Analysis by Languages - Subspace Gaussian Mixture Models for Language Identification and Dysart

As final study, we analyze in detail the performance of one of the best performing fusions, namely the fusion of acoustic and prosodic i-Vector -based systems. This fusion performed the best in the 30 s task, very close to the best in the 10 s task, and although there were fusions performing better in the 3 s task (concretely the fusion

Figure 8.2: Results for fusion of acoustic and prosodic i-Vector systems for the 3 s task - 600-dimension i-Vectors with Gaussian classifier. Prosodic system includes pitch, energy, duration, and also F1, and F2, obtained in fixed segments of 200 ms and 10 ms shift, and order 5 polynomials. Results with 20 h for training and 1 h for development, per language, in terms of 100 · Cavg.

Figure 8.3: Results for fusion of acoustic and prosodic i-Vector systems for the 10 s task - 600-dimension i-Vectors with Gaussian classifier. Prosodic system includes pitch, energy, duration, and also F1, and F2, obtained in fixed segments of 200 ms and 10 ms shift, and order 5 polynomials. Results with 20 h for training and 1 h for development, per language, in terms of 100 · Cavg.

of all acoustic systems, including GMM, and prosodic system), we think that in a real scenario, this would be our preferred solution owing to consistency, simplicity and better generalization of conclusions.

First, Cavg can be seen for each language individually in Figures 8.2, 8.3, and 8.4,

Figure 8.4: Results for fusion of acoustic and prosodic i-Vector systems for the 30 s task - 600-dimension i-Vectors with Gaussian classifier. Prosodic system includes pitch, energy, duration, and also F1, and F2, obtained in fixed segments of 200 ms and 10 ms shift, and order 5 polynomials. Results with 20 h for training and 1 h for development, per language, in terms of 100 · Cavg.

obtained with each i-Vector system individually (results of the prosodic system were already shown in Figure 7.7, but here are included again for an easier comparison), we see a clear advantage of the fusion for all languages in all tasks, except for English and Spanish on the 30 s task, for which the acoustic i-Vector -based system performed slightly better. However, we already saw above that in general, the fusion was beneficial. Especially remarkable were the benefits for Mandarin, which could be expected, since Mandarin is a tonal language, and as we saw in Section sec:ProsResults, the prosodic and formant system performed very well for this language. Therefore, the fusion of acoustic and prosodic system could be expected to be richer for Mandarin than for the rest of languages. Surprisingly, Russian was the other language for which the fusion performed best.

In the 3 s task, the language that most benefited from the fusion was Mandarin, with a 33,87% relative improvement over the acoustic i-Vector system alone. Next, Russian, with a 19,74%. The rest of languages were between 5% and 10% relative improvements.

In the 10 s task, Russian with a 52.17%, and English with a 47.69%, were the languages with highest decrease of Cavg with the fusion with respect to the acoustic

i-Vector -based system alone. The other languages were between 13% and 14% relative improvements.

English Farsi Hindi Mandarin Russian Spanish English CTS 72.73 4.96 6.61 4.13 2.48 2.48 English BNBS 89.43 2.11 0.91 1.51 0.91 1.21 Farsi CTS 6.09 75.63 4.06 3.05 3.05 1.02 Farsi BNBS 1.44 70.19 4.81 6.73 4.81 1.92 Hindi CTS 1.43 5.71 65.71 10.00 4.29 2.86 Hindi BNBS 3.16 7.76 55.17 2.01 4.60 9.20 Mandarin CTS 0.77 3.86 3.47 80.31 2.70 1.54 Mandarin BNBS 0.00 1.73 1.73 89.60 0.00 1.16 Russian CTS 3.60 3.60 3.60 2.16 76.98 1.44 Russian BNBS 0.66 2.32 2.98 0.33 85.43 2.65 Spanish CTS 6.06 6.06 15.58 3.46 6.06 49.35 Spanish BNBS 2.66 5.32 4.79 0.53 4.26 77.13

Table 8.1: Confusion matrix of the acoustic and prosodic fusion i-Vector -based LID system for the 3 s task - 600-dimension i-Vectors with Gaussian classifier. Results with 20 h for training and 1 h for development, per language. Rows are true spoken language and transmission channel, and columns are decisions made by our system (in % of files).

In the 30 s tasks, Russian with 57.89%, Mandarin with 44.44%, and Hindi with 34.89% relative improvements of the fusion with respect to the acoustic i-Vector -based system alone, were the languages that most benefited from the fusion. On the other hand, Cavg for English was increased a 26.67% in the fusion, with respect to the acoustic

i-Vector -based system alone, and Cavg for Spanish was increased a 7.69%.

Next, we show the confusion matrices for the fusion of acoustic and prosodic i- Vector systems. In this case, we split the results into CTS and BNBS transmissions, to see if there were common patterns that help to understand better the results. They are in Tables 8.1, 8.2, and 8.3, for the 3 s, 10 s, and 30 s tasks, respectively.

In the three tasks, the most clear unbalanced result was for Spanish. In the 3 s task, 77.13% of BNBS files were correctly classified, while only 49.35% of CTS files were correctly classified. In the 10 s and 30 s tasks, this difference was progressively reduced. The most reasonable explanation we find for this result is that, unlike the rest of languages, we did not have LRE09cts data in the Dev dataset (see Table 4.2). Probably, these data helped a lot to make more robust models in the rest of languages, and Spanish could not benefit from this.

English Farsi Hindi Mandarin Russian Spanish English CTS 96.69 0.83 1.65 0.00 0.00 0.00 English BNBS 98.93 0.36 0.00 0.00 0.00 0.00 Farsi CTS 0.00 95.43 1.02 0.51 0.51 0.51 Farsi BNBS 0.00 85.65 0.96 3.35 0.96 0.96 Hindi CTS 0.00 1.43 95.71 1.43 0.00 0.00 Hindi BNBS 0.36 0.91 89.29 0.36 1.09 1.09 Mandarin CTS 0.00 0.00 1.16 97.68 0.00 0.00 Mandarin BNBS 0.00 0.00 0.41 99.59 0.00 0.00 Russian CTS 0.00 0.00 0.72 0.00 97.84 0.00 Russian BNBS 0.00 0.00 0.33 0.00 99.01 0.00 Spanish CTS 0.43 2.16 5.19 1.73 0.00 81.82 Spanish BNBS 0.00 0.53 1.60 0.00 0.53 94.68

Table 8.2: [Confusion matrix of the acoustic and prosodic fusion i-Vector - based LID system for the 10 s task] - 600-dimension i-Vectors with Gaussian classifier. Results with 20 h for training and 1 h for development, per language. Rows are true spoken language and transmission channel, and columns are decisions made by our system (in % of files).

English Farsi Hindi Mandarin Russian Spanish English CTS 100.00 0.00 0.00 0.00 0.00 0.00 English BNBS 98.00 0.00 0.00 0.00 0.00 0.00 Farsi CTS 0.00 98.98 0.51 0.00 0.00 0.00 Farsi BNBS 0.00 96.14 0.00 0.00 0.00 0.00 Hindi CTS 0.00 0.00 100.00 0.00 0.00 0.00 Hindi BNBS 0.00 0.00 97.24 0.00 0.69 0.00 Mandarin CTS 0.00 0.00 0.00 99.61 0.00 0.00 Mandarin BNBS 0.00 0.00 0.00 100.00 0.00 0.00 Russian CTS 0.00 0.00 0.00 0.00 100.00 0.00 Russian BNBS 0.00 0.00 0.00 0.00 100.00 0.00 Spanish CTS 0.43 0.87 0.87 0.00 0.00 94.37 Spanish BNBS 0.00 0.00 0.00 0.00 0.00 99.47

Table 8.3: [Confusion matrix of the acoustic and prosodic fusion i-Vector - based LID system for the 30 s task] - 600-dimension i-Vectors with Gaussian classifier. Results with 20 h for training and 1 h for development, per language. Rows are true spoken language and transmission channel, and columns are decisions made by our system (in % of files).

is one database, SRE08, that was present in the rest of languages except English. This could influence the performance in the 3 s task. For the 10s, and 30 s task, we did not see such a clear difference.

For Hindi, we did not include CALLFRIEND data since preliminary experiments with GMMs seemed to indicate a worse performance when this database was included. In the results, CTS models seemed to be working better than BNBS models.

A large percentage of Spanish CTS files (15.58%) were confused with Hindi in the 3 s task. In addition to the lack of Spanish LRE09cts data, perhaps, this type of errors could have been accentuated by the fact of not having included CALLFRIEND for Hindi. The problem persisted in the 10 s task, with 5.19% of Spanish CTS files misclassified as Hindi.

There were other confusions between pairs of languages for a specific transmission channel that we can not fully explain, like the 10% of Hindi files misclassified as Man- darin. In addition to language differences, probably, they are just due to database selection. Thus, some languages benefited from one database more than others just because the audio files were cleaner, or there were less labeling errors. For example, we know that VOA3 contains a non-negligible number of English recordings labeled as other languages. We did a cleaning process to alleviate this problem [Mart´ınez et al., 2011c], but probably the use of VOA3 benefited more English than the rest of languages.

As final note, we want to remark again the very good results obtained, especially for the 30 s task, where the lowest correct classification rate was 94.37%, and all confusion rates were below 1%. Also, in the 10 s task, all but 3 correct classification rates were over 94%. Probably, the biggest effort from now on should be put on the 3 s task. The short duration of audio recordings makes this problem a very challenging and interesting one.

Results on 2009 NIST LRE Database

Contents

9.1 2009 NIST LRE database . . . 158

9.2 Training Database . . . 158

9.3 Development Database . . . 159

9.4 Experimental Setup . . . 160

In this chapter, we present results on the whole 2009 NIST LRE database [National Institute of Standards and Technology (NIST), 2009] using the techniques developed in this Thesis. Actually, in the last years many researchers have used this database in his/her experiments, and therefore, the results reported in this chapter will allow an easier comparison with other works. Remember that in previous chapters, we only used 6 target languages for which we could collect specific number of hours for train and development datasets. However, now we do not restrict the size of train and development, but we use all available data. The whole database includes 23 target languages, and the number of hours per language is unbalanced. Thus, this is a more challenging problem, and we will be able to see if the conclusions previously seen hold in other scenarios. Additionally, we will present results using a flat prior for all languages, as in previous chapters, but also using a prior equal to 0.5 for the target language and 0.5 split among the rest of languages (remember we perform a binary detection task where each each audio file is evaluated against each target language individually), which is the common strategy in NIST evaluations adopted in many works of the literature.

9.1 2009 NIST LRE database

We report results on 2009 NIST LRE database. This database includes 41793 files totaling 40 different languages. We only focused on the 23 target languages of the closed-set task, what reduces the number of files to 31178. The channel type can be CTS or BNBS. The distribution of files belonging to the target languages among the 3 s, 10 s, and 30 s tasks can be seen in Table D.1 of Appendix D. Also, in order to see which languages can be more confusable, we classify them by families in Table 9.1.

In document Subspace Gaussian Mixture Models for Language Identification and Dysarthric Speech Intelligibility Assessment (Page 185-192)