Conclusion and Future Work - Local representations and random sampling for speaker verification

7.1 Conclusion

In this thesis, we have discussed the assumption of low-dimensionality of channel space typical in current state-of-the art speaker verification systems. We have proposed to partition the acoustic space into local regions where we can apply channel compensation techniques separately. This would lead to more tractable solutions and gives a better view of the channel and speaker spaces. We argued that obtaining a single number as the distance between two speakers is questionable and using local regions may also support better modeling of inter-speaker variabilities. We have proposed two architectures for speaker verification in local regions in which current approaches can easily be used.

The two architectures both apply channel compensation locally and differ in where the subsequent speaker modeling and scoring steps are realized. We have implemented a GMM supervector SVM system with local NAP based on one of these architectures.

We have also worked on a special case of intra-speaker variability, namely within-session variability. We have proposed a method to improve performance by compensating vari-abilities mainly coming from actual realizations of phones as well as varivari-abilities across phones. We have shown that our technique not only improves performance for short test utterances but also for long test utterances.

Chapter 7. Conclusion and Future Work 61

7.2 Future Work

For future improvements on our proposals, we will focus on the below directions.

• Further studies on local region based speaker verification must be explored. We want to apply other approaches like TVS, LDA + WCCN, and JFA in the local architectures we proposed. In this thesis we only used one of the architectures (Architecture 1) where only channel compensation is applied locally. It may be interesting to model speakers and inter-speaker variabilities locally. We also want to explore how well local systems may be fused in the score level. Since the dimension of local regions are small enough we can leave the linearity assumption and try to learn the manifold of speaker space.

• Better ways of partitioning the acoustic space is also worth working on. It will be most beneficial if we can find a way to partition the space that will help in subsequent steps of channel compensation.

• In this thesis we have used random sampling to generate short segments from the training utterance. We tried to learn a better hyperplane by SVM taking phonetic intra-speaker variabilities into account. As a future work, we plan to investigate ways to remove these variabilities by feature transforms applied to the training supervectors.

Bibliography

[1] J. Pelecanos and S. Sridharan. Feature warping for robust speaker verification. In 2001: A Speaker Odyssey-The Speaker Recognition Workshop, 2001.

[2] H. Hermansky and N. Morgan. RASTA processing of speech. IEEE transactions on speech and audio processing, 2(4):578–589, 1994.

[3] D.A. Reynolds. Channel robust speaker verification via feature mapping. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003.

Proceedings.(ICASSP’03), volume 2, 2003.

[4] A. Adami, R. Mihaescu, D.A. Reynolds, and J.J. Godfrey. Modeling prosodic dynamics for speaker recognition. In Proc. ICASSP, volume 4, pages 788–791, 2003.

[5] W.D. Andrews, M.A. Kohler, J.P. Campbell, J.J. Godfrey, and J. Hern´ andez-Cordero. Gender-dependent phonetic refraction for speaker recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002. Pro-ceedings.(ICASSP’02), 2002.

[6] J. Navr´atil, Q. Jin, W. Andrews, and J. Campbell. Phonetic speaker recognition using maximum likelihood binary decision tree models. In Proc. of ICASSP, 2003.

[7] D. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A. Adami, Q. Jin, D. Klusacek, J. Abramson, R. Mihaescu, et al. The SuperSID project: Exploiting high-level information for high-accuracy speaker recognition. In Proc. ICASSP, volume 4, pages 784–787, 2003.

[8] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas. Score normalization for text-independent speaker verification systems. Digital Signal Processing, 10(1-3):42–54, 2000.

Bibliography 63

[9] D.A. Reynolds, T.F. Quatieri, and R.B. Dunn. Speaker verification using adapted Gaussian mixture models. Digital signal processing, 10(1-3):19–41, 2000.

[10] P. Kenny. Joint factor analysis of speaker and session variabil-ity: theory and algorithms. The Centre de Recherche Informa-tique de Montreal, Technical Report CRIM-06/08-13, 2005. URL http://www.crim.ca/perso/patrick.kenny/FAtheory.pdf.

[11] P. Kenny, M. Mihoubi, and P. Dumouchel. New MAP estimators for speaker recog-nition. In Eighth European Conference on Speech Communication and Technology, 2003.

[12] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel. Speaker and session vari-ability in GMM-based speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 15(4):1448–1460, 2007.

[13] P. Kenny, G. Boulianne, and P. Dumouchel. Eigenvoice modeling with sparse train-ing data. IEEE Transactions on Speech and Audio Processtrain-ing, 13(3):345–354, 2005.

[14] O. Glembek, L. Burget, N. Dehak, N. Brummer, and P. Kenny. Comparison of scoring methods used in speaker recognition with Joint Factor Analysis (PDF). In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2009.

Proceedings.(ICASSP’09), pages 4057–4060, 2009.

[15] N. Dehak, P. Kenny, R. Dehak, P. Ouellet, and P. Dumouchel.

Front-end factor analysis for speaker verification. submitted to IEEE Transaction on Audio, Speech and Language Processing. URL http://groups.csail.mit.edu/sls/publications/2010/.

[16] N. Dehak, R. Dehak, P. Kenny, N. Brummer, P. Ouellet, and P. Dumouchel. Sup-port vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. In Proceedings of Interspeech, 2009.

[17] M. Senoussaoui, P. Kenny, N. Dehak, and P. Dumouchel. An i-vector Extractor Suitable for Speaker Recognition with both Microphone and Telephone Speech. In Odyssey: The Speaker and Language Recognition Workshop, Brno, Czech Republic, 2010.

Bibliography 64

[18] WM Campbell, DE Sturim, DA Reynolds, and A. Solomonoff. Svm Based Speaker Verification Using a GMM Supervector Kernel and NAP Variability Compensation.

In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2006. Proceedings.(ICASSP’06).

[19] W. Campbell. Generalized linear discriminant sequence kernels for speaker recogni-tion. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, 2002.

[20] A. Solomonoff, W.M. Campbell, and I. Boardman. Advances in channel compen-sation for SVM speaker recognition. In Proc. ICASSP, pages 629–632, 2005.

[21] A.O. Hatch and A. Stolcke. Generalized linear kernels for one-versus-all classifica-tion: application to speaker recognition. In 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings, 2006.

[22] Alvin Martin and Mark Przybocki. The nist speaker recognition evalua-tion series. National Institute of Standard and Technology’s web-site. URL http://www.nist.gov/speech/test/spk.

[23] URL http://www.ldc.upenn.edu/.

[24] M.A. Przybocki and A.F. Martin. Odyssey Text Independent Evaluation Data. In 2001: A Speaker Odyssey-The Speaker Recognition Workshop. ISCA, 2001.

[25] C. Cieri, L. Corson, D. Graff, and K. Walker. Resources for new research directions in speaker recognition: The Mixer 3, 4 and 5 corpora. In Proc. Interspeech, pages 950–954, 2007.

[26] C. Cieri, J.P. Campbell, H. Nakasone, D. Miller, and K. Walker. The mixer corpus of multilingual, multichannel speaker recognition data. In Proc. 4th International Conference on Language Resources and Evaluation, pages 26–28, 2004.

[27] C. Cieri, W. Andrews, J.P. Campbell, G. Doddington, J. Godfrey, S. Huang, M. Liberman, A. Martin, H. Nakasone, M. Przybocki, et al. The mixer and tran-script reading corpora: Resources for multilingual, crosschannel speaker recogni-tion research. In Proceedings of the Fifth Internarecogni-tional Conference on Language Resources and Evaluation (LREC), pages 117–120, 2006.

Bibliography 65

[28] L. Brandschain, C. Cieri, D. Graff, A. Neely, and K. Walker. Speaker recognition:

building the Mixer 4 and 5 corpora. In Proceedings of the Sixth International Conference on Language Resources and Evaluation, 2008.

[29] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel. A study of inter-speaker variability in inter-speaker verification. IEEE Trans. Audio, Speech and Language Processing, 16(5):980–988, July 2008.

[30] A.O. Hatch, S. Kajarekar, and A. Stolcke. Within-class covariance normalization for SVM-based speaker recognition. In Ninth International Conference on Spoken Language Processing, pages 1471–1474, 2006.

[31] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, and M.I. Jordan. Learn-ing the kernel matrix with semidefinite programmLearn-ing. The Journal of Machine Learning Research, 5:27–72, 2004.

[32] M. Gonen and E. Alpaydin. Localized multiple kernel learning. In Proceedings of the 25th international conference on Machine learning, pages 352–359. ACM, 2008.

[33] R. Vogt, C. Lustri, and S. Sridharan. Factor analysis modelling for speaker verifi-cation with short utterances. In Odyssey: The Speaker and Language Recognition Workshop, 2008.

[34] R. Vogt, B. Baker, and S. Sridharan. Factor analysis subspace estimation for speaker verification with short utterances. In Proc. Interspeech, pages 853–856, 2008.

[35] R. Vogt, J. Pelecanos, N. Scheffer, S. Kajarekar, and S. Sridharan. Within-session variability modelling for factor analysis speaker verification. In Proc. Interspeech, 2009.

[36] S.S. Kajarekar. Across-phone variability and diagonal term in joint factor analysis for speaker recognition. In In: ICASSP, 14th-19th of March, 2010., Dallas, Texas, USA.

[37] M. Collet, Y. Mami, D. Charlet, and F. Bimbot. Probabilistic anchor models approach for speaker verification. In Ninth European Conference on Speech Com-munication and Technology, 2005.

In document Local representations and random sampling for speaker verification (Page 72-77)