6 Conclusions and future work
6.3 Conclusions
This dissertation introduces a novel speaker adaptation approach called Parallel Reference Speaker Weighting (PRSW), based on parallel acoustic and articulatory Hidden Markov Models. This approach uses a robust normalized articulatory space and palate referenced articulatory features combined with speaker-weighted adaptation to form an inversion mapping for new speakers to accurately estimate articulatory
trajectories where there is no kinematic data. The proposed PRSW method is evaluated on the newly collected Marquette EMA-MAE corpus using 20 native English speakers. Cross-speaker inversion results show that given a good selection of reference speakers with consistent acoustic and articulatory patterns, the PRSW approach gives good
speaker independent inversion performance, close to that of a speaker dependent system, without the need for kinematic training data.
References
Atal, B. S., Chang, J. J., Mathews, M. V., & Tukey, J. W. (1978). Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. Journal of the Acoustical Society of America, 63(5), 1535-1555. Badin, P., Bailly, G., Reveret, L., Baciu, M., Segebarth, C., & Savariaux, C. (2002).
Three-dimensional linear articulatory modeling of tongue, lips and face, based on MRI and video images. Journal of Phonetics, 30, 533-553.
Bahl, L. R., & Jelinek, F. (1975). Decoding for channels with insertions, deletions, and substitutions with applications to speech recognition. IEEE Transaction on
Information Theory, 21, 404-411.
Beckman, M. E. J., & Jung, T. P. (1995). Variability in the production of quantal vowels revisited. Journal of the Acoustic Society of America, 97, 471-490.
Birkholz, P., Jackel, D., & Kroger, B. J. (2006). Construction and control of a three- dimensional vocal tract model. International Conference on Acoustics Speech and
Signal Processing, 873-876.
Byrd, D., Browman, C. P., Goldstein, L., & Honorof, D. (1999). Magnetometer and x-ray microbeam comparison. Proceedings of the 14th International Congress of Phonetic
Sciences, New York. 627-630.
Coker, C. H. (1976). A model for articulatory dynamics and control. , 64(4) 260-452. Dang, J., & Honda, K. (2004). Construction and control of a physiological articulatory
model. Journal of the Acoustical Society of America, 115(2), 853-870.
Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in contonously spoken sentences. IEEE Transaction
on Acoustics, Speech and Signal Processing, 28(4), 357-166.
Dusan, S., & Deng, L. (2000). Acoustic-to-articulatory inversion using dynamical and phonological constraints. In Proc. of the 5th Seminar on Speech Production: Models
and Data, Kloster Seeon, Germany. 237-240.
Erler, K., & Deng, L. (1993). Hidden markov model representation of quantized
articulatory features of speech recognition. Computer Speech and Language, 7, 265- 282.
Felps, D., & Osuna, R. G. (2010). Normalization of articulatory data through procrustes
transformations and analysis-by synthesis. (Technical Report). Texas A&M
University: Texas A&M University, Computer Science.
Frankel, J., & King, S. (2001). ASR-articulator speech recognition. European Conference
on Speech Communication and Technology, Scandinavia.
TIMIT acoustic-phonetic continuous speech corpus. Garofolo, J., Lamel, L., Fisher, W.,
Fiscus, J., Pallett, D., Dahlgren, N. and Zue, V. (Directors). (1993). [Video/DVD] Linguistic Data Consortium.
Gauvain, J. L., & Lee, C. H. (1994). Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Transactions on Acoustics,
Speech and Signal Processing 2, 2, 291-298.
Ghosh, P. K., & Narayanan, S. S. (2011). A subject-independent acoustic-to-articulatory inversion. Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE
International Conference On, 4624-4627.
Gracco, V. L., & Nye, P. W. (1993). Magnetometry in speech articulation research: Some misadventures on the road to enlightment. Forschungber Institute Phonet., 31, 91- 104.
Hart, J. C., Francis, K. G., & Kauffman, H. L. (1994). Visualizing quaternion rotation.
ACM Transactions on Graphics, 13(3), 256-276.
Hashi, M. Westbury, J. R., & Honda, K. (1998). Vowel posture normalization. Journal of
the Acoustical Society of America, 104, 2426-2437.
Hazon, T. J. (2000). A comparison of novel techniques for rapid speaker adaptation.
Speech Communications, 31, 15-33.
Hazon, T. J., & Glass, J. R. (1997). A comparison of novel techniques for instantaneous speaker adaptation. Proceedings of the European Conference on Speech
Communication and Technology, 2047-2050.
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of
the Acoustical Society of America, 87(4), 1738-1752.
Hiroya, S., & Honda, M. (2004). Estimation of articulatory movements from speech acoustics using an HMM-based speech production model. IEEE Transactions on
Speech Audio Process, 12(2), 175-185.
Hiroya, S., & Mochida, T. (2005). Multi-speaker articulatory reconstruction based on an eigen-articulatory HMM. IEEE International Conference on Acoustics, Speech and
Hofer, G., & Richmond, K. (2010). Comparison of HMM and TMDN methods for lip synchronisation. Interspeech, Makuhari, Japan. 454-457.
Hogden, J., Lofqvist, A., Gracco, V., Zlokarnik, I., Rubin, P., & Saltzman, E. (1996). Accurate recovery of articulator positions from acoustics: New conclusions based on human data. The Journal of the Acoustical Society of America, 100(3), 1819-1834. Houde, R. A. (1967). A study of tongue body motion during selected consonant sounds.
(PhD, University of Michigan).
Huang, C., Chen, T., & Chang, E. (2002). Speaker selection training for large vocabulary continuous speech recognition. ICASSP, Orlando, Florida, USA. 609-612.
Hueber, T., Bailly, G., Badin, P., & Elisei, F. (2013). Speaker adaptation of an acoustic- articulatory inversion model using cascaded gaussian mixture regressions.
Interspeech, Lyon, France. 2753-2757.
IEEE subcommittee on subjective measurements IEEE recommended practices for speech quality measurements. (1969). 17
Jelinek, F. (1969). A fast sequential decoding algorithm using a stack. IBM Journal of
Research and Development, 13, 675-685.
Jelinek, F. (1976). Continuous speech word recognition by statistical methods.
Proceedings of IEEE, , 64 532-536.
Jelinek, F. (1999). Statistical methods for speech recognition. Cambridge, MA: MIT Press.
Jelinek, F., Bahl, L. R., & Mercer, R. L. (1975). Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Transactions on Information Theory,
21, 250-256.
Kaburagi, T., & Honda, M. (1994). An ultrasonic method for monitoring tongue shape and the position of a fixed-point on the tongue surface. Journal of the Acoustical
Society of America, 95(4), 2268-2270.
Kaburagi, T., & Honda, M. (1998). Determination of the vocal tract spectrum from the articulatory movements based on the search of an articulatory–acoustic database.
Proc. ICSLP, Sydney, Australia. 433-436.
King, S., & Wrench, A. A. (1999). Dynamical system modeling of articulator
movement. International Conference on Phonetic Sciences, San Francisco, USA. Kirchhoff, K. (1999). Robust speech recognition using articulatory information. (PhD,
Krista, R. (2011). The effect of palate morphology on consonant articulation in healthy
speakers. (Master, Department of Speech-Language Pathology).
Kubala, F., Schwartz, R., & Barry, C. (1989). Speaker adaptation using multiple reference speakers. DARPA Speech and Language Workshop, San Mateo, CA. Kuhn, R. (1998). Eigenvoices for speaker adaptation. International Conference on
Spoken Language Processing, Syndey, Australia. 1771-1774.
Kuhn, R., Junqua, C., Nguyen, P., & Niedzielski, N. (2000). Rapid speaker adaptation in eigen voice space. IEEE Transactions on Speech Audio Proceedings, 8, 695-707. Laprie, Y. (1998). A variational approach for estimation vocal tract shapes from the
speech signal. International Conference on Acoustic, Speech and Signal Processing, Seattle, USA.
Lawrence, H., & Schafer, R. W. (1978). Digital processing of speech signals Prentice- Hall.
Leggetter, C. J., & Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Computer Speech
and Language, , 171-185.
Leung, K. Y., & Siu, M. (2004). Speech recognition using combined acoustic and articulatory information with retraining of acoustic model parameters. International
Conference on Spoken Language Processing, Jeju Island, Korea.
Lindblom, B., Lubker, J., & Gay, T. (1977). Formant frequencies of some fixed-mandible vowels and a model of speech motor programming by predicitive simulation. The
Journal of the Acoustical Society of America, 62(S1), 1115-1123.
Ling, Z., Richmond, K., Yamagishi, J., & Wang, R. (2009). Integrating articulatory features into HMM-based parametric speech synthesis. IEEE Transactions on Audio,
Speech and Language Processing, 17(6), 1171-1185.
Maeda, S. (1990). Compensatory articulation during speech: Evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model. Boston: Kluwer Academic Publishers.
Masaki, S., Tiede, M. K., Honda, K., Shimada, Y., Fujimoto, I., Nakamura, Y., & Ninomiya, N. (1999). MRI-based speech production study using a synchronized sampling method. The Journal of the Acoustical Society of America, 20(5), 375-379. McGowan, R., & Cushing, S. (1999). Vocal tract normalization for midsagittal
articulatory revovery with analysis-by-synthesis. Journal of the Acoustical Society of
Mermelstein, P. (1973). Articulatory model for the study of speech production. Journal of
the Acoustical Society of America, 53(4), 1070-1082.
Metze, F., & Waibel, A. (2002). A flexible stream architecture for ASR using articulatory features. The International Conference on Spoken Language Processing, Denver, USA.
Mitra, V., Nam, H., Espy-Wilson, Y., Saltzman, E., & Goldstein, L. (2010). Retrieving tract variables from acoustics: A comparison of different machine learning strategies.
IEEE Journal of Selected Topics in Signal Processing, 4(6), 1027-1045.
Munhall, K. G., Vatikiotis-Bateson, E., & Tohkura, Y. (1998). X-ray film database for speech research. Journal of the Acoustical Society of America, , 1222-1224. Narayanan, S., Nayak, K., Lee, S., Sethy, A., & Byrd, D. (2004). An approach to real-
time magnetic resonance imaging for speech production. The Journal of the
Acoustical Society of America, 115(4), 1771-1776.
Perkell, J. S., & Cohen, M. H. (1992). Electromagnetic midsagittal articulometer systems for transducing speech articulatory movements. The Journal of the Acoustical
Society of America, 92, 3078-3086.
Qin, C., & Carreira-Perpinan, M. A. (2007). An empirical investigation of the
nonuniqueness in the acoustic-to-articulatory mapping. Interspeech, Belguim. 74. Rabiner, L. R. (1989). Tutorial on hidden markov models and selected applications in
speech recognition. Proceedings of the IEEE. 77 257-286.
Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Englewood Cliffs, NJ: Prentice-Hall.
Richmond, K. (2002). Estimating articulatory parameters from the acoustic speech
signal. (PhD, The Centre for speech technology research, Edinburgh University).
Richmond, K., Hoole, P., & King, S. (2011). Announcing the electo-magnetic articulatography (day 1) subset of the mngu0 articulatory corpus. Interspeech, Florence, Italy. 1505-1508.
Rogers, C. L. (1997). Segmental intelligibility assessment for chinese-accented english. (PhD, University of Indiana).
Scobbie, J. M., Turk, A., Geng, C., King, S., Lickley, R. J., & Richmond, K. (2013). The ediburgh speech production facility doubletalk corpus. Interspeech, Lyon, France. 764-766.
Stone, M. L., Sonies, B. C., Shawker, T. H., Weiss, G., & Nadel, L. (1983). Analysis of real-time ultrasound images of tongue configuration using a grid-digitizing system.
Journal of Phonetics, 11(3), 207-218.
Story, B. (2005). Synergistic modes of vocal tract articulation for american english vowels. Journal of the Acoustical Society of America, 118, 3834-3859. Sun, J., & Deng, L. (2002). An overlapping-feature-based phonological model
incorporating linguistic constraints: Applications to speech recognition. Journal of
the Acoustical Society of America, 1086-1111.
Tang, M., Seneff, S., & Zue, V. (2003). Modeling linguistic features in speech recognition. European Conference on Speech Communication and Technology, Geneva.
Toda, T., Black, A., & Tokuda, K. (2004). Acoustic-articulatory inversion mapping with gaussian mixture model. International Conference on Spoken Language Processing, Jeju Island, Korea. 1129-1132.
Tokuda, K., Yoshimura, T., Masuko, T., & Kobayashi, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. ICASSP, Istanbul. , 3 1315- 1318.
Wei, J. (2008). Vocal tract normalization in articulatory space using thin-plate spline method. Journal of Acoustical Society of America, 123(5), 3885.
Westbury, J. (1991). The significance and measurement of head position during speech production experiments using the x-ray microbeam system. Journal of Acoustical
Society of America, 89(4), 1782-1797.
Westbury, J. (1994a). In University of Wisconsin Press (Ed.), X-ray microbeam speech
production database user’s handbook (1st ed.). Madison: University of Wisconsin
Press.
Westbury, J. (1994b). X-ray microbeam speech production database user's handbook
version 1.0
Wrench, A. A. (1993). EUR-ACCOR corpus. Retrieved from http://www.cstr.ed.ac.uk/research/projects/artic/accor.html
Wrench, A. A., & William, J. (2000). A multichannel articulatory database and its application for automatic speech recognition. 5th Seminar on Speech Production:
Yunusova, Y., Baljko, M., Pintilie, G., Rudy, K., Faloutsos, P., & Daskalogiannakis J. (2012). Acquisition of the 3D surface of the palate by in-vivo digitization with wave.
Speech Communication, 54(8), 923-931.
Zhang, L., & Renals, S. (2008). Acoustic-articulatory modelling with the trajectory HMM. IEEE Signal Processing Letters, 15 245-248.
Zue, V., Seneff, S., & Glass, J. R. (1990). Speech database development at MIT. TIMIT and beyond. Speech Communication, 9(4), 351-356.