Future work - Conclusions and Future work

7. Conclusions and Future work

7.3 Future work

There are several lines for future research that can be considered as extensions of the work developed in this dissertation. Three major areas, which will benet greatly from further research, are briey discussed in the following paragraphs.

Automatic adjustment of the timings for the wavelet analysis. The se- lection of the timings of the wavelet is a critical point, since in order to generate an accurate representation of the prosodic events of speech. A suggested system, already used by [Vai13], to the establishment of the correct timings, is studying the peak prominence in the word and syllables levels and relating it with the syllable and word boundaries. The level showing major relation between the peaks and syllables/word slots will be selected as syllable/word level, allowing the construction of the complete wavelet domain.

Testing other wavelets transforms Even though the Mexican Hat wavelet has proved good properties in order to represent the prosodic events, other wavelets with dierent properties can be proven. The Morlet wavelet (or Gabor wavelet), which is highly related with the auditive perception scale of the humans, or wavelets allowing a full reconstruction of the original signal without depending on the dilation and scaling parameters, such as the Daubechies wavelets, can be tested.

Improving the statistical mapping technique DKPLS has shown its capabil- ities to model the prosody using the wavelet domain, however, several improvements can enhance the performance of the system. It is suggested to treat each prosodic unit separately: using the phonemes/syllables/words boundaries on the correspond- ing wavelet level to model the prosodic unit, for instance with CARTs, based on the position and amplitude of the peak present on the slot.

Testing the system in diverse databases The proposed method has shown good results in speakers where the speaking style is clearly dierent, consequently, the system could also be tested in emotional databases, where the prosody is clearly dierent for every emotion. Moreover, a complete prosody and emotion conversion system requires a detailed conversion of the speaking rate and the duration of the syllables. The approach proposed by [Nav14], modeling the syllable duration with CARTs, would be an appropriate alternative.

REFERENCES

[Abe88] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, Voice conversion through vector quantization. Proc. of ICASSP, pp. 565568, 1988.

[Bec86] M. Beckman and J. Pierrehumbert, Intonational structure in japanese and english. Phonology Yearbook 3, pp. 255309, 1986.

[Bel87] M. Bellanger, Adaptative Digital Filters and Signal analysis. Marcel Dekker, 1987.

[Che03] Y. Chen, M. Chu, E. Chang, J. Liu, and R. Liu, Voice conversion with gmm and map adaptation. Proc. of Interspeech, pp. 24132416, 2003.

[CMU] CMU ARCTIC databases for speech synthesis, J. Kominek and A. Black. http://festvox.org/cmu_arctic/index.html, last access, 10-2-2014.

[Dau92] I. Daubechies, Ten Lectures on Wavelets. SIAM, 1992.

[de 93] S. de Jong, Simpls: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, pp. 251263, 1993. [Dem77] A. Dempster, N. Laird, and D. Rubin, Maximum likelihood from incom- plete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 39, no. 1:pp. 138, 1977.

[Dur60] J. Durbin, The tting of time-series models. Revue de l'Institut Interna- tional de Statistique 28, pp. 233244, 1960.

[Dux04] H. Duxans, A. Bonafonte, A. Kain, and J. v. Santen, Including dynamic and phonetic in voice conversion systems. Proc. of ICSLP, pp. 58, 2004. [Emb04] M. Embrechts, B. Szymanski, and K. Sternickel, Ch 10: Introduction to

scientic data mining: Direct kernel methods and applications. In Compu- tationally Intelligent Hybrid Systems: The Fusion of Soft Computing and Hard Computing, pp. 317363, Wiley Interscience, 2004.

[Err10a] D. Erro, A. Moreno, and A. Bonafonte, Inca algorithm for training vc systems from nonparallel corpora. IEEE Transactions on Audio, Speech, and Language Processing, 2010.

[Err10b] D. Erro, A. Moreno, and A. Bonafonte, Voice conversion based on weighted frequency warping. IEEE Transactions on Audio, Speech, and Language Pro- cessing, 2010.

[Gil03] B. Gillet and S. King, Transforming f0 contours. Eurospeech, pp. 101104, 2003.

[Gro85] A. Grossman, J. Morlet, and T. Paul, Transforms associated to square in- tegrable group representations. Journ. Math. Phys., 1985.

[Haa10] A. Haar, Zur theorie der orthogonalen funktionensysteme. Math. Annal, 69, pp. 331371, 1910.

[Hel07] E. Helander and J. Nurminen, A novel method for prosody prediction in voice conversion. Proc. of ICASSP, 2007.

[Hel10] E. Helander, J. Nurminen, J. MÃguez, and M. Gabbouj, Maximum a pos- terior voice conversion using sequential monte carlo methods. Proc. of In- terspeech, pp. 17161719, 2010.

[Hel12] E. Helander, H. Silen, T. Virtanen, and M. Gabbouj, Voice conversion using dynamic kernel partial least squares regression. IEEE Transactions on audio, speech and language processing, pp. 806817, 2012.

[Hua92] X. Huang, A. Acero, and H. Hon, Spoken Language Processing. Prentice Hall PTR, 1992.

[Hym85] L. Hyman, A theory of phonological weight. Foris Publications, 1985. [Ina03] Z. Inanoglu, Transforming Pitch in a Voice Conversion Framework. Master's

thesis, St. Edmund's College, University of Cambridge, 2003.

[Ina07] Z. Inanoglu and S. Young, A system for transforming the emotion in speech: combining data-driven conversion techniques for prosody and voice quality. Proc. of Interspeech, 2007.

[Kai01] A. Kain and M. Macon, Spectral voice conversion for text-to-speech synthesis. Proc. of ICASSP, pp. 813816, 2001.

[Kaw99] H. Kawahara, I. Masuda-Katsuse, and A. deChevignÃ, Reestructuring speech representations using a pitch-adaptative time-frequency smoothing and a instantaneous-frequency-based f0 extraction: Possible role of a repet- itive structure in sounds. Speech Communication, 27:pp. 187207, 1999. [Lei10] M. Lei, Y. Wu, F. K. Soong, Z. Ling, and L. Dai, A hierarchical f0 modeling

method for hmm-based speech synthesis. Proc. of Interspeech, 2010.

[Lev46] N. Levinson, The wiener root-mean-square error criterion in lter design and prediction. Journal of Mathematics and Physics 25, pp. 261278, 1946.

[Lib77] M. Liberman and A. Prince, On stress and linguistic rhythm. Linguistic Inquiry 8, pp. 249336, 1977.

[Lie67] P. Lieberman, Intonation, perception and language. MIT Press, Cambridge, Mass., 1967.

[Mal98] S. Mallat, A wavelet tour of signal processing. Academic Press, 1998. [Nav14] S. Navarro, Automatic conversion of emotion within a speaker independent

framework. Master's thesis, Tampere University of Technology, 2014. [Nes86] M. Nespor and I. Vogel, Prosodic Phonology. Dordrecht: Foris, 1986. [Nur06] J. Nurminen, V.Popa, J. Tian, Y. Tang, and I. Kiss, A parametric approach

for voice conversion. Proc. of the TC-STAR Workshop on Speech-to Speech Translation, 2006.

[Pri83] A. Prince, Relating to the grid. Linguistic Inquiry 14, pp. 19100, 1983. [Ric95] J. Rice, Mathematical Statistics and Data Analysis. Duxbury Press, 1995. [Sak78] H. Sakoe and S. Chiba, Dynamic programming algorithm optimization for

spoken word recognition. IEEE Trans. on Acoustics, Speech and Signal Pro- cessing, pp. 4349, 1978.

[Sel80] E. Selkirk, The role of prosodic categories in english word stress. Linguistic Inquiry 11, pp. 563605, 1980.

[Sel86] E. Selkirk, Phonology and Syntax. Cambridge MIT: Press, 1986.

[She96] Y. Sheng, Wavelet transform. In The transforms and applications handbook, pp. 747827, The Electrical Engineering Handbook Series, 1996.

[Sil13] H. Silen, J. Nurminen, E. Helander, and M. Gabbouj, Voice conversion for non-parallel datasets using dynamic kernel partial least squares regression. Proc. of Interspeech, 2013.

[SPT] SPTK toolkit, Speech signal processing toolkit (SPTK) version 3.7. http://sp-tk.sourceforge.net/, last access, 5-2-2014.

[Sty98] Y. Stylianou, O. CappÃ c , and E. Moulines, Continuous probabilistic transform for voice conversion. IEEE transactions on Speech, Audio and language processing, 6(2):pp. 131142, 1998.

[Sty05] Y. Stylianou, Modeling speech based on harmonic plus noise models. In Nonlinear Speech Modeling and Applications, Springer Berlin / Heidelber, 2005.

[Sun13] A. Suni, D. Aalto, T. Raitio, P. Alku, and M. Vainio, Wavelets for intonation modeling in hmm speech synthesis. 8th ISCA Workshop on Speech Synthesis, pp. 285290, 2013.

[Tod07] T. Toda, A. Black, and K. Tokuda, Voice conversion based on maximum- likelihood estimation of spectral parameter trajectory. IEEE Transactions on audio, speech and language processing, pp. 22222235, 2007.

[Tok94] K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, Mel-generalized cepstral analysis - a unied approach to speech spectral estimation. Proc. of the ICSLP, 1994.

[Vai13] M. Vainio, A. Suni, and D. Aalto, Continuous wavelet transform for analysis of speech prosody. TRASP, pp. 7881, 2013.

[Wan08] C. Wang, Z. Ling, B. Zhang, and L. Dai, Multi-layer f0 modeling for hmm- based speech synthesis. Proc. ISCSLP, pp. 129132, 2008.

[Wik] Wikipedia, Speech production. Online, accessed Feb. 10, 2014, available: http://en.wikipedia.org/wiki/Speech_production.

[You80] R. M. Young, An introduction to Nonharmonic fourier Series. Academic Press, New York, 1980.

In document Prosody and Wavelets: Towards a natural speaking style conversion (Page 63-68)