Future work - Conclusions and future work

Chapter 7 Conclusions and future work

7.2 Future work

In this study, limited speech data were used for accent analysis and spectral training. Speech segmentation was carried out manually; this is a time-consuming and a tedious process even at word level. In future, a larger speech corpus should be considered. A relatively accurate speech segmentation at phoneme level can be achieved by means of an automatic speech segmentation tool, such as HMM based speech segmentation. This will save the tedious work of manual segmentation and a more accurate transformation function can then be developed by employing an increased size of training data set at phoneme level.

In this study, spectral mapping between the two accents was performed via model-based transformation. In this method, a model which is based on various acoustic features of speech (first three formant frequencies) was built for the two accents. Then through a training phase, a transformation function was generated. In conversion phase, the trained transformation function was used for predicting target speech features from new set of source speech features. Finally, the predicted features were used to produce the final transformed speech at the synthesis stage. The advantage of the model based conversion is that it is text independent; it can convert unconstrained words and utterances into a desired voice or accent. However, the model based conversions, are heavily dependent on the accuracy of the model, which again depends on the size of the training dataset; an increased training data set can improve the accuracy of the model. In the future, the effect of the amount of the training data and the quality of speech data such as the strength and authenticity of the accents which the speaker produced can be explored.

Since there is only a small fraction of the speech which contains discriminative information about accent, and sometimes these distinctions are quite subtle and not easy to be picked up for some untrained ears, accent conversion is more

Chapter 7 Conclusions and future work __________________________________________________________________

__________________________________________________________________ 99

challenging than voice conversion in aspects of accent modelling and accent perceptual evaluation.

Accent conversion and its application to purposes such as language learning or manipulation of the accent of a film recording is still a challenging subject. The research in this thesis was based on only two specific British accents; however, it provides an alternative approach to exploring accent analysis and conversion.

References __________________________________________________________________

__________________________________________________________________ 100

References

1. Murray, I. R. and Arnott, J. L. (2008) Applying an analysis of acted vocal emotions to improve the simulation of synthetic speech. Computer Speech

and Language, 22, pp.107-129.

2. Lida, A., Campell, N., Higuchi, F. and Yasumura, M. (2003) A corpus- based speech synthesis system with emotion. Speech Communication, 40, pp.161-187.

3. Murray, I. R. and Arnott, J. L. (1995) Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech

Communication, 16, pp.369-390.

4. Cahn, J. E. (1990) The generation of affect in synthesized speech. Journal

of American Voice I/O society, 8, pp.1-19.

5. Murray, I. R., Arnott, J. L. and Rohwer, A. (1996) Emotional stress in synthetic speech: progress and future directions. Speech Communication,

20, pp.85-91.

6. Shuang, Z., Bakis, R. and Qin, Y. (2008) IBM Voice Conversion Systems for 2007 TC-STAR Evaluation. Tsinghua Science and Technology, 13, pp.510-514.

7. Hasan, M. M., Nasr, A. M. and Sultana, S. (2005) An approach to voice conversion using feature statistical mapping. Applied Acoustics, 66(5), pp.513-532.

8. Arslan, L. M. (1999) Speaker Transformation Algorithm using Segmental Codebooks (STASC). Speech Communication, 28(3), pp.211-226.

9. Wells, J. C. (1982) Accents of English. Vol. 1, 2, 3. Cambridge University Press

10. Kuwabara, H. and Sagisak, Y. (1995) Acoustic characteristics of speaker individuality: Control and conversion. Speech Communication, 16(2), pp.165-173.

11. Cook, P. R. (1993) SPASM, a real time vocal tract physical model controller, and singer, the companion software synthesis system.

Computer Music, 17, pp.30-44.

12. Karlsson, I. (1992) Modelling voice variations in female speech synthesis.

Speech Communication, 11, pp.491-495.

13. Milenkovic, P. H. (1993) Voice source model for continuous control of pitch period. Acoustic Socity of America, 93, pp.1087-1096.

14. Matsumoto, S. H., Hiki, T. S. and Nimura, T. (1973) Multidimensional representation of personal quality of vowels and its acoustical correlates.

IEEE Transactions on Audio and Electroacoustics, 21(5), pp.428-436.

15. Fant, G. (1960) Acoustic theory of speech production. The Hague: Mouton.

16. Foulkes, P. and Docherty, G. J. (1999) Urban voices-overview, in Urban

voices. Editors: Foulkes, P. and Docherty, G. J. Arnold: London, pp.1-24.

17. Ikeno, A. and Hansen, J. H. L. (2007) The effect of listener accent background on accent perception and comprehension. EORASIP Journal

References __________________________________________________________________

__________________________________________________________________ 101

18. D'Arcy, S. M., Russell, M. J., Browning, S. R., Tomlinson, M. J. (2004) The accents of the British Isles (ABI), corpus, MIDL Workshop in Paris, pp.115-119.

19. Yan, Q., Saeed, V., Dimitrios, R. and Ho, C. (2003) Analysis of acoustic correlates of British, Australian and American Accent. IEEE workshop on

automatic speech recognition and understanding, pp.345-350.

20. Watt, D., and Milroy, L. (1999) Patterns of variation and change in three Newcastle vowels: is this dialect levelling, in Urban voices. Editors: Foulkes, P. and Docherty, G. J. Arnold: London, pp.25-46.

21. Clopper, C. G., Pisoni, D. B. and Jong, K. D. (2005) Acoustic characteristics of the vowel systems of six regional varieties of American English. Acoustical Society of America, 118(3), pp.1661-1676.

22. Adank, P. and Hout, R. V. and Velde, H. V. (2007) An acoustic description of the vowels of northern and southern standard Dutch II: Regional variaties. Acoustical Society of America, 121(2), pp.1130-1141. 23. Arslan, L. M. and Hansen, J. H. L. (1996) Language accent classification

in American English. Speech Communication, 18(4), pp.353-367.

24. Zheng, Y. L., Sproat, R., Gu, L. and Shafran, I. (2005) Accent detection and speech recognition for Shanghai-accented mandarin. Interspeech. Lisbon, pp.217-220.

25 Ghorshi, S., Vaseghi, S. and Yan, Q. (2008) Cross-entropic comparison of formants of British, Australian and American English accents. Speech

Communication, 50, pp.564-579.

26. Angkititrakul, P. and Hansen, J. H. L. (2006) Advances in phone-based modeling for automatic accent classification. IEEE Transaction on Audio,

Speech, and Language Processing, pp.634-646.

27. Huckvale, M. (2004) ACCDIST: a metric for comparing speakers' accents.

International conference on Spoken Language Processing. Korea.

28. Felps, D., Bortfeld, H. and Gutierrez-Osuna, R. (2009) Foreign accent conversion in computer assisted pronunciation training. Speech

Communication, 51(10), pp.920-932.

29. Yamagishi, J., Usabaev, B., King, S., Watts, O., Dines, J., Tian, J., Guan, Y., Hu, R., Oura, K., Wu, Y. J., Tokuda, K., Karhila, R., Kurimo, M. (2010) Thousands of voices for HMM-Based speech syntheis-analysis and application of TTS systems built on various ASR corpora. IEEE

Transaction on Audio, Speech, and Language Processing, pp.984-1004.

30. Huckvale, M. and Yanagisawa, K. (2007) Spoken language conversion with accent morphing. ISCA Speech Synthesis Workshop. Bonn, Germany. pp.64-70.

31. Kleijn, K. and Paliwal, K. (1998) Speech coding and synthesis, Elsevier Science B.V., The Netherlands.

32. Roth, K., Kauppinen, I., Esquef, P. and Välimäki, V. (2003) Frequency warped Burg's method for AR-modeling. IEEE Workshop on Applications

of Signal Processing to Audio and Acoustics. New Paltz, New Yoke. pp.5-

References __________________________________________________________________ __________________________________________________________________ 102 34. http://accent.gmu.edu/. 35. www.fon.hum.uva.nl/praat/.

36. Mixdorff, H. and Pfitzinger, H. R. (2005) Analysing fundamental frequency contours and local speech rate in map task dialogs. Speech

Communication, 46, pp.310-325.

37. Yoon, K. (2008) Design and evaluation of prosodically-sensitive concatenative units for a Korean TTS system. Computer Speech and

Language, 22(3), pp.273-294.

38. Weil, K. S., Fitch, J.L. and Wolf, V. I. (2000) Diphthong changes in style shifting from southern English to standard American English. Journal of

Communication Disorders, 33, pp.151-163.

39. Gerosa, M., Giuliani, D. and Brugnara, F. (2009) Towards age- independent acoustic modeling. Speech Communication, 51, pp.499-509. 40. Brugnara, F., Falavigna, D. and Omologo, M. (1993) Automatic

segmentation and labeling of speech based on Hidden Markov Models.

Speech Communication, 12(4), pp.357-370.

41. Ajmera, J., Mccowan, I. and Bourlard, H. (2003) Speech/music segmentation using entropy and dynamism features in a HMM classification framework. Speech Communication, 40(3), pp.351-363. 42. Turk, O. and Arslan, L. M. (2006) Robust processing techniques for voice

conversion. Computer Speech and Language, 20, pp.441-467. 43. http://www.phon.ucl.ac.uk/resource/sfs.

44. http://htk.eng.cam.ac.uk.

45. Chinn, C. and Thorne, S. (2001) Proper Brummie: A dictionary of

Birmingham words and phrases. Brewin Books.

46. Ladefoged, P. (1996) Elements of acoustic phonetics (2nd ed.), Chicago:The University of Chicago Press.

47. Ladefoged, P. (2000) A course in phonetics (4th ed.). Fort Worth, Tx: Harcourt Brace Jovanovich.

48. Childers, D. G. (1978) Modern spectrum analysis, IEEE Press, New York. pp.252-255.

49. Childers, D.G. (2000) Speech processing and synthesis toolboxes. New York: Wiley.

50. O'shaughnessy, D. (1987) Speech Communication Human and machine, Addison-Wesley Publishing Company.

51. Deterding, D. (1997) The formants of monophthong vowels in Standard

Southern British English Pronunciation. Journal of the International

Phonetic Association, 27, pp.47-55.

52. Cherif, P., Bouafif, L. and Dabbabi, T. (2001) Pitch detection and formant analysis of Arabic speech processing. Applied Acoustics, 62, pp.1129- 1140.

53. Veprek, P. and Scordilis, M. S. (2002) Analysis, enhancement and evaluation of five pitch determination techniques. Speech Communication,

37, pp.249-270.

54. Boersma, P. (1993) Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound.

References __________________________________________________________________

__________________________________________________________________ 103

Proceedings of the Institute of Phonetic Sciences University of

Amsterdam. pp.97-110.

55. Krishnan, A., Xu, Y., Gandour, J. T. and Cariani, P. A. (2004) Human frequency-following response: representation of pitch contours in Chinese tones. Hearing Research, 189, pp.1-12.

56. Esther, G., Brechtje, P., Francis, N.,and Kimberley, F.(2000).Pitch accent realization in four varieties of British English. Journal of Phonetics, 28, pp. 161-185.

57. Grover, C., Jamieson, D. G. and Dobrovolsky, M. B. (1987) Intonation in English, French and German: Perception and production. Language and

Speech, 30(3), pp.277-296.

58. Mannell, R. (2002) Modelling of the segmental and prosodic aspects of speech intensity in synthetic speech. Proceedings of the ninth Australian

International Conference on Speech Science and Technology. Melbourne.

59. Scobbie, J. M., Hewlett, N., and Turk, A. E. (1999) Standard English in

Edinburgh and Glasgow: the Scottish vowel length rule revealed, in Urban voices. Editors: Foulkes, P. and Docherty, G. J. Arnold: London, pp.230-

245.

60. Moulines, E. and Sagisaka, Y. (1995) Voice conversion: State of the art and perspectives. Speech Communication, 16(2), pp.125-126.

61. Mouchtaris, A., Spiegel, J. V. and Mueller, P. (2004) Non-parallel training for voice conversion by maximum likelihood constrained adaptation. IEEE

International Conference on Acoustics, Speech and Signal Processing.

pp.Ι1-Ι4.

62. Valbret, H., Moulines, E. and Tubach, J. P. (1992) Voice transformation using PSOLA technique. Speech Communication, 11, pp.175-187.

63. Wu, C. H., Hsia, C. C., Liu, T. H. and Wang, J. F. (2006) Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis. IEEE Transactions on Audio Speech and Language Processing,

14(4), pp.1109-1116.

64. Loots, L. and Niesler, T. (2001) Automatic conversion between pronunciations of different English accents. Speech Communication,

53(1), pp.75-84.

65. Mahadeva Prasanna, S. R., Gupta, C.S. and Yegnanarayana, B. (2006) Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Communication, 48(10), pp.1243- 1261.

66. Mizuno, H. and Abe, M. (1995) Voice conversion algorithm based on piecewise linear conversion rules of formants frequency and spectrum tilt.

Speech Communication, 16, pp.153-164.

67. Narendranath, M., Murthy, H. A., Rajendran, S. and Yegnanarayana, B. (1995) Transformation of formants for voice conversion using artificial neural networks. Speech Communication, 16, pp.207-216.

68. Abe, M., Nakamura, S., Shikano, K. and Kuwabara, H. (1998) Voice conversion through vector quantization. Proceedings of the IEEE Int.

References __________________________________________________________________

__________________________________________________________________ 104

69. Stylianou, Y., Cappé, O. and Moulines, E. (1998) Continuous probabilitstic transformation for voice conversion. IEEE Transactions on

Speech and Audio Processing. pp.131-142

70. Toda, T., Saruwatari, H. and Shikano, K. (2000) STRAIGHT based voice conversion algorithm based on Gaussian mixture model. Proceeding of

international conference on Spoken Language Processing, 3, pp.279-282.

71. Toda, T., Saruwatari, H. and Shikano, K. (2001) Voice conversion algorithm based on Gaussian Mixture Model with Dynamic Frequency Warping of STRAIGHT spectrum. Proceeding of International

Conference on Acoustic, Speech and Signal Processing, pp.841-844.

72. Chen, Y., Zheng, Z. and Wang, C. (2003) Voice conversion with smoothed GMM and MAP adaptation. Eurospeech, pp.2413-2416.

73. Rao, K. S. (2010) Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Computer Speech and

Language, 24, pp.474-494.

74. Guido, R. C., Vieira, L. S., Júnior, S. B., Sanchez, F. L., Maciel, C. D., SILVA Fonseca, E. S. and Pereira, J. C. (2007) A neural-wavelet architecture for voice conversion. Neurocomputing, 71, pp.174-180.

75. Suendermann, D., Ney, H. and Hoege, H. (2003) VTLN-based Cross- language voice conversion. Proceedings of Automatic Speech Recognition

and Understanding, pp.676-681.

76. Makhoul, J. (1975) Linear prediction: A tutorial reviews. Proceeding of

the IEEE, 63(4), pp.561-580.

77. Malcangi, M. and Frontini, D. (2009) Language-independent, neural network-based, text-to-phones conversion. Neurocomputing, 73(1-3), pp.87-96.

78. Scordilis, M. S. (1996) A neural formant synthesizer. Mathematics and

Computers in Simulation, 40, pp.615-622.

79. Desai, S., Raghavendra, E. R., Yegnanarayana, B., Black, A. W. and Prahallad, K. (2009) Voice conversion using artificial neural networks.

IEEE International Conference on Acoustics, speech and Signal Processing, pp.3893-3896.

80. Toda, T., Black, A. W. and Tokuda, K. (2008) Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Communication, 50(3), pp.215-227.

81. Jordan, M.I. and Xu, L. (1995) Convergence results for the EM approach to mixtures of experts architectures. Neural Networks, 8(9), pp. 1409- 1431.

82. Ghahramani, Z. and Jordan, M. I. (1994) Supervised learning from incomplete data via an EM approach. Advances in Neural Information

Processing Systems, 6, pp.120-127.

83. Kain, A. B. (2001) High resolution voice transformation. PhD thesis, Department of Computer Science and Engineering. Oregon Health and Science University.

References __________________________________________________________________

__________________________________________________________________ 105

84. Bandyopadhyay, S. and Maulik, U. (2002) An evolutionary technique based on K-Means algorithm for optimal clustering. Information Sciences,

146(1-4), pp.221-237.

85. Toda, T. (2003) High-quality and flexible speech synthesis with segment selection and voices conversion. PhD. thesis, Department of Information Processing. University of Nara Institute of Science and Technology.

86. Verhelst, W. (2000) Overlap-add methods for time-scaling of speech.

Speech Communication, 30, pp.207-221.

87. Arslan, L. M. and Talkin, D. (1998) Speaker transformation using sentence HMM based alignment and detailed prosody modification.

Proceedings of International Coeference on Acoustics, Speech and Signal Processing, 1, pp.289-292.

88. Moulines, E. and Charpentier, F. (1990) Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech

Communication, 9, pp.453-467.

89. Lin, Q., Jan, E., Che, C., Yuk, D. and Flanagan, J. (1996) Selective use of the speech spectrum and a VQGMM method for speaker identification.

International Conference on Spoken Language Processing, pp.2415-2418.

Appendix I __________________________________________________________________

__________________________________________________________________ 106

Appendix Ι

Text material used in this study

Group A Each sentence was uttered 10 times repeatedly in each accent

Sentence 1 It was more like sugar. Sentence 2 This is no place for you.

Sentence 3 A burst of laughter was his reward. Sentence 4 He was an athlete and a giant. Sentence 5 The issue was not in doubt.

Group B Each sentences was uttered once in each accent

Sentence 1 Will we ever forget it?

Sentence 2 If I ever needed a fighter in my life I need one now. Sentence 3 There was a change now.

Sentence 4 It occurred to me that there would have to be an accounting. Sentence 5 I had faith in them.

Sentence 6 She turned in at the hotel.

Sentence 7 I was the only one who remained sitting. Sentence 8 We’ll have to watch our chances.

Sentence 9 Meanwhile I’ll go out to breathe a spell. Sentence 10 He moved away as quietly as he had come. Sentence 11 It was a curious coincidence.

Sentence 12 There was nothing on the rock. Sentence 13 Anyway, no one saw her like that. Sentence 14 The men stared into each other’s face. Sentence 15 What was the object of your little sensation? Sentence 16 There has been a change, she interrupted him. Sentence 17 His face was streaming with blood.

Appendix I __________________________________________________________________

__________________________________________________________________ 107

Sentence 18 The nightglow was treacherous to shoot by. Sentence 19 For a full minute he crouched and listened. Sentence 20 You must sleep, he urged.

Group C Paragraph was uttered once in each accent

Please call Stella. Ask her to bring these things with her from the store: six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.

Appendix II __________________________________________________________________

__________________________________________________________________ 108

Appendix ΙΙ

Twelve vowels and the words from which the vowels were

extracted

Table 1 Twelve vowels and the words from which the vowels were extracted

Vowels Example Phonetic spelling Word in this study Phonetic spelling AA barn B AA N laughter L AA F T AX AE pat P AE T and AE N D AO born B AO N more M AO AX about AX B AW T sugar S UH G AX ER burn B ER N burst B ER S T IH pit P IH T his HH IH Z OH pot P OH T not N OH T UH good G UH D sugar S UH G AX UW boon B UW N you Y UW AY buy B AY like L AY K EY bay B EY place P L EY S OW loan L OW N no N OW

*British English example Pronunciation Dictionary (BEEP) [90] has been used for phonemic transcription

In document Accent Conversion via Formant-based Spectral Mapping and Pitch Contour Modification (Page 98-108)