4.3 Impact of Mismatch between Adaptation & Synthesis Languages
4.3.7 Follow-Up 2: Effects of the Number of Iterations of Transform Estimation 74
mapping-based system captured more undesirable language information than a single global transform did and thus led to worse adaptation performance. Thus it is realized that likewise, re-estimating a certain number of transforms iteratively could also add more undesirable language information in a data mapping-based system.
An experiment was carried out in order to verify the impact of the number of transform estimation iterations. Cross-lingual speaker adaptation by data mapping was carried out on the average voice AV-ENG-UK with adaptation data CMN-100 and ADP-DEU-100 in 20 speakers’ voices. Two sets of CSMAPLR transforms for the synthesis of DATA-TEST-ENG-25, one containing a single global transform and the other containing multiple regression class-specific transforms, were estimated for one to six iterations in turn.
Mel-cepstral distortion on the test data set DATA-TEST-ENG-25 was calculated for the 20 target speakers and is presented in Figure 4.11.
As we anticipated, estimating adaptation transforms by data mapping in an iterative manner is detrimental to cross-lingual speaker adaptation most of the time. In particular, as Figure 4.11 shows, mel-cepstral distortion on DATA-TEST-ENG-25 consistently increases (i) when the input language is substantially phonologically distinct from the output language (e.g., Mandarin to English adaptation), regardless of whether a global or multiple regression class-specific transforms are estimated, and (ii) even when the languages are much closer (e.g., German to English adaptation) if multiple regression class-specific transforms are estimated.
4.4 Conclusions
Two main issues have been covered in this chapter. Firstly, the possibility of employing cross-lingual speaker adaptation in the unsupervised fashion in the context of personalized speech-to-speech translation was investigated.
Unsupervised cross-lingual speaker adaptation was implemented by combining recently de-veloped decision tree marginalization and HMM state mapping techniques. It was observed that unsupervised cross-lingual speaker adaptation was comparable to the supervised fashion in terms of spectrum adaptation in the scenario of personalized speech-to-speech transla-tion, even though automatically obtained transcriptions of adaptation data had a very high phoneme error rate. This is what was hoped for – In subsequent research on personalization of speech-to-speech translation, researchers can simply focus on the supervised fashion.
Then we move on to the second issue, i.e., the investigation of how language mismatch de-grades HMM state mapping-based cross-lingual speaker adaptation. In this chapter, it is demonstrated how the various sources of language mismatch impacted the different adapta-tion systems. From these results, it can be concluded that though HMM state mapping is an effective method to relate two different languages, it remains sensitive to the negative impacts of language mismatch. Reducing this mismatch is thus a key to advancing the state of the art. Currently, HMM state mapping rules are always constructed based on the minimum K-L divergence criterion. Alternative mapping criteria have not been investigated.
Moreover, the impacts of the number of regression class-specific transforms and the quantity of adaptation data on cross-lingual speaker adaptation have been investigated. It was found that the performance of cross-lingual speaker adaptation was degraded when many regression class-specific transforms are estimated. From the results of this part of study, it becomes clear that current approaches are largely unable to take advantage of a large quantity of adaptation data, mainly because the language mismatch between average voice synthesis models and adaptation data introduces too much unwanted language-specific information. In order to better reduce the negative impact of language mismatch and in so doing enable the effective use of a regression class tree, it is necessary to introduce new techniques that model speaker
1 2 3 4 5 6
Figure 4.11 – Mel-cepstral distortion of data mapping systems on DATA-TEST-ENG-25 with respect to the number of iterations of transform estimation. Theblueandredpolylines correspond to estimating asingle globalandmultiple regression class-specifictransforms, respectively.
characteristics and inherent differences between languages separately, or to find a new method of growing a regression class tree.
Lastly, it is found in both investigations that the data mapping approach outperforms the transform mapping approach. Consequently, only the data mapping approach will be in-vestigated in the following work. It was also found that estimating adaptation transforms iteratively in the data mapping approach is detrimental to the performance of cross-lingual speaker adaptation. Thus, in the experiments in Chapter 5 only a single iteration of transform estimation is employed, unless otherwise stated.
The contributions presented in this chapter were originally published in the following confer-ence papers:
– Hui LIANG, John DINESand Lakshmi SAHEER, “A Comparison of Supervised and Unsuper-vised Cross-Lingual Speaker Adaptation Approaches for HMM-Based Speech Synthesis”, Proc. of ICASSP, pp. 4598–4601, March 2010.
– Hui LIANGand John DINES, “An Analysis of Language Mismatch in HMM State Mapping-Based Cross-Lingual Speaker Adaptation”, Proc. of Interspeech, pp. 622–625, September 2010.
Using Phonological Knowledge
In the previous chapter, HMM state mapping with the K-L divergence as a measure of the similarity between state distributions has been shown to be a simple and effective technique that enables cross-lingual speaker adaptation for text-to-speech synthesis. Meanwhile, the weakness of this technique is also noticeable: it constructs state mapping rules only based on means and variances of HMM state distributions, ignoring any other information that may positively contribute to state mapping construction, for example, the phoneme(s) which an HMM state represents. In this chapter, a jointly data-driven and phonological knowledge-guided approach that produces enhanced state mapping rules is presented: HMM state distributions derived from the input and output languages are clustered according to broad phonetic categories using a decision tree, and state mapping rules are then constructed only within each resultant phonologically consistent cluster as per the minimum K-L divergence criterion.
Apart from this, the previous chapter showed that regression class trees which followed the decision tree structure for state tying provided minimal benefits and usually resulted in degra-dation of synthesis quality. Thus the basic idea of the jointly data-driven and phonological knowledge-guided approach is also applied to regression class tree growth as well: HMM state distributions from the output language are clustered according to broad phonetic categories using a decision tree, which is then directly used as a regression class tree for cross-lingual speaker adaptation.
In this chapter, HMM state mapping is presented from the data mapping perspective since the previous chapter has shown a preference for this approach, though the proposed jointly data-driven and phonological knowledge-guided approach may equally generalize to other state mapping approaches as well. Adaptation of spectrum, which is the dominant component of speaker identity [Türk and Arslan, 2003], is the focus of this research.
There exists a potential confusion in this chapter: Two sets of decision trees are touched upon here, one of which is obtained in the normal training stage of synthesis models while the other is generated during the enhancement of state mapping rules by the jointly data-driven and
phonological knowledge-guided approach. The two sets of decision trees are involved for completely distinct purposes. Furthermore, the trees derived for enhanced state mapping rules are also distinct from those derived for enhanced regression classes.
5.1 Preliminary Investigations
First of all, two preliminary experiments were carried out, in order to test the hypothesis on the sub-optimality of the minimum K-L divergence criterion for determining state mapping rules between average voice synthesis models of two languages.