5.1 VLBR-based voice conversion system
5.1.3 Conversion of the VLBR parameters
In the development of the first version of a VLBR-based voice conversion system, the emphasis was placed on the conversion of the pitch and the LSFs because these parameters were found in the first experiments to be particularly important from the perception point of view. Other parameters such as voicing and the residual spectrum were partially used as complementary information and were exploited in the model training but no explicit conversion was performed for these parameters. The conversion of the LSF vectors is performed using an extended vector that also contains the derivative of the LSF vector, to take some dynamic context information into account. This combined feature vector is transformed through GMM modeling, using Equation (5.1). Only the true LSF part is retained after conversion. The conversion utilizes several modes, each containing its own GMM model with 8 Gaussian components. The number of components was selected based on practical experimentation. In the first implementation described in this section, the modes were decided in a data-driven manner based on the voicing parameter, i.e., the LSF data is clustered during the model training into separate sets using the corresponding voicing information, and similar voicing-based mode selections are used during the conversion phase. The motivation for using voicing- based modes is similar as in the case of the segmental speech coding approach
presented in Section 3.2, i.e., different types of speech signals typically benefit from different type of processing. Also, because the VLBR codec already operates on different segment types, the same voicing-based segmentation decisions can be directly used in VLBR-based voice conversion.
The pitch parameter is transformed through the associated GMM in the fre- quency domain using Equation (5.1). During unvoiced parts, the fixed pitch value is left unchanged. The GMM with 8 Gaussian components used for the pitch conversion is trained on aligned data, with the additional requirement of having matched voicing between the source and the target data.
After the conversion of the pitch parameter, the residual amplitude spectrum is processed accordingly. The reason for this processing is the fact that the length of the amplitude spectrum vector depends on the pitch value at the correspond- ing time instant, as discussed earlier in this thesis. This means that the residual spectrum, although essentially unchanged, will be re-sampled to fit the dimension dictated by the converted pitch at that time.
Once the parameters have been converted as described above, they are used together to re-synthesize the transformed waveform. The signal generation part of the VLBR decoder can be used as such for synthesizing the waveform in a pitch-synchronous manner.
5.1.4 Performance evaluation
The initial VLBR-based voice conversion system described in this section was evaluated in listening tests in the context of the second TC-STAR [TC-13] eval- uation campaign. The evaluation covered aspects related to both speaker identity and speech quality. The evaluation was carried out by an independent evaluation agency.
Test set-up
The data set used in the testing included UK English speech data from four dif- ferent speakers (two female and two male speakers). The training set included 159 sentences per speaker and a distinct testing set consisted of 9 sentences per speaker. The same sentences were recorded from all the speakers.
Among the 12 possible conversion directions, 4 were chosen as the directions included in the test. For the selected directions, the test organizer provided the recorded source sentences used in the test. These source sentences were con- verted using the voice conversion system to the voices of the target speakers. The converted signals were evaluated by 20 native non-expert listeners.
The listening test included two parts. In the first part, the listeners were asked to evaluate the speaker identity without considering the speech quality using the 5-level scale summarized in Table 5.1. The true target signals recorded from the target speakers, available only for the test organizer, were used as the reference.
Table 5.1: Scale used for evaluation of speaker identity. The listeners were asked to evaluate whether the two samples in the given pair were spoken by the same person or not. The real target speaker was used as the reference speaker.
Grade Meaning 5 Definitely identical 4 Probably identical 3 Not sure 2 Probably different 1 Definitely different
Table 5.2: Scale used in the evaluation of speech quality Grade Meaning 5 Excellent 4 Good 3 Fair 2 Poor 1 Bad
In the second part, the listeners evaluated the perceptual quality of the converted speech using the mean opinion score (MOS) grades shown in Table 5.2.
Results and discussion
The results are summarized in Table 5.3 and Table 5.4. Table 5.3 contains the results from the first part of the listening test, focusing on the evaluation of speaker identity. The results from the speech quality evaluation are summarized in Table 5.4.
When looking at the evaluation results, the first observation that can be made is that there were large differences between the different conversion directions. Moreover, despite the moderate average scores, the person identity conversion was
Table 5.3: Results from the first part of the evaluation (speaker identity, with the target speaker used as the reference in every sample pair). F denotes a female and M a male speaker. The column Average shows the combined score for all the directions.
Direction F1 toF2 F1toM2 M1toF2 M1toM2 Average
Table 5.4: Results achieved from the second part of the evaluation (speech qual- ity). MOS score Achieved score 2.09 Reference 1 (source) 4.80 Reference 2 (target) 4.78
sometimes perceived very successful, as indicated by the more detailed sentence- level results not shown here. This can be regarded as a good result due to two main reasons. First, the initial system that participated in the evaluation was a rather el- ementary system that only converted the LSFs and the pitch parameter. Moreover, the conversion was performed in a frame-wise manner without considering the frame-to-frame evolvement of the parameters or the intonation contours.
As can be seen from Table 5.4, a rather low score was achieved in the speech quality evaluation. There are a couple of clear reasons for this. First, the system produced 8-kHz output signals while all the other signals included in the listen- ing test (e.g., the reference samples and the samples from the other TC-STAR participants) had a sampling rate of 16 kHz. Second, the source signals also con- tained some non-speech elements such as audible breathing and the parametric speech and conversion models created many audible artifacts to the correspond- ing places in the output signals. Third, the frame-by-frame conversion made the converted parameter contours, including the pitch contours, a bit noisy and this was also audible in the output signals. Also, it is known that the GMM-based conversion approach has its shortcomings related to overfitting and oversmooth- ing, as discussed, e.g., in [Nur12]. Finally, the fact that not all the parameters were converted also had its impact on the quality.
It should also be noted that the use of the simple interpolation based alignment had a small negative impact on the output quality. Furthermore, it was also found later on, as a part of the research whose main findings were reported in [Hel08b], that the use of the two-step alignment procedure described in Section 5.1.1 is un- necessary and sometimes even counterproductive due to the sometimes erroneous phoneme boundary locations that limit the frame-level alignment. The main out- come of the study [Hel08b] was that even though the alignment of the training data significantly affects the voice conversion quality, it is possible to obtain the same quality as with hand-marked labels using only simple voice activity detec- tion, dynamic time warping (DTW) and certain additional considerations. This effectively renders the phoneme boundary information unnecessary at best, and, as mentioned above, detrimental at worst.
Several techniques that enhance the performance of the initial VLBR-based system are presented in Sections 5.2–5.4. In those discussions, the alignment is
handled using the dynamic time warping based approach but otherwise the system presented in this section is used as the baseline voice conversion system.