2.5 Voice conversion
2.5.3 Conversion methods
The most popular approach for the actual conversion task has been Gaussian
[Kai98] and [Sty98]. The former approach uses GMMs for modeling the density of the source features while the latter models the joint density of both source and target features. The GMM-based conversion approach, implemented as proposed in [Kai98] is also used in Chapter 5. More information on this simple but effective approach is provided in Section 5.1.
In addition to the GMM-based approach, a wide variety of different conver- sion techniques have been proposed in the literature. Examples of different ap- proaches include neural network based conversion studied in [Nar95], [Wat02], and more recently in [Des10], hidden Markov model based conversion [Kim97], codebook based conversion studied, e.g., in [Abe88], [Ars97], and [Esl11], and non-linear conversion techniques such as [Son11] and [Hel12]. In addition, there have been proposals that combine these different approaches. For example, a hybrid approach combining GMM-based conversion and codebook based conver- sion has been proposed in [Kan05]. More information on the different conversion methods can be found, e.g., in [Nur12], and in the other references mentioned in this section.
Chapter 3
VLBR – segmental speech coding
for efficient storage
In speech storage applications, many of the traditional speech codec design con- straints discussed in Section 2.2 can be relaxed in order to achieve higher quality and/or lower bitrate [Won92][Mud98]. For example, the limitations regarding en- coding delay can be relaxed or omitted and the lack of bit errors in most storage applications enables the use of lossless coding and/or all kinds of predictors and memory-based solutions. In addition, variable bitrate can be conveniently used to adaptively adjust the parameter update rate [Rou82][Lee01] and the quantiza- tion accuracy based on the short-time properties of the input speech. All of these aspects are considered in the development of the VLBR codec and the related compression techniques presented in this chapter.
Since the aim was to achieve relatively good speech quality, speech coding solutions such as the ones presented, e.g., in [Rou82], [Rou83], [Shi88], [Pic89], and [Cer98], aiming at producing intelligible speech at extremely low bitrates as low as 0.15 kbps, were considered too coarse, and the usual limitation to only one speaker was also seen undesirable. Furthermore, since the memory consumption of the decoder had to be kept as small as possible, solutions based on the use of a large speech database (see, e.g., [Lee01] and [Lee02]) were also readily out of the question. Thus, the development approach chosen was to start with the models and techniques typically used at bitrates between 1.2 and 4.0 kbps and to develop further solutions for making the overall process more efficient.
The first section of this chapter introduces the parametric representation used in the VLBR codec. The parametric representation itself or the parameter estima- tions are not considered to be core contributions of this thesis but short descrip- tions are provided due to their important role in the VLBR codec and consequently in the whole thesis. Section 3.2 can be regarded as the central section of this chap- ter since it provides an overview of the first developmental version of the VLBR codec and introduces the related mode-based segmental processing and quantiza-
tion solutions. The remaining parts of the chapter discuss additional techniques that can be used to further enhance the efficiency of the VLBR codec. In Section 3.3, a specific general-purpose quantizer structure, based on the multi-mode ma- trix quantization of adjacent parameter vectors using low-complexity vector-based predictions, is introduced. In Section 3.4, further bitrate reductions are sought for by considering the possibility to compress the quantizer index data using loss- less compression. In particular, an enhanced version of the conventional dynamic codebook reordering technique [DeN96] is proposed and evaluated. The final core contribution of this chapter, a novel preprocessing method based on perceptual ir- relevancy removal, is presented in Section 3.5.
The discussions provided in the main parts of this chapter, i.e., Section 3.2, Section 3.3, Section 3.4, and Section 3.5, are largely based on the publications [Räm04], [Nur03b], [Nur07a], [Nur06a], and [Läh03b].
3.1
Parametric representation
The selection of the parametric model is one of the crucial issues when designing a parametric speech codec. The main reason for this is that the chosen paramet- ric representation directly sets an upper limit on the achievable speech quality. The parametric model also indirectly affects the bitrate of the codec because the perceptual importances of different parameters are different, and there are also dif- ferences in the achievable compression efficiencies, i.e., the number of bits needed for perceptually accurate representation of a parameter value or a set of parameter values varies a lot.
As discussed in Section 2.2, most of the modern speech codecs utilize the idea of linear prediction due to its beneficial properties. Thus, it was an easy decision to base the operation of the new storage codec on linear prediction. The selection between the excitation models based on waveform interpolation and sinusoidal modeling was not as straightforward but the sinusoidal modeling approach was eventually chosen. One of the reasons for this selection was the fact that the resulting parameter set is slightly more compact but it is worth noting that a similar codec could be built around the waveform interpolation approach as well.