University of Mons
Doctoral School MUSICS
Signal ProcessingP H D T H E S I S
to obtain the title ofPhD in Applied Sciences
of University of Mons
Specialty :
Speech Processing
Defended by
Benjamin
Picart
Statistical Parametric Speech
Synthesis Based on the Degree of
Articulation
Thesis Advisor: Thierry
Dutoit
Thesis Co-Advisor: Thomas
Drugman
prepared at University of Mons, Faculté Polytechnique,
TCTS Lab
defended on October 29, 2013 Jury :
Prof. Marc Pirlot - University of Mons (UMONS) Prof. Thierry Dutoit - University of Mons (UMONS) Prof. FrancisGrenez - Université Libre de Bruxelles (ULB) Prof. Simon King - University of Edinburgh (Scotland) Dr. Thomas Drugman - University of Mons (UMONS) Dr. Vincent Pagel - Acapela Group S.A. (Mons) Dr. RaphaelSebbe - Creaceed S.P.R.L. (Mons)
To my grandfather Marcel.
To my grandmother Marcelle, my parents Annie and Pascal, my sister Justine and my girlfriend Virginie.
When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
Arthur C. Clarke (16th of December 1917 - 19th of March 2008)
For certain you have to be lost to find a place as can’t be found. Elseways, everyone would know where it was.
Geoffrey Rush, alias Hector Barbossa Pirates of the Caribbean: At World’s End
v
Abstract
Nowadays, speech synthesis is part of various daily life applications. The ultimate goal of such technologies consists in extending the possibilities of interaction with the machine, in order to get closer to human-like communications. However, current state-of-the-art systems often lack of realism: although high-quality speech synthesis can be produced by many researchers and companies around the world, synthetic voices are generally perceived as hyperarticulated. In any case, their degree of articulation is fixed once and for all.
The present thesis falls within the more general quest for enriching expressivity in speech synthesis. The main idea consists in improving statistical parametric speech synthe-sis, whose most famous example is Hidden Markov Model (HMM) based speech synthesynthe-sis, by introducing a control of the articulation degree, so as to enable synthesizers to automat-ically adapt their way of speaking to the contextual situation, like humans do. The degree of articulation, which is probably the least studied prosodic parameters, is characterized by modifications of phonetic context, of speech rate and of spectral dynamics (vocal tract rate of change). It depends upon the surrounding environment and the communication context, and provides information on the relationship between the speaker and the listener(s).
According to Lindblom’s “H and H” theory, speakers are expected to vary their output along a continuum of hypo and hyperarticulated speech. Compared to the neutral case, hyperarticulated speech tends to maximize the clarity of the speech signal by increasing the articulation efforts to produce it, while hypoarticulated speech is produced with minimal articulation efforts. The work presented in this PhD thesis provides a thorough and detailed study on the analysis and synthesis of hypo and hyperarticulated speech in the framework of HMM-based speech synthesis. This framework is very convenient for creating a synthesizer whose speaker characteristics and speaking styles can be easily modified.
In order to achieve this goal, a new French database consisting of three distinct and parallel sets (one for each articulation degree to be studied, i.e. neutral, hypoarticulated and hyperarticulated speech) was recorded. This database allows: i) the study of both acoustic and phonetic modifications due to articulatory effort changes; ii) the design of a high-quality speech synthesizer integrating a continuous control of the articulation degree. This first requires to address the issue of speaking style adaptation to derive hypo and hy-perarticulated speech from the neutral synthesizer. Once this is done, an interpolation and extrapolation of the resulting models enables to finely tune the voice so that it is generated with the desired articulatory efforts. Secondly, we perform a perceptual study of speech with a variable articulation degree, specifically focusing on: i) the internal mechanisms leading to the perception of the degree of articulation by listeners (i.e. cepstrum, prosody, phonetic transcription adaptation and the complete adaptation); ii) how intelligibility and various other voice dimensions are affected. Based on the ensuing conclusions, we finally implement an automatic modification of the degree of articulation in an existing standard neutral voice for which no hypo or hyperarticulated recordings are available.
Keywords: HMM-based Speech Synthesis, Speech Analysis, Expressive Speech, De-gree of Articulation, Speaking Style Adaptation, Speaking Style Transposition, Voice Qual-ity, Speech Intelligibility
vii
Acknowledgements
The present thesis has been fulfilled within the Circuit Theory and Signal Processing (TCTS) lab of the Faculté Polytechnique (FPMs) in the University of Mons (UMONS), and was made possible by the support from the “Fonds pour la formation à la Recherche dans l’Industrie et dans l’Agriculture” (FRIA).
I would like to express my deepest gratitude to my supervisor, Prof. Thierry Dutoit, and to my co-supervisor, Dr. Thomas Drugman, for their kindness, their insightful guid-ance, their availability and their support throughout this thesis. I am also thankful to Acapela Group S.A. for the fruitful collaboration and for providing me with their linguis-tic front-end. In parlinguis-ticular, I am grateful to Mr. Geoffrey Wilfart, Mr. Fabrice Malfrère, Dr. Vincent Pagel and Mr. Olivier Deroo for their judicious advices, their time and the industrial partnership. I would also like to thank Hui Liang and Lakshmi Saheer, from Idiap Research Institute, for their help and advices when I started my thesis.
I would like to thank all the people working in TCTS for their friendship and help throughout my thesis, and particularly: Alexis, Maria, Onur, Jérôme and Joëlle, Thomas, Jean-Marc, Hüseyin, Sandrine, Nicolas R., Stéphane, Loïc, Thierry R., Matéi, Radwan, Matthieu, Thierry C., Stéphanie, Nicolas D., Johan, William, Nathalie, Bernard, Joël, etc. A special thank to the card player team, for all the good games: Thomas, Justine, Zacharie, Amaury, Vasiliki and Christophe. Another special thank to Anderson, who introduced and addicted me to the game of GO. I am also grateful to Caroline, Hatice and Véronique, from the Secrétariat des Etudes, for their kindness since I came in FPMs as a student, almost 9 years ago.
This is also the End of an Era. I thank Nicolas Linze, alias Reggie, for all the good time we spent together during our 7-year parallel academic path (which was so close that even our FRIA project defenses started some minutes apart!), and also to all the nice people I have met across the world.
For their availability and their judicious comments, I am thankful to all my thesis proofreaders: Thierry Dutoit, Thomas Drugman, Alexis Moinet, Jérôme Urbain and Joëlle Tilmanne, Maria Astrinaki, Onur Babacan and Sandrine Brognaux.
I would finally like to express my deepest gratitude to my grandfather for his kind thoughts, his help and advices when needed, and to my grandmother, my parents, my sister and Virginie for their so kind support all along this journey.
Contents
1 General Introduction 1
1.1 Introduction . . . 1
1.1.1 Unit Selection Speech Synthesis . . . 2
1.1.2 HMM-based Speech Synthesis . . . 3
1.1.3 The Degree of Articulation . . . 3
1.2 Contributions and Structure of the Thesis . . . 5
2 Background 9 2.1 Introduction . . . 9
2.2 Markov Model Theory . . . 10
2.2.1 Discrete-Time Markov Process . . . 10
2.2.2 Hidden Markov Model . . . 11
2.3 Overview of HMM-based Speech Synthesis . . . 18
2.4 Training Step in HMM-based Speech Synthesis . . . 18
2.4.1 Spectral Parameters . . . 19
2.4.2 F0 Modeling . . . 20
2.4.3 State Duration . . . 25
2.4.4 Clustering . . . 28
2.5 Synthesis Step in HMM-based Speech Synthesis . . . 28
2.5.1 Maximizing P(q|W,bλ) . . . 30
2.5.2 Maximizing P(O|bq,bλ) . . . 31
2.6 Voice Adaptation Techniques . . . 32
2.6.1 Maximum Likelihood Linear Regression (MLLR) . . . 33
2.6.2 Maximum A Posteriori (MAP) Adaptation . . . 36
3 Creation of a Database with various Degrees of Articulation 39 3.1 Introduction . . . 39
3.2 Database Specifications . . . 41
3.3 Recording Hardware . . . 42
3.3.1 Audio Acquisition System - Motu 8pre . . . 43
3.3.2 Microphone - AKG C3000B . . . 43
3.3.3 XLR Connections. . . 43
3.3.4 Digital Effects - Behringer Virtualizer DSP1000 . . . 43
3.3.5 Amplifier - Behringer Powerplay Pro-8 HA8000 . . . 43
4 Analysis of Hypo and Hyperarticulated Speech 47
4.1 Introduction . . . 47
4.1.1 Increase in the Articulation Effort . . . 48
4.1.2 Decrease in the Articulation Effort . . . 50
4.1.3 Contributions and Structure of the Chapter . . . 50
4.2 Acoustic Analysis . . . 51
4.2.1 Vocal Tract-based Modifications . . . 51
4.2.2 Glottal-based Modifications . . . 52 4.3 Phonetic Analysis. . . 54 4.3.1 Glottal Stops . . . 55 4.3.2 Phone Variations . . . 56 4.3.3 Phone Durations . . . 57 4.3.4 Speech Rate . . . 59 4.4 Conclusions . . . 60
5 HMM-based Synthesis of Hypo and Hyperarticulated Speech 63 5.1 Introduction . . . 63
5.1.1 Reactive Speech Synthesis . . . 63
5.1.2 Knowledge Integration in Speech Synthesis . . . 64
5.1.3 Contributions and Structure of the Chapter . . . 65
5.2 Method . . . 65
5.3 Acoustic Analysis . . . 66
5.4 Objective Evaluation . . . 67
5.5 Subjective Evaluation . . . 69
5.6 Conclusions . . . 70
6 Continuous Control of the Degree of Articulation 73 6.1 Introduction . . . 73
6.1.1 From Source toward Target Speakers’ Voice . . . 74
6.1.2 Interpolation and Extrapolation between Statistical Models . . . 76
6.1.3 Contributions and Structure of the Chapter . . . 77
6.2 Speaking Style Adaptation . . . 77
6.2.1 Method . . . 77
6.2.2 Objective Evaluation . . . 79
6.2.3 Subjective Evaluation . . . 81
6.3 Interpolation and Extrapolation of the Degree of Articulation . . . 83
6.3.1 Method . . . 83
6.3.2 Perception of the Degree of Articulation . . . 83
6.3.3 Segmental Quality of the Interpolation and Extrapolation . . . 84
Contents xi
7 Subjective Assessment of Hypo and Hyperarticulated Speech 89
7.1 Introduction . . . 89
7.1.1 Speech Intelligibility Estimation. . . 90
7.1.2 Speech Intelligibility Enhancement . . . 91
7.1.3 Contributions and Structure of the Chapter . . . 92
7.2 Effects Influencing the Perceived Degree of Articulation . . . 93
7.2.1 Method . . . 93
7.2.2 Experiments. . . 95
7.3 Intelligibility and Quality Assessments of Hypo and Hyperarticulated speech 98 7.3.1 Method . . . 99
7.3.2 Semantically Unpredictable Sentences Test. . . 99
7.3.3 Absolute Category Rating Test . . . 101
7.4 Conclusions . . . 103
8 Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis 107 8.1 Introduction . . . 108
8.1.1 Creating Target Style Model without any Target Style Speech Data. 108 8.1.2 Contributions and Structure of the Chapter . . . 110
8.2 Creation of the Articulation Model . . . 112
8.3 Techniques for the Transposition of the Articulation Model to a New Speaker114 8.4 Prosody Transposition . . . 116
8.4.1 Experimental Framework . . . 116
8.4.2 Speech Quality of the Prosody Model Transposition . . . 117
8.4.3 Perception of the Degree of Articulation . . . 119
8.5 Filter Transposition . . . 121
8.5.1 Experimental Framework . . . 121
8.5.2 Speech Quality of the Filter Model Transposition . . . 123
8.5.3 Perception of the Degree of Articulation . . . 124
8.5.4 Identity Preservation Assessment . . . 125
8.5.5 Conclusions on Filter Transposition. . . 127
8.6 Generalization to Other Voices . . . 128
8.6.1 Experimental Framework . . . 129
8.6.2 Speech Quality of the Prosody and Filter Models Transposition . . . 129
8.6.3 Perception of the Degree of Articulation . . . 130
8.6.4 Identity Preservation Assessment . . . 132
8.7 Conclusions . . . 133
9 General Conclusion and Future Works 139 9.1 Conclusions . . . 139
9.1.1 Creation of a Database with various Degrees of Articulation . . . 140
9.1.2 Analysis of Hypo and Hyperarticulated Speech . . . 140
9.1.4 Subjective Assessment of Hypo and Hyperarticulated Speech . . . . 141
9.1.5 Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis . . . 141
9.2 Thesis Contributions . . . 142
9.3 Perspectives . . . 143
9.3.1 In Direct Continuity . . . 143
9.3.2 Average-Voice-based Speech Synthesis integrating the Degree of Ar-ticulation . . . 143
9.3.3 Generalization to other types of Data and Languages . . . 143
Bibliography 145 A Publications 175 A.1 Journals . . . 175
A.2 Conference Proceedings . . . 175
List of Figures
2.1 Schematic representation of (a) a 3-state ergodic HMM and (b) a 4-stateleft-to-right HMM, together with emissionB={bj(o)}and transitionA={aij}
probabilities associated with each state. . . 11
2.2 Output distributions: (a) Gaussian PDF, (b) Gaussian mixture PDF, (c) Multi-stream PDF. Adapted from [Yamagishi 2006]. . . 13
2.3 Overview of the HMM-based Speech Synthesis System (“H-Triple-S” - HTS), from [Zen et al. 2009]. . . 18
2.4 F0 pattern modeling, from [Masuko 2002]. . . 21
2.5 Multi-Space probability Distribution (MSD) and observations, from [Masuko 2002]. . . 21
2.6 A HMM based on Multi-Space probability Distribution (MSD), from [Masuko 2002]. . . 23
2.7 Multi-Space probability Distribution Hidden Markov Model (MSD-HMM) for F0 modeling, from [Tokuda & Zen 2009]. . . 24
2.8 HMM duration PDFs modeled either by their state self-transition probabili-ties (decreasing exponential blue curve) or by a Gaussian distribution (Gaus-sian red curve), from [Tokuda & Zen 2009]. . . 26
2.9 Decision tree context clustering, from [Tokuda et al. 2002b]. . . 28
2.10 Duration synthesis, from [Yamagishi 2006]. . . 30
2.11 Generated speech parameter trajectory, from [Tokuda & Zen 2009]. . . 32
2.12 Maximum Likelihood Linear Regression (MLLR) and its related algorithms, adapted from [Yamagishi et al. 2009a]. . . 34
2.13 Combined algorithm of the (C)MLLR and MAP adaptation, adapted from [Yamagishi et al. 2009a]. . . 36
2.14 Relationship between the MAP and the ML estimates, adapted from [Yamagishi 2006]. . . 37
3.1 Sound-proof room equipped in order to record natural-sounding NEU, HPO and HPR speech. . . 41
3.2 Schematic illustration of the “standard recording protocol” designed in this work to induce the speaker’s (a) HPO (“amplification” effect) and (b) HPR (“cathedral effect”) speech. . . 42
4.1 Vocalic triangle estimated on the original recordings for each DoA, together with dispersion ellipses. . . 52
4.2 Pitch histograms for each DoA. . . 53
4.3 Averaged magnitude spectrum of the glottal source for each DoA (in the top right corner, a zoom on the glottal formant frequency). . . 54
4.5 Number of glottal stops for each vowel and for each DoA.. . . 56
4.6 Phone duration histograms. (a) Front, central, back & nasal vowels. (b) Plosive & fricative consonants. (c) Pauses. . . 58
4.7 Phone duration histograms. (a) Semi-vowels. (b) Trill consonants. . . 58
5.1 Standard training of the NEU, HPO and HPR full data models, from the database containing 1220 training sentences for each DoA. . . 66
5.2 Vocalic triangle estimated on the generated recordings for each DoA, together with dispersion ellipses. . . 67
5.3 Subjective evaluation of the overall speech quality of the full data models (mean score with its 95% CI). . . 70
6.1 Standard training of the NEU, HPO and HPR full data models (Chapter 5), from the database containing 1220 training sentences for each DoA. Adap-tation of the NEU full data model using CMLLR transform with HPO and HPR speech data to produce HPO and HPR adapted models (Section 6.2). Implementation of a tuner, manually adjustable by the user, for a continuous control of the DoA (Section 6.3). . . 78
6.2 Objective evaluation - Average MCD [dB] computed between the adapted and the full data models. Black dots indicate actual measures. . . 79
6.3 Objective evaluation - RMSE of log F0 [cent] computed between the adapted and the full data models. Black dots indicate actual measures. . . 80
6.4 Objective evaluation - RMSE of vowel durations [number of frame] (frame shift = 5 ms) computed between the adapted and the full data models. Black dots indicate actual measures. . . 81
6.5 Subjective evaluation of the overall speech quality of the adapted models -Effect of the number of adaptation sentences on CCR scores (mean scores with their 95% confidence intervals). . . 82
6.6 Subjective evaluation of the adapted models - Perceived interpolation and extrapolation ratio as a function of the actual interpolation and extrapolation ratio, together with its 95% confidence interval. . . 84
7.1 Subjective evaluation of the perception of the DoA - Mean PDA scores with their 95% confidence intervals (CI) for each DoA. . . 96
7.2 Subjective evaluation of the perception of the DoA - ACR test. . . 98
7.3 Subjective intelligibility evaluation of the DoA (SUS Test) - Mean word (top) and sentence (bottom) recognition accuracies [%], together with their 95% CI.101
7.4 Subjective quality evaluation of the DoA (ACR Test) - Mean scores together with their 95% CI. . . 103
8.1 Vocalic triangles estimated on the original NEU recordings for Voices A, B, M and F, together with dispersion ellipses. . . 111
8.2 Creation of the articulation model on Voice A. Transforms are computed in two alternative ways, using LS or CMLLR adaptation. . . 112
List of Figures xv
8.3 Comparison of mean vector µ adaptation in CMLLR and model-space LS. . 113
8.4 Prosody and filter adaptation transforms computed on Voice A are applied to an existing standard NEU Voice B with no HPO or HPR recordings available for generating Voice B HPO and HPR adapted models. The most successful method (selected through various evaluations) is then used for automatically modifying the DoA of two other speakers (Voices M and F). . . 115
8.5 Transposition of the articulation model learned on Voice A to Voice B. Leaf nodes mapping is performed in two alternative ways, using phonetic (based on decision trees) or acoustic (based on KL divergence) mapping. . . 115
8.6 CMOS test for prosody transposition - Mean CMOS score for each method and each DoA, together with their 95% confidence intervals (CI). . . 118
8.7 CMOS test for prosody transposition - Detailed preference scores (expressed in [%]), averaged for all the participants and utterances used in the test, for each method compared to the baseline, for HPR speech (left) and HPO speech (right). . . 119
8.8 CPDA test for prosody transposition - Mean score of the perceived DoA using the four methods or the baseline (1 being the reference DoA, defined on Voice A), together with their 95% CI. . . 120
8.9 MOS test for the second pruning step - Overall speech quality (with its 95% CI) of the sentences synthesized by the HPO and HPR transposed models of Voice B. . . 123
8.10 CMOS test for filter transposition - Mean CMOS score for each method and each DoA, together with their 95% CI. . . 124
8.11 CPDA test for filter transposition - Mean score of the perceived DoA using the four methods or the baseline (1 being the reference DoA, defined on Voice A), together with their 95% CI. . . 125
8.12 ID test for filter transposition - Mean score for each method and each DoA (Voice A = 0, Voice B = 1), together with their 95% CI. . . 126
8.13 Vocalic triangles estimated on the synthesized HPR, NEU and HPO speech for Voices A, B, M and F, together with dispersion ellipses. . . 128
8.14 CMOS test for the generalization of the prosody and filter transposition -Mean CMOS scores each DoA using the LSP_LS_Phn method, together with their 95% CI. . . 130
8.15 CPDA test for the generalization of the prosody and filter transposition -Mean score of the perceived DoA using the LSP_LS_Phn method (1 being the reference DoA, defined on Voice A), together with their 95% CI. . . 131
8.16 ID test for the generalization of the prosody and filter transposition - Mean score for each DoA using the LSP_LS_Phn method (Voice A = 0, Voice M or F = 1), together with their 95% CI. . . 132
List of Tables
2.1 Mel-Generalized Cepstral (MGC) analysis. . . 204.1 Vocalic space (in kHz2) for the three DoA for the original sentences. . . 52
4.2 Deleted and inserted phone percentage in HPO and HPR speech respectively, compared to NEU style, and their repartition inside the words: total (first row), beginning (second row), middle (third row), end (fourth row). . . 57
4.3 Speech rates and related time information for NEU, HPO & HPR speech, together with the positive or negative variation from the NEU style (in [%]). 59
5.1 Vocalic space (in kHz2) for the three DoA for the synthesized sentences. . . 67
5.2 Objective evaluation of the overall speech quality of the full data models: average MCD [dB], RMSE_lf0 [cent] and RMSE_dur [number of frames] (frame shift = 5 ms) with their 95% confidence intervals (CI) for each DoA. 69
6.1 Grades in the CCR scale. . . 82
6.2 Grades in the CMOS scale. . . 85
6.3 Subjective evaluation of the adapted models (CMOS test) - Perceived syn-thesis quality of the test sentence X vs. the NEU sentence B (CMOS scores with their 95% confidence intervals). . . 85
7.1 Four different synthesizers, so as to analyze the internal mechanisms leading to the perception of the DoA by listeners.. . . 94
7.2 Answering the questions by comparing the synthesizers performance. . . 95
7.3 Question list asked to listeners during the ACR test, together with their corresponding extreme category responses [de Mareüil et al. 2006]. . . 97
7.4 Question list (complement to Table 7.3) asked to listeners during the ACR test, together with their corresponding extreme category responses [de Mareüil et al. 2006]. . . 102
8.1 Speech rates, mean and standard deviation of F0 values for Voices A NEU, HPO and HPR recordings and for Voice B, M and F NEU recordings. . . . 114
8.2 Methods for applying the prosody and filter transposition transforms from Voice A to Voice B. . . 116
8.3 Selected methods after the first pruning step (? and ??) and after the sec-ond one (??). Observed artefacts on the rejected methods are also indicated (u: filter unstability; g: occurrence of glitches; i: complete target speaker identity loss). . . 122
Acronyms
• ACR: Absolute Category Rating • AI: Articulation Index
• AN N: Artificial Neural Network • AN OV A: ANalysis Of VAriance • ASR: Automatic Speech Recognition
• CCD: Complex Cepstrum-based Decomposition • CCR: Comparison Category Rating
• CI: Confidence Interval
• CM OS: Comparative Mean Opinion Score
• CM LLR: Constrained Maximum Likelihood Linear Regression
• CSM AP LR: Constrained Structural Maximum A Posteriori Linear Regression • dB: Decibel
• DoA: Degree of Articulation
• DSM of residual signal: Deterministic plus Stochastic Model of residual signal • DSP: Digital Signal Processor
• EM: Expectation-Maximization • F W S: Frequency Weighted Segmental • F0 : Fundamental frequency
• F x: Formant x (x = formant id) • GCI: Glottal Closure Instant • GM M: Gaussian Mixture Model • GP: Glimpse Proportion
• HM M: Hidden Markov Model • HN M: Harmonic plus Noise Model
• HP R: HyPeRarticulation or HyPeRarticulated • HSM: Harmonic/Stochastic Model
• HSM M: Hidden Semi Markov Model
• HT S: HMM-based Speech Synthesis System (“H-Triple-S”) • Hz: Hertz
• LAR: Log Area Ratio
• LF model: Liljencrants-Fant model • LP C: Linear Predictive Coding • LS: Linear Scaling
• LSF: Line Spectral Frequency • LSP: Line Spectral Pairs • M AP: Maximum A Posteriori • M CD: Mel-Cepstral Distortion
• M ELP: Mixed Excitation Linear Prediction • M F A: Mixtures of Factor Analyzers
• M F CC: Mel-Frequency Cepstrum Coefficient
• M GC coefficients: Mel Generalized Cepstral coefficients • M L: Maximum Likelihood
• M LLR: Maximum Likelihood Linear Regression • M LSA: Mel Log Spectrum Approximation • M OS: Mean Opinion Score
• M RGV: Multiple-Regression Global Variance
• M SD−HSM M: Multi-Space probability Distribution Hidden Semi Markov Models • N EU: NEUtral
• N LP: Natural Language Processor
• P ARCOR coefficients: PARtial CORrelation coefficients • P DA: Perceived Degree of Articulation
Acronyms xxi
• P DF: Probability Density Function • P W I: Prototype Waveform Interpolation • RIR: Room Impulse Response
• RM SE: Root-Mean-Square Error
• RM SE_dur: Root-Mean-Square Error of vowel durations • RM SE_lf0: Root-Mean-Square Error of log F0
• SAT: Speaker-Adaptive Training
• SEDREAM S: Speech Event Detection using the REsidual And Mean-based Signals • SII: Speech Intelligibility Index
• SN R: Signal to Noise Ratio • ST I: Speech Transmission Index
• ST OI: Short-Time Objective Intelligibility
• ST RAIGHT: Speech Transformation and Representation using Adaptive Interpo-lation of weiGHTed spectrum
• SU S: Semantically Unpredictable Sentences
• T CGP P: Template Constrained Generalized Posterior Probability • T T S: Text-To-Speech
• V C: Voice Conversion
• V T LN: Vocal Tract Length Normalization • W SS: Weighted Spectral Slope
Chapter 1
General Introduction
1.1
Introduction
Nowadays, the speech synthesis market is expanding. In addition to the numerous daily life multimedia applications, the speech synthesis domain is most of the time associated with the search for extending the interaction possibilities between the human and the machine, in order to get closer to human-like communications. On the one hand, speech quality is characterized by its naturalness, its intelligibility and its expressivity. On the other hand, speech synthesizer efficiency is characterized by the amount of resources required for synthesis (i.e. the amount of data collected during the database recording, and the amount of time needed for collecting and processing them) and by the number of languages available (if possible with the same voice).
Two main techniques are governing speech synthesis: unit selection
[Hunt & Black 1996] and statistical parametric speech synthesis [Zenet al. 2009].
Unit selection speech synthesis, in which appropriate subword units are automatically selected from a natural speech database, allows the generation of high-quality human-like sounding speech, but requires a huge amount of resources. The basic idea relies on two cost functions: the target cost, which represents how well the selected unit matches the target, and the concatenation cost, representing how well two selected units combine. The target cost function can also be calculated in advance using tree-based clustering
[Donovan & Woodland 1995] [Black & Taylor 1997]. At synthesis time, the goal thus
consists in the minimization of the overall cost of a label sequence to be produced, which is equal to the sum of the target and concatenation cost functions. Many works have focused on this kind of speech synthesis, and more information can be found in [Taylor 2009].
In direct contrast with this selection of actual unmodified instances of speech from a database, statistical parametric speech synthesis might be most simply described as gener-ating the average of some sets of similarly sounding speech segments [Zen et al.2009]. This produces good quality speech synthesis, but slightly degraded by its “buzziness” compared to what is generated by unit selection speech synthesis. It has the significant advantage of greatly reducing the memory footprint (as only the statistical models have to be stored), al-though the runtime computation may be much higher. As explained further in the present thesis, the speech parameters, i.e. spectrum, fundamental frequency (F0) and phone dura-tion, allowing the reconstruction of any speech unit, are statistically modeled and generated by Hidden Markov Models (HMMs) or Hidden Semi-Markov Models (HSMMs). This is the reason why the most famous example of statistical parametric speech synthesis is often called HMM-based speech synthesis.
1.1.1 Unit Selection Speech Synthesis
ATR ν-talk was the first to demonstrate the effectiveness of the automatic selection of appropriate units [Sagisaka et al.1992], based on minimizing acoustic distortions between selected units and the target spectrum. Then CHATR generalized these techniques to multiple languages and an automatic scheme [Hunt & Black 1996], by taking into account both the prosodic and phonetic appropriateness of units. Synthetic speech is directly linked to the database. Indeed, the more carefully the database is recorded (i.e. high quality recordings), the higher the generated speech quality. The quality is also directly linked with the size of the database, as a larger database implies a better unit coverage (although never perfect [Möbius 2003]). Most of current commercial systems use this synthesis technique. Its main characteristics are:
! high-quality speech synthesis, as speech units are directly selected from a database of actual human speech (there is no underlying statistical process);
% it is not very portable on embedded devices which often have limited memory re-sources, as a large database is required (typically around 400 MB) in order to cover most of the phonetic and prosodic contexts. However, it should be noted that the runtime computation may be lower compared to HMM-based speech synthesis;
% it is not very flexible, as speech units cannot be easily and straightforwardly modified (e.g. changes in spectrum, fundamental frequency and phone duration). If expressive speech synthesis is required, for any arbitrary style, a database containing this kind of expressive human speech is necessary. Recording a database is time-consuming and therefore expensive;
% Automatic Speech Recognition (ASR) techniques are hardly applicable (e.g. speaker adaptation methods).
Although this is a successful method, high quality speech synthesis cannot be guaran-teed in all case. The synthesized speech quality can be dramatically degraded if the input text requires phonetic and prosodic contexts that are under-represented in the database. Even if it is not common, a single bad concatenation in an utterance can dramatically affect the resulting subjective appreciation. Due to the prohibitive number of possible combina-tions between units, it is impossible to ensure that no bad joins or inadequate unit selection will occur, except in the special case of limited domain synthesizers [Black & Lenzo 2000] where the database is designed for specific applications.
As selected units cannot be (easily) modified, synthetic speech is limited to the same style as the one in the original recordings. As a consequence, larger speech databases containing various speaking styles are required in order to limit this effect and have more control on the synthesis (like IBM’s stylistic synthesis [Eideet al. 2004]). Unfortunately, recording large databases with variations is very difficult and costly [Black 2003]. The time needed to record a normal database varies from 8h to 40h, depending on the language and on the desired synthetic quality. Moreover, this data has to be processed afterwards (i.e. annotations, segmentation, etc.), which can last several months.
1.1. Introduction 3
1.1.2 HMM-based Speech Synthesis
A new method for speech synthesis emerged around ten years ago: statistical paramet-ric speech synthesis. This technique consists of two parts: the training and the synthesis steps. During the training step, a natural speech database is analyzed, i.e. for each analysis frame, spectral (filter contribution) and excitation (glottal source contribution) parameters are extracted. These parameters are modeled through context-dependent (e.g. phonetic, prosodic, etc.) statistical models. During the synthesis step, the input text is first con-verted into such a context-dependent label sequence. The idea is that realistic parameters should be generated by the models, by maximizing the likelihood of the sequence given the model. The speech signal is eventually reconstructed from some parametric representation of speech. The main characteristics of this approach are:
! higher portability for embedded devices, which often have limited memory resources (as only the statistical models have to be stored). However, it should be noted that the runtime computation may be higher compared to unit selection speech synthesis;
! higher flexibility, including the possibility of using voice conversion and techniques developed for ASR like speaker adaptation methods, potentially leading to more expressive speech synthesis;
! smaller memory footprint, typically within one MB, as only the statistical models have to be stored;
% lower synthetic speech quality, often termed as “buzziness”.
The latter characteristic is the main drawback of HMM-based speech synthesis. This is mainly due to the fact that this synthesis technique is based on a para-metric representation of the speech signal: the excitation signal, consisting of ei-ther a pulse train for voiced speech, or a white noise for unvoiced speech, is far too simplistic. Several studies have been and are still focusing on this latter is-sue, in order to improve the output speech quality of such systems: among oth-ers, the Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum (STRAIGHT) [Kawahara et al.1999] [Kawahara & Morise 2011], the glottal-flow-derivative model [Cabralet al. 2007] [Cabralet al. 2008] and the Deter-ministic plus Stochastic Model (DSM) of the residual signal [Drugmanet al.2009b]
[Drugman & Dutoit 2012].
1.1.3 The Degree of Articulation
Current state-of-the-art systems often lack realism: synthetic voices are most of the time perceived as hyperarticulated, and in any case, their degree of articulation is fixed once and for all. The expressivity of synthetic voices can be improved by modifying various prosodic parameters, including the fifth dimension of prosody: the degree of articulation
According to Lindblom’s “H and H” theory [Lindblom 1983], speakers are expected to vary their output along a continuum of hypo and hyperarticulated speech. Compared to the neutral case, hyperarticulated speech tends to maximize the clarity of the speech signal by increasing the articulation efforts to produce it, while hypoarticulated speech is produced with minimal articulation efforts. Therefore the degree of articulation (DoA) provides information on the relationship between the speaker and the listeners, as well as on the speaker’s introversion and extroversion in real life situation [Beller 2009]. This status can be induced by contextual factors (like the listener’s emotional state) or simply by the speaker’s own expressivity. Indeed, when talkers speak, they also listen to each other
[Cookeet al. 2012]. Speakers can adopt a speaking style allowing them to be more easily
understood in difficult communication situations. In this work, “hyperarticulated speech” (HPR) refers to the situation of a person talking in a reverberant environment, e.g. a teacher or a speaker talking in front of a large audience (important articulation efforts have to be made to be understood by everybody). “Hypoarticulated speech” (HPO) refers to the situation of a person talking in a quiet environment (e.g. in a library) or very close to someone (few articulation efforts have to be made to be understood). “Neutral speech” (NEU) refers to the daily life situation of a person reading aloud a text emotionless (e.g. no happiness, no anger, no excitement, etc) and without any specific articulation efforts to produce the speech, keeping only the sentence intonation: rising intonation for questions, flat intonation for affirmative or negative sentences, etc. It is worth noting that these three modes of expressivity are emotionless, but can vary amongst speakers as reported
in [Beller 2009]. The influence of emotion on the DoA has been studied in [Beller 2007]
[Beller et al.2008] and is out of the scope of this work.
The DoA is characterized by modifications of phonetic stream, of fundamental fre-quency, of speech rate and of spectral dynamics (vocal tract rate of change). A common measure of the DoA consists in defining formant targets for each phone, taking coartic-ulation into account, and studying differences between real observations and targets vs. the speech rate [Wouters & Macon 2001]. Since defining formant targets is not an easy task, Beller proposed in [Beller 2009] a statistical measure of the DoA by studying the joint evolution of the vocalic triangle (i.e. the shape formed by the vowels /a/, /i/ and /u/in the F1 - F2 space) area and the speech rate. A recent study presented a compu-tational model of human speech production to provide a continuous adjustment according to environmental conditions [Nicolaoet al. 2012].
In direct connection with HPR speech, the “Lombard effect” [Lombard 1911] refers to the speech changes due to the immersion of the speaker in a noisy environment. It is indeed known that a speaker tends to increase his vocal efforts to be more easily un-derstood while talking in a background noise [Summers et al.1988]. Various aspects of the Lombard effect were already studied, including acoustic and articulatory characteris-tics [Garnier et al. 2006b] [Garnier et al.2006a], features extracted from the glottal flow
[Drugman & Dutoit 2010a], or changes of F0 and of the spectral tilt [Lu & Cooke 2009].
Some works have been done in the framework of concatenative speech synthesis to enhance speech intelligibility by means of a kind of Lombard or HPR speech. For ex-ample, speech intelligibility improvement has been performed for a limited domain task
1.2. Contributions and Structure of the Thesis 5
in [Langner & Black 2005] based on voice conversion techniques. For this, they recorded
the CMU_SIN database [Langner & Black 2004] containing two parallel corpora obtained respectively under clean and noisy conditions. Another example is the Loudmouth synthe-sizer [Patelet al. 2006], which emulates human modifications (both acoustic and linguistic) to speech in noise by manipulating word duration, fundamental frequency and intensity.
In [Bonardo & Zovato 2007], it is proposed to tune dynamic range controllers (e.g.
com-pressors and limiters) and some user controls (e.g. speaking rate and loudness) to improve the intelligibility of synthesized speech. Various methods allowing automatic modification of speech in order to achieve the same goal are investigated in [Anumanchipalliet al.2010] (e.g. boosting the signal amplitude in important frequency bands, modification of prosodic and spectral properties, etc). Another work [Cer˘nak 2006] introduced an additional mea-sure evaluating intelligibility for the unit cost, so as to bias the synthesis by choosing more intelligible units from the speech database.
A new method for extracting or modifying mel cepstral coefficients based on an intel-ligibility measure for speech in noise, the Glimpse proportion measure, has been proposed
in [Valentini-Botinhaoet al. 2012a] [Valentini-Botinhao et al.2012b]. Lombard speech
synthesis in HMM-based speech synthesis [Zenet al. 2009] has also been performed in
[Raitioet al. 2011a]. Nonetheless, the Lombard effect is a reflex produced unconsciously
due to the noisy surrounding environment [Junqua 1993] [Picket al.1989], while HPR speech is defined as the voice produced with increased articulatory efforts compared to the NEU style. From a general point of view, these latter efforts might therefore also result from a voluntary decision to enhance speech intelligibility to facilitate the listener’s comprehen-sion (like in the case of teaching). A similar case happens when people hyperarticulate in front of interactive systems, hoping to correct their recognition errors [Oviattet al. 1998].
1.2
Contributions and Structure of the Thesis
The present thesis provides a detailed and complete study on the analysis and the integra-tion of a variable DoA in HMM-based speech synthesis: NEU speech, HPO (or casual) and HPR (or clear) speech. HPO and HPR speech are of interest in many daily life applica-tions: expressive voice conversion (e.g. for embedded systems and video games); “reading speed” control for visually impaired people (i.e. fast speech synthesizers, more easily pro-duced using HPO speech, as synthetic speech at very high speaking rates is frequently used by blind users to increase the amount of presented information [Pucher et al.2010a]
[Moos & Trouvain 2007] [Stentet al. 2011]); improving intelligibility performance in
ad-verse environments (e.g. perceiving GPS voice inside a moving car, understanding train or flight information in stations or halls); adapting the difficulty level when learning foreign languages with the student’s progresses (i.e. from HPR to HPO speech); etc. Note also that the ultimate goal of our research is to be able to continuously control the DoA of an existing standard NEU voice for which no HPO and HPR recordings are available.
The present thesis is divided into chapters and structured as follows. Personal con-tributions are indicated in italic. Audio examples for each DoA are available online at http://tcts.fpms.ac.be/∼picart.
Chapter 2 explains the theoretical background related to the Markov models theory and to the HMM-based speech synthesis system.
Chapter3 describesthe creation, the recording protocol and the specifications of a spe-cific database used throughout all next chapters. This database is unique, in the sense that: i) it contains three parallel sets, each one containing 1359 sentences pronounced with a different DoA (i.e. NEU, HPO and HPR speech), allowing a thorough analysis of the effects caused and induced by the DoA; ii) it is made of high-quality recordings (i.e. recorded in a sound-proof room, which is noise or perturbation-free), in order to generate high-quality HMM-based speech synthesis with a varying DoA.
Chapter4details the analysis of the acoustic and phonetic characteristics of HPO and HPR speech, compared to the NEU case. Acoustic and phonetic analyses are performed on the previously recorded database. It is shown that a variable DoA is reflected by considerable changes of both vocal tract and glottal characteristics, and of speech rate, phone durations, phone variations and the presence of glottal stops.
Chapter 5 focuses on the synthesis of NEU, HPO and HPR speech in the framework of HMM-based speech synthesis. This first synthesis experiment is conducted by training a specific synthesizer for each DoA, using the entire training set of the corresponding database. Both objective and subjective evaluations aiming to assess the generated speech quality are performed, and it is shown that synthesized HPO speech seems to be less naturally rendered than NEU speech, and that the latter style seems to be less naturally rendered than HPR speech.
Chapter 6 investigates the implementation of a continuous control of the DoA in the framework of HMM-based speech synthesis. By means of inter-speaker voice adaptation techniques, applied here to intra-speaker voice adaptation, we study in a first step the adaptation of a NEU speech synthesizer to directly generate HPO and HPR speech using a limited amount of HPO and HPR speech data. We show that around 7 (for HPO) and 13 (for HPR) minutes of speech are needed to adapt cepstra with a good quality, while only half of it is sufficient to adapt F0 and phone duration correctly. The implementation of a continuous control of the DoA is then proposed in a second step. We prove that good quality NEU, HPO and HPR speech, and also any intermediate, interpolated or extrapolated DoA, can be obtained from a HMM-based speech synthesizer.
Chapter 7 focuses on the understanding of the internal mechanisms leading to high-quality HMM-based speech synthesis with various DoAs, as well as how intelligibility and other voice dimensions are affected when the synthesizer is embedded in adverse environ-ments. In a first step, the process of adapting a NEU speech synthesizer to directly gener-ate HPO and HPR speech is broken down into four factors: cepstrum, prosody, phonetic transcription and the complete adaptation. The impact of these factors on the perceived DoA is studied, and the importance of prosody and cepstrum adaptation as well as the use of a Natural Language Processor able to generate realistic HPO and HPR phonetic transcriptions is quantified. Moreover, HPO and HPR speech is assessed through various dimensions: comprehension, non-monotony, fluidity and pronunciation. In a second step, we focus on the assessment of both the intelligibility and the quality of speech when the HMM-based speech synthesizers integrating a variable DoA is working in adverse
condi-1.2. Contributions and Structure of the Thesis 7
tions. Simulated noisy and reverberant conditions are applied to the speech produced by the latter synthesizers, and we quantify how the possibility of varying the DoA improves the intelligibility of synthetic speech in various adverse conditions. Again, HPO and HPR speech is assessed through a subjective multi-dimensional evaluation.
Chapter8implements the ultimate goal of our research, i.e. the automatic modification of the DoA of an existing standard NEU voice for which no HPO or HPR recordings are available, in the framework of HMM-based speech synthesis. The idea consists in finding new methods to transpose, to a target voice, the DoA model estimated on a source voice. Starting from a source speaker for which NEU, HPO and HPR speech data is available, statistical transformations are computed during the adaptation of the NEU speech synthe-sizer. These transformations are then applied to a new target speaker for which no HPO or HPR recordings are available. Four statistical methods are investigated. They differ in the speaking style adaptation technique (model-space Linear Scaling LS vs. CMLLR) and in the speaking style transposition approach (phonetic vs. acoustic correspondence) they use. The methods are model-independent, in the sense that they can be applied to the prosody (pitch and phone duration) and filter models independently. Moreover, we inves-tigate various parametric spaces for representing the spectral envelope in order to find out the most appropriate space for our purpose.
Chapter 2
Background
Contents
2.1 Introduction . . . 9
2.2 Markov Model Theory . . . 10
2.2.1 Discrete-Time Markov Process . . . 10 2.2.2 Hidden Markov Model . . . 11
2.3 Overview of HMM-based Speech Synthesis . . . 18
2.4 Training Step in HMM-based Speech Synthesis . . . 18
2.4.1 Spectral Parameters . . . 19 2.4.2 F0 Modeling . . . 20 2.4.3 State Duration . . . 25 2.4.4 Clustering . . . 28
2.5 Synthesis Step in HMM-based Speech Synthesis. . . 28
2.5.1 MaximizingP(q|W,bλ) . . . 30
2.5.2 MaximizingP(O|qb,bλ) . . . 31
2.6 Voice Adaptation Techniques . . . 32
2.6.1 Maximum Likelihood Linear Regression (MLLR) . . . 33 2.6.2 Maximum A Posteriori (MAP) Adaptation . . . 36
2.1
Introduction
This chapter is devoted to explaining the theoretical background required to understand the techniques used throughout this work. Most of those methods are based on the widely used and well-known Hidden Markov Model (HMM), which is a statistical way of modeling systems in various domains. Speech signals, for instance, can be well characterized as a parametric random process, and the parameters of the stochastic process can be determined in a precise and well-defined way [Rabiner & Juang 1993].
The mathematical basics of the HMM-based modeling process are described in Section
2.2. Those models are then used in a concrete application: HMM-based speech syn-thesis. An overview of the HMM-based Speech Synthesis System (“H-Triple-S” - HTS) [Zen et al.2009] is detailed in Section2.3. This system is made of two main parts: train-ing step and synthesis step. We first describe the traintrain-ing procedure of those HMM models
in Section 2.4. After that, speech synthesis is performed in Section 2.5. Finally, relying on the inherent flexibility of HMM-based speech synthesis due to the statistical modeling process, voice adaptation techniques are detailed in Section2.6, in order to modify a source speaker’s voice to sound as if it was pronounced by a target speaker.
2.2
Markov Model Theory
Hidden Markov Models (HMMs) are statistical models used to characterize observed time series. They were and are still widely used in Automatic Speech Recognition (ASR). More recently, they have proven to be also useful for speech synthesis. Despite their intrinsic simplicity, HMMs are able to model complex systems. In order to understand their mechanisms, we first describe the discrete-time Markov processes in Section 2.2.1. HMMs are then detailed in Section2.2.2.
2.2.1 Discrete-Time Markov Process
A discrete-time Markov process is a stochastic finite state machine which, at any time, can be in one amongstN distinct states. Transitions between states occur on a discrete time basis, according to a set of state transition probabilities
P(qt=j|qt−1=i, qt−2 =k, ...) (2.1) denoting the probability of being in statej at time t, given statei at time t−1, state k at timet−2, etc.
In the case of a first-order Markov process, the transition probability associated with statej at timet depends only on the stateiat timet−1, i.e.,
P(qt=j|qt−1=i, qt−2 =k, ...) =P(qt=j|qt−1 =i) (2.2) and in the case of time-independent transition probabilities, the first-order Markov process can be described by the following parameters:
aij =P(qt=j|qt−1=i) 1≤i, j≤N (2.3) where aij represents the probability of state changing from i to j, under the following
constraints: aij ≥0 ∀i, j (2.4a) N X j=1 aij = 1 ∀i (2.4b)
Such a process can be considered as an observable Markov model, because each state corresponds to an observable physical state of the system.
2.2. Markov Model Theory 11
2.2.2 Hidden Markov Model
A Hidden Markov Model (HMM) is a finite state machine which generates a sequence of observations. But, in this case, the states cannot be directly observed (they are hid-den) [Boiteet al. 1999] [Rabiner & Juang 1993]. A HMM is a doubly embedded stochastic process in which the state changes at each time unit according to the state transition prob-abilities of a Markov process, and then generates the observational data through the output probability function associated with the current state.
i k j aki aji aik aij ajk akj akk ajj aii i j k l aij ajk akl aii ajj akk all aik ajl (a) (b) b (o)k o b (o)l o b (o)i o b (o)j o b (o)i o b (o)j o b (o)k o i j k i
Figure 2.1: Schematic representation of (a) a 3-state ergodic HMM and (b) a 4-state left-to-right HMM, together with emission B={bj(o)} and transition A={aij} probabilities
associated with each state.
Figure2.1 provides some examples of typical HMM topologies. Figure2.1a represents a 3-state completely interconnected model in which each state can be reached from every other state in a single transition. A model in which each state can be reached from any other state in a finite number of transitions is called ergodic. Figure 2.1b represents a 4-state left-to-right model, also called Bakis model, in which successive state indices are greater or equal to the preceding ones. Left-to-right models with no skip are widely used to model speech units.
In the following, o is a d-dimensional observation vector (ot representing a particular
A N-state HMM is defined by its model parameters
λ={A,B,π} (2.5)
including:
• the initial state probabilities π={πi}Ni=1:
πi =P(q1 =i) 1≤i≤N (2.6)
• the matrix of state transition probabilities A={aij}Ni,j=1 where
aij =P(qt+1 =j|qt=i) 1≤i, j ≤N (2.7)
is the probability of changing from stateito statej, under the common hypotheses of considering an underlying first order Markov process (i.e. the transition probability depends only on the current state, not on the previous ones) and time-independent transition probabilities. In the case of a fully connected HMM, i.e. that each state can reach all the other ones in one step, we have aij > 0. In other cases, we could
have aij = 0 for one or more(i, j) state pairs.
• a matrix ofemission probabilities B={bj(ot)}Nj=1 where
bj(ot) =P(ot|qt=j) 1≤j≤N (2.8)
is the probability of generating the observation ot given statej at timet:
– in discrete distribution HMM,ot∈V ={v1, v2, ..., vK}(K being the number of
distinct observation symbols per state) and
bj(ot) =bj(k) =P(ot=vk|qt=j) 1≤k≤K (2.9)
defines the probability of observing the output ot = vk while being in state j,
j = 1,2, ..., N.
– in a Continuous Distribution HMM (CD-HMM), ot ∈ Rd and the emission
probability distribution is generally modeled by a multivariate Gaussian mixture distributions as follows: bj(ot) = M X m=1 cjmN(ot;µjm,Σjm) 1≤j≤N (2.10) where:
∗ M is the number of Gaussian components in the mixture;
∗ cjm is the weight of mixture componentmin statej, respecting the
follow-ing constraints: cjm ≥0 1≤j≤N,1≤m≤M (2.11a) M X m=1 cjm = 1 1≤j≤N (2.11b)
2.2. Markov Model Theory 13
∗ N(ot;µjm,Σjm) corresponds to the mth Gaussian mixture component
in state j (with mean vector µjm and covariance matrix Σjm). Note
that the Gaussian assumption is made without any loss of generality
[Rabiner & Juang 1993].
In the general case, the above-mentioned multivariate Gaussian PDF is expressed as followed: N(o;µjm,Σjm) = 1 (2π)d/2|Σ jm|1/2 exp −1 2(o−µjm) > Σ−jm1(o−µjm) (2.12)
In the case of a diagonal covariance matrix (i.e. when the coefficients of the feature vector are not correlated between each other), the latter equation becomes:
N(o;µjm,Σjm) = d Y i=1 1 q 2πΣ2 jmi exp −1 2 oi−µjmi Σjmi 2! (2.13)
whereµjmi represents theith component ofµjmandΣ2jmi are the diagonal elements of the
covariance matrixΣjm.
Figure 2.2: Output distributions: (a) Gaussian PDF, (b) Gaussian mixture PDF, (c) Multi-stream PDF. Adapted from [Yamagishi 2006].
When the observation vectorotis divided into S stochastic-independent data stream,
i.e.,o=
o>1,o>2, ...,o>S>
as illustrated in Figure2.2,bj(o)can be formulated by a product
of Gaussian mixture densities [Yamagishi 2006]:
bj(o) = S Y s=1 bjs(os) (2.14a) = S Y s=1 Ms X m=1 cjsmN(os;µjsm,Σjsm) 1≤j≤N (2.14b)
where Ms is the number of components in stream s, and cjsm, µjsm, and Σjsm are the
weight, mean vector and covariance matrix of the mth mixture component of state j in streamsrespectively.
Modeling real-world processes by HMMs requires to solve the three following basic problems [Ferguson 1980b], whose formal efficient mathematical solutions are detailed for instance in [Boite et al.1999] [Rabiner & Juang 1993]:
• problem #1 (P(O|λ) Evaluation): given a HMM model λ = {A,B,π}, how to compute efficiently the probability P(O|λ) of the observation sequence O = (o1, ...,ot, ...,oT)?
• problem #2 (Optimal State Sequence): given a HMM model λ={A,B,π}, how to determine the state sequenceq = (q1, ..., qt, ..., qT)that best explains the observation
sequence O= (o1, ...,ot, ...,oT)?
• problem #3 (Parameter Estimation): given the observation sequence O = (o1, ...,ot, ...,oT), how to adjust the model parameters λ = {A,B,π} in order to
maximize P(O|λ)?
2.2.2.1 Solution to Problem #1: P(O|λ) Evaluation
The probabilityP(O|λ)of the observation sequenceO= (o1, ...,ot, ...,oT)given the model λcan be efficiently computed by theForward-Backwardprocedure. This procedure is based on the forward and backward probabilities defined as:
• αt(i) =P(o1,o2, ...,ot, qt=i|λ), the probability of the partial observation sequence
o1,o2, ...,ot until timet and stateiat timet, given the model λ;
• βt(i) = P(ot+1,ot+2, ...,oT|qt = i, λ), the probability of the partial observation
se-quence from t+ 1to the end, given state iat timetand the modelλ. αt(i) andβt(i) can be calculated recursively as follows:
1. Initialization α1(i) =πibi(o1) 1≤i≤N (2.15a) βT(i) = 1 1≤i≤N (2.15b) 2. Recursion αt+1(i) = N X j=1 αt(j)aji bi(ot+1) 1≤i≤N 1≤t≤T−1 (2.16a) βt(i) = N X j=1 aijbj(ot+1)βt+1(j) 1≤i≤N T −1≥t≥1 (2.16b)
Finally, the probability P(O|λ) is given by
P(O|λ) =
N
X
i=1
2.2. Markov Model Theory 15
2.2.2.2 Solution to Problem #2: Optimal State Sequence
The difficulty here lies with the definition of the “optimal” state sequence, that is, there are several possible optimality criteria. However, the most widely used criterion is to find the single best state sequence (path), i.e., to maximizeP(q|O, λ), which is equivalent to maximizingP(q,O|λ). This can be achieved based on dynamic programming techniques, using theViterbi algorithm [Viterbi 1967] [Forney 1973].
Let δt(i) be the highest probability along a single path which accounts for the first t
observations and ends in statei:
δt(i) = max q1q2...qt−1
P(q1, q2, ..., qt−1, qt=i,Ot1|λ) (2.18) The Viterbi algorithm can be written as follows:
1. Initialization δ1(i) =πibi(o1) 1≤i≤N (2.19a) ψ1(i) = 0 1≤i≤N (2.19b) 2. Recursion δt(j) = max 1≤i≤N[δt−1(i)aij]·bj(ot) 1≤j≤N 2≤t≤T (2.20a) ψt(j) =arg max 1≤i≤N[δt−1(i)aij] 1≤j≤N 2≤t≤T (2.20b)
In order to later retrieve the followed state sequence, it is necessary keep track of the argument that maximized Equation 2.20a for each tand each j. This is the role of the array ψt(j). 3. Terminaison p∗ =P(O,q∗|λ) = max 1≤i≤N[δT(i)] (2.21a) qT∗ =arg max 1≤i≤N[δT(i)] (2.21b)
4. Path (state sequence) backtracking
qt∗=ψt+1(qt∗+1) t=T−1, T−2, ...,1 (2.22)
2.2.2.3 Solution to Problem #3: Parameter Estimation
There is no known analytical solution for finding the model parameter setλ={A,B,π} which globally maximize the probability P(O|λ) of a given observation sequence O in a closed form: λ∗=argmax λ P(O|λ) =argmaxλ X q P(O,q|λ) (2.23)
However, a parameter setλwhich locally maximizes the likelihoodP(O|λ)can be ob-tained using an iterative procedure such as theBaum-Welchalgorithm (also called forward-backward algorithm) [Dempsteret al. 1977] or theViterbi algorithm, depending on wheter the likelihood is estimated by considering all possible paths in the model or only the best one respectively. These algorithms are variants of the Expectation-Maximization (EM) procedure, a general technique for finding maximum likelihood estimators in models in-cluding hidden (also called latent or missing) variables, such as the states of an HMM model.
In the following, theBaum-Welch algorithm for the CD-HMM with Gaussian mixture distributions is briefly described. Corresponding formulae for single Gaussian or discrete output distributions can be derived straightforwardly.
In the Expectation step, the current model parameter set λ0 is used to compute the posterior probabilities of the HMM hidden variables as follows:
• thetransition posterior probability ξt(i, j)is the probability of being in stateiat time
t and in state j at time t+ 1 given the model λ0 and the observation sequence O, i.e., ξt(i, j) =P(qt=i, qt+1=j|O, λ0) (2.24) We have: ξt(i, j) = P(qt=i, qt+1 =j,O|λ0) P(O|λ0) = P(qt=i, qt+1 =j,O|λ0) N P i,j=1 P(qt=i, qt+1 =j,O|λ0) (2.25)
and from the definition of the forward and backward probabilities αt(i)and βt(i), it
follows that ξt(i, j) = αt(i)aijbj(ot+1)βt+1(j) P(O|λ0) = αt(i)aijbj(ot+1)βt+1(j) N P i,j=1 αt(i)aijbj(ot+1)βt+1(j) (2.26)
• the state posterior probability γt(i) is the probability of being in state i at time t
given the model λ0 and the observation sequenceO, i.e.,
γt(i) =P(qt=i|O, λ0) (2.27) We have: γt(i) = P(O, qt=i|λ0) P(O|λ0) = P(O, qt=i|λ0) N P j=1 (P(O, qt=j|λ0) (2.28)
and from the definition of the forward and backward probabilities αt(i)and βt(i), it
follows that γt(i) = αt(i)βt(i) P(O|λ0) = αt(i)βt(i) N P j=1 αt(j)βt(j) (2.29)
2.2. Markov Model Theory 17
The probability γt(i, m)of being in state iat timetgiven the model λ0 and the
ob-servation sequenceO, taking into account only themthcomponent of the considered state Gaussian mixture output distribution, is given by
γt(i, m) = αt(i)βt(i) N P j=1 αt(j)βt(j) · cimN(ot;µim,Σim) M P n=1 cinN(ot;µin,Σin) (2.30)
In the Maximization step, the current model parameter set λ0 is replaced by a new parameter setλwhich maximize the auxiliary function
Q(λ, λ0) =X q
P(q|O, λ0) lnP(O,q|λ) (2.31)
taking into account the posterior probabilities of the HMM hidden variables computed in the expectation step. Applied iteratively, this procedure can be proved to increase the model likelihoodP(O|λ) monotonically and converge to a critical point.
The maximization of the auxiliary function Q(λ, λ0) overλ, subject to the constraints PN
j=1πj = 1, PN
j=1aij = 1, PM
m=1cim= 1(1≤i≤N), leads to the following reestimation formulae: ¯ πi =γ1(i) (2.32) ¯ aij = T−1 P t=1 ξt(i, j) T−1 P t=1 γt(i) (2.33) ¯ cim = T P t=1 γt(i, m) T P t=1 M P n=1 γt(i, n) (2.34) ¯ µim = T P t=1 γt(i, m)·ot T P t=1 γt(i, m) (2.35) ¯ Σim= T P t=1 γt(i, m)·(ot−µim)·(ot−µim)> T P t=1 γt(i, m) (2.36)
2.3
Overview of HMM-based Speech Synthesis
In such a statistical parametric speech synthesis system, speech parameters are extracted from a database of natural speech signals. Combined with their associated labels, gener-ative models are trained to model those parameters. The synthesized speech waveform is eventually built from the learned parametric representations of speech by HMMs.
Statistical parametric speech synthesis is called HMM-based speech synthesis when HMMs are used as generative models, albeit any other generative models could be imple-mented. An overview of the HMM-based speech synthesis system is displayed in Figure
2.3. This system is made of two main parts: the training and the synthesis steps.
Figure 2.3: Overview of the HMM-based Speech Synthesis System (“H-Triple-S” - HTS), from [Zen et al. 2009].
2.4
Training Step in HMM-based Speech Synthesis
During the training step, spectral and excitation parameters are extracted from a database of natural speech signals. Spectral parameters typically consist of the mel-cepstral coef-ficients together with their first and second derivatives (respectively ∆ and ∆2, detailed in Section 2.5.2). Excitation parameters are generally the logarithm of the fundamental frequency log(F0), also with its ∆ and ∆2 coefficients. Associated with their respective labels, HMMs are trained in order to model these speech parameters. As a result, not only
2.4. Training Step in HMM-based Speech Synthesis 19
spectrum parameters but alsoF0 and duration are modeled in a unified framework. The model parameters set is commonly estimated based on the following Maximum Likelihood (ML) criterion, using the EM algorithm described in Section2.2.2.3:
b
λ=argmax
λ {P(O|W, λ)} (2.37)
whereλis the model parameter set,Ois the training data set andW is the label sequence set associated withO.
2.4.1 Spectral Parameters
Feature extraction basically relies on the source-filter model, in which speech is described as a source signal, representing the air flow at the vocal folds, passed through a time-varying filter, representing the effect of the vocal tract [Dutoit & Dupont 2010]. It is based on the hypothesis that the glottis and the vocal tract are fully discoupled, leading to the separation of the filter and source parts of the model.
The general approach consists in extracting some smooth representation of the signal power spectral density (characteristic of the filter frequency response), usually estimated over analysis frames of typically 25 ms with 5 ms shifts. This takes into account the time-varying nature of both the source and the filter. The main tools used in spectral parameters extraction include:
• short-time Fourier transform, providing the power and phase spectra of short analysis frames;
• Linear Predictive Coding (LPC) in which the vocal tract is modeled by an all-pole filter, whose transfer function is described as:
H(z) = K 1− M P m=0 c(m)z−m (2.38)
where K and c(m) are respectively the gain of the filter and the Mth order LP coefficients;
• cepstrum, computed as the inverse short-time Fourier transform of the logarithm of the power spectrum. In this case:
H(z) =exp
M
X
m=0
c(m)z−m (2.39)
wherec(m) are theMth order cepstral coefficients.
It can be shown that low order elements of the cepstrum vectors provide a good approximation of the filter part of the model;
• Mel-Frequency Cepstrum Coefficients (MFCCs), which take into account the human auditory system, where the cepstral coefficients are computed for a spectrum that has been warped along a nonlinear spectral scale.
In direct continuity, the Mel-Generalized Cepstral (MGC) analysis was proposed in
[Kobayashi & Imai 1984]:
H(z) = 1 +γ M X m=0 cα,γ(m)zα−m !1/γ if 0<|γ| ≤1 (2.40a) =exp M X m=0 cα,γ(m)zα−m if γ = 0 (2.40b)
where cα,γ(m) are the Mth order MGC coefficients. The variable zα−1 is expressed as
the following first order all-pass function:
zα−1= Ψ(z) = z
−1−α
1−αz−1 (2.41)
modeling the nonlinear frequency transformation performed by the human auditory system. Combined with the frequency warping factorα,γ is a parameter that allows to obtain various standard types of coefficients, as illustrated in Table2.1. By choosing judiciousα values, the mel-scale becomes a good approximation of the human perceptual scale: e.g. α= 0.31 for a sampling rate of 8 kHz, and is made equal to 0.42 for a sampling frequency of 16 kHz.
Table 2.1: Mel-Generalized Cepstral (MGC) analysis.
γ = 0 0<|γ|<1 |γ|= 1
α= 0 Cepstral analysis Generalized cepstral analysis LPC analysis |α|<1 Mel-cepstral analysis Mel-generalized cepstral analysis Mel-LPC analysis
2.4.2 F0 Modeling
To model fixed-dimensional parameter sequences, such as spectral parameters, single multi-variate Gaussian distributions are typically used as their stream-output distributions. However, it is difficult to apply a discrete or continuous distribution to model variable-dimensional parameter sequences, such as log(F0). Indeed, the values of F0 are not defined in unvoiced regions, i.e. the observation sequence of anF0pattern is composed of one-dimensional continuous values in voiced regions and a discrete symbol which represents “unvoiced” in unvoiced regions.
Considering that the observedF0value occurs from one-dimensional spaceΩ1 and the unvoiced symbol occurs from a zero-dimensional spaceΩ2, this kind of observation sequence can be modeled by Multi-Space probability Distribution (MSD), as displayed in Figure2.4. The integration of MSD in the HMM framework is called MSD-HMM [Tokuda et al.1999]
2.4. Training Step in HMM-based Speech Synthesis 21
Figure 2.4: F0 pattern modeling, from [Masuko 2002].
2.4.2.1 Multi-Space probability Distribution (MSD)
In general, a MSD is described considering a sample spaceΩ which consists of Gspaces, as illustrated in Figure2.5: Ω = G [ g=1 Ωg (2.42)
whereΩg is anng-dimensional real space Rng. Each space has its own dimensionality.
Figure 2.5: Multi-Space probability Distribution (MSD) and observations, from [Masuko 2002].
Each spaceΩg has its probabilitywg, i.e. P(Ωg) =wg, where PGg=1wg = 1. Ifng>0,
each space has a probability distribution function Ng(x) withx ∈ Rng. If n
g = 0, Ωg is
assumed to contain only one sample point.
In a MSD, each observation vectoroconsists of a set of space indicesXand a continuous random variablex∈Rn, that is:
o= (X,x) (2.43)
where all spaces specified by X are n-dimensional. Note that X does not necessarily include all indices which specifyn-dimensional spaces. Not only the observation vectorx
but also the space index setX is a random variable, determined at each observation. The observation probability ofois defined by
b(o) = X g∈S(o) wgNg(V(o)) (2.44) where S(o) =X (2.45a) V(o) =x (2.45b)
Although Ng(x) does not exist forng= 0 sinceΩg contains only one sample point, for
simplicity of notation,Ng(x)≡1 is defined forng= 0.
As an example, the observation o1 shown in Figure 2.5 consists of a set of space indicesX1 ={1,2, G} and a three-dimensional vector x1 ∈R3. Thus the random variable
x is drawn from one of the three spaces Ω1, Ω2, ΩG ∈ R3, and its PDF is given by w1N1(x) +w2N2(x) +wGNG(x).
2.4.2.2 HMMs-based on Multi-Space probability Distribution (MSD-HMM)
An N-state MSD-HMM λ is specified by the initial state probability distribution π =
{πi}Ni=1, the state transition probability distribution A={aij}Ni,j=1 and the state output probability distributionB={bj(o)}Nj=1 given in Equation 2.44.
As shown in Figure 2.6, each state i has G PDFs Ni1(·),Ni2(·), ...,NiG(·) associated with their weightswi1, wi2, ..., wiG, respectingPGg=1wig = 1.
The MSD-HMM parameters estimation procedure is derived from the Baum-Welch algorithm used for conventional HMM.
In theExpectation step, the posterior probabilities of the MSD-HMM hidden variables are computed from the current model parameters as follows:
• the posterior probability of being in stateiand spaceh at timet, given the observa-tion sequence and the model,
γt(i, h) = αt(i)βt(i) N P j=1 αt(j)βt(j) · PwihNih(V(ot)) g∈S(ot) wigNig(V(ot)) (2.46)
2.4. Training Step in HMM-based Speech Synthesis 23
• the posterior probability of transitions from state ito statejat timet+ 1, given the observation sequence and the model,
ξt(i, j) = αt(i)aijbj(ot+1)βt+1(j) N