Festival TTS Training Material
TTS Group
Indian Institute of Technology Madras Chennai - 600036
India
Contents
1 Introduction 4
1.1 Nature of scripts of Indian languages . . . 4
1.2 Convergence and divergence . . . 4
2 What is Text to Speech Synthesis? 5 2.1 Components of a text-to-speech system . . . 5
2.2 Normalization of non-standard words . . . 5
2.3 Grapheme-to-phoneme conversion . . . 5
2.4 Prosodic analysis . . . 6
2.5 Methods of speech generation . . . 6
2.5.1 Parametric synthesis . . . 7
2.5.2 Concatenative synthesis . . . 7
2.6 Primary components of the TTS framework . . . 8
2.7 Screen readers for the visually challenged . . . 9
3 Overall Picture 10 4 Labeling Tool 12 4.1 How to Install LabelingTool . . . 12
4.2 Troubleshooting of LabelingTool . . . 15
5 Labeling Tool User Manual 18 5.1 How To Use Labeling Tool . . . 18
5.2 How to do label correction using Labeling tool . . . 25
5.3 Viewing the labelled file . . . 29
5.4 Control file . . . 29
5.5 Performance results for 6 Indian Languages . . . 30
5.6 Limitations of the tool . . . 30
6 Unit Selection Synthesis Using Festival 31 6.1 Cluster unit selection . . . 31
6.2 Choosing the right unit type . . . 31
6.3 Collecting databases for unit selection . . . 32
6.4 Preliminaries . . . 33
6.5 Building utterance structures for unit selection . . . 33
6.6 Making cepstrum parameter files . . . 34
6.7 Building the clusters . . . 35
7 Building Festival Voice 42 8 Customizing festival for Indian Languages 44 8.1 Some of the parameters that were customized to deal with Indian languages in festival framework are : . . . 44
8.2 Modifications in source code . . . 45
9 Trouble Shooting in festival 50 9.1 Troubleshooting (Issues related with festival) . . . 50
9.2 Troubleshooting(Issues might occur while synthesizing) . . . 50
11 NVDA Windows Screen Reader 53
11.1 Compiling Festival in Windows : . . . 53
12 SAPI compatibility for festival voice 60 13 Sphere Converter Tool 62 13.1 Extraction of details from header of the input file . . . 62
13.1.1 Calculate sample minimum and maximum values . . . 63
13.1.2 RAW Files . . . 63
13.1.3 MULAW Files . . . 63
13.1.4 Output in encoded format . . . 63
13.2 Configfile . . . 63
14 Sphere Converter User Manual 64 14.1 How to Install the Sphere converter tool . . . 64
14.2 How to use the tool . . . 65
14.3 Fields in Properties . . . 66
14.4 Screenshot . . . 67
14.5 Example of data in the Config file (default properties) . . . 68
1
Introduction
This training is conducted for new members who joined the TTS consortium. The main aim of the TTS consortium is to develop text-to-speech (TTS) systems in all of these 22 official languages in order to build screen readers which are spoken interfaces for information access which will aid vi-sually challenged people use a computer with ease and to make computing ubiquitous and inclusive.
1.1 Nature of scripts of Indian languages
The scripts in Indian languages have originated from the ancient Brahmi script. The basic units of the writing system are referred to as Aksharas. The properties of Aksharas are as follows:
1. An Akshara is an orthographic representation of a speech sound in an Indian language 2. Aksharas are syllabic in nature
3. The typical forms of Akshara are V, CV, CCV and CCCV, thus having a generalized form of C*V where C denotes consonant and V denotes vowel
As Indian languages are Akshara based, akshara being a subset of a syllable, a syllable based unit selection synthesis system has been built for Indian languages. Further, a syllable corresponds to a basic unit of production as opposed to that of the diphone or the phone. Earlier efforts were made by the consortium members, in particular, IIIT Hyderabad and IIT Madras do indicate that natural sounding synthesisers for Indian languages can be built using the syllable as a basic unit.
1.2 Convergence and divergence
The official languages of India, except (English and Urdu) share a common phonetic base, i.e., they share a common set of speech sounds. This common phonetic base consists of around 50 phones, including 15 vowels and 35 consonants. While all of these languages share a common phonetic base, some of the languages such as Hindi, Marathi and Nepali also share a common script known as Devanagari. But languages such as Telugu, Kannada and Tamil have their own scripts.
The property that makes these languages unique can be attributed to the phonotactics in each of these languages rather than the scripts and speech sounds. Phonotactics is the permissible combinations of phones that can co-occur in a language. This implies that the distribution of syllables encountered in each language is different. Another dimension in which the Indian languages significantly differ is prosody which includes duration, intonation and prominence associated with each syllable in a word or a sentence.
2
What is Text to Speech Synthesis?
Text to Speech Synthesis System converts text input to speech output. The conversion of text into spoken form is deceptively nontrivial. A nave approach is to consider storing and concatenation of basic sounds (also referred to as phones) of a language to produce a speech waveform. But, natural speech consists of co-articulation i.e., effect of coupling two sound together, and prosody at syllable, word, sentence and discourse level, which cannot be synthesised by simple concatenation of phones. Another method often employed is to store a huge dictionary of the most common words. However, such a method may not synthesise millions of names and acronyms which are not in the dictionary. It also cannot deal with generating appropriate intonation and duration for words in different context. Thus a text-to-speech approach using phones provides flexibility but cannot produce intelligible and natural speech, while a word level concatenation produces intelligible and natural speech but is not flexible. In order to balance between flexibility and intelligibility/naturalness, sub-word units such as diphones which capture essential coarticulation between adjacent phones are used as suitable units in a text-to-speech system.
2.1 Components of a text-to-speech system
A typical architecture of a Text-to-Speech (TTS) system is as shown in Figure below The compo-nents of a text-to-speech system could be broadly categorized as text processing and methods of speech generation.
Text processing in the real world, the typical input to a text-to-speech system is text as available in electronic documents, news papers, blogs, emails etc. The text available in real world is anything but a sequence of words available in standard dictionary. The text contains several non-standard words such as numbers, abbreviations, homographs and symbols built using punctuation characters such as exclamation !, smileys :-) etc. The goal of text processing module is to process the input text, normalize the non-standard words, predict the prosodic pauses and generate the appropriate phone sequences for each of the words.
2.2 Normalization of non-standard words
The text in real world consists of words whose pronunciation is typically not found in dictionaries or lexicons such as IBM”, CMU”, and MSN” etc. Such words are referred to as non-standard words (NSW). The various categories of NSW are:
1. Numbers whose pronunciation changes depending on whether they refer to currency, time, telephone numbers, zip code etc.
2. Abbreviations, contractions, acronyms such as ABC, US, approx., Ctrl-C, lb., 3. Punctuations 3-4, +/-, and/or,
4. Dates, time, units and URLs.
2.3 Grapheme-to-phoneme conversion
Given the sequence of words, the next step is to generate a sequence of phones. For languages such as Spanish, Telugu, Kannada, where there is a good correspondence between what is written and what is spoken, a set of simple rules may often suffice. For languages such as English where the relationship between the orthography and pronunciation is complex, a standard pronunciation dictionary such as CMU-DICT is used. To handle unseen words, a grapheme-to-phoneme generator is built using machine learning techniques.
2.4 Prosodic analysis
Prosodic analysis deals with modeling and generation of appropriate duration and intonation con-tours for the given text. This is inherently difficult since prosody is absent in text. For example, the sentences where are you going?; where are you GOING? and where are YOU going?, have same text-content but can be uttered with different intonation and duration to convey different meanings. To predict appropriate duration and intonation, the input text needs to be analyzed. This can be performed by a variety of algorithms including simple rules, example-based techniques and machine learning algorithms. The generated duration and intonation contour can be used to manipulate the context-insensitive diphones in diphone based synthesis or to select an appropriate unit in unit selection voices.
2.5 Methods of speech generation
The methods of conversion of phone sequence to speech waveform could be categorized into para-metric, concatenative and statistical parametric synthesis.
2.5.1 Parametric synthesis
Parameters such as formants, linear prediction coefficients are extracted from the speech signal of each phone unit. These parameters are modified during synthesis time to incorporate co-articulation and prosody of a natural speech signal. The required modifications are specified in terms of rules which are derived manually from the observations of speech data. These rules include duration, intonation, co-articulation and excitation function. Examples of the early parametric synthesis systems are Klatts formant synthesis and MITTALK.
2.5.2 Concatenative synthesis
Derivation of rules in parametric synthesis is a laborious task. Also, the quality of synthesized speech using traditional parametric synthesis is found to be robotic. This has led to development of concatenative synthesis where the examples of speech units are stored and used during synthesis. Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthe-sized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis.
1. Unit selection synthesis - Unit selection synthesis uses large databases of recorded speech. During database creation, each recorded utterance is segmented into some or all of the fol-lowing: individual phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences. Typically, the division into segments is done using a specially modified speech recognizer set to a ”forced alignment” mode with some manual correction afterward, using visual representations such as the waveform and spectrogram. An index of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency (pitch), duration, position in the syllable, and neighboring phones. At run time, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted decision tree.
Unit selection provides the greatest naturalness, because it applies only a small amount of digital signal processing (DSP) to the recorded speech. DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform. The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the gigabytes of recorded data, representing dozens of hours of speech. Also, unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e.g. minor words become unclear) even when a better choice exists in the database. Recently, researchers have proposed various automated methods to detect unnatural segments in unit-selection speech synthesis systems.
2. Diphone synthesis - Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-sound transitions) occurring in a language. The number of diphones depends on the phonotactics of the language: for example, Spanish has about 800 diphones, and German about 2500. In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, the target prosody of a sentence is superimposed on these minimal units by means of digital signal processing techniques such as linear predictive coding, PSOLA or MBROLA. Diphone synthesis suffers from the sonic glitches of concatenative
synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size. As such, its use in commercial applications is declining,[citation needed] although it continues to be used in research because there are a number of freely available software implementations.
3. Domain-specific synthesis - Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports.The technology is very simple to implement, and has been in commercial use for a long time, in devices like talking clocks and calculators. The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings. Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed. The blending of words within naturally spoken language however can still cause problems unless the many variations are taken into account.
For example, in non-rhotic dialects of English the ”r” in words like ”clear” /kl/ is usually only pronounced when the following word has a vowel as its first letter (e.g. ”clear out” is realized as /klt/). Likewise in French, many final consonants become no longer silent if followed by a word that begins with a vowel, an effect called liaison. This alternation cannot be reproduced by a simple word-concatenation system, which would require additional complexity to be context-sensitive.
The speech units used in concatenative synthesis are typically at diphone level so that the natural co-articulation is retained. Duration and intonation are derived either manually or automatically from the data and are incorporated during synthesis time. Examples of diphone synthesizers are Festival diphone synthesis and MBROLA. The possibility of storing more than one example of a diphone unit, due to increase in storage and computation capabilities, has led to development of unit selection synthesis. Multiple examples of a unit along with the relevant linguistic and phonetic context are stored and used in the unit selection synthesis. The quality of unit selection synthesis is found to be more natural than diphone and parametric synthesis. However, unit selection synthesis lacks the consistency i.e., in terms of variations of the quality.
2.6 Primary components of the TTS framework
1. Speech Engine - One of the most widely used speech engine is eSpeak. eSpeak uses ”formant synthesis” method, which allows many languages to be provided with a small footprint. The speech synthesized is intelligible, and provides quick responses, but lacks naturalness. The demand is for a high quality natural sounding TTS system. We have used festival speech synthesis system developed at The Centre for Speech Technology Research, University of Edinburgh, which provides a framework for building speech synthesis systems and offers full text to speech support through a number of APIs . A large corpus based unit selection paradigm has been employed. This paradigm is known to produce intelligible natural sounding speech output, but has a larger foot print.
2. Screen Readers - The role of a screen reader is to identify and interpret what is being dis-played on the screen and transfer it to the speech engine for synthesis. JAWS is the most popular screen reader used worldwide for Microsoft Windows based systems. But the main 30 drawback of this software is its high cost, approximately 1300 USD, whereas the average per capita income in India is 1045 USD. Different open source screen readers are freely available. We chose ORCA for Linux based systems and NVDA for Windows based systems. ORCA is a flexible screen reader that provides access to the graphical desktop via user-customizable
combinations of speech, braille and magnification. ORCA supports the Festival GNOME speech synthesizer and comes bundled with popular Linux distributions like Ubuntu and Fe-dora. NVDA is a free screen reader which enables vision impaired people to access computers running Windows. NVDA is popular among the members of the AccessIndia community. AccessIndia is a mailing list which provides an opportunity for visually impaired computer users in India to exchange information as well as conduct discussions related to assistive technology and other accessibility issues . NVDA has already been integrated with Festival speech Engine by Olga Yakovleva.
3. Typing tool for Indian Languages - The typing tools map the qwerty keyboard to Indian language characters. Widely used tools to input data in Indian languages are Smart Com-mon Input Method (SCIM)and inbuilt InScript keyboard, for Linux and Windows systems respectively. Same has been used for our TTS systems, as well.
2.7 Screen readers for the visually challenged
India is home to the worlds largest visually challenged (VC) population. In todays digital world, disability is equated to inability. Low attention is paid to people with disabilities and social inclusion and acceptance is always a threat/challenge. The perceived inability of people with disability, the perceived cost of special education and attitudes towards inclusive education are major constraints for effective delivery of education. Education is THE means of developing the capabilities of people with disability, to enable them to develop their potential, become self sufficient, escape poverty and provide a means of entry to fields previously denied to them. The aim of this project is to make a difference in the lives of VC persons. VC persons need to depend on others to access common information that others take for granted, such as newspapers, bank statements, and scholastic transcripts. Assistive technologies (AT) are necessary to enable physically challenged persons to become part of the mainstream of society. A screen reader is an assistive technology potentially useful to people who are visually challenged, visually impaired, illiterate or learning disabled, to use/access standard computer software, such as Word Processors, Spreadsheets, Email and the Internet.
Before the start of this project, Indian Institute of Technology, Madras (IIT Madras) had been conducting a training programme for visually challenged people, to enable them to use the computer using the screen reader JAWS with English as the language. Although, the VC persons have benefited from this programme, most of them felt that:
• The English accent was difficult to understand.
• Most students would have preferred a reader in their native language. • They would prefer English spoken in Indian accent.
• The price for the individual purchase of JAWS was very high.
Against this backdrop, it was felt imperative to build assistive technologies in the vernacular. An initiative was taken by DIT, Ministry of Information Technology to sponsor the development of
1. Natural sounding Text-to-speech synthesis systems in different Indian languages 2. To ensure that the TTSes are also integrated with open source screen readers.
3
Overall Picture
1. Data Collection - Text crawled from a news site and a site for stories for children.
2. Cleaning up of Data - From the crawled data sentences were picked to maximize syllable coverage.
3. Recording - The sentences that were picked were then recorded in a studio which was a completely noise-free environment.
4. Labeling - The wavefiles were then manually labeled using the semi-automatic labeling tool to get accurate syllable boundaries.
5. Training - Using the wavefiles and their transcriptions the indian language unit selection voice was built
6. Testing - Using the voice built, a MOS test was conducted with visually challenged end users as the evaluators.
4
Labeling Tool
It is a widely accepted fact that the accuracy of labeling of speech files would have a great bearing on the quality of unit selection synthesis. Process of manual labeling is time consuming and daunting task. It is also not trivial to label waveforms manually, at the syllable level. DONLabel Labeling tool provides an automatic way of performing labeling given an input waveform and the corresponding text in utf8 format. The tool makes use of group delay based segmentation to provide the segment boundaries. The size of the segment labels generated can vary from monosyllables to polysyllables, as the Window Scale Factor (WSF) parameter is varied from small to large values. Our labeling process make use of:
• Ergodic HMM (EHMM) labeling procedure provided by Festival, • The group delay based algorithm (GD)
• The Vowel Onset Point (VOP) detection algorithm.
The Labeling tool displays a panel, which shows the segment boundaries estimated by Group Delay algorithm, another panel which would show the segment boundaries as estimated by the EHMM process and a panel for VOP, which shows how many vowel onset points are present between each segments provided by group delay algorithm. This would help greatly in adjusting the labels provided by the group delay algorithm, if necessary, by comparing the labeling outputs of both EHMM process and VOP algorithm. By using VOP as an additional cue, manual intervention during the labeling process can be eliminated. It would also improve the accuracy of the labels generated by the labeling tool.
The tool works for 6 different Indian languages namely • Hindi • Tamil • Malayalam • Marathi • Telugu • Bengali
The tool also displays the text (utf8) in segmented format along with the speech file.
4.1 How to Install LabelingTool
1. Copy the html folder to /var/www folder. If www folder is not there in /var, create a folder named www and extract the html folder into it. So we have the labelingTool code in /var/www/html/labelingTool/
2. Install java compiler using the following command sudo apt−get install sun−java6−jdk
The following error may come ==> Reading package lists... Done Building dependency tree
Reading state information... Done
Package sun−java6−jdk is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or is only available from another source
E: Package ’sun-java6-jdk’ has no installation candidate sudo apt−get install sun−java6−jre
The following error may come ==> Reading package lists... Done Building dependency tree
Reading state information... Done
Package sun−java6−jre is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or is only available from another source
E: Package ’sun-java6-jre’ has no installation candidate One solution is :
sudo add−apt−repository ”deb http://archive.canonical.com/ lucid partner”
sudo add−apt−repository ”deb http://ftp.debian.org/debian squeeze main contrib non−free” sudo add−apt−repository ”deb http://ppa.launchpad.net/chromium−daily/ppa/ubuntu/ lu-cid main”
sudo add−apt−repository ”deb http://ppa.launchpad.net/flexiondotorg/java/ubuntu/ lucid main”
sudo apt−get update The other solution is :
For Ubuntu 10.04 LTS, the sun-java6 packages have been dropped from the Multiverse section of the Ubuntu archive. It is recommended that you use openjdk-6 instead.
If you can not switch from the proprietary Sun JDK/JRE to OpenJDK, you can install sun-java6 packages from the Canonical Partner Repository. You can configure your system to use this repository via command-line:
sudo add-apt-repository ”deb http://archive.canonical.com/ lucid partner” sudo apt-get update
sudo apt-get install sun-java6-jre sun-java6-plugin sudo apt-get install sun-java6-jdk
sudo update-alternatives –config java
For Ubuntu 10.10, the sun-java6 packages have been dropped from the Multiverse section of the Ubuntu archive. It is recommended that you use openjdk-6 instead.
If you can not switch from the proprietary Sun JDK/JRE to OpenJDK, you can install sun-java6 packages from the Canonical Partner Repository. You can configure your system to use this repository via command-line:
sudo add-apt-repository ”deb http://archive.canonical.com/ maverick partner” sudo apt-get update
sudo apt-get install sun-java6-jre sun-java6-plugin sudo apt-get install sun-java6-jdk
If above does not work (for other version of ubuntu) then you can create local repository as follows:
cd ∼/
wget https://github.com/flexiondotorg/oab-java6/raw/0.2.1/oab-java6.sh -O oab-java6.sh chmod +x oab-java6.sh
sudo ./oab-java6.sh and then run:
sudo apt-get install sun-java6-jdk sudo apt-get install sun-java6-jre Source :
https://github.com/flexiondotorg/oab-java6/blob/a04949f242777eb040150e53f4dbcd4a3ccb7568/ README.rst
3. Install php using the following command sudo apt−get install php5
4. Install apache2 using the following command sudo apt−get install apache2
update the paths in the following file /etc/apache2/sites−available/default
Set all path of cgi−bin to ”/var/www/html/cgi-bin”. Sample default file is attached
5. Install apache2 using the following command sudo apt−get install speech-tools
6. Install tcsh using the following command sudo apt−get install tcsh
7. Enable java script in the properties of the browser used Use Google chrome or Mozilla firefox
8. Install java plugin for browser
sudo apt−get install sun−java6−plugin
Create a symbolic link to the Java Plugin libnpjp2.so using the following commands
sudo ln −s /usr/lib/jvm/java-6-sun/jre/plugin/i386/libnpjp2.so /etc/alternatives/mozilla−javaplugin.so sudo ln −s /etc/alternatives/mozilla−javaplugin.so /usr/lib/mozilla/plugins/libnpjp2.so
9. give full permissions to html folder sudo chmod −R 777 html/
grant {
permission java.security.AllPermission; };
11. In /var/www/html/labelingTool/jsrc/install file, make sure that correct path of javac is pro-vided
as per your installation. For example : /usr/lib/jvm/java−6−sun-1.6.0.26/bin/javac
Version of java is 1.6.0.26 here, it might be different in your installation. Check the path and give correct values.
12. Install the tool using the following command
Go to /var/www/html/labelingTool/jsrc and run the below command sudo ./install
It might give the following output which is not an error. Note: LabelingTool.java uses or overrides a deprecated API. Note: Recompile with −Xlint:deprecation for details.
13. Restart apache using the following command sudo /etc/init.d/apache2 restart
14. check if java applet is enabled in the browser by using the following link http://javatester.org/enabled.html
In that webpage, in the LIVE box, it should display This web browser can indeed run Java applets wait for some time for the display to come.
In case it had displayed This web browser can NOT run Java applets, there is some issue with the java applets. Please browse for how to enable java in your version of browser and fix the issue.
15. Replace the Pronunciation Rules.pl in the /var/www/html/labelingTool folder with your lan-guage specific code (The name should be same − Pronunciation Rules.pl )
16. Open the browser and go to the following link http://localhost/main.php
NOTE : VOP algoirthm is not used in the current version of the labelingTool. So anything related to vop, please ignore on the below sections
4.2 Troubleshooting of LabelingTool
1. When Labelingtool is working fine the following files will be generated in labelingTool/results folder boundary segments spec low vop wav sig gd spectrum low segments indicator tmp.seg
2. when the boundaries are manually updated, (deleted, added or moved) and saved 2 more files gets created in the results folder.
ind upd
segments updated
3. When after manually updating and saving, if the vopUpdate button is clicked, another new file gets created in the results folder
vop updated
4. If a file named ’vop’ is not getting generated in labelingTool/results folder and the labelit.php page is getting stuck, you need to compile the vop module.
Follow the below steps.
(a) cd /var/www/html/labelingTool/Vop−Lab (b) make −f MakeEse clean
(c) make −f MakeEse (d) cd bin
(e) cp Multiple Vopd ../../bin/
5. If the above files are not getting created, we can try running through command line as follows Execute them from/var/www/html/labelingTool/bin folder.
• The command line usage of the WordsWithSilenceRemoval program is as follows WordsWithSilenceRemoval ctrlFile waveFile sigFile spectraFile boundaryFile thresSi-lence(ms) thresVoiced(ms)
example :
./WordsWithSilenceRemoval fe−words2.base /home/text 0001.wav ..results/spec ..re-sults/boun 100 100
Two files named spec and boun has to be generated in the results folder. if not created. try recompiling.
cd /var/www/html/labelingTool/Segmentation make −f MakeWordsWithSilenceRemoval clean make −f MakeWordsWithSilenceRemoval
cp bin/WordsWithSilenceRemoval /var/www/html/labelingTool/bin/ • The command line usage of the Multiple Vopd program is as follows
Multiple Vopd ctrlFile waveFile(in sphere format) segments file vopfile example :
./Multiple Vopd fe-ctrl.ese ../results/wav ..results/segments ../results/vop
On running Multiple Vopd binary, a file ’vop’ has to be generated in the results folder. 6. If the file ’wav’ is not produced in results folder, ’speech tools’ are not installed
How to check if speech tools are installed : Once installing speech tools check if the following ch wave −info <wave file name>
This command should give the information about that wave file.
If speech tools was installed along with festival and there is no link to it in /usr/bin, please make a link to point to ch wave binary file in /usr/bin folder. 7. How to check if tcsh is installed..
type command ’tcsh’ and a new prompt will come.
8. Provide full permissions to the labelingTool folder and its sub folder so that the new files can be created and updated without any permission issues.
(if required, following command can be used in the labelingTool folder chmod −R 777 *
chown −R root:root * )
9. The java.policy file should be updated as specified in the installation steps, otherwise it may result in error ”Error writing Lab File”
10. When the lab file is viewed in the browser, if utf8 is not displaying, enable character−encoding to utf8 for the browser
Tools->options->content->fonts and colors(Advanced menu)->default character encoding(utf8) Restart browser.
5
Labeling Tool User Manual
5.1 How To Use Labeling Tool
The front page of the tool can be taken using the URL http://localhost/main.php A screen shot of the front page is as shown below.
The front page has the following fields
• The speech file in wav format should be provided. It can be browsed using the browse button • The corresponding utf8 text has to be provided in the text file. It can be browsed using the
browse button the text file that is uploaded should not have any special characters.
• The ehmm lab file generated by festival while building voice can be provided as input. This is an optional field.
• The gd lab file generated by labelingtool in a previous attempt to label the same file. This is an optional field. If the user had once labelled a file half way and saved the lab file, it can be provided as input here so as to label the rest of it or to correct the labels.
• The threshold for voiced segment has to be provided in the text box. It varies for each wav file. The value is in milli seconds. (e.g. 100, 200, 50..)
• The threshold for unvoiced segment has to be provided in the text box. It varies for each wav file. The value is in milli seconds. (e.g. 100, 200, 50..) If the speech file has very long silences a high value can be provided as threshold value.
• WSF (window scale factor) can be selected from the drop down list. The default value given is 5. Depending on the output user will be required to change WSF values and find the most appropriate value that provides the best segmentation possible for the speech file.
• Submit the details to the tool using submit button. A screen shot of the filled up front page is given below.
Loading Page
On clicking ’submit’ button on the front page the following page will be displayed.
Validation for data entered
• If the loading of all files were successful and proper values were given for the thresholds in the front page the message Click View to see the results... will be displayed as shown above. • If the wave file was not provided in the front page the following error will come in the loading
page Error uploading wav file. Wav file must be entered
• If the text file was not provided in the front page the following error will come in the loading page Error uploading text file. Text file must be entered
• If the threshold for voiced segments was not provided in the front page the following error will come in the loading page Threshold for voiced segments must be entered
• If the threshold for unvoiced segments was not provided in the front page the following error will come in the loading page Threshold for unvoiced segments must be entered
• If numeric value is not entered for thresholds of unvoiced or voiced segments, in the front page the following error will come in the loading page Numeric value must be entered for thresholds
• The wav file loaded will be copied to the /var/www/html/UploadDir folder as text.wav • The lab file (ehmm label file) will be copied to /var/www/html/labelingTool/lab folder as
temp.lab If error occurred while moving to the lab folder the following error will be displayed Error moving lab file.
• The gd lab file (group delay label file) will be copied to /var/www/html/labelingTool/results folder with the name gd lab. If error occurred while moving to the lab folder the following error will be displayed Error moving gdlab file.
• The Labelit Page
On clicking view button on the loading page the labelit page will be loaded. A screenshot of this page along with the markings for each panel is given below.
Note: If error message Error reading file ’http://localhost/labelingTool/tmp/temp.wav ap-pears, it means in place of wav file some other file(eg text file) was uploaded.
• Panels on the Labelit Page It has 6 main Panels
– EHMM Panel displays the lab files generated by festival using EHMM algorithm while building voices
– Wave Panel displays the speech waveform in segmented format (Note: The speech wave file is not appearing, as seen in wavesurfur. This is because of the limitations in java) – Text Panel displays the segmented text (in utf8 format) with syllable as the basic units. – GD Panel draws the group delay curve. This is the result of group delay algorithm.
Wherever the peaks appear, is considered to be a segment boundary.
– VOP Panel shows the number of vowel onset points found between each segments pro-vided by Group delay. Here green colour corresponds to one vowel onset point. That means the segment boundary found by group delay algorithm is correct. Red colour cor-responds to zero vowel onset point. That means the segment boundary found by group delay algorithm is wrong and that boundary needs to be deleted. Yellow colour corre-sponds to more than one vowel onset points. This means that, between 2 boundaries found by group delay algorithm there will be one or more boundaries.
• Resegment
The WSF selected for this example is ’5’. A different wsf will provide a different set of boundaries. Lesser the wsf, greater the number of boundaries and vice versa. To experiment with different wsf values, select the WSF from the drop down list and click ’RESEGMENT’. A screen shot for the same text (as in the above figure) with a greater wsf selected is shown below
The above figure shows the segmentation using wsf = 12. It gives less number of boundaries. Below figure shows the same waveform with a lesser wsf (wsf =3). It gives more number of boundaries.
So the ideal wsf for the waveform has to be found out. Easier way is to check the text segments are reaching approximately near the end of the waveform. (Not missing any text segments nor having many segments without texts).
• Menu Bar
The menu Bar is just above the EHMM Panel, with a heading ’Waveform’ The Menu Bar contains following buttons in that order from left to right
– Save button The lab file can be saved using the save button. After making any changes to the segments (deletion, addition or dragging), if required save button has to be clicked. – Play the waveform The entire wave file will be played on pressing this button
– Play the selection Select some portion of the waveform (say a segment) and play just that part using this button. This button can be used to verify each segment.
– Play from selection Play the waveform starting from the current selection to the end. Click the mouse on the waveform and a yellow line will appear to show the selection. On clicking this button, from that selected point to end of the file will be played – Play to selection Plays the waveform from the beginning to the end of the current
selection
– Stop the playbackStops the playing of wave file
– Zoom to fit Display the selected portion of the wave zoomed in – Zoom 1 Display the entire wave
– Zoom in Zoom in on the wave – Zoom out Zoom out on the wave
– Update VOP Panel After changing the segments (dragging, adding or deleting) , the VOP algorithm is recalculated on the new set of segments on clicking this button. After making the changes, the save button must be pressed before updating the VOP panel.
Some screen shots are given below to demonstrate the use of menu bar.
Below figure shows how to select a portion (drag using mouse on wavepanel) of the waveform and play that part. The selected portion appears shaded in yellow as shown
The below figure shows how to select a point (click using mouse on wavepanel) and play from selection to end of file. The selected point appears as a yellow line
Next figure shows how to select a portion of the wave and zoom to fit.
The Next figure shows how the portion of the wave file selected in the above figure is zoomed.
5.2 How to do label correction using Labeling tool
Each segment given by the group delay can be listened to and decided whether the segment is cor-rect or not, whether it is matching the text segmentation, with the help of VOP and EHMM Panels.
• Deletion of a Segment
All the segments appear as red lines in the labeling tool output. A segment can be deleted by right clicking on that particular segment on the slider panel. The figure below shows the original output of labelling tool for the Hindi wave file
The third and fourth segments are very close to each other and one has to be deleted. Ideally we delete the fourth one. The VOP has given a red colour (indication to delete one) for that segment. User can decide whether to delete right or left of red segment after listening. On deletion (right click on slider panel on that segment head) of the fourth segment, the text segments get shifted and fits after silence segment as shown in the below figure.
On listening each segments it is seen that the segment between and is wrong. It has to be deleted. The VOP gives red colour for the segment and the corresponding peak in the group delay panel is below the threshold. Peaks below the threshold in group delay curve usually wont be a segment boundary. But sometimes the algorithm computes it as a bound-ary. Threshold value in GD panel is the middle line in magenta colour.
There are 2 more red columns in VOP. The last one is correct and we have to delete a segment. The second last red column in VOP is incorrect and GD gives the correct segment. Hence it need not be deleted. is always used as a reference for GD algorithm. It can be wrong in some cases. The yellow colour on VOP usually says to add a new segment, but here the yellow colour is appearing in the Silence region and we ignore it.
On completion of correcting the labels, the save button have to be pressed. On clicking Save button a dialog box appears with the message Lab File Saved Click Next to Continue A silence segment gets deleted on clicking the right boundary of the silence segment. • Update VOP Panel
After saving the changes made to the labels the VOP update button has to be clicked to recalculate the VOP algorithm on the new segments. The updated output is shown in below figure.
• Adding A Segment
A segment can be added by right clicking with mouse on the slider panel at the point where a segment needs to be added. The below figure shows a case in which a segment needs to be added.
The VOP shows three yellow columns here of which the second yellow column is true. The GD plot shows a small peak in that segment and we can be sure that the segment has to be added at the peak only. In the above figure it can be seen that the mouse is placed on the slider panel at the location to add the new segment. The figure below shows the corresponding corrected wave file and after VOP updation done.
• Sliding a Segment
A segment can be moved to left or right by clicking on the head of the segment boundary on the slider panel and dragging left or right. Sliding can be used if required while correcting the labels.
• Modification of labfile If a half corrected lab file is already present (gd lab file present), upload it from ./labelingTool/labfiles directory in the gd lab file option in the main page. Irrespective of the wsf value, the earlier lab file will be loaded. But if we use resegmentation the already present labels will be gone and it will be regenerated based on the new wsf value present. After modification, when Save button is pressed same labfile is updated but before updating backup copy of lab file is created.
Note: If system creates a lab file with same name that already exists in labfiles directory, system creates the backup copy of that file. But backup copy is by default hidden, to view it just press CTRL + h.
• Logfiles Tool generates a seprate log file for each lab file(eg. text0001.log) in ./labeling-Tool/logfiles directory. Please keep cleaning this directory after certain interval.
5.3 Viewing the labelled file
Once the corrections are made and the save button is clicked the lab file is generated in/var/www/html/labelingTool/labfiles directory and it can be viewed by clicking on the ’next’ link. The following message comes on
click-ing the ’next’, Download the labfile: labfile Click on the link ’labfile’. The lab file will appear on the browser window as below
5.4 Control file
A control file is placed at the location /var/www/html/labelingTool/bin/fewords.base The param-eters in the control file are given below. These paramparam-eters can be adjusted by the user to get better segmentation results.
• windowSize size of frame for energy computation
• waveType type of the waveform: 0 Sphere PCM 1 Sphere Ulaw 2 plain sample one short integer per line 3 RAW sequence of bytes each sample 8bit 4 RAW16 two bytes/sample Big Endian 5 RAW16 two bytes/sample Little Endian 6 Microsoft RIFF standard wav format • winScaleFactor should be chosen based on syllable rate choose by trial and error
• gamma reduces the dynamic range of energy • fftOrder and fftSize MUST be set to ZERO!!
• frameAdvanceSamples frameshift for energy computation
• ThresEnergy, thresZero, thresSpectralFlatness are thresholds used for voiced unvoiced detec-tion
When a parameter is set to zero, it is NOT used . Examples tested with ENERGY only • Sampling rate of the signal required for giving boundary information in seconds.
5.5 Performance results for 6 Indian Languages
Testing was conducted for a set of test sentences for all 6 Indianlanguages and the percentage of correctness was calculated based on the following formulae. The calculations were done after the segmentation was done using the tool with the best wsf and threshold values.
[1− (N oof insertions+noof deletions)T otalnoof segments ] × 100
Language Percentage of Correctness
Hindi 86.83% Malayalam 78.68% Telugu 85.40% Marathi 80.24% Bengali 77.84% Tamil 77.38%
5.6 Limitations of the tool
• Zooming is not enabled for VOP and EHMM panels • Wave form is not displayed properly as in wavesurfur
6
Unit Selection Synthesis Using Festival
This chapter discusses some of the options for building waveform synthesizers using unit selection techniques in Festival.
By ”unit selection” we actually mean the selection of some unit of speech which may be anything from whole phrase down to diphone (or even smaller). Technically diphone selection is a simple case of this. However typically what we mean is unlike diphone selection, in unit selection there is more than one example of the unit and some mechanism is used to select between them at run-time. The theory is obvious but the design of such systems and finding the appropriate selection crite-ria, weighting the costs of relative candidates is a non-trivial problem. However techniques like this often produce very high quality, very natural sounding synthesis. However they also can produce some very bad synthesis too, when the database has unexpected holes and/or the selection costs fail.
6.1 Cluster unit selection
The idea is to take a database of general speech and try to cluster each phone type into groups of acoustically similar units based on the (non-acoustic) information available at synthesis time, such as phonetic context, prosodic features (F0 and duration) and higher level features such as stressing, word position, and accents. The actual features used may easily be changed and experi-mented with as can the definition of the definition of acoustic distance between the units in a cluster. The basic processes involved in building a waveform synthesizer for the clustering algorithm are as follows. A high level walkthrough of the scripts to run is given after these lower level details.
1. Collect the database of general speech. 2. Building the utterance Structure
3. Building coefficients for acoustic distances, typically some form of cepstrum plus F0, or some pitch synchronous analysis (e.g. LPC).
4. Build distances tables, precalculating the acoustic distance between each unit of the same phone type.
5. Dump selection features (phone context, prosodic, positional and whatever) for each unit type.
6. Build cluster trees using wagon with the features and acoustic distances dumped by the previous two stages
7. Building the voice description itself
6.2 Choosing the right unit type
Before you start you must make a decision about what unit type you are going to use. Note there are two dimensions here. First is size, such as phone, diphone, demi-syllable. The second type itself which may be simple phone, phone plus stress, phone plus word etc. The code here and the related files basically assume unit size is phone. However because you may also include a percentage of the previous unit in the acoustic distance measure this unit size is more effectively phone plus previous phone, thus it is somewhat diphone like. The cluster method has actual restrictions on the unit
is currently assuming phone sized units.
The second dimension, type, is very open and we expect that controlling this will be a good method to attain high quality general unit selection synthesis. The parameter clunit name feat may be used define the unit type. The simplest conceptual example is the one used in the limited domain synthesis. There we distinguish each phone with the word it comes from, thus a d from the word limited is distinct from the d in the word domain. Such distinctions can hard partition up the space of phones into types that can be more manageable.
The decision of how to carve up that space depends largely on the intended use of the database. The more distinctions you make less you depend on the clustering acoustic distance, but the more you depend on your labels (and the speech) being (absolutely) correct. The mechanism to define the unit type is through a (typically) user defined feature function. In the given setup scripts this feature function will be called lisp INST LANG NAME::clunit name. Thus the voice simply de-fines the function INST LANG NAME::clunit name to return the unit type for the given segment. If you wanted to make a diphone unit selection voice this function could simply be
(define (INST LANG NAME::clunit name i) (string append
(item.name i) ” ”
(item.feat i ”p.name”)))
Thus the unittype would be the phone plus its previous phone. Note that the first part of a unit name is assumed to be the phone name in various parts of the code thus although you make think it would be neater to return previousphone phone that would mess up some other parts of the code.
In the limited domain case the word is attached to the phone. You can also consider some demi−syllable information or more to differentiate between different instances of the same phone. The important thing to remember is that at synthesis time the same function is called to iden-tify the unittype which is used to select the appropriate cluster tree to select from. Thus you need to ensure that if you use say diphones that the your database really does not have all diphones in it.
6.3 Collecting databases for unit selection
Unlike diphone database which are carefully constructed to ensure specific coverage, one of the advantages of unit selection is that a much more general database is desired. However, although voices may be built from existing data not specifically gathered for synthesis there are still factors about the data that will help make better synthesis.
Like diphone databases the more cleanly and carefully the speech is recorded the better the synthesized voice will be. As we are going to be selecting units from different parts of the database the more similar the recordings are, the less likely bad joins will occur. However unlike diphones database, prosodic variation is probably a good thing, as it is those variations that can make syn-thesis from unit selection sound more natural. Good phonetic coverage is also useful, at least phone coverage if not complete diphone coverage. Also synthesis using these techniques seems to retain aspects of the original database. If the database is broadcast news stories, the synthesis from it will typically sound like read news stories (or more importantly will sound best when it is reading
news stories).
Again the notes about recording the database apply, though it will sometimes be the case that the database is already recorded and beyond your control, in that case you will always have some-thing legitimate to blame for poor quality synthesis.
6.4 Preliminaries
Throughout our discussion we will assume the following database layout. It is highly recommended that you follow this format otherwise scripts, and examples will fail. There are many ways to organize databases and many of such choices are arbitrary, here is our ”arbitrary” layout.
The basic database directory should contain the following directories
bin/ − Any database specific scripts for processing. Typically this first contains a copy of standard scripts that are then customized when necessary to the particular database
wav/ − The waveform files. These should be headered, one utterances per file with a standard name convention. They should have the extension .wav and the fileid consistent with all other files through the database (labels, utterances, pitch marks etc).
lab/ − The segmental labels. This is usually the master label files, these may contain more information that the labels used by festival which will be in festival/relations/Segment/.
lar/ − The EGG files (larynograph files) if collected.
pm/ − Pitchmark files as generated from the lar files or from the signal directly. festival/ − Festival specific label files.
festival/relations/ − The processed labeled files for building Festival utterances, held in di-rectories whose name reflects the relation they represent: Segment/, Word/, Syllable/ etc.
festival/utts/ − The utterances files as generated from the festival/relations/ label files. Other directories will be created for various processing reasons.
6.5 Building utterance structures for unit selection
In order to make access well defined you need to construct Festival utterance structures for each of the utterances in your database. This (in its basic form) requires labels for segments, syllables, words, phrases, F0 Targets, and intonation events. Ideally these should all be carefully hand labeled but in most cases that’s impractical. There are ways to automatically obtain most of these labels but you should be aware of the inherit errors in the labeling system you use (including labeling systems that involve human labelers). Note that when a unit selection method is to be used that fundamentally uses segment boundaries its quality is going to be ultimately determined by the quality of the segmental labels in the databases.
For the unit selection algorithm described below the segmental labels should be using the same phoneset as used in the actual synthesis voice. However a more detailed phonetic labeling may be more useful (e.g. marking closures in stops) mapping that information back to the phone labels
before actual use. Autoaligned databases typically aren’t accurate enough for use in unit selection. Most autoaligners are built using speech recognition technology where actual phone boundaries are not the primary measure of success. General speech recognition systems primarily measure words correct (or more usefully semantically correct) and do not require phone boundaries to be accurate. If the database is to be used for unit selection it is very important that the phone boundaries are accurate. Having said this though, we have successfully used the aligner described in the di-phone chapter above to label general utterance where we knew which di-phone string we were looking for, using such an aligner may be a useful first pass, but the result should always be checked by hand. It has been suggested that aligning techniques and unit selection training techniques can be used to judge the accuracy of the labels and basically exclude any segments that appear to fall outside the typical range for the segment type. Thus it, is believed that unit selection algorithms should be able to deal with a certain amount of noise in the labeling. This is the desire for researchers in the field, but we are some way from that and the easiest way at present to improve the quality of unit selection algorithms at present is to ensure that segmental labeling is as accurate as possible. Once we have a better handle on selection techniques themselves it will then be possible to start experimenting with noisy labeling.
However it should be added that this unit selection technique (and many others) support what is termed ”optimal coupling” where the acoustically most appropriate join point is found auto-matically at run time when two units are selected for concatenation. This technique is inherently robust to at least a few tens of millisecond boundary labeling errors.
For the cluster method defined here it is best to construct more than simply segments, durations and an F0 target. A whole syllabic structure plus word boundaries, intonation events and phrasing
allow a much richer set of features to be used for clusters. See the Section called Utterance building in the Chapter called A Practical Speech Synthesis System for a more general discussion of how to build utterance structures for a database.
6.6 Making cepstrum parameter files
In order to cluster similar units in a database we build an acoustic representation of them. This is is also still a research issue but in the example here we will use Mel cepstrum. Interestingly we do not generate these at fixed intervals, but at pitch marks. Thus have a parametric spectral representation of each pitch period. We have found this a better method, though it does require that pitchmarks are reasonably identified.
Here is an example script which will generate these parameters for a database, it is included in festvox/src/unitsel/make mcep.
for i in $* do
fname=‘basename $i .wav‘ echo $fname MCEP
$SIG2FV $SIG2FVPARAMS −otype est binary $i −o mcep/$fname.mcep −pm pm/$fname.pm −window type hamming
done
The above builds coefficients at fixed frames. We have also experimented with building param-eters pitch synchronously and have found a slight improvement in the usefulness of the measure
based on this. We do not pretend that this part is particularly neat in the system but it does work. When pitch synchronous parameters are build the clunits module will automatically put the local F0 value in coefficient 0 at load time. This happens to be appropriate from LPC coefficients. The
script in festvox/src/general/make lpc can be used to generate the parameters, assuming you have already generated pitch marks.
Note the secondary advantage of using LPC coefficients is that they are required any way for LPC resynthesis thus this allows less information about the database to be required at run time. We have not yet tried pitch synchronous MEL frequency cepstrum coefficients but that should be tried. Also a more general duration/number of pitch periods match algorithm is worth defining.
6.7 Building the clusters
Cluster building is mostly automatic. Of course you need the clunits modules compiled into your version of Festival. Version 1.3.1 or later is required, the version of clunits in 1.3.0 is buggy and incomplete and will not work. To compile in clunits, add
ALSO INCLUDE += clunits
to the end of your festival/config/config file, and recompile. To check if an installation already has support for clunits check the value of the variable *modules*.
The file festvox/src/unitsel/build clunits.scm contains the basic parameters to build a clus-ter model for a databases that has utclus-terance structures and acoustic parameclus-ters. The function build clunits will build the distance tables, dump the features and build the cluster trees. There are many parameters are set for the particular database (and instance of cluster building) through the Lisp variable clunits params. An reasonable set of defaults is given in that file, and reasonable run−time parameters will be copied into festvox/INST LANG VOX clunits.scm when a new voice is setup.
The function build clunits runs through all the steps but in order to better explain what is going on, we will go through each step and at that time explain which parameters affect the substep.
The first stage is to load in all the utterances in the database, sort them into segment type and name them with individual names (as TYPE NUM. This first stage is required for all other stages so that if you are not running build clunits you still need to run this stage first. This is done by the calls
(format t ”Loading utterances and sorting types\n”) (set! utterances (acost:db utts load dt params))
(set! unittypes (acost:find same types utterances)) (acost:name units unittypes)
Though the function build clunits init will do the same thing. This uses the following parameters
name STRING
db dir FILENAME
This pathname of the database, typically . as in the current directory. utts dir FILENAME
The directory contain the utterances. utts ext FILENAME
The file extention for the utterance files files
The list of file ids in the database.
For example for the KED example these parameters are (name ’ked timit)
(db dir ”/usr/awb/data/timit/ked/”) (utts dir ”festival/utts/”)
(utts ext ”.utt”)
(files (”kdt 001” ”kdt 002” ”kdt 003” ... ))
In the examples below the list of fileids is extracted from the given prompt file at call time. The next stage is to load the acoustic parameters and build the distance tables. The acoustic distance between each segment of the same type is calculated and saved in the distance table. Precalculating this saves a lot of time as the cluster will require this number many times.
This is done by the following two function calls (format t ”Loading coefficients\n”)
(acost:utts load coeffs utterances) (format t ”Building distance tables\n”)
(acost:build disttabs unittypes clunits params) The following parameters influence the behaviour. coeffs dir FILENAME
The directory (from db dir) that contains the acoustic coefficients as generated by the script make mcep.
coeffs ext FILENAME
The file extention for the coefficient files get std per unit
Takes the value t or nil. If t the parameters for the type of segment are normalized by finding the means and standard deviations for the class are used. Thus a mean mahalanobis euclidean distance is found between units rather than simply a euclidean distance. The recommended value is t.
ac left context FLOAT
The amount of the previous unit to be included in the the distance. 1.0 means all, 0.0 means none. This parameter may be used to make the acoustic distance sensitive to the previous acoustic
context. The recommended value is 0.8. dur pen weight FLOAT
The penalty factor for duration mismatch between units. F0 pen weight FLOAT
The penalty factor for F0 mismatch between units.
ac weights (FLOAT FLOAT ...) The weights for each parameter in the coefficeint files used while finding the acoustic distance between segments. There must be the same number of weights as there are parameters in the coefficient files. The first parameter is (in normal operations) F0.
Its is common to give proportionally more weight to F0 that to each individual other parameter.
The remaining parameters are typically MFCCs (and possibly delta MFCCs). Finding the right parameters and weightings is one the key goals in unit selection synthesis so its not easy to give concrete recommendations. The following aren’t bad, but there may be better ones too though we suspect that real human listening tests are probably the best way to find better values.
An example is (coeffs dir ”mcep/”) (coeffs ext ”.mcep”) (dur pen weight 0.1) (get stds per unit t) (ac left context 0.8)
(ac weights (0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5))
The next stage is to dump the features that will be used to index the clusters. Remember the clusters are defined with respect to the acoustic distance between each unit in the cluster, but they are indexed by these features. These features are those which will be available at text-to-speech time when no acoustic information is available. Thus they include things like phonetic and prosodic context rather than spectral information. The name features may (and probably should) be over general allowing the decision tree building program wagon to decide which of theses feature actual does have an acoustic distinction in the units.
The function to dump the features is
(format t ”Dumping features for clustering\n”)
(acost:dump features unittypes utterances clunits params) The parameters which affect this function are
fests dir FILENAME
The directory when the features will be saved (by segment type). feats LIST
The list of features to be dumped. These are standard festival feature names with respect to the Segment relation.
For our KED example these values are (feats dir ”festival/feats/”)
(occurid p.name p.ph vc p.ph ctype p.ph vheight p.ph vlng p.ph vfront p.ph vrnd p.ph cplace p.ph cvox n.name n.ph vc n.ph ctype n.ph vheight n.ph vlng n.ph vfront n.ph vrnd n.ph cplace n.ph cvox
segment duration seg pitch p.seg pitch n.seg pitch R:SylStructure.parent.stress
seg onsetcoda n.seg onsetcoda p.seg onsetcoda R:SylStructure.parent.accented pos in syl syl initial syl final R:SylStructure.parent.syl break R:SylStructure.parent.R:Syllable.p.syl break pp.name pp.ph vc pp.ph ctype pp.ph vheight pp.ph vlng pp.ph vfront pp.ph vrnd pp.ph cplace pp.ph cvox))
Now that we have the acoustic distances and the feature descriptions of each unit the next stage is to find a relationship between those features and the acoustic distances. This we do using the CART tree builder wagon. It will find out questions about which features best minimize the acoustic distance between the units in that class. wagon has many options many of which are apposite to this task though it is interesting that this learning task is interestingly closed. That is we are trying to classify all the units in the database, there is no test set as such. However in synthesis there will be desired units whose feature vector didn’t exist in the training set.
The clusters are built by the following function (format t ”Building cluster trees\n”)
(acost:find clusters (mapcar car unittypes) clunits params) The parameters that affect the tree building process are tree dir FILENAME
the directory where the decision tree for each segment type will be saved wagon field desc LIST
A filename of a wagon field descriptor file. This is a standard field description (field name plus field type) that is require for wagon. An example is given in festival/clunits/all.desc which should be sufficient for the default feature list, though if you change the feature list (or the values those features can take you may need to change this file.
wagon progname FILENAME
The pathname for the wagon CART building program. This is a string and may also include any extra parameters you wish to give to wagon.
wagon cluster size INT
The minimum cluster size (the wagon −stop value). prune reduce INT
This number of elements in each cluster to remove in pruning. This removes the units in the cluster that are furthest from the center. This is down within the wagon training.
cluster prune limit INT
This is a post wagon build operation on the generated trees (and perhaps a more reliably method of pruning). This defines the maximum number of units that will be in a cluster at a tree leaf. The wagon cluster size the minimum size. This is usefully when there are some large numbers of some particular unit type which cannot be differentiated. Format example silence segments without context of nothing other silence. Another usage of this is to cause only the center example units to be used. We have used this in building diphones databases from general databases but making the selection features only include phonetic context features and then restrict the number of diphones we take by making this number 5 or so.
unittype prune threshold INT
When making complex unit types this defines the minimal number of units of that type re-quired before building a tree. When doing cascaded unit selection synthesizers its often not worth excluding large stages if there is say only one example of a particular demi−syllable.
Note that as the distance tables can be large there is an alternative function that does both the distance table and clustering in one, deleting the distance table immediately after use, thus you only need enough disk space for the largest number of phones in any type. To do this
(acost:disttabs and clusters unittypes clunits params)
Removing the calls to acost:build disttabs and acost:find clusters. In our KED example these have the values
(trees dir ”festival/trees/”)
(wagon field desc ”festival/clunits/all.desc”)
(wagon progname ”/usr/awb/projects/speech tools/bin/wagon”) (wagon cluster size 10)
(prune reduce 0)
The final stage in building a cluster model is collect the generated trees into a single file and dumping the unit catalogue, i.e. the list of unit names and their files and position in them. This is done by the lisp function
(acost:collect trees (mapcar car unittypes) clunits params) (format t ”Saving unit catalogue\n”)
(acost:save catalogue utterances clunits params) The only parameter that affect this is
the directory where the catalogue will be save (the name parameter is used to name the file). Be default this is
(catalogue dir ”festival/clunits/”)
There are a number of parameters that are specified with a cluster voice. These are related to the run time aspects of the cluster model. These are
join weights FLOATLIST
This are a set of weights, in the same format as ac weights that are used in optimal coupling to find the best join point between two candidate units. This is different from ac weights as it is likely different values are desired, particularly increasing the F0 value (column 0).
continuity weight FLOAT
The factor to multiply the join cost over the target cost. This is probably not very relevant given the the target cost is merely the position from the cluster center.
log scores 1
If specified the joins scores are converted to logs. For databases that have a tendency to contain non−optimal joins (probably any non−limited domain databases), this may be useful to stop failed synthesis of longer sentences. The problem is that the sum of very large number can lead to over-flow. This helps reduce this. You could alternatively change the continuity weight to a number less that 1 which would also partially help. However such overflows are often a pointer to some other problem (poor distribution of phones in the db), so this is probably just a hack.
optimal coupling INT
If 1 this uses optimal coupling and searches the cepstrum vectors at each join point to find the best possible join point. This is computationally expensive (as well as having to load in lots of cepstrum files), but does give better results. If the value is 2 this only checks the coupling distance at the given boundary (and doesn’t move it), this is often adequate in good databases (e.g. limited domain), and is certainly faster.
extend selections INT
If 1 then the selected cluster will be extended to include any unit from the cluster of the previous segments candidate units that has correct phone type (and isn’t already included in the current cluster). This is experimental but has shown its worth and hence is recommended. This means that instead of selecting just units selection is effectively selecting the beginnings of multiple segment units. This option encourages far longer units.
pm coeffs dir FILENAME
The directory (from db dir where the pitchmarks are pm coeffs ext FILENAME
The file extension for the pitchmark files. sig dir FILENAME
Directory containing waveforms of the units (or residuals if Residual LPC is being used, PCM waveforms is PSOLA is being used)