• No results found

BHAASHIKA: TELUGU TTS SYSTEM

N/A
N/A
Protected

Academic year: 2020

Share "BHAASHIKA: TELUGU TTS SYSTEM"

Copied!
9
0
0

Loading.... (view fulltext now)

Full text

(1)

BHAASHIKA: TELUGU TTS SYSTEM

Dr. K.V.N.SUNITHA Professor & HOD Computer Science Department,

G. Narayanamma Institute of Technology & Science, Hyderabad, India.

P.SUNITHA DEVI Assistant Professor Computer Science Department

G.Narayanamma Institute of Technology & Science, Hyderabad,India.

Abstract :

Speech synthesis is a process of building machinery that can generate human-like speech from any text input to imitate human speakers. The objective of a text to speech system is to convert an arbitrary text into its corresponding spoken waveform. Text processing and speech generation are two main components of a text to speech system. To build a natural sounding speech synthesis system, it is essential that text processing component produce an appropriate sequence of phonemic units. Speech synthesizers in the current state of art are based on Data Driven Concatenative synthesis. Required sequence of basic units is synthesized by selecting appropriate unit instances from huge database consisting of multiple instances of units with varying prosodic properties. Performance of speech synthesis systems differ depending on the size of the speech units used in the process. A system can use phone, diphone or syllables as basic unit. High performance is achieved by the system which is domain specific and basic units are words or sentences in the database. The quality of a speech synthesizer is judged by its closeness to the natural human voice and understandability. This paper describes an approach to build a Telugu TTS system using concatenative synthesis method with syllable as a basic unit of concatenation.

Key words: Speech synthesis, Text processing, speech generation, Concatenative synthesis, syllable.

1. Introduction

1.1. Speech Synthesis

The conversion of words in written form into speech is nontrivial. Even if we can store a huge dictionary for most common words; the TTS system still needs to deal with millions of names and acronyms. Moreover, in order to sound natural, the intonation of the sentences must be appropriately generated. Synthesis of speech cannot be accomplished by cutting and pasting smaller units together. Attention has to be paid to smoothing out the discontinuities in such a process so that the resulting signal approximates natural speech. According to the speech generation model used, speech synthesis can be classified into three categories

as Articulatory synthesis, Formant synthesis and Concatenative synthesis [1]. Based on the degree of manual intervention in the design, speech synthesis can be classified into two categories Synthesis by rule and Data-driven synthesis.

(2)

Rule-based synthesis is intelligent but sounds unnatural, because it is difficult to capture all slight variations of natural speech in a small set of derived rules. Concatenative synthesis simply plays back the waveform with the matching phone string. An utterance is synthesized by concatenating together several speech fragments, unlike synthesis-by rule; it requires neither rules nor manual tuning. Moreover, each segment is completely natural, so we should expect very natural output. Speech segments are greatly affected by coarticulation, so if we concatenate two speech segments that were not adjacent to each other, there can be spectral or prosodic discontinuities. Spectral discontinuities occur when the formants at the concatenation point do not match. Prosodic discontinuities occur when the pitch at the concatenation point does not match. A listener rates as poor synthetic speech that contains large discontinuities, even if each segment is very natural.

Several attempts have also been made to achieve speech synthesis using speech segments. In order to minimize discontinuities, it would be preferable to have word length or longer segments. This requires storing large repositories of speech segments. To synthesize a given passage, it is also necessary to identify the appropriate segments such that synthesis can be done with the fewest joins. This was difficult earlier but is becoming increasingly possible with the availability and economy of mass storage devices.

Prosody and intonation (modulation of the pitch and volume) are quite important for natural sounding speech. This is a complex aspect and this skill is not easy to acquire even for humans except during their initial years. (This is the reason for accent problems that people have when they learn their second and subsequent languages.) There are in existence speech synthesis systems which replicate the prosodic features of human speech. This involves fairly complex parsing of the input sentences and using rather complex rules to determine the intonation patterns. 1.2. Existing Systems

Many of the improvements in speech synthesis over the past years have come from creative use of the technologies developed for speech recognition. We can build systems that interact through speech, which the system can listen to what was said, compute or do something, and then speak back, using spoken language generation. Festivox is speech synthesis system developed by speech group at CMU; it provides a general framework for building speech synthesis systems [2]. It offers full text to speech through a number API’s from shell level, though a scheme command interpreter, as a C++ library, from Java, and an Emacs interface. The Festvox project: automating the processes involved in building synthetic voices for new languages. Creating festival TTS for other languages, especially Indian languages is very similar and will mostly require language specific changes to be made to the code in these modules. Festival uses diphone as basic unit.

The synthesis approach of Black and Taylor, [3] [4] builds decision trees for the units of the same type based on questions concerning prosodic and phonemic context and its acoustic similarity with the neighboring units. During synthesis, for each target unit, the appropriate decision tree is used to find the best cluster of candidate units. A search is then made to find the best path through the candidate units that takes into account the distance of a candidate unit from its cluster center and the cost of joining two adjacent units. These approaches use phone as basic unit and either involve prediction of features for target units or use classification and regression techniques to build decision trees.

Concatenative synthesis stores pre-recorded speech segments which are later retrieved and replayed to produce synthesized speech. Type of unit to be stored as pre-recorded speech segments rely on the developers’ preference and also the best unit to represent a particular language for the speech synthesis system. These units can be classified into two types; context dependent units such as phonemes, syllables and words; and also context independent units such as diphone, triphone and demisyllable [1]. Other representation will be unit selection database. For this type of database, there will be different type of unit available in the database and there will be a few instances for an utterance [5].

(3)

2. Telugu Language

The scripts of Indian languages have originated from the ancient Brahmi script. The basic units of writing system are characters which are orthographic representation of speech sounds. A character in Indian language scripts is close to syllable and can be typically of the following form: C, V, CV, CCV and CVC, where C is a consonant and V is a vowel. There are about 35 consonants and about 18 vowels in Indian languages [6]. An important feature of Indian language scripts is their phonetic nature. There is more or less one to one correspondence between what is written and what is spoken. The rules required to map the letters to sounds of Indian languages are almost straight forward. All Indian language scripts have common phonetic base.

2.1. Syllabification Rules

There is almost one to one correspondence between what is written and what is spoken in Indian Languages. Each character in Indian language script has a correspondence to a sound of that language. In Indian languages, a consonant character is inherently bound with the vowel sound /a/, and is almost always pronounced with this vowel. In some occurrences this vowel is not pronounced, and this is referred to as Inherent Vowel Suppression (IVS). This occurs at both word final and word middle positions. A few heuristic rules to detect IVS of a consonant character are noted below. While letter to phone rules are almost straight forward in Indian languages, the syllabification rules are not trivial. There is need to come up with some rules to break the word into syllables. We have derived certain simplistic rules for syllabification i.e. rules for grouping clusters of C*VC* based on heuristic analysis of several words in Telugu language.

3. Bhaashika Architecture

After a brief study on Telugu language model and exploring the existing speech synthesizers it is found that adapting a system using other language synthesizer engine will cause the synthesized speech to sound foreign [7]. Hence it is important for us to have our own synthesizer as a starting point to have a robust speech synthesis engine. Development of Telugu synthesizer needs consider the phonetic and linguistic nature of Telugu language and the synthesizing approach problems. The architecture of our system is explained as shown in Fig 1.

Fig 1: Bhaashika -Telugu TTS system Architecture

(4)

Text syllabification will segment the normalized text to syllable unit according to Telugu language rules. Phone set Definition module defines the complete set of phones used in Telugu speech. It also includes feature definitions (ex. vowel/consonant, lip rounding) of these phones. Lexical Analysis module is used to arrive at the phones that make up the pronunciation of a particular word. Since Telugu is phonetic in nature, we do not require a dictionary for lexical analysis. Instead, this module defines letter-to-sound rules (LTS) which are used to arrive at the speech phones based on the spelling of the word.

Syllable extraction and Concatenation module will search for the sequence of syllables obtained from the syllabification module and combine syllable unit sound files to produce a synthesized speech. Prosodic phrasing in speech synthesis makes the whole speech more understandable. Phrasing is done based on punctuation.

4. Creation Of Speech Corpus

The synthesis quality is inherently bound to the speech database from which the units are selected. Design and development of a speech corpus balanced in terms of coverage and prosodic diversity of all the units is one of the major issues in such methods [8] [9] [10]. This set of selected sentences is recorded in a quiet room with a noise canceling microphone using the recording facilities of a typical multimedia computer system. Annotated speech data is taken and manually segmented the syllable boundaries using Praat tool for accuracy. Annotation of the recorded sound file is done using the TextGrid object. A TextGrid is an annotation/labeling object with one or more layers or tiers. Each tier can be either an interval tier (contiguous labeled intervals) or a point tier (labeled points). This wave signal consists of different variations in it. Each variation in the wave signal corresponds to each speech utterance or each syllable. In this way based on the speech utterances or syllables we have to segment the entire wave signal. Segmentation should be done carefully to attain accuracy in the speech output.

Fig 2: Segmentation of Recorded Speech using PRAAT

(5)

Structure for storing the speech units struct syllable

{

char *syllable_name; char * syllable_next; char * syllable_prev; int syllable_pos; int syllable_id;

}

5. Implementation

Telugu Speech synthesizer developed uses concatenative approach with syllable as basic unit. The synthesis approach is divided into three different phases.

5.1. Text Normalization

In practice, an input text from books, news articles consists of standard words (whose entries could be found in the pronunciation dictionary) and non-standard words such as initials, digits, symbols and abbreviations. Mapping of non-standard words to a set of standard words depends on the context, and is a non-trivial problem. For example, number 120 has to be expanded to నూట ఇర ై  and its WX notation is “nUta iravai”   and rU. 300 to “mUdu vaMxala rUpAyalu” , ముడు వందల రు ాయలు and Ponu: 3005412 to “Ponu nambaru, mUdu sunnA sunnA, Edu nalagu okati reVMdu” ొను నంబరు : ముడు సునన్ సునన్   ఏడు  ాలగు ఒకటి  ెండు. Similarly punctuation characters and their combinations such as: >, !, -, $, #, %, / which may be encountered in the cases of ratios, percentages, comparisons have to be mapped to a set of standard words according to the context. Other such situations include initials, company names, street address, initials, titles, non-native words such as bank, computer etc.

5.2. Text Syllabification

Text syllabification function extracts syllables from each normalized words and arrange it according to the sequence of the syllables based on Telugu phonological rules. Syllable structure is represented as C*VC* in most of Indian languages like Telugu, Tamil, Hindi, etc. The syllables in Telugu language can exist as vowel alone or as CV, VC, CVC. CCVC.

The syllabification module takes the input in WX notation and identifies all the vowels and consonants and the groups them into syllables.

Algorithm for Syllabification

 Read the input text which is in WX notation.

 Label the characters of the normalized text as consonants and vowels using the following rules

o Any consonant except(y, H, M) followed by y is a single consonant, label it as C o Any consonant except (y, r, l, lY, lYY) followed by r is taken as single consonant

o Consonants like (k, c, t, w, p, g, j, d, x, b, m, R, S, s) followed by l is taken as single consonant. o Consonant like (k, c, t, w, p, g, j, d, x, b, R, S, s, r) followed by v is taken as a single consonant. o Label the remaining as Vowel (V) or Consonant(C) depending on the set to which it belongs. o Store the attribute of the word in terms of (C*VC*)* in file1.

 For each word in the normalized text get its label attribute from file1.

o If the first character is a C then the associate it to the nearest Vowel on the right. o If the last character is a C then associate it to the nearest Vowel on the left Check. o If sequences correspond to VV then break is as V-V.

(6)

o The strings separated by – are identified as syllable units.

 repeat

 Store the text in syllable form in file2 for synthesis process.

5.3. Syllable Extraction and Concatenation

This module will receive a sequence of syllables that has been properly arranged according to the raw text. Based on the list of syllable, Syllable Extraction module will search for speech units in the speech corpus. The context of the syllable to be searched in speech corpus is given weights according to next, previous and the position of the syllable in the word. For context corresponding to next be given a weightage of 4, previous is given 2 and the matching of position in the word is given 1.

Algorithm for Syllabification  Read the syllabified text.

 Compute the weight for each Syllable using the following steps.

oIf the search syllable matches with next, previous and position context in speech corpus is given weight of

7.

oElse If it matches with next and previous but the position context does not match is given weight of 6. oElse If it matches with next and position us, but the previous context does not match is given weight of 5. oElse If it matches with previous and position, but the next context does not match is given weight of 3. oElse If it matches with previous, but the next and position context does not match is given weight of 2. oElse If it matches with position, but the next and previous context does not match is given weight of 1.

 The syllable which gives maximum weight is selected for synthesis.

 If the next context of the syllable is space or end mark of sentence then include the silence unit depending on the unit.

 Concatenate all the units selected and generate a single wave file which has the utterance of the given text.  Play the wave file.

6. Results

(7)

Fig 3: Front end used for Bhaashika -Telugu TTS system Architecture

The input text is converted into readable text and the syllabification module will generate the sequence of syllables that should be extracted from the speech corpus to be concatenated and play the sound files. The synthesizer developed will be able to segment all the words correctly and in addition the loan words in Telugu language vocabulary, but the utterance may not be an exact match

Fig 4: Text input can be browsed and played

From our rules, syllable segmentation has shown acceptable results. These are few examples of segmentation done for different types of words in Telugu language.

(8)

  Word: namaskAramu (WX representation) Consonant-Vowel identification:

CVCVCCVCVCV Syllabification:

CV- CVC-CV-CV-CV Synthesizer:

na-mas-kA-ra-mu

Telugu origin word రు ార్కష్  

Word: ruXrAksha (WX representation) Consonant-Vowel identification:

CVCVCCCV (<Xr> is taken as 1 Consonant) Syllabification:

CV- CVC-CCV Synthesizer:

ru-XrAk-sha

English loan word కంపూయ్టర్

Word: kaMpyutar (WX representation) Consonant-Vowel identification:

CVCVCVC (<aM> is taken as 1 vowel and <py> is taken as 1 Consonant) Syllabification:

CV-CV-CVC. Synthesizer:

kaM-pyu-tar

Segment concatenation has also produced reasonably intelligible synthetic speech. However, distortion still occurs between concatenate segments and hence implementing wave modification to handle the distortion is one of the options to have more natural speech. Based on the informal perceptibility test, it is easy to determine what is being said if the input are words and short sentences. However, spoken text is harder to comprehend when the text is in paragraph length. This perceptibility difficulty is more significant without the prosody modification algorithm.

7. Conclusions

Bhaashika – Telugu TTS system using syllables as basic unit of concatenation is presented. The quality of the synthesized speech is reasonably natural. The proposed approach minimizes the co articulation effects and prosodic mismatch between two adjacent units. Selection of a unit is based on the presiding and succeeding context of the syllables and the position of the syllable. The system implements a matching function which assigns weight to the syllable based on its nature and the syllable with maximum weight is selected as output speech units. We have observed the efficiency of this approach for Telugu language and found that the performance of this approach is better.

8. Future Enhancements

(9)

modify the speech parameters like pitch formants and intensity at the concatenation point. More attention needs to be paid to the frequency and amplitude range as well as the duration of each pre-recorded segments to produce high quality speech.

References

[1] X. Huang, A.Acero and H.-W. Hon, “Spoken Language Processing A Guide to Theory, Algorithm and System Development”, New Jersey: Prentice Hall, 2001

[2] Sreekanth Majji, Ramakrisshnan A.G “ Festival Based maiden TTS system for Tamil Language”, 3rd Language & Technology Conference:Human Language Technologies as a Challenge for Computer Science and Linguistics October 5-7, 2007, Poznań, Poland [3] Alan W Black, Kevin Lanzo, “Building Voices in festival speech synthesis system” , www.festvox.org, 2000.

[4] Alan W Black, Kevin Lanzo, “Building Synthetic Voices” , Carnegie Melon University, 2003.

[5] A.W. Black and N. Campbell, “Optimising Selection of Units from Speech Databases for Concatenative Synthesis”, Proceeding Eurospeech ’95, Madrid, Spain, September 1995, pp. 581 – 584.

[6] S P Kishore, Rohit Kumar and Rajeev Sangal, “A Data Driven Synthesis Approach For Indian Languages using Syllables as Basic Unit”, in Proceedings of Intl. Conf. on NLP (ICON) 2002, pp. 311-316, Mumbai, India, 200

[7] A.M. Zeki and N. Azizah, "A Speech Synthesizer for Malay Language”, National Conference on Research and Development in Computer Science, Selangor, Malaysia, October 2001

[8] Alan W Black, Paul Taylor, “Automatically Clustering similar units for unit selection in speech synthesis”, Proceedings of Eurospeech 97 ,2:601-604.

[9] Min Chu, Hu Peng, Hong-yun Yang, Eric Chang, “Selecting Non-Uniform Units from a very large corpus for Concatenative Speech Synthesizer”, in Proceedings of ICASSP, Salt Lake City, 2001

Figure

Fig 1: Bhaashika -Telugu TTS system Architecture
Fig 2: Segmentation of Recorded Speech using PRAAT
Fig 3: Front end used for Bhaashika -Telugu TTS system Architecture

References

Related documents

It’s not real learning Quality of teaching Work was too easy Quality of teaching We didn’t get to do GCSEs Qualifications You should offer more academic subjects

Ziff Energy believes robust low-cost supply from the US Appalachia region will be available to cover Eastern Canadian demand, including for LNG exports in an

For Blackboard username and password help and for help accessing your online course availability and student email account contact the Student Success Center at 432-335-6878 or

Figure 4.3 compares the performance of the proposed adaptive lambda gird scheduling algorithm; the Varying Bandwidth List Scheduler (VBLS) and the Virtual Finish time

The crude com- pound was washed ten times with acetone and pure solid 6-((2-(bis(3-(dodecyldimethylammonio)propyl)amino)-2- oxoethyl)amino)-6-oxohexane-1,5-diaminium chloride,

In this section, the effect of the initial condition on performance of the proposed ACO- based FMS algorithm is investigated by repeating same simulation using different initial

BP-C1 in the treat- ment of patients with stage IV breast cancer: a randomized, double- blind, placebo-controlled multicenter study and an additional open-label treatment phase.

leakage of Zn 2+ loaded from cellulose beads based adsorbent, improve its stability and adsorption 18.. capacity for testosterone, Freeze-drying method was used to enhance