• No results found

Text-to-speech conversion

word list part of speech phrase parsing

phonetics intonation rhythm

the influences between phonemes or by storing examples of concatenations of phonemes and using them to build up more complex acoustic units. It is at this stage that features like pitch or frequency are incorporated. The synthesis is either articulation-based, rule-based or concatenation-based.

Articulation-based synthesis

The production of speech is based on the physics of sound production and the physiology of speech. Equations of physics describe the radiation of sound waves from the mouth. Unfortunately, this approach is far too costly and complex to be used in practical applications and it leaves many problems still unsolved.

Rule-based synthesis

This approach is based on the dynamic evolution of speech. Typically, the synthesis is done using formants, large concentrations of energy from which it is usually possible to identify phonemes with a high degree of reliability. Formants refer to voiced phonemes only and are very difficult to estimate from speech data. For this reason, a good synthesis requires a significant trial and error process, which does not guarantee a high degree of naturalness unless the right rules are applied. A rule-based synthesizer requires many parameters (up to 60) to be tuned and analyzed and a thorough knowledge of the data to be handled.

However, this method provides the possibility to study speaker-dependent voice characteristics. These can be used to build rules to switch between different synthetic voices in a relatively simple way.

Concatenation-based synthesis

This is speech synthesis based on concatenation of speech sound blocks stored in a database. The creation of the database plays a vital role in determining the quality of the synthesized speech. The major advantage of this solution is that all acoustic aspects of a real speaker are taken into account. However, because of this high degree of specialization, when a new voice or style need to be added, the database containing the speech blocks has to be re-segmented and re-analyzed. Also, a new set of recording details is required for each type of addition: whether it is a male or female voice, whether the speech is fast or slow, or whether the voice is young or age-marked.

3.3.2 Quality assessment for TTS

Synthesized speech certainly does not compare with natural human voice, but how close can we get? Nowadays, there is an increasing number of TTS products available on the shelf, each with its own features.

Chapter 3. Overview of speech technology 53 But what does quality mean in this context? At first glance, quality can be associated with the following characteristics:

1. Naturalness 2. Intelligibility 3. Pleasantness

A good TTS engine needs to produce speech that is as close as possible to a true human voice, while keeping speech understandable, clear and pleasant.

TTS engines can usually model both female and male voices and it is often the case that female voices are clearer than their male counterpart. Other studies report male voices as being more intelligible because the fundamental frequency and formants range is more suitable for telephone conversations.Furthermore, the modeling adopted for TTS with female voice produces less robotic and more pleasant speech.

When evaluating the quality of synthetic speech, the following elements are typically taken into account:

• segmental rendering

• stress, rhythm and intonation accuracy

• variability of speaking rate

• control of intonation

• voice quality

• dialectal variation

However, on what basis is it possible to claim that a given TTS product performs better than another one? Is it because we can hear a better sound or because we there are not so many mistakes? Comparing different TTS methodologies is not easy because of all the different components that contribute to the synthesis. For example, a TTS engine may incorporate language-specific knowledge like accents or dialects, and artificial voice can be listened to in a variety of

environments, such as over the telephone or through a speaker in a car. These settings affect the subjective perception of speech. Hence, if quality is evaluated by measuring the degree of human perception, the assessment may be impaired by the listener’s capacity for understanding what it read out or by tiredness (when multiple products are sequentially tested). Researchers have looked into possible objective methods of assessing synthesized speech, for example the use of resynthesized speech (a method to code natural speech into parameters that can be used by a synthesizer). This way, a comparison between the natural and the artificial waveforms generated by the TTS engine gives an estimation of the degree of naturalness of the synthetic voice.

The concept of naturalness has been so far used in an intuitive manner, but it still has not been properly defined. What can be perceived as natural, especially when an objective evaluation is required? The answer is based on phonetic theories stating that a speaker usually adapts to a listener in order to minimize the cognitive load during perception. This means that the speaker is able to capture signs of fatigue, comprehension problems or other behaviors and act on the speech accordingly. Although this phenomenon can be identified in human speakers, it is not yet available for synthesized voice. Hence, naturalness is somehow compromised by the inability to adapt to the listener. Research in this area is still ongoing, but it is possible to measure the extent to which an artificial voice emulates characteristics of a natural one. These characteristics have been modeled and parametrized using human waveforms and are compared against the synthesized ones.

The quality of TTS products still lacks a solid ground for quality tests because the speaker/listener interaction does not influence the evaluation of the speech production.

3.4 IBM ViaVoice

IBM ViaVoice is a family of products providing both ASR and TTS capabilities.

The speech recognition engine supports dictionaries containing approximately 200,000 words and is equipped with tools to generate customized grammars and user dictionaries.

The TTS engine is a rule-based speech synthesizer.

3.4.1 Multilingual support for ViaVoice

The IBM ViaVoice TTS engine is capable of synthesizing speech in different languages. However, an engine is only capable of producing speech for a single language. Table 3-3 shows information about the languages supported by IBM ViaVoice with respect to the platform where the engine is installed. It clearly appears that ViaVoice can be used across different platforms, although many languages are not yet available on the market.

Table 3-3 Language support for IBM ViaVoice

Language WIN NT/95/98 AIX Solaris Linux

US English Yes Yes Beta Yes

UK English Yes Yes Beta Planned

French Yes Yes Beta Planned

Chapter 3. Overview of speech technology 55

*not all dialects of Japanese and C hinese are currently suppor ted

3.4.2 ViaVoice limitations

Although it is possible to find products with multilingual support, a given ViaVoice engine is only capable of synthesizing speech for a single language. Multilingual support still exists but it implies the use of some form of logic to switch between several TTS engines, each of which supports a different language.

ViaVoice, like other TTS engines, still lacks a high degree of naturalness in some situations, especially because of the limited prosody associated with the text.

Unfortunately, intonation and rhythm are affected by the sentence parsing process, which is itself dependent on the original text. So, for example, the list in Table 3-4 does not appear as such when spoken by a TTS engine. In fact, the parsing neither identifies logical sentences, reflecting the intention of the original writer, nor expresses correctly conventional symbols like “1/5” (one fifth), which is spelled instead.

Table 3-4 Example of parsing error in ViaVoice TTS

This does not impair the overall quality of the product, since the limitations are determined by the need for further development in speech synthesis, as mentioned in 3.6, “Future development” on page 59.

German Yes Yes Beta Planned

Italian Yes Planned Planned Planned

Spanish Yes Planned Planned Planned

Portuguese Yes n/a n/a Planned

Japanese Yes* Planned n/a Planned

Chinese Yes* Planned n/a Planned

Finnish Planned n/a n/a n/a

Language WIN NT/95/98 AIX Solaris Linux

Original text ViaVoice TTS parsed sentences

1. Open a new command window 2. Resize it to be 1/5 of your screen 3. cd to \temp

4. Type setup.exe

1.Open a new command window 2.Resize it to be 1 slash 5 of your screen 3. CD to slash temp

4.Type setup.exe

We mentioned earlier that the ViaVoice TTS engine is a formant-based synthesizer and that there are two different approaches to voice synthesis.

Hence, ViaVoice does not yet offer a concatenation-based solution that allows potential customers a comparison of performance and a chance to select the most suitable approach for their application domain. The concatenation-based synthesizer has already been planned, however, and it is currently under development and testing.

3.5 Examples of voice-enabled applications

This section describes potential applications of both ASR and TTS. The list is not meant to be complete and is provided for guidance only. References to specific products are only included for information completeness.

3.5.1 Speech reco applications

This section discusses some applications built on speech reco.