Future development - Voice-activated applications

Voice-activated applications

3.6 Future development

Despite the fact the speech recognition research has made much progress in the last years, there is still a lot to be done. The algorithms currently used fail in a variety of situations, from a simple voice alteration to a change of environment.

Therefore, most of the efforts will be directed towards an enhancement of performance for the recognition.

Portability

Current speech recognition works very well in a given context, but performance gets significantly worse when the situation changes. Normally, one should collect new data and retrain the system. However, this is a rather expensive and time-consuming procedure and it is not always worth the effort when the context might be changing again. One possible way to improve portability is to increase the size of the vocabulary and the number of rules in the grammar. However, regardless of the size of these databases, there is always a chance for

unpredicted context and the problem then resurfaces. The introduction of some form of dynamic update for both the grammar and the vocabulary may be the answer, but other questions need to be addressed, such as deciding when the upgrade should be done and how.

Adaptation

This refers to the ability of the recognition process to adapt to new conditions, such as for instance a change in the hardware used to input speech. The more constraints one can put on the incoming speech, the better performance can be achieved. However, this approach leads to a lack of flexibility in the system and

may turn out to be a negative point. It is not unusual to upgrade pieces of technology like microphones or telephones (they may also simply need to be replaced due to breakage), but such a simple action can have a serious impact on recognition. Some of the characteristics of the device used to input speech are included in the acoustic model to improve performance and these features are often device-dependent. Hence, when the device changes, the model is no longer a valid description of the underlying environment, and performance is negatively affected.

Out-of-vocabulary words

As described in Section 3.2.1, “System architecture” on page 39, a vocabulary contains all the words that a voice recognition engine should identify. However, it is not possible to guarantee that a speaker will never use a word not included in the vocabulary, since he/she may not be aware of what the vocabulary contains.

At present, the recognition process tries to identify the closest match for the input received, because it is not able to distinguish whether a given word belongs to the vocabulary or not. This behavior is not desirable for control and command applications since a wrong action will be invoked. A good way to solve this problem might be to use minimum threshold levels for the word match. In this case, when a match is very poor, an out-of-vocabulary exception can be fired and a message can be sent to the speaker. The negative side of that is that,

depending on the threshold, words that do belong to the vocabulary might be misclassified as not belonging to it.

Robustness

Robustness and its various aspects will possibly be one of the most challenging areas in speech recognition in the coming years. Communications are changing very rapidly and speech engines need to quickly adapt to the new media and to the needs of the potential users of the system. For example, though speech recognition can be very accurate for certain applications, it is also true that accuracy can drop down sensibly when a non-native speaker is involved. The same is also true for native speakers with a strong regional accent. Although our ear can easily be trained to identify words regardless of accent and intonation, a speech recognition system cannot do so yet. In fact, this aspect of robustness is strictly linked with problems of adaptability and out-of-dictionary words, as regional accents influence the perception of speech. Robustness is also affected by transient interferences on telephone conversations and recognition of speech coming from devices with low signal-to-noise ratio. Hence, this represents another potential area for improvements.

Chapter 3. Overview of speech technology 61 Spontaneous speech

During a normal conversation, it may well happen that people sneeze, cough or hesitate before takings there may also be a conversation between people in the background. In all these circumstances, speech recognition performs quite poorly because the added noise affects the quality of the actual speech. It is certainly desirable to have a system where all these conditions could easily be dealt with.

Language modeling

Currently, speech technology is far from being human-like because of the various limitations imposed to perform both recognition and synthesis. However, the constant rise in mobile device use is causing a push for a wider use of voice technologies as well. ASR and TTS can offer functionalities that are not available in natural language, such a random access of data, remote data access and sorting. A breakthrough in the use of these resources will only be possible when improved language modeling is available. New models should be able to remove much speech variability caused by accents and external noise (for ASR) as well as synthesize voice using appropriate intonation and pauses. Statistical models are not able to capture all speech features and the use of other techniques (especially for the prosody) could be a successful approach to improving the accuracy of both ASR and TTS.

Dynamics modeling

ASR and TTS normally treat the information contained in window frames as is, without dynamics. However, speech is very dynamic and this form of variability need to be taken into account.

Prosody

As mentioned earlier in this chapter, prosody provides information about

intonation, rhythm, etc. Prosody is significantly affected by the use of punctuation in TTS. In fact, when no punctuation is found in a long sentence, a human speaker naturally breaks the words into phrases of smaller length. However, an artificial system is not able to perform this task because of the inability to associate a meaning to the words being spoken. Consequently, sentences are identified by proper punctuation marks (like full stops, question marks and exclamation marks), as a TTS engine is unable to place pauses autonomously.

Prosodic information is neither available for synthesized speech nor captured during recognition, although it notably improves naturalness (for speech synthesis) and understanding (for speech recognition). Currently, the major problem to be solved is the integration of such information in the overall engine architecture.

In document Mobile Applications with IBM WebSphere Everyplace Access Design and Development (Page 77-80)