In this chapter we showed that computing a reference transcription from the annota-tions of text by experts is a good alternative for the derivation of a human reference through analysis of spoken versions of text. The reference transcription we obtained with this approach was used for the evaluation of three state-of-the-art Text-to-Speech systems for Dutch and PROS-3 under three conditions.
The evaluation of the TTS systems showed that there are substantial differences be-tween automatically generated prosodic structure and the prosodic structure as as-signed by human experts. The analysis shows that a considerable number of incorrect insertions and omissions is made by the systems. These are merely due to incorrect or insufficient syntactic information about the input text.
Next, we evaluated PROS-3 as such, since it is the algorithm which we will use as a starting point for our attempts on creating an improved module for assignment of prosodic structure. As mentioned before we learned that incorrect or insufficient syn-tactic information was the major error-inducing factor for allocation of phrase bound-aries and accents. Therefore, we made a first attempt to investigate the effect of proper syntactic information by evaluating PROS-3 in combination with such information (de-rived through manual correction of the output of the robust parser). Since PROS-3 as-signs too many phrase boundaries, we defined a revised phrasing algorithm. We also evaluated PROS-3 in combination with proper syntactic information and the revised phrasing algorithm. These evaluation studies showed that proper syntactic informa-tion alone did not improve the allocainforma-tion of phrase boundaries, but did do so in combi-nation with the revised phrasing algorithm. Proper syntactic information did improve the allocation of accents and the revised phrasing algorithm did not make a difference here.
A perception experiment in which listeners had to indicate which version of a sen-tence was the most acceptable, showed that they highly prefer the version based on correct syntactic information in combination with the revised algorithm for prosodic phrasing. Still, there is a slight preference for the version based on the reference tran-scription. These results support our expectation that improved accentuation can only be appreciated in combination with adequate phrasing.
From the error analysis we learned that the major problem for phrasing turns out to be allocation of boundaries at junctures preceding an attached prepositional phrase. The major problem for accentuation is allocation of accents at sentence final verbs preceded by a nominal constituent. Therefore, in following chapters we will focus on finding a method for predicting the status of the PP (i.e. noun or verb attachment) and the status of the nominal constituent that precedes the sentence final verb (i.e. argument or condition). This will be done after a study on computing the perceptual costs of these types of errors.
Tolerance for errors 3
In this chapter we describe two experiments we perform to investigate the effect of prosodic phrasing and accentuation on the acceptability of synthetic speech. In the first experiment we show that listeners are more tolerant towards an incorrect omitted phrase boundary than to-wards an incorrect inserted boundary at the juncture preceding an at-tached PP. Thus, we rather allocate too few boundaries than too many.
This implies that in machine learning experiments we should have a bias for predicting noun attachment, because for noun attached PP’s correct phrasing means that there is no boundary allocated preceding the PP, whereas for verb attached PP’s there is. Furthermore, we show that we had better not allocate a medium boundary preceding an at-tached PP. In the second experiment we show that incorrect accent in-sertions on a sentence final verb are as bad as accent omissions. Thus, we should find an optimum in accent allocation, such that there are as few accent insertions and omissions as possible. This implies that in machine learning experiments we should be as good in predicting arguments as in predicting conditions. Finally, when comparing two different pitch contours there is no clear evidence of an effect of the shape of pitch contour.
3.1 Introduction
From the previous chapters we learned that in state-of-the-art Text-to-Speech synthe-sis systems assignment of prosodic structure (accents and phrase boundaries) is not yet a solved problem. Accents and phrase boundaries are often omitted or allocated in the wrong places. Previous research (e.g. Nooteboom and Kruyt, 1987; Sanderman and Collier, 1997) showed that correct prosodic information helps the listener when processing text, whereas incorrect prosodic structure may impede the listener’s com-prehension. This means that it will take more time and effort from the listener when processing speech with incorrect prosodic structure, and in the worst case the listener might not understand the conveyed information correctly.
We introduce the contrast correct versus incorrect prosodic structure. In this context, the notion ‘correct’ means that a boundary or accent is allocated (or not) according to the syntactic structure, whereas ‘incorrect’ means that a boundary or accent is allocated (or not) in contradiction to the syntactic structure. Other contrasts that we refer to in this chapter are insertion versus omission and acceptable versus unacceptable. The no-tion ‘inserno-tion’ means that a boundary or accent is allocated where there should not be one according to the syntactic structure. The notion ‘omission’ means that a boundary or accent is not allocated where there should have been one according to the syntac-tic structure. The notion ‘acceptable’ means that listeners approve with the phrasing structure or accentuation structure, whereas ‘unacceptable’ means that the listeners disapprove of it.
For the assignment of prosodic structure we focus on allocation of prosodic phrase boundaries and accents. Prosodic phrasing indicates which parts of an utterance be-long together, syntactically and semantically (Bolinger, 1989; Sanderman, 1996). This information is used by the listener to deduce the relations between the words in the sentence. For instance, adjectives provide information about the status of a noun.
When there is a phrase boundary inserted (i.e. incorrectly allocated) between the ad-jective and the noun, the listener might wrongly conclude that there is no semantic or syntactic relation between the two words.
A similar problem occurs when the accentuation of a sentence is incorrect. Accents provide information about which words the listener should pay attention to. Chafe (1974) states that accents highlight the information that should be at the center of at-tention of the listener. However, when a word is accented while it should not have been, the attention of the listener is erroneously attracted by that word. In this case the listener might have less attention for the more important words, which means that he should deduce the information that was meant to be provided, by back-tracking the sentence.
From the evaluation study (described in Chapter 2) we learned that many phrasing errors occur at junctures preceding the PP in [NP PP] or [PP PP] sequences. There-fore, we focus on the allocation of prosodic phrase boundaries at junctures preceding a
3.1 INTRODUCTION
prepositional phrase in Dutch. We only consider sentences in which the PP is preceded by a nominal phrase or another prepositional phrase. In certain sentences a prosodic phrase boundary can be realized at such a juncture (indicated with [ ] in example 3.1), whereas in other sentences a phrase boundary preceding the PP is incorrect. The ap-propriateness of a phrase boundary depends on the status of the PP. If the PP is noun attached (as in example 3.1a), a phrase boundary preceding the PP is undesirable. If the PP is verb attached (as in example 3.1b), a phrase boundary preceding the PP is possible, although not mandatory.
(3.1) (a) He accused the president [ ] of the National Bank.
(b) He accused the president [ ] of the bank robbery.
Since TTS-systems will always make some errors when performing phrase boundary allocation, we want to find out which type of error is the least problematic for the listener (i.e. the least unacceptable; the type towards which the listeners is the most tolerant), so that we can try to shift the number of phrasing errors in the direction of the one that is less problematic. We perform a perception experiment in which subjects are asked to indicate their preference1for the sentence with or without a phrase bound-ary preceding the PP. We assume that the tolerance for errors can be interpreted as an indicator for the perceptual costs of errors: if the tolerance for an error is low (meaning that the acceptability of an utterance is low, due to the error), this error induces high perceptual costs. The results of the perception experiment will be used for the alloca-tion of prosodic phrase boundaries in synthetic speech. Besides, the results will also be used for another part of our research, which is a machine learning experiment to predict whether the PP is noun attached or verb attached (described in Chapter 4).
From the evaluation study for Dutch we also learned that many accentuation errors are made on the sentence final verb phrase. These errors are due to the lack of information about the status of the nominal constituent preceding the sentence final verb. This nominal constituent can be either a condition or an argument to the verb. Linguistic investigations (Gussenhoven, 1984; Baart, 1987; Marsi, 2001) showed that conditions can be left out and are not subcategorized for by the verb. All other constituents are arguments. We argue that the sentence final verb should be accented in sentences with a condition preceding it, whereas the verb should not be accented in sentences with an argument preceding the verb. This means that the appropriateness of accentuation of the verb depends on the status of the nominal constituent preceding the verb.
As for allocation of phrase boundaries, TTS-systems will always make some errors when allocating accents. Therefore, we also want to find out whether an inserted ac-cent or an omitted acac-cent is less problematic for the listener. This way we can try to
1Note that the notion ‘preferred sentence’ is used here for the version that is most appreciated by the listeners. When they have to choose between two incorrect versions, the one that is most appreciated will be referred to as the preferred version.
shift the number of accentuation errors in the direction of the one that is less problem-atic. We perform a perception experiment in which subjects are asked to indicate their preference for either the sentence with or without an accented sentence final verb. The results of this study will also be used for another part of our research, which is a ma-chine learning experiment to predict the status of the nominal constituent preceding the sentence final verb (described in Chapter 5).
In section 3.2, we describe the perception experiment on the allocation of prosodic phrase boundaries at junctures preceding the PP. In section 3.3, we describe the per-ception experiment on the accentuation of sentence final verbs. In section 3.4, we dis-cuss what the results of these experiments imply for further experiments and for the allocation of prosodic structure in synthetic speech.