Chapter 5: Transform Synthesis
5.6 An Alternative Network Configuration
In previous sections, the new analysis step of transform synthesis was presented, where formant amplitude control signals were optimised for a particular synthesiser via a transformational step.
The transformations were performed using five MLP networks, one for each synthe siser amplitude control signal. Each of the five networks accepted as input a sub-set of the total number of synthesiser control signals and output only one, transformed control signal.
Initially five networks were chosen as opposed to one to reduce the amount of train ing data required as, typically, small networks can be adequately trained with a com paratively small body of data.
An alternative and possibly preferable approach would be to construct a single net work, which accepted as input all the synthesiser control signals and output all five, transformed, amplitude control signals. Such a network would be preferred because of the following:
• A single network would be faster and easier to manipulate than five smaller net works.
• As all control signals are used within one network, the potential for more com plex associations between inputs, and hence a better transform, is created.
However there are disadvantages:
• Practically, it may be impossible to adequately train a large network, with only a limited amount of training data.
• Choosing the appropriate network size and configuration for the task is a non trivial problem. Poor choices in the network design may lead to inappropriate generalisations, where idiosyncrasies in the training data are highlighted, reduc ing the performance of the network.
frame at time N+l frame at time N frame at time N-l frame at time N -2 Network
Ten control signals per frame presented to the input of the
network. AHF A3 A2 A1 --- ► ALF
Transformed amplitude control signals.
Fig. 5.21: Schematic of single network configuration
This section presents some initial results from two experiments using a single net work configuration. The aim of the first experiment was to investigate how well a large network would perform on unseen data, when trained on a small set of good quality data. The aim of the second experiment was to investigate whether a single large network, trained on a comparatively large training set, would produce the same quality of synthesis as the set of five smaller networks.
Fig. 5.21 shows schematically the structure of the single network. The details of which are given in table 5.6. The network consisted of a single hidden layer of sig moidal units and an output layer of linear units.
Control Signals Pre sented to Input Units
No. of input Units No. of Hidden units No. of Output Units Output Con trols Signals F0,V,ALF,F1,A1,F2, A2,F3,A3,AHF 40 1 0 5 ALF,A1, A2A3,AHF
Table 5.6 Configuration of single net
5.6.1 The Training Data
As mentioned in the previous section two data sets were used for training. Set (a) consisted of a comparatively small number of very good quality phrases. Giving a total of 952 pattern vectors. The phrases are tabulated in table D 4, appendix D, sec tion D2.2. Set (b) consisted of a much larger number of phrases. Giving a total of 4192 pattern vectors. Details of this phrase set are also described in appendix D, section D2.2. The phrases are tabulated in table D 5.
5.62 Training the Single Network
For the first experiment the training was as follows. The complete data set was pre sented to the network 5000 times, the network weights were updated after every pat tern vector presentation. The learning rate r\ was set to T| = 0.1 and the momentum term a set at a = 0.9.
For the second experiment the training was as follows. The complete data set was presented to the network 2000 times. The network weights were updated after every pattern vector presentation. However, the learning rate was increased from T| = 0.1 to rj = 0.5 and the momentum term reduced from a = 0.9 to a = 0.5.
5.6.3 Initial Results
The experimental results for the single network trained on data set (a) were as ex pected. The network reached an acceptable minimum in the error surface, giving an average error of 2.8dB per control signal on the training data. However, when pre sented with unseen data the network performed poorly. The synthetic speech quality was judged, informally, to be poor.
Initial results for the single network trained using a large data set (b) are encourag ing. The results of informal comparison listening, using unseen data presented to the single network and the five small networks, suggests that there was little differ ence between the two approaches. However, a more rigorous comparison of the two approaches must be undertaken before clear conclusions can be given.
5.7 Conclusions
This chapter has presented an extra analysis step in the synthesis-by-analysis proc ess, called transform synthesis. Which was designed to correct possible amplitude errors in the AAM analysis stage and idiosyncrasies in the design of the synthesiser used. It was shown that:
• Multi-layer perceptions represent an efficient and appropriate method of imple menting this analysis step. Once trained, the computational overhead of per forming the transform synthesis step is very small1.
• The extra analysis step significantly improved the quality of synthetic speech produced using AAM alone.
• The quality of synthetic speech produced by this analysis was judged, in three out of six cases, to be identical in quality to that of the best currently available method of synthesis-by-analysis, but at a much reduced computation cost.
*The whole synthesis-by-analysis process takes approximately 90 seconds to analysis 2 seconds of speech, running on a Micro VAX 3600.