Combined Recognition Methods - Robust speech recognition with spectrogram factorisation

In ASR, there are standardised processing chains like GMM-HMM recognition with cepstral features, which are readily available in software implementations [213]. Despite their shortcomings in some applications like robust recognition, they have been fine-tuned over years to cover several algorithmic stages like lan- guage modelling, feature extraction, statistical modelling and so on. Reinvent- ing the whole toolset would be a daunting task and usually not even necessary. Therefore new paradigms are often tested with replacement or combination of new and established system components. This also applies to NMF-based processing, which may act in multiple roles ranging from plain front-end enhancement to direct state likelihood estimation or word spotting. This section illustrates a few approaches proposed for joint recognition with NMF as one of the components involved in more complex systems.

The terminology of alternative modelling methods can be derived from parallels in artificial neural network (ANN) systems, especially multi-layer percep- trons (MLPs), which started to emerge for ASR in the 80s [100], and then have repeatedly appeared in literature since the 90s [19, 116, 141], gradually evolving into deep neural networks (DNNs) [67]. There are two major branches of systems employing MLPs in ASR. In a hybrid approach, an ANN is trained to produce direct posterior likelihoods for HMM states, thus replacing the whole statistical modelling and evaluation [5]. In tandem systems, the ANN outputs are modelled statistically with GMMs, hence acting as features instead of e.g. mel-cepstra [64] but not producing direct state likelihoods. Further variants, comparisons and in- sights to these major routes are provided in later work [19, 176].

The plain sparse classification system described in this thesis is essentially hybrid-like, because it produces state likelihoods as its output. However, in [170], the SC output is modelled with GMMs, making the system similar to tandem recognition. These parallels in ANN- and NMF-based single-stream recognition are illustrated with simplified flowcharts in Figure 5.2. The first two paths a) and b) represent conventional GMM evaluation of e.g. MFCC or PLP features, optionally with an enhancement front-end. The middle paths c) and d) correspond to hybrid and tandem recognition with ANNs, respectively. The last two paths represent direct sparse classification and statistical modelling of SC outputs.

Single-stream processing in consecutive algorithm steps is not the only option for recognition, though. In [167, 168], a dynamic Bayesian network (DBN) is used to combine SC and MFCC likelihoods. Similarly in [169], estimates from SC and a three layer MLP are combined either by summing or multiplying the state probabilities to produce the combined posterior probability for decoding. In [40], SC and NMF-enhanced MFCC probabilities are combined with a prod- uct rule. In [203, 208], a bi-directional long short-term memory recurrent neu-

Figure 5.2: Main components of single-stream recognition paths employing statistical modelling, spectrogram factorisation, and neural networks, starting from spectral features and ultimately producing likelihoods for back-end decoding: a) conventional GMM system with e.g. mel-cepstral features [133]

b) GMM system with NMF feature enhancement [44, 137] c) hybrid ANN recognition [5]

d) tandem ANN recognition [64] e) NMF sparse classification [37, 44] f) statistical modelling of SC output [170]

feature transform (MFCC, PLP) statistical model (GMM) NMF FE NMF SC ANN

A

spectral features back-end decoding PCA a) b) c) d) e) f)

noisy spectral features enhanced spectral features noisy cepstral (etc.) features enhanced cepstral features mid-level representation state likelihoods

Figure 5.3: Recognition paths employing stream combinations: a) MFCC+SC with DBN combination [167]

b) ANN+SC with sum or product likelihood combination [169]

c) triple-stream MFCC+ANN+SC recognition from enhanced features [35, 204]

DBN + ×

A

× a) b) c)

ral network (BLSTM-RNN) is used in conjugation with NMF-enhanced MFCC-

GMMs for probability combination. In [35] and [204], three streams are combined; NMF-enhanced MFCCs, sparse classification, and a BLSTM-RNN. These latter systems, typically computing the product of stream probabilities with expo- nent weight factors, can be referred to as multi-stream, hybrid-like recognisers. Figure 5.3 shows schematic views of three of these systems, namely [167], [169] and [204]. Other combinations can be illustrated similarly or as subsets of these examples.

In these multi-stream experiments, all feature streams have been found complementary, that is, combined evaluation surpasses the recognition rates of its single components even if the FE and SC outputs are derived from the same NMF system. Apart from FE and SC streams, NMF output has also been used for esti- mating masks in uncertainty and missing data decoding [44, 78, 79]. Meanwhile, deep neural networks have gained a lot of attention in ASR, being employed in

ASR applications by major companies and producing state-of-the-art results in recognition of real-world speech [67]. They should be able to provide even more complementary information to joint systems, again improving the overall recognition rate. Yet another path beyond the scope of this thesis are spatial algorithms, which are likely to become increasingly important as multi-microphone devices gain popularity, and demonstrably improve the recognition results further [16, 125, 175].

For actual combination of streams, many more algorithms have been proposed in literature than the DBN, sum, and product approaches previously employed in joint systems containing NMF components. Recogniser output voting error re-

duction (ROVER) uses a variety of voting schemes to find an optimal word transi-

tion network from multiple system outputs [32]. Confusion network combination (CNC) is its later extension [29]. BAYCOM stands for Bayesian combination us- ing a decision-theoretic approach that is expected to provide optimal combination weights even for streams with considerably differing error rates [147]. Driven de-

coding algorithm (DDA) performs dynamic search between a primary system and

auxiliary systems or manual transcripts [88]. There is no particular reason prevent- ing the use of NMF-based components in these fusion methods as well, although no examples appear to exist in literature yet.

We can conclude that there is a multitude of established and novel recognition paths, NMF-based or not, which provide partially overlapping yet ultimately complementary information for joint recognition. This raises interesting questions on how to incorporate the strengths of different methods in a joint system while minimising the redundancy and computational complexity. Because the best per- forming robust systems are currently relatively heavy combinations of multiple methods [3, 183], these questions can be expected to remain highly relevant in the quest for human-like or even superhuman ASR performance.

In document Robust speech recognition with spectrogram factorisation (Page 66-69)