Published work - Methods for Addressing Data Diversity in Automatic Speech Recognition

This section lists the peer-reviewed and published papers during the PhD studies. The first six publications are already introduced in section 1.3 and contain the main contributions of this thesis. The remainder of the publications contain auxiliary

work related to this thesis.

1. Mortaza Doulaty, Oscar Saz, Thomas Hain, “Data-selective transfer learning for multi-domain speech recognition,” in Proceedings of Interspeech, Dresden, Germany, 2015.

2. Mortaza Doulaty, Oscar Saz, Thomas Hain, “Unsupervised domain discovery using latent Dirichlet allocation for acoustic modelling in speech recognition,” in Proceedings of Interspeech, Dresden, Germany, 2015.

3. Mortaza Doulaty, Oscar Saz, Raymond W. M. Ng, Thomas Hain, “Latent Dirichlet allocation based organisation of broadcast media archives for deep neural network adaptation,” in Proceedings of IEEE Automatic Speech Recog- nition and Understanding Workshop (ASRU), Scottsdale, Arizona, USA, 2015. 4. Mortaza Doulaty, Oscar Saz, Raymond W. M. Ng, Thomas Hain, “Automatic genre and show Identification of broadcast media,” in Proceedings of Inter- speech, San Francisco, California, USA, 2016.

5. Mortaza Doulaty, Richard Rose, Olivier Siohan, “Automatic optimization of data perturbation distributions for multi-style training in speech recognition,” in Proceedings of IEEE Workshop on Spoken Language Technology (SLT), San Diego, California, USA, 2016.

6. Oscar Saz, Mortaza Doulaty, Thomas Hain, “Background-tracking acoustic features for genre identification of broadcast shows,” in Proceedings of IEEE Workshop on Spoken Language Technology (SLT), Lake Tahoe, Nevada, USA, 2014.

7. Oscar Saz, Mortaza Doulaty, Salil Deena, Rosanna Milner, Raymond W. M. Ng, Madina Hasan, Yulan Liu, Thomas Hain, “The 2015 Sheffield system for transcription of multi-genre broadcast media,” in Proceedings of IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU), Scottsdale, Arizona, USA, 2015.

8. Rosanna Milner, Oscar Saz, Salil Deena, Mortaza Doulaty, Raymond WM Ng, Thomas Hain, “The 2015 Sheffield system for longitudinal diarisation of broadcast media,” in Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Scottsdale, Arizona, USA, 2015.

9. Raymond W. M. Ng, Mortaza Doulaty, Rama Doddipatla, Wilker Aziz, Kashif Shah, Oscar Saz, Madina Hasan, Ghada AlHarbi, Lucia Specia, Thomas Hain,

“The USFD spoken language translation system for IWSLT 2014,” in Proceed- ings of International Workshop on Spoken Language Translation (IWSLT), Lake Tahoe, Nevada, USA, 2014.

10. Salil Deena, Madina Hasan, Mortaza Doulaty, Oscar Saz, Thomas Hain, “Com- bining feature and model-based adaptation of RNNLMs for multi-genre broadcast speech recognition,” in Proceedings of Interspeech, San Francisco, Califor- nia, USA, 2016.

11. Thomas Hain, Jeremy Christian, Oscar Saz, Salil Deena, Madina Hasan, Ray- mond WM Ng, Rosanna Milner, Mortaza Doulaty, Yulan Liu, “webASR 2 - improved cloud based speech technology,” in Proceedings of Interspeech, San Francisco, California, USA, 2016.

12. Raymond W. M. Ng, Mauro Nicolao, Oscar Saz, Madina Hasan, Bhusan Chet- tri, Mortaza Doulaty, Tan Lee, Thomas Hain, “The Sheffield language recognition system in NIST LRE 2015,” in Proceedings of the Speaker and Language Recognition Workshop Odyssey, Bilbao, Spain, 2016.

13. Erfan Loweimi, Mortaza Doulaty, Jon Barker, Thomas Hain, “Long-term statistical feature extraction from speech signal and its application in emotion recognition”, in Proceedings of International Conference on Statistical Lan- guage and Speech Processing (SLSP), Budapest, Hungary, 2015.

Ch1. Introduction

Ch2. Background

Ch3. Data selection and augmentation techniques

Ch4.

Identification of genres and shows in media data

Ch5.

Latent domain acoustic model adaptation

Ch6.

Conclusion and future work

CHAPTER

2 Background

2.1 Introduction

Often the term domain is used to vaguely define collections of speech data that share the same acoustic attributes and variabilities, such as type of speech (read vs. spontaneous), communication channel, background conditions and number of speakers. Conventional ASR domains often include broadcast news, meetings, tele- phony speech, audio books, lectures and talks (Benesty et al., 2007; Huang et al., 2001; Jurafsky and Martin, 2000). However, the concept of a domain is complex and not bound to specific criteria. In this section a new definition of a domain from a statistical point of view is provided based on the notations introduced in (Pan and Yang, 2010).

A domain is defined as a pair which consists of a feature space and a marginal probability distribution of data in that space:

D = {X , P (X)} (2.1)

where X is a feature space, X = {x1, . . . , xn} ⊆ X is a dataset and P (X) is the

marginal probability distribution of the data in the feature space.With this notation two domains are different when either their feature spaces are different or they have different marginal probability distributions or both.

For the ASR task, X is the space of all arbitrary length segments of i.e. 39- dimensional MFCC feature vectors, X is a training dataset and xi ∈ X is a particular

speech segment. The conventional domains in ASR such as meetings, read speech or talks can be considered to share the same feature space, but have different marginal probability distributions.

A task is defined as:

T = {Y, f ()} (2.2)

where Y is a label space and f () is a prediction function which maps some input sequence to some output sequence:

f : X → Y. (2.3)

Two tasks are considered different when their label spaces are different or they have different prediction functions or both.

In supervised learning, the training data consists of (xi, yi) pairs such that

f (xi) = yi and xi ∈ X , yi ∈ Y, Xtrn = {x1, . . . , xn} and Ytrn = {y1, . . . , yn}.

In a probabilistic learning framework, f () can be viewed as P (y|x), the posterior probability of the output, y, given the input, x. This function is usually not observed directly and learned from the training data.

In the speech recognition example, Y is the set of all possible sequences of words in English (defined as L in chapter 1) and f() is a mapping function which maps an audio segment to a sequence of words. Using the same audio signal for speech recognition and emotion identification (where the task is to identify the emotion of the speaker) can be considered as two different tasks, since the label space as well as the prediction functions are different, but both tasks share the same input to their prediction functions.

In many machine learning problems, the source and target domains (underlying distributions of the training and test data) are assumed to be the same: Dtrn = Dtst.

Furthermore the tasks are identical as well: Ttrn = Ttst. But in realistic scenarios

the domains are usually different and this causes mismatch between the training and test domains. The next section is devoted to the domain mismatch problem.

In document Methods for Addressing Data Diversity in Automatic Speech Recognition (Page 34-40)