Conclusion - Methods for Addressing Data Diversity in Automatic Speech Recognition

In this chapter an overview of data selection techniques for ASR was provided. The data selection techniques were studied in the context of reducing the mismatch between the training and test conditions with the ultimate goal of improving the recognition accuracy. Two new approaches for data selection were introduced. A similarity measure based on likelihood ratio was proposed where the training data is selected based on similarity to a target test set and the experimental results were provided using a highly diverse dataset. In this dataset, data from six different domains were pooled together and various mismatched conditions when using out- of-domain data and cross-domain data were studied. It was shown that using the proposed method, the WER can be improved under different mismatched conditions. The second approach was based on phone posteriors computed by a reference model. First the effectiveness of the proposed metric in quantifying the variations present in the data in the form of signal-to-noise ratios was studied and then it was generalised to learn the distributions of other sources of variability. Then these distributions were used to create a training corpus which matches the distributions of variations present in a target test set. It was shown that using this MTR training corpus reduces the WER significantly when compared to using a uniformly perturbed training corpus.

CHAPTER

4 Identification of genres and

shows in media data

4.1 Introduction

The amount of digital media is growing larger and larger every day due to digital televisions, online streaming services and social media. There are over 28,000 TV broadcasting stations in the world, every minute more than 300 hours of video is being uploaded just to YouTube and on a daily basis, users around the globe spend more than 100 million hours watching Facebook videos (Central Intelligence Agency, 2016; Facebook, 2016; YouTube, 2016). This creates a huge demand for effective techniques for automatic processing of these digital media so that their content can be easily searched, retrieved and navigated.

Multimedia data may have some associated meta-data which facilitates the automatic processing for the downstream tasks such as indexing. Meta-data can be either structured or unstructured. Examples of structured meta-data include genre labels, number of speakers, speaker labels, duration, date and time of production, date and time of broadcast, broadcast type, broadcast media, etc. Examples of unstructured meta-data include description and textual summary. Genre labels may include news, sports, comedy, documentary and drama, which are categories that imply more than purely semantic information. For example shows that belong to the same genre may share similar acoustic conditions.

Some of these meta-data are objective, such as duration and some are subjective, such as genre labels. The objective properties are usually observable and measurable, while the subjective properties might not be measurable easily and in some cases impossible to measure. Even for humans assigning subjective tags can be challenging, for example a news programme that discusses oil price rise after death

of a political figure can be considered as belonging to either of these genres: news, finance or politics.

The extra information provided by meta-data is usually used for efficient query- ing, navigation, browsing and discovery (Chowdhury, 2010). For example classification of multimedia data into genres or other categories makes content discovery easier for the users of information retrieval systems.

The meta-data tags might not always be available for the data, especially for the subjective properties such as the genre labels. Also with huge historical digital archives, there might be some inconsistency in the tags, especially if the tagging was performed manually by several people. Manual labelling of the digital archives is usually not considered as a viable option even for the medium-sized archives, especially with budget and time constraints. Thus, automatic labelling of genres or other similar tags is an important task for multimedia and information retrieval systems and is the main motivation of the study in this chapter. Furthermore, since shows that belong to the same genre usually share similar acoustic conditions, this information can be used in acoustic model adaptation for mismatch reduction as well. This further motivates the study conducted in this chapter. The empirical results to support this argument for improving the WER of ASR systems is provided in the next chapter. In this chapter, the main aim is to automatically tag the media data with genre and show labels.

Research in the media processing field is further motivated by initiatives such as the “MediaEval benchmarking for multimedia evaluation” (Larson et al., 2013), or the “Robust, as accurate as human genre classification for video” challenges within the multimedia grand challenges of the ACM multimedia conference (Challenge, 2010).

Given the applications of genre labelling in multimedia information retrieval systems and their potential applications in acoustic model adaptation in ASR systems, the main research question this chapter is trying to answer is how broadcast media data can be classified into subjective tags such as genre labels using audio. It further investigates which sources of information are required for further improving genre classification accuracy. To answer these questions, two techniques for genre identification are proposed in this chapter. The first approach is based on a set of local features called background tracking features and the second approach is based on a latent modelling technique called latent Dirichlet allocation (LDA). The LDA approach is also used for the show identification task for the first time. An overview of genre identification techniques is provided in the next section.

In document Methods for Addressing Data Diversity in Automatic Speech Recognition (Page 92-95)