Similar to speech recognition but to a less extent, in the area of audio and music processing, deep learning has also become of intense interest
7.3. Audio and music processing 289
but only quite recently. As an example, the first major event of deep learning for speech recognition took place in 2009, followed by a series of events including a comprehensive tutorial on the topic at ICASSP-2012 and with the special issue at IEEE Transactions on Audio, Speech, and Language Processing, the premier publication for speech recognition, in the same year. The first major event of deep learning for audio and music processing appears to be the special session at ICASSP-2014, titled Deep Learning for Music [14].
In the general field of audio and music processing, the impacted areas by deep learning include mainly music signal processing and music information retrieval [15, 22, 141, 177, 178, 179, 319]. Deep learning presents a unique set of challenges in these areas. Music audio signals are time series where events are organized in musical time, rather than in real time, which changes as a function of rhythm and expression. The measured signals typically combine multiple voices that are synchro-nized in time and overlapping in frequency, mixing both short-term and long-term temporal dependencies. The influencing factors include musi-cal tradition, style, composer and interpretation. The high complexity and variety give rise to the signal representation problems well-suited to the high levels of abstraction afforded by the perceptually and bio-logically motivated processing techniques of deep learning.
In the early work on audio signals as reported by Lee et al. [215]
and their follow-up work, the convolutional structure is imposed on the RBM while building up a DBN. Convolution is made in time by sharing weights between hidden units in an attempt to detect the same
“invariant” feature over different times. Then a max-pooling operation is performed where the maximal activations over small temporal neigh-borhoods of hidden units are obtained, inducing some local temporal invariance. The resulting convolutional DBN is applied to audio as well as speech data for a number of tasks including music artist and genre classification, speaker identification, speaker gender classification, and phone classification, with promising results presented.
The RNN has also been recently applied to music processing appli-cations [22, 40, 41], where the use of ReLU hidden units instead of logistic or tanh nonlinearities are explored in the RNN. As reviewed in
290 Selected Applications in Speech and Audio Processing
Section 7.2, ReLU units compute y = max(x, 0), and lead to sparser gradients, less diffusion of credit and blame in the RNN, and faster training. The RNN is applied to the task of automatic recognition of chords from audio music, an active area of research in music information retrieval. The motivation of using the RNN architecture is its power in modeling dynamical systems. The RNN incorporates an internal memory, or hidden state, represented by a self-connected hidden layer of neurons. This property makes them well suited to model temporal sequences, such as frames in a magnitude spectrogram or chord labels in a harmonic progression. When well trained, the RNN is endowed with the power to predict the output at the next time step given the previous ones. Experimental results show that the RNN-based auto-matic chord recognition system is competitive with existing state-of-the-art approaches [275]. The RNN is capable of learning basic musical properties such as temporal continuity, harmony and temporal dynam-ics. It can also efficiently search for the most musically plausible chord sequences when the audio signal is ambiguous, noisy or weakly discrim-inative.
A recent review article by Humphrey et al. [179] provides a detailed analysis on content-based music informatics, and in particular on why the progress is decelerating throughout the field. The analysis con-cludes that hand-crafted feature design is sub-optimal and unsustain-able, that the power of shallow architectures is fundamentally limited, and that short-time analysis cannot encode musically meaningful struc-ture. These conclusions motivate the use of deep learning methods aimed at automatic feature learning. By embracing feature learning, it becomes possible to optimize a music retrieval system’s internal feature representation or discovering it directly, since deep architectures are especially well-suited to characterize the hierarchical nature of music.
Finally, we review the very recent work by van den Oord, et al. [371]
on content-based music recommendation using deep learning methods.
Automatic music recommendation has become an increasingly signifi-cant and useful technique in practice. Most recommender systems rely on collaborative filtering, suffering from the cold start problem where it fails when no usage data is available. Thus, collaborative filtering is
7.3. Audio and music processing 291
not effective for recommending new and unpopular songs. Deep learning methods power the latent factor model for recommendation, which pre-dicts the latent factors from music audio when they cannot be obtained from usage data. A traditional approach using a bag-of-words represen-tation of the audio signals is compared with deep CNNs with rigorous evaluation made. The results show highly sensible recommendations produced by the predicted latent factors using deep CNNs. The study demonstrates that a combination of convolutional neural networks and richer audio features lead to such promising results for content-based music recommendation.
Like speech recognition and speech synthesis, much more work is expected from the music and audio signal processing community in the near future.
8
Selected Applications in Language Modeling and Natural Language Processing
Research in language, document, and text processing has seen increasing popularity recently in the signal processing community, and has been designated as one of the main focus areas by the IEEE Signal Processing Society’s Speech and Language Processing Technical Committee. Applications of deep learning to this area started with language modeling (LM), where the goal is to provide a probability to any arbitrary sequence of words or other linguistic symbols (e.g., letters, characters, phones, etc.). Natural language processing (NLP) or computational linguistics also deals with sequences of words or other linguistic symbols, but the tasks are much more diverse (e.g., translation, parsing, text classification, etc.), not focusing on providing probabilities for linguistic symbols. The connection is that LM is often an important and very useful component of NLP systems.
Applications to NLP is currently one of the most active areas in deep learning research, and deep learning is also considered as one promising direction by the NLP research community. However, the intersection between the deep learning and NLP researchers is so far not nearly as large as that for the application areas of speech or vision.
This is partly because the hard evidence for the superiority of deep
292
8.1. Language modeling 293
learning over the current state of the art NLP methods has not been as strong as speech or visual object recognition.
8.1 Language modeling
Language models (LMs) are crucial part of many successful applica-tions, such as speech recognition, text information retrieval, statistical machine translation and other tasks of NLP. Traditional techniques for estimating the parameters in LMs are based on N-gram counts. Despite known weaknesses of N -grams and huge efforts of research communities across many fields, N -grams remained the state-of-the-art until neural network and deep learning based methods were shown to significantly lower the perplexity of LMs, one common (but not ultimate) measure of the LM quality, over several standard benchmark tasks [245, 247, 248].
Before we discuss neural network based LMs, we note the use of hierarchical Bayesian priors in building up deep and recursive struc-ture for LMs [174]. Specifically, Pitman-Yor process is exploited as the Bayesian prior, from which a deep (four layers) probabilistic genera-tive model is built. It offers a principled approach to LM smoothing by incorporating the power-law distribution for natural language. As discussed in Section 3, this type of prior knowledge embedding is more readily achievable in the generative probabilistic modeling setup than in the discriminative neural network based setup. The reported results on LM perplexity reduction are not nearly as strong as that achieved by the neural network based LMs, which we discuss next.
There has been a long history [19, 26, 27, 433] of using (shallow) feed-forward neural networks in LMs, called the NNLM. The use of DNNs in the same way for LMs appeared more recently in [8]. An LM is a function that captures the salient statistical characteristics of the distribution of sequences of words in natural language. It allows one to make probabilistic predictions of the next word given preceding ones.
An NNLM is one that exploits the neural network’s ability to learn distributed representations in order to reduce the impact of the curse of dimensionality. The original NNLM, with a feed-forward neural net-work structure net-works as follows: the input of the N-gram NNLM is
294 Language Modeling and Natural Language Processing
formed by using a fixed length history of N − 1 words. Each of the previous N− 1 words is encoded using the very sparse 1-of-V coding, where V is the size of the vocabulary. Then, this 1-of-V orthogonal rep-resentation of words is projected linearly to a lower dimensional space, using the projection matrix shared among words at different positions in the history. This type of continuous-space, distributed representation of words is called “word embedding,” very different from the common symbolic or localist presentation [26, 27]. After the projection layer, a hidden layer with nonlinear activation function, which is either a hyperbolic tangent or a logistic sigmoid, is used. An output layer of the neural network then follows the hidden layer, with the number of output units equal to the size of the full vocabulary. After the network is trained, the output layer activations represent the “N -gram” LM’s probability distribution.
The main advantage of NNLMs over the traditional counting-based N-gram LMs is that history is no longer seen as exact sequence of N −1 words, but rather as a projection of the entire history into some lower dimensional space. This leads to a reduction of the total number of parameters in the model that have to be trained, resulting in automatic clustering of similar histories. Compared with the class-based N -gram LMs, the NNLMs are different in that they project all words into the same low dimensional space, in which there can be many degrees of similarity between words. On the other hand, NNLMs have much larger computational complexity than N -gram LMs.
Let’s look at the strengths of the NNLMs again from the view-point of distributed representations. A distributed representation of a symbol is a vector of features which characterize the meaning of the symbol. Each element in the vector participates in representing the meaning. With an NNLM, one relies on the learning algorithm to dis-cover meaningful, continuous-valued features. The basic idea is to learn to associate each word in the dictionary with a continuous-valued vec-tor representation, which in the literature is called a word embedding, where each word corresponds to a point in a feature space. One can imagine that each dimension of that space corresponds to a semantic or grammatical characteristic of words. The hope is that functionally
8.1. Language modeling 295
similar words get to be closer to each other in that space, at least along some directions. A sequence of words can thus be transformed into a sequence of these learned feature vectors. The neural network learns to map that sequence of feature vectors to the probability distribution over the next word in the sequence. The distributed representation approach to LMs has the advantage that it allows the model to generalize well to sequences that are not in the set of training word sequences, but that are similar in terms of their features, i.e., their distributed represen-tation. Because neural networks tend to map nearby inputs to nearby outputs, the predictions corresponding to word sequences with similar features are mapped to similar predictions.
The above ideas of NNLMs have been implemented in various studies, some involving deep architectures. The idea of structuring hierarchically the output of an NNLM in order to handle large vocabularies was introduced in [18, 262]. In [252], the temporally factored RBM was used for language modeling. Unlike the traditional N-gram model, the factored RBM uses distributed representations not only for context words but also for the words being predicted.
This approach is generalized to deeper structures as reported in [253].
Subsequent work on NNLM with “deep” architectures can be found in [205, 207, 208, 245, 247, 248]. As an example, Le et al. [207] describes an NNLM with structured output layer (SOUL–NNLM) where the pro-cessing depth in the LM is focused in the neural network’s output rep-resentation. Figure 8.1 illustrates the SOUL-NNLM architecture with hierarchical structure in the output layers of the neural network, which shares the same architecture with the conventional NNLM up to the hidden layer. The hierarchical structure for the network’s output vocab-ulary is in the form of a clustering tree, shown to the right of Figure 8.1, where each word belongs to only one class and ends in a single leaf node of the tree. As a result of the hierarchical structure, the SOUL–NNLM enables the training of the NNLM with a full, very large vocabulary.
This gives advantages over the traditional NNLM which requires short-lists of words in order to carry out the efficient computation in training.
As another example neural-network-based LMs, the work described in [247, 248] and [245] makes use of RNNs to build large scale language
296 Language Modeling and Natural Language Processing
Figure 8.1: The SOUL–NNLM architecture with hierarchical structure in the out-put layers of the neural network [after [207], @IEEE].
models, called RNNLMs. The main difference between the feed-forward and the recurrent architecture for LMs is different ways of representing the word history. For feed-forward NNLM, the history is still just pre-vious several words. But for the RNNLM, an effective representation of history is learned from the data during training. The hidden layer of RNN represents all previous history and not just N−1 previous words, thus the model can theoretically represent long context patterns. A fur-ther important advantage of the RNNLM over the feed-forward coun-terpart is the possibility to represent more advanced patterns in the word sequence. For example, patterns that rely on words that could have occurred at variable positions in the history can be encoded much more efficiently with the recurrent architecture. That is, the RNNLM can simply remember some specific word in the state of the hidden layer, while the feed-forward NNLM would need to use parameters for each specific position of the word in the history.
The RNNLM is trained using the algorithm of back-propagation through time; see details in [245], which provided Figure 8.2 to show during training how the RNN unfolds as a deep feed-forward network (with three time steps back in time).
8.1. Language modeling 297
Figure 8.2: During the training of RNNLMs, the RNN unfolds into a deep feed-forward network; based on Figure 3.2 of [245].
The training of the RNNLM achieves stability and fast convergence, helped by capping the growing gradient in training RNNs. Adaptation schemes for the RNNLM are also developed by sorting the training data with respect to their relevance and by training the model during processing of the test data. Empirical comparisons with other state-of-the-art counting-based N -gram LMs show much better performance of RNNLM in the perplexity measure, as reported in [247, 248] and [245].
298 Language Modeling and Natural Language Processing
A separate work on applying RNN to an LM with the unit of characters instead of words can be found in [153, 357]. Many interesting properties such as predicting long-term dependencies (e.g., making open and closing quotes in a paragraph) are demonstrated.
However, the usefulness of characters instead of words as units in practical applications is not clear because the word is such a powerful representation for natural language. Changing words to characters in LMs may limit most practical application scenarios and the training become more difficult. Word-level models currently remain superior.
In the most recent work, Mnih and Teh [255] and Mnih and Kavukcuoglu [254] have developed a fast and simple training algorithm for NNLMs. Despite their superior performance, NNLMs have been used less widely than standard N -gram LMs due to the much longer training time. The reported algorithm makes use of a method called noise-contrastive estimation or NCE [139] to achieve much faster train-ing for NNLMs, with time complexity independent of the vocabulary size; hence a flat instead of tree-structured output layer in the NNLM is used. The idea behind NCE is to perform nonlinear logistic regres-sion to discriminate between the observed data and some artificially generated noise. That is, to estimate parameters in a density model of observed data, we can learn to discriminate between samples from the data distribution and samples from a known noise distribution. As an important special case, NCE is particularly attractive for unnormalized distributions (i.e., free from partition functions in the denominator). In order to apply NCE to train NNLMs efficiently, Mnih and Teh [255]
and Mnih and Kavukcuoglu [254] first formulate the learning problem as one which takes the objective function as the distribution of the word in terms of a scoring function. The NNLM then can be viewed as a way to quantify the compatibility between the word history and a candidate next word using the scoring function. The objective function for train-ing the NNLM thus becomes exponentiation of the scortrain-ing function, normalized by the same constant over all possible words. Removing the costly normalization factor, NCE is shown to speed up the NNLM training over an order of magnitude.
A similar concept to NCE is used in the recent work of [250], which is called negative sampling. This is applied to a simplified version of