Deep neural network architectures - Compositional hierarchical model for music information retr

As deep neural network architectures have become the preferred approach for classification and segmentation, as well as other tasks that involve the processing of images, videos and sound, they are given a more focused overview in this section. To fully familiarize the reader with neural-network-based deep architectures, we first briefly elaborate on the history and evolution of neural networks, followed by a short description of some of the prevalent deep neural network architectures. We conclude the section with an overview of their applications in MIR.

2.2.1 Neural networks

Artificial neural networks were first introduced in the early sixties by Rosenblatt [62,63], who defined the perceptron as a three layer structure with one input layer, a second non- adaptive layer with hand-coded features, and an output layer. Although perceptrons were an innovative and promising algorithm, they were limited in their learning capacity (only linear problems) and were also not learned but hand-coded.

Decades later, when the backpropagation algorithm for weight adjustment was introduced, first generation perceptrons were extended by discarding the need for hand- coding of weights, as well as by introducing non-linear activation functions [64]. The latter is also called the backward propagation of errors and a generalization of the delta

rule [65]. Based on an annotated training set, the outputs of the neural network for the given input are compared to the annotations. The error is calculated as the difference between the expected and the produced outputs and is used to adjust the weights of the network’s hidden layer. The algorithm is repeated for each layer backwards—from the output to the input layer. The algorithm can be iterated several times until a the error is satisfactorily small. The whole process can be time-consuming, depending on the number of training samples and network layers. Although backpropagation-based artificial neural networks have successfully been used in a variety of problem domains, they possess several shortcomings. Large networks that would, for example, model complex perceptual tasks are difficult to train, as the size of the appropriate annotated datasets increases and learning becomes unstable. The training algorithm may often converge to a local minimum and thus a good solution may not be found. Deep neural networks are essentially neural networks with a high number of layers. In recent years they have become the preferred algorithm for solving a large number of tasks involving multime- dia materials. Why deep architectures are more successful than the shallow ones is still unclear. The reasons may lie in the hierarchical nature of tasks we are trying to solve, the number of neurons needed for the same accuracy (shallow networks could be larger than deep for the same task), and the fact that shallow networks are more difficult to train. Many different deep neural network architectures have been introduced over the years; here we summarize several of the more prominent ones.

Deep belief networks Deep belief network approach [66] emerged as a new approach in 2006, when kernelized support vector machines were outperformed on the MNIST database of handwritten digits, addressing some of the issues of shallow networks by introducing gradual layer-by-layer learning and the ability to train on non-annotated data.

A deep belief network (DBN) is a generative model, comprised of several layers of latent variables. The units at the lowest layer represent the input vector of the data, while the subsequent layers represent latent variables. The connections between these layers are directed in a top-down manner. In contrast, the top two layers are linked with un-

Figure 2.2

An abstract repres- entation of the DBN structure. This example shows a model with an input data layer at the bottom, followed by two latent layers (ℎ1

and ℎ2), and two out-

put layers, representing high-level concepts extracted from the data. The highest two layers are connected undir- ectedly, whereas the latent and the input layers form top-down directed connections.

directed connections in order to form associative memory. The units of the latent layers can be observed as feature detectors.

Deep belief networks reflect a hierarchy by processing the signal through several stages, extracting simple features at lower layers and modeling complex structures at higher layers. Such deep learning embodies the idea of learning the less-complex abstract representations of the data on one layer and later composing these representations into more complex high-level structures present in the data.

The model can be applied to a specific task in two stages: the first stage consists of layer- by-layer learning or pre-training of the model on a training set. At the second stage, the model is applied to the dataset of interest. Training a DBN may seem a difficult problem; however, by symmetrically connecting the hidden and output layers, the model can be observed as a restricted Boltzmann machine [67]. Each layer of a DBN is learned in- dependently, thus facilitating the learning process compared to the previous attempts with multi-layer artificial neural networks. The layer-wise unsupervised learning process may also be implemented by a greedy approach for weight optimization [68]. The most discernible features from different classes are stimulated. While inferring the DBN

over a given dataset, the information is extracted and passed from the input layer to the highest layer over a number of latent layers. The output of the highest computed DBN layer may be used as an input for standard machine-learning classification techniques. The highest output layer may also be hand-coded, depending on the problem task. For example, the output layer may contain only a single node summing all the outputs of the previous layer and applying a threshold function for a binary classification. Convolutional neural networks Convolutional neural networks (CNNs) also consist of an input and an output layer connected by a number of hidden layers. As the name implies, their main difference from the DBNs are the convolutional layers, which ap- ply correlations with the (learned) filters to their input and provide the resulting feature

maps as outputs. Since a filter is only applied to a small portion of the input—its recept-

ive field—it only has a small number of parameters, which is beneficial when compared to a fully-connected standard network layer. Additionally, to reduce the size of the feature maps produced by the network filters, pooling layers, which reduce the size of the maps by grouping and summarizing blocks of activations on a previous layer into single outputs, can be included. The entire network commonly consists of tens or even hun- dreds of convolutional layers, optionally followed by one or more fully connected layers used for classification. Specialized CNN architectures, such as inception [69] and resid- ual networks [70], have been introduced for specific domains.

Recurrent neural networks Neural networks provide an abstraction of a single or a small amount of neighboring input entities. When observing time-domain signals, their long-term evolution is also important. To model this aspect, recurrent neural networks (RNNs) were proposed. In RNNs, feed-forward connections from lower to higher layers are complemented by feedback connections from higher to lower layers. These connections can model delays in the signal and thus represent memory-like sequence modeling units. RNNs can therefore model temporal sequences. Several recurrent network models have been introduced, such as the long-short term memory (LSTM) by Hochreiter and Schmidhuber [71].

Generative adversarial networks In 2014, Goodfellow et al. [72] proposed the generative adversarial network (GAN), a combination of two neural networks. The proposed model is an attempt to overcome two difficulties of existing deep generative networks, as

Figure 2.3

An abstract repres- entation of the CNN structure. This example shows a model with an input data layer at the bottom, followed by a convolutional layer and a max-pooling layer. The highest two layers are fully connected.

expressed by the authors: the difficulty of approximating many intractable probabilistic computations which arise in maximum likelihood estimation and related strategies; and the difficulty of leveraging the benefits of piece-wise linear units in the generative context. The approach consists of two models: a generative Model G and a discriminative Model D. While the generative Model models the data distribution and generates samples based on a latent space, the discriminative Model determines whether a sample originates from the Model’s distribution or the data distribution. The Model D is trained to maximize the probability of assigning the correct label to training samples and samples generated by G. The Model G is trained to minimize the difference between the G’s and the training data distributions, thus trying to fool D. The GANs have been mainly applied to com- puter vision problems, such as video generation (e.g. [73]) and object categorization (e.g. [74]).

In document Compositional hierarchical model for music information retrieval (Page 31-35)