Data augmentation - Structural Annotation using a multi-label CNN

6.4 Structural Annotation using a multi-label CNN

6.4.3 Data augmentation

A straightforward method of training the network would consist of extracting all available feature matrices from a given training set and passing them, together with the corresponding labels, to the network. This approach has the drawback that the network is presented with a large number of training instances per song, compared to the number of available recordings. More specifically, using a hop-size of 14.3 ms yields over 12500 training examples for a 3-minute recording. The number of available songs is however expected to be far lower, given that the training stage requires manual annotations (here we train on 80 manually annotated recordings). As a consequence, the network is prone to learning the timbre, melodic and

Fig. 6.10 Illustration of the feature extraction stage.

harmonic content of these songs instead of generic characteristics which generalise well to unseen data.

Therefore, we limit the total number of training instances to 40000, which are randomly drawn from the training data. We assume, that this amount covers a representative sample of the training space. We furthermore apply a set of randomised data augmentation techniques to each feature matrix to increase the variety presented to the network and thus prevent overfitting. The three augmentation strategies applied here are intended to mimic natural variety among a large number of recordings with respect to pitch range, key transposition, recording quality, timbre and tempo. These methods have shown to increase the detection accuracy and prevent overfitting in the context of vocal detection [180].

More specifically, we employ the following three augmentation stages to the spectrogram of each training instance:

• Frequency-domain filtering using a gaussian-shaped filter response following the equation

f (x) = d · e(0.5·(x−µ)2σs ) (6.34)

where µ corresponds to a randomly chosen frequency bin between 150 Hz and 8 kHz, σ is randomly chosen between 5 and 7 semitones, and the attenuation d is randomly chosen between +10 and −10 dB.

• Pitch shifting with a randomly chosen stretching factor between 0.8 and 1.2. As described in [180], both time stretching and pitch shifting can be efficiently implemented as affine image transformations on the two-dimensional spectrogram representation and frequency filtering is achieved by simply multiplying each frame of the spectrogram with the randomly generated frequency response.

6.4.4 CNN architecture

The CNN architecture employed in this study, as shown in Figure 6.11, is comprised of three main blocks: Each 128 × 22 input matrix is first passed through two consecutive convolutional layers with 64 and 32 convolutional masks (feature detectors) of size 3 × 3, and then subsampled in a 3 × 3 max pooling scheme. The second block consists of two further convolutional layers with 128 and 32 masks, again followed by a 3 × 3 max pooling layer. The output of each convolutional operation is passed through a relu function. The resulting 32 × 11 × 7 tensor is then unfolded (flattened), yielding a 4928 × 1 one-dimensional representation, which is subsequently fed as input to a standard, fully connected feed forward neural network. This network consists of two hidden layers, of 256 and 64 units, and a sigmoid output layer holding the four units described above.

Fig. 6.11 Illustration of the CNN architecture.

As mentioned earlier, we aim at detecting solo strummed and picked guitar sections and discard the guitar playing technique when the singing voice is present. We furthermore assume that (as it is the case in the vast majority of classical flamenco recordings) only a single guitar is present. Consequently, the three classes vocals, strummed and picked guitar, are conceptually speaking mutually exclusive. The presence of the palmas is however independent of the presence of the other three components. We therefore model the problem as a multi-label classification task and do not hardcode the aforementioned mutual exclusivity into the network. We instead assume that the network will learn this property and we deal with ambiguous output in the post-processing stage described below. This design choice has the additional advantage that it allows us to detect silence, which is encoded in the output as all sigmoid units predicting the zero class.

Initial experiments have furthermore shown that the presented multi-label architecture yields similar performance that of to an ensemble of four CNNs, where each CNN solves a binary task, focusing on a single instrumental component.

6.4.5 Post-processing

As it was previously described, when an input image is given, the output at the four sigmoid units is interpreted as the multi-label prediction for the time instant corresponding to the middle of the respective frame subsequence. Given the mutual exclusivity of vocals, strummed and picked guitar by definition, it is necessary to define a procedure for the case of ambiguous output, when more than one of the classes is predicted at a given time instance. In this case, we simply set the label of the node with the highest activation to one and all remaining nodes to zero. Note that, as it was described above, the case when none of the classes is predicted as true is valid, since it corresponds to silence.

6.4.6 Classifier training

The proposed segmentation system is trained and evaluated on a set of 100 recordings which were manually annotated in the scope of this study. All songs are commercial flamenco recordings taken from the corpusCOFLA [102] collection. The annotated dataset was split into training (80 songs), validation (10 songs) and test (10 songs) sets. In order to create a realistic scenario, the splits are artist-filtered, meaning that no artists appears in more than one split. These standard measures are taken to reduce the risk that the classifiers will overfit to the timbral characteristics of a particular singer or recording style.

After the feature extraction stage is carried out on the aforementioned dataset, the network is trained for a maximum of 100 epochs using the Adam [96] algorithm for optimisation of the mean squared error over the training set. During each training epoch, the training images are shuffled and are grouped to form mini-batches (128 images per mini-batch). An early-stopping criterion is also used, which terminates the training procedure if the loss-value does not decrease significantly (at least by a value of 0.01) during a patience-period of 5 epochs.

6.4.7 Baseline methods

We evaluate the performance of the proposed method for each of the target classes separately by computing the binary classification accuracy as the percentage of correctly classified frames for a particular class over the test set.

In order to assess the advantage of using a deep network and to estimate the difficulty of the task itself, we furthermore compare the proposed method to the performance of an ensemble of shallow classifiers. More specifically, we train a separate classifier based on the

Method vocals palmas strummed guitar picked guitar

proposed 97% 96% 93% 95%

baseline 88% 77% 66% 69%

Table 6.2 Experimental results: Classification accuracy in % for all four tasks.

vocal detection method described in [188] for each of the four tasks. The method uses a based on a Gaussian Mixture Model (GMM) per class, trained on frame-level mel-frequency cepstral coefficients and log frequency power coefficients. For a detailed description, we refer to [188]. We furthermore smooth the resulting decision sequence with a median filter of 0.5s length.

All baseline methods and the CNN approach are trained and evaluated using the same split into train, validation and test set.

6.4.8 Experimental results

The results of the proposed CNN-based system are shown in Table 6.2 together with the performance obtained using the GMM-based baseline method. For all four tasks, the deep convolutional architecture yields a binary classification accuracy above 90%. The performance of the shallow classifier is significantly lower, in particular for the two tasks involving guitar playing techniques: The obtained classification accuracy for strummed guitar detection is only 66% and for picked guitar detection 69%. In comparison, the deep network yields classification accuracies of 93% and 95%, respectively.

These results indicate that the proposed system can generate automatic reliable annotations which can set the basis for related computational tasks and can be used in data-driven exploratory studies (see Section 6.5). The experiment has furthermore demonstrated the capabilities of CNNs to solve complex audio-based classification tasks, where standard shallow algorithms yield unsatisfactory results.

In document Flamenco music information retrieval. (Page 168-172)