6.4 Structural Annotation using a multi-label CNN
6.4.9 What did the networks learn?
A heavily criticised aspect of deep learning, and to a certain extend of machine learning algorithms in general, is the fact that the resulting computational model is not interpretable by humans. While all learned weights of a pre-trained deep architecture are accessible, the sheer number of parameters, which can be in the range of various millions, and the increasing level of abstraction throughout the layers, make it impossible for humans to understand what the network learned and which characteristics of the data it considers when making a decision. As a result, deep networks have gained a reputation of being "well-performing black boxes".
This aspect is problematic in various ways. While deep architectures can solve many complex problems, they do not provide the means for scientists to gain deeper knowledge of the problem itself. As a result, the problem solution does not yield a better understanding of the domain. In addition, the lack of transparency makes it difficult for engineers to estimate the robustness of a model with respect to unseen data characteristics and the development and fine tuning of deep networks is often reduced to numerous iterations of trial and error.
Recently, a number of visualisation techniques have been proposed in the image processing community in order to overcome these limitations and allow developers and users to better understand how a deep network operates and what it has learned [184, 55, 237]. In the context of image classification, activation maps or salience maps [237], have proven to be a useful tool in validating that a network bases its decision on relevant pixel areas within an input image.
Another, more generally applicable strategy, is activation maximisation. The idea is to artificially generate images, which, when passed as input to the network, will cause a high activation in a specific feature map in a particular convolutional layer [55, 184]. In this way, it is possible to estimate which characteristics of the input image are targeted by the corresponding convolutional filter. Case studies on networks trained for image classification6 and handwritten character recognition [55] have revealed that first layer filters detect elementary structural components, such as edges of different orientation and shape. Deeper layers tend to target more complex textures, consisting of combinations of the basic shapes learned in prior layers.
Here, we apply theactivation maximisation strategy to the audio domain and attempt to gain insight into the underlying mechanisms of the trained multi-layer CNN by generating input feature matrices which maximise filter maps of the first and third convolutional layer. Visualisations of the original input space (Figure 6.12) show that the presence of the targeted instrumental components can not only be heard, but also visually identified in the feature representation. We therefore directly follow the method described in [55] to investigate in how far the structure of our network is similar to the that of networks trained for image classification and handwritten character recognition.
Let x be the network input and A(l)n be the feature map of size (I × J ) created by the nth
filter of the lth convolutional layer. Assuming that the network is trained and all parameters are fixed, the values observed in the feature map A(l)n [i, j](x) only depend on the input image
x. Mathematically speaking, we aim to generate an image x∗, s. th. x∗= max x X i,j A(l)n [i, j](x). (6.35) 6https://blog.keras.io/how-convolutional-neural-networks-see-the-world.html
Fig. 6.12 Log-mel features extracted from audio excepts containing (a) vocals, (b) vocals and palmas, (c) picked guitar and (d) strummed guitar.
While this is a non-convex optimisation problem, we attempt to find at least a local maximum by employing gradient ascend based on the score function
ξ(x) =X
i,j
A(l)n [i, j](x) − r · ||x|| (6.36)
where the second term is an l2-norm with regularisation weight parameter r. Starting with random pixel values for x, at each step, we compute the gradient δξ(x)δx w.r.t. the input image and update the pixel values as follows
x := x + β ·δξ(x)
δx (6.37)
where β is the learning rate. The values for parameters β and r were determined empirically by observing convergence for different settings. The regularisation parameter was set to r = 1 · 10−5and the learning rate was set to β = 0.001 for maximising the first layer activation and to β = 0.1 for maximising the third layer activation.
Figures 6.13 and 6.14 show the resulting images for several convolutional filters of the first and third layer, respectively. Similar to the aforementioned image processing task, it can be seen that the first layer filters appear to focus on basic shapes, such as different types of edges.
Fig. 6.13 Input feature matrices which maximise selected first layer filter maps.
The third layer filters seem to combine these elements into more complex textures. It is interesting to see that some of these textures show strong similarities with instrument-specific spectral patterns: The wave-shaped patterns in filter 1 bear a striking resemblance to the parallel continuous contours caused by vocal vibrato (Figures 6.12 (a) and (b)). Furthermore, the patterns in filters 34 and 53 are somewhat similar to the spectra produced by strummed guitar sections (Figures 6.12 (d)), where the percussive onset produces vertical lines across the spectrum, followed by parallel horizontal lines caused by the notes contained in the chord and their harmonics.
Fig. 6.14 Input feature matrices which maximise selected third layer filter maps.