2.5 Background on tools, techniques and methods
2.5.2 Models
Four different models were developed for both the speech and music emotion recognition exper-iments. These are a Support Vector Machine (SVM), a Random Forest (RF), a Multilayer Perceptron (MLP) and a Deep Neural Network (D-NN). For the speech emotion recognition experiments all models take the form of a classifier, which produce categorical emotion labels as output. For the music emotion recognition experiments all models take the form of a two-value regressor instead, producing two numerical outputs representing the valence and arousal.
The first three models were built using ‘scikit-learn’, a popular Python-based machine learning library. The D-NN was built using ‘Keras’, a popular deep learning library also in Python. Some background on each model is given below, explaining how they function and showcasing their differences to readers unfamiliar with these types of models.
Support Vector Machine
A Support Vector Machine (SVM) uses hyperplanes in the feature vector space differently for classification and regression. For classification it uses hyperplanes to split all data points (feature vectors) into the possible classes. These hyperplanes serve as decision boundaries, defining if an input vector belongs to one class or an other depending on which side of the hyperplane it lies.
The linear split quality of a hyperplane for classification is measured by summing the distance of the closest sample of each class to the hyperplane itself, where a larger distance is better. For regression it uses the hyperplanes as an function to estimate the regression value, where the goal is to find a curve (dictated by the hyperplanes) that minimizes the deviation of all data points to it. A lower deviation means that the regression functions lies closer to all data points on average, meaning that the values produced by the function are more accurate.
By default these hyperplanes split the space linearly, but in most cases the data is not linearly separable. This is often solved by applying a kernel that maps all data points (features) non-linearly to a new space. The model then tries to find suitable hyperplanes in this non-linear space. A visualization of mapping these hyperplanes of non-linear space back to linear space is depicted in Figure2.6.
Random Forest
Random Forest models are based on decision trees. In a decision tree the input is iteratively passed to either the left or right leaf based on if the input (features) meets a certain condition. After a certain number of splits a dead-end is reached, which has a categorical label when the Random Forest is a classifier, or a numerical value when it is a regressor. An example of a simple decision tree is depicted in Figure2.7.
Type Feature class Loudness [1] average loudness Complexity [1] dynamic complexity
Silence [27] silence rate 20dB, silence rate 30dB, silence rate 60dB
Spectral [252] spectral rms, spectral flux, spectral centroid, spectral kurtosis, spectral spread, tral skewness, spectral rolloff, spectral decrease, spectral strongpeak, spectral energy, spec-tral energyband low, spectral energyband middle low, spectral energyband middle high, spectral energyband high, spectral entropy, spectral complexity, spectral contrast coeffs, spec-tral contrast valleys
Barkbands[288] Barkbands, barkbands crest, barkbands flatness db, barkbands kurtosis, barkbands skewness, barkbands spread
Melbands [405] Melbands, melbands128, melbands crest, melbands flatness db, melbands kurtosis, mel-bands skewness, melmel-bands spread
Erbbands [405] erbbands, erbbands crest, erbbands flatness db, erbbands kurtosis, erbbands skewness, erb-bands spread
Other [720] mfcc, gfcc, dissonance, pitch salience
Rhythm [121] beats count, beats loudness. beats loudness band ratio, bpm histogram first peak bpm, bpm histogram first peak spread, bpm histogram first peak weight, bpm histogram second peak bpm, bpm histogram second peak spread, bpm histogram second peak weight, onset rate, danceability
Tonal [413] hpcp, thpcp, hpcp entropy, hpcp crest, key temperley, key krumhansl, key edma, chords strength, chords histogram, chords changes rate, chords number rate, chords key, chords scale, tuning frequency, tuning diatonic strength, tuning equal tempered deviation, tuning nontempered energy ratio
Table 2.5: A list of all features included in Essentia’s ‘essentia streaming extractor music.exe’
(Based on the output file)
A Random Forest consists of a large number of decision trees that form an ensemble. All de-cision trees in the ensemble make a prediction based on the same feature vector, and the most predicted, or mean output becomes the prediction of the ensemble.
Multilayer perceptron
The multilayer perceptron (MLP) is a type of deep artificial neural network. Similar to all other neural networks, it consists of at least 3 layers: the input layer, one or more hidden layers and the output layer. Each node in the hidden and output layers are perceptrons. The perceptrons use non-linear activiations functions, allowing the model to make non-linear separations because at least two layers of the model always consist of perceptrons. These non-linear separations define to which class the given sample belongs when it is a classifier, and to which numerical value it belongs when it is a regressor. MLPs are always feed-forward, unlike other neural networks. This means that the output of one layer only affects layers deeper in the model, not earlier layers or itself.
Each node (perceptron) in the hidden and output layer are linear classifiers that multiply their input x, in our case the feature vector, by a set of weights w and add a bias b. The result is passed through a nonlinear activation function φ to produce a single output. The function of a single perceptron can be written as follows:
y = φ(
Õn i=1
wixi+ b) (2.1)
Figure 2.6: State Vector Classifier hyperplanes depicted in four different spaces using the Iris Flower dataset. Each of the four visualizations shows the use of a different kernel, which map the data points and the hyperplanes used for class segmentations into a different space. (Scikit-learn, 2007)
Deep neural network
A network is deep when it consists of more than one hidden layer. This means that the MLP can also be considered a D-NN if it has more than one hidden layer. The advantage of a ‘deep’
model is that each hidden layer transforms the input in a more abstract format usable by the next layer, allowing the model to perform different levels of abstraction. This can be beneficial when learning the correct mapping from input to prediction. Our D-NN uses multiple hidden layers, but no special learning techniques. The MLP uses a single hidden layer, as otherwise it would be too similar to the D-NN model.
Deep neural networks can take many forms. A model can be supervised, semi-supervised or unsupervised and there exist many learning architectures. Most of these aspects are too complex to
Figure 2.7: A simple decision tree (Victor, 2019)
cover here. We recommend reading the work by LeCun et al. (2015) for more in-depth information.