• No results found

2. Music Data Analysis

2.4. Classification

2.4.1. Decision trees and random forest

A classification method, which generates perhaps the most interpretable models, is the

decision tree. Figure2.11illustrates an example from the study described later in Sec-

tion5.1.1. Two features, or attributes, are used here for the categorisation into chords with guitar (category ‘Guitar’) or chords without guitar (category ‘NOT Guitar’). The start- ing tree node is called the root, and in each node a decision is made, if the corresponding feature value is above or below a certain threshold. In general, each node may have more than two successors, and also more complex queries are possible, e.g., ‘if (feature 1 < 0.5) AND (feature 2 = 0.4), go to the left child node’. The tree leaves contain the instances, which are identified by the attribute queries on the path from the root to a leaf. The tree from Fig. 2.11 uses only two features and enables some misclassifications as a strategy against overfitting (see Def.4.1in Section4.2). For example, a leaf on the left side of the tree contains 614 chords without guitar, which have ‘envelope 1’ values less than 0.057, but also 141 chords with guitar.

Figure 2.11.: A decision tree example.

One of the most critical decisions during tree construction is the choice of the attributes for the queries. The well-established algorithmsID3and C4.5[177] derive the concepts from information theory, investigated by Claude E. Shannon in [194]. In this theory, the value of a message is measured by the minimal number of trials, which are required to guess it. As an example, if a 4-letter word consists of exactly two ‘A’ and two ‘B’ symbols, the number of possible words is 4·32 = 6: AABB, ABAB, ABBA, BABA, BAAB and BBAA. For guessing a word, at least three binary questions with a yes/no answer are required: for example, a first question could be: ‘does a word belong to the left subgroup of the three words AABB, ABAB and ABBA?’. If a 4-letter word consists of exactly three ‘C’ and one ‘D’ symbol, we have only 4 possible words (CCCD, CCDC, CDCC and DCCC), and it is possible to guess any word by only two yes/no questions. The number of necessary questions is equal to log2|W |, where |W | is the number of different words.

Consider now that the symbols correspond to the categories of the T classification in- stances, which are organised by a subtree below a node. The node information content (the number of trials necessary for guessing a category of an instance below this node)

can then be measured by itsentropy H(X): H(X) = − C X i=1 f req(ci) T · log2  f req(ci) T  , where (2.17)

C is the number of categories and f req(ci) is the number of instances from X which belong

to category ci, i ∈ {1, ..., C}.

The efficiency of candidate nodes can be measured by the information gain

gain(X, QDT), with the target to reduce the information content which is carried by a node with a query QDT:

gain X, QDT = H(X) − k X j=1 |Xj| T · H(Xj), where (2.18)

Xj are the instances of k outcomes after the query QDT.

Several further enhancements led to the development of the decision tree algorithm C4.5 (for details see [177]): handling of missing feature values, grouping of feature values, tree pruning, etc. Especially the last technique is very important, since too large trees increase the danger of overfitting: if a model describes the data perfectly, from which it has been trained, but is not suitable anymore for reasonable classification of other instances. A forerunner of C4.5, the ID3 decision tree algorithm, incorporates reduced error pruning, where a node is replaced by a leaf with the most frequent category of the

succeeding instances. The performance of the original node and a leaf is measured by the classification error on a validation set. Because some of the classification instances must be reserved for this independent set, this restriction was removed by therule post-pruning

during the development of the C4.5. Here, a large and overfitted tree is built from the training data. Afterwards this tree is converted to a set of rules, which are partly pruned by sorting out rules with respect to their performance and its deviation on the training set.

A modification of the decision trees, therandom forest(RF), builds an ensemble of un-

pruned trees and estimates the label output by majority voting [19]. During tree construc- tion, for each tree node a number mRF ≤ F of the random candidate features is selected

and the best split is taken into account. The default RF algorithm uses mRF =√F . The advantage of the RF is that it usually performs very well by averaging the tree outcomes. It is also fast, since no pruning is applied. However, the performance of the random forest suffers from a large number of noisy features because of the increasing share of irrelevant features from mRF selected ones [82]. As we can see in the discussion of the experiments (Chapter5), the RF method tends to increase its performance (as other classifiers), when the feature selection is previously applied. Another drawback is that the classification models are not interpretable anymore, compared to a single tree.

2.4. Classification 49