• No results found

Supervised Learning (Classication and Prediction)

1.7 Data Mining and Machine Learning

1.7.2 Supervised Learning (Classication and Prediction)

The goal of prediction [144] is to predict a target attribute for new objects based on the values of the other attributes. The relationship between the target attribute and those other attributes is then learned from a training set of data, for which the target attribute is already known. The training data captures an empirical dependency between the ordinary attributes and the target attribute, for which the data mining technique builds an explicit model of the observed dependenices. In the case of categorial target values the prediction is called classication. As opposed to continous numerical target values, for which the prediction is referred to as regression. There are a number of popular classication methods, which have already been applied to problems from the eld of biology and medicine [145,121,146].

Cross-Validation

To get an unbiased assessment of a classiers predicitve quality, it is required to test its performance on an independent data set [147,148]. Since in a clinical setting the availability of samples is often limited, withholding a substantial proportion of the data for testing purposes might potentially reduce the quality of the predictive model. Thus the approach of k-fold cross-validation can be used to predict how well a classier will perform in practice. For this the original sample set is divided into k subsamples. While one subsample is retained as the test data for model validation, the remaining k − 1 subsamples (training set) are used to learn the classier. This ensures that the classication result for each sample is unbiased by knowledge of the particular sample. The validation procedure is repeated k times (the folds), with each subsample being used exactly once for the validation set. In stratied k-fold cross-validation, the folds are selected such that each fold contains roughly the same proportions of class label types. The classication results obtained for each sample, when the particular sample was not part of the training set, can be used to rate the classier. Common criteria include accuracy, sensitivity, specity and the area under curve (AUC) [149]. Overtting of the predictor can be a major limitation in supervised learning. One of the main reasons for cross-validation is to test if the model was not overtted. Overtting means that the model was optimized for the available test data but poorly predicts independent data. This happens when the model picks up random variations that do not present true relationships. Another important thing to keep in mind is that all preprocressing and feature selection steps using class knowledge should be included in the cross-validation. Otherwise a substantial bias is introduced to the cross-validation results [150], which leads to an overestimation of prediction accuracy.

Decision Trees

Classication using decision trees such as CART (classication and regression trees)[146], is quite popular in in the Machine Learning Community. A decision tree classies a pat- tern by performing a sequence of simple tests, where the tests performed at subsequent levels depend on the outcome of the previous tests. For the class of binary trees there

are two types of tests: (i) equality tests for categorial attributes, (ii) inequality tests on a single real-valued attribute. An example of the rst type is color = red, and an example of the second type is length ≥ 24cm. Representing the rules as a tree structure T , each tree node t represents a rule used for testing a variable Xi from a set of input variables X1, X2, . . . , XD. Using those rule sets the classier partitions the input space Rr into cuboid regions for predicting an output variable Y . Hereby the decision node at the top of the tree, containing only outgoing edges, is called root node. To predict the output variable y for a sample x, drop x down T and follow its path till a terminal (leaf) node is reached. Each leaf node contains only incoming edges and represents a specic class which is associated with a certain partitioning based on a sequence of tests. The class associated with a specic leaf node is determined by the majority class of samples from the training set which, according to the rule set, would be assigned to this leaf node. A new sample assigned to this leaf node is then classifed accordingly. See Figure 1.4 for an example.

Construction of a tree classier requires a labelled training set

L = (x1, y2), . . . , (xn, yn) (1.9) where xnis an object measured in the input variables X1, X2, . . . , XD with n = 1, . . . , N. The class label of each object is dened by yn which can take a value k ∈ 1, . . . , l. In the case of a binary classication problem an object can belong to either of two classes i.e. k ∈ 1, 2. The tree model is then constructed by partitioning the training set of measurement vectors into "purer" subsets. Herefore every possible value of each variable is considered for each split. The goodness of a split is evaluated by the achieved decrease in impurity. The most established choice of impurity within a tree node t [146] is given by the Gini index

IG(t) = l X

j=1

pj(1 − pj(t)) (1.10)

where we assume that there are l classes and p1, p2, . . . , pl are the proportions of samples in the l classes. Then the Gini index, as used by CART, is a measure of how often a randomly chosen element from the set would be incorrectly labelled if it were randomly labelled according to the distribution of labels in the subset. Thus it is computed by summing the probability of each item being chosen times the probability of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category. Based on the respective impurity function a feature and split are rated according to the decrease in impurity [146]

4(s, t) = I(t) − h(tL)I(tL) − h(tR)I(tR) (1.11) where s is a split of node t, h(tL) and h(tR)are the propotions of the samples in the left and right daughter nodes of node t, respectively. The split that leads to the highest decrease in impurity is chosen for the tree. By recursively using the node-splitting procedure, the tree is usually overgrown (too many descendant nodes), which is likely to

overt the training data. Thus a pruning steps is necessary which removes some nodes to achieve an optimal bias-variance trade-o. The decision which nodes to remove is usually based on an independent test set or cross-validation.

Figure 1.4. A simplied decision tree model for the discrimination between plants, insects and mammals.

Related documents