Methods for Comparison
7.1.4 Optimisation of parameters
Frequently it is desirable to tune some parameter to get the best performance from an algorithm: examples might be the amount of pruning in a decision tree or the number of hidden nodes in the multilayer perceptron. When the objective is to minimise the error-rate of the tree or perceptron, the training data might be divided into two parts: one to build the tree or perceptron, and the other to measure the error rate. A plot of error-rate against the
110 Methods for comparison [Ch. 7
parameter will indicate what the best choice of parameter should be. However, the error rate corresponding to this choice of parameter is a biased estimate of the error rate of the classification rule when tested on unseen data. When it is necessary to optimise a parameter in this way, we recommend a three-stage process for very large datasets: (i) hold back 20% as a test sample; (ii) of the remainder, divide into two, with one set used for building the rule and the other for choosing the parameter; (iii) use the chosen parameter to build a rule for the complete training sample (containing 80% of the original data) and test this rule on the test sample.
Thus, for example, Watkins (1987) gives a description of cross-validation in the context of testing decision-tree classification algorithms, and uses cross-validation as a means of se- lecting better decision trees. Similarly, in this book, cross-validation was used by Backprop in finding the optimal number of nodes in the hidden layer, following the procedure outlined above. This was done also for the trials involving Cascade. However, cross-validation runs involve a greatly increased amount of computational labour, increasing the learning time ä fold, and this problem is particularly serious for neural networks.
InStatLog, most procedures had a tuning parameter that can be set to a default value, and
where this was possible the default parameters were used. This was the case, for example, with the decision trees: generally no attempt was made to find the optimal amount of pruning, and accuracy and “mental fit” (see Chapter 5) is thereby sacrificed for the sake of speed in the learning process.
7.2 ORGANISATION OF COMPARATIVE TRIALS
We describe in this section what we consider to be the ideal setup for comparing classi- fication procedures. It not easy to compare very different algorithms on a large number of datasets, and in practice some compromises have to be made. We will not detail the compromises that we made in our own trials, but attempt to set out the ideals that we tried to follow, and give a brief description of the UNIX-based procedures that we adopted. If a potential trialist wishes to perform another set of trials, is able to cast the relevant algorithms into the form that we detail here, and moreover is able to work within a UNIX environment, then we can recommend that he uses our test procedures. This will guarantee comparability with the majority of our own results.
In the following list of desiderata, we use the notation file1, file2, ... to denote arbitrary files that either provide data or receive output from the system. Throughout we assume that files used for training/testing are representative of the population and are statistically similar to each other.
1. Training Phase. The most elementary functionality required of any learning algorithm, is to be able to take data from one file file1 (by assumption file1 contains known classes) and create the rules.
(Optionally) The resulting rules (or parameters defining the rule) may be saved to another file file3;
(Optionally) A cost matrix (in file2 say) can be read in and used in building the rules
2. Testing Phase. The algorithm can read in the rules and classify unseen data, in the following sequence:
Sec. 7.2] Organisation of comparative trials 111
Read in the rules or parameters from the training phase (either passed on directly from the training phase if that immediately precedes the testing phase or read from the file file3)
Read in a set of unseen data from a file file4 with true classifications that are hidden from the classifier
(Optionally) Read in a cost matrix from a file file5 (normally file5 = file2) and use this cost matrix in the classification procedure
(Optionally) Output the classifications to a file file6
If true classifications were provided in the test file file4, output to file file7 a confusion matrix whose rows represent the true classifications and whose columns represent the classifications made by the algorithm
The two steps above constitute the most basic element of a comparative trial, and we describe this basic element as a simple Train-and-Test (TT) procedure. All algorithms used in our trials were able to perform the Train-and-Test procedure.
7.2.1 Cross-validation
To follow the cross-validation procedure, it is necessary to build an outer loop of control procedures that divide up the original file into its component parts and successively use each part as test file and the remaining part as training file. Of course, the cross-validation procedure results in a succession of mini-confusion matrices, and these must be combined to give the overall confusion matrix. All this can be done within the Evaluation Assistant shell provided the classification procedure is capable of the simple Train-and-Test steps above. Some more sophisticated algorithms may have a cross-validation procedure built in, of course, and if so this is a distinct advantage.
7.2.2 Bootstrap
The use of the bootstrap procedure makes it imperative that combining of results, files etc. is done automatically. Once again, if an algorithm is capable of simple Train-and-Test, it can be embedded in a bootstrap loop using Evaluation Assistant (although perhaps we should admit that we never used the bootstrap in any of the datasets reported in this book). 7.2.3 Evaluation Assistant
Evaluation Assistant is a tool that facilitates the testing of learning algorithms on given datasets and provides standardised performance measures. In particular, it standardises timings of the various phases, such as training and testing. It also provides statistics describing the trial (mean error rates, total confusion matrices, etc. etc.). It can be obtained from J. Gama of the University of Porto. For details of this, and other publicly available software and datasets, see Appendices A and B. Two versions of Evaluation Assistant exist: - Command version (EAC)
- Interactive version (EAI)
The command version of Evaluation Assistant (EAC) consists of a set of basic commands that enable the user to test learning algorithms. This version is implemented as a set of C-shell scripts and C programs.
The interactive version of Evaluation Assistant (EAI) provides an interactive interface that enables the user to set up the basic parameters for testing. It is implemented in C and
112 Methods for comparison [Ch. 7
the interactive interface exploits X windows. This version generates a customised version of some EAC scripts which can be examined and modified before execution.
Both versions run on a SUN SPARCstation and other compatible workstations.
7.3 CHARACTERISATION OF DATASETS
An important objective is to investigate why certain algorithms do well on some datasets and not so well on others. This section describes measures of datasets which may help to explain our findings. These measures are of three types: (i) very simple measures such as the number of examples; (ii) statistically based, such as the skewness of the attributes; and (iii) information theoretic, such as the information gain of attributes. We discuss information theoretic measures in Section 7.3.3. There is a need for a measure which indicates when decision trees will do well. Bearing in mind the success of decision trees in image segmentation problems, it seems that some measure of multimodality might be useful in this connection.
Some algorithms have built in measures which are given as part of the output. For example, CASTLE measures the Kullback-Leibler information in a dataset. Such measures are useful in establishing the validity of specific assumptions underlying the algorithm and, although they do not always suggest what to do if the assumptions do not hold, at least they give an indication of internal consistency.
The measures should continue to be elaborated and refined in the light of experience. 7.3.1 Simple measures
The following descriptors of the datasets give very simple measures of the complexity or size of the problem. Of course, these measures might advantageously be combined to give other measures more appropriate for specific tasks, for example by taking products, ratios or logarithms.
Number of observations,
This is the total number of observations in the whole dataset. In some respects, it might seem more sensible to count only the observations in the training data, but this is generally a large fraction of the total number in any case.
Number of attributes,¥
The total number of attributes in the data as used in the trials. Where categorical attributes were originally present, these were converted to binary indicator variables.
Number of classes,ù
The total number of classes represented in the entire dataset. Number of binary attributes, Bin.att
The total number of number of attributes that are binary (including categorical attributes coded as indicator variables). By definition, the remaining ¥n Bin.att attributes are numerical (either continuous or ordered) attributes.