Learning, in the general sense of the word, like intelligence and knowledge, may be difficult to define precisely. For the purpose of this thesis, therefore, we will narrow down the scope to something more manageable. Learning, in our context, refers to the discipline of machine learning that is concerned with developing computational models of learning in machines [Mitchell,1997]. Put simply, machine learning is the study of computer algorithms that improve automatically through experience. Formally, when
we refer to learning, we imply the following definition [Mitchell,1997]:
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its performance at tasks inT , as measured by P , improves with experience E.”
For example, a program that learns to play chess might improve its performance as mea- sured by its ability to winat the class of tasks involving playing chess games, through experience obtained by playing games against a human player.
In particular, we are interested in a branch of machine learning called decision tree learning[Mitchell,1997;Quinlan,1986,1993] that involves the use of decision trees to make conclusions based on a history of observations. Our choice of decision trees for learning is motivated by several factors. Firstly, decision trees support hypotheses that are a disjunction of conjunctive terms and this representation is compatible with how context formulas are generally written. Secondly, decision trees are robust against training data that may contain errors. This is specially relevant in stochastic domains where applicable plans may nevertheless fail due to unforeseen circumstances. Finally, decision tree learning is a well-developed technology: it has several competitive imple- mentations and a mature theory behind it.
A decision tree may be viewed as an tree-like flowchart, where each node represents a decision point, i.e., a test for some attribute, and each outgoing branch represents a possible value of that attribute. Each path from the root node to a leaf node constitutes a decision path, and terminates in a categorisation, or classification.
Figure 2.4 shows an example decision tree for the travelling domain for deciding whether to travel by tram or not. Here, the decision paths terminating in the √ (or ×) classification indicate the final decision based on the chosen attribute values. In this example, travelling by tram is a good idea for short distances in wet weather, i.e., (dist = short∧outlook = rain) but not otherwise, i.e., (dist = long) ∨ (dist =
short∧outlook=sun).
The typical use of decision trees is for generalising from past experiences to categorise unseen situations. For instance, one may recall several ways in which to entertain chil-
Tram dist outlook × sun √ rain short × long
Figure 2.4: A decision tree for the travelling domain to decide if one should travel by tram.
dren, such as by taking them to the park, reading books, and watching trains go by. If one were to use these past experiences to guide their decision making when interacting with a new child, then this would constitute an inductive decision making process. A key concern here is determining how to construct the decision tree for deciding what activity to perform with the child.
The problem of learning decision trees may be described as follows: given a set of examples, each described by a set of attributes and a known categorisation, the task is to learn the structure of a decision tree that correctly classifies the examples and may be used to decide the category of an unseen example.
Putting together a decision tree is a matter of choosing attributes to test at each node in the tree. The key is to decide which attributes to test first since the order in which various attributes are tested will invariably impact the size of the final tree. Intuitively, we would like to test the most important attributes first. However, since examples are generally not annotated with information about the importance of the attributes, we require a more generic method for determining this. One way to do this is by looking at how different combinations of attribute values impact the categorisation.
As an example, consider again the travelling problem where we would like to decide the best mode of transportation for any given situation. Say we decided to cycle to work, but it rained on the way and so we concluded that that was an unsatisfactory outcome. Now,
we may record this categorisation, i.e., unsatisfactory, against the situation in which we made the decision, i.e., the values of such attributes asmoney = 200,outlook =rain,
day = Monday, and so on. However, simply by considering this one experience we cannot determine which attribute(s) actually contributed to the unsatisfactory outcome. If, on the other hand, we had several experiences of cycling to work under different situations, then by analysing them collectively we may justifiably conclude that the weather outlook was indeed most influential to the outcome.
For the purpose of this thesis, we use the algorithm J48, a version of Quinlan’s
C4.5 [Quinlan, 1993] algorithm for inducing decision trees, from theweka learning package [Witten and Frank,1999]. The basic algorithm conceptually performs a sim- ilar analysis to our example above by calculating the information gain of an attribute with respect to the set of examples. It then (i) places the attribute with the highest infor- mation gain at the root of the decision tree; (ii) creates a branch for each observed value of that attribute; (iii) assigns the relevant examples to each branch; and (iv) repeats the process for each subset thus created. The end result is that the attributes that contribute the most to the outcome are placed earlier in the decision path, and are considered first when evaluating a new situation.
Assuming consistent data, i.e., where no two examples have the same values for the attributes but are categorised differently, it is always possible to construct a decision tree that correctly classifies the training cases with complete accuracy. However, full accuracy in itself may not be a valid measure for the usefulness of the decision tree if the data is incomplete, and may indicate overfitting, i.e., where the decision tree performs well on the training data but does not generalise well to unseen data.
Approaches to address overfitting in decision trees broadly aim to do one of two things. They (i) either stop growing the tree earlier i.e before it perfectly classifies all training samples; or (ii) allow the tree to grow fully but then prune it afterwards: this latter being generally considered to be more effective [Mitchell,1997]. Overall though, the induction process will trade-off some accuracy in classification for compactness of rep- resentation.