• No results found

PCTs for Single and Multiple Targets

In this section, we discuss the predictive clustering trees algorithm that deals with the standard classifi- cation and regression tasks, as well as more complex tasks of multi-target classification and regression. The PCTs that are able to predict single discrete/continuous target are called single target decision trees (STDTs), whereas the ones that are able to predict multiple (tuple of discrete/continuous variables) tar- get simultaneously are called multiple targets decision trees (MTDTs). We consider both STDTs and MTDTs and we refer to as PTCs.

The task of mining predictive clustering trees (PCTs), for single and multiple targets, can be formal- ized as follows:

Given:

• A descriptive space X that consists of tuples of values of primitive data types (boolean, discrete or continuous), i.e., X ={X1, X2, . . . Xm} spanned by m independent (or predictor) variables Xj, • A target space Y = {Y1,Y2, . . . ,Yq} spanned by q dependent (or target) variables Yj,

• A set T of training examples, (xi, yi) with xi∈ X and yi∈ Y Find: a tree structure τ which represents:

• A set of hierarchically organized clusters on T such that for each u ∈ T , the clusters Ci0,Ci1, . . . ,Cir exist for which u ∈ Cir and the containment relation Ci0 ⊇ Ci1 ⊇ . . . ⊇ Cir is satisfied. Clusters Ci0,Ci1, . . . ,Cir are associated to the nodes ti0,ti1, . . . ,tir, respectively, where each tij ∈ τ is a direct child of tij−1 ∈ τ ( j = 1, . . . , r) and ti0 is the root.

Learning Predictive Clustering Trees (PCTs) 69

• A predictive piecewise function f : X → Y, defined according to the hierarchically organized clus- ters. In particular, ∀u ∈ X, f (u) =

ti∈leaves(τ) D(u,ti) fti(u) (5.1) where D(u,ti) =  1 if u ∈ Ci 0 otherwise

and fti(u) is a (multi-objective) prediction function associated to the leaf ti.

(a) (b) (c) Descriptive space T ar get space Descriptive space T ar get space Descriptive space T ar get space

Figure 5.1: An illustration of predictive clustering. Illustration of predictive clustering: (a) clustering in the target space, (b) clustering in the descriptive space, and (c) clustering in both the target and descriptive spaces. Note that the target and descriptive spaces are presented here as one-dimensional axes for easier interpretation, but can be actually of higher dimensionality. Figure taken from (Blockeel, 1998).

Clusters are identified according to both the descriptive space and the target space X × Y (Fig- ure 5.1(c)). This is different from what is commonly done in predictive modeling (Figure 5.1(a)) and classical clustering (Figure 5.1(b)), where only one of the spaces is generally considered.

We can now proceed to describe the top-down induction algorithm for building PCTs for single and multiple discrete and continuous targets. It is a recursive algorithm which takes as input the set of example and the function η : V 7→ X × Y and partitions the set of nodes V until a stopping criterion is satisfied.

The construction of PCTs is not very different from that of standard decision tree (see, for example, the C4.5 algorithm proposed by (Quinlan, 1993)): at each internal node t, a test has to be selected accord- ing to a given evaluation function. The main difference is that PCTs select the best test by maximizing the (inter-cluster) variance reduction, defined as:

∆Y(C, P) = VarY(C) −

Ck∈P

| Ck|

| C |VarY(Ck) (5.2)

where C represents the cluster associated to t and P defines the partition {C1,C2} of C. The partition is defined according to a Boolean test on a predictor variable in X. By maximizing the variance reduction, the cluster homogeneity is maximized, improving at the same time the predictive performance. VarY(C) is the variance computed on the Y variable (class) in the cluster C.

If the variance Var(·) and the predictive function f (·) are considered as parameters, instantiated for the specific learning task at hand, it is possible to easily adapt PCTs to different domains and different tasks. The PCT framework allows different definitions of appropriate variance functions for different types of data and can thus handle complex structured data as targets.

70 Learning Predictive Clustering Trees (PCTs)

Figure 5.2: An example of a multi-target regression tree. The image presents a multi-target regression tree where the splits are binary and the predictions of the two targets are given in brackets in each node of the tree (Stojanova, 2009).

To construct a single classification tree, the variance function Var(·) returns the Gini index of the target variable of the examples in the partition E (i.e., Var(E) = 1 − ∑y∈Yp(y, E)2, where p(y, E) is the probability that an instance in E belongs to the class y), whereas the predictive function of a nominal target variable is the probability distribution across its possible values.

To construct a single regression tree, the variance function Var(·) returns the variance of the response values of the examples in a cluster E (i.e., Var(E) = Var(Y )), whereas the predictive function is the average of the response values in a cluster.

The instantiation of the variance and prototype functions for the multiple targets regression trees is done as follows. In the case of multiple targets classification trees, the variance function is com- puted as the sum of the Gini indexes of the target variables, i.e., Var(E) = ∑Ti=1Gini(E, Yi). Fur- thermore, one can also use the sum of the entropies of class variables as variance function, i.e., Var(E) = ∑Ti=1Entropy(E, Yi) (this definition has also been used in the context of multi–label predic- tion (Clare, 2003)). The prototype function returns a vector of probabilities that an instance belongs to a given class value for each target variable. Using this probability, the majority class for each target attribute can be calculated.

In the case of multiple targets regression trees, the variance is calculated as the sum of the variances of the target variables, i.e., Var(E) = ∑Ti=1Var(Yi). The variances of the targets are normalized, so each target contributes equally to the overall variance. The prototype function (calculated at each leaf) returns as a prediction a vector of the mean values of the target variables. The prediction is calculated using the training instances that belong to the given leaf.

In addition to these instantiations of the variance function for classical classification and regression problems, the CLUS system also implements other variance functions, such as reduced error, information gain, gain ratio and m-estimate.

Finally, the algorithm evaluates all possible tests to be put in a node. If no acceptable test can be found, that is, if no test significantly reduces variance (as measured by a statistical F-test), then the algorithm creates a leaf and labels it with a representative case, or prototype, of the given instances.

Figure 5.2 gives an example of multi target regression tree. The splits are binary and the predictions of the two targets are given in brackets in each node of the tree. In particular, it presents an example dataset for outcrossing rate prediction. The descriptive variables describe different environmental and ecological properties of the study area, while the targets are transgenic male-fertile (MF) and the non-

Learning Predictive Clustering Trees (PCTs) 71

transgenic male-sterile (MS) line of oilseed rape measurements across the study area. More details on the dataset are given in (Demšar et al, 2005).