• No results found

In Chapter 3, we have discussed several batch methods for learning linear model trees, which can be mainly categorized based on the approach taken for solving the selection decision. One category of algorithms ignores the fact that the predictions will be computed using linear models instead of a constant regressor, and uses a variant of the least squares split evaluation function. That said, there is no guarantee that the split chosen will minimize the mean squared error of the model tree.

The other category of algorithms tries to minimize the least squares error taking into account the hypothetical linear models on both sides of each candidate split. While the second category of algorithms gives theoretical guarantees that the mean squared error of the model tree will be minimized, its computational complexity is much higher compared to the first category of algorithms. Another problematic aspect of fitting linear models on each side of the candidate splits is that the problem is intractable for real-valued (numerical) attributes.

For these reasons, we have chosen an approach which is agnostic to the existence of linear models in the leaves. The model tree is thus built just like a regression tree, using the standard deviation reduction as an evaluation function. The linear models are computed after the execution of a split by using a computationally lightweight method based on online learning of un-thresholded perceptrons.

Learning Model Trees from Time-Changing Data Streams 77

Figure 9: An illustration of a single-layer perceptron

The perceptron is the basic building block of an artificial neural network (ANN) system, invented by Frank Rosenblat. The simplest perceptron model is depicted in Figure 9. Here, rach wi is a real-valued weight that determines the contribution of input xi to the perceptron

output o.

All the numerical attributes are mapped to a corresponding input, while the target attribute is mapped to the perceptron’s output. The symbolical or categorical attributes have to be previously transformed into a set of binary attributes. More precisely, given inputs x1 through xn, the output o computed by the perceptron is:

o= f (

n

i=1

wixi− b) = f (wx − b) (44)

where b is a threshold and f is an activation function. The activation function can be, e.g., a step function, a linear function, a hyperbolic tangent function or a sigmoid function.

Learning a perceptron involves choosing values for the weights, including the threshold b, and is conditioned upon the function used to measure its error with respect to the desired output. Since we are dealing with a regression task, we want to obtain a linear unit which will give a best-fit approximation to the target concept. In that sense, the least squared error is a natural choice of an error function, E(w):

E(w) =1 2(y − o)

2

where y and o are the target value and the perceptron output value for a given training example x.

The perceptron learning procedure is given as follows:

1. Initialize the weights w1, ..., wn, and the threshold b to random values.

2. Present a vector x to the perceptron and calculate the output o.

3. Update the weights using the following training rule: w ← w + η(y − o)x, where η is the learning rate.

4. Goto step 2 until a user specified error is reached.

The training rule used in this learning procedure is known as the delta or Widrow-Hoff additive rule, for which it is known that the learning process converges even in the case when the examples do not follow a linear model. The learning rule adjusts the weights of the inputs until the perceptron starts to produce satisfying outputs. However, the threshold parameter b remains fixed throughout learning, once the perceptron is initialized. Depending on the value of b, this would potentially slow down the convergence. As a solution, an additional input is presented to the perceptron x0= 1 with a fixed value, and a weight w0= −b initialized

to a random value. With this modification, the convergence of the training algorithm is improved and the model is simplified.

78 Learning Model Trees from Time-Changing Data Streams

The key idea behind the delta rule is to use gradient descent to search the hypothesis space H, which in this case is the set of all possible real-valued weight vectors; that is, H= {w|w ∈Rn+1}. For linear units, the error surface which is associated with the hypothesis

space is always parabolic with a single global minimum. The direction of moving along this surface can be found by computing the derivative of E with respect to each component of the vector w, known as the gradient. Gradient descent search determines the direction that produces the steepest increase of E. This information is used to modify the weight vector such that it will be moved in a direction that decreases E.

In other words, wi← wi− η∂ w∂ E

i, thus the steepest descent is achieved by altering each component wi in proportion to the corresponding partial derivative ∂ w∂ E

i:

wi← wi+ η(o − y)xi , for i 6= 0. (45)

The weights are updated with every training example arriving in the leaf instead of using a batch update. The learning rate η can be kept constant or it can decrease with the number of examples seen during the process of learning. The second option gives a smooth decay of the learning rate by using a formula where the value of the learning rate is inversely proportional to the number of instances:

η= η0

1+ nηd

. (46)

where n is the number of examples used for training. As proposed in our previous work (?), the initial learning rate η0 and the learning rate decay parameter ηd should be set

to appropriately low values (e.g., η0= 0.1 and ηd= 0.005). If a constant learning rate is

preferred, it is recommended to be set to some small value (e.g., 0.01). In the experimental evaluation presented in the following sections we have named the two versions of the FIMT algorithm as FIMT Decay (with a decaying learning rate) and FIMT Const (with a constant learning rate).

The perceptron represents a linear approximation of the training examples observed at the leaf node. Consequently, the weights of the perceptron are equal to the parameters of the fitted linear model at the node. Compared to existing approaches, this process is of linear complexity, given that a stochastic gradient descent method is used.

An important part of the process is the normalization of the inputs before they are presented to the perceptron. This is necessary to ensure that each of them will have the same influence during the process of training. The normalization can be performed incrementally by maintaining the sums: ∑ixj and ∑ix2j, for j = 1, ..., n. FIMT-DD uses a variant of the

studentized residual equation, where instead of dividing with one standard deviation (σ ), it divides by three standard deviations. The equation is given with:

x0i=xi− x

3σ . (47)

The learning phase of each perceptron is performed in parallel with the process of growing a node, and ends with a split or when a pre-pruning rule is applied to the leaf. If a split was performed, the linear model in the parent node is passed down to the children nodes. This avoids training from scratch: The learning continues in each of the new leaves, independently and accordingly to the examples that will be observed in each leaf separately. Since the feature space covered by the parent node is divided among the children nodes, this can be seen as fine tuning of the linear model in the corresponding sub-region of the instance-space. Compared to the approach of Potts and Sammut (2005) FIMT-DD has a significantly lower computational complexity which is linear in the number of regressors. An important disadvantage of FIMT-DD is that the split selection process is invariant to the existence of linear models in the leaves. The process of learning linear models in the leaves will not explicitly reduce the size of the regression tree. However, if the linear model fits well

Learning Model Trees from Time-Changing Data Streams 79

the examples assigned to the leaf, no further splitting is necessary and pre-pruning can be applied.