Hoeffding-based Regression and Model Trees

Following the successful line of Hoeffding-based algorithms for classification, probabilistic estimates based on the Hoeffding bound have been also used for the task of learning regression trees from an infinite stream of examples. The FIMT, FIRT-DD and FIMT-DD2 algorithms by Ikonomovska and Gama (2008); Ikonomovska et al. (2010b, 2009) are repre- sentatives of Hoeffding-based learning algorithms in the domain of regression analysis. The FIMT algorithm is an online algorithm for learning linear model trees from stationary data streams. The statistical stability of the split selection decisions in FIMT is achieved by using the Chernoff bound. FIRT-DD is an extended version of FIMT that adds change detection abilities to FIMT when learning regression trees from time-changing data streams. FIMT-DD is an improved version of the previous algorithms that enables learning linear model trees from time-changing data streams and is based on the Hoeffding bound instead. As such will be used in the presentation of our main ideas.

The inductive inference process of FIMT-DD is very similar to the one used in the previously discussed Hoeffding-based decision tree learning algorithms, with two major dif- ferences. The first main difference is in the way the probabilistic estimate is obtained in the comparison of candidate splits. The second main difference is the incremental induction of additional linear models in the leaves of the tree.

The pseudocode of FIMT-DD is given in Algorithm 1. Like existing Hoeffding-based algorithms for classification, FIMT-DD starts with an empty leaf and reads examples in the order of their arrival. Each example is traversed to a leaf using the standard Traverse procedure. The pseudocode of the Traverse procedure is given in Algorithm 2. The procedure encapsulates the change detection and adaptation mechanisms of FIMT-DD. Regardless if change has been detected or not, it always returns a leaf node where the training example is traversed. The details of the drift detection and adaptation methods used in FIMT-DD are presented in the succeeding sections.

Every leaf has to maintain a set of sufficient statistics, updated using a corresponding procedure UpdateStatistics. A split selection decision is reached upon observing a sufficiently large sample of training examples, unless a pre-pruning rule is applied which corresponds to a stopping decision. The split evaluation measure is a variant of the one used by Breiman et al. (1984).

The selection decisions made by FIMT-DD are based on a probabilistic estimate of the ratio of standard deviation reduction (SDR) values for the two best performing candidate splits. After observing a portion of instances from the data stream in one of its leaf nodes, the algorithm finds the best split for each attribute, and then ranks the attributes according to the SDR value of the split candidate. If the splitting criterion is satisfied, FIMT-DD will introduce a spitting test on the best attribute, creating two new leaves, one for each branch of the split.

Learning Model Trees from Time-Changing Data Streams 71

Algorithm 1 The incremental FIMT-DD algorithm. Pseudocode.

Input: δ - confidence level, NAtt - number of attributes, and nmin - chunk size

Output: T - current model tree for ∞ do

e← ReadNext() Seen← Seen + 1

Leaf ← Traverse(T, e) . Traverses the example e to the corresponding Leaf node. UpdateStatistics(Leaf ) .Updates the necessary statistics in the Leaf node. UpdatePerceptron(Leaf )

.Updates the weights of the perceptron located in the Leaf node. if Seen mod nmin= 0 then

for i= 1 → NAtt do

Si= FindBestSplitPerAttribute(i)

. Computes the best split test per attribute. Sa← Best(S1, ..., SNAtt) Sb← SecondBest(S1, ..., SNAtt) if Sb/Sa< 1 − q ln(1/δ ) 2×Seen then

MakeSplit(T, Sa) .Creates a binary spit by applying Sa

end if end for end if end for

Algorithm 2 The Traverse(T,e) procedure of the FIMT-DD algorithm. Pseudocode. Input: T - root node, e - training example, PruneOnly - boolean parameter

Output: Leaf - leaf node where the example e will be traversed if IsLeaf(T ) 6= True then

Change← UpdateChangeDetection(T, e)

. Updates the statistics used in Page-Hinkley test. if Changeis True then

if PruneOnly then

Leaf ← Prune(T ) . Prunes the whole sub-tree T .

else

InitiateAlternateTree(T ) . Creates an alternate tree rooted at T . Leaf ← Traverse(T, e) .Traverses the example e to the corresponding Lea f node. end if end if else Leaf ← T end if Return (Leaf )

72 Learning Model Trees from Time-Changing Data Streams

More precisely, given a leaf where a sample of the dataset S of size N has been observed, a hypothetical binary split hAover attribute A would divide the examples in S in two disjoint

subsets SLand SR, with sizes NLand NR, respectively (S = SL∪ SR; N = NL+ NR). The formula

for the SDR measure of the split hA is given below:

SDR(hA) = sd(S) − NL N sd(SL) − NR N sd(SR) (41) sd(S) = s 1 N( N

∑

i=1 yi− y)2= s 1 N(

∑

i= 1 N_y2 i − 1 N( N

∑

i=1 yi)) (42)

Let hA be the best split (over attribute A) and hB the second best split (attribute B).

Let us further consider the ratio of the SDR values for the best two splits (hA and hB) as a

real-valued random variable r:

r= SDR(hB)/SDR(hA). (43)

Let us further observe the ratios between the best two splits after each consecutive example. Every observed value of the ratio can be considered as real-valued random variable r₁, r2, ..., rN. Let us denote the upper and lower bounds on the estimated sample average

as r+= r + q

ln(1/δ ) 2×N and r

−_{= r −}qln(1/δ )

2×N , where the true value lies in the interval [r −_{, r}+_].

If the upper bound r+of the sample average is below 1, then the true average is also below 1. Therefore, after observing a sequence of N examples, the best attribute estimated from the available sample is truly the best over the whole distribution with probability 1 − δ . The algorithm selects the split hA with confidence 1 − δ , and proceeds with evaluating the next

selection decision in one of the leaves of the regression tree. The newly arriving instances are passed down along the branches corresponding to the outcome of the test.

The splitting criterion of FIMT-DD is sensitive to situations in which two or more splits have very similar or even identical SDR values. This type of a situation is treated as a tie, and is handled using the method proposed by Domingos and Hulten (2000). The main idea is that, if the confidence intervals determined by ε have shrunk substantially and still one cannot make a difference between the best splits, choosing any of them would be equally satisfying. The stopping rule can be then defined with the maximal value for ε with which the user is satisfied. Let us denote the threshold value with τ. If ε becomes smaller than τ (e.g., τ = 0.05) and the splitting criterion is still not satisfied, we can argue that both of the competing candidate splits are equally good, and the one with a higher SDR value will be chosen.

In document Algorithms for Learning Regression Trees and Ensembles on Evolving Data Streams. Elena Ikonomovska (Page 84-86)