Trees
Let us look closely at the selection phase of Hoeffding-based tree learning algorithm. Given a desired level of confidence and a range for the dispersion of the random variables, the error assigned to every probabilistic estimate decreases gradually with a constant speed. While some candidate splits exhibit strong advantage over the rest, there are situations in which there is no clear difference in the estimated merits of two or more of them. The Hoeffding bound is ignorant of the value of the mean or the variance of the random variables and does not take into account situations of this type. Therefore, when the value of the estimated mean is zero, the algorithm will not be able to make an informed decision even if an infinite amount of examples are observed.
This problem has been resolved in an ad-hoc fashion with the previously discussed tie- breaking mechanism. The tie-breaking mechanism determines an amount of data which has to be processed before any decision is reached. As a result, a constant delay will be associated to every ambiguous situation.
Online Option Trees for Regression 99
Here, we propose a different solution, which exploits option nodes as an ambiguity re- solving mechanism. The main idea as follows: If the standard selection decision is replaced with an easier selection decision, where multiple splits are chosen and added as options, there is no need to wait for more statistical evidence. A selection decision on multiple splits will be reached sooner, although under a higher approximation error. However, each of these decision can revised later when more evidence has been collected for each possible split.
There is an additional value in introducing options in the structure of the regression tree, which is not restricted to Hoeffding trees only. If we think of tree induction as search through the hypothesis space H, an option node is best understood as a branching point in the search trajectory. Each optional split represents a different direction that can be followed in the exploration of the search space. The hill-climbing search strategy is therefore replaced with a more robust one, which will enable us to revise the selection decisions in hindsight. As a result, an option tree has a different inductive bias as compared to an ordinary decision, because it will not necessarily select the first smallest tree that fits the training data best in the general-to-specific ordering of hypotheses. Being able to explore a larger sub-space of H, an option tree is expected to be more stable and more robust than an ordinary decision tree.
7.2.1 Ambiguity-based Splitting Criterion
We begin with a brief high-level description of the ORTO algorithm and then discuss the relevant points in more detail. The algorithm follows the standard top-down approach for learning regression trees. The pseudo-code of ORTO with a generic strategy for combining the predictions is given in Algorithm 4.
Similarly to FIMT-DD, it starts with an empty leaf and reads examples from the stream in the order of their arrival. Each example is traversed to a leaf where the necessary statistics (such as ∑ yk, ∑ y2k) are maintained per splitting point. After a minimum of nmin examples
have been observed in the leafs of the tree, the algorithm examines the splitting criterion. If an ambiguous situation is encountered the algorithm will proceed with creating an option node. Otherwise, a standard splitting node will be introduced and the procedure will be performed recursively for the leaf nodes in which a modulo of nmin examples have been
observed. Differently to FIMT-DD, the ORTO algorithm makes use of a prediction rule that enables it to combine multiple predictions obtained for a single example. The prediction rule is used only in the testing phase, thus the algorithm for building an option tree is invariant on the choice of the strategy for combining the predictions.
Given the first portion of instances n, the algorithm computes the best split for each attribute and ranks these splits according to their variance reduction (V R) value. Let A1,
A2, A3, . . . , Ad be the attribute ranking obtained after observing n examples. Let us further
consider the ratio of the variance reduction values of any attribute from the subset {A2, . . . ,
Ad} and the best one A1, e.g., rn= VR(A2)/VR(A1), at the moment of observing n examples.
We observe r1, r2, ..., rn as independent random variables with values in the range [0,1],
that is, for 1 ≤ i ≤ n it holds that P(ri∈ [0, 1]) = 1. The empirical mean of these variables is
r= 1 n∑
n i=1ri.
By using the Hoeffding bound in the same way as in FIMT-DD, we can apply confidence intervals on the probability for an error in approximating the true average with the estima- tion r ± ε, where ε is computed using the Equation 40. To allow the introduction of option nodes, we make the following modification: if, after observing nmin examples, the inequality
r+ ε < 1 holds, a normal split is performed; otherwise, an option node is introduced with splits on all the attributes Ai for which the inequality:
VR(Ai)/VR(A1) > 1 − ε , for i 6= 1 (53)
100 Online Option Trees for Regression
Algorithm 4 ORTO: An incremental option tree learning algorithm. Pseudocode.
Input: δ - confidence level, nmin - chunk size, NAtt - number of attributes, Agg - aggregation
method, y - decay factor
Output: T - current option model tree for ∞ do
e← ReadNext() Seen← Seen + 1 Leaf ← Traverse(T, e)
. The example e is traversed to the corresponding leaf node Lea f . UpdateStatistics(Leaf , e)
.Updates the split-evaluation statistics using the attribute vector of e. TrainThePerceptron(Leaf , e)
if Seen mod nmin= 0 then
for i= 1 → NAtt do
Si= FindBestSplit(i) . Computes the best spit per attribute.
end for Sa← Best(Si), i = 1, ..., NAtt Sb← SecondBest(Si), i = 1, ..., NAtt if Sb/Sa< 1 − q ln(1/δ ) 2×N then
MakeSplit(T, Sa) . Creates a binary split using Sa.
else
O← CreateOptionNode(T ) .Transforms the leaf node T into an option node.
MakeOption(O, Sa) . Creates one optional split applying Sa.
Counter← 0
k← CountPossibleOptions(O)
.Computes the number of possible optional splits.
l← GetCurrentLevel(T ) .The root node is at level 0.
for i= 1 → NAtt do if i 6= a ∧ Counter < k × yl ∧ S i/Sa> 1 − q ln(1/δ ) 2×Seen then
MakeOption(O, Si) .Creates an optional split applying Si.
Counter← Counter + 1 end if
end for end if end if
return Prediction(T, Agg, e) end for
Online Option Trees for Regression 101
the inequality r + ε < 1 does not hold are approximately equally discriminative, i.e., as discriminative or with an approximately equal variance reduction as the best one on the observed sample of the data. The fact that this inequality does not hold is an expression of the lack of the necessary confidence to discard them.
In other words, as long as the inequality r + ε < 1 does not hold, the algorithm has to collect more statistical evidence in favor of the observed best split. In general, when the algorithm needs more evidence than the nmin initial examples to determine the best split,
we either have a tie situation in which A1 competes with one or several other candidate
splits, or noisy data, which make the evaluation less reliable. Instead of waiting for more examples to be observed, which would eventually give preference to one of the splits or declare a tie situation, we accept all of the competitive splits as allowed directions for the search. With the modified splitting criterion, the tree is allowed to grow faster. At the same time, having multiple options is a strategy to overcome the one-step look-ahead ”myopia” of greedy search.
7.2.2 Limiting the Number of Options
With this approach however, we might encounter situations in which multiple candidate splits are equally discriminative with respect to the target attribute. Allowing splits of this type is a reasonable decision, when the reduction of the variance is estimated to be high. However, in situations when multiple splits are equally discriminative but do not contribute a significant reduction of the variance, allowing options on all of them is unnecessary and would result in an excessive growth of the tree. For this reason, it is crucial to have some form of a restriction on the number of options or the number of trees, which will control the tree growth and the memory allocation. Having in mind that the greatest reduction of variance is achieved with splits which are positioned at the higher levels of the hierarchy; that is, near the root node, an intuitive approach is to reduce the number of maximum allowed options proportionally with the level of the node. Although there is no theoretical support to the claim that options are most useful near the root of the tree, this has been empirically shown by Kohavi and Kunz (1997) in their study on batch methods for learning option trees.
We decided to rely on the existing empirical evidence for the batch learning setup and introduce a restriction on the number of options with the depth of the tree. The number of possible options is thus given with
o= k · βlevel, (54)
where k is the number of competitive attributes chosen using the inequality in Equation 53, and β is a decay factor. We refer to o as the decaying option factor, which will be used as an alternative to the basic algorithm (with option factor k). The levels in the tree are enumerated starting with 0 at the root.
For an intuitive comparison of option trees with ensembles of trees, it is much more interesting to use a different way to limit the size of the option tree. Instead of deciding explicitly on the number of allowed options per level or per node, one can place a constraint on the maximum number of trees represented with a single option tree by using a parameter Tmax. When this number is reached, introducing options will no longer be allowed. This
control mechanism is similar to the way the size of an ensemble is constrained; that is, by limiting the number of possible base models. Therefore, we can compare the predictive accuracy and the memory consumption of an ensemble with those of an option tree that represents an equal number of base models.
102 Online Option Trees for Regression