Hoeffding bounds are a non-parametric method which, given a sequence of independent observations of a random variable, permit one to place a confidence interval on the underlying mean value of the random variable (Hoeffding, 1963). Namely, the Hoeffding bound can be
68 Learning Model Trees from Time-Changing Data Streams
used to achieve an (δ , ε) approximation for the mean value of a sequence of random variables X1, X2, .... enabling us to state that: With probability 1 − δ the true underlying mean (Xtrue)
lies within distance ε of the sample mean (Xset):
Pr[|Xtrue− Xest| > ε] < 2e−2Nε
2/R2
(39) R is the (known a priori) maximum range of the random variables and N is the number of samples.
We can therefore make statements like ”We are 99% confident that our estimate of the average is within ε of the true average” which translates to Pr[|Xtrue− Xest| > ε] < 0.01. The
confidence parameter is typically denoted by δ . Equating the right-hand side of Equation (39) with δ , we get an estimate of the precision of the approximation in terms of N, R and δ:
ε= r
R2ln(2/δ )
2N , (40)
The value of ε monotonically decreases as more samples are being observed, which cor- responds to confidence interval shrinkage. Naturally, our estimate is improved by observing more samples from the unknown distribution, but in this case there is a fast decay in the probability of failure to bound our estimate. In particular, one of the main benefits when using the Hoeffding bound is that it gives an exponential decay in the probability density function that models the deviation of the sampled mean from its expected value.
Another advantage of the Hoeffding bound is the lack of an assumption about the dis- tribution of the random variables. This comes at the price of a higher conservatism, which means that if we could make stronger assumptions, the same confidence intervals could be reached using fewer examples. However, Hoeffding bounds have been used extensively in a variety of tasks in machine learning. In Chapter 2, we have already shown how the Hoeffding bound can be applied in the context of PAC learning for determining the number of training instances which is necessary to find a hypothesis inH that is within an ε distance from the best one H∗ with probability of 1 − δ .
Hoeffding bounds have been also used in racing models for the task of automated model selection (Maron and Moore, 1994, 1997). The main idea is to race competing models, until a clear winner is found. The racing algorithm starts with a collection of models, each of which is associated with two pieces of information: a current estimate of its average error and the number of test points it has processed. At each iteration of the algorithm, a random instance is selected from a given test set. Then, for each model a leave-one-out error is computed using the testing instance; this is used to update the model’s estimate of its own average error. The Hoeffding bound is used to determine probabilistically how close the model’s estimate is to the true average error. The models whose lower bound on their average error is greater than the upper bound on the average error of the best model are eliminated from the race. As compared to beam search, the set of models is known from the beginning of the evaluation process, whereas in the former algorithm the final set of models is chosen heuristically from an unknown superset. A slightly different implementation of the same idea has been used for racing features in an algorithm for feature subset selection, in the work of Moore and Lee (1994). At each step in forward feature selection, all the futures available to be added are raced against each other in parallel. A feature drops out of the race if the estimated upper bound on its merit is lower than the estimated lower bound on the merit of the best feature.
The same idea has been further leveraged by Kohavi (1994) in a different algorithm for feature subset selection, where the task is formulated as a state space search with prob- abilistic estimates. In this problem setup, each state represents a subset of features, and each move in the space is drawn from the set of possible moves {”add one feature”, ”remove
Learning Model Trees from Time-Changing Data Streams 69
one feature”}. The state evaluation function f (s) is an indicator of the quality of the state s, thus the goal is to find the state with the maximal value of f∗(s). The problem can be approached using a simple hill climbing method or other more sophisticated alternatives, such as simulated annealing, beam search or a genetic algorithm. However, each step in the search would have to be accompanied with an evaluation phase, which makes the exploration of the search space a very tedious task. The authors have therefore proposed a probabilistic evaluation function, which gives a trade off between a decreased accuracy of estimates and an increased state exploration. By observing the evaluation function as a random variable, the Hoeffding bound is used to provide a probabilistic estimate on the true value of f (s).
These early ideas on using probabilistic estimates for speeding up different machine learning tasks have inspired the work of Domingos and Hulten (2000), now considered a main reference point for incremental algorithms for learning decision trees. Domingos and Hulten (2000) have used the feature racing idea in a decision tree learning algorithm as a method for approximate probabilistic evaluation of the set of possible refinements of the current hypothesis, given a leaf node as a reference point. The probability of failing to make the correct selection decision (the one that would have been chosen if an infinite set of training examples was available) is bounded with the parameter δ . The sequence of probably approximately correct splitting decision as a result gives an asymptotically approximately correct model.
More precisely, given an evaluation function f and a set of refinements r1, r2, ..., rm, the
algorithm tries to estimate the difference in performance between the best two refinements with a desired level of confidence 1 − δ . Let ri and rj be the best two refinements whose
estimated evaluation function values after observing N examples are ˆf(ri) and ˆf(rj) and
ˆ
f(ri) > ˆf(rj). Let us denote the difference of interest ∆ ˆf = ˆf(ri) − ˆf(rj): We consider it
as a random variable and apply to it the Hoeffding inequality to bound the probability of exceeding the true difference by more than ε with Pr[|∆ f − ∆ ˆf| > ε] < 2e−2Nε2/R2
. If (∆ ˆf− ∆ f ) < ε with probability 1 − δ , then ∆ f > ∆ ˆf+ ε. Given that ε > 0, the true difference must be positive with probability 1 − δ , which means that f (ri) > f (rj). This is valid as long
as the estimated values ˆf can be viewed as the average of f values for the examples observed at the leaf node1. Thus, when ∆ ˆf < ε, there is enough statistical evidence to support the
choice of ri.
A key property of the Hoeffding tree algorithm is that it is possible to guarantee, un- der realistic assumptions, that the trees it produces are asymptotically arbitrarily close to the ones produced by a batch learner. Domingos and Hulten (2000) have proven that the maximal expected disagreement between the Hoeffding tree (induced by the Hoeffding tree algorithm with a desired probability δ given an infinite sequence of examples) and an asymptotic batch decision tree, induced by choosing at each node the attribute with the true greatest f (i.e., by using an infinite number of examples at each node), is bounded with δ /p, where p is the probability that an example reaches a leaf (assumed constant for simplicity). A useful application of the bound is that, instead of providing a desired level of confidence in each splitting decision, users can now specify as input to the Hoeffding tree algorithm the maximum expected disagreement they are willing to accept, given enough examples for the tree to settle, and an estimation of the leaf probability p.
The applicability of the Hoeffding bound is dependent on the assumption of observing a sequence of identically independently distributed random variables, which can be difficult to achieve in practice. Another caveat when using the bound is the hidden assumption on viewing the evaluation function as an average over a sequence of random variables. Finally, the Hoeffding algorithm implicitly races only two features in the refinement selection performed at the referent leaf node. This is due to the assumption that the third-best and all of the lower ranked features have sufficiently smaller merits, so that their probability of being the best refinement is very small and can be neglected. Despite these assumptions,
70 Learning Model Trees from Time-Changing Data Streams
which might be difficult to satisfy in practice, Hoeffding trees have been shown to achieve remarkable accuracy and efficiency in learning.
Hoeffding-based algorithms (Gama et al., 2004b, 2003; Hulten et al., 2001) can process millions of examples efficiently in the order of their arrival, without having to store any training data. Given enough training examples, Hoeffding trees have been show to achieve a performance comparable to that of a batch decision tree. This is empirical proof that a sequence of probably approximately correct splitting decisions, supported by using the Hoeffding probabilistic bound, can provide means for successful convergence to a hypothesis which is very near to the optimal hypothesis.