Boosting - Generalized Maximum Entropy, Convexity and Machine Learning

In this section we show how the boosting problem (Schapire, 2001) fits into our framework. We distinguish between the boosting problem and boosting algo-

rithms. Our focus is the boosting problem, and in a later chapter we will discuss algorithms for solving it. Originally the two concepts were introduced together and were intertwined. This turns out to be unnecessary. The boosting problem can be described and analyzed without reference to a specific algorithm.

Our development closely follows that of Lebanon and Lafferty (2001), with necessary notation changes to make the picture clear in our setting. They showed that AdaBoost (Collins et al., 2002) and exponential family maximum likelihood estimation could be seen as identical, except for a special normalization constraint utilized in the latter. Further specialization of both models leads to a correspondence between logistic regression and binary Adabost. Here we con- tinue the connection-making, fitting these models into the even larger picture developed in this thesis.

5.3.1 Notation

The general problem in boosting involves an input-output pair (x, y) ∈ X × Y

and a number of deterministic functions fj : X × Y → R. Often, each fj is a

classifier, termed a weak learner. The goal is to find the best set of weights to use to combine the outputs of the weak learners to form a better classifier, hence the use of the term boosting. The outputs offj’s are known on the data set, and are

available to classify new examples. In other respects they are black boxes and the results of this section do not require the interpretation just given in order to hold.

In the problem setup of Lebanon and Lafferty (2001), the data sample is denoted ((xn, yn))N_n₌₁. We will need several facts about the sample. The empirical

distribution is ˜ pX,Y(x, y) = 1 N N X n=1 δ(xn, x)δ(yn, y).

This function mainly has the value zero on points not in the sample and (usually) takes the value 1/N on points in the sample. The need for the redundant look- ing subscript on ˜pX,Y(x, y) will become apparent in a moment. The empirical

marginal and empirical conditional distributions are defined analogously as

˜ pX(x, y) = 1 N N X n=1 δ(xn, x), and ˜ pY|X(x, y) = ( _pX,Y_˜ ₍_x,y₎ ˜ pX(x,y) if˜pX(x, y)6= 0 0 if ˜pX(x, y) = 0.

pY|X(x,·) is a probability distribution–a different one at each x. This notation

makes it easier to switch to vector notation. Consider the pair (x,y) as the result of a one-to-one finite map from a generic index set I = {1, ...,|X | |Y|} to the sample space X × Y. Thus for i ∈ I it makes sense to write (x,y) = (x,y)(i), with the (x, y) on the right as the name of the map in one direction. We can switch back with the map i = (x, y)(−1), so that we write i = i(x, y). Naturally the indexiis more conducive to defining vectors elementwise. In a mathematical sense this could be considered trivial, but, from the viewpoint of understanding, it is critical. The change makes it clear that if all indexes and are taken over (x, y) rather than separately for x and y, it becomes easy to take advantage of the vector/matrix notation. In the standard statistical style of presentation, this is not so common.

Based on the foregoing discussion the vectors p_eX,Y,peX,peY|X all have obvious

definitions based on the definitions of ˜pX,Y(x, y), ˜pX(x, y) and ˜pY|X(x, y).

The objective function is defined as

X x∈X ˜ pX(x) X y∈Y ∆s(p(y|x), q(y|x)).

In Lebanon and Lafferty (2001), s is SBG entropy (s(p) = plog(p)) and q(y|x) is an initial guess, an input variable. That means ∆s is the same as the unnormalized KL-divergence (theorem ref). Also, the notation used in Lebanon and Lafferty (2001) contains some syntactic sugar meant to convey the role of certain terms. The term ˜pX(x) could be written as ˜pX(x, y). The index (y|x) should

be regarded as just a different way to express the index (x, y), while p(y|x) and

q(y|x) are actually meant to be just positive measures. With all that in mind, the objective is equivalent to

X (x,y)∈X ×Y ˜ pX(x, y)∆s(p(x, y), q(x, y) = X i∈I ˜ pX(i)∆s(p(i), q(i)) =hp_eX,∆s[p,q]i.

Now the connection of the objective function to our setting is clear. The function is a weighted version of unnormalized KL-divergence. The weights are dependent on the data sample (but not they-values). Otherwise it perfectly fits the format of (3.23). Next, the feature constraints (Equation 1, Lebanon and Lafferty (2001)). are given as:

(x,y)∈X ×Y

pX(x)p(y|x)(fj(x, y)−Ep˜(fj|x)) = 0 for j = 1. . . J. (5.11)

of (x,y) ˜ fj(x, y) := ˜fj(x) := Ep˜(fj|x) = X (x,y)∈X ×Y ˜ pY|X(x, y)fj(x, y).

Now we can introduce the term, ˜fj using the usual definition, and use it to

construct the matrix

F =f˜1, . . . ,f˜J

which can be used to write the feature constraint (5.11) as (F −Fe) (diag

pX) p=0. (5.12)

In this formulation,Fe reflects an interaction between the features and the sample

data1_{. This differs from the unconditional formulation typified by the classic}

maxent problem. There the goal is to satisfy a constraint of the form Ap = b, where the data only enters the problem through b. Here F plays the role of

A, while p plays the same role as before, and the middle term is purely data dependent. The first term (F−Fe), in effect, represents a data-dependent feature

matrix.

The normalization constraint is also affected by the conditional setup. Instead of the unconditional normalization corresponding to 1Tp= 1, we have

y∈Y

p(y|x) = 1 ∀x∈ X.

Following a similar line of reasoning, and relying on the natural index map (where

y values vary fastest) we can express this as (I|X |⊗1T|Y|) p=1.

In the boosting problem the single normalization constraint may be replaced with many sparse constraints, one for each x. The choice of whether or not to impose this constraint leads to the distinction between AdaBoost (unnormalized), and maximum likelihood for exponential models. Lebanon and Lafferty point out that further model choice distinctions can be made based on choices of Y and fj.

For example letting Y = {−1,1} leads to logistic regression in the normalized case and binary AdaBoost otherwise.

1_{Under the}_{consistent data assumption} _{each value of}_x

n is associated with only one value

5.4 Regression Models and Generalized Maxent

In document Generalized Maximum Entropy, Convexity and Machine Learning (Page 98-102)