Notes on the Bernoulli model - Multiple instance classification with ensemble classifiers

4.2 Multiple instance classification with ensemble classifiers

5.1.4 Notes on the Bernoulli model

Discriminative Bernoulli model The Bernoulli model as stated above is a generative bag model, i. e. it provides a probability of instance classes given the bag class P (y | c). It is also possible to formulate a discriminative Bernoulli model that provides a probability of the bag class for given instance classes P (c | y) as follows

qBern,disc(y, c, γ) y = 0 y = 1

c = 0 1 − γ 0

c = 1 γ 1

(5.30)

In this case, the Bernoulli parameter γ describes the probability of a negative instance to belong to a positive bag.

The difference between the generative and the discriminative Bernoulli model is just the normalization. As we have seen in Section 4.1.2, however, the normalization has a large impact on the bag prior. For the discriminative Bernoulli model the bag prior reads

PBern,disc(c = 0) =

1 − γ 2

. (5.31)

The bag prior depends both on the Bernoulli parameter γ and on the bag size N , which makes implementing the discriminative Bernoulli model more complicated than the genera-

Chapter 5 Alternative Bag Models for Multiple Instance Applications

tive Bernoulli model. Since there is no clear advantage of the discriminative Bernoulli model over the generative one, we will not pursue the discriminative Bernoulli model further.3 Interpretation of the Bernoulli parameter It is suggestive to interpret the Bernoulli parameter β as the (observed or expected) ratio of positive instances in positive bags. This would indeed be the case if the instance classes were determined by the Bernoulli model alone, since β is defined as β = qBern(y = 1 | c = 1)). However, the instance classes are in fact

determined by both the Bernoulli model and the instance classifier: PBern(yn= 1 | c = 1) = 1 Z · qCl(yn| xn, θ) · qBern(yn| c) (5.32) = pnβ pnβ + (1 − pn)(1 − β) . (5.33)

Moreover, the instance classifier often has a “strong opinion” about the instance class, meaning that pn is close to zero or close to one, while β is usually in a “moderate” range.

In other words, the instance classifier dominates the instance class prediction, while the Bernoulli model just “pushes the decision boundary a little”, so that more instances get classified as positive; the larger β the stronger the push.

As described above, the MI model is equivalent to β = 1/2, but of course this does not imply that exactly half of the instances from positive bags must be positive, while the other half must be negative.

To pinpoint the issue, let us consider a bag that contains less positive than negative instances. According to the above mentioned (faulty) interpretation, the Bernoulli parameter should be β < 1/2. But then, the Bernoulli model would penalize positive instances in positive bags, which is obviously wrong.

Estimation of the Bernoulli parameter It is not clear a-priori which value of the Bernoulli parameter β yields the best result. Therefore, it is suggestive to optimize β during training to automatically find the best value.

However, as we will discuss in this paragraph, this is not possible.

The reason can be seen by inspecting the parameter likelihood in Figure 5.2. Let us first assume that the result of the instance classifier pn is fixed. For any instance with pn< 1/2

the optimum is β = 0, for any instance with pn> 1/2, the optimum is β = 1. So β always

diverges to the extreme values.

If we have a bag with more than one instance, the situation is a bit more complex. There are well-behaved situations, e. g.

dβ logβp + (1 − β)(1 − p) =

β + _2p−11−p (5.34)

In effect, correcting for the unwanted bag prior of the discriminative Bernoulli model would lead back to the generative Bernoulli model.

Chapter 5 Alternative Bag Models for Multiple Instance Applications

At the optimum, the sum of derivatives must be zero. For two instance we obtain βopt+ 1 − p1 2p1− 1 = −βopt− 1 − p2 2p2− 1 (5.35) βopt = 1 2 p1− 1 2p1− 1 + p2− 1 2p2− 1 (5.36) Equation (5.36) is plotted in Figure 5.6. For more than two instances, we expect a similar behavior: For many values of pn, βopt is degenerate in many cases (if all pn > 1/2, then

βopt = 1, if all pn< 1/2, then βopt= 0). For the small band of non-degenerateness, betaopt

is very sensitive to small changes of pn.

The second problem with the estimation of β is the interaction with the instance classifier. Even if βopt has a reasonable value (between 0.5 and 0.9) for fixed pn, we must also take

into account that for β > 0.5, the likelihood increases with increasing pn. Again, we analyze

the situation for a bag containing a single instance: There is only one stationary point (at β = p = 1/2), and this is not a stable minimum, but a saddle point. The two degenerate

Figure 5.6: Left: Value of Bernoulli parameter βopt that optimizes the likelihood for a bag

with two instances with fixed instance class probabilities p1, p2. Right: In the white region,

Chapter 5 Alternative Bag Models for Multiple Instance Applications

minima are located at β = p = 1 and at β = p = 0. d2N LL dβ2 = −1 β + _2p−11−p2 (5.37) d2N LL dβ dp = −1 (β(2p − 1) + 1 − p)2 (5.38) H(β = p = 1/2) = 0 −4 −4 0 ! (5.39)

To sum up: The Bernoulli parameter does not describe the ratio of positive instances in positive bags, but it is a rather heuristic parameter that induces the algorithm to find more positive instance. How many more positive instances will be found depends on the details of the dataset and the instance classifier. β does not have a clear meaning.

In document Multiple Instance Learning with Random Forests and Applications in Industrial Optical Inspection (Page 78-81)