The first algorithm to solve the multiple instance learning problem was proposed by [39]. Here, the authors assume that the positive instances reside in a single axis parallel rectangle (APR). The diverse density framework in [84, 133] assumes that positive instances form a Gaussian-like pattern around some concept points. Wang et al. [127] propose a nearest neighbor and nearest neighbor of nearest neighbor based approach. Another SVM-based approach is presented in [48, 74]. Here, the authors design kernel functions on bags, instead of on instances. Mangasarian et al. [82] use a succession of linear programs to solve the MIL problem. Boosting based approaches include, for example, Viola et al. [125]. In addition, some deterministic annealing type approaches have been proposed [49, 76] that try to uncover the instance labels in several training iterations. Joulin and Bach [61] propose a convex formulation for estimating the latent instance labels. An interesting extension to the finite-sized MIL algorithms has been proposed in [4] that learns from infinite size manifold bags. Also noteworthy is the method proposed in [12] which tackles the runtime complexity of multiple instance learning algorithms using a fast bundle method. Settles et al. [108] apply the idea of active learning to MIL by querying the labels of instances in positive bags . Li et al. [78] reports increased performance in object tracking if learning is performed from batches of bags.
A large group of existing MIL methods commonly follow the large margin principle [2, 49, 31, 52]. These models are based on extensions of the support vector machine (SVM) optimization problem with the positive identifiability and negative exclusion constraints. MI-SVM [2] adds one constraint per one instance in a positive bag that has the largest discriminant value. mi-SVM [2] treats the instance labels in positive bags as latent variables to be learned from data. Although mi-SVM is a more flexible model, MI-SVM exhibits better prediction performance, since the prediction of the latent variables in mi-SVM strongly biases false positives to maximize the margin. This inherent difficulty in regularizing the intra-bag distribution of instance labels has been pointed out by [52], and solved alternatively by enforcing the instances in positive bags to lie on a manifold perpendicular to decision boundary.
6.3 Decision trees
by Kim and de la Torre [63], who introduce an extension of Gaussian process classification to the MIL setting. The model relies on a very similar principle to MI-SVM, namely describing positive bags by a single instance having the largest separation from the border. Apart from the fact that it enjoys a principled parameter tuning mechanism, this model shares all the weaknesses of MI-SVM, such as not being robust to false positives in positive bags, due to the fact that it ignores the information stored in all the instances in a positive bag but one. Raykar proposes a Bayesian formulation [103] to MIL that can effectively perform feature selection.
A third alternative approach is decision tree based MIL. Blockeel et al. [16] adapted deci- sion trees to the MIL problem by introducing a priority queue into the tree construction process. Leistner et al. [76] propose a deterministic annealing procedure for uncovering the hidden in- stance labels in several forest training iterations. In this chapter, we introduce another MIL algorithm on this track, which is an extended version of of the work of Blockeel et al. [16].
Recent examplary applications of MIL include diabetic retinopathy screening [101], cancer detection from tissue images [129], visual saliency estimation [128], and content-based ob- ject detection and tracking [109, 5]. MIL is also useful in drug activity prediction where each molecule constitutes a bag, each configuration of a molecule an instance, and binding of any of these configurations to the desired target as a positive label, as first introduced by Dietterich et al. [39]. More recent applications of MIL to this problem include finding the interaction of proteins with Calmodulin molecules [11], and finding bioactive conformers [46].
6.3 Decision trees
We quickly review the decision tree algorithm already covered in Section 1.7. Let D = {(x1, yi), (x2, y2) · · · , (xN)} be a data set of N instances where
xi = [xi1, xi2, · · · , xiD] are D-dimensional vectors of observed instances and yi are associated
labels. Suppose that M (ys) is a scalar measure based on the labels of an instance set s, denoted
by ys. A decision tree is built by the following steps:
• For each observed value xij of each feature j, group the instances into two groups ac-
cording to the split rule fj > xij where fj is an arbitrary value for feature j. Calculate a
goodness measure for the split θij = M (yfj>xij) + M (yfj≤xij).
• Create a node for the feature ˆj which gives the highest θij, and two child nodes, and assign
all instances with fj > xij to one node, and the rest to the other node.
• For each child node, repeat this process on their assigned set of instances recursively until all nodes have instances belonging to a single class.
Let fcdenote the ratio of instances in ysthat belong to class c. Two widely used examples of
goodness measures are Gini Impurity:
IG= 1 − C
X
c=1
and Information Gain: IE = − C X c=1 fclog fc.
Gini impurity is widely used by the CART (Classification And Regression Tree) algorithm [20], and the information gain is more often preferred with the C4.5 algorithm [102]. The outlined axis-orthogonal splitting procedure can be replaced by more complex splitting tests over multi- ple features at a time, see for example [88].