Boosting for learning from sparsely labeled data

So far we focused on two paradigms for localizing objects in computer vision, i.e., object detection and segmentation. Common to both problems is the need for a classification model that distinguishes image features between object and non-object. The training process of these classification models requires annotation that can be expensive to obtain for large datasets. For example, detailed object ground-truth annotation, usually being a segmentation mask, can be laborious to create manually. Therefore, it would be advantageous if we can relax the labeling requirements by assuming only partial or coarse annotation is available.

More generally, classification is the problem of assigning a class (or label) to a new observation, on the basis of a set of training data. The resulting model is commonly referred to as a classifier. A classification problem is supervised if the class membership of observations in the training set is known, or unsupervised otherwise. We call the training data labeled or unlabeled, respectively. Due to the annotation availability issue discussed before, in this thesis we focus on the semi-supervised classification problem where only partial class membership

information in the training set is available. Semi-supervised classification studies the problem of using both labeled and unlabeled data to learn a classifier. As we will show in Chapter 6, in some practical applications including object segmentation, semi-supervised classifiers achieve a level of performance comparable to fully supervised classifiers, therefore either reduce the amount of required annotation or eliminate the need for detailed annotation.

There has been a large amount of literature in semi-supervised learning and we refer the readers to the recent book [31] for a comprehensive review. Generally, semi-supervised learning methods can be categorized into either transductive or inductive based on the nature of inference. Transductive algorithms can only predict labels of data seen during training. Typ- ical approaches include label propagation [238] and LLGC [236]. The goal of transductive learning is to predict labels for an observed and unlabeled transduction set, and the algorithm commonly makes use of the geometric properties of the data distribution. More specifically, many transductive learning algorithms are based on the manifold assumption which assumes that data lie in a low-dimensional manifold in a (high-dimensional) input feature space. The geometry of the data distribution can be captured by representing the dataset as a graph, with data points as vertices and pairwise similarities between data points as edge weights. Induc- tive methods, on the other hand, build a general decision rule over the input feature space and therefore can be used to predict the labels of data that are unseen during training. Examples of inductive methods include co-training [17] and semi-supervised SVM [12]. One of the most widely used underlying ideas in these methods is the cluster assumption which assumes that decision boundaries are more likely to pass through regions in the feature space with lower data density. It should be noted, however, although the manifold assumption is inherently transductive, we can also use it to regularize decision boundaries in inductive methods. For example, manifold regularization [11] adds a data-dependent geometric regularization term to the objective function of a max-margin classifier (e.g., an SVM). Our work in this thesis belongs to the inductive category and is inspired by this manifold regularization idea. Specifically, our method is based on the manifold assumption in Laplacian Eigenmaps [10].

Many classification algorithms are commonly used in the computer vision literature. This includes decision trees, ensemble learning (e.g., boosting and random forest), k-nearest neigh- bors, SVMs, to name a few [15]. In our work, we choose to make use of the boosting classification framework and, more specifically, extend the margin distribution boosting (MDBoost) algorithm [182] to support semi-supervised learning based on manifold regularization. We choose the boosting framework because the max-margin nature of boosting algorithms makes it straightforward to introduce manifold regularization for semi-supervised learning and induce an inductive learning algorithm. More importantly, the geometry of the (labeled and unlabeled) data distributions can be assimilated into the margin-cost based objective function. As a result, the algorithm can be efficiently and incrementally trained using column generation, thus retains the stage-wise gradient descent training procedure. This is in contrast to methods such as the

semi-supervised SVM [12] that involves solving a computationally expensive mixed integer program for the semi-supervised case.

Several works have extended supervised boosting algorithms to a semi-supervised setting. Semi-supervised MarginBoost [28] generalizes the margin concept to unlabeled data, and minimizes a margin-based loss by functional gradient descent. Chen and Wang also minimize the margin-based loss and introduce additional local smoothness into regularization in the Regu- larized Boost [33]. SERBoost [175] aims to scale up to large datasets by using expectation regularization. In ASSEMBLE [13] and SemiBoost [131], authors introduce the notion of pseudo-labels for unlabeled data and boost any supervised classifier by iteratively relabeling the unlabeled data. Unlike those existing approaches, the algorithm proposed in this thesis optimizes the margin distribution directly within a totally corrective framework, while incor- porating manifold regularization on both labeled and unlabeled data coherently.

For completeness, we briefly review the AdaBoost and MDBoost algorithms below. AdaBoost. AdaBoost is the first and most commonly used variant of boosting alogrithms [207]. Mathematically, let Dl= {(xi, yi)}i=1,··· ,M be the training data set, where xi∈ X is the input

feature vector and yi∈ {−1, +1} is the output label. Given the training data, our goal is to train

a classifier to assign a binary label to any input vector x. In the setting of boosting methods, the classifier consists of a weighted combination of weak learners (classifiers).

More specifically, denote h(·) ∈ H as a weak learner that maps an input vector x into a binary output. We assume that we choose K weak learners from the set H in our boosted classifier, and define a matrix H ∈ ZM ×K to be all the possible predictions of the training data using weak learners. That is, Hij = hj(xi) is the label ({+1, −1}) given by the weak learner

hj(·) on the training example xi. We also use Hi:= [Hi1Hi2· · · HiK] to denote the i-th row

of H, which constitutes the output of all the weak learners on the training example xi. Let α

be the weight vector for the weak learners. We can write the output of the final classifier on any training data xias Hi:α, and the so-called (unnormalized) margin at data xiis defined as

yiHi:α.

AdaBoost can be viewed as a gradient descent procedure that minimizes the exponential classification error (or loss) function. The training procedure of AdaBoost is a greedy algorithm that constructs an additive combination of weak classifiers such that the following exponential loss is minimized [36]:

L(y, f (x)) = exp −yH(x). (2.27) where

H(x) = sign XN

i=1αihi(x), (2.28)

Here αi is the weight coefficient for the i-th weak learner, and N is the number of weak

Margin theory and MDBoost. One way of deciphering the success of boosting lies in margin theory [178]. Several papers, such as LPBoost [39], adopt the minimum margin as an alternative learning criterion for boosting. Ryyzin and Schapire [168] point out that the gen- eralization performance of boosting algorithms may depend more on the margin distribution instead of the minimum margin. Based on this observation, Shen and Li propose MDBoost and achieved promising classification performance by directly maximizing the average margin and minimizing the margin variance [182].

Specifically, let ρi denote the unnormalized margin for the i-th example datum, i.e., ρi=

yiHi:α, ∀i = 1, · · · , M. The cost function and the learning problem in MDBoost can be written

as follows: min α 1 2(M − 1) X i>j (ρi− ρj)2− M X i=1 ρi s.t. α < 0, 1>α = D, (2.29) where D is a regularization parameter. By defining a matrix A ∈ RM ×M, where

A =       1 − 1 M −1 . . . − 1 M −1 −_{M −1}1 1 . . . −_{M −1}1 .. . ... . .. ... − 1 M −1 − 1 M −1 . . . 1       ,

the optimization problem can be rewritten into the following form: min α 1 2ρ >_{Aρ − 1}>_ρ, s.t. α < 0, 1>α = D, ρi= yiHi:α, ∀i = 1, · · · , M. (2.30)

It has been shown [183] the problem in (6.2) can be efficiently solved by considering its dual form, i.e., min r,u r + 1 2D(u − 1) > A−1(u − 1), s.t. M X i=1 uiyiHi:4 r1>. (2.31)

The form of the dual problem allows us to incrementally search the solution space by the column generation technique. At each iteration, we obtain a new weak classifier through searching

the most violated constraint:

h0(·) = argmax

h(·)

i=1uiyih(xi). (2.32)

While the MDBoost learning cost incorporates the margin variance information, the global variance can be restrictive and cannot describe the finer structure of the distribution beyond the second order statistics. In our work, we propose to use the “local” version of variance that considers the geometric properties of the data manifold. More importantly, the idea that we can make use of the geometric properties of the data distribution can be naturally extended to a semi-supervised learning setting. In Chapter 6, we propose the Semi-supervised Laplacian MDBoost algorithm that addresses the above shortcomings of MDBoost. In addition, we apply the new semi-supervised learning algorithm on a number of object segmentation tasks to verify its efficacy.

In document Context-driven Object Detection and Segmentation with Auxiliary Information (Page 69-73)