Related Work - AdaBoost learning of shape and color features for object recognition

AdaBoost learning of shape and color features for object recognition

2. Related Work

Sung and Poggio (Sung & Poggio, 1998) and Rowley et al. (Rowley et al., 1998) present early trainable sys- tems in the face detection domain. The former assume a mixture of Gaussians for both object and background classes while the latter use a multilayer neural network. A number of methods follow with different learning algorithms. The major obstacle to a generic object detection system lies in their exploration of training

data. The performance of appearance-based methods, where all pixel values are used in classification, is likely to degrade when background is embedded with object examples. In addition, the scalability of the learning techniques has to be examined because of the large size and/or high dimensionality of the training data for other object classes rather than human faces. Finally, the “bootstrap” method of collecting negative examples (Sung & Poggio, 1998) is not easy to automate for generic object classes, for example the setting of the accuracy of the system in each bootstrapping round. Viola and Jones (Viola & Jones, 2004) present a fast object detection system by using a cascade of classifiers. This type of classifiers provides a viable approach to exploring negative examples. However, training an optimal classifier of this type is extremely difficult (Vi- ola & Jones, 2004). Thus, a heuristic approach is adopted. As a result, the generalization performance of a cascade of classifiers is not clear.

The works in (Burl & Perona, 1996; Ioffe & Forsyth, 2001; Agarwal & Roth, 2002) belong to the class of detection-by-part methods. In this approach the object parts are first detected, then grouped to form objects according to an explicit spatial relationship among parts. This approach is intuitive. However, under the presented formulation only translation of parts is dealt with. The difficult problem of learning the object model is addressed in (Weber et al., 2000; Ioffe & Forsyth, 2001; Agarwal & Roth, 2002) where the specific object class is handled.

Detection by part can be seen from a different per- spective. Mohan et al. (Mohan et al., 2001) model the pedestrian object class by six components. The support vector machine learning method (Vapnik, 1998) is used to train a detector for each component and to train a combined classifier. The system shows robust detection even when partial occlusion occurs, which is a clear advantage of this approach. Their result also shows that combination of classifiers outperforms a single classifier approach. The major drawback is that the model is constructed manually. This problem is in fact also present in their related work (Pa- pageorgiou & Poggio, 2000) where a reduced subset of features is selected manually to improve the detection speed.

In summary, methods that assume a generative model are not suitable for generic detection system, while distribution free methods such as support vector machine (SVM) (Vapnik, 1998) or Sparse Network of Winnow (SNoW) (Yang et al., 2000) do not fully address the class imbalance problem in an automatic manner. In addition, appearance-based methods do not provide a

viable solution to the problem where background is embedded with object examples. The detection by part approach deals with this problem elegantly. How- ever, the problem of learning a generic object model remains unsolved. Furthermore, current methods consider object parts at one scale and orientation only, and hence important discriminative features might not be used.

3. AdaBoost learning

Let us consider a standard two class classification problem. Let there be a training set {(xi, yi)} drawn from some fixed but unknown distribution P(x, y) on X ×Y, where X is the space of the data variable x andY ={−1,1}is the set of the class labely. In our context, −1 denotes the background class and 1 denotes the object class. The task is to predict the label y givenx.

Among the various learning techniques, ensemble learning methods (Freund & Schapire, 1997; Breiman, 1998) are suited for our problem because they are ef- ficient and robust with respect to training data while making no assumption about the underlying distribution. The fact that they work directly in the distribution space of the input data allows us to deal with the class imbalance problem in a simple manner. They are flexible in that prior knowledge can be incorpo- rated via the class of base classifiers. This allows us to design a discriminative model combining both color and shape information.

In this paper we are interested in a class of ensemble methods which finds a sparse linear combination of base classifiers (Freund & Schapire, 1997). Specifi- cally, suppose that there are a set of classifiers (weak hypotheses) H={ht :X → Y} and a learning algorithm (base learner) which returns a hypothesisht∈ H for any distribution over the inputs. The number of classifiers inHcould be infinite. A classifier ensemble is constructed by iteratively calling the base learner with an appropriate distribution, depending on the empirical performance of the hypotheses learned in the previous steps.

The AdaBoost algorithm (Freund & Schapire, 1997) is a powerful ensemble learning method. Empirical stud- ies in (Breiman, 1998) show that the performance of the AdaBoost algorithm is similar or slightly better than related ensemble methods in terms of generalization. We choose the AdaBoost algorithm because of the ease of implementation. A summary of the algorithm is as follows.

The AdaBoost Algorithm

Input: N examples {(xi, yi)} and an initial distribution represented by a set of weights D1(i) over the

examples.

Do for t= 1, . . . , T

1. Learn a hypothesis ht∈ H from the training examples with distribution Dt.

2. calculate the empirical error ofht ǫt= Pri eDt[ht(xi)6=yi] (1) 3. set αt= 1 2ln (1−ǫt) ǫt 4. update Dt+1(i) = Dt(i)e (−αtyiht(xi)) Qt whereQt is a normalization factor.

Output: The final classifier

fH(x) = sign (gH(x)) (2) where gH(x) = T X t=1 αtht(x) (3)

gH(x) might be used to indicate the confidence of the

classification.

In document Workshop on Machine Learning Techniques for Processing Multimedia Content (Page 64-66)