• No results found

2.2 Feature descriptors

2.3.2 AdaBoost

Among the algorithms used for classification, AdaBoost [Freund and Schapire, 1997] is a weighted voting algorithm that selects weak classifiers to combine them into a strong classifier. A weak learner is any statistical classifier that classifies better than random. Initially, every training example is assigned an equal weight. If the example is classified incorrectly its associated weight will be increased in the next iteration to force the weak learner to focus on the hard examples.

AdaBoost has been a successful measure to handle a large amount of data [Morra et al., 2008]

[Heisele et al., 2003] which is a very unmanageable task to any robust classifiers. It has been

reported to be able to provide a fast and accurate solution in face recognition and other areas

[Viola and Jones, 2002] [Papageorgiou and Poggio, 1999] [Savio et al., 2009]. The most basic

2.3. Machine learning 55

zero [Freund and Schapire, 1997]. It has been shown that AdaBoost can boost a weak learning algorithm into a strong learning algorithm with an arbitrarily low error rate.

Initially, a weight w1(i) = N1 (N is the size of the dataset) is assigned to each data point xi. A

weak learner hj(j = 1..K) consists of one of the features, a threshold and a Boolean function.

At each iteration t (t = 1..T ) in the training phase, AdaBoost finds the weak learner ht(x) with

the lowest error t comparing to label yi,

t= min j=1..k( N X i=1 wt(i)|hj(xi) − yi|)

and assigns coefficient αt to it. The lower the error t, the greater the coefficient αt:

αt= (1/2) log((1 − t)/t)

The real power of AdaBoost is that after each iteration t the weights Wt(i) for misclassified

data points are increased. So, AdaBoost can focus on the examples with high weights, which seems difficult to classify for the weak learners.

wt+1(i) = wt(i) exp(−αtyeht(xe))/(2

p

t(1 − t))

This procedure is repeated for K iterations. At last AdaBoost uses the weighted vote to combine all weak learners into a single final classifier and the final hypothesis is:

H(x) =                1, (1/(exp(−PT t=1(αtht(x))) + 1) > 0.5) 0, (1/(exp(−PT t=1(αtht(x))) + 1) < 0.5)

The convergencee of the training process is important. Freund and Schapire [Freund and

training error of the final hypothesis, has an upper bound as: Y t h 2pt(1 − t) i =Y t p 1 − 4γt t ≤ " −2X t γt2 #

Therefore, if each weak hypothesis is slightly better than random, then the training process converges exponentially fast. When the learned AdaBoost classifier is used to classify unseen samples, the generalization error is shown by Freund and Schapire [Freund and Schapire, 1997] to be at most : ˆ P r [margin(x, y) ≤ θ] + ˜O "r d mθ2 #

for θ > 0 with high probability. The bounds on training and generalisation error given above prove that AdaBoost can efficiently assemble weak learners into a strong classifier.

Practically, AdaBoost is simple to implement as it has very few parameter to tune. In addition, no prior knowledge is required for learning and the most representative features will be selected during the learning. AdaBoost is equipped with theoretical properties for the convergence of the training error and generalization error. It is also reported that AdaBoost usually does not overfit even after training for thousands of iterations.

AdaBoost has been tested and used in many applications, and along with its applications in different area, various versions of AdaBoost are proposed. These variants differ in the update scheme of the weights or use some other classifiers as the weak learner. However, the performance of AdaBoost is dependent on the data and weak classifiers. The consistency of the training set is crucial for the quality of the classification. Insufficient data or too complex weak hypotheses can result in failure of AdaBoost. Boosting also seems sensitive to noise. Another drawback of AdaBoost is that the training is quite time-consuming. At each iteration, all features are tested on all training samples which makes the training time proportional to the size of the features and training set. The features, the size of features and training set need to be well defined to achieve good classification within a sensible training time. Haar-like features are calculated to effectively encode domain information to represent each sample. .

2.3. Machine learning 57

tree 1 tree 2 tree N

Node

Terminal Node

Vote 1 Vote 2 Vote N

Majority vote sample

Figure 2.16 – A random forest of N trees. Color filled terminal nodes give predictions for two classes. When a test sample runs down each individual tree, marked in green, majority votes of terminal nodes are taken as the final prediction for this sample.

sequences, which integrates the motion and the static local appearance features and generates accurate boundary criteria via Adaboost. Furthermore, through boosting error analysis, the posterior probability density function of the shape model are calculated based on the confidence ratings. A principle component analysis (PCA) subspace of shape is also used constrains the shape variation and lower the dimensionality.