Learning Filters for Object Recognition with Linear Support Vector Machine

(1)

Procedia

Engineering

Procedia Engineering 00 (2011) 000–000 www.elsevier.com/locate/procedia

Advanced in Control Engineering and Information Science

Learning filters for object recognition with linear support

vector machine

Jiabao Wang*, Jianjiang Lu, Yafei Zhang, Weiguang Xu

Institute of Command Automation, PLA University of Science and Technology, Nanjing 210007, China

Abstract

This paper is contributed to object recognition with linear Support Vector Machine (SVM), from which we can learn the filters for recognizing the particular object categories. We compare two different SVM solvers for the primal and dual problems, and present a new fast training method, called as Partial Gradient Descent (PGD), to estimate the filters from the large-scale dataset. Our learning approach is directly proposed for the primal problem not for the dual, but it has great generalization performance and training efficiency than the popular dual optimization methods when we just want to find an approximate solution. Experiment proves the success of our learning method, especially in the large-scale truth dataset.

Selection and/or peer-review under responsibility of [CEIS 2011] Keywords: Object recognition; Partial gradient descent; Linear SVM

1. Introduction

Object recognition is a significant challenge in computer vision. In this paper, we concentrate on the problem of learning discriminative models (filters), which are used for detecting and recognizing objects in static images. Here, the key problem of object recognition system is how to learn the discriminative object models or filters from a large-scale dataset. As for this problem, many predecessors have explored it. A soft linear SVM was used to learn a pedestrian model by Dalal and Triggs based on the INRIA

* Corresponding author.

E-mail address: [email protected].

Open access under CC BY-NC-ND license.

(2)

Person dataset [1]. A latent variable based non-convex linear SVM also was employed to learn the discriminative, multi-scale and deformable part models for detecting objects in the PASCAL VOC benchmarks [2,3]. Above two linear SVMs showed their powerful discriminative capability and achieved significant breakthrough in object detection. However, the above two linear SVMs were solved by using different methods, from the dual and primal problems respectively. As we all know, the most SVM solvers were to optimize the dual problem because of the less complexity and various existing tools. In this paper, we proposed a new fast training method, called as Partial Gradient Descent (PGD), to learn the filters from the primal problem based on the work in [3]. The difference between ours and [3] is that the selected instances have different influence on updating the parameters. The correctly classified example with higher response should be retrained fewer times in the next epochs.

The vast majority of literatures on SVM problem concentrated on the dual quadratic programming (QP) optimization problem [4,5,6]. The early SVM algorithms were based on direction search and modified gradient projection algorithms. Recently, decomposition methods were designed to overcome the cost of the full kernel matrix. Sequential Minimal Optimization (SMO) was another widely popular methodology for solving dual QP problem [4]. It works by finding a pair of instances that violate the KKT conditions and taking an optimization step along the feasible direction. Besides, Collobert et al. showed that non-convexity can provide scalability advantages over non-convexity [5], and Ertekin et al. proposed a non-convex online SVM by using the Ramp Loss function to suppress the number of the support vectors (SVs) for generalization performance [6].

However, in this paper, we pay more attention to the solvers of the primal SVM problem, because it is easier to get the filters directly from the primal optimization than from the dual. Besides, the gap between the primal and the dual exists when the duality is weak. Primal optimizations for linear SVMs and non-linear cases are studied by Keerthi et al. [7] and Chapelle [8] respectively. Chapelle have proved that the result from the primal is the same as the dual but superior when the goal is to find an approximate solution [8]. So we solve the primal problem based on the stochastic gradient descent method.

In the next section, we present the general formation of linear SVM in the primal and the dual, and then introduce our optimization method for the primal linear SVM in Section 4. Section 5 presents our experiments on the INRIA Person dataset.

2. Primal and Dual Problem

Here, we consider two class classification problems. Given a training dataset D={( , )}x yi i ni=1, where d

i

x R∈ is the feature vector of the i-th instance, is the label of the instance and n is the number of the instances in the dataset, we want to learn a classification rule of the form: y ∈ − +i { 1, 1}

( ) T

f xθ =w x b+ (1)

where , is the weight, which means the normal vector of the classification hyper-plane, is the offset, and the sign of θ=( , )w b w f x( ) represents the predicted classification of x. b

Using the classification rule (1), we can obtain the optimum hyper-plane of the traditional soft margin SVM by minimizing the following primal objective function:

1 1 1 min ( ) ( ( )) 2 n T i i i J w w C H y f xθ θ θ = +

∑

₌ (2) where H z =1( ) max(0,1−z) is the Hinge Loss and the constant is the misclassification penalty. C

In practice, the quadratic programming problem in (2) is often solved by optimizing the dual problem: 1 1 1 1 max ( ) ( ) 2 n n n T i i i j i j i j i G y x α α =

∑

₌ α −

∑ ∑

= =α α x (3)

(3)

1 min(0, ) max(0, ), . . 1 0, i i i n i i Cy Cy s t α i α = ≤ ≤ ≤ ≤ = ⎧⎪ ⎨ ⎪⎩

∑

n x (4) Above formulation has a slight deviation from the standard representation, because the coefficients can take on negative values by inheriting the signs of the labels .y_i

α

i

After solving the dual optimization problem, the parameter w in (2) can be represented as a linear combination of the vectors in the training set:

1 n

i i i

w=

∑

₌α (5) Note that we only discuss the linear SVM in this section without involving non-linear classification problem, because the parameter w is not in the original input space when it mapped into feature space. Therefore, it is hard to obtain the high dimension feature weight vector by (5) when instance x is mapped into the kernel space.

3. Primal Optimization

In this paper, we resort to the stochastic gradient descent approach from the primal optimization without maintaining large number of SVs. Besides, we search hard negative samples because there are a very large number of negative examples in training set. Our work is similar to the [3], in which all the correctly classified samples were removed from the cache for a fixed time, but here we adopt a different cache removing operation. We set different intervals for the correctly classified examples according to their predicted scores. We call this strategy as Partial Gradient Descent due to the distinct treatment of the examples. It works as follows:

Algorithm 1: Partial Gradient Descent for Primal SVM Initialize

_θ

0₌₀.

Repeat

1. Select an instance ( , )x y_i _i from the training dataset randomly. 2. Compute the predicting value of the instance yˆ_i x_i: yˆ_i= f x_θt( )_i . 3. Update the parameter:

_θ

t 1

_θ

t t

t

μ θ

+ _{= −} _{, where}

t

μ

is the learning rate. 4. If y y <i iˆ 1, then

θ θ

= +

μ

tnCy xi i.

5. Else set a waiting time for x_i according to .yˆ_i

Until _Jt+1_{( )}

_θ

₋_Jt_{( )}

_θ

_<

_ε

.

According to [9], we set the learning rate

μ

t =1t for the linear SVM and shrink for each random selected instance (Step 3). When the instance i

θ

x is misclassified, we add a scalar multiple of xi to the

parameter (Step 4). Otherwise, we set a waiting time for the correctly classified instance according to the predicting value (Step 5). This means the instance will not take part in the computing until the waiting time expired. This searching procedure decreases the objective function and converges to a local minimum or saddle point in finite iterations.

θ

ˆi y ( t J

θ

)

Note that if we set the waiting time for the correctly classified instance xi based on , it means the

larger is, the more its waiting time is. Thus, it should wait more time to take part into the optimization process again. This strategy can produce a fast convergence of the objective function and consume less time for training.

ˆi

y

ˆi

(4)

4. Experiment

4.1. Dataset and Features

The PGD algorithm is evaluated for object recognition on the INRIA Person datasets. The positive images are in normalized 64×128 pixel format and each corresponds to one positive example, and the negative examples are random selected from the negative images. Note that the number of the negative examples is much more than that of the negative images. After a preliminary object filter is trained, we employed it as object detector to search exhaustively over a scale-space pyramid for “hard” examples, and then retrained the filter with the “hard” examples. All the experiments in this section were performed by the PGD learning approach and the SVM Light software, which is a solver of the dual optimization problem. We used a soft linear SVM classifier to obtain the filters by (5) in the SVM Light.

In our experiments, the extended HOG features in [3] are employed. It includes 9 contrast-insensitive orientations and 18 contrast-sensitive orientations, and 4 dimensions capturing the overall gradient energy in square block. The first 27-dimensional orientation feature vector is produced by the analytic projection of the 36-dimensional contrast-insensitive feature vector and the 72-dimensional contrast-sensitive.

4.2. Results

Our learning method is compared with SVM Light, and trained the filters with random and retrained strategies. So there are four kinds of learning models: PGD random, SVM Light random, PGD retrained and SVM Light retrained. The precision and the efficiency are evaluated in our experiments. Table 1 shows the training time and test error by the four kind of learning models. It is obviously that our training method use much less time to learn the filters, and gain better performance than the SVM Light. Note that in Table 1, the train time only indicates the optimization process cost, not including data pre-processing.

Table 1. Train time and test error

PGD

random SVM Light random retrained PGD SVM Light retrained

Train time (secs) 53.33 369.97 52.78 289.77

Test error (%) 5.10 8.83 5.09 7.34 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.7 0.75 0.8 0.85 0.9 0.95 Recall P re ci si on

Class: INRIA Person

PGD Retrained(0.9491) SVMLight Retrained(0.9266) PGD Random(0.9490) SVMLight Random(0.9117) 400 1 2 0 50 100 150 200 250 300 350 Time s ( se cs ) PGD SVMLight Random Retrained

Fig. 1. Precision/Recall curves for filters trained by four different methods on the INRIA Person dataset. We also present the average precision in the label box.

Fig. 2. Comparing training time using different methods. Note that the time is only the optimization procession, not include data pre-processing.

(5)

0 20 40 60 80 100 120 140 160 180 200 3.5 4 4.5 5 5.5 6 Iterations O bj ec tiv e

Class: INRIA Person

PGD Retrained PGD Random

Fig. 4. Filters learnt from the INRIA Person dataset using: (a) PGD random, (b) PGD retrained, (c) SVM Light random, (d) SVM Light retrained.

Fig. 3. The objective function with respect to the number of iterations.

(c) (d)

(b) (a)

The performance of our approach and the SVM Light are evaluated by the precision/recall curves, which could give more intuitive and sensitive evaluation than the ROC curve. The principal quantitative measure is the average precision. We performed our comparison on the above four models and the results are showed in Fig. 1, which clearly shows that our PGD learning method could achieve great precision and better generalization performance than the SVM Light.

Another significant performance factor is the efficiency, including the train time and the convergence speed. The train time consumed on the four learning models is given in Fig. 2. It is obvious that our PGD computing cost is much less than the SVM Light. Besides, we test the convergence of our learning method, and the results are presented in Fig. 3. It is notable that our PGD method can quickly converge to a local minimum or saddle point.

5. Conclusion

The novel learning method, Partial Gradient Descent (PGD), has great performance and efficiency to estimate the filters for object recognition in large-scale truth dataset. Experiments proved that our primal SVM solver is more efficient than the dual one when we only want to get an estimate of the optimization. It is amazing to learn filters using PGD, especially when the negative samples are tremendous.

References

[1] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In CVPR; 2005.

[2] Felzenszwalb P, McAllester D, Ramaman D. A discriminatively trained, multiscale, deformable part model. In CVPR; 2008. [3] Felzenszwalb P, Girshick R, McAllester D, Ramanan D. Object detection with discriminatively trained part based models,”

PAMI; 2010, 32(9):1627-1645.

[4] Platt J. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods:

Support Vector Learning; 1998.

[5] Collobert R, Sinz F, Weston J, Bottou L. Trading convexity for scalability. In ICML; 2006.

[6] Ertekin S, Bottou L, Giles C. Nonconvex online support vector machines. PAMI; 2011, 33(2):368-381.

[7] Keerthi S, DeCoste D. A modified finite newton method for fast solution of large scale linear svms. JMLR; 2005, 6:341-361. [8] Chapelle O. Training a support vector machine in the primal. Neural Computation; 2007, 19(5):1155-1178.