• No results found

Nonparallel Hyperplanes Support Vector Machine for Multi-class Classification

N/A
N/A
Protected

Academic year: 2021

Share "Nonparallel Hyperplanes Support Vector Machine for Multi-class Classification"

Copied!
9
0
0

Loading.... (view fulltext now)

Full text

(1)

Nonparallel Hyperplanes Support Vector Machine for

Multi-class Classification

Xuchan Ju

1,2

, Yingjie Tian

2,3

, Dalian Liu

4

, and Zhiquan Qi

2,3

1 School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China 2 Research Center on Fictitious Economy & Data Science, CAS, Beijing, China

[email protected]

3 Key Laboratory of Big Data Mining and Knowledge management, CAS, Beijing, China

tyj,[email protected]

4 Department of Basic Course Teaching, Beijing Union University, Beijing, China

[email protected]

Abstract

In this paper, we proposed a nonparallel hyperplanes classifier for multi-class classification, termed as NHCMC. This method inherits the idea of multiple birth support vector ma-chine(MBSVM), that is the ”max” decision criterion instead of the ”min” one, but it has the incomparable advantages than MBSVM. First, the optimization problems in NHCMC can be solved efficiently by sequential minimization optimization (SMO) without needing to com-pute the large inverses matrices before training as SVMs usually do; Second, kernel trick can be applied directly to NHCMC, which is superior to existing MBSVM. Experimental results on lots of data sets show the efficiency of our method in multi-class classification accuracy.

Keywords: multi-class classification, support vector machine, nonparallel, quadratic programming, kernel function

1

Introduction

Support vector machines(SVMs) are proposed by Vapnik and his co-workers[1, 18, 19]for classi-fication, regression, or other problems. SVMs depend on the principle of maximum margin, dual theory and kernel trick to be so successful. The well-known algorithm sequential minimization optimization (SMO) can be applied and solve the problem efficiently.

Recently, twin support vector machine(TWSVM)[4, 9] which works faster than conventional SVC has been successfully proposed. It seeks two nonparallel proximal hyperplanes such that each hyperplane is closest to one of two classes and as far as possible from the other class. TWSVMs have been studied extensively[11, 8, 10, 14].

Corresponding author

Volume 51, 2015, Pages 1574–1582

ICCS 2015 International Conference On Computational Science

1574 Selection and peer-review under responsibility of the Scientific Programme Committee of ICCS 2015 c

(2)

Although TWSVMs only solve two smaller QPPs, they have to compute the inverse of matrices, it is in practice intractable or even impossible for a large data set by the classical methods. Moreover, for the nonlinear case, TWSVMs consider the kernel-generated surfaces instead of hyperplanes and construct extra two different primal problems, i.e., they have to solve two problems for linear case and two other problems for nonlinear case separately. In order to avoid these drawbacks above, some nonparallel hyperplanes classifiers has been proposed[6, 16, 15].

For the multi-class classification problem, a new multi-class classifier, termed as multiple birth support vector machine (MBSVM), has been proposed[20]. In this classifier, it seeks K hyperplanes by solving k quadratic programming problems(QPPs). The QPPs are developed as the k-th class are as far as the k-th hyperplane while the rest points are proximal to the k-th hyperplane. A new point is assigned to the class label depending on which of the K hyperplanes it lies farthest to. Based on this trick, the size of each QPP in MBSVM has much lower complexity and faster than the existing multi-class SVMs. However, these QPPs have to compute the inverse of matrices and they have to solve different problems for linear case and nonlinear case.

In this paper, we develop nonparallel hyperplanes classifier for multi-class classification with this trick while it can avoid drawbacks in MBSVM. Moreover, SMO can be applied and solve the problem efficiently.

The paper is organized as follows. Section 2 briefly dwells on the standard C-SVC, TWSVM and NHC. Section 3 proposes our NHCMC. Section 4 deals with experimental results and Section 5 contains concluding remarks.

2

Background

In this section, we introduce the standard C-SVC, TWSVM and NHC briefly.

2.1

C

-SVC

Consider the binary classification problem with the training set

T ={(x1, y1),· · ·,(xl, yl)} ∈(Rn× Y)l, (1) where xi ∈Rn, yi ∈ Y ={1,−1}, i = 1,· · · , l, standard C-SVC formulates the problem as a convex quadratic programming problem (QPP)

min w,b,ξ 1 2w 2+Cl i=1 ξi, s. t. yi((w·xi) +b)1−ξi, i= 1,· · ·, l, ξi0, i= 1,· · · , l, (2)

(3)

where ξ= (ξ1,· · ·, ξl), andC > 0 is a penalty parameter. For this primal problem,C-SVC solves its Lagrangian dual problem

min α 1 2 l i=1 l j=1 αiαjyiyjK(xi, xj) l i=1 αi, s. t. l i=1 yiαi= 0, 0αiC, i= 1,· · · , l, (3)

where K(x, x) is the kernel function, which is also a convex QPP and then constructs the decision function.

2.2

TWSVM

Consider the binary classification problem with the training set

T ={(x1,+1),· · · ,(xp,+1),(xp+1,−1),· · ·,(xp+q,−1)}, (4) wherexi Rn, i = 1,· · ·, p+q. Let A = (x1,· · ·, xp)T Rp×n B = (xp+1,· · · , xp+q)T Rq×nand l = p+q. For linear classification problem, TWSVM seeks a pair of nonparallel

hyperplanes

(w+·x) +b+= 0 and (w−·x) +b−= 0. (5)

by solving two smaller QPPS min w+,b+,ξ− 1 2(Aw++e+b+) T(Aw ++e+b+) +c1eT−ξ−, s.t. (Bw++eb+) +ξe, ξ0, (6) and min w−,b−,ξ+ 1 2(Bw+e−b−) T(Bw +e−b−) +c2eT+ξ+, s.t. (Aw+e+b) +ξ+e+, ξ+0, (7)

whereci, i= 1,2 are penalty parameters. e+ ande are vectors of ones of appropriate dimen-sions.For nonlinear classification problem, two kernel-generated surfaces instead of hyperplanes are considered and two other primal problems are constructed.

2.3

NHC

For training set(4), NHC constructs two QPPs as follows. min w+,b+,ξ− 1 2c3(w+ 2+b2 +) + 1 2η T +η++c1eT−ξ−, s.t. Aw++e+b+=η+, (Bw++eb+) +ξe, ξ0, (8)

(4)

and min w−,b−,ξ+ 1 2c4(w 2+b2 ) +12ηT−η−+c2eT+ξ+, s.t. Bw+eb=η, (Aw+e+b−) +ξ+e+, ξ+0, (9)

where ci, i = 1,2,3,4 are penalty parameters. e+ and e are vectors of ones of appropriate dimensions. Their dual forms can be written as follows.

max λ,α 1 2(λ TαT) ˆQ(λT αT)T+c 3eT−α, s.t. 0αc1e, (10) where ˆ Q= AAT+c 3I ABT BAT BBT +E, (11) and max θ,γ 1 2(θ TγT) ˜Q(θTγT)T+c 4eT+γ, s.t. 0γc2e+, (12) where ˜ Q= BBT+c 4I BAT ABT AAT +E, (13)

and I is an identity matrix ofq×q, E is a matrix of l×l with all entries equal to one. For nonlinear classification problem, kernel function can be added directly.

3

Nonparallel hyperplanes classifier for multi-class

classi-fication(NHCMC)

Consider the multiple classification problem with the training set:

T ={(x1, y1),· · · ,(xl, yl)}, (14) wherexi∈Rn, i= 1,· · · , l, yi∈ {1,· · ·, K}is the corresponding pattern ofxi.

For multiple classification, we seekK nonparallel hyperplanes:

(wk·x) +bk = 0, k= 1,· · ·, K (15) For convenience, we denote the number of each class of the training set (14) as lk and the points belonging tok-th class asAk∈Rlk×n,k= 1,· · ·, K. Besides, we define the matrix

Bk= [A1,· · · , Ak−1, Ak+1,· · ·, Ak] (16)

(5)

0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 l1 l2 l3

Figure 1: A toy example learned by the linear NHCMC

3.1

Linear case

3.1.1 The Primal Problem

We seek to constructknonparallel hyperplanes (15) by solving the following convex quadratic programming problems(QPPs): min wk,bk,ηk,ξk 1 2C1wk 2+1 2η kηk+C2ek2ξk, s.t. Bkwk+ek1bk =ηk, (Akwk+ek2bk) +ξk ek2 ξk 0, (17)

where ηk ∈R(l−lk)is a variable and ξk is a slack variable. ek1∈R(l−lk)and ek2∈Rlk are the vectors of ones. C10 andC20 are penalty parameter.

In order to illustrate the primal problem of NHCMC, we generated an artificial two-dimensional three-class dataset. The geometric interpretation of above problem with x∈ R2 is shown in Figure 1,where minimizes the sum of the squared distance from the hyperplanes of K−1 classes, that is all classes except for those of thek−th class, and the points of thek−th class are far from the its hyperplane. Take the ”*” class in Figure 1 as an example. We hope the hyperplane of the ”*” classl1is far from the ”*” points and closest the ”+” and triangular points. In order to minimize misclassification, the points of the k−th class are at distance 1 from the hyperplane, and we minimize the sum of error variables with soft margin lost. The differences between multiple birth support vector machine(MBSVM)[20] and NHCMC are that we introduce a regularization term to implement structural risk minimization(SRM) principle and a variable is introduced to make a term of objective function to be constraints. These changes have many positive effects on original NHCMC.

(6)

3.1.2 The Dual Problem

In order to get the solutions of problem (17), we need to derive its dual problem. The Lagrangian of the problem (17) is given by

L(wk, bk, ηk, ξk, α, β, λ) =1 2C1wk 2+1 2η kηk +C2ek2ξk+λ(Bkwk+ek1bk−ηk) −α(A kwk+ek2bk+ξk−ek2)−βξk, (18)

whereα= (α1,· · ·, αlk),β = (β1,· · · , βlk),λ= (λ1,· · · , λllk)are the Lagrange multiplier vectors. The Karush-Kuhn-Tucker (KKT) conditions[13] forwk, bk, ηk, ξk andα, β, λare given by ∇wkL=C1wk+Bkλ−Akα= 0, (19) ∇bkL=ek1λ−ek2α= 0, (20) ∇ηkL=ηk−λ= 0, (21) ∇ξkL=C2ek2−α−β= 0, (22) Bkwk+ek1bk=ηk, (23) (Akwk+ek2bk) +ξkek2, ξk0 (24) α((A kwk+ek2bk) +ξk) = 0, βξk= 0 (25) α0, β0 (26)

Sinceβ0, from (22) we have

0αC2ek2. (27)

And from (19), we have

wk =C1 1(B

kλ−Akα), (28)

Then putting (28) and (21) into the Lagrangian and using (19)(26), we obtain the dual problem of problem (17) min ˆ π 1 2πˆ Λˆˆπ+ ˆκπ,ˆ s.t.e k1λ−ek2α= 0, ˆ CπCˆ2, (29) where ˆ π = (λ, α), (30) ˆ κ = (0,−C1ek2), (31) ˆ C1 = (−∞ek1,0), (32) ˆ C2 = (+∞ek1, C2ek2), (33) ˆ Λ = ˆ Q1 Qˆ2 Qˆ 2 Qˆ3 (34)

(7)

ˆ Q1 = BkBk +C1I, (35) ˆ Q2 = BkAk, (36) ˆ Q3 = AkAk, (37)

whereI is an identity matrix.

After getting the solution of the problem (29), we can obtain wk and b∗k with (19) and (23). A new pointx∈Rn is assigned to classk(k∈1,· · ·, K), depending on which of theK hyperplanes given by (15) it lies farthest to. The decision function is defined as

f(x) = arg max k=1,···,K |(w∗ k·x) +b∗k| ||w∗ k|| , (38) where| · |is the prependicular distance of pointxfrom the hyperplane (wk·x) +bk = 0, k = 1,· · ·, K.

3.2

Nonlinear case

Now we extend the linear NHCMC to the nonlinear case. Totally different with the existing MBSVM, we do not need to consider the extra kernel-generated surfaces since only inner prod-ucts appear in the dual problems (29), so the kernel functions can be applied directly in the problems and the linear NHCMC is easily extended to the nonlinear multiple classifier.

min ˆ π 1 2ˆπ Λˆˆπ+ ˆκπ,ˆ s.t.e k1λ−ek2α= 0, ˆ C1πˆCˆ2, (39) where ˆ π = (λ, α), (40) ˆ κ = (0,−C1ek2), (41) ˆ C1 = (−∞ek1,0), (42) ˆ C2 = (+∞ek1, C2ek2), (43) ˆ Λ = ˆ Q1 Qˆ2 Qˆ 2 Qˆ3 (44) ˆ Q1 = K(Bk, Bk) +C1I, (45) ˆ Q2 = K(Bk, Ak), (46) ˆ Q3 = K(Ak, Ak), (47)

whereIis an identity matrix andK(·,·) is the kernel function. The corresponding conclusions are similar with the linear case except that the inner product (x·x) is taken instead of the

(8)

4

Numerical experiments

In this section, in order to validate the performance of our NHCMC, we compare it with one-versus-rest(l-v-r) method[12], one-versus-one(l-v-l) method[17], directed acyclic graph SVM(DAGSVM)[5], the multi-class SVM(M-SVM)[3], crammer and singer(C&S) method[7] and MBSVM[20] on several publicly available benchmark datasets, some of which are used in [20]. NHCMC is implemented in MATLAB 2010 on a PC with an Intel Core I5 processor and 2 GB RAM.

All samples were scaled such that the features locate in [0,1] before training. For the first seven datasets, the RBF kernelK(x, x) = exp(−x−σx2) is applied and for the latter three datasets, the rectangular kernel[2] K(x, ET) with E typically of size 10% of E is used. The parameters are tuned for best classification accuracy in the range 28 to 212. Table 1 lists the average tenfold cross-validation results of these methods to the first six datasets and the average fourfold cross-validation results of these methods to the latter four datasets in term of accuracy. The results of the former six methods come from [20].

5

Conclusion

In this paper, we have proposed a nonparallel hyperplanes classifier for multiple classification. The idea of NHCMC approximates to MBSVM, that is the ”max” decision criterion instead of the ”min” one, but NHCMC has incomparable advantages. First, we do not need to compute the large inverses matrices before training which is inevitable in MBSVM, and furthermore the optimization problems in NHCMC can be solved efficiently by SMO. Second, totally different with MBSVM, kernel trick can be applied directly to NHC. Experimental results show the efficiency of our method in classification accuracy. Therefore this novel method could be a promising approach and worth of further study in this field.

Acknowledgment

This work has been partially supported by grants from National Natural Science Foundation of China (No.11271361, No.61472390, No.71331005), Major International (Regional) Joint Re-search Project (No.71110107026), “New Start” Academic ReRe-search Project of Beijing Union University (No. ZK10201409) and the Ministry of water resources special funds for scientific research on public causes (No.201301094).

References

[1] C. Cortes and V. N. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995. [2] Fung G and Mangasarian OL. Proximal support vector machine classifiers. In: 7th International

proceedings on knowledge discovery and data mining, pages 77–86, 2001.

[3] Weston J and Watkins C. Multi-class support vector machines. CSD-TR-98-04 royal holloway, University of London, Egham, UK, 1998.

[4] Jayadeva, R.Khemchandani, and S.Chandra. Twin support vector machines for pattern classifica-tion. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2007.

[5] Platt JC, Cristianini N, and Shawe-Taylor J. Large margin dags for multiclass classification.Adv Neural Inf Process Syst, 12:547–553, 2000.

(9)

Datasets l-v-r l-v-l DAGSVM M-SVM C&S MBSVM NHCMC

acc(%) acc(%) acc(%) acc(%) acc(%) acc(%) acc(%)

Iris 99.67 97.33 96.67 97.33 97.33 98.00 98.45 Wine 98.88 99.44 98.88 98.88 98.88 98.24 98.62 Glass 71.96 71.50 73.83 71.03 71.96 69.52 74.55 Vowel 98.49 99.05 98.67 98.49 98.67 98.79 99.71 Vehicle 87.47 86.64 86.05 86.99 86.76 86.07 86.91 Segment 97.53 97.40 97.36 97.58 97.32 96.02 98.43 dna 95.78 95.45 95.45 95.62 95.87 96.04 97.25 Satimage 91.70 91.30 91.25 91.25 92.35 90.55 91.60 Letter 97.88 97.98 97.98 97.76 97.68 93.28 98.07 Shuttle 99.91 99.92 99.92 99.91 99.94 99.95 100.00

Table 1: Average results of the benchmark datasets

[6] Xu Chan Ju and Ying Jie Tian. Efficient implementation of nonparallel hyperplanes classifier.

International Conferences on Web Intelligence, pages 5–9, 2012.

[7] Crammer K and Singer Y. On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res, 2:265–292, 2002.

[8] R. Khemchandani, R.K. Jayadeva, and S. Chandra. Optimal kernel selection in twin support vector machines. Pattern Recognit. Lett., 3:1:77–88, 2009.

[9] M. A. Kumar and M. Gopal. Least squares twin support vector machines for pattern classication.

Expert Systems withApplication, May 2007.

[10] M.A. Kumar and M. Gopal. Least squares twin support vector machines for pattern classification.

Expert Syst. Appl., 36:4:7535–7543, May 2009.

[11] M.A. Kumar and M. Gopal. Application of smoothing technique on twin support vector machines.

Pattern Recognit. Lett., 29:13:1842–1848, Oct. 2008.

[12] Bottou L, Cortes C, Denker JS, Drucher H, Guyon I, Jackel LD, LeCun Y, M¨uuller UA, Sackinger E, Simard P, and Vapnik V. Comparison of classifier methods: a case study in handwriting digit recognition. IAPR (eds) Proceedings of the international conference on pattern recognition, IEEE Computer Society Press, pages 77–82, 1994.

[13] O.L. Mangasarian. Nonlinear programming. Philadelphia, PA:SIAM, 1994.

[14] X. Peng. Tsvr: An efficient twin support vector machine for regression.Neural Networks, 23:3:365– 372, 2010.

[15] Yingjie Tian, Xuchan Ju, Zhiquan Qi, and Yong Shi. Improved twin support vector machine.

SCIENCE CHINA Mathematics, 2:417–432, 2014.

[16] Yingjie Tian, Zhiquan Qi, Xuchan Ju, Yong Shi, and Xiaohui Liu. Nonparallel support vector ma-chines for pattern classification. IEEE TRANSACTIONS ON CYBERNETICS, DOI 10.1109/T-CYB.2013.2279167, 2013.

[17] Krebel U. Pairwise classification and support vector machines. In: Scholkopf B, Burges CJC, Smola AJ (eds) Advances in Kernel methods: support vector learning, MIT Press, Cambridge, MA, pages 255–268, 1999.

[18] V. N. Vapnik. The Nature of Statistical Learning Theory. New York: Springer, 1996. [19] V. N. Vapnik. Statistical Learning Theory. New York: John Wiley and Sons, 1998.

[20] Zhi-Xia Yang, Yuan-Hai Shao, and Xiang-Sun Zhang. Multiple birth support vector machine for multi-class classification. Neural Computing and Applications, 22:Issue 1 Supplement: 153–161, 2013.

Figure

Figure 1: A toy example learned by the linear NHCMC
Table 1: Average results of the benchmark datasets

References

Related documents

Against this backdrop, we investigate household over-indebtedness and its associated financial vulnerability based on an analysis of credit bureau (CB) data, which contain the

To comply with the County Council’s Corporate Minimum Standards and Government guidance on Best Value, a Review Team and Project Team were established comprising representatives

Figure 57 shows the change in the average CT-number of the entire core sample as a function of time for all the experiments performed with organic rich shale using carbon

The indirect correlation between growth rate and stocking density has been studied and different parameters as fish age, size, water quality, various items related to nutrition

-I thought I had to have my hands on everything…I have learned to delegate…Delegating will help get the Assistant AD ready for the next level…this allows me to do some

The Echoes project is developing a virtual reality environment to investigate social skills development in children, particularly those with autism. ‘Some children

Criteria fulfillment shows few strong points but the areas for improvement predominate over the strong points. 1 Not satisfied – areas for

archive all such documents, materials, and transcripts or recordings of statements received, in a manner that will ensure their preservation and accessibility to the public and