Nonparallel Hyperplanes Support Vector Machine for Multi-class Classification

(1)

Nonparallel Hyperplanes Support Vector Machine for

Multi-class Classiﬁcation

Xuchan Ju

1,2

_{, Yingjie Tian}

2,3∗

_{, Dalian Liu}

4

_{, and Zhiquan Qi}

2,3

1 _{School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China} 2 _{Research Center on Fictitious Economy & Data Science, CAS, Beijing, China}

[email protected]

3 _{Key Laboratory of Big Data Mining and Knowledge management, CAS, Beijing, China}

tyj,[email protected]

4 _{Department of Basic Course Teaching, Beijing Union University, Beijing, China}

[email protected]

Abstract

In this paper, we proposed a nonparallel hyperplanes classifier for multi-class classification, termed as NHCMC. This method inherits the idea of multiple birth support vector ma-chine(MBSVM), that is the ”max” decision criterion instead of the ”min” one, but it has the incomparable advantages than MBSVM. First, the optimization problems in NHCMC can be solved efficiently by sequential minimization optimization (SMO) without needing to com-pute the large inverses matrices before training as SVMs usually do; Second, kernel trick can be applied directly to NHCMC, which is superior to existing MBSVM. Experimental results on lots of data sets show the efficiency of our method in multi-class classification accuracy.

Keywords: multi-class classiﬁcation, support vector machine, nonparallel, quadratic programming, kernel function

1 Introduction

Support vector machines(SVMs) are proposed by Vapnik and his co-workers[1, 18, 19]for classi-ﬁcation, regression, or other problems. SVMs depend on the principle of maximum margin, dual theory and kernel trick to be so successful. The well-known algorithm sequential minimization optimization (SMO) can be applied and solve the problem eﬃciently.

Recently, twin support vector machine(TWSVM)[4, 9] which works faster than conventional SVC has been successfully proposed. It seeks two nonparallel proximal hyperplanes such that each hyperplane is closest to one of two classes and as far as possible from the other class. TWSVMs have been studied extensively[11, 8, 10, 14].

∗_{Corresponding author}

Volume 51, 2015, Pages 1574–1582

ICCS 2015 International Conference On Computational Science

1574 Selection and peer-review under responsibility of the Scientiﬁc Programme Committee of ICCS 2015 c

(2)

Although TWSVMs only solve two smaller QPPs, they have to compute the inverse of matrices, it is in practice intractable or even impossible for a large data set by the classical methods. Moreover, for the nonlinear case, TWSVMs consider the kernel-generated surfaces instead of hyperplanes and construct extra two diﬀerent primal problems, i.e., they have to solve two problems for linear case and two other problems for nonlinear case separately. In order to avoid these drawbacks above, some nonparallel hyperplanes classiﬁers has been proposed[6, 16, 15].

For the multi-class classification problem, a new multi-class classifier, termed as multiple birth support vector machine (MBSVM), has been proposed[20]. In this classifier, it seeks K hyperplanes by solving k quadratic programming problems(QPPs). The QPPs are developed as the k-th class are as far as the k-th hyperplane while the rest points are proximal to the k-th hyperplane. A new point is assigned to the class label depending on which of the K hyperplanes it lies farthest to. Based on this trick, the size of each QPP in MBSVM has much lower complexity and faster than the existing multi-class SVMs. However, these QPPs have to compute the inverse of matrices and they have to solve different problems for linear case and nonlinear case.

In this paper, we develop nonparallel hyperplanes classifier for multi-class classification with this trick while it can avoid drawbacks in MBSVM. Moreover, SMO can be applied and solve the problem efficiently.

The paper is organized as follows. Section 2 brieﬂy dwells on the standard C-SVC, TWSVM and NHC. Section 3 proposes our NHCMC. Section 4 deals with experimental results and Section 5 contains concluding remarks.

2 Background

In this section, we introduce the standard C-SVC, TWSVM and NHC brieﬂy.

2.1 C

-SVC

Consider the binary classiﬁcation problem with the training set

T ={(x₁, y₁),· · ·,(x_l, y_l)} ∈(Rn× Y)l, (1) where x_i ∈Rn, y_i ∈ Y ={1,−1}, i = 1,· · · , l, standard C-SVC formulates the problem as a convex quadratic programming problem (QPP)

min w,b,ξ 1 2w 2₊_Cl i=1 ξi, s. t. y_i((w·x_i) +b)1−ξ_i, i= 1,· · ·, l, ξi0, i= 1,· · · , l, (2)

(3)

where ξ= (ξ₁,· · ·, ξ_l), andC > 0 is a penalty parameter. For this primal problem,C-SVC solves its Lagrangian dual problem

min α 1 2 l i=1 l j=1 αiαjyiyjK(xi, xj)− l i=1 αi, s. t. l i=1 yiαi= 0, 0α_iC, i= 1,· · · , l, (3)

where K(x, x) is the kernel function, which is also a convex QPP and then constructs the decision function.

2.2 TWSVM

Consider the binary classiﬁcation problem with the training set

T ={(x₁,+1),· · · ,(x_p,+1),(x_p+1,−1),· · ·,(x_p+q,−1)}, (4) wherex_i ∈ Rn, i = 1,· · ·, p+q. Let A = (x₁,· · ·, x_p)T ∈ Rp×n B = (x_p+1,· · · , x_p+q)T ∈ Rq×n_and _l ₌ _p₊_q_{. For linear classiﬁcation problem, TWSVM seeks a pair of nonparallel}

hyperplanes

(w+·x) +b+= 0 and (w−·x) +b−= 0. (5)

by solving two smaller QPPS min w+,b+,ξ− 1 2(Aw++e+b+) T_(Aw ++e+b+) +c1eT−ξ−, s.t. −(Bw₊+e₋b₊) +ξ₋e₋, ξ₋0, (6) and min w−,b−,ξ+ 1 2(Bw−+e−b−) T_(Bw −+e−b−) +c2eT+ξ+, s.t. (Aw₋+e₊b₋) +ξ₊e₊, ξ₊0, (7)

wherec_i, i= 1,2 are penalty parameters. e₊ ande₋ are vectors of ones of appropriate dimen-sions.For nonlinear classiﬁcation problem, two kernel-generated surfaces instead of hyperplanes are considered and two other primal problems are constructed.

2.3 NHC

For training set(4), NHC constructs two QPPs as follows. min w+,b+,ξ− 1 2c3(w+ 2₊_b2 +) + 1 2η T +η++c1eT−ξ−, s.t. Aw₊+e₊b₊=η₊, −(Bw₊+e₋b₊) +ξ₋e₋, ξ₋0, (8)

(4)

and min w−,b−,ξ+ 1 2c4(w− 2₊_b2 −) +1₂ηT−η−+c2eT+ξ+, s.t. Bw₋+e₋b₋=η₋, (Aw−+e+b−) +ξ+e+, ξ+0, (9)

where c_i, i = 1,2,3,4 are penalty parameters. e₊ and e₋ are vectors of ones of appropriate dimensions. Their dual forms can be written as follows.

max λ,α − 1 2(λ T_αT_{) ˆ}_Q(λT _αT₎T₊_c 3eT−α, s.t. 0αc₁e₋, (10) where ˆ Q= AAT₊_c 3I ABT BAT _BBT +E, (11) and max θ,γ − 1 2(θ T_γT_{) ˜}_Q₍_θT_γT₎T₊_c 4eT+γ, s.t. 0γc₂e₊, (12) where ˜ Q= BBT₊_c 4I BAT ABT _AAT +E, (13)

and I is an identity matrix ofq×q, E is a matrix of l×l with all entries equal to one. For nonlinear classiﬁcation problem, kernel function can be added directly.

3 Nonparallel hyperplanes classiﬁer for multi-class

classi-ﬁcation(NHCMC)

Consider the multiple classiﬁcation problem with the training set:

T ={(x₁, y₁),· · · ,(x_l, y_l)}, (14) wherex_i∈Rn, i= 1,· · · , l, y_i∈ {1,· · ·, K}is the corresponding pattern ofx_i.

For multiple classiﬁcation, we seekK nonparallel hyperplanes:

(w_k·x) +b_k = 0, k= 1,· · ·, K (15) For convenience, we denote the number of each class of the training set (14) as l_k and the points belonging tok-th class asA_k∈Rlk×n,k= 1,· · ·, K_{. Besides, we deﬁne the matrix}

Bk= [A1,· · · , Ak−1, Ak+1,· · ·, Ak] (16)

(5)

0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 l1 l2 l3

Figure 1: A toy example learned by the linear NHCMC

3.1 Linear case

3.1.1 The Primal Problem

We seek to constructknonparallel hyperplanes (15) by solving the following convex quadratic programming problems(QPPs): min wk,bk,ηk,ξk 1 2C1wk 2₊1 2η kηk+C2ek2ξk, s.t. B_kw_k+e_k1b_k =η_k, (A_kw_k+e_k2b_k) +ξ_k e_k2 ξk 0, (17)

where η_k ∈R(l−lk)_{is a variable and} ξ_k _{is a slack variable.} e_k1∈R(l−lk)_and e_k2∈Rlk _{are the} vectors of ones. C₁0 andC₂0 are penalty parameter.

In order to illustrate the primal problem of NHCMC, we generated an artificial two-dimensional three-class dataset. The geometric interpretation of above problem with x∈ R2 is shown in Figure 1,where minimizes the sum of the squared distance from the hyperplanes of K−1 classes, that is all classes except for those of thek−th class, and the points of thek−th class are far from the its hyperplane. Take the ”*” class in Figure 1 as an example. We hope the hyperplane of the ”*” classl₁is far from the ”*” points and closest the ”+” and triangular points. In order to minimize misclassification, the points of the k−th class are at distance 1 from the hyperplane, and we minimize the sum of error variables with soft margin lost. The differences between multiple birth support vector machine(MBSVM)[20] and NHCMC are that we introduce a regularization term to implement structural risk minimization(SRM) principle and a variable is introduced to make a term of objective function to be constraints. These changes have many positive effects on original NHCMC.

(6)

3.1.2 The Dual Problem

In order to get the solutions of problem (17), we need to derive its dual problem. The Lagrangian of the problem (17) is given by

L(w_k, b_k, η_k, ξ_k, α, β, λ) =1 2C1wk 2₊1 2η kηk +C₂e_k2ξ_k+λ(B_kw_k+e_k1b_k−η_k) −α₍_A kwk+ek2bk+ξk−ek2)−βξk, (18)

whereα= (α₁,· · ·, α_l_k),β = (β₁,· · · , β_l_k),λ= (λ₁,· · · , λ_l₋_l_k)are the Lagrange multiplier vectors. The Karush-Kuhn-Tucker (KKT) conditions[13] forw_k, b_k, η_k, ξ_k andα, β, λare given by ∇wkL=C1wk+Bkλ−Akα= 0, (19) ∇bkL=ek1λ−ek2α= 0, (20) ∇ηkL=ηk−λ= 0, (21) ∇ξkL=C2ek2−α−β= 0, (22) Bkwk+ek1bk=ηk, (23) (A_kw_k+e_k2b_k) +ξ_ke_k2, ξ_k0 (24) α₍₍_A kwk+ek2bk) +ξk) = 0, βξk= 0 (25) α0, β0 (26)

Sinceβ0, from (22) we have

0αC₂e_k2. (27)

And from (19), we have

wk =−_C1 1(B

kλ−Akα), (28)

Then putting (28) and (21) into the Lagrangian and using (19)∼(26), we obtain the dual problem of problem (17) min ˆ π 1 2πˆ _Λˆˆ_π_{+ ˆ}_κ_π,_ˆ s.t.e k1λ−ek2α= 0, ˆ C1ˆπCˆ2, (29) where ˆ π = (λ, α), (30) ˆ κ = (0,−C₁e_k2), (31) ˆ C1 = (−∞e_k1,0), (32) ˆ C2 = (+∞e_k1, C₂e_k2), (33) ˆ Λ = _ˆ Q1 Qˆ2 −_Qˆ 2 Qˆ3 (34)

(7)

ˆ Q1 = BkBk +C1I, (35) ˆ Q2 = BkAk, (36) ˆ Q3 = AkAk, (37)

whereI is an identity matrix.

After getting the solution of the problem (29), we can obtain w_k∗ and b∗_k with (19) and (23). A new pointx∈Rn is assigned to classk(k∈1,· · ·, K), depending on which of theK hyperplanes given by (15) it lies farthest to. The decision function is deﬁned as

f(x) = arg max k=1,···,K |(w∗ k·x) +b∗k| ||w∗ k|| , (38) where| · |is the prependicular distance of pointxfrom the hyperplane (w_k·x) +b_k = 0, k = 1,· · ·, K.

3.2 Nonlinear case

Now we extend the linear NHCMC to the nonlinear case. Totally diﬀerent with the existing MBSVM, we do not need to consider the extra kernel-generated surfaces since only inner prod-ucts appear in the dual problems (29), so the kernel functions can be applied directly in the problems and the linear NHCMC is easily extended to the nonlinear multiple classiﬁer.

min ˆ π 1 2ˆπ _Λˆˆ_π_{+ ˆ}_κ_π,_ˆ s.t.e k1λ−ek2α= 0, ˆ C1πˆCˆ2, (39) where ˆ π = (λ, α), (40) ˆ κ = (0,−C₁e_k2), (41) ˆ C1 = (−∞e_k1,0), (42) ˆ C2 = (+∞e_k1, C₂e_k2), (43) ˆ Λ = _ˆ Q1 Qˆ2 −_Qˆ 2 Qˆ3 (44) ˆ Q1 = K(Bk, Bk) +C1I, (45) ˆ Q2 = K(Bk, Ak), (46) ˆ Q3 = K(Ak, Ak), (47)

whereIis an identity matrix andK(·,·) is the kernel function. The corresponding conclusions are similar with the linear case except that the inner product (x·x_{) is taken instead of the}

(8)

4 Numerical experiments

In this section, in order to validate the performance of our NHCMC, we compare it with one-versus-rest(l-v-r) method[12], one-versus-one(l-v-l) method[17], directed acyclic graph SVM(DAGSVM)[5], the multi-class SVM(M-SVM)[3], crammer and singer(C&S) method[7] and MBSVM[20] on several publicly available benchmark datasets, some of which are used in [20]. NHCMC is implemented in MATLAB 2010 on a PC with an Intel Core I5 processor and 2 GB RAM.

All samples were scaled such that the features locate in [0,1] before training. For the first seven datasets, the RBF kernelK(x, x) = exp(−x−_σx2) is applied and for the latter three datasets, the rectangular kernel[2] K(x, ET) with E typically of size 10% of E is used. The parameters are tuned for best classification accuracy in the range 2−8 to 212. Table 1 lists the average tenfold cross-validation results of these methods to the first six datasets and the average fourfold cross-validation results of these methods to the latter four datasets in term of accuracy. The results of the former six methods come from [20].

5 Conclusion

In this paper, we have proposed a nonparallel hyperplanes classifier for multiple classification. The idea of NHCMC approximates to MBSVM, that is the ”max” decision criterion instead of the ”min” one, but NHCMC has incomparable advantages. First, we do not need to compute the large inverses matrices before training which is inevitable in MBSVM, and furthermore the optimization problems in NHCMC can be solved efficiently by SMO. Second, totally different with MBSVM, kernel trick can be applied directly to NHC. Experimental results show the efficiency of our method in classification accuracy. Therefore this novel method could be a promising approach and worth of further study in this field.

Acknowledgment

This work has been partially supported by grants from National Natural Science Foundation of China (No.11271361, No.61472390, No.71331005), Major International (Regional) Joint Re-search Project (No.71110107026), “New Start” Academic ReRe-search Project of Beijing Union University (No. ZK10201409) and the Ministry of water resources special funds for scientiﬁc research on public causes (No.201301094).

References

[1] C. Cortes and V. N. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995. [2] Fung G and Mangasarian OL. Proximal support vector machine classiﬁers. In: 7th International

proceedings on knowledge discovery and data mining, pages 77–86, 2001.

[3] Weston J and Watkins C. Multi-class support vector machines. CSD-TR-98-04 royal holloway, University of London, Egham, UK, 1998.

[4] Jayadeva, R.Khemchandani, and S.Chandra. Twin support vector machines for pattern classiﬁca-tion. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2007.

[5] Platt JC, Cristianini N, and Shawe-Taylor J. Large margin dags for multiclass classiﬁcation.Adv Neural Inf Process Syst, 12:547–553, 2000.

(9)

Datasets l-v-r l-v-l DAGSVM M-SVM C&S MBSVM NHCMC

acc(%) acc(%) acc(%) acc(%) acc(%) acc(%) acc(%)

Iris 99.67 97.33 96.67 97.33 97.33 98.00 98.45 Wine 98.88 99.44 98.88 98.88 98.88 98.24 98.62 Glass 71.96 71.50 73.83 71.03 71.96 69.52 74.55 Vowel 98.49 99.05 98.67 98.49 98.67 98.79 99.71 Vehicle 87.47 86.64 86.05 86.99 86.76 86.07 86.91 Segment 97.53 97.40 97.36 97.58 97.32 96.02 98.43 dna 95.78 95.45 95.45 95.62 95.87 96.04 97.25 Satimage 91.70 91.30 91.25 91.25 92.35 90.55 91.60 Letter 97.88 97.98 97.98 97.76 97.68 93.28 98.07 Shuttle 99.91 99.92 99.92 99.91 99.94 99.95 100.00

Table 1: Average results of the benchmark datasets

[6] Xu Chan Ju and Ying Jie Tian. Eﬃcient implementation of nonparallel hyperplanes classiﬁer.

International Conferences on Web Intelligence, pages 5–9, 2012.

[7] Crammer K and Singer Y. On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res, 2:265–292, 2002.

[8] R. Khemchandani, R.K. Jayadeva, and S. Chandra. Optimal kernel selection in twin support vector machines. Pattern Recognit. Lett., 3:1:77–88, 2009.

[9] M. A. Kumar and M. Gopal. Least squares twin support vector machines for pattern classication.

Expert Systems withApplication, May 2007.

[10] M.A. Kumar and M. Gopal. Least squares twin support vector machines for pattern classiﬁcation.

Expert Syst. Appl., 36:4:7535–7543, May 2009.

[11] M.A. Kumar and M. Gopal. Application of smoothing technique on twin support vector machines.

Pattern Recognit. Lett., 29:13:1842–1848, Oct. 2008.

[12] Bottou L, Cortes C, Denker JS, Drucher H, Guyon I, Jackel LD, LeCun Y, M¨uuller UA, Sackinger E, Simard P, and Vapnik V. Comparison of classiﬁer methods: a case study in handwriting digit recognition. IAPR (eds) Proceedings of the international conference on pattern recognition, IEEE Computer Society Press, pages 77–82, 1994.

[13] O.L. Mangasarian. Nonlinear programming. Philadelphia, PA:SIAM, 1994.

[14] X. Peng. Tsvr: An eﬃcient twin support vector machine for regression.Neural Networks, 23:3:365– 372, 2010.

[15] Yingjie Tian, Xuchan Ju, Zhiquan Qi, and Yong Shi. Improved twin support vector machine.

SCIENCE CHINA Mathematics, 2:417–432, 2014.

[16] Yingjie Tian, Zhiquan Qi, Xuchan Ju, Yong Shi, and Xiaohui Liu. Nonparallel support vector ma-chines for pattern classiﬁcation. IEEE TRANSACTIONS ON CYBERNETICS, DOI 10.1109/T-CYB.2013.2279167, 2013.

[17] Krebel U. Pairwise classiﬁcation and support vector machines. In: Scholkopf B, Burges CJC, Smola AJ (eds) Advances in Kernel methods: support vector learning, MIT Press, Cambridge, MA, pages 255–268, 1999.

[18] V. N. Vapnik. The Nature of Statistical Learning Theory. New York: Springer, 1996. [19] V. N. Vapnik. Statistical Learning Theory. New York: John Wiley and Sons, 1998.

[20] Zhi-Xia Yang, Yuan-Hai Shao, and Xiang-Sun Zhang. Multiple birth support vector machine for multi-class classiﬁcation. Neural Computing and Applications, 22:Issue 1 Supplement: 153–161, 2013.