2017 2nd International Conference on Computational Modeling, Simulation and Applied Mathematics (CMSAM 2017) ISBN: 978-1-60595-499-8
Robust Multi-Weight Vector Projection Support Vector Machine
Heng-hao ZHAO and Qiao-lin YE
*College of Information Science and Technology, Nanjing Forestry University, Nanjing, Jiangsu 210037, P. R. China
Jangsu Key Laboratory of Image and Video Understanding for Social Safety, Nanjing University of Science and Technology, 210094, P. R. China
*Corresponding author
Keywords: MVSVM, L1-norm, ITERATIVE algorithm, Robustness to outliers.
Abstract. Recently proposed Multi-weight vector projection support vector machines (MVSVM) is an outstanding algorithm for binary classification. However, it measuring distance in the objective function by squared L2-norm, which is easy to find that the impact of outliers is exaggerated. To alleviate this, we propose an effective algorithm, termed as Robust MVSVM based on the L1-norm distance (L1-MVSVM). The distance in the objective of L1-MVSVM is measured by L1-norm. Besides, we design a powerful iterative algorithm to solve the optimal problem of L1-norm, whose convergence is theoretically ensured. Finally, the effectiveness of L1-MVSVM has been verified through extensive experiments.
Introduction
In the last two decades, Support vector machine (SVM) has gained a great deal of attention due to its great generalization ability, which has been a powerful classification method in the machine learning [1]. For binary classification, a new approach to SVM classification named Generalized eigenvalue proximal support vector machine (GEPSVM) [2] is proposed by Wild and Mangasarian wherein each of two data sets to one of two distinct planes that are not parallel to each other. The unique characteristic of GEPSVM leads to lower computational complexity and outstanding classification performance, especially in the XOR problem, GEPSVM has a huge advantage than SVM. Following the GEPSVM, so many researchers have improved on GEPSVM in various aspects. Guarracino et al. proposed Regularized general eigenvalue classifier (ReGEC) [3]. Ye proposed Multi-weight Vector Projection Support Vector Machines (MVSVM) [4-5].
GEPSVM and its improvement algorithms always sensitive to the outliers or noises, because the model adopts L2-norm operation distance criterion. In recent years, many papers exposed that L1-norm distance have fine robust to the outliers and noises [6-9]. In document [9], Kwak first time introduced L1-norm distance into PCA. Then, L1-norm distance was used to discriminant criteria LDA feature extraction. Based on the L1-norm distance, Li proposed the robust L1-NPSVM [8], which adopt L1-norm distance in GEPSVM instead of square L2-norm operation distance criterion. L1-norm distance criterion guarantee the GEPSVM is robust to the outliers or noises.
We consider a binary classification problem in an n-dimensional space. We suppose that we have
m binary training sets, which are indicated as
( )
( i , ) | 1, 2, 1, 2,..., j yi i j mi x
. In the training sets, ( )i
j
x
denotes the
-ith
class and j-th sample. yj { 1, 1} is the class mark of the sample, which represents the class of positive or negative. We suppose the matrix A with size of m1n and the
matrix B with the size of m2n to describe in the following content. We define a pair of column
vector 1 1
m
e R and 2
2
m
In this document, we will pursue research in overcoming the non-robustness of the MVSVM about outliers. The convergence of algorithm has been theoretically proved by us. Finally, effectiveness has been verified through experimental results.
MVSVM
Like GEPSVM, MVSVM has two eigenvalue formulations. But MVSVM is different from GEPSVM in spirit. Instead of aiming to finding the specific planes, MVSVM aims to find the weight-vector projections w1 and w2 for the respective class. MVSVM can fast complete the computation and simultaneously handle the complex Exclusive Or (XOR) problems well.
The optimization criteria of MVSVM are given by:
1 2 1 1
(1) (2) 2 (1) (1) 2
1 1 1 1
1 1 1 1
1 2 1
1 1 1
max ( ) β ( )
m m m m
T T T T
j j i j
j j i j
m m m
w x w x w x w x
(1)
2 1 2 2
(2) (1) 2 (2) (2) 2
2 2 2 2
1 1 1 1
2 1 2
1 1 1
max ( ) β ( )
m m m m
T T T T
j j i j
j j i j
m m m
w x w x w x w x
(2)
Set
1
(1) 1
1 1
1 =
m
j j
m
μ x
is the mean vector of positive samples and
2
( 2) 2
1 2
1 =
m
j j
m
μ x
is the mean vector of negative samples. 1=( 1 1) ( 1 1)
T T T
S A e μ A e μ is the divergence matrix of the positive samples. 2=( 2 2) ( 2 2)
T T T
S B e μ B e μ is the divergence matrix of the negative samples.
3= ( 2 1)( 2 1)
T
S μ μ μ μ is the inter-class divergence matrix. We can rewrite the problem (1) and (2)
as:
1 3 1 1 1 1
max T β T
w S w w S w
(3)
2 3 2 2 2 2
max T β T
w S w w S w
(4) Where is a free trade-off parameter. According to the above criteria, MVSVM can find two optimal weight-vector projections (each for a particular class), such that each of two data sets are closest to one of two class means and meanwhile the points sharing different labels are separated as far as possible.
Related Works
MVSVM Based on the L1-norm Distance
Tradition MVSVM model:
1 1 1 1 1 1
1 2 1 2 1 1
( ) ( )
min
( )( )
T T T T
T T
w A e μ A e μ w
w μ μ μ μ w (5)
2 2 2 2 2 2
2 2 1 2 1 2
( ) ( )
min
( )( )
T T T T
T T
w B e μ B e μ w
w μ μ μ μ w (6) Problem (5) and (6) have the same form, so we use problem (5) as an example to solve.
SetH ( 1 1)
T A e μ
and ( 2 1)
T
G μ μ
, we can rewrite the problem (5) as: 2
1 2 2 1 2
|| || min
|| ||
Hw
1 1 1 1
|| || min
|| ||
Hw
Gw (8) In order to facilitate the following solution, we rewrite the problem (8) to the maximum form:
1 2
1 1
1 1
max | | / | |
m m
i i
i i
g w
h w(9) We should note that the change of w1 does not lead to changes of the objective value of the
original problem.
We convert the problem (9) to a maximization problem with equality constraints
1 2
1 1
1 1
max | |, s.t. | | 1
m m
i i
i i
g w
h w(10) From:
1 1 1
1 1 1 1 1
1 1 1 1
| | = ( )( )
| |
T
m m m
T i i
i i i
i i i i
sign
g g
g w w w g w g w
g w
(11)
2 2 2
1 1 1 1 1
1 1 1 1
| | = ( )( )
| |
T
m m m
T i i
i i i
i i i i
sign
h h
h w w w h w h w
h w (12) Set fii 1/ |hiw1| and kisign(giw1) , we can rewrite the problem (10), like:
1 2
1 1 1
1 1
max ( ), s.t. ( ) 1
m m
T T
i i ii i i
i i
k f
g w w
h h w(13) We denote ( )
1
p
w is the -th
p iteration ofw1. Therefore, we can get ( 1) 1
p
w from the equation which is shown as:
1 2
1
( 1) ( ) ( )
1 1 1 1
1 1
arg max ( ), s.t. ( ) 1
m m
p p T p T
i i ii i i
w i i
k f
w g w w h h w
(14) Where ( ) 1/ | 1( )|
p p
ii i
f h w and ( ) ( 1( ))
p p
i i
k signg w . Easy to verify, ( ) 1
( )
p i i
k g w
is the first-order Taylor expansion of |g wi 1| at point
( ) 1
p w
. Rewrite the problem (14) with matrix form:
1
( 1) ( ) ( )
1 arg max 1, s.t. 1 1 1
p p T T p
w
w K Gw w H F Hw
(15)
Where 2 2
( ) ( ) ( ) ( ) 11 22
( , ,..., )
p p p p
m m
diag f f f
F
and ( ) ( 1( ) )
p sign p T
K w G .
Now, we will give general form of result of problem (15). Construct the LaGrange function of problem (15):
( ) ( )
1 1 1 1
1
( , ) ( 1)
2
p T T p
L w γ K Gw γ w H F Hw
(16) Where γ is the LaGrange multiplier. Solve the derivative of w1 for L w( , )1 γ and set it to zero. We can get the solution of problem (15):
( 1) ( ) 1 ( ) 1
1
( ) T
p T p T p
w H F H G K
γ (17)
We can bring the solve (17) into the equation constraint 1 ( ) 1 1
T T p
w H F Hw :
( ) ( ) 1 ( )
( p )( T p ) ( T pT)
Finally, we get:
( ) 1 ( ) ( 1)
1
( ) ( ) 1 ( )
( )
( )( ) ( )
T T
T p T p p
p T p T p
H F H G k
w
k G H F H G k (19) Similarly, the same way is applied to the problem (6):
( ) 1 ( ) ( 1)
2
( ) ( ) 1 ( )
( )
( )( ) ( )
T T
T p T p p
p T p T p
E F E N k
w
k N E F E N k (20)
Where ( 2 2)
T
E B e μ and ( 2 1)
T
N μ μ .
Algorithm 1: An efficient iterative algorithm to solve the problem (8).
Data: Input the data matrixXA;B
Result: w1
SetH ( 1 1)
T
Ae μ andG ( 2 1) T
μ μ . Initialization ( ) 1
p
w and set p=1
Repeat:
1) Compute 2 2
( ) ( ) ( ) ( ) 11 22
( , ,..., )
p p p p
m m diag f f f
F and ( ) ( )
1
( )
p p T
sign
k w G .
2) Compute ( 1)
1
p
w by the problem (15).
3) Normalize ( )
1 p
w by the formula ( 1) ( 1) ( 1)
1 1 / || 1 ||1
p p p
w w w and setp p 1
UntilConverges
Theorem 1: In each iteration, Algorithm 1 monotonically increases the objective function (9)
Proof: we iterative to solve the problem (15) and set its objective function asJ(w1). From the
physical meaning of 1( 1) p
w , we can know that ( +1) ( )
1 1
(wp ) (wp )
J J . Therefore:
( ) ( 1) ( ) ( )
1 k 1
p p p p
k Gw Gw (21)
1 1 1
( ) ( 1) ( ) ( ) ( )
1 1 1 1 1
1 1 1
( )( ) ( )( )= | |
m m m
p p p p p
i i i i i
i i i
sign g g sign g g g
w w
w w
w(22) Since A is a convex function, therefore, we can get an inequality like:
( ) 1 1
( 1) ( ) ( 1) ( )
1 1 1 1 1
( ) ( )+ ( )| p ( )
p p p p
w w
f w f w f' w w w
(23) Problem (23) can be written as
1 1 1
( 1) ( ) ( ) ( 1) ( )
1 1 1 1 1
1 1 1
+ )g ( )
m m m
p p p p p
i i i i
i i i
g g sign g
| w |
| w |
( w w w(24) So, we can get that
1 1
( 1) ( ) ( 1)
1 1 1
1 1
)g
m m
p p p
i i i
i i
g sign g
| w |
( w w(25) Combine the problem (24) and (25):
1 1
( 1) ( )
1 1
1 1
| | | |
m m
p p
i i
i i
g g
w
w(26) For any two nonzero variables v and u, we have:
2 2 2
2 2 2
( ) 0 2 0
2 2 2 2
v u v u
v u v u vu v v u
u u u
(27)
Set 1 1
p i
v|hw( )|, 1 p i
1 2 2
1 1 1
1 1
1 1
(h ) (h )
2 2
p p
p i p i
i p i p
i i
h h w
h h
w w
w
w w
( ) ( )
( ) ( )
( ) ( )
| | | |
| | | | (28)
Therefore
2 2 1 2 2 2 2
1 1 1
1 1
1 1 1 1 1 1
(h ) (h )
2 2
p p
m m m m
p i p i
i p i p
i i i i i i
h h
h h
w
ww w
w w
( ) ( )
( ) ( )
( ) ( )
| | | |
| | | | (29)
Due to
2 ( ) ( ) ( ) ( )
1 1 i 1
1
| | 1
m
p T T p p p
i h
w H F Hw w
(30)
2 ( 1) 2 ( +1) ( ) ( +1) i 1
1 1 ( )
1 i 1
( )
1
| |
p m p T T p p
p i
h w
w H F Hw
h w (31)
We can get 1
1 1 1
1
m
p i i
h
| w( )|
, so we have
2
1 +1 +1
1 1 1
1
m
p p T T p p
i i
h
| w( )| w( ) H F( )Hw( )(32) Further
1 1 1 2
( 1) ( 1) ( +1) ( ) ( +1) ( 1) ( 1)
1 1 1 1 1 1
1 1 1 1
| | | | / ( ) | | / | |
m m m m
p p p T T p p p p
i i i i
i i i i
g g g h
w
w w H F Hw
w
w(33)
Combine the problem (33), problem (26) and equation 2
( ) i 1 1
| | 1
m
p i
h
w, we can get:
1 1
2 2
( 1) ( )
1 1
1 1
( 1) ( )
1 i 1
1 1
| | | |
| | | |
m m
p p
i i
i i
m m
p p
i
i i
g g
h h
w w
w w
(34) When the equation (9) is established, it means that L1-MVSVM can find a local maximum point. And we can know that the algorithm is convergence. In practice, in order to guarantee the convergence of the algorithm, the conventional way is to set the iterative termination conditions which the difference of objective value between twice iteration is less than a small value, and at the same time, the iterative amount should be less than the given value. In the iterative process of our algorithm, we can know about the problem (19) and (20) that the matrix G DT ( )pG and T ( )p
H F Hcan
only guarantee the semi positive definiteness, so that we will get an inexact or unstable solution. Therefore, we can solve this problem by regularizing it. And the method is to replace a pair of matrix with G D GT ( )p δI and H F HT ( )p δI.
Experiment
In order to test the classification performance of L1-MVSVM, we will compare with four algorithms on the UCI dataset. The experimental data only contains two types, and all the sample data are normalized by the interval [-1, 1] to reduce the difference between the characteristics of different sample. In order to obtain the best generalization performance, parameter δ was selected from the value {2 |i i 12, 11,..., 12} by using ten-fold cross validation method. The termination conditions for these algorithms are set that the target value between twice iteration less than 0.001 and the maximum number of iterations are 20.
Hardware of my experiment like: A PC with Inter(R) Core(TM) i7-3632QM CPU @ 2.20GHz, 8.00GB of RAM. And the software like: Windows 10 operating system; MATLAB 2014b.
Table 1. Test accuracy in UCI dataset.
Dataset GEPSVM L1-NPSVM MVSVM L1-MVSVM
Test accuracy Test accuracy Test accuracy Test accuracy
Brightdata 95.73% 95.94% 97.28% 95.53%
Checkdata 52.60% 52.60% 49.70% 54.30%
Germ 68.90% 68.70% 70.00% 70.30%
Haberm 74.43% 74.76% 71.91% 76.09%
Housingdata 71.53% 72.53% 71.92% 74.93%
Ionodata 78.33% 79.19% 84.90% 86.89%
Monk1 78.05% 78.94% 66.50% 66.48%
Monk2 67.89% 67.72% 71.05% 70.53%
Monk3 78.50% 78.50% 80.15% 77.61%
Musk 78.93% 79.54% 78.30% 76.67%
Spect 78.29% 78.29% 73.87% 77.51%
Votes 95.63% 95.63% 95.62% 95.17%
Vowel 88.44% 88.44% 89.95% 90.72%
Wpbc 74.68% 74.68% 74.32% 76.18%
Conclusions
In this paper, we have proposed MVSVM based on the L1-norm distance for binary classification, term as L1-MVSVM. In contrast with MVSVM, the application of L1-norm distance makes L1-MVSVM more robust to outliers and improves the flexibility of the model. Further, we design a valid iterative algorithm to solve the optimal problem of L1-norm, which is easy to implement and its convergence to a logical partial optimum is theoretically ensured. To sum up, L1-MVSVM has a better classification performance than MVSVM, GEPSVM and L1-NPSVM, the effectiveness of L1-MVSVM is proved by extensive experiments.
References
[1] C. Cortes, V. Vapnik. Support vector networks. Machine learning. 1995; 20: 273-297.
[2] O.L. Mangasarian, E.W. Wild. Multisurface proximal support vector machine classification via generalized eigenvalues. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28:69-74.
[3] M. R. Guarracino, C. Cifarelli, O. Seref. Pardalos PM.A classification method based on generalized eigenvalue problems. Optim. Method Softw. 2007, 22: 73-81.
[4] Q. L. Ye, C.X. Zhao, N. Ye and Y.N. Chen. Multi-Weight Vector Projection Support vector machines, Pattern Recognition Letters, 2010, 31:2006-2011.
[5] Q. L. Ye, N. Ye, T.M. Yin, Enhanced multi-weight vector projection support vector machine. Pattern Recognition Letters 2014, 42: 91-100.
[6] C. N. Li, Y.H. Shao, and N.Y. Deng, Robust L1-norm non-parallel proximal support vector machine. Optimization, 2016. 65(1): p. 1-15.
[7] H. X. Wang, X.S. Lu, Z.L. Hu, and W.M. Zheng. Fisher discriminant analysis with L1-norm. IEEE Trans. Cybern., 2014, 6(44): 828-842.
[8] W. M. Zheng, Z. C. Lin, and H. X. Wang, “L1-norm distance Kernel Discriminant Analysis via Bayes Error Bound Optimization for Robust Feature Extraction,” IEEE Trans. Neural Netw., 2014, 4(24):793-805.