Data Perturbation Analysis of the Support Vector Classifier Dual Model

(1)

ISSN Print: 1945-3116

DOI: 10.4236/jsea.2018.1110027 Oct. 23, 2018 459 Journal of Software Engineering and Applications

Data Perturbation Analysis of the Support

Vector Classifier Dual Model

Chun Cai

1

_{, Xikui Wang}

2

1_{College of Arts and Science, Beijing Union University, Beijing, China}

2_{Department of Statistics, University of Manitoba, Winnipeg, Manitoba, Canada}

Abstract

The paper establishes a theorem of data perturbation analysis for the support vector classifier dual problem, from which the data perturbation analysis of the corresponding primary problem may be performed through standard re-sults. This theorem derives the partial derivatives of the optimal solution and its corresponding optimal decision function with respect to data parameters, and provides the basis of quantitative analysis of the influence of data errors on the optimal solution and its corresponding optimal decision function. The theorem provides the foundation for analyzing the stability and sensitivity of the support vector classifier.

Keywords

Support Vector Classifier, Partial Derivative, Sensitivity, Stability

1. Introduction

Many methods of data mining exist, including machine learning which is a ma-jor research direction of artificial intelligence. The theory of statistics plays a fundamental role in machine learning. However the classic theory of statistics is often based on large sample properties, while in reality we often face small sam-ples, sometimes with a very limited number of observations due to resource or other constraints. Consequently the performance of some large sample methods may be unsatisfactory in certain real applications. Vapnik and collaborators pioneered in the 1960s machine learning for finite samples and developed the statistical learning theory [1] [2]. To deal with the inconsistency between mini-mizing the empirical risk and minimini-mizing the expected risk in statistical learning, they proposed the principle of structural risk minimization to investigate the

How to cite this paper: Cai, C. and Wang, X.K. (2018) Data Perturbation Analysis of the Support Vector Classifier Dual Model. Journal of Software Engineering and Ap-plications, 11, 459-466.

https://doi.org/10.4236/jsea.2018.1110027

Received: April 10, 2018 Accepted: October 20, 2018 Published: October 23, 2018

(2)

DOI: 10.4236/jsea.2018.1110027 460 Journal of Software Engineering and Applications consistency in the machine learning process. Later on, Vapnik and his colleagues at AT & T Bell Laboratory proposed the method of support vector machine [3]. This has then been further developed into the method of support vector classifi-er (SVC) within statistical learning theory, which has shown satisfactory pclassifi-er- per-formance and is becoming a general method of machine learning. We focus our paper on SVC.

Let the training data set be

{

(

1, 1

) (

, 2, 2

)

, ,

(

,

)

}

(

)

m m m

T = x y x y  x y ∈ X Y× ,

where n

i

x X R∈ = is the training input and y Y_i∈ = − +

{

1, 1

}

is the training output, i=1,2, , m. The essence of the SVC problem for data mining is to seek a real-valued function f x

( )

on the training input set _Rn_{such that the} deci-sion function sgn f x

( )

infers the corresponding category in the set

{

− +1, 1

}

of an arbitrary data input _{x X R}_{∈ =} n_{, where} _{sgn f x}

_{( )}

_{= −}₁_if _{f x}

_{( )}

_<₀_and

( )

1

sgn f x = if f x

( )

>0 [4] [5]. Because it is not guaranteed that a linear space over the original input space _Rn_{can be found to separate the training} input set, a transformation φ:x→ =x φ

( )

x is often introduced from the in-put space _Rn_{to a high-dimensional Hilbert space}__{such that the training} input set corresponds to a training set in . Then a super-plane is sought after in  to separate the input space to solve the data mining classification prob-lems.

The primary problem of the standard support vector classifier (SVC) is to

(

)

(

)

(

( )

)

(

)

T

, , ₁

T 1

min , , 2

subject to , , 1 0, 1,2, , , , , 0, 1,2, , ,

m

m i i

w b R R _i

i i i i

m i i

w b w w C

g w b y w b i m

g w b i m

ψ τ ψ ψ

ψ ψ

∈ ∈ ∈ ₌

+

= +

= − + − + ≤ =

= − ≤ =

∑

 



x (1)

where T stands for transpose, Ci>0, 1, ,i=  m and ψ =

(

ψ1, ,ψm

)

T are pe-nalty parameters, w consists of the slopes of the super-plane, b R∈ is the in-tercept of the super-plane, and xi=φ

( )

xi .

The quadratic Wolfe dual problem of the above primary problem (1) is to

( )

T T

T 1 min

2

subject to 0,

0, 1,2, , ,

i i i

W H e

h y

C h i m

α α α α α

α α

= −

= =

≥ = ≥ = 

(2)

where α =

(

α1, ,αm

)

T,

(

)

T 1, , m

y= y _ y _, e=

(

1, ,1_

)

T_,

(

)

(

( )

)

T

( )

,

i j i j

K x x = φ x φ x _{is the kernel function, and} Hij=y y K x xi j

(

i, j

)

. The relationships between the optimal solution *_, _{1, ,}

i i m

α =  of the dual

problem (2) and the optimal solution

(

_{w b}*_, *

)

_{of the primary problem (1) is}

given by

( )

(

)

(

)

(

)

* *

1

T

* * * *

1 ,

, , for any 0, .

m i i i i

m

i i i j j j i i i j

w y

b y x w y y K x x C

α

φ

α

=

= =

= − = − ∈

∑

(3)

DOI: 10.4236/jsea.2018.1110027 461 Journal of Software Engineering and Applications The corresponding optimal decision function is given by sgn f x

( )

, where

( )

*

(

)

*

1 ,

m

i i i

i

f x =

∑

₌α y K x x b+ _{. Furthermore, each component} *

i

α of the optim-al solution _α*_{corresponds to a training input, and we call a training input}

i x a support vector if its corresponding * ₀

i

α > _.

In data mining, the training input data x ii, 1,2, ,=  m of the support vector classifier (SVC) model are approximation of the true values. When using the approximate values to establish the SVC model, errors in data will inevitably impact on the optimal solution and the corresponding optimal decision function of the SVC model. When the upper bound of data errors is known, the method of data perturbation analysis is used to derive the upper bounds of the optimal solution and its corresponding optimal decision function.

We are interested in the stability and sensitivity analysis of the optimal solu-tion and its corresponding optimal decision funcsolu-tion. Our analysis is based on the Fiacco’s Theorem on the sensitivity analysis of a general class of nonlinear programming problem [6] [7] [8]. The second order sufficiency condition re-quired for the Fiacco Theorem was further studied in [9]. On the other hand, kernel functions have been used in machine learning [10] [11] [12]. In this paper, we establish a Theorem on data perturbation analysis for a general class of SVC problems with kernel functions. The paper is organized as follows. The main Theorem and its Lemmas are introduced in the next section. Section 3 concludes the paper.

2. Data Perturbation Analysis of the SVC Dual Problem

Suppose that

(

_{w b}*_{, ,}*_ψ*

)

_{is the optimal solution of the primary problem}

(1). Corresponding to the solution

(

_{w b}*_{, ,}*_ψ*

)

_{, we divide, for convenience,}

the training data

(

x y1, 1

) (

, x y2, 2

)

, ,

(

x ym, m

)

into categories A B C, , as fol-lows:

1) A category: points satisfying

( )

T * * ₁

i w b+ =

x and yi = +1, or

( )

T * * ₁

i w b+ = −

x and yi = −1. These points are denoted as

(

x y1, 1

) (

, x y2, 2

)

, , ,

(

x yt t

)

for convenience. Denote A=

{

1, , t

}

.

2) B category: points satisfying

( )

T * * ₁

i w b+ >

x and yi = +1, or

( )

T * * ₁

i w b+ < −

(

x yt+1, t+1

) (

, xt+2,yt+2

)

, ,

(

x ys, s

)

for convenience. Denote B= +

{

t 1, , s

}

. 3) C category: points satisfying

( )

T * * ₁

i w b+ <

x and yi = +1, or

( )

T * * ₁

i w b+ > −

(

xs+1,ys+1

) (

, xs+2,ys+2

)

, ,

(

x ym, m

)

for convenience. Denote C=

{

s+1, , m

}

. We call those indices i such that either g w b_i

(

, ,ψ

)

=0_or g_{m i}₊

(

w b, ,ψ

)

=0 as working constraint indices. Let’s start with a lemma.

Lemma 1. Suppose that

(

_{w b}*_{, ,}*_ψ*

)

_{is the optimal solution of the primary}

problem (1). Then the set of all working constraint indices is give by

(

)

{

}

{

} {

}

*_{, ,}* * _{1, , ,} _{1, ,} _, _{1, ,} _, _{1, ,}

: : .

I w b t m m t m t m s s m

A m i i A m j j B C

ψ = + + + + + +

= + ∈ + ∈

   

(4)

DOI: 10.4236/jsea.2018.1110027 462 Journal of Software Engineering and Applications Proof. The proof is straightforward from the definitions of working constraint index and the observation that points in A and B categories imply that ψ =i 0 and points in the C category imply that ψ >i 0. 

Then we offer a sufficient condition for the linear independence of the gra-dients indexed by the set _{I w b}

(

*_{, ,}*_ψ*

)

_.

Lemma 2. Let *

(

* *

)

T 1, , m

α = α _α _{be the optimal solution of the dual}

prob-lem (2). If there exists any component *

j

α

_of _α*_{such that} ₀ *

j Cj

α

< < _where

1,2, ,

j∈ _ m_{, then the set of gradients}

(

)

T T

* * *

, hi , , ,

h _{i I w b} _ψ

α α

 _∂ __∂ _ 

  _∈ 

__∂  _{ }_∂ _ 

 

  is

linearly independent.

Proof. For i A∈ , suppose there exists 1≤ ≤t t1 such that αi*>0 for 1,2, ,i

i= _ t_but * ₀

i

α = _for i t= +₁ 1, ,_ t_{. For}i B∈ , we always have * ₀

i

α = _.

For i C∈ , we always have *

i Ci

α = _{. The set of gradients}

(

)

T T

* * *

, hi , , ,

h _{i I w b} _ψ

α α

 _∂ __∂ _ 

  _∈ 

__∂  _{ }_∂ _ 

 

  becomes

1

2

3

0 0 0 0 0

1 0 0 0 0

0 1 0 0 0

, , , , , , ,

0 0 1 0 0

0 0 0 1 0

0 0 0 0 1

m y y y

y

          

     ₋     

          

     −     

          

         

         ₋

         

     _{   }    _{   }

 _{         }

 

     

 

 

     

                  _

Since it is assumed that there is a support vector multiplier *

i

α such that

*

0<α_i <C_i_{, the number of gradients in}

(

)

T

* * *

, , ,

i

h _{i I w b} _ψ

α

__∂ _ 

 _∈ 

__∂ _ 

 

  must be

smaller than the dimension m, which is the dimension of the vector y_{= }_∂∂_αhT   .

Therefore the vector T

h _y

α ∂

 _{ =}

_∂ 

  whose m components are either −1 or +1

cannot be expressed as a linear combination of the set of gradients

(

)

T

* * *

, , ,

i

h _{i I w b} _ψ

α

__∂ _ 

 _∈ 

_∂  

 

 

 . Hence the set of constrained gradients is linearly

independent. 

We use the lower case letter p to denote the training input variable and p0 to denote the training input data. When the form of the kernel function φ

( )

x _is known, the following main Theorem investigates how the errors in the training input data impact on the optimal solution and the corresponding optimal deci-sion function of the SVC model.

Theorem 3. Suppose that *

(

* *

)

T 1, , m

α = α α is the optimal solution of the

(5)

DOI: 10.4236/jsea.2018.1110027 463 Journal of Software Engineering and Applications *

b . Furthermore, *

(

* *

)

1, , m

g = g _ g _and *

(

* *

) (

* *

)

1, , m gm1, ,g2m

ψ = ψ _ψ = ₊ _ _.

Suppose that A category input vectors x x1, , ,2 xt are all support vectors with

*

0<α_i <C i_i, 1,2, ,= _ t_{, and the set of vectors}

( )

1 1 2 2

1 2

, , , t t

t

y x y x y x

y y y

φ φ  φ 

   

 

   

     

corresponding to points in category A is linearly independent. Then the follow-ing results hold.

1) The optimal solution *

(

* *

)

T 1, , m

α = α _α _{is unique, and the corresponding}

multiplier _b* _{as well as} *

(

* *

)

T

1, , m

g = g _ g _and

(

) (

T

)

T

* * * * *

1, , m gm1, ,g2m

ψ = ψ _ψ = ₊ _ _{are all unique.}

2) There is a neighbourhood N p

( )

0 of p0 on which there is a unique

continuously differentiable function y p

( )

=

(

α

( ) ( ) ( ) ( )

p b p g p, , ,ψ p

)

_such

that

a)

( )

(

* * * *

)

0 , , ,

y p = α b g ψ _,

b) α

( )

p _{is a feasible solution to the dual problem (2) for any} p N p∈

( )

₀ _,

c) the partial derivatives of y p

( )

=

(

α

( ) ( ) ( ) ( )

p b p g p, , ,ψ p

)

_{with respect}

to data parameters satisfy

( )

T

1 T

T

, p

b p

M p M p

g p

p

α

ψ __∂ _ 

_ _ 

∂

  

 

∂ 

 

 _∂ 

 

 _{ =}

__∂ _ 

_ _ 

_∂ _ 

 

∂   _∂  

 

where

( )

(

)

( )

T T

1 , 1 p 1, , m p m, p

L

M p g g y

p

α _α _α _α

∂ ∇ 

= −_ ∇ ∇ ∇ _

∂

   , ∇αL is the

gradient of L with respect to

α

_, ∇p iα, 1, ,i=  m is the gradient of αi with

respect to p,

( )

T

p y α

∇ _{is the gradient of} _yT_α_{with respect to p,}_{and L is the}

Lagrange function of the primary problem (1), M p

( )

E F

U V

 

=  

 , and

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

2

1 1 1 1 2 1 2 1 1

2

2 1 2 1 2 2 2 2 2

2

1 1 2 2

, , ,

, , , _,

, , ,

m m m m

m m m m m m m

y K x x y y K x x y y K x x

y y K x x y K x x y y K x x

E

y y K x x y y K x x y K x x

 

 

=  

 

 

 

   



1

2

1 0 0 1 0 0

0 1 0 0 1 0

,

0 0 1 0 0 1 m

y y F

y

−

 

 ₋ 

 

=_ _

 

−

 

 

        

(6)

DOI: 10.4236/jsea.2018.1110027 464 Journal of Software Engineering and Applications *

1 * 2

*

* 1

* 2

*

1 2

0 0

,

0 0

m

m m g

g

g U

y y y

ψ ψ

ψ

− 

 

−

 

−

 

=  

 

 

 

   

  

   

 

and

1

2

1 1

2 2

0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

m

m m

V C

C

α α

α

α −

 

 ₋ 

 

−

 

= −

 

−

 

 − 

 

 

 

        

 

        

 

Proof.

Our results follow directly from the Fiacco Theorem after checking the fol-lowing three conditions required in the Fiacco Theorem:

The Second Order Sufficiency Condition for _α*_,

Linear Independence of Gradients over the Working Constraint Set, The Strict Complementarity Property.

Proof of the Second Order Sufficiency Condition. Suppose that

(

)

T

* * *

1, , m

α = α _ α _{is the optimal solution to the dual problem. Then there exist}

multipliers * *

(

* *

)

T

1

, , , m

m

b ∈R g = g _ g ∈R _and

(

) (

T

)

T

* * * * *

1, , m gm1, ,g2m Rm

ψ = ψ _ψ = ₊ _ ∈ _{satisfying the Kurash-Kuhn-Tucker}

Condition:

* * * * _0,

Hα − +e b y g− +ψ =

* _{0 for any ,}

i

g ≥ i

* _{0 for any ,}

i i

ψ ≥

( )

_* T _* 0,

g α =

( ) (

_* T _*

)

0,

C

ψ α − =

where H=

( )

Hij and

(

)

T 1, , m C= C _C _.

The Hessian matrix corresponding to the Lagrangian function

(

)

( )

(

( )

)

( )

( ) (

)

* * * *

T T T T

* * T * * * * * * *

, , ,

1 _e

2

L b g

H b y g C

α ψ

α α α α α ψ α

(7)

DOI: 10.4236/jsea.2018.1110027 465 Journal of Software Engineering and Applications becomes the matrix H.

For i A∈ , we have assumed that * ₀

i

α > _{. For}i B∈ , we have * _0, * ₀

i i

g > α =

and * ₀

i

ψ = . For i C∈ , we have * _0, *

i i i

g = α =C and * ₀

i

ψ > .

Define the set

{

_: T _0; _{0, 1, , ;} _0, _{1, ,}

}

i i

Z= z z y= z ≤ i= _t z = i t= + _ m _{. Then for} any z=

(

z1, , ,0, ,0 zt 

)

∈Z z, ≠0, we have

(

)

(

)

(

)

( )

1

T 2 * * * * *

1 1

T

1 1

, , , , , ,0, ,0 , , 0

0

, t

t t

t

t t

i i i i i i

i i

z

z z

z L b g z z z H z z H

z

z y x z y x

α ψ

φ φ

= =

   

  _{ }

  _{ }

∇ =  = _{ }

   _{ }

       

   

= _

∑

 _{ }

∑

_



   



where _∇2_L

(

_α*_{, , ,}_{b g}* *_ψ*

)

₌_H_{is the Hessian matrix of}_L

(

_α*_{, , ,}_{b g}* *_ψ*

)

_and

*

H is the upper left t t× _{sub-matrix of H.}

Suppose that z y1 1φ

( )

x1 ++z yt tφ

( )

xt =0. Then because the set of vectors

( )

1 1 2 2

1 2

, , , t t

t

y x y x y x

y y y

φ φ  φ 

   

 

   

     

corresponding to points in category A is linearly independent and that

1 1 t t 0

z y + +_ z y = _{, we must have} z₁== =zt 0. This contradicts with the as-sumption that z≠0. Hence we must have z y1 1φ

( )

x1 ++z yt tφ

( )

xt ≠0. This implied that _zT_∇2_L

(

_α*_{, , ,}_{b g}* *_ψ*

)

_z_>₀_{and therefore the Second Order} Suffi-ciency Condition is satisfied by _α*_.

Proof of the Linear Independence of Gradients over the Working Con-straint Set. This is Lemma 2.

Proof of the Strict Complementarity Property. We need to show that _g*

and _α*_{cannot be 0 simultaneously, and}_ψ*_and _α*₋_C_{cannot be 0} simul-taneously. For i A∈ , we have * ₀

i

α > and hence the intersection between A and the working constraint set is empty because of the assumption

* _,

i C i Ai

α < ∈ . For i B∈ , we have * ₀

i

α = and multiplier * ₀

i

g > . For i C∈ , we have *

i Ci

α = _{and multiplier} * ₀

i

ψ > _{. Therefore the Strict Complementarity}

Property holds. 

3. Conclusion

(8)

foun-DOI: 10.4236/jsea.2018.1110027 466 Journal of Software Engineering and Applications dation of quantitative analysis based on the derivatives.

Acknowledgements

This work is funded with grants from the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions CIT & TCD201404080 and New Start Academic Research Projects of Beijing Union University. Xikui Wang acknowledges research support from the Natural Sciences and Engineering Research Council (NSERC) of Canada and from the University of Manitoba.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this pa-per.

References

[1] Vapnik, V.N. and Lerner, A. (1963) Pattern Recognition Using Generalized Portrait Method. Automation and Remote Control, 24, 774-780.

[2] Vapnik, V.N. (1982) Estimation of Dependence Based on Empirical Data. Sprin-ger-Verlag, New York.

[3] Cortes, C. and Vapnik, V.N. (1995) Support Vector Networks. Machine Learning 20, 273-297. https://doi.org/10.1007/BF00994018

[4] Deng, N. and Tian, Y. (2004) New Data Mining Method—Support Vector Machine. Science Press, Beijing.

[5] Liu, J., Li, S.C. and Luo, X. (2012) Classification Algorithm of Support Vector Ma-chine via p-Norm Regularization. Acta Automatica Sinica,38, 76-87.

https://doi.org/10.3724/SP.J.1004.2012.00076

[6] Fiacco, A. (1976) Sensitivity Analysis for Nonlinear Programming Using Penalty Methods. Mathematical Programming, 10, 287-331.

https://doi.org/10.1007/BF01580677

[7] Liu, B. (1988) Nonlinear Programming. Beijing Institute of Technology Press, Bei-jing.

[8] Wu, Q. and Yu, Z. (1989) Sensitivity Analysis for Nonlinear Programming Using Generalized Reduced Gradient Method. Journal of Beijing Institute of Technology, 9, 88-96.

[9] Cai, C., Liu, B. and Deng, N. (2006) Second Order Sufficient Conditions Property for Linear Support Vector Classifier. Journal of China Agricultural University, 11, 92-95.

[10] Scholkopf, B., Burges, C.J.C. and Smola, A.J. (1999) Advances in Kernel Me-thods-Support Vector Learning. MIT Press, Cambridge, MA, 327-352.

[11] Cristianin, N. and Shawe-Taylar, J. (2000)An Introduction to Support Vector Ma-chines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511801389