Cost-Sensitive Support Vector Machine for Semi-Supervised Learning

(1)

Procedia Computer Science 18 ( 2013 ) 1684 – 1689

Selection and peer review under responsibility of the organizers of the 2013 International Conference on Computational Science doi: 10.1016/j.procs.2013.05.336

International Conference on Computational Science, ICCS 2013

Cost-Sensitive Support Vector Machine for Semi-Supervised

Learning

Zhiquan Qi

a

_{, Yingjie Tian}

a,∗

_{, Yong Shi}

a,∗

_{, Xiaodan Yu}

b

a_{Research Center on Fictitious Economy}_&_{Data Science, Chinese Academy of Sciences, Beijing 100190, China} b_{College of Information Science}_&_{Technology, University of Nebraska at Omaha, Omaha, NE 68182, USA.}

Abstract

Cost-sensitive learning has been a hot research topic in machine learning. Many cost-sensitive methods have been successfully applied in many real-world applications such as disease diagnosis, fraud detection and business decision making. In this paper, we proposed a new Cost-Sensitive Laplacian Support Vector Machine(called Cos-LapSVM) , which can deal with the cost-sensitive problem in Semi-Supervised Learning. The eﬀectiveness of the proposed method is demonstrated via experiments on UCI datasets.

Keywords: SVM, Cost-Sensitive, Semi-Supervised Learning

1. Introduction

Semi-Supervised Learning (SSL) [1, 2, 3] has attracted an increasing amount of interests in machine learning [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. One main reason is that the labeled examples are always rare but there are large amount of unlabeled examples available in many practical problems. Several novel approaches for making use of the unlabeled data to improve the performance of classifiers have been proposed. Graph based methods are very important branch, where nodes in the graph are the labeled and unlabeled points, and weighted edges reflect the similarities of nodes. The initially assumption of these methods is that all points are located in a low dimensional manifold, and the graph is used to approximate the underlying manifold. Neighboring point pairs connected by large weight edges tend to have the same labels and vice versa. By the means, the labels associated with data can be propagated throughout the graph. By using the graph Laplacian, [16] proposed a novel Laplacian Support Vector Machine (LapSVM). Unlike other methods based on graph [17, 18, 19], LapSVM is a natural out-of-sample extension, which can classify data that becomes available after the training process, without having to retrain the classifier or resort to various heuristics [16]. A lot of experiments show that LapSVM achieves state of the art performance in semi-supervised classification [20].

In many real-world applications, different misclassifications often have different costs, such as disease di-agnosis, fraud detection, business decision making [21, 22] and object recognition [23, 24] and so on. These classification problem is usually called cost-sensitive learning problem [25, 26, 27, 28, 29, 30], which aimed to minimize the total misclassification costs.

∗_{Corresponding author}

Email addresses:[email protected](Zhiquan Qi),[email protected](Yingjie Tian) Open access under CC BY-NC-ND license.

(2)

In this paper, we consider the problem of how to extend LapSVM algorithms to solve the cost-sensitive learn-ing problem and proposed a new algorithm: Cos-LapSVM.

The remaining parts of the paper are organized as follows. Section 2 describes the detail of Cos-LapSVM; All public datasets experiment results on real data sets are shown in the section 3; The last section gives the conclusions.

2. Cost-Sensitive Laplacian Support Vector Machine(Cos-LapSVM) 2.1. Laplacian Semi-supervised Learning Framework

Regularization [31] is a key technology for obtaining smooth decision functions and thus avoiding over-ﬁtting to the training data, which is widely used in machine learning [32, 16]. Recently, the regularization framework has been recently extended in SSL ﬁeld by [16] as follows.

Given a set of labeled data

T ={(x1,y1),· · ·,(xl,yl)} ∈(n× Y)l, (1)

wherexi∈ n,yi∈ Y={1,−1},i=1,· · ·,l, and a set of unlabeled data

(xl+1,· · ·,xl+u), (2)

wherexi ∈ n. For a kernel functionK(·,·) , which associated with a reproducing kernel Hilbert spaceHk, the

decision function can be obtained by minimizing f∗=arg min f∈Hk l i=1 V(xi,yi,f)+γHf2H+γMf2M, (3)

where f is a unknown decision function,Vrepresents some loss function on the labeled data,γ_H is the weight off2

H and controls the complexity of f in the reproducing kernel Hilbert space. γMis the weight off2Mand

controls the complexity of the function in the intrinsic geometry of marginal distribution,f2

Mis able to penalize

f along the Riemann manifoldM. 2.2. Cos-LapSVM

Similar to the LapSVM, we use the decision function

f(x)=(w·Φ(x))+b, (4)

whereΦis a nonlinear mapping from a low dimensional space to a higher dimensional Hilbert spaceH. According to Hilbert space theory([33]),wcan be expressed asw=_il+₌u₁αiΦ(xi). So, the decision function is written as

f(x)=

l+u

i=1

αiK(xi,x)+b, (5)

whereKis an chosen kernel function: K(xi·xj) = (Φ(xi)·Φ(xj)). Also, the regularization termf2_H can be

expressed as

f2_H =w2=(Φα)(Φα)=αKα. (6)

For the manifold regularizationf₊2

M, a data adjacency graphW(l+u)×(l+u) is deﬁned by nodesWi,j, which

repre-sents the similarity of every pair of input samples. The weight matrixWmay be deﬁned byknearest neighbor or graph kernels as follows([16]):

Wi j=

exp(−xi−xj2₂/2σ2), ifxi,xjare neighbor;

(3)

wherexi−xj2₂denotes the Euclidean norm inn. So the manifold regularization is deﬁned by f2 M= 1 (l+u)2 l+u i,j=1 Wi,j(f(xi)−f(xj))2= fL f, (8)

whereL = D−W is the graph Laplacian, D is a diagonal matrix with itsi-th diagonalDii = lj+=u1Wi j, and f =[f(x1),· · ·,f(xl+u)]=Kα, here we drop the bias termb. According to the cost-sensitive learning method of

[30], the primal problem of Cos-LapSVM can be written as min ξ∈Rl_,α∈_Rn[λ1 {i|yi=1} ξi+λ2 {i|yi=−1} ξi]+γHαKα +γMαKLKα, s.t. l+u j=1 αiK(xi,xj)+b≥1−ξi, yi=1, l+u j=1 αiK(xi,xj)+b≤ −λ3+ξi, yi=−1, ξi≥0, i=1,· · ·,l, (9) where λ1=C1, λ2 =2C−1−1, λ3= 1 2C₋1−1, (10) C1andC−1are the penalty factors for the positive and negative class.

By introducing its Lagrange function L(ξ, α,b, β, ρ, η)=[λ1 {i|yi=1} ξi+λ2 {i|yi=−1} ξi]+γHαKα +γMαKLKα− {i|yi=1} βi( l+u j=1 αiK(xi,xj) +b−1+ξi)+ {i|yi=−1} ρi( l+u j=1 αiK(xi,xj)+ b+λ3−ξi)− l i=1 ηiξi, (11)

whereβi, ρi, ηi∈Rare the Lagrange multipliers, therefore the dual problem of (11) can be formulated as

max

ξ,α,b,β,ρ,ηL(ξ, α,b, β, ρ, η), (12)

s.t.∇_ξ,α,bL(ξ, α,b, β, ρ, η)=0, (13)

βi, ρi, ηi≥0. (14)

From equation (13) we get

∇bL= {i|yi=1} βi− {i|yi=−1} ρi=0, (15) ∇ξiL=λ1−βi−ηi=0, yi=1, (16) ∇ξiL=λ2−ρi−ηi=0, yi=−1, (17) ∇αL=(2γHK+2γMKLK)α−K J1β +K J₂ρ =0, (18)

(4)

whereJ1=[E1 0] is al×(l+u) matrix withE1asl×ldiagonal matrix ((E1)ii=1, ifyi=1, otherwise, (E1)ii=0);

similarly,J2 =[E2 0] is al×(l+u) matrix withE2asl×ldiagonal matrix ((E2)ii =−1, ifyi =−1, otherwise,

(E1)ii=0);β ∈Rl_{, if}_y i=−1,β i=0;η ∈Rl_{, if}_y i=1,η i=0.

Substituting the above equations into problem (13), the dual problem can be expressed as max β_,ρ − 1 2(β J1−ρ J2)K(2γH +2γMKL)−1(J1β −J₂ρ) + l i=1 (β_i+λ3ρ i) s.t. {i|yi=1} βi− {i|yi=−1} ρi=0, 0≤βi≤λ1, yi=1, 0≤ρi≤λ2, yi=−1. (19)

It is not diﬃcult to ﬁnd that (19) will degenerate to the standard LapSVM whenC1 =C−1=1. 3. Experiments

Table 1. Parameters and Samples on UCI datasets

Datasets Dimensions Number of cost(P,N) cost(N,P)

samples Hepatitis 19 155 0.67 0.33 Australian 14 690 0.75 0.25 BUPA liver 6 345 0.8 0.2 CMC 9 844 0.83 0.17 Credit 19 690 0.85 0.15 Diabetis 8 768 0.87 0.13 Flare-Solar 9 1066 0.88 0.12

Table 2. Results on UCI datasets

Datasets Cos-LapSVM LapSVM T S V M

Hepatitis 0.208 0.223 0.254 Australian 0.424 0.476 0.462 BUPA liver 0.389 0.412 0.433 CMC 0.485 0.521 0.553 Credit 0.343 0.432 0.411 Diabetis 0.457 0.460 0.471 Flare-Solar 0.477 0.501 0.535

The performance of the LapSVM was evaluated in UCI datasets (see Table 1). We compare the Cos-LapSVM against Cos-LapSVM and TSVM[34]. Cos-LapSVM and TSVM are cost-blind. We use sampling method [35] to make these two method solve the cost-sensitive problem. Each data set is split into two equal halves, one for training and the other for testing. Each training set contains 20% labeled examples. The RBF kernel is always used. The testing accuracies of all experiments are computed using standard 10-fold cross validation.γ_H, γ_Mand RBF kernel parameterσare all selected from the set{2i_|_i₌₋₇_,_{· · ·}_,₇_}_{by 10-fold cross validation on the tuning}

(5)

to the training set to learn the ﬁnal decision function. We use the Average Cost(AC) to evaluate these algorithms’ performance. The AC can be expressed as

AC= FP×cost(N,P)+FN×cost(P,N)

T P+T N+FP+FN , (20)

where TP is true positive, TN is true negative, FP is false positive and FN is false negative;cost(N,P) is the cost of which negative data is classiﬁed into positive one andcost(P,N) is the cost of which positive data is classiﬁed into negative one. All algorithms are implemented by using MATLAB 2010. The experiment environment: Intel Core i7-2600 CPU, 4 GB memory.

From Table 2, we ﬁnd that Cos-LapSVM signiﬁcantly outperforms the cost-sensitive extensions of LapSVM and TSVM. LapSVM has a better performance than TSVM in most cases. The main reason is that our cost-sensitive method roots in the method of [30], which is derived as the minimizer of the associated risk and can avoid the shortcomings of previous approaches to cost-sensitive SVM design.

4. Conclusion

In this paper, we proposed a new Cost-Sensitive Laplacian Support Vector Machine(called Cos-LapSVM). All experiments in datasets show that the performance of the Cos-LapSVM is better than that of the LapSVM and the TSVM. However, since Cos-LapSVM needs to computing an extra inverse matrix before solving the quadratic programming, so in the future work, we will try to solve the Cos-LapSVM in its primal formulation.

5. Acknowledgment

This work has been partially supported by grants from National Natural Science Foundation of China( NO.70921061, NO.11271361, No. 61202226), the CAS/SAFEA International Partnership Program for Creative Research Teams, Major International( Ragional) Joint Research Project(NO.71110107026), the President Fund of GUCAS. References

[1] M. Seeger, Learning with labeled and unlabeled data, Tech. rep. (2001).

[2] O. Chapelle, B. Sch¨olkopf, A. Zien (Eds.), Semi-Supervised Learning (Adaptive Computation and Machine Learning), The MIT Press, 2006.

[3] X. Zhu, Semi-supervised learning literature survey (2006).

[4] Y. Tian, Y. Shi, X. Liu, Recent advances on support vector machines research, Technological and Economic Development of Economy 18(1) (2012) 5–33.

[5] Z. Qi, Y. Tian, S. Yong, Twin support vector machine with universum data, Neural Networks 36C (2012) 112–119. doi:doi:

10.1016/j.neunet.2012.09.004.

[6] Z. Qi, Y. Tian, S. Yong, Robust twin support vector machine for pattern classiﬁcation, Pattern Recognition 46(1) (2013) 305–316. doi:doi:10.1016/j.patcog.2012.06.019.

[7] Z. Qi, Y. Tian, Y. Shi, Structural twin support vector machine for classiﬁcation, Knowledge-Based Systems.

[8] C. Zhang, Y. Shao, Y. Tan, N. Deng, Mixed-norm linear support vector machine, Neural Comput & Applicdoi:10.1007/ s00521-012-1166-0.

[9] N. C.Zhang, Y.Tian, The new interpretation of support vector machines on statistical learning theory, Science China Mathematics 53(1) (2010) 151–164.

[10] Y. Shao, C. Zhang, X. Wang, N. Deng, Improvements on twin support vector machines, Neural Networks, IEEE Transactions on 22 (6) (2011) 962–968.

[11] Y. Li, Y. Shao, N. Deng, Improved prediction of palmitoylation sites using pwms and svm, Protein and Peptide Letters 18 (2) (2011) 186.

[12] Z. Qi, Y. Tian, S. Yong, Regularized multiple criteria linear programming via linear programming, in: Procedia Computer Science, Vol. 9, 2012, pp. 1234–1239.

[13] Z. Qi, Y. Tian, S. Yong, Regular multiple criteria linear programming for semi-supervised classiﬁcation, in: ICDM(OEDM), 2012. [14] Z. Qi, Y. Tian, S. Yong, Multi-instance classiﬁcation based on regularized multiple criteria linear programming, Neural Computing &

Applicationsdoi:10.1007/s00521-012-1008-0.

[15] Z. Qi, S. Yong, Structural regular multiple criteria linear programming for classiﬁcation problem, International Journal of Computers, Communications & Control 7(4) (2012) 732–742. doi:doi: http://dx.doi.org/10.1016/j.neunet. 2012.07.011.

[16] M. Belkin, P. Niyogi, V. Sindhwani, Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples, Journal of Machine Learning Research 7 (2006) 2399–2434.

(6)

[17] T. Joachims, Transductive learning via spectral graph partitioning, in: In ICML, 2003, pp. 290–297.

[18] M. Belkin, P. Niyogi, Using Manifold Structure for Partially Labelled Classiﬁcation, in: NIPS, 2002, pp. 953–960.

[19] X. Zhu, Z. Ghahramani, J. La_ﬀerty, Semi-supervised learning using gaussian ﬁelds and harmonic functions, in: IN ICML, 2003, pp. 912–919.

[20] Z. Qi, Y. Tian, S. Yong, Laplacian twin support vector machine for semi-supervised classiﬁcation, Neural Networks 35 (2012) 46–53. doi:doi: http://dx.doi.org/10.1016/j.neunet. 2012.07.011.

[21] X. Yu, D. Khazanchi, An exploratory study of the impact of it capabilities adaptation on shared mental models similarity, in: MWAIS 2011, 2011.

[22] X. Yu, D. Owens, D. Khazanchi, Building socioemotional environments in metaverses for virtual teams in healthcare: A conceptual exploration, in: Health Information Science, Springer, 2012, pp. 4–12.

[23] Z. Qi, Y. Xu, L. Wang, Y. Song, Online multiple instance boosting for object detection, Neurocomputing 74 (10) (2011) 1769–1775. doi:10.1016/j.neucom.2011.02.011.

URLhttp://dx.doi.org/10.1016/j.neucom.2011.02.011

[24] Z. Qi, Y. Tian, S. Yong, Eﬃcient railway tracks detection and turnouts recognition method using hog features, Neural Computing & Applicationsdoi:10.1007/s00521-012-0846-0.

[25] M. Kubat, S. Matwin, Addressing the curse of imbalanced training sets: One-sided selection, in: In Proceedings of the Fourteenth International Conference on Machine Learning, Morgan Kaufmann, 1997, pp. 179–186.

[26] S. ichi Amari, S. Wu, Improving support vector machine classiﬁers by modifying kernel functions, NEURAL NETWORKS 12 (1999) 783–789.

[27] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: Synthetic minority over-sampling technique., J. Artif. Intell. Res. (JAIR) 16 (2002) 321–357.

[28] G. Wu, E. Y. Chang, KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution, IEEE Transactions on Knowledge and Data Engineering 17 (6) (2005) 786–795. doi:10.1109/TKDE.2005.95.

[29] H. Masnadi-Shirazi, N. Vasconcelos, Cost-Sensitive Boosting, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (2) (2011) 294–309.

[30] H. Masnadi-Shirazi, N. Vasconcelos, Risk minimization, probability elicitation, and cost-sensitive SVMs, in: J. F¨urnkranz, T. Joachims (Eds.), Proceedings of the 27th International Conference on Machine Learning (ICML-10), Omnipress, Omnipress, Haifa, Israel, 2010, p. 759–766.

URLhttp://www.icml2010.org/papers/376.pdf

[31] A. Tikhonov, Regularization of incorrectly posed problems, Sov. Math. Dokl. 4 (1963) 1624–1627.

[32] T. Evgeniou, M. Pontil, T. Poggio, Regularization Networks and Support Vector Machines, Advances in Computational Mathematics 13 (1) (2000) 1–50.

[33] B. Sch¨olkopf, A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond, Adaptive compu-tation and machine learning, MIT Press, 2002.

[34] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer New York, 1996.

[35] C. Elkan, The foundations of cost-sensitive learning, in: In Proceedings of the Seventeenth International Joint Conference on Artiﬁcial Intelligence, 2001, pp. 973–978.