• No results found

High-Dimensional Data Classification Based on Smooth Support Vector Machines

N/A
N/A
Protected

Academic year: 2021

Share "High-Dimensional Data Classification Based on Smooth Support Vector Machines"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

Procedia Computer Science 72 ( 2015 ) 477 – 484

1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Peer-review under responsibility of organizing committee of Information Systems International Conference (ISICO2015) doi: 10.1016/j.procs.2015.12.129

ScienceDirect

The Third Information Systems International Conference

High-Dimensional Data Classification Based on Smooth

Support Vector Machines

Santi Wulan Purnami

a

*, Shofi Andari

a

, Yuniati Dian Pertiwi

a

aDepartment of Statistics, ITS Campus, Sukolilo, Surabaya 60111

Institut Teknologi Sepuluh Nopember

Abstract

Classification on high dimensional data arises in many statistical and data mining studies. Support vector machines (SVM) are one of data mining technique which has been extensively studied and have shown remarkable success in many applications. Many researches developed SVM to increase performance such as smooth support vector machine (SSVM). In this study variants of SSVM (spline SSVM, piecewise polynomial SSVM) are proposed for high-dimensional classification. Theoretical results demonstrate piecewise polynomial SSVM has better classification. And numerical comparison results show that the piecewise polynomial SSVM slightly better performance than spline SSVM.

© 2015 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of the scientific committee of The Third Information Systems International Conference (ISICO 2015)

Keywords: high-dimensional data, classification, SSVM, polynomial, spline

1.Introduction

In pattern recognition problem, including classification and clustering system, assigning the proper algorithm is the key to provide the best classifier. Therefore that would bring out the best performance in analysis. However, people are often prone to make mistakes during analyses, or possibly, when trying to establish relationship between multiple features [1]. One should pay attention on how some algorithms work on particular data. Machine learning can often be successfully implemented in these problems.

One of the challenging aspects in supervised learning is to deal with high-dimensional data. This term refers to any dataset with large-scale features regardless metric or nonmetric data. This kind of data

* Corresponding author. Tel.: +62-31-594-3352; fax: +62-31-592-2940.

E-mail address: [email protected].

© 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

(2)

is becoming hot issue in many fields along with the development in database and data engineering. Machine learning, in many ways, is still successfully applied in classifying high-dimensional datasets. For instance, in healthcare quantitative measures as the forefront in solving classification problem can no longer be avoided, moreover in detection and identification particular disease from microarrays data. The challenging effect of the identification of the disease in a patient is highly subjective and it is reliant on the physician expertise [2].

Support vector machines (SVMs) [3], as well as other machine such as neural networks and logistic regression have been developed in many ways to deal with supervised learning or classification problem. This algorithm is well known for its high accuracy and prediction due to its global solution in optimization. However, the elapse time of computing when dealing with larger datasets still becoming main problem. Therefore, recently this machine has already many variants developed in all over the world. Smooth support vector machine [4], proposed by Lee and Mangasarian as classifier with smoothing methods to reformulate SVM. Due to twice differentiation is not applicable in its objective function, smoothing function is inserted to smooth the unconstrained optimization problem. Since then, many researches focus on increasing the performance of the smoothing function, e.g quadratic polynomial function [11], fourth polynomial function [11], spline function [8], and piecewise polynomial function [7,10]. All four functions have been compared and reported in [5]. It was reported that SSVM with piecewise polynomial function (PPSSVM-1) proposed by Luo, et. al [7] provided the best performance in classifying diagnostics for breast cancer .

In [9], Purnami, et. al tested the performance of PPSSVM-1 and compared the algorithm to other piecewise polynomial function proposed by Wu and Wang [10], from now on will be called PPSSVM-2. The result of the comparison showed that PPSSVM-2 could provide better performance dominantly in classifying simulation datasets as well as in classifying cervical cancer diagnostics [9]. The data generated in the previous study [9] is to accommodate high-dimensional data scenario which is adopted in this study with more challenging size of datasets. The dataset generated for this study were focused on multiple features. The machines used in the study is SSVM with original smoothing function, SSVM with spline function (TSSVM), and PPSSVM-2. In the end, those three functions will be compared to each other in term of the binary classification accuracy.

2.Literature Review

2.1 Smooth Support Vector Machine (SSVM)

SSVM is proposed by Lee and Mangasarian [4]. It was begun with linear case which can be converted to an unconstrained optimization problem. We consider the problem of classifying n points in the m -dimensional real space Rm , represented by the n u m matrix A, according to membership of each point Ai in the classes 1 or -1 as specified by a given m u m diagonal matrix D with 1 or -1 along its diagonal. For this problem the standard SVM is given by the following quadratic program:

1 min ' ' 2 1 ( , , ) ( ) v p n w v y R e e

st

J  t t e y w w D Aw e y y (1)

where v is positive weight, y is the slack variable, e represents a column vector whose elements worth one with specific dimensions, w is normal vector sized n u 1, and J is bias value that determines the relative location of the hyperplane.

(3)

In the SSVM approach, the modified SVM problem as follows: 1 2 ( , , ) 1 min ' ( ' ) 2 2 st ( ) 0 n m w y R v y y w w D Aw e y e y J J J  t t (2)

The constraint in equation(2), can be written by

( ( ))

y e D Aw e J (3)

Thus, we can replace y in constraint (2) by (3) and convert the SVM problem (2) into an equivalent SVM which is an unconstrained optimization problem as follows:

2 2 2 ( , ) 1 min ( ( )) ( ' ) 2 2 w v e D Aw e w w J J J (4)

The second derivative for objective function in (4) does not exist. Therefore, solving it using conventional optimization method will be useless, because it always requires Hessian matrix. Lee and Mangasarian [4] applied the smoothing techniques and replace x+ by the integral of the sigmoid function:

1

( , ) log(1 x), 0

p xD x H D D

D

! (5)

This p function with a smoothing parameter α is used here to replace the plus function of (4) to obtain a smooth support vector machine (SSVM):

1 2 2 2 ( , ) 1 min ( ( ), ) ( ' ) 2 2 n w R v p e D Aw e w w J J D J (6)

While, the optimization for nonlinear SSVM problem is as follows:

2 2 2 ( , ) 1 min ( ( ( , ') ), ) ( ' ) 2 2 u v p e D K A A Du e u u J J D J (7)

Where

K

( ,

A A

')

is a kernel function from

R

n mu

u

R

m nu

to R

n nu .

2.1.Spline Smooth Support Vector Machine (TSSVM)

Spline SSVM is proposed by Yuan, et al [2] where they using a new smooth function to replace

x

in optimization problem (7). It is three order spline function as following:

2 2 1 3 2 1 1 1 6 2 2 6 3 2 1 1 1 6 2 2 6 1 0, , 0 T( , ) , 0 , k k k k k k k k k k if x x x x if x x k x x x if x x if x ­ ° ° d ° ® ° d d ° ° ¯ (8)

If we replace the plus function in problem (7) by spline function (8), a new smooth SVM model is obtained as following:

2

2

2 1 , ' , ' 2 2 v T eD K A A Du e J D u uJ (9)
(4)

2.2.Piecewise Polynomial Smooth Support Vector Machine

Many researches have been proposing variants smoothing function in SSVM to increase its performance. Luo [7] proposed piecewise polynomial function to approximate the plus function. The formulation of this function is as follows:

3 2 3 2 1 1 , 1 5 1 1 ( , k) (k 1) 3 , 32 1 0, i x x k f x x k x kx x x k k k x k ­ t ° ° ° § · ® ¨© ¸¹ ° ° ° d ¯ (10)

Then, Wu and Wang [10] also proposed piecewise polynomial function which different formulation as follows : 3 2 2 2 1 0, 3 3 1 1 , , 0 2 3 3 ( , k) 3 1 1 , 0 2 3 3 1 , 3 x k k x x k k f x x k x x k k x x k ­ ° ° ° § · d ° ¨ ¸ ° © ¹ ® § · ° ¨ ¸ d d ° © ¹ ° ° ! °¯ (11)

The piecewise polynomial function which proposed by Luo [7] is called PP-1 and the piecewise polynomial function which proposed by Wu and Wang [4] is called PP-2. If we replace the plus function in optimization problem (7) by PP-1, the SSVM based on PP-1 which called PPSSVM-1 is obtained as following:

2 1 2 1 2 1 min( , ) ( ( , ') ) , ( ' ) 2 2 m v uJ R f eD K A A Du e J D u uJ (12)

As well as, if PP-2 replaced the plus function of optimization problem (7), the PPSSVM-2 is obtained as follows :

2 1 2 2 2 1 min( , ) ( ( , ') ) , ( ' ) 2 2 m v uJ R f e D K A A Du e J D u uJ (13)

Purnami, et al [9] have compared PPSSVM-1 and PPSSVM-2 in theoretical and numerical. Comparison result of theoretical presents that PPSSVM-2 is better than PPSSVM-1 to plus function. The proof can be seen in [9]. The numerical comparison used some variations number of data and number of features. PPSSVM-2 method presents that it has better performance in accuracy, sensitivity, specivity, and computation time than PPSSVM-1.

In this research will be compared SSVM, TSSVM and PPSSVM-2 to analyze high dimensional data. The comparison is done on theoretical and numerical experiment.

(5)

3.Comparison of Three SSVM on High Dimensional Data

3.1.Theoretical Comparison of Three SSVM

The comparison of the three SSVM (SSVM, TSSVM, and PPSSVM-2 ) will be done based on these following lemmas:

Lemma 1: p(x, k) is defined as integral of sigmoid function (5) and x+ is the plus function:

0,

U

!

k



R

:

2 2 2

log 2

,

d

2

U

log 2

p x k

x

k

k

The proof of Lemma 1 is as in [4].

Lemma 2: Let

U

1

k

, By taking the integral function of (5), by Lemma 1, we obtain:

2 2 2 log 2 , d 2

U

log 2 p x k x k k

2

2 1 log 2 2log 2 k 2 0.6927 | k

Lemma 3: Let

: 

R

, T(x, k) be defined as (8) , then the following results are easily obtained. (i)

T x k

,



C

2

: :

,

x

, or

T x k

,

satisfies the following equalities at the points

x

r

1

k

,

0

x

,

1 1 1 0 0 ' 1 ' ' ' 1 0 0 '' 1 '' '' '' 1 0 0 , 0, lim , lim , , , , , 0, lim , lim , , , 1, , 0, lim , lim , , , 0. o o o o o o ­ ° ° ® ° ° ¯ k k k x x k k x x k x x k T k T x k T x k T k T k T x k T x k T k T k T x k T x k T k (ii)

T x k

,

t

x

,

:

x

; (iii)

:

x

,

k



R

. 2 2 2

1

,

24

t

T x k

x

k

Lemma 4: The function of PP-2 as defined in (11) has the properties: (i)



x

R f

,

2

x k

,

t

x

;
(6)

(ii) 2 2 2 2

1

,

,

216



x

R f

x k

x

d

k

The proof for this lemma is as written in [9].

According to results of Lemma 1, Lemma 2, Lemma 3 and Lemma 4, the following performance comparison result of smooth function are obtained.

Theorem 1. Let

k

1 U

(i) The integral of sigmoid smooth function is defined as (5) , by Lemma 2,

2 2 2 2 2 2 2 2 1 6927 . 0 1 ) 2 log 2 ) 2 ((log log 2 ) log ( ) , ( k k k k x k x p d U |

(ii) The three spline smooth function is defined as (8), by Lemma 3,

2 2 2 2 1 0415 , 0 24 1 ) , ( k k x k x T d |

(iii) The piecewise polynomial smooth function (PP-2) is defined as (11), by Lemma 4,

2 2 2 2 1 0.00463 216 1 ) , ( k k x k x f d |

From theorem 1, it can be concluded that the piecewise polynomial smooth function (PP-2) has smallest value of difference between square smooth function and square plus function or another word the PP-2 has best performance than the others.

3.2.Numerical Comparison of Three SSVM

The evaluation of the performance of three SSVMs is performed by numerical analysis. We generated high-dimensional data with variants number of data n and number of variable m. There are 10 fold testing for accuracy and time processing. All our experiments were performed on a computer with Matlab R2011b for 32-bit operating system that was installed on a PC Intel Core i7 and 64-bit operating system. Processor required for 3.20 Ghz with 9 GB of RAM. The results of the experiment high-dimensional data using SSVM, TSVM and PPSSVM2 can be presented as follows.

Table 1. Accuracy of SSVM, TSSVM and PPSSVM2 (high score in bold)

Data n, m Accuracy (%) SSVM TSSVM PPSSVM2 50, 10 92.50% 97.50% 97.50% 50, 50 75.00% 67.50% 75.00% 50, 100 62.50% 65.00% 67.50% 50, 500 52.50% 42.50% 52.50% 100, 10 93.33% 94.44% 94.44% 100, 50 87.78% 84.44% 90.00% 100, 100 75.00% 76.00% 78.00% 100, 500 70.00% 68.89% 70.00%

(7)

Table 2. Time Processing SSVM, TSSVM and PPSSVM2 Data n, m Time (sec) SSVM TSSVM PPSSVM2 50, 10 1.6692 6.1152 1.2324 50, 50 1.1388 2.8392 1.4267 50, 100 1.2012 2.4804 1.5600 50, 500 1.6692 1.8252 1.7628 100, 10 2.1996 16.302 7.5972 100, 50 2.0124 6.8328 4.3212 100, 100 2.3244 6.9732 4.6644 100, 500 3.1356 4.6488 4.1184

In smaller number of features or variables, the three SSVMs provided remarkable result in accuracy. The more features in the dataset, the accuracy also becoming lesser. However, from Table 1, it was shown that PPSSVM2 has a higher accuracy than two other SSVMs. Although in some datasets have the same accuracy value, but it does not affect the performance of PPSSVM for high dimensional data analysis. That is because formulation of PPSSVM more complex, but has better performance than other (see the

Theorem 1). The time processing of three methods does not give significant differences. On TSSVM and PPSSVM method are longer a few second. This is because algorithm more complex than SSVM method. Generally, PPSSVM2 slightly better performance than TSSVM.

4.Conclusions and Recommendations

The theoretical results demonstrate that the piecewise polynomial smooth function (PP-2) has better performance than the others. Based on the numerical results, the three SSVMs compared in this study show a consistent performance in terms of the accuracy. And numerical comparison results show that the PPSSVM2 slightly better performance than TSSVM.

References

[1] S. Kotsiantis, "Supervised Machine Learning: A Review of Classification Techniques," Informatica,

vol. 31, pp. 249 - 268, 2007.

[2] Y. Haowen and G. Rumbe, "Comparative Study of Classification Techniques on Breast Cancer,"

International Journal of Artificial Intelligence and Interactive Multimedia , vol. 1, no. 3, pp. 5-12, 2010.

[3] N. V. Vapnik, The Nature of Statistical Learning Theory, Springer - Verlag, 1995.

[4] Y. J. Lee and O. L. Mangasarian, "SSVM : A Smooth Support Vector Machine for Classification,"

Journal of Computational Optimization and Aplication, vol. 20, pp. 5-22, 2001.

[5] S. W. Purnami, E. Abdullah, J. M. Zain and S. P. Rahayu, "A comparison of smoothing functions in smooth supprot vector machine," International Conference on Software Engineering & Computer Systems, 2009.

[6] Y. J. Lee and O. L. Mangasarian, "A Smooth Support Vector Machine," 2011.

[7] L. Luo, "Study on Piecewise Polynomial Smooth Aproximation to the Plus Function," in Proceding of the ICARCV, 2006.

[8] Y. Yuan, W. Fan and D. Pu, "Spline Function Smooth Support Vector Machine For Classification,"

Journal of Industrial and Management Optimization, vol. 3, no. 3, pp. 529-542, 2007.

(8)

Piecewise Polynomial Smooth Support Vector Machine to Classify Diagnosis of Cervical Cancer,"

International Journal of Applied Mathematics and Statistics, vol. 53, no. 6, pp. 159-166, 2015. [10] Q. Wu and W. Wang, "Piecewise-Smooth Support Vector machine for Classification," Hindawi

Publishing Corporation Mathematical Problems in Enginering, 2013.

[11] Y. Yuan, J. Yan and C. Xu, "Polynomial smooth support vector machine (PSSVM)," Chinese Journal of Computers, vol. 28, pp. 9-17, 2005.

Figure

Table 1. Accuracy of SSVM, TSSVM and PPSSVM2 (high score in bold)  Data  n, m  Accuracy (%)  SSVM  TSSVM  PPSSVM2  50, 10  92.50%  97.50%  97.50%  50, 50  75.00%  67.50%  75.00%  50, 100  62.50%  65.00%  67.50%  50, 500  52.50%  42.50%  52.50%  100, 10  93
Table 2. Time Processing SSVM, TSSVM and PPSSVM2   Data  n, m  Time (sec)  SSVM  TSSVM  PPSSVM2  50, 10  1.6692  6.1152  1.2324  50, 50  1.1388  2.8392  1.4267  50, 100  1.2012  2.4804  1.5600  50, 500  1.6692  1.8252  1.7628  100, 10  2.1996  16.302  7.59

References

Related documents

In order to verify the accuracy and effectiveness of the proposed two-stage credit scoring model using linear correlation and artificial NN, one private bank in Tehran, Iran is used

The evolution of the median size (left panel) and Hi column density (right panel) of Hi regions in different Aurora simulations.. Hi regions are defined as patches of gas with

Not all of the bacteria cells will however take up the recombinant DNA. This is due the plasmid sometimes closing up again before the DNA fragment is incorporated The process

WHEN A CHILD HAS FEVER AND OTHER SYMPTOMS such as pale color, trouble breathing, unusual sleepiness, chest pain, severe cough, abdominal pain, diarrhea, or vomiting, your child

On the contrary, our findings also show that profitability, size, investment in fixed assets, growth opportunities, probability of financial distress, and cost of external

There are different theories on this issue. Differ- ent groups have their own set of theories. In the other reproductive techniques, it is possible for two women to participate;

1- Quantitative procedure according to double odeometer test showed that approximately 16 of the undisturbed specimen have moderate to so high risk in terms of collapsibility..

referred to in paragraph (c) was wholly or mainly for the purpose of receiving full-time education, was ordinarily resident in the territory comprising the