Lee05b

(1)

Multicategory Classifier

Daewon Lee and Jaewook Lee

Department of Industrial and Management Engineering, Pohang University of Science and Technology,

Pohang, Kyungbuk 790-784, Korea {woosuhan,jaewookl}@postech.ac.kr

Abstract. Support vector machines are primarily designed for binary-class binary-classification. Multicategory binary-classification problems are typically solved by combining several binary machines. In this paper, we propose a novel classifier with only one machine for even multiclass data sets. The proposed method consists of two phases. The first phase builds a trained kernel radius function via the support vector domain decomposition. The second phase constructs a dynamical system corresponding to the trained kernel radius function to decompose data domain and to assign class label to each decomposed domain. Numerical results show that our method is robust and efficient for multicategory classification.

1 Introduction

The support vector machine (SVM), rooted in the statistical learning theory, has been successfully applied to diverse pattern recognition problems [1], [6]. The main idea of a conventional SVM is to construct a optimal hyperplane to separate ‘binary class’ data so that the margin is maximal. For multiclass prob-lems, several approaches for multiclass SVMs [4] have been proposed. Most of the previous approaches try to reduce a multiclass problem to a set of multiple binary classiﬁcation problems where a conventional SVM can be applied. Those approaches, however, have some drawbacks in that they not only generate in-accurate decision boundaries in some region due to the unbalanced data size for each class, but also suﬀer from a masking problem: some class of data is overwhelmed by others, resulting in being ignored in the decision step.

To overcome such difficulties, in this paper, we propose a novel efficient and robust classifier for multicategory classifications. The proposed method consists of two phases. In the first phase, we build a trained kernel radius function via support vector domain decomposition [2], [3]. In the second phase, we construct a dynamical system corresponding to the trained kernel radius function and decomposes the data domain into a small number of disjoint regions where each region, which is a basin of attraction itself for the constructed system, is classified by the class label of the corresponding stable equilibrium point. As a result, we can classify an unknown test data by using the dynamical system and the class information for each decomposed region.

J. Wang, X. Liao, and Z. Yi (Eds.): ISNN 2005, LNCS 3496, pp. 857–862, 2005. c

(2)

2 The Proposed Method

2.1 Phase I: Building a Trained Kernel Radius Function via Support Vector Domain Decomposition

The basic idea of support vector domain decomposition is to map data points by means of a inner-product kernel to a high dimensional feature space and to ﬁnd, not the optimal separating hyperplane, but the smallest sphere that contains most of the mapped data points in the feature space. This sphere, when mapped back to the data space, can decompose a data domain into several regions. The support vector domain decomposition builds a trained kernel radius function as follows [2], [3], [9]: let{xi} ⊂ X be a given data set ofN points, withX ⊂ n, the data space. Using a nonlinear transformationΦfromX to some high dimensional feature-space, we look for the smallest enclosing sphere of radiusRdescribed by the following model:

minR2

s.t. Φ(x_j)−a2≤_R2+ξj,

ξj ≥0, forj = 1, . . . , N (1)

wherea is the center andξj are slack variables allowing for soft boundaries. To solve this problem, we introduce the Lagrangian

L=R2−

j

(R2+ξj− Φ(xj)−a2)βj−

j

ξjµj+C

j

ξj,

the solution of the primal problem (1) can be obtained by solving its dual prob-lem:

max W =

j

Φ(xj)2βj−

i,j

βiβjΦ(xi)·Φ(xj)

subject to 0≤βj≤C,

j

βj = 1, j= 1, ..., N (2)

Only those points with 0< βj < C lie on the boundary of the sphere and are called support vectors (SVs).

The trained kernel radius function, deﬁned by the squared radial distance of the image ofxfrom the sphere center, is then given by

f(x) :=R2(x) =Φ(x)−a2 (3)

= K(x,x)−2 j

βjK(xj,x) +

i,j

βiβjK(xi,xj)

where the inner products of Φ(x_i)·Φ(x_j) are replaced by a kernel function

(3)

2.2 Phase II: The Class Label Assignments of the Decomposed Regions

In the second phase, we ﬁrst construct the following generalized gradient descent process corresponding to the trained kernel radius function (3)

dx

dt =−gradGf(x)≡ −G(x)

−1_∇

f(x) (4)

where G(x) is a positive deﬁnite symmetric matrix for allx∈ n. (Such anG

is called a Riemannian metriconn.) A state vector ¯xsatisfying the equation ∇f(¯x) = 0 is called anequilibrium pointof (4) and called a (asymptotically) sta-ble equilibrium point if all the eigenvalues of its corresponding Jacobian matrix,

Jf(¯x)≡ ∇2f(¯x), are positive. The basin of attraction (or stability region) of a stable equilibrium point ¯xis deﬁned as the set of all the points converging to ¯x when the process (4) is applied, i.e.,

A(¯x) :={x(0)∈ n: lim

t→∞x(t) = ¯x}.

One nice property of system (4) is that under fairly mild condition, it can be shown that the whole data space is composed of the closure of the basins, that is to say,

n₌N i

cl(A(¯x_i))

where {¯x_i;i = 1, ..., N} is the set of all the stable equilibrium points ([7], [8]). (See Fig.1.)

This property of the constructed system (4) enables us to decompose the data domain into a small number of disjoint regions (i.e., basins of attraction) where each region is represented by the corresponding stable equilibrium point. Next we deﬁne the set of the training data points converging to a stable equilibrium point ¯x_k by x¯k and apply a majority vote on the set of x¯_k to determine the class label of the corresponding stable equilibrium point ¯x_k. Then each point of a decomposed region, sayA(¯x_k), is assigned to the same class label as that of the corresponding stable equilibrium point ¯x_k. As a result, if we want to classify an unknown data, by applying the process (4) to the test point, we assign the class label of the corresponding stable equilibrium point to which the test data point converges.

3 Simulation Results and Discussion

The proposed algorithm for the multicategory classiﬁcation has been simulated on ﬁve benchmark data sets. Description of the data sets is given in Table1.

(4)

Fig. 1.Topographic map of the trained kernel radius function

Table 1.Benchmark data description

No. of classes No. of training data No. of test data

Iris 3 100 50

Wine 3 118 60

Sonar 2 104 104

Ring 4 160 40

Orange 8 170 30

comparison are the training and the test mis-classification error rate. Simulation results are shown in Table2. Experimental results demonstrate that the proposed method achieves a much better accuracy in the large-categorical classification problem, whereas, in the three- or four-category classification problems, it has a slightly better performance to previously reported methods.

In addition to this experimental result, the proposed method has several nice features: Firstly, the process (4) can be implemented as various discretized learn-ing algorithms dependlearn-ing on the choice ofG(x) [7]. For example, ifG(x) =I, it is the steepest descent algorithm; ifG(x) =Bf(x) whereBf is a positive deﬁnite matrix approximating the Hessian,∇2

f(x), then it is the Quasi-Newton method; ifG(x) = [∇2

(5)

Table 2.Simulation results on ﬁve benchmark problems

Method One-against-one One-against-rest Proposed method

Data training test training test parameters No. of training test sets error error error error (C,q) S.E.P. error error Iris 0.040 0.060 0.020 0.060 (1, 3.0) 10 0.020 0.040 Wine 0 0.317 0 0.050 (1, 0.5) 23 0 0.050 Sonar 0.096 0.289 0.096 0.289 (1, 0.5) 78 0.029 0.144 Ring 0.010 0.075 0 0.075 (1, 1.0) 9 0 0.025 Orange 0.142 0.168 0.106 0.130 (1, 1.0) 10 0.006 0

Fig. 2. Change of topology of the trained kernel radius function over the kernel pa-rameter q

kernel parameter, q. Fig.2 empirically shows the robustness of the topology of the trained kernel radius function with respect to the kernel parameterq.

4 Conclusions

In this paper, a new classifier for multicategory classification problems has been proposed. The proposed method first builds a trained kernel radius function via support vector domain decomposition and construct a dynamical system corre-sponding to the trained kernel radius function, from which a given data domain is decomposed into several disjoint regions. Benchmark results demonstrated a sig-nificant performance improvement of the proposed method compared to other existing methods. An application of the method to more large-scale practical problems remains to be investigated.

Acknowledgement

This work was supported by the Korea Research Foundation under grant number KRF-2004-041-D00785.

References

1. Vapnik, V.N.: An Overview of Statistical Learning Theory. IEEE Trans. Neural Networks,10(1999) 988-999

(6)

3. Ben-Hur, A., Horn, D., Siegelmann, H.T., Vapnik, V.N.: Support Vector Cluster-ing. Journal of Machine Learning Research,2(2001) 125-137

4. Hsu, C.-W., Lin, C.-J.: A Comparison of Methods for Multiclass Support Vector Machines. IEEE Trans. on Neural Networks,13(2002) 415-425

5. Weston, J., Watkins, C.: Multi-Class Support Vector Machines. Proc. ESANN99, M. Verleysen, Ed., Brussels, Belgium (1999)

6. Burges, C.J.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery,2(1998) 121-167

7. Lee, J.: Dynamic Gradient Approaches to Compute the Closest Unstable Equilib-rium Point for Stability Region Estimate and Their Computational Limitations. IEEE Trans. on Automatic Control,48(2003) 321- 324

8. Lee, J., Chiang, H.-D.: A Dynamical Trajectory-Based Methodology for Systemat-ically Computing Multiple Optimal Solutions of General Nonlinear Programming Problems. IEEE Trans. on Automatic Control,49(2004) 888 - 899