## 578 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 2, MARCH 2007

**Equilibrium-Based Support Vector Machine for**
**Semisupervised Classification**

Daewon Lee and Jaewook Lee

**Abstract—A novel learning algorithm for semisupervised classification****is proposed. The proposed method first constructs a support function that**
**estimates a support of a data distribution using both labeled and unlabeled**
**data. Then, it partitions a whole data space into a small number of disjoint**
**regions with the aid of a dynamical system. Finally, it labels the decomposed**
**regions utilizing the labeled data and the cluster structure described by the**
**constructed support function. Simulation results show the effectiveness of**
**the proposed method to label out-of-sample unlabeled test data as well as**
**in-sample unlabeled data.**

**Index Terms—Dynamical systems, inductive learning, kernel methods,****semisupervised learning, support vector machines (SVMs).**

## I. INTRODUCTION

In statistical machine learning, there are three different scenarios:
su-pervised learning, unsusu-pervised learning, and semisusu-pervised learning.
In supervised learning, a set of labeled data is given, and its task is to
construct a classifier that predicts the labels of the future unknown data.
In unsupervised learning such as clustering [14], [18], only a set of
un-labeled data is given, and the task is to segment unun-labeled data into
clusters that reflect meaningful structure of the data domain. In*semi*
*supervised learning*, a set of both labeled and unlabeled data is given,
and its task is to construct a better classifier using the set of both labeled
and unlabeled data than using only labeled data as in the supervised
learning.

Recently, semisupervised learning among the aforementioned sce-narios has come to occupy an important position in many real-world ap-plications such as bioinformatics, web and text mining, database mar-keting, face recognition, video-indexing, etc. This is because a large amount of unlabeled data can be easily collected by automated means in many practical learning domains, while labeled data are often diffi-cult, expensive, or time consuming to obtain as they often require the efforts of human experts [12].

Many learning algorithms have been developed to solve
semisu-pervised learning problems and include graph-based model,
genera-tive mixture models using expectation–maximization (EM) [5],
self-training, cotraining [12], transductive support vector machine (TSVM)
and its variants [1], [4], kernel methods [17], semisupervised clustering
methods [13], [19], etc. Most of these existing methods are, however,
designed for transductive semisupervised learning. Since transduction
is only concerned with predicting the*given*specific test points (e.g.,
in-sample unlabeled points in semisupervised learning), it does not
provide a straightforward way to make a prediction on out-of-sample
points for inductive learning. Although some transductive methods can
be extended into inductive ones, their performance on out-of-sample
points tends to be poor or inefficient.

In this letter, to overcome such difficulties, we propose a novel robust, efficient, and inductive learning algorithm for semisupervised

Manuscript received August 11, 2005; revised March 13, 2006 and July 13, 2006; accepted September 28, 2006. This work was supported by the Korean Science and Engineering Foundation (KOSEF) under the Grant R01-2005-000-10746-0.

The authors are with the Department of Industrial and Management Engi-neering, Pohang University of Science and Technology, Pohang, Kyungbuk 790-784, Korea (e-mail: [email protected]).

Color versions of one or more of the figures in this letter are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNN.2006.889495

learning. The proposed method consists of three phases. In the first phase, we build a support function that characterizes the support of a multidimensional distribution of a given data set consisting of both labeled and unlabeled data. In the second phase, we decompose a whole data space into a small number of separate clustered regions via a dynamical system associated with the constructed support function. Finally, in the third phase, we assign a class label to each decomposed region utilizing the information of their constituent labeled data and the topological and dynamical property of the constructed dynamical system, thereby classifying in-sample unlabeled data as well as un-known out-of-sample data. The detailed procedure of each phase is described in Section II (see Fig. 1).

## II. PROPOSEDMETHOD

*A. Phase I: Constructing a Trained Gaussian Kernel Support*
*Function via SVDD*

Suppose that a set of labeled or unlabeled dataf(xi; yi)gNi=1 X 2
Yis given wherex_{i}2 <n[or its reduced dimensional representations
via linear principal component analysis (PCA), for example] denotes
an example andy_{i}denotes its label, or null when it is unlabeled.

In Phase I, we construct a support function that estimates a support of a data distribution. Any available method that estimates a support or density of a data distribution [11], [15] may be employed. In this letter, we adopt a support vector domain description (SVDD) procedure suggested in [15] and applied to diverse clustering problems [2], [6], [7]. The SVDD algorithm [2], [15], [9] maps data points by means of a nonlinear transformation8fromX to some high-dimensional feature space and finds the smallest enclosing sphere of radiusRthat contains most of the mapped data points in the feature space, described by the following model:

min R2_{+ C} N
i=1

i

subject tok8(x_{j}) 0 ak2 R2+ _{j}; _{j} 0; forj = 1; . . . ; N
(1)
whereais the center and_{j}are slack variables allowing for soft
bound-aries. The solution of the primal problem (1) can then be obtained by
solving its dual problem

max W = j

K(xj; xj)j0 i;j

ijK(xi; xj)

subject to 0 j C; j

j = 1; j = 1; . . . ; N (2)

where the inner product of 8(xi) 1 8(xj) is replaced by a kernel

K(xi; xj). Only those points with0 < j < Clie on the boundary

of the sphere and are called support vectors (SVs). Note that both labeled and unlabeled data are used as a training set in Phase I and any information of the labelsyiis not involved and does not affect the solution of (2), which can be easily seen from the form of (2).

Now, let its solution bej,j = 1; . . . ; N andJ f1; . . . ; Ngbe
the set of the index of nonzeros_{j}. Then, the trained Gaussian kernel
support function, defined by the squared radial distance of the image
ofxfrom the sphere center, is given by

f(x) := R2_{(x) = k8(x) 0 ak}2
=K(x; x) 0 2

j

jK(xj; x) + i;j

ijK(xi; xj)

=102 j2J

je0qkx0x k + i;j2J

ije0qkx 0x k (3)

Fig. 1. (a) Original data set.anddenote unlabeled points and labeled points, respectively. The number for each labeled point represents its class label. (b) Contour map of a trained Gaussian support function constructed in Phase I. (c) Basin cells generated by Phase II. The solid lines represent a set of contours given byfx : f(x) = r gand the dash–dot lines represent the boundaries of basin cells. denotes a stable equilibrium vector, a representative point in each basin cell. (d) Ultimate labeled regions determined by Phase III. The solid lines represent decision boundaries separating the labeled region.

where a widely used Gaussian kernel of the form K(x_{i}; x_{j}) =

exp(0qkxi0 xjk2)with width parameterqis employed. For an illus-tration, Fig. 1(b) shows a contour map of a trained Gaussian support function for an original data set including labeled and unlabeled data described in Fig. 1(a).

One distinguished feature of the trained kernel support function via
SVDD is that cluster boundaries can be constructed by a set of contours
that enclose the points in data space given by a setfx : f(x) = r_{s}g
wherers = R2(xi)for any SVxi. Another distinguished feature is
that, in practice, only small portion of the_{j}take nonzero values, which
not only simplifies the cluster structure, but also highly reduces the
computational burden involved in the computation offor its derivative.

*B. Phase II: Decomposing the Data Space Into Separate Clustered*
*Regions via a Dynamical System*

The objective of Phase II is to decompose the whole data space, say

<n_{, into separate clustered regions. To this end, we build the following}

dynamical system, which will be shown to preserve a topological struc-ture of the clusters described byfin (3):

dx

dt = F (x) := 0x +_{j2J}j(x)xj

where

j(x) =

je0qkx0x k

j2J

je0qkx0x k :

(4)

Note that0 < _{j}(x) < 1and _{j2J}_{j}(x) = 1. The existence of a
unique solution (or trajectory)x(1) : < ! <nfor each initial condition

x(0)is guaranteed by the smoothness of the functionF. A state vector

s 2 <n _{satisfying the equation}_{F (s) = 0}_{is called an} _{equilibrium}*vector*of (4) and called a (asymptotically)*stable equilibrium vector*

(SEV) if all the eigenvalues of its derivative are positive. Geometrically, for eachx, the vector fieldF (x)in (4) is orthogonal to the hypersurface

## 580 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 2, MARCH 2007

Fig. 2. Contour map of the trained kernel support function varying level values. Two connected components in (a) with a level valueR are merged into one connected component in (b) with a level valueR > R .

property makes each trajectory flows inward and remains in one of the clusters described byf, which will be rigorously proved below.

The*basin of attraction*of a stable equilibrium vectorsis defined as
a set of all the points converging toswhen process (4) is applied, i.e.,

A(s) := x(0) 2 <n_{: lim}

t!1x(t) = s :

A*basin cell*of a stable equilibrium vectors, one important concept
used in this letter, is defined by the closure of the basinA(s)and is
denoted byA(s). From the form ofF (1)in (4), the basin cellA(s)
can be interpreted in the context of clustering a single approximated
Gaussian cluster whose center is a stable equilibrium vectorssatisfying

s =j2J

je0qks0x k xj

j2J

je0qks0x k :

The next result, which serves a theoretical basis of Phase II, shows that a data space can be decomposed into several basin cells under process (4) while preserving a topological structure of the clusters described by the support functionf.

*Theorem 1:* Each connected component of the level setL_{f}(r) :=

fx 2 <n_{: f(x) rg}_{for any level value}_{r}_{is positively invariant, i.e.,}

if a point is on a connected component ofLf(r), then its entire posi-tive trajectory lies on the same component when process (4) is applied. Furthermore, the whole data space is composed of the basin cells, i.e.,

<n_{=} M
i

A(si) (5)

wherefsi: i = 1; . . . ; Mgis the set of the stable equilibrium vectors of (4).

*Proof:* See Appendix.

One nice property of the constructed system (4) is that the topolog-ical structure of each level setLf(r)is preserved under process (4). Another nice property of (4) is that the data space can be decomposed into a small number of disjoint regions (i.e., basin cells) where each

region is represented by a stable equilibrium vector. From a compu-tational point of view, we can identify a basin cell which a data point belongs to by locating a stable equilibrium vector that it converges to under (4) without directly determining the exact basin cells. In an il-lustrative example of Fig. 1(c), we can see that the whole data space is decomposed into 14 disjoint regions (A1–A14) and all the data points within a basin cell converge to a common stable equilibrium vector under process (4).

*C. Phase III: Classifying the Decomposed Regions*

Up to now, we have not used any information of labels provided in a labeled data set. In Phase III, we classify each decomposed region, a basin cell constructed in Phase II, with the aid of a labeled data set as follows.

First, for each basin cellA(s)that contains at least one labeled data point, we make a majority vote on the labeled data points in it to assign a class label toA(s)(hence, stable equilibrium vectors). All the un-labeled data points inA(s)are then assigned to the same class label. In an illustrative example of Fig. 1(c) and (d), the decomposed region A1 is predicted to class 1, A4 and A7 are predicted to class 2 by the majority votes made on the labeled data points in each region, and so on.

Second, for a basin cellA(s)with no labeled data point in it [A2 in
Fig. 1(c)], we utilize a cluster structure of the trained Gaussian kernel
support functionfconstructed in Phase I to classify it. To put it
con-cretely, we notice that the level setL_{f}(r_{s})is composed of several
dis-joint clusters

Lf(rs) = fx : f(x) rsg = C1[ 1 1 1 [ Cp (6) wherers= R2(xi)for some SVxiand each clusterCi,i = 1; . . . ; p is a connected component ofLf(rs)[see Fig. 1(c)]. Therefore, if two decomposed regions (i.e., basin cells) share the same cluster, it is nat-ural to assign the same class label to these regions. For an illustration, in Fig. 1(c), the region A2 with no labeled data point in it shares the same cluster with region A6. Therefore, A2 and A6 are assigned to the same class label as shown in Fig. 1(d).

## TABLE I

BENCHMARKDATADESCRIPTION ANDPARAMETERSETTINGS

a labeled component. Then, we assign to the unlabeled component the
same class label of the labeled component. To illustrate this, see Fig. 2.
In Fig. 2(a), with a level valuer = R_{old}, a left-sided connected
compo-nent contains no labeled data while a right-sided connected compocompo-nent
contains a labeled data point, say “6”. By increasing a level valuer
fromRoldtoRnew, the two connected components are merged into
one connected component with a labeled data in it, and assigned the
class “6”, as is shown in Fig. 2(b).

To identify the connected component of a level setL_{f}(r), we employ
the following reduced-complete graph (R-CG) labeling strategy [2], [6]
restricted to the set of the stable equilibrium vectorsfskgMk=1

gener-ating an adjacency matrixA_{ij} between pairs ofs_{k}ands_{l}:A_{ij} = 1,
ifmax01f(si+ (1 0 )sj) randAij = 0, otherwise. A
pair of decomposed regionsA(s_{i})andA(s_{j})is then assigned to the
same connected component ofLf(r)ifsiandsj belong to the same
connected components of the graph induced byA. A simple labeling
strategy using the nearest neighboring labeled data as in [7] can also
be adopted for some data set with well partitioned convex shapes. For
a data set with highly curved data distributions, a more robust strategy
suggested in [7] may be preferred, but not reported here.

After labeling all the decomposed regions, we can use the trained classifier not only to classify the in-sample unlabeled data, but also to predict class labels of future unknown out-of-sample data by applying process (4), which is one distinguished feature of the proposed method. Specifically, for a given test data point, we apply process (4) to this point as an initial guess and locate a stable equilibrium vector to which the test data point converges. Then, we assign the same class label of corresponding stable equilibrium vector to the test data point.

## III. NUMERICALRESULTS ANDREMARKS

In order to evaluate the effectiveness of the proposed method,
de-noted by “Proposed,” we conducted experiments on 12 data sets and
compared their generalization performance with the existing methods.
Description of the data sets is given in Table I. “tae,” “ring,”
and “sunflower” data sets are artificially generated from the
multi-modal and nonlinearly separable distributions. “sonar,” “iris,” “wine,”
“satimage,” “segment,” and “shuttle” are widely used classification
data sets from University of California at Irvine (UCI) repository [20].
“Coil20,” “g50c,” and “Uspst” data sets are taken from [4]. To check
the performance of inductive learning, we randomly partition each
un-labeled data set into a (in-sample)*unlabeled*set and an (out-of-sample)

test set as in [17]. (As a result, we obtain somewhat different results from those of [4] in which data sets do not contain test examples.)

The main parts of “Proposed” are implemented as follows. Quadratic programming solver for optimizing a kernel support function is basi-cally based on LibSVM library [21]. To find SEVs, an unconstrained minimization solver or a nonlinear equation solver in Matlab optimiza-tion toolbox are used.

We performed experiments based on the setup of [4] and [17]. The model parametersC,q,C3, andare chosen among combinations of values on a finite grid by fivefold cross validation for unlabeled exam-ples. The best probable combinations are reported in Table I. Each data set has 100 random splits in order to partition training set into labeled and unlabeled set. Performance is evaluated by averaged error rate and standard deviation over the 100 split sets to provide an analysis of the statistical relevance of the reported results.

Our interest of the experiments is to show the generalization per-formance of “Proposed” as an inductive method. To do this, we cal-culate misclassification error rates on unlabeled data and test data, and then compare the performance of “Proposed” with four widely used and very competitive semisupervised learning algorithms: SVM [3], [16], mixed ensemble approach (MEA-EM) [5],rTSVM, and low-density separation (LDS) [4]. SVM constructs an optimal separating hyper-plane by using only labeled data. MEA-EM algorithm is a clustering-based method and closely related to the proposed method.rTSVM is a modified model of TSVM to reduce the time complexity and directly optimizes the objective in the primal by using nonlinear optimization techniques. LDS aims to find a decision boundary to pass low-den-sity regions by the combination of a so-called manifold learning (e.g., Isomap) andrTSVM.

## 582 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 2, MARCH 2007

TABLE II

AVERAGEDMISCLASSIFICATIONERRORRATES(%)ANDTHEIRSTANDARDDEVIATIONS ONUNLABELED ANDTESTEXAMPLES

Table II, LDS reports only unlabeled errors. Second, it is not avail-able on large-scale data sets (e.g., “satimage” and “shuttle”) due to its heavy complexity and requirement of large memory. On the other hand, “Proposed” yields unlabeled errors similar to, sometimes slightly worse than, LDS, but achieves a good generalization performance for future unseen patterns showing a potentiality as a good alternative for inductive semisupervised learning while LDS is not a straightforward way to classify them because it is transductive. As a result, “Proposed” shows a statistically different performance (better or slightly worse) from other methods on unlabeled in-sample data and a fairly good gen-eralization performance for out-of-sample data.

*Remarks*

1) Our proposed method shares some similarities with other clus-tering-based semisupervised algorithms in [5], [13], and [19]. This kind of methods first clusters the given sample data points, and then assigns to each cluster its label based on some prespec-ified rule with labeled data set. However, most of them focus on clustering in-sample data, not out-of-sample data, thereby concentrating on enhancing the performance of transductive learning rather than inductive learning [17]. Although they can be extended to labeling the entire space that is inductive, by utilizing some simple strategy (e.g., after K-means clustering, for each out-of-sample point, assigns a class label of the nearest cluster center), the performance for out-of-sample data is often not satisfactory if some restrictive cluster assumptions are violated [4], [17]. Moreover, since they should determine the number of components (or clusters) and parameter values, they easily fail if appropriate values are not found. On the other hand, the proposed method employs an SVDD which has a good descriptive ability to describe the data distribution rather than to cluster data points itself. Also it automatically detects the optimized structure of a high-dimensional data distribution with highly nonlinear shape. These properties make it possible for the proposed method to be a competitive method for inductive semisupervised learning by means of a dynamical system process.

2) To analyze the time complexity of the proposed method, letNbe the number of training pattern andM ( N) is the number of SEVs. The proposed method has a quadratic programming (QP) procedure in Phase I and most of QP solvers have time complexity

O(N3_{)}_{[10]. In Phase II, the time complexity of getting the }

de-composition (5) isO(Nm)wheremis the average number of iteration steps converging to SEV and is independent ofN and usually takes the value between 5 and 20 [6]. In Phase III, the complexity of labeling theM decomposed regions isO(M2). To put it together, the time complexity of the proposed method isO(N3+ Nm + M2) ' O(N3)for large-scale data sets. This fact implies that the computing speed of the proposed method is comparable to those of other existing methods such asrTSVM and LDS [4].

## IV. CONCLUSION

In this letter, we have proposed an inductive semisupervised learning method. The proposed method first builds a trained Gaussian kernel support function that estimates a support of a data distribution via an SVDD procedure using both labeled and unlabeled data. Then, it de-composes the whole data space into separate clustered regions, i.e., basin cells, with the aid of a dynamical system. Finally, it classifies the decomposed regions utilizing the information of the labeled data and the topological structure of clusters described by the constructed support function. A theoretical basis of the proposed method is also given. Benchmark results demonstrate that the proposed method has competitive performance for inductive learning (i.e., the ability of la-beling out-of-sample unlabeled test points as well as in-sample unla-beled points) and is applicable to large-scale data sets. An application of the proposed method to more large-scale practical semisupervised learning problems remains to be investigated.

APPENDIX PROOF OFTHEOREM1

*Proof:*

1) Letx(0) = x0be a point in a connected component, sayCof

Lf(r)andx(t)be the trajectory starting atx(0) = x0. Since

d

dtf(x) = 0 rf(x)T dx

dt

= j2J

je0qkx0x k x 0 j2J

j(x)xj 2

f(x(t))is a decreasing function of t 2 <and so we have

f(x(t)) f(x0)for allt 0, or equivalently,fx(t) : t 0g 2 Lf(r). Sincefx(t) : t 0gis connected, we should

havex(t) 2 Cfor allt 0.

2) First, we will show that every trajectory is bounded. Let

V (x) = 1=2kxk2_{and choose}_{R > max}

j2Jkxjk. Then, on

kxk = R, we have

@

@tV (x) = 0 xT dx

dt = 0xT x 0_{j2J}j(x)xj
= 0 kxk2_{+}

j2J

j(x)xTxj

0 kxk2_{+ 1 1 kxk max}

j kxjk < 0:

This implies that the trajectory starting from any point on

kxk = Ralways enters into the bounded setkxk R, which implies thatfx(t) : t 0gis bounded.

Next, we will show that every bounded trajectory converges to one of the equilibrium vectors. Sincefx(t) : t 0gis nonempty and compact, andg(t) = f(x(t))is a nonincreasing function oft,g is bounded from below becausefis continuous. Hence,g(t)has a limit

aast ! 1. Let!(x0)be the!-limit set ofx0. Then, for anyp 2

!(x0), there exists a sequenceftngwithtn ! 1andx(tn) ! p

asn ! 1. By the continuity off,f(p) = limn!1f(x(tn)) = a. Hence,f(p) = a, for allp 2 !(x0). Since!(x0)is an invariant set, for allx 2 !(x0)

@

@tf(x) = _{j2J}je0qkx0x k x 0_{j2J}j(x)xj
2

= 0

or, equivalently,F (!(x0)) = 0. Since every bounded trajectory con-verges to its!-limit set andx(t)is bounded,x(t)approaches!(x0)

ast ! 1. Hence, it follows that every bounded trajectory of system (4) converges to one of the equilibrium vectors.

Therefore, the trajectory ofx(0) = x0under process (4) approaches one of its equilibrium vectors, says. Ifsis not a stable equilibrium vector, then the region of attraction ofshas a dimension less than or equal ton 0 1. Therefore, we have

<n_{=} M
i

A(si)

wherefsi: i = 1; . . . ; Mgis the set of the stable equilibrium vectors of system (4).

## REFERENCES

[1] K. P. Bennett and A. Demiriz, “Semi-supervised support vector
machines,” in *Advances in Neural Information Processing *
*Sys-tems*. Cambridge, MA: MIT Press, 1999, vol. 11, pp. 368–374.
[2] A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik, “Support

vector clustering,”*J. Mach. Learn. Res.*, vol. 2, pp. 125–137, 2001.
[3] C. J. Burges, “A tutorial on support vector machines for pattern

recog-nition,”*Data Mining Knowl. Disc.*, vol. 2, no. 2, pp. 121–167, 1998.
[4] O. Chapelle and A. Zien, “Semi-supervised classification by low

den-sity separation,” in*Proc. 10th Int. Workshop Artif. Intell. Statist.*, 2005,
pp. 57–64.

[5] E. Dimitriadou, A. Weingessel, and K. Hornik, “A mixed ensemble approach for the semi-supervised problem” [Online]. Available: http:// citeseer.ist.psu.edu/590958.html

[6] J. Lee and D. Lee, “An improved cluster labeling method for support
vector clustering,”*IEEE Trans. Pattern Anal. Mach. Intell.*, vol. 27, no.
3, pp. 461–464, Mar. 2005.

[7] ——, “Dynamic characterization of cluster structures for robust and
inductive support vector clustering,”*IEEE Trans. Pattern Anal. Mach.*
*Intell.*, vol. 28, no. 11, pp. 1869–1874, Nov. 2006.

[8] D. Lee and J. Lee, “A novel semi-supervised learning methods using support vector domain description,” presented at the World Congr. Comput. Intell. (WCCI), Vancouver, BC, Canada, Jul. 16–21, 2006. [9] ——, “Domain described support vector classifier for

multi-classifica-tion problems,”*Pattern Recognit.*, vol. 40, pp. 41–51, 2007.
[10] J. C. Platt, “Fast training of support vector machines using sequential

minimal optimization,” in*Advances in Kernel Methods: Support Vector*
*Machines*. Cambridge, MA: MIT Press, 1999, pp. 185–208.
[11] B. Scholkopf, J. Platt, J. Shawe-taylor, A. j. Smola, and R. C.

Williamson, “Estimating the support of a high-dimensional
distribu-tion,”*Neural Comput.*, vol. 13, no. 7, pp. 1443–1472, 2001.
[12] M. Seeger, “Learning with labeled and unlabeled data,” Univ.

Edin-burgh, Tech. Rep., 2001.

[13] S. Basu, M. Bilenko, and R. J. Mooney, “A probabilistic framework
for semi-supervised clustering,” in*Proc. 10th ACM SIGKDD Int. Conf.*
*Knowl. Disc. Data Mining*, Seattle, WA, Aug. 2004, pp. 59–68.
[14] A. Szymkowiak-Have, M. A. Girolami, and J. Larsen, “Clustering via

kernel decomposition,”*IEEE Trans. Neural Netw.*, vol. 17, no. 1, pp.
256–264, Jan. 2006.

[15] D. M. J. Tax and R. P. W. Duin, “Support vector domain description,”
*Pattern Recognit. Lett.*, vol. 20, pp. 1191–1199, 1999.

[16] V. N. Vapnik, “An overview of statistical learning theory,”*IEEE Trans.*
*Neural Netw.*, vol. 10, no. 5, pp. 988–999, Sep. 1999.

[17] V. Sindhwani, P. Niyogi, and M. Belkin, “Beyond the point cloud:
From transductive to semi-supervised learning,” in*Proc. 22nd Int.*
*Conf. Mach. Learn.*, Bonn, Germany, 2005, pp. 824–831.

[18] R. Xu and D. Wunsch, II, “Survey of clustering algorithms,”*IEEE*
*Trans. Neural Netw.*, vol. 16, no. 3, pp. 645–678, May 2005.
[19] S. Zhong, “Semi-supervised model-based document clustering: A

com-parative study,”*Mach. Learn.*, vol. 65, no. 1, pp. 3–29, Oct. 2006.
[20] Univ. California, Irvine, “UCI repository of machine learning

databases” [Online]. Available: http://www.ics.uci.edu/~mlearn/ML-Repository.html