Feature Space Restructuring for SVMs with Application to Text Categorization

(1)

with Application to Text Categorization

Hiroya Takamura and Yuji Matsumoto

Departmentof InformationTechnology

Nara Institute of Scienceand Technology

8516-9, Takayama, Ikoma, 630-0101 Japan

fhiroya-t,[email protected]

Abstract

Inthispap er, weprop oseanewmetho d oftext

categorization based on feature space

restruc-turing for SVMs. In our metho d, indep endent

comp onents of do cument vectors are extracted

using ICA and concatenated with the original

vectors. This restructuring makes it p ossible

forSVMstofo cuson thelatentsemanticspace

withoutlosinginformationgivenbytheoriginal

feature space. Using this metho d, we achieved

high p erformance in text categorization b oth

withsmallnumb erandlargenumb ersoflab eled

data.

1 Introduction

Thetaskof textcategorizationhasb een

exten-sively studied in Natural Language Pro cessing.

Most successful works rely on a large numb er

of classied data. However,itis hardtocollect

classied data,soconsidering realapplications,

textcategorizationmustb erealizedevenwitha

small numb erof lab eled data. Several metho ds

torealizeithaveb eenprop osedsofar(Nigamet

al,2000),buttheyneedtob efurtherdevelop ed.

Forthatpurp ose, wehavetotakeadvantageof

invaluable information oered by the prop erty

of unlab eled data. In this pap er, we prop ose

a new categorization metho d based on

Sup-p ort Vector Machines (SVMs) (Vapnik, 1995)

and Indep endent Comp onent Analysis (ICA)

(Herault and Jutten, 1986;Bell and Sejnowski,

1995). SVM is gaining p opularity as a

classi-er with high p erformance, and ICA is one of

the most prosp ective algorithms in the eld of

signal pro cessing, which extracts indep endent

comp onentsfrom mixedsignals.

SVM has b een applied in many applications

suchasImagePro cessingandNaturalLanguage

Pro cessing. TheideatoapplySVMfortext

cat-1998). However, when the numb er of lab eled

data are small, SVM often fails to pro duce a

go o dresult,althoughseveraleortsagainstthis

problemhaveb eenmade. Therearetwo

strate-gies for improving p erformance in the case of

a limited numb er of data. One is to mo dify

the learning algorithm itself (Joachims, 1999a;

Glenn and Mangasarian, 2001), and the other

istopro cesstrainingdata(Westonetal,2000),

including the selection of features. In this

pa-p er, we fo cus on the latter, esp ecially on

fea-ture space restructuring. For pro cessing

train-ingdata,Principal Comp onentAnalysis(PCA)

isoftenadopted in classiers suchas k-Nearest

Neighb ormetho d(Mitchell,1997). Butthe

con-ventional dimension-reduction metho ds fail for

SVMasshownbyexp erimentsinSection6.

Un-like the conventional ones, our approach uses

thecomp onentsobtainedwithICAtoaugment

thedimension of thefeaturespace.

ICA is built on the assumptions that the

sourcesare indep endent of each other and that

thesignals observed atmultiple-p oints are

lin-earmixtures ofthe sources. Whilethe

theoret-ical asp ects of ICA are b eing studied, its p

os-sibility to applications is often p ointed out as

in (Bell and Sejnowski, 1997). The idea of

us-ing ICA for text clustering is adopted in

sev-eral works such as in (Isb ell and Viola, 1998).

In those works, vector representation mo del is

adopted(i.e. eachtextisrepresentedasavector

with the word-frequencies as the elements). It

isrep orted howeverthat the indep endent

com-p onentsdonotalwayscorresp ondtothedesired

classes, butrepresent some kind of

characteris-tics of texts (Kolenda et al, 2000). In (Kaban

andGirolami,2000),theyshowedthatthe

num-b er of p otential comp onents were larger than

(2)

Takingtheseobservationsintoconsideration,

wetakethefollowing strategy: rstwep erform

ICA on input do cument vectors, and second,

createtherestructured information by

concate-nating the reduced vectors (i.e. the values of

the indep endent comp onents) and the original

featurevectors.

PCA is an alternative restructuring metho d.

So we conducted exp eriments using SVM with

various input vectors: original feature vectors,

reduced feature vectors and restructured

fea-ture vectors (reduction and restructuring are

p erformed by PCA and ICA). For comparison,

we conducted exp eriments using Transductive

SVM(TSVM)(Joachims,1999a)aswell,which

is designed for the case of a small numb er of

lab eled data.

Usingtheprop osedmetho d(SVMwithICA),

weobtainb etterresultsthanordinarySVMand

TSVM, with b oth small and large numb ers of

lab eled data.

2 Support Vector Machines

2.1 Brief Overview of Supp ort Vector

Machines

Supp ort Vector Machine (SVM) is one of the

large-margin classiers (Smola et al, 2000).

Given asetof pairs,

(x 1 ;y 1 );(x 2 ;y 2

);;(x

n ;y n ) (1) 8i; x i 2R d ;y i

2f 1;1g

ofafeaturevectorandalab el, SVM constructs

aseparatinghyp erplanewiththelargestmargin

(the distance b etween the hyp erplane and the

vectors,see Figure1):

f(x)=wx+b: (2)

Findingthelargestmarginisequivalent to

min-imizing thenorm kwk,which isexpressed as:

min: 1 2 kwk 2 ; (3)

s:t: 8i; y

i (x

i

w+b) 10:

This isrealized bysolving thequadratic

pro-gram(dual problemof (3)):

max: P i i 1 2 P i;j i j y i y j x i x j (4) s:t: P i i y i =0;

8i; 0;

Positive example

Negative example

Margin

Figure1: Supp ortVectorMachine

(the solid line corresp onds to the optimal

hy-p erhy-plane).

where

i

's areLagrange multipliers. Using the

i

'sthat maximize (4),wis expressedas

w= X i i y i x i : (5)

Substituting(5) into(2),weobtain

f(x)= X i i y i x i

x+b: (6)

Unlab eleddataareclassied accordingtothe

signsof(6).

2.2 Kernel Metho d

SVMisalinearclassieranditsseparating

abil-ity is limited. To comp ensate this limitation,

Kernel Metho d is usually combined with SVM

(Vapnik,1995).

InKernelMetho d,thedot-pro ductin(4)and

(6)isreplacedbyamoregeneralinner-pro duct

K(x

i

;x),called thekernelfunction. Polynomial

kernel (x i x j +1) d

(d 2 N

+

) and RBF

ker-nel expf kx

i x j k 2 =2 2

g are often used.

Us-ing kernel metho d means that feature vectors

aremapp edintoa(higherdimensional) Hilb ert

space and linearly separated there. This

map-pingstructuremakesnon-linearseparationp

os-sible,althoughSVM isbasically alinear

(3)

althoughitdeals withahigh dimensional(p

os-sibly innite) Hilb ert space, there is no need

tocomputehigh dimensional vectors explicitly.

Only thegeneral inner-pro ducts of twovectors

areneeded. Thisleadstoarelativelysmall

com-putational overhead.

2.3 Transductive SVMs

The Transductive Supp ort Vector Machine

(TSVM) is intro duced in (Joachims, 1999a),

whichisonerealization oftransductivelearning

in(Vapnik,1995). Itisdesigned forthe

classi-cationwithasmall numb erof lab eleddata. Its

algorithm isapproximatelyasfollows:

1. constructa hyp erplane using lab eled data

in thesamewayastheordinary SVMs.

2. classifytheunlab eled(test)dataaccording

tothecurrent hyp erplane.

3. selectthepairofap ositivelyclassied

sam-pleand anegatively classied sample that

arenearest tothe hyp erplane.

4. exchangethelab elsofthosesamples,if the

margingets largerbyexchanging them.

5. terminateifastopping-criterionissatised.

Otherwise, gobacktostep2.

This is one way to search for the largest

mar-gin, p ermitting the relab eling of testdatathat

havealreadyb eenlab eledbytheclassierinthe

previous iteration.

3 Independent Component Analysis

Indep endent Comp onent Analysis (ICA) is a

metho d by which source signals are extracted

from mixed signals. It isbased on the

assump-tions that the sources s 2 R m

are

statisti-cally indep endent of each other and that the

observed signals x 2R n

are linear mixtures of

thesources:

x=As: (7)

HerethematrixAiscalledamixingmatrix. We

observex asa timeseries and estimate b othA

and s=(s

1 ;;s

m

). So ourpurp ose hereis to

nd ademixing matrixW suchthat s

1 ;;s

m

areas indep endent of eachother asp ossible:

s=Wx: (8)

learning with an objective function indicating

indep endence. There are several criteria of

indep endence and their learning rules, among

which we take here Infomax approach (Bell

and Sejnowski, 1995), but with natural

gradi-ent(Amari, 1998). Itslearningrule is

ÆW =(I+(I 2g(Wx))(Wx) t

)W; (9)

w her e; g(u)=1=(1+exp ( u)):

4 Text Categorization Enhanced

with Feature Space Restructuring

As in most previous works, we adopt Vector

Space Mo del (Salton and McGill, 1983) for

representing do cuments. In this framework,

each do cument d is represented as a vector

(f

1 ;;f

d

) with word-frequencies as its

ele-ments.

4.1 Feature Space Restructuring

Firstwereducethedimensionofdo cument

vec-tors using PCA or ICA. As for PCA, we

fol-lowtheprevious workdescrib edin,e.g.,

(Deer-westeretal, 1990). In(Isb ell and Viola,1998),

they use ICA for dimension reduction and

ob-taina go o dresult in InformationRetrieval. At

therst stepof ourmetho d,wherethe reduced

vectors are obtained, we follow their metho d.

In this framework,each do cument d is

consid-eredasalinearmixtureofsourcessrepresenting

topics. Eachwordplaysaroleof "microphone"

andreceivesa word-frequencyin thedo cument

as a mixed signal ateach time unit. This

for-mulationis represented bythe equation:

d=As; (10)

where A isa mixing matrix. Although b oth A

andsareunknown,theycan b eobtainedusing

theindep endenc e assumption. The source

sig-nalssareconsidered asareducedexpression of

thisdo cument. InthecaseofPCA,the

restruc-turingis pro cessed in the same way. The only

dierence is that indep endent comp onents

cor-resp ond to principal comp onents for the PCA

case.

AftercomputingareducedvectorswithPCA

or ICA, we concatenate the original vector d

andthereduced vectors:

^

d=

d

(4)

only on thereduced information, but make use

of b oth the reduced and the original

informa-tion,that is, therestructured information.

4.2 Text Categorization

Regarding ^

d as the input feature vector of a

do cument, we useSVM forcategorization.

Since SVMsarebinaryclassiersthemselves,

so we take here the one-versus-rest metho d to

apply them formulti-class classication tasks.

5 Theoretical Perspective

5.1 Validation asa Kernel Function

Theprop osedfeaturerestructuringmetho dcan

b e considered as theuse of a certain kernel for

the pre-restructuredfeature space. We give an

explanation forthe linear case. Given two

vec-tors, d

1

and d

2

, the kernel function K in the

restructured space isexpressedas,

K( ^

d

1 ;

^

d

2 ) =

^

d t

1 ^

d

2

= d t

1 d

2 +s

t

1 s

2

= d t

1 d

2 +d

t

1 A

t

Ad

2

: (12)

Considering the fact that each of two terms

ab oveisakernelandthatthesumoftwokernels

isalso akernel(Vapnik,1995),theprop osed

re-structuring isequivalent tousing a certain

ker-nelin the pre-restructuredspace.

5.2 Interpretation of Feature Space

Restructuring

Theexpression (12)showsthatweightsareput

on the latent semantic indices determined by

ICA and PCA resp ectively. The criterion of

meaningfulness dep ends on which of ICA and

PCA is used. Note that weighting is

dier-ent from reducing. In the dimension-reduction

metho ds,only thelatent semanticspaceis

con-sidered, but inourmetho d,theoriginal feature

space still directly inuences the classication

result.

This prop erty of our metho d makes it p

os-sible to fo cus on the information given by the

latent semantic space, without losing

informa-tiongiven bytheoriginal feature space.

Intextcategorization,classestob epredicted

are sometimes characterized by lo cal

informa-tion such as the o ccurrence of a certain word,

words. Consideringthissituationandtheab ove

prop ertyofourmetho d,itisnotsurprisingthat

outmetho d givesa go o dresult.

6 Experiments

Toevaluatetheprop osedmetho d,weconducted

severalexp eriments.

The data used here is the Reuters-21578

dataset. The mostfrequent 6 categoriesare

ex-tractedfromthetraining-setofthecorpus. This

leaves4872do cuments(seeTable1). Somepart

ofthem is usedastraining data and therestis

used as test data. Only the words o ccurring

morethantwice are used. Both stemming and

stop-wordremoval are p erformed. For

compu-tation,weused SVM-light (Joachims, 1999b).

Weconducted twokindsofexp eriments. The

rstone fo cuses on evaluatingthe p erformance

oftheprop osedmetho d foreachcategory,with

a xed numb er of lab eled data (Section 6.1).

The second one is conducted to show that the

prop osedmetho d givesa go o dresultalso when

the numb er of lab eled data increases (Section

6.2).

The results are evaluated by F-measures.

To evaluate the p erformance across categories,

wecomputedMicro-averageandMacro-average

(Yang, 1999) of F-measures. Micro-average is

obtained by rst computing precision and

re-call for all categories and then using them to

computetheF-measure. Macro-averageis

com-puted by rst calculating F-measures for each

category and then averaging them.

Micro-average tends to b e dominated by large-sized

categories, and Macro-average by small-sized

ones.

Thekernel function usedhereis alinear

ker-nel. The numb er of indep endent or principal

comp onents extractedbyICA orPCAis set to

50.

6.1 Performance with a Fixed Numb er

of Data

In this exp eriment, we treated 100, 500, 1000

and 2000 samples as lab eled resp ectively and

keptthe other 4772,4372,3872 and 2872

sam-plesunlab eled. The exp eriment wasconducted

10 times for each sample-size rep eatedly with

randomlyselectedlab eled samplesandtheir

(5)

Table 1: Do cuments usedin Exp eriments

category numb erof do cuments

earn 2673

acq 1435

trade 225

crude 223

money-fx 176

interest 140

combinationsofrestructuringmetho dsare

writ-ten. "Original"meansthedataoforiginaldo

cu-mentvectors. "PCA"and"ICA"meanthedata

of only reduced vectors, resp ectively.

"Orig-inal+PCA" and "Original+ICA" are the

re-structureddata explained in Section 4.

The prop osed metho d yields a high

F-measure in all thecategories for1000and 2000

lab eleddataandin mostcategoriesfor100and

500 lab eled data. The last two rows of Tables

2, 3, 4 and 5 show that b oth Micro-average

and Macro-averageare the highest forthe

pro-p osed metho d. This means that the prop osed

metho d p erforms well b oth for large-sized

cat-egories (e.g., earn) and small-sized categories

(e.g., interest), regardless with the numb er of

lab eled data.

6.2 Performance for the Increase of the

Lab eled Data

To investigate how each metho d b ehaves when

the numb er of lab eled data increases, we

con-ductedthis exp eriment. The numb er oflab eled

data ranges from 100to 2000. The results are

showninFigure2and Figure3. "PCA"givesa

go o dscoreonlywithasmallnumb erofdataand

"Original" gives ago o dscore only witha large

numb er of data. In contrast to them, the

pro-p osed metho dpro duces high p erformanceb oth

withsmall and largenumb ersof data.

7 Conclusions

We prop osedanewmetho dof featurespace

re-structuring for SVM. In our metho d, indep

en-dent comp onents are extracted using ICA and

concatenated with the original vectors. Using

this new vectors in the restructured space, we

achievedhighp erformanceb othwithsmalland

largenumb ersof lab eled data.

The prop osed metho d can b e applied also

78

80

82

84

86

88

90

92

94

96

0

200

400

600

800 1000

1200

1400

1600

1800

2000

Micro-average of F-measures

Number of Labeled Data

Original+ICA

PCA

Original

Original(TSVM)

Figure 2: Micro-average

55

60

65

70

75

80

85

90

95

0

200

400

600

800 1000

1200

1400

1600

1800

2000

Macro-average of F-measures

Number of Labeled Data

Original+ICA

PCA

Original

Original(TSVM)

Figure3: Macro-average

thattheyarerobustagainstnoiseand can

han-dleahigh-dimensional featurespace. Fromthis

p oint of view, it is exp ected that the prop osed

metho d is useful for kernel-based metho ds, to

whichSVM b elongs.

Asafuturework,weneedtondawayto

de-cidethenumb er of indep endent comp onents to

b eextracted. In this pap er, we set thenumb er

(6)

appropri-Metho d Original Original(TSVM) PCA ICA Original+PCA Original+ICA

earn 92.96 84.00 91.13 86.60 92.97 92.88

acq 85.88 81.42 85.67 80.86 85.91 87.48

trade 36.52 65.59 72.41 72.28 36.68 70.73

crude 65.69 70.90 79.75 80.67 65.93 82.87

money-fx 32.46 45.01 52.69 54.37 32.47 48.62

interest 51.30 52.69 64.44 63.48 51.30 64.84

microaverage 83.63 79.48 85.98 82.14 83.66 87.40

macroaverage 60.80 66.60 74.34 73.04 60.87 74.56

Table 3: F-Measures(500 Lab eled Data)

Metho d Original Original(TSVM) PCA ICA Original+PCA Original+ICA

earn 96.49 93.97 94.38 93.45 96.49 96.70

acq 93.23 91.57 89.18 87.45 93.22 93.41

trade 86.31 80.81 87.42 86.58 86.37 91.70

crude 83.33 79.78 81.36 78.28 83.43 87.12

money-fx 62.94 64.88 72.83 73.45 63.17 73.99

interest 59.31 52.02 73.37 72.18 59.31 70.41

microaverage 92.17 89.75 90.54 89.33 92.19 93.48

macroaverage 80.26 77.17 83.09 81.89 80.34 85.55

Table 4: F-Measures (1000Lab eled Data)

earn 97.15 95.52 96.07 95.53 97.15 97.26

acq 94.60 93.77 92.18 91.44 94.60 94.84

trade 91.19 86.11 87.13 86.87 91.23 93.25

crude 87.99 80.03 80.93 78.75 87.99 89.41

money-fx 73.68 68.85 72.96 72.68 69.96 80.99

interest 75.34 57.26 72.83 68.25 75.34 79.27

microaverage 94.23 91.79 92.31 91.54 94.09 94.90

macroaverage 86.65 80.25 83.68 82.25 86.04 89.17

Table 5: F-Measures (2000Lab eled Data)

earn 97.48 95.92 97.18 97.12 97.48 97.55

acq 95.39 94.39 94.78 94.80 95.39 95.65

trade 93.81 86.33 88.61 85.28 93.81 95.90

crude 89.88 80.35 82.63 78.56 89.88 90.25

money-fx 77.44 70.60 74.84 70.69 77.49 81.56

interest 82.71 62.15 73.99 68.46 82.76 83.02

microaverage 95.19 92.43 93.93 93.26 95.20 95.58

(7)

mo del selection such as Minimum Description

Length(Rissanen, 1987)orAkaike Information

Criterion (Akaike,1974)could b e a go o d

theo-retical basis.

As explained in Section 4, two terms d t

1 d

2

and d t

1 A

t

Ad

2

are simply concatenated in our

metho d. Buteither of these terms can b e

mul-tipliedwithacertainconstant. Thismeansthat

either of the original space and the Latent

Se-mantic Space can b e weighted. Searching for

the b est weighting scheme is one of the future

works.

Acknowledgment

We would like to thank Thomas Kolenda

(Technical University of Denmark) for helping

us withtheco de.

References

Akaike, H. 1974. A New Lo ok at the

Statis-tical Mo del Identication. IEEE Trans.

Au-tom. Control, vol. AC-19,pp.716{723.

Amari, S. 1998. Natural Gradient Works

EÆ-cientlyinLearning. NeuralComputation,vol.

10-2, pp.251{276.

Bell, A. J. and Sejnowski, T. J. 1995. An

In-formation Maximization Approach to Blind

Separation and Blind Deconvolution. Neural

Computation, 7,1129{1159.

Bell, A. J. and Sejnowski, T. J. 1997. The

'Indep endent Comp onents'ofNaturalScenes

areEdgeFilters. VisionResearch,37(23),pp.

3327{3338.

Deerwester,S.,Dumais, T.,Landauer, T.,F

ur-nas,W.andHarshman,A. 1990. Indexingby

LatentSemanticAnalysis. Journalof the

So-ciety forInformationScience,41(6),pp.391{

497.

Glenn, F. and Mangasarian, O. 2001.

Semi-Sup ervised Supp ortVectorMachines for

Un-lab eled Data Classication. Optimization

Methods and Software, pp.1{14.

Herault,J.andJutten, J. 1986. SpaceorTime

Adaptive Signal Pro cessing by Neural

Net-workMo dels. Neuralnetworksforcomputing:

AIPconferenceproceedings151,pp.206{211.

Isb ell, C. and Viola. P. 1998. Restructuring

Sparse High Dimensional Data for Eective

Retrieval. Advances in Neural Information

Supp ort Vector Machines: Learning with

Many Relevant Features. Proceedings of the

European Conference on Machine Learning,

pp.137{142.

Joachims,T. 1999a. Transductive Inference for

TextClassication usingSupp ortVector

Ma-chines. Machine Learning { Proc. 16th Int'l

Conf.(ICML '99),pp.200{209.

Joachims, T. 1999b. Making large-Scale SVM

Learning Practical. Advances in Kernel

Methods - SupportVector Learning,pp.169{

184.

Kaban, A. and Girolami, M. 2000. Unsup

er-visedTopicSeparationandKeyword

Identi-cationinDo cumentCollections: AProjection

Approach TechnicalReport.

Kolenda, T, Hansen, L., K. and Sigurdsson, S.

2000. Indep edent Comp onentsin Text .

Ad-vances in Independent Component Analysis,

Springer-Verlag, pp.235{256.

Mitchell, T. 1997. Machine Learning,McGraw

Hill.

Nigam, K., McCallum, A., Thrun, S. and

Mitchell, T. 2000. Text Classication from

Lab eledandUnlab eledDo cumentsusingEM.

MachineLearning,39(2/3).pp. 103{134.

Rissanen, J. 1987. Sto chastic Complexity.

Journal ofRoyalStatisticalSociety,SeriesB,

49(3),pp.223{239.

Salton,G.andMcGill,M.J. 1983. Introduction

to Modern Information Retrieval.

McGraw-Hill Bo ok Company,NewYork.

Smola,A.,Bartlett,P., Scholkopf,B.and

Schu-urmans,D. 2000. Advances in Large Margin

Classiers.MITPress

Vapnik, V. 1995. The Nature of Statistical

Learning Theory.Springer.

Weston, J., Mukherjee, S., Chap elle, O.,

Pon-til,M.,Poggio,T.andVapnik,V. 2000. F

ea-tureSelectionforSVMs. InAdvancesin

Neu-ral Information Processing Systems, volume

13.

Yang, Y. An Evaluation of Statistical

Ap-proachestoTextCategorization. Information