with Application to Text Categorization
Hiroya Takamura and Yuji Matsumoto
Departmentof InformationTechnology
Nara Institute of Scienceand Technology
8516-9, Takayama, Ikoma, 630-0101 Japan
fhiroya-t,[email protected]
Abstract
Inthispap er, weprop oseanewmetho d oftext
categorization based on feature space
restruc-turing for SVMs. In our metho d, indep endent
comp onents of do cument vectors are extracted
using ICA and concatenated with the original
vectors. This restructuring makes it p ossible
forSVMstofo cuson thelatentsemanticspace
withoutlosinginformationgivenbytheoriginal
feature space. Using this metho d, we achieved
high p erformance in text categorization b oth
withsmallnumb erandlargenumb ersoflab eled
data.
1 Introduction
Thetaskof textcategorizationhasb een
exten-sively studied in Natural Language Pro cessing.
Most successful works rely on a large numb er
of classied data. However,itis hardtocollect
classied data,soconsidering realapplications,
textcategorizationmustb erealizedevenwitha
small numb erof lab eled data. Several metho ds
torealizeithaveb eenprop osedsofar(Nigamet
al,2000),buttheyneedtob efurtherdevelop ed.
Forthatpurp ose, wehavetotakeadvantageof
invaluable information oered by the prop erty
of unlab eled data. In this pap er, we prop ose
a new categorization metho d based on
Sup-p ort Vector Machines (SVMs) (Vapnik, 1995)
and Indep endent Comp onent Analysis (ICA)
(Herault and Jutten, 1986;Bell and Sejnowski,
1995). SVM is gaining p opularity as a
classi-er with high p erformance, and ICA is one of
the most prosp ective algorithms in the eld of
signal pro cessing, which extracts indep endent
comp onentsfrom mixedsignals.
SVM has b een applied in many applications
suchasImagePro cessingandNaturalLanguage
Pro cessing. TheideatoapplySVMfortext
cat-1998). However, when the numb er of lab eled
data are small, SVM often fails to pro duce a
go o dresult,althoughseveraleortsagainstthis
problemhaveb eenmade. Therearetwo
strate-gies for improving p erformance in the case of
a limited numb er of data. One is to mo dify
the learning algorithm itself (Joachims, 1999a;
Glenn and Mangasarian, 2001), and the other
istopro cesstrainingdata(Westonetal,2000),
including the selection of features. In this
pa-p er, we fo cus on the latter, esp ecially on
fea-ture space restructuring. For pro cessing
train-ingdata,Principal Comp onentAnalysis(PCA)
isoftenadopted in classiers suchas k-Nearest
Neighb ormetho d(Mitchell,1997). Butthe
con-ventional dimension-reduction metho ds fail for
SVMasshownbyexp erimentsinSection6.
Un-like the conventional ones, our approach uses
thecomp onentsobtainedwithICAtoaugment
thedimension of thefeaturespace.
ICA is built on the assumptions that the
sourcesare indep endent of each other and that
thesignals observed atmultiple-p oints are
lin-earmixtures ofthe sources. Whilethe
theoret-ical asp ects of ICA are b eing studied, its p
os-sibility to applications is often p ointed out as
in (Bell and Sejnowski, 1997). The idea of
us-ing ICA for text clustering is adopted in
sev-eral works such as in (Isb ell and Viola, 1998).
In those works, vector representation mo del is
adopted(i.e. eachtextisrepresentedasavector
with the word-frequencies as the elements). It
isrep orted howeverthat the indep endent
com-p onentsdonotalwayscorresp ondtothedesired
classes, butrepresent some kind of
characteris-tics of texts (Kolenda et al, 2000). In (Kaban
andGirolami,2000),theyshowedthatthe
num-b er of p otential comp onents were larger than
Takingtheseobservationsintoconsideration,
wetakethefollowing strategy: rstwep erform
ICA on input do cument vectors, and second,
createtherestructured information by
concate-nating the reduced vectors (i.e. the values of
the indep endent comp onents) and the original
featurevectors.
PCA is an alternative restructuring metho d.
So we conducted exp eriments using SVM with
various input vectors: original feature vectors,
reduced feature vectors and restructured
fea-ture vectors (reduction and restructuring are
p erformed by PCA and ICA). For comparison,
we conducted exp eriments using Transductive
SVM(TSVM)(Joachims,1999a)aswell,which
is designed for the case of a small numb er of
lab eled data.
Usingtheprop osedmetho d(SVMwithICA),
weobtainb etterresultsthanordinarySVMand
TSVM, with b oth small and large numb ers of
lab eled data.
2 Support Vector Machines
2.1 Brief Overview of Supp ort Vector
Machines
Supp ort Vector Machine (SVM) is one of the
large-margin classiers (Smola et al, 2000).
Given asetof pairs,
(x 1 ;y 1 );(x 2 ;y 2
);;(x
n ;y n ) (1) 8i; x i 2R d ;y i
2f 1;1g
ofafeaturevectorandalab el, SVM constructs
aseparatinghyp erplanewiththelargestmargin
(the distance b etween the hyp erplane and the
vectors,see Figure1):
f(x)=wx+b: (2)
Findingthelargestmarginisequivalent to
min-imizing thenorm kwk,which isexpressed as:
min: 1 2 kwk 2 ; (3)
s:t: 8i; y
i (x
i
w+b) 10:
This isrealized bysolving thequadratic
pro-gram(dual problemof (3)):
max: P i i 1 2 P i;j i j y i y j x i x j (4) s:t: P i i y i =0;
8i; 0;
Positive example
Negative example
Margin
Figure1: Supp ortVectorMachine
(the solid line corresp onds to the optimal
hy-p erhy-plane).
where
i
's areLagrange multipliers. Using the
i
'sthat maximize (4),wis expressedas
w= X i i y i x i : (5)
Substituting(5) into(2),weobtain
f(x)= X i i y i x i
x+b: (6)
Unlab eleddataareclassied accordingtothe
signsof(6).
2.2 Kernel Metho d
SVMisalinearclassieranditsseparating
abil-ity is limited. To comp ensate this limitation,
Kernel Metho d is usually combined with SVM
(Vapnik,1995).
InKernelMetho d,thedot-pro ductin(4)and
(6)isreplacedbyamoregeneralinner-pro duct
K(x
i
;x),called thekernelfunction. Polynomial
kernel (x i x j +1) d
(d 2 N
+
) and RBF
ker-nel expf kx
i x j k 2 =2 2
g are often used.
Us-ing kernel metho d means that feature vectors
aremapp edintoa(higherdimensional) Hilb ert
space and linearly separated there. This
map-pingstructuremakesnon-linearseparationp
os-sible,althoughSVM isbasically alinear
althoughitdeals withahigh dimensional(p
os-sibly innite) Hilb ert space, there is no need
tocomputehigh dimensional vectors explicitly.
Only thegeneral inner-pro ducts of twovectors
areneeded. Thisleadstoarelativelysmall
com-putational overhead.
2.3 Transductive SVMs
The Transductive Supp ort Vector Machine
(TSVM) is intro duced in (Joachims, 1999a),
whichisonerealization oftransductivelearning
in(Vapnik,1995). Itisdesigned forthe
classi-cationwithasmall numb erof lab eleddata. Its
algorithm isapproximatelyasfollows:
1. constructa hyp erplane using lab eled data
in thesamewayastheordinary SVMs.
2. classifytheunlab eled(test)dataaccording
tothecurrent hyp erplane.
3. selectthepairofap ositivelyclassied
sam-pleand anegatively classied sample that
arenearest tothe hyp erplane.
4. exchangethelab elsofthosesamples,if the
margingets largerbyexchanging them.
5. terminateifastopping-criterionissatised.
Otherwise, gobacktostep2.
This is one way to search for the largest
mar-gin, p ermitting the relab eling of testdatathat
havealreadyb eenlab eledbytheclassierinthe
previous iteration.
3 Independent Component Analysis
Indep endent Comp onent Analysis (ICA) is a
metho d by which source signals are extracted
from mixed signals. It isbased on the
assump-tions that the sources s 2 R m
are
statisti-cally indep endent of each other and that the
observed signals x 2R n
are linear mixtures of
thesources:
x=As: (7)
HerethematrixAiscalledamixingmatrix. We
observex asa timeseries and estimate b othA
and s=(s
1 ;;s
m
). So ourpurp ose hereis to
nd ademixing matrixW suchthat s
1 ;;s
m
areas indep endent of eachother asp ossible:
s=Wx: (8)
learning with an objective function indicating
indep endence. There are several criteria of
indep endence and their learning rules, among
which we take here Infomax approach (Bell
and Sejnowski, 1995), but with natural
gradi-ent(Amari, 1998). Itslearningrule is
ÆW =(I+(I 2g(Wx))(Wx) t
)W; (9)
w her e; g(u)=1=(1+exp ( u)):
4 Text Categorization Enhanced
with Feature Space Restructuring
As in most previous works, we adopt Vector
Space Mo del (Salton and McGill, 1983) for
representing do cuments. In this framework,
each do cument d is represented as a vector
(f
1 ;;f
d
) with word-frequencies as its
ele-ments.
4.1 Feature Space Restructuring
Firstwereducethedimensionofdo cument
vec-tors using PCA or ICA. As for PCA, we
fol-lowtheprevious workdescrib edin,e.g.,
(Deer-westeretal, 1990). In(Isb ell and Viola,1998),
they use ICA for dimension reduction and
ob-taina go o dresult in InformationRetrieval. At
therst stepof ourmetho d,wherethe reduced
vectors are obtained, we follow their metho d.
In this framework,each do cument d is
consid-eredasalinearmixtureofsourcessrepresenting
topics. Eachwordplaysaroleof "microphone"
andreceivesa word-frequencyin thedo cument
as a mixed signal ateach time unit. This
for-mulationis represented bythe equation:
d=As; (10)
where A isa mixing matrix. Although b oth A
andsareunknown,theycan b eobtainedusing
theindep endenc e assumption. The source
sig-nalssareconsidered asareducedexpression of
thisdo cument. InthecaseofPCA,the
restruc-turingis pro cessed in the same way. The only
dierence is that indep endent comp onents
cor-resp ond to principal comp onents for the PCA
case.
AftercomputingareducedvectorswithPCA
or ICA, we concatenate the original vector d
andthereduced vectors:
^
d=
d
only on thereduced information, but make use
of b oth the reduced and the original
informa-tion,that is, therestructured information.
4.2 Text Categorization
Regarding ^
d as the input feature vector of a
do cument, we useSVM forcategorization.
Since SVMsarebinaryclassiersthemselves,
so we take here the one-versus-rest metho d to
apply them formulti-class classication tasks.
5 Theoretical Perspective
5.1 Validation asa Kernel Function
Theprop osedfeaturerestructuringmetho dcan
b e considered as theuse of a certain kernel for
the pre-restructuredfeature space. We give an
explanation forthe linear case. Given two
vec-tors, d
1
and d
2
, the kernel function K in the
restructured space isexpressedas,
K( ^
d
1 ;
^
d
2 ) =
^
d t
1 ^
d
2
= d t
1 d
2 +s
t
1 s
2
= d t
1 d
2 +d
t
1 A
t
Ad
2
: (12)
Considering the fact that each of two terms
ab oveisakernelandthatthesumoftwokernels
isalso akernel(Vapnik,1995),theprop osed
re-structuring isequivalent tousing a certain
ker-nelin the pre-restructuredspace.
5.2 Interpretation of Feature Space
Restructuring
Theexpression (12)showsthatweightsareput
on the latent semantic indices determined by
ICA and PCA resp ectively. The criterion of
meaningfulness dep ends on which of ICA and
PCA is used. Note that weighting is
dier-ent from reducing. In the dimension-reduction
metho ds,only thelatent semanticspaceis
con-sidered, but inourmetho d,theoriginal feature
space still directly inuences the classication
result.
This prop erty of our metho d makes it p
os-sible to fo cus on the information given by the
latent semantic space, without losing
informa-tiongiven bytheoriginal feature space.
Intextcategorization,classestob epredicted
are sometimes characterized by lo cal
informa-tion such as the o ccurrence of a certain word,
words. Consideringthissituationandtheab ove
prop ertyofourmetho d,itisnotsurprisingthat
outmetho d givesa go o dresult.
6 Experiments
Toevaluatetheprop osedmetho d,weconducted
severalexp eriments.
The data used here is the Reuters-21578
dataset. The mostfrequent 6 categoriesare
ex-tractedfromthetraining-setofthecorpus. This
leaves4872do cuments(seeTable1). Somepart
ofthem is usedastraining data and therestis
used as test data. Only the words o ccurring
morethantwice are used. Both stemming and
stop-wordremoval are p erformed. For
compu-tation,weused SVM-light (Joachims, 1999b).
Weconducted twokindsofexp eriments. The
rstone fo cuses on evaluatingthe p erformance
oftheprop osedmetho d foreachcategory,with
a xed numb er of lab eled data (Section 6.1).
The second one is conducted to show that the
prop osedmetho d givesa go o dresultalso when
the numb er of lab eled data increases (Section
6.2).
The results are evaluated by F-measures.
To evaluate the p erformance across categories,
wecomputedMicro-averageandMacro-average
(Yang, 1999) of F-measures. Micro-average is
obtained by rst computing precision and
re-call for all categories and then using them to
computetheF-measure. Macro-averageis
com-puted by rst calculating F-measures for each
category and then averaging them.
Micro-average tends to b e dominated by large-sized
categories, and Macro-average by small-sized
ones.
Thekernel function usedhereis alinear
ker-nel. The numb er of indep endent or principal
comp onents extractedbyICA orPCAis set to
50.
6.1 Performance with a Fixed Numb er
of Data
In this exp eriment, we treated 100, 500, 1000
and 2000 samples as lab eled resp ectively and
keptthe other 4772,4372,3872 and 2872
sam-plesunlab eled. The exp eriment wasconducted
10 times for each sample-size rep eatedly with
randomlyselectedlab eled samplesandtheir
Table 1: Do cuments usedin Exp eriments
category numb erof do cuments
earn 2673
acq 1435
trade 225
crude 223
money-fx 176
interest 140
combinationsofrestructuringmetho dsare
writ-ten. "Original"meansthedataoforiginaldo
cu-mentvectors. "PCA"and"ICA"meanthedata
of only reduced vectors, resp ectively.
"Orig-inal+PCA" and "Original+ICA" are the
re-structureddata explained in Section 4.
The prop osed metho d yields a high
F-measure in all thecategories for1000and 2000
lab eleddataandin mostcategoriesfor100and
500 lab eled data. The last two rows of Tables
2, 3, 4 and 5 show that b oth Micro-average
and Macro-averageare the highest forthe
pro-p osed metho d. This means that the prop osed
metho d p erforms well b oth for large-sized
cat-egories (e.g., earn) and small-sized categories
(e.g., interest), regardless with the numb er of
lab eled data.
6.2 Performance for the Increase of the
Lab eled Data
To investigate how each metho d b ehaves when
the numb er of lab eled data increases, we
con-ductedthis exp eriment. The numb er oflab eled
data ranges from 100to 2000. The results are
showninFigure2and Figure3. "PCA"givesa
go o dscoreonlywithasmallnumb erofdataand
"Original" gives ago o dscore only witha large
numb er of data. In contrast to them, the
pro-p osed metho dpro duces high p erformanceb oth
withsmall and largenumb ersof data.
7 Conclusions
We prop osedanewmetho dof featurespace
re-structuring for SVM. In our metho d, indep
en-dent comp onents are extracted using ICA and
concatenated with the original vectors. Using
this new vectors in the restructured space, we
achievedhighp erformanceb othwithsmalland
largenumb ersof lab eled data.
The prop osed metho d can b e applied also
78
80
82
84
86
88
90
92
94
96
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Micro-average of F-measures
Number of Labeled Data
Original+ICA
PCA
Original
Original(TSVM)
Figure 2: Micro-average
55
60
65
70
75
80
85
90
95
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Macro-average of F-measures
Number of Labeled Data
Original+ICA
PCA
Original
Original(TSVM)
Figure3: Macro-average
thattheyarerobustagainstnoiseand can
han-dleahigh-dimensional featurespace. Fromthis
p oint of view, it is exp ected that the prop osed
metho d is useful for kernel-based metho ds, to
whichSVM b elongs.
Asafuturework,weneedtondawayto
de-cidethenumb er of indep endent comp onents to
b eextracted. In this pap er, we set thenumb er
appropri-Metho d Original Original(TSVM) PCA ICA Original+PCA Original+ICA
earn 92.96 84.00 91.13 86.60 92.97 92.88
acq 85.88 81.42 85.67 80.86 85.91 87.48
trade 36.52 65.59 72.41 72.28 36.68 70.73
crude 65.69 70.90 79.75 80.67 65.93 82.87
money-fx 32.46 45.01 52.69 54.37 32.47 48.62
interest 51.30 52.69 64.44 63.48 51.30 64.84
microaverage 83.63 79.48 85.98 82.14 83.66 87.40
macroaverage 60.80 66.60 74.34 73.04 60.87 74.56
Table 3: F-Measures(500 Lab eled Data)
Metho d Original Original(TSVM) PCA ICA Original+PCA Original+ICA
earn 96.49 93.97 94.38 93.45 96.49 96.70
acq 93.23 91.57 89.18 87.45 93.22 93.41
trade 86.31 80.81 87.42 86.58 86.37 91.70
crude 83.33 79.78 81.36 78.28 83.43 87.12
money-fx 62.94 64.88 72.83 73.45 63.17 73.99
interest 59.31 52.02 73.37 72.18 59.31 70.41
microaverage 92.17 89.75 90.54 89.33 92.19 93.48
macroaverage 80.26 77.17 83.09 81.89 80.34 85.55
Table 4: F-Measures (1000Lab eled Data)
Metho d Original Original(TSVM) PCA ICA Original+PCA Original+ICA
earn 97.15 95.52 96.07 95.53 97.15 97.26
acq 94.60 93.77 92.18 91.44 94.60 94.84
trade 91.19 86.11 87.13 86.87 91.23 93.25
crude 87.99 80.03 80.93 78.75 87.99 89.41
money-fx 73.68 68.85 72.96 72.68 69.96 80.99
interest 75.34 57.26 72.83 68.25 75.34 79.27
microaverage 94.23 91.79 92.31 91.54 94.09 94.90
macroaverage 86.65 80.25 83.68 82.25 86.04 89.17
Table 5: F-Measures (2000Lab eled Data)
Metho d Original Original(TSVM) PCA ICA Original+PCA Original+ICA
earn 97.48 95.92 97.18 97.12 97.48 97.55
acq 95.39 94.39 94.78 94.80 95.39 95.65
trade 93.81 86.33 88.61 85.28 93.81 95.90
crude 89.88 80.35 82.63 78.56 89.88 90.25
money-fx 77.44 70.60 74.84 70.69 77.49 81.56
interest 82.71 62.15 73.99 68.46 82.76 83.02
microaverage 95.19 92.43 93.93 93.26 95.20 95.58
mo del selection such as Minimum Description
Length(Rissanen, 1987)orAkaike Information
Criterion (Akaike,1974)could b e a go o d
theo-retical basis.
As explained in Section 4, two terms d t
1 d
2
and d t
1 A
t
Ad
2
are simply concatenated in our
metho d. Buteither of these terms can b e
mul-tipliedwithacertainconstant. Thismeansthat
either of the original space and the Latent
Se-mantic Space can b e weighted. Searching for
the b est weighting scheme is one of the future
works.
Acknowledgment
We would like to thank Thomas Kolenda
(Technical University of Denmark) for helping
us withtheco de.
References
Akaike, H. 1974. A New Lo ok at the
Statis-tical Mo del Identication. IEEE Trans.
Au-tom. Control, vol. AC-19,pp.716{723.
Amari, S. 1998. Natural Gradient Works
EÆ-cientlyinLearning. NeuralComputation,vol.
10-2, pp.251{276.
Bell, A. J. and Sejnowski, T. J. 1995. An
In-formation Maximization Approach to Blind
Separation and Blind Deconvolution. Neural
Computation, 7,1129{1159.
Bell, A. J. and Sejnowski, T. J. 1997. The
'Indep endent Comp onents'ofNaturalScenes
areEdgeFilters. VisionResearch,37(23),pp.
3327{3338.
Deerwester,S.,Dumais, T.,Landauer, T.,F
ur-nas,W.andHarshman,A. 1990. Indexingby
LatentSemanticAnalysis. Journalof the
So-ciety forInformationScience,41(6),pp.391{
497.
Glenn, F. and Mangasarian, O. 2001.
Semi-Sup ervised Supp ortVectorMachines for
Un-lab eled Data Classication. Optimization
Methods and Software, pp.1{14.
Herault,J.andJutten, J. 1986. SpaceorTime
Adaptive Signal Pro cessing by Neural
Net-workMo dels. Neuralnetworksforcomputing:
AIPconferenceproceedings151,pp.206{211.
Isb ell, C. and Viola. P. 1998. Restructuring
Sparse High Dimensional Data for Eective
Retrieval. Advances in Neural Information
Supp ort Vector Machines: Learning with
Many Relevant Features. Proceedings of the
European Conference on Machine Learning,
pp.137{142.
Joachims,T. 1999a. Transductive Inference for
TextClassication usingSupp ortVector
Ma-chines. Machine Learning { Proc. 16th Int'l
Conf.(ICML '99),pp.200{209.
Joachims, T. 1999b. Making large-Scale SVM
Learning Practical. Advances in Kernel
Methods - SupportVector Learning,pp.169{
184.
Kaban, A. and Girolami, M. 2000. Unsup
er-visedTopicSeparationandKeyword
Identi-cationinDo cumentCollections: AProjection
Approach TechnicalReport.
Kolenda, T, Hansen, L., K. and Sigurdsson, S.
2000. Indep edent Comp onentsin Text .
Ad-vances in Independent Component Analysis,
Springer-Verlag, pp.235{256.
Mitchell, T. 1997. Machine Learning,McGraw
Hill.
Nigam, K., McCallum, A., Thrun, S. and
Mitchell, T. 2000. Text Classication from
Lab eledandUnlab eledDo cumentsusingEM.
MachineLearning,39(2/3).pp. 103{134.
Rissanen, J. 1987. Sto chastic Complexity.
Journal ofRoyalStatisticalSociety,SeriesB,
49(3),pp.223{239.
Salton,G.andMcGill,M.J. 1983. Introduction
to Modern Information Retrieval.
McGraw-Hill Bo ok Company,NewYork.
Smola,A.,Bartlett,P., Scholkopf,B.and
Schu-urmans,D. 2000. Advances in Large Margin
Classiers.MITPress
Vapnik, V. 1995. The Nature of Statistical
Learning Theory.Springer.
Weston, J., Mukherjee, S., Chap elle, O.,
Pon-til,M.,Poggio,T.andVapnik,V. 2000. F
ea-tureSelectionforSVMs. InAdvancesin
Neu-ral Information Processing Systems, volume
13.
Yang, Y. An Evaluation of Statistical
Ap-proachestoTextCategorization. Information