Learning Multi-label Alternating Decision Trees from Texts and Data

(1)

Learning Multi-label Alternating Decision Trees from

Texts and Data

Fran¸cesco De Comite, R´

emi Gilleron, Marc Tommasi

To cite this version:

Fran¸cesco De Comite, R´

emi Gilleron, Marc Tommasi. Learning Multi-label Alternating

Deci-sion Trees from Texts and Data. Petra Perner, Azriel Rosenfeld. International Conference on

Machine Learning and Data Mining, 2003, Leipzig, Georgia. Springer, pp.35-49, 2003, Lecture

Notes in Artificial Intelligence.

<

inria-00536733

>

HAL Id: inria-00536733

https://hal.inria.fr/inria-00536733

Submitted on 16 Nov 2010

HAL

is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not.

The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL

, est

destin´

ee au d´

epˆ

ot et `

a la diffusion de documents

scientifiques de niveau recherche, publi´

es ou non,

´

emanant des ´

etablissements d’enseignement et de

recherche fran¸

cais ou ´

etrangers, des laboratoires

publics ou priv´

es.

(2)

from Texts and Data ? FranesoDeComite 1 ,RemiGilleron 2

,andMarTommasi 2 equipeGrappa|EA3588

1

Lille1University,59655Villeneuved'AsqCedex,Frane deomitelifl.fr

2

Lille3University,59653Villeneuved'AsqCedex,Frane gilleron,tommasiuniv-lille3. fr

Abstrat. Multi-labeldeisionproeduresarethe targetofthe super-visedlearning algorithm we propose inthispaper. Multi-label deision proeduresmapexamplestoanitesetoflabels.Ourlearningalgorithm extendsShapireandSinger'sAdaboost.MH andproduessets ofrules that an be viewed as trees like Alternating Deision Trees (invented by Freund and Mason).Experiments show that we take advantage of bothperformane and readability usingboostingtehniquesas wellas treerepresentations oflargesetofrules.Moreover, akeyfeatureofour algorithmistheabilitytohandleheterogenousinputdata:disreteand ontinuousvaluesandtextdata.

Keywords:boosting-alternatingdeisiontrees-textmining-multi-labelproblems

1 Introdution

Whenapatient spendsmorethan3daysin enter X,measuresof albu-minuriaswellasproteinuri aremade.Butifthe patientisinenter Y, thenonlymeasuresof albuminuri aremade.

These sentenes anbeviewed asamulti-labellassiationproedure be-ausemorethanonelabelamongfalbuminuri;proteinurigmaybeassignedtoa givendesriptionofasituation.Thatistosaywearefaedaategorizationtask forwhihtheategoriesarenotmutuallyexlusive.Thisworkisoriginally mo-tivatedbyapratialprobleminmediinewhereeahpatientmaybedesribed by ontinuous-valued attributes (e.g. measures), nominal attributes (e.g. sex, smoker,...) and text data (e.g. desriptions,omments,...).It wasimportant to produerules that anbe interpreted by physiiansand also rules that re-vealorrelationsbetweenlabelspreditions.Theserequirementshaveleadtothe realizationofthealgorithmpresentedin thispaper

1 . ?

PartiallysupportedbyprojetDATADIAB:\ACItelemedeineettehnologiespour lasante"andprojetTACT/TIC Feder&CPERRegion-NordPasdeCalais. 1

(3)

Learningalgorithms thatanhold multi-labelproblems arethereforevaluable. Ofoursetherearemanystrategiestoapplyaombinationofmanybinary las-sierstosolvemulti-labelproblems([ASS00℄).Butmostofthemignore orrela-tionsbetweenthedierentlabels.AdaBoost:MHalgorithmproposedbyShapire and Singer([SS00℄) eÆientlyhandles multi-labelproblems. Foragiven exam-ple, it also provides a real value as an outome for eah label. For pratial appliations, these values are important beause theyan be interpreted asa ondene rate about the deision for the onsidered label. AdaBoost:MH im-plementsboostingtehniquesthat aretheoretiallyprovedto transformaweak learner | alled base lassier | into astrong one. The main idea of boost-ing is to ombine many simple and moderately inaurate rules built by the base lassierintoasinglehighly auraterule.Theombinationisaweighted majorityvoteoverall simplerules. Boosting hasbeenextensively studied and many authors haveshownthat itperforms well on standardmahine learning tasks([Bre98℄, [FS96℄[FS97℄).Unfortunately,aspointedbyFreund andMason anothersauthors,theruleultimatelyproduedbyaboostingalgorithmmaybe diÆulttounderstandand tointerpret.

In[FM99℄,FreundandMasonintrodueAlternatingDeisionTrees(ADTrees). The motivation of Freund and Mason was to obtain intelligible lassiation models whenapplyingboosting methods.ADTrees arelassiationmodels in-spiredbybothdeisiontreesandoptiontrees([KK97℄).ADTreesprovidea sym-bolirepresentationofalassiationproedureandgivetogetherwith lassi-ationameasure ofondene. Freundand Masonalso proposeanalternating deision tree learning algorithm alled ADTBoost in [FM99℄. ADTBoost algo-rithmoriginatesfrom twoideas:rst,deisiontreelearningalgorithmsmay be analyzedasboostingalgorithms(see[DKM96℄,[KM96℄);seond,boosting algo-rithms ould be used beause ADTrees generalize both voted deisionstumps andvoteddeisiontrees.Reently,theADTree formalismhasbeenextended to themultilassase([HPK

+ 02℄).

We propose in this paper to extend ADTrees formalism to handle multi-label deision proedure. Deisionproedures are learnedfrom data desribed bynominal,ontinuousand textdata. Ouralgorithmanbeunderstoodasan extensionofAdaBoost:MHthatpermitsabetterreadabilityofthelassiation ruleultimatelyproduedaswellasanextensionto ADTBoostin orderto han-dlemulti-labellassiationproblems.Themulti-labelADTreeformalismgives an intelligibleset ofrules viewableasatree. Moreover,rulesallow toombine atomitests. Inourimplementation,weanhandletest on(ontinuousor dis-rete) tabulardata as well astests ontext data. This is partiularly valuable in mediine where desriptions of patients ombine diagnosti analysis, om-ments,dosages,measures,andso on.Forinstane,arule anbebuilton both temperatureanddiagnosti,iftemperature>37.5anddiagnostiontains \Car-diovasular"then....Thiskindofombinationinauniqueruleisnotonsidered byothersalgorithmslikeBoostexterwhihimplementsAdaBoost:MH.Weexpet

(4)

formations that an be interpreted by experts. In this paper, we only present resultsonfreely available datasets. We ompareouralgorithm ADTBoost:MH with AdaBoost:MHon twodatasets: thereuters olletionand anewdata set builtfromnewsartiles

2 .

In Setion 2, we dene multi-label problems and wereall AdaBoost:MH's funtioning. Alternating Deision Trees are presented in Setion 3. We dene multi-labelADTreeswhihgeneralizeADTreestothemulti-labelaseinSetion 4.An exampleis giveninFigure 3.A Multi-labelADTree isaneasilyreadable representationofbothdierentADTrees(oneADTreeperlabel)andofmany de-isionstumps(oneperboostinground).Weproposeamulti-labelADTree learn-ingalgorithmADTboost:MHbasedonbothADTboostandAdaboost:MH[SS98℄. Experimentsaregiven in Setion5.Theyshowthat ouralgorithm reahesthe performaneofwelltunedalgorithmslikeBoostexter .

2 Boosting and Multi-label Problems

Mostofsupervisedlearningalgorithmsdealwithbinaryormultilass lassia-tiontasks.Insuhaase,aninstanebelongstoonelassandthegoaloflearning algorithmsisto ndanhypothesis whih minimizestheprobabilitythat an in-staneismislassiedbythehypothesis.Evenwhenalearningalgorithmdonot applytothemultilassase,thereexistseveralmethodsthatanombinebinary deisionproeduresin orderto solvemultilassproblems([DB95℄, [ASS00℄).In thispaper,weonsiderthemoregeneralproblem,alledmulti-labellassiation problem, in whih an examplemay belong to anynumberof lasses.Formally, let X be the universe and let us onsider a set of labels Y = f1;:::;kg. The goalis to ndwith inputasample S =f(x

i ;Y i )jx i 2X;Y i Y;1img an hypothesis h : X 7! 2 Y

with low error. It is unlear to dene theerror in themulti-labelasebeausedierentdenitionsarepossible,dependingonthe appliation we are faed with. In this paper, we only onsider the Hamming error.

Hamming error: The goal is to predit the set of labels assoiated with an example.Therefore,onetakesintoaountpredition errors (aninorretlabel is predited) and missing errors (a label is not predited). Let us onsider a targetfuntion :X 7!2

Y

andan hypothesish:X 7!2 Y

, theHammingerror ofhisdenedby: E H (h)= 1 k k X l=1 D(fx2Xj(l2h(x)^l62(x))_(l62h(x)^l2(x))g): (1) 2

(5)

Thefator k

normalizes theerrorin theinterval[0;1℄.The trainingerrorover asampleS is: E H (h;S)= 1 km X i;l (k(l2h(x i )^l62Y i k+kl62h(x i )^l2Y i k) (2)

wherekakequals1ifaholdsand0otherwise.

Intherestofthepaper,wewillonsiderlearningalgorithmsthatoutput map-pingshfromXY intoR. Therealvalueh(x;l)anbeviewedasapredition valueforthelabellfortheinstanex.Givenh,wedeneamulti-label interpreta-tionh m ofh:h m isamappingX !2 Y suhthath m (x)=fl2Yjh(x;l)>0g. AdaBoost:MH

ShapireandSingerintrodueAdaBoost:MHin[SS00℄.Originally,thealgorithm supposesaweaklearnerfrom XY to R. Inthissetionwefousonboosting deisionstumps.WearegivenasetofonditionsC.Weakhypothesesare there-forerulesoftheformifthen(a

l ) l2Y else(b l ) l2Y where2Canda l ;b l arereal valuesforeahlabell.Weakhypothesesthereforemaketheirpreditionsbased onapartitioning ofthedomainX.

The AdaBoost:MHlearning algorithm is given in Algorithm 1. Multi-label data arerstly transformedinto binary data. Given aset oflabelsY Y, let us dene Y[l℄to be +1ifl2Y andto be 1ifl62Y. Givenan inputsample of m examples,themain ideais toreplae eah trainingexample(x

i ;Y i )by k examples ((x i ;l);Y i

[l℄) for l 2 Y. AdaBoost:MH maintains a distribution over XY.Itre-weightsthesampleateahboostingstep.Basially,examplesthat weremislassiedbythehypothesisinthepreviousroundhaveanhigherweight in the urrent round. It is proved in [SS98℄ that the normalization fator Z

t indued by the re-weighting realizes a bound on the empirial Hamming loss of theurrenthypothesis. Therefore,this bound is used to guidethe hoie of the weak hypothesis at eah step. AdaBoost:MH tries to minimize error while minimizingthenormalizationfatordenotedbyZ

t at eahstep. LetusdenoteW l + ()(resp.W l

())thesumofweightsofthepositive(resp. negative)examplesthat satisesondition andhavelabell.

W l + ()= i=m X i=1 D t (i;l)kx i satises^Y i [l℄=+1k W l ()= i=m X i=1 D t (i;l)kx i satises^Y i [l℄= 1k

Therefore,onean proveanalytiallythat the thebest preditionvaluesa l andb

l

whihminimizeZ t

at eahstepare:

a l = 1 2 ln W l + () l ;b l = 1 2 ln W l + () l (3)

(6)

t Z t ()=2 l=k X l=1 q W l + ()W l () (4)

Algorithm1AdaBoost:MH(T)where T isthenumberofboostingrounds Input: asampleS=f(x1;Y1);:::;(xm;Ym)jxi2X;YiY=f1;:::;kgg;Cisaset

ofbaseonditions 1: TransformSintoS

k

=f((xi;l );Yi[l ℄)j1im;l2Yg(XY)f 1;+1g 2: Initializetheweights:1im,l2Y,w

1

(i;l )=1 3: fort=1..T do

4: hoosewhihminimizeZ t

()aording toEquation4 5: buildtheruler

t :ifthen 1 2 l n W l + () W l () else 1 2 l n W l + () W l () 6: Updateweights:wt+1(i;l )=wt(i;l )e

Y i [l℄r t (x i ;l) 7: endfor Output: f(x;l )= P T t=1 rt(x;l ).

3 Alternating Deision Trees

Alternating Deision Trees (ADTrees) are introdued by Freund and Mason in [FM99℄. Theyare similarto optiontrees developedbyKohaviet al [KK97℄. An important motivation ofFreund and Mason wasto obtainintelligible las-siation models when applying boosting methods. Alternating deision trees ontainsplitternodesandpredition nodes.Asplitternodeisassoiatedwitha test, apredition node is assoiated witha real value.An example of ADTree is given in Figure 1. It is omposed of four splitter nodes and nine predition nodes.AninstanedenesasetofpathsinanADTree.Thelassiationwhih is assoiated with an instane is the sign of the sumof the preditions along thepathsin theset dened bythis instane.ConsidertheADTree in Figure1 and the instane x = (olor = red;year = 1989;:::), the sum of preditions is +0:2+0:2+0:6+0:4+0:6 = +2, thus the lassiation is +1 with high ondene.Fortheinstanex=(olor=red;year=1999;:::),thesumof pre-ditionsis+0:4andthelassiationis+1withlowondene.Fortheinstane x =(olor = white;year= 1999;:::), thesum of preditionsis 0:7 and the lassiationis 1withmediumondene.

ADTreedepitedinFig.1analsobeviewedasonsistingofarootpredition node and four units of three nodes eah. Eah unit is a deision rule and is omposed of asplitter node and twopredition nodes that are its hildren. It iseasyto giveanotherdesriptionofADTrees usingsetsof rules.Forinstane,

(7)

year < 1998

colour = white

Y

N

Y

N

year > 1995

colour = red

Y

N

Y

N

+0.2

−0.4

−0.5

+0.6

+0.4

−0.2

−0.1

+0.6

Fig.1.anexampleofADTree;boldpreditionnodesdenethesetofnodesassoiated withtheinstanex=(ol or=red;year=1989;:::)

IfTRUE then(ifTRUE then+0.2else0 )else0 IfTRUE then(ifyear<1998 then+0.2else-0.4 )else0 Ifyear<1998then(ifolour=red then+0.6else-0.1 )else0 Ifyear<1998then(ifyear>1995 then-0.2 else+0.4)else0 IfTRUE then(ifolour=whitethen-0.5 else+0.6)else0

ADTBoost

RulesinanADTreearesimilartodeisionstumps.Consequently,oneanapply boosting methods in order to designan ADTree learningalgorithm. The algo-rithm proposed by Freund and Mason [FM99℄ is based on this idea. It relies onShapireandSinger[SS98℄study ofboostingmethods. Arulein anADTree denes apartition of the instane spae into three bloks dened by C

1 ^ 2 , C 1 ^: 2 and:C 1

.Followingthisobservation,Freund andMasonapplya vari-ant of Adaboost proposed in [SS98℄ for domain-partitioning weak hypotheses. Basially,the learning algorithm builds an ADTree with a top down strategy. Ateveryboostingstep,itseletsandaddsanewruleorequivalentlyanewunit onsisting ofasplitternodeand twopreditionnodes.Contrary to theaseof deisiontrees, thereadershould note that anewruleanbeadded below any preditionnodeintheurrentADTree.

Freund andMason'salgorithmADTboostisgivenasAlgorithm2.Similarly to Adaboost algorithm, ADTboost updates weightsat eah step (line 8). The quantityZ t (C 1 ; 2

)isanormalizationoeÆient.Itis denedby

Z t (C 1 ; 2 )=2 p W + (C 1 ^ 2 )W (C 1 ^ 2 )+ p W + (C 1 ^: 2 )W + (C 1 ^: 2 ) +W(:C ) (5)

(8)

+

negative)examplesthatsatisfyonditionC.Ithasbeenshownthattheprodut of all suh oeÆients givesan upper bound of the training error. Therefore, ADTboostseletsapreonditionC

1 andaondition 2 thatminimizeZ t (C 1 ; 2 ) (line 5) in order to minimize thetrainingerror. Thepredition valuesa andb (line6) arehosenaordingtoresultsin [SS98℄.

ADTboostisompetitivewithboostingdeisiontreelearningalgorithmssuh asC5+Boost.Moreover,thesizeoftheADTreegeneratedbyADTboostisoften smaller than the model generated by other methods. These two points have stronglymotivatedourhoiesalthoughADTboostsuersfromsomedrawbaks e.g.thehoieofthenumberofboostingroundsisdiÆultandoverttingours. Wedisusstheseproblemsin theonlusion.

Algorithm2ADTboost(T)whereT isthenumberofboostingrounds

Input: a sampleS =f(x1;y1);:::;(xm;ym)jxi 2 X;yi2 f 1;+1gg;a setof base onditionsC.

1: Initializetheweights:1im,w i;1

=1 2: InitializetheADTree:R

1 =fr

1

:(ifT then(ifT then 1 2 ln W + (T) W (T) )else0)else0)g 3: Initializethesetofpreonditions:P1=fTg

4: fort=1..T do 5: ChooseC 1 2P t and 2 2CwhihminimizeZ t (C 1 ; 2 )aording toEquation5 6: R t+1 =R t [fr t+1 :(ifC 1 then(if 2 then 1 2 ln W + (C 1 ^ 2 ) W (C 1 ^ 2 ) )else 1 2 ln W + (C 1 ^: 2 ) W (C 1 ^: 2 ) ) else 0)g 7: Pt+1=Pt[fC1^2;C1^:2g 8: updateweights:wi;t+1(i)=wi;t(i)e

y i rt(x i ) 9: endfor Output: ADTreeR T+1

4 Multi-label Alternating Deision Trees

Multi-labelADTrees

WegeneralizeADTreestotheaseofmulti-labelproblems.Apreditionnodeis nowassoiatedwithasetofrealvalues,oneforeahlabel.An exampleofsuh anADTreeisgivenin Figure3.

Let X be the universe, the onjuntion is denoted by ^, the negation is denotedby:,andletT betheTrueondition.Let(0)

l2Y

beavetoroflzeros.

Denition1. Let C be a set of base onditions where a base ondition is a boolean prediate over instanes. A preondition is a onjuntion of base on-ditions and negations of base onditions. A rule in an ADTree is dened by a preondition C

1

, a ondition 2

and two vetors of real numbers (a l

) l2Y

and (b) :

(9)

1 2 l l2Y l l2Y l2Y

Amulti-labelalternatingdeisiontree(multi-labelADTree)isasetRofsuh rules satisfyingproperties(i)and(ii):

(i) the set Rmust inlude an initial rule for whih the preondition C 1 is T, theondition 2 isT,andb equals (0) l2Y ;

(ii) wheneverthe setRontainsarulewith apreonditionC 0 1

,Ralso ontains anotherrulewith preonditionC

1

andthereisabaseondition 2 suhthat eitherC 0 1 =C 1 ^ 2 orC 0 1 =C 1 ^: 2 .

Amulti-labelADTreemaps eah instanetoavetorofreal numberin the followingway: Denition2. A ruler:if C 1 then (if 2 then(a l ) l2Y else(b l ) l2Y )else(0) l2Y assoiatesarealvaluer(x;l)withany(x;l)2XY.If(x;l)satisesC=C

1 ^ 2 then r(x;l) equals a l ; if (x;l) satises C = C 1 ^: 2 then r(x;l) equals b l ; otherwise, r(x;l)equals 0. An ADTreeR=fr i g i2I

assoiatesapredition valueR (x;l)= P

i2I r

i (x;l) withany(x;l)2XY.Amulti-labellassiationhypothesisisassoiatedwith H denedbyH(x;l)=sign(R (x;l))andthe real numberjR (x;l)jisinterpreted as the ondene assigned to H(x;l), i.e. the ondene assignedto the label l for the instanex.

ADTBoost:MH

Ourmulti-labelalternatingdeisiontreelearningalgorithmisderivedfromboth ADTboostandAdaBoost:MHalgorithms.Following[SS98℄,wenowmakepreise thealulationofthepreditionvaluesandthevalueofthenormalizationfator Z t (C 1 ; 2

).Thereadershouldnote thatweonsiderpartitionsoverX Y. On round t,let us denote the urrent distribution over XY by D

t , and letusonsider W l + (C)(resp.W l

(C))asthesumoftheweightsof thepositive (resp.negative) examplesthatsatisfyonditionCandhavelabell:

W l + (C)= i=m X i=1 D t (i;l)kx i satisesC^Y i [l℄=+1k W l (C)= i=m X i=1 D t (i;l)kx i satisesC^Y i [l℄= 1k W l (C)= i=m X i=1 D t (i;l)kx i satisesCk

Itis easyto provethat thenormalizationfatorZ t

and thebest predition valuesa andb are:

(10)

a l = 1 2 ln W + (C 1 ^ 2 ) W l (C 1 ^ 2 ) ;b l = 1 2 ln W + (C 1 ^: 2 ) W l (C 1 ^: 2 ) (6) Z t (C 1 ; 2 )=2 l=k X l=1 q W l + (C 1 ^ 2 )W l (C 1 ^ 2 )+ q W l + (C 1 ^: 2 )W l (C 1 ^: 2 ) +W(:C 1 ) (7)

ThealgorithmADTboost:MHisgivenasAlgorithm3.Inorderto avoid ex-tremvaluesforondenevalues,weusethefollowingformulas:

a l = 1 2 ln W l +1 (C 1 ^ 2 )+ W l 1 (C 1 ^ 2 )+ ;b l = 1 2 ln W l +1 (C 1 ^: 2 )+ W l 1 (C 1 ^: 2 )+ (8) with= 1 2mk .

Algorithm3ADTboost:MH(T)whereT isthenumberofboostingrounds Input: asampleS=f(x 1 ;Y 1 );:::;(x m ;Y m )jx i 2X;Y i Y=f1;:::;kgg;Cisaset ofbaseonditions 1: TransformSintoS k =f((x i ;l );Y i [l ℄)j1im;l2Yg(XY)f 1;+1g 2: Initializetheweights:1im,l2Y,w1(i;l )=1

3: Initializethemulti-labelADTree: R1 =fr1:(ifT then(ifT then(a

l = 1 2 ln W l + (T) W l (T) ) l2Y )else(b l =0) l2Y ) else0)g. 4: Initializethesetofpreonditions:P

1 =fTg. 5: fort=1..T do 6: hooseC 1 2P t and 2 2CwhihminimizeZ t (C 1 ; 2 )aordingtoEquation7 7: R t+1 =R t [fr t+1 :(if C 1 then(if 2 then(a l = 1 2 l n W l + (C 1 ^ 2 ) W l (C 1 ^ 2 ) ) l2Y )else (b l = 1 2 l n W l + (C 1 ^: 2 ) W l (C 1 ^: 2 ) ) l2Y ) else0)g 8: P t+1 =P t [fC 1 ^ 2 ;C 1 ^: 2 g 9: Updateweights:wt+1(i;l )=wt(i;l )e

Y i [l℄rt(x i ;l) 10: endfor

Output: multi-labelADTreeRT +1

Relationswith other formalisms

Multi-label ADTrees trivially extends several ommon formalisms in mahine learning.

Voteddeisionstumps produedforinstanebyAdaBoost:MHpresentedas Al-gorithm 1are obviouslymulti-label ADTrees where eah rule has T asa

(11)

pre-2 whihminimizeZ

t (T;

2

) aordingto Equation7.Thus, AdaBoost:MHanbe onsideredasaparameterizationofADTBoost:MH.

(Multi-label)deision trees. A(multi-label)deisiontreeisanADTreewiththe followingrestritions:anyinnerpreditionnodeontains0;thereisatmostone splitter node below everypredition node; predition nodes at a leaf position ontain values that an be interpreted as lasses (using for instane the sign funtion inthebinaryase).

Weightedvote of (Multi-label)deision trees. Voteddeisiontreest 1

;:::;t k

as-soiatedwithweightsw

1 ;:::;w

k

arealsosimplytransformedintoADTrees.One needstoaddpreditionnodesontainingtheweightw

i

ateveryleafofthetree t

i

and graftalltreesattherootofanADTree.

5 Experiments

In this setion, we desribe the experiments we ondut with ADTBoost:MH. Wemainlyarguethatweanobtainbothaurateandreadablelassiersover tabularandtextdatausingmulti-labelalternatingdeisiontrees.

Implementation

OurimplementationofADTBoost:MHsupportsitemsdesriptionsthatinlude disrete attributes, ontinuous attributes and texts. In the ase of a disrete attribute Awhosedomain isfv

1 ;:::;v

n

g,thesetofbase onditionsarebinary onditionsoftheformA=v

i ,A6=v

i

.IntheaseofaontinuousattributeA,we onsiderbinaryonditionsoftheformA<v

i

.Finally,intheaseoftext-valued attributes overavoabulary V, base onditionsare of the form mours inA wherem belongstoV.

Missingvaluesarehandled in ADTBoost:MH asin Quinlan's C4.5software ([Qui93℄).Letusonsider anADTree R ,aposition pin Rassoiatedwith on-dition based on an attribute A and an instane for whih the value of A is missing.Weestimateprobabilitiesfortheassertionsto betrueorfalse,based onthe observation ofthetrainingset. Theobtainedvaluesare assignedto the missingvalueofthisinstaneand thenpropagatedbelowp.

Data sets

The Reutersolletion isthemostommonly-usedolletionfortext lassia-tion.WeuseaformattedversionofReutersversion2(alsoalledReuters-21450) preparedbyY.Yangandolleagues

3

.Doumentsarelabeledtobelongtoatleast oneofthe135possibleategories.A\sub-ategory"relationgovernsategories.

(12)

artilesinthewholeReutersdatasetbelongtoexatlyoneategory,in the ex-perimentsweseletategoriesandartilesforwhihoverlapsbetweenategories aremoresigniant.

News olletion Weprepare anewdata set from newsgroupsarhivesin order to build amultilabellassiationproblem where asesare desribedby texts, ontinuousandnominalvalues.Weobtainfromftp://ftp.s.mu.edu/user/ ai/pubs/news/omp.ai/newsartilespostedintheomp.ainewsgroupinjuly 1997.Someartilesinthisforumhavebeenrosspostedinseveralnewsgroups. Thelassiationtask onsistsin nding in whih newsgroupanewshas been rossposted.

Eahnews is desribed byseven attributes. Twoof them are textual data: the subjetand the text of the news. Four attributes are ontinuous (natural numbers): the number of lines in the text of the news, the number of refer-enes (that is thenumber of parents in the thread disussion), thenumber of apitalized words and the number of words in the text of the news. One at-tribute is disrete: the top level domain of sender's email address. We have dropedsmallwords,lessthanthreeletters,andnonpurelyalphabetiwords(e.g. R2D2)

4

. There are 524artiles and we keeponly the ve mostfrequent ross posted newsgroupsas labels.The venewsgroupsare: mis.writing(61posts), si.spae.shuttle(68posts),si.ognitive(70posts),re.arts.sf.written(70posts) andomp.ai.philosophy(73posts).Only171artileswererosspostedtoatleast oneofthesevenewsgroups(60inone,51intwo,60inthree).

Results

WersttrainouralgorithmontheReutersdatasetinordertoevaluateitagainst Boostexter (Available implementationofAdaBoost:MH

5

).Reutersdataset on-sistsinatrainsetandatestset.Butfollowingtheprotoolexplainedin[SS00℄, wemergethese twosetsandwefousedontheninetopisonstituting thetop hierarhy.We then seletthe subsets of thek lasses with thelargestnumber of artiles for k =3:::9. Results were omputed on a 3-fold ross-validation. The numberof boosting stepsbeingset to 30. Wereport in table 1 one-error, overageandaveragepreisionforBoostexter andADTBoost:MH.

Inthenewslassiationproblem,ranksoflabelsarelessrelevant.Weonly report thehammingerror. OuralgorithmADTBoost:MHbuilds rulesthat may havelargepreonditions.Thisfeatureallowstopartitionthespaeinaveryne wayand thetrainingerroranderease veryquikly.This anbeobservedon thenewsdataset.After30boostingsteps,thetrainingerroris0forthemodel generated byADTBoost:MH. Thisis ahievedby Boostexter after230boosting steps.Ontheonehand,smallermodelsanbegeneratedbyADTBoost:MH.On

4

Data sets and perl sripts are available onhttp://www.grappa.univ-lill e3.fr / reherhe/datasets.

(13)

k ErrorCoverPre ErrorCoverPre 3 6.01% 0.07 0.97 6.48% 0.08 0.97 4 7.03% 0.10 0.96 7.93% 0.11 0.96 5 8.31% 0.12 0.95 8.99% 0.14 0.95 6 12.70% 0.24 0.9212.34% 0.24 0.92 7 14.72% 0.31 0.9114.32% 0.30 0.91 8 16.01% 0.34 0.9115.90% 0.35 0.90 9 16.77% 0.40 0.8916.60% 0.39 0.89

Table 1.ComparingBoostexter andADTreeontheReuters dataset.Thenumberk isthenumberoflabelsinthemulti-labellassiationproblem.

theother hand,bothADTBoost:MHand ADTBoost tendto overspeializeand thisphenomenonseemstoourmorequiklyforADTBoost:MH.

Table2reports thehammingerror ontherossposted newsdataset om-putedwithaten-foldrossvalidation.Notethatthehammingerrorofthe pro-edurethat assoiatetheemptysetoflabeltoeahaseis0.306.

BoostingstepsADTBoost:MHBoostexter 10 0.024 0.022 30 0.023 0.019 50 0.017 0.017 100 0.017 0.014

Table 2. Hammingerror ofADTBoost:MHand Boostexter ontheross postednews dataset.

Figure2showsanexampleofrulesproduedbyADTBoost:MHonthisdata set and its graphialrepresentation isdepited in Fig.3.Both representations allowtointerpretthemodelgeneratedbyADTBoost:MH.Forinstane, aord-ing to the weightsomputed in the ve rst rules, one may say that when an artileis rosspostedin si.spae.shuttleit isnotin si.ognitive.Onthe on-trary re.arts.sf.writtenseems to beorrelated with si.spae.shuttle. Below is anexampleofarulewithonditionsthat mixtestsovertextualdata andtests overontinuousdata.

If subjet (not ontains Birthday) and (subjet not ontains Clarke) and (subjet not ontains Serets)

Class : mis_w si_sp si_o omp_a re_ar If #lines >= 22.50 then -0.37 -3.24 0.24 0.17 0.08 If #lines < 22.50 then -0.12 -2.88 -3.51 -0.41 -5.07 Else 0

(14)

if TRUE

Class : mis_w si_sp si_o omp_a re_ar If TRUE then -1.01 -0.95 -0.93 -0.91 -0.93 Else 0

---Rule 1

if TRUE

Class : mis_w si_sp si_o omp_a re_ar If subjet Contains Birthday then 1.71 1.50 -6.35 -6.35 1.71 If subjet not ontains Birthday then -7.35 -2.02 -0.86 -0.84 -1.96 Else 0

---Rule 2

If subjet not ontains Birthday

Class : mis_w si_sp si_o omp_a re_ar If subjet Contains Emotional then -2.93 -5.59 7.03 2.65 -5.62 If subjet not ontains Emotional then -4.12 0.05 -0.41 -0.37 0.05 Else 0

---Rule 3

If subjet not ontains Birthday

Class : mis_w si_sp si_o omp_a re_ar If subjet Contains Clarke then -0.46 6.92 -5.30 -5.33 6.89 If subjet not ontains Clarke then -2.31 -6.92 0.01 0.01 -1.10 Else 0

---Rule 4

If subjet not ontains Birthday If subjet not ontains Clarke

Class : mis_w si_sp si_o omp_a re_ar If subjet Contains Serets then -0.17 -1.99 2.61 -5.83 -4.92 If subjet not ontains Serets then -1.32 -3.59 -0.33 0.02 0.02 Else 0

(15)

misc_writing=-1.013 (61.00)

sci_space_shuttle=-0.951 (68.00)

sci_cognitive=-0.935 (70.00)

comp_ai_philosophy=-0.911 (73.00)

rec_arts_sf_written=-0.935 (70.00)

(items : 524)

sujet

(items : 524.00)

Rule 1

misc_writing=1.709 (61.00)

sci_space_shuttle=1.498 (60.00)

sci_cognitive=-6.354 (0.00)

comp_ai_philosophy=-6.354 (0.00)

rec_arts_sf_written=1.709 (61.00)

(items : 63.00)

Contains Birthday

misc_writing=-7.349 (0.00)

sci_space_shuttle=-2.018 (8.00)

sci_cognitive=-0.860 (70.00)

comp_ai_philosophy=-0.835 (73.00)

rec_arts_sf_written=-1.958 (9.00)

(items : 461.00)

Contains no Birthday

sujet

(items : 461.00)

Rule 2

sujet

(items : 461.00)

Rule 3

misc_writing=-2.929 (0.00)

sci_space_shuttle=-5.593 (0.00)

sci_cognitive=7.032 (39.00)

comp_ai_philosophy=2.654 (38.00)

rec_arts_sf_written=-5.623 (0.00)

(items : 39.00)

Contains Emotional

misc_writing=-4.118 (0.00)

sci_space_shuttle=0.045 (8.00)

sci_cognitive=-0.407 (31.00)

comp_ai_philosophy=-0.366 (35.00)

rec_arts_sf_written=0.045 (9.00)

(items : 422.00)

Contains no Emotional

misc_writing=-0.455 (0.00)

sci_space_shuttle=6.918 (8.00)

sci_cognitive=-5.298 (0.00)

comp_ai_philosophy=-5.331 (0.00)

rec_arts_sf_written=6.888 (8.00)

(items : 8.00)

Contains Clarke

misc_writing=-2.311 (0.00)

sci_space_shuttle=-6.918 (0.00)

sci_cognitive=0.011 (70.00)

comp_ai_philosophy=0.010 (73.00)

rec_arts_sf_written=-1.099 (1.00)

(items : 453.00)

Contains no Clarke

sujet

(items : 453.00)

Rule 4

misc_writing=-0.168 (0.00)

sci_space_shuttle=-1.992 (0.00)

sci_cognitive=2.611 (15.00)

comp_ai_philosophy=-5.833 (0.00)

rec_arts_sf_written=-4.923 (0.00)

(items : 16.00)

Contains Secrets

misc_writing=-1.323 (0.00)

sci_space_shuttle=-3.590 (0.00)

sci_cognitive=-0.329 (55.00)

comp_ai_philosophy=0.020 (73.00)

rec_arts_sf_written=0.020 (1.00)

(items : 437.00)

Contains no Secrets

Fig. 3. A v e rule m ultilab el ADT ree built on the ross p osted news data set.

(16)

We have proposed a learning algorithm ADTBoost:MH that an handle multi-label problemsandprodueintelligiblemodels.Itisbasedonboostingmethodsandseems to reahthe performane of well tunedalgorithmslikeAdaBoost:MH.Furtherworks onernaloseranalysisoftheoverspeializationphenomenon.

Thenumberofrulesinamulti-labelADTreeisrelatedtothenumberofboosting rounds inboosting algorithmslike AdaBoost:MH. Readability ofmulti-labelADTree is learly altered when the number of rules beomes large. But, this possibly large setofrules depitedasatree omeswithweightsand withanordering thatpermits \stratied"interpretations.Indeed,duetothealgorithmibiasinthealgorithm,rules thatarerstlygeneratedontributetoreduethemostthetrainingerror.Nonetheless, navigationtoolsandhighqualityuserinterfaesshouldbebuilttoimprovereadability.

Referenes

[ASS00℄ Erin L. Allwein, Robert E. Shapire, and Yoram Singer. Reduing mul-tilass to binary:a unifying approah for margin lassiers. In Pro. 17th InternationalConf.onMahineLearning,pages9{16,2000.

[Bre98℄ L.Breiman.Combiningpreditors. Tehnialreport,StatistiDepartment, 1998.

[DB95℄ T. G.Dietterih and G. Bakiri. Solving multilass learning problems via error-orreting outputodes. Journal of Artiial Intelligene Researh, 2:263{286,1995.

[DKM96℄ T. Dietterih, M. Kearns, and Y. Mansour. Applying the weak learning frameworkto understand and improve C4.5. In Pro. 13th International ConfereneonMahineLearning,pages96{104.MorganKaufmann,1996. [FM99℄ Yoav Freundand Llew Mason. The alternating deisiontree learning

al-gorithm. In Pro. 16th International Conf. on Mahine Learning, pages 124{133,1999.

[FS96℄ Yoav Freundand Robert E. Shapire. Experiments with a newboosting algorithm. InPro. 13th International Conferene on Mahine Learning, pages148{146.Morgan Kaufmann,1996.

[FS97℄ YoavFreundandRobertE.Shapire. Adeision-theoretigeneralizationof on-linelearningand anappliationto boosting. Journalof Computer and SystemSienes,55(1):119{139,August1997.

[HPK +

02℄ G.Holmes, B. Pfahringer, R.Kirkby, E. Frank, and M. Hall. Multilass alternating deision trees. InProeedings of the European Conferene on MahineLearning.SpringerVerlag,2002.

[KK97℄ RonKohaviandClaytonKunz. Optiondeisiontreeswithmajorityvotes. In Pro. 14th International Conferene on Mahine Learning, pages 161{ 169.MorganKaufmann,1997.

[KM96℄ M. KearnsandY. Mansour. Ontheboosting abilityof top-downdeision treelearningalgorithms.InProeedingsoftheTwenty-EighthAnnualACM SymposiumontheTheoryof Computing,pages459{468,1996.

[Qui93℄ J.R.Quinlan. C4.5:ProgramsforMahine Learning. Morgan Kaufmann, SanMateo,CA,1993.

[SS98℄ RobertF.ShapireandYoramSinger. Improvedboostingalgorithmsusing ondene-ratedpreditions. InProeedings ofthe11thAnnualConferene on Computational Learning Theory (COLT-98), pages 80{91, New York, July24{261998.ACMPress.