CiteSeerX — The CN2 Induction Algorithm

(1)

In: Machine Learning Journal, 3 (4), pp261-283, Netherlands: Kluwer (1989) http://www.cs.utexas.edu/users/pclark/papers

The CN2 Induction Algorithm

PeterClark (p [email protected])

Tim Niblett([email protected])

The Turing Institute,

36 N. HanoverSt.,

Glasgow, G1 2AD, U.K.

Octob er 1988

Abstract

Systems for inducing concept descriptions from examples are valuableto ols for

assistinginthetaskofknowledgeacquisitionforexp ertsystems. Thispap erpresents

adescription andempiricalevaluationofanewinductionsystem,cn2,designed for

theecientinductionof simple,comprehensible pro ductionrulesindomainswhere

problemsofp o ordescriptionlanguageand/ornoisemayb eprese nt. Implementations

ofthecn2,id3andaqalgorithmsarecomparedonthreemedicalclassicationtasks.

Keywords: conceptlearning,ruleinduction,noise,comprehensibility,cn2.

(2)

In the task of constructing exp ert systems, systems for inducing concept descriptions

from examples have proved useful ineasing the b ottleneck of knowledge acquisition[1].

Twofamiliesofsystems,basedontheid3[2]and aq[3]algorithms,haveb eenesp ecially

successful. Thesebasicalgorithmsassumenonoiseinthedomain,searchingforaconcept

descriptionthat classiestrainingdata p erfectly. Howeverfortheapplicationofsystems

based on these algorithms to real-world domains, metho ds for handling noisy data are

required. In particular, mechanisms for avoiding the overtting of the induced concept

descriptiontothedataareneeded,requiringrelaxationoftheconstraintthattheinduced

description mustb eclassify thetrainingdatap erfectly.

Fortunatelytheid3algorithmlendsitselftoeasymo dicationallowingthisconstraint

tob erelaxed,bythenatureofitsgeneral-to-sp ecicsearch. Treepruningtechniques(e.g.

[4, 5]), as usedfor example in thesystems c4[6] and assistant [7], have proved tob e

eective metho ds of avoiding overtting. The aq algorithm, however, is less easy to

mo dify due to itsdep enden ce on sp ecic trainingexamples during its search. Existing

implementations(e.g. aq11 [8] and aq15 [9]) deal with noisy data by using pre- and

p ost-pro cessing techniques whileleavingthebasicaq algorithmintact. Ourobjective in

designing cn2 is to mo dify the aq algorithm itself insuch a way that this dep end ence

on sp ecicexamples is removed andthe spaceof rulessearchedis increased. Asa result

statisticaltechniques analogoustothose usedfortreepruning canthenb eappliedinthe

generationofif-then rules,anda simpler algorithmis achieved.

Wecanidentifyseveralrequirementsthatlearningsystemsshould meetiftheyareto

prove useful ina varietyofreal-worldsituations:

Accurate classication. The induced rules should b e able to classify new examples

accurately,even inthepresence ofnoise.

Simple rules. For the sake of comprehensibility, the induced rules should b e as short

as p ossible. However, when noise is present, rules that are overtted tend to b e

long. Thus, toinduce shortrules,one mustusuallyrelax therequirement thatthe

induced rules b econsistent withallthetraining data. The choice of howmuchto

relaxthisrequirement involvesa trade-ob etween accuracy and simplicity[10].

Ecient rule generation. Ifoneexp ectstouselargeexamplesets,itisimp ortantthat

thealgorithmscales up tocomplex situations. In practice, it is desirablethat the

timetakenforrule generationb e linearinthesizeof theexampleset.

(3)

evaluationof cn2, a new induction algorithm. It combines the eciency and abilityto

cop ewith noisydataofid3 withtheif-thenrule formand exible searchstrategyof the

aq family. The representation forrules outputbycn2is an orderedset of if-thenrules,

also known as a `decision list' [11]. cn2 uses a heuristic function to terminate search

during rule construction, based on an estimate of the noise present in the data. This

results in rules that do not necessarily classify all the training examples correctly, but

thatp erformwellonnew data.

In the following section we describ e cn2 and three other systems used for our com-

parative study. These include: aqr, the authors'reconstruction of Michalskiet al'saq

algorithm; Kononenko, Bratko and Roskar's (1984) assistant,a variant of id3; and a

simpleBayesianclassierwhichisusedtoprovideareference forthep erformanceof the

otheralgorithms. Ineachcaseweconsiderthetimecomplexityofthevariousalgorithms.

In section 4, we compare the p erformance of thealgorithms on three medicaltasks; we

alsocomparethep erformanceofassistantandcn2ontwosynthetictasks. Insection5

we discuss thesignicance of these results, and we followthiswith somesuggestionsfor

futureworkinsection 6.

2 CN2 and Algorithms for Comparative Study

cn2 and the other algorithms used in exp eriments are now presented. Because cn2

has b een develop ed from studyof b oth theid3 and aq algorithms, we rst present the

id3-based system assistant and the aq-based system aqr b efore presenting cn2 and

discussingitsrelationshipto thesealgorithms.

Wecharacterize thesystems alongthreedimensions. These are:

Therepresentationlanguage forthe induced knowledge;

Thep erformance engine forexecutingtherules;and

Thelearningalgorithmand itsasso ciatedsearchheuristics.

In all of our exp eriments, theexample description language consisted of attributes, at-

tributevaluesand user-sp eciedclasses. Thislanguagewasthesameforeachalgorithm.

2.1 Assistant

The assistant algorithm [7]is a descendant of Quinlan's id3 (1983), and incorp orates

atree pruning mechanismforhandlingnoisy data.

(4)

A b e aset ofattributesfordescribingexamples

TE(E)b e aterminationcriterion

ID M(a

i

;E) b ean evaluationfunctionwhere a

i 2A

Pro cedureassistant(E)returningTREE:

If: EsatisestheterminationcriterionTE(E)thenreturnaleafno de

forTREE,lab elled withthemostcommon classofexamples inE.

Else: determine the attribute a

best

2 A with the largest value of the

function ID M(a

best

;E). Then, for each value v

j

of attribute a

best ,

generate subtrees using assistant(E

j

) where E

j

are those examples

in E with value v

j

for attribute a

best

. Return a no de lab elled as a

teston attributea

best

with thesesubtrees attached.

Table 1: Thecore ofthe assistantalgorithm

2.1.1 Concept Description Language and Interpretation

assistantrepresentsacquired knowledgeintheformofdecisiontrees. Aninternalno de

of a tree sp ecies a test of an attribute, with each outgoing branch corresp onding to a

p ossible result of this test. Leaf no des represent the classicationto b e assigned to an

example.

To classify a new example, a path from the ro ot of the decision tree to a leaf no de

is traced. At each internal no de reached, the branch corresp onding to the value of the

attributetested atthat no de isfollowed. The classat theleaf no de represents theclass

predictionforthat example.

2.1.2 Learning Algorithm

assistant induces a decision tree by rep eatedly sp ecializing leaf no des of an initially

single-no ded tree. The sp ecialization op eration involves replacing a leaf no de with an

attributetest,andaddingnewleavestothatno decorresp ondingtothep ossibleresultsof

thattest. Heuristicsdeterminewhichattributetotestonandwhentostopsp ecialization.

Table 1summarizes thisalgorithm.

(5)

assistantusesanentropymeasuretoguidethegrowthofthedecisiontree,asdescrib ed

byQuinlan (1983) . This corresp onds to thefunctionIDM inTable 1. In addition, the

algorithmcan applya treecutometho dbased onan estimateofmaximalclassication

precision. Thistechniqueestimateswhetheradditionalbranchingwouldreduceclassica-

tionalaccuracy and ifso, terminatessearch (thereareno user-changeable parametersin

thiscalculation). Thiscutocriterion corresp onds tothefunctionTE intheTable1. If

assistantistogeneratean`unpruned' tree,theterminationcriterionTE(E)issatised

ifalltheexamplesE have thesameclassvalue.

2.2 AQR

aqr is an induction system that uses the basic aq algorithm [3] to generate a set of

classication rules. Many systems use this algorithm in a more sophisticated manner

thanaqr toimprovepredictive accuracy and rule simplicity(e.g., aq11 [8]uses a more

complex metho d of rule interpretation that involvesdegrees of conrmation). aqr is a

reconstructionof a straightforward aq-based system.

aqrinducesasetofdecisionrules,oneforeachclass. Eachruleisoftheform`if<cover>

thenpredict <class>',where<cover> is a b o oleancombination of attributetestsas we

nowdescrib e.

The basic test on an attribute is called a selector. The following are examples of

selectors:

hCloudy=yesi

hWeather=wet^stormyi

hTemp>60i

aqr allows tests in the set f=;;>;6=g. A conjunct of selectors is called a complex,

and a disjunct of complexes is called a cover. We say that an expression (a selector,

complex, or cover) covers an example if the expression is true of the example. Thus,

theemptycomplex (conjunctof zeroattribute tests)covers allexamplesand the empty

cover (disjunct of zero complexes)covers no examples. A cover is stored alongwith an

asso ciated class value, representing the most common class of those training examples

which itcovers.

(6)

conditionssatisedbythe example. Iftheexamplesatises onlyone rule,then theclass

predicted bythatrule isassignedtotheexample. If theexamplesatisesmorethanone

rule, the most common class of training examples that were covered by those rules is

predicted. Iftheexampleisnotcoveredbyanyrule,thenit isassignedbydefaulttothe

classthat o ccurred mostfrequentlyinthetrainingexamples.

2.2.2 The Learning Algorithm

The aq rule-generation algorithm has b een describ ed elsewhere (e.g. [8, 12, 13]), and

the aqr system is an instance of thisgeneral algorithm. aqr generates a decision rule

for each class in turn. Having chosen a class on which to fo cus, it forms a disjunct of

complexes (the cover) to serve as the conditionof the rule forthat class. This pro cess

o ccurs instages; each stagegenerates a single complex, and then removes theexamples

itcoversfromthetrainingset. Thisstep isrep eateduntilenoughcomplexesarefoundto

coverallthe examples of thechosen class. Thiswhole pro cess is rep eated foreach class

inturn. Table 2summarizes theaqralgorithm.

2.2.3 Heuristic Functions

The particularheuristic functions used by the aq algorithmare implementationdep en-

dent. Theheuristicusedbyaqr tocho osetheb estcomplexis\maximizethenumb erof

p ositiveexamplescovered". Theheuristicusedtotrimthepartialstarduring generation

of a complexis \maximize thesum of p ositive examples covered and negativeexamples

excluded". In the case of a tie for either heuristic, the system prefers complexes with

fewerselectors. Seeds arechosenatrandomandnegativeexamples arechosenaccording

totheirdistancefrom theseed(nearestonesare pickedrst,where distanceisthe num-

b er of attributeswithdierent valuesin theseed and negative example). In thecase of

contradictions(i.e. iftheseed andnegativeexamplehave identicalattributevalues) the

negative example is ignored and a dierent one is chosen, since the complex cannot b e

sp ecializedtoexclude itbut stillinclude theseed.

2.3 A Bayesian Classier

To establish a reference p oint, we also implemented a simple Bayesian classier and

compareditsb ehaviortothatof theother algorithms.

(7)

Pro cedure AQR(POS;NEG) returning COVER:

let COVER b etheemptycover;

whileCOVER do esnot coverallp ositiveexamples inPOS

selecta SEED,i.e. ap ositive examplenotcoveredbyCOVER;

callpro cedure STAR(SEED;NEG) togenerate theSTAR(aset)of

complexes thatcoverSEEDbut no examplesinNEG;

selecttheb est complexBESTfromthestaraccording touser-dened criteria;

add BESTasan extra disjuncttoCOVER;

returnCOVER.

Pro cedure STAR(SEED;NEG) returning STAR:

let STARb etheset containing theemptycomplex;

whileone ormorecomplexes inSTARcoverssome negativeexamples inNEG,

selecta negativeexampleE

neg

coveredbya complexinSTAR;

Sp ecializecomplexes inSTARtoexclude E

neg by:

letEXTENSIONb eallselectors thatcoverSEEDbut notE

neg

;

letSTARb e thesetfx^yjx2STAR;y2EXTENSIONg;

remove allcomplexesinSTARsubsumed byothercomplexes inSTAR;

Remove the worstcomplexes fromSTAR

until sizeofSTARisless thanorequaltouser-dened maximum(maxstar).

returnSTAR.

(8)

Thisclassierrepresentsits`decisionrule'asamatrixofprobabilitiesp(v

j jC

k

)sp ecifying

theprobabilityof o ccurrenceof each attributevalue given each class. To classify a new

example, one appliesBayes'theorem

p(C

i j

^

v

j )=

p(

V

v

j jC

i )p(C

i )

P

k p(

V

v

j jC

k )p(C

k )

where the summation is over the n classes and p(C

i j

V

v

j

) denotes the probability that

the example is of class C

i

given v

j (

V

is the symb ol for conjunction, V

v

j

denoting a

conjunctof attributevaluesallo ccurringinan example). Onecalculates thisprobability

foreveryclass,andthenselectstheclasswiththehighestprobability. Thetermp(C

k )is

estimated from the distributionof thetraining examples amongclasses. Ifone assumes

indep enden ce of attributes,p(

V

v

j jC

k

) can b ecalculatedusing

p(

^

v

j jC

k )=

Y

j p(v

j jC

k )

and the values p(v

j jC

k

) from the probabilitymatrix. Note that, unlike the other algo-

rithmswe have discussed, our implementation of the Bayesianclassier requires one to

examinethevaluesofallattributeswhenmakinga prediction.

We should note that there also exist more sophisticated applications of the Bayes

rule in which the attribute testsare ordered[14]. Such a sequential technique adds the

contribution of each testto a total; when this score exceeds a threshold, the algorithm

exitswithaclassprediction. Suchaninterpretationmayb emorecomprehensibletoauser

thantheapproach we have used,aswellaslimitingthetestsrequiredforclassication.

2.3.2 The Learning Algorithm

TheBayesianlearningmetho dconstructsthematrixp(v

j jC

k

)fromthetrainingexamples

by examining thefrequency of valuesineach class. One can compute thismatrixeither

incrementally,incorp oratingone instanceat atime,or non incrementally,using alldata

attheoutset.

2.3.3 Heuristics

Sometimes a value of zero is calculated from the training data for some elements of

the p(v

j jC

k

) matrix. Like all elements of the matrix, this numb er is subject to error

due to the nite training data available. However as as classication of new examples

involvesmultiplyingelements together,a zero element can havedrasticeect, nullifying

(9)

zero elementsinthematrixwould,givenmoredata,convergeon asmall,non-zero value

and hence replace the zeros with some appropriate estimate. In our implementation a

valueofp(C

i

)2(1=N)wasused, whereN isthenumb eroftrainingexamples. Thefactor

1=N representstheincreasingcertaintythatthiselementmusthaveanalmost-zerovalue

withincreasing sizeoftrainingdata.

2.4 The Default Rule

Finally, an `algorithm' which simply assigns the most commonly o ccurring class to all

newexamples, withno referencetotheir attributesatall,wasusedforcomparison with

theotheralgorithms. Interestingly,thissimplepro cedurewasofcomparablep erformance

totheother algorithmsinone domaintested (describ edlater),and thus proveda useful

extraalgorithmtoconsider.

2.5 The CN2 Algorithm

Having presented assistant, aqr, and the other algorithms used in the exp eriments,

we now present cn2, rst describing how the general algorithm arises naturally from

considerationof theid3 andaq algorithmsand thendescribingitsdetails.

id3 iseasilyadaptable tohandle noisydatabyvirtueof itstop-downtree generation

approach. Duringinduction, all p ossibleattributetests areconsidered when`growing'a

leaf no de in the tree, and entropy is used to select the b est one to place at that no de.

Overttingof decision trees can thus b e avoided by halting tree growth when no more

signicant information can b e gained by further growth. We wish to apply a similar

metho dtothepro duction ofif-then rulesratherthandecisiontrees.

The aq algorithm, when generating a complex, also p erforms a general-to-sp ecic

searchfortheb estcomplex. However,onlysp ecializationswhichexcludesomeparticular

covered negative example from the complex while ensuring some particular `seed' p os-

itive example remains covered are considered, iterating until all negative examples are

excluded. As a result, aq searches only the space of complexes completely consistent

withthe trainingdata. Theaq algorithmemploysa b eam search,which can b e viewed

asseveralhill-climbingsearchesinparallel.

Forthecn2algorithm,wehaveretainedtheb eamsearchmetho doftheaqalgorithm

but rstly removed itsdep ende nce on sp ecic examples during search and secondly ex-

tended its search space toinclude rules which do notp erform p erfectly on the training

(10)

sp ecializations of a complex, similar to the way id3 considers all attribute tests when

growingano deinthetree. Indeed,withab eamwidthofone thecn2algorithmb ehaves

equivalentlytoid3growingasingle treebranch. Thus byp erforminga top-downsearch

forcomplexes,itisnowp ossibletoapplyacutometho dsimilartodecisiontreepruning

tohalt sp ecializationwhenno furtherstatisticallysignicantsp ecializationsarep ossible.

Finally, we note that the rules cn2 pro duces are an ordered list of if-then rules,

rather than an unordered set of if-then rules as pro duced by aq-based systems. Both

representationshavetheirresp ectiveadvantagesanddisadvantagesforcomprehensibility

{orderindep endentrulesrequiresomeadditionalmechanismtob eprovidedtoresolveany

rulecon icts whichmayo ccur(thusdetracting fromastrictlogicalinterpretationof the

rules),whileorderedrulesalsosacriceadegreeincomprehensibilityastheinterpretation

of a single rule is dep endent on which other rules preceded it in the list. It is p ossible

to make cn2 pro duce unordered if-then rules byappropriately changing the evaluation

function 1

.

Rules in the ordered list which cn2 induces are each of the form `if <complex> then

predict<class>',where<complex>hasthesamedenitionasforaqr,namelyaconjunct

ofattributetests. Thisorderedrule representationisaversionofwhatRivest(1987)has

termed decisionlists. The lastrule incn2'slist is a`default rule'whichsimply predicts

themostcommonly o ccurringclass inthetrainingdataforallnewexamples.

To use the induced rules to classify new examples, cn2 applies an interpretation in

which each rule is triedinorder untilone is found whoseconditions aresatised by the

exampleb eingclassied. Theresultingclasspredictionofthisruleisthenassignedasthe

class of that example. Thus, the ordering of the rules is imp ortant. If no induced rules

aresatised,the naldefaultrule assignsthemostcommonclass tothenew example.

2.5.2 Learning Algorithm

Table 3 presents a summary of the cn2 algorithm. This works in an iterative fashion,

each iterationsearching for a complex covering a large numb er of examples of a single

class C and few of other classes. The complex must b e b oth predictive and reliable,as

1

E.g. replaceentropy withE=(+veexs.coveredminus-veexs. covered),thengeneraterulesforeach

classinturn.

(11)

it covers are removed from the trainingset and the rule `if <complex> then predict C'

is added to the end of the rule list. This pro cess iterates until no more satisfactory

complexes can b efound.

Thesystemsearchesforcomplexesbycarryingoutaprunedgeneral-to-sp ecicsearch.

At each stage in the search, cn2 retains a size-limited set or star S of `b est complexes

found sofar'. The systemexamines onlysp ecializationsofthisset, carryingouta b eam

search of the space of complexes. A complex is sp ecialized by either adding a new con-

junctivetermorremovinga disjunctiveelement inone ofitsselector. Eachcomplexcan

b e sp ecializedin severalways,and cn2 generatesand evaluates allsuch sp ecializations.

Thestaristrimmedaftercompletionofthisstepbyremovingitslowestrankingelements

asmeasured byan evaluationfunctionthatwewilldescrib e shortly.

Our implementationofthesp ecializationstepistorep eatedlyintersect 2

thesetofall

p ossibleselectorswiththecurrentstar,eliminatingallthenullandunchangedelementsin

theresultingsetofcomplexes. (Anullcomplexisonethatcontainsapairofincompatible

selectors, e.g. big=y^big=n). cn2 deals with continuous attributes in a manner

similar to assistant - by dividing the range of values of each attribute into discrete

sub-ranges. Testsonsuchattributesexaminewhether avalueisgreaterorless(orequal)

thanthe values atsub-range b oundaries. Thecomplete range of valuesand sizeof each

sub-rangeis providedbytheuser.

Fordealing with unknown attribute values, cn2 usethe simple metho dof replacing

unknown valueswiththemostcommonlyo ccurringvalue(ormidvalue ofthemostcom-

monly o ccurring sub-range, in the case of numeric attributes) forthat attribute in the

trainingdata.

2.5.3 Heuristics

Thecn2algorithmmustmaketwoheuristicdecisionsduringthelearningpro cess,andit

employstwoevaluationfunctionstoaidinthesedecisions. Firstitmustassessthequality

of complexes, determining if a new complex should replace the `b est complex' found so

farand alsowhich complexes in thestar S todiscard ifthe maximum sizeis exceeded.

Computing this involves rst nding the set E 0

of examples which a complex covers

(i.e.,whichsatisfyallofits selectors)andthe probabilitydistributionP =(p

1

;:::p

n ) of

2

TheintersectionofsetAwithsetBisthesetfx^yjx2A;y2Bg. Forexample,using`:'toabbreviate

`^', fa:b;a:c;b:dg intersected with fa;b;c;dg is fa:b;a:b:c;a:b:d;a:c;a:c:d;b:d;b:c:dg. If we now remove

unchangedelementsinthissetweobtainfa:b:c;a:b:d;a:c:d;b:c:dg.

(12)

Let: Eb e aset of trainingexamples;

Pro cedure CN2(E)returning RULELIST:

letRULE LISTb etheemptylist;

rep eat

let BESTCPX b eFindBest Complex(E);

if BESTCPX isnotnil then

LetE 0

b etheexamples coveredbyBEST CPX;

Remove fromEthe examplesE 0

coveredbyBESTCPX;

LetCb e themostcommonclass ofexamples inE 0

;

Add therule `ifBESTCPXthen class=C'totheend of RULELIST,

untilBEST CPXisnil or Eisempty.

returnRULE LIST.

Pro cedure FindBest Complex(E)returning BEST CPX:

lettheset STARcontainonlytheempty complex;

letBEST CPXb enil;

letSELECTORSb etheset of allp ossible selectors;

whileSTARisnotempty,

sp ecialize allcomplexes inSTARasfollows:

let NEWSTARb e thesetfx^yjx2STAR;y2SELECTORSg;

Remove allcomplexesinNEWSTAR that areeitherinSTAR(i.e.,the

unsp ecializedones) orare null (e.g. big=y^big=n)

for every complexC

i

inNEWSTAR:

ifC

i

isstatisticallysignicant when testedonE andb etter than

BESTCPXaccording touser-dened criteria whentested onE,

then replacethecurrent valueof BEST CPXby C

i

;

rep eat removeworstcomplexes from NEWSTAR

untilsizeof NEWSTAR isuser-dened maximum;

let STARb e NEWSTAR;

returnBEST CPX.

(13)

examples inE amongclasses(where n= numb er of classes represented inthe training

data). cn2thenuses theinformation-theoreticentropymeasure

Entr opy=0 X

i p

i log

2 (p

i )

toevaluate complex quality(the lowertheentropy theb etter the complex). Thisfunc-

tion thus prefers complexes covering a large numb er of examples of a single class and

few examplesof other classes, and hence such complexes scorewellon thetraining data

when used to predict the majority class covered. It should b e noted that this func-

tion was used in preference to a simple `p ercentage correct' measure (e.g. by taking

max(P)), most imp ortantly b ecause entropy will distinguish probability distributions

such as P = (0:7;0:1;0:1;0:1) and P = (0:7;0:3;0;0) in favor of the latter where as

max(P)willnot. This is desirable,sincethere existmore waysof sp ecializingthelatter

toacomplexidentifyingonlyoneclass;iftheexamplesofthemajorityclassareexcluded

by sp ecialization, the distributions b ecome P = (0;0:33;0:33;0:33)and P = (0;1;0;0)

resp ectively. In addition, the entropy measure tends to direct the search in the direc-

tion of moresignicant rules; empirically, rules of high entropy also tend to have high

signicance.

The second evaluation function tests whether a complex is signicant. By this we

refer to a complex that lo cates a regularity unlikely to have o ccurred by chance, and

thus re ectsa genuinecorrelationb etween attribute valuesand classes. Toassess signif-

icance,cn2 comparestheobserved distributionamongclasses ofexamples satisfyingthe

complex with the expected distribution that would result if thecomplex selected exam-

plesrandomly. Somedierencesinthese distributionswillresultfrom randomvariation.

The issue is whether the observed dierences are to o great to b e accounted for purely

by chance. If so, cn2 assumes that the complex re ects a genuine correlation b etween

attributesand classes.

To test signicance,the system uses the likeliho o dratio statistic [15]. This is given

by

2 n

X

i=1 f

i log(f

i

=e

i );

wherethedistributionF =(f

1

;:::;f

n

)istheobservedfrequencydistributionofexamples

amongclasses satisfying a given complexand E =(e

1

;:::;e

n

) is the exp ected frequency

distribution of the same numb er of examples under the assumption that the complex

selectsexamples randomly. Thisis taken asthe N = P

f

i

covered examples distributed

amongclasses with the same probabilityas that of examples in the entire trainingset.

(14)

tance b etween the two distributions 3

. Under suitable assumptions, one can show that

this statistic is distributed approximately as 2

with n01 degrees of freedom. This

provides a measure of indicatessignicance | thelower thescore, the more likely that

theapparent regularityis due tochance.

Thus these two functions - entropy and signicance - serve to determine whether

complexes found during search are b oth `go o d' (i.e. high accuracy when predicting the

majorityclass covered) and `reliable'(high accuracy on trainingdata is not justdue to

chance)resp ectively. cn2usesthesefunctionstorep eatedlysearchforthe`b est'complex

whichalsopassessomeminimumthresholdofreliabilityuntilnomorereliablecomplexes

can b efound.

3 Time Complexity of the Algorithms

The assistant,aqr and cn2algorithmsare allsearchinga very largespace of concept

descriptions,andalluseheuristicstoguidethissearch. Furthermore,thecn2,assistant

and aqr algorithms attempt to pro duce structures that are b oth consistent with the

trainingexamplesandascompactasp ossible. Inthedesignofsuchalgorithms,thereisa

tradeob etweenexecutionsp eedandthesizeoftheinducedstructures. Ineachcase,the

exhaustive searchfora smallestset ofstructures, althoughdesirable,iscomputationally

infeasible.

Amajorapplicationofthesealgorithmsistoextractusefulinformationfromverylarge

databases,p erhaps with millionsof examples. Withthisin mind, itis worthexamining

thecomplexityofeachalgorithm. Tob epracticalforverylargeproblems,theirb ehavior

should b elinear,ornear-linear, inthenumb erof examplesand attributes.

Since theoverall complexityof each algorithmisdomain-dep endent, we insteadpro-

videupp er b oundsforthecriticalcomp onentsof thealgorithms. Wedonotconsider the

complexityof thecutopro cedure usedbyassistant.

In ourtreatment we usethefollowingvariablestodenote parameters:

e: Thesizeof theexampleset

a: Thenumb er ofattributes

s: Themaximumstarsize (forcn2and aqr)

3

WeassumethatF iscontinuous withresp ect toE,i.e.that thef

i

are zerowhenthee

i

arezero.

(15)

We alsoassume thateach attributeisbinary valued and thatthere are twoclasses .

3.1 Time Complexity of Assistant

Thecriticalcomp onent inassistantisthepro cessofselectinga testattributeonwhich

tobranch. Eachsuchchoice involvesthefollowingop erations:

1. For each attribute, example counts are put in an array, indexed by class and at-

tribute. ThistakesTime: o(e1a);

2. Theentropyfunctionis calculatedforeach attributetakingTime: o(a);

3. Oncetheb estattributeisfound,theexamples aredividedintotwosets;thistakes

timeTime: o(e) 5

.

Therefore, the overall time for a single attribute choice is o(a1e). The time taken to

construct the complete tree dep end s very much on the structure of the tree. It seems

reasonable touse therst gure onlyforcomparativepurp oses, as argued ab ove. Thus

the amount of time taken by assistant for the basicattribute selection op eration is a

linearfunctionofthenumb erofexamples,whenthenumb erofclassesandattributesare

heldconstant.

Weshould notethatextensionstothisalgorithmthatusereal-valuedattributes(such

asacls[16])mustsorttheexamplesbyattributevalue atthe rststage. Thisincreases

theoverall timeb oundtoo(a1eloge).

3.2 Time Complexityof CN2

Thebasicop erationincn2isthesp ecializationofthecomplexesinthecurrentstar. The

numb erofsingleselectorcomplexeswithoutdisjuncts is2a. The numb erofintermediate

complexesgeneratedis atmosta1s,and thetimetaken toevaluateanexampleagainsta

complexisb ounded byo(a). Threestepsare requiredforthissp ecializationop eration:

Multiplying each complexin the starbythe set of single selectorrules; this takes

Time: o(a1s);

4

Onemightalsoconsiderthe complexityasafunctionofthenumb er ofdistinctattributevaluesand

classes. Wehavenotdone thisinouranalysis.

5

Withappropriatedatastructures,itmayb ep ossibletomuchofthisworkintherststage,butthis

do esnotaectthecomplexityclass. Similarly,onecanincludeanyterminationtestthatislinearinthe

numb erofexamples.

(16)

Sortingthecomplexesbyvalueandthentrimthestar,whichtakesTime: o(a1slog(a1s)).

Therefore, the overall time for a single sp ecialization step is b ounded by o(a1s(e +

log(a1s))). As with assistant, the time required is a linear function of the numb er

ofexamples. Ifwerestrictthethesizeofthestartoone,thetimerequiredhasthesame

orderasforassistant. Ingeneral,exp erience indicatesthatthetimeconstantsinvolved

aresomewhat lessforassistantand other variants on id3thanforcn2.

3.3 Time Complexity of AQR

In aqr,the basic op erationis thesp ecialization of complexes ina star. This op eration

issimilarto thatof cn2, except thatonlysp ecializationscausing a negative exampleto

b euncoveredbycomplexes inaqr's stararegenerated. We showthecomplexityof this

op erationis thesameasthatof cn2.

Foreachnegativeexample, thefollowingstepsarep erformed:

Anegativeexampleisfoundbyiteratingthroughthenegativeset. Weassumethat

thenumb er of negative examples is notless than somexed fraction of the entire

exampleset. ThistakesTimeo(e1s);

Theset ofselectorsthat distinguishthenegative examplefrom theseedarefound;

ThistakesTimeo(a);

Each complex in the star is sp ecialized by intersection with this set of selectors,

takingTimeo(a1s);

Theresulting complexesare evaluated,whichtakesTimeo(a1s1e);

Thecomplexes aresorted andthe startrimmed,takingTimeo(a1slog(a1s)).

Thus, for each negative example the time is b ounded by o(a1s(e+log(a1s))). This is

the same gure as obtained for cn2. Observe that the numb er of iterations of this

pro cess (makingthestardisjointfroma negativeexample)is b oundedbythenumb er of

attributes,notby thenumb erof examples.

In practice, although the order of time taken by the algorithms for this particular

op eration of pro ducing a new star is the same, cn2 is faster overallthan aqr. This is

b ecause the numb er of iterations of this op eration is lower in cn2 than in aqr, since

cn2 may halt sp ecialization of a complex b efore it p erforms p erfectly on the training

(17)

arecoveredifno furtherstatisticallysignicant rulescan b efound.

3.4 Time Complexity of the Bayesian Classier

ThetimecomplexityoftheBayes'classierforgeneratingaprobabilitymatrixiso(a1e),

whereaisthenumb erofattributesandethenumb erofexamples. Thislearningalgorithm

was substantially faster than the other algorithmsb ecause the run time is indep end ent

of thedecision`rule'generated. In additionthisbasic op erationis p erformedonlyonce,

unliketheab ove algorithms wherethebasicop erationis rep eatedlyapplied.

3.5 Summary and Actual Run Times

We haveshownthat thetimecomplexityof thebasiclearningstep forallthealgorithms

tested is linear in the numb er of examples (o(a:e) for assistant,o(a:e:s) for aqr and

cn2). Thisisan essentialrequirement foranyalgorithmthat mustworkwithvery large

datasets.

We brie y consider the time complexity of the entire induction pro cess, requiring

iterationofthebasiclearningstepsanalysedab ove. Withidealnoise-tolerantalgorithms,

givenacertainminimumnumb erofexamples,conceptdescriptionsrepresentingonlythe

genuineregularitiesinthedatashouldb einduced; additionalexamplesshouldnotcause

theconceptdescriptiontogrowfurtherandb ecomeovertted,henceinthisidealcasethe

ab ovegures also represent thetimecomplexityof theoverall learningtask. When this

idealis not met (e.g. when a concept description classifying the trainingdata p erfectly

is sought for), cn2 would at worst induce e rules of length a giving an overall time

complexityof o(a 2

:e 2

:s)[17]. assistantsorts a total of eexamples among aattributes

foreachlevelofthetree,givinganoveralltimecomplexityofo(a 2

:e)asthetreedepthis

b ounded bya. A similarworst-casetime complexitytocn2 holds foraqr.

Weprovidetheactualrun timesforinterest,while notingthat itis diculttomake

fair comparisons of sp eed due to dierences in implementation language and metho d.

All run times are for inducing a classication pro cedure in the lymphography domain

(section4.2.1)using afour-megabyte Sun 3/75. assistant,implementedinab out 5000

lines of Pascal, to ok one minute run-time. cn2 and aqr, each implemented in ab out

400 lines of Prolog and with a value of fteen for maxstar, to ok 15 and 170 minutes

run-timeresp ectively. The Bayesianclassier, implemented in150 lines of Prolog,to ok

afewsecondstocalculateitsprobabilitymatrix. Whileitisdiculttodrawconclusions

(18)

fastest,followedbyassistant,cn2andaqr)isafairre ectionoftherelativecomputa-

tion involved inusing the algorithms. Moredetailedempirical comparisonsof time and

memory requirements of id3 and the aq-based system aq11p have b een conducted by

O'Rorke (1982)and Jackson (1985)inthedomainof chessend-games.

4 Experiments with the Algorithms

4.1 Dependent Measures

Inadditiontocomputationalcomplexity,weevaluatetheb ehaviorofthesesystemsbytwo

othercriteria,classicationalaccuracyandsyntacticcomplexityoftheacquiredstructure.

Thistwofoldevaluationis motivatedbyconsidering thesesystemsasto olsforknowledge

acquisition for exp ert systems. A useful system should induce rules that are accurate -

sothat theyp erform well- aswellascomprehensible - so thatthey can b e validatedby

an exp ert andused forexplanation.

Wemeasureeachalgorithm'sclassicationaccuracybysplittingthedataintoatrain-

ingsetand atestset,presentingthe algorithmwiththetrainingset toinducea concept

description and thenmeasuringthe p ercentage of correctpredictions made by thatcon-

cept description on the test set. Quinlan (1983,1987) and others have taken a similar

approachtomeasuring accuracy.

Cross-algorithm comparisons of the complexity of concept descriptions are dicult

duetothedierencesinrepresentationandthedegree ofsubjectivityinvolvedinjudging

complexity. Thus, we will onlycompare the gross features of the knowledge structures

inducedbythedierentalgorithms. Forassistant'sdecisiontrees,wemeasurecomplex-

itybythenumb erofno des (including leaves)inthetree. Forcn2andaqr,wemeasure

complexitybythenumb erofselectorsinthenalrulelistandrulesetresp ectively. These

measuresrevealthegrossfeatures oftheinduced decisionrules. More detailedmeasures

of rulecomplexityhave b eenmade byO'Rorke (1982)but arenotused here. We assign

acomplexityofone tothedefaultrule basedon itsequivalencetoadecisiontree witha

singleno de.

Assessing the complexity of a Bayesianrule is more dicult. One could count the

numb er of elements in the p(V

j jC

k

) matrix. Thus, for a domain with n classes and a

attributes,eachwithan averageof v p ossiblevalues, thecomplexitywouldb ea2v2n.

However,suchameasure isindep endent ofthetrainingexamples,anditignoresfeatures

(19)

large and the rest small). Still, lacking any b etter measure, we provide the size of the

matrixasa roughguide.

4.2 Experiments on Natural Domains

The ab ove algorithms were tested on three sets of medical data,whichwe willdescrib e

shortly. These data were obtained from the Institute of Oncology at the University

Medical Center in Ljubljana,Yugoslavia[7]. In each test, 70%of thetrainingexamples

wereselectedatrandomfromtheentiredataset,andtheremaining30%ofthedatawere

usedfortesting. Thealgorithmswereallrunonthesametrainingdataandtheirinduced

knowledgestructurestestedusingthesametestdata. Fivesuchtestswerep erformedfor

each of the three domains,and the resultswere averaged. Thesedata are thus identical

to those used totest aq15 in [9], though the particularrandom 70% and 30% samples

aredierent.

Both cn2and aqrweregiven a value of15formaxstar inallruns.

4.2.1 Three Medical Domains

Table 4 summarizesthe characteristics of thethree medicaldomains usedin theexp eri-

ments. Therstoftheseinvolvedlymphography. Forpatientswithsusp ectedcancer,itis

imp ortantforphysicians todistinguishb etween diagnosessuch asmetastases,malignant

lymphoma or simply normal ndings; patient data relating to this task were collected

from Ljubljana's Oncology Institute. These data were consistent, i.e. examples of any

two classeswere always dierent. Allthe tested algorithmspro duced fairlysimple and

accurate rules. Unlike the other two domains, thisdata set wasnot submitted toa de-

tailedcheckingafteritsoriginalcompilationbytheMedicalCenter,andthusmaycontain

errorsinattributevalues.

The second domaininvolvedpredictingwhether patientswho haveundergone breast

cancer op erations will exp erience recurrence of the illness within ve years of the op-

eration. The recurrence rate is ab out 30%, and hence such prognosis is imp ortant for

determining p ost-op erational treatment. These data were veried after collection, and

thus arelikelytob erelativelyfreeof errors.

Thenalmedicaldomainfo cussedonpredictingthelo cationofprimarytumor. Physi-

ciansdistinguish b etween 22p ossible lo cations, predicted from datasuch asage, hysto-

logictyp e of carcinoma,and p ossible lo cationsof detected metastases,and is againim-

(20)

Domain Domain

Prop erty Lymphography Breast Cancer PrimaryTumor

No. ofattributes 18 9 17

Min-maxno. of vals/att 2-8 2-5 2-3

Average 3.3 2.8 2.2

valuesp er attribute

Numb er ofclasses 4 2 22

Totalno. ofexamples 148 286 339

Distributionof examples 2,81, 61,4 85,201 84, 20,9,14,39, 1,14,

amongclasses 6,0,2,28,16, 7,24,

2,1,10,29, 6,2,1,24

p ortantindeterminingtreatmentofpatients. Thesedatawereinconsistent,i.e. examples

of dierent classes existedwith identical attributevalues. Thesedata wasveriedafter

collecting,andthusarelikelytob erelativelyerror-free. Thesetofattributesisrelatively

incomplete,i.e. not sucient toinduce highqualityrules.

4.2.2 Results with Natural Domains

Table 5 presentsthe resultsforeach algorithmon eachdomain,averaged overve runs.

Ineachcase,wepresenttheaverageaccuracyonthetestdataandtheaveragecomplexity

of the resulting knowledge structures. cn2 was tested using three values of signicance

threshold and assistant was run with and without pruning. Only one version of the

other systemswererun.

The table contains some interesting regularities. Most imp ortant is that the algo-

rithms designed to reduce problems caused by noisy data achieve a lower complexity

withoutdamaging theirpredictive accuracy. Forexamplein thelymphographydomain,

the version of cn2 with the highest threshold achieved the sameclassication accuracy

as the other algorithms by inducing (on average) only eight rules each containing 1.6

selectors. The treepruning versionofassistantpro ducedsimilarresults.

Both systems apply a similar technique to reducing complexity, namely sometimes

halting sp ecialization of concept descriptions b efore they classify thetraining examples

p erfectly. Asa result,assistantand cn2avoidoverttingtheirdecisiontreesandrules

(21)

three naturaldomains. (ComplexityfortheBayesclassier isthesizeof theprobability

matrix).

Lymphography BreastCancer PrimaryTumor

Algorithm Domain Domain Domain

Accuracy Complexity Accuracy Complexity Accuracy Complexity

Default rule 56% 1 71% 1 26% 1

assistant:

(nopruning) 79% 41 62% 112 40% 178

(pruning) 78% 36 68% 44 42% 52

Bayes 83% (240)y 65% (540)y 39% (465)y

aqr 76% 76 72% 208 35% 562

cn2:

(90%threshold) 78% 24 70% 28 37% 33

(95%threshold) 81% 22 70% 20 36% 42

(99%threshold) 82% 12 71% 4 36% 19

ySeediscussioninsection4.1ab outdicultiesinmeasuringthecomplexityofBayesianclassier

Table 6: Accuracy of the dierent algorithms on training and testdata. The rep orted

versionofassistantincorp oratedpruningand theversionofcn2useda99%threshold.

Accuracy of decisionpro cedureson trainingandtestdata:

Algorithm Lymphography Breast Cancer PrimaryTumor

Train Test Train Test Train Test

Default rule 54% 56% 70% 71% 23% 26%

assistant 98% 78% 85% 68% 53% 42%

Bayes 89% 83% 70% 65% 48% 39%

aqr 100% 76% 100% 72% 75% 35%

cn2 91% 82% 72% 71% 37% 36%

(22)

set until it achieves as nearly complete consistency with the training data as p ossible,

resulting in an overtted rule set. Table 6 illustratesthis eect bycomparing accuracy

on thetrainingand testdata.

TheresultsalsoshowthattheBayesianclassierdo eswell,p erformingcomparablyto

the moresophisticated algorithms in allthree domains and givingthe highest accuracy

in the lymphography domain. Table 6 shows that this metho d regularly overts the

trainingdata,butthatitsp erformanceinthetestsetisstillgo o d. Evenmoresurprising

is the b ehavior of the frequency-based default rule, which outp erforms assistant and

theBayesmetho d on thebreastcancer domain. Thissuggeststhat inthebreast cancer

domain there are virtuallyno signicant correlations b etween attributes and classes in

the data. This is re ected by cn2's inability to nd signicant rules in this domain at

99% threshold, suggesting that, in this domain at least, the signicance test has b een

eectiveinlteringoutrules representing chanceregularities.

In general, the dierences in p erformance seem to b e due less to the learning algo-

rithms than to the nature of the domains; for example the b est classication accuracy

forlymphographywasbarely half ashigh as that forprimary tumor. This suggests the

need foradditionalstudies toexaminetheroleof domainregularityonlearning.

4.3 Experiments on Articial Domains

Tob etterunderstandtheeectsofovertting,weexp erimentedwithcn2andassistant

ontwoarticialdomainsthatletuscontroltheamountofnoiseinthedata. Bothdomains

containedtwelveattributesand200exampleswhichwereevenlydistributedb etweentwo

classes. Theydiered onlyinthenumb erofvalueseach attributecouldtake(twoin the

rst domainand eight inthesecond).

Inb othcases,thetargetconcept foroneclasscouldb estatedasasimpleconjunctive

ruleoftheform`if(a=v

1

)^111^(d=v

1

)thenclassX'.Bothalgorithmscanrepresent

such a regularitycompactly. The second class wassimply thenegationof the rst. 50%

ofthe datawasused fortraining,50%for testing,and resultsaveraged overve trials.

Foreach domain, we varied the amount of noise in thetraining data and measured

the eect on complexityand accuracy on the test data. Table 7 rep orts the resultsfor

therst articialdomainwithtwovaluesp erattribute,and Table 8forthe secondwith

eightvaluesp erattribute.

The p ercentage of noiseadded indicatesthe prop ortion of attributeand class values

(23)

%of CN2 RULE (99%threshold) ASSISTANT RULE

Noise Non-defaulty unpruned pruned

Added Accuracy Accuracy Size Accuracy Size Accuracy Size

0% 95% 100% 3 99% 8 99% 8

2% 88% 99% 5 96% 16 98% 11

5% 88% 95% 10 91% 32 95% 16

10% 82% 95% 15 86% 45 91% 24

20% 73% 86% 20 76% 60 84% 27

40% 67% 76% 25 65% 74 76% 23

60% 56% 64% 26 62% 75 67% 23

100% 45% 49% 28 46% 85 43% 12

ythis refers to the accuracyof those cn2 rulesfound bysearch, i.e. excludingthe extra default

rule(`everythingisclassX') attheendoftherulelist. See discussioninSection 4.3.1.

inthetrainingexamplesthathaveb een randomized,whereattributesandclasseschosen

forrandomizationhaveequalchanceoftakinganyofthep ossiblevaluesforthatattribute

orclass. Notethat nonoisewasintro duced tothetestdata.

4.3.1 Results with Articial Domains

In exp erimenting with articialdomains, we are able toexamine severalfeatures of the

algorithmsrelatingtotheirabilitytohandlenoise. Firstlyweareinterestedinthedegra-

dation of accuracy and simplicityof concept descriptions as noise levels are increased.

Secondly, for cn2, it is also interesting to examine how the accuracy and simplicityof

individualrules(aswellasthatoftherulesetasawhole)isaectedbynoiseinthedata.

The results reveal some surprising features ab out b oth cn2 and assistant. Com-

paring classicational accuracy alone, assistant p erformed b etter than cn2 in these

particular domains. However, comparing complexity of concept description, cn2 pro-

duced simpler concept descriptions than assistantexcept athigh levels of noisein the

rst domain.

Ideally, as the level of noise approaches 100%, b oth algorithms should fail to nd

any signicant regularities in the data and thus converge on a concept description of

complexityone (forcn2'sdefaultrule alone orasingle no dedecisiontree). Howeverfor

(24)

%of CN2 RULE (99%threshold) ASSISTANT RULE

Noise Non-defaulty unpruned pruned

Added Accuracy Accuracy Size Accuracy Size Accuracy Size

0% 93% 98% 8 99% 6 99% 6

2% 83% 99% 10 97% 12 97% 12

5% 86% 94% 13 96% 15 96% 15

10% 80% 98% 10 93% 22 93% 22

20% 73% 88% 15 85% 27 85% 27

40% 68% 82% 5 75% 33 75% 33

60% 63% 90% 4 66% 40 66% 40

100% 50% 58% 1 55% 43 55% 43

ythis refers to the accuracyof those cn2 rulesfound bysearch, i.e. excludingthe extra default

rule(`everythingisclassX') attheendoftherulelist. See discussioninSection 4.3.1.

cn2, this o ccurred only in the second of the two domains tested and did not o ccur in

either domain for assistant. Indeed, in the second domain assistant's tree pruning

mechanismdidnotprune thetree atall. cn2'snding ofrules intherst domain,even

at100% noiselevel,can b e understo o d asdue toa combination of the large numb er of

rulessearched (e.g. there are12211210=1320rulesof lengththree inthespace)and

the high coverage of these rules (each length three rule covers on average 100=2 3

= 12

examples). Enough rules are searched so that, even with 99% signicance test, some

chancecoverage ofthe12(average)exampleswillapp earsignicant. Thisdid noto ccur

in the second domain as the coverage of rules wasconsiderably less; each length three

rulecoversonaverage100=8 3

0:5examples,to ofewforthesignicancetesttosucceed.

Thesecomparativeresultsandb ehaviorasnoiselevelapproaches100%suggeststhat

thethresholdingmetho ds usedinb oth cn2and assistantneed tob emoresensitiveto

theprop ertiesoftheapplicationdomain. Researchonimprovementstocn2'ssignicance

test[17]and assistant's pruningmechanism [19]iscurrentlyb eingconducted.

Finally,weconsidertheaccuracyofcn2'sindividualrulesaswellasthatofthewhole

rule set. The gures presented in Tables 7 and 8 for `non-default accuracy', referring

totheaccuracy of cn2's rulesexcluding caseswhere thedefaultrule res, indicatethat

the rule list consists of high accuracy rules plus a low accuracy (50% in this domain)

(25)

helping an exp ert articulate his or her knowledge, as each individual rule found (apart

from thedefault rule) represents a strongregularityin the training data. The decision

tree equivalent would b etoexaminethe individualbranches generated,and theiruse in

assistingan exp ert. Quinlan(1987) hasrecentlyconducted work inthisarea.

5 Discussion

The results on the natural domains indicate that dierent metho ds of halting the rule

sp ecializationpro cess,b esideshavingtheeectofreducingrulecomplexity,donotgreatly

aectpredictiveaccuracy. Thiseecthasb eenrep ortedinanumb erofpap ers([7,9,21])

Indeed it p erhaps mayb e the case that anytechniquewill have thiseect, providing a

certain maximumlevelof pruning isnot exceeded. If thisis thecase then an algorithm

should b epreferred ifitmostclosely estimatesthismaximumlevel.

The results in Table 6 suggest that the 99% threshold for the cn2 algorithmis ap-

propriateforthethreenaturaldomains;theaccuracy ontrainingdataisclosetothaton

testdata indicating that, inthese domains atleast, the algorithm is notoverttingthe

data. Additionally,highaccuracy is maintained,indicating that theconcept description

isnot underteither.

The results of the tests on the articial domains, in particularthe tests with 100%

noise, indicatethatthe current measure of signicance used bycn2could b e improved.

Asthenoiselevelreaches100%,thealgorithmshould ideallyalmostalwaysndnorules.

The fact that this only o ccurred in one of the two articial domains suggests that the

signicancemeasure should b emoresensitive toprop ertiesof thedomaininquestion.

In many waysthe comparisonswiththe aqr system areunfair, astheaq algorithm

was never intended to deal alone with noisydata. It wasincluded in these exp eriments

toexaminethe basicaq algorithm'ssensitivitytonoise. In practiceitis rarelyused on

itsown,andinstead enhanced bya numb erof pre- and p ost-rule-generationtechniques.

Exp eriments with the aq15 system [9] show that with p ost-pruning of the rules and a

probability-basedor` exible matching'rule applicationmetho d,one can achieveresults

similarto thoseof cn2and assistantinterms of accuracyand complexity.

The principal advantage of cn2 over aqr is that the algorithm supp orts a cuto

mechanism | it do es not restrict itssearch toonly those rulesthat are consistent with

thetrainingdata. cn2demonstratesthatonecan successfullycontrolthesearchthrough

thelargerspaceof inconsistentruleswiththeuse ofjudiciouslychosensearchheuristics.

(26)

achieved a simple metho d for generating noise tolerant if-then rules, allowing easy re-

pro ducibility and analysisof the algorithm. In addition, we feel that the requirements

of interactiveinduction, wheretheuser interactswith systemduring and afterrule gen-

eration, in particular the requirement for go o d explanation facilities, indicate that the

logicalruleinterpretationusedbycn2willhavepracticaladvantagesoverthemorecom-

plexprobabilisticruleinterpretationnecessaryforapplyingorder-indep endentrulessuch

asgenerated byaqr whererule con icts mayo ccur.

Another result of interest is the high p erformance of the Bayesian classier. Al-

though theindep endence assumptionof the classiermayb e unjustiedin thedomains

tested,itdidnotp erformsignicantlyworse interms ofaccuracy thanother algorithms

and it remains an op en question as to how sensitive Bayesian metho ds are to violated

indep enden ce assumptions. Although the probability matrices pro duced by the tested

classierare diculttocomprehend,theexp eriments suggestthat variantsof theBayes

classierpro ducingmorecomprehensibledecisionpro cedureswouldb eworthyof further

investigation.

6 Conclusion

Inthis pap er we havedemonstratedan induction algorithmthat combines theb estfea-

turesoftheid3andaqalgorithms,allowingtheapplicationofstatisticalmetho dssimilar

totreepruninginthegenerationofif-thenrules. Itissimilartoassistantinitseciency

and abilitytohandlenoisy data,whereasitpartiallysharestherepresentation language

and exible search strategy of aqr. By including a mechanism for handling noise into

thealgorithmitself, asimplemetho d forgeneratingnoisetolerant if-thenruleshasb een

achieved allowing easyrepro ducibilityand analysisofthe system.

The exp eriments wehave conducted showthat the algorithmhas, innoisy domains,

comparablep erformancetoassistant. Byinducingconceptdescriptionsbasedonif-then

rules,itprovidesato olforassistingintheconstructionofknowledge-basedsystemswhere

classicationpro ceduresbasedon rulesrather thandecisiontrees aredesired. The most

obviousimprovementto thealgorithm,suggestedbythe resultson articialdomains,is

animprovement tothesignicancemeasure used.

(27)

We thank Donald Michie and Claude Sammut for their careful reading and valuable

commentson earlier draftsof thispap er. We alsothank PatLangley and the reviewers

for theirdetailedcommentsand suggestionsab out the presentation. We are grateful to

G.Klanjscek, M.Soklic and M.Zwitter of theUniversityMedicalCenter, Ljubljana for

the use of the medical data and I. Kononenko for itsconversion to a form suitable the

induction algorithms. This work was supp orted by the Oce of Naval Research under

contractN00014-85-G-0243aspart oftheCognitiveScience ResearchProgram.

References

[1] Peter Mowforth. Some applicationswith inductive exp ert system shells. TIOP 86-

002,Turing Institute, Glasgow,UK, 1986.

[2] J.Ross Quinlan. Learningecient classicationpro ceduresandtheirapplicationto

chess endgames. In J. G. Carb onell, R.S. Michalski, and T. M. Mitchell, editors,

Machine Learning,vol. 1. Tioga,Palo Alto,Ca,1983.

[3] R. S.Michalski. Onthequasi-minimalsolutionof thegeneral coveringproblem. In

Proceedings of the 5th international symposium on Information Processing (FCIP

69), Vol.A3(Switching circuits), Bled, Yugoslavia,pages 125{128,1969.

[4] J. R. Quinlan. Simplifying decision trees. Int. Journal of Man-Machine Studies,

27(3):221{234,Septemb er1987.

[5] Tim Niblett. Constructing decision trees in noisy domains. In I. Bratko and

N. Lavrac,editors, Progress in Machine Learning(proceedingsof the 2nd European

Working Session onLearning),pages 67{78.Sigma,Wilmslow,UK,1987.

[6] J. R. Quinlan, P. J. Compton, K. A. Horn, and L. Lazarus. Inductive knowledge

acquisition: acasestudy.InApplicationsofExpertSystems,pages157{173.Addison-

Wesley,Wokingham,UK,1987.

[7] IgorKononenko,Ivan Bratko,andEgidijaRoskar. Exp erimentsinautomaticlearn-

ing of medicaldiagnostic rules. Technical rep ort,Facultyof ElectircalEngineering,

E. KardeljUniversity,Ljubljana, 1984.

(28)

1

ingmetho dologyandthedescriptionofprogramaq11. ISG83-5,Dept.ofComputer

Science, Univ. of IllinoisatUrbana-Champaign,Urbana, 1983.

[9] R. S. Michalski,I. Mozetic,J. Hong, and N. Lavrac. The AQ15 inductive learning

system: anoverviewandexp eriments. InProceedingsofIMAL1986,Orsay,France,

1986.Universitede Paris-Sud.

[10] WayneIba,JamesWogulis,andPatLangley. Tradingosimplicityandcoveragein

incrementalconceptlearning.InJohnLaird,editor,Proc.5thInt.Conf.onMachine

Learning,pages73{79. Kaufmann,Ca,1988.

[11] R. L.Rivest. Learning decisionlists. Machine Learning,2(3):229{246,1987.

[12] R.S.MichalskiandR.Chilausky.Learningbyb eingtoldandlearningfromexamples:

an exp erimental comparison of the two metho ds of knowledge acquisition in the

contextof developing an exp ert system for soyb ean diagnosis. Policy Analysisand

Information Systems,4(2):125{160,1980.

[13] P. O'Rorke. A comparative study of inductive learning systems AQ11P and ID3

using a chess end-game testproblem. ISG 82-2, Dept. of ComputerScience, Univ.

of Illinoisat Urbana-Champaign,Urbana,1982.

[14] A. Wald. Sequential Analysis. Wiley,New York, 1947.

[15] J. Kalb eish. Probabilityand Statistical Inference,volume2. Springer-Verlag,NY,

1979.

[16] A. Paterson and T.Niblett. ACLS Manual.Version1. Glasgow,1982.

[17] Philip K. Chan. A criticalreview of cn2: A p olytheticclassier system. Technical

Rep ort CS-88-09,Dept. ofComputerScience, Vanderbild Univ.,Tennessee, 1988.

[18] J.A. Jackson. Economicsofautomaticgenerationofrulesfrom examplesinaChess

end-game. UIUCDCS-F 85-932, Dept. of Computer Science, Univ. of Illinois at

Urbana-Champaign, Urbana,1985.

[19] Bojan Cestnik, Igor Kononenko, and Ivan Bratko. Assistant 86: A knowledge-

elicitationto olforsophisticatedusers.InI.BratkoandN.Lavrac,editors,Progressin

Machine Learning (proceedingsof the 2nd EuropeanWorking Sessionon Learning),

pages 31{45.Sigma,Wilmslow,UK,1987.

(29)

editor,IJCAI-87,pages 304{307,Ca,1987.Kaufmann.

[21] T.NiblettandI.Bratko.Learningdecisionrulesinnoisydomains.InM.A.Bramer,

editor, Research and Development in Expert Systems III, pages 25{34. Cambridge

UniversityPress, 1987.