In: Machine Learning Journal, 3 (4), pp261-283, Netherlands: Kluwer (1989) http://www.cs.utexas.edu/users/pclark/papers
The CN2 Induction Algorithm
PeterClark (p [email protected])
Tim Niblett([email protected])
The Turing Institute,
36 N. HanoverSt.,
Glasgow, G1 2AD, U.K.
Octob er 1988
Abstract
Systems for inducing concept descriptions from examples are valuableto ols for
assistinginthetaskofknowledgeacquisitionforexp ertsystems. Thispap erpresents
adescription andempiricalevaluationofanewinductionsystem,cn2,designed for
theecientinductionof simple,comprehensible pro ductionrulesindomainswhere
problemsofp o ordescriptionlanguageand/ornoisemayb eprese nt. Implementations
ofthecn2,id3andaqalgorithmsarecomparedonthreemedicalclassicationtasks.
Keywords: conceptlearning,ruleinduction,noise,comprehensibility,cn2.
In the task of constructing exp ert systems, systems for inducing concept descriptions
from examples have proved useful ineasing the b ottleneck of knowledge acquisition[1].
Twofamiliesofsystems,basedontheid3[2]and aq[3]algorithms,haveb eenesp ecially
successful. Thesebasicalgorithmsassumenonoiseinthedomain,searchingforaconcept
descriptionthat classiestrainingdata p erfectly. Howeverfortheapplicationofsystems
based on these algorithms to real-world domains, metho ds for handling noisy data are
required. In particular, mechanisms for avoiding the overtting of the induced concept
descriptiontothedataareneeded,requiringrelaxationoftheconstraintthattheinduced
description mustb eclassify thetrainingdatap erfectly.
Fortunatelytheid3algorithmlendsitselftoeasymo dicationallowingthisconstraint
tob erelaxed,bythenatureofitsgeneral-to-sp ecicsearch. Treepruningtechniques(e.g.
[4, 5]), as usedfor example in thesystems c4[6] and assistant [7], have proved tob e
eective metho ds of avoiding overtting. The aq algorithm, however, is less easy to
mo dify due to itsdep enden ce on sp ecic trainingexamples during its search. Existing
implementations(e.g. aq11 [8] and aq15 [9]) deal with noisy data by using pre- and
p ost-pro cessing techniques whileleavingthebasicaq algorithmintact. Ourobjective in
designing cn2 is to mo dify the aq algorithm itself insuch a way that this dep end ence
on sp ecicexamples is removed andthe spaceof rulessearchedis increased. Asa result
statisticaltechniques analogoustothose usedfortreepruning canthenb eappliedinthe
generationofif-then rules,anda simpler algorithmis achieved.
Wecanidentifyseveralrequirementsthatlearningsystemsshould meetiftheyareto
prove useful ina varietyofreal-worldsituations:
Accurate classication. The induced rules should b e able to classify new examples
accurately,even inthepresence ofnoise.
Simple rules. For the sake of comprehensibility, the induced rules should b e as short
as p ossible. However, when noise is present, rules that are overtted tend to b e
long. Thus, toinduce shortrules,one mustusuallyrelax therequirement thatthe
induced rules b econsistent withallthetraining data. The choice of howmuchto
relaxthisrequirement involvesa trade-ob etween accuracy and simplicity[10].
Ecient rule generation. Ifoneexp ectstouselargeexamplesets,itisimp ortantthat
thealgorithmscales up tocomplex situations. In practice, it is desirablethat the
timetakenforrule generationb e linearinthesizeof theexampleset.
evaluationof cn2, a new induction algorithm. It combines the eciency and abilityto
cop ewith noisydataofid3 withtheif-thenrule formand exible searchstrategyof the
aq family. The representation forrules outputbycn2is an orderedset of if-thenrules,
also known as a `decision list' [11]. cn2 uses a heuristic function to terminate search
during rule construction, based on an estimate of the noise present in the data. This
results in rules that do not necessarily classify all the training examples correctly, but
thatp erformwellonnew data.
In the following section we describ e cn2 and three other systems used for our com-
parative study. These include: aqr, the authors'reconstruction of Michalskiet al'saq
algorithm; Kononenko, Bratko and Roskar's (1984) assistant,a variant of id3; and a
simpleBayesianclassierwhichisusedtoprovideareference forthep erformanceof the
otheralgorithms. Ineachcaseweconsiderthetimecomplexityofthevariousalgorithms.
In section 4, we compare the p erformance of thealgorithms on three medicaltasks; we
alsocomparethep erformanceofassistantandcn2ontwosynthetictasks. Insection5
we discuss thesignicance of these results, and we followthiswith somesuggestionsfor
futureworkinsection 6.
2 CN2 and Algorithms for Comparative Study
cn2 and the other algorithms used in exp eriments are now presented. Because cn2
has b een develop ed from studyof b oth theid3 and aq algorithms, we rst present the
id3-based system assistant and the aq-based system aqr b efore presenting cn2 and
discussingitsrelationshipto thesealgorithms.
Wecharacterize thesystems alongthreedimensions. These are:
Therepresentationlanguage forthe induced knowledge;
Thep erformance engine forexecutingtherules;and
Thelearningalgorithmand itsasso ciatedsearchheuristics.
In all of our exp eriments, theexample description language consisted of attributes, at-
tributevaluesand user-sp eciedclasses. Thislanguagewasthesameforeachalgorithm.
2.1 Assistant
The assistant algorithm [7]is a descendant of Quinlan's id3 (1983), and incorp orates
atree pruning mechanismforhandlingnoisy data.
A b e aset ofattributesfordescribingexamples
TE(E)b e aterminationcriterion
ID M(a
i
;E) b ean evaluationfunctionwhere a
i 2A
Pro cedureassistant(E)returningTREE:
If: EsatisestheterminationcriterionTE(E)thenreturnaleafno de
forTREE,lab elled withthemostcommon classofexamples inE.
Else: determine the attribute a
best
2 A with the largest value of the
function ID M(a
best
;E). Then, for each value v
j
of attribute a
best ,
generate subtrees using assistant(E
j
) where E
j
are those examples
in E with value v
j
for attribute a
best
. Return a no de lab elled as a
teston attributea
best
with thesesubtrees attached.
Table 1: Thecore ofthe assistantalgorithm
2.1.1 Concept Description Language and Interpretation
assistantrepresentsacquired knowledgeintheformofdecisiontrees. Aninternalno de
of a tree sp ecies a test of an attribute, with each outgoing branch corresp onding to a
p ossible result of this test. Leaf no des represent the classicationto b e assigned to an
example.
To classify a new example, a path from the ro ot of the decision tree to a leaf no de
is traced. At each internal no de reached, the branch corresp onding to the value of the
attributetested atthat no de isfollowed. The classat theleaf no de represents theclass
predictionforthat example.
2.1.2 Learning Algorithm
assistant induces a decision tree by rep eatedly sp ecializing leaf no des of an initially
single-no ded tree. The sp ecialization op eration involves replacing a leaf no de with an
attributetest,andaddingnewleavestothatno decorresp ondingtothep ossibleresultsof
thattest. Heuristicsdeterminewhichattributetotestonandwhentostopsp ecialization.
Table 1summarizes thisalgorithm.
assistantusesanentropymeasuretoguidethegrowthofthedecisiontree,asdescrib ed
byQuinlan (1983) . This corresp onds to thefunctionIDM inTable 1. In addition, the
algorithmcan applya treecutometho dbased onan estimateofmaximalclassication
precision. Thistechniqueestimateswhetheradditionalbranchingwouldreduceclassica-
tionalaccuracy and ifso, terminatessearch (thereareno user-changeable parametersin
thiscalculation). Thiscutocriterion corresp onds tothefunctionTE intheTable1. If
assistantistogeneratean`unpruned' tree,theterminationcriterionTE(E)issatised
ifalltheexamplesE have thesameclassvalue.
2.2 AQR
aqr is an induction system that uses the basic aq algorithm [3] to generate a set of
classication rules. Many systems use this algorithm in a more sophisticated manner
thanaqr toimprovepredictive accuracy and rule simplicity(e.g., aq11 [8]uses a more
complex metho d of rule interpretation that involvesdegrees of conrmation). aqr is a
reconstructionof a straightforward aq-based system.
2.2.1 Concept Description Language and Interpretation
aqrinducesasetofdecisionrules,oneforeachclass. Eachruleisoftheform`if<cover>
thenpredict <class>',where<cover> is a b o oleancombination of attributetestsas we
nowdescrib e.
The basic test on an attribute is called a selector. The following are examples of
selectors:
hCloudy=yesi
hWeather=wet^stormyi
hTemp>60i
aqr allows tests in the set f=;;>;6=g. A conjunct of selectors is called a complex,
and a disjunct of complexes is called a cover. We say that an expression (a selector,
complex, or cover) covers an example if the expression is true of the example. Thus,
theemptycomplex (conjunctof zeroattribute tests)covers allexamplesand the empty
cover (disjunct of zero complexes)covers no examples. A cover is stored alongwith an
asso ciated class value, representing the most common class of those training examples
which itcovers.
conditionssatisedbythe example. Iftheexamplesatises onlyone rule,then theclass
predicted bythatrule isassignedtotheexample. If theexamplesatisesmorethanone
rule, the most common class of training examples that were covered by those rules is
predicted. Iftheexampleisnotcoveredbyanyrule,thenit isassignedbydefaulttothe
classthat o ccurred mostfrequentlyinthetrainingexamples.
2.2.2 The Learning Algorithm
The aq rule-generation algorithm has b een describ ed elsewhere (e.g. [8, 12, 13]), and
the aqr system is an instance of thisgeneral algorithm. aqr generates a decision rule
for each class in turn. Having chosen a class on which to fo cus, it forms a disjunct of
complexes (the cover) to serve as the conditionof the rule forthat class. This pro cess
o ccurs instages; each stagegenerates a single complex, and then removes theexamples
itcoversfromthetrainingset. Thisstep isrep eateduntilenoughcomplexesarefoundto
coverallthe examples of thechosen class. Thiswhole pro cess is rep eated foreach class
inturn. Table 2summarizes theaqralgorithm.
2.2.3 Heuristic Functions
The particularheuristic functions used by the aq algorithmare implementationdep en-
dent. Theheuristicusedbyaqr tocho osetheb estcomplexis\maximizethenumb erof
p ositiveexamplescovered". Theheuristicusedtotrimthepartialstarduring generation
of a complexis \maximize thesum of p ositive examples covered and negativeexamples
excluded". In the case of a tie for either heuristic, the system prefers complexes with
fewerselectors. Seeds arechosenatrandomandnegativeexamples arechosenaccording
totheirdistancefrom theseed(nearestonesare pickedrst,where distanceisthe num-
b er of attributeswithdierent valuesin theseed and negative example). In thecase of
contradictions(i.e. iftheseed andnegativeexamplehave identicalattributevalues) the
negative example is ignored and a dierent one is chosen, since the complex cannot b e
sp ecializedtoexclude itbut stillinclude theseed.
2.3 A Bayesian Classier
To establish a reference p oint, we also implemented a simple Bayesian classier and
compareditsb ehaviortothatof theother algorithms.
Pro cedure AQR(POS;NEG) returning COVER:
let COVER b etheemptycover;
whileCOVER do esnot coverallp ositiveexamples inPOS
selecta SEED,i.e. ap ositive examplenotcoveredbyCOVER;
callpro cedure STAR(SEED;NEG) togenerate theSTAR(aset)of
complexes thatcoverSEEDbut no examplesinNEG;
selecttheb est complexBESTfromthestaraccording touser-dened criteria;
add BESTasan extra disjuncttoCOVER;
returnCOVER.
Pro cedure STAR(SEED;NEG) returning STAR:
let STARb etheset containing theemptycomplex;
whileone ormorecomplexes inSTARcoverssome negativeexamples inNEG,
selecta negativeexampleE
neg
coveredbya complexinSTAR;
Sp ecializecomplexes inSTARtoexclude E
neg by:
letEXTENSIONb eallselectors thatcoverSEEDbut notE
neg
;
letSTARb e thesetfx^yjx2STAR;y2EXTENSIONg;
remove allcomplexesinSTARsubsumed byothercomplexes inSTAR;
Remove the worstcomplexes fromSTAR
until sizeofSTARisless thanorequaltouser-dened maximum(maxstar).
returnSTAR.
Thisclassierrepresentsits`decisionrule'asamatrixofprobabilitiesp(v
j jC
k
)sp ecifying
theprobabilityof o ccurrenceof each attributevalue given each class. To classify a new
example, one appliesBayes'theorem
p(C
i j
^
v
j )=
p(
V
v
j jC
i )p(C
i )
P
k p(
V
v
j jC
k )p(C
k )
where the summation is over the n classes and p(C
i j
V
v
j
) denotes the probability that
the example is of class C
i
given v
j (
V
is the symb ol for conjunction, V
v
j
denoting a
conjunctof attributevaluesallo ccurringinan example). Onecalculates thisprobability
foreveryclass,andthenselectstheclasswiththehighestprobability. Thetermp(C
k )is
estimated from the distributionof thetraining examples amongclasses. Ifone assumes
indep enden ce of attributes,p(
V
v
j jC
k
) can b ecalculatedusing
p(
^
v
j jC
k )=
Y
j p(v
j jC
k )
and the values p(v
j jC
k
) from the probabilitymatrix. Note that, unlike the other algo-
rithmswe have discussed, our implementation of the Bayesianclassier requires one to
examinethevaluesofallattributeswhenmakinga prediction.
We should note that there also exist more sophisticated applications of the Bayes
rule in which the attribute testsare ordered[14]. Such a sequential technique adds the
contribution of each testto a total; when this score exceeds a threshold, the algorithm
exitswithaclassprediction. Suchaninterpretationmayb emorecomprehensibletoauser
thantheapproach we have used,aswellaslimitingthetestsrequiredforclassication.
2.3.2 The Learning Algorithm
TheBayesianlearningmetho dconstructsthematrixp(v
j jC
k
)fromthetrainingexamples
by examining thefrequency of valuesineach class. One can compute thismatrixeither
incrementally,incorp oratingone instanceat atime,or non incrementally,using alldata
attheoutset.
2.3.3 Heuristics
Sometimes a value of zero is calculated from the training data for some elements of
the p(v
j jC
k
) matrix. Like all elements of the matrix, this numb er is subject to error
due to the nite training data available. However as as classication of new examples
involvesmultiplyingelements together,a zero element can havedrasticeect, nullifying
zero elementsinthematrixwould,givenmoredata,convergeon asmall,non-zero value
and hence replace the zeros with some appropriate estimate. In our implementation a
valueofp(C
i
)2(1=N)wasused, whereN isthenumb eroftrainingexamples. Thefactor
1=N representstheincreasingcertaintythatthiselementmusthaveanalmost-zerovalue
withincreasing sizeoftrainingdata.
2.4 The Default Rule
Finally, an `algorithm' which simply assigns the most commonly o ccurring class to all
newexamples, withno referencetotheir attributesatall,wasusedforcomparison with
theotheralgorithms. Interestingly,thissimplepro cedurewasofcomparablep erformance
totheother algorithmsinone domaintested (describ edlater),and thus proveda useful
extraalgorithmtoconsider.
2.5 The CN2 Algorithm
Having presented assistant, aqr, and the other algorithms used in the exp eriments,
we now present cn2, rst describing how the general algorithm arises naturally from
considerationof theid3 andaq algorithmsand thendescribingitsdetails.
id3 iseasilyadaptable tohandle noisydatabyvirtueof itstop-downtree generation
approach. Duringinduction, all p ossibleattributetests areconsidered when`growing'a
leaf no de in the tree, and entropy is used to select the b est one to place at that no de.
Overttingof decision trees can thus b e avoided by halting tree growth when no more
signicant information can b e gained by further growth. We wish to apply a similar
metho dtothepro duction ofif-then rulesratherthandecisiontrees.
The aq algorithm, when generating a complex, also p erforms a general-to-sp ecic
searchfortheb estcomplex. However,onlysp ecializationswhichexcludesomeparticular
covered negative example from the complex while ensuring some particular `seed' p os-
itive example remains covered are considered, iterating until all negative examples are
excluded. As a result, aq searches only the space of complexes completely consistent
withthe trainingdata. Theaq algorithmemploysa b eam search,which can b e viewed
asseveralhill-climbingsearchesinparallel.
Forthecn2algorithm,wehaveretainedtheb eamsearchmetho doftheaqalgorithm
but rstly removed itsdep ende nce on sp ecic examples during search and secondly ex-
tended its search space toinclude rules which do notp erform p erfectly on the training
sp ecializations of a complex, similar to the way id3 considers all attribute tests when
growingano deinthetree. Indeed,withab eamwidthofone thecn2algorithmb ehaves
equivalentlytoid3growingasingle treebranch. Thus byp erforminga top-downsearch
forcomplexes,itisnowp ossibletoapplyacutometho dsimilartodecisiontreepruning
tohalt sp ecializationwhenno furtherstatisticallysignicantsp ecializationsarep ossible.
Finally, we note that the rules cn2 pro duces are an ordered list of if-then rules,
rather than an unordered set of if-then rules as pro duced by aq-based systems. Both
representationshavetheirresp ectiveadvantagesanddisadvantagesforcomprehensibility
{orderindep endentrulesrequiresomeadditionalmechanismtob eprovidedtoresolveany
rulecon icts whichmayo ccur(thusdetracting fromastrictlogicalinterpretationof the
rules),whileorderedrulesalsosacriceadegreeincomprehensibilityastheinterpretation
of a single rule is dep endent on which other rules preceded it in the list. It is p ossible
to make cn2 pro duce unordered if-then rules byappropriately changing the evaluation
function 1
.
2.5.1 Concept Description Language and Interpretation
Rules in the ordered list which cn2 induces are each of the form `if <complex> then
predict<class>',where<complex>hasthesamedenitionasforaqr,namelyaconjunct
ofattributetests. Thisorderedrule representationisaversionofwhatRivest(1987)has
termed decisionlists. The lastrule incn2'slist is a`default rule'whichsimply predicts
themostcommonly o ccurringclass inthetrainingdataforallnewexamples.
To use the induced rules to classify new examples, cn2 applies an interpretation in
which each rule is triedinorder untilone is found whoseconditions aresatised by the
exampleb eingclassied. Theresultingclasspredictionofthisruleisthenassignedasthe
class of that example. Thus, the ordering of the rules is imp ortant. If no induced rules
aresatised,the naldefaultrule assignsthemostcommonclass tothenew example.
2.5.2 Learning Algorithm
Table 3 presents a summary of the cn2 algorithm. This works in an iterative fashion,
each iterationsearching for a complex covering a large numb er of examples of a single
class C and few of other classes. The complex must b e b oth predictive and reliable,as
1
E.g. replaceentropy withE=(+veexs.coveredminus-veexs. covered),thengeneraterulesforeach
classinturn.
it covers are removed from the trainingset and the rule `if <complex> then predict C'
is added to the end of the rule list. This pro cess iterates until no more satisfactory
complexes can b efound.
Thesystemsearchesforcomplexesbycarryingoutaprunedgeneral-to-sp ecicsearch.
At each stage in the search, cn2 retains a size-limited set or star S of `b est complexes
found sofar'. The systemexamines onlysp ecializationsofthisset, carryingouta b eam
search of the space of complexes. A complex is sp ecialized by either adding a new con-
junctivetermorremovinga disjunctiveelement inone ofitsselector. Eachcomplexcan
b e sp ecializedin severalways,and cn2 generatesand evaluates allsuch sp ecializations.
Thestaristrimmedaftercompletionofthisstepbyremovingitslowestrankingelements
asmeasured byan evaluationfunctionthatwewilldescrib e shortly.
Our implementationofthesp ecializationstepistorep eatedlyintersect 2
thesetofall
p ossibleselectorswiththecurrentstar,eliminatingallthenullandunchangedelementsin
theresultingsetofcomplexes. (Anullcomplexisonethatcontainsapairofincompatible
selectors, e.g. big=y^big=n). cn2 deals with continuous attributes in a manner
similar to assistant - by dividing the range of values of each attribute into discrete
sub-ranges. Testsonsuchattributesexaminewhether avalueisgreaterorless(orequal)
thanthe values atsub-range b oundaries. Thecomplete range of valuesand sizeof each
sub-rangeis providedbytheuser.
Fordealing with unknown attribute values, cn2 usethe simple metho dof replacing
unknown valueswiththemostcommonlyo ccurringvalue(ormidvalue ofthemostcom-
monly o ccurring sub-range, in the case of numeric attributes) forthat attribute in the
trainingdata.
2.5.3 Heuristics
Thecn2algorithmmustmaketwoheuristicdecisionsduringthelearningpro cess,andit
employstwoevaluationfunctionstoaidinthesedecisions. Firstitmustassessthequality
of complexes, determining if a new complex should replace the `b est complex' found so
farand alsowhich complexes in thestar S todiscard ifthe maximum sizeis exceeded.
Computing this involves rst nding the set E 0
of examples which a complex covers
(i.e.,whichsatisfyallofits selectors)andthe probabilitydistributionP =(p
1
;:::p
n ) of
2
TheintersectionofsetAwithsetBisthesetfx^yjx2A;y2Bg. Forexample,using`:'toabbreviate
`^', fa:b;a:c;b:dg intersected with fa;b;c;dg is fa:b;a:b:c;a:b:d;a:c;a:c:d;b:d;b:c:dg. If we now remove
unchangedelementsinthissetweobtainfa:b:c;a:b:d;a:c:d;b:c:dg.
Let: Eb e aset of trainingexamples;
Pro cedure CN2(E)returning RULELIST:
letRULE LISTb etheemptylist;
rep eat
let BESTCPX b eFindBest Complex(E);
if BESTCPX isnotnil then
LetE 0
b etheexamples coveredbyBEST CPX;
Remove fromEthe examplesE 0
coveredbyBESTCPX;
LetCb e themostcommonclass ofexamples inE 0
;
Add therule `ifBESTCPXthen class=C'totheend of RULELIST,
untilBEST CPXisnil or Eisempty.
returnRULE LIST.
Pro cedure FindBest Complex(E)returning BEST CPX:
lettheset STARcontainonlytheempty complex;
letBEST CPXb enil;
letSELECTORSb etheset of allp ossible selectors;
whileSTARisnotempty,
sp ecialize allcomplexes inSTARasfollows:
let NEWSTARb e thesetfx^yjx2STAR;y2SELECTORSg;
Remove allcomplexesinNEWSTAR that areeitherinSTAR(i.e.,the
unsp ecializedones) orare null (e.g. big=y^big=n)
for every complexC
i
inNEWSTAR:
ifC
i
isstatisticallysignicant when testedonE andb etter than
BESTCPXaccording touser-dened criteria whentested onE,
then replacethecurrent valueof BEST CPXby C
i
;
rep eat removeworstcomplexes from NEWSTAR
untilsizeof NEWSTAR isuser-dened maximum;
let STARb e NEWSTAR;
returnBEST CPX.
examples inE amongclasses(where n= numb er of classes represented inthe training
data). cn2thenuses theinformation-theoreticentropymeasure
Entr opy=0 X
i p
i log
2 (p
i )
toevaluate complex quality(the lowertheentropy theb etter the complex). Thisfunc-
tion thus prefers complexes covering a large numb er of examples of a single class and
few examplesof other classes, and hence such complexes scorewellon thetraining data
when used to predict the majority class covered. It should b e noted that this func-
tion was used in preference to a simple `p ercentage correct' measure (e.g. by taking
max(P)), most imp ortantly b ecause entropy will distinguish probability distributions
such as P = (0:7;0:1;0:1;0:1) and P = (0:7;0:3;0;0) in favor of the latter where as
max(P)willnot. This is desirable,sincethere existmore waysof sp ecializingthelatter
toacomplexidentifyingonlyoneclass;iftheexamplesofthemajorityclassareexcluded
by sp ecialization, the distributions b ecome P = (0;0:33;0:33;0:33)and P = (0;1;0;0)
resp ectively. In addition, the entropy measure tends to direct the search in the direc-
tion of moresignicant rules; empirically, rules of high entropy also tend to have high
signicance.
The second evaluation function tests whether a complex is signicant. By this we
refer to a complex that lo cates a regularity unlikely to have o ccurred by chance, and
thus re ectsa genuinecorrelationb etween attribute valuesand classes. Toassess signif-
icance,cn2 comparestheobserved distributionamongclasses ofexamples satisfyingthe
complex with the expected distribution that would result if thecomplex selected exam-
plesrandomly. Somedierencesinthese distributionswillresultfrom randomvariation.
The issue is whether the observed dierences are to o great to b e accounted for purely
by chance. If so, cn2 assumes that the complex re ects a genuine correlation b etween
attributesand classes.
To test signicance,the system uses the likeliho o dratio statistic [15]. This is given
by
2 n
X
i=1 f
i log(f
i
=e
i );
wherethedistributionF =(f
1
;:::;f
n
)istheobservedfrequencydistributionofexamples
amongclasses satisfying a given complexand E =(e
1
;:::;e
n
) is the exp ected frequency
distribution of the same numb er of examples under the assumption that the complex
selectsexamples randomly. Thisis taken asthe N = P
f
i
covered examples distributed
amongclasses with the same probabilityas that of examples in the entire trainingset.
tance b etween the two distributions 3
. Under suitable assumptions, one can show that
this statistic is distributed approximately as 2
with n01 degrees of freedom. This
provides a measure of indicatessignicance | thelower thescore, the more likely that
theapparent regularityis due tochance.
Thus these two functions - entropy and signicance - serve to determine whether
complexes found during search are b oth `go o d' (i.e. high accuracy when predicting the
majorityclass covered) and `reliable'(high accuracy on trainingdata is not justdue to
chance)resp ectively. cn2usesthesefunctionstorep eatedlysearchforthe`b est'complex
whichalsopassessomeminimumthresholdofreliabilityuntilnomorereliablecomplexes
can b efound.
3 Time Complexity of the Algorithms
The assistant,aqr and cn2algorithmsare allsearchinga very largespace of concept
descriptions,andalluseheuristicstoguidethissearch. Furthermore,thecn2,assistant
and aqr algorithms attempt to pro duce structures that are b oth consistent with the
trainingexamplesandascompactasp ossible. Inthedesignofsuchalgorithms,thereisa
tradeob etweenexecutionsp eedandthesizeoftheinducedstructures. Ineachcase,the
exhaustive searchfora smallestset ofstructures, althoughdesirable,iscomputationally
infeasible.
Amajorapplicationofthesealgorithmsistoextractusefulinformationfromverylarge
databases,p erhaps with millionsof examples. Withthisin mind, itis worthexamining
thecomplexityofeachalgorithm. Tob epracticalforverylargeproblems,theirb ehavior
should b elinear,ornear-linear, inthenumb erof examplesand attributes.
Since theoverall complexityof each algorithmisdomain-dep endent, we insteadpro-
videupp er b oundsforthecriticalcomp onentsof thealgorithms. Wedonotconsider the
complexityof thecutopro cedure usedbyassistant.
In ourtreatment we usethefollowingvariablestodenote parameters:
e: Thesizeof theexampleset
a: Thenumb er ofattributes
s: Themaximumstarsize (forcn2and aqr)
3
WeassumethatF iscontinuous withresp ect toE,i.e.that thef
i
are zerowhenthee
i
arezero.
We alsoassume thateach attributeisbinary valued and thatthere are twoclasses .
3.1 Time Complexity of Assistant
Thecriticalcomp onent inassistantisthepro cessofselectinga testattributeonwhich
tobranch. Eachsuchchoice involvesthefollowingop erations:
1. For each attribute, example counts are put in an array, indexed by class and at-
tribute. ThistakesTime: o(e1a);
2. Theentropyfunctionis calculatedforeach attributetakingTime: o(a);
3. Oncetheb estattributeisfound,theexamples aredividedintotwosets;thistakes
timeTime: o(e) 5
.
Therefore, the overall time for a single attribute choice is o(a1e). The time taken to
construct the complete tree dep end s very much on the structure of the tree. It seems
reasonable touse therst gure onlyforcomparativepurp oses, as argued ab ove. Thus
the amount of time taken by assistant for the basicattribute selection op eration is a
linearfunctionofthenumb erofexamples,whenthenumb erofclassesandattributesare
heldconstant.
Weshould notethatextensionstothisalgorithmthatusereal-valuedattributes(such
asacls[16])mustsorttheexamplesbyattributevalue atthe rststage. Thisincreases
theoverall timeb oundtoo(a1eloge).
3.2 Time Complexityof CN2
Thebasicop erationincn2isthesp ecializationofthecomplexesinthecurrentstar. The
numb erofsingleselectorcomplexeswithoutdisjuncts is2a. The numb erofintermediate
complexesgeneratedis atmosta1s,and thetimetaken toevaluateanexampleagainsta
complexisb ounded byo(a). Threestepsare requiredforthissp ecializationop eration:
Multiplying each complexin the starbythe set of single selectorrules; this takes
Time: o(a1s);
4
Onemightalsoconsiderthe complexityasafunctionofthenumb er ofdistinctattributevaluesand
classes. Wehavenotdone thisinouranalysis.
5
Withappropriatedatastructures,itmayb ep ossibletomuchofthisworkintherststage,butthis
do esnotaectthecomplexityclass. Similarly,onecanincludeanyterminationtestthatislinearinthe
numb erofexamples.
Sortingthecomplexesbyvalueandthentrimthestar,whichtakesTime: o(a1slog(a1s)).
Therefore, the overall time for a single sp ecialization step is b ounded by o(a1s(e +
log(a1s))). As with assistant, the time required is a linear function of the numb er
ofexamples. Ifwerestrictthethesizeofthestartoone,thetimerequiredhasthesame
orderasforassistant. Ingeneral,exp erience indicatesthatthetimeconstantsinvolved
aresomewhat lessforassistantand other variants on id3thanforcn2.
3.3 Time Complexity of AQR
In aqr,the basic op erationis thesp ecialization of complexes ina star. This op eration
issimilarto thatof cn2, except thatonlysp ecializationscausing a negative exampleto
b euncoveredbycomplexes inaqr's stararegenerated. We showthecomplexityof this
op erationis thesameasthatof cn2.
Foreachnegativeexample, thefollowingstepsarep erformed:
Anegativeexampleisfoundbyiteratingthroughthenegativeset. Weassumethat
thenumb er of negative examples is notless than somexed fraction of the entire
exampleset. ThistakesTimeo(e1s);
Theset ofselectorsthat distinguishthenegative examplefrom theseedarefound;
ThistakesTimeo(a);
Each complex in the star is sp ecialized by intersection with this set of selectors,
takingTimeo(a1s);
Theresulting complexesare evaluated,whichtakesTimeo(a1s1e);
Thecomplexes aresorted andthe startrimmed,takingTimeo(a1slog(a1s)).
Thus, for each negative example the time is b ounded by o(a1s(e+log(a1s))). This is
the same gure as obtained for cn2. Observe that the numb er of iterations of this
pro cess (makingthestardisjointfroma negativeexample)is b oundedbythenumb er of
attributes,notby thenumb erof examples.
In practice, although the order of time taken by the algorithms for this particular
op eration of pro ducing a new star is the same, cn2 is faster overallthan aqr. This is
b ecause the numb er of iterations of this op eration is lower in cn2 than in aqr, since
cn2 may halt sp ecialization of a complex b efore it p erforms p erfectly on the training
arecoveredifno furtherstatisticallysignicant rulescan b efound.
3.4 Time Complexity of the Bayesian Classier
ThetimecomplexityoftheBayes'classierforgeneratingaprobabilitymatrixiso(a1e),
whereaisthenumb erofattributesandethenumb erofexamples. Thislearningalgorithm
was substantially faster than the other algorithmsb ecause the run time is indep end ent
of thedecision`rule'generated. In additionthisbasic op erationis p erformedonlyonce,
unliketheab ove algorithms wherethebasicop erationis rep eatedlyapplied.
3.5 Summary and Actual Run Times
We haveshownthat thetimecomplexityof thebasiclearningstep forallthealgorithms
tested is linear in the numb er of examples (o(a:e) for assistant,o(a:e:s) for aqr and
cn2). Thisisan essentialrequirement foranyalgorithmthat mustworkwithvery large
datasets.
We brie y consider the time complexity of the entire induction pro cess, requiring
iterationofthebasiclearningstepsanalysedab ove. Withidealnoise-tolerantalgorithms,
givenacertainminimumnumb erofexamples,conceptdescriptionsrepresentingonlythe
genuineregularitiesinthedatashouldb einduced; additionalexamplesshouldnotcause
theconceptdescriptiontogrowfurtherandb ecomeovertted,henceinthisidealcasethe
ab ovegures also represent thetimecomplexityof theoverall learningtask. When this
idealis not met (e.g. when a concept description classifying the trainingdata p erfectly
is sought for), cn2 would at worst induce e rules of length a giving an overall time
complexityof o(a 2
:e 2
:s)[17]. assistantsorts a total of eexamples among aattributes
foreachlevelofthetree,givinganoveralltimecomplexityofo(a 2
:e)asthetreedepthis
b ounded bya. A similarworst-casetime complexitytocn2 holds foraqr.
Weprovidetheactualrun timesforinterest,while notingthat itis diculttomake
fair comparisons of sp eed due to dierences in implementation language and metho d.
All run times are for inducing a classication pro cedure in the lymphography domain
(section4.2.1)using afour-megabyte Sun 3/75. assistant,implementedinab out 5000
lines of Pascal, to ok one minute run-time. cn2 and aqr, each implemented in ab out
400 lines of Prolog and with a value of fteen for maxstar, to ok 15 and 170 minutes
run-timeresp ectively. The Bayesianclassier, implemented in150 lines of Prolog,to ok
afewsecondstocalculateitsprobabilitymatrix. Whileitisdiculttodrawconclusions
fastest,followedbyassistant,cn2andaqr)isafairre ectionoftherelativecomputa-
tion involved inusing the algorithms. Moredetailedempirical comparisonsof time and
memory requirements of id3 and the aq-based system aq11p have b een conducted by
O'Rorke (1982)and Jackson (1985)inthedomainof chessend-games.
4 Experiments with the Algorithms
4.1 Dependent Measures
Inadditiontocomputationalcomplexity,weevaluatetheb ehaviorofthesesystemsbytwo
othercriteria,classicationalaccuracyandsyntacticcomplexityoftheacquiredstructure.
Thistwofoldevaluationis motivatedbyconsidering thesesystemsasto olsforknowledge
acquisition for exp ert systems. A useful system should induce rules that are accurate -
sothat theyp erform well- aswellascomprehensible - so thatthey can b e validatedby
an exp ert andused forexplanation.
Wemeasureeachalgorithm'sclassicationaccuracybysplittingthedataintoatrain-
ingsetand atestset,presentingthe algorithmwiththetrainingset toinducea concept
description and thenmeasuringthe p ercentage of correctpredictions made by thatcon-
cept description on the test set. Quinlan (1983,1987) and others have taken a similar
approachtomeasuring accuracy.
Cross-algorithm comparisons of the complexity of concept descriptions are dicult
duetothedierencesinrepresentationandthedegree ofsubjectivityinvolvedinjudging
complexity. Thus, we will onlycompare the gross features of the knowledge structures
inducedbythedierentalgorithms. Forassistant'sdecisiontrees,wemeasurecomplex-
itybythenumb erofno des (including leaves)inthetree. Forcn2andaqr,wemeasure
complexitybythenumb erofselectorsinthenalrulelistandrulesetresp ectively. These
measuresrevealthegrossfeatures oftheinduced decisionrules. More detailedmeasures
of rulecomplexityhave b eenmade byO'Rorke (1982)but arenotused here. We assign
acomplexityofone tothedefaultrule basedon itsequivalencetoadecisiontree witha
singleno de.
Assessing the complexity of a Bayesianrule is more dicult. One could count the
numb er of elements in the p(V
j jC
k
) matrix. Thus, for a domain with n classes and a
attributes,eachwithan averageof v p ossiblevalues, thecomplexitywouldb ea2v2n.
However,suchameasure isindep endent ofthetrainingexamples,anditignoresfeatures
large and the rest small). Still, lacking any b etter measure, we provide the size of the
matrixasa roughguide.
4.2 Experiments on Natural Domains
The ab ove algorithms were tested on three sets of medical data,whichwe willdescrib e
shortly. These data were obtained from the Institute of Oncology at the University
Medical Center in Ljubljana,Yugoslavia[7]. In each test, 70%of thetrainingexamples
wereselectedatrandomfromtheentiredataset,andtheremaining30%ofthedatawere
usedfortesting. Thealgorithmswereallrunonthesametrainingdataandtheirinduced
knowledgestructurestestedusingthesametestdata. Fivesuchtestswerep erformedfor
each of the three domains,and the resultswere averaged. Thesedata are thus identical
to those used totest aq15 in [9], though the particularrandom 70% and 30% samples
aredierent.
Both cn2and aqrweregiven a value of15formaxstar inallruns.
4.2.1 Three Medical Domains
Table 4 summarizesthe characteristics of thethree medicaldomains usedin theexp eri-
ments. Therstoftheseinvolvedlymphography. Forpatientswithsusp ectedcancer,itis
imp ortantforphysicians todistinguishb etween diagnosessuch asmetastases,malignant
lymphoma or simply normal ndings; patient data relating to this task were collected
from Ljubljana's Oncology Institute. These data were consistent, i.e. examples of any
two classeswere always dierent. Allthe tested algorithmspro duced fairlysimple and
accurate rules. Unlike the other two domains, thisdata set wasnot submitted toa de-
tailedcheckingafteritsoriginalcompilationbytheMedicalCenter,andthusmaycontain
errorsinattributevalues.
The second domaininvolvedpredictingwhether patientswho haveundergone breast
cancer op erations will exp erience recurrence of the illness within ve years of the op-
eration. The recurrence rate is ab out 30%, and hence such prognosis is imp ortant for
determining p ost-op erational treatment. These data were veried after collection, and
thus arelikelytob erelativelyfreeof errors.
Thenalmedicaldomainfo cussedonpredictingthelo cationofprimarytumor. Physi-
ciansdistinguish b etween 22p ossible lo cations, predicted from datasuch asage, hysto-
logictyp e of carcinoma,and p ossible lo cationsof detected metastases,and is againim-
Domain Domain
Prop erty Lymphography Breast Cancer PrimaryTumor
No. ofattributes 18 9 17
Min-maxno. of vals/att 2-8 2-5 2-3
Average 3.3 2.8 2.2
valuesp er attribute
Numb er ofclasses 4 2 22
Totalno. ofexamples 148 286 339
Distributionof examples 2,81, 61,4 85,201 84, 20,9,14,39, 1,14,
amongclasses 6,0,2,28,16, 7,24,
2,1,10,29, 6,2,1,24
p ortantindeterminingtreatmentofpatients. Thesedatawereinconsistent,i.e. examples
of dierent classes existedwith identical attributevalues. Thesedata wasveriedafter
collecting,andthusarelikelytob erelativelyerror-free. Thesetofattributesisrelatively
incomplete,i.e. not sucient toinduce highqualityrules.
4.2.2 Results with Natural Domains
Table 5 presentsthe resultsforeach algorithmon eachdomain,averaged overve runs.
Ineachcase,wepresenttheaverageaccuracyonthetestdataandtheaveragecomplexity
of the resulting knowledge structures. cn2 was tested using three values of signicance
threshold and assistant was run with and without pruning. Only one version of the
other systemswererun.
The table contains some interesting regularities. Most imp ortant is that the algo-
rithms designed to reduce problems caused by noisy data achieve a lower complexity
withoutdamaging theirpredictive accuracy. Forexamplein thelymphographydomain,
the version of cn2 with the highest threshold achieved the sameclassication accuracy
as the other algorithms by inducing (on average) only eight rules each containing 1.6
selectors. The treepruning versionofassistantpro ducedsimilarresults.
Both systems apply a similar technique to reducing complexity, namely sometimes
halting sp ecialization of concept descriptions b efore they classify thetraining examples
p erfectly. Asa result,assistantand cn2avoidoverttingtheirdecisiontreesandrules
three naturaldomains. (ComplexityfortheBayesclassier isthesizeof theprobability
matrix).
Lymphography BreastCancer PrimaryTumor
Algorithm Domain Domain Domain
Accuracy Complexity Accuracy Complexity Accuracy Complexity
Default rule 56% 1 71% 1 26% 1
assistant:
(nopruning) 79% 41 62% 112 40% 178
(pruning) 78% 36 68% 44 42% 52
Bayes 83% (240)y 65% (540)y 39% (465)y
aqr 76% 76 72% 208 35% 562
cn2:
(90%threshold) 78% 24 70% 28 37% 33
(95%threshold) 81% 22 70% 20 36% 42
(99%threshold) 82% 12 71% 4 36% 19
ySeediscussioninsection4.1ab outdicultiesinmeasuringthecomplexityofBayesianclassier
Table 6: Accuracy of the dierent algorithms on training and testdata. The rep orted
versionofassistantincorp oratedpruningand theversionofcn2useda99%threshold.
Accuracy of decisionpro cedureson trainingandtestdata:
Algorithm Lymphography Breast Cancer PrimaryTumor
Train Test Train Test Train Test
Default rule 54% 56% 70% 71% 23% 26%
assistant 98% 78% 85% 68% 53% 42%
Bayes 89% 83% 70% 65% 48% 39%
aqr 100% 76% 100% 72% 75% 35%
cn2 91% 82% 72% 71% 37% 36%
set until it achieves as nearly complete consistency with the training data as p ossible,
resulting in an overtted rule set. Table 6 illustratesthis eect bycomparing accuracy
on thetrainingand testdata.
TheresultsalsoshowthattheBayesianclassierdo eswell,p erformingcomparablyto
the moresophisticated algorithms in allthree domains and givingthe highest accuracy
in the lymphography domain. Table 6 shows that this metho d regularly overts the
trainingdata,butthatitsp erformanceinthetestsetisstillgo o d. Evenmoresurprising
is the b ehavior of the frequency-based default rule, which outp erforms assistant and
theBayesmetho d on thebreastcancer domain. Thissuggeststhat inthebreast cancer
domain there are virtuallyno signicant correlations b etween attributes and classes in
the data. This is re ected by cn2's inability to nd signicant rules in this domain at
99% threshold, suggesting that, in this domain at least, the signicance test has b een
eectiveinlteringoutrules representing chanceregularities.
In general, the dierences in p erformance seem to b e due less to the learning algo-
rithms than to the nature of the domains; for example the b est classication accuracy
forlymphographywasbarely half ashigh as that forprimary tumor. This suggests the
need foradditionalstudies toexaminetheroleof domainregularityonlearning.
4.3 Experiments on Articial Domains
Tob etterunderstandtheeectsofovertting,weexp erimentedwithcn2andassistant
ontwoarticialdomainsthatletuscontroltheamountofnoiseinthedata. Bothdomains
containedtwelveattributesand200exampleswhichwereevenlydistributedb etweentwo
classes. Theydiered onlyinthenumb erofvalueseach attributecouldtake(twoin the
rst domainand eight inthesecond).
Inb othcases,thetargetconcept foroneclasscouldb estatedasasimpleconjunctive
ruleoftheform`if(a=v
1
)^111^(d=v
1
)thenclassX'.Bothalgorithmscanrepresent
such a regularitycompactly. The second class wassimply thenegationof the rst. 50%
ofthe datawasused fortraining,50%for testing,and resultsaveraged overve trials.
Foreach domain, we varied the amount of noise in thetraining data and measured
the eect on complexityand accuracy on the test data. Table 7 rep orts the resultsfor
therst articialdomainwithtwovaluesp erattribute,and Table 8forthe secondwith
eightvaluesp erattribute.
The p ercentage of noiseadded indicatesthe prop ortion of attributeand class values
%of CN2 RULE (99%threshold) ASSISTANT RULE
Noise Non-defaulty unpruned pruned
Added Accuracy Accuracy Size Accuracy Size Accuracy Size
0% 95% 100% 3 99% 8 99% 8
2% 88% 99% 5 96% 16 98% 11
5% 88% 95% 10 91% 32 95% 16
10% 82% 95% 15 86% 45 91% 24
20% 73% 86% 20 76% 60 84% 27
40% 67% 76% 25 65% 74 76% 23
60% 56% 64% 26 62% 75 67% 23
100% 45% 49% 28 46% 85 43% 12
ythis refers to the accuracyof those cn2 rulesfound bysearch, i.e. excludingthe extra default
rule(`everythingisclassX') attheendoftherulelist. See discussioninSection 4.3.1.
inthetrainingexamplesthathaveb een randomized,whereattributesandclasseschosen
forrandomizationhaveequalchanceoftakinganyofthep ossiblevaluesforthatattribute
orclass. Notethat nonoisewasintro duced tothetestdata.
4.3.1 Results with Articial Domains
In exp erimenting with articialdomains, we are able toexamine severalfeatures of the
algorithmsrelatingtotheirabilitytohandlenoise. Firstlyweareinterestedinthedegra-
dation of accuracy and simplicityof concept descriptions as noise levels are increased.
Secondly, for cn2, it is also interesting to examine how the accuracy and simplicityof
individualrules(aswellasthatoftherulesetasawhole)isaectedbynoiseinthedata.
The results reveal some surprising features ab out b oth cn2 and assistant. Com-
paring classicational accuracy alone, assistant p erformed b etter than cn2 in these
particular domains. However, comparing complexity of concept description, cn2 pro-
duced simpler concept descriptions than assistantexcept athigh levels of noisein the
rst domain.
Ideally, as the level of noise approaches 100%, b oth algorithms should fail to nd
any signicant regularities in the data and thus converge on a concept description of
complexityone (forcn2'sdefaultrule alone orasingle no dedecisiontree). Howeverfor
%of CN2 RULE (99%threshold) ASSISTANT RULE
Noise Non-defaulty unpruned pruned
Added Accuracy Accuracy Size Accuracy Size Accuracy Size
0% 93% 98% 8 99% 6 99% 6
2% 83% 99% 10 97% 12 97% 12
5% 86% 94% 13 96% 15 96% 15
10% 80% 98% 10 93% 22 93% 22
20% 73% 88% 15 85% 27 85% 27
40% 68% 82% 5 75% 33 75% 33
60% 63% 90% 4 66% 40 66% 40
100% 50% 58% 1 55% 43 55% 43
ythis refers to the accuracyof those cn2 rulesfound bysearch, i.e. excludingthe extra default
rule(`everythingisclassX') attheendoftherulelist. See discussioninSection 4.3.1.
cn2, this o ccurred only in the second of the two domains tested and did not o ccur in
either domain for assistant. Indeed, in the second domain assistant's tree pruning
mechanismdidnotprune thetree atall. cn2'snding ofrules intherst domain,even
at100% noiselevel,can b e understo o d asdue toa combination of the large numb er of
rulessearched (e.g. there are12211210=1320rulesof lengththree inthespace)and
the high coverage of these rules (each length three rule covers on average 100=2 3
= 12
examples). Enough rules are searched so that, even with 99% signicance test, some
chancecoverage ofthe12(average)exampleswillapp earsignicant. Thisdid noto ccur
in the second domain as the coverage of rules wasconsiderably less; each length three
rulecoversonaverage100=8 3
0:5examples,to ofewforthesignicancetesttosucceed.
Thesecomparativeresultsandb ehaviorasnoiselevelapproaches100%suggeststhat
thethresholdingmetho ds usedinb oth cn2and assistantneed tob emoresensitiveto
theprop ertiesoftheapplicationdomain. Researchonimprovementstocn2'ssignicance
test[17]and assistant's pruningmechanism [19]iscurrentlyb eingconducted.
Finally,weconsidertheaccuracyofcn2'sindividualrulesaswellasthatofthewhole
rule set. The gures presented in Tables 7 and 8 for `non-default accuracy', referring
totheaccuracy of cn2's rulesexcluding caseswhere thedefaultrule res, indicatethat
the rule list consists of high accuracy rules plus a low accuracy (50% in this domain)
helping an exp ert articulate his or her knowledge, as each individual rule found (apart
from thedefault rule) represents a strongregularityin the training data. The decision
tree equivalent would b etoexaminethe individualbranches generated,and theiruse in
assistingan exp ert. Quinlan(1987) hasrecentlyconducted work inthisarea.
5 Discussion
The results on the natural domains indicate that dierent metho ds of halting the rule
sp ecializationpro cess,b esideshavingtheeectofreducingrulecomplexity,donotgreatly
aectpredictiveaccuracy. Thiseecthasb eenrep ortedinanumb erofpap ers([7,9,21])
Indeed it p erhaps mayb e the case that anytechniquewill have thiseect, providing a
certain maximumlevelof pruning isnot exceeded. If thisis thecase then an algorithm
should b epreferred ifitmostclosely estimatesthismaximumlevel.
The results in Table 6 suggest that the 99% threshold for the cn2 algorithmis ap-
propriateforthethreenaturaldomains;theaccuracy ontrainingdataisclosetothaton
testdata indicating that, inthese domains atleast, the algorithm is notoverttingthe
data. Additionally,highaccuracy is maintained,indicating that theconcept description
isnot underteither.
The results of the tests on the articial domains, in particularthe tests with 100%
noise, indicatethatthe current measure of signicance used bycn2could b e improved.
Asthenoiselevelreaches100%,thealgorithmshould ideallyalmostalwaysndnorules.
The fact that this only o ccurred in one of the two articial domains suggests that the
signicancemeasure should b emoresensitive toprop ertiesof thedomaininquestion.
In many waysthe comparisonswiththe aqr system areunfair, astheaq algorithm
was never intended to deal alone with noisydata. It wasincluded in these exp eriments
toexaminethe basicaq algorithm'ssensitivitytonoise. In practiceitis rarelyused on
itsown,andinstead enhanced bya numb erof pre- and p ost-rule-generationtechniques.
Exp eriments with the aq15 system [9] show that with p ost-pruning of the rules and a
probability-basedor` exible matching'rule applicationmetho d,one can achieveresults
similarto thoseof cn2and assistantinterms of accuracyand complexity.
The principal advantage of cn2 over aqr is that the algorithm supp orts a cuto
mechanism | it do es not restrict itssearch toonly those rulesthat are consistent with
thetrainingdata. cn2demonstratesthatonecan successfullycontrolthesearchthrough
thelargerspaceof inconsistentruleswiththeuse ofjudiciouslychosensearchheuristics.
achieved a simple metho d for generating noise tolerant if-then rules, allowing easy re-
pro ducibility and analysisof the algorithm. In addition, we feel that the requirements
of interactiveinduction, wheretheuser interactswith systemduring and afterrule gen-
eration, in particular the requirement for go o d explanation facilities, indicate that the
logicalruleinterpretationusedbycn2willhavepracticaladvantagesoverthemorecom-
plexprobabilisticruleinterpretationnecessaryforapplyingorder-indep endentrulessuch
asgenerated byaqr whererule con icts mayo ccur.
Another result of interest is the high p erformance of the Bayesian classier. Al-
though theindep endence assumptionof the classiermayb e unjustiedin thedomains
tested,itdidnotp erformsignicantlyworse interms ofaccuracy thanother algorithms
and it remains an op en question as to how sensitive Bayesian metho ds are to violated
indep enden ce assumptions. Although the probability matrices pro duced by the tested
classierare diculttocomprehend,theexp eriments suggestthat variantsof theBayes
classierpro ducingmorecomprehensibledecisionpro cedureswouldb eworthyof further
investigation.
6 Conclusion
Inthis pap er we havedemonstratedan induction algorithmthat combines theb estfea-
turesoftheid3andaqalgorithms,allowingtheapplicationofstatisticalmetho dssimilar
totreepruninginthegenerationofif-thenrules. Itissimilartoassistantinitseciency
and abilitytohandlenoisy data,whereasitpartiallysharestherepresentation language
and exible search strategy of aqr. By including a mechanism for handling noise into
thealgorithmitself, asimplemetho d forgeneratingnoisetolerant if-thenruleshasb een
achieved allowing easyrepro ducibilityand analysisofthe system.
The exp eriments wehave conducted showthat the algorithmhas, innoisy domains,
comparablep erformancetoassistant. Byinducingconceptdescriptionsbasedonif-then
rules,itprovidesato olforassistingintheconstructionofknowledge-basedsystemswhere
classicationpro ceduresbasedon rulesrather thandecisiontrees aredesired. The most
obviousimprovementto thealgorithm,suggestedbythe resultson articialdomains,is
animprovement tothesignicancemeasure used.
We thank Donald Michie and Claude Sammut for their careful reading and valuable
commentson earlier draftsof thispap er. We alsothank PatLangley and the reviewers
for theirdetailedcommentsand suggestionsab out the presentation. We are grateful to
G.Klanjscek, M.Soklic and M.Zwitter of theUniversityMedicalCenter, Ljubljana for
the use of the medical data and I. Kononenko for itsconversion to a form suitable the
induction algorithms. This work was supp orted by the Oce of Naval Research under
contractN00014-85-G-0243aspart oftheCognitiveScience ResearchProgram.
References
[1] Peter Mowforth. Some applicationswith inductive exp ert system shells. TIOP 86-
002,Turing Institute, Glasgow,UK, 1986.
[2] J.Ross Quinlan. Learningecient classicationpro ceduresandtheirapplicationto
chess endgames. In J. G. Carb onell, R.S. Michalski, and T. M. Mitchell, editors,
Machine Learning,vol. 1. Tioga,Palo Alto,Ca,1983.
[3] R. S.Michalski. Onthequasi-minimalsolutionof thegeneral coveringproblem. In
Proceedings of the 5th international symposium on Information Processing (FCIP
69), Vol.A3(Switching circuits), Bled, Yugoslavia,pages 125{128,1969.
[4] J. R. Quinlan. Simplifying decision trees. Int. Journal of Man-Machine Studies,
27(3):221{234,Septemb er1987.
[5] Tim Niblett. Constructing decision trees in noisy domains. In I. Bratko and
N. Lavrac,editors, Progress in Machine Learning(proceedingsof the 2nd European
Working Session onLearning),pages 67{78.Sigma,Wilmslow,UK,1987.
[6] J. R. Quinlan, P. J. Compton, K. A. Horn, and L. Lazarus. Inductive knowledge
acquisition: acasestudy.InApplicationsofExpertSystems,pages157{173.Addison-
Wesley,Wokingham,UK,1987.
[7] IgorKononenko,Ivan Bratko,andEgidijaRoskar. Exp erimentsinautomaticlearn-
ing of medicaldiagnostic rules. Technical rep ort,Facultyof ElectircalEngineering,
E. KardeljUniversity,Ljubljana, 1984.
1
ingmetho dologyandthedescriptionofprogramaq11. ISG83-5,Dept.ofComputer
Science, Univ. of IllinoisatUrbana-Champaign,Urbana, 1983.
[9] R. S. Michalski,I. Mozetic,J. Hong, and N. Lavrac. The AQ15 inductive learning
system: anoverviewandexp eriments. InProceedingsofIMAL1986,Orsay,France,
1986.Universitede Paris-Sud.
[10] WayneIba,JamesWogulis,andPatLangley. Tradingosimplicityandcoveragein
incrementalconceptlearning.InJohnLaird,editor,Proc.5thInt.Conf.onMachine
Learning,pages73{79. Kaufmann,Ca,1988.
[11] R. L.Rivest. Learning decisionlists. Machine Learning,2(3):229{246,1987.
[12] R.S.MichalskiandR.Chilausky.Learningbyb eingtoldandlearningfromexamples:
an exp erimental comparison of the two metho ds of knowledge acquisition in the
contextof developing an exp ert system for soyb ean diagnosis. Policy Analysisand
Information Systems,4(2):125{160,1980.
[13] P. O'Rorke. A comparative study of inductive learning systems AQ11P and ID3
using a chess end-game testproblem. ISG 82-2, Dept. of ComputerScience, Univ.
of Illinoisat Urbana-Champaign,Urbana,1982.
[14] A. Wald. Sequential Analysis. Wiley,New York, 1947.
[15] J. Kalb eish. Probabilityand Statistical Inference,volume2. Springer-Verlag,NY,
1979.
[16] A. Paterson and T.Niblett. ACLS Manual.Version1. Glasgow,1982.
[17] Philip K. Chan. A criticalreview of cn2: A p olytheticclassier system. Technical
Rep ort CS-88-09,Dept. ofComputerScience, Vanderbild Univ.,Tennessee, 1988.
[18] J.A. Jackson. Economicsofautomaticgenerationofrulesfrom examplesinaChess
end-game. UIUCDCS-F 85-932, Dept. of Computer Science, Univ. of Illinois at
Urbana-Champaign, Urbana,1985.
[19] Bojan Cestnik, Igor Kononenko, and Ivan Bratko. Assistant 86: A knowledge-
elicitationto olforsophisticatedusers.InI.BratkoandN.Lavrac,editors,Progressin
Machine Learning (proceedingsof the 2nd EuropeanWorking Sessionon Learning),
pages 31{45.Sigma,Wilmslow,UK,1987.
editor,IJCAI-87,pages 304{307,Ca,1987.Kaufmann.
[21] T.NiblettandI.Bratko.Learningdecisionrulesinnoisydomains.InM.A.Bramer,
editor, Research and Development in Expert Systems III, pages 25{34. Cambridge
UniversityPress, 1987.