On expressiveness and uncertainty awareness in rule-based classification for data streams

(1)

Contents lists available at ScienceDirect

Neurocomputing

journal homepage: www.elsevier.com/locate/neucom

On

expressiveness

and

uncertainty

awareness

in

rule-based

classiﬁcation

for

data

streams

Thien Le

a

_{, Frederic Stahl}

a ,∗

_{, Mohamed Medhat Gaber}

b

_{, João Bártolo Gomes}

c

_,

Giuseppe Di Fatta

a

a_{Department of}_Computer_{Science, University}_{of Reading,}_{Whiteknights,}_{Reading, Berkshire, RG6}_6AY,_{United Kingdom}

b_{School of}_{Computing and Digital}_Technology,_Birmingham_{City University, Curzon St, Birmingham B4 7XG,}_England,_{United Kingdom} c_DataRobot,₁_{International Place, 5th}_{Floor, Boston, MA 02110, United States}

a

r

t

i

c

l

e

i

n

f

o

Article history:

Received25February2016 Revised24May2017 Accepted26May2017 Availableonline3June2017 Keywords:

DataStreamMining BigDataAnalytics Classiﬁcation Expressiveness Abstaining

Modularclassiﬁcationruleinduction

a

b

s

t

r

a

c

t

MiningdatastreamsisacoreelementofBigDataAnalytics.Itrepresentsthevelocityoflargedatasets, whichisoneofthefouraspectsofBigData,theotherthreebeingvolume,varietyandveracity.Asdata streamsin,modelsareconstructedusingdataminingtechniquestailoredtowardscontinuous andfast modelupdate.TheHoeffdingInequalityhasbeenamongthemostsuccessfulapproachesinlearning the-oryfordatastreams.Inthiscontext,itistypicallyusedtoprovideastatisticalboundforthenumberof examplesneededineachstepofanincrementallearningprocess.Ithasbeenappliedtoboth classifica-tionandclusteringproblems.DespitethesuccessoftheHoeffdingTreeclassifierandotherdatastream miningmethods,suchmodelsfallshortofexplaininghowtheirresults(i.e.,classifications)arereached (blackboxing).Theexpressivenessofdecisionmodelsindatastreamsisanareaofresearchthathas at-tractedlessattention,despiteitsparamountofpracticalimportance.Inthispaper,weaddressthisissue, adoptingHoeffdingInequalityasanupperboundtobuilddecisionruleswhichcanhelpdecisionmakers withinformedpredictions(whiteboxing).WetermedournovelmethodHoeffdingRuleswithrespectto theuseoftheHoeffdingInequalityinthemethod,forestimatingwhetheraninducedrulefromasmaller samplewouldbeofthesamequalityasaruleinducedfromalargersample.The newmethodbrings inanumberofnovelcontributionsincludinghandlinguncertaintythroughabstaining,dealingwith con-tinuousdatathroughGaussian statistical modelling,and anexperimentally proven fastalgorithm. We conductedathoroughexperimentalstudyusingbenchmarkdatasets,showingtheefficiencyand expres-sivenessoftheproposedtechniquewhencomparedwiththestate-of-the-art.

1. Introduction

One problem the research area of ‘Big Data Analytics’ is con-cerned with is the analysisof high velocity data, also known as streaming data [1,2] ,that challenge our computational resources. Theanalysisofthesefaststreamingdatainreal-timeisalsoknown as the emerging area of Data Stream Mining (DSM) [2,3] . One important data mining technique, and in turn DSM category of techniquesisclassification.Traditionaldataminingbuildsits clas-sificationmodelsonstaticbatchtrainingsetsallowingseveral iter-ationsoverthe data.ThisisdifferentinDSMastheclassification modelneedstobeinducedinalinearorsublineartime

complex-∗ _{Corresponding}_author.

E-mail addresses: [email protected] (T. Le), [email protected],

[email protected] (F. Stahl), [email protected] (M.M. Gaber),

[email protected](J.B. Gomes),[email protected](G.D. Fatta).

ity [4] .Furthermore, DSMclassification techniques need toallow dynamicadaptation to concept drifts as the data streams in [4] . Applications of DSM classification techniques are manifold and compriseforexamplemonitoringthestockmarketfromhandheld devices [5] ,real-timemonitoringofafleetofvehicles [6] ,real-time sensingofdatainthechemicalprocessindustryusingsoft-sensors [7] ,sentimentanalysisusingreal-timemicro-boggingdatasuchas twitterdata [8] ,tomentionafew.

The challenge ofdata streamclassification lies inthe need of theclassifierto adaptinreal-timetoconcept drifts,whichis sig-nificantlymore challengingifthedata streamis ofhighvelocity. Manydatastream classification techniquesare basedonthe ‘Top DownInductionofDecisionTrees’,alsoknownasthe ‘divide-and-conquer’approach [9] ,suchas [10,11] .However, thedecisiontree formatisalsoamajorweaknessandoftenrequiresirrelevant infor-mationtobe availabletoperformaclassification task [12] . More-over,adaptationofthetrees ishardercompared withrules when http://dx.doi.org/10.1016/j.neucom.2017.05.081

(2)

achangeoccurs,thiscould bea disadvantageforreal-time appli-cations.

The here presented work proposes the Hoeffding Rules data streamclassifier that isbased on modular classification rules in-stead of trees. Hoeffding Rules can easily be assimilated by hu-mans,andat the sametime doesnot require unnecessary infor-mation tobe available forthe classification tasks. Rule induction from data streams can be traced back to the Very Fast Decision Rules (VFDR) [13] and eRules data stream classifiers [9] for nu-mericaldata.eRulesinducesmodularclassificationrulesfromdata streams, but requires extensive parameter tuning by the user in ordertoachieveadequateclassificationaccuracyincludingthe set-tingofthewindowsize.Notingthisdrawbackthataffectsthe ac-curacy,iftheparametersarenotsetcorrectly,astatisticalmeasure that automatically tunes the parameters is desirable. Addressing thisissue, Hoeffding Rulesadjusts these parameters dynamically withverylittleinputrequiredbytheuser.Theherepresented Ho-effdingRulesalgorithmis basedon thePrism [12] rule induction approachusingaslidingwindow[14,15] .However,thiswindowfor bufferingtrainingdata isadjusted dynamically by making use of theHoeffdingInequality [16] .OneimportantpropertyofHoeffding RulescomparedwiththepopularHoeffdingTreedatastream clas-sificationapproach [10] is,thatHoeffdingRulescanbeconfigured toabstainfromclassifyingan unseendatainstancewhenitis un-certainaboutitsclasslabel.Inaddition,ourapproachis computa-tionallyefficientandhenceissuitable forreal-time requirements. Animportant strengthof theproposed technique isthe high ex-pressivenessoftherules.Thus,havingtherulesasthe representa-tionoftheoutputcan helpusers inmakingtimely andinformed decisions.Outputexpressivenessincreasestrustindatastream an-alyticswhichisoneofthechallengesfacingadaptivelearning sys-tems [17] .Toaddresstheexpressivenessissueforofflineblackbox machine learning models, the new algorithm Local Interpretable Model-AgnosticExplanations(LIME) hasbeenproposedin [18] .The methodgenerates a shortexplanation for each new classified or regressedinstanceoutofapredictivemodel,inaformthatis in-terpretablebyhumans (canbeexpressed asrules, inaway).The workhasattractedagreatdealofmediaattention,andhas empha-sisedtheneedforexpressivemodels.Modeltrusthasbeenfurther emphasised.Thisworkandmanyotherfollow-upresearchpapers havebeenthe resultofexperimental workthatshowedsome se-riousflawsin deeplearning models(a highlyaccurate blackbox approach) [19] .Thework showedthatmiss-classification by deep learningmodelsofsomeimages– duetoaddednoisetothese im-ages– canoccurtosurprisinglyveryobviousexamplestohumans. Again, modelinterpretability and trust havebeen emphasised as animportantareaofresearch.

The utilityofexpressivenessisintroducedinthispapertorefer tothecostofexpressivenesswhencomparingtheaccuracyoftwo methods.Asaccuracyhasbeenthedominatingmeasureofinterest incomparingclassiﬁersinbothstaticandstreamingenvironments, it is evident that real-time decision making based on streaming modelsstill suffersfromtheissueoftrust [17] .Toaddressthis is-sue,theuserisabletodetermineanaccuracylossband (

ζ

),such thatthemodelcanbeexpressiveenoughtogranttrust,andatthe sametimetheaccuracycanbetoleratedat(−

ζ

%)ofanyotherbest performingclassiﬁerwhichislessexpressive(canbeatotalblack box).We argue that such a newmeasure will open the doorfor moretrustfulwhiteboxmodels.Inmanyapplications(e.g., surveil-lance,medicaldiagnosis,terrorismdetection),decisionsneedtobe basedon clear arguments.In such applications, havinga trustful modelwithacompetent accuracycanbemuch moreappreciated thanhavingahighlyaccuratemodelthatdoesnotconveyany rea-soningaboutits decision. Otherexamples ofapplicationsthat re-quireconvincingargumentscanbefoundin [18] .

Fig.1. Hierarchyofoutputexpressiveness.

Thispaper isorganisedas follows: Section 2 highlights works related to the Hoeffding Rules approach. Section 3 highlights our dynamic rule induction and adaptation approach for data streams. An experimental evaluation and discussion is presented in Section 4 .Concludingremarksareprovidedin Section 5 .

2. Relatedwork

Thevelocity aspectoftheBigDatatrendisthemaindriverof workdone forover adecadeintheareaofdatastreammining– long before the Big Data term wascoined. Among the proposed techniquesintheareacomealonglistofclassiﬁcationtechniques. Approaches to data stream classiﬁcation varied from approxima-tionto ensemblelearning. Twomotivesstimulated such develop-ments:(1)fastmodelconstruction addressingthehighspeed na-tureofdatastreams;and(2)changeintheunderlyingdistribution ofthedata,inwhathasbeencommonlyknownasconceptdrift.

HoeffdingInequality [16] hasfounditswayfromthestatistical literatureinthe60softhelastcenturytomakeanimpactindata stream mining, having a number oftechniques, mostlyin classi-ﬁcation, withnotable success.The Hoeffdingbound is astatistical upper bound on the probability that the sum of a random vari-able deviatesfrom its expected value. The basicHoeffding bound

hasbeenextendedandadoptedinsuccessfullydevelopinga num-berofstreaming techniquesthatwere termedcollectivelyasVery FastMachineLearning(VFML) [20] .

Earlier work on data stream mining addressedthe aforemen-tionedissues.However, theenduserperspectivehasbeengreatly missing,andhencetheuser’strustinsuchsystemswasfrequently questioned. Thisissuehas beendiscussed in a positionpaper by Zliobaiteetal [17] .

Inthispaperwe address thisissue,attemptingto providethe end userwith the mostexpressive knowledge representationfor datastreamclassiﬁcation,i.e., rules.Wearguethat rulescan pro-videtheuserswithinformativedecisionsthatenhancethetrustin streamingsystems. Fig. 1 showsahierarchyofoutput expressive-ness,withrule-based modelsbeing atthetop ofall ofthe other classiﬁcationtechniques.

2.1. Ruleinductionfromdatastreams

FLORA is a family of algorithms for data stream rule induc-tion that adjusts its window size dynamically using a heuristic based on the predictive accuracy and concept descriptions. The mostrecentFLORAalgorithm,FLORA4,addressestheissueof con-cept drift. It can use previous concept descriptions in situations

(3)

of recurring changes and is robust to noise in the data stream [21] . From theAQ-PM familyof algorithms, theAQ11-PM system is the mostrepresentative. AQ11-PM uses learnedrules to select data instances from the training data that lie on the boundaries ofinducedconceptdescriptionsandisstoringtheseso-called ‘ex-treme ones’ in partial memory [22] . However, those approaches are stillnotadapted tohighspeeddatastreamenvironments, es-pecially those featuring continuous attributes. The FACIL [23] al-gorithm workssimilarlytoAQ11-PM.However, FACILdoesnot re-quireallstoreddatainstancestobeextremeandtherulesarenot revisedimmediatelywhentheybecomeinconsistent.Adaptationto driftisachievedbysimplydeletingolderrules.Nevertheless,none of theseapproacheswasevaluated onmassive datasets with nu-mericalvalues(FACILrequiresthatthenumericdataisnormalised between0and1).The mostrecentapproachisVFDR [13,24] that sharesideaswithVFMLandwasimplementedandtestedinMOA [14] ,aworkbenchforevaluating datastreamlearningalgorithms. VFDRisabletolearnanorderedorunorderedruleset.

VFDRisthemostsimilaralgorithmtoourapproach(Hoeffding Rules).ThemaindifferencebetweenVFDRandtheHoeffdingRules approachproposedinthispaperisthat HoeffdingRulesinduction isbased onPrism [12] whileVFDRinductionis similartothe in-ductionofHoeffdingTree.Moreover,theHoeffdingRulesapproach canabstainfromclassifying,whichcontributestothehigh expres-sivenessoftheinducedruleset.

For a more in-depthreview ofthe related work we refer the readertoasurveyonincrementalrule-basedlearners [25] .

3. HoeffdingRules:expressivereal-timeclassiﬁcationrule inducer

This section highlights the developmentof the proposed Ho-effdingRulesclassiﬁerconceptually.Itinvolvestheinductionofan initial classiﬁer in the form of a set of expressive ‘IF… THEN…’

rules. The section ﬁrst highlights expressive rule sets in general in Section 3.1 , and then discusses the Prism algorithm as a ba-sicapproachforinducingsuchrules onbatchdatain Section 3.2 . Prism hasbeen adopted by HoeffdingRules asthe basic process forinducingexpressiverules.However,ithasbeenenhancedwith a moreexpressive ruleterminductionmethodforcontinuous at-tributes as describedin Section 3.3 based on probability density distribution. Section 3.4 thendescribestheHoeffdingboundused by HoeffdingRulesasa metricto estimate a gooddynamic win-dowsizeofthedatastreamtoinduceexpressiverulesfrom.Lastly, Section 3.5 illustratestheoverallHoeffdingRulesreal-time induc-tionprocess.

3.1. Expressiverulerepresentationandinduction

Expressiveclassiﬁcationrulesarelearntfroma givensetof la-belled datainstances, which consistsof attribute valuesandrule learningalgorithmstoconstructoneormorerulesoftheform: IFt1ANDt2ANDt3...ANDtkTHENclass

ω

i

Theleftsideofaruleistheconditionalpartoftherule,which consistsofaconjunctionofruleterms.Aruletermisalogicaltest that determines whether adata example tobe classiﬁed hasthe classiﬁcation

ω

_i ornot.Aclassiﬁcationrule canhaveoneup tok

ruleterms,wherekisthenumberofattributesinthedata. A ruleterm can havedifferentforms forboth categorical and continuous attributes. A rule termforcategorical attributes typi-callyhastheform

α

=

v

inwhichvisoneofthepossiblevaluesof attribute

α

. Forcontinuousattributes, binary splittingtechniques are widely used such asin [26–28] . With binarysplitting a rule term isof theform (

α

< v) or(

α

≥ v), inthis case, vis a con-stantfromtherangeofobserved valuesforattribute

α

.Hence, if

Fig.2. Anexampleofthereplicatedsubtreeproblemfortherules:IF w =0AND x =1THEN class =a,IF y =0AND z =1THENclass =a . Otherwise, class =b.

thedatainstancessatisfy thebodyorconditionalpartoftherule, thentherulepredicts

ω

iastheclasslabel.

3.2.Predictiverulelearningprocess

Thissectiondiscussestheinductionofexpressiverulessuchas the ones described in Section 3.1 based on the Prism algorithm [12] .HoeffdingRules’basicruleinductionstrategyisalsobasedon Prism.Prismusesa‘separate-and-conquer’approachtoinduce ex-pressiverules.Incontrast,the‘divide-and-conquer’ruleinduction approach (which generates decision trees), Prism generates deci-sionrulesdirectlyfromtrainingdataandnotintheintermediate formofatree,suchasforexampletheC4.5algorithm [29] : IFw=0ANDx=1THENclass=a

IFy=0ANDz=1THENclass=a Otherwise, class=b

Thethree rules above cannot be representedinthe formof a decisiontree,astheydonothaveanyattributesincommon. Rep-resentingtheserulesina decisiontreewouldrequireadding un-necessaryandmeaningless ruleterms. Thisis alsoknown asthe

‘replicatedsubtreeproblem’ [30,31] illustrated in Fig. 2 forthetwo rulesabove.

Thetreestructureexamplein Fig. 2 isgeneratedunderthe as-sumptionthatthereexistonlythefourattributes(w,x,z,y);that each attributeis eitherassociated withthevalue 0or 1; and in-stancescoveredbythetworulesaboveareclassiﬁedasbelonging toclassawhereastheremainingrulespredictclassb.

Thisexamplerevealsthat the‘divide-and-conquer’rule induc-tion approach can lead to unnecessarily large and complex trees [30] ,whereasthePrismalgorithmisabletoinducemodularrules such as the two rules above, that have no attribute in common. Also,theauthorsof [32] discussthatdecisiontreemodelsareless expressive, as they tend to be complex and diﬃcult to interpret byhumans oncethetreemodelgrowstoa certainsize.Also, the authorsofthewell-known C4.5decisiontreeinductionalgorithm [29] acknowledged that pruning a decision tree model doesnot guarantee simplicityand can still be too cumbersome to be un-derstoodbyhumans.

(4)

Ithasbeenshownin [28] thatPrism’sinductionmethodof ex-pressiverules achievesa similar classification accuracycompared withdecision tree based classifiers and sometimes even outper-forms decision trees (especially ifthe data is noisyor there are clashesin the data) [28] . Thus, Prism has been chosen as a ba-sicruleinductionstrategyforHoeffdingRules.Anotherreasonfor choosing Prismis that it naturallyabstains from classification, if norulematches,whereas adecisiontreebasedapproachforcesa classification [9] .Abstaining maybe necessaryin critical applica-tionswheremiss-classificationsareverycostly,suchasinmedical orfinancialapplications.

Prism follows the ‘separate-and-conquer’ approach which re-peatedlyexecutes the followingtwosteps:(1)induce a newrule andaddit to therule setand(2)remove all datainstances cov-eredby the newrule. The stopping criterion for executing these two steps is usually when all data instances are covered by the ruleset.Hence,thisapproachisalsooftenreferred toasthe ‘cov-eringapproach’.Cendrowska’soriginalPrismalgorithmfor categor-icalattributesimplementsthis‘separate-and-conquer’approachas shownin Algorithm 1 ,wheret_α isa possibleattributevalue pair

Algorithm 1: Learningclassiﬁcation rules fromlabelled data instancesusingPrism.

1

for

i

=

1 →

C

do

2

D

←

Dataset;

3

while

D

does

not

contain

only

instances

of

class

ω

i

do

4

forall

the

attributes

α

∈

D

do

5

Calculate

the

conditional

probability,

P

(

ω

_i

|

t

_α

)

for

all

possible

rule

terms

t

_α

;

end

7

Select

the

t

_α

with

the

maximum

conditional

probability,

P

(

ω

i

|

t

α

)

as

rule

term;

8

D

←

S,

create

a

subset

S

from

D

containing

all

the

instances

covered

by

t

α

;

end

10

The

induced

rule

R

is

a

conjunction

of

all

t

_α

at

line

7;

11

Remove

all

instances

covered

by

rule

R

from

original

Dataset;

12

repeat

13

lines

3 to

9;

until

all

instances

of

class

ω

i

have

been

removed

;

end

(ruleterm) andD isthe trainingdata.The algorithm isexecuted foreachclass

ω

iinturnontheoriginaltrainingdataD.

There have been variations of Prism, such as N-Prism which also deals with continuous attributes [28] ; PrismTCS which im-posesan order ofthe rules in therule set [33] ; PMCRI which is a scalableparallel version ofPrismTCS [31] and Prismbased en-sembleapproachessuchasRandomPrism [34] .

HoeffdingRulesusesthisbasicPrismapproachtoinducerules froma recentsubset ofthedatastream.However, different com-paredwithPrism, HoeffdingRulesuses a moreexpressive repre-sentation of rule terms for continuous data, which will be dis-cussednext in Section 3.3 ; andalso usestheHoeffdingboundto adaptthe inducedrulesettoconceptdrifts inthedatastreamin real-time,aswillbediscussedin Section 3.4 .

Fig.3. Cut-pointcalculationstoinducearuletermforcontinuousandcategorical attributes.

3.3. Probabilitydensitydistributionforexpressivecontinuousrule terms

The original Prism algorithm [12] only works on categorical attributes and produces rule terms of the form

(

α

=

ν

)

. eRules [9] and VeryFastDecision Rules (VFDR) [13] areamong the few algorithms speciﬁcally developed forlearning rules directly from a data streamin real-time. Forcontinuous attributes, eRulesand VFDRproduceruletermsoftheform(

α

<

ν

),or(

α

≥

ν

)and(

α

≤

ν

),or(

α

_>

ν

),respectively.

Asummary oftheprocess ofhow eRulesandVFDRdealwith continuousattributescanbedescribedasfollows:

1. Foreachpossiblevalue

α

jofacontinuousattribute

α

,calculate

theconditionalprobabilityforagiventargetclassforbothrule terms(

α

<

ν

)and(

α

≥

ν

)or(

α

≤

ν

)and(

α

>

ν

).

2. Return the rule term, which has the overall best conditional probabilityforthetargetclass.

It is evident that this process of dealing with continuous at-tributes requires many cut-point calculations for the conditional probabilitiesforeachpossiblevalue

α

jofacontinuousattribute.

The example illustrated in Fig. 3 comprises just six data in-stances, one continuous attribute, one categorical attribute, and two possible class labels. It shows how many cut-point calcula-tions are needed by eRulesor VFDR inorder to induce one rule term. The numberofcut-pointcalculationsneededforeach con-tinuousattributeisthenumberuniquevaluesoftheattribute mul-tipliedby2.Clearlybothalgorithms,eRulesandVFDR,stillrequire a lotof calculationseven thoughthe dataintheexample isvery small.Thisisadrawbackascomputationallyeﬃcientmethodsare needed for mining data streams. Furthermore, eRules and VFDR usea ‘separate-and-conquer’ strategy, whichrequires many itera-tionsuntilaruleiscompleted.

G-eRulesusestheGaussiandistributionoftheattribute associ-atedwithaclass labelasintroduced in [35] ,tocreateruleterms oftheform(x<

α

≤y),andthusavoidsfrequentcut-point calcu-lations.Evidenceoftheimprovementsinperformancewhile main-tainingtheaccuracyoftheinducedrulesisdiscussedin [35] .This methodisalsousedintheHoeffdingRulesalgorithmtoavoid fre-quentcut-pointcalculations.Itisalsomoreexpressivethan induc-ingruletermsfrombinarysplits,asruletermsoftheform(x<

α

≤y)can describeanintervalofdata.Onewouldneedtousetwo ruletermsinducedbybinarysplittingtodescribe thesame inter-valofdatavaluesofaparticularattribute.Foreachcontinuous at-tribute ofthe the instances, a Gaussian distribution representing all possible valuesof that continuousattribute fora given target classisusedtogeneratethesemoreexpressiveruleterms.

Ifthedatainstanceshaveclasslabelsof

ω

1,

ω

2,...,

ω

ithenwe

cancomputethemostrelevantvalueofacontinuousattributethat is themost relevantone toa particular class label basedon the

(5)

Fig.4. Theshadedarearepresentsarangeofvaluesofcontinuousattributeαfor classωi.

Gaussian distributionofthevaluesassociated withthisparticular classlabel.

The Gaussian distribution is calculated for a continuous at-tribute

α

witha mean

μ

andavariance

σ

2 _from_all _possible nu-meric values associatedwith the class label

ω

i. The class

condi-tionaldensityprobabilityiscalculatedasinthe Eq. (1) :

P

(

t_α

|

ω

i

)

=P

(

tα

|

μ

,

σ

2

)

= 1 √ 2r

σ

2exp

(

−

(

t_α−

μ)

2 2

σ

2

)

(1) Hence, a heuristic based on P

(

ω

i

|

tα

)

, or equivalently

log

(

P

(

ω

i

|

tα

))

is calculated and used to determine the

proba-bilityofaclasslabelforavalidvalueofacontinuousattributeas inthe Eq. (2) :

log

(

P

(ω

i

|

tα

))

=log

(

P

(

tα

|

ω

i

))

+log

(

P

(ω

i

))

−log

(

P

(

tα

))

(2)

The probability betweentwo values,

i,can be calculated for

therangebetweenthesetwovaluessuchthatifx∈

i,thenx

be-longs to class

ω

i.Thismethod maynot guaranteeto capturethe

full details ofthe intricate continuousattributes, butthe compu-tational and memory efficiency can be improved significantly as shownin [35] comparedwithbinarysplittingtechnique. The effi-ciency asa resultofusingGaussian distributiononlyneeds tobe calculatedonceandcanbeincrementally updatedovertime with justtwovariables,

μ

and

σ

2_.

Asillustratedin Fig. 4 ,theshadedareabetweenxandyshould representthemostcommonvaluesofacontinuousattribute

α

for classwi.Agoodrule termofacontinuousattributeisderived by

choosinganareaunderthecurveofthecorrespondingdistribution forwhichthedensityclassprobabilityP

(

x<

α

≤y

)

isthegreatest. This techniqueis used to identifya possiblerule term inthe formof(x<

α

≤y),whichishighlyrelevanttoarangeofvalues of thecontinuous attribute

α

fora target class

ω

_i froma subset of data instances. The process can be describedin the following steps:

1. Mean

μ

andvariance

σ

2 _of _each _class _label _is _calculated _for eachavailablecontinuousattribute.

2. Eqs. (1) and(2 )areusedtoworkouttheclassconditional den-sityandposteriorclassprobabilityforeachvalueofa continu-ousattributeforatargetclass.

3. Avaluewiththegreatestposteriorclassprobabilityisselected. 4. Choose a smaller value compared with the value selected in

step 3 that also has the greatest posterior class probability among all small er values. Choose also a greater value com-paredwiththevalueselectedinstep3thatalsohasthe great-estposteriorclassprobabilityamongallgreatervalues.

Fig.5. Slidingwindowsprocess.

5.Calculate density probability with the two values in step 4

by usingthecorresponding Gaussiandistribution calculatedin

step1forthetargetclass.

6.Selecttherangeoftheattribute(x<

α

≤y)asaruleterm,for whichthedensityclassprobabilityisthemaximum.

Thenormality assumptionshould notcausemajor problemsif largeenoughsamplesizesareused(>30or40) [36] .Asstatedby Altman andBland [37] , thedistribution ofdatacan beignored if thesamples consist ofhundreds ofobservations. Thisis thecase for data stream classiﬁers such as Hoeffding Rules, as they are designedfor inﬁnite data streams. Also, there are a few notable pointsfrom the central limit theorem [37,38] regardingthe nor-malityassumption:

• Ifthe sampledatais approximatelynormallydistributed,then thesamplingdistributionwillalsobenormaldistributed.

• Ifthesamplesizeislargeenough(>30or40) then,the sam-pling distribution tends to be normally distributed, regardless oftheactualunderlyingdistributionofthedata.

Fromthepointsjustmentionedabove,truenormalityis consid-eredtobeamythbutagoodestimationofnormalitycanbe con-firmedby usingnormalplots orsignificance tests [37] .The main ideabehindthesetestsistoshowwhetherdatasignificantly devi-atesfromnormality [36–38] .

Thenextsectiondescribestheadaptationoftheruleinduction processtodatastreams.

3.4.UsingtheHoeffdingboundtoensurequalityoflearntrulesfrom adatastream

Itisreasonabletoassumethattherecentdatainadatastream ismorelikelytoreﬂectthecurrentconceptmoreaccurately com-paredwitholderdata [39] .Someworks [9,13,40–42] indata min-ing discuss and use a sliding windows process as illustrated in Fig. 5 .

Thefundamentalidea ofthe slidingwindows processisthata window is maintained which stores mostrecently seen data in-stances,andfromwhicholderdatainstancesaredropped accord-ingtosome setofrules. Datainstancesinawindowcanbe used forthefollowingthreetasks [14,43] :

1.Todetectchange.Usingastatisticaltesttocomparethe under-lyingprobabilitydistributionbetweendifferentsub-windows. 2.Toobtainupdatedstatisticsfromrecentdatainstances. 3.Torebuildorrevisethelearntmodelafterdatahaschanged.

Byusingslidingwindowstechnique,algorithmswillnotbe af-fectedbystaledataandtheycanalsobeusedasatoolto approx-imatetheamountofmemoryrequired [1] .

TheproposedHoeffdingRulesalgorithmusesHoeffding Inequal-ity [16] toestimatetheconﬁdenceof,whetheraddingaruleterm toa rule, orstopping therule’s induction process isappropriate. Thismakes it morelikely thatthe rulewill coverinstances from thestreamthatmatchtherule’stargetclass.TheuseofHoeffding

(6)

Inequalityin HoeffdingRuleswasinspired by [10,44,45] .In addi-tion,theword‘Hoeffding’ inouralgorithm nameisderived from the nameof the formula called HoeffdingInequality [16] , which provides a statistical measurement in conﬁdence of the sample meanofnindependentdata instancesx1,x1,...,xn.IfEtrue is the

truemeanandEest istheestimationofthetruemeanfroman

in-dependentsamplethenthedifferenceinprobabilitybetweenEtrue

andEestisboundedby Eq. (3) ,whereRisthepossiblerangeofthe

differencebetweenEtrueandEest: P[

|

Etrue−Eest

|

>

]<2e−2n

2_/R2

(3) FromtheboundsoftheHoeffdingInequality,itisassumedthat withtheconﬁdenceof1−

δ

,theestimationofthemeaniswithin

ofthetruemean.Inotherwords,wehave:

P[

|

Etrue−Eest

|

>

]<

δ

(4)

From Eqs. (3) and(4 ) andsolvingfor

,aboundonhowclose theestimatedmeanistothetruemeanafternobservations,with aconﬁdenceofatleast1₋

δ

_,isdeﬁnedasfollows:

=

R2_ln

(

₁_/

δ)

2n (5)

By using Hoeffdingbound asan independent metric to verify thetruelikenessofaruleterm,we saythat therulesthatsatisfy theHoeffding bound are likelyto be asgood asthe rules learnt fromaninﬁnitedatastream.

is calculated after the rule term with the best conditional probabilityforclass

ω

iisselected.However,theruletermwillbe

addedtothecurrentruleunlessthe differenceoftheconditional probabilitiesbetweentheselected(best)andthesecond bestrule termisgreater than

.Otherwise, therule’s induction process is completedandthe rule isadded tothe ruleset. Anewiteration foranewruleisstartedagainwithdatainstancescoveredby the previousruleremoved.

If G(t_α) is the heuristic measurement that isused to test the ruletermt_α,thenRin Eq. (5) representstherangeofG(t_α).G(t_α) in our approach is the conditional probability _P

(

class₌i

|

t_α

)

at whicha rule termt_α coversa target class

ω

i.Hence, the

proba-bilityrangeofaruletermRis1.nisthenumberofdatainstances thattherulehascoveredsofar.

Concerning the goodness ofthe best ruleterm, lett_α_j be the rule termwith the highest conditional probability fromthe cur-rentiteration,andt_α_j₋₁be theruletermwiththesecond highest conditionalprobabilityfromthecurrentiteration,then:

G=G

(

t_αj

)

−G

(

tαj−1

)

(6)

If

G>

, then the Hoeffding bound guarantees that with a probability of 1−

δ

, the true

G≥

(

G−

)

. If this is the case, we include the rule term into the current rule as part of Algorithm 1 atstep7andcontinuetosearchforanewruletermif therulestillcovers datainstancesofdifferentclassiﬁcationsthan thetargetclass.Onceallpossiblerulesareinducedforallclass la-bels fromthe currentwindow, then all instances covered by the rules are removedandthe instances not covered are addedto a temporarybuffer.Thisbufferisthencombinedwiththedatafrom thenextslidingwindowforinducingnewclassiﬁcationrules.

Essentially, we usethe Hoeffdingbound to determinea prob-abilitywiththe conﬁdenceof1−

δ

that theobservedconditional probability,withwhich theruletermcovers thetarget classin n

examples,isthesameaswewouldobserveforaninﬁnitenumber ofdatainstances.

The next section brings the previously outlined methods for ruleinduction,dealingwithcontinuousattributesandadaptation toconceptdrifttogether.

3.5. OveralllearningprocessofHoeffdingRules

Thefollowingsectionsdescribehowweadaptedandcombined slidingwindows,HoeffdingInequality,andthePrismalgorithmto induceandmaintainan adaptivemodularsetofdecisionrulesfor streaming data. These techniqueshave been discussed in greater detailinthe Sections 3.1 –3.4 .

3.5.1. Inducingtheinitialclassiﬁer

ThefirststepofHoeffdingRules’executionisthegenerationof the initial classifier, which is done in a batch mode usingPrism onthefirstninstancesinthewindow.Asdescribedin Section 3.2 , thePrismalgorithmisabletoinduceexpressiveclassificationrules directlyfromtrainingdatabyusing‘separate-and-conquer’search strategy. Themethod ofinducingnumericalrule termsusingthis algorithm hasbeen replacedwith thecomputationally more effi-cientwayofinducingnumericalruleterms,basedontheGaussian Probability Density Distributions described in Section 3.3 in this paper.

Fortheﬁrst window, thewindow sizen ispredeﬁned. Subse-quentlythenumberofdatainstancesforeachwindowconsistsof unseendatainstances plusthe datainstancesnot coveredby the rulesfromthepreviouswindow.ThelearningprocessofHoeffding Rulesisdescribedin Algorithm 2 .

Algorithm 2:HoeffdingRules– inducing rules froman inﬁ-nitedatastream.

R_←Learntruleset;

r←Aclassiﬁcationrule;

S←Streamofdatainstances;

Wunseen←Bufferofunseendatainstance;

WHB←Bufferofdatainstancesnotcoveredbyrulesfrom

previousWunseen;

n:pre-deﬁnedwindowsize; 7 whileShasmoredatainstancedo 8 i→newinstancefromS; 9 if r∈Rcoversithen

10 Validatetherulerandremoveifnecessary;

else 12 AdditoWunseen; 13 13 ifWunseen=nthen 14 W :=Wunseen+WHB; 15 empty(Wunseen,WHB);

16 Learnruleset,R,inbatchmodeasinAlgorithm3 fromW;

17 AddR toR;

18 W_HB:=datainstancesnotcoveredbyr∈R inW;

end end end

3.5.2. Evaluatingexistingrulesandremovingobsoleterules

The evaluation and removal of rules is done online.Once la-belled instances are available, then the rules ofthe current clas-sifier are applied on these instances. Each rule remembers how many instances it has correctly and incorrectly classified in the past. From this, the rule can update its accuracy aftereach clas-sificationattempt.Ifarule’sclassification accuracydropsbelowa pre-definedthreshold(bydefault0.8)andtherulehasalsotaken partin a minimumnumber ofclassification attempts (bydefault 5), then the rule isremoved. The reasonforconsidering a mini-mumnumberofclassificationattemptsistoavoidthattheruleis removedtooearly.

(7)

Algorithm3:HoeffdingRules– inducingrulesinbatchmode. 1

for

i

=

1 →

C

do

2

D

←

input

Dataset;

3

while

D

contains

classes

other

than

ω

i

do

4

forall

the

α

in

D

do

5

if

α

is

categorical

then

6

Calculate

the

conditional

probability,

P

(

ω

i

|

t

α

)

for

all

rule

terms

t

α

;

else

if

α

is

continuous

then

7

calculate

mean

μ

and

variance

σ

2

of

continuous

attribute

α

for

class

ω

i

;

8

foreach

value

α

_j

of

attribute

α

do

9

Calculate

P

(

α

j

|

ω

i

)

based

on

created

Gaussian

distribution

created

in

line

8;

end

11

Select

α

j

of

attribute

α

,

which

has

highest

value

of

P

(

α

j

|

ω

i

)

;

12

Create

t

_α

in

form

of

x

<

α

≤

y

as

discussed

in

Section

3.3;

13

Calculate

P

(

t

_α

|

ω

i

)

,

where

t

α

is

in

the

form

of

x

<

α

≤

y

;

end

16

Calculate

Hoeffding

bound,

=

R2 ln(1/δ) 2∗(no.instancesinD)

;

17

if

P

(

t

_α

|

ω

_i

)

_best

−

P

(

t

_α

|

ω

_i

)

_second₋_best

>

then

18

Select

t

_α

for

which

P

(

t

_α

|

ω

i

)

is

a

maximum;

else

20

Stop

inducing

current

rule;

end

22

Create

subset

S

of

D

containing

all

the

instances

covered

by

t

_α

;

23

D

←

S;

end

25

R

is

a

conjunction

of

all

the

rule

terms

built

at

line

17;

26

Remove

all

instances

covered

by

rule

R

from

input

Dataset;

27

repeat

28

lines

2 to

22;

until

all

instances

of

ω

i

have

been

removed

;

30

Reset

input

Dataset

to

its

initial

state;

end

return

induced

rules

;

Forexample,iftherule’sminimumnumberofclassification at-temptsbeforeremovalisonly1,thenitwouldberemovedalready if the first classification attempt fails. However, withthe default settings,therulewouldatleast‘survive’5attempts.Assumingthat only the firstof the5 attempts fails, then therule wouldbe re-tained,asithasanaccuracyof4_÷5₌0_.8_,whichistheminimum classificationaccuracyrequired.

Thedefaultsettingsmaybeadjusted accordingtotheuser re-quirements.Alowerminimumaccuracywillresultintheclassifier adapting slower,however,ahighaccuracymayresultinrules ex-piring quickly, and thus more computationis required to induce newrules.Alsoalownumberofminimumclassificationattempts will result inrules expiring quickly anda high numberof mini-mum classificationattempts willresultina sloweradaptation.In

Fig.6. CombiningdatainstancesthatsatisfytheHoeffdingboundfromthe previ-ouswindowwiththeunseendatainstancesfromthecurrentwindow.

ourexperiments,wehavefoundthatthedefaultvaluesworkwell inmostcases.Thedefaultvalueshavebeenusedinall experimen-talresultspresentedinthispaper.

3.5.3. StoringdatainstancesthatdonotsatisfytheHoeffdingbound

OneofthenotablefeaturesofHoeffdingRulesistheuseof Ho-effding Inequality to determine the credibility of a rule term as describedin Section 3.4 .Foranalgorithmbasedon ‘separate-and-conquer’strategy inbatch data,a newruletermis searchedand addedtoacurrentruleuntiltheruleonlycoversdatainstancesof thetargetclass.Slidingwindowtechniqueisusedasdescribedin Section 3.4 to actively learn rulesin real-time. Thewindow con-tainsthe most recenttraining datainstances, and thesedata in-stancesareusedtoinduceclassiﬁcationrulesovertime.However, HoeffdingRulesalgorithmdoesnotalwaysinducerulesthatcover onlyexamplesofthetargetclass,becauseHoeffdingRuleswillstop inducingfurtherruletermsiftheruledoesnotsatisfythe Hoeffd-ingboundmetricfromthecurrentsubsetofdatainstances.

Asillustratedin Fig. 6 ,onceallpossiblerulesareinducedfrom theslidingwindowthenalldatainstancesthatarenotcoveredby thenewlycreatedrulesarestoredinabuffer.Thisbufferis com-bined with the next window of unseen data instances from the stream.Hence,aftertheﬁrstwindow,eachslidingwindowisﬁlled with unseen data instances from the window and the instances fromtheHoeffdingbound bufferfromtheprevious windows.The Hoeffdingboundbuffercontainsinstancesthatarenotcoveredby thecurrentruleset.

3.5.4. Additionofnewrules

Theaddition ofnewrules alsotakesplace online.Asoutlined in Section 3.5.2 ,Hoeffding Rulesapplies its currentrules to new data instances that are already labelled,in order to evaluate the ruleset’s accuracy. However, ifnoneofthe rules applies toa la-belleddatainstance,thenthisdatainstanceisaddedto the win-dow. Once the window of unseen instances reaches the deﬁned threshold,datainstances arelearntasoutlined in Algorithm 2 to inducenew rules,which are thenadded tothe currentclassiﬁer. Nextthewindowisresetbyremovingallinstancesfromit.

Thisisbasedontheassumptionthattheinstancesinthe win-dow will primarily cover concepts that are not reﬂected by the

(8)

currentclassifier.Thus, rules induced fromthiswindow will pri-marilyreflectthesemissingconcepts.Byaddingtheserulestothe classifier,itisexpectedthattheclassifier willadaptautomatically tonewemergingconceptsinthedatastream.

4. Experimentalevaluationanddiscussion

An empirical evaluation has been conducted to evaluate Ho-effdingRulesintermsofaccuracy,adaptivitytoconcept driftand the trade off between accuracy for a white box model such as HoeffdingRules compared with a lessexpressive model such as HoeffdingTree. In addition, Hoeffding Rules has also been eval-uated interms of its expressivenesscompared with its more di-rectcompetitors HoeffdingTreeandVFDR.Thishas been accom-plishedempirically bymeasuring thenumberofaverage decision stepsneededforclassifyingan unseendatainstance,and qualita-tivelybyexaminingsomeofthedecisionrulesinducedby Hoeffd-ingRules ontwo examples. The implementationofthe proposed learningsystem was developed in Java, using the MOA [14] en-vironmentasatest-bed.MOA standsforMassiveOnlineAnalysis andisanopen-source frameworkfordatastreammining.Related totheWEKAproject [46] ,itincludesacollectionofmachine learn-ingalgorithmsandevaluationtoolsspecialtodatastreamlearning problems.TheMOAevaluationfeatures(i.e.,prequential-error im-plementedasEvaluatePrequential)were usedinour experiments. FortheMOAevaluation,thesamplingfrequencywassetto10,000. Thistechniqueandsettingarecommonlyusedinthedatastream miningliterature assuch in [14,47] .Inthe remainderofthis sec-tion,unless statedotherwise,the defaultparameters of theMOA platformwereused.

4.1.Experimentalsetup

Twodifferentclassiﬁerswereusedforcomparingandanalysing theHoeffdingRulesalgorithm.

• VFDR (Very Fast Decision Rules) [13] , is to the best of our knowledgethestate-of-artindatastreamrulelearning.

• Hoefffding Tree [10] , is state-of-art in decision tree learning fromdatastreams.

The rationalebehindthischoiceistwofold:(1)HoeffdingTree hasestablished itselfforthelastdecadeasthestate-of-the-artin datastreamclassiﬁcation, withalong historyofsuccess; and(2) bothtechniquesusetheHoeffdingboundforresultapproximation, whichwe havealso adoptedinour technique. Itis worth noting that opaque blackbox methods recentlytrialed in a datastream setting like deep learning [48] lack the expressiveness element, andthushavenotbeenchosenforourexperimentalstudy.

Table 1 shows the default parameters for the classiﬁers that wereusedinallofourexperimentsunlessstatedotherwise.

4.2.Datasets

Different synthetic and real world datasets were used in our experiments.Asforsyntheticdatasets,streamgeneratorsavailable inMOA were used. The real world datasets ‘Airlines’ and‘Forest Covertype’are knownandused forbatch learning,in whichcase all datainstances from datasets are read andlearnt in one pass. However, we simulate these datasets into data streams by read-ingdatainstancesfromthesedatasetinorderedsequenceoverthe time.

Each dataset used in our experiments can be summarised as follows:

• SEA artiﬁcial generator was introduced in [49] to test their stream ensemble algorithm. The dataset has two class labels

and three continuous attributes in which one attribute is ir-relevanttotheclasslabelsandunderlyingconceptofthedata stream.Moreinformationhowthisdatastreamisgeneratedis describedin [49] . Bifet etal. [9,13,15] also usethis datasetin theirempiricalevaluationsamongotherdatasets.

• RandomTree Generatorwasintroduced in [10] andgenerates astreambasedonarandomlygeneratedtree.Thegeneratoris basedonwhatisproposedin [10] .Itproducesconceptsthatin theoryshouldfavourdecisiontreelearners.Itconstructsa de-cision treeby choosing attributes atrandom for splittingand assignsarandomclasslabeltoeachleaf.Oncethetreeisbuilt, newexamplesaregeneratedbyassigninguniformlydistributed randomvaluestoattributes,whichthendeterminetheclass la-belusingtherandomlygeneratedtree.

• STAGGER was introduced by Schlimmer and Granger [50] to test the STAGGERconcept drift tracking algorithm. The STAG-GERconcepts areavailable asa datastreamgeneratorinMOA andhasbeenusedasabenchmarkto testforconceptdriftin [50] . The dataset represents a simple block world deﬁned by threenominalattributessize,colourandshape,each compris-ing3differentvalues.Thetargetconceptsare:

size≡small∧color≡red color_≡green_∨shape_≡circular size≡(medium∨large)

While performing preliminary experiments with the data stream generators in MOA, it became apparent that the con-cepts deﬁned did not match the ones proposed in the orig-inal paper. This was observed from the rules generated by ourapproach fromthe STAGGER datastream. Meanwhile, the ‘bug’ was reported to the MOA development team and the current, corrected implementation of the generator has been usedforthe experimentspresentedinthis section.This high-lights the expressiveness of the rules induced by Hoeffding Rules.

• ForestCoverType contains the forest cover type for 30 × 30 meter cells obtained from US Forest Service (USFS) Region 2 ResourceInformationSystem(RIS)data.Itcontains581,012 in-stancesand54 attributes,andit hasbeenused inseveral pa-persondatastreamclassiﬁcation,i.e.,in [51] .

• Airlinesdatasetwasgeneratedbasedontheregressiondataset by ElenaIkonomovska, whichconsistsof about500,000 flight records.Themain taskofthis datasetisto predict whethera givenflightwillbedelayedbasedontheinformationof sched-uleddeparture.Thisdatasethasthreecontinuousandfour cat-egoricalattributes.ElenaIkonomovskaalsousesthisdatasetin oneofherstudiesondatastreams [52] .Thisdatasetwas down-loadedfromtheMOAwebsiteandusedinourempirical exper-imentswithoutanymodifications.

Allsyntheticdatastreamgeneratorsarecontrollableby param-etersand Table 2 showsthesettingsusedforallsyntheticstreams inourevaluation.

4.3. Utilityofexpressiveness

The empiricalevaluation isfocused on thecost of expressive-nesswhencomparingtheaccuracyandperformancebetween clas-siﬁersfordatastreams.

4.3.1. Accuracylossband

Asmentionedin [17] ,alearntmodelfromthelabelleddata in-stancesmayproducehighpredictive accuracyforunlabelleddata, butthelearntmodelcan behard andcomplextounderstandfor human users or even domain experts. All classiﬁers were evalu-atedonthe samebasedatasetsinordertoexaminethetrade-off betweenrule-based classiﬁerssuch asVFDR andHoeffdingRules

(9)

Table1

Parametersettingsforclassiﬁersusedintheexperiments.

HoeffdingRules VFDR(VeryFastDecisionRules) HoeffdingTree

VF.P:0.1 HT.NUE:Gaussianobserver

VF.SC:0.0 HT.NOE:Nominalobserver

HR.SW:200 VF.TT:0.05 HT.GP:200

HR.MRT:3 VF.AAP:0.99 HT.SC:Informationgain

HR.RVT:0.7 VF.PT:0.1 HT.SCON:0.0

HR.HBT:0.01 VF.AT:15 HT.TTH:0.05

HR.APT:0.5 VF.GP:200 HT.BS:false

VF.PF:Firsthit HT.RPA:false

VF.OR:false HT.NPP:false

VF.AD:false HT.LP:NBAdaptive

VF.P:%oftotalsamplesseeninthenode HT.NUE:Numericattributeestimator VF.SC:Splitconﬁdence HT.NOE:Categoricalattributeestimator

HR.SW:Slidingwindowsize VF.TT:Tiethreshold HT.GP:Graceperiod

HR.MRT:Minimumruletries VF.AAP:Anomalyprobabilitythreshold HT.SC:Splitcriterion HR.RVT:Rulevalidationthreshold VF.PT:Probabilitythreshold HT.SCON:Splitconﬁdence HR.HBT:Hoeffdingboundthreshold VF.AT:Anomalythreshold HT.TTH:Tiethreshold

HR.APT:Adaptationthreshold VF.GP:Graceperiod HT.BS:Binarysplit

VF.PF:Predictionfunction HT.RPA:Removepoorattribute

VF.OR:Orderedrules HT.NPP:Nopre-prune

VF.AD:Anomalydetection HT.LP:Leafpredictive

Table2

Parametersettingsforsyntheticstreamgeneratorsusedintheexperiments.

RandomTree RandomTreewithdrift SEA SEAwithdrift STAGGERS STAGGERwithdrift

Beforedrift Afterdrift Beforedrift Afterdrift Beforedrift Afterdrift

RT.TRSV:1 SEA.F:1 ST.IRS:1

RT.ISV:1 RT.TRSV:1 RT.TRSV:5 SEA.IRS:1 ST.F:1

RT.NCL:4 RT.ISV:1 RT.ISV:5 SEA.BC:true ST.BC:true

RT.NCA:5 RT.NCL:4 RT.NCL:4 SEA.NP:10% SEA.F:1 SEA.F:2

RT.NNA:5 RT.NCA:5 RT.NCA:5 SEA.IRS:1 SEA.IRS:1 ST.IRS:1 ST.IRS:1

RT.NVPCA:5 RT.NNA:5 RT.NNA:5 SEA.BC:true SEA.BC:true ST.F:1 ST.F:2

RT.MTD:5 RT.NVPCA:5 RT.NVPCA:5 SEA.NP:10% SEA.NP:10% ST.BC:true ST.BC:true

RT.FLL:3 RT.MTD:5 RT.MTD:5

RT.LF:15% RT.FLL:3 RT.FLL:3

RT.LF:15% RT.LF:15%

Driftat:150,000 Driftat:150,000 Driftat:150,000

Driftwidth:10,000 Driftwidth:10,000 Driftwidth:10,000

RT.TRSV:Treerandomseedvalue RT.ISV:Instanceseedvalue RT.NCL:Numberofclasslabels

RT.NCA:Numberofcategorialattribute(s) SEA.F:Classiﬁcationfunctionasdeﬁnedinpaper

RT.NNA:Numberofnumericalattribute(s) SEA.IRS:Seedforrandomgenerationofinstances ST.IRS:Instancerandomseed

RT.NVPCA:Numberofvaluespercategoricalattribute SEA.BC:Balancedclass ST.F:Classiﬁcationfunction

RT.MTD:Maxtreedepth SEA.NP:Noisepercentage ST.BC:Balancedclass

RT.FLL:Firstleaflevel RT.LF:Leaffraction

∗₄₀_0,0₀₀_data_instances_are_generated_for_each_experiment.

∗_In_each_experiment,_all_classiﬁers_are_given_identical_data_instances_and_same_sequenced_order.

comparedwithtreebasedclassiﬁerssuch asHoeffdingTree. Con-ceptdriftwasalsosimulatedinallsyntheticdatasetsfrom150,000 data instancesonwards forapproximately1000 furtherinstances, wherebothconceptswerepresentbeforeswitchingcompletelyto thenewconcept.Theaccuracy lossband

ζ

can beeitherpositive or negative, where positive values indicate that Hoeffding Rules achievesabetteraccuracyandnegativevaluesindicatethe short-fallinaccuracycomparedwithitscompetitor.

Figs. 7 a–f, 8 aandbshowthataccuracylossband

ζ

of Hoeffd-ingRulesisverycompetitivecomparedwithHoeffdingTreewhile clearlyoutperformingVFDRinmostcases.Thereadershouldnote that the existing implementations of Hoeffding Tree, VFDR and syntheticdatageneratorsinMOAwereused,theseclassifiersmay havebeenoptimisedtoworkwellonthesesyntheticdatasets.Two real datasetsAirlines andCoverType are chosen andincluded for anunbiasedevaluation.VFDRistheclosestalgorithmtoHoeffding Rulesbecauseitisanativerule-basedclassifierwiththeabilityto producerulesdirectlyfromtheseenlabelleddatainstances. How-ever,VFDRdoesnotofferabstainingandforcesaclassification.

Ev-idently, HoeffdingRules has a positive loss band compared with VFDRonbothrealandsyntheticdatasetsandoutperforms Hoeffd-ingTreeon theAirlinesdataset,while sufferingaminornegative lossbandonafewoccasionsontheCovertypedataset.

4.3.2. Costofexpressiveness

Wealsoestimatedthecost ofexpressivenessintermsofsteps required for a model to predict a class label. This evaluation is focussedon the most expressive rule and treebased algorithms. Moredecisionmakingstepsinruleandtreebasedalgorithms im-pliesmoreandpotentiallyunnecessary andcostly testsforauser to obtain a classification, which is not desirable [12,53] . Support VectorMachines andArtificial NeuralNetworks basedalgorithms arenotinvestigatedhereastheyarenearlynon-expressiveand dif-ficulttocomprehendby humananalysts.Alsoinstancebasedand probabilisticmodelsarenotveryinterestingtothehumananalyst interms ofexpressiveness asthey only indirectlyexplain a clas-sificationthrougheithertheenumeration ofdatainstances inthe

(10)

Fig.7. Differenceinaccuracycomparedwithotherclassiﬁersforsyntheticdatastreams.

Fig.8. Differenceinaccuracycomparedwithotherclassiﬁersforrealdatastreams.

caseof instance based learners,or through basic probabilities in thecaseofprobabilisticlearners.

Classification steps for HoeffdingRulesand VFDR refer tothe numberof rule terms (conditions) in a rule that are needed for classifyingadatainstance.Similarlyclassificationstepsfor Hoeffd-ingTreerefertothenumberofnodesthatneedtobevisited (con-ditions)fromtherootofthetreetotheleafthatprovidesa partic-ularclassification.Asshownin Fig. 9 a–h,HoeffdingRulesismore competitive in terms of numberof classification steps over time compared with the Hoeffding Tree classifier. We observed in all experiments that HoeffdingTree does not start building its tree

modeluntilseveralthousanddatainstanceshavebeenbufferedto satisfy the HoeffdingInequality. HoeffdingTreedoesnot limit the depth ofthetree nordoesit havea built-in pruningmechanism torestrictits size,thusitgrowslargerovertime.Thisisreflected in the increasing numberof steps required over time to classify data instances as can be seen in Fig. 9 a–h. In addition, we ex-pectallthreeclassifierstoadapttoconceptdrifts.Therule-based classifierswillchangetheirrulesets,whereas HoeffdingTreewill replaceobsoletesubtreeswithnewersubtrees. Eitherway,a con-cept drift mayalter thenumberof stepsneededto reach a clas-sification significantly.This explains the more abrupt changes of

(11)

Fig.9. TheaveragenumberofclassiﬁcationstepsneededbyHoeffdingRulescomparedwithHoeffdingTreeandVFDR.

(12)

Table3

AccuracyevaluationbetweenHoeﬃngTree,VFDRandHoeffdingRules.

Algorithm Measure(%) Dataset

SEA RandomTree STAGGER ForestCovertype Airlines

Nodrift Withconceptdrift Nodrift Withconceptdrift Nodrift Withconceptdrift

HoeffdingRules Tentativeaccuracy 82.6 84.30 81.86 75.62 100 98.50 74.24 66.74

Abstainingrate 8.07 6.43 49.51 56.02 22.27 28.52 10.04 16.47

VFDR Overallaccuracy 81.3 82.26 52.28 47.07 100 86.20 61.32 62.50

HoeffdingTree Overallaccuracy 88.08 89.23 88.98 61.08 100 99.75 82.04 66.04

Fig.11.LearningtimeHoeffdingRules,HoeffdingTreeandVFDR.

classificationstepsofsomealgorithmsfortheSTAGGERand Cover-typedata streams depictedin Fig 9 fandh. Regarding the Cover-typedatastream,itconsistsofrealdataandthusitisnotknown ifthereisaconceptdrift.However,themoreabruptchangesofthe numberofclassificationstepsindicatesthatthereisadriftstarting somewherebetweeninstances150,000to200,000.Thuswe have useda separate concept drift detectionmethod, a version ofthe Micro-Clusterbased concept drift detection method presented in [54,55] ,which has confirmeda drift starting atposition 150,000 instances.

Fig. 9 a–h also compare the average number of classification stepsof HoeffdingRuleswithVFDR. Thefigures show thatVFDR hasaverysimilaraveragenumberofclassificationstepscompared withHoeffdingRules.Forallobservedcases,exceptforthe Cover-type datastream, HoeffdingRules needs an average of about1.5 classification steps only. However, as shown in Table 3 , Hoeffd-ingRulesachievesamuchhigherclassificationaccuracythanVFDR andinmostcasesalsorequireslesstimetolearnasillustrated in Fig. 11 .In mostcaseswherethereis conceptdrift itcan beseen that the numberof average classification steps increasesfor Ho-effdingTree,whereasHoeffdingRules’andVFDR’saveragenumber ofclassificationstepsstaysalmostconstant.

The number ofsteps requiredfora classiﬁcation task demon-stratesthe effortrequiredtotranslate a classiﬁcationfroma tree basedmodelintotheformofan expressive‘IF... THEN...’rule.The currentimplementation ofHoeffdingTreein MOA,aswell asthe HoeffdingTree algorithm presentedin its original paper [10] , do notprovide anymechanismfortranslating aleaf intoan expres-siverule.Onecanarguethat asetofdecisionrulescan beeasily extractedfrom an existing tree model asQuinlan [56] has men-tionedasearly asin1986. An extracteddecisionrule froma hi-erarchicalmodelrepresentslogictestsfromarootnodetoa leaf. However,asHoeffdingTreeisdesignedtoadapttochangesindata streams,thisextraction processwould needto repeatevery time thetree expandsor changes its structure. Therefore, maintaining an accurate andup-to-date rule set froma Hoeffding Treecould beachallengeandacomputationallydemandingtask.

Forsyntheticdatasetswithconceptdriftat150,000in Fig. 9 b,d andf,wedetectednotablechangesinthenumberofstepsrequired forclassiﬁcationusingHoeffdingTreebutnotforusingHoeffding

Rule andVFDR.This isan expectedbehaviour becauseHoeffding Treemayneedtoreplaceanentiresubtreeorintheworstcasethe entiretreefromtherootnodeifthereisadriftinthedatastream. However, rule-basedclassifierssuch asHoeffdingRulesandVFDR do not need to replace larger numbers of rules, as rules can be examined, alteredandreplaced individually.Examining,replacing andalteringindividual ‘rules’ (decision paths fromtheroot node toaleafnode)isnotpossibleintreeswithoutalsoalteringfurther ‘rules’thatareconnectedtotheruletobechangedthrough inter-mediatetreenodes.Forrealdatasets,wedonothaveanabsolute ground truth whether concept drifts are encoded in the data or not,butwe expect tosee conceptdrifts inreal-life datastreams. In particular, we saw a correlated behaviour between abstaining andclassificationstepsin Figs. 9 hand 10 fortheCovertypedataset between150,000and200,000instances.Asmentioned beforewe haveused a version ofthe Micro-Clusterbased concept drift de-tectionmethodpresentedin [54,55] ,whichconfirmedadrift start-ingatposition150,000instances.InthisparticularcaseHoeffding Rulesdropped slowlyin theaverage numberofsteps neededfor classification, and Hoeffding Tree suddenly stopped growing and stalledforabout50,000datainstancesbeforestartingtogrow fur-ther.Hence,webelievethatHoeffdingTreealsoadaptedtoadrift here.

4.3.3. Expressiverules’rankingandinterpretation

Atanygiventimeausercaninspectthedecisionrulesdirectly from the rule set produced by HoeffdingRules and can be con-fidentthat therulesreflecttheunderlyingpatternencodedinthe datastreamatanygiventime.Forexample,weextractedthetop3 rules(based ontherules’individual classificationaccuracy)learnt fromthetworealdatasets,CovertypeandAirlinesat50,000data instances:

• Covertypedataset:

• IFSoil-Type40=0ANDSoil-Type30=0ANDSoil-Type38=0

THENClass=0

• IFSoil-Type38=1THENClass=7

• IF Soil-Type10 ₌1AND 0.57 _< Aspect _<₌0_.64 THEN Class

=6

• Airlinesdataset:

• IFAirportTo₌CHAANDDayOfWeek₌3ANDAirportFrom₌ ATLANDAirline=EV THENClass=1

• IFAirportFrom=CLTTHENClass=0

• IFAirportTo₌AZOTHENClass₌1

Asitcanbeseen,rulesinducedbyHoeffdingRulescanbe mod-ular (independent from each other) meaning that the rules not necessarilyhaveanyattributesin common, whichisnot possible whenrulesarepresentedinatreestructure.Rulesextractedfrom a treewillhave atleastthe attributechosen to split onthe root node incommon, even ifthisattributemaynot be necessaryfor someclassiﬁcationtasks.Thusatreestructuremayresultin poten-tiallyunnecessary andcostlytestsfortheuser.Thisisalsoknown asthereplicatedsubtreeproblem [30] .Wereferto Section 3.2 for anexplanationofthereplicatedsubtreeproblem.

(13)

Inadditionasmentionedin Section 4.2 ,whileperforming pre-liminaryexperiments withthedatastream generatorsin MOA,it becameapparentthat theconceptsdeﬁnedfortheSTAGGER gen-eratordidnotmatchtheonesproposedintheoriginalpaper [50] . This furtherhighlights theexpressivenessofthe ruleset induced by HoeffdingRules. This‘bug’wasreportedto theMOA develop-ment team and a corrected version of the stream generator was usedfortheexperimentsdescribedinthispaper.

4.4. Abstainingfromclassiﬁcation

The results presented in Section 4.3 showed the tolerance of rule-basedclassifierscomparedwithdecisiontreebasedclassifiers fortheutilityofexpressiveness. Anotherimportantfeatureof Ho-effding Rulesisthe abilitytoabstain. In Fig. 10 ,we observethat theabstainingrateofHoeffdingRulesdecreasesasHoeffdingRules processes more instances, except for the Random Tree dataset. For synthetic datasets withconcept drift, we also notice that at thepointofsimulatedconceptdrift(150,000),theabstainingrate spikedupbutquicklyrecoveredtoadapttothenewconcept.This indicates one of the major benefits of abstaining instances from classification, when HoeffdingRulesis uncertain abouta classifi-cationit doesnot riskclassifyingunseendatainstances.This fea-tureisdesirableorevencrucialindomainsandapplicationswhere miss-classificationiscostlyorirreversible.Asmentionedbeforewe haveuseda versionofa Micro-Clusterbasedconceptdrift detec-tionmethodpresentedin [54,55] ,whichconfirmedadriftstarting atposition150,000instances.

Table 3 showstheaccuracyevaluationbetweenHoeffdingTree, VFDRandHoeffdingRules. OneuniquefeatureofHoeffdingRules is itsabilityto refusea classificationiftheclassifier isnot confi-denttooutputadecisivepredictionfromitsruleset.Werecorded the tentative accuracyand abstainingrate forHoeffdingRules as wellasoverallaccuracyforHoeffdingTreeandVFDR.Tentative ac-curacyforHoeffdingRulesistheaccuracyforinstanceswherethe classifierisconfidenttoproduceareliableprediction.Thetentative accuracywasalsousedforestimatingtheutilityofexpressiveness shownin Section 4.3 .

WecanseethatHoeffdingRulesoutperformsVFDRinallcases and is very competitive compared with Hoeffding Tree, both on syntheticandrealdatastreams.HoeffdingRulesisalsocompetitive withHoeffdingTree,theaccuracyofboth classifiersisvery close, with Hoeffding Rules outperforming Hoeffding Tree in 3 cases. However,comparedwithHoeffdingTree,HoeffdingRulesproduces amoreexpressiverulesetandfundamentallydoesnotsufferfrom the replicated subtree problem. Also, theclassification rules gen-eratedbyHoeffdingRulescanbeeasilyinterpretedandexamined byhumanusersordomainexperts.Decisiontreeswouldfirstneed toundergoafurtherprocessingstepbeforeahumanusercan in-terpret therules.This wouldtranslatethe treeintorules starting in singlepassesfromthe rootnode downto each leaf.This may well beatoocumbersomeandatootimeconsumingtaskforthe decisiontaker.

Automatingtreetraversaltoincreaseexpansivenessisan addi-tionallinearprocesswithrespecttothenumberoftreenodes,or initsbestcaseforbalancedtrees,itcanbe O(log(tn))wheretn is

thenumberofnodesinthetree [57] .

4.5. Computationaleﬃciency

In ordertoexamine HoeffdingRules’ computationaleﬃciency, we havecompared HoeffdingRules, VFDR andHoeffdingTreeon thesamedatastreams asin Table 3 .As shownin Fig. 11 ,the ex-ecution time ofHoeffdingRulesoutperforms that ofVFDRby far andisalsoveryclosetothatofHoeffdingTrees.

It canbe seen that HoeffdingRules isparticularly superior to VFDR fordata streams withmostly numerical attributes such as theSEA andRandom Tree(RT) datastreams. Thisis expectedas VFDR needs many cut-point calculations for inducing new rule terms from numerical attributes, whereas Hoeffding Rules just needs to update the Gaussian Probability Density Distributions. Thiscomputationaldifferencehasbeendiscussedin Section 3.3

Insummary, loosely speaking,Hoeffding Rulesshowsa much betterperformanceintermsofutilityofexpressivenesscompared with its direct competitor VFDR and is also competitive com-paredwithitslessexpressivecompetitorHoeffdingTree.Thesame countsfortheevaluationwithrespecttocomputationaleﬃciency, HoeffdingRules’ runtimesaremuch shortercomparedwithVFDR andonlyslightlylongercomparedwithHoeffdingTree.

5. Conclusions

The research presented inthis paperis motivatedby thefact thatrule-baseddatastreamclassificationmodelsaremore expres-sive than other models, such as decision tree models, instance based models and probabilistic models. Inducing a classifier on data streams has some unique challenges compared with data miningfrombatchdata,asthepatternencodedinthestreammay change over time which is known as concept drift. While most datastreammining classification techniquesfocusonachievinga highaccuracyandquickadaptationtoconceptdrift,theyareoften ratherunfriendly,cumbersomeortoocomplexto providetrustful decisionstotheusers,whichisundesirableinmanydomainssuch assurveillanceormedicalapplications.

This paper proposed the new Hoeffding Rules data stream classifier that focusses on producing an expressive rule set that adapts to concept drift in real-time. Compared withless expres-sivedata streamclassifiers,HoeffdingRulesexplains howa deci-sionisreached.Thealgorithmisbasedona‘separate-and-conquer’ approach andthe Hoeffdingbound to adapt the ruleset to con-cept drifts inthe data.Different comparedwithexisting well es-tablished data stream classifiers, Hoeffding Rules may decide to abstainfromclassifyingadatainstanceifitisuncertainaboutthe trueclasslabel.Thisagainisdesirableinapplicationswhereafalse classification label may be very costly such asin medical appli-cationsornetworkintrusion detection.Additionally,theabstained datainstancescanalsobeconsideredforactivelearning,whichis adirectiontogoforwardforourproposedHoeffdingRules classi-fiertoimproveandmaximisetheeffectivenessofthelearntmodel. Inthis way,the abstainingfeature is not justreducing the miss-classificationbutcanalsopotentially improvethe accuracyofthe overallmodel.

An empirical evaluation examined the utility of inducing ex-pressive rules with Hoeffding Rules compared with its competi-tors VFDRandHoeffdingTree.VFDR is anotherhighlyexpressive rule-based classiﬁer, and Hoeffding Tree is a less expressive but state-of-the-art tree based data stream classiﬁer. The evaluation measuredthelossband

ζ

betweenthecompetitorsandthespeed ofinducingrulesonstreaming data.Theresultsshowthat Hoeff-ing Rulesoutperformsits directcompetitorVFDR interms of ac-curacy loss band and execution time. Compared with Hoeffding Tree,the proposed algorithm only showed a slight loss of accu-racy and runtime. However, Hoeffding Rules produces more ex-pressive rules compared with HoeffdingTree and is also able to abstainfromclassificationwhenit isuncertain.Theexperimental workhasshownthatHoeffdingRulesprovidesan arrayofunique advantagesoverotherstreamclassifiers:(1)thefastestrule-based streaming method; (2) a trusted method that effectively handles uncertainty;and (3)the mostexpressive stream classifier witha small

ζ

when comparedwith theclosest less expressive method (Hoeffding Tree).As such andto the best of our knowledge, the