• No results found

e-NSP: Efficient negative sequential pattern mining

N/A
N/A
Protected

Academic year: 2021

Share "e-NSP: Efficient negative sequential pattern mining"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Contents lists available atScienceDirect

Artificial

Intelligence

www.elsevier.com/locate/artint

e-NSP:

Efficient

negative

sequential

pattern

mining

Longbing Cao

a

,

,

Xiangjun Dong

b

, Zhigang Zheng

c

aUniversityofTechnologySydney,Australia bQiluUniversityofTechnology,Jinan,China cUniversityofTechnologySydney,Australia

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory:

Received13January2015

Receivedinrevisedform3March2016 Accepted7March2016

Availableonline10March2016 Keywords:

Negativesequenceanalysis Sequenceanalysis Behavioranalytics Non-occurringbehavior Behaviorinformatics Behaviorcomputing Patternmining

Asan important tool for behavior informatics,negative sequential patterns (NSP)(such asmissing medical treatments)are criticaland sometimes muchmoreinformativethan positivesequentialpatterns(PSP)(e.g.usingamedicalservice)inmanyintelligentsystems andapplications suchas intelligenttransport systems,healthcareand risk management, astheyofteninvolvenon-occurringbutinterestingbehaviors.However,discoveringNSPis muchmoredifficultthanidentifyingPSPduetothesignificantproblemcomplexitycaused bynon-occurringelements,highcomputationalcostandhugesearchspaceincalculating negativesequentialcandidates (NSC).Sofar, theproblem hasnot beenformalizedwell, and veryfew approaches havebeen proposed to mine forspecific typesof NSP, which relyon database re-scans after identifying PSP inorder to calculate the NSC supports. Thishas been shown to be veryinefficient or even impractical, since the NSC search space is usually huge. This paper proposes a very innovative and efficient theoretical framework:settheory-basedNSPmining(ST-NSP),andacorrespondingalgorithm,e-NSP, toefficiently identifyNSPby involvingonlythe identified PSP, withoutre-scanning the database.Accordingly,negativecontainmentisfirstdefinedtodeterminewhetheradata sequencecontainsanegativesequencebasedonsettheory.Second,anefficientapproachis proposedtoconvertthenegativecontainmentproblemtoapositivecontainmentproblem. TheNSCsupportsarethencalculatedbasedonlyonthecorrespondingPSP.Thisnotonly avoids theneed for additional databasescans, butalso enablesthe useof existingPSP mining algorithms to mine for NSP. Finally,a simple but efficient strategy is proposed to generate NSC. Theoretical analyses show that e-NSP performs particularly well on datasetswith asmall number of elements in a sequence, a large number of itemsets andlowminimum support.e-NSPiscomparedwithtwo currentlyavailableNSPmining algorithms via intensive experiments onthree synthetic and six real-lifedatasets from aspects includingdata characteristics, computationalcosts and scalability. e-NSPis tens tothousandsoftimesfaster thanbaselineapproaches,and offersasound and effective approachforefficientminingofNSPinlargescaledatasetsbydirectlyusingexistingPSP miningalgorithms.

©2016TheAuthors.PublishedbyElsevierB.V.Thisisanopenaccessarticleunderthe CCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4.0/).

Thesourcecodesofe-NSPareavailablefromhttp://www-staff.it.uts.edu.au/~lbcao/.

*

Correspondingauthor.

E-mailaddress:[email protected](L. Cao). http://dx.doi.org/10.1016/j.artint.2016.03.001

0004-3702/©2016TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/).

(2)

1. Introduction

Behavioriswidelyseeninourdailystudy,work,livingandentertainment[7].Acriticalissueinunderstandingbehavior fromtheinformaticsperspective,namelybehaviorinformatics[6,9],istounderstandthecomplexities,dynamicsandimpact ofnon-occurringbehaviors(NOB)[8].MiningNegativesequentialpatterns(NSP)[43]isone offew approachesavailable for understandingNOB. NSPrefertofrequentsequences withnon-occurringandoccurringbehaviors(also callednegative and

positivebehaviorsinbehaviorandsequenceanalysis),suchasadriverfailingtostopbeforedrivingthroughanintersection withastopsignoramissingtreatmentinmedicalservice.

DiscoveringNSP isbecoming increasingly important,andsometimesplay a role that cannot be replacedby analyzing occurringbehaviorsaloneinmanyintelligent systemsandapplications,suchasintelligent transportsystems(ITS), health andmedicalmanagement systems, bioinformatics,biomedical systems,risk management, counter-terrorism, andsecurity. Forexample, inITS, negative driving behavior patterns resultindrivers failingto followcertain traffic rulescould cause serioustraffic problemsorevendisasters.Inhealthcare,apatientmissinganimportantmedicalappointmentmightresult inserioushealthissues.Ingenesequencing,thenon-occurrenceofcertaingenesmaybeassociatedwithparticulardiseases. Suchproblemscannotbehandledbytheidentificationofoccurringbehaviorpatternsalone.

Formally,aNSPinhealthcaremayappearasfollows.Assume p1

=

<

abc X

>

isapositivesequentialpattern(PSP); p2

=

<

ab

¬

c Y

>

isa NSP,wherea,b andc stand formedicalservicecodesindicating the servicesa patient hasreceivedin healthcare, and X andY standfordisease status. p1 showsthat patientswho usually receivemedicalservices a,b and then c are likelytohavedisease status X,whereas p2 indicates that patientsreceivingtreatments ofa andb butNOT c haveahighprobabilityofhavingdiseasestatusY.

Althoughintensiveeffortshavebeenmadeto developPSP(suchas p1)mining algorithms suchasGSP [32],FreeSpan

[15],SPADE[34],PrefixSpan[30],andSPAM [4],NSP(suchas p2)cannot bedescribed ordiscoveredbythesealgorithms. This isbecause mining NSPis much moredifficult than miningPSP [8], particularlydue tothe following threeintrinsic complexities.

Problemcomplexity.Thehiddennatureofnon-occurringitemsmakesdefinitionoftheNSPminingproblemcomplicated, particularlytheNSPformatandnegativecontainment.Thisiswhyresearcherspresentdifferentandeveninconsistent definitionsandconstraintsintheiridentificationofNSP.AsresearchintoNSPisataveryearlystage,itisimportantto formalizetheproblemproperlyandcomprehensively.

Highcomputationalcomplexity.Existingmethodscalculatethesupportofnegative sequentialcandidates(NSC)by addi-tionallyscanningthedatabaseafteridentifyingPSP.Thisleads toadditionalcosts andlow efficiencyinminingNSP.It isthusessentialtodevelopefficientNSPminingmethodswithoutdatabasere-scanning.

LargeNSCsearchspace.Theexistingapproachesgeneratek-sizeNSCbyconductingajoiningoperationon(k-1)-sizeNSP. ThisresultsinahugenumberofNSC[11,24,26,38],whichmakesitdifficulttosearchformeaningfuloutputs.Further, NSCdoesnotsatisfytheAprioriprinciple[38].ItisachallengetoprunethelargeproportionofmeaninglessNSC,and itisthusimportanttodevelopefficientapproachesforgeneratingalimitednumberoftrulyusefulNSC.

NSPminingisatan earlystage,andhasseenonlyverylimitedprogressinrecentyears[3,11,12,14,16–18,20,21,38–40, 43]. Allexisting methods arevery inefficientandare toospecific formining NSP.As NSPanalysisis very complex, chal-lengingandimmature,theadditionofappropriateconstraintsmakestheproblemsolvabletosomedegree.Allthereported workinNSPanalysisthereforeincorporatesconstraintsonformat,frequencyand/ornegativeelements fromrespective as-pects(seemorediscussioninSection 3.2.1) toreducethenumberofNSC,discoverspecificNSPofparticularinterest,and enhance computationalefficiency.More importantly,thereare differentdefinitionsofthe mostimportantconceptinNSP mining –negative containment–which defineswhetheradatasequence containsa negative sequence(see moredetails inSection4.2).Somedefinitionsaremoregenericandtypical[11,12,38,39]thanotherswhicheitherincorporateadditional constraintsonnegationformatandcontainment[21–23,25–29]oroffernocleardefinition[18,31].

To address the above intrinsiccomplexities in NSP mining and make it efficient for real-life applications, this paper proposesan innovative, flexibleandefficient framework: thesettheory-basedNSP miningframework (ST-NSP)anda

com-prehensive algorithm called e-NSP to instantiate the ST-NSP framework. We first formalize the NSP mining problem by

defining some importantconcepts in NSP,includingnegative containment.The formalization drawsa clearboundary be-tweenwhetheradatasequencesetcontainsaNSCornot.Buildingonthesettheory,e-NSPthencalculatesthesupportof NSC basedonthesupportofthe correspondingPSP, withoutadditionaldatabasescans.Weconvertthenegative contain-mentproblemtoapositivecontainmentproblem.TheNSCsupportsarethencalculatedbyonlyusingaNSC’scorresponding PSPinformation.Inthisway,thereisnoneedtore-scanthedatabaseafterdiscoveringPSP. Moreimportantly,anyexisting PSPalgorithmscanthenbedirectlyusedorslightlychangedtodiscoverNSP.

We specifythefrequency,format andnegative elementconstraints ine-NSPto makethe problemconsistentwithset theory(frequencyconstraint),reduceconfusion,ambiguityanduncertainty(formatconstraint),andtocontrolcomputational complexity (negativeelement constraint).Ourdefinitionsandspecificationsofsuch constraintsandnegative containment formtheST-NSPframeworkandmakee-NSPconsistentwithsettheorywhichthusenablese-NSPtobemuchmoreefficient andflexiblethanexistingmethods.

(3)

Below,wesummarizethemaindifferencesbetweenourworkandtherelatedworksintermsoftheconstraintsimposed onNSP,keydefinitionsandconcepts,aswellasthesignificantcontributionsmadeinthiswork:

An innovative andveryefficientframework ST-NSPandits corresponding theoreticaldesign andalgorithm e-NSPare proposed todiscoverNSPwithinone databasescan;e-NSPisbuiltonsettheoryandistheonlyworkreportedsofar thatusessettheoryforsequenceanalysis,whichopensanewparadigmforNSPanalysis.

e-NSP isbuiltona systematicandcomprehensivestatementandformalization oftheNSPproblem, andincorporates newconcepts,specificationsonconstraintsandeffectivetechnicaldesignthatmakeitefficientandflexible,inaddition toofferingsubstantialtheoreticalanalysisintermsofdatacharacteristicsrepresentedbydatafactorsandcomputational costs.

e-NSPintroducesthreeconstraintsonfrequency,formatandnegativeelementsrespectively,whichnotonlymakee-NSP consistentwithtypicalexistingworkbutalsosupporttheproposedframeworkofsettheory-basedNSPstudy.

Theoretical analysisshows that e-NSP is much lesssensitive to data factors andis much more efficienton datasets

that havea smallnumberof elementsina sequenceanda largenumberofitemsets, comparedtoavailable baseline algorithms we can find. This advantage of e-NSP is especially clear when minimum support is low. It is thus very suitableforlargescaledata.

Unfortunately,itisnoteasytofindbaselinedatasetsandNSPalgorithmsthat satisfysimilarNSPsettings.We compare e-NSPwithtwomodifiedNSPalgorithmsintheliteratureusingintensiveexperimentsonthreesyntheticandsixreal-world datasets frommany perspectives, including computational complexity against different minimum supports on 8 distinct datasets,datacharacteristicsanalysison48combinationsofvariousdatafactorson16subsets,andscalabilitytestsontwo scalabledatasetswithlowminimumsupport.Theexperimental resultsverifythetheoreticalanalyses,showingthat e-NSP ismuchmoreflexibleandefficientandistenstothousandsoftimesfasterthanthebaselinemethods,thushighlysuitable forlargescaledatasets.Tothebestofourknowledge,thisisthefirstapproachthat efficientlydiscoversNSPbyinvolving PSP onlywithoutrescanning databases,directlyapplies existingPSPdiscovery algorithms, andisparticularlyeffectivefor verylargedatasets.ThisdemonstratesthesignificantvalueoftheproposedST-NSPframework.

Theremainderofthepaperisorganizedasfollows.Section2discussestherelatedworkandgapsinthecurrent knowl-edge. InSection3,weformalize theproblemofmining PSPandNSP,providingcorrespondingdefinitionsandconstraints. The ST-NSPframework andthee-NSPalgorithm aredetailedinSection 4.The theoreticalanalyses ofe-NSPcompared to baselinearepresentedinSection5fromtheperspectiveofdatafactors.Section6presentssubstantialexperimentalresults andareal-lifecasestudy.DiscussionsonseveralcriticalissuesareofferedinSection 7,followedbyconclusionsandfuture workinSection8.

2. Relatedwork

Inthissection,wefirstdiscusstherelatedworkonafundamentalconceptinNSP;thatis,negativecontainment.Different researcherspresentinconsistentdefinitionsandexplanationsofnegativecontainmentforrespectiveinterestsandpurposes. In [11],a datasequence ds

=

<

dc

>

cannot containnegative sequence ns

=

<

¬

(

ab

)

c

¬

d

>

since size

(

ns

)

>

size

(

ds

);

while theworkin[38]allowsthatdscontainsns.Anothercriticalissueishowtodealwithanon-occurringelement.Chenetal.

[11] argued that ds

=

<

dc

>

cannot contain

<

¬

cd

>

because

<

d

>

indshasno antecedent itemset;ds cannot contain

<

c

¬

d

>

because

<

c

>

indshasnosuccessor.However,Zhengetal.[38]allowedthatdscontains

<

c

¬

d

>.

Furthermore, the containment position of each element is very tricky.Chen et al.[11] proposed that a data sequenceds

=

<

aacbc

>

does not contain a negative sequence ns

=

<

a

¬

bc

>,

since opposite evidenceof

<

abc

>

can be found inds.However, Zhengetal.[38]presentedadividedopinionsinceds

=

<

aacbc

>

matchesa andc;hisalgorithmfindsthecorresponding positiveelementindsforeachnegativeelementofns,suchastheseconda for

¬

b.Accordingtoourunderstanding,since

<

e

>

meansthateoccurs,noelement(includingelemente)occursbeforeoraftere.Accordingly,

<

e

>

contains

<

e

¬

?

>,

<

¬

?e

>,

<

¬

?e

¬

?

>,

where“?”representsanyelement.

Second,wesummarizethestatusoftheNSPresearch.UnlikePSPmining,whichhasbeenwidelyexplored,verylimited research outcomes are available in the literature on mining NSP.We briefly introduce what we havebeen able to find. Zheng et al. [38] proposed a negative version of the GSP algorithm, i.e. NegGSP, to mine for NSP. This algorithm first discovers PSPby GSP, then generatesandprunes NSC.It then countsthesupport ofNSC by re-scanningthe databaseto generatenegativepatterns.Chenetal.[11]proposedaPNSPapproachforminingpositiveandnegativesequentialpatterns intheformof

< (

abc

)

¬

(

de

)(

i jk

)

>.

Thisapproachisbrokenintothreestages.PSPareminedbytraditionalalgorithmsand all positiveitemsetsarederivedfromthesePSP. Allnegative itemsetsarethenderivedfromthesepositiveitemsets.Lastly, bothpositiveandnegativeitemsetsarejoinedtogenerateNSC,whichareinturnjoinediterativelytogeneratelongerNSC inanApriori-likeway.ThisapproachcalculatesthesupportofNSCbyre-scanningthedatabase.In[22–24],theauthorsonly handledNSPinwhichthelastelementwasnegative.AnalgorithmNSPMin[24]minessuchNSP.Theextendedversionsof

[24]arein[22,23],whichaddfuzzyandstrongconstraintsrespectivelytoNSPM.In[39],ageneticalgorithmisproposedto mine NSP.Itgeneratescandidatesby crossoverandmutationbyinvolvingadynamicfitness functiontogenerateasmany candidatesaspossibleandavoidpopulationstagnation.In[26],onlyNSPareidentifiedintheformof

(

¬

A

,

B

),

(

A

,

¬

B

)

and

(

¬

A

,

¬

B

),

whichis similarto mining negative associationrules [13,33].The work in[26] requires A

B

= ∅

,whichis a

(4)

Table 1

Notationdescription.

Symbol Description

I A set of items,I= {x1, . . . ,xn}, consisting ofnitemsxk(1kn) s A sequence,s=<s1, . . . ,sl>, consisting oflelementssj(1jl) min_sup Minimum support threshold

ns A negative sequence

length(s) Length of sequences, referring to the number of items in all elements ins size(s) Size of a sequences, referring to the total number of elements ins sup(s) The support ofs

PP(ns) ns’s positive partner EidSs Elements id set of sequences OPS(EidSs) Order preserving sequence withEidSs MPS(s) Maximum positive sub-sequence ofs 1-negMSns 1-neg-size maximum subsequence ofns 1-negMSSns 1-neg-size maximum subsequence set ofns FSEorfse First subsequence ending position LSBorlsb Last subsequence beginning position

usualconstraintinassociationruleminingbutisaverystrictconstraintinsequentialpatternmining.Itgeneratesfrequent itemsets first, then generates frequent and infrequent sequences, and lastly derives NSP from the infrequent sequences. Threeextendedversionsof[26]canbefoundin[27–29]inwhichconditionsareaddedtofuzzy,multiplelevelandmultiple minimumsupports,respectively.Althoughtheauthorsof[31] mentionedthequestionofminingNSP,theydidnotpropose howtominethem.In[20],NSPareminedinthesameformas[26] inincrementaltransactiondatabases.

Zhao etal. [35] proposed an approach to mining event-oriented negative sequentialrules frominfrequent sequences in the formof

<

A

>

<

¬

B

>,

<

¬

A

>

<

B

>,

<

¬

A

>

<

¬

B

>.

Based on the work in [35], Zhao etal. [36] also presentedan approachfordiscoveringboth positiveandnegative impact-orientedsequentialrules.Issuesaboutsequence classificationusingpositiveandnegative patternswerediscussedin[25,37].Positiveandnegative usagepatternsareused in[19]tofilterWebrecommendationlists.NoneofthesepapersinvolveNSPminingdirectly.

Theabovediscussions aboutnegativecontainmentandexistingNSPresearchinvolvethekeyissueofvariousconstraints

applied to NSP mining. As detailed in Section 3.2.1, constraints are generallyempowered according to the frequency of elements,patternsorpositivepartners,theformat ofcontinuousnegativeelements,andthenegationofelementsoritems.

Anotherrelevantresearchtopicisnegativeassociationrulemining[5,33].However,astheorderingrelationshipbetween itemsandelementsinasequenceisinbuiltinNSP,itismuchmorechallengingtodiscoverNSPthannegative associations andpatterns.In fact, theordinal nature ofNSPmeans that algorithms fornegative association rule miningandnegative patternminingcannotbedirectlyusedtomineNSP.

Insummary,theabovediscussionsshowthatNSPresearchpresentsthefollowingstrongearly-stagecharacteristics:

Significantinconsistencyinkey concepts andsettings.In particular,thereisnoconsolidated conceptofnegative

con-tainment intheliterature. InSection 4.2,we willpresenta genericdefinitionof negativecontainment andformalize theissue.

DifferentconstraintsareincorporatedintoNSP,leading tovariedsettingsandevendividedassumptionsaboutNSP.In Section 3.2.1, we willprovide formal andgeneric definitions offrequencyconstraint, format constraint, andnegative elementconstraint.

NSPminingisembeddedwithspecificconstraintsandrequiresrescanningdatabases.

Existing NSP approacheseither re-scan a databaseas a resultof inefficient computational design ordo not directly addresstheproblemofNSPmining.InSection4,e-NSPisintroducedwhichscansadatabaseonlyonce.

Ourproposed ST-NSP(see Section 4.1) ande-NSP (more details in Section 4) directlyaddress the above fundamental issuesandthesubstantialcomplexitiesofNSPminingbybuildinganinnovative,formal,comprehensiveandgenericdesign, togetherwiththeoreticalanalysis,andexperimentevaluation.

3. Problemstatement

Inasequence,a non-occurringitemiscalledanegativeitemandan occurringitem iscalledapositiveitem.Sequences thatconsistofatleastonenegativeitemarecallednegativesequences.Thesequencesinsourcedataarecalleddatasequences

[32].Classicsequentialpatternmininghandlesoccurringitemsonly,andgeneratesfrequentpositivesequences[11,24,26,38]. Below,weformallydefinePSPandNSP.ThemainsymbolsusedinthispaperarelistedinTable 1.

3.1. PositiveSequentialPatterns–PSP

Let I

= {

x1,x2,

. . . ,

xn

}

be a set of items. An itemset is a subset of I.A sequence is an ordered list of itemsets. A se-quence sisdenotedby

<

s1s2

. . .

sl

>,

wheresj

I

(1

j

l

).

sjisalsocalledanelementofthesequence,anddenotedas

(5)

(

x1x2

. . .

xm

),

wherexk isan item, xk

I (1

k

m). Forsimplicity,thebracketsare omittedifan element onlyhasone item,i.e.,element

(

x

)

iscodedx.Toreducecomplexity,weassumethatanitemoccursatmostonceinanelement,butcan appearmultipletimesindifferentelementsofasequence.

Thelengthofsequences,denotedaslength

(

s

),

isthetotalnumberofitemsinallelementsins.sisak-lengthsequence if length

(

s

)

=

k.The size ofsequence s,denoted assize

(

s

),

isthe totalnumberof elements ins. s isa k-size sequenceif

size

(

s

)

=

k.Forexample,agivensequences

=

< (

ab

)

cd

>

iscomposedof3elements

(

ab

),

c andd,or4itemsa,b,c andd. Thereforesisa4-lengthand3-sizesequence.

Sequence

=

<

α

1

α

2

. . .

α

n

>

iscalledasub-sequenceofsequence

=

< β1β2

. . . β

m

>

andisasuper-sequenceof,

denotedas

sβ,ifthereexists1

j1

<

j2

< . . . <

jn

msuchthat

α

1

β

j1

,

α

2

β

j2

,

. . . ,

α

n

β

jn.Wealsosaythat

contains .Forexample,

<

a

>,

<

d

>

and

< (

ab

)

d

>

areallsub-sequencesof

< (

ab

)

cd

>.

Asequencedatabase Disasetoftuples

<

sid

,

ds

>,

wheresidisthesequence_idanddsisthedatasequence.Thenumber oftuplesinD isdenotedas

|

D

|

.Thesetoftuplescontaining sequencesisdenotedas

{

<

s

>

}

.Thesupportofs,denoted assup

(

s

),

isthenumberof

{

<

s

>

}

,i.e.,sup

(

s

)

=| {

<

s

>

}

|=| {

<

sid

,

ds

>,

<

sid

,

ds

>

D

(

s

ds

)

}

|

.min_supisaminimum support thresholdpredefinedby users.Sequence s is calledafrequent(positive)sequentialpatternifsup

(

s

)

min_sup.By contrast,sisinfrequentifsup

(

s

)

<

min_sup.

PSPminingaimstodiscoverall positivesequencesthatsatisfyagivenminimumsupport.Forsimplicity,weoftenomit “positive”whendiscussingpositiveitems,positiveelementsandpositivesequencesinminingPSP.

3.2. NegativeSequentialPatterns–NSP

3.2.1. ConstraintsonNSP

Inrealapplicationssuchashealthandmedicalbusiness,thenumberofNSC andtheidentifiednegative sequencesare oftenlarge,butmanyofthemarenot actionable[10].The numberofNSC maybe hugeoreveninfiniteifnoconstraints are added.Thismakes NSPmining verychallenging. Forexample,fordatasetwith10distinctitems, thetotal numberof potential itemsetsis1023(

=

C

(10,

1)

+

C

(10,

2)

+

. . .

+

C

(10,

10)),whereC

(

m

,

n

)

representsthenumberofcombinations createdbychoosingnitemsfrommdistinctitems.Assumingthat10ofthe1023arefrequent,ifthemaximumsizeofdata sequencesinDBis3,thetotalnumberofNSCcouldbe10331

+

10332

+

10333.Manyofthecombinationsareuninteresting ormeaninglessinreality.

VariousconstraintshavebeenaddedtoexistingNSPmining,suchasthosedetailedin[11,38,39]and[24],todealwith thefundamental challengesembeddedinNSPminingandthecurrentimmaturesituationinNSPstudy.Althoughtheyare not consistentnorgeneric,theseconstraintsaimto reduceproblemcomplexityandthenumberofNSC,andtoefficiently discoveractionableNSP.Forexample,inhealthinsurance,acommonbusinessrulesaysifaprosthesis(a)hasbeencharged for, there should be a charge for a prior corresponding procedure (b). Accordingly, ifa customer claims a but does not claimb,whichisrepresentedasaNSP:

<

¬

ba

>,

thentheclaimbehaviorcouldbepotentiallysuspicious.Morecomplicated NSPmaybefound,whichcontainmorenegativeitems,suchas

<

a

¬

ba

¬

c

>

(c isanotherprocedure).

Inthiswork,weincorporatethreeconstraintsine-NSP:frequencyconstraint,formatconstraint,andnegativeelement con-straint.Thereasonsforintroducingthesethreeconstraintsareasfollows.First,asexplainedabove,NSPminingisatanearly stage andistoo complicatedtoconduct withouttheimposition ofspecific constraints.Second,we specifyconstraintson frequency,formatandnegativeelements inordertoreduceproblemcomplexity,andinparticular,tobuildaframeworkof NSPminingonsettheorytoextractnegativesequentialpatternsfromidentifiedpositivepatterns(frequencyconstraint), re-ducethenumberofNSC(formatconstraint),andsubstantiallyreducethecomplexityinhandlingpartiallynegativeelements (negativeelementconstraint).

Below, we define key concepts for these three constraints for mining NSP, andfurther explain why each of them is introduced.

Definition1 (PositivePartner). The positivepartner ofa negative element

¬

e is e, denoted as p

(

¬

e

),

i.e., p

(

¬

e

)

=

e. The positivepartnerofpositiveelementeiseitself,i.e., p

(

e

)

=

e.Thepositivepartnerofanegativesequencens

=

<

s1

. . .

sk

>

changesallnegativeelementsinnstotheirpositivepartners,denotedasp

(

ns

),

i.e.,p

(

ns

)

= {

<

s1

. . .

sk

>

|

si

=

p

(

si

),

si

ns

}

. Forexample,p

(<

¬

(

ab

)

c

¬

d

>)

=

< (

ab

)

cd

>.

Constraint1 (Frequencyconstraint). For simplicity, this paperonly focuseson the negative sequences ns whose positive partners are frequent, i.e., sup

(

p

(

ns

))

min_sup. In contrast,the authors in [11] and[38] only requirethat the positive partnerofeachelementinnsisfrequent.

Although there maybe manynegative sequences that can be mined frominfrequent positive partner sequences (just as manyusefulnegative association rules can bemined from infrequentitemsets [33]),requiring positive partnerstobe frequent serves multiplepurposes: (1)users are ofteninterested inthe absenceof certain “frequent”(positive) itemsets; thepositivepartnersofthosenegative itemsetsappearinginnegativesequentialpatternsshouldthereforesatisfyacertain frequency.(2)Ifwedonotenforcethisconstraint, thenumberofNSCmaybehugeoreveninfinite,whichwouldleadto

(6)

Asimilar constrainthasbeenadoptedby otherresearchers, althoughitis specifieddifferentlyforparticularpurposes. Forexample,in[11],onlythepositivepartnerofeachelementinnegativesequencesnsisrequiredtobefrequent.Inour work,werequirethatthepositivepartner,notonlyeachelement,ofnsshouldbefrequent,i.e.,sup

(

p

(

ns

))

min_sup.This is because inthis work we build a newframework that allows to only negative sequential patternsto be minded from positivesequentialpatternsbyusingthesettheorytorapidly“calculate”thesupportofNSCbasedonlyonthesupportof theircorrespondingPSPwithoutadditionaldatabasescans.

Constraint2(Formatconstraint).ContinuousnegativeelementsinaNSCarenotallowed.

Example1.

<

¬

(

ab

)

c

¬

d

>

satisfiesConstraint 2,but

<

¬

(

ab

)

¬

cd

>

doesnot. Thisissimilartothesettingsin[11,38].

Thisconstraintisintroducedforthreereasons.(1)Iftwoormorecontinuousnegativeelements areallowed,moreand more negative sequential candidates can be generated,which would resultin a very challenging and sophisticated task. (2) In practice,it wouldbe difficultto ascertainthe correctorderof two continuous negative elements iftherewere no positive elements between them. (3) As argued in [11], the order of consecutive negative itemsets is trifling for many applications.Asaresult,addingthisconstraintwillsubstantiallyreducethenumberofNSC.

Constraint3(Negativeelementconstraint).ThesmallestnegativeunitinaNSCisanelement.Ifanelementconsistsofmore thanoneitem,eitherallornoneoftheitemsareallowedtobenegative.

Example2.

<

¬

(

ab

)

cd

>

satisfiesConstraint 3,but

< (

¬

ab

)

cd

>

doesnot because,inelement

(

¬

ab

),

only

¬

a isnegative whileb isnot.

Althoughnegativesequentialpatternswithnegativeitems(suchasin

< (

¬

ab

)

cd

>)

maybealsointerestinginreal appli-cations,thecomplexitycouldbetoohightosolvetheproblem.Introducingthisconstraintavoidsthecomplexityofhandling partiallynegativeelements,andsubstantiallyreducesthenumberofNSC.ThisisbecausethenumberofNSCwillincrease exponentially ifnegative itemsare allowed. Forexample,givenpositive pattern ps

=

< (

ab

)(

cde

)(

f ghi

)

>,

thenumber of NSC isonly 4 ifthisconstraintis satisfied;they are

<

¬

(

ab

)(

cde

)(

f ghi

)

>,

< (

ab

)

¬

(

cde

)(

f ghi

)

>,

< (

ab

)(

cde

)

¬

(

f ghi

)

>

and

<

¬

(

ab

)(

cde

)

¬

(

f ghi

)

>.

Ifnegativeitemsareallowed,thenumberofNSCgeneratedfrom

< (

ab

)(

cde

)(

f ghi

)

>

is1260 (

=

4

(

C

(2,

1)

+

C

(2,

2))

(

C

(3,

1)

+

C

(3,

2)

+

C

(3,

3))

(

C

(4,

1)

+

C

(4,

2)

+

C

(4,

3)

+

C

(4,

4))

=

4

3

7

15).Itwouldbe veryinefficientandcomplextogenerateallNSC.Whenthesizeandlengthofdataishuge,itisunrealistictogenerateall NSC.Therefore,similartothesettingsin[11,38],wealsoapplythisconstrainttoavoidthecomplexityofhandlingpartially negativeelements.

In summary,the above threeconstraints introduced intoe-NSP not only maintain a certain consistency withexisting work, butalsoenable the efficientgeneration ofNSC based ontheir positive partnersandenable the calculationof NSC support.ThisleadstoafundamentallynewandefficientframeworkforefficientdiscoveryofNSPinlargescaledata.

Inpractice, withthedevelopment ofmore efficientlearning frameworks,data structures, andNSPmining algorithms, theseconstraintsmaybe progressivelyrelaxed.GiventhecurrentstageofmaturityofNSPmining,weonlyworkonthose negativesequencesthatsatisfytheabovethreeconstraintsinthiswork.

3.2.2. NSPconcepts

AccordingtoConstraint 3,thedefinitionofsub-sequencesinapositivesequenceisnotapplicabletonegativesequences. Below,wedefinesub-sequenceandsuper-sequencefornegativesequences,inadditiontotheconceptsofelement-idsetand

order-preservingsequence.

Definition2(Positive/NegativeElement-idSet).Element-idistheordernumberofanelementinasequence.Givenasequence

s

=

<

s1s2

. . .

sm

>,

id

(

si

)

=

i isthe element identifierof element si.Element-id set EidSs of s isthe setthat includes all elementsandtheiridsins,i.e.,EidSs

= {

(

si

,

id

(

si

))

|

si

s

=

(

s1,1),

(

s2,2),

. . . ,

(

sm

,

m

)

}

(1

i

m).

Thesetincludingallpositiveandnegativeelement-idsofasequences iscalledpositiveandnegativeelement-idsetofs, denotedasEidS+s,EidSs,respectively.Forexample,s

=

<

¬

(

ab

)

c

¬

d

>,

EidS+s

= {

(

c

,

2)

}

,EidSs

= {

(

¬

(

ab

),

1),

(

¬

d

,

3)

}

. Definition3(Order-preservingSequence).ForanysubsetEidSs

= {

(

α

1,id1),

(

α

2,id2),

. . . ,

(

α

p

,

idp

)

}

(1

<

p

m)ofEidSs,

α

=

<

α

1

α

2

. . .

α

p

>,

if

α

i

,

α

i+1

α

(1

i

<

p

),

thereexistsidi

<

idi+1,then

α

iscalledan order-preservingsequenceofEidSs, denotedas

α

=

OPS

(

EidSs

).

Example3.Givens

=

<

¬

(

ab

)

c

¬

d

>,

itsEidSs

= {

(

¬

(

ab

),

1),

(

c

,

2),

(

¬

d

,

3)

}

,EidS+s

= {

(

c

,

2)

}

,EidSs

= {

(

¬

(

ab

),

1),

(

¬

d

,

3)

}

.We canobtainOPS

(

EidS+s

)

=

<

c

>.

Also,ifEidSs

= {

(

¬

(

ab

),

1),

(

c

,

2)

}

,wecancreateasequenceOPS

(

EidSs

)

=

<

¬

(

ab

)

c

>.

(7)

Fig. 1.The set theory-based NSP mining framework: CT-NSP.

Definition4 (Sub-sequenceandSuper-sequenceofaNegativeSequence). Sequence is calleda sub-sequence of a negative

sequence sβ,and isasuper-sequenceof,if

EidSsβ,EidS

isasubsetofEidSsβ,sα

=

OPS

(

EidSsβ

),

denotedas

.If

isanegative sequence,itis requiredtosatisfyConstraint 2,whichmeansthat theremustnot becontinuous negative

elementsin.

Example4.Given

=

<

¬

(

ab

)

cd

>

and

=

<

¬

(

ab

)

d

>,

EidSsβ

= {

(

¬

(

ab

),

1),

(

c

,

2),

(

d

,

3)

}

,EidSsβ

= {

(

¬

(

ab

),

1),

(

d

,

3)

}

isa

subsetofEidSsβ.sα isasub-sequenceof since

=

OPS

(

EidSsβ

).

Definition5 (MaximumPositiveSub-sequence).Let ns

=

<

s1s2

. . .

sm

>

be a m-sizeandn-neg-size negative sequence

(

m

n

>

0),OPS

(

EidS+ns

)

iscalledthemaximumpositivesub-sequenceofns,denotedasMPS

(

ns

).

Example 5. Given a negative sequence s

=

<

¬

(

ab

)

cd

>,

EidS+

= {

(

c

,

2),

(

d

,

3)

}

, its maximum positive sub-sequence is

MPS

(

s

)

=

<

cd

>.

Definition6(NegativeSequentialPattern).Anegative sequencesisa negativesequentialpattern(NSP)ifits supportisnot lessthanthethresholdmin_sup.

4. Thesettheory-basedNSPframeworkande-NSPalgorithm

4.1. Thesettheory-basedNSPminingframework

HereweproposeaninnovativeNSPminingframework,basedonsettheory.Theframeworkandworkingmechanismof

theproposedsettheory-basedNSPminingframework(ST-NSPforshort)isillustratedinFig. 1.Wealsoproposeanefficient NSPminingalgorithm,callede-NSP,toinstantiatetheST-NSPframework.Thee-NSPalgorithmissummarizedinSection4.7, andanexampleisgiveninSection4.8.

ANSPalgorithminstantiatingtheST-NSPframeworkconsistsofseveralcomponents:

(1)negativecontainmenttodefinehowadatasequencecontainsanegativesequence,whichwillbediscussedinSection4.2; (2) anegativeconversionstrategytoconvertthenegativecontainmenttopositivecontainment,andthenusetheinformation

ofcorrespondingPSPtocalculatethesupportofaNSC,whichwillbediscussedinSection 4.3; (3)NSCsupportcalculationtocalculatethesupportofNSC,whichwillbediscussedinSection4.4; (4)NSCgenerationtogenerateNSC,whichwillbediscussedinSection4.5;and

(5) ane-NSPdatastructureandoptimizationstrategytocalculatetheunionset,whichwillbediscussedinSection4.6.

Given a sequence database, algorithms like e-NSP built on the ST-NSP framework work on the following process to

discoverNSP,inwhichsteps(2)to(4)relyonthesettheory.

(1) AllPSPareminedbyatraditionalPSPalgorithm(withslightchangesifnecessary)ornewPSPminingalgorithm; (2) NSCaregeneratedbasedontheidentifiedPSPintermsofthethreeconstraintsproposedinSection3.2.1; (3) ThegeneratedNSCareconvertedtotheircorrespondingPSPintermsofthenegativeconversionstrategy;

(4) The NSC support is calculated based on the corresponding PSP support in terms of negative containment, relevant constraintsandtheproposede-NSPdatastructureandoptimizationstrategy;

(5) Lastly,NSPareidentifiedfromtheNSCtosatisfycertainsupportcriteria.

4.2. Negativecontainment

As asub-sequence(e.g.,s1

=

<

d

>)

mayoccurmorethanonceinitssuper-sequence(e.g.,s2

=

<

a

(

bc

)

d

(

cde

)

>),

we needtoknowtheexactpositionsofs2containings1fromtheleftandrightsidesofs2.Wethereforedefinethefundamental conceptofnegativecontainmentbelow.

(8)

Definition7(FirstSub-sequenceEndingPosition/LastSub-sequenceBeginningPosition).Givenadatasequenceds

=

<

d1d2

. . .

dt

>

andapositivesequence

α

,

(1) if

p

(1

<

p

t

),

α

<

d1

. . .

dp

>

α

<

d1

. . .

dp−1

>,

then p iscalledtheFirstSub-sequenceEndingPosition,denoted asFSE

(

α

,

ds

);

if

α

<

d1

>

thenFSE

(

α

,

ds

)

=

1;

(2) if

q

(1

q

<

t

),

α

<

dq

. . .

dt

>

α

<

dq+1

. . .

dt

>,

thenq iscalledtheLastSub-sequenceBeginningPosition,denoted asLSB

(

α

,

ds

);

if

α

<

dt

>

thenLSB

(

α

,

ds

)

=

t;

(3) if

α

ds,thenFSE

(

α

,

ds

)

=

0,LSB

(

α

,

ds

)

=

0.

Example6. Given ds

=

<

a

(

bc

)

d

(

cde

)

>.

FSE

(<

a

>,

ds

)

=

1, FSE

(<

c

>,

ds

)

=

2, FSE

(<

cd

>,

ds

)

=

3, LSB

(<

a

>,

ds

)

=

1,

LSB

(<

c

>,

ds

)

=

4,LSB

(<

cd

>,

ds

)

=

2,LSB

(< (

cd

)

>,

ds

)

=

4.

Ourdefinitionofadatasequencecontaininganegativesequenceisasfollows.Weusen

neg

sizetodenoteanegative sequencecontainingnnegativeelements.

Definition8 (NegativeContainment).Let ds

=

<

d1d2

. . .

dt

>

be a data sequence, ns

=

<

s1s2

. . .

sm

>

be an m

size and

n-neg-sizenegativesequence,(1)ifm

>

2t

+

1,thends doesnotcontain ns;(2)ifm

=

1 andn

=

1,thends contains nswhen

p

(

ns

)

ds;(3)otherwise,ds contains nsif,

(

si

,

id

(

si

))

EidSns

(1

i

m

),

oneofthefollowingthreecasesholds: (a)

(

lsb

=

1)or

(

lsb

>

1)

p

(

s1)

<

d1

. . .

dlsb−1

>,

wheni

=

1;

(b)

(

fse

=

t

)

or

(0

<

fse

<

t

)

p

(

sm

)

<

dfse+1

. . .

dt

>,

wheni

=

m;

(c)

(

fse

>

0

lsb

=

fse

+

1) or

(

fse

>

0

lsb

>

fse

+

1)

p

(

si

)

<

dfse+1

. . .

dlsb−1

>,

when 1

<

i

<

m, where fse

=

FSE

(

MPS

(<

s1s2

. . .

si−1

>),

ds

),

lsb

=

LSB

(

MPS

(<

si+1

. . .

sm

>),

ds

).

Intheabove definition,Case(a)indicates thatthe firstelementin nsisnegative. “(lsb

>

1)

p

(

s1)

<

d1

. . .

dlsb−1

>”

meansthat

<

dlsb

. . .

dt

>

contains MPS

(<

s2

. . .

sm

>)

but

<

d1

. . .

dlsb−1

>

doesnotcontain p

(

s1).“lsb

=

1”meansthatthe last sub-sequence’sbeginning position is1, so p

(

s1) cannot be containedby ds.Case (b)indicates that the last element in nsisnegative. Case (c)indicatesthat the negative elementis betweenthefirst andlast elementin ns.“lsb

>

fse

+

1” ensuresthereisatleastoneelementin“<dfse+1

. . .

dlsb−1

>”.

fse

>

0

lsb

=

fse

+

1”meansthatdfseanddlsbarecontiguous elements,sop

(

si

)

cannotbecontainedwithinthem.

Example7.Givends

=

<

a

(

bc

)

d

(

cde

)

>,

wehave

(1) ns

=

<

¬

ac

>.

EidSns

= {

(

¬

a

,

1)

}

. ds does not contain ns. lsb

=

4

>

0, but p

(

s1)

=

<

a

>

<

d1

. . .

d3

>

=

<

a

(

bc

)

d

>

(Case (a)).

(2) ns

=

<

¬

aac

>.

EidSns

= {

(

¬

a

,

1)

}

.dscontainsnsbecauselsb

=

1 (Case(a)).

(3) ns

=

< (

ab

)

¬

(

cd

)

>.

EidSns

= {

(

¬

(

cd

),

2)

}

.dsdoesnotcontainnsbecausefse

=

0 (Case(b)). (4) ns

=

< (

de

)

¬

(

cd

)

>.

EidSns

= {

(

¬

(

cd

),

2)

}

.dscontainsnsbecausefse

=

4

(

t

=

4)(Case(b)).

(5) ns

=

<

a

¬

dd

¬

d

>.

EidSns

= {

(

¬

d

,

2),

(

¬

d

,

4)

}

. ds does not contain ns. For

(

¬

d

,

2), fse

=

1, lsb

=

4, but p

(

¬

d

)

<

d2

. . .

d3

>

=

< (

bc

)

d

>

(Case (c)). Ifone negative element does not satisfy the condition, we do not need to con-siderothernegativeelements.

(6) ns

=

<

a

¬

bb

¬

a

(

cde

)

>.

EidSns

= {

(

¬

b

,

2),

(

¬

a

,

4)

}

. ds contains ns. For

(

¬

b

,

1), fse

=

1, lsb

=

2, fse

>

0

lsb

=

fse

+

1 (Case (c));For

(

¬

a

,

4),fse

=

2,lsb

=

4, p

(

¬

a

)

<

d3

>

=

<

d

>

(Case(c)).

Note that negative sequences do not satisfy the Apriori property. As shown in Example 7, sup

(<

¬

ac

>)

=

0,

sup

(<

¬

aac

>)

=

1, sup

(<

¬

ac

>)

<

sup

(<

¬

aac

>),

though

<

ac

>

<

¬

aac

>.

Accordingly, we cannot simply apply or changethe patternpruningstrategiesavailable inPSPmining toNSP mining,andnewNSPfilteringmethods needto be developed.

4.3. Negativeconversion

e-NSPisbuiltonthestrategyofconvertingnegativecontainmenttopositivecontainment;PSPminingcanthenbeused tomineNSP.Forthis,wedefineaspecialsubsequence:1-neg-sizeMaximumSub-sequence.

Definition9(1-neg-sizeMaximumSub-sequence). Fora negative sequence ns,its sub-sequencesthat include MPS

(

ns

)

and one negative element e arecalled 1

neg

size maximumsub-sequences, denoted as1

negMS

=

OPS

(

EidS+ns

,

e

),

where

e

EidSns. The sub-sequence set including all 1-neg-size maximum sub-sequences of ns is called 1-neg-size maximum sub-sequenceset,denotedas1

negMSSns,1

negMSSns

= {

OPS

(

EidS+ns

,

e

)

| ∀

e

EidSns

}

.

Example8. 1) ns

=

<

¬

(

ab

)

c

¬

d

>,

1

negMSSns

= {

<

¬

(

ab

)

c

>,

<

c

¬

d

>

}

; 2) ns

=

<

¬

a

(

bc

)

d

¬

(

cde

)

>,

1

negMSSns

=

(9)

Corollary1(NegativeConversionStrategy).Givenadatasequenceds

=

<

d1d2

. . .

dt

>

,andns

=

<

s1s2

. . .

sm

>

,whichisanm-size

andn-neg-sizenegativesequence,thenegativecontainmentproblemcanbeconvertedtothefollowingproblem:datasequenceds containsnegativesequencens ifandonlyifthetwoconditionshold:(1)MPS

(

ns

)

ds;and(2)

1

negMS

1

negMSSns,p

(1

negMS

)

ds.

Example9.Givends

=

<

a

(

bc

)

d

(

cde

)

>,

1)ifns

=

<

a

¬

dd

¬

d

>,

1

negMSSns

= {

<

a

¬

dd

>,

<

ad

¬

d

>

}

,thendsdoesnot con-tain nsbecausep

(<

a

¬

dd

>)

=

<

add

>

ds;2)ifns

=

<

a

¬

bb

¬

a

(

cde

)

>,

1

negMSSns

= {

<

a

¬

bb

(

cde

)

>,

<

ab

¬

a

(

cde

) >

}

, thendscontainsnsbecauseMPS

(

ns

)

=

<

ab

(

cde

)

>

ds

p

(<

a

¬

bb

(

cde

)

>

ds

p

(<

ab

¬

a

(

cde

)

>)

ds.

Corollary 1provesthattheproblemwhetheradatasequencecontainsanegativesequencecanbeconvertedtotheproblem

whetheradatasequencedoesnotcontainotherpositivesequences.Thislaysafoundationforcalculatingthesupportofnegative sequencesbyusingonlytheinformationofcorrespondingpositivesequences.Itsproofisgivenbelow.

ProofofCorollary 1. Hereweonlyprovethat Case(c)inthenegativecontainmentdefinitionisequivalent tothenegative convertingstrategy,becauseCases(a)and(b)canbeprovedinthesameway.InCase(c),condition“(fse

>

0

lsb

=

fse

+

1)” indicates that dfse anddlsb−1 are contiguous elements, so p

(

si

)

cannot be containedwithin them. It is a specialcase of anothercondition“(fse

>

0

lsb

>

fse

+

1)

p

(

si

)

<

dfse+1

. . .

dlsb−1

>”.

Hence,weonlyneedtoprovethat“(fse

>

0

lsb

>

fse

+

1)

p

(

si

)

<

dfse+1

. . .

dlsb−1

>”

isequivalenttonegativeconversionstrategy.

For

(

si

,

id

(

si

))

EidSns

(1

i

m

),

“0

<

fse” means MPS

(<

s1s2

. . .

si−1

>)

<

d1

. . .

dfse

>,

and “0

<

lsb” means

MPS

(<

si+1

. . .

sm

>)

<

dlsb

. . .

dt

>.

Sincefse

<

lsb,MPS

(<

s1s2

. . .

si−1si+1

. . .

sm

>)

<

d1

. . .

dfsedlsb

. . .

dt

>,

i.e.,MPS

(

ns

)

ds.Ontheother hand,ifMPS

(

ns

)

ds,for

(

si

,

id

(

si

))

EidSns,theremustexist 0

<

fse

<

lsbs.t. MPS

(<

s1s2

. . .

si−1

>)

<

d1

. . .

dfse

>

andMPS

(<

si+1

. . .

sm

>)

<

dlsb

. . .

dt

>.

Inaddition,accordingtothedefinitionof1-neg-sizemaximumsub-sequence,MPS

(<

s1s2

. . .

si−1

>),

MPS

(<

si+1

. . .

sm

>)

and si only construct a 1-neg-size maximum sub-sequence 1-negMS of si, so “(fse

>

0

lsb

>

fse

+

1)

p

(

si

)

<

dfse+1

. . .

dlsb−1

>”

also means p

(1

negMS

)

ds. For

(

si

,

id

(

si

))

EidSns, “(fse

>

0

lsb

>

fse

+

1)

p

(

si

)

<

dfse+1

. . .

dlsb−1

>”

canbeconvertedto:

1

negMS

1

negMSSns

,

p

(1

negMS

)

ds,andviceversa.

2

4.4. Supportsofnegativesequences

Tocalculatethesupportofnegativesequences,wefirstpresentCorollary 2.

Corollary 2 (Support of negative sequences). Given a m-size and n-neg-size negative sequence ns, for

1

negMSi

1

negMSSns

(1

i

n

)

,thesupportofns insequencedatabaseD is:

sup

(

ns

)

=| {

ns

} |=| {

MPS

(

ns

)

} − ∪

ni=1

{

p

(

1

negMSi

)

} |

(1) ThiscanbeeasilyderivedfromCorollary 1.Because

ni=1

{

p

(1

negMSi

)

}

⊆ {

MPS

(

ns

)

}

,Equation(1)canberewrittenas:

sup

(

ns

)

= |{

MPS

(

ns

)

}| − | ∪

in=1

{

p

(

1

negMSi

)

}| =

sup

(

MPS

(

ns

))

− | ∪

ni=1

{

p

(

1

negMSi

)

}|

(2) Example 10. sup

(<

¬

a

(

bc

)

d

¬

(

cde

)

>)

=

sup

(< (

bc

)

d

>)

− |{

<

a

(

bc

)

d

>

}

∪ {

< (

bc

)

d

(

cde

)

>

}|

; sup

(<

¬

(

ab

)

c

¬

d

>)

=

sup

(<

c

>)

− |{

< (

ab

)

c

>

}

∪ {

<

cd

>

}|

.

Ifnsonlycontainsanegativeelement,thesupportofnsis:

sup

(

ns

)

=

sup

(

MPS

(

ns

))

sup

(

p

(

ns

))

(3) Example11.sup

(< (

ab

)

¬

cd

>)

=

sup

(< (

ab

)

d

>)

sup

(< (

ab

)

cd

>).

Inparticular,fornegativesequence

<

¬

e

>,

sup

(<

¬

e

>)

= |

D

| −

sup

(<

e

>)

(4)

In thefollowing,we illustratenegative containment intermsofsettheory.Fig. 2 showstheintersectionofsequences

<

a

>

and

<

b

>.

{

<

a

>

}

,

{

<

b

>

}

mean the set of tuples that respectively contain sequences

<

a

>,<

b

>

in a se-quencedatabase.Therearethree2-lengthsequences:

<

ab

>,

<

ba

>

and

< (

ab

)

>,

andfourdisjointedsets:

{

< (

ab

)

>

only

}

,

{

<

ab

>

only

}

,

{

<

ba

>

only

}

and

{

<

ab

>

}

∩{

<

ba

>

}

,whicharethesetsoftuplesthatcontainsequences

< (

ab

)

>

only,

<

ab

>

only,

<

ba

>

only,andboth

<

ab

>

and

<

ba

>

respectively.

Taking

{

<

a

¬

b

>

}

asanexample,asseeninFig. 2,wehave:

{

<

a

¬

b

>

}

=

(

{

<

a

>

} − {

<

b

>

}

)

∪ {

< (

ab

) >

only

} ∪ {

<

ba

>

only

}

= {

<

a

>

} − {

<

ab

>

only

} ∪

(

{

<

ab

>

} ∩ {

<

ba

>

}

)

= {

<

a

>

} − {

<

ab

>

}

(10)

Fig. 2.The intersection of{<a>}and{<b>}.

Fig. 3.The interpretation ofsup(<a¬bc¬de¬f>)in terms of the set theory.

Thisresultis consistentwiththe negative containmentdefinition,by whichdata sequences containing

{

<

a

¬

b

>

}

are thesamesequencesthatcontain

{

<

a

>

}

butdonotcontain

{

<

ab

>

}

.

Subsequently,weobtain:

sup

(<

a

¬

b

>)

=

sup

(<

a

>)

sup

(<

ab

>).

Similarly,

sup

(<

¬

ab

>)

=

sup

(<

b

>)

sup

(<

ab

>)

;

sup

(<

b

¬

a

>)

=

sup

(<

b

>)

sup

(<

ba

>)

;

and sup

(<

¬

ba

>)

=

sup

(<

a

>)

sup

(<

ba

>).

Fig. 3illustratesthemeaningofsup

(

ns

)

intermsofsettheory.Givenns

=

<

a

¬

bc

¬

de f

>,

1

negMSSns

= {

<

a

¬

bc e

>,

<

ac

¬

de

>,

<

ace

¬

f

>

}

,sup

(

ns

)

= |{

<

ace

>

}|

− |

<

abce

>

<

acde

>

<

ace f

>

|

.

The above examplesshow that our proposed negative containmentdefinition isconsistent withsettheory. Therefore, setpropertiesareapplicableforcalculatingsup

(

ns

).

FromEquation(2),wecanseethat sup

(

ns

)

canbeeasily calculatedif we knowsup

(

MPS

(

ns

))

and

|

ni=1

{

p

(1

negMSi

)

}|

. Accordingto Constraint 3andthe negative candidategeneration ap-proachdiscussedinSection4.5,MPS

(

ns

)

andp

(1

negMSi

)

arefrequent.sup

(

MPS

(

ns

))

canbedirectlyobtainedbyapplying traditionalPSPminingalgorithms.

Wefurtherexplainwhythepositivepartnersp

(1

negMSi

)

arefrequent.Constraint 3ensuresthatthesmallestnegative unitinaNSCisanelement.Hence,thepositivepartnerofeachNSCgeneratedfromaPSPisthePSPitself.p

(1

negMSi

)

is asub-sequenceofthePSP.Both p

(1

negMSi

)

andthePSParepositivesequencesandmeetthedownwardclosureproperty, thereforep

(1

negMSi

)

arefrequent.

Nowtheproblemishowtocalculate

|

ni=1

{

p

(1

negMSi

)

}|

.Ourapproachisasfollows.Westorethesidofthetuples containing p

(1

negMSi

)

intoset

{

p

(1

negMSi

)

}

,thencalculatetheunionsetof

{

p

(1

negMSi

)

}

.Because p

(1

negMSi

)

are frequent,thesidofthe tuplescontaining p

(1

negMSi

)

canbe easilyobtainedby thosewell-known algorithms with minormodifications.Forinstance,westorethesidofthetuplescontaining p

(1

negMSi

)

to

{

p

(1

negMSi

)

}

.

References

Related documents

understanding the need of the market, and to determine the growth drivers and inhibitors and trends followed in the Global E Learning Market.. What this

Urine specimen should investigated for culture and susceptibility test before giving the patient any therapy to decrease the resistant rate among organisms

I A candidate for the degree of Master of Applied Science shall undertake a program of research and investigation on a topic approved by the Academic Board.. All

• Issue No 2: Working Without a Net reports on the views and experiences of new teachers from three prominent alternate route programs, Teach for America, Troops to Teachers

The Oxford Ankle Foot Questionnaire for Children (OxAFQ-C) was completed at baseline, 1, 2, 6 and 12 month time points by both child and parent.. Results: A total of 133 children

inability to satisfy lenders’ (Commercial Bank) collateral requirements small and marginal farmers have restricted access to formal credit. Sometimes even when

Morning Star Adventist Heritage Tours are available by appointment by calling the South Central Conference Archives at 615-226-6500, ext.. We Need