e-NSP: Efficient negative sequential pattern mining

(1)

Contents lists available atScienceDirect

Artiﬁcial

Intelligence

www.elsevier.com/locate/artint

e-NSP:

Eﬃcient

negative

sequential

pattern

mining

✩

Longbing Cao

a

,

∗

,

Xiangjun Dong

b

, Zhigang Zheng

c

a_University_of_Technology_Sydney,_Australia b_Qilu_University_of_Technology,_Jinan,_China c_University_of_Technology_Sydney,_Australia

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory:

Received13January2015

Receivedinrevisedform3March2016 Accepted7March2016

Availableonline10March2016 Keywords:

Negativesequenceanalysis Sequenceanalysis Behavioranalytics Non-occurringbehavior Behaviorinformatics Behaviorcomputing Patternmining

Asan important tool for behavior informatics,negative sequential patterns (NSP)(such asmissing medical treatments)are criticaland sometimes muchmoreinformativethan positivesequentialpatterns(PSP)(e.g.usingamedicalservice)inmanyintelligentsystems andapplications suchas intelligenttransport systems,healthcareand risk management, astheyofteninvolvenon-occurringbutinterestingbehaviors.However,discoveringNSPis muchmoredifficultthanidentifyingPSPduetothesignificantproblemcomplexitycaused bynon-occurringelements,highcomputationalcostandhugesearchspaceincalculating negativesequentialcandidates (NSC).Sofar, theproblem hasnot beenformalizedwell, and veryfew approaches havebeen proposed to mine forspecific typesof NSP, which relyon database re-scans after identifying PSP inorder to calculate the NSC supports. Thishas been shown to be veryinefficient or even impractical, since the NSC search space is usually huge. This paper proposes a very innovative and efficient theoretical framework:settheory-basedNSPmining(ST-NSP),andacorrespondingalgorithm,e-NSP, toefficiently identifyNSPby involvingonlythe identified PSP, withoutre-scanning the database.Accordingly,negativecontainmentisfirstdefinedtodeterminewhetheradata sequencecontainsanegativesequencebasedonsettheory.Second,anefficientapproachis proposedtoconvertthenegativecontainmentproblemtoapositivecontainmentproblem. TheNSCsupportsarethencalculatedbasedonlyonthecorrespondingPSP.Thisnotonly avoids theneed for additional databasescans, butalso enablesthe useof existingPSP mining algorithms to mine for NSP. Finally,a simple but efficient strategy is proposed to generate NSC. Theoretical analyses show that e-NSP performs particularly well on datasetswith asmall number of elements in a sequence, a large number of itemsets andlowminimum support.e-NSPiscomparedwithtwo currentlyavailableNSPmining algorithms via intensive experiments onthree synthetic and six real-lifedatasets from aspects includingdata characteristics, computationalcosts and scalability. e-NSPis tens tothousandsoftimesfaster thanbaselineapproaches,and offersasound and effective approachforefficientminingofNSPinlargescaledatasetsbydirectlyusingexistingPSP miningalgorithms.

✩ _The_source_codes_of_e-NSP_are_available_from_http_:/_/www-staff_.it_.uts_.edu_.au_/~lbcao/.

*

Correspondingauthor.

E-mailaddress:[email protected](L. Cao). http://dx.doi.org/10.1016/j.artint.2016.03.001

(2)

1. Introduction

Behavioriswidelyseeninourdailystudy,work,livingandentertainment[7].Acriticalissueinunderstandingbehavior fromtheinformaticsperspective,namelybehaviorinformatics[6,9],istounderstandthecomplexities,dynamicsandimpact ofnon-occurringbehaviors(NOB)[8].MiningNegativesequentialpatterns(NSP)[43]isone offew approachesavailable for understandingNOB. NSPrefertofrequentsequences withnon-occurringandoccurringbehaviors(also callednegative and

positivebehaviorsinbehaviorandsequenceanalysis),suchasadriverfailingtostopbeforedrivingthroughanintersection withastopsignoramissingtreatmentinmedicalservice.

DiscoveringNSP isbecoming increasingly important,andsometimesplay a role that cannot be replacedby analyzing occurringbehaviorsaloneinmanyintelligent systemsandapplications,suchasintelligent transportsystems(ITS), health andmedicalmanagement systems, bioinformatics,biomedical systems,risk management, counter-terrorism, andsecurity. Forexample, inITS, negative driving behavior patterns resultindrivers failingto followcertain traffic rulescould cause serioustraffic problemsorevendisasters.Inhealthcare,apatientmissinganimportantmedicalappointmentmightresult inserioushealthissues.Ingenesequencing,thenon-occurrenceofcertaingenesmaybeassociatedwithparticulardiseases. Suchproblemscannotbehandledbytheidentificationofoccurringbehaviorpatternsalone.

Formally,aNSPinhealthcaremayappearasfollows.Assume p1

=

<

abc X

>

isapositivesequentialpattern(PSP); p2

=

<

ab

¬

c Y

>

isa NSP,wherea,b andc stand formedicalservicecodesindicating the servicesa patient hasreceivedin healthcare, and X andY standfordisease status. p1 showsthat patientswho usually receivemedicalservices a,b and then c are likelytohavedisease status X,whereas p2 indicates that patientsreceivingtreatments ofa andb butNOT c haveahighprobabilityofhavingdiseasestatusY.

Althoughintensiveeffortshavebeenmadeto developPSP(suchas p1)mining algorithms suchasGSP [32],FreeSpan

[15],SPADE[34],PreﬁxSpan[30],andSPAM [4],NSP(suchas p2)cannot bedescribed ordiscoveredbythesealgorithms. This isbecause mining NSPis much morediﬃcult than miningPSP [8], particularlydue tothe following threeintrinsic complexities.

•

Problemcomplexity.Thehiddennatureofnon-occurringitemsmakesdefinitionoftheNSPminingproblemcomplicated, particularlytheNSPformatandnegativecontainment.Thisiswhyresearcherspresentdifferentandeveninconsistent definitionsandconstraintsintheiridentificationofNSP.AsresearchintoNSPisataveryearlystage,itisimportantto formalizetheproblemproperlyandcomprehensively.

•

Highcomputationalcomplexity.Existingmethodscalculatethesupportofnegative sequentialcandidates(NSC)by addi-tionallyscanningthedatabaseafteridentifyingPSP.Thisleads toadditionalcosts andlow eﬃciencyinminingNSP.It isthusessentialtodevelopeﬃcientNSPminingmethodswithoutdatabasere-scanning.

•

LargeNSCsearchspace.Theexistingapproachesgeneratek-sizeNSCbyconductingajoiningoperationon(k-1)-sizeNSP. ThisresultsinahugenumberofNSC[11,24,26,38],whichmakesitdiﬃculttosearchformeaningfuloutputs.Further, NSCdoesnotsatisfytheAprioriprinciple[38].ItisachallengetoprunethelargeproportionofmeaninglessNSC,and itisthusimportanttodevelopeﬃcientapproachesforgeneratingalimitednumberoftrulyusefulNSC.

NSPminingisatan earlystage,andhasseenonlyverylimitedprogressinrecentyears[3,11,12,14,16–18,20,21,38–40, 43]. Allexisting methods arevery inefficientandare toospecific formining NSP.As NSPanalysisis very complex, chal-lengingandimmature,theadditionofappropriateconstraintsmakestheproblemsolvabletosomedegree.Allthereported workinNSPanalysisthereforeincorporatesconstraintsonformat,frequencyand/ornegativeelements fromrespective as-pects(seemorediscussioninSection 3.2.1) toreducethenumberofNSC,discoverspecificNSPofparticularinterest,and enhance computationalefficiency.More importantly,thereare differentdefinitionsofthe mostimportantconceptinNSP mining –negative containment–which defineswhetheradatasequence containsa negative sequence(see moredetails inSection4.2).Somedefinitionsaremoregenericandtypical[11,12,38,39]thanotherswhicheitherincorporateadditional constraintsonnegationformatandcontainment[21–23,25–29]oroffernocleardefinition[18,31].

To address the above intrinsiccomplexities in NSP mining and make it efficient for real-life applications, this paper proposesan innovative, flexibleandefficient framework: thesettheory-basedNSP miningframework (ST-NSP)anda

com-prehensive algorithm called e-NSP to instantiate the ST-NSP framework. We ﬁrst formalize the NSP mining problem by

deﬁning some importantconcepts in NSP,includingnegative containment.The formalization drawsa clearboundary be-tweenwhetheradatasequencesetcontainsaNSCornot.Buildingonthesettheory,e-NSPthencalculatesthesupportof NSC basedonthesupportofthe correspondingPSP, withoutadditionaldatabasescans.Weconvertthenegative contain-mentproblemtoapositivecontainmentproblem.TheNSCsupportsarethencalculatedbyonlyusingaNSC’scorresponding PSPinformation.Inthisway,thereisnoneedtore-scanthedatabaseafterdiscoveringPSP. Moreimportantly,anyexisting PSPalgorithmscanthenbedirectlyusedorslightlychangedtodiscoverNSP.

We specifythefrequency,format andnegative elementconstraints ine-NSPto makethe problemconsistentwithset theory(frequencyconstraint),reduceconfusion,ambiguityanduncertainty(formatconstraint),andtocontrolcomputational complexity (negativeelement constraint).Ourdefinitionsandspecificationsofsuch constraintsandnegative containment formtheST-NSPframeworkandmakee-NSPconsistentwithsettheorywhichthusenablese-NSPtobemuchmoreefficient andflexiblethanexistingmethods.

(3)

Below,wesummarizethemaindifferencesbetweenourworkandtherelatedworksintermsoftheconstraintsimposed onNSP,keydeﬁnitionsandconcepts,aswellasthesigniﬁcantcontributionsmadeinthiswork:

•

An innovative andveryeﬃcientframework ST-NSPandits corresponding theoreticaldesign andalgorithm e-NSPare proposed todiscoverNSPwithinone databasescan;e-NSPisbuiltonsettheoryandistheonlyworkreportedsofar thatusessettheoryforsequenceanalysis,whichopensanewparadigmforNSPanalysis.

•

e-NSP isbuiltona systematicandcomprehensivestatementandformalization oftheNSPproblem, andincorporates newconcepts,specificationsonconstraintsandeffectivetechnicaldesignthatmakeitefficientandflexible,inaddition toofferingsubstantialtheoreticalanalysisintermsofdatacharacteristicsrepresentedbydatafactorsandcomputational costs.

•

e-NSPintroducesthreeconstraintsonfrequency,formatandnegativeelementsrespectively,whichnotonlymakee-NSP consistentwithtypicalexistingworkbutalsosupporttheproposedframeworkofsettheory-basedNSPstudy.

•

Theoretical analysisshows that e-NSP is much lesssensitive to data factors andis much more eﬃcienton datasets

that havea smallnumberof elementsina sequenceanda largenumberofitemsets, comparedtoavailable baseline algorithms we can ﬁnd. This advantage of e-NSP is especially clear when minimum support is low. It is thus very suitableforlargescaledata.

Unfortunately,itisnoteasytofindbaselinedatasetsandNSPalgorithmsthat satisfysimilarNSPsettings.We compare e-NSPwithtwomodifiedNSPalgorithmsintheliteratureusingintensiveexperimentsonthreesyntheticandsixreal-world datasets frommany perspectives, including computational complexity against different minimum supports on 8 distinct datasets,datacharacteristicsanalysison48combinationsofvariousdatafactorson16subsets,andscalabilitytestsontwo scalabledatasetswithlowminimumsupport.Theexperimental resultsverifythetheoreticalanalyses,showingthat e-NSP ismuchmoreflexibleandefficientandistenstothousandsoftimesfasterthanthebaselinemethods,thushighlysuitable forlargescaledatasets.Tothebestofourknowledge,thisisthefirstapproachthat efficientlydiscoversNSPbyinvolving PSP onlywithoutrescanning databases,directlyapplies existingPSPdiscovery algorithms, andisparticularlyeffectivefor verylargedatasets.ThisdemonstratesthesignificantvalueoftheproposedST-NSPframework.

Theremainderofthepaperisorganizedasfollows.Section2discussestherelatedworkandgapsinthecurrent knowl-edge. InSection3,weformalize theproblemofmining PSPandNSP,providingcorrespondingdeﬁnitionsandconstraints. The ST-NSPframework andthee-NSPalgorithm aredetailedinSection 4.The theoreticalanalyses ofe-NSPcompared to baselinearepresentedinSection5fromtheperspectiveofdatafactors.Section6presentssubstantialexperimentalresults andareal-lifecasestudy.DiscussionsonseveralcriticalissuesareofferedinSection 7,followedbyconclusionsandfuture workinSection8.

2. Relatedwork

Inthissection,weﬁrstdiscusstherelatedworkonafundamentalconceptinNSP;thatis,negativecontainment.Different researcherspresentinconsistentdeﬁnitionsandexplanationsofnegativecontainmentforrespectiveinterestsandpurposes. In [11],a datasequence ds

=

<

dc

>

cannot containnegative sequence ns

=

<

¬

(

ab

)

c

¬

d

>

since size

(

ns

)

>

size

(

ds

);

while theworkin[38]allowsthatdscontainsns.Anothercriticalissueishowtodealwithanon-occurringelement.Chenetal.

[11] argued that ds

=

<

dc

>

cannot contain

<

¬

cd

>

because

<

d

>

indshasno antecedent itemset;ds cannot contain

<

c

¬

d

>

because

<

c

>

indshasnosuccessor.However,Zhengetal.[38]allowedthatdscontains

<

c

¬

d

>.

Furthermore, the containment position of each element is very tricky.Chen et al.[11] proposed that a data sequenceds

=

<

aacbc

>

does not contain a negative sequence ns

=

<

a

¬

bc

>,

since opposite evidenceof

<

abc

>

can be found inds.However, Zhengetal.[38]presentedadividedopinionsinceds

=

<

aacbc

>

matchesa andc;hisalgorithmﬁndsthecorresponding positiveelementindsforeachnegativeelementofns,suchastheseconda for

¬

b.Accordingtoourunderstanding,since

<

e

>

meansthateoccurs,noelement(includingelemente)occursbeforeoraftere.Accordingly,

<

e

>

contains

<

e

¬

?

>,

<

¬

?e

>,

<

¬

?e

¬

?

>,

where“?”representsanyelement.

Second,wesummarizethestatusoftheNSPresearch.UnlikePSPmining,whichhasbeenwidelyexplored,verylimited research outcomes are available in the literature on mining NSP.We briefly introduce what we havebeen able to find. Zheng et al. [38] proposed a negative version of the GSP algorithm, i.e. NegGSP, to mine for NSP. This algorithm first discovers PSPby GSP, then generatesandprunes NSC.It then countsthesupport ofNSC by re-scanningthe databaseto generatenegativepatterns.Chenetal.[11]proposedaPNSPapproachforminingpositiveandnegativesequentialpatterns intheformof

< (

abc

)

¬

(

de

)(

i jk

)

>.

Thisapproachisbrokenintothreestages.PSPareminedbytraditionalalgorithmsand all positiveitemsetsarederivedfromthesePSP. Allnegative itemsetsarethenderivedfromthesepositiveitemsets.Lastly, bothpositiveandnegativeitemsetsarejoinedtogenerateNSC,whichareinturnjoinediterativelytogeneratelongerNSC inanApriori-likeway.ThisapproachcalculatesthesupportofNSCbyre-scanningthedatabase.In[22–24],theauthorsonly handledNSPinwhichthelastelementwasnegative.AnalgorithmNSPMin[24]minessuchNSP.Theextendedversionsof

[24]arein[22,23],whichaddfuzzyandstrongconstraintsrespectivelytoNSPM.In[39],ageneticalgorithmisproposedto mine NSP.Itgeneratescandidatesby crossoverandmutationbyinvolvingadynamicﬁtness functiontogenerateasmany candidatesaspossibleandavoidpopulationstagnation.In[26],onlyNSPareidentiﬁedintheformof

(

¬

A

,

B

),

(

A

,

¬

B

)

and

(

¬

A

,

¬

B

),

whichis similarto mining negative associationrules [13,33].The work in[26] requires A

B

= ∅

,whichis a

(4)

Table 1

Notationdescription.

Symbol Description

I A set of items,I= {x1, . . . ,xn}, consisting ofnitemsxk(1kn) s A sequence,s=<s1, . . . ,sl>, consisting oflelementssj(1jl) min_sup Minimum support threshold

ns A negative sequence

length(s) Length of sequences, referring to the number of items in all elements ins size(s) Size of a sequences, referring to the total number of elements ins sup(s) The support ofs

PP(ns) ns’s positive partner EidSs Elements id set of sequences OPS(EidSs) Order preserving sequence withEidSs MPS(s) Maximum positive sub-sequence ofs 1-negMSns 1-neg-size maximum subsequence ofns 1-negMSSns 1-neg-size maximum subsequence set ofns FSEorfse First subsequence ending position LSBorlsb Last subsequence beginning position

usualconstraintinassociationruleminingbutisaverystrictconstraintinsequentialpatternmining.Itgeneratesfrequent itemsets ﬁrst, then generates frequent and infrequent sequences, and lastly derives NSP from the infrequent sequences. Threeextendedversionsof[26]canbefoundin[27–29]inwhichconditionsareaddedtofuzzy,multiplelevelandmultiple minimumsupports,respectively.Althoughtheauthorsof[31] mentionedthequestionofminingNSP,theydidnotpropose howtominethem.In[20],NSPareminedinthesameformas[26] inincrementaltransactiondatabases.

Zhao etal. [35] proposed an approach to mining event-oriented negative sequentialrules frominfrequent sequences in the formof

<

A

>

⇒

<

¬

B

>,

<

¬

A

>

⇒

<

B

>,

<

¬

A

>

⇒

<

¬

B

>.

Based on the work in [35], Zhao etal. [36] also presentedan approachfordiscoveringboth positiveandnegative impact-orientedsequentialrules.Issuesaboutsequence classiﬁcationusingpositiveandnegative patternswerediscussedin[25,37].Positiveandnegative usagepatternsareused in[19]toﬁlterWebrecommendationlists.NoneofthesepapersinvolveNSPminingdirectly.

Theabovediscussions aboutnegativecontainmentandexistingNSPresearchinvolvethekeyissueofvariousconstraints

applied to NSP mining. As detailed in Section 3.2.1, constraints are generallyempowered according to the frequency of elements,patternsorpositivepartners,theformat ofcontinuousnegativeelements,andthenegationofelementsoritems.

Anotherrelevantresearchtopicisnegativeassociationrulemining[5,33].However,astheorderingrelationshipbetween itemsandelementsinasequenceisinbuiltinNSP,itismuchmorechallengingtodiscoverNSPthannegative associations andpatterns.In fact, theordinal nature ofNSPmeans that algorithms fornegative association rule miningandnegative patternminingcannotbedirectlyusedtomineNSP.

Insummary,theabovediscussionsshowthatNSPresearchpresentsthefollowingstrongearly-stagecharacteristics:

•

Signiﬁcantinconsistencyinkey concepts andsettings.In particular,thereisnoconsolidated conceptofnegative

con-tainment intheliterature. InSection 4.2,we willpresenta genericdeﬁnitionof negativecontainment andformalize theissue.

•

DifferentconstraintsareincorporatedintoNSP,leading tovariedsettingsandevendividedassumptionsaboutNSP.In Section 3.2.1, we willprovide formal andgeneric deﬁnitions offrequencyconstraint, format constraint, andnegative elementconstraint.

•

NSPminingisembeddedwithspeciﬁcconstraintsandrequiresrescanningdatabases.

•

Existing NSP approacheseither re-scan a databaseas a resultof ineﬃcient computational design ordo not directly addresstheproblemofNSPmining.InSection4,e-NSPisintroducedwhichscansadatabaseonlyonce.

Ourproposed ST-NSP(see Section 4.1) ande-NSP (more details in Section 4) directlyaddress the above fundamental issuesandthesubstantialcomplexitiesofNSPminingbybuildinganinnovative,formal,comprehensiveandgenericdesign, togetherwiththeoreticalanalysis,andexperimentevaluation.

3. Problemstatement

Inasequence,a non-occurringitemiscalledanegativeitemandan occurringitem iscalledapositiveitem.Sequences thatconsistofatleastonenegativeitemarecallednegativesequences.Thesequencesinsourcedataarecalleddatasequences

[32].Classicsequentialpatternmininghandlesoccurringitemsonly,andgeneratesfrequentpositivesequences[11,24,26,38]. Below,weformallydeﬁnePSPandNSP.ThemainsymbolsusedinthispaperarelistedinTable 1.

3.1. PositiveSequentialPatterns–PSP

Let I

= {

x1,x2,

. . . ,

xn

}

be a set of items. An itemset is a subset of I.A sequence is an ordered list of itemsets. A se-quence sisdenotedby

<

s1s2

. . .

sl

>,

wheresj

⊆

I

(1

j

l

).

sjisalsocalledanelementofthesequence,anddenotedas

(5)

(

x1x2

. . .

xm

),

wherexk isan item, xk

∈

I (1

k

m). Forsimplicity,thebracketsare omittedifan element onlyhasone item,i.e.,element

(

x

)

iscodedx.Toreducecomplexity,weassumethatanitemoccursatmostonceinanelement,butcan appearmultipletimesindifferentelementsofasequence.

Thelengthofsequences,denotedaslength

(

s

),

isthetotalnumberofitemsinallelementsins.sisak-lengthsequence if length

(

s

)

=

k.The size ofsequence s,denoted assize

(

s

),

isthe totalnumberof elements ins. s isa k-size sequenceif

size

(

s

)

=

k.Forexample,agivensequences

=

< (

ab

)

cd

>

iscomposedof3elements

(

ab

),

c andd,or4itemsa,b,c andd. Thereforesisa4-lengthand3-sizesequence.

Sequencesα

=

<

α

1

α

2

. . .

α

n

>

iscalledasub-sequenceofsequencesβ

=

< β1β2

. . . β

m

>

andsβisasuper-sequenceofsα,

denotedassα

⊆

sβ,ifthereexists1

j1

<

j2

< . . . <

jn

msuchthat

α

1

⊆

β

j1

,

α

2

⊆

β

j2

,

. . . ,

α

n

⊆

β

jn.Wealsosaythatsβ

contains sα.Forexample,

<

a

>,

<

d

>

and

< (

ab

)

d

>

areallsub-sequencesof

< (

ab

)

cd

>.

Asequencedatabase Disasetoftuples

<

sid

,

ds

>,

wheresidisthesequence_idanddsisthedatasequence.Thenumber oftuplesinD isdenotedas

|

D

|

.Thesetoftuplescontaining sequencesisdenotedas

{

<

s

>

}

.Thesupportofs,denoted assup

(

s

),

isthenumberof

{

<

s

>

}

,i.e.,sup

(

s

)

=| {

<

s

>

}

|=| {

<

sid

,

ds

>,

<

sid

,

ds

>

∈

D

∧

(

s

⊆

ds

)

}

|

.min_supisaminimum support thresholdpredeﬁnedby users.Sequence s is calledafrequent(positive)sequentialpatternifsup

(

s

)

min_sup.By contrast,sisinfrequentifsup

(

s

)

<

min_sup.

PSPminingaimstodiscoverall positivesequencesthatsatisfyagivenminimumsupport.Forsimplicity,weoftenomit “positive”whendiscussingpositiveitems,positiveelementsandpositivesequencesinminingPSP.

3.2. NegativeSequentialPatterns–NSP

3.2.1. ConstraintsonNSP

Inrealapplicationssuchashealthandmedicalbusiness,thenumberofNSC andtheidentiﬁednegative sequencesare oftenlarge,butmanyofthemarenot actionable[10].The numberofNSC maybe hugeoreveninﬁniteifnoconstraints are added.Thismakes NSPmining verychallenging. Forexample,fordatasetwith10distinctitems, thetotal numberof potential itemsetsis1023(

=

C

(10,

1)

+

C

(10,

2)

+

. . .

+

C

(10,

10)),whereC

(

m

,

n

)

representsthenumberofcombinations createdbychoosingnitemsfrommdistinctitems.Assumingthat10ofthe1023arefrequent,ifthemaximumsizeofdata sequencesinDBis3,thetotalnumberofNSCcouldbe10331

₊

₁₀₃₃2

₊

₁₀₃₃3_._Many_of_the_combinations_are_{uninteresting} ormeaninglessinreality.

VariousconstraintshavebeenaddedtoexistingNSPmining,suchasthosedetailedin[11,38,39]and[24],todealwith thefundamental challengesembeddedinNSPminingandthecurrentimmaturesituationinNSPstudy.Althoughtheyare not consistentnorgeneric,theseconstraintsaimto reduceproblemcomplexityandthenumberofNSC,andtoeﬃciently discoveractionableNSP.Forexample,inhealthinsurance,acommonbusinessrulesaysifaprosthesis(a)hasbeencharged for, there should be a charge for a prior corresponding procedure (b). Accordingly, ifa customer claims a but does not claimb,whichisrepresentedasaNSP:

<

¬

ba

>,

thentheclaimbehaviorcouldbepotentiallysuspicious.Morecomplicated NSPmaybefound,whichcontainmorenegativeitems,suchas

<

a

¬

ba

¬

c

>

(c isanotherprocedure).

Inthiswork,weincorporatethreeconstraintsine-NSP:frequencyconstraint,formatconstraint,andnegativeelement con-straint.Thereasonsforintroducingthesethreeconstraintsareasfollows.First,asexplainedabove,NSPminingisatanearly stage andistoo complicatedtoconduct withouttheimposition ofspeciﬁc constraints.Second,we specifyconstraintson frequency,formatandnegativeelements inordertoreduceproblemcomplexity,andinparticular,tobuildaframeworkof NSPminingonsettheorytoextractnegativesequentialpatternsfromidentiﬁedpositivepatterns(frequencyconstraint), re-ducethenumberofNSC(formatconstraint),andsubstantiallyreducethecomplexityinhandlingpartiallynegativeelements (negativeelementconstraint).

Below, we deﬁne key concepts for these three constraints for mining NSP, andfurther explain why each of them is introduced.

Deﬁnition1 (PositivePartner). The positivepartner ofa negative element

¬

e is e, denoted as p

(

¬

e

),

i.e., p

(

¬

e

)

=

e. The positivepartnerofpositiveelementeiseitself,i.e., p

(

e

)

=

e.Thepositivepartnerofanegativesequencens

=

<

s1

. . .

sk

>

changesallnegativeelementsinnstotheirpositivepartners,denotedasp

(

ns

),

i.e.,p

(

ns

)

= {

<

s₁

. . .

s_k

>

|

s_i

=

p

(

si

),

si

∈

ns

}

. Forexample,p

(<

¬

(

ab

)

c

¬

d

>)

=

< (

ab

)

cd

>.

Constraint1 (Frequencyconstraint). For simplicity, this paperonly focuseson the negative sequences ns whose positive partners are frequent, i.e., sup

(

p

(

ns

))

≥

min_sup. In contrast,the authors in [11] and[38] only requirethat the positive partnerofeachelementinnsisfrequent.

Although there maybe manynegative sequences that can be mined frominfrequent positive partner sequences (just as manyusefulnegative association rules can bemined from infrequentitemsets [33]),requiring positive partnerstobe frequent serves multiplepurposes: (1)users are ofteninterested inthe absenceof certain “frequent”(positive) itemsets; thepositivepartnersofthosenegative itemsetsappearinginnegativesequentialpatternsshouldthereforesatisfyacertain frequency.(2)Ifwedonotenforcethisconstraint, thenumberofNSCmaybehugeoreveninﬁnite,whichwouldleadto

(6)

Asimilar constrainthasbeenadoptedby otherresearchers, althoughitis speciﬁeddifferentlyforparticularpurposes. Forexample,in[11],onlythepositivepartnerofeachelementinnegativesequencesnsisrequiredtobefrequent.Inour work,werequirethatthepositivepartner,notonlyeachelement,ofnsshouldbefrequent,i.e.,sup

(

p

(

ns

))

≥

min_sup.This is because inthis work we build a newframework that allows to only negative sequential patternsto be minded from positivesequentialpatternsbyusingthesettheorytorapidly“calculate”thesupportofNSCbasedonlyonthesupportof theircorrespondingPSPwithoutadditionaldatabasescans.

Constraint2(Formatconstraint).ContinuousnegativeelementsinaNSCarenotallowed.

Example1.

<

¬

(

ab

)

c

¬

d

>

satisﬁesConstraint 2,but

<

¬

(

ab

)

¬

cd

>

doesnot. Thisissimilartothesettingsin[11,38].

Thisconstraintisintroducedforthreereasons.(1)Iftwoormorecontinuousnegativeelements areallowed,moreand more negative sequential candidates can be generated,which would resultin a very challenging and sophisticated task. (2) In practice,it wouldbe diﬃcultto ascertainthe correctorderof two continuous negative elements iftherewere no positive elements between them. (3) As argued in [11], the order of consecutive negative itemsets is triﬂing for many applications.Asaresult,addingthisconstraintwillsubstantiallyreducethenumberofNSC.

Constraint3(Negativeelementconstraint).ThesmallestnegativeunitinaNSCisanelement.Ifanelementconsistsofmore thanoneitem,eitherallornoneoftheitemsareallowedtobenegative.

Example2.

<

¬

(

ab

)

cd

>

satisﬁesConstraint 3,but

< (

¬

ab

)

cd

>

doesnot because,inelement

(

¬

ab

),

only

¬

a isnegative whileb isnot.

Althoughnegativesequentialpatternswithnegativeitems(suchasin

< (

¬

ab

)

cd

>)

maybealsointerestinginreal appli-cations,thecomplexitycouldbetoohightosolvetheproblem.Introducingthisconstraintavoidsthecomplexityofhandling partiallynegativeelements,andsubstantiallyreducesthenumberofNSC.ThisisbecausethenumberofNSCwillincrease exponentially ifnegative itemsare allowed. Forexample,givenpositive pattern ps

=

< (

ab

)(

cde

)(

f ghi

)

>,

thenumber of NSC isonly 4 ifthisconstraintis satisﬁed;they are

<

¬

(

ab

)(

cde

)(

f ghi

)

>,

< (

ab

)

¬

(

cde

)(

f ghi

)

>,

< (

ab

)(

cde

)

¬

(

f ghi

)

>

and

<

¬

(

ab

)(

cde

)

¬

(

f ghi

)

>.

Ifnegativeitemsareallowed,thenumberofNSCgeneratedfrom

< (

ab

)(

cde

)(

f ghi

)

>

is1260 (

=

4

∗

(

C

(2,

1)

+

C

(2,

2))

∗

(

C

(3,

1)

+

C

(3,

2)

+

C

(3,

3))

∗

(

C

(4,

1)

+

C

(4,

2)

+

C

(4,

3)

+

C

(4,

4))

=

4

∗

3

∗

7

∗

15).Itwouldbe veryineﬃcientandcomplextogenerateallNSC.Whenthesizeandlengthofdataishuge,itisunrealistictogenerateall NSC.Therefore,similartothesettingsin[11,38],wealsoapplythisconstrainttoavoidthecomplexityofhandlingpartially negativeelements.

In summary,the above threeconstraints introduced intoe-NSP not only maintain a certain consistency withexisting work, butalsoenable the efficientgeneration ofNSC based ontheir positive partnersandenable the calculationof NSC support.ThisleadstoafundamentallynewandefficientframeworkforefficientdiscoveryofNSPinlargescaledata.

Inpractice, withthedevelopment ofmore eﬃcientlearning frameworks,data structures, andNSPmining algorithms, theseconstraintsmaybe progressivelyrelaxed.GiventhecurrentstageofmaturityofNSPmining,weonlyworkonthose negativesequencesthatsatisfytheabovethreeconstraintsinthiswork.

3.2.2. NSPconcepts

AccordingtoConstraint 3,thedeﬁnitionofsub-sequencesinapositivesequenceisnotapplicabletonegativesequences. Below,wedeﬁnesub-sequenceandsuper-sequencefornegativesequences,inadditiontotheconceptsofelement-idsetand

order-preservingsequence.

Deﬁnition2(Positive/NegativeElement-idSet).Element-idistheordernumberofanelementinasequence.Givenasequence

s

=

<

s1s2

. . .

sm

>,

id

(

si

)

=

i isthe element identiﬁerof element si.Element-id set EidSs of s isthe setthat includes all elementsandtheiridsins,i.e.,EidSs

= {

(

si

,

id

(

si

))

|

si

∈

s

=

(

s1,1),

(

s2,2),

. . . ,

(

sm

,

m

)

}

(1

i

m).

Thesetincludingallpositiveandnegativeelement-idsofasequences iscalledpositiveandnegativeelement-idsetofs, denotedasEidS+_s,EidS−_s,respectively.Forexample,s

=

<

¬

(

ab

)

c

¬

d

>,

EidS+_s

= {

(

c

,

2)

}

,EidS−_s

= {

(

¬

(

ab

),

1),

(

¬

d

,

3)

}

. Deﬁnition3(Order-preservingSequence).ForanysubsetEidS_s

= {

(

α

1,id1),

(

α

2,id2),

. . . ,

(

α

p

,

idp

)

}

(1

<

p

m)ofEidSs,

α

=

<

α

1

α

2

. . .

α

p

>,

if

∀

α

i

,

α

i+1

∈

α

(1

i

<

p

),

thereexistsidi

<

idi+1,then

α

iscalledan order-preservingsequenceofEidSs, denotedas

α

=

OPS

(

EidS_s

).

Example3.Givens

=

<

¬

(

ab

)

c

¬

d

>,

itsEidSs

= {

(

¬

(

ab

),

1),

(

c

,

2),

(

¬

d

,

3)

}

,EidS+s

= {

(

c

,

2)

}

,EidSs−

= {

(

¬

(

ab

),

1),

(

¬

d

,

3)

}

.We canobtainOPS

(

EidS+_s

)

=

<

c

>.

Also,ifEidS_s

= {

(

¬

(

ab

),

1),

(

c

,

2)

}

,wecancreateasequenceOPS

(

EidS_s

)

=

<

¬

(

ab

)

c

>.

(7)

Fig. 1.The set theory-based NSP mining framework: CT-NSP.

Deﬁnition4 (Sub-sequenceandSuper-sequenceofaNegativeSequence). Sequence sα is calleda sub-sequence of a negative

sequence sβ,andsβ isasuper-sequenceofsα,if

∀

EidSsβ,EidS

sβ isasubsetofEidSsβ,sα

=

OPS

(

EidSsβ

),

denotedassα

⊆

sβ.If

sα isanegative sequence,itis requiredtosatisfyConstraint 2,whichmeansthat theremustnot becontinuous negative

elementsinsα.

Example4.Givensβ

=

<

¬

(

ab

)

cd

>

andsα

=

<

¬

(

ab

)

d

>,

EidSsβ

= {

(

¬

(

ab

),

1),

(

c

,

2),

(

d

,

3)

}

,EidSsβ

= {

(

¬

(

ab

),

1),

(

d

,

3)

}

isa

subsetofEidSsβ.sα isasub-sequenceofsβ sincesα

=

OPS

(

EidSsβ

).

Deﬁnition5 (MaximumPositiveSub-sequence).Let ns

=

<

s1s2

. . .

sm

>

be a m-sizeandn-neg-size negative sequence

(

m

−

n

>

0),OPS

(

EidS+_ns

)

iscalledthemaximumpositivesub-sequenceofns,denotedasMPS

(

ns

).

Example 5. Given a negative sequence s

=

<

¬

(

ab

)

cd

>,

EidS+

= {

(

c

,

2),

(

d

,

3)

}

, its maximum positive sub-sequence is

MPS

(

s

)

=

<

cd

>.

Deﬁnition6(NegativeSequentialPattern).Anegative sequencesisa negativesequentialpattern(NSP)ifits supportisnot lessthanthethresholdmin_sup.

4. Thesettheory-basedNSPframeworkande-NSPalgorithm

4.1. Thesettheory-basedNSPminingframework

HereweproposeaninnovativeNSPminingframework,basedonsettheory.Theframeworkandworkingmechanismof

theproposedsettheory-basedNSPminingframework(ST-NSPforshort)isillustratedinFig. 1.Wealsoproposeaneﬃcient NSPminingalgorithm,callede-NSP,toinstantiatetheST-NSPframework.Thee-NSPalgorithmissummarizedinSection4.7, andanexampleisgiveninSection4.8.

ANSPalgorithminstantiatingtheST-NSPframeworkconsistsofseveralcomponents:

(1)negativecontainmenttodeﬁnehowadatasequencecontainsanegativesequence,whichwillbediscussedinSection4.2; (2) anegativeconversionstrategytoconvertthenegativecontainmenttopositivecontainment,andthenusetheinformation

ofcorrespondingPSPtocalculatethesupportofaNSC,whichwillbediscussedinSection 4.3; (3)NSCsupportcalculationtocalculatethesupportofNSC,whichwillbediscussedinSection4.4; (4)NSCgenerationtogenerateNSC,whichwillbediscussedinSection4.5;and

(5) ane-NSPdatastructureandoptimizationstrategytocalculatetheunionset,whichwillbediscussedinSection4.6.

Given a sequence database, algorithms like e-NSP built on the ST-NSP framework work on the following process to

discoverNSP,inwhichsteps(2)to(4)relyonthesettheory.

(1) AllPSPareminedbyatraditionalPSPalgorithm(withslightchangesifnecessary)ornewPSPminingalgorithm; (2) NSCaregeneratedbasedontheidentiﬁedPSPintermsofthethreeconstraintsproposedinSection3.2.1; (3) ThegeneratedNSCareconvertedtotheircorrespondingPSPintermsofthenegativeconversionstrategy;

(4) The NSC support is calculated based on the corresponding PSP support in terms of negative containment, relevant constraintsandtheproposede-NSPdatastructureandoptimizationstrategy;

(5) Lastly,NSPareidentiﬁedfromtheNSCtosatisfycertainsupportcriteria.

4.2. Negativecontainment

As asub-sequence(e.g.,s1

=

<

d

>)

mayoccurmorethanonceinitssuper-sequence(e.g.,s2

=

<

a

(

bc

)

d

(

cde

)

>),

we needtoknowtheexactpositionsofs2containings1fromtheleftandrightsidesofs2.Wethereforedeﬁnethefundamental conceptofnegativecontainmentbelow.

(8)

Deﬁnition7(FirstSub-sequenceEndingPosition/LastSub-sequenceBeginningPosition).Givenadatasequenceds

=

<

d1d2

. . .

dt

>

andapositivesequence

α

,

(1) if

∃

p

(1

<

p

t

),

α

⊆

<

d1

. . .

dp

>

∧

α

<

d1

. . .

dp−1

>,

then p iscalledtheFirstSub-sequenceEndingPosition,denoted asFSE

(

α

,

ds

);

if

α

⊆

<

d1

>

thenFSE

(

α

,

ds

)

=

1;

(2) if

∃

q

(1

q

<

t

),

α

⊆

<

dq

. . .

dt

>

∧

α

<

dq+1

. . .

dt

>,

thenq iscalledtheLastSub-sequenceBeginningPosition,denoted asLSB

(

α

,

ds

);

if

α

⊆

<

dt

>

thenLSB

(

α

,

ds

)

=

t;

(3) if

α

ds,thenFSE

(

α

,

ds

)

=

0,LSB

(

α

,

ds

)

=

0.

Example6. Given ds

=

<

a

(

bc

)

d

(

cde

)

>.

FSE

(<

a

>,

ds

)

=

1, FSE

(<

c

>,

ds

)

=

2, FSE

(<

cd

>,

ds

)

=

3, LSB

(<

a

>,

ds

)

=

1,

LSB

(<

c

>,

ds

)

=

4,LSB

(<

cd

>,

ds

)

=

2,LSB

(< (

cd

)

>,

ds

)

=

4.

Ourdeﬁnitionofadatasequencecontaininganegativesequenceisasfollows.Weusen

−

neg

−

sizetodenoteanegative sequencecontainingnnegativeelements.

Deﬁnition8 (NegativeContainment).Let ds

=

<

d1d2

. . .

dt

>

be a data sequence, ns

=

<

s1s2

. . .

sm

>

be an m

−

size and

n-neg-sizenegativesequence,(1)ifm

>

2t

+

1,thends doesnotcontain ns;(2)ifm

=

1 andn

=

1,thends contains nswhen

p

(

ns

)

ds;(3)otherwise,ds contains nsif,

∀

(

si

,

id

(

si

))

∈

EidS−ns

(1

i

m

),

oneofthefollowingthreecasesholds: (a)

(

lsb

=

1)or

(

lsb

>

1)

∧

p

(

s1)

<

d1

. . .

dlsb−1

>,

wheni

=

1;

(b)

(

fse

=

t

)

or

(0

<

fse

<

t

)

∧

p

(

sm

)

<

dfse+1

. . .

dt

>,

wheni

=

m;

(c)

(

fse

>

0

∧

lsb

=

fse

+

1) or

(

fse

>

0

∧

lsb

>

fse

+

1)

∧

p

(

si

)

<

dfse+1

. . .

dlsb−1

>,

when 1

<

i

<

m, where fse

=

FSE

(

MPS

(<

s1s2

. . .

si−1

>),

ds

),

lsb

=

LSB

(

MPS

(<

si+1

. . .

sm

>),

ds

).

Intheabove deﬁnition,Case(a)indicates thatthe ﬁrstelementin nsisnegative. “(lsb

>

1)

∧

p

(

s1)

<

d1

. . .

dlsb−1

>”

meansthat

<

dlsb

. . .

dt

>

contains MPS

(<

s2

. . .

sm

>)

but

<

d1

. . .

dlsb−1

>

doesnotcontain p

(

s1).“lsb

=

1”meansthatthe last sub-sequence’sbeginning position is1, so p

(

s1) cannot be containedby ds.Case (b)indicates that the last element in nsisnegative. Case (c)indicatesthat the negative elementis betweentheﬁrst andlast elementin ns.“lsb

>

fse

+

1” ensuresthereisatleastoneelementin“<dfse+1

. . .

dlsb−1

>”.

“fse

>

0

∧

lsb

=

fse

+

1”meansthatdfseanddlsbarecontiguous elements,sop

(

si

)

cannotbecontainedwithinthem.

Example7.Givends

=

<

a

(

bc

)

d

(

cde

)

>,

wehave

(1) ns

=

<

¬

ac

>.

EidS−_ns

= {

(

¬

a

,

1)

}

. ds does not contain ns. lsb

=

4

>

0, but p

(

s1)

=

<

a

>

⊆

<

d1

. . .

d3

>

=

<

a

(

bc

)

d

>

(Case (a)).

(2) ns

=

<

¬

aac

>.

EidS−_ns

= {

(

¬

a

,

1)

}

.dscontainsnsbecauselsb

=

1 (Case(a)).

(3) ns

=

< (

ab

)

¬

(

cd

)

>.

EidS−_ns

= {

(

¬

(

cd

),

2)

}

.dsdoesnotcontainnsbecausefse

=

0 (Case(b)). (4) ns

=

< (

de

)

¬

(

cd

)

>.

EidS−_ns

= {

(

¬

(

cd

),

2)

}

.dscontainsnsbecausefse

=

4

(

t

=

4)(Case(b)).

(5) ns

=

<

a

¬

dd

¬

d

>.

EidS_ns−

= {

(

¬

d

,

2),

(

¬

d

,

4)

}

. ds does not contain ns. For

(

¬

d

,

2), fse

=

1, lsb

=

4, but p

(

¬

d

)

⊆

<

d2

. . .

d3

>

=

< (

bc

)

d

>

(Case (c)). Ifone negative element does not satisfy the condition, we do not need to con-siderothernegativeelements.

(6) ns

=

<

a

¬

bb

¬

a

(

cde

)

>.

EidS−_ns

= {

(

¬

b

,

2),

(

¬

a

,

4)

}

. ds contains ns. For

(

¬

b

,

1), fse

=

1, lsb

=

2, fse

>

0

∧

lsb

=

fse

+

1 (Case (c));For

(

¬

a

,

4),fse

=

2,lsb

=

4, p

(

¬

a

)

<

d3

>

=

<

d

>

(Case(c)).

Note that negative sequences do not satisfy the Apriori property. As shown in Example 7, sup

(<

¬

ac

>)

=

0,

sup

(<

¬

aac

>)

=

1, sup

(<

¬

ac

>)

<

sup

(<

¬

aac

>),

though

<

ac

>

⊆

<

¬

aac

>.

Accordingly, we cannot simply apply or changethe patternpruningstrategiesavailable inPSPmining toNSP mining,andnewNSPﬁlteringmethods needto be developed.

4.3. Negativeconversion

e-NSPisbuiltonthestrategyofconvertingnegativecontainmenttopositivecontainment;PSPminingcanthenbeused tomineNSP.Forthis,wedeﬁneaspecialsubsequence:1-neg-sizeMaximumSub-sequence.

Deﬁnition9(1-neg-sizeMaximumSub-sequence). Fora negative sequence ns,its sub-sequencesthat include MPS

(

ns

)

and one negative element e arecalled 1

−

neg

−

size maximumsub-sequences, denoted as1

−

negMS

=

OPS

(

EidS+_ns

,

e

),

where

e

∈

EidS−_ns. The sub-sequence set including all 1-neg-size maximum sub-sequences of ns is called 1-neg-size maximum sub-sequenceset,denotedas1

−

negMSS_ns,1

−

negMSS_ns

= {

OPS

(

EidS+_ns

,

e

)

| ∀

e

∈

EidS−_ns

}

.

Example8. 1) ns

=

<

¬

(

ab

)

c

¬

d

>,

1

−

negMSSns

= {

<

¬

(

ab

)

c

>,

<

c

¬

d

>

}

; 2) ns

=

<

¬

a

(

bc

)

d

¬

(

cde

)

>,

1

−

negMSSns

=

(9)

Corollary1(NegativeConversionStrategy).Givenadatasequenceds

=

<

d1d2

. . .

dt

>

,andns

=

<

s1s2

. . .

sm

>

,whichisanm-size

andn-neg-sizenegativesequence,thenegativecontainmentproblemcanbeconvertedtothefollowingproblem:datasequenceds containsnegativesequencens ifandonlyifthetwoconditionshold:(1)MPS

(

ns

)

⊆

ds;and(2)

∀

1

−

negMS

∈

1

−

negMSSns,p

(1

−

negMS

)

ds.

Example9.Givends

=

<

a

(

bc

)

d

(

cde

)

>,

1)ifns

=

<

a

¬

dd

¬

d

>,

1

−

negMSSns

= {

<

a

¬

dd

>,

<

ad

¬

d

>

}

,thendsdoesnot con-tain nsbecausep

(<

a

¬

dd

>)

=

<

add

>

⊆

ds;2)ifns

=

<

a

¬

bb

¬

a

(

cde

)

>,

1

−

negMSS_ns

= {

<

a

¬

bb

(

cde

)

>,

<

ab

¬

a

(

cde

) >

}

, thendscontainsnsbecauseMPS

(

ns

)

=

<

ab

(

cde

)

>

⊆

ds

∧

p

(<

a

¬

bb

(

cde

)

>

ds

∧

p

(<

ab

¬

a

(

cde

)

>)

ds.

Corollary 1provesthattheproblemwhetheradatasequencecontainsanegativesequencecanbeconvertedtotheproblem

whetheradatasequencedoesnotcontainotherpositivesequences.Thislaysafoundationforcalculatingthesupportofnegative sequencesbyusingonlytheinformationofcorrespondingpositivesequences.Itsproofisgivenbelow.

ProofofCorollary 1. Hereweonlyprovethat Case(c)inthenegativecontainmentdeﬁnitionisequivalent tothenegative convertingstrategy,becauseCases(a)and(b)canbeprovedinthesameway.InCase(c),condition“(fse

>

0

∧

lsb

=

fse

+

1)” indicates that dfse anddlsb−1 are contiguous elements, so p

(

si

)

cannot be containedwithin them. It is a specialcase of anothercondition“(fse

>

0

∧

lsb

>

fse

+

1)

∧

p

(

si

)

<

dfse+1

. . .

dlsb−1

>”.

Hence,weonlyneedtoprovethat“(fse

>

0

∧

lsb

>

fse

+

1)

∧

p

(

si

)

<

dfse+1

. . .

dlsb−1

>”

isequivalenttonegativeconversionstrategy.

For

(

si

,

id

(

si

))

∈

EidS−ns

(1

i

m

),

“0

<

fse” means MPS

(<

s1s2

. . .

si−1

>)

⊆

<

d1

. . .

dfse

>,

and “0

<

lsb” means

MPS

(<

si+1

. . .

sm

>)

⊆

<

dlsb

. . .

dt

>.

Sincefse

<

lsb,MPS

(<

s1s2

. . .

si−1si+1

. . .

sm

>)

⊆

<

d1

. . .

dfsedlsb

. . .

dt

>,

i.e.,MPS

(

ns

)

⊆

ds.Ontheother hand,ifMPS

(

ns

)

⊆

ds,for

∀

(

si

,

id

(

si

))

∈

EidS−ns,theremustexist 0

<

fse

<

lsbs.t. MPS

(<

s1s2

. . .

si−1

>)

⊆

<

d1

. . .

dfse

>

andMPS

(<

si+1

. . .

sm

>)

⊆

<

dlsb

. . .

dt

>.

Inaddition,accordingtothedeﬁnitionof1-neg-sizemaximumsub-sequence,MPS

(<

s1s2

. . .

si−1

>),

MPS

(<

si+1

. . .

sm

>)

and si only construct a 1-neg-size maximum sub-sequence 1-negMS of si, so “(fse

>

0

∧

lsb

>

fse

+

1)

∧

p

(

si

)

<

dfse+1

. . .

dlsb−1

>”

also means p

(1

−

negMS

)

ds. For

∀

(

si

,

id

(

si

))

∈

EidS−ns, “(fse

>

0

∧

lsb

>

fse

+

1)

∧

p

(

si

)

<

dfse+1

. . .

dlsb−1

>”

canbeconvertedto:

∀

1

−

negMS

∈

1

−

negMSSns

,

p

(1

−

negMS

)

ds,andviceversa.

2

4.4. Supportsofnegativesequences

Tocalculatethesupportofnegativesequences,weﬁrstpresentCorollary 2.

Corollary 2 (Support of negative sequences). Given a m-size and n-neg-size negative sequence ns, for

∀

1

−

negMSi

∈

1

−

negMSSns

(1

i

n

)

,thesupportofns insequencedatabaseD is:

sup

(

ns

)

=| {

ns

} |=| {

MPS

(

ns

)

} − ∪

n_i₌₁

{

p

(

1

−

negMSi

)

} |

(1) ThiscanbeeasilyderivedfromCorollary 1.Because

∪

n_i₌₁

{

p

(1

−

negMSi

)

}

⊆ {

MPS

(

ns

)

}

,Equation(1)canberewrittenas:

sup

(

ns

)

= |{

MPS

(

ns

)

}| − | ∪

_in₌₁

{

p

(

1

−

negMS_i

)

}| =

sup

(

MPS

(

ns

))

− | ∪

n_i₌₁

{

p

(

1

−

negMS_i

)

}|

(2) Example 10. sup

(<

¬

a

(

bc

)

d

¬

(

cde

)

>)

=

sup

(< (

bc

)

d

>)

− |{

<

a

(

bc

)

d

>

}

∪ {

< (

bc

)

d

(

cde

)

>

}|

; sup

(<

¬

(

ab

)

c

¬

d

>)

=

sup

(<

c

>)

− |{

< (

ab

)

c

>

}

∪ {

<

cd

>

}|

.

Ifnsonlycontainsanegativeelement,thesupportofnsis:

sup

(

ns

)

=

sup

(

MPS

(

ns

))

−

sup

(

p

(

ns

))

(3) Example11.sup

(< (

ab

)

¬

cd

>)

=

sup

(< (

ab

)

d

>)

−

sup

(< (

ab

)

cd

>).

Inparticular,fornegativesequence

<

¬

e

>,

sup

(<

¬

e

>)

= |

D

| −

sup

(<

e

>)

(4)

In thefollowing,we illustratenegative containment intermsofsettheory.Fig. 2 showstheintersectionofsequences

<

a

>

and

<

b

>.

{

<

a

>

}

,

{

<

b

>

}

mean the set of tuples that respectively contain sequences

<

a

>,<

b

>

in a se-quencedatabase.Therearethree2-lengthsequences:

<

ab

>,

<

ba

>

and

< (

ab

)

>,

andfourdisjointedsets:

{

< (

ab

)

>

only

}

,

{

<

ab

>

only

}

,

{

<

ba

>

only

}

and

{

<

ab

>

}

∩{

<

ba

>

}

,whicharethesetsoftuplesthatcontainsequences

< (

ab

)

>

only,

<

ab

>

only,

<

ba

>

only,andboth

<

ab

>

and

<

ba

>

respectively.

Taking

{

<

a

¬

b

>

}

asanexample,asseeninFig. 2,wehave:

{

<

a

¬

b

>

}

=

(

{

<

a

>

} − {

<

b

>

}

)

∪ {

< (

ab

) >

only

} ∪ {

<

ba

>

only

}

= {

<

a

>

} − {

<

ab

>

only

} ∪

(

{

<

ab

>

} ∩ {

<

ba

>

}

)

= {

<

a

>

} − {

<

ab

>

}

(10)

Fig. 2.The intersection of{<a>}and{<b>}.

Fig. 3.The interpretation ofsup(<a¬bc¬de¬f>)in terms of the set theory.

Thisresultis consistentwiththe negative containmentdeﬁnition,by whichdata sequences containing

{

<

a

¬

b

>

}

are thesamesequencesthatcontain

{

<

a

>

}

butdonotcontain

{

<

ab

>

}

.

Subsequently,weobtain:

sup

(<

a

¬

b

>)

=

sup

(<

a

>)

−

sup

(<

ab

>).

Similarly,

sup

(<

¬

ab

>)

=

sup

(<

b

>)

−

sup

(<

ab

>)

;

sup

(<

b

¬

a

>)

=

sup

(<

b

>)

−

sup

(<

ba

>)

;

and sup

(<

¬

ba

>)

=

sup

(<

a

>)

−

sup

(<

ba

>).

Fig. 3illustratesthemeaningofsup

(

ns

)

intermsofsettheory.Givenns

=

<

a

¬

bc

¬

de f

>,

1

−

negMSS_ns

= {

<

a

¬

bc e

>,

<

ac

¬

de

>,

<

ace

¬

f

>

}

,sup

(

ns

)

= |{

<

ace

>

}|

− |

<

abce

>

∪

<

acde

>

∪

<

ace f

>

|

.

The above examplesshow that our proposed negative containmentdeﬁnition isconsistent withsettheory. Therefore, setpropertiesareapplicableforcalculatingsup

(

ns

).

FromEquation(2),wecanseethat sup

(

ns

)

canbeeasily calculatedif we knowsup

(

MPS

(

ns

))

and

|

∪

n_i₌₁

{

p

(1

−

negMSi

)

}|

. Accordingto Constraint 3andthe negative candidategeneration ap-proachdiscussedinSection4.5,MPS

(

ns

)

andp

(1

−

negMSi

)

arefrequent.sup

(

MPS

(

ns

))

canbedirectlyobtainedbyapplying traditionalPSPminingalgorithms.

Wefurtherexplainwhythepositivepartnersp

(1

−

negMS_i

)

arefrequent.Constraint 3ensuresthatthesmallestnegative unitinaNSCisanelement.Hence,thepositivepartnerofeachNSCgeneratedfromaPSPisthePSPitself.p

(1

−

negMSi

)

is asub-sequenceofthePSP.Both p

(1

−

negMSi

)

andthePSParepositivesequencesandmeetthedownwardclosureproperty, thereforep

(1

−

negMSi

)

arefrequent.

Nowtheproblemishowtocalculate

|

∪

n_i₌₁

{

p

(1

−

negMSi

)

}|

.Ourapproachisasfollows.Westorethesidofthetuples containing p

(1

−

negMS_i

)

intoset

{

p

(1

−

negMS_i

)

}

,thencalculatetheunionsetof

{

p

(1

−

negMS_i

)

}

.Because p

(1

−

negMS_i

)

are frequent,thesidofthe tuplescontaining p

(1

−

negMSi

)

canbe easilyobtainedby thosewell-known algorithms with minormodiﬁcations.Forinstance,westorethesidofthetuplescontaining p