Contents lists available atScienceDirect
Artificial
Intelligence
www.elsevier.com/locate/artint
e-NSP:
Efficient
negative
sequential
pattern
mining
✩
Longbing Cao
a,
∗
,
Xiangjun Dong
b, Zhigang Zheng
caUniversityofTechnologySydney,Australia bQiluUniversityofTechnology,Jinan,China cUniversityofTechnologySydney,Australia
a
r
t
i
c
l
e
i
n
f
o
a
b
s
t
r
a
c
t
Articlehistory:
Received13January2015
Receivedinrevisedform3March2016 Accepted7March2016
Availableonline10March2016 Keywords:
Negativesequenceanalysis Sequenceanalysis Behavioranalytics Non-occurringbehavior Behaviorinformatics Behaviorcomputing Patternmining
Asan important tool for behavior informatics,negative sequential patterns (NSP)(such asmissing medical treatments)are criticaland sometimes muchmoreinformativethan positivesequentialpatterns(PSP)(e.g.usingamedicalservice)inmanyintelligentsystems andapplications suchas intelligenttransport systems,healthcareand risk management, astheyofteninvolvenon-occurringbutinterestingbehaviors.However,discoveringNSPis muchmoredifficultthanidentifyingPSPduetothesignificantproblemcomplexitycaused bynon-occurringelements,highcomputationalcostandhugesearchspaceincalculating negativesequentialcandidates (NSC).Sofar, theproblem hasnot beenformalizedwell, and veryfew approaches havebeen proposed to mine forspecific typesof NSP, which relyon database re-scans after identifying PSP inorder to calculate the NSC supports. Thishas been shown to be veryinefficient or even impractical, since the NSC search space is usually huge. This paper proposes a very innovative and efficient theoretical framework:settheory-basedNSPmining(ST-NSP),andacorrespondingalgorithm,e-NSP, toefficiently identifyNSPby involvingonlythe identified PSP, withoutre-scanning the database.Accordingly,negativecontainmentisfirstdefinedtodeterminewhetheradata sequencecontainsanegativesequencebasedonsettheory.Second,anefficientapproachis proposedtoconvertthenegativecontainmentproblemtoapositivecontainmentproblem. TheNSCsupportsarethencalculatedbasedonlyonthecorrespondingPSP.Thisnotonly avoids theneed for additional databasescans, butalso enablesthe useof existingPSP mining algorithms to mine for NSP. Finally,a simple but efficient strategy is proposed to generate NSC. Theoretical analyses show that e-NSP performs particularly well on datasetswith asmall number of elements in a sequence, a large number of itemsets andlowminimum support.e-NSPiscomparedwithtwo currentlyavailableNSPmining algorithms via intensive experiments onthree synthetic and six real-lifedatasets from aspects includingdata characteristics, computationalcosts and scalability. e-NSPis tens tothousandsoftimesfaster thanbaselineapproaches,and offersasound and effective approachforefficientminingofNSPinlargescaledatasetsbydirectlyusingexistingPSP miningalgorithms.
©2016TheAuthors.PublishedbyElsevierB.V.Thisisanopenaccessarticleunderthe CCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4.0/).
✩ Thesourcecodesofe-NSPareavailablefromhttp://www-staff.it.uts.edu.au/~lbcao/.
*
Correspondingauthor.E-mailaddress:[email protected](L. Cao). http://dx.doi.org/10.1016/j.artint.2016.03.001
0004-3702/©2016TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction
Behavioriswidelyseeninourdailystudy,work,livingandentertainment[7].Acriticalissueinunderstandingbehavior fromtheinformaticsperspective,namelybehaviorinformatics[6,9],istounderstandthecomplexities,dynamicsandimpact ofnon-occurringbehaviors(NOB)[8].MiningNegativesequentialpatterns(NSP)[43]isone offew approachesavailable for understandingNOB. NSPrefertofrequentsequences withnon-occurringandoccurringbehaviors(also callednegative and
positivebehaviorsinbehaviorandsequenceanalysis),suchasadriverfailingtostopbeforedrivingthroughanintersection withastopsignoramissingtreatmentinmedicalservice.
DiscoveringNSP isbecoming increasingly important,andsometimesplay a role that cannot be replacedby analyzing occurringbehaviorsaloneinmanyintelligent systemsandapplications,suchasintelligent transportsystems(ITS), health andmedicalmanagement systems, bioinformatics,biomedical systems,risk management, counter-terrorism, andsecurity. Forexample, inITS, negative driving behavior patterns resultindrivers failingto followcertain traffic rulescould cause serioustraffic problemsorevendisasters.Inhealthcare,apatientmissinganimportantmedicalappointmentmightresult inserioushealthissues.Ingenesequencing,thenon-occurrenceofcertaingenesmaybeassociatedwithparticulardiseases. Suchproblemscannotbehandledbytheidentificationofoccurringbehaviorpatternsalone.
Formally,aNSPinhealthcaremayappearasfollows.Assume p1
=
<
abc X>
isapositivesequentialpattern(PSP); p2=
<
ab¬
c Y>
isa NSP,wherea,b andc stand formedicalservicecodesindicating the servicesa patient hasreceivedin healthcare, and X andY standfordisease status. p1 showsthat patientswho usually receivemedicalservices a,b and then c are likelytohavedisease status X,whereas p2 indicates that patientsreceivingtreatments ofa andb butNOT c haveahighprobabilityofhavingdiseasestatusY.Althoughintensiveeffortshavebeenmadeto developPSP(suchas p1)mining algorithms suchasGSP [32],FreeSpan
[15],SPADE[34],PrefixSpan[30],andSPAM [4],NSP(suchas p2)cannot bedescribed ordiscoveredbythesealgorithms. This isbecause mining NSPis much moredifficult than miningPSP [8], particularlydue tothe following threeintrinsic complexities.
•
Problemcomplexity.Thehiddennatureofnon-occurringitemsmakesdefinitionoftheNSPminingproblemcomplicated, particularlytheNSPformatandnegativecontainment.Thisiswhyresearcherspresentdifferentandeveninconsistent definitionsandconstraintsintheiridentificationofNSP.AsresearchintoNSPisataveryearlystage,itisimportantto formalizetheproblemproperlyandcomprehensively.•
Highcomputationalcomplexity.Existingmethodscalculatethesupportofnegative sequentialcandidates(NSC)by addi-tionallyscanningthedatabaseafteridentifyingPSP.Thisleads toadditionalcosts andlow efficiencyinminingNSP.It isthusessentialtodevelopefficientNSPminingmethodswithoutdatabasere-scanning.•
LargeNSCsearchspace.Theexistingapproachesgeneratek-sizeNSCbyconductingajoiningoperationon(k-1)-sizeNSP. ThisresultsinahugenumberofNSC[11,24,26,38],whichmakesitdifficulttosearchformeaningfuloutputs.Further, NSCdoesnotsatisfytheAprioriprinciple[38].ItisachallengetoprunethelargeproportionofmeaninglessNSC,and itisthusimportanttodevelopefficientapproachesforgeneratingalimitednumberoftrulyusefulNSC.NSPminingisatan earlystage,andhasseenonlyverylimitedprogressinrecentyears[3,11,12,14,16–18,20,21,38–40, 43]. Allexisting methods arevery inefficientandare toospecific formining NSP.As NSPanalysisis very complex, chal-lengingandimmature,theadditionofappropriateconstraintsmakestheproblemsolvabletosomedegree.Allthereported workinNSPanalysisthereforeincorporatesconstraintsonformat,frequencyand/ornegativeelements fromrespective as-pects(seemorediscussioninSection 3.2.1) toreducethenumberofNSC,discoverspecificNSPofparticularinterest,and enhance computationalefficiency.More importantly,thereare differentdefinitionsofthe mostimportantconceptinNSP mining –negative containment–which defineswhetheradatasequence containsa negative sequence(see moredetails inSection4.2).Somedefinitionsaremoregenericandtypical[11,12,38,39]thanotherswhicheitherincorporateadditional constraintsonnegationformatandcontainment[21–23,25–29]oroffernocleardefinition[18,31].
To address the above intrinsiccomplexities in NSP mining and make it efficient for real-life applications, this paper proposesan innovative, flexibleandefficient framework: thesettheory-basedNSP miningframework (ST-NSP)anda
com-prehensive algorithm called e-NSP to instantiate the ST-NSP framework. We first formalize the NSP mining problem by
defining some importantconcepts in NSP,includingnegative containment.The formalization drawsa clearboundary be-tweenwhetheradatasequencesetcontainsaNSCornot.Buildingonthesettheory,e-NSPthencalculatesthesupportof NSC basedonthesupportofthe correspondingPSP, withoutadditionaldatabasescans.Weconvertthenegative contain-mentproblemtoapositivecontainmentproblem.TheNSCsupportsarethencalculatedbyonlyusingaNSC’scorresponding PSPinformation.Inthisway,thereisnoneedtore-scanthedatabaseafterdiscoveringPSP. Moreimportantly,anyexisting PSPalgorithmscanthenbedirectlyusedorslightlychangedtodiscoverNSP.
We specifythefrequency,format andnegative elementconstraints ine-NSPto makethe problemconsistentwithset theory(frequencyconstraint),reduceconfusion,ambiguityanduncertainty(formatconstraint),andtocontrolcomputational complexity (negativeelement constraint).Ourdefinitionsandspecificationsofsuch constraintsandnegative containment formtheST-NSPframeworkandmakee-NSPconsistentwithsettheorywhichthusenablese-NSPtobemuchmoreefficient andflexiblethanexistingmethods.
Below,wesummarizethemaindifferencesbetweenourworkandtherelatedworksintermsoftheconstraintsimposed onNSP,keydefinitionsandconcepts,aswellasthesignificantcontributionsmadeinthiswork:
•
An innovative andveryefficientframework ST-NSPandits corresponding theoreticaldesign andalgorithm e-NSPare proposed todiscoverNSPwithinone databasescan;e-NSPisbuiltonsettheoryandistheonlyworkreportedsofar thatusessettheoryforsequenceanalysis,whichopensanewparadigmforNSPanalysis.•
e-NSP isbuiltona systematicandcomprehensivestatementandformalization oftheNSPproblem, andincorporates newconcepts,specificationsonconstraintsandeffectivetechnicaldesignthatmakeitefficientandflexible,inaddition toofferingsubstantialtheoreticalanalysisintermsofdatacharacteristicsrepresentedbydatafactorsandcomputational costs.•
e-NSPintroducesthreeconstraintsonfrequency,formatandnegativeelementsrespectively,whichnotonlymakee-NSP consistentwithtypicalexistingworkbutalsosupporttheproposedframeworkofsettheory-basedNSPstudy.•
Theoretical analysisshows that e-NSP is much lesssensitive to data factors andis much more efficienton datasetsthat havea smallnumberof elementsina sequenceanda largenumberofitemsets, comparedtoavailable baseline algorithms we can find. This advantage of e-NSP is especially clear when minimum support is low. It is thus very suitableforlargescaledata.
Unfortunately,itisnoteasytofindbaselinedatasetsandNSPalgorithmsthat satisfysimilarNSPsettings.We compare e-NSPwithtwomodifiedNSPalgorithmsintheliteratureusingintensiveexperimentsonthreesyntheticandsixreal-world datasets frommany perspectives, including computational complexity against different minimum supports on 8 distinct datasets,datacharacteristicsanalysison48combinationsofvariousdatafactorson16subsets,andscalabilitytestsontwo scalabledatasetswithlowminimumsupport.Theexperimental resultsverifythetheoreticalanalyses,showingthat e-NSP ismuchmoreflexibleandefficientandistenstothousandsoftimesfasterthanthebaselinemethods,thushighlysuitable forlargescaledatasets.Tothebestofourknowledge,thisisthefirstapproachthat efficientlydiscoversNSPbyinvolving PSP onlywithoutrescanning databases,directlyapplies existingPSPdiscovery algorithms, andisparticularlyeffectivefor verylargedatasets.ThisdemonstratesthesignificantvalueoftheproposedST-NSPframework.
Theremainderofthepaperisorganizedasfollows.Section2discussestherelatedworkandgapsinthecurrent knowl-edge. InSection3,weformalize theproblemofmining PSPandNSP,providingcorrespondingdefinitionsandconstraints. The ST-NSPframework andthee-NSPalgorithm aredetailedinSection 4.The theoreticalanalyses ofe-NSPcompared to baselinearepresentedinSection5fromtheperspectiveofdatafactors.Section6presentssubstantialexperimentalresults andareal-lifecasestudy.DiscussionsonseveralcriticalissuesareofferedinSection 7,followedbyconclusionsandfuture workinSection8.
2. Relatedwork
Inthissection,wefirstdiscusstherelatedworkonafundamentalconceptinNSP;thatis,negativecontainment.Different researcherspresentinconsistentdefinitionsandexplanationsofnegativecontainmentforrespectiveinterestsandpurposes. In [11],a datasequence ds
=
<
dc>
cannot containnegative sequence ns=
<
¬
(
ab)
c¬
d>
since size(
ns)
>
size(
ds);
while theworkin[38]allowsthatdscontainsns.Anothercriticalissueishowtodealwithanon-occurringelement.Chenetal.[11] argued that ds
=
<
dc>
cannot contain<
¬
cd>
because<
d>
indshasno antecedent itemset;ds cannot contain<
c¬
d>
because<
c>
indshasnosuccessor.However,Zhengetal.[38]allowedthatdscontains<
c¬
d>.
Furthermore, the containment position of each element is very tricky.Chen et al.[11] proposed that a data sequenceds=
<
aacbc>
does not contain a negative sequence ns=
<
a¬
bc>,
since opposite evidenceof<
abc>
can be found inds.However, Zhengetal.[38]presentedadividedopinionsinceds=
<
aacbc>
matchesa andc;hisalgorithmfindsthecorresponding positiveelementindsforeachnegativeelementofns,suchastheseconda for¬
b.Accordingtoourunderstanding,since<
e>
meansthateoccurs,noelement(includingelemente)occursbeforeoraftere.Accordingly,<
e>
contains<
e¬
?>,
<
¬
?e>,
<
¬
?e¬
?>,
where“?”representsanyelement.Second,wesummarizethestatusoftheNSPresearch.UnlikePSPmining,whichhasbeenwidelyexplored,verylimited research outcomes are available in the literature on mining NSP.We briefly introduce what we havebeen able to find. Zheng et al. [38] proposed a negative version of the GSP algorithm, i.e. NegGSP, to mine for NSP. This algorithm first discovers PSPby GSP, then generatesandprunes NSC.It then countsthesupport ofNSC by re-scanningthe databaseto generatenegativepatterns.Chenetal.[11]proposedaPNSPapproachforminingpositiveandnegativesequentialpatterns intheformof
< (
abc)
¬
(
de)(
i jk)
>.
Thisapproachisbrokenintothreestages.PSPareminedbytraditionalalgorithmsand all positiveitemsetsarederivedfromthesePSP. Allnegative itemsetsarethenderivedfromthesepositiveitemsets.Lastly, bothpositiveandnegativeitemsetsarejoinedtogenerateNSC,whichareinturnjoinediterativelytogeneratelongerNSC inanApriori-likeway.ThisapproachcalculatesthesupportofNSCbyre-scanningthedatabase.In[22–24],theauthorsonly handledNSPinwhichthelastelementwasnegative.AnalgorithmNSPMin[24]minessuchNSP.Theextendedversionsof[24]arein[22,23],whichaddfuzzyandstrongconstraintsrespectivelytoNSPM.In[39],ageneticalgorithmisproposedto mine NSP.Itgeneratescandidatesby crossoverandmutationbyinvolvingadynamicfitness functiontogenerateasmany candidatesaspossibleandavoidpopulationstagnation.In[26],onlyNSPareidentifiedintheformof
(
¬
A,
B),
(
A,
¬
B)
and(
¬
A,
¬
B),
whichis similarto mining negative associationrules [13,33].The work in[26] requires AB= ∅
,whichis aTable 1
Notationdescription.
Symbol Description
I A set of items,I= {x1, . . . ,xn}, consisting ofnitemsxk(1kn) s A sequence,s=<s1, . . . ,sl>, consisting oflelementssj(1jl) min_sup Minimum support threshold
ns A negative sequence
length(s) Length of sequences, referring to the number of items in all elements ins size(s) Size of a sequences, referring to the total number of elements ins sup(s) The support ofs
PP(ns) ns’s positive partner EidSs Elements id set of sequences OPS(EidSs) Order preserving sequence withEidSs MPS(s) Maximum positive sub-sequence ofs 1-negMSns 1-neg-size maximum subsequence ofns 1-negMSSns 1-neg-size maximum subsequence set ofns FSEorfse First subsequence ending position LSBorlsb Last subsequence beginning position
usualconstraintinassociationruleminingbutisaverystrictconstraintinsequentialpatternmining.Itgeneratesfrequent itemsets first, then generates frequent and infrequent sequences, and lastly derives NSP from the infrequent sequences. Threeextendedversionsof[26]canbefoundin[27–29]inwhichconditionsareaddedtofuzzy,multiplelevelandmultiple minimumsupports,respectively.Althoughtheauthorsof[31] mentionedthequestionofminingNSP,theydidnotpropose howtominethem.In[20],NSPareminedinthesameformas[26] inincrementaltransactiondatabases.
Zhao etal. [35] proposed an approach to mining event-oriented negative sequentialrules frominfrequent sequences in the formof
<
A>
⇒
<
¬
B>,
<
¬
A>
⇒
<
B>,
<
¬
A>
⇒
<
¬
B>.
Based on the work in [35], Zhao etal. [36] also presentedan approachfordiscoveringboth positiveandnegative impact-orientedsequentialrules.Issuesaboutsequence classificationusingpositiveandnegative patternswerediscussedin[25,37].Positiveandnegative usagepatternsareused in[19]tofilterWebrecommendationlists.NoneofthesepapersinvolveNSPminingdirectly.Theabovediscussions aboutnegativecontainmentandexistingNSPresearchinvolvethekeyissueofvariousconstraints
applied to NSP mining. As detailed in Section 3.2.1, constraints are generallyempowered according to the frequency of elements,patternsorpositivepartners,theformat ofcontinuousnegativeelements,andthenegationofelementsoritems.
Anotherrelevantresearchtopicisnegativeassociationrulemining[5,33].However,astheorderingrelationshipbetween itemsandelementsinasequenceisinbuiltinNSP,itismuchmorechallengingtodiscoverNSPthannegative associations andpatterns.In fact, theordinal nature ofNSPmeans that algorithms fornegative association rule miningandnegative patternminingcannotbedirectlyusedtomineNSP.
Insummary,theabovediscussionsshowthatNSPresearchpresentsthefollowingstrongearly-stagecharacteristics:
•
Significantinconsistencyinkey concepts andsettings.In particular,thereisnoconsolidated conceptofnegativecon-tainment intheliterature. InSection 4.2,we willpresenta genericdefinitionof negativecontainment andformalize theissue.
•
DifferentconstraintsareincorporatedintoNSP,leading tovariedsettingsandevendividedassumptionsaboutNSP.In Section 3.2.1, we willprovide formal andgeneric definitions offrequencyconstraint, format constraint, andnegative elementconstraint.•
NSPminingisembeddedwithspecificconstraintsandrequiresrescanningdatabases.•
Existing NSP approacheseither re-scan a databaseas a resultof inefficient computational design ordo not directly addresstheproblemofNSPmining.InSection4,e-NSPisintroducedwhichscansadatabaseonlyonce.Ourproposed ST-NSP(see Section 4.1) ande-NSP (more details in Section 4) directlyaddress the above fundamental issuesandthesubstantialcomplexitiesofNSPminingbybuildinganinnovative,formal,comprehensiveandgenericdesign, togetherwiththeoreticalanalysis,andexperimentevaluation.
3. Problemstatement
Inasequence,a non-occurringitemiscalledanegativeitemandan occurringitem iscalledapositiveitem.Sequences thatconsistofatleastonenegativeitemarecallednegativesequences.Thesequencesinsourcedataarecalleddatasequences
[32].Classicsequentialpatternmininghandlesoccurringitemsonly,andgeneratesfrequentpositivesequences[11,24,26,38]. Below,weformallydefinePSPandNSP.ThemainsymbolsusedinthispaperarelistedinTable 1.
3.1. PositiveSequentialPatterns–PSP
Let I
= {
x1,x2,. . . ,
xn}
be a set of items. An itemset is a subset of I.A sequence is an ordered list of itemsets. A se-quence sisdenotedby<
s1s2. . .
sl>,
wheresj⊆
I(1
jl).
sjisalsocalledanelementofthesequence,anddenotedas(
x1x2. . .
xm),
wherexk isan item, xk∈
I (1km). Forsimplicity,thebracketsare omittedifan element onlyhasone item,i.e.,element(
x)
iscodedx.Toreducecomplexity,weassumethatanitemoccursatmostonceinanelement,butcan appearmultipletimesindifferentelementsofasequence.Thelengthofsequences,denotedaslength
(
s),
isthetotalnumberofitemsinallelementsins.sisak-lengthsequence if length(
s)
=
k.The size ofsequence s,denoted assize(
s),
isthe totalnumberof elements ins. s isa k-size sequenceifsize
(
s)
=
k.Forexample,agivensequences=
< (
ab)
cd>
iscomposedof3elements(
ab),
c andd,or4itemsa,b,c andd. Thereforesisa4-lengthand3-sizesequence.Sequencesα
=
<
α
1α
2. . .
α
n>
iscalledasub-sequenceofsequencesβ=
< β1β2
. . . β
m>
andsβisasuper-sequenceofsα,denotedassα
⊆
sβ,ifthereexists1j1<
j2< . . . <
jnmsuchthatα
1⊆
β
j1,
α
2⊆
β
j2,
. . . ,
α
n⊆
β
jn.Wealsosaythatsβcontains sα.Forexample,
<
a>,
<
d>
and< (
ab)
d>
areallsub-sequencesof< (
ab)
cd>.
Asequencedatabase Disasetoftuples
<
sid,
ds>,
wheresidisthesequence_idanddsisthedatasequence.Thenumber oftuplesinD isdenotedas|
D|
.Thesetoftuplescontaining sequencesisdenotedas{
<
s>
}
.Thesupportofs,denoted assup(
s),
isthenumberof{
<
s>
}
,i.e.,sup(
s)
=| {
<
s>
}
|=| {
<
sid,
ds>,
<
sid,
ds>
∈
D∧
(
s⊆
ds)
}
|
.min_supisaminimum support thresholdpredefinedby users.Sequence s is calledafrequent(positive)sequentialpatternifsup(
s)
min_sup.By contrast,sisinfrequentifsup(
s)
<
min_sup.PSPminingaimstodiscoverall positivesequencesthatsatisfyagivenminimumsupport.Forsimplicity,weoftenomit “positive”whendiscussingpositiveitems,positiveelementsandpositivesequencesinminingPSP.
3.2. NegativeSequentialPatterns–NSP
3.2.1. ConstraintsonNSP
Inrealapplicationssuchashealthandmedicalbusiness,thenumberofNSC andtheidentifiednegative sequencesare oftenlarge,butmanyofthemarenot actionable[10].The numberofNSC maybe hugeoreveninfiniteifnoconstraints are added.Thismakes NSPmining verychallenging. Forexample,fordatasetwith10distinctitems, thetotal numberof potential itemsetsis1023(
=
C(10,
1)+
C(10,
2)+
. . .
+
C(10,
10)),whereC(
m,
n)
representsthenumberofcombinations createdbychoosingnitemsfrommdistinctitems.Assumingthat10ofthe1023arefrequent,ifthemaximumsizeofdata sequencesinDBis3,thetotalnumberofNSCcouldbe10331+
10332+
10333.Manyofthecombinationsareuninteresting ormeaninglessinreality.VariousconstraintshavebeenaddedtoexistingNSPmining,suchasthosedetailedin[11,38,39]and[24],todealwith thefundamental challengesembeddedinNSPminingandthecurrentimmaturesituationinNSPstudy.Althoughtheyare not consistentnorgeneric,theseconstraintsaimto reduceproblemcomplexityandthenumberofNSC,andtoefficiently discoveractionableNSP.Forexample,inhealthinsurance,acommonbusinessrulesaysifaprosthesis(a)hasbeencharged for, there should be a charge for a prior corresponding procedure (b). Accordingly, ifa customer claims a but does not claimb,whichisrepresentedasaNSP:
<
¬
ba>,
thentheclaimbehaviorcouldbepotentiallysuspicious.Morecomplicated NSPmaybefound,whichcontainmorenegativeitems,suchas<
a¬
ba¬
c>
(c isanotherprocedure).Inthiswork,weincorporatethreeconstraintsine-NSP:frequencyconstraint,formatconstraint,andnegativeelement con-straint.Thereasonsforintroducingthesethreeconstraintsareasfollows.First,asexplainedabove,NSPminingisatanearly stage andistoo complicatedtoconduct withouttheimposition ofspecific constraints.Second,we specifyconstraintson frequency,formatandnegativeelements inordertoreduceproblemcomplexity,andinparticular,tobuildaframeworkof NSPminingonsettheorytoextractnegativesequentialpatternsfromidentifiedpositivepatterns(frequencyconstraint), re-ducethenumberofNSC(formatconstraint),andsubstantiallyreducethecomplexityinhandlingpartiallynegativeelements (negativeelementconstraint).
Below, we define key concepts for these three constraints for mining NSP, andfurther explain why each of them is introduced.
Definition1 (PositivePartner). The positivepartner ofa negative element
¬
e is e, denoted as p(
¬
e),
i.e., p(
¬
e)
=
e. The positivepartnerofpositiveelementeiseitself,i.e., p(
e)
=
e.Thepositivepartnerofanegativesequencens=
<
s1. . .
sk>
changesallnegativeelementsinnstotheirpositivepartners,denotedasp(
ns),
i.e.,p(
ns)
= {
<
s1. . .
sk>
|
si=
p(
si),
si∈
ns}
. Forexample,p(<
¬
(
ab)
c¬
d>)
=
< (
ab)
cd>.
Constraint1 (Frequencyconstraint). For simplicity, this paperonly focuseson the negative sequences ns whose positive partners are frequent, i.e., sup
(
p(
ns))
≥
min_sup. In contrast,the authors in [11] and[38] only requirethat the positive partnerofeachelementinnsisfrequent.Although there maybe manynegative sequences that can be mined frominfrequent positive partner sequences (just as manyusefulnegative association rules can bemined from infrequentitemsets [33]),requiring positive partnerstobe frequent serves multiplepurposes: (1)users are ofteninterested inthe absenceof certain “frequent”(positive) itemsets; thepositivepartnersofthosenegative itemsetsappearinginnegativesequentialpatternsshouldthereforesatisfyacertain frequency.(2)Ifwedonotenforcethisconstraint, thenumberofNSCmaybehugeoreveninfinite,whichwouldleadto
Asimilar constrainthasbeenadoptedby otherresearchers, althoughitis specifieddifferentlyforparticularpurposes. Forexample,in[11],onlythepositivepartnerofeachelementinnegativesequencesnsisrequiredtobefrequent.Inour work,werequirethatthepositivepartner,notonlyeachelement,ofnsshouldbefrequent,i.e.,sup
(
p(
ns))
≥
min_sup.This is because inthis work we build a newframework that allows to only negative sequential patternsto be minded from positivesequentialpatternsbyusingthesettheorytorapidly“calculate”thesupportofNSCbasedonlyonthesupportof theircorrespondingPSPwithoutadditionaldatabasescans.Constraint2(Formatconstraint).ContinuousnegativeelementsinaNSCarenotallowed.
Example1.
<
¬
(
ab)
c¬
d>
satisfiesConstraint 2,but<
¬
(
ab)
¬
cd>
doesnot. Thisissimilartothesettingsin[11,38].Thisconstraintisintroducedforthreereasons.(1)Iftwoormorecontinuousnegativeelements areallowed,moreand more negative sequential candidates can be generated,which would resultin a very challenging and sophisticated task. (2) In practice,it wouldbe difficultto ascertainthe correctorderof two continuous negative elements iftherewere no positive elements between them. (3) As argued in [11], the order of consecutive negative itemsets is trifling for many applications.Asaresult,addingthisconstraintwillsubstantiallyreducethenumberofNSC.
Constraint3(Negativeelementconstraint).ThesmallestnegativeunitinaNSCisanelement.Ifanelementconsistsofmore thanoneitem,eitherallornoneoftheitemsareallowedtobenegative.
Example2.
<
¬
(
ab)
cd>
satisfiesConstraint 3,but< (
¬
ab)
cd>
doesnot because,inelement(
¬
ab),
only¬
a isnegative whileb isnot.Althoughnegativesequentialpatternswithnegativeitems(suchasin
< (
¬
ab)
cd>)
maybealsointerestinginreal appli-cations,thecomplexitycouldbetoohightosolvetheproblem.Introducingthisconstraintavoidsthecomplexityofhandling partiallynegativeelements,andsubstantiallyreducesthenumberofNSC.ThisisbecausethenumberofNSCwillincrease exponentially ifnegative itemsare allowed. Forexample,givenpositive pattern ps=
< (
ab)(
cde)(
f ghi)
>,
thenumber of NSC isonly 4 ifthisconstraintis satisfied;they are<
¬
(
ab)(
cde)(
f ghi)
>,
< (
ab)
¬
(
cde)(
f ghi)
>,
< (
ab)(
cde)
¬
(
f ghi)
>
and<
¬
(
ab)(
cde)
¬
(
f ghi)
>.
Ifnegativeitemsareallowed,thenumberofNSCgeneratedfrom< (
ab)(
cde)(
f ghi)
>
is1260 (=
4∗
(
C(2,
1)+
C(2,
2))∗
(
C(3,
1)+
C(3,
2)+
C(3,
3))∗
(
C(4,
1)+
C(4,
2)+
C(4,
3)+
C(4,
4))=
4∗
3∗
7∗
15).Itwouldbe veryinefficientandcomplextogenerateallNSC.Whenthesizeandlengthofdataishuge,itisunrealistictogenerateall NSC.Therefore,similartothesettingsin[11,38],wealsoapplythisconstrainttoavoidthecomplexityofhandlingpartially negativeelements.In summary,the above threeconstraints introduced intoe-NSP not only maintain a certain consistency withexisting work, butalsoenable the efficientgeneration ofNSC based ontheir positive partnersandenable the calculationof NSC support.ThisleadstoafundamentallynewandefficientframeworkforefficientdiscoveryofNSPinlargescaledata.
Inpractice, withthedevelopment ofmore efficientlearning frameworks,data structures, andNSPmining algorithms, theseconstraintsmaybe progressivelyrelaxed.GiventhecurrentstageofmaturityofNSPmining,weonlyworkonthose negativesequencesthatsatisfytheabovethreeconstraintsinthiswork.
3.2.2. NSPconcepts
AccordingtoConstraint 3,thedefinitionofsub-sequencesinapositivesequenceisnotapplicabletonegativesequences. Below,wedefinesub-sequenceandsuper-sequencefornegativesequences,inadditiontotheconceptsofelement-idsetand
order-preservingsequence.
Definition2(Positive/NegativeElement-idSet).Element-idistheordernumberofanelementinasequence.Givenasequence
s
=
<
s1s2. . .
sm>,
id(
si)
=
i isthe element identifierof element si.Element-id set EidSs of s isthe setthat includes all elementsandtheiridsins,i.e.,EidSs= {
(
si,
id(
si))
|
si∈
s=
(
s1,1),(
s2,2),. . . ,
(
sm,
m)
}
(1im).Thesetincludingallpositiveandnegativeelement-idsofasequences iscalledpositiveandnegativeelement-idsetofs, denotedasEidS+s,EidS−s,respectively.Forexample,s
=
<
¬
(
ab)
c¬
d>,
EidS+s= {
(
c,
2)}
,EidS−s= {
(
¬
(
ab),
1),(
¬
d,
3)}
. Definition3(Order-preservingSequence).ForanysubsetEidSs= {
(
α
1,id1),(
α
2,id2),. . . ,
(
α
p,
idp)
}
(1<
pm)ofEidSs,α
=
<
α
1α
2. . .
α
p>,
if∀
α
i,
α
i+1∈
α
(1
i<
p),
thereexistsidi<
idi+1,thenα
iscalledan order-preservingsequenceofEidSs, denotedasα
=
OPS(
EidSs).
Example3.Givens
=
<
¬
(
ab)
c¬
d>,
itsEidSs= {
(
¬
(
ab),
1),(
c,
2),(
¬
d,
3)}
,EidS+s= {
(
c,
2)}
,EidSs−= {
(
¬
(
ab),
1),(
¬
d,
3)}
.We canobtainOPS(
EidS+s)
=
<
c>.
Also,ifEidSs= {
(
¬
(
ab),
1),(
c,
2)}
,wecancreateasequenceOPS(
EidSs)
=
<
¬
(
ab)
c>.
Fig. 1.The set theory-based NSP mining framework: CT-NSP.
Definition4 (Sub-sequenceandSuper-sequenceofaNegativeSequence). Sequence sα is calleda sub-sequence of a negative
sequence sβ,andsβ isasuper-sequenceofsα,if
∀
EidSsβ,EidS
sβ isasubsetofEidSsβ,sα
=
OPS(
EidSsβ),
denotedassα⊆
sβ.Ifsα isanegative sequence,itis requiredtosatisfyConstraint 2,whichmeansthat theremustnot becontinuous negative
elementsinsα.
Example4.Givensβ
=
<
¬
(
ab)
cd>
andsα=
<
¬
(
ab)
d>,
EidSsβ= {
(
¬
(
ab),
1),(
c,
2),(
d,
3)}
,EidSsβ= {
(
¬
(
ab),
1),(
d,
3)}
isasubsetofEidSsβ.sα isasub-sequenceofsβ sincesα
=
OPS(
EidSsβ).
Definition5 (MaximumPositiveSub-sequence).Let ns
=
<
s1s2. . .
sm>
be a m-sizeandn-neg-size negative sequence(
m−
n
>
0),OPS(
EidS+ns)
iscalledthemaximumpositivesub-sequenceofns,denotedasMPS(
ns).
Example 5. Given a negative sequence s
=
<
¬
(
ab)
cd>,
EidS+= {
(
c,
2),(
d,
3)}
, its maximum positive sub-sequence isMPS
(
s)
=
<
cd>.
Definition6(NegativeSequentialPattern).Anegative sequencesisa negativesequentialpattern(NSP)ifits supportisnot lessthanthethresholdmin_sup.
4. Thesettheory-basedNSPframeworkande-NSPalgorithm
4.1. Thesettheory-basedNSPminingframework
HereweproposeaninnovativeNSPminingframework,basedonsettheory.Theframeworkandworkingmechanismof
theproposedsettheory-basedNSPminingframework(ST-NSPforshort)isillustratedinFig. 1.Wealsoproposeanefficient NSPminingalgorithm,callede-NSP,toinstantiatetheST-NSPframework.Thee-NSPalgorithmissummarizedinSection4.7, andanexampleisgiveninSection4.8.
ANSPalgorithminstantiatingtheST-NSPframeworkconsistsofseveralcomponents:
(1)negativecontainmenttodefinehowadatasequencecontainsanegativesequence,whichwillbediscussedinSection4.2; (2) anegativeconversionstrategytoconvertthenegativecontainmenttopositivecontainment,andthenusetheinformation
ofcorrespondingPSPtocalculatethesupportofaNSC,whichwillbediscussedinSection 4.3; (3)NSCsupportcalculationtocalculatethesupportofNSC,whichwillbediscussedinSection4.4; (4)NSCgenerationtogenerateNSC,whichwillbediscussedinSection4.5;and
(5) ane-NSPdatastructureandoptimizationstrategytocalculatetheunionset,whichwillbediscussedinSection4.6.
Given a sequence database, algorithms like e-NSP built on the ST-NSP framework work on the following process to
discoverNSP,inwhichsteps(2)to(4)relyonthesettheory.
(1) AllPSPareminedbyatraditionalPSPalgorithm(withslightchangesifnecessary)ornewPSPminingalgorithm; (2) NSCaregeneratedbasedontheidentifiedPSPintermsofthethreeconstraintsproposedinSection3.2.1; (3) ThegeneratedNSCareconvertedtotheircorrespondingPSPintermsofthenegativeconversionstrategy;
(4) The NSC support is calculated based on the corresponding PSP support in terms of negative containment, relevant constraintsandtheproposede-NSPdatastructureandoptimizationstrategy;
(5) Lastly,NSPareidentifiedfromtheNSCtosatisfycertainsupportcriteria.
4.2. Negativecontainment
As asub-sequence(e.g.,s1
=
<
d>)
mayoccurmorethanonceinitssuper-sequence(e.g.,s2=
<
a(
bc)
d(
cde)
>),
we needtoknowtheexactpositionsofs2containings1fromtheleftandrightsidesofs2.Wethereforedefinethefundamental conceptofnegativecontainmentbelow.Definition7(FirstSub-sequenceEndingPosition/LastSub-sequenceBeginningPosition).Givenadatasequenceds
=
<
d1d2. . .
dt>
andapositivesequenceα
,(1) if
∃
p(1
<
pt),
α
⊆
<
d1. . .
dp>
∧
α
<
d1. . .
dp−1>,
then p iscalledtheFirstSub-sequenceEndingPosition,denoted asFSE(
α
,
ds);
ifα
⊆
<
d1>
thenFSE(
α
,
ds)
=
1;(2) if
∃
q(1
q<
t),
α
⊆
<
dq. . .
dt>
∧
α
<
dq+1. . .
dt>,
thenq iscalledtheLastSub-sequenceBeginningPosition,denoted asLSB(
α
,
ds);
ifα
⊆
<
dt>
thenLSB(
α
,
ds)
=
t;(3) if
α
ds,thenFSE(
α
,
ds)
=
0,LSB(
α
,
ds)
=
0.Example6. Given ds
=
<
a(
bc)
d(
cde)
>.
FSE(<
a>,
ds)
=
1, FSE(<
c>,
ds)
=
2, FSE(<
cd>,
ds)
=
3, LSB(<
a>,
ds)
=
1,LSB
(<
c>,
ds)
=
4,LSB(<
cd>,
ds)
=
2,LSB(< (
cd)
>,
ds)
=
4.Ourdefinitionofadatasequencecontaininganegativesequenceisasfollows.Weusen
−
neg−
sizetodenoteanegative sequencecontainingnnegativeelements.Definition8 (NegativeContainment).Let ds
=
<
d1d2. . .
dt>
be a data sequence, ns=
<
s1s2. . .
sm>
be an m−
size andn-neg-sizenegativesequence,(1)ifm
>
2t+
1,thends doesnotcontain ns;(2)ifm=
1 andn=
1,thends contains nswhenp
(
ns)
ds;(3)otherwise,ds contains nsif,∀
(
si,
id(
si))
∈
EidS−ns(1
im),
oneofthefollowingthreecasesholds: (a)(
lsb=
1)or(
lsb>
1)∧
p(
s1)<
d1. . .
dlsb−1>,
wheni=
1;(b)
(
fse=
t)
or(0
<
fse<
t)
∧
p(
sm)
<
dfse+1. . .
dt>,
wheni=
m;(c)
(
fse>
0∧
lsb=
fse+
1) or(
fse>
0∧
lsb>
fse+
1)∧
p(
si)
<
dfse+1. . .
dlsb−1>,
when 1<
i<
m, where fse=
FSE
(
MPS(<
s1s2. . .
si−1>),
ds),
lsb=
LSB(
MPS(<
si+1. . .
sm>),
ds).
Intheabove definition,Case(a)indicates thatthe firstelementin nsisnegative. “(lsb
>
1)∧
p(
s1)<
d1. . .
dlsb−1>”
meansthat<
dlsb. . .
dt>
contains MPS(<
s2. . .
sm>)
but<
d1. . .
dlsb−1>
doesnotcontain p(
s1).“lsb=
1”meansthatthe last sub-sequence’sbeginning position is1, so p(
s1) cannot be containedby ds.Case (b)indicates that the last element in nsisnegative. Case (c)indicatesthat the negative elementis betweenthefirst andlast elementin ns.“lsb>
fse+
1” ensuresthereisatleastoneelementin“<dfse+1. . .
dlsb−1>”.
“fse>
0∧
lsb=
fse+
1”meansthatdfseanddlsbarecontiguous elements,sop(
si)
cannotbecontainedwithinthem.Example7.Givends
=
<
a(
bc)
d(
cde)
>,
wehave(1) ns
=
<
¬
ac>.
EidS−ns= {
(
¬
a,
1)}
. ds does not contain ns. lsb=
4>
0, but p(
s1)=
<
a>
⊆
<
d1. . .
d3>
=
<
a(
bc)
d>
(Case (a)).(2) ns
=
<
¬
aac>.
EidS−ns= {
(
¬
a,
1)}
.dscontainsnsbecauselsb=
1 (Case(a)).(3) ns
=
< (
ab)
¬
(
cd)
>.
EidS−ns= {
(
¬
(
cd),
2)}
.dsdoesnotcontainnsbecausefse=
0 (Case(b)). (4) ns=
< (
de)
¬
(
cd)
>.
EidS−ns= {
(
¬
(
cd),
2)}
.dscontainsnsbecausefse=
4(
t=
4)(Case(b)).(5) ns
=
<
a¬
dd¬
d>.
EidSns−= {
(
¬
d,
2),(
¬
d,
4)}
. ds does not contain ns. For(
¬
d,
2), fse=
1, lsb=
4, but p(
¬
d)
⊆
<
d2. . .
d3>
=
< (
bc)
d>
(Case (c)). Ifone negative element does not satisfy the condition, we do not need to con-siderothernegativeelements.(6) ns
=
<
a¬
bb¬
a(
cde)
>.
EidS−ns= {
(
¬
b,
2),(
¬
a,
4)}
. ds contains ns. For(
¬
b,
1), fse=
1, lsb=
2, fse>
0∧
lsb=
fse+
1 (Case (c));For(
¬
a,
4),fse=
2,lsb=
4, p(
¬
a)
<
d3>
=
<
d>
(Case(c)).Note that negative sequences do not satisfy the Apriori property. As shown in Example 7, sup
(<
¬
ac>)
=
0,sup
(<
¬
aac>)
=
1, sup(<
¬
ac>)
<
sup(<
¬
aac>),
though<
ac>
⊆
<
¬
aac>.
Accordingly, we cannot simply apply or changethe patternpruningstrategiesavailable inPSPmining toNSP mining,andnewNSPfilteringmethods needto be developed.4.3. Negativeconversion
e-NSPisbuiltonthestrategyofconvertingnegativecontainmenttopositivecontainment;PSPminingcanthenbeused tomineNSP.Forthis,wedefineaspecialsubsequence:1-neg-sizeMaximumSub-sequence.
Definition9(1-neg-sizeMaximumSub-sequence). Fora negative sequence ns,its sub-sequencesthat include MPS
(
ns)
and one negative element e arecalled 1−
neg−
size maximumsub-sequences, denoted as1−
negMS=
OPS(
EidS+ns,
e),
wheree
∈
EidS−ns. The sub-sequence set including all 1-neg-size maximum sub-sequences of ns is called 1-neg-size maximum sub-sequenceset,denotedas1−
negMSSns,1−
negMSSns= {
OPS(
EidS+ns,
e)
| ∀
e∈
EidS−ns}
.Example8. 1) ns
=
<
¬
(
ab)
c¬
d>,
1−
negMSSns= {
<
¬
(
ab)
c>,
<
c¬
d>
}
; 2) ns=
<
¬
a(
bc)
d¬
(
cde)
>,
1−
negMSSns=
Corollary1(NegativeConversionStrategy).Givenadatasequenceds
=
<
d1d2. . .
dt>
,andns=
<
s1s2. . .
sm>
,whichisanm-sizeandn-neg-sizenegativesequence,thenegativecontainmentproblemcanbeconvertedtothefollowingproblem:datasequenceds containsnegativesequencens ifandonlyifthetwoconditionshold:(1)MPS
(
ns)
⊆
ds;and(2)∀
1−
negMS∈
1−
negMSSns,p(1
−
negMS
)
ds.Example9.Givends
=
<
a(
bc)
d(
cde)
>,
1)ifns=
<
a¬
dd¬
d>,
1−
negMSSns= {
<
a¬
dd>,
<
ad¬
d>
}
,thendsdoesnot con-tain nsbecausep(<
a¬
dd>)
=
<
add>
⊆
ds;2)ifns=
<
a¬
bb¬
a(
cde)
>,
1−
negMSSns= {
<
a¬
bb(
cde)
>,
<
ab¬
a(
cde) >
}
, thendscontainsnsbecauseMPS(
ns)
=
<
ab(
cde)
>
⊆
ds∧
p(<
a¬
bb(
cde)
>
ds∧
p(<
ab¬
a(
cde)
>)
ds.Corollary 1provesthattheproblemwhetheradatasequencecontainsanegativesequencecanbeconvertedtotheproblem
whetheradatasequencedoesnotcontainotherpositivesequences.Thislaysafoundationforcalculatingthesupportofnegative sequencesbyusingonlytheinformationofcorrespondingpositivesequences.Itsproofisgivenbelow.
ProofofCorollary 1. Hereweonlyprovethat Case(c)inthenegativecontainmentdefinitionisequivalent tothenegative convertingstrategy,becauseCases(a)and(b)canbeprovedinthesameway.InCase(c),condition“(fse
>
0∧
lsb=
fse+
1)” indicates that dfse anddlsb−1 are contiguous elements, so p(
si)
cannot be containedwithin them. It is a specialcase of anothercondition“(fse>
0∧
lsb>
fse+
1)∧
p(
si)
<
dfse+1. . .
dlsb−1>”.
Hence,weonlyneedtoprovethat“(fse>
0∧
lsb>
fse
+
1)∧
p(
si)
<
dfse+1. . .
dlsb−1>”
isequivalenttonegativeconversionstrategy.For
(
si,
id(
si))
∈
EidS−ns(1
i m),
“0<
fse” means MPS(<
s1s2. . .
si−1>)
⊆
<
d1. . .
dfse>,
and “0<
lsb” meansMPS
(<
si+1. . .
sm>)
⊆
<
dlsb. . .
dt>.
Sincefse<
lsb,MPS(<
s1s2. . .
si−1si+1. . .
sm>)
⊆
<
d1. . .
dfsedlsb. . .
dt>,
i.e.,MPS(
ns)
⊆
ds.Ontheother hand,ifMPS
(
ns)
⊆
ds,for∀
(
si,
id(
si))
∈
EidS−ns,theremustexist 0<
fse<
lsbs.t. MPS(<
s1s2. . .
si−1>)
⊆
<
d1. . .
dfse>
andMPS(<
si+1. . .
sm>)
⊆
<
dlsb. . .
dt>.
Inaddition,accordingtothedefinitionof1-neg-sizemaximumsub-sequence,MPS
(<
s1s2. . .
si−1>),
MPS(<
si+1. . .
sm>)
and si only construct a 1-neg-size maximum sub-sequence 1-negMS of si, so “(fse>
0∧
lsb>
fse+
1)∧
p(
si)
<
dfse+1. . .
dlsb−1>”
also means p(1
−
negMS)
ds. For∀
(
si,
id(
si))
∈
EidS−ns, “(fse>
0∧
lsb>
fse+
1)∧
p(
si)
<
dfse+1. . .
dlsb−1>”
canbeconvertedto:∀
1−
negMS∈
1−
negMSSns,
p(1
−
negMS)
ds,andviceversa.2
4.4. Supportsofnegativesequences
Tocalculatethesupportofnegativesequences,wefirstpresentCorollary 2.
Corollary 2 (Support of negative sequences). Given a m-size and n-neg-size negative sequence ns, for
∀
1−
negMSi∈
1−
negMSSns
(1
in)
,thesupportofns insequencedatabaseD is:sup
(
ns)
=| {
ns} |=| {
MPS(
ns)
} − ∪
ni=1{
p(
1−
negMSi)
} |
(1) ThiscanbeeasilyderivedfromCorollary 1.Because∪
ni=1{
p(1
−
negMSi)
}
⊆ {
MPS(
ns)
}
,Equation(1)canberewrittenas:sup
(
ns)
= |{
MPS(
ns)
}| − | ∪
in=1{
p(
1−
negMSi)
}| =
sup(
MPS(
ns))
− | ∪
ni=1{
p(
1−
negMSi)
}|
(2) Example 10. sup(<
¬
a(
bc)
d¬
(
cde)
>)
=
sup(< (
bc)
d>)
− |{
<
a(
bc)
d>
}
∪ {
< (
bc)
d(
cde)
>
}|
; sup(<
¬
(
ab)
c¬
d>)
=
sup
(<
c>)
− |{
< (
ab)
c>
}
∪ {
<
cd>
}|
.Ifnsonlycontainsanegativeelement,thesupportofnsis:
sup
(
ns)
=
sup(
MPS(
ns))
−
sup(
p(
ns))
(3) Example11.sup(< (
ab)
¬
cd>)
=
sup(< (
ab)
d>)
−
sup(< (
ab)
cd>).
Inparticular,fornegativesequence
<
¬
e>,
sup
(<
¬
e>)
= |
D| −
sup(<
e>)
(4)In thefollowing,we illustratenegative containment intermsofsettheory.Fig. 2 showstheintersectionofsequences
<
a>
and<
b>.
{
<
a>
}
,{
<
b>
}
mean the set of tuples that respectively contain sequences<
a>,<
b>
in a se-quencedatabase.Therearethree2-lengthsequences:<
ab>,
<
ba>
and< (
ab)
>,
andfourdisjointedsets:{
< (
ab)
>
only}
,{
<
ab>
only}
,{
<
ba>
only}
and{
<
ab>
}
∩{
<
ba>
}
,whicharethesetsoftuplesthatcontainsequences< (
ab)
>
only,<
ab>
only,<
ba>
only,andboth<
ab>
and<
ba>
respectively.Taking
{
<
a¬
b>
}
asanexample,asseeninFig. 2,wehave:{
<
a¬
b>
}
=
(
{
<
a>
} − {
<
b>
}
)
∪ {
< (
ab) >
only} ∪ {
<
ba>
only}
= {
<
a>
} − {
<
ab>
only} ∪
(
{
<
ab>
} ∩ {
<
ba>
}
)
= {
<
a>
} − {
<
ab>
}
Fig. 2.The intersection of{<a>}and{<b>}.
Fig. 3.The interpretation ofsup(<a¬bc¬de¬f>)in terms of the set theory.
Thisresultis consistentwiththe negative containmentdefinition,by whichdata sequences containing
{
<
a¬
b>
}
are thesamesequencesthatcontain{
<
a>
}
butdonotcontain{
<
ab>
}
.Subsequently,weobtain:
sup
(<
a¬
b>)
=
sup(<
a>)
−
sup(<
ab>).
Similarly,
sup
(<
¬
ab>)
=
sup(<
b>)
−
sup(<
ab>)
;
sup
(<
b¬
a>)
=
sup(<
b>)
−
sup(<
ba>)
;
and sup(<
¬
ba>)
=
sup(<
a>)
−
sup(<
ba>).
Fig. 3illustratesthemeaningofsup
(
ns)
intermsofsettheory.Givenns=
<
a¬
bc¬
de f>,
1−
negMSSns= {
<
a¬
bc e>,
<
ac¬
de>,
<
ace¬
f>
}
,sup(
ns)
= |{
<
ace>
}|
− |
<
abce>
∪
<
acde>
∪
<
ace f>
|
.The above examplesshow that our proposed negative containmentdefinition isconsistent withsettheory. Therefore, setpropertiesareapplicableforcalculatingsup
(
ns).
FromEquation(2),wecanseethat sup(
ns)
canbeeasily calculatedif we knowsup(
MPS(
ns))
and|
∪
ni=1{
p(1
−
negMSi)
}|
. Accordingto Constraint 3andthe negative candidategeneration ap-proachdiscussedinSection4.5,MPS(
ns)
andp(1
−
negMSi)
arefrequent.sup(
MPS(
ns))
canbedirectlyobtainedbyapplying traditionalPSPminingalgorithms.Wefurtherexplainwhythepositivepartnersp
(1
−
negMSi)
arefrequent.Constraint 3ensuresthatthesmallestnegative unitinaNSCisanelement.Hence,thepositivepartnerofeachNSCgeneratedfromaPSPisthePSPitself.p(1
−
negMSi)
is asub-sequenceofthePSP.Both p(1
−
negMSi)
andthePSParepositivesequencesandmeetthedownwardclosureproperty, thereforep(1
−
negMSi)
arefrequent.Nowtheproblemishowtocalculate