Process Mining
Making Sense of Processes Hidden in Big Event Data
EIS Colloquium, 7-12-2012, TU/e, Eindhoven
Wil van der Aalst
Positioning Process Mining
process mining
data-oriented analysis
(data mining, machine learning, business intelligence)
process model analysis
(simulation, verification, etc.)
p e rf o rm a n c e -o ri e n te d q u e s tio n s , p ro b le m s a n d s o lu tio n s c o m p lia n c e -o ri e n te d q u e s tio n s , p ro b le m s a n d s o lu tio n s
Advances in Process Mining
• Many process discovery and conformance checking algorithms and tools are available (cf. the various ProM packages).
• Also commercial software based on these ideas: Disco (Fluxicon), Reflect (Futura/Percentive), BPMOne (Pallas
Athena/Perceptive), ARIS Process Performance Manager (Software AG), Futura Reflect (Futura Technology), Interstage Automated Process Discovery (Fujitsu), QPR ProcessAnalyzer/Analysis (QPR Software), flow (fourspark), Discovery Analyst (StereoLOGIC), etc.
• We applied process mining in over 100 organizations.
More than 75 people involving more than 50 organizations created the Process Mining Manifesto in the context of the IEEE Task Force on Process Mining. Available in 13 languages
Big Data: Even Dilbert and the "pointy-haired boss" know about it …
Moore's Law
D=1.56 D=2.03
• Starting point 2010:
• Harddisk 1 Terabyte = 1012 bytes
• Digital Universe 1.2 Zettabyte = 1.2*1021 bytes
(estimate in IDC’s annual report, “The Digital Universe Decade – Are You Ready?” May 2010)
• Disk needs to grow 230.16 = 1.2* 109 = 1.2*1021/ 1012 times its current size.
• Assuming D=1.56 this takes 30.16*1.56 = 47.05
years.
• Hence, in 2060 your laptop can contain all of
today's digital universe (internet, computer files, transaction logs, movies, photos, music, books, databases, etc.)!
A simple calculation
What if? book car c add extra insurance d change booking e confirm initiate check-in j check driver’s license k charge credit card i select car g supply car in a b skip extra insurance f h add extra insurance skip extra insurance l out c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 acefgijkl acddefhkjil abdefjkgil acdddefkhijl acefgijkl abefgjikl ... process discovery conformance checking
there are more than 1000 different
activities?
there are more than 1.000.000
cases?
there are more than 100.000.000
Decomposing/Distributing
Process Mining Problems
How to decompose/distribute process discovery? abcdeg abdcefbcdeg abdceg abcdefbcdeg abdcefbdceg abcdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg abcdeg abdceg abdcefbcdeg abcdeg
?
How to decompose/distribute process discovery? abcdeg abdcefbcdeg abdceg abcdefbcdeg abdcefbdceg abcdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg abcdeg abdceg abdcefbcdeg abcdeg a b c d e c1 in c2 c3 c4 c5 g out c6 f
How to decompose/distribute conformance checking? abcdeg adcefbcfdeg abdceg abcdefbcdeg abdfcefdceg acdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg acdefg adcfeg abdcefcdfeg abcdeg a b c d e c1 in c2 c3 c4 c5 g out c6 f
?
How to decompose/distribute conformance checking? abcdeg adcefbcfdeg abdceg abcdefbcdeg abdfcefdceg acdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg acdefg adcfeg abdcefcdfeg abcdeg a b c d e c1 in c2 c3 c4 c5 g out c6 f abcdeg abdceg abcdefbcdeg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg abcdeg a b c d e c1 in c2 c3 c4 c5 g out c6 f adcefbcfdeg abdfcefdceg acdefbdceg acdefg adcfeg b is often skipped f occurs too often
Replication: Same event log on all computing nodes abcdeg abdcefbcdeg abdceg abcdefbcdeg abdcefbdceg abcdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg abcdeg abdceg abdcefbcdeg abcdeg abcdeg abdcefbcdeg abdceg abcdefbcdeg abdcefbdceg abcdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg abcdeg abdceg abdcefbcdeg abcdeg abcdeg abdcefbcdeg abdceg abcdefbcdeg abdcefbdceg abcdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg abcdeg abdceg abdcefbcdeg abcdeg abcdeg abdcefbcdeg abdceg abcdefbcdeg abdcefbdceg abcdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg abcdeg abdceg abdcefbcdeg abcdeg abcdeg abdcefbcdeg abdceg abcdefbcdeg abdcefbdceg abcdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg abcdeg abdceg abdcefbcdeg abcdeg
Only makes sense if random elements, e.g., genetic process mining.
Classification based on partitioning of event log: vertical and horizontal
sets of cases
sets of activities
Vertical distribution I: Split cases arbitrarily
abcdeg abdcefbcdeg abdceg abcdefbcdeg abdcefbdceg abcdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg abcdeg abdceg abdcefbcdeg abcdeg abcdeg abdcefbcdeg abdceg abcdefbcdeg abdcefbdceg abcdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg abcdeg abdceg abdcefbcdeg abcdeg sets of cases
Vertical distribution II:
Split cases based on a specific feature
abcdeg abdcefbcdeg abdceg abcdefbcdeg abdcefbdceg abcdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg abcdeg abdceg abdcefbcdeg abcdeg abcdeg abdceg abcdeg abdceg abcdeg abcdeg abdceg abcdeg abdcefbcdeg abcdefbcdeg abdcefbdceg abcdefbdceg abcdefbdceg abdcefbcdeg abdcefbdcefbdceg abcdefbcdefbdceg short long medium
Horizontal distribution abcdeg abdcefbcdeg abdceg abcdefbcdeg abdcefbdceg abcdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg abcdeg abdceg abdcefbcdeg abcdeg abeg abefbeg abeg abefbeg abefbeg abefbeg abeg abeg abefbefbeg abeg abefbefbeg abefbeg abeg abeg abefbeg abeg bcde bdcebcde bdce bcdebcde bdcebdce bcdebdce bcde bdce bdcebdcebdce bcde bcdebcdebdce bcdebdce bcde bdce bdcebcde bcde sets of activities
a
b
cd
e
g
=
a
be
g
+
b
cd
e
Horizontal distribution: The key idea abeg abefbeg abeg abefbeg abefbeg abefbeg abeg abeg abefbefbeg abeg abefbefbeg abefbeg abeg abeg abefbeg abeg bcde bdcebcde bdce bcdebcde bdcebdce bcdebdce bcde bdce bdcebdcebdce bcde bcdebcdebdce bcdebdce bcde bdce bdcebcde bcde a b e c1 in g out c6 f b c d e c2 c3 c4 c5 projected on a,b,e,f,g projected on b,c,d,e
System Net
• Labeled Petri net (P,T,F,l)
• Silent transitions and visible transitions (unique or not),
• One initial marking Minit, one final marking Mfinal
a start a = register request b = examine file c = check ticket d = decide e = reinitiate request f = send acceptance letter g = pay compensation h = send rejection letter b c d g h e end c1 c2 c3 c4 c5 t1 f t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 c6 c7 c8 c9
Traces and Logs
• A system net SN has a set of possible visible traces ϕ(SN) starting in Minit and ending Mfinal only showing the visible steps.
• An event log L is a multiset of traces.
• Two main process mining problems:
1. Conformance checking: Given L and SN, evaluate the "conformance" (e.g., fitness, precision, generalization, etc.) of L and ϕ(SN)
2. Process discovery: Given L, create SN such that the conformance of L and ϕ(SN) is "as good as possible"
Valid Decomposition
• Union of subnets is original net
• No shared places
• Shared transitions are visible and have unique label
a start b c d g h e end c1 c2 c3 c4 c5 t1 f t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 c6 c7 c8 c9 d g h e end c5 f t5 t6 t7 t8 t9 t10 t11 c6 c7 c8 c9 a start b c d e c1 c2 c3 c4 t1 t2 t3 t4 t5 t6
Another Decomposition
Requirement
• Union of subnets is original net
• No shared places
• Shared transitions are visible and unique
a start b c d g h e end c1 c2 c3 c4 c5 t1 f t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 c6 c7 c8 c9 d g h e end c5 f t5 t6 t7 t8 t9 t10 t11 c6 c7 c8 c9 a c d e c2 c4 t1 t4 t5 a start t1 a b d e c1 c3 t1 t2 t3 t5 t6 t6
Maximal Valid Decomposition a start t1 SN1 a c e c2 t1 t4 t6 SN3 a b d e c1 c3 t1 t2 t3 t5 t6 SN2 d g h e c5 f t5 t6 t7 t8 t9 t10 c6 c7 SN5 c d c4 t4 t5 SN4 g h end f t8 t9 t10 t11 c8 c9 SN6 a start b c d g h e end c1 c2 c3 c4 c5 t1 f t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 c6 c7 c8 c9
Maximal Decomposition
• Construction: group arcs iteratively
• Maximal decomposition is unique
and always a valid decomposition
a start b c d g h e end c1 c2 c3 c4 c5 t1 f t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 c6 c7 c8 c9 a start t1 SN1 a c e c2 t1 t4 t6 SN3 a b d e c1 c3 t1 t2 t3 t5 t6 SN2 d g h e c5 f t5 t6 t7 t8 t9 t10 c6 c7 SN5 c d c4 t4 t5 SN4 g h end f t8 t9 t10 t11 c8 c9 SN6 y y x
Non-unique visible labels
• Union of subnets is original net
• No shared places
• Shared transitions are visible and unique
a start b b d g h e end c1 c2 c3 c4 c5 t1 f t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 c6 c7 c8 c9 a b e c2 t1 t4 t6 a b d e c1 c3 t1 t2 t3 t5 t6 b d c4 t4 t5 SN7 a b b d e c1 c2 c3 c4 t1 t2 t3 t4 t5 t6
Conformance checking can be decomposed !!!
• Let L be an event log, SN a system net, and D={SN1,SN2, … SNn} a valid decomposition
• Li is the sublog of SNi (L projected onto visible
transitions of SNi) a start b c d g h e end c1 c2 c3 c4 c5 t1 f t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 c6 c7 c8 c9 a start t1 SN1 a c e c2 t1 t4 t6 SN3 a b d e c1 c3 t1 t2 t3 t5 t6 SN2 d g h e c5 f t5 t6 t7 t8 t9 t10 c6 c7 SN5 c d c4 t4 t5 SN4 g h end f t8 t9 t10 t11 c8 c9 SN6 L is perfectly fitting SN if and only if
Example of alignment for observed trace a,b,c,d,e,c,d,g,f a start b c d g h e end c1 c2 c3 c4 c5 t1 f t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 c6 c7 c8 c9 a start t1 SN1 a c e c2 t1 t4 t6 SN3 a b d e c1 c3 t1 t2 t3 t5 t6 SN2 d g h e c5 f t5 t6 t7 t8 t9 t10 c6 c7 SN5 c d c4 t4 t5 SN4 g h end f t8 t9 t10 t11 c8 c9 SN6 Etc. a,b,c,d,e,c,d,g,f
Quantifying Conformance
• Fraction of cases perfectly fitting SN equals the
fraction of cases for which each projected log is Li is perfectly fitting SNi
• For fitness at the event level (costs based alignment) it is possible to compute an optimistic value.
PAGE 31 a b c d c1 c2 start end a b c d c1 c2 start end b c
a,b
c,d
a,a,b
a,b,c,d
Discovery can also be distributed!
• Assume the set of activities
is split in overlapping sets.
• Split log L in sublogs Li
based on these sets.
• Discover a model SNi per
sublog.
• Merge the models SNi into
SN and return result.
All the earlier guarantees hold, e.g., L is perfectly fitting SN if and only if each projected log is Li is perfectly fitting SNi
Example
• Let's use the Alpha algorithm
• Alpha algorithm is not very suitable for discovering transition bordered subnets.
• By adding start and end activities we get (in this case) perfectly fitting subnets.
Discover model per sublog (Alpha algorithm) a i b e d start end a c e d start end d j h e start end j start g end f k SN1 SN2 SN3 SN4 h
Merge models a i b c d g h e j f k t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 p5 p1 p2 p3 p4 p6 {1,2,3,4} {1,2} {1} {1} {2} {1,2,3} {1,2,3} p7 p8 p9 p10 p11 p12 {3,4} {3,4} p13 p14 p15 p16 p17 p18 p19 p20 p21 p22 p23 p24 p25 {4} {4} {4} {1,2,3,4} a i b e d start end a c e d start end d j h e start end j start g end f k SN1 SN2 SN3 SN4 h
L is perfectly fitting SN because each projected log is Li is perfectly fitting SNi
Simplify by removing redundant places and initialization
a i b c d g h e j f k t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 p5 p1 p2 p3 p4 p6 {1,2,3,4} {1,2} {1} {1} {2} {1,2,3} {1,2,3} p7 p8 p9 p10 p11 p12 {3,4} {3,4} p13 p14 p15 p16 p17 p18 p19 p20 p21 p22 p23 p24 p25 {4} {4} {4} {1,2,3,4} a start i b c d g h e end j f k
Passages
Wil van der Aalst. Decomposing Process Mining Problems Using Passages. In Petri Nets 2012, LNCS 7347 , pages 72-91. Springer, 2012.
Passage P=(X,Y) a b d X Y f e h c g i causal dependency: may trigger or enable
Minimal passages a b d X Y f e h c g i X Y
a passage is minimal if it does not contain smaller passages
Passages define an equivalence relation on the edges in the graph
a b c d e f g h i
Minimal passage 1: (X,Y) = ({a},{b,c}) a b c d e f g h i X={a} Y={b,c}
Minimal passage 2: (X,Y) = ({b,c,d},{d,e,f}) a b c d e f g h i X={b,c,d} Y={d,e,f}
Minimal passage 3: (X,Y) = ({e},{g}) a b c d e f g h i X={e} Y={g}
Minimal passage 4: (X,Y) = ({f},{h}) a b c d e f g h i X={f} Y={h}
Minimal passage 5: (X,Y) = ({g,h},{i}) a b c d e f g h i X={g,h} Y={i}
Union of two passages a b c d e f g h i a b c d e f g h i a b c d e f g h i
Difference of two passages a b c d e f g h i a b c d e f g h i a b c d e f g h i
Properties of passages X1 X2 X5 X3 X4 Y1 Y2 Y5 Y3 Y4 If then:
Passage partitioning
• P1, P2, … ,Pn is a passage partitioning if and only if
− P1, P2, … ,Pn are passages,
− the passages are disjoint, and
− the passages cover all edges.
a b c d e f g h i 5 passages
More examples of passage partitionings PAGE 53 a b c d e f g h i a b c d e f g h i a b c d e f g h i 2 passages 2 passages 4 passages ({a},{b,c}) ({b,c,d},{d,e,f}) ({e,g,h},{g,i}) ({f},{h}) ({e,f,g,h},{g,h,i}) ({a,b,c,d},{b,c,d,e,f}) ({b,c,d},{d,e,f})
Minimal passages
• A passage is minimal if it does not contain smaller passages.
• Each edge is involved in precisely one minimal passage.
• Passages are composed of minimal passages.
• The minimal passages form a passage partitioning.
a b c d e f g h i
Passages are special case of decomposition
• Conformance checking:
1. Compute causal structure from Petri net
2. Compute (minimal) passages and construct corresponding subnets
3. Subnets provide a valid decomposition 4. Hence, conformance checking* can be
decomposed/distributed!
• Process discovery:
1. Compute causal structure from event log* 2. Compute (minimal) passages and construct
corresponding sublogs
3. Discover subnets for sublogs* 4. Merge subnets into overall model
Conformance checking book car c add extra insurance d change booking e confirm initiate check-in j check driver’s license k charge credit card i select car g supply car in a b skip extra insurance f h add extra insurance skip extra insurance l out c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 acefl acddefl abdefl acdddefl acefl abefl ...
?
Create Skeleton book car c add extra insurance d change booking e confirm initiate check-in supply car a b skip extra insurance f l book car c add extra insurance d change booking e confirm initiate check-in j check driver’s license k charge credit card i select car g supply car in a b skip extra insurance f h add extra insurance skip extra insurance l out c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 acefl acddefl abdefl acdddefl acefl abefl ... conformance checking
Net fragments per passage book car c add extra insurance a b skip extra insurance c1 c add extra insurance d change booking e confirm b skip extra insurance c2 e confirm initiate check-in f c3 initiate check-in j check driver’s license k charge credit card i select car g supply car f h add extra insurance skip extra insurance l c4 c5 c6 c7 c8 c9 c10 c11 book car c add extra insurance d change booking e confirm initiate check-in supply car a b skip extra insurance f l ac ac ab ac ac ab ... ce cdde bde cddde ce be ... ef ef ef ef ef ef ... fl fl fl fl fl fl ... acefl acddefl abdefl acdddefl acefl abefl ...
Suppose d is executed too late book car c add extra insurance a b skip extra insurance c1 c add extra insurance d change booking e confirm b skip extra insurance c2 e confirm initiate check-in f c3 initiate check-in j check driver’s license k charge credit card i select car g supply car f h add extra insurance skip extra insurance l c4 c5 c6 c7 c8 c9 c10 c11 book car c add extra insurance d change booking e confirm initiate check-in supply car a b skip extra insurance f l ac ac ab ac ac ab ... ced cdde bde cddde ce be ... ef ef ef ef ef ef ... fl fl fl fl fl fl ... acefdl acddefl abdefl acdddefl acefl ...
causal structure obtained using heuristics & domain knowledge
Discovery example PAGE 60 abcdeg abdcefbcdeg abdceg abcdefbcdeg abdcefbdceg abcdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg abcdeg abdceg abdcefbcdeg abcdeg a b c d e g f ab abfb ab ... bcd bdcbcd bdc ... cde dcecde dce ... eg efeg eg ... a in g out b f a b c d c d e f e g a b c d e c1 in c2 c4 g out c6 f abcdeg abdcefbcdeg abdceg abcdefbcdeg abdcefbdceg abcdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg abcdeg abdceg abdcefbcdeg abcdeg a b c d e g f ab abfb ab ... bcd bdcbcd bdc ... cde dcecde dce ... eg efeg eg ... abcdeg abdcefbcdeg abdceg abcdefbcdeg abdcefbdceg abcdefbdceg abcdeg abdceg abdcefbdcefbdceg abcdeg abcdefbcdefbdceg abcdefbdceg abcdeg abdceg abdcefbcdeg abcdeg a b c d e g f ab abfb ab ... bcd bdcbcd bdc ... cde dcecde dce ... eg efeg eg ...
({a,f},{b}) ({b},{c,d}) ({c,d},{e}) ({e},{f,g})
ab abfb ab ... bcd bdcbcd bdc ... cde dcecde dce ... eg efeg eg ... b f a b c d c d e f e g a in g out b f a b c d c d e f e g a b c d e c1 in c2 c3 c4 c5 g out c6 f
Tool Support in ProM
(implemented by Eric Verbeek)
discovery
Result
passage graph
Example Passage
Process Model has 11 Passages
Diagnostics per Passage
Discovery (no initial model, just events)
algorithm used to derive causal graph
(to determine passages)
discovery algorithm (applied per passage)
Result (alpha + ILP with proper completion)
sound WF-net perfectly fitting the
Conclusion
• Process discovery and conformance checking can be decomposed!
• Generic approach with various formal
guarantees independent of the techniques used.
• Open issues:
− Experimental evaluation (in progress, see Eric's presentation)
− Better algorithms for computing a causal structure are
needed (independent of decomposition)
formal (not just a
picture)
fast (should not
take years) sound
(result should at least be free of deadlocks, etc.) ability to balance all conformance dimensions (fitness, precision, generalization, and simplicity) incl. noise provide guarantees (not just a best
effort) 1 2 3 4 5 energy efficient politically correct
Today's sheep have only 1.5-3.5 legs
formal fast balance sound guaran.
alpha + + - - - heuristic + + +/- - - genetic AK + - +/- - +/- genetic JB + - + + +/- fuzzy miner - + +/- - - Disco - + +/- - - regions SB + - - + + regions LB + - - + + alpha ++ + + - - - alpha # + + - - - SMT + - - + + MFM + + - - +