• No results found

The Process Discovery Problem

2.4 Process Discovery

2.4.1 The Process Discovery Problem

Traditional process discovery approaches are carried out in a static way as an "offline" method contrary to event-based BAM or EDBPM solutions (see Section2.2.3). This is re- flected by the fact that the input for these algorithms is an entire event log. Different to the generic event logs introduced in Section2.2.2, it is sufficient for the discovery of a control- flow to contain only a minimal set of event features: Every event needs to have a reference (1) to itsprocess instance, e.g. via unique identifier, and (2) to the correspondingactivity, e.g. via unique name. Furthermore, it is assumed that the log contains exactly one event for each activity execution, i.e. activity lifecycle events are not regarded (e.g. start, assign, complete). In the context of process discovery, the execution of any two process instances is assumed to be independent, i.e. the execution order within an instance does not de- pend on the execution order of a second instance. In terms of process log terminology (see Section2.2.2), all events resulting from the execution of the sameprocess instanceare captured in onetrace, which is sometimes also calledcase. Accordingly, an event is repre- sented by a pair (t,s) wheret links to the trace andsto the activity. Figure2.12shows an example process consisting of simple one-letter activities: a,b,c,d,e,f,g,h. This exam- ple BP represents the reference process to help explaining the log-model terminology and concepts. Since two traces are assumed to be independent from each other only the order of the activities within a trace is of interest, i.e. a trace can be specified by a sequence of activities ordered by their occurrence:t1=[a,b,d,e] denotes a trace that conforms to the

Furthermore, traces only consisting of the activity order are calledsimple tracesand event logs only consisting of simple traces are calledsimple event logs[233]. An example of a simple event log for the business process in Figure2.12is6

L1={[b,a]4, [a,b,d,e]5, [b,a,e,d]4, [b,a,c,a,b,c,b,a,d,e,e,d]6,

[g,g]2, [f,h]3, [f,f,h,f,g,h,g,f,h]8, [g,h,f]2} (2.1) The logL1consists of eight different traces each occurring a number of times. The po-

sition of the traces in the log is of no relevance for process discovery algorithms since all have the same "weight" and need to be taken into account equally. Note, that logL1does

not include all possible behaviour of the process from Figure2.12, e.g. even though in the model activityb may be followed by activitye no trace inL1contains this behaviour.

Logs that do not include all possible behaviour of a source model are calledincomplete logs[128] and emerged as a challenge for rediscovering BPs. Another challenge origi- nating from real life problems is the analysis of logs that contain exceptional/infrequent behaviour. Behaviour that does not conform to the source model is callednoiseor in- frequent behaviour [127]. For instance, adding trace [a,b,d,e,e] to logL1would result

in an incomplete event log containing infrequent behaviour since the new trace is not conforming to the source model, i.e.

L2={[b,a]4, [a,b,d,e]5, [b,a,e,d]4, [b,a,c,a,b,c,b,a,d,e,e,d]6,

[g,g]2, [f,h]3, [f,f,h,f,g,h,g,f,h]8, [g,h,f]2, [a,b,d,e,e]1} (2.2) Considering that real-life logs are incomplete and/or contain infrequent behaviour, conventional challenges in process discovery originate from the motivation to find the "best fitting" model, i.e. discovered processes should be an accurate reflection of the be- haviour contained in the log. Determining the "best fitting" model that conforms to a given log is essentially a balance betweenover-fittingandunder-fitting[46,243]:

• Anover-fittingmodel does allow for only the exact behaviour recorded in the log, i.e. it does not generalise any behaviour and therefore does not allow for any additional behaviour [254]. A naive example for an over-fitting model is theTrace Model: Here every possible different trace of a log is transferred into an activity sequence com- bined to a choice construct (see left model in Figure2.13for logL1). The example

model as well as trace models in general contain duplicate activities.

• Anunder-fitting model is"overgeneralised" [243] which is usually synonymous for allowing"... too much behaviour that is not supported ..."[254] by the correspond- ing log. Considering logL1, a model that would allow any permutation the involved

activities is considered under-fitting. A naive example would be theFlower Model 6The power values denote the respective occurrences of the traces in the log, e.g. trace [b,a] occurs 4

a b a b d e b a e d b a c a b c b a d e e d g g f h f f h f g h g f h g h f a b c d e f g h

Fig. 2.13 MinedTrace Model(left) andFlower Model(right) for LogL1

(see right model in Figure2.13) which allows for all permutations (with infinite rep- etition) ofa,b,c,d,e,f,g,h.

Both, Trace Model and Flower Model, are considered evaluation baselines for over-fitting and under-fitting models, respectively. The quality of process discovery approaches is essentially determined by their ability to discover the "best fitting" model from a given event log, i.e. evaluation based on log-model conformance checking techniques (see Fig- ure2.9). What constitutes a "best fitting" model strongly depends on given requirements and goals, e.g. the three BP models in Figures2.12and2.13are each a valid representative of logL1and have their respective advantages and disadvantages. Thus, process discov-

ery represents a multi-goal optimisation problem, i.e. a trade-off between the following four quality criteria [233]:

1. Fitness: The ability of the discovered model to allow for the behaviour recorded in the event log [3] - in other conformance checking literature and general data mining terminology this quality criteria represents therecallmeasure [146].

2. Precision: The ability of the discovered model to not allow for behaviour unrelated to what is recorded in the event log [2,146,155].

Fig. 2.14 Excerpt of a large "Spaghetti" Process Model [89]

3. Generalisation: The ability of the discovered model to have generalised theexample

behaviour recorded in the event log [236].

4. Simplicity: A measure to avoid overly complex models, e.g. a "spaghetti" model [89] as shown in Figure2.14.

With respect to those four criteria under-fitting models like the flower model score well for fitness, generalisation, and simplicity values but very poorly for the precision value, and over-fitting models like the trace model score well for the fitness and precision value but poorly for generalisation and simplicity. The fitness and precision criteria originate from traditional data mining challenges and evaluate the accuracy of a mined model. On the other hand the criteria of generalisation and simplicity emphasises the need to create less complex and human-readable models. Disregarding these criteria in favour of the accuracy-based criteria often lead to the discovery of overly complex and highly intercon- nected business processes, so called "spaghetti" models (see Figure2.14) [89,234]. This is especially the case when dealing with logs that contain incomplete and/or infrequent (noisy) behaviour. In recent literature, e.g. in [26,118,126,147,148,175,234], an increas- ingly popular notion to promote generalisation and simplicity is the restriction of the BP model language to a structure that only allows for constructs which are easy to under- stand. The assumption here is that a well structured model is far easier to analyse while still being able to represent the main behaviour recorded in the log, i.e. a little accuracy (fitness and precision) is sacrificed in favour of a much increased structuredness [118].

A BP model is "... (well-)structured, if for every node with multiple outgoing arcs (a split) there is a corresponding node with multiple incoming arcs (a join), and vice versa, such that the fragment of the model between the split and the join forms a single-entry- single-exit (SESE) process component"[175]. The processes shown in Figure2.2and Fig- ure2.12are examples of such well-structured processes. This definition allows to intro- duce a hierarchy of different constructs within a process, i.e. supports separations and drill-down capabilities that help to understand independent parts of the process individ- ually. For instance, the example process in Figure2.12is essentially a choice between the

executions of activitiesa,b,c,d,e (right branch) or f,g,h(left branch); the right branch is essentially a sequence of activitiesa,b,c (top right) followed by activitiesd,e (bottom right), and so on. BP models conforming to this definition of (well-)structuredness are also calledblock-structuredBP models in the context of this thesis. Other representations with a very similar structure definition areHierarchical Process Models, an abstracted view on block-structured BPMN-like process models [147,148], andProcess Trees, a simplified notation of block-structured Petri Nets/workflow nets [26,126].