• No results found

3.3 Phases of Application

3.3.3 Data Processing

Phase 3 is given as data processing which is performed by first choosing the most applicable and suitable algorithms for the given case. The problem should be used as a reference as there are numerous metrics that can be calculated, most of which requires a slightly different approach. Most of the data pro- cessing can be done by available process mining software and should therefore only require the proper setup in most cases. The steps of phase 3 can be seen in Figure 3.7 where the progression of the methodology is also shown.

Phase 3:

Data Processing

 Data filtering and cleaning.

 Process discovery.

Figure 3.7: Application phase 3.

3.3.3.1 Process Discovery

The first step of phase 3 occurs directly after the event log has successfully been loaded into the process mining software. Step 1 of phase 3 entails the discovery of the process within the event log. This can also be seen in the pro- cess mining procedure shown in Subsection 2.3.1, Figure 2.10. The first step in process discovery is the selection of the algorithm responsible for mining

the event log from which process discovery can be done. The selection of the mining algorithm does have an impact on the processes being constructed and therefore the results might be unreliable or not suited to the initial goals. From a practical point of view, mining algorithms balance four quality crite- ria when performing process discovery as discussed in Van der Aalst (2011). The first consideration relates to the Fitness or the ability of the model to represent the actual behaviour in the process log. The second, which closely relates to fitness, is precision. Precision should constrain the discovered model not to allow behaviour not part of the original event log. In other words, the model should not be underfitted when referring to the desired process and therefore not be too generic to allow any process behaviour. An example of a underfitted model is shown in Figure 3.8. Thirdly, generalisation refers to the ability of the algorithm to generalise the behaviour and build a suitable model from that generalisation. This should however be done without overfitting the event log (Van der Aalst, 2011). Overfitting can be seen as the attempt of an algorithm to create a model that is too precise in showing event log behaviour. Overfitting then leads to a model where no clear patterns can be observed. An overfitted model can be seen in Figure 3.8. Lastly it is desirable that the model is as simple as possible, therefore simplicity.

Different algorithms used in process mining perform each of these criteria dif- ferently and should be considered depending on the situation. In Subsection 2.3.3 there is a discussion on the different types of algorithms used in process mining for the discovery of models. The first of which is the α-algorithm. In general, this is not a favourable algorithm as it has difficulty addressing noise, unsuitable behaviour and overly complex routing of cases as discussed in Sub- section 2.3.1. It should also be assumed when using the α-algorithm that the event log is entirely complete, although when this is true there are still some complications discussed in full detail in Van der Aalst (2011). For these rea- sons, heuristic and genetic mining alternatives should be considered, as they perform more favourably with regards to the four quality criteria discussed above and can deal with infrequent paths and internal loops.

A good example is presented in Van der Aalst et al. (2011) where a com- parison is made between different models with all being constructed from the

h

f

a

g

d

c

b

e

(a) An underfitted model.

a c g d h f a a c d g h f a a b d e a a g c h d f a a c d h f a (b) An overfitted model.

Figure 3.8: Example of fitted models for the same event log.

Adapted from Van der Aalst (2011)

same event log. These models clearly show the influence different algorithms can have on the event log when there is a focus on one or more of the four quality criteria mentioned above. These models are shown in Figure 3.9. Taking Figure 3.9 and considering N1 to N4 as a function of the quality cri-

terion, it can be seen that N4 places an emphasis on fitness and precision.

This causes a decline in the generalisation of the model, as well as the mod- els’ simplicity. In other words, the model tries to cater too strongly towards every trace instance and results in being a complete visual representation of every trace. N3 continues to reduce the precision to a point where the model

is so generalised as to not bind the process to any constraints. N2 attempts to

make the model as simple as possible but at the cost of it being generalised to a point where the fitness suffers. N1 finds an optimal solution where a balance

N1 : fitness = good, precision = good, generalisation = good,

simplicity = good

N2 : fitness = bad, precision = good, generalisation = bad,

simplicity = good

N3 : fitness = good, precision = bad, generalisation = good,

simplicity = good

N4 : fitness = good, precision = good, generalisation = bad,

simplicity = bad a b c d e f g h a c d e h a b d c e f g h a d c e h a c d e g a d c e g a b d e g a d d e h

(All variants in log)

# Trace 455 191 177 144 111 82 56 47 38 33 14 11 9 8 5 3 2 2 1 1 1 acdeh abdeg adceh abdeh acdeg adceg abdeh acdefdbeh adbeg acdefbdeh acdefbdeg acdefdbeg adcefcdeh adcefdbeh adcefbdeg acdefbdefdbeg adcefdbeg adcefbdefbdeh adbefbdefdbeg adcefdbefcdefdb eg

Figure 3.9: Model alternatives based on algorithm.

Adapted from Van der Aalst (2011)

3.3.3.2 The Inductive Visual Miner

As the selection of appropriate algorithms can be a difficult task, whether it is in the commercial or academic environment, Leemans, Fahland and Van der Aalst (2014b) introduces a complete software plug-in for ProM called the In- ductive Visual Miner (IvM). This tool aims to package all the steps required for the implementation of process discovery without the conventional itera- tive process involved with other algorithms when a suitable model needs to be generated. The iterative process usually involved with model creation en- tails setting parameters and generating a model without the ability to have

immediate feedback on changes. The IvM plug-in allows for the animation of the model according to the given event log and for filtering the event log interactively while the changes are observed. The IvM plug-in is available as a free plug-in for ProM and can be installed using the package manager as discussed in Subsection 2.3.7.

Prepare

Log Filter Log

Process Discovery Align Model Filter Node Selection Animate

Figure 3.10: IvM tool process.

Adapted from Leemans et al. (2014 b)

Figure 3.10 shows the process implemented within the IvM software plug-in. The process entails the following steps:

1. Prepare Log: Using the perspective classifier setting, the events are clas- sified. The perspective considered here is the control-flow perspective or the resource perspective;

2. Filter Log: Process instances in the log are filtered out according to the frequency by which they occur;

3. Process Discovery: Using the Inductive Miner (Leemans, Fahland and Van der Aalst, 2014a), the process model is discovered.

4. Align Model: The log traces are replayed and put on top of the created model for user feedback;

5. Filter Node Selection: Filters are placed on every node (activity) to show the traces which pass through that node; and

6. Animate: Using the timestamps in the event log, traces are animated on the process model. If there are no timestamps, the IvM generates random timestamps for visual feedback.

After the IvM has been run and the filters have been adjusted to the satis- faction of the analyst, the model can be saved and used for further analytic purposed.

The IvM tool also offers to show the deviation of the event log from the gener- ated model. Leemans et al. (2014b) identifies deviations either as a log move or a model move. When a log move occurs, it entails a deviation which is not permitted by the model. When a model move occurs, it indicates that there is an event that occurred in the model, but is not present in the log. These deviations are both visualised as dashed red lines which either show a path around an event (model move) or a task which loops into itself (log move).