Model evaluation is a critical issue in process modelling due to the fact that there might be more than one model that can represent the data. Also, a question of “what is the best model?” is difficult to answer and depends highly on the domain itself.
Furthermore, event logs are built from the reality thus, negative examples of the process would not be found in the log. Interestingly and according to [90], there are two different character- istics of event logs, that seem to be conflicted, which are :
Event logs are (trustworthy) because everything that is recorded in the log must have hap- pened, however, not everything that might have happened is recorded in the log (incomplete- ness). To cope with the incompleteness issue, a model should be able to represent other possible behaviours that might be allowed, but attention should be paid to prevent modelling ones that are not allowed.
In order to resolve evaluation issues, some metrics must be developed. Plenty of research such as [90] and [91] have discussed several process model evaluation metrics. Those metrics are used to assess process models from different perspectives such as fitness, precision, generalisation and simplicity.
35 2.8. Models quality metrics in process mining
(a) Fitness
Fitness, sometimes called as reply fitness, is the ability of a model to reflect all processes that are recorded in the log. The idea of model fitness can be derived by counting how many false negative examples that cannot be presented in a model but are exist in the log.
Fitness is calculated in our experiments based on [92], where simple alignment score costs on move on log (if event is observed on log only) and move on model (if event is observed on model only) and no cost is made for a synchronized move that is occurred between log and model. The alignment cost is calculated for each individual process instance with respect to the generated model after finding the optimal alignment. Then alignment cost is normalized by the cost of the worst scenario that may happen where no synchronisation move occurs between the log and model (the denominator). The best fitness score is 1 and the worst model fitness is 0. Fitness score is calculated for each individual trace then normalised by the number of total traces.
f itness = 1 − optimal cost(log, model)
move(log) + move(model) (2.1)
For more explanation an example for fitness calculation is discussed. Suppose we have a model M that is presented in Figure 2.2(left) and trace (a,b,c,d,e,g). First we need to calculate the alignment between the trace and model M which counts the moves between model and trace. Different alignments can be found such as the alignments that are presented in (A) and (B), however, ProM tool uses an algorithm that guarantees the optimal alignments [92] such as the alignment in (B) where the cost = 2 however, the cost of alignment in (A)= 4.
Figure 2.2: Fitness calculation example, the symbol ‘>>’ represents no synchronisation. Hence, using the fitness formula that is presented in (2.1), the fitness(M,trace) of alignment (A) = 1 −6+64 = 0.67 and the fitness(M,trace) of alignment (B) = 1 −6+62 = 0.83 which is the used fitness in ProM tool. It should be noted that, evaluating the process model by measuring its ability for reflecting reality (fitness reply) is not enough because of the incompleteness issue and the lack of negative examples.
(b) Precision
Precision metric aims to measure the fraction of the behaviours that are presented in the log compared with the allowed behaviours on the model. In other words, a non-precise model is
the model that represents a negative trace (if the definition of negative examples are known) or an extremely anomaly process that is different than the observed ones on the log.
In our evaluation, precision is computed using [48] and [93] which counts the score of alignment between traces and model with considering illegal behaviour that is never seen in the log.
precision = 1 −number of observed events in log at a particular position
number of allowed events on model at that position (2.2) To calculate the numerator and the denominator they used the best alignment sequence to construct a tree of prefix automata that is weighted by the occurrence of a prefix of events in each position. Then the prefix automata is enriched by the edges between prefix that are allowed by the model but not observed in the log which they called it as escaping edge. The method can help in identifying the set of observed behaviour besides the set of invalid ones that have generated from the model. An example of the precision calculation is adopted from [2] and explained here.
Figure 2.3: Precision calculation example adopted from [2, pg.5]
Suppose we have the model M in (a), log L1includes two traces [(a,c), (a,d)] that are shown in the prefix tree in (b) and log L2 includes three traces [(a,c),(a,d),(a,b,a,b,a,b,a,b,a,b,a,c)] that are presented in the prefix tree in (c). The red edges represent the moves that are allowed by the model only and not observed in the log. The gray circles of the prefix automata are weighted by the number of tokens that enabled the move. Based on the model M and the precision formula in (2.2), the precision (L1, M)= 2∗1+2∗2+1∗0+1∗02∗1+2∗3+1∗0+1∗0= 68= 0.75 and precision (L2, M)= 0.714
Models with low precision allow a high number of unobserved events in a particular move, however, models with high precision only permit observed events. As in data mining evaluation metrics, the F-score measurement is used to balance between both accuracy metrics (fitness and precision) using the formula;
F − score = 2 ∗ f itness ∗ precision