1 W P = ∅;
2 correct = initialize(); 3 foreach (a, a0) ∈ Amc do
4 if notCausedByF iltering(a, a0) then 5 $= tok(norm(λ(a)));
6 $0 = tok(norm(λ0(a0)));
7 similarity= σ.$($, $0);
8 foreach (w, w0) ∈ maxW ordP airs($, $0) do 9 if w 6= w0 then
10 W P = W P ∪ (w, w0);
11 correct(w, w0) = correct(w, w0) + ϑ − similarity;
12 end 13 end 14 end 15 end 16 foreach (w, w0) ∈ W P do 17 σ.w(w, w0) = σ.w(w, w0) + correct(w, w0); 18 if σ.w(w, w0) > 1 then 19 σ.w(w, w0) = 1; 20 end 21 if σ.w(w, w0) < 0 then 22 σ.w(w, w0) = 0; 23 end 24 σ.w(w, w0) = σ.w(w0, w) 25 end
expert. Moreover, there can be false negatives where one of the activities has an equally labeled counterpart in the other process model and thus the activity pair was considered to not correspond, but it was identified by the expert as a correspondence. If one of these two conditions applies the notCausedByF iltering function returns false and the activity pair is not processed, as the misclassification is not caused by the bag-of-words similarity.
For all other activity pairs, the algorithm determines the bag-of-words similarity (lines 5 to 7). Here, the word similarity σ.w which was applied by the BOT configuration to previously propose the alignment is utilized. Next, the function maxW ordP airs is applied (line 8) to determine the set of word pairs that need to be adapted with regard to the current activity pair. For each word w in the union of the two bag-of-words it
yields a word pair (w, w0) consisting of the word w and the word w0 from the other
bag-of-words that yielded the maximum word similarity score for w. The respective set of word pairs thus comprises all word pairs that contributed to the misclassification of the activity pair. Note that the word similarity function is a symmetric function where for any word pair (w, w0) it holds that σ.w(w, w0) = σ.w(w0, w). Thus, in order
to ensure a consistent management of word pairs at this point, each word pair (w, w0)
is arranged alphabetically, i.e., w occurs before w0 in a dictionary. Then, the algorithm
iterates over the determined set of word pairs (line 8 to 13). If the word pair consists of two different words (line 9) the word pair (w, w0) is added to W P (line 10). Moreover,
the correction value corr(w, w0) for this word pair is updated (line 11). Therefore, the
algorithm subtracts the overall similarity score for the activity pair from the threshold and adds the respective difference to the stored correction value. The difference between the threshold and the overall similarity score characterizes the degree to which the similarity of the activities was misjudged. In case of a false positive the difference is negative and decreasing the word similarity by this difference for each of the determined word pairs will result in a similarity score that better reflects the similarity assessment of the expert. Analogously, the difference is positive for a false negative and increasing the word similarities respectively will lead to a higher overall similarity score. Note that if a word pair contributes to several misclassifications, its correction value is the sum of all differences yielded for the respective activity pairs.
Once the first step is finished, the word similarity values are updated by iterating over all word pairs in W P (lines 16 to 25). For each word pair the correction value is added to the word similarity score (line 17). Here, a word pair that predominantly contributed to false positives will have a negative correction value as its similarity was generally overestimated. Thus, its new word similarity score will be lower. In contrast, a word pair that predominantly contributed to false negatives will have a positive correction value. This indicates that its similarity was underestimated and its similarity score needs to be higher. In order to ensure that the word similarity σ.w is bound to the interval [0, 1], the new word similarity value is modified, if the update leads to a value outside this interval (lines 18 to 23). That is, if the value is larger than 1, it is set to 1. Additionally, if it is smaller than 0, it is set to 0. Moreover, σ.w is a symmetric function that yields the same similarity score for two words independent of the ordering of the words. Thus, the new similarity value for the word pair (w, w0) is also assigned to the
To investigate the effect of the word similarity adaptation, the following experiment is carried out based on the development datasets. In the experiment different config- urations of BOT are used to determine alignments for process models. Here, in all configurations stemming and pruning are disabled. In order to achieve a broad evalu- ation of the adaptation algorithm, the three word similarity measures that are part of OPBOT (LEV, LIN, and 2CS) as well as five different threshold values (.5, .6, .7, .8, and .9) are applied. Consequently, 15 different BOT configurations are used.
For each of the two datasets all process model pairs are matched following the process for feedback collection from Figure 6.1. In this regard, the gold standards are used to simulate the expert feedback and the process is executed for each of the BOT configura- tions separately. That is, in each iteration a model pair from the dataset is matched by the respective BOT configuration. The proposed alignment is then stored to compute the effectiveness once all process model pairs have been matched. Moreover, it is com- pared to the gold standard for the model pair and all misclassifications are determined. The falsely classified activity pairs are then passed to the word similarity adaptation algorithm to adjust the word similarity applied by the respective BOT configuration. In this regard, the threshold that was set for the BOT configuration will be used to deter- mine the correction value. Once the algorithm is done, the next iteration of the process is carried out. That is, the next model pair is processed by the BOT configuration using the updated word similarity. This way the word similarity is adjusted stepwise until the whole model collection is matched. Note that by storing the alignment in each iteration, it is ensured that the alignments which are used to compute the effectiveness at the end only depend on the adaptation that was achieved before the model pair was processed. All adaptations in later iterations do not impact the assessment of the effectiveness for the alignments.
A factor that influences the word similarity adaptation is the order in which the pro- cess model pairs are matched. That is because the order of model pairs determines the order of the similarity adapations. Thus, to examine the degree to which the or- dering influences the adaptation, 100 orders were randomly generated for each of the model collections. Each of these random orders was processed by each of the 15 BOT configurations. Consequently, a total of 1,500 separate runs was carried out for each dataset.
As a first indicator for the effect of the word adaptation Table 6.2 shows the maximum micro f-measure that was observed for each dataset. That is, the table reports the best result that was observed for any of the 1,500 runs. Additionally, the best micro f-measure
Table 6.2.: Maximum effectiveness of BOT configurations with and without adaptation
Dataset σ.w ϑ adaptation prµ reµ Fµ
BR LEV2CS .8.8 not appliedapplied .538.751 .399.582 .458.656 UA LEVLEV .7.8 not appliedapplied .597.646 .307.550 .405.594
yielded by any of the 15 BOT configurations without the feedback collection is used as a baseline, i.e., when the word similarity was not adapted. The order of the model pairs is irrelevant for the latter case, i.e., a specific BOT configuration yields the same effectiveness for all orderings. As the table reveals, the adaptation can have a strong positive impact on the effectiveness. On BR the maximum micro f-measure based on the word similarity adaptation amounts to (.458 vs. .656 =) 143% of the micro f-measureb for the best BOT configuration without the adaptation. The effect on UA is similar, i.e.,
.405 vs. .594 = 146%. In both cases this is due to an increase in the precision and moreb
important in the recall. With regard to the recall, the adaptation achieves a relative performance of (.399 vs. .582 =) 146% on BR and of (.307 vs. .550b =) 179% on UA.b
While these results show the potential of the adaptation, they only consider the maxi- mum effectiveness and thus draw an optimistic picture. To refine the analysis, Table 6.3 summarizes the improvements that were achieved for each of the 15 BOT configura- tions. Here, for each BOT configuration the micro f-measure that was yielded when the adaptation was not applied served as a baseline. In this context, the improvement for a BOT configuration gained in a certain run is the difference between the baseline and the micro f-measure that was achieved in this run. Accordingly, a positive value indicates that the adaptation of the word similarities improved the overall micro f-measure. As 100 runs were carried out for each of the BOT configurations, the table summarizes the improvements in terms of the maximum, the minimum, and the average improvement for each of the 15 configurations.
A first interesting result is that for each BOT configuration the micro f-measures were always improved regardless of the order in which the model pairs were matched. That is because the minimum values are all positive. Yet, the actual impact of the adaptation varies. On BR the difference to the baseline varies between .06 and .29. The situation is similar on UA where the improvements fall into the interval [.05, .28]. To this end, the variance can partly be explained by the different ordering of the model pairs. Here, the average difference between the maximum and the minimum effectiveness per BOT
Table 6.3.: Improvements of the micro f-measure
Average Maximum Minimum
Dataset ϑ LEV LIN 2CS LEV LIN 2CS LEV LIN 2CS
.5 .09 .13 .13 .12 .15 .15 .06 .09 .10 .6 .17 .18 .14 .20 .20 .16 .14 .14 .12 BR .7 .22 .21 .15 .26 .24 .17 .19 .17 .13 .8 .25 .22 .15 .29 .25 .18 .21 .18 .12 .9 .26 .22 .17 .29 .26 .19 .22 .18 .14 .5 .11 .09 .17 .14 .11 .20 .08 .05 .15 .6 .10 .14 .18 .14 .18 .21 .07 .10 .15 UA .7 .13 .17 .18 .16 .21 .22 .09 .13 .15 .8 .17 .19 .15 .20 .23 .19 .14 .15 .12 .9 .24 .23 .15 .28 .28 .19 .19 .17 .12
configuration is .06 on BR and .07 on UA. However, the results also show that the improvement depends on the word similarity and the threshold. To better understand the impact of these two features, Figure 6.2 outlines the distribution of the effectiveness values yielded by all configurations with a certain threshold or a certain word similarity. That means the distribution for each of the five different threshold values is based on 300 runs per dataset, i.e., 100 different orderings of the model pairs per word similarity. In contrast, the distribution for each word similarity relies on 500 runs, i.e., 100 different orderings of the model pairs per threshold value.
The figure shows that in each dataset the distributions of the micro f-measures are similar for all the three word similarities. However, there are differences with regard to the precision and recall values. That is, for LEV and LIN the precision takes higher values than the recall. For 2CS it is the other way around. In contrast to the word similarities the differences in the effectiveness are larger for the different threshold values. On both datasets an increase in the threshold is connected with an increase in the precision and a decrease in the recall. Moreover, the best micro f-measures are on average yielded for a threshold value of .7 or .8. This observation shows that the overall effectiveness achieved by adapting the word similarities depends on the quality of the respective BOT configuration.
While these analysis results address the overall effectiveness of the word similarity adaptation approach, the average f-measure yielded for each model pair with and without the adaptation are contrasted in Figure 6.3. As the figure reveals, the word adaptation achieves an improvement for the majority of the model pairs. On BR the f-measure
prμ reμ Fμ (a) Thresholds on BR prμ reμ Fμ (b) Thresholds on UA prμ reμ Fμ (c) Word similarities on BR prμ reμ Fμ (d) Word similarities on UA
Figure 6.2.: Overview of the effectiveness for the thresholds and word similarities
is improved for 29 model pairs and on UA for 30. On both datasets there are more than 20 model pairs for which the f-measure with the adaptation is lower than .3 and can be lifted to approximately .5. However, the figure also shows that the micro f- measure is decreased for some of the model pairs and on average seems to be located at approximately .5 for all pairs. According to this result, model pairs for which a high effectiveness is already achieved without the similarity adaptation should be matched at the beginning when the effect of the word similarity adaptation is low and the high effectiveness can still be achieved.
In summary the results demonstrate that due to the positive impact on the effec- tiveness, the word similarity adaptation can be considered as a means to adjust the bag-of-words similarity in a way that it better reflects the domain characteristics of a certain model collection. Thus, the algorithm constitutes a lightweight supervised WSD approach that adjusts similarity values, but does not learn semantic relations between words. The results also revealed two problems that influence the improvements in the effectiveness. First, the effect can vary strongly depending on the specific ordering of the model pairs, the threshold and the word similarity. Second, the adaptation algorithm
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Fμ Model Pair
without adaptation with adaptation (a) BR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Fμ Model Pair
without adaptation with adaptation (b) UA
Figure 6.3.: Average f-measure yielded per model pair with and without the adaptation
might lead to situations where the micro f-measure for a model pair is actually de- creased compared to the effectiveness yielded without the adaptation. Thus, a matching technique that utilizes the word similarity adaptation algorithm should also incorporate strategies to mitigate these effects.
6.3. Transitivity
The second strategy to improve the effectiveness of matchers through expert feedback applies a well-known property in mathematics: transitivity. Generally speaking, transi- tivity can be interpreted in the following way: if two things are equal to the same thing, they are also equal to one another. In mathematical terms a binary relation R ⊆ X × X is transitive, if ∀x1, x2, x3 ∈ X : [(x1, x2) ∈ R ∧ (x2, x3) ∈ R] ⇒ (x1, x3) ∈ R [Bronshtein
Process A Approve Aptitude Accept Applicant Process B ... ... ... ... Process A Offer Place at University Accept Applicant Process C ... ... ... ... Process B Offer Place at University Approve Aptitude Process C ... ... ... ...
Figure 6.4.: Example for transitive correspondences
Accordingly, the idea is here to decide whether an activity pair (a0, a00) corresponds
or not by analyzing the true alignments that were already discovered during feedback collection. In particular, the idea is to search these true alignments for an activity a that corresponds to a0 as well as to a00. If such an activity a exists, it is considered as
evidence towards the correspondence relation between (a0, a00). An example of such tran-
sitive correspondences is shown in Figure 6.4. This example comprises three activities, “accept applicant” from process A, “approve aptitude” from process B, and “offer place at university” from process C. Because all three activities depict the task of determining, if an applicant is qualified for a certain course of study, each of the activities corresponds to both other activities. Accordingly, transitivity holds between these correspondences. Consequently, the alignments between process A and B as well as process A and C might be used to automatically infer the alignment between process B and C. Of course, any other constellation where two of the alignments are known is also conceivable.
In order to investigate to which degree transitivity exists in model collections, the gold standard alignments of the two development datasets are examined in the following. Moreover, the global clustering coefficient χ ∈ [0, 1] [Wasserman and Faust, 1994], also referred to as the graph transitivity index, is used as a means to measure the extent to which transitivity holds in the datasets. In graph theory, it provides information on the degree to which nodes tend to form clusters within a graph. It is also of interest for the analysis of social networks [Luce and Perry, 1949; Holland and Leinhardt, 1971] where it provides information on the existence of groups whose members share a certain relation, e.g., groups of friends.
The global clustering coefficient relies on a graph representation of the data where the nodes represent the elements and the edges are the relations between the elements. In the context of business process model matching this graph contains one node per activity in the model collection and the edges depict the correspondences that exist in the collection.
From such a graph, the set of triplets is derived in order to compute the global clustering coefficient. Here, a triplet is a 3-tuple that consists of three distinct nodes from the graph. Accordingly, an activity triplet contains three distinct activities from the model collection. Yet, as the goal is to examine how likely it is for three activities to transitively correspond, only those triplets that consist of activities from different process models are considered. The reason is that correspondences exist between different process models. Thus, triplets with activities from the same model contain at least two activities that do not correspond and can hence be ignored.
Definition 6.1 (Activity triplets). Let {Pi}ki=1 with Pi = (Ni, Ai, Ei, λi, τi) be a
collection of k ∈ N≥2 process models. Then, the set of activity triplets A3 is defined as
A3 = {(a, a0, a00)|(a, a0, a00) ∈ Ax× Ay × Az∧1 ≤ x, y, z ≤ k ∧ x 6= y 6= z 6= x}
Given the set of activity triplets, the global clustering coefficient is defined as the ratio of the number of transitive activity triplets and the number of potentially transitive triplets. A transitive triplet is a triplet where each activity corresponds to both other activities. That is, a transitive triplet comprises three activities that correspond to each other. By contrast, potentially transitive triplets are all triplets for which at least one activity corresponds to both other activities. This means, a potentially transitive triplet satisfies the condition of the transitivity. Thus, the global clustering coefficient is the percentage of cases where the transitivity condition is fulfilled and transitivity actually holds. Consequently, if the global clustering coefficient is 1 all correspondences transitively inferred from the existence of two other correspondences truly exist. The lower the coefficient is the more often will a transitively inferred correspondence be incorrect, as the number of cases increases where transitivity is falsely concluded.
Definition 6.2 (Global clustering coefficient). Let {Pi}ki=1with Pi = (Ni, Ai, Ei, λi,
τi) be a collection of k ∈ N≥2 process models and A3 be the set of activity triplets.
Further, let {Aj}lj=1be a set of l ∈ N>2alignments where there is at most one alignment