4.5 Summary
6.3.2 Adaptability Metrics Implementation
To implement the approach, it is necessary to set adaptability scores, as defined
in Def. 32, for every relevant element of BPMN. Score
setting As discussed in Sect. 2.2.2.3, we limit the evaluation to BPMN processes [26, pp. 143–314], since these form the executable part of the language and, hence, the target of our metrics. Other types of process models, such as collaborations or process choreographies are not considered. We reviewed the specification and set scores for all elements that are permitted in BPMN processes. These are activities, including tasks and subProcesses, gateways, and events. Thus, we cover all control-flow aspects of BPMN, but abstract from data-flow here. Moreover, we omit language elements that are relevant to the visualization of the process model only, such as lanes. The result is a total of 94 language elements for which we set a score.
Plugin structure After setting the scores, we extended our metrics suite, prope, with a plugin
for computing the adaptability of BPMN process models. The structure of this plugin is depicted in Fig. 6.3. As before, the plugin is hooked into the architecture of prope through a custom implementation of a FileAnalyzer. This AdaptabilityAnalyzer starts with a number of sanity checks on a given file, using a BpmnInspector. These sanity checks include the checking of the usage of correct namespaces and the validation of referential integrity of the elements in the process model using the BPMNspector tool26. If a process model does not pass these checks, it is excluded from further analysis. After the initial validation, the AdaptabilityAnalyzer continues with the computation of adaptability metrics using different implementations that follow the definitions from Sect. 6.1.1. In particular, the implementation of the binary adaptability metric can be configured with different threshold values that are used as cutoff criterion. The metric
26The project page of this BPMN file validation tool can be found at http://www.bpmnspector. org/.
Figure 6.3.: Adaptability Plugin for the Metrics Suite
implementations themselves utilize AdaptableElements that are representations of concrete BPMN elements and their adaptability score. Each AdaptableElement is identified by its name and a list of possible adaptations. This list corresponds to the set of alternative representations, as defined in Def. 32. Hence, the size of the list is the adaptability score of the element. Moreover, an AdaptableElement contains a documentation that explains design decisions for the configuration of the list of adaptions and an XPath expression that is used to locate the element in a BPMN process model. The set of all these elements is constructed via a factory class, AdaptableElements. Since the AdaptableElements, and particularly the list of adaptions, is imperative for the computation of the adaptability metric values, they are listed in tabular form in Appendix C.
6.3.3. Results
In the following sections, we first describe the nature and construct usage of the gathered process models in Sect. 6.3.3.1. Thereafter, we present and discuss descriptive metric data in Sect. 6.3.3.2 and evaluate the hypotheses to determine the most useful metrics in Sect. 6.3.3.3 to 6.3.3.5.
6.3.3.1. BPMN Process Models and Usage of Elements
The process models we obtained can be seen as representative of the open source usage of BPMN, since they are all gathered from freely accessible projects. Correct-
ness checks
As indicated in the previous section, we performed several correctness checks, such
as the correct usage of namespaces and the validation of element references, on the process models and excluded them from the analysis if they did not pass these checks. For instance, despite the fact that a file contains the correct BPMN 2.0 namespace, a process model defined in it might use a different one. The amount of such process models in our data set is negligible and more than 99% of the analyzed process models are using the correct BPMN namespace. When it comes to referential integrity, we could detect issues, such as an EventDefinition referenced in an event but not found in the process model, in 8% of the cases. This step reduced the initial set of process models from 2995 to 2745. Finally, 12% of the process models, a total of 327, are explicitly marked as executable, i.e., the isExecutable flag of the process element is set to true.
Figure 6.4.: Occurence Frequency of BPMN Elements
Occur- rence frequency In an influential paper [202], which also had impact on the current version of
BPMN [216], zur Muehlen and Recker analyzed the usage of BPMN elements in process models and found that only a very small subset of the existing elements were actually used in practice. As this information can be used to refine the computation of our metrics, we reproduced this analysis for the process models at hand. The plot in Fig. 6.4 depicts the occurrence frequency of selected BPMN elements, i.e., the percentage of process models in which the elements occur. We limit the figure to the ten most frequent elements found in the process models. Although the evaluation from [202] uses an older revision of BPMN, our results largely reinstate theirs. By far the most common elements in BPMN process models that occur in almost all cases are SequenceFlows and ordinary Start- and EndEvents. The next most frequent elements are two specific types of tasks.
This is similar to [202] where tasks are not separated by a specific type. The occurrence frequency of elements is important to the metrics computation for the following reason: The most frequent elements will need to be present in every process model and adapting them is simply not an option. If included in the adaptability computation, these elements would introduce noise into the metric value. For instance, SequenceFlows are very common in any process model and including them in the computation would introduce a strong weighting towards the adaptability of SequenceFlows in general. Since they cannot be adapted, this weighting would be noise. Based on the data depicted in Fig. 6.4, we can exclude the most common elements from the metrics computation, being the elements that occur in more than two thirds of all process models. These are SequenceFlows and ordinary Start- and EndEvents.
6.3.3.2. Comparison of the Metrics
Table 6.2 lists descriptive statistics for the process models, sorted into executable and nonexecutable models, i. e., process models that have the isExecutable flag set to true or false respectively. We computed the weighted metric as defined in Sect. 6.1.1.2, and the binary metric, as defined in Sect. 6.1.1.1, with the thresholds set at 20%, 40%, 60%, and 80% of the most adaptable element of the language to see which threshold performs best. The metric values are very similar
Table 6.2.: Descriptive Statistics for the Process Libraries
Process Group N Statistics AMbin., AMbin., AMbin., AMbin., AMweighted
t = 0.2 t = 0.4 t = 0.6 t = 0.8 M ean 0.91 0.88 0.60 0.09 0.56 std(X) 0.14 0.16 0.15 0.16 0.14 executable 327 Disc.P ow. 0.08 0.09 0.12 0.09 0.21 nonexecutable 2418 M ean 0.96 0.93 0.59 0.09 0.59 std(X) 0.12 0.14 0.19 0.16 0.12 Disc.P ow. 0.02 0.02 0.02 0.01 0.05
for the two process groups, but there are strong differences among the different metrics. Differ- ences in metric distri- butions
A first step is to examine if there are significant differences between the adaptability of executable and nonexecutable process models. To determine this, we first performed the Shapiro-Wilk test to see if the metric values are normally distributed. Since this is clearly not the case, we chose a nonparametric test, the Mann-Whitney U test [188] to find out if there are differences in the distributions of the metric values for the two process groups. The null hypothesis here is that there are no significant differences in the distributions of the metric values for the two process groups. Here, p-values do not reach a significant level for any of the metrics, so the null hypothesis cannot be rejected and there seem to be no significant differences between executable and nonexecutable process models in the data at hand. The reason for this might be that most process models
we obtained are in fact used for execution, but not marked with the respective flag. In the following, we evaluate the previously stated hypotheses regarding the quality of the metrics.
6.3.3.3. Hypothesis 1 – Discriminative Power
We can show that our metrics perform better when analyzing process models marked for execution and we can filter for the metrics that are particularly good. This can be achieved by looking at the discriminative power of the metrics, their ability to differentiate between different process models. The better a metric differentiates between different process models, the better it can be used for quality comparison and ranking of process models. To determine the discriminative power, we removed all duplicate metric values from the different data sets, resulting in a list of unique metric values for each metric and process group. By comparing the amount of unique values to the total amount of values, we obtained the percentage of unique values for the process models in the different sets. These percentage values are listed in the third row of each process group in Table 6.2. It can be seen that the discriminative power is much higher for executable process models and in particular the weighted adaptability degree performs best. Looking at the binary degrees, a threshold set at 60% yields best results. This narrows the scope of the useful metrics to the weighted metric and the binary metric with a threshold set at 60%.
6.3.3.4. Hypothesis 2 – Process Model Size
A further quality characteristic of software metrics is the ability to compare systems of different size. Typically, metrics tend to produce very different values when the system size differs a lot [184], thus rendering them insufficient for comparing systems of a very different size. To investigate how well our metrics perform in the face of process models of different size, we extracted two sets of process models. These are small and large models separated into executable and nonexecutable groups. The set of small process models refers to the first quartile with respect to the number of elements in the process model, i.e., the 25% smallest models. In the same fashion, the set of large process models corresponds to the fourth quartile with respect to the elements in the process models. We use the Mann-Whitney U test [188] as above to see if the metrics for small and large process models differ in their distribution. The null hypothesis here is that there are no significant differences in the distributions of the metric values for the two process groups. We limit this comparison to AMweighted and AMbinary
with a threshold set at 60%.
Table 6.3 depicts the results of this test separated by process groups. Because we are doing four tests, we have to adjust the significance level to 0.05/4 = 0.0125. In all but one case, p-values become significant, meaning that the distributions of the process groups for the metrics really are different. Consequently, these metrics should be treated with care when comparing process models of very different size. The only exception to this is AMweighted for executable process
Table 6.3.: Descriptive Statistics and Mann-Whitney U Test for Differences in Process Model Size
Process Nsmall Nlarge Statistics AMbin., AMweighted
Group t : 60% ˜ xsmall 0.67 0.62 ˜ xlarge 0.58 0.53 p 0.004 0.09 executable 62 80 U 2559 2803 nonexecutable 573 564 ˜ xsmall 0.67 0.62 ˜ xlarge 0.43 0.54 p 2.2e−16 2.2e−16 U 220285 213544
models. Here, no significant differences of the metric values for small and large process models could be detected. This means that the metric really can be used for comparing process models of different size.
6.3.3.5. Hypothesis 3 – Stability
A third quality factor for the metrics addresses their resilience to variances over time. Repeated measurements should produce similar results to demonstrate the stability of the mechanism of computation, i.e., the predictability of the metric value. One way to test this aspect, recommended by [171, p. 12], is by checking if metric values of repeated measurements differ only up to a certain accuracy threshold. This can be evaluated with the following formula, adapted from [171, p. 12]: AM (p2) − AM (p1) AM (p2) < Acc.T hreshold (6.7) By comparing the metric values of different snapshots of our data set over time, we can evaluate this aspect. For this reason, we repeated the data gathering described in Sect. 6.3.1 five months later and obtained a second snapshot of the data. This data set is slightly smaller, with 2719 instead of 2997 process models, but contains a larger amount of executable process models with 565 instead of 327 cases. Since we do not compare single process models, but different sets of process models, we replace the metric values in equation 6.7 with the median values of the different sets. As [171] leaves no hint for a suitable accuracy threshold, we set it to 0.05, meaning that the difference in metric values of repeated measurement should be no higher. The median values and results are depicted in Table 6.4. In nearly all cases, median values are identical. In case of the weighted metric for executable process models, the result is 0.04 which is still below the threshold.
Table 6.4.: Descriptive Statistics and Accuracy for Repeated Measurements
Process N1 N2 Statistics AMbin., AMweighted
Group t : 60% ˜ x1 0.63 0.59 ˜ x2 0.63 0.57 executable 327 565 Acc. 0 0.04 nonexecutable 2418 1921 ˜ x1 0.67 0.62 ˜ x2 0.67 0.62 Acc. 0 0
6.4. Summary
In this chapter, we proposed a set of metrics for measuring and assessing the adaptability of service-oriented and process-aware software. Adaptability in this context refers to the design-time adaption of executable process models to enable their execution on a different engine. In Sect. 6.1, we proposed and formally defined a set of metrics that capture this quality characteristic, based on the notion of alternative representations of process elements. The idea is that elements for which a higher amount of standard-conformant alternative representations exist can be considered as more adaptable.
The proposed metrics are internal complexity metrics. The theoretical vali- dation in Sect. 6.2 shows that the metrics fulfill related measurement-theoretic properties, i. e., non-negativity, null value, symmetry, and monotonicity. Due to the normalization of the metrics with respect to process model size, they fail to satisfy additivity. Moreover, the metrics are defined on an interval scale and are computed by counting. Their purpose is status assessment and the information of stakeholders.
We implemented the metrics computation for BPMN process models and gathered a large set of real-world process models for the practical evaluation in Sect. 6.3. The occurrence frequency of process elements in the gathered process models shows that only a small part of the vocabulary of BPMN is used in practice. We used this information to refine the metrics computation. The analysis of the discriminative power of the metrics shows that the weighted adaptability metric and the binary adaptability metric with a threshold set at 60% perform best. However, the former is the only metric that allows for the meaningful comparison of process models of different size. Finally, all metrics produce stable results.
In summary, the weighted adaptability metric outperforms the other proposed metrics. Hence, it is our answer to research question 2.3 and we can recommend it as a theoretically valid and practically tested mechanism for computing the adaptability of service-oriented and process-aware software.
Parts of this chapter have been taken from [85]. If software is not directly portable, nor easy to adapt, the remaining alternative, specified by the SQuaRE model [53], is to replace it. Replacement might also be required when upgrading an execution environment and a high degree of replace- ability reduces the risk of vendor lock-in [53, p. 16]. This quality characteristic forms the concluding part of the measurement framework. It is addressed by research question 2.4, the target of this chapter: What are suitable metrics for measuring replaceability? As discussed in Sect. 2.3, the SQuaRE model defines replaceability as the “degree to which a product can replace another specified soft- ware product for the same purpose in the same environment ” [53, p. 16]. In terms
of this work, a product refers to a piece of service-oriented and process-aware software, whereas the environment is the engine. The definition indicates that replaceability is not computed based on a single piece of software, but based on a paired comparison of an original piece of software and a replacement candidate. This constitutes a notable difference to the quality characteristics that have been the focus of the previous chapters. As opposed to these characteristics, replaceability is no inherent property of a piece of software.
In this chapter, we deviate from the methodology applied for the previous three quality characteristics. The reason for this deviation is that a large amount of metrics that seem applicable for assessing replaceability has already been proposed, validated, and evaluated, e.g., [217–227]. What is more, comparative studies that evaluate these metrics do already exist [228]. Adding yet another set of replaceability metrics on top of the existing body of metrics is neither desirable, nor likely to form a novel contribution. Instead of proposing new metrics, we review existing metrics and discuss their suitability for our area of application, the replaceability evaluation of executable software. We propose an extension to existing metrics that is necessary to support our use case and evaluate the resulting metrics computation. The proposed extension is orthogonal to the definition of existing metrics and does not change their way of computation. A theoretical validation of existing metrics is not required, since this is part of the sources in which they are defined.
Structure of the chapter We start by explaining the nature of replaceability in Sect. 7.1. This is followed
by a review of existing metrics in Sect. 7.2. The review includes a discussion of desirable properties for replaceability metrics and presents a categorization of existing metrics. Moreover, we select a subset of applicable metrics based on their targeted area of application, provide a closer discussion of these metrics, and ultimately choose two metrics for an evaluation. Next, Sect. 7.3 states deficiencies of these metrics for replaceability assessment, proposes an extension to cope with
these deficiencies, and evaluates the selected metrics using a set of synthetic process models. This allows to decide on which metric fits best. Finally, Sect. 7.4 concludes with a summary.
7.1. On the Measurement of Replaceability
As indicated in the previous section, the quality characteristic of replaceability, as defined by the SQuaRE model [53], differs from the other characteristics related to portability. These characteristics of portability, installability, and adaptability are inherent properties of a single piece of software or software system. They can be computed based on the static attributes or dynamic behavior of the piece of software alone. In contrast, replaceability, as defined above, is the degree to which a software product can replace another software product. This implies that replaceability cannot be evaluated for a single isolated piece of software, but requires a paired combination of two pieces: the currently installed software product and a candidate for its replacement. In our case, this translates to a currently enacted executable process model and one or more replacement candidates. If the change of an engine becomes necessary, for instance due to an upgrade, and a process model is neither directly portable to this engine, nor easy to adapt, it could be replaced by an alternative that can be executed on this engine. Clearly, this is only possible if alternatives to the original model do exist. Consequently, the evaluation of replaceability is not always possible or desired.