6.4 Bridging the Gap between Heterogeneous Artefacts and the Property Graph Model
7.2.3 Identification of the Artefact Element Level Change Type
7.2.3.2 Change Identification and Representation: Graph-based Ap-
For the purposes of this work, instead of XML differencing, a graph-based approach was chosen to identify and represent modified entities and their respective change types. The motivating factor is that since artefact data is already represented as graph nodes, no conversion to another representation is required. The higher the number of interim representations, the higher the potential for consistency issues to arise, should any of these representations change. Additionally, the graph-based approach makes it possible to express changes and differences between two versions of artefacts at the graph level. Since graph nodes model artefact elements, the translation of changes to the artefact element level is straightforward.
The idea of using graph-based differencing of versions of a software system is not new and has been expressed in the literature [188]. A graph-based approach presents a number of alternatives for implementation. One example is the format used for storing change data; the framework may utilise a custom XML-based format or theGraphMLformat can be extended to cater for expressing changes. Another consideration is the level at which graph differencing is performed, such as at the graph database or at theGraphMLlevel, using Java APIs or custom objects. This work adopts some aspects of the XyDelta change representation model to identify the change type and the changed entity. InXyDeltanodes in the original XML document are given unique identifiers, which are stored in the XidMap. The delta between two versions of the XML document is expressed through the operations of these identifiers: if an identifier is not found in the new version, it corresponds to a node that has been deleted [186].
GraphML File Differencing
In the approach adopted in this work, each node property is mapped to akey-valuerepresentation, which uniquely identifies property values of graph nodes, through which graph nodes and their changes can be identified. The approach is illustrated through an example shown in Listing 7.1. As mentioned in Chapter 6, each data element in theGraphMLfile has a key attribute with pre-defined values. These attributes can take the role of keys in thekey-valuepairs. Every data element has a value associated with it, which can be assigned a unique identifier such asV + a
6http://www.dia.uniroma3.it/ vldbproc/062_581.pdf 7http://www.xml.com/pub/r/536
8http://www.w3.org/2008/05/xmlspec-diff-generation/ 9http://www.vmsystems.net/vmtools/doc/
sequential number. In case two data elements have the same value, they are assigned the same identifier, i.e. the same value in thekey-valuepair. In this example theD1key of both nodes has the same value (public), hence they are both assigned the valueV1. Unique identifiers of nodes represented byD8data keys are excluded from differencing since they are not the same across two versions of theGraphMLfile. Hence, EveryD8key is assigned a set value.
<node id="1">
<data key="d0">BitIOCommonTest </data>-->V0 <data key="d1">public</data>-->V1
<data key="d6">class</data>-->V2 <data key="d8">sc0Path </data>-->V3 <data key="d9"/>-->V4
</node>
<node id="2">
<data key="d0">rnd</data>-->V5 <data key="d1">public</data>-->V1 <data key="d2">Random </data>-->V6 <data key="d6">Field </data>-->V7 <data key="d8">sc1Path </data>-->V3 </node>
Listing 7.1:GraphML file, versionn- mapping to key-value pairs
The above nodes shown in Listing 7.1 can be represented as collections ofkey-valuepairs. A graphG, which consists of these two nodes, may therefore be defined as:
Let graph G:
N1 --> (D0 --> V0, D1 --> V1, D6 --> V2, D8 --> V3, D9 --> V4) N2 --> (D0 --> V5, D1 --> V1, D2 --> V6, D6 --> V7, D8 --> V3)
Listing 7.2:GraphGdefined as a collection of key-value pairs
The subsequent (modified) version of the sameGraphMLfile is presented in Listing 7.4: theD1
data key of the first node was edited frompublictoprotected. GraphG’, which is versionn+1of graphG, is therefore defined as:
Let graph G’:
N1 --> (D0 --> V0, D1 --> V8, D6 --> V2, D8 --> V3, D9 --> V4) N2 --> (D0 --> V5, D1 --> V1, D2 --> V6, D6 --> V7, D8 --> V3)
<node id="1">
<data key="d0">BitIOCommonTest </data>-->V0 <data key="d1">protected</data>-->V8 <data key="d6">class</data>-->V2 <data key="d8">sc0Path </data>-->V3 <data key="d9"/>-->V4
</node>
<node id="2">
<data key="d0">rnd</data>-->V5 <data key="d1">public</data>-->V1 <data key="d2">Random </data>-->V6 <data key="d6">Field </data>-->V7 <data key="d8">sc1Path </data>-->V3 </node>
Listing 7.4: GraphML file, versionn+1- mapping to key-value pairs
Listing 7.3 shows that the D1-V1 key-valuepair was modified toD1-V8, therefore it can be concluded that the node was edited. The three basic types of changes can be defined as follows:
Add: if node exists in graphG’but it is not present in graphG, it was added.
Edit: if node exists in both graphGandG’and any of its key-value pairs are modified, it was edited.
Delete:if node exists in graphGbut it is not present in graphG’, it was deleted.
Implementation of GraphML File Differencing
To implement the solution in Java, both previous and current versions of the graph (pre and post modification) obtained fromGraphMLfiles are parsed to a nestedhashmapdata structure. Nodes of the graph are identified by keys, while node property names and their values constitute values and are also stored as ahashmap, as illustrated by Figure 7.2. Change identification can also be realised using a constraint modelling system, such as Conjure [189], which allows the problem to be solved effectively. This option however, was dismissed for the current implementation, since it requires a constraint solver to be integrated in the framework resulting in further complexity. Furthermore, in case any changes are introduced to the underlying algorithm, utilising Java collections for implementing both the change and node identification problems provides a more flexible way of incorporating those changes.
Listing 2 describes the algorithm for identifying identical entities and fine-grained change types. While iterating the nestedhashmaprepresentations of the previous and current versions of the graph,beforeMapandafterMaprespectively, the innerhashmapsinside both are compared. Inner
Figure 7.2:Nestedhashmaprepresentation of graph nodes and their properties.
hashmapswith the same number of keys are checked for matching values ofD0keys.D0keys
stand for the name property of graph nodes, and since nodes with the same name and number of properties are identical, to establish matching nodes,hashmapswith the sameD0values are to be searched. In case values of all other keys are identical, the graph node is unchanged. If values of other keys mismatch, the node has beenedited. Such keys are added to a list of edited entities. All matching keys are added to a list of matching entities. Should the values ofD0keys be different, the innerhashmapsdo not match, showing anadd change. The same applies for innerhashmapswith different numbers of keys.
Deletednodes are identified by differencing the sets of keys ofbeforeMapand matching keys.
Added nodes can be obtained by differencing the sets of keys ofafterMapand matching keys.
Editednodes can be established based on the list of edited entities.
A challenge revealed during implementation is differentiatingrenamechanges from additions. Rename operations involve the modification of theD0property. However, at the same time, any other property may be modified. When comparing a node from the previous version with a node in the subsequent version, it cannot be stated with certainty whether the investigated nodes are the same in case the actual change was arename. This is due to the fact that any two nodes can have the same number and type of properties, and their values are also subject to changes. For this reason, for any node where a match was not found, the node in the new graph is labelled as
anaddition.
Some modifications affectintratrace links, which are represented by edges in theGraphMLfiles. To updateintralinks, they are re-generated following the identification of changed entities and change types. The approach for updatingintertrace links is discussed separately in Section 7.3.
Algorithm 2Change identification.
1: Input:
2: Nested hashmap representation of before and after graphs:
3: a, outer hashmapsbeforeMapandafterMap
4: b, inner hashmapsbeforeValueandafterValue
5: Artefact change type: were the artefacts added, deleted or edited?
6: Output: List of edited, added, deleted nodes
7: begin
8: For eachkey-value mapping inbeforeValueandafterValue
9: IfbeforeValue.keySet size = afterValue.keySet size
10: Ifvalues corresponding to theD0key are equal inbeforeValueandafterValue
11: beforeValueandafterValuerepresent the same graph entity
12: IfbeforeValueandafterValueare equal, the entity did not change.
13: AddbeforeValuekey andafterValuevalue to map of visited key-value pairs.
14: Else
15: The entity was edited. AddbeforeValuekey andafterValuevalue to
16: map of visited key-value pairs. AddbeforeValuekey and
17: afterValueto map of edited key-value pairs.
18: ElsebeforeValueandafterValueare not the same graph entity. No action.
19: Else
20: beforeValueandafterValueare not the same graph entity. No action.
21: End for
22: Get added entities by differencing the keys ofafterMapand visited key-value pairs.
23: Get removed entities by differencing the keys ofbeforeMapand visited key-value pairs.
24: Get edited entities from edited map keyset.
25: End
Element Identification Problem
A pivotal aspect of change detection is the identification of artefact elements across two subsequent versions of the given artefact. That is, how can it be established that artefact
element 1inversion nis identical toartefact element 1inversion n+1regardless of the element
being edited or remaining the same across the two versions. This issue exists regardless of the models selected to represent artefact elements and changes, and can be translated to the graph-based representation as follows. What methodology can be adopted to establish thatN1on graphGis identical toN1on graphG’. The answer is straight forward in case the node was not modified since the key-value pairs are identical. Using the above example, it can be concluded thatN2 on graphGis the same as N2on graphG’. However, in case the node was edited, in the current implementation, domain-specific knowledge based on the artefact type is utilised. For example in a Java class, there cannot be two methods with the same name and signature.
Therefore, when comparing two nodes in two versions of a graph describing a Java source code artefact, if the two nodes have the same number of data elements, and their name is the same (D1data key), it can be concluded that they are identical even if other data key values have been modified. This rule can be applied in case of UML class diagrams and JUnit test cases. Further rules can be associated with other artefact types. For example, if nodeN stands for a requirement in a requirement specification artefact, it can be concluded that nodeN is the same across two versions of the file if their title properties (D11) are the same.
One specific case of this problem is therenameoperation, which is also a type ofedit change and it may be interpreted either as arenameor as a composite change consisting of a delete
and anaddoperation. Another case is specific to Java source code, JUnit test and UML class diagram artefacts, where it is possible that multiple artefact elements with the same name and type exist. For example, there may be multiple Java methods with the same name and different parameters. Furthermore, for example, a Java class that contains multiple constructors poses the same challenge in identifying if elements are the same across two subsequent versions of the artefact.