• No results found

6.2 Discussion

6.2.1 Weaknesses

A fundamental issue with our approach is that sub-graph isomorphism is NP-Complete (or more specifically, it is NP-NP-Complete when there are no labels or when repeated labels are present). That is the solution is difficult to discover, however, once found is easy to verify. This trait is proven by the clique problem and the Hamiltonian cycle problem (Cook 1971; Karp 1972). This obstacle results in an exponential amount of work finding matches, per historical trace. Our framework currently doesn’t do anything to reduce this. Due to the way our graphs are stored, in Neo4j, labels are indexed within the database engine. It would be more prudent

to narrow down the number of candidate graphs before passing them over the network. We have already discussed in Section 5.7 that we need only involve graphs that are less than or equal to the size of the new graph G, use a subset of the rules that were used and have no more instances of those rules than those in G. More formally, where we have a new trace G = hV, E, `V, `E, ιi and a historical trace h =hV0, E0, `0V, `0E, ι0i, we only require it for comparison if |VG| ≥ |Vh|, `0V \ `V =∅ and ∀l ∈ `0V,|{v | v ∈ V, `V(v) = l}| ≥ |{v | v ∈ V0, `0V(v) = l}| for all historical traces in the system. This reduction could substantially reduce the number of candidates delivered from the database engine preventing unnecessary comparisons and network traffic.

Additionally, performance increases may be achievable using other sub-graph isomorphism techniques. Bonnici et al. (2013) introduced a method that creates a search tree, not unlike VF2, however, reorders the search tree based on fast and straightforward heuristics to prune options earlier in the proceedings. The process reorders the search tree such that nodes with the most-constraints fail-first (Bonnici and Giugno 2017). Using the fail-first principle stops the traversal earlier, opposed to the general brute force or look-ahead methods. We were unable to test this approach as there was no viable method to use it within our framework. To allow this, we would need to implement this method in Java or Python to work with our current toolset. However, performance increases in subgraph-isomorphism may be negligible within small transformations, as shown by Carletti, Foggia, and Vento (2013).

Another key area of weakness is that of orphan capture, as described in Sec-tion 3.2.2. The approach for our engine, SiTra, increases exponentially as the input increases. The effects are manageable for smaller transformations; however, the

more invocations that occur time will become an issue. This problem comes from intercepting all accessor and mutator methods to proxy the inputs and outputs to capture what we have not seen before. We are increasing the workload for simple getter and setters quite substantially. The engine needs to know whether the input of a setter came from any previous rule invocations as well as the orphans created by them. By this point, the trace could be massive. A way we could get around this is to, once we have found an orphan is to keep a mapping of it to the invocation that created it, but where do we stop caching? The more we cache, the more memory the engine needs. Despite ETL having longer response times, they are at least linear when capturing orphans. The executable abstract syntax tree that ETL uses captures the new keyword and only needs to retain the object. This is because the object is an orphan at this point, the engine did not create it, and therefore can blindly be preserved. A framework specific version of SiTra would be able to use meta-modelling observer patterns. For example, if one were to use EMF we could use the notification adapters that it implements to track setters to find orphans. It would also enable us to traverse with ease into new objects via their tree iterable.

Our approach also only looks at good trace elements. That is the knowledge base contains transformation traces that are deemed to have been successful in deployment. It does not include times where they did not work. An extension of our framework could include the ability to use anti-trace patterns. A way of negating the cumulative effect of the transformation trace. This consequence would occur when a trace has not completed its task correctly. However, for this to work rules would have to be versioned as it would be unrealistic to expect brand new transformation rules for every bug discovered. An inclusion of a version number

in the graph labelling would enable the engine to distinguish the variety of rules.

An argument against this is the increase in graphs within the knowledge base, increasing the number of candidate graphs to analyse.

Another drawback is that it does not take into account rule maintenance or the fact that software progresses. The way we label our rules is using its name, it would be more prudent to add a version string to it. This addition would automatically ignore all previous traces that include older versions for comparison as the label would differ. Therefore pruning the potential list of traces that apply to the new transformation. A further idea may include the ability to specify a range of possible versions for comparison. However, the matching technique would become more complicated due to the additional pruning code required. Additionally, as software progresses rapidly, traces may become irrelevant and numerous quite quickly. A consideration to evade this would be to have a sliding window of traces that makes traces obsolete after a period. How much time may depend on the frequency of transformations and the development of the rules required and thus would be dependent upon the domain in question.

Our work was only evaluated using one simulated domain, the transformation of a relational database into a non-relational database, specifically Apache HBase.

However, this transformation had all of the hallmarks that we have defined in Section 3.1 and since the move to big data is becoming prominent, we felt that this was a good demonstration of both capturing trace data and making sure that the data is in good shape afterwards. A continuation of this work should revisit our assumptions about transforming relationships and possibly create a library to enable others to migrate their data to test our framework in a real setting.

Additionally, other domains should be investigated. In our view the future of our

work could be extended in a few ways:

• Investigation of new matching techniques — this may include new heuristics to reduce the number of graphs to check or other ways to compare the graphs.

• The use within other domains — initial efforts were on transforming a DSL to describe the symptomatic behaviour of malware into C code to find them;

however, the transformation albeit not simple in regards to the binding phase, it was regarding the invocation of the transformation. It was also very orphan intensive, which started our research into that area.

• Investigate anti-trace patterns — currently we know what worked well, but if something ceased to work: what should we do next?

• How to interact with this information — currently we have a bubble graph to show the values about the rules in question. Depending on the size of the transformation, a navigable graph might be better placed to allow us to see exactly where the cold spots are.

• Parsing and code generation — we have the information for many levels of M2M transformations, however in the event of text-to-model, we have no knowledge of what part of the AST caused the source to exist. Likewise, with model-to-text, what source becomes part of the final AST?

• Handle evolving transformation rules and software systems — our process currently uses all historic traces and does not take into account rule versioning.

Naturally as software progresses legacy traces should be pruned, and as rules are modified previous versions should be deprecated.