Related Work - Graph database management systems: storage, management and query processing

In this section, we discuss several categories of existing work that are closely related to versioning of code dependencies. Relevant work originate from diverse research areas pertaining to graph evolution, source code analysis for software evolution and projects from industry.

4.3.1 Evolving Graphs

There are many works from the graph domain addressing the general problem of evolving

graph sequences [106]. Frameworks and systems have been proposed [164,102,101] focusing on

efficient snapshot retrieval and analytics. Recent work [111,187] present distributed frameworks

designed for data management of large-scale temporal graphs. LLAMA [124] focus on storage

of an evolving graph with the emphasis on data layout augmenting the compressed sparse row

representation to store mutating graphs. G* [111] takes advantage of the commonalities in

successive snapshots and stores them in compact form. None of these works are specifically focused on the evolution of code dependency graphs which presents a set of unique challenges.

Several works stem from the field of social networks [37, 32, 105] where temporal aspects

are introduced to the graphs to enable interesting social queries over the time dimension (e.g.

historical queries). Chen et al. [37] proposed a storage model for temporal social networks and

indexes to speed up temporal queries on users, relationships and their activities. Proximity

networks are modelled in [32], where nodes represent users and edges represent timed interac-

tions between users originating from wearable sensors. Semertzidis and Pitoura [177] discuss

alternative methods of storing generic graph snapshots in a native graph database, and present historical graph queries on it. They discuss the possibility of representing time either as an attribute on node/edge or as a different edge type corresponding to each time-point (2012,2013 etc.). Some studies attempt to index the graph in a way that helps specific graph queries such

as the historical reachability [178] and shortest path [81,3] queries.

Instead of consuming individual snapshots, some research [101,105] process the graph delta.

Queries require reconstructing the graph by applying the correct delta on the current snapshot

and [105] show how the performance of historical queries can be improved by materialising

more than one snapshot, partial reconstruction and indexing deltas. DeltaGraph [101] proposes

efficient ways to retrieve a single or multiple snapshots using an hierarchical index structure of the deltas.

Evolving graphs is the most relevant body of work although to the best of our knowledge there has been no work specifically focused on the evolution of code dependency graphs. It

Related Work 73

must also be noted that for all of the above-mentioned approaches, there is a known implicit association between the entities across versions (perhaps with a persistent id), but in the code dependency graphs we need to compute it.

4.3.2 Industry Projects

Projects have originated from industry for advanced code querying, analysis and comprehen-

sion. In addition to text-based tools such as CScope [46], several more advanced tools have

been designed through lexing and parsing the source code at different levels. IDEs are one such example providing developers with features for basic code navigation, class browsing and refac- toring tools. With no support for incremental indexing built-in to these tools, management of multiple revisions of the source is up to the user; each code repository must be analysed individually and the query results should be compared manually.

Prototype tools such as Wiggle [198] represent the graph in a graph database running

queries at a mixture of syntax-tree, type, control-flow-graph or data-flow levels. The AST is stored and queried for several use cases, but there is no discussion about the storage cost

or query performance on individual repositories. OpenGrok [148] and Google Kythe [68] are

more large-scale projects capturing code structure in varying granularity and complexity with

different goals in mind. OpenGrok[148] is a project initiated at Oracle with the aim of providing

developers with a tool for searching and cross-referencing in various source code repositories with support for different program file formats. OpenGrok does not build a dependency graph in the backend–it includes parsers for several languages, maintains a text index and uses regular expressions for search tasks. It provides support for version control histories such as Mercurial and Git, allowing the user to select the version of the source to be indexed.

A closer match to Frappé is Google Kythe [68]. The core of the Kythe project is in defining

language-agnostic protocols and data formats for representing, accessing and querying source code information as data. This standardisation effort provides protocols for inter-operable developer tools. Extractors in Kythe pull compilation information from the build system, and index the retrieved information in a language-agnostic graph. The graph is then used to answer queries related to code browsing, review and document generation. The Kythe graph schema captures more information than the model in Frappé, thus making the graph more complex and querying difficult. For each new revision of the code, the repository needs a complete re-index, possibly in parallel. For Kythe, indexing every version is an adequate solution, considering their focus is on interoperability with cross-referencing in each version. Many of these tools from the

Related Work 74

Software Engineering domain allow code comprehension in multiple versions in the sense that search results may be provided as an aggregated view of results in individual repositories. In this work, our objective is to allow cross-version querying involving de-deduplicating entities and efficient storage.

4.3.3 Source code analysis and other program meta-models

Source code analysis and manipulation has a long standing history in software engineering research. Some early work built tools to understand a single software repository and to query

it using declarative [153,75,104] and natural languages [103]. Several approaches in literature

have analysed software repositories for the purpose of understanding their evolution over time

[48]. Data is collected from different sources including versioning and bug tracking systems, the

retrieved data is modeled, stored and analysed for use cases such as change impact propagation,

hotspot analysis, developer effort and fault prediction to name a few [48]. Program meta-models

are built [112,169,51], adding a layer of indirection to the software at hand, with the objective

of performing different types of analysis regarding its evolution. The meta-model CHA-Q [169]

in particular persists the elements of the model in Neo4j. Since the Frappé meta-model already builds a storage-efficient dependency graph that scales to very large codebases, we investigated versioning the existing graph model with an emphasis on maintaining this storage scalability without compromising the existing querying capabilities.

Many forms of query languages have also been developed to enable querying a versioned

software project in a declarative manner. In QWALKEKO [189], a git repository is directly

viewed as a graph and queried using a combination of regular path expressions and logic query

languages. ABSINTHE [98] is a general purpose tool for querying versioned software and the

history is modeled as a directed acyclic graph. SysEdMiner [137] is another tool that uses

mining algorithms on change histories with the specific goal of finding unknown systematic edits. These approaches share a similar goal of enabling querying of versioned software. Our objective is to build versioned graphs that can be queried with a general purpose language, but these approaches use domain-specific languages and change information from the SCM.

4.3.4 Syntactic and Semantic Differencing

Literature on syntactic and semantic differencing is also relevant research since we need to

determine the delta of successive snapshots by identifying equivalent entities (Section 4.5).

Related Work 75

[80], parse trees [220] and fine-grained Program Dependency Graphs (PDG) [107] (for duplicate

code detection) thus producing varying levels of accuracy and semantics of the differences. The coarse-grained model in Frappé introduced a storage-efficient graph model for a different purpose of code comprehension on very large codebases. The performance and scalability of

differencing algorithms on large codebases is uncertain. For example, Dex [160] employs a graph

matching algorithm operating on ASTs and the algorithm reports a complexity of O(n4) in the

worst case: it will be expensive dealing with codebases with millions of LOC.

What sets us apart from alternative approaches?

InTable 4.1we compare our graph-based proposed solution (Section 4.4.1.4) with the following features of alternative approaches: (a) complexity of the model; (b) Scalability to millions of LOC; (c) Granularity of the model and storage efficiency for a source code comprehension use case; (d) Support for C/C++ codebases, and finally, (e) the ability to capture code changes precisely.

Table 4.1: Feature based comparison with alternative approaches

Complexity Scalability Granularity Support Precision

Evolving graphs simple n/a n/a n/a n/a

Industry tools complex 3 7 7 3

Source code analysis complex ? 3 7 3

Differencing algorithms moderate 7 7 3 varied

Proposed Solution moderate 3 3 3 varied

The essence of our solution is to incorporate the principles of evolving graphs for versioning code dependencies. For reasons mentioned above we cannot employ an approach that assumes the availability of a graph delta. Instead our approach is a variation where the delta is calculated

by means of node and edge resolutions (details in Section 4.5). For most of the projects from

industry, although scalable, the model becomes too complex or the level of granularity is not suitable for the types of code comprehension queries that we deals with. Also, in existing approaches we have not seen particular support for C/C++ codebases. The approaches in source code analysis evaluates much smaller codebases (100-800k entities) compared to systems supported by Frappé (5M nodes, 80M edges). Due to the inherent complexity of the algorithms, differencing algorithms do not scale to large graphs with millions of LOC.

Versioning Dependency graphs 76

In document Graph database management systems: storage, management and query processing (Page 88-92)