3.8 Discussion
3.8.4 Processing keyword search on graphs
In our survey, particularly in Section 3.4.3, we encountered many studies focusing on tweet
search. In the Twittersphere, the tweet text (including its terms and hashtags) is an ideal candidate for keyword search which requires maintaining an inverted index for efficient retrieval. The inverted index is a simple data model to locate relevant documents on the basis of user input terms. As mentioned before, for full-text search features, Neo4j provides access to an external Lucene index while Sparksee’s capabilities only go as far as searching text via regular expressions.
The studies we found in the survey maintained ad-hoc approaches to maintain and query the corpus of tweets. Many of the existing works focused only on either tweet text or the social graph but never both. We observe great potential in combining these two dimensions and expressing interesting queries on both the graph and the text dimensions. More importantly, we want to explore how well graph database systems are able to handle this type of combined
Summary 63
3.9
Summary
In this chapter, we first highlighted the need for new data management and query language frameworks for Twitter. We reviewed the tweet analytics space by exploring mechanisms pri- marily for data collection, data management and languages for querying and analyzing tweets. In this first part, we outlined the research issues and challenges associated with integrated solutions and propose a graph-based model for the Twittersphere.
In the second part of this chapter we investigated how well graph database systems are able to drive the data management goals of a Twitter framework. Then we used a graph- based model of Twitter and proposed a set of queries relevant for a microblogging applications scenario. We chose two representative systems Neo4j and Sparksee and shared our experiences, noting the limitations in running the microblogging queries on these data management systems. Our contributions of this work can be summarised as follows:
• Extensive Survey: We conducted the first extensive review on existing approaches to primarily collect, represent, manage and query Twitter data. Armed with these observa- tions we consolidated the requirements of an integrated data management framework for
Twitter: Section 3.2—Section 3.4.
• Data Model and Queries: We proposed a data model for the Twittersphere that pro-
actively captures Twitter specific interactions and properties Section 3.5. In this model,
we suggested microblogging queries that is useful in a variety of application scenarios, such as recommendation, co-occurrence and influence detection.
• Experiments: We conducted experiments on a large Twitter dataset, and examined how queries performed on existing GDBMS that use graph structures to represent data
Section 3.6—Section 3.7.3.
• Lessons Learned: We shared our introspection on working with these graph database
Chapter 4
Evolving Dependency Graphs for
Multi-versioned Codebases
In Chapter 3 we explored how graph database systems can be used to model a large scale social network application such as Twitter. On this proposed model we implemented a series of microblogging queries and observed the ability of graph systems to drive data management goals in a social network setting. Dependencies in software source code repositories can also be modeled as an attributed multi-graph and graph databases become a good conceptual fit to manage and query this type of data. In this chapter we study software dependency graphs exhibiting characteristics different to that of social networks, with hundreds of node and edge types representing software entities. A software dependency graph can be captured for several reasons, one useful objective is code comprehension.
Frappé, is a code comprehension tool developed by Oracle Labs that extracts the code dependencies from a codebase and stores them in a graph database, enabling advanced code comprehension tasks. We study this established project from industry that captures the code dependencies and extend it to create, manage and query versioned graphs when the underlying codebase evolves over time. Unique challenges associated with versioned graph construction in multiple code revisions were addressed by leveraging efficient entity resolution strategies. In this chapter we explore how a graph database system addresses these challenges and facilitate representation, construction and querying of versioned code repositories.
Introduction 65
4.1
Introduction
As the size and scope of a codebase grows, code querying and comprehension tools become crucial in understanding and navigating tens of millions of lines of code. This is especially true for C/C++ codebases with their complex language features and custom-built systems. Text
editors in combination with text-based tools, such as Cscope[46] and Grep, support searching
for references of a particular symbol, navigating from definition to a declaration and fuzzy text searches. However, these text-based searches and fuzzy parsing results lack context sensitivity and have limited knowledge of the semantics of the search symbol. Consequently, the user needs to navigate through a large number of results to manually filter out appropriate answers. Integrated Development Environments (IDEs) such as Eclipse have more complete compilation- level support and also have access to a richer set of language-dependent structural information. They are not favoured on large C/C++ codebases for reasons of build integration complexity, performance and tradition.
It is often useful to show the relationships connecting the query symbols to better understand the context of the symbol for purposes of debugging or further analysis. As such, capturing the different dependencies within the source code is an important aspect of building context- sensitive comprehension tools. Code dependencies can be naturally modelled as a dependency
graph representing call graphs, type graphs and inheritance hierarchies. Frappé [77] is a source
code querying tool developed by Oracle Labs that supports code comprehension tasks for large C/C++ codebases. Frappé extracts graph-structured data from underlying codebases and in addition to symbol search, is able to answer navigational queries in the form of:
• Does function X or something it calls, write to global variable Y? and, • How much code could be affected if I change macro M?
The extracted dependency information is stored in a Neo4j [140] graph database (Detailed
in Section 2.4.3.1). Graph databases provide a platform that allows native support for code comprehension use cases expressed as efficient graph-based queries. In Neo4j, comprehension queries are expressed in their declarative query language Cypher.
In a collaborative software development environment, the codebases are constantly changing. Developers are continuously adding new features, fixing bugs and refactoring code generating new versions of the code. Frappé captures the dependency graph based on the most recent snapshot of the codebase. Each new version of the codebase results in a modified version of the dependency graph. Existing comprehension queries can be issued against a specific version of
Introduction 66
the dependency graph in isolation. In this chapter, we extend Frappé, focusing on strategies to enable advanced code comprehension when the underlying codebase evolves over time, and queries may span multiple versions. Any tool that captures a codebase dependency graph can benefit from our experiences in building versioned graphs for multiple revisions of the codebases. Storing dependency graphs corresponding to each version of the source code is not an ideal solution. In larger software projects, a significant proportion of the graph remains unchanged and thus can have redundant information. Separate graphs also present a complex challenge for implementing user queries that span multiple versions. Existing graph database systems do not have in-built support for efficient storage and management of versioned graphs. As such, end-users of graph databases need to either make copies of each different version of the data or investigate user-defined representations of storing the deltas. A graph delta is a history of graph differences over time, e.g. node additions and removals. In this chapter, we sought an efficient representation of the dependency graph to store and query multiple revisions. We also study in detail how the version graph can be built, dealing with the inherent traits of C/C++ codebases. The proposed model must be scalable and performant and be able to seamlessly integrate the current Frappé workloads.
Existing work in generic graph evolution and management [101, 187, 177] is relevant to
our work. In many of these studies, the changeset between two graph snapshots (i.e. delta) can be easily determined perhaps by the use of a consistent ID in all successive snapshots.
Due to the peculiarities of software entities that we discuss later in Section 4.5, determining
the equivalence of entities is a challenging task. As such, versioning of codebase dependency graphs have unique building characteristics and pose new management and query challenges, giving rise to new research directions in graph database systems. In studies from software
engineering research, different program meta-models [112,169,51] are constructed from source
code and, additionally, meta-information extracted from Source Control Management (SCM) repositories (such as git). Our goal with extending Frappé has been to encode and version all supported semantic information in the graph, not restricting the change information to that
directly available from the SCM. Some projects [68,160] extract dependency graphs in varying
levels of granularity, dependent on their use case. For the current use cases of Frappé, the dependency graph is more coarse-grained than a full Abstract Syntax Tree (AST), enabling a storage-efficient graph-based comprehension query platform that scales to very large codebases. In this work, we propose a model of the dependency graph representing n versions of the codebase. On this versioned graph we are able to perform code comprehension tasks on a version specified by the user with the added benefit of facilitating queries across versions. For
Frappé background 67
example, we are able to view how a function has changed over the last five revisions of the code, and thus identify and flag changes that have a wider dependency impact. We also present a systematic study on how graph databases can facilitate versioning code dependencies in terms of its representation, construction and query processing. Our contribution in this chapter can be summarised as follows:
• Presents methods of conducting resolutions of entities across versions, with and without location information.
• Using the resolutions, propose a scalable model to represent a versioned code dependency graph capturing code evolution.
• Evaluates a large codebase (≈13 million lines of code) and show the rate of growth and the storage benefit with a versioned graph over maintaining individual snapshots. • Recommends new comprehension queries that can be performed as a result of the proposed
versioned graph.