Frappé background - Graph database management systems: storage, management and query processing

This section introduces the background to Frappé starting with a brief description about the architecture and how the dependency information are extracted, stored and visualized. A graph model is explained to understand what is represented as the nodes, edges and properties of the dependency graph. Finally, representative code comprehension queries are explained.

4.2.1 Architecture

The architecture of Frappé is shown in Figure 4.1. The wrapper scripts in the extractor run

Frappé background 68

Figure 4.1: Frappé architecture

the source code and generate a set of intermediate .fo files for each unit of compilation. The extracted dependencies are then imported into a graph repository. The zoomable 2D spatial visualization available in the web UI, known as the ‘codemap’ of the query results may include individual source entities, display paths through the code or transitive closures in the call graph. In a code comprehension task, all this additional information gives the end-user a clear idea of the location, locality, structure and quantity of the results which help immensely in filtering

out irrelevant results [77].

4.2.2 Graph Model

The dependency information in a codebase is encoded as a combination of nodes and edges (col- lectively known as entities) in the Frappé graph model. A code segment and its corresponding

dependency graph is shown in Figure 4.2. The nodes of the dependency graph represent enti-

ties within the code, such as variables, functions and macros. The directed edges represent the associations among these entities, such as calls edges between function nodes and has_param

edges between a function and parameter nodes. As shown inFigure 4.2b, the nodes and edges

originate from various data sources and at different steps of compilation. For example, mod- ules, files and the linking information between them come from the build system; file inclusions macro definitions, expansions and interrogation links are a result of the pre-processor ; and other directories, source files from the file system and general symbols within the code.

In addition to the name and type attributes of the entity shown inFigure 4.2b, the database

stores further attributes on the nodes and the edges. A node of type enumerator would

have an attribute ‘value’ to denote the integer representation of the enumerator. An edge type has_param would contain an ‘index’ attribute to indicate the position of parameter in

Frappé background 69

int bar(int); foo.h #include“foo.h”

int bar(int *input){

return *input * 2;

}

foo.c

gcc foo.c -c -o foo.o gcc main.c foo.o -o prog

build

#include“foo.h”

int main(int argc, char **argv){

return bar(&argc); } main.c 12 25 26 27 28 48 49 51 52

(a) Code Segment

foo.o module prog module foo.h source_file bar function_decl bar function input parameter int primitive main.c file main function argv parameter foo.c source_file argc parameter char primitive linked_from compiled_from compiled_from compiled_from includes includes file_contains file_contains file_contains declares calls has_param has_param has_param is_type is_type is_type (b) Dependency graph

Figure 4.2: Example of a code dependency graph

the function signature. All file_contains and reference edge types describe the location information by a set of attributes to precisely identify the position at which an entity appears in the underlying code. The location information stored on the edges is an important part of the model as this information is used later for cross-referencing and visualizations in the ‘codemap’. This location information is also crucial in distinguishing different instances of entities.

foo.c

source_file

bar

function file_contains

file_id: 4 | start: 26 | end: 28 nameStart: 26 | nameCol: 100

Frappé background 70

Each reference edge captures the ‘use’ and ‘spelling’ location (used in Clang terminology,

sometimes referred to as ‘expansion’ and ‘name’ location) of adjacent nodes (Figure 4.3). To

illustrate with the code segment in Figure 4.2a, the file_contains edge between foo.c and

bar has the following properties: (1) File Id of foo.c, (2) use location: the start and end line numbers at which the bar function begins and ends, and (3) spelling location: the start line number (nameStart) and column number (nameCol) which the bar name token appears in. As

we see inSection 4.5.1, the location information becomes an important modeling consideration

when the code evolves. Note that the schema of the graph does not have a one-to-one mapping from the nodes represented in an AST – it is at a more coarse-grained level of granularity and does not reproduce all the detailed syntactic structures in an AST.

4.2.3 Code Comprehension Queries

Using the graph model described, high-level code comprehension questions can be translated into graph queries. Queries can range from simple index lookups to complex pattern-matching queries that require traversing a significant portion of the dependency graph. The queries are written using Neo4j’s query language – Cypher. A few representative categories of queries are discussed below.

Code Search. The most common feature of any comprehension tool is its ability to quickly search for a given symbol. Any simple text editor allows a user to search for a symbol, but that may produce a large number of results. A more advanced IDE will allow a user to filter the symbol by its type or the location in which it is defined given that entity types are identified in advance. External index can be built for fast retrieval of symbols and fuzzy searches. A user can provide additional constraints on type, attributes or the location in which a symbol is defined so that results which are more relevant can be retrieved.

For example, to return only fields named ‘id’ present in the module preprocess.elf, any fields that are not reachable from the node representing that module via a sequence of

file_contains, compiled_from and linked_from edges can be eliminated [77]. The code snippet

below shows this query in Cypher.

START m = node:node_auto_index(‘name: preprocess.elf’)

MATCH m -[:compiled_from|linked_from*]-> f WITH DISTINCT f

Frappé background 71

Code Navigation. Another useful feature in a comprehension tool is its ability to move between source files. Go-to-definition allows navigation from a symbol to where it is defined. This can be done either by entering the symbol along with any conditions to filter on or by clicking a symbol hyper-link in the visualisation. In contrast, find-references retrieves all locations from which a given symbol is referenced and presents the user with a list to filter further. The references are searched by returning all the incoming reference locations to a given symbol. These navigation features facilitate the general purpose of debugging.

Both these search and navigation functionalities are staples of modern IDEs. However Frappé can provide these functionalities for large C/C++ codebase environments where an IDE is not available or is impractical.

Code Path Comprehension. First, shortest path queries in this category enable better understanding of how two nodes in the graph are connected by a given edge type or that they are not reachable at all. Second, developers are often interested in exploring how a seed statement

or a region in the code affects the rest of the code (known as a program slice [208,182, 215]).

For example, given a seed function, the call graph can be traversed transitively to show the impact of the function either by incoming (forward slice) or outgoing (backward slice) calls edges. Moreover, the same affected regions can be found for source file inclusions or macro expansions both features particularly useful in debugging.

While the above queries facilitate comprehension of large codebases, providing more context to the user, the current model is unable to perform queries over a history of changes. Some motivating use cases for extending Frappé with multi-versioned functionality are discussed next. Motivating use cases for multi-versioned querying. Augmenting the current graph model with versions enables a series of new and interesting queries over these versions. A standard code review typically involves careful and detailed investigation of the changes that have been made to the code over revisions. A reviewer’s job is greatly facilitated if, as a part of the code comprehension tool, the changes, such as the function dependencies, are highlighted in advance. Further analysis on multiple revisions, such as which areas of the code are prone to change, how often they change, and which areas tend to change together, can all yield more insight into the codebases.

Related Work 72

In document Graph database management systems: storage, management and query processing (Page 83-88)