• No results found

Graph Database Systems for Microblogging Queries

In the previous sections, we highlighted the need for efficient querying and management of large collections of Twitter data modeled as graphs. In the second part of this study, we model the basic elements of the Twittersphere as graphs, and determine the feasibility of running a set of microblogging queries in graph database systems and present our introspection. As shown in

the graph schema inFigure 3.3, Twitter can be modeled as a labeled, directed, attributed multi-

graph. Graph database systems support management of property graphs, which consolidate

the above features (Detailed in Section 2.4.1), thus become a good conceptual fit to test our

model.

In existing work around the topic, analyses of general data management queries on graph

database systems have been widely reported [8, 200, 88], but none have demonstrated the

feasibility of analyzing microblogging queries using such databases. It is noteworthy that most of the prior studies have focused on either executing MLDM (machine learning data mining)

algorithms over large graphs [132] or on performing graph data management queries using

relational databases [123]. In many of these studies, the goal is to create benchmarks for

graph database management systems in terms of computational [132] or data management

query workloads [200, 88, 125]. In addition, a large number of existing research has focused

on using RDF stores or relational databases [123,200] to store Twitter data [70]. Most of the

relational queries are written with self-joins, requiring many optimizations in order to achieve an acceptable performance. Different from these approaches, we study the feasibility of using a graph database system to query Twitter data. We believe that graph data management systems are better equipped to test the particular type of microblogging data workloads used in this chapter. We define queries relevant to microblogging and share our introspection on executing them using graph management systems; thereby perfectly complementing those prior works.

For our analysis, we have carefully chosen queries pertinent to several applications of mi- croblogging data. For example, our queries are relevant to applications such as providing friend recommendations, analyzing user influence, finding co-occurrences and shortest paths between graph nodes. In addition, we have analyzed fundamental atomic operations like selection and

Graph Database Systems for Microblogging Queries 49

retrieving the neighbourhood of a node. For executing the aforementioned queries, we have

chosen two popular open-source graph database systems: Neo4j [140] and Sparksee [131]. Such

systems are typically able to efficiently answer data management queries concerning attributes and relationships exploiting the structure of the graph. We particularly want to find answers to the following questions.

• How efficiently can graph systems ingest a large graph dataset?

• Can graph systems model the Twittersphere with all the required properties? • Can microblogging workloads be effectively translated to graph queries?

• How efficient are the queries when running them in a declarative and procedural fashion? • What are the limitations of graph database systems and future research directions? The goal of this work is not to perform a full benchmark of the two systems or recommend one over the other. Instead our objective is to report our experiences working on these two graph database systems, as a way forward for us to understand the capabilities of graph database systems for data management.

3.6.1 Database Schema

The data model we proposed inSection 3.5.3is what we use in this study. Here, inFigure 3.4we

describe it further with attributes, and discuss a few alternate data modeling options. The figure only shows a few attributes attached to each of the nodes and edges; User and Tweet nodes particularly has many more properties on them. Many of the edges may have the timestamp as an attribute. A Twitter dataset collected from an API would require pre-processing to create many of these relationships: follows relationship may be directly returned by the Twitter’s REST API while a Tweet may have to be processed to extract the hashtag nodes and retweet relationships. Although we specify multiplicity on the edges, they are generally enforced at the application level since many graph database systems cannot defined such constraints on the schema.

Some applications would require tweet text to be tokenized and stored in an inverted index, in order to be able to efficiently search keywords or hashtags within tweet text. If a keyword search is conducted on any of these attributes, it is necessary to create a separate text index. Next, let us consider a few more alternate modeling options. Depending on the analysis, we may or may not model the hashtags as a separate vertex (and tags edge) in the graph schema. Hashtags could be simply modeled as an attribute on the Tweet node itself. On the other

Graph Database Systems for Microblogging Queries 50

user tweet hashtag

posts follows retweets mentions tags userId username location tweetID text timestamp tagId keyword 1:m m:m 1:m m:m m:m

Figure 3.4: Data model of the schema with properties and multiplicity of edges.

hand, modeling hashtags in this way enables us to efficiently express queries on co-occurrence

as discussed in Section 3.7.3.

3.6.2 Graph Databases

For our analysis we chose two leading open-source graph management systems, namely, Neo4j and Sparksee. These systems not only support all the features needed for analyzing Twitter data, but also support declarative query languages and API interfaces to interact with the prop-

erty graphs. Neo4j as introduced inSection 2.4.3.1is a fully transactional graph management

system implemented in Java. It supports a declarative query language called Cypher. Using the above schema, a query that retrieves the tweets of a given user with id 531 can be written in Cypher as:

MATCH (u:USER uid:{531})-[:POSTS]->(t:TWEET)

RETURN t.text;

Another method of interaction is by using its core API. The core API offers more flexibility through a traversal framework, which allows the user to express exactly how to retrieve the query results. Cypher supports caching the query execution plans. This reduces the cost of re-compilation at run-time when a query with a similar execution plan is executed more than once. We have often used Cypher’s profiler to observe the execution plan and determine which query plan results in the least number of database hits (db hits) and have rephrased the query for better performance. It is noteworthy that all the queries can be alternatively written using the Java API exploiting the traversal framework. However, as with any imperative approach, the performance is dependent on how the query is translated into a series of API calls.

Sparksee, as introduced inSection 2.4.3.2, is a graph database management system imple-

Data Ingestion and Query Processing 51

provides APIs in many languages. We choose the Java API for our experiments. As an exam- ple, the query that retrieves the tweets of a given user 531 can be written in Sparksee’s API as:

int nodetype = g.findType("USER");

int attrID = g.findAttribute(nodetype, "uid");

Value attrVal = new Value();

attrVal.setInteger(531);

long input = g.findObject(attrID, attrVal);

int edgeType = g.findType("POSTS");

Objects userTweets = g.neighbors(input, edgeType, EdgesDirection.OUTGOING);

Sparksee queries have two primary navigation operations: neighbours and explode, which return an unordered set of unique node and edge identifiers that are adjacent to any given node ID. When translating the queries using Sparksee’s API, we made use of most of the constructs provided by the developers.

For this study, with the objective of understanding the diverse functionality of different graph database systems, we opted to run our queries with the declarative interface for Neo4j and the core API interface with Sparksee.