• No results found

In this chapter we first introduced graph preliminaries related to our thesis, including graph types, graph properties and different graph representations. We discussed types of real-world graphs and their applications such as social networks. We introduced the property graph model that can describe graphs, not only with nodes and edges of a single type, but also allowing different types of nodes and edges, and attributes on them. We comprehensively reviewed graph data management systems that can model, store and query property graphs. The above form the basis and background for our work described next.

InChapter 3we explore how graph database systems can be used in a social network setting

and study a series of queries relevant for a microblogging scenario. In Chapter 4 we examine

a code comprehension tool that captures dependency graphs and models them in a native graph database. We extend the capabilities of a system to enable versioning of dependency

graphs when the underlying codebase changes over time. In Chapter 5 we investigate issues

around storage of graph systems and propose edge re-labeling techniques to increase disk locality

and thus improve query performance. In Chapter 6 we investigate how a textual search can

be combined with graph traversals, integrating these dimensions in a generic graph database system.

Chapter 3

Graph Database Systems for

Microblogging Analytics

With the inception of different types of social networks, a growing number of applications con- sume data collected from various Microblogging platforms. Twitter is one such platform where a myriad of research efforts have emerged studying different aspects of the Twittersphere. Each study exploits its own tools and mechanisms to capture, store, query and analyse Twitter data. Inevitably, frameworks have been developed to replace this ad-hoc exploration with a more structured and methodological form of querying and analysis. An analysis framework typically involves the following major components: data collection, pre-processing, data modeling and a language for querying tweets.

In this chapter we highlight the need for graph-based data models for Microblogging analyt- ics by reviewing existing approaches. Addressing limitations of existing models, we propose a data model for the Twittersphere that captures different kinds of Twitter-specific interactions. We examine the feasibility of running analytical queries using graph database systems and offer empirical analysis of the performance of the proposed approach. Accordingly we observe how well graph database systems are able to drive the overall data management goals of a Twitter framework. In particular, we share our experiences on executing a wide variety of microblogging queries on two popular graph databases: Neo4j and Sparksee. The queries are executed on a large, real Twitter graph data set comprising nearly 50 million nodes and 326 million edges.

Introduction 30

3.1

Introduction

The massive growth of data generated from social media sources has resulted in a growing interest on efficient and effective means of collecting, analysing and querying large volumes of social data. In particular, online social networking and microblogging platform Twitter has seen exponential growth in its user base since its inception in 2006, with now over 200

million monthly active users producing 500 million tweets daily1. A wide research community

has been established since then with the hope of understanding interactions on Twitter. For example, studies have been conducted in many domains exploring different perspectives of understanding human behaviour. Prior research has focused on a variety of topics including

opinion mining [15,18,84], event detection [113,171,222], spread of pandemics [40,152,181],

celebrity engagement [212] and analysis of political discourse [45, 89, 196]. These types of

efforts have enabled researchers to understand interactions on Twitter related to the fields of journalism, education, marketing, disaster relief etc.

Pre- processing Information Extraction Data Modelling Query Processor Data Analytics Focused crawling Graph store Data Management

Figure 3.1: An abstraction of a Twitter data management platform

The systems that perform analysis in the context of these interactions typically involve the following major components: data collection, data management and data analytics. Here, data management comprises information extraction, pre-processing, data modeling and query

processing components. Figure 3.1 shows a block diagram of such a system and depicts inter-

1

Introduction 31

actions among various components. Until now, there has been a significant amount of prior

research around improving each of the components shown in Figure 3.1, but to the best of our

knowledge, there have been no frameworks that propose a unified approach to Twitter data management that seamlessly integrates all these components. Following these observations, in the first part of this chapter we extensively survey the techniques that have been proposed for

realising each of the components shown inFigure 3.1, summarise their drawbacks and describe

the motivation for the need and challenges of a unified platform for managing Twitter data. In our survey of existing literature, we observe ways in which researchers have tried to develop general platforms to provide a repeatable foundation for Twitter data analytics. We

show the elements of our survey inFigure 3.2, primarily focusing on the following key elements.

• Data Collection. InSection 3.2we describe mechanisms and tools that focus primarily

on facilitating the initial data acquisition phase. These tools systematically capture the data using any of the Twitter’s publicly accessible APIs.

• Data management frameworks. In addition to providing a module for crawling tweets, these frameworks provide support for pre-processing, information extraction and/or visu-

alization capabilities. In Section 3.3we review existing data management frameworks.

• Languages for querying tweets. A growing body of literature proposes declarative query languages as a mechanism of extracting structured information from tweets. Lan- guages present end-users with a set of primitives beneficial in exploring the Twittersphere

in different dimensions. In Section 3.4 we investigate declarative languages and similar

systems developed for querying a variety of tweet properties.

As shown in Figure 3.2, for each of the components we make note of the data model and

storage systems in use, dimensions explored and the types of analysis conducted with Twitter

data. Armed with these observations, in Section 3.5we consolidate the requirements of a data

management platform for Twitter and highlight the importance of a graph-based approach to data management. As graph database management system is a good conceptual fit for our proposed data model; we conduct experiments to test the feasibility of running a series of

interesting microblogging queries on them. Section 3.6 discusses preliminaries on the graph

schema, query abilities of the tested graph systems and the pre-processing of the data. For

the databases we do a feasibility analysis (Section 3.7) reporting on data ingestion and query

processing. Finally, in Section 3.8 we discuss our findings on these two graph databases and

Data Collection 32

Twitter Analytics

Data Collection

Specific Frameworks Querying Tweets

❏ Twitter APIs ❏ Data resellers ❏ Focused crawlers ❏ Pre-processing ❏ Information extraction ❏ Generic platforms ❏ Application-specific platforms ❏ Visualization interfaces ❏ Twitter Query Languages ❏ Generic languages

for social networks ❏ Twitter search Data Management

Features

Data Model and storage: flat files, relational, RDF, key-value, graph

Dimensions explored: text, location, time, interactions

Analysis types: offline vs. online (real-time) exploration

Figure 3.2: Elements of the survey on Twitter analytics

Our contributions of this work can be summarised as follows.

• Extensive Survey: We conduct the first extensive review on existing approaches to primarily collect, represent, manage, and query twitter data. With these observations we consolidate the requirements of an integrated data management framework for Twitter. • Data Model and Queries: We propose a data model for the Twittersphere that pro-

actively captures Twitter specific interactions and properties. In this model, we suggest microblogging queries useful in a variety of application scenarios such as recommendation, co-occurrence and influence detection.

• Experiments: We conduct experiments on a large Twitter dataset, and examine how queries perform on existing GDBMS that use graph structures to represent data.

• Lessons Learned: We share our introspection on working with these graph database systems and discuss open problems and opportunities for future research.