Big Graph Data Management

(1)

Big Graph Data Management

Antonio

Maccioni

he ld -b_y w he re Rome 14 May 2015

Big Data Course

email locatedIn topi c whe n

[email protected]

affil iate d is -a

(2)

this talk is about

●

Graph Databases: models, languages, use cases

●

Graph Database Management Systems

●

Graph Processing Systems (intro)

●

Open Problems in Graph Data Management

(3)

scenario

is going towards graph search capabilities (Tao, Unicorn, Open Graph

protocol, etc.)

is working on PGX (Parallel Graph Analytics), has launched

Green-Marl and has implemented an RDF layer on top of

Oracle NoSQL

developing a layer, VERTEXICA, for graph mining on top

of the analytical database

launched the language interface SQL-GR to run graph

analysis on top of its analytical database

has been working on the “Knowledge Graph”

for a while, has created Pregel and has launched Cayley

has just aquired Aurelius TitanDB

has open-sourced its graph database FlockDB

(4)

“Over 25 percent of enterprises will use graph

databases by 2017”

- Enterprise DBMS, Q1 2014. Forrester Research

scenario

“graph databases are catching on commercially”

(5)

nosql scenario

Simple data models

“Graph Databases are an odd fish in the NoSQL pond”

- P.J. Sadalage, M. Fowler - NoSQL Distilled

Simple data models

(6)

But if we want to represent

connections

we may opt for a

graph database management systems:

nosql scenario

(7)

a graph

1

6

3

4

5

2 [1]->2->4->5

[2]->3

[3]->5

[4]->5->6

[5]->6

from\to 1 2 3 4 5 6 1 0 1 0 1 1 0 2 0 0 1 0 0 0 3 0 0 0 0 1 0 4 0 0 0 0 1 1 5 0 0 0 0 0 1 6 0 0 0 0 0 0

(8)

admin

works

belongs

admin

friends

married

belongs

likes

worked

friends

likes

follows

(9)

why graph databases?

●

More natural modeling

●

Manage connections explicitly

(10)

use case 1: semantic web

●

A Web-scale architecture

●

for metadata and data

management (together)

●

for interoperarbility of data and services

●

Web of Data

●

Compatible with other Web

technologies

●

Based a set of W3C standards

(HTTP, IRI/URI,RDF, SPARQL, OWL)

●

Web 3.0

(11)

●

HTTP request of data by URI

●

You can follow links (the edges of the graph)

(12)

Semantic Web + Open Data = Linked Open Data

(13)

(14)

sparql

Which are the names of the people married

with people born in Honolulu?

SELECT ?name

{

?person1 married ?person2 .

?person2 born ?city .

?city name “Honolulu” .

?person1 name ?name1 .

}

?person1

?person2

married

?city

Honolulu name

name

?name1

name

(15)

SELECT ?name

{

?person1 married ?person2 .

?person2 born ?city .

?city name “Honolulu” .

?person1 name ?name1 .

}

sparql

?person1

?person2

married

?city

Honolulu name

name

?name1

Pattern matching

style of querying

p2

n1

c2

p1

c1

name

Michelle

Obama

married

Barack

Obama

name

born

locatedIn

Chicago

Honolulu

USA

locatedIn

name

born

name

(16)

Pattern matching

style of querying

SELECT ?name

{

?person1 married ?person2 .

?person2 born ?city .

?city name “Honolulu” .

?person1 name ?name1 .

}

p2

n1

c2

p1

c1

name

Michelle

Obama

married

Barack

Obama

name

born

locatedIn

Chicago

Honolulu

USA

locatedIn

name

born

sparql

?person1

?person2

married

?city

Honolulu name

name

?name1

name

(17)

triple stores

s

p

o

p1

married

p2

p1

name

Michelle

Obama

p2

born

c2

name

Honolulu

us

name

USA

...

Indexes on:

(s,p,o), (s,o,p), (p,s,o), (p,o,s), (o,s,p), (o,p,s)

(18)

query processing

G(P=name and O=Honolulu) G(P=married) G(P=name) G(P=name)

?person1

?person2

married

?city

Honolulu

name name

?name

?person1 ?person1 ?person2 ?person2 ?city ?city

?person1

?name

name name

?person1

?person2

married

?person2

?city

name

?city

Honolulu

(19)

use case 2: social networks

admin

works

belongs

admin

friends

married

belongs

likes

worked

friends

likes

follows

(20)

Find common friends for every

single profile visit

~ 1.2 billions tuples

* average number of

friends per person

FRIEND 1

FROM_DAY

FRIEND 2

(21)

Find common friends for every

single profile visit

~ 1.2 billions tuples

* average number of

friends per person

FRIEND 1

FROM_DAY

FRIEND 2

(22)

graph database management systems

Three main properties:

1. Property Graph

(as data model)

2. Index-free Adjacency

(as physical

level organization)

(23)

It's a

schema-less

data model

A property graph is a directed multigraph

g =

(N, E)

where every node

n N

∈

and every edge

e E

∈

is associated with a set of pairs

<key,

value>

, called properties.

(24)

We say that a (graph) database

g

satisfies

the index-free adjacency if the existence of an

edge

between two nodes

n

₁

and

n

₂

in

g

can

be tested on those nodes and does not

require to access an external, global, index.

index-free adjacency

GOAL: make the cost of a basic traversal

independent of the size of the database, in case

keeping O(1)

(25)

index-free adjacency

(26)

index-free adjacency

(27)

neo4j physical layer

I. Robinson, J. Webber, E. Eifrem – Graph Databases, 2013.

●

Store files for different parts of the graph

●

Node store

●

Relationship store

●

Property store

●

Each record contains 4 properties

●

The properties of an element may use more records

●

Property's values can be either stored in the property store or stored in a dynamic

(28)

titan, infinitegraph, levelgraph

LevelGraph

Built above the extensible column store

Apache Cassandra

Built above the Object Oriented

Database Objectivity/DB

Built on node.js above the key-value

store LevelDB but pluggable to different

stores

(29)

Server Mode

1. Go to

http://www.neo4j.org/

, download

Neo4J Server and unzip it

2. Run the command

./bin/neo4j start

(use

bin\Neo4j.bat on Windows)

3. Find a graphical dashboard at

http://localhost:7474/

4. You can also use it with REST API:

http://localhost:7474/db/data/

(30)

Embedded mode

1. Import in your java project the library

neo4j-kernel--*-.jar

and its

classes:

2. Create the database:

3. Create nodes and edges:

4. Set the properties:

import org.neo4j.graphdb.*; import org.neo4j.graphdb.factory.GraphDatabaseFactory; GraphDatabaseService gdb = new GraphDatabaseFactory(). newEmbeddedDatabase("/home/..."); Node n1 = gdb.createNode(); Node n2 = gdb.createNode(); Relationship e12 = n1.createRelationshipTo(n2, EdgeType.TYPE); n1.setProperty(“name”, “Rome”); n2.setProperty(“name”, “Italy”); e12.setProperty(“type”, “locatedIn”);

Enum implementing RelationshipType

(31)

BLUEPRINTS

GREMLIN

PIPES

Blueprints is a property graph model interface with provided implementations.

Gremlin is a domain specific language for traversing property graphs

FRAMES

Frames exposes the elements of a Blueprints graph as Java objects: software is written in terms of domain objects and their relationships to each other.

REXTER

FURNACE

Furnace is a property graph algorithms package

http://www.tinkerpop.com/

(32)

building a graph db with blueprints

https://github.com/tinkerpop/blueprints/wiki

1. Import in your java project the libraries

blueprints-core-.*..jar

and blueprints-neo4j-graph-.*..jar

with their classes:

2. Create the database:

3. Create nodes and edges:

4. Set the properties:

import com.tinkerpop.blueprints.*; import com.tinkerpop.blueprints.impls.neo4j.Neo4jGraph; Graph gdb = new Neo4jGraph("/home/..."); Vertex n1 = gdb.addVertex(null); Vertex n2 = gdb.addVertex(null); Edge e12 = gdb.addEdge(null, n1, n2, “locatedIn”); n1.setProperty(“name”, “Rome”); n2.setProperty(“name”, “Italy”); e12.setProperty(“type”, “locatedIn”);

(33)

querying a graph database

●

Gremlin:

●

Imperative query language

●

Descendant of languages such as XPATH

●

Cypher

:

●

Declarative query language

(34)

gremlin> g = new Neo4jGraph("/home/...");

gremlin> g.V.outE.filter{it.edgeid == 'e1'}

==>e[2][1EDGE>4]

==>e[1][1EDGE>3]

==>e[0][1EDGE>2]

==>e[7][6EDGE>2]

(35)

gremlin> g.V.outE.filter{it.edgeid == 'e1'}.inV.outE.

filter{it.edgeid == 'e2'}.inV.nodeid

==>F

==>E

(36)

cypher: a pattern matching QL

START: starting point in the graph

MATCH: the pattern to match, bound to the

starting point

WHERE: filtering criteria

RETURN: what to return

(37)

MATCH: the pattern to match, bound to the

starting point

cypher: a pattern matching QL

node1edge1>node2edge2>node3

node1[?]>node2[?]>node3

node1[*]>node3

Live hands-on a graph database about

beers at

(38)

START n = node(*)

MATCH n[r1:EDGE]>x[r2:EDGE]>m

WHERE (r1.edgeid = 'e1') and (r2.edgeid = 'e2')

RETURN m.nodeid

==>”F”

==>”E”

(39)

●

Secondary

Indexes

: defined on properties

●

Transactions

: graph databases usually support

ACID properties. In Neo4J all operations have to

be performed in a transaction:

●

Other programming language wrappers:

try ( Transaction tx = gdb.beginTx() )

{

tx.success();

}

(40)

graph processing systems

Pregel/Giraph

●

Frameworks to compute (distributed) graph

analysis on large graphs:

●

have similar motivations of Hadoop, Spark, etc.

●

help programmers to focus on the algorithm rather

than on the implementation

●

support different types of graphs

●

provide a variety of algorithms already implemented

(41)

google pregel/apache giraph

●

User specifies a vertex program

●

Computation runs a sequence of supersteps

●

In each superstep the program is executed over all the

vertexes

●

The program can use messages received in a previous

superstep from the neighbors and can send messages

to them for the next superstep

●

A vertex can deactivate itself and the computation halts

when all the vertexes are deactivated.

(42)

●

How to model a graph database?

●

How to migrate data and queries from

existing databases?

●

How to scale queries over large graphs?

(43)

Compact: Sparse: Dense:

Reduces the

number of data

accesses

Can violate

property graph

constraints

Accesses and

updates can be

inefficient

Reduces

number of joins

Needs human

intervention for

a semantic

enrichment

(44)

Orienting the ER:

modeling graph databases

ENTITY 1 ENTITY 2 RELATIONSHIP (0:1) RELATIONSHIP : 0 (0:1) ENTITY 1 ENTITY 2 ENTITY 1 ENTITY 2 RELATIONSHIP (0:N) RELATIONSHIP : 1 (0:1) ENTITY 1 ENTITY 2 ENTITY 1 ENTITY 2 RELATIONSHIP (0:N) RELATIONSHIP : 2 (0:N) ENTITY 1 ENTITY 2

R. De Virgilio, A. Maccioni, R. Torlone – Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.

(45)

Oriented-ER:

modeling graph databases

(46)

Partitioning:

modeling graph databases

Rule 1:

if a node

n

is disconnected

then it forms a group by itself.

Rule 2:

if a node

n

has

w

−

(n)>1

and

w

+

(n)>0

then

n

forms a group by itself.

Rule 3:

if a node

n

has

w

−

_(n)<2

_and

_w

+

_(n)<2

_then

_n

is added to the group of a node

m

such that there

exists the edge

(m, n)

in the O-ER diagram.

(47)

Partitioning:

modeling graph databases

(48)

Graph Database

Template

modeling graph databases

(49)

SQL

select * from T

where T.A1 = v1

R. De Virgilio, A. Maccioni, R. Torlone – R2G: a Tool for Migrating Relations to Graphs – EDBT International Conference on Extending Database Technology, 2014

(50)

R2G: unifiability of data values

• Joinable

tuples t1 R

∈

1 and t2

∈

R2:

–

there is a foreign key constraint between R1.A and R2.B and

t1[A] = t2[B].

• Unifiability

of data values t1[A] and t2[B]:

–

(i) t1=t2 and both A and B do not belong to a multi-attribute

key;

–

(ii) t1 and t2 are joinable and A belongs to a multi-attribute key;

–

(iii) t1 and t2 are joinable, A and B do not belong to a

multi-attribute key and there is no other tuple t3 that is joinable

with t2.

(51)

R2G: schema graph

Full Schema Paths:

FR.fuser US.uid US.uname

→

FR.fuser FR.fblog BG.bid BG.bname

→

FR.fuser FR.fblog BG.bid BG.admin US.uid US.uname

→

...

(52)

R2G: data migration

(53)

R2G: query migration

(54)

scalability over real-world graphs

Graph Databases are hard to scale

(55)

power-law graphs

scale-free graphs

preferential attachment

Graph Databases are hard to scale

10% of the users follow

the same five users

(56)

Replication

Partitioning

Graph Databases are

very hard

to scale

10% of the users follow

the same five users

(57)

real-world graphs

(58)

follows

follow_s

follo ws

real-world graphs

(59)

real-world graphs

(60)

any redundancy?

(61)

high-degree node

low-degree node compressor node

SPARSIFICATION

sparsification

(62)

G

REE

DY

SPA

RSI

FICA

TIO

N

2 B

1 C

A

3

5

4

6

2 B

1 C

A

3

5

4 BC

AB

ABC

6 compression via sparsification

2 B

1 C

A

3

5

4 ABC

6 SP

_ACE-A

WARE

SPAR

SIFICA

_TION

(63)

with

Space-aware

Compressed

Graphs

graph pattern matching

with Greedy

Compressed

Graphs

(64)

●

How to shard/partition a graph database?

●

How to visualize large graphs?

●

Specialized startups are addressing the problem

●

Standardization of a query language

●

Graph processing with GPUs

●

Medusa-gpu, MapGraph

●

What is the best way to implements graph layer on

top of SQL/NoSQL systems?

●

IBM, Oracle, Teradata, HP, ...

(65)

●

Graphs are used in many fields

●

Social Networks, Bioinformatics, Semantic Web, Geo-informatics, ...

●

Graphs are more complicated to manage than other types

of data

●

and we need different considerations

●

When we need to store a big graph

●

we have many options, each one with both advantages and disadvantages

●

Scaling queries over graph databases is still an infant

area of database industry and research

(66)