Big Graph Data Management
Antonio
Maccioni
he ld -by w he re Rome 14 May 2015Big Data Course
email locatedIn topi c whe n
[email protected]
affil iate d is -athis talk is about
●
Graph Databases: models, languages, use cases
●Graph Database Management Systems
●
Graph Processing Systems (intro)
●
Open Problems in Graph Data Management
scenario
is going towards graph search capabilities (Tao, Unicorn, Open Graph
protocol, etc.)
is working on PGX (Parallel Graph Analytics), has launched
Green-Marl and has implemented an RDF layer on top of
Oracle NoSQL
developing a layer, VERTEXICA, for graph mining on top
of the analytical database
launched the language interface SQL-GR to run graph
analysis on top of its analytical database
has been working on the “Knowledge Graph”
for a while, has created Pregel and has launched Cayley
has just aquired Aurelius TitanDB
has open-sourced its graph database FlockDB
“Over 25 percent of enterprises will use graph
databases by 2017”
- Enterprise DBMS, Q1 2014. Forrester Research
scenario
“graph databases are catching on commercially”
nosql scenario
Simple data models
“Graph Databases are an odd fish in the NoSQL pond”
- P.J. Sadalage, M. Fowler - NoSQL Distilled
Simple data models
But if we want to represent
connections
we may opt for a
graph database management systems:
nosql scenario
a graph
1
6
3
4
5
2
[1]->2->4->5
[2]->3
[3]->5
[4]->5->6
[5]->6
from\to 1 2 3 4 5 6 1 0 1 0 1 1 0 2 0 0 1 0 0 0 3 0 0 0 0 1 0 4 0 0 0 0 1 1 5 0 0 0 0 0 1 6 0 0 0 0 0 0admin
works
belongs
belongs
admin
friends
married
belongs
belongs
likes
worked
friends
likes
follows
why graph databases?
●
More natural modeling
●
Manage connections explicitly
use case 1: semantic web
●
A Web-scale architecture
●
for metadata and data
management (together)
●
for interoperarbility of data and services
●
Web of Data
●
Compatible with other Web
technologies
●
Based a set of W3C standards
(HTTP, IRI/URI,RDF, SPARQL, OWL)
●
Web 3.0
●
HTTP request of data by URI
●
You can follow links (the edges of the graph)
Semantic Web + Open Data = Linked Open Data
sparql
Which are the names of the people married
with people born in Honolulu?
SELECT ?name
{
?person1 married ?person2 .
?person2 born ?city .
?city name “Honolulu” .
?person1 name ?name1 .
}
?person1
?person2
married
?city
Honolulu name
name
?name1
name
SELECT ?name
{
?person1 married ?person2 .
?person2 born ?city .
?city name “Honolulu” .
?person1 name ?name1 .
}
sparql
?person1
?person2
married
?city
Honolulu name
name
?name1
Pattern matching
style of querying
p2
n1
c2
p1
c1
name
Michelle
Obama
married
Barack
Obama
name
born
locatedIn
Chicago
Honolulu
USA
locatedIn
name
name
name
born
name
Pattern matching
style of querying
SELECT ?name
{
?person1 married ?person2 .
?person2 born ?city .
?city name “Honolulu” .
?person1 name ?name1 .
}
p2
n1
c2
p1
c1
name
Michelle
Obama
married
Barack
Obama
name
born
locatedIn
Chicago
Honolulu
USA
locatedIn
name
name
name
born
sparql
?person1
?person2
married
?city
Honolulu name
name
?name1
name
triple stores
s
p
o
p1
married
p2
p1
name
Michelle
Obama
p2
born
c2
c2
name
Honolulu
us
name
USA
...
...
...
Indexes on:
(s,p,o), (s,o,p), (p,s,o), (p,o,s), (o,s,p), (o,p,s)
query processing
G(P=name and O=Honolulu) G(P=married) G(P=name) G(P=name)?person1
?person2
married?city
Honolulu
name name?name
?person1 ?person1 ?person2 ?person2 ?city ?city?person1
?name
name name?person1
?person2
married?person2
?city
name?city
Honolulu
use case 2: social networks
admin
works
belongs
belongs
admin
friends
married
belongs
belongs
likes
worked
friends
likes
follows
Find common friends for every
single profile visit
~ 1.2 billions tuples
* average number of
friends per person
FRIEND 1
FROM_DAY
FRIEND 2
Find common friends for every
single profile visit
~ 1.2 billions tuples
* average number of
friends per person
FRIEND 1
FROM_DAY
FRIEND 2
graph database management systems
Three main properties:
1.
Property Graph
(as data model)
2.
Index-free Adjacency
(as physical
level organization)
It's a
schema-less
data model
A property graph is a directed multigraph
g =
(N, E)
where every node
n N
∈
and every edge
e E
∈
is associated with a set of pairs
<key,
value>
, called properties.
We say that a (graph) database
g
satisfies
the index-free adjacency if the existence of an
edge
between two nodes
n
1
and
n
2
in
g
can
be tested on those nodes and does not
require to access an external, global, index.
index-free adjacency
GOAL: make the cost of a basic traversal
independent of the size of the database, in case
keeping O(1)
index-free adjacency
index-free adjacency
neo4j physical layer
I. Robinson, J. Webber, E. Eifrem – Graph Databases, 2013.
●
Store files for different parts of the graph
●Node store
●
Relationship store
●
Property store
●
Each record contains 4 properties
●
The properties of an element may use more records
●
Property's values can be either stored in the property store or stored in a dynamic
titan, infinitegraph, levelgraph
LevelGraph
Built above the extensible column store
Apache Cassandra
Built above the Object Oriented
Database Objectivity/DB
Built on node.js above the key-value
store LevelDB but pluggable to different
stores
Server Mode
1. Go to
http://www.neo4j.org/
, download
Neo4J Server and unzip it
2. Run the command
./bin/neo4j start
(use
bin\Neo4j.bat on Windows)
3. Find a graphical dashboard at
http://localhost:7474/
4. You can also use it with REST API:
http://localhost:7474/db/data/
Embedded mode
1. Import in your java project the library
neo4j-kernel-*-*-*.jar
and its
classes:
2. Create the database:
3. Create nodes and edges:
4. Set the properties:
import org.neo4j.graphdb.*; import org.neo4j.graphdb.factory.GraphDatabaseFactory; GraphDatabaseService gdb = new GraphDatabaseFactory(). newEmbeddedDatabase("/home/..."); Node n1 = gdb.createNode(); Node n2 = gdb.createNode(); Relationship e12 = n1.createRelationshipTo(n2, EdgeType.TYPE); n1.setProperty(“name”, “Rome”); n2.setProperty(“name”, “Italy”); e12.setProperty(“type”, “locatedIn”);
Enum implementing RelationshipType
BLUEPRINTS
GREMLIN
PIPES
Blueprints is a property graph model interface with provided implementations.
Gremlin is a domain specific language for traversing property graphs
FRAMES
Frames exposes the elements of a Blueprints graph as Java objects: software is written in terms of domain objects and their relationships to each other.
REXTER
FURNACE
Furnace is a property graph algorithms package
http://www.tinkerpop.com/
building a graph db with blueprints
https://github.com/tinkerpop/blueprints/wiki
1. Import in your java project the libraries
blueprints-core-*.*.*.jar
and blueprints-neo4j-graph-*.*.*.jar
with their classes:
2. Create the database:
3. Create nodes and edges:
4. Set the properties:
import com.tinkerpop.blueprints.*; import com.tinkerpop.blueprints.impls.neo4j.Neo4jGraph; Graph gdb = new Neo4jGraph("/home/..."); Vertex n1 = gdb.addVertex(null); Vertex n2 = gdb.addVertex(null); Edge e12 = gdb.addEdge(null, n1, n2, “locatedIn”); n1.setProperty(“name”, “Rome”); n2.setProperty(“name”, “Italy”); e12.setProperty(“type”, “locatedIn”);
querying a graph database
●
Gremlin:
●
Imperative query language
●
Descendant of languages such as XPATH
●
Cypher
:
●
Declarative query language
gremlin> g = new Neo4jGraph("/home/...");
gremlin> g.V.outE.filter{it.edgeid == 'e1'}
==>e[2][1EDGE>4]
==>e[1][1EDGE>3]
==>e[0][1EDGE>2]
==>e[7][6EDGE>2]
gremlin> g.V.outE.filter{it.edgeid == 'e1'}.inV.outE.
filter{it.edgeid == 'e2'}.inV.nodeid
==>F
==>E
==>E
cypher: a pattern matching QL
START: starting point in the graph
MATCH: the pattern to match, bound to the
starting point
WHERE: filtering criteria
RETURN: what to return
MATCH: the pattern to match, bound to the
starting point
cypher: a pattern matching QL
node1edge1>node2edge2>node3
node1[?]>node2[?]>node3
node1[*]>node3
Live hands-on a graph database about
beers at
START n = node(*)
MATCH n[r1:EDGE]>x[r2:EDGE]>m
WHERE (r1.edgeid = 'e1') and (r2.edgeid = 'e2')
RETURN m.nodeid
==>”F”
==>”E”
==>”E”
●
Secondary
Indexes
: defined on properties
●
Transactions
: graph databases usually support
ACID properties. In Neo4J all operations have to
be performed in a transaction:
●
Other programming language wrappers:
try ( Transaction tx = gdb.beginTx() )
{
tx.success();
}
graph processing systems
Pregel/Giraph
●
Frameworks to compute (distributed) graph
analysis on large graphs:
●
have similar motivations of Hadoop, Spark, etc.
●
help programmers to focus on the algorithm rather
than on the implementation
●
support different types of graphs
●
provide a variety of algorithms already implemented
google pregel/apache giraph
●
User specifies a vertex program
●
Computation runs a sequence of supersteps
●
In each superstep the program is executed over all the
vertexes
●
The program can use messages received in a previous
superstep from the neighbors and can send messages
to them for the next superstep
●
A vertex can deactivate itself and the computation halts
when all the vertexes are deactivated.
●
How to model a graph database?
●
How to migrate data and queries from
existing databases?
●
How to scale queries over large graphs?
Compact: Sparse: Dense:
Reduces the
number of data
accesses
Can violate
property graph
constraints
Accesses and
updates can be
inefficient
Reduces
number of joins
Needs human
intervention for
a semantic
enrichment
Orienting the ER:
modeling graph databases
ENTITY 1 ENTITY 2 RELATIONSHIP (0:1) RELATIONSHIP : 0 (0:1) ENTITY 1 ENTITY 2 ENTITY 1 ENTITY 2 RELATIONSHIP (0:N) RELATIONSHIP : 1 (0:1) ENTITY 1 ENTITY 2 ENTITY 1 ENTITY 2 RELATIONSHIP (0:N) RELATIONSHIP : 2 (0:N) ENTITY 1 ENTITY 2
R. De Virgilio, A. Maccioni, R. Torlone – Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.
Oriented-ER:
modeling graph databases
R. De Virgilio, A. Maccioni, R. Torlone – Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.
Partitioning:
modeling graph databases
Rule 1:
if a node
n
is disconnected
then it forms a group by itself.
Rule 2:
if a node
n
has
w
−(n)>1
and
w
+(n)>0
then
n
forms a group by itself.
Rule 3:
if a node
n
has
w
−(n)<2
and
w
+(n)<2
then
n
is added to the group of a node
m
such that there
exists the edge
(m, n)
in the O-ER diagram.
R. De Virgilio, A. Maccioni, R. Torlone – Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.
Partitioning:
modeling graph databases
R. De Virgilio, A. Maccioni, R. Torlone – Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.
Graph Database
Template
modeling graph databases
R. De Virgilio, A. Maccioni, R. Torlone – Model-driven design of graph databases, ER International Conference on Conceptual Modeling, 2014.
SQL
select * from Twhere T.A1 = v1
R. De Virgilio, A. Maccioni, R. Torlone – R2G: a Tool for Migrating Relations to Graphs – EDBT International Conference on Extending Database Technology, 2014
R2G: unifiability of data values
•
Joinable
tuples t1 R
∈
1 and t2
∈
R2:
–
there is a foreign key constraint between R1.A and R2.B and
t1[A] = t2[B].
•
Unifiability
of data values t1[A] and t2[B]:
–
(i) t1=t2 and both A and B do not belong to a multi-attribute
key;
–
(ii) t1 and t2 are joinable and A belongs to a multi-attribute key;
–
(iii) t1 and t2 are joinable, A and B do not belong to a
multi-attribute key and there is no other tuple t3 that is joinable
with t2.
R. De Virgilio, A. Maccioni, R. Torlone – R2G: a Tool for Migrating Relations to Graphs – EDBT International Conference on Extending Database Technology, 2014
R2G: schema graph
Full Schema Paths:
FR.fuser US.uid US.uname
→
→
FR.fuser FR.fblog BG.bid BG.bname
→
→
→
FR.fuser FR.fblog BG.bid BG.admin US.uid US.uname
→
→
→
→
→
...
R. De Virgilio, A. Maccioni, R. Torlone – R2G: a Tool for Migrating Relations to Graphs – EDBT International Conference on Extending Database Technology, 2014
R2G: data migration
R. De Virgilio, A. Maccioni, R. Torlone – R2G: a Tool for Migrating Relations to Graphs – EDBT International Conference on Extending Database Technology, 2014
R2G: query migration
R. De Virgilio, A. Maccioni, R. Torlone – R2G: a Tool for Migrating Relations to Graphs – EDBT International Conference on Extending Database Technology, 2014
scalability over real-world graphs
Graph Databases are hard to scale
power-law graphs
scale-free graphs
preferential attachment
Graph Databases are hard to scale
10% of the users follow
the same five users
Replication
Partitioning
Graph Databases are
very hard
to scale
10% of the users follow
the same five users
real-world graphs
follows
follows
follo ws
real-world graphs
real-world graphs
any redundancy?
high-degree node
low-degree node compressor node
SPARSIFICATION
sparsification
G
REE
DY
SPA
RSI
FICA
TIO
N
2
B
1
C
A
3
5
4
6
2
B
1
C
A
3
5
4
BC
AB
ABC
6
compression via sparsification
2
B
1
C
A
3
5
4
ABC
6
SP
ACE-A
WARE
SPAR
SIFICA
TION
with
Space-aware
Compressed
Graphs
graph pattern matching
with Greedy
Compressed
Graphs
●
How to shard/partition a graph database?
●How to visualize large graphs?
●
Specialized startups are addressing the problem
●
Standardization of a query language
●Graph processing with GPUs
●
Medusa-gpu, MapGraph
●
What is the best way to implements graph layer on
top of SQL/NoSQL systems?
●
IBM, Oracle, Teradata, HP, ...
●
Graphs are used in many fields
●
Social Networks, Bioinformatics, Semantic Web, Geo-informatics, ...
●
Graphs are more complicated to manage than other types
of data
●
and we need different considerations
●
When we need to store a big graph
●
we have many options, each one with both advantages and disadvantages
●