Big Data Management
and NoSQL Databases
Lecture 8. Graph databases
Doc. RNDr. Irena Holubova, Ph.D.
http://www.ksi.mff.cuni.cz/~holubova/NDBI040/
Graph Databases
Basic Characteristics
To store entities and relationships between these entities
Node is an instance of an object
Nodes have properties
e.g., name
Edges have directional significance
Edges have types
e.g., likes, friend, …
Nodes are organized by relationships
Allow to find interesting patterns
e.g., “Get all nodes employed by Big Co that like NoSQL Distilled”
Graph Databases
RDBMS vs. Graph Databases
When we store a graph-like structure in RDBMS, it is for a single type of relationship
“Who is my manager”
Adding another relationship usually means schema changes, data movement etc.
In graph databases relationships can be dynamically created / deleted
There is no limit for number and kind
In RDBMS we model the graph beforehand based on the Traversal
we want
If the Traversal changes, the data will have to change
We usually need a lot of join operations
In graph databases the relationships are not calculated at query time but persisted
Shift the bulk of the work of navigating the graph to inserts, leaving queries as fast as possible
Graph Databases
Representatives
Graph Databases
Basic Characteristics
Nodes can have different types of relationships between
them
To represent relationships between the domain entities
To have secondary relationships
Category, path, time-trees, quad-trees for spatial indexing, linked
lists for sorted access, …
There is no limit to the number and kind of relationships
a node can have
Relationships have type, start node, end node, own
properties
Example: Neo4J
We have to create a relationship between the nodes in
both directions
Nodes know about INCOMING and OUTGOING relationships
Node martin = graphDb.createNode(); martin.setProperty("name", "Martin"); Node pramod = graphDb.createNode(); pramod.setProperty("name", "Pramod");
martin.createRelationshipTo(pramod, FRIEND); pramod.createRelationshipTo(martin, FRIEND);
Graph Databases
Query
Properties of a node/edge can be indexed
Indexes are queried to find the starting node to begin a
traversal
Transaction transaction = graphDb.beginTx(); try {
Index<Node> nodeIndex = graphDb.index().forNodes("nodes");
nodeIndex.add(martin, "name", martin.getProperty("name"));
nodeIndex.add(pramod, "name", pramod.getProperty("name")); transaction.success(); }
finally {
transaction.finish(); }
Node martin = nodeIndex.get("name", "Martin").getSingle(); allRelationships = martin.getRelationships();
adding nodes
creating index
retrieving a node
Graph Databases
Query –
finding paths
We are interested in determining if there are
multiple paths, finding all of the paths, the
shortest path, …
Node barbara = nodeIndex.get("name", "Barbara").getSingle(); Node jill = nodeIndex.get("name", "Jill").getSingle();
PathFinder<Path> finder1 = GraphAlgoFactory.allPaths(
Traversal.expanderForTypes(FRIEND,Direction.OUTGOING), MAX_DEPTH);
Iterable<Path> paths = finder1.findAllPaths(barbara, jill); PathFinder<Path> finder2 = GraphAlgoFactory.shortestPath(
Traversal.expanderForTypes(FRIEND,Direction.OUTGOING), MAX_DEPTH);
Graph Databases
Suitable Use Cases
Connected Data
Social networks
Any link-rich domain is well suited for graph databases
Routing, Dispatch, and Location-Based Services
Node = location or address that has a delivery Graph = nodes where a delivery has to be made
Relationships = distance
Recommendation Engines
“your friends also bought this product”
Graph Databases
When Not to Use
When we want to update all or a subset of entities
Changing a property on all the nodes is not a straightforward operation
e.g., analytics solution where all entities may need to be updated with a changed property
Some graph databases may be unable to handle lots of
data
Graph Databases
A bit of theory
Data: a set of entities and their relationships
e.g., social networks, travelling routes, …
We need to efficiently represent graphs
Basic operations: finding the neighbours of a node,
checking if two nodes are connected by an edge,
updating the graph structure, …
We need efficient graph operations
G =
(
V, E
)
is commonly modelled as
set of nodes (vertices) V
set of edges E
n = |V|, m = |E|
Adjacency Matrix
Bi-dimensional array
A
of
n
x
n
Boolean values
Indexes of the array = node identifiers of the graph
The Boolean junction
A
ijof the two indices indicates
whether the two nodes are connected
Variants:
Directed graphs
Weighted graphs
Adjacency Matrix
Pros:
Adding/removing edges
Checking if two nodes are
connected
Cons:
Quadratic space with respect to
n
We usually have sparse graphs
lots of 0 values
Addition of nodes is expensive
Retrieval of all the neighbouring
nodes takes linear time with
Adjacency List
A set of lists where each accounts for the
neighbours of one node
A vector of
n
pointers to adjacency lists
Undirected graph:
An edge connects nodes
i
and
j
=> the list of
neighbours of
i
contains the node
j
and vice versa
Often compressed
Exploitation of regularities in graphs, difference from
other nodes, …
Adjacency List
Pros:
Obtaining the neighbours of a
node
Cheap addition of nodes to the
structure
More compact representation of
sparse matrices
Cons:
Checking if there is an edge
between two nodes
Optimization: sorted lists =>
logarithmic scan, but also logarithmic insertion
Incidence Matrix
Bi-dimensional Boolean matrix of
n
rows
and
m
columns
A column represents an edge
Nodes that are connected by a certain edge
A row represents a node
Incidence Matrix
pros:
For representing
hypergraphs, where one
edge connects an arbitrary
number of nodes
Cons:
Laplacian Matrix
Bi-dimensional array of
n
x
n
integers
Diagonal of the Laplacian matrix indicates the
degree of the node
The rest of positions are set to
-1
if the two
Laplacian Matrix
Pros:
Allows analyzing the graph
structure by means of
spectral analysis
Graph Traversals
Single Step
Single step traversal
from element
i
to element
j
, where
i
,
j
(
V
E
)
Expose explicit adjacencies in the graph
eout : traverse to the outgoing edges of the vertices
ein : traverse to the incoming edges of the vertices
vout : traverse to the outgoing vertices of the edges
vin : traverse to the incoming vertices of the edges
elab : allow (or filter) all edges with the label
: get element property values for key r
ep : allow (or filter) all elements with the property s for key r
Graph Traversals
Composition
Single step traversals can compose
complex
traversals
of arbitrary length
e.g., find all friends of Alberto
„Traverse to the outgoing edges of vertex
i
(representing Alberto), then only allow those edges
with the label
friend
, then traverse to the incoming
(i.e. head) vertices on those
friend
-labeled edges.
Finally, of those vertices, return their
name
property.“
Improving Data Locality
Idea: take into account computer architecture in the data
structures to reach a good performance
The way data is laid out physically in memory determines the locality to be obtained
Spatial locality = once a certain data item has been accessed, the nearby data items are likely to be accessed in the following computations
e.g., graph traversal
Strategy: in graph adjacency matrix representation,
exchange rows and columns to improve the cache hit
ratio
Breadth First Search Layout
(BFSL)
Trivial algorithm
Input: sequence of vertices of a graph
Output: a permutation of the vertices which obtains
better cache performance for graph traversals
BFSL algorithm
:
1. Selects a node (at random) that is the origin of the traversal 2. Traverses the graph following a breadth first search algorithm,
generating a list of vertex identifiers in the order they are visited
3. Takes the generated list and assigns the node identifiers
sequentially
Pros: optimal when starting from the selected node
Bandwidth of a Matrix
Graphs
matrices
Locality problem = minimum bandwidth problem
Bandwidth of a row in a matrix = the maximum distance between nonzero elements, with the condition that one is on the left of the diagonal and the other on the right of the diagonal
Bandwidth of a matrix = maximum of the bandwidth of its rows
Matrices with low bandwidths are more cache friendly
Non zero elements (edges) are clustered across the diagonal
Bandwidth minimization problem
(BMP) is NP hard
Cuthill-McKee
(1969)
Popular bandwidth minimization
technique for sparse matrices
Re-labels the vertices of a matrix according to a
sequence, with the aim of a heuristically guided
traversal
Algorithm:
1. Node with the first identifier (where the traversal starts) is the
node with the smallest degree in the whole graph
2. Other nodes are labeled sequentially as they are visited by BFS
traversal
In addition, the heuristic prefers those nodes that have the
Graph Partitioning
Some graphs are
too large
to be fully loaded into
the main memory of a single computer
Usage of secondary storage degrades the
performance of graph applications
Scalable solution distributes the graph on multiple
computers
We need to partition the graph reasonably
Usually for particular (set of) operation(s)
The shortest path, finding frequent patterns, BFS,
spanning tree search, …
One and Two Dimensional
Graph Partitioning
Aim: partitioning the graph to solve BFS more
efficiently
Distributed into shared-nothing parallel system
Partitioning of the adjacency matrix
1D partitioning
Matrix rows are randomly assigned to the
P
nodes
(processors) in the system
Each vertex and the edges emanating from it are
owned by one processor
One and Two Dimensional
Graph Partitioning
BSF with 1D partitioning
Input: starting node s having level 0
Output: every vertex v becomes labeled with its level, denoting its
distance from the starting node
1. Each processor has a set of frontier vertices F
At the beginning it is node s where the BFS starts
2. The edge lists of the vertices in F are merged to form a set of
neighbouring vertices N
Some owned by the current processor, some by others
3. Messages are sent to all other processors to (potentially) add
these vertices to their frontier set F for the next level
A processor may have marked some vertices in a previous
One and Two Dimensional
Graph Partitioning
2D partitioning
Processors are logically arranged in an R x C processor mesh
Adjacency matrix is divided C block columns and R x C block rows
Each processor owns C blocks
Note: 1D partitioning = 2D partitioning with
C = 1
(or
R = 1
)
Consequence: each node communicates with at most
R +
C
nodes instead of all
P
nodes
In step 2 a message is sent to all processors in the same row
= block owned by processor (i,j)
Partitioning of vertices:
Processor (i, j) owns vertices corresponding to block row
Types of Graphs
Single-relational
Edges are homogeneous in meaning
e.g., all edges represent friendship
Multi-relational (property) graphs
Edges are typed or labeled
e.g., friendship, business, communication
Vertices and edges in a property graph maintain a set
of key/value pairs
Representation of non-graphical data (properties) e.g., name of a vertex, the weight of an edge
Graph Databases
A graph database = a set of graphs
Types of graphs:
Directed-labeled graphs
e.g., XML, RDF, traffic networks
Undirected-labeled graphs
e.g., social networks, chemical compounds
Types of graph databases:
Non-transactional = few numbers of very large graphs
e.g., Web graph, social networks, …
Transactional = large set of small graphs
e.g., chemical compounds, biological pathways, linguistic trees each
Transactional Graph Databases
Types of Queries
Sub-graph queries
Searches for a specific pattern in the graph database
A small graph or a graph, where some parts are uncertain
e.g., vertices with wildcard labels
More general type: sub-graph isomorphism
Super-graph queries
Searches for the graph database members of which their whole structures are contained in the input query
Similarity (approximate matching) queries
Finds graphs which are similar, but not necessarily isomorphic to a given query graph
sub-graph: q1: g1, g2 q2: super-graph: q1: q2: g3
Sub-graph
Query Processing
Non Mining-Based
Graph Indexing Techniques
Focus on
indexing whole constructs
of the graph
database
Instead of indexing only some selected features
Cons:
Can be less effective in their pruning (filtering) power
May need to conduct expensive structure comparisons in the filtering process
Pros:
Can handle graph updates with less cost
Do not rely on the effectiveness of the selected features Do not need to rebuild whole indexes
Non Mining-Based Techniques
GraphGrep
(2002)
For each graph, it enumerates
all paths up to a certain
maximum length
and records the number of
occurrences of each path
Path index table:
Row = path
Column = graph
Entry in the table = number of occurrences of the path in the graph
Query processing:
1. Path index is used to find candidate graphs G1, … GN which
1. contain paths in the query q and
2. counts of such paths are beyond threshold specified in query q
2. Each G1, … GN is examined by sub-graph isomorphism to
obtain the final results
candidate search
Non Mining-Based Techniques
GraphGrep
Pros:
The indexing process of paths with limited lengths is
usually fast
Cons:
The size of the indexed paths could drastically
increase with the size of graph database
The filtering power of paths is limited
Verification cost can be very high due to the large size of the
Non Mining-Based Techniques
GString
(2007)
Considers the semantics of the graph
Models graph objects in the context of organic chemistry
using basic structures
Line = series of vertices connected end to end
Cycle = series of vertices that form a closed loop
Star = core vertex directly connected to several vertexes
Order of identification of objects: cycles, stars, lines
Graphs and queries are represented as string
sequences
Sub-graph search problem = subsequence string-matching domain
Suffix tree-based index structure for the string representations is created
Non Mining-Based Techniques
GString
Basic structure
has:
Type Size Line/Cycle = number of vertices Star = fan-out ofthe central vertex
Edits = set of annotations
Non Mining-Based Techniques
GString
Cons:
Converting can be inefficient for large graphs
Suitable for, e.g., chemical compounds but not in
general
Sub-graph Query Processing
Mining-Based
Graph Indexing Techniques
Idea: if features of query graph q do not exist in data graph G,
then G cannot contain q as its sub-graph
Graph-mining methods extract selected features (sub-structures) from the graph database members
An inverted index is created for each feature
Answering a sub-graph query q:
1. Identifying the set of features of q
2. Using the inverted index to retrieve all graphs that contain the same
features of q
Cons:
Effectiveness depends on the quality of mining techniques to effectively identify the set of features
Quality of the selected features may degrade over time (after lots of insertions and deletions)
Mining-Based Techniques
GIndex
(2004)
Indexing unit = sub-graphs (of labeled undirected graph)
Pros: improves query performance (higher filtering power)
Cons: number of graph structures can be large
GIndex considers only frequent discriminative sub-graphs
Frequent: its support > threshold
Support threshold is progressively increased as the graph grows
Discriminative (distinctive): presence cannot be predicted by the set of all their frequently occurring subgraphs
Among similar fragments with the same support only the smallest common
fragment is indexed
Query processing: given a query graph q
1. If q is a frequent sub-graph, the exact set of query answers containing
q can be retrieved directly
q is indexed
2. Otherwise, q probably has a frequent sub-graph f whose support
maybe close to the support threshold
q is infrequent => the sub-graph is only contained in a small number of
(1) Carbon chains up-to length 3 are present
path-based approach cannot prune (a) and (b)
Motivation from indexing text: Vocabulary size is small index phrases rather than individual words
Mining-Based Techniques
TreePI
(2007)
Indexing unit = frequent sub-trees Observations:
Tree data structures are more complex patterns than paths
Trees can preserve almost equivalent amount of structural information as arbitrary sub-graph patterns
The frequent sub-tree mining process is easier than general frequent sub-graph mining process.
Mining:
1. Mining a frequent tree on the graph database
2. Selecting a set of frequent trees as index patterns
Query processing: given a query graph q
1. Frequent sub-trees in q are identified and matched with the set of
indexing features to obtain a candidate set
Super-graph
Query Processing
cIndex
(2007)
Idea: If a feature
f
(sub-graph) is not in
q
, then the
graphs having
f
are pruned
The saving will be significant when
f
is a sub-graph of
many graphs in the database.
Indexing unit = contrast sub-graphs (features) = sub-graphs contained by a lot of graphs in database, but unlikely contained by a query graph
Extracted from the database based on their rare occurrence in historical query graphs
Pros: index is small
Cons: query logs may frequently change
index maybe
outdated often
Needs to be recomputed to stay effective
graph database: features: graph-feature matrix: query graph:
- ga and gb can be pruned based on f2
Super-graph Query Processing
GPTree
(2009)
Similar idea, different approach to identification of
frequent sub-graphs
Based on approach for compact organization of graph
database members (GPTree)
All of the graph database members are stored into one graph
Common sub-graphs are stored only once
Complexity:
Optimal construction is NP-complete => approximation algorithm
Indexing unit =
frequent sub-graphs
Determination is based on the containment feature from the representation
Graph Similarity Queries
Find sub-graphs in the database that are
similar
to query
q
Allows for node mismatches, node gaps, structural
differences, …
Usage: when graph databases are noisy or
incomplete
Approximate graph matching query-processing
techniques can be more useful and effective than
exact matching
Graph Similarity Queries
Grafil
(2005)
Feature-based structural filtering algorithm
Models each query graph as a set of features
Feature = can be any structure indexed in a graph database
e.g., elementary structures, paths, frequent structures, …
Feature-graph matrix
– to compute the difference in the
number of features between
q
and graphs in the
database
Column = graph in the graph database
Row = feature being indexed
query graph q: features: feature-graph matrix: edge-feature matrix:
for some graphs Gi in the database
for q and features fa, fb, fc
Example: no match for q in our database which has 7 embeddings of fa, fb,
fc => we relax one edge (e1, e2, or e3), we still have at least 3 embeddings (we may miss at most 4 embeddings) we can discard graphs that do not involve at least 3 embeddings
Graph Similarity Queries
Grafil
Query processing:
1. Feature-graph matrix is used to calculate the difference in
number of features between each graph database member and q
2. If the difference is greater than a threshold T, then it is discarded
The remaining graphs constitute a candidate answer set
3. Substructure similarity (verification) is calculated for each
candidate to prune the false positives candidates
Many existing approaches – not considered in Grafil
Extension:
Edge-feature matrix
– used to compute the
maximum allowed feature misses (
T
) based on a query
relaxation ratio (
r
)
Row = an edge
Column = embedding of a feature
Graph Similarity Queries
Grafil
Graph relaxation = allowing
feature misses
=
approximation
Feature miss in q = edge deletion/relabeling
Inspiration: n-grams in approximate string matching
Feature miss estimation problem: “Given a query graph
q
and a set of features contained in
q
, if the relaxation
ratio is
r
, what is the maximum number of features that
can be missed (
T
)?”
i.e., the maximum number of columns that can be hit by k = r.|G| rows in the edge-feature matrix
Maximum coverage problem NP-complete
Approximated by a greedy algorithm
Percentage of edges we relax
Graph Similarity Queries
Closure Tree (CTree)
(2006)
For both similarity and sub-graph queries Similar to the R-tree indexing mechanism
Extended to support graph-matching queries
Each node in the tree contains information about its descendants
To facilitate effective pruning (filtering)
Closure of a set of vertices = generalized vertex whose attribute is
the union of the attribute values of the vertices
Closure of a set of edges = generalized edge whose attribute is the union of the attribute values of the edges
Closure of two graphs g1 and g2 under a mapping M = generalized
graph (V, E)
V is the set of vertex closures of the corresponding vertices
Graph Similarity Queries
CTree
CTree:
Each node is a graph closure of its children
The children of an internal node are nodes
The children of a leaf node are database graphs
Each node has at least m children unless it is root, m ≥ 2
Each node has at most M children, (M+1)/2 ≥ m
Insertion/splitting/deletion take polynomial time
Query evaluation:
1. Traverse the CTree, pruning (filtering) nodes based on pseudo
sub-graph isomorphism
A candidate answer set is returned
Graph Query Languages
Idea: need for a suitable language to query and
manipulate graph data structures
Some common standard
Like SQL, XQuery, OQL, …
Classification:
General
Special-purpose (= special types of graphs)
PQL
= Pathway Query Language
(2005)
Special purpose graph query language for biological
networks
Targets retrieving specific parts of large, complex networks
Declarative
Syntax is similar to SQL
Queries match arbitrary sub-graphs in the database
based on node properties and paths between nodes
SELECT sub-graph-specification FROM node-variables
PQL
Data model:
Set of nodes and
directed edges
A node is either an
interaction
or a
molecule
The graphs need not
be connected
PQL
Basic Structures
Constants (in double quotes “…”)
Operators – prefix, binary infix, and function calls with arguments in
parentheses
and, union, =, <>, like, …
Logical quantifiers (for all, exists)
SELECT * FROM A, B
WHERE A.name = “3-Isopropylmalate” AND B.name = “EC1.1.1.85”
SELECT * FROM B, C, D
WHERE D.name = “L-Lactaldehyde” AND B ISA “Enzyme” AND
B[-*]C[-*]D AND
C.name = “Lactaldehyde” a path exists
has this type
PQL
Query Evaluation
Node variables are bound to nodes of the database
graph s.t. all node-conditions in the WHERE clause
evaluate to TRUE
All possible assignments of the variable to nodes of the graph are determined
Node variables are equally assigned to molecules and interactions
Query result is constructed from these variable bindings
according to the sub-graph specification of the SELECT
clause
SQL works on tables and returns tables
PQL works
GraphQL
(2008)
General graph query and manipulation language
Supports arbitrary attributes on nodes, edges, and graphs
Represented as a tuple
Graphs are considered as the basic unit of information
Query manipulates one or more collections of graphs
Graph pattern
= graph motif and a predicate on attributes
of the graph
Simple vs. complex graph motifs
Concatenation, disjunction, repetition
Predicate = combination of Boolean or arithmetic comparison expressions
edges are unified if their nodes are unified
examples of repetition (Kleene star)
simple path = edge
recursion = path itself + new edge
declare a new node and unify with a nested one
a root node v0
arbitrary repetition as a child node of v0
predicate
FLWR expression
pairs of authors
co-authorship graph
extract pairs of co-authors construct the graph
Other Graph Query Languages
GraphLog
(1990) – graphical query language for graph
databases
GOQL
(1999) and
GOOD
(1994) – extensions of OQL
Object-oriented graph data model
BPMN-Q
(2007) – determining whether a given business
process model graph is structurally similar to a query
graph
Nodes, edges, wildcards, paths, negative paths
SPARQL
(2008) – W3C recommendation for querying
RDF graph data
Describes a directed labeled graph by a set of triples
References
Pramod J. Sadalage - Martin Fowler:
NoSQL
Distilled: A Brief Guide to the Emerging
World of Polyglot Persistence
Eric Redmond - Jim R. Wilson:
Seven
Databases in Seven Weeks: A Guide to
Modern Databases and the NoSQL Movement