• No results found

Big Data Management and NoSQL Databases

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Management and NoSQL Databases"

Copied!
79
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data Management

and NoSQL Databases

Lecture 8. Graph databases

Doc. RNDr. Irena Holubova, Ph.D.

[email protected]

http://www.ksi.mff.cuni.cz/~holubova/NDBI040/

(2)

Graph Databases

Basic Characteristics

To store entities and relationships between these entities

 Node is an instance of an object

 Nodes have properties

 e.g., name

 Edges have directional significance

 Edges have types

 e.g., likes, friend, …

Nodes are organized by relationships

 Allow to find interesting patterns

 e.g., “Get all nodes employed by Big Co that like NoSQL Distilled”

(3)
(4)

Graph Databases

RDBMS vs. Graph Databases

 When we store a graph-like structure in RDBMS, it is for a single type of relationship

 “Who is my manager”

 Adding another relationship usually means schema changes, data movement etc.

 In graph databases relationships can be dynamically created / deleted

 There is no limit for number and kind

 In RDBMS we model the graph beforehand based on the Traversal

we want

 If the Traversal changes, the data will have to change

 We usually need a lot of join operations

 In graph databases the relationships are not calculated at query time but persisted

 Shift the bulk of the work of navigating the graph to inserts, leaving queries as fast as possible

(5)

Graph Databases

Representatives

(6)

Graph Databases

Basic Characteristics

Nodes can have different types of relationships between

them

 To represent relationships between the domain entities

 To have secondary relationships

 Category, path, time-trees, quad-trees for spatial indexing, linked

lists for sorted access, …

There is no limit to the number and kind of relationships

a node can have

Relationships have type, start node, end node, own

properties

(7)
(8)

Example: Neo4J

We have to create a relationship between the nodes in

both directions

 Nodes know about INCOMING and OUTGOING relationships

Node martin = graphDb.createNode(); martin.setProperty("name", "Martin"); Node pramod = graphDb.createNode(); pramod.setProperty("name", "Pramod");

martin.createRelationshipTo(pramod, FRIEND); pramod.createRelationshipTo(martin, FRIEND);

(9)

Graph Databases

Query

Properties of a node/edge can be indexed

Indexes are queried to find the starting node to begin a

traversal

Transaction transaction = graphDb.beginTx(); try {

Index<Node> nodeIndex = graphDb.index().forNodes("nodes");

nodeIndex.add(martin, "name", martin.getProperty("name"));

nodeIndex.add(pramod, "name", pramod.getProperty("name")); transaction.success(); }

finally {

transaction.finish(); }

Node martin = nodeIndex.get("name", "Martin").getSingle(); allRelationships = martin.getRelationships();

adding nodes

creating index

retrieving a node

(10)

Graph Databases

Query –

finding paths

We are interested in determining if there are

multiple paths, finding all of the paths, the

shortest path, …

Node barbara = nodeIndex.get("name", "Barbara").getSingle(); Node jill = nodeIndex.get("name", "Jill").getSingle();

PathFinder<Path> finder1 = GraphAlgoFactory.allPaths(

Traversal.expanderForTypes(FRIEND,Direction.OUTGOING), MAX_DEPTH);

Iterable<Path> paths = finder1.findAllPaths(barbara, jill); PathFinder<Path> finder2 = GraphAlgoFactory.shortestPath(

Traversal.expanderForTypes(FRIEND,Direction.OUTGOING), MAX_DEPTH);

(11)

Graph Databases

Suitable Use Cases

Connected Data

Social networks

 Any link-rich domain is well suited for graph databases

Routing, Dispatch, and Location-Based Services

Node = location or address that has a delivery  Graph = nodes where a delivery has to be made

Relationships = distance

Recommendation Engines

 “your friends also bought this product”

(12)

Graph Databases

When Not to Use

When we want to update all or a subset of entities

 Changing a property on all the nodes is not a straightforward operation

 e.g., analytics solution where all entities may need to be updated with a changed property

Some graph databases may be unable to handle lots of

data

(13)

Graph Databases

A bit of theory

Data: a set of entities and their relationships

 e.g., social networks, travelling routes, …

 We need to efficiently represent graphs

Basic operations: finding the neighbours of a node,

checking if two nodes are connected by an edge,

updating the graph structure, …

 We need efficient graph operations

G =

(

V, E

)

is commonly modelled as

 set of nodes (vertices) V

 set of edges E

 n = |V|, m = |E|

(14)

Adjacency Matrix

Bi-dimensional array

A

of

n

x

n

Boolean values

Indexes of the array = node identifiers of the graph

The Boolean junction

A

ij

of the two indices indicates

whether the two nodes are connected

Variants:

Directed graphs

Weighted graphs

(15)

Adjacency Matrix

Pros:

Adding/removing edges

Checking if two nodes are

connected

Cons:

Quadratic space with respect to

n

 We usually have sparse graphs 

lots of 0 values

Addition of nodes is expensive

Retrieval of all the neighbouring

nodes takes linear time with

(16)

Adjacency List

A set of lists where each accounts for the

neighbours of one node

A vector of

n

pointers to adjacency lists

Undirected graph:

An edge connects nodes

i

and

j

=> the list of

neighbours of

i

contains the node

j

and vice versa

Often compressed

Exploitation of regularities in graphs, difference from

other nodes, …

(17)

Adjacency List

Pros:

Obtaining the neighbours of a

node

Cheap addition of nodes to the

structure

More compact representation of

sparse matrices

Cons:

Checking if there is an edge

between two nodes

 Optimization: sorted lists =>

logarithmic scan, but also logarithmic insertion

(18)

Incidence Matrix

Bi-dimensional Boolean matrix of

n

rows

and

m

columns

A column represents an edge

Nodes that are connected by a certain edge

A row represents a node

(19)

Incidence Matrix

pros:

For representing

hypergraphs, where one

edge connects an arbitrary

number of nodes

Cons:

(20)

Laplacian Matrix

Bi-dimensional array of

n

x

n

integers

Diagonal of the Laplacian matrix indicates the

degree of the node

The rest of positions are set to

-1

if the two

(21)

Laplacian Matrix

Pros:

Allows analyzing the graph

structure by means of

spectral analysis

(22)

Graph Traversals

Single Step

Single step traversal

from element

i

to element

j

, where

i

,

j

(

V

E

)

Expose explicit adjacencies in the graph

eout : traverse to the outgoing edges of the vertices

ein : traverse to the incoming edges of the vertices

vout : traverse to the outgoing vertices of the edges

vin : traverse to the incoming vertices of the edges

elab : allow (or filter) all edges with the label

  : get element property values for key r

ep : allow (or filter) all elements with the property s for key r

(23)

Graph Traversals

Composition

Single step traversals can compose

complex

traversals

of arbitrary length

e.g., find all friends of Alberto

„Traverse to the outgoing edges of vertex

i

(representing Alberto), then only allow those edges

with the label

friend

, then traverse to the incoming

(i.e. head) vertices on those

friend

-labeled edges.

Finally, of those vertices, return their

name

property.“

(24)

Improving Data Locality

Idea: take into account computer architecture in the data

structures to reach a good performance

 The way data is laid out physically in memory determines the locality to be obtained

 Spatial locality = once a certain data item has been accessed, the nearby data items are likely to be accessed in the following computations

 e.g., graph traversal

Strategy: in graph adjacency matrix representation,

exchange rows and columns to improve the cache hit

ratio

(25)

Breadth First Search Layout

(BFSL)

Trivial algorithm

Input: sequence of vertices of a graph

Output: a permutation of the vertices which obtains

better cache performance for graph traversals

BFSL algorithm

:

1. Selects a node (at random) that is the origin of the traversal 2. Traverses the graph following a breadth first search algorithm,

generating a list of vertex identifiers in the order they are visited

3. Takes the generated list and assigns the node identifiers

sequentially

Pros: optimal when starting from the selected node

(26)

Bandwidth of a Matrix

Graphs

matrices

Locality problem = minimum bandwidth problem

 Bandwidth of a row in a matrix = the maximum distance between nonzero elements, with the condition that one is on the left of the diagonal and the other on the right of the diagonal

 Bandwidth of a matrix = maximum of the bandwidth of its rows

Matrices with low bandwidths are more cache friendly

 Non zero elements (edges) are clustered across the diagonal

Bandwidth minimization problem

(BMP) is NP hard

(27)
(28)

Cuthill-McKee

(1969)

Popular bandwidth minimization

technique for sparse matrices

Re-labels the vertices of a matrix according to a

sequence, with the aim of a heuristically guided

traversal

Algorithm:

1. Node with the first identifier (where the traversal starts) is the

node with the smallest degree in the whole graph

2. Other nodes are labeled sequentially as they are visited by BFS

traversal

 In addition, the heuristic prefers those nodes that have the

(29)

Graph Partitioning

Some graphs are

too large

to be fully loaded into

the main memory of a single computer

Usage of secondary storage degrades the

performance of graph applications

Scalable solution distributes the graph on multiple

computers

We need to partition the graph reasonably

Usually for particular (set of) operation(s)

The shortest path, finding frequent patterns, BFS,

spanning tree search, …

(30)

One and Two Dimensional

Graph Partitioning

Aim: partitioning the graph to solve BFS more

efficiently

Distributed into shared-nothing parallel system

Partitioning of the adjacency matrix

1D partitioning

Matrix rows are randomly assigned to the

P

nodes

(processors) in the system

Each vertex and the edges emanating from it are

owned by one processor

(31)
(32)

One and Two Dimensional

Graph Partitioning

BSF with 1D partitioning

 Input: starting node s having level 0

 Output: every vertex v becomes labeled with its level, denoting its

distance from the starting node

1. Each processor has a set of frontier vertices F

 At the beginning it is node s where the BFS starts

2. The edge lists of the vertices in F are merged to form a set of

neighbouring vertices N

 Some owned by the current processor, some by others

3. Messages are sent to all other processors to (potentially) add

these vertices to their frontier set F for the next level

 A processor may have marked some vertices in a previous

(33)

One and Two Dimensional

Graph Partitioning

2D partitioning

 Processors are logically arranged in an R x C processor mesh

 Adjacency matrix is divided C block columns and R x C block rows

 Each processor owns C blocks

Note: 1D partitioning = 2D partitioning with

C = 1

(or

R = 1

)

Consequence: each node communicates with at most

R +

C

nodes instead of all

P

nodes

 In step 2 a message is sent to all processors in the same row

(34)

= block owned by processor (i,j)

Partitioning of vertices:

Processor (i, j) owns vertices corresponding to block row

(35)

Types of Graphs

Single-relational

Edges are homogeneous in meaning

 e.g., all edges represent friendship

Multi-relational (property) graphs

Edges are typed or labeled

 e.g., friendship, business, communication

Vertices and edges in a property graph maintain a set

of key/value pairs

 Representation of non-graphical data (properties)  e.g., name of a vertex, the weight of an edge

(36)
(37)

Graph Databases

A graph database = a set of graphs

Types of graphs:

 Directed-labeled graphs

 e.g., XML, RDF, traffic networks

 Undirected-labeled graphs

 e.g., social networks, chemical compounds

Types of graph databases:

 Non-transactional = few numbers of very large graphs

 e.g., Web graph, social networks, …

 Transactional = large set of small graphs

 e.g., chemical compounds, biological pathways, linguistic trees each

(38)

Transactional Graph Databases

Types of Queries

Sub-graph queries

 Searches for a specific pattern in the graph database

 A small graph or a graph, where some parts are uncertain

 e.g., vertices with wildcard labels

 More general type: sub-graph isomorphism

Super-graph queries

 Searches for the graph database members of which their whole structures are contained in the input query

Similarity (approximate matching) queries

 Finds graphs which are similar, but not necessarily isomorphic to a given query graph

(39)
(40)

sub-graph: q1: g1, g2 q2:  super-graph: q1:  q2: g3

(41)

Sub-graph

Query Processing

Non Mining-Based

Graph Indexing Techniques

Focus on

indexing whole constructs

of the graph

database

 Instead of indexing only some selected features

Cons:

 Can be less effective in their pruning (filtering) power

 May need to conduct expensive structure comparisons in the filtering process

Pros:

 Can handle graph updates with less cost

 Do not rely on the effectiveness of the selected features  Do not need to rebuild whole indexes

(42)

Non Mining-Based Techniques

GraphGrep

(2002)

For each graph, it enumerates

all paths up to a certain

maximum length

and records the number of

occurrences of each path

Path index table:

 Row = path

 Column = graph

 Entry in the table = number of occurrences of the path in the graph

Query processing:

1. Path index is used to find candidate graphs G1, … GN which

1. contain paths in the query q and

2. counts of such paths are beyond threshold specified in query q

2. Each G1, … GN is examined by sub-graph isomorphism to

obtain the final results

candidate search

(43)

Non Mining-Based Techniques

GraphGrep

Pros:

The indexing process of paths with limited lengths is

usually fast

Cons:

The size of the indexed paths could drastically

increase with the size of graph database

The filtering power of paths is limited

 Verification cost can be very high due to the large size of the

(44)

Non Mining-Based Techniques

GString

(2007)

Considers the semantics of the graph

Models graph objects in the context of organic chemistry

using basic structures

 Line = series of vertices connected end to end

 Cycle = series of vertices that form a closed loop

 Star = core vertex directly connected to several vertexes

Order of identification of objects: cycles, stars, lines

Graphs and queries are represented as string

sequences

 Sub-graph search problem = subsequence string-matching domain

 Suffix tree-based index structure for the string representations is created

(45)
(46)

Non Mining-Based Techniques

GString

Basic structure

has:

 Type  Size  Line/Cycle = number of vertices  Star = fan-out of

the central vertex

 Edits = set of annotations

(47)

Non Mining-Based Techniques

GString

Cons:

Converting can be inefficient for large graphs

Suitable for, e.g., chemical compounds but not in

general

(48)

Sub-graph Query Processing

Mining-Based

Graph Indexing Techniques

Idea: if features of query graph q do not exist in data graph G,

then G cannot contain q as its sub-graph

 Graph-mining methods extract selected features (sub-structures) from the graph database members

 An inverted index is created for each feature

Answering a sub-graph query q:

1. Identifying the set of features of q

2. Using the inverted index to retrieve all graphs that contain the same

features of q

 Cons:

 Effectiveness depends on the quality of mining techniques to effectively identify the set of features

 Quality of the selected features may degrade over time (after lots of insertions and deletions)

(49)

Mining-Based Techniques

GIndex

(2004)

Indexing unit = sub-graphs (of labeled undirected graph)

 Pros: improves query performance (higher filtering power)

 Cons: number of graph structures can be large

 GIndex considers only frequent discriminative sub-graphs

 Frequent: its support > threshold

 Support threshold is progressively increased as the graph grows

 Discriminative (distinctive): presence cannot be predicted by the set of all their frequently occurring subgraphs

 Among similar fragments with the same support only the smallest common

fragment is indexed

Query processing: given a query graph q

1. If q is a frequent sub-graph, the exact set of query answers containing

q can be retrieved directly

q is indexed

2. Otherwise, q probably has a frequent sub-graph f whose support

maybe close to the support threshold

q is infrequent => the sub-graph is only contained in a small number of

(50)

(1) Carbon chains up-to length 3 are present

 path-based approach cannot prune (a) and (b)

Motivation from indexing text: Vocabulary size is small  index phrases rather than individual words

(51)

Mining-Based Techniques

TreePI

(2007)

Indexing unit = frequent sub-treesObservations:

 Tree data structures are more complex patterns than paths

 Trees can preserve almost equivalent amount of structural information as arbitrary sub-graph patterns

 The frequent sub-tree mining process is easier than general frequent sub-graph mining process.

Mining:

1. Mining a frequent tree on the graph database

2. Selecting a set of frequent trees as index patterns

Query processing: given a query graph q

1. Frequent sub-trees in q are identified and matched with the set of

indexing features to obtain a candidate set

(52)

Super-graph

Query Processing

cIndex

(2007)

Idea: If a feature

f

(sub-graph) is not in

q

, then the

graphs having

f

are pruned

The saving will be significant when

f

is a sub-graph of

many graphs in the database.

 Indexing unit = contrast sub-graphs (features) = sub-graphs contained by a lot of graphs in database, but unlikely contained by a query graph

 Extracted from the database based on their rare occurrence in historical query graphs

Pros: index is small

Cons: query logs may frequently change

index maybe

outdated often

 Needs to be recomputed to stay effective

(53)

graph database: features: graph-feature matrix: query graph:

- ga and gb can be pruned based on f2

(54)

Super-graph Query Processing

GPTree

(2009)

Similar idea, different approach to identification of

frequent sub-graphs

Based on approach for compact organization of graph

database members (GPTree)

 All of the graph database members are stored into one graph

 Common sub-graphs are stored only once

 Complexity:

 Optimal construction is NP-complete => approximation algorithm

Indexing unit =

frequent sub-graphs

 Determination is based on the containment feature from the representation

(55)
(56)

Graph Similarity Queries

Find sub-graphs in the database that are

similar

to query

q

Allows for node mismatches, node gaps, structural

differences, …

Usage: when graph databases are noisy or

incomplete

Approximate graph matching query-processing

techniques can be more useful and effective than

exact matching

(57)

Graph Similarity Queries

Grafil

(2005)

Feature-based structural filtering algorithm

 Models each query graph as a set of features

 Feature = can be any structure indexed in a graph database

 e.g., elementary structures, paths, frequent structures, …

Feature-graph matrix

– to compute the difference in the

number of features between

q

and graphs in the

database

 Column = graph in the graph database

 Row = feature being indexed

(58)

query graph q: features: feature-graph matrix: edge-feature matrix:

for some graphs Gi in the database

for q and features fa, fb, fc

Example: no match for q in our database which has 7 embeddings of fa, fb,

fc => we relax one edge (e1, e2, or e3), we still have at least 3 embeddings (we may miss at most 4 embeddings)  we can discard graphs that do not involve at least 3 embeddings

(59)

Graph Similarity Queries

Grafil

Query processing:

1. Feature-graph matrix is used to calculate the difference in

number of features between each graph database member and q

2. If the difference is greater than a threshold T, then it is discarded

 The remaining graphs constitute a candidate answer set

3. Substructure similarity (verification) is calculated for each

candidate to prune the false positives candidates

 Many existing approaches – not considered in Grafil

Extension:

Edge-feature matrix

– used to compute the

maximum allowed feature misses (

T

) based on a query

relaxation ratio (

r

)

 Row = an edge

 Column = embedding of a feature

(60)

Graph Similarity Queries

Grafil

Graph relaxation = allowing

feature misses

=

approximation

 Feature miss in q = edge deletion/relabeling

 Inspiration: n-grams in approximate string matching

Feature miss estimation problem: “Given a query graph

q

and a set of features contained in

q

, if the relaxation

ratio is

r

, what is the maximum number of features that

can be missed (

T

)?”

 i.e., the maximum number of columns that can be hit by k = r.|G| rows in the edge-feature matrix

 Maximum coverage problem  NP-complete

 Approximated by a greedy algorithm

Percentage of edges we relax

(61)

Graph Similarity Queries

Closure Tree (CTree)

(2006)

For both similarity and sub-graph queriesSimilar to the R-tree indexing mechanism

 Extended to support graph-matching queries

 Each node in the tree contains information about its descendants

 To facilitate effective pruning (filtering)

Closure of a set of vertices = generalized vertex whose attribute is

the union of the attribute values of the vertices

 Closure of a set of edges = generalized edge whose attribute is the union of the attribute values of the edges

 Closure of two graphs g1 and g2 under a mapping M = generalized

graph (V, E)

V is the set of vertex closures of the corresponding vertices

(62)
(63)

Graph Similarity Queries

CTree

CTree:

 Each node is a graph closure of its children

 The children of an internal node are nodes

 The children of a leaf node are database graphs

 Each node has at least m children unless it is root, m ≥ 2

 Each node has at most M children, (M+1)/2 m

Insertion/splitting/deletion take polynomial time

Query evaluation:

1. Traverse the CTree, pruning (filtering) nodes based on pseudo

sub-graph isomorphism

 A candidate answer set is returned

(64)

Graph Query Languages

Idea: need for a suitable language to query and

manipulate graph data structures

Some common standard

 Like SQL, XQuery, OQL, …

Classification:

General

Special-purpose (= special types of graphs)

(65)

PQL

= Pathway Query Language

(2005)

Special purpose graph query language for biological

networks

 Targets retrieving specific parts of large, complex networks

Declarative

Syntax is similar to SQL

Queries match arbitrary sub-graphs in the database

based on node properties and paths between nodes

SELECT sub-graph-specification FROM node-variables

(66)

PQL

Data model:

Set of nodes and

directed edges

A node is either an

interaction

or a

molecule

The graphs need not

be connected

(67)

PQL

Basic Structures

Constants (in double quotes “…”)

Operators – prefix, binary infix, and function calls with arguments in

parentheses

 and, union, =, <>, like, …

Logical quantifiers (for all, exists)

SELECT * FROM A, B

WHERE A.name = “3-Isopropylmalate” AND B.name = “EC1.1.1.85”

SELECT * FROM B, C, D

WHERE D.name = “L-Lactaldehyde” AND B ISA “Enzyme” AND

B[-*]C[-*]D AND

C.name = “Lactaldehyde” a path exists

has this type

(68)

PQL

Query Evaluation

Node variables are bound to nodes of the database

graph s.t. all node-conditions in the WHERE clause

evaluate to TRUE

 All possible assignments of the variable to nodes of the graph are determined

 Node variables are equally assigned to molecules and interactions

Query result is constructed from these variable bindings

according to the sub-graph specification of the SELECT

clause

SQL works on tables and returns tables

PQL works

(69)

GraphQL

(2008)

General graph query and manipulation language

 Supports arbitrary attributes on nodes, edges, and graphs

 Represented as a tuple

Graphs are considered as the basic unit of information

 Query manipulates one or more collections of graphs

Graph pattern

= graph motif and a predicate on attributes

of the graph

 Simple vs. complex graph motifs

 Concatenation, disjunction, repetition

 Predicate = combination of Boolean or arithmetic comparison expressions

(70)
(71)

edges are unified if their nodes are unified

(72)

examples of repetition (Kleene star)

simple path = edge

recursion = path itself + new edge

declare a new node and unify with a nested one

(73)

a root node v0

arbitrary repetition as a child node of v0

(74)

predicate

FLWR expression

pairs of authors

co-authorship graph

extract pairs of co-authors construct the graph

(75)

Other Graph Query Languages

GraphLog

(1990) – graphical query language for graph

databases

GOQL

(1999) and

GOOD

(1994) – extensions of OQL

 Object-oriented graph data model

BPMN-Q

(2007) – determining whether a given business

process model graph is structurally similar to a query

graph

 Nodes, edges, wildcards, paths, negative paths

SPARQL

(2008) – W3C recommendation for querying

RDF graph data

 Describes a directed labeled graph by a set of triples

(76)
(77)
(78)
(79)

References

Pramod J. Sadalage - Martin Fowler:

NoSQL

Distilled: A Brief Guide to the Emerging

World of Polyglot Persistence

Eric Redmond - Jim R. Wilson:

Seven

Databases in Seven Weeks: A Guide to

Modern Databases and the NoSQL Movement

Sherif Sakr - Eric Pardede:

Graph Data

References

Related documents

The purpose of the archaeogeophysics website is to serve as the platform for the tutorial through which users will learn how to process data after it is collected from various

У статі «Грошові кошти та їх еквіваленти на початок року» звіту про рух грошових коштів банк відображає залишок грошових коштів та їх еквівалентів на початок року, який

Imatinib (Gleevec ® ) may be considered medically necessary for treatment of newly diagnosed adult and pediatric patients with Philadelphia chromosome positive chronic

The key maintained assumption in this analysis is that municipal green procurement policies in far-away cities increase the supply of LEED Accredited Professionals in nearby

Finally, interpreting students are often exposed to intense student competition during their education and experience an overly critical perspective of the field during

In Middle Musselshell River and Cottonwood Creek watersheds, mean monthly streamflows for all three future periods are projected either to increase, remain unchanged or decrease

In a hospitalist system, when a patient leaves the hospital, he or she will return to a primary care pro- vider (PCP) for follow-up and continuing care.. The hand-off after

The M270/M274 family of four-cylinder engines is optimally equipped thanks to the flexible consumption technologies Camtronic, lean-burn combustion and natural gas capability.