Graph databases - Enabling Model-Driven Live Analytics For Cyber-Physical Systems: The Case of

Closely related to graph processing frameworks are graph databases.

Neo4j [47] is one of the first and most well-known graph databases today. It offers native graph storage, processing, querying, and is fully ACID compliant. Neo4j is written in Java and can be accessed from applications written in other languages using Neo4j’s query language Cypher [23]. The data model of Neo4j consists of nodes and relationships (edges). Both, nodes and relationships can contain properties. In addition to properties and relationships, nodes can be labelled. Nodes have unique conceptual identities and are typically used to represent the entities of a domain. Every relationship must have a start node and an end node. Relationships are defined between node instances, not node classes. Labels are named graph constructs, which are used to group several nodes into sets. This allows queries to operate on smaller sets instead of the whole graph, making queries more efficient. Neo4j supports replication but until now it does not offer support for distribution. The community edition of Neo4j is open source under a GPL license. An enterprise edition is developed by Neo Technology Inc. under a commercial as well as a free AGPL license.

DEX [234], [233] (recently rebranded to Sparksee) is a high performance graph database management system based on bitmaps and additional secondary structures. DEX uses bitmaps as primary structures to store and manipulate the graph data, since they allow to represent information in a compact way and can be operated efficiently with logic operations. The logical data model of DEX defines a labeled and attributed multigraph as G = {L, N, E, A}, where L is a collection of labels, N the collection of nodes, E the collection of edges (directed or undirected), and A the collection of attributes. Labeled graphs in DEX provide a label for nodes and edges, defining their types. In DEX two types of graphs are distinguished, the DbGraph is the persistent graph that contains the whole database, and RGraphs that are used to temporarily query results. The design of DEX follows four goals: 1) it must be possible to split the graph into smaller structures for improved caching and memory usage, 2) object identifiers for nodes and edges should be used to speed-up graph operations, 3) spe-

cific structures must be used to accelerate the navigation and traversal of edges, and 4) attributes should be indexed to allow queries over nodes and edges based on value filters [234]. DEX uses bitmaps to define which objects (nodes or edges) are selected or related to other objects. As auxiliary structures, DEX uses maps with key values associated to bitmaps or data values. These two structures are combined to build links: binary associations between unique identifiers and data values. This allows for an identifier to return the value, and the other way around for a value, to return the associated identifiers. A graph is then built out of links, maps, and bitmaps. The original DEX papers [234], [233] contain no information about transaction support but the current version [61] is fully ACID conform. DEX has been initially developed by the Data Management group at the Polytechnic University of Catalonia (DAMA-UPC) and later on by a created spin-off called Sparsity Technologies. It comes with a dual license, a free one for evaluation and academic purposes and one for commercial usage. The source code is not available as open source.

HyperGraphDB [187] is a graph database which was specifically designed for arti- ficial intelligence and Semantic Web projects. It is based on generalised hypergraphs (e.g., as proposed in [99], [159]) where hyperedges can contain other hyperedges. Hy- pergraphs are extensions of standard graphs that allow edges to point to more than two nodes. In HyperGraphDB edges can also point to other edges. It is transactional and embeddable. As stated by Iordanov [187], the representational power of higher-order n- ary relationships was the main motivation behind the development of HyperGraphDB. The basic structural unit in the HyperGraphDB data model is called atom. Each atom has a tuple of atoms associated. This tuple is called target set and its size is referred to as the atom’s arity. Atoms of arity 0 are nodes, while atoms of arity > 0 are links. The set of links pointing to an atom a is called the incidence set of atom a. Each atom in HyperGraphDB has a value, which is strongly typed. HyperGraphDB uses a two- layered architecture: the hypergraph storage layer and a model layer. As storage layer, HyperGraphDB suggests—like we do in our approach—to use key-value stores. The only requirement imposed by HyperGraphDB is that indices support multiple ordered values per single key. The model layer contains the hypergraph atom abstraction, the type system, caching, indexing, and querying. HyperGraphDB supports data distribution, using an agent-based peer-to-peer framework. Activities are asynchronous and incoming messages are dispatched using a scheduler and processed in a thread pool. HyperGraphDB is open source under the LGPL license.

Titan [65] is a distributed graph database. It supports ACID transactions, eventual consistency, and is designed to be used with different data backends, e.g., Apache Cassandra, Apache HBase, and Oracle BerkeleyDB. Titian is open source and supports different processing frameworks, among others, Apache Spark, Apache Giraph, and Apache Hadoop. It supports TinkerPop Gremlin [15] queries. Titan is open source und the Apache 2 license.

OrientDB [52] is another distributed database. A distinctive characteristic of Ori- entDB is its multi-model, besides from being a graph database, it also supports document and key-value data models. OrientDB supports schema-less, schema-full, and schema-mixed modes and allows to query data with a slightly extended SQL (OrientDB SQL) variant and with TinkerPop Gremlin. It supports ACID transactions, sharding, and provides encryption models for data security. OrientDB implements several in-

Table 3.3: Summary and comparison of important graph databases

distributed transactional query source language (ql) available Neo4j ₇ fully ACID Cypher (declarative) 3 DEX ₇ fully ACID API for graph traversal 7

(current version) no explicit ql

HyperGraphDB ₃ transactional API for graph operations 3 (depends on (depends on relational-style queries

underlying underlying k/v store) k/v store)

Titan ₃ fully ACID and TinkerPop Gremlin 3 eventual consistency API for graph traversal

OrientDB 3 fully ACID and TinkerPop Gremlin 3 eventual consistency OrientDB SQL

(slightly extended SQL) API for graph traversal

InfiniteGraph ₃ fully ACID and TinkerPop Gremlin 7 eventual consistency API for graph traversal

dexing strategies based on B-trees and extendible hashing. This allows fast traversal (O(1) complexity) of one-to-many relationships and fast add/remove link operations. OrientDB comes in two versions, a free community edition licensed under Apache 2 and a commercial enterprise edition with professional support. The community edition is available as open source.

InfiniteGraph [36] is a distributed graph database, which is, in the meantime, in- tegrated into the thingspan [37] analytics stack. The InfiniteGraph graph model is a labeled, directed multigraph. Edges in InfiniteGraph are first-class entities with their own identity. InfiniteGraph provides ACID transactions and supports a schema-full model. InfiniteGraph has been developed by Objectivity Inc. and is not available as open source.

Table 3.3 summarises and compares the discussed graph databases. The table shows for the discussed graph databases if they support distribution, their transaction model, the used query language, and if the source code is available.

Like it is the case for graph processing frameworks, there exists also a wide variety of additional graph databases, which are conceptually similar to the ones discussed but which provide optimisations for specific use cases or are alternative implementations. ArrangoDB [17] is among the most well-known ones and supports multiple data models: graph, key-value, and document. Another one is FlockDB [30] which has been developed by Twitter in order to store social graphs. AllegroGraph [2] is a triple store designed to store RDF triples. Stardog [63], GraphDB [35], Dgraph [25], InfoGrid [40], blazegraph [21], GraphBase [33], and VelocityDB [68] are other examples of mostly commercial, generic graph databases.

The boundary between graph databases and graph processing frameworks is not sharp: on the one hand, some graph processing frameworks offer persistence and, on the

other hand, some graph databases only provide weak consistency models, like BASE. Nonetheless, most graph databases offer stronger consistency and transaction models than graph processing frameworks, whereas the latter have a stronger focus on graph traversing and distributed processing. The contribution of this dissertation is somehow in the middle of these two categories. Despite the main focus is a multi-dimensional graph data model for near real-time analytics, we also put a strong emphasis on efficient storage concepts for this graph model. None of the mentioned graph databases allow to natively represent time nor many different hypothetical worlds. Although, Neo4j does not support time natively, there are some discussions and patterns on how to best model time dependent data [55], [32]. These are discussed in more detail in Chapter 3.6.

In document Enabling Model-Driven Live Analytics For Cyber-Physical Systems: The Case of Smart Grids (Page 82-85)