Adjacency list - Graph storages - Graph-based implementation of AQL

5. Graph-based implementation of AQL

5.5. Graph storages

5.5.1. Adjacency list

The AdjacencyListStorage class implements a graph storage by using a unique set of edge entries, sorted by the ID of the source and target node. It is currently the only graph storage that implements the WriteableGraphStorage interface and allows to change the content of the graph storage dynamically. Each edge entry in the sorted set is a pair of the ID of the source node and the ID of the target node. Using a set (instead of an array containing lists of outgoing edges) allows checking for uniqueness of an edge with an upper complexity bound of O(log n), because the set is implemented with a B-tree (Cormen et al. 2009, p. 492). It also allows lookups for all outgoing edges of any node using

• an initial lookup of the first edge of a given node with a complexity of O(log n), • a lookup of the successor of the current edge inside the set, which is the potential

next edge for the same node, with the same complexity (Cormen et al. 2009, p. 292).

Thus, getting all outgoing edges for a single node in a graph with n nodes and a maximum degree of d has the complexity O(d log n). In addition to the set of edges the class also contains a set of inverted edges (where the source and target nodes are switched). Currently this is only used to make deletion of nodes more efficient, because they have to be deleted from all edges where there are either the source or the target node. In the future, the inverted edge set would also allow to traverse the graph in inverse direction efficiently (see Section 7.1.2 for a discussion why this useful).

AdjacencyListStorage uses graph traversal to implement reachability queries. In particular, the findConnected(…) function is implemented with a depth-first

5.5. Graph storages

Algorithm 5.2 Implementation of a Depth-First-Search (DFS) for adjacency lists

which outputs each found node only once.

1: procedure uniqueDFS(node, distmin, distmax)

2: output ← ∅ ▷init with empty set

3: stack ← emptyStack ▷ init with empty stack

4: push(stack, (node, 0)) ▷ add initial node with distance 0 to stack

5: while r ← next(stack, distmin, distmax, output) do

6: emit r

7: function next(stack, distmin, distmax, output) 8: f ound ← f alse

9: while found = false ∧ ¬empty(stack) do

10: (node, dist) ←top(stack)

11: f ound ←enterNode(node, dist, stack, distmin, distmax)

12: if found then

13: return (node, dist)

14: else

15: return false

16: function enterNode(node, dist, stack, distmin, distmax, output) 17: f ound ← f alse

18: pop(stack)

19: if dist ≥ distmin∧ dist ≤ distmax then ▷ check if distance in valid range 20: f ound ← true

21: if dist ≤ distmax then ▷ add all child nodes to the stack

22: for all child ∈ outgoingEdges(node) do

23: push(stack, (child, dist + 1))

24: if found then

25: if node /∈ output then ▷ only output node once

26: output ← output ∪ {node}

27: return true

function time complexity

findConnected(…) O (|Vr| + (|Er| · log|E|))

distance(…) O (|Vr| + (|Er| · log|E|))

isConnected(…) O (|Vr| + (|Er| · log|E|))

Table 5.1.: Complexity for the reachability functions of AdjacencyListStorage. |Vr|

is the number of nodes reachable by the source node, |Er| the number

of all edges that belong to a path that connects the source node with a reachable node. |E| is the number of all edges in the graph.

traversal that additionally checks if each node is used as output only once, which adds a computational overhead.17 _{See Algorithm 5.2 for a listing of the pseudo-code. The}

traversal will stop as soon as the maximum allowed distance is reached. If a node is reachable via more than one path with valid lengths, it will be still outputted only once (Algorithm 5.2, line 25). Testing for cycles is not performed for the findConnected(…) function to avoid the additional overhead introduced by testing each traversed node and since most components are cycle-free anyway. The traversal will still always stop, because a finite maximum distance must be given as argument. Currently, only the LEFT_TOKEN and RIGHT_TOKEN components contain cycles (see Section 4.2.3 for a description) and these components are typically used by AQL operators that query them with a maximum distance of 1.

The distance(…) function also uses a depth-first traversal, but stops exploring a node if it is part of a cycle. Its pseudo-code is given in Algorithm 5.3. The same cycle- safe depth-first traversal implementation is used in the isConnected(…) function. Initializing the traversal has a constant computational overhead. For the special (but frequent) case that the requested distance for isConnected(…) is exactly one, a simplified implementation that just checks the existence of the specific edge in the edge set is used in order to avoid this overhead.

In general, the running time of a depth-first search is O(|V | + |E|), where |V | is the number of nodes and |E| is the number of edges (Cormen et al. 2009, p. 606). Each node can be visited only once, which adds a linear complexity relative to the number of nodes to the overall complexity. For each visited node, all outgoing edges are retrieved in a loop, but each edge will also be visited only once in total. An ideal adjacency list needs constant time to lookup a list of all outgoing edges of a single node, however, AdjacencyListStorage needs logarithmic time for this task. Also, not all nodes and edges are actually searched, but only the reachable ones. The worst-case running times for the reachability functions of AdjacencyListStorage are given in Table 5.1 and reflect these considerations. While the worst-case times are the same, the cycle-safe implementation needs an additional check if a searched node is part of the current path to detect cycles. The UniqueDFS procedure in Algorithm 5.2 also checks for unique results, but this check is only executed if a node is part

17_{Currently, a red-black tree (Cormen et al. 2009, pp. 308 ff.) based set from the default C++ STL}

is used to store the unique result. Thus, an additional lookup cost of O(log2n)is needed for each

5.5. Graph storages

Algorithm 5.3 Implementation of a cycle-safe DFS for adjacency lists.

1: procedure cycleSafeDFS(node, distmin, distmax)

2: distlast ← 0

3: path ← emptyList ▷ init with empty list

4: stack ← emptyStack ▷ init with empty stack

5: push(stack, (node, 0)) ▷ add initial node with distance 0 to stack

6: while r ← next(stack, distmin, distmax, distlast, path) do

7: emit r

8: function next(stack, distmin, distmax, distlast, path) 9: f ound ← f alse

10: while found = false ∧ ¬empty(stack) do

11: (node, dist) ←top(stack)

12: f ound ←enterNode(node, dist, stack, distmin, distmax, distlast, path)

13: if found then

14: return (node, dist)

15: else

16: return false

17: function enterNode(node, dist, stack, distmin, distmax, distlast, path)

18: if distlast≥ dist then ▷ test if subgraph was completed

19: removeIdx(path, [dist, length(path)]) ▷remove all below the parent

20: if contains(path, node) then ▷ test for cycle

21: distlast ← dist

22: return false

23: else

24: insert(path, node) 25: distlast = dist 26: f ound ← f alse

27: pop(stack)

28: if dist ≥ distmin∧ dist ≤ distmax then ▷ check if distance in valid range 29: f ound ← true

30: if dist ≤ distmax then ▷ add all child nodes to the stack

31: for all child ∈ outgoingEdges(node) do

32: push(stack, (child, dist + 1))

of the output. This can give a slight performance advantage for queries that have a minimum distance greater than one, because the number of output nodes is smaller than the number of reachable nodes.

In document ANNIS: A graph-based query system for deeply annotated text corpora (Page 68-72)