GraphANNIS data model and AQL - Graph-based data model for AQL searches

4. Graph-based data model for AQL searches

4.3. GraphANNIS data model and AQL

node_name=t1

tok=That node_name=t2tok=is

ORDERING node_name=t3 tok=a ORDERING node_name=t4 tok=Category ORDERING node_name=t5 tok=3 ORDERING node_name=t6 tok=storm ORDERING node_name=t7 tok=. ORDERING node_name=s1 cat=ROOT LEFT_TOKEN RIGHT_TOKEN node_name=s2 cat=S DOMINANCE

LEFT_TOKEN DOMINANCE RIGHT_TOKEN

node_name=s3 cat=NP

DOMINANCE node_name=s4_cat=VP

DOMINANCE

LEFT_TOKEN RIGHT_TOKEN

DOMINANCE

LEFT_TOKEN node_name=s5_cat=NP RIGHT_TOKEN

DOMINANCE

DOMINANCE LEFT_TOKENDOMINANCE DOMINANCE DOMINANCE RIGHT_TOKEN

Figure 4.5.: Example for a constituent tree represented in graphANNIS. The red edges are of the type DOMINANCE. In this example, the edges do not have additional labels, but it is possible to add them if needed. Since these edges imply text coverage, the left and right token edges are inherited from child to parent nodes. For example, the node s3 has the first token as the left-most covered token, and thus its parent node s2 has the same left-most covered token node.

4.2.5. Pointing relations

For relations that are not implying text-coverage, edges of type POINTING can be used. These have the same semantics as the SPointingRelation of Salt and can have additional labels for expressing edge annotations. The type of a pointing relation in AQL corresponds to the name of the edge component in graphANNIS.

4.3. GraphANNIS data model and AQL

At the beginning of this section it is stressed, that graphANNIS is meant to be a model that makes it easy and efficient to implement AQL queries. Node annotation search is comparatively limited in what it can express and can be implemented relatively straight-forward. In contrast, implementation of the different operators as they have been described in Section 3.2.3 is more challenging. There are numerous operators with different semantics, and it is not desirable to keep special encoded and optimized information for each operator separately in the database. Instead, an operator should be implemented by combining several more basic graph component types.

Also, these operators have something in common: In addition to finding direct child nodes, they often need to find all reachable nodes from a given start node with some additional constraints like path length or valid edge types and labels. This means it

must be possible to implement a reachability query for a component very efficiently. That is why relANNIS uses the pre-/post-order encoding to query for reachable nodes without the need to traverse through the graph with recursive SQL.

But the components that represent the different aspects of linguistic annotation are very different from each other. The COVERAGE component only has paths of length 1 because there are not any hierarchies for spans. Also, ORDERING components can have very long paths (depending on the text length), but a node always has at most one outgoing edge. These different graph structures for different linguistic annotations have different optimal implementations for reachability queries. Separating graphs into components of different types and names allows exploiting these differences and provide a more optimal implementation. Still, these implementations are based on graphs and do not need a translation into a different model. For example, the token coverage of nodes is expressed as attributes of the node in relANNIS, which extends the data model of nodes. In graphANNIS, edges are used to encode the same information.

4.4. Extensions to the relational algebra to model

AQL queries

GraphANNIS is based on a labeled directed graph, and thus it is not trivial to apply relational algebra on the graph itself. It is, however possible to describe the results of the query as relations. For example, the openCypher query language also operates on a very similar graph structure called property graph (Rodriguez and Neubauer 2010) and in “Formalising openCypher Graph Queries in Relational Algebra” the gap between graphs and relational algebra is bridged by stating that “openCypher queries take a property graph as their input, however the result of a query is not a graph, but a graph relation” (Marton et al. 2017, p. 184). OpenCypher adds new operators to the relational algebra, which use the graph as input and produce relations. Using relations to describe the results of the query is very similar to a definition of a result in graphANNIS, where each match is defined by a tuple of nodes and the matching node annotation. Since the edge information is not part of a graphANNIS result, it is sufficient to describe only the node labels as a relation. More formally, a node label relation is a set of tuples

N = {(id, ns, name, val) ∈ N × DS× DS× DS} (4.1)

where DS is the domain of all strings, id is the global ID of the node, ns and name

represent the qualified label name of the match and val is its value. The relation can also be expressed as relational schema definition:

N (id : N, ns : DS, name : DS, val : DS) (4.2)

A simple AQL query like anno_ns:anno_name="value" on a node label can be expressed as a selection on such a node relation:

4.4. Extensions to the relational algebra to model AQL queries

In document ANNIS: A graph-based query system for deeply annotated text corpora (Page 51-53)