Modeling corpora in a relational database

3. Existing graph-based data models for representing and querying lin-

3.3. Modeling corpora in a relational database

(a) Additional edge between node c

and b. (b) Decomposed tree with pre-/post-order values.

Figure 3.8.: Example of how relANNIS encodes pre- and post-order for DAGs. An additional edge has been added to the example tree of Figure 3.7 between the nodes c and b (a). The DAG is decomposed into a tree by copying the descendants of node c (b). Copied nodes are drawn in red. As a result, the DAG has multiple order entries for the nodes that are reachable by several paths.

might have several pre- and post-order values instead of only one pair (Rosenfeld 2012, pp. 21 ff.). The basic idea is to copy sub-graphs, which are reachable by more than one path so that each sub-graph becomes a tree. Pre- and post-order entries are calculated for each of the generated tree copies, which leads to duplicated order entries. For example, if there is even only one additional edge in the graph of Figure 3.7 between node c and b, its descendant nodes b, d, and e are copied and assigned additional orders (see Figure 3.8). Depending on how many edges are added in comparison to the spanning tree of the DAG and where they are added, the duplication can be substantial. These duplicated entries must be filtered out after execution of the join because AQL only outputs each node annotation once in its result set. This adds additional processing overhead for corpora with duplicated entries, even if a specific result might not contain any duplicates.

3.3.2. Representing the graph in the relational database

When a user enters a query in AQL in the legacy ANNIS application, this query is translated to a SQL query, executed on the PostgreSQL database and the results are mapped to Salt. Thus, the original graph structure must be represented in the relational database, and it must be modeled in a way that allows efficient query execution. In the normalized schema, there are tables for the nodes, node annotations, edges and edge annotations. Figure 3.9 gives a graphical representation of this normalized version of the schema.

In order to describe the corpus graph structure and metadata, the corpus and corpus_annotation tables are used. An entry in the corpus table can be either a top-level corpus, a sub-corpus or a document. The hierarchic tree of documents is expressed by using a pre-/post-order encoding. Metadata annotations for documents and corpora are added to the corpus_annotation table.

component id type layer name node_annotation node_ref namespace name value n 1 rank id pre post node_ref component_ref parent level node id text_ref corpus_ref layer name left right token_index left_token right_token seg_index seg_name span root n 1 n 1 edge_annotation rank_ref namespace name value corpus id name type version pre post top_level corpus_annotation corpus_ref namespace name value text corpus_ref id name text 1 n 1 n 1 n 1 n 1 n

Figure 3.9.: Diagram of the relANNIS schema in normalized form. Only tables that represent the annotation and corpus graph are included. There are several more tables in the actual implementation, which are used for for example for configuration and references to binary data like media files.

The node table stores information about each annotation node. It is connected to the corpus and text tables which contain entries about each (sub-) corpus and each text. Thus, a node always belongs to exactly one document and one text. In order to speed up text coverage searches, the covered text range is explicitly encoded using the left/right and left_token/right_token columns. While the former ones describe the range of covered characters in the text, the latter ones refer to the covered tokens indexes. If a node is a token by itself, the token_index column is set to the index of the token in the chain of tokens for a specific textual data source (otherwise it is NULL). For nodes that are not tokens but segmentation nodes, the seg_index column is used instead. Additionally, the name of the segmentation chain for the node is given in the seg_name column. For either token or segmentation nodes, the span column is set to the spanned text value to make it possible to execute the tok searches without joining the text table. Annotations on nodes are encoded in the node_annotation table and consist of the textual representations of the namespace, name, and value of the annotation.

For expressing relations between nodes, the component, rank and edge_annota tion tables are used. The component table lists all connected components of the annotation graph. A component has a type column which is used to distinguish pointing, dominance and spanning relations from each other. Additionally, the layer of the component is encoded in the layer column. Only edges belonging to the same type and layer can be part of the same component. The rank table is used to represent the edges of these components. Each rank entry contains an explicit reference to its parent rank entry via the parent column. Additionally, the connectivity is expressed

3.3. Modeling corpora in a relational database with the help of a pre- and post-order encoding.

3.3.3. Challenges of mapping graphs to a relational database

Using the pre-/post-order encoding approach allows to efficiently querying reachable nodes with only one join on two rank tables per operator. Unfortunately, a lot of information needed to execute an AQL query is encoded over multiple tables. This can result in joins of 9 tables even for simple queries, for example, queries which involve only two annotation nodes and one edge operator with a constraint on the edge annotation. In practice, these joins turned out to be very costly and thus a pre-joined materialized table named facts was created which combines the node, node_annotation, component, rank and edge_annotation tables into one large table. While this approach reduces the number of needed joins, it increases the size of the table and thus also the indexes. Joining the tables also results in almost duplicated rows which differ only in a few columns and are equal otherwise.

In the original schema, all corpora that are part of the database installation are inserted into the same facts table. This makes the size problem even worse since large corpora in the database can influence much smaller corpora. In order to make it easier for the DBMS to process the large facts table, it is partitioned by the top-level corpus a row belongs to. This is implemented by using a common parent table named facts and a child-table named facts_<corpus ID>, which inherits from the general facts table.6 _{For queries that involve only one of the corpora, the SQL generation}

directly uses the child table name. Otherwise, PostgreSQL would use the statistics over all corpora instead of the more accurate statistics for a specific corpus. Depending on how different the corpora in the database are, this can make a huge difference. Whenever a query is executed over more than one corpus, the more general parent facts table is used, and PostgreSQL will determine which child tables to use on its own.

Another problem introduced by the materialized facts table are duplicated rows. This duplication requires filtering for distinct rows when only a subset of columns is queried. For queries that produce many results this filtering for unique rows is costly. Queries that should be fast to execute, because they do not have any operator, can take more time to execute compared to the normalized schema. In order to approach this problem, special columns where added to the facts table. These two boolean-typed columns n_sample and na_sample are only true for one row belonging to the same node or node annotation. Thus, it is possible to filter by these indexed columns instead of applying a unique filter on the output.

While the previous problems introduced by the facts table can be handled by various optimizations, there are inherent problems of mapping the graph-based data to a relational database. Statistical dependencies, which are introduced by columns that express a range (like the pre-/post-order columns and the ones for token coverage), are such a problem. These range columns are not statistically independent but the PostgreSQL query planner will assume this. Thus, the intermediate result size

6_{See https://www.postgresql.org/docs/9.6/static/ddl-partitioning.html (last accessed}

2017-10-25) for a description how partitioning is implemented using inheritance in PostgreSQL 9.6.

estimation will be skewed, and PostgreSQL might tend to underestimate the number of tuples involved in a join and thus choose wrong join orders and join implementations. PostgreSQL has support for a proper range data type7_{, but there is no actual support}

for statistics on these data-types at this time. As a countermeasure, a custom operator ^=^, that replaces the “equals” operator =, was introduced for same coverage queries. It has the same semantics as the “equals” operator but has a constant selectivity of 0.995. In same coverage queries it is necessary to search for rows where the left_token and right_token columns of both sides are equal.

facts1.left_token = facts2.left_token AND facts1.right_token = facts2.right_token

Since PostgreSQL assumes statistical independence the (accurate) selectivity for both column joins is multiplied, and as a result, the assumed result size is largely underestimated. By using the custom operator ^=^ for one of the sides, PostgreSQL will assume a constant selectivity for the right_token column join and make much better estimates.

facts1.left_token = facts2.left_token AND facts1.right_token ^=^ facts2.right_token

Unfortunately, this approach is not generalizable for queries that use non-equal operators like < or >. Until the PostgreSQL query planner properly supports columns that are statistically dependent on each other, it will underestimate result sizes leading to bad join performance.

Another duplication problem is the indexing of reachability by using pre-/post-order encoding. This is the only graph indexing available for relANNIS, but it only works well in case if the annotation graph components are trees or have very few additional edges compared to the spanning tree of the component. Any extra edge will lead to duplicate entries in the rank table, and since the facts table is joined with rank, it can lead to a much larger number of rows per node in the facts table. There is no easy solution to this, other than allowing multiple ways of indexing graphs in relANNIS, which is not possible in the current SQL schema as it is and would also lead to a much more complicated SQL generation.

4. Graph-based data model for AQL

In document ANNIS: A graph-based query system for deeply annotated text corpora (Page 43-47)