Combining terms with operators - ANNIS Query Language (AQL)

3. Existing graph-based data models for representing and querying lin-

3.2. ANNIS Query Language (AQL)

3.2.3. Combining terms with operators

It is possible to combine multiple terms in one query by separating these terms with the symbol “&”. For example,

cat="S" & "storm" & #1 >* #2

searches for the terms cat="S" and "storm", generates the Cartesian product of both results and then filters it by using the predicate #1 >* #2. This example predicate is composed of the binary operator >* and its two operands #1 and #2. Binary operators have two operands, a left-hand side (LHS) and a right-hand side (RHS). In this example #1 is the LHS and #2 is the RHS. Each annotation search term is implicitly numbered and thus cat="S" can be referenced as first term #1 and "storm" as second term #2. The operator >* only includes pairs of annotations that are connected by a path of dominance edges of arbitrary length. It is possible to use a more convenient AQL syntax and abbreviate queries by directly writing operators between two terms. For example, the above query could also be written as

cat="S" >* "storm"

AQL contains numerous binary and unary operators that express constraints on linguistic phenomena. These operators are defined on a graph-based data model, but their semantics are not primarily defined by graph operators but by linguistic annotation concepts. New operators can be added to AQL when needed by users. An up to date list of available operators can be found in the most recent version of the ANNIS user guide (Zeldes 2016a). Following, some operators that are currently available in AQL are described, but this list is not complete and only covers operators that are supported in graphANNIS.

Pointing relation operator “->type”

Pointing relations are relations between any type of annotations and correspond to edges of type SPointingRelation in Salt. They are explicitly typed. Thus, a relation must have a name. The pointing relation operator is written as ->type, where “type” is the name of its type. This form is without a range parameter and corresponds to a single edge in the annotation graph. A range parameter can be added to the operator after the type name (possibly delimited with either a space or comma character). In

this case, the operator does not define a single edge but a path of edges of the same given type. A range definition can be either:

• * for a path of edges with this type of any length (for example pos=/P.*/ & pos=/V.FIN/ & #1 ->dep* #25_{) or}

• m,n where m is the minimal length of the path and n is the maximal length (for example pos=/P.*/ & pos=/V.FIN/ & #1 ->dep,2,5 #2).

Pointing relation components of a named type are not allowed to have cycles. The combination of several pointing relations with different named types can contain circles, and that is why the name of the type has to be given in the operator definition. For pointing relation queries without a range parameter, it can be defined that the relation must have an edge annotation as a constraint. The edge annotation definition is written in square brackets after the type, and the syntax is the same as for node annotations, for example, pos=/P.*/ & pos=/V.FIN/ & #1 ->dep[func="sbj"] #2.

Dominance operator “>”

The dominance operator is similar to the pointing relation operator, but it corresponds to relations of the Salt type SDominanceRelation. Dominance relations are typed too, but the declaration of the type is optional for the dominance operator. Thus, the combination of all named components of dominance relations must still be cycle-free. As the pointing operator, paths of unspecified length can be expressed with >* (or >type* for selecting a dominance relation of a specific type) and a ranged length path can be specified with >m,n. Again, edge annotations can be given in square brackets for the single-edge variant of the operator: >[func="SB"].

Precedence operator “.”

A precedence relationship is defined over the stream of tokens. Two tokens are precedent if the LHS of the operator is located directly before the token defined by the RHS. For example, in the tokenized sentence “[That] [is] [a] [Category] [3] [storm]” the token “is” precedes the token “a” and the corresponding AQL query would be "is" . "a". The precedence operator allows the same range argument as the pointing relation operator, with a .* marking an arbitrarily long distance between two tokens and .m,n marks a specified range m..n. Due to performance concerns, the legacy ANNIS implementations limited the maximal distance for the .* operator to 50 (while still allowing explicit ranges that define a larger distance). If a corpus uses multiple segmentations (for example by adding explicit SOrderRelation edges), the typed form .type of the operator can be used to specify the name of the segmentation.

Precedence is not only defined for tokens directly but also for all other annotation nodes since they are always directly or indirectly connected to a set of covered tokens. For such non-tokens, the right-most covered token is used as anchor point if the annotation is the LHS of the precedence operator and the left-most token for annotations on the RHS of the operator. For example, in Figure 3.3 the node s3

3.2. ANNIS Query Language (AQL)

(a) same text _=_ (b) overlap _o_ (c) inclusion _i_

Figure 3.6.: Example spans for the different text coverage operators. The top span is the LHS, and the bottom span is the RHS.

with the NP annotation is precedent with distance 1 to the node s4 with the VP annotation because the right-most covered token of s3 (“That”) is precedent to the left-most covered token of s4 (“is”). The typed form of the precedence operator can only be used to compare tokens which are part of the SOrderRelation directly, not to compare non-token nodes.

Text coverage operators “_=_”, “_o_” and “_i_”

It was already described that each annotation node covers a specific set of tokens (which in turn cover a specific part of the original textual data source). The text coverage operators allow comparing these sets which each other. For the same text operator _=_ the set of covered tokens must be equal, for the overlap operator _o_ it is sufficient that there is any non-empty intersection of the two sets and the inclusion operator _i_ filters annotation nodes where all covered tokens of the RHS are contained in the set of the LHS. An example for the different text coverage operators is given in Figure 3.6. These operators cannot be parameterized and are always defined on the untyped token layer and not on any named segmentation.

In document ANNIS: A graph-based query system for deeply annotated text corpora (Page 39-41)