Future work - Conclusion and future work - ANNIS: A graph-based query system for deeply annotat

8. Conclusion and future work

8.4. Future work

This section gives an outlook on what future enhancements and studies could be performed with graphANNIS. First, additional possibilities of enhancing the query execution speed with more optimizations and parallelization are discussed. Then, functional enhancements are proposed.

8.4.1. Additional optimizations

Despite the better performance of graphANNIS compared to relANNIS, there are more optimizations that could be implemented or existing optimizations that could be enhanced. In order to get a better understanding of where the focus of solving these optimization problems should be, more experiments and measurements are needed.

7_{To the best of the author’s knowledge, this is the largest set of realistic corpus queries, which}

has been made available. Publishing the query sets for published benchmarks is not a matter of course. Sometimes the query set is only vaguely described (for example in Rábara et al. (2017)), only the collection and selection process and the total number of queries is given (for example in Vanroy et al. (2017)), or a small manually designed set of queries is used (like in Meurer (2012) and Rosenfeld (2010)). A larger public data set is given in the appendix of Rosenfeld (2012). It contains 224 queries for the tiger2 corpus, which have been collected from users and which were used in the benchmark.

8.4. Future work This could also give more insight into the general problems of main memory-based graph search and not only the ones specific for graphANNIS.

A clear result of the comparison with relANNIS in Section 7.1.2 was that there is a problematic class of queries which has a large influence on the overall performance. These are queries that contain both high and low selectivity annotation searches, but where the AQL operator is defined to use the low-selectivity query as LHS. This can result in query plans where the source iterator in the plan yields a lot of results that need to be check if they fulfill the AQL operator condition. A solution to this problem is to have inverse operators for each non-commutative AQL operator and store inverse edges in the graph storages. The result of the benchmark suggests that the speedup in relation to relANNIS could be almost doubled if this feature would be supported.

8.4.2. Parallelization

One of the areas in the evaluation with mixed results was the impact of parallelization. While thread-based parallelization of joins had a notable effect, the SIMD-based approach was less helpful. It would be interesting to compare the approaches for parallelization of more recent versions of PostgreSQL with the ones of graphANNIS, to find more possibilities for thread-based parallelization in graphANNIS and to directly compare their efficiency in a benchmark. Also, the current benchmarks used only very few cores. Since modern server systems have more CPUs available, it should be tested which parallelization techniques are able to scale on the number of CPUs. Also, the behavior of the CPUs should be measured in more detail and more systematic. While there have been ad-hoc measurements in the development process to find performance-critical parts of the application, more systematic measurements of, for example, cache misses could provide more insight into the effects of parallelization and how to better optimize such a query system, like it was done for the cache-sensitive skip list implementation in Sprenger et al. (2017).

Other approaches to parallelization could be to exploit the partitioning of the data into graph components. Currently, only different annotation layers are partitioned into separate graph storages. If a graph inside a graph storage consists of multiple strongly connected components (Cormen et al. 2009, p. 1171), these could be stored separately and queried in parallel. For example, if a graph storage represents a syntax tree, each sentence will be a strongly connected component. Treebank query systems like trgrep2 (Rohde 2005) or TIGERSearch (Lezius 2002) are already able to use this partitioning of sentences. If implemented in graphANNIS based on strongly connected components, this optimization could be applied to all annotation types with a similar structure.

8.4.3. Query language support

The design of graphANNIS allows adding new operator implementations, which could be used to add new features to AQL but also to provide alternative implementations for existing AQL operators. These could be specialized in certain types of corpora, similar to the specialized graph storages. Another possibility would be the support for different query languages. For example, CQP is a popular query language that

users might already know (Evert and Hardie 2011). Also, there is a movement to design a generic query language as an ISO standard (Banski et al. 2016). Support for more queries could be achieved by mapping the queries to the same internal data structures that are described in Section 5.2. Since KorAP also supports multiple query languages (Diewald and Margaretha 2016), including a subset of AQL, this would make it easier to compare both systems in a benchmark using the same set of queries. Given the modularity of both the ANNIS and KorAP architectures, support for additional query languages could also make it easier to integrate graphANNIS as a back-end for different existing user interfaces and to combine it with other query execution systems behind the same user interface of a web-service.

8.4.4. Support for more domains

Another possible extension of graphANNIS is to explore how its design can be used to create a graph query system for other domains, with graphs of similar size and where users typically query a read-only graph, too. This would need more flexible support for components and a different set of domain-specific operators. GraphANNIS is much more suited than the original relANNIS implementation for such an extension because its data model is more generic and extensible. Also, partitioning by edge components instead of documents allows much more flexible document structures, and this can be useful for other domains as well.

One domain where such flexible structures are needed is the study of text reuse phenomena. Text reuse can have multiple forms, “that range from quotations to allusions and translation” (Berti et al. 2014, p. 1). To study such reuse, text fragments need to be linked with other texts, text fragments or external information like named entity or geographical databases. The fragments itself often do not belong to just one document but to several ones, and the documents also have more complex relationships than in the traditional corpus/sub-corpus model. In Berti et al. (2014) RDF is used to model these connections. Representing these links as part of graphANNIS would allow adding additional kind of annotations and to perform analysis based on the combination of these annotations. The design of graphANNIS should allow querying this kind of corpora as fast as more conservatively structured ones. An open issue in such a scenario of highly linked texts is if all data needs to be located on the same server. The current single server system might be well suited to store an even larger number of texts, but for example, copying all the linked information from external databases as explicit annotations might be impossible due to the recursive nature of the links or legal limitations. A federated search infrastructure like the one proposed as part of the “Common Language Resources and Technology Infrastructure” (CLARIN) organization (Stehouwer et al. 2012) could be used to identify relevant documents, and to fetch more complete data-sets on-demand for a specific query.

In document ANNIS: A graph-based query system for deeply annotated text corpora (Page 132-134)