8. Conclusion and future work
8.2. Comparison to existing solutions
Execution speed could be improved by the factor 23 compared to the initial port by changing the generated SQL considerable and by improving the implementation of MonetDB itself (Rosenfeld 2012, p. 81). The most influential change in the SQL was that parts of the results have been pushed down in the execution plan manually with help the of Common Table Expressions (CTEs), effectively overriding the query planners decisions and optimizing the query plan from the application (Rosenfeld 2012, pp. 52 ff.). In comparison to PostgreSQL, the improved version of MonetDB was able to execute the queries on the tiger2 corpus in less than a third of the time on a server system and about ten times faster on a desktop system (Rosenfeld 2012, p. 79). While the results of this benchmark support the assumption, that a main memory-based system can also query larger corpora like tiger2, it also shows that extensive adjustments to the COTS system were needed to make use of the specifics of the domain of linguistic annotation graphs. Also, the benchmarks were performed on a corpus with syntax trees only, a structure that is very well supported by the pre-/post-order encoding used to encode the graph. But it does not solve the problem of mapping different types of annotation graphs efficiently inside the same database. Using an actual graph database could solve problems introduced by mapping the data to a relational database, but existing solutions like Neo4j do not include the flexibility to adapt to different graph structures. In the case of Neo4j, finding reachable nodes is implemented by graph traversal2, which is an unsuitable approach for several
types of annotation graphs like those with long paths (see Section 7.2 for an experiment where only graph traversal was used). While it might be possible to work around such limitations, such custom solutions will also need to be adjusted to changes in the underlying platform and thus need continued development effort. Also, a specialized implementation can be a testbed for advanced technologies under active research. Since the domain is more limited, but there is still a non-artificial use-case scenario, a linguistic query system could give valuable insight into techniques like graph indexes such as Grail (Yildirim et al. 2010) or Ferrari (Seufert et al. 2013) in future studies.
8.2.2. GraphANNIS as open-source community project
The downsides of the increased development effort, when not relying on a COTS solution, could be reduced by developing such a tool as an open-source community project. ANNIS is already part of a set of tools around the common data model Salt named “corpus-tools.org”, which are developed in such a community effort (Druskat et al. 2016). Its general data model allows using ANNIS for a broad range of corpora, which allows it to represent corpora from different fields of linguistics and even other fields that use annotations on texts, like for example literary science. Such broad applicability leads to a larger user-base compared to tools which are more restricted, like the ones that only support one type of annotation.
In order to actually allow a distributed development, the entry level for contributing should be as low as possible. GraphANNIS is written in C++ (the other tools of
2In Robinson et al. (2013, p. 20), querying a graph is equated with graph traversal and the whole
“corpus-tools.org” are written in the Java programming language), and this can be a problem given the inapplicability of C++ as a programming language for beginners. It should be evaluated if other programming languages like Rust3 can provide the
same memory control and optimization possibilities as C++, but are safer to use for entry-level programmers who are common in the corpus linguistic community.
8.2.3. Embeddability of graphANNIS
Some COTS systems like PostgreSQL are based on a client-server infrastructure. Thus, users must either use a central web-based service from an infrastructure provider or install complex server software on their own computers. GraphANNIS does not require such server-systems and end-users will be able to install it more easily than relANNIS. Not using a server-software also allows integrating graphANNIS into other linguistic corpus tools more easily. For instance, by integrating a comprehensive query language into an annotation tool, it is possible to implement an agile corpus creation workflow where annotations are constantly checked for consistency and annotation schemes can be changed more easily (Voormann and Gut 2008). GraphANNIS has already been integrated into the Salt-based Atomic annotation editor (Druskat et al. 2014) as Java library to support such an agile workflow (Druskat et al. 2017). The query system CWB is equally embeddable and different customized front-ends make use of it as a query engine (Evert and Hardie 2011). Such front-ends could use graphANNIS or both systems in parallel in the future. They could allow accessing corpora in a way that is more specialized on a specific use case, instead of the general purpose approach of the current ANNIS web-interface. An example is the CALLIDUS project4 which
will study how to support the teaching of Latin in schools by providing access to Latin corpora to teachers and pupils. ANNIS will be used as a back-end, but the front-end will be highly customized with the possibility to adapt predefined AQL queries and generate exercises from the results, providing more simple access to the corpora than directly querying AQL.
8.2.4. Support for more complex and larger corpora
Another paradigm shift of graphANNIS was using exclusively main memory for accessing the data. All other corpus query systems described in Section 2.3 are disk- based (except the MonetDB based implementation of AQL). As it was shown in Section 7.3, even the largest corpora currently available in relANNIS should be supported on current desktop and notebook hardware. For larger corpora, central server systems which provide a web interface and a REST API could be used. Using main memory directly allows avoiding typical disk problems like caching and to concentrate the optimization efforts to other areas. Partitioning the corpora into edge components also allows applying simpler “caching” strategies like only loading the components relevant for a query into main memory. Other systems like KorAP have a more conservative disk-based design because they are explicitly designed to handle “very large corpora”
3https://www.rust-lang.org/ (last accessed 2017-12-18)
8.3. Representativity of the workload