NLP TOOLS AND FRAMEWORKS - Semantic Indexing via Knowledge Organization Systems: Applying the C

There is a plethora of available NLP tools and frameworks written in range of different computing languages and platforms and distributed by a range of proprietary and general public licences. Java based tools like the Open NLP (http://opennlp.sourceforge.net) and

Stanford NLP tools (http://nlp.stanford.edu) are described as statistical NLP tools based on maximum entropy models for delivering a range of NLP components, such as sentence detector, tokenizer, part of speech tagger etc. Such tools can be deployed standalone or can be combined into larger NLP frameworks for contributing to larger scale NLP applications. The detailed discussion of such NLP tools is not within the scope of this thesis, however the following paragraphs briefly discuss popular NLP frameworks which could be used to support the IE and semantic annotation aims of the thesis.

2.7.1 GATE

General Architecture for Text Engineering (GATE) is an NLP framework that provides the architecture and the development environment for developing and deploying natural language software components (Cunningham et al. 2002). The architecture distinguishes two basic kinds of resources; Language Resources and Processing Resources. Language Resources can be text documents, including a wide range of different formats (HTML, XML, Plain text, MS word , Open Office, RTF and PDF) while ontologies of OWL-Lite format and Lexicons such as WordNet are also regarded as Language resources. Text documents can be loaded individually as GATE documents or as a collection of documents described as a GATE corpus.

Processing Resources are NLP components that are made available by the architecture, such as Tokenizer, Part-of-Speech tagger, Sentence Splitter, as well as Gazetteers, Export modules specialised taggers etc. The architecture is equipped with a repository of Processing resources which contains a large variety of available resources known as Collection of Reusable Objects for Language Engineering (CREOLE) plug-ins. The architecture is flexible due to its open source orientation to integrate with a range of JAVA based Processing Resources which are made available via the CREOLE repository. The GATE community also delivers new plug-ins which support a wide range of NLP needs.

A collection of processing resources organised in a cascading processing order is known as the GATE pipeline or GATE Application. The architecture enables users to name and save applications which can be quickly reloaded into GATE with the associated Language and Processing resources. Offering a rich graphical user interface, the architecture also provides easy access to language, processing, and visual resources that help scientists and developers produce GATE applications.

The architecture supports a Lucene based searchable data-store and a Serial data-store for storing Language resources. In addition it includes ANNIE (A Nearly-NEW

Information Extraction System), a ready-to-run Information Extraction system. ANNIE consists of processing resources such as Tokenizer, Sentence Splitter, Part-of-Speech Tagger, Gazetteer and ANNIE Named Entity transducer for providing a fundamental and adaptable framework for Information Extraction. The ANNIE transducer utilises a set of rules in combination with available gazetteer listings in order to deliver the named entity result.

The language that supports the definition of such IE rules is JAPE (Java Annotation Pattern Engine). JAPE grammar is a finite state transducer, which uses regular expressions for handling pattern-matching rules (Cunningham, Maynard, and Tablan, 2000). Such expressions are at the core of every rule-based IE system aimed at recognising textual snippets that conform to particular patterns, while the rules enable a cascading mechanism of matching conditions that is usually referred as the IE pipeline.

JAPE grammars are constituted from two parts; the LHS (Left Hand Side) which handles the regular expressions and the RHS (Right Hand Side) which manipulates the results of the matching conditions and defines the semantic annotation output. The architecture allows the integration of user-defined JAPE rules which are customised to extract information snippets to satisfy user specific IE goals.

2.7.2 UIMA

Unstructured Information Management Architecture (UIMA) (Ferrucci and Lally 2004) is a language processing framework aimed at analysing large amounts of text and other forms of unstructured information. The framework concentrates on performance and scalability with emphasis on standards. UIMA originates from IBM but it has now moved to be an open source project incubated by the Apache Software Foundation, while its technical specifications are developed by the Organisation for the Advancement of Structured Information Standards.

The architecture enables document processing applications (Analysis Engines) which encapsulate components (annotators). The UIMA components can be written in different programming languages, currently JAVA and C++, while the architecture allows installation of components from repositories. A standard data structure, the Common

Analysis System (CAS) is operated by the Analysis Engines. CAS includes both text and

annotation and supports interoperability by using the XML Metadata Interchange (XMI) standard.

An important architectural principle of UIMA is the use of strongly typed features for annotations and annotation features. Each Analysis Engine must declare what types of annotations are supported and must specify what feature each annotation type supports and what is the type feature each value may take, e.g. primitive, array, reference to another annotation type. The use of strongly typed features enables the architecture to control and check that output from one component has the right annotations types for input to the next component.

2.7.3 SProUT

Shallow Processing with Unification and Typed feature structures is a platform for the development of multilingual text processing systems (Drozdzynski et al. 2004). The platform is not as popular as GATE and UIMA but it has been adopted as the core IE component in several EU-funded and industrial projects, mainly originating from Germany and Poland. SProUT is developed by the German Research Centre for Artificial Intelligence (DFKI) and while not open source it can be used for research purposes free of charge. The motivations supporting the SProUT development relate to the trade-off between processing efficiency and expressiveness of grammar rules.

The platform utilises unification-based grammar formalisms which are designed to capture fine-grain syntactic and semantic details. Such formalisms use as their informational domain a system based on features and values. The main characteristic of the platform is that allows use of rich descriptive rules over linguistic structures which enable information sharing in the form of features among rule elements.

2.7.4 The Adopted Framework

All three frameworks that are discussed above have their merits but also their weak points. SProUT enables the definition of fine-grained rules but it is not a popular platform with limited availability of documentation and community support. UIMA on the other hand, might be a robust and scalable framework but the strongly typed features approach does not lend easily to prototype, exploratory or rapid developments. In addition, it has a steep learning curve since it relies on the Eclipse integrated development environment (IDE) for GUI support and on third party NLP tools for delivering language processing tasks. GATE might support rapid development via the ready-to-run ANNIE system and a unified GUI environment which controls all aspects of the development (language, processing and data- store resources), however performance and scalability are not its strongest points.

Considering the merits and weak points of the above framework, this research study adopts GATE as the core IE platform of the project. In detail, GATE supports rapid prototype and exploratory developments allowing use of loosely typed annotation types and features while making available a fast range of NLP plug-ins, including ontology and terminology (gazetteers) components. Thus, it fits well to the exploratory nature of the project and the requirement to deliver semantic annotation with respect to ontologies using terminological resources. In addition, the PhD work, being a research and not a commercial project, does not present any significant performance requirements. Thus GATE is suitable to negotiate the volume of grey literature documents since processing time is not top priority.

Furthermore, the platform has been in development for more than 10 years and has matured while used in range of projects. It is also supported by a strong community and available online documentation (tutorials, user forums, mailing lists etc). Regarding training, the GATE team organises annual summer schools which support developers to obtain new skills and discuss issues of their applications. The author has participated in two GATE summer schools, 2009 and 2010, which have significantly helped to improve skills and to develop the final Semantic Annotation application.

In document Semantic Indexing via Knowledge Organization Systems: Applying the CIDOC-CRM to Archaeological Grey Literature (Page 50-54)