• No results found

Natural Language Processing

In document Perspectives on Ontology Learning (Page 67-85)

Diana MAYNARD and Kalina BONTCHEVA Dept. of Computer Science, University of Sheffield, UK

[email protected]

Abstract. This chapter provides a high-level overview of the various NLP pro-cesses typically required for an ontology learning system,ranging from low-level linguistic pre-processing, through parsing, term recognition and information extrac-tion. Since ontology learning research tends to reuse many existing NLP tools, this chapter also discusses some of the most widely used, open-source ones, providing references to further reading materials.

Keywords. Natural Language Processing, term recognition

Introduction

Natural Language Processing (NLP) is concerned with building algorithms that under-stand and generate natural language text (also often referred to as unstructured content).

Since ontology learning methods automatically derive (parts of) an ontology from tex-tual sources, they tend to make use of NLP tools and techniques in order to help with the text analysis. The aim of this chapter is to provide a high-level overview of the var-ious NLP tasks, ranging from low-level linguistic pre-processing, through parsing, term recognition and information extraction. Since ontology learning research tends to reuse many existing NLP tools, this chapter also discusses some of the most widely used, open-source ones, providing references to further reading materials. We have, however, focused primarily on the GATE architecture [23] because, in our opinion, it offers one of the most flexible infrastructures for incorporating NLP components, and provides the widest variety of plugins for different NLP tasks.

As the chapter progresses, the NLP tasks and algorithms become progressively more complex and, consequently, more error-prone. For instance, tokenisation, sentence split-ting and part of speech (POS) tagging are typically performed with 95-98% accuracy, en-tity recognition between 90-95%, and even less for parsing. Domain adaptation is also a major issue, and again performance in different domains varies more widely as the tasks become more complex. Consequently, there is a trade-off between the sophistication of the linguistic analysis and its accuracy, especially on unseen, noisier types of text. As a result, it is advisable, as an integral part of system development, to carry out rigorous quantitative evaluation against a gold standard dataset. It is through such experiments that it is possible to establish the usefulness of each of the NLP processing steps for the ontology learning results. On large datasets, computational complexity and implementa-tion efficiency would also need to be considered.

In general, the easiest way to carry out such quantitative experimentation is to build easily reconfigurable NLP pipelines. A typical NLP pipeline consists of a number of

52 Natural Language Processing

tools applied in sequence (tokenisation, sentence splitting, part-of-speech tagging, entity recognition, etc). To help with the pipeline building and quantitative evaluation tasks, researchers typically use a general purpose NLP infrastructure. Some of the most popular ones are open-source and come with a large number of already implemented NLP tools (e.g. GATE [23], NLTK [46], OpenNLP1, UIMA [32]). The advantages of reusing NLP tools from such an infrastructure are several:

• They support a variety of text formats including HTML, XML, RTF, email, plain text and in some cases Word, PDF and Excel files. When a document is processed, the format is analysed transparently to the user and converted into a single unified model of annotations. Multilingual support is well-tested and based on Unicode.

• All low-level components within an infrastructure are designed to be interopera-ble and, consequently, there is no overhead in putting them together into a single pipeline.

• Results storage, evaluation tools, and other facilities are taken care of by the infrastructure. In the case of GATE, this also comes with a graphical develop-ment environdevelop-ment, called GATE Developer, which makes it easy to build and test pipelines visually.

Since GATE integrates OpenNLP low-level processing components, it is also pos-sible to mix and match GATE modules and OpenNLP modules, as well as visually and quantitatively (i.e. using precision, recall, and f-measure) compare the performance of these alternative implementations, from within the GATE Developer user interface or programmatically via the GATE API.

NLTK [46] is suitable when Python is preferred as a programming language. Given that NLTK was primarily developed for the purpose of teaching NLP, the efficiency of some of the implementations might not be suitable for the processing of very large datasets.

1. Linguistic Pre-Processing Tasks

There are a number of low-level linguistic tasks which form the basis of more com-plex language processing and ontology learning algorithms. This section will provide an overview and point to some existing open-source implementations which can easily be reused and, in some cases, are easily adaptable as well.

1.1. Tokenisation

Tokenisation is the task of splitting the input text into very simple units, called tokens.

Tokenisation is a required step in any linguistic processing application, since more com-plex algorithms typically work on tokens as their input, rather than using the raw text.

Consequently, it is important to use a high-quality tokeniser, as errors are likely to affect the results of all subsequent NLP algorithms.

Commonly distinguished types of tokens are numbers, symbols (e.g., $, %), punc-tuation and words of different kinds, e.g., uppercase, lowercase, mixed case. Tokenis-ing well-written text is generally reliable and reusable, since it tends to be

domain-1http://incubator.apache.org/opennlp/

Natural Language Processing 53

independent. However, such general purpose tokenisers typically need to be adapted to work correctly with, for example, chemical formulae, twitter messages, and other more specific text types.

One widely used tokeniser is bundled in the open-source ANNIE system in GATE [24]. Similar to the way in which programming languages are tokenised, the ANNIE Tokeniser relies on a set of regular expression rules which are then compiled into a finite-state machine. This differs from most other tokenisers in that it maximises efficiency by doing only very light processing, and enabling greater flexibility by placing the burden of deeper processing on the grammar rules, which are more adaptable (see Section 3). For example, there is a specialised set of rules for tokenisation of English, which deals with expressions such as “don’t” which would by default be tokenised as 3 tokens, whereas for correct POS tagging and syntactic processing they need to be tokenised as “do” and

“n’t” as the short form of not. If Python is preferred, NLTK has several similar tokenisers, one based on regular expressions.

Another freely-available tokeniser is the OpenNLP TokenizerME2, which is a train-able maximum entropy tokeniser. It uses a statistical model, based on a training corpus.

There is a method also for re-training the tokeniser on new data. However, this depen-dency on training data means that tokeniser adaptation to new types of text, e.g., twitter messages, is likely to be expensive, as it would require a substantial amount of training data.

In our experience, human readable tokenisation rules such as those used in the AN-NIE and OpenNLP tokenisers tend to be easier to adapt to new languages and text types than those built from statistical models. For example, the following tokeniser rule is from the ANNIE Tokeniser:

‘UPPERCASE_LETTER’ ‘LOWERCASE_LETTER’* >

Token;orth=upperInitial;kind=word;

It states that the sequence must begin with an uppercase letter, followed by zero or more lowercase letters. This sequence will then be annotated as type ‘Token’. The attribute

‘orth’ (orthography) has the value ‘upperInitial’; the attribute ‘kind’ has the value ‘word’.

1.2. Sentence Splitting

Sentence detection (or sentence splitting) is the task of separating text into its constituent sentences. For plain text, sentence splitting is concerned pretty much solely with deter-mining whether a punctuation token (e.g. “.”, “?”, “:”) marks the end of a sentence or not.

More complex cases arise when the text being processed contains tables, titles, formulae, or other formatting markup (e.g. HTML tags, hashtags in tweets).

The GATE sentence splitter is a cascade of finite-state transducers which segment the text into sentences, based on a set of rules. This module is required for the POS tag-ger in GATE, and is often necessary for other low-level processing. It is domain- and application-independent, although some adaptation to text formats might be required.

Each sentence is annotated with the type ‘Sentence’ and a separate sentence break anno-tation is also produced.

2http://incubator.apache.org/opennlp/documentation/manual/opennlp.html

54 Natural Language Processing

GATE also offers a RegEx splitter, which is based on regular expressions, using the default Java implementation. This has the advantage of being easily customisable by Java programmers. It has three sets of patterns for:

• sentence splits that are part of the sentence, such as sentence-ending punctuation;

• sentence splits that are not part of the sentence, such as 2 consecutive new lines;

• text fragments that might be seen as splits but which should be ignored (such as full stops occurring inside abbreviations).

The OpenNLP Sentence Detector is designed to run prior to tokenisation and is a trainable one. Punctuation marks are required as indicators for sentence boundaries.

Conquently, it cannot identify sentence boundaries based on new lines, markup tags, or sentence content, e.g. titles, whereas the GATE ones are more flexible in that respect. The Sentence Detector takes a trained model file as input and produces an array of sentences.

Similar to the OpenNLP tokeniser, this might make it harder to adapt to new types of text, compared with the regular expression and rule-based ones, which can be changed directly.

While generally a simple problem for humans, automatic sentence splitting is not without challenges. For instance, abbreviations need to be recognised and dealt with properly, as well as carriage returns and newlines. Some splitters ignore these com-pletely, requiring a punctuation mark as a sentence boundary. Others use two consec-utive newlines/carriage returns as an indication of a sentence end, while there are also cases when even a single newline/carriage return character would indicate end of a sen-tence (e.g. comments in software code or lists which have one entry per line). HTML formatting tags, Twitter hashtags, wiki syntax, and other such special text types are also somewhat problematic for general-purpose sentence splitters which have been trained on well-written corpora, typically newspaper texts.

1.3. POS Tagging

Part-of-Speech (POS) tagging is concerned with tagging words with their part of speech, by taking into account the word itself, as well as the context in which it appears. A key part of this task is the tagset used and the distinctions that it makes. The main categories are verb, noun, adjective, adverb, preposition, etc. However, tagsets tend to be much more specific, e.g. distinguishing between singular and plural nouns. One commonly used tagset is the Penn treebank one [47].

In terms of approaches, researchers have achieved excellent results with Hidden Markov models, rule-based approaches, maximum entropy, and many other methods.

GATE’s English POS tagger [40] is a modified version of the Brill transformational rule-based tagger [7], which produces a part-of-speech tag as an annotation on each word or symbol, using the Penn treebank tagset. The tagger uses a default lexicon and ruleset (the result of training on a large corpus taken from the Wall Street Journal). Both of these can be modified manually if necessary.

Similarly, the OpenNLP POS tagger uses a model learnt from a training corpus to predict the correct POS tag from the Penn treebank tagset. During training, it is possible to build either a maximum entropy or a perceptron-based model.

For Python, NLTK also has an implementation of the Brill tagger, as well as the TNT statistical tagger [6] and the Stanford POS tagger [65].

Natural Language Processing 55

The accuracy of these general purpose, reusable taggers is typically excellent (97-98%) on texts similar to those on which the taggers have been trained (mostly news articles). Consequently, when presented with new text types or noisier data, the accuracy declines. In some cases, changes to the tagger rules and/or re-training might be required.

For instance, Hearst patterns (see Section 6), which are widely used in ontology learning, need reliable POS tags in order to produce high-quality results.

1.4. Stemming and Morphological Analysis

Another set of useful low-level processing components are stemmers and morphological analysers. Stemmers produce the stem form of each word, e.g. “driving” and “drivers”

have the stem “drive”, whereas morphological analysis tends to produce the root/lemma forms of the words and their affixes, e.g. “drive” and “driver” for the above examples, with affixes “ing” and “s” respectively.

GATE provides a wrapper for the widely used, open-source Snowball stemmers, which cover 11 European languages (Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish and Swedish) and makes them straight-forward to combine with the other low-level linguistic components. The stemmers are rule-based [59] and easy to modify, following the suffix-stripping approach of Porter.

NLTK also provides an implementation of the Snowball stemmers for Python.

The English morphological analyser in GATE is also rule-based, with the rule lan-guage supporting rules and variables that can be used in regular expressions in the rules.

POS tags can taken into account if desired, depending on a configuration parameter. At the time of writing, OpenNLP and NLTK do not provide morphological analysers.

2. Named entity recognition

Named entity recognition (NER) is used to automatically derive the semantics from tex-tual content, using linguistic and/or statistical knowledge. It consists of the identification of proper names in texts, and their classification into a set of predefined categories of interest. The core set of traditional named entities are Person, Organisation, Location and Date and Time expressions, such as "John Smith", "IBM", "London", "4th August 2011" etc. respectively. Various other types of named entity are frequently included, as appropriate to the application, e.g. newspapers, ships, monetary amounts, percentages etc. NER provides a foundation from which to build more complex IE systems. For ex-ample, extracting the relations between entities can provide the means for entity tracking (finding co-references), ontological information (e.g. distinguishing between “Athens, Georgia” and “Athens, Greece”), and scenario building.

Approaches can be divided into pattern-based and statistical extraction [16], al-though quite often the two techniques are mixed (see e.g. [58][17][63]). Most informa-tion extracinforma-tion (IE) techniques rely on some form of human supervision, with the ex-ception of purely structural IE techniques performing unsupervised machine learning on unannotated documents, e.g. [22]. A survey of information extraction methods from web data is presented in [12]. Language engineering platforms such as GATE enable the mod-ular implementation of techniques and algorithms for information extraction, and allow repeatable experimentation and evaluation of their results.

56 Natural Language Processing

Linguistic rule-based methods for NER, such as those used in ANNIE, GATE’s in-formation extraction system, comprise a combination of gazetteer lists and hand-coded pattern-matching rules which use contextual information to help determine whether can-didate entities are valid, or to extend the set of cancan-didates. The gazetteer lists act as a starting point from which to establish, reject, or refine the final entity to be extracted. A typical processing pipeline consists of linguistic pre-processing (tokenisation, sentence splitting, POS tagging), entity finding (using gazetteers and grammars), co-reference, and finally some kind of export of the results to a database or ontology. Section 3 discusses in more detail some of the techniques for pattern-based rule writing used for NER.

Learning methods for NER can be classified broadly into two main categories: rule learning and statistical learning. The former methods induce a set of rules from training examples, e.g. SRV [34], RAPIER [9], WHISK [64], BWI [35], and LP2[18]. Statis-tical systems learn statisStatis-tical models or classifiers, such as HMMs [55], Maximum En-tropy [15], SVM [42] [48] [44] and Perceptron [10][45]. Methods differ widely in the NLP features that they use, including simple features such as token string and capitali-sation information, linguistic features such as part-of-speech, semantic information from gazetteer lists, and genre-specific information such as document structure.

The general approach consists of three stages: linguistic pre-processing to obtain the feature vectors, training or applying classifiers, and finally post-processing the results to tag the documents. There are advantages and disadvantages to the Machine Learning ap-proach compared with a knowledge engineering, rule-based apap-proach. First, large quan-tities of training data are required, which can be problematic, especially as these need to be relevant to the domain and the set of entities required. If any criteria change (such as a new entity type), then the whole training set may need to be reannotated. On the other hand, ML techniques have the advantage of not requiring specialist language engineers to develop hand-coded rules, which can be time-consuming to develop.

GATE’s general purpose named entity recognition system is ANNIE, which was de-signed for traditional NER on news texts, but which, being easily adaptable, can form the starting point for new NER applications in other languages and for other domains.

Other well known systems are UIMA3, developed by IBM, which focuses more on ar-chitectural support and processing speed, and offers a number of similar resources to GATE; OpenCalais4, which provides a web service for semantic annotation of text for traditional named entity types, and LingPipe5which provides a (limited) set of Machine Learning models for various tasks and domains: while these are very accurate, they are not easily adaptable to new applications. Components from all these tools are actually included in GATE, so that a user can mix and match various resources as needed, or compare different algorithms on the same corpus.

3. Pattern-based rule writing

Using pattern matching for Named Entity Recognition (NER) requires the develop-ment of patterns over multi-faceted structures that consider many different token proper-ties (e.g orthography, morphology, part of speech information etc.). Traditional

pattern-3http://uima.apache.org

4http://www.opencalais.com/

5http://alias-i.com/lingpipe/index.html

Natural Language Processing 57

matching languages such as PERL get “hopelessly long-winded and error prone” [4], when used for such complex tasks. Therefore, attribute-value notations are normally used, which allow for conditions to refer to token attributes arising from multiple anal-ysis levels. Examples of these are the NEA notation [4] and JAPE [23], both of which are declarative notations that allow for context-sensitive rules to be written and for non-deterministic pattern matching to be performed. NEA was used in FACILE, a named entity recognition tool used in the early MUC evaluations, and was then adapted to the needs of the CONCERTO project [3,56]. while JAPE is the standard rule-writing mech-anism used in GATE.

Traditional rule-based NER is based on a set of linguistic patterns which aim to identify the relevant entities in text. These rely largely on gazetteer lists which provide all or part of the entity, or clues to its existence, in combination with linguistic patterns. For example, a typical rule to identify a person’s name consists of matching the first name of the person via a gazetteer entry (e.g. John), followed by an unknown proper noun (e.g.

Smith, which is POS-tagged as a proper noun). In this section we introduce the concept of pattern-based rule writing, using the example of the JAPE language.

Smith, which is POS-tagged as a proper noun). In this section we introduce the concept of pattern-based rule writing, using the example of the JAPE language.

In document Perspectives on Ontology Learning (Page 67-85)