Chapter 2 : Natural Language Processing and Knowledge Engineering Background
2.1 An Overview of Natural Language Processing
2.1.2 Natural Language Processing Tools
NLP is the computational technique for analysing and representing naturally occurring texts at
one or more levels of linguistic analysis for the purpose of achieving human-like language
processing for a range of tasks or applications (Liddy, 2001). Most NLP tools are influenced
by relatively newer areas such as Machine Learning, Computational Statistics and Cognitive
Science, in order to perform annotations on words and terminologies to identify real world
objects, and their relationships in the text (Madnani and Dorr, 2010). There are many open
source tools available for NLP for semantic annotation of textual documents such as OpenNLP
(Baldridge, 2005), NLTK (Loper and Bird, 2002), GATE (Cunningham et al., 2002), and
Stanford CoreNLP (Manning et al., 2014). In the following paragraphs, some of these NLP
tools are described in more detail.
2.1.2.1 OpenNLP
The OpenNLP tool is a free toolkit, which creates textual linguistic annotations that are based
on maximum entropy use of statistical models (Baldridge, 2005). It integrates many NLP tasks
including tokenization, sentence segmentation, PoS tagging, NER, chunking and co-reference
resolution.
The OpenNLP toolkit has been used in different studies and domains. For example, the
OpenNLP system was used for the development of a biomedical corpus for performing noun
phrase chunking (NP) by Kang et al. (2011). Separately, an approach in relation to named entity
detection in an SMS corpus in the Swedish language was described by Ek et al. (2011). They
used the OpenNLP toolkit to annotate the SMS corpus through applying part-of-speech (PoS)
16
is provided as a full package, making it hard to use it in our system if only few components are
needed.
2.1.2.2 NLTK
The Natural Language Toolkit (NLTK) is a Python package for natural language processing
(Loper and Bird, 2002). NLTK provides and enables interfaces for the purpose of text
processing, linguistic structure analysis and the access to large corpora collections. NLTK
includes libraries and programs of NLP components such as tokenization, PoS tagging, parsing,
chunking, semantic analysis, classification and clustering.
Many researchers have utilised NLTK in the process of analysing natural language and
knowledge extraction. For example, Mckenzie et al. (2010) introduced a novel application for
information extraction by extracting data from helicopter maintenance records to populate a
database. They used NLTK to implement a partial parsing of text by way of hierarchical
chunking of the text. Additionally, Stoyanchev et al. (2008) developed a question answering
system that employed the NLTK toolkit in order to analyse questions linguistically. Moreover,
Sætre (2006) presented an approach used to find biological relevant information on protein
interactions from the internet. This was developed using NLTK components that include data
selection, tokenization, PoS tagging and stemming. The fact that NLTK (Loper and Bird, 2002)
is an open source package implemented in Python is the main disadvantage in this tool. That is
because Python is not powerful enough for most standard NLP tasks despite having most of
the functionality needed to perform simple NLP tasks (Madnani, 2007).
2.1.2.3 GATE
The General Architecture for Text Engineering (GATE) (Bontcheva et al., 2004, Cunningham
et al., 2009) architecture is implemented in Java and developed at the University of Sheffield
for processing natural language is a publicly available system.(Gosling et al., 2005). It is an
17
them for semantic annotation. The GATE architecture is developed in the IE component set
called ANNIE (A Nearly-New Information Extraction). ANNIE contains a set of processing
resources that apply algorithms for extracting information from unstructured text. Various
major processing resources presented by the ANNIE plug-in are: English tokenizer, Gazetteer,
sentence splitter, part-of-speech (PoS) tagger, named entity (NE) transducer, Java Annotations
Pattern Engine (JAPE) transducer, and orthographic co-reference (Cunningham et al., 2009).
The advantages of using the GATE architecture have been showed by a number of different
researches. An idea developed by Feilmayr et al. (2009) that a rule/ontology-based IE system
can be used for analysing tourism websites and extracting structured data from accommodation
webpages based on the use of the GATE system. An ontology-based IE system for the business
domain based on the use of the standard and adapted processing resources from GATE was
created by Saggion et al. (2007). Joshi et al. (2012), used the GATE for IE which shows the
advantages of using social networking sites like twitter in the marketing domain. A domain
specific NER for classifying named entities in Twitter posts from buyers and sellers was also
created in this study. Accordingly, these posts (a collection of tweets) are analysed and
processed by using the GATE components (e.g. English Tokenizer, Sentence Splitter, PoS-
tagger, Gazetteer and NE transducer) so that data is acquired for farmers and merchants for
giving them useful ideas. However, the main drawback of the GATE tool (Bontcheva et al.,
2004, Cunningham et al., 2009) is that it can only be used as a full package, therefore, it is not
suitable for our system as only some components of it are needed to be implemented in APELS.
2.1.2.4 Stanford CoreNLP
The Stanford CoreNLP is an open source toolkit that is composed of a set of NLP tools that are
used for processing English texts (Manning et al., 2014). It can give the base forms of words,
their parts of speech, whether they are names of companies, people, etc., normalize dates, times,
18
dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract
particular or open-class relations between mentions, etc.
Stanford CoreNLP is designed to be highly flexible and extensible. With a single option you
can change which tools should be enabled and which should be disabled. It is very easy to
apply a bunch of linguistic analysis tools to a piece of text.
Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (PoS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system,
sentiment analysis, bootstrapped pattern learning and the open information extraction tools.
Moreover, an annotator pipeline can include additional custom or third-party annotators.
CoreNLP’s analyses provide the foundational building blocks for higher-level and domain- specific text understanding applications.
A variety of studies have actually used the Stanford CoreNLP tool. For instance, Ahmed et al.
(2009) proposed the BioEve system to extract the Bio-Molecular events from Text. They used
Stanford parser a simple tool to provide typed-dependency relationships between these words
in form the dependency parse. In addition, Poria et al. (2014) developed an algorithm to exploit
the relationship between words and obtained the semantic relationship between words based
on dependency parsing. The Stanford Chunker component is used as the first step in the
algorithm to chunk the input text. Moreover, Trupti and Deshmukh (2013) presented an
approach for building an ontology from heterogeneous text documents using the Stanford
CoreNLP parser. They parsed the text file using Stanford parser, which generates XML file
that tags words as noun, verbs, adjectives, pronouns etc. Then OWL ontology, which contains
classes and concepts, is generated by converting identified PoS words in XML file. Likewise,
Siddharthan (2011) presented a system for text regeneration tasks such as text simplification,
style modification or paraphrase. The system applied transformation rules specified in XML
19
Pal et al. (2010) introduced a system to automatically classify the semantic relations between
nominals. The system achieves its best performance using lexical features such as
nominalization of WordNet and syntactic information such as dependency relations of Stanford
Dependency Parser. Likewise, Kern et al. (2010) built a Word Sense Induction and
Discrimination (WSID) system that exploits the syntactic and semantic features based on the
results of a natural language parser component. They applied the Stanford Parser in order to
provide a context-free phrase structure grammar representation and a list of grammatical
relations (typed dependencies) of a given sentence. Moreover, Uryupina (2010) presented
Corry – a system for co-reference resolution in English. He relied on the Stanford NLP toolkit
for extracting named entities and parse trees for each sentence. The Corry system has shown
the best performance level among four well-known co-reference resolution systems. Finally,
Berend and Farkas (2010) introduced a novel approach which includes a set of features for the
supervised learning in order to extract key phrases from scientific papers. They applied
syntactic tagging using the Stanford parser on each sentence. Taken together, Stanford
CoreNLP has been widely and effectively used tool for text processing, information extraction,
therefore, it was implemented in our system for the same purpose.