• No results found

Chapter 2 : Natural Language Processing and Knowledge Engineering Background

2.1 An Overview of Natural Language Processing

2.1.2 Natural Language Processing Tools

NLP is the computational technique for analysing and representing naturally occurring texts at

one or more levels of linguistic analysis for the purpose of achieving human-like language

processing for a range of tasks or applications (Liddy, 2001). Most NLP tools are influenced

by relatively newer areas such as Machine Learning, Computational Statistics and Cognitive

Science, in order to perform annotations on words and terminologies to identify real world

objects, and their relationships in the text (Madnani and Dorr, 2010). There are many open

source tools available for NLP for semantic annotation of textual documents such as OpenNLP

(Baldridge, 2005), NLTK (Loper and Bird, 2002), GATE (Cunningham et al., 2002), and

Stanford CoreNLP (Manning et al., 2014). In the following paragraphs, some of these NLP

tools are described in more detail.

2.1.2.1 OpenNLP

The OpenNLP tool is a free toolkit, which creates textual linguistic annotations that are based

on maximum entropy use of statistical models (Baldridge, 2005). It integrates many NLP tasks

including tokenization, sentence segmentation, PoS tagging, NER, chunking and co-reference

resolution.

The OpenNLP toolkit has been used in different studies and domains. For example, the

OpenNLP system was used for the development of a biomedical corpus for performing noun

phrase chunking (NP) by Kang et al. (2011). Separately, an approach in relation to named entity

detection in an SMS corpus in the Swedish language was described by Ek et al. (2011). They

used the OpenNLP toolkit to annotate the SMS corpus through applying part-of-speech (PoS)

16

is provided as a full package, making it hard to use it in our system if only few components are

needed.

2.1.2.2 NLTK

The Natural Language Toolkit (NLTK) is a Python package for natural language processing

(Loper and Bird, 2002). NLTK provides and enables interfaces for the purpose of text

processing, linguistic structure analysis and the access to large corpora collections. NLTK

includes libraries and programs of NLP components such as tokenization, PoS tagging, parsing,

chunking, semantic analysis, classification and clustering.

Many researchers have utilised NLTK in the process of analysing natural language and

knowledge extraction. For example, Mckenzie et al. (2010) introduced a novel application for

information extraction by extracting data from helicopter maintenance records to populate a

database. They used NLTK to implement a partial parsing of text by way of hierarchical

chunking of the text. Additionally, Stoyanchev et al. (2008) developed a question answering

system that employed the NLTK toolkit in order to analyse questions linguistically. Moreover,

Sætre (2006) presented an approach used to find biological relevant information on protein

interactions from the internet. This was developed using NLTK components that include data

selection, tokenization, PoS tagging and stemming. The fact that NLTK (Loper and Bird, 2002)

is an open source package implemented in Python is the main disadvantage in this tool. That is

because Python is not powerful enough for most standard NLP tasks despite having most of

the functionality needed to perform simple NLP tasks (Madnani, 2007).

2.1.2.3 GATE

The General Architecture for Text Engineering (GATE) (Bontcheva et al., 2004, Cunningham

et al., 2009) architecture is implemented in Java and developed at the University of Sheffield

for processing natural language is a publicly available system.(Gosling et al., 2005). It is an

17

them for semantic annotation. The GATE architecture is developed in the IE component set

called ANNIE (A Nearly-New Information Extraction). ANNIE contains a set of processing

resources that apply algorithms for extracting information from unstructured text. Various

major processing resources presented by the ANNIE plug-in are: English tokenizer, Gazetteer,

sentence splitter, part-of-speech (PoS) tagger, named entity (NE) transducer, Java Annotations

Pattern Engine (JAPE) transducer, and orthographic co-reference (Cunningham et al., 2009).

The advantages of using the GATE architecture have been showed by a number of different

researches. An idea developed by Feilmayr et al. (2009) that a rule/ontology-based IE system

can be used for analysing tourism websites and extracting structured data from accommodation

webpages based on the use of the GATE system. An ontology-based IE system for the business

domain based on the use of the standard and adapted processing resources from GATE was

created by Saggion et al. (2007). Joshi et al. (2012), used the GATE for IE which shows the

advantages of using social networking sites like twitter in the marketing domain. A domain

specific NER for classifying named entities in Twitter posts from buyers and sellers was also

created in this study. Accordingly, these posts (a collection of tweets) are analysed and

processed by using the GATE components (e.g. English Tokenizer, Sentence Splitter, PoS-

tagger, Gazetteer and NE transducer) so that data is acquired for farmers and merchants for

giving them useful ideas. However, the main drawback of the GATE tool (Bontcheva et al.,

2004, Cunningham et al., 2009) is that it can only be used as a full package, therefore, it is not

suitable for our system as only some components of it are needed to be implemented in APELS.

2.1.2.4 Stanford CoreNLP

The Stanford CoreNLP is an open source toolkit that is composed of a set of NLP tools that are

used for processing English texts (Manning et al., 2014). It can give the base forms of words,

their parts of speech, whether they are names of companies, people, etc., normalize dates, times,

18

dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract

particular or open-class relations between mentions, etc.

Stanford CoreNLP is designed to be highly flexible and extensible. With a single option you

can change which tools should be enabled and which should be disabled. It is very easy to

apply a bunch of linguistic analysis tools to a piece of text.

Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (PoS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system,

sentiment analysis, bootstrapped pattern learning and the open information extraction tools.

Moreover, an annotator pipeline can include additional custom or third-party annotators.

CoreNLP’s analyses provide the foundational building blocks for higher-level and domain- specific text understanding applications.

A variety of studies have actually used the Stanford CoreNLP tool. For instance, Ahmed et al.

(2009) proposed the BioEve system to extract the Bio-Molecular events from Text. They used

Stanford parser a simple tool to provide typed-dependency relationships between these words

in form the dependency parse. In addition, Poria et al. (2014) developed an algorithm to exploit

the relationship between words and obtained the semantic relationship between words based

on dependency parsing. The Stanford Chunker component is used as the first step in the

algorithm to chunk the input text. Moreover, Trupti and Deshmukh (2013) presented an

approach for building an ontology from heterogeneous text documents using the Stanford

CoreNLP parser. They parsed the text file using Stanford parser, which generates XML file

that tags words as noun, verbs, adjectives, pronouns etc. Then OWL ontology, which contains

classes and concepts, is generated by converting identified PoS words in XML file. Likewise,

Siddharthan (2011) presented a system for text regeneration tasks such as text simplification,

style modification or paraphrase. The system applied transformation rules specified in XML

19

Pal et al. (2010) introduced a system to automatically classify the semantic relations between

nominals. The system achieves its best performance using lexical features such as

nominalization of WordNet and syntactic information such as dependency relations of Stanford

Dependency Parser. Likewise, Kern et al. (2010) built a Word Sense Induction and

Discrimination (WSID) system that exploits the syntactic and semantic features based on the

results of a natural language parser component. They applied the Stanford Parser in order to

provide a context-free phrase structure grammar representation and a list of grammatical

relations (typed dependencies) of a given sentence. Moreover, Uryupina (2010) presented

Corry – a system for co-reference resolution in English. He relied on the Stanford NLP toolkit

for extracting named entities and parse trees for each sentence. The Corry system has shown

the best performance level among four well-known co-reference resolution systems. Finally,

Berend and Farkas (2010) introduced a novel approach which includes a set of features for the

supervised learning in order to extract key phrases from scientific papers. They applied

syntactic tagging using the Stanford parser on each sentence. Taken together, Stanford

CoreNLP has been widely and effectively used tool for text processing, information extraction,

therefore, it was implemented in our system for the same purpose.