Chapter 2: Literature Review
2.2 Natural Language Processing and Information Extraction
2.2.1 Natural Language Processing
Natural language can be ambiguous, such that many common words may have the same writing but indicate different meanings and multiple interpretations depending on the context in which these words occur. The fast growth of information and technologies on the Internet has increased the amount of unstructured data where this data is expressed in natural language. In addition, many knowledge resources available online such as blogs, surveys, articles, web pages, documents and corpora (collection of documents) are expressed in free (unstructured) texts written in natural language. This has increased the demand for software that analyses text of all forms to solve the ambiguity problem.
Automatic methods in natural language processing (NLP) (Manning and Schütze, 1999; Jackson and Moulinier, 2007; Jurafsky and Martin, 2009; Collobert et al. 2011) are normally applied to process the ambiguity in the natural language and free textual data in order to extract knowledge, summarise the available content and transform the unstructured textual data into structured textual information. In general, NLP is a field in artificial intelligence (AI) (Russell and Norvig, 2010) which is concerned with the interactions between computers and human natural language. It uses computer software and the state-of-the-art systems for analysing linguistic patterns and generating useful rules from texts in a meaningful way (e.g. learning grammar rules which can indicate whether a sentence is well formed). Modern NLP systems are based on the use of machine learning (Mitchell, 1997; Alpaydin, 2010) and statistical techniques for processing the language (more details about machine learning and its methods are surveyed in Section 2.3).
In the domain of NLP, machine learning is utilised to derive significant rules by analysing large text corpora of typical real-world data. In general, most of the machine learning methods in NLP utilise external knowledge resources to provide
useful information which help in processing the textual data. These knowledge resources include the following:
Machine-readable dictionaries, which are common to use in NLP for providing useful information (e.g. meanings of words, grammatical categories of words, relations etc) (Manning and Schütze, 1999). The WordNet lexicon (Miller et al. 1990; Fellbaum 1998) (described in more detail in Section 2.2.3) and Longman Dictionary of Contemporary English (LDOCE) (Proctor, 1978) are some examples of machine readable dictionaries.
Corpora, which consist of sets of texts, used for learning concepts and language models. The text provided in the corpora can be raw text or annotated with word senses (i.e. semantics) and parts of speech (POS) tags. Both kinds of corpora represent a knowledge resource and could be used by the NLP approaches (Manning and Schütze, 1999). The Brown corpus (Kucera and Francis 1967), Wall Street Journal corpus (Charniak et al. 2000) and American National corpus (Ide and Suderman, 2006) are some examples of the raw corpora. The SemCor corpus (Miller et al. 1993) (i.e. a subset of the Brawn corpus that is manually tagged with senses from WordNet) and the interest corpus (Bruce and Wiebe, 1994) are some examples of corpora annotated with word senses and POS tags.
Thesaurus, which provides information about relationships between words (e.g. synonym (words with the same meanings) and antonym (words that have the opposite meanings) relations) or word meanings (Kilgarriff and Yallop 2000). The Roget's International Thesaurus (Roget, 1911) is an example of a well-known thesaurus which has been extensively used in the NLP field.
Ontologies, which are used in AI and NLP as knowledge resources, include definition of objects and semantic relations for a set of concepts within a specific domain (Gruber, 1995).
NLP includes tasks which are usually performed on unstructured text in order to transform it into a structured format. These tasks represent the pre-processing steps which are considered the key procedures in NLP for linguistic analysis. The pre- processing tasks extract relevant features from text, where these features can be used
to describe syntactic information (e.g. grammatical structure, parts of speech etc) or semantic information (e.g. meanings). In other words, pre-processing tasks can be applied to perform the linguistic analysis of a given text including the syntactic and semantic analysis of this text (Manning and Schütze, 1999; Jackson and Moulinier, 2007; Jurafsky and Martin, 2009; Collobert et al. 2011). Some of the NLP tasks are grouped into subfields of NLP like information extraction (IE) (Appelt, 1999; Moens, 2006), or they may have direct and separate real-world applications such as information retrieval (IR) (Goker and Davies, 2009) (IE and IR are discussed in Section 2.2.2). The main NLP tasks include the following pre-processing steps: tokenization, part of speech (POS) tagging, lemmatization, named entity recognition (NER), chunking, parsing, co-reference, semantic class disambiguation (SCD) and word sense disambiguation (WSD) (Manning and Schütze, 1999; Jackson and Moulinier, 2007; Jurafsky and Martin, 2009; Collobert et al. 2011). In the following points, a brief description is presented for each of the above mentioned NLP tasks (the tasks of SCD and WSD are illustrated later in Sections 2.2.3 and 2.2.4 respectively):
Tokenization, which splits the text into a set of tokens. For example, documents are broken into paragraphs, paragraphs into sentences, and sentences into individual words.
Part of speech (POS) tagging, which provides a grammatical category for each word in the text (e.g. ‘eat’: verb, ‘boy’: noun, ‘happy’: adjective, ‘widely’: adverb, ‘in’: preposition etc). It has been widely proposed by many authors (e.g. Brill, 1992 and Hepple, 2000) as a main task for analysing the text syntactically at the word level. This task is useful for linguistic analysis and is normally performed on the tokenized text.
Lemmatization (morphological analysis), which groups together different forms of a given word into a single word to represent the lexical root of this
word (e.g. is → be, ran → run).
Text chunking (shallow parsing), whose aim is to label segments of a sentence with syntactic constituents which called chunks. These chucks can be either verb phrase (e.g. have eaten) or noun phrase (e.g. the girl).
Parsing, representing the process of generating the parsing tree in order to determine the syntactic structure of a sentence. This process specifies the rules
for POS that can generate a well-formed phrase.
Named entity recognition (NER), a main task in IE, which classifies entity names in text into predefined categories. This task is able to identify expressions that refer to people (e.g. ‘John’), locations (e.g. ‘France’), organisations (e.g. ‘Microsoft’) etc. It uses heuristic rules that rely on the syntactic structures of the surrounding context. For example, the NER task has classified the word “John” into the “person” category, because it is a proper noun started with a capital letter and it has been expressed in a way that indicates a name of a person.
Co-reference, an IE task, which links multiple expressions in text that refer to a given entity. By considering this sentence “John went to the school yesterday, he was in a holiday for three days”, the task of co-reference states that the word “he” in the given sentence example refers to “John”.