• No results found

Approaches to building Named Entity Recognition Systems

CHAPTER 2: BACKGROUND

2.3 Approaches to building Named Entity Recognition Systems

The first systems for NER were rule-based, based on pattern matching rules and pre- compiled lists of information i.e. gazetteers, the research community has since moved towards machine learning methods for NER. For example, in the MUC-7

competition13 five NER systems out of eight were rule-based. In the absence of training examples, handcrafted rules remain the preferred technique for NER (e.g. Sekine and Nobata, 2004 developed a NER system for 200 named entities). In the biomedical domain, rule-based approaches are also used to identify named entities in biomedical literature (see Ananiadou and McNaught, 2006 for more details). The major setback of rule-based approaches is the issue of portability as these approaches are difficult to adapt to different domains. There are three machine learning approaches to build NER systems.

 Supervised Learning Approach  Semi-Supervised Learning Approach  Unsupervised Learning Approach

2

.3.1 Supervised Learning Approach

Supervised learning is the most dominant technique employed to solve the problem of NER. Supervised learning approach studies the features of positive and negative examples of Named Entities (NEs) over a large collection of annotated documents and learns rules that capture instances of a given type.

Supervised learning techniques include Hidden Markov Models (HMMs) (Bikel et al., 1998; Borkar et al., 2001; Agichtein and Ganti, 2004; Finkel et al., 2005), Decision Trees (Sekine, 1998), Maximum Entropy Models (ME) (Borthwick et al., 1998; Chieu and Ng, 2003; Florian et al., 2007), Maximum Entropy Morkov Models (MEMMs) (McCallum et al., 2000), Support Vector Machines (SVM) (Asahara and Matsumoto, 2003; Mayfield et al., 2003), boosting (Carreras et al., 2003), memory-based learning (MBL) (Meulder and Daelemans, 2003) and Conditional Random Fields (CRF) (McCallum and Li, 2003). All the abovementioned techniques usually consist of a system which reads a large annotated corpus, memorises lists of entities and creates disambiguation rules based on discriminative features. CRFs (McCallum and Li, 2003) are considered as the state-of-the-art method for label assignment to token sequences (words) as it has a more flexible and dominant mechanism for exploiting

arbitrary feature sets along with dependency in the labels of neighbouring words (Sarawagi, 2008). Apart from IE, supervised learning approaches are frequently used by many other fields of NLP (e.g. Mehdi et al., 2010 used the supervised learning approach for the summarisation of legal documents).

The major shortcoming of the supervised learning approach is the requirement of a large annotated corpus which is sometimes difficult to obtain.

2.3.2 Semi-supervised Learning Approach

As mentioned earlier, semi-supervised approaches rely on the process of bootstrapping. There are many systems which have used this bootstrapping technique for NER.

Brin (1998) used regular expressions in order to generate lists of book titles paired with book authors from the Web. The system started with a few seed examples and learned new ones. The main idea of this algorithm is that many websites conform to a reasonably uniform format across the site.

Collins and Singer (1999) used a parsing technique to search for NE patterns. A pattern is a proper name followed by a noun phrase in apposition. In this system, patterns are kept in pairs {spelling, context} where spelling refers to the proper name and context refers to the noun phrase in its context. The system starts with an initial seed of spelling rules and a candidate which satisfies a spelling rule and they are classified according to how their contexts are accumulated. The most frequent contexts are then turned into a set of contextual rules and later on these rules are used to find further spelling rules and so on. Riloff and Jones (1999) introduced manual bootstrapping technique which consists of a set of entities and a set of contexts. They found out in their experiments that performance of their algorithm deteriorates with the introduction of noise. Cucchiarelli and Velardi (2001) presented a NER system based on Riloff and Jones (1999) manual bootstrapping that used syntactic relations (e.g. subject-object) to discover contextual evidence around named entities.

Pasca et al. (2006) presented a semi-supervised approach for NER by employing Lin’s (1998) distributional similarity measure to generate synonyms (e.g. words which are the members of the same semantic class) for pattern generalisation. They conducted their experiments on a huge corpus (100 million web documents) starting with only 10 seed examples and demonstrated that it is possible to generate one million named entities with a precision of about 88%.

Data selection also plays an important role in the learning process. Heng and Grishman (2006) noted that selection of documents using information retrieval-like relevance measures brought out the best results in their experiments rather than relying on a huge collection of documents.

2.3.3 Unsupervised Learning Approach

Clustering is a typical approach used for unsupervised learning. This approach relies on lexical resources (WordNet), lexical patterns and statistics computed on a large unannotated corpus.

Alfonseca and Manandhar (2002) presented an approach to address the problem of assigning a label to an input word with the appropriate NE type. They made use of WordNet synset and the surrounding context of an input word. Evans (2003) presented an NER system based on the idea of hypernyms described by Hearst (1992) in order to identify named entities. Shinyama and Sekine (2004) presented an approach based on an observation that NEs often appear synchronously in several news articles, whereas common nouns do not. This approach allows identification of rare NEs in an unsupervised manner and can be useful in combination with other NER methods.

Nadeau et al. (2006) presented an unsupervised approach for NER. Their approach made use of simple heuristics based on the work of Mikheev (1999), Petasis et al. (2001) and Palmer and Day (1997) to perform NE disambiguation. Their approach can be divided into two stages. In the first stage, a large gazetteer of entities (list of entities) was created and in the second stage heuristics were used to identify and

classify NEs in the given context of a document. They evaluated their system performance against the basic supervised system using the MUC-7 NER corpus (Chinchor, 1998). The supervised system was able to achieve high precision but low recall while the unsupervised system achieved higher recall at the cost of lower precision.

Semi-supervised and unsupervised approaches are useful when a large amount of training data is unavailable or difficult to obtain. There is a lot of research being done in the area of NER spreading across various languages, domains and textual genres (Nadeau and Sekine, 2007). A supervised learning approach gives good performance in the presence of huge collections of annotated data while semi-supervised and unsupervised approaches promise fast deployment of many NE types without the prerequisite of an annotated corpus (Nadeau and Sekine, 2007).