• No results found

We describe our system as unsupervised, but the distinction between supervised and

unsupervised systems is not always clear. In some systems that are apparently unsupervised, it could be argued that the human labour involved in generating labelled training data has merely been shifted to embedding clever rules and heuristics in the system.

In our gazetteer generator (Section 3.1), the supervision is limited to a seed of four entities per list, a primitive noise filter (Section 3.1.4.5), the knowledge that month-person ambiguity is particularly problematic in MUC-7 (Section 3.3, Table 6) and three heuristics (Section 3.2) for handling entity ambiguity and adjusting entity boundaries. In our ambiguity resolver (Section 3.2), we attempt to minimize the use of domain knowledge of specific entity types. Our system exploits human-generated HTML mark-up in Web pages to generate gazetteers. However, because Web pages are available in such a quantity, and because the creation of Web pages is now intrinsic to the work-flow of most organization and individuals, we believe this annotated data comes at a negligible cost. For these reasons, we believe it is reasonable to describe our system as unsupervised.

3.6 Conclusion

In this chapter, we presented a named-entity recognition system that advances the NER state- of-the-art by avoiding the need for supervision and by handling novel NE types. In a

comparison on the MUC corpus, our system outperforms a baseline supervised system, but it is still not competitive with more complex supervised systems. There are fortunately many

ways to improve our model. One interesting way would be to generate gazetteers for a multitude of NE types (e.g., all 200 of Sekine’s types), and use list intersection as an

indicator of ambiguity. This idea would not resolve the ambiguity itself, but it would clearly identify where to invest further efforts.

Chapter 4

Noise-Filtering Techniques for Generating NE Lists

In this chapter, we present a first improvement to BaLIE. It comes from the observation that entities of a given type tend to be lexically similar, in that they are comparable in length, they are made up of characters from a given character set, they often have common prefixes and suffixes, and so forth. We therefore formulated the hypothesis that lexical features are useful in identifying valid instances of an NE type. Our contributions are the following:

• The design of a noise filter for NE list generation based on lexical features; • First experiments in using statistical semantics as noise filter.

This chapter covers the “Noise filter” that is an improvement to the “List creator” module of Figure 1. The noise filter works on the output of the Web page wrapper module of BaLIE in order to generate NE lists of greater quality, as shown in Figure 4. Both the noise filter we present in this chapter and the Web page wrapper presented in the previous chapter are instances of the problem of learning from positive and unlabelled examples. In both case, we use an algorithm inspired by SMOTE (Chawla et al. 2002) to solve the problem. SMOTE is reviewed in Section 4.2.2.

NE lists—also called dictionaries, lexicons, or gazetteers—are a typical component of NER systems. Lists are either an explicit system component (e.g., Cunningham et al. 2002), or they are derived from an annotated training data set (e.g., Bikel et al. 1999). For instance, a typical NER system that recognizes city names will refer to a list of cities and apply a mechanism to resolve entity boundary and type ambiguity. However, lists are rarely

exhaustive and they require ongoing maintenance to stay up-to-date. This is particularly true with NE types such as “company,” which are very volatile. Moreover, the initial cost of creating a list of NEs is usually high because it either requires manual NE harvesting, or manually annotating a large collection of documents.

Figure 4: Details of noise filtering as a post-process for the Web page wrapper

Recently, many techniques have been proposed to generate large NE lists starting from an initial seed of a few examples (e.g., Etzioni et al. 2005). Techniques have also been proposed to autonomously maintain an existing NER system (e.g., Heng and Grishman 2006) by increasing its underlying training data set. These semi-supervised learning techniques are

Training the system (semi-supervised learning) List creator (from Figure 1):

Output: annotated

document (ambiguity not resolved)

Output: generated lists

of named entities

Input: seed examples

(see Appendix)

Information retrieval using Web search engine (Yahoo! API)

Web page wrapper (learning from positive and

unlabelled examples)

Testing the system (actual use and evaluation)

Input: unannotated

document

From training: generated

lists of named entities (noise filtered)

Delimit: exact match list lookup

Noise filtering (learning from positive and

based on bootstrapping lexical knowledge from a large collection of unannotated documents (e.g., the Web). An early example of a bootstrapping algorithm is provided by Riloff and Jones (1999).

In Section 3.2 of the previous chapter, we proposed our own technique for NE list generation based on a bootstrapping algorithm. For efficiency, we kept this algorithm simple. The penalty for simplicity is noise in the generated NE list, but even the most sophisticated algorithm will generate noise. Most of our research focuses on the problem of noise. In Section 4.1, we summarize our NE list generation technique and explain the role of noise filtering. Our main contribution, detailed in Section 4.2, is a new noise-filtering technique, based on lexical NE features. In Section 4.3, we compare our technique to an existing noise- filtering technique, based on information redundancy, and we also examine the combination of our lexical filter with the information redundancy filter. In Section 4.4, we show that the combination of the two noise filters is better than either filter taken individually. In Section 4.5, we demonstrate the use of a third noise filter, based on statistical semantics techniques. Because of the computational complexity of this filter, we report the results of its use as a post-processing step, after the list generation process. Section 4.6 summarizes and concludes.