• No results found

The acronym detection module described in this chapter can be integrated with BaLIE’s alias resolution algorithm (Section 3.2.3, Figure 3). When a definition is added to a set of aliases “a ,” the corresponding acronym is also added. Moreover, a side-effect of identifying an i

acronym definition is identifying the exact boundary of the potential entity. For instance, let’s look at this sentence containing one “organization” type entity:

“The court convicted the head of the South <ENAMEX TYPE="ORG">Lebanon Army</ENAMEX> (SLA) of collaborating with Israel.”

In this sentence, the acronym detection module recognizes the acronym SLA:

SLA, [S]outh [L]ebanon [A]rmy

On the one hand, it corrects the organization boundary, and on the other hand, it associates “SLA” to the “South Lebanon Army” alias set. Eventually, the annotations are corrected

accordingly:

“The court convicted the head of the <ENAMEX TYPE="ORG">South Lebanon Army</ENAMEX> (<ENAMEX TYPE="ORG">SLA</ENAMEX>) of collaborating with Israel.”

In our experiments, we measured no significant improvements on the MUC-7 and the BBN corpora. In fact, only four acronyms are identified in MUC-7, and no acronyms are found in BBN. However, the CONLL corpus is rich in acronyms, and the improvement in

organization recall is important, as shown in Table 29. We identified 19 acronyms in the CONLL corpus, and one was a false positive (<New, [N]orm H[e][w]itt>).

Table 29: BaLIE's performance on the CONLL corpus with acronym detection

Without acronym With acronym

Type Precision Recall F-measure Precision Recall F-measure

Person 49.5 52.10 50.77 49.69 52.16 50.90

Location 65.49 72.71 68.91 65.52 72.71 68.92

Organization 43.26 51.27 46.93 44.60 52.43 48.20

Miscellaneous 61.37 52.35 56.50 61.59 52.35 56.60

6.7 Conclusion

In this chapter, we described a supervised learning approach to the task of identifying acronyms. The approach consists in using few hand-coded constraints to reduce the search space, and then using supervised learning to impose more constraints. The advantage of this approach is that the system can easily be retrained for a new corpus when the previously learned constraints no longer apply. The hand-coded constraints reduce the set of acronym- definition pair candidates that must be classified by the supervised learning system, yet they are weak enough to be transferable to a new corpus with little or no change.

In our experiments, we tested various learning algorithms and found that an SVM is

comparable in performance to rigorously designed handcrafted systems, as presented in the literature. We reproduced experiments by Schwartz and Hearst (2003) and showed that our

testing framework was comparable to their work.

We integrated the acronym detection module with BaLIE’s alias resolution. We demonstrate that it brings an interesting improvement, particularly at the level of organization recall in an acronym-rich corpus.

Our future work will consist of applying the supervised learning approach to different corpora, especially corpora in which acronyms or definitions are not always indicated by parentheses.

Chapter 7

Discussion and Conclusion

This thesis is about creating a semi-supervised NER system. It has the desirable property of requiring, as input, that an expert linguist lists a dozen examples of each supported entity type. It contrasts with the annotation of thousands of documents with hundreds of entity types, which is required for supervised learning. It also contrasts with manually harvesting NE lists and designing a complex rule system, which are usually required for handmade systems. The NER system we present in this thesis therefore requires very little supervision and we’ve included this human input in the Appendix.

The system presented in this thesis falls in the new category of semi-supervised and unsupervised systems. Work in this category is relatively rare and recent, and we believe ours to be the first that is devoted exclusively to the autonomous creation of an NER system.

Our overall goal is to create proof-of-concept software. In completing this system, we claim four major contributions that impact the NER field, and also have the potential to be used in other domains. First, we designed the first semi-supervised NER system that performs at a comparable level to that of a simple supervised learning-based NER system (Chapter 3). Second, we present a noise filter for generating NE lists based on computational linguistics and statistical semantic techniques (Chapter 4). This noise filter outperforms previous systems devoted to the same task. Then, we demonstrate a simple technique based on set intersections that can identify unambiguous examples for a given NE type (Chapter 5). Unambiguous NEs are a requirement for creating semi-supervised disambiguation rules. Finally, our fourth contribution is an acronym detection algorithm—part of an alias

resolution system—that outperforms previous system and allows improvement in NER for a “less common and very difficult problem” (Chapter 6).

These contributions are crucial components to a successful semi-supervised NER system, and they are explained in the context of the whole system, for which the architecture is detailed in Figure 1. In the course of completing this system, however, we met many

limitations and difficulties, which we discuss in Section 7.1. We conclude this thesis by presenting our future work and some general long-term research ideas.

We believe the resulting system requiring little supervision has two important advantages over past systems put forth in the literature, and this is generally in favour of a shift towards semi-supervised and unsupervised techniques in the machine learning community. Our system is first extensible to new entity types. The design we adopted is free of linguistic knowledge or type-dependant heuristics. Therefore, we can modify the hierarchy or add new types, and let the system generate lists and rules. The system is also easily maintained over time. While supervised learning-based systems get most of their knowledge from large static training corpora, the system we present gets most of its knowledge from the Web.

Recrawling the Web and periodically verifying the Web pages from which lists were extracted is a straightforward approach to maintenance.