• No results found

Information Extraction

Information extraction (IE) systems are given a collection of documents, and a set of templates with slots to be filled. The templates determine what information we are looking for in the collection. Each document is analyzed using shallow parsing and other techniques like finite-state machines in order to carry out entity

extraction (determining which entities are named or referred to in the text), link extraction (determining how the extracted entities are connected or related) and event extraction (determining what events are being described by the entities and links uncovered). The final goal is to fill in as many slots in the templates as possible using the information discovered.

An IE system is composed of a tokenizer, a morphological and lexical processor, a syntax analyzer and a domain or semantic analyzer (Patienza, 1997). The tokenizer breaks down the text into tokens (words); this is called word segmentation and is a similar initial step to the one taken by IR systems. The morphological and lexical analyzer (MLA) does part of speech tagging and word sense tagging: The first annotates each word with its grammatical function in a phrase (for instance, noun, adjective or verb) and thus requires shallow or light parsing. The second identifies when a word has different senses (and possibly grammatical functions), which one is being used and assigns a certain category to nouns: IE systems need not only identify names, but also to classify them as one of a certain class (e.g., company, product or organization). Usually a lexicon is used to start with: This is a simple thesaurus, a list of words with some simple information per word — inflections, possible syntactic categories, senses and real-world categories. The MLA does lookups on the lexicon to find the words in the text and tag them; at this point some ambiguity is dealt with. Part of speech taggers can be statistically- or rule-based; the best are correct about 95% of the time (Israel & Appelt, 1999). The MLA also has to give lexical features to lexical items that are not simple (composed of more than one word), like dates, times, numbers, locations and proper names. Proper numbers and spatial and temporal expressions are particularly important and difficult: These are large, open-ended classes in almost any language, and there are no simple rules as to what characterizes an item in each of the classes. To recognize names, hidden Markov models (HMM), as well as finite state rules, have been proposed. HMMs basically use a finite state model that incorporates probabilities in the transition function, derived from a training corpus. The underlying finite state machine is followed and the path with higher probability is chosen. For instance, the word “John” followed by the word `”Smith.” has high probability of being a proper noun, but followed by “Deere” has high probability of denoting a company (Israel & Appelt, 1999).

Even after part-of-speech tagging, word sense disambiguation is still necessary; usually part-of-speech simply reduces possible readings but does not disambigu- ate. This is due to the multiple meanings that can be given to some nouns (i.e., interest) and verbs (i.e., run). Most techniques for word sense disambiguation are based on examining terms near the target term, their position and part-of- speech tag. Supervised learning algorithms can be used to try to distinguish word senses from such features.

The syntax analyzer tries to do shallow parsing: it only parses for main constituents of simple fragments, and only in fragments of interest; typically, it finds the verb and noun phrases of a sentence that contain recognized entities. Most prepositional phrases (starting with prepositions, conjunctions or relative pronouns) are discarded unless they fit some pattern. However, some of them (starting with prepositions “of” and “for,” for instance) are usually relevant, and a second phase analysis is used sometimes to handle this more complex cases. Relative clauses (starting with “that”) attached to the subject of a clause may or may not be analyzed. Coordination is hard to handle and therefore ignored most of the time: “The company invested $2 million and set up a joint venture” (VP conjunction), or “The company and the investors agreed to set up a joint venture” (NP conjunction) are simple cases and can be handled, but most IE systems do not handle disjunction. There is no attempt to organize these parts into higher- order structures (i.e., how they make up sentences and paragraphs). Usually grammars are also finite state machines. This is robust and efficient; full parsing has been tried and found slow, error prone and, many times, useless. Even though shallow parsing overlooks some parts of the text, may err and sometimes produces partial results, all systems use it instead of full parsing (Israel & Appelt, 1999).

The domain analyzer deals with some of the hardest problems in IE systems. It uses the information from previous phases to extract application-specific relationships and event descriptions and fill in the templates mentioned earlier to create the output. This is a very hard task that depends on the informational content of the documents (usually, documents created to present factual information — manuals, news — work much better than others), the complexity of the templates and the amount of knowledge available in the lexicon and other resources. This phase, like the others described before, can be attacked through a knowledge-engineering or a statistical approach. However, success on this task has been very limited so far. Rule-based strategies and learning-based strategies perform more or less the same, and neither one performs very well. Part of the problem may be that this part requires more in-depth analysis, and no good indicators are readily available with the light analysis that IE carries out. For instance, one of the hard problems the domain analyzer must deal with is that of co-reference: two or more expressions that refer to the same entity, or whose referents are related somehow. This is very common in natural language, especially in multi-sentence narrative. In the following example, all underlined expressions refer to the same person: “George Bush left the White House this morning. The president went to his ranch in Crawford, Texas, where he will spend the weekend.” Co-reference is extremely difficult because of things like acronyms (IBM for international business machines), aliases (Big Blue), definite noun phrases (“the big computer company”) and pronouns (“it”). Also, co- reference may apply to all kinds of entities, not just names but groups or

collections (“The jury deliberated for only 15 minutes. It quickly decided Microsoft was a monopoly,”) events (``the testimony was confusing; it didn’t add to anything”), or abstract entities (“the case was interesting; it made headlines in all newspapers”) (Israel & Appelt, 1999).

Another complex issue is how to use the information extracted to fill in the templates. The difficulty of this task is directly related to the structure of the templates: The simplest templates have one slot to fill and simply ask for some value for a property (for instance, a template that simply requests company names), but most templates ask for at least two values that must stand in a certain relationship; for instance, a template may ask for two values, one of them a company name, another a person name, where the person is the CEO of the company. Yet other templates involve n different values (n > 2), all of them related by being part of a specific event: for instance, as in our example, a template may request, for all “acquisition” events, information as to which company is the bidder, which company is the target, when the acquisition took place and what was the price paid. Note that some of the values may be temporal (like the one asking when the acquisition took place); as stated above, these may be very hard to obtain. In general, the larger the number of slots, the harder the task: Filling a template with a large number of slots may involve analyzing whole paragraphs, as the information may be disseminated among several sentences. Also, some information may simply not be present. Thus, IE systems need to get co-references right and deal with partial information.

IE does more than IR, since it pays attention to order (syntax) and other characteristics disregarded by IR. However, even though it uses some basic linguistic knowledge, IE is not full-fledged natural language processing (NLP). The difference between IE and full-fledged NLP is that in IE the task is to extract some predefined type of information from text. This implies that we can disregard all text that is determined not to be related to the target information. However, some NLP is needed because the type of information desired almost always involves a relational component, that is, two or more entities connected by some action or activity in which the entities play a certain role (for instance, in a crime there is a victim and a perpetrator). Thus, it is not enough to detect the entity, but it is necessary to look at the context to determine the role that the entity plays in a given relation (its relationship to other entities). For instance, it is relatively easy to identify people by searching for proper names, titles (Mr., etc.) and personal pronouns; thus, this search needs no context. But to identify a person as a victim of a crime (as opposed to a perpetrator, or to something else unrelated) it is necessary to pay attention to the context (Belew, 2000). Finally, IE is applied to large document collections; because of performance require- ments, IE must use fast and robust techniques that perform well and deal with errors and incomplete information — many NLP techniques do not posses these characteristics.

As in the case of IR, it is very difficult to evaluate the performance of an IE system: This requires estimating, for a given collection of documents and set of templates, what is the maximum amount of information about the templates that can be obtained from the documents (that is, how many slots can be filled and in how many different ways). Even humans have difficulty with this task, and building a benchmark is difficult and time consuming (there may be no agreement among humans, e.g., some may consider the Red Cross a company, some may not). State-of-the-art seems to have stopped at around 60-80% of human performance. This assumes clean, grammatical text; text with noise (transcripts, etc.) has its own problems and may lower performance.

Combining Databases and IE