Automatic Indexing Information Extraction
3.4 Information Extraction
There are two processes associated with information extraction: determination of facts to go into structured fields in a database and extraction of text that can be used to summarize an item. In the first case only a subset of the important facts in an item may be identified and extracted. In summarization all of the major concepts in the item should be represented in the summary.
The process of extracting facts to go into indexes is called Automatic File Build in Chapter 1. Its goal is to process incoming items and extract index terms that will go into a structured database. This differs from indexing in that its objective is to extract specific types of information versus understanding all of the text of the document. An Information Retrieval System’s goal is to provide an in- depth representation of the total contents of an item (Sundheim-92). An
Information Extraction system only analyzes those portions of a document that potentially contain information relevant to the extraction criteria. The objective of the data extraction is in most cases to update a structured database with additional facts. The updates may be from a controlled vocabulary or substrings from the item as defined by the extraction rules. The term “slot” is used to define a particular category of information to be extracted. Slots are organized into templates or semantic frames. Information extraction requires multiple levels of analysis of the text of an item. It must understand the words and their context (discourse analysis). The processing is very similar to the natural language processing described under indexing.
In establishing metrics to compare information extraction, the previously defined measures of precision and recall are applied with slight modifications to their meaning. Recall refers to how much information was extracted from an item versus how much should have been extracted from the item. It shows the amount of correct and relevant data extracted versus the correct and relevant data in the item. Precision refers to how much information was extracted accurately versus the total information extracted.
Additional metrics used are overgeneration and fallout. Overgeneration measures the amount of irrelevant information that is extracted. This could be caused by templates filled on topics that are not intended to be extracted or slots that get filled with non-relevant data. Fallout measures how much a system assigns incorrect slot fillers as the number of potential incorrect slot fillers increases (Lehnert-91).
These measures are applicable to both human and automated extraction processes. Human beings fall short of perfection in data extraction as well as automated systems. The best source of analysis of data extraction is from the Message Understanding Conference Proceedings. Conferences (similar to TREC) were held in 1991, 1992, 1993 and 1995. The conferences are sponsored by the Advanced Research Project Agency/Software and Intelligent Systems Technology Office of the Department of Defense. Large test databases are made available to any organization interested in participating in evaluation of their algorithms. In MUC-5 (1993), four experienced human analysts performed detailed extraction against 120 documents and their performance was compared against the top three information extraction systems. The humans achieved a 79 per cent recall with 82 per cent precision. That is, they extracted 79 per cent of the data they could have found and 18 per cent of what they extracted was erroneous. The automated programs achieved 53 per cent recall and 57 per cent precision. The other mediating factor is the costs associated with information extraction. The humans required between 15 and 60 minutes to process a single item versus the 30 seconds to three minutes required by the computers. Thus the existing algorithms are not operating close to what a human can achieve, but they are significantly cheaper. A combination of the two in a computer-assisted information extraction system appears the most reasonable solution in the foreseeable future.
Another related information technology is document summarization. Rather than trying to determine specific facts, the goal of document summarization
is to extract a summary of an item maintaining the most important ideas while significantly reducing the size. Examples of summaries that are often part of any item are titles, table of contents, and abstracts with the abstract being the closest. The abstract can be used to represent the item for search purposes or as a way for a user to determine the utility of an item without having to read the complete item. It is not feasible to automatically generate a coherent narrative summary of an item with proper discourse, abstraction and language usage (Sparck Jones-93). Restricting the domain of the item can significantly improve the quality of the output (Paice-93, Reimer-88). The more restricted goals for much of the research is in finding subsets of the item that can be extracted and concatenated (usually extracting at the sentence level) and represents the most important concepts in the item. There is no guarantee of readability as a narrative abstract and it is seldom achieved. It has been shown that extracts of approximately 20 per cent of the complete item can represent the majority of significant concepts (Morris-92). Different algorithms produces different summaries. Just as different humans create different abstracts for the same item, automated techniques that generate different summaries does not intrinsically imply major deficiencies between the summaries. Most automated algorithms approach summarization by calculating a score for each sentence and then extracting the sentences with the highest scores. Some examples of the scoring techniques are use of rhetorical relations (e.g., reason, direction, contrast: see Miike-94 for experiments in Japanese), contextual inference and syntactic coherence using cue words (Rush-71), term location (Salton-83), and statistical weighting properties discussed in Chapter 5. There is no overall theoretic basis for the approaches leading to many heuristic algorithms. Kupiec et al. are pursuing statistical classification approach based upon a training set reducing the heuristics by focusing on a weighted combination of criteria to produce “optimal” scoring scheme (Kupiec-95). They selected the following five feature sets as a basis for their algorithm:
Sentence Length Feature that requires sentence to be over five words in length
Fixed Phrase Feature that looks for the existence of phrase “cues” (e.g., “in conclusion)
Paragraph Feature that places emphasis on the first ten and last five paragraphs in an item and also the location of the sentences within the paragraph
Thematic Word Feature that uses word frequency
Uppercase Word Feature that places emphasis on proper names and acronyms.
As with previous experiments by Edmundson, Kupiec et al. discovered that location based heuristics gives better results than the frequency based features (Edmundson-69).
Although there is significant overlap in the algorithms and techniques for information extraction and indexing items for information retrieval, this text does not present more detail on information extraction. For additional information, the MUC proceedings from Morgan Kaufman Publishers, Inc. in San Francisco is one source of the latest detailed information on information extraction.
3.5 Summary
This chapter introduces the concepts behind indexing. Historically, term indexing was applied to a human-generated set of terms that could be used to locate an item. With the advent of computers and the availability of text in electronic form, alternatives to human indexing are available and essential. There is too much information in electronic form to make it feasible for human indexing of each item. Thus automated indexing techniques are absolutely essential. When humans performed the indexing, there were guidelines on the scope of the indexing process. They were needed to ensure that the human indexers achieved the objectives of a particular indexing effort. The guidelines defined the level of detail to which the indexing was to be applied (i.e., exhaustivity and specificity). In
automated systems there is no reason not to index to the lowest level of detail. The
strength in manual indexing was the associative powers of the human indexer in consolidating many similar ideas into a small number of representative index terms and knowing when certain concepts were of such low value as to not warrant indexing. Automated indexing systems try to achieve these by using weighted and natural language systems and by concept indexing. The reliance of automated systems on statistical information alone never achieve totally accurate assignment of importance weights to the concepts being indexed. The power of language is not only in the use of words but also the elegance of their combinations.
The goal of automatic indexing is not to achieve equivalency to human processing, but to achieve sufficient interpretation of items to allow users to locate needed information with the minimum amount of wasted effort. Even the human indexing process has left much to be desired and caused significant energy by the user to locate all of the needed information.
As difficult as determining index terms is, text summarization encounters an even higher level of complexity. The focus of text summarization is still on just the location of text segments that adequately represent an item. The combining of these segments into a readable “abstract” is still an unachievable goal. In the near term, a summarization that may not be grammatically correct but adequately covers the concepts in an item can be used by user to determine if the complete item should be read in detail.
The importance of the algorithms being developed for automatic indexing can not be overstated. The original text of items is not being searched. The extracted index information is realistically the only way to find information. The weaker the theory and implementation of the indexing algorithms is, the greater the impact on the user in wasting energy to find needed information. The Global Information Infrastructure (e.g., the Internet) is touching every part of our lives from academic instruction to shopping and getting news. The indexing and search algorithms drives the success of this new aspect of everyday life.
EXERCISES
1. Under what circumstances is manual indexing not required to ensure finding information? Postulate an example where this is true.
2. Does high specificity always imply high exhaustivity? Justify your answer.
3. Trade off the use of precoordination versus postcoordination.
4. What are the problems with Luhn’s concept of “resolving power”?
5. How does the process of information extraction differ from the process of document indexing?