Proper Noun Recognition, Categorization, Normalization, and Matching

9. Natural Language Processing (NLP) Approaches

9.4 Proper Noun Recognition, Categorization, Normalization, and Matching

The presence of specified proper nouns is often a necessary, though not necessarily sufficient, condition for a document to be relevant to a specified topic. If the topic is the Japanese stock mar- ket, then some form of the proper noun “Japan” is clearly essential, although by itself hardly sufficient, since documents might deal with many other aspects of Japan. Names of persons, companies, government agencies, religions, chemicals, and many other entities may be essential to the specification of a topic, and the recognition of documents relevant to the given topic. Recognition, extraction, and matching of proper nouns is considerably more complex than it might at first seem. A variety of factors complicate the process. Many proper nouns consist of more than one noun, e.g., “Wall Street Journal.” Many proper nouns include a preposition, e.g., “Department of Defense,” or a conjunction, e.g., “John Wiley and Sons.” Many proper nouns can be specified in multiple forms, e.g., “MCI Communications Corp.,” “MCI Communications,” and “MCI.” Many proper nouns are group nouns, which may result in references either to the group as a whole, or to the individual entities making up the group, e.g., “European Community,” “Latin America.” Common nouns and noun phrases may also group individual entities that have proper noun names, e.g., western nations, socialist countries, third world, agricultural chemicals.

Borgman et al. [JASIS] discuss at length the particularly difficult case of names of persons. Con- ventions for assigning names vary with the culture and historical period. In ancient times, single names were normal. The practice of assigning multiple names, e.g., first, middle, and last names, is more recent. Some cultures use compound surnames, but the conventions vary from one culture to another, e.g., “[h]ispanic children receive a combination of their parents’ surnames, and wives acquire a combination of their maiden surnames, and their husbands’ surnames.” Order of names also varies with culture, e.g., “[a]sians traditionally place the surname first, although asians living in Western nations often report their names with surnames last.” “Personal names may be trans- lated from one language to another, retaining meaning,... or be transliterated from one alphabet or character set to another.” Multiple transliteration schemes exist. People change their names over their lifetime, as a result of marriage, divorce, adoption, or movement from one country to another. People adopt or receive nicknames and diminutives, e.g., “Dick” for “Richard,” “Bob” for “Robert.” A person may use one form of her name on a drivers license, but another form for publication as an author. On top of all this, errors are common, not only typographical errors, which affect any typed input, but phonetic errors, e.g., a person from one cultural or linguistic background transcribing a spoken name from another cultural or linguistic background is especially likely to err.

Paik et al. [ARPA Workshop] [Corpus Proc] have developed a sophisticated series of procedures for proper noun recognition and matching in their DR-LINK (Document Retrieval through LIN- guistic Knowledge) and KNOW-IT (KNOWledge base Information Tools) IR engines. The proper noun recognition system described here was developed through corpus analysis of newspaper texts. First they assign parts of speech to all the words in the document; then they execute a general purpose noun phrase bracketter, and a special-purpose proper noun phrase boundary identi- fier. Next the system categorizes all the proper nouns; this is consistent with the DR-LINK emphasis on capturing the conceptual level of a document, as well as the actual keywords and phrases. Topic requests may often be stated at the conceptual level. As Liddy et al. note, “queries

about government regulations of use of agrochemicals on produce from abroad, require presence of the following proper noun categories: government agency, chemical, and foreign country.” Note that a document that contains proper nouns in those categories may not actually contain the words “government,” “agrochemicals,” “produce,” or “abroad.” DR-LINK attempts to recognize eight categories: Geographic Entity, Affiliation, Organization, Human, Document, Equipment, Scientific, and Temporal; within each of these categories, DR-LINK recognizes two or more sub- categories, for a total of 29 meaningful sub-categories. (A more recent version, embodied in both DR-LINK and another commercial tool, KNOW-IT, recognizes over 60 sub-categories.) “Affilia- tion” includes “religion” and “nationality.” “Human” includes “person” and “title.” “Scientific” includes “disease,” “drug,” and “chemical.” And so on. DR-LINK performs this categorization using such clues as known prefixes, infixes, and suffixes for each category, e.g., Dr., Mr., Ms.,and Jr. for persons, Inc. and Ltd., for companies, etc. DR-LINK also uses a database of aliases for alternate names of some proper nouns, and knowledge bases such as gazetteers, the CIA World Factbase, etc. Contextual clues are also used, e.g., if the pattern proper noun, comma, proper noun is encountered, and the second noun has been identified as a state, the first noun (if not otherwise categorized) will be categorized as a city.

Since a given proper noun may take multiple forms, DR-LINK standardizes proper nouns as they are being categorized. That is, all forms of the same proper noun are mapped into a single stan- dard form, to simplify subsequent matching. This is equivalent to stemming of ordinary words, reducing all variants to a common form. However, whereas stemming (at least in English!) largely involves processing of multiple suffixes, standardizing of proper nouns involves standardizing of prefixes, infixes, suffixes, and variant forms of proper nouns, e.g., “Dick” to “Richard.” Note that two variant forms of the same proper noun, referring to the same entity, may occur not only in two different documents, or in a document and a topic request, but also within a single document. In particular, an entity may be named in full on its first reference, and mentioned in a more abbreviated form on subsequent references. It is an important instance of reference resolution for a Natu- ral Language Processing (NLP) based IR system to recognize that these are references to the same entity.

DR-LINK also expands group proper and common nouns, so that a topic request can match a document on either the group name or its constituents. For example [Feldman, ONLINE], a request for documents about “African countries which have had civil wars, insurrections, coups, or rebel- lions” will return not only documents that contain some form of the proper noun “Africa,” but also documents containing references to countries within Africa. DR-LINK uses proper noun and common noun expansion databases.

Note that, in a system like DR-LINK, proper nouns can provide several levels of evidence for topic-document similarity computations. First, there is the obvious matching on the names them- selves. Second (as noted earlier), there is matching on categories assigned to the names. This category matching is similar to, but supplements, the matching on subject categories (SFC’s) described in an earlier section. Third, expansion of group nouns can result in matching a document on proper nouns not actually mentioned in the topic statement. This can work in the other direction too, e.g., if a document mentions Montana or Atlanta, then these references may be used to match the document against a topic that only speaks about “American” companies. Fourth, proper nouns naming geographical entities can provide relationship information, e.g., they can

“reveal the location of a company or the nationality of an individual.” Subject information can be combined with proper noun category information for more refined topic-document matching. A report of a merger should involve (at least) two proper nouns of category “company,” while a report of an invasion is likely to specify two geographic entities, most likely at the level of country or province.

A significantly different approach to proper noun recognition is taken by Mani et al. [Corpus Proc, 1996] Their approach differs somewhat both in goals and methods. They focus on a much smaller set of subject categories: people, products, organizations, and locations. Within large text corpora, they seek (like Paik) to categorize previously unknown names automatically. However, they attempt to go further than Paik, extracting from the text appropriate semantic attributes for each named entity, e.g., the occupation and gender of a person. A given entity may be mentioned more than once in a given document, and each mention may employ a different variation of the entity’s name, e.g., “President Clinton,” “Bill Clinton,” “Clinton,” “the president,” etc. They seek to “unify” these mentions, i.e., to recognize all mentions to the same entity, and to combine the attributes associated with these varied mentions into one common schema describing the given entity. This is called “coreference resolution” for proper nouns. When two mentions (and their associated attributes) are successfully unified as referring to the same entity, they are said to be “coanchored.” Note that this goes considerably beyond (although it includes) the normalization of proper nouns performed by Paik.

Coreference resolution is closely tied to attribute extraction. On the one hand, attributes extracted from one mention of a given entity can be combined with attributes extracted from another mention, to fill out as many of the “slots” associated with the given type of entity as possible. For example, one mention may indicate that Clinton’s occupation or title is “president.” Another mention may indicate that his gender is “male.” On the other hand, extracted attributes can serve as evidence to determine whether two mentions refer to the same entity, or to two distinct entities. For example, if “President Clinton” has been associated with the gender attribute value “male,” and “Hilary Clinton” has been associated with the gender attribute value “female,” this is evidence that these mentions do not refer to the same entity. But, “President Clinton” and “Mr. Clinton” will match on gender, and hence will be coreference candidates unless additional evidence indicates a contradiction. Moreover, attributes may also serve to indicate whether two mentions refer to distinct but related entities. For example, “Bill Clinton,” “Hilary Clinton,” and “the Clintons,” are distinct, but related entities. As a further refinement, Mani distinguishes between “discourse pegs,” i.e., entities that are distinct in a given discourse, and entities that are distinct in the real world. For example, President Clinton, and ex-Governor Clinton may be two distinct discourse pegs for purposes of analyzing a given document, although they refer to the same real-world object in the world model or belief system of an external knowledge base.

As Mani encounters new proper noun mentions in the text of a given document, he naturally wants to limit the number of earlier mentions that must be evaluated as possible candidates for coreference. He does this by indexing each mention by normalized name (a standardized form, analogous to Paik), by name elements in its name (individual words within the name), and by its abbreviations. Only mentions that match on at least one of these indexes are coreference candi- dates. Abbreviations are generated by rule, or retrieved from a lexicon; hence, a full name in one mention can be matched against an abbreviated name in another.

Another difference between the Mani and Paik approaches is that Mani makes greater use of the context surrounding a proper noun, and of the discourse structure of successive mentions. In particular, Mani makes use of both honorifics and “appositive phrases,” phrases adjoining and identi- fying a proper noun. It is a widely used convention, especially in news stories, to attach an honorific or an appositive phrase to the first mention of a given name, e.g., “Anthony Lake, Clin- ton’s national security advisor,” or “Osamu Nagayama, 33, senior vice president and chief finan- cial officer of Chugai,” or “German Chancellor Gerhard Schroeder.” Such appositives and honorifics are generally employed whenever the named entity is not a “household name,” and is not sufficiently identified by title. It is applied to entities other than persons, especially organizations and locations, e.g., “X, a small Bay Area town.” (Paik indicates that one of their intended research directions is the use of appositive phrases. However, in one knowledge base derived by KNOW-IT from New York Times articles, Anthony Lake was erroneously categorized as a body of water, presumably because the appositive phrase was ignored or misinterpreted.) Mani identifies candidate appositive phrases by pattern matching based on left and right delimiters such as commas and certain parts of speech. Syntactic analysis is then used to extract key elements, e.g., a head or premodifier, from the given phrase. In the “Nagayama” example above, “senior vice president” would be extracted, and looked up in a semantic lexicon ontology, which identifies the title as a “corporate officer.” Plainly, the value of such appositive phrases for categorization depends on the availability of lexicons that enable one to interpret their semantic content.

Another distinctive feature of the Mani methodology, closely related to the gathering of evidence over multiple mentions of an entity, is the explicit handling of uncertainty. Evidence gathered in one mention can reinforce or contradict evidence gathered in another mention. Mani employs a variety of Knowledge Sources (KS’s). KS’s are little rule-based programs that attempt to categorize (“tag”) entities. Many of the rules employed by Mani’s KS’s are similar to the rules employed by Paik’s system, e.g., one KS attempts to identify organizations by using suffixes such as “Inc.” and “Ltd.” Another tries to identify persons by looking for titles and honorifics, e.g., “Mr.”, “Lt. Col.”, “Ms.”, etc. Other KS’s use lexicons, e.g., organization lexicons, gazetteers as geographic lexicons, etc. On the basis of the evidence it collects, a KS can generate multiple hypotheses with different confidences.Mani offers the example that “General Electric Co.” may generate one hypothesis that the entity named is a person, with “General” as a title, while other hypotheses may be that it is an organization or a county, based on the abbreviated suffix “Co.”. On the other hand, multiple KS’s may generate the same hypothesis based on different evidence, e.g., one KS may hypothesize that the given mention is an organization based on the “Co.” suffix; another KS may generate the same hypothesis based on the presence of the name in an organization lexicon. A “Combine-Confidence” function computes the confidence of a given hypothesis about a given mention as the weighted sum of the probabilities assigned to the hypothesis by each KS that con- tributed to it, each probability weighted by the reliability of the KS that generated it.

The confidence values associated with hypotheses play an important role in mention unification. If two person mentions have conflicting hypotheses about the occupation slot, but one hypothesis has a much lower confidence than the other, unification may succeed. On the other hand, if two mentions have conflicting gender hypotheses, and these hypotheses both have high confidence values (e.g., based on the honorifics “Mr.” and “Mrs.” respectively), the unification will fail.

In document Information Retrieval: A Survey - Free Computer, Programming, Mathematics, Technical Books, Lecture Notes and Tutorials (Page 102-106)