• No results found

9. Natural Language Processing (NLP) Approaches

9.3 Concept Identification and Matching

9.3.2 Concepts and Discourse Structure

Many researchers have recognized that a document, especially a large document, may not be the ideal unit for matching against queries or topics. A document may deal with multiple topics. The matters of concern to a given user, or the key words that identify her interests, may be localized to a small portion of a document. Hence, a variety of research efforts, some of them described in this report, attempt to break documents into segments, often called “passages.” Sometimes, the bound- aries of these segments are determined orthographically, e.g., on the basis of paragraph or section or sentence boundaries. In other cases, documents are segmented arbitrarily, e.g., by overlapping windows N characters long. The former approach takes semantics into account, but only indi- rectly, by assuming that sentence, paragraph, or section boundaries specified by the author accu- rately reflect her intended semantic structure. The latter approach ignores semantics in favor of locality. Of course, it is likely that the words or sentences that occupy a local passage have some semantic relationship, but it is impossible to say a priori what that relationship will be.

Liddy et al. [Proc RIAO Conf., 1994] have taken a more principled approach by studying the dis- course structure (based on “discourse linguistic theory”) of various types of documents, e.g., newspaper articles [TREC-2, 1994], or abstracts of empirical technical documents [Liddy, ASIS ‘87] Liddy, 1988]. A coherent, well-written document has a semantic structure that represents the way the author has organized the ideas or story she wants to tell. Moreover, textual documents of a particular type will have a predictable, standard structure. The elements of this structure are called “discourse components.” Liddy has extended an earlier model due to van Dijk [Hillsdale, 1988] for the text type “newspaper article.” She has identified 38 discourse components in her

extended model. Each clause or sentence in a given article can be tagged as one of these compo- nents. These tags “instantiate” the model. Assigning a tag to a clause says that the given clause belongs in the corresponding component of the model. Each component will contain certain kinds of information relative to the story told by the entire article. Examples of component tags for the newspaper article model are: MAIN EVENT, VERBAL REACTION, EVALUATION, FUTURE CONSEQUENCE, and PREVIOUS EVENT. Components may be nested, corresponding to nest- ing in the sentence structure. Linguistic clues are used to identify the components. For example, Liddy [TREC-2] offers the following example of nested, tagged discourse components.

<LEAD-FUT> South Korea’s trade surplus, <LEAD-HIST> which more than dou- bled in 1987 to $6.55 billion, <LEAD-HIST> is expected to narrow this year to above $4 billion, </LEAD-HIST> is expected to narrow this year to above $4 bil- lion. </LEAD-FUT>

Plainly, this is a LEAD-FUTURE component about the expected future trade surplus of South Korea, as indicated by linguistic clues such as the phrase “is expected to,” containing a nested LEAD-HISTORY component about South Korea’s past trade surplus, as indicated by linguistic clues such as the past tense “doubled” and the “1987” date.

Tagging the clauses and sentences of a document by discourse component allows Liddy to gener- ate multiple SFC vectors, one for each component. This means that one can not only match the subjects found in a topic against the subjects found in a document; one can also determine whether they are in the correct discourse component. For example, if the topic required that a doc- ument discuss future trade surpluses in South Korea, it would be important not only that the sub- ject appear in a given document, but that it appear in a FUTURE EVENT or LEAD-FUTURE discourse component. A document that has the right subject in the right discourse component should receive a higher relevance ranking score than a document that has the right subject in the wrong component. Liddy has identified 38 discourse components for the newspaper article text type. However, she has found that topic requests usually do not have so fine a discourse grain. Hence, she has improved the performance of DR-LINK by mapping the 38 components into seven meta-components for the purpose of topic-document matching and ranking: LEAD-MAIN, HIS- TORY, FUTURE, CONSEQUENCE, EVALUATION, ONGOING, and OTHERS. These seven meta-components yield eight SFC vectors, one for each component and one for the combination of all seven together. The resulting module that matches topics against documents using these eight SFC vectors is called the “V-8 SFC Matcher.”

Mann et al. [Text, 1988] have developed an alternative method of discourse analysis called Rhe- torical Structure Theory (RST) [Mann et al., Text]. RST can be used for the automated markup and parsing of natural language texts [Marcu, AAAI, 96] [Marcu, PC, 1999]. Both Marcu and Eklund et al. [proposal, 1999] are exploring possible applications of RST to automated textual information extraction. Marcu is studying the application of RST to document summarization and machine translation of natural languages. Eklund et al. have considered its application to text data mining, knowledge base construction, and knowledge fusion across documents.

Mann et al. claim that RST “provides a general way to describe the relations [called rhetorical relations] among clauses in a text, whether or not they are grammatically or lexically signalled.”

Of course, automatic parsing as developed by Marcu, depends on recognizing just such grammat- ical or lexical signals, and using them to drive the actions of the parser. Marcu has taken two approaches to automated RST parsing. First, he has developed a set of manual rules. Second, he has applied a machine learning tool to a large text corpus to “learn” a set of parsing rules. Clearly, the success of such automated parsing depends on the text possessing a certain coherence and clarity typical of news article text, and well-written scientific and legal papers. The techniques might be much less successful if applied to informal text, e.g., e-mail; they have never been applied to such informal texts. [Marcu, PC, 1999]

A rhetorical relation links two clauses (non-overlapping spans), called the nucleus and the satel- lite. The significance of these terms is that most (though not all) of the relations are asymmetric. One clause, the nucleus, is usually more essential than the other. The less essential clause, the sat- ellite, is sometimes incomprehensible without the nucleus to which it is related. Even where the satellite is comprehensible by itself, the nucleus is generally more essential for the writer’s pur- poses. Marcu capitalizes on this fact to generate automatic document summaries. The nuclear clauses, taken by themselves, form a coherent summary of the document’s essence. The satellites do not. Another indication that the satellites are less essential is that they are often substitutable, i.e., in a given relation one satellite can often be substituted for another while retaining the same nucleus. Mann and Thompson define 24 rhetorical relations, but stress that the set is open-ended. Marcu has discovered a considerably larger set of relations, but anticipates that other researchers may discover yet more relations, as they explore other classes of text. [Marcu, PC, 1999]

Marcu is also considering the possible application of RST parsing to machine translation (MT). One of the difficulties with existing MT is that even when words and phrases and even clauses are properly translated, the overall translation of the text may be awkward or incorrect. This is because the discourse structure in the RST sense may vary from one language to another, espe- cially at the lower levels of the RST trees. Translating this structure may substantially improve the quality of the translation.

The parsing of a text according to RST identifies and marks up all the rhetorical relations, and the clauses participating in those relations. The rhetorical structure is defined recursively, i.e., one of the spans participating in a rhetorical relation can itself be composed of rhetorical relations. Hence, the rhetorical structure of a text is a tree. Only the nodes of the tree are necessarily “sim- ple” clauses. A given text can be parsed in multiple ways. Hence, a given text may be represented by multiple rhetorical parse trees. However, Marcu has defined principled rules for “legal” parse trees. If one adheres to Marcu’s rules, one can still generate multiple trees for a given text docu- ment, but it becomes possible to reject some candidate parses as ill-formed while accepting others as well-formed.

The following passage illustrates two asymmetric rhetorical relations: “concession” and “elabora- tion.”

Although discourse markers are ambiguous,1

one can use them to build discourse trees for unrestricted texts;2 this may lead to many new applications in text data mining.3

A concession relation exists between the nucleus, either 2 or 3, and the satellite, 1. The nucleus is asserted to be true despite the contradictory “concession” of the satellite. An elaboration relation exists between the satellite 3 and the nucleus, either 1 or 2. The satellite elaborates on the asser- tion made by the nucleus. Note that the same clause, e.g., 3, can be a nucleus in one instance of one relation, and a satellite in one instance of another relation.

The word “although” is a lexical marker for the concession relation in the example given above. Similarly, the semicolon is a marker for the elaboration relation in this example. However, Mann et al. stress that “the definitions [of the rhetorical relations] do not depend on morphological or syntactic signals … We have found no reliable, unambiguous signals for any of the relations.” Marcu’s relative success indicates that for well-structured text, reliable markers can be found for a fairly high proportion of instances of the relations. But Marcu’s studies have been limited to well- structured text types, e.g., Scientific American articles. Mann et al., on the other hand, have stud- ied a wide variety of types including “administrative memos, magazine articles, advertisements, personal letters, political essays, scientific abstracts, and more.” They claim that an RST analysis is possible for all of these diverse types. However, not surprisingly, they have found no simple lin- guistic markers that work for recognizing or delimiting these relations across all of the diverse text types. They also find that certain text types do not have RST analyses, including “laws, con- tracts, reports ‘for the record’ and various kinds of language-as-art, including some poetry.”A common characteristic of the text types studied with some success by Marcu is that they are types of expository writing. Hence, Marcu’s approach might work well with legal (judicial) opinions which are typically expository, although according to Mann et al. as quoted above, it would prob- ably not work for laws and contracts, which are typically not expository.

Marcu measured the success of his automated parsers in terms of the classical precision and recall measures. Recall measured the proportion of the rhetorical relations in a text that were identified, i.e., the “coverage.” Precision measured the proportion of identified relations that were identified correctly. These values were computed by comparing the results of the automated parses against parses performed manually by human judges. It was found that human judges could achieve a high degree of agreement in their respective parses, thereby justifying the claims of RST for the text types studied. However, it was also found that the human judges required a considerable amount of training and practice before they could achieve this consistency. Parsing according to the rules of RST is far from trivial.

Eklund et al. have proposed combining RST with FCA by generating formal contexts whose objects are larger entities like clauses, sentences, and even documents. Correspondingly, the rela- tions would be rhetorical relations. If a simple formal context based on RST is converted to a sim- ple conceptual graph (CG), its nucleus (in the RST sense) would become the “head” of the CG. Other text spans having RST relations to the nucleus could then be used to build more complete CG’s. In this way, a CG knowledge base could be constructed, extending over whole documents and fusing the knowledge of multiple documents.