An Exercise in Reuse of Resources: Adapting General Discourse Coreference
Resolution for Detecting Lexical Chains in Patent Documentation
Nadjet Bouayad-Agha
1, Alicia Burga
1, Gerard Casamayor
1,
Joan Codina
1, Rogelio Nazar
1, Leo Wanner
1,21NLP Group, Department of Information and Communication Technologies, Universitat Pompeu Fabra, 2Catalan Institute for Research and Advanced Studies (ICREA)
1C/ Roc Boronat, 138, 08018 Barcelona, Spain
{firstname.lastname}@upf.edu
Abstract
The Stanford Coreference Resolution System (StCR) is a multi-pass, rule-based system that scored best in the CoNLL 2011 shared task on general discourse coreference resolution. We describe how the StCR has been adapted to the specific domain of patents and give some cues on how it can be adapted to other domains. We present a linguistic analysis of the patent domain and how we were able to adapt the rules to the domain and to expand coreferences with some lexical chains. A comparative evaluation shows an improvement of the coreference resolution system, denoting that (i) StCR is a valuable tool across different text genres; (ii) specialized discourse NLP may significantly benefit from general discourse NLP research.
Keywords:Coreference resolution, lexical chain, Stanford Coreference Resolution System, patents, domain adaptation
1.
Introduction
The development of high quality large scale resources for NLP (e.g., treebanks, lexical databases/dictionaries, or tools) is a very time consuming and laborious work. There-fore, it is not surprising that an increasing number of such resources is available off-the-shelf for free use by the com-munity. One of these resources is theStanford Determinis-tic Coreference Resolution System1(henceforth, StCR). It is the last stage of a pipeline of linguistic processing modules in the Stanford CoreNLP platform, and uses the results of the preceding modules (e.g., tokenization, (named) entity recognition and syntactic dependency analysis).
In our work, we adapt StCR to the patent domain and ex-pand its functionality to detect chains instantiated by lex-ical relations that go beyond strict identity. We have cho-sen StCR for two reasons. First, because of its quality: it scored best in the CoNLL 2011 Shared Task on general discourse coreference resolution; see (Raghunathan et al., 2010; Lee et al., 2011; Lee et al., 2013). Second, because it is rule-based rather than statistical, i.e., machine learning-based. The effort for the annotation of patents for training a machine learning-based lexical chain-recognition model is very high due to the very complex linguistic structures of the patent material, considerably higher than the creation of a set of rules by a linguist experienced in patent analysis. Due to their linguistic idiosyncrasies, patents represent a real endurance test for any general discourse technique. The success of the adaptation of StCR to patent processing shows that (i) StCR is a valuable tool across different text genres, and, more generally, that (ii) specialized discourse NLP may significantly benefit from general discourse NLP research.
As in general discourse (Halliday and Hasan, 1976; Mor-ris and Hirst, 1991), a lexical chain in a patent is defined as a sequence of lexical entities between which relations
1
http://nlp.stanford.edu/software/dcoref.shtml
of identity, near-identity or non-identity hold (Recasens et al., 2010). However, the complex syntactic structures of patents and the potential distribution of the mentions of an entity across the entire document make the detection of these sequences significantly more complex. In this paper, we show how this can be done.
The remainder of the paper is structured as follows. In Sec-tion 2., we summarize an empirical study we carried out on linguistic phenomena in patents related to coreference and other lexical chains. Section 3 presents the original StCR and analyzes its application to patent documentation. Sec-tion 4 describes the adaptaSec-tion of StCR to the patent do-main. Section 5 sketches some general guidelines for the adaptation of StCR to a specific domain and lists the phe-nomena that should be taken into account during the adap-tation. Section 6, finally, provides some conclusions and outlines the directions of our future work in the area of lex-ical chain identification in the patent domain.
2.
An Empirical Study of Patents
Patent texts are notoriously difficult to read and compre-hend due to their abstract vocabulary and very complex lin-guistic constructions. In this section, we present the results of an empirical study on the linguistic phenomena in patent texts related to coreference and other frequent types of lex-ical chains.
2.1.
Coreferences
Let us briefly summarize the most important idiosyncrasies of patent with respect to coreferences.
each claim must be rendered as a single sentence. As a con-sequence, sentences often exceed 250 words; some of them reach 500 or even 900 words. These long sentences often contain multiple references to the same element of the in-vention or method in question that in other domains are typ-ically distributed among sevearl separate sentences. This needs to be taken into account for coreference resolution since sentence boundaries and ordering of candidates are determining factors in many coreference resolution strate-gies.
NP repetition. Due to the legal nature of patents, writ-ers of patents tend to avoid ambiguity as much as possible. This often leads to the repetition of full nominal phrases (NPs) across the text instead of pronouns once the referent has been introduced. Modifiers of the introduced referents are also kept. The frequent use of multiple modifiers in NPs, as well as the observed nominal nature of the claims, result in complex NPs with multiple levels of embedded NPs; cf. for illustration the citation (a) below. Thus, it is crucial to use the right level of NP embeddedness as the base for coreference detection. Even NPs that do not con-tain prepositional phrases as modifiers can be very complex and contain nouns as pre-nominal modifiers (as in (b) be-low), making it difficult to accurately detect the head. NPs in the patent domain can also contain punctuation sym-bols other than commas (as illustrated in (c)), for which lin-guistic processing tools should be adapted.
(a) [[a control means]NPfor rotating [the first motor]NP
in [a given rotating direction]NPwith [a given ratio of
[rotating speed]NP]NP]NP
(b) a recording [media]N[storage]Nand player [unit]N
(c) [A device comprising: B; C; D; E; F]NP.
Due to NP repetition, pronouns occur less often than in general discourse. In particular, personal pronouns differ-ent fromit are inexistent, and the use of referentialits is very limited. On the other hand, adverbial pronouns such aswhereinare by far more frequent in patents than in gen-eral discourse and must be taken into account as candidates for coreference resolution:
[The multi-recording media storage and player unit of claim 21]i, [wherein]i said expanded
in-formation comprises links to web addresses . . .
NP definiteness. In general discourse, the definiteness of an NP marked by a determiner constitutes a clear hint dur-ing the detection of the appropriate antecedent: an NP in-troduced by a definite determiner can corefer with a previ-ous NP introduced by an indefinite NP that shares the same head. Opposite to that, two indefinite NPs cannot corefer, even if they are identical or share the same head. In the patent domain, however, our study yielded numerous ex-amples of coreferring indefinite NPs:
(cl.1) [Abattery remaining capacity indicating apparatus for detecting a remaining battery capacity of a charge and discharge battery]i[. . . ]
(cl.2) [A battery remaining capacity indicating apparatus]i
further comprising: [. . . ]
(cl.3) [A battery remaining capacity indicating apparatus]i
further comprising: [. . . ]
Sense idiosyncrasy. Another important trait of patent texts is that some words do not share the same meaning and grammatical function (and consequently the same PoS) with their instances in general discourse. Notably, the word
means, which in general discourse can be either a verb or a noun, depending on the context, always behaves as a noun in patents, and therefore should be considered as a poten-tial candidate for coreference resolution. Another promi-nent example is the wordsaid, which in general discourse behaves as a verb yet in patents stands for a definite deter-miner.
Furthermore, some words that act as nominal heads in patent texts have a very abstract meaning and do not actu-ally corefer with other expressions in the text, as is the case of the nounsclaim,apparatus, orunit. In contrast, bare NPs (nominal groups not introduced by an article) and some nu-merical NPs may corefer, and therefore cannot be excluded as candidates by a coreference resolution strategy. Thus, in the example below, the bare nounbatteriesis generic and, in principle, should not corefer. However, given that there is a definite NP (the batteries) with the same head later in the text, and both NPs are embedded within the same larger NP,batteriesandthe batteriesneed to corefer:
[A battery charging device (. . . ) for continu-ing supply of air to [batteries]bare NP,
com-prising: a charge current controlling portion for judging whether an abnormal condition is stored in the memory means incorporated in [the batteries]NP]NP
Coreferences with multiple antecedents. In patents, it is very common to find coreferences with more than one antecedent. The last element of this type of coreference can be a relational pronoun (as illustrated by (a) below), or an NP (as illustrated by (b)):
(a) if [said battery-side end voltage] and [said device-side end voltage] do not correspond to [each other]
(b) The electric circuit wherein each of the DC-to-AC converters comprises [a first switch (. . . )]i−1 and [a
second switch (. . . )]i−2. The electric circuit wherein
[the first and second switches]i are field effect
tran-sistors.
Position of the coreference elements. The position of each coreference element within the document structure is also very relevant in the patent domain, and should be taken into account. The claims section, for instance, contains a set of claims that are related to each other through depen-dency relations. A claim is dependent to another claim if it makes explicit reference to the latter. The dependencies be-tween claims form a directed graph, which can be explored to limit the scope of coreference candidates for a given ex-pression, so that only candidates in claims that are super-ordinated to the claim in which the expression in question occurs are considered.
2.2.
Lexical Chains
For various patent processing applications (among them, e.g., entity extraction or summarization), it is essential to capture not only coreferences, but also other types of lexi-cal chains. In what follows, we summarize the set of lexilexi-cal relations that are, according to our study, the most frequent relatins in patents: ‘part-whole’, ‘entity in process’, ‘set-member’ and ‘class’.
Part-whole: This lexical relation is very important in patents because it relates the patented object (or method) with its different components (or steps). Thus, it is present in every single patent. The composition of a device/method has to appear in the claims, but can be repeated in the de-scription.
This relation is a1 : n- relation since in patents, first the patented object/method (or one of its components) is in-troduced in terms of a single NP and then its components (steps or (sub)components) asnfurther NPs. The compo-nents/steps are most of the times coordinated among them-selves (by a semicolon or the conjunctionand); cf. an ex-ample:
[each]head having [a DC-side arranged to
con-nect across positive and negative terminals of cells received by the circuit]comp 1 and [an
AC-side for carrying an AC voltage converted from the DC-side]comp 2
Entity in process: This lexical chain marks the relation between references to a single entity, but in different phases of a process. In order to detect this lexical chain, it is neces-sary to go beyond the NP itself, and also look at the verbal information around (which is somehow included in the last mention). Although this relation is, in principle, a1 : 1 re-lation, due to its potential recursion it may become1 : 1 : 1
. . . :
[. . . ] a temperature detection device for detecting [a current temperature of the battery]entity process1; a temperature rise output
device for obtaining the temperature rise from [the temperature detected by said tempera-ture detection device]entity process2[. . . ]
Set-member: This semantic relation links singular NPs (members) to plural NPs (set). This relation is a1 : n re-lation. The single NP is the one that represents the entire
collection andnthe different NPs that represent the mem-bers. It is possible that more than one member is explic-itly mentioned (in this case, adjectives such asfirst,second, etc. are commonly used). However, often there are just two NPs connected, given that the singular NP represents any member of the collection. Consider an illustration of the ‘set-member’ relation:
a plurality of DC-to-AC converters each having a DC-side arranged to connect across positive and negative terminals of cells received by the circuit and [an AC-side for carrying an AC voltage converted from the DC-side]member
and an inductive coupling between [the AC-sides]collection of the DC-to-AC converters to
provide . . .
Class: The ‘class’ lexical relation links an NP that refers to a specific object with another NP that refers to the class of this object (as a generic unit). Therefore, it is a 1 : 1
relation. Both NPs can appear either in singular or in plural; the NP that refers to the class appears with no article (if in plural) or with an indefinite article that is NOT followed by a definite NP (if in singular).
[An electric circuit for receiving a battery of cells in series]COREF/object, comprising : a
plu-rality of DC-to-AC converters (24A-24H) [. . . ] [The electric circuit]COREFwherein the
induc-tive coupling comprises transformer windings (26A-26H) .
An electrically powered device including [an electric circuit]class. . .
3.
Stanford Coreference Resolution System
Let us first give an overview of the StCR and then briefly analyze the applicability of its original configuration to patents.
3.1.
Overview of StCR
The central idea behind the StCR’s deterministic approach to coreference resolution is the application of successive in-dependent coreference models (sieves) of decreasing preci-sion, so that coreference matches for which the system has greater confidence are detected first, and further matches are detected on the basis of the former. Instead of basing the decision on whether two mentions corefer on the whole set of features extracted from the text, the system chooses to separate lower precision features from higher precision features into different sieves.
The general architecture of StCR consists of three stages:
(1) Candidate detection: Detection of linguistic expres-sions that are candidates for coreference matching, us-ing a high-recall algorithm. The initial list is filtered out to exclude undesirable expressions such as imper-sonal pronouns, partitives, numericals, bare NPs, etc.
(3) Post-processing: Elimination of singleton clusters.
A sieve in (2) starts with the first mention of the first sen-tence and, moving forward one mention at a time, assigns the current mentionmi an antecedentmi−1from a list of
candidates. These are ordered (i) from left-to-right, travers-ing breadth-first the syntactic dependency tree whenmi−1
is in the same sentence asmi(thus favoring subjects), (ii)
from right-to-left, breadth-first when the candidate is an NP in a sentence preceding that ofmi(thus favoring syntactic
salience and proximity), and (iii) from left-to-right if the candidate is in pronominal form in a sentence preceding that ofmi.
Each sieve traverses the candidate list until a corefering an-tecedent is detected, or the end of the list is reached. In case of a match, the mention mi, its antecedent and all
their coreferring mentions are grouped into the same cluster which shares the common features of all mentions. Only first mentions in textual order of the clusters are consid-ered when searching for new matches, following the intu-ition that earlier mentions are more informative and have less potential candidates (and are therefore less likely to be mismatched).
The original StCR system applies 12 sieves for coreference resolution; cf. (Raghunathan et al., 2010) and (Lee et al., 2011) for detailed presentations:
1. Discourse Processing, 2. Exact String Match, 3. Relaxed String Match, 4. Precise Constructs, 5.–7. Strict Head Match, 8. Proper HeadWord Match, 9. Alias, 10. Relaxed Head Match, 11. Lexical Chain, 12. Pronouns.
The features that these sieves use include the mention string, shallow linguistic traits, deep linguistic analysis and semantic features. Of particular interest to our work is the Lexical Chain sieve, which in the original StCR uses Word-Net to detect hypernymy or synonymy relations and which we replace with an implementation tailored to detect some of the types of lexical chains in patents discussed in Section 2.2.
3.2.
Analysis of the application of the original
StCR
Due to patent idiosyncrasies, the original constellation of the StCR has indeed a limited performance on patents. In particular:
•NPs are wrongly included or excluded as mentions. For instance, many impersonal pronouns are included because the filter for excluding impersonal pronouns fails:
Itis then judged whether the current value is not more than a specified value.
•Bare NPs are always excluded, even if they are needed as antecedents for a posterior NP:
A battery charging device [. . . ] for continuing supply of air to [batteries] [. . . ] comprising: a charge current controlling portion for judging whether an abnormal condition is stored in the memory means incorporated in [the batteries]
• Smaller mentions within larger mentions with the same head (a typical phenomenon in patents where multiple lev-els of NP-embedding is common) are excluded:
[[ a charge controlling portion]mention excluded
for charging the batteries at the current value that has been retrieved by the current value retrieving portion]]mention included
• The ordering of candidate antecedents of nominal men-tions within the same sentence (left-to-right) is not adequate in the patent domain because claim sentences are often too long. In general, a right-to-left order is more adequate, not only within the same sentence, but also across the text, as there is a lot of repetition of terms that may actually refer to different objects:
Fig. 1 is a perspective view of the battery charg-ing device accordcharg-ing to one form of embodiment of the present invention.
Fig. 2 is a perspective view of a battery package according to the one form of embodiment of the present invention [. . . ] Fig. 15 is an explanatory view showing a theory for charging of the battery charging device of the third embodimenti. [. . . ]
In the illustrated embodimenti[. . . ]
•The hypernymy and synonymy lexical relations detected by the original StCR system are only of limited use for the highly specialized terminology of patent texts. Further-more, truly relevant lexical chains such as those described in Section 2.2. are not detected:
[A battery charging device (. . . )]Head,
com-prising : [a judging portion for (. . . )]comp 1,
and [an abnormality indicating portion for (. . . )]comp 2
4.
Adapting StCR to Patents
To adapt StCR to patent material, we (1) substituted the Stanford CoreNLP pipeline by our own pipeline, and (2) tuned all three stages of the StCR.
4.1.
Substitution of the Stanford CoreNLP
Pipeline
Stanford’s original general discourse CoreNLP pipeline performed poorly on patent material. The pipeline includes the Stanford Parser, on which the coreference detection sieves rely heavily to extract features from both its phrase structures and, to a lesser degree, from its converted depen-dency trees. Unfortunately, the Stanford Parser was unable to cope with the long sentences found in patent documents, and, therefore, a parser retraining did not make sense. For this reason we decided to replace it with Bohnet (2010)’s dependency parser, which was better suited to handle very long sentences (Burga et al., 2013).2
Our pipeline is composed of a modified version of GATE’s Annie tokenizer (Cunningham, 2011), a PoS tagger, a
2To the best of our knowledge, no off-the-shelf NLP
lemmatizer and a parser, the last three components taken from Bohnets parsing environment. In order to replace the phrase-structure input to the coreference engine, we developed a hierarchical chunker specifically designed for the patent domain and the peculiar syntactic structures of patent sentences. The chunker output contains multiple level of embedded annotations from which we derive the constituency structures used then by StCR. We decided to include only NPs, a decision which is justified by the fact that StCR operates mostly on NPs when determining the coreference chain elements (Lee et al., 2013). The result-ing phrase-based structure is thus a flat sequence of hierar-chally embedded NPs.
Finally, a GATE (Cunningham, 2011) plug-in was created as a wrapper around the StCR code. The plug-in integrates the StCR into our patent processing pipeline by convert-ing our annotations into a format accepted by StCR and the lexical chains delivered by StCR into GATE annotations.
4.2.
The adaptation procedure
The following adaptations have been performed in the indi-vidual stages of the StCR.
In the candidate detection stage, we introduced patent-specific filters to exclude, e.g., simple NPs with common abstract head nouns (such as claim, part, method), inad-equate mentions that contain “:” or “;”, or mentions with over 30 tokens (to palliate chunking errors). Unlike in the original StCR, we kept smaller mentions within larger men-tions with the same head and bare NPs.
In the coreference resolution stage, nominal and pronomi-nal antecedent ordering was adapted to fit patent idiosyn-crasies. From the twelve sieves that compose the StCR, ex-cept the sieve for detecting the antecedent of deicticI/you, all sieves were included in the same order, albeit with some modifications. In addition, a new sieve was created that detects elements related via copula. Furthermore, the fol-lowing specific adaptations have been implemented:
•Sieve 1 (Discourse Processing Sieve): This sieve was re-moved, as patents do not include the conversational text that is the target of this sieve.
•Sieve 2 (Exact String Match): This sieve showed a high precision and was left as it is in the original configuration.
•Sieve 3 (Relaxed String Match: Bare NPs are linked when they occur at the beginning of sentences.
• Sieve 4 (Precise Constructs): The detection of apposi-tives has been disabled; in addition, relative pronouns are assumed to refer back to the nearest antecedent mention, whilst relational pronouns such as one another or each otherare assumed to refer back to the nearest plural an-tecedent mention.
•Sieve 5 (Strict Head Match): This sieve is kept as it was, but the list of stop words was adapted to the patent domain.
•Sieves 5, 6, and 7 (Variants of Strict Head Match): The mention and its antecedent are not required to match in number so as to allow set-member relations.
• Sieve 8 and 9 (Proper Head Word Match and Alias Match): These sieves were disabled, given that neither proper nouns nor aliases appear in patents.
•Sieve 10 (Relaxed Head Matching: Is kept as it was.
• Sieve 11 (Lexical Chain): The lexical chain sieve was modified so as to use elaborate patterns for detection of ‘part-whole’ relations and ‘instance-of’ relations between mentions within the same sentence.
•Sieve 12 (Pronominal Coreference resolution): This sieve was modified according to the occurrence of the pronouns in patents. Most personal pronouns were excluded and ad-verbial pronouns were included.
The new copula sieve distinguishes between attributive re-lations where the copulative construction indicates a prop-erty of the subject (see (a) below for illustration) and iden-tity relations (see (b) for illustration):
(a) [the present invention] is [a wind turbine genera-tor]attribute
(b) [the difference of the wind direction deviation]iis[the
error of the anemoscope 6 due to drift wind]i
Unlike StCR, in which copulative constructions (<NP1>
is<NP2>) are detected in the precise pattern-based con-struct, in patents, this construct needed its own sieve, ap-plied towards the end of the sieve application sequence. This is because both mentions NP1 and NP2 can be
in-volved in separate lexical chain relations such that an early merge into a single chain would make the subject NP1the
antecedent of the chain while NP2would become
unavail-able for further linking.
The post-processing stage was adapted to post-process the obtained clusters. First, clusters that contain a mixture of reference relations have been divided into different chains. Second, if compatible mentions (i.e., same head, same at-tributes, and one smaller mention included in the other larger one) are used to establish different cohesion rela-tions, their clusters are merged into one.
4.3.
Performance of the adapted StCR
A Multi-Pass Sieve for Coreference Resolution. In Pro-ceedings of the EMNLP-2010 Conference.