Complex Lexico syntactic Reformulation of Sentences Using Typed Dependency Representations

(1)

Complex Lexico-Syntactic Reformulation of Sentences using Typed

Dependency Representations

Advaith Siddharthan Department of Computing Science

University of Aberdeen [email protected]

Abstract

We present a framework for reformulat-ing sentences by applyreformulat-ing transfer rules on a typed dependency representation. We specify a list of operations that the frame-work needs to support and argue that typed dependency structures are currently the most suitable formalism for complex lexico-syntactic paraphrasing. We demon-strate our approach by reformulating sen-tences expressing the discourse relation of

causationusing four lexico-syntactic dis-course markers – “cause” as a verb and as a noun, “because” as a conjunction and “because of” as a preposition.

1 Introduction

There are many reasons why a writer might want to choose one formulation of a discourse relation over another; for example, maintaining thread of discourse, avoiding shifts in focus and issues of salience and end weight. There are also reasons to use different formulations for different audiences; for example, to account for differences in reading skills and domain knowledge. In recent work, Sid-dharthan and Katsos (2010) demonstrated through psycholinguistic experiments that domain experts and lay readers show significant differences in which formulations ofcausationthey find accept-able. They further showed that the most appropri-ate formulation depends both on the domain ex-pertise of the user and the propositional content of the sentence, and that these preferences can be learnt in a supervised machine learning frame-work. That work, as does much of the related comprehension and literacy literature, used man-ually reformulated sentences. In this paper, we present an approach to automate such complex re-formulation. We consider the four lexico-syntactic discourse markers for causation studied by Sid-dharthan and Katsos (2010); consider 1a.–d. be-low (from their corpus, but simplified to aid pre-sentation):

(1) a. An incendiary device caused the explosion. [A-CAUSE-B]

b. The explosion occurredbecause ofan incen-diary device. [B-BECAUSEOF-A]

c. The explosion occurredbecausethere was an incendiary device. [B-BECAUSE-A]

d. Thecauseof the explosion was an incendiary device. [CAUSEOF-B-A]

These differ in terms of the lexico-syntactic prop-erties of the discourse marker (shown in bold font). Indeed the discourse markers here are verbs, prepositions, conjunctions and nouns. As a conse-quence, the propositional content is expressed ei-ther as a clause or a noun phrase (“The explosion occurred” vs “the explosion”, etc.). Additionally, the order of presentation of propositional content can be varied to give four more lexico-syntactic paraphrases:

(1) e. The explosionwas caused byan incendiary device. [B-CAUSEBY-A]

f. Because ofan incendiary device, the explo-sion occurred. [BECAUSEOF-A-B]

g. Becausethere was an incendiary device, the explosion occurred. [BECAUSE-A-B] h. An incendiary device was thecauseof the

ex-plosion. [A-CAUSEOF-B]

(2)

For the purpose of this paper, our main reason is that the 8 formulations selected are different in-formation orderings of 4 different lexico-syntactic constructs. Thus, we explore a broad range of con-structions and are confident that the framework we develop covers the range of operations required for text regeneration in general. Of less relevance to this paper, but equally important to our broad goals of reformulating technical writing for lay readers, causal relations are pervasive in science writing and are integral to how humans conceptualise the world. We have a particular interest in scientific writing – reformulating such texts for lay audi-ences is a highly relevant task today and many news agencies perform this service; e.g., Reuters Health summarises medical literature for lay audi-ences and BBC online has a Science/Nature sec-tion that reports on science. These services rely either on press releases by scientists and universi-ties or on specialist scientific reporters, thus lim-iting coverage of a growing volume of scientific literature in a digital economy.

In Section 2, we relate our research to the exist-ing lexist-inguistic and computational literature. Then in Section 3, we compare three different linguistic representations with respect to their suitability for lexico-syntactic reformulation. We found typed dependency structures to be the most promising and present an evaluation in Section 4.

2 Related Work

2.1 Discourse Connectives and Comprehension Previous work has shown that when texts have been manually rewritten to make the language more accessible (L’Allier, 1980), or to make the content more transparent (Beck et al., 1991), stu-dents’ reading comprehension shows significant improvements. An example of a revision choice that might be applied differentially depending on the literacy skills of the reader involves connec-tives such as because. Connectives that permit pre-posed adverbial clauses have been found to be difficult for third to fifth grade readers, even when the order of mention coincides with the causal (and temporal) order (Anderson and Davi-son, 1988); this experimental result is consistent with the observed order of emergence of connec-tives in children’s narraconnec-tives (Levy, 2003).

Thus the b) version of the following example would be preferred for children who can grasp causation, but who have not yet become comfort-able with alternative clause orders (example from Anderson and Davison (1988), p. 35):

(2) a. Because Mexico allowed slavery, many Amer-icans and their slaves moved to Mexico during that time.

b. Many Americans and their slaves moved to Mexico during that time, because Mexico al-lowed slavery.

Such studies show that comprehension can be improved by reformulating text for readers with low reading skills (Linderholm et al., 2000; Beck et al., 1991) and for readers with low levels of do-main expertise (Noordman and Vonk, 1992). Fur-ther, specific information orderings were found to be facilitatory by Anderson and Davison (1988). All these studies suggest that the automatic lexico-syntactic reformulation of causation can benefit various categories of readers.

2.2 Connectives and Text (Re)Generation

Much of the work regarding (re)generation of text based on discourse connectives aims to simplify text in certain ways, to make it more accessible to particular classes of readers. The PSET project (Carroll et al., 1998) considered simplifying news reports for aphasics. The PSET project focused mainly on lexical simplification (replacing diffi-cult words with easier ones), but there has been work on syntactic simplification and, in particu-lar, the way syntactic rewrites interact with dis-course structure and text cohesion (Siddharthan, 2006). These were restricted to string substitution and sentence splitting based on pattern matching over chunked text. Our work aims to extend these strands of research by allowing for more sophis-ticated insertion, deletion and substitution oper-ations that can involve substantial reorganisation and modification of content within a sentence.

Elsewhere, there has been interest in paraphras-ing, including the replacement of words (espe-cially verbs) with their dictionary definitions (Kaji et al., 2002) and the replacement of idiomatic or otherwise troublesome expressions with simpler ones. The emphasis has been on automatically learning paraphrases from comparable or aligned corpora (Barzilay and Lee, 2003; Ibrahim et al., 2003). The text simplification and paraphrasing literature does not address paraphrasing that re-quires syntactic alterations such as those in Exam-ple 1 or the question of appropriateness of differ-ent formulations of a discourse relation.

(3)

2003) are two contemporary generation systems that allow for specifying aspects of style such as choice of discourse marker, clause order, repeti-tion and sentence and paragraph lengths in the form of constraints that can be optimised. How-ever, to date, these systems do not consider syn-tactic reformulations of the type we are interested in. Our research is directly relevant to such gen-eration systems as it can help such systems make decisions in a principled manner.

Williams et al. (2003) examined the impact of discourse level choices on readability in the do-main of reporting the results of literacy assessment tests, using the results of the test to control both the content and the realisation of the generated re-port. Our research aims to facilitate the transfer of such user-driven generation research to text regen-eration areas.

2.3 Sentence Compression

Sentence compression is a research area that aims to shorten sentences for the purpose of summaris-ing the main content. There are similarities be-tween our interest in reformulation and existing work in sentence compression. Sentence com-pression has usually been addressed in a gener-ative framework, where transformation rules are learnt from parsed corpora of sentences aligned with manually compressed versions. The com-pression rules learnt are therefore tree-tree trans-formations (Knight and Marcu, 2000; Galley and McKeown, 2007; Riezler et al., 2003) of some va-riety. These approaches focus on deletion oper-ations, mostly performed low down in the parse tree to remove modifiers. Further they make as-sumptions about isomorphism between the aligned tree, which means they cannot be readily applied to more complex reformulation operations such as insertion and reordering that are essential to perform reformulations such as those in Example 1. Cohn and Lapata (2009) provide an approach based on Synchronous Tree Substitution Grammar (STSG) that in principle can handle the range of reformulation operations. However, given their fo-cus on sentence compression, they restricted them-selves to local transformations near the bottom of the parse tree. In this paper, we explore whether this framework could prove useful to more in-volved reformulation tasks. Our experience (see Section 3.2) suggests that parse trees are the wrong representation for learning complex transforma-tion rules and that dependency structures are more suited for complex lexico-syntactic reformulation.

3 Regeneration using Transfer Rules

We experimented with three representations – phrasal parse trees, typed dependencies and Min-imal Recursion Semantics (MRS). In this section, we first describe our data, and then report our ex-perience with performing text reformulation using these representations.

3.1 Data

We use the corpus described in Siddharthan and Katsos (2010). This corpus contains examples of complex lexico-syntactic reformulations such as those in Example 1a–f; each example consists of 8 formulations, 7 of which are manual reformu-lations. The corpus contains 144 such examples from three genres, giving 1152 sentences in to-tal. The manual reformulation is formulaic and Example 1 is indicative of the process. To make a clause out of a noun phrase, either the copula or the verb “occur” is introduced, based on a subjec-tive judgement of whether this is an event or a con-tinuous phenomenon. Conversely, to create a noun phrase from a clause, a possessive and gerund are used; for example (from Siddharthan and Katsos (2010)):

(3) a. Irwin had triumphed because he was so good a man.

b. The cause of Irwin’s having triumphed was his being so good a man.

The corpus contains equal numbers of sentences from three different genres: PubMed Abstracts1 (technical writing from the Biomedical domain), and articles from the British National Corpus2 tagged as World News or Natural Science (popular science writing in the mainstream media).

3.2 Reformulation using Phrasal Parse Trees

As described above, we have access to a cor-pus that contains aligned sentences for each pair of types (a type is a combination of a discourse marker and an information order; thus we have 8 types). In principle it should be easy to learn trans-fer rules between parse trees of aligned sentences. Figure 1 shows parse trees ( using the RASP parser (Briscoe et al., 2006)) for the active and the pas-sive voice with “cause” as a verb. A transfer rule is derived by aligning nodes between two parse trees so that the rule only contains the differences in structure between the trees. In the represen-tation in Figure 1, the variable ??X0[NP] maps

1

PubMed URL: http://www.ncbi.nlm.nih.gov/pubmed/ 2

(4)

The explosion was caused by an incendiary device.

(S

(NP (AT The) (NN1 explosion)) (VP (VBDZ be+ed)

(VP (VVN cause+ed) (PP (II by)

(NP (AT1 an) (JJ incendiary) (NN1 device))))))

An incendiary device caused the explosion.

(S

(NP (AT1 An) (JJ incendiary) (NN1 device)) (VP (VVD cause+ed)

(NP (AT the) (NN1 explosion))))

Derived Rule:

(S

(??X0[NP]) (VP (VBZ be+s)

(VP(VVN cause+ed) (PP(II by+) (??X1[NP])))))

↓

(S

(??X1[NP])

[image:4.595.68.294.60.306.2]

(VP (VVZ cause+s) (??X0[NP])))

Figure 1:Example of a transfer rule derived from two parse trees.

onto any node (subtree) with label NP. RASP per-forms a morphological analysis of words (shown as lemma+suffix in the figure). Thus such rules can be used to account for changes in morphology, as in example 3a.–b. above.

In practise however, the parse tree representa-tion is too dependent on the grammar rules em-ployed by the parser. For instance, the parse tree for the sentence:

The explosion was presumed to be caused by an incendiary device.

(S

(NP (AT The) (NN1 explosion)) (VP (VBDZ be+ed)

(VP (VVN presume+ed) (VP (TO to)

(VP (VB0 be) (VP (VVN cause+ed)

(PP (II by) (NP (AT1 an)

(JJ incendiary) (NN1 device)))))))))

looks very different and does not match the rule in figure 1. With longer sentences, further prob-lems arise when similar strings are parsed differ-ently in the two aligned sentences (for example, different PP attachment) – these lead to very com-plicated rules, often with more than 20 variables. We split our data into development/training (96 stances of passive to active) and test sets (48 in-stances of passive). Using the top parse for each sentence, we derived 92 rules, including the one shown in Figure 1. However, coverage of these rules over the test corpus was poor (less than 10% recall). By learning rules using the top 20 parses for each sentence rather than just the top parse,

we could improve coverage to around 70%, but this involved the acquisition of over 4000 differ-ent rules – just to change voice. The situation was even worse for reformulations that change syntac-tic categories, such as “because” to “cause”, and we obtained more than 20,000 rules that still gave us a coverage of only around 15% for the test set.

We concluded that this was not a sensible rep-resentation for general text reformulation. In other words, while substitution grammars for parse trees have been shown to be useful for sentence com-pression tasks (e.g., Cohn and Lapata (2009)), they are less useful for more complex lexico-syntactic reformulation tasks.

3.3 Reformulation using MRS

Another option is to use a bi-directional grammar and perform the transforms at a semantic level. We now briefly discuss the use of Minimal Recursion Semantics (MRS) as a representation for transfer rules. Consider a very short example for ease of illustration:

Tom ate because of his hunger.

This can be analysed by a deep grammar to give a compositional semantic representation which captures the information that is available from the syntax and inflectional morphology. We show this sentence below in the Minimal Recursion Seman-tics (MRS) (Copestake et al., 2005) representation, as produced by the English Resource Grammar (ERG3) (Flickinger, 2000), but considerably sim-plified for ease of exposition and to save space:

named(x5,Tom), _eat_v_1(e2,x5), _because_of(e2,x11), poss(x11,x16), pron(x16), _hunger_n(x11)

The main part of the MRS structure is a list of elementary predications (EPs), which may have predicates derived from lexemes (e.g., _eat_v_1; these are indicated by the leading underscore) or supplied by the grammar (e.g., poss). The ERG treatsbecause ofas a multiword expression and assigns it a semantics comparable to a preposition. Paraphrase rules map between se-mantic representations; for our application, a pos-sible rule is the following:

_because_of(e,x), P(e,y) <->

_cause_v_1(e10,x,y,l1), l1:P(e,y)

Here ‘P’ is to be understood as a general predi-cate. The left hand side of the rule will match the preposition-like ‘because of’ relation when it has an event as an argument, where the event is the

(5)

characteristic event of an underspecified verbal EP. The right hand side indicates that the ‘because of’ can be substituted by a verbal relation correspond-ing to cause, with the verbal EP being a scopal argument. This rule matches the MRS above and maps it to the following (with P=_eat_v_1):

named(x5,Tom), l1:_eat_v_1(e2,x5),

_cause_v_1(e10,x11,x5,l1), poss(x11,x16), pron(x16), _hunger_n(x11), x5 aeq x16

This can be input to the realiser, giving:

His hunger caused Tom to eat.

Writing transfer rules is intuitive and easy in MRS. Further, the use of a bi-directional grammar for generation ensures that the generated sentence is grammatical. An infrastructure of writing para-phrase rules exists in this framework and semantic transfer has also been explored for machine trans-lation (e.g., Copestake et al. (1995)).

The problem we encountered, however, is that bidirectional grammars such as the ERG fail to parse ill-formed input and will also fail to anal-yse some well-formed input because of limitations in coverage of unusual constructions. Although the DELPH-IN parsing technology allows for un-known words, missing lexical items can also cause parse failure and even more problems for gener-ation. The ERG gives an acceptable parse ‘out of the box’ for only around 50-60% of sentences from scientific papers. Further, the generator can get slow and memory intensive for long sentences and many of our sentences are around 30 words long. Much of this processing effort during gen-eration is redundant as the input sentence can be used to narrow down generation choices, but as of now, the infrastructure does not exist to support this. Thus, while using a bi-directional grammar and semantic transfer might indeed be the most in-tuitive approach to complex lexico-syntactic refor-mulation, it is not quite feasible yet.

3.4 Reformulation using Typed Dependencies

Having had mixed success with transforming phrasal parse trees and semantic representations, we turned our attention to typed dependency struc-tures. We used the RASP toolkit (Briscoe et al., 2006) for finding grammatical relations (GRs) be-tween words in the text. GRs are triplets con-sisting of a relation-type and arguments and also encode morphology (stem + suffix), word posi-tion (after colon) and part-of-speech (after under-score); GRs produced for the sentence:

The explosion was caused by an incendiary device.

are:

(|ncsubj| |cause+ed:4 VVN| |explosion:2 NN1| ) (|aux| |cause+ed:4 VVN| |be+ed:3 VBDZ|) (|passive| |cause+ed:4 VVN|)

(|ncmod| |device:8 NN1| |incendiary:7 JJ|) (|det| |explosion:2 NN1| |the:1 AT|)

This representation shares aspects of phrasal parse trees and MRS. Note that the sets of de-pendencies (such as those above) represent a tree.4 While phrase structure trees such as those in Sec-tion 3.2 represent the nesting of constituents with the actual words at the leaf nodes, dependency trees have words at every node:

cause+ed:4

XX XX_X

X

explosion:2

the:1

be+ed:3 by:5

device:8

b bb " " "

an:6 incendiary:7

To generate from a dependency tree, we need to know the order in which to process nodes - in general tree traversal will be “inorder”; i.e, left subtrees will be processed before the root and right subtrees after. These are generation deci-sions that would usually be guided by the type of dependency and statistical preferences for word and phrase order. However, we can simply use the word positions (1–8) from the original sentence.

While typed dependencies share characteris-tics with parse trees, the flat structure repre-sents dependencies between words, and we can write transformation rules for this representation in fairly compact form. For instance, a transfor-mation rule to convert the above to active voice would require five deletions and two insertions:

1. Match and Delete: (a) (|passive| |??X0|)

(b) (|iobj| |??X0| |??X1(by II)|) (c) (|dobj| |??X1| |??X2|) (d) (|ncsubj| |??X0| |??X3| ) (e) (|aux| |??X0| |??X4|)

2. Insert:

(a) (|ncsubj| |??X0| |??X2| ) (b) (|dobj| |??X0| |??X3|)

4_{In fact, the GR scheme is only ‘almost’ acyclic. There}

(6)

Thus far, the rule looks very similar to rules written for MRS: one list of predicates is replaced by another. Applying this transformation to the GR set above creates a new dependency tree:

cause+ed:4

P P

P_P

device:8

b bb " " "

an:6 incendiary:7

explosion:2

the:1

However, unlike the case with MRS, where a statistical generator decides issues of morphology and ordering, we have to specify the consequences of the rule application for generation. Note that we can no longer rely on the original word or-der to determine the oror-der in which to traverse the tree for generation. Thus our transformation rules, in addition to Deletion and Insertion oper-ations, also need to provide rules for tree traver-sal order. These only need to be provided for nodes where the transform has reordered subtrees (“??X0”, which instantiates to “cause+ed:4” in the trees). Our rule would thus include:

3. Traversal Order Specifications:

(a) Node ??X0: [??X2, ??X0, ??X3]

This states that for node ??X0, the traversal or-der should be subtree ??X2 followed by current node ??X0 followed by subtree ??X3. Using this specification would allow us to traverse the tree using the original word order for nodes with no order specification, and the specified order where a specification exist. In the above instance, this would lead us to generate:

An incendiary device caused the explosion.

Our transfer rule is still incomplete and there is one further issue that needs to be addressed – operations to be performed on nodes rather than relations. There are two node-level operations that might be required for sentence reformulation:

1. Lexical substitution: In our example above, we still need to ensure number agreement for the verb “cause” (??X0). By changing voice, ??X0 now has to agree with ??X2 rather than ??X3. Fur-ther the tense of ??X0 was encoded in the auxiliary verb ??X4 that has been deleted from the GRs. We thus need the transfer rule to encode the lexical substitution required for node ??X0:

4. Lexical substitution:

(a) Node ??X0: IF (??X4 is Present Tense) THEN{

IF (??X2 is Plural) THEN{SET ??X0:SUFFIX =“”}ELSE{SET ??X0:SUFFIX =“s”} }

Other lexical substitutions are easier to spec-ify; for instance to reformulate “John ran be-cause David shouted.” as “David’s shouting caused John to run”, the following lexical substi-tution rule is required for node ??Xn representing “shout” that replaces its suffix “ed” with “ing”:

Lexical substitution: Node ??Xn: Suffix=“ing”

2. Node deletion: This is an operation that re-moves a node from the tree. Any subtrees are moved to the parent node. If a root node is deleted, one of the children adopts the rest. By default, the right-most child takes the rest as dependents, but we allow the rule to specify the new parent. In the above example, we want to remove the nodes ??X1(“by”) and ??X4 (“was”) (note that deleting a relation does not necessarily remove a node – there might be other nodes connected to ??X1 or ??X4). We would like to move these to the node ??X0 (“cause”):

5. Node Deletion:

(a) Node ??X1: Target=??X0 (b) Node ??X4: Target=??X0

Node deletion is easily implemented using search and replace on sets of GRs. It is central to reformulations that alter syntactic categories of discourse markers; for instance, to reformulate “The cause of X is Y” as “Y causes X”, we need to delete the verb “is” and move its dependents to the new verb “causes”.

To summarise, we propose a framework for lexico-syntactic reformulation based on typed de-pendency structures and have discussed the form of a transformation. We now specify the structure of transfer rules and tree nodes more formally.

Specification for Transfer Rules

Our proposal is based on applying transfer rules to lists of grammatical relations (GRs). Our transfer rules take the form of five lists:

1. CONTEXT: Transform only proceeds if this list of GRs can be unified with the input GRs.

2. DELETE: List of GRs to delete from input.

3. INSERT: List of GRs to insert into input.

4. ORDERING: List of nodes with subtree order specified

5. NODE-OPERATIONS: List of lexical substitutions and deletion operations on nodes.

(7)

correspond to the CONTEXT, INPUT and OUT-PUT lists used to specify transform in the MRS framework. However, because we do not use a for-mal grammar for generation, we need two further lists that capture changes in morphology or con-stituent ordering. The list ORDERING is used to traverse the dependency tree constructed from the transformed GRs. Again, the lexical substitution lists are prescriptions for generation. We restrict our lexical substitutions to change of suffix and part of speech (for instance, “X is afrequentcause of Y” to “Xfrequentlycauses Y”), but in general this can be an arbitrary string substitution (for in-stance, “X and Y aretwocauses of Z” to “X and Ybothcause Z”).

In this paper, we have tried to do away with a generator altogether by encoding generation deci-sions within the transfer rule. A case can be made, particularly for the issue of agreement, for such issues to be handled by a generator. This would make the transfer rules simpler to write, and easier to learn automatically in a supervised setting.

Specification for Dependency Tree

Applying a transfer rule specified above results in a new set of GRs. To generate a sentence, we need to create a dependency tree from these GRs. As described earlier, a dependency tree needs to be traversed “inorder” to generate a sentence. This means that at each node, the order in which to visit the daughters and the current node needs to be specified. To enable this, we propose that each node in the tree have the following features:

1. VALUE: stem, suffix and part-of-speech of the word;

2. PARENT: parent node;

3. CHILDREN: list of daughters;

4. ORDER: list specifying order in which to visit children and current node.

The parent node is required for DELETE oper-ations and to find the root of the tree (node with no parent). Further, if there is more than one node with no parent, the GRs do not form a tree and generation will result in multiple fragments.

The dependency tree is constructed using the following algorithm:

1. For each word in the list of GRs:

(a) Create a Node and instantiate the VALUE field.

2. For each GR (relation word1 word2):

(a) If GR is one that introduces a cycle, remove it from list, else add the node created for word2 to the CHILDREN list of node for word1 and set PARENT of word 2 to word1.

3. After Step 2, the tree is created. Now for each Node:

(a) If an ORDERING specification is introduced for this node by the transformation rule, copy that list to the ORDER field, else add the daughter nodes to the ORDER list in increasing order of word position.

The reformulated sentence is generated by traversing the tree “inorder”, outputting the word at each node visited (the stem, suffix and part-of-speech tag are fed to the RASP morphological generator, which returns the correct word).

4 Evaluating Transformation Rules

In this paper we have proposed a framework for complex lexico-syntactic reformulations. We want to evaluate our framework for (a) how easy it is to write transformation rules, (b) how many are re-quired for intuitive lexico-syntactic reformulations and (c) how robust the transformation is to parsing errors. With this intended purpose, we evaluate hand-written transformation rules that have been developed looking at one third of the corpus (48 sentences) and tested on the remaining two thirds (96 sentences). We report results using:

• Recall:The proportion of sentences in the test set for which a transform was performed; i.e., (a) the DELETE pattern matched the input GRs and (b) there was ex-actly one root node in the transformed GRs resulting in exactly one sentence being output

• Precision: The proportion of transformed sentence that were accurate; i.e., grammatical with (a) cor-rect verb agreement and inflexion and (b) modi-fiers/complements appearing in acceptable orders.

Note that we are merely evaluating the frame-work and not evaluating the utility of these trans-formations for text simplification – that would re-quire an evaluation using test subjects drawn from our intended users. Table 1 provides some exam-ples of accurate and inaccurate transformations.

The rule for converting passives to actives de-scribed in Section 3.4 already achieves a recall of 42% and precision of 83%. Writing 6 addi-tional rules to handle reduced relative clauses (1a-b, Table 1) etc., we could boost recall to 71% with precision dropping marginally to 82%. We hand-crafted rules to implement three other reformula-tions. These were selected based on results from the Siddharthan and Katsos (2010) study that sug-gested:

1. cause as a noun (either information ordering), passive voice, “because of” and “because a, b” formulations (versions b,d,e,f,g and h in Example 1, Section 1) are dispreferred by lay readers. Moreover, these are com-mon constructs in scientific writing.

(8)

Accurate Transformations

1a. Apart from occasional problems of ensemble caused by the complex rhythms of the outer movements, the orchestra gave an animated and committed reading of the work. [B-CAUSEBY-A→A-CAUSE-B]

b. Apart from occasional problems of ensemble the complex rhythms of the outer movements caused, the orchestra gave an animated and committed reading of the work.

2a. Because of transvection, the expression of a gene can be sensitive to the proximity of a homolog. [BEC-OF-A-B→A-CAUSE-B]

b. Transvection can cause the expression of a gene to be sensitive to the proximity of a homolog.

3a. Because each myosin is expressed in Drosophila indirect flight muscle, in the absence of other myosin isoforms, this allows for muscle mechanical and whole organism locomotion assays. [BEC-A-B→B-BEC-A]

b. In the absence of other myosin isoforms, this allows for muscle mechanical and whole organism locomotion assays because each myosin is expressed in Drosophila indirect flight muscle.

4a. Almost certainly, however, the underlying cause of the war was the problem of Aquitaine. [CAUSEOF-B-A→A-CAUSE-B] b. Almost certainly, however, the underlying problem of Aquitaine caused the war.

Inaccurate Transformation

5a. Moreover, main road traffic has scarcely been slowed and concern should be caused by the rising number of cyclist casualties. [B-CAUSEBY-A→A-CAUSE-B]

b. Moreover, the rising number of cyclist casualties should cause main road traffic has scarcely been slowed and concern. 6a. Because of the risk of injury and the need to kill prey quickly, predators usually predate animals smaller than themselves.

[BEC-OF-A-B→A-CAUSE-B]

[image:8.595.72.529.62.311.2]

b The risk of injury and the need cause kill to prey quickly predators usually predate animals smaller than themselves.

Table 1: Examples of automatic reformulations (version a. is the original and b. the reformulation).

Handcrafted rules n P R F

B-CAUSEBY-A→A-CAUSE-B 7 .82 (1.00) .71 (.75) .76 (.86) BEC-OF-A-B→A-CAUSE-B 9 .75 (.92) .70 (1.00) .72 (.97) BEC-A-B→B-BEC-A 8 .85 (.92) .83 (.87) .84 (.89) CAUSEOF→A-CAUSE-B 6 .97 (.90) .78 (1.00) .86 (.95)

Table 2: Number of Rules (n), Precision, Recall and F-Measure for lexico-syntactic reformulation using hand-crafted rules over GRs. Numbers in brackets are over the subset of the corpus that con-tains only the original sentences from PubMed and the BNC.

We summarise our results in Table 2. Most of the sentences in the corpus are manual reformu-lations and some of them are quite stilted. The numbers in brackets show performance over the smaller set of original sentences from PubMed and the BNC. These are more indicative of how the rules will perform on real data. Our results sug-gests that the framework we propose is adequate for a range of lexico-syntactic reformulations and a fairly small number of rules is required to cap-ture a reformulation.

Loss of recall was usually from parsing error (either misparses, in which case our rules don’t match the GRs, or partial parses, where a full tree can’t be formed because of missing GRs).

Loss of precision was a more worrying issue as it often resulted in badly corrupted output. This was usually the result of either bad parser deci-sions regarding attachment or scope or just mis-parsing (e.g., wide scoping of “and” in 5a-b and parsing “prey” as a verb in 6a-b, Table 1). It might be possible to trade-off recall for improved preci-sion by identifying sentences where ambiguity is a problem (by looking at multiple parses).

5 Conclusions and Future Work

In this paper we have reported our experience with using different linguistic formalisms as represen-tations for applying transform rules to generate complex lexico-syntactic reformulations of sen-tences expressing the discourse relation of causa-tion. We find typed dependency structures to be the most suited for this task and report that hand-crafted transformation rules generalise well to sen-tences in an unseen test corpus. We believe that the framework we have described is adequate for a range of regeneration tasks focused on text simpli-fication. While in this paper we focus on the dis-course relation of causation, other disdis-course rela-tions commonly used in scientific writing can also be realised using markers with different lexico-syntactic properties; for instance,contrast can be expressed using markers such as “while”, “un-like”, “but”, “compared to”, “in contrast to” and “the difference between”. Our rules for voice con-version and information reordering for subordina-tion are already general enough to be applied to non-causal constructs. We also plan to use our framework to explore sentence simplification and sentence shortening applications.

(9)

Acknowledgements

This work was supported by the Economic and Social Research Council (Grant Number RES-000-22-3272). We would also like to thank Dan Flickinger and Ann Copestake for many discus-sions on the topic of paraphrase and for help with using the ERG.

References

R.C. Anderson and A. Davison. 1988. Conceptual and empirical bases of readibility formulas. In Alice Davison and G. M. Green, editors,Linguistic Com-plexity and Text Comprehension: Readability Issues Reconsidered. Lawrence Erlbaum Associates, Hills-dale, NJ.

R. Barzilay and L. Lee. 2003. Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. In HLT-NAACL 2003: Main Proceed-ings, pages 16–23.

I.L. Beck, M.G. McKeown, G.M. Sinatra, and J.A. Loxterman. 1991. Revising social studies text from a text-processing perspective: Evidence of im-proved comprehensibility. Reading Research Quar-terly, pages 251–276.

T. Briscoe, J. Carroll, and R. Watson. 2006. The sec-ond release of the RASP system. InProceedings of the COLING/ACL, volume 6.

J. Carroll, G. Minnen, Y. Canning, S. Devlin, and J. Tait. 1998. Practical simplification of English news-paper text to assist aphasic readers. InProceedings of AAAI98 Workshop on Integrating Artificial Intelli-gence and Assistive Technology, pages 7–10, Madi-son, Wisconsin.

T. Cohn and M. Lapata. 2009. Sentence compres-sion as tree transduction. Journal of Artificial In-telligence Research, 34(1):637–674.

A. Copestake, D. Flickinger, R. Malouf, S. Riehemann, and I. Sag. 1995. Translation using minimal recur-sion semantics. InProceedings of the Sixth Interna-tional Conference on Theoretical and Methodologi-cal Issues in Machine Translation, pages 15–32.

A. Copestake, D. Flickinger, I. Sag, and C. Pollard. 2005. Minimal recursion semantics: An intro-duction. Research in Language and Computation, 3:281–332.

D. Flickinger. 2000. On building a more efficient grammar by exploiting types. Natural Language Engineering, 6(1):15–28.

M. Galley and K. McKeown. 2007. Lexicalized Markov grammars for sentence compression. In

HLT-NAACL 2007: Main Proceedings, pages 180– 187, Rochester, New York, April. Association for Computational Linguistics.

A. Ibrahim, B. Katz, and J. Lin. 2003. Extracting para-phrases from aligned corpora. InProceedings of The Second International Workshop on Paraphrasing.

N. Kaji, D. Kawahara, S. Kurohash, and S. Sato. 2002. Verb paraphrase based on case frame alignment. In

Proceedings of the 40th Annual Meeting of the As-sociation for Computational Linguistics (ACL’02), pages 215–222, Philadelphia, USA.

K. Knight and D. Marcu. 2000. Statistics-based sum-marization — step one: Sentence compression. In

Proceeding of The American Association for Arti-ficial Intelligence Conference (AAAI-2000), pages 703–710.

J.J. L’Allier. 1980. An evaluation study of a computer-based lesson that adjusts reading level by monitor-ing on task reader characteristics. Ph.D. thesis, University of Minnesota, Minneapolis, MN.

E.T. Levy. 2003. The roots of coherence in discourse.

Human Development, pages 169–88.

T. Linderholm, M.G. Everson, P. van den Broek, M. Mischinski, A. Crittenden, and J. Samuels. 2000. Effects of Causal Text Revisions on More-and Less-Skilled Readers’ Comprehension of Easy and Dif-ficult Texts. Cognition and Instruction, 18(4):525– 556.

L. G. M. Noordman and W. Vonk. 1992. Reader’s knowledge and the control of inferences in reading.

Language and Cognitive Processes, 7:373–391.

R. Power, D. Scott, and N. Bouayad-Agha. 2003. Gen-erating texts with style. Proceedings of the 4 thInter-national Conference on Intelligent Texts Processing and Computational Linguistics.

S. Riezler, T.H. King, R. Crouch, and A. Zaenen. 2003. Statistical sentence condensation using ambiguity packing and stochastic disambiguation methods for lexical-functional grammar. In HLT-NAACL 2003: Main Proceedings, Edmonton, Canada.

A. Siddharthan and N. Katsos. 2010. Reformulat-ing discourse connectives for non-expert readers. In Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2010), Los Angeles, CA.

A. Siddharthan. 2006. Syntactic simplification and text cohesion. Research on Language and Compu-tation, 4(1):77–109.

S. Williams and E. Reiter. 2008. Generating basic skills reports for low-skilled readers. Natural Lan-guage Engineering, 14(04):495–525.