Analysis of Results on Quality Questions - Problem Definition and Algorithms

Part I Selecting a Meaning Representation

2.2 Problem Definition and Algorithms

2.4.1 Analysis of Results on Quality Questions

With the exception of question 3, the results on the quality questions followed a consistent pattern of Human/LexRank performing best, followed by Simplification, Disaggregation, and Compression, in that order. It is not surprising that the human-written summaries perform well on this portion of the evaluation; if anything it is surprising that they did not perform even better than they did.

The lack of a significant difference between LexRank and Human conditions on any of the quality evaluations appears impressive, although we must be cautious not to read too much into it. Because the Human condition applied only to the legal cases, not to the biomedical articles, the sample size of the Human group was smaller than those of the other conditions (NHuman = 8; NLexRank = 18). A larger study might find significant

results where this one did not. In addition, it is possible that by showing evaluators summaries from the much lower quality Simplified, Compressed, or Disaggregated conditions alongside LexRank and Human summaries, we may have distorted their perception of the quality scale. A truly fair test of LexRank’s performance would require a larger number of summaries and only the Human and LexRank conditions.

We can say with confidence, though, that the LexRank summaries outperformed the Compressed and Disaggregated ones on all of the quality questions and the Simplified system on most of them. On some quality questions, this is unsurprising. The extra step of

simplifying, compressing, or disaggregating a sentence adds an opportunity for a previously grammatical sentence to become ungrammatical. In addition, removing part of a sentence increases the likelihood of having unclear referents.

However, our hypothesis suggested that on quality question 2, which asks how much useless or confusing material ought to be removed from the summary, we could expect Sim- plified, Compressed, and Disaggregated conditions to perform well. After all, their purpose was to automatically remove useless material. Yet all three performed significantly worse on this question than the Human condition, and Compressed also performed significantly worse than LexRank.

One possible explanation is that the compound question has blurred the issue. Perhaps the evaluators felt that they would need to edit a great deal of confusing material from the Simplified, Compressed, and Disaggregated summaries, but were neutral or even positive as to the amount of useless material. Another possibility is that the experimental systems rendered some sentences so confusing as to be useless.

A review of the actual summaries supports this second view. Table2.7contains the first hundred words from the LexRank, Simplified, Compressed, and Disaggregated versions of the same summary.9 In the Compressed and Disaggregated summaries, and to a lesser extent the Simplified summary, sentences have been altered and taken out of context in ways that make them quite confusing. For instance, in the Simplified summary, the head- ing “Austin Bank—Troup, Texas,” which in the original document introduced a recitation of the facts regarding the robbery of that bank, was transformed into the rather bewilder- ing sentence “Austin Bank is Troup, Texas.” This type of change explains why, even if some useless material was excluded by the Simplified, Compressed, and Disaggregated conditions, evaluators would still score these conditions as needing to have more useless or confusing material trimmed.

Additionally, we note that question 3, which asks whether the summary is repetitive,

9_{The summaries in their entirety are available online at} _{http://www-personal.umich.edu/}

a: (2010) UNITED STATES of America, Plaintiff-Appellee, v. Paul Edward THOMAS; Derrick Van Hodges, Defendants-Appellants. Each robbery was completed within two minutes. The basis for the warrant was DNA evidence linking Hodges to a glove dropped during a bank robbery in Henderson, Texas. Thomas and Hodges were named in an 18- count indictment charging them with conspiracy, bank robbery, and weapons offenses related to the following bank robberies: 1. DISCUSSION I. Sufficiency of the Evidence Thomas and Hodges argue the government presented insufficient evidence identifying them as the bank robbers. Thomas claims the evidence against Hodges was much stronger and implies that Thomas was found guilty by association.

b: Each robbery was completed within two minutes. Thomas and Hodges were named in an 18-count indictment charging them with conspiracy , bank robbery , and weapons offenses related to the following bank robberies : 1. Austin Bank is Troup, Texas. DISCUSSION I. Sufficiency of the proof Thomas and Hodges argue the government presented insufficient evidence identifying them as the bank robbers. Thomas claims and the proof against Hodges implies that association found Thomas guilty. The proof was much stronger. Hodges was ‘ weak in some instances. This happened after a bank robbery was also ‘ happenstance. United States v. Clayton . We apply this standard of review to direct and circumstantial evidence.

c: (2010) UNITED STATES of America, Plaintiff-Appellee, v. Paul Edward THOMAS; Derrick Van Hodges, Defendants-Appellants. Ireland, Carroll , Kelley for Thomas is P.C. Ireland, Carroll , Kelley for Thomas is Tyler. Ireland, Carroll , Kelley for Thomas is TX. Hodges had in his possession a $ 10 bait bill taken a week earlier during the robbery of a bank in Crockett, Texas. At that time, arrested. America Henderson, is Texas ; 3. Austin Bank is Troup, Texas. Thomas, Hodges , were jointly tried before a jury. Thomas claims.the evidence against Hodges implies that Thomas was found guilty by association. The evidence against Hodges was much stronger.

d: LESLIE SOUTHWICK Circuit Judge Paul Edward Thomas and Derrick Van Hodges were convicted of counts of conspiracy bank robbery and weapons possession . Bank robberies were executed in manner . Each robbery was completed within two minutes . Derrick Van Hodges was arrested in Tyler Texas on state warrant . The basis for the warrant was evidence linking Hodges to glove dropped during a bank robbery in Henderson Texas . Bait bill was found in child ’s bedroom . DISCUSSION Sufficiency of Evidence Thomas and Hodges argue government presented evidence identifying them as bank robbers . Thomas claims the evidence against Hodges was stronger and implies that Thomas was found guilty by association .

Table 2.7: Segments of summaries output by (a) LexRank Only, (b) Simplified, (c) Disag- gregated, and (d) Compressed conditions.

is the only quality question that does not follow the pattern of the others, as can be seen in Figure2.5. For question 3 only, the Simplification system performed approximately as well as the LexRank Only and Human conditions. Because the differences in question 3 were not statistically significant, this apparent deviation from the pattern may be illusory. But this result could also suggest that the simplifier’s use of determiners instead of repeated noun phrases helped avoid overly repetitive summaries.

Another notable result was that Simplified and Disaggregated were not significantly different on any quality measures except for question 6, which asks about ungrammatical sentences. Given the engineering described above to avoid introducing ambiguous determiners and to repeat modifiers in the split sentences, we had expected that the Dis- aggregated system would perform better on question 4, which asks how difficult it was to identify referents of noun phrases.

An important question is why Disaggregated summaries included more ungrammatical sentences than Simplified summaries. The most likely cause seems to be overfitting. The modifications and the additional rules described above were based on a small number of gold standard sentences. Changes to the system intended only to allow modifiers to be repeated generated some side effects in initial testing; for example, some disaggregated sentences contained long strings of repeated conjunctions. These problems were fixed, in that they no longer occurred when tested on the gold standard sentences and basic modifications to them, before the summaries in this study were generated. However, perhaps the gold standard sentences and modifications to them did not expose a broad enough range of possible problems, and other side effects remained that could only have been discovered by a larger system test. Similarly, the new rules were developed to work on the sentences from the gold standard collection and variations on those. It is possible broader testing might reveal sentences that match the dependency patterns we found in those sentences, but are grammatically different enough that the rule application no longer makes sense.

of the Compression system. The Clarke and Lapata ILP-based sentence compression algorithm that we used is widely considered state of the art in sentence compression. We suspect that the problem may relate to the language model that the algorithm incorporates in its objective function. Maybe the trigram model built from the LDC Gigaword corpus of newswire articles does a poor job of representing n-grams that show up in legal cases and biomedical articles. Biomedical articles in particular are likely to contain a great many out-of-vocabulary terms. A simple follow-up study could test this hypothesis by building a language model using a corpus of sophisticated documents and checking if performance improved. It would not require evaluating entire summaries, but could be evaluated on a sentence-by-sentence basis.

In document Selecting and Generating Computational Meaning Representations for Short Texts (Page 71-75)