• No results found

Using the Reddit Summarization Corpus

One of the goals of the work described in this chapter is to compare the performance of the contextualist-inspired, change-based approach on extractive summarization to its previous performance on the MRE sentence detection; we use the same development, seed, tuning, and testing set split of the 476 annotated personal narratives as in our previous experiments.

4.2.1

Reference Summary Construction

The Reddit personal narrative summarization corpus provides either two or four human- written abstractive summaries for each narrative and up to six crowdsourced or aggregated extractive summaries for each abstractive summary. Thus, our first task is to construct a

single extractive reference summary for each narrative in the corpus. Choosing a single extractive summary for each abstractive summary is relatively straightforward: we use the majorityaggregation scheme, which included any sentence selected by at least two out of three Amazon Mechanical Turk workers. Of the other two aggregation schemes, intersect, which required sentences to be selected by all three workers, is too strict – in about 12% of narratives, the intersect-aggregated summary is empty – while union, which combines all sentences selected by all workers, is likely to include redundant sentences. Further, the average length of a majority-aggregated extractive summary is 2.69 sentences, very close to the average of 2.5 MRE sentences per narrative in our previous experiments.

How best to combine the majority summaries corresponding to two different anno- tators’ abstractive summaries into a single extractive summary for each narrative is less clear. Differences among the majority summaries reflect, to a large extent, differences in the content of their corresponding abstractive summaries. In the previous chapter, we used crowdsourcing to examine differences in content in the abstractive summaries in our corpus, finding three possible cases:

1. The abstractive summaries are about completely different main events. In this case, the majority summaries are completely disjoint sets of sentences.

2. The abstractive summaries are all about the same main event and cover exactly the same information. In this case, the majority summaries are ideally identical, although in practice, there are multiple sets of sentences that can reconstruct the information in the abstractive summaries.

In both of these cases, neither of the abstractive summaries is any more or less “correct” than the other, so we take the union of the sentences in the majority summaries.

3. The abstractive summaries are about the same event, but one or both of them covers information not in the other. In this case, there is likely to be a shared subset of sentences between the two majority summaries, with some extra sentences in either one or both.

In this last case, there is some piece of information that one annotator considered es- sential to a reader’s understanding of the summary, but that another annotator considerd unimportant. As we have no way to tell which annotator is “correct,” we experiment with two different methods for handling extra information: keeping it or dropping it.

• Keep: We take the union of the sentences in the majority summaries, thus retaining any extra information present in either summary.

• Drop: We do not include any sentences present only in the more-informative majority summary. If both majority summaries contain information not present in the other, we take the intersection of the sentences in the two summaries.

For the 68 narratives with four different abstractive summaries, we begin building the extractive reference summary using two of the majority summaries as described above and then working in the other two, adding or removing sentences as needed, following the same rules. The average length of a completed extractive reference summary is six sentences, using the keep method of handling extra information, while the average length of a completed extractive reference summary using drop is four sentences.

4.2.2

Pyramid Weighting

Because the extractive reference summaries are constructed by merging multiple majority- aggregated Turker summaries, it is not necessarily the case that all sentences in a reference summary are equally important. Up to twelve Turkers contributed to each reference sum- mary, but we only require a sentence to be selected by two Turkers to be included. A natural next step to ensure the quality of our reference summaries is to apply the pyramid method of Nenkova and Passonneau (2004).

The pyramid method for summarization evaluation was designed to address the possi- bility of there being multiple sentences in a document that express overlapping or identical information – in other words, there may be multiple, equally correct extractive summaries that express the same information using different sets of sentences. The pyramid method addresses this concern by breaking each summary into summary content units (SCUs), roughly clause-level units of meaning. A system summary is evaluated based on the SCUs it contains, where each SCU is weighted according to the number of human summaries that contain it. The set of weighted SCUs taken from all human summaries forms the pyramid, the gold standard reference summary.

While we do not have SCU annotations for our data, we can still use the pyramid method to weight the sentences in our extractive reference summaries. A sentence selected by all twelve Turkers, for example, likely expresses crucial information not present in any other sentence in the document; a system-generated summary should be penalized more heavily for failing to include such a sentence than for omitting another sentence selected by only two Turkers, the information in which might well be present in some other sentence

that the system did extract.

In the development, seed, tuning, and testing sets, we assign each sentence a pyramid weight equal to the number of Turkers who selected it; because we use majority aggregation to construct the extractive reference summaries, the pyramid weight for all sentences in the extractive reference summary is at least 2, while the weight of the other sentences is 1.

4.2.3

Heuristic Labeling

We use a new linear combination of the same similarity metrics from our previous ex- periments (semantic similarity to comment, tl;dr, and prompt) to label each sentence as extracted or not, using a threshold tuned on our pyramid-weighted development set:

hcomment+ htldr+ hprompt ? > 1.544

The 67,954 training sentences labeled using this heuristic are assigned a pyramid weight of 1.