Experiments - Adapting Automatic Summarization to New Sources of Information

Using this weakly-labeled Reddit dataset and the change-based view of narration described in Section2.2, we conducted two experiments on automatically extracting MRE sentences. We compared our results with three baselines: random, our extraction heuristic, and the last sentence of the narrative (the next-best heuristic).

As described in the previous section, we labeled our training set in blocks of three con- secutive MRE sentences, centered on the sentence from each narrative that was selected by our heuristic. To account for this, in our experiments and baselines, we predicted the pres- ence of an MRE sentence in a three-sentence block: in testing, we considered a predicted block to be correct if it contained at least one human-extracted MRE.

2.4.1 Features

Stylistic Features. For each sentence in a narrative, we generated 176 sentence-level features capturing changes in the narration. We first scored each sentence using each of the sixteen metrics shown in Table2.33_{. The semantic metrics, cossimilarity and lssimilarity,} refer to bag-of-words cosine similarity and latent semantic similarity of a sentence to the preceding sentence.

Type Metric Names

Syntactic sentlength, vplength, lengthratio, sentdepth, vpdepth, depthratio, wordlength, structcomplexity, wordformality, wordcomplexity Semantic cossimilarity, lssimilarity

Affectual pleasantness, activation, imagery, subjectivity

Table 2.3: The sixteen narration-style metrics.

We then smoothed the scores across sentences in a narrative by applying a Gaussian filter. We also tried weighted and exponential moving averages, as well as a Hamming window, but the Gaussian performed best in experiments on our tuning set. Finally, we generated eleven features for each metric at each sentence: the sentence score; whether or not the sentence is a local maximum or minimum; the sentence’s distance from the global maximum and minimum; the difference in score between the sentence and the preceding sentence, the difference between the sentence and the following sentence, and the average of these differences (approximating the incoming, outgoing, and self slopes for the metric); and the incoming, outgoing, and self differences of differences (approximating the second derivative).

3_{While lexical formality and complexity scores are not properly features of the syntax of a sentence,}

we considered them part of the same category as the truly syntactic features, whose goal was to capture the formality and complexity of the sentence.

Other Features. We included an additional ten features inspired by Labov’s theory of narrative structure:

• The tense of the main verb and whether or not there is a shift from the previous sentence. Labov (2013) suggests a shift between the past and the historical present near the most reportable event.

• The position of the sentence in the narrative; the MRE usually appears near the end. We implemented position as four binary features by dividing the narrative into four sections.

• The bag-of-words cosine similarity and latent semantic similarity between the sentence and the first and second sentences in the narrative. While the MRE sentence usually appears near the end of the narrative, Labov (2013) notes that the abstract, a short introduction that occurs in some narratives, often refers to the MRE.

It is important to note that we did not use any lexical features in our experiments – the system we trained does not in any way model what a narrative is about. This was a deliberate choice in the design of our system. Because our Reddit narratives were collected from just 39 different prompts, a system that did use lexical features might learn a set of words topically related to each prompt and classify MRE sentences based on those words. Such a system would perform poorly on narratives about previously unseen topics. We designed our features to capture only the similarities and differences between a sentence in a narrative and its neighbors, keeping them independent of the actual content – the deep structure – of the narrative. Our hope was that the narration alone contained enough

2.4.2 Distant Supervision

Our first experiment used distant supervision with our weakly-labeled training set: the heuristically extracted MRE sentences were treated as if they were gold standard labels. We classified blocks of three sentences as containing an MRE sentence or not. The two classes, MRE and no-MRE, were weighted inversely to their frequencies in the weakly- labeled training set, and all features were normalized to the range [0, 1]. We trained a support vector machine with margin C = 1 and an RBF kernel with γ = 0.001 (these parameters were tuned using grid search on our human-annotated tuning set).

Trial Precision Recal F-Score

Last sentence baseline 0.208 0.112 0.146

Heuristic baseline 0.107 0.333 0.162

No change-based features* 0.146 0.378 0.211

Random baseline 0.185 0.586 0.281

Change-based features only* 0.351 0.685 0.466

All features* 0.398 0.745 0.519

Table 2.4: Distant supervision experiment results (* indicates significant difference from baselines, p < 0.01).

The results of the distant supervision experiment are shown in Table 2.4. We trained three different systems using different sets of features: the change-based features included all 176 stylistic features except for the metric scores themselves; the non-change-based features included the other Labov-inspired features and the metric scores, but none of the other stylistic features, such as slopes and distance from global extremes.

Our best results use both sets of features, but notably, using the change-based features alone achieved significant improvement over the three baselines (p < 0.00005). The no- change feature set was outperformed by the random baseline (p < 0.0024), supporting our

hypothesis that it is change in a stylistic metric, rather than the metric score itself, that predicts MRE sentences.

2.4.3 Self-Training

The distant supervision approach treats heuristically-labeled data as if it were human- labeled, gold standard data. The hope is that a large amount of noisy data would allow one to train a more general model than would a very small amount of clean data. However, there is the risk that some of these noisy training labels are so egregiously bad that they could lower the overall performance of the trained model.

Our second experiment used a self-training approach, where a classifier uses a small, labeled seed set to label a larger training set. In pure self-training, there is the risk that these predicted labels are incorrect or reflect strange outliers or biases in the seed set. We addressed the risks of both the distant supervision and self-training approaches by adding an additional quality control step to self-training: to ensure the quality of a self-training predicted label, we required that it agree with the heuristic weak label for that sentence in order to be added to the training set.

With the same parameter settings as in the distant supervision experiment, we trained an SVM on our hand-labeled seed set of 958 sentences. We used this initial model to predict labels for the training set, and all sentences where this labeling agreed with the heuristic labeling were added to the seed set and used to train a new model, which was in turn used to label the remaining unused sentences (ie. sentences whose predicted labels from the previous round had disagreed with the heuristic labels), and so on until none of the current

model’s labels agreed with any of the remaining heuristic labels. Figure 2.6 shows the learning curve for the self-training experiment, along with the growth of the self-training set. 0 2 4 6 8 10

Training Round

0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64

F-Score

Performance

Training Size

0 10 20 30 40 50 60

Sentences (thousands)

Figure 2.6: Learning and training set size curves for self-training.

The results of the self-training experiment are shown in Table2.5. We achieved the best performance, F-measure of 0.635, after 9 rounds of self-training; self-training terminated after 10 rounds, but the 10thround had no effect on model performance.

Trial Precision Recall F-Score

Random baseline 0.185 0.586 0.281 Seed training only* 0.374 0.617 0.466 Distant supervision* 0.398 0.745 0.519 Self-training* 0.478 0.946 0.635

Table 2.5: Self-training experiment results (* indicates significant improvement over the baseline, p < 0.01).

The initial model, trained only on the seed set, performed nearly as well as our distant supervision experiment. This illustrates that sheer quantity of data may not overcome the use of accurate manual labels on a small dataset. As described in Section 2.2, the distant supervision labels were based on a linear combination of three heuristics that achieved at best an RMSE of 5.1 sentences. However, with quality-controlled self-training, we were better able to exploit the noisy heuristic labels by using only those that agreed with the seed- trained model, thus reducing the amount of noise. 52,147 of the 67,954 total heuristically- labeled sentences were used in our quality-controlled self-training experiment – roughly 27% of our heuristic labels were too noisy to use.

In document Adapting Automatic Summarization to New Sources of Information (Page 56-62)