Synthetic Training Corpora - Automatic Structured Text Summarization with Concept Maps

In the following sections, we discuss several solutions to these issues, including methods to generate synthetic training data, pre-summarizing inputs, linearizing target graphs and using our novel sequence-to-graph architecture. We study the effectiveness of these ideas in a range of experiments.

7.2 Synthetic Training Corpora

To successfully train neural networks to perform complex structured prediction tasks, a large number of training examples is necessary. For neural SDS, existing work relies on the 460k pairs of news articles and summaries in the Annotated New York Times corpus (Sand- haus, 2008), the 290k article-summary pairs of the CNN/DailyMail corpus (Hermann et al., 2015) or the 10 million article-headline pairs of the Gigaword corpus (Napoles et al., 2012). Very recently, an additional corpus with 1.3 million pairs of news articles and summaries from various sources has been published by Grusky et al. (2018). As we already mentioned in Section 2.3.2, no datasets of these sizes exist for MDS. Similarly, for our CM-MDS task, we also lack suitable training data for neural models, as Educ and Wiki have only 30 and 38 and Biology only 183 examples. Note that while the corpus creation method presented in Section 4.3.2 has been specifically designed to make the annotation scale to the size of the benchmark corpus, the number of pairs required for neural models are far beyond its scalability: With on average $150 crowdsourcing cost per document set, creating a dataset of 100k examples with the approach would cost at least 15 million dollars.

To be able to carry out experiments with neural models despite the lack of training data, we pursue two directions to obtain synthetic training examples automatically:

Extending CM-MDS Corpora The idea of our first data generation approach is to use the concept maps of the training part of Educ and pair them with alternative inputs to obtain additional pairs. As we discussed during mention extraction (see Chapter 5), there are many different ways to express the same concepts and relations and multiple alternative inputs can therefore lead to the same summary concept map.

Specifically, we process each of the 15 training topics and create additional examples. Given the summary concept map with concepts 𝐶 and relations 𝑅, we first collect alternative expressions for them in addition to their labels. Towards that end, we use data collected during the corpus creation (see Section 4.3.2) in which alternative mentions of concepts have been merged and further rely on PPDB 2.0 (Pavlick et al., 2015), a database of paraphrases. Looking up our concepts and relations in it, we can find alternatives such as

manifestationsforsymptomsoris not necessaryforis not required. As a result of this step,

we have concepts 𝐶 and relations 𝑅 with a set of alternative referring expressions for each of them. In the second step, we then build all possible triples (𝑐₁, 𝑟, 𝑐₂) out of 𝐶 and 𝑅, sample a triple of expressions (𝑚𝑐1, 𝑜𝑟, 𝑚𝑐2) from their corresponding expression sets and

Chapter 7. End-to-End Modeling Approaches

Dataset Pairs Concept Map Source

Concepts Tokens Relations Tokens Tokens

Educ 30 25.0 ± 0.0 3.2 ± 0.5 25.2 ± 1.3 3.2 ± 0.5 97880.0 ± 50086.2 EducSyn 49,950 21.2 ± 2.3 2.8 ± 2.0 20.6 ± 2.3 2.1 ± 1.5 623.3 ± 83.6 DMSyn 209,525 25.0 ± 0.2 2.9 ± 2.0 20.6 ± 0.9 3.3 ± 2.8 302.1 ± 35.3

Table 7.1: Comparison of Educ with the two synthetic training datasets.

search in a recent English Wikipedia dump for sentences mentioning all three parts. Given such triples and retrieved sentences with corresponding mentions, we then build concept maps in the final step. We sample out of all triples for which sentences were found until the sampled subset forms a graph of more than 20 nodes and concatenate the corresponding sentences to form the source text. That process yields a single new pair of input text and output graph, which we repeat until we run out of triples.

With this approach, we obtained the EducSyn dataset of roughly 50k new pairs. All of these pairs have concept maps similar to the original 15 maps from Educ, but use different mentions, different subsets of propositions and different sentences on the input side.

Pipeline-based Corpus Creation As the second strategy, we use a corpus of plain docu- ments and apply existing CM-MDS machinery — the pipeline introduced in Section 6.5 — to automatically create corresponding concept maps. While this obviously yields noisy data as the pipeline cannot handle CM-MDS perfectly, similar approaches of using large noisy datasets in combination with smaller high-quality datasets have previously been successfully used for other tasks to bootstrap models, e.g. Konstas et al. (2017) for AMR parsing.

We use the DailyMail part of the corpus by Hermann et al. (2015) and process the news articles with our pipeline to extract triples of concepts and relations. We do not use the summaries the corpus provides. We then randomly sample subsets of these triples and combine them to graphs of 25 concepts. These graphs are paired with the sentences the triples were extracted from. Note that this process does not make use of the importance scoring and subgraph selection steps of the pipeline, since all concepts and relations are used to create new pairs. As a result, we obtained DMSyn, a dataset of almost 210k pairs of input texts and corresponding concept maps.

Table 7.1 compares the two synthetic datasets with Educ. Note that they are orders of magnitude larger, with sizes comparable to the aforementioned corpora used to train neural SDS models. The length of the input, on the other hand, is much smaller for the synthetic datasets, which is part of our approach to handle the large input problem. More details on that will follow. The main difference between the two synthetic datasets is that DMSyn

In document Automatic Structured Text Summarization with Concept Maps (Page 155-157)