4.2
Automatic Corpus Creation
To create benchmark corpora automatically, we use two different techniques. We make use of a collection of manually created concept maps and match them with corresponding documents, resulting in the Biology dataset. And second, we use documents with existing concept annotations and extend them to create the Wiki and ACL datasets.
4.2.1
Using Existing Concept Maps
As the starting point for Biology, we use a collection of 464 concept maps25 that were manually created by experts as teaching materials for biology. They were previously used by Olney et al. (2011) to evaluate their concept map mining technique. Since the maps have been created independently of a specific text, we have to add appropriate source documents to obtain the desired pairs. We match each of the maps automatically with a corresponding article in Wikipedia. Every map is a star-like graph centered around a central concept, e.g. protein, and hence has a similar topical focus as an encyclopedic article. We manually
correct wrong assignments and make sure that no article is assigned twice.
To ensure that the concept maps indeed cover the same content as the matched article, we apply several additional measures. We prune all concepts from the maps that are not mentioned in the article and completely ignore maps that have fewer than 4 concepts after this step. This reduces the average number of concepts per map from 8.8 to 7.0. On the text side, we remove all sections from the articles that do not mention a concept other than the central concept of the map. Thereby, we remove aspects from the articles that are not covered by the maps, yet, by working on the section level, we still maintain coherent text. Unfortunately, a similar approach is not possible to ensure extractiveness of the relation labels, as that would have left us with almost no data. Instead, we manually relabel the relations with mentions that occur together with their concepts in a sentence of the article. Out of 1083 relations in the remaining maps, we assigned a new label to 618 (57%) relations, but could not find appropriate mentions for the remaining 465 (43%).
The resulting dataset has 183 pairs of a concept map and a Wikipedia article. Figure 4.1 shows one of the maps that has been paired with the articleAtom. Note that in this dataset,
maps are centered around a focus concept and concept labels are rather short. The example also shows that relations are sometimes not expressed in the most natural way, which is a consequence of ensuring that all labels are mentions in the paired article.
25Available athttp://web.archive.org/web/20120106232123/http://www.biologylessons.sdsu.edu/ta/toc.html
Chapter 4. Creation of Benchmark Corpora atom molecule element matter isotope neutron electron ion atomic number proton form belongs to is constituent unit of are in is composed of is called has has
Figure 4.1: Summary concept map from Biology on the topic βatomβ.
Quebec agreement Winston Churchill University of Birmingham Rudolf Peierls Tube Alloys MAUD report atomic bomb Klaus
Fuchs World War II
uranium 235 uranium Manhattan Project Franklin Roosevelt Soviet critical mass Otto Frisch signed by worked at galvanized their American counterparts with β¦ assessed the chances of developing emphatically concluded that had worked closely on succeeded in would
require using the fissile isotope
of
enriched in was a joint
effort to build were to acquire β¦ approved fled to England to work on needed for to create the precise physical conditions called was made by
4.2. Automatic Corpus Creation
4.2.2
Using Existing Concept Annotations
As the starting point for Wiki, we use a recently published MDS corpus created by Zopf et al. (2016b). They took the introductory sections of featured Wikipedia articles, which tend to be good summaries of the topic due to the Wikipedia guidelines for featured articles, and matched them with web pages that described different aspects covered by the summary in detail. Their corpus consists of 91 pairs of documents π· and a textual summary π.
For our corpus, we make use of the fact that these summaries π, being Wikipedia arti- cles, contain many links to other Wikipedia pages. For each π, we create a set of concepts πΆ by collecting the names of the linked articles and the main article itself. Since they are linked in the summary, they tend to be important concepts for the topic. Further, we run an existing OIE system26 over the source documents π· to extract binary propositions. To construct a summary concept map πΊ for a document set π·, we then iterate over all extrac- tions and identify those that mention a concept out of πΆ in both of their arguments. We apply a range of rules to filter out spurious matches, e.g. concept mentions that are just a small part of a very long argument or extractions containing unresolved pronouns. If an extraction passes all tests, it is added to a set π , forming the concept map πΊ = (πΆ, π ).
Since the map πΊ has been created based on the set πΆ derived from π, it covers similar content as the summary and is thus an adequate summary for π·. To ensure that it is also connected, as required by Definition 5, we reduce the obtained graph to its biggest connected component. Finally, we remove all pairs where the resulting concept map has fewer than 7 concepts. After these steps, Wiki consists of 38 pairs of documents and a summary concept map. One of them is shown in Figure 4.2.
For the third dataset, ACL, we use the ACL RD-TEC 2.0 corpus (QasemiZadeh and Schu- mann, 2016). It consists of 300 abstracts taken from papers in the ACL Anthology in which two annotators marked concepts. As abstracts are good summaries of a paper, these con- cepts tend to be the central concepts discussed in the papers. We use Apache Tika27 to extract the full texts, excluding the abstracts, from the PDF version of the corresponding papers. We filter out papers where the extraction fails. These texts are then paired with the annotated concepts as the gold concepts. We obtain 255 pairs. Note that we cannot use this corpus to evaluate relation extraction, as such annotations are not available.
4.2.3
Comparison and Limitations
All three datasets, compared in Table 4.2, could be created mostly automatic with minimal manual effort, circumventing the challenges of manual annotation that were discussed at the beginning of this chapter. And, compared to most of the existing datasets presented in
26OpenIE4 (Mausam, 2016), a state-of-the-art system according to Stanovsky and Dagan (2016b). 27https://tika.apache.org/
Chapter 4. Creation of Benchmark Corpora
Dataset Pairs Concept Map Source
Concepts Tokens Relations Tokens Documents Tokens
Biology 183 6.9 Β± 4.0 1.2 Β± 0.4 3.5 Β± 3.0 1.9 Β± 1.2 1.0 Β± 0.0 2620.9 Wiki 38 11.3 Β± 5.2 1.9 Β± 0.4 13.8 Β± 8.4 5.0 Β± 1.2 14.6 Β± 3.1 27065.6
ACL 255 10.9 Β± 5.5 1.9 Β± 0.9 β β 1.0 Β± 0.0 4987.5
Table 4.2: Corpus statistics for automatically created benchmark corpora. All values are averages over pairs with their standard deviation indicated by Β±. ACL does not contain relations.28
Table 4.1, all three are substantially bigger. But Table 4.2 also reveals that the datasets do not yet satisfy all requirements.
Both Biology and ACL only provide summaries for single documents, whereas we want to have summaries for document sets. In addition, ACL does not have real concept maps, but only concepts, and thus can only be used to evaluate concept mention extraction, but not the full task. Biology, while having relations, provides only very small concept maps, with especially few relations. Given the average number of relations (3.5) and concepts (6.9) per map, one can also easily see that the graphs are disconnected.
Wiki comes closest to our requirements because it provides multi-document summaries that are bigger and connected concept maps. Its main weakness are the relations, which have been obtained fully automatically. Since no annotator was involved, we do not have a guarantee that they express relationships in the same way a human would. The large size of their labels, compared to Biology (5.0 vs. 1.9 tokens), indicates that they follow a different style. The example in Figure 4.2 also reveals that some relations are rather complex clauses. During evaluations, this dataset might also unfairly favor CM-MDS approaches that use similar OIE-based techniques for relation extraction.
In light of these limitations, we explore other techniques to create more high-quality datasets with reasonable effort in the next part of this chapter. That being said, we want to emphasize that the automatically created datasets can still be of use in experiments where their limitations are less relevant or if they are taken into account when interpreting quan- titative results. In this thesis, we will use the Biology and ACL datasets to evaluate concept and relation extraction approaches in Section 5.2 and the Wiki dataset as a second corpus to evaluate pipelines for the full task in Chapter 6.
28The values reported here differ slightly from those in (Falke and Gurevych, 2017c, Table 1) where statistics