3.5 Evaluation
3.5.2 Proposed Evaluation Methods for CM-MDS
Given the advantages and limitations of each of the evaluation types, we propose to also use a combination of all of them for CM-MDS. In fact, examples for all three categories have previously been used to assess concept map mining techniques. Villalon (2012) and Aguiar et al. (2016) use automatic comparisons against references (type 1), Kowata et al. (2010), Qasim et al. (2013) and Zubrinic et al. (2015) let experts judge the quality of concept maps (type 2) and Rajaraman and Tan (2002) and Valerio et al. (2012) evaluate them in task-based user studies (type 3). The main problems limiting the comparability of that existing work is that the exact procedures and data differ from work to work.
We hope to reduce that problem in the future by proposing standardized evaluation procedures for CM-MDS. In the following, we outline these procedures for evaluations of type (1) and (2). When results of these evaluations are published, subsequent work can simply run the same evaluations following the description here to compare their approach. For evaluations of type (3), comparisons can typically only be made within one experiment, and we therefore do not give specific recommendations here.
3.5.2.1 Automatic Metrics
In Section 2.2.2, we mentioned the study of McClure et al. (1999), which analyzed different manual scoring methods for concept maps in the context of student testing. They compared
3.5. Evaluation
holistically scoring a map, scoring its structure and scoring each proposition independently, all with a master concept map as the reference. The latter method was found to be most reliable and showed the highest correlation with the true assessment of the student maps (obtained through a more laborious manual analysis). Inspired by these findings, we also designed proposition-level metrics to score concept maps.
Given input documents π·, a reference summary concept map πΊπ = (πΆπ , π π ) and a system-generated summary concept map πΊπ = (πΆπ, π π), we create sets of propositions
ππ₯ = { π(π ππ’πππ(ππ)) π(ππ) π(π‘πππππ‘(ππ)) | ππ β π π₯} (3.3)
where π₯ indicates either the reference or system map and the functions π ππ’πππ and π‘πππππ‘ assign a relation to its source and target concepts. To score a generated concept map, we then calculate the overlap between the sets ππand ππ .
For evaluation purposes, we distinguish between two different task settings: One where the size of the summary concept map is restricted by both βπΆand βπ and a second where only βπΆ is given. In the first setting, similar to the traditional MDS setting, the summary size is bound and most systems will try to fully use that budget. It is therefore sufficient to compute only recall when comparing ππ and ππ . In the second setting, while the number of concepts is still constrained, the number of relations and thus the size of ππ can differ. By reporting both precision and recall, a fair comparison between systems can be made in this setting. In fact, we think that the second setting is more interesting, as it gives models the flexibility to either focus on high precision or high recall.20
METEOR Our first metric is based on METEOR (Denkowski and Lavie, 2014), which has the advantage that it takes synonyms and paraphrases into account21and does not solely rely on lexical matches (as it is the case for ROUGE). For each pair of propositions ππ β ππ and ππ β ππ we use METEOR22 to calculate a match score πππ‘πππ(ππ , ππ) β [0, 1]. Then, precision and recall per map are computed as:
π π = 1 |ππ|πβπβ π πππ₯{πππ‘πππ(π, ππ) | ππ β ππ } (3.4) π π = 1 |ππ |πβπβ π πππ₯{πππ‘πππ(π, ππ ) | ππ β ππ} (3.5)
20Note that the size restrictions β
πΆand βπ only limit the size of the graph, but not the size of its content.
A possible adversarial attack against the metric would be to create a concept map with very long labels, in the extreme case containing all of π·. The recall would be 100%. Obviously, such a concept map is not a summary and the labels would not be concise. It illustrates that automatic evaluation metrics typically have limitations. A comprehensive evaluation should therefore also rely on manual inspections. Note also that this attack does not work in the second setting, where the high recall would come with low precision.
21As determined using WordNet (Miller et al., 1990) and PPDB (Pavlick et al., 2015). 22We use METEOR version 1.5 with default settings.
Chapter 3. Structured Summarization with Concept Maps
The F1-score is the equally weighted harmonic mean of precision and recall. Scores per map are macro-averaged over all instances of a dataset.
ROUGE As a second metric, we compute ROUGE (Lin, 2004). We concatenate all propo- sitions of a map into a single string, π π and π π , and separate propositions with a dot to ensure that no bigrams span across propositions and the metric is therefore independent of how propositions are ordered. We run ROUGE 1.5.523with π
πas the peer summary and π π as a single model summary to compute the ROUGE-2 score, the most commonly used variant of the ROUGE metric family.
Significance Testing An important step when using automatic metrics is to determine whether a difference in the average scores that two methods A and B achieve on a dataset is meaningful or only due to chance. The smaller the difference is, the more relevant this question becomes. Statistical tests can be used to determine if a difference is in fact signif- icant . Following suggestions for other NLP tasks, we propose to rely on sampling-based tests, which, in contrast to commonly used parametric tests like Studentβs t-test, do not make any assumptions on the distribution of the evaluation scores or evaluation data (Dror et al., 2018). Since such assumptions do not hold for many NLP metrics, such as precision, recall and F-scores, using parametric tests is problematic (Yeh, 2000, Dror et al., 2018).
We propose to use a permutation or randomization test (Noreen, 1989), as also sug- gested for machine translation by Riezler and Maxwell (2005). Let A and B be two alter- native methods with evaluation scores of πΈπ΄ = (π1, π2, β― , ππ) and πΈπ΅ = (π1, π2, β― , ππ) over π instances of an evaluation dataset. The average performance difference is then
π(πΈπ΄, πΈπ΅) = β£ 1 ππβ πβπΈπ΄ ππβ 1 ππβ πβπΈπ΅ ππ β£ . (3.6)
For the randomization test, we create π samples by swapping evaluation scores between A and B at each position with probability 0.5, yielding π new pairs of score lists such as πΈβ²
π΄ = (π1, π2, π3, β―) and πΈπ΅β² = (π1, π2, π3, β―). The p-value is then defined as
π = 1 + # samples with π(πΈ β²
π΄, πΈπ΅β² ) β₯ π(πΈπ΄, π΅π΄)
1 + π . (3.7)
If π is sufficiently small, we reject the null hypothesis that there is no difference between A and B and conclude that A and B differ significantly. The permutation test, in contrast, checks all 2πpossible ways of swapping scores between πΈπ΄and πΈπ΅rather than just draw- ing π samples out of them. While being more accurate, it can be prohibitively expensive to compute for large π, such that in practice one mostly relies on the approximate ran-
3.5. Evaluation
domization test. In this thesis, we make use of the permutation test when possible and otherwise rely on the randomization test. We use thresholds of 0.01 and 0.05 to determine significance, but also report the actual π-values for greater transparency.
3.5.2.2 Manual Quality Dimensions
For manual evaluations of type (2), we experimented with several setups and recommend using pairwise comparisons of different summary concept maps rather than assessing their quality in isolation. Most recent works on MDS, for instance Celikyilmaz et al. (2018), use this pairwise approach, as it makes it easier to discover differences between summaries.
A difficulty that arises is the layouting of concept maps. Different ways of presenting the concept map graph can greatly influence a raterβs perception of the mapβs quality, and since layouting is a non-trivial issue, it is difficult to automatically create layouts for different maps that are βequally goodβ. As the goal is to evaluate the output of CM-MDS approaches, we do not want the layout quality to influence the evaluation. As the solution, we propose to not present the concept maps in a graphical form at all, but instead as lists of propositions. Such lists can be easily shown side by side and allow a fair comparison of the content. In addition, when showing propositions, even raters unfamiliar with concept maps can easily perform the evaluation, as a proposition is essentially just a short sentence.
Given two summary concept maps, we show their propositions side by side and ask raters to pick their preferred list according to the following dimensions:
Meaning The sentences should be understandable and express meaningful facts.
Grammaticality The sentences should be grammatical, without missing or unrelated frag- ments or other grammar-related problems that make them difficult to understand. Non-Redundancy There should be no unnecessary repetition within the list, neither of
whole sentences nor of partial facts.
Focus The sentences should be focused; sentences should only contain information that is related to the given topic description.
The dimensions are inspired by the (more fine-grained) list used during the DUC competi- tions (Dang, 2005). The first two dimensions assess the quality of the extraction subtasks by checking whether concept and relation labels form meaningful and grammatical propo- sitions. The third dimension focuses on non-redundancy, assessing whether the grouping subtasks were handled successfully. Finally, the fourth dimension evaluates the summa- rization aspect, i.e. which content has been selected for the summary concept map.
We note that the automatic evaluation described before mainly focuses on whether a summary contains the same content as a reference summary, whereas the procedure de- scribed here captures more aspects of summary quality. However, the evaluation of content
Chapter 3. Structured Summarization with Concept Maps
selection is more limited in this setup: Since raters have no access to the full source docu- ments or a reference summary, but see only a topic description, they can only assess how much of a mapβs content is relevant, but do not know if there is other more relevant con- tent. Nevertheless, we believe if both evaluation techniques are used in combination, such shortcomings are compensated as both techniques complement each other very well.
3.6
Chapter Summary
In this chapter, we introduced the central problem studied in this thesis. Based on the review of user requirements in the previous chapter, we argued that concept maps are a useful text representation to support users during exploratory search. Given the limited amount of existing work on extracting concept maps from text, more research in this direction is necessary. As a result, we proposed concept mapβbased multi-document summarization (CM-MDS), a reformulation of concept map mining with a focus on summarization.
We formally defined the different subtasks of CM-MDS, which are mention extraction, grouping and labeling for concepts and relations followed by importance estimation and concept map construction. Moreover, we presented a comprehensive example and illus- trated the challenges for computational models for each subtask. For extraction and group- ing, these include handling the variety of ways in which mentions can be expressed, find- ing extraction approaches that generalize to many text types and that successfully trade-off precision and recall, determining appropriate concept granularities, ensuring assertedness, scaling to large inputs and finding concise labels. Further challenges are potentially large numbers of extracted concepts and relations, external and user-specific factors influencing importance, the combinatorial complexity of the subgraph selection, in particular due to its connectedness requirement, and resolving contradictions between propositions.
In addition, we discussed relations to existing work, pointing out that the task is es- sentially a combination of text summarization with information extraction. Other relations exist to work on knowledge graphs, but clear differences exist. In order to evaluate com- putational models for the task, we proposed two automatic metrics based on METEOR and ROUGE. To account for the limitations of automatic methods and to better cover aspects beyond content selection, we further suggest to also manually compare different automat- ically created concept maps against each other with regard to their meaningfulness, gram- maticality, non-redundancy and focus and to ultimately also evaluate the usefulness of the created maps in downstream applications by conducting user studies.
Chapter 4
Creation of Benchmark Corpora
In this chapter, we will look at the data that is needed to train and evaluate computational methods for CM-MDS. We discuss requirements for suitable corpora and show that corre- sponding data does not yet exist. Therefore, we describe two different strategies to obtain data β by automatically extending partial annotations and by creating annotations from scratch with scalable methods β and the corpora that we obtained using these strategies.
4.1
Motivation and Challenges
In order to evaluate methods for CM-MDS with the automatic methods discussed in the previous chapter, datasets with appropriate annotations are necessary. As we argued be- fore, being able to run such automatic evaluations is not only important to compare to other work, but also to guide the development of new methods. In addition, such data is also es- sential to develop any kind of supervised model for the task or to tune hyper-parameters of supervised and unsupervised models.
Specifically, we need datasets of pairs (π·, πΊ) where π· is a collection of documents and πΊ is a high-quality reference concept map that summarizes π· according to Definition 4 and Definition 5. The pairs should satisfy the following requirements:
β’ πΊ should be a valid concept map and an appropriate summary of π·.
β’ Using the information in π·, it should be possible to create a map close to πΊ.
β’ The dataset should have enough pairs to observe statistically significant differences between evaluated approaches.
β’ Ideally, to deal with the inherent subjectiveness of summarization, πΊ should be a consensus of several annotators or multiple alternative maps πΊ should be used. Table 4.1 shows the datasets used in existing work on concept map mining that made use of reference annotations. Unfortunately, none of these datasets can meet all of our require- ments. In fact, most of them violate even several: Some are of very small size (Qasim et al.,
Chapter 4. Creation of Benchmark Corpora
Author Pairs Per Pair
Maps Concepts Relations Docs Tokens
Valerio and Leake (2006) 80 1 1 β 1 8821
Qasim et al. (2013) 1 1 370 β 65 ?
Zouaq et al. (2011) 1 1 1286 1797 36 36008
Aguiar et al. (2016) 1 10 ? ? 1 617
Villalon (2012) 42 1 21 12 1 468
Table 4.1: Datasets with reference annotations used for automatic evaluations of concept map mining. Values in the last five columns are averages over all pairs, ? indicates values not reported.
2013, Zouaq et al., 2011, Aguiar et al., 2016), some provide only concept annotations but no relations (Valerio and Leake, 2006, Qasim et al., 2013) and some contain single-document rather than multi-document summaries (Valerio and Leake, 2006, Aguiar et al., 2016, Vil- lalon, 2012). In addition, Villalon (2012)βs corpus, which comes closest to our requirements, is not available for other researchers to use.24 Thus, to enable automatic evaluations for work on CM-MDS, we have to create new corpora.
The manual creation of such a dataset is very time-consuming, as the annotation in- cludes many laborious subtasks. Essentially, an annotator would need to perform exactly the subtasks outlined in Section 3.3 to create a reference concept map for a set of source documents. For a full dataset with many pairs, the corresponding effort would be huge. An additional challenge lies in the nature of summarization: Summaries should be much smaller than their source documents and contain only the most important parts. However, in order to create the summary concept map πΊ, a human annotator has to read all doc- uments π· and find and group mentions of concepts and relations in them to obtain the building blocks for πΊ. In the final annotation product πΊ, only a fraction of the identified mentions are used, such that most of the mention annotation work is in a sense βwastedβ effort but cannot be avoided. The bigger the size difference between π· and πΊ is, i.e. the more we want to summarize, the bigger this problem becomes.
To overcome these challenges, we explore two different directions in this thesis. First, we make use of existing datasets that provide annotations for parts of the task and explore different ways of automatically extending them to a full benchmark corpus for CM-MDS. Due to the automation, no manual effort is needed. As a second option, we develop a corpus creation method that effectively combines automatic preprocessing, scalable crowdsourcing and high-quality expert annotations to make the annotation more efficient. We describe both directions and the resulting datasets in the remainder of this chapter.
24It is not available for download anywhere right now and, in personal communication, the authors of the