6.1 Motivation and Challenges
6.4.1 Integer Linear Programming Approach
We focus here on the optimization problem from Equation 3.2 which expresses the selection solely in terms of concepts and their estimated importance. While one could also include the selection of relations into the ILP in order to solve the problem defined in Equation 3.1, we leave that extension for future work.
6.4. Concept Map Construction
Let π₯π be a binary decision variable that represents whether concept ππ β πΆ is part of the selected subgraph. Then, the objective function can be written as67
max β|πΆ|π=1 π₯π π(ππ) (6.13)
π₯π β {0, 1} β π β πΆ (6.14)
while the following constraint ensures that the subgraph obeys the size limit: β|πΆ|
π=1 π₯π β€ πΏ (6.15)
Ensuring that the selected subgraph is also connected is a bit more intricate. A common approach to express such a constraint in an ILP are so-called commodity flow variables and, more specifically, the single commodity flow formulation for the minimum spanning tree problem proposed in the operations research community by Magnanti and Wolsey (1994). It has been successfully used in ILPs addressing dependency parsing (Martins et al., 2009), sentence compression (Thadani and McKeown, 2013) and abstractive summarization (Liu et al., 2015, Li et al., 2016a). Let πππ be a non-negative integer variable capturing the flow from concept ππ to ππ. We introduce flow variables for concept pairs with a relation in π . The constraints
πππ β€ π₯πβ |πΆ| β (π, π) β π (6.16)
πππ β€ π₯πβ |πΆ| β (π, π) β π (6.17)
βππππβ βππππβ π₯π = 0 β π β πΆ (6.18)
πππ β β β (π, π) β π (6.19)
enforce that flow can only move between concepts that are selected (6.16 and 6.17) and a selected concept consumes one unit of flow (6.18). Further, let π = 0 be a virtual root node and π0π a virtual edge from the root to each concept. The additional constraints
|πΆ| β π0πβ π0π β₯ 0 β π β πΆ (6.20) β|πΆ| π=1 π0π = 1 (6.21) β|πΆ|π=1 π0π β β|πΆ|π=1 π₯π = 0 (6.22) π0π β {0, 1} β π β πΆ (6.23) π0π β β0 β π β πΆ (6.24)
ensure that only one virtual edge can be active (6.21), that the virtual node can only send flow over this active edge (6.20) and that the total amount of flow sent from the root cannot
Chapter 6. Pipeline-based Approaches
exceed the number of selected concepts (6.22). As a consequence, if π concepts are selected, π units of flow are sent from the virtual root over the edges of the graph and each selected concept consumes one of them. This is only possible if the selected subgraph is connected. Equivalently, one can think of it as the edges with flow larger than zero forming a spanning tree of the selected subgraph that is rooted in the additional virtual node.
An important detail for the optimization is the range of the importance estimates. If some concepts receive negative scores, the objective can be improved by excluding them from the subgraph. As a result, some part of the size budget might remain unused although additional connected concepts would be available. In order to avoid that, we can simply shift all importance scores into the positive range, formally, by deriving πβ² as
πβ²(π
π) = π(ππ) β πππ{ π(ππ) | ππ β πΆ } (6.25) and then using πβ²in the ILP. However, if negative scores are only assigned to concepts that should in no case be part of the summary, the default behavior might actually be desired.
We take several measures to ensure that the ILP can be efficiently solved for the problem instances of CM-MDS. First, the above ILP formulation is already much more efficient than the one proposed by Li et al. (2016a) for MDS, which is the most similar ILP in related work. While ours requires πͺ(|πΆ| + |π |) variables and constraints, their formulation uses two variables per pair of nodes for the connectivity constraint, resulting in πͺ(|πΆ|2) variables and constraints. For sparse graphs, where |π | βͺ |πΆ|2, this leads to much smaller ILPs.
Second, we leverage the fact that πΊ is typically disconnected. Since a connected sub- graph has to be completely in one of the connected components of πΊ, we first identify these components and solve separate ILPs for each of them. These smaller ILPs can usually be solved faster than a single large one. And third, processing πΊ component by component also allows us to completely skip some of them. Starting with the biggest component, we can keep track of the best objective function value so far. If the next component has a total concept score less than that value, none of its subgraphs can be a better solution. And if the component consists of less concepts than the limit, we can also directly use the component instead of selecting a subset. With these measures, as we show in the experiments, the ILP can be efficiently solved for the problem sizes in the Educ corpus.
6.4.2
Experiments
To verify the effectiveness and efficiency of our proposed subgraph selection, we conduct an experiment that compares it against heuristic selection and alternative ILP formulations.
Experimental Setup We use the same data as for the concept importance estimation ex- periment (see Section 6.3.3), namely concepts extracted and grouped from the training top- ics of Educ. To evaluate subgraph selection independent of importance estimation, we do
6.4. Concept Map Construction METEOR ROUGE Pr Re F1 p Pr Re F1 p Educ ILP 23.32 27.52 25.16 26.09 23.93 24.74 Heuristic 18.28 25.15 21.13 .0003 17.52 21.97 19.34 .0014 Wiki ILP 29.04 26.76 27.73 29.08 18.79 22.54 Heuristic 24.45 24.46 24.83 .0051 24.06 17.39 19.57 .0093
Table 6.7: Evaluation of summary concept maps obtained with the proposed ILP and heuristic selection. Inputs are graphs created by automatic extraction and grouping in combination with gold importance scores. P-values are computed with a permutation test comparing F1-scores.
not use a trained model but the gold scores derived as training labels in Section 6.3.3. We create a second dataset based on Wiki with the same approach. On both datasets, we evalu- ate the selected subgraphs by comparing them against the reference concept maps with the metrics proposed for CM-MDS in Section 3.5.2. ILPs are solved with CPLEX68on a compute server with 500 GB of memory and 24 Intel Xeon ES-2620 2.1GHz cores.
As a baseline for our proposed approach, we implement a greedy heuristic similar to Zubrinic et al. (2015): Given the graph of scored concepts, it starts with the most important one and selects the best neighbor (by score, breaking ties by the nodeβs degree) until the size limit is reached. While this procedure ensures that the selected subgraph is valid, i.e. not too big and connected, it is not necessarily, in contrast to the ILP, the best subgraph with regard to our objective function. As a second baseline, we include an alternative formulation of the subgraph selection ILP obtained by transferring Li et al. (2016a)βs ILP for MDS to our task. The main difference is that it uses a quadratic number of variables to represent the presence or absence of all possible edges and the flow along them. While that has implications for its efficiency, it does of course also find an optimal subgraph.
Results Table 6.7 shows the results of this experiment. As expected, our ILP approach selects better subgraphs as summaries and the results on both datasets and in both metrics show that the difference between them and the summaries obtained with the heuristic are substantial and significant. Note that while the ILP finds the best solution to the optimiza- tion problem by definition and is in that sense already known to be superior to the heuristic, this experiment verifies that the best solution to the optimization problem is also indeed a good solution for the CM-MDS task in terms of being closer to the reference map.
Chapter 6. Pipeline-based Approaches
Method ILP Size Runtime
Variables Constraints sec
(Li et al., 2016a) 37,273,062 74,530,095 2670.61
by component 25,810,465 51,607,172 999.25
Our ILP 21,596 31,129 7.31
by component 17,973 26,484 5.61
Table 6.8: Comparison of ILP sizes and runtimes on average per topic for subgraph selection on Educ with our ILP and the alternative formulation of Li et al. (2016a).
In table Table 6.8, we compare ILP sizes and the time required to solve them. Although the differences between our ILP formulation and the one by Li et al. (2016a) are small, they have a large effect in practice, resulting in orders of magnitude smaller problems and faster runtimes. Identifying connected components and selecting subgraphs for each of them separately further improves the efficiency of both ILP approaches. On the document sets of Educ, with on average over 100,000 tokens, that allows us to select a summary subgraph in just a few seconds, which is not possible with Li et al. (2016a)βs formulation.
Conclusion Based on these experimental results, we conclude that selecting summary subgraphs with our proposed ILP is effective and can also be done efficiently on our copora. We will therefore include it in our CM-MDS pipeline described in the next section and assess it in an end-to-end task-level evaluation.