Integer Linear Programming Approach - Motivation and Challenges

6.1 Motivation and Challenges

6.4.1 Integer Linear Programming Approach

We focus here on the optimization problem from Equation 3.2 which expresses the selection solely in terms of concepts and their estimated importance. While one could also include the selection of relations into the ILP in order to solve the problem defined in Equation 3.1, we leave that extension for future work.

6.4. Concept Map Construction

Let 𝑥_𝑖 be a binary decision variable that represents whether concept 𝑐_𝑖 ∈ 𝐶 is part of the selected subgraph. Then, the objective function can be written as67

max ∑|𝐶|_𝑖=1 𝑥_𝑖 𝜈(𝑐_𝑖) (6.13)

𝑥_𝑖 ∈ {0, 1} ∀ 𝑖 ∈ 𝐶 (6.14)

while the following constraint ensures that the subgraph obeys the size limit: ∑|𝐶|

𝑖=1 𝑥𝑖 ≤ 𝐿 (6.15)

Ensuring that the selected subgraph is also connected is a bit more intricate. A common approach to express such a constraint in an ILP are so-called commodity flow variables and, more specifically, the single commodity flow formulation for the minimum spanning tree problem proposed in the operations research community by Magnanti and Wolsey (1994). It has been successfully used in ILPs addressing dependency parsing (Martins et al., 2009), sentence compression (Thadani and McKeown, 2013) and abstractive summarization (Liu et al., 2015, Li et al., 2016a). Let 𝑓_𝑖𝑗 be a non-negative integer variable capturing the flow from concept 𝑐_𝑖 to 𝑐_𝑗. We introduce flow variables for concept pairs with a relation in 𝑅. The constraints

𝑓_𝑖𝑗 ≤ 𝑥_𝑖⋅ |𝐶| ∀ (𝑖, 𝑗) ∈ 𝑅 (6.16)

𝑓_𝑖𝑗 ≤ 𝑥_𝑗⋅ |𝐶| ∀ (𝑖, 𝑗) ∈ 𝑅 (6.17)

∑_𝑖𝑓_𝑖𝑗− ∑_𝑘𝑓_𝑗𝑘− 𝑥_𝑗 = 0 ∀ 𝑗 ∈ 𝐶 (6.18)

𝑓_𝑖𝑗 ∈ ℕ ∀ (𝑖, 𝑗) ∈ 𝑅 (6.19)

enforce that flow can only move between concepts that are selected (6.16 and 6.17) and a selected concept consumes one unit of flow (6.18). Further, let 𝑖 = 0 be a virtual root node and 𝑒_0𝑖 a virtual edge from the root to each concept. The additional constraints

|𝐶| ⋅ 𝑒_0𝑖− 𝑓_0𝑖 ≥ 0 ∀ 𝑖 ∈ 𝐶 (6.20) ∑|𝐶| 𝑖=1 𝑒0𝑖 = 1 (6.21) ∑|𝐶|_𝑖=1 𝑓_0𝑖 − ∑|𝐶|_𝑖=1 𝑥_𝑖 = 0 (6.22) 𝑒_0𝑖 ∈ {0, 1} ∀ 𝑖 ∈ 𝐶 (6.23) 𝑓_0𝑖 ∈ ℕ₀ ∀ 𝑖 ∈ 𝐶 (6.24)

ensure that only one virtual edge can be active (6.21), that the virtual node can only send flow over this active edge (6.20) and that the total amount of flow sent from the root cannot

Chapter 6. Pipeline-based Approaches

exceed the number of selected concepts (6.22). As a consequence, if 𝑛 concepts are selected, 𝑛 units of flow are sent from the virtual root over the edges of the graph and each selected concept consumes one of them. This is only possible if the selected subgraph is connected. Equivalently, one can think of it as the edges with flow larger than zero forming a spanning tree of the selected subgraph that is rooted in the additional virtual node.

An important detail for the optimization is the range of the importance estimates. If some concepts receive negative scores, the objective can be improved by excluding them from the subgraph. As a result, some part of the size budget might remain unused although additional connected concepts would be available. In order to avoid that, we can simply shift all importance scores into the positive range, formally, by deriving 𝜈′ as

𝜈′_(𝑐

𝑖) = 𝜈(𝑐𝑖) − 𝑚𝑖𝑛{ 𝜈(𝑐𝑗) | 𝑐𝑗 ∈ 𝐶 } (6.25) and then using 𝜈′in the ILP. However, if negative scores are only assigned to concepts that should in no case be part of the summary, the default behavior might actually be desired.

We take several measures to ensure that the ILP can be efficiently solved for the problem instances of CM-MDS. First, the above ILP formulation is already much more efficient than the one proposed by Li et al. (2016a) for MDS, which is the most similar ILP in related work. While ours requires 𝒪(|𝐶| + |𝑅|) variables and constraints, their formulation uses two variables per pair of nodes for the connectivity constraint, resulting in 𝒪(|𝐶|2) variables and constraints. For sparse graphs, where |𝑅| ≪ |𝐶|2, this leads to much smaller ILPs.

Second, we leverage the fact that 𝐺 is typically disconnected. Since a connected subgraph has to be completely in one of the connected components of 𝐺, we first identify these components and solve separate ILPs for each of them. These smaller ILPs can usually be solved faster than a single large one. And third, processing 𝐺 component by component also allows us to completely skip some of them. Starting with the biggest component, we can keep track of the best objective function value so far. If the next component has a total concept score less than that value, none of its subgraphs can be a better solution. And if the component consists of less concepts than the limit, we can also directly use the component instead of selecting a subset. With these measures, as we show in the experiments, the ILP can be efficiently solved for the problem sizes in the Educ corpus.

6.4.2 Experiments

To verify the effectiveness and efficiency of our proposed subgraph selection, we conduct an experiment that compares it against heuristic selection and alternative ILP formulations.

Experimental Setup We use the same data as for the concept importance estimation experiment (see Section 6.3.3), namely concepts extracted and grouped from the training top- ics of Educ. To evaluate subgraph selection independent of importance estimation, we do

6.4. Concept Map Construction METEOR ROUGE Pr Re F1 p Pr Re F1 p Educ ILP 23.32 27.52 25.16 26.09 23.93 24.74 Heuristic 18.28 25.15 21.13 .0003 17.52 21.97 19.34 .0014 Wiki ILP 29.04 26.76 27.73 29.08 18.79 22.54 Heuristic 24.45 24.46 24.83 .0051 24.06 17.39 19.57 .0093

Table 6.7: Evaluation of summary concept maps obtained with the proposed ILP and heuristic selection. Inputs are graphs created by automatic extraction and grouping in combination with gold importance scores. P-values are computed with a permutation test comparing F1-scores.

not use a trained model but the gold scores derived as training labels in Section 6.3.3. We create a second dataset based on Wiki with the same approach. On both datasets, we evaluate the selected subgraphs by comparing them against the reference concept maps with the metrics proposed for CM-MDS in Section 3.5.2. ILPs are solved with CPLEX68on a compute server with 500 GB of memory and 24 Intel Xeon ES-2620 2.1GHz cores.

As a baseline for our proposed approach, we implement a greedy heuristic similar to Zubrinic et al. (2015): Given the graph of scored concepts, it starts with the most important one and selects the best neighbor (by score, breaking ties by the node’s degree) until the size limit is reached. While this procedure ensures that the selected subgraph is valid, i.e. not too big and connected, it is not necessarily, in contrast to the ILP, the best subgraph with regard to our objective function. As a second baseline, we include an alternative formulation of the subgraph selection ILP obtained by transferring Li et al. (2016a)’s ILP for MDS to our task. The main difference is that it uses a quadratic number of variables to represent the presence or absence of all possible edges and the flow along them. While that has implications for its efficiency, it does of course also find an optimal subgraph.

Results Table 6.7 shows the results of this experiment. As expected, our ILP approach selects better subgraphs as summaries and the results on both datasets and in both metrics show that the difference between them and the summaries obtained with the heuristic are substantial and significant. Note that while the ILP finds the best solution to the optimization problem by definition and is in that sense already known to be superior to the heuristic, this experiment verifies that the best solution to the optimization problem is also indeed a good solution for the CM-MDS task in terms of being closer to the reference map.

Chapter 6. Pipeline-based Approaches

Method ILP Size Runtime

Variables Constraints sec

(Li et al., 2016a) 37,273,062 74,530,095 2670.61

by component 25,810,465 51,607,172 999.25

Our ILP 21,596 31,129 7.31

by component 17,973 26,484 5.61

Table 6.8: Comparison of ILP sizes and runtimes on average per topic for subgraph selection on Educ with our ILP and the alternative formulation of Li et al. (2016a).

In table Table 6.8, we compare ILP sizes and the time required to solve them. Although the differences between our ILP formulation and the one by Li et al. (2016a) are small, they have a large effect in practice, resulting in orders of magnitude smaller problems and faster runtimes. Identifying connected components and selecting subgraphs for each of them separately further improves the efficiency of both ILP approaches. On the document sets of Educ, with on average over 100,000 tokens, that allows us to select a summary subgraph in just a few seconds, which is not possible with Li et al. (2016a)’s formulation.

Conclusion Based on these experimental results, we conclude that selecting summary subgraphs with our proposed ILP is effective and can also be done efficiently on our copora. We will therefore include it in our CM-MDS pipeline described in the next section and assess it in an end-to-end task-level evaluation.

In document Automatic Structured Text Summarization with Concept Maps (Page 138-142)