6.5 Full Pipeline Experiments
6.5.5 Runtime Complexity and Optimizations
At various points in this thesis we pointed out the challenge and importance of scalabil- ity. In Section 6.2, we compared runtimes of semantic similarity measures, introduced a hashing-based pre-grouping to reduce problem sizes and developed efficient partitioning al- gorithms as alternatives to expensive ILP formulations. In Section 6.4, we introduced more efficient ILP formulations for subset selection and further improved runtimes by breaking the selection problem down to connected components.
While all of these measures are necessary to enable the application of our proposed methods to problem sizes as present in the Educ corpus, the scalability of the pipeline as described in this chapter is still limited. A bottleneck is concept mention grouping. As we showed in Section 6.2.3, finding a good partitioning already takes more than 2.5 hours on the largest document set. Computing the features for all pairs of concept mentions takes even longer. Due to the inherent quadratic runtime behavior of pairwise comparisons, the computation time for this step quickly explodes with larger inputs. For sets of several thousands or more documents, the application of the pipeline already becomes infeasible. In this section, we therefore describe a few possible directions to improve the runtime behavior of that step and report preliminary experiments.
6.5. Full Pipeline Experiments
As an alternative to our greedy partitioning search, an efficient graph clustering algo- rithm proposed by Biemann (2006), called Chinese whispers, can be used. It initially assigns every node in a graph to its own cluster and then repeatedly iterates over all of them and changes the cluster assignment of a node to the majority cluster of its neighbors. While not formally guaranteed, this process typically converges to a stable clustering. When run for 𝑘 iterations, it takes 𝒪(𝑘|𝑀 |2) time, with even better average case runtime for sparse graphs. To our mention grouping problem it can be applied by setting a cut-off threshold for coreference probabilities and then running Chinese whispers on the graph with nodes for all mentions and edges for pairs with probabilities above the threshold. In preliminary experiments, we saw runtimes similar to greedy search on our smallest topic but much faster on the big ones while finding partitionings scoring equally well. However, since this algorithm does not actively optimize our objective function, obtaining good partitionings amounts to carefully tuning the cut-off threshold. And more importantly, it still requires the computation of all pairwise features and predictions before it can be applied, and thus does not fully solve the scalability problem.
To fully overcome the need of making all pairwise comparisons of mentions, one pos- sibility is to rely on locality sensitive hashing (LSH). Given points in a vector space, the central idea of LSH is that one can compute binary hash representations of such vectors which approximately preserve the distance between the vectors. More specifically, the hamming distance between binary hashes for two points can be used to approximate their cosine similarity in the vector space (Charikar, 2002). If we represent concept mentions by vectors in a vector space, e.g. using word embeddings, instead of computing pairwise features, the idea of LSH can be applied to our problem in at least two different ways.
The first option is to rely on the fact that the shorter the binary hashes computed for LSH are, the more vectors get mapped to the same hash, automatically clustering them into groups. As we explained before, the similarity of hashes directly corresponds to the cosine similarity between vectors, and the vectors that get mapped to the exact same hash are therefore the most similar ones. The mention grouping approach is therefore simple: We represent every mention by its word embedding, compute hashes for the embeddings and group those mentions together that receive the same hash. The whole process can be performed in 𝒪(|𝑀 |) time. However, despite being conceptually appealing, this approach is difficult to use in practice. The length of the computed hashes has to be carefully tuned to control the size of the created groups and has to be adapted when the number of mentions becomes substantially smaller or larger.
As a second option, one can use LSH together with fast hamming search (Charikar, 2002), a probabilistic algorithm that can find the objects closest to a given object by com- paring their hashes (computed from their vectors) to only a subset of all other objects. It has been applied by Ravichandran et al. (2005) to the NLP problem of efficiently finding, for a given noun, the most similar other nouns in a large list. Applied to our problem, we can use
Chapter 6. Pipeline-based Approaches
it to find for each concept mention all other mentions that have a cosine similarity above a certain threshold in 𝒪(|𝑀 |𝑙𝑜𝑔|𝑀 |) time. While this approach is therefore an efficient alternative to pairwise mention classification, it still leaves the partitioning problem to be solved. Since the optimization problem to find globally consistent partitionings defined in Section 6.2.2 relies on pairwise scores for all pairs, especially also low ones, it cannot be directly applied in this case. Nevertheless, developing partitioning approaches for this case seems to be a worthwhile direction for future work on improving scalability.
6.6
Chapter Summary
In this chapter, we focused on the CM-MDS subtasks for concept mention grouping, im- portance estimation and concept map construction. Our contributions include improved techniques for these subtasks and a new state-of-the-art pipeline for the task as a whole.
For concept mention grouping, we proposed a solution based on pairwise mention clas- sification and a subsequent partitioning step. Compared to previous work on concept map mining, this approach can capture more types of coreferences as it relies on a variety of different semantic similarity metrics. Although its inherent quadratic runtime behavior makes the approach challenging to apply to large corpora, we carefully analyzed several trade-offs between computation time and grouping performance to design an approach that can be successfully applied to the problem sizes in our evaluation corpora.
With regard to importance estimation, we studied a broad set of more and less com- monly used features as well as different machine learning approaches to determine the im- portance of concepts. While we could not observe clear advantages of modeling the prob- lem as regression, classification or ranking, we did design supervised models that clearly outperform the exclusively unsupervised methods suggested in previous work on concept maps, therefore contributing to an overall improved performance on CM-MDS.
For concept map construction, we proposed an ILP formulation that allows us to find an optimal solution to the subgraph selection problem of CM-MDS. We experimentally demonstrated that these optimal subgraphs are superior to heuristically selected ones on our evaluation corpus. In addition, we designed the subgraph selection process to be par- ticularly efficient such that it can scale to problem instances of a reasonable size.
Based on the proposed methods for the three subtasks and the mention extraction meth- ods explored in the previous chapter, we then presented a full pipeline for CM-MDS. The pipeline was shown to be superior to a range of baselines based on previous work in both automatic and manual evaluations on two corpora, establishing a new state-of-the-art for the task of CM-MDS. We finally pointed out directions for further improvements of the pipeline, including in particular the need for better importance estimation models, and sketched possibilities to improve its scalability.
Chapter 7
End-to-End Modeling Approaches
In this chapter, we focus on alternative models for CM-MDS that try to approach the task end-to-end rather than with a pipeline of multiple steps. For various tasks in NLP, such approaches have recently been very successful. We first discuss how sequence transduction models can be applied to CM-MDS. Then, we propose a new alternative architecture which we name a sequence-to-graph network. We evaluate both approaches experimentally to assess the applicability of end-to-end modeling for CM-MDS.
7.1
Motivation and Challenges
In recent years, the use of neural network models — mostly under the name deep learning — has led to substantial performance improvements in many NLP tasks. Prominent exam- ples are the pioneering work by Collobert et al. (2011) and subsequent work on language modeling (Mikolov, 2012), text classification (Socher et al., 2013, Kim, 2014) and machine translation (Sutskever et al., 2014, Cho et al., 2014). A lot more work than we can refer- ence here has been done in that area, with large fractions of the papers published in NLP venues in the last three to four years focusing on the development and analysis of neural models. Goldberg (2017) provides an overview of these efforts, whereas Goodfellow et al. (2016), LeCun et al. (2015) and Schmidhuber (2015) give a more general overview of types and applications of deep learning across different fields.
That large body of work on neural models has repeatedly confirmed several advantages that neural networks have over traditional, feature-based machine learning approaches. First, they make it possible to model many complex tasks, e.g. machine translation, as a single end-to-end problem. That allows learning powerful task-specific representations, such as word vectors, directly from the task-level supervision signal and in addition avoids error propagation problems that typically occur in multi-step pipelines where individual models are not trained jointly. Second, distributed word representations and architectures
Chapter 7. End-to-End Modeling Approaches
such as CNNs or RNNs that can combine word vectors into higher-level representations have made it possible to learn models directly from raw text. The design and selection of features, as done in traditional machine learning approaches, is no longer necessary. And third, as a consequence of the previous, many tasks in NLP can nowadays be approached with the same kind of model and no task-specific knowledge or techniques are needed to reach competitive performance.
In light of these advantages, we are interested in also applying such models to CM- MDS, where we expect several benefits. First, the end-to-end modeling and training can overcome the error propagation problems of the pipeline approach observed in Section 6.5. Second, task-specific word representations learned in this end-to-end fashion can improve performance on the difficult concept coreference subtask, where we already found generi- cally pre-trained embeddings to be helpful in Section 6.2.3. And third, parts of our pipeline rely on off-the-shelf software such as OIE systems which have not been explicitly trained to extract concepts and relations. With an end-to-end model, we would be able to directly learn extraction approaches as needed for CM-MDS.
However, this direction also poses the following new challenges:
Large Inputs In CM-MDS, and in particular on our Educ corpus, the size of the input is large, consisting of multiple full documents with a total of about 100,000 tokens. Most other NLP tasks to which neural models have been successfully applied work with only single documents, sentences or individual words as inputs. For the common sequence processing architecture, RNNs, long sequences can be challenging in terms of training time, memory requirements and successful backpropagation of the training signal.
Graph Outputs The output in CM-MDS is a labeled, directed graph. Thus, it is a structured prediction rather than a simpler classification or regression task. While much work exists on the most prominent structured prediction task in NLP, the prediction of word sequences as in machine translation or text generation, to the best of our knowledge, no neural net- work components have been proposed to directly predict a labeled, directed graph.69
Little Data Due to the structure of the output and the size of the input, manually cre- ating reference concept maps is expensive as we showed in Chapter 4. Only with a com- plex pipeline of preprocessing, crowdsourcing and manual annotations, we could create the high-quality Educ benchmark corpus of 30 document-summary pairs. While that size is useful to evaluate CM-MDS approaches, it is far less than what is needed to train neural models. The work on neural SDS, for instance, relies on hundreds of thousands of pairs.
69Note that tasks such as dependency parsing or semantic parsing to AMR also produce graphs (or trees), but