4.3 Oracles and Ambiguity
4.6.3 Dynamic Oracles and Training With Exploration
Instead of trying to extend the scope of the search algorithm, one way to improve greedy parsers is to try to reduce the impact of error propagation. This has motivated the work on dynamic oracles (Goldberg and Nivre, 2012, 2013). Dynamic oracles should be regarded in the context of the static and non-deterministic oracles we have previously seen in this chapter. Recall that a static oracle provides a single transition sequence to derive a given dependency tree for a sentence. Non-deterministic oracles cover the spurious ambigui- ties and allow for all (or a subset of) the possible transition sequences. Dynamic oracles take this one step further, and provide a set of possible transitions given a state which has already deviated from any paths defined by a non-deterministic oracle. For the set of transitions, the dynamic oracle selects those transitions that could lead to a final tree with the minimal amount of attachment errors, subject to the current state.
Dynamic oracles were first proposed by Goldberg and Nivre (2012) for the ArcEager system. They also proposed the standard way of exploiting dynamic oracles for training locally normalized greedy parsers known as training with exploration. Here, the idea is that sometimes erroneous transitions are predicted during training. The dynamic oracle then comes into play by guiding the model towards the best possible tree (in terms of LAS), subject to the mistakes that have already been made. Similarly to the latent tran- sition sequences used in our experiments, the next transition given the current state is left latent and the perceptron model under training gets to choose what is the best tran- sition given the current state. These parsers thus retain their highly efficient runtime complexity, but improve the parsing results compared to when they are trained with a static oracle.
More recent work on dynamic oracles has primarily been concerned with developing dynamic oracles for other transition systems. Goldberg et al. (2014) present dynamic ora- cles for the ArcStandard system and the LR-spine parser (Sartorio et al., 2013). Dynamic oracles have also been defined for some transition systems that allow for a restricted amount of non-projectivity (G´omez-Rodr´ıguez et al., 2014; G´omez-Rodr´ıguez et al., 2018).
G´omez-Rodr´ıguez and Fern´andez-Gonz´alez (2015) present a non-deterministic oracle for Covington’s (2001) unrestricted parsing algorithm which could be argued to be transition-based (Nivre, 2008), but is strictly slower with an O(n2)time complexity with
regard to the input.
Finally, as an alternative to dynamic oracles, recent work has also focused on develop- ing approximate dynamic oracles using machine learning techniques. The basic idea is to use machine learning to try to decide what the latent transitions should be. This involves ideas such as searching for the best transitions in the presence of mistakes (Straka et al., 2015), or trying to directly learn a function that returns the highest possible mistake in the presence of mistakes (Le and Fokkens, 2017). Recently Yu et al. (2018) combined these ideas using reinforcement learning, where a function is learned using the gold standard tree as features.
4.7 Conclusion
In this chapter we have studied the SwapStandard transition system and analyzed its spurious ambiguities. This has enabled us to create new oracles that could be used for training parsers. We developed two non-deterministic oracles that can be used for learn- ing latent transition sequences. One of these oracles provide the full space of all transition sequences, whereas the other was restricted to the specific ambiguity between Swap and Shift. Additionally, we solved the open problem of creating a static oracle that minimizes the number of Swap transitions required for non-projective parsing.
In terms of the framework from Chapter 2, the primary functionality applied in this chapter is the ability of using latent structures for learning. We abstained from an in- depth analysis of the different update methods, a topic that we saw in the previous chap- ter and will return to in the next one. However, at this point we can briefly mention that the update methods do not play an important role for sequences of the length seen in this chapter.
The primary question for the empirical evaluation was whether latent transition se- quences can be used to improve the performance of a transition-based parser. We con- sidered this question using both a greedy, classifier-based parser, which has previously been studied with positive results for other systems, and a beam search parser, where this question has previously not been regarded. The experimental results show that the an- swer to this question depends on whether greedy or beam search is used – in the greedy case, the latent transitions help, whereas for beam search the performance is roughly the
same as when using a static oracle. In a broad sense, the conclusion is that the non- deterministic oracles are never harmful compared to their static counterparts, although they sometimes also do not yield any improvements.
A secondary result of our experiments is a thorough comparison of all static and non- deterministic oracles. The general result could be summed up by saying that fewer swaps in the training sequences tends to improve the performance. This is apparent from the comparison between the static oracles EAGER and LAZY, corroborating and extending Nivre et al.’s (2009) results. While the static MINIMAL oracle theoretically reduces the number of swaps even more, the reduction is in practice rather small due to the nature of the treebanks and since the difference in swaps when moving from LAZYto MINIMALis rather minor. The results from the experiments with non-deterministic oracles also points to he fact that fewer swaps leads to better performance. We saw this in the analysis of Hungarian, where the ND-ALLoracle had a tendency to overswap greatly, yielding more swaps and also worse results than the EAGERoracle.
Chapter 5
Joint Sentence Segmentation and
Dependency Parsing
5.1 Introduction
In the previous chapter we studied the utility of non-deterministic oracles for transition- based dependency parsing. The novelty with respect to previous work was that we used the idea of latent structures for learning search-based transition-based dependency parsers. In this chapter we will look at another aspect that the framework from Chapter 2 is concerned with: the update methods required and their importance vis-`a-vis the length of the sequences that need to be learned. We will extend the dependency parsing task to not just parse single sentences, but to parse a sequence of tokens (i.e., a document), where the beginnings and ends of the sentences are not known.
The default approach to parse documents is to build a pipeline of NLP components, solving a number of sub-tasks sequentially. Such a pipeline would start with a sentence boundary detector which splits the input document into sentences. Then, each sentence would be fed through a tokenizer followed by a part-of-speech tagger and morpholog- ical analyzer only after which the parser would step in. When working with carefully copy-edited text documents, sentence boundary detection can be viewed as a minor pre- processing task in such a pipeline, solvable with very high accuracy. However, when dealing with the output of automatic speech recognition or “noisier” texts such as blogs and emails, non-trivial sentence segmentation issues do occur. Dridan and Oepen (2013), for example, show that fully automatic preprocessing can result in considerable drops in parsing quality when moving from well-edited to less-edited text.
tence boundaries, such as prosodic phrasing and intonation in speech (Kol´aˇr et al., 2006) or formatting cues in text documents (Read et al., 2012), and (ii) to emulate the human ability to exploit syntactic competence for segmentation. By coupling the prediction of sentence boundaries with syntax we will aim for the latter. The basic intuition is that segmentations that would give rise to suboptimal syntactic structures will also be more difficult to parse. Therefore, erroneous segmentations can be caught early during search and discarded in favor of segmentations where the syntactic structure receives a high score by the parsing model.
Our technical approach will be to extend the transition system from the previous chapter to predict sentence boundaries and syntax jointly. We will refine the transition system by a dedicated transition to label sentence boundaries and augment the states to keep track of this information. We characterize the necessary preconditions for the transition system in order to keep the resulting output well-formed.
Although this joint system and, consequently, the machine-learning problem are by and large similar to what we saw in the previous chapter, they differ strongly in terms of the length of the transition sequences. We instantiate the framework from Chapter 2 sim- ilarly as in the previous chapter, using both a greedy, classifier-based model as a baseline and a beam search parser as the contrastive system. We evaluate the update methods for the approximate search setting and find, similarly to the results on coreference resolu- tion, that the update methods that discard training data are inadequate for this problem as they fail to outperform the baseline. However, when we apply DLaSO we find that the beam search parser outperforms the baseline.
From the computational linguistics perspective, the joint system allows us to test the utility of syntax when predicting sentence boundaries. We demonstrate empirically that syntactic information can make up to a large extent for missing or unreliable cues from punctuation. The joint system allows us to test the influence of syntactic information on the prediction of sentence boundaries as compared to a pipeline baseline where both tasks are performed independently of each other. With a thoughtful selection of data sets and baselines for the sentence segmentation problem, we are able to demonstrate that syntactic information is helpful for the task of sentence segmentation.
For our analysis, we use the Wall Street Journal as the standard benchmark set and as a representative for copy-edited text. We also use the Switchboard corpus of transcribed dialogues as a representative for data where punctuation cannot give clues to a sentence boundary predictor. Other types of data that may exhibit this property to varying de- grees are web content data, e.g. forum posts or chat protocols, or (especially historical) manuscripts. While the Switchboard corpus gives us a realistic scenario for a setting with
unreliable punctuation, the syntactic complexity of telephone conversations is rather low compared to the Wall Street Journal. Therefore, as a controlled experiment for assess- ing how far syntactic competence alone can take us if we stop trusting punctuation and capitalization entirely, we also perform joint sentence boundary detection/parsing on a lower-cased, no-punctuation version of the Wall Street Journal.