5.5 Automatic Prediction of Morphological Features
7.4.2 Joint Models for Transition-based Dependency Parsers
Joint models for dependency parsing were pioneered by Hatori et al. (2011) for Chinese. They design a transition-based parser that performs part-of-speech tagging and parsing jointly. In this parser, the shift operation of the arc-standard decoder (Nivre 2008) is defined such that every time a token is pushed onto the stack the parser also selects a part-of-speech tag for it. Instead of one shift transition that the model has to predict, the parser now chooses between shift transitions for each tag in the part-of-speech tag set. This increase in number of transitions adds a constant factor to the overall complexity of the parser, which means that the parser is not significantly slower, especially since part-of- speech tag sets for languages with no morphology are usually rather small. A problem that they encounter is that due to the joint modeling, the parser has no information about the part-of-speech of tokens that it has not shifted yet. To deal with this, they introduce delayed features in the feature model, which postpones the evaluation of some features until all necessary information is available.
This parser was soon extended to integrate word segmentation into the model thus per- forming the full joint task of segmentation, part-of-speech tagging, and dependency parsing (Hatori et al. 2012, Li and Zhou 2012). The full joint parsers operate on charac- ter level and form words by means of an append action, an additional operation in the transition system that concatenates characters. The first models worked with an ad-hoc representation of the inner structure of the words. Linguistically motivated word-internal structures were shown to yield even better results by Zhang et al. (2014a).
The joint transition-based parsers that were developed for Chinese were quickly adapted to other languages as well. As we have argued in this dissertation, joint models are inter- esting for languages with rich morphology because they can model interaction between morphology and syntax. Bohnet and Nivre (2012) define a similar parser as the one in Hatori et al. (2011) but extend it to non-projective parsing in order to handle the free word order in morphologically rich languages.
Bohnet et al. (2013) extend the parser further to include prediction of morphological fea- tures into the model. However, including morphological features is not as straightforward as including the part-of-speech tags was for the Chinese parsers because morphological tag sets are much larger than part-of-speech tag sets. Simply having the parser choose for
7.4 Discussion 151
each shifted word a morphological tag out of a set of potentially more than 1,000 tags has a big impact on the run-time of the parser. Bohnet et al. (2013) therefore provide the parser with an n-best list of possible tags for each word, which they predict with a standard sequence model. They find that keeping at most two tags for each word already gives them the best results. It may be surprising that two is already enough, but today’s sequence models for predicting part-of-speech tags or morphology are quite good. Keeping a list of the two best tags essentially means that one trusts the sequence model to rank the correct candidates high, but wants to keep some options open for the parser to make the final decision when it sees more context. That the preprocessing has to be good in ranking the correct alternatives high was already found by Cohen and Smith (2007) for their Hebrew constituency parser (see above).
Bohnet et al. (2013) encounter another problem related to rich morphology in their beam- search decoder: the beam quickly loses variants that differ with respect to the part-of- speech tags or morphological features and only keeps structural variants around. They avoid this effect by reserving a portion of the beam to be filled by morphological variants exclusively. The problem seems related to one encountered by Zhang et al. (2014a), who weight the features for segmentation and part-of-speech tagging four times as high as the parsing features because otherwise the parsing features dominate the feature model due to their larger number.
The parsers discussed in this section were developed for languages with rich morphology but that do not have a word segmentation problem as in Hebrew, Arabic, Turkish, or Chinese. However, it is straight-forward to adapt the parser. In essence, the append action that is used in Chinese to form words from characters can be used to split words into smaller parts. A parser that uses this idea is described in Tratz (2013) for parsing Arabic. The parser is based on the easy-first decoding algorithm (Shen and Joshi 2008, Goldberg and Elhadad 2010a) and differs from standard transition-based algorithms in that it can operate on any pair of words in the sentence. The parser defines additional operations like part-of-speech tagging, morphological tagging, and affix splitting. The operations are ordered such that the parser can only link two tokens with a dependency relation if both have already received a part-of-speech tag. The same idea was implemented in a parser for Chinese for joint part-of-speech tagging and parsing (Ma et al. 2012).
There is some work on standard dependency parsing for Hebrew (Goldberg and Elhadad 2009, 2010a, Goldberg 2011), but only little work has been done on lattices. De La Clergerie
152 7 Graph-based Lattice Dependency Parsing
(2013) reports on experiments with a transition-based dependency parser for lattices. The results for the joint model are however slightly behind the ones for their pipeline baseline. K ¨ohn et al. (2014) conduct experiments with TurboParser on n-best paths that they predict from Hebrew lattices. Although this is technically not lattice parsing, they show that the parser is able to select better paths from the n-best list than a ranking model without syntactic features.