3.4 Modeling Choices
3.4.3 Joint Modeling of Morphology and Syntax
As we have seen, the division of tokenization, morphological and syntactic analysis into separate steps creates a number of problems when applied to languages with a rich morphological system. Error propagation in the pipeline is aggravated by the lower performance of the early steps like tokenization and morphological analysis, which in turn is caused by a lack of syntactic information. A natural consequence of this observation is then to abandon this separation and to develop models that allow an exchange of information between the different levels. Joint models can resolve the mutual dependency between morphology and syntax elegantly by allowing both to influence the outcome of the other.
The hypothesis that we will be exploring and arguing for in the following chapters is that models that jointly predict morphology and syntax are better than models that separate them. To qualify this, we argue that the availability of the syntactic structure and the possibility to directly model interaction between morphology and syntax, e.g., agreement, allows a joint model to avoid errors caused by syncretism, unreliable sequential context due to free word order, and segmentation ambiguity. Since these problems occur frequently in languages with rich morphology, this hypothesis entails that joint models are better suited for parsing such languages.
Joint models for parsing have been explored before both for constituency parsing and dependency parsing (see Chapter 7 for a detailed account). These parsers predict the mor- phological features of words (and e.g., the segmentation into tokens) while simultaneously predicting the syntactic structure of the sentence. The typical challenge is to provide the additional information in an efficient way since simply combining the two tasks directly
48 3 Motivation
quickly results in an intractable model.
In the remainder of this dissertation, we approach the topic of joint models for mor- phologically rich languages in four steps. We start by having a closer look at parsing with a pipeline model. In Chapter 4, we experiment with a state-of-the-art dependency parser in a pipeline setup analyzing the parser’s output for three morphologically rich languages, Czech, German, and Hungarian. The analysis demonstrates that syncretism in the morphological system of a language directly causes parsing errors in a pipeline setup. In Chapter 5, we continue by showing that direct access to the syntactic structure of a sentence can improve the prediction of morphological features. Furthermore, we find that the syntactic information from the parser complements the information that is provided by language-specific lexicons. In Chapter 6, we then design a joint parser that models interaction between morphology and syntax explicitly by imposing constraints on the syntactic structure. The constraints implement morphosyntactic rules and act as a filter on the search space of the parser. The constrained model outperforms its uncon- strained baseline as well as a state-of-the-art pipeline parser. In Chapter 7, we address the segmentation problem by designing an efficient graph-based dependency parser for morphological lattices that performs segmentation, morphological analysis, and parsing jointly. We test the parser on Turkish and Hebrew and show that it outperforms three state-of-the-art pipeline systems.
49
Chapter 4
Error Propagation between
Morphology and Syntax
The standard architecture for parsing systems are pipelines, where parsing is broken down into tokenization, morphological analysis, and the actual parsing. The individual steps are applied one after the other with each step feeding its output as information to the following. Pipelines are efficient but may suffer from error propagation, because introduced errors cannot be corrected later on, though the problem might be alleviated by using jackknifing.
In Chapter 3, we motivate joint models by claiming that error propagation in pipelines becomes a more serious problem when parsing morphologically rich languages. This is because the morphological analysis makes more mistakes due to syncretism and insuffi- cient information and these mistakes then cause follow-up mistakes in the parsing step. In this chapter, we support this claim empirically by showing that certain mistakes of the morphological analysis correlate with parsing errors and that these mistakes originate in the syncretism in the morphological system of the language that is being parsed.1
For the analysis, we run a state-of-the-art pipeline system on data from three different lan- guages: Czech, German, and Hungarian. All three languages belong to the broad category
50 4 Error Propagation between Morphology and Syntax
of morphologically rich languages. Czech and German are both Indo-European languages, Czech from the Slavonic branch and German from the Germanic branch. Hungarian on the other hand is a Finno-Ugric language of the Ugric branch. Syntactically, they all use a case system to mark the function of verbal arguments. However, the morphological realization of case systems in the three languages shows important differences. Both Czech and German are fusional languages, where multiple morphological categories are fused into one inflection suffix. For example, nominal inflection suffixes in Czech and German signal gender, number, and case values simultaneously. Hungarian, on the other hand, is an agglutinative language. Every morphological feature is signaled by its own morpheme and the different morphemes are chained at the end of the word.
For us, however, the important difference between the morphological systems of these languages is that Czech and German show wide-spread syncretism in their inflection paradigms whereas Hungarian inflection suffixes are to the most extent unambiguous. With German and Czech on one side and Hungarian on the other, we can observe the effect of syncretism on the parsing system on Czech and German and use Hungarian as a control experiment. Being the prime example of a morphosyntactic feature, the analysis is focused on case and the grammatical functions that are marked by it.
The chapter starts in Section 4.1 by describing the experimental setup. In Sections 4.2 and 4.3 we then analyze the quality of the morphological and syntactic annotation and demonstrate the error propagation in the parser. Section 4.4 concludes with a discussion of the results of the analysis.