• No results found

correct case value must be accusative case as this is the case assigned to the direct object of the verb. The valency of the verb thus helps in disambiguating the morphological features of the ambiguous word form.

m ˚uj bratr navˇst´ıv´ı mˇest-a

my.NOM.SG.M brother.NOM.SG.M visit-3SG.PRS city-{NOM,ACC}.PL.N

{nsubj,dobj} nsubj

nmod

Figure 3.4:Verb valency helps disambiguating morphologically ambiguous word forms.

3.3

Ambiguity in Word Segmentation

Syncretism is a particular form of morphological ambiguity that is restricted to the word forms within the inflection paradigm of a single lemma. But word forms are often am- biguous across lemmas as well, take for example the English verb bear (to carry/support) and the English noun bear (an animal, e.g., a polar bear). The two words are homonyms, i.e., they are pronounced and written the same way but have different (and unrelated) meanings.4

In languages with rich morphology, homonymity occurs frequently also due to composi- tion and derivation processes. Take for example the Turkish word ¸cekti in Examples 3.3 and 3.4. The word can be segmented into different combinations of basic morphemes such that each distinct segmentation gives rise to entirely different interpretations of the word. Note that this ambiguity is orthogonal to the ambiguity introduced by syncretism. Each of these forms can potentially also be (and often are in Turkish) syncretic within their inflection paradigm.

(3.3) ¸cekti

pull.3SG.PAST

’it pulled’

4In text-based natural language processing, homography, i.e., being written the same way, is usually enough to create problems even if the words are pronounced differently (cmp. access as a verb and access as a noun).

40 3 Motivation

(3.4) ¸cek

cheque.NOM

-ti

exist.3SG.PAST

’it was a cheque’

Turkish has a very productive morphology and often forms complex words that involve multiple steps of derivation interlaced with syntactic structure. In order to make the underlying syntactic relations visible, the Turkish treebank (Oflazer et al. 2003a) annotates dependency structure not over words but sub-units of words. Figure 3.5 shows an example from Eryi ˇgit et al. (2008) that demonstrates the syntactic annotation in the Turkish treebank. Words are shown within solid frames. The dependency arcs in the example connect sub- units of words rather than the words themselves. Oflazer et al. (2003a) call these sub-units of a word Inflectional Groups (IGs), which are separated by Derivational Boundaries. The semantic root and derivational morphemes in a word are represented by different IGs, but an IG can contain additional inflection morphemes.

Figure 3.5:A Turkish sentence annotated in the style of the Turkish Dependency Treebank. Links are annotated on a sub-lexical level. The example is taken from Eryi ˇgit et al. (2008). Note that this graphic follows a different convention than ours and draws dependency arcs pointing from the dependent to the head.

As an example, consider the second word in Figure 3.5, which contains two IGs. The first one, okul+da, is made of a stem okul and an inflectional suffix da, whereas the second IG, ki, is a derivational suffix that turns the word into an adjective. The first word Bu, a determiner, depends syntactically on the first IG of the second word, which is a noun root (the entire word is an adjective though). As another example, take the last word in the sentence: it is a verb that was formed from a noun meaning girl. The suffix dır is the

3.3 Ambiguity in Word Segmentation 41

copula suffix that turns the noun into a verb. However, the word before the last means little and modifies the noun root of the verb rather than the verb itself.

Splitting words into IGs and annotating syntax between them would not be remarkable if the decision of how to split a word would be straight-forward all the time. But Turkish words can be highly ambiguous and context is often needed to resolve a segmentation ambiguity. Figure 3.6 shows the word ¸cekti from above in a sentence. The word a¸cık is an adjective and means blank. In the Turkish treebank, adjectives normally modify nouns but not verbs. The presence of a¸cık in the sentence therefore makes it more likely to assume the interpretation in Example 3.4 than the one in Example 3.3. The syntactic context thus helps to arrive at the correct segmentation of a word. However, the correct syntactic structure can be found only if the segmentation of the words is correct.

Ac¸ık c¸ek -ti

blank cheque.NOM exist.3SG.PAST

amod deriv

Figure 3.6:Syntactic context disambiguates the segmentation of ¸cekti.

Since the basic units in the Turkish treebank are separated by derivational boundaries, ambiguous segmentation of a given word means that there is more than one derivational structure for this word. But ambiguity in word segmentation can also arise from different sources, e.g., from orthography. In Modern Hebrew, written words can be ambiguous with respect to their segmentation into meaningful units because there are eight common prepositions, articles, and conjunctions that are always attached to the following word (Goldberg and Elhadad 2013). This process is recursive so that several of such affixes can be attached in sequence. However, it is not always immediately obvious from a word form whether there is such affixation or not. Goldberg and Elhadad (2013) give the example in Example 3.5 to demonstrate the ambiguity. The displayed word can either mean onion or it can mean in the shadow, if the first character (read from right to left) is interpreted as a preposition affix. (3.5) !לצב onion or or !לצ-!ב in the shadow

(Goldberg and Elhadad 2013: 123)

To illustrate the interaction between segmentation, morphology and syntax in Hebrew, Cohen and Smith (2007: 2) give an example of a Hebrew sentence that can be interpreted in

42 3 Motivation

different ways depending on the segmentation of a specific word. The two interpretations of this sentence are repeated in Examples 3.6 and 3.7. The word in question is the sixth word of the sentence (counted from right to left). While the segmentation of the word is decided locally, the morphological and syntactic interpretation of the first and the last word depend on this decision.

(3.6) !הפי is-beautiful ADJ+MASC !Mש there !העור+!ש shepherds that VB+MASC !קחורמ+!ה+!ו

distant the and

!לודג+!ה big the !קורי+!ה green the !וחא+!ה+!ב meadow the in !העור+!ה shepherd the MASC

’The shepherd in the big green distant meadow who shepherds there is beautiful.’

(Cohen and Smith 2007: 2)

(3.7) !הפי nicely ADV !Mש there !העורש is-lying VB+FEM !קחורמ+!ה+!ו

distant the and

!לודג+!ה big the !קורי+!ה green the !וחא+!ה+!ב meadow the in !העור+!ה shepherd the FEM

’The shepherdess in the big green distant meadow is lying there nicely.’

(Cohen and Smith 2007: 2)

The difference between Turkish and Hebrew with respect to segmentation ambiguity is that in Turkish, the segments of a word still all belong to the same syntactic unit, whereas in Hebrew the different segments of a word may belong to entirely different syntactic contexts. In Hebrew, the eight affixes mentioned above are always attached to the following word regardless of what this word is and whether they belong together syntactically. For example, the attached affix could be the subordinating conjunction of a subordinate clause, but may be written as a part of a word outside of the clause. In Turkish, an inflectional group morphologically and syntactically always belongs to the word it is part of, because segmentation in the Turkish treebank represents the derivational genesis of this word. However, in both cases there exists an interdependency between the morphological and syntactic interpretation of the words and its parts.