4.2 Language-specific choices
5.1.2 Mapping the Irish Dependency Scheme to the Universal De-
versal Dependency Scheme (UD13)
The departure point for the design of the Universal Dependency (UD13) Treebanks (McDonald et al., 2013) was the Stanford typed dependency scheme (de Marneffe and Manning, 2008), which was adapted based on a cross-lingual analysis of six languages: English, French, German, Korean, Spanish and Swedish. As a result of this study, universal dependency treebanks were developed initially for these six languages, followed by subsequent development of UD treebanks for five languages (Brazilian Portuguese, Finnish1, Indonesian, Italian and Japanese).2 Approaches to development of these treebanks varied. Existing English and Swedish treebanks were automatically mapped to the new universal scheme. The rest of the treebanks were developed manually to ensure consistency in annotation. The study also reports some structural changes (e.g. Swedish treebank coordination structures).3
1The Finnish data was not available at the time of our experiments.
2Version 2 data sets downloaded from https://code.google.com/p/uni-dep-tb/
3There are two versions of the annotation scheme: the standard version (where copulas and ad-
There are 41 dependency relation labels to choose from in the universal annota- tion scheme.4 McDonald et al. (2013) use all labels in the annotation of the German
and English treebanks. The remaining languages use varying subsets of the label set. In our study we map the Irish dependency annotation scheme to 30 of the universal labels. The mappings are given in Table 5.2.
UD13 Dependency Label Mappings
Universal Irish Universal Irish Label
root top csubj csubj
acomp adjpred, advpred, ppred dep for
adpcomp comp det det, det2, dem
adpmod padjunct, obl, obl2, obl ag dobj obj, vnobj, obj q
adpobj pobj mark subadjunct
advcl comp nmod addr, nadjunct
advmod adjunct, advadjunct, quant,
advadjunct q nsubj subj, subj q
amod adjadjunct num quant
appos app p punctuation
attr npred parataxis comp
aux toinfinitive poss poss
cc NEW prt
particle, vparticle, nparticle, advparticle, vocparticle, particlehead, cleftparticle, qparticle, aug
ccomp comp rcmod relmod
compmod nadjunct rel relparticle
conj coord xcomp xcomp
Table 5.2: Mapping of Irish Dependency Annotation Scheme to UD13 Annotation Scheme
As with the POS mapping discussed in Section 5.1.1, mapping the Irish depen- dency scheme to the universal scheme was relatively straightforward, due in part, perhaps, to a similar level of granularity suggested by the similar label set sizes (Irish 47; standard universal 41). That said, there were significant considerations made in the mapping process, which involved some structural change in the treebank and the introduction of more specific analyses in the labelling scheme. These are discussed below.
heads. We are using the standard version.
4
5.1.2.1 Structural Differences
The following structural changes were made manually before the dependency labels were mapped to the universal scheme.
coordination The most significant structural change made to the Irish treebank was an adjustment to the analysis of coordination. The original Irish Dependency Treebank subscribes to the LFG coordination analysis, where the coordinating con- junction (e.g. agus ‘and’) is the head, with the coordinates as its dependents, labelled coord (see Figure 5.1 and refer to Section 3.1.2 for further discussion). The Universal Dependency Annotation scheme, on the other hand, uses right-adjunction, where the first coordinate is the head of the coordination, and the rest of the phrase is adjoined to the right, labelling coordinating conjunctions as cc and the following coordinates as conj (Figure 5.2).
coord det subj advpred top coord det subj advpred obl det pobj
Bh´ı an l´a an-te agus bh´ı gach duine sti´ugtha leis an tart
Be-PAST the day very-hot and be-PAST every person parched with the thirst ‘The day was very hot and everyone was parched with the thirst’
Figure 5.1: LFG-style coordination of original Irish Dependency Treebank.
top det subj advpred cc conj det subj advpred obl det pobj
Bh´ı an l´a an-te agus bh´ı gach duine sti´ugtha leis an tart
Be-PAST the day very-hot and be-PAST every person parched with the thirst ‘The day was very hot and everyone was parched with the thirst’
Figure 5.2: Stanford-style coordination changes to original Irish Dependency Tree- bank.
subordinate clauses In the Irish Dependency Treebank, the link between a ma- trix clause and its subordinate clause is similar to that of LFG: the subordinating
conjunction (e.g. mar ‘because’, nuair ‘when’) is a subadjunct dependent of the matrix verb, and the head of the subordinate clause is a comp dependent of the subordinating conjunction (Figure 5.3). In contrast, the universal scheme is in line with the Stanford analysis of subordinate clauses, where the head of the clause is dependent on the matrix verb, and the subordinating conjunction is a dependent of the clause head (Figure 5.4).
top subj xcomp obl subadjunct comp subj ppred pobj num
Caithfidh t´u brath orthu nuair at´a t´u i Roinn 1
Have-to-FUT you rely on-them when REL-be you in Division 1 ‘You have to rely on them when you are in Division 1’
Figure 5.3: LFG-style subordinate clause analysis (with IDT labels)
top subj xcomp obl subadjunct comp subj ppred pobj num
Caithfidh t´u brath orthu nuair at´a t´u i Roinn 1
Have-to-FUT you rely on-them when REL-be you in Division 1 ‘You have to rely on them when you are in Division 1’
Figure 5.4: Stanford-style subordinate clause analysis (with IDT labels)
5.1.2.2 Differences between dependency types
We found that the original Irish scheme makes distinctions that the universal scheme does not – this finer-grained information takes the form of the following Irish-specific dependency types: advpred, ppred, subj q, obj q, advadjunct q, obl, obl2. In producing the universal version of the treebank, these Irish-specific dependency types are mapped to less informative universal ones (see Table 5.2). Conversely, we found that the universal scheme makes distinctions that the Irish scheme does
not. Some of these dependency types are not needed for Irish. For example, there is no indirect object iobj in Irish, nor is there a passive construction that would require the labels nsubjpass, csubjpass or auxpass. Also, in the Irish Dependency Treebank, the copula is usually the root (top) or the head of a subordinate clause (e.g. comp) which renders the universal type cop redundant. Others that are not used are adp, expl, infmod, mwe, neg, partmod. However, we did identify some de- pendency relationships in the universal scheme that we introduce to the UD13 Irish Dependency Treebank (adpcomp, adposition, advcl, num, parataxis). These are explained below.
comp → adpcomp, advcl, parataxis, ccomp The following new mappings were previously subsumed by the IDT label comp (complement clause). The mapping for comp has thus been split between adpcomp, advcl, parataxis and ccomp.
• adpcomp is a clausal complement of an adposition. An example from the English data is ‘some understanding of what the company’s long-term horizon should begin to look like’, where ‘begin’, as the head of the clause, is a dependent of the preposition ‘of’. An example of how we use this label in Irish is in Figure 5.5.
top dobj xcomp adpmod det adpobj adpmod adpmod adpcomp nsubj
´
Eileofar orthu taisteal chuig an Ionad de r´eir mar is g´a
Demand-AUTO on-them travel to the Centre according as COP need ‘They will have to travel to the Centre when it is necessary
Figure 5.5: UD13 adpcomp complement clause
• advcl is used to identify adverbial clause modifiers. In the English data, they are often introduced by subordinating conjunctions such as ‘when’, ‘be- cause’, ‘although’, ‘after’, ‘however’, etc. An example is ‘However, because the guaranteed circulation base is being lowered, ad rates will be higher’. Here,
‘lowered’ is an advcl dependent of ‘will’. Equivalent subordinating conjunc- tions in Irish are mar ‘because’, nuair ‘when’, c´e ‘although’, for example. An example of Irish usage is given in Figure 5.6.
top nsubj amod acomp adpobj mark prt advcl nsubj compmod acomp
T´a truailli´u m´or san ´ait mar nach bhfuil c´oras s´earchais ann
Be pollution much in-the area because not be system sewerage there ‘There is a lot of pollution in the area because there is no sewerage system’
Figure 5.6: UD13 advcl adverbial clause modifier
• parataxis labels clausal structures that are separated from the previous clause with punctuation such as – ... : () ; and so on. See Figure 5.7 for example.
top nsubj xcomp adpobj adpmod p parataxis ccomp acomp adpobj
T´a siad ag ´eir´ı leo – meastar gur in Eirinn´ ....
Be they at succeeding with-them – think-AUTO COP in Ireland .... ‘They are succeeding - it is believed that in Ireland...’
Figure 5.7: UD13 parataxis clauses
• ccomp covers all other types of clausal complements. For example, in English, ‘Mr. Amos says the Show-Crier team will probably do two live interviews a day’. The head of the complement clause here is ‘do’, which is a comp dependent of the matrix verb ‘says’. An Irish example is given in Figure 5.8.
quant → num, advmod The IDT Scheme uses one dependency label (quant) to cover all types of numerals and quantifiers. We now use two labels from the
top nsubj prt ccomp nsubj det dobj det prt advmod
D´uirt siad nach bhfeiceann siad an cine´al seo chomh minic
Said they that-not see they the type this very often ‘They said that they do not see this type of thing very often’
Figure 5.8: UD13 ccomp clausal complements
universal scheme to differentiate between quantifiers such as m´or´an ‘many’ (advmod) and numerals such as fiche ‘twenty’ (num).
nadjunct → nmod, compmod The IDT label nadjunct accounts for all nom- inal modifiers. However, in order to map to the universal scheme, we discriminate two kinds: (i) nouns that modify clauses are mapped to nmod (e.g. bliain ´o shin ‘a year ago’) and (ii) nouns that modify nouns (usually genitive case in Irish) are mapped to compmod (e.g. plean marga´ıochta ‘marketing plan’).
As with the POS tag mapping, information is also lost through the dependency label mapping process. This is because some UD labels are too general to describe fully the nature of the dependency relation between tokens. This is clear from the numerous ‘many-to-one’ mappings shown in Table 5.2. Some of the lost information is explained in more detail here:
(i) (adjpred, advpred, ppred) → comp; npred → attr: all of these predicate arguments are used in a similar pattern. By separating the nominal predicate map- ping to a separate label, the common behaviour and linguistic patterns across all predicates are lost.
(ii) obl ag → adpmod: by subsuming the oblique agent label under a general mod- ifier label, the ‘passive’ function of these oblique modifiers and the nature of the structures they describe are lost.
(iii) prt: the universal ‘particle’ label subsumes all of the fine-grained Irish parti- cles (apart from relative particles). Particles are a significant feature of the Irish
language, carrying out many functions, and their distinction can help semantic dis- ambiguation of some forms (e.g. a can be a vocative, quantifier, cleft and time particle).
It is worth noting, however, that in the more recent Universal Dependency Scheme (see Section 5.2.2), it is possible to account for the loss of information in these (and other mappings) through the use of language-specific sub-labels.