Universität Stuttgart ⓒ Achim Stein 2013
Diachronic syntax based on
constituency and dependency annotated corpora:
theoretical and methodological issues
Achim Stein (ILR, Universität Stuttgart)
This talk based on collaborative research with Sophie Prévost
(CNRS LaTTiCE) and the members of the ANR/DFG project
Universität Stuttgart ⓒ Achim Stein, Institut für Linguistik/Romanistik
‣
Principal investigators: Sophie Prévost, Achim Stein
‣
Funding: 2009 – 2012
Agence Nationale de la Recherche ANR (France)
Deutsche Forschungsgemeinschaft DFG (Germany)
Special Session on Romance Parsed Corpora
43rd Linguistic Symposium on Romance Languages
Syntactic Reference Corpus of Medieval French
‣
Institutions and staff:
‣
Paris
: UMR 8094-LaTTiCe (CNRS/ENS Paris):
‣
Sophie Prévost, Julie Glikman
‣
Lyon
: ENS de Lyon
‣
Céline Guillot, Serge Heiden, Alexei Lavrentiev, Tom Rainsford
‣
Stuttgart
: Institut für Linguistik/Romanistik (ILR)
‣
Achim Stein, Beatrice Bischof, Nicolas Mazziotta
Universität Stuttgart ⓒ Achim Stein 2013
CoNLL based query tools
The annotation workflow
3 Corpora: Base de français médiéval (BFM); Nouveau Corpus d'Amsterdam (NCA) manual annotation with the Notabene tool
(Mazziotta 2010) syntactic structures (RDF graphs) dependency model annotation principles Forum: discussion of grammar and annotation principles correction 1: compare parallel annotations correction 2: review of compared versions queries with TigerSearch (local) or TXM (web) training of dependency parsers XML CoNLL
preparation
work
use
Universität Stuttgart ⓒ Achim Stein, Institut für Linguistik/Romanistik
Special Session on Romance Parsed Corpora
43rd Linguistic Symposium on Romance Languages
Universität Stuttgart 5 ⓒ Achim Stein 2013
Penn style constituent structure
‣
Tresqu'en la mer cunquist la tere altaigne. (Chanson de Roland)
Until the sea he conquered the high land.
The noun
phrase (NP)
consists of
"la" and "mer"
Universität Stuttgart ⓒ Achim Stein 2013
Special Session on Romance Parsed Corpora
43rd Linguistic Symposium on Romance Languages
6
Dependency structure
‣
Tresqu'en la mer cunquist la tere altaigne.
Until the sea he conquered the high land.
(SRCMF, Prévost/Stein 2013)
"Tresqu", "en",
and "la"
depend
on
"mer"
"mer"
depends
on
"cunquist"
(compare with
constituency)
Universität Stuttgart ⓒ Achim Stein 2013
Universität Stuttgart 8 ⓒ Achim Stein 2013 classe unité syntaxique fonction structure noeud groupe satellite parenthese modifieur relateur structure maximale structure non-maximale noeud verbal noeud non-verbal [nV] coordonné [Coo] coordination [GpCoo] phrase [Snt]
non-phrase [nSnt] noeud verbal personnel [VFin] noeud verbal infinitif [VInf] noeud verbal participial [VPar]
actant circonstant [Circ] négation [Ng] forclusif [NgPrt] auxilié [Aux] sujet attribut régime [Regim] auxilié actif [AuxA] auxilié passif [AuxP] apostrophe [Apst]
interjection [Int] insertion [Insrt]
sujet personnel [SjPer] sujet impersonnel [SjImp]
modifieur attaché [ModA] modifieur détaché [ModD]
attribut de sujet [AtSj] attribut d'objet [AtObj] attribut du réfléchi [AtRfc]
objet [Obj] complément [Cmpl]
réfléchi [Rfc] réfléxif renforcé [Rfx]
relateur coordonnant [RelC] relateur non-coordonnant [RelNC]
The class hierarchy of
SRCMF categories
syntactic entities
Universität Stuttgart ⓒ Achim Stein 2013
SRCMF grammar: heads and "functional categories"
9
(TUT, Bosco 2004)
‣
Turin University Treebank
‣
Functional categories = heads
‣
e.g. prepositional phrase:
in > quei > giorni
(
in > these > days)
‣
SRCMF
‣
Lexical categories = heads
‣
e.g. prepositional phrase
mer > outre
(sea > over)
preposition
noun
noun
preposition
verb
conjunction
Universität Stuttgart ⓒ Achim Stein, Institut für Linguistik/Romanistik
‣
A duplicate is a double reference to a node (not two forms).
‣
Duplicates allow for the assignment of a second relation to the node.
‣
Duplicates are used in relative clauses and contracted forms.
Special Session on Romance Parsed Corpora
43rd Linguistic Symposium on Romance Languages
SRCMF grammar: duplicates
‣
Examples:
‣
In (1) the relative pronoun
qui
is a non-coordinating relator (RelNC).
Its duplicate is a subject (SjPer).
‣
In (2) the contracted form
nes
(=
ne
+
les)
is a negation (Ng).
Its duplicate is an object (Obj).
(1) Souffrance si est semblable a esmeraude
qui
toz jorz est vert.
Sufferance such is like an emerald
which
all day is green.
(2) sovent dit / Qu'or veut morir s'il
nes
ocit.
often says / that now wants die if he
not+them
kills
Universität Stuttgart ⓒ Achim Stein, Institut für Linguistik/Romanistik
Universität Stuttgart ⓒ Achim Stein (Institut für Linguistik/Romanistik)
Special Session on Romance Parsed Corpora
43rd Linguistic Symposium on Romance Languages
12
Universität Stuttgart ⓒ Achim Stein, Institut für Linguistik/Romanistik
Special Session on Romance Parsed Corpora
43rd Linguistic Symposium on Romance Languages
Universität Stuttgart 15 ⓒ Achim Stein 2013
Universität Stuttgart ⓒ Achim Stein 2013
Special Session on Romance Parsed Corpora
43rd Linguistic Symposium on Romance Languages
16
‣
Parser: mate tools (Bohnet 2010; Björkelund, Bohnet et al. 2010)
‣
training on three different texts (90% of 6508 sentences)
‣
evaluation on 10% (650 sentences)
‣
Encouraging results, considering that
‣
the SRCMF grammar designed is motivated linguistically
‣
no compromise was made to facilitate automatic parsing
SRCMF: a first parsing experiment
Difficulties to guess the right
label: the price for a very
explicit annotation model?
Main error:
Cmpl-Circ
Too few exact matches:
a small number of
Universität Stuttgart 17 ⓒ Achim Stein 2013
Universität Stuttgart ⓒ Achim Stein 2013
Special Session on Romance Parsed Corpora
43rd Linguistic Symposium on Romance Languages
Results
18
‣
See the SRCMF homepage:
http://srcmf.org
‣
Publication is on-going:
‣
15 Old French texts, > 250.000 words, > 23.000 sentences.
‣
online access (via TXM web, ENS Lyon)
‣
download formats for local queries
‣
documentation
‣
Re-usable tools
‣
Notabene annotation tool
‣
http://sourceforge.net/projects/notabene/
Universität Stuttgart ⓒ Achim Stein 2013
•
Bechhofer, Sean; van Harmelen, Frank; Hendler, Jim; Horrocks, Ian; McGuinness, Deborah L.; F.,Patel-Schneider Peter; Andrea Stein, Lynn (2004): OWL Web Ontology Language Reference. W3C Recommendation 10 February 2004.
•
Björkelund, Anders; Bohnet, Bernd; Hafdell, Love; Nugues, Pierre (2010): A high-performance syntactic and semantic dependency parser. Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, Stroudsburg, PA, USA: Association for Computational Linguistics, 33--36.•
Bohnet, Bernd (2010): Top Accuracy and Fast Dependency Parsing is not a Contradiction. Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, China: Coling 2010Organizing Committee, 89--97.
•
Bosco, Cristina (2004): A Grammatical Relation System for Treebank Annotation. : PhD Thesis, Università degli Studi di Turino.•
Guillot, Céline; Marchello-Nizia, Christiane; Lavrentiev, Alexeij (2007): La Base de Français Médiéval (BFM) : états et perspectives. – Kunstmann, Pierre; Stein, Achim (ed.): Le Nouveau Corpus d'Amsterdam. Actes de l'atelier de Lauterbad, 23-26 février 2006, Stuttgart: Steiner.•
Martineau, France (2009): Le corpus MCVF. Modéliser le changement: les voies du français. Ottawa: Université d'Ottawa.•
Mazziotta, Nicolas (2010): Logiciel NotaBene pour l'annotation linguistique. Annotations et conceptualisations multiples. Recherches qualitatives. Hors-série 'Les actes', 9, 83-94.•
Achim Stein et al. (2006): Nouveau Corpus d'Amsterdam. Corpus informatique de textes littéraires d'ancien français (ca 1150-1350), établi par Anthonij Dees (Amsterdam 1987), remanié par Achim Stein, PierreKunstmann et Martin-D. Gleßgen. Stuttgart: Institut für Linguistik/Romanistik.
•
Stein, Achim; Prévost, Sophie (2013): Syntactic annotation of medieval texts: the Syntactic Reference Corpus of Medieval French (SRCMF). – Bennett, Paul; Durrell, Martin; Scheible, Silke; Whitt, Richard (ed.): NewMethods in Historical Corpus Linguistics, Tübingen: Narr.