Annotation
in Language Documentation
Univ. Hamburg Workshop Annotation SEBASTIAN DRUDE 2015-10-29Topics
1. Language Documentation
2. Data and Annotation (theory)
3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF)
Topics
1. Language Documentation
2. Data and Annotation (theory)
3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF)
1) Language Documentation
• New subfield of linguistics (Himmelmann
1998): “documentary linguistics”, with language documentations as results
• Triggered by language endangerment,
enabled by technical / digital revolution
• Different from Language Description:
In addition to the “Boas’ian triad”
(grammar, dictionary, text collection): corpora of annotated multimedia-data
1) Language Documentation
• A modern Langague Documentation (LD)
cosists of a corpus of primary data (audio & vídeo recordings) of utterances and texts
of a broad spectrum of genres and domains
• Annotation accompanies the utterances
• A LD is digital and sustainable (metadata,
open standards, archiving, maintenance)
Topics
1. Language Documentation
2. Data and Annotation (theory)
3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF)
2) Data and Annotation
Videorecording Audiorecording Annotation PrimArY datA SeCOndArY datATranscription: Orthographical / Phonolog. ...
Word-by-word / Idiomatic Translation ...
(Morpheme-Glosses... )
(linguistic / ethnograph. comment... )
. . .
Metadata
(describe the event and the respective
Data)
2) Data and Annotation
DataData is always data FOR something, or at least
OF something – usually it is a systematic
representation of physical states and events (‘facts’ used FOR a scientific argument)
In LD, primary data is a direct rendering or result
of communicative (speech) events, for
instance a written text or, in particular, an
2) Data and Annotation
AnnotationAnnotation of data is a symbolic representation
of properties of the state/event represented in the data
In LD, the most common and basic types of
(primary) annotation are a transcription and
a translation of the expressions represented in primary data (e.g., a/v recording)
2) Data and Annotation
REALITY (Communicative Events) Primary Data
Direct Measurement / Rendering / Result of
Annotation = Secondary Data
Represents symbolically properties of… …represented in
2) Data and Annotation
Global vs. unit-oriented AnnotationGlobal or holistic annotation represents
properties of the event as a whole and is in LD part of the metadata
Unit-oriented annotation refers to specific parts
of the data, in particular, utterances of
individual sentences or words or sounds etc.
2) Data and Annotation
Secondary and derived dataIf unit-oriented annotation is directly based on primary data (such as a written text or a audio
or video recording), then it is secondary data
Annotation commenting on previous annotation is tertiary data, and so forth recursively
In sum, all unit-o. annotation is derived data
2) Data and Annotation
Time-aligned annotationAnnotation of a media file is time-aligned
anotation if each piece of annotation is
explicitly associated with the corresponding
chunk (time-span, segment) of the media file
Time-linking is the activity and result of
specifying the time-alignment of each
annotation associated with a certain chunk in the media file
2) Data and Annotation
This is usually done by using the time marks
Time marks: the start/end times of each chunk
Segmenting (of a media file): identification
of relevant chunks and their time marks
Work-flow: segmenting – adding annotation Older unit-oriented annotation can later be
time-aligned, but this is very labour-intensive (but now see web-maus from CLARIN/BAS)
Topics
1. Language Documentation
2. Data and Annotation (theory)
3. Types and interdependencies of Annotations
4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF)
3) Types and Interdependencies
of Annotations
Linguistic types of annotations
Annotations differ according to the types
of properties of the speech event that are
represented in the annotation
Annotations can be phonetic, phonological,
morphological, syntactic, semantic, pragmatic,
(possibly others), and on each level they can focus on the units, or on structures of units,
Coverage of annotation
Basic annotation: only transcriptions, translations
and optionally notes, on a sentence / clause / intonational unit level
Basic glossing: additionally information on
individual morphs: a gloss (indication of meaning
or function) and perhaps a part-of-speech tag
Advanced glossing: one or several of additional
levels, from phonetic to pragmatic (for instance,
a prosodic transcription, or annotation of the
syntactic structure, of grammatical relations, etc.)
3) Types and Interdependencies
of Annotations
3) Types and Interdependencies
of Annotations
Most often used format in lang. description:
• Interlinear Morpheme Translations / Glossing
(standard “glossing”)
• C. Lehmann: Interlinear Morphemic Glossing.
In Morphology (2004, first version 1982)
• Leipzig Glossing Rules = Linguistics @
MPI-EVA (B. Comrie, M. Haspelmath) & Linguistics @ Univ. Leipzig (B. Bickel), 2008
3) Types and Interdependencies
of Annotations
Example:
time-o ne veni-a-t
fear-1.SG NEG.VOL come-SBJV.PRES-3.SG
3) Types and Interdependencies
of Annotations
Problems:
• Theory-specific (and-arrangement, not
item-and-process nor word-and-paradigm)
• Mixes morphology and syntax
• Problems with synthetic word forms
timeo: 1P, SG, IND, ACT, PRES – where PRES? (Ø)
• Analytical word forms (esp. discontinuous) • What do the labels designate? Meaning?
3) Types and Interdependencies
of Annotations
Hans-Heinrich Lieb & Sebastian Drude
Advanced Glossing:
A Language Documentation Format
(DOBES Working Paper, November 2000)
http://dobes.mpi.nl/documents/Advanced-Glossing1.pdf
Advanced Glossing (AG):
Advanced Glossing (AG):
Glossing table a c e l l a l i n e is a list . . . . . . . . a h o l i s t i c l i n e a h o l i s t i c l i n e
AG: A Glossing
Glossing
Glossing table Comment
General comment
. . . .
AG: Syntactic and Morphological
Glossings of a sentence
Morphological glossing of a sentence
. . . .
Syntactic glossing of a sentence
M. glossing of word 1 M. glossing of word 2 M. glossing of word 3
AG: Glossing of a Text
a c e l l
a l i n e
is a list
. . . . . . . .
Glossing table Comment
General comment
Morphological glossing of a sentence
. . . .
Syntactic glossing of a sentence
M. glossing of word 1 M. glossing of word 2 M. glossing of word 3
is a glossing of
. . . . Glossings of the sentences
. . . .
Syntactic and morphological glossings of sentence 1
. . . .
Syntactic and morphological glossings of sentence 2
. . . .
Syntactic and morphological glossings of sentence 3
General comment on the text Raw data . . . . (Other components)
3) Types and Interdependencies
of Annotations
• Time-linked annot. for sentence-utterances
• Other dependent sentence-annotations
• Subdivision into annotat. for syntactic units
(can be internally time-aligned or not)
• Dependent syntactic-unit-annotations
• Further subdivision into annotations f. morphs
(hardly possible to time-align internally)
Topics
1. Language Documentation
2. Data and Annotation (theory)
3. Types and interdependencies of Annotations
4. Annotation Tools (overview and comparison)
5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges
4) Annotation Tools
TranscriberTool for the segmentation and transcription of audio files
Pros: Compatible with MAC, Windows & Linux;
very easy to use; produces simple XML-files
Cons: No Unicode input possible; only one line
of annotation; no video; no lexicon, outdated (new version not tested)
4) Annotation Tools
ELANTool for the complex annotation of audio and video files
Pros: Compatible with MAC, Windows & Linux;
audio and multiple video files; unlimited tiers for different speakers; state-of-the-art; wide user community; XML output (but complex)
Cons: Complex tool for beginners (but now:
4) Annotation Tools
Toolboox
Text-oriented general database tool for linguistic fieldwork with lexicon and texts
Pros: Flexible and powerful; Export to different
formats (incl. XML); therefore easy to integrate with other tools; many users
Cons: Too flexible; poor data format “Standard
Format”; complex to set up; tricky on MAC/Linux; no video and no time-aligning; at end of
4) Annotation Tools
FLEX
Extensive linguistic database tool for linguistic fieldwork with lexicon and texts
Pros: Powerful and well-designed; inbuilt ontology
and analysis tools; growing user community
Cons: Not flexible (8 tiers); one huge XML database
with no good import or export function for texts; Windows only; difficult to configure; no audio, no video, no time-alignment; produced by SIL
4) Annotation Tools
Other tools
Praat for segmenting, best for phonetic annotation.
CLAN does audio and video annotation, in the CHAT or CA (Conversation Analysis) formats, for child language data (CHILDES project).
ANVIL seems to be similar to ELAN (not tested).
The EXMARaLDA Partitur-Editor (U. Hamburg)
is widely used for discourse analysis.
Audiamus and Eopas (N. Thieberger) organize
(not create) annotation.
Poio (developed in the context of CLARIN, API) There are several others.
4) Annotation Tools
Transcriber ELAN Toolbox FLEX Complexity Easy Complex, w.
easier modes Complex to configure Complex
Audio Yes Yes No (can play) No
Video No Yes No No
Tiers 1 per speaker Unlimited Unlimited Fixed: 8
Lexicon interop.,
automatic glossing No No (is planned) Yes Yes
Unicode No input Yes Yes Yes
Data format Simple XML Compl. XML Faulty TXT XML database
Interoperability Good Fair Good Bad
User community /
support Small?, no support? Large, good support Large, fair support Small, good support
Life cycle Old (but new
4) Annotation Tools
Transcriber ELAN Toolbox FLEX Complexity Easy Complex, w.
easier modes Complex to configure Complex
Audio Yes Yes No (can play) No
Video No Yes No No
Tiers 1 per speaker Unlimited Unlimited Fixed: 8
Lexicon interop.,
automatic glossing No No (is planned) Yes Yes
Unicode No input Yes Yes Yes
Data format Simple XML Compl. XML Faulty TXT XML database
Interoperability Good Fair Good Bad
User community /
support Small?, no support? Large, good support Large, fair support Small, good support
Life cycle Old (but new
4) Annotation Tools
Transcriber ELAN Toolbox FLEX Complexity Easy Complex with
easier modes Complex to configure Complex
Audio Yes Yes No (can play) No
Video No Yes No No
Tiers 1 per speaker Unlimited Unlimited Fixed: 8
Lexicon interop.,
automatic glossing No No (is planned) Yes Yes
Unicode No input Yes Yes Yes
Data format Simple XML Compl. XML Faulty TXT XML database
Interoperability Good Fair Good Bad
User community /
support Small?, no support? Large, good support Large, fair support Small, good support
Life cycle Old (but new
Topics
1. Language Documentation
2. Data and Annotation (theory)
3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison)
5. Annotation data formats (TRS, EAF, SF)
5) Annotation data formats
5) Annotation data formats
5) Annotation data formats
Topics
1. Language Documentation
2. Data and Annotation (theory)
3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF)
6) Standing challenges
• No standardized conventions for layers
of linguistic annotation
• Problems with interlinear morpheme glosses
• Unclear status / interpretation of labels
• Different labels for ‘same’ categories
• Different definitions for ‘same’ categories
based on different theories
Annotation
in Language Documentation
Univ. Hamburg Workshop Annotation SEBASTIAN DRUDE 2015-10-296) Standing challenges
EUROTYP:
ca. 550 „abbreviations of terms”:
morphological categories 246 lexical word classes 114
Syntactic relations 56
Syntactic constituent categories 27
Semantic roles 16
Word order 16
Sentence types 2
Varieties and other 6+2
Inflection: analytical word forms
Where is PLUSQUAMPERFEKT to be annotated?
moni -t -us
ask -PART.PF.PASS -NOM.SG.M
er -a -m
PASS -IND.PAST -1.SG.ACT
monitus eram (analytical form):
1P, Sg, Ind, Pass, Plpf, NomV, MascV