• No results found

Annotation in Language Documentation

N/A
N/A
Protected

Academic year: 2021

Share "Annotation in Language Documentation"

Copied!
63
0
0

Loading.... (view fulltext now)

Full text

(1)

Annotation

in Language Documentation

Univ. Hamburg Workshop Annotation SEBASTIAN DRUDE 2015-10-29
(2)

Topics

1. Language Documentation

2. Data and Annotation (theory)

3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF)

(3)

Topics

1. Language Documentation

2. Data and Annotation (theory)

3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF)

(4)
(5)
(6)

1) Language Documentation

• New subfield of linguistics (Himmelmann

1998): “documentary linguistics”, with language documentations as results

• Triggered by language endangerment,

enabled by technical / digital revolution

• Different from Language Description:

In addition to the “Boas’ian triad”

(grammar, dictionary, text collection): corpora of annotated multimedia-data

(7)

1) Language Documentation

• A modern Langague Documentation (LD)

cosists of a corpus of primary data (audio & vídeo recordings) of utterances and texts

of a broad spectrum of genres and domains

Annotation accompanies the utterances

• A LD is digital and sustainable (metadata,

open standards, archiving, maintenance)

(8)

Topics

1. Language Documentation

2. Data and Annotation (theory)

3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF)

(9)

2) Data and Annotation

Videorecording Audiorecording Annotation PrimArY datA SeCOndArY datA

Transcription: Orthographical / Phonolog. ...

Word-by-word / Idiomatic Translation ...

(Morpheme-Glosses... )

(linguistic / ethnograph. comment... )

. . .

Metadata

(describe the event and the respective

Data)

(10)

2) Data and Annotation

Data

Data is always data FOR something, or at least

OF something – usually it is a systematic

representation of physical states and events (‘facts’ used FOR a scientific argument)

In LD, primary data is a direct rendering or result

of communicative (speech) events, for

instance a written text or, in particular, an

(11)

2) Data and Annotation

Annotation

Annotation of data is a symbolic representation

of properties of the state/event represented in the data

In LD, the most common and basic types of

(primary) annotation are a transcription and

a translation of the expressions represented in primary data (e.g., a/v recording)

(12)

2) Data and Annotation

REALITY (Communicative Events) Primary Data

Direct Measurement / Rendering / Result of

Annotation = Secondary Data

Represents symbolically properties of… …represented in

(13)

2) Data and Annotation

Global vs. unit-oriented Annotation

Global or holistic annotation represents

properties of the event as a whole and is in LD part of the metadata

Unit-oriented annotation refers to specific parts

of the data, in particular, utterances of

individual sentences or words or sounds etc.

(14)

2) Data and Annotation

Secondary and derived data

If unit-oriented annotation is directly based on primary data (such as a written text or a audio

or video recording), then it is secondary data

Annotation commenting on previous annotation is tertiary data, and so forth recursively

In sum, all unit-o. annotation is derived data

(15)

2) Data and Annotation

Time-aligned annotation

Annotation of a media file is time-aligned

anotation if each piece of annotation is

explicitly associated with the corresponding

chunk (time-span, segment) of the media file

Time-linking is the activity and result of

specifying the time-alignment of each

annotation associated with a certain chunk in the media file

(16)

2) Data and Annotation

This is usually done by using the time marks

Time marks: the start/end times of each chunk

Segmenting (of a media file): identification

of relevant chunks and their time marks

Work-flow: segmenting – adding annotation Older unit-oriented annotation can later be

time-aligned, but this is very labour-intensive (but now see web-maus from CLARIN/BAS)

(17)

Topics

1. Language Documentation

2. Data and Annotation (theory)

3. Types and interdependencies of Annotations

4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF)

(18)

3) Types and Interdependencies

of Annotations

Linguistic types of annotations

Annotations differ according to the types

of properties of the speech event that are

represented in the annotation

Annotations can be phonetic, phonological,

morphological, syntactic, semantic, pragmatic,

(possibly others), and on each level they can focus on the units, or on structures of units,

(19)

Coverage of annotation

Basic annotation: only transcriptions, translations

and optionally notes, on a sentence / clause / intonational unit level

Basic glossing: additionally information on

individual morphs: a gloss (indication of meaning

or function) and perhaps a part-of-speech tag

Advanced glossing: one or several of additional

levels, from phonetic to pragmatic (for instance,

a prosodic transcription, or annotation of the

syntactic structure, of grammatical relations, etc.)

3) Types and Interdependencies

of Annotations

(20)

3) Types and Interdependencies

of Annotations

Most often used format in lang. description:

Interlinear Morpheme Translations / Glossing

(standard “glossing”)

• C. Lehmann: Interlinear Morphemic Glossing.

In Morphology (2004, first version 1982)

Leipzig Glossing Rules = Linguistics @

MPI-EVA (B. Comrie, M. Haspelmath) & Linguistics @ Univ. Leipzig (B. Bickel), 2008

(21)

3) Types and Interdependencies

of Annotations

Example:

time-o ne veni-a-t

fear-1.SG NEG.VOL come-SBJV.PRES-3.SG

(22)

3) Types and Interdependencies

of Annotations

Problems:

• Theory-specific (and-arrangement, not

item-and-process nor word-and-paradigm)

• Mixes morphology and syntax

• Problems with synthetic word forms

timeo: 1P, SG, IND, ACT, PRES – where PRES? (Ø)

• Analytical word forms (esp. discontinuous) • What do the labels designate? Meaning?

(23)

3) Types and Interdependencies

of Annotations

Hans-Heinrich Lieb & Sebastian Drude

Advanced Glossing:

A Language Documentation Format

(DOBES Working Paper, November 2000)

http://dobes.mpi.nl/documents/Advanced-Glossing1.pdf

(24)

Advanced Glossing (AG):

(25)

Advanced Glossing (AG):

(26)

Glossing table a c e l l a l i n e is a list . . . . . . . . a h o l i s t i c l i n e a h o l i s t i c l i n e

(27)

AG: A Glossing

Glossing

Glossing table Comment

General comment

. . . .

(28)

AG: Syntactic and Morphological

Glossings of a sentence

Morphological glossing of a sentence

. . . .

Syntactic glossing of a sentence

M. glossing of word 1 M. glossing of word 2 M. glossing of word 3

(29)

AG: Glossing of a Text

a c e l l

a l i n e

is a list

. . . . . . . .

Glossing table Comment

General comment

Morphological glossing of a sentence

. . . .

Syntactic glossing of a sentence

M. glossing of word 1 M. glossing of word 2 M. glossing of word 3

is a glossing of

. . . . Glossings of the sentences

. . . .

Syntactic and morphological glossings of sentence 1

. . . .

Syntactic and morphological glossings of sentence 2

. . . .

Syntactic and morphological glossings of sentence 3

General comment on the text Raw data . . . . (Other components)

(30)
(31)
(32)
(33)
(34)
(35)
(36)
(37)
(38)

3) Types and Interdependencies

of Annotations

• Time-linked annot. for sentence-utterances

• Other dependent sentence-annotations

• Subdivision into annotat. for syntactic units

(can be internally time-aligned or not)

• Dependent syntactic-unit-annotations

• Further subdivision into annotations f. morphs

(hardly possible to time-align internally)

(39)

Topics

1. Language Documentation

2. Data and Annotation (theory)

3. Types and interdependencies of Annotations

4. Annotation Tools (overview and comparison)

5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges

(40)

4) Annotation Tools

Transcriber

Tool for the segmentation and transcription of audio files

Pros: Compatible with MAC, Windows & Linux;

very easy to use; produces simple XML-files

Cons: No Unicode input possible; only one line

of annotation; no video; no lexicon, outdated (new version not tested)

(41)
(42)

4) Annotation Tools

ELAN

Tool for the complex annotation of audio and video files

Pros: Compatible with MAC, Windows & Linux;

audio and multiple video files; unlimited tiers for different speakers; state-of-the-art; wide user community; XML output (but complex)

Cons: Complex tool for beginners (but now:

(43)
(44)
(45)

4) Annotation Tools

Toolboox

Text-oriented general database tool for linguistic fieldwork with lexicon and texts

Pros: Flexible and powerful; Export to different

formats (incl. XML); therefore easy to integrate with other tools; many users

Cons: Too flexible; poor data format “Standard

Format”; complex to set up; tricky on MAC/Linux; no video and no time-aligning; at end of

(46)
(47)
(48)

4) Annotation Tools

FLEX

Extensive linguistic database tool for linguistic fieldwork with lexicon and texts

Pros: Powerful and well-designed; inbuilt ontology

and analysis tools; growing user community

Cons: Not flexible (8 tiers); one huge XML database

with no good import or export function for texts; Windows only; difficult to configure; no audio, no video, no time-alignment; produced by SIL

(49)
(50)
(51)

4) Annotation Tools

Other tools

Praat for segmenting, best for phonetic annotation.

CLAN does audio and video annotation, in the CHAT or CA (Conversation Analysis) formats, for child language data (CHILDES project).

ANVIL seems to be similar to ELAN (not tested).

The EXMARaLDA Partitur-Editor (U. Hamburg)

is widely used for discourse analysis.

Audiamus and Eopas (N. Thieberger) organize

(not create) annotation.

Poio (developed in the context of CLARIN, API) There are several others.

(52)

4) Annotation Tools

Transcriber ELAN Toolbox FLEX Complexity Easy Complex, w.

easier modes Complex to configure Complex

Audio Yes Yes No (can play) No

Video No Yes No No

Tiers 1 per speaker Unlimited Unlimited Fixed: 8

Lexicon interop.,

automatic glossing No No (is planned) Yes Yes

Unicode No input Yes Yes Yes

Data format Simple XML Compl. XML Faulty TXT XML database

Interoperability Good Fair Good Bad

User community /

support Small?, no support? Large, good support Large, fair support Small, good support

Life cycle Old (but new

(53)

4) Annotation Tools

Transcriber ELAN Toolbox FLEX Complexity Easy Complex, w.

easier modes Complex to configure Complex

Audio Yes Yes No (can play) No

Video No Yes No No

Tiers 1 per speaker Unlimited Unlimited Fixed: 8

Lexicon interop.,

automatic glossing No No (is planned) Yes Yes

Unicode No input Yes Yes Yes

Data format Simple XML Compl. XML Faulty TXT XML database

Interoperability Good Fair Good Bad

User community /

support Small?, no support? Large, good support Large, fair support Small, good support

Life cycle Old (but new

(54)

4) Annotation Tools

Transcriber ELAN Toolbox FLEX Complexity Easy Complex with

easier modes Complex to configure Complex

Audio Yes Yes No (can play) No

Video No Yes No No

Tiers 1 per speaker Unlimited Unlimited Fixed: 8

Lexicon interop.,

automatic glossing No No (is planned) Yes Yes

Unicode No input Yes Yes Yes

Data format Simple XML Compl. XML Faulty TXT XML database

Interoperability Good Fair Good Bad

User community /

support Small?, no support? Large, good support Large, fair support Small, good support

Life cycle Old (but new

(55)

Topics

1. Language Documentation

2. Data and Annotation (theory)

3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison)

5. Annotation data formats (TRS, EAF, SF)

(56)

5) Annotation data formats

(57)

5) Annotation data formats

(58)

5) Annotation data formats

(59)

Topics

1. Language Documentation

2. Data and Annotation (theory)

3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF)

(60)

6) Standing challenges

• No standardized conventions for layers

of linguistic annotation

• Problems with interlinear morpheme glosses

• Unclear status / interpretation of labels

• Different labels for ‘same’ categories

• Different definitions for ‘same’ categories

based on different theories

(61)

Annotation

in Language Documentation

Univ. Hamburg Workshop Annotation SEBASTIAN DRUDE 2015-10-29
(62)

6) Standing challenges

EUROTYP:

ca. 550 „abbreviations of terms”:

morphological categories 246 lexical word classes 114

Syntactic relations 56

Syntactic constituent categories 27

Semantic roles 16

Word order 16

Sentence types 2

Varieties and other 6+2

(63)

Inflection: analytical word forms

Where is PLUSQUAMPERFEKT to be annotated?

moni -t -us

ask -PART.PF.PASS -NOM.SG.M

er -a -m

PASS -IND.PAST -1.SG.ACT

monitus eram (analytical form):

1P, Sg, Ind, Pass, Plpf, NomV, MascV

http://dobes.mpi.nl/documents/Advanced-Glossing1.pdf

References

Related documents

Therefore, the study aims to specifically identify women’s role and the distinction between their practical, strategic and productive gender needs within rural electrification

In particular, we consider the difficulty of manual video annotation and present a method for automatic annotation of human actions in movies based on script alignment and

He reported a high and positive (0.77 to 0.98) genetic correlations between body weight at different ages from dam and sire plus dam components of variance; and the

Teleport questions with parents is it phrase is important to do they are better understand grammar quiz and relative clause and organize your team has a quizizz.. Nailed it to use

 Does the organization have a contact responsible for privacy and access/amendment to my personal information. What to Look for in Website Privacy

See Hall (2009) and Gasparov (2010) for two quite different possible alternative solutions to the problem of grasping the meaning of novel sentences. 11 While, like most

and globally, remain on the negative side of the digital divide. This age-based digital divide is of concern because the internet enables users to expand their

Hyphae localization in tissue surrounding the wound or inoculation sites indicates that Pch colonizes all cell types, such as vascular tissues, paratracheal parenchyma cells,