• No results found

FoLiA: Format for Linguistic Annotation

N/A
N/A
Protected

Academic year: 2021

Share "FoLiA: Format for Linguistic Annotation"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)

Introduction Features Paradigm Format Tools Conclusion The End Extra

FoLiA: Format for Linguistic Annotation

Maarten van Gompel Radboud University Nijmegen

(2)

Introduction Features Paradigm Format Tools Conclusion The End Extra

Introduction

Introduction

What is FoLiA?

Generalised XML-based format for a wide variety of linguistic annotation

Characteristics

Generalisedparadigm – Single universal paradigm applicable to all

kinds of annotations; as few ad-hoc provisions as possible. Not commited to any label set.

Extensible– Unsupported annotation types can be added fairly

easily.

Expressive– Verbose expression of annotations, their annotators,

timestamps, etc... Moreover, support foralternativeannotations.

Formalised– Validation on two levels: shallow and deep. The latter

validates the used label set and allows for links with for instance ISOcat.

(3)

Introduction Features Paradigm Format Tools Conclusion The End Extra

Features

Intended Applications

as a corpus storage format

as a language resource exchange format Properties

One document, one text, one XML file – containing all annotations. Annotation types and label sets must be declared in the document header

Document metadata can be either included in the file (limited), or by reference to external CMDI or IMDI (preferred)

(4)

Introduction Features Paradigm Format Tools Conclusion The End Extra

Motivation & Dissemination

Why (yet) another format?

Many ad-hoc and legacy annotation formats (CGN, Tadpole column format)

Many theoretic and specialised annotation formats with limited scope (LAF, SynAF, MAF, TEI)

Bottom-up rather than top-down development: FoLiA arose from practical need, immediately developed alongside practical

programming libraries and applications. De-facto-standard: D-COI XML Dissemination SoNaR TTNWW DutchSemCor Valkuil.net Frog & Ucto

(5)

Introduction Features Paradigm Format Tools Conclusion The End Extra

Annotation Categories

Paradigm

Paradigm: Annotation Categories Four categories of annotation:

Structure Annotation- Elements denoting document structure

E.g: Divisions, Header, Paragraphs, Sentences, Lists, Figures, Gaps, Quote

Token Annotation- Linguistic Annotations pertaining to a single

token (inline annotation)

E.g: Part of Speech Annotation, Lemma Annotation, Lexical Semantic Sense Annotation

Span Annotation- Linguistic Annotations spanning over multiple

tokens (standoff annotation)

E.g: Syntactic Parses, Dependency Relations, Entities/Multi-word Units

Subtoken Annotation - Linguistic Annotations pertaining to a

subpart of a token (standoff annotation)

(6)

Introduction Features Paradigm Format Tools Conclusion The End Extra

(7)

Introduction Features Paradigm Format Tools Conclusion The End Extra

Document skeleton

Format

<?xml v e r s i o n=" 1.0 " e n c o d i n g=" utf -8 "?> <FoLiA x m l n s=" h t t p : // ilk . uvt . nl / F o L i A "

x m l n s : x s i=" h t t p : // www . w3 . org / 2 0 0 1 / X M L S c h e m a - i n s t a n c e " x m l : i d=" e x a m p l e "> <m e t a d a t a t y p e=" c m d i " s r c=" e x a m p l e . c m d i "> <a n n o t a t i o n s> . . . </ a n n o t a t i o n s> </ m e t a d a t a> <t e x t x m l : i d=" e x a m p l e . t e x t "> . . . </ t e x t> </ FoLiA>

(8)

Introduction Features Paradigm Format Tools Conclusion The End Extra Structure Annotation Example <p x m l : i d=" T E S T . p .1 "> <t>T h i s i s a t e s t. I t h a s two s e n t e n c e s.</ t> <s x m l : i d=" T E S T . p .1. s .1 "> <t>T h i s i s a t e s t.</ t> <w x m l : i d=" T E S T . p .1. s .1. w .1 "><t>T h i s</ t></w> <w x m l : i d=" T E S T . p .1. s .1. w .2 "><t>i s</ t></w> . . </ s><s x m l : i d=" T E S T . p .1. s .2 ">. . .</ s> </ p>

Characteristics of basic structure

1 Structure Elements: Paragraphs, Sentences, Words/Tokens

2 More: Division, Head, List, ListItem, Figure, Gap...

3 Unique identifiers

(9)

Introduction Features Paradigm Format Tools Conclusion The End Extra

Structure Annotation

Token Annotation

Token annotation occurs within the scope of a word/token (w) element.

Example

PoS and Lemma Annotation:

<w x m l : i d=" e x a m p l e . p .1. s .1. w .2 "> <t>b o o t</ t> <p o s s e t=" cgn " c l a s s =" n " a n n o t a t o r=" M a a r t e n van G o m p e l " a n n o t a t o r t y p e=" m a n u a l " /> <lemma s e t=" english - l e m m a s " c l a s s =" b o o t " /> <s e n s e s e t=" c o r n e t t o " c l a s s =" D _ N 1 2 3 4 5 " a n n o t a t o r=" s u p w s d 1 " a n n o t a t o r t y p e=" a u t o " c o n f i d e n c e=" 0 . 6 5 " /> </w>

(10)

Introduction Features Paradigm Format Tools Conclusion The End Extra

Structure Annotation

Token Annotations with subsets

For all annotation types;subsets can be used for more refined

annotations. Example

<w x m l : i d=" e x a m p l e . p .1. s .1. w .2 "> <t>b o o t</ t>

<p o s s e t=" cgn " c l a s s =" N ( soort , ev , basis , zijn , s t a n ) "> <f e a t s u b s e t=" h e a d " c l a s s =" N " /> <f e a t s u b s e t=" n t y p e " c l a s s =" s o o r t " /> <f e a t s u b s e t=" n u m b e r " c l a s s =" ev " /> <f e a t s u b s e t=" d e g r e e " c l a s s =" b a s i s " /> <f e a t s u b s e t=" g e n d e r " c l a s s =" z i j d " /> <f e a t s u b s e t=" c a s e " c l a s s =" s t a n " /> </ p o s> </w>

(11)

Introduction Features Paradigm Format Tools Conclusion The End Extra

Span Annotation

Span Annotation

Token Annotation is not sufficient, some annotations span over multiple tokens (not necessarily consecutive)

Spanning multiple tokens can produce nesting problems (e.g

A(BC)D andAB(CD))

Solution: Span Annotation using standoff notation

Properties

Applications: Syntactic Parses, Chunking, Dependency Relations,

Entities/Multi-Word Units

Layers: Each type of span annotation is placed within anannotation

layer, annotation layers are usually embedded withinsentences(s))

(12)

Introduction Features Paradigm Format Tools Conclusion The End Extra

Span Annotation

<s x m l : i d=" e x a m p l e . p .1. s .1 ">

<t>The D a l a i Lama g r e e t e d him.</ t>

<w x m l : i d=" e x a m p l e . p .1. s .1. w .1 "><t>The</ t></w> <w x m l : i d=" e x a m p l e . p .1. s .1. w .2 "><t>D a l a i</ t></w> <w x m l : i d=" e x a m p l e . p .1. s .1. w .3 "><t>Lama</ t></w> <w x m l : i d=" e x a m p l e . p .1. s .1. w .4 "><t>g r e e t e d</ t></w> <w x m l : i d=" e x a m p l e . p .1. s .1. w .5 "><t>him</ t></w> <w x m l : i d=" e x a m p l e . p .1. s .1. w .6 "><t>.</ t></w> <e n t i t i e s> <e n t i t y x m l : i d=" e x a m p l e . p .1. s .1. e n t i t y .1 " c l a s s =" p e r s o n "> <w r e f x m l : i d=" e x a m p l e . p .1. s .1. w .2 " /> <w r e f x m l : i d=" e x a m p l e . p .1. s .1. w .3 " /> </ e n t i t y> </ e n t i t i e s> </ s>

(13)

Introduction Features Paradigm Format Tools Conclusion The End Extra Span Annotation <s y n t a x> <s u x m l : i d=" e x a m p l e . p .1. s .1. su .1 " c l a s s =" s "> <s u x m l : i d=" e x a m p l e . p .1. s .1. su .1 _1 " c l a s s =" np "> <s u x m l : i d=" e x a m p l e . p .1. s .1. su .1 _ 1 _ 1 " c l a s s =" det "> <w r e f x m l : i d=" e x a m p l e . p .1. s .1. w .1 " /> </ s u> <s u x m l : i d=" e x a m p l e . p .1. s .1. su .1 _ 1 _ 2 " c l a s s =" pn "> <w r e f x m l : i d=" e x a m p l e . p .1. s .1. w .2 " /> <w r e f x m l : i d=" e x a m p l e . p .1. s .1. w .3 " /> </ s u> </ s u> </ s u> <s u x m l : i d=" e x a m p l e . p .1. s .1. su .1 _2 " c l a s s =" vp "> <s u x m l : i d=" e x a m p l e . p .1. s .1. su .1 _ 1 _ 1 " c l a s s =" v "> <w r e f x m l : i d=" e x a m p l e . p .1. s .1. w .4 " /> </ s u> <s u x m l : i d=" e x a m p l e . p .1. s .1. su .1 _ 1 _ 2 " c l a s s =" p r o n "> <w r e f x m l : i d=" e x a m p l e . p .1. s .1. w .5 " /> </ s u> </ s u> </ s u> </ s y n t a x>

(14)

Introduction Features Paradigm Format Tools Conclusion The End Extra

Tools for working with FoLiA

StandardXML facilities: XSLT, XPath

Pythonlibrary: pynlpl.formats.folia

C++library: libfolia (Ko van der Sloot)

Applications

Frog– tagger/lemmatisaion/parser suite: FoLiA output (input in

later stage).

ucto– tokeniser: FoLiA input and output.

Converters

DCOI ←→FoLiA

(15)

Introduction Features Paradigm Format Tools Conclusion The End Extra

Conclusion

Uniformity: generic framework with simple paradigm, XML based

Expressiveness: Ability to encode many kinds of linguistic

annotation, including structural annotation, alternatives, and corrections

Extensibility: easy to add new annotations with the same paradigm

A variety of tools and converters already available! URLs

http://ilk.uvt.nl/folia

(16)

Introduction Features Paradigm Format Tools Conclusion The End Extra

(17)

Introduction Features Paradigm Format Tools Conclusion The End Extra Trade-off

Trade-off: Expressivity versus Computing Efficiency

FoLiA aims at expressivity rather than computing efficiency.

XML and FoLiA overhead: Not ideal for real-time or resource-constrained applications

(18)

Introduction Features Paradigm Format Tools Conclusion The End Extra Annotations

Supported Annotations (1/2)

FoLiA supports the following linguistic annotations: Part-of-Speech tags (with features)

Lemmatisation Domain tagging

Lexical semantic sense annotation (used in DutchSemCor) Named Entities / Multi-word units (used in SoNaR) Syntactic Parses

(19)

Introduction Features Paradigm Format Tools Conclusion The End Extra Annotations

Supported Annotations (2/2)

FoLiA supports the following linguistic annotations: Chunking

Corrections (used in valkuil.net) Morphology

Event/Time annotation Phonetic annotation

(20)

Introduction Features Paradigm Format Tools Conclusion The End Extra Declarations

Token Annotation

All annotations need to be declared in the metadata:

Default sets and annotator maybe predefined at this level

Example <m e t a d a t a> <a n n o t a t i o n s> <t o k e n−a n n o t a t i o n /> <pos−a n n o t a t i o n s e t=" b r o w n " a n n o t a t o r=" M a a r t e n van G o m p e l " a n n o t a t o r t y p e=" m a n u a l "/> <lemma−a n n o t a t i o n /> </ a n n o t a t i o n s> </ m e t a d a t a>

(21)

Introduction Features Paradigm Format Tools Conclusion The End Extra Alternatives

Alternative Token Annotations

Annotations of the same type, but different sets neednotbe alternatives.

<w x m l : i d=" e x a m p l e . p .1. s .1. w .2 "> <t>l u i d</ t>

<p o s s e t=" b r o w n " c l a s s =" jj " /> <p o s s e t=" cgn " c l a s s =" adj " /> </w>

There can be only one of the same set though, this is illegal and requires usage of alternatives instead:

<w x m l : i d=" e x a m p l e . p .1. s .1. w .2 "> <t>l u i d</ t>

<p o s s e t=" cgn " c l a s s =" adj " /> <p o s s e t=" cgn " c l a s s =" adv " /> </w>

(22)

Introduction Features Paradigm Format Tools Conclusion The End Extra Alternatives

Alternative Token Annotations

Encodes mutually exclusive alternative annotations. Any annotations that are not alternatives are considered “selected”.

<w x m l : i d=" e x a m p l e . p .1. s .1. w .2 "> <t>bank</ t> <s e n s e s e t=" w o r d n e t 3 .0 " c l a s s =" b a n k %1 : 1 7 : 0 1 : " a n n o t a t o r=" M a a r t e n van G o m p e l " a n n o t a t o r t y p e=" m a n u a l " c o n f i d e n c e=" 0.8 "> s l o p i n g g r o u n d n e a r w a t e r</ s e n s e> <a l t x m l : i d=" e x a m p l e . p .1. s .1. w .2. alt .1 "> <s e n s e s e t=" w o r d n e t 3 .0 " c l a s s =" b a n k %1 : 1 4 : 0 1 : " a n n o t a t o r=" W S D s y s t e m " a n n o t a t o r t y p e=" a u t o " c o n f i d e n c e=" 0.6 "> f i n a n c i a l i n s t i t u t i o n</ s e n s e> </ a l t> </w>

(23)

Introduction Features Paradigm Format Tools Conclusion The End Extra Alternatives

Alternative Token Annotations

All token annotations grouped as one alternative are considered dependent. Multiple alternatives are always independent:

<w x m l : i d=" e x a m p l e . p .1. s .1. w .2 "> <t>v l i e g</ t> <p o s c l a s s =" N " /> <lemma c l a s s =" v l i e g " /> <a l t x m l : i d=" e x a m p l e . p .1. s .1. w .2. alt .1 "> <p o s c l a s s =" V " /> <lemma c l a s s =" v l i e g e n " /> </ a l t> </w>

References

Related documents

zostericola were sampled during both shallow (20cm) and deep (1 m) water depths, in both the intertidal and subtidal zones of a single large Zostera capricorni seagrass bed

26 Austria www.share-project.org /at Belgium www.share-project.org /be Czech Republic www.share-project.org /cz Denmark www.share-project.org /dk Estonia www.share-project.org

The volume of filtrate produced by the glomeruli is huge; a result of the large surface area, a highly permeable filter, forces favouring filtration and the high cortical blood

Rivara and the Thompsons recommend readers to consult James Hedlund’s article in Injury Prevention entitled “Risky busi- ness: safety regulations, risk compensation, and

60 ( a ) Department of Modern Physics and State Key Laboratory of Particle Detection and Electronics, University of Science and Technology of China, Hefei; ( b ) Institute of

The Community Pediatrics Training Initiative is de- veloping models of community pediatric education, and additional evaluation may show if residents’ perceptions change

Interestingly, because of the key role the dollar exchange rate plays in the valuation and trade channels of adjustment, it is possible to use measures of external imbalances of