Standards for Linguistic Markup - Shallow Processing and XML Markup

4.2 Shallow Processing and XML Markup

4.2.8 Standards for Linguistic Markup

There are a number of standards and proposals for linguistic annotation, some go- ing even back to the time when XML has not yet been invented. We will briefly discuss some of them . Although processing of annotated corpora is not in the focus of our thesis, standards and especially XML plays an important role because NLP may have corpora at both ends – NLP components may both use them as input (e.g. for training statistical models) and produce markup as output, e.g. automati- cally annotated corpora. Thus, corpus annotation frameworks and NLP-generated markup are bound up with each other.

However, because of the limitations of NLP components, the markup of a specific NLP component typically is only a subset of a corpus annotation scheme which often is designed to cover a broader variety of linguistic phenomena. Gen- erally speaking, the same criteria that are crucial for corpus annotation also are important for NLP component output. Ide and Romary (2002) e.g. name the con- sistency of tag set and encoding schema, recoverability of source text, validatabil- ity, processability, extensibility, compactness and readability.

4.2.8.1 Text Encoding Initiative (TEI)

In the early days of linguistic corpus annotation, SGML was proposed and used for corpus markup. Already in 1987, the separation of text and layout for content markup, and the independence of systems, hardware, software-specific data for- mats has been the motivation for the Text Encoding Initiative (TEI), a consortium

4.2. SHALLOW PROCESSING AND XML MARKUP 65 of institutions and projects related to history, literature, linguistics, philology etc, to setup a standard for content-oriented document annotation.

‘The Text Encoding Initiative (TEI) Guidelines are an international and interdisciplinary standard that facilitates libraries, museums, pub- lishers, and individual scholars represent a variety of literary and lin- guistic texts for online research, teaching, and preservation.’

(Sperberg-McQueen and Burnard, 1994)

Text structures (DTDs) have been defined for e.g. prose, verse, drama, speech transcription, dictionary, terminology, but also for linguistic information such as part-of-speech tags or inflection, and even feature structures. The proposed tag sets are very comprehensive, and organized in modular DTDs. The first series of guidelines was published in 1990 as TEI P1. In 1998, TEI has adopted XML as additional markup syntax.

Elements for linguistic markup are e.g.

• <s> for sentence-like division of a text • <cl> for grammatical clause

• <phr> for grammatical phrase

• <w> for grammatical (not necessarily orthographic) words • <m> for grammatical morpheme

• <c> for character

An XML example taken from the TEI P5 Guidelines (Sperberg-McQueen and Burnard, 1994)

<p> <s>

<cl type="finite declarative" function="independent"> <phr type="NP" function="subject">Nineteen fifty-four,

<phr type="VP" function="predicate">was eighteen years old</phr> </cl>

</phr>,

<phr type="NP" function="predicate nom.">a crucial turning point <phr type="PP" function="postmodifier">in

<phr type="NP" function="prep.obj.">the history <phr type="PP" function="postmodifier">

66 CHAPTER 4. SHALLOW PROCESSING AND LINGUISTIC MARKUP

of the Afro-American</phr> </phr>

</phr> -

<phr type="PP" function="appositive postmodifier">for <phr type="NP" function="prep.obj.">the U.S.A.

<phr type="PP" function="postmodifier">as a whole</phr> </phr>

</phr> </phr> -

<phr type="NP" function="subject">segregation</phr> <phr type="VP" function="predicate">

<phr type="V" function="main verb">was outlawed</phr> <phr type="PP" function="postmodifier">

by the U.S. Supreme Court</phr> </phr> </cl> </phr> </cl> </phr> </phr>. </cl> </s> </p>

Although TEI is frequently referenced by other approaches and annotation schemata and is one of the oldest annotation standardization efforts, many corpora are not using the TEI schema, but other, simpler ad-hoc annotation schemata designed for the actual, specific needs. The main reason is that TEI suffers from the SGML sickness that in aiming at describing any phenomenon and foreseeing every case and feature, the schema becomes complex and confusing (cf. Witt 1998).

At the same time, TEI leaves room for more specific extensions (therefore the term ‘guidelines’), and is organized in a modular way. However, there are also aspects that TEI didn’t cover at all, such as semantic annotation, and that are not easy to make conforming to the guidelines. Although it is possible to add extensions to a TEI schema, people often end up in defining their own, TEI-independent schema, taking TEI as a start point. In Chapter 5, we will argue why this does not do much harm from a technical perspective. However, the question remains about the value of a standard that is too general on the one side, and too inflexible on the other side.

4.2.8.2 CES and XCES

CES (Corpus Encoding Standard; Ide 1998) has been developed as an application of TEI (firstly in SGML) and as part of the EAGLES (Expert Advisory Group on Language Engineering Standards) guidelines. As such, CES lays a much stronger focus on (linguistic) corpus annotation than TEI did. CES extends TEI specifi-

4.2. SHALLOW PROCESSING AND XML MARKUP 67 cations and makes them more specific where appropriate, and on the other hand limits the TEI scheme to include only that subset of tags that is relevant for corpus annotation.

Like TEI, CES has migrated to XML under the name XCES (Ide et al., 2000a; Ide and Romary, 2001, 2002). The approach is in a clear way top-down-oriented, and the basic concepts have also influenced the ISO standardization efforts for linguistic annotation we will describe in the next section. XCES defines an abstract

Structural Skeleton for syntactic structures that are common to all (in the corpus

world) possible annotation schemes, and a Data Category Registry that defines gen- eral categories such as phrase types in a hierarchy using RDF (cf. Section 4.2.9).

Both the Structural Skeleton and the Data Categories are instantiated for a specific annotation scheme (called AML, the Annotation Markup Language), where e.g. the noun phrase category is defined to be an attribute value or the name of an element as well as the rest of syntactic structure. This AML corresponds then to and can be written as an XML DTD.

Universal Resources Concrete Resources Structural Skeleton Data Category Data Category Registry Concrete XML Encoding Abstract Markup Language Specification

Figure 4.3: XCES annotation framework (simplified)

What makes XCES interesting for our needs in deep-shallow integration is the abstract top-down view to annotation, the concrete realization, and the clear adoption of XML transformation to realize the top-down approach of XCES in implementations, including concepts such as stand-off annotation and linking of annotation. These points will be discussed later.

4.2.8.3 ISO

A recent development is the ISO standardization of linguistic markup for compu- tational linguistics, computerized lexicography, and language engineering, defin- ing ‘standards by specifying principles and methods for creating, coding, process-

ing and managing language resources, such as written corpora, lexical corpora, speech corpora, dictionary compiling and classification schemes. These standards

68 CHAPTER 4. SHALLOW PROCESSING AND LINGUISTIC MARKUP

will also cover the information produced by natural language processing compo-

nents in these various domains.’6.

This claim makes ISO fit into our goal of using XML7 for NLP component

integration and it turns out that the ISO working group is mainly complementing, and partly overlapping existing TEI approaches. TEI, e.g. is not specific enough on morphology and although feature structure markup is defined by TEI, ISO tries to cover it in more principled and elaborated way (Lee et al., 2004).

Moreover, ISO also aims at putting more focus on multilingual, multimedia and multimodal aspects than TEI did so far. However, the standardization process by the joint ISO/TEI working group (TC 37 SC 4) is still ongoing, and only one (not so near) day could become the ISO DIS 24610 standard. Another focus of ISO will also consist in standardization of non-textual linguistic resources such as lexica which are also less covered by TEI.

In document Integrating deep and shallow natural language processing components : representations and hybrid architectures (Page 64-68)