Analysis of Lexical Structures from Field Linguistics and Language Engineering

(1)

Analysis of Lexical Structures from Field Linguistics and Language Engineering

P. Wittenburg, W. Peters

+

, S. Drude

++

Max-Planck-Institute for Psycholinguistics Wundtlaan 1, 6525 XD Nijmegen, The Netherlands

[email protected]

+

University of Sheffield

++

Free University of Berlin

Abstract

Lexica play an important role in every linguistic discipline. We are confronted with many types of lexica. Depending on the type of lexicon and the language we are currently faced with a large variety of structures from very simple tables to complex graphs, as was indicated by a recent overview of structures found in dictionaries from field linguistics and language engineering. It is important to assess these differences and aim at the integration of lexical resources in order to improve lexicon creation, exchange and reuse. This paper describes the first step towards the integration of existing structures and standards into a flexible abstract model.

1. Introduction

Lexica play an utterly important role in all linguistic sub disciplines ranging from Language Engineering to Field-Linguistics. The former generally deal with the main languages whereas the latter record minority and endangered languages. Lexica form an essential component in describing all relevant information about a language that can be associated with a structural unit of that language, e.g. a word, a morpheme, or even a whole sentence.

Lexica contain a wide range of linguistic information according to their nature and function. They vary from simple lists to complex resources with many types of linguistic information associated with the entries or elements. In general they can be of various types (the following list is not meant to be exhaustive): word list, machine readable dictionary, thesaurus, ontology, glossary, concordance, term bank, phonetic transcriptions, picture set, video shots, sound bits

Lexical resources are widely used for language and knowledge engineering. In both monolingual and multilingual environments, language resources play a crucial role in preparing, processing and managing the information and knowledge needed by computers as well as humans. In field-linguistics they also play a central role since they are focusing on basic linguistic units such as words, affixes and fixed expressions. The variety of lexical requirements in field linguistics is greater, since the language types differ widely.

Language technology components aiming at carrying out automatic parsing involve even more complex resources including dictionaries. In addition, multilingual dictionaries contain translation equivalents and concordances, and ontologies describe semantic relations between important concepts.

2. Formats and Structure Types

This large variety of available information and the linguistic differences between languages are the main reasons that there is a huge amount of different lexical structures and formats. Almost every lexicon comes along with its own specification that is defined by project and

task requirements. The two terms “format” and “structure” cannot always be separated clearly. The term “structure” mostly refers to the internal organization of a document, while the term “format” addresses information which also has to do with the way information is presented to the user or stored by a computer program, which includes questions of data structure

Computer-based lexica come in various formats such as relational database format (which also implies the ER type of structure, see below), plain-text files in some proprietary format such as SHOEBOX1 (which also has a typical structure, see below), MS WORD document formats and many others.

There are various ways in which textual and lexical data can be annotated and structured, depending on theoretical convictions and associated tools. The most widely used standards for the representation of structures are SGML, XML2 and RDF [1]. But especially in field-linguistics we also meet special structure (and format) definitions such as from Shoebox, which basically has a feature-value pairs which can be embedded in tree structures. Since most of these field linguistic lexica are not meant to be processed automatically, but traditionally are meant to be put on paper, many of them are written in text processors such as MS WORD where the researchers are guided from the traditional structure (and format) principles of written lexica.

Data structures can take the form of typed feature structures such as Comlex3 ([2]; see figure 1), relational tables, e.g. Celex4 ([3] see figure 2), flat files (unnormalized relational format) or resource specific formats such as WordNet5 [4] and EuroWordNet6 [5]. The

1

http://www.sil.org

2

For introductions to SGML and XML see

http://msdn.microsoft.com/library/default.asp?url=/library /en-us/xmlsdk30/htm/xmtutxmltutorial.asp ,

http://www.projectcool.com/developer/xmlz/xmldtd/

,

http://www.oasis-open.org/cover/xml.html

3

http://cs.nyu.edu/cs/faculty/grishman/comlex.html

4

see http://www.kun.nl/celex/

5

see http://www.hum.uva.nl/~ewn

6

(2)

(3)

offset-The most complex lexicon is set up by the Aweti team implemented as a complex hierarchy of Shoebox feature value pairs. The lexicon makes at high level a difference between 4 types of entries: entry-type = [stem | idiom | lexical word], entry-type = [auxiliary | inflectional affix], entry-type = [derivational word | derivational affix] or entry-type = [word form | allomorph]. For each type sub-structures exist. In the following example only an extraction of the first type is shown.

3.2. Lexica from Language Engineering

Beyond what was briefly indicated in chapter 2 the structural properties of a few other well-known lexica from language engineering were analyzed.

To be mentioned here is the GENELEX work the title of which claims to be generic. However, it was a concrete proposal for an exhaustive lexicon with definitions of structure and tag-sets. Its SGML structure consists of a huge DTD with specifications of three main layers (morphology, syntax, semantics) and many lexical elements integrated in tree-structures. GENELEX was used as a base line for the definition of the lexica from the PAROLE and SIMPLE10 projects. These were an attempt to encode multilingual lexica in a uniform way with 12 fairly small sized example lexica as a result (see figure 7).

<MuS id="V01015"

%% morphological unit identifier%% gramcat="VERB"

gramsubcat="MAIN"

synulist="verb-cons-001V01015" %%link to the syntactic units describing the syntactic behavior of the entry%%

autonomy="yes" combuf="UF1"> <Gmu naming="destroy"

InP="Vinfl0"> %%inflectional code%%

<spelling>destroy</spelling> </Gmu>

</MuS>

Figure 7: PAROLE morphological entry

MULTILEX11 was another project focusing on the implementation of 15 concrete lexica applying a structure derived from the EAGLES model of morphosyntactic annotation. Its data structure consists of three columns: wordform, lemma and morphosyntactic label. The latter provides a label for a number of classes. An example is:

adversities adversity

Ncnp-where adversities is a plural, neuter, countable noun.

The MILE (Multilingual Computational Lexicon) project recently started within ISLE has the task of standardizing multilingual lexica.

The early CELEX work was already described. It is realized as a rich set of relational tables for three

10

http://www.ub.es/gilcub/SIMPLE/simple.html

11

http://www.ilc.pi.cnr.it/EAGLES96/lexarch

languages where word form and lemma related information was separated.

3.3. Written Lexica

Also, examples from written dictionaries as analyzed by Bell&Bird [7] and Ide [8] were included to get a broad coverage. Bell&Bird studied more than 50 written lexica and found a number of characteristic organization principles and differences. The study showed mainly how the lexica differ with respect to

• the headword used and its characteristics • the way senses are included

3.4. Other Lexica

Interesting proposals were made by two field researcher who focus on semantic relations between elements of lexical information. Schultze-Berndt [9] and colleagues implemented a lexicon by using the Hypercard mechanisms from Apple. She makes heavy use of semantic classes and also can create links from elements (words, set of words) in comment fields to other entries or elements within entries. In doing so she can realize complex semantic networks. Also Manning [10] stresses the relevance of supporting many different types of semantic relations between entries and attributes of entries. In his KirrKirr lexicon implementation he put much effort in visualizing these relations.

Although we did not find concrete lexica which make use of inheritance mechanisms, it is often reported that inheritance is a very important feature for computer-based lexica. So it is a structural requirement.

3.5. Summary

The analysis was in this stage not yet extended to lexica purely dedicated to cover semantic relations such as ontologies, thesauri etc., although some of the lexica discussed offer possibilities to use their structural possibilities to include such semantic relations.

As discussed above, the structure of the observed lexica varies considerably depending on the languages studied and the research interests. Simply structured dictionaries existing of a single table contrast with relational databases covering a large set of related tables. Also, many differences could be noticed with respect to the microstructure in dictionaries, i.e. the elements used to describe linguistic content and their underlying structural relations. This was supported by the observations found by Bell/Bird who showed, for example, that headwords and sense descriptions diverge.

The lexical structures found within the domains of language engineering and field linguistics diverge considerably. Between the two domains many similarities with respect to the requirements could be shown. Those attempts which use the term “ generic” are not generic in the true sense. What GENELEX for example provides is an exhaustive list of tag sets which are embedded in a fixed hierarchical structure. This is not generic since the tag sets people are using differ largely, but especially since linguists differ largely with respect to the structural embedding of certain tags such as sense descriptions.

(4)

(5)

[2] Grishman, Ralph, Catherine Macleod and Adam Meyers (1994). COMLEX Syntax: Building a Computational Lexicon, Coling94, Kyoto

[3] Burnage (1990), Celex , a Guide for Users, Nijmegen, the Netherlands

[4] Fellbaum, Christiane (ed.) (1998), WordNet: An Electronic Lexical Database.

Cambridge, Mass.: MIT Press.

[5] Vossen, P., Introduction to EuroWordNet.

In: Nancy Ide, N., Greenstein, D. and Vossen, P. (eds), Special Issue on EuroWordNet. Computers and the Humanities, Volume 32, Nos. 2-3 1998. 73-89. [6] Wittenburg, P. (2001) Lexical Structures. MPI Technical Report. MPI Nijmegen

[7] J. Bell, S. Bird (2000) A Preliminary Study of the Structure of Lexicon Entries. Paper presented at the Workshop on Web-Based Language Documentation and Description. Philadelphia.

[8] Ide, N., Le Maitre, J., and Veronis, J.(1991), Outline for a Model of Lexical Databases. RIAO91, Barcelona [9] Schultze-Berndt, E. (2001) Unpublished Manuscript of a contribution to a lexicon workshop. MPI Nijmegen [10] KirrKirr Lexicon:

www.sultry.arts.usyd.edu.au/kirrkirr

[11] Ide, N., Kilgarriff, A. and Romary, L. (2000), A Formal Model of Dictionary Structure and Content, Euralex, Stuttgart

[12] Erjavec, T., Evans, R., Ide, N., Kigarriff, A. (2000), The Concede Model for Lexical Databases, LREC, Granada

[13] Booch, G., Rumbaugh, J. and Jacobson, I. (1999), The Unified Modelling Language User Guide. Addison Wesley Longman