• No results found

A Knowledge Acquisition and Management System for Morphological Dictionaries

N/A
N/A
Protected

Academic year: 2020

Share "A Knowledge Acquisition and Management System for Morphological Dictionaries"

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)

A Knowledge Acquisition and Management

System for Morphological Dictionaries

Pius ten Hacken °, Stephan Bopp*, Marc Domenig °, Dieter Holz °, Alain Hsiung °, Sandro Pedrazzini'Jl

°Universit~it Basel, IFI *Vri.je Universiteit Amsterdam 'qDSIA Petersgraben 51 SR Lexicologie Corso Helvezia 36 CH-4051 Basel NL-1081 HV Amsterdam CH-6900 Lugano

Abstract

A system for the acquisition and management of reusable morphological dictionaries is clearly a useful tool for NLP. As such, most currently popular finite-state morphology systems have a number of drawbacks. In the development of Word Manager, these problems have been taken into account. As a result, its knowledge acquisi- tion component is well-developed, and its knowledge representation enables more flexible use than typical finite-state systems.

1. Introduction

It has long been recognized that there is a need for reusable dictionaries for NLP-systems. A system for the devel- opment and maintenance of such dictio- naries can be viewed as a knowledge acquisition and management tool. The major objectives from this perspective are support for efficient dictionary con- struction, and maintenance of consis- tency within the database.

As a starting point, it makes sense to take morphological dictionaries of a type that can at least solve the m a p p i n g p r o b l e m between text words and dictio- nary entries. This problem directly in- volves inflection, (graphic) clitics (e.g. German z u m = zu dent), and multi-word

units, while wordformation is necessary for the recognition of newly coined words. Concentrating on these morpho- logical dictionaries recommends itself because the information they contain is relatively well-understood (compared to e.g. semantics), can be isolated fairly well from more application-specific types of knowledge, and is necessary for almost every NLP-application.

In this project note, we will first con- sider finite-state morphology systems from the perspective of k n o w l e d g e acquisition and management, and ob- serve some shortcomings (section 2). Then Word Manager (WM) is presented, and it is shown how WM avoids these drawbacks (section 3). In section 4, an overview of the present state of the WM-project is given.

2. Finite-State Morphology

Systems

(2)

String alternation

rules Formative

lexicon]

I

~ Compilation '1 Compilation

(Lexicon FSA) ( R u l e F S T s )

Lexical ransducel2, j

Figure l: Architecture of Two-Level Morphology.

Compared to the original systems of Koskenniemi (1983) and Karttunen (1983), the major improvements made during the past ten years were in the compilation of the finite-state transition tables, and the switching from analyzers to transducers. By pushing finite-state technology to its limits, the resulting finite-state machines are extremely t2qst.

If we consider current finite-state mor- phology systems as a potential solution to the problem sketched in section 1, i.e. the acquisition and maintenance of mor- phological and lexical knowledge, they have a number of shortcomings. First, knowledge acquisition is not supported very well: the editing of the formative lexicon and the string alternation rules must be done with text editors. Hence, it is a major difficulty for the expert to conceive the lexicon's structure and the string alternation rules. In particular, the system provides little support for the visualization of the structuring and in- teraction between the entities specified.

specifically for the purpose of mapping from text words to their analysis and vice versa. Flexibility in this respect is lacking. Thirdly, the mapping problem as defined above is only partially cov- ered: neither multi-word units, nor compounding and prefixation can be handled appropriately with the basic systems.

3. Word Manager

The three drawbacks of finite-state mor- phology mentioned have served as a starting point in the development of Word Manager (WM). The resulting properties of WM are presented in the three following subsections.

3.1. C o v e r a g e

A strictly finite-state mechanism has a number of problems in covering natural language morphology, as has been recognized earlier (e.g. Kay (1987)).

[image:2.612.175.425.85.239.2]
(3)

tion, the rules for combining formatives in WM are context-free. Another differ- ence with two-level morphology is the basic distinction between inflection and wordformation in WM. Inflection is treated as the paradigmatic realization of certain features on a lexeme, whereas wordformation is the application of a rule to a lexeme, resulting in a new lexeme. This type of distinction is lin- guistically motivated and pragmatically elaborated by ten Hacken (1993).

The rules for string alternation in WM are similar in function to two-level rules, but two additional ways are offered to restrict their domain of appli- cation. Whereas two-level rules can only see and change strings, WM string alter- nation rules can also see (but not change) features. Besides they can be defined for individual classes of lexemes or for individual entries, for the treatment of exceptions. The entire formalism of this part of WM is described in Domenig & ten Hacken (1992).

The subsystem Phrase Manager, de- scribed in Pedrazzini (1994), covers all cases where the mapping between text words and lexemes is not one-to-one. This includes (graphic) clitics and multi- word units. The clitics mechanism may split up text words and rearrange the parts before further analysis. Multi- word units are recognized and assigned a structure on the basis of the string, and treated as possible analyses alongside the literal ones.

Together, the mechanisms included in WM cover the entire mapping problem.

3.2. Compilation

The WM formalism can be compiled in a variety of ways. As opposed to compi- lation of current finite-state systems, the output is not restricted to an analyzer or transducer.

The basic compilation converts rules and entries into a network structure record- ing links between rules, formatives, and lexeme entries. On the basis of this structure, WM can offer d y n a m i c knowledge aggregation options, to be used for browsing, d e b u g g i n g (rule application tracing) etc.. Another use of this structure is in the access to the data by client applications, where any type of view on the data is supported.

By compiling the wordformation rules into a unification-based grammar, WM can analyze u n k n o w n words, and, drawing from the information gathered in the parse tree, generate their lexeme entries (including all inflectional forms of the new entry). Thus, the word- formation rules can be used for semi- automatic construction of the dictionary.

(4)

3.3. Knowledge Acquisition

In the construction of a WM dictionary, linguistic and lexicographic knowledge are separated. As opposed to coding in finite-state morphology, it is not a single person who has to keep track of all the rules, know all special symbols and write all the entries. Instead, tailor-made interfaces for linguistic and lexico- graphic experts have been developed.

The l i n g u i s t ' s interface supports the description of the lnorphological rule system ol' a language in the WM-formal- ism. Besides formulating the rules them- selves, the linguist has to give at least one example per rule, and to describe all exceptions.

The compilation step checks syntactic correctness and signals a number of semantic errors or probable errors. The compiled database can be inspected to see if the rules yield the desired results. The interface offers an incremental compilation facility, so that it is possible to get an inspectable version of the database without entirely recompiling it. This makes a gradual expansion of the rule system with frequent testing of intermediate results feasible. Thus, the way compilation has been implemented in WM contributes to a higher degree of consistency.

In the lexicographer's interface, the rule-base coded by the linguist is pre- supposed and two tasks are supported:

° New formatives (stems) can be specified and linked to existing regu- lar inflection rules.

• Existing l'ormatives can be combined in novel ways, according to existing rules lot wordformation and multi- word units.

Obviously, the interface Call show the consequences ot' any lexicographic deci- sion for the l'orms generated by the rules concerned. More active support is also provided, so that the system proposes an analysis interactively, t;or the first task, an inflection rule is proposed on the basis of forms given by the lexicogra~ phcr. In the second task, the support consists of parsing the input on the basis of the rules in the database. The lexico- grapher can choose and inspect one of the proposals, confirm it as it is, or go back and choose a dill'trent proposal. The resulting entry is incrementally added to the lexical database.

The lexicographer's interface helps the lexicographer concentrate oil lexico- graphic decisions, rather than problems of encoding them. Since wordformation

a c c o u n t s for a major portion of a lan-

guage's vocabulary, and this part of a WM-database is constructed by rule application a high degree of consistency is guaranteed.

4. Present State of the Project

(5)

rently runs on Macintosh only. The i m p l e m e n t a t i o n i n c l u d e s P h r a s e Manager, the compiler, and the linguist's and l e x i c o g r a p h e r ' s i n t e r f a c e s as described above. A comprehensively documented version has been published on an ftp server ([email protected]).

A full rule database, including morpho- logical and phrasal rules, has been implemented for German, and complete morphological rule databases for Italian (described by Bopp (1993)) and English. The development of these databases has shown that the expressive power of the WM-formalism is sufficiently rich for at least some of the languages most fre- quently used in NLP-systems currently under development.

5. C o n c l u s i o n

From the perspective of knowledge acquisition and maintenance, WM has a number of properties that distinguish it from current finite-state morphology systems in a positive way. It has a larger coverage, including compounding, pre- fixation, and multi-word units; compila- tion is much more generic in that it enables flexible use of information in the database; and knowledge acquisition is supported by tailor-made interfaces for linguists and lexicographers. Therefore, WM is a tool that may be seen as an ideal supplement for today's finite-state analyzers and transducers and thus a further step towards the development of truly reusable dictionaries.

R e f e r e n c e s

Bopp, Stephan (1993), Computerimplementation der italienischen Flexions- und Wortbil- dungsmorphoIogie, Olms, Hildesheim.

Domenig, Marc & ten Hacken, Pius (1992), Word Manager: A System Jor Morphological Dictionaries, Olms, Hildesheim.

ten Hacken, Pius (1993), De/~ning Morphology: A Principled Approach to Determining the Boundaries c)f Compounding, Derivation, and Inflection, Ph.D. Dissertation, Univer- sity of Basel, to appear at Olms Verlag, Hildesheim.

Karttunen, Lauri (ed.) (1983), 'KIMMO: A Two Level Morphological Analyzer', in Texas Linguistic Forum 22, Department of Linguis- tics, University of Texas, Austin, p. 163- 278.

Karttunen, Lauri, Kaplan, Ronald M. & Zaenen, Annie (1992), 'Two-Level Morphology with Composition', Proceedings of Coling-92, p. 141-148.

Kay, Martin (1987), 'Nonconcatenative Finite- State Morphology', in Proceedings of the Third Conference of the European Cahpter o./" the Assocation Jor Computational Linguis- tics, Copenhagen, p. 2-10.

Koskenniemi, Kimmo (1983), Two-Level Morphology." A General Computational Model for Word-Form Recognition and Production, University of Helsinki.

Figure

Figure l: Architecture of Two-Level Morphology.

References

Related documents

GENERAL CONDITIONS: No responsibility is assumed for: (a) entries that contain inaccurate information or do not comply with or violate the Official Rules; (b) entries, prize

you can watch our 7:00 am Spanish Mass or 8:30 am English Mass via Zoom or on our Facebook page, then come to receive communion.. in the Plaza from 2:00pm-3:00pm in front of

THE REVIEWS ARE GROUPED BY TYPE OF ALLOWANCE (SECTION A: COST-OF- LIVING; SECTION B: OVERSEAS HOUSING ALLOWANCE; SECTION C: EXCHANGE RATES, SECTION D: NO CHANGES BASED ON

The level of total protein and albumin was depleted in the group treated with diclofenac sodium (toxin control) and they were significantly decreased ( P < 0. 001 ) when compared

The American Association for Thoracic Surgery guidelines for lung cancer screening using low-dose computed tomography scans for lung cancer survivors and other high-risk groups...

Infelizmente não tivemos a participação de todos os alunos agregados aos grupos das turmas no aplicativo, mas na verdade já estávamos preparados para essa possível realidade, uma vez

The Ministry of Health (MOH) in Myanmar provides wider access of contraceptive methods in the community to improve quality of life on people in the community by legalizing

Considering the fact that the combination therapy is highly effective in inhibiting stem cell population in primary tumors as well as chemo-resistant colon cancer cells, our