Human Language Technology in Digital Libraries

(1)

Human Language Technology in Digital Libraries

Werner Winiwarter

Software Competence Center Hagenberg, Austria

[email protected]

Abstract

In this paper we focus on the important role of human language technology as one of the key technologies for the universal access to worldwide digital libraries. In particular, linguistic barriers caused by the multilingual nature of the global information pool require solutions from cross-language information retrieval and machine translation. In our research we developed a machine translation environment for the automatic translation of Japanese documents into German. An important point regarding the implementation of the translation environment is that it is completely embedded in the widely used text processing program Word to ensure its easy use by any potential end user.

1. Introduction

The past decades have witnessed a rapidly growing interest in human language technology research, e.g. natural language interfaces, text retrieval and summarization, word sense disambiguation and document categorization, information filtering and extraction, or machine translation (for a good survey see [1]). For all these domains several useful systems have already been developed and there exist realistic expectations of future developments. Human language technology has reached a level of maturity that makes it feasible to solve many of the urgent needs of the coming information age. By learning from past failures and successes we are now ready to apply the new technology to real-life applications.

Whereas the first wave of research on digital libraries was mainly concerned with advancing the infrastructure for organizing and accessing the vast stores of archival and emerging information abundant in all media [2], current and future research focuses on facilitating knowledge transfer from the source to the user. Today’s digital libraries increasingly incorporate proactive services aimed at assisting in the interpretation and

application of information to fulfill user information requirements [3].

Some of the key tasks in this context covered by human language technology are to provide improved access to unstructured data, dynamic ontologies for improved information access, the presentation of information through summarization and visualization, and the bridging of the language barrier [4].

Especially the task of multilingual information access has become more and more important with the steady growth of the proportion of non-English documents on the Web. The research area of cross-language information retrieval (for a recent overview see [5]) addresses the problem of retrieving documents across language barriers. For users of digital libraries it is essential to be able to query large multilingual collections using a single language [6]. There has been a lot of international research initiatives in this field in the last few years, most notably within the EU-NSF Working Group on Multilingual Information Access for Digital Libraries [7].

If the user cannot understand the meaning of a retrieved document, machine translation technology is needed to perform an automatic translation of the document text. The research on machine translation has a long tradition and a somehow disputed reputation (for a good survey we refer to [8]). Like for many other disciplines of computational linguistics the euphoric mood of the first days gave place to a pessimistic period of stagnation. The interest in machine translation systems faded away after they could not fulfill the unrealistic promises of the first hype.

However, within the last few years this situation has changed considerably. In particular in Japan there have been extensive efforts regarding the automatic translation from Japanese into English and vice versa. One of the most prominent research directions in Japan has been example-based machine translation [9, 10, 11], which relies on massive bilingual corpora to build a knowledge base of translation examples. New sentences are then translated by finding the most similar example. Unfortunately, this promising approach can only be

(2)

applied successfully to language pairs for which enough bilingual data is available.

Another popular approach, often incorporated in commercial products, is transfer-based translation (see [12]). Transfer-based systems divide the translation problem into three parts: analysis, transfer, and generation. The analysis part parses the source sentence by means of a source grammar to create a structured representation. The transfer part applies a comparative grammar to map every source representation onto a target representation. Finally, the generation part produces the target sentence by using a target grammar.

The main disadvantage of this approach is the fact that the transfer component can only be used for one language pair. This makes transfer-based translation infeasible for multilingual environments with many languages. In addition, most transfer-based systems analyze sentences only at the syntactic level, i.e. they do not offer any means of language understanding at the semantic or pragmatic level. Therefore, the quality of the output remains limited since the system does not cover the intended meaning of the source text.

The main goal for our research was to develop a Machine Translation Environment (MTE) for the translation of Japanese documents into German. In consideration of the shortcomings of the above-mentioned approaches, we followed the direction of knowledge-based machine translation [13, 14, 15, 16]. In this approach a semantic representation of the meaning of a sentence is used as an interlingua, i.e. a language-independent intermediate representation. Therefore, in translation systems using an interlingua the transfer component vanishes. The main advantages are that the analysis and generation components can be freely reused for the translation of other languages as well as for many other language processing tasks, e.g. within natural language interfaces.

Regarding the implementation of our machine translation environment we embedded the translation program in the widely used text processing program Word. All modules for the translation are realized as macros in Visual Basic. To input and display Japanese characters we use Global IME. Thus, the user can work with the documents in a familiar environment without the need to install any additional software.

Our aim was to make the machine translation environment as user-friendly as possible. The user does not have to get used to a new application program but can concentrate on working and experimenting with the text. The system does not only output the final translation but the user can also ask for the display of intermediate results such as transcriptions, token lists, syntax trees, and semantic representations. This makes MTE not only useful for readers who are merely interested in the

content of a document but even more for language students with some knowledge in Japanese who want to improve their language skills by reading and translating Japanese documents. For the application in computer-assisted language learning MTE represents an excellent flexible companion, which assists the language student only to the extent that is really necessary for the student.

The rest of this paper is organized as follows. First, we give a brief overview of the system architecture. After this we present the analysis component in detail, which computes the Roman transcription of the Japanese input, the token list, the syntax tree, and the semantic representation. Finally, we discuss the generation component. Starting from the semantic representation it derives the syntax tree for the target language, the token list, and the surface representation of the target sentence.

2. System Architecture

The system architecture of MTE is displayed in Fig. 1. The two main components are the analysis component and the generation component.

Figure 1. System architecture

Lexical Analysis

Syntactic Analysis

Semantic Analysis

Source Token List

Source Syntax Tree

Semantic Representation

Syntactic Generation

Lexical Generation

Surface Generation

Target Syntax Tree

Target Token List Source Sentence Target Sentence Transcription

ANALYSIS

GENERATION

INTERLINGUA

(3)

The analysis component takes the source sentence as input and produces the semantic representation, which is used as interlingua for the translation. The analysis process is divided into three steps. The lexical analysis performs the tokenization and lemmatization of the input. Its output is a token list, which indicates the dictionary form, category, and subcategory for each word token. As side result the Roman transcription of the Japanese input is computed. The syntactic analysis derives the structural dependencies in the sentence by building a syntax tree. The semantic analysis deals then with the interpretation of the meaning of the sentence by mapping the sentence onto a semantic representation.

The generation component has the opposite task as the analysis part, i.e. it calculates the surface form of the target sentence starting from the semantic representation. The generation component consists also of three modules. The syntactic generation finds a corresponding sentence structure in the target language to convey the intended meaning. The lexical generation creates the token list for the target language and determines the correct values for all token features. Finally, the surface generation module transforms the token list into the surface representation of the sentence by generating the necessary morphological variations.

3. Analysis

3.1. Lexical Analysis

The most obvious problem with the lexical analysis of written Japanese is its complex writing system. Japanese writing consists of three different subsystems: the two syllable writings hiragana and katakana as well as the (often modified) Chinese characters called kanji. An even more severe difficulty is that there are no spaces between Japanese words so that the tokenization, i.e. the segmentation into individual word tokens, is no longer a trivial task (see also [17]).

In Japanese there exists no inflection to indicate case, number, or gender but a complex system of conjugation. Accordingly, we divide Japanese words in [18]:

 non-conjugative words: particles, nouns,

pronouns, counters, copular nouns, non-conjugative adjectives, adverbs, conjunctions, and interjections;

 conjugative words: verbs (vowel-stem verbs,

consonant-stem verbs, and irregular verbs), the copula (corresponds to the English verb “to be”), and adjectives.

The conjugation of adjectives and vowel-stem verbs follows uniform conjugation rules whereas

consonant-stem verbs are divided into 9 subclasses based on their final syllables.

For the lexicon used for the lexical analysis we adapt the lexical approach [19] in that we store the information about word stems and word endings separately [20]. Since the number of irregular verbs is very limited in Japanese, we store their conjugated forms explicitly. The same we do for the conjugation of the copula, which is also highly irregular.

The information about words and word endings is stored in Word files, which are imported during the initialization of the system. Therefore, it is very easy for the user to add new lexical items or to correct existing ones. Figure 2 shows an example of the output of the lexical analysis. Each word is reduced to its base form. The ta-form indicates informal perfective usage in Japanese grammar.

Figure 2. Example of source token list

<- 69P

Dictionary form Category Subcategory

noun particle noun particle comma adverb noun particle

adjective dictionary form noun particle comma noun <- noun particle comma 69P noun left parenthesis noun particle noun right parenthesis particle noun counter particle verb ta-form noun copula ta-form period

(4)

An important side result of the lexical analysis is the transcription in Roman alphabet (romaji). Users who are not yet at ease with reading Japanese writing can display the transcription to assist them in reading the source text. In particular, this is a vital feature concerning the pronunciation of kanji. Most kanji have at least two different pronunciations or readings and the correct reading always depends on the context in the document.

Figure 3 presents a screenshot of MTE. It shows an educational document about the history of books. In this example the user asked for the transcription of the second sentence.

3.2. Syntactic Analysis

The syntactic analysis module parses the source sentence to compute a hierarchical representation of the sentence structure. We use an efficient object-oriented

parsing algorithm to build the source syntax tree. The parsing algorithm first transforms the source token list into a linear list of constituents. Starting from this list the syntax tree is constructed by following a bottom-up strategy, i.e. in each parsing step a new constituent is derived and a corresponding manipulation is performed on the tree. The process is repeated until no more successful derivation of a new constituent can be performed. For more details about the parsing of Japanese text see [21].

Figure 4 gives the source syntax tree for the sentence from Fig. 2. Each complex constituent is implemented as an object with references to its subconstituents. The example shows the four most commonly used complex constituent types: verb phrases (JapVP), noun phrases (JapNP), adjectival phrases (JapAP), and counter phrases (JapCP). For noun phrases we indicate the head category: noun (N) or copular noun (CN).

(5)

F ig u re 4 . E x a m p le o f s o u rc e s y n ta x t re e PUNC COM POST de HEAD mono/N NOUP POST no HEAD mukashi/N PUNC PER HEAD da/ta ADVP SUBJ COMP

JapV P JapNP JapNP

PUNC COM POST wa HEAD mono/N ADJP JapNP HEAD chikai ADVE mottomo IOB J JapA P POST ni HEAD hon/N JapNP HEAD makimono/N VERP NOUP JapNP

HEAD tsunagi awaseru/ta COUP DOB J JapV P POST mo HEAD mai QUA N nani JapCP HEAD sen‘i/N NOUP JapNP POST no HEAD ashi/N JapNP PUNC COM POST no HEAD ejiput o/N PREF kodai JapNP POST o HEAD papirusu/ N PARE JapNP PUNC punctuation HEAD head constituent ADVP adverbial phrase SUBJ subject

COMP complement POST postposition NOUP noun phrase ADJP adjecti val phras e

ADVE adverb IOB J indirect object VERP verb phrase COUP counter phras e

DOB J direct object QUA N quantity PREF prefix PARE parent hesis PUNC COM POST de HEAD mono/N NOUP POST no HEAD mukashi/N PUNC PER HEAD da/ta ADVP SUBJ COMP JapV P PUNC PER HEAD da/ta ADVP SUBJ COMP

JapV P JapNP JapNP

PUNC COM POST wa HEAD mono/N ADJP JapNP HEAD chikai ADVE mottomo IOB J JapA P HEAD chikai ADVE mottomo IOB J JapA P POST ni HEAD hon/N JapNP POST ni HEAD hon/N JapNP HEAD makimono/N VERP NOUP JapNP

HEAD tsunagi awaseru/ta COUP

DOB J

JapV P

HEAD tsunagi awaseru/ta COUP DOB J JapV P POST mo HEAD mai QUA N nani JapCP POST mo HEAD mai QUA N nani JapCP HEAD sen‘i/N NOUP JapNP HEAD sen‘i/N NOUP JapNP POST no HEAD ashi/N JapNP POST no HEAD ashi/N JapNP PUNC COM POST no HEAD ejiput o/N PREF kodai

JapNP

PUNC COM POST no HEAD ejiput o/N PREF kodai JapNP POST o HEAD papirusu/ N PARE JapNP POST o HEAD papirusu/ N PARE JapNP PUNC punctuation HEAD head constituent ADVP adverbial phrase SUBJ subject

COMP complement POST postposition NOUP noun phrase ADJP adjecti val phras e

ADVE adverb IOB J indirect object VERP verb phrase COUP counter phras e

DOB J direct object QUA N quantity PREF prefix PARE parent hesis

(6)

3.3. Semantic Analysis

The final step of the analysis process is the mapping of the syntactic representation of the input sentence onto a corresponding deep structure, which covers the intended meaning (for a good background on semantic representation see [22]). We use again an object-oriented representation for the language-independent interlingua (see Fig. 5 for the semantic representation of the sentence from Fig. 4).

For the semantic concepts we employ English notations in contrast to completely artificial interlingual languages, e.g. Lojban [23]. The semantic representation abstracts from the syntactic structure of the source sentence, which is of special importance for the successful translation of languages with strongly divergent grammars like Japanese and German.

4. Generation

4.1. Syntactic Generation

The syntactic generation module builds the syntax tree for the target language German. One serious problem with the translation from Japanese into German is the

inherent ambiguity of the Japanese language. In Japanese there exists no grammatical distinction between singular and plural as well as no articles to distinguish between the definite or indefinite use of a noun. As mentioned before, Japanese language knows no inflection to convey information about case or gender and even conjugated forms do not identify the person of the subject. Moreover, the subject of a sentence is often omitted.

In contrast, German is a highly inflective language in which the word forms indicate number, person, case, gender, tense, mood, and voice. Therefore, more detailed information is needed to derive the required syntactic features for the generation of the German sentence. This information must be extracted from the input by using indirect indicators in the sentence or contextual information from preceding sentences. There exists only little theoretical work regarding this important aspect, e.g. see [24] for experiments with heuristic rules to determine the referential property and number of nouns for the translation into English.

Figure 6 shows the target syntax tree for the semantic representation from Fig. 5. The default value for the number of a noun phrase is singular, plural usage is indicated explicitly. The inversion attribute indicates that the normal word order is reversed by extraposing the adverbial phrase. TYPE Resemble RES T DEGR Most PATI ACTO TINF Activity EVSP Before Time

TYPE concept type RES T restriction DEGR degree PATI patient ACTO actor

TINF temporal information EVSP event speech

sequence LOCA location CREA creation QUA N quantity MATE material SYNO synonym E XIS existence EVTI time of event

TYPE BookRoll LOCA AncientEgypt CREA Object TYPE BookRoll LOCA AncientEgypt CREA Object TYPE Connect PATI Activity TYPE Sheet QUA N Num erous MATE

Object TYPE Sheet QUA N Num erous MATE Object TYPE Papyrus SYNO ReedFiber Object TYPE Book Object TYPE Thing E XIS Object EVTI Antiquity Time

(7)

4.2. Lexical Generation

Lexical generation traverses the target syntax tree and creates a linear list of tokens for the individual target words by observing the rules for German word order. The correct values are partly derived from the information in the syntax tree. Additional information is supplied from specialized German lexicons for the different word categories, e.g. the gender of nouns.

Missing values are calculated by applying rules of syntactic agreement between the head constituent of a phrase and modifying constituents, e.g. number and gender of attributive adjectives.

4.3. Surface Generation

The final step of the generation process is to produce the surface representation of the target sentence (see Fig. 7). The generation of the correct inflected forms

involves not only the concatenation of endings but also complex morphological variations, including ablaut, umlaut, elision, and other phenomena (see [25]).

The exact word form is computed based on the values for the syntactic features and on information from the German lexicons. Like in the case of the source lexicon (see Sect. 3.1) we divide the information about word stems and word endings so that we store only irregular word forms explicitly.

Figure 6. Example of target syntax tree

<- 69P

Von den Dingen des Altertums kam die aus zahlreichen Blättern aus Papyrus (Schilffaser) zusammengefügte, altägyptische Buchrolle dem Buch am nächsten.

Figure 7. Example of translation HEAD kommen/ PastTense INVE ADVP DA TO SUBJ PRAP GerV P HEAD nahe/ Superlative GerA P

HEAD head constituent INVE inversion ADVP adverbial phrase DA TO dative object SUBJ subject

PRAP predicative adjectival phrase ARTI article

ATPA attributive participial phrase ATAP attributive adjectival phras e POBJ prepositional object PREP preposition

ATP R attributive prepositional phrase PARE parent hesis

ATGP attributive genitive phrase

HEAD Buchrolle ARTI Definite ATPA ATAP GerNP HEAD altägyptisch/ Positive GerA P HEAD zusammenfügen/ PastParticiple POBJ GerV P HEAD Blatt/ Plural PREP aus ATP R ATAP GerNP HEAD Buch ARTI Definite GerNP HEAD Ding/ Plural ARTI Definite PREP von ATGP GerNP HEAD Altertum ARTI Definite GerNP HEAD zahlreic h/ Positive GerA P HEAD Papyrus PREP aus PARE GerNP HEAD Schilffaser GerNP HEAD kommen/ PastTense INVE ADVP DA TO SUBJ PRAP GerV P HEAD nahe/ Superlative GerA P

HEAD head constituent INVE inversion ADVP adverbial phrase DA TO dative object SUBJ subject

PRAP predicative adjectival phrase ARTI article

ATPA attributive participial phrase ATAP attributive adjectival phras e POBJ prepositional object PREP preposition

ATP R attributive prepositional phrase PARE parent hesis

ATGP attributive genitive phrase

HEAD Buchrolle ARTI Definite ATPA ATAP GerNP HEAD altägyptisch/ Positive GerA P HEAD zusammenfügen/ PastParticiple POBJ GerV P HEAD Blatt/ Plural PREP aus ATP R ATAP GerNP HEAD Buch ARTI Definite GerNP HEAD Ding/ Plural ARTI Definite PREP von ATGP GerNP HEAD Altertum ARTI Definite GerNP HEAD zahlreic h/ Positive GerA P HEAD Papyrus PREP aus PARE GerNP HEAD Schilffaser GerNP

(8)

5. Conclusion

In this paper we have presented a system, which aims at facilitating the global access to Japanese documents in worldwide digital libraries. We address the problem of cross-language information retrieval in multilingual environments by providing a machine translation environment, which can be fully embedded in common word processor technology.

MTE applies knowledge-based machine translation to guarantee high quality automatic translation from Japanese into German. Furthermore, this is also a prerequisite for an efficient portation to the translation of other languages or even to other language engineering tasks.

After finishing the implementation we have started a first case study with language students from the University of Vienna. First reactions by the students were very positive. However, a detailed evaluation study to measure the performance of MTE is still the topic of ongoing research.

Future work will concentrate on additional coverage of our system, long-term empirical studies, and the extension of MTE for the use with other languages. A final important point is the adaptation of the linguistic knowledge by the user. Although at the current state it is possible for the user to freely adapt the transformation rules for the individual steps of the translation process, we still have to make the front-end for this adaptation more user-friendly. The final goal is that the user can simply correct or improve the target sentence, which causes the system to run a reverse translation leading to an automatic update of the linguistic rule base.

References

[1] R. Cole et al. (eds), Survey of the State of the Art in

Human Language Technology, Cambridge University

Press, New York, 1997.

[2] H. D. Wactlar, “The Next Generation Electronic Library

– Capturing the Experience”, ACM Computing Surveys 28(4), 1996.

[3] A. Brewer et al., “The Role of Intermediary Services in

Emerging Digital Libraries“, Proc. of the ACM Intl.

Conf. on Digital Libraries, 1996, pp. 29-35.

[4] J. L. Klavans, “Data Bases in Digital Libraries: Where

Computer Science and Information Management Meet”,

Proc. of the ACM SIGACT-SIGMOD-SIGART

Symposium on Principles of Database Systems, 1998,

pp. 224-226.

[5] G. Grefenstette (ed), Cross-Language Information

Retrieval, Kluwer Academic Publishers, Boston, 1998.

[6] D. W. Oard, “Serving Users in Many Languages:

Cross-Language Information Retrieval for Digital Libraries”,

D-Lib Magazine, December 1997.

[7] J. L. Klavans and P. Schäuble, “Report on the EU-NSF

Working Group on Multilingual Information Access”,

D-Lib Magazine, December 1997.

[8] D. Arnold et al., Machine Translation: An Introductory

Guide, Blackwells-NCC, London, 1994.

[9] M. A. Nagao, “A Framework of a Mechanical Translation

between Japanese and English by Analogy Principle”, A. Elithorn and R. Banerji (eds), Artificial and Human

Intelligence, North-Holland, Amsterdam, 1984.

[10] S. Sato, Example-Based Machine Translation, Ph.D. thesis, Kyoto University, 1991.

[11] E. Sumita and H. Iida, “Experiments and Prospects of Example-Based Machine Translation”, Proc. of the

Annual Meeting of the Association for Computational Linguistics, 1991, pp. 185-192.

[12] W. J. Hutchins and H. L. Somers, An Introduction to

Machine Translation, Academic Press, London, 1982.

[13] S. Nirenburg, V. Raskin, and A. Tucker, “The Structure of Interlingua in TRANSLATOR”, S. Nirenburg (ed),

Machine Translation: Theoretical and Methodological Issues, Cambridge University Press, New York, 1987.

[14] S. Nirenburg et al., Machine Translation: A

Knowledge-Based Approach, Morgan Kaufmann Publishers, San

Mateo, 1992.

[15] B. Onyshkevych and S. Nirenburg, “A Lexicon for Knowledge-Based MT”, Machine Translation 10(1-2), 1995.

[16] J. R. R. Leavitt, D. W. Lonsdale, and A. M. Franz, “A Reasoned Interlingua for Knowledge-Based Machine Translation”, Proc. of the Biennial Conf. of the Canadian

Society for Computational Studies of Intelligence, 1994.

[17] H. Fujii and W. B. Croft, “A Comparison of Indexing Techniques for Japanese Text Retrieval”, Proc. of the

Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1993, pp. 237-246.

[18] Y. M. McClain, Handbook of Modern Japanese

Grammar, Hokuseido Press, Tokyo, 1981.

[19] P. Steffens, “Machine Translation and the Lexicon”, Proc. of the Intl. EAMT Workshop, 1994.

[20] W. Winiwarter. “Adaptive Natural Language Interface Design in a DOOD Framework”, Proc. of the IPSJ Intl.

Symposium on Information Systems and Technologies for Network Society, 1997, pp. 215-222.

[21] W. Winiwarter, O. Kagawa, and Y. Kambayashi, “Syntactic Disambiguation by Using Categorial Parsing in a DOOD Framework”, Proc. of the German Annual Conf.

on Artificial Intelligence, 1996, pp. 363-375.

[22] J. Allen, Natural Language Understanding, Benjamin/ Cummings, San Francisco, 1994.

[23] J. W. Cowan, The Complete Lojban Language, Logical Language Group, Fairfax, 1997.

[24] M. Murata and M. Nagao, “Determination of Referential Property and Number of Nouns in Japanese for Machine Translation into English”, Proc. of the Intl. Conf. on

Theoretical and Methodological Issues in Machine Translation, 1993.

[25] W. Winiwarter, “MIDAS - the Morphological Component of the IDA System for Efficient Natural Language Interface Design”, Proc. of the Intl. Conf. on Database