SUPER-FUNCTION BASED MACHINE TRANSLATION SYSTEM FOR BUSINESS USER. Xin Zhao

(1)

SUPER-FUNCTION BASED MACHINE TRANSLATION SYSTEM FOR BUSINESS USER

by

Xin Zhao

A Dissertation Submitted to the Factulty of Engineering at Tokushima University in conformity with the requirements for the degree of Doctor of Engineering.

Doctoral Course for Information Science and Intelligent System

Graduate School of Engineering Department of Information Science and Intelligent Systems Tokushima, Japan

(2)

Abstract

In today’s increasingly networked world, there is an increased need for language translation. Attempts of language translation are almost as old as computer themselves. Machine Translation (MT) is the attempt to automate all, or part of the process of translating between human languages and is one of the oldest large-scale applications of computer science. Developing a system that accurately produces a good translation between human languages is the goal of MT systems.

Machine translation between closely related languages is easier than between lan-guage pairs that are not related with each other. The present work reports my attempt in de-veloping a Super-Function (SF) based machine translation system which translate Japanese into Chinese, and proposes a shallow method to translate Japanese compound nouns by us-ing word-level translation dictionary and target language monolus-ingual corpus. The machine translation system consists of three main parts, responsible for analysis, SF matching and translation generation. A SF is a functional relation mapping sentences from one language to another. The core of the system uses the SF approach to translate without going through syntactic and semantic analysis as many MT systems usually do. This work focuses on business users for whom MT often is a great help if they need an immediate idea of the content of texts like e-mail messages, reports, web pages, or business letters.

The translation of compound nouns is a major issue in MT due to their frequency of occurrence and high productivity. Several aspects of compound nouns make them par-ticularly difficult to be handled in a system performing automatic translation. In this work, I also discuss the challenges of automatic translating Japanese compound nouns into Chi-nese in the Super-Function Based Machine Translation (SFBMT) system to address this

(3)

issue. A major design goal of this method is that it can act as a standalone module and can be very well integrated with the machine translation system for Japanese sentence.

My experiment is not yet completed, but already it has displayed interesting re-sults. By using the SF in the translation system to translate without syntactic and semantic analysis as many MT systems usually do, will enhance the translation quality and reduce the glossary quantity. Even if the method for handling compound nouns is still a shallow method, but my experimental results show that the method, if improved can perform the task of automatic compound noun translation quite well.

(4)

Acknowledgements

I gratefully acknowledge many people who supported me with deepest apprecia-tion. I would like to take this opportunity to express my gratitude to my supervisor Profes-sor Fuji Ren, who introduced me to the fascinating world of natural language processing, for suggesting the research topic, for his generous guidance and patience given to me. His numerous support and encouragement, as well as inspiring advice are extremely essential and valuable in my research papers and this thesis.

I would like to express my deepest gratitude to Professor Junichi Aoe and Pro-fessor Kenji Kita, they gave me the opportunity to entered the department of Information Science and Intelligent Systems, Faculty of Engineering, Tokushima University, and intro-duced me to the A-1 group, to work on the topic of machine translation and to find my own way as a researcher.

I cordially thank my other supervisor, Associate Professor Shingo Kuroiwa, for his advice, patience, and linguistic corrections during my entire work. Without his effort, I will not be able to strengthen and improve my research project and papers.

I would also like to show my gratitude to the Department of Information Science and Intelligent Systems, for the provision of the best equipment and pleasant office envi-ronment required for high quality research.

Special thanks should be given to Professor. Dr. Stefan Voss who has given me valuable suggestions, encouragement and supports. I can not list the names of all people who I am indebted to. But I particularly would like to give my thanks to my research group members, HaiQing Hu, Manabu Sasayama, Yuhsuke Konishi, Ippei Fukuda, Kazuyuki Matsumoto, Keisuke Ueta, DaPeng Yin, Mohamed Fattah, Tomohiro Yagi, Takahiro Kuroda, Jiajun Yan, Min Shao, Yu Zhang, Jing Wang. They have given me support, and a joyful

(5)

and wonderful university life.

I was also blessed with wonderful friends; in many ways, my successes are theirs, too. They gave me a lot of support and encouragement through the hard times. Thank you Yoshinori Matsuura, Tomoyo Matsuura, Qing Liu, Xin Lu, ZhenYu Jin, Furthermore I would like to thank my parents and my husband, for all their support and unconditional love, sacrifice and consideration throughout my Ph.D.

(6)

3 Super-Function Based Machine Translation 52 3.1 How to Do Translation? . . . 53 3.1.1 Pre-Editing . . . 53 3.1.2 Translation Process . . . 54 3.1.3 Post-Editing . . . 56 3.2 Super-Function . . . 57 3.2.1 SF Definition . . . 58 3.2.2 SF Format . . . 60 3.2.3 SF Architecture . . . 62 3.2.3.1 Directional Graph . . . 62 3.2.3.2 Transformation Table . . . 63 3.2.4 Robust Super-Function . . . 63 3.3 Process of SFBMT . . . 64 3.3.1 Morphological Analysis . . . 65 3.3.2 Super-Function Matching . . . 66 3.3.3 Morphological Agreement . . . 66

3.4 Comparing with Other Related Research . . . 67

3.4.1 Pattern-Based Machine Translation . . . 67

3.4.2 Glossary-Based Machine Translation . . . 69

3.4.3 Example-Based Machine Translation . . . 70

3.5 Summary . . . 72

4 Japanese-Chinese SFBMT for Business Users 73 4.1 Proposed MT System . . . 73

4.1.1 Review of Business Letter . . . 75

4.1.2 Outline of The System . . . 76

4.2 Experiment and Evaluation . . . 79

4.2.1 Examples . . . 80

4.2.2 Evaluation Experiment . . . 84

4.3 Summary . . . 90

5 Translating Compound Nouns in SFBMT 91 5.1 Overview . . . 91

5.2 The Problems and Proposed Solution . . . 92

5.2.1 Intractable Problems of Compound Nouns . . . 92

(8)

5.2.2.1 Generation Stage . . . 96

5.2.2.2 Selection Stage . . . 98

5.3 Used Resources and Evaluating Results . . . 100

5.3.1 Used Resources . . . 100

5.3.2 Evaluating Results . . . 102

5.4 Related Works . . . 107

5.5 Summary . . . 110

6 Conclusions and Future Works 111 6.1 Conclusions . . . 111

6.2 Future Works . . . 114

(9)

List of Tables

3.1 Example of a Node Table . . . 63

3.2 Example of an Edge Table . . . 63

3.3 Examples of Translation Pattern . . . 67

4.1 United Table . . . 80

4.2 Result of the Evaluation Experiment . . . 86

5.1 Example of Translation Templates . . . 101

5.2 Results of Alignment . . . 103

5.3 Result of Aligned Case . . . 104

5.4 Examples of Aligned Case . . . 104

5.5 Examples of Partially Aligned Case . . . 105

5.6 Examples of Unaligned Case . . . 106

(10)

List of Figures

2.1 Human-Aided Machine Translation . . . 25

2.2 Translation Tripod . . . 28

2.3 Kind of Translation Approaches . . . 35

2.4 Syntax Structure Translation and Semantic Structure Translation . . . 36

2.5 Traditional Approaches of MT . . . 37

2.6 Direct Approach . . . 37

2.7 Transformer Architecture . . . 38

2.8 Components of Transfer system . . . 39

2.9 Interlingual Architecture . . . 42

2.10 Component of Interlingual System . . . 43

3.1 Flow of Translation . . . 53

3.2 Flowchart of Translation process . . . 55

3.3 Example of DG . . . 62

3.4 Architecture of PBMT . . . 68

3.5 Process of GBMT . . . 70

3.6 Architecture of EBMT . . . 71

4.1 Outline of The System . . . 76

4.2 Translation Interface . . . 79

5.1 Flowchart of The Processing . . . 97

(11)

Chapter 1

Introduction

Machine Translation, or Automatic Translation is the attempt to automatic all, or part of the process of translating from one human language to another. This chapter will briefly sketch some background on machine translation, a brief history of machine translation, and some basic concepts of natural language processing. The idea is to give a clear basic understanding of the state of the art. Then the aim of the current work and the structure of this thesis are summarized.

1.1 Overview

Machine translation of natural languages, commonly known as MT, has multiple person-alities. First of all, it is a venerable scientific enterprise, a component of the larger area of studies concerned with the studies of human language understanding capacity. Indeed, computer modeling of thought processes, memory and knowledge is an important component of certain areas of lin-guistics, philosophy, psychology, neuroscience and the field of Artificial Intelligence (AI) within computer science. MT promises the practitioners of these sciences empirical results that could be used for corroboration or refutation of a variety of hypotheses and theories. But MT is also a tech-nological challenge of the first order. It offers an opportunity for software designers and engineers to dabble in constructing very complex and large-scale non-numerical systems and for field compu-tational linguists, an opportunity to test their understanding of the syntax and semantics of a variety of languages by encoding this vast, though rarely comprehensive, knowledge into a form suitable for processing by computer programs.

The study of natural language has been an important area of artificial intelligence almost since the beginning of the field. Two main goals motivate AI work on natural language. One is

(12)

the theoretical goal, and close to that of the linguist, namely, to discover how we use language to communicate. The other is technological goal, namely, to enable the intelligent computer interfaces of the future, where natural language becomes an important means for man-machine interaction. Luckily, progress toward one of these goals often is progress toward the other – a better theoretical understanding leads to more robust systems, and a better understanding of processing issues in ac-tual applications suggests new goals and techniques of theoretical interest. The ultimate solution to language understanding must wait until we can effectively model almost all aspects of human intel-ligence. Many applications, however, do not require full conversational capabilities or encyclopedic knowledge. For instance, a natural language interface that serves as a query language to a database need only focus on questions and can limit the language it understands to concepts that arise in the database. On the contrary, MT is the attempt to automate all, or part of the process of translating from one human language to another. It is one of the oldest large-scale applications of computer science.

1.2 Machine Translation

People who need documents translated often ask themselves whether they could use a computer to do the job. When a computer translates an entire document automatically and then presents it to a human, the process is called Machine Translation. When a human composes a trans-lation, perhaps calling on a computer for assistance in specific tasks such as looking up specialized words and expressions in a dictionary, the process is called Human Translation. There is a gray area between human and machine translation, in which the computer may retrieve whole sentences of previously translated text and make minor adjustments as needed. However, even in this gray area, each sentence was originally the result of either human translation or machine translation. We will reserve the label ”machine translation” for the case when both the initial translation of the sentences and subsequent manipulations are performed by a computer[42].

The original proposals covered only the making of a straightforward dictionary translation from the Source Language (SL) to the Target Language (TL). It is convenient to start by seeing how this simple objective may be achieved on a machine whose primary purpose is only the most rudimentary machine functions in order to perform machine translation:

1. The machine has a large memory.

(13)

subtraction facilities, the accumulator register.

3. The machine contains a conditional transfer order which enables the machine to select be-tween alternative courses of action according to the sign of the number held in the accumula-tor register.

4. The contents of the accumulator can be typed at the output.

The reader familiar with modern automatic digital computers will see that all of the above functions are present in all such computers existing, with the exception in many cases of the large memory.

There are some popular misconception about MT, we will discuss them in turn:

• ”MT is a waste of time because you will never make a machine that can translate Shake-speare.”

The criticism that MT systems cannot, and will never, produce translations of great literature of any great merit is probably correct, but quite beside the point. It certainly does not show that MT is impossible. First, translating literature requires special literary skill – it is not the kind of thing that the average professional translator normally attempts. So accepting the criticism does not show that automatic translation of non-literary texts is impossible. Second, literary translation is a small proportion of the translation that has to be done, so accepting the criticism does not mean that MT is useless. Finally, one may wonder who would ever want to translate Shakespeare by machine – it is a job that human translators find challenging and rewarding, and it is not a job that MT systems have been designed for. The criticism that MT systems cannot translate Shakespeare is a bit like criticism of industrial robots for not being able to dance Swan Lake.

• ”Generally, the quality of translation you can get from an MT system is very low. This makes them useless in practice.”

Far from being useless, there are several MT systems in day-to-day use around the world. Examples include METEO (in daily since 1977 use at the Canadian Meteorological Center in Dorval, Montreal), SYSTRAN (in use at the CEC, and elsewhere), LOGOS, ALPS, EN-GSPAN (and SPANAM), METAL, GLOBALINK. It is true that the number of organizations that use MT on a daily basis is relatively small, but those that do use it benefit consider-ably. For example, as of 1990, METEO was regularly translating around 45000 words of

(14)

weather bulletins every day, from English into French for transmission to press, radio, and television. In the 1980s, the diesel engine manufacturers Perkins Engines was saving around £4000 on each diesel engine manual translated (using a PC version of WEIDNER system). Moreover, overall translation time per manual was more than halved from around 26 weeks to 9-12 weeks – this time saving can be very significant commercially, because a product like an engine cannot easily be marketed without user manuals.

Of course, it is true that the quality of many MT systems is low, and probably no existing system can produce really perfect translations.1 _{However, this does not make MT useless.}

First, not every translation has to be perfect. Imagine you have in front of you a Chinese newspaper which you suspect may contain some information of crucial importance to you or your company. Even a very rough translation would help you. Apart from anything else, you would be able to work out which, if any, parts of the paper would be worth getting translated properly. Second, a human translator normally does not immediately produce a perfect translation. It is normal to divide the job of translating a document into two stages. The first stage is to produce a draft translation, i.e. a piece of running text in the target language, which has the most obvious translation problems solved (e.g. choice of terminology, etc.), but which is not necessarily perfect. This is then revised – either by the same translator, or in some large organizations by another translator - with a view to producing something that is up to standard for the job in hand. This might involve no more than checking, or it might involve quite radical revision aimed at producing something that reads as though written originally in the target language. For the most part, the aim of MT is only to automate the first, draft translation process.2

• ”MT threatens the jobs of translators.”

The quality of translation that is currently possible with MT is one reason why it is wrong to think of MT systems as dehumanizing monsters which will eliminate human translators, or enslave them. It will not eliminate them, simply because the volume of translation to be performed is so huge, and constantly growing, and because of the limitations of current and

1

In fact, one can get perfect translations from one kind of system, but at the cost of radically restricting what an author can say, so one should perhaps think of such systems as (multilingual) text creation aids, rather than MT systems. The basic idea is similar to that of a phrase book, which provides the user with a collection of ”canned” phrase to use. This is fine, provided the canned text contains what the user wants to say. Fortunately, there are some situations where this is the case.

2

Of course, the sorts of errors one finds in draft translations produced by a human translator will be rather different from those that one finds in translations produced by machine.

(15)

forseeable MT systems. While not an immediate prospect, it could, of course, turn out that MT enslaves human translators, by controlling the translation process, and forcing them to work on the problems it throws up, at its speed. There are no doubt examples of this happen-ing to other professions. However, there are not many such examples, and it is not likely to happen with MT. What is more likely is that the process of producing draft translations, along with the often tedious business of looking up unknown words in dictionaries, and ensuring terminological consistency, will become automated, leaving human translators free to spend time on increasing clarity and improving style, and to translate more important and interest-ing documents – editorials rather than weather reports, for example. This idea borne out in practice: the job satisfaction of the human translators in the Canadian Meteorological Cen-terimproved when METEO was installed, and their job became one of checking and trying to find ways to improve the system output, rather than translating the weather bulletins by hand (the concrete effect of this was a greatly reduced turnover in translation staff at the Center). • ”The Japanese have developed a system that you can talk to on the phone. It translates what

you say into Japanese, and translates the other speaker’s replies into English.”

The claim that the Japanese have a speech to speech translation system, of the kind described above, is pure science fiction. It is true that speech-to-speech translation is a topic of current research, and there are laboratory prototypes that can deal with a very restricted range of questions. But this research is mainly aimed at investigating how the various technologies involved in speech and language processing can be integrated, and is limited to very restricted domains (hotel bookings, for example), and messages (offering little more than a phrase book in these domains). It will be several years before even this sort of system will be in any sort of real use. This is partly because of the limitations of speech systems, which are currently fine for recognizing isolated words, uttered by a single speaker, for which the system has been specially trained, in quiet conditions, but which do not go far beyond this. However, it is also because of the limitations of the MT system.

Against these misconceptions, we should place the genuine facts about MT.

• MT is useful. The METEO system has been in daily use since 1977. As of 1990, it was regularly translating around 45000 words daily. In the 1980s, The diesel engine manufacturers Perkins Engines was saving around£4000 and up to 15 weeks on each manual translated.

(16)

• While MT systems sometimes produce howlers, there are many situations where the ability of MT systems to produce reliable, if less than perfect, translations at high speed is valuable. • In some circumstances, MT systems can produce good quality output: less than 4% of ME-TEO output requires any correction by human translators at all (and most of these are due to transmission errors in the original texts). Even where the quality is lower, it is often easier and cheaper to revise ’draft quality’ MT output than to translate entirely by hand.

• MT does not threaten translators’ jobs. The need for translation is vast and unlikely to dimin-ish, and the limitations of current MT systems are too great. However, MT systems can take over some of the boring, repetitive translation jobs and allow human translation to concentrate on more interesting tasks, where their specialist skills are really needed.

• Speech-to-Speech MT is still a research topic. In general, there are many open research prob-lems to be solved before MT systems will be come close to the abilities of human translators. • Not only are there are many open research problems in MT, but building an MT system is an arduous and time consuming job, involving the construction of grammars and very large monolingual and bilingual dictionaries. There is no ’magic solution’ to this.

• In practice, before an MT system becomes really useful, a user will typically have to invest a considerable amount of effort in customizing it.

The correct conclusion is that MT, although imperfect, is not only a possibility, but an actuality. But it is important to see the product in a proper perspective, to be aware of its strong points and shortcomings.

While there have been many variants, most MT systems, and certainly those that have found practical application, have parts that can be named for the chapters in a linguistic textbook. They have lexical, morphological, syntactic, and possibly semantic components, one for each of the two languages, for treating basic words, complex words, sentences and meanings. Each feeds into the next until the last one in the chain produces a very abstract representation of the sentence. There is also a ’transfer’ component, the only one that is specialized for a particular pair of languages, which converts the most abstract source representation that can be achieved into a corresponding abstract target representation. The target sentence is produced from this essentially by reversing the analysis process. Some systems make use of a so-called ’interlingua’ or intermediate language, in which case the transfer stage is divided into two steps, one translating a source sentence into the

(17)

interlingua and the other translating the result of this into an abstract representation in the target language.

Machine Translation started out with the hope and expectation that most of the work of translation could be handled by a system which contained all the information we find in a standard paper bilingual dictionary. Source language words would be replaced with their target language translational equivalents, as determined by the built-in dictionary, and where necessary the order of the words in the input sentences would be rearranged by special rules into something more char-acteristic of the target language. In effect, correct translations suitable for immediate use would be manufactured in two simple steps. This corresponds to the view that translation is nothing more than word substitution (determined by the dictionary) and reordering (determined by reordering rules).

Reason and experience show that ”good” MT cannot be produced by such delightfully simple means. As all translators know, word for word translation doesn’t produce a satisfying target language text, not even when some local reordering rules (e.g. for the position of the adjective with regard to the noun which it modifies) have been included in the system. Translating a text requires not only a good knowledge of the vocabulary of both source and target language, but also of their grammar – the system of rules which specifies which sentences are well-formed in a particular language and which are not. Additionally it requires some element of real world knowledge – knowledge of the nature of things out in the world and how they work together – and technical knowledge of the text’s subject area. Researchers certainly believe that much can be done to satisfy these requirements, but producing systems which actually do so is far from easy. Most effort in the past 10 years or so has gone into increasing the subtlety, breadth and depth of the linguistic or grammatical knowledge available to systems.

In growing into some sort of maturity, the MT world has also come to realize that the ’text in→translation out’ assumption – the assumption that MT is solely a matter of switching on the machine and watching a faultless translation come flying out – was rather too naive. A translation process starts with providing the MT system with usable input. It is quite common that texts which are submitted for translation need to be adapted (for example, typographically, or in terms of format) before the system can deal with them. And when a text can actually be submitted to an MT system, and the system produces a translation, the output is almost invariably deemed to be grammatically and translationally imperfect. Despite the increased complexity of MT systems they will never – within the forseeable future – be able to handle all types of text reliably and accurately. This normally means that the translation will have to be corrected (post-edited) and usually the person best equipped to do this is a translator.

(18)

MT will only be profitable in environments that can exploit the strong points to the full. As a consequence, we see that the main impact of MT in the immediate future will be in large corporate environments where substantial amounts of translation are performed. The implication of this is that MT is not (yet) for the individual self-employed translator working from home, or the untrained lay-person who has the occasional letter to write. This is not a matter of cost: MT systems sell at anywhere between a few hundred pounds and over£100000. It is a matter of effective use. The aim of MT is to achieve faster, and thus cheaper, translation. The lay-person or self-employed translator would probably have to spend so much time on dictionary updating and/or postediting that MT would not be worthwhile. There is also the problem of getting input texts in machine readable form, otherwise the effort of typing will outweigh any gains of automation. The real gains come from integrating the MT system into the whole document processing environment, and they are greatest when several users can share, for example, the effort of updating dictionaries, efficiencies of avoiding unnecessary retranslation, and the benefits of terminological consistency.

1.3 The Need for Translation Technology

Advances in Information Technology (IT) have combined with modern communication requirements to foster translation automation. The history of the relationship between technology and translation goes back to the beginnings of the Cold War, as in the 1950s competition between the United States and the Soviet Union was so intensive at every level that thousands of documents were translated from Russian to English and vice versa. However, such high demand revealed the inefficiency of the translation process, above all in specialized areas of knowledge, increasing in-terest in the idea of a translation machine. Although the Cold War has now ended, and despite the importance of globalization, which tends to break down cultural, economic and linguistic bar-riers, translation has not become obsolete, because of the desire on the part of nations to retain their independence and cultural identity, especially as expressed through their own language. This phenomenon can clearly be seen within the European Union, where translation remains a crucial activity. The Internet with its universal access to information and instant communication between users has created a physical and geographical freedom for translators that was inconceivable in the past.

IT has produced a screen culture that tends to replace the print culture, with printed docu-ments being dispensed with and information being accessed and relayed directly through computers (e-mail, databases and other stored information). These computer documents are instantly available

(19)

and can be opened and processed with far greater flexibility than printed matter, with the result that the status of information itself has changed, becoming either temporary or permanent according to need. Over the last two decades we have witnessed the enormous growth of information technology with the accompanying advantages of speed, visual impact, ease of use, convenience, and cost-effectiveness. At the same time, with the development of the global market, industry and commerce function more than ever on an international scale, with increasing freedom and flexibility in terms of exchange of products and services. The nature and function of translation is inevitably affected by these changes. There is the need for countries to cooperate in many spheres, such as ecological (Greenpeace), economic (free trade agreements) humanitarian (Doctors without Borders) and edu-cational (exchange programs), etc. Despite the importance of English, there is the commonly-held belief that people have the right to use their own language, yet the diversity of languages should not be an obstacle to mutual understanding. Solutions to linguistic problems must be found in order to allow information to circulate freely and to facilitate bilateral and multilateral relationships.

Thus different aspects of modern life have led to the need for more efficient methods of translation. At the present time the demand for translations is not satisfied because there are not enough human translators, or because individuals and organizations do not recognize translation as a complex activity requiring a high level of skill, and are therefore not prepared to pay what it is worth. In other words, translation is sometimes avoided because it is considered to be too expensive. In part, human translation is expensive because the productivity of a human being is essentially limited. Statistics vary, but in general to produce a good translation of a difficult text a translator cannot process more than 4-6 pages or 2,000 words per day. The economic necessity of finding a cheaper solution to international exchange has resulted in continuing technological progress in terms of translation tools designed to respond to the translator’s need for immediately-available information and non-sequential access to extensive databases.

The social or political importance of MT arises from the socio-political importance of translation in communities where more than one language is generally spoken. Here the only vi-able alternative to rather widespread use of translation is the adoption of a single common ”lingua franca”, which (despite what one might first think) is not a particularly attractive alternative, be-cause it involves the dominance of the chosen language, to the disadvantage of speakers of the other languages, and raises the prospect of the other languages becoming second-class, and ultimately disappearing. Since the loss of a language often involves the disappearance of a distinctive culture, and a way of thinking, this is a loss that should matter to everyone. So translation is necessary for communication – for ordinary human interaction, and for gathering the information one needs to

(20)

play a full part in society. Being allowed to express yourself in your own language, and to receive information that directly affects you in the same medium, seems to be an important, if often vio-lated, right. And it is one that depends on the availability of translation. The problem is that the demand for translation in the modern world far outstrips any possible supply. Part of the problem is that there are too few human translators, and that there is a limit on how far their productivity can be increased without automation. In short, it seems as though automation of translation is a social and political necessity for modern societies which do not wish to impose a common language on their members.

The commercial importance of MT is a result of related factors. First, translation itself is commercially important: faced with a choice between a product with an instruction manual in English, and one whose manual is written in Japanese, most English speakers will buy the former – and in the case of a repair manual for a piece of manufacturing machinery or the manual for a safety critical system, this is not just a matter of taste. Secondly, translation is expensive. Translation is a highly skilled job, requiring much more than mere knowledge of a number of languages, and in some countries at least, translators’ salaries are comparable to other highly trained professionals. Moreover, delays in translation are costly. Estimates vary, but producing high quality translations of difficult material, a professional translator may average no more than about 4-6 pages of translation (perhaps 2000 words) per day, and it is quite easy for delays in translating product documentation to erode the market lead time of a new product. It has been estimated that some 40-45% of the running costs of European Community institutions are ”language costs”, of which translation and interpreting are the main element. This would give a cost of something like£300 million per annum. This figure relates to translations actually done, and is a tiny fraction of the cost that would be involved in doing all the translations that could, or should be done.3

Scientifically, MT is interesting, because it is an obvious application and testing ground for many ideas in Computer Science, Artificial Intelligence, and Linguistics, and some of the most important developments in these fields have begun in MT. To illustrate this: the origins of Prolog, the first widely available logic programming language, which formed a key part of the Japanese ”Fifth Generation” programme of research in the late 1980s, can be found in the ”Q-Systems” language, originally developed for MT.

Philosophically, MT is interesting, because it represents an attempt to automate an activity that can require the full range of human knowledge – that is, for any piece of human knowledge, it

3

(21)

is possible to think of a context where the knowledge is required. For example, getting the correct translation of negatively charged electrons and protons into French depends on knowing that protons are positively charged, so the interpretation cannot be something like ”negatively charged electrons and negatively charged protons”. In this sense, the extent to which one can automate translation is an indication of the extent to which one can automate ’thinking’.

Despite this, very few people, even those who are involved in producing or commission-ing translations, have much idea of what is involved in MT today, either at the practical level of what it means to have and use an MT system, or at the level of what is technically feasible, and what is science fiction. In the whole of the UK there are perhaps five companies who use MT for making commercial translations on a day-to-day basis. In continental Europe, where the need for commercial translation is for historical reasons greater, the number is larger, but it still represents an extremely small proportion of the overall translation effort that is actually undertaken. In Japan, where there is an enormous need for translation of Japanese into English, MT is just beginning to become established on a commercial scale, and some familiarity with MT is becoming a standard part of the training of a professional translator.

1.4 History of Machine Translation

Machine translation has recently celebrated its 50th birthday. This is a short life span for a science, but in that period remarkable progress has been made, mirroring the advances in the contributing disciplines of computer science and linguistics.

It is possible to trace ideas about mechanizing translation processes back to the seven-teenth century, but realistic possibilities came only in the 20th century. In the mid 1930s, a French-Armenian Georges Artsrouni and a Russian Petr Troyanskii applied for patents for ”translating machines”. Of the two, Troyanskii’s was the more significant, proposing not only a method for an automatic bilingual dictionary, but also a scheme for coding interlingual grammatical roles (based on Esperanto) and an outline of how analysis and synthesis might work. However, Troyanskii’s ideas were not known about until the end of the 1950s. Before then, the computer had been born.

Soon after the first appearance of ”electronic calculators” research began on using com-puters as aids for translating natural languages. The beginning may be dated to a letter in March 1947 from Warren Weaver of the Rockefeller Foundation to cyberneticist Norbert Wiener. Two years later, Weaver wrote a memorandum (July 1949), putting forward various proposals, based on the wartime successes in code breaking, the developments by Claude Shannon in information

(22)

the-ory and speculations about universal principles underlying natural languages. Within a few years research on MT had begun at many US universities, and in 1954 the first public demonstration of the feasibility of machine translation was given4_{. Although using a very restricted vocabulary and}

grammar it was sufficiently impressive to stimulate massive funding of MT in the United States and to inspire the establishment of MT projects throughout the world.

The earliest systems consisted primarily of large bilingual dictionaries where entries for words of the source language gave one or more equivalents in the target language, and some rules for producing the correct word order in the output. It was soon recognized that specific dictionary-driven rules for syntactic ordering were too complex and increasingly ad hoc, and the need for more systematic methods of syntactic analysis became evident. A number of projects were inspired by contemporary developments in linguistics, particularly in models of formal grammar, and they seemed to offer the prospect of greatly improved translation.

Optimism remained at a high level for the first decade of research, with many predictions of imminent ”breakthroughs”. However, disillusion grew as researchers encountered ”semantic bar-riers” for which they saw no straightforward solutions. There were some operational systems – the Mark II system5_{installed at the USAF Foreign Technology Division, and the Georgetown}

Univer-sity system at the US Atomic Energy Authority and at Euratom in Italy – but the quality of output was disappointing. By 1964, the US government sponsors had become increasingly concerned at the lack of progress; they set up the Automatic Language Processing Advisory Committee (AL-PAC), which concluded in a famous 1966 report that MT was slower, less accurate and twice as expensive as human translation and that ”there is no immediate or predictable prospect of useful machine translation.” It saw no need for further investment in MT research; and instead it recom-mended the development of machine aids for translators, such as automatic dictionaries, and the continued support of basic research in computational linguistics.

Although widely condemned as biased and short-sighted, the ALPAC report brought a virtual end to MT research in the United States for over a decade and it had great impact elsewhere in the Soviet Union and in Europe. However, research did continue in Canada, in France and in Ger-many. Within a few years the Systran system was installed for use by the USAF (1970), and shortly afterwards by the Commission of the European Communities for translating its rapidly growing vol-umes of documentation (1976). In the same year, another successful operational system appeared in Canada, the Meteo system for translating weather reports, developed at Montreal University.

4

A collaboration by IBM and Georgetown University

5

(23)

In the 1960s in the US and the Soviet Union MT activity had concentrated on Russian-English and Russian-English-Russian translation of scientific and technical documents for a relatively small number of potential users, who would accept the crude unrevised output for the sake of rapid ac-cess to information. From the mid-1970s onwards the demand for MT came from quite different sources with different needs and different languages. The administrative and commercial demands of multilingual communities and multinational trade stimulated the demand for translation in Eu-rope, Canada and Japan beyond the capacity of the traditional translation services. The demand was now for cost-effective machine-aided translation systems that could deal with commercial and technical documentation in the principal languages of international commerce.

The 1980s witnessed the emergence of a wide variety of MT system types, and from a widening number of countries. First there were a number of mainframe systems, whose use con-tinues to the present day. Apart from Systran, now operating in many pairs of languages, there was Logos (German-English and English-French); the internally developed systems at the Pan American Health Organization (Spanish-English and English-Spanish); the Metal system (German-English); and major systems for English-Japanese and Japanese-English translation from Japanese computer companies.

The wide availability of microcomputers and of text-processing software created a market for cheaper MT systems, exploited in North America and Europe by companies such as ALPS, Weidner, Linguistic Products, and Globalink, and by many Japanese companies, e.g. Sharp, NEC, Oki, Mitsubishi, Sanyo. Other microcomputer-based systems appeared from China, Taiwan, Korea, Eastern Europe, the Soviet Union, etc.

Throughout the 1980s research on more advanced methods and techniques continued. For most of the decade, the dominant strategy was that of ’indirect’ translation via intermediary representations, sometimes interlingual in nature, involving semantic as well as morphological and syntactic analysis and sometimes non-linguistic ”knowledge bases”. The most notable projects of the period were the GETA-Ariane (Grenoble), SUSY (Saarbrucken), Mu (Kyoto), DLT (Utrecht), Rosetta (Eindhoven), the knowledge-based project at Carnegie-Mellon University (Pittsburgh), and two international multilingual projects: Eurotra, supported by the European Communities, and the Japanese CICC project with participants in China, Indonesia and Thailand.

In early 1990s, the end of the decade was a major turning point. Firstly, a group from IBM published the results of experiments on a system (Candide) based purely on statistical methods. Secondly, certain Japanese groups began to use methods based on corpora of translation examples, i.e. using the approach now called ”example-based” translation. In both approaches the distinctive

(24)

feature was that no syntactic or semantic rules are used in the analysis of texts or in the selection of lexical equivalents; both approaches differed from earlier ”rule-based” methods in the exploitation of large text corpora.

A third innovation was the start of research on speech translation, involving the integration of speech recognition, speech synthesis, and translation modules – the latter mixing rule-based and corpus-based approaches. The major projects are at ATR (Nara, Japan), the collaborative JANUS project (ATR, Carnegie-Mellon University and the University of Karlsruhe), and in Germany the government-funded Verbmobil project. However, traditional rule-based projects have continued, e.g. the Catalyst project at Carnegie-Mellon University, the project at the University of Maryland, and the ARPA-funded research (Pangloss) at three US universities.

Another feature of the early 1990s was the changing focus of MT activity from ”pure” research to practical applications, to the development of translator workstations for professional translators, to work on controlled language and domain-restricted systems, and to the application of translation components in multilingual information systems.

These trends have continued into the later 1990s. In particular, the use of MT and trans-lation aids (translator workstations) by large corporations has grown rapidly – a particularly im-pressive increase is seen in the area of software localisation (i.e. the adaptation and translation of equipment and documentation for new markets). There has been a huge growth in sales of MT software for personal computers (primarily for use by non-translators) and even more significantly, the growing availability of MT from on-line networked services (e.g. AltaVista, and many others). The demand has been met not just by new systems but also by ”downsized” and improved ver-sions of previous mainframe systems. While in these applications, the need may be for reasonably good quality translation (particularly if the results are intended for publication), there has been even more rapid growth of automatic translation for direct Internet applications (electronic mail, Web pages, etc.), where the need is for fast real-time response with less importance attached to quality. With these developments, MT software is becoming a mass-market product, as familiar as word processing and desktop publishing.

1.5 Natural Language Processing

Natural Language Processing (NLP) is both a modern computational technology and a method of investigating and evaluating claims about human language itself. Some prefer the term Computational Linguistics in order to capture this latter function, but NLP is a term that links back

(25)

into the history of Artificial Intelligence, the general study of cognitive function by computational processes, normally with an emphasis on the role of knowledge representations, that is to say the need for representations of our knowledge of the world in order to understand human language with computers.

NLP is the use of computers to process written and spoken language for some practical, useful, purpose: to translate languages, to get information from the web on text data banks so as to answer questions, to carry on conversations with machines, so as to get advice about, say, pensions and so on. These are only examples of major types of NLP, and there is also a huge range of lesser but interesting applications, e.g. getting a computer to decide if one newspaper story has been rewritten from another or not. NLP is not simply applications but the core technical methods and theories that the major tasks above divide up into, such as Machine Learning techniques, which is automating the construction and adaptation of machine dictionaries, modeling human agents’ beliefs and desires etc. This last is closer to Artificial Intelligence, and is an essential component of NLP if computers are to engage in realistic conversations: they must have an internal model of the humans they converse with.

Natural language processing can be defined, in a very general way, as the discipline having as its ultimate, very ambitious goal that of enabling people to interact with machines using their ”natural” faculties and skills. This means, in practice, that machines should be able to understand spoken or written sentences constructed according to the rules of some natural language, and should be capable of generating in reply meaningful sentence in this language. NLP includes:

• Speech Synthesis:

Although this may not at first sight appear very ’intelligent’, the synthesis of natural-sounding speech is technically complex and almost certainly requires some ’understanding’ of what is being spoken to ensure, for example, correct intonation.

• Speech Recognition:

Basically the reduction of continuous sound waves to discrete words. • Natural Language Understanding:

Here treated as moving from isolated words (either written or determined via speech recogni-tion) to ’meaning’. This may involve complete model systems or ’front-ends’, driving other programs by natural language commands.

(26)

Generating appropriate natural language responses to unpredictable inputs. • Machine Translation:

Translating one natural language into another.

The task of NLP is that of accepting inputs in a human natural language, and to transform the inputs into some sort of formal statements that are to be ”meaningful” for a computer. The computer will be, therefore, able to react correctly to the given input; sometimes, the reaction will take the form of a NL ”answer”, i.e., the computer will use the formal representation corresponding to the analysis of the input to generate, in turn, statements in natural language. NLP is characterized by the presence of some, very primitive and idiosyncratic indeed, form of ”understanding” of the ”meaning” of a given statement. As a consequence, we will exclude from the description of the NLP domain some trivial and purely passive forms of processing of NL inputs. Examples are the simple transfer on magnetic support of a spoken input through the use of a voice recorder, or the handling of inputs formed by single words, e.g., all sort commands, entered by a keyboard, or spoken through a voice recognition system. We consider, in fact that a real problem of ”meaning” begins only when several words combine together inside a written string or an utterance.

1.6 Standard Paradigm for NLP

While there have been many variants, the structure of most MT systems have parts that can be named for the chapters in a linguistic textbook. They have lexical, morphological, syntactic, and possibly semantic components, one for each of the two languages, for treating basic words, complex words, sentences and meanings. Each feeds into the next until the last one in the chain produces a very abstract representation of the sentence.

1.6.1 The Source Language Analysis

An important issue in natural language analysis is the resolution of structural ambiguity. A sentence is said to be structurally ambiguous when it can be assigned to more than one syntactic structure. Natural language analysis is the process of mapping between a natural language text and a representation of its form and/or content. This representation can be a syntactic structure rep-resentation, a representation of the text’s prepositional meaning, a comprehensive interlingua text (consisting of unambiguous semantic propositions and discourse/pragmatic information) or some

(27)

specialized representation geared at a particular application. In knowledge-based machine transla-tion, the analysis stage is expected to produce a complete interlingua text. In essence, the quality of the translation depends up on the depth and quality of the analysis. Most transfer-based MT systems stop at a syntactic representation often augmented with semantic markers (such as case markers for verb arguments), although the trend is toward ever-deeper semantic analysis.

A comprehensive system of natural language analysis, such as an analysis module of a knowledge-based machine translation system, must include the following basic components:

• Morphological Analysis:

The decomposition of words into their uninflected root forms, performed at the word level. There are many morphological phenomena: almost all language has inflectional morphol-ogy; the majority has some form of derivational morphology. A number of general models of morphological processing have been investigated. At the theoretical level, the most pop-ular approach to morphology is the so-called two-level approach (Koskenniemi 1983 [36]; Karttunen 1983 [30]). In practical systems many other, less general and more language and task-specific approaches have been used.

• Syntactic Analysis:

The extraction of all well-formed syntactic structures and dependencies for a source text, performed at the sentence level. In the MT environment, a grammar must be written for each source language, in one of the many current grammar formalisms, such as, for instance, Lexical Functional Grammar, Generalized Phrase Structure Grammar, Head-driven Phrase Structure Grammar, Definite Clause Grammar, Treead joining Grammar or Government-and-Binding-related Grammars. The use of a ”canonical” formalism facilitates the use of a single grammar interpreter applicable to any language whose grammar is defined in the selected formalism.

• Semantic Analysis:

The creation of the knowledge structures in a text-meaning representation language (interlin-gua in MT) that reflect the meanings of lexical units in the source text and semantic depen-dencies among them, performed at the sentence level but often having to take into account suprasentential contexts. Semantic analysis procedures are typically developed for a par-ticular domain (e.g. medicine, finance, and computers), though general, ”common sense” semantic knowledge is also used. The existence of canonical formalisms for encoding world

(28)

knowledge and text meaning enables the use of a single universal semantic interpreter with different knowledge source for each domain.

• Pragmatic or Discourse Analysis:

Suprasentential analysis leading to the resolution of anaphors, ellided phrases, deixis, as well as the attribution of intent and speech acts. In its full form, discourse analysis leads to the creation of a text-meaning structure in a representation language with the various domainori-ented and rhetorical relations among the elements of a text, including coreference of noun phrases and anaphors, causal and temporal relations, topic/comment structure and so forth. The state of the art in pragmatic and discourse analysis is not as well developed as the other three phases of language analysis.

1.6.2 Target Language Generation

The process of natural language generation, in its unconstrained form, starts with the specification of the ”need to communicate”, the prepositional goals for a target language text, and a pragmatic profile of the speech situation-knowledge about speaker/author (or, more generally, text producer), the hearer/reader (text consumer), the style of communication and so on. A generator then must perform the following tasks:

• Content Delimitation:

The system must select which of the active prepositional and rhetorical goals should be overtly realized in text and which should be left for the human consumer to infer.

• Text Structuring:

The system must determine the order of propositions and the boundaries of sentences in the target language text.

• Lexical Selection:

The system must select open class lexical units to be used in the target language text. • Syntactic Selection:

The system must select syntactic structures for the target language clauses and perform closed class lexical selection according to syntactic structure decision.

• Constituent Ordering:

(29)

• Realization:

The system must map from syntactic representation with lexical insertion into surface string.

1.7 Aim of The Work

In today’s increasingly networked world, the need for systems to translate documents to and from a variety of languages is expanding, for applications as diverse as: multilingual e-mail, browsing (e.g., on the World Wide Web) and searching texts in other languages, high-quality translation of business letters and reports, translation of technical documents and articles, speech-to-speech translation for business and travel. While useful MT technology is currently available, it is not yet capable of providing both high-quality and wide-domain performance simultaneously [25, 53]. For higher quality the domain may be limited and human assistance required while for a wider domain output quality may be sacrificed. MT research continues to push the boundaries of this automation quality-scope continuum. Techniques such as Statistical MT and Example-Based MT add new capabilities and possibilities to the older tried-and-true methods and theories of MT. Yet, comparing systems and measuring MT quality can be challenging.

MT technology is rich in history dating back thirty or more years. Over the past decade, the number and diversity of experiments in Rule-Based MT, Knowledge-Based MT, and Example-Based MT have been growing significantly [61]. Most of these methods are aimed at building an automatic high quality translation system. However, there are many difficulties in building such MT systems. For instance, a Glossary-Based Machine Translation (GBMT) engine provides an automatic translation for various language pairs by using a bilingual phrasal dictionary (glossary) to produce a phrase-by-phrase translation[79]. Translation (based on phrase pattern-matching) is fast and accurate regarding the content of the document and browsed documents can be translated almost in real-time. A GBMT system is also extremely simple, non- expensive and fast to develop. Moreover, all language resources used by the system are entirely under the control of the user.

The aim of the current work is to develop a machine translation system, which translate Japanese into Chinese, I apply a new MT paradigm called Super-Function Based Machine Transla-tion (SFBMT) to improve the GBMT. SFBMT uses a specific funcTransla-tion in the translaTransla-tion engine to enhance the translation quality and to reduce the glossary quantity. In the system the SF itself is used to translate without syntactic and semantic analysis as many MT systems usually do. Furthermore, compound nouns are very frequently used in some languages such as Japanese, Chinese, English, etc., and are often important words which determine the semantic content of the document. These

(30)

compound nouns are too large in number to be contained in a manually-created dictionary, thus automatic acquisition of their translations is highly desirable. The translation of compound nouns has become a major issue in machine translation due to their frequency of occurrence and high productivity. We know that compound words pose well-known problems for linguistic description in general, and also some additional ones for natural language processing, such as the problems of identification, segmentation, disambiguation, interpretation, and so on. All of those problems make them particularly difficult to be handled in a system performing automatic translation, such as a machine translation system or a system for cross-language information retrieval. In this thesis, I also discuss the challenges in automatic translating Japanese compound nouns into Chinese in the SFBMT system, by using a word-level translation dictionary and the target language monolingual corpus.

1.8 Structure of The Thesis

This thesis is organized as follows. Chapter 2 is a review on the field of the machine translation, it discusses the difficulties and the different approaches of the machine translation.

In Chapter 3, the core algorithms of the basic approach of Super-Function based MT are presented. First, I define the approach more formally. Second, I provide an overview of the SFBMT process, how it works, how this system is implemented. Finally, I relate it to other engines like GBMT, EBMT.

Chapter 4, gives a detailed description of the user requirements and presents my detailed experiment results, and the evaluation methodology used to evaluate the system.

In the studies on SFBMT I have found that there are many problems caused by compound nouns. In Chapter 5, I describe those problems of compound nouns in SFBMT and the proposed shallow method. Finally, I outline the used resources and present the evaluating results.

Chapter 6 concludes the thesis with a summary and discussion of the main advantages of this approach and what it has succeeded in showing. Furthermore, the principal limitations are highlighted. In the light of this information, feasible future directions of research are proposed.

(31)

Chapter 2

Machine Translation

In exploring the field of MT, it helps to look separately at different related fields. In this chapter, I first give the breakdowns of automatic translation that covers many activities. Then look inside the most common approaches to the non-human component of MT, so-called the translation engine.

2.1 Automatic Translation

During the first years of the research in MT, a considerable amount of progress was made which sufficed to convince many people, who originally were highly skeptical, that MT was not just a wild idea. It did more than that. It created among many of the workers actively engaged in this field the strong feeling that a working system is just around corner. Though it is understandable that such an illusion should have been formed at the time, it was an illusion. It was created, among other causes, also by the fact that a large number of problems were rather readily solved, and that the output of machine-simulated ”translations” of various texts from Russian, German or French into English were often of a form which an intelligent and expert reader could make good sense and use of. It was not sufficiently realized that the gap between such an output, for which only with difficulty the term ”translation” could be used at all, and high quality translation proper, i.e., a translation of the quality produced by an experienced human translator, was still enormous, and that the problems solved until then were indeed many but just the simplest ones, whereas the ”few” remaining problems were the harder ones – very hard indeed.

Many groups engaged in MT research still regard fully automatic, high quality translation (FAHQT)as an aim towards which it is reasonable to work. Most groups, however, seem to have

(32)

realized, sometimes very reluctantly, that FAHQT will not be attained in the near future. Two consequences can be drawn from this realization. One can go on working with FAHQT in mind, in the hope that the pursuit of this aim will yield interesting theoretical insights which will justify this endeavor, whether or not these insights will ever be exploited for some practical purpose. Or one gives up the ideal of FAHQT in favor of some less ambitious aim with a better chance of attainability in the near future. Both consequences are equally reasonable but should lead to rather different approaches. Lack of clarity in this respect, vague hopes that somehow or other both aims can be attained simultaneously and by the use of the same methods, must lead to confusion and result in waste of effort, time and money. Those who are interested in MT as a primarily practical device must realize that full automation of the translation process is incompatible with high quality. There are two possible directions in which a compromise could be struck: one could sacrifice quality or one could reduce the self sufficiency of the machine output. There are very many situations where less than high quality machine output is satisfactory. For instance, that scientists are prepared to be satisfied with less than the present average standard of human translation, while many regard this standard as too low for their purposes, then the machine output will have to be post-edited, thereby turning, strictly speaking, machine translation intomachine aids to translation.

As soon as the aim of MT is lowered to that of high quality translation by a machine-post-editor partnership, the decisive problem becomes to determine the region of optimality in the continuum of possible divisions of labor. It is clear that the exact position of this region will be a function of, among other things, the state of linguistic analysis to which the languages involved have been submitted. It may be safely assumed that, with machine-time/efficiency becoming cheaper and human time becoming more expensive, continuous efforts will be made to push this region in the direction of reducing the human element. However, there is no good reason to assume that this region can be pushed to the end of the line, certainly not in the near future.

It seems that with the state of linguistic analysis achieved today, and with the kind of electronic computers already in existence or under construction, especially with the kind of large capacity, low cost and low access-time internal memory devices that will be available within a few years, a point has been reached where commercial partly machine translation centers stand a serious chance of becoming a practical reality. However, various developments are still pending and certain decisions will have to be made.

The term Machine-Assisted Translation (MAT) can be taken as covering all techniques for automating the translation activity. Human-Aided Machine Translation (HAMT) is the style of translation in which a computer system does most of the translation, appealing in case of difficulty

(33)

to a monolingual or bilingual human for help. Machine-Aided Human Translation (MAHT) is the style of translation in which a human does most of the work but uses one of more computer systems, mainly as resources such as dictionaries and spelling checkers, as assistants, and Fully-Automated Machine Translation (FAMT) means that MT performed without the intervention of a human being during the process.

2.1.1 Machine-Assisted Translation

Machine-Assisted Translation (MAT), as known as Machine-Aided Translation is so-metimes called Computer-Assisted translation is human translation supported by a computer sys-tem. In MAT, the computer program supports the translator, who translates the text himself/herself, making all the essential decisions involved, whereas in machine translation, the translator supports the machine, that is to say that the computer or program translates the text, which is then edited by the translator, or not edited at all. Difficulties with such unedited output are described at machine translation. Support is available by lexical data, grammatical help, translation memory, domain in-formation and organizational support. MAT is a broad and imprecise term covering a range of tools, from the fairly simple to the more complicated. These can include:

• Spell checkers, either built into word processing software, or add-on programs.

• Grammar checkers, again either built into word processing software, or add-on programs. • Terminology managers, allowing the translator to manage his own terminology bank in an

electronic form. This can range from a simple table created in the translator’s word process-ing software or spreadsheet, a database created in a program such as FileMaker Pro (almost a standard in the translation industry) or Microsoft Access, or, for more robust (and more expensive) solutions, specialized software packages such as LogiTerm, MultiTerm, Termex, etc.

• Dictionaries on CD-ROM, either unilingual or bilingual.

• Terminology databases, either on CD-ROM or accessible through the Internet (such as TER-MIUM or Le Grand dictionnaire terminologique from the Office quebecois de la langue fran-caise).

• Full-text searches (or indexers), which allow the user to query already translated texts or reference documents of various kinds. In the translation industry one finds such indexers as

(34)

Natural, ISYS and dtSearch.

• Concordances, which are programs that retrieve instances of a word or an expression in a monolingual, bilingual or multilingual corpus.

• Bitexts, a fairly recent development, the result of merging a source text and its translation, which can then be consulted using a full-text search tool.

• Translation memory managers (TMM), tools consisting of a database of text segments in a source language and their translations in one or more target languages.

2.1.2 Fully-Automated Machine Translation

This is what most people have in mind when they think of using computers to translate. The idea is simple: feed in the text in one language and get the text out in another language. Unfor-tunately, the implementation of this idea presents obstacles that have yet to be completely overcome. The main problem is the complexity of language. Consider for instance the meanings of the word ”can”. Besides its use as a modal auxiliary verb, ”can” has several legitimate and slang meanings as a noun: container, depth charge, jail, and toilet. Not to mention an archaic verb meaning ”to know or to understand”. Assuming that the foreign language has a separate term for each of these meanings, how is the computer to know which one to choose? As it turns out, advances have been made in teaching computers to understand language based on context. More recent research is focusing on the use of probability theories for analyzing texts. But fully automated machine translation covering a broad range of subject areas is still a distant goal.

As well as translation memory tools, fully automated machine translation is starting to come of age, with successful implementations being reported at various sites. The success of these implementations is due to various factors, including the suitability of the subject domain and text types and the customizability of the system. The quality of the source text is one of the most vital criteria for successful use of MT at present. The more restricted the source language, the less room for inaccuracy in translation. This is where controlled authoring comes in. Controlled authoring is not a new concept, having been discussed, and used, for a number of years, both as a means of improving English texts and as a means of gaining more control over the output of MT. It’s gone through some rough times too, with many people questioning whether it is worth the investment.