translation
Edited by
JONATHAN SLOCUM
Microelectronics and Computer Technology Corporation Austin, Texas CAMBRIDGE CAMBRIDGE NEW YORK MELBOURNE UNIvERSITY TN.ESS NEW ROCHELLE SYDNE Y
32 East 57th Street, New York, NY 10022, USA 10 Stamford Road, Oakleigh, Melbourne 3166, Australia @ Association for Computational Linguistics 1985, 1988
First publishe d as Computational Linguisrics vol. II (1985) nos. 1-3 (special issues on machine translation)
Reissued (with revisions) in book form by Cambridge University Press 1988 Printed in Great Britain by Redwood Burn Ltd., Trowbridge
British Library cataloguing in publication data
Machine translation systems - (Studies in natural language processing). 1. Machine translating
I. Stocum, Jonathan II. Series
4tg' .02 P30B
Library of Congress cataloguing in publication data
Machine translation systems.
(Studies in natural language processing).
1 . Machine translating. I. Slocum, Jonathan . II. Series.
P308. M35 1981 1r8', .A2 87-151 1 1
ISBN 0 527 35166 t hard covers
Machine
Translation
at
the
Pan
American
Health
Organization
Muriel Vasconcellos and Marjorie Le6n
1
Project
History
and
Current
Status
1.1
Overview
Machine translation has been supporting the activities of the Pan
American
Health
Organization(reHo)
since1980.
SpeNeld,
the systemthat
translatesfrom
Spanishinto
English, beganto
producetexts
for
requesting offi,ceswithin
the
Organizationin
1980, and the English-spanish system, ENGSPAN, became operationalin
1984. Asof December 1985, SPANAM had produced 2,614,779 words of
trans-lation
(tO,aSO pages)in
responseto
872job
orders, while ENGSPANhad translated 6L2,973 words (2,+SZ pages), of which 594,293 words (Z,lZZ pages) were production texts, requested under 146 orders. The
translation
programsrun
on
PAHo's mainframe computer (now an IBM 4381 running under DoS/VSE/sp), which is usedfor
many other purposes aswell. A
versionof
the program is also availablefor
theIBM 3081
(os/tvtvs).
Texts are submitted and retrieved using theordinary
word-processingworkstition
(Wang oIS/140) asa
remotejob-entry
terminal.
Productionis in
batch modeonly.
The
input texts comefrom
the regular flowof
documentationin the
Organiza-tion,
and there are no restrictions asto
fieldof
discourseor
type of syntax. Specially trained translators, working at the word-processing screen, produce polishedoutput of
standard professionalquality
at a rate between two and three times as fast astraditional
translation(+,OOO-f0,000 words a day for post-editing us. 1,500-3,000
for
humantranslation).
Theoutput
is readyfor
deliveryto
the requesting officewith
nofurther
preparation required.The translation programs and
their
supporting software arecur-rently
written in
rl/t.
SpeNeu's speed on the mainframe has been registeredat
1,500 wordsper minute
in
clocktime
(495,000 words an hourin
CPUtime),
whilethat
of ptlcspAN has reached 836 wpmin
clocktime
(LOZ,OOOwph cPU).
They runwith
size parameters of2I5K
and 800K, respectively. Asof
December 1985,the
dictionariesfor SPANAM had 61,282 source entries and 58,485 target entries, while those of nNcspAN
had
45,,614 and 47,545, respectively. These reside on permanently mounted disks and occupy from 7to
10MB each.Spexelu
and pNcsPAN use essentially the same modular systemarchitecture. ENGSPAN is more advanced linguistically, however, since
it
has an ATN parser and separate lexical and syntactic transfermod-ules.
The
overallpolicy is
to
regularly upgrade SPANAM asbreak-throughs become available
in
the
more sophisticated ENGSPAN. Inthis way
it
has been possibleto
maintain ongoing productionwith
SPANAM while
its
capabilities are gradually enhanced and expanded. Because of this dynamic mode of development, information about thetheoretical status of either SPANAM or ENGSPAN is necessarily short-lived.
L.2
Early
History:
1976-1979The
Pan
AmericanHealth
Organization,with
headquarters in Washington, D.C., is the specialized international agency in the Amer-icasthat
has responsibilityfor
actionin
the field of publichealth.
It
comes under the umbrellas of both the Inter-American System and the
United Nations family, serving
in
thelatter
instance as Regional Of-fice of the World Health Organization.In
addition to its headquarters staff of 533in
Washington, PAHO has a field staff of 657that
supportsboth the operational programs in
its
10 Pan American centers and 29other offi.ces
in
the field serving the 38 member countries.Business may be conducted
in
any of thefour
offi.cial languages: Spanish, English, Portuguese, and French. The translation demand isthan
half
the
total
workload (average 57 percent), and,after that,
into English.
The
demandfor
Portugueseis
considerably smaller, and there is only an occasional requirementfor
French.In
1975the
Organization's administrators undertook a feasibility study and determinedthat
MT might be a means of reducing theex-penditure
for
translation. There was already a mainframe computer,then an IBM 360, at the headquarters site, and the decision was made
to
developan MT
systemthat
wouldrun
on
this
installation on
atime-sharing
basis.
Work wasto
focuson the
Spanish-English and English-Spanish combinations. The effort was to be supported underthe Organization's regular budget.
The intention from the outset was
that
MT should articulatewith
the routine flow of
text in
PAHo. Post-editing was consideredto
be unavoidable, sincethe
system would haveto
dealwith
free syntax,with
any vocabulary normally used in the Organization, and, in time,with
a large rangeof
subjects and different genresof
discourse. No serious thought was givento
a mode of operationthat
would require pre-editing.Initial
efforts began in 1976. A team of three part-time consultants worked under contract for the Organization for two years, and one ofthese consultants remained
with
the project on apart-time
basis for athird
year.In
the beginning the translation strategy reflected someof the principles
that
had been appliedat
Georgetown lJniversity in the late 1950's and early 1960's in the course of work on the Russian-English system.It
should be stressed, however,that
the project'sthe-oretical orientation has evolved significantly, and
that
the algorithmnow reflects up-to-date linguistic and computational approaches.
The
consultants choseto
addressthe
Spanish-English directionfirst
becausethey
recognizedthat
results couldbe
available earlierthan
if
they had startedwith
English as the source. Parallel efforts were concentratedon
the
architectureof
the
systemitself
and the extensive supporting software. The period 1976-1978 saw themount-ing
of this
architecture and thewriting
of
a basic algorithmfor
thetranslation
of
Spanishinto English.
At
the
endof
three years the Spanish-English algorithm wasin
place, as well as a set of pt/t
source dictionary had been
built
to a level of 48,000 entries,with
cor-responding English glossesin
the
separatetarget
dictionary.
Workon the
dictionaries was supportedby
mnemonic, user-friendly soft-ware developedin
1978-1979 to facilitate the operations of updating,side-by-side
printing,
and retrieval of individual records.A
corpus of about 50,000 words had been translated from Spanishinto
English.Human resources
during
the
period
1976-1978 consistedof
the three part-time consultants togetherwith
PAHo's contributionin
theform of dictionary manpower
(total
24 staff-months in the three-yearperiod) and, starting
in
1977, half-time participation of the staff ter-minologist, who assumed the responsibility ofcoordination.
A
com-putational linguist
was recruited and assignedto
the
projectat
the endof
1979.The year
1980 wasa
turning-point
for
MTat
PAHo.
Advances came togetherwhich
madeit
possibleto
moveinto
a
production mode.With
the computational linguist on board, the programs weregreatly
improved, andthe
operational problemsof text input,
themost serious impediment
to
production, were resolved.An
interface established between the IBM mainframe and the Organization's word-processingfacility
(thena
Wang System 30) enabledut
to
take its placein
the
text-processing chain andtap
into a
large bodyof
textthat
had been made machine-readable for other purposes.A
program waswritten
to convert the Wang text for processing on the IBM main-frame, and thereafter any Spanishtext
that
had been keyedinto
the word-processor, regardlessof the
purposefor
whichit
was entered, was available for machine translation.1.3
Operational
Ph"r",
l980-present
Today the machine translation
activity
functions as a componentof
the
Organization's newly merged Language Services, and the hu-man translatorsin both
languages post-edit machineoutput
directlyon the screen. However, progress toward
this
end has been gradual.In
1981,the
increasingly steady flowof
requestsfor
translationjustified the
assignmentof a full-time post-editor.
The rising
origi-nally
in machine-readableform.
Whereas word processing had previ-ously been restricted to special services provided by a typing pool, theinstallation of word-processing hardware (Wang oIS/140) throughout
the headquarters building brought all the program units into the
text-processing
chain.
Furthermore, an optical character reader (then a Compuscan AlphawordII)
was interfacedwith
the
word-processing system; thus, existing typewriters could also be used asinput
devices, and this meantthat
texts could be prepared in the field, and scannedin
Washingtonfor
subsequentinput to
SPANAM.With
accelerated production, improvementsto
SPANAM followedin
tandem.
Fromthe
beginningit
wasthe
policy, and continues tobe so today
with
both
SPANAM and ENGSeAN,that
theoutput
from production servesnot only to
meet the purposefor
whichit
was re-questedbut
alsoto
provide feedbackfor further
developmentof
thealgorithm and dictionaries. As post-editing proceeds, the translator
makes note of recurring problems
at
all
levels on a side-by-side ver-sionof
theoutput
(for
every translation two outputs are generated,the side-by-side hard copy and the word-processing document on the screen). The messages noted by the translators serve as a basis both
for
updating the dictionaries andfor
making enhancements, asfeasi-ble,
in
thealgorithm.
The capture ofthis
informationat
the time of post-editing saves much work later on.This mode of operation has brought the dictionaries
to their
cur-rent
size.
The totals
correspond largelyto
base,or
stem, entries,rather than
fully
inflected forms. For example,in
SPANAM's Spanish sourcedictionary
of.6I1282 entriesr g4Vo were stems and 6Vo werefull
forms.
Forboth
systemsthe
incidenceof
not-found wordsin
ran-domtext
is well under 1 percent-
limited
usuallyto
proper names, scientific names, new acronyms, and nonce formations. Throughco-ordination
with
the terminology side of the program, the glosses have been increasingly tailoredto
the
specific requirementsof
p^q,uo. Inaddition, microglossaries have been established
for
various users, sothat
specialized glosses can be elicited.Further details about the working environment are given
in
1.4
SPANAM/ENGSPAN
Development: l9gl-present
In
early
1981a
long-range strategy was decided onfor
the
con-tinued
improvementof
spANAM andthe
developmentof a
parallel system from English into Spanish. Two consultants from Georgetownuniversity,
ProfessorsR.
Ross Macdonald and Michael Zarechnak,undertook separate evaluations
of
spaNeMat
that time.
Their
rec-ommendations ledto
the
adoptionof
a
combined working mode in which improvements were to be introduced in SPANAM accord.ing to a predetermined schedule while at the same time development began onthe other system, ENGSPAN. Recognizing
that
each languagecombi-nation imposed a different set of linguistic priorities, the consultants nevertheless emphasized
that
greatly expanded parsing was needed inboth cases, especially
in
the analysis of English as a source language. Such parsing,in turn,
calledfor
revisionof the
dictionary record inorder
to
allowfor
a broader range of syntactic and semantic coding.It
wasfelt
that
the basic modular architecture of spANAM, as well asthe dictionary record
in its
essential format, should be usedfor
ENG-SPAN aswell. A
common architecture for the two systems meantthat
they
could continueto
sharethe
same supportingsoftware.
Thus, improvements could migrate readily from one systemto
the other;it
would be easy
for
themto
cross-fertilize.Having adopted
this
approachto
development,with
each side to benefit systematicallyfrom
the work
being doneon the other,
theproject addressed its attention
in
1981to
the enhancementsthat
had been recommended for spANAM. Then, as the spANAM effort taperedoff, time was devoted increasingly
to
ENGSPAN.By the
endof
1gg2the ENGSPAN program and dictionaries (about 40,000 source entries, most of them
with
acceptable glossesin
the
Spanish target) were in place.Translation from English
into
Spanish has special importance forpublic health
in the
developing countries, and thisfact
provided the incentive for seeking extrabudgetary support from theu.s.
Agency forInternational Development
(elo).
Itt
August 1gg3 AID gave the Orga-nization a two-year grant for the accelerated development of ppcspaw. This funding madeit
possible to have a second computational linguistfor
the grant period, as well as consultants andpart-time
dictionaryassistants who undertook specific tasks
within
the approved plan ofwork.
With
the added manpower, the project made significant progresson
the
English-Spanishalgorithm.
Particular
focus was placed onthe
development of a parser using an augmentedtransition
network(etw),
which
asof
April
1984 was integratedinto the
rest
of
the ENGSPAN program. The dictionary record was modifiedwithout
any increase in its overall size, sothat it
can now accommodate 2I1- fields, as comparedwith
82in the
1980 version ofspexaM.
Introduction of the new syntactic and semantic codesin
the
English dictionary was completedin
1985.Application
Environment
2.L
Pre-editing Policy
As indicated above,
it
has always been expectedthat
the outputof
SpeUaM and ENGSPAN would haveto
be post-edited. There was no application of tvtt at PAHo for which a customized language wouldbe
feasible. Nor
wasit
to
be expectedin
the
nearfuture
that
ei-ther of its source languages would be the
front
end of other languagepairs.
In
these circumstances, there were few advantages to be gainedfrom
pre-treatmentof the
input.
Since post-editing was inevitable,it
wasfelt that
an additional pre-editing step would be uneconomic:the
advantages wouldnot
be sufficientto
offsetthe
added costof
a second human pass. Moreover,in
orderfor
pre-editingto
be worth-while, the process would haveto
draw on a high degree of linguisticsophistication, and ad,lquate manpower
for
this purpose was scarce.Thus,
pre-editingin the
linguistic
sense has beenruled
out
for SPANAM and DNCSPAN.In
theory,a
document can be sentfor
ex-ecution
by
SPANAMwithout
being seenby
any human eyes.If
the operator has keyed in the original Spanish document using normal in-housetyping
conventions, no adjustments should berequired. With
inexperienced operators, the precaution is taken to check the format, particularly the line-spacing and page width, since deviations from the
standard
at that
level candisrupt
the workof the
algorithm.
OCninput
also needs to be reviewedfor
accuracy.Production
texts
are usuallyrun
only
once.
For
long projects,particularly
with
ENGSPAN,the translation is
run in
several parts, additions or changes being made to the dictionaries after each section is processed. Demonstrations are always performed on random text.2.2
Post-editing Policy
2.2.1
General Position
It
is afirm
ruleat
PAHOthat
post-editing is done by professional translators who have been specially trainedin
the techniques ofhan-dling machine
output.
The strictness ofthis
rule is balanced against a flexible standard for the degree of post-editing required.The decision as
to
the extensiveness of post-editing takes into ac-count:-
the purpose of the translation,-
the user's own resources for editing,-
the time frame, and-
structural
linguistic considerationsin
thetext
itself.A text
may be neededfor
information only,for
publication,or for
a variety of uses between these two extremes.If it
is to be edited by the requesting office, only the most glaring problems are dealtwith
by thetranslator.
On the oth.er hand,if
it
isto
be publishedwithout
muchfurther
review, the translator devotes careful attentionto
the qualityof
the
text.
These factors are determinedin
consultationwith
the userat
the time thejob
is submitted. Asto
time constraints,it
may happenthat
the work has to be delivered under considerable pressure:information-only translations of 20-25 pages may have to be delivered
within
a
coupleof
hours, and oncea
40-page proposalfor
fundingwas delivered
in
polished form the same dayit
was requested.With
Contrary to what might be deduced from the nature of PAHo's mis-sion, SPANAM and ENGSPAN are asked
to
copewith
a wide range ofsubject areas and types of
text.
There have been: documents for meet-ings, international agreements, technical and administrative reports, proposals for funding, summaries and protocols for international data bases,journal
articles and abstracts, published proceedings ofscien-tific
meetings,training
manuals, letters, lists of equipment, materialfor
newsletters-
evenfilm scripts.
They are certainlywhat
Lawson (1982:5) wouldcall "try-anything"
systems.The
fact
that
SPANAM and ENGSPAN have such highly variedap-plications makes
it
particularly
necessaryto
use trained professionaltranslators as post-editors. Whereas
Martin Kay
(f982:74) suggeststhat
the
person who interprets machineoutput
"wouldnot
have tobe a translator and could quite possibly be drawn from a much larger segment of the labor pool," experience in the PAHo environment sug-gests
that this
conclusion would bevalid only
for
technical expertsworking on a
text
for
information purposes-
a circumstancethat
is rareat
PAHO.Only an experienced translator
will
be aware of the words whose variable meanings are dependent on extralinguisticcontext.
For ex-ample, proyecto in Spanish can be translated as"project,'
"proposal,'or "draft,"
and the choice depends onfull
knowledge of the situation to which thetext
refers. Esperar can be translated as"hope'
or
"ex-pect,'
andthe
difference is significantin
English-
sometimes evencrucial.
Such ambiguities requirethe
attentionof
a translatorwith
training,
experience, good knowledgeof the
subject-mattervocab-ulary
in
both
languages, anda
technical understandingof
what
ismeant by the
text.
Oply a personwith
this combined background isin
a positionto
make the choicesthat will fully
reflect the intentionof the original
author.
Another
areain
whichthe
translator's role isimportant
isin
interpretation of the degree of intensity associatedwith
relativeterms.
For example, tros cendentein
Spanish can havemuch less force
than its
English cognate, andthe
entire toneof
a message may be over- or underdrawn depending on the interpretationgiven
to
a keyterm of this
nature.
Indeed,it
has beenthe
experi-enceof
speueMthat
users, even technical experts, can misinterpretthe glosses appearing in the machine
output
and assign an altogether incorrect meaningin
the processof
"correcting" thetext.
The role of the experienced translator isnot to
be underestimated.It
was these considerationsthat
ledpego
to form a combined lan-guage service, effective asof
November 1984, under whichall
trans-lators would
ultimately
be expectedto
devotepart
of
their
time
tothe post-editing of
vtt
output.
Fortunately, at the time of the mergerthere were
six
translator vacancies, andfor
four
of
these posts ar-rangements were madeto
include post-editing, and also dictionarymaintenance,
in
thejob
description andin
the
announcement that listed the duties of the staff being recruited. This policywill
be con-tinuedwith
future vacancies. Thus the incoming translators are awarefrom the outset
that
their
dutieswill
include post-editing.2.2.2
Linguistic
Strategies
In
addition
to
experiencein
the
interpretation
of
nuances, the post-editor needs a strong linguistic background in order to master theparticular strategies that have proven to be effective in the recasting of
certain unwieldy constructions
that
frequently recur-
with
SPANAM, for example, the result of translating a verb that was in sentence-initialposition
in
Spanish. For the translator untrained in post-editing, the most time-consuming task is the recasting of such constructions. For problemsof this kind
a seriesof
"quick-fix"
post-editing expedients(qrp)
have been developed.At
the
sametime,
thereis
a
series of word-processing aidsthat
helpto
speedup
the
physical process ofediting and
to
dealwith
pragmatic decisionsin
theoutput
which arenot
handled by syntactic rules.To follow through
wiitr
the example mentioned above, certain ma-neuvers are suggested as being usefulwith
fronted verb constructionsin Spanish, which occur frequently and present difficulties for the SVo
pattern typically
requiredfor
English. The purpose of the QFP is to minimize the number of stepsthat
are requiredin
orderto
make the sentencework.
Sinceit
was aV(s)o
constructionthat
triggered theproblem
in
the
first
place, any solutionthat
avoids reorderingwill
the example of the fronted verb, one
might
try
to
seeif
the opening phrase, whichwill
be a discourse adjunct, a cognitive adjunct (termsfrom Halliday 1967), or the main verb itself, could be nominalized so
that
it
can serve asthe
subject ofthe
sentencein
English.
Such an approach managesto
preservein
the theme position (Halliday 1967)the
cognitive material which had been thematicin
the
source text,usually
with
a parallel effect on the focus position as well (Vasconcel-los 1985b). For this reason, the result is often quite satisfactory, even comparedwith
a translationthat
is syntactically more"faithful"
-
seeSection 6 below, and (Vasconcellos 1986). The examples below com-pare QFPs
with
solutionsthat
were actually proposedby
translators newto
post-editing(rHr
-
traditional
human translation).In example (1) the semantic content of the fronted verb is reworked
into a
noun
phrasethat
can serve asthe
subjectof the
sentence.Time is saved by leaving the rheme of the sentence untouched; only a few characters, highlighted inside the box, were changed. Moreover,
additional speed was gained by making changes from
left to right,
inthe same direction
in
which thetext
is being reviewed. (1) Durante 1983 se inici6 ya la transformaci6n paulatina deestos planteamientos en acciones.
MT:
During
1983fVdf|
lwas wasinitiated
initiated
already alreadv I the gradualtransformation of these proposals
into
actions.THT:
During
1983 these proposals already beganto
begradually transformed
into
actions. (62 keystrokes)QFP:
During
1983
the gradualtrans-formation of these proposals
into
actions. (27 keystrokes)In
example (2), on the other hand, the adjunct itself is nominal-ized, againwith
a significant saving of time and keystrokes:(2) En este estudio se buscard, contestar dos preguntas fundamentales:
MT:
[Ttl
tttir
studyquestions:
it
will
be soughtto
answer two fundamentalTHT:
In
this study answersto
two fundamental questionswill
be sought: (53 keystrokes)
QFP:
This studyF""t*l
to
answer two fundamental questions: (14 keystrokes)Use of
this
approach, wherever feasibleo as well as onesthat
have been devised for other troublesome differences between the source andtarget
languages, addsup to
substantial economy,with
apparentlylittle
or no deteriorationin
the quality of the translation (see Section6 below).
However, knowing when and howto
make such changes requires considerableskill.
This
is
one more reasonwhy the
post-editor should have a strong background in translation and, if possible,
in
linguistics as well.It
has always been emphasized in the post-editing of spaxnM and ENGSPANthat
editorial changes should be keptto
the minimumthat
is needed
in
orderto
make theoutput
intelligible and acceptable forits
intended purpose.2.2.3
Word-processing
Strategies
The post-editors
work directly
on-screen. Experience has shownthat
post-editing on hard copy,with
the changes entered bya
"word-processing operator," isnot
a highly efficient mode. Accordingly, at-tention has also been given to speeding up the post-edit by automating as many of the recurring operations as possible.The SEARCH-and-REPLACE function on the word processor is
heav-ily
usedin
post-editing.
In
addition, SpANAM and ENcspAN have a series of special macrosthat
have been developed for purposes of tvtt. Besides afull
set of possible word switches (1 x 1,Lx2,2xt,2x2,!x
3,3
x
1,3x
3,
etc.), there are macrosthat
dealwith
the particularpragmatic reasons. For example,
with
a single "glossary" keystrokeit
is possible to perform the following editorial operations in the English
output:
SEARCH-ANd-OPI,ETE:
the, of, there, to,
in
order to SEARCH.ANd-REPLACE:from replaces of, forf of,
lorf
by,in
orderto
Vffor
Ving, of the, whichf that, whof that, eueryf each, amongf between, such asf os, some of thef someThe inventory can be changed or expanded
at will.
2.2.4
Other
Time-savers
From the discussion above
it
can be seenthat
speed in post-editingis achieved by a combination of strategies. Some of the points made may appear at
first
sightto
be unimportant,but
yet they can add upto a significant difference. One example of an apparently
trivial
factor is the method of positioning the cursor under the string to be modified. Delaysat this
point
can addup
to
a surprisingproportion of total
time
spent on post-editing, since theywill
occurwith
every changethat
is made. Informal
experiments suggestthat
the
most efficient approachfor
positioning the cursor isto
always use the SEARCH key.(The
*mouse" andlight
pencil remainto
befully
investigated.) The slowest method, unfortunately, seems to be the onethat
is most often used, namely simple manualstriking of the
directionalkeys.
Since people tendto
rely
onthe
directional keys unless otherwise trained,this point
is emphasizedwith
the post-editors who work on SPANAM and EttcsPAN.The staff
of the
project
are constantlyon
the
lookout
for
new ways of savingtime. All
tasks are streamlined as much as possible.A
seriesof
programs has been developed onthe
word processor forautomating the housekeeping support
that
has to be done apart from post-editing, and recently someof this
work
was made even more efficient by passingit
onto
the mainframe computer.2.3
Post-editing vis-A-vis Other Aspects of the
System
In
the
PAHO MT environment there is a closelink
betweenpost-editing and
the
other
aspectsof the system.
The
translators aretrained
to
updatethe
dictionaries, and required changesin
the
dic-tionaries are proposed
at
the time
ofpost-editing.
This saves goingthrough the
text
a second time, andit
also captures ideas about the glosses and codingof
aparticular
item whenthe
translator has the entiretext
fresh inmind.
This person,if
adequately trained inupdat-ing, is
in
an excellent positionto
decide howto
dealwith
the specific constructionsthat
tendto
recurin
production translations.The translators also alert the computational linguist to areas where
the algorithm needs improvement.
2.4
lntegration into
a Complete Tlanslation Environment
As of November 1985 the human and machine translation activities
at PAHo, together
with
simultaneous interpretation, were merged intoa single Language Services
Unit.
There is now centralized screeningof
incomingjobs. In
addition, requestors are freeto
specifically re-quest a machine translationif
they wish. The triage makesit
possibleto
maximize the effectiveness and efficiency of the respective modes.There
is
now a morerational utilization of the
manpower availableat
any giventime,
with
the
staff being assignedto
different duties depending onboth
needs andskills.
Also,it
is hopedto
be able toultimately
reduce a given person's day infront
of the word-processing screenfrom
eight hoursto
six
throughthe rotation
of
assignments. To date, however, there have been no complaints of eyestrain or other discomfortwith
the vDU.In
the
areaof
management,the
SPANAM/ENGSPAN programs onthe mainfrarne computer are also helping
with
labor-intensiveopera-tions
for
which the human translation service is responsible:it
isal'
ready performing automatic word counts, and spelling-check systems are being developed
to
reflect the Organization's highly technical vo-cabulary.SpaxaufnNcsn.nN can also help to lighten the load in the human
con-text.
Consideredfrom the
standpointof
term
retrieval,MT
makesfor
an effi.cient lexical data base.With
an ordinary LDB,the
trans-lator
hasto
goto
theterminal
(which isnot
usuallyfor
his/her
use alone), sign on, andinitiate
a search.After all
the mechanical steps have been performed, there isstill
the possibilitythat
the term is notin the
data
baseat all,
andthat
the effort
will
have been wasted. When this happens repeatedly over time, frustration buildsup.
With
SPANAM/owcsreN, on
the
other hand,not only
doesthe
translator know immediatelywhat
translation has been assignedto
the
term,but
he also knowsits
degreeof reliability
and whetheror
not
afull
terminological
entry is
availablein
the
WHoTERM data base, which can be accessedfrom
the sameworkstation.
The status of aterm
isindicated by small superscript symbols which can be requested at the
time the
text
is sentfor
translation.WHorpRM's
files normallycontain:
a definition of eachterm
in English, translational equivalents in up to four languages besidesEn-glish, synonyms
if
there are any,a
reliability
codefor
the
primaryterm
in
each language, scope notes, and a subject code.In
additionto
the
generalfile,
it
has fileswith:
namesof
organizational enti-ties,full
equivalentsfor
abbreviations, scientific names of pathogens, generic namesof
drugsin
three
languages, and chemical names ofpesticides
with
trade
names cross-referencedto
them
(Ahlroth
and Armstrong-Lowe, 1983) .Sp.q.weNd/nNcseeN
can
alsoaid
the
humantranslation
processwith
the
systemof
microglossariesfor
specialized subject areas (seeSection 4.3 below). When a
text
is knownto
dealwith
a certainsub-ject,
the translator can request a corresponding microglossary whichwill
contain alternative glosses. One or more of these microglossaries can be specifiedat
the time
thejob
issubmitted.
The post-editingtranslator can also have a microglossary of his/her own
in
which can be stored special glossesthat
he/she prefersto
use.It
is possiblefor
SPANAM/ENGSPANto
provide "slashed entries," r.e., alternative choicesin
theoutput
entry such as project/proposal/draft, hope/erpect, time/weather, etc., although this is not the regular
policy.
These alternatives can be storedin
a
microglossary.In
thesingle keystroke.
If
the translator
provides feedbackin
the
form of
suggested or requested changesin
the
dictionaries, the updating can be doneim-mediately.
Someof the
requesting offices have developedthe
habit of providing regular feedback, and this meansthat
their
translations become increasingly tailoredto their
specific requirements.While there is no doubt
that
SPANAM/oNcsreN reachtheir
max-imum efficiency when post-edited on-screen, at the same time studies are being done on ways in which a translator can dictate the changesorally
and then have a word-processing operator enter them, workingfrom the audio tape.
The
human translation service standsto
benefit, also,from
the macrosthat
have been developed on the word processorin
editing atext
and from the word count and other support programs.General Translation Approach
Since the bulk of the Organization's translation work involves only Spanish and English, the machine translation system was developed specifically
for this pair
of languages. No consideration was given tousing
the
interlingua approach. The broad range of subject areas to be dealtwith
madeit
impractical to consider using a knowledge-based approach or one based on a representation of the meaning of the text. Although the systems are currently language-specific, significantpor-tions of the algorithm could be adapted for use
in
a system involving Portugueseor
French,the
other official languagesof the
Organiza-tion.
Because SPANAM and ENGSPAN were developed separately, theyreflect different theoretical orientations and
utilize
differentcompu-tational
techniques.At
the
sametime, they
have many features in common.SpeNeu
wasoriginally
designed asa
direct translation
system.The translation is produced through a series of operations which an-alyze
the
Spanish sourcestring,
transformthe
surface structure to producea
syntactic framefor
the
Englishtarget string,
substitutethe
English glosses indicatedby
the
resultsof the
analysis, insertre-quired endings on
the
Englishwords.
Theprincipal
stages involvedin
the translation
algorithmare:
morphological analysis and single-word lookup, gap analysis, multi-wordunit
lookup, homographresolu-tion, subject identification, treatment of prepositions, object pronoun movement, verb
string
analysis, subject insertion, dainsertion, noun phrase rearrangement, target lookup, target synthesis.EttcspeN
is a lexical and syntactic transfer system based on theslot-and-filler approach
to
languagestructure.
It
performsa
sepa-rate
analysisof
the
English sourcestring,
applies transfer routines based onthe
contrastive analysisof
English and Spanish, and then synthesizesthe
Spanishtarget
string.
The principal
stagesof
thisalgorithm
are:
morphological analysis and single-word lookup, gap analysis, substitution and analysisunit
lookup, sentence-level parse,lexical transfer, target lookup, syntactic transfer, and target synthe-sis. The program includes a backup module for homograph resolution, and backup strategies
for
verb string and noun phrase analysis,that
are invoked
if
the sentence-level parse is unsuccessful.4
Linguistic
Techniques4.L
MorphologicalAnalysis
Sp.q.Neu's morphological lookup procedure makes
it
possible tofind
most Spanish wordsin
their
stemforms.
The algorithm recog-nizesplural
and feminine endingsfor
nouns, pronouns, determiners,quantifiers,
and
adjectivesl person, number, and tense endings for verbsl and derivational endings such a,s -mentel-lV. Boundclitic
pro-nouns are separated from verb forms and any accent mark related to the presence of theclitic
is removed. Another subroutine adds missing accent marks when the source word iswritten
with
aninitial
capitalor
in
all
capitalletters.
The components of compounds formedwith
hyphens
or
slashes are lookedup
as separatewords.
A
few prefixes are also removed from wordswithout
a hyphen.ENcspeN's morphological analysis procedure, known as LEMMA, is called
if
thefull-form
isnot
foundin
the dictionary and the word consists ofat
least four alphabetic characters. This procedure checksEach
time
an ending is removed, the new form of the word is lookedup.
Lptvttrle uses morphological and spelling rules and short lists of exceptionsin
orderto
determine whento
removeor
add afinal
-e,when
the
word
endsin
a
double consonant,etc. If
a
lemmatizedform
of the word is foundin
the
dictionary,its
record is checked to make surethat
its part
of speech correspondsto
the endingthat
was removed.If
l,BuMA exhaustsall its
possibilities, the word is checked against alist
of prefixes and rulesfor British
spelling.If
these efforts are unsuccessful, a dummy record is createdfor
the word and a gap analysisroutine is
called.
Not-found words areinitially
consideredto
be common nouns (and proper nounsif
capitalized) and given thepossibility
of
also functioning as verbs and adjectives. Information from both LEMMA and derivational suffi.xes is used in order to confirmor reassign the main
part
of speech, as well asto
confirm, remove, or add possibilitiesfor
ambiguities.The
lookup strategy usedin both
SPANAM and ENcSPAN keeps down the size of the dictionary while at the same time allowing a good dealof
flexibility.
The dictionary coder has the option of entering a word in its full form, in one or more of its inflected forms, or in its stemform. With
irregular forms and homographs, thefull
form
must beused. For example,
in the
Spanish source dictionary the only entriesfor the verb esperar are the stem esper and the verb/noun homograph espera. The English source dictionary contains an entry for erpect and unerpected,
but
notfor
erpects,, erpected, expeeting,or
unerpectedlg.4.2
HomographResolution
Speueu
dealswith
part-of-speech homographs at several different stages of the program. Ambiguitiesthat
can be resolved bymorpho-logical chies
or
capitalization are handledby the
lookup procedure.Proper names are also identified
at this
stage. One-character wordsare distinguished
from
lettersof
the
alphabetafter the
lookup has beencompleted. The
homograph resolution module handles other types of homographs by examining the surrounding context.The possible parts of speech for a word are indicated in the dictio-nary record
in
a series ofbit
fields which include: verb, noun,adjec-tive, pronoun, determiner, numerative, preposition, modifier, adverb,
conjunction, auxiliary, and
prefix.
Any
combinationof
twoor
morebits
maybe
coded.
Other
sequencesof
bit
codes are usedto
dis-tinguish between different types of pronouns, adverbs, and
conjunc-tions: relative, interrogative, nominal, adverbial, connector, interrup-tive, compound, and coordinate.
The use of multiple-word substitution units reduces the number of lexical ambiguities which must be resolved by the
algorithm.
Analysisunits may also be used to selectively specify the
part
of speech of anyor
all
of the words covered by theunit.
ENcspeN's front-line approach to homograph resolution is
embod-ied
in
the
ATN parser whichis
described belowin
Section5.3.
The English words can be codedfor
the same possible parts of speech asin
SPANAM. Determination of the function of each word depends onthe path taken through the network. The sequence of parts of speech
which leads
to
the
first
successful parse is used asthe
basisfor
thetransfer stage.
There are three ways in which lexical information from the
dictio-nary is
usedto
helpthe
parser arriveat
the
correct analysis.Sub-stitution units
compress idiomsinto
one recordwith a
single partof
speech. Analysisunits
can be usedto
indicatethat a
group ofwords can be expected to occur in collocation
with
a particularfunc-tion.
This
information may be overridden, whenever necessary, bythe parser.
An
individual word may also be codedto
indicate whichof
its
possible partsof
speech isstatistically
mostfrequent.
Again,the final
decision is madeby the
parser based onthe
resultsof
the sentence-level analvsis.4.3
Polyse?nySpeuaira/nNcsneN have two principal tools
for
dealingwith
pol-ysemy:
microglossariesand
transferunits.
Substitution
units
and analysis units are also used when common collocations are involved.A
microglossaryis a
subsetof
dictionary
entrieswhich
can be selectedfor
a particular subject area, discourse register, or user.dictionaries.
Glosses pertainingto
the
subject areaof
internationalpublic
healthform
part
of the
maindictionary.
Microglossaries arein
usefor
special translationsof
termsin
the
fieldsof
law, finance,sanitary engineering, agriculture, computer science, and biomedical
research.
The
system may haveup to
99 microglossarieswith
anynumber
of
entriesin
eachone.
The microglossariesto
be consultedduring the translation
of
aparticular
text
are specifiedat
run
time. The existence of a specific microglossary entry is indicated in the main record for theword.
Thus no time is wasted looking for special entries under every word. More than one microglossary may be activated for the same translation, in which event they are listed and consulted in order of priority.The transfer
unit
is a lexical transfer rule which is storedin
the sourcedictionary.
The
existenceof a
transferunit is
indicated inthe
record correspondingto
the individual
sourceword. A
transferunit
specifies a condition to be tested and an action to be performed. Examples of conditions are:-
The subject of this verb hasX
feature(s)or
is word W;-
The object of this verb (or preposition) hasX
feature(s)or
isword W;
-
This word modifies a wordwith
X
feature(s) or modifies word W;-
This word is modified by a wordwith
X feature(s) or is modifiedby word W;
-
This verb has N object(s);-
This
verb
governs prepositionP
andthe
objectof P
hasX
feature(s)
or
is word W;-
This verb has a complement of type T.Transfer units are
explicitly
ordered in the dictionary. The actionmay either select an alternative translation, insert a word such as a
whether
or not
additional transfer entries should be soughtfor
the same word.4.4 Syntactic and Semantic
Features
The dictionary
recordfor
each lexicalitem
(includingsubstitu-tion
units)
containsbit
fieldsfor
storing informationabout
its
syn-tactic
and semantic features. These features are usedin
all
stages of thetranslation.
For example, verbs and deverbal nouns are specified as occurringwith
oneor
moreof
the following possibilities: noob-ject, one object, two objects, complement, locative, marked infinitive,
unmarked
infinitive,
declarative clause, imperative clause,interroga-tive
clause, gerund,adjunct,
bound preposition, object followed bymarked
infinitive,
object followedby
clause, and object followed bybound
preposition.
Subject and object preferences can be specified asfHuman, fAnimate,
andfConcrete.
Other fields can be made availablefor
case framesat
suchtime
asthey
maybe
introduced. Features which can be codedfor
nouns include count, bulk, concrete,human, onimate, feminine, proper, collective, deuice, location, time,
quontity, scale, color,
nationality,
material, apposition (noun,infini-tive,
or
clause), bodypart,
condition, and treatment. Adjectives are codedfor
many of the same features mentioned above.In
addition,they
canbe
coded as infl,ectable, optionally infl,ectoble, comporable, general,temprary,
resultative, and tough-mouement. Adverbs can be coded astime,
placeo monner, motiue, interruptrue,and
connector. Oneof the
references usedin
developingthe
coding schemefor
the English entries was (Sager, 1981).I
4.5
Annotated
Surface
Structure
NodesIn
ENGSPAN,the
structure producedby
the
parser consistsof
a graph of nodes correspondingto
each clause and phrase. Each node has a table indicating its constituents, their roles, and their locations.If
the constituent is a lexical item, the location is a word number;if it
is a phrase
or
a clause, the location is the pointerto
the appropriatephrase or clause involved. These features include Type, Mood, Person, Number,, Tense, Aspect, and Voice.
Both
the
ATN formalism andthe structural
representation usedin
ENGSPANdraw
heavilyon the
presentationof
ATN parsers and systemic grammarin
Winograd(1983).
Winograd's discussion, inturn,
is
basedon the work
of
Woods (1969and
1g71) and Kaplan(1973). Of course, the ATN parser has necessarily had
to
be adaptedto
the needs and computational environment of the pAHo project.4.6
Spanish
Verb
Synthesis
The
procedurefor
the
synthesisof
Spanish verb formsis
based on principles of generative morphology and phonology. The program synthesizesall
regular and mostof the
irregular verbs,in
all
tenses and moods except the future subjunctive, and in all persons except the second personplural.
The verb is entered in the target dictionary in its stemform.
Binary codes are used to specify the conjugation class andthe 11 exception features
that
govern the synthesis of irregular forms.Only one dictionary entry is needed for each
verb.
A small number ofhighly
irregular stems andfull
forms (7+in
all)
are listedin
a table.The majority
of
"stem-changing" verbs require no special synthesiscoding.
The procedure consists of a series of morphological spelloutrules;
raising, lowering, diphthongization, and deletion rules basedon phonological processes; stress assignment rulesl and orthographic rules
to
handle predictable spelling changes.Spanibh verbs are also coded
to permit
the synthesis of thenom-inalization
correspondingto
"the
actionor
processof."
The
verb synthesis procedureis
invokedto
produce nominalized forms which correspondto
the
first or third
person present indicativeor
presentsubjunctive,
or
to
the
past participleor
infinitival form
of the verb.Other forms are produced
by
u combination of stem alterations andderivational affi.xes. Irregular forms require a separate
target
dictio-nary entry.5.1
Computat ional
TechniquesDictionaries
The
SPANAM/ENGSPAN dictionaries are VSAM files storedon
apermanently mounted
disk.
The source andtarget
dictionaries are separatefiles.
The basic record has a fixed lengthof
160 bytes. The source entry is linkedto
its target gloss by means of al2-digit
lexical number(lnx).
Thefirst
six digits of the LEX are the uniqueidentifi-cation number which is assigned
to
each pair whenit
is addedto
thedictionary. The
secondhalf of the
LEX is usedto
specify alternatetarget
glosses associatedwith
the
same sourceentry.
The
main ordefault target gloss for each pair has zeroes
in
these positions.5.1.1
Single-word Entries
The
keyfor
a
sourceentry is the
lexicalitem itself,
which may beup
to
30 charactersin
length.
The sourcedictionary
is arrangedalphabetically. The key
for
a targetentry
is the LEX, and the targetdictionary is arranged
in
numerical order.Words may be entered in the source dictionary either
with
orwith-out
inflectional endings. Most nouns are entered onlyin
the singularand adjectives
only
in
the
masculinesingular.
Verbs are entered asstems. Full-form entries are required
for
auxiliary verbs, wordswith
highly
irregular morphology, and homographs.Several source items may be linked to the same target gloss by
.r-signing theni the same LEX. For example, irregular forms of the same
verb
or
alternative spellingsof
a word require only oneentry
in
thetarget dictionary. Likewise, more than one target gloss can be linked
to
the same source word through the lexical number. Each alternate glossis
distinguishedby
codingin
the secondhalf
ofthe LEX.
Twopositions are used
to
designate terms belongingto
microglossaries, two for glosses corresponding to different parts of speech, and two for5.1.2
Multiple-word
Entries
The dictionaries contain four types of multiple-word entries:
sub-stitution units (su),
analysisunits (eu),
delayedsubstitution
units(osu),
and transferunits
(tu).
The keyfor
a multiple-word entry inthe source dictionary is a string consisting of the
first
six digits of the LEX for each word in theunit. With
an SU or an AU, the words mustoccur consecutively
in
the
sentencein
orderfor
the
unit to
beacti-vated.
A
osu or
a TU can cover either a continuousor
discontinuous string.The basic SU contains from two
to
five words.A
different recordstructure is
usedfor
longer entries, such as namesof
organizationsand
titles of
publications.
When an SUis
retrieved,the
dictionaryrecords corresponding
to
the individual words are replacedwith
one record correspondingto
the
entire sequence. The glossfor
the unit
is also found
in
a singleentry in
the targetdictionary. This
type ofunit is
usedin
orderto
obtain the
correcttranslation of
names oforganizations,
titles of
publications, slogans, etc., and is an efficient way of handling some fixed idioms, phrasal prepositions, and certain technical terminology.An
SU record has the same format as a single-word entry.In
addition,it
contains a character string which indicatesthe
part
of speechof
eachof its
members.This
information can be used by the parserif it
is unable to parse the sentence using the singlepart
of speech specifiedfor
theunit.
Examples of phrases which are entered as SUs are: by leaps and bounds, InternationalDrinking
Woter Supply and Sanitation Decade, and Healthlor All
bythe
Year 2000.The AU, which also contains
from
twoto
five words, has severalfunctions.
At
the very least,it
alerts the analysis routines to thepos-sible presence
of
a
common phrase and provides informationon
itslength and
function.
It
can also be used to resolve the part-of-speechambiguity of any of
its
members.Finall5
it
can specify an alternatetranslation
for
oneor
moreof
its parts.
The
AUis
anentry
in
the source dictionary but has no counterpart in the target dictionary. The recordfor
each source word is retainedin
the
representationof
the sentence,but
the last two digits ofits
lexical number are modifiedif
lookup is performed, the gloss
for
each word is retrieved separately.This
ensuresthat
the
rulesfor
analysis and synthesisof
conjoined modifierswill
be able to access information about the individual wordsof the
phrase.It
also makesit
possiblefor
the
parserto
determinewhether
or not the
individual words are being used as aunit in
the givencontext.
Examples of phrases entered as AUs aredrinking
wo-ter
and patient core. The algorithm isstill
ableto
correctly analyze sequences such as the children haue been drinlcingwater
witha
highfluoride
content andrt
is
essential that the patient corefor
himse$. The DSU is used to handle lexical items such as phrasal verbs which are likely to occur a"s noncontiguous words in theinput.
The existenceof a DSU is indicated in the source record of the first word of the
unit.
The
unit
is
retrievedfrom the dictionary during the
sentence-level parse. The decision of whether ornot to
accept theunit
is based onboth
syntactic and semantic requirementsof the
parser.If
theunit
is accepted,
it
replacesthe individual
records and causes a differenttarget
glossto
beretrieved.
Examplesof
DSUs arc laok up,put
on, and carry out.The
TU is usedto
specify an alternatetranslation
of
a word or words which depends onthe
occurrenceof
a specific wordor
set offeatures
in
one ofits
arguments orin
a specified environment. These entries are stored in the source dictionary only and are retrieved afterthe
analysis has beencompleted.
If
the
conditions specifiedin
the transfer entry are met, the corresponding lexical numbers are modified sothat
the
desired target gloss is selected during the target lookup.For example,
if
the objectof
know is coded as*Human,
the verb istranslated as conocer instead
of
saber.If
female and molemodify
anoun coded as
-Human,
they
are translated as hembraand
macho instead oftnujer
and hombre.6.2
Grammar
Rules
SpeNe.u
until
recently usedtwo
basic typesof
grammar rules:pattern matching and transformations. Pattern matching is used for
the
recognition and reorderingof
nounphrases. The
grammatical patterns are stored in a file which can be updatedwithout
recompilingthe program.
The patterns are appliedby
searchingfor
the
longestmatch
first.
Transformations are usedto
identify and synthesize the verb phrase andclitic
pronouns. The rules are expressed in PLf t codeand are grouped
in
modules accordingto
the
part
of
speechof
the headword.
Each group of rules is tested once for each sentence. Thestructural
description of each rule is comparedwith
theinput
string.The
description may requirea
matchof
parts
of
speech, syntactic features,or
specific lexicalitems.
If
a
matchis found,
the rule
isapplied.
The
rule
may
permute,add,
delete,or
substitute
lexical items or features associatedwith
them.ENcsp^q.N's analysis component takes the
form of
an augmentedtransition network.
The network configuration indicates possiblese-quences of constituents. The rules governing the acceptability of any specific
input string
are containedin the
conditions attachedto
the various arcs of the network. The building of the nodes of the structuralrepresentation and the assignment of features and roles is determined
by
actions associatedwith
eacharc.
The conditions and actions are containedin
separate modules which arepart
of the
compiledpro-gram. The configuration of states and arcs is specified
in
a file which is updatedon-line.
The contents ofthis
file also determine which of the conditions and actions are actually attachedto
specific arcsfor
aparticular run.
As
of
December 1985,the
ATN grammarhad
11 networks: sen-tence, clause, noun phrase, verb phrase, sentence nominalization, hy-phenated compound, prepositional phrase, comparative phrase,ad-verbial phrase, relative clause, and dependent clause. Each network consists
of
a
setof
states connectedby
arcs.
Four
typesof
arcs are used: CATEGORY arcs, which can be takenif
thepart of
speech matchesthat
of the input word; JUMP arcs, which can be takenwith-out matching a word of the input; SEEK arcs, which
initiate
recursive callsto
a network; and SEND arcs, which return control to the callingnetwork