terminology
Sayori Shimohata
Research &DevelopmentGroup,
Oki Electric Industry Co., Ltd.
Crystal Tower1-2-27 Shiromi,
Chuo-ku,Osaka 540-6025 Japan
Abstract
This pap er describ es a method for retrieving
patterns of words and expressions frequently
usedinasp ecic domainandbuilding a
dictio-naryformachinetranslation(MT).Themetho d
usesanuntaggedtextcorpus inretrievingword
sequences and simplied part-of-sp eech
tem-plates in identifying their syntactic categories.
Thepap er presentsexp erimentalresultsfor
ap-plying the wordsand expressions toa
pattern-basedmachine translationsystem.
1 Intro duction
Therehasb eenacontinuousinterestin
corpus-based approacheswhichretrieve words and
ex-pressions in connection with a sp ecic domain
(we callthem technicalterms hereafter). They
maycorrespondtosyntacticphrases orcomp
o-nents of syntactic relationships and have b een
found useful in various application areas,
in-cluding information extraction, text
summa-rization, and machine translation. Among
oth-ers,a knowledge oftechnicalterminology is
in-disp ensableformachinetranslation b ecause
us-age and meaning of technical terms are often
quitedierentfrom theirliteralinterpretation.
Oneapproach foridentifyingtechnical
termi-nology is a rule-based approach which learns
lo cal syntactic patterns from a training
cor-pus. A varietyof metho dshaveb eendeveloped
withinthisframework,(Ramshaw,1995)
(Arga-monetal.,1999)(CardieandPierce,1999)and
achieved go od results for the considered task.
Surprisingly, though, little work has b een
de-voted to learning lo cal syntactic patterns
be-sides noun phrases. Another drawback of this
approachisthatitrequiressubstantialtraining
corp ora,inmanycaseswithpart-of-speechtags.
An alternative approach is a statistical one
collocations (Smadja, 1993)(Haruno et al.,
1996)(Shimohata et al., 1997). This approach
is robust and practical b ecause it uses plain
text corpora without any information
depen-dent on a language. Unlike the former
ap-proach, this approach extracts varioustypes of
local patterns at the same time. Therefore,
post-pro cessing, such aspart of sp eechtagging
and syntactic category identication, is
neces-sarywhen weapplythem toNLPapplications.
This pap er presents a method for
identify-ing technical terms from a corpus and
apply-ing themtoamachine translationsystem. The
proposedmetho dretrieveslo calpatternsby
uti-lizing the n-gram statistics and identiestheir
syntactic categories with simple part-of-sp eech
templates. Wemakeamachinetranslation
dic-tionary from the retrieved patterns and
trans-late documentsinthesamedomainasthe
orig-inal corpus.
In the next section, we briey describe a
pattern-based machinetranslation. The
follow-ing section explains how the proposed metho d
works indetail. We then present exp erimental
results andconclude withadiscussion.
2 Pattern-based MT system
A pattern-based MTsystem usesa setof
bilin-gual patterns(CFG rules) (Ab eille et al., 1990)
(Takeda,1996)(Shimohataetal.,1999). Inthe
parsing pro cess, the engine p erforms a
CFG-parsingforaninputsentenceandrewritestrees
by applying the source patterns. Terminals
andnon-terminalsareprocessedunderthesame
frameworkbutlexicalizedpatternshavepriority
oversymbolized patterns 1
. A plausible parse
1
Wedene a symbolizedpatternas apattern
with-out a terminal and a lexicalized pattern as that with
by the number of patterns applied. Then the
parsetreeistransferredintotargetlanguageby
using target patterns which corresp ond to the
sourcepatterns.
Figure 1 shows an example of translation
patterns between English and Japanese. Each
English pattern(a left-half CFG rule) has
cor-resp onding Japanese pattern(a right-half CFG
rule). Non-terminals are bracketed with
in-dex numb ers which represents corresp ondence
of non-terminalsbetweenthesourceand target
pattern.
The pattern format is simple but highly
de-scriptive. It can representcomplicated
linguis-tic phenomena and even corresp ondences b
e-tween the languages with quite dierent
struc-tures. Furthermore,alltheknowledgenecessary
forthetranslation,whethersyntacticorlexical,
are compiledin the samepattern format.
Ow-ing to these features, we can easily apply the
retrievedtechnicalterms toa real MTsystem.
3 Algorithm
Figure 2 shows an outline of the proposed
metho d. Theinputisanuntaggedmonolingual
corpus,whiletheoutputisadomaindictionary
for machine translation. The pro cess is
com-prisedof3phases: retrievinglo calpatterns,
as-signing their syntactic categories with
part-of-sp eech(POS)templates,andmakingtranslation
patterns. The dictionary is used when an MT
systemtranslates a textinthesamedomainas
thecorpus.
We assume that the input is an English
cor-pus and the dictionary is used for an
English-Japanese MT system. Inthe remainder of this
section,wewillexplaineachphaseindetailwith
Englishand Japanese examples.
Wehavealreadyprop osedamethodfor
retriev-ing word sequences (Shimohata et al., 1997).
This metho d generates all n-character (or
n-word) strings app earing in a text and lters
out fragmental strings with the distribution of
words adjacent to the strings. This is based
on theidea thatadjacent wordsarewidely
dis-tributed if thestring ismeaningful, and are
lo-calized ifthestringisasubstringofa
meaning-ful string.
Themetho dintroducesentropyvalueto
mea-sure the word distribution. Let the string b e
str, the adjacent words w
1 :::w
n
, and the
fre-quencyofstrfr eq(str ). Theprobabilityofeach
possibleadjacent wordp(w
At that time, the entropy of str H(str ) is
de-ned as:
Calculating the entropyof b oth sides of str,
the lower one is used as H(str ). Then the
strings whose entropy is larger than a given
threshold areretrievedaslo cal patterns.
3.2 Identifying syntacticcategories
Since the strings are just word sequences, the
process gives them syntactic categories. For
eachstr str ,
1. assign part-of-sp eech tags t
1
2. matchtag sequence t
1
3. give str corresponding syntactic category
SC
i
,ifitmatches T
i
3.2.1 Assigning part-of-speech tags
The pro cess usesa simpliedpart-of-sp eechset
shown intable 1. Functionwords are assigned
as they are, while contentwordsexcept for
ad-verb are fallen into only one part of speech
word. Four kinds of words \be", \do", \not",
and \to" are assigned to special tags b e, do,
not, and toresp ectively.
P O S
t e m p l a t e s
R e t r i e v e l o c a l p a t t e r n s
I d e n t i f y s y n t a c t i c c a t e g o r i e s
M a k e t r a n s l a t i o n p a t t e r n s
D o m a i n
D i c t i o n a r y
M T s y s t e m
T e x t
t r a n s l a t i o n
M o n o l i n g u a l
c o r p u s
Figure2: outline
POStag partof sp eech
art article
adv adverb
aux auxiliaryverb
conj conjunction
det determiner
prep prep osition
prn pronoun
punc punctuation
b e \be"
do \do"
not \not"
to \to"
word the others
Table 1: part-of-sp eechtags
it may sometimes be dicult to identify
preciseparts of sp eechinsuch alocal
pat-tern.
wordsareoftenusedbeyondpartsofspeech
intechnicalterminology
itisempiricallyfound thatwordsequences
retrieved through n-gram statistics have
distributionalconcentrationonseveral
syn-tactic categories.
Therefore, wethinkthesimpliedPOStagsare
sucienttoidentifysyntacticcategories.
The word sequence w
1 ,... w
n
is represented
for apart-of-speech tagsequence t
1 ,... t
i .
Fig-ure 3 shows examples of POS tagging. Italic
t h e f u e l t a n k
a r t w o r d w o r d
d o t h i s s t e p :
d o d e t , p r n w o r d p u n c
t o o p r n t h e
t o w o r d a r t
Figure 3: examples ofPOStagging
lines are given word sequences and b old lines
are POStagsequences. Ifa wordfalls intotwo
or more parts of sp eech, all p ossible POSs will
be assignedlike \this"inthesecond example.
3.2.2 Matching POS templates
The process identiesa syntactic category(SC)
of str by checking if str's tag sequence t
1 , ...
i
corresp onding to T
i
. Table 2 shows examples
of POStemplates and corresp ondingSCs 2
Table2: POStemplatesandcorresp ondingSCs
The templates are describ ed in the form of
regular expressions(RE) 3
. The rst template
intable 2,forexample, matchesastring whose
tag sequence beginswith an article, contains0
ormorerep etitionsofcontentwordsor
conjunc-tions, and endswith a content word. \the fuel
tank"ingure3isappliedtothistemplatesand
givenaSC \N".
3.3 Making translation patterns
The process converts the strings into
transla-tionpatterns. Theproblemhereisthatweneed
to generate bilingual translation patterns from
monolingual strings. We use heuristic rules on
b orrowingwordsfrom foreignlanguages 4
.
Figure4 isan exampleofconversionrulesfor
generating English-Japanese translation
pat-terns. To give an example, \to op en the" in
gure 3,whose SCis VT,is convertedinto the
following patterns in accordance with the
sec-ond rulein gure4.
V P
o p e n [ 1 : N P ]
V P
[ 1 : N P ] ( d o b j ) o p e n
( d o )
2
NotethatthePOStemplatesarestronglydep endent
onthefeaturesofn-gramstrings.
3
\3"causestheresultingREtomatch0ormore
rep-etitions of the precedingRE. \+"causes the resulting
REtomatch1ormorerepetitionsoftheprecedingRE.
\j"createsaREexpressionthatwillmatcheitherright
or left of \j". \(:::)" indicates the start and end of a
group.
4
In Japanese, foreign words, esp ecially in technical
terminology,areoftenusedastheyareinkatakana(the
phoneticsp ellingforforeignwords)followedbyfunction
wordswhichindicatetheirpartsofsp eechForexample,
Englishverbsarefollowedby\suru",averbwhichmeans
I f S C i s N , d e l e t e
a r t
a n d g e n e r a t e :
Figure 4: conversionrulesforgenerating
trans-lation patterns
4 Evaluation
We have tested our algorithm in building a
domain dictionary and making a translation
with it. A corpus used in the experiment is a
computer manualcomprising 167,023words(in
22,041 sentences).
The corpus contains 24,137 n-grams which
appear more than twice. Among them, 7,616
strings areextractedovertheentropythreshold
1. Table 3 is a list of top 20 strings (except
for single words and function word sequences)
retrieved from thetest corpus.
Thesestrings arecategorizedinto1,239 POS
patterns. Table 4 is a list of top 10 POS
pat-terns and the numbers of strings classiedinto
them. In thisexp eriment, thetop10 POS
pat-ternsaccountfor49.4%ofallPOSpatterns. It
substantiatesthefactthattheretrievedstrings
tend toconcentrateincertainPOS patterns.
fr eq POS
5.51 247 seealso 3.55 552 theclient
4.48 1499 theserver 3.54 209 usethe
4.46 100 clickOK . 3.46 209 theuser
3.92 106 usethis function 3.46 168 clickthe
3.79 163 thefunction 3.44 172 thecatalog agent
3.76 309 thefollowing 3.36 192 therequest
3.67 297 thele 3.29 132 on page
3.58 36 intheServerManager , 3.23 213 a sp ecied
3.56 169 using the 3.22 71 ifyou want to
3.55 180 CGIprograms 3.22 575 yourserver
Table3: top20 strings
Inthematchingpro cess,weprepared15
tem-plates and 6 SCs. Table 5 is a result of SC
identication. 2,462 strings(32.3 %) are not
matchedto anytemplates. The table indicates
that most strings retrieved in this method are
identied as N and NP. It is quite reasonable
b ecausethe majorityof thetechnicalterms are
supp osedto b enounsand noun phrases.
SC numb erof patterns
NP 722
N+prep 200
VP 32
VP+prep 10
VT 177
V 78
Table 5: result of SCidentication
The retrieved translation patterns total
1,219. Figure 5 shows an example of
transla-tion patternsretrieved byourmetho d.
We, then, converted them to an MT
dictio-nary and made a translation with and without
it. Table 6 summarizes the evaluation results
translating randomly selected 1,000 sentences
fromthetestcorpus. Comparedwiththe
trans-lations without the dictionary,the translations
withthedictionaryimproved571inparsingand
word selection.
Figure 6 illustrates changes in translations.
Each column consists of an input sentence, a
translationwithout thedictionary,anda
trans-improvedinparsing 104
improvedinword selection 467
ab out thesame 160
same 212
notimproved 57
total 1000
Table 6: Translation evaluationresults
correspond tounderlined Japanese.
First two examples show improvement in
wordselection. Thetranslationsof"map(verb)"
and "exec" are changed from word-for-word
translations tonon-translationword sequences.
Although "to makea map" and "exective"are
not wrong translations, they are irrelevant in
thecomputermanualcontext. Onthecontrary,
thedomaindictionaryreducesconfusioncaused
bythe wrongwordselection.
Wrong parsing and incomplete parsing are
also reduced as shown in the next two
exam-ples. Inthe third example, "Next" should b e a
noun,whileitisusuallyusedasanadverb. The
domain dictionary solved the syntactic
ambi-guity prop erlybecauseit hasexclusivepriority
oversystemdictionaries. In theforth example,
"double-click"is anunknownwordwhichcould
cause incomplete parsing. But the phrase was
parsed asaverb correctly.
ThelastoneisanwrongexampleofJapanese
verb selection. That was a main cause of
un-N P f u l l y - q u a l i f i e d d o m a i n n a m e
N P
t e x t s e a r c h e n g i n e
N P
a c c e s s l o g f o r [ 1 : N P ]
V P
s a v e [ 1 : N P ]
V
d e a l l o c a t e
N P f u l l y - q u a l i f i e d d o m a i n n a m e
N P
t e x t s e a r c h e n g i n e
N P
[ 1 : N P ]
( o f ) a c c e s s l o g
V P
[ 1 : N P ] ( d o b j ) s a v e
( d o )
V
d e a l l o c a t e
( d o )
Figure5: theretrieved translation patterns
T y p e t h e U R L p r e f i x y o u w a n t t o
m a p
.
( " m a p " ) ( d o b j )
( " m a k e " )
U R L
m a p ( p e r f o r m m a p p i n g )
U R L p r e f i x
T h e
e x e c t a g
a l l o w s a n H T M L f i l e t o e x e c u t e a n a r b i t r a r y p r o g r a m o n t h e s e r v e r ;
( e x e c t i v e s t a g )
H T M L
e x e c t a g
H T M L
s e r v e r
T y p e t h e f u l l n a m e o f y o u r s e r v e r , a n d t h e n c l i c k
N e x t
.
( " t h e n " )
s e r v e r
N e x t c l i c k
G o t o t h e C o n t r o l P a n e l a n d
d o u b l e - c l i c k
t h e S e r v i c e s i c o n .
C o n t r o l P a n e l
( " d o u b l e - " )
S e r v i c e s
( " c l i c k " )
C o n t r o l P a n e l
S e r v i c e s i c o n
d o u b l e - c l i c k ( " d o u b l e - c l i c k " )
S e t t i n g
a d d i t i o n a l d o c u m e n t d i r e c t o r i e s
( p u t , p l a c e )
d o c u m e n t
d i r e c t o r y
( " a s s i g n , i m p o s e " )
Figure6: examplesentencesinthetest corpus
themetho daddeddefaultsemanticinformation
to the retrieved nouns and noun phrases. We
hop e to overcome it by a mo del that classies
noun phrases, for example using verb-noun or
adjective-nounrelations.
5 Related work
As mentioned in section 1, there are two
ap-proaches in corpus-based technical term
re-trieval: a rule-based approach and a statistical
approach. Major dierences b etween the two
are:
the former uses a tagged corpus while the
theformerretrieveswordsandphraseswith
a designated syntactic category while the
latter retrievesthat with varioussyntactic
categories atthesametime.
Our metho d uses the latter approach because
wethinkitmorepracticalb othinresourcesand
in applications.
For comparison, we refer here to Smadja's
method (1993) b ecause this metho d and the
proposed metho d have much in common. In
both cases, technical terms are retrieved from
an untagged corpus with n-gram statistics and
weusePOStemplates. Aparsermayaddmore
precisesyntacticcategory thanPOStemplates.
However,weconsideritnottobecritical under
the sp ecic condition that the variety of input
patterns is very small. In terms of portability,
the prop osed method has an advantage.
Actu-ally,adding POS templates is notso time
con-sumingasdevelopinga parser.
We haveapplied the translation patterns
re-trieved by this method to a real MT system.
As a result, 57.1 % of translations were
im-proved with1,219 translation patterns. To our
knowledge, little work has gone into
quantify-ing its eectiveness to NLP applications. We
recognize that the metho d leaves room for
im-provementinmaking translationpatterns. We,
therefore,plantointro ducetechniquesfor
nd-ing translational equivalent from bilingual
cor-p ora (Melamed,1998) toourmetho d.
6 Conclusion
We have presented a metho d for identifying
technical terminology and building a domain
dictionary for MT. Applying the metho d to a
technicalmanualinEnglishyielded positive
re-sults. Wehavefoundthattheprop osedmetho d
woulddramaticallyimprovethep erformanceof
translation. In the future work, we plan to
in-vestigatetheavailabilityofPOSpatternswhich
arenotcategorized intoanySCs.
References
Ab eille A., Schab es Y., and Joshi A. K.
1990. \Using Lexicalized Tags for Machine
Translation". In Proceedings of the
Interna-tionalConferenceonComputational
Linguis-tics(COLING), pages 1{6.
Argamon, S., Dagan, I., and Krymolowski,
Yuval. 1999. A Memory-Based Approach
to Learning Shallow Natural Language
Pat-terns. In Proceedings of the 17th COLING
and the 36th Annual Meeting of ACL,pages
67{73.
Cardie, C. and Pierce, D. 1999. The Role of
Lexicalization and Pruning for Base Noun
Phrase Grammars In Proceedings of the
16th National Conference onArticial
Intel-ligence,pages 423{430.
Haruno, M., Ikehara, S., and Yamazaki, T.
16thCOLING, pages525{530.
Melamed,I.D. 1998. EmpiricalMetho dsforMT
LexiconDevelopment In Gerb er,L.and F
ar-well, D. Eds. Machine Translation and the
InformationSoup, Springer-Verlag.
Ramshaw, L.A., and Marcus, M.P. 1995.
Text Chunking using Transformation-Based
Learning InProceedingsofthe 3rdWorkshop
onVeryLargeCorpora,pages 82{94.
Shimohata,S., Sugio,T., and Nagata,J. 1997.
Retrieving Collo cations by Co-o ccurrences
and WordOrder Constraints. In Proceedings
of the 35th Annual Meeting of ACL, pages
476{481.
Shimohata, S. et al. 1999. \Machine
Trans-lation System PENSEE: System Design and
Implementation," InProceedings of Machine
TranslationSummit VII, pp.380{384.
Smadja,F.A. 1993. Retrieving Collo cations
from Text: Xtract. In Computational
Lin-guistics,19(1), pages143{177.
TakedaK. 1996. \Pattern-BasedContext-Free
GrammarsforMachineTranslation". In
Pro-ceedingsof the34th AnnualMeetingof ACL,