• No results found

An empirical method for identifying and translating technical terminology

N/A
N/A
Protected

Academic year: 2020

Share "An empirical method for identifying and translating technical terminology"

Copied!
7
0
0

Loading.... (view fulltext now)

Full text

(1)

terminology

Sayori Shimohata

Research &DevelopmentGroup,

Oki Electric Industry Co., Ltd.

Crystal Tower1-2-27 Shiromi,

Chuo-ku,Osaka 540-6025 Japan

[email protected]

Abstract

This pap er describ es a method for retrieving

patterns of words and expressions frequently

usedinasp ecic domainandbuilding a

dictio-naryformachinetranslation(MT).Themetho d

usesanuntaggedtextcorpus inretrievingword

sequences and simplied part-of-sp eech

tem-plates in identifying their syntactic categories.

Thepap er presentsexp erimentalresultsfor

ap-plying the wordsand expressions toa

pattern-basedmachine translationsystem.

1 Intro duction

Therehasb eenacontinuousinterestin

corpus-based approacheswhichretrieve words and

ex-pressions in connection with a sp ecic domain

(we callthem technicalterms hereafter). They

maycorrespondtosyntacticphrases orcomp

o-nents of syntactic relationships and have b een

found useful in various application areas,

in-cluding information extraction, text

summa-rization, and machine translation. Among

oth-ers,a knowledge oftechnicalterminology is

in-disp ensableformachinetranslation b ecause

us-age and meaning of technical terms are often

quitedierentfrom theirliteralinterpretation.

Oneapproach foridentifyingtechnical

termi-nology is a rule-based approach which learns

lo cal syntactic patterns from a training

cor-pus. A varietyof metho dshaveb eendeveloped

withinthisframework,(Ramshaw,1995)

(Arga-monetal.,1999)(CardieandPierce,1999)and

achieved go od results for the considered task.

Surprisingly, though, little work has b een

de-voted to learning lo cal syntactic patterns

be-sides noun phrases. Another drawback of this

approachisthatitrequiressubstantialtraining

corp ora,inmanycaseswithpart-of-speechtags.

An alternative approach is a statistical one

collocations (Smadja, 1993)(Haruno et al.,

1996)(Shimohata et al., 1997). This approach

is robust and practical b ecause it uses plain

text corpora without any information

depen-dent on a language. Unlike the former

ap-proach, this approach extracts varioustypes of

local patterns at the same time. Therefore,

post-pro cessing, such aspart of sp eechtagging

and syntactic category identication, is

neces-sarywhen weapplythem toNLPapplications.

This pap er presents a method for

identify-ing technical terms from a corpus and

apply-ing themtoamachine translationsystem. The

proposedmetho dretrieveslo calpatternsby

uti-lizing the n-gram statistics and identiestheir

syntactic categories with simple part-of-sp eech

templates. Wemakeamachinetranslation

dic-tionary from the retrieved patterns and

trans-late documentsinthesamedomainasthe

orig-inal corpus.

In the next section, we briey describe a

pattern-based machinetranslation. The

follow-ing section explains how the proposed metho d

works indetail. We then present exp erimental

results andconclude withadiscussion.

2 Pattern-based MT system

A pattern-based MTsystem usesa setof

bilin-gual patterns(CFG rules) (Ab eille et al., 1990)

(Takeda,1996)(Shimohataetal.,1999). Inthe

parsing pro cess, the engine p erforms a

CFG-parsingforaninputsentenceandrewritestrees

by applying the source patterns. Terminals

andnon-terminalsareprocessedunderthesame

frameworkbutlexicalizedpatternshavepriority

oversymbolized patterns 1

. A plausible parse

1

Wedene a symbolizedpatternas apattern

with-out a terminal and a lexicalized pattern as that with

(2)

by the number of patterns applied. Then the

parsetreeistransferredintotargetlanguageby

using target patterns which corresp ond to the

sourcepatterns.

Figure 1 shows an example of translation

patterns between English and Japanese. Each

English pattern(a left-half CFG rule) has

cor-resp onding Japanese pattern(a right-half CFG

rule). Non-terminals are bracketed with

in-dex numb ers which represents corresp ondence

of non-terminalsbetweenthesourceand target

pattern.

The pattern format is simple but highly

de-scriptive. It can representcomplicated

linguis-tic phenomena and even corresp ondences b

e-tween the languages with quite dierent

struc-tures. Furthermore,alltheknowledgenecessary

forthetranslation,whethersyntacticorlexical,

are compiledin the samepattern format.

Ow-ing to these features, we can easily apply the

retrievedtechnicalterms toa real MTsystem.

3 Algorithm

Figure 2 shows an outline of the proposed

metho d. Theinputisanuntaggedmonolingual

corpus,whiletheoutputisadomaindictionary

for machine translation. The pro cess is

com-prisedof3phases: retrievinglo calpatterns,

as-signing their syntactic categories with

part-of-sp eech(POS)templates,andmakingtranslation

patterns. The dictionary is used when an MT

systemtranslates a textinthesamedomainas

thecorpus.

We assume that the input is an English

cor-pus and the dictionary is used for an

English-Japanese MT system. Inthe remainder of this

section,wewillexplaineachphaseindetailwith

Englishand Japanese examples.

Wehavealreadyprop osedamethodfor

retriev-ing word sequences (Shimohata et al., 1997).

This metho d generates all n-character (or

n-word) strings app earing in a text and lters

out fragmental strings with the distribution of

words adjacent to the strings. This is based

on theidea thatadjacent wordsarewidely

dis-tributed if thestring ismeaningful, and are

lo-calized ifthestringisasubstringofa

meaning-ful string.

Themetho dintroducesentropyvalueto

mea-sure the word distribution. Let the string b e

str, the adjacent words w

1 :::w

n

, and the

fre-quencyofstrfr eq(str ). Theprobabilityofeach

possibleadjacent wordp(w

At that time, the entropy of str H(str ) is

de-ned as:

Calculating the entropyof b oth sides of str,

the lower one is used as H(str ). Then the

strings whose entropy is larger than a given

threshold areretrievedaslo cal patterns.

3.2 Identifying syntacticcategories

Since the strings are just word sequences, the

process gives them syntactic categories. For

eachstr str ,

1. assign part-of-sp eech tags t

1

2. matchtag sequence t

1

3. give str corresponding syntactic category

SC

i

,ifitmatches T

i

3.2.1 Assigning part-of-speech tags

The pro cess usesa simpliedpart-of-sp eechset

shown intable 1. Functionwords are assigned

as they are, while contentwordsexcept for

ad-verb are fallen into only one part of speech

word. Four kinds of words \be", \do", \not",

and \to" are assigned to special tags b e, do,

not, and toresp ectively.

(3)

P O S

t e m p l a t e s

R e t r i e v e l o c a l p a t t e r n s

I d e n t i f y s y n t a c t i c c a t e g o r i e s

M a k e t r a n s l a t i o n p a t t e r n s

D o m a i n

D i c t i o n a r y

M T s y s t e m

T e x t

t r a n s l a t i o n

M o n o l i n g u a l

c o r p u s

Figure2: outline

POStag partof sp eech

art article

adv adverb

aux auxiliaryverb

conj conjunction

det determiner

prep prep osition

prn pronoun

punc punctuation

b e \be"

do \do"

not \not"

to \to"

word the others

Table 1: part-of-sp eechtags

it may sometimes be dicult to identify

preciseparts of sp eechinsuch alocal

pat-tern.

wordsareoftenusedbeyondpartsofspeech

intechnicalterminology

itisempiricallyfound thatwordsequences

retrieved through n-gram statistics have

distributionalconcentrationonseveral

syn-tactic categories.

Therefore, wethinkthesimpliedPOStagsare

sucienttoidentifysyntacticcategories.

The word sequence w

1 ,... w

n

is represented

for apart-of-speech tagsequence t

1 ,... t

i .

Fig-ure 3 shows examples of POS tagging. Italic

t h e f u e l t a n k

a r t w o r d w o r d

d o t h i s s t e p :

d o d e t , p r n w o r d p u n c

t o o p r n t h e

t o w o r d a r t

Figure 3: examples ofPOStagging

lines are given word sequences and b old lines

are POStagsequences. Ifa wordfalls intotwo

or more parts of sp eech, all p ossible POSs will

be assignedlike \this"inthesecond example.

3.2.2 Matching POS templates

The process identiesa syntactic category(SC)

of str by checking if str's tag sequence t

1 , ...

(4)

i

corresp onding to T

i

. Table 2 shows examples

of POStemplates and corresp ondingSCs 2

Table2: POStemplatesandcorresp ondingSCs

The templates are describ ed in the form of

regular expressions(RE) 3

. The rst template

intable 2,forexample, matchesastring whose

tag sequence beginswith an article, contains0

ormorerep etitionsofcontentwordsor

conjunc-tions, and endswith a content word. \the fuel

tank"ingure3isappliedtothistemplatesand

givenaSC \N".

3.3 Making translation patterns

The process converts the strings into

transla-tionpatterns. Theproblemhereisthatweneed

to generate bilingual translation patterns from

monolingual strings. We use heuristic rules on

b orrowingwordsfrom foreignlanguages 4

.

Figure4 isan exampleofconversionrulesfor

generating English-Japanese translation

pat-terns. To give an example, \to op en the" in

gure 3,whose SCis VT,is convertedinto the

following patterns in accordance with the

sec-ond rulein gure4.

V P

o p e n [ 1 : N P ]

V P

[ 1 : N P ] ( d o b j ) o p e n

( “ d o ” )

2

NotethatthePOStemplatesarestronglydep endent

onthefeaturesofn-gramstrings.

3

\3"causestheresultingREtomatch0ormore

rep-etitions of the precedingRE. \+"causes the resulting

REtomatch1ormorerepetitionsoftheprecedingRE.

\j"createsaREexpressionthatwillmatcheitherright

or left of \j". \(:::)" indicates the start and end of a

group.

4

In Japanese, foreign words, esp ecially in technical

terminology,areoftenusedastheyareinkatakana(the

phoneticsp ellingforforeignwords)followedbyfunction

wordswhichindicatetheirpartsofsp eechForexample,

Englishverbsarefollowedby\suru",averbwhichmeans

I f S C i s N , d e l e t e

a r t

a n d g e n e r a t e :

Figure 4: conversionrulesforgenerating

trans-lation patterns

4 Evaluation

We have tested our algorithm in building a

domain dictionary and making a translation

with it. A corpus used in the experiment is a

computer manualcomprising 167,023words(in

22,041 sentences).

The corpus contains 24,137 n-grams which

appear more than twice. Among them, 7,616

strings areextractedovertheentropythreshold

1. Table 3 is a list of top 20 strings (except

for single words and function word sequences)

retrieved from thetest corpus.

Thesestrings arecategorizedinto1,239 POS

patterns. Table 4 is a list of top 10 POS

pat-terns and the numbers of strings classiedinto

them. In thisexp eriment, thetop10 POS

pat-ternsaccountfor49.4%ofallPOSpatterns. It

substantiatesthefactthattheretrievedstrings

tend toconcentrateincertainPOS patterns.

fr eq POS

(5)

5.51 247 seealso 3.55 552 theclient

4.48 1499 theserver 3.54 209 usethe

4.46 100 clickOK . 3.46 209 theuser

3.92 106 usethis function 3.46 168 clickthe

3.79 163 thefunction 3.44 172 thecatalog agent

3.76 309 thefollowing 3.36 192 therequest

3.67 297 thele 3.29 132 on page

3.58 36 intheServerManager , 3.23 213 a sp ecied

3.56 169 using the 3.22 71 ifyou want to

3.55 180 CGIprograms 3.22 575 yourserver

Table3: top20 strings

Inthematchingpro cess,weprepared15

tem-plates and 6 SCs. Table 5 is a result of SC

identication. 2,462 strings(32.3 %) are not

matchedto anytemplates. The table indicates

that most strings retrieved in this method are

identied as N and NP. It is quite reasonable

b ecausethe majorityof thetechnicalterms are

supp osedto b enounsand noun phrases.

SC numb erof patterns

NP 722

N+prep 200

VP 32

VP+prep 10

VT 177

V 78

Table 5: result of SCidentication

The retrieved translation patterns total

1,219. Figure 5 shows an example of

transla-tion patternsretrieved byourmetho d.

We, then, converted them to an MT

dictio-nary and made a translation with and without

it. Table 6 summarizes the evaluation results

translating randomly selected 1,000 sentences

fromthetestcorpus. Comparedwiththe

trans-lations without the dictionary,the translations

withthedictionaryimproved571inparsingand

word selection.

Figure 6 illustrates changes in translations.

Each column consists of an input sentence, a

translationwithout thedictionary,anda

trans-improvedinparsing 104

improvedinword selection 467

ab out thesame 160

same 212

notimproved 57

total 1000

Table 6: Translation evaluationresults

correspond tounderlined Japanese.

First two examples show improvement in

wordselection. Thetranslationsof"map(verb)"

and "exec" are changed from word-for-word

translations tonon-translationword sequences.

Although "to makea map" and "exective"are

not wrong translations, they are irrelevant in

thecomputermanualcontext. Onthecontrary,

thedomaindictionaryreducesconfusioncaused

bythe wrongwordselection.

Wrong parsing and incomplete parsing are

also reduced as shown in the next two

exam-ples. Inthe third example, "Next" should b e a

noun,whileitisusuallyusedasanadverb. The

domain dictionary solved the syntactic

ambi-guity prop erlybecauseit hasexclusivepriority

oversystemdictionaries. In theforth example,

"double-click"is anunknownwordwhichcould

cause incomplete parsing. But the phrase was

parsed asaverb correctly.

ThelastoneisanwrongexampleofJapanese

verb selection. That was a main cause of

(6)

un-N P f u l l y - q u a l i f i e d d o m a i n n a m e

N P

t e x t s e a r c h e n g i n e

N P

a c c e s s l o g f o r [ 1 : N P ]

V P

s a v e [ 1 : N P ]

V

d e a l l o c a t e

N P f u l l y - q u a l i f i e d d o m a i n n a m e

N P

t e x t s e a r c h e n g i n e

N P

[ 1 : N P ]

( “ o f ” ) a c c e s s l o g

V P

[ 1 : N P ] ( d o b j ) s a v e

( “ d o ” )

V

d e a l l o c a t e

( “ d o ” )

Figure5: theretrieved translation patterns

T y p e t h e U R L p r e f i x y o u w a n t t o

m a p

.

( " m a p " ) ( d o b j )

( " m a k e " )

U R L

m a p ( “ p e r f o r m m a p p i n g ” )

U R L p r e f i x

T h e

e x e c t a g

a l l o w s a n H T M L f i l e t o e x e c u t e a n a r b i t r a r y p r o g r a m o n t h e s e r v e r ;

( “ e x e c t i v e ‘ s t a g ” )

H T M L

e x e c t a g

H T M L

s e r v e r

T y p e t h e f u l l n a m e o f y o u r s e r v e r , a n d t h e n c l i c k

N e x t

.

( " t h e n " )

s e r v e r

N e x t c l i c k

G o t o t h e C o n t r o l P a n e l a n d

d o u b l e - c l i c k

t h e S e r v i c e s i c o n .

C o n t r o l P a n e l

( " d o u b l e - " )

S e r v i c e s

( " c l i c k " )

C o n t r o l P a n e l

S e r v i c e s i c o n

d o u b l e - c l i c k ( " d o u b l e - c l i c k " )

S e t t i n g

a d d i t i o n a l d o c u m e n t d i r e c t o r i e s

( “ p u t , p l a c e ” )

d o c u m e n t

d i r e c t o r y

( " a s s i g n , i m p o s e " )

Figure6: examplesentencesinthetest corpus

themetho daddeddefaultsemanticinformation

to the retrieved nouns and noun phrases. We

hop e to overcome it by a mo del that classies

noun phrases, for example using verb-noun or

adjective-nounrelations.

5 Related work

As mentioned in section 1, there are two

ap-proaches in corpus-based technical term

re-trieval: a rule-based approach and a statistical

approach. Major dierences b etween the two

are:

the former uses a tagged corpus while the

theformerretrieveswordsandphraseswith

a designated syntactic category while the

latter retrievesthat with varioussyntactic

categories atthesametime.

Our metho d uses the latter approach because

wethinkitmorepracticalb othinresourcesand

in applications.

For comparison, we refer here to Smadja's

method (1993) b ecause this metho d and the

proposed metho d have much in common. In

both cases, technical terms are retrieved from

an untagged corpus with n-gram statistics and

(7)

weusePOStemplates. Aparsermayaddmore

precisesyntacticcategory thanPOStemplates.

However,weconsideritnottobecritical under

the sp ecic condition that the variety of input

patterns is very small. In terms of portability,

the prop osed method has an advantage.

Actu-ally,adding POS templates is notso time

con-sumingasdevelopinga parser.

We haveapplied the translation patterns

re-trieved by this method to a real MT system.

As a result, 57.1 % of translations were

im-proved with1,219 translation patterns. To our

knowledge, little work has gone into

quantify-ing its eectiveness to NLP applications. We

recognize that the metho d leaves room for

im-provementinmaking translationpatterns. We,

therefore,plantointro ducetechniquesfor

nd-ing translational equivalent from bilingual

cor-p ora (Melamed,1998) toourmetho d.

6 Conclusion

We have presented a metho d for identifying

technical terminology and building a domain

dictionary for MT. Applying the metho d to a

technicalmanualinEnglishyielded positive

re-sults. Wehavefoundthattheprop osedmetho d

woulddramaticallyimprovethep erformanceof

translation. In the future work, we plan to

in-vestigatetheavailabilityofPOSpatternswhich

arenotcategorized intoanySCs.

References

Ab eille A., Schab es Y., and Joshi A. K.

1990. \Using Lexicalized Tags for Machine

Translation". In Proceedings of the

Interna-tionalConferenceonComputational

Linguis-tics(COLING), pages 1{6.

Argamon, S., Dagan, I., and Krymolowski,

Yuval. 1999. A Memory-Based Approach

to Learning Shallow Natural Language

Pat-terns. In Proceedings of the 17th COLING

and the 36th Annual Meeting of ACL,pages

67{73.

Cardie, C. and Pierce, D. 1999. The Role of

Lexicalization and Pruning for Base Noun

Phrase Grammars In Proceedings of the

16th National Conference onArticial

Intel-ligence,pages 423{430.

Haruno, M., Ikehara, S., and Yamazaki, T.

16thCOLING, pages525{530.

Melamed,I.D. 1998. EmpiricalMetho dsforMT

LexiconDevelopment In Gerb er,L.and F

ar-well, D. Eds. Machine Translation and the

InformationSoup, Springer-Verlag.

Ramshaw, L.A., and Marcus, M.P. 1995.

Text Chunking using Transformation-Based

Learning InProceedingsofthe 3rdWorkshop

onVeryLargeCorpora,pages 82{94.

Shimohata,S., Sugio,T., and Nagata,J. 1997.

Retrieving Collo cations by Co-o ccurrences

and WordOrder Constraints. In Proceedings

of the 35th Annual Meeting of ACL, pages

476{481.

Shimohata, S. et al. 1999. \Machine

Trans-lation System PENSEE: System Design and

Implementation," InProceedings of Machine

TranslationSummit VII, pp.380{384.

Smadja,F.A. 1993. Retrieving Collo cations

from Text: Xtract. In Computational

Lin-guistics,19(1), pages143{177.

TakedaK. 1996. \Pattern-BasedContext-Free

GrammarsforMachineTranslation". In

Pro-ceedingsof the34th AnnualMeetingof ACL,

References

Related documents