Automatic extraction of differences between spoken and written languages, and automatic translation from the written to the spoken language

(1)

(2)

(3)

(4)

Freq. examplesofmatchingpartinfront written spoken examplesofmatchingpartbehind

182 IPALnokeiyoushi , keiyou-doushino

72 wobokokugo no kiiwaadodekouritsuyokukensaku

(of)

43 patapata . batabata

49 nokakutyoudearu e shikibetsutekitokutyou

(eh-)

56 LRhyouhenohukusuuno ee setsuzokuseiyaku

(eh-)

54 hyakumanenninaru " toyogenshitatoka

39 nizokusurubekutoru no wanokyorini

(of)

28 honbunkannohaipaarinku wo jidouseisei

(obj.)

19 shuushokuyouso wo torikomu

(obj.)

21 shiborikomigowa \ beieiga

22 meishinokurikaeshinobaai , ee sentounomeishinomi

(eh-)

20 waiwa ) xtojijougachigau

21 sonshitsuwazengakushuu deeta deetaa nitaisurusonshitsuno

(data) (data)

11 oyobiyougenno ga(katakana) ga(hiragana) kakujouhouwohuyo

(sub.) (sub.)

19 gakkaihappyouronbun <C> 25534hyoudaitsuikaranaru

13 tairyounokoupasuwomochiite , e kikaihon-yakuniyori

(eh-)

15 madokansuu ni yorikiridasareta

(by)

12 gengengotomokuhyougengo to noaidanodoujishinkouseiga

(and)

10 shimesukotobawooginatte 1 ichi(kanji) tsunobunwo

(one) (one)

10 zatsuonsupekutoru no wo genzan

('s) (obj.)

14 sonokekkawohitode de shuuseishiteiku

(by)

20 sukunai toiu kotokara

(thatis)

10 bangumi wo no saishokarasaigomade

(obj.) (of)

10 goukeikyoudogaN f "

11 kijino sono kouzoutotokutyou

(its)

16 bangumijidoujimakukano tame(hiragana) tame(kanji) noonseininshikishisutemuwo

(for) (for)

10 warichikai kurasuta kurasutaa gaimi

(cluster) (cluster)

6 bekutoruwokotonaru k K konoshuugouni

8 VQkoudobukkuno 2 ni(kanji) shuruinowashamoderu

(two) (two)

7 tekigouritsu = wa honyakukekkaga

(equal) (be)

8 sonobuntyuude n wadaitonatteiruyouso

(\nn")

9 juubunyoiseidode wa suiteidekinaitoiumondaiga

(be)

6 renketsugakushuuto o kongousuuwobaizousuru

(\oh")

11 picchinojoushou toiumono gayuuseion

(thatis)

7 kisoteki na kentouwo

(-tive)

5 seikiseigen no ga kibishiikeitaisomo

(of) (sub.)

5 kyoutaiiki / koutaiikiCELP

7 hukasDno oo jouigainennityuushouka

(\oh")

7 wokontekisutonimotsu youna gengomoderuni

(suchthat)

14 shuhou toiuno woteian

(5)

Matchingpartin front Writtendata Spokendata Matchingpart behind

sumuujingu shori wo

(smoothing) (process) (obj. case-particle)

kaku C(V)fkg sohen

(each) (piece)

supoutsunyuusu ni okeru no kaiwabubunwo

(sportsnews) (in) (of) (conversation)

heikinjikanga 11.25 11.3 hunmade

(averagetime) (11.25) (11.3) (minutes)

Table11: Exampleofcomplementation

Matchingpartinfront Writtendata Spokendata Matchingpartbehind

sonshitsuno ataino heikintoshite

(loss) (values) (inaverage)

kaiwani kanshimashitewazenzen hugenwanai

(conversation) (about) (noinconvenience)

on-atsureberu 70dB nanajuugogodeshiberu deteiji

(soundlevel) (70dB) (70.55dB) (shown)

Table8: Exampleofsynonyms

Writtendata Spokendata

oyobi to

(and) (and)

ya toka

(or) (or)

ronbun kenkyu

(paper) (study)

,kotonari kotonari-de

(dierences) (dierences)

kaku sorezore

(each) (each)

i-banmenotaamu taamuI

(i-thterm) (termI)

jutsugo doushi

(predicate) (verb)

shikibetsu ninshiki

(discrimination) (recognition)

kotonareba, chigaeba

(dierent) (dierent)

4. Colloquialstyle

We showsome examples ofthis stylein Table9.

We extracted many colloquial-style Japanese

ex-pressions. \toiu",whichis\thatis"inEnglish,is

extracted. \toiu"isa colloquialexpression. Such

an expression was extracted by matching

writ-ten and spoken data. Both \shita" and \itashi

mashita"mean\did". However,\itashi mashita"

is much more polite than \shita". In Japanese,

weusepoliteexpressionsinpresentationsand we

natural language processing. We used the method

de-scrib edinthisstudyfordierentdictionariesandobtained

manysynonyms(MurataandIsahara,2001b).

Table9: Exampleofcolloquialstyle

Writtendata Spokendata

toiu

(thatis)

shita. itashimashita

(did) (did,politeexpression)

desu

(politeexpression)

rareru. raremasu

(bedone) (bedone,politeexp.)

tteiu

(thatis)

kou

(this)

use neutralexpressionsin papers. Hence,we

ex-tractedthepairsbymatchingwrittenandspoken

data. \kou"(this)wasextracted. Inspoken

lan-guage, the denotationexpressions such as \this"

are often used. \kou" would be one exampleof

them.

5. Ellipsis(decreaseofinformation)

We show some examples of ellipsis in Table 10.

\process"wasomittedinthespokendata. \11.25"

in the written data was changed into the lighter

expression\11.3"inthespokendata.

6. Complementation(increaseofinformation)

Weshow some examples forcomplementation in

Table 11. This is the inverse of the above item

\Ellipsis". \loss in average"in written data was

changedintoaricherexpression\thevalueofloss

in average". \70 dB" in written data was also

(6)

(7)

kin-nen chishiki kakutoku no kenkyu gajuuyoushisaretsutsuaru.

(recently) (knowledge) (extraction) (of) (study) (becameimportant)

e wo

(eh) (obj.)

(Eh,studyonknowledge(obj.) extractionbecameimportantrecently.)

hon kou dewa , dougino tekisutowo shougoushi ,

(this) (paper) (in) (,) (samemeaning) (texts) (match) (,)

kenkyu

(study)

(Inthispaper()study),wematchedtextshavingthesamemeaning,)

sono shougou kekka womochiite chishiki wokakutokushita.

(the) (matching) (result) (use) (knowledge) (extracted)

andextractedknowledgeusingthematching(results).)

Table14: Examplesof transformationfrom writtentospokenlanguage(thecaseofk=2)

sono teigi wo riyou suru toiukoto ga kangaerareru

(its) (denition) obj (use) (do) (that) obj (think)

ma

ller

(Wecanthink(ma)thatweuseitsdenition.)

dougi hyougen wo tyuushutsu suru koto wo kokoromiru.

(same-meaning) (expression) obj (extract) (do) (that) obj (try)

toiu

(thatis/such)

(Wetry(such)thatweextractthesame-meaningexpressions.)

hindo de souto shita kekka wo hyou ni shimesu.

(frequency) (by) (sort) (did) (result) obj (table) (in) (show)

toiu no

(thatis) (those)

(Weshow(thosethatis)theresultssortedbyfrequencyinthetable.)

Weusedsentencesin ourJapanese paper (Murata

andIsahara,2001a)asinputandtransformedthe

writ-ten language to spoken language by using the above

pro cedure. Weshowedtheresultwhenk=1inTable

13andtheresultwhenk=2 inTable14. The

under-linedpartisthepartthatwasremovedinthe

transfor-mation. The lower strings are the transformed ones.

Sincethealgorithmwasnotsostrong,ink=1the

con-text was short andthe precision waslow. (\wo" was

inserted,but it was thewrongtransformation. There

weremanyotherwrongtransformationsin the

exper-iments.) However, good spoken-language-like results

wereobtainedsuchthat\e"(eh)wasinsertedand\this

probabilityofoccurrenceofxinthecorporawhenthegiven

inputdataisusedasthecontext.Althoughourprocedures

usethexedk32morphemesof\infront"and\behind"as

the context, we must calculatethe probabilities by using

the variable-length context and more global information,

suchas syntacticinformation and tenseinformation, ina

p owerful probability-estimator suchas the maximum

en-tropymethod.

paper" was transformed into \thisstudy". In k =2,

the precision was very good and there were few

er-rors. \toiu"and\ma"areJapanesecolloquial

expres-sions and were inserted. Theresults were very good.

However, thenumber of transformedexpressions was

verysmallandtherecallratewasvery small. Wefeel

obliged to improve the method by using the method

describedin Footnote4. Wewillimproveoursystem

bychangingthecalculationoffrequenciesintothe

cal-culation of probabilities and making the information

usedfora contextricherinourfuturework.

5. Conclusion

Inthisstudy,weextracteddierencesbetween

spo-kenandwrittenlanguagesandexaminedtheextracted

dierences by using spoken and written data

con-structedbytheCommunicationsResearchLaboratory

andtheNationalInstituteforJapaneseLanguage. We

also tried transforming written language into spoken

language by using extracted dierences as the

(8)

tweenspokenandwrittenlanguageswereunsucient,

ourapproachusingcomputationalprocessingfor

stud-iesofdierencesbetweenspokenandwrittenlanguages

hasdemonstrateditsecacy.

Wealsoconstructedabasicsystemfortransforming

a written language into a spokenlanguage. This

sys-tem outputtedspoken-language-likeexpressions. The

resultsof ourstudy will befurther applied toour

fu-turework.

6. References

Sadao Kurohashiand MakotoNagao, 1998. Japanese

MorphologicalAnalysisSystemJUMANversion3.5.

Department of Informatics, Kyoto University. (in

Japanese).

Masaki Murata and Hitoshi Isahara. 2001a.

Auto-matic paraphrase acquisition based on matching

of two texts with the same meaning. Information

Processing Society of Japan, WGNL 142-18. (in

Japanese).

Masaki Murata and Hitoshi Isahara. 2001b.

Univer-sal model for paraphrasing | using

transforma-tionbasedona denedcriteria|. InNLPRS'2001

Workshop on Automatic Paraphrasing: Theories