Freq. examplesofmatchingpartinfront written spoken examplesofmatchingpartbehind
182 IPALnokeiyoushi , keiyou-doushino
72 wobokokugo no kiiwaadodekouritsuyokukensaku
(of)
43 patapata . batabata
49 nokakutyoudearu e shikibetsutekitokutyou
(eh-)
56 LRhyouhenohukusuuno ee setsuzokuseiyaku
(eh-)
54 hyakumanenninaru " toyogenshitatoka
39 nizokusurubekutoru no wanokyorini
(of)
28 honbunkannohaipaarinku wo jidouseisei
(obj.)
19 shuushokuyouso wo torikomu
(obj.)
21 shiborikomigowa \ beieiga
22 meishinokurikaeshinobaai , ee sentounomeishinomi
(eh-)
20 waiwa ) xtojijougachigau
21 sonshitsuwazengakushuu deeta deetaa nitaisurusonshitsuno
(data) (data)
11 oyobiyougenno ga(katakana) ga(hiragana) kakujouhouwohuyo
(sub.) (sub.)
19 gakkaihappyouronbun <C> 25534hyoudaitsuikaranaru
13 tairyounokoupasuwomochiite , e kikaihon-yakuniyori
(eh-)
15 madokansuu ni yorikiridasareta
(by)
12 gengengotomokuhyougengo to noaidanodoujishinkouseiga
(and)
10 shimesukotobawooginatte 1 ichi(kanji) tsunobunwo
(one) (one)
10 zatsuonsupekutoru no wo genzan
('s) (obj.)
14 sonokekkawohitode de shuuseishiteiku
(by)
20 sukunai toiu kotokara
(thatis)
10 bangumi wo no saishokarasaigomade
(obj.) (of)
10 goukeikyoudogaN f "
11 kijino sono kouzoutotokutyou
(its)
16 bangumijidoujimakukano tame(hiragana) tame(kanji) noonseininshikishisutemuwo
(for) (for)
10 warichikai kurasuta kurasutaa gaimi
(cluster) (cluster)
6 bekutoruwokotonaru k K konoshuugouni
8 VQkoudobukkuno 2 ni(kanji) shuruinowashamoderu
(two) (two)
7 tekigouritsu = wa honyakukekkaga
(equal) (be)
8 sonobuntyuude n wadaitonatteiruyouso
(\nn")
9 juubunyoiseidode wa suiteidekinaitoiumondaiga
(be)
6 renketsugakushuuto o kongousuuwobaizousuru
(\oh")
11 picchinojoushou toiumono gayuuseion
(thatis)
7 kisoteki na kentouwo
(-tive)
5 seikiseigen no ga kibishiikeitaisomo
(of) (sub.)
5 kyoutaiiki / koutaiikiCELP
7 hukasDno oo jouigainennityuushouka
(\oh")
7 wokontekisutonimotsu youna gengomoderuni
(suchthat)
14 shuhou toiuno woteian
Matchingpartin front Writtendata Spokendata Matchingpart behind
sumuujingu shori wo
(smoothing) (process) (obj. case-particle)
kaku C(V)fkg sohen
(each) (piece)
supoutsunyuusu ni okeru no kaiwabubunwo
(sportsnews) (in) (of) (conversation)
heikinjikanga 11.25 11.3 hunmade
(averagetime) (11.25) (11.3) (minutes)
Table11: Exampleofcomplementation
Matchingpartinfront Writtendata Spokendata Matchingpartbehind
sonshitsuno ataino heikintoshite
(loss) (values) (inaverage)
kaiwani kanshimashitewazenzen hugenwanai
(conversation) (about) (noinconvenience)
on-atsureberu 70dB nanajuugogodeshiberu deteiji
(soundlevel) (70dB) (70.55dB) (shown)
Table8: Exampleofsynonyms
Writtendata Spokendata
oyobi to
(and) (and)
ya toka
(or) (or)
ronbun kenkyu
(paper) (study)
,kotonari kotonari-de
(dierences) (dierences)
kaku sorezore
(each) (each)
i-banmenotaamu taamuI
(i-thterm) (termI)
jutsugo doushi
(predicate) (verb)
shikibetsu ninshiki
(discrimination) (recognition)
kotonareba, chigaeba
(dierent) (dierent)
4. Colloquialstyle
We showsome examples ofthis stylein Table9.
We extracted many colloquial-style Japanese
ex-pressions. \toiu",whichis\thatis"inEnglish,is
extracted. \toiu"isa colloquialexpression. Such
an expression was extracted by matching
writ-ten and spoken data. Both \shita" and \itashi
mashita"mean\did". However,\itashi mashita"
is much more polite than \shita". In Japanese,
weusepoliteexpressionsinpresentationsand we
natural language processing. We used the method
de-scrib edinthisstudyfordierentdictionariesandobtained
manysynonyms(MurataandIsahara,2001b).
Table9: Exampleofcolloquialstyle
Writtendata Spokendata
toiu
(thatis)
shita. itashimashita
(did) (did,politeexpression)
desu
(politeexpression)
rareru. raremasu
(bedone) (bedone,politeexp.)
tteiu
(thatis)
kou
(this)
use neutralexpressionsin papers. Hence,we
ex-tractedthepairsbymatchingwrittenandspoken
data. \kou"(this)wasextracted. Inspoken
lan-guage, the denotationexpressions such as \this"
are often used. \kou" would be one exampleof
them.
5. Ellipsis(decreaseofinformation)
We show some examples of ellipsis in Table 10.
\process"wasomittedinthespokendata. \11.25"
in the written data was changed into the lighter
expression\11.3"inthespokendata.
6. Complementation(increaseofinformation)
Weshow some examples forcomplementation in
Table 11. This is the inverse of the above item
\Ellipsis". \loss in average"in written data was
changedintoaricherexpression\thevalueofloss
in average". \70 dB" in written data was also
kin-nen chishiki kakutoku no kenkyu gajuuyoushisaretsutsuaru.
(recently) (knowledge) (extraction) (of) (study) (becameimportant)
e wo
(eh) (obj.)
(Eh,studyonknowledge(obj.) extractionbecameimportantrecently.)
hon kou dewa , dougino tekisutowo shougoushi ,
(this) (paper) (in) (,) (samemeaning) (texts) (match) (,)
kenkyu
(study)
(Inthispaper()study),wematchedtextshavingthesamemeaning,)
sono shougou kekka womochiite chishiki wokakutokushita.
(the) (matching) (result) (use) (knowledge) (extracted)
andextractedknowledgeusingthematching(results).)
Table14: Examplesof transformationfrom writtentospokenlanguage(thecaseofk=2)
sono teigi wo riyou suru toiukoto ga kangaerareru
(its) (denition) obj (use) (do) (that) obj (think)
ma
ller
(Wecanthink(ma)thatweuseitsdenition.)
dougi hyougen wo tyuushutsu suru koto wo kokoromiru.
(same-meaning) (expression) obj (extract) (do) (that) obj (try)
toiu
(thatis/such)
(Wetry(such)thatweextractthesame-meaningexpressions.)
hindo de souto shita kekka wo hyou ni shimesu.
(frequency) (by) (sort) (did) (result) obj (table) (in) (show)
toiu no
(thatis) (those)
(Weshow(thosethatis)theresultssortedbyfrequencyinthetable.)
Weusedsentencesin ourJapanese paper (Murata
andIsahara,2001a)asinputandtransformedthe
writ-ten language to spoken language by using the above
pro cedure. Weshowedtheresultwhenk=1inTable
13andtheresultwhenk=2 inTable14. The
under-linedpartisthepartthatwasremovedinthe
transfor-mation. The lower strings are the transformed ones.
Sincethealgorithmwasnotsostrong,ink=1the
con-text was short andthe precision waslow. (\wo" was
inserted,but it was thewrongtransformation. There
weremanyotherwrongtransformationsin the
exper-iments.) However, good spoken-language-like results
wereobtainedsuchthat\e"(eh)wasinsertedand\this
probabilityofoccurrenceofxinthecorporawhenthegiven
inputdataisusedasthecontext.Althoughourprocedures
usethexedk32morphemesof\infront"and\behind"as
the context, we must calculatethe probabilities by using
the variable-length context and more global information,
suchas syntacticinformation and tenseinformation, ina
p owerful probability-estimator suchas the maximum
en-tropymethod.
paper" was transformed into \thisstudy". In k =2,
the precision was very good and there were few
er-rors. \toiu"and\ma"areJapanesecolloquial
expres-sions and were inserted. Theresults were very good.
However, thenumber of transformedexpressions was
verysmallandtherecallratewasvery small. Wefeel
obliged to improve the method by using the method
describedin Footnote4. Wewillimproveoursystem
bychangingthecalculationoffrequenciesintothe
cal-culation of probabilities and making the information
usedfora contextricherinourfuturework.
5. Conclusion
Inthisstudy,weextracteddierencesbetween
spo-kenandwrittenlanguagesandexaminedtheextracted
dierences by using spoken and written data
con-structedbytheCommunicationsResearchLaboratory
andtheNationalInstituteforJapaneseLanguage. We
also tried transforming written language into spoken
language by using extracted dierences as the
tweenspokenandwrittenlanguageswereunsucient,
ourapproachusingcomputationalprocessingfor
stud-iesofdierencesbetweenspokenandwrittenlanguages
hasdemonstrateditsecacy.
Wealsoconstructedabasicsystemfortransforming
a written language into a spokenlanguage. This
sys-tem outputtedspoken-language-likeexpressions. The
resultsof ourstudy will befurther applied toour
fu-turework.
6. References
Sadao Kurohashiand MakotoNagao, 1998. Japanese
MorphologicalAnalysisSystemJUMANversion3.5.
Department of Informatics, Kyoto University. (in
Japanese).
Masaki Murata and Hitoshi Isahara. 2001a.
Auto-matic paraphrase acquisition based on matching
of two texts with the same meaning. Information
Processing Society of Japan, WGNL 142-18. (in
Japanese).
Masaki Murata and Hitoshi Isahara. 2001b.
Univer-sal model for paraphrasing | using
transforma-tionbasedona denedcriteria|. InNLPRS'2001
Workshop on Automatic Paraphrasing: Theories