Introduction of a Probabilistic Language Model to Non-Factoid Question Answering Using Example Q&A Pairs

(1)

Question-AnsweringUsingExample Q&APairs

KyosukeYoshida,TaroUeda,MadokaIshioroshi,HideyukiShibuki,andTatsunoriMori

GraduateShoolofEnvironmentandInformationSienes,YokohamaNationalUniversity

79-7Tokiwadai,Hodogaya-ku,Yokohama240-8501,Japan

fkyoshida, kks, ishioroshi, shib, morigforest.eis.ynu.a.jp

Abstrat

Inthispaper,weproposeamethodwhih uti-lizes aprobabilisti language modelin non-fatoidtypequestion-answeringsystemin or-dertoimproveitsauray. Themodelis a mixtureprobabilistilanguagemodelof part-of-speehandsurfaeexpressions. We intro-duedthemodelintotwosub-proesseswhih alulatesimilarityoftextsintermsofwriting style.Therstproessolletsexample ques-tionssimilartoasubmittedquestion.The se-ondonemeasures similarity betweenan an-swer andidate and example answers paired withtheolletedexamplequestions. Experi-mentalresultsshowedthattheaurayofthe systemwasimprovedbyintroduingthe pro-posedmethod.

1 Introdution

In reent years, the amount of data available on theWebisinreasing bygrowingomputer

perfor-maneand networktraf. Therefore,tehnologies

thatgiveus aesstoneessaryinformation inthe largeamountofdataarerequired. Oneofsuh teh-nologiesisquestion-answering(QA),whihisto ex-tratananswerforaquestionwritteninnatural

lan-guagefromsoure douments. Ingeneral,QA

sys-tems are ategorized into the following two types:

fatoidandnon-fatoid(Fukumoto,2007).Wefous

on the non-fatoid type QAin this paper. Table 1 showssome typial types ofnon-fatoidquestions. Theappropriatenessoftheanswerandidates is of-tenestimatedonthebasisoffollowingtwomeasures (Hanetal.,2006).

Measure1:Relevanetothetopiofthequestion, how relevant is the andidate tothe topi of thequestion?

Measure2:Appropriatenessofwritingstyle, howwelldoestheandidatesatisfythewriting stylethatisappropriateforanswersofthelass ofthegivenquestion?

Here, by the term writing style, we refer to the styleofexpressions peuliar toa lassofquestions and their answers, asshown in Table 1. Although

these two measures depend on eah other tosome

extent,weassumethattheyareindependent inthis study.

Non-fatoidtypeQAsystemsareategorizedinto

the following two types aording to how to

han-dle Measure 2. The rst type lassies

submit-tedquestions intoseveralpredenedquestiontypes

suh as denition-type, why-type, how-type, and

so on, in order to separately handle eah type of questions by different methodologies. Han et al.

(2006) alulated the above-mentioned two

mea-sures fordenition-type questions based on proba-bilisti models built from orpora. The model for

Measure 1 isalulated from retrieved douments.

Themodel forMeasure 2isalulated froma

or-pus of denitions. However, this type of systems

hassome difulties as follows. Sine the lasses

of non-fatoid questions are not well dened, it is

difult to distinguish and dene all lasses

om-prehensively. Moreover, theauray ofaquestion

lassier affets the overall auray of

question-answering, beause mislassied questions are

in-orretlyrouted toanansweringmodule for differ-entlasses.

The seond type of systems handles submitted

questions based on a unied framework without

question lassiation. Mizuno et al. (2009)

pro-poseda methodthat isable toalulateMeasure 2

withoutlassiation ofquestions. Usingexample

(2)

Typeofquestion Examplesoftypialwritingstyle

Question Answer

Denition-type tte-nani(Whatis) towa dearu(is)

Why-type Naze(Why) tame(Beause)

How-type suru-niwadou-shitaraii(HowanIdo) suru-niwamazu (Inordertodo,)

Othertypes X-toY-nohigaiwanani X-wa-daga,Y-wa

(WhatisthedifferenebetweenXandY) (WhileXis,Yis)

of a given answer andidate is onsistent with the lassofa submittedquestion. Byusingthis lassi-er,Measure2isrealized withoutquestion lassi-ation. Soriutetal. (2006)alsoproposed asystem

without question lassiation. They introdued a

statistial translation model between questions and theorrespondinganswersinordertobridgethe lex-ial gapbetween thequestions and theanswers. A set of example Q&Apairs from FAQ sites on the Webisusedfortheestimationofthemodel.

In these methods, the length of answers should bepredetermined. Thelengthofanswersannotbe hangeddynamiallyandisneessarytobeestimate fromthelengthofthequestion.

Therefore, Morietal.(2008)proposed amethod of the seond type approah that is able to adap-tively determine the lengthofan answer andidate aordingtoasubmittedquestion. Theyuse exam-pleQ&Apairsona Q&Aommunitysiteinorder tondappropriatewritingstylestoanswersfor sub-mittedquestions. Theyutilizesimplen-grammodel asfeatures toretrieve examplequestions similar to a submitted question in terms of writing style and tond appropriate writingstyles to answer. How-ever, thesimple n-grammodelisnotappropriateto

modeldependeny among words thatappear inthe

distant positionsbeause it onlyaptures linguisti phenomenathatappearwithinthen-wordswindow. Therefore,sometimestheseletionofexample ques-tionsisnotarried outorretly. Thereexist some inorretly retrieved examplequestions thatarenot similartothewholesubmittedquestionintermsof writingstyle,whilethosen-gramshappentobevery similartothen-gramsofthequestion. Theirmethod ofsoringanswerandidateisbasedonanaive

fre-queny model of word 2-grams as feature

expres-sions. Therefore, ungrammatial sentenes, whih

oftenappearinWebdoumentsandarenotsuitable

to answer andidates, happen to have high sores

whentheyhavethefeatureexpressions. Itderease

theaurayofthesystem.

Inthispaper,weemploythemethodofMorietal. (2008)asabaseline method. Weintrodue a prob-abilisti language model tothe baseline method in ordertosolve theabove problems and improve the

methodintermsofauray.

Ourmethodhasthefollowingthreefeatureparts. Firstly,aprobabilistilanguagemodelisusedto re-trieve examples similar to an submitted question. Seondly, another probabilisti language model is

onstruted from the retrieved example answers,

whihisusedtomeasuretheappropriatenessof an-swer andidates for submitted questions. Finally,

the answer andidates are lustered into several

groups,andtheandidatesthathaveunsuitable writ-ingstylesasanswersforthesubmittedquestionare removed.

Therest ofthis paperisorganized asfollows. In Setion2, weexplaintherelatedworks. InSetion 3,weexplaintheoutlineofthebaselinemethod. In Setion4, we disuss theproblems of thebaseline method. InSetion5, wedesribe thedetailofthe

proposed method. InSetion6, weondut

exam-inations ofourQAsystem, and disuss theresults. InSetion7,weprovideouronlusion.

2 RelatedWorks

The methods whih utilize probabilisti language

(3)

(4)

(5)

(6)

(7)

Reduce the number

of example questions

by using 7-gram

Q

A

Q

A

Q

A

Q

A

Q

A

Q

A

Q

A

Q

A

Q

A

Q

A

Q

A

Q

A

Input Question

Qin

Retrieving

1500 example

questions

Select the combination

of clusters whose

language model gives

the highest generation

probability to Qin

Selected

Combination

of Clusters

Retrieved question

examples along with

their paired answer

examples

Clustering

A

Q

A

Q

A

Q

A

Q&A Corpus

Figure4: Retrievingquestionexamples(alongwith an-swerexamples)similartoansubmittedquestioninterms ofwritingstyle

plequestions retrieved inthis step threetimes oftargetnumber,inourexperiment.

3. Applyalusteringalgorithmtoexample ques-tionsextrated intheabovestep 2,and obtain severallusters.

4. ObtainombinationsoflustersreatedinStep

3. Generate a probabilisti language model

fromthe example questions ineah

ombina-tionoflusters.Calulatethegeneration proba-bilityofthesubmittedquestionforeahmodel. Obtaintheombination oflusterswhose lan-guagemodelgivesthehighestprobabilitytothe submittedquestion.

The reason why we divide examples into some

lustersistoshortentheproessing time ompared toalulating forall sebsets ofexample questions. TheoutlineofthisproessingisshowninFigure4.

5.2.1 ClusteringExampleQ&APairs

AsdesribedlaterinSetion5.3,wenallyneed toobtainexampleanswerspairedwiththeexample questionsthataresimilartothesubmittedquestion.

The lustering proess desribed above is for not

only example questions but also example answers, namely,forexampleQ&Apairs. Inorderto alu-late similarity between sentenes for lustering by

taking aount of word o-ourrene in distane

positionsofa sentene, weuseword skip2-grams assentene features for lustering. A skip 2-gram isanypairofwordsintheirsentene order. Itmay havesomegapsbetweentwowords. Bothquestion

examples and answerexamples are generalized for lustering (notfor obtaining probabilisti language models) as follows: 1) the following surfae ex-pressionsareusedastheyare: thefuntionalwords (e.g.interrogativespartilesandauxiliaryverbs)and somepredeterminedontentwordsdesribedbelow, 2) the other words are replaed with their

part-of-speeh tags. The predetermined ontext words

in-ludes a) words that tend to express the fous of question(e.g.riyuu(reason),houhou(method), imi(meaning),higai(differene)),andb)verbs andadjetives that frequentlyappearinorpus. As the words expressing the fouses ofquestions, we olletnounsXthatfrequentlyappearinthe follow-ingontexts oforpus: ...X-wanan-desuka (What isX of ...), ... X-wo oshiete (Tell me Xof ...), andsoon.

There are, at least, following three hoies for

similarity alulation when we luster example

questionsandanswersintosomelusters.

Similarity1

SimilaritybetweenQ&Apairsintermsofskip 2-grams. Wetakeaount ofboththequestion partand theanswerpartofaQ&Apair simul-taneously.

Similarity2

Similarity between example questions only in termsofskip2-grams.

Similarity3

Similarity between example answers only in

termsofskip2-grams.

Inthe alulation of Similarity 1, we alulate the similarity of the question parts and that of the an-swerpartsseparately, then mix thevalues into one similarity,beause thefeatureexpressions fromthe answerpartsshouldbetreatedindependentofthose ofthequestionparts,andvieversa.Asthe luster-ingalgorithm,weemployedthek-meansmethod.

5.2.2 ObtaintheOptimalCombinationsof

Clusters

We employed a simple hill limbing method to

retrieve the optimal ombination of lusters whose language model of question parts gives the maxi-mal generation probability to the submitted ques-tion.WeuseEquation(5)toalulatethegeneration probabilityandtheombinationisgreedilysearhed throughthefollowingsteps.

1. LetthelustersetCLbethegivenlusterset, andlettheandidatesetCAbeanemptyset.

(8)

(9)

Proposedmethod (=0:7)

Baseline (=0:5)

TypeofQuestion MRR

Numberof

CorretResponse MRR

Numberof

CorretResponse MRR

Numberof

CorretResponse MRR

Numberof CorretResponse

Denition 0.458 6/10 0.483 6/10 0.500 6/10 0.425 6/10

Why 0.332 9/17 0.345 8/17 0.345 9/17 0.240 6/17

How 0.400 3/3 0.611 3/3 0.511 3/3 0.111 1/3

Other 0.527 13/20 0.543 14/20 0.502 14/20 0.412 14/20

All 0.439 31/50 0.464 31/50 0.437 32/50 0.338 27/50

ofMeasure1andMeasure2inEquation(7)and2) thesimilarity alulation methods inthelustering desribedinSetion5.2.

6.1 ExperimentalSettings

Asthequestionset,weusethelatterhalfofJapanese question set of NTCIR-6 QAC formal run test set (Fukumotoetal.,2007).

AsaWebsearhengineforinformationsoureof

QA,we adopted Yahoo! Japan API

1

. With regard toQ&Aexamples,weusedaorpus of0.9 million Q&ApairsthatomesfromYahoo! Chiebukuro, whih isa Q&A ommunity site and the Japanese

version of Yahoo! answers. Let the

parame-ter target numberdesribed inSetion 5.2 be 500.

The systems output ve answers for eah

submit-tedquestioninthedesendingorderofsore. Judg-mentwhether ananswer andidateisorret ornot isperformedby one assessor. Theassessor judged

an output answer andidate orret, when the

an-didate inludes orret answer for the question as

itspart. WeuseMeanReiproal Rank(MRR

2 )as theevaluation metris.InadditiontoMRR,wealso investigate the number of the questions for whih the system anreturn, at least, one orret answer inthetopveanswerandidates(numberoforret responses,hereafter).

6.2 ExperimentalResults

ExperimentalresultsareshownintheTable2,3,and 4.

Withregardtothebaselinemethod,weemployed 0.5fortheparameter,beauseitgivesthebest

per-formane in terms ofMRR.On the other hand, as

forthe proposed method, the results areshown for the three settings, =0.7,0.8,and 0.9, whih give thebetterperformanethanothersettings.

1

http://developer.yahoo.o.jp/ 2

ReiproalRank(RR)istheinverseoftherankoftherst orretanswerandidate.MRRistheaverageofRRsoverthe questionset.

Although the proposed method and the baseline

methoddo notperform anyquestion lassiation,

theresults areshownonatype-by-type basisin or-dertoinvestigatetheeffetivenessofthemethodfor eahtypialquestiontypedesribedinTable1.

6.3 Disussion

All of Table 2, 3, and 4 show that the proposed methodoutperformsthebaselinemethod.

With regard to the numberof orret responses,

theproposed method givesmore orret responses

thanthebaselinemethodexeptfortheaseofuse ofSimilarity1(usingbothquestionpartandanswer part)and =0:7.

WithregardtoMRR,theproposed methodgives betterperformanethanthebaselinemethodfornot onlythe average of allquestions but also the aver-age of eah type of question. One of the reasons for thegood performane may be the fat that the proposemethodanappropriatelylterout ungram-matialexpressions inanswerandidates,whilethe baselinemethodsometimesemploythemasanswer response. It means that the introdued probabilis-tilanguage modelontributetoremoving

ungram-matial text fromanswer andidates. Another one

ofthereasonsforthegoodperformanemaybethe

fatthat theproposed methodanredue the

num-berofexampleQ&Apairswhihinludeunsuitable expressions for answers of the submitted question when the system retrieves example Q&Apairs. It means that more example answers suitable to the submitted questionan be retrieved byintroduing thelusteringandtheprobabilistiestimationtothe proessofretrievingexamplequestions,andasa re-sult,byreningthelanguagemodelofanswers.The followingshowsanexampleforwhihthebaseline methodannotgiveorretanswer,buttheproposed

(10)

Question(submitted)

WhatisrequiredtoeffetuatetheKyoto Proto-ol?(OriginallyinJapanese)

Answer(Baseline)

After deposit of instrument of ratiation of

Kyoto protool by the Russian goverment, a

ondition for ratiation is satised, it is ef-fetuated onFebruary16, 2005. (Originally in Japanese)

Answer(Proposedmethod)

In order to effetuate the Kyoto Protool, the ratiationbymorethan50signatoryountries and ontrieswhosearbon-dioxide emissionis

more than 55% of advaned industrial

oun-tries'areneeded.(OriginallyinJapanese)

Withregardtothemethodsof similarity alula-tioninlustering exampleQ&Apairs,Similarity 1 (usingbothquestionpartandanswerpart)generally givesbetterperformanethanothersimilarity alu-lationmethods intermsofboththenumberof

or-retresponse andMRR.Thefollowingreason may

besupposed.

The features from question parts of retrieved Q&Aexamplesseemnottobesuitablefor lus-teringtheQ&Aexamples beause the writing stylesofquestionpartsareverysimilartoeah other on aount of the method for retrieving Q&Aexamples. Inorder to retrieve example questionssimilartothesubmittedquestion,we

use the 7-gram in eah question part whose

enterwordisaninterrogative.

Sineanswerpartshavelonger textthan ques-tion parts in Q&A examples and are onse-quently desribed invarious writing styles, it

may be possible to nd subgroups of answer

parts aording to the variations of writing

styles.

Forthesereasons,theuseofanswerpartsofQ&A examples ismore efientforlustering the exam-ples. Although thereisnosigniantdifferene be-tween Similarity 1 and Similarity 3 (using answer partonly)asshowninTable2and4,thesystemwith Similarity1 (=0.9)stably outperforms thesystem withSimilarity 3interms ofthenumberof orret response. Moreover,MRRofthesystemwith Simi-larity1,0.482,isalmostthesameasthebest perfor-mane,0.483,amongallsettings.

7 Conlusion

Inthis study, we poposed a method to introdue a probabilistilanguagemodelintonon-fatoid ques-tionanswering inorder toimprovethe auray ot thesystemproposedbyMorietal.(2008)

Weintrodued themodelintotwosub-proesses

whihalulate similarityintermsof writingstyle. Therstproessolletsexamplequestionssimilar toansubmittedquestion. Theseondonemeasures similarity between an answer andidate and exam-pleanswerspairedwiththeolletedexample ques-tions.Theexperimentalresultsshowedthatthe sys-temwiththepropose methodoutperforms the base-linesystem.

Referenes

Fukumoto,J. 2007. Question AnsweringSystem for Non-fatoidType Questions and Automati Evalua-tion basedonBE Method. Proeedingsofthe Sixth NTCIRWorkshopMeeting,Tokyo,Japan,441447. Fukumoto,J.,T.Kato,F.MasuiandT.Mori. 2007. An

Overviewof the 4th Question AnsweringChallenge (QAC-4)atNTCIRWorkshop6. Proeedingsofthe SixthNTCIRWorkshopMeeting,433440.

Han,K.-S.,Y.-I.SongandH.-C.Rim.2006. Probabilis-ti modelfor denitional question answering. Pro-eedingsofthe29thannualinternationalACMSIGIR onfereneonResearhanddevelopmentin informa-tionretrieval,SIGIR'06,NewYork,212219. Heie,M.H.,E.W.D.WhittakerandS.Furui. 2012.

Ques-tion answeringusing statistial language modelling. ComputerSpeehandLanguage26,193209. Mizuno,J.,T.Akiba,A.FujiiandK.Itou. 2009.

Non-fatoidQuestionAnsweringExperimentsat

NTCIR-6:Towards Answer Type Detetion for Realworld

Questions.ProeedingsoftheSixthNTCIRWorkshop, 487492.

Mori,T.,M.SatoandM.Ishioroshi. 2008. Answering anylass ofJapanese non-fatoidquestionby using theWebandexampleQ&ApairsfromasoialQ&A website. Proeedingsofthe2008IEEE/WIC/ACM In-ternationalConfereneonWebIntelligeneand Intel-ligentAgentTehnology,5965.

Soriut, R., T.Akiba and E. Brill. 2006. Automati QuestionAnsweringUsingtheWeb:Beyondthe Fa-toid. JournalofInformationRetrieval-SpeialIssue onWebInformationRetrieval,vol.9,191206.

Takahashi, A., A. Takatsu and J. Adahi. 2010.