Scaling Question Answering to the Web

(1)

Scaling Question Answering to the Web

Cody C. T. Kwok

University of Washington

Seattle, WA, USA

ctkwok@cs.

washington.edu

Oren Etzioni

University of Washington

Seattle, WA, USA

etzioni@cs.

washington.edu

Daniel S. Weld

University of Washington

Seattle, WA, USA

[email protected]

ABSTRACT

Thewealthof informationonthe webmakesitan at-tractive resource for seeking quick answers to simple, factualquestionssuchas\whowastherstAmericanin space?" or\whatisthesecondtallestmountaininthe world?" Yettoday'smostadvancedwebsearchservices (e.g., Google and AskJeeves) make it surprisingly te-dioustolocateanswerstosuchquestions. Inthispaper, weextendquestion-answeringtechniques, rststudied inthe information retrievalliterature, to the web and experimentallyevaluatetheirperformance.

First we introduce Mulder, which we believe to bethe rstgeneral-purpose, fully-automated question-answeringsystemavailableontheweb. Second,we de-scribeMulder'sarchitecture,whichreliesonmultiple search-enginequeries, natural-language parsing,and a novelvotingproceduretoyieldreliableanswerscoupled with high recall. Finally, we compare Mulder's per-formancetothatofGoogleandAskJeevesonquestions drawnfrom theTREC-8 questiontrack. Wend that Mulder'srecall is more thana factorof threehigher thanthatofAskJeeves. Inaddition,wendthatGoogle requires 6.6 times as much user eort to achieve the samelevelofrecallasMulder.

1. INTRODUCTION

Whilewebsearchenginesanddirectories havemade important strides in recent years, the problem of ef-ciently locating information on the web is far from solved. This paper empirically investigates the prob-lem of scaling question-answering techniques, studied intheinformationretrievalandinformationextraction literature [1, 16], to the web. Can these techniques, previouslyexploredinthecontextofmuchsmallerand more carefully structured corpora, be extended to the web? Will a question-answering system yield benet

Copyright is held by the author/owner.

WWW10, May 1-5, 2001, Hong Kong.

ACM 1-58113-348-0/01/0005.

whencompared tocutting-edgecommercial search en-ginessuchasGoogle

1 ?

Ourfocusisonfactualquestionssuchas\whatisthe secondtallestmountainintheworld?" or\whowasthe rstAmericaninspace?"Whilethisspeciesofquestions represents only a limited subset of the queries posed to web search engines, it presents the opportunity to substantiallyreducetheamountofusereortrequired to nd the answer. Instead of forcingthe userto sift throughalistsearchengine\snippets"andreadthrough multiple web pages potentially containing the answer, ourgoal is todevelop asystem that conciselyanswers \K-2" or \AlanShepard" (along withthe appropriate justication)inresponsetotheabovequestions.

As a step towards this goal, this paper intro-duces the Mulder web question-answering system. We believe that Mulder is the rst general-purpose, fully-automatedquestion-answeringsystemavailableon the web.

2

The commercial search engine known as AskJeeves

3

respondstonatural-languagequestionsbut itsrecallisverylimited(seeSection4)suggestingthat itisnotfullyautomated.

To achieve its broad scope, Mulderautomatically submitsmultiplequeriesontheuser'sbehalftoasearch engine (Google) and extracts itsanswers from the en-gine's output. Thus, Mulderis an information car-nivore inthe sensedescribedin[13]. Mulderutilizes severalnatural-languageparsersandheuristicsinorder toreturnhigh-qualityanswers.

The remainderofthis paperis organized asfollows. Section2providesbackgroundinformationon question-answering systemsand considersthe requirementsfor scalingthis classofsystemsto the web. Section3 de-scribes Mulderand itsarchitecture. Section 4 com-paresMulder'sperformanceexperimentallywiththat of AskJeeves and Google, and also analyzes the con-tribution of each of Mulder's keycomponents to its performance. Weconcludewithadiscussionofrelated andfuturework.

2. BACKGROUND

Aquestionanswering(QA)systemprovidesdirect an-swerstouserquestionsbyconsultingitsknowledgebase. Sincetheearlydaysofarticialintelligenceinthe60's, 1

http://www.google.com 2

Seehttp://mulder.cx foraprototype 3

(2)

language questions. However, the diÆculty of natural languageprocessing(NLP)haslimitedthescopeofQA todomain-specicexpertsystems.

Inrecent years,the combinationofwebgrowth, im-provementsininformationtechnology,andtheexplosive demandfor betterinformationaccesshasreignitedthe interest inQA systems. The availability of huge doc-umentcollections (e.g.,the webitself), combinedwith improvements in information retrieval (IR) and NLP techniques,has attractedthe developmentofa special classofQAsystemsthatanswersnaturallanguage ques-tionsbyconsultingarepositoryofdocuments[16]. The Holy Grail for these systems is, of course, answering questionsovertheweb. AQAsystemutilizingthis re-sourcehasthe potentialtoanswerquestionsofawide varietyoftopics,andwillconstantlybekeptup-to-date withthewebitself. Therefore,it makessensetobuild QAsystemsthatcanscaleuptotheweb.

Inthe following sections,welookat theanatomy of QA systems, and what challenges the web poses for them. Wethendene whatisrequiredofaweb-based QAsystem.

2.1 Components of a QA system

AnautomatedQAsystembasedonadocument col-lectiontypicallyhasthreemaincomponents. Therst is a retrieval enginethat sits on top of the document collection and handles retrieval requests. In the con-text of web, this is a search engine that indexes web pages. The second is a query formulation mechanism that translates natural-languagequestionsinto queries fortheIRengineinordertoretrieverelevantdocuments fromthecollection,i.e.,documentsthatcanpotentially answerthequestion. Thethirdcomponent,answer ex-traction,analyzesthesedocumentsandextractsanswers fromthem.

2.2 Challenges of the Web

Although the web is full of information, nding the facts sometimes resembles picking needles from a haystack. Thefollowing are specic challenges of the webtoanautomatedQAsystem:

Forming the right queries. Givenaquestion, translatingit into asearchengine queryis nota trivialtask. Ifthequeryistoogeneral,toomany documentsmayberetrievedandthesystemwould nothaveenough resourcesto scanthroughall of them. If the query is too specic, nopages are retrieved. Gettingtherightsetofpagesiscrucial indealingwiththeweb.

Noise.Sometimesevenwiththerightsetof key-words,thesystemmaystillretrievealotofpages that talk about something else. For example, weissuedthequeryfirst american in spaceto Google,andreceivedresultsabouttherst Amer-ican woman in space and the rst American to votefromspace. TheQAsystemhastobe toler-ant ofsuch noiseor it will becomeconfused and giveincorrectanswers.

Factoids. By far the worst scenario for a QA systemisndingfalsehoods. AdvancedNLP

tech-90% of the time, but if the knowledge is incor-rect then the answer is still invalid. Sometimes suchfactoidsarehumanerrorsorlies,butnot al-ways. For example, oneof the sites returnedby Google stated \John Glenn was the rst Ameri-caninspace",whichisfalse. However,thesiteis labeled\CommonmisconceptionsinAstronomy". Such sentences can cause a lot of headaches for QAsystems.

Resourcelimitations. Searchengineshavebeen improvingsteadilyandarefasterthanever,butit is still very taxing to issue a few dozen queries in orderto answereachquestion. A QA system hastomakeintelligentdecisionsonwhichqueries to try instead of blindly issuingcombinations of words. Online interactive QA systems also have to considerthetimespent bythe userinwaiting for ananswer. Veryfewusersare willingtowait morethanafewminutes.

Inthenextsection,wedescribeafullyimplemented prototypewebQAsystem,Mulder,whichtakesthese factorsintoconsideration.

3. THE MULDER WEB QA SYSTEM

Figure1showsthearchitectureofMulder. Theuser beginsbyaskingthequestioninnaturallanguagefrom Mulder'swebinterface. Mulderthenparsesthe ques-tionusinganaturallanguageparser(section3.1),which constructs a tree of the question's phrasal structure. The parse tree is given to the classier (section 3.2) whichdeterminesthe type ofanswertoexpect. Next, the query formulator (section 3.3) uses the parse tree to translatethe questionintoaseries of search engine queries.Thesequeriesareissuedinparalleltothesearch engine(section 3.4), whichfetches web pagesfor each query. Theanswer extractionmodule (section3.5) ex-tractsrelevantsnippetscalledsummariesfromtheweb pages,andgeneratesalistofpossiblecandidateanswers fromthesesnippets. Thesecandidatesaregiventothe answer selector (section 3.6) for scoring and ranking; the sortedlist of answers (presented inthe context of theirrespectivesummaries)isdisplayedtotheuser.

Thefollowingsectionsdescribeeachcomponentin de-tail.

3.1 Question Parsing

Incontrastto traditionalsearch engineswhichtreat queriesasasetofkeywords,Mulderparsestheuser's question to determine its syntactic structure. As we shallsee, knowingthe structureofthe questionallows Mulderto formulate the question into several dier-entkeywordquerieswhichare submittedtothe search engineinparallel, thusincreasingrecall. Furthermore, Mulderusesthequestion'ssyntacticstructureto bet-ter extractplausible answers from the pages returned bythesearchengine.

Natural language parsing is a mature eld with a decades-longhistory,sowechosetoadoptthebest ex-istingparsingtechnologyratherthanrollourown. T o-day'sbestparsersemploystatisticaltechniques[12,8]. Theseparsersaretrainedonataggedcorpus,andthey

(3)

user

Search

Engine

?

Question Classification

Link

Parser

Classification

Rules

+

Query Formulation

Query

Formulation

Rules

(only rules applicable

to the question will

be triggered)

WordNet

Transformational

Grammar

Quote Noun

Phrases

PC-Kimmo

web

pages

?

search

engine

queries

parse

tree

parse

tree

links

answer

type

Answer Extraction

Find

phrases

with

Matching

Type

Summary

Extraction

+

Scoring

summaries

parse

trees

Answer Selection (Voting)

Final

Ballot

Scoring

answers

?

Clustering

?

clusters

?

candidate

answers

?

PC-Kimmo

MEI

Parser

+

Final

answer

Question

?

Original

Question

IDF

score

database

NLP

Parser

NLP

Parser

NLP

Parser

NLP

Parser

NLP

Parser

Legend

rectangles - components implemented by us

rounded rectangles - third party components

Figure 1: Architecture of Mulder

uselearnedprobabilitiesofwordrelationshipstoguide thesearchfor thebestparse. TheMaximum Entropy-Inspired(MEI)parser[9]isoneofthebestofthisbreed, and soweintegratedit into Mulder. Theparserwas trainedonthePennWall StreetJournal tree-bank[21] sections2 21.

Ouroverallexperiencewiththeparserhasbeen pos-itive, but initially we discovered two problems. First, duetoitslimitedtrainingsettheparserhasarestricted vocabulary. Second, when the parser does not know a word, it attemptsto make educatedguesses for the word'spart-of-speech. This increases itssearchspace, andslows downparsing.

4

Inaddition,theguessescan be quite bad. In one instance, the parser tagged the word\tungsten"asacardinalnumber. Toremedyboth oftheseproblems,weintegratedalexicalanalyzercalled Pc-Kimmo[3]intotheMEIparser. Lexicalanalysis al-lows Muldertotagpreviouslyunseenwords. For ex-ample,Mulderdoesnotknowtheword\neurological", butby breakingdownthewordinto theroot \neurol-ogy"and the suÆx\-al", it knows the word is an ad-jective. When the MEIparser doesnotknowa word, MulderusesPc-Kimmotorecognizeandtagthe part-of-speechoftheword. IfPc-Kimmoalsodoesnotknow the word either, weadoptthe heuristic of tagging the wordasapropernoun.

4

Whilespeedisnotaconcernforshortquestions,longer questionscantakeafewsecondstoparse. Wealsouse theparserextensivelyintheanswerextractionphaseto parselongersentences,wherespeedisaveryimportant issue.

Mulder'scombinationofMEI/Pc-Kimmoandnoun defaulting is able to parse all Trec-8 questions cor-rectly.

3.2 Question Classifier

QuestionclassicationallowsMuldertonarrowthe number of candidate answers during the extraction phase.Ourclassierrecognizesthreetypesofquestions: nominal,numerical andtemporal,whichcorrespondto nounphrase,number,anddateanswers. Thissimple hi-erarchymakesclassicationaneasytaskandoersthe potentialofveryhighaccuracy. Italsoavoidsthe diÆ-cultprocessofdesigning acompletequestion ontology withmanually-tunedanswerrecognizers.

Most natural questions contain a wh-phrase, which consists of the interrogative word and the subsequent wordsassociatedwithit. TheMEIparserdistinguishes fourmaintypesofwh-phrases,andMuldermakesuse of thistype informationto classify thequestion. Wh-adjectivephrases,suchas\howmany"and\howtall", containanadjectiveinadditiontothequestionwords. Thesequestionsinquireaboutthedegreeofsomething or someevent, andMulderclassiesthem as numer-ical. Wh-adverb phrases begin with single question words suchas \where" or \when". We use the inter-rogative word to nd the question type, e.g., \when" istemporaland\where"isnominal. Wh-nounphrases usuallybeginwith\what", andthequestionscanlook for any type of answer. For example, \what height" looksforanumber,\whatyear"adate,and\whatcar" anoun. Weuse theobjectof theverb inthe question

(4)

Mulder uses the Link parser [14]. The Link parser isdierentfrom theMEIparserinthatit outputsthe sentenceparseintheformoftherelationshipsbetween words (calledlinks) instead ofa tree structure. Mul-derusestheverb-objectlinkstondtheobjectinthe question. MulderthenconsultsWordNet[22]to deter-minethetypeoftheobject. WordNetisasemantic net-workcontainingwordsgroupedintosetscalledsynsets. Synsetsare linkedto eachother bydierent relations, suchassynonyms,hypernymsandmeronyms

5

To clas-sifytheobjectword,wetraversethehypernymsofthe objectuntilwendthesynsetsof\measure"or\time". Intheformercase,weclassifythe questionas numeri-cal,andinthelattercasetemporal. Sometimesaword canhavemultiplesenses,andeachsensemayleadtoa dierent classication, e.g., \capital" can be a city or asumofmoney. Insuchcasesweassumethequestion is nominal. Finally, ifthe questiondoesnot containa wh-phrase(e.g.,\NamealmthathaswontheGolden Bearaward:::"),weclassifyitasnominal.

3.3 Query Formulation

The query formulation module converts the ques-tion into a set of keyword queries that will be sent to the search engine for parallel evaluation. Mul-der has a good chanceof success if it canguess how the answer may appear in a target sentence, i.e. a sentence that contains the answer. We believe this strategy works especially well for questions related to common facts, because these facts are replicated across the web on many dierent sites, which makes it highly probable that some web page will contain our formulated phrase. To illustrate, one can nd theanswerto ourAmerican-in-spacequestion by issu-ingthephrasequery\"The first American in space was"".

6

When given this query, Google returns 34 pages, and most of them say \The first American in space was Alan Shepard,"withvariationsof Shep-ard'sname.

AsecondinsightbehindMulder'squeryformulation methodistheideaofvaryingthespecicityofthequery. Themostgeneralquerycontainsonlythemost impor-tantkeywordsfromtheuser'squestion,whilethemost specic queries are quoted partial sentences. Longer phrases placeadditional constraints onthe the query, increasingprecision,butreducingthenumberofpages returned. Ifthe constraintis too great, thennopages will bereturnedat all. A prioriit is diÆculttoknow theoptimalpointonthegenerality/specicitytradeo, henceourdecisiontotrymultiplepointsinparallel;ifa longphrasequeryfailstoreturn,Muldercanfallback onthemorerelaxedqueries. Muldercurrently imple-mentsthefollowingreformulationstrategies,inorderof generaltospecic:

Verb Conversion. For questionswith an aux-illiary do-verb and a main verb, the target sen-5

Ahypernymisawordwhosemeaningdenotesa super-ordinate,e.g.,animalisahypernymofdog. Ameronym isaconceptthatisapartofanotherconcept,e.g.,knob isameronymofdoor.

6

The actual query is \"+the first American +in space +was"",followingGoogle'sstopwordsyntax.

formratherthanseparateverbs. Forexample,in \When did Nixon visit China?", we expect the target sentenceto be \Nixon visited China :::" ratherthan\NixondidvisitChina:::". Toform theverb\visited",weusePc-Kimmo'sword syn-thesis feature. It generates words by combining the innitive ofa word withdierent suÆxes or prexes.

7

Mulderformsaquerywiththe auxil-liaryandthemainverbreplacedbytheconjugated verb.

Query Expansion. Extendingthequery vocab-ulary via a form of query expansion [5, 29] im-proves Mulder's chances of nding pages with ananswer. For example,withwh-adjective ques-tions suchas\HowtallisMt. Everest?",the an-swersmayoccurinsentencessuchas\theheightof Everestis:::".Forthesequestions,weuse Word-Net [22] to nd the attribute noun of the adjec-tives. Muldercomposesanewquery by replac-ing the adjective with its attribute noun. Tight boundsareemployedtoensurethatMulderwill notoodthesearchenginewithtoomanyqueries.

Noun PhraseFormation. Manynounphrases, such as proper names, are atomic and should not be broken down. Mulder gets more rel-evant web pages if it quotes these entities in its query. For example, if one wants to nd web pages on question-answering research, the query \"question answering"" performs better than the unquoted query. Mulder implements this ideabycomposingaquerywithquoted non-recursivenounphrases,i.e.,nounphrasesthatdo notcontainothernounphrases.

Transformation. Transforming the question intoequivalentassertions(asillustratedinthe be-ginning ofthis section) is apowerful methodfor increasing thechanceof gettingtargetsentences. TransformationalGrammar[11,2]allowsMulder to dothisinaprincipledmanner. Thegrammar denes conversion rules applicable to a sentence, andtheireectsonthesemanticsofthesentence. Thefollowingaresomesampletransformationswe haveimplementedinMulder:

1. Subject-Aux Movements. For ques-tions with a single auxilliary and a trail-ing phrase such as the American-in-space question, Muldercan removethe interrog-ative and formtwo queries with the phrase placed in front and after the auxilliary, respectively, i.e.\"was the first American in space"" and \"the first American in space was"".

2. Subject-Verb Movements. If there is a single verb in the question, Mulder forms a query by stripping the interrogative, e.g., from \who shot JFK?" Mulder creates \"shot JFK"".

7

Irregular verbs are handled specially by Pc-Kimmo, butareonlyslightlytrickierthantheregulars.

(5)

ordertohandlecomplexsentences,buttheessence isthesame.

Sinceonlyasubsetofthesetechniquesisapplicableto eachquestion, for eachuserquestion Mulder'squery formulation module issues approximately 4 search en-gine queriesonaverage. Wethink thisis areasonable load on the search engine. Section 4.4 details exper-iments that show that reformulation signicantly im-provesMulder'sperformance.

3.4 Search Engine

SinceMulderdependsonthewebasitsknowledge base,thechoiceofsearchengineessentiallydetermines our scope of knowledge, and the methodof retrieving thatknowledge. Theidiosyncrasiesofthesearchengine alsoaectmorethanjustthesyntaxofthequeries. For example,withBooleansearchenginesonecouldissuea queryA AND ( B OR C ),butforsearchengineswhich do not support disjunction, one must decompose the originalqueryintotwo,A BandA C,inordertogetthe samesetofpages.

We considered a few search engine candidates, but eventually chose Google. Google has two overwhelm-ing advantages over others: it has the widest cover-ageamongsearchengines(asof October2000, Google hasindexed1.2 billion webpages), and itspage rank-ing function is unrivaled. Wider coverage means that Mulder has a larger knowledge base and access to more information. Moreover, with a larger collection we have a higher probability of nding target sen-tences. Google'srankingfunctionhelpsMulderintwo ways. First,Google'srankingfunctionisbasedonthe PageRank [17] and the Hits algorithms [19], which use random walk and linkanalysis techniques respec-tivelytodeterminesiteswithhigherinformationvalue. ThisprovidesMulderwithaselectionofhigherquality pages. Second,Google's rankingfunctionisalsobased ontheproximityofwords,i.e.,ifthedocumenthas key-words closertogether, itwill be rankedhigher. While thishasnoeectonquerieswithlongphrases,ithelps signicantlyincaseswhenwehavetorelyonkeywords.

3.5 Answer Extraction

Theanswer extractionmodule is responsible for ac-quiringcandidateanswersfromtheretrievedwebpages. In Mulder, answer extraction is a two-step process. First,weextractsnippetsoftextthatarelikelyto con-tain the answer from the pages. These snippets are calledsummaries; Mulderranksthemand selectsthe N best. NextMulderparsesthesummariesusingthe MEIparserandobtainsphrasesoftheexpectedanswer type. For example, after Mulderhasretrievedpages from the American-in-space queries, the summary ex-tractorobtainsthesnippet\TherstAmericaninspace was Alan B. Shepard. He made a 15 minute subor-bitalight intheMercury capsuleFreedom7 onMay 5,1961." WhenMulderparsesthissnippet,itobtains thefollowingnounphrases,whichbecomecandidate an-swers: \TherstAmerican",\AlanB.Shepard", \sub-orbitalight"and\MercurycapsuleFreedom7".

We implemented a summary mechanism instead of extractinganswersdirectlyfromthe webpagebecause

tweenahalfsecondforshortsentencestothreeormore forlongersentences,makingitimpracticaltoparse ev-ery sentence on every web page retrieved. Moreover, regions that are not close to any query keywords are unlikelytocontaintheanswer.

Mulder's summary extractor operates by locating textualregionscontainingkeywordsfromthesearch en-ginequeries. Inordertoboundsubsequentparsertime and improve eÆciency, theseregions are limited to at most 40 words and are delimited by sentence bound-aries whenever possible. We tried limiting summaries to a single sentence, but this heuristic reduced recall signicantly. Thefollowingsnippetillustrateswhy mul-tiplesentencesareoftenuseful: \Manypeopleaskwho was the rst American in space. The answer is Alan Shepardwhosesuborbitalightmadehistoryin1961." After Mulder has extracted summaries from each webpage, itscores andranksthem. Thescoring func-tion prefers summaries that contain more important keywords whichare close toeachother. Todetermine theimportanceofakeyword,weuseacommonIR met-ric knownasinverse document frequency(IDF),which isdenedby

N df

,whereNisthesizeofadocument col-lection anddf is the numberof documentscontaining theword. Theideaisthatunimportantwords,suchas \the" and\is", are inalmost all documentswhile im-portantwordsoccursparingly. ToobtainIDFscoresfor words,wecollectedabout100,000documentsfrom var-iousonlineencyclopedias andcomputedtheIDFscore ofeverywordinthiscollection. Forunknownwords,we assumedf is 1. Tomeasurehowclose keywordsare to eachother,Muldercalculatesthesquare-root-meanof thedistancesbetweenkeywords. For example,suppose d1;:::;dn

1

arethedistancesbetweenthen keywords insummarys,thenthedistanceis

D(s)= q d 2 1 +:::+d 2 n 1 (n 1)

If s has n keywords, each with weight wi, then its scoreisgivenby:

S(s)= P n i=1 wi D(s)

Mulder selects the best N summaries and parses themusingtheMEIparser. Toreducetheoverallparse time,Mulderrunsmanyinstancesofthe MEIparser ondierentmachinessothatparsingcanbedonein par-allel. We alsocreated a quality switch for the parser, sothatMuldercantradequalityfor time. Theparse results are thenscannedfor theexpectedanswer type (noun phrases, numbers, or dates), and the candidate answersarecollected.

3.6 Answer Selection

The answer selection module picks the best answer amongthecandidateswithavotingprocedure. Mulder rstranks thecandidates according to how close they are to keywords. It thenperformsclustering togroup similaranswerstogether,andanalballotiscastforall clusters;theonewiththehighestscorewins. Thenal

(6)

ter. Forinstance,supposewehave3answercandidates, \AlanB.Shepard",\Shepard",and\JohnGlenn",each withscores2,1and2respectively. Theclustering will resultintwogroups, onewith\AlanB.Shepard" and \Shepard"andtheotherwith\JohnGlenn". The for-mer is scored3 and thelatter 2, thusMulderwould picktheShepardclusterasthewinnerandselect\Alan B.Shepard"asthenalanswer.

Mulder scores a candidate answer by its distance from the keywords in the neighborhood, weighted by how importantthosekeywords are. Let k

i

representa keyword,aiananswerwordandciawordthatdoesnot belong to either category. Furthermore,let wi be the weightofkeywordk

i

. Supposewehaveastringof con-secutivekeywordsontheleftside oftheanswer candi-datebeginningata1,separatedbymunrelatedwords, i.e., k 1 :::k n c 1 :::c m a 1 :::a p

, Mulder scores the an-swerby KL= w1+:::+wn m (theLinK L

standsfor\Left".) Wecomputeasimilar score, KR, with thesame formula, butwith keywords on the right side of the answer. The score for a can-didate is max(K

L ;K

R

). For certain typesof queries, we modify this score with multipliers. The answers for temporalandwh-adverb(when,where)queriesare likelytooccurinsideprepositional phrases,so weraise the scorefor answers that occur underaprepositional phrase. Some transformations expect the answers on a certain side of the query keywords (left or right). Forexample,thesearchenginequery\"was the first American in space"" expects the answer on the left sideofthisphrase,soMulderrejectspotentialanswers ontheright side.

Afterwehaveassigned scores toeachcandidate an-swer, wecluster theminto groups. Thisprocedure ef-fectsseveralcorrections:

Reducing noise. Random phrases that occur bychancearelikelytobeeliminatedfromfurther considerationbyclustering.

Allowing alternative answers. Manyanswers have alternative acceptable forms. For example, \Shepard", \Alan B. Shepard", \Alan Shepard" are all variations of the same name. Clustering collectsall of thesedissimilar entries together so thattheyhavecollectivebargainingpowerandget selectedasthenalanswer.

Separating facts from ction. The web con-tains a lot of misinformation. Mulderassumes thetruth will prevail and occur more often, and clusteringembedsthisideal.

Our clustering mechanism is a simpliedversion of suÆxtree clustering [31]. Muldersimplyassigns an-swersthatshare thesamewordsintothesamecluster. WhileacompleteimplementationofsuÆxtree cluster-ing that handles phrasesmay givebetter performance thanoursimplealgorithm,ourexperienceshows word-basedclusteringworksreasonablywell.

ing tothe sumof thescores of their memberanswers. The memberanswer withthe highest individual score ischosenastherepresentativeofeachcluster.

Thesortedlistofrankedclustersandtheir represen-tatives are displayed as the nal results to the user. Figure2showsMulder'sresponsetothe American-in-spacequestion.

4. EVALUATION

Inthissection,weevaluatetheperformanceof Mul-der by comparing it with two popular and highly-regardedwebservices,Google(theweb'spremiersearch engine) and AskJeeves (the web's premier QA sys-tem). In addition, we evaluate how dierent compo-nents contribute to Mulder's performance. Speci-cally, we studythe eectivenessof queryformulation, answerextractionandthevotingselectionmechanism.

AskJeeves allows users to ask questions in natural language. It looks up the user's question in its own databaseandreturnsalistofmatchingquestionswhich itknowshowtoanswer. Theuserselects themost ap-propriate entryinthe list, andis takentoa webpage where theanswerwill be found. Its databaseof ques-tionsappearstobeconstructedmanually,thereforethe correspondingwebpagesreliablyprovidethedesired an-swer. WealsocompareMulderwithGoogle. Google suppliesthe web pagesthat Mulder usesinour cur-rentimplementationandprovidesauseful baselinefor comparison.

We begin by dening various metrics of our experi-mentinsection4.1,theninsection4.2wedescribeour test question set. Insection 4.3, we comparethe per-formanceofMulderwiththetwowebservicesonthe Trec-8question set. Finally,insection 4.4we investi-gatetheeÆcacyofvarioussystemcomponents.

4.1 Experimental Methodology

Themaingoalofourexperimentsistoquantitatively assess Mulder's performance relative to the baseline providedby Google and AskJeeves. Traditional infor-mationretrievalsystemsuserecallandprecisionto mea-suresystemperformance. Supposewehaveadocument

Figure2: Mulderoutputforthequestion\Who wasthe rstAmerican inspace?"

(7)

is relevant to aquery Q. Theretrieval systemfetches asubsetF ofD for Q. Recallisdenedas

jF\Rj jRj and precision jF\Rj jFj

. In other words, recall measures how thorough the system is in locating R . Precision, on theother hand,measureshowmanyrelevant retrieved documentsare among the retrievedset, whichcan be viewed as a measure of the eort the userhas to ex-pendto ndthe relevant documents. For example, if precisionishighthen,all otherthingsbeingequal,the userspendslesstimesiftingthroughtheirrelevant doc-umentstondrelevantones.

Inquestionanswering,wecandenerecallasthe per-centage of questions answered correctly from the test set. Precision, onthe otherhand, is inappropriate to measureinthiscontextbecauseaquestionisanswered eithercorrectlyorincorrectly.

8

Nonetheless,wewould stillliketomeasurehowmuchusereortisrequiredto nd the answer, not merelywhatfraction of the time ananswerisavailable.

To measure user eort objectively across the dier-entsystems,weneedametricthatcapturesor approx-imates how much work it takes for the user to reach ananswer while readingthroughthe results presented by each system. While the true \cognitive load" for eachuserisdiÆculttocapture,anaturalcandidatefor measuring user eort is time; the longer it takes the user to reachthe answer, the moreeort is expended ontheuser's part. However, reading timedependson numerousfactors,suchashowfastapersonreads,how well the web pages are laid out, how experienced the user is with the subject, etc. In addition, obtaining data for reading time wouldrequire an extensiveuser study. Thus,inourrstinvestigation ofthefeasibility ofwebQAsystems,wechosetomeasureusereortby thenumberofwordstheuserhastoreadfromthestart of each system'sresultpageuntiltheylocate the rst correctanswer.

Naturally,ourmeasureofusereortisapproximate. Itdoesnottakeintoaccountpagelayout,users'ability toskiptextinsteadofreadingitsequentially,dierences inwordlength,etc. However,ithastwoimportant ad-vantages. First,themetriccanbecomputed automati-callywithoutrequiringanexpensiveuserstudy.Second, itprovidesacommonwaytomeasureusereortacross multiplesystemswithverydierentuserinterfaces.

We now dene our metricformally. The word dis-tancemetriccountshowmanywords ittakesan ideal-izedusertoreachtherstcorrectanswerstartingfrom thetopofthesystem'sresultpageandreading sequen-tially towards the bottom. Suppose s

1 ;s 2 ;:::;s n are thesummaryentriesintheresultspagereturnedbythe system,anddiisthewebpagelinkedtobysi. We de-nejtjto bethe numberofwords inthetext tbefore the answer. If tdoes not have the answer, thenjtj is 8

Considerasearchenginethatreturnsonehundred re-sults. Inonecase,thecorrectansweristhetop-ranked result and all otheranswers are incorrect. In the sec-ond case, the rst fty answers are incorrect and the last fty are correct. While precision is muchhigher inthesecondcase, usereortis minimizedintherst case. Unlikethecase ofcollectingrelevant documents, thevalue ofrepeatingthecorrectanswerisminimal.

indexofthe entrywiththeanswer, thuss a

andd a

are the rst snippet and document containing the answer respectively.jdaj=0iftheanswerisfoundinsa,since we donot needto view d

a if s

a

already contains the answer.

WenowdeneW(C ;P),theworddistanceoftherst correctanswerConaresultspageP,asfollows:

W(C ;P)= a X i=1 jsij ! +jdaj

NotethatourworddistancemetricfavorsGoogleand AskJeevesby assuming that the useris able to deter-minewhichdocumentd

i

containstheanswerbyreading snippetsexclusively,allowinghertoskipd1:::di 1. On the otherhand,it doesassume that peopleread snip-petsandpagessequentiallywithoutskippingor employ-inganykindof\speedreading"techniques. Overall,we feelthatthisafair,thoughapproximate,metricfor com-paringusereortacrossthethreesystemsinvestigated. Itisinteresting tonotethatanaveragepersonreads 280words per minute [27], thus we canuse word dis-tance toprovidea roughestimateof thetimeit takes tondtheanswer.

4.2 Data Set

For our test set, we use the Trec-8 question track whichconsists of 200questionsof varyingtype,topic, complexityanddiÆculty. Some examplequestionsare showninTable1.

In the original question track, each system is pro-videdwithapproximately500,000documentsfromthe Trec-8documentcollectionasitsknowledgebase,and isrequiredtosupplyarankedlistofcandidateanswers foreachquestion. Thescoregiventoeachparticipantis basedonwhereanswersoccurinitslistsofcandidates. The scoring procedure is done by trained individuals. Allthe questionsare guaranteed tohaveananswerin thecollection.

We chose Trec-8 as our test set because its ques-tions are verywell-selected, and it allows our work to be put into context with other systems which partic-ipated in the track. However, our experiments dier from the Trec-8 question track in four ways. First, Mulderis aQA system basedon the web, therefore wedonot usethe Trec-8document collectionas our knowledge base. Second, because we do not use the original document collection, it is possible that some questionswillnothaveanswers. Forexample,wespent about30minuteswithGooglelookingfortheanswerto \HowmuchdidMercuryspendonadvertisingin1993?" and couldnot ndananswer. Third, since we donot have the resourcesto score all of our test runs manu-ally,weconstructedalistofanswersbasedonTrec-8's relevance judgment le

9

and our manual eort. Fur-thermore,someofTrec-8'sanswershadtobeslightly modied or updated. For example, \How tall is Mt. Everest?" can be answered either in meters or feet; \Donald Kennedy" was Trec-8's answer to \Who is 9

http://trec.nist.gov/data/qre ls eng/ qrels.trec8.qa.gz

(8)

WhoisthePresidentofGhana? Rawlings What is the name of the rare neurological disease with symptoms such as:

involuntarymovements(tics),swearing,andincoherentvocalizations(grunts, shouts,etc.)?

Tourette'sSyndrome

Howfar isYaroslavlfromMoscow? 280miles Whatcountryisthebiggestproduceroftungsten? China

WhendidtheJurassicPeriodend? 144millionyearsago

Table 1: Some samplequestions fromTrec-8withanswers providedby Mulder.

thepresidentofStanfordUniversity",butitisGerhard Casperatthetimeofwriting.

10

Finally,whereas Trec-8 participantsare allowedto returna snippetofxed length text as the answer, Mulder returns the most preciseanswerwheneverpossible.

4.3 Comparison with Other Systems

Inour rst experiment, we compare the user eort requiredtondtheanswersinTrec-8byusing Mul-der, Google and AskJeeves. User eort is measured using word distanceas denedinsection 4.1. At each worddistancelevel,wecomputetherecallforeach sys-tem. The maximum user eort is bounded at 5000, whichrepresentsmorethan15minutesofreadingbya normalperson. Tocomparethesystemsquantitatively, wecomputetotaleortatrecallr%bysummingupthe usereortrequiredfrom0% recallupto r%. Figure3 showstheresultofthisexperiment.

The experiment shows that Mulder outperforms bothAskJeevesand Google. Mulderconsistently re-quires less user eort to reach an answer than either systemat everylevelofrecall. AskJeeveshasthe low-estlevelofrecall;itsmaximumisroughly20%. Google 10

The updated set of questions and answers used in our experiments is publicly available at http://longinus.cs.washington .edu /muld er/t rec8t .qa.

0

10

20

30

40

50

60

70

0

500 1000

1500

2000

2500

3000

3500

4000

4500

5000

%of questions answered (recall)

user effort (word distance)

Mulder

Google

AskJeevesJ

Figure 3: Recall vs. user eort: Mulder versus Google versus AskJeeves. On average, Google requires6.6 times thetotal eort of Mulder, to achievethesamelevelofrecall. AskJeeves' per-formanceissubstantiallyworse.

requires 6.6 times more total eort than Mulder at 64.5%recall(Google'smaximumrecall).

Mulder displayed an impressive ability to answer questions eectively. As gure 3 shows, 34% of the questions have the correct answer as their top-ranked result (ascomparedwith only 1% for Google.) Mul-der's recall rises to about 60% at 1000 words. After this point, Mulder's recall per unit eort grows lin-earlyataratesimilartoGoogle. Thequestionsinthis region are harder to answer, and some of the formu-latedqueriesfrom transformationsdonotretrieve any documents,henceMulder'sperformance is similarto thatofGoogle. MulderstillhasanedgeoverGoogle, however,duetophrasedqueriesandanswerextraction. The latter ltersout irrelevant text so that users can nd answers with Mulderwhile reading far less text thantheywouldusingGoogle. Finally,Mulder'scurve extendsbeyondthisgraphtoabout10,000words, end-ingat75%. Basedonourlimitedtesting, wespeculate thatofthe25%unansweredquestions,about10% can-notbe answered withpages indexedby Googleat the time of writing, and the remaining 15% require more sophisticatedsearchstrategies.

Fromourresults,wecanalsoseethelimitedrecallof AskJeeves. Thissuggests that a searchengine with a goodrankingfunctionandwidecoverageisbetterthan anon-automatedQAsysteminansweringquestions.

4.4 Contributions by Component

Oursecondexperimentinvestigates thecontribution ofeachcomponentof Muldertoitsperformance. We compare Mulder to three dierent variants of itself, eachwithasinglemissingcomponent:

Mulderwithoutqueryformulation(M-QF). Wetakeawaythe queryformulation module and issue onlythequestiontothesearchengine.

Mulder without answer extraction (M-X) Without answerextraction,welocate answersin this systemusingthesameapproachweusedfor Google in the previous experiment. To linearize theresultsretrievedfrommultiplequeries,we or-derqueriesinorderofgenerality,startingfromthe mostspecicquery.Thenresultsfromeachquery are selected ina breadth-rst manner. To illus-trate, suppose r

i;j

is the j th

result from the i th query, and we have n queries. The orderwould ber 1;1 ;r 2;1 ;:::;r n;1 ;r 1;2 ;:::;r n;2 ;::: andso on. r n;i would bethe i th

resultfromthe original un-formulatedquery,sinceitisthemostgeneral.

(9)

0

10

20

30

40

50

60

70

0

500 1000

1500

2000

2500

3000

3500

4000

4500

5000

%of questions answered (recall)

user effort (word distance)

Mulder

M - X

M - V

M - QF

Google

Figure 4: Recall vs. user eort: Mulder versus 3 dierent congurations of itself - without query formulation (M-QF), without answer extraction (M-X), and without voting (M-V). Google is the baseline.

System

totaleort totaleortMulder

Mulder 1.0

M-V 2.3

M-QF 3.0

M-X 3.8

Google 6.6

Table 2: Performances of Mulder variants and Google as multiples of total eort required by Mulder at 64.5%recall. Allthree Mulder vari-antsperformworsethanMulderbutbetterthan Google. The experiment suggests that answer extraction (M-X) is the most important, fol-lowed by query formulation (M-QF)and nally voting (M-V).

Mulderwithoutvoting(M-V)Inthissystem, we replace votingby asimplescheme: itassigns candidate answers the same score given to the summariesfromwhichtheywereextractedfrom. Inaddition,weincludedGoogleasabaselinefor com-parison.Asintherstexperiment,wecomparethe sys-tems' performance onrecall and user eort. Figure 4 illustratestheresults.

Figure4showsthat atallrecalllevels, systemswith missingcomponentsperformworsethanMulder,but betterthanGoogle. Amongthem,M-Vappearsto per-formbetter thanM-QF,andM-Xtrailsboth. Table2 showstotalusereortforeachsystem asamultipleof Mulder's at 64.5% recall, Google's maximum. The numbers suggest that answer extraction is the most important, followed by query formulation, and voting

least. Answerextractionplaysamajorroleindistilling textwhichsavesusersalotoftime.Queryformulation helpsintwoways:rst,byformingquerieswith alterna-tivewordsthat improvethe chanceofndinganswers; second, by getting to the relevant pages sooner. The positive eects of query formulation can also be seen withthe improvement inM-XoverGoogle; we expect thiseecttobemuchmorepronouncedwithotherless impressivesearchengines,suchasAltaVista

11

and Ink-tomi.

12

Finally,votingisthelastelementthatenhances answerextraction. Withoutvoting, M-Vonlyreaches 17%recall ateort0;withvoting, thenumberis dou-bled. Itis,however,lesseectiveafter1500 words, be-causeitbecomesdiÆculttondanycorrect answerat thatpoint.

5. RELATED WORK

Althoughquestionansweringsystemshavealong his-tory in Articial Intelligence, previous systems have used processed or highly-editedknowledge bases(e.g., subject-relation-object tuples [18], edited lists of fre-quently asked questions [6], sets of newspaper arti-cles [16], and an encyclopedia [20]) as their founda-tion.Asfarasweknow,Mulderistherstautomated question-answeringsystemthatusesthefullwebasits knowledge base. AskJeeves

13

is a commercial service that provides anatural language question interface to theweb,butit reliesonhundredsof humaneditorsto mapbetweenquestiontemplatesandauthoritativesites. 11

http://www.altavista.com 12

http://www.inktomi.com. Inktomi powers web sites such as http://www.hotbot.com and http://www.snap.com.

13

(10)

TheWebclopediaproject appearstobeattackingthe same problem as Mulder, but we are unable to nd anypublicationsfor thissystem,andfromtheabsence ofafunctioning webinterface, itappearsto bestillin activedevelopment.

START[18]isoneoftherstQAsystemswithaweb interface, having been available since 1993. Focussed on questions about geography and the MIT InfoLab, STARTusesaprecompiledknowledgebaseintheform ofsubject-relation-objecttuples,andretrievesthese tu-ples at runtime to answerquestions. Our experience with the system suggests that its knowledge is rather limited,e.g.,itfailsonsimplequeriessuchas\Whatis thethirdhighestmountainintheworld?"

The seminal MURAX [20] uses an encyclopedia as a knowledge base inorderto answertrivia-type ques-tions. MURAXcombinesaBooleansearchenginewith ashallowparsertoretrieverelevant sectionsofthe en-cyclopedia and toextractpotentialanswers. Thenal answerisdeterminedbyinvestigatingthephrasal rela-tionships between words that appear inthe question. According to its author MURAX's weak performance stemsfromthe factthatthevectorspace modelof re-trievalisinsuÆcientforQA,atleastwhenappliedtoa largecorpuslikeanencyclopedia.

RecentlyUsenet FAQ les (collections of frequently asked questions and answers, constructed by human experts) have become a popular choice of knowledge base, because they are small, cover a wide variety of topics, and have high content-to-noise ratio. Auto-FAQ [30] was an early system, limited to a few FAQ les. FAQFinder [6] is a more comprehensive system which uses a vector-space IR engine to retrieve a list ofrelevant FAQlesfromthequestion. Aftertheuser selectsaFAQlefromthelist,thesystemproceeds to use question keywords to nd a question-answer pair inthe FAQ lethat bestmatches theuser's question. QuestionmatchingisimprovedbyusingWordNet[22] hypernymsto expandthe possiblekeywords. Another system[25] usespriority keyword matchingto improve retrievalperformance. Keywords are divided into dif-ferent classes depending on their importance and are scored dierently. Various morphological and lexical transformations are also applied to words to improve matching.MulderusesasimilarideabyapplyingIDF scorestoestimatetheimportanceofwords.

TherearealsomanynewQAsystemsspawningfrom the Trec-8QAcompetition[28], the1999 AAAI Fall Symposium onQuestionAnswering [10] and the Mes-sageUnderstandingConferences[1]. SincetheTrec-8 competitionprovides arelatively controlled corpusfor information extraction, the objectives and techniques usedbythesesystemsaresomewhatdierentfromthose of Mulder. Indeed,mostTrec-8systemsusesimple, keyword-basedretrievalmechanisms, insteadfocussing ontechniquestoimprovetheaccuracyofanswer extrac-tion. Whileeective inTrec-8, such mechanismsdo notscale welltothe web. Furthermore,thesesystems have not addressedthe problemsof noise or incorrect information whichplaguethe web. Alot ofcommon techniquesaresharedamongtheTrec-8systems,most 14

http://www.isi.edu/natural- langu age/ proje cts/ webclopedia/

notablyaquestionclassicationmechanismand name-entitytagging.

Questionclassicationmethodsanalyzeaquestionin ordertodeterminewhattypeofinformationisbeing re-questedsothatthesystemmaybetterrecognizean an-swer. Manysystemsdeneahierarchyofanswertypes, andusetherstfewwordsofthequestiontodetermine whattypeofansweritexpects(suchas [23,26]). How-ever, building a complete question taxonomy requires excessive manual eort and will not coverall possible cases. For instance, what questions can be associated withmanynounsubjects,suchaswhatheight,what cur-rencyandwhatmembers. Thesesubjectscanalsooccur inmanypositionsinthequestion. Incontrast,Mulder usesasimplequestion wh-phrasematchingmechanism withcombinedlexicalandsemanticprocessing to clas-sifythesubjectofthequestionwithhighaccuracy.

Name-entity (NE) tagging associates a phrase or a wordwithitssemantics. Forexample,\Africa"is asso-ciatedwith\country",\John"with\person" and\1.2 inches" with \length". Most of the NE taggers are trained on a tagged corpus using statistical language processing techniques,and reportrelativelyhigh accu-racy[4]. QAsystemsuse theNEtaggertotagphrases inthequestionaswellastheretrieveddocuments.The numberofcandidateanswerscanbegreatlyreducedif thesystemonlyselects answers thathavetherequired tag type. Many systemsreport favorable results with NEtagging,e.g.[26]. WebelieveaNEtaggerwouldbe ausefuladditiontoMulder,butwealsobelieveaQA system should not relytoo heavily onNE tagging, as thenumberofnewtermschangesandgrowsrapidlyon theweb.

Relationshipsbetweenwordscanbeapowerful mech-anism to recover answers. As mentioned previously, the START system uses subject-relation-object rela-tionships. CLResearch'sQAsystem[24]parsesall sen-tences from retrieved documents and forms semantic relationtripleswhichconsistofadiscourseentity,a se-mantic relation that characterizes the entity's role in thesentence,andagoverningwordtowhichtheentity stands inthe semantic relation. Another system, de-scribedby Harabagiu [15], achievesrespectable results for Trec-8byreasoning withlinkagesbetweenwords, whichareobtainedfromitsdependency-based statisti-calparser[12]. Harabagiu'ssystemrstparsesthe ques-tionandtheretrieveddocumentsintowordlinkages. It thentransforms theselinkages into logicalforms. The answer is selected using an inference procedure, sup-plementedbyanexternalknowledgebase. We believe mechanismsofthistypewouldbeveryusefulin reduc-ingthenumberofanswercandidatesinMulder.

6. FUTURE WORK

Clearly, we need substantially more experimental evaluationtovalidateMulder'sstrongperformancein ourinitial experiments. Weplan tocompare Mulder against additional search services, and against tradi-tional encyclopedias. Inaddition, we plan to evaluate Mulderusingadditional question sets andadditional performance metrics. We plan to conduct controlled user studies to gauge the utility of Mulder to web users. Finally,as wehavedone withpreviousresearch

(11)

ofthedeployedversionof Mulderinpractice. Manychallengesremainaheadof Mulder. Our cur-rent implementationutilizes little of the semantic and syntactic information available during answer extrac-tion. Webelieveourrecallwillbemuchhigherifthese factorsaretakenintoconsideration.

There are many elements that can determine the truthofananswer. For example, thelast-modied in-formation from web pages can tell us how recent the pageanditsansweris. Theauthorityofapageisalso important, since authoritative sites are likely to have better contentsand more accurateanswers. In future versionsofMulderwemayincorporatetheseelements intoouranswerselection mechanism.

Wearealsoimplementingourownsearchengine,and plan to integrate Mulderwith its index. We expect thatanevaluationofthebestdatastructuresand inte-gratedalgorithmswill providemanychallenges. There aremanyadvantagestoalocalsearchengine,themost obvious being reduced network latencies for querying andretrievingwebpages. Wemayalsoimprovethe eÆ-ciencyandaccuracyforQAwithanalysesofalarge doc-ument collection. One disadvantage of a home-grown searchengine is oursmaller coverage, since we donot havetheresourcesofacommercialsearchengine. How-ever,withfocussedwebcrawlingtechniques[7],wemay beabletoobtainaveryhighqualityQAknowledgebase withourlimitedresources.

7. CONCLUSION

Inthis paper we set out to investigate whether the question-answeringtechniquesstudiedinclassical infor-mationretrievalcanbescaledtotheweb. While addi-tionalworkisnecessarytooptimizethesystemsothat it can supporthigh workloads withfast response, our initial experiments(reportedinSection4) are encour-aging. Ourcentralcontributionsarethefollowing:

We elded http://mulder.cx, the rst general-purpose,fully-automatedquestion-answering sys-temavailableontheweb.

Mulderisbasedonanovelarchitecturethat com-binesinformationretrievalideaswiththoseof sta-tisticalnaturallanguageprocessing,lexical analy-sis,queryformulation,answerextraction,and vot-ing.

We performed an ablation experiment which showed that each major component of Mulder (query formulation, answer extraction, and vot-ing)contributedto thesystem'soverall eective-ness.

We conducted empirical experiments which showed that Mulder reduces user eort by a factor of 6.6 when compared with Google. We also demonstrated a three-fold advantage in re-callwhencomparingourfully-automatedMulder systemwiththemanually-constructedAskJeeves.

Ourworkisasteptowardstheultimategoalofusing the web as a comprehensive, self-updating knowledge

a wide range of questions with muchless eort than is required from users interacting with today's search engines.

8. ACKNOWLEDGEMENTS

We thankDavidKoandSam Revitchfor their con-tributionstotheproject. WealsothankErikSneiders, Tessa Lau and Yasuro Kawata for their helpful com-ments. Thisresearch was funded in part by OÆce of Naval Research Grant 98-1-0177, OÆce of Naval Re-search Grant N00014-98-1-0147, and by National Sci-enceFoundationGrantDL-9874759.

9. REFERENCES

[1] ProceedingsoftheSeventhMessageUnderstanding Conference (MUC-7).MorganKaufmann,1998. [2] AdrianAkmajianandFrankHeny.An

Introduction tothePrinciples ofTransformational Syntax.MITPress,Cambridge, Massachusetts, 1975.

[3] EvanL.Antworth.PC-KIMMO:atwo-level processor formorphologicalanalysis.Dallas,TX: SummerInstituteofLinguistics,1990.

[4] D.Bikel,S.Miller, R.Schwartz,and R.Weischedel.Nymble: AHigh-performance LearningNameFinder.InProceedings oftheFifth Conference onAppliedNaturalLanguage

Processing,pages194{201,1997.

[5] C.Buckley,G.Salton,J.Allan,andA.Singhal. AutomaticqueryexpansionusingSMART:TREC 3.InNISTSpecialPublication500-225: The ThirdText REtrievalConference(TREC-3), pages69{80.DepartmentofCommerce,National InstituteofStandardsandTechnology,1995. [6] RobinBurke,KristianHammond,Vladimir

Kulyukin,StevenLytinen,NorikoTomuro,and ScottSchoenberg.QuestionAnsweringfrom Frequently-AskedQuestionFiles: Experiences withtheFAQFinderSystem.TechnicalReport TR-97-05,UniversityofChicago,Departmentof ComputerScience,1997.

[7] S.Chakrabarti,M.vanderBerg,andB.Dom. Focusedcrawling: anewapproachto

topic-specicWebresourcediscovery.In

Proceedingsof8thInternationalWorldWideWeb Conference (WWW8),1999.

[8] E.Charniak.Statistical techniquesfornatural languageparsing.AIMagazine,18(4),Winter 1997.

[9] E.Charniak.AMaximum-Entropy-Inspired Parser.TechnicalReportCS-99-12,Brown University,ComputerScienceDepartment, August1999.

[10] VinayChaudhriandCo-chairsRichardFikes. QuestionAnsweringSystems: Papersfromthe 1999FallSymposium.TechnicalReportFS-98-04, AAAI,November1999.

[11] NoamChomsky.Aspectsof aTheory ofSyntax. MITPress, Cambridge,Mass.,1965.

[12] Michael JohnCollins.ANewStatisticalParser BasedonBigramLexicalDependencies.In

(12)

ACL,SantaCruz,1996.

[13] O.Etzioni.Movinguptheinformationfoodchain: softbotsasinformationcarnivores.InProceedings oftheThirteenthNational Conferenceon ArticialIntelligence,1996.Revisedversion reprintedinAIMagazinespecialissue,Summer '97.

[14] DennisGrinberg,JohnLaerty,andDaniel Sleator.ARobustParsingAlgorithmforLink Grammars.InProceedings oftheFourth

InternationalWorkshoponParsingTechnologies, Prague,September1995.

[15] SandaHarabagiu,MariusPasca,andSteven Maiorano.ExperimentswithOpen-Domain TextualQuestionAnswering.InProceedingsof COLING-2000,Saarbruken Germany,August 2000.

[16] DavidHawking,EllenVoorhees,PeterBailey,and NickCraswell. OverviewofTREC-8Question AnsweringTrack.InE.M.VoorheesandD.K. Harman,editors,Proceedingsof theEighthText RetrievalConference-TREC-8,November1999. [17] MonikaR.Henzinger,AllanHeydon,Michael

Mitzenmacher,andMarcNajork.Measuring indexqualityusingrandomwalks ontheWeb. 31(11{16):1291{1303, May1999.

[18] B.Katz.FromSentenceProcessingtoInformation AccessontheWorldWideWeb.InNatural LanguageProcessingfortheWorldWideWeb: Papersfromthe1997 AAAISpringSymposium, pages77{94,1997.

[19] J.Kleinberg.Authoritativesourcesina hyperlinkedenvironment.InProc.9th

ACM-SIAMSymposiumonDiscreteAlgorithms, 1998.

[20] JulianKupiec.MURAX:ARobustLinguistic ApproachforQuestionAnsweringUsingan On-LineEncyclopedia.InRobertKorfhage, EdieM.Rasmussen,andPeterWillett,editors, Proceedingsofthe16thAnnualInternational ACM-SIGIRConference onResearchand

DevelopmentinInformationRetrieval.Pittsburgh, PA,USA,June27-July1,pages181{190.ACM, 1993.

[21] M.P.Marcus,B.Santorini,andM.A. Marcinkiewicz. BuildingaLargeAnnotated CorpusofEnglish: ThePennTreebank. Computational Linguistics,19:313{330, 1993.

InternationalJournalofLexicography, 3(4):235{312,1991.

[23] DragomirR.Radev,JohnPrager,andValerie Samn.TheUseofPredictiveAnnotationfor QuestionAnsweringinTREC8.InProceedingsof the8thTextRetrievalConference (TREC-8). National InstituteofStandardsandTechnology, Gaithersburg,MD,pages399{411,1999. [24] K.C.Litkowski(CLResearch).

Question-AnsweringUsingSemanticRelation Triples.InProceedings ofthe8thText Retrieval Conference (TREC-8).National Instituteof StandardsandTechnology, Gaithersburg,MD, pages349{356,1999.

[25] E.Sneiders.AutomatedFAQAnswering: ContinuedExperiencewithShallowLanguage Understanding.InQuestionAnsweringSystems. Papersfromthe1999AAAIFallSymposium. TechnicalReport FS-99-02,1999.

[26] RohiniSrihariandWeiLi(CymfonyInc.). InformationExtractionSupportedQuestion Answering.InProceedings ofthe8thText Retrieval Conference(TREC-8).National Institute ofStandardsandTechnology, Gaithersburg,MD,pages185{196,1999. [27] S.E.Taylor,H.Franckenpohl,andJ.L.Pette.

Gradelevelnormsforthecomponentofthe fundamentalreadingskill.EDLInformationand Research BulletinNo.3.Huntington,N.Y.,1960. [28] E.VoorheesandD.Tice.TheTREC-8 Question

AnsweringTrackEvaluation,pages77{82. DepartmentofCommerce,NationalInstituteof StandardsandTechnology,1999.

[29] EllenVoorhees.Queryexpansionusing

lexical-semanticrelations.InProceedingsofACM SIGIR,Dublin,Ireland,1994.

[30] StevenD.Whitehead.Auto-FAQ:Anexperiment incyberspaceleveraging.Computer Networksand ISDNSystems,28(1{2):137{146, December1995. [31] O.ZamirandO.Etzioni.ADynamicClustering

InterfacetoWebSearchResults.InProceedings of theEighthInt.WWWConference,1999.