54 R ETRIEVE U S I N G KWSEARCH F U N CTION CALL NORMALIZE WO R D F O R M U S I N G WORDNET MORPHING AND GET T E R M I D
CALCULATE PROBABILITY OF R E LEVANCE USING STAG E D LOGISTIC REGR ESSION FOR M U LA
A P P E N D ENTRIES TO KW R E T R IEVAL AND KW
=
O U E R YFigure 4
The Lassen Rctrie,·a\ Process
same processi ng steps as documems in the indexi ng process. The individual words of the query are extracred :1 11d located in the wn_indn d ictionary ( after removing common words or "stopll'ords" ) . The tcrmids �or matching words ti·om wn_indn a rc the n used to retrieve all the tuples in kw_tcrm_doc_rd that contain the te rm. For each uni
q
ue document identifier in this l ist ofru p les, the marching k11 _doc_i ndn tuple is rctrinTd. \Vith the �i·equenc�· int(mllation containedin kw_term_doc_rel and kll'_doc_indn, the estim ated
probabilitY of relevance is calcul ated for e:�ch docu ment that con tains at l east one term i n common ll'i th the q u er y. The formulae used in the calculation arc based on experiments with fu l l -text retrieval -" The basic equation for the probabilistic model used i n Lassen states the following: The probabi lity o f the event that a document is relevant R. gin:n that there is a set of N "c lues" associated with that docu ment, A1 fori= l ,
2, . . . , 1\', is
log O ( R I A , , ... , A , ) = log O ( R) +
2
) 1og O ( R I A,)Digit�! Technic�! Journal
1 = I
- log O ( R) ] ,
( l )
Vol . 7 :--Jo. 3 J <)%
where for :.m y cvcms
1:· and E:
the odds 0( /:.' I /;'' ) is P(Ei h"')/Pr f l l;'' ), i . e . , a simple trans�ormation of the probabi lities. Because there is not enough information to compute the cxJct probability of rele\·ance �-or Jtw user Jnd any document, Jn esti mation is derived based on logistic regression ofJ set of clues ( usual!\' terms or \\'Ord s ) contained in some sa mple of que ries and the documents pre\ iouslv judged to be relevant to those queries. for J set of .l! terms that occur in both a q u cn· and a g;i\·cn document, the regression equ ation is of the t(m11.I/
log O ( R I A , , . . .
, A ,1 )
= c0 + c,· )\M) LX111,
+ · · ·.1/ I
+ c,. ·
j\JI-1) LXmA·
+ cA'_,M +CJ;.,2MZ,
(2 )
ll'h ere there arc /( retrieval variables X111 .K used to
characterize each term or clue, and the c1 coeH!cients
are constant
tor
a gi\·en training set of queries anddocu men ts. The coefticienrs used i n the prototYpe \\'ere deri\·ed from anah·sis of ful l -text documents
;md
q
ueries (wirh relevance judgments ) from the TlPSTER intcmn:nion retrieval test collection -" The deri,·ation of rhis fc>rm ula is given i n " Probabil isticRetrie' al Rased on Staged Logistic Regression ."' The
fu l l ren·in·al cquJtion used f(x the prororl'pe ,·ersion of retrie,·al descri bed in this section is
where log
O ( N I A1 ,
. . ., A 11) """ - 3 . S l
I ·" ·" +\ .lJ +
374 :Lx,11
I +0.330 :Lx.,,
I If- 0 . 1 9 37 :Lx"',]
+o.0929M,
( 3 )
IX111 1 is the quotiem of the nu mber of times the m th term occurs in the querv and the sum of the rot:d number of tenns in the querv plus
35;
X1112 is t h e logarithm of t h e q uotient arrived a t bv
d ividing the n u m ber of times the m th term occ u rs in the document by the sum of the tot;ll number oftenns
i n the document plus
80;
x/J/.3 is the logari thm of the q uotiellt arrived at by dividi ng the number of times the 111th term occu rs in
the d atabJsc
(
i.e. ,
i n all documents) by the total n u m ber o f terms in the col lection;
.ll is the number of terms held i n common bv the q uerv and the document.
Note that the .'v/ 2 term call ed f(>r in
Eq u
ation 2 was not found to prol'ide any significmt d i fkrence i n the results and w:1s omitted from Equation3.
The con stants35
and80, w
hich were used in X111 1 and X11,_2 , are :1rbitrarv b u t appear to offer t h e best results 11·hen set to the a\·erage size of a q uen· and the <1\'erage size of a docu ment for the particular database. The seq uence of operations performed to ca lcu l ate the probabi lity of re levance is shown in Figu re 5. Note that i n the figure, k l , . .. , 1<5
represent the constants of Equation 3.The proba b i l i t\' of rele\'ance is c::�lcu lated