Digital Technical Journal Vol 7 No 3 1995

54 R ETRIEVE U S I N G KWSEARCH F U N CTION CALL NORMALIZE WO R D F O R M U S I N G WORDNET MORPHING AND GET T E R M I D

CALCULATE PROBABILITY OF R E LEVANCE USING STAG E D LOGISTIC REGR ESSION FOR M U LA

A P P E N D ENTRIES TO KW R E T R IEVAL AND KW

=

O U E R Y

Figure 4

The Lassen Rctrie,·a\ Process

same processi ng steps as documems in the indexi ng process. The individual words of the query are extracred :1 11d located in the wn_indn d ictionary ( after removing common words or "stopll'ords" ) . The tcrmids �or matching words ti·om wn_indn a rc the n used to retrieve all the tuples in kw_tcrm_doc_rd that contain the te rm. For each uni

q

ue document identifier in this l ist ofru p les, the marching k11 _doc_i ndn tuple is rctrinTd. \Vith the �i·equenc�· int(mllation contained

in kw_term_doc_rel and kll'_doc_indn, the estim ated

probabilitY of relevance is calcul ated for e:�ch docu ment that con tains at l east one term i n common ll'i th the q u er y. The formulae used in the calculation arc based on experiments with fu l l -text retrieval -" The basic equation for the probabilistic model used i n Lassen states the following: The probabi lity o f the event that a document is relevant R. gin:n that there is a set of N "c lues" associated with that docu ment, A1 fori= l ,

2, . . . , 1\', is

log O ( R I A , , ... , A , ) = log O ( R) +

2

) 1og O ( R I A,)

Digit�! Technic�! Journal

1 = I

- log O ( R) ] ,

( l )

Vol . 7 :--Jo. 3 J <)%

where for :.m y cvcms

1:· and E:

the odds 0( /:.' I /;'' ) is P(Ei h"')/Pr f l l;'' ), i . e . , a simple trans�ormation of the probabi lities. Because there is not enough information to compute the cxJct probability of rele\·ance �-or Jtw user Jnd any document, Jn esti mation is derived based on logistic regression ofJ set of clues ( usual!\' terms or \\'Ord s ) contained in some sa mple of que ries and the documents pre\ iouslv judged to be relevant to those queries. for J set of .l! terms that occur in both a q u cn· and a g;i\·cn document, the regression equ ation is of the t(m11

.I/

log O ( R I A , , . . .

, A ,1 )

= c0 + c,

· )\M) LX111,

+ · · ·

.1/ I

+ c,. ·

j\JI-1) LXmA·

+ cA'_,M +

CJ;.,2MZ,

(2 )

ll'h ere there arc /( retrieval variables _{X111 .K}used to

characterize each term or clue, and the c1 coeH!cients

are constant

tor

a gi\·en training set of queries and

docu men ts. The coefticienrs used i n the prototYpe \\'ere deri\·ed from anah·sis of ful l -text documents

;md

q

ue_r_i_es _(wirh _relevance_{judgments ) from the} TlPSTER intcmn:nion retrieval test collection -" The deri,·ation of rhis fc>rm ula is given i n " Probabil istic

Retrie' al Rased on Staged Logistic Regression ."' The

fu l l ren·in·al cquJtion used f(x the prororl'pe ,·ersion of retrie,·al descri bed in this section is

where log

O ( N I A1 ,

. . .

, A 11) """ - 3 . S l

I ·" ·" +

\ .lJ +

374 :Lx,11

I +

0.330 :Lx.,,

I If

- 0 . 1 9 37 :Lx"',]

o.0929M,

( 3 )

X111 1 is the quotiem of the nu mber of times the m th term occurs in the querv and the sum of the rot:d number of tenns in the querv plus

35;

X1112 is t h e logarithm of t h e q uotient arrived a t bv

d ividing the n u m ber of times the m th term occ u rs in the document by the sum of the tot;ll number ofte_nns

i n the _{document plus}

_80;

x/J/.3 is the logari thm of the q uotiellt arrived at by dividi ng the number of times the 111th term occu rs in

the d atabJsc

(

i.e

. ,

i n all documents) by the total n u m

ber o f terms in the col lection;

.ll is the number of terms held i n common bv the q uerv and the document.

Note that the .'v/ 2 term cal_{l e}_d_f(>r_in

_{Eq u}

_a_t_io_n₂_w_as not found to prol'ide any significmt d i fkrence i n the results and w:1s omitted from Equation

3.

The con stants

35

and

80, w

hich were used in X111 1 and X11,_2 , are :1rbitrarv b u t appear to offer t h e best results 11·hen set to the a\·erage size of a q uen· and the <1\'erage size of a docu ment for the pa_rt_i_cu_l_a_r database. _The seq uence of operations performed to ca lcu l ate the probabi lity of re levance is shown in Figu re 5. Note that i n the figure, k l , . .

. , 1<5

represent the constants of Equation 3.

The proba b i l i t\' of rele\'ance _is_{c::�lcu lated}

_for

_e:1ch doc u ment ( by com·ening the l ogarith mic odds to a probabi lity) and is stored along \\'ith _a_u_ni_q_ue_querv identifier, the documem idenriticr, :md some location information in the kw_retrieval class. The query itsel f

CALCULATE NUMBER OF

In document dtj v07 03 1995 pdf (Page 55-57)