4.2 Query Driven Retrieval Approaches
4.2.2 Multi-field Based Retrieval
When allowing users to query over multi-field documents two approaches can be taken. The first approach sees a user specify the query terms to be used for each field.
A retrieval system then performs a document field by field search, using the query terms provided for each field. This is referred to as ‘structured search’, described in Section 4.2.2.1. In the second approach the user specifies all the query terms to be used in the search, without providing any indication as to the field each query term should be targeted to. A retrieval system then performs a search across all document fields
EXAMPLE 1 - structured query:
Content field query terms: content context lifelogs Title field query terms: PhD thesis
Extension field query terms: pdf EXAMPLE 2 - flat query:
Query terms: content context lifelogs PhD thesis pdf
Figure 4.1: Sample structured and flat queries.
using the provided query terms. We refer to this as ‘flat search’, described in Section 4.2.2.2. Figure 4.1 shows the difference between a structured query and flat query to a retrieval system.
4.2.2.1 Structured Content+Context Search
A simple approach to scoring in structured search is using simple data fusion [Belkin et al., 1995], whereby each field of a document is queried separately using the query terms for that field, and a simple linear combination of the individual field scores taken, as shown in Equation 4.14, where f is a field in document d, wf is a weight assigned to each field f, and ms(qf, df) is a matching score approach which computes a matching score for field f of document d, given the query terms qf which were used for this field. The ms(qf, df) score can be calculated using any query-document matching algorithm. For example, Equation 4.15 shows the use of BM25 (Equation 4.12 described in Section 4.2.1.1) in this process, where wtf,f is the applica-tion of Equaapplica-tion 4.12 to the field f of document d using a query term tf in qf. Indeed in Equation 4.15, any matching score function can be used to calculate wtf,f.
ms(q, d) =X
f ∈d
wf · ms(qf, df) (4.14)
ms(q, d) =X
f ∈d
(wf · X
tf∈qf∩f
wtf,f) (4.15)
In Chapter 5.3 we investigate the performance of this simple data fusion approach for structured content+context-based search on PL collections. This allows us to examine
the effect of allowing the PL owner form structured queries based on their recalled content and context associated with required items, relative to content only query-ing. Since we use BM25 for our content only based retrieval investigations in Chapter 5, in exploring structured content+context-based retrieval the BM25 term weighting function is used in this simple data fusion approach shown in Equation 4.15.
4.2.2.2 Flat Content+Context Search
As highlighted in Section 4.1 structured search can limit the scope for the use of the context of the query in so far as it requires explicit association of query terms with specific fields. Flat search represents a more flexible alternative by enabling a searcher to enter all content and context associated with required information in a single simple flat query. This suggests that in the development of IR methods for PLs, we should focus on the need to support effective search of multi-field items using simple flat queries.
Given a simple flat query for search, the key research challenge is how to score the individual fields individually or in combination to generate the most effective over-all score for retrieval. This issue is analyzed in detail in [Ogilvie and Cover-allan, 2003, Robertson et al., 2004, Wilkinson, 1994]. The simplest approach to this is to index all item fields as one field, hence reducing the collection to a single field collection on which queries can be processed using a content only retrieval approach, such as the VSM or BM25, described in Section 4.2.1. This approach acts as a flat query based retrieval baseline for the PL retrieval investigations in Chapter 5. However, by reduc-ing multi-field documents to sreduc-ingle field documents the rich information in structured documents is lost. For example, consider a news articles archive with fields such as content, title, author and a query looking for articles on a given topic by a given au-thor. The presence of the queried for author field of a structured document would help narrow the search space, however in a flat document this author information is essen-tially lost, or not deemed as significant, amongst the many terms that are in the flat representation of the document. Similarly, the significance of terms in a document’s title field will be lost when a structured document is converted to a flat representation.
Hence, it is desirable to maintain the structure of documents.
Much of the research exploring the challenge of retrieval from structured documents
using flat queries has focused on using content only retrieval algorithms (described in Section 4.2.1) to determine field weights (where each field is queried against all the terms in the flat query) and then using various approaches to combine the indi-vidual field scores [Chowdhury et al., 2003, Savoy and Rasolofo, 2003, Xu et al., 2003], the simplest of these being the linear sum of individual field scores, described in the context of structured queries in Section 4.2.2.1. However, as highlighted in [Robertson et al., 2004] for flat queries simple data fusion of multi-field document ap-proaches suffers a number of weaknesses3. In particular, linearly combining scores across fields can lead to a great over-estimation of the importance of terms occur-ring in multiple fields at the expense of a term occuroccur-ring in only one field. They also highlight that careful exploitation of field structure is important for optimal retrieval performance. Their proposed solution, BM25F, is to weight terms at the field level and linearly combine these weights. This overall term weight is then applied to the BM25 saturating function. In subsequent work, they refined their BM25F term scoring approach to also consider field length at the term scoring level [Zaragoza et al., 2004], to account for fields of extremely different length, e.g. title and content fields. The BM25F term scoring approach [Zaragoza et al., 2004] for calculating the weight ¯wt,dof term t in document d is presented in Equation 4.16. The term weight ¯wt,dis applied to the BM25 saturating function, presented in Equation 4.17.
¯
wt,d=X
f ∈d
tfd,f,t· wf ((1 − bf) + bf ·avllf
f) (4.16)
where,
tfd,f,tis the frequency of term t in field f of document d.
lf is the length of f in d.
avlf is the average length of field f.
wf is the field weight assigned to f.
bf is a length normalising parameter for f.
3This makes no claim as to the utility, or lack there of, of using a simple data fusion approach for structured queries.
ms(q, d) = X
df (t)+0.5 in the BM25F implementation [Robertson et al., 2004, Zaragoza et al., 2004], where N is the number of documents in the collection, df(t) is the number of documents containing term t. Of course other approaches can be used to calculate the idf, as discussed in Section 4.2.1.
BM25F provides a simple but effective state-of-the-art solution to flat querying on multi-field documents, and we investigate its utility in PL retrieval in Chapter 5.4.
However, it should be noted that BM25F was developed for flat querying on col-lections where query terms may match any number of fields, e.g. abstract, ti-tle and main body of text fields, and that its utility was shown for web retrieval [Zaragoza et al., 2004] and email content retrieval, combined with other query inde-pendent measures [Craswell et al., 2005b]. Our multi-field lifelog collections are dif-ferent in that the fields from a querying point of view are independent of each other in that different fields do not have common meaning (with the exception of the title and content fields). Mapping of query terms to their target field may therefore be impor-tant in flat content+context retrieval approaches for the lifelogging domain. In related work Kim et al [Kim et al., 2009, Kim and Croft, 2009] argue that the importance of individual terms to individual fields should be captured in the term’s weight when searching semi-structured documents. Clearly an approach using structured queries where the user enters search terms for fields separately represents one extreme where the term is only searched for in this field and its presence in other fields makes no contribution to the score. However, as noted previously users generally prefer to en-ter simple flat queries, which while they know the expected importance of en-terms to fields, e.g. the name of a place in the location field, this is not captured in a flat query.
In their work Kim et al [Kim et al., 2009, Kim and Croft, 2009] explore the mapping of flat query terms to semi-structured movie database and desktop collections, where fields from a querying view point have separate meaning. Their retrieval technique for known-item desktop search uses the content field, context fields related to the content
of the item (specifically title and abstract), and item specific context data (specifically to, from, URL and modified date). The focus of their investigations was how to com-bine fields for scoring within the language modelling IR framework. They found ben-eficial, a term scoring adaptation which weights the term score for each field according to the frequency of the term in the given field relative to its frequency across all fields in the document, with the expectation that this maps query terms to their target field.
In so doing, they form a type of structured query where the presence of a query term is treated individually for each field. This process is referred to in the literature as query transformation [Croft, 2009], and is attractive since it introduces some of the strengths of structured queries, by mapping query terms to their target fields, without requiring users to understand or engage in the process of creating such queries. In Chapter 5.5 we examine the application of Kim et al’s [Kim et al., 2009, Kim and Croft, 2009] query transformation approach to PL retrieval, and based on the findings of this study de-velop a novel approach to query transformation for PL retrieval.