Federated Search and Distributed Information Retrieval

2.2 Result Organization and Aggregation

2.2.3 Federated Search and Distributed Information Retrieval

user interface for searching information using multiple search engines [Avrahami et al.,

2006]. The name federated search arose in the database research community; in the information retrieval (IR) research community the problem was usually described as distributed information retrieval [Avrahami et al.,2006]. Therefore federated search is also referred as distributed information retrieval [Si and Callan, 2005;Shokouhi et al.,

2007].

[Craswell, May, 2000] describes the problem of distributed information retrieval as a situation when the documents are spread across many document servers; and an effective information retrieval system is required to access these distributed documents to

2-7_{http://www.apple.com/uk/itunes/} 2-8_{http://www.ebay.co.uk/}

2.2. Result Organization and Aggregation

satisfy users’ information need. In distributed information retrieval scenario, the information retrieval system available across the network is called a search server, and it is accessed using a search client.

In DIR a search broker is a sophisticated search client which when given a query and a set of search servers, selects the servers which are likely to provide relevant documents in response to the given query. The broker then sends queries to the selected servers, and finally produces a merged list of ranked documents from the set of documents provided by each server. Furthermore, [Craswell,May, 2000] defines three main tasks performed by the broker as: selection, retrieval and merging. The information flow during the three tasks performed by the broker are shown in the Figure2.3.

During server selection the broker selects a subset S0 of servers S which are best for answering the user’s query q. Next, during retrieval the broker applies the query q at servers S’ to obtain results lists R1, R2..., R|S0_|. Finally, during results merging the broker combines results R1, R2..., R|S0_| into a merged results list RM = hD_M, o_Mi, such that DM = D1∪ . . . ∪ D|S0_|and o_M is an effective ranking.

[Si and Callan,2005] divides the problem of federated search and DIR into three sub- problems: resource description, resource selection and result merging; where resources corresponds to the severs defined by [Craswell, May, 2000]. There have been dedicated bodies of research focusing on either individual sub-problems of DIR and federated search (e.g., Luo-05,Shokouhi-07 ), or on the overall phenomenon of DIR (e.g., [Craswell,May, 2000]) and Federated search [Avrahami et al.,2006].

For instance, [Si and Callan,2005] proposes a federated search technique that uses util- ity maximization to model the retrieval effectiveness of each search engine in a federated search environment. While [Avrahami et al.,2006] discusses a prototype federated search system developed for the U.S. government’s FedStats Web portal, and the issues addressed in adapting research solutions to this operational environment. A collection- selection method based on the ranking of downloaded sample documents were proposed by [Shokouhi,2007]. Furthermore, [Shokouhi et al.,2007] focused on the problem of maintaining representation sets for dynamically changing, uncooperative, distributed collections. It was suggested that as collections evolve over time, collection represen- tations should also be updated to reflect any change.

2.2. Result Organization and Aggregation

FIGURE2.3: Search broker information flow. Given S and q the broker selects a subset S0⊂ S, retrieves R1, R2, R3from those servers and builds the merged list RM. The query q usually guides

each stage of the process, although in certain cases it may be ignored, for example if the broker’s policy is to always select all its servers S0= S ( Figure taken from [Craswell,May, 2000]).

2.2.3.1 Metasearch

Metasearch can be considered an application of DIR and federated search2-10. Based on the underlying fundamentals of DIR and federated search, metasearch also aims to provide a unified access to information stored in the databases of multiple search engines. When a metasearch engine receives a query from a user, it invokes the underlying search engines to retrieve useful information for the user [Meng et al.,2002]. The sim- ilarity among the components of a metasearch and a federated search architecture can be seen in Figure2.4and Figure2.5.

Primarily metasearch focuses on web collections, and aggregates results from various search engines, whereas, some of the DIR and Federated search systems may focus on specific database collection (e.g. FedLemur which was developed for the US govern-

2.2. Result Organization and Aggregation ment’s FedStats Web portal by [Avrahami et al.,2006]).

A metasearch engine sends a user query to several other search engines and/or databases and aggregates the results into a single list or display them according to their source. Metasearch engines enable users to enter search criteria once and access several search engines simultaneously. They operate on the premise that the web is too large for any one search engine to index it all. In addition, more comprehensive search results can be obtained by combining results from several search engines. This also may save the user from having to use multiple search engines separately2-11.

There have been dedicated bodies of work addressing various aspects of metasearch (e.g., [Aslam and Montague,2001;Wu et al., 2001;Meng et al.,2001, 2002;Thomas and Hawking, 2009; Thomas et al., 2010], etc.). For instance, [Thomas et al., 2010] investigated what user interfaces might be appropriate for presenting results from more than one source. Efficient ways for selecting search engines (servers) were suggested by [Dreilinger and Howe, 1997;Desai et al., 2006;Thomas and Hawking,2009]. Fur- thermore, different models and Frameworks for metasearch were proposed by [Glover et al.,1999;Aslam and Montague,2001;Aslam et al.,2003a].

The architecture proposed by [Glover et al., 1999] was designed to consider users’ information need as well. [Aslam et al., 2003b] proposed a unified framework for simultaneously solving both the pooling problem (the construction of efficient document pools for the evaluation of retrieval systems) and metasearch (the fusion of ranked lists returned by retrieval systems in order to increase performance).

Techniques like clustering and distributed information retrieval have facilitated easier access to more information, but these approaches have been mostly limited to ‘text’ sources of similar genres or single media type (for example, image collections in faceted browsing).

In document Study of result presentation and interaction for aggregated search (Page 48-51)