Source selection and representation - Approaches to implement and evaluate aggregated search

Some aggregated search systems make use of all sources they dispose for all queries, while others are source selective i.e. they make use only of the sources considered useful for the query. Because, most of the major current approaches are source selective, deciding which source is useful (relevant) becomes one of the primary issues in cross-vertical aggregated search. This corresponds in research to the vertical selection problem [14], which we will call as source selection to enable generalization on Web search and vertical search. Source selection consists in predicting whether a source is relevant or not for a given query/information need..

Typically, source selection aims avoiding delay times that would be caused if many sources are inquired and we wait for all results. Though, sources are typically selected before retrieving results. To enable eﬃcient selection, sources have some internal representation in the system. This representation can be as simple as a textual description of the source, but in general it contains representative terms and features for the source ex- tracted from sampling or other mining techniques. We will list some of these techniques below starting from the perspective of federated search.

The source selection and source representation problems are ﬁrst met in federated search. Here, we meet the term resource as synonym of source. In federated search, it is common to distinguish between cooperative sources and non cooperative sources. In the case of cooperative sources we can access the collection of each source and though compute useful statistical data on the collection terms. But, in federated search sources are mostly non cooperative, though we cannot access the entire collection. In the latter case, the source representation can be a manually written description of the source [65], although this is generally not enough. Typically, a sample of

representative documents is obtained issuing queries to the source. There are two main approaches. In the ﬁrst approach [38], an initial representation is built using top search results from one seed query. The representation is then used to generate new queries and update the representation. Alterna- tively, Shokohui et al. [160] show that better results are obtained when most frequent queries are used to generate the sample. In the case of cooperative sources, we can use all collection to select a set of representative terms and features for the source.

In cross-vertical aggregated search, there exist approaches where the source is selected after retrieval i.e. given some of the search results the source has to oﬀer, it is decided whether the source is useful or not. Though, when we speak of internal representation of sources in terms of features, we distinguish between pre-retrieval features and post-retrieval features. All these features are also presumed to be useful for ranking and assembling results.

Table 5.1 contains a list of common pre-retrieval features used in liter- ature listed with the work they appear in. In [12, 139, 14], authors rely on terms that indicate vertical intent. Some of them are obvious and they can be input manually such as the terms “photo”, “image”, “video”, “clip”. Others are derived from the source representation or external sources. In [14] they propose mapping verticals to Wikipedia articles which tend to be uniform and verbose. For instance, a vertical search engine on autos can be mapped into the Wikipedia articles of the “Automobile”, “Car” and “Vehi- cle” category. This is particularly useful for sources that contain few textual data.

Queries are also associated to predeﬁned categories such as “sports”, “arts”, “technology”. This is done in [139, 12, 167]. In [12], authors map queries to 30 topical categories derived from the Open Directory Project (ODP). Another important source of evidence is found in query logs of the sources. If a query has already been issued in one of the verticals from the source interface itself, this is a strong indicator of vertical intent. This approach is used in [12, 14].

Table 5.2 contains a list of common post-retrieval features used in litera- ture listed with the work they appear in. This set of features has been shown to be very useful to score sources and blocks of results [12]. In the case of cooperative sources, we can have access to the relevance scores of the source itself or the source conﬁdence on the utility of these results [139]. Neverthe- less, in general we need to compute features that are uniform across sources in order for scores to be comparable. Ponnuswami et al. [139] compute the BM25 weighting function, while Arguello et al. [12] combine 4 diﬀerent scores: (1) the cosine similarity between the query and the document/title representation, (2) the maximum number of query terms appearing consec- utively in the document/title representation, (3) the percentage of query terms appearing in the document/title representation (4) the percentage of

the document/title representation that matches query terms.

The number of returned results from a source can also be source of evidence. For some sources, the freshness of the returned results is more indicative. For instance, microblogs are usually prefered when they are issued recently.

Source selection approaches correspond generally to supervised classiﬁers [104, 53, 14] that are trained through 3-tuples ( query, source, relevance). The relevance assessments have been collected through human assessments on queries and intents [14] and through implicit feedback derived from click- through analysis [53]. These techniques will be described in detail in the next section.

Table 5.1: Pre-retrieval features Feature based on Description

Vertical intent terms [12, 139, 14]

Some terms indicate vertical intent such as image, video, photo. Some others can be related. This feature combines hard-coded and learned associa- tion rules for queries and sources

Query logs [12, 14] These features indicate if the query has been met in a source query log.

Recent popularity of the query [53]

This feature indicates how often the query has been met in a source query log recently.

Click-through [12, 139, 104]

These features are generated from the documents that have been clicked for the query. The click is considered implicit feedback.

Query domain [14] This feature is generated through classiﬁcation of the query into predeﬁned domains such as sport, arts, technology.

IsNavigational [139] This feature indicates the chances of the query to come from navigational needs.

Query length [139] This feature corresponds to the length of the query in terms.

Relevance feedback [54] This feature is computed based on explicit and implicit feedback on the query and its intent. Named-entity type [12] These features indicate the presence of named en-

tities of some type in the query.

5.4 Result aggregation

There are diﬀerent ways to assemble results in cross-vertical aggregated search. Some approaches rely on source selection [104, 53, 14]. They integrate search results from a source on top of Web search results only when the

Table 5.2: Post-retrieval features Feature based on Description

Vertical relevance score [53]

This feature corresponds to the relevance score of the vertical source itself for a result or a block of results.

Query-results match [12, 139, 104]

These features correspond to match score computed on one or more search results from the source. This score can be returned from the source itself or computed from scratch on the result. Number of results [12] This feature corresponds to the results count of a

source. Freshness of documents

[12, 53]

This feature indicates how fresh are search results for the query within a source.

Contextual score of results These features indicate the relatedness of search results with respect to some context.

Geographic-context score of results

These features indicate the relatedness of search results with respect to some geographic context.

source is deemed relevant. Some others go beyond source selection. They rank results with each other. There are two main approaches in this direc- tion. Some rank results in blocks of results of the same source [139, 12] and others rank results one by one [108].

Diaz studies the integration of news search results within Web search [53]. They estimate newsworthiness of a query to decide whether to introduce news results on top of the Web search results. They make use of click- through feedback to recover from system errors. Liu et al. [104] also rely on click-through data to extract implicit feedback for source selection. They represent queries and documents as a bipartite graph and they propagate implicit feedback in this graph. Their approach is experimented for the integration of product search and job search results.

Arguello et al. [14] list various sources of evidence that can be used to tackle source selection such as query-log features, vertical intent terms, corpus features. In later work [54], they show how they can integrate implicit and explicit feedback for vertical selection. From a set of 25195 labeled queries, they propagate implicit feedback across about ten million queries. In [15], the same authors show how to adapt source selection methods to new unlabeled verticals.

In [139] and [12], search results are ranked in blocks of vertical content (e.g. 3 images versus 3 Web search results). Ponnuswami et al. [139] use as training data pairwise preferences between a block of Web search results and a block of vertical search results. Similarly, Arguello et al. [12] also rely on pairwise preferences to train their ranking functions. They experiment

three ranking techniques: one derived from classiﬁcation, a voting approach and learning to rank techniques. Learning to rank techniques are shown to work best.

In [108], Liu et al. deﬁne a probabilistic model that enables ranking search results from diﬀerent sources. In contrast with other approaches, they rank single items (search results) instead of blocks of search results. Although the probability model is interesting, the probability estimates are not convincing.

Current result aggregation approaches are inspired from the approaches taken by major search engines. We need a more ﬂexible framework for result aggregation which enabled diﬀerent ways to put results together and preferably less training or no training.

In document Approaches to implement and evaluate aggregated search (Page 80-84)