Evaluation - Approaches to implement and evaluate aggregated search

Until now, different evaluation methodologies have been undertaken for mea- suring the effectiveness of cross-vertical aggregated search. Because existing techniques have been designed with different goals, they are quite heterogeneous. We can classify them with respect to their target. In [14, 104, 108], the main goal is to evaluate source selection. In [166, 168, 173], the main goal is to compare cross-vertical aggregated search interfaces. In [13], au- thors target result ranking. We will describe here the different evaluation approaches.

One common protocol to evaluate source selection is to ask human participants to choose which are the relevant sources for a query [14, 104, 108]. Liu et al. [108] performed this kind of relevance assessment on 2153 generic Web queries. In [14], human judges classiﬁed 25195 queries. This kind of assessments is fast, but not necessarily accurate. The human judge might not guess the real information need or might neglect some interpretations of the query and some queries might demand speciﬁc knowledge.

In [166, 168], Sushmita et al. compare the effectiveness of different interfaces for cross-vertical aggregated search. They show that users find more relevant results when vertical results are placed together with Web results. They also show that placing vertical results on top, on bottom or in the middle of search results can affect the amount of vertical search results ac- cessed by users. In both studies, participants are shown concrete search results from the considered sources. They are also given the information need behind the query. They have to click on results and bookmark the ones that are relevant. This approach is closer to traditional IR evaluation. The information need is not ambiguous and assessors can access real search results.

Instead of human participants, relevance assessments have been simu- lated using click-through logs [167, 53, 168]. In [53], Diaz shows that queries which obtain a high click through rate within news results are probable to be newsworthy. Click-through logs are also used in [168]. It is shown that for some sources such as video click-through behavior is diﬀerent. Although click-through logs enable a large scale automatic evaluation, they cannot be as realistic as human based evaluation.

Recently, Arguello et al. [13] proposed a methodology to evaluate result ranking. Relevance assessments are pair-wise preferences between result sets. Each result set contains results from one source. This work does not focus on the notion of source relevance, rather than on the relative eﬀectiveness of ranking.

Zhou et al. [196] propose building an evaluation benchmark for cross- vertical aggregated search through the re-use of existing evaluation benchmarks. They make use the the CluWeb track in TREC [46]. They artiﬁcially build vertical collections using classiﬁcation. Then, they choose topics that

cover many verticals. We can see this work as a step towards evaluation benchmarks, although a more substantial eﬀort is needed in this direction to make the distribution of topics, sources and assessments more realistic.

In general, evaluation of cross-vertical aggregated search remains an open problem. There are diﬀerent types of relevance assessment, diﬀerent mea- sures, while there is no common agreement yet. In particular, it is not clear which are the advantages of this approach that are prioritary and how they should they be assessed. We know that cross-vertical aggregated search can provide focus and diversity, but we do not know why and at which extent this improves information retrieval. Research needs to investigate more on the interest and evaluation of cross-vertical aggregated search.

5.7 Conclusions

In this chapter we provided an overview on the cross-vertical aggregated search. We have organized related work around some main issues: source selection, result aggregation, result presentation and evaluation. Some of the related work is inspired from federated search while some is quite novel. Source selection has to take into account for vertical search and web search speciﬁcities. Result aggregation and presentation approaches go beyond the uniform ranked list. Evaluation techniques are also speciﬁc to the targeted issue (e.g. source selection, result aggregation, result presentation).

However, research in this direction remains quite heterogeneous and should be taken to converge. We need to clearly identify the novel issues and advantages of cross-vertical aggregated search. Which are the advantages of combining diverse sources remains to be explored. As well, we have seen that there are different ways to assemble and present results, which can affect user satisfaction. Existing evaluation techniques rely on binary relevance assessments but different setups. To better capture the utility of cross-vertical aggregated search we need to investigate more on the notion of relevance for cross-vertical aggregated search and the ways to capture it realistically. This will also be the goal of our contribution in this research direction.

Moreover, vertical search engine results can be combined in other tasks and uses except of Web search. We believe that the potential of this domain is to discover and research outcome will be proliferous in future years.

Part 3: Relational

aggregated search: Attribute

retrieval, result aggregation,

instance lists and

applications

Attribute retrieval

6.1 Introduction

Semi-structured HTML data (tables and lists) in the Web are probably the largest source of relational data in the Web [37]. Because, we target high- recall for relational aggregated search in both terms of queries that can be answered and information that can be retrieved we propose approaches that rely on this kind of data. To do so, we do not rely on patterns for text extraction [9, 58] nor on precise wrapper induction techniques [41]. Our work is mostly inspired from the work of Cafarella et al. [37]. Their work is at our knowledge the largest mining from HTML tables and HTML lists for search purposes. They show how to identify relational tables and lists, although they do not speciﬁcally extract attributes or instances from nor they perform search for them. Our work is diﬀerent because we explicitly extract and we perform query matching i.e. we rank by relevance.

In this chapter, we propose an approach for attribute retrieval i.e. given a query (instance or set of instances) our approach can extract candidate attributes and rank them by relevance to the query. We will show in the next chapter how they can be used to build aggregated information retrieval answers. In chapter 8 we present our research on HTML lists to extract sets of instances. In chapter 9, we apply our research to build relational aggregated search prototypes.

As we mentioned, for our attribute retrieval approach we rely on relational HTML tables. Although tables are probably the largest source of relational data [37], the task is not easy for the following reasons. Many of the HTML tables are not in a relational form, especially the ones used for layout design and navigation. Some tables are useful but not uniformly relational (see table 2 in ﬁgure 6.1) As well, relevant tables are not easy to detect. A table can be relevant for an instance even when the instance is not present within the table text.

Our approach takes into account for the above issues and it is designed

to work at large scale with high recall. It is composed of three main steps:

∙ retrieval of a seed of potentially relevant tables

∙ ﬁltering of useless tables and attributes

∙ ranking of attributes by relevance features

To test our approach, we use three search situations. First, we retrieve relevant attributes for one given instance (e.g. “University of Strathclyde”). Second, we retrieve attributes for a given class (e.g. “universities”) repre- sented as a set of instances. Third, we retrieve attributes for one instance when some other similar instances are given (from the same class). We use the term instance attribute retrieval when we retrieve attributes for one instance. By analogy, we use the term class attribute retrieval when we retrieve attributes for a class. The last search situation corresponds to rein- forced instance attribute retrieval. However, the instance attribute retrieval is the core problem in all cases.

Moreover, we compare attribute retrieval from HTML tables with other existing approaches from state of the art. Concretely, we use the lexico- syntactic rules used in [9] and a combination of DBPedia and Wikipedia.

In the next section, we describe our general approach. Then we describe our experimental setup (section 6.3) and the corresponding results (section 6.4).

In document Approaches to implement and evaluate aggregated search (Page 88-94)