• No results found

From raw to semi-structured data

1.3 The Web of Data

1.3.4 From raw to semi-structured data

The largest part of Web documents currently does not contain semantic annotations, and they are commonly modeled as a simple bag of words. One of the main approaches for enhancing search effectiveness on these documents consists in automatically enriching them with the most relevant related entities [115]. By enriching in the same way the queries submitted by users, query processing can exploit the semantic relations among the entities involved.

In literature, this automatic enrichment is known as the Entity Linking: given a plain text, the entity linking task aims to identify all the small frag- ments of text (in the following interchangeably called spots or mentions) re- ferring to any named entity that is listed in a given knowledge base, e.g., Wikipedia. The ambiguity of natural language mentions makes it a non trivial task. The same entity can be in fact mentioned with different text fragments, e.g., “President Obama”, “Barack Obama”. On the other hand, the same mention may refer to different entities, e.g., “President” may refer to the U.S. president or to Alain Chesnais, the president of the Association for Computing Machinery.

5see http://www.rottentomatoes.com/m/tron/ 6available athttp://data.sindice.com/trec2011/

38 CHAPTER 1. BACKGROUND

A typical entity linking system performs this task in two steps: spotting and disambiguation. The spotting process identifies a set of candidate spots in the input document, and produces a list of candidate entities for each spot. Then, the disambiguation process selects the most relevant spots and the most likely entities among the candidates. The spotting step exploits a given catalog of entities, or some knowledge base, to devise the possible mentions of entities occurring in the input. One common approach to address this issue is resorting to Wikipedia [110, 89]: each Wikipedia article can be in fact considered a named entity, and the anchor texts associated with Wikipedia links a rich source of possible mentions to the linked entity. The spotter can thus process the input text looking for any fragment of text matching any of the Wikipedia mentions, and therefore potentially referring to a entity. Indeed, the spotter should detect all the mentions and find all the possible entities associated with such mentions. The coverage of the source knowledge base and the accuracy of the spotter have in fact a strong impact on the recall of the entity linking system: all the entities related to every mention should be possibly detected and returned [47].

In NLP Research, Named Entity Recognition (NER) is a similar problem extensively studied in the state of the art. The main difference with Entity Linking is the fact that entities discovered in NER are represented as labeled phrases and are not uniquely identified by referring them to a knowledge base. Bunescu and Pasca [39] propose to link spots to Wikipedia articles (entities): in their work they exploit Wikipedia categories associated to each article to perform disambiguation. A similar approach is proposed by Cucerzan [50], whose approach creates for each article a vector containing the closest enti- ties and categories. Section 4.3 provides a more detailed description of these approaches [110, 66, 68, 105, 57].

Beside entity linking for documents, an even more challenging issue is entity linking for queries. Annotating keyword queries with entities is difficult mainly for two reasons: i) query terms can have multiple meanings (polisemy), or du- ally, the same information need can be represented by using different words (synonymy) ii) while in the case of documents one can exploit the text close to the spot in order to perform disambiguation, a keyword query usually does not have an exploitable context. Several approaches to perform query-entity link- ing have been proposed. Huurnink et al. [72] exploit clicks to map queries to an in-house thesaurus. Other proposals [70, 106] perform approximate match- ing between the query and the label of the entity. Meij et al. [102] match the query and its n-grams against the entity labels and use machine learning to select the most related entities. Furthermore, in [103] they propose a method still based on machine learning that performs the mapping between the queries and DBpedia entities. Interestingly, they consider in their features set the user

1.3. THE WEB OF DATA 39

history which is important because it provides a context for disambiguation (we also rely on that feature in the approach presented in Section 3.6.1). En- tity Retrieval (ER) is a well known problem, strongly related to entity linking: the task consists in ranking the entities in a knowledge base given a keyword query. Many approaches for ER were proposed and evaluated in several tracks proposed at the TREC [14] and at the INEX [52] conferences. In these tracks there are different problems about ranking entities (e.g., rank only entities of a certain type, rank entities that best represent a query, etc.). The research on ER is really extensive [83, 147, 148, 151, 161]. For a complete survey of the methods we remind to Adafre et al. [2]. An interesting approach was proposed by Zwol et al. [148]. In their work they propose to rank facets (i.e., related entities) by learning a model through Gradient Boosted Decision Trees [58], and they evaluate the quality of their model by considering two different clicks models: i) click through rate, and, ii) click over expected clicks.

Chapter 2

Query Biased Snippets

2.1

Introduction

Nowadays a Web Search Engine (WSE) is a very complex software sys- tem [8]. It is well known that the goal of a WSE is to answer users’ queries both effectively, with relevant results, and efficiently, in a very short time frame. After a user issues a query, the WSE actually runs a chain of complex processing phases producing the Search Engine Results Page (SERP). A SERP contains a list of few (usually 10) results. Each result corresponds to a Web page, and it contains the title of the page, its URL, and a text snippet, i.e. a brief text summarizing the content of the page.

The true goal is to scale-up with the growth of Web documents and users. The services offered by a WSE exploit a significant amount of storage and computing resources. Resource demand must be however kept as small as possible in order to achieve scalability. During query processing, document indexes are accessed to retrieve the list of identifiers of the most relevant doc- uments to the query. Second, a view of each document in the list is rendered. Given a document identifier, the WSE accesses the document repository that stores permanently on disk the content of the corresponding Web page, from which a summary, i.e. the snippet, is extracted. In fact, the snippet is usually query-dependent, and shows a few fragments of the Web page that are most relevant to the issued query. The snippets, page URLs, and page titles are finally returned to the user.

Snippets are fundamental for the users to estimate the relevance of the returned results: high quality snippets greatly help users in selecting and ac- cessing the most interesting Web pages. It is also known that the snippet quality depends on the ability of producing a query-biased summary of each document [143] that tries to capture the most important passages related with the user query. Since most user queries cannot be forecasted in advance, these

42 CHAPTER 2. QUERY BIASED SNIPPETS

snippets cannot be produced off-line, and their on-line construction is a heavy load for modern WSEs, which have to process hundreds of millions queries per day. In particular, the cost of accessing several different files (containing the Web pages) for each query, retrieved among terabytes of data, under heavy and unpredictable load conditions, may be beyond the capability of traditional filesystems and may require special purpose filesystems [146].

In this Chapter we are interested in studying the performance aspects of the snippet extraction phase and we devise techniques that can increase the query processing throughput and reduce the average query response time by speeding-up snippet extraction phase. In particular, we leverage on a pop- ular technique used to enhance performance of computing systems, namely caching [117]. Caching techniques are already largely exploited by WSEs at various system levels to improve query processing, mainly for storing past queries and associated sets of results along with the query-biased snippets [55]. The basic technique adopted for Web search caching is to store the whole result page for each cached query. When a cache hit occurs, the result page is immediately sent back to the user. This approach perfectly fits queries being in the “head ” of the power-law characterizing the query topic distribution. On the other hand, SERP caching is likely to fail in presence of personalization, that is when the engine produces two different SERPs for the same query submitted by two different users. Furthermore, it fails when a query has not been previously seen or it is a singleton query, i.e. it will not be submitted again in the future. Baeza-Yates et al. [9] report that approximatively 50% of the queries received by a commercial WSE are singleton queries.

Unlike query-results caching, snippets caching that is the focus of this Chapter, has received relatively low attention. The two research efforts closest to ours are those by Turpin et al. and Tsegay et al. [146, 144]. The authors investigate the effect of lossless and lossy compression techniques to gener- ate documents surrogates that are statically cached in memory. They argue that complex compression algorithms can effectively shrink large collection of texts and accommodate more surrogates within the cache. As a trade-off, complex decompression increases the time needed by the cache to serve a hit, and are thus unlikely to decrease the average snippets generation time [146]. Lossy compression techniques produce surrogates by reordering and pruning sentences from the original documents. They reduce the size of the cached doc- uments, still retaining the ability of producing high quality snippets. Tsegay et al. measured that in the 80% of the cases snippets generated from surro- gates that are the 50% of the original collection size, are identical to the ones generated from the non-compressed documents [144].

To the best of our knowledge, this is the first research studying techniques for dynamically generating document surrogates to be managed by a snippet