Entity search has been gaining increasing attention in the research community, as recognised by various world-wide evaluation campaigns. The TREC Question An- swering track focused on entities with factoid questions and list questions (asking for entities that meet certain constraints) [119]. The TREC 2005–2008 Enterprise track [10] featured an expert finding task: given a topic, return a ranked list of experts on the topic. The TREC Entity search track ran from 2009 to 2011 [12], with the goal of finding entity-related information on the web, and introduced the related entity finding (REF) task: return a ranked list of entities (of a specified type) that engage in a given relationship with a given source entity. Between 2007 and 2009, INEX also featured an Entity Ranking track [32]. There, entities are rep- resented by their Wikipedia page, and queries ask for typed entities (that is, entities that belong to certain Wikipedia categories) and may come with examples. Most recently, the Semantic Search Challenge (SemSearch) ran a campaign in 2010 [46] and 2011 [20] to evaluate the ad-hoc entity search task over structured data. The main task here is ad-hoc entity retrieval : “answering arbitrary information needs related to particular aspects of objects [entities], expressed in unconstrained natural language and resolved using a collection of structured data” [92].
Commonly, the ad-hoc entity retrieval task is approached by adapting standard document retrieval methods. A textual representation (“pseudo document”) is built
for each entity, and these representations can then be ranked using conventional IR models. The main challenge, of course, is how to obtain these textual represen- tations from structured data. WoD conceptually forms a large, directed, labelled graph with nodes corresponding to entities and edges denoting relationships, and is described in the form of subject-predicate-object (SPO) triples of the RDF data model; Figure 1.3 shows a small excerpt from an RDF graph centred around a given entity.
A natural solution would be to represent each entity using a fielded structure, where fields correspond to predicates (i.e., arrows on Figure 1.3) and associated nodes (or rather, the text extracted from them) are used as field values. These representations can then be ranked using any fielded document retrieval model, such as BM25F [95] or the mixture of language models (MLM) [86]. However, with this approach the number of document fields soon becomes computationally prohibitive, making the estimation of field weights intractable. A commonly used workaround is to group predicates together into a small set of predefined categories, and as such, create documents comprising of only a handful of fields. This grouping (or “predicate folding”) can be based on, for example, the type of predicates (attributes, in/out- relations, etc.) [90] or on their (manually determined) importance [21]. This leads to a data model where the optimisation of field weights is easily tractable, even using exhaustive search over the parameter space. While this approach seems to work well in practice, it seriously limits the semantic expressiveness of entity models, as it is no longer possible to access the content of individual predicates or might be too dependent on the data collection.
Part II
Entity Search
Having covered the basics, we continue by introducing the main focus of this the- sis in Chapter 3 which is the ad-hoc entity search task (i.e., search for entities in the WoD). Therein, we give a thorough introduction to the problems we are con- cerned with and describe the basic techniques the later chapters will build on. In Chapter 4, we continue to describe more advanced models and incorporate seman- tic aspects in terms of structuring predicates by their types to improve retrieval efficiency.
Chapter 3
Entity Search in the Web of
Data
In this chapter we give an overview of the task of entity search, and our participation in benchmarking initiatives. Our investigations in the field of entity retrieval started with our participation in the 2010 edition of the Semantic Search Challenge for which we studied the entity search and list search tasks. We first introduce the entity search task, based on an extension of [13]. This line of research was followed further in [27]. Further, our observations gave rise to specific research questions related to entity modelling which will be picked up in Chapter 4.
3.1
Introduction
Search for entities has become the most popular type of web search, second to navigational queries [92]. As such, the search for entities has attracted considerable amounts of research interest. We introduce approaches to both the classical entity search and list search (e.g., the “Arab states of the Persian gulf”, introduced in 1.3, is a typical list search query). entity search task. With respect to the list search task we attempt to model human user behaviour when searching in Wikipedia. Both methods were evaluated in the Semantic Search Challenge of the Semantic Search 2011 Workshop. We then put our results in context with the other teams’ results for the challenge tasks. Entity search denotes searches targeting entities instead of documents. Contrary to search in the Web of Documents, we search for entiteis and do so in RDF (Resource Description Framework) data or other types of structured data representation. Such structured representations, which provide the directions and types of links between entities, are often referred to as “Semantic Web” as envisioned by Tim Berners Lee [19].
At the same time, there is an increased amount of information published as Linked Data that is inherently organised around entities; each entity is identified by a
unique URI and is described using a set of subject-predicate-object RDF triples. Querying these structured data sources by means of simple keyword search (as op- posed to SPARQL-like languages) emerged as a genuine user need and has recently become an active topic of research [20, 21, 26, 90, 92]. The tasks we are study- ing in this chapter is ad-hoc entity retrieval (often referred to as semantic search) which we introduced in Section 2.6: “answering arbitrary information needs related to particular aspects of objects [entities], expressed in unconstrained natural lan- guage and resolved using a collection of structured data” [92], and list search, i.e., find a set of relevant answers for a given query.
In the context of semantic search this means that the classical information retrieval keyword search is extended by using RDF input data in the form of (subject, pred-
icate, object), where each component is described by a URI (Uniform Resource
Identifier). Entities are represented by subjects and occur together with predi- cates and objects closer identifying this entity. For example the RDF triple ex-
ample.org/NTNU, example.org/hasLocation, example.org/Trondheim implies that
NTNU is located in Trondheim.
In this chapter we show our approaches to entity search and summarise our partic- ipation in and the results of the Semantic Search Challenge of the Semantic Search 2011 Workshop providing comprehensive evaluation of our approaches, using both the Billion Triple Challenge (BTC) and DBpedia1 data sets. Overall we achieved the third place for entity search and the first for the list search task.
Our main emphasis for the entity search was on combining evidence from multiple knowledge sources, where each source is queried using a retrieval method tailored to its specific properties. With respects to list search, our goal was to mimic the behaviour of humans searching in Wikipedia for we believe much of the answers to list queries is available there, albeit not directly accessible. Finally, for both tasks, we exploited “sameAs” links extracted from DBpedia.
In the remainder of this chapter we first survey related work in Section 3.2. Next, we introduce the Semantic Search Challenge in Section 3.3. Then, in two largely independent sections, we discuss our approaches to both entity search and list search in Sections 3.4 and 3.5, respectively. Further, we give an overview of the results of the challenge in Section 3.6, putting our approaches in context with other submissions. We conclude and outline future directions in Section 3.7.