• No results found

Triple-Pattern Search

Definition 3.14 : Triple-Pattern Language Model

3.3. Related Work

3.3. Related Work

Our work on ranking the results to triple-pattern queries on RDF knowledge bases is closely related to work on IR over structured data in general. Based on the types of data and queries handled, we classify prior work on ranking as follows: i) keyword queries on unstructured data (documents), ii) structured queries on structured data, iii) keyword queries on structured data, and iv) keyword-augmented structured queries on structured data.

3.3.1. Keyword Queries on Unstructured Data

The main technique that we use from the standard IR literature is that of lan- guage models and KL-divergence for result ranking [52]. An overview on lan- guage models for IR was given in Subsection 3.2.2. In recent years, keyword querying has been carried over to the extended setting of entity search and rank- ing, also referred to as expert finding [68, 73]. Here, results are named enti- ties (e.g., companies, products, publications, authors), but the queries are still keyword-based. In most of these approaches, entities are assumed to be em- bedded in textual form in Web pages and other traditional kinds of documents. For the approaches that treat entities as first-class citizens [66], see Subsection 3.3.3 below. Extended forms of language models and PageRank-inspired spec- tral analyses are used to rank the entities that qualify for a keyword query. The key difference to our setting is that our corpus is a single redundancy-free RDF knowledge base instead of a set of documents, our queries consist of triple pat- terns rather than keywords and the output is a ranked list of tuples of triples instead of documents. We have described how to adapt language modeling techniques for this new setting.

3.3.2. Structured Queries on Structured Data

Ranking for structured queries has been investigated, for restricted forms of SQL queries. Ranking models have been developed for selection-join queries, us- ing either tf-idf based models [16] or probabilistic-IR models [11] that leverage attribute-value statistics in both the database and the workload. It is thus not possible to use these models for schema-less and redundancy-free RDF data.

The closest work to ours is the ranking model in the NAGA system [48]. NAGA introduced a query language similar to SPARQL triple patterns and used a (simpler) LM for computing a notion of informativeness. But NAGA can rank only exact matches to a given query; so the ranking is helpful only for the too- many-answers case but not for the too-few-answers problem. In contrast, our triple-pattern search framework goes beyond this limited setting by support- ing query relaxation and introducing the new notion of keyword-augmented queries.

3.3.3. Keyword Queries on Structured Data

The work on keyword search over structured data can be classified into two classes. The first class aims at mapping the keyword query into one or more structured query [82, 58]. In this chapter, we assumed that structured triple- pattern queries were given and we were interested in ranking the results to such queries. Inferring a structured query from a given keyword query is a different problem.

The second class of work on keyword search over structured data tries to di- rectly retrieve structured results for a given keyword query. The work on key- word search over XML data for instance falls into this category. XKSearch [88] returns a set of nodes that contain the query keywords either in their labels or in the labels of their descendant nodes and have no descendant node that also contains all keywords. Similarly, XRank [34] returns the set of elements that contain at least one occurrence of all the query keywords, after excluding the occurrences of the keywords in sub-elements that already contain all the query keywords. However, all these techniques assume a tree-structure and thus can not be directly applied to graph-structured data such as RDF data.

Also, closely related to our work is the language-modeling approach for key- word search over XML data proposed in [49]. Their ranking is based on the hierarchal language models proposed in [67]. However, the setting of XML data is quite different from that of RDF since in XML the retrieval unit is an XML document (or a subtree). In an RDF setting, we are interested in ranking tuples of triples that match the user’s query. These tuples are not present in advance and are computed on the fly during retrieval time, and thus most of the prior work on XML IR would not apply.

3.3. Related Work

Keyword search over graphs which returns a ranked list of Steiner trees [41, 46, 39, 32] (the exception is [55] which returns graphs) deals with the latter prob- lem of having a predefined retrieval unit. However, the result ranking in each of the above is based on the structure of the results [41, 47] (usually based on aggre- gating the number or weights of nodes and edges), or on a combination of these properties with content-based measures such as tf-idf [14, 39, 55] or language models [66].

For instance, the BANKS system [41] enables keyword search on graph databases. Given a keyword query, an answer is a subgraph connecting some set of nodes that ”cover” the keywords (i.e., match the query keywords). The relevance of an answer is determined based on a combination of edge weights and node weights in the answer graph. The importance of an edge depends upon the type of the edge, i.e., its relationship. Node weights on the other hand represent the static authority or importance of nodes and are set as a function of the in-degree of the node. We adopt the BANKS system to the RDF setting in the experiments section, and compare it to our model.

A closely related work that combines structure and content for ranking is the language-model-based ranking model in [66] for ranking objects (resources in an RDF setting). The model assumes that each resource is associated with a set of records extracted from Web sources. In turn, each record is associated with a “document”. The relevance of each such “document” (and correspondingly, the resource associated with it) to a keyword query is estimated using language models. This model however assumes that the retrieval unit is resources only, while our ranking model goes beyond this to treat triples in a holistic manner by taking into account the relationships between the resources. In addition, it as- sumes the presence of a document associated with each Web Object or resource, something that we lack in the case of RDF data in general.

3.3.4. Keyword-Augmented Structured Queries on Structured

Data

XML IR like XPath Full-Text search falls into this category [37, 2]. XPath forms the tree-structured part of the query, while keyword conditions can be specified at each branch of the tree-pattern query. An important difference between XML IR and our setting is that in the former, it is possible to have results of differ-

ent sizes, while in our case, the results are all of fixed structure. And so, the structure-based aspects are not as relevant to our setting as the content-based ones. The content-based ranking is again based either on tf-idf scores [2] or lan- guage models [37].