• No results found

Retrieval Models

Keyword Search

6.2. Retrieval Algorithm

6.5.2. Retrieval Models

We compared our ranking model, which we refer to as the Structured LM ap- proach, to three competitors: 1) a baseline language-modeling approach (Base- line LM), 2) the Web Object Retrieval Model (WOR) [66] and 3) the BANKS

Information need Query LibraryThing

Historians who wrote memoir books historian memoir book

The author of a classic fantasy funny book classic fantasy funny author

Authors of non-fiction books that won the Pulitzer prize

author non-fiction pulitzer

A crime fiction that has was tagged as favorite by the users

crime fiction favorite

Children’s writers who wrote books about were- wolves

children writer were- wolves

IMDB

Movies with genre Musical that were produced in Italy

musical italy

Actors from New York City that have won the Academy Award for Best Actor

new york academy award best actor Movies with genre War in which Anthony Quinn

acted

anthony quinn war

Movies with genre Comedy that have won the Academy Award

comedy academy award

Movies that Mel Gibson directed mel gibson director

Table 6.7.: A subset of the evaluation queries

system [41]. We chose these 3 competitors since they represent the family of approaches applicable to our setting, namely: keyword search over structured data. The rest of the approaches sketched in Section 6.4 do not directly apply to our setting and thus were omitted from our evaluation.

Structured LM Approach. The Structured LM approach ranks a set of tuples

of triples retrieved using the retrieval algorithm in Section 6.2. It takes into con- sideration the structure of the triples which is represented by the predicates of

6.5. Experimental Evaluation

the triples as described in Section 6.3. The Structured LM approach weights the probability of a query term in the language model of each triple by the prob- ability of the triple’s predicate being intended by the query term. This model involves the single parameter β which must be learnt (see Equation 6.5), and we explain how to do this in Subsection 6.5.4.

Baseline LM Approach. Similar to the Structured LM approach, the Baseline

LM approach also ranks a set of tuples of triples that are retrieved using the re- trieval algorithm described in Section 6.2. It also uses the query likelihood of the tuples to rank them. However, this approach completely ignores the structure of the triples and treats all the triples as bags-of-words. The Baseline LM approach is a special case of our Structured LM approach and can be achieved by setting the value of the parameter β in Equation 6.5 to 0.

Web Object Retrieval. The Web Object Retrieval model proposed by Nie et

al. [66] is a language-model-based approach for ranking objects, or resources in an RDF setting. The model assumes that each resource is associated with a set of records extracted from Web sources. In turn, each record is associated with a “document”. The relevance of each such “document” (and correspondingly, the resource associated with it) to a keyword query is estimated using language models.

We adapted the WOR model to work on RDF data as follows. We treated triples as records and for a given resource X, we created a language model for Xusing all its triples {t1, t2, ..., tn}. Given a keyword query Q = {q1, q2, ..., qm}, we then ranked the resources according to their probabilities of generating the query which is computed as follows:

P(Q|X) = m Y i=1 n X j=1 1 nP(qi|Dj) (6.9)

where P(qi|Dj)is the probability of generating the term qigiven triple tj’s docu- ment which was estimated using a maximum-likelihood estimator as described in Equation 6.6 in Section 6.3.

BANKS. The BANKS system enables keyword search on graph databases. Given

”cover” the keywords (i.e., match the query keywords). The relevance of an an- swer is determined based on a combination of edge weights and node weights in the answer graph. The importance of an edge depends upon the type of the edge, i.e., its relationship. Node weights on the other hand represent the static authority or importance of nodes and are set as a function of the in-degrees of the nodes.

This directly applies to our setting. Given a keyword query, we retrieved all subgraphs that matched the query using the technique described in [41]. We then ranked the subgraphs based on a combination of edge weights and node weights as proposed in their model. That is, the score of a subgraph G ={t1, t2, ..., tn} with nodes N = {n1, n2, ..., nk} is defined as follows:

score(G) = λ n X i=1 score(ti) + (1 − λ) k X i=1 1 kscore(ni) (6.10) where score(ti) is the score of triple or edge ti which is computed using the probability that the predicate of t is intended by any of the query keywords as is done in the Structured LM approach. Note that in case a triple matches more than one keyword, it will be counted as many times as the number of keywords it matches. The score(ni)on the other hand is the score of node ni which is set to the in-degree of the node ni. We used log scaling for both scores as advised in [41]. Finally, the parameter λ controls the influence of both scores, and we explain how it is set when we discuss the evaluation results in Subsection 6.5.4.