Query Reformulation for Triple-Pattern Search
4.2. Query Reformulation Framework
4.2.4. Executing Reformulated Queries
The reformulated queries generated by our reformulation algorithm can be pre- sented to the user as query suggestions or they can be executed and their results merged and ranked in one way or the other. We present two different modes of execution : an incremental mode and a batch mode. Both execution modes make use of the ranking model described in Chapter 3 to rank the results of triple-pattern queries.
Incremental Execution.
Let L = {Q0, Q1, ..., Qk} be the list of query reformulations obtained using our reformulation algorithm described in the previous subsection, where Q0 is the original query. In the incremental execution mode, we execute the queries Qj∈ L in order of their scores δ(Q0, Qj)which are defined according to Equation 4.12. That is, we start by executing the original query Q0, retrieve all its results and rank them according to the ranking model described in Chapter 3. Next, we execute query Q1, retrieve all its results and rank them according to the same ranking model, and so on. The final result list would be the set of all unique results of all queries Q0, Q1, ..., Qksuch that the results of query Qjare all ranked above the results of query Q(j+1) and the results of each query Qj are ranked based on the ranking model described in Chapter 3.
Batch Execution. In batch execution mode, we execute the original query
and all its reformulations and merge their results into one unified results set eliminating duplicates. We then rank the unified result set using the ranking model described in Chapter 3. Our ranking model assumes there exists a lan- guage model for the query which is defined as follows. Given a query Q = (q1, q2, .., qn) where qi is a triple pattern, the language model of query Q is a probability distribution over all tuples of n triples of the form T = (t1, t2, ..., tn) where ti is a triple. The probability P(T|Q) of a tuple T in the query language model is then estimated as follows (assuming independence between the triples):
P(T|Q) = n Y
i=1
4.2. Query Reformulation Framework
Now, assume that triple pattern qi has the list of reformulations {q0i, q1i, ..., q mi
i } where qji is a reformulated triple pattern obtained by our reformulation algo- rithm (q0
i is the original triple pattern). The probability P(ti|qi)is then estimated as a weighted sum of the following mi+ 1probabilities:
P(ti|qi) = λ0P(ti|q0i) + λ1P(ti|q1i) + .... + λmiP(ti|q
mi
i ) (4.14)
where P(ti|qji)is the probability of triple ti in the language model of triple pat- tern qjiwhich was estimated as described in Chapter 3. The parameters λjweigh the contribution of each triple pattern and we set it as a function of the score of the reformulation δ(qi, qji)as follows:
λj = 1 − δ(qi, qji) Σmi j=0(1 − δ(qi, q j i)) (4.15)
Recall that the smaller δ(qi, qji) is, the closer qji is to the original triple pattern qi. Also recall that δ(qi, q0i)is equal to 0. This weighting scheme basically gives higher weights to triples that instantiate reformulated patterns which are closer to the original triple pattern q. However, triples instantiating a lower-ranked reformulated triple pattern with sufficiently high scores can have higher proba- bilities than ones that instantiate a higher-ranked triple patterns.
Once the language model of the triple pattern qi has been estimated, we use it to compute the probability of a tuple in the query language model. Finally, to rank the overall results of the original query and all its reformulations, we assume there exists a language model for each result (the way we estimate the result language models was explained in Chapter 3). The results are then ranked in ascending order based on the KL divergence between the query language model and the result language models.
Note that both the incremental and the batch modes are only two ways in which we can execute the reformulated queries and present their results to the user. Additional modes can include a mixture of both modes for instance, or any other variations. Our result ranking model described in Chapter 3 is general enough and can support any number of such fine-grained execution modes with minimal changes.
4.3. Related Work
In this chapter we presented a framework for reformulating triple-pattern queries, possibly augmented with keywords. Our framework reformulates queries by substituting resources that appear in a given query, whether entities or relations, with similar ones or with variables. Measuring the similarity between resources is somewhat related to both record linkage [65], and ontology matching [77]. But a key difference is that we are only interested in finding candidate resources which are close in spirit to a given resource, and not trying to solve the resource disambiguation problem.
Query reformulation in general has been studied in other contexts such as key- word queries [17] (more generally called query expansion), XML [3, 54], SQL [11, 92]. Our setting of RDF and triple patterns is different in being schema- less (as opposed to relational data) and graph-structured (as opposed to XML which is mainly tree-structured and supports navigational predicates). For RDF triple-pattern queries, query reformulation has been addressed to some extent in [92, 42, 20, 40, 23]. With the exception of [20, 42], the types of reformulations considered in previous work are limited. For example, [92] considers substitut- ing relations only, while [40, 23] consider both entity and relation substitutions. The work in [23], in particular, considers a very limited form of reformulation – relaxing queries by replacing entities or relations specified in the triple patterns with variables. Our approach, on the other hand, considers a comprehensive set of reformulations and in contrast to most other previous approaches, weights the reformulated queries in terms of the quality of reformulation (i.e., how close the reformulated queries are to the original query), rather than the number of substitutions that resulted in the reformulated queries.
In addition, our framework stands out in the way reformulated queries are generated and executed. While [20, 42, 40] make use of rule-based rewriting, the approach in [92] and our own approach make use of the data itself to gener- ate query reformulations. Note that rule-based rewriting requires human input, while our approach is completely automatic. Also note that our framework is the only approach that merges the results of the original query and its reformu- lations in a holistic manner. This allows us to rank results based on both the relevance of the results, as well as the closeness of the reformulated query to the original query.
4.4. Experimental Evaluation
#entities Example entity types #triples Example relations
LibraryThing Dataset
48,000 book,author 700,000 wrote,hasFriend
user,tag hasTag,type
IMDB Dataset
59,000 movie,actor 600,000 actedIn,directed director,producer, won,isMarriedTo, country,language produced,hasGenre
Table 4.5.: Overview of the datasets
4.4. Experimental Evaluation
We evaluated our query reformulation framework using three experiments. The first one evaluated the quality of the individual resource substitutions and the second one evaluated the quality of the reformulated queries overall. The third experiment evaluated the quality of the final query results obtained from the original query and its reformulations.
4.4.1. Setup
All experiments were conducted over two datasets using the Amazon Mechan- ical Turk service1. The first dataset was derived from the LibaryThing commu- nity, which is an online catalog and forum about books. The second dataset was derived from a subset of the Internet Movie Database (IMDB). The data from both sources was automatically parsed and converted into RDF triples. In ad- dition, each triple was also augmented with keywords derived from the data source it was extracted from. In particular, for the IMDB dataset, all the terms in the plots, tag-lines and keywords fields were extracted, stemmed and stored with each triple. For the LibraryThing dataset, since we did not have enough textual information about the entities present, we retrieved the books’ Amazon descriptions and the authors’ Wikipedia pages and used them as textual context for the triples. Table 4.5 gives an overview of the datasets.
LibraryThing Dataset ?b type Nonfiction ?b hasTag Greek ?w type Historian ?w wrote ?b ?b hasTag Memoir ?w wrote ?b ?b hasTag Non-fiction ?b hasTag Pulitzer
?w wrote ?b nobel prize
?b hasTag British Literature
?w wrote ?b civil war
?b hasTag Film
IMDB Dataset
?m hasGenre Comedy
?m hasWonPrize Academy Award
?a hasWonPrize Academy Award for Best Actor ?a originatesFrom New York
?m1 hasGenre Mystery ?m1 hasPredecessor ?m2 ?d1 directed ?m1 ?d2 directed ?m2
?d directed ?m true story
?d hasWonPrize Academy Award for Best Director
?a actedIn ?m school friends
?a type singer
Table 4.6.: Subset of the evaluation queries
We constructed 40 evaluation queries for each dataset and converted them into triple-pattern queries. In addition, we constructed 15 keyword-augmented queries for each dataset, where one or more triple-patterns were augmented with one or more keywords. Some example queries are shown in Table 4.6. The complete set of evaluation queries used is listed in Appendix C.
4.4. Experimental Evaluation
LibraryThing IMDB
Egypt Non-fiction France Titanic (1997)
Ancient Egypt Politics Italy Atlantic (1929) Mummies American History Switzerland The Abyss
Egyptian Sociology Spain Titanic (1953)
Cairo Essays West Germany Top Gun
Egyptology History Germany Britannic
Table 4.7.: Example resources and their top-5 substitutions