Strategies for Data Surfacing

5.2 Case Study

5.2.1 Strategies for Data Surfacing

Data surfacing involves the selection of the next query to be executed among the ones available in the materialization queries queue. Such a selection is performed according to a materialization strategy, i.e., a logic devoted to the maximization of a given set of metrics in order to optimize the query selection task. We next describe some materialization strategies for the SPSS materialization scenario, which performance is evaluated in section 5.2.4.

Let us consider a single service s described by an access pattern AP; AP has a set of input attributes Ii associated with a domain di, with i = 1..n,

and a set of output attributes Oj associated with a domain dj, with j = 1..m.

In order to show the reseeding, we assume that di = dj for some i, j, i.e,.

that the domain of some input and output attributes is the same. Consider d = d1 × ...dn as the cross product of the input domains, and let k ∈ d be

a combination of input values for the AP. A paginated query q_kp is a query addressed to the service interface (service) si using the combination k of input values, and 1 ≤ p ≤ M axN umP ages indicates the page currently queried. We define rp_k ⊆ R as the set of tuples in the source that satisfies a query qp_k, where R represents all the items of the source to be materialized. The input discovery step of the materialization process builds, at materialization set-up time, the initial query queue initQ, e.g., by retrieving them from a dictionary - initDict; then, new combinations of Q can be found by using the values in results rp_k of queries that are progressively executed. The materialization RM

is built as the union of the rp_k; note that RM ⊆ R, and in general RM is much

smaller than R due to the access limitations to the data source. With a single service, the union operation is sufficient for duplicate elimination.

The outcome of a materialization strategy is influenced by the chunking of query answers, which requires multiple service calls to fully collect a query’s result, and by the distribution of values for the input attributes, as distinct

5.2. Case Study 89

inputs produce unaligned numbers of returned results, thus, introducing skew in the materialized result set. These factors call for data surfacing strategies that are capable of balancing between the coverage and diversity.

As illustrated in the example in section 5.2issuing the queries to TradeMe service retrieves many relevant results to the query, thus, ensuring diversity but to the detriment of coverage due to the limited result set size. In contrast the RealEstate service that may return in chunks up to 2000 results per issued query. In this case some queries and its chunks might return many new cities for the queried postcode. At the same time if for instance the postcode belongs to a suburb in a major metropolitan area such a query might return results with very little variety for new cities and postcodes due to availability of many results exactly matching the posed query.

In order to define a few simple data surfacing strategies, let us model the sequence of queries produced by a data surfacing strategy as an undirected graph as described in Chapter 4. In this case the graph QRT is explored by a tree walking algorithm, where all the nodes except the root correspond to queries; the root is directly connected to queries q1

k with k ∈ C, and we do not

further consider how nodes q1

k are ordered. In this context, a materialization

strategy consists of interleaving of tree generation and tree traversal steps. Tree generation occurs as follows:

• If the current query q_kp has not retrieved all the available chunks and a new chunk can be retrieved, then q_kp+1 is generated as a child of q_kp, • If the current query q_kp has generated new query combinations h which

are not present in Q, than new nodes q1

h are generated and Q is set to

h; the insertion of nodes q1_h in the tree may occur according to two insertion policies:

• Child insertion policy: nodes q1

h are created as children (Cq) of

q_kp, possibly on the left of qp+1_k Figure 5.2, • Sibling insertion policy: nodes q1

h are left-appended as children of

the root, Figure 5.2.

5.2. Case Study 90

Figure 5.2: Left - Child Insertion Policy; Right - Sibling Insertion Policy.

select the next query in the tree to execute. Related works [Wu 2006] per- form a similar selection process by exploiting a cost model that associates a weight to each edge in the tree, so to find an optimal selection of queries that minimizes the total cost of traversal (a Weighted Minimum Dominating Set problem). In this chapter, we instead exploit classical breadth-first and depth- first tree traversal algorithms. We apply them to the two variants of insertion policies, thus, obtaining four materialization strategies, which yield different performances in terms of coverage and diversity. An analysis of the performance of the proposed materialization strategies is provided in the following sections.

In document Materialization Strategies for Web Based Search Computing Applications (Page 103-105)