Increasing Diversity in the Top k - Learning To Scale Up Search-Driven Data Integration

In active learning, the goal is to select query results such that the system maximizes its ability to learn (in this case, from user feedback). The uncertainty measure developed previously identifies the single sample (query result) with lowest confidence, in isolation. Similarly, the ranking schemes discussed effectively consider answers to be independent of one another. However, in theQ system, the user labels derived query results as positive or negative, and the feedback is converted into a modification of weights on individual features on edges or nodes. Some of these features may beshared with other queries and their results.

Clearly, this can lead to an issue: multiple closely related answers might have similar ranks by the metrics from the previous section. Yet feedback on any one of the answers might give most of the benefit of feedback on all of the answers.

Hence, in this section, we consider two approaches to increasing thediversity of the top-k answers, in order to get more beneficial feedback. We note that the diversity scheme should increase the number of different features present in the top-kanswer set, such that the user has an opportunity to provide greater feedback. Yet (sometimes at odds with the previous goal) it should not suppress the best answers (particularly the most-likely-relevant answers)

3.4. INCREASING DIVERSITY IN THE TOP-k 39

from appearing in the answer set. Finally, there should be some mechanism whereby the diversity scheme can be dampened as further feedback is given.

To address these goals, we consider two different schemes. In Section 3.4.1, we propose a clustering scheme, whereby we generate a large number of candidate results (in our ex- periments of Section 3.6, we generate 4k), then we collect them into k clusters and choose a representative result from each cluster. In Section 3.4.2, we try an alternative approach, based on incorporating a diversity component into our ranking scheme, by extending one of the most effective techniques from the information retrieval literature.

3.4.1 Clustering Queries

Ideally, we can find a few “representative” query results to return in the top-kresults, and learn about many other results’ scores from these representatives.

Our clustering strategy targets this problem. It presents to the user the results of a query (Steiner tree) that shares some highly uncertain edges with other, also-highly- uncertain, queries — such that feedback given on results from the first query (tree) can also reduce the uncertainty of the other queries. To achieve this, we must estimate common uncertain information between two Steiner trees and how informative a given Steiner tree is with respect to a keyword query. We use these to cluster overlapping queries and choose a representative query per cluster.

Clustering Algorithm. Our clustering algorithm takes a set of query trees {T1, T2,· · · }

as input and clusters them into k groups, one associated with each top-k answer, similar to the k-means algorithm. (Note that in our domain it is intractable to compute the entire set of query results and then perform hierarchical agglomerative clustering; instead we can only produce some partial set of results and return k of them. Hence k-clustering makes sense.)

We define the center tree T of a set of trees to be a tree where each edge e has an associated appearance frequency rT(e), which is the ratio of number of trees having e to the size of the set. For any general Steiner tree T,rT(e) is defined as1(e∈E(T)). Now, we define the similarity between two trees as follows.

S(T1, T2) =

e∈E(T1)∩E(T2)

40 CHAPTER 3. ACTIVE LEARNING IN THE Q SYSTEM

This similarity roughly estimates the amount of common uncertainty of the two trees by summing up uncertainty values on their common edges. It can estimate the similarity between two Steiner trees as well as similarity between a Steiner tree and a “center” tree of a set. We use standardk-means clustering over this similarity function to build clusters.

Choosing Cluster Representatives. Once we have a cluster of similar queries, the next key question is how to choose one from them for feedback, such that theQsystem can learn as much information as possible. We determine such a representative based on a notion of informativeness. Intuitively, the informativeness of a query with respect to a cluster of queries measures how much uncertainty this particular query shares with other trees in the cluster. Formally, given a cluster of treesC and a treeT ∈C, we define the informativeness as follows: IC(T) = X T0_∈_C,T0₆₌_T X e∈E(T)∩E(T0₎ U(e) (3.7)

whereU(X) is any uncertainty measurement over a random variable X. Thus, we vote on a representative according to the above informativeness formalism, and choose the most informative tree to represent the cluster.

3.4.2 Directly Incorporating Diversity in the Score

The clustering approach of the previous section is well-aligned with our intuition that we would like to choose the “best” result from each group of related results. Yet there are some potential pitfalls to the approach. First, our choice of a representative result for each cluster may be poor: in this case, the user might see (and provide feedback on) a bad result, when other members of the cluster were good. Unfortunately the results of this feedback may down-rank all of the answers in the cluster. Second, it is not obvious how to gradually dampen the clustering scheme, such that initially we select with a heavy bias towards diversity, and after a few rounds of feedback we progressively de-emphasize the diversity component.

The information retrieval literature considers an alternative to clustering: there, the approach is to incorporate diversitydirectly into the scoring function[23, 27, 35]. Section 3.3 described how we can incorporate two different optimization criteria, namely predicted

3.4. INCREASING DIVERSITY IN THE TOP-k 41

relevance and uncertainty, into the scoring function. Here we extend this even further, to consider diversity as a third optimization criterion.

A commonly used family of approaches in information retrieval comes from Gollapudi et al. [35], who proposed a series of functions to combine an existing score with a diversity measure. Of the proposed measures, we favor the one that chooses a set of results that maximizes the overall combination of relevance and diversification. Thus we adopt a scheme calledmax-sum diversification, which maximizes a linear combination of utilities of selected query answers and their pairwise distances. More specifically, we are given a universeU of objects to choose from (in our case, U is the set of all Steiner trees), the scoring function Ydiv is defined over a subset S ofk selected items fromU as a weighted linear combination of any standard score metric from Section 3.3, plus a diversity component:

Ydiv(S) = (k−1) X T∈S Score(T) + 2λ X T1,T2∈S ˆ D(T1, T2). (3.8)

We can use the symmetric difference of the two trees as the distance function, which is defined as the number of edges in the union of the two trees but not shared by both of them.

Maximizing the above objective function is NP-hard [35]. Instead, consider the following equivalent reformulation [35]: Ydiv(S) = X T1,T2∈S D0(T1, T2) = X T1,T2∈S Score(T1) +Score(T2) + 2λDˆ(T1, T2)

This results in the Max-Sum Dispersion problem which has a 2-approximation [20]. We use this reformulation, but with the following adaptations in theQ System.

1. Unlike in many IR applications where the scoring function is mostly about relevance, we adopt our proposed ranking functions (for example, expected model change) so as to consider relevance, uncertainty and diversity together.

2. We also adaptively change λ. At the beginning, the system is actively exploring for more feedback. However, after the system has gathered certain amount of information, it will apply the exploitation strategy. To enable this, we can gradually decrease the value of λ to reduce the effect of the pairwise similarities (and hence that of diversification).

42 CHAPTER 3. ACTIVE LEARNING IN THE Q SYSTEM

In document Learning To Scale Up Search-Driven Data Integration (Page 52-56)