3.2 Notations and Definitions
3.2.3 Problem Definition
The current query refinement techniques (e.g., [79, 14, 123, 78]) often provide efficient and effective solutions to the Aggregate-based Query Refinement problem. Specifically, these techniques aim to quickly navigate a large search space of possible refined queries, and return one refined query such that its aggregate is very close to a specified aggregate constraint. That is, minimize the deviation between the aggregate constraint and the achieved one. Formally, given an aggregate constraint G, the aim of these techniques is to find a refined query R such that∆GR is minimized:
∆G R =
1
z|G − GR| (3.2.4)
where z is a normalization factor, and GRis the aggregate value of R.
Clearly, while the user might be satisfied that the refined query’s aggregate value GR is very close
to the constraint G, they would also expect the refined query R to be very close (i.e., similar) to their input query I. A refined query that is very different from the input one will have very limited benefits to the end user and is often rendered useless.
Motivated by that, we propose the Similarity-aware, Cardinality-based Query Refinement problem, in which the user satisfaction is measured in terms of both: 1) meeting some user specified aggregate constraint on R, and 2) maximizing the similarity between R and I. Formally:
Definition 3.1. Similarity-aware, Aggregate-based Query Refinement problem: Given a database B, an input conjunctive query I, a distance function D(), and an aggregate constraint G over the result I(B), the goal in the Similarity-aware, Aggregate-based Query Refinement problem is to find R that satisfies the aggregate constraint G while minimizing D(R, I). Ideally, the distance between R and I (i.e., D(R, I)) should be equal to zero (i.e., maximum similarity such that R ≡ I). In reality, however, achieving that extreme case of exact similarity is unrealistic, unless query I already satisfies the aggregate constraint G, i.e., G=GI. That is, I already meets its aggregate constraint G and no further refinement is required. Hence, in this work, we adopt a hybrid metric, which captures and quantifies the success of meeting the user’s expectations for both similarity and aggregate constraints. In particular, we capture the user’s (dis)satisfaction in terms of the overall deviation (in both, aggregate and similarity constraints) from the user’s expectations,
∆R=α∆SR+ (1 − α)∆GR (3.2.5)
where∆SR is the deviation in similarity, which is captured by means of a distance function D(R, I)as described earlier in Section 3.2.2, and∆GR is the aggregate deviation defined above in Eq. 3.2.4.
The parameter α simply specifies the weight assigned to the deviation in similarity, and in turn, (1 − α) is the weight assigned to the deviation in cardinality. Parameter α can be user-defined so as to reflect the user’s preference between satisfying the aggregate and similarity constraints.
On the one hand, setting α =0 is equivalent to the AQR problem. On the other hand, setting α =1 is equivalent to the extreme case described above, in which R ≡ I. In the general case, in which 0 < α < 1, both the aggregate and similarity constraints are considered according to their respective weights and the overall deviation is captured by ∆R. Hence, a small value of ∆R indicates a small
deviation in meeting the constraints, and more satisfaction by the refined query R.
Interestingly, the similarity and aggregate constraints are typically at odds with each other. That is, maximizing similarity (i.e.,∆SR) while minimizing aggregate deviation (i.e.,∆GR) are two objectives that are typically in conflict with each other, and α specifies by how much those two constraints contribute to the overall deviation∆R. For instance, assume that the input query I in Figure 3.1 does
not satisfy an aggregate constraint G. In order to satisfy G, query I has to be refined by expanding or contracting its predicates. Hence, any refined query R that minimizes∆GR (i.e., closes on the constraint G), will have to increase∆SR (moves far from I by expanding or contracting its predicates).
Lemma 1. Minimizing ∆S always conflicts with minimizing ∆G, provided that 0 < α < 1 and the input query does not satisfy the aggregate constraint.
Proof. It is trivial to proof the above lemma using the example in Figure 3.1. Assume that I does not satisfy an aggregate constraint G, e.g., the cardinality of I is less than a cardinality constraint G. Hence, to satisfy G (i.e., minimize ∆G), query I has to be refined by expanding its predicates. This
will always result in increasing∆S.
As an alternative method for deciding the appropriate α in Eq. 3.2.5, user feedback can be used to infer the suitable α [77, 111]. For instance, users are asked to label the result of a small sample of refined queries as acceptable or unacceptable. These refined queries are the result of refining a query with an aggregate constraint using different values of α, e.g., 0.1, 0.4, 0.6, 0.9. Then, the preferred α can be inferred from these labeled queries. The larger the sample is, the better value of α is, however, the more queries users have to label.
objectives that are defined by an application. For instance, assume an application that shows exactly Knumber of flights for any user-selected departure and destination within a user-selected time slot. If no flights are found within the user-selected options (i.e., the input query returns empty result), then it is more popular and common to show alternative similar flights than showing an empty screen, which meets the application’s goal. Hence, the input query can be refined with a suitable value of α so that the refined query returns almost K number of flights that are very similar to the input query and can fill the application’s screen by suggested flights for the user.
Before presenting our proposed techniques for solving the problem defined in Def. 3.1, we present a declarative query model which encapsulates all the essential parameters that are used by our proposed techniques.