Multi-result Query Processing - Processing Multi-Result Queries

3.4 Processing Multi-Result Queries

3.6.5 Multi-result Query Processing

So far, we have only discussed works that are related to DrillBeyond with respect to its function- ality or intent. However, a large part of our contribution also consists of methods for planning, processing and optimizing the specific type of SQL query processed by DrillBeyond.

One of the characteristics of DrillBeyond is that it produces top-k query results. Superfi- cially, one could therefore assume that the large body of work on processing top-k queries in databases, as surveyed for example in (Ilyas et al., 2008), would be related. However, there is a conceptual distinction. All approaches that are usually grouped under the term top-k query processingare concerned with efficiently identifying the k most relevant result tuples to a given query, where relevance is defined by user-defined criteria or scoring functions. These techniques are therefore mainly aimed at increasing efficiency and making results easier to handle by limiting their size, since in many applications a user is interested only in the most relevant answers. Our approach however, produces k variants of a single, complete query result, which differ only in the set of external data sources used to resolve the augmentation attributes. In other words, our approach does not produce a subset of the available data, but a superset. Therefore, existing techniques for “top-k” query processing do not apply.

However, that does not mean that there is not related work with respect to our form of top-k processing. The most related system in this respect is the MCDB system (Jampani et al., 2008), which deals with Monte Carlo simulations in an RDBMS context. MCDB processes queries over databases that contain a mixture of regular single-valued attributes, and probabilistic attributes whose values are drawn from some distribution. More specifically, MCBD allows arbitrary so-called VG functions, or variable generating functions, to be stored in the database together with their parametrizations. These functions are used to pseudo-randomly generate concrete values for probabilistic attributes at query processing time. This enables MCDB to generate multiple independent but identically distributed instances of the database,

i.e., possible worlds, process the query of interest over all of them, and then summarize the effects of the uncertainty modeled in the VG functions. This method allows to bring complex statistical models into the database, and enables approximative evaluation of arbitrary queries over them, yielding a highly expressive system.

However, it has similar performance implications as the DrillBeyond model: the query, including all of its deterministic parts, are processed k times, leading to unacceptable performance with a naïve implementation. MCDB uses a different solution than DrillBeyond: It em- ploys a technique called “tuple bundling”, in which the variants of a tuple can be bundled into one tuple when the common attributes need to be accessed, and unbundled into its variants when varying attributes to be processed. Efficient bundling is enabled by the pseudo-random nature of the possible worlds embodied in the tuples variants: Tuples variants can be bundled by reducing them to the deterministic tuple they were generated from, representing probabilistic attributes by storing only the seed that was used to generate their variants. In addition, a special isPresentattribute stores whether a tuple is actually included in each possible world. This allows to represent the fact that regular relational operations, such a selection, may remove tuple instances in some worlds, that then need not to be regenerated on unbundling. From just these two pieces of information, an instantiate operator can project the required tuples instances from a tuple bundle when the probabilistic attributes need to be accessed individually.

In some respects, this method is similar to our idea of the ω/Ω splitting of the DrillBeyond operator. In our solution, a set of tuple instances is represented by single tuple for as long as possible as well, while the augmentation attributes’ possible values are represented by place- holder values. Those placeholders refer to a set of concrete augmentation values, not unlike the seed values represent a set of possible worlds in MCDB. However, there are some crucial differences between the systems. First, MCDB aims at coping with much higher number of tuple instances for Monte Carlo simulation, for example N = 1000, while DrillBeyond aims at human inspectable result lists, for example of length k = 10. To cope with this much larger overhead, MCDB requires modified relational operators, such as selection of join, to work directly on bundled tuples. While this enables the higher number of tuple instances, it also introduced a much higher implementation overhead, and diminishes the generality of the resulting RDBMS. A production RDBMS includes many variants of operators, e.g., join im- plementations, different aggregation operators, but also user-defined functions or extensions. None of these can be reused unchanged, in fact, some can not even be efficiently implemented on top of tuple bundles. Furthermore, other components of the RDBMS, like the optimizer, can not be easily adapted to this completely new query processing approach, but need to be written largely from scratch. In fact, an early version of DrillBeyond was also built on a tuple bundling approach with customized operators. However, we quickly discovered that even af- ter reimplementing many RDBMS operators for tuple bundles, edge cases in query processing remained, e.g., the inability to efficiently sort tuple bundles on bundled attributes.

We therefore used the minimally invasive approach as described in Section 3.3 and 3.4, which allows unchanged reuse of all operators and requires only minimal changes to RDBMS components. It yielded a system that does not loose any of the generality of the original DBMS system we extended, while still yielding acceptable performance for the number of results variants that we aim at in our use case.

In document Query-Time Data Integration (Page 108-110)