Accessing Remote Sources - Centralised Adaptive Query Processing Techniques

4.4 Centralised Adaptive Query Processing Techniques

4.4.3 Accessing Remote Sources

In this section, AQP systems that target problems specific to centralised query processing over remote sources are discussed (Table 4.3 provides a summary).

In the light of data arrival delays, a common approach is to minimise idle time by performing other useful operations, thus attenuating the effect of such delays. Query Scrambling [AFT98, AFTU96, UFA98] and its variants [BFMV00a, BFMV00b] are

two representative examples in this area.

Query scrambling focuses on connections to remote sources and more specifically tries to address problems incurred by delays in receiving the first tuples from a remote data source that result in idle CPU. The default response form is to reorder operators so that the system performs other useful work in the hope that the problem will eventually be resolved and the requested data will arrive at or near the expected rate from then on. If changes in the execution order to avoid idling (which is always beneficial) do not solve the problem, new operations are inserted in the query plan, which is risky. In query scrambling, there is a trade-off between the potential benefit of modifying the query plan on the fly and the risk of increasing the total response time instead of reducing it.

[BFMV00a, BFMV00b] also deal with the problem of unpredictable data arrival rates, and, additionally, with the problem of memory limitation in the context of data integration systems. A general hierarchical dynamic query processing architecture is proposed. Planning and execution phases are interleaved in order to respond in a timely manner to delays. The query plan is adjusted to the remote source connections and memory consumption. Materialisation points are dynamically inserted into pipelines to reduce the memory requirements of the query plan, in case of memory shortage [BKV98]. In a sub-optimal query plan, the operator ordering can be modified or the whole plan can be re-optimised, by introducing new operators and/or altering the tree shape. In general, operators that may incur large CPU idle times are pushed down the tree. Scheduling such operators as soon as possible increases the possibility that some other work is available to schedule concurrently if they experience delays. During the construction of query plans, bushy trees are preferred because they offer the best opportunities to minimise the size of intermediate results. Thus, in case of partial materialisation, the overhead remains low. The query response time is reduced by running several query fragments concurrently (with selection and ordering being based on heuristics) and partial materialisation is used, as mentioned above. Because material- isation may increase the total response time, a benefit materialisation indicator (bmi) and a benefit materialisation threshold (bmt) are used. A bmi gives an approximate indication of the profitability of materialisation and the bmt is its minimum acceptable value.

The Tukwila [IFF+99, ILW+00] project has developed a data integration system in which queries are posed across multiple, autonomous and heterogeneous sources. Tuk- wila attempts to address the challenges of generating and executing plans efficiently

with little knowledge and variable network conditions. The adaptivity of the system lies in the fact that it interleaves planning and execution and uses adaptive operators. The operators in the query plan are annotated with their expected output cardinality. The system monitors the availability of remote data sources, the time an operator spends waiting for input tuples to infer the resource connection speed, the number of tuples each operator produces and the amount of memory it requires, along with the operator cost. The assessment and response are achieved by event-condition-action rules. An example of such a rule is: if at the end of the execution of a pipeline, the output cardinality is half of the expected one, then reoptimise the remainder of the query plan. Possible actions include operator reconfiguration to change the amount of memory allocated or the remote data source, operator reordering, operator replacement, and re-optimisation of the remainder of a query plan. The Tukwila system integrates adaptive techniques proposed in [KD98, UFA98, BFMV00a]. Re-optimisation is based on pipelined units of execution. At the boundaries of such units, the pipeline is broken and partial results are materialised. The main difference from [KD98] is that materialisation points can be dynamically chosen by the optimiser (e.g., when the system runs out of memory). Tukwila’s adaptive operators adjust their behaviour to data transfer rates and memory requirements. A collector operator is also used, which is activated when a data source fails so as to switch to an alternative data source.

The Tukwila adaptive query engine has also been used for XML query processing [IHW02, Ive02]. As in the case in which data is structured, the monitoring information includes aspects such as data cardinalities and rates of tuple production. In [Ive02, IHW04] an extension to [IFF+99] is presented, which incorporates the notion of eddies to route tuples through operators in order to allow tuples to be routed in different query plans running in parallel. As many different plans can be used during evaluation, a final clean-up phase is required to ensure result correctness.

BindJoins have also the capability to change the data sources accessed on the fly, like the Tukwila’s collector operator, in a self-reconfiguration manner [Man01]. The difference is that (i) BindJoins focus on operator cost rather than on resource connections, and (ii) they adapt when the operator cost can be reduced by a better machine allocation, rather than when performance expectations are not met.

Technique Monitoring Assessment Response Architecture

Focus Freq-

uency

Issue Response Form Impact Data

Local. QP Local. Adapt. Local. CACQ [MSHR02], psoup [CF03] data cardinality, operator cost intra- operator suboptimal operator scheduling operator rescheduling

query plan stream central central

dQUOB [PS01] data charac- teristics intra- operator subopt. oper. scheduling operator rescheduling

query plan stream central central chain [BBDM03] data cardinality intra- operator subopt. oper. scheduling operator rescheduling

query plan stream central central stream [MWA+03, BMM+04] data cardinality, resource memory intra- operator subopt. oper. scheduling, in- suf. memory operator rescheduling, reconfig.

query plan stream central central

Table 4.4: Summarising table of AQP proposals over streams, according to the classi- fications of Section 4.3.

In document RESOURCE AWARE QUERY PROCESSING ON THE GRID (Page 102-105)