Chapter 2: Background: Data Integration and XML
3.3 The Tukwila Data Integration System: An Adaptive Query Processor
In order to validate the techniques proposed in my thesis, both experimentally and for real-world applications, I have built an adaptive query processor as the core of the Tukwila data integration system. The current Tukwila implementation is a complete data integration environment except that it is not typically coupled with a query refor- mulator module: it accepts reformulated XQueries over XML data sources, it requests data from these sources via HTTP or NFS requests, and it combines the source XML to form new results. In other work, my co-authors and I have also combined the Tukwila system with the MiniCon reformulator algorithm of [PL00] to form a complete system with a mediated schema, which works for conjunctive relational queries. (Tukwila has also been the query processor for a number of other projects at the University of Washington, as I discuss in Chapter7.)
3.3.1 Novel Architectural Features
The focus of my thesis work has been on the query processing component of Tukwila, illustrated in Figure3.1. So far as I am aware, the Tukwila query processor is the first system to be engineered with the express goal of building an adaptive query processor. Key architectural aspects of Tukwila include the following:
• An architecture that emphasizes pipelined query execution, in order to produce early first results. Our system typically begins with a single pipelined query plan, and it can read and process XML as it streams across the network. Since a single pipeline may run out of resources, my architecture include capabilities for handling memory overflow and for adaptively splitting a single pipeline into multiple stages once a significant portion of computation has completed.
• Support for multithreaded execution of queries, in order to facilitate adaptive scheduling. Unlike most traditional, iterator-based query processors, my system is not forced to rely on deterministic scheduling, and instead it can selectively attempt to overlap I/O and computation to more effectively utilize available data and resources.
• Novel algorithms for network-based and pipelined query processing, including: (1) a variant of the pipelined hash join [RS86] that supports overflow resolution and the dynamic collector, (2) a specialized version of the relational union that, upon data source failure, can switch from the original data source to its mirrors or alternates, (3) an XPath pattern-matching operator, x-scan, which produces variable bindings in pipelined fashion over streaming XML data, and (4) an op- erator, indirect-scan that generalizes the relational dependent join operator for a web- and XML-oriented environment.
• Mechanisms for efficiently switching query plans in mid-execution of a pipeline, as well as at the interface between two separate pipelines. One of the key chal- lenges in building the Tukwila execution engine was providing mechanisms for efficiently passing data along pipelines (using techniques such as sharing space in the buffer pool between operators), but also enabling a query plan to be modi- fied in mid-execution.
• Close coupling between the query optimization and execution components. The optimizer and execution engine expose their data structures to a status monitor that periodically reads query execution cost and selectivity information and uses this to update the query optimizer’s cost model. If the actual values diverge sig- nificantly from the estimates, a query re-optimization operation may be triggered in the background.
• An event handler for responding to important runtime events. The event handler provides a way of responding to data source failures, network delays, and memory overflow. The event handler is capable of initiating a complete re-optimization of the query. Note that the event handler operates at a lower granularity than a handler for traditional database triggers or active rules: it responds to events at the physical level rather than the logical level, and it can also modify the behavior of query operators or initiate a change in query plans.
• An extended version of the System-R cost-based optimizer algorithm ([SAC+79])
that can be used not only to create initial query plans, but also to periodically re- optimize an executing query plan when better statistical information is available. • Support within both the cost-based query optimizer and the query execution sys- tem for sharing of data structures across different query plans within an exe- cution sequence. These capabilities are used to avoid repeating computation of results in a query execution sequence that consists of multiple plans.
Adaptive behavior during query execution is key in situations where I/O costs are variable and unpredictable. When data sizes are also unpredictable, it is unlikely that the query optimizer will produce a good query plan, so it is important to be able to modify the query plan being executed. As a result, Tukwila supports incremental re- optimization of queries during particular plan execution points.
In the following chapters, I provide a detailed discussion of each of the novel aspects of my work, and supplement this discussion with an experimental evaluation of the techniques within the Tukwila system.