Several sensor database query systems, such as Cougar [56] and TinyDB [57], have been developed by database researchers. These works aim to extend SQL-like systems for sensor networks by focusing on reducing power consumption during query processing. The Cougar approach to tasking sensor networks through declarative queries is introduced in [56]. A set of challenging research problems, including distributed in-network processing, query optimiza- tion, query languages, catalog management and multi-query optimization, are described and discussed as well. TinyDB, a query processor for sensor networks that incorporates acqui- sitional techniques, is presented in [57]. TinyDB is a distributed query processor that runs on each of the nodes in a sensor network. TinyDB has many of the features of a traditional query processor (e.g., the ability to select, join, project and aggregate data), but also incor- porates a number of other features designed to minimize power consumption via acquisitional techniques. These techniques, taken in aggregate, lead to significant improvements in power consumption and increased accuracy of query results over non-acquisitional systems.
In addition to these two pioneer systems, a large number of studies have been conducted to address many other aspects of query processing techniques for sensor networks [58][8][9][59]. An energy efficient routing scheme for data collection from all nodes in a sensor network is proposed in [58]. The scheme explores suppression, both spatial and temporal, to reduce the energy cost of sensor data collection. The suppression of spatial and temporal redundancy is modeled by monitoring node and edge constraints. A monitored node triggers a report
if its value changes. A monitored edge triggers a report if the difference in values between its nodes changes. The set of reports collected at the base station is used to derive all node values. The routing scheme, constraint chaining, builds a network of constraints which are maintained locally but allow a global view of values to be maintained with minimal cost. In-network processing and data aggregation are presented in [8][9]. To answer aggregated queries such as “average” or “min”, sensor data can be aggregated at intermediate sensor nodes to reduce the amount of data being transferred in the network during processing. This approach, referred to as “in-network aggregation” can significantly reduce bandwidth consumption over approaches where data is collected and aggregated centrally. The operator placement problem, which deals with how to place filter operators in queries at the “best” sensor node in the network based on its selectivity and cost so that the total cost of com- putation and communication is minimized, is addressed in [9]. It is shown that the problem is tractable; however greedy algorithms can be suboptimal. An optimal algorithm is then presented for uncorrelated filters, correlated filters and multiway stream , respectively. The work in [59] complements sensors with statistical data models to provide more meaningful query results and reduce the number of message transmissions during data collection. Mod- els can help provide more robust interpretations of sensor reading against inaccurate or even faulty sensor readings and also extrapolate the values of missing sensors or sensors readings at geographic locations where sensors are no longer operational. Furthermore, models pro- vide new opportunities for optimizing the acquisition of sensor readings, because sensors are only used to acquire data when the model itself is not sufficiently rich to answer the query with acceptable confidence.
Multiple query processing, in particular the optimization (MQO) problem, has been studied by database researchers [96]. The focus is given to finding common sub-expressions in a single complex query or multiple such queries run as a batch. By identifying and evaluating the common sub-expressions only once, the overall evaluation cost of multiple queries can be reduced. Greedy and heuristic search algorithms have been designed for this purpose. The focus of MQO in sensor networks, however, is different since data in sensor networks is spread over all sensors and aggregated results are usually returned to a base station during query processing. The sensor data, once aggregated, is difficult to reuse at
the base station for multiple query optimization. New schemes, therefore, should be proposed to address these new challenges.
The MQO problem has been recently addressed by several researchers [97][98][99][100]. The scheme presented at [97] explores using spatial query information for multi-query opti- mization. A notion equivalence classes (EC) is defined as the union of all regions covered by the same set of queries. A query is then expressed as a set of ECs intersecting with its query region. In this approach, EC becomes the unit of processing and the results from these ECs are used to derive the processing results of queries. Experimental results show that large amount of energy can be saved using this optimization technique.
The impact of MQO is analyzed in the work described at [98]. A cost model was de- veloped to study the benefit of exploiting common subexpressions in queries. The authors also propose several optimization algorithms for both data acquisition queries and aggrega- tion queries that intelligently rewrite multiple sensor data queries (at the base station) into “synthetic” queries to eliminate redundancy among them before they are injected into the wireless sensor network. The set of running synthetic queries is dynamically updated by the arrival of new queries as well as the termination of existing queries. A synthetic query, is rewritten from a set of queries and essentially collects data for all these queries. The idea of synthetic query works for queries collecting raw data, but not for aggregated queries. Data aggregation, such as summation, is like a one way function. From the raw data, the aggregated value can be derived, but not vice versa. Similarly, from the aggregated result for a synthetic query, the aggregated result of the queries which the synthetic query is written from cannot be derived, even if the raw sensor data of these queries are contained in the synthetic query.
The scheme is then extended into a Two-Tier Multiple Query Optimization (TTMQO) scheme [99]. The first tier, called base station optimization, adopts a cost-based approach to rewrite a set of queries into an optimized set that shares the commonalities and eliminates the redundancy among the queries in the original set. The optimized queries are then injected into the wireless sensor network. The second tier, called in-network optimization, efficiently delivers query results by taking advantage of the broadcast nature of the radio channel and sharing the sensor readings among similar queries over time and space at a finer granularity.
These proposed schemes for MQO [97][99] have explored spatial or temporal information among queries to reduce the transmission costs of multi-query processing. In this thesis, We propose to investigate a finer granularity, the semantic correlation among multiple queries, to further optimize multiple query processing in sensor networks.
The problem of “Many-to-Many aggregation” in sensor networks is addressed in [100], where destinations require data from multiple sensors while sensor data are also needed by multiple destinations. The ideas of multicast and in-network aggregation are combined to reduce communication costs. It is natural to use multicast to send source reading at sensors to multiple destinations. In the meantime, in-network aggregation is used to reduce the data communication costs for each destination. However, a sensor reading, once aggregated, becomes specific for one destination, and cannot be reused by other destinations. The goal is to determine when the in-network aggregation should be performed during multicast to minimize the overall communication costs for all destinations.
A similar problem of computing multiple aggregations in stream processing is studied in [101]. In stream processing, many users run different, but often similar, queries against the stream. Several techniques have been developed to find commonalities among aggregated queries with same or different predicates and windows. The stream is chopped into slices for aggregated queries with the same predicate but different windows. For queries with the same window but different predicates, the predicates of queries are used to divide the tuples at stream source into fragments. These tuples in fragments can be aggregated to form partial fragment aggregates, which can in turn be processed to produce the results for various queries. These two techniques are put together to process queries with different predicates and windows. The proposed approach is particularly effective in handling query update in a streaming system.