Stream processing in data-driven computational science

(1)

Stream processing

in data-driven computational science

Ying Liu, Nithya N. Vijayakumar and Beth Plale

Computer Science Department, Indiana University

Bloomington, IN, USA

{yingliu, nvijayak, plale}@cs.indiana.edu

Abstract— The use of real-time data streams in data-driven

computational science is driving the need for stream processing tools that work within the architectural framework of the larger application. Data stream processing systems are beginning to emerge in the commercial space, but these systems fail to address the needs of large-scale scientific applications. In this paper we illustrate the unique needs of large-scale data driven computational science through an example taken from weather prediction and forecasting. We apply a realistic workload from this application against our Calder stream processing system to determine effective throughput, event processing latency, data access scalability, and deployment latency.1

I. INTRODUCTION

The same technology advancements that have driven down the price of handhelds, cameras, phones and other devices, have enabled affordable commodity sensors, wireless networks and other devices for scientific use. As a result, scientific computing that was previously static, such as weather forecast prediction models, can now be envisioned as dynamic - with models triggered in response to changes in the environment. The cyberinfrastructure needed to bring about the dynamic capabilities is still evolving.

Stream processing in scientific applications differs from stream processing in other domains in important ways. We define a stream S as a sequence of events,S={ei} wherei

is a monotonically increasing number and0< i <∞. Events often are timestamped. Depending on the source, event flow rates in a stream can range from an event per microsecond to an event per day, and can range in size from a few bytes to megabytes or gigabytes. The contents of an event could be for instance a new reading of a stock value, or could mark a state change in an application.

Stream processing falls into three general categories: stream management systems, rule engines, and stream processing engines [1].

In stream management systems, stream processing is similar to a traditional database management system which could be relational [2] [3] or object-relational [4]. The interface is through a declarative SQL-style query language that has 1_{This work is supported in part by NSF grants EIA-0202048 and}

CDA-0116050, and DOE DE-FG02-04ER25600.

been augmented with operations over time-based tables [5]. A client invokes pre-built operations or can code his own in a procedural language that is then stored as a stored procedure [4].

Rule engines date from the early 1970’s. Clients write rules

in a declarative programming language in which patterns of events can be described [6] [7]. The rule language supports relational and temporal operators, as well as subtyping, par-allelization, etc. [8]. When events arrive, selected rules in the rule base are fired, causing an action to result. Rule engines include Message Oriented Middleware (MOM) technologies. The latter hold a collection of user profiles in the form of XPath expressions as rules for instance [9] [10]. Arriving events are matched against the profiles, with the corresponding action being to forward the event on the user indicated in the profile.

Stream processing engines (SPE’s) are designed

specifi-cally for processing data flows on the fly. In many sys-tems described in literature and available commercially, en-gines execute queries continuously over arriving streams of data [11] [12] [13]. Clients describe their filtering and process-ing needs through a declarative query language or through a graphical user interface(GUI) [14] [15] that is converted. Events are processed on the fly, without necessarily storing them. Queries can be deployed dynamically [13], and can have their operators reordered on the fly [11]. The SPE uses constructs such as the time window to deal with the unbounded nature of the streams. The size of the sliding window determines the history over which a query operator can execute. Optimizations have been applied to yield memory savings for instance in [13] [14] [16]. The SPE architecture uses an underlying storage and/or transport medium that can be files [12] [15], a publish-subscribe system [17], or sockets [18]. The contributions of this paper are as follows. Through our extensive study of stream processing in the context of scientific computing, we have come to understand what we believe to be fundamental differences of stream processing in the context of scientific computing versus elsewhere. We list these requirements here. Having worked with meteorology researchers over the past several years, we understand their needs more clearly than others. Hence we have developed a

(2)

realistic stream workload and stream processing scenario for dynamic weather forecasting and use it to illustrate features of stream processing in data-driven scientific computing, through the Calder system developed at Indiana University. In [13] we evaluated throughput and deployment latency of single queries on a synthetic workload. In this paper we extend that work to encompass distributed collections of queries and users under synthetic and realistic workloads. Specifically, we measure effective throughput, event processing latency, data access scalability, and deployment latency. Our results show that good performance and excellent scalability can be achieved by a service that fits within the context of a data-driven, workflow-orchestrated computational science application.

The remainder of the paper is organized as follows. In Sec-tion II, we list and discuss unique features of data streams in data-driven science and the requirements of stream processing systems in scientific domain. In Section III, we describe a dynamic data stream example from weather prediction and forecasting. In Section IV, we briefly describe the Calder stream processing architecture and show how it fits in the framework of meteorology forecasting. In Section V, we experimentally evaluate our system under a realistic meteoro-logical workload. Conclusions and future work are discussed in Section VI.

II. STREAMPROCESSING INCOMPUTATIONALSCIENCE

Stream processing in computational science introduces chal-lenges not always fully present in domains such as finance, media streaming, and business (such as RFID tags). We characterize the list of unique requirements to data driven computational science as follows. We argue that the most data driven applications we have observed have these requirements.

A) Heterogeneous data formats. Science applications use

and generate data in many different data formats, including netCDF, HDF5, FITS, JPG, XML, and ASCII. The binary formats can have complex access and retrieval APIs.

B) Asynchronous streams. Stream generation rates can

be highly asynchronous. One event stream might generate an event once every millisecond, while another might generate an event only once every 12 hours. Some SPEs fuse or join streams based on the assumption of relatively synchronous streams.

C) Wide variance in event sizes. Events generated by a

sensor are only a few bytes in size while events generated by large-scale instruments or regularly run models can be in the 10’s of megabytes in size.

D) Timeliness is relative. One application may want to be

notified the instant a condition occurs, whereas for a second application a condition may only emerge over days or weeks.

E) Streaming is part of larger system. Stream processing

in data-driven computational science can be one small part of a much larger system. Its architecture must be compatible with the overall system architecture.

Data No. Ev. Size Ev. Rate Cum. Rate Cum. BW

Source sources (KB) (ev/hr) (event/hr) (Kbps)

Metars 27 1-5 3 81 0.9 1st order Metars 100 105 1 100 1.1 2nd order Rawinsondes 9 2125 0.08 0.75 0.04 (buoy data) Acars 30 100-700 10 300 466.67 NexRad II 5 163-1700 6-12 60 222.2 NexRad III 5 2-20 6-12 60 2.67 GOES 1 4400 2 2 19.6 (model data ) Eta 4 41500 0.17 0.67 615 (model data) CAPS 10 62.5-15.6 12-60 600 20800 (sensors) TABLE I

OBSERVATIONAL DATA SOURCES USED IN MESOSCALE METEOROLOGY.

SHOWS THE RATES AND SIZES OF DATA PRODUCTS OVERNEWORLEANS.

Fig. 1. Data sources around New Orleans.

F) Scientists need changes as an experiment progresses.

One could envision a dynamic weather prediction workflow that data mines a region of the atmosphere looking for tornado signatures then kicks off a prediction model. The region over which data mining is carried out will change as a storm moves across the Midwest for instance. As the storm moves, the filtering criteria (e.g., spatial region) must adapt.

G) Domain specific processing. Much stream processing

in computational science is domain specific. For instance, a mesoscale detection algorithm classifies vortices detected in Doppler radar data. Thus, a stream processing system needs to be extensible, that is, it needs to provide mechanisms for scientists to extend stream and query processing with their own operators.

III. METEOROLOGYEXAMPLE

Meteorology is a rich application domain for illustrating the uniqueness of stream processing in scientific domains. Atmospheric scientists have considerable number and variety of weather observational instruments available to them due in large part to over 100 years of history in observing the atmosphere. Tools such as the Unidata Internet Data Dis-semination (IDD) [19] system distribute many of the data products to interested universities for research purposes. The data products range considerably in their sizes and generation

(3)

rates. Table I lists nine of the most common data products. These products are moved to the location where the weather forecast model is to run, then ingested into the model at runtime.

To illustrate the use of stream processing engines in this context, suppose that an atmospheric science student is study-ing Fall severe weather in the region around New Orleans, Louisiana (see Figure 1) and wants to kick off a regional 1km forecast when a storm cell emerges. The Figure 1 shows the region around New Orleans (approximately at 29.98 degree North Latitude and 90.25 degree West Longitude). The inner-box in Figure 1 marks an area of 2 degree Latitude height and 2 degree Longitude width around New Orleans, where one degree latitude is 70 statute miles and one degree longitude 60 is statute miles approximately. The figure is taken from the GeoGUI in LEAD portal [20].

The number of data products, their sizes and rates for the sensors that overlap the 80 mile radius around New Orleans are given in Table I. We call this the New Orleans Workload. The table shows nine data products, and for each type gives the number of sources. The event rate is the rate at which events are generated at the source. The cumulative rate and bandwidth are calculated over all data sources within a data type and under storm mode. An event is a time stamped observation from a data source. For the NexRad Level II Doppler radar, for instance, an event corresponds to a scan, where one scan consists of fourteen 360 degree sweeps of a radar. A scan completes in 5-7 minutes. The range given in the event size column of the table is bipolar: the small event size occurs during clear skies, and the large event size occurs during storm conditions.

The variability in event rates in Table I, from 0.08 ev/hr to 1 ev/min, and variability in event sizes, from 1 KB to 41 MB, clearly demonstrates several stream processing requirements of Section II, specifically asynchronous streams (requirement B), and wide variances in event sizes (requirement C). This collection of data products also demonstrates a common requirement of stream processing in scientific domains, that of heterogeneous data products (requirement A). The product formats shown in Table I alone include text, raw radar format, model specific binary format, images, and netCDF data.

IV. CALDERARCHITECTURE

Calder, developed at Indiana University, falls into the cat-egory of a stream processing engine (SPE). Its purpose is to provide timely access to data streams. Additional details of the system architecture can be found in [13]. In this section, we provide a brief overview of the system architecture and show how a stream processing fits into a larger data-driven computational science application. In particular, we discuss a scenario in the context of the mesoscale meteorology forecasting example of Section III.

We view data streams as a virtual data repository, that while

GDS GDS GDS service factory rowset service handlers for incoming channels query

one per event type runtime container query query rowset request response chunk query execution engine continuous query dynamic deployment query continuous service query planner data sources point of presence

− ring buffers hold results channels,

Pub−sub System

Calder System

Fig. 2. Calder architecture.

constantly changing, has many similarities to a database [21]. Like a database, a collection of streams is bound by coherence, in that the streams belonging to a collection are related to one another, and possess meaning in that a collection of streams can be described. We call such a collection of streams a Virtual

Stream Store. Calder manages multiple virtual stream stores

simultaneously and provides users with query access to one or more virtual stream stores.

Calder uses a publish-subscribe system, dQUOBEC [22] as its underlying transport layer. How sensors and instruments are pub-sub enabled is outside our scope of research, but solutions exist, such as [23], which takes an XML approach. This pub-sub enabling is shown in Figure 2 as a single point of presence, however other approaches exist.

In the simplified diagram of Figure 2, the data streams flow to a query execution engine where they are received by handlers. The runtime acts on each incoming event by triggering one or more queries. A query executes on the event, and generates zero, one, or more events that either trigger other queries in the system or flow to the Rowset Service where they are stored to a ring buffer for user access.

User interaction with Calder follows the Globus OGSI model of service interaction where a grid data service (GDS) is created on behalf of a user to serve an interaction with the virtual stream store. The user submits SQL-like queries through the GDS. Details of the extended GDS interface are given in [24]. The query planner service optimizes and distributes queries and query fragments based on local and global optimization criteria. The query planner service initiates a request to the rowset service to create a new ringbuffer for the query.

Calder supports monotonic time-sequenced SQL Select-From-Where queries. The operators supported are se-lect/project/join operators where the join operator is an equi-join over the logical or physical time fields; the boolean operations are AND and OR; and relational operations are

(4)

! " #$%&'(')*+$ , -$(. / 0 1 ($23452$ )(*66$( 78 9 1 1 :; '6$5)2 #$3&4.$# 45 #$< '5#

Fig. 3. Stream processing to detect vortices in Doppler radar data (below) as part of a larger workflow (above).

=,6=,≤,≥, < and>. We do not currently support aggregate

operations like GROUP BY but are working towards it. In addition, our language supports START and EXPIRE clauses for specifying the lifetime of the query, the RANGE clause for specifying a user’s approximation of the divergence in stream rates that the query will experience. RANGE is an optional clause which is only required for the query which includes join operations. The EXEC FUNC clause specifies a user-defined function to be executed on the resulting events.

As we indicated in requirement B of Section II, an SPE must often operate as part of a larger system. In applications where it makes sense to treat a collection of streams as a coherent and meaningful data resource, Calder provides continuous query access to the resource. Figure 3 illustrates how Calder works in an real application, in this case from mesoscale meteorology. Suppose a storm front is moving across the U.S. Midwest, threatening to spawn tornados. A user wants to deploy a data mining agent that can detect precursor conditions for a tornado, and when detected, spawn a weather prediction model.

A scientist creates an experiment by interacting with an experiment builder [25] accessed through a science gateway. The specification is handed off to a workflow engine, which interacts with component pieces through a notification system. The workflow engine interacts with Calder by passing it a declarative query, similar to how it would interact with a database management system. Calder optimizes the query and deploys it (which includes the data mining classification com-ponents [26]) at a computational node located, for instance, on the Teragrid [27]. The query when instantiated at the computational node executes the filtering/data mining loop depicted at the bottom of Figure 3 for every incoming Doppler

radar scan. When the classification algorithm detects a vortice pattern that exceeds the threshold, a response trigger is issued to the response channel. The workflow engine is reading the response channel, and acts on the message to wake the dormant prediction simulation.

V. EXPERIMENTS

As discussed in Section II, data driven computational sci-ence imposes unique demands on a stream processing engine (SPE). While some of these requirements are future work (see Section VI), Calder already addresses several important requirements. One of these is the requirement that the engine adapt to changing needs of the experiment, (requirement F). Calder addresses this through dynamic deployment of queries at runtime. We also experimentally evaluate the scalability of the rowset service, because while not unique to scientific computing, is important nonetheless. Finally, we examine throughput and event processing latency of a query execution engine for the scenario given in Section III.

A. Experimental Setup

We developed a workload simulator that simulates the instrument types common to mesoscale meteorology. The sim-ulator generates events at realistic sizes and rates as shown in Table I. Our workload generator is a set of highly configurable parallelized processes. Each process takes a channel name, data type, rates, sizes, and modes (clear or storm) for one instrument and produces a stream of events of the required size and rate with pre-set metadata. The streams generate events onto the dQUOBEC publish-subscribe system, one stream per channel. In our experimental setup, each query execution engine registers through the pub-sub system to receive all data products from the workload simulator.

The experiment is executed on a 128-node cluster where each node runs RHEL WS release 4 and has dual AMD 2.0 GHz Opteron 64 bit processors with 4GB memory. The Opteron nodes are interconnected by a 1 Gbps LAN. The simulator processes execute on 9 cluster nodes.

B. Query Deployment Latency

In this first experiment, we examine the time taken to deploy a continuous query into the Calder system while performing under the New Orleans workload. We used a set of select-project queries that filter the data products on temporal and spatial aspects. Currently, Calder supports only falls-within (boundary check) spatial queries. Users submit queries through the Grid Data Service. The query planner service creates a query execution plan, and then deploys the query to the query execution engine. Microbenchmarks of the steps of deploy-ment are presented in [13]. Here we examine the scalability of query deployment latency by submitting 1000 queries across 25 query processing engines using 25 nodes of the Opteron cluster. The number of simultaneous users submitting queries is set at 50, based on a study [28] that estimates the number of

(5)

users running canonical workflows in LEAD simultaneously at 50.

Query deployment latency includes query plan generation, query distribution and installation time, plus the overhead for XML and SOAP communication and processing between the different components of Calder. Figure 4 shows the average query deployment latency as seen by 50 users for 20 queries. The X axis shows in milliseconds. The deployment latency of the nth query in the figure was computed by taking the

average of the deployment latency of thenth query for all 50

users.

One can see from Figure 4 that latency is high for the first query and low thereafter. The initial high latency can be attributed to the large user proxy creation (GDS setup) time of approximately 1200ms. While Figure 4 shows the average latency, and we can see that after the first query, the query deployment latency seen by the user is almost constant at around 300 to 400 ms. The table embedded on top right of Figure 4 shows the overall distribution of deployment latency for all the 1000 queries. From this table, it can be observed that maximum number of queries fall in the range of 200-400ms.

0 2 4 6 8 10 12 14 16 18 20 0 200 400 600 800 1000 1200 1400 1600 Number of Queries

Query Deployment Latency (ms)

Time (ms) | Count 0 − 200 | 2 200 − 400 | 795 600 − 800 | 4 1400 − 1600| 41 1600 − 1800| 9 1800 − 2000| 1

Fig. 4. Average query deployment latency and frequency distribution for 1000 queries. 0 100 200 300 400 500 600 700 800 0 5 10 15 20 25 30 35 40 45 50 Number of users

Average Response Time (ms)

Average RT Trend

Fig. 5. Response time of the rowset service as seen by a single user in the presence ofnnumber of users.

C. Data Access Scalability

The rowset service provides users and programs with flexi-bility in data access by synchronizing data generation between the query execution engine and requests by the users. Users request their results through OGSA-DAI v6 GDS (OGSI) that has been extended to support stream data resource [24]. The GDS maintains a persistent connection to the rowset service and thus a user can submit any number of rowset requests using a single GDS. In this experiment we study the scalability of this service by measuring the response time as seen by a single user in the presence of multiple other users and the resultant data streams from New Orleans workload. Each user instantiates a GDS to connect to the rowset service. The rowset service response time is defined as a time period from the instance a request is submitted to the rowset service until the instant the user receives the first result.

The scalability experiment consists of two simultaneous tasks. The rowset service is fed with New Orleans workload data products defined in Table I. The user request workload is simulated as many users sending requests to the rowset service simultaneously. We measure a single user’s response time while gradually increasing from 0 to 800, the number of users sending requests to the rowset service.

The results appearing in Figure 5 is the average response time calculated over 50 runs. Further scaling beyond 800 users encountered a limit on the maximum number of open sockets for a process, because each GDS maintains an active connection to the rowset service. We can see from the figure that the response time increases in proportion to the number of users. But the response time with 800 users in the system is still in a reasonable range of 20ms. The best fit plot shows trend of increase. The variation in response time is caused by the variations in rates of the input streams which in turn influence the rates at which streams arrive at the rowset service.

D. Throughput of query execution engine

0 10 20 30 40 50 60 70 80 90 100 0 2 4 6 8 10 12 14 16 Queries Bandwidth (BW) in Mbps Input BW Output BW Mean Output BW Trend Output BW Scatter Plot

Fig. 7. Input and output network bandwidth at a Query Execution Engine under increasing query load. Pass-through queries running over “New Orleans Workload”

(6)

Metar 1 1 … 27 Metar 2 1 … 100 Rawin sondes 1 … 9 ACAR 1 … 30 NEXRAD II 1 … 5 NEXRAD III 1 … 5 GOES 1 Eta 1 … 4 CAPS 1 … 10 20800 Kbps 615 Kbps 19.5 Kbps 2.6 Kbps 222.2 Kbps 466.67 Kbps 0.0417 Kbps 1.1 Kbps 0.9 Kbps Q1 Q11 Q12 Q22 Q100 Q99 Q89 Event B(Meta 2) Event B(Meta 2) Event A(Meta 1) Event A(Meta 1) Event I(CAPS) Event I(CAPS) Event I(CAPS)

under storm mode Max = 263.9 Mbps collective bandwidth

Raw data on channels Max = 22.1 Mbps collective bandwidth

under storm mode

Data Products

Query Execution Engine

Output data on channels

Fig. 6. Data types and bandwidth produced by workload simulator (left) operating in storm mode and collective output bandwidth from queries output stream (right).

The purpose of this third experiment is to compute through-put for a single query execution engine on a large number of queries and realistic data streams with high throughput (output bandwidth). The output bandwidth of a single query is the product of the rate and size of the output events produced by the query. The overall output bandwidth of a quoblet is the sum of all output bandwidth produced by all the queries deployed in the system at that time. Figure 6 shows data products reflecting the New Orleans workload given in Table I. The cumulative input bandwidth shown in Figure 6 is calculated by adding the bandwidth of each data product under storm mode.

In this experiment, we capture the scalability of a query execution engine on a single computational node as a function of the output bandwidth of the engine. We gradually increase the number of queries deployed and measure the overall output bandwidth at different times. We use a suite of metadata filtering queries, each of which executes on one of the 9 data products. For the purposes of this experiment, our queries are pass-through (select all) queries that act on one data product at a time. Pass-through queries remove the query bias on the output bandwidth. Each additional query produces a specific output stream.

An example of pass-through query is as follows:

SELECT *

FROM NexRad Level II

START "2006-03-24T00:00:00.000-05:00" EXPIRE "2006-03-25T00:00:00.000-05:00";

The query execution engine (the quoblet) is hosted on a single computational node. The output streams generated by the queries are fed to client processes listening on cor-responding output channels. The streams are mapped

one-to-one to channels in the the underlying pub-sub system. The workload simulator was configured to generate the New Orleans workload under clear sky mode. We increased the rates of few data products while correspondingly decreasing their sizes to maintain a smooth continuous input. The queries were submitted at 10-second intervals and the throughput measured at 5-second intervals.

Figure 7 shows the input and output bandwidth (Y-axis) measured for 100 queries (X-axis). We can see that the input bandwidth is steady around 0.5Mbps (clear mode) which is less than the cumulative input bandwidth shown in Figure 6 (storm mode). Figure 7 shows that the output bandwidth of the engine (Y axis) increases with the number of queries (X axis). The scatter plot plots all the output bandwidth measurements taken. The average output bandwidth is connected and a trend line super-imposed to show the increasing nature of output bandwidth with increase in the number of queries. From Figure 7, we can see that the output bandwidth keeps increasing linearly with increasing number of queries. We tested up to 400 queries and this increasing trend continues providing an average throughput of 38Mbps for 400 queries. This shows the ability of the quoblet to scale well to hundreds of queries under clear mode. We are currently working on throughput measurements for storm mode as well.

The maximum number of queries supported by a query execution engine is influenced by several factors including the arrival rate of the input streams and the complexity of the queries. The current measurements were taken in a cluster where the stream providers and the query processing node existed in the same LAN and were connected by a 1Gbps Ethernet connection. In a wide area network, the output bandwidth may be restricted by the maximum bandwidth of

(7)

Description Average Std. Deviation Query Execution Time 0.343712(ms) 0.042686 MDA Execution Time 624.807(ms) 9.37785

Total Service Time 626.407(ms) 9.38881 TABLE II

SERVICE TIME FOR EXECUTING FILTER QUERY AND DATA MINING

ALGORITHM ONNEXRADLEVELIIDATA.

Fig. 8. Execution of filter query and data mining algorithm on NexRad II data

the network links.

E. Event Processing Latency

In this final experiment we measure event processing latency (service time) for a “typical” query in the context of the motivating example of Section III, that is, where the query portion filters out all data products that are not NexRad Level II data and are outside the geospatial region of interest. The data mining portion of the query is a Mesoscale Detection classifier algorithm [26] that operates over the NexRad Level II data. Figure 8 shows the relationship between the two pieces, and a representative query is as follows:

SELECT *

FROM NexRad Level II

WHERE southBound >= "28.00" and eastBound <= "-89.00" and northBound <= "31.00" and westBound >= "-91.00" EXEC_FUNC MDA_Algorithm START "2006-03-24T00:00:00.000-05:00" EXPIRE "2006-03-25T00:00:00.000-05:00"

Table II shows service time distributed across the filtering and mining parts of the query execution. We can see that query execution consumes a small fraction of total service time. More complex queries may consume longer execution time, but this confirms earlier results [29] and also confirms our earlier results that service time is dependant on the rates of the input streams when joins are involved [17].

VI. CONCLUSION ANDFUTUREWORK

In this paper we have distinguished the major categories of stream processing system, and have argued that data driven sci-ence imposes unique demands on stream processing systems. The stream processing needs from meteorology researchers are evidence of the unique requirements of stream processing systems in data driven applications. We signified this point by describing a use scenario. The scenario motivates the experi-mental evaluation carried out on the Calder stream processing engine. Specifically, the experiments apply a realistic workload from the meteorology applications against the Calder system to experimentally determine its effective throughput, event processing latency, data access scalability, and deployment latency.

The primary focus of our ongoing work is to support querying on XML data streams, because Calder, though currently supports various data formats, lacks the ability to dynamically add new data formats and user-defined functions (needed to satisfy the requirement A). XML based language support will allow users to dynamically define new data formats. Our second focus is stream resource discovery. As discussed earlier, a collection of active streams form a Virtual Stream Store; the store must be describable in a way that clients can discover it, understand the data it contains, and issue suitably formatted queries. Capturing metadata about the streams, the collection of streams, queries, and other details such as data format is key to enabling discovery. Third, optimal query placement is dependent upon information about the computational mesh in which the queries exist, so metadata must include performance monitoring information collected about streams in real time. We are also planning to migrate our GDS to OGSA-DAI WSRF 1.0 which is compatible with Globus Toolkit 4.0. Finally, we are examining issues of user privacy, approximate query processing in the occurrence of missing data, missing streams and dynamic deployment of user specified data mining. The latter is needed to satisfy the requirement G.

REFERENCES

[1] M. Stonebraker, U. Cetintemel, and S. Zdonik, “The 8 requirements of real-time stream processing,” SIGMOD Rec., vol. 34, no. 4, 2005. [2] A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito,

R. Motwani, U. Srivastava, and J. Widom, “Stream: The Stanford data stream management system,” in Data Stream Management, 2004. [3] T. Johnson, C. D. Cranor, and O. Spatscheck, “Gigascope: a stream

database for network application,” in ACM SIGMOD International Conference on Management of Data, 2003.

[4] M.G.Koparanova and T.Risch, “High-performance stream-oriented grid database manager for scientific data,” in 1st European Across Grids Conference, 2003.

[5] A. Arasu, S. Babu, and J. Widom, “The CQL continuous query language: Semantic foundations and query execution,” In Very Large Database (VLDB) Journal, vol. 14, no. 1, 2005.

[6] D. Luckham, The Power of Events. Addison Wesley, 2002.

[7] L. Brownston, R. Farrell, E. Kant, and N. Martin, Programming Expert Systems in OPS5. Addison Wesley, 1985.

(8)

[8] D. C. Luckham and J. Vera, “An event-based architecture definition language,” IEEE Transactions on Software Engineering, vol. 21, no. 9, pp. 717–734, 1995.

[9] M. Altinel and M. J. Franklin, “Efficient filtering of XML documents for selective dissemination of information,” in The Very Large Database VLDB Conference, 2000.

[10] B. Nguyen, S. Abiteboul, G. Cobena, and M. Preda, “Monitoring XML data on the Web,” SIGMOD Record, vol. 30, no. 2, pp. 437–448, 2001. [11] R. Avnur and J. M. Hellerstein, “Eddies: continuously adaptive query processing,” in ACM SIGMOD International Conference on Manage-ment of Data, 2000.

[12] S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, F. Reiss, and M. A. Shah, “Telegraphcq: Continuous dataflow processing for an uncertain world,” in Conference on Innovative Database systems Research (CIDR), 2003.

[13] N. Vijayakumar, Y. Liu, and B. Plale, “Calder query grid service: Insights and experimental evaluations,” in CCGrid Conference, 2006. [14] D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack,

J.-H. Hwang, W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik, “The Design of the Borealis Stream Processing Engine,” in Second Biennial Conference on Innovative Data Systems Research (CIDR) Conference, 2005.

[15] U. V. Catalyurek, “Supporting large scale data driven science in distrib-uted environments,” in Minisymposium on Distribdistrib-uted Data Management Infrastructures for Scalable Computational Science and Engineering Ap-plications, SIAM Conference on Computational Science and Engineering (SIAM CSE ’05), 2005.

[16] B. Plale, “Leveraging run time knowledge about event rates to improve memory utilization in wide area data stream filtering,” in IEEE Interna-tional Symposium on High Performance Distributed Computing, 2002. [17] B. Plale and N. Vijayakumar, “Evaluation of rate-based adaptivity in

joining asynchronous data streams,” in 19th IEEE International Parallel and Distributed Processing Symposium, April 2005.

[18] D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik, “Aurora: A new model and architecture for data stream management,” In Very Large Database (VLDB) Journal, vol. 12, no. 2, pp. 120–139, 2003.

[19] B. Domenico, “Unidata internet data distribution: Real-time data on the desktop,” in Science Information Systems Interoperability Conference (SISIC), 2005.

[20] K. K. Droegemeier, V. Chandrasekar, R. Clark, D. Gannon, S. Graves, E. Joseph, M. Ramamurthy, R. Wilhelmson, K. Brewster, B. Domenico, T. Leyton, V. Morris, D. Murray, B. Plale, R. Ramachandran, D. Reed, J. Rushing, D. Weber, A. Wilson, M. Xue, and S. Yalda, “Linked environments for atmospheric discovery (LEAD): A cyberinfrastructure for mesoscale meteorology research and education,” in 20th Conf. on Interactive Information Processing Systems for Meteorology, Oceanog-raphy, and Hydrology, Seattle, WA, 2004.

[21] B. Plale, “Using global snapshots to access data streams on the grid,” in Lecture Notes in Computer Science, Volume 3165. Springer Verlag, 2004, 2nd European Across Grids Conference (AxGrids).

[22] N. Vijayakumar and B. Plale, “dQUOBEC event channel communication system,” Computer Science Department of Indiana University, Tech. Rep. TR614, 2005.

[23] D. McMullen, et al., “Instruments and sensors on the grid: Issues and challenges,” in GlobusWorld, 2005.

[24] Y. Liu, B. Plale, and N. Vijayakumar, “Realization of ggf dais data service interface for grid access to data streams,” Indiana University, Computer Science Department, Tech. Rep. TR613, 2005.

[25] B. Plale, D. Gannon, Y. Huang, G. Kandaswamy, S. L. Pallickara, and A. Slominski, “Cooperating services for data-driven computational experimentation,” in Computing in Science and Engineering (CiSE), 2005.

[26] J. Rushing, R. Ramachandran, U. Nair, S. Graves, R. Welch, and A. Lin, “ADaM: A data mining toolkit for scientists and engineers,” Computers and Geosciences, vol. 31, 2005.

[27] “TeraGrid,” 2005, http://www.teragrid.org. [Online]. Available: http://www.teragrid.org

[28] B. Plale, “Usage study for data storage repository in LEAD,” 2005, LEAD TR001.

[29] B. Plale and K. Schwan, “Dynamic querying of streaming data with the dquob system,” IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 4, April, 2003.