Combining Sequence Databases and Data Stream Management Systems Technical Report Philipp Bichsel ETH Zurich,

(1)

Combining Sequence Databases and Data Stream Management Systems

Technical Report

Philipp Bichsel ETH Zurich, 2-12-2007

Abstract

This technical report explains the differences and similarities between the research areas of sequence databases and data stream processing systems. The first part of this report gives an overview about both topics. The remaining sections of the document examine the differences and similarities and discuss ways to combine both approaches.

1. Sequence Databases

Ordered data appears in a many scientific and commercial applications. Stock market prices or data from continuous satellite observation are examples of application data where the underlying order is important. In the following, the term sequences will be used instead of ordered data.

In order to efficiently store and query large sets of sequences, the support of a database management system is needed. Traditional DBMS, however, provide no support for sequences. The relational model works with sets as its only data structure and sets are not suited to represent an ordering over tuples. It is important to note, that the “order by” clause in SQL only specifies the order in which the final result is going to be presented to the user. The following two sections provide an overview over two different sequence database approaches, discussed in two separate papers.

1.1 SEQ

“The Design and Implementation of a Sequence Database System”, written in 1996 describes the architecture and implementation of SEQ, a database system that supports sequences. Early approaches modeled sequences as an abstract data type. In the SEQ system, relations and sequences are both modeled as an abstract data type, called E- ADT (Enhanced Abstract Data Type). Each E- ADT provides his own physical storage implementation as well as its own query

language. This allows sequences to be the top level type of the data base system. SEQUIN is the query language that is used for querying sequence data in the SEQ system. A fundamentally new concept of SEQUIN compared to SQL is the concept of moving aggregates. Moving aggregation is basically an aggregation over a fixed size, sliding window.

Of course, this concept only makes sense for database systems that support the concept of sequences.

1.2 SRQL

The approach described above has several drawbacks when querying a combination of both relational and sequence data. The paper

“SRQL: Sorted Relational Query Language”

presents an improvement over the E-ADT approach of the SEQ system. SRQL is an extension of the SQL query language to support queries over sequences. Queries over a combination of sequence and relational data can be expressed much easier using SRQL than SEQUIN. Further more, SRQL queries allow for more integrated query optimizations.

The SRQL paper is a language paper with a simple but powerful idea to support sequences.

The idea is to treat sequences as sorted relations. An efficient implementation of a database system requires an underlying algebra. The existing relational algebra is therefore extended with four basic operators.

One of these algebraic operators enables window aggregation, a concept that was already implemented in the SEQ system. A fundamentally new idea of this paper is to align tuples with other tuples at a given offset in the sequence. One could for example align each tuple with all tuples that appear two steps later in the sequence. This concept is called shifting.

SRQL is an extension of the SQL query language and allows expressing the algebraic operators described above. Over all, this

(2)

approach leads to minimal extensions to a traditional database system.

The earlier paper “The Design and Implementation of a Sequence Database System” describes several optimizations that are possible in sequence queries and treats the different storages issues. Based on this results it was possible to implement an efficient and scalable sequence database system that supports the SRQL query language.

2. Data stream management

Traditional relational DBMS have been designed to support business applications. In this model, the database system is used as a persistent storage of a usually large collection of data. Humans initiate queries and updates on the data set and therefore play an important role in such systems.

Monitoring applications, in contrast, deal with a completely different computation model. An example of a monitoring application would be a military system that processes sensor data from a group of soldiers. The sensors could for example periodically transmit the soldier’s positions and heart beats as a stream of data.

Monitoring applications therefore have to deal with continuous data streams from many different sources. Since these applications usually have to process a high volume of data, it would be desirable to have the support of a database management system. Monitoring applications, however, are difficult to implement using existing database technologies. It is therefore necessary to completely rethink the architectural issues for a data stream management system.

2.1 Aurora

Aurora is a system that was designed to support the development of monitoring applications. The paper “Aurora: a new model and architecture for data stream management”

describes the system model of Aurora and the algebra that is used to process data streams.

Aurora is basically a data flow system that is built as a network of boxes. The topology of one specific network is designed by an application administrator, who can connect the output of one box with the input of another box. Each box performs certain processing on

data streams which flow from one box to another.

Some dedicated boxes are called connection points and play an important role in the Aurora system. Data elements that enter a connection point are cached for a certain amount of time, specified by the application administrator. So called adhoc queries can be attached to connection points and allow for example to pose window queries.

Aurora has to deal with continuously arriving data. The system therefore uses a Quality of Service (QoS) monitor that aims to prevent congestion in the network. If necessary, the QoS monitor invokes the load shedder, which will then drop certain data elements in the network.

During runtime, Aurora gathers performance statistics such as the execution time of boxes.

This information is then used to optimize the system during runtime, for example by combining several boxes into one.

3. Similarities and Differences

3.1 Need for DBMS support

A relational database management system basically returns data elements that fulfill certain predicates. Its task is to extract relevant information from a large collection of data.

What is the motivation to extend such a RDBMS to a sequence database? The motivation is to perform data analysis on certain data elements of a database. Of course, performing data analysis only makes sense, if the data elements that are analyzed can somehow be related to each other. Ordering data into a sequence is one possibility of relating data elements and performing meaningful data analysis. Data analysis, however, is not a concept of a traditional DBMS and it is questionable whether data analysis should be performed completely outside the data base, to optimize the database in its core competence. The main reason of doing data analysis inside the database is the usually huge collection of data. In order to efficiently answer sequence queries on large data sets, the benefit of the support of a DBMS is significant.

(3)

Data stream management has not much in common with the idea of a traditional database.

The major task of a data stream management system can be abstracted as data processing.

Streams of data flow into such a system where they are processed, transformed and possibly activate some triggers. Storage of data is a minor task and is performed only to answer queries about the recent history. The most important part of a data stream system is the novel arriving data, whereas the most important part of a traditional data base is the huge collection of data that evolved over a long period of time.

Both sequences databases as well as data stream systems come up with new ideas that are not part of traditional relational database systems. The similarity between sequence databases and data stream systems is the challenge of dealing with high volumes of data in an efficient way. In order to fulfill these performance requirements, both systems use the fundamental ideas of traditional database management systems.

3.2 Triggers

The importance of triggers is a main difference between sequence databases and data stream systems. Triggers are not a core concept of traditional databases and since sequence databases more or less build on top of RDBMS, triggers even do not play an important role in sequence databases. Dealing with ordered data has not much in common with the concept of triggers.

In traditional database systems, triggers are mainly used to guarantee integrity constraints and to detect abnormalities in the changes of the data set.

Since triggers are not a core concept of traditional databases, most RDBMS lack a scalable implementation for triggers. Many business applications therefore encode their triggers in the middleware which leads to very poor performance.

Monitoring applications, in contrast, are trigger oriented. Such applications use triggers to identify certain pattern on data streams, to detect unusual or abnormal behavior or to recognize events on the streamed data. These tasks are all part of the data processing

objective of monitoring applications. Triggers play a central role in data stream systems and a scalable implementation of triggers is therefore essential in such a system. This is one of the reasons why traditional database systems are inadequate for building monitoring applications.

3.3 HADP versus DAHP

One of the fundamental differences between sequence databases and data stream systems is the way in which they are operated. The operational model of sequence databases as well as of traditional relational databases is called human active – database passive (HADP). In this model, humans actively modify the data set and pose queries of their interest. The database has a passive role and simply executes the queries and commands on the data set. Traditional business applications process their data based on the HADP model.

Monitoring applications deal with data that come from other applications or sensors at various sources. The data stream system processes this data streams and reports notable events and conditions to the user. Humans therefore play a passive role whereas the database is the active part. Such systems are called database active – human passive (DAHP) and lead to different requirements compared to HADP systems. One of these differences is described in the next section.

3.4 System models

Aurora as an example of a data stream management system is built as a network of boxes and streams of data flow from one box to another. This is basically a data flow system.

Sequence databases and traditional RDBMS, in contrast, are designed as data centered systems.

While real time requirements are not an issue for data centered systems, timing requirements are essential for data flow systems with continuous input streams. Data flow systems unlike data centered systems can not delay the processing of a data element since this would possibly cause congestion in the flow network.

The Aurora system specifies real time requirements and uses a Quality of Service monitor to meet these requirements. A high load would force Aurora to drop some data

(4)

elements of the input streams. Such real time requirements are hard to implement using a traditional database system. This is another reason why the systems architecture has to be completely rethought for data stream management systems.

3.5 Approximation

As we have seen in the above section, the Aurora system will possibly drop data elements of the input streams in order to meet the real time specifications of the system. The scenario of dropping arriving data would be unimaginable in traditional database systems.

Sequence databases and RDBMS have to deliver exact results and guarantee the durability of made changes.

Real time requirements, however, can only be guaranteed at the cost of some possible information loss. Data stream management systems therefore can only deliver approximate results. Approximation is characteristic for data stream processing. Information on data streams arrives asynchronously and is inadequate since data can be lost or delayed.

Data stream systems have to deal with incomplete information and therefore only can give approximate answers. The idea of computing approximate answers is a new concept that was not a requirement in RDBMS.

An example of such an approximate answer in the Aurora system is BSort. A complete sort over an infinite data stream is not possible.

BSort is an approximate sort operator that operates over a finite window on the data stream.

In the overview of sequence databases, we saw that the SRQL system for example uses a shift operator that aligns tuples with all tuples that occur at a given offset in the sequence. The question that comes to mind is, whether it makes sense to have such an operator in data stream management systems where only approximated sorting is possible. This question will be discussed in the next section.

3.6 Operators

Relational DBMS have been built to work with the most recent state of the database. It is therefore difficult to examine the history of the data set in such systems.

The history of data values plays a central role in sequence databases. It is the nature of such systems to store historic information in order to pose sequence queries and perform data analysis on this information. Sequence queries only make sense, if the queried data items are somehow related to each other. Relating data items by time is one of the most natural ways in sequence queries.

Historic information is even more important for data stream management systems. The Aurora system for example uses connection points to cache historic information and allow exploiting the history.

In the last section the question was raised, whether it would make sense to have a shifting operator, as we have it in the SRQL system, also in a data stream management system.

Aurora actually has a join operator that works in the same way as the shift operator of the SRQL system. The shift operator however works on a static set of data while the join operator of Aurora works on streams of data.

Aurora caches historic information at the connections points, but it is not possible to apply the join operator on parts of such a history. Aurora therefore limits the possibility for data analysis on historic data at the connection points.

Both the Aurora and the SRQL system provide an operator to apply window aggregation. The SRQL system allows performing window aggregation on parts of the data set, while Aurora only allows applying window aggregation on data streams. However, it is not possible to calculate a window aggregate on parts of the history that is cached at connection points. Whether such extensions to Aurora would make sense will be discussed in the conclusions of this document.

4. Event Stream Processing

The Aurora system addresses several needs that are relevant in event stream processing. A central task of event stream processing is to recognize patterns and detect events on continuous data streams. Research on data stream processing has come up with results for real time processing or computation of approximate queries. These results can be

(5)

directly applied for event stream processing systems.

In my point of view, sequence databases play a minor role for event stream processing.

Elements of an event stream, however, usually have an attached timestamp and therefore can be seen as an ordered sequence of data.

5. Conclusions

We have seen the benefits of extending traditional databases to sequence databases.

Sequence databases basically allow data analysis to be performed at the database level.

I think it is still questionable, whether such functionality should be implemented as part of a database system. The performance gain of doing this analysis at the database level however shows that this approach is quite beneficial.

We have also encountered that data stream management systems have to rethink the architecture and concept of a conventional database system. The question remains, whether the ideas of sequence databases could also be applied to data stream systems.

Streaming data naturally forms a sequence, since each data element is assigned a timestamp. I think it would therefore make sense to apply the ideas of sequence databases also to data stream systems. In case of Aurora, I think it would make sense to implement sequence operators at the connection points and therefore allow sequence queries on subsets of the cached history. In the same way as sequence databases extend relational databases, one could also extend data stream systems to sequence data stream management systems.

6. References

[1] P.Seshadri, M.Livny, R.Ramakrishnan. The Design and Implementation of a Sequence Database System. In Proceedings of the 22^nd VLDB Conference, Mumbai, India, 1996.

[2] R. Ramakrishnan, D. Donjerkovic, A.

Ranganathan, M. Krishnaprasad, K.S. Beyer, SRQL: Sorted Relational Query Language, SSDBM, 1998.

[3] D.J.Abadi, D. Carney, U. Cetintemel, M.

Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, S.Zdonik, Aurora: a new model and architecture for data stream management, VLDB Journal, 2003.

[4] SearchSOA.com: event stream processing, http://searchsoa.techtarget.com/sDefinition/0,,s id26_gci1274435,00.html

[5] Wikipedia: Event Stream Processing, http://en.wikipedia.org/wiki/Event_Stream_Pro cessing