Effective Parameters on Response Time of Data Stream Management Systems

(1)

Abstract – considering rapid and time variant (bursty) nature of data streams, data would rapidly lost its value while time is going on. So, the results with high response time are not reliable in Data Stream Management Systems (DSMSs). In other words, one of the most important factors in data stream management systems is the response time (i.e., the amount of time which a data stream tuple arrives into the system until it exits as the output while processed by a query).

In this paper, the parameters which are more effective on DSMSs’ response time are considered, categorized and analyzed. Static and dynamic system properties, input and output data streams’ properties, and also properties of queries and query processing algorithms are factors which influence on DSMS’s response time. Experimental results are shown to illustrate the impact of each parameter on the response time metric.

Keywords: DSMS, Query processing, Response time, Effective parameters

IKE'11 - 10th Int'l Conference on Information and Knowledge Engineering

I. INTRODUCTION

Traditional Database management systems as finite set of stored data are able to respond ad-hoc queries in best cases. But most of new applications need data stream processes which are infinite continuous streams of data [1, 2]. These applications need a new series of systems called as data stream management systems (DSMS). DSMSs provide requirements of mentions applications. Continuous data streams are infinite and rapid and varying time. DSMSs are able to discuss queries on data streams. These queries are executed in long time processes since being received continuously and are called as continuous queries [2]. Such queries with long lasting execution time need to be evaluated by the system until they finish [3]. One of the important factors of evaluating DSMSs is the response time of the system.

Response time or tuple latency is defined as the average time which a tuple needs to be processed by a query. Of course, it includes all waiting times in buffers [3]. In other words, response time for an output tuple is the time period

since providing all required information for concluding the output tuple until generation of the output tuple in real [4].

Generally, figure 1 represents DSMS architecture. Incoming data streams on the left produce data indefinitely and drive query processing. In many applications stream data also may be copied to an archive, for preservation and possible offline processing of expensive analysis or mining queries which we primarily concerned with the online processing of continuous queries. Finally, processing of such queries requires intermediate state, which denoted as Scratch Store in the figure and could be stored and accessed on disk or in memory. Applications or users register their Continuous Queries (CQ), which they remain active until explicit deregistration. Results are generally transmitted as output streams of data, although they could be relational results being updated over time [1].

Figure 1- DSMS Architecture [1]

The remained parts of this paper are structured as follows: related work is studied in section 2. Effective parameters on DSMS’s response time are categorized and analyzed in section 3. In section 4, experimental results are shown to illustrate the impact of these parameters on the response time. Finally, we conclude the paper in section 5.

II. RELATEDWORKS

Lots of researches on DSMSs are done [13]. Several primary samples of DSMSs like the STREAM [1,2], Aurora [5] and TelegraphCQ [14] are provided too. scheduling strategies of operators to process continuous queries on data streams varies from simple ones like the Round Robin [6], chain [15] and Greedy [6] to more complex ones [16,17]. Some of them are

Effective Parameters on Response Time of Data Stream

Management Systems

Shirin Mohammadi1, Ali A. Safaei1, Mostafa S. Hagjhoo1 and Fatemeh Abdi2

1

Department of Computer Engineering, Iran University of Science and Technology Tehran, Iran

[email protected], [email protected], [email protected] 2

Department of Science, Babol-Branch, Islamic Azad University, Babol, Iran [email protected]

Stored Result

Data Stream Management System

) DSMS ( Stored Relation Archive Scratch Store Streamed Result Input Streams Register Query

(2)

provided for optimizing a performance factor [18, 19], while some others try to optimize multiple factors or a compound one [20, 21]. Totally most of these methods are provided to minimize tuple latency or the response time factor. Determination and analysis of effective parameters on response time of DSMSs are explicitly studied in few references which [22] is one of them.

III. EFFECTIVEPARAMETERSONRESPONSETIME A. Categorizing the parameters based on DSMS architecture

Response time of a query in a Data stream management system depends on several parameters, some with less influence and some with more importance. Considering the figure 2 effective parameters on response time of DSMSs are represented in total categories of: Data stream properties, Query properties, Query execution properties, Output properties, System properties (static conditions) and System condition (Dynamic).

Figure 2 - categorizing the effective parameters on response time

B. Data stream properties

A data stream includes data elements generated in an infinite, continuous and rapid manner which varies in time. In other words Data stream of S is a set of s elements with time stamp of τ which the elements arrive to the system in time stamp order. The time stamp specifies the logical entrance time of a tuple into the data stream. By using discrete and regular domain of T a data stream can be defined as below [1]:

{

s

T

}

S

=

,

τ

|

τ

∈

(1) 1) Type of elements in a data stream: Data stream S includes data elements s which are divided into three categories of well-structured data, semi-structured data and unstructured data [3].

2) Domain of attributes: Domain or element type of tuples, belongs to attribute set of Att which its members are m types of a1, a2, a3, …, am as:

{

m

}

i n e Att Att a a a e e e s=( 1, 2,..., ) , ∀ ∈ , = 1, 2,..., (2)

3) Number of attributes: each tuple s is represented by an ordered list of elements like (e1,e2, … , en) which the n

represents the number of attributes of a tuple.

4) Data stream distribution: The DSMSs usually don’t have any control on order, rate and distribution of input streams [2]. Data stream distribution parameter represents the manner of distribution in stream arrival into the DSMS.

Distribution of data arrival can be Uniform or bursty distribution such as the Pareto distribution.

5) Arrival rate into the system: A data stream includes an ordered set of tuples which arrives to the system continuously. Time rate of data which arrive to the system is called as arrival rate. Arrival information of a data stream usually could not be controlled or predicted. The produced tuples often has fluctuate and high arrival rate [3].

C. Query properties

Continuous queries are those which have several processes on new data to generate new results. They are executed in a long lasting mode and generate the results continuously [3].

1) Type of query: Registered queries of a DSMS can be categorized as below [2]:

-One-time query which is evaluated in a specific moment of time.

-Continuous query which is continuously evaluated until arriving tuples to the system. The output is generated continuously. It could be saved as a relation or updated by processing new tuples or the output could be sent as a data stream.

2) Number of operators in a query: Each query includes a set of operators. If Qrepresents a query andOp_irepresents the ith operator, then n represents the number of operators in a query, as we have:

(

Op Op Opn

)

Q= ₁, ₂,..., (3) 3) Arrangement of operators in query plan: Arrangement of operators represents the query plan and its execution procedure. By changing the Arrangementand execution order of the operators, which occurs in query optimization they can lead to decrease cost of query execution and tuple latency.

4) Type of operators: A query includes a set of operators which is shown by symbolO. If we consider

o

_j as the jth operator and m as number of operators in a system, then we have:

(

Op₁,Op₂,...,Op_n

)

, Op_i O , O {o₁,o₂,...,o_m}

Q= ∀ ∈ = (4)

Each Op_i gets one or more streams as input and generates an output. The output stream of each operator may provide the input for one or more other operators.

D. Execution of Queries

To execute multiple queries concurrentlyin a DSMS, first the query selection has to be executed. In second step the manner of accurate execution of queries has to be specified. The first and second steps are discussed as scheduling queries and scheduling query operators.

1) Scheduling queries: To schedule queries in non-continuous systems algorithms like the FIFO can be used but they could not be applied in DSMSs considering the continuous nature of queries and streams. So we have to use scheduling algorithms with circular manner, like the RR and the WRR (weighted RR). The Quantum length in such scheduling algorithms is an important factor of response time

Data Stream Management System

) DSMS ( Data Stream(s) Output Registered

Queries _{System Conditions} (dynamic) System properties ) static Conditions ( Queries execution

(3)

of the system.

2) Scheduling query operators: A DSMS includes several query plans which are connected as a DAG of operators which are connected by connector queues and transforms the input stream into the output stream. Considering the high rate and the input variable of data stream, also the limitations on resources such as processor and memory, scheduling algorithms of operators are important factors on responding the queries [5]. Scheduling algorithms like Round-Robin, FIFO, Greedy, Chain and two-Phase are proposed algorithms in scheduling the query operators [6, 7].

E. Output properties

1) Type of output: Four types of output are considered for a DSMS: output as a data stream (continuous), output as a relation, output as an announcement and output at once.

2) Number of outputs: number of outputs affects the DSMS’s response time since for example, preparing a stream as the output result takes less time rather than preparing a relation in addition to the output stream.

F. System properties (static conditions)

Depending on total or the static conditions of the DSMS, even in case of number of queries, amount of available memory, number of processors and processing capacity of the them, effective parameters on response time in this category can be considered as below:

1) Number of registered queries: Continuous queries are stored in DSMS and are used permanently to process the queries. In a DSMS the achieved data of various query executions from resources is processed.

2) query registration time : Queries are separated into two categories of below about the time of registration as below:

-Predefined queries: before starting to receive data stream(s) the query is registered. This kind of query is often a continuous type of queries.

-Ad-hoc queries: after starting data streams arrival, it can be continuous or one time use and may need to process previous data.

3) Number of processes (logical machines): In parallel processing of a query, we have number of N_psame processes

(logical machines) which are associated. These machines can be physical machines (such as processes of a multi-processor system or nodes of a clustered computer) or virtual machines (such as threads which are executed on each cores of a multi core processor) [9].

If a proper architecture for parallel processing of queries in DSMSs is provided, an increase in number of processes can highly improve the response time.

4) Processing capacity of processor(s): Time of completing a job by the computer depends on several factors which first one is the processor speed. Obviously the processing capacity of processors has reverse relation on the response time.

5) System architecture: Several types of architectures exist for parallel machines. The leading ones are shared memory, shared disk, shared nothing and hierarchical [10].

6) The amount of available memory: If arrival rate of the input stream is greater than the output rate of the output stream or the sliding windows are mostly used in operators of query, the amount of memory consumption will increase. If the amount of required memory is greater than the amount of available memory, the system will be forced to discard the overload or to use secondary memories. If discarding the overload is prevented, it will influence the factor of validity and if the method of buffering on secondary memories is used, in cause of I/O operations, response time will increase.

G. System condition (Dynamic)

Dynamic status of the system includes changes which may occur while systems execution, like the amount of assigned memory to execute a specified query when we have several processes in processing state or occurring a deadlock.

1) Allocated processing capacity of processor(s) for query execution: Considering that the assigned processor(s) to the DSMS are simultaneously assigned to execute other processes, then always a part of the processor is assigned to execute the query.

2) Memory usage: When memory consumption exceeds the available memory, then overload problem occurs and the system is forced to discard some parts of data. As been considered in static state of the system, this influences the quality of output.

3) Overload in DSMS: Overload situation occurs when requested system resources exceeds the available capacity [8,11]. In such a situation, most part of data are accumulated in systems queues which may cause to increase response time if required main memory is available and in cases of insufficient main memory and requirement to data transfer with the secondary memory, vast delay occurs in generating results of the query.

4) Occurring deadlock: If three conditions to occur a deadlock are established in the DSMS, then failure occurs and the response time will infinitely increase.

IV. EVALUATION

As the evaluation process we developed a prototype which been implemented in the Java language with JDK 6.0 on a machine which was equipped with a Core i7 2930 Intel processor and 6GB of RAM in Linux environment. The Input data set includes data of monitoring IP packets which is located in Internet Traffic Archive (ITA) [12].One of traces, specifically the “DEC-PKT” contains all wide-area traffic of an hour between Digital Equipment Corporation and the rest of the world. This real-world data set is used in our experiments. Two types of monitored packets, the TCP and the UDP packets, are selected as input streams. Each TCP packet contains five items of source address, destination address, source port, destination port, and length. UDP packets are the same as TCP missing the length of packets. So elements of the stream are in type of well-structures data (tuple). Registered queries are continuous and registration time of them in system is pre-defined. To schedule queries the RR algorithm and to schedule query operators the FIFO

(4)

algorithm are used. Also type of relations is assumed. The experiment is done through a 60000 milliseconds time period with 10 times of runs, considering average results of these runs as the final experimental results.

Effect of parameters of input rate, processing capacity of the processor, number of query operators, type of operators, Quantum length of scheduling between queries, number of registered queries of a system, number of processes and size of buffer memory are evaluated.

1) Effects of parameter of Stream arrival rate to the system: Considering the definition if response time based on "the time period since arriving a tuple until it exits as an output tuple" and considering the figure 3, if stream arrival rate increases as much the value that the system is unable to execute desired queries on input tuples, soon buffers of system will be filled of tuples and waiting time for tuples in queues will increase, then these cause the response time to increase. Conversely, if arrival rate of tuples of data streams is less that the response time, then the response time will decrease. Consequently input rate of data stream has direct relation with the response time.

Figure 3- response time vs input rate of data stream

2) Effects of parameter of number of query operators: As shown in figure 4, as much as registered query operators in a system we have, it causes an arrived tuple to get more time to be processed. It means that response time increases while the number of operators increases. So, number of operators has direct effect on the response time.

Figure 4- response time vsnumber of query operators

3) Effects of parameter of type of operators: Type of operator has great influence on response time. For example an operator such as the Join is more time consuming than the Selection operator. So as much as the response time of an operator been used in query increases, response time of DSMS will increase too. In figure 5 response time of six operators of Selection, Count, Max, Min, Union and Join are evaluated.

Figure 5 – response time vs type of operator

4) Effects of parameter of Quantum length in scheduling queries: Considering the continuous nature of queries and that the best scheduling is one such as the RR, the quantum length been used in algorithm is an important parameter of response time. As shown in figure 6, in short quantum samples the response time is high. Then in an optimized mode, by increasing the quantum length the response time increases too.

Figure 6 – response time vs quantum length of the RR algorithm

5) Effects of parameter of number of registered queries of a system: Number of queries which are executed on input tuples is competing to achieve system resources such as the processor and memory and when resources are allocated to a query which is under execution, on execution tuples of other queries will be kept on a waiting queue and the response time increases. As shown in figure 7, the number of queries has direct effect on the response time.

Figure 7- response time vs number of registered queries

6) Effects of parameter of processing capacity of the processor(s): In figure 8, effects of number of queries parameter on a Dual core and a Quad core processor is studied. It is obvious that processing capacity of the processors has reverse relation with the response time. As much as the capacity increases the response time of each operator and consequently the total response time of the system will decrease.

(5)

Figure8 – response time vs processing capacity of processors

7) Effects of parameter of number of queries: As we can see in figure 9, if a proper architecture is used for parallel processing of queries in a DSMS, increasing the number of processes can great influence on improving the response time. So the parameter of number of processes or N_phas direct effect on the response time.

Figure 9 – response time vs number of processes

8) Effects of parameter of memory buffer size: Considering the evaluation result of figure 10, as much as the buffer size increases, tuples in waiting queues will wait more and the response time of the system increases, on the other hand when the buffer size is small, the system is forced to discard some tuples, so accuracy of result decreases.

Figure 10 – response time vs memory buffer size

V. CONCLUSION AND FUTURE WORKS

Rapid and bursty arrival of input data streams raises this fact that, to be fast is a major challenge for a DSMS. We considered the time period from arriving a tuple to the DSMS, until it exits as an output tuple as the response time of the system. Many different factors affect on response time of the system. In this paper, these factors were studied in six categories: data stream properties, query properties, query execution properties, output properties, static properties and also dynamic status of the system. Experimental results are shown to illustrate the impact of each parameter on the response time.

As the further works, dynamically setting of the changeable parameters based on the of machine learning techniques, determining and providing a function to compute the response

time w.r.t the values of effective parameters and also using proper mechanisms to estimate the response time in a DSMS can be followed.

REFERENCES

[1] A. Arasu, et. al., “STREAM: The Stanford Stream Data Manager”. In: Proc. of ACM SIGMOD, USA, 2003.

[2] B. Shivnath, "Adaptive Query Processing in Data Stream Management Systems", Ph.D. thesis, Department of Computer Science, Stanford University, USA, September 2005.

[3] S. Chakravarthy, et. al., “Stream data processing: a quality of service perspective: modeling, scheduling, load shedding, and complex event processing”, book, springer, USA, 2009.

[4] Y. Bai, et. al., “Minimizing Latency and Memory in DSMS-a Unified Approach to Quasi-Optimal Scheduling”, Proceedings of the 2nd international workshop on Scalable stream processing system, University of California, Los Angeles, 2008.

[5] D. Abadi, et. al., “Aurora: A New Model and Architecture for Data Stream Management”, In VLDB Journal (12)2: 120-139, August 2003. [6] B. Babcock, et. al., “Operator Scheduling in Data Stream Systems”,

VLDB Journal, 13(4):333–353, 2004.

[7] D. Carney, et al., “Operator Scheduling in a Data Stream Manager”,in Proceedings of the 29th international conference on Very large data bases, Germany, pp. 838-849, 2003.

[8] N. Tatbul, et. al., “Load Shedding Techniques for Data Stream Management Systems", Ph.D. thesis, Brown University, May 2007. [9] A.A.Safaei, et. al., “Parallel Processing of Continuous Queries over

Data Streams”, Distributed and Parallel Databases, Volume 28, Numbers 2-3, 93-118, 2010.

[10] A Silberschatz, “Database System Concepts”, book, 5th_{edition, 2005.} [11] F. Reiss, et. al., “Data Triage: An Adaptive Architecture for Load

Shedding in TelegraphCQ”, U.C. Berkeley Department of Electrical Engineering and Computer Science, And Intel Research Berkeley, Conference paper, Proceedings of the 21st International Conference on Data Engineering, ICDE 2005, 5-8 April 2005, Tokyo, Japan, 2005. [12] Internet Traffic Archive, http://www.acm.org/sigcomm/ITA/

[13] B. Babcock, et. al., “Models and Issues in Data Stream Systems”, Invited paper in Proc. of PODS 2002, June 2002.

[14] S. Chandrasekaran, et al., “TelegraphCQ: Continuous Dataflow Processing”, in ACM SIGMOD, 2003.

[15] B Babcock, et al., “Chain: Operator Scheduling for Memory Minimization in Data Stream Systems”, Proceedings of the ACM SIGMOD International conference, 2003.

[16] M. A. Sharaf, “Preemptive Rate-Based Operator Scheduling in a Data Stream Management System”, in IEEE/AICCSA, 2005.

[17] M. S. Soliman, G. Tan, “Operator-scheduling using dynamic chain for continuous-query processing”, IEEE Int. Conference on Computer Science and Software Engineering , 2008.

[18] S. Chakravarthy, et. al., “Scheduling Strategies and Their Evaluation in a Data Stream Management System”, Springer LNCS 4042, 2006. [19] M. A. Sharaf, et. al., “Scheduling Continuous Queries in Data Stream

Management Systems”, in PVLDB, 2008.

[20] B. Srivastava, et. al., “Exploiting k-Constraints to Reduce Memory Overhead in Continuous Queries over Data Streams”, Technical Report, November 2002.

[21] M. Ghalambor, et. al., “DSMS scheduling regarding complex QoS metrics”, IEEE/ACS International Conference on Computer Systems and Applications (AICCSA), 10-13 May 2009.

[22] S. Chakravarthy, et. al., “Stream data processing: a quality of service perspective: modeling, scheduling, load shedding, and complex event processing”, book, springer, USA, 2009.