D4.2 Prototype of warehouse application and adaptation over new IO stack design; design and implementation of middleware prototype

(1)

IOLanes

Advancing the Scalability and Performance

of I/O Subsystems in Multicore Platforms

Contract number:

Contract Start Date:

Duration: Project coordinator: Partners: FP7-‐248615 01-‐Jan-‐10 36 months FORTH (Greece)

UPM (Spain), BSC (Spain), IBM (Israel), Intel (Ireland), Neurocom (Greece)

D4.2

Prototype of warehouse application and adaptation

over new IO stack design; design and

implementation of middleware prototype

Draft Date: 29-‐Feb-‐12, 13:22 Delivery Date: M24

Due Date: M24 Workpackage: WP4

Dissemination Level: Public

Authors: Ricardo Jimenez-‐Peris, UPM, Marta Patiño-‐Martínez, UPM, Valerio Vianello, UPM, Luis Rodero, UPM, Grigoris Prasinos, Apostolis Hatzimanikatis.

Status: 5

(Choose one, implies all previous) 1. Internal draft preparation 2. Under internal project review 3. Release draft preparation 4. Passed internal project review 5. Released

Project funded by the

European Commission under the

Embedded Systems Unit – G3

Directorate General Information Society

7th_{Framework Programme}

(2)

History of Changes

Date By Description

11 Dec 2011 Ricardo Jimenez TOC

Dec 2011 Ricardo Jimenez First draft

Jan 2012 Ricardo Jimenez Second draft

Jan 2012 Grigorios Parisinos Third draft

Jan 2012 Valerio Vianello Fourth draft

Jan 2012 Ricardo Jimenez Fifth draft

Feb 2012 Grigorios Prasinos Sixth draft (added TA experiments)

Feb 2012 Apostolis

Hazailmanikatis

7th draft

Feb 2012 Martina Naughton 8th draft (peer review)

Feb 2012 Angelos Bilas 9th draft (coordinator review)

Feb 2012 Alex Landau 10th draft (peer review)

(3)

TABLE OF CONTENTS

TABLE OF CONTENTS ... 3

EXECUTIVE SUMMARY ... 5

1 OVERVIEW OF DELIVERABLE ... 7

1.1 MAIN CONTRIBUTIONS AND ACHIEVEMENTS ... 7

1.2 PROGRESS BEYOND THE STATE-‐OF-‐THE-‐ART ... 8

1.3 ALIGNMENT WITH PROJECT GOALS AND TIME-‐PLAN FOR THIS REPORTING PERIOD ... 8

1.4 PLAN FOR NEXT REPORTING PERIOD ... 8

1.5 STRUCTURE OF DELIVERABLE ... 8

2 MIDDLEWARE SYSTEMS ... 9

2.1 INTRODUCTION ... 9

2.2 DATABASE REPLICATION MIDDLEWARE ... 9

2.2.1 MOTIVATION ... 9

2.2.2 ARCHITECTURAL CHOICES AND RATIONALE OF THE SELECTED ARCHITECTURE ... 9

2.2.3 DESIGN ... 11

2.2.4 IMPLEMENTATION ISSUES ... 13

2.3 PARALLEL DATA STREAMING MIDDLEWARE ... 14

2.3.2 ARCHITECTURAL CHOICES AND RATIONALE OF THE SELECTED ARCHITECTURE ... 15

2.3.3 DESIGN ... 16

3 PROTOTYPE OF WAREHOUSE APPLICATION AND ADAPTATION OVER NEW IO STACK DESIGN ... 18

3.2 THE CURRENT STATE OF TARIFF ADVISOR ... 18

3.3 POST PROCESSING AND AGGREGATIONS NEEDED ... 19

3.3.1 BASIC AGGREGATE ... 19

3.3.2 TOTAL INVOICE AMOUNT COMPUTATION ... 19

3.3.3 SORT RESULTS AND COMPUTE DISTANCES ... 20

3.3.4 DROP RESULTS ABOVE A SPECIFIC THRESHOLD ... 20

3.3.5 KEEP RESULTS IN A SPECIFIC DISTANCE FROM CURRENT OR BEST PLAN ... 20

3.3.6 KEEP RESULTS IN SPECIFIC STEPS (STEP-‐UP, STEP-‐DOWN) FROM CURRENT OR BEST PLANS ... 20

3.3.7 TAKING INTO ACCOUNT TRAFFIC OF MORE PERIODS ... 21

3.3.8 COMPUTATION OF DISCOUNTS ... 21

3.3.9 PROCESSING BASED ON MARGINS ... 21

3.4 IMPLEMENTATION: DESCRIPTION OF THE INTEGRATION WITH STREAMCLOUD AND OTHER IMPROVEMENTS MADE DURING THE PROJECT ... 21

3.4.1 CHANGES TO TARIFFADVISOR ... 22

3.4.2 INITIAL PROTOTYPE ... 22

3.4.3 THE FORMAT OF DATA OF TARIFF ADVISOR ... 22

3.4.4 IMPLEMENTATION OF STREAMCLOUD OPERATORS ... 23

4 OTHER BENCHMARKING APPLICATIONS ... 24

4.2 TPC-‐W BENCHMARK ... 24

4.2.2 METRICS ... 24

4.2.4 SUMMARY OF BASELINE RESULTS ... 25

4.2.5 EXPECTED IMPROVEMENT WITH IOLANES ... 26

4.3 LINEAR ROAD ... 26

4.3.2 METRICS ... 26

(4)

4.4 TARIFFADVISOR ... 29

4.4.2 METRICS ... 29

4.5 SPEC JENTERPRISE 2010 ... 31

4.5.2 METRICS ... 31

5 PLAN FOR THE NEXT REPORTING PERIOD ... 32

6 CONCLUSIONS ... 33

7 BIBLIOGRAPHY ... 34

(5)

Executive Summary

This deliverable presents the progress during the second period of the project in the applications and middleware developed to evaluate the IOLanes prototype with realistic workloads. The progress during this period is two-‐fold:

(1) A main goal of Period 2 is to create a realistic application stack for evaluation purposes that includes all layers of modern applications stacks. Our evaluation aims to not only show limitations and improvements at the system level, but also to provide a better understanding of various application trends. For this purpose we use TariffAdvisor, which we extend in two ways.

First, we build a separate rating component, based on TariffAdvisor that can make use of a generic stream-‐processing engine. This is important since it allows the rating engine to focus on the policies to be evaluated rather than event processing itself. Integration of TariffAdvisor and the middleware has progressed and the different points, where modifications are needed have been identified.

Second, we design a middleware system that allows the new rating engine to scale to multiple nodes. Replication is a main technique for scaling performance, however, due to consistency requirements, it significantly increases I/O requirements. The database replication middleware that has been architected and developed aims at replicating a database transparently to clients for scalability and availability purposes. The approach taken places special emphasis on transparency: Applications should be able to use the replication middleware transparently, without changes. During this period we provide an initial implementation of the middleware, as per the original plan. In addition, during this period, the database component of TariffAdvisor was migrated from Oracle to PostgreSQL. This way TariffAdvisor is automatically integrated with the database replication component, and uses it for the storage of input CDRs as well as metadata and outputted results.

(2) With respect to the existing application stacks:

Regarding TariffAdvisor, during the experimental analysis and profiling we were able to build a base version that eliminates many of the inefficiencies of the original application: Avoiding character-‐based processing of input and output data when possible, introducing large I/O requests, and custom buffer copying have all resulted in significant improvement of application-‐level throughput. In addition, we introduce asynchronous I/O at the application level, since asynchronous I/O facilities are not available in Java 6, by using additional I/O threads, which however does not improve I/O performance further. The current deployment cases for TariffAdvisor in production are either on a single server (bare metal) or under VMs. Our I/O optimizations during Period 2 result in proper scaling of TariffAdvisor in non-‐virtualized environments, however, our evaluation shows significant (up to 70% compared to the non-‐VM case) performance penalty and lack of scaling in the case of VMs. We currently use the optimized version of TariffAdvisor for evaluating the new I/O stack.

For OLTP applications, following the preliminary evaluation of TPC-‐W during Period 1, two main bottlenecks found and removed by re-‐architecting the TPC-‐W implementation. These two bottlenecks were the database connection pool and the load generator.

For generic data streaming applications, a preliminary evaluation of LinearRoad unveiled many performance issues at the application and middleware level:

• At the middleware level, we find that that database operators are CPU expensive, WaitFor

synchronization is inefficient, and aggregate operators performed poor memory management. New versions of the database operators, WaitFor operators and aggregate operators were designed and developed to solve these issues. We also added persistence of data streams for fault-‐ tolerance purposes.

• Generating large loads necessary to stress I/O in modern systems was not possible because the

load generator would hang after a few days. Additionally, generating each new load was taking longer and longer (i.e. weeks). We designed and implemented a new load generator that allows us to generate incrementally large loads.

• The granularity of the partitioning into subqueries is too coarse to enable an even load distribution

across cores. A series of optimizations were made till achieving an adequate balancing across cores.

Overall, at the starting point, LinearRoad was not able to sustain a 1-‐expressway load in our testbed, and now it is able to bear a load an order of magnitude higher, and at least 10 expressways. During Period 3 we will complete the TariffAdvisor+Replication Middleware stack and use the new and existing application stacks for evaluating the IOLanes stack prototype.

(6)

(7)

1 Overview of Deliverable

1.1 Main contributions and achievements

During the reporting period, WP has progressed with:

• The design and implementation of the database replication middleware.

o The database middleware has been architected and developed. The middleware provides scalable replication of the database in a transparent manner to applications.

• The extension of TariffAdvisor and its integration with the middleware systems.

o TariffAdvisor has been enhanced to enable aggregation via the data streaming middleware. This extension enables to perform aggregation queries in a declarative manner and parallelize automatically the aggregation of rate plans.

o TariffAdvisor has migrated from Oracle to PostgreSQL. With this migration the integration with the database replication middleware is automatic. This extension will enable to enable declarative queries over the input data, and parallel reading of input files.

• In order to stress the IO path, modifications were made to the middleware systems and

benchmark applications. o Parallel data streaming.

§ A new version of the database operators has been designed and implemented to use composite keys and avoid the costly scans performed by previous versions of the operators.

§ A new version of the WaitFor synchronization operator has been designed and developed to avoid traversals of long linked lists during high loads that burned too many CPU cycles. The current version uses hash tables to perform indexing and provide efficient access.

§ A new version of the aggregate operator has been designed and developed with efficient memory management and new non-‐intrusive garbage collection functionality to avoid the high memory usage of the previous version.

• Evaluation of bare-‐metal and virtualised baselines with TPC-‐W and LinearRoad. o Bare-‐metal baseline evaluations.

§ A baseline for TPC-‐W has been completed. The findings show that one of the limitations of current IO architectures lies in the synchronous write that is one of the bottlenecks of transactional systems.

§ A baseline for LinearRoad has been completed. The results have found the IO path does not scale with high rates of small reads and writes.

o Virtualized environment evaluations.

§ Baselines for TPC-‐W and LinearRoad have been obtained.

§ In both cases, it has been found out that there is a substantial loss of performance due to virtualization, which is in turn cause by the number of context changes invoked by IO operations.

• Ongoing evaluation of SplitX, K0fs and deduplication.

o During the last few months, SplitX and K0fs have been executed with LinearRoad. However, there are still issues to be solved in order to attain results that will allow us to draw conclusions regarding its performance.

o Evaluation with deduplication has recently started with the database replication middleware and is still ongoing.

(8)

1.2 Progress beyond the state-‐of-‐the-‐art

• Scalable database replication middleware. The main innovation with respect SOTA lies in

the avoidance of read/write conflicts by using as isolation level snapshot isolation (SI for short).

• Scalable parallel data streaming. The main innovation with respect to SOTA lies in parallelizing data streaming operators such that the increasing multi-‐core power currently available is exploited.

• In addition, during this period WP4 has produced an optimized version of the TariffAdvisor rating application, improving significantly the performance going beyond state of the art. Neurocom has executed a large number of experiments with TariffAdvisor with different workloads, in different environments and different parameters.

1.3 Alignment with project goals and time-‐plan for this reporting period

The main objective of WP4 is to provide middleware systems and benchmarking applications to evaluate (with realistic software stacks and applications) the improvements made along the IO path in WP1-‐3. Additionally, it uses the monitoring framework developed by WP5. In period 2, the goal was to establish a baseline evaluation that would enable the evaluation of improvements made by the contributions in WP1-‐3. Therefore, the whole activity of the WP has been totally aligned with the goals of the project, and the timeframe for the second period of the project.

1.4 Plan for next reporting period

During the last and third period, the improvements made by WP1 to 3 will be continuously evaluated to provide feedback to WP1-‐3 partners to enable them to perform further improvements and quantify them. Changes in the benchmarks will be made as needed to fulfil the goals of the evaluations. Middleware systems will be improved to better exploit the developments from WP1 to 3.

1.5 Structure of deliverable

The core of the technical contents is organized into 3 sections. Section 3 presents the middleware systems developed for the project. Section 4 presents the integration of TariffAdvisor with the middleware systems. Section 5 presents the remaining benchmarking applications developed to evaluate the results of the project. Finally, section 6 presents the conclusions.

(9)

2 Middleware systems 2.1 Introduction

The initial scope of the project targeted a database replication middleware as the middleware system. This scope has been extended to handle a variety of workloads and target different limitations of the IO path. The scope was extended to include three middleware systems: database replication middleware, parallel-‐distributed data streaming (StreamCloud), and clustered Java EE. In the second year, it became clear that the clustering aspect of JavaEE was not adding any additional benefit, therefore; a regular Java EE middleware will be used instead. On the application/benchmarking side the original scope considered an enhanced version of TariffAdvisor, whilst the extended scope considers three additional benchmarking applications, TPC-‐W, LinearRoad and SPEC jEnterprise.

TariffAdvisor is a typical OLAP application that performs analytical queries over large datasets. In this particular case, it computes hundreds/thousands of alternative plans for the clients of a phone company during period of several months. It will be used standalone, and integrated with the data streaming middleware and the database replication middleware. The data streaming middleware will be able to perform declaratively aggregate queries and parallelize them automatically. The database replication middleware will be able to perform declaratively database queries over the input data and parallelize them automatically the reading.

TPC-‐W is a benchmark for databases that be used in conjunction with a database management system and the database replication middleware. TPC-‐W aims at evaluating OLTP workloads. SPEC jEnterprise is the benchmark for Java EE technology. It will be used in conjunction with Java EE servers and databases.

2.2 Database replication middleware 2.2.1 Motivation

Databases are among the most used software servers in industry and are used pervasively in almost every application. Two crucial requirements for databases are scalability and availability. Both can be attained via a technique known as database replication. In the context of the project, the database replication middleware will be used for two different purposes. The first purpose is to integrate it with TariffAdvisor as originally proposed in the DoW to enable declarative queries over input data and automatic parallelization over a many core system. This scenario will allow us the evaluation of the benefits of IOLanes for OLAP applications that are characterized by massive reads. The second purpose is to evaluate how a replicated database service deployed in a cloud environment (e.g. such as the Amazon Relational Database Service) can benefit from IOLanes. In this context, each node runs replicas from different clients on different VMs. This scenario will stress the IO path in virtualized environments, typically found in cloud environments.

2.2.2 Architectural choices and rationale of the selected architecture

Database replication middleware can follow four alternative architectures (see Figure 1). They differ with respect to where the replication logic is stored. In the first, kernel-‐based replication, the database is modified to contain the replication logic (see Figure 1 (a)). In the second, middleware-‐based replication, a middleware component encapsulates the replication logic, and acts as a kind of router between client requests and database replicas (see Figure 1 (b)). It avoids the main complication of the kernel-‐based approach that is to modify extensively the internals of the database. The third, replicated middleware, is an extension of the second approach (see Figure 1 (c)). Within replicated middleware, the middleware component has one or more replicas to avoid single-‐points of failures that is the main

(10)

shortcoming of the aforementioned alternatives. The fourth and final architecture alternative,

decentralized middleware (see Figure 1 (d)) has a middleware component collocated with each database replica. This solution avoids the single-‐node bottleneck of the previous solution.

The selected architectural choice for locating the replication logic has been, decentralized middleware, since it avoids the three aforementioned issues, extensive modification of database internals, single-‐point of failure and single-‐node bottleneck.

Figure 1: Architectural choices for locating the database replication logic.

We also need to consider how the middleware extracts updates from the database system. This update extraction process can happen in three different ways. The first, is known as

white-‐box. In this technique, the database is considered a white box in which all the functionality is available and can be used. Since all functionality is available a very efficient implementation is possible. Its main shortcoming is the complexity of modifying the database internals. The second approach is referred to as black-‐box. In this case, database internals are not exposed, only its public interface can be used. The main shortcoming of this approach is the high overhead associated with it. The third approach, the grey-‐box approach, tries to exploit the advantages of the two former approaches whilst minimizing the disadvantages. In this approach only two minimal services are implemented within the database to extract and install the updates of a transaction. On one hand, the middleware has little dependencies on the database, simply the availability of these two services. On the other hand, the extraction of updates is extremely efficient, almost as efficient as the white box approach. For these reasons, the gray-‐box alternative has been selected to deal with the manner in which updates are extracted.

(11)

2.2.3 Design

The DB replication middleware consists of several components (see Figure 2), namely, a transparent JDBC driver, a group communication subsystem, an extended database server and DB replication middleware. The JDBC driver enables clients to contact the replicated database in a transparent manner. That is, clients connect to the database in the same way they do with a regular non-‐replicated database. This transparency is syntactic. This JDBC driver has extended functionality (with respect to a regular JDBC driver), in that it provides replica discovery, dynamic reconfiguration and fail-‐over functionality. Replica discovery enables the discovery of available replicas via IP multicast. After replica discovery is complete, the JDBC driver establishes a connection to a particular replica, the one currently with the lowest load, in order to balance the load. The dynamic balancing reconfiguration changes the connection of a JDBC driver from one replica to another in order to balance the load across replicas. In the advent of the failure of the replica to which a JDBC driver is connected, the JDBC driver performs failover connecting to one of the available replicas.

The group communication subsystem provides two basic functionalities: reliable multicast and group membership. The reliable multicast enables the database replicas to communicate among them with ordering and reliability guarantees. The replication protocol exploits ordering to guarantee that all transactions are validated in the same order and guarantee the determinism across decisions in all replicas. The reliability also plays a crucial role guaranteeing that communication is atomic, that is, a multicast message is received by all replicas or by none. This ensures that a transaction is committed at all available replicas or at none and therefore, prevents state divergence across them in the advent of failures. Group membership provides another important service: it enables to have an agreed notion of which replicas are up and connected and detect failures in a coherent manner. When there is a failure all replicas receive a view change message indicating the new view with the replicas still alive and connected. Similarly, when a replica joins the group, a view change reports the new extended view. In the implementation used, the Ensemble group communication system implemented in CAML has been used with its interface for C programs.

(12)

The extended database server provides two additional services (with respect to a regular database server) to extract and install the updates of a transaction. In this implementation, PosgreSQL has been used due to its clean architecture. The two services have been exported as custom database functions so they can be invoked through the regular database interface from the DB replication middleware.

The most involved component is the DB replication middleware. This component consists of several elements:

• Client connection manager.

• DB connection pool.

• Local transaction manager. • Replica control manager. • Remote transaction manager.

The client connection manager accepts connections from the client JDBC drivers. It also performs the server side of the replica discovery, dynamic load balancing and failover protocols. The connection manager accepts a connection from a client, and registers it in the client connection table. It associates a particular DB connection to this client. This sticky connection (the connection is kept for the whole session) provides session consistency guarantees. Requests from clients are sent from the connection manager to the local transaction manager via the request_queue.

The database connection pool is essentially a pool of processes, where each process maintains a connection with PostgreSQL. PostgreSQL only permits one connection per process. This is why the system needs to have a process pool instead of a thread pool. Connections are materialized via a socket that keeps the connections with the corresponding PosgreSQL postmaster process. Each connection has an associated connection object, which contains information relating to the connection (e.g. client id, status of the connection, status of the associated transaction)

The local transaction manager executes SQL statements originating from local clients. The local transaction manager has a number of worker threads that take the client requests stored in the request_queue, and forwards them to the corresponding DB connection process in the DB pool. The request result is sent from the database pool to the connection thread in the connection manager associated to the client that forwards the reply to the client. When a commit request is sent by a client, the extract updates function on the DB connection of the client is invoked in the associated DB pool process. If the updates are empty, the transaction was read only. In this case, the transaction is committed locally and the reply is sent to the client. If the updates are not empty, the set of transaction updates are sent to the replica control manager. The replica control manager uses the group communication subsystem to multicast this update to all replicas (including the sender).

The remote transaction manager consists of a certifier thread, and a set of committer threads. It receives update propagation messages from other replicas. These messages are stored in the to_apply queue. The certifier thread processes the update propagation messages sequentially. It keeps track of previously committed transactions and performs the validation process. The validation process checks for every transaction being validated whether there

(13)

was a previously committed transaction that is concurrent to it and has a write-‐write conflict on any modified item. In case of conflict the transaction being validated is aborted. This decision is taken by all replicas for each update transaction. Since the validation is performed in the same order (the total order of the multicast) the result is deterministic. In case of positive validation the update propagation message is stored in the to_complete queue. The committer threads takes update propagation messages from this queue. If the transaction is local, it will be committed by the local transaction manager. If the transaction is remote, the update set is applied and committed as an independent transaction.

2.2.4 Implementation issues

There are several implementation issues with the remote and local transaction managers. The first, relates to the semantic transparency of the replication protocol. The replication protocol provides 1-‐copy equivalence. This means that the replicated database should behave in the same way as a non-‐replicated centralised one. In order to achieve this, it is necessary to coordinate the start of new transactions with the commitment of update transactions. Essentially, the start of new transactions should be made only when a gap-‐free prefix of the transactions are committed. This means that either transactions are committed sequentially, or if they are committed in parallel, then no transactions can be started. By committing transactions sequentially, there is a liveness issue since a deadlock can be created between the replication logic and the local database. This is due to the fact that PostgreSQL uses locking to deal with write-‐write conflicts, and a remote update transaction can be stopped by a local transaction that will be serialized after it, thus creating a waiting cycle and the corresponding deadlock. To avoid it, it is necessary to enable parallel commit processing. However, parallel commit processing results in losing 1-‐copy equivalence if transactions are started at this stage. The adopted solution has been to alternate between the parallel commit processing at the start of new transactions. When the system switches from parallel commit processing to sequential commit processing, the start of new transactions is enabled, and then switches back to parallel commit processing.

Figure 3: Fraction of capacity used for local vs. remote txns with different number of replicas

A second implementation issue is that the higher the number of replicas is used, the higher is the capacity used by each replica to execute remote transactions (see Figure 3). This is an inherent limiting factor of scalability of fully replicated systems. The main issue with respect IO comes from the fact that with configurations with a large number of replicas, the amount of remote transactions that are pure updates increases. This overhead has been reduced by grouping consecutive remote non-‐conflictive transactions into one large transaction instead of using individual transactions. This does not affect 1-‐copy equivalence since transactions are committed in the same order, and being consecutive, they do not create gaps.

LOCAL LOCAL REMOTE REMOTE 2 Sites DB 3 Sites DB

(14)

Another implementation issue relates to how many threads are devoted to local transactions and remote transactions. One might think that a single thread should be enough for remote transactions. However, since parallel commit processing is needed multiple threads are needed. There is interplay between threads executing local transactions and remote transactions. Some tuning has been performed, which adjusts the size of the thread pools. This has allowed increasing the performance to some extent.

There are also a couple of configuration issues that affect the IO behaviour. The first, relates to the configuration of the commit function. In principle, it should log synchronously to guarantee the durability of transactions. However, this is typically quite detrimental with respect to performance. PostgreSQL allows asynchronous commits to be configured. This additional configuration increases performance, but at the cost of losing durability.

Another performance issue relates to how PostgreSQL manages multi-‐versioning. Each tuple, whenever updated, it is not updated in-‐place, instead a new version is generated. This means that the number of versions of each tuple grows over time, and a garbage collection mechanism is required. This is needed not only due to free unnecessary occupied space, but mainly for performance reasons. Database pages end up containing just a few active tuples overtime, making IO access inefficient. PostgreSQL has a vacuum process that performs the garbage collection process by removing obsolete versions. The vacuum process performs synchronous writes and since it is executed periodically it has a severe effect on performance. There is a configuration parameter in PostgreSQL that determines whether to force vacuum process periodically or on demand.

2.3 Parallel data streaming middleware 2.3.1 Motivation

There are an increasing number of emerging applications for which databases and the underlying store and process model are not a good solution. Some new technologies and techniques are being proposed to address more adequately these emerging applications such as data streaming technologies. Data streaming applications aims to process data and events in an online manner as they flow. Data streaming enables the deployment of continuous queries that are evaluated over the streaming data in an online manner with in-‐memory processing.

Data streaming also has strong IO requirements due to two features. The first, is its ability to access persistent states via database operators that allow reads, updates and inserts on databases. The second, is the need for availability. Availability requires persisting streaming data to re-‐inject it in the advent of failures. This is quite challenging since it requires persisting data at streaming rates.

Another recent requirement demands scalability with respect to the number of cores and the number of nodes in the system. This requirement has been addressed by the FP7 project

Stream, which was coordinated by UPM, and included FORTH as a partner. This project ended officially on January 2011. In order to extend the scope of IO workloads, this new middleware will be used in IOLanes to exercise data streaming workloads, and in particular determine how database operators and fault-‐tolerance will stress the IO path. The data streaming middleware will be used to exploit the multi-‐core power by using different middleware

(15)

instances for each available core. This will require an underlying IO system able to scale with the number of cores. The middleware will run each different middleware instance on a different VM, thus helping to stress the virtualized IO path.

2.3.2 Architectural choices and rationale of the selected architecture

In this section, we briefly summarize the architecture of the parallel data streaming middleware, and describe the architectural choices that were adopted. The middleware extends an underlying streaming processing engine with distribution capabilities, however, lacks capabilities for parallel or parallel-‐distributed processing. Parallelization transparency requires distributing tuples before each stateful operator across all the instances. The overhead of parallelization can be measured according to two factors. The first factor is the number of hops made by each tuple from instance to instance. Each hop between nodes involves serializing, deserializing, sending and receiving the tuple, all of which consume CPU cycles. The second factor is the fan-‐out of the communication. That is, to how many instances should send tuples, and therefore keep open connections with them.

The parallelization of a query can be made following different strategies that exhibit different tradeoffs. They can be seen in a single spectrum with two extremes. On one extreme of this spectrum, we have a strategy, whole-‐query strategy, in which the whole query is not split, and is instantiated at every core. In the whole-‐query strategy the overhead factors are as follows. The number of hops will be as many as stateful operators present in the original query that is the minimum one that can be attained, since before each stateful operator tuple redistribution is required. The fan-‐out factor, however, reaches its maximum, that is the total number of cores used for parallelizing the query.

On the other extreme of the spectrum, we find the single-‐operator strategy in which the original query is split into as many subqueries as individual operators it has. Each operator can then be instantiated into as many cores as needed. The number of hops overhead factor in this case is maximal since each tuple performs as many hops as possible, the number of individual operators. On the other hand, the other overhead factor, that is the fan-‐out, reaches its minimum value, since each operator will have the smallest possible number of instances.

In the middle of these two extremes, there is the stateful-‐subquery strategy in which the original subquery is split into as many subqueries as stateful operators it has. That is, the original query is split just before each stateful operator it contains. In this case, the number of hops overhead factor reaches its minimum possible value, the number of stateful operators. While the fan-‐out overhead factor reaches a value between the maximum and minimum possible.

In a multi-‐core system there are other factors that affect parallelization. In a complex query, there might be more subqueries and/or operators than available cores. Since having an additional instance, it is only beneficial to take advantage of available cores, this means that some subqueries will be merged despite they could be split if more cores were available. Another factor lies in load balancing. The most CPU intensive subqueries need to be allocated to more cores. This can result in having to merge less CPU intensive subqueries to allocate them in the remaining cores.

As a conclusion, from the three parallelization strategies, the subquery parallelization strategy will be adopted. However, it requires an additional optimization phase where resulting

(16)

subqueries are distributed among the remaining cores to balance the CPU utilization such that only one data streaming engine instance is used per core.

2.3.3 Design

In order to extend the data streaming engine without having to modify it, parallelization has been enabled by developing parallelization operators as new custom data streaming operators. These parallelization operators are load balancers and input mergers. Basically, the query to be parallelized is split into subqueries and each subquery is instantiated into as many cores/nodes as needed. Between each pair of consecutive subqueries, say A and B, the parallelization operators are introduced (see Figure 4). At the end of each instance of subquery A (origin subquery), a load balancer (LB) is located. At the beginning of each instance of subquery B (destination subquery), an input merger (IM) is located.

Figure 4: Query parallelization by means of parallelization operators

The attain parallelization transparency, tuples that need to be processed together (for aggregation or correlation purposes), should be sent to the same node. Otherwise, incorrect results would be produced (since tuples would be partially aggregated and/or correlated). Tuples are grouped into buckets based on a hash of their distribution key (the key used for their aggregation and/or correlation in the subquery B). The load balancer is aware of the semantics of the destination subquery and distributes each tuple to the destination instance in charge of the bucket to which the tuple belongs. Each input merger at destination query instances takes the tuples from all incoming load balancers and provides a single merged stream. Simply merging the incoming streams does not guarantee transparent parallelization. It is required that the multiple incoming physical streams are merged into a single coherent stream. In order to attain this goal, the timestamp of tuples is used to make a merge sort of the incoming tuples according their timestamp. That is, it takes the tuple with the smallest timestamp out of all the incoming streams.

2.3.4 Implementation issues

Borealis, the underlying data streaming engine used, employs a particular threading model. It uses two threads for communicating with other instances, a receiver thread and a sender thread. Internally it uses a processor thread that processes each incoming tuple. In practical terms, this means, that a Borealis instance is almost single-‐threaded. This means that the parallelization middleware facilitates its execution on a multi-‐core system in a scalable manner.

The transparent input merging has a liveness issue. In order to take the tuple with the smallest timestamp it should wait to have a tuple from every incoming stream. It might be the

(17)

case that one of the incoming streams is not producing tuples what would cause the blocking of the input merging and the delay of processing of the available tuples in the other streams. In order to solve this liveness issue, a special kind of tuple called dummy tuple was used. Load balancers periodically check whether they have sent any tuple on each of the outgoing streams. If not, they send a dummy tuple with the smallest timestamp that the load balancer can generate. This dummy tuples enables the input merger to unblock. If the dummy tuple is the one taken by the input merger it is simply discarded. If not, it enables to take tuples from the other incoming streams and avoid the blocking of the operator.

Another implementation issue was related to the efficient persistence of data streams for fault-‐tolerance. The first approach relied on a file per data bucket what resulted in a high number of files that were not manageable by the file system. The second, used a single file per subquery instance, which resulted in more complex recovery, but avoided the bottleneck associated with the management of thousands of files. The fault tolerance of aggregate operators was especially challenging due to the high number of active sliding windows. This issue was solved by developing new efficient garbage collection functionality.

The performance analysis conducted in T3.2 revealed that aggregate operators in the data streaming engine had memory management issues. One issue relates to the sliding window paradigm, more concretely, when to clean windows of aggregate operators which have not received tuples recently. In many applications (for instance in telephony), there is a high distribution of keys (i.e. phone number in call detail records). Aggregate operators devote a sliding window to each key according the group-‐by clause. Due to the high distribution, many different phone numbers receive calls, resulting in many different sliding windows. The issue is that most phone numbers receive seldom calls, and the sliding window does not slide till receiving a call with a temporal distance to one of the calls in the sliding window bigger than the window size. For instance, if the window size is 24 hours, and a phone number makes a call in day 1, and then it receives the next call a week later. According to the sliding window model, when receiving the call the week later, the tuple from the day 1 call would be removed when the window slides. However, as a result, huge amounts of memory are unnecessarily occupied. This has also an additional impact since the fault-‐tolerance is checkpointing the state of the aggregate operators and this state is much larger than needed.

The solution that was given was to reprogram the aggregate operators to manage memory efficiently by introducing a garbage collection process. However, performing garbage collection periodically is disruptive, since regular processing needs to be stopped/delayed as a result. For this reason, garbage collection was performed in batches. Since parallel processing requires buckets of keys, we used this concept for batching the garbage collection. As soon as a window in a bucket required sliding a window, all windows in the same bucket were adjusted. This was possible since the parallel data streaming middleware guarantees that tuples arrive timestamp ordered to each bucket. Windows that remained empty during one period (the time between two window slides) were garbage collected. This solution has largely diminished the memory usage by aggregate operators for those workloads with sparse tuples across the aggregation keys that is the case in most applications (phone numbers, credit card numbers, car plates, RFIDs, etc.). Additionally, the batched approach to garbage collection was non-‐intrusive for regular processing.

(18)

3 Prototype of warehouse application and adaptation over new IO stack design

3.1 Introduction

In the context of the IOLanes project it is expected that the Tariff Advisor (TA) rating engine and tariff simulation tool will be integrated with Middle-‐R database replication system and the parallel data streaming engine to create a new type of rating application that is both flexible in terms of functionality as well as that scales in I/O performance with the number of cores.

The integration of the parallel data streaming engine with TariffAdvisor means that the results of TA, that is, the re-‐rated Call Detail Records (CDRs) will be fed to the parallel data streaming engine, which in turn will execute a number of post-‐processing steps on them. This approach allows applications users (e.g. sales and marketing people in Telco providers) to experiment with new plans of rating at acceptable response times and use results to make business decisions. In addition, we foresee that this flexibility will allow the TariffAdvisor rating engine to be applied to other areas of rating as well, such as energy or more generally, utility rating.

This document describes the post-‐processing steps that are needed, based on actual requirements from existing deployments and customers that already use TA. It also reports the status of their implementation on a StreamCloud/TariffAdvisor prototype.

3.2 The current state of Tariff Advisor

The Tariff Advisor architecture was presented in previous deliverable 4.1. We include a brief description in the following paragraphs.

Tariff Advisor is based on a fast rating engine and re-‐rating engine. It stores metadata (rate plans, add-‐ons, tariffs, free unit bundles, etc.) in a database schema. Metadata is required to ensure that calls are rated correctly. Customer contracts, rate plans, as well as the scenarios that should be applied are maintained in the same database. Scenarios define the combinations of base rate plans and add-‐ons that should be used for the re-‐rating of calls. Scenarios can be complex, and can include substitution conditions and rules that are matched using the original rate plan combination of each contract.

TA receives input as flat files of calls. In every file it is expected that calls are ordered based on the call date and time, in order to use the free units (minutes, SMSs, MBs) in the correct order. It is also possible that the inputs to TA are calls stored in a database.

The basic output of TA is rated call records (also called CDRs – call detail records), stored normally in files. For each call record in the input, TA outputs as many call records as the number of scenarios that were applied to the input. Each output call record is marked with the corresponding scenario ID that created it. In most of the cases(depending on the configuration), the calls are also re-‐rated against the current plan of the customer, and this is marked with a special scenario ID.

It is also possible that aggregated data is produced as output. In this case, the output call records are usually aggregated per contract, scenario ID, service, and destination zone, but other aggregations may be desirable.

TariffAdvisor outputs can be used in different ways. The basic case is to output the total charge amount for each scenario for all customers. However, it is more common to have requirements that require more elaborate processing of the results. One such requirement is to find the best rate plan for a customer. Another one might be to find all the plans in a specific distance (that is, total charge amount range) from the current plan, even though the best plan is not included in this distance. Distance could be expressed either as an absolute amount or