• No results found

D4.2 Prototype of warehouse application and adaptation over new IO stack design; design and implementation of middleware prototype

N/A
N/A
Protected

Academic year: 2021

Share "D4.2 Prototype of warehouse application and adaptation over new IO stack design; design and implementation of middleware prototype"

Copied!
34
0
0

Loading.... (view fulltext now)

Full text

(1)

 

IOLanes    

 

Advancing  the  Scalability  and  Performance  

of  I/O  Subsystems  in  Multicore  Platforms

 

Contract  number:  

Contract  Start  Date:  

Duration:   Project  coordinator:   Partners:   FP7-­‐248615   01-­‐Jan-­‐10   36  months     FORTH  (Greece)    

UPM  (Spain),  BSC  (Spain),  IBM  (Israel),  Intel  (Ireland),  Neurocom  (Greece)  

   

D4.2  

Prototype  of  warehouse  application  and  adaptation  

over  new  IO  stack  design;  design  and  

implementation  of  middleware  prototype  

                     

Draft  Date:       29-­‐Feb-­‐12,  13:22   Delivery  Date:       M24  

Due  Date:       M24   Workpackage:       WP4  

Dissemination  Level:       Public  

Authors:     Ricardo   Jimenez-­‐Peris,     UPM,   Marta   Patiño-­‐Martínez,   UPM,   Valerio   Vianello,  UPM,  Luis  Rodero,  UPM,  Grigoris  Prasinos,  Apostolis  Hatzimanikatis.  

Status:         5  

(Choose  one,  implies  all  previous)   1. Internal  draft  preparation   2. Under  internal  project  review   3. Release  draft  preparation   4. Passed  internal  project  review   5. Released  

Project  funded  by  the    

European  Commission  under  the  

Embedded  Systems  Unit  –  G3  

Directorate  General  Information  Society  

7th  Framework  Programme  

   

(2)

 

History  of  Changes  

Date   By   Description  

11  Dec  2011   Ricardo  Jimenez   TOC  

Dec  2011   Ricardo  Jimenez   First  draft  

Jan  2012   Ricardo  Jimenez   Second  draft  

Jan  2012   Grigorios  Parisinos   Third  draft  

Jan  2012   Valerio  Vianello   Fourth  draft  

Jan  2012   Ricardo  Jimenez   Fifth  draft  

Feb  2012   Grigorios  Prasinos   Sixth  draft  (added  TA  experiments)  

Feb  2012   Apostolis  

Hazailmanikatis    

7th  draft  

Feb  2012 Martina  Naughton   8th  draft  (peer  review)  

Feb  2012 Angelos  Bilas   9th  draft  (coordinator  review)  

Feb  2012   Alex  Landau   10th  draft  (peer  review)  

 

(3)

TABLE  OF  CONTENTS  

TABLE  OF  CONTENTS  ...  3  

EXECUTIVE  SUMMARY  ...  5  

1   OVERVIEW  OF  DELIVERABLE  ...  7  

1.1   MAIN  CONTRIBUTIONS  AND  ACHIEVEMENTS  ...  7  

1.2   PROGRESS  BEYOND  THE  STATE-­‐OF-­‐THE-­‐ART  ...  8  

1.3   ALIGNMENT  WITH  PROJECT  GOALS  AND  TIME-­‐PLAN  FOR  THIS  REPORTING  PERIOD  ...  8  

1.4   PLAN  FOR  NEXT  REPORTING  PERIOD  ...  8  

1.5   STRUCTURE  OF  DELIVERABLE  ...  8  

2   MIDDLEWARE  SYSTEMS  ...  9  

2.1   INTRODUCTION  ...  9  

2.2   DATABASE  REPLICATION  MIDDLEWARE  ...  9  

2.2.1  MOTIVATION  ...  9  

2.2.2  ARCHITECTURAL  CHOICES  AND  RATIONALE  OF  THE  SELECTED  ARCHITECTURE  ...  9  

2.2.3  DESIGN  ...  11  

2.2.4  IMPLEMENTATION  ISSUES  ...  13  

2.3   PARALLEL  DATA  STREAMING  MIDDLEWARE  ...  14  

2.3.1  MOTIVATION  ...  14  

2.3.2  ARCHITECTURAL  CHOICES  AND  RATIONALE  OF  THE  SELECTED  ARCHITECTURE  ...  15  

2.3.3  DESIGN  ...  16  

2.3.4  IMPLEMENTATION  ISSUES  ...  16  

3   PROTOTYPE  OF  WAREHOUSE  APPLICATION  AND  ADAPTATION  OVER  NEW  IO  STACK   DESIGN  ...  18  

3.1   INTRODUCTION  ...  18  

3.2   THE  CURRENT  STATE  OF  TARIFF  ADVISOR  ...  18  

3.3   POST  PROCESSING  AND  AGGREGATIONS  NEEDED  ...  19  

3.3.1  BASIC  AGGREGATE  ...  19  

3.3.2  TOTAL  INVOICE  AMOUNT  COMPUTATION  ...  19  

3.3.3  SORT  RESULTS  AND  COMPUTE  DISTANCES  ...  20  

3.3.4  DROP  RESULTS  ABOVE  A  SPECIFIC  THRESHOLD  ...  20  

3.3.5  KEEP  RESULTS  IN  A  SPECIFIC  DISTANCE  FROM  CURRENT  OR  BEST  PLAN  ...  20  

3.3.6  KEEP  RESULTS  IN  SPECIFIC  STEPS  (STEP-­‐UP,  STEP-­‐DOWN)  FROM  CURRENT  OR  BEST  PLANS  ...  20  

3.3.7  TAKING  INTO  ACCOUNT  TRAFFIC  OF  MORE  PERIODS  ...  21  

3.3.8  COMPUTATION  OF  DISCOUNTS  ...  21  

3.3.9  PROCESSING  BASED  ON  MARGINS  ...  21  

3.4   IMPLEMENTATION:  DESCRIPTION  OF  THE  INTEGRATION  WITH  STREAMCLOUD  AND  OTHER  IMPROVEMENTS   MADE  DURING  THE  PROJECT  ...  21  

3.4.1  CHANGES  TO  TARIFFADVISOR  ...  22  

3.4.2  INITIAL  PROTOTYPE  ...  22  

3.4.3  THE  FORMAT  OF  DATA  OF  TARIFF  ADVISOR  ...  22  

3.4.4  IMPLEMENTATION  OF  STREAMCLOUD  OPERATORS  ...  23  

4   OTHER  BENCHMARKING  APPLICATIONS  ...  24  

4.1   INTRODUCTION  ...  24  

4.2   TPC-­‐W  BENCHMARK  ...  24  

4.2.1  MOTIVATION  ...  24  

4.2.2  METRICS  ...  24  

4.2.3  IMPLEMENTATION  ISSUES  ...  24  

4.2.4  SUMMARY  OF  BASELINE  RESULTS  ...  25  

4.2.5  EXPECTED  IMPROVEMENT  WITH  IOLANES  ...  26  

4.3   LINEAR  ROAD  ...  26  

4.3.1  MOTIVATION  ...  26  

4.3.2  METRICS  ...  26  

(4)

4.3.4  SUMMARY  OF  BASELINE  RESULTS  ...  27  

4.3.5  EXPECTED  IMPROVEMENT  WITH  IOLANES  ...  29  

4.4   TARIFFADVISOR  ...  29  

4.4.1  MOTIVATION  ...  29  

4.4.2  METRICS  ...  29  

4.4.3  IMPLEMENTATION  ISSUES  ...  29  

4.4.4  SUMMARY  OF  BASELINE  RESULTS  ...  30  

4.4.5  EXPECTED  IMPROVEMENT  WITH  IOLANES  ...  30  

4.5   SPEC  JENTERPRISE  2010  ...  31  

4.5.1  MOTIVATION  ...  31  

4.5.2  METRICS  ...  31  

4.5.3  IMPLEMENTATION  ISSUES  ...  31  

4.5.4  SUMMARY  OF  BASELINE  RESULTS  ...  31  

5   PLAN  FOR  THE  NEXT  REPORTING  PERIOD  ...  32  

6   CONCLUSIONS  ...  33  

7   BIBLIOGRAPHY  ...  34  

   

(5)

Executive  Summary    

This  deliverable  presents  the  progress  during  the  second  period  of  the  project  in  the  applications  and   middleware   developed   to   evaluate   the   IOLanes   prototype   with   realistic   workloads.   The   progress   during  this  period  is  two-­‐fold:  

(1)   A   main   goal   of   Period   2   is   to   create   a   realistic   application   stack   for   evaluation   purposes   that   includes  all  layers  of  modern  applications  stacks.  Our  evaluation  aims  to  not  only  show  limitations  and   improvements  at  the  system  level,  but  also  to  provide  a  better  understanding  of  various  application   trends.  For  this  purpose  we  use  TariffAdvisor,  which  we  extend  in  two  ways.    

First,   we   build   a   separate   rating   component,   based   on   TariffAdvisor   that   can   make   use   of   a   generic   stream-­‐processing  engine.  This  is  important  since  it  allows  the  rating  engine  to  focus  on  the  policies  to   be  evaluated  rather  than  event  processing  itself.    Integration  of  TariffAdvisor  and  the  middleware  has   progressed  and  the  different  points,  where  modifications  are  needed  have  been  identified.  

Second,  we  design  a  middleware  system  that  allows  the  new  rating  engine  to  scale  to  multiple  nodes.   Replication  is  a  main  technique  for  scaling  performance,  however,  due  to  consistency  requirements,  it   significantly   increases   I/O   requirements.   The   database   replication   middleware   that   has   been   architected   and   developed   aims   at   replicating   a   database   transparently   to   clients   for   scalability   and   availability   purposes.   The   approach   taken   places   special   emphasis   on   transparency:   Applications   should  be  able  to  use  the  replication  middleware  transparently,  without  changes.  During  this  period   we  provide  an  initial  implementation  of  the  middleware,  as  per  the  original  plan.  In  addition,  during   this   period,   the   database   component   of   TariffAdvisor   was   migrated   from   Oracle   to   PostgreSQL.   This   way  TariffAdvisor  is  automatically  integrated  with  the  database  replication  component,  and  uses  it  for   the  storage  of  input  CDRs  as  well  as  metadata  and  outputted  results.  

(2)  With  respect  to  the  existing  application  stacks:  

Regarding  TariffAdvisor,  during  the  experimental  analysis  and  profiling  we  were  able  to  build  a  base   version  that  eliminates  many  of  the  inefficiencies  of  the  original  application:  Avoiding  character-­‐based   processing  of  input  and  output  data  when  possible,  introducing  large  I/O  requests,  and  custom  buffer   copying  have  all  resulted  in  significant  improvement  of  application-­‐level  throughput.  In  addition,  we   introduce  asynchronous  I/O  at  the  application  level,  since  asynchronous  I/O  facilities  are  not  available   in  Java  6,  by  using  additional  I/O  threads,  which  however  does  not  improve  I/O  performance  further.     The   current   deployment   cases   for   TariffAdvisor   in   production   are   either   on   a   single   server   (bare   metal)  or  under  VMs.    Our  I/O  optimizations  during  Period  2  result  in  proper  scaling  of  TariffAdvisor   in  non-­‐virtualized  environments,  however,  our  evaluation  shows  significant  (up  to  70%  compared  to   the   non-­‐VM   case)   performance   penalty   and   lack   of   scaling   in   the   case   of   VMs.   We   currently   use   the   optimized  version  of  TariffAdvisor  for  evaluating  the  new  I/O  stack.  

For   OLTP   applications,   following   the   preliminary   evaluation   of   TPC-­‐W   during   Period   1,   two   main   bottlenecks  found  and  removed  by  re-­‐architecting  the  TPC-­‐W  implementation.  These  two  bottlenecks   were  the  database  connection  pool  and  the  load  generator.    

For   generic   data   streaming   applications,   a   preliminary   evaluation   of   LinearRoad   unveiled   many   performance  issues  at  the  application  and  middleware  level:    

• At   the   middleware   level,   we   find   that   that   database   operators   are   CPU   expensive,   WaitFor  

synchronization   is   inefficient,   and   aggregate   operators   performed   poor   memory   management.   New   versions   of   the   database   operators,   WaitFor   operators   and   aggregate   operators   were   designed  and  developed  to  solve  these  issues.    We  also  added  persistence  of  data  streams  for  fault-­‐ tolerance  purposes.    

• Generating   large   loads   necessary   to   stress   I/O   in   modern   systems   was   not   possible   because   the  

load   generator   would   hang   after   a   few   days.   Additionally,   generating   each   new   load   was   taking   longer  and  longer  (i.e.  weeks).  We  designed  and  implemented  a  new  load  generator  that  allows  us   to  generate  incrementally  large  loads.    

• The  granularity  of  the  partitioning  into  subqueries  is  too  coarse  to  enable  an  even  load  distribution  

across   cores.   A   series   of   optimizations   were   made   till   achieving   an   adequate   balancing   across   cores.    

Overall,  at  the  starting  point,  LinearRoad  was  not  able  to  sustain  a  1-­‐expressway  load  in  our  testbed,   and  now  it  is  able  to  bear  a  load  an  order  of  magnitude  higher,  and  at  least  10  expressways.  During   Period   3   we   will   complete   the   TariffAdvisor+Replication   Middleware   stack   and   use   the   new   and   existing  application  stacks  for  evaluating  the  IOLanes  stack  prototype.  

(6)
(7)

1 Overview  of  Deliverable    

1.1 Main  contributions  and  achievements  

During  the  reporting  period,  WP  has  progressed  with:  

• The  design  and  implementation  of  the  database  replication  middleware.    

o The   database   middleware   has   been   architected   and   developed.   The   middleware   provides   scalable   replication   of   the   database   in   a   transparent   manner   to   applications.    

• The  extension  of  TariffAdvisor  and  its  integration  with  the  middleware  systems.    

o TariffAdvisor   has   been   enhanced   to   enable   aggregation   via   the   data   streaming   middleware.  This  extension  enables  to  perform  aggregation  queries  in  a  declarative   manner  and  parallelize  automatically  the  aggregation  of  rate  plans.  

o TariffAdvisor   has   migrated   from   Oracle   to   PostgreSQL.   With   this   migration   the   integration  with  the  database  replication  middleware  is  automatic.  This  extension   will  enable  to  enable  declarative  queries  over  the  input  data,  and  parallel  reading  of   input  files.  

• In  order  to  stress  the  IO  path,  modifications  were  made  to  the  middleware  systems  and  

benchmark  applications.   o Parallel  data  streaming.  

§ A   new   version   of   the   database   operators   has   been   designed   and   implemented  to  use  composite  keys  and  avoid  the  costly  scans  performed  by   previous  versions  of  the  operators.  

§ A  new  version  of  the  WaitFor  synchronization  operator  has  been  designed   and  developed  to  avoid  traversals  of  long  linked  lists  during  high  loads  that   burned   too   many   CPU   cycles.   The   current   version   uses   hash   tables   to   perform  indexing    and  provide  efficient  access.    

§ A  new  version  of  the  aggregate  operator  has  been  designed  and  developed   with   efficient   memory   management   and   new   non-­‐intrusive   garbage   collection   functionality   to   avoid   the   high   memory   usage   of   the   previous   version.  

• Evaluation  of  bare-­‐metal  and  virtualised  baselines  with  TPC-­‐W  and  LinearRoad.   o Bare-­‐metal  baseline  evaluations.  

§ A  baseline  for  TPC-­‐W  has  been  completed.  The  findings  show  that  one  of  the   limitations  of  current  IO  architectures  lies  in  the  synchronous  write  that  is   one  of  the  bottlenecks  of  transactional  systems.  

§ A  baseline  for  LinearRoad  has  been  completed.  The  results  have  found  the   IO  path  does  not  scale  with  high  rates  of  small  reads  and  writes.  

o Virtualized  environment  evaluations.  

§ Baselines  for  TPC-­‐W  and  LinearRoad  have  been  obtained.  

§ In   both   cases,   it   has   been   found   out   that   there   is   a   substantial   loss   of   performance  due  to  virtualization,  which  is  in  turn  cause  by  the  number  of   context  changes  invoked  by  IO  operations.  

• Ongoing  evaluation  of  SplitX,  K0fs  and  deduplication.  

o During  the  last  few  months,  SplitX  and  K0fs  have  been  executed  with  LinearRoad.   However,  there  are  still  issues  to  be  solved  in  order  to  attain  results  that  will  allow   us  to  draw  conclusions  regarding  its  performance.    

o Evaluation   with   deduplication   has   recently   started   with   the   database   replication   middleware  and  is  still  ongoing.  

(8)

1.2 Progress  beyond  the  state-­‐of-­‐the-­‐art  

• Scalable  database  replication  middleware.  The  main  innovation  with  respect  SOTA  lies  in  

the  avoidance  of  read/write  conflicts  by  using  as  isolation  level  snapshot  isolation  (SI  for   short).  

• Scalable   parallel   data   streaming.   The   main   innovation   with   respect   to   SOTA   lies   in   parallelizing  data  streaming  operators  such  that  the  increasing  multi-­‐core  power  currently   available  is  exploited.  

• In   addition,   during   this   period   WP4   has   produced   an   optimized   version   of   the   TariffAdvisor   rating   application,   improving   significantly   the   performance   going   beyond   state  of  the  art.  Neurocom  has  executed  a  large  number  of  experiments  with  TariffAdvisor   with  different  workloads,  in  different  environments  and  different  parameters.    

 

1.3 Alignment  with  project  goals  and  time-­‐plan  for  this  reporting  period  

The  main  objective  of  WP4  is  to  provide  middleware  systems  and  benchmarking  applications   to   evaluate   (with   realistic   software   stacks   and   applications)   the   improvements   made   along   the  IO  path  in  WP1-­‐3.  Additionally,  it  uses  the  monitoring  framework  developed  by  WP5.  In   period  2,  the  goal  was  to  establish  a  baseline  evaluation  that  would  enable  the  evaluation  of   improvements  made  by  the  contributions  in  WP1-­‐3.  Therefore,  the  whole  activity  of  the  WP   has  been  totally  aligned  with  the  goals  of  the  project,  and  the  timeframe  for  the  second  period   of  the  project.  

 

1.4 Plan  for  next  reporting  period  

During  the  last  and  third  period,  the  improvements  made  by  WP1  to  3  will  be  continuously   evaluated   to   provide   feedback   to   WP1-­‐3   partners   to   enable   them   to   perform   further   improvements  and  quantify  them.  Changes  in  the  benchmarks  will  be  made  as  needed  to  fulfil   the   goals   of   the   evaluations.   Middleware   systems   will   be   improved   to   better   exploit   the   developments  from  WP1  to  3.  

 

1.5 Structure  of  deliverable  

The   core   of   the   technical   contents   is   organized   into   3   sections.   Section   3   presents   the   middleware   systems   developed   for   the   project.   Section   4   presents   the   integration   of   TariffAdvisor  with  the  middleware  systems.  Section  5  presents  the  remaining  benchmarking   applications   developed   to   evaluate   the   results   of   the   project.   Finally,   section   6   presents   the   conclusions.  

(9)

2 Middleware  systems   2.1 Introduction  

The  initial  scope  of  the  project  targeted  a  database  replication  middleware  as  the  middleware   system.  This  scope  has  been  extended  to  handle  a  variety  of  workloads  and  target  different   limitations   of   the   IO   path.   The   scope   was   extended   to   include   three   middleware   systems:   database   replication   middleware,   parallel-­‐distributed   data   streaming   (StreamCloud),   and   clustered  Java  EE.  In  the  second  year,  it  became  clear  that  the  clustering  aspect  of  JavaEE  was   not   adding   any   additional   benefit,   therefore;   a   regular   Java   EE   middleware   will   be   used   instead.   On   the   application/benchmarking   side   the   original   scope   considered   an   enhanced   version  of  TariffAdvisor,  whilst  the  extended  scope  considers  three  additional  benchmarking   applications,  TPC-­‐W,  LinearRoad  and  SPEC  jEnterprise.    

TariffAdvisor   is   a   typical   OLAP   application   that   performs   analytical   queries   over   large   datasets.  In  this  particular  case,  it  computes  hundreds/thousands  of  alternative  plans  for  the   clients  of  a  phone  company  during  period  of  several  months.  It  will  be  used  standalone,  and   integrated  with  the  data  streaming  middleware  and  the  database  replication  middleware.  The   data   streaming   middleware   will   be   able   to   perform   declaratively   aggregate   queries   and   parallelize  them  automatically.  The  database  replication  middleware  will  be  able  to  perform   declaratively   database   queries   over   the   input   data   and   parallelize   them   automatically   the   reading.  

TPC-­‐W  is  a  benchmark  for  databases  that  be  used  in  conjunction  with  a  database  management   system  and  the  database  replication  middleware.  TPC-­‐W  aims  at  evaluating  OLTP  workloads.   SPEC  jEnterprise  is  the  benchmark  for  Java  EE  technology.  It  will  be  used  in  conjunction  with   Java  EE  servers  and  databases.  

 

2.2 Database  replication  middleware   2.2.1 Motivation  

Databases  are  among  the  most  used  software  servers  in  industry  and  are  used  pervasively  in   almost   every   application.   Two   crucial   requirements   for   databases   are   scalability   and   availability.  Both  can  be  attained  via  a  technique  known  as  database  replication.  In  the  context   of  the  project,  the  database  replication  middleware  will  be  used  for  two  different  purposes.   The   first   purpose   is   to   integrate   it   with   TariffAdvisor   as   originally   proposed   in   the   DoW   to   enable   declarative   queries   over   input   data   and   automatic   parallelization   over   a   many   core   system.   This   scenario   will   allow   us   the   evaluation   of   the   benefits   of   IOLanes   for   OLAP   applications  that  are  characterized  by  massive  reads.  The  second  purpose  is  to  evaluate  how  a   replicated   database   service   deployed   in   a   cloud   environment   (e.g.   such   as   the   Amazon   Relational  Database  Service)  can  benefit  from  IOLanes.  In  this  context,  each  node  runs  replicas   from   different   clients   on   different   VMs.   This   scenario   will   stress   the   IO   path   in   virtualized   environments,  typically  found  in  cloud  environments.  

 

2.2.2 Architectural  choices  and  rationale  of  the  selected  architecture  

Database   replication   middleware   can   follow   four   alternative   architectures   (see   Figure   1).   They   differ   with   respect   to   where   the   replication   logic   is   stored.   In   the   first,  kernel-­‐based   replication,  the  database  is  modified  to  contain  the  replication  logic  (see  Figure  1  (a)).  In  the   second,  middleware-­‐based  replication,  a  middleware  component  encapsulates  the  replication   logic,  and  acts  as  a  kind  of  router  between  client  requests  and  database  replicas  (see  Figure  1   (b)).   It   avoids   the   main   complication   of   the   kernel-­‐based   approach   that   is   to   modify   extensively  the  internals  of  the  database.  The  third,  replicated  middleware,  is  an  extension  of   the   second   approach   (see   Figure   1   (c)).   Within   replicated   middleware,   the   middleware   component   has   one   or   more   replicas   to   avoid   single-­‐points   of   failures   that   is   the   main  

(10)

shortcoming  of  the  aforementioned  alternatives.  The  fourth  and  final  architecture  alternative,  

decentralized   middleware  (see   Figure   1   (d))   has   a   middleware   component   collocated   with   each   database   replica.   This   solution   avoids   the   single-­‐node   bottleneck   of   the   previous   solution.    

 

The   selected   architectural   choice   for   locating   the   replication   logic   has   been,   decentralized   middleware,   since   it   avoids   the   three   aforementioned   issues,   extensive   modification   of   database  internals,  single-­‐point  of  failure  and  single-­‐node  bottleneck.    

 

 

Figure  1:  Architectural  choices  for  locating  the  database  replication  logic.  

 

We   also   need   to   consider   how   the   middleware   extracts   updates   from   the   database   system.   This   update   extraction   process   can   happen   in   three   different   ways.   The   first,   is   known   as  

white-­‐box.   In   this   technique,   the   database   is   considered   a   white   box   in   which   all   the   functionality  is  available  and  can  be  used.  Since  all  functionality  is  available  a  very  efficient   implementation  is  possible.  Its  main  shortcoming  is  the  complexity  of  modifying  the  database   internals.  The  second  approach  is  referred  to  as  black-­‐box.  In  this  case,  database  internals  are   not  exposed,  only  its  public  interface  can  be  used.  The  main  shortcoming  of  this  approach  is   the   high   overhead   associated   with   it.   The   third   approach,   the  grey-­‐box   approach,   tries   to   exploit  the  advantages  of  the  two  former  approaches  whilst  minimizing  the  disadvantages.  In   this  approach  only  two  minimal  services  are  implemented  within  the  database  to  extract  and   install  the  updates  of  a  transaction.  On  one  hand,  the  middleware  has  little  dependencies  on   the  database,  simply  the  availability  of  these  two  services.  On  the  other  hand,  the  extraction  of   updates   is   extremely   efficient,   almost   as   efficient   as   the   white   box   approach.   For   these   reasons,  the  gray-­‐box  alternative  has  been  selected  to  deal  with  the  manner  in  which  updates   are  extracted.  

(11)

2.2.3 Design    

The   DB   replication   middleware   consists   of   several   components   (see   Figure   2),   namely,   a   transparent  JDBC  driver,  a  group  communication  subsystem,  an  extended  database  server  and   DB  replication  middleware.  The  JDBC  driver  enables  clients  to  contact  the  replicated  database   in  a  transparent  manner.  That  is,  clients  connect  to  the  database  in  the  same  way  they  do  with   a   regular   non-­‐replicated   database.   This   transparency   is   syntactic.   This   JDBC   driver   has   extended   functionality   (with   respect   to   a   regular   JDBC   driver),   in   that   it   provides   replica   discovery,  dynamic  reconfiguration  and  fail-­‐over  functionality.  Replica  discovery  enables  the   discovery  of  available  replicas  via  IP  multicast.  After  replica  discovery  is  complete,  the  JDBC   driver  establishes  a  connection  to  a  particular  replica,  the  one  currently  with  the  lowest  load,   in  order  to  balance  the  load.  The  dynamic  balancing  reconfiguration  changes  the  connection  of   a  JDBC  driver  from  one  replica  to  another  in  order  to  balance  the  load  across  replicas.  In  the   advent   of   the   failure   of   the   replica   to   which   a   JDBC   driver   is   connected,   the   JDBC   driver   performs  failover  connecting  to  one  of  the  available  replicas.  

The   group   communication   subsystem   provides   two   basic   functionalities:   reliable   multicast   and  group  membership.  The  reliable  multicast  enables  the  database  replicas  to  communicate   among   them   with   ordering   and   reliability   guarantees.   The   replication   protocol   exploits   ordering  to  guarantee  that  all  transactions  are  validated  in  the  same  order  and  guarantee  the   determinism   across   decisions   in   all   replicas.   The   reliability   also   plays   a   crucial   role   guaranteeing   that   communication   is   atomic,   that   is,   a   multicast   message   is   received   by   all   replicas  or  by  none.  This  ensures  that  a  transaction  is  committed  at  all  available  replicas  or  at   none   and   therefore,   prevents   state   divergence   across   them   in   the   advent   of   failures.   Group   membership  provides  another  important  service:  it  enables  to  have  an  agreed  notion  of  which   replicas   are   up   and   connected   and   detect   failures   in   a   coherent   manner.   When   there   is   a   failure   all   replicas   receive   a  view   change   message   indicating   the   new  view   with   the   replicas   still  alive  and  connected.  Similarly,  when  a  replica  joins  the  group,  a  view  change  reports  the   new  extended  view.  In  the  implementation  used,  the  Ensemble  group  communication  system   implemented  in  CAML  has  been  used  with  its  interface  for  C  programs.  

 

 

(12)

 

The   extended   database   server   provides   two   additional   services   (with   respect   to   a   regular   database   server)   to   extract   and   install   the   updates   of   a   transaction.   In   this   implementation,   PosgreSQL  has  been  used  due  to  its  clean  architecture.  The  two  services  have  been  exported   as  custom  database  functions  so  they  can  be  invoked  through  the  regular  database  interface   from  the  DB  replication  middleware.  

 

The  most  involved  component  is  the  DB  replication  middleware.  This  component  consists  of   several  elements:    

• Client  connection  manager.    

• DB  connection  pool.  

• Local  transaction  manager.   • Replica  control  manager.   • Remote  transaction  manager.    

 

The   client   connection   manager   accepts   connections   from   the   client   JDBC   drivers.   It   also   performs   the   server   side   of   the   replica   discovery,   dynamic   load   balancing   and   failover   protocols.  The  connection  manager  accepts  a  connection  from  a  client,  and  registers  it  in  the   client   connection   table.   It   associates   a   particular   DB   connection   to   this   client.   This   sticky   connection   (the   connection   is   kept   for   the   whole   session)   provides   session   consistency   guarantees.   Requests   from   clients   are   sent   from   the   connection   manager   to   the   local   transaction  manager  via  the  request_queue.  

 

The  database  connection  pool  is  essentially  a  pool  of  processes,  where  each  process  maintains       a  connection  with  PostgreSQL.  PostgreSQL  only  permits  one  connection  per  process.  This  is     why   the   system   needs   to   have   a   process   pool   instead   of   a   thread   pool.   Connections   are   materialized   via   a   socket   that   keeps   the   connections   with   the   corresponding   PosgreSQL   postmaster   process.   Each   connection   has   an   associated   connection   object,   which   contains   information   relating   to   the   connection   (e.g.   client   id,   status   of   the   connection,   status   of   the   associated  transaction)  

 

The   local   transaction   manager   executes   SQL   statements   originating   from   local   clients.   The   local  transaction  manager  has  a  number  of  worker  threads  that  take  the  client  requests  stored   in  the  request_queue,  and  forwards  them  to  the  corresponding  DB  connection  process  in  the   DB   pool.   The   request   result   is   sent   from   the   database   pool   to   the   connection   thread   in   the   connection   manager   associated   to   the   client   that   forwards   the   reply   to   the   client.   When   a   commit  request  is  sent  by  a  client,  the  extract  updates  function  on  the  DB  connection  of  the   client  is  invoked  in  the  associated  DB  pool  process.  If  the  updates  are  empty,  the  transaction   was  read  only.  In  this  case,  the  transaction  is  committed  locally  and  the  reply  is  sent  to  the   client.   If   the   updates   are   not   empty,   the   set   of   transaction   updates   are   sent   to   the   replica   control   manager.   The   replica   control   manager   uses   the   group   communication   subsystem   to   multicast  this  update  to  all  replicas  (including  the  sender).    

 

The  remote  transaction  manager  consists  of  a  certifier  thread,  and  a  set  of  committer  threads.   It   receives   update   propagation   messages   from   other   replicas.   These   messages   are   stored   in   the   to_apply   queue.   The   certifier   thread   processes   the   update   propagation   messages   sequentially.  It  keeps  track  of  previously  committed  transactions  and  performs  the  validation   process.   The   validation   process   checks   for   every   transaction   being   validated   whether   there  

(13)

was  a  previously  committed  transaction  that  is  concurrent  to  it  and  has  a  write-­‐write  conflict   on   any   modified   item.   In   case   of   conflict   the   transaction   being   validated   is   aborted.   This   decision  is  taken  by  all  replicas  for  each  update  transaction.  Since  the  validation  is  performed   in   the   same   order   (the   total   order   of   the   multicast)   the   result   is   deterministic.   In   case   of   positive   validation   the   update   propagation   message   is   stored   in   the  to_complete   queue.   The   committer  threads  takes  update  propagation  messages  from  this  queue.  If  the  transaction  is   local,  it  will  be  committed  by  the  local  transaction  manager.  If  the  transaction  is  remote,  the   update  set  is  applied  and  committed  as  an  independent  transaction.  

 

2.2.4 Implementation  issues  

There  are  several  implementation  issues  with  the  remote  and  local  transaction  managers.  The   first,  relates  to  the  semantic  transparency  of  the  replication  protocol.  The  replication  protocol   provides   1-­‐copy   equivalence.   This   means   that   the   replicated   database   should   behave   in   the   same   way   as   a   non-­‐replicated   centralised   one.   In   order   to   achieve   this,   it   is   necessary   to   coordinate   the   start   of   new   transactions   with   the   commitment   of   update   transactions.   Essentially,  the  start  of  new  transactions  should  be  made  only  when  a  gap-­‐free  prefix  of  the   transactions  are  committed.  This  means  that  either  transactions  are  committed  sequentially,   or   if   they   are   committed   in   parallel,   then   no   transactions   can   be   started.     By   committing   transactions  sequentially,  there  is  a  liveness  issue  since  a  deadlock  can  be  created  between  the   replication  logic  and  the  local  database.  This  is  due  to  the  fact  that  PostgreSQL  uses  locking  to   deal   with   write-­‐write   conflicts,   and   a   remote   update   transaction   can   be   stopped   by   a   local   transaction  that  will  be  serialized  after  it,  thus  creating  a  waiting  cycle  and  the  corresponding   deadlock.  To  avoid  it,  it  is  necessary  to  enable  parallel  commit  processing.  However,  parallel   commit   processing   results   in   losing   1-­‐copy   equivalence   if   transactions   are   started   at   this   stage.  The  adopted  solution  has  been  to  alternate  between  the  parallel  commit  processing  at   the  start  of  new  transactions.  When  the  system  switches  from  parallel  commit  processing  to   sequential   commit   processing,   the   start   of   new   transactions   is   enabled,   and   then   switches   back  to  parallel  commit  processing.  

 

   

Figure  3:  Fraction  of  capacity  used  for  local  vs.  remote  txns  with  different  number  of   replicas  

 

A  second  implementation  issue  is  that  the  higher  the  number  of  replicas  is  used,  the  higher  is   the   capacity   used   by   each   replica   to   execute   remote   transactions   (see   Figure   3).   This   is   an   inherent  limiting  factor  of  scalability  of  fully  replicated  systems.  The  main  issue  with  respect   IO  comes  from  the  fact  that  with  configurations  with  a  large  number  of  replicas,  the  amount  of   remote   transactions   that   are   pure   updates   increases.   This   overhead   has   been   reduced   by   grouping  consecutive  remote  non-­‐conflictive  transactions  into  one  large  transaction  instead   of  using  individual  transactions.  This  does  not  affect  1-­‐copy  equivalence  since  transactions  are   committed  in  the  same  order,  and  being  consecutive,  they  do  not  create  gaps.  

LOCAL LOCAL REMOTE REMOTE 2 Sites DB 3 Sites DB

(14)

 

Another  implementation  issue  relates  to  how  many  threads  are  devoted  to  local  transactions   and  remote  transactions.  One  might  think  that  a  single  thread  should  be  enough  for  remote   transactions.   However,   since   parallel   commit   processing   is   needed   multiple   threads   are   needed.   There   is   interplay   between   threads   executing   local   transactions   and   remote   transactions.  Some  tuning  has  been  performed,  which  adjusts  the  size  of  the  thread  pools.  This   has  allowed  increasing  the  performance  to  some  extent.    

 

There  are  also  a  couple  of  configuration  issues  that  affect  the  IO  behaviour.  The  first,  relates  to   the   configuration   of   the   commit   function.   In   principle,   it   should   log   synchronously   to   guarantee   the   durability   of   transactions.   However,   this   is   typically   quite   detrimental   with   respect   to   performance.   PostgreSQL   allows   asynchronous   commits   to   be   configured.   This   additional  configuration  increases  performance,  but  at  the  cost  of  losing  durability.    

 

Another  performance  issue  relates  to  how  PostgreSQL  manages  multi-­‐versioning.  Each  tuple,   whenever  updated,  it  is  not  updated  in-­‐place,  instead  a  new  version  is  generated.  This  means   that   the   number   of   versions   of   each   tuple   grows   over   time,   and   a   garbage   collection   mechanism  is  required.  This  is  needed  not  only  due  to  free  unnecessary  occupied  space,  but   mainly   for   performance   reasons.   Database   pages   end   up   containing   just   a   few   active   tuples   overtime,   making   IO   access   inefficient.   PostgreSQL   has   a   vacuum   process   that   performs   the   garbage   collection   process   by   removing   obsolete   versions.   The   vacuum   process   performs   synchronous  writes  and  since  it  is  executed  periodically  it  has  a  severe  effect  on  performance.   There  is  a  configuration  parameter  in  PostgreSQL  that  determines  whether  to  force  vacuum   process  periodically  or  on  demand.  

 

2.3 Parallel  data  streaming  middleware   2.3.1 Motivation  

There   are   an   increasing   number   of   emerging   applications   for   which   databases   and   the   underlying  store   and   process   model   are   not   a   good   solution.   Some   new   technologies   and   techniques  are  being  proposed  to  address  more  adequately  these  emerging  applications  such   as  data  streaming  technologies.  Data  streaming  applications  aims  to  process  data  and  events   in   an   online   manner   as   they   flow.   Data   streaming   enables   the   deployment   of   continuous   queries   that   are   evaluated   over   the   streaming   data   in   an   online   manner   with   in-­‐memory   processing.    

 

Data  streaming  also  has  strong  IO  requirements  due  to  two  features.  The  first,  is  its  ability  to   access   persistent   states   via   database   operators   that   allow   reads,   updates   and   inserts   on   databases.  The  second,  is  the  need  for  availability.  Availability  requires  persisting  streaming   data   to   re-­‐inject   it   in   the   advent   of   failures.   This   is   quite   challenging   since   it   requires   persisting  data  at  streaming  rates.  

 

Another  recent  requirement  demands  scalability  with  respect  to  the  number  of  cores  and  the   number   of   nodes   in   the   system.   This   requirement   has   been   addressed   by   the   FP7   project  

Stream,  which  was  coordinated  by  UPM,  and  included  FORTH  as  a  partner.  This  project  ended   officially  on  January  2011.  In  order  to  extend  the  scope  of  IO  workloads,  this  new  middleware   will   be   used   in   IOLanes   to   exercise   data   streaming   workloads,   and   in   particular   determine   how   database   operators   and   fault-­‐tolerance   will   stress   the   IO   path.   The   data   streaming   middleware   will   be   used   to   exploit   the   multi-­‐core   power   by   using   different   middleware  

(15)

instances  for  each  available  core.  This  will  require  an  underlying  IO  system  able  to  scale  with   the   number   of   cores.   The   middleware   will   run   each   different   middleware   instance   on   a   different  VM,  thus  helping  to  stress  the  virtualized  IO  path.  

 

2.3.2 Architectural  choices  and  rationale  of  the  selected  architecture  

In   this   section,   we   briefly   summarize   the   architecture   of   the   parallel   data   streaming   middleware,   and   describe   the   architectural   choices   that   were   adopted.   The   middleware   extends   an   underlying   streaming   processing   engine   with   distribution   capabilities,   however,   lacks  capabilities  for  parallel  or  parallel-­‐distributed  processing.  Parallelization  transparency   requires   distributing   tuples   before   each   stateful   operator   across   all   the   instances.   The   overhead  of  parallelization  can  be  measured  according  to  two  factors.  The  first  factor  is  the   number   of   hops   made   by   each   tuple   from   instance   to   instance.   Each   hop   between   nodes   involves  serializing,  deserializing,  sending  and  receiving  the  tuple,  all  of  which  consume  CPU   cycles.  The  second  factor  is  the  fan-­‐out  of  the  communication.  That  is,  to  how  many  instances   should  send  tuples,  and  therefore  keep  open  connections  with  them.  

 

The  parallelization  of  a  query  can  be  made  following  different  strategies  that  exhibit  different   tradeoffs.  They  can  be  seen  in  a  single  spectrum  with  two  extremes.  On  one  extreme  of  this   spectrum,  we  have  a  strategy,  whole-­‐query  strategy,  in  which  the  whole  query  is  not  split,  and     is  instantiated  at  every  core.  In  the  whole-­‐query  strategy  the  overhead  factors  are  as  follows.   The  number  of  hops  will  be  as  many  as  stateful  operators  present  in  the  original  query  that  is   the  minimum  one  that  can  be  attained,  since  before  each  stateful  operator  tuple  redistribution   is   required.   The   fan-­‐out   factor,   however,   reaches   its   maximum,   that   is   the   total   number   of   cores  used  for  parallelizing  the  query.    

 

On   the   other   extreme   of   the   spectrum,   we   find   the  single-­‐operator   strategy   in   which   the   original  query  is  split  into  as  many  subqueries  as  individual  operators  it  has.  Each  operator   can  then  be  instantiated  into  as  many  cores  as  needed.  The  number  of  hops  overhead  factor  in   this   case   is   maximal   since   each   tuple   performs   as   many   hops   as   possible,   the   number   of   individual  operators.  On  the  other  hand,  the  other  overhead  factor,  that  is  the  fan-­‐out,  reaches   its  minimum  value,  since  each  operator  will  have  the  smallest  possible  number  of  instances.    

In   the   middle   of   these   two   extremes,   there   is   the  stateful-­‐subquery   strategy   in   which   the   original   subquery   is   split   into   as   many   subqueries   as   stateful   operators   it   has.   That   is,   the   original  query  is  split  just  before  each  stateful  operator  it  contains.  In  this  case,  the  number  of   hops  overhead  factor  reaches  its  minimum  possible  value,  the  number  of  stateful  operators.   While   the   fan-­‐out   overhead   factor   reaches   a   value   between   the   maximum   and   minimum   possible.    

 

In  a  multi-­‐core  system  there  are  other  factors  that  affect  parallelization.  In  a  complex  query,   there   might   be   more   subqueries   and/or   operators   than   available   cores.   Since   having   an   additional  instance,  it  is  only  beneficial  to  take  advantage  of  available  cores,  this  means  that   some   subqueries   will   be   merged   despite   they   could   be   split   if   more   cores   were   available.   Another  factor  lies  in  load  balancing.  The  most  CPU  intensive  subqueries  need  to  be  allocated   to   more   cores.   This   can   result   in   having   to   merge   less   CPU   intensive   subqueries   to   allocate   them  in  the  remaining  cores.    

 

As  a  conclusion,  from  the  three  parallelization  strategies,  the  subquery  parallelization  strategy   will   be   adopted.   However,   it   requires   an   additional   optimization   phase   where   resulting  

(16)

subqueries  are  distributed  among  the  remaining  cores  to  balance  the  CPU  utilization  such  that   only  one  data  streaming  engine  instance  is  used  per  core.  

 

2.3.3 Design    

In  order  to  extend  the  data  streaming  engine  without  having  to  modify  it,  parallelization  has   been   enabled   by   developing   parallelization   operators   as   new   custom   data   streaming   operators.     These   parallelization   operators   are   load   balancers   and   input   mergers.   Basically,   the  query  to  be  parallelized  is  split  into  subqueries  and  each  subquery  is  instantiated  into  as   many  cores/nodes  as  needed.  Between  each  pair  of  consecutive  subqueries,  say  A  and  B,  the   parallelization   operators   are   introduced   (see   Figure   4).   At   the   end   of   each   instance   of   subquery   A   (origin   subquery),   a   load   balancer   (LB)   is   located.   At   the   beginning   of   each   instance  of  subquery  B  (destination  subquery),  an  input  merger  (IM)  is  located.    

 

 

Figure  4:  Query  parallelization  by  means  of  parallelization  operators  

 

The   attain   parallelization   transparency,   tuples   that   need   to   be   processed   together   (for   aggregation  or  correlation  purposes),  should  be  sent  to  the  same  node.  Otherwise,  incorrect   results   would   be   produced   (since   tuples   would   be   partially   aggregated   and/or   correlated).   Tuples  are  grouped  into  buckets  based  on  a  hash  of  their  distribution  key  (the  key  used  for   their   aggregation   and/or   correlation   in   the   subquery   B).   The   load   balancer   is   aware   of   the   semantics  of  the  destination  subquery  and  distributes  each  tuple  to  the  destination  instance  in   charge   of   the   bucket   to   which   the   tuple   belongs.   Each   input   merger   at   destination   query   instances   takes   the   tuples   from   all   incoming   load   balancers   and   provides   a   single   merged   stream.  Simply  merging  the  incoming  streams  does  not  guarantee  transparent  parallelization.   It  is  required  that  the  multiple  incoming  physical  streams  are  merged  into  a  single  coherent   stream.  In  order  to  attain  this  goal,  the  timestamp  of  tuples  is  used  to  make  a  merge  sort  of  the   incoming   tuples   according   their   timestamp.   That   is,   it   takes   the   tuple   with   the   smallest   timestamp  out  of  all  the  incoming  streams.  

 

2.3.4 Implementation  issues  

Borealis,  the  underlying  data  streaming  engine  used,  employs  a  particular  threading  model.  It   uses   two   threads   for   communicating   with   other   instances,   a   receiver   thread   and   a   sender   thread.  Internally  it  uses  a  processor  thread  that  processes  each  incoming  tuple.  In  practical   terms,   this   means,   that   a   Borealis   instance   is   almost   single-­‐threaded.   This   means   that   the   parallelization   middleware   facilitates   its   execution   on   a   multi-­‐core   system   in   a   scalable   manner.  

 

The   transparent   input   merging   has   a   liveness   issue.   In   order   to   take   the   tuple   with   the   smallest  timestamp  it  should  wait  to  have  a  tuple  from  every  incoming  stream.  It  might  be  the  

(17)

case  that  one  of  the  incoming  streams  is  not  producing  tuples  what  would  cause  the  blocking   of  the  input  merging  and  the  delay  of  processing  of  the  available  tuples  in  the  other  streams.   In  order  to  solve  this  liveness  issue,  a  special  kind  of  tuple  called  dummy  tuple  was  used.  Load   balancers   periodically   check   whether   they   have   sent   any   tuple   on   each   of   the   outgoing   streams.  If  not,  they  send  a  dummy  tuple  with  the  smallest  timestamp  that  the  load  balancer   can  generate.  This  dummy  tuples  enables  the  input  merger  to  unblock.  If  the  dummy  tuple  is   the  one  taken  by  the  input  merger  it  is  simply  discarded.  If  not,  it  enables  to  take  tuples  from   the  other  incoming  streams  and  avoid  the  blocking  of  the  operator.  

 

Another   implementation   issue   was   related   to   the   efficient   persistence   of   data   streams   for   fault-­‐tolerance.   The   first   approach   relied   on   a   file   per   data   bucket   what   resulted   in   a   high   number  of  files  that  were  not  manageable  by  the  file  system.  The  second,  used  a  single  file  per   subquery   instance,   which   resulted   in   more   complex   recovery,   but   avoided   the   bottleneck   associated   with   the   management   of   thousands   of   files.   The   fault   tolerance   of   aggregate   operators  was  especially  challenging  due  to  the  high  number  of  active  sliding  windows.  This   issue  was  solved  by  developing  new  efficient  garbage  collection  functionality.  

 

The   performance   analysis   conducted   in   T3.2   revealed   that   aggregate   operators   in   the   data   streaming  engine  had  memory  management  issues.  One  issue  relates  to  the  sliding  window   paradigm,   more   concretely,   when   to   clean   windows   of   aggregate   operators   which   have   not   received   tuples   recently.   In   many   applications   (for   instance   in   telephony),   there   is   a   high   distribution  of  keys  (i.e.  phone  number  in  call  detail  records).  Aggregate  operators  devote  a   sliding  window  to  each  key  according  the  group-­‐by  clause.  Due  to  the  high  distribution,  many   different  phone  numbers  receive  calls,  resulting  in  many  different  sliding  windows.  The  issue   is  that  most  phone  numbers  receive  seldom  calls,  and  the  sliding  window  does  not  slide  till   receiving  a  call  with  a  temporal  distance  to  one  of  the  calls  in  the  sliding  window  bigger  than   the  window  size.  For  instance,  if  the  window  size  is  24  hours,  and  a  phone  number  makes  a   call  in  day  1,  and  then  it  receives  the  next  call  a  week  later.  According  to  the  sliding  window   model,  when  receiving  the  call  the  week  later,  the  tuple  from  the  day  1  call  would  be  removed   when   the   window   slides.   However,   as   a   result,   huge   amounts   of   memory   are   unnecessarily   occupied.   This   has   also   an   additional   impact   since   the   fault-­‐tolerance   is   checkpointing   the   state  of  the  aggregate  operators  and  this  state  is  much  larger  than  needed.    

 

The   solution   that   was   given   was   to   reprogram   the   aggregate   operators   to   manage   memory   efficiently   by   introducing   a   garbage   collection   process.   However,   performing   garbage   collection  periodically  is  disruptive,  since  regular  processing  needs  to  be  stopped/delayed  as   a   result.   For   this   reason,   garbage   collection   was   performed   in   batches.   Since   parallel   processing  requires  buckets  of  keys,  we  used  this  concept  for  batching  the  garbage  collection.   As  soon  as  a  window  in  a  bucket  required  sliding  a  window,  all  windows  in  the  same  bucket   were   adjusted.   This   was   possible   since   the   parallel   data   streaming   middleware   guarantees   that  tuples  arrive  timestamp  ordered  to  each  bucket.  Windows  that  remained  empty  during   one  period  (the  time  between  two  window  slides)  were  garbage  collected.  This  solution  has   largely  diminished  the  memory  usage  by  aggregate  operators  for  those  workloads  with  sparse   tuples   across   the   aggregation   keys   that   is   the   case   in   most   applications   (phone   numbers,   credit   card   numbers,   car   plates,   RFIDs,   etc.).   Additionally,   the   batched   approach   to   garbage   collection  was  non-­‐intrusive  for  regular  processing.  

   

(18)

3 Prototype  of  warehouse  application  and  adaptation  over  new  IO  stack  design  

 

3.1 Introduction  

In  the  context  of  the  IOLanes  project  it  is  expected  that  the  Tariff  Advisor  (TA)  rating  engine   and   tariff   simulation   tool   will   be   integrated   with   Middle-­‐R   database   replication   system   and   the   parallel   data   streaming   engine   to   create   a   new   type   of   rating   application   that   is   both   flexible  in  terms  of  functionality  as  well  as  that  scales  in  I/O  performance  with  the  number  of   cores.  

The   integration   of   the   parallel   data   streaming   engine   with   TariffAdvisor   means   that   the   results  of  TA,  that  is,  the  re-­‐rated  Call  Detail  Records  (CDRs)  will  be  fed  to  the  parallel  data   streaming  engine,  which  in  turn  will  execute  a  number  of  post-­‐processing  steps  on  them.   This  approach  allows  applications  users  (e.g.  sales  and  marketing  people  in  Telco  providers)   to  experiment  with  new  plans  of  rating  at  acceptable  response  times  and  use  results  to  make   business   decisions.   In   addition,   we   foresee   that   this   flexibility   will   allow   the   TariffAdvisor   rating  engine  to  be  applied  to  other  areas  of  rating  as  well,  such  as  energy  or  more  generally,   utility  rating.  

This   document   describes   the   post-­‐processing   steps   that   are   needed,   based   on   actual   requirements  from  existing  deployments  and  customers  that  already  use  TA.  It  also  reports   the  status  of  their  implementation  on  a  StreamCloud/TariffAdvisor  prototype.  

 

3.2 The  current  state  of  Tariff  Advisor  

The  Tariff  Advisor  architecture  was  presented  in  previous  deliverable  4.1.  We  include  a  brief   description  in  the  following  paragraphs.  

Tariff  Advisor  is  based  on  a  fast  rating  engine  and  re-­‐rating  engine.  It  stores  metadata  (rate   plans,  add-­‐ons,  tariffs,  free  unit  bundles,  etc.)  in  a  database  schema.  Metadata  is  required  to   ensure  that  calls  are  rated  correctly.  Customer  contracts,  rate  plans,  as  well  as  the  scenarios   that   should   be   applied   are   maintained   in   the   same   database.   Scenarios   define   the   combinations   of   base   rate   plans   and   add-­‐ons   that   should   be   used   for   the   re-­‐rating   of   calls.   Scenarios  can  be  complex,  and  can  include  substitution  conditions  and  rules  that  are  matched   using  the  original  rate  plan  combination  of  each  contract.  

TA  receives  input  as  flat  files  of  calls.  In  every  file  it  is  expected  that  calls  are  ordered  based  on   the  call  date  and  time,  in  order  to  use  the  free  units  (minutes,  SMSs,  MBs)  in  the  correct  order.   It  is  also  possible  that  the  inputs  to  TA  are  calls  stored  in  a  database.  

The   basic   output   of   TA   is   rated   call   records   (also   called  CDRs   –   call   detail   records),   stored   normally   in   files.   For   each   call   record   in   the   input,   TA   outputs   as   many   call   records   as   the   number  of  scenarios  that  were  applied  to  the  input.  Each  output  call  record  is  marked  with   the   corresponding   scenario   ID   that   created   it.   In   most   of   the   cases(depending   on   the   configuration),  the  calls  are  also  re-­‐rated  against  the  current  plan  of  the  customer,  and  this  is   marked  with  a  special  scenario  ID.  

It   is   also   possible   that   aggregated   data   is   produced   as   output.   In   this   case,   the   output   call   records   are   usually   aggregated   per   contract,   scenario   ID,   service,   and   destination   zone,   but   other  aggregations  may  be  desirable.  

TariffAdvisor   outputs   can   be   used   in   different   ways.   The   basic   case   is   to   output   the   total   charge   amount   for   each   scenario   for   all   customers.   However,   it   is   more   common   to   have   requirements  that  require  more  elaborate  processing  of  the  results.  One  such  requirement  is   to  find  the  best  rate  plan  for  a  customer.  Another  one  might  be  to  find  all  the  plans  in  a  specific   distance  (that  is,  total  charge  amount  range)  from  the  current  plan,  even  though  the  best  plan   is  not  included  in  this  distance.  Distance  could  be  expressed  either  as  an  absolute  amount  or  

References

Related documents

Despite the apparently well-defined features of BL, some non-Hodgkin lymphomas (NHL) such as diffuse large B-cell lymphoma (DLBCL), may present morphological, immunohistochemical

When asked what they perceived as the ideal role for a nurse educator in the clinical area two categories emerged, one being the educator as a clinical teacher, and the other as

Blackjack is one of those games where the rules can vary from one place to another, so if you're going to be playing for big money or you're really serious about the game, look for

10 Under the act, FERC may regulate “‘the sale of electric energy at wholesale in interstate commerce,’ including both wholesale electricity rates and any rule

For the topologies studied, this suggests that in an idealized fractional bandwidth routed network the increase in network throughput achieved by improving the transceiver coding can

Furthermore, we look at the various relevant support and policy initiatives aimed at stimulating high-growth entrepreneurship in the Netherlands, which are

GNSS code and carrier phase observations are integrated recursively synergistically with the correction data provided from base stations to estimate reliable positioning with

Three modes of support were predefined: 1) consulting and coaching, 2) teaching, and 3) written guidelines. In general, the need for a standardized concept of PPC was highlighted.