Cloud computing The cloud as a pool of shared hadrware and software resources

(1)

Towards SLA-oriented

Cloud Computing

Sara Bouchenak

INSA Lyon Sara.Bouchenak@insa-lyon.fr 3rdFranco-American Workshop on CyberSecurity, December 8-10, 2014, Lyon, France

Cloud computing

The cloud as a pool of shared hadrware and software resources

The cloud as a means for distributed applications to: pick up required resources

access “infinite” remote resources

access on - demand resources (pay - as - you - go) transparent resource management

cpu, memory, disk operating systems,

virtual machines middleware layers (e.g. application servers)

virtual machines middleware layers (e.g. application servers) …

cloud

December 10, 2014 2

QoS, SLOs, SLA

• QoS

Many quality-of-service criteria

Performance (e.g. service response time, service tjhroughput) Dependability, Availability (e.g. service abadon rate), Reliability Security, etc.

• Costs

Energetic costs Financial costs

• SLOs

Service Level Objective for a given QoS criterion/metric

Examples:

a target service level, e.g. a minimum service throughput, a maximum service abandon rate

a service level interval

service level maximization/minimization

• SLA (Service Level Agreement)

A contract between the service provider and service customer Ideally, a combination of a set of SLOs and cost constraints

Example: “99% of service requests are processed within 1s with a minimal energetic cost”

December 10, 2014 3

QoS and SLA in clouds?

• Some initiatives

Amazon EC2, Rackspace, 3tera clouds

• Restricted to a single QoS criterion

Service unavailability due to computer failures

Other QoS aspects not tackled (performance, security, energy, financial costs, etc.)

• Ad-hoc and incomplete approaches

Is the SLA guaranteed/violated by the cloud?

E.g. Amazon EC2 customers must provide proofs of cloud service unavailability: capture the failure, document it, send the "proof" to Amazon, within 30 days

This is against one of the main motivations of Cloud Computing: "hide complexity of resource management and provide simple access to cloud services by the customer"

December 10, 2014 4

(2)

Open Challenges & Perspectives

•

Multicriteria SLA: dependability, performance, cost, etc.

•

Towards scalable and distributed SLA control

•

Consider different applications and Big Data services

December 10, 2014 5

Towards SLA - aware clouds

• Objective 1: Define a new cloud model, the SLAaaS (SLA aware Service)

Orthogonal to other cloud models (IaaS, PaaS, SaaS)

A cloud presents, along with its service interface, a non-functional SLA interface Allow a customer to compare different cloud service providers regarding the provided SLOs

• Objective 2: Autonomic reconfiguration of cloud services

From cloud provider perspective

Multi-objective SLA between cloud provider and cloud customer Fully elastic cloud via dynamic resource re-allocation, reconfiguration Handle cloud dynamics, workload variations

• Objective 3: SLA governance in the cloud

From cloud customer point of view

Automatically notify the customer about SLA violation, energy footprint, etc.

• Objective 4: Big Data services, Benchmarking tools

Stress/evaluate dependability and scalability of Big Data cloud services, real workloads

AMADEOS project: http://amadeos.imag.fr/ MyCloud project: http://mycloud.inrialpes.fr/

December 10, 2014 6

Towards SLA - aware clouds

• Objective 1: Define a new cloud model, the SLAaaS

Challenges

Towards a control - theoretic approach [ACM OSR 2013]

ConSer: Control of server systems [IEEE Trans. Comp. 2011]

MoKa: Control of multi - tier distributed web systems [IGI Global, 2011]

MoMap: Control of MapReduce systems [ACM CCGrid 2013]

• Objective 3: SLA governance in the cloud

MRBS: Bechmarking framework for Hadoop MapReduce [IEEE SRDS 2012]

December 10, 2014 7

Towards SLA - aware clouds

Challenges

Towards a control - theoretic approach

[ACM OSR 2013]

ConSer: Control of server systems [IEEE Trans. Comp. 2011]

MoKa: Control of multi - tier distributed web systems [IGI Global, 2011]

MoMap: Control of MapReduce systems [ACM CCGrid 2013]

• Objective 3: SLA governance in the cloud

MRBS: Bechmarking framework for Hadoop MapReduce [IEEE SRDS 2012]

December 10, 2014 8

(3)

1) Complex SLOs

Multiple service level objectives (SLOs)

performance, availability, dependability, security, etc. Trade - off antagonist SLOs

“at least 99% of client requests are admitted and processed within 1s, with a minimal financial cost”

2) From SLOs to resource allocation/configuration

Challenges in autonomic reconfiguration of cloud services

availability performance

Resources

Small instance $0.085 per hour Large instance $0.34 per hour Extra large instance $0.68 per hour Number of instance X unitary price

Amazon EC2 cloud

SLOs

Availability level 99% of requests are processed Performance level requests processed within 1s Cost constraint minimal cost

bime cloud application

cost

Nontrivial SLOs-to-resource allocations

December 10, 2014 9

3) Time-varying and nonlinear behavior Workload amount (#concurrent client requests)

Workload amount of the soccer World Cup Web Site [Arlitt et. al., HP 99]

Challenges in autonomic reconfiguration of cloud services

December 10, 2014 10

Control knobs

(i.e. resource allocations):

• Cache size

• Server admission control • Server provisioning • Content quality level • …

System outputs:

• Cache hit ratio • Service QoS • Resource utilization • Service differentiation ratio • …

Exogenous inputs:

• Workload amount

• Workload mix

A control - theoretic approach

Feedback control loop

control knobs

Controller Target system

SLOs

exogenous inputs

measured service levels service costs

(1) Utility: State the objective and capture the trade-off Multicriteria utility function

(2) Model: Describe system behavior

Relationship between allocated resources and service levels and costs

(3) Control: Solve the system

Calculate (optimal) resource allocation Maximize utility function Based on the model

(4) Implement the solution

Translate theoretical optimal solution into concrete implementation Not trivial: automatically (re)determine model’s parameters

System model exogenous variables resource allocations in p u

ts predicted service levels

o u tp u ts service costs

Followed approach

December 10, 2014 12 © S. Bouchenak

(4)

Towards SLA - aware clouds

Challenges

Towards a control - theoretic approach [ACM OSR 2013]

ConSer: Control of server systems [IEEE Trans. Comp. 2011], PhD L. Malrait MoKa: Control of multi - tier distributed web systems

[IGI Global, 2011], PhD J. Arnaud MoMap: Control of MapReduce systems

[ACM CCGrid 2013], PhD M. Berekmeri • Objective 3: SLA governance in the cloud

MRBS: Bechmarking framework for Hadoop MapReduce [IEEE SRDS 2012], PhD A. Sangroya December 10, 2014 13 © S. Bouchenak Admission control MPL

Server admission control

Prevent server thrashing, denial-of-service Multi-Programming Level (MPL)

Classical configuration parameter in server systems Apache Web server’s MaxClients

MySQL database server’s max_connections

Control of server systems

server clients

rejected

Trade off between server performance and availability

Experiments conducted with PostgreSQL database server, running TPC-C benchmark

Performance (client request latency) Availability (client request abadon rate)

How to configure server’s MPL trading-off

performance and availability?

Related work – Server admission control

• Ad-hoc techniques, heuristics

[Menascé et al., EC’01] Best - effort behavior

• Linear models

[Diao et. al., NOMS’02] [Parekh et al., RTS’02]

Can not render the whole nonlinear behavior of server systems

• Nonlinear models based on queueing theory

[Robertsson et. al. CDC’04] [Tipper et. al., JSAC’90] [Wang et. al. INFOCOM’96] Multiple model parameters, hard to calibrate

Do not tackle full dynamics (workload types) Restricted to a single QoS aspect, SLO

(5)

(1) Utility: State the objective and capture the trade-off

AM-C (availability-maximizing objective)

(P1) average client request latency does not exceed a given Lmax

(P2) and abandon rate is made as small as possible PM-C, PA-AM-C, AA-PM-C

* L. Malrait, S. Bouchenak, N. Marchand. Experience with ConSer: A System for Server Control Through Fluid Modeling. IEEE Transactions on Computers, 60(7), 2011.

In collaboration with the NeCS INRIA research group on Control Theory latency and

abandon rate SLOs controlled

MPL

Controller Target server L (latency)

α (abandon rate)

SLOs: L ≤≤≤≤ Lmax& αααα

AM-C Controller workload

amount N & mix M

ConSer: Control of server systems*

ConSer modeling

latency and abandon rate SLOs

controlled MPL

Controller Target server

L (latency) α (abandon rate) SLOs: L ≤≤≤≤ Lmax& αααα

AM-C Controller

(2) Nonlinear fluid modeling

Server model (workload amount) N (workload mix) M L (latency) α (abandon rate) workload

amount N & mix M

request latency request abandon rate throughput of processed requests admitted requests

request latency request abandon rate throughput of processed requests admitted requests State variables:

MPL Control input:

Exogenous inputs: Outputs:

(incoming throughput) Ti

ConSer control

latency and abandon rate SLOs

controlled MPL

Controller Target server

L (latency) α (abandon rate) SLOs: L ≤≤≤≤ Lmax& αααα

AM-C Controller

(3) Control server’s MPL • AM-C (availability-maximizing control)

(P1) average client request latency does not exceed a given Lmax (P2) and abandon rate is made as small as possible

– If L > Lmax; ¬(P1) ; MPL a decreased value of Ne

– If L < Lmax; (P1) & possibly ¬(P2) ; MPL an increased value of Ne • Efficient control: O(1)

workload amount N & mix M

; γ > 0

ConSer AM-C control evaluation

Experiments conducted with PostrgeSQL database server running TPC-C benchmark, AM-C control law, Lmax= 8s Performance improved by up to 30%

(6)

ConSer AM-C control evaluation

Experiments conducted with PostrgeSQL database server running TPC-C benchmark, AM-C control law, Lmax= 8s

Towards SLA - aware clouds

• Objective 2: Autonomic reconfiguration of cloud services Challenges

Towards a control - theoretic approach [ACM OSR 2013]

ConSer: Control of server systems [IEEE Trans. Comp. 2011], PhD L. Malrait

MoKa: Control of multi - tier distributed web systems [IGI Global, 2011], PhD J. Arnaud

MoMap: Control of MapReduce systems [ACM CCGrid 2013], PhD M. Berekmeri • Objective 3: SLA governance in the cloud

MRBS: Bechmarking framework for Hadoop MapReduce [IEEE SRDS 2012], PhD A. Sangroya

• MapReduce Big Data applications A popular programming model A runtime environment on cluster

of commodity computers • Automatic Data partitioning Data replication Task scheduling Fault tolerance

• A wide range of applications log analysis,

data mining, web search engines, scientific computing, business intelligence, etc.

• Big companies use it Amazon, eBay, Facebook,

LinkedIn, Twitter, Yahoo!, etc.

Big Data Systems - MapReduce

• Lots of work to improve MapReduce dependability and performance

New fault-tolerance models

[Costa, CloudCom 11] Replication and partitioning policies

[Ananthanarayanan, EuroSys 11] [Eltabakh,VLDB 11] Scheduling policies

[Zaharia, OSDI 08] [Isard, SOSP 09] [Zaharia, EuroSys 10] Cost-based optimization

[Herodotou, VLDB 11] Resource provisioning

[Verma, Middleware 11]

• Most evaluations use micro-bechmarks

Not representative of full distributed, concurrent applications Not representative of realistic workloads

No dependability benchmarking

Motivation

(7)

• Empirical evaluation of dependability and performance of MapReduce Fault - tolerance

Scalability

• Variety of application domains, workloads and dataloads Compute - oriented vs. data - oriented applications Batch applications vs. real - time applications

• Variety of Big Data workloads and faultloads Various workloads, dataloads

Different fault models Different fault rates

• Portable and easy to use on a wide range of clouds Different cloud infrastructures

MRBS objectives

* A. Sangroya, D. Serrano, S. Bouchenak. Benchmarking Dependability of MapReduce Systems.

The 31stIEEE Int. Symp. on Reliable Distributed Systems (SRDS 2012), Irvine, CA, Oct. 2012.

MRBS characteristics

Response time with Hadoop 1.0: up to +40%

Experiments conducted on a ten node Hadoop cluster

Use-case: Comparing two MapReduce frameworks w.r.t.

performance & dependability

How does Hadoop 1.0 compare to Hadoop 0.20 w.r.t. performance?

Throughput with Hadoop 1.0: up to -42%

Less failed jobs with Hadoop 1.0

Use-case: Comparing two MapReduce frameworks w.r.t.

performance & dependability

How does Hadoop 1.0 compare to Hadoop 0.20 w.r.t. dependability?

(8)

Less I/O failures with Hadoop 1.0

Use-case: Comparing two MapReduce frameworks w.r.t.

performance & dependability

How does Hadoop 1.0 compare to Hadoop 0.20 w.r.t. dependability?

Conclusion & Perspectives

•

Multicriteria SLA by design

•

Cloud computing The cloud as a pool of shared hadrware and software resources

Towards SLA-oriented

Cloud Computing

Sara Bouchenak

Cloud computing

QoS, SLOs, SLA

QoS and SLA in clouds?

Open Challenges & Perspectives

Multicriteria SLA: dependability, performance, cost, etc.

Towards scalable and distributed SLA control

Consider different applications and Big Data services

Towards SLA - aware clouds

Towards SLA - aware clouds

Towards SLA - aware clouds

Challenges in autonomic reconfiguration of cloud services

Challenges in autonomic reconfiguration of cloud services

A control - theoretic approach

Feedback control loop

Followed approach

Towards SLA - aware clouds

Server admission control

Control of server systems

Trade off between server performance and availability

How to configure server’s MPL trading-off

performance and availability?

Related work – Server admission control

ConSer*: Control of server systems

ConSer modeling

ConSer control

ConSer AM-C control evaluation

ConSer AM-C control evaluation

Towards SLA - aware clouds

Big Data Systems - MapReduce

Motivation

MRBS objectives

MRBS characteristics

Use-case: Comparing two MapReduce frameworks w.r.t.

performance & dependability

Use-case: Comparing two MapReduce frameworks w.r.t.

performance & dependability

Use-case: Comparing two MapReduce frameworks w.r.t.

performance & dependability

Conclusion & Perspectives

Multicriteria SLA by design

Different applications

ConSer: Control of server systems*