Towards SLA-oriented
Cloud Computing
Sara Bouchenak
INSA Lyon Sara.Bouchenak@insa-lyon.fr 3rdFranco-American Workshop on CyberSecurity, December 8-10, 2014, Lyon, France
Cloud computing
The cloud as a pool of shared hadrware and software resources
The cloud as a means for distributed applications to: pick up required resources
access “infinite” remote resources
access on - demand resources (pay - as - you - go) transparent resource management
cpu, memory, disk operating systems,
virtual machines middleware layers (e.g. application servers)
cpu, memory, disk operating systems,
virtual machines middleware layers (e.g. application servers)
cpu, memory, disk operating systems,
virtual machines middleware layers (e.g. application servers) …
cloud
December 10, 2014 2
© S. Bouchenak
QoS, SLOs, SLA
• QoS
Many quality-of-service criteria
Performance (e.g. service response time, service tjhroughput) Dependability, Availability (e.g. service abadon rate), Reliability Security, etc.
• Costs
Energetic costs Financial costs
• SLOs
Service Level Objective for a given QoS criterion/metric
Examples:
a target service level, e.g. a minimum service throughput, a maximum service abandon rate
a service level interval
service level maximization/minimization
• SLA (Service Level Agreement)
A contract between the service provider and service customer Ideally, a combination of a set of SLOs and cost constraints
Example: “99% of service requests are processed within 1s with a minimal energetic cost”
December 10, 2014 3
© S. Bouchenak
QoS and SLA in clouds?
• Some initiatives
Amazon EC2, Rackspace, 3tera clouds
• Restricted to a single QoS criterion
Service unavailability due to computer failures
Other QoS aspects not tackled (performance, security, energy, financial costs, etc.)
• Ad-hoc and incomplete approaches
Is the SLA guaranteed/violated by the cloud?
E.g. Amazon EC2 customers must provide proofs of cloud service unavailability: capture the failure, document it, send the "proof" to Amazon, within 30 days
This is against one of the main motivations of Cloud Computing: "hide complexity of resource management and provide simple access to cloud services by the customer"
December 10, 2014 4
Open Challenges & Perspectives
•
Multicriteria SLA: dependability, performance, cost, etc.
•
Towards scalable and distributed SLA control
•
Consider different applications and Big Data services
December 10, 2014 5
© S. Bouchenak
Towards SLA - aware clouds
• Objective 1: Define a new cloud model, the SLAaaS (SLA aware Service)
Orthogonal to other cloud models (IaaS, PaaS, SaaS)
A cloud presents, along with its service interface, a non-functional SLA interface Allow a customer to compare different cloud service providers regarding the provided SLOs
• Objective 2: Autonomic reconfiguration of cloud services
From cloud provider perspective
Multi-objective SLA between cloud provider and cloud customer Fully elastic cloud via dynamic resource re-allocation, reconfiguration Handle cloud dynamics, workload variations
• Objective 3: SLA governance in the cloud
From cloud customer point of view
Automatically notify the customer about SLA violation, energy footprint, etc.
• Objective 4: Big Data services, Benchmarking tools
Stress/evaluate dependability and scalability of Big Data cloud services, real workloads
AMADEOS project: http://amadeos.imag.fr/ MyCloud project: http://mycloud.inrialpes.fr/
December 10, 2014 6
© S. Bouchenak
Towards SLA - aware clouds
• Objective 1: Define a new cloud model, the SLAaaS
• Objective 2: Autonomic reconfiguration of cloud services
Challenges
Towards a control - theoretic approach [ACM OSR 2013]
ConSer: Control of server systems [IEEE Trans. Comp. 2011]
MoKa: Control of multi - tier distributed web systems [IGI Global, 2011]
MoMap: Control of MapReduce systems [ACM CCGrid 2013]
• Objective 3: SLA governance in the cloud
• Objective 4: Big Data services, Benchmarking tools
MRBS: Bechmarking framework for Hadoop MapReduce [IEEE SRDS 2012]
December 10, 2014 7
© S. Bouchenak
Towards SLA - aware clouds
• Objective 1: Define a new cloud model, the SLAaaS
• Objective 2: Autonomic reconfiguration of cloud services
Challenges
Towards a control - theoretic approach
[ACM OSR 2013]
ConSer: Control of server systems [IEEE Trans. Comp. 2011]
MoKa: Control of multi - tier distributed web systems [IGI Global, 2011]
MoMap: Control of MapReduce systems [ACM CCGrid 2013]
• Objective 3: SLA governance in the cloud
• Objective 4: Big Data services, Benchmarking tools
MRBS: Bechmarking framework for Hadoop MapReduce [IEEE SRDS 2012]
December 10, 2014 8
1) Complex SLOs
Multiple service level objectives (SLOs)
performance, availability, dependability, security, etc. Trade - off antagonist SLOs
“at least 99% of client requests are admitted and processed within 1s, with a minimal financial cost”
2) From SLOs to resource allocation/configuration
Challenges in autonomic reconfiguration of cloud services
availability performance
Resources
Small instance $0.085 per hour Large instance $0.34 per hour Extra large instance $0.68 per hour Number of instance X unitary price
Amazon EC2 cloud
SLOs
Availability level 99% of requests are processed Performance level requests processed within 1s Cost constraint minimal cost
bime cloud application
cost
Nontrivial SLOs-to-resource allocations
December 10, 2014 9
© S. Bouchenak
3) Time-varying and nonlinear behavior Workload amount (#concurrent client requests)
Workload amount of the soccer World Cup Web Site [Arlitt et. al., HP 99]
Challenges in autonomic reconfiguration of cloud services
December 10, 2014 10
© S. Bouchenak
Control knobs
(i.e. resource allocations):
• Cache size
• Server admission control • Server provisioning • Content quality level • …
System outputs:
• Cache hit ratio • Service QoS • Resource utilization • Service differentiation ratio • …
Exogenous inputs:
• Workload amount
• Workload mix
A control - theoretic approach
Feedback control loop
control knobs
Controller Target system
SLOs
exogenous inputs
measured service levels service costs
December 10, 2014 11
© S. Bouchenak
(1) Utility: State the objective and capture the trade-off Multicriteria utility function
(2) Model: Describe system behavior
Relationship between allocated resources and service levels and costs
(3) Control: Solve the system
Calculate (optimal) resource allocation Maximize utility function Based on the model
(4) Implement the solution
Translate theoretical optimal solution into concrete implementation Not trivial: automatically (re)determine model’s parameters
System model exogenous variables resource allocations in p u
ts predicted service levels
o u tp u ts service costs
Followed approach
December 10, 2014 12 © S. BouchenakTowards SLA - aware clouds
• Objective 1: Define a new cloud model, the SLAaaS
• Objective 2: Autonomic reconfiguration of cloud services
Challenges
Towards a control - theoretic approach [ACM OSR 2013]
ConSer: Control of server systems [IEEE Trans. Comp. 2011], PhD L. Malrait MoKa: Control of multi - tier distributed web systems
[IGI Global, 2011], PhD J. Arnaud MoMap: Control of MapReduce systems
[ACM CCGrid 2013], PhD M. Berekmeri • Objective 3: SLA governance in the cloud
• Objective 4: Big Data services, Benchmarking tools
MRBS: Bechmarking framework for Hadoop MapReduce [IEEE SRDS 2012], PhD A. Sangroya December 10, 2014 13 © S. Bouchenak Admission control MPL
Server admission control
Prevent server thrashing, denial-of-service Multi-Programming Level (MPL)
Classical configuration parameter in server systems Apache Web server’s MaxClients
MySQL database server’s max_connections
Control of server systems
server clients
rejected
December 10, 2014 14
© S. Bouchenak
Trade off between server performance and availability
Experiments conducted with PostgreSQL database server, running TPC-C benchmark
Performance (client request latency) Availability (client request abadon rate)
How to configure server’s MPL trading-off
performance and availability?
December 10, 2014 15
© S. Bouchenak
Related work – Server admission control
• Ad-hoc techniques, heuristics[Menascé et al., EC’01] Best - effort behavior
• Linear models
[Diao et. al., NOMS’02] [Parekh et al., RTS’02]
Can not render the whole nonlinear behavior of server systems
• Nonlinear models based on queueing theory
[Robertsson et. al. CDC’04] [Tipper et. al., JSAC’90] [Wang et. al. INFOCOM’96] Multiple model parameters, hard to calibrate
Do not tackle full dynamics (workload types) Restricted to a single QoS aspect, SLO
December 10, 2014 16
(1) Utility: State the objective and capture the trade-off
AM-C (availability-maximizing objective)
(P1) average client request latency does not exceed a given Lmax
(P2) and abandon rate is made as small as possible PM-C, PA-AM-C, AA-PM-C
* L. Malrait, S. Bouchenak, N. Marchand. Experience with ConSer: A System for Server Control Through Fluid Modeling. IEEE Transactions on Computers, 60(7), 2011.
In collaboration with the NeCS INRIA research group on Control Theory latency and
abandon rate SLOs controlled
MPL
Controller Target server L (latency)
α (abandon rate)
SLOs: L ≤≤≤≤ Lmax& αααα
AM-C Controller workload
amount N & mix M
ConSer*: Control of server systems
December 10, 2014 17
© S. Bouchenak
ConSer modeling
latency and abandon rate SLOs
controlled MPL
Controller Target server
L (latency) α (abandon rate) SLOs: L ≤≤≤≤ Lmax& αααα
AM-C Controller
(2) Nonlinear fluid modeling
Server model (workload amount) N (workload mix) M L (latency) α (abandon rate) workload
amount N & mix M
request latency request abandon rate throughput of processed requests admitted requests
request latency request abandon rate throughput of processed requests admitted requests State variables:
MPL Control input:
Exogenous inputs: Outputs:
(incoming throughput) Ti
December 10, 2014 18
© S. Bouchenak
ConSer control
latency and abandon rate SLOs
controlled MPL
Controller Target server
L (latency) α (abandon rate) SLOs: L ≤≤≤≤ Lmax& αααα
AM-C Controller
(3) Control server’s MPL • AM-C (availability-maximizing control)
(P1) average client request latency does not exceed a given Lmax (P2) and abandon rate is made as small as possible
– If L > Lmax; ¬(P1) ; MPL a decreased value of Ne
– If L < Lmax; (P1) & possibly ¬(P2) ; MPL an increased value of Ne • Efficient control: O(1)
workload amount N & mix M
; γ > 0
December 10, 2014 19
© S. Bouchenak
ConSer AM-C control evaluation
Experiments conducted with PostrgeSQL database server running TPC-C benchmark, AM-C control law, Lmax= 8s Performance improved by up to 30%
December 10, 2014 20
ConSer AM-C control evaluation
Experiments conducted with PostrgeSQL database server running TPC-C benchmark, AM-C control law, Lmax= 8s
December 10, 2014 21
© S. Bouchenak
Towards SLA - aware clouds
• Objective 1: Define a new cloud model, the SLAaaS
• Objective 2: Autonomic reconfiguration of cloud services Challenges
Towards a control - theoretic approach [ACM OSR 2013]
ConSer: Control of server systems [IEEE Trans. Comp. 2011], PhD L. Malrait
MoKa: Control of multi - tier distributed web systems [IGI Global, 2011], PhD J. Arnaud
MoMap: Control of MapReduce systems [ACM CCGrid 2013], PhD M. Berekmeri • Objective 3: SLA governance in the cloud
• Objective 4: Big Data services, Benchmarking tools
MRBS: Bechmarking framework for Hadoop MapReduce [IEEE SRDS 2012], PhD A. Sangroya
December 10, 2014 22
© S. Bouchenak
• MapReduce Big Data applications A popular programming model A runtime environment on cluster
of commodity computers • Automatic Data partitioning Data replication Task scheduling Fault tolerance
• A wide range of applications log analysis,
data mining, web search engines, scientific computing, business intelligence, etc.
• Big companies use it Amazon, eBay, Facebook,
LinkedIn, Twitter, Yahoo!, etc.
Big Data Systems - MapReduce
December 10, 2014 23
© S. Bouchenak
• Lots of work to improve MapReduce dependability and performance
New fault-tolerance models
[Costa, CloudCom 11] Replication and partitioning policies
[Ananthanarayanan, EuroSys 11] [Eltabakh,VLDB 11] Scheduling policies
[Zaharia, OSDI 08] [Isard, SOSP 09] [Zaharia, EuroSys 10] Cost-based optimization
[Herodotou, VLDB 11] Resource provisioning
[Verma, Middleware 11]
• Most evaluations use micro-bechmarks
Not representative of full distributed, concurrent applications Not representative of realistic workloads
No dependability benchmarking
Motivation
December 10, 2014 24
• Empirical evaluation of dependability and performance of MapReduce Fault - tolerance
Scalability
• Variety of application domains, workloads and dataloads Compute - oriented vs. data - oriented applications Batch applications vs. real - time applications
• Variety of Big Data workloads and faultloads Various workloads, dataloads
Different fault models Different fault rates
• Portable and easy to use on a wide range of clouds Different cloud infrastructures
MRBS objectives
* A. Sangroya, D. Serrano, S. Bouchenak. Benchmarking Dependability of MapReduce Systems.
The 31stIEEE Int. Symp. on Reliable Distributed Systems (SRDS 2012), Irvine, CA, Oct. 2012.
December 10, 2014 25
© S. Bouchenak
MRBS characteristics
December 10, 2014 26
© S. Bouchenak
Response time with Hadoop 1.0: up to +40%
Experiments conducted on a ten node Hadoop cluster
Use-case: Comparing two MapReduce frameworks w.r.t.
performance & dependability
How does Hadoop 1.0 compare to Hadoop 0.20 w.r.t. performance?
Throughput with Hadoop 1.0: up to -42%
December 10, 2014 27
© S. Bouchenak
Less failed jobs with Hadoop 1.0
Experiments conducted on a ten node Hadoop cluster
Use-case: Comparing two MapReduce frameworks w.r.t.
performance & dependability
How does Hadoop 1.0 compare to Hadoop 0.20 w.r.t. dependability?
December 10, 2014 28
Less I/O failures with Hadoop 1.0
Experiments conducted on a ten node Hadoop cluster
Use-case: Comparing two MapReduce frameworks w.r.t.
performance & dependability
How does Hadoop 1.0 compare to Hadoop 0.20 w.r.t. dependability?
December 10, 2014 29
© S. Bouchenak
Conclusion & Perspectives
•
Multicriteria SLA by design
•
Different applications
December 10, 2014 30