Cloud Application Development #2
Johan Tordsson
Department of Computing Science
Last time
• Introduction to Science – Computer science in particular
• Experimental computer science in particular
• Methods in experimental computer science – I: Performance evaluation
• Benchmarking
– II: Hypothesis testing
• Experiment design
– And also a bit about statistics
Today
1. Common (Web) application architectures – N-tier applications
• Load Balancers • Application Servers • Databases
2. Cloud application guidelines – Scalability
– Fault-tolerance – Some best practices – A few Amazon examples
System Architecture
• The architecture of a computer system is the high-level (most general) design on which the system is based
• Architectural features include: – Components
– Collaborations (how components interact) – Connectors (how components communicate) • Common examples
– Client-Server – Layered – Peer-to-peer – Etc.
Client-server
• Roles
– Client: a component that makes requests
clients are active initiators of transactions
– Server: a component that satisfies requests
servers are passive and react to client requests
• The client-server architecture can be thought of as a median between
– Centralized processing: computation is performed on a central platform, which is accessed using “dumb” terminals
– Distributed processing: computation is performed on platforms located with the user
Centralized
Client / Server
Distributed
Tiered Web Architectures
• Web applications are usually implemented with 2-tier, 3-tier, or multitier (N-tier) architectures
• Each tier is a platform (client or server) with a unique responsibility
1-Tier server Architecture
• (Tier 0): Client platform, hosting a web browser
• Tier 1: server platform, hosting all server software components
1-Tier Characteristics
• Advantage:
– Inexpensive (single platform)
• Disadvantages
– Interdependency (coupling) of
components
– No redundancy
– Limited scalability
• Typical application
– 10-100 users
– Small company or organization, e.g.,
law office, medical practice, local
non-profit
2-Tier C-S Architecture
• Tier 2 takes over part of the server function from Tier 1, typically data management
2-Tier Characteristics
• Advantages
– Improved performance, from specialized hardware
– Decreased coupling of software components – Improved scalability
• Disadvantages
– No redundancy• Typical Application
– 100-1000 users
– Small business or regional organization, e.g., specialty retailer, small college
Multitier C-S Architecture
• A multitier (N-tier) architecture is an
expansion of the 2-tier architecture, in one of several different possible ways
– Replication of the function of a tier – Specialization of function within a tier
Replication
• Application and data servers are replicated • Servers share the total workload
Specialization
• Servers are specialized
• Each server handles a designated part of the workload, by function
Multi-Tier Characteristics
• Advantages
– Decoupling of software components – Flexibility to add/remove platforms
in response to load – Scalability – Redundancy
• Disadvantages
– Higher costs (maintenance, design, electrical load, cooling)
• Typical Application
– 1000+ users– Large business or organization
Characteristics Summary
1-Tier 2-Tier N-Tier 10+ 100+ 1000+ # users•large e-commerce, business, or organization
•small e-commerce, regional business or organization •local business or organization
capacity
scalability
redundancy
cost
Inside the N-tier application
1. Load Balancers
2. Application Servers
Web Server Load
Balancing
• Request enters a
router
• Load balancing
server determines
which web server
should serve the
request
• Sends the request
to the appropriate
web server
Internet
Router
Load-Balancing Server Web Servers Request ResponseTraditional Web Cluster
How do we split up
information?
Content
Server Farm
?
Information Strategies
Replication
Partition
Load Balancing Approaches
File Distribution
File Distribution
Routing
Routing
Content/Locality
Content/Locality
Aware
Aware
DNS Server
DNS Server
Size Aware
Size Aware
Centralized
Centralized
Router
Router
Workload Aware
Workload Aware
Distributed
Distributed
Dispatcher
Dispatcher
Issues
• Efficiently processing requests with optimizations for load balancing
– Send and process requests to a web server that has files in cache
– Send and process requests to a web server with the least amount of requests
– Send and process requests to a web server determined by the size of the request • Sessions complicate load balancing
– Need to ensure that subsequent user requests are handled by same backend
• Stickiness
• Makes fail-over more complicated
FLEX
• Locality aware load-balancing strategy based on two factors:
– Accessed files, memory requirements – Access rates (working set), load
requirements
• Partitions all servers into equally balanced groups
• Each server transfers the response to the browser to reduce bottleneck through the router (TCP Handoff)
File Distribution
File Distribution RoutingRouting
Content/Locality Content/Locality Aware Aware DNS Server DNS Server Size Aware
Size Aware Centralized RouterCentralized Router
Workload Aware Workload Aware Distributed Distributed
Dispatcher Dispatcher
Flex Diagram
DNS
DNS
Server
Server
Requests To Client Browser Forwards RequestS1
S2
S3
S4
S5
S6
W(S1) ≈ W(S2) … ≈ W(S6)FLEX Cont.
• Advantages: – Highly scalable– Reduces bottleneck by the load balancer – No software is required
– Reduces number of cache misses • Disadvantages:
– Not dynamic, routing table must be recreated – Only compared to Round Robin
– Responsibility of load-balancing and
transferring response is given to web servers – unorganized responsibility
WARD
• Workload-Aware Request Distribution
Strategy
• Server core are essential files that
represent majority of expected requests
• Server core is replicated at every server
• Compute core size the by workload
access patterns
– Number of nodes – Node RAM– TCP handoff overhead – Disk access overhead
File Distribution
File Distribution RoutingRouting Content/Locality Content/Locality Aware Aware DNS DNS ServerServer Size Aware
Size Aware Centralized RouterCentralized Router
Workload Aware Workload Aware Distributed Distributed
Dispatcher Dispatcher
WARD Diagram
Requests
S1
S2
S3
S4
S5
S6
Queue: Queue: Queue: Queue: Queue: Queue: Switch Switch •Each computer is a distributor and a dispatcherWARD Cont. II
• Similar to FLEX, sends response directly
to client
• Minimizes forwarding overhead from
handoffs for the most frequent files
• Optimizes the overall cluster RAM usage
• “By mapping a small set of most frequent
files to be served by multiple number of
nodes, we can improve both locality of
accesses and the cluster performance
significantly”
WARD Cont. III
• Advantages:
– No decision making, core files are replicated on every server
– Minimizes transfer of requests and disk reads, both are “equally bad”
– Outperforms Round Robin – Efficient use of RAM
– Performance gain with increased number of nodes • Disadvantages:
– Core files are created on past day’s data
• Could decrease performance up to 15%
– Distributed dispatcher increases the number of TCP requests transfers
– If core files not selected correctly, higher cache miss rate and increased disk accesses
EquiLoad
• Determines which server will process a request determined by the size of the requested file
• Splits the content on each server by file size, forcing the queues sizes to be consistent
File Distribution
File Distribution RoutingRouting Content/Locality Content/Locality Aware Aware DNS DNS ServerServer Size Aware
Size Aware Centralized RouterCentralized Router
Workload Aware Workload Aware Distributed Distributed
Dispatcher Dispatcher
EquiLoad Solves Queue
Length Problems
• This is bad • This is better1k
1k
1k
1k
1000k
1000k
1k
1k 1k
1k 1k
1k 2k
2k
2k
2k
1k
1k 1k
1k 2k
2k 1k
1k
100k
100k
100k
100k
1k
1k 1k
1k 1k
1k 2k
2k
100k
100k
EquiLoad Diagram
Distributor
Distributor
Requests
To Client Browser
Forwards RequestS1
S2
S3
S4
S5
S6
1k-2k2k-3k
3k-10k
10k-20k 20k-100k >100kDispatcher
Dispatcher
(periodically calculates (periodically calculates partitions) partitions)EquiLoad
• Advantages – Dynamic repartitioning– Can be implemented at various levels
• DNS • Dispatcher • Server
– Minimum queue buildup
– Performs well under variable workload and high system load
• Disadvantages
– Cache affinity is neglected – Requires a front end dispatcher
– Distributor must communicate with servers – Thresholds of parameter adjustment
EquiLoad
AdaptLoad
• AdaptLoad improves upon EquiLoad using “fuzzy boundaries”
– Allows for multiple servers to process a request – Behaves better in situations where server
partitions are very close in size
AdaptLoad Diagram
Dispatcher
Dispatcher
RequestsTo Client Browser
Forwards RequestS1
S2
S3
S4
S5
S6
1k-3k 2k-4k 3k-10k 8k-20k 15k-100k >80kDistributor
Distributor
(periodically calculates (periodically calculates partitions) partitions)Summary
File Distribution
File Distribution
Routing
Routing
Content/Locality
Content/Locality
Aware
Aware
DNS Server
DNS Server
Size Aware
Size Aware
Centralized
Centralized
Router
Router
Workload Aware
Workload Aware
Distributed
Distributed
Dispatcher
Dispatcher
FLEX
EquiLoad,
AdaptLoad
WARD
Load balancing conclusions
• There is no “best” way to distribute content among servers
• There is no optimal policy for all website applications
• Certain strategies are geared towards a particular website application
Sample Load balancers
• Amazon Elastic Load Balancing – Balances load across across EC2 VMs – Performs VM instance health checks
– Balancing metrics include request count and request latency
– Supports sticky sessions • HAProxy
– Robust & high-performance load balancer for TCP/HTTP
– Multiple load balancing algorithms – Open source
– Not specific to EC2
Application Servers
-motivation
• Key observation made by application
server vendors
– Most web applications require similar features such as database access, security, scalability, etc.
– Provide these features that are fully tested in a container to be leveraged by
application developers
• Similar to programming language libraries – Allows application programmers to focus
on business logic instead of writing all features from scratch
Services Provided most
Application Servers
• Web Services
• RMI for distributed applications • Clustering (for load balancing) • Database integration • System management • Message-oriented middleware • Security • Dynamic redeployment • Etc.
More examples
• Java
– Commercial
• IBM WebSphere Application Server • BEA WebLogic
• Oracle OC4J
– Open Source
• JBoss Application Server • Apache Geronimo • Sun Glassfish
• Apache Tomcat (only a web container)
• Microsoft .Net
• Ruby on Rails
• There are (too) many others…
Pros and Cons
• Advantages
– Many, many features provided to the application developer
– Shorter development cycle
– Low cost of entry, especially when using open source application servers
• Disadvantages
– Less flexibility on architecture – Debugging harder
• Problem may be in app server…
– Sometimes your application does not need all features
• This could hinder performance
• Various open source tools target composeability
Databases
-Relational DB Characteristics
• Data stored in columns and tables • Relationships represented by data • Data Manipulation Language (SQL) • Data Definition Language (SQL) • Transactions
• Abstraction from physical layer
Data Manipulation Language
• Data manipulated with Select, Insert, Update, & Delete statements
– Select T1.Column1, T2.Column2 … From Table1, Table2 …
Where T1.Column1 = T2.Column1 … • Data Aggregation
• Compound statements • Functions and Procedures • Explicit transaction control
Data Definition Language
• Schema defined at the start
• Create Table (Column1 Datatype1,
Column2 Datatype 2, …)
• Constraints to define and enforce
relationships
– Primary Key
– Foreign Key
– Etc.
• Triggers to respond to Insert, Update,
and Delete
• Alter …
• Drop …
• Security and Access Control
Transaction: An Execution of a
DB Program
• Key concept is transaction, which is an
atomic sequence of database actions
(reads/writes).
• Each transaction, executed completely, must leave the DB in a consistent state if DB is consistent when the transaction begins.
– Users can specify some simple integrity
constraints on the data, and the DBMS will
enforce these constraints.
– Beyond this, the DBMS does not really understand the semantics of the data. – Thus, ensuring that a transaction (run alone)
preserves consistency is ultimately the user’s responsibility!
Transactions – ACID
Properties
• Atomic – All of the work in a transaction completes (commit) or none of it completes • Consistent – A transaction transforms the
database from one consistent state to another consistent state. Consistency is defined in terms of constraints.
• Isolated – The results of any changes made during a transaction are not visible until the transaction has committed.
• Durable – The results of a committed transaction survive failures
CAP Theorem
• Three properties of a system
– Consistency (all copies have same value) – Availability (system can run even if parts have
failed)
– Partitions (network can break into two or more parts, each with active systems that cannot talk to other parts)
• Brewer’s CAP “Theorem”: You can have
at most two of these three properties for
any system
• Very large systems will partition at some
point
– -> Choose one of consistency or availability – Traditional database choose consistency – Most Web applications choose availability
NoSQL – A Scalable Web
alternative
• Stands for Not Only SQL
• Class of non-relational data storage
systems
• Usually do not require a fixed table
schema nor do they use the concept of
joins
• All NoSQL offerings relax one or more of
the ACID properties
• Not a backlash/rebellion against RDBMS
• SQL is a rich query language that cannot
be rivaled by the current list of NoSQL
offerings
Distributed Key-Value Data Stores
• Distributed key-value data storage
systems allow key-value pairs to be
stored (and retrieved on key) in a
massively parallel system
– E.g. Google BigTable, Yahoo! Sherpa/PNUTS, Amazon Dynamo, ..
– Recall Ahmed’s lecture
• Partitioning, high availability, etc
completely transparent to application
• Sharding systems (partioned DBs) and
key-value stores do not support many
relational features
– No join operations
– No group by, order by, etc
– No ACID transactions
– No SQL
Typical NoSQL API
• Basic API access:
– get(key)
• Extract the value given a key
– put(key, value)
• Create or update the value given its key
– delete(key)
• Remove the key and its associated value
– execute(key, operation, parameters)
• Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... etc).
Databases - Summary
• SQL Databases – Predefined schema
– Standard definition and interface language – Tight consistency
– Well defined semantics • NoSQL Databases
– No predefined schema
– Per-product definition and interface language – Getting an answer quickly is more important
2. Cloud application
guidelines
Scalable N-tier applications
• Basic 3-Tier architecture – Dedicated server for each tier – Simple
– Non-redundant
– For testing/development – Not suitable for production
Redundant 3-Tier architecture
• Redudancy in each tier • Replicated DB
- May used striped volumes
• Fault tolerance • But fixed size
• No auto-scaling
Scaling Relational DBs –
Master/Slave
• Master-Slave
– All writes are written to the master
– All reads performed against the
replicated slave databases
– Good for mostly read, very few update
applications
– Critical reads may be incorrect as
writes may not have been propagated
down
– Large data sets can pose problems as
master needs to duplicate data to
slaves
Scaling Relational DBs
-Partitioning
• Partitioning
– Divide the database across many machines • E.g. hash or range partitioning
• Handled transparently by parallel databases
– These are expensive
• “Sharding”
– Divide data amongst many cheap databases (MySQL/PostgreSQL)
– Manage parallel access in the application – Scales well for both reads and writes – Not transparent, application needs to be
partition-aware
Multi-center architecture
• Similar to redundant architecture • Uses multiple Data Centers
- Ex. EC2 availability zones
• Protects against Data Center failure • Redudant architecture protects against server failure… • Fixed size
Autoscaling Architecture
• Redundant & fault tolerant server tiers • Change number of servers dynamically
- Commonly only in Application server tier
• Database tier the bottleneck…
Autoscaling Architecture 2
• For read-heavy (static) DBs
• Add read-only caches, e.g.. Memcached • Does not solve
Autoscaling Architecture 3
• Replace Relational DB with NoSQL data store • Replicate datastore across multiple servers • Adds Scalability • At the expense of SQL, consistency, … • Common in social networks, etc.
Scalable batch processing
• For compute-intense workloads
• Auto-scale application server + worker nodes • Message queues
for robustness and scaling
More about Message Queues
•
Asynchronous calls between components
•
Enables loose coupling
•
Buffering of load spikes
•
Enables fault tolerance
•
Wanted: Admission control mechanisms
-
Traffic engineering a’la routers, drop requests, etc.•
Example: Amazon SQS
Auto-scaling, summary
• Auto-scaling of 3-tier architecture well understood
• Database layer is bottleneck – NoSQL may be an alternative • Scaling up is easy
– Scalable architectures can make use of more servers
• Scaling down is harder
– Which server instance to shut down? – Long-running sessions make it hard to
evacuate workloads • Auto-scaling metrics
– Generic: Server CPU/memory utilization, etc. – Specific: Number of transactions, \\ users, etc. • When to scale?
A few words on faults
• Services fail in operation
– Disks break, power outages, application logic errors, overload failures, etc.
• Well-designed services handle faults
• How often does a disk fail?
Designing robust services
• Quick service health checks
– Detect problems early
• Develop + test in full environment
– Unit tests are not enough
• Do not trust underlying systems
– Rare failures are not rare at large scale
• Isolate failures
– Avoid cross-component fault propagation
• Allow human intervention
– Emergency cases
– Through scripts, not manual processes – Fixing problems under stress is hard…
• Keep things simple and robust
– Optimize only if factor 10 better, not for a few %
Rubust services (cont.)
• Automate management and provisioning – Deployment, updates, restarts, etc. should be
simple
– Make everything configurable
• Recompiling the system during downtime is hard
• Intentionally fail the system
– Conventional shutdown does not test the fault handling
– Gremlin-style testing
– Purposely kill components/services at random • Soft deletes only
– Mark data as removed, but keep it
– No backup can save a bad SQL delete query
Robust services (cont.)
• Independent components
– Asyncronous design: expect latencies – Expect failures: backoffs, retries, etc. – Fail fast: release resources unless successful • Coupling
– Careful design to minimize cross-component – No replicated functionality across components • Graceful degradation
– Extend beyond ”working” or ”failure” – Cut non-critical load in emergency cases – Store non-processed work in queues • Admission control at component boarders
– Do not bring more work to overloaded system – Meter and control admission
Understand your service
• Instrument and monitor as much as possible – Adjustable logging levels, e.g., log4j
• Data mining and visualization helps understanding your application behaviour
– Where are bottlenecks, etc.?
• Workload analysis (trace-based) valuable for stress-testing updated systems
– Will this actually work in full-scale production? • Monitor everthing, but rairly raise alarms
– Don’t cry wolf
AWS examples
- Web hosting
1. Route 53 DNS 2. ELB
3. EC2 Web server 4. Auto scaling group 5. S3 for static content 6. Cloudfront for edge
location 7. Availability zones
AWS content
streaming
1. S3 for static content 2. Cloudfront for low
latency (edge caching) 3. Alt.: Host content in
EC2
4. Deliver data stream with EC2 + Cloudfront
AWS Batch
Processing
1. Job manager as EC2 VM 2. Input data in S3
3. Jobs inserted to SQS input queue 4. Job processing in EC2 nodes.
Auto-scaling group for elasticity + fault tolerance
5. Output in S3
6. Stats in RDS or SimpleDB
AWS
High
Availability
2. Use of >1 Availability Zones 3. Elastic IP to bind IP to EC2 VMs, can
remap upon failures 4. No critical data on VM instance
storage, use EBS or S3.
1? ELB for load balancing across Availability Zones
AWS
Big Data
1. Parallel data upload to S3, or use Amazon Import to send whole storage device 2. Parallel processing in EC2, may use S3 as
/scratch through FUSE 3. Results written to S3
4. Alt. Use EBS for input and/or output data sets. Tune TCP streams or use UDP
General Conclusions
• Cloud auto-scaling, fault tolerance, etc. does not happen automagically
• Many tools/services out there to help
• Careful architecture design is essential
Assignment 3 - published
• Course project • Free form assignment
– Implement something using the cloud
• Should be suitable for the cloud – Use at least 3 existing services • Should be doable
– Make a list of features/requirements + prioritize
• Deadlines and Deliverables: – Wednesday 16/5:
• Project plan: what will you do + how?
– Thursday 31/5:
• Presentation + demo
– Monday 4/6
Next time….
• Assignment 2 to be graded this week
– Will be returned this Wednesday 13-15 (in D420) • No class this Thursday
– Public holiday (once again…) • Next lecture Monday 21/4: