virtual techdays
INDIA
│
28-30 September 2011
Building highly Available Services on Windows Azure Platform
Pooja Singh
│
Technical Architect, Accenture
Aakash Sharma
│
Technical Lead, Accenture
Laxmikant Bhole
│
Senior Architect, Accenture
You know the basics of
Web/Worker roles
SQL Azure
Windows Azure Storage
Windows Azure Diagnostics
virtual techdays
INDIA
│
28-30 September 2011
Assumptions
Topics
Understand Availability
Causes for unavailability
What you get with Azure
What you do on your own
Guiding Principles
Audience
Developers & Architects community
People with high available services needs
Takeaway
Windows Azure inherent attributes for building highly available services
Architectural expectations for building highly available services
virtual techdays
INDIA
│
28-30 September 2011
S E S S I O N A G E N D A
virtual techdays
INDIA
│
28-30 September 2011
How do “you” define “Availability”
• What is acceptable Downtime
• What happens in case of failure
– All functionality required to be available? – Degraded functionality to be available – Failsafe
virtual techdays
INDIA
│
28-30 September 2011
Cost of building highly available services
• Unavailability Vs High Availability
Availability Cost &
virtual techdays
INDIA
│
28-30 September 2011
Implementation costs for a new project
Implementation cost for a startup company that offers its software as a service with a hosting company.6
Traditional
virtual techdays
INDIA
│
28-30 September 2011
Causes for unavailability
• Increase in workload
– Non-scalable architecture – Poor performance• Platform Failures
• Upgrades
• Failure
– Hardware – Network – Transient conditionsvirtual techdays
INDIA
│
28-30 September 2011
virtual techdays
INDIA
│
28-30 September 2011
Azure to “rescue”
virtual techdays
INDIA
│
28-30 September 2011
virtual techdays
INDIA
│
28-30 September 2011
Azure out-of-box features
• Elasticity
– Scale up/down compute resources on-demand
• Self Service Management
– Self recovery for nodes
• Fault Domains
• Storage Resilience
– 3 copies of storage – Geo Replication
virtual techdays
INDIA
│
28-30 September 2011
virtual techdays
INDIA
│
28-30 September 2011
Design for Increased Load
virtual techdays
INDIA
│
28-30 September 2011
Is this Scalable?
Web Role Instance 1 Web Role Instance 2 Web Role Instance 3 Web Role Instance 4 SQL Azure Load B alan cervirtual techdays
INDIA
│
28-30 September 2011
Is this Scalable?
Web Role Instance 1 Web Role Instance 2 Worker RoleInstance 1 Worker Role Instance 2 SQL Azure Load B alan cer Table storage Blob storage Queue
virtual techdays
INDIA
│
28-30 September 2011
Design for Scalability
• Use loosely coupled nodes
• Design for redundancy
• Scale “OUT” everything
– Better to have 50 one GB databases than one 50 GB database
virtual techdays
INDIA
│
28-30 September 2011
Design for Performance
• Service and data closer to user
– Same data center to avoid network latency – CDN
– Caching
• Be mindful of the throughput and transaction thresholds
• Auto-scaling
virtual techdays
INDIA
│
28-30 September 2011
How “CDN” works
• Contents closer to end-users
• 24 physical nodes globally
• CDN works for web apps & public blobs
Azure Storage CDN – Region A CDN – Region B Copy of Blob A Copy of Blob A Blob A Users in Europe Users in Asia
virtual techdays
INDIA
│
28-30 September 2011
Decide Upgrade Strategies
virtual techdays
INDIA
│
28-30 September 2011
Upgrade Strategies
• VIP Swap
• New Service and DNS swap
• Upgrade Domains
virtual techdays
INDIA
│
28-30 September 2011
How does upgrade domain work?
Load Balancer DNS Myservice.Cloudapp.net Myservice v1 Myservice v1 Myservice v1 Myservice v2 Myservice v2 Myservice v2
virtual techdays
INDIA
│
28-30 September 2011
Handle Failure
virtual techdays
INDIA
│
28-30 September 2011
Fault Tolerance
• Self recovery
– Can your Service fix itself?
• Transaction & Recovery
– Loosely coupled
– Transaction rollback and recovery
• Network Failures
virtual techdays
INDIA
│
28-30 September 2011
What is “Retry” logic?
• When - Network failure or transient conditions
– Service is temporarily unavailable E.g. SQL Azure Error 40501
The service is currently busy. Retry request after 10 seconds.
• What - Retry for any external connections
– SQL Azure
– Windows Azure Storage – Service Bus
– Any external service
• How - Use RetryPolicy class or Transient Fault Handling Framework
– NoRetry – Retry
virtual techdays
INDIA
│
28-30 September 2011
“Retry” Code Example
virtual techdays
INDIA
│
28-30 September 2011
Disaster Recovery
virtual techdays
INDIA
│
28-30 September 2011
Disaster Recovery
• Backups
– Fault Domain – Geo-replication• Traffic Manager
– Performance – Round Robin – Failovervirtual techdays
INDIA
│
28-30 September 2011
How “Traffic Manager” works
• Policies
– Performance
• Use when geo-distributed services
– Round Robin
• Large user base
– Failover
• Small user base
virtual techdays
INDIA
│
28-30 September 2011
Traffic Manager – Performance Policy
East Asia DC Myservice-ea.cloudapp.netMyservice
North Europe DC Myservice-ne.cloudapp.net
Myservice
North Central U.S. DC Myservice-ncus.cloudapp.net
Myservice DNS
myservice.com
Decide which data center to connect Traffic Manager
virtual techdays
INDIA
│
28-30 September 2011
Load Test, Diagnostics & Monitoring
• Load test your service
– Visual Studio 2010 Ultimate Load Tests
• Diagnostics
– Windows Azure Diagnostics – Service Management APIs – Storage Management APIs – CSS SQL Azure Diagnostics
• Monitoring
– Visual Studio profiling tools
virtual techdays
INDIA
│
28-30 September 2011
virtual techdays
INDIA
│
28-30 September 2011
Guiding Principles
• Use loosely coupled roles
– Use of queues promotes loose coupling
• Handling fault tolerance
– Recover from fault
• Handling scalability in architecture
– Design for scalability
• Run multiple instances of each role
virtual techdays
INDIA
│
28-30 September 2011
Guiding Principles
• Design and code for instance failure
– Imbibe redundancy
• Monitor everything
– Take feedback to recover fast