Experiences with Building Disaster Recovery for
Enterprise-Class Clouds
University of Illinois, ECE 542 / CS 536, Spring 2015
Hari Ramasamy, Ph.D.
Manager and Research Staff Member, IBM Research
Member, IBM Academy of Technology
[email protected]
Outline
What is Cloud?
What is Disaster Recovery?
Core concepts behind Enterprise-Class Cloud DR
Challenges in Enterprise-Class Cloud DR
DR Life Cycle and Use Cases
Reference Architecture
• DR Solutions for an enterprise cloud platform (IBM’s Cloud Managed Services)
Lessons Learned
What is Cloud Computing?
Essential characteristics [NIST, 2009]:
–On-demand self-service
–Broad network access of cloud services
–Resource pooling and sharing across apps/tenants
–Rapid/automated provisioning and (later) release of services
–Resource utilization tracking and Pay-as-you-go
Building blocks of Cloud Computing
–Standardization
–Virtualization
–Automation
Cloud Terminology
Cloud – The actual resources (HW, SW, Building, etc.) that enable cloud services
Cloud Service – What users can buy or request on a Cloud
Types of Clouds
Based on service models
– Infrastructure as a Service (IaaS) – Platform as a Service (PaaS) – Software as a Service (SaaS)
Based on ownership or deployment models – Public Clouds
– Private Clouds – Hybrid Clouds
Based on who manages – Managed Clouds
• cloud provider manages IT services such as monitoring, patching, security, load balancing, and even certain applications on behalf of cloud clients
– Unmanaged Clouds
Infrastructure as a Service Platform as a Service
High Volume Transactions
Software as a Service
Servers Networking Storage
Middleware Collaboration Business Processes CRM/ERP/HR Industry Applications Data Center Fabric
Shared virtualized, dynamic provisioning Database
Web 2.0 Application Runtime
Development Tooling
Cloud Computing As A Service
Infrastructure as a Service (IaaS)
Platform as a Service (PaaS)
Software as a Service (SaaS)
Provides an application
Google search, email,
and other applications
A Quick Comparison of Cloud Types
SaaS (Application)
PaaS (Platform)
IaaS (HW + OS)
Quicker to
Value
(Less Work)
Fewer
Constraints
(Increasing
Flexibility)
What is Disaster Recovery?
According to Wikipedia, Disaster Recovery (DR) is "the policies and procedures . . . for recovery or continuation. . . of vital technology infrastructure and systems . . . following a natural or human-induced disaster.”
Disaster Types
Floods
Hurricanes
Volcanoes
Earthquakes
Fires
Terrorist Attacks
Hacker Attacks
Alien monsters
…….
IT Infrastructure
and Systems
Servers
Storage
Network
Software
Configuration
……
Policies and Procedures
for Recovery
Geographic Dispersion
Recovery Orchestration
Recovery Automation
Detailed Plans
Data copies
DR Drills
Periodic Testing
Detection
Disaster Recovery vs. High Availability
Disaster Recovery (DR): process and procedures that enable the
continuation or recovery of technology infrastructure or systems after a
natural or human-induced disaster causes an interruption
High Availability: ability of a system to continue being accessible despite
failures of system component(s)
Disaster Recovery (DR)
High Availability (HA)
Recovery
Target
Entire technology
infrastructure
Individual components or
functions
Failure
type
Site-wide disasters
Failures of individual
computing components
Triggering
Event
Executive decision
Failure detection or
administrator action
Disaster Recovery for Enterprise-class Clients
Enterprise-class clients
– Examples: banks, financial institutions, hospitals, governments, utility companies, etc. Many are regulation-bound to have DR coverage
DR requirements are very stringent
– Aggressive Recovery Time Objective (RTO) and Recovery Point Objective (RPO) Most large companies spend between 2% and 4% of their IT budget on DR planning Business impact of loss of IT infrastructure and data can be huge
– Cost of downtime could dissolve business – Ubiquitous nature of IT on Business
– Irreparable brand damage
– Loss of customer data and reputation
Market opportunity for Business Continuity/Disaster Recovery around $32 Billion in 2015 [Source: IBM]
Disaster Recovery for Enterprise-class Clients on the Cloud
Potential Benefits to Customers (Cloud Users):
– Self-service Model
• On-demand DR protection activation • On-demand, non-intrusive DR tests – Resiliency made cheaper
• Pay only for workloads that need to be DR-protected • No upfront capital expenses
– Improved agility to outages
Challenges to Cloud Providers
– More Aggressive SLAs – Scale & Diversity
– Inter-dependencies and Coordination of Server DR and App DR – DR of Management Capabilities
– Regulatory Requirements (e.g., location)
DR Life Cycle and Basic DR Use Cases
DR
Deploy
ment
DR
Steady
State
DR Test
Failover
Failback
DR declaration
16Reference Architecture for Cloud DR
At DR site
– VMs and applications/appliances may or may not exist before failover
Reference Architecture for Cloud DR: Replication
Replication Method Recovery Time Objectives Recovery Point Objectives Cost Synchronous Replication Seconds-minutes Seconds-minutes $$$ Asynchronous Replication
Minutes-few days Minutes-few days $$
Backup-Restore Days-weeks Days-weeks $
Replication Levels
Storage-level replication
– any updates to the VM's state at the primary site's storage is mirrored to the DR site's storage Host-level replication
– requires installation of agent in each host – different agents for different OSes
App-level replication
– may be required for certain apps even if other options are technically possible
Replication Modes
Active-active (live/live, hot DR, warm DR) Active-passive (cold DR, warm DR)
Reference Architecture for Cloud DR: Networking
Physical WAN network link between sites must have adequate bandwidth Network design should
– support multiple replication streams
– support secure segregation of data streams (e.g., VPNs, VLANs) – secure access channels for cloud admins and clients
– support adequate segregation between accounts within the same client Client network environment may need to be pre-staged at the DR site
Network management capabilities may need to be pre-staged at the DR site – switching/routing configurations
© 2015 IBM Corporation
Reference Architecture for Cloud DR: Management
Management Services for Enterprise Workloads include – virus scanning – patching – directory services – monitoring – backup/restore – load balancing – network security – compliance, …… At DR site
– VMs and applications/appliances may or may not exist before failover
Reference Architecture for Cloud DR: Control
Orchestration and Automation
– Overall coordination of steps in DR lifecycle, particularly failover • workload recovery
• environment recovery • management recovery
– Drive automatic steps in DR lifecyle Administration
– self-service portal(s) that allow clients and admins to launch DR operations such as • which VMs should be replicated
• initiating DR test or failover
• viewing replication/recovery status • defining user roles
• specifying access permissions
© 2015 IBM Corporation
IBM: Business Continuity and Resiliency Services
Broad experience Broad solution capabilities Industry-specific, globally available expertise Credibility More than 50 years of
business continuity and disaster recovery experience
More than 7,800 Business Continuity & Resiliency Services contracts with 5000+ clients
Unique insights based on the work of 30,000 industry
specialists worldwide
Global resiliency centers designed for multivendor
environments, with over 200 hardware and software vendors supported, including HP, Oracle, Cisco and our own IBM products Business process and technology expertise to help you design and implement the right solution for your business
150 resiliency centers across 50 countries Five million square feet of floor space for disaster recovery, with 41000 work area recovery seats
Knowledge of local, regional and global regulations Over 1800 professionals dedicated to business continuity Track record of 100 percent success in meeting commitments to clients who have declared a disaster External validation by analysts that have reported favorably on IBM’s breadth of offerings and
DR Solutions for an Enterprise-class IaaS-PaaS Managed Cloud
(IBM Cloud Managed Services)
Cloud-to-Cloud Cloud-to-Dedicated DR Site
Cloud-to-Repurposed Customer Site
Description Failover to similar cloud site
Failover to dedicated DR site
Failover to custom site RPO/RTO
Specifications
15min/4 hours 15min/4 hours 15min/4 hours
Regional Availability Another cloud site in same region
No cloud site but a purpose-built DR site in
same region
No cloud site but a customer-owned site
in same region VM Provisioning at
DR site
cloud’s VM provisioning Dedicated DR site’s VM recovery mechanism
VMWare vCenter Replication Type Storage-based, rsync,
application-based;
Host-based and application-based
Storage-based Replication Mode Active-active Active-passive and
Active-active
Active-passive Post-failover
Management
Full management Limited management Limited management
Networking VLANs over
dedicated Fiber link
VPN, MPLS or Point-to-Point with Layer 2 or
Layer 3 routing
VLANs over dedicated link
Cloud to Cloud Disaster Recovery (IBM CMS)
Secondary Primary Pre-provisioned DR VMs (maybe suspended) Automated DR failover CMS Cloud B CMS Cloud A VMs VMs DR Control DR Control Storage Storage File-level or App-level or Host-level replication DR Metadata Storage-level replicationCloud to Cloud Disaster Recovery (IBM CMS)
Secondary / Primary Primary / Secondary Pre-provisioned DR VMs (maybe suspended) Automated DR failover CMS Cloud B CMS Cloud A VMs VMs DR Control DR Control Storage Storage File-level or App-level or Host-level replication DR Metadata Storage-level replication• Failover site can be leveraged for other workloads (e.g. dev/test)
• 4 hour recovery time objective (RTO), 15 minute recovery point objective (RPO) • Full CMS Management capabilities at recovery site
• IBM makes disaster declaration • Single annual DR test included
–Option to purchase additional tests –Individual Workload(s) can be tested
• DR services can be ordered anytime after initial onboarding • Primary Focus on Infrastructure DR (as of 1Q 2015)
–IBM-managed SAP and Oracle Services have DR options that were defined leveraging base CMS capabilities
–Enhancements to include middleware and database services within DR scope are planned for future release
CMS DataCenter CMS DataCenter Fail Over Fail Back CMS Site-to-Site Disaster Recovery Raleigh Boulder Lisbon Barcelona Ehningen Montpellier Portsmouth Lisbon
IBM CMS Cloud-to-Cloud Disaster Recovery Overview
Makuhari Sydney
Winterthur Ehningen
Toronto Boulder
Cloud to Dedicated DR Site (IBM CMS)
Secondary Primary Dedicated DR Site in same region CMS Cloud A VMs VMs DR Control DR Control Storage Storage File-level or App-level or Host-level replication DR Metadata • Pre-provisioned DRservers for Managed Applications.
• Other VMs
provisioned during failover.
Lessons Learned in Enterprise-Cloud DR
DR should cover workloads and management Standardization vs. customizability
Data management is central to DR design
Regulations and regional requirements may trump technology in DR Find acceptable balance between cost and risk mitigation
Automation is a must to achieve low RTOs