• No results found

Experiences with Building Disaster Recovery for Enterprise-Class Clouds

N/A
N/A
Protected

Academic year: 2021

Share "Experiences with Building Disaster Recovery for Enterprise-Class Clouds"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Experiences with Building Disaster Recovery for

Enterprise-Class Clouds

University of Illinois, ECE 542 / CS 536, Spring 2015

Hari Ramasamy, Ph.D.

Manager and Research Staff Member, IBM Research

Member, IBM Academy of Technology

[email protected]

(2)

Outline

What is Cloud?

What is Disaster Recovery?

Core concepts behind Enterprise-Class Cloud DR

Challenges in Enterprise-Class Cloud DR

DR Life Cycle and Use Cases

Reference Architecture

• DR Solutions for an enterprise cloud platform (IBM’s Cloud Managed Services)

Lessons Learned

(3)

What is Cloud Computing?

Essential characteristics [NIST, 2009]:

–On-demand self-service

–Broad network access of cloud services

–Resource pooling and sharing across apps/tenants

–Rapid/automated provisioning and (later) release of services

–Resource utilization tracking and Pay-as-you-go

Building blocks of Cloud Computing

–Standardization

–Virtualization

–Automation

(4)

Cloud Terminology

Cloud – The actual resources (HW, SW, Building, etc.) that enable cloud services

Cloud Service – What users can buy or request on a Cloud

(5)

Types of Clouds

Based on service models

– Infrastructure as a Service (IaaS) – Platform as a Service (PaaS) – Software as a Service (SaaS)

Based on ownership or deployment models – Public Clouds

– Private Clouds – Hybrid Clouds

Based on who manages – Managed Clouds

• cloud provider manages IT services such as monitoring, patching, security, load balancing, and even certain applications on behalf of cloud clients

– Unmanaged Clouds

(6)

Infrastructure as a Service Platform as a Service

High Volume Transactions

Software as a Service

Servers Networking Storage

Middleware Collaboration Business Processes CRM/ERP/HR Industry Applications Data Center Fabric

Shared virtualized, dynamic provisioning Database

Web 2.0 Application Runtime

Development Tooling

Cloud Computing As A Service

(7)

Infrastructure as a Service (IaaS)

(8)

Platform as a Service (PaaS)

(9)

Software as a Service (SaaS)

Provides an application

Google search, email,

and other applications

(10)

A Quick Comparison of Cloud Types

SaaS (Application)

PaaS (Platform)

IaaS (HW + OS)

Quicker to

Value

(Less Work)

Fewer

Constraints

(Increasing

Flexibility)

(11)

What is Disaster Recovery?

According to Wikipedia, Disaster Recovery (DR) is "the policies and procedures . . . for recovery or continuation. . . of vital technology infrastructure and systems . . . following a natural or human-induced disaster.”

Disaster Types

Floods

Hurricanes

Volcanoes

Earthquakes

Fires

Terrorist Attacks

Hacker Attacks

Alien monsters

…….

IT Infrastructure

and Systems

Servers

Storage

Network

Software

Configuration

……

Policies and Procedures

for Recovery

Geographic Dispersion

Recovery Orchestration

Recovery Automation

Detailed Plans

Data copies

DR Drills

Periodic Testing

Detection

(12)

Disaster Recovery vs. High Availability

Disaster Recovery (DR): process and procedures that enable the

continuation or recovery of technology infrastructure or systems after a

natural or human-induced disaster causes an interruption

High Availability: ability of a system to continue being accessible despite

failures of system component(s)

Disaster Recovery (DR)

High Availability (HA)

Recovery

Target

Entire technology

infrastructure

Individual components or

functions

Failure

type

Site-wide disasters

Failures of individual

computing components

Triggering

Event

Executive decision

Failure detection or

administrator action

(13)

Disaster Recovery for Enterprise-class Clients

Enterprise-class clients

– Examples: banks, financial institutions, hospitals, governments, utility companies, etc. Many are regulation-bound to have DR coverage

DR requirements are very stringent

– Aggressive Recovery Time Objective (RTO) and Recovery Point Objective (RPO) Most large companies spend between 2% and 4% of their IT budget on DR planning Business impact of loss of IT infrastructure and data can be huge

– Cost of downtime could dissolve business – Ubiquitous nature of IT on Business

– Irreparable brand damage

– Loss of customer data and reputation

Market opportunity for Business Continuity/Disaster Recovery around $32 Billion in 2015 [Source: IBM]

(14)
(15)

Disaster Recovery for Enterprise-class Clients on the Cloud

Potential Benefits to Customers (Cloud Users):

– Self-service Model

• On-demand DR protection activation • On-demand, non-intrusive DR tests – Resiliency made cheaper

• Pay only for workloads that need to be DR-protected • No upfront capital expenses

– Improved agility to outages

Challenges to Cloud Providers

– More Aggressive SLAs – Scale & Diversity

– Inter-dependencies and Coordination of Server DR and App DR – DR of Management Capabilities

– Regulatory Requirements (e.g., location)

(16)

DR Life Cycle and Basic DR Use Cases

DR

Deploy

ment

DR

Steady

State

DR Test

Failover

Failback

DR declaration

16

(17)

Reference Architecture for Cloud DR

At DR site

– VMs and applications/appliances may or may not exist before failover

(18)

Reference Architecture for Cloud DR: Replication

Replication Method Recovery Time Objectives Recovery Point Objectives Cost Synchronous Replication Seconds-minutes Seconds-minutes $$$ Asynchronous Replication

Minutes-few days Minutes-few days $$

Backup-Restore Days-weeks Days-weeks $

Replication Levels

Storage-level replication

– any updates to the VM's state at the primary site's storage is mirrored to the DR site's storage Host-level replication

– requires installation of agent in each host – different agents for different OSes

App-level replication

– may be required for certain apps even if other options are technically possible

Replication Modes

Active-active (live/live, hot DR, warm DR) Active-passive (cold DR, warm DR)

(19)

Reference Architecture for Cloud DR: Networking

Physical WAN network link between sites must have adequate bandwidth Network design should

– support multiple replication streams

– support secure segregation of data streams (e.g., VPNs, VLANs) – secure access channels for cloud admins and clients

– support adequate segregation between accounts within the same client Client network environment may need to be pre-staged at the DR site

Network management capabilities may need to be pre-staged at the DR site – switching/routing configurations

(20)

© 2015 IBM Corporation

Reference Architecture for Cloud DR: Management

Management Services for Enterprise Workloads include – virus scanning – patching – directory services – monitoring – backup/restore – load balancing – network security – compliance, …… At DR site

– VMs and applications/appliances may or may not exist before failover

(21)

Reference Architecture for Cloud DR: Control

Orchestration and Automation

– Overall coordination of steps in DR lifecycle, particularly failover • workload recovery

• environment recovery • management recovery

– Drive automatic steps in DR lifecyle Administration

– self-service portal(s) that allow clients and admins to launch DR operations such as • which VMs should be replicated

• initiating DR test or failover

• viewing replication/recovery status • defining user roles

• specifying access permissions

(22)

© 2015 IBM Corporation

IBM: Business Continuity and Resiliency Services

Broad experience Broad solution capabilities Industry-specific, globally available expertise Credibility More than 50 years of

business continuity and disaster recovery experience

More than 7,800 Business Continuity & Resiliency Services contracts with 5000+ clients

Unique insights based on the work of 30,000 industry

specialists worldwide

Global resiliency centers designed for multivendor

environments, with over 200 hardware and software vendors supported, including HP, Oracle, Cisco and our own IBM products Business process and technology expertise to help you design and implement the right solution for your business

150 resiliency centers across 50 countries Five million square feet of floor space for disaster recovery, with 41000 work area recovery seats

Knowledge of local, regional and global regulations Over 1800 professionals dedicated to business continuity Track record of 100 percent success in meeting commitments to clients who have declared a disaster External validation by analysts that have reported favorably on IBM’s breadth of offerings and

(23)

DR Solutions for an Enterprise-class IaaS-PaaS Managed Cloud

(IBM Cloud Managed Services)

Cloud-to-Cloud Cloud-to-Dedicated DR Site

Cloud-to-Repurposed Customer Site

Description Failover to similar cloud site

Failover to dedicated DR site

Failover to custom site RPO/RTO

Specifications

15min/4 hours 15min/4 hours 15min/4 hours

Regional Availability Another cloud site in same region

No cloud site but a purpose-built DR site in

same region

No cloud site but a customer-owned site

in same region VM Provisioning at

DR site

cloud’s VM provisioning Dedicated DR site’s VM recovery mechanism

VMWare vCenter Replication Type Storage-based, rsync,

application-based;

Host-based and application-based

Storage-based Replication Mode Active-active Active-passive and

Active-active

Active-passive Post-failover

Management

Full management Limited management Limited management

Networking VLANs over

dedicated Fiber link

VPN, MPLS or Point-to-Point with Layer 2 or

Layer 3 routing

VLANs over dedicated link

(24)

Cloud to Cloud Disaster Recovery (IBM CMS)

Secondary Primary Pre-provisioned DR VMs (maybe suspended) Automated DR failover CMS Cloud B CMS Cloud A VMs VMs DR Control DR Control Storage Storage File-level or App-level or Host-level replication DR Metadata Storage-level replication

(25)

Cloud to Cloud Disaster Recovery (IBM CMS)

Secondary / Primary Primary / Secondary Pre-provisioned DR VMs (maybe suspended) Automated DR failover CMS Cloud B CMS Cloud A VMs VMs DR Control DR Control Storage Storage File-level or App-level or Host-level replication DR Metadata Storage-level replication

(26)

• Failover site can be leveraged for other workloads (e.g. dev/test)

• 4 hour recovery time objective (RTO), 15 minute recovery point objective (RPO) • Full CMS Management capabilities at recovery site

• IBM makes disaster declaration • Single annual DR test included

–Option to purchase additional tests –Individual Workload(s) can be tested

• DR services can be ordered anytime after initial onboarding • Primary Focus on Infrastructure DR (as of 1Q 2015)

–IBM-managed SAP and Oracle Services have DR options that were defined leveraging base CMS capabilities

–Enhancements to include middleware and database services within DR scope are planned for future release

CMS DataCenter CMS DataCenter Fail Over Fail Back CMS Site-to-Site Disaster Recovery Raleigh Boulder Lisbon Barcelona Ehningen Montpellier Portsmouth Lisbon

IBM CMS Cloud-to-Cloud Disaster Recovery Overview

Makuhari Sydney

Winterthur Ehningen

Toronto Boulder

(27)
(28)
(29)

Cloud to Dedicated DR Site (IBM CMS)

Secondary Primary Dedicated DR Site in same region CMS Cloud A VMs VMs DR Control DR Control Storage Storage File-level or App-level or Host-level replication DR Metadata • Pre-provisioned DR

servers for Managed Applications.

• Other VMs

provisioned during failover.

(30)
(31)

Lessons Learned in Enterprise-Cloud DR

DR should cover workloads and management Standardization vs. customizability

Data management is central to DR design

Regulations and regional requirements may trump technology in DR Find acceptable balance between cost and risk mitigation

Automation is a must to achieve low RTOs

(32)

Summary and Takeaways

Cloud DR is the ability to recover the cloud infrastructure and workloads hosted on it

Cloud-based DR-as-a-Service has many benefits for enterprise-class cloud users

Cloud-based DR-as-a-Service raises many technical challenges for cloud providers

Many trade-offs to be considered

– Standardization vs. customizability

– Cost vs. risk mitigation

– Regulations vs. technical aspects

Automation is Key

Enterprises considering cloud-based DR expected to grow from 17% in 2014 to 50%

in 2018 [Evolve IP Survey, 2015]

References

Related documents

Better outcomes Smarter education Desired results Dashboards & scorecards Social Analytics Reporting & visualization Sentiment Analysis Real-time Decisions

Storage Area Network (SAN) & Disaster Recovery (DR) course is designed to provide professionals with extensive storage and disaster recovery knowledge to accomplish their day to

Using Primary Site DC During DR Testing VMware vSphere VMware vCenter Server Site Recovery Manager. App App App App App DC-1 VMware vSphere VMware vCenter Server Site

Disaster Recovery / Continuous Operations DR/COOP: another classic cloud workload Build systems, then quiesce and pay only storage costs. “Pilot light” systems (such as replicated

Scholarships and fellowships increase $2.2 million in fiscal 2014 due to expenditures being less than anticipated in fiscal 2013, in which $1.0 million was not awarded:

Disc brakes were most popular on sports cars when they were first introduced, since these vehicles are more demanding about brake performance. Discs have now

Immerse us in your grace, and transform us by your Spirit, that we may follow after your Son, Jesus Christ, our Savior and Lord, who lives and reigns with you and the Holy Spirit,

The new approach recognises that the most effective way to achieve this is for local authorities to take the lead in identifying these priorities and sets out how, working