© 2014 VMware Inc. All rights reserved.
Implementing a Holistic BC/DR
Strategy with VMware
What’s on the agenda?
•
Defining the problem
•
Definitions
•
VMware technologies that provide BC and DR
– vSphere HA and App HA
– vSphere FT
– vSphere Data Protection / Advanced
– vCenter Availability
– vSphere Replication
– vCenter Site Recovery Manager (SRM)
– vCenter Infrastructure Navigator (VIN)
Disaster Recovery vs. Business Continuity
Example: Tuesday, August 23, 2011 at 1:51 PM EDT - Magnitude 5.8
earthquake near Mineral, Virginia
Disaster recovery required? No
Fault Tolerance vs. High Availability
•
Fault tolerance
– Ability to recover from component loss
– Example: Hard drive failure
•
High availability
Uptime percentage in one year
Downtime in one year
99
3.65 days
99.9
8.76 hours
99.99
52 minutes
99.999 “five nines”
5 minutes
RTO, RPO, and MTD
•
Recovery Time Objective (RTO)
– How long it should take to recover•
Recovery Point Objective (RPO)
– Amount of data loss that can be incurred
•
Maximum Tolerable Downtime (MTD)
– Downtime that can occur before significant loss is incurred
Making an Application Service Highly Available
•
vSphere HA
VMware vFabric™ tc Server
vSphere App HA
New
Policy-based
vSphere App HA
vSphere HA Cluster
vFabric HypericVirtual Appliance
vSphere App HAVirtual Appliance
Hyperic AgentsRunning in VMs
vCenter ServervSphere
vSphere
vSphere
vSphere
vSphere App HA – Key notes
New
•
Available only in vSphere Enterprise Plus
•
Based on VMware vCenter Hyperic
What’s new in App HA 1.1
• Create • Duplicate • View • DeleteEdit policy
• Add a new service • Shell script
Custom
Service
• Support 5 new languages
Level 3
• vSphere 5.1 U2 • ESX 5.1
5.1 support
vSphere HA – Keep In Mind…
•
RTO – measured in minutes (not seconds)
•Requires shared storage
•
Best practices
– Use admission control – percentage policy
– Test post-failure performance with host maintenance mode
– Isolation response – leave powered on
– Network and storage redundancy
vSphere Fault Tolerance (FT)
•
Zero recovery time, data loss
– Host hardware failure only– Does not protect against OS and application failure
•
Works fine with HA, App HA
•
Why not FT?
– Resource requirements – does workload really need it?
– VM has multiple CPUs – see BCO5065 ☺
Data Protection (Backup and Restore)
•
Agents? No Agents? – Both!
– No agents for majority of workloads – keep it simple
– Agents for certain apps
•
vSphere Data Protection (VDP) Advanced
– Backup and recovery for VMware, from VMware – Based on proven, mature EMC Avamar™– Agent-less VM backup and restore
VDP Advanced – Keep In Mind…
•
Engineered for SMB environments
•Uses VADP – VM snapshots, CBT
•
Utilizes Windows VSS in VMware Tools
•Works fine with HA, not with FT
•
RDM – virtual yes, physical no
•
Is it DR?
– Maybe – depends on RTO, RPO
VDP Advanced – Keep In Mind…
•
Best Practices
– Prepopulate DNS, always use FQDN
– Manage VM snapshots
– Avoid deploying to slow storage
– Do not power-off, always shut down gracefully
– Do not schedule backups during maintenance window
vCenter Availability
•
Run vCenter Server application in a VM
•Run vCenter Server database in a VM
•
Run both in same VM?
•
Protect with vSphere HA
– vCenter and DB VM restart priority set to High
– Enable guest OS and App monitoring
vCenter Availability
•
Back up vCenter Server VM and database
– Image-level backup for vCenter Server VM– App-level backup using agent for database backup
•
Why not FT for vCenter Server?
– vCenter Server requires minimum of 2 vCPUs
– FT does not protect against application failure
vSphere Replication – DR
•
Native tool built into the platform
•
Per-VM hypervisor replication, managed in VC
Replication Across Sites
vCenter Server ESXi NFC VRA ESXi NFC VRA ESXi NFC VRA Storage Storage (VMDK1) vCenter Server ESXi NFC VRA ESXi NFC VRA ESXi NFC VRA VR Appliance VR Appliance Storage Storage VMDK1Four Steps for Full Recovery
Right-click, select “Recover” Right-click, select “Recover” Select a target folder Select a target folder Select a target resource Select a target resource Click Finish Click FinishNew Feature – Retain Historical Replicas
vSphere
VR Agent
MPIT Presented as VM Snapshots after Failover
vSphere Replication – Interoperability
Fault tolerance –
Doesn’t work with VR
•
FT conflicts at thevSCSI disk filter level.
VDP
•
Mostly no problem!•
If using VSS… ensureyou are using 5.5!!
HA, vMotion, DRS
Storage vMotion
and Storage DRS
vSphere Replication – Best Practices
•
RPO
– Only what is necessary!
– Just because you can…
•
RTO
– Don’t set one! No testing, no automation, manual process.
•
VSS – Only if necessary!
•
What about bandwidth?
– Very hard to determine.Do a local loopback first.
•
RDMs?
– Don’t use them. If you must, use virtual compatible.
SRM
• A Disaster Recovery engine
• A tool that uses externally replicated data (VR or
array based) to speed the RTO of a BCP
• A product that allows for DR to be tested,
automated, planned, repeatable and customizable
What is it?
What is it?
• A replication engine
• A tool for systems that need near-instant RPO
• A disaster avoidance stretched cluster
Key Components of SRM
Replication
vCenter ServerSRM Server
One vCenter Server
(Windows or VCVA) per
site, same versions
One SRM Server per
site, same versions
vSphere hosts,
recommend same
versions per site (pre
vSphere 5.x only if using
array replication)
vSphere Essentials Plus and higher editions supported
SRM Replication Options
•
SRM can utilize BOTH array based
AND vSphere Replication
•
SRM will “see” existing standalone
vSphere
Replication protected VMs
Recovery Workflows
• User defined recovery plan • Minimize errors
Failover Automation Failover Automation
• Isolated test environment
• Increase confidence in DR process
Non-disruptive Failover Testing
Non-disruptive Failover Testing
• Zero data loss
• Operational migration
Planned Migration Planned Migration
• Re-protect VM’s, migrate back
SRM Interoperability
•
Works with VR –and- ABR
•
Backups, VADP or other
are fine
•
HA is no problem at all
•
vMotion and DRS are fine
•
Storage vMotion and
Storage DRS – Sort of…
–
Replication Dependent
•
FT is “yellow”
–
Array replicated only and the
FT status is not recovered
SRM – A Few Best Practices
Not
exhaustive
How long is VMworld?
Big ones:
Storage LayoutTest Network Configuration Test often!
Size vCenter correctly
Biggest
one:
Do a Business Impact Analysis
RPO, RTO, Cost of downtime, interdependencies, criticality of applications, priorities, units of failover, overlooked
Protection Groups (PGs)
•
More PGs = more granular testing/failover
– DR testing is easier – fewer resource requirements
– Fail-over only what is needed
– More configuration/complexity
•
Less protection groups = less complex
– Fewer LUNs, PGs, recovery plans– Less flexibility
•
Find a good balance between flexibility and simplicity
Test Network
– Use VLAN or isolated network for test environment
• Default “Auto” setting does not allow VM communication between hosts
– Different vSwitch can be specified in SRM for test versus run
VMware – Multiple Levels of Protection
SQL
VMware – Multiple Levels of Protection
SQL