Recovering from a Disaster in the Primary Data Center

Whether you are using public, private, or hybrid IaaS solutions, there is a standard set of best practices for recovering the database in the unfortunate

event that the database or the data center is in a disaster state. Following are four common methods of recovery for leveraging a secondary data cen-ter (physical or virtual) to protect against disascen-ters within the primary data center:

1. Classic backup and restore method

2. Redundant data centers—Active-Passive Cold 3. Redundant data centers—Active-Passive Warm 4. Redundant data centers—Active-Active Hot

Classic Backup and Restore Method

In this method (see Figure 13.1 ), daily full backups and incremental backups are created during the day and stored to a disk service provided by the cloud vendor. The backups are also copied to the secondary data center and possibly to some other third-party vendor just to be extra safe.

If the database goes off-line, gets corrupted, or encounters some other issue, we can restore the last good backup and apply the latest incremental backups on top of that. If those are unavailable, we can go to the secondary site and pull the last good full backup and the incremental backups dated after the full backup.

This method is the cheapest solution because there are no redundant servers running. The downside of this approach is that the RTO is very long

Main Data Center

Secondary Data Center Web

API

Offsite

Main Data Center

Secondary Data Center Web

API

Offsite

Figure 13.1 Classic Backup and Recovery Method

because the database cannot be brought back online until all of the rele-vant backups have been restored and the data quality has been verifi ed. This method has been used for years in our brick-and-mortar data centers.

Active-Passive Cold

In this model (see Figure 13.2 ), the secondary data center is prepared to take over the duties from the primary data center if the primary is in a disaster state. The term cold means that the redundant servers are not on and run-ning. Instead, a set of scripts is ready to be executed in case of an emergency, which will provision a set of servers that is confi gured exactly the same as the resources that run at the primary data center. When a disaster has been declared, the team runs these automated scripts that create database servers and restore the latest backups. It also provisions all of the other servers (web servers, application servers, etc.) and essentially establishes a duplicated envi-ronment in a secondary data center, hence the term cold . This method is a cost-effective way to deal with outages because the cold servers do not cost anything until they are provisioned; however, if the RTO for an outage is less than a few minutes, it will not be an acceptable plan. Restoring databases from tape or disk is a time-consuming task that could take several minutes to several hours, depending on the size of the database. The Active-Passive Cold approach is for high-RTO recoveries.

Active-Passive Warm

The warm method (see Figure 13.3 ) runs the database server hot , meaning that it is always on and always in sync with the master data center. The other servers are cold or off and are only provisioned when the disaster recovery plan is exe-cuted. This method costs more than the Active-Passive Cold method, because the hot database servers are always on and running, but greatly reduces the amount of downtime if an outage occurred because no database restore would be required. The recovery time would be the time it takes to provision all of the nondatabase servers, which can usually be accomplished in a few minutes if the process is scripted. Another advantage of this approach is that the hot data-base at the secondary data center can be allocated for use as opposed to sitting idle, waiting for a disaster to be declared. For example, ad hoc and business intelligence workloads could be pointed to this secondary data center ’s database instances segregating reporting workloads from online transaction processing workloads, thus improving the overall effi ciency of the master database.

For systems with a low RPO, running a live and in-sync database at a secondary data center is a great way to reduce the loss of data while speeding up the time to recovery.

Restore from backup Resources

are on.

New Master Main

Data Center

Secondary Data Center

Web

API

Cold resources

are off Main

Data Center

Web

API

Offsite

Secondary Data Center

Web

API

Web

API

Figure 13.2 Redundant Data Centers—Active-Passive Cold

Resync when back

online

Resources are on.

New Master

Sync Main

Data Center

Secondary Data Center

Web

API

Hot resource

is on Cold resources

are off Main

Data Center

Web

API

DB Sync

Secondary Data Center

Web

API

Web

API

Figure 13.3 Redundant Data Centers—Active-Passive Warm

Active-Active Hot

The most expensive but most resilient method is to run fully redundant data centers at all times (see Figure 13.4 ). The beauty of this model is that all of the compute resources are being used at all times and in many cases a complete failure of one data center may not cause any downtime at all. This is the model that we used for the digital incentives platform. That platform has survived every AWS outage without ever missing a transaction, while many major web-sites were down. We had a low tolerance for lost data and downtime, and the value of recovery to the business was extremely high because of the risk of impacting our customers ’ point-of-sale systems.

In this model, the database uses master–slave replication across data cen-ters. When the primary data center fails, the database at the secondary data center becomes the new master. When the failed data center recovers, the databases that were down start to sync up. Once all data in all data centers is back in sync, control can be given back to the primary data center to become the master again. Active-Active Hot is the way to go when the value of recov-ery is extremely high and failure is not an option.

In document ARCHITECTING THE CLOUD (Page 176-181)