Publication Date: April 2007

(1)

BOOsT BlOCk-level ReplICATION

wITh GOldeNGATe’s TdM TeChNOlOGy

FOR 100% BusINess CONTINuITy

Publication Date: April 2007

ABsTRACT:

Recent natural disasters and government regulations have created a new demand for business continuity solutions. Many vendors have offered a variety of technologies for disaster recovery that focus on recovering data after a disaster— i.e. physical disruptions— but not necessarily enabling companies to prevent an interruption to their business when a logical disruption such as a human error occurs. To have continuous operations for critical business applications, companies need solutions that can handle interruptions caused by physical and logical failures.

This paper describes the two leading business continuity solutions: block level data replication and Transactional Data Management (TDM). It evaluates how they complement each other in creating a solid and complete business continuity solution that enables not only disaster recovery, but also high availability (HA) of mission critical systems.

whITe pApeR

(2)

2 BusINess CONTINuITy vs. dIsAsTeR ReCOveRy

The disruptive events companies have experienced in the recent years including power blackouts, hurricanes, tsunamis, along with new government regulations such as Sarbanes Oxley, compelled businesses to make business continuity planning a priority. Many companies have deployed disaster recovery solutions that focus on recovering their data after a disaster or a disruptive event occurs. Business continuity, however, is a broader term that includes strict requirements for high availability of systems and data files— i.e. the ability to operate within acceptable criteria under a wide variety of damaging conditions or interruptions. While disaster recovery technologies aim to decrease mean time to recover (MTTR) in case of a physical failure, high availability solutions aim to maximize mean time to fail (MTTF). Since a typical business experiences interruptions approximately fifty percent1_{of the time due to human errors rather}

than large disasters, it is critical for companies to have a solid high availability technology to implement a complete business continuity solution.

A high availability solution should be able to support the business in any of the three possible availability issues from the perspective of end-users:

1. Unplanned outages: These are the interruptions caused by system failures such as a power outage that stops all

systems. Disaster recovery solutions focus on these unplanned outages, where the company can only react after it happens.

2. Planned outages: Planned outages happen when moving from one system to another, maintaining an existing

system or upgrading it to the next version. During this time, end-users are not able to use the system and it interrupts the business operations that rely on this system. Most of the time businesses select times when the outage will cause minimum impact on the company. However, for mission-critical systems there is never an ideal time in which a business can afford not having the system available. For example, hospital medical systems that patients depend on for sustaining life are mission-critical systems and the hospital cannot afford to allow them to go down for even ten minutes on a weekend night.

3. Active state performance issues: Systems that are overloaded with a large number of users or stressed with high

processing demands can experience performance issues—even though all the underlying components are up and running. In such instances, end users may experience significant delays in response times, even receive error messages. These types of interruptions not only slow the business, but make end-users dissatisfied, resulting in retention issues and possible revenue loss when customers are the end users. For example, if an ATM system of a retail bank is slow and the users receive error messages that say the system cannot process the request at that time, it will negatively impact the customer’s personal experience with the bank and may lead them to switch banks if repeat issues occur.

Business continuity plans should address all of these three availability issues to make certain end users have access to the critical applications under any circumstances including maintenance and upgrades, as well as when the transaction volumes or processing requirements increase.

1_{D. Patterson et al. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies, UC Berkeley Technical Report}

(4)

BuIldING A COMpleTe BusINess CONTINuITy sOluTION

There are various technologies available to support business continuity of key applications, each with different levels of expected availability, recovery time objectives2_{(RTO) and recovery point objectives}3_{(RPO). When building a complete}

business continuity solution IT organizations should evaluate not only the technical fit of the solution but also how it supports the business requirements.

1) Business considerations

Certainly, not every application and its data is equally important to invest in maximum availability and fastest recovery times. For example, an online retail company can afford to have its human resource systems to be down for a day or two, or lose employee contact information, which can be easily replaced. However, it cannot have its order management system down. Similarly, it cannot afford to lose its customers’ order data since it would be almost impossible to replace all of them and would cause significant revenue loss. For many mission critical applications recovery time and recovery point objectives are very strict due to their crucial impact on the day-to-day, revenue generating operations. IT organizations should work with business groups to identify which applications are the ones that need the best investment for ensuring high availability and very granular data recovery points, and which applications can use the technologies that are less strict in availability and recovery objectives.

As shown below on Chart 1 there are four categories of importance that an application and related data can fall into: critical, essential, important and normal. Working with business owners, IT organizations should assign recovery time and recovery point objectives for each application to help with selecting the right technology. In addition, IT teams should determine the cost of downtime and data loss for each application. With that information they can find out the expected return on investment by comparing the cost of different business continuity technologies to possible downtime scenarios when these technologies are not deployed. Another critical business consideration is service level agreements (SLA). If certain applications’ availability is directly or indirectly linked to meeting SLAs, those requirements should be considered when choosing a business continuity solution for applications.

Chart 1: Matching Data Classification to Recovery Objectives

2_{Recovery time objective defines acceptable duration of outage}

3_{Recovery point objective defines maximum age of acceptable data after recovery}

RTO

Critical

Essential

Important

Normal

0 – 3 Sec 3 Sec – 10 Min 10 Min – 1 Day 1 Day – 1 Week

RPO

10 Min – 1 Hr 1 Hr – 1 Day 1 Day 1 Day Financial Servicesinaina

Finan

Healthcare Systemsealtealtealtealt all Centers C C Ca Ca ail Services M Maa nformation I Inn Billing R Systete H H H H Finanan Online Tradingnng E-Commerceeeee Supply Chainnnnn Website Sales Analysis Historical Hi Data D

(5)

2) Technical requirements

After identifying the business requirements, IT organizations should evaluate the applications at hand from a technical perspective. These considerations include:

Source system performance: Does the HA solution create any overhead on the source system? Application I/O: What are the peak times for the source system? How do they impact other systems? Network limitations: Does the solution require additional bandwidth?

Target database preferences: Does the solution support data movement only between similar systems and can

the target database be actively used?

Fail back strategy: How easy and fast does it fail back?

Geographical distance: Are there any physical distance requirements for the back-up system? If so, what is the

minimum distance requirement?

BlOCk level ReplICATION

Block level replication technology has been offered by a wide selection of vendors addressing disaster recovery requirements. Block level replication terminology is used for these solutions due to the fact that they replicate each block of data in the hard drive to a remote copy. This technology category is also further grouped into controller-based replication4_{and host-based replication}5_:

Controller-based solutions create in-box clones, or snapshots of data, without using storage area network (SAN)

or server resources. This technology creates copies of the source database byte-by-byte over limited or extended distances using synchronous and asynchronous modes of replication, respectively.

Host based solutions replicate data from one server to another, using various methodologies, including file-write-based

replication, disk-volume-based replication and database-log-based replication. In general, server-based replication requires near-identical software on each server.

Block level replications tools have several advantages that have made them popular in disaster recovery space: Available in many different flavors, from a wide selection of vendors

Provide fast recovery from hardware failures Replicate non-structured data and files as well

Require minimal configuration depending on implementation Can be on cheaper hardware

Create low overhead on the database

4_{Also called “array-based replication”} 5_{Also called “server-based replication”}

› › › › › › › › › › › ›

(6)

However, block level replication solutions have also disadvantages, including: High total cost of ownership (TCO) due to vendor dependencies Target machine cannot be accessed with controller-based replication6

Distance constraints

Cannot guarantee transactional data integrity Cannot provide granular RPO/point-in-time recovery

Replicates the failure cause—block-level problem or data corruption—to the target database as well Requires significantly more bandwidth due to replicating all blocks

Has to be implemented across identical systems

Requires labor intensive and time consuming failover and recovery

TRANsACTIONAl dATA MANAGeMeNT

Transactional data management (TDM) technology moves high volumes of transactional data across heterogeneous IT environments—with sub-second latency, guaranteed accuracy, reliability, and integrity. TDM technology is not intrusive. The software reads the transaction logs of the source database and moves the changed data only after the transaction is committed. Because TDM moves data at the business transaction level, rather than copying disk blocks, and has dynamic rollback capabilities it enables high availability in the event of logical failures such as human errors.

TDM fits well to improve the availability of mission-critical data due to its several unique strengths: Enables immediate failover to a secondary system. Also provides dual-active HA solutions

Allows active use of the target database for writing—via bi-directional data movement— or for read-only activities, including real-time reporting

Creates no overhead on source systems Does not have any distance constraints Moves transactional data in real-time

Supports heterogeneous database systems which allows secondary systems to be on lower-cost platforms from different vendors

Provides very granular point-in-time recovery

Preserves transactional integrity by moving only committed transactions

In addition to improving availability for mission-critical systems, TDM provides data integration capabilities that support data warehousing or real-time reporting.

6_{Applies to majority of the technologies in this category}

› › › › › › › › › › › › › › › › ›

(7)

COMpARIsON OF TeChNOlOGIes IN hANdlING dIFFeReNT FAIluRes

When we compare these two separate technologies and their combination based on their effectiveness in enabling high availability and disaster recovery for different types of failure and interruption scenarios, we see that each have their own strength areas and complement each other in creating a complete business continuity solution.

Data storage hardware failure without data corruption

As shown on the table on the next page, block level replication tools are effective in handling physical disruptions such as disk corruption. Because these solutions replicate each byte in a disk to another disk, if the first disk cannot be read, the secondary one will be available for the users. If there is a physical damage to the primary site, a secondary, nearby site can be used to continue operations after bringing up the secondary database.

Block level replication solutions are also an important investment to recover unstructured data such as image files. Because they can copy any byte in a disk, the users can have a mirrored copy of their unstructured data in another storage location.

While TDM solutions do not support unstructured data, in the event of a physical damage to one site, a secondary site with up-to-date transactional data will be immediately available for failover.

Data storage hardware failure with data corruption

If a physical disruption such as a disk failure creates data corruption and shuts down the database management system, it can take anywhere from 30 minutes to hours to bring the system back online. Because an improper shutdown puts the database management system into recovery mode, it cannot be mounted immediately. During this recovery state, users cannot access it. In addition, because a block level replication tool copies each byte on the disk, it will also copy the faulty data to the target database. Hence, recovering the corrupted data will require significant effort and time. In many cases recovery teams will lose the data since the last snapshot or archive file.

Because TDM does not replicate byte by byte, but only the committed transactions, the data corruption on one of the disks would not be transferred to the other database. The recovery team can switch over to the secondary database immediately since its data is not corrupt. They can then focus on understanding the failure and data corruption in the primary database. The TDM solution also allows two active databases to support a mission-critical system with bi-directional, real-time data movement between heterogeneous databases. It detects loops automatically and prevents the bi-directional movement from creating a continuous loop of the same data between databases. If one database goes down due to the data corruption, the other one can continue to support the system without any downtime.

System failure without data corruption

When the database shuts down improperly, restarting can be time consuming even if there is no data corruption. As mentioned above, the database system will be in recovery mode and will not be accessible. If the user wants to failover to the target system, that process will also be time consuming. After the secondary (target) system is set as the primary, it will be coming up slow, because it lacks the required files in its cache.

With the TDM solution, the recovery team can switch over to the secondary database immediately. After the first database restarts, TDM’s internal checkpoints will show which transactions were not applied to the first database. Using TDM solution’s trail files—which store all the changed data in a transportable format— users can deliver new transactions to the first database and update it to a current state.

(8)

Table: Comparison of Technologies In Handling Different Failures or Interruptions

Failure/Interruption type Scope Block level

Replication

(such as EMC SRDF, Hitachi TruCopy, NetApp Snapmirror, Veritas Volume Replicator) Transactional Data Management (GoldenGate Software) Combined Solution: Block level replication & TDM

Database storage hardware failure (structured data – no data corruption)

HA

_●

DR

_●

1

_●

2

Database storage hardware failure (unstructured data – no data corruption)

HA

_●

°

_●

DR

_●

°

_●

Database storage hardware failure

(with structured data corruption) HA

°

●

DR

°

3

_●

1

_●

2

Database system/OS failure

(without data corruption) HA

°

●

DR

_◗

4

●

1

●

2

Database system/OS failure

(with data corruption) HA

°

●

DR

°

3

●

2

Human error/ Application error

(e.g. deleting data) HA

°

●

DR

°

_●

Planned outage (migration,

upgrade, maintenance) HA

°

●

Performance problems due to

overhead on production system HA

◗

●

= Very effective in handling the failure scenario

◗

= Somewhat effective in handling the failure scenario

°

= Not effective in handling the failure scenario 1: In-flight transactions may be lost

2: If block level replication stores the trail files of GoldenGate’s TDM solution, there will be no loss of data when solutions are combined.

3: May permanently lose data since the last snapshot 4: During the restart few minutes of data may be lost

(9)

System failure with data corruption

Replicating every byte of the database can be detrimental to availability and recovery also if there is a data corruption due to a system failure. When there are two corrupt instances of databases, a failover to the target system will require a significant amount of time and effort. In many cases, IT teams will have to use an outdated snapshot of the database to be able to recover from the disruption. This will result in the company losing data since the last snapshot or the archive file was created.

As described above, TDM captures transactions from transaction logs. Because it looks for committed transactions it will not move a transaction to a secondary database if it contains corrupted data. After a database crash, the TDM solution provides a target database to which the application can switch. Once the primary database is up and running, TDM’s internal checkpoints show the last transaction the primary database received and the delivery mechanism closes the gap between two databases using TDM’s trail files.

Human error or application error creating data corruption

If a user or a faulty application logic—e.g. a software bug— creates data corruption, block level replication tools will copy the faulty data to the target database. Because both database copies will have incorrect data, failover to the target database will be futile. Recovery teams will need to use other tools to be able to fix the data corruption and restore the database. In this scenario, the database may lose the data since the last snapshot or archive file.

On the other hand, TDM technology’s dynamic rollback capability enables the users to reverse any unwanted action via the trail files which store all the change data operations. For example, if the data corruption happened because a user deleted the customer list from the database, the TDM solution would first move that changed data—i.e. the delete command for those customers— to the target database. As soon as the error is noticed the user can find the erroneous delete command in TDM solution’s trail files. The user can change that command right in the trail file in a way that it will fix the data corruption problem—i.e. by changing the delete operation back to an insert of those customer names. The bi-directional TDM solution can reverse specific commands and apply them to the source database to remove the related issue. As a result, the business can continue to use the source database with complete and correct data.

Planned outages

In addition to logical disruptions, TDM allows high availability when a system migration or upgrade or maintenance is needed. For example, a database migration team can install a new database and then associate it as the target to the primary database so that the TDM software keeps it up-to-date with the changes that happen to the primary one. The team can then switch the system to the target (new) database, while the primary (old) database becomes the target for the new database in case a problem occurs requiring the team to switch back to the old one. Having these two heterogeneous databases in-synch enables a migration to a new database with zero downtime.

Block level replication solutions cannot recover from unplanned outages without data loss, because they do not support bi-directional data movement.

Performance problems due to overhead on production system

Because block level replication tools are designed specifically for disaster recovery purposes, the secondary system cannot be accessed by end-users unless snapshots are mounted as point-in-time images of the data. However, TDM solutions allow the secondary system to be actively utilized for both read and write activities. Many businesses with TDM solutions use the target database for load balancing in addition to real-time reporting.

When the production system is overloaded with many read-only queries, a secondary database with real-time updates from the primary system can be assigned for read-only activities, offloading the primary system and significantly improving performance.

(10)

10

All these different scenarios show us that these two technologies have their pros and cons in different types of disruptions. While block level replication has advantages in handling unstructured data and physical disruption related recoveries, TDM solutions are highly effective in handling logical corruptions. TDM solutions allow for immediate failover and HA, keeping business application running while repair is done on the primary systems. In addition, TDM technology offers solutions beyond HA—its real-time data integration capabilities can be used for real-time business intelligence in data warehousing and reporting. The solution enables businesses to access mission-critical data from multiple, heterogeneous systems in real time.

When combined these complementing technologies form a very strong and complete business continuity solution that enables high availability during physical or logical failures for both structured and unstructured data.

CONClusION

Disaster recovery investments are only a component of business continuity planning. High availability solutions for mission-critical transactional data play a very critical role in enabling continuous operations. Business continuity planning requires a thorough analysis of existing data. Identifying requirements for acceptable down times and acceptable data loss enables a business to find the right type of solution for each application.

With many mature products available today in the market, block level replication solutions are a good fit for recovering non-mission critical, structured or unstructured data from physical disruptions. For mission-critical transactional data that impacts day-to-day operations, high availability and transactional integrity are vital for business health. Therefore companies should deploy a TDM solution for those systems that contain mission critical transactional data. Given the complementary protection strengths associated with different types of failures, block level replication solutions and TDM solutions offer a combined solution that provides a complete business continuity solution that enables high availability in all major failure or interruption cases.

(11)

11 ABOuT GOldeNGATe sOFTwARe

GoldenGate Software, Inc. is the market leader in Transactional Data Management (TDM). GoldenGate’s solutions enable customers to effectively maximize the performance, accessibility, and availability of the transactional data that drives their mission-critical business applications.

Providing technologies for capturing, transforming, routing, delivering, and verifying data transactions in real-time and across major databases, GoldenGate helps organizations mitigate risk, reduce costs and increase revenues. Fortune 500 companies leverage GoldenGate solutions to manage critical initiatives for high availability/disaster recovery and real-time data integration.

A privately held company, GoldenGate’s more than 300 customers worldwide include Visa, Bank of America, US Bank, UBS, Sabre Holdings, DirecTV, Comcast, Federated Investors, Mayo Foundation and Overstock.com. GoldenGate broadens its global market reach through key relationships with leading application and infrastructure vendors including ACI Worldwide, Business Objects, Amdocs, Eclipsys, GE Healthcare, HP, IBM, Ingres, Oracle, Teradata, and others.

(12)

GoldenGate Software, Inc. 301 Howard Street

San Francisco, CA 94105 USA Tel: +1 415-777-0200

Email: [email protected]

For a list of worldwide offices and for more company information, please visit:

www.goldengate.com

12

Publication Date: April 2007

BOOsT BlOCk-level ReplICATION

wITh GOldeNGATe’s TdM TeChNOlOGy

FOR 100% BusINess CONTINuITy