Linux: Beyond Backups An overview of restore and recovery technology, and Double-Take Availability

(1)

An overview of restore and

recovery technology, and

Double-Take Availability

(2)

Introduction

Linux, once an operating system for computer enthusiasts and contrarians, is now

installed on primary production servers in business computing environments around

the globe. The deployment of Linux saw a dramatic uptick in 2008, in part because

of macroeconomic events that took place in primary trade centers around the

world. Today, Linux, in its various forms, is going strong. Even first-tier computer

hardware manufacturers like IBM have embraced specific distros of Linux for their

System x hardware.

Regardless of whether you’re running critical business applications on Linux, Windows,

UNIX or another mainstream operating system, every organization faces critical

intervals when system downtime is unwelcome—whether it’s planned or unplanned.

Increasingly, shops that were able to accommodate modest periods of downtime for

backups and system maintenance are finding increased server demands are closing

backup windows. Globalization and expanding online business opportunities have been

a big contributor to the unwelcomed contraction of periodic backup opportunities.

Since nearly all organizations need to keep their systems available for increasing

amounts of time, they are now realizing that a system outage of even a few hours

will result in disruption, chaos and wasted capital. For many companies, exposure to

anything more than an hour or two of downtime has become unpalatable.

Current backup processes that include tape or removable disk media fall short of

serving the availability and recovery needs of businesses. Shops that thought they

weren’t candidates for a high availability solution are now feeling an urgent need to

start looking at high availability options.

(4)

The purpose of this white paper is to explore current disaster recovery strategies for

Linux servers and provide an introduction to the costs and benefits of a Linux-based

high availability strategy. Along with this overview of the different Linux protection

strategies we will also provide a more in-depth overview of how the Double-Take

Availability for Linux product provides a high availability solution for Linux servers.

(5)

Backing Up to Tape or Removable Disk

For nearly 80% of companies in the small to medium range (SMB), their entire disaster recovery strategy consists of performing regular saves to tape or removable disk . This usually includes periodic saves of their entire system, daily incremental tape saves of changed or otherwise critical data, and then storing these tapes safely offsite . Unfortunately, if a failure occurs requiring a bare-metal restore that includes applications and system state settings, it is not unusual for the data recovery time to be up to 48 hours or longer, depending on the time it takes to repair or replace hardware, restore data from tape, and manually recreate all transactions since the last good save . And keep in mind that it is not unusual to run into media errors when restoring from tape . Once the system is running again you can expect a lot of fine tuning to follow and that could take days .

There are a growing number of companies who have calculated the real cost of this downtime and as a result have introduced additional layers of protection to reduce data recovery time . Disaster Recovery and High Availability software is well suited for this kind of task . For Linux, there are both open-systems, freeware, do it yourself, and commercial options at your disposal .

Open Source Disaster Recovery, DIY and Other Variants

For better or worse, the notion of open source software is inextricably linked to the word free . If the axiom “you get what you pay for” is even somewhat true, then the use of open source software would seem to put a “for profit” enterprise at a disadvantage since—it would seem—functionality, reliability and support would be noticeably limited . Furthermore, the absence of operating capital for providers of open source code makes the long range survivability of the organizations that support these software solutions somewhat uncertain . Yet, open source software plays an important role in the grand scheme of things for several reasons .

• It may save you time if you’re developing a system of your own and need a script to plug in

• It increases your number of options • It creates an environment for innovation

While open source disaster recovery and high availability solutions exist, disaster recovery is an area where it’s best to know exactly what you’re getting into. There are instances where open source offerings are embedded within productized solutions and hence, are enhanced, maintained and supported like commercial solutions. If you’re getting into a large-scale software development project involving HA and DR functionality then there’s a lot to consider .

If the axiom “you get what you pay for” is even somewhat true, then the use of open source software would seem to put a “for profit” enterprise at a disadvantage since—it would seem—functionality, reliability and support would be noticeably limited.

(6)

DIY Linux Disaster Recovery

Linux coders will tell you that part of the beauty of Linux lies in its modularity . Adroit technologists often consider this an invitation to write their own plug-ins or entire applications to accommodate the functionality their organization needs . Given the complexity of good automated disaster recovery systems and the huge collection of scripts that would be needed to mimic some of the functionality in commercial solutions, developing these scripts from scratch almost never shakes out to be a cost-effective approach . And long term maintenance often becomes a problem because people move on in the IT business, and external and internal documentation is often sketchy .

Freeware DR

While many of the points in the previous text on open source software apply to freeware, there are a few important distinctions with freeware that should be noted . First, with open source software you have access to source code, and with freeware you do not . If you have a penchant to fiddle with source code, then freeware is not the way to go . Freeware is offered on an “as is” bases . That said, there are several freeware DR solutions available and inevitably, some of them offer the features you need. Yet, you’ll find that support, customization and documentation are also offered on an “as is” basis .

Commercial Disaster Recovery Solutions for Linux

Environments

Because the for-profit model drives commercial software organizations, a productized solution such as Double-Take Availability for Linux is available with full feature sets and around-the-clock support . Double-Take Availability for Linux replicates full system state information, application data, and transactions at the byte-level to a physical, virtual or cloud-based target server in real time .

Commercially available HA software is advantageous in the following ways:

• You have a binding agreement that details what the software is supposed to do. • You have a support contract: If you have a problem, your vendor is being paid to fix it. • Someone is available to support you during normal business hours or on a 24/7 basis if

(7)

Double-Take Availability for Linux

One such option- Double-Take Availability for Linux is a commercially available disaster recovery and high availability solution . It continuously captures changes as they happen and replicates those changes to one or more servers at any location, over any distance, so you always have access to a current copy of your data, applications and operating system . You can replicate to a disaster recovery site as far away as necessary, over standard IP networks, for maximum protection against data loss and improve performance by compressing the protected data before sending . The likelihood of data loss when compatible databases such as Oracle and MySQL are deployed, is also dramatically reduced . Double-Take Availability for Linux also allows you to implement failover without shared storage or geographic limitations – eliminating the single point of failure and providing the freedom to locate servers anywhere .

Technical Overview of Double-Take Availability for Linux

The following content provides a high-level overview of the basic theory of operations for Double-Take Availability for Linux . This solution employs a replication engine that works with physical or virtual servers running RedHat Enterprise Linux, Novell SuSE Linux Enterprise Server, CentOS Linux or Oracle Enterprise Linux .

Origination of Data

For the purposes of this discussion, the term “data” will refer to any digital information that can be visualized by a server as a series of one or more input/output (I/O) operations. Data is generated by the Linux operating system itself, in the form of updates to files through day-to-day changes across various system files . Data is more frequently generated by various applications that run on the server . This data may be non-transaction dependent (such as read and write operations on flat files) or transactional-dependent (such as read and write operations to database systems) . In either case, the data will be translated into a stream of I/O operations that will be processed by the Linux file system and committed to disk. Data may be created by end-users, applications, or any other system that can read and write information to the server .

Double-Take Availability for Linux File System Filter Driver

Double-Take Availability for Linux uses a file system filter driver to integrate itself within the I/O stack on a server. This filter driver allows the Double-Take Availability process to see each I/O operation that flows through the stack to the file system. It should be noted the filter driver is a pseudo-file system mounted on top of the Linux native file systems such

Data is generated by the Linux operating system itself, in the form of updates to files through day-to-day changes across various system files. Data is more frequently generated by various applications that run on the server.

(8)

The Double-Take Availability filter driver is a kernel-mode driver . This means it exists as a kernel module – a subset of system memory reserved for high-level operations that cannot be accessed directly by general applications or the end-users . This will permit the Double-Take Availability filter driver to act with minimal impact on the server itself, while still ensuring that each I/O operation is able to be examined. Since filters exist within the server’s OS kernel, several safety features have been implemented to ensure they do not adversely impact the system in the event of an error .

First, the filter itself only examines data and allows for a copy of required I/O operations to be moved out of the data stream. Double-Take Availability’s filter driver is non-blocking, so it will not halt data flow in the event of an error . Instead, should an abnormal condition be created on the server, the Double-Take Availability filter will allow data to pass through unmodified, producing an error event to alert administrators, but not stopping applications and end-users from accessing data . Secondly, even when operating under normal

conditions, the Double-Take Availability filter allows other applications (such as anti-virus scanners) to operate without hindrance, so that non-Double-Take Availability operations may continue as they did before installation of Double-Take Availability .

Selecting the Data

The replication set is the fundamental unit of data protection for Double-Take Availability on a server . It defines which areas of a file system will be protected, and which will be ignored by the Double-Take Availability during normal operations . These definitions are made using either the management console GUI, or via one of the interfaces supplied by Vision Solutions . Replication sets are defined as a series of rules which indicate to Double-Take Availability a logical unit of files within the server that should be protected . The replication set may contain a group of files, a directory, a group of directories, volume or group of volumes . In addition to the basic definitions, replication sets may also contain inclusion and exclusion information by wildcard (*) definition . Administrators may define extremely specific replication sets to protect only what a specific application would require for resumption of services, or may instead protect a much more general set of data – up to and including an entire server . The replication set information is stored in a configuration file called DblTake .db .

The Double-Take Availability service will use the replication set rules to determine which replication operations from the driver should be sent to the target . If the Double-Take Availability service sees replication activity from a file that is not within the replication set rules, then the service will notify the driver to not send any more information regarding that

(9)

Initialization of a Connection

Once the replication set is defined, it can be connected to a target server via the

management console or through one of the Double-Take Availability supplied connection wizards . This will cause a series of events to occur based on the option selected during connection . Generally they are:

1 . Double-Take Availability ensures that the target system is responding, that the proper version of Double-Take Availability is installed on the target and that the Double-Take Availability systems on that target are ready to receive information .

2 . The connection is established using the repset rules previously defined and the target path that was provided in the connection information. If the target path doesn’t exist on the target, it will be created .

3 . Once the connection is established, a replication start command will be sent to the filter driver which will cause the driver to start sending up change data . At this time a mirror is also started . The mirror will enumerate the files within the replication set and will read the data from disk and send it to the target . The file structure will be created on the target underneath the target path provided by the connection information . The administrator may choose either a full mirror (over-writing any data on target) or differences only (by file-system blockchecksum) when creating the connection, and the chosen form of mirror operation will initiate on connection . While the mirror is occurring, any changes to the data within a repset will be replicated to the target .

4 . After all the mirror data has been received by the target, the mirror process will update the attributes of the target directories to ensure that the attributes and last modify times are correct . The mirror operation will then complete and the connection statistics will report that the mirror is idle .

5 . Replication will continue as long as the connection is established .

Replication Mechanics

Replication of changed data is a continuous process that only ends when the source and target lose communication to the point where queuing is no longer possible, a failover event occurs, or the administrator specifically directs Double-Take Availability to stop protection . Outside of those events, any new data written to any area of the file system within the replication set will be replicated and written to the target .

During replication, I/O operations for files that are being protected are processed through the source’s file system. As these I/O operations are processed, they must pass through Double-Take Availability’s filter driver as they make their way down the I/O stack toward the file system. If the I/O operation is a read (meaning that it will not change the data), then the operation is passed through with no copy made . If the operation is a write, rename, delete

During replication, I/O operations for files that are being protected are processed through the source’s file system. As these I/O operations are processed, they must pass through Double-Take Availability’s filter driver

as they make their way down the I/O stack toward the file system.

(10)

notify the application that made the change that it successfully completed the operation . This completion is passed back up the I/O stack and at this point that the Double-Take Availability filter driver will package the data into a buffer and send it to the Double-Take Availability service as replication data .

The Double-Take Availability service will parse the data being sent from the driver and determine if it is part of the replication set . If the data is not within the replication set rules, meaning that it’s not part of the protection set, then the service will notify the driver to not send any further data for that file . If the data is part of the replication set, then the service will generate a Double-Take Availability file operation that would be queued for transmission to the target . This operation is placed in the send queue, which is generally in RAM, but depending on the system load and network latency, could be written to an extended queue location on disk .

Data Transmission

The Double-Take Availability file operations are temporarily stored within the Send queue on the source system . As always, the order of the operations is maintained within the queue and is consistent with the order in which data was written to disk . The send queue acts as a buffer to the network and in the event that the source cannot send data to the target as fast as it’s being written to disk, then the send queue will store Double- Take Availability operations that need to be transmitted . Data will technically always flow through the queuing systems to allow for instantaneous queuing when an emergency bottleneck emerges, but barring such slow-downs or disconnections, file operations remain in the queues for a brief period of time before being transmitted . The Send queue will hand off file operations to the native Linux TCP/IP systems for packetization and transmission across the IP network between source and target . Before transmission, a copy of the operation is placed in the Acknowledgement queue on the source, so if there are errors within the transmission process, the operation can be re-transmitted .

The Linux TCP/IP stack or networking system transmits the operations as a series of TCP/ IP datagrams over administrator-specified ports where the target system then receives the information via the target’s TCP/IP stack. The Double-Take Availability service running on the target will take data from the network, where sub-systems examine the I/O and perform several integrity checks. Included among these checks are tests to ensure the I/O has been received in its entirety, in the correct I/O transactional order, and without errors. If any test fails, the target will request a re-transmission of any necessary operations from the

(11)

Since each file operation is written in the exact same order as it was written on the source, transactional integrity for any application writing through he file system will be preserved explicitly . This is true across multiple files or within a single file . This is how Double-Take Availability for Linux can safely protect MySQL, Oracle and other forms of transaction-dependent data sets .

Conclusion

When it comes to maintaining the availability of Linux based servers; recovering from minor or major disasters, and ensuring data loss is kept to an absolute minimum, IT professionals have several options at their disposal . Some of these options involve little or no investment, while others are licensed by development houses that develop DR packages for living . There are advantages and disadvantages posed by all of these options that span the gamut of functionality, flexibility and cost of operation . This paper endeavors to help you choose wisely and also offers functional details on a commercial productized option from Vision Solutions called Double-Take Availability for Linux . Through this process of write-order intact byte-level I/O replication, Double-Take Availability for Linux can safely protect nearly any server running a mainstream distribution of Linux . This includes numerous release levels of RedHat Enterprise Linux, Novell SuSE Linux Enterprise Server, CentOS Linux and Oracle Enterprise Linux .

About Vision Solutions

With over 25,000 customers, Vision Solutions, Inc. is the world’s leading provider of information availability software and services for ƒWindows, Linux, IBM Power Systems and Cloud Computing markets. Vision’s trusted Double-Take®_{, MIMIX}®_{and iTERA}™_high

availability and disaster recovery brands support business continuity, satisfy compliance requirements and increase productivity . Affordable and easy-to-use, Vision products are backed by certified worldwide 24X7 customer support centers and a global partner network that includes IBM (NYSE:IBM), HP (NYSE:HPQ), Microsoft (NASDAQ:MSFT), VMware (NYSE:VMW) and Dell (NASDAQ:DELL). Privately held by Thoma Bravo, Vision Solutions is headquartered in Irvine, California, USA with offices worldwide . For more information, visit visionsolutions .com, or call 1 800 .957 .4511 (toll-free U .S . and Canada) or 801 .799 .0300 .