THE NEED TO PROTECT SQL SERVER AND
RELATED APPLICATIONS
Microsoft SQL Server is the preferred database management system for a large number of Windows-based applications. Often, these applications and their associated data are critical to the operation of the business. Typical SQL Server applications include order entry, manufacturing control, inventory management, purchasing, along with many industry-specific applications.
Protecting SQL Server and its databases is a top priority for many businesses, but it is generally not enough. If the application itself is not also protected, business operations will be disrupted. All but the simplest database applications include a separate application element that includes application logic, user interface control and sometimes communication with other applications or networks. Applications may include in-memory data or file system data, in addition to SQL Server data, all adding to the complexity of the protection challenge.
DEFINE THE PROBLEM
Improving the availability of SQL Server and related applications involves reducing or eliminating many possible downtime causes. At the highest level, downtime can be separated into two categories: planned downtime and unplanned downtime. Planned downtime is less disruptive since it can be scheduled for nights or weekends when user activity is much lower. Unplanned downtime, on the other hand, tends to occur at the worst possible times and can impact the business severely. Unplanned downtime can have many causes including hardware failures, software failures, operator errors, data loss or corruption, and site outages. This paper discusses the different causes of
Protecting SQL Server In Physical And
Virtual Environments
both planned and unplanned downtime along with some important considerations for evaluating solutions to address them.
SOLVE THE PROBLEM
Most availability solutions today fall into one of three categories: traditional failover clusters, virtualization clusters and data replication. Some solutions combine elements of both clustering and data replication; however, there is no single solution that can address all possible causes of downtime. Traditional and virtualization clusters both rely on shared storage and the ability to run applications on an alternate server if the primary server fails or requires maintenance. Data replication solutions maintain a second copy of the application data, at either a local or remote site, and support either manual or automated failover to handle planned or unplanned server failures.
All of these solutions rely on redundant servers to provide availability. Applications can be moved to an alternate server if a primary server fails or requires maintenance. It is also possible to add redundant components within a server to reduce the chances of server failure.
ELIMINATE DOWNTIME
Most availability solutions rely on a recovery process called “failover” that begins after a failure occurs. A failover moves application processing to an alternate host after an unplanned failure occurs or by operator command to accommodate planned
maintenance activity. Failovers are effective in bringing applications back online reasonably quickly but they do result in application downtime, loss of in-process transactions and in-memory application data, and expose the possibility of data corruption. Even a routine failover will result in minutes or tens of minutes of downtime including the time required for application restart and data recovery resulting from an unplanned failure. In the worst case, software bugs or errors in scripts or operational procedures can result in failovers that do not work properly, with the result that downtime can extend to hours or even days. Reducing the number of failovers, shortening the duration of failovers, and ensuring that the failover process is completely reliable, all contribute to the elimination of SQL Server downtime.
Protecting SQL Server In Physical And
Virtual Environments
cause much longer outages and require additional solution elements to properly address.
EVALUATE UNPLANNED DOWNTIME CAUSES
Unplanned downtime can be caused by a number of different events:
• Catastrophic server failures caused by memory, processor or motherboard failures
• Server component failures including power supplies, fans, internal disks, disk controllers, host bus adapters and network adapters
• Software failures of the operating system, middleware or application • Site problems such as power failures, network disruptions, fire, flooding or
natural disasters
There are also the problems of data loss and corruption that require solutions beyond hardware redundancy and failover. Each category of unplanned downtime is
addressed in more detail below.
AVOID SERVER HARDWARE FAILURES
Server core components include power supplies, fans, memory, CPUs and main logic boards. Purchasing robust, name brand servers, performing recommended
preventative maintenance, and monitoring server errors for signs of future problems can all help reduce the chances of failover due to catastrophic server failure.
Downtime caused by server component failures can be significantly reduced by adding redundancy at the component level. Examples are: redundant power and cooling, ECC memory, with the ability to correct single-bit memory errors, teaming of Ethernet cards and use of RAID arrays.
REDUCE STORAGE HARDWARE FAILURES
Storage protection relies on device redundancy combined with RAID storage
algorithms to protect data access and data integrity from hardware failures. There are distinct issues for both local disk storage and for shared network storage.
For local storage, it is quite easy to add extra disks configured with RAID protection. A second disk controller is also required if you want to protect against controller failures.
Protecting SQL Server In Physical And
Virtual Environments
SAY GOODBYE TO NETWORKING FAILURES
The network infrastructure itself must be fault-tolerant, consisting of redundant network paths, switches, routers and other network elements. Server connections can also be duplicated to eliminate failovers caused by the failure of a single server or network component. Take care to ensure that the physical network hardware does not share common components. For example, dual-ported network cards share common hardware logic, and a single card failure can disable both ports. Full redundancy requires either two separate adapters or the combination of a built-in network port along with a separate network adapter.
MINIMIZE SOFTWARE FAILURES
Software failures can occur at the operating system level or at the SQL Server and application level. In virtualization environments, the hypervisor itself or virtual machines can fail. In addition to hard failures, performance problems, or functional problems can seriously impact SQL Server users, even while all of the software components continue to operate. Beyond proper software installation and
configuration along with the timely installation of hot fixes, the best way to improve software reliability is the use of effective monitoring tools. Fortunately, there is a wide choice of monitoring and management tools for SQL Server available from Microsoft as well as from third parties.
REDUCE OPERATOR ERRORS
Operator errors are a major cause of downtime. Proven, well-documented procedures and properly skilled and trained IT staff will greatly reduce the chance for operator errors. But some availability solutions can actually increase the chance of operator errors by requiring specialized staff skills and training, by introducing the need for complex failover script development and maintenance, or by requiring the precise coordination of configuration changes across multiple servers.
SECURE YOURSELF FROM SITE-WIDE OUTAGES
Site failures can range from an air conditioning failure or leaking roof that affect a single building, a power failure that affects a limited local area, or a major hurricane that affects a large geographic area. Site disruptions can last anywhere from a few hours to days or even weeks.
Protecting SQL Server In Physical And
Virtual Environments
misses the last few updates. In the latter case, asynchronous data replication is used to keep a backup copy of the data.
Data replication is combined with error detection and failover tools to help get a disaster recovery site up and running in minutes or hours, rather than days.
PROTECT AGAINST DATA LOSS AND CORRUPTION
Data loss and corruption cannot be eliminated through hardware redundancy alone. Errors in application logic or mistakes by users or IT staff can result in accidentally deleted files or records, incorrect data changes and other data loss or integrity problems. Certain types of hardware or software failures can lead to data corruption. Site problems or natural disasters can result in loss of access to data or the complete loss of data. Beyond the need to protect current data, both business and regulatory requirements add the need to archive and retrieve historical data, often spanning several years and multiple types of data. Full protection against data loss and corruption requires a comprehensive backup and recovery strategy along with a disaster recovery plan.
In the past, backup and recovery strategies have been based on writing data to tape media that can be stored off-site. However, this approach has several drawbacks:
• Backup operations require storage and processing resources that can interfere with production operation and may require some applications to be stopped during the backup window
• Backup intervals typically range from a few hours to a full day, with the risk of losing several hours of data updates that occur between backups • Using tape backup for disaster recovery results in recovery times measured
in days, an unacceptable level of downtime for many organizations
Data replication is a better solution for both data protection and disaster recovery. Data replication solutions capture data changes from the primary production system and send them, in real time, to a backup system at a remote disaster site, at the local site, or both. There is still the chance that a system failure can occur before data changes have been replicated, but the exposure is in seconds or minutes rather than hours or days. Data replication can be combined with error detection and failover tools to help get a disaster recovery site up and running in minutes or hours rather than days. Local data copies can be used to reduce tape backup requirements and to separate archival tape backup from production system operation to eliminate
Protecting SQL Server In Physical And
Virtual Environments
CONSIDER ISSUES THAT CAUSE PLANNED DOWNTIME
Hardware and software reconfiguration, hardware upgrades, software hot fixes and service packs, and new software releases can all require planned downtime. Planned downtime can be scheduled for nights and weekends, when system activity is lower, but there are still issues to consider. IT staff morale can suffer if off-hour activity is too frequent. Companies may need to pay overtime costs for this work. And application downtime, even on nights and weekends, can still be a problem for many companies that use their systems on a 24/7 basis.
Using redundant servers in any availability solution allows reconfiguration and upgrades to be applied to one server while SQL Server and applications can continue to run on a different server. After the reconfiguration or upgrade is completed, SQL Server and applications can be moved to the upgraded server with minimal downtime. Most of the work can be done during normal hours. Solutions based on virtualization, which can move applications from one server to another with no downtime, can reduce planned downtime even further.
ADDED BENEFITS OF VIRTUALIZATION
The latest server virtualization technologies, while not required for protecting SQL Server, do offer some unique benefits that can make SQL Server protection both easier and more effective.
Virtualization makes it very easy to set up evaluation, and test and development environments without the need for additional, dedicated hardware. Many companies cannot afford the additional hardware required for testing SQL Server and database applications in a traditional, physical environment; but effective testing is one of the keys to avoiding problems when making configuration changes, installing hot fixes or moving to a new update release.
Virtualization offers a single solution for protecting both SQL Server and the related database applications. SQL Server and each application can be configured as individual virtual machines and run on any available host within a resource pool. Availability characteristics can be matched to each virtual machine as dictated by application characteristics and business needs.
Virtualization allows resources to be adjusted dynamically to accommodate growth or peak loads. The alternative is to buy enough extra capacity upfront to handle
Protecting SQL Server In Physical And
Virtual Environments
lead to poor performance and ultimately to the disruption associated with upgrading or replacing production hardware.
Infrastructure components that support the SQL Server environment, including Active Directory, DNS and DHCP that have traditionally required separate servers and distinct availability solutions, can now be implemented as virtual machines in a common resource pool and leverage the common availability solution that is used to address the entire virtualization environment.
Virtualization also makes disaster recovery easier to implement, more effective, and less costly. Virtual machines separate the software configuration from the underlying hardware. This provides total flexibility in the hardware required for the disaster site. One set of hardware can provide disaster backup for multiple applications and cost effective configurations can be chosen strictly based on their disaster recovery role. Software configurations change over time and changes must be duplicated at the disaster site to ensure proper operation. This can be extremely time consuming and error prone in a physical environment. In a virtual environment, the configuration is contained within the virtual machine definition file. Simply copying this file to the disaster site is all that is needed to maintain configuration compatibility.
everRun BENEFITS
everRun software offers a unique set of features and benefits that make everRun a great choice for protecting Microsoft SQL Server and related database applications:
• Higher levels of availability than competing failover solutions
• Selectable fault-tolerance using a range of everRun products along with the “availability dial” of everRun VM
• Simple to install and manage with no need for specialized staff experience or training
• Solutions for both virtual and physical environments
• Solutions for local, near distance and long distance availability and data protection
• Choice of either locally-attached or networked storage
• Built-in failover policy eliminates complex policy definition and the potential for policy errors
• Sophisticated error detection is faster and more reliable than simple heartbeats
• Cost effective for a wide range of SQL Server environments
Protecting SQL Server In Physical And
Virtual Environments
Providing availability for SQL Servers and database applications running in remote locations can present unique challenges compared to operating in a corporate data center. Remote sites often have less skilled IT staff or lack local IT staff entirely. A simple, less complex availability solution like everRun is an ideal solution. The ability to use local storage, without the requirement for a SAN, makes everRun more suited to environments without a datacenter infrastructure.
everRun HA
everRun HA turns two commodity servers into a high-availability server that presents a single image to software, administrators and users. Component failures of disks, storage interfaces and network interfaces are handled transparently with no application disruption. System failures, unlike everRun FT, require an application restart on the surviving system.
everRun HA provides Component Fault Tolerance that uses disk and network components across the pair of servers for redundancy. Unlike most clustering solutions, eveRun HA does not require redundant components within each server and does not use multipath IO and NIC teaming for failover, saving both cost and
complexity while providing a solution that is compatible with existing storage and network infrastructure.
everRun FT
everRun FT turns two commodity servers into a fault-tolerant server that presents a single image to software, administrators and users. SQL Server and applications continue to run through a full range of possible errors including power supplies, fans, storage devices and interfaces, network interfaces, memory, processors and even motherboards. There is no downtime, no loss of application state or in-memory data; the everRun software transparently utilizes redundant components to handle processing tasks in the event of component or complete system failures. The individual servers within the redundant pair can be taken offline for hardware or software maintenance and reintegrated online for most maintenance activities.
everRun VM
everRun VM integrates Marathon’s availability solutions with the Citrix XenServer virtualization platform. everRun VM offers a range of availability options – an availability “dial” – within a single product, allowing users to choose the most
Protecting SQL Server In Physical And
Virtual Environments
Applications that do not need any protection at all can run in the standard XenServer environment along with everRun protected applications. Availability options include standard failover (similar to the VMware HA feature), Component Fault-Tolerance (availability features equivalent to everRun HA), and System Fault-Tolerance (availability features equivalent to everRun FT). Availability selections are made for each protected virtual machine. Virtual machines at different availability levels, along with unprotected virtual machines, can coexist within the same XenServer pool.
everRun SPLITSITE
everRun SplitSite is an option for the VM, HA and FT products that allows the two systems in a server pair (or two hosts running a protected virtual machine) to be geographically separated by distances up to 100 miles. The operating characteristics of SplitSite systems are identical to VM, HA and FT systems running side by side. With SplitSite, protection from many local site problems can be combined with best-in-class local availability for a single, integrated availability solution.
everRun DR
everRun DR is a data replication and failover software solution for long-distance disaster recovery protection. everRun DR uses sophisticated data compression, coalescing and de-duplication techniques to make optimum use of WAN bandwidth. Application aware solutions for SQL Server 2000, SQL Server 2005 and SQL Server 2008 can restore business operation within minutes.
everRun CDP
Protecting SQL Server In Physical And
Virtual Environments
ARCHITECTURE EXAMPLES
The following diagrams show how everRun software can be used in several different scenarios to protect SQL Server.
Diagram 1
Protecting SQL Server In Physical And
Virtual Environments
Diagram 2
Protecting SQL Server In Physical And
Virtual Environments
Diagram 3
Diagram 3 shows SQL Server, along with an ERP application, in a disaster recovery scenario using everRun DR across two extended distance sites. everRun VM is used for local protection and to simplify management and reduce cost of the disaster recovery solution. Data is replicated across the two sites in real time using everRun DR. SQL Server failover is also managed by everRun DR.
PROTECTING SQL SERVER WITH everRun – CASE
STUDIES
Protecting SQL Server In Physical And
Virtual Environments
solution provides continuous availability for both the application and database and protects against a range of hardware, software and site failures. The hospital installs application upgrades, a major cause of planned downtime, on one server at a time, virtually eliminating downtime associated with this process. Commenting on their experience with Marathon everRun, the hospital’s system administrator says, “The systems just stay up and run.”
A major U.S. retailer uses a sophisticated warehouse control system, built on a SQL Server database, to operate a major distribution center. Any failure of the SQL Server or the warehouse control application means that the distribution center operation and the resulting restocking of hundreds of retail stores comes to a complete stop. The solution recommended by the warehouse system integrator included Marathon everRun HA. Using a pair of HP DL380 servers and internal storage mirrored across the servers, the company has encountered no problems with the Marathon system in almost two years of operation.
Rich Products, a major U.S. frozen foods manufacturer, operates a large manufacturing facility in Arlington, Tennessee on a 24 hours per day, 7 days per week schedule. The facility produces its products in large batches using Wonderware Historian and SQL Server 2005 to collect a wide range of critical data throughout the process. Continuous data collection is a necessity for Rich Products to ensure that its products are manufactured to the highest levels of quality and are in full compliance with all FDA and other regulatory requirements. Failure of SQL Server or
Wonderware Historian can shut down the entire production line with direct financial consequences. Rich Products considered a cluster solution but determined that cluster failover introduced unacceptable levels of downtime and data loss for their environment. They instead chose Marathon everRun FT with its ability to continue processing through a variety of hardware and software faults. Since installing their Marathon solution, Rich Products has experienced no downtime or data loss from their SQL Server and Wonderware Historian applications.
CONCLUSION
Protecting SQL Server and related applications requires addressing many different possible causes of both planned and unplanned downtime. Marathon everRun solutions provide the most comprehensive, effective and affordable options that can address the full range of SQL Server and database application downtime risks.
Want to keep Microsoft SQL Server up and running through failures and disasters? Contact us for more information or to take test drive, marathontechnologies.com
WORLDWIDE HEADQUARTERS
Marathon Technologies Corporation 295 Foster Street, Littleton, MA 01460 Tel 1.800.884.6425 / 1.978.489.1100 Fax 1.978.489.1101 Email: [email protected] Web: www.marathontechnologies.com EMEA HEADQUARTERS Marathon Technologies UK Ltd Regus House, Trinity Court Wokingham Road, Bracknell Berkshire, RG42 1PL Tel +44 (0) 1344.706.241 Fax +44 (0) 1344.706.242