Following this introductory section, this chapter is divided into two main parts. The first discusses local high availability within the datacenter. The second part discusses the disaster recovery plan and business continuity when a production site no longer functions following a major event. This chapter describes advanced features (such as vSphere HA [High Availability], vSphere FT [Fault
Tolerance], and SRM5 [Site Recovery Manager 5]) and how they interact with the infrastructure’s components.
Recovery Point Objective/R ecovery Time Objective
In the area of data protection, the key factors are the recovery point objective (RPO) and the recovery time objective (RTO), which determine the various possible solutions and choices for post-failure recovery.
RPO corresponds to the maximum quantity of data that can be acceptably lost when a
breakdown occurs. Daily backup is the technique generally used for a 24-hour RPO. For an RPO of a few hours, snapshots and asynchronous replication are used. An RPO of 0 involves setting up a synchronous replication mode and corresponds to a request for “no data loss.”
RTO corresponds to the maximum acceptable duration of the interruption. The time required to restart applications and put them back into service determines the RTO. Tapes located at a protected remote site can be used for a 48-hour RTO. For a 24-hour RTO, restoration can be performed from tapes at a local site. For an RTO of four hours or less, several complementary techniques must be implemented—for example, clustering, replication, VMware HA-specific techniques, FT, SRM, and storage virtualization.
Virtualization simplifies some processes that allow reduced RTOs. RTO depends on the techniques implemented and strongly hinges on the restart of applications and application consistency when the production site comes to a grinding halt. If application consistency is not guaranteed, RTO varies and is difficult to predict.
Of course, every company wants solutions that ensure no data loss and the restart of production as quickly as possible when a problem occurs. Yet it is no secret: The lower the RPO and RTO times, the more costly the solutions are to implement. This is why it is crucial to involve managers and executives to determine the desired RTO and RPO based on business needs and constraints.
Another consideration is business impact analysis (BIA), which quantifies the actual value of data for the company. Often, investments are made to protect the wrong data (which may be important for the administrator but not necessarily for the company), and crucial data is insufficiently protected. A collective decision by stakeholders will determine the level of risk taken based on the solutions chosen.
The service-level agreement (SLA) is a contract that specifies service levels. It is a formal and binding agreement concluded between a service provider and its client. When such an agreement is
entered into, RTO and RPO factors come into play. Information Availability
The information system is essential to company operations. It allows users to be productive and have efficient means of communication (for example, mailbox, collaborative tools, and social networks) at their disposal. The information system provides the business applications that enable business activities. The unavailability—even partial unavailability—of a service can lead to significant, even unrecoverable, income loss for a company.
Statistic
According to the National Archives and Records Administration (NARA), 93% of companies losing their datacenter for 10 days or more following a major breakdown have gone bankrupt during the following year.
To protect the company, it is crucial to implement measures that reduce service interruptions and allow the return to a functioning system in the event of a major incident on the production site.
As shown inFigure 5.1, the greatest portion of periods of unavailability (79%) comes from planned maintenance for backup operations, hardware addition, migration, and data extraction. These are predictable but can, nevertheless, cause service unavailability. Other types of unavailability relate to unpredictable events that cannot be anticipated and whose consequences can be dramatic if reliable measures and procedures are not implemented.
Figure 5.1. Predictable and unpredictable causes of system unavailability. Data availability is calculated as follows: IA = MTBF / (MTBF + MTTR).
• M TBF: Mean time between failure, the average time before a system or component experiences failure.
• M TTR : Mean time to repair, the average time required to repair an out-of-service component. (MTTR includes the time spent detecting the bad component, planning a technician’s intervention, diagnosing the component, obtaining a spare component, and repairing the system before putting it back into production.)
Information availability is measured as a percentage, and it answers business needs for a set period. The greater the number of nines in the decimal percentage value, the higher the availability. In general, high availability starts at 99.999%.
Table 5.1 shows the relation between the availability (in the number of nines) and the unavailability period it represents per year.
Table 5.1. Unavailability D urations Based on Availability Percentages
A system with an availability of 99.999% represents a maximum downtime of 5.25 minutes annually. N ote that this is a very short period, representing less than the time required to reboot a physical server!
Infrastructure Protection
Protection of the information system (seeFigure 5.2) can be put into two categories: local availability on a site, and defined business continuity processes that follow when a major incident occurs on the production site.
Figure 5.2. Infrastructure components.
To sustain local high availability, you can use the following to prevent service interruptions when a component breaks down:
• Hardware redundancy to eliminate a single point of failure (SPOF)
• Clustering systems to return applications to productivity in case of a server failure
• Data securing through backup to prevent data loss, or replication mechanisms to compensate for the loss of a storage bay
• Snapshots, used to quickly return to a healthy state if an application becomes corrupt For business continuity, a disaster recovery plan (DRP) can be implemented to quickly return a production site to a functioning state when a major event occurs. By having in place a DRP, a business continuity plan (BCP), and an IT contingency plan on a backup site, you can handle such events when they occur within the production datacenter.