Designing HPE Backup Solutions eBook
Aleksandar MiljkovićHPE Press 660 4th Street, #802 San Francisco, CA 94107
Designing HPE Backup Solutions eBook (Exam HPE0-J77) Aleksandar Miljkovic © 2016 Hewlett Packard Enterprise Development LP. Published by: Hewlett Packard Enterprise Press 660 4th Street, #802 San Francisco, CA 94107 All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without written permission from the publisher, except for the inclusion of brief quotations in a review.
WARNING AND DISCLAIMER
This book provides information about the topics covered in the Designing HPE Backup Solutions (HPE0-J77) certification exam. Every effort has been made to make this book as complete and as accurate as possible, but no warranty or fitness is implied. The information is provided on an “as is” basis. The author, and Hewlett Packard Enterprise Press, shall have neither liability nor responsibility to any person or entity with respect to any loss or damages arising from the information contained in this book or from the use of the discs or programs that may accompany it. The opinions expressed in this book belong to the author and are not necessarily those of Hewlett Packard Enterprise Press.
TRADEMARK ACKNOWLEDGEMENTSAll third-party trademarks contained herein are the property of their respective owner(s).
GOVERNMENT AND EDUCATION SALES
This publisher offers discounts on this book when ordered in quantity for bulk purchases, which may include electronic versions. For more information, please contact U.S. Government and Education Sales 1-855-447-2665 or email email@example.com.
At HPE Press, our goal is to create in-depth reference books of the best quality and value. Each book is crafted with care and precision, undergoing rigorous development that involves the expertise of members
from the professional technical community.
Readers’ feedback is a continuation of the process. If you have any comments regarding how
we could improve the quality of this book, or otherwise alter it to better suit your needs, you can contact us through email at firstname.lastname@example.org. Please make sure to include the book title and ISBN in your message. We appreciate your feedback. Publisher: Hewlett Packard Enterprise Press HPE Contributors: Ralph Luchs, Ian Selway, Chuck Roman HPE Press Program Manager: Michael Bishop
About the Author
Aleksandar Miljković has over sixteen years of IT experience with a strong background in UNIX systems and network security. His specialties include IT security, OpenStack based cloud computing, and HPE Helion CloudSystem, a fully integrated, IaaS and PaaS cloud solution for the enterprise. Aleksandar is currently involved in developing and delivering sales and technical training in areas such as virtualization, hybrid cloud, and backup and recovery systems and solutions.
This study guide helps you prepare for the Designing HPE Backup Solutions exam (HPE0-J77) that is a requirement for the HPE ASE - Storage Solutions Architect V2 certification. The exam tests your ability to define and recommend an effective Enterprise backup and recovery solution based on customer needs. Designed to help you identify the best components, this guide describes HPE's Backup, Recovery and Archive (BURA) solutions for data protection, and outlines recommended configurations for HPE’s StoreOnce Backup Systems.
Certification and Learning
Hewlett Packard Enterprise Partner Ready Certification and Learning provides end-to-end continuous learning programs and professional certifications that can help you open doors and succeed in the New Style of Business. We provide continuous learning activities and job-role based learning plans to help you keep pace with the demands of the dynamic, fast paced IT industry; professional sales and technical training and certifications to give you the critical skills needed to design, manage and implement the most sought-after IT disciplines; and training to help you navigate and seize opportunities within the top IT transformation areas that enable business advantage today.
As a Partner Ready Certification and Learning certified member, your skills, knowledge, and real-world experience are recognized and valued in the marketplace. To continue your professional and career growth, you have access to our large HPE community of world-class IT professionals, trend-makers and decision-makers. Share ideas, best practices, business insights, and challenges as you gain professional connections globally. To learn more about HPE Partner Ready Certification and Learning certifications and continuous learning programs, please visit http://certification-learning.hpe.com
AudienceThis book is designed for presales and solutions architects involved in supporting the sale of enterprise backup solutions. It is assumed that you have a minimum of one year of experience in storage technologies and have a need to identify and define the right components for a backup and recovery solution based on customer needs.
To understand the concepts and strategies covered in this guide, storage professionals should have at least six months to one year of on the job experience. The associated training course, which includes numerous practical examples, provides a good foundation for the exam, but learners are also expected to have at least one year’s experience in storage technologies.
After you pass these exams, your achievement may be applicable toward more than one certification. To determine which certifications can be credited with this achievement, log in to The Learning Center and view the certifications listed on the exam’s More Details tab. You might be on your way to achieving additional certifications.
Preparing for Exam HPE0-J77
This self-study guide does not guarantee that you will have all the knowledge you need to pass the exam. It is expected that you will also draw on real-world experience designing and implementing storage solutions. We also recommend taking the Designing HPE Backup Solutions instructor led one-day training course (Course ID 01059534).
Recommended HPE Training
Recommended training to prepare for each exam is accessible from the exam’s page in The Learning Center. See the exam attachment, “Supporting courses,” to view and register for the courses.
Obtain Hands-on Experience
You are not required to take the recommended, supported courses, and completion of training does not guarantee that you will pass the exams. Hewlett Packard Enterprise strongly recommends a combination of training, thorough review of courseware and additional study references, and sufficient on-the-job experience prior to taking an exam.
Exam RegistrationTo register for an exam, go to
CHAPTER OBJECTIVESIn this chapter, you will learn to: Describe and differentiate the terms downtime and availability Explain the business impact of system downtime Describe and calculate the Mean Time between Failures (MTBF) and estimate its impact on the IT environment Describe the Mean Time to Repair (MTTR) and explain its importance in the overall recovery List and describe technologies that address availability, such as Redundant Array of Inexpensive (or Independent) Disks (RAID), snapshots, and replication, and contrast them with backup Explain and use Recovery Point Objective (RPO) and Recovery Time Objective (RTO) as parameters for backup planning
When it comes to downtime, availability, fault tolerance, backup, recovery, and data archiving, several topics come to mind. Almost everyone knows something about them, usually firsthand. Who has not faced a system failure, a loss of data, or some type of a disaster that affected normal operations of a network, the Internet, a point-of-sale system, or even your own computer? All these terms deal with protecting business or personal assets—either systems, applications, processes, data, or the ability to conduct business. Each varies, however, in its approach and methodology. This chapter lays the foundation for this course and the subsequent chapters by defining and describing the key concepts behind the quest for protecting IT resources. It begins by describing and differentiating between downtime and availability. It then explains how a component or a system failure, which leads to downtime, is measured and prevented. Lastly, it takes the opposite view, one of system availability, and how it can be achieved. Within the context of both downtime and availability, it positions backup and recovery.
Let us begin by focusing on what prevents a system from operating normally and what the consequences might be.
Downtime is a period of time during which a system, an application, a process, a service, or data is unavailable. During this time, a business entity cannot conduct sales operations, support a customer, provide a service, or conduct transactions, that is, if this business entity solely relies upon what just became unavailable.
• • • • • • • • These events cause loss of productivity, inability to communicate with clients, or even loss of sensitive or important information. Downtime of an IT system can cause loss of productivity of a single user, a workgroup, or even the entire company. Whenever downtime impairs business, the outage carries serious consequences. To prevent or minimize the impact of downtime, you must understand what causes downtime in the first place.
Describing causes of downtimeFigure 1-1 Causes of downtime and data loss or unavailability Figure 1.1 lists the most typical causes of downtime and data loss or unavailability. Note that the largest cause of downtime is due to hardware failures and then to human error. Other causes, not shown in this chart, include the following: Power outages Data center cooling issues Software bugs Cyber attacks Database inconsistencies or design flaws _____________________________________ (add your own) _____________________________________ (add your own) _____________________________________ (add your own) Disasters, without a doubt, affect computer operations and data unavailability. A disaster in the IT world
• • •
is any event that disrupts a company’s computer operations and causes downtime. Disasters are not controllable, but precautions against them make the recovery much faster and easier. Disasters can be categorized according to the affected area: Building-level incidents Metropolitan area disasters Regional events Furthermore, downtime can be planned as well as unplanned.
Building-level incidentsFigure 1-2 Disasters affecting a building
Disasters affecting a building (Figure 1.2) usually impact computer operations in that building as well. There may not be a direct damage to the IT systems, but these incidents may prevent access to them or to the data they host, or they may interrupt operations.
Figure 1-3 Metropolitan area disasters Usually floods, fires, large chemical incidents, moderate earthquakes, severe winter storms, or blackouts affect entire cities, impacting their infrastructure and disrupting IT systems (Figure 1.3).
Regional eventsFigure 1-4 Natural disasters Computer operations may be interrupted by natural disasters that affect an entire region within a radius of hundreds to tens of thousands of miles/kilometers. These disasters include large floods, hurricanes, earthquakes, political instability, and wars.
Planned vs. unplanned downtime
Planned downtime occurs when system administrators intentionally restrict or stop the system operations to implement upgrades, updates, repairs, or other changes. During planned downtime, a particular time
• • • • • • • period is set aside for these operations, which are carefully planned, prepared, executed, and validated. On the contrary, unplanned downtime is when an unintentional intervention restricts or stops the system availability.
Planned downtime means downtime as well.
Return, for a moment, to Figure 1.1, which shows the major causes of downtime and data loss/unavailability. Are these causes intentional/planned or unintentional/unplanned? If you answered
unintentional/unplanned, you answered correctly. Overall, the majority of system and data unavailability is because of planned downtime due to required maintenance and upgrades. In fact, unplanned downtime accounts for only about 10% of all downtime, but its unexpected nature means that any single downtime incident may be more damaging to the enterprise, both physically and financially, than many occurrences of planned downtime. Thus, understanding the cost of downtime is critical in either case.
Estimating the cost and impact of downtimeQuantifying downtime is not an easy task because the impact of downtime varies from one case to another. Losing a second of time in an air traffic control system or in a hospital life-support environment can have dire consequences, while losing hours in a billing system may not have a significant impact at all if these billing transactions were queued and committed only when the system became available again. Before you can calculate downtime, you must know its root cause. And not all root causes are strictly IT issues. To begin with, it is important to identify and understand both internal and external downtime threats—you need to know what and who has the potential to take your business down. You also need to know how.
IT-related outages, planned or unplanned, can unleash a procession of costs and consequences that are direct and indirect, tangible and intangible, short term and long term, and immediate and far-reaching.
Tangible and direct costsTangible and direct costs refer to expenses that can easily be measured and documented, are incurred up front, and are tracked in the business general ledger. These costs can be “(touched)” or “(felt)” or “(easily determined.)” Tangible and direct costs related to downtime include: Loss of transaction revenue Loss of wages due to employees’ idle time Loss of inventory Remedial labor costs Marketing costs Bank fees Legal penalties from inability to deliver on service-level agreements (SLAs)
• • • • • • • •
Intangible and indirect costs
Intangible and indirect costs refer to business impact that is more difficult to measure, is often felt or incurred at a later date, and is not related to a physical substance or intrinsic productive value.
These costs are nonetheless real and important to a business’ success or failure. They can be more important and greater than tangible costs. Intangible and indirect costs related to downtime include the following: Loss of business opportunities Loss of employees and/or employee morale Loss of goodwill in the community Decrease in stock value Loss of customers and/or departure of business partners Brand damage Shift of market share to competitors Bad publicity and press
Outage impact to business
The cost that can be assigned to a measurable period of downtime varies widely depending upon the nature of the business, the size of the company, and the criticality of the IT system related to the primary revenue-generating processes. For example, a global financial services firm may lose millions of dollars for every hour of downtime, whereas a small manufacturer using IT as an administrative tool would lose only a margin of productivity. According to a Gartner document titled How Much Does an Hour of Downtime Cost?, for a conventional brick-and-mortar business, estimating the cost of an outage is relatively simple, compared to, let us say, a global financial services firm. In either case, such estimations are never trivial. Consider this example: Assume that a firm conducts business in Western Europe and North America during regular business hours. This firm needs its systems and services to be available 40 hours per week, or 2000 hours per year (accounting for holidays, vacations, and weekends).
Therefore, the first order of approximation for the cost of an outage would be to distribute the firm’s revenue uniformly across those 2000 hours. Thus, if the firm’s annual revenue is $100 million, an average hour would represent $50,000 of its revenue. Consequently, one-hour outage would cost this firm $50,000.
Two immediate objections arise to this assessment and both lead to important refinements. First, revenue is almost never distributed uniformly across all working hours. Second, most businesses experience seasonal fluctuations. Many retail organizations make 40% of their revenue and 100% of their profits in the last eight weeks of the year. One-hour outage on December 23 will have a much greater impact on the firm’s financial performance as compared to the same outage in late June, for instance. Therefore, the cost of an outage must reflect its potential impact at a particular time in the business cycle.
Table 1.1 shows the cost of downtime, estimated in 1998 by the Gartner Group. You can see that the cost of downtime is company-specific and related intangible costs are very difficult to estimate.
• • • • • • Table 1-1 The cost of downtime, estimated in 1998 by the Gartner Group
Industry Application Ave rage cost (pe r hour of downtime )
Financial Brokerage operations $6,500,000 Financial Credit card sales $2,600,000 Media Pay-per-view $1,150,000 Retail Home shopping (TV) $113,000 Retail Catalog sales $90,000 Transportation Airline reservations $89,500
Consequences of downtimeWhat happens to an organization when its system goes down?
Before you can predict downtime and plan its prevention, you must have a clear understanding of what happens when a system goes down. Specifically, what and who is impacted and how.
You should consider the following:
Processes: Vital business processes such as order management, inventories, financial reporting, transactions, manufacturing, and human resources may be interrupted, corrupted, or even lost. Programs: Revenue can be affected and a key employee or customer activities might be missed or lost. Business: If customers cannot access a website, they might purchase from someone else. You might lose a customer now and forever. People: Salaries might not be paid, and even lives could be lost due to downtime of life-sustaining medical systems. Projects: Thousands of person-hours of work can be lost and deadlines can be missed, resulting in failure-to-perform fees and noncompliance penalties. Operations: Those who manage daily activities of an organization may find themselves without the data they need to make informed decisions. The Gartner Group recommends estimating the cost of an outage to your firm by calculating lost revenue, lost profit, and staff cost for an average hour—and for a worst-case hour—of downtime for each critical business process. Not-for-profit organizations cannot calculate revenues or profits, so they should focus on staff productivity and qualitative assessment of the outage to their user community. An outage can weaken customer perception of the firm, harm the wider community, and derail a firm’s strategic initiatives, but these impacts may be difficult to quantify, and should, in most cases, be left unquantified. Any outage assessment based on raw, generic industry averages alone is misleading.
Predicting downtimeCan you predict downtime and events that lead to it? If you could, would it mean that you could then better prepare for such events, or even prevent them from happening? There are causes of downtime which cannot be predicted, such as natural disasters or fire. You just have to determine the best prevention or recovery mechanism if they occur. On the other hand, failure rates of
• • • • • computer system components can be predicted with a level of certainly. To determine and express failure rates of computer components, you can calculate the MTBF.
MTBF is the measure of expected failure rates. To understand MTBF, it is best to start with something else—something for which it is easier to develop an intuitive feel.
Let us take a look at a generalized MTBF measure for a computer component, such as a hard drive, a memory DIMM, or a cooling fan. This component has an MTBF of 200,000 hours. Since there are approximately 9000 hours in a year, 200,000 hours is about 21 years. In other words, if the MTBF of this component is 200,000 hours, it is expected that the component fails every 21 years. Now, take a sample of 10,000 units of this particular component and determine how many units fail every day, over a test period of 12 months. You may determine that In the first 24 hours, two components fail. In the second 24 hours, zero components fail. In the third 24 hours, one component fails, and so on. You then ask If five units fail in 12 months, how long would it take for all 10,000 units to fail at this rate? If all units fail in regular intervals over this period, how long is this interval?
If a failure is the termination of the component’s ability to perform its intended function, what is then MTBF? MTBF is an interval of time used to express the expected failure rate of a given component. It does not indicate the expected lifetime of that component and says nothing about the failure likelihood of a single unit.
Figure 1-5 Reliability bell curve: number of failures vs. time distribution
This reliability bell curve (Figure 1.5) is also known as a normal distribution, where the highest point of the curve (or the top of the bell) represents the most probable event (at time Âµ). All possible events are then distributed around the most probable event, which creates a downward slope on each side of the
• peak. Given the reliability bell curve and a sample of components, most of them will fail around time Âµ. Some components will fail early in the cycle, whereas others will last much longer. But all of them will fail at some point on this curve, with a high probability that the majority of failures will be distributed according to the bell curve.
NoteTrent Hamm has written a good explanation of the reliability bell curve in August 2014, using the analogy of low-end and more reliable washing machines. You can find this article at:
Calculating MTBFHow does one calculate an MTBF for an environment which consists of a number of such components? What if all these components were identical? What if they all were different? Once again, let us use an example: If you have 2000 identical units in your environment and each unit has the MTBF of 200,000 hours, what is the associated total MTBF? The formula for the total MTBF is as follows: MTBFtotal Therefore, MTBFtotal = = 100 hours = ~4 days Within your particular environment of 2,000 identical units, you can expect a failure approximately every 4 days. It may be easy to understand the MTBF of a single unit, but in reality, systems are complex and consist of a number of different components. Within a complex environment, you can expect a failure of any particular component, a failure between components, or a failure of the entire system.
The measure of a failure rate for a system consisting of one to n components, which may not necessarily be identical, is expressed as the MTBF for a complex system. Its formula is as follows:
MTBF complex system=
Even though individual MTBF rates for a single component have been improving, the entire environment is still vulnerable to component failures due to the interdependent nature of complex and multivendor solutions. Consider these examples:
Hardware: Storage systems may fail to provide adequate service due to MTBF problems or due to corrupted data (resulting from viruses or data integrity issues). Computing systems may lose power
• • • • • • • • • or can have various other problems, such as memory or processor failures. One corner stone for a highly available environment is a good network. High-speed connections for transaction processing and robust connections for client/server backup are critical. Software: The basis of any system is a stable operating system because a failure at the OS level may lead to vulnerability and data loss in everything that is at higher levels of the stack, such as applications, databases, and processes.
Defining and calculating MTTR
You are looking at a component or a system failure rate in order to predict, prevent, and minimize downtime. There is another variable which has not yet been discussed, one of recovery. As soon as a unit
fails, how long does it take to repair or replace it? To answer this question, another term is used, called
MTTR (Mean Time to Repair or Mean Time to Recovery).
MTTR represents the average time it takes to repair or replace the defective component and return the system to its full operation. It includes the time needed to identify the failure, to diagnose it and determine its root cause, and to rectify it.
MTTR can be a major component of downtime, especially if you are unprepared.
Planning for and preventing downtime
While planning for downtime, you have to be aware of the level of problem you are planning for. While you cannot influence metropolitan or regional incidents because they are usually naturally occurring disasters, you can create an appropriate disaster recovery plan and implement a disaster recovery solution once you know what it is you are planning for (also called local problem area). Many times, you can also put in place a solution that may reduce the impact of a disaster or even prevent it altogether. Lastly, regardless of the cause, location, or the level of downtime, some type of administrative intervention is usually required to quickly recover from the problem. This intervention always depends on the exact cause of the problem and differs from one case to another. To increase your chances of success, make sure that you have Adequate problem detection in place A recovery plan that is validated and tested Protection technologies that ensure desired levels of availability Monitoring and diagnostic tools and processes Skilled staff educated in the adopted technologies and recovery processes _____________________________________ (add your own) _____________________________________ (add your own) _____________________________________ (add your own)
Learning check questionsReinforce your knowledge and understanding of the topics just covered by completing this learning check:
1. What measures the time needed to repair a component after it fails? a) MTBF b) MTTR c) RTO d) RPO 2. What measures the MTBFs of a complex system? a) MTBF b) MTTR c) RTO d) RPO 3. Floods, wildfires, tornados, and severe snowstorms are examples of what type of disasters? a) Building-level incidents b) Metropolitan area disasters c) Regional events d) Localized incidents 4. What is the major cause of system and data unavailability? a) User errors b) Metropolitan area disasters c) Planned downtime d) Inadequate backup strategy
Learning check answersThis section contains answers to the learning check questions. 1. What measures the time needed to repair a component after it fails? a) MTBF b) MTTR c) RTO d) RPO 2. What measures the MTBFs of a complex system? a) MTBF b) MTTR c) RTO d) RPO 3. Floods, wildfires, tornados, and severe snowstorms are examples of what type of disasters?
• a) Building-level incidents b) Metropolitan area disasters c) Regional events d) Localized incidents 4. What is the major cause of system and data unavailability? a) User errors b) Metropolitan area disasters c) Planned downtime d) Inadequate backup strategy
AvailabilityUp to now, this module covered the effects of downtime of a computer system or unavailability of data due to a failure or a disaster. The opposite of downtime is availability (or uptime).
This section discusses the terminology and methods of expressing and calculating the availability times and requirements.
Before continuing with this section, see if you can define the terms below and answer the questions. Search the Internet if you need to. In your own words, define these terms: o High availability: __________________________________________________ __________________________________________________ __________________________________________________ __________________________________________________ o Fault tolerance: __________________________________________________ __________________________________________________ __________________________________________________ __________________________________________________ o RPO: __________________________________________________ __________________________________________________ __________________________________________________ __________________________________________________ o RTO: __________________________________________________
• __________________________________________________ __________________________________________________ __________________________________________________ Given the following type of failure or downtime, what would be the best prevention technology in your opinion? o Power outage: ____ o Hardware component failure: ____ o System-wide failure: ____ o Administrative or maintenance downtime: ____ o Regional or metropolitan disaster: ____
Availability is a measure of the uptime of a computer system. It is a degree to which a system, a subsystem, or a component of a system continues running and being fully operational. High availability is something which is generally acceptable and desirable. Higher levels of availability, all the way to continuous or mission-critical availability, are usually expensive and complex to achieve.
When a computer system plays a crucial role, availability requirements are dictated and enforced by the law or technical regulations. Examples of such systems include the hospital life-support equipment, air traffic systems, and nuclear power plant control systems. The Harvard Research Group defines five Availability Environments (AEs) in terms of the impact on the business and the end user/customer. Each successive level inherits the availability and functionality of the previous (lower) level. The minimum requirement for a system to be considered highly available is a backup copy of data on a redundant disk and a log-based or journal file system for identification and recovery of incomplete (“(in flight)”) transactions that have not been committed to a permanent medium. This environment corresponds to AE-1.
• • • • • Figure 1-6 AEs as defined by the Harvard Research Group Figure 1.6 summarizes these five AEs. They are defined as follows:
AE-4 (fault tolerance)—AE-4 covers business functions that demand continuous computing and environments where any failure is transparent to the user. This means no interruption of work, no lost transactions, no degradation in performance, and continuous 24Ã—7 operation.
AE-3 (fault resilience)—AE-3 covers business functions that require uninterrupted computing services, either during essential periods or during most hours of the day and most days of the week throughout the year. The users stay online, but their transactions may need restarting, and they may experience performance degradation.
AE-2 (high availability)—AE-2 covers business functions that allow minimally interrupted computing services, either during essential periods or during most hours of the day and most days of the week throughout the year. The users may be interrupted but can quickly log back in. They may have to rerun some transactions from journal files, and they may experience performance degradation.
AE-1 (high reliability)—AE-1 covers business functions that can be interrupted as long as the integrity of the data is insured. The user work stops and an uncontrolled shutdown occurs, but the data integrity is ensured.
AE-0 (conventional)—AE-0 covers business functions that can be interrupted and environments where data integrity is not essential. The user work stops and uncontrolled shutdown occurs, and data may be lost or corrupted.
High levels of availability are not only expected but also required. Designers, engineers, and manufacturers build resilience, availability, and fault tolerance into components, subsystems, and entire solutions depending on what’s required and what the customer is willing to pay for. Such solutions are often built around these three goals:
1. Reduce or eliminate single points of failure by introducing and employing redundancy. This means that a failure of a single component does not cause the entire system to fail. 2. Include a failover mechanism from one component to its redundant counterpart. Depending on the solution design, the failover may be near instantaneous to taking a few seconds or minutes. 3. Provide adequate monitoring and failure detection, which then leads to quick recovery. As you might suspect, protection against MTBF-/MTTR-related problems plays a key role in achieving high levels of availability. To determine which highly available solution is right for you, you should 1. Determine the desired or the required level of availability. 2. Determine the cost of downtime. 3. Understand the recovery process, time, and procedures.
4. Focus on events that have negative aspects (events that can bring your system down as well as events that may prevent or delay its recovery).
5. Develop a cost-consideration model, which weighs the cost of downtime, as defined earlier in this chapter, versus the cost of the high-availability solution and alternatives. 6. Obtain the necessary approvals, funding, and support and then design and implement the solution. For a distributed system, if a particular part of the system is unavailable due to a communication or other failure, the system as a whole may still be deemed to be in the up state even though certain applications may be down. When running multiple applications, the system may be fully functional and available even if certain parts of the system are down.
NoteThe type of availability at the high end of the availability scale is called continuous availability or nonstop computing and is around 99.999% (“(five nines)”) of uptime or higher. This level of availability yields 5.26 minutes of downtime per year (25.9 seconds per month, 6.05 seconds per week, or 864.3 milliseconds per day), which is often not necessary. Many businesses are satisfied with the requirement that the system does not go down during normal business hours.
NoteFor additional details regarding high availability definition, requirements, and parameters, see: https://en.wikipedia.org/wiki/High_availability.
RPO and RTODisaster recovery specialists examine the impact possibilities and the needs for availability in terms of recovery point (also referred to as RPO) and recovery time (also referred to as RTO; Figure 1.7).
• • • • • • Figure 1-7 RPO vs. RTO
Recovery Point ObjectiveRPO is the acceptable amount of data loss, specified as the maximum % of data or the maximum length of time, due to an event or an incident. This is the point or state of data to which the recovery process of a given solution must be capable of returning. In terms of backups, the recovery point provides you with information necessary to determine the frequency of performing backup operations. RPO provides answers to these questions: How much data is acceptable to be unprotected? To what point in time do you need to recover (e.g., 24 hours, 1 hour, or several seconds)?
Recovery Time ObjectiveRTO defines the length of time necessary to restore the system operations after a disaster or a disruption of service. RTO includes at minimum the time to correct the situation (“(break fix)”) and to restore any data. It can, however, include factors such as detection, troubleshooting, testing, and communication to the users.
Re cove ry = environment bre ak fix + re store from an alternate source (backup disk or tape)
RTO provides answers to these questions:
How long is the customer or user willing to wait for the data or application? How long can you tolerate downtime?
Both RPO and RTO represent recovery objectives, or goals. These are not mandates and are usually not treated as absolute requirements. Most disaster recovery specialists and businesses select availability and recovery tactics that do not necessarily (or always) meet RPO and RTO, but get close. The availability requirements of a given solution or environment dictate the RTO and provide the basis for determining these backup parameters: The type of backup The number of restores that are necessary
• • • • The location of the backup data The speed and type of access to the backup data Here are a few questions that you should ask your customers (and collect answers for) when defining the appropriate protection and recovery strategy: In terms of the operational impact of downtime: o What is more important, a fast recovery or a recovery to an exact state prior to the failure? Or are both important? o What is the impact on operations relative to the recovery point? If you do not resume where you left off, will the loss be inconvenient, damaging, or catastrophic? o _________________________________________________________________ (add your own) o _________________________________________________________________ (add your own) o _________________________________________________________________ (add your own) In terms of the business impact of downtime:
o What is the business impact of the recovery time? If you do not resume processing within seconds/minutes/hours/days, will it be inconvenient, damaging, or catastrophic to the business? o What is the most effective and efficient method to recover your information? o _________________________________________________________________ (add your own) o _________________________________________________________________ (add your own) o _________________________________________________________________ (add your own) Careful and complete assessment of the RPO and the RTO helps you define the protection and recovery strategy, and which availability technologies, products, and solutions you recommend and use.
As you can imagine, the list of availability approaches, technologies, products, and solutions is long. There are hardware solutions, software solutions, and a combination of both. There is replication, clustering, snapshots, clones, and RAID, to name just a few. Some protect against single points of failure, others protect against magnitudes of incidents and disasters.
Table 1.2 provides a quick overview of some options that may be used to achieve the required levels of availability.
Table 1-2 Options that may be used to achieve availability
To prote ct against: Te chnology which can be use d:
Power outages UPS, generators Component failures Fault tolerance, RAID Failure of the entire system, operating system, or interconnects Fault tolerance, clusters Administrative, planned downtime Clusters Regional or metropolitan downtimes Remote copy
Redundant Array of Inexpensive (or Independent) DisksThe first priority when working with data is to protect it from disk failures. Considering the high price of hard drives a few decades ago, companies struggled with two basic problems—disk reliability and disk size (or the available storage space). As a result, technology called Redundant Array of Inexpensive (or Independent) Disks (RAID) was created. The purpose of RAID is to combine smaller disk drives into larger ones or to introduce redundancy into a system to protect it against disk drive failures or both. Several types of RAID configurations were created, called RAID levels. These are most common:
RAID 0 (striping)—is used to create larger virtual disks from smaller physical ones. Data blocks are spread (striped) across these physical member disks with no redundancy. RAID 0 requires at least two physical disk drives. An advantage of RAID 0 (Figure 1.8) is its performance—it increases the number of spindles used to service I/O requests. However, due to no data protection, you should never use this RAID in a business-critical system.
Figure 1-8 RAID 0
RAID 1 (mirroring)—is used to protect the data by making writes to two identical disk drives at the same time, thus keeping an exact copy in case of a disk failure. RAID 1 (Figure 1.9) requires at least two physical disk drives and requires them in an even quantity. It provides good performance and redundancy (it can withstand failure of multiple disk drives, as long as they are not mirrored to each
• • other). Its drawback is that the usable disk space is reduced by 50% due to information redundancy. Figure 1-9 RAID 1 RAID 5 (striping with parity)—was created as a cheaper alternative to mirroring, requiring fewer disk drives in the array and being able to sustain a loss of a single disk drive without losing data. RAID 5 requires at least three disk drives, provides good performance thanks to its data blocks being striped, and good redundancy (using distributed parity). It also provides 67%–93% of usable disk storage. RAID 5 (Figure 1.10) is a great, cost-effective choice for predominantly read-intensive environments, but suffers from slow write operations.
Figure 1-10 RAID 5
RAID 6 (striping with double parity)—is similar to RAID 5, but it can sustain a loss of two disk drives while keeping the data safe. It uses block-level striping for good performance and creates two parity blocks for each data stripe. It requires at least four disk drives. RAID 6 (Figure 1.11) is also more complicated to implement in the RAID controller or in a software.
Figure 1-11 RAID 6
RAID 10 (combining mirroring and striping)—is a hybrid of RAID 1 and 0. It protects data by maintaining a mirror on the secondary set of disk drives while using striping across each set to improve data transfers. RAID 10 (Figure 1.12) requires at least four disk drives and an even number of them. It provides excellent performance and redundancy, which is ideal for mission-critical applications, at the cost of losing 50% of usable disk drive space.
Figure 1-12 RAID 10
For more information about standard RAID levels, go to:
RAID technology is necessary for system uptime and data protection against disk drive failure, but it does not replace backups. It does not provide a point-in-time copy of the data, so the information is not protected in case of a major hardware failure, user error, logical corruption, viruses, or cyberattacks. It is also not designed to keep offline or offsite copies of your data, which is the primary purpose of backups.
A snapshot is a point-in-time copy of data created from a set of markers pointing to stored data. Snapshots became possible with advances in the Storage Area Networking (SAN) technologies, and they complement the traditional RAID technology by adding backup functionality.
Most of the snapshot implementations use a technique called copy-on-write. The idea is to make an initial snapshot and then further update it as data changes. Restoring to a specific point in time is possible as long as all iterations of the data are kept. For that reason, snapshots can protect data against corruption (unlike replication, which cannot). Another common snapshot variant is the clone, or the split-mirror, where reference pointers are made to the entire content of a mirrored set of drives, a file system, or a LUN every time the snapshot is created. Clones take longer to create as compared to the copy-on-write snapshots because all data is physically copied when the clone is created. There is also an impact on production performance when the clone is created since the copy process has to access the primary data at the same time as the host.
For more information about snapshots, go to:
Replication is a technique which involves making a point-in-time copy of data on a storage device, which can be physically located in a remote location. Replication is sensitive to RTO and RPO and includes
synchronous and asynchronous replication, where the data transfer to the remote copy is achieved either
immediately or with a short time delay. Both methods create a secondary copy of the data that is identical to the primary copy, with synchronous replication solutions achieving it in real time.
With replication, any data corruption or user file deletion is immediately (or very quickly) replicated to the secondary copy, therefore, making it ineffective as a backup method.
Another point to remember with replication is that only one copy of the data is kept at the secondary location. The replicated copy does not include historical versions of the data from preceding days, weeks, or months, as is the case with backup.
• • • • • • • • • • • • • https://en.wikipedia.org/wiki/Replication_(computing).
Calculating system’s availabilityTo calculate the overall availability of a system, you must know the system’s entire MTBF figure (based on the MTBFs individual components) and the MTTR. Then, you can use this formula: Availability =
Planning for availabilityIt is often challenging to determine which level of availability is best for a given business environment, for these two reasons: 1. It is difficult to estimate the actual levels of availability that are required to meet given service levels while also meeting budgetary and implementation scheduling requirements. 2. Even if you are successful with the initial estimates, the system usage patterns often change and shift from one area to another, distorting the original assumptions and parameters. When planning for availability, consider seeking answers to these questions: Who are the customers or the users of the system and what are their expectations and requirements? How much downtime are they willing to accept? How much uptime are they willing to pay for? What and who depends on the service your system provides? What is the impact of downtime? How often do you expect the system to be down? How quickly can you recover from failures? What are the alternatives and their cost/benefit analysis? What is the budget and implementation timeframe? What is the required skillset for its implementation, operation, and maintenance? _____________________________________ (add your own) _____________________________________ (add your own) _____________________________________ (add your own)
Planning and designing for a specific availability level do pose challenges and tradeoffs that you must make. There are a wide selection of technologies, products, and solutions on the market today, each with a different methodology, architecture, requirements, benefits, and cost. No single availability solution fits every situation. Therefore, a good solution always depends on a combination of business requirements, application-specific parameters, implementation timetable, and available budget.
Learning check questionsReinforce your knowledge and understanding of the topics just covered by completing this learning check: 1. Match the availability level with its correct description: 2. What is a typical expected downtime of a fault-resilient system? a) None b) 1 hour per year c) 8.5 hours per year d) 3.5 days per year 3. Which RTO goal should be expected from a highly available system? a) Less than one second b) Seconds to minutes c) Minutes to hours d) Hours to days 4. What does the following formula calculate? a) RTO b) RPO c) System availability d) Expected downtime
5. Which RAID levels can withstand a failure of two disk drives? (Select all that apply.) a) RAID 0 b) RAID 1 c) RAID 5 d) RAID 6 e) RAID 10 6. Which data protection technique creates reference pointers to the entire content of duplicate drives, file system, or LUNs every time a snapshot is made? a) Archiving b) Copy-on-write c) Replication d) Split-mirror e) Deduplication
Learning check answersThis section contains answers to the learning check questions. 1. Match the availability level with its correct description: 2. What is a typical expected downtime of a fault-resilient system? a) None b) 1 hour per year c) 8.5 hours per year
d) 3.5 days per year 3. Which RTO goal should be expected from a highly available system? a) Less than one second b) Seconds to minutes c) Minutes to hours d) Hours to days 4. What does the following formula calculate? a) RTO b) RPO c) System availability d) Expected downtime 5. Which RAID levels can withstand a failure of two disk drives? (Select all that apply.) a) RAID 0 b) RAID 1 c) RAID 5 d) RAID 6 e) RAID 10 6. Which data protection technique creates reference pointers to the entire content of duplicate drives, file system, or LUNs every time a snapshot is made? a) Archiving b) Copy-on-write c) Replication d) Split-mirror e) Deduplication
There are many human errors, software glitches, hardware failures, and human-induced or natural disasters, all affecting IT operations. Downtime due to these disruptions can result in lost productivity, revenue, market share, customers, employee satisfaction, and in some extreme situations, even in loss of life.
MTBFs and MTTR are two metrics that deal with downtime prediction and recovery.
Availability is the opposite of downtime. It measures the time during which a system runs and delivers data, applications, or services. System availability varies from conventional systems to fully fault-tolerant systems. So does their cost.
your data, backups provide secondary copies of your data in case it is lost or corrupted, whereas RAID (except for RAID 0) provides some level of redundancy in case of a disk drive failure.
Various RAID technologies exist, ranging from RAID 0 to RAID 10, each with different levels of protection and performance characteristics.
RPO and RTO define the boundaries under which a system and its data are brought back into service after downtime.
Redundant and fault-tolerant components address high-availability demands. Backup protects your data against multiple hardware-component failures, human errors, viruses, cyber-attacks, and data corruption.
CHAPTER OBJECTIVESIn this chapter, you will learn to: Define a backup, explain its purpose, and describe the basic backup terminology List and describe backup types and explain which backup type is appropriate for a given situation List and describe the most common backup tape rotation schemes and related topics such as the overwrite protection and append period Explain the restore operations Define archiving and contrast it with backing up Describe data tiering and its benefits Explain the purpose of a backup and recovery plan and define the recommended steps in a sample backup and recovery plan List and describe common tape media drives and technologies
In the previous chapter, you learned about downtime and availability. You learned about what causes downtime, what impact it has on IT operations and on the business, and how it translates to cost. Then you also learned about availability—how to achieve it, how to calculate it, and how to plan for it. The previous chapter also made a clear distinction between availability technologies, such as RAID and replication, and explained why backups are critical in protecting business applications and data. This chapter focuses on backup strategies and related topics such as restore, archiving, and data tiering. As you may already know or suspect, backup is a wide topic, which must be understood to properly plan your backup strategies and fit into your overall protection implementation. Why? Because your backup is as good as the data you can restore from it.
Learner AssessmentBefore continuing with this chapter, take a few minutes to assess what you might already know by defining these topics:
NoteThe purpose of this exercise is to get you thinking about these topics. Do not worry if you do not know the answers; just put in your best effort. Each topic will be explained later in this chapter. In your own terms, define backup and its purpose: ___________________________________________________________ ___________________________________________________________
• • • • • • ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ In your own terms, define these concepts: Backup consistency ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ Backup application ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ Backup device ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ Deduplication ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________
What is a backup?Now, consider these questions regarding backup: Is backup a protection against hardware failures? Does backup address disk-related problems? Is backup the best protection against loss of data due to hardware failures? Does backup always have the latest version of the data? Is backup still vulnerable to some data loss? Everyone would agree that a backup is an insurance policy against data loss. A company depends on its information to stay in business, whether the information takes form of email messages, a customer database, research and development projects, or even unstructured data. An individual depends on his/her
information for work-related tasks. Even in our daily personal lives, we rely on information which we cannot (or do not want to) lose.
Figure 2-1 Backup
A backup (Figure 2.1) is the process of creating a copy of data, applications, or operating systems on a secondary storage medium. This copy is created, stored, and kept for future use in case the original is destroyed or corrupted.
According to the Dictionary of Computer Terms:
“(Backup is the activity of copying files or databases so that they will be preserved in case of equipment failure or other catastrophes. Backup is usually a routine part of the operation of large businesses with servers as well as the administrators of smaller business companies. For personal computer users, backup is also necessary but often neglected.)”
In most cases, the source is data on a disk, such as files, directories, databases, and applications. The target is another disk or a tape, hosted by some type of a tape drive or a tape library.
Since the backup is expected to be used for disaster recovery, it must be consistent. Consistency means that the data copy is the same as the original, and when restored, it is identical. Database consistency is achieved by consistency checks such as byte-by-byte comparison, cyclic redundancy checks, or reduced verification.
The software that copies the data to its destination is called the backup application. The destination is called the backup device, such as a tape drive, with a medium onto which the copy of the data is written. The backup device can be physical or virtual. There are many benefits of virtualizing the target backup device, one of the most significant being the potential to take advantage of the deduplication functionality, and another to overcome the limitations of sequential physical devices, such as speed and size. A customer’s or your own company recovery requirements help you decide on the best way to back up the data, and there is rarely a case where all data should be treated the same way. This means that retention requirements (how long to keep the backup) vary for different types of data.
Backup purposeAs already mentioned, backup is an insurance policy against data loss. It plays a critical role in Disaster recovery (enables data to be restored after a disaster)
• • • • • • • •
Data archival (consists of files or records that have been selected for permanent or long-term preservation)
Operational backup (enables a restore of a small or a selected number of objects, such as files or emails, after they have been accidentally deleted or corrupted)
Recovery from data corruption (such as databases)
Compliance with corporate guidelines and/or local, state, and federal requirements for data retention
What should a backup consist of?
There is no rule on what must be backed up and what can be left unprotected. Some of the good candidates to be included in the backup sets are the following: Operating environments o Laptop operating systems o Desktop operating systems o Server operating systems (physical and virtual) Applications
o Enterprise Resource Planning (ERP) applications such as SAP, Oracle Applications, and PeopleSoft o Customer Relationship Management (CRM) applications such as SAP Digital, Siebel, Salesforce, and Microsoft Dynamics CRM o Databases, such as Oracle, UDP, and Microsoft SQL Server o Messaging applications, such as Microsoft Exchange User and application data (for all above operating environments and applications) Logs and journals o Application transaction logs o Database journals o File system journals Always consider including in your backup sets everything that you cannot afford to lose.
Backup typesThis section defines the archive bit, which is an important property of operating system files and plays a key role in performing backups, and lists and describes various types of backups.
Archive bitWhen a file is created or changed, the operating system maintains a flag, which is called the archive bit. The backup software uses this bit to determine whether the file has been backed up before. As soon as the file is backed up using either the full or incremental backup, this bit is turned off, indicating to the system that the file was saved to a secondary medium. If the file is changed again, the bit is turned back on, and
the file is flagged to be backed up again by the next full or incremental backup. Differential backups include only files that were created or modified since the last full backup. When a differential backup is performed, no changes are made to the archive bit.
Archive bit File must be backed up File has been backed up
A copy is an option in many backup applications. Its function is similar to a full backup; however, the archive bit is not modified.
Learner assessmentYou may already be familiar with many types of backups. For fun and to test your preexisting knowledge, see if you can answer these questions:
NoteThe purpose of this exercise is for you to assess what you already know. Do not worry if you do not know the answers; just put in your best effort. Each topic will be explained later in this chapter. Which backup type is described by these statements? It is also called a normal backup. It saves the entire content of the source disk on a file-by-file basis. After backing up individual files, it turns the archive bit off. a) Online backup b) Offline backup c) File-based backup d) Image-based backup e) Full backup Which backup type is described by these statements? It enables restoring individual files to their original locations. It provides random access to individual files for a quick restore. It requires a full path to be saved in the backup set. It can create significant operating system overhead with a large number of small files. a) Online backup b) Offline backup c) File-based backup d) Image-based backup e) Full backup Which backup type is described by these statements?
• • • It supports 24x7 operations because it does not require a system shutdown. It can create a degradation in server and network performance. It may create data integrity and inconsistency issues. a) Online backup b) Offline backup c) File-based backup d) Image-based backup e) Full backup Which backup type is described by these statements? It saves the entire disk at the block level. It is often called physical backup because it dumps the complete file system to a single file. It provides the fastest method to save or to recover a complete file system It requires a restore to identical disk. a) User-defined backup b) Offline backup c) File-based backup d) Image-based backup e) Full backup Which backup type is described by these statements? It is performed upon a special request, which defines a set of files to back up. It is performed in addition to and outside of the standard tape rotation scheme.
It is usually performed before a system or software upgrade or when necessary to save certain files. a) User-defined backup b) Offline backup c) File-based backup d) Image-based backup e) Full backup
Learner Assessment AnswersThis section contains answers to this chapter’s learner assessment questions. Which backup type is described by these statements? It is also called a normal backup. It saves the entire content of the source disk on a file-by-file basis. After backing up individual files, it turns the archive bit off. a) Online backup b) Offline backup c) File-based backup
• • • • d) Image-based backup e) Full backup Which backup type is described by these statements? It enables restoring individual files to their original locations. It provides random access to individual files for a quick restore. It requires a full path to be saved in the backup set. It can create significant operating system overhead with a large number of small files. a) Online backup b) Offline backup c) File-based backup d) Image-based backup e) Full backup Which backup type is described by these statements? It supports 24x7 operations because it does not require a system shutdown. It can create a degradation in server and network performance. It may create data integrity and inconsistency issues. a) Online backup b) Offline backup c) File-based backup d) Image-based backup e) Full backup Which backup type is described by these statements? It saves the entire disk at the block level. It is often called physical backup because it dumps the complete file system to a single file. It provides the fastest method to save or recover a complete file system It requires a restore to identical disk. a) User-defined backup b) Offline backup c) File-based backup d) Image-based backup e) Full backup Which backup type is described by these statements? It is performed upon a special request, which defines a set of files to back up. It is performed in addition to and outside of the standard tape rotation scheme. It is usually performed before a system or a software upgrade or when necessary to save certain files. a) User-defined backup
b) Offline backup c) File-based backup d) Image-based backup e) Full backup
Offline backupFor an offline backup, the administrator takes the applications and their data offline for the time required to do the backup. During the backup window, the host (such as a server or a storage array) is not available to its users, which is why this type of backup is usually done at times when the user demand is at its lowest (such as at night or on the weekends).
The exact calculation of the overall system performance and the volume of backup data are both required to determine whether the actual time needed for the operation fits into the allotted backup window to avoid extended offline periods for the host.
From the backup perspective, offline backup is the easiest and most secure form of performing backups, because the backup software usually does not have to deal with open files or performance issues. An offline backup can be complete or partial.
As more companies move toward 24-hour and 7-days-per-week operations, no clear backup window for offline backups exists. Online backup is the alternative and is performed during normal operating hours and with the host fully available to its users. Usually both the users and the backup environment see a degradation in the overall network and host performance during the backup window, thus influencing productivity. Because user and system files can be open during online backups, there is a danger to data integrity, which has to be solved appropriately by the backup software.
Inconsistent files could mean a serious threat to the data integrity when restored from the backup medium. Incompletely backed up files (open files with ongoing write operations while the backup software is copying them to tape) may contain inconsistent data which could be spread across various databases in the enterprise, depending on how the databases are linked to each other. An example of inconsistency is database indices that no longer reflect the actual content of the database. Your backup software should at least offer the option of selecting what to do with open files—either copy them anyway, mark the files in a log file as suspect, or do not copy these files but inform the administrator via a log file or an alert. Some backup programs automatically retry open files after a certain time, when there may be a chance that the file has been closed in the meantime. An online backup can also be complete or partial.
File-by-file backup (file-based backup)
With file-by-file backups, the information needed to retrieve a single file from the backup medium is retained and it is possible to restore files to their correct locations on the disk. The full path information for every file must be saved on the backup medium, which, in an environment with a large number of small files, could cause a significant overhead for the operating system. The operating system must access the file allocation table many times and can reduce the backup throughput by as much as 90%. The performance can also be influenced by other factors such as disk fragmentation. However, file-by-file
• • • • • • • • • backup provides random access to individual files for a quick restore. Furthermore, this type of backup is used by all tape rotation schemes to back up selected files.
Image backup (image-based backup)An image backup: Saves the entire disk at the block level.
Provides the highest backup performance due to large sequential I/Os (the tape drive is typically kept streaming).
Creates a snapshot of the disk at the time of the backup. Requires a restore to an identical disk.
Is used for archiving because it does not restore individual files.
Image backups may also be referred to as physical backups and are used to describe a dump of the complete file system to a single backup image file. Consequently, only the complete image can be restored.
When the image backup is performed locally, the tape drive typically keeps streaming. If the image backup is done over the network to a distant tape drive, other activities on the network may influence the tape drive’s performance.
An image backup is usually the fastest method to save or to recover a complete file system. In environments with an exceptionally high number of small files, this could be your best choice of backup methods.
User-defined backupA user-defined backup is Performed upon a special request from a user who defines which set of files to back up Performed in addition to and outside of the standard tape rotation scheme Defined by the backup set which includes user files Performed before a system or a software upgrade A user-defined backup usually means a special backup request where the user defines which files need to be copied (e.g., for a critical project). This can be the complete user account with all applications, configurations, and customizations, which can be especially useful when the user plans to do a major system change like operating system or software upgrades. Another request could consist of just the information base of the user, which includes certain applications and their data, or all information created only within a certain timeframe.
A full backup is also called a complete or normal backup. It saves the entire content of the source disk to the backup medium on a file-by-file basis. In networked environments, this source disk could be the disk