Creating a Robust Testing
Program
Beverly Schulz, CBCP
Agenda
• Overview of Testing
• Notification Tests
• Tabletop or Walk-through Tests
• Simulations
– Technology Outage Tests
– Third Party Outage Tests
– Workplace Outage Tests
– Workforce Outage Tests
• Reporting
Overview of Testing
• Purpose of Testing is to:
– Answer the question: “Can the business recover?” – Reinforce training of recovery team members
– Test the performance of the recovered systems, people, etc. – Expose issues which may prevent recovery
– Get participants excited about fixing issues – Demonstrate improvement year after year
• Test Types:
– Notification - A test of the phone numbers within a business continuity plan – Tabletop or Walk-through Exercises - A test of the business continuity plan
where participants discuss their response to a simulated disaster scenario – Simulation Exercises - A test of IT, business, and/or vendor recovery strategies,
where participants perform recovery activities
Notification Tests
• Preparation, Execution, and Follow-up:
– Alert team to upcoming exercise*
– Develop script for notification calls*; set up in automated notification tool – Conduct exercise and record results
– Follow up on any incorrect contact information
• Things to Consider:
– Set goals / objectives (i.e. 75% respond within a certain number of hours) and increase those over time
– Include just business area recovery team contacts, or expand to cover all employees in the business area
– Group by geography, contacting all employees in a city/state/region – Group by business area to allow ease in reporting
– Decrease the time allowed for response if multiple modalities are used
• Scoring:
– % of participants responding
Tabletop Tests
• Preparation, Execution, and Follow-up:
– Read the plan!
– Write a scenario that will test weaknesses within the plan
• Things to Consider:
– Use scenario injects (i.e. monkey wrenches)*
– Include multiple, inter-dependent groups with differing RTOs – Conduct executive level exercises
– Conduct surprise exercises
• Scoring:
– Measure success based on percentage of answers known – Measure success based on improvement opportunities identified
* - see Appendix
Simulation Tests
• Types of Simulation Tests:
– Loss of technology or telephony – Loss of critical vendor – Loss of workplace
– Loss of workforce or key resource
• Things to Consider:
– Test the “workplace unavailable” scenario for all business areas within a building at the same time
– Ensure the duration allows problems to surface – Test a variety of scenarios, not just workplace outage
• Scoring:
– Measure success based on the percentage of functions or applications recovered within their Recovery Time Objective
Simulation – Loss of Technology
a.k.a. Disaster Recovery Tests
• Things to Consider:
– Business function / process validation – Application performance statistics – Crisis Management involvement – Third Party participation – Measurements of success
– Schedule notification and tabletop exercises right before the Disaster Recovery exercise
– Conduct tests for each major data center
– Gradually increase scope so that more than most critical applications tested
Disaster Recovery Tests, continued
Business function / process validation
Customer Purchase Customer Statement Application #1 Application #3 Application #4 Application #5 Application #2
Disaster Recovery Tests, continued
Application performance statistics
Volume/ Utilization
Metric
(T) Production (T) Test Comparison
Disaster Recovery Tests, continued
• Setting Success Rate Targets
– Use exercise sponsor or BC Committee to set target success rates – Revisit targets on a regular basis (raise the bar)
• Example Measurements of Success for IT
– % of Applications meeting Recovery Time Objectives – % of Applications recovered before end of exercise (even if late) – % of Applications meeting Recovery Point Objectives
• Measurements of Success for the Business
– % of Functions meeting Recovery Time Objectives
– Impact Rating* - Allow business to rate their own success using a pre-defined impact scale No impact to business – better results than expected Low impact to business Moderate impact to business High impact to business
Very high impact to business
* - see Appendix
Simulation – Loss of Third Party
• Things to Consider:
– Test the ability of the business to respond to the loss of the third party, OR test the third party’s ability to recover from their own disaster, OR both! – Start with a simple “ping” test or a tabletop test
– Expand over time involve login to vendor systems during their DR tests, file exchanges, etc.
• Scoring:
– Measure success based on the percentage of Third Party applications meeting Recovery Time and Recovery Point objectives
– Measure success based on the percentage of Third Party functions meeting Recovery Time Objective
– Give extra credit for business participation in third party’s test
Simulation – Loss of Workplace
• Things to Consider:
– Allow real events to count as test credit
– Scope should align to plans, i.e. if plans are built by building, then tests should be by building
• Scoring:
– Measure success based on the percentage of the business area’s people testing the strategy
– Measure success based on the percentage of the functions recovered within Recovery Time Objectives
Simulation – Loss of Workforce
• Things to Consider:
– Allow real events to count as test credit
– Ensure the scope of the test allows problems to surface (ex. require a minimum of 25% workforce loss)
• Scoring:
– Measure success based on the percentage of the business area’s people testing the strategy
– Measure success based on the percentage of the functions recovered within Recovery Time Objectives
Reporting
• Identify issues resulting from test, assignments for resolutions, and target
completion dates
• Include the following within the post test report:
– Scenario summary – Objectives – Results
– Lessons learned and recommendations for future tests – Issues tracking and summary
• Develop a process to track actions to confirm closure
• Include results of testing in Business Continuity metrics
Reporting, cont’d
1 2 3 4 5 BC Issues from Testing
and Events
All issues documented appropriately including level of risk and actively remediated and updated
Reflects mixed results between 1 and a 3
Issues not registered or not
being actively remediated Reflects mixed results between 3 and a 5
Issues not registered and not being actively remediated
Notification Testing 100-95% response rate 94-85% response rate 84-75% response rate 74-50% response rate <50% response rate
Workforce Strategy Testing
100% of Mission Critical plans and >75% of remaining plans scored 2 or 1
99-75% of plans scored 2 or 1 74-51% of plans scored 2 or
1 or insufficient testing has occurred (i.e. risk is unknown)
50-25% of plans scored 2 or 1 <25% of plans scored 2 or 1
Workplace Strategy Testing
100% of Mission Critical plans and >75% of remaining plans scored 2 or 1
99-75% of plans scored 2 or 1 74-51% of plans scored 2 or
1 or insufficient testing has occurred (i.e. risk is unknown)
50-25% of plans scored 2 or 1 <25% of plans scored 2 or 1
Third Party Strategy Testing
100% of Mission Critical Third Parties met recovery time and recovery point and >75% of remaining third parties met recovery time and recovery point
99-75% of Third Parties met recovery time and recovery point
74-51% of plans scored 2 or higher or insufficient testing has occurred (i.e. risk is unknown)
50-25% of Third Parties met recovery time and recovery point
<25% of Third Parties met recovery time and recovery point
Disaster Recovery Testing
100% of business plans with needs met by app DR exercise results
99-90% of business plans with needs met by app DR exercise results
89-80% of business plans with needs met by app DR exercise results
79-70% of business plans with needs met by app DR exercise results
<69% of business plans with needs met by app DR exercise results
Consider developing metrics for Executive Management to show their ability to recover, based on testing results:
For more information…
•
[email protected]
• Various internet sites:
–
www.continuityinsights.com
–
www.drii.org
–
www.thebci.org
–
www.fema.gov
Appendix 1 – Notification Alert
Sample text to use when alerting a group about an upcoming
notification exercise:
Subject: Notification Exercise Required Annually
Per Business Continuity Standards, we are required to perform a Notification Exercise annually. This is performed without advance notice to test the accuracy of contact information, as well as the accessibility of the Recovery Team Members. We will be conducting this exercise before [month/day].
The attached Business Continuity Plan has all of the required information for a successful exercise. During which, you will be contacted via phone and email. The automated system will call all phone numbers you have currently listed in the HR system (main, work, work cell, personal cell, home, etc.) in an effort to reach you. Note: when responding to the phone notification, please wait to hear that your response has been accepted before hanging up otherwise your response will not be registered. These contacts will continue several times within a 2 hour timeframe until contact is made and response has been received by the participant. Please let me know if you have any questions or if you would like additional information about the Notification Exercise.
Appendix 2 – Notification Template
Sample text to use when conducting a notification exercise:
This is the Business Continuity Management team, conducting your annual
Notification Exercise in partnership with your [insert name] business area.
This mandatory exercise is required for all business areas per Business
Continuity Policy. In order for this exercise to be successful, please
acknowledge receipt of this notification by entering 1 on your phone or in the
body of the e-mail.
Thank You.
Appendix 3 - Tabletop Tests
Scenarios and Injects – Natural Disasters and Accidents:
Main Scenario Inject
Earthquake, hurricane, flood, blizzard, or tornado
• Roof collapse
• Area roads blocked and local / state travel restrictions are being enforced • IT is wondering how many computers you will need and what applications you will need loaded on them. They are also inquiring as to any other equipment (fax, printer, copier, phones) you will need. Please respond. Fire • Determined to be arson
• Mold grows due to water used in fire suppression, causing health issues for half of the employees so far. Which functions can be delayed and which can be transferred?
Sink hole or impassable facility access
• Area roads blocked and local / state travel restrictions are being enforced
Plane crash or mass transit accident
• Multiple executives on board
• The designated business area decision maker was injured. Who is next in command?
Appendix 3 - Tabletop Tests, cont’d
Main Scenario Inject
Generator failure • Fuel supply vendor can’t deliver Heating / air conditioning failure • Associates report health issues Network or telecommunications failure • Determined to be malicious code Facility access disruption
• All doors failed open Infrastructure:
Main Scenario Inject
Loss of personnel due to illness
• Determined to be food poisoning from on-site cafeteria Third party bankruptcy / hostile takeover • Choose a single-source vendor Internet or cyber incident
• Business critical data being released on Internet • Time-released cyber attack Protests block building access
• Police blockade, tear gas, or injury to customer/employee Natural Disasters and Accidents:
Appendix 4 – Impact Rating Samples
• What was the impact to CUSTOMERS?
– No impact = we do not work with customers – Low Impact = minimal inconvenience to customers – Moderate impact = inconvenienced and irate customers
– High impact = dissatisfied customers, escalating high % of complaints to managers
– Very high impact = customers are closing accounts at an unacceptable rate
• What REGULATORY impacts may have been caused?
– No impact
– Low impact = minor, isolated compliance issues – Moderate impact = Regulators require issue resolution – High impact = Regulators publicly warn company – Very high impact = Regulators take action against company