Presented by
Steve Carroll
Building a Disaster Recovery
Testing Program
Email: [email protected] Phone: 717-256-1865
About Our Speaker
Steve Carroll is a Senior Consultant with Abound Resources. With more than 25 years’ experience as a community financial
institution executive, Steve has worked in a variety of capacities in financial institutions, including consultant and CEO.
Since 1996, Steve has worked as a lead consultant on more than 100 financial institution consulting engagements across the country. His areas of expertise include business continuity planning, risk management, strategic business planning, and strategic
technology planning.
Steve has developed software applications to assist financial institutions in improving their risk management positions, including Abound Resources’ bPLAN Web-based Business Continuity Planning system .
Steve has completed Institute of Financial Education courses at the University of Texas at Austin, the University of Georgia, and the University of Connecticut.
Who We Are
• Management consulting firm for the
Community Financial Institution (CFI) industry
• We empower CFIs to achieve their goals.
“Goals achieved. Guaranteed.”™
• Based in Austin, TX; clients in 40+ states
• Founded in 1997 by industry execs and Big 5
consultants
• 500+ software evaluations • Vendor Neutral
• Advisors average 25+ years in CFI
management; lending, cash management, compliance, operations and IT
What We Do
Sales & Marketing
Presentation Highlights
• Regulatory Issues & Terminology • Building a Testing Program
• Conducting Tests – Examples
• High Availability Environments
• A Simple Pandemic Exercise
Regulatory Background
• FFIEC Guidance – March, 2008
– The Board must approve the Testing Program
& review test results
– IT is responsible for DR Testing
– The Crisis Management Team should be
involved in the testing process
– Those responsible for Facilities should be
involved in the process
– Test results must be subjected to an
Common Regulatory and Audit
findings
• There is no Comprehensive Testing Program in
place
• Testing activities show an over-reliance on a
single testing methodology (Table-top)
• Test activities do not involve departments/users
in a meaningful way
• Test documentation is inadequate
• No “step-by-step” restoration procedures • No “order of restoration” defined
Terminology
• Testing Program – A schedule of test events spanning
a complete testing cycle.
– What? When? Who? How? Where?
• DR Test – An event that demonstrates that a given resource can be restored to a production state within a target time frame using a documented restoration method.
• BCP Test – An event that demonstrates that a Business Function can be completed using a
Budgeting Your BCP Effort
36%
5% 18%
9%
32% Business ImpactAnalysis
Risk Assessment Documentation
Emergency Response Testing
Testing Methodologies
Methodology
Type of Test/Administrator
Tabletop Exercise BCP/BCP Coordinator
Due Diligence BCP/BCP Coordinator
Vender Service Levels BCP/BCP Coordinator Independent Review BCP/BCP Coordinator
Incident Tracking BCP/BCP Coordinator Compatibility Testing Disaster Recovery/IT
Simulation Disaster Recovery/IT
Getting started
• Testing Team – Technical representation – Operations – Department Staff • Inventory of Resources– Software Applications (Core and Network) – Critical Services (Data Communications,
Internet)
– Outsourced Applications & Services • List of Business Functions
– By Department
Build a Database
Resource Name* Critical Level Test? RTO (hours) RPO (hours) MAD (hours) Control GroupCore System 1 Yes 8 2 72 Core Data
Communications
1 Yes 1 -- 24 Network Loan Prospector 3 Yes 48 72 96 Loans Fedline 1 Yes 4 24 48 Fed Network Files 3 Yes 24 72 72 Network Branch Capture 1 Yes 4 8 48 Item Proc Internet Access 1 Yes 1 -- 24 Internet Acrobat Reader 5 No 96 -- 120 --
EMail 1 Yes 12 24 48 Email Internet Banking 1 Yes 4 8 24 Core Laser Pro 3 Yes 48 72 96 Loans
Assign Criticality Levels
• Criticality is assigned to both Resources and Business
Functions
– Sometimes called “Mission Critical” or “Business
Critical”
• Better to use multiple criticality “Levels” for flexibility – Three or five levels, matched to a time frame
• Example: Level 1 = 1 to 24 hours, Level 2 = 24
to 48 hours, etc.
• Test Flag
– Will we test this? (yes or no)
Assign Target Timeframes
• Recovery Time Objective (RTO)
– Target time frame for resource restoration • RPO
– The maximum capacity for data loss of a given
information system, measured in time.
– Can be assigned to any application, but should be
applied at a minimum to Transaction Interfaces.
– RPO’s should be supplemented with a description of
how lost data could be reconstructed.
• Maximum Allowable Downtime (MAD)
– Estimated maximum downtime for a given
Assign Control Groups
• Control Groups
– Create a Control Group for resources that should
be tested together.
• Examples:
– Core System
– Loan Systems
– Internet (Web sites)
• Examples
– By Server
– By Criticality Level – By Application type
Create Test Events
• Build a Control Document for each Test Event: – Statement of Objective – be clear and concise
• Example “Show that the [software application]
can be restored onto new hardware from backup media. Users will log in and verify that the
system has been returned to a production state”
– Description of Test Environment • How will hardware be replaced? • Preinstalled software
• External connections needed – Most likely test methodology – Test Date
– Who is responsible? – Who will be present?
– What documentation (evidence) will be retained? • Write a Test Script
Example
Test Script
Step # Start Time Activity Expected Results Actual Results1 Set up server hardware, workstation & test LAN
2 Install O/S & backup/restore utility 3 Install application from original media
(d/l from vendor Web site)
4 Locate backup image for application data & restore to server
5 Install client onto workstation
6 Have user login and verify that work can resume
7 User runs samples of typical transactions
8 Print screens and reports – retain for documentation
Build a Testing Timeline
• Test Cycle – 12 months
• Assign a target test date to each Test
Event/Control Group
• Strategically space test events across the entire
Test Cycle
– “Easy” tests can happen more quickly – Allow more time for complex tests
– Consider likely unplanned outages (Incident
Resource Restoration Methods
• Applications & Data
– Restore from backup
– Reinstall from original media
– Installed in multiple locations (redundant) – “High Availability” System – failover
• Hardware
– Backup Equipment
– Replace from available market sources (add time
to RTO!)
• Services
Test Day
• Print the appropriate Test Control Documents and
Scripts, or open the documents on a laptop
• Line up the Test Participants – Test Administrator
– Technical support
– Observers/documenter
– Department users when appropriate
• Execute test script – note start time and results for
each step. Complete the document as you go.
• Testing – 80% preparation & documentation, 20%
Reporting
• Electronic files are better that physical
– Create a folder structure on your network – folders
and test events with the same name
– Scan completed Test Scripts and Control
Documents; attach to Test Event
• Keep a schedule of all Test Events, past and future – Be able to sort by Date and Status (Pending,
Complete)
• When you’re ready to distribute -- copy or Zip the
folder structure for emailing or copy to media
High Availability Environments
• Virtual servers
– Pro - can immediately cut RTO’s in half – Con • Testing challenges • Bandwidth • Licensing • Staffing • Core Synchronization
– Synch how often?
– Is more always better?
A Simple Pandemic Exercise
• Preparation – BCP Coordinator
– Create an Excel Spreadsheet with 2 columns
• Column A = Employee Name
• Column B = Department
– Use a random selection formula to select 40% of
the employee records
• Make sure you can reference which
departments become impacted (use a “count record” formula or a pivot table)
A Simple Pandemic Exercise
• Pull your team together for the exercise meeting • Use the Pandemic Simulator (Excel spreadsheet) to
determine which employees are absent with the flu
• Determine which Department has the highest level
of absenteeism are most affected
• Review the Business Functions for that Department
and develop a strategy for dealing with the incoming work
Top Five Testing Mistakes
5. Procrastination/Cramming 4. Hiding Failed Tests
3. Reliance on a single methodology 2. Failure to leverage “real life”
Steve Carroll
Abound Resources, Inc. Senior Consultant
Cell: 717-256-1865
E.Mail: [email protected] Twitter: @bankbcp