Training Technical Personnel for Disaster
Recovery Testing – A Case Study
•
Frank Ioli
,
Manager of Disaster Recovery Operations,
Northrop Grumman Corporation
Session Abstract
• Northrop Grumman Corporation has moved to an internal,
multi-site structure for IT Disaster Recovery. In order to validate this solution, it is necessary to test recoverability on a regular basis. Tests of IT recovery are only as effective as the personnel running them and their understanding of their roles and responsibilities in the process. This session will allow attendees to follow the steps Northrop Grumman took in developing a philosophy of testing based on multiple data centers, different replication and backup media, a range of RTOs and staff in operations, systems
administration and application development who had never tested recoverability in this manner. Specifically, Northrop Grumman
developed procedures and a training program for its technical staff. Attendees will learn about the opportunities and potential pitfalls in creating an educated IT staff, ready to respond to data center
Who is Northrop Grumman?
Northrop Grumman is a leading global
security company providing innovative
systems, products and solutions in
aerospace, electronics, information
systems, and technical services to
government and commercial customers
worldwide.
•
$26.4 billion sales in 2011•
Leading capabilities in:– Unmanned Systems
– Cybersecurity
– C4ISR
IT Disaster Recovery Background
Business requirements for both cost and risk reduction has driven strategic requirements for its data center operations.
• The company has implemented an Enterprise Data Center (EDC)
Transformation project that incorporated several strategic IT initiatives.
• Data center consolidation
• Virtualization
• Standardization (technology and process)
• Partner with strategic vendors
IT Disaster Recovery Background
Business requirements for both cost and risk reduction has driven strategic requirements for its data center operations.
• The overall objectives of EDC Transformation were cost and risk reduction.
• EDC Transformation changed Northrop Grumman’s IT risk portfolio as data
centers are consolidated and applications are migrated to a standard platform.
• EDC Transformation required a new DR strategy that established an
internal, triangular data center recovery approach.
• The Transformation also formalized a DR Operations function dedicated to
The Triangular Strategy
Application recovery is being targeted to alternate EDCs depending on deployment to Northrop Grumman’s standard operating platform.
• DR for applications on the standard platform are to be recovered between the East and West EDCs.
• Any application hosted on a
unique (non-standard) platform are to be recovered at the
The DR Testing Strategy
There are both walkthroughs and hot-site rehearsals and tests every year. Tests are unannounced; rehearsals are announced in advance.
• Unlike in an actual disaster, the rehearsals and tests only deal with the recovery of a limited number of applications.
• The strategy calls for responsibility to be shared among a variety of Recovery Teams.
– Technical and operations personnel are assigned to a team…or maybe
more than one.
– They must still have current, day-to-day responsibilities.
The Challenge of DR Test Training
Data center technical and operations personnel needed to approach DR from an EDC perspective, rather than an application-based approach.
• Northrop Grumman had a long history of live testing of the recoverability of its critical applications …one at a time.
• Personnel needed to re-think DR testing involving all applications at the same time.
– In sequence by application priority (tiers).
– The most critical applications have recovery timeframes of 8 hours or less.
– Information System Contingency Plans needed to be re-written to address recovery in large-scale shared environments (EDCs).
• Recovery work instructions (procedures) needed to be written to address the specific requirements of testing DR in the triangular strategy.
The DR Testing Transformation
ISCP Appl. n ISCP Appl. 2 ISCP Appl. 1 Planning Coord-ination Logistics Servers Network EDC Recovery Storage Facilities Infra-structure Appli-cationsThe Training Logistics Challenge
Northrop Grumman’s IT Services personnel are widely distributed, work remotely and often have never met colleagues they are not directly involved with day-to-day.
• The EDCs themselves are lightly staffed.
• Operations and Technical Support run
– Around the clock
– On three shifts
– In four time zones
– With overlapping work schedules
• How could all personnel learn the new approach to testing DR in the same way, at the same time?
The Training Logistics Solution
Face-to-face classroom training was considered but rejected, for reasons of cost and logistics.
• The training was presented in a “virtual classroom” using LiveMeeting and conference bridges.
• It was presented in weekly sessions:
– One hour per week.
– Three different time slots to accommodate all shifts and time zones.
– Northrop Grumman retained Risk Masters to develop and facilitate the training curriculum.
The Course Material
The training needed to address all the subject matter in the DR Testing Work Instructions.
Pre-Test
• Pre-Test Meetings
• Test Scheduling
• Preventive Equipment Maintenance for Test Specific Equipment
• Structured Walk-Through
• Third-Party Vendor Activation
• Equipment Inventory
• Installation of Equipment Required for Testing Purposes
• Equipment Configuration
• Testing Resources
Test
• Media and Data Recovery
• Network Support
• Problem Management
• Disk Sanitization
Post-test
The Recovery Teams
The EDC Recovery strategy calls for the creation of eight Recovery Teams.
– DR Test Steering Committee
– DR Test Coordination Team
– Site Management Team
– Site Information System Recovery team
– Administrative Support Team
– EDC Facilities Team
– Infrastructure Service
– Application Service Team
• Their roles and responsibilities are defined for recovery from an actual disaster.
• Their roles in DR testing are, naturally, parallel to those in an actual disaster, but are subtly different, given the nature of a test.
Developing the Training Materials
Inasmuch as the sponsor of the training was DR Operations and the attendees were EDC staff, oversight of the training development was split between DR Operations and EDC operations.
• The oversight committee grew as the scope of the training became better understood.
• Eventually it included managers from Technical Support, Facilities and Application Development.
• The first step was to develop an outline for the weekly sessions.
– The first eight weeks alternated between lectures and exercises.
The Training Outline
The initial outline anticipated face-to-face training and
needed to be restructured for the virtual classroom.
• The outline was modified many times as the actual development of the
material did not fit earlier expectations…
• And as Northrop
Grumman gained a greater understanding of its
philosophy of DR testing. Minutes
Module 1: Northrop Grumman's DR Approach
1. Introduction 15
2. Northrop Grumman’s Disaster Recovery Strategy 30 3. Test Planning
a. Premises and Assumptions 15
4. Crisis Response Exercise
a. Notification and reaction 30 b. On-site and remote recovery 30
Module 2: Pre-Testing Activities (Part 1)
5. Northrop Grumman’s Disaster Recovery Testing Strategy
a. Overview of the Current Strategy 15 b. Structured Walkthrough 10
c. Test Preparation 35
i. Pre-test Meetings ii. Test Scheduling iii. Test Staffing iv. Testing Resources
6. Pre-Test Exercise 1
a. Test Planning 30
The Philosophy of DR Testing
Above all, do no harm – Hippocrates
• Production processing should be completely isolated from DR testing.
• There are too many applications to test the recovery of all of them at one time.
• DR testing should focus on the recovery of the most critical applications.
• DR testing should simulate an actual disaster as much as possible within those constraints.
• Unannounced testing is preferable, but at the outset rehearsals are to be announced in advance.
The Lectures
Module 1: Northrop Grumman's DR Approach 1. Introduction
2. Northrop Grumman’s Disaster Recovery Strategy
3. Disaster Recovery Test Planning 4. Crisis Management
Module 3: Pre-Testing Activities
1. Overview of Northrop Grumman’s DR Testing Strategy
2. Structured Walkthroughs 3. DR Test Preparation
Module 5: DR Test Setup Activities 1. DR Test-Specific Equipment
2. 3rd Party Activation 3. Equipment Preparation
Module 7: DR Test Execution
1. Media and Data Recovery 2. Network Support
3. Disk Sanitization
Module 9: Problem Management and Follow-up
1. Problem Management
2. Post-Test Analysis and Resolution
Keeping the Lectures Light
DR and DR testing are necessarily heavy topics; it is important to keep attendees interested and alert.
• Cartoon graphics were used as a running motif through the classes.
• Where applicable, humor was sprinkled in.
• The training was delivered in March, April and May, so Final Four and Stanley Cup analogies were used.
The Exercises
Adults learn by listening, questioning, applying what they have learned and reviewing what they have applied.
• The exercises had several objectives:
– To involve the participants in consideration of the impact of the new DR testing strategy.
– To familiarize them with their roles in DR testing.
– To challenge them to be creative in applying what they already knew to changed circumstances.
– To foster communications among people who do not usually work together…
– And may not even know one another…
– Or understand each others’ roles.
• A side benefit was achieved as participants came up with answers that improved the DR testing program.
The Exercise Work Groups
Small teams were assembled to work on the exercises.
• The work groups were led by experienced managers from the different Recovery Teams.
• The teams were intentionally a blend of Recovery Team members with different responsibilities and from different locations
– A side benefit was to improve recognition of how the different technical specialties fit together
• Facilitators from Northrop Grumman and Risk Masters joined the work groups for a few minutes at a time.
• All work groups were required to submit “home work” – answers to the exercises – which were reviewed by Risk Masters and sent back with sample answers to the work groups.
– Another side benefit: some of the homework responses also improved the testing program.
The Introductions
It proved important to get all participants on the same page at the beginning of each session.
• Stress was placed on:
– Communicating the objectives of the training.
– Facilitating class logistics.
– Communications with both DR Operations and Risk Masters.
• Each exercise session was preceded by a general introduction before the attendees broke into work groups.
The Preparation for the Classes
A number of preparatory meetings were held to maximize the value and minimize the glitches.
• A virtual classroom specialist from Northrop Grumman provided valuable insights.
• A complete face-to-face run-through was held with the core of the Steering Committee.
– This was the only face-to-face meeting in the five months from the start of the project to the end of the training.
– Only minor changes were made.
• A dress rehearsal was held to make sure that the content was communicated well in the virtual classroom.
• A technical dress rehearsal was held to make sure that the meeting software and conference bridges worked appropriately.
The Potholes in the Road
There were lessons to be learned in presenting in a virtual classroom.
• A lot of conference bridges are needed for seven simultaneous work groups.
• It is possible, but not easy, for facilitators of the exercises to go from group to group.
• Internal and external meeting systems had to be jury-rigged together.
• It’s hard to be on your game at 7:00 am and at 7:00 pm.
• When training is spread over nine weeks, some participants will have other plans.
The Results of the Training
DR Testing Training was completed on May 9, 2012.
• Assumptions
– Each participant had to attend at least 5 out of 9 sessions to complete the course
– For folks who attended less than 5 sessions, if they read the material from the classes they missed, they received credit for
having attended. They needed to let DR Operations know that they have reviewed slides from the missing session(s) to receive credit for the class.
• Training Summary
– 127 total number of participants
• 122 participants successfully completed the training
The Benefits of the Training
The primary benefit, of course, is a better-educated workforce, able to execute a DR test in the triangular strategy.
• Beyond that, the training created greater cohesion within the technical workforce.
• It enabled the first (announced) test held shortly after the classes to proceed smoothly.
• Test objectives included:
– Infrastructure Service support for application recovery
– Applications’ recovery parameters met
– Resources were adequately assigned