Business Continuity Program Review

(1)

PROJECT VS. PROGRAM:

THE BUSINESS CONTINUITY MANAGEMENT CHALLENGE

e now do very well what we set out to do 20 years ago: develop emergency response plans, run tests at alternate sites where we recover computer platforms, and sometimes networks. We can implement high availability solutions using cutting edge technologies. But we have done less well developing our profession as an integral and structured component of the business.

It was 20 years ago as well that we saw the emergence of tools that organized the art of programming information systems into steps, stages, life cycle phases, all of which defined activities and prescribed certain procedures and documentation. We need to do this now for our own profession. What are our best practices for creating and managing our programs? Where is our trade literature that speaks to the issue of how to move from recovery plan development to a strong ongoing program, and perhaps, more importantly, how to keep that program healthy and growing over time?

This development of a methodology for structured, formal continuity program is now the single most important challenge facing our profession. Many organizations (if not most) are now grappling with these issues, and many are stuck. We have demonstrated that we can develop new recovery solutions for some new technologies, but we need above all to get our management house in order to make significant progress, no matter how complex our recovery strategies may become, we need to become an integrated part of the business.

The following case study deals with the development of a Business Continuity Program over 5 years at a large multinational firm. The Program is now a well-regarded part of the company. It started as just another recovery plan development project. The difference, however, is that we began early to look forward to the ongoing program, and have developed a set of tools that has helped us to formalize our activities. These tools have helped us to get to where we are now: we have a logo, we have an identity, we offer a very specific, detailed set of services that make up the ongoing recovery capability for participating business units.

Case Study: The Road Map to Recovery Program

Five years ago, in the beginning of 1994, we were faced with a challenge: how to move a large company, with many applications running on multiple vendor platforms, toward development of a basic recovery capability. A global study regarding the impact of the loss of critical applications had just been completed at great expense. There were reams of data, but no direction for moving forward to an action plan.

Within 18 months, for as many of the defined critical systems as management had decided were truly important (46), we had implemented backup and recovery strategies. These were documented in recovery plans and other documentation. Our profession is good at this kind of project: there was nothing new here, other than perhaps the size of the project.

But then what? The development project glory was fading, the sponsors of the original project were no longer in the organization. We were pretty much looking out over the cliff. We were faced with the

W

(2)

We did gain one advantage from the Development Project. Since the project was so large, we needed to work in phases, which we called waves. As Wave 1 systems completed their implementation, they needed to transition to an ongoing support mode. Then the Wave 2 systems would simultaneously go through their development phase. And so, midway through the development project, we were already involved in an ongoing program.

Therefore, as part of the development project, we had already begun to embed elements that facilitated the longer term program. We have been developing other structural elements all along the way. One of the most important of these, System Complexity Classification, was developed only last year. The program is still evolving, and will continue to change to meet the emerging needs of our participating business units.

The following set of structural elements has helped us to move from an ad-hoc mode to a systematic approach that is effective both for the Program and for its participants

1. System Complexity Classification. When a new system or business area wishes to join the program, it is classified according to its complexity. This translates to a resource requirement, and ultimately to a fee which can then be billed back to the requesting department. This is important, because at the current time, all costs must be billed back each business area from the service department. The following table represents how this classification works. The deliverables are the same for each phase, but the amount of resource varies according the complexity of the system/business area. The 80/20 rule applies: most participating systems or business areas fall into one of the boxes. Of course, sometimes customization is necessary – but this is the rare exception.

Complexity/

Phase Small Medium Large

Development a b c

Year 1 maintenance d e f

Year 2+ maintenance g h i

2. Recovery Class Criteria. This tool assigns the application/platform/business area a specific recovery class. There are 4 different classes, each based on the primary recovery and backup strategy drivers: RTO (Recovery Time Objective) and RPO (Recovery Point Objective). Each class then specifies the following:

• Alternate site characteristics

• Backup requirements, including type, frequency, monitoring, offsite storage characteristics and procedures.

• Data recovery procedures

• Disaster recovery plan characteristics: organization, form, content, testing, coordination with other emergency plans

• Validation: procedures for ensuring the integrity of software, data, and hardware of the re-created systems.

(3)

3. Vital Records Backups: Requirements and Recommended Procedures. This document has two parts:

A) A detailed discussion of disaster backup theory Ø Types, frequency, advantages and disadvantages

Ø Why disaster backups must be treated differently from file backups used to restore faulty components (i.e., disk) or re-create deleted or damaged files.

Ø Procedures for transport to offsite locations; characteristics of effective offsite locations

Ø Retention of disaster backup files: minimum number of full cycles to be retained B) An explanation of the required content of a Vital Records Backup Technical

Operating Procedure

Ø Disaster Backup Creation

• Frequency and Type

• Storage Device Names

• Creation Journal

• File State at time of backup creation Ø Disaster Backup Storage

Ø Disaster Backup Retention

Ø Creating Hardware Characteristics Ø Disaster Backup Transport

Ø Removal of Disaster Backups from Offsite Storage. Detailed procedures for:

• Normal retrieval : expiration of retention date

• Exceptional retrieval: § Need for file recovery

§ Need for disaster recovery testing § Need for disaster recovery

4. Vital Records Backup Technical Operating Procedure. A separate document is created and maintained for the named critical resources associated with each system/business area that participates in the program. This document follows the requirements defined in the Vital Records Backup Requirements document above, and is subject to quarterly review and updating to reflect environmental or other changes.

5. The Business Continuity Program Test Protocol. This very detailed document is the road map leading participants through the various testing phases and activities. It specifies the content and procedures for the two pre-test meetings, the post-test review meeting, and defines the content of required test documentation and its distribution. As an example, this is the list of items for the second pre-test meeting:

• Timing: 30 days prior to the test

• Participants: Leader of the meeting, team members, other required participation depending on the type of test

• Preparation for the meeting: who does the following:

Ø Review the documentation from the previous meeting

(4)

Ø Fax test information to vendor

Ø Document any action items open from the previous meeting

Ø Document any new contract requirements to alternate site vendor; obtain new contract proposal if necessary

Ø Make copies of all relevant information for each meeting participant

• Meeting agenda

Ø Discuss action items from previous meeting

Ø Review system configuration; review new alternate site contract if necessary Ø Review test objectives and alternate site configuration in detail

Ø Discuss any equipment that will be brought to the alternate site Ø Review and finalize test staffing and roles

Ø Review information for the shipment of backup tapes: who will receive the notification call that tapes have been received

Ø Review existing recovery documentation. If any changes are necessary, make new copies and take to the test

Ø Make conference call to alternate site vendor Ø Determine lodging and catering requirements

• Post-test Planning Activities

Ø Meeting record is written and distributed within 3 business days of the meeting

Ø Activities for Business Continuity Program participant and technical participants, including timing

6. The Continuity Documentation Maintenance and Distribution Technical Operating Procedures. This document is created for each system/business area participating in the Program. It defines critical responsibilities for review and updating of the various pieces of continuity documentation, and defines the quarterly cycle of update, publication, and distribution of new documentation.

7. Program Testing Rules. Each system/business area participating in the program MUST perform four test exercises per year. Two of these are relocation tests; two are paper or notification exercises. We have determined, based on experience, that a continuity exercise must occur at least once each quarter in order to allow participants to begin to integrate business continuity concepts into their normal workday.

6. The Three-Year Maintenance Strategy. For each system/business area participating in the Program, primary and secondary objectives are defined for each of the coming three years. These objectives are supported by testing activity, and are harmonized with test exercises being conducted for other systems/business areas participating in the Program. We begin first with component certification, including personnel cross-training, and progress to multiple-platform recoveries and user participation that will reflect ever closer the realities that would be encountered at time of disaster. These documents are updated annually with a rolling 3-year window.

(5)

The Impact of All of These Tools?

¥ Those business areas participating in the Business Continuity Program know what they will receive, and how much to budget for these services. When a business area comes to the Program and asks for assistance in recovering a computer platform, this is what happens:

1. An initial analysis is performed, which results in definition of the recovery class, and a classification of system complexity.

2. A project plan is created which defines the timing for the deliverables which are standard to all development projects:

• Definition of the backup strategy and the recovery strategy: implementation of these in the most cost-effective manner

• Creation of the Vital Records Backup TOP

• Creation of the Continuity Documentation Maintenance and Distribution TOP

• The Business Continuity Plan, including any technical system restoration information

• The integration of information regarding this project with the existing sitewide logistic support plans

• The performance of a basic test of the backup and recovery strategies, generally involving re-creation of a system at the alternate site

3. A cost is applied based on the system complexity definition.

4. The business unit approves the project plan and the development phase is initiated.

In order for the system to move into the ongoing Program, the business unit must approve the cost for First-Year Maintenance. This is an annual fee, based on the system complexity classification. Deliverables are standard for all projects in first-year maintenance:

1. Design and documentation of a 3-year Maintenance Strategy. This includes primary and secondary objectives for each of the next three years. Design of test exercises to include at least two system re-creation or user relocation tests per year, as well as two paper/notification tests per year. These tests are designed specifically to support the yearly objectives. They take into account potential multi-platform tests involving other systems, and Maintenance Strategies for other Program participants.

2. Execution of the formal Test Protocol for each of the test exercises planned for the year.

3. Review, physical update and distribution of the standard continuity documentation for each participating system/business area: Continuity Plan, Vital Records Backup TOP, Continuity Documentation Maintenance and Distribution TOP.

In order to continue to participate in the Program, the business unit must then approve the cost for each subsequent year of maintenance. This is an annual fee, and is based on the system complexity classification. Standard deliverables are the same as for first year maintenance, except that the Maintenance Strategy is now simply updated, rather than being written for the first time. The cost based on the system complexity classification is therefore somewhat less than for year 1 maintenance.

(6)

¥ The Business Continuity Program has a name, a logo, and an identity: this comes as a direct result of the structured, recurrent, defined nature of the Program’s activities. We can market the program because it has a defined shape.

¥ Another effect is that this year, we have planned for 76 separate test exercises with knowledge of how much resource we need and when. We can bring new Program personnel up to speed very quickly: they have a reference point for all Program activities.

¥ Because we know our committed resource requirements for the Program, we can then identify and focus on strategic issues as they arise, dealing with the challenges of new technology implementation and the larger issues of interlocking systems and networks. The systematic nature of our ongoing tactical program frees us to focus scarce resources where they can best be deployed, supplementing with less senior resources to deliver our well-defined standard services.

Conclusion

It took a long time and almost 20 years of experience in the business for me to truly understand the primordial importance of management and political issues to the success of any disaster recovery or business continuity effort. About 12 years ago, while I was a consultant for a French consulting firm in Paris, I had a knockdown drag-out with my manager, who insisted on the importance of developing and sustaining a personal relationship with the client, and on the regular use of formal project management reporting. I argued that the project could stand on its own technical merits. I could not have been more wrong: it was a hard lesson.

Perhaps it was because I came to the business as a technologist; perhaps it was also because many of the major commercial services providers and plan development software providers in our industry have concentrated so much on technical solutions. I can’t help thinking and hoping that our industry has now come to a critical watershed, where we have reached a sufficient level of maturity to look beyond today’s newest backup or recovery solution. And that this new maturity will help us to understand and master our role as a critical business function, whether that business manufactures widgets or computers or supplies electricity or telephone service.

I know that for a long time, I looked primarily at the technology and at the newest technology recovery solution. I still tend to look at the role of the technology in supporting business requirements. But I consciously try now to look first at the business, and then at its critical processes and resources, some of which are technology-based, some of which are not. This attitude helps to keep me focused on the basic raison-d’être of our entire industry: ensure the ability of the business to stay in business during a crisis. In order to do this, the business continuity program must function effectively and well in and of itself. I know too that this attitude is of critical importance to the very scary challenge now facing our industry: contingency planning for Year 2000 projects.

Ultimately, management and political issues are always more important than the implementation of any particular technical recovery solution. As we move toward the new challenges facing our industry, we need to remind ourselves – often – of this reality. Will we move forward, armed with new attitudes and tools, or will we the adopt the ostrich attitude that we have so often observed in regard to our own industry – in business managers, in technology developers? We must become effective leaders in our own realm, and to do so we need solid logistics and management tools supporting us. The program I have discussed in this paper is the beginning of a new understanding of what we are about as an industry. Let’s get on with it !!