Information Technology Services (ITS)

(1)

Information Technology Services

(ITS)

Disaster Recovery Plan

(2)

2

DOCUMENT VERSION CONTROL

FOREWORD

The design of this Disaster Recovery Plan (DRP), technical analysis and advice included

is the proprietary information of Standby Consulting Ltd (StandbyTM_{) and is confidential}

to Standby and Massey University (Massey).

Standby has prepared the DRP exclusively for use in connection with the ITS operations of Massey, who shall not disclose or release the plan to any third party without the prior consent of Standby. Notwithstanding the foregoing, Massey remains the owner of the applications, schematics and corporate information set out in the plan and shall not be constrained from dealing with such material as it thinks fit.

This version of the DRP has been significantly modified by Massey University, and is based on the DRP supplied by Standby Consulting Ltd in 2013. The Standby DRP is a much lengthier document and is appended for reference.

DISCLAIMER

This plan is based on information supplied to Standby by Massey. Standby shall accept no liability for losses occurring by reason of errors or omissions in the information

supplied.

Care has been taken to ensure that the procedures set out in this manual represent best practice in business continuity at the time of writing. However, because of the unpredictability and diversity of events that may cause a disaster, there is no guarantee that this plan will cover every situation.

The plan needs to be reviewed regularly and updated to take into account changes and developments. Standby accept no responsibility for losses arising by reason of the failure by Massey to carry out regular reviews and where necessary updates of the manual.

Version Date Description/Notes Author/s

V2.00 02/06/2014 New plan based on updated

Standby plan. Craig Collis

V2.10 10/03/2015 Updated CMT roles to align with

CIMS roles. Jonathan Godfrey

V2.11 10/11/2015 Minor corrections and update to

(3)

T A B L E O F C O N T E N T S DOCUMENT VERSION CONTROL ________________________________________________________ 2 FOREWORD _________________________________________________________________________ 2 DISCLAIMER _________________________________________________________________________ 2

DISASTER RECOVERY PLAN _____________________________________________ 4

Definition of a Disaster ________________________________________________________________ 4 Scope of the Disaster Recovery Plan _____________________________________________________ 4 Location of Disaster Recovery Plan ______________________________________________________ 5 Disaster Recovery Capability Overview ___________________________________________________ 6 Meeting Location ____________________________________________________________________ 8 Disaster Recovery Process _____________________________________________________________ 8 Crisis Management Team _____________________________________________________________ 12

(4)

4

D I S A S T E R R E C O V E R Y P L A N

This Disaster Recovery Plan (DRP) documents the actions that Information Technology Services (ITS) will take following a disaster that affects core University IT services.

DEFINITION OF A DISASTER

A Disaster is defined as any significant event that impacts IT services sufficiently that the core business activities of the University cannot continue.

The declaration of a disaster is a deliberate, conscious decision which will result in the initiation of this DRP. Once a disaster is declared, the response to and recovery from the situation will be managed via the structures and processes outlined in this document. There are three levels of disaster which can potentially affect IT services:

1. No Damage: All IT facilities and services are operating as normal, however there has been a disaster which impacts ITS’ ability to continue operating services fully, e.g. pandemic, lack of access to ITS workspaces, etc.

2. Logical Damage: There has been no physical damage or change to IT facilities or systems, but the data or system environments have been impacted, e.g. virus outbreak, malicious action, accidental deletion, unexpected data corruption, etc. 3. Physical Damage: IT facilities or equipment have been physically damaged, e.g.

fire, earthquake, explosion.

This DRP is intended to cover all of these scenarios.

SCOPE OF THE DISASTER RECOVERY PLAN

This Disaster Recovery Plan is intended to:

 Outline the capabilities in place to recover from a disaster.

 Describe the process to be followed to carry out the recovery of services.  Outline the roles and responsibilities of various staff and governing bodies in

the event of a disaster.

This Disaster Recovery Plan is not intended to:

 Describe the process for managing the incident itself.

 Act as a repository for all technical and process documentation associated with disaster recovery systems.

 Act as a Business Continuity Plan for the ongoing long-term activities of the ITS department itself.

The focus of the DRP and IT disaster systems is the resolution of core business services within an acceptable timeframe. The current Restore Time Objective (RTO) required by the University for all core services is 72 hours, with a Restore Point Objective (RPO) of 30 minutes. The core services included within that scope are as follows:

Core Services

Stream File service Print service

(5)

5

Email service

Building Management system (BMS) Student Management system (SMS)

Finance system (Technology1, FlexiPurchase, KoFax

Payroll system (PSE)

Timetabling system (FCMIS) Application licensing service Knowledgebase service

Research Information Management system (RIMS)

ITS Service Desk system (Marval)

There are several other documents which should be used in conjunction with this DRP. These are:

 The ITS Incident Response Plan  The ITS Business Continuity Plan

 The University Emergency Response Plan

 Disaster Recovery, system, and application documentation

LOCATION OF DISASTER RECOVERY PLAN

The soft copy of this plan will be held at http://dr.massey.ac.nz hosted at the Albany National Data Centre. Hard copies of the Plan are located at the data centre on each campus.

The following individuals will hold hard copies of the plan, as well as a soft copy on their individual cell phone:

Copy Name Title

1 Clive Martis CIO

2 Mareen Watts Associate Director Service Delivery

3 Alistair James Associate Director Planning &

Transformation

4 Harry Faas Associate Director Business Systems

Support

5 Jodie Banner University Risk Manager

6 Barbie Yerkovich Albany Service Manager

7 Kevin Reynolds Wellington Service Manager

(6)

6

DISASTER RECOVERY CAPABILITY OVERVIEW

The ITS Data Centre in Turitea is the prime data centre in which most applications run and data is stored. Data and images of the servers and systems are replicated to Albany.

The Albany Campus is also a prime site with similar equipment to Turitea. Data from Turitea is stored in the Rittal modular data centre (MDC), which is the main data centre at Albany.

Data generated in Albany is replicated to the Turitea Data Centre. The Wellington Data Centre replicates its data to Turitea.

In the event of a disaster, systems can be activated in Albany and all campuses will run from Albany. In the event of a disaster at Albany all services will run from Turitea. If Wellington fails, services will continue to be provided by Turitea.

The initiation of all of these recovery mechanisms can be carried out remotely at Turitea or Albany.

This Disaster Recovery capability has been designed to recover services from a single major incident. For example, should the Turitea data centre fail services will then run from Albany. Should Albany also fail there is no further fail over or recovery capability. In some instances the services may be able to tolerate more than a single incident, but this has not been specifically incorporated into their design.

(7)

(8)

8

MEETING LOCATION

An initial meeting should be conducted without delay as the situation will need urgent review and the Disaster Recovery process initiated.

The primary space for all meetings associated with the situation will be room 1.29, the large meeting space on Level 1 of the Turitea ITS building, inside the main doors.

If this room is not available, or more than one meeting space is required, the secondary meeting space will be Room TA 2.14, in the Te Ara building on the Hokowhitu campus.

Location Description

Location 1 Main Meeting room, first floor (1.29) ITS Building, Turitea

Location 2 Room TA2.14

Te Ara Building

Hokowhitu Campus ITS Department

DISASTER RECOVERY PROCESS

The following process outlines the high-level steps to be taken in the event of a disaster. While this process describes a best-practice approach to the management of the recovery it is likely to be modified as necessary within the context of a particular scenario.

PROCESS STEPS

1. Event An event occurs which results in a significant disruption to IT services.

2. Discovery The event is discovered and reported to the relevant points of contact. This may be through automated alerting and/or reporting by users

3. Decision Point: is the ITS building safe and fit for occupation? If so, ITS staff and the Crisis Management Team can be located in the building. If not, the Crisis

Management Team decamps to Location 2, and accommodation arrangements are made for non-essential ITS personnel.

4. Triage An initial assessment of the cause, impact and severity of the event must be made. This involves gathering enough information to allow an informed decision as to whether or not the situation warrants the declaration of a disaster.

5. Decision Point: declare a disaster? The decision to declare a disaster is a formal step taken by the Incident Manager in consultation with relevant staff. Not all significant service disruptions require the formal declaration of a disaster and implementation of this DR plan. The decision to declare a disaster should be made in collaboration with the CIO and senior university stakeholders. Guidelines for the appropriate declaration of a disaster are as follows:

a. Where IT services are disrupted to a point that will cause the University to close, or significantly enough that financial and/or reputational damage will be caused.

(9)

9

b. Where the University’s key Teaching, Learning, Research and Administrative functions are unable to continue at a satisfactory level.

c. Where the situation requires the dedicated attention of a large group of people.

d. Where the management of the situation requires specialist roles to be appointed for the duration of the outage, e.g. convening the Crisis Management Team.

6. Disaster not declared: service restoration If a disaster is not declared, the individual restoration of a service or services begins. In this situation the formal activation of the Disaster Recovery Plan is not required, and services can be restored using standard recovery methods without invoking a failover of services to Albany.

7. Disaster declared: activate the Disaster Recovery Plan.

8. Investigation When a disaster is declared a more extensive investigation of the situation and potential recovery scenarios must occur. This builds on the initial information gathered during the triage step.

9. Develop the Recovery Action Plan The Recovery Action Plan is a critical document which is constructed once sufficient information is known. It outlines the intended steps to manage the situation and recover from the disaster. This plan will form the basis of all activities from this point, but is likely to evolve and be modified as further information comes to light and circumstances change.

10. Decision Point: Failover to Albany? Once sufficient information is available a decision needs to be made as to whether or not to failover core services to the Albany data centre. Again, this is a formal decision and is made in collaboration with the CIO and senior university leaders, based on the recommendations of the Crisis Management Team. The declaration of a disaster does not necessarily require a decision to fail over to Albany (it may be possible to recover services without failover) but a fail over should never occur without a disaster first having been declared and this DRP enacted. In making a decision to fail over, the following points need to be considered:

a. When failing over to Albany, services will then continue to run from Albany for a period of several months. There is no easy method to fail back to Turitea and this step has not been incorporated into disaster recovery planning.

b. Whilst core services can be restored at Albany, non-core services will only be restored after all core services are operational. Non-core services are likely to be restored outside of the 72 hour RTO timeframe.

c. Non-core services are not replicated to Albany with the same frequency as core services. Therefore a decision to fail over to Albany may result in a greater data loss for non-core services.

d. Some services cannot be easily failed over to Albany. There are proprietary and specialist systems running in the Turitea data centre which cannot be replicated to Albany (e.g. systems still running on physical server hardware). In the event of a fail over these systems will remain inoperable until such time as they can be rebuilt at Albany.

11. Restore Services at Turitea: Where the decision is made to restore core services at Turitea, a fail over to Albany is not required. Some aspects of the DR capability (e.g. data snapshots, redundant components) may be utilised to achieve the recovery, but the intention is to restore services at the Turitea data centre within 72 hours. The

(10)

10

management structures and processes outlined in this DRP will still be in effect, as a disaster has been declared.

12. Restore Core Services at Albany: Where the decision is made to fail over to Albany the focus is on restoring core services within 72 hours. This is achieved by initiating Site Recovery Manager to stand up core services at Albany.

13. Restore Other Services: Only once all core services have been recovered can non-core services be restored. As some of these services will not have been replicated to Albany it may be necessary to arrange for them to be moved from Turitea, or to install additional capacity at Albany to allow the services to be re-established there.

14. Close Disaster: Once services have been restored the disaster can be formally closed. Control of the situation and any ongoing activities associated with the disaster would then revert to the normal operational management structures of ITS and the Crisis Management Team would be disbanded.

15. Plan Failback: After the closure of the disaster planning must take place to migrate services back to the Turitea data centre (if a fail over to Albany has occurred). This is a substantial activity which will require reconfiguration of equipment and systems, and may not occur for some months after services have been restored.

(11)

11 1. Event 5. Triage: Gather info and assess situation 6. Declare Disaster? 7. Initiate Service

Restoration 8. Enact Disaster _{Recovery Plan}

11. Failover to Albany? 12. Restore Services at Turitea No Yes No Yes 3. ITS Building Usable? 4. Decamp to alternate location No Yes 9. Investigation 13. Restore Core Services at Albany 14. Restore Other Services 15. Close Disaster 16. Plan Failback 15. Close Disaster 2. Discovery and Alerting 10. Develop Recovery Action Plan

(12)

12

CRISIS MANAGEMENT TEAM

The Crisis Management Team (CMT) is the primary organisational structure responsible for determining the response to a disaster and managing the restoration of services. The CMT is based on a set of roles created expressly for the management of an emergency, and which are different to business as usual roles. This is because roles in the CMT should be filled by people with the appropriate competencies, experience and personal attributes rather than being based on their BAU role. This also provides for a greater depth of contingency, should the primary person filling a Team role be unavailable.

Incident Manager

The Incident Manager holds overall responsibility for managing the disaster response, is in charge during an incident, and has authority to act. This person leads and controls the activities of the Crisis Management Team and subsidiary groups.

 Assess the overall situation.

 Assume control of all disaster recovery activities.

 Lead the decision to declare a disaster or not.

 Allocate roles to staff, including the rest of the Crisis Management Team

 Activate the Disaster Recovery Plan.

 Determine priorities for responding to the situation.

 Initiate the development of the Recovery Action Plan.

 Provide a focal point for decision making.

 Report to the ITS Executive.

Information Officer

Communication of the situation, the ongoing response, and the likely recovery of services, is one of the most critical activities in a disaster. The Information Officer is responsible for keeping all stakeholders up to date with the status of the disaster and recovery. This will include structured communications to the internal ITS teams, and external communications to the wider University and public via the External Relations office.

 Ensure all relevant information is communicated consistently both internally

and externally.

 Develop and implement a communications plan to ensure regular updates.

 Inform and liaise with University External Relations.

 Provide regular updates to internal staff and external communications

channels. (e.g. Service Desk, External Relations).

 Ensure emergency communications channels are operational (e.g. remote

web site).

 Coordinate all formal communications regarding the situation.

Document Officer

The Document Officer is responsible for recording all discussions and decision points related to the disaster and recovery, as well as documenting the remedial actions (both planned and unplanned) as they are taken. Documentation will include meeting

minutes, technical documentation, and photographs of affected areas and equipment.

 Record all meetings and discussions associated with the situation.

 Author the Recovery Action Plan based on discussions and decisions made

(13)

13

 Ensure that any changes made to systems or environments as a result of

restoration activities are documented and that existing technical documentation is updated.

 In the event of a disaster which results in physical damage to equipment or

facilities, photograph this damage for documentation purposes.

Staff Safety and Welfare Officer

In a disaster situation it is essential to actively monitor and manage the welfare of the staff involved. The Staff Safety and Welfare Officer is responsible for ensuring the safety of staff at all times during the recovery. This will include physical safety in hazardous situations, as well as ensuring staff welfare is carefully managed (e.g. adequate rest breaks, food, stress management).

 Ensure that all physical locations are safe for staff occupation.

 Monitor staff stress and workload levels to ensure that any issues are identified

and dealt with.

 Ensure that staff take regular breaks, and that where necessary they are

rostered in shifts to continue critical work.

 Ensure that food, accommodation and other requirements are in place to

enable staff to focus on the recovery of services.

 Alert the Incident Manager to any developing issues regarding staff welfare

or safety, and ensure that these are dealt with.

Operations Manager

The Operations Manager is responsible for carrying out the activities described in the Recovery Action Plan. This person co-ordinates the activities of the technical teams that will be involved in the restoration of services. The technical teams will be assembled, as appropriate, from the areas of systems infrastructure, network infrastructure, business applications, desktop delivery, security, and IT facilities. The teams themselves will be led by people who are capable of representing their relevant technical areas during

planning discussions, and can ensure that the actions identified in the Recovery Action Plan are carried out.

 Provide specialist advice to the CMT when making decisions regarding the

management of the situation and construction of the Recovery Action Plan.

 Direct and manage the teams which will be actively engaged in carrying out

the restoration of services, including external resources.

Logistics Manager

The Logistics Manager is responsible for procuring and arranging all facilities, equipment, materials, services, and resources required to carry out the restoration of services. This person is responsible for all aspects of ordering, obtaining and managing materials, equipment, licenses and physical locations. The Logistics Manager must have financial authority to carry out these duties.

 Procure all items of equipment, facilities, software, consumables and other

material required by staff to carry out the recovery.

 Record procurement activity for later reconciliation.

 Ensure that any damaged equipment or material is catalogued and

recorded for future insurance claims or disposal.

 Liaise directly with external vendors to expedite the rapid procurement of

(14)

14 Planning Manager

The Planning Manager is responsible for gathering, analysing and publishing information about the event. This person leads the task of preparing and documenting the

Recovery Action Plan, including the activities that will be undertaken, the order they will be done in, the period of time they will take, and the resources required.

 Provide detailed information to the CMT regarding the nature of the disaster.

 Carry out investigations into possible recovery options, and provide the CMT