IT Disaster Recovery Policy
Use in the event of Individual system outages or total loss of Infrastructure.
Total Loss - resulting from Natural Disasters, Fire, intentional damage or Equipment/Hardware failure.
The college is housed over 4 sites; in the event of a disaster the 2 main sites Vernon Street & Blenheim Walk can accommodate additional IT services.
The main bulk of the IT infrastructure is housed in the Blenheim Walk Server room.
If the Blenheim Walk site was lost then we would still have some functionality based at VS to setup a base Infrastructure.
Blenheim court is linked via fibre to BW; if BW was lost then BC would also be non functional and therefore should not be considered for a DR base.
Brodrick Building is a rented building connected via a fibre optic laser, this building should not be considered for a DR base.
Areas of cover, Supplier info
Network Infrastructure Staff Applications/Hardware Student Applications/Hardware Server & Network Topology Backup Procedure
Insight – Technology Building, Insight Campus, Terry Street, Sheffield. S9 2BU. Tel | 0870 706 7265
Fax | 0870 706 8265
shaun.brooks@uk.insight.com
Misco – Darby Close, Park Farm Industrial Estate, Wellingborough, Northants, NN8 6GS Tel: 0870 720 8720, Fax: 0870 720 8686, Email: salesdesk@misco.co.uk
Mhill@misco.co.uk
Intrinsic Technology – Cisco sales – 01942 528100 Network Infrastructure
The base/backbone of the network is controlled by a centralised Firewall/router. This is provided by ECSC; currently we have this using 2 Dell PowerEdge Servers. ECSC Ltd - 01274 513266
ECSC holds backups of the configuration along with a copy of Leeds-art.ac.uk DNS and Academic.lcad DNS zones.
ECSC will need to arrange on-site consultancy to install and configure the new firewalls. A master password list is currently contained in the server room Safe at Blenheim Walk.
Switch requirements
The college uses CISCO switches which run 7 VLANs, 2 staff side, 2 student side, 2 wireless & management.
Switches needed – CISCO Catalyst 3750 series. Internet Connectivity
The Internet/WAN connection is provided by Janet UK – www.ja.net - 0870 850 2212
In the event of WAN loss Janet will need to install a new router and fibre link from the college to the Janet network.
If necessary the new locations WAN link (internet) can be used to temporarily route email etc.
Communications
The Telephone system is an IP system using Alcatel switches, in the event of a disaster the main college line can be diverted to a temporary BT line or equivalent.
Freedom communications can provide support and provide the hardware/configuration to rebuild the phone system.
Freedom Communications (UK) Ltd
Telephone: 01923 654321, Facsimile: 01924 233070, Service: 0870 055 6900 www.freedomcomms.com
Server Racking & Air Conditioning
There will need to be enough rack space and cooling to accommodate at least 35 servers. If only critical systems are in use then facilities for 10 servers will be sufficient.
Backup system
A server and tape device will be required in order to restore the data.
The tape drive requirements are – HP Storageworks Ultrium 920 400/800 LTO SAS. VERITAS Backup Exec Software will also be required to perform the restores. Current server name – Oscar – HP DL380 G5 with 2 Sata drive enclosures.
Domain Infrastructure Leeds-art.ac.uk – staff domain Academic.lcad – student domain
Staff Applications Critical:
Unit 4 Agresso system.
This runs on 7 servers, 2 physical & 5 virtual.
Resource 3200 – finance system, use the IT Wiki for restoration instructions. Corero, 01923 695137
Server name – BEAKER Windows 2008 HP DL380 G6
Exchange 2010SP2 – Email system, restore server, system state and exchange mail store. Server name – Chef, Windows 2008 HP DL380 G8
Chris21 – Human Resources System, stores all employee data, use same procedure as above.
Frontier Software – 01925 852284, 01925 852242 Server name – Ernie Windows 2003 HP DL360 G4
Server name – GONZO Windows 2003 HP DL380 G5 Non Critical
Liberty – Library system, restore from tape.
Softlink Europe Ltd Tel - +44 (01993) 883401 Fax - +44 (01993) 883799 Server name – Bunsen HP DL360 G5
Cellcat – Timetabling software – restore sql database. Cellcat.com
Server name – ERNIE Windows 2003 HP DL360 G4
Agent – EBS Reporting software, contact Tribal for installation. Tribal Technology Ltd -Tel: 0114 281 6100 Fax: 0114 2816021 Server name – AGENT Windows 2003 HP DL360 G5
Alactel 4760 – Telephone system software, Restore from tape. Server name – ROWLF Windows 2003 HP DL360 G5
Efficient 5 – Door entry system, database held on tape. Kronos systems ltd - Tel: 01302 381505
Server name – Clifford (virtual) & TMS Camilla Windows 2003 HP DL360 G5. Staff Intranet – Staff internal documents, department info. Restore from tape. Server name – KERMIT Windows 2008 HP DL380 G7
IT application – Asset tracker and print system etc. Server name – ANIMAL (virtual)
IT Network monitoring – In house tools and WSUS Server name – STRANGEPORK
Student Applications
File Server(s) – Holds students work, currently backed up once per week (to disk) but too large to house on tape or off site.
Server name – MENDEZ HP DL585 & GAUDI Windows 2008 HP X1800 G2 PaperCut – student printing system, hold records of students print credit balance.
Server name – RAPHAEL HP DL360 G6P & KAHLO Windows 2003 HP DL360 G4
Student Web server – allows students to create their own websites.
Server name – STUD_WWW UNIX HP DL360 G5 Student Licensing Servers – Not backed up. Server name - BLAKE Apple Mac Server Server name – KIRBY Apple Mac Server
Server name – HART Windows 2008 server – Autocad/ghosting etc
Backup Process
The current full backup size is 6.0 TB of data. Restore times for this level of data can vary depending on the server/network foundation.
Restores of this size are usually run throughout the night.
Therefore certain systems data will require more time to restore from tape than others, depending on total size.
Tape Retention
One full Friday backup each month end, 12 tapes, possible to restore up to 1 year previous. 16 Incremental Weekday tapes, these are held for 4 weeks.
4 Full Friday tapes – 4 weeks of data.
Current Backup Procedure.
Backup Server – Oscar, DL380 G5 with HP 800/1600 SAS Autoloader.
Backup Regime, currently covers a 4 week rotation (disk and tape)
Backup to disk is performed first and then backup to tape is performed after when the backup window has diminished.
Monday – Thursday – Differential – changed files from full backup, covers 1 tape. (Total back-up around 300 GB)
Friday – Full backup – 3 tapes
Month End Tape – 3 tapes to be used instead of the Friday tape on the last Friday of every month.
Tape Storage.
The daily and following weeks tapes are housed in the fireproof section of the safe which is located in the IT Server room.
All sets of Friday tapes (except current week) are to be housed at Blenheim Court (off-site).
Tape Procedure.
Every Friday the following weekday tapes are changed and an inventory set to run.
The following day the backup logs should be checked for errors or any problems which may have been encountered, i.e. faulty tapes.
Tapes then should be removed and placed back in the safe.
Tape Labelling.
Tapes are labelled according to the 4 week rotation. Monday 1 - 4
Tuesday 1 - 4 Wednesday 1 - 4 Thursday 1 - 4
This shows how week one is labelled, week 2 is 2, week 3 is 3 and week 4 is 4. Friday tapes are also labelled over 4 weeks same as above. 1.1, 1.2, 1.3, 2.1, 2.2 etc Month End tapes are labelled according to the month i.e. March 1 (2) etc.
Data is held on Tape for a maximum duration of 12 months.
VEEAM
Veeam is used to backup our virtual servers. The virtual infrastructure consists of one ESX4i server and a cluster of 3 ESX5i servers in a VSA. See the IT wiki for detailed setup.
Backup Exec V2I Imaging Software.
This software is used to image an entire server and its drives. If total failure occurs a new server can be imaged from the V2i file and can be up and running in hours. Currently we only have 9 licenses which are used to backup critical servers.
Virtual – Vmware (VEEAM)
With the growing use of VmWare and the introduction of the VSA, we are using Veeam to backup the virtual machines, this takes a snapshot of the servers.
Restores
Restores are carried out using backup exec, using the software to select the file required and then the system will restore from disk or ask for the corresponding tape.
Restores are sent to a new location, for checking before placing back in their original location.
Tape Drive cleaning
When the cleaning light flashes on the drive the loader should clean the heads. (around 28 days)
The cleaning tape only lasts for a set amount of cleans.
The drive should always have a current cleaning tape not an expired one. This should help keep the backup drive functioning normally.
Issues
Any issues that could stop the running of a backup job should be immediately identified to the Infrastructure team.
Restore Order
Network Infrastructure
External Links – Internet & Telecommunications Servers (priority)
1. Backup Exec Server/Tape drive 2. Finance System 3. Agresso System 4. HR System 5. File Server 6. Email Server 7. Other systems
However it is noted that depending on the time of year the HR and Agresso systems could take precedence over the Finance system.
Build Times
Source and purchase new Hardware, 1 week
Rapid Response from ECSC to install new firewall/router – 1 week Liaise with BT for new telecommunications lines, BT dependant
Arrange for Freedom consultancy to rebuild the phone system at new location, 1-2 weeks. Install Internet Connection, Janet/BT dependant
Install and configure new backup exec server, 2 days Install and configure new switching (basic) – 1 week Liaise with 3rd party software vendors.
Install Servers
Agresso – 1 to 2 weeks Email – 1 week
Finance System – 1 week HR System 1 week
Initial Restore Times:
1 day - Tape Recovery and Catalogue. 1 Week - Data Recovery from Tape. 1 week catalogue software stores.
Roles:
Chris Dodds – Responsible for the configuration and maintenance of backup jobs, technical support for the backup process and fault resolution. In charge of tape rotation, log analysis & ad-hoc data restores.
Ian Atkinson – Responsible for scripting Linux/Ubuntu servers backup jobs. Technical support for application and server restores.
Dan Hill – DR policy and Supplier & 3rd
party relations. Technical support for application and server restores.
Restore Roles:
Chris Dodds – Retrieve appropriate tapes from safe. Help to rebuild pc’s and servers if required.
All - setup of backup exec server, retrieval of data from tapes, restoration of servers in order of business impact policy.
All – Setup of Network infrastructure and switches, restoration of non windows servers and workstations. Aid in setting up external links, e.g. phones and internet.
Dan Hill – Procurement of necessary IT equipment and services. Liaise with the business on recovery timescales and procedures. Carry out processes of restore according to the DR policy and the business impact analysis.
Disaster Recovery Testing Schedule
All systems are to be restored from tape using the DR tape drive and server using a VMware virtual environment.
The testing phase will then lead to updating the system documentation to either add or amend server configurations on the IT WIKI.
Any problems, new servers or additional info will also be entered into the WIKI.
Once updated the pages from the WIKI are to be printed out and place in the off-site safe.
DR Testing Scenarios 2011
These were completed individually as separate DR scenarios.
Time taken = total time to restore the server to a working state, no time gaps from start to finish (real world)
Build Virtual Environment - Installation of ESX Server from DVD: 20mins
Installation of Backup Exec : 40mins
Cataloging of Media:10mins, BESR: 20mins
Install & configure Windows Server (200X) - Installation of 2008R2 with default settings: 20minsInstallation of 2003R2 with default settings: 30minsConfigurations and installation of services, roles and features (i.e. IIS, .Net, File, Print): 10mins
Install & Restore Exchange - Installation of Exchange with disaster recovery option: 20minsRestoration of Exchange database from tape: 31mins
Bridget requested calendar items from Outlook via helpdesk. Used DR server and data. Was able to login without errors, data was in and no permissions errors were recorded..
Install & Restore Resource
Restoration of files from tape: 40mins
Restoration of SQL database from tape: 20mins
JB tested: All data was in but reports couldnt be run due to problems with IQ Objects.
Staff File Server – restore a selection of files
Estimated time for full restore of all data (1400Gb): 35Hrs
Tested if permissions had copied over, all were correct and working. Logged in to test PC and folder mapping (S:) were created.
HR – Chris21 Install & Restore
Installation of C21 software: 10mins
Restoration of data files: 5mins
Amanda N tested. Data was correct and could be amended, saved and reloaded.
EBS – restore Databases & files for Tribal to use
Installation of Oracle: 20mins
Restoration of Database:
Installation of Oracle: 20mins
Restoration of Database:
No Oracle licence, Tribal will not restore without it.
Problems Encountered
Backup Exec System Recovery
Some of the images would not import directly to ESX server. However we could deploy onto a base image, and extract date directly.
Network Traffic
Running the network at 100Mbps slowed down the transfer rate and may have exaggerated restore times. This will not be a problem in a real disaster as all new switches run at 1000Mbps.
Server 2008 R2
Very choppy mouse performance within the ESX console however performed fine when using remote desktop. VMware expect a fix within the next driver update package.
Need more tracking to ensure that correct installations is installed 1st time.
Oracle
Could not restore DB from BE, unknown why. Tribal to investigate.
However dumps restored and imported.
Resource
DR Testing Scenarios 2012
This year we focussed on scenario testing. The following were performed:
Migration and restore of Resource 32000 to a new OS and server. Fully document the transfer and add to the IT wiki. (mirrors total failure)
Create and move Portal (intranet) from Linux to Windows 2008. (mirrors total failure) IT Helpdesk – restore and migrate into a virtual environment. (mirrors total failure) Test restoration of new Agresso servers into another virtual environment.
All scenarios were successful and new documentation has been added to the IT Wiki.
DR Testing Scenarios 2013
Restoration and migration of exchange email server. Restore from backup
Create working replica of exchange 2003
Create new exchange 2010 server and upgrade domain Migrate users to new server