• No results found

perspective of adopting computing systems

Chapter 6. Technical deployment for a major OSM Case Study

6.2 The non-Cloud Solution

The non-Cloud Storage Area Network (SAN) at St Thomas’ Hospital served the entire GSTT including medical researchers based at Guy’s Hospital. The non-Cloud SAN is formed of four HP storage systems containing a total of 32 terabytes (TB) of storage. Four additional storage servers were added and the total disk storage was expanded to serve up to 64 TB after 2011. This is not classified as a Cloud system, for the following reasons:

• It does not use any virtualisation technologies.

• The deployment is not entirely delivered by distributed technologies. Users based at Guy’s Hospital need to get access to a local virtual private network (VPN) server, which then authenticates and connects users to the SAN at St Thomas’ Hospital. The distance between Guy’s and St Thomas’ is 2 miles for network cabling, which has an impact on the network performance (see Section 6.2.2 for details).

In addition, the system was initially intended for general use and was not particularly designed for scientific research; each time additional work was required before the start of a backup process.

6.2.1 The deployment and architecture of the non-Cloud system In the non-Cloud system, there is a control centre at the SAN to execute

commands for backup services. Backup files take a large amount of disk space up to 10 TB, but only the most recent two backup versions are kept. The aim of this case study is to demonstrate the improvement in efficiency in using a Cloud Storage system comparing to a non-Cloud system. Dependent on the user

114

request, data backup took place at least once a week. Figure 6-1 shows the architecture for backup deployment via VPN.

Figure 6-1: The architecture for backup deployment via VPN

The term ‘job’ is explained in Section 6.1.2, and one job is required to complete the backup process for each dataset. Before beginning to run a batch of jobs, the control centre of the SAN needs to calculate the expected time of completion. A simulation is performed in which all jobs are completed without any failures or incomplete jobs to calculate the expected time of completion. A failed or incomplete job means that the command to send data across the network was unsuccessful and requires another command (job) to send the data again. In other words, it is a simulation to calculate expected completion time when there are no risks involved. It is important information for the system architect and IT Director to know prior to making any the comparisons under the ‘best possible case’. In this work the formulae that calculate the expected execution time are explained in Section 6.6.

When dealing with comparisons of systems involving a large number of jobs, it is common to estimate the incomplete or failed jobs using a simulation method (Perrow, 1999; Brotby, 2009). For a robust system, the backup process should continue without interruptions while keeping the risks (failed or incomplete jobs) to an acceptable rate (Chandy and Sauer, 1978; Gray and Siewiorek, 1991; Perrow, 1999). Indeed, the backup process in this case study allows thousands of jobs to

Non-Cloud SAN, 32 TB backup service and 100 TB for archiving (2007-2011)

St Thomas’ Hospital Guy’s Hospital

Users

VPN

Connects Authenticates Grant access

Access granted 2 miles

115

be completed at once while keeping the risk-control rate under 5% all times as described in Chapter 5. The main code syntax associated with system

administration is presented in Appendix E.

After the authentication, the backup process runs until the completion of the jobs.

This backup is a single-direction data transfer, moving data from Guy’s to St Thomas’ Hospital creating an archive of backed up files. Figure 6-2 shows the case where the user is authenticated to complete the backup process.

Figure 6-2: The single direction of the backup process after the authentication

6.2.2 Issues about network performance

The way that the network is set up can contribute to the network performance, which can influence the performance of the backup process. This explains why the speed of the network, bandwidth and any factors that affect the backup speed should be considered prior to the comparisons beginning. The network speed is 1 GBps (1 GB per second). The distance between St Thomas’ and Guy’s Hospital is about 2 miles, and network is expected to lose some speed during the

transmission, so that actual network speed is lower than 1 GBps. Each SAN has the network analytics tool to determine the average actual network speed. There are two types of network speeds: the download and upload speeds. The backup

process is dependent on the upload speed, since it sends all the files to the secure SAN storage, to the right storage space, and then archives all the files.

System measurement on actual network speed began in 2008, based on the system measurement over one year: the average actual download speed was

Non-Cloud SAN

St Thomas’ Hospital Guy’s Hospital

Users

Backup process after authentication 2 miles

116

reduced to 750 MBps during an off-peak period and 550 MBps during a peak period (9-10am, 11am-1pm and between 4-6pm). The average actual upload time is 400 MBps in the off-peak period and 200 MBps in the peak period. 200 MBps means that 200 MB of data can be moved through the network in 1 second.

All the backup processes are performed automatically in the off-peak period, either 7am or 7pm, to avoid fighting for network resources with users. During the backup process, occasionally some users required more network bandwidth (e.g.

if they had a deadline for medical projects – although all users received warning about the backup), which may have caused the lower upload speed being

experienced. When such situations happen, the upload network speed, expected execution time and actual execution time for backup completion are recorded.

6.2.3 OSM Metrics

This section aims to explain how the parameters used by non-Cloud Storage and also Cloud Storage systems are relevant to OSM metrics. At the end of the backup process, the system administrator is notified of the result of the jobs, including statistics such as the execution time, number of files backed up and the number of files for which the backup process failed, and the percentage of successful and failed jobs. All these parameters are used in the OSM model and are explained as follows:

• Actual execution time: This is the actual time taken to complete the backup process while keeping the rate of failed/incomplete jobs under 5%.

Another round of re-running failed/incomplete jobs will be carried out. The actual execution time is the sum of the first round of the backup process and the completion of re-running the backup jobs.

• Expected execution time: This is calculated by the network simulator. This is the execution time under the ideal situation where there are no failed jobs. Expected execution time can be calculated (See Section 6.6 for details).

• Risk-control rate: This is the controlled risk of running comparisons. As discussed in Chapter 5, the rate of failed or incomplete jobs must be kept under 5%. If the rate is less than 5% then job failures will be reported but

117

will not interrupt the backup process. However, when the rate rises above 5% the entire backup process is terminated and will be restarted on both systems.

6.2.4 Summary of the non-Cloud solution

This section describes the technical setup of the non-Cloud SAN solution. There is only one network route for sending data from Guy’s to St Thomas’ Hospital during the backup process. The distance of 2 miles means that the average network speed is 400 MBps for backup. Backup operations and OSM metrics have been explained including expected execution time, actual execution time and risk-control rate.