Finding a Cure for Downtime

(1)

EXECUTIVE SUMMARY

The continued adoption of electronic medical records, PACS and other health information systems are becoming an increasingly integral part of the delivery of patient care. With vital patient information residing in electronic health records and images, tolerance for system downtime is approaching zero. In addition, if patient data is lost or corrupted, HIPAA and JCAHO data integrity and data protection requirements could be at risk.

Whether you're a local community hospital or national health network, this paper outlines seven key tips that every healthcare organization should consider to protect the availability of healthcare information systems. From reducing human error, to understanding the key differences between high availability and disaster recovery, to selecting the right hardware and storage components, this paper provides an overview of the key steps necessary to ensure the availability and integrity of your healthcare information systems.

SEVEN TIPS FOR REDUCING DOWNTIME

T I P #1 : REDUCE HUM AN E RROR

Human error is one of the primary sources of unplanned downtime. There are two main types of human factors that cause or worsen downtime:

Finding a Cure for Downtime

7 Tips for Reducing Downtime

in Healthcare Information Systems

THE COST OF DOWNTIME IN HEALTHCARE

According to research by Healthcare Informatics:

• Every minute of downtime to healthcare information systems costs more than $264 for a typical 500-bed hospital.

• Each 1% of downtime per year costs more than $1.4 million for that same 500-bed hospital in additional operating costs.

That's not including the potential costs of medical errors made during downtime.

(2)

P a g e | 2

Finding a Cure for Downtime in Healthcare

1. The initial human errors that cause the downtime

2. Additional errors after an outage has occurred that delay uptime restoration

Niel Nickolaisen, IT process management expert and co-author of the new book Stand Back and Deliver: Accelerating Business Agility, says that in his experience working with best practices companies, system downtime can be reduced by at least 70% by defining and implementing a simple IT change management process, which includes identifying and researching the change, performing risk analysis on the change, and consistently communicating the change.

All changes should be reviewed using a cross-functional team, always including the operational people who are responsible for maintaining the changed system. Including a cross-functional review as part of the change management process is essential to avoiding unexpected problems that can lead to unplanned downtime. Different teams may be aware of interdependencies between systems that may otherwise go unnoticed.

For example, when a problem occurs, someone must identify the problem correctly and take steps to address it. To restore service or ensure continued availability, operations teams may need to replace damaged server hardware, restore lost data, replace redundant network connections, and/or reload applications or an operating system.

The next step in reducing human error is to simplify IT processes, including change management processes, wherever possible. As Nickolaisen puts it in his law of inverse entropy: "Left to ourselves, we humans (even IT people) will complicate all processes and guidelines."

To guard against excess complexity, he suggests revisiting and reviewing change management processes regularly, and making changes if:

• Change management processes exceed two pages

• A change review form exceeds one page in length

• A special "emergency" change process emerges because the ordinary process is too time-consuming or complex

The same general principles apply to ordinary operational processes. Managing and maintaining availability in the IT organization should be simple, reducing the strain on expert employees. For example, IT operations staff should be able to:

• Roll back changes easily

• Easily maintain, upgrade or replace an IT infrastructure component without complex processes for switching components in and out of production

“Left to ourselves, we humans (even IT people) will complicate all processes and guidelines.”

(3)

P a g e | 3

Finding a Cure for Downtime in Healthcare

Many technologies for high availability actually introduce complexity into the IT environment. For example, clustering technologies may require administrators to painstakingly maintain each server in the cluster to support successful failover. IT organizations instead should find and embrace those technologies that reduce complexity for operational staff—thereby eliminating potential sources of human error.

T I P #2 : UNDE RST AND THE DI FF E RENCE BET W EE N HA AND DR Don't confuse high availability (HA) and disaster recovery (DR), or expect a DR solution to meet your availability needs. Put simply, HA is about preventing downtime and disasters, while DR is recovering from problems once they occur. Disaster recovery solutions are designed to help you recover from true disasters (floods, fires), not minor problems (network card failure, storage corruption).

To identify your needs, you should evaluate two key factors:

Recovery Time Objective (RTO): How long can you afford to have your system down – if at all? Electronic healthcare record systems or patient identification systems, for example, may not tolerate any downtime.

Recovery Point Objective (RPO): How out-of-date can you afford your data to be once the system is up and running again? HIPAA regulations require that you protect patient records from loss.

DR failover is sometimes necessary, but is usually a lengthy and complex process. Outages of hours or days are costly and dangerous in the healthcare environment. Your systems should be able to keep operating through minor problems and disruptions that fall short of site-wide disasters. Local HA protection can keep your systems running during the common, minor outages and failures that plague complex computer systems. And for complete protection, you should combine local HA with DR as protection from major, site-wide failures.

T I P #3 : REDUCE S E RV ER F AI LURE S WI T H QUAL IT Y HARDW ARE AND COM P ONE NT REDUNDANCY

Server hardware problems can cause unplanned downtime in several ways: • Catastrophic server failures caused by memory, processor or motherboard

failures

• Failures of server components, including power supplies, fans, internal disks, disk controllers, host bus adapters and network adapters

(4)

P a g e | 4

Finding a Cure for Downtime in Healthcare

To reduce the chances of catastrophic server failure, purchase robust name brand servers, perform recommended preventative maintenance, and monitor server errors for signs of future problems.

To reduce downtime caused by server component failures, add redundancy at the component level. Examples include: redundant power and cooling, ECC memory, with the ability to correct single-bit memory errors, and combining Ethernet cards with RAID.

T I P #4 : P ROT E CT AGAI NS T S T ORAGE F AI L URES WI TH ST ORAGE DE V I CE RE DUNDANCY AND RAI D

Government regulations require that you protect patient data from loss. Use device redundancy combined with RAID storage algorithms to protect data access and data integrity from hardware failures.

For local storage, add extra disks configured with RAID protection. Use a second disk controller to prevent the controller itself from being a single point of failure.

Access to shared storage relies on either a fibre channel or Ethernet storage network. To assure uninterrupted access to shared storage, these networks must be designed to eliminate all single points of failure. This requires redundancy of storage network paths, network switches, and network connections to each storage array.

T I P #5 : USE RE DUNDANT NETW ORK P AT HS , SWI T CHE S AND ROUT E RS T O PROT ECT AG AI NS T NE TW ORK F AIL URE S

The network is another possible source of downtime. The network infrastructure itself must be fault-tolerant, with redundant network paths, switches, routers and other network elements. Use redundant server connections to eliminate failovers caused by the failure of a single server or network component.

Make sure that the physical network hardware does not share common components. For example, dual-ported network cards share common hardware logic, and a single card failure can disable both ports. For full redundancy, you need either two separate adapters or a built-in network port combined with a separate network adapter.

T I P #6 : P ROT E CT AGAI NS T S IT E F AI L URE S WIT H DAT A RE P LI CAT I ON T O ANOTHE R SI TE

Many factors can cause site-wide failures, including an air conditioning failure or leaking roof, a power failure, or a major hurricane. Site disruptions can last anywhere from a few hours to days or even weeks.

There are two methods for replicating data across sites. One method is to tightly couple redundant servers across high speed/low latency links, to provide zero data-loss and zero downtime. The other method is to loosely couple redundant servers

(5)

P a g e | 5

Finding a Cure for Downtime in Healthcare

over medium speed/higher latency/greater distance lines. This provides a disaster recovery capability where a remote server can be restarted with a copy of the application database missing only the last few updates. In the latter case, asynchronous data replication maintains a backup copy of the database.

Combine data replication with error detection and failover tools to help get a disaster recovery site up and running in minutes or hours, rather than days – remaining compliant with industry regulations while preserving access to patient data.

T I P #7 : USE HA S OF TWARE T O AUT OM AT E P ROT E CTI ON OF S E RV ERS , ST ORAGE AND NET W ORKS

Implementing steps 3-6 above requires a significant investment of time and

resources, particularly if you have no automated way to monitor and react to failures in the application infrastructure.

Automated application availability software can reduce or entirely eliminate downtime and data loss without adding a lot of overhead. A software solution can monitor and react to failures in systems, storage and networks, automatically reconfiguring resources to keep the application operating with little or no interruption or loss of client connectivity.

The next section of this paper describes the automated application availability software developed by Marathon Technologies and its use in the healthcare industry to protect healthcare information systems.

everRUN PROTECTS HEALTHCARE APPLICATIONS

FROM DOWNTIME

Marathon Technologies' everRun® software provides automated application availability that is easy to deploy, maintain and afford. everRun eliminates the complexities and hidden costs of traditional high availability and data protection products to give you reliable, affordable application protection. Delivered in software, everRun makes application availability practical for a wide range of healthcare applications, including imaging systems, electronic medical records, building security systems and others.

Leading healthcare providers around the globe rely on everRun to protect applications critical to patient care. everRun offers:

High availability: everRun's ComputeThru™ technology means that EHR and PACS applications keep running around the clock – helping you meet HIPAA and JCAHO data integrity and data protection requirements.

(6)

P a g e | 6

Finding a Cure for Downtime in Healthcare

Affordability: everRun fits seamlessly with existing infrastructure, supporting commodity, low-cost hardware and local, shared-disk or SAN-based storage. It has a lower acquisition cost than other solutions, with a much lower total cost of ownership than cluster solutions.

Automation: With automated fault detection and policy management, everRun monitors systems continuously and automatically addresses failures. everRun does not require any specialized training.

Easy integration: everRun supports any Windows application, even special or custom applications, without requiring custom scripts or application changes.

BELFAST CITY HOSPITAL CASE STUDY

Opened in March 2006, Belfast City Hospital’s Oncology Centre is the Regional Centre of Excellence in the treatment of cancer across Northern Ireland.

RE QUI RE ME NTS

Belfast Hospital needed a system that would be accessible by all cancer units, 24x7, throughout Northern Ireland, and that could be easily updated in real-time. This system ensures that all the information on a patient, their diagnosis, treatment and progress can be accessed from any of the cancer units throughout Northern Ireland. They turned to Real Time Systems Ltd (RTLS), an IT solutions provider, to design and implement a state-of-the-art IT solution for patient care. And RTLS chose everRun to protect the critical patient systems.

T HE S OL UT I ON

RTSL designed the IT system, using everRun to work within the existing infrastructure while providing 24x7 availability. They then implemented a 'split site' solution from Marathon to protect the system from a site-wide disruption.

T HE RE S UL TS

The resulting system has not experienced any unscheduled downtime since implementation. Philip Leighton, Senior Systems Specialist, Belfast City Hospital says, “It is unthinkable what we would be up against without this system in place. We transferred all the users on to the system over a weekend period; in the space of a week, the everRun system proved its worth.”

Belfast City Hospital has plans to move more of their systems over to everRun in the long term. The main beneficiaries are the patients, as patient care was at the core of the systems design.

"In the space of a week, the everRun system proved its worth."

(7)

P a g e | 7

Finding a Cure for Downtime in Healthcare

SCOTT & WHITE HOSPITAL AND CLINIC CASE

STUDY

Scott & White Memorial Hospital and Clinic is the largest multi-specialty practice in Texas, with more than 500 physicians who provide care in Temple and at 15 regional clinics throughout Central Texas. In addition to accepting most major insurance plans, they deliver care to members of the Scott & White Health Plan, one of the highest rated plans in the nation. Scott & White is also the clinical educational site for The Texas A&M Health Science Center College of Medicine.

RE QUI RE ME NTS

Scott & White’s Center for Advanced Medicine (CAM) is technologically integrated, interdisciplinary and designed for the most important people in a hospital – the patients. This new facility has a Siemens SiPass access control system installed for security. The access control for other clinics in surrounding areas is also controlled from the same server, installed in the CAM.

Siemens Building Technologies was chosen to provide a single controlling SiPass system to control access to all of these important facilities which operate 24*7. A solution was required to protect the SiPass security system from any failures. Marathon’s everRun was chosen, since it is the qualified fault tolerant solution for Siemens’ SiPass and it offers simple and superior application and system availability.

T HE S OL UT I ON

everRun was set up in a very short time and passed the hospital’s required tests immediately. everRun synchronizes two standard Windows servers to create a virtual application environment that runs a single license of SiPass on both servers

simultaneously. If a device or even an entire server fails, SiPass continues to operate uninterrupted. All redundancies and failures are completely transparent to the applications and users, with no interruption or downtime.

Applications are installed, managed and accessed through a single Windows environment, eliminating the need to license, install and manage multiple copies as required in clustering and failover situations.

T HE RE S UL TS

“The Marathon solution for SiPass offers us the extremely high level of availability we need for our critical control system at 24 by 7 facilities, where interruptions in the access control security system cannot be tolerated,” said Royce Cox, Senior Project Manager of Siemens Building Technologies.

“The Marathon solution for SiPass, offers us the extremely high level of availability we need for our critical control system at 24*7 facilities, where interruptions in the access control security system cannot be tolerated.” - Royce Cox, Senior Project Manager

(8)

P a g e | 8

Finding a Cure for Downtime in Healthcare

SUMMARY

No matter what size your healthcare organization, you cannot afford downtime or data loss for critical, patient-facing systems. The seven tips outlined in this paper highlight processes and strategies that you can adopt today to reduce the risks of downtime, while containing the cost of IT management.

More than 2,500 organizations around the world trust their application availability to Marathon’s everRun software. Find out how Marathon can help your organization ensure regulatory compliance, contain costs and protect patient safety with automated availability.

To see everRun in action, watch our product demo videos, or download a free 30-day evaluation license, visit www.marathontechnologies.com

The Marathon logo, SplitSite and everRun are trademarks or registered trademarks of Marathon Technologies Corporation. Microsoft and Windows are registered trademarks of Microsoft Corporation. All other trademarks and registered trademarks are the property of their respective owners. Copyright 2010 Marathon Technologies Corporation. All rights reserved. Marathon Technologies Corporation reserves the right to make changes to this document at any time and without further notice. Marathon Technologies Corporation assumes no responsibility for any errors that may appear in this document.