A DATA CENTER OPERATIONS GUIDE FOR MAXIMUM RELIABILITY

(1)

November, 2013

Robert D. McClary

Senior Vice President and General Manager

In its most basic description, a data center provides a place to house information technology (IT) and network equipment. However, a data center’s mission should be to create reliability, mitigate risk, and provide uptime for the technology and applications that it enables. This eBook examines operations and the most likely causes of outages in a data center along with organizational strategies that can eliminate or minimize the potential for unplanned downtime in a data center.

If you have any questions about this eBook or FORTRUST, please visit us at

www.FTDC.com

A DATA CENTER

OPERATIONS GUIDE FOR

MAXIMUM RELIABILITY

(2)

A DATA CENTER OPERATIONS

GUIDE FOR MAXIMUM RELIABILITY

INTRODUCTION

There is significant time spent in the data center industry discussing data center infrastructure design or the Tier rating of a data center. With so much focus on the critical systems infrastructure design of a data center (specifically the electrical and mechanical systems) it leaves the impression that this is the key to predicting the reliability of a data center. This way of thinking seems like a “break – fix” philosophy for data center operations. The reason to have data centers is that utilities are generally “break – fix” operations. The Tier rating of a data center is an indicator of the distribution paths, capacities, redundancies, and approximately how many planned maintenance windows the IT-end user can expect, assuming none of the maintenance is deferred. However, as the primary factor or predictor of reliability, the Tier rating alone falls short. Data center reliability is the combination of many factors of which infrastructure design is only one of many.

People, processes, operations, maintenance, lifecycle, and risk mitigation strategies are also necessary in creating reliability. The strategies within this eBook work with any Tier rating and decrease the likelihood of unplanned downtime or outages in a data center.

I have been charged with the oversight and direct supervision of the business and data center operations for FORTRUST and the FORTRUST Denver data center since opening for operations in 2001. FORTRUST is a data center and colocation services provider located in Denver, Colorado. For the majority of my adult life, I have been associated with the design, construction, engineering, and operations of mission critical assets, having served over 16 years in the United States Navy and some 12 years in the data center industry.

Recently FORTRUST surpassed its 12th year of continuous critical systems uptime without a single instance of unplanned downtime in our Denver data center. Over the past several years we have been asked on many occasions, how do we do it? I do not believe there is anything secretive or special about what we do. However, I do believe there is something to how we do things, and the mindset behind why we do it that way. As data center professionals, we are keenly aware of the math that provides the inevitable conclusion. Nothing is 100%. Because the data center industry is charged with the quest for continuous uptime, data center professionals are somewhat on-edge, or even superstitious. As a result, speaking about how long something has been operating without an outage or unplanned downtime is not always a comfortable conversation.

When we are asked about our long track record of continuous uptime, we usually deflect the conversation, look around for wood to knock on, or just change the subject like most data center professionals. Some say luck factors into it somewhat, but I am not so sure. In fact, I believe luck— like design— has little to do with it. I do believe that the right people with the right structure and an operational mindset from which attention to

At the end of the day, the measuring stick for a data center is the

number of unplanned downtime events and the track record of

continuous uptime it has delivered. To borrow a quote from Hall of

Fame football coach Bill Parcells, “you are what your record says

you are.” This eBook is about how to prevent unplanned

downtime, not about trying to predict it.

ABOUT THE AUTHOR

Rob McClary joined FORTRUST in 2001, and continues to hold the critical role of building the company into the premier data center services provider and colocation facility that it is today.

(4)

detail, process discipline, and procedural compliance emanates from every aspect of operations can deliver continuous uptime over a long amount of time. I just happen to be lucky enough to work with such a team of data center professionals. In fact, to use a football analogy, I would line-up the FORTRUST team in a “Data Center Super Bowl” without hesitation and order the championship rings in advance.

I will agree that the design of a data center’s critical systems infrastructure is important to reliability and uptime. However, it is my experience that the design is only one small factor in the equation that results in continuous uptime. The larger factors contributing to high-availability and uptime are specific to people, process, operations, maintenance, lifecycle, and risk mitigation strategies. In a nutshell, a data center’s design, organizational, and operational strategies are essential to prevent unplanned downtime in a data center. We define “High-Availability Service Delivery” as the manner and means by which the data center organization and operations deliver the expected availability or uptime to the end-user.

In this eBook we will examine the most likely causes of data center outages and some proven strategies to prevent them. This is about organizational strategy and an operational mindset. The strategies will be presented, although, ultimately, the results can only truly be measured by a track record of continuous uptime; not by a rating, certification, or a checklist.

THE MOST LIKELY CAUSES

OF UNPLANNED OUTAGES

OR DOWNTIME

In its most basic description, a data center provides a place to house information technology (IT) and network equipment. However, a data center’s mission should be to create reliability, mitigate risk, and provide uptime for the technology and applications that it enables. Companies rely on data centers or colocation service providers to support mission critical information technology (IT) infrastructure and maintain business continuity. Unplanned downtime or outages affect every company differently. Some companies can absorb unplanned downtime in their IT environments without any impact. For other companies, unplanned downtime can have an undesirable impact on productivity, finances, or business continuity.

Reliability or uptime stems from a combination of many factors. The key to preventing unplanned outages or downtime in the critical systems infrastructure, specifically, the Electrical and Mechanical (cooling) systems that support the IT equipment and telecommunication areas in a data center, is to focus the greatest amount of attention and effort on the most likely causes of outages.

(5)

Easy enough, right? Hardly! In business, we prefer to take the path of least resistance. This usually means doing what has the least upfront costs or doing what we perceive to be the easiest to control. The most likely causes of unplanned outages are often perceived to be the most expensive or hardest to control. I will concede that some may be more difficult than others, but I believe the costs are nowhere near the perceptions.

In my opinion, the root causes of data center unplanned downtime in the critical systems infrastructure are summarized into four general categories ranked from the “most likely” to the “least likely” causes:

1. Human error and poor infrastructure capacity management 2. Poor maintenance and lifecycle strategy

3. Poor data center site selection and risk mitigation measures 4. All other causes

Visually, a chart depicting the “most likely” to “least likely” causes looks like this:

For the purpose of this eBook we will focus on the three

most likely causes as they are specific to organizational

and operations strategy.

Human Error and Infrastructure

Capacity Management

Maintenance and Lifecyle Strategy

Data Center Site Selection and Risk Mitigation Measures

All Other Causes

Causes of Unplanned Downtime in Data Centers

“L eas t Lik ely” “Mos t Lik ely”

(6)

1. HUMAN ERROR AND

INFRASTRUCTURE

CAPACITY MANAGEMENT

I chose to combine human error and infrastructure capacity management into the same category because the root cause of an unplanned outage due to the improper management of capacity is human error. Likewise, any causes that are attributed to poor management decisions or a failure of management to act in accordance with the mission should also be classified as human error. The following are strategies that can be employed in a data center’s operations and, for that matter, any mission critical organization.

Many frameworks can be used to develop a structured Procedures Library. The example below is one of the more common frameworks. Collectively, procedures can be categorized into the following types of documents:

• Standard Operating Procedure (SOP): An SOP is used to define and document recurring routine and critical processes.

• Maintenance Procedure (MP): An MP is used to define and document preventive/predictive maintenance processes for recurring routine maintenance actions.

• Method of Procedure (MOP): An MOP is used to define and document complex or non-routine maintenance actions, non-recurring corrective maintenance and processes that may require a back-out plan.

• Contingency Procedure (CP): A CP is used to define and document processes outside the course of normal or routine operations and provide procedural direction for critical processes that may be needed in response to a random, hazardous, or catastrophic event that may affect critical systems, service delivery, or business continuity. • Guideline (GL): A guideline defines and documents

organizational and management policies that may be needed to provide for consistency of operations and an understanding of specific responsibilities while conducting operations. The very first procedure written (possibly an SOP) should explain in detail how the structure, organization, and management of the Procedures Library will be accomplished. Yes, have a Standard Operating Procedure on how to write a procedure!

Simply stated, procedures are used to document a process and

produce a consistent desired result.

STRATEGY: Develop documented operational

process controls for the entire organization and operations of the data center.

Process control and the comprehensive documentation of processes are critical because many unplanned downtime events are the result of human error. Documented, validated, and repeatable processes create a standardized approach to operations, service delivery, and maintenance while mitigating or eliminating the risk associated with human error. All organizational and data center operations should be structured and conducted through the use of documented processes commonly referred to as procedures.

The purpose of any procedure is to: • Define and document a process

• Provide step by step direction for critical and routine processes

• Provide for operational control and consistency • Mitigate or eliminate errors

(7)

This procedure should:

1. Define which of the above types of procedures apply to which processes.

2. Detail the information to be included in each type of procedure such as references to other procedures, safety items, tools, materials, and revision history, etc.

3. Provide a uniform template to be used for each type of procedure (uniformity and consistency of procedures is key).

4. Establish a library and designate a custodian(s). 5. Define the review, validation, revision, and

approval process.

6. Provide a revision control process to ensure only the most current and approved version of the document is used.

7. Define a document numbering and identification scheme (e.g., SOP OPS 001).

It is critical that the organization has a primary focus on implementing a procedural structure and will adhere to it throughout its operations and service delivery functions. All personnel involved with the processes must be trained on the standardized and routine use of procedures. The organization must provide continuous training on their use and have a review process to ensure that the procedures are followed.

At minimum, a structure of process discipline and procedural compliance must include:

• A multiple personnel review, validation, and approval process for each original procedure and every revision thereafter

• A formal dissemination process for all procedures and revisions to all applicable personnel

• An orientation and training program for each new employee, including required reading of all assigned procedures and a formal acknowledgment of understanding and compliance

• A formal acknowledgment of understanding and compliance for the initial training and revision of all required procedures from each applicable individual

• A requirement for the procedure to be “in hand,” meaning the procedure should not be read once and attempted to be repeated from memory thereafter

• A mandate for all personnel whose job requires them to use and maintain procedures, along with a measurement of their accountability.

CHALLENGES: “Procedures are difficult and take

a lot of time to write.”

STRATEGY: Demand process discipline and

procedural compliance throughout the organization and data center operations.

Documented, validated, and repeatable processes are the

cornerstones of an effective organization’s operations and form

the foundation of an effective training program.

This is the one that I hear often and, to be honest, it has some basis. Documentation and validation of processes is difficult at first and can be time consuming. However, there are many significant benefits:

1. Procedures will reduce mistakes and human error. 2. Procedures will create process control and

consistent results throughout the organization. 3. Documenting and validating the process will lead

to efficiency and increased productivity throughout the organization.

4. Procedures will become the foundation of a continuous training program.

5. Procedures create a knowledge base and reduces the impact of someone leaving the organization with “the only key” or knowledge of a critical process known by only one person.

(8)

A strict program of structured process control and the required use of documented procedures will ensure the likelihood that they will be followed. This will reduce the likelihood that human error will cause a disruption in service or unplanned downtime. The structure must be all inclusive, integrated, and adhered to by the entire organization.

traditional facilities management or telecommunications skill sets do not always translate to the ideal data center operations skill sets. This is an area of repeated discussion and debate.

So what are the right skill sets?

• Basic electrical and electronics skills • Basic mechanical and HVAC skills

• Drawing/print reading and interpretation skills • Systems interrelationship and cause-and-

effect skills

• Electrical and mechanical systems troubleshooting skills

• Analytical problem-solving skills

You can most certainly add to this list. For the sake of brevity, we will use this as a base.

Running CAT 5/6 and fiber throughout a data center is a much needed skill set in a data center, as is operating and maintaining a 30 Megawatt electrical distribution system. However, they are not the same skill set. A data center is the combination of electrical and mechanical systems that have a direct impact on each other: the interrelationship between the two systems has a direct cause and effect. Conversely, traditional facilities technician skills are mostly trade specific, and in a traditional facility the cause and effect between the electrical and mechanical systems is much less.

Process discipline and procedural compliance needs to be the

standard, the tone, and the mindset for the entire organization.

CHALLENGES: “I’ve done this a hundred times”,

“I have this memorized” or “I don’t need a procedure to tell me how to do my job.”

STRATEGY: Ensure the data center operations team

has the right skill sets.

This attitude is not acceptable; following the procedures is the job. As an example, I had one of our longest tenured operations technicians tell me one day. “Even when I’m 99.99% sure, I still use the procedure step by step and you want to know why? — Because we are not in the mistake business.” This is the right mindset!

“Procedures do not apply to our department.” This attitude is also not acceptable. All organizational personnel even remotely associated with the data center need to understand and accept that their job is to use and follow procedures. This is not exclusive to operations, data center, facilities, and IT/Network personnel, but should be inclusive of the entire organization. No department or any function that has anything to do with the data center should be excluded; yes, that means Executive, Accounting, Administration, Vendors, and Sales/Marketing (if applicable).

The question I hear frequently at industry conferences and other events is “Where do I find skilled and experienced data center professionals?” Relatively speaking, the data center industry is fairly young and has evolved significantly in its short existence. I am not one that believes that experience in a data center readily translates to success. It is my experience that

(9)

Long story short, data center operations personnel need to have a wide range of skill sets and a keen sense of the relationships between multiple systems.

In order for this to occur, you must have two basic elements happening simultaneously and with equivalency.

1. The continuous delivery of relevant information

and training to the workforce.

2. The workforce must take individual

responsibility for his or her level of knowledge, skills, and competency.

Sounds simple enough but you would be surprised how many times it fails. Training programs will vary in effectiveness based on those two elements. I have seen significant effort in delivering the relevant information, but due to a lack of individual responsibility to learn and apply the information, the program failed to meet its objectives. On the other hand, when both elements are strong an organization can achieve excellence.

Things to consider:

• A structure of well-documented processes in the form of procedures will serve as a basis for your training program.

• Training does not have to be in the form of sending people to courses, programs, etc. These are used to augment and complement the organization’s program, not substitute for it.

Attention to detail, process discipline, and a procedural-based

structure— do these traits sound like a good fit for a data center?

STRATEGY: Train to achieve excellence and a

consistent level of knowledge throughout the organization and data center operations.

CHALLENGES: “The best data center operators

do not necessarily come from like industries.”

This is a statement I hear often and it has some basis. So what is a similar industry? If not telecommunications or traditional facilities management, where do you find people who have all of the training and skills above? Admittedly, I am biased, but the US Military just happens to produce people with these skills and much more. There are certainly many benefits that an organization can garner by looking at this pool of talented and skilled people.

Training, training, training. Everyone agrees that it is necessary, but very few organizations do it well or on a consistent basis. It is normally a one-time, short-lived strategy, or there is an expectation of “on the job” training. I prefer the phrase “for the job” training. There must be a commitment and focus on training throughout the organization!

This statement has been made many times without producing results. An organizational training program’s primary objective is:

• To continually increase the level of knowledge, skills, and competency of the organization’s workforce.

(10)

• You will have people in the workforce that will not take responsibility for their own learning and level of knowledge, if you allow it. Insist on individual accountability for learning.

• Scheduled training frequently canceled due to other priorities is a sign of a failing program and other organizational issues.

• Create qualification programs that have specific criteria for the demonstration of knowledge. This can be in the form of testing, producing drawings, and walkthroughs.

• Much like the use of procedures, no department or any function that has anything to do with the data center should be excluded.

Training on all the major critical systems in the data center should be a primary focus for the operations team and throughout the organization. It is important that your training program have a plan to provide an understanding of how the electrical and mechanical systems are distributed throughout the data center, as this is a major part of the service delivery function. One of the best ways to provide orientation and familiar-ization on a specific system is to require a “hand over hand” tracing exercise. As an example, have the trainee trace the electrical distribution system from a specific rack back through the distribution system to its external source. The trainee should provide a block diagram and be able to explain the distribution path(s) with all the major components’ purposes and functions. The same should be done for the mechanical and other systems. More detailed training can include:

• Walkthroughs • Testing

• Practical demonstrations of procedures • Drills

My reply to this is, “if you increase your level of knowledge, you will be more productive at your job and the results will benefit you and the organization.”

Creating an operational mindset requires continuous training.

Training and the responsibility for learning must be the routine or

“standard operating procedure.”

CHALLENGES: “Training is taking time away from

me being able to get my job done.”

STRATEGY: Develop and implement infrastructure

identification and management standards. The use of infrastructure standards for the uniform identification of equipment throughout the electrical distribution, mechanical, connectivity, access, and fire/ life safety systems is fundamental to an operational environment that has strict process control.

Implement standards and policies that describe: • Proper installation and management of cabling in

the data center

• Uniform numeration, color coding, and labeling throughout the critical systems infrastructure • Visible, accurate, and easily referenced in-place

diagrams and drawings

• Accurate electrical branch circuit panel schedules (which also plays a role in capacity management) • Mimic buss on all switchgear and electrical

distribution components

• Mechanical systems labeling and accurate valve identification lists

(11)

TIP: There is no such thing as too much labeling or signage in a data center. Count the signage and labels on your next

drive in to work. We are accustomed to

signage providing us with direction. A formal change management process and procedure that fully integrates into all aspects of operations and service delivery is a requirement. Change management should be a routine part of the operations and service delivery functions of a data center.

It should consist of a documented process (SOP or GL) that includes:

• A requirement for participation by all levels of management on a scheduled basis.

• Establishment of a formal Change Management Board for multiple levels of review and approval to ensure participation and compliance throughout an organization.

• A defined method used to implement change to critical systems.

• A process for the identification of potential impacts to end-users along with a process to mitigate the possible impact of the change.

• A defined communication and dissemination process for the organization to the end-user. An effective change management process will create a forum. This forum can be used to uncover details that may not be considered by someone that is intricately involved in the process. I have seen many times when a change management process or procedure is the only function in place to mitigate human error. A change management process is only one of many processes in a high-availability environment.

Formal change management processes should be the norm and one of

many functions to mitigate human error in a mission critical environment.

CHALLENGES: “Repeated inspections and tours

take a lot of time.”

STRATEGY: Develop a formal procedure for the

implementation of change.

CHALLENGES: “It is a small change and following the change management process is more work than the change itself.”

STRATEGY: Require frequent inspections of the

entire data center.

CHALLENGES: “We will go back and label everything later.”

Walkthroughs and inspections of the data center need to be constant and conducted by multiple personnel with varying frequencies. Start with a documented procedure detailing:

• A list of items to be checked • A list of parameters to be recorded

• A process and requirement for recording and dissemination of the results

TIP: “You will get what you inspect, not what you expect!”

I cannot count the times this statement has been proven true.

The process and a disciplined approach to frequent inspections of the data center create attention to detail and sense of ownership. Inspection as a maintenance item is somewhat different. Inspections play an important role in maintenance and lifecycle strategy that will be discussed in greater detail in the next section.

Agreed, but in my experience small things identified on a tour or inspection are easily corrected and thus prevented from evolving into major issues.

Does this always get done? – Enough said!

(12)

It is refreshing to see that more people in our industry are taking a proactive look at the importance of monitoring and control systems. However, I find it humorous that many people are positioning it as a “new thing.” Data Center Infrastructure Management (DCIM) may be new to other data centers, but I would not operate a data center that did not have a comprehensive monitoring system(s) in place throughout the critical systems infrastructure. Monitoring and control systems have been around for as long as I can remember, although apparently not in the data center industry. Only the acronym DCIM is new. Comprehensive DCIM in the data center must be the norm!

In an operations organization focused on uptime, monitoring systems become as important as the critical equipment within the systems. Data Center Infrastructure Management and monitoring systems alert personnel to changes or issues often before they become critical or possibly impact availability. Properly configured monitoring systems can provide immediate notification of changes in the critical system’s infrastructure before they become potential issues and can be used to track and report on adherence to Service Level Agreements (SLA). I cannot tell you how many times our monitoring systems have alerted us to small changes that allowed us to analyze the cause and take action long before they became an issue.

TIP: Once the DCIM system(s) are configured and working, adopt a policy of “believe the indications until you can prove otherwise.”

I have seen many times when a monitoring system or device will alert personnel to an issue only to have it misinterpreted as a bad device, or false positive, leading to an undesired event.

A comprehensive and accurate monitoring system(s) is core to other strategies that will impact continuous uptime, specifically:

1. Infrastructure capacity management

2. Testing, trend analysis, and predictive maintenance We will look at these two items and the relationship between them and DCIM in detail.

CHALLENGES: “DCIM system(s) are an additional

cost for a data center.”

STRATEGY: Develop a method for infrastructure

capacity management.

STRATEGY: Require all critical infrastructure systems

to have a comprehensive monitoring system(s).

My response: “So is non-predicted failure or unplanned downtime!” DCIM has so many uses that it easily justifies the cost.

Often in data centers the capacity of individual components and systems is not measured or managed effectively. Frequently, loads are not balanced or redundant capacities are encroached upon which creates the potential for cascading events in a failure scenario. Improper capacity management is a common cause of outages in a data center and a leading cause of cascading events. There are many ways to track and manage capacities in the Electrical and Mechanical distribution systems ranging from a spreadsheet to fairly extensive methods. The point is to have something that has the following basic elements:

• Accurate measurement of all electrical and mechanical loads

• As close to real-time measurement as possible • A model to be followed when provisioning capacity

(13)

When we first designed and built the FORTRUST data center in Denver in 2001, we required Branch Circuit Monitoring at every breaker panel that fed a critical IT load within the data center equipment space. We integrated it into a centralized monitoring system along with all the critical systems components allowing us to monitor all the points in the critical infrastructure that we could. At the time the DCIM acronym did not exist, or at least we did not know of it. Over the years we have evolved our DCIM system(s) to provide information that has been useful in many things including:

• Individual component and system capacity monitoring and management

• Threshold alerting and alarming • Automatic escalations

• Real-time Power Usage Effectiveness (PUE) measurement

• Real-time branch circuit power usage

• Real-time delivery temperature and humidity measurement

• Dashboard views

• Integrated panel schedule management • Predictive maintenance and trend analysis

(discussed in detail in the next section) DCIM can play a key role in infrastructure capacity management by providing real-time information across the critical systems infrastructure. It can assist in the creation of templates used to provision capacity to the end-user without infringing on redundancies or having unbalanced loads.

SUMMARY

To minimize or eliminate the chances of human error, the concepts of reliability and high-availability ser-vice delivery must be facilitated through an operational mindset in which attention to detail, process discipline, and procedural compliance emanate from every aspect of data center operations and service delivery.

(14)

2. MAINTENANCE AND

LIFECYCLE STRATEGY

The designed reliability of a data center does not make up for poor maintenance and lifecycle strategies. Maintenance and lifecycle strategies are core to a data center’s ability to continuously provide high-availability service delivery and uptime over a long period of time. A comprehensive maintenance and lifecycle strategy will include inspections, preventive maintenance, predictive maintenance, testing, and corrective maintenance. I will go into each component in detail, demonstrating their differences and offering examples.

Inspections are a point assessment or verification

of equipment operating conditions and systems alignment. Examples would be:

• Daily inspections of the generator(s) alignment • Checking and recording jacket water temperature,

battery voltages, fuel levels, general conditions, etc. • Measuring and recording plenum pressures,

system parameters, etc.

Preventive Maintenance is an action intended to keep

a piece of equipment or component operating at its optimum level or an action that prolongs its lifecycle. Examples would be:

• Filter changes, oil changes, greasing bearings • Cleaning heat exchangers

• Electrical systems clean and torque, etc.

Predictive Maintenance is using direct measurements,

data collection, and trending to identify changes, patterns, or abnormalities indicating potential failure of a component prior to its actual failure. Examples would be:

• Fuel, lube oil, coolant sampling, testing, and analysis • Vibration analysis

• Trend analysis

Testing is a validation of operations within a set of

parameters and conditions. Examples would be: • Infrared testing

• Load testing

• Fail-over testing, etc.

TIP: Infrared testing should be a validation of your preventive maintenance and capacity management; not in itself the sole maintenance action for the electrical systems.

Corrective maintenance is the repair or replacement of

a component or system. Examples would be: • Repairing a leak

• Replacing a bearing

• Repairing/Replacing a valve

STRATEGY: Develop and implement formal

testing, trend analysis, and predictive maintenance processes and procedures.

Do not be an organization that waits for failure before it takes action. I am of the belief that it is not impossible or difficult to predict issues in equipment before failure. In fact, I believe that if you have a strong maintenance and lifecycle strategy; unpredicted failure becomes at the very least, a random event. It is my experience that most equipment will provide you with advanced indi-cations of a potential failure if you are monitoring and trending the information over a period of time.

(15)

Trend analysis starts with a baseline. Ideally, this hap-pens concurrently with the commissioning of equip-ment and its initial validation and testing. The key is to conduct testing in a manner that replicates the same conditions. Establish a set of key parameters and have the readings recorded at or near the same conditions. Graph the readings over time and analyze the data for trends. You can use this information to identify chang-es, patterns, or abnormalities. Trending this information can provide you with a graphic view of rising/falling parameters from baseline. The analysis can lead to the identification of patterns and indications that may be preliminary signs of failure.

Load testing a generator provides a good example. Conduct a load test at a predetermined load (kW), e.g., 500, 1000, 1500, etc. A record sheet, spreadsheet, or your DCIM can be used to document multiple readings on jacket water temperatures, cylinder temps, lube oil pressure/temp, and exhaust temps, etc. The informa-tion is graphed over multiple tests across a period of months, years, etc.

Similar processes can be used for different components and does not necessarily require a load test. Recording operating conditions and parameters on equipment and analyzing the trends is one of the best ways to predict issues prior to failure. Data Center Infrastructure Management (DCIM) expedites trend analysis and predictive maintenance by reducing the manual process that would normally be involved. The predictive maintenance benefit alone could justify the cost of the DCIM system(s).

The frequencies and periodicities of the preventive and predictive maintenance should be reviewed to ensure that OEM recommendations are being met or, in many cases, exceeded. A strong lifecycle strategy will involve preventive maintenance actions conducted with defined periodicities for each component or piece of equipment. The appropriate daily, weekly, monthly, quarterly, and annual preventive maintenance actions must be conducted on a routine basis to keep the equipment operating at the optimum level of performance. A strong commitment to preventive maintenance will increase the lifecycle of equipment and reduce the amount of corrective maintenance. Develop your own maintenance program based on your environment, equipment, and systems.

STRATEGY: Understand the cost of corrective

maintenance is much greater over the long term than comprehensive preventive and predictive maintenance.

CHALLENGES: “Graphing and analyzing data

across a long period of time is tedious work.”

STRATEGY: Acknowledge that Original Equipment

Manufacturers (OEM) recommendations for maintenance are the minimums, and minimums are not part of a data center’s mission.

I agree. Testing and trend analysis of data is something that requires time and a disciplined approach, but the benefits are clearly there. Once the processes are established, it will become second nature.

Ask any classic car buff, and they will tell you the same thing! Regular preventive maintenance will save you money over the long haul. When I was in the military, I reviewed multiple studies that showed that the cost of corrective maintenance was up to ten times the cost of a preventive/predictive maintenance program over the lifecycle of the equipment— barring the “impact costs” of an unplanned or unforeseen failure. In many cases failure of equipment or components due to a lack

(16)

of preventive maintenance will occur with increasing frequency over time. Component failures can also cause collateral damage to equipment and increase the cost of the corrective maintenance required.

A clean data center is the foundation for a lifecycle strategy and prolonging the life of equipment. The Mechanical Electrical Plumbing (MEP) should be as clean as, or cleaner, than the IT equipment areas. A data center operations team should pride itself in maintaining a meticulous and “cleanroom” type environment throughout the data center. Tacky mats, air filtration, extensive cleaning standards and the data center operations staff should control and mitigate airborne particulate throughout the IT equipment areas. The cleanroom approach should start in the MEP and reach to the IT equipment areas equally.

A documented Standard Operating Procedure (SOP) for maintenance, installation, repair, construction, and cleanliness should exist. This document needs to describe:

• The process for the authorization and performance of any maintenance, installation, repair, and action that occurs in or around a controlled area, (e.g., IT equipment areas, raised floor, and MEP)

• Controlled areas (which should also be designated by signage throughout the data center)

• Preventive actions and control measures to be employed in order to minimize possible adverse effects on equipment within these areas

• Cleanliness expectations and standards applicable to the controlled areas

CHALLENGES: “If we do just the minimums to

keep the warranties intact, should that suffice as a preventive maintenance program?

STRATEGY: Understand what a lifecycle strategy is

and have one!

STRATEGY: Require a “cleanroom” approach to

data center operations. Sure, if you view the data center as a five-year

asset, go ahead. (As long as you’re not working at my data center!)

A lifecycle strategy encompasses a preventive and predictive maintenance program combined with other strategies that are all focused on increasing the lifecycle and prolonging mean time between failure (MTBF) of the systems, equipment, components, and the data center as a whole. It will involve processes and strategies that include:

• A “replace before fail” strategy. Many compo-nents within a piece of equipment have a useful life and are intended to be replaced at regular intervals. Waiting for the component to fail before replacing it can come with additional or collateral damage to the piece of equipment. The useful life information of components is normally available from the manufacturer and should be used to develop a schedule for replacement of these components. • An equipment rotation strategy. Equipment

should be rotated on a schedule that allows for runtime hours to be balanced or spread across like equipment or systems.

• An equipment replacement strategy. A plan for replacement of equipment and components as part of the lifecycle strategy for the data center is necessary to maintain optimum operating conditions and availability.

Additional strategies and policies as discussed further in this section.

(17)

This procedure should govern all personnel and third parties involved in any work inside the data center.

Is it worth it if you could add several months of life to every server and prolong the life of the electrical equipment in a data center?

CHALLENGES: “Cleaning shouldn’t be part of my job.”

STRATEGY: Make up air and air filtration is not just

used for airborne particulates!

CHALLENGES: “The cost of custom filter banks in

addition to particulate filters is another expense.”

STRATEGY:Seismic enhancements to the critical systems infrastructure are not just for possible earthquake areas!

Cleaning is part of preventive maintenance and life-cycle strategy. The operations team should conduct and be responsible for the majority of cleaning in the data center. The only recurring cleaning that should be outsourced to a vendor is the office areas. This is a simple equation: if a person is responsible for cleaning his or her own mess instead of someone else; the level of ownership is greater.

What if I told you that by adding a custom set of filters to your make up air handlers, you can increase the lifecycle of the IT equipment and electrical/electron-ic equipment in your data center? Air quality varies by geographic location. The quality of the air being admitted to the data center should be tested. Much like water quality and the effect it can have on heat exchangers, boilers, etc., there are constituents in the air that can corrode copper components over time. There are companies that can assess the air quality and specify a custom set of filters that will remove harmful elements from the air supply that can reduce the lifecycle of elec-trical and electronic components over time.

Seismic enhancements like resilient mounts, spring loaded piping hangers, piping rollers, strong backs, flexible braided piping, flex and expanding conduit are just as much a preventive maintenance and lifecycle strategy as risk mitigation features. Seismic protection features can be integrated throughout the critical systems infrastructure. The transmission of vibration and torque through rigid piping systems to couplings, bearings, flanges, and welds can cause increased failure rates. Seismic enhancements serve both risk mitigation and lifecycle strategy.

(18)

The Washington Monument was not in an area with a high probability of seismic activity, either. Nevertheless, these types of seismic protections and enhancements can be integrated into the critical systems design and have a huge impact on preventive maintenance and lifecycle of the systems and equipment. It will be well worth the additional upfront costs over the long term.

High margin maintenance programs from vendors are usually a series of simple inspection sheets, the occasional filter change, and “oh by the way, here is your invoice.” Most of what we have discussed in this paper is about creating an “Operational Mindset” and “Ownership,” and those can rarely be outsourced. Does this mean that all maintenance has to be in-house? Of course not, but be very selective of what, who, and how much is done by vendors. OEM and OEM certified vendors doing specific levels and types of maintenance on their equipment as part of your maintenance and lifecycle strategy is essential to a well-rounded program. However, in my experience this will be less than 20% normally. The overall management and direct super-vision of the program should always remain with the operations team. The vendor(s) need to adapt to your procedures and not the other way around.

A data center’s mission or business is to create reliability, mitigate

risk, and provide uptime for the technology and

applications that it enables.

STRATEGY: Understand that if you outsource the

management or the majority of your maintenance program to a third party, you do not have a data center operations team—you have someone else’s!

CHALLENGES: “We do not have the skill

sets in house.”

STRATEGY: Place a great deal of importance on

the selection of critical equipment.

CHALLENGES: “The additional cost of seismic

augmentation and enhancements are not necessary because we are not in a geographic location that has a high probability of seismic activity.”

When the organization purchases new or replacement equipment, the selection process becomes extremely important to the performance and reliability of the data center’s critical systems and its lifecycle strategy. Some organizations select equipment for the data center based on the lowest bid they receive, while others take a best-in-class approach. Best-in-best-in-class equipment selection pro-tects the end-user and the data center’s mission. Many organizations do not consider the long-term costs associ-ated with lower-end equipment over the course of time. Reliability, performance, efficiency, and cost should all be considered when making equipment choices. Equipment selection on the basis of low bid alone is a questionable practice. Just like bad management deci-sions that cause unplanned downtime, CEOs and CFOs do not get a pass on business decisions made solely on cost. If the decision may possibly lead to failure, this too is human error.

The tradeoff between cost and reliability is something every data center organization must consider. You must ask yourself, what do I want my data center to achieve and for how long?

Then get them! A skilled operations team is by far the largest single factor in high-availability service delivery in a data center. This is true for a number of reasons, and especially so when trying to accomplish the following strategy.

A colleague once told me “Engineers design; contractors build; operators operate”. Over the lifecycle of a data center who impacts it more?

Data centers are long term, high value assets. Many times owners throw money at engineering firms, developers, and general contractors to build data centers.

(19)

More often than not somewhere near or upon com-pletion it gets handed off to an operations team with a “here you go— make it do what it’s supposed to do!”

STRATEGY: The data center operations team

needs to be involved and have influence over the design and construction of the data center.

SUMMARY

A skilled operations team with ownership of the maintenance and lifecycle strategies is core to a data center’s critical systems infrastruc-ture’s ability to continuously provide high-availability service delivery and uptime over a long amount of time. Maintenance and lifecycle strategy must be a routine. Attention to detail and ownership is contagious if the tone is set and emphasized at every level in the organization!

A skilled operations team can inject lifecycle strategies into the design phase and frequently find opportunities to reduce unnecessary costs. Ownership begins here with the operations team for the data center. The opera-tions team should be involved in producing an Owner’s Project Requirements (OPR) document to govern both design and construction. The operations team should also be involved in the construction process itself and have a quality assurance role. They should witness all testing evolutions and inspect every piece of equip-ment and panel before it is buttoned up. They should be involved in every start-up evolution for every piece of equipment. And they should direct and participate in the commissioning and integrated systems testing of the data center.

TIP: Have your operations team in place at the design phase of any data center build or expansion. The training, ownership, and lifecycle opportunities cannot be replicated!

(20)

3. DATA CENTER SITE SELECTION

AND RISK MITIGATION MEASURES

If I learned one thing in my years in the United States Navy, it is that Mother Nature always wins. Those who take her for granted or do nothing to defend against her fury are playing a waiting game. The same lesson can be applied to data centers. There are no perfect locations or perfect situations, but there are

• better places than others

• risks that can be mitigated or eliminated by site selection

• risks that should be considered unacceptable no matter how convenient or tempting due to costs

Risk Mitigation integrated into the data center’s design and operations is

a key factor in high-availability service delivery and continuous uptime.

can all damage—if not destroy—even the most reliably built data center, or render the surrounding areas unable to support the prolonged operations of the data center.

TIP: Do not try to mitigate the risk of floods, as it rarely works. Just stay out

of the way. Building a data center near a large body of water, river, ocean, or

with-in the 100 or 500-year flood plain is tempting fate and Mother

Nature too much.

The data center’s proximity to other resources or po-tential hazards is important to evaluate in site selection. Obviously, data centers located close to or on desig-nated emergency routes, the local fire department, etc. will benefit from faster response during emergencies, priority snow removal, and general access. Proximity to airports and railways may or may not present potential risks. Airport risk varies upon location with respect to flight paths. Railways do not always present a risk de-pending on their function, purpose, or traffic. The Fed-eral Railroad Administration provides readily available information on all railways. Proximity of the data center to other high risk elements such as nuclear power plants, refineries, dams, etc. should also be considered and reviewed. Long story short, do the due diligence on your surrounding area. Some risks are easily mitigated some are not.

STRATEGY: Make the geographic location of the

data center and the potential for natural disasters a key factor in site selection.

Have you noticed when you turn on the evening news, about two or three times a week the lead story has something to do with weather or a natural disaster? The locations of many data centers are selected on what is readily available, convenient, or what can be re-pur-posed inexpensively. These are not always the best priorities concerning data center site selection.

Investigate whether the proposed data center site is in a zone or region that is at risk to the possibility of natural disasters. Earthquakes, hurricanes, floods, and tornadoes

(21)

Another factor to consider is in the data center’s construction. I am not a big fan of calling a

telecommunications hotel a data center. Data centers need to be purposefully designed and built specifically to be data centers. Renovated shells are perfectly fine, so long as the shell provides the structural integrity for its geographic location and other attributes specific for use as a data center. Multi-use commercial property structures or re-purposed telecommunications hotels may be missing key structural elements and may be more susceptible to storms and problems with ongoing sustainability. They also may lack the necessary physical security and risk mitigation measures. The data center’s construction needs to provide structural integrity and the ability to withstand weather extremes for the geographic location and potential natural disasters. Reliability of the critical systems infrastructure design without the thought of risk mitigation tempts the possibility of significant downtime.

SUMMARY

The reliability of a data center with little or no consideration to geographical risk or risk mitigation measures as part of the design and operations is no longer an “if” scenario, but rather a “when” scenario for unplanned downtime.

CONCLUSION

It is important to understand that not all data centers are designed the same, built the same, managed alike or operated alike. Predicting the reliability of a data center based solely on the data center’s critical systems infrastructure design is just that, a guess.

The key to preventing unplanned outages or downtime in the critical systems infrastructure is to focus the greatest amount of attention and effort on the “most likely” causes of outages.

1. Human Error and Infrastructure Capacity Management

2. Maintenance and Lifecycle Strategies 3. Data Center Site Selection and Risk

Mitigation Measures

The measuring stick for a data center is the number of unplanned

downtime events and the track record of continuous uptime it

has delivered. Focus on how to prevent unplanned downtime,

not simply on how to predict it. The larger factors of

high-availability and uptime are specific to people, process,

operations, maintenance, lifecycle, risk mitigation strategies, and

the operational mindset that facilitates it.

(22)

LEARN MORE ABOUT FORTRUST

FORTRUST is the premier high availability data center service provider in North America offering services in Denver, Colorado; Phoenix, Arizona; and Edison, New Jersey. FORTRUST offers agile, reliable, sustainable and secure raised floor or modular data center capacity for any-size enterprise supported by optimal power infrastructure and connectivity to safeguard mission-critical business services. Leading companies choose FORTRUST to gain a trusted partner who will preserve and protect their IT infrastructure as well as serve as an essential extension of their operations. More information is available by visiting www.FTDC.com.

ACKNOWLEDGMENTS

Anyone can have a good plan or follow best practices. The challenge is always in the execution. Excellence is the result of having the discipline to execute time and time again; for this I thank the entire FORTRUST organization for their repeated excellence. Thanks many times over to Madison Knudson, Nealy Bernard and Kara Miller for their help with editing and their patience.

A DATA CENTER OPERATIONS GUIDE FOR MAXIMUM RELIABILITY

A DATA CENTER

OPERATIONS GUIDE FOR

MAXIMUM RELIABILITY

A DATA CENTER OPERATIONS

GUIDE FOR MAXIMUM RELIABILITY

TABLE OF CONTENTS

INTRODUCTION

At the end of the day, the measuring stick for a data center is the

number of unplanned downtime events and the track record of

continuous uptime it has delivered. To borrow a quote from Hall of

Fame football coach Bill Parcells, “you are what your record says

you are.” This eBook is about how to prevent unplanned

downtime, not about trying to predict it.

THE MOST LIKELY CAUSES

OF UNPLANNED OUTAGES

OR DOWNTIME

For the purpose of this eBook we will focus on the three

most likely causes as they are specific to organizational

and operations strategy.

1. HUMAN ERROR AND

INFRASTRUCTURE

CAPACITY MANAGEMENT

Simply stated, procedures are used to document a process and

produce a consistent desired result.

Documented, validated, and repeatable processes are the

cornerstones of an effective organization’s operations and form

the foundation of an effective training program.

Process discipline and procedural compliance needs to be the

standard, the tone, and the mindset for the entire organization.

Attention to detail, process discipline, and a procedural-based

structure— do these traits sound like a good fit for a data center?

Creating an operational mindset requires continuous training.

Training and the responsibility for learning must be the routine or

“standard operating procedure.”

Formal change management processes should be the norm and one of

many functions to mitigate human error in a mission critical environment.

SUMMARY

2. MAINTENANCE AND

LIFECYCLE STRATEGY

A data center’s mission or business is to create reliability, mitigate

risk, and provide uptime for the technology and

applications that it enables.

SUMMARY

3. DATA CENTER SITE SELECTION

AND RISK MITIGATION MEASURES

Risk Mitigation integrated into the data center’s design and operations is

a key factor in high-availability service delivery and continuous uptime.

SUMMARY

CONCLUSION

The measuring stick for a data center is the number of unplanned

downtime events and the track record of continuous uptime it

has delivered. Focus on how to prevent unplanned downtime,

not simply on how to predict it. The larger factors of

high-availability and uptime are specific to people, process,

operations, maintenance, lifecycle, risk mitigation strategies, and

the operational mindset that facilitates it.

ACKNOWLEDGMENTS