Increasing Data Center Resilience While
Lowering PUE
Nandini Mouli, Ph.D. President/Founder eSai LLC [email protected] www.esai.technologyIntroduction – eSai LLC
• eSai LLC: Is a Disadvantaged woman-owned minority business focused on providing energy management solutions for federal and state government agencies
• Core Competencies: Technologies:
• Technical/Business Feasibility Studies Dynamic Pricing, Demand Energy Audits, Commissioning Response
• Energy Conservation Measures Distributed Energy Services,
Combined Heat and Power
• Evaluation, Validation and Measurement Microgrid Integration
• Utility, Federal and State Grants Building Management
Systems
Experience in consulting and implementing clean energy programs to meet DOE, EPA and FEMP policies and programs.
Currently leading multiple projects to bring resiliency and energy conservation for federal agencies and private corporations
Topics for Discussion
• What is Resilience?
• Why it is Resilience critical for data centers? • Dynamics of treating resilience
• Challenges to achieving data center resilience • Some tools to achieving the resilience
• What is DCIM?
• How is DCIM a resilience platform for:
• Planning and implementation • Monitoring
• Data Collection
• Dash Board Visualization
• Getting the most out of DCIM tools • Key Take-aways !!
What is Resilience?
• TechTarget’s Definition of Resilience: “the ability of a server, network, storage system, or an entire data center, to recover quickly and
continue operating even when there has been an equipment failure, power outage or other disruption.”
• In the context of cyber security: “Resilience is the ability of a system to resist illegitimate activity and its ability to effect a speedy
Why is Resilience critical for Data Center?
• Forrester Research: Resilience is # 2 top priority for Facility Directors: • Carrier availability and density – 82%
• Availability, resilience – 80%
• Control over facility – 78%
• Access to Cloud and other partners – 75% • Lack of resilience is costly:
• IBM Reputational Risk and IT Study: system outage is one of the top two IT risks that can harm an organization’s reputation.
• 91% of data centers have experienced an unplanned data center outage in the past 24 months.
• The average cost per minute of data center downtime has increased 38% from $7,908 in 2013 to $11,000 in 2015
• Organizations which improve from “Laggard” to “Industry Average” levels of downtime can reduce losses ~$3 million/year.
Dynamics in Treating Resilience
• Achieving resilience
used
to mean redundancy:
• Two (or more) of everything – servers, power supplies, generators, and even whole data centers
• But most of this duplicate equipment was never utilized. • Waste of space and energy = Increased PUE
• Now, the trend: increase resilience sans waste
selecting
software instead of hardware
• Fault tolerance built right into software
• Improve resilience through load balancing, virtualization, prediction and other techniques.
Challenges To Achieving Data Center Resilience
Measurement of how vulnerable the data center system is to failure and fixing the potential problems leads to increased uptime;
However,
• Increase in the number of applications to be managed and backed up • Organizations getting larger and more geographically dispersed
• Infrastructural ecosystems are more complex
• Decreasing costs of hardware encouraging organizations to maintain backup and recovery in house incompatible with other network
software to mitigate problems • Increasing use of virtualization
• Frequency and intensity of natural disasters Increasing risks
What Are Some Traditional Ways To Achieving Resilience?
Current Methodologies
Conventional Data Center relies on
manual response plan and Human
teams
Design Failure: Competent design firm,
integration firm, construction companies and commissioning team
Catastrophic Failure: Comprehensive maintenance and operation program
Compounding Failure: Paying more attention to details of each and every possible failure mode
Human-error Failure: Having experienced staff and training all responsible. Continuous
training and execution with pilot/co-pilot approach for operation.
Modern Tools To Achieving Resilience
• A modern data center needs the II dashboard. Due to the complexity of the operations, IT and Facility management can not rely on just the human component to combat failures occurring from a combination of two or three faults
• IT/Facilty Management have to align themselves in using predictive ways of disaster mitigation DCIM
What is Data Center Infrastructure Management - DCIM?
• It is a software platform that helps operators safely manage
the physical infrastructure and controls with higher visibility
and transparency of the IT and the facilities operations and
quick identification and resolution of problems before they
happen
• Maximizes the efficient use of
power, cooling, and space
capacities
now and in the future.
• Two core building blocks:
• Asset Management • Monitoring
DCIM - A Resiliency Platform: Physical
Infrastructure/ Controls
From Device Level
Monitoring in a traditional data center system to
Context-Aware Monitoring so actions can be
performed to mitigate a risk !!!!
DCIM-Planning and Implementation Platform
Planning tools and functions:
• Display impact of pending moves on power
capacity and cooling distribution
• Graphical representations of IT equipment
and its location in the rack
• Proactively manage within rack and floor tile
weight limits
• Correlate data between CRAC units, the PDUs,
and the UPSs. The entire chain is monitored.
• Simulate consequences of power and cooling
device failure on IT equipment through “What If?” scenarios
• Generate recommended installation locations
for rack-mount IT equipment. The selection will be based on available power, cooling, space capacity, and network ports
DCIM – Monitoring and Automation Platform
• Alarming/Notification: DCIM sends out an
alarm from the rack prior to a breaker tripping. Provides operator with the
opportunity to make adjustments before shut-down
• Status: Notes are generated for minimum,
maximum, and average usage over time for that rack and for each rack
• Control: If a rack gets close to an
overcapacity threshold, predictive
simulation can be triggered generated to determine the best way to alleviate the situation.
• Reports and graphs are generated to help
DCIM – Monitoring and Automation Platform (contd.)
Comparison of Primary and Secondary Functions
Certain DCIM applications will
take certain data center
features as primary or
secondary functions.
Depending on the facility and
need, care must be taken to
select the right ones to
include in the suite of
integrated platform
DCIM- Data Collection Platform
The data collection subset
represents devices such as meters, power protection devices,
embedded cards, programmable logic controllers (PLCs), sensors and other such devices.
The devices perform the
fundamental function of gathering data and forwarding it to
management software for processing.
DCIM- Dash Board Platform
Key performance indicators are at the operators’ fingertips with DCIM When will I run out of power and what is the current cooling
capacity?
What is my current server utilization?
Do I have any servers that can be retired and if so what are they? The dashboard is the key
centerpiece for aggregation of
actionable data that can be shared quickly with decision-makers
Sample dashboard collects data across OT subsets and centralizes information anytime, any where and any user interfaces: mobile, laptop, PC
DCIM- Dash Board Platform (Contd.) –
Another view
DCIM – Energy and Power Saving Platform
• DCIM provides overview of facility energy use and cost and a complete breakdown of each kW per device
• Cost savings realized from the Servers Rack Row Room Building and Beyond
DCIM Offer in the Market: Suite and Non-Suite
Providers
DCIM Market Trends
• Market is growing
• From $240 million in 2011 to $1.2 Billion in 2016
• Growth in Data Center is very high since facilities and IT meet to think about the business
• Inhibitors to adoption:
• Cost and functionality issues
• Difficulty of creating and maintaining asset databases
• Believe blindly that it is possible to manage data center without software solutions
• Energy Savings from well-managed data centers
• Reduce operating expenses by 20%
How To Get The Highest Benefit From DCIM?
• There are quite a variety of options. Care must be taken to ensure best fit • Scalable, modular, standardized, pre-engineered, open communication
architecture with a strong vendor support structure
• Agreement between facilities, IT, and management on operating parameters, metrics, and goals for the data center power and cooling systems and their management
• A review of existing processes and comparison to DCIM requirements • New processes should be formally defined and resources committed and
Case Study Conclusions:
Data centers are complex systems, changing constantly over time Monitoring and measurement of capacity is not enough
Much lost capacity can be reclaimed using predictive modeling and state of the art tools with support of DCIM measurements
Key Take-Aways - DCIM Benefits
DCIM provides higher visibility, more control and improved automation Decision Support and Information Management
• Asset Planning and Implementation
Monitoring, Measuring and Alerting Management and Control
• Fault-tolerant (fail-over)
Software Services
Final outcome: More reliable and efficient data center higher resilience and decreased PUE.