Defining and Improving IT Utilisation Efficiency through Holistic Data Centre Monitoring

(1)

Defining and Improving

IT Utilisation Efficiency through

Holistic Data Centre Monitoring

Michael Rudgyard CEO

(2)

(3)

• A spin-out company of a well-established UK SI

• Technology was developed for High Performance Computing

– Management of HPC resources needs to be ‘system-wide’

– Scalability (of both the architecture and the GUI) is paramount

• New company formed in March 2010

– Took on the product IP and existing HPC customer base – Notable investment from the UK Carbon Trust

• Currently in ‘semi-stealth’ mode

– Have developed new features for the Data Centre market

(4)

How Efficient is your

Data Centre ?

(5)

• Most new data centres are being designed against PUE targets

– For a given IT hardware capacity, PUE is a good planning metric – However, it is usually a poor operational metric

• Most importantly: what if the servers are not doing any useful

work ??

– The data centre may still have a ‘good’ PUE, but it would be very inefficient by any business metric

• We really need a measure of IT Usage Effectiveness

– ie. how effective the power is being used to deliver necessary IT

(6)

• Unlike PUE, the concept of ITUE encompasses a family of

performance metrics

• Some metrics may provide useful generic ITUE measurement

– MIPS/watt or CPU Utilisation/watt (for compute bound tasks) – IOPS/watt or Bytes/watt (when I/O is predominant)

• Some end-users may be interested in application-related metrics:

– Database transactions/watt

– Page refresh/watt – Search/watt

• Some may be business related:

– £s of products sold / watt; or £s of products sold / integrated IT cost

(7)

• With few exceptions, the most successful methodology for

improving energy conservation across all sectors is:

– Step 1: Identify who/what is responsible for significant energy waste – Step 2: Drive behaviour to ‘encourage’ change

• What is the implication for the Data Centre ?

• Need to monitor and report ITUE metrics by customer,

department or end-user

– Who or what applications/service are the worst offenders ?

– Management can use data to drive better practice (charge-back ?)

(8)

(9)

• Efficient DCs will need to monitor & manage both IT and Facilities

systems in a coherent manner:

– Environmental systems (temperature, humidity, air-conditioning..) – Power (at the distribution board, rack PDU and server PSU level …) – IT equipment (using standard protocols such as IPMI and SNMP…) – Operating systems & Virtual Machines (integrating with IT systems) – ..and perhaps applications themselves

• Software tools will need to integrate with multiple systems from

multiple vendors (both hardware and software) in an agnostic

manner

(10)

• Optimised environmental management to improve PUE (& ITUE)

• Identification of unused, under-used, inefficient or over-spec’ed IT

equipment

• Using active power management during low utilisation periods

• Dynamic orchestration of virtual machines based on

environmental, power and IT usage constraints

• Non-trivial energy savings through simple changes (20-25%)

• Opportunity for very significant savings in most DCs (25-75%)

(11)

(12)

• Consolidation of Data Centres is already happening

– Driven by economies of scale and the ‘Cloud’

– The trend is only likely to accelerate…

• Conversely, as Data Centres become bigger, energy

management will become even more important

• The winners in the race for the clouds will be those

who can operate the most efficiently …

– .. but few know how efficient they are now !!

(13)

• The largest data centres are owned by a handful of IT

giants:

– Google, Amazon, Microsoft, Yahoo etc…

• These giants are very aware of Data Centre Efficiency

– Some have turned common perceptions on their heads

– Some even design their own servers

– All have developed their own systems and software

(14)

• Imagine a data-centre with 50 -100,000 servers (cf. Google)

– ie. 1,500-3,000 racks and a similar number of PDUs and sensors – and up to (say) 16 VMs per server

• You might want to monitor (derive reports from & orchestrate…)

– 1,500-12,000 environmental sensors

– 20-30 data-points per server (IPMI, Power) = 1-3M points

– 20-100 data-points per OS/VM (eg. SNMP, WMI) = 16-160M points – … as well as user and application data.

• That’s hell of a lot of information !

– But even scaling this back by an order of magnitude presents a challenge for software.

(15)

Things that won’t work:

• Using a ‘single-instance’ software architecture

– Information will need to be processed in a distributed manner

• Putting unrefined data in a standard SQL data-base

– or you’ll need another data-centre to store, process & retrieve the data !

• Expecting simple GUIs (eg. lists and trees) to be effective

– Visualisation becomes a key aspect to usability

(16)

Concurrent Thinking’s

Products

(17)

‘Command & Control’ Architecture

• 1U appliance

• Collates information from concurrentCONTROL devices • Delivers highly polished, Web 2 GUI

 Manage anywhere from mobile, Iphone, PDA, PC etc…

• Built to be a scalable interface

• ‘Zero’ U, low-power appliance

• Monitors data from devices associated with local ‘racks’

 Power control and monitoring, Environmental information

• concurrentCOMMAND provides full system management GUI

• concurrentCONTROL devices act as slaves, and are designed to enable scalable and fault tolerant system monitoring and imaging

(18)

• Monitoring

– Power from power clamps, third party PDUs & PMBus PSUs – Environmental sensors: wired (5V) and wireless (866Mhz)

– Server hardware - IPMI, DCMI and Intel Node Manager support

– SNMP & WMI support for OS and VM monitoring; optional in-band ‘daemon’

• Reporting

– Power charge-back and ITUE metrics by group/customer/user/application – Scalable, ‘real-time’ data-centre views

– Extensive reporting of historical data

– Breach monitoring and reporting; Event data-base and visualisation

• Management

– Data Centre Inventory

– Power management (PDU and IPMI support) – Event scheduling

– Serial-over-LAN & SSH terminals

(19)

(20)

Visualisation of real-time metrics - data centre

view

(21)

(22)

(23)

(24)

(25)

(26)

Hardware repository: PDU association to

servers

(27)

• Integration with third party VMs (VMWare, KVM, Hyper-V ..)

– Dynamic orchestration of virtual machines

Defining and Improving IT Utilisation Efficiency through Holistic Data Centre Monitoring