Dimensioning Data Centre Resiliency

(1)

Dimensioning Data Centre Resiliency

HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS

(2)

Extended Welcome



New and returning attendees, presenters &



History of HPC Advisory Council Events, Lugano,

Switzerland (source: hpcadvisorycouncil.com)

 March 15-17, 2010

 March 21-23, 2011

 March 13-15, 2012

 March 13-15, 2013

 March 31 - April 3, 2014

(3)

Swiss National Supercomputing Centre (CSCS)



Provides, develops, and promotes technical and scientific

services for the Swiss research community in HPC (cscs.ch)



Highlights since 2010—1st Swiss HPC Advisory Council Event

 High Performance High Productivity (HP2C) project

 Introduction to hybrid CPU & GPU systems (clusters & Cray XK7)

 Move to new building in Lugano from Manno

 Launch of data and storage services

 Piz Daint—Hybrid Cray XC30

 All hybrid CPU and GPU systems GPUDirect enabled

 Fully non-blocking IB FDR cluster

 Extension of Piz Daint with Cray XC40 (Piz Dora) & consolidation of services

(4)

Acknowledgements



Several members of staff at CSCS, specifically:

 Network (Chris Gamboni)

 Storage (Roberto Aielli, Stefano Gorini)

 Regression suite (Tim Robinson, Gabriella Ceci)

 User lab (Maria Grazia Giuffreda)

 CSCS on call service (Carmelo Ponti)

 Review (Nick Cardo)

(5)

Provisioning & Consolidation of Services at CSCS



High performance computing services

 Piz Daint—Hybrid Cray XC30

 Piz Dora—Cray XC40 (User Lab, UZH and EPFL MARVEL)



Data analysis and visualization services

 Visualization on Piz Daint

 Large memory nodes on Piz Dora



Storage and data services

 Site-wide GPFS for the User Lab projects

 Additional GPFS and offline storage for customers



Services for customers and partners

 MeteoSwiss Cray XE6 system

 Swiss Tier-2 Infrastructure for LHC community

 Fully non-blocking FDR cluster for the PASC community

 EPFL Blue Brain 4—IBM Blue Gene/Q & additional resources

(6)

Resiliency



Resilience according to merriam-webster.com

 the ability to become strong, healthy, or successful again after

something bad happens

 the ability of something to return to its original shape after it has been

pulled, stretched, pressed, bent, etc.



Terminology in a

dynamic

HPC data center environment

 Bad = failure or degradation of facility OR hardware OR software

 Pulled = cabling incidents, users induced errors

 Stretched = power issues esp. in GPU systems, extensions

 Pressed = oversubscription in a shared environment

 Bent = repurposed resources, mandatory patches

 … = constantly evolving requirements due to service consolidation in a shared environment

(7)

Typical Solutions for Resiliency

 Redundancy  two or more Piz Daint?

 Failover  high availability solutions, for example, MeteoSwiss

 Fault tolerance  component vs. vertically integrated stacks

 On call services  monitoring and interventions

 Service migration  many dependencies

(8)

CSCS On Call Setup for Systems with Criticality Levels

 Red

 Defined by Service Level and Contractual elements

 Highest availability requirements

 Requirement for an operator

 Orange

 Central services

 Systems with specific production requirements  Moderate coverage during on-call period

 Green

 No specific requirements for high availability

 Mainly R&D systems

(9)

Dimensions of a Data Center Resiliency



Perspective

 Users

 Operational staff

 Stakeholders



Hierarchical building blocks



Holistic approach = perspective + hierarchical building

blocks

Resilient services Users Stake holders Operational Staff

Facility & Infrastructure

Central services and systems (e.g. network, authentication, central database, etc.) Site-wide shared resources (e.g. storage

systems, InifniBand, etc.)

Computing systems (e.g. Cray and clusters) Programming and execution environment

User applications R ate of ch an g e

(10)

Resiliency Dimensions @ CSCS (Perspectives)

Resilient

services

Users

Stake holders Operational Staff 0 50 100 150 200 250 300 2010 2011 2012 2013 2014

(11)

HPC Advisory Council & Resiliency—

[an unscientific]

Survey

 2009 (China workshop)

 Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, GPU

 2010 (Switzerland, ISC’10 & SC10)

 Topics: Networking, cloud, applications, storage and file systems, system management and administration, MPI

 2011 (Switzerland, China, USA/Stanford, ISC’11, SC11)

 Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU

 2012 (Israel, Switzerland, China, Spain, ISC’12)

 Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, security

 2013 (USA/Stanford, Switzerland, Spain, China, ISC’13)

 Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU, MIC,

Big Data

 2014 (USA/Stanford, Switzerland, Brazil, Spain, Singapore, China, South Africa, ISC’14)

 Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU, MIC, Big Data

 2015 (USA/Stanford, Switzerland …)

 Topics: Networking, cloud, applications, storage and file systems, system management and administration, MPI, GPU, MIC, Big Data, [Resiliency/Data Center resiliency]

(12)

(13)

(14)

Site-wide GPFS Storage Configuration @ CSCS

Frank Schmuck and Roger Haskin. 2002. GPFS: A Shared-Disk File System for Large Computing Clusters. In

Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST '02). USENIX Association, Berkeley, CA, USA.

TSM VNX8K01 VNX8K02 RAMSAN Metadata Data Backup Monch CNFS General CNFS GPFS clients Monch Various systems FS840

(15)

Bottom Up vs. Top Down Solutions



Bottom up (customized)

 Regression testing tools

 Monitors, log file analysis and alerts



Top down

 Standards driven

 Community effort proposal: Robust505



Holistic approach = bottom up + top down

ANSI/TIA-942 standard

Source: http://www.colocationamerica.com/ data-center/tier-standards-overview.htm

(16)

CSCS Regression Suite Design Goals



Contract between customers & service providers

Facility & Infrastructure

Central services and systems (e.g. network, authentication, central database, etc.) Site-wide shared resources (e.g. storage

systems, InifniBand, etc.)

Computing systems (e.g. Cray and clusters) Programming and execution environment

User applications Fine-grain regression tests Customer Service provider Customer Service provider CSCS regr es s ion s ui te for Pi z Dora & P iz Dai nt R ate of cha ng e

(17)

CSCS Regression Suite Workflow

Compile CUDA tools on a specified node

TESTID: 8[0-7]00

Basic compilation and library linking +

modules env

TESTID: 5[0-6][0-6][0-6][0-9]

Compile and run libsci_acc apps

TESTID: 600[0-5]

Compile and run basic CUDA tools

TESTID: 7[0-7]0[0-3]

Run scientific applications

TESTID: 900[1-4] Admin checks Application checks Regression driver

Machine load, FS, network, GOM, node status, file systems, SLURM status, disk

status…

(18)

Custom options

-m: To run after system downtime (planned or unplanned).

Requires a reservation to check all nodes that are up.

Runs Admin + App checks

-p: To run after installation of monthly PE release and/or any patches obtained from Cray. It tests the development environment

-c: To run a special suite of tests on the node suspected to be in a bad state

-a: To run a special suite of production applications to check for performance regressions.

Applications are run concurrently, to simulate a production environment. Performances is measured.

(19)

HPC Advisory Counci Switzerland 2015 19

Sample Output (I)

Running admin checks (PASSED) User who submitted the regression

Start date

Login node where the regression suite has been launched

Location where the output and the log are stored

End date Test passed

(20)

Sample Output (2)

Possible next steps depend on types of failed test (s) & decision to resume services depends on the criticality of failure (s)

(21)

LogStash—An Open Source Log Management Tool

(http://logstash.net)

(22)

Systems Alerts & Monitoring—homegrown examples

 Network snapshot (switches, link info, subnet manager)

 Change logs and alerts

 Error monitoring, for instance, symbol errors indicate cable issues

 Pros: effective and instantaneous detection and possibly corrections of errors

 Cons: level of abstraction, for example, GPFS may have an issue while network logs are clean

(23)

Holistic Approach to Resiliency



HPC focused list of systems to incentivize robust design

solutions

Top500 5 best

(24)

Proposed Guidelines

 Zero to minimum overhead for making a submission

 Metrics (TBD):

 Data collection and reporting for Top500 runs

 Uptime

 Failures classification (known vs. unknown)

 Self-healing vs. intervention, i.e. unscheduled maintenance

 Known errors database (KEDB) and/or checklist

 Faster workaround & resumption of service to users

 Knowledge sharing

 Ganglia and Nagios integration and completeness (main system, ecosystem, file system)

(25)

Leveraging Other Approaches & Solutions

(26)

Leveraging Other Approaches & Solutions

Horizon: dashboar d & portal Nova: cloud compute Neutron: cloud network Cinder: block storage Keystone: identity managem ent Glance: image managem ent

(27)

Resilience for OpenStack Deployments

 High Availability (HA) presentations and papers on OpenStack (mainly at

the OpenStack Summits)

 Resiliency and Performance Engineering for OpenStack at Enterprise Scale by Mirantis

 Checklist ~ regression

 Wrecking crew

 More Reliable, More Resilient, More Redundant: High Availability Update for Grizzly and Beyond by Hastexo

 Layer-by-layer & component-by-component analysis ~ hierarchical approach & customer & service provider model for regression

 NovaResiliency

 Component specific draft

 Ubuntu OpenStackHA guideline

 ...

HPC Advisory Counci Switzerland 2015 27

"Everything fails, all the time."

(28)

Key Considerations and Next Steps



Commissioning resilient services at a data center

 Cultural issues

 Overhead or productive work by users & operational staff  Focus on documentation and knowledge sharing

 Putting best practices into work, e.g. change management tools  Investments & cost effectiveness

 Staffing considerations

 Metrics for cost effectives and SLAs TBC by stake holders



Community driven efforts for HPC data center resiliency

 Share your stories (happy endings plus horror)

 Feedback on Robust505

(29)

Acknowledgements



Several members of staff at CSCS, specifically:

 Network (Chris Gamboni)

 Storage (Roberto Aielli, Stefano Gorini)

 Regression suite (Tim Robinson, Gabriella Ceci)

 User lab (Maria Grazia Giuffreda)

 CSCS on call (Carmelo Ponti)

 Review (Nick Cardo)

(30)

Dimensioning Data Centre Resiliency