Dimensioning Data Centre Resiliency
HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS
Extended Welcome
New and returning attendees, presenters &
sponsors
Interconnects for systems interconnected systems
Continuous adoption of program to reflect community needs
History of HPC Advisory Council Events, Lugano,
Switzerland (source: hpcadvisorycouncil.com)
March 15-17, 2010
March 21-23, 2011
March 13-15, 2012
March 13-15, 2013
March 31 - April 3, 2014
Swiss National Supercomputing Centre (CSCS)
Provides, develops, and promotes technical and scientific
services for the Swiss research community in HPC (cscs.ch)
Highlights since 2010—1st Swiss HPC Advisory Council Event
High Performance High Productivity (HP2C) project
Introduction to hybrid CPU & GPU systems (clusters & Cray XK7)
Move to new building in Lugano from Manno
Launch of data and storage services
Piz Daint—Hybrid Cray XC30
All hybrid CPU and GPU systems GPUDirect enabled
Fully non-blocking IB FDR cluster
Extension of Piz Daint with Cray XC40 (Piz Dora) & consolidation of services
Acknowledgements
Several members of staff at CSCS, specifically:
Network (Chris Gamboni)
Storage (Roberto Aielli, Stefano Gorini)
Regression suite (Tim Robinson, Gabriella Ceci)
User lab (Maria Grazia Giuffreda)
CSCS on call service (Carmelo Ponti)
Review (Nick Cardo)
Provisioning & Consolidation of Services at CSCS
High performance computing services
Piz Daint—Hybrid Cray XC30
Piz Dora—Cray XC40 (User Lab, UZH and EPFL MARVEL)
Data analysis and visualization services
Visualization on Piz Daint
Large memory nodes on Piz Dora
Storage and data services
Site-wide GPFS for the User Lab projects
Additional GPFS and offline storage for customers
Services for customers and partners
MeteoSwiss Cray XE6 system
Swiss Tier-2 Infrastructure for LHC community
Fully non-blocking FDR cluster for the PASC community
EPFL Blue Brain 4—IBM Blue Gene/Q & additional resources
Resiliency
Resilience according to merriam-webster.com
the ability to become strong, healthy, or successful again after
something bad happens
the ability of something to return to its original shape after it has been
pulled, stretched, pressed, bent, etc.
Terminology in a
dynamic
HPC data center environment
Bad = failure or degradation of facility OR hardware OR software
Pulled = cabling incidents, users induced errors
Stretched = power issues esp. in GPU systems, extensions
Pressed = oversubscription in a shared environment
Bent = repurposed resources, mandatory patches
… = constantly evolving requirements due to service consolidation in a shared environment
Typical Solutions for Resiliency
Redundancy two or more Piz Daint?
Failover high availability solutions, for example, MeteoSwiss
Fault tolerance component vs. vertically integrated stacks
On call services monitoring and interventions
Service migration many dependencies
CSCS On Call Setup for Systems with Criticality Levels
Red
Defined by Service Level and Contractual elements
Highest availability requirements
Requirement for an operator
Orange
Central services
Systems with specific production requirements Moderate coverage during on-call period
Green
No specific requirements for high availability
Mainly R&D systems
Dimensions of a Data Center Resiliency
Perspective
Users
Operational staff
Stakeholders
Hierarchical building blocks
Holistic approach = perspective + hierarchical building
blocks
Resilient services Users Stake holders Operational StaffFacility & Infrastructure
Central services and systems (e.g. network, authentication, central database, etc.) Site-wide shared resources (e.g. storage
systems, InifniBand, etc.)
Computing systems (e.g. Cray and clusters) Programming and execution environment
User applications R ate of ch an g e
Resiliency Dimensions @ CSCS (Perspectives)
Resilient
services
Users
Stake holders Operational Staff 0 50 100 150 200 250 300 2010 2011 2012 2013 2014HPC Advisory Council & Resiliency—
[an unscientific]Survey
2009 (China workshop)
Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, GPU
2010 (Switzerland, ISC’10 & SC10)
Topics: Networking, cloud, applications, storage and file systems, system management and administration, MPI
2011 (Switzerland, China, USA/Stanford, ISC’11, SC11)
Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU
2012 (Israel, Switzerland, China, Spain, ISC’12)
Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, security
2013 (USA/Stanford, Switzerland, Spain, China, ISC’13)
Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU, MIC,
Big Data
2014 (USA/Stanford, Switzerland, Brazil, Spain, Singapore, China, South Africa, ISC’14)
Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU, MIC, Big Data
2015 (USA/Stanford, Switzerland …)
Topics: Networking, cloud, applications, storage and file systems, system management and administration, MPI, GPU, MIC, Big Data, [Resiliency/Data Center resiliency]
Site-wide GPFS Storage Configuration @ CSCS
Frank Schmuck and Roger Haskin. 2002. GPFS: A Shared-Disk File System for Large Computing Clusters. In
Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST '02). USENIX Association, Berkeley, CA, USA.
TSM VNX8K01 VNX8K02 RAMSAN Metadata Data Backup Monch CNFS General CNFS GPFS clients Monch Various systems FS840
Bottom Up vs. Top Down Solutions
Bottom up (customized)
Regression testing tools
Monitors, log file analysis and alerts
Top down
Standards driven
Community effort proposal: Robust505
Holistic approach = bottom up + top down
ANSI/TIA-942 standard
Source: http://www.colocationamerica.com/ data-center/tier-standards-overview.htm
CSCS Regression Suite Design Goals
Contract between customers & service providers
Facility & Infrastructure
Central services and systems (e.g. network, authentication, central database, etc.) Site-wide shared resources (e.g. storage
systems, InifniBand, etc.)
Computing systems (e.g. Cray and clusters) Programming and execution environment
User applications Fine-grain regression tests Customer Service provider Customer Service provider CSCS regr es s ion s ui te for Pi z Dora & P iz Dai nt R ate of cha ng e
CSCS Regression Suite Workflow
Compile CUDA tools on a specified node
TESTID: 8[0-7]00
Basic compilation and library linking +
modules env
TESTID: 5[0-6][0-6][0-6][0-9]
Compile and run libsci_acc apps
TESTID: 600[0-5]
Compile and run basic CUDA tools
TESTID: 7[0-7]0[0-3]
Run scientific applications
TESTID: 900[1-4] Admin checks Application checks Regression driver
Machine load, FS, network, GOM, node status, file systems, SLURM status, disk
status…
Custom options
-m: To run after system downtime (planned or unplanned).
Requires a reservation to check all nodes that are up.
Runs Admin + App checks
-p: To run after installation of monthly PE release and/or any patches obtained from Cray. It tests the development environment
-c: To run a special suite of tests on the node suspected to be in a bad state
-a: To run a special suite of production applications to check for performance regressions.
Applications are run concurrently, to simulate a production environment. Performances is measured.
HPC Advisory Counci Switzerland 2015 19
Sample Output (I)
Running admin checks (PASSED) User who submitted the regression
Start date
Login node where the regression suite has been launched
Location where the output and the log are stored
End date Test passed
Sample Output (2)
Possible next steps depend on types of failed test (s) & decision to resume services depends on the criticality of failure (s)
LogStash—An Open Source Log Management Tool
(http://logstash.net)
Systems Alerts & Monitoring—homegrown examples
Network snapshot (switches, link info, subnet manager)
Change logs and alerts
Error monitoring, for instance, symbol errors indicate cable issues
Pros: effective and instantaneous detection and possibly corrections of errors
Cons: level of abstraction, for example, GPFS may have an issue while network logs are clean
Holistic Approach to Resiliency
HPC focused list of systems to incentivize robust design
solutions
Top500 5 best
Proposed Guidelines
Zero to minimum overhead for making a submission
Metrics (TBD):
Data collection and reporting for Top500 runs
Uptime
Failures classification (known vs. unknown)
Self-healing vs. intervention, i.e. unscheduled maintenance
Known errors database (KEDB) and/or checklist
Faster workaround & resumption of service to users
Knowledge sharing
Ganglia and Nagios integration and completeness (main system, ecosystem, file system)
Leveraging Other Approaches & Solutions
Leveraging Other Approaches & Solutions
Horizon: dashboar d & portal Nova: cloud compute Neutron: cloud network Cinder: block storage Keystone: identity managem ent Glance: image managem entResilience for OpenStack Deployments
High Availability (HA) presentations and papers on OpenStack (mainly at
the OpenStack Summits)
Resiliency and Performance Engineering for OpenStack at Enterprise Scale by Mirantis
Checklist ~ regression
Wrecking crew
More Reliable, More Resilient, More Redundant: High Availability Update for Grizzly and Beyond by Hastexo
Layer-by-layer & component-by-component analysis ~ hierarchical approach & customer & service provider model for regression
NovaResiliency
Component specific draft
Ubuntu OpenStackHA guideline
...
HPC Advisory Counci Switzerland 2015 27
"Everything fails, all the time."
Key Considerations and Next Steps
Commissioning resilient services at a data center
Cultural issues
Overhead or productive work by users & operational staff Focus on documentation and knowledge sharing
Putting best practices into work, e.g. change management tools Investments & cost effectiveness
Staffing considerations
Metrics for cost effectives and SLAs TBC by stake holders
Community driven efforts for HPC data center resiliency
Share your stories (happy endings plus horror)
Feedback on Robust505
Acknowledgements
Several members of staff at CSCS, specifically:
Network (Chris Gamboni)
Storage (Roberto Aielli, Stefano Gorini)
Regression suite (Tim Robinson, Gabriella Ceci)
User lab (Maria Grazia Giuffreda)
CSCS on call (Carmelo Ponti)
Review (Nick Cardo)