• No results found

Dimensioning Data Centre Resiliency

N/A
N/A
Protected

Academic year: 2021

Share "Dimensioning Data Centre Resiliency"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

Dimensioning Data Centre Resiliency

HPC Advisory Council Switzerland Conference 2015 Sadaf Alam, CSCS

(2)

Extended Welcome

New and returning attendees, presenters &

sponsors

 Interconnects for systems  interconnected systems

 Continuous adoption of program to reflect community needs

History of HPC Advisory Council Events, Lugano,

Switzerland (source: hpcadvisorycouncil.com)

 March 15-17, 2010

 March 21-23, 2011

 March 13-15, 2012

 March 13-15, 2013

 March 31 - April 3, 2014

(3)

Swiss National Supercomputing Centre (CSCS)

Provides, develops, and promotes technical and scientific

services for the Swiss research community in HPC (cscs.ch)

Highlights since 2010—1st Swiss HPC Advisory Council Event

 High Performance High Productivity (HP2C) project

 Introduction to hybrid CPU & GPU systems (clusters & Cray XK7)

 Move to new building in Lugano from Manno

 Launch of data and storage services

 Piz Daint—Hybrid Cray XC30

 All hybrid CPU and GPU systems GPUDirect enabled

 Fully non-blocking IB FDR cluster

 Extension of Piz Daint with Cray XC40 (Piz Dora) & consolidation of services

(4)

Acknowledgements

Several members of staff at CSCS, specifically:

 Network (Chris Gamboni)

 Storage (Roberto Aielli, Stefano Gorini)

 Regression suite (Tim Robinson, Gabriella Ceci)

 User lab (Maria Grazia Giuffreda)

 CSCS on call service (Carmelo Ponti)

 Review (Nick Cardo)

(5)

Provisioning & Consolidation of Services at CSCS

High performance computing services

 Piz Daint—Hybrid Cray XC30

 Piz Dora—Cray XC40 (User Lab, UZH and EPFL MARVEL)

Data analysis and visualization services

 Visualization on Piz Daint

 Large memory nodes on Piz Dora

Storage and data services

 Site-wide GPFS for the User Lab projects

 Additional GPFS and offline storage for customers

Services for customers and partners

 MeteoSwiss Cray XE6 system

 Swiss Tier-2 Infrastructure for LHC community

 Fully non-blocking FDR cluster for the PASC community

 EPFL Blue Brain 4—IBM Blue Gene/Q & additional resources

(6)

Resiliency

Resilience according to merriam-webster.com

 the ability to become strong, healthy, or successful again after

something bad happens

 the ability of something to return to its original shape after it has been

pulled, stretched, pressed, bent, etc.

Terminology in a

dynamic

HPC data center environment

 Bad = failure or degradation of facility OR hardware OR software

 Pulled = cabling incidents, users induced errors

 Stretched = power issues esp. in GPU systems, extensions

 Pressed = oversubscription in a shared environment

 Bent = repurposed resources, mandatory patches

 … = constantly evolving requirements due to service consolidation in a shared environment

(7)

Typical Solutions for Resiliency

 Redundancy  two or more Piz Daint?

 Failover  high availability solutions, for example, MeteoSwiss

 Fault tolerance  component vs. vertically integrated stacks

 On call services  monitoring and interventions

 Service migration  many dependencies

(8)

CSCS On Call Setup for Systems with Criticality Levels

 Red

 Defined by Service Level and Contractual elements

 Highest availability requirements

 Requirement for an operator

 Orange

 Central services

 Systems with specific production requirements  Moderate coverage during on-call period

 Green

 No specific requirements for high availability

 Mainly R&D systems

(9)

Dimensions of a Data Center Resiliency

Perspective

 Users

 Operational staff

 Stakeholders

Hierarchical building blocks

Holistic approach = perspective + hierarchical building

blocks

Resilient services Users Stake holders Operational Staff

Facility & Infrastructure

Central services and systems (e.g. network, authentication, central database, etc.) Site-wide shared resources (e.g. storage

systems, InifniBand, etc.)

Computing systems (e.g. Cray and clusters) Programming and execution environment

User applications R ate of ch an g e

(10)

Resiliency Dimensions @ CSCS (Perspectives)

Resilient

services

Users

Stake holders Operational Staff 0 50 100 150 200 250 300 2010 2011 2012 2013 2014

(11)

HPC Advisory Council & Resiliency—

[an unscientific]

Survey

 2009 (China workshop)

 Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, GPU

 2010 (Switzerland, ISC’10 & SC10)

 Topics: Networking, cloud, applications, storage and file systems, system management and administration, MPI

 2011 (Switzerland, China, USA/Stanford, ISC’11, SC11)

 Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU

 2012 (Israel, Switzerland, China, Spain, ISC’12)

 Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, security

 2013 (USA/Stanford, Switzerland, Spain, China, ISC’13)

 Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU, MIC,

Big Data

 2014 (USA/Stanford, Switzerland, Brazil, Spain, Singapore, China, South Africa, ISC’14)

 Topics: Networking, cloud, applications, storage & file systems, system mgmt. & administration, MPI, GPU, MIC, Big Data

 2015 (USA/Stanford, Switzerland …)

 Topics: Networking, cloud, applications, storage and file systems, system management and administration, MPI, GPU, MIC, Big Data, [Resiliency/Data Center resiliency]

(12)
(13)
(14)

Site-wide GPFS Storage Configuration @ CSCS

Frank Schmuck and Roger Haskin. 2002. GPFS: A Shared-Disk File System for Large Computing Clusters. In

Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST '02). USENIX Association, Berkeley, CA, USA.

TSM VNX8K01 VNX8K02 RAMSAN Metadata Data Backup Monch CNFS General CNFS GPFS clients Monch Various systems FS840

(15)

Bottom Up vs. Top Down Solutions

Bottom up (customized)

 Regression testing tools

 Monitors, log file analysis and alerts

Top down

 Standards driven

 Community effort proposal: Robust505

Holistic approach = bottom up + top down

ANSI/TIA-942 standard

Source: http://www.colocationamerica.com/ data-center/tier-standards-overview.htm

(16)

CSCS Regression Suite Design Goals

Contract between customers & service providers

Facility & Infrastructure

Central services and systems (e.g. network, authentication, central database, etc.) Site-wide shared resources (e.g. storage

systems, InifniBand, etc.)

Computing systems (e.g. Cray and clusters) Programming and execution environment

User applications Fine-grain regression tests Customer Service provider Customer Service provider CSCS regr es s ion s ui te for Pi z Dora & P iz Dai nt R ate of cha ng e

(17)

CSCS Regression Suite Workflow

Compile CUDA tools on a specified node

TESTID: 8[0-7]00

Basic compilation and library linking +

modules env

TESTID: 5[0-6][0-6][0-6][0-9]

Compile and run libsci_acc apps

TESTID: 600[0-5]

Compile and run basic CUDA tools

TESTID: 7[0-7]0[0-3]

Run scientific applications

TESTID: 900[1-4] Admin checks Application checks Regression driver

Machine load, FS, network, GOM, node status, file systems, SLURM status, disk

status…

(18)

Custom options

-m: To run after system downtime (planned or unplanned).

Requires a reservation to check all nodes that are up.

Runs Admin + App checks

-p: To run after installation of monthly PE release and/or any patches obtained from Cray. It tests the development environment

-c: To run a special suite of tests on the node suspected to be in a bad state

-a: To run a special suite of production applications to check for performance regressions.

Applications are run concurrently, to simulate a production environment. Performances is measured.

(19)

HPC Advisory Counci Switzerland 2015 19

Sample Output (I)

Running admin checks (PASSED) User who submitted the regression

Start date

Login node where the regression suite has been launched

Location where the output and the log are stored

End date Test passed

(20)

Sample Output (2)

Possible next steps depend on types of failed test (s) & decision to resume services depends on the criticality of failure (s)

(21)

LogStash—An Open Source Log Management Tool

(http://logstash.net)

(22)

Systems Alerts & Monitoring—homegrown examples

 Network snapshot (switches, link info, subnet manager)

 Change logs and alerts

 Error monitoring, for instance, symbol errors indicate cable issues

 Pros: effective and instantaneous detection and possibly corrections of errors

 Cons: level of abstraction, for example, GPFS may have an issue while network logs are clean

(23)

Holistic Approach to Resiliency

HPC focused list of systems to incentivize robust design

solutions

Top500 5 best

(24)

Proposed Guidelines

 Zero to minimum overhead for making a submission

 Metrics (TBD):

 Data collection and reporting for Top500 runs

 Uptime

 Failures classification (known vs. unknown)

 Self-healing vs. intervention, i.e. unscheduled maintenance

 Known errors database (KEDB) and/or checklist

 Faster workaround & resumption of service to users

 Knowledge sharing

 Ganglia and Nagios integration and completeness (main system, ecosystem, file system)

(25)

Leveraging Other Approaches & Solutions

(26)

Leveraging Other Approaches & Solutions

Horizon: dashboar d & portal Nova: cloud compute Neutron: cloud network Cinder: block storage Keystone: identity managem ent Glance: image managem ent

(27)

Resilience for OpenStack Deployments

 High Availability (HA) presentations and papers on OpenStack (mainly at

the OpenStack Summits)

 Resiliency and Performance Engineering for OpenStack at Enterprise Scale by Mirantis

 Checklist ~ regression

 Wrecking crew

 More Reliable, More Resilient, More Redundant: High Availability Update for Grizzly and Beyond by Hastexo

 Layer-by-layer & component-by-component analysis ~ hierarchical approach & customer & service provider model for regression

 NovaResiliency

 Component specific draft

 Ubuntu OpenStackHA guideline

 ...

HPC Advisory Counci Switzerland 2015 27

"Everything fails, all the time."

(28)

Key Considerations and Next Steps

Commissioning resilient services at a data center

 Cultural issues

 Overhead or productive work by users & operational staff  Focus on documentation and knowledge sharing

 Putting best practices into work, e.g. change management tools  Investments & cost effectiveness

 Staffing considerations

 Metrics for cost effectives and SLAs TBC by stake holders

Community driven efforts for HPC data center resiliency

 Share your stories (happy endings plus horror)

 Feedback on Robust505

(29)

Acknowledgements

Several members of staff at CSCS, specifically:

 Network (Chris Gamboni)

 Storage (Roberto Aielli, Stefano Gorini)

 Regression suite (Tim Robinson, Gabriella Ceci)

 User lab (Maria Grazia Giuffreda)

 CSCS on call (Carmelo Ponti)

 Review (Nick Cardo)

(30)

References

Related documents

NetApp FlexPod Nutanix XCP SimpliVity OmniStack Data Efficiency VM-Centric Management Data Protection Resiliency Deduplication is not inline.. Compression and deduplication

To address the resiliency, capacity, and IT transformational business requirements put forth in Cisco’s data center strategy, and meet the latency, portfolio management, and

The pressure drop (and accuracy of the measurement) increases as the nozzle approaches the surface, but the shear stress imposed by the gauging flow also increases, which could cause

This qualitative study was conducted in three stages: first, a narrative review of the literature was conducted to identify the list of organizational quality indicators

The most successful Governance, Risk and Compliance strategies not only address compliance demands, but also help the corporation manage risk while providing business- level

Social media has also dissolved the barriers that used to bar entry into public life and political debate.. An

Heavenly Father, you chose St John Baptist de la Salle to give young people a Christian education. Give your church teachers who will devote themselves to helping your children

The frequency response of a sample, 1 encoded on the lines of the probe comb, can be thus measured using a signal with a RF bandwidth: BW = BW opt / C , where BW opt is