• No results found

Comet - High performance virtual clusters to support the long-tail of science.

N/A
N/A
Protected

Academic year: 2021

Share "Comet - High performance virtual clusters to support the long-tail of science."

Copied!
44
0
0

Loading.... (view fulltext now)

Full text

(1)

Comet - High performance virtual clusters to

support the long-tail of science.

Philip M. Papadopoulos, Ph.D.

San Diego Supercomputer Center

California Institute for telecommunications and Information

Technologies (Calit2)

(2)
(3)

Comet is funded by U.S. National

Science Foundation

“… expand the use of high end resources to a

much larger and more diverse community

… support the entire spectrum of NSF

communities

... promote a more comprehensive and

balanced portfolio

… include research communities that are not

users of traditional HPC systems.“

(4)

Jobs and SUs at various scales across

NSF resources

0

500

1000

1500

2000

2500

3000

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1

2

4

8

16 32 64 128 256 512 1K 2K 4K 8K 16K

Millio

n

s

of XD

SUs Ch

ar

ge

d

Fr

actio

n

of All

Jo

b

s

Char

ge

d

in

2012

Job Size (Cores)

Percentage of Jobs (Left Axis)

SUs Charged (Right Axis)

One node

99% of jobs run on

NSF’s HPC

resources in 2012

used < 2048 cores

And consumed

~50% of the total

core-hours across

NSF resources

Job Size (Cores)

Cum

ulat

iv

e

(5)
(6)

Comet: System Characteristics

Available 1Q 2015

Total flops ~1.9 PF (AVX2)

Dell primary integrator

• Intel next-gen processors, former

codename Haswell, with AVX2

• Aeon storage vendor

• Mellanox FDR InfiniBand

Standard compute nodes

• Dual 12-core Haswell processors

• 128 GB DDR4 DRAM (64

GB/socket!)

• 320 GB SSD (local scratch, VMs)

GPU nodes

• Four NVIDIA GPUs/node

Large-memory nodes

(2Q 2015)

Hybrid fat-tree topology

• FDR (56 Gbps) InfiniBand

• Rack-level (72 nodes) full bisection

bandwidth

• 4:1 oversubscription inter-rack

Performance Storage

• 7 PB, 200 GB/s

• Scratch & Persistent Storage

Durable Storage (reliability)

• 6 PB, 100 GB/s

Gateway hosting nodes and VM

image repository

100 Gbps external connectivity to

Internet2 & ESNet

(7)

Comet Architecture

Juniper 100 Gbps Arista 40GbE (2x) Data Mover (4x)

R&E Network Access Data Movers

Internet 2

7x 36-port FDR in each rack wired as full fat-tree. 4:1 over

subscription between racks.

72 HSWL 320 GB IB Core (2x) N GPU 4 Large-Memory Bridge (4x)

Performance Storage Durable Storage

Arista 40GbE (2x) N racks FDR 36p FDR 36p 64 128 18 72 HSWL 320 GB 72 HSWL 72 72 18 Mid-tier

Additional Support Components

(not shown for clarity)

NFS Servers, Virtual Image Repository, Gateway/Portal Hosting Nodes, Login Nodes, Ethernet Management Network, Rocks Management Nodes

Node-Local Storage 18 72 FDR FDR FDR 40GbE 40GbE 10GbE

(8)

Design Decision - Network

Each Rack

• 72 nodes (144 CPUs, 1728 Cores)

• Fully-connected FDR IB (2-level Clos-Topology)

• 144 In-rack Cables

• 4:1 oversubscription between racks

• 18 inter-rack cables

Supports the large majority of jobs with no

performance degradation

3-level network for complete cluster

• worst-case latency similar to a much smaller cluster

(9)

SSDs – building on Gordon success

Based on our experiences with Gordon, a number of

applications will benefits from continued access to flash

Applications that generate large numbers of temp files

• Computational finance – analysis of multiple markets (NASDAQ, etc.)

• Text analytics – word correlations in Google Ngram data

Computational chemistry codes that write one- and

two-electron integral files to scratch

Structural mechanics codes (e.g. Abaqus), which

(10)

Large memory nodes

While most user applications will run well on the standard

compute nodes, a few domains will benefit from the large

memory (1.5 TB nodes)

De novo genome assembly: ALLPATHS-LG,

SOAPdenovo, Velvet

Finite-element calculations: Abaqus

(11)

GPU nodes

Comet’s GPU nodes will serve a number of domains

Molecular dynamics applications have been one of the

biggest GPU success stories. Packages include Amber,

CHARMM, Gromacs and NAMD

Applications that depend heavily on linear algebra

(12)

Key Comet Strategies

Target modest-scale users and new users/communities:

goal of

10,000 users/year

!

Support

capacity computing

, with a system optimized for

small/modest-scale jobs and quicker resource response

using allocation/scheduling policies

Build upon and expand efforts with

Science Gateways

,

encouraging gateway usage and hosting via software

and operating policies

Provide a

virtualized environment

to support

development of customized software stacks, virtual

environments, and project control of workspaces

(13)

Comet will serve a large number of users,

including new communities/disciplines

Allocations/scheduling policies to optimize for high throughput of

many modest-scale jobs (leveraging Trestles experience)

• Optimized for rack-level jobs but cross-rack jobs feasible

• Optimized for throughput (ala Trestles)

• Per-project allocations caps to ensure large numbers of users

• Rapid access for start-ups with one-day account generation

• Limits on job sizes, with possibility of exceptions

Gateway-friendly environment: Science gateways reach large

communities w/ easy user access

• e.g. CIPRES gateway alone currently accounts for ~25% of all users of NSF

resources, with 3,000 new users/year and ~5,000 users/year

(14)

Changing the face of XSEDE HPC users

System design and policies

• Allocations, scheduling and security policies which favor gateways

• Support gateway middleware and gateway hosting machines

• Customized environments with high-performance virtualization

• Flexible allocations for bursty usage patterns

• Shared node runs for small jobs, user-settable reservations

• Third party apps

Leverage and augment investments elsewhere

• FutureGrid experience, image packaging, training, on-ramp

• XSEDE (ECSS NIP & Gateways, TEOS, Campus Champions)

• Build off established successes supporting new communities

• Example-based documentation in Comet focus areas

(15)

Virtualization Environment

Leveraging expertise of Indiana U/ FutureGrid team

VM jobs scheduled just like batch jobs (not conventional

cloud environment with immediate elastic access)

VMs will be easy on-ramp for new users/communities,

including low porting time

Flexible software environments for new communities and

apps

VM repository/library

Virtual HPC cluster (multi-node) with near-native IB

latency and minimal overhead (SRIOV)

(16)

Single Root I/O Virtualization in HPC

Problem

: complex workflows demand

increasing flexibility from HPC platforms

• Pro: Virtualization

flexibility

• Con: Virtualization

IO performance loss

(e.g., excessive DMA interrupts)

Solution

: SR-IOV and Mellanox ConnectX-3

InfiniBand HCAs

• One physical function (PF)

multiple

virtual functions (VF), each with own DMA

streams, memory space, interrupts

(17)

More on Single Root IO Virtualization

PCIe is the I/O bus of modern x86 servers

• The I/O controller is integrated into microprocessors

• The I/O complex is called single-root if there is only one

controller for the bus

I/O Virtualization

• Allow a single I/O device (e.g. Network, Disk Controller) to

appear as multiple

independent

I/O devices

• These are called virtual functions

(18)

Benchmark comparisons of SR-IOV Cluster v AWS

(pre-Haswell) Hardware/Software Configuration

Native, SR-IOV

Amazon EC2

Platform

Rocks 6.1 (EL6)

Virtualization via

kvm

Amazon Linux 2013.03 (EL6)

cc2.8xlarge

Instances

CPUs

2x Xeon E5-2660 (2.2GHz)

16 cores per node

2x Xeon E5-2670 (2.6GHz)

16 cores per node

RAM

64 GB DDR3 DRAM

60.5 DDR3 DRAM

Interconnect

QDR4X InfiniBand

Mellanox ConnectX-3 (MT27500)

Intel VT-d, SR-IOV enabled in

firmware, kernel, drivers

mlx4_core 1.1

Mellanox OFED 2.0

HCA firmware 2.11.1192

10 GbE

(19)

SRIOV Latency approaches Native HW

19

SR-IOV

< 30% overhead for

Messages < 128 bytes

< 10% overhead for

eager send/recv

Overhead

0% for

bandwidth-limited

regime

Amazon EC2

> 5000% worse latency

Time dependent (noisy)

(20)

Bandwidth Unimpaired Native vs.

SRIOV

20

SR-IOV

< 2% bandwidth loss

over entire range

> 95% peak bandwidth

Amazon EC2

< 35% peak bandwidth

900% to 2500% worse

bandwidth than

virtualized InfiniBand

(21)

Weather Modeling – 15% Overhead

96-core (6-node)

calculation

Nearest-neighbor

communication

Scalable algorithms

SR-IOV incurs modest

(15%) performance hit

...but still still 20%

faster

***

than Amazon

(22)

Quantum ESPRESSO: 28% overhead

48-core (3 node)

calculation

CG matrix inversion

(irregular comm.)

3D FFT matrix

transposes (All-to-all

communication)

28% slower w/ SR-IOV

SR-IOV still > 500%

faster

***

than EC2

Quantum Espresso 5.0.2 – DEISA AUSURF112 benchmark *** 20% faster despite SR-IOV cluster having 20% slower CPUs

(23)

If bandwidth unimpaired why Falloff for

App performance

Latency.

• Measured microbenchmark for latency shows 10-30%

overhead.

• However, this is variable. Can be as much as 100%.

• Why?

• Hypervisor/Physical node scheduling.

(24)

SR-IOV is a huge step forward in

high-performance virtualization

Shows substantial improvement in latency over Amazon

EC2, and it provides nearly zero bandwidth overhead

Benchmark application performance confirms significant

improvement over EC2

SR-IOV lowers performance barrier to virtualizing the

interconnect and makes fully virtualized HPC clusters

viable

Comet will deliver virtualized HPC to new/non-traditional

communities that need flexibility without major loss of

performance

(25)

High-Performance Virtualization on Comet

Mellanox FDR InfiniBand HCAs with SR-IOV

Rocks and KVM to manage Virtual Machines and

Clusters

Flexibility to support complex science gateways and

web-based workflow engines

• Custom compute appliances and virtual clusters developed with

FutureGrid and their existing expertise

(26)

Virtual Clusters – Overlay Physical Cluster with

User-Owned High Performance Clusters

Virtual Cluster 1

(27)

Virtual Cluster Characteristics

User-Owned and Defined

• Looks like bare metal cluster to user

Low overhead (latency and bandwidth) for a

virtualized Infiniband interface

• Single Root IO Virtualization

Schedule compatibility with standard HPC batch

jobs

• A node of virtual cluster is a virtual machine. That VM

looks like one element of a parallel job to the scheduler

Persistence of Disk State of virtual nodes across

(28)

Some Interesting Logistics for Virtual

Clusters

Scheduling

• Can we co-schedule virtual cluster nodes with regular

HPC jobs?

How do we efficiently handle disk images the

(29)
(30)

Virtual Cluster Anatomy

Public

network

Virtual

frontend

container

Virtual

Frontend

Generic

compute

nodes

Private

network

10GB

Virtual

Frontend

Virtual

Compute

Private network segmentation:

10GB using VLAN

Infiniband PKEY

Infiniband

Virtual

Compute

Virtual

Compute

VLAN NNN VLAN NNN VLAN NNN VLAN MMM VLAN MMM PKEY JJJJ PKEY JJJJ PKEY JJJJ PKEY LLLL PKEY LLLL

(31)

VM Disk Management

Each VM gets a 36 GB disk (Small SCSI)

Disk images are persistent through reboots

Two central NASes store all disk images

VMs can be allocated on all compute nodes

dependent on availability (scheduler)

Two solutions:

o

iSCSI (Network mounted disk)

(32)

Virtual compute-x

VM Disk Management iSCSI

NAS

Compute

nodes

Targets

iqn.2001-04.com.nas-0-0-vm-compute-x

This is what OpenStack Supports

Big Issue: Bandwidth Bottleneck at

(33)

A hybrid solution via replication

I

nitial boot of any cluster node uses an iSCSI

disk(Call this a node disk) on the NAS

During normal operation, Comet moves a node disk

to the physical host that is running the node VM.

And then disconnects from the NAS

o

All Node disk operation is local to the physical host

o

Fundamentally enables scale out w/o a $1M NAS

At Shutdown, any changes made to the node disk

(now on the physical host) are migrated back to the

NAS, ready for next boot

(34)

1.a Init Disk

NAS

Compute

nodes

Virtual

compute-x

Targets

iqn.2001-04.com.nas-0-0-vm-compute-x Replicate Disk

iSCSI mount on NAS enables virtual

compute node to boot immediately.

Read operations from NAS

(35)

1.b Init Disk

NAS

Compute

nodes

Virtual compute-x

Targets

During boot, the disk image on the

NAS is migrated to the physical

host.

Read-only and read/write are

then merged into one local

disk

(36)

2. Steady State

NAS

Compute

nodes

Virtual

compute-x

Targets

During normal operation

Node disk is snapshot

Incremental snapshots sent to NAS

(replicate back to NAS)

Timing/load/experiment will tell us how

often we can do this

(37)

3. Release Disk

NAS

Compute

nodes

Virtual

compute-x

Targets Power off

At shutdown, any unsynched

changes are send back to NAS

When the last snapshot is

sent, the Virtual compute

node can be rebooted on

another system

(38)

Software to implement is under

development

Rocks Roll so that it can be a part of any

physical Rocks-defined cluster

https://github.com/rocksclusters/img-storage-roll

• Uses ZFS for disk image storage on NAS and hosting

nodes

http://zfsonlinux.org

• RabbitMQ – AMQP (Asynchronous Message Q

Protocol)

http://www.rabbitmq.com/

• Pika - library for communication with RabbitMQ from

Python

(39)

Full Virtualization isn’t the only

choice: Containers

Comet supports fully-virtualized clusters

• OS of cluster can be almost anything: Windows,

Linux, Solaris, (not Mac-OS)

Containers are a different way to virtualize

the file system and some other elements

• Network, inter-process communication

• In Solaris for more than a decade

• Newly popular in Linux with the

docker

project

• Containers must run the same kernel as the host

operating system

(40)

Full vs. Container Virtualization

Physical Host

Kernel

HW

Virtual Host

Virtual Host

Physical Network

Physical Host

Kernel

partial /proc

Container 1

Container 2

Physical Network

N

etw

o

rk

B

ri

d

g

e

Full Virtualization (KVM)

Container Virtualization

(Docker)

Con

tain

ers

Inh

erits

th

e

Host

kernel

and

element

s

of

its

hardw

are.

Need

cgrou

ps

to

limit

con

tain

er

mem

ory

/cpu

usag

e

Fu

lly

-V

ir

tu

al

iz

ed

hav

e

ind

epen

den

t

kernels/OS.

Hardw

are

is

“un

iv

ersa

l”

.

Mem

ory

/CPU

de

fin

ed

by

de

f

of

v

irtu

al

HW

(41)

Still need to Configure Virtual Systems

Why Docker (container) virtualization popular

• Very space efficient if the changes to the base OS file

system are small. Changes can be 100’s of megabytes

instead of Gbytes

• Network (latency) performance and/or network topology is

less important (topology needed for virtual clusters)

• Given how quickly (and relatively lightweight) Docker brings up

virtual environments, this will be addressed

A system is DEFINED by the contents of the file

system

• System libraries, application code, configuration

(42)

Running: Almost the same

Rocks-created server

running in Docker

Container

Notice: uptime, processor

and date created

rocks-created server

running

~

(43)

Summary

Comet: HPC for the 99%

Expand capability to enable virtual clusters

• Not generic cloud computing

Advances in virtualization techniques continue

• Comet will support fully-virtualized clusters

• Container-based virtualized (Docker) become quite

popular

(44)

References

Related documents