Comet - High performance virtual clusters to support the long-tail of science.

(1)

Comet - High performance virtual clusters to

support the long-tail of science.

Philip M. Papadopoulos, Ph.D.

San Diego Supercomputer Center

California Institute for telecommunications and Information

Technologies (Calit2)

(2)

(3)

Comet is funded by U.S. National

Science Foundation

• “… expand the use of high end resources to a

much larger and more diverse community

• … support the entire spectrum of NSF

communities

• ... promote a more comprehensive and

balanced portfolio

• … include research communities that are not

users of traditional HPC systems.“

(4)

Jobs and SUs at various scales across

NSF resources

0

500 1000

1500

2000

2500

3000

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1

2

4

8 16 32 64 128 256 512 1K 2K 4K 8K 16K

Millio

n

s

of XD

SUs Ch

ar

ge

d

Fr

actio

n

of All

Jo

b

s

Char

ge

d

in

2012

Job Size (Cores)

Percentage of Jobs (Left Axis)

SUs Charged (Right Axis)

One node

• 99% of jobs run on

NSF’s HPC

resources in 2012

used < 2048 cores

• And consumed

~50% of the total

core-hours across

NSF resources

Job Size (Cores)

Cum

ulat

iv

e

(5)

(6)

Comet: System Characteristics

• Available 1Q 2015

• Total flops ~1.9 PF (AVX2)

• Dell primary integrator

• Intel next-gen processors, former

codename Haswell, with AVX2

• Aeon storage vendor

• Mellanox FDR InfiniBand

• Standard compute nodes

• Dual 12-core Haswell processors

• 128 GB DDR4 DRAM (64

GB/socket!)

• 320 GB SSD (local scratch, VMs)

• GPU nodes

• Four NVIDIA GPUs/node

• Large-memory nodes

(2Q 2015)

• Hybrid fat-tree topology

• FDR (56 Gbps) InfiniBand

• Rack-level (72 nodes) full bisection

bandwidth

• 4:1 oversubscription inter-rack

• Performance Storage

• 7 PB, 200 GB/s

• Scratch & Persistent Storage

• Durable Storage (reliability)

• 6 PB, 100 GB/s

• Gateway hosting nodes and VM

image repository

• 100 Gbps external connectivity to

Internet2 & ESNet

(7)

Comet Architecture

Juniper 100 Gbps Arista 40GbE (2x) Data Mover (4x)

R&E Network Access Data Movers

Internet 2

7x 36-port FDR in each rack wired as full fat-tree. 4:1 over

subscription between racks.

72 HSWL 320 GB IB Core (2x) N GPU 4 Large-Memory Bridge (4x)

Performance Storage Durable Storage

Arista 40GbE (2x) N racks FDR 36p FDR 36p 64 ₁₂₈ 18 72 HSWL 320 GB 72 HSWL 72 72 18 Mid-tier

Additional Support Components

(not shown for clarity)

NFS Servers, Virtual Image Repository, Gateway/Portal Hosting Nodes, Login Nodes, Ethernet Management Network, Rocks Management Nodes

Node-Local Storage ₁₈ 72 FDR FDR FDR 40GbE 40GbE 10GbE

(8)

Design Decision - Network

• Each Rack

• 72 nodes (144 CPUs, 1728 Cores)

• Fully-connected FDR IB (2-level Clos-Topology)

• 144 In-rack Cables

• 4:1 oversubscription between racks

• 18 inter-rack cables

• Supports the large majority of jobs with no

performance degradation

• 3-level network for complete cluster

• worst-case latency similar to a much smaller cluster

(9)

SSDs – building on Gordon success

Based on our experiences with Gordon, a number of

applications will benefits from continued access to flash

• Applications that generate large numbers of temp files

• Computational finance – analysis of multiple markets (NASDAQ, etc.)

• Text analytics – word correlations in Google Ngram data

• Computational chemistry codes that write one- and

two-electron integral files to scratch

• Structural mechanics codes (e.g. Abaqus), which

(10)

Large memory nodes

While most user applications will run well on the standard

compute nodes, a few domains will benefit from the large

memory (1.5 TB nodes)

• De novo genome assembly: ALLPATHS-LG,

SOAPdenovo, Velvet

• Finite-element calculations: Abaqus

(11)

GPU nodes

Comet’s GPU nodes will serve a number of domains

• Molecular dynamics applications have been one of the

biggest GPU success stories. Packages include Amber,

CHARMM, Gromacs and NAMD

• Applications that depend heavily on linear algebra

(12)

Key Comet Strategies

• Target modest-scale users and new users/communities:

goal of

10,000 users/year

!

• Support

capacity computing

, with a system optimized for

small/modest-scale jobs and quicker resource response

using allocation/scheduling policies

• Build upon and expand efforts with

Science Gateways

,

encouraging gateway usage and hosting via software

and operating policies

• Provide a

virtualized environment

to support

development of customized software stacks, virtual

environments, and project control of workspaces

(13)

Comet will serve a large number of users,

including new communities/disciplines

• Allocations/scheduling policies to optimize for high throughput of

many modest-scale jobs (leveraging Trestles experience)

• Optimized for rack-level jobs but cross-rack jobs feasible

• Optimized for throughput (ala Trestles)

• Per-project allocations caps to ensure large numbers of users

• Rapid access for start-ups with one-day account generation

• Limits on job sizes, with possibility of exceptions

• Gateway-friendly environment: Science gateways reach large

communities w/ easy user access

• e.g. CIPRES gateway alone currently accounts for ~25% of all users of NSF

resources, with 3,000 new users/year and ~5,000 users/year

(14)

Changing the face of XSEDE HPC users

• System design and policies

• Allocations, scheduling and security policies which favor gateways

• Support gateway middleware and gateway hosting machines

• Customized environments with high-performance virtualization

• Flexible allocations for bursty usage patterns

• Shared node runs for small jobs, user-settable reservations

• Third party apps

• Leverage and augment investments elsewhere

• FutureGrid experience, image packaging, training, on-ramp

• XSEDE (ECSS NIP & Gateways, TEOS, Campus Champions)

• Build off established successes supporting new communities

• Example-based documentation in Comet focus areas

(15)

Virtualization Environment

• Leveraging expertise of Indiana U/ FutureGrid team

• VM jobs scheduled just like batch jobs (not conventional

cloud environment with immediate elastic access)

• VMs will be easy on-ramp for new users/communities,

including low porting time

• Flexible software environments for new communities and

apps

• VM repository/library

• Virtual HPC cluster (multi-node) with near-native IB

latency and minimal overhead (SRIOV)

(16)

Single Root I/O Virtualization in HPC

• Problem

: complex workflows demand

increasing flexibility from HPC platforms

• Pro: Virtualization



flexibility

• Con: Virtualization



IO performance loss

(e.g., excessive DMA interrupts)

• Solution

: SR-IOV and Mellanox ConnectX-3

InfiniBand HCAs

• One physical function (PF)



multiple

virtual functions (VF), each with own DMA

streams, memory space, interrupts

(17)

More on Single Root IO Virtualization

• PCIe is the I/O bus of modern x86 servers

• The I/O controller is integrated into microprocessors

• The I/O complex is called single-root if there is only one

controller for the bus

• I/O Virtualization

• Allow a single I/O device (e.g. Network, Disk Controller) to

appear as multiple

independent

I/O devices

• These are called virtual functions

(18)

Benchmark comparisons of SR-IOV Cluster v AWS

(pre-Haswell) Hardware/Software Configuration

Native, SR-IOV

Amazon EC2

Platform

• Rocks 6.1 (EL6)

• Virtualization via

kvm

• Amazon Linux 2013.03 (EL6)

• cc2.8xlarge

Instances

CPUs

• 2x Xeon E5-2660 (2.2GHz)

• 16 cores per node

• 2x Xeon E5-2670 (2.6GHz)

• 16 cores per node

RAM

• 64 GB DDR3 DRAM

• 60.5 DDR3 DRAM

Interconnect

• QDR4X InfiniBand

• Mellanox ConnectX-3 (MT27500)

• Intel VT-d, SR-IOV enabled in

firmware, kernel, drivers

• mlx4_core 1.1

• Mellanox OFED 2.0

• HCA firmware 2.11.1192

• 10 GbE

(19)

SRIOV Latency approaches Native HW

19

• SR-IOV

• < 30% overhead for

Messages < 128 bytes

• < 10% overhead for

eager send/recv

• Overhead



0% for

bandwidth-limited

regime

• Amazon EC2

• > 5000% worse latency

• Time dependent (noisy)

(20)

Bandwidth Unimpaired Native vs.

SRIOV

20

• SR-IOV

• < 2% bandwidth loss

over entire range

• > 95% peak bandwidth

• Amazon EC2

• < 35% peak bandwidth

• 900% to 2500% worse

bandwidth than

virtualized InfiniBand

(21)

Weather Modeling – 15% Overhead

• 96-core (6-node)

calculation

• Nearest-neighbor

communication

• Scalable algorithms

• SR-IOV incurs modest

(15%) performance hit

• ...but still still 20%

faster

*******

_{than Amazon}

(22)

Quantum ESPRESSO: 28% overhead

• 48-core (3 node)

calculation

• CG matrix inversion

(irregular comm.)

• 3D FFT matrix

transposes (All-to-all

communication)

• 28% slower w/ SR-IOV

• SR-IOV still > 500%

faster

*******

_{than EC2}

Quantum Espresso 5.0.2 – DEISA AUSURF112 benchmark *** 20% faster despite SR-IOV cluster having 20% slower CPUs

(23)

If bandwidth unimpaired why Falloff for

App performance

• Latency.

• Measured microbenchmark for latency shows 10-30%

overhead.

• However, this is variable. Can be as much as 100%.

• Why?

• Hypervisor/Physical node scheduling.

(24)

SR-IOV is a huge step forward in

high-performance virtualization

• Shows substantial improvement in latency over Amazon

EC2, and it provides nearly zero bandwidth overhead

• Benchmark application performance confirms significant

improvement over EC2

• SR-IOV lowers performance barrier to virtualizing the

interconnect and makes fully virtualized HPC clusters

viable

• Comet will deliver virtualized HPC to new/non-traditional

communities that need flexibility without major loss of

performance

(25)

High-Performance Virtualization on Comet

• Mellanox FDR InfiniBand HCAs with SR-IOV

• Rocks and KVM to manage Virtual Machines and

Clusters

• Flexibility to support complex science gateways and

web-based workflow engines

• Custom compute appliances and virtual clusters developed with

FutureGrid and their existing expertise

(26)

Virtual Clusters – Overlay Physical Cluster with

User-Owned High Performance Clusters

Virtual Cluster 1

(27)

Virtual Cluster Characteristics

• User-Owned and Defined

• Looks like bare metal cluster to user

• Low overhead (latency and bandwidth) for a

virtualized Infiniband interface

• Single Root IO Virtualization

• Schedule compatibility with standard HPC batch

jobs

• A node of virtual cluster is a virtual machine. That VM

looks like one element of a parallel job to the scheduler

• Persistence of Disk State of virtual nodes across

(28)

Some Interesting Logistics for Virtual

Clusters

• Scheduling

• Can we co-schedule virtual cluster nodes with regular

HPC jobs?

• How do we efficiently handle disk images the

(29)

(30)

Virtual Cluster Anatomy

Public

network

Virtual

frontend

container

Virtual

Frontend

Generic

compute

nodes

Private

network

10GB

Virtual

Frontend

Virtual

Compute

Private network segmentation:

● 10GB using VLAN

● Infiniband PKEY

Infiniband

Virtual

Compute

Virtual

Compute

VLAN NNN VLAN NNN _{VLAN NNN} VLAN MMM VLAN MMM PKEY JJJJ PKEY JJJJ PKEY JJJJ PKEY LLLL PKEY LLLL

(31)

VM Disk Management

● Each VM gets a 36 GB disk (Small SCSI)

● Disk images are persistent through reboots

● Two central NASes store all disk images

● VMs can be allocated on all compute nodes

dependent on availability (scheduler)

● Two solutions:

o

iSCSI (Network mounted disk)

(32)

Virtual compute-x

VM Disk Management iSCSI

NAS

Compute

nodes

Targets

iqn.2001-04.com.nas-0-0-vm-compute-x

This is what OpenStack Supports

Big Issue: Bandwidth Bottleneck at

(33)

A hybrid solution via replication

● I

nitial boot of any cluster node uses an iSCSI

disk(Call this a node disk) on the NAS

● During normal operation, Comet moves a node disk

to the physical host that is running the node VM.

And then disconnects from the NAS

o

All Node disk operation is local to the physical host

o

Fundamentally enables scale out w/o a $1M NAS

● At Shutdown, any changes made to the node disk

(now on the physical host) are migrated back to the

NAS, ready for next boot

(34)

1.a Init Disk

NAS

Compute

nodes

Virtual

compute-x

Targets

iqn.2001-04.com.nas-0-0-vm-compute-x Replicate Disk

iSCSI mount on NAS enables virtual

compute node to boot immediately.

● Read operations from NAS

(35)

1.b Init Disk

NAS

Compute

nodes

Virtual compute-x

Targets

During boot, the disk image on the

NAS is migrated to the physical

host.

● Read-only and read/write are

then merged into one local

disk

(36)

2. Steady State

NAS

Compute

nodes

Virtual

compute-x

Targets

During normal operation

● Node disk is snapshot

● Incremental snapshots sent to NAS

(replicate back to NAS)

● Timing/load/experiment will tell us how

often we can do this

(37)

3. Release Disk

NAS

Compute

nodes

Virtual

compute-x

Targets Power off

At shutdown, any unsynched

changes are send back to NAS

● When the last snapshot is

sent, the Virtual compute

node can be rebooted on

another system

(38)

Software to implement is under

development

• Rocks Roll so that it can be a part of any

physical Rocks-defined cluster

• https://github.com/rocksclusters/img-storage-roll

• Uses ZFS for disk image storage on NAS and hosting

nodes

• http://zfsonlinux.org

• RabbitMQ – AMQP (Asynchronous Message Q

Protocol)

http://www.rabbitmq.com/

• Pika - library for communication with RabbitMQ from

Python

(39)

Full Virtualization isn’t the only

choice: Containers

• Comet supports fully-virtualized clusters

• OS of cluster can be almost anything: Windows,

Linux, Solaris, (not Mac-OS)

• Containers are a different way to virtualize

the file system and some other elements

• Network, inter-process communication

• In Solaris for more than a decade

• Newly popular in Linux with the

docker

project

• Containers must run the same kernel as the host

operating system

(40)

Full vs. Container Virtualization

Physical Host

Kernel

HW

Virtual Host

Physical Network

Physical Host

Kernel

partial /proc

Container 1

Container 2

Physical Network

N

etw

o

rk

B

ri

d

g

e

Full Virtualization (KVM)

Container Virtualization

(Docker)

_Con

tain

ers

Inh

erits

th

e

Host

kernel

and

element

s

of

its

hardw

are.

Need

cgrou

ps

to

limit

con

tain

er

mem

ory

/cpu

usag

e

Fu

lly

-V

ir

tu

al

iz

ed

hav

e

ind

epen

den

t

kernels/OS.

Hardw

are

is

“un

iv

ersa

l”

.

Mem

ory

/CPU

de

fin

ed

by

de

f

of

v

irtu

al

HW

(41)

Still need to Configure Virtual Systems

• Why Docker (container) virtualization popular

• Very space efficient if the changes to the base OS file

system are small. Changes can be 100’s of megabytes

instead of Gbytes

• Network (latency) performance and/or network topology is

less important (topology needed for virtual clusters)

• Given how quickly (and relatively lightweight) Docker brings up

virtual environments, this will be addressed

• A system is DEFINED by the contents of the file

system

• System libraries, application code, configuration

(42)

Running: Almost the same

Rocks-created server

running in Docker

Container

Notice: uptime, processor

and date created

rocks-created server

running

~

(43)

Summary

• Comet: HPC for the 99%

• Expand capability to enable virtual clusters

• Not generic cloud computing

• Advances in virtualization techniques continue

• Comet will support fully-virtualized clusters

• Container-based virtualized (Docker) become quite