Comet - High performance virtual clusters to
support the long-tail of science.
Philip M. Papadopoulos, Ph.D.
San Diego Supercomputer Center
California Institute for telecommunications and Information
Technologies (Calit2)
Comet is funded by U.S. National
Science Foundation
•
“… expand the use of high end resources to a
much larger and more diverse community
•
… support the entire spectrum of NSF
communities
•
... promote a more comprehensive and
balanced portfolio
•
… include research communities that are not
users of traditional HPC systems.“
Jobs and SUs at various scales across
NSF resources
0
500
1000
1500
2000
2500
3000
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1
2
4
8
16 32 64 128 256 512 1K 2K 4K 8K 16K
Millio
n
s
of XD
SUs Ch
ar
ge
d
Fr
actio
n
of All
Jo
b
s
Char
ge
d
in
2012
Job Size (Cores)
Percentage of Jobs (Left Axis)
SUs Charged (Right Axis)
One node
•
99% of jobs run on
NSF’s HPC
resources in 2012
used < 2048 cores
•
And consumed
~50% of the total
core-hours across
NSF resources
Job Size (Cores)
Cum
ulat
iv
e
Comet: System Characteristics
•
Available 1Q 2015
•
Total flops ~1.9 PF (AVX2)
•
Dell primary integrator
• Intel next-gen processors, former
codename Haswell, with AVX2
• Aeon storage vendor
• Mellanox FDR InfiniBand
•
Standard compute nodes
• Dual 12-core Haswell processors
• 128 GB DDR4 DRAM (64
GB/socket!)
• 320 GB SSD (local scratch, VMs)
•
GPU nodes
• Four NVIDIA GPUs/node
•
Large-memory nodes
(2Q 2015)
•
Hybrid fat-tree topology
• FDR (56 Gbps) InfiniBand
• Rack-level (72 nodes) full bisection
bandwidth
• 4:1 oversubscription inter-rack
•
Performance Storage
• 7 PB, 200 GB/s
• Scratch & Persistent Storage
•
Durable Storage (reliability)
• 6 PB, 100 GB/s
•
Gateway hosting nodes and VM
image repository
•
100 Gbps external connectivity to
Internet2 & ESNet
Comet Architecture
Juniper 100 Gbps Arista 40GbE (2x) Data Mover (4x)R&E Network Access Data Movers
Internet 2
7x 36-port FDR in each rack wired as full fat-tree. 4:1 over
subscription between racks.
72 HSWL 320 GB IB Core (2x) N GPU 4 Large-Memory Bridge (4x)
Performance Storage Durable Storage
Arista 40GbE (2x) N racks FDR 36p FDR 36p 64 128 18 72 HSWL 320 GB 72 HSWL 72 72 18 Mid-tier
Additional Support Components
(not shown for clarity)
NFS Servers, Virtual Image Repository, Gateway/Portal Hosting Nodes, Login Nodes, Ethernet Management Network, Rocks Management Nodes
Node-Local Storage 18 72 FDR FDR FDR 40GbE 40GbE 10GbE
Design Decision - Network
•
Each Rack
• 72 nodes (144 CPUs, 1728 Cores)
• Fully-connected FDR IB (2-level Clos-Topology)
• 144 In-rack Cables
• 4:1 oversubscription between racks
• 18 inter-rack cables
•
Supports the large majority of jobs with no
performance degradation
•
3-level network for complete cluster
• worst-case latency similar to a much smaller cluster
SSDs – building on Gordon success
Based on our experiences with Gordon, a number of
applications will benefits from continued access to flash
•
Applications that generate large numbers of temp files
• Computational finance – analysis of multiple markets (NASDAQ, etc.)
• Text analytics – word correlations in Google Ngram data
•
Computational chemistry codes that write one- and
two-electron integral files to scratch
•
Structural mechanics codes (e.g. Abaqus), which
Large memory nodes
While most user applications will run well on the standard
compute nodes, a few domains will benefit from the large
memory (1.5 TB nodes)
•
De novo genome assembly: ALLPATHS-LG,
SOAPdenovo, Velvet
•
Finite-element calculations: Abaqus
GPU nodes
Comet’s GPU nodes will serve a number of domains
•
Molecular dynamics applications have been one of the
biggest GPU success stories. Packages include Amber,
CHARMM, Gromacs and NAMD
•
Applications that depend heavily on linear algebra
Key Comet Strategies
•
Target modest-scale users and new users/communities:
goal of
10,000 users/year
!
•
Support
capacity computing
, with a system optimized for
small/modest-scale jobs and quicker resource response
using allocation/scheduling policies
•
Build upon and expand efforts with
Science Gateways
,
encouraging gateway usage and hosting via software
and operating policies
•
Provide a
virtualized environment
to support
development of customized software stacks, virtual
environments, and project control of workspaces
Comet will serve a large number of users,
including new communities/disciplines
•
Allocations/scheduling policies to optimize for high throughput of
many modest-scale jobs (leveraging Trestles experience)
• Optimized for rack-level jobs but cross-rack jobs feasible
• Optimized for throughput (ala Trestles)
• Per-project allocations caps to ensure large numbers of users
• Rapid access for start-ups with one-day account generation
• Limits on job sizes, with possibility of exceptions
•
Gateway-friendly environment: Science gateways reach large
communities w/ easy user access
• e.g. CIPRES gateway alone currently accounts for ~25% of all users of NSF
resources, with 3,000 new users/year and ~5,000 users/year
Changing the face of XSEDE HPC users
•
System design and policies
• Allocations, scheduling and security policies which favor gateways
• Support gateway middleware and gateway hosting machines
• Customized environments with high-performance virtualization
• Flexible allocations for bursty usage patterns
• Shared node runs for small jobs, user-settable reservations
• Third party apps
•
Leverage and augment investments elsewhere
• FutureGrid experience, image packaging, training, on-ramp
• XSEDE (ECSS NIP & Gateways, TEOS, Campus Champions)
• Build off established successes supporting new communities
• Example-based documentation in Comet focus areas
Virtualization Environment
•
Leveraging expertise of Indiana U/ FutureGrid team
•
VM jobs scheduled just like batch jobs (not conventional
cloud environment with immediate elastic access)
•
VMs will be easy on-ramp for new users/communities,
including low porting time
•
Flexible software environments for new communities and
apps
•
VM repository/library
•
Virtual HPC cluster (multi-node) with near-native IB
latency and minimal overhead (SRIOV)
Single Root I/O Virtualization in HPC
•
Problem
: complex workflows demand
increasing flexibility from HPC platforms
• Pro: Virtualization
flexibility
• Con: Virtualization
IO performance loss
(e.g., excessive DMA interrupts)
•
Solution
: SR-IOV and Mellanox ConnectX-3
InfiniBand HCAs
• One physical function (PF)
multiple
virtual functions (VF), each with own DMA
streams, memory space, interrupts
More on Single Root IO Virtualization
•
PCIe is the I/O bus of modern x86 servers
• The I/O controller is integrated into microprocessors
• The I/O complex is called single-root if there is only one
controller for the bus
•
I/O Virtualization
• Allow a single I/O device (e.g. Network, Disk Controller) to
appear as multiple
independent
I/O devices
• These are called virtual functions
Benchmark comparisons of SR-IOV Cluster v AWS
(pre-Haswell) Hardware/Software Configuration
Native, SR-IOV
Amazon EC2
Platform
•
Rocks 6.1 (EL6)
•
Virtualization via
kvm
•
Amazon Linux 2013.03 (EL6)
•
cc2.8xlarge
Instances
CPUs
•
2x Xeon E5-2660 (2.2GHz)
•
16 cores per node
•
2x Xeon E5-2670 (2.6GHz)
•
16 cores per node
RAM
•
64 GB DDR3 DRAM
•
60.5 DDR3 DRAM
Interconnect
•
QDR4X InfiniBand
•
Mellanox ConnectX-3 (MT27500)
•
Intel VT-d, SR-IOV enabled in
firmware, kernel, drivers
•
mlx4_core 1.1
•
Mellanox OFED 2.0
•
HCA firmware 2.11.1192
•
10 GbE
SRIOV Latency approaches Native HW
19•
SR-IOV
•
< 30% overhead for
Messages < 128 bytes
•
< 10% overhead for
eager send/recv
•
Overhead
0% for
bandwidth-limited
regime
•
Amazon EC2
•
> 5000% worse latency
•
Time dependent (noisy)
Bandwidth Unimpaired Native vs.
SRIOV
20
•
SR-IOV
•
< 2% bandwidth loss
over entire range
•
> 95% peak bandwidth
•
Amazon EC2
•
< 35% peak bandwidth
•
900% to 2500% worse
bandwidth than
virtualized InfiniBand
Weather Modeling – 15% Overhead
•
96-core (6-node)
calculation
•
Nearest-neighbor
communication
•
Scalable algorithms
•
SR-IOV incurs modest
(15%) performance hit
•
...but still still 20%
faster
***
than Amazon
Quantum ESPRESSO: 28% overhead
•
48-core (3 node)
calculation
•
CG matrix inversion
(irregular comm.)
•
3D FFT matrix
transposes (All-to-all
communication)
•
28% slower w/ SR-IOV
•
SR-IOV still > 500%
faster
***
than EC2
Quantum Espresso 5.0.2 – DEISA AUSURF112 benchmark *** 20% faster despite SR-IOV cluster having 20% slower CPUs
If bandwidth unimpaired why Falloff for
App performance
•
Latency.
• Measured microbenchmark for latency shows 10-30%
overhead.
• However, this is variable. Can be as much as 100%.
• Why?
• Hypervisor/Physical node scheduling.
SR-IOV is a huge step forward in
high-performance virtualization
•
Shows substantial improvement in latency over Amazon
EC2, and it provides nearly zero bandwidth overhead
•
Benchmark application performance confirms significant
improvement over EC2
•
SR-IOV lowers performance barrier to virtualizing the
interconnect and makes fully virtualized HPC clusters
viable
•
Comet will deliver virtualized HPC to new/non-traditional
communities that need flexibility without major loss of
performance
High-Performance Virtualization on Comet
•
Mellanox FDR InfiniBand HCAs with SR-IOV
•
Rocks and KVM to manage Virtual Machines and
Clusters
•
Flexibility to support complex science gateways and
web-based workflow engines
• Custom compute appliances and virtual clusters developed with
FutureGrid and their existing expertise
Virtual Clusters – Overlay Physical Cluster with
User-Owned High Performance Clusters
Virtual Cluster 1
Virtual Cluster Characteristics
•
User-Owned and Defined
• Looks like bare metal cluster to user
•
Low overhead (latency and bandwidth) for a
virtualized Infiniband interface
• Single Root IO Virtualization
•
Schedule compatibility with standard HPC batch
jobs
• A node of virtual cluster is a virtual machine. That VM
looks like one element of a parallel job to the scheduler
•
Persistence of Disk State of virtual nodes across
Some Interesting Logistics for Virtual
Clusters
•
Scheduling
• Can we co-schedule virtual cluster nodes with regular
HPC jobs?
•
How do we efficiently handle disk images the
Virtual Cluster Anatomy
Public
network
Virtual
frontend
container
Virtual
Frontend
Generic
compute
nodes
Private
network
10GB
Virtual
Frontend
Virtual
Compute
Private network segmentation:
●
10GB using VLAN
●
Infiniband PKEY
Infiniband
Virtual
Compute
Virtual
Compute
VLAN NNN VLAN NNN VLAN NNN VLAN MMM VLAN MMM PKEY JJJJ PKEY JJJJ PKEY JJJJ PKEY LLLL PKEY LLLLVM Disk Management
●
Each VM gets a 36 GB disk (Small SCSI)
●
Disk images are persistent through reboots
●
Two central NASes store all disk images
●
VMs can be allocated on all compute nodes
dependent on availability (scheduler)
●
Two solutions:
o
iSCSI (Network mounted disk)
Virtual compute-x
VM Disk Management iSCSI
NAS
Compute
nodes
Targets
iqn.2001-04.com.nas-0-0-vm-compute-x
This is what OpenStack Supports
Big Issue: Bandwidth Bottleneck at
A hybrid solution via replication
●
I
nitial boot of any cluster node uses an iSCSI
disk(Call this a node disk) on the NAS
●
During normal operation, Comet moves a node disk
to the physical host that is running the node VM.
And then disconnects from the NAS
o
All Node disk operation is local to the physical host
o
Fundamentally enables scale out w/o a $1M NAS
●
At Shutdown, any changes made to the node disk
(now on the physical host) are migrated back to the
NAS, ready for next boot
1.a Init Disk
NAS
Compute
nodes
Virtual
compute-x
Targetsiqn.2001-04.com.nas-0-0-vm-compute-x Replicate Disk
iSCSI mount on NAS enables virtual
compute node to boot immediately.
●
Read operations from NAS
1.b Init Disk
NAS
Compute
nodes
Virtual compute-x
TargetsDuring boot, the disk image on the
NAS is migrated to the physical
host.
●
Read-only and read/write are
then merged into one local
disk
2. Steady State
NAS
Compute
nodes
Virtual
compute-x
TargetsDuring normal operation
●
Node disk is snapshot
●
Incremental snapshots sent to NAS
(replicate back to NAS)
●
Timing/load/experiment will tell us how
often we can do this
3. Release Disk
NAS
Compute
nodes
Virtual
compute-x
Targets Power offAt shutdown, any unsynched
changes are send back to NAS
●
When the last snapshot is
sent, the Virtual compute
node can be rebooted on
another system
Software to implement is under
development
•
Rocks Roll so that it can be a part of any
physical Rocks-defined cluster
•
https://github.com/rocksclusters/img-storage-roll
• Uses ZFS for disk image storage on NAS and hosting
nodes
•
http://zfsonlinux.org
• RabbitMQ – AMQP (Asynchronous Message Q
Protocol)
http://www.rabbitmq.com/
• Pika - library for communication with RabbitMQ from
Python
Full Virtualization isn’t the only
choice: Containers
•
Comet supports fully-virtualized clusters
• OS of cluster can be almost anything: Windows,
Linux, Solaris, (not Mac-OS)
•
Containers are a different way to virtualize
the file system and some other elements
• Network, inter-process communication
• In Solaris for more than a decade
• Newly popular in Linux with the
docker
project
• Containers must run the same kernel as the host
operating system
Full vs. Container Virtualization
Physical Host
Kernel
HW
Virtual Host
Virtual Host
Physical Network
Physical Host
Kernel
partial /proc
Container 1
Container 2
Physical Network
N
etw
o
rk
B
ri
d
g
e
Full Virtualization (KVM)
Container Virtualization
(Docker)
Con
tain
ers
Inh
erits
th
e
Host
kernel
and
element
s
of
its
hardw
are.
Need
cgrou
ps
to
limit
con
tain
er
mem
ory
/cpu
usag
e
Fu
lly
-V
ir
tu
al
iz
ed
hav
e
ind
epen
den
t
kernels/OS.
Hardw
are
is
“un
iv
ersa
l”
.
Mem
ory
/CPU
de
fin
ed
by
de
f
of
v
irtu
al
HW
Still need to Configure Virtual Systems
•
Why Docker (container) virtualization popular
• Very space efficient if the changes to the base OS file
system are small. Changes can be 100’s of megabytes
instead of Gbytes
• Network (latency) performance and/or network topology is
less important (topology needed for virtual clusters)
• Given how quickly (and relatively lightweight) Docker brings up
virtual environments, this will be addressed
•
A system is DEFINED by the contents of the file
system
• System libraries, application code, configuration
Running: Almost the same
Rocks-created server
running in Docker
Container
Notice: uptime, processor
and date created
rocks-created server
running
~