Formation du Club des Affiliés du LAAS-CNRS, Toulouse, 22 mars 2016
Frédéric Parienté, Tesla Accelerated Computing, NVIDIA
ENTERPRISE AUTO
GAMING PRO VISUALIZATION DATA CENTER
FIVE THINGS TO
REMEMBER
Time of accelerators has come
NVIDIA is focused on co-design from top-to-bottom
Accelerators are surging in supercomputing
Machine learning is the next killer application for HPC
“
It’s
time to start planning for
the end of
Moore’s
Law, and
it’s
worth pondering how it
will end, not just when.
”
Robert Colwell
TESLA ACCELERATED COMPUTING PLATFORM
Focused on Co-Design from Top to Bottom
Productive
Programming
Model
&
Tools
Expert
Co-Design
Accessibility
APPLICATION MIDDLEWARE SYS SW LARGE SYSTEMS PROCESSORFast GPU
Engineered for High Throughput
0,0 0,5 1,0 1,5 2,0 2,5 3,0 2008 2009 2010 2011 2012 2013 2014 NVIDIA GPU x86 CPU TFLOPS M2090 M1060 K20 K80 K40 Fast GPU + Strong CPU
25 50 75 100 125
ACCELERATORS SURGE IN
WORLD’S TOP SUPERCOMPUTERS
100+ accelerated systems now on Top500 list
1/3 of total FLOPS powered by accelerators
NVIDIA Tesla GPUs sweep 23 of 24 new
accelerated supercomputers
Tesla supercomputers growing at 50% CAGR
over past five years
Top500: # of Accelerated Supercomputers
70% OF TOP HPC APPS ACCELERATED
TOP 25 APPS IN SURVEY
INTERSECT360 SURVEY OF TOP APPS
GROMACS SIMULIA Abaqus NAMD AMBER ANSYS Mechanical MSC NASTRAN SPECFEM3D ANSYS Fluent WRF VASP OpenFOAM CHARMM Quantum Espresso LAMMPS NWChem LS-DYNA Schrodinger Gaussian GAMESS ANSYS CFX Star-CD CCSM COMSOL Star-CCM+ BLAST
= All popular functions accelerated
Top 10 HPC Apps
90% AcceleratedTop 50 HPC Apps
70% Accelerated Intersect360, Nov 2015“HPC Application Support for GPU Computing”
= Some popular functions accelerated = In development
370 GPU-Accelerated
Applications
TESLA BOOSTS DATACENTER THROUGHPUT
$500M Datacenter, 4x increase in ROI
1000 Jobs Per Day
70%
GPU-Accelerated
Nodes
30%
CPU Nodes
3800 Jobs Per Day
100%
CPU Nodes
70% of Applications
5x Faster with GPU
U.S. Dept. of Energy
Pre-Exascale Supercomputers
for Science
NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED
NOAA
New Supercomputer for Next-Gen
Weather Forecasting
IBM Watson
Breakthrough Natural Language
Processing for Cognitive Computing
SUMMIT SIERRA
MACHINE LEARNING
HPC 1
ST
CONSUMER KILLER-APP
MICROSOFT OPEN-SOURCE DMTK GOOGLE OPEN-SOURCE TENSORFLOWYOUTUBE CLICK-TO-BUY ADS GOOGLE PHOTO
TESLA PLATFORM LEADS IN EVERY WAY
PROCESSOR
INTERCONNECT
ECOSYSTEM
SOFTWARE
“
A
pproximately a third of HPC
systems operating today are
equipped with accelerators and
nearly half of all newly deployed
systems have them
.
”
TESLA FOR SIMLUATION
LIBRARIES
TESLA ACCELERATED COMPUTING
LANGUAGES
DIRECTIVES
Tesla Accelerates
Discoveries
Using a supercomputer powered by
the Tesla
Platform with over 3,000 Tesla accelerators
,
University of Illinois scientists performed the first
all-atom simulation of the HIV virus and
discovered the chemical structure of its capsid
—
“the perfect target for fighting the infection.”
Without GPU, the supercomputer would need to
be 5x larger for similar performance.
TESLA K80
World’s Fastest Accelerator
for HPC & Data Analytics
0 5 10 15 20 25 30
Tesla K80 Server Dual CPU Server
# of Days
AMBER Benchmark: PME-JAC-NVE Simulation for 1 microsecond CPU: E5-2698v3 @ 2.3GHz. 64GB System Memory, CentOS 6.2
CUDA Cores 4992
Peak DP 1.9 TFLOPS
Peak DP w/ Boost 2.9 TFLOPS
GDDR5 Memory 24 GB
Bandwidth 480 GB/s
Power 300 W
GPU Boost Dynamic Simulation Time from
1 Month to 1 Week
5x Faster
0x
5x
10x
15x
K80
CPU
TESLA K80 BOOSTS DATA CENTER THROUGHPUT
ACCELERATING KEY APPS
1/3 OF NODES ACCELERATED, 2X SYSTEM THROUGHPUT
100 Jobs Per Day 220 Jobs Per Day CPU-only System Accelerated System
0x 5x 10x 15x
QMCPACK LAMMPS CHROMA NAMD AMBER
K80 CPU
CPU: Dual E5-2698 [email protected] 3.6GHz, 64GB System Memory, CentOS 6.2
TESLA FOR VISUALIZATION
IRAY
TESLA ACCELERATED COMPUTING
INDEX
OPTIX
VISUALIZE DATA INSTANTLY FOR FASTER SCIENCE
Traditional
Slower Time to Discovery
CPU Supercomputer Viz Cluster
Simulation- 1 Week Viz- 1 Day
Multiple Iterations
Time to Discovery = Months
Tesla Platform
Faster Time to DiscoveryGPU-Accelerated Supercomputer
Visualize while you simulate/without
data transfers
Restart Simulation Instantly Multiple Iterations
Time to Discovery = Weeks
Flexible
Scalable
Interactive
DaysVISUALIZATION-ENABLED SUPERCOMPUTERS
Simulation + Visualization
GROWING ADOPTION IN CLIMATE
&
WEATHER
MeteoSwiss Deploys World’s
First Accelerated Weather
Supercomputer
2x higher resolution for daily forecasts 14x more simulation with ensemble approach for medium-range forecasts
NOAA Chooses Tesla To
Improve Weather Forecast
Research
Develop global model with 3km resolution, five-fold increase from today’s resolution
Improved resolution requires 100x computational complexity
U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS
Powered by the Tesla Platform
100-300 PFLOPS Peak
10x in Scientific App Performance
IBM POWER9 CPU + NVIDIA Volta GPU
NVLink High Speed Interconnect
40 TFLOPS per Node, >3,400 Nodes
IBM POWER CPU
Most Powerful Serial Processor
Fastest CPU-GPU Interconnect
NVIDIA NVLink
Most Powerful Parallel Processor
NVIDIA Volta GPU
ACCELERATED COMPUTING DELIVERS
5X
HIGHER ENERGY EFFICIENCY
80-200 GB/s
CORAL: BUILT FOR GRAND SCIENTIFIC
CHALLENGES
Fusion Energy
Role of material disorder, statistics, and fluctuations in nanoscale materials and systems
Combustion
Combustion simulations to enable the next gen diesel/bio-fuels to burn more efficiently
Climate Change
Study climate change adaptation and mitigation scenarios; realistically represent detailed features
Nuclear Energy
Unprecedented high-fidelity
radiation transport calculations for nuclear energy applications
Biofuels
Search for renewable and more efficient energy sources
Astrophysics
Radiation transport –critical to
astrophysics, laser fusion, atmospheric dynamics, and medical imaging
THE BIG BANG IN MACHINE LEARNING
“
Google’s
AI engine also reflects how the world of computer hardware is changing.
(It) depends on machines equipped with
GPUs… And
it depends on these chips more
Tesla Revolutionizes
Machine Learning
GOOGLE BRAIN APPLICATION
–
DEEP LEARNING
BEFORE TESLA
AFTER TESLA
Cost
$5,000K
$200K
Servers
1,000 Servers
16 Tesla Servers
Energy
600 KW
4 KW
NVIDIA GPU THE ENGINE OF DEEP LEARNING
NVIDIA CUDA
ACCELERATED COMPUTING PLATFORM
WATSON
CHAINER
THEANO
MATCONVNET
CUDA BOOSTS
DEEP LEARNING
5X IN 2 YEARS
Pe rfo rma nc eCaffe Performance
K40 K40+cuDNN1 M40+cuDNN3 M40+cuDNN4 0 1 2 3 4 5 6 11/2013 9/2014 7/2015 12/201539% 45% 55% 62% 66% 72% 75% 79% 83%86% 87,5% 30% 40% 50% 60% 70% 80% 90% 100% 72% 74% 84% 88% 93% 96% 65% 70% 75% 80% 85% 90% 95% 100% 2010 2011 2012 2013 2014 2015
ImageNet Accuracy
NVIDIA GPUAMAZING RATE OF IMPROVEMENT
Pedestrian Detection
CALTECH
65% 70% 75% 80% 85% 90% 95% 100% 11/2013 6/2014 12/2014 7/2015 1/2016 Accu ra cy CV-based DNN-basedObject Detection
KITTI
Image Recognition
IMAGENET
Top Score NVIDIA DRIVENetCUDA FOR DEEP LEARNING DEVELOPMENT
TITAN X
DEVBOX
GPU CLOUD
DEEP LEARNING SDK
cuSPARSE cuBLAS
35
FACEBOOK’S DEEP LEARNING MACHINE
Purpose-Built for Deep Learning Training
2x
Faster Training for Faster Deployment
2x
Larger Networks for Higher Accuracy
Powered by Eight Tesla M40 GPUs
Open Rack Compliant
Serkan Piantino
Engineering Director of Facebook AI Research
“
Most of the major advances in machine learning and AI in the past few years have been contingent on tapping into powerful GPUs and huge data sets to build and train advanced models”
DESIGNED FOR AI COMPUTING AT LARGE SCALE
Built on the NVIDIA Tesla Platform
•8 Tesla M40s deliver aggregate 96 GB GDDR5 memory and 56 teraflops of SP performance
•Leverages world’s leading deep learning
platform to tap into frameworks such as Torch and libraries such as cuDNN
Operational Efficiency and Serviceability
•Free-air Cooled Design Optimizes Thermal and Power Efficiency
•Components swappable without tools
TESLA M40
World’s Fastest Accelerator
for Deep Learning Training
0 1 2 3 4 5
GPU Server with 4x TESLA M40 Dual CPU Server
13x Faster Training
Caffe
Number of Days CUDA Cores 3072 Peak SP 7 TFLOPS GDDR5 Memory 12 GB Bandwidth 288 GB/s Power 250WReduce Training Time from 5 Days to less than 10 Hours
Note: Caffe benchmark with AlexNet, training 1.3M images with 90 epochs CPU server uses 2x Xeon E5-2699v3 CPU, 128GB System Memory, Ubuntu 14.04
TESLA M4
Highest Throughput
Hyperscale Workload
Acceleration
CUDA Cores 1024 Peak SP 2.2 TFLOPS GDDR5 Memory 4 GB Bandwidth 88 GB/sForm Factor PCIe Low Profile
Video Processing 4x Image Processing 5x Video Transcode 2x Machine Learning Inference 2x H.264 &H.265, SD &HD Stabilization and
10X GROWTH IN ACCELERATED COMPUTING
2015 2008 3 Million CUDA Downloads 150,000 CUDA Downloads 60,000 Academic Papers 4,000 Academic Papers 800 Universities Teaching 60 Universities Teaching 450,000 Tesla GPUs 6,000 Tesla GPUs 370 CUDA Apps 27 CUDA Apps41
HOW GPU ACCELERATION WORKS
Application Code
+
GPU
5% of Code
CPU
Compute-Intensive Functions
Rest of Sequential CPU Code
COMMON PROGRAMMING MODELS ACROSS
MULTIPLE CPUS
x
86
Libraries
Programming
Compiler
Directives
AmgX cuBLAS/
43
GPU ACCELERATED LIBRARIES
“Drop
-
in” Acceleration for Your Applications
Domain-specific
Deep Learning, GIS, EDA,
Bioinformatics, Fluids
Visual Processing
Image & Video
Linear Algebra
Dense, Sparse, Matrix
Math Algorithms
AMG, Templates, Solvers
NVIDIA cuRAND
NVIDIA NPP NVIDIA CODEC SDK
NVBIO Triton Ocean SDK
NVIDIA cuBLAS, cuSPARSE
AmgX cuSOLVER
OpenACC
Simple | Powerful | Portable
Fueling the Next Wave of
Scientific Discoveries in HPC
University of Illinois
PowerGrid- MRI Reconstruction
70x
Speed-Up
2
Days of Effort
RIKEN Japan
NICAM- Climate Modeling
7-8x
Speed-Up
main() {
<serial code>
#pragma acc kernels
//automatically runs on GPU
{ <parallel code> } }
8000+
Developers
using OpenACC
Janus Juul Eriksen, PhD Fellow qLEAP Center for Theoretical Chemistry, Aarhus University
“
OpenACC makes GPU computing approachable for domain scientists. Initial OpenACC implementation required only minor effort, and more importantly,
no modifications of our existing CPU implementation.
“
Lines of Code
Modified
# of Weeks
Required
# of Codes to
Maintain
<100 Lines
1 Week
1 Source
Big Performance
0,0x 4,0x 8,0x 12,0x Alanine-1 13 Atoms Alanine-2 23 Atoms Alanine-3 33 Atoms Spe edu p vs CPUMinimal Effort
LS-DALTON CCSD(T) ModuleBenchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)
LS-DALTON
Large-scale Application for
Calculating High-accuracy
OPENACC DELIVERS TRUE PERFORMANCE
PORTABILITY
Paving the Path Forward: Single Code for All HPC Processors
4,1x 5,2x 7,1x 4,3x 5,3x 7,1x 7,6x 11,9x 30,3x 5x 10x 15x 20x 25x 30x
35x CPU: MPI + OpenMP CPU: MPI + OpenACC CPU + GPU: MPI + OpenACC
Spe ed up vs Si ngl e CP U Co re
CUDA
Super Simplified Memory Management Code
void sortfile(FILE *fp, int N) { char *data;
data = (char *)malloc(N); fread(data, 1, N, fp);
qsort(data, N, 1, compare); use_data(data);
free(data); }
void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }
GPU DEVELOPER ECO-SYSTEM
Consultants & Training
OEM Solution Providers
Debuggers
& Profilers
CUDA-GDB
NV Visual Profiler
NVIDIA Nsight
Visual Studio
Allinea
TotalView
MATLAB
Mathematica
LabView
Numerical
Packages
GPUDirect RDMA
Datacenter
GPU Manager
Cluster Tools
FFT
BLAS
SPARSE
LAPACK
NPP
Video
Imaging
Libraries
C
C++
Fortran
Java
Python
OpenACC
OpenMP
49
DEVELOP ON GEFORCE, DEPLOY ON TESLA
Designed for Developers & Gamers
Available Everywhere
developer.nvidia.com/cuda-gpus developer.nvidia.com/devbox
Designed for the Data Center
ECC 24x7 Runtime GPU Monitoring Cluster Management
GPUDirect-RDMA Hyper-Q for MPI 3 Year Warranty
DEEP LEARNING & ARTIFICIAL INTELLIGENCE
Sep 28-29, 2016 | Amsterdam
www.gputechconf.eu #GTC16
SELF-DRIVING CARS VIRTUAL REALITY & AUGMENTED REALITY
SUPERCOMPUTING & HPC
GTC Europe is a two-day conference designed to expose the innovative ways developers, businesses and academics are using parallel computing to transform our world.