GPU EN CALCUL SCIENTIFIQUE

(1)

Formation du Club des Affiliés du LAAS-CNRS, Toulouse, 22 mars 2016

Frédéric Parienté, Tesla Accelerated Computing, NVIDIA

(2)

ENTERPRISE AUTO

GAMING PRO VISUALIZATION DATA CENTER

(3)

FIVE THINGS TO

REMEMBER

Time of accelerators has come

NVIDIA is focused on co-design from top-to-bottom

Accelerators are surging in supercomputing

Machine learning is the next killer application for HPC

(4)

“

It’s

time to start planning for

the end of

Moore’s

Law, and

it’s

worth pondering how it

will end, not just when.

”

Robert Colwell

(5)

TESLA ACCELERATED COMPUTING PLATFORM

Focused on Co-Design from Top to Bottom

Productive

Programming

Model

&

Tools

Expert

Co-Design

Accessibility

APPLICATION MIDDLEWARE SYS SW LARGE SYSTEMS PROCESSOR

Fast GPU

Engineered for High Throughput

0,0 0,5 1,0 1,5 2,0 2,5 3,0 2008 2009 2010 2011 2012 2013 2014 NVIDIA GPU x86 CPU TFLOPS M2090 M1060 K20 K80 K40 Fast GPU + Strong CPU

(6)

25 50 75 100 125

ACCELERATORS SURGE IN

WORLD’S TOP SUPERCOMPUTERS

100+ accelerated systems now on Top500 list

1/3 of total FLOPS powered by accelerators

NVIDIA Tesla GPUs sweep 23 of 24 new

accelerated supercomputers

Tesla supercomputers growing at 50% CAGR

over past five years

Top500: # of Accelerated Supercomputers

(7)

70% OF TOP HPC APPS ACCELERATED

TOP 25 APPS IN SURVEY

INTERSECT360 SURVEY OF TOP APPS

GROMACS SIMULIA Abaqus NAMD AMBER ANSYS Mechanical MSC NASTRAN SPECFEM3D ANSYS Fluent WRF VASP OpenFOAM CHARMM Quantum Espresso LAMMPS NWChem LS-DYNA Schrodinger Gaussian GAMESS ANSYS CFX Star-CD CCSM COMSOL Star-CCM+ BLAST

= All popular functions accelerated

Top 10 HPC Apps

90% Accelerated

Top 50 HPC Apps

70% Accelerated Intersect360, Nov 2015

“HPC Application Support for GPU Computing”

= Some popular functions accelerated = In development

(8)

370 GPU-Accelerated

Applications

(9)

TESLA BOOSTS DATACENTER THROUGHPUT

$500M Datacenter, 4x increase in ROI

1000 Jobs Per Day

70%

GPU-Accelerated

Nodes

30%

CPU Nodes

3800 Jobs Per Day

100%

CPU Nodes

70% of Applications

5x Faster with GPU

(10)

U.S. Dept. of Energy

Pre-Exascale Supercomputers

for Science

NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED

NOAA

New Supercomputer for Next-Gen

Weather Forecasting

IBM Watson

Breakthrough Natural Language

Processing for Cognitive Computing

SUMMIT SIERRA

(11)

MACHINE LEARNING

HPC 1

ST

CONSUMER KILLER-APP

MICROSOFT OPEN-SOURCE DMTK GOOGLE OPEN-SOURCE TENSORFLOW

YOUTUBE CLICK-TO-BUY ADS GOOGLE PHOTO

(12)

TESLA PLATFORM LEADS IN EVERY WAY

PROCESSOR

INTERCONNECT

ECOSYSTEM

SOFTWARE

(13)

(14)

“

A

pproximately a third of HPC

systems operating today are

equipped with accelerators and

nearly half of all newly deployed

systems have them

.

”

(15)

TESLA FOR SIMLUATION

LIBRARIES

TESLA ACCELERATED COMPUTING

LANGUAGES

DIRECTIVES

(16)

Tesla Accelerates

Discoveries

Using a supercomputer powered by

the Tesla

Platform with over 3,000 Tesla accelerators

,

University of Illinois scientists performed the first

all-atom simulation of the HIV virus and

discovered the chemical structure of its capsid

—

“the perfect target for fighting the infection.”

Without GPU, the supercomputer would need to

be 5x larger for similar performance.

(17)

TESLA K80

World’s Fastest Accelerator

for HPC & Data Analytics

0 5 10 15 20 25 30

Tesla K80 Server Dual CPU Server

# of Days

AMBER Benchmark: PME-JAC-NVE Simulation for 1 microsecond CPU: E5-2698v3 @ 2.3GHz. 64GB System Memory, CentOS 6.2

CUDA Cores 4992

Peak DP 1.9 TFLOPS

Peak DP w/ Boost 2.9 TFLOPS

GDDR5 Memory 24 GB

Bandwidth 480 GB/s

Power 300 W

GPU Boost Dynamic Simulation Time from

1 Month to 1 Week

5x Faster

(18)

0x

5x

10x

15x

K80

CPU

(19)

TESLA K80 BOOSTS DATA CENTER THROUGHPUT

ACCELERATING KEY APPS

1/3 OF NODES ACCELERATED, 2X SYSTEM THROUGHPUT

100 Jobs Per Day 220 Jobs Per Day CPU-only System Accelerated System

0x 5x 10x 15x

QMCPACK LAMMPS CHROMA NAMD AMBER

K80 CPU

CPU: Dual E5-2698 [email protected] 3.6GHz, 64GB System Memory, CentOS 6.2

(20)

TESLA FOR VISUALIZATION

IRAY

TESLA ACCELERATED COMPUTING

INDEX

OPTIX

(21)

VISUALIZE DATA INSTANTLY FOR FASTER SCIENCE

Traditional

Slower Time to Discovery

CPU Supercomputer Viz Cluster

Simulation- 1 Week Viz- 1 Day

Multiple Iterations

Time to Discovery = Months

Tesla Platform

Faster Time to Discovery

GPU-Accelerated Supercomputer

Visualize while you simulate/without

data transfers

Restart Simulation Instantly Multiple Iterations

Time to Discovery = Weeks

Flexible

Scalable

Interactive

Days

(22)

VISUALIZATION-ENABLED SUPERCOMPUTERS

Simulation + Visualization

(23)

GROWING ADOPTION IN CLIMATE

&

WEATHER

MeteoSwiss Deploys World’s

First Accelerated Weather

Supercomputer

2x higher resolution for daily forecasts 14x more simulation with ensemble approach for medium-range forecasts

NOAA Chooses Tesla To

Improve Weather Forecast

Research

Develop global model with 3km resolution, five-fold increase from today’s resolution

Improved resolution requires 100x computational complexity

(24)

U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS

Powered by the Tesla Platform

100-300 PFLOPS Peak

10x in Scientific App Performance

IBM POWER9 CPU + NVIDIA Volta GPU

NVLink High Speed Interconnect

40 TFLOPS per Node, >3,400 Nodes

(25)

IBM POWER CPU

Most Powerful Serial Processor

Fastest CPU-GPU Interconnect

NVIDIA NVLink

Most Powerful Parallel Processor

NVIDIA Volta GPU

ACCELERATED COMPUTING DELIVERS

5X

HIGHER ENERGY EFFICIENCY

80-200 GB/s

(26)

CORAL: BUILT FOR GRAND SCIENTIFIC

CHALLENGES

Fusion Energy

Role of material disorder, statistics, and fluctuations in nanoscale materials and systems

Combustion

Combustion simulations to enable the next gen diesel/bio-fuels to burn more efficiently

Climate Change

Study climate change adaptation and mitigation scenarios; realistically represent detailed features

Nuclear Energy

Unprecedented high-fidelity

radiation transport calculations for nuclear energy applications

Biofuels

Search for renewable and more efficient energy sources

Astrophysics

Radiation transport –critical to

astrophysics, laser fusion, atmospheric dynamics, and medical imaging

(27)

(28)

THE BIG BANG IN MACHINE LEARNING

“

Google’s

AI engine also reflects how the world of computer hardware is changing.

(It) depends on machines equipped with

GPUs… And

it depends on these chips more

(29)

Tesla Revolutionizes

Machine Learning

GOOGLE BRAIN APPLICATION

–

DEEP LEARNING

BEFORE TESLA

AFTER TESLA

Cost

$5,000K

$200K

Servers

1,000 Servers

16 Tesla Servers

Energy

600 KW

4 KW

(30)

(31)

NVIDIA GPU THE ENGINE OF DEEP LEARNING

NVIDIA CUDA

ACCELERATED COMPUTING PLATFORM

WATSON

CHAINER

THEANO

MATCONVNET

(32)

CUDA BOOSTS

DEEP LEARNING

5X IN 2 YEARS

Pe rfo rma nc e

Caffe Performance

K40 K40+cuDNN1 M40+cuDNN3 M40+cuDNN4 0 1 2 3 4 5 6 11/2013 9/2014 7/2015 12/2015

(33)

39% 45% 55% 62% 66% 72% 75% 79% 83%86% 87,5% 30% 40% 50% 60% 70% 80% 90% 100% 72% 74% 84% 88% 93% 96% 65% 70% 75% 80% 85% 90% 95% 100% 2010 2011 2012 2013 2014 2015

ImageNet Accuracy

NVIDIA GPU

AMAZING RATE OF IMPROVEMENT

Pedestrian Detection

CALTECH

65% 70% 75% 80% 85% 90% 95% 100% 11/2013 6/2014 12/2014 7/2015 1/2016 Accu ra cy CV-based DNN-based

Object Detection

KITTI

Image Recognition

IMAGENET

Top Score NVIDIA DRIVENet

(34)

CUDA FOR DEEP LEARNING DEVELOPMENT

TITAN X

DEVBOX

GPU CLOUD

DEEP LEARNING SDK

cuSPARSE cuBLAS

(35)

35

FACEBOOK’S DEEP LEARNING MACHINE

Purpose-Built for Deep Learning Training

2x

Faster Training for Faster Deployment

2x

Larger Networks for Higher Accuracy

Powered by Eight Tesla M40 GPUs

Open Rack Compliant

Serkan Piantino

Engineering Director of Facebook AI Research

“

Most of the major advances in machine learning and AI in the past few years have been contingent on tapping into powerful GPUs and huge data sets to build and train advanced models

”

(36)

DESIGNED FOR AI COMPUTING AT LARGE SCALE

Built on the NVIDIA Tesla Platform

•8 Tesla M40s deliver aggregate 96 GB GDDR5 memory and 56 teraflops of SP performance

•Leverages world’s leading deep learning

platform to tap into frameworks such as Torch and libraries such as cuDNN

Operational Efficiency and Serviceability

•Free-air Cooled Design Optimizes Thermal and Power Efficiency

•Components swappable without tools

(37)

TESLA M40

World’s Fastest Accelerator

for Deep Learning Training

0 1 2 3 4 5

GPU Server with 4x TESLA M40 Dual CPU Server

13x Faster Training

Caffe

Number of Days CUDA Cores 3072 Peak SP 7 TFLOPS GDDR5 Memory 12 GB Bandwidth 288 GB/s Power 250W

Reduce Training Time from 5 Days to less than 10 Hours

Note: Caffe benchmark with AlexNet, training 1.3M images with 90 epochs CPU server uses 2x Xeon E5-2699v3 CPU, 128GB System Memory, Ubuntu 14.04

(38)

TESLA M4

Highest Throughput

Hyperscale Workload

Acceleration

CUDA Cores 1024 Peak SP 2.2 TFLOPS GDDR5 Memory 4 GB Bandwidth 88 GB/s

Form Factor PCIe Low Profile

Video Processing 4x Image Processing 5x Video Transcode 2x Machine Learning Inference 2x H.264 &H.265, SD &HD Stabilization and

(39)

(40)

10X GROWTH IN ACCELERATED COMPUTING

2015 2008 3 Million CUDA Downloads 150,000 CUDA Downloads 60,000 Academic Papers 4,000 Academic Papers 800 Universities Teaching 60 Universities Teaching 450,000 Tesla GPUs 6,000 Tesla GPUs 370 CUDA Apps 27 CUDA Apps

(41)

41

HOW GPU ACCELERATION WORKS

Application Code

+

GPU

5% of Code

CPU

Compute-Intensive Functions

Rest of Sequential CPU Code

(42)

COMMON PROGRAMMING MODELS ACROSS

MULTIPLE CPUS

x

86

Libraries

Programming

Compiler

Directives

AmgX cuBLAS

/

(43)

43

GPU ACCELERATED LIBRARIES

“Drop

-

in” Acceleration for Your Applications

Domain-specific

Deep Learning, GIS, EDA,

Bioinformatics, Fluids

Visual Processing

Image & Video

Linear Algebra

Dense, Sparse, Matrix

Math Algorithms

AMG, Templates, Solvers

NVIDIA cuRAND

NVIDIA NPP NVIDIA CODEC SDK

NVBIO Triton Ocean SDK

NVIDIA cuBLAS, cuSPARSE

AmgX cuSOLVER

(44)

OpenACC

Simple | Powerful | Portable

Fueling the Next Wave of

Scientific Discoveries in HPC

University of Illinois

PowerGrid- MRI Reconstruction

70x

Speed-Up

2

Days of Effort

RIKEN Japan

NICAM- Climate Modeling

7-8x

Speed-Up

main() {

#pragma acc kernels

//automatically runs on GPU

{ <parallel code> } }

8000+

Developers

using OpenACC

(45)

Janus Juul Eriksen, PhD Fellow qLEAP Center for Theoretical Chemistry, Aarhus University

“

OpenACC makes GPU computing approachable for domain scientists. Initial OpenACC implementation required only minor effort, and more importantly,

no modifications of our existing CPU implementation.

“

Lines of Code

Modified

# of Weeks

Required

# of Codes to

Maintain

<100 Lines

1 Week

1 Source

Big Performance

0,0x 4,0x 8,0x 12,0x Alanine-1 13 Atoms Alanine-2 23 Atoms Alanine-3 33 Atoms Spe edu p vs CPU

Minimal Effort

LS-DALTON CCSD(T) Module

Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)

LS-DALTON

Large-scale Application for

Calculating High-accuracy

(46)

OPENACC DELIVERS TRUE PERFORMANCE

PORTABILITY

Paving the Path Forward: Single Code for All HPC Processors

4,1x 5,2x 7,1x 4,3x 5,3x 7,1x 7,6x 11,9x 30,3x 5x 10x 15x 20x 25x 30x

35x CPU: MPI + OpenMP CPU: MPI + OpenACC CPU + GPU: MPI + OpenACC

Spe ed up vs Si ngl e CP U Co re

(47)

CUDA

Super Simplified Memory Management Code

void sortfile(FILE *fp, int N) { char *data;

data = (char *)malloc(N); fread(data, 1, N, fp);

qsort(data, N, 1, compare); use_data(data);

free(data); }

void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }

(48)

GPU DEVELOPER ECO-SYSTEM

Consultants & Training

OEM Solution Providers

Debuggers

& Profilers

CUDA-GDB

NV Visual Profiler

NVIDIA Nsight

Visual Studio

Allinea

TotalView

MATLAB

Mathematica

LabView

Numerical

Packages

GPUDirect RDMA

Datacenter

GPU Manager

Cluster Tools

FFT

BLAS

SPARSE

LAPACK

NPP

Video

Imaging

Libraries

C

C++

Fortran

Java

Python

OpenACC

OpenMP

(49)

49

DEVELOP ON GEFORCE, DEPLOY ON TESLA

Designed for Developers & Gamers

Available Everywhere

developer.nvidia.com/cuda-gpus developer.nvidia.com/devbox

Designed for the Data Center

ECC 24x7 Runtime GPU Monitoring Cluster Management

GPUDirect-RDMA Hyper-Q for MPI 3 Year Warranty

(50)

(51)

DEEP LEARNING & ARTIFICIAL INTELLIGENCE

Sep 28-29, 2016 | Amsterdam

www.gputechconf.eu #GTC16

SELF-DRIVING CARS VIRTUAL REALITY & AUGMENTED REALITY

SUPERCOMPUTING & HPC

GTC Europe is a two-day conference designed to expose the innovative ways developers, businesses and academics are using parallel computing to transform our world.

EUROPE’S BRIGHTEST MINDS & BEST IDEAS

Galaxy Formation

Molecular Dynamics

Cosmology

developer.nvidia.com/gpu-accelerated-libraries

http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf

http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway

http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf

http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7

Consultants & Training

OEM Solution Providers

developer.nvidia.com/cuda-gpus

developer.nvidia.com/devbox