NVIDIA Tesla® K20-K20X GPU Accelerators
Benchmarks
Teraflop
Single Precision Performance 3.0
2.0
1.0
0.0 0.35
0.89 (SGEMM)
2.9
Xeon E5-2687W Tesla M2090 Tesla K20X (Fermi)
5x
0x
CP2K- Quantum Chemistry
20x�15x�
10x�
K20 with Hyper-Q K20 without Hyper-Q
2.5x
S p e e d u p v s. D u a l K 2 0 �0 5 10 15 20
Number of GPUs
NVIDIA® changed the high performance computing (HPC) landscape by introducing its Fermi based GPUs that delivered an impressive leap in performance and energy-efficiency while offering a parallel programming model, CUDA ®, as an extension to industry standard languages like C, C++, and Fortran.
Now NVIDIA has introduced the new Tesla® K20-K20X GPU Accelerators that further advance the HPC industry. Built on the revolutionary Kepler™ architecture, these accelerators redefine the standard for energy-efficient computing and feature innovative technologies like SMX, Hyper-Q, and Dynamic Parallelism to boost application performance by up to 10x.
SMX: 3x More Performance P
er Watt
The new SMX (Next Generation Streaming Multiprocessor) is an architectural innovation
designed from t he ground-up to deliver high efficiency performance. With SMX at its core,
Tesla K20/K20X accelerators deliver the industry’s highest single and double precision performance- 3.95 teraflops and 1.31 teraflops respectively for Tesla K20X- at an unprecedented 93% computational efficiency.
1.5
1.0
0.5
0.0 0.17
0.43
1.22
Xeon E5-2687WTesla M2090 (Fermi) Tesla K20X
Teraflop Double Precision Performance
(DGEMM)
Hyper-Q: Easy S
peed-up for Legacy MPI
Codes
Many legacy MPI codes don’t generate enough work to fully occupy the GPU because they were written for CPU cores. Rather than requiring developers to refactor their codes to put more
workload per MPI process, the Hyper-Q f eature
reduces efforts considerably because developers can now throw up to 32 MPI processes with small and medium-sized workloads at a shared GPU. To illustrate the power of Hyper-Q, we picked a traditionally difficult code for GPUs called CP2K, a
popular MPI-based quantum c hemistry code,
showing more than 2x performance improvement when 16 MPI ranks are run on the shared GPU.
Tesla K20X Performance Advantage over Sandy Bridge CPUs
Quicksort
4x 3x 2x 1xWithout Dynamic Parallelism
2x
0x R e la ti v e S o rt in g P e rf o rm a n ceWith Dynamic Parallelism
0 5 10
Dynamic Parallelism: Simplifying
arallel P
rogramming
P
Dynamic Parallelism allows the GPU to operate more autonomously from the CPU by generating new work for itself at run time, from inside a kernel. The concept is simple, but the impact is powerful: it can make programming easier, particularly for algorithms traditionally considered difficult such as divide-and-conquer problems. To showcase its potential, we used Dynamic Parallelism on Quicksort, a
well-Problem Size (Million of Elements)
known algorithm for sorting methods, to
reduce the lines of CUDA code in half while improving performance by 2x.
K20X Benchmarks by Application
Accelerating Key Scientific Applications by up to 10x
Today, hundreds of applications take advantage of GPU acceleration, spanning all scientific disciplines and engineering domains, and the number of applications continues to grow. In the past year alone, the number of CUDA-accelerated applications has grown by over 60%. When Tesla K20 GPU Accelerators are added to servers with Sandy Bridge CPUs, CUDA-enabled applications are typically accelerated up to 10x. The K20X benchmark below shows single node performance of leading applications in various science domains.
CPU results: Dual socket E5-2687w; GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs *MATLAB results comparing one i7-2600K CPU vs. Tesla K20 GPU
24
3x128 Lattice Volume, BiCSTAB s
olver
20x
18.0x
�
U
16x
14.6x
P
C
x
2
12x
o
9.4x
t
e
7.5x
vi
8x
t
al
e
R
4x
1.0x
0x
2xCPU 2xCPU+ 2xCPU+ 2xCPU+ 2xCPU+
1xM2090 2xM2090 1xK20X 2xK20X
CPU r esults: Dual socket E5-2687w; GPU r esults: Dual socket E5-2687w + 2 Tesla K20X GPUs, 64GB per node
Here’s a closer look at three of the applications listed above.
Chroma: High Energy &
Nuclear Physics
Chroma is used by scientists to test alternatives to the Standard Model of physics in order to develop a deeper understanding of the fundamental properties of matter and energy. The benchmark below show significant performance boost when adding one or two Tesla K20X GPU Accelerators to dual socket CPUs.
SPECFEM3D: Earth Sciences
Researchers use SPECFEM3D for a deeper understanding of the phenomena that shape seismic activity, thereby allowing engineers to assess potential hazards and to build structural countermeasures. This code was a Gordon Bell winner in 2008 prior to the GPU work, so it was a highly tuned code even before the recent work to accelerate on GPUs.
256x128 3D Spatial Discretization
16.0x
12.8x
�
U
P
12.0x
C
x
9.3x
2
8.8x
o t
8.0x
�
e
vi
5.4x
t
al
e
4.0x
�
R
1.0x
0.0x
�
2xCPU 2xCPU+ 2xCPU+ 2xCPU+ 2xCPU+
1xM2090 2xM2090 1xK20X 2xK20X
CPU r esults: Dual socket E5-2687w; GPU r esults: Dual socket E5-2687w + 2 Tesla K20X GPUs, 64GB per node
SPFP-JAC_production_NVE
10.0x
1.0x
3.4x
4.6x
7.1x
8.2x
R
e
la
ti
v
e
to
2
x
C
P
U
�
8.0x
6.0x
4.0x
2.0x
0.0x
2xCPU 2xCPU+ 2xCPU+ 2xCPU+ 2xCPU+
1xM2090 2xM2090 1xK20X 2xK20X
CPU results: Dual socket E5-2687w; GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs, 64GB per node
AMBER: Molecular Dynamics
Molecular dynamics (MD) allows the study of biological and chemical systems at the atomistic
level on timescales from f emtoseconds to milliseconds. Numerous software packages exist for
conducting MD simulations of which one of the most widely used is AMBER. With the Tesla K20X GPU Accelerators, acceleration by up to 80% is observed on AMBER, over compared to Tesla M2090 GPUs.
32 Atom Emsemble of Iron (Fe)
40x30x
20x
6
10x
0x
# Sockets
R
e
lat
iv
e
to
a
S
in
g
le
C
P
U
o
n
ly
N
o
d
e
Cray XK7 -Tesla K20X Cray XK7- CPU
0 20 40 60 80 100 120 140 160 180 200
Large Cluster Scaling for GPU-Accelerated Applications
Large clusters around the world are increasingly deploying GPUs to accelerate their workload.
In the Top500 list, from J une 2011 to June 2012, number of GPU-accelerated systems grew by
over 400%.
Whether an application is GPU-accelerated or not, designing code to scale well across many nodes has never been an easy task. However, many GPU-accelerated applications are scaling well as developers work on extracting more parallelism by allocating more parallel work within a node and use MPI to distribute GPU workloads.
WL-LSMS: Material Science
WL-LSMS simulates the magnetic behavior of materials at the atomic and nanoscale in order to design lightweight, powerful magnetic components for highly efficient electric motors,
generators, and magnetic storage devices. WL-LSMS was a Gordon Bell winner in 2009, so it was already a highly-tuned CPU code prior to the recent work to accelerate on GPUs.
x
QMCPACK: Material S
cience
QMCPACK is used by researchers to develop new insights into condensed matter physics, materials science, and chemistry by using more physically accurate methods. For large
systems on the order of 200 electrons or more, simulations that traditionally would take upwards of 12 hours or more can now run on the order of 3 hours per simulation.
Compute
3x3x1
Graphite
Efficiency1500000
CrayXK7TeslaK20X CrayXK7CPU
1000000
4x
500000
0
0 500 1000 1500 2000 2500
#ofCompute Nodes
ns/day
100x
STMV
2.0
1.5
1.0
0.5
0.0
CrayXK7 K20X
CrayXK7 CPU
4x
128 256 512 768
#ofComputeNodes
NAMD: Molecular Dynamics
NAMD (Not (just) Another Molecular Dynamics program) is a free-of-charge molecular dynamics simulation package written using the Charm++ parallel programming model, noted for its parallel efficiency and often used to simulate large systems (millions of atoms). NAMD also scales well across many nodes accelerated with GPUs.
Notice
ALL INFORMATION PROVIDED IN THIS TECHNICAL BRIEF, INCLUDING COMMENTARY, OPINION, NIVIDA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF
NONINFRINGEMEN, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any
infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. This
publication supersedes and replaces all other information previously supplied. NVIDIA
Corporation products are not authorized as critical components in life support devices or system without express written approval of NVIDIA Corporation.
Copyright
© 2012 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, Tesla, Kepler, and CUDA are trademarks and / or registered trademarks of NVIDIA Corporation. All company and product names are trademarks or registered trademarks of the respective owners with which they are associated. Features, pricing, availability, and specifications are all subject to change without notice. NOV 2012