Does Your Software Scale ?
Multi-GPU Scaling for Large Data Visualization
Agenda
Agenda
•
Multi-GPU Scaling for Large Data Visualization (Thomas Ruge, NVIDIA)
– Multi GPU why ?Multi GPU why ?
– Different Multi-GPU rendering methods, focus on database decomposition
– System Analysis of depth compositing and its impact on multi-GPU rendering
– Performance results of multi-GPU rendering with NVSG and OSG
•
NVIDIA’s Multi-GPU SDK (Thomas Volk, NVIDIA)
– How to use NVIDIA’s MGPU SDK to get scalability in your application
•
NVSG-Scale (Subu Krishnamoorthy, NVIDIA)
– Multi-GPU implemented in NVSG
•
Technical Demos (with Horn Thorolf Tonjum, Stormfjord)
•
Q&A
Multi-GPU: Why ?
Multi GPU: Why ?
•
Demand for Large Data visualization beyond capacity of 1 GPU
•
Demand for Large Data visualization beyond capacity of 1 GPU
–
Boeing 777 (350 MTriangles) (between 4.2 – 29 GB of graphics data)
–
Visible Human 16 GB of 3D Textures
•
Single GPU technology approaches cliff (like CPU’s do)
–
Power consumption is growing
Power consumption is growing
–
Cooling problems with high heat output
–
High production costs per chip of dies with large footprints
G80 : 681 M transistors, 90nm, 484 mm2
GT200 1 b
i
65
576
2
GT200: 1 bn transistors, 65 nm, 576 mm2
•
Higher Performance than fastest single GPU on the market
•
Higher Performance than fastest single GPU on the market
– MGPU can give you a time advantage, get tomorrow’s single GPU performance
Multi-GPU: Goal
Multi GPU: Goal
G l G t li
l bilit ith
b f GPU’ i
•
Goal: Get linear scalability with number of GPU’s in
–
Rendering performance
•
Triangle throughput
g
g p
•
Fill rate
–
Data size
Image resolution
–
Image resolution
(not topic of this presentation)•
Steps to get scalability:
–
Distribution of the workload/rendering tasks to all GPU’s
–
Collection of rendering results from all GPU’s
–
Assemble the rendering results in final image
Multi-GPU: Distributing the Load
g
•
Pixel Decomposition (e.g. SLI Antialiasing )
– Assigns different pixels or subpixels to each
GPU
– helps with fill-rate bound apps, good
approach for AA
•
Decomposition in time (e.g. SLI AFR)
– Different GPUs render different frames
– scales well for vertex-processing and fill rate
Multi-GPU: Screen Decompositing
p
g
S
Fi
S
D
i i
(
•
Sort-First or Screen Decomposition (e.g.
SLI SFR or SLI Mosaic)
– Different GPUs are rendering different
ti f th
GPU 1 GPU 2
portions of the screen
– fill rate bound applications scale well
– Good method for Displays with very high
resolutions (e g mersive Displays or Sony
Final Image resolutions (e.g. mersive Displays or Sony
4k) Compositing Schemes: GPU 4 GPU 3 Display 1 Display 2 Compositing Schemes:
None, use (Tiled walls or multi-input Displays)
Simple stitching (e g SLI SFR) Display 3 Display 4
Simple stitching (e.g. SLI SFR)
Multi-GPU: Database Decomposition
p
S
L
D
b
D
i i
GPU 1 GPU 2•
Sort-Last or Database Decomposition
– Different GPUs render different portions of
the dataset
S l ll i V t i fill t
GPU 1 GPU 2
– Scales well in Vertex processing, fill rate
bound and graphics card memory bound applications
Final Image
GPU 3 GPU 4
•
Compositing Schemes
– Depth or z-Buffer compositing (needs
RGB and Depth Buffer from rendering GPU’s
Final Image
GPU 3 GPU 4
GPU 1 GPU 2
RGB and Depth Buffer from rendering GPU s
– Alpha-compositing (needs only RGBA buffer)
GPU 1 GPU 2
=> Database decomposition is the method
that gives us more graphics card
memory
GPU 3 GPU 4Final Image
Compositing: Performance
Compositing: Performance
All D iti S h i C iti t t th fi l i
All Decomposition Schemes require Compositing to create the final image This can be done:
• with special purpose hardware
– very fast, typically very low additional latency and performance impact
Ti ht t ifi hi h d ( ’t i / t h diff t d )
– Tight to specific graphics hardware (can’t mix/match different cards)
– only subset of all possible composition schemes possible (e.g. SLI doesn’t support z- or
alpha-compositing)
– no general purpose compositing hardware available (but several attempts in the past)g p p p g ( p p )
• with commodity hardware and some software
– Adds additional latency, performance impact depends on final image size and compositing
mode mode
– not very graphic card dependent (mix/match possible) – can cover all known compositing schemes
– New HW developments (e.g. PCIe Gen 2) reduces the impact of SW-compositing
© 2008 NVIDIA Corporation.
A System Analysis of Compositing with
commodity hardware
commodity hardware
S t
A l i f d t b
d
iti
ith
System Analysis of database decomposition with
depth compositing
• Goal: • Goal:
– Understand the impact of depth compositing on multi-GPU
rendering
– Means to estimate performance benefits – Valid on today’s HWy
• Assumptions:
– single shared memory system (no clusters) – only one displaying GPU (no powerwalls)
GPU 1
GPU 2 Displaying GPU
– buffer transports go through host memory (no specific HW
tricks) • Tasks: Di ib kl d d i GPU GPU n p y g Host
– Distribute workload to rendering GPUs
– Download resulting 2D-buffers (from GPUs framebuffer) to
host memory
– Upload 2D-buffers to displaying GPU – composite 2D-buffers to final imagecomposite 2D buffers to final image
Compositing with 2 GPUs
Compositing with 2 GPUs
St t A l i ith 2 GPU
Start Analysis with 2 GPUs
System description:
– 2 rendering GPUs, 1 displaying GPU
– Optimization: displaying GPU is also a rendering
GPU GPU
=> reduced number of buffer transports
Depth compositing needs RGB and Z-buffer
GPU 2 GPU 1
E.g. 1920x1200 pixels :
sx = 1920, sy = 1200, bpp = 8 (BGRA+DEPTH_STENCIL)
18,432,000 Bytes per frame buffer
73 728 000 Bytes transported through system
Host Memory Host Memory
73,728,000 Bytes transported through system per frame
@60 Hz = 4,423,680,000 > 4 GB per second !
=> High load on system Bus
CPU
Depth Compositing Performance
p
p
g
on 790i Ultra
S t
790i Ult
2 FX 3700 (PCI 2 0)
System: 790i Ultra, 2 FX 3700 (PCIe 2.0)
Specs relevant for Compositing:
Specs relevant for Compositing:
•
Buffer downloads:
2.8 GB/s
Ù
360 ms/GB
•
host memory copy:
4.1 GB/s
Ù
240 ms/GB
•
Buffer uploads:
3.7 GB/s
Ù
270 ms/GB
•
Execution time of the actual compositing (fragment shader) is
negligible
GPU 2 GPU 1 GPU 2 GPU 1 Color : 2.8 GB/s Depth: 2.8 GB/s 4.1 GB/s 3.7 GB/s Landing Zone Host Memory Launching Pad Host MemoryDepth Compositing Performance
p
p
g
Time to depth-composite 1 MPixel (worst – best case)
buffer size = s ·s ·bpp = 1024·1024·8 = 8 MByte buffer size = sx·sy·bpp = 1024·1024·8 = 8 MByte
• Compositing performance for sequential execution: 870 ms/GB Ù 143 fps
worst case (no concurrency) ttotal = tdn+ tmc+ tup
• Compositing performance for parallel execution: 360 ms/GB Ù 350 fps
best case (full concurrency) ttotal = max(tdn, tmc, tup )
• Compositing performance measured: 550 ms/GB Ù 227 fps
(some concurrency)
td = Time to download BGRA buffer [s/GB]
GPU 2 GPU 1
1.8 GB/s
© 2008 NVIDIA Corporation.
tdc Time to download BGRA buffer [s/GB]
tdc= Time to download DEPTH_STENCIL buffer [s/GB] tdn= (tdc+ tdc)/2
tmc= Time to copy from host mem to host mem [s/GB] tup= Time to upload [s/GB]
Host Memory Host Memory
Single GPU Performance:
T i
l th
h t
d d t i
Triangle throughput and datasize
R d i f
d i
Rendering performance measured in
triangles/sec depending on data size on
graphics card with limited memory
o
rmance
Single GPU cliff
1/tin Rende ring perf o [ triangles / s] 1/tout l d f d Data size [ triangles]
Example, Rendering performance on Quadro FX 5600:
bytes per triangle = 3 vertices(36 bytes) + 3 normals( 36 bytes) = 72 bytes / triangle NVSG-test program renders individual triangles in VBO’s:
– pin = 143 Mtriangles/s for in-memory data => tin = 7 ns
– pout = 25 Mtriangles / s for out-of memory data => tout = 40 ns
n 20 Mtriangles
Multi GPU Rendering: 2 GPUs
Multi GPU Rendering: 2 GPUs
– Compositing impact is known R d i P f – Compositing impact is known
– Rendering performance on single GPU is known
Let’s estimate what performance gain we can expect over a growing dataset Assumptions: Rendering Performance 1.50E+08 2.00E+08 2.50E+08 3.00E+08 p er fo rm an ce a ng le s/ s ]
Dual GPU Cliff
Assumptions:
- Execution of rendering and compositing happens sequentially (will be changed in the future):
- Rendering happens concurrently w/o overhead on both GPU’s - Fixed buffer size = 1k x 1k = 1Mpixel => tcompositing= 4.4 ms
i l GPU t t l ti f 0.00E+00 5.00E+07 1.00E+08 0 10 20 30 40 50 60 70 triangles [millions] re n d e r p [t ri a
triangle throughput 1 GPU
Single GPU Cliff
single GPU total time per frame:
- ttotal,1 GPU= trender(n) = n0·tin+ (n-n0) ·tout
dual GPU total time per frame:
- ttotal, 2 GPU= trender(n/2) + tcompositing
f i t / t
triangle throughput 1 GPU triangle throughpout no compositing triangle throughput 2 GPU compositing
2 GPU relative to 1 GPU
Area of super linear scalability
⇒ performance gain s2GPU= ttotal,1 GPU/ ttotal,2 GPU
peak performance to be expected at n = 2 n0:
s(n=2·n0) = (n0·tin+ n0·tout)/(n0·tin+tc) Ù
(tin+ tout)/(tin+tc/n0)
improve peak performance by: 3
4 5 6 7 8 to r o ver si n g le G P U
reduce tc (e.g. optimize buffer transports)
increase n0 (e.g. pack more triangles in card like tristrips instead
individual triangles)
- Theoretical peak max smax = (tin+ tout)/tin - applied to FX5600 s = (7ns +40ns)/7ns = 6 7 0 1 2 0 10 20 30 40 50 60 70 triangles [millions] scal e f a c t
scale factor 2 GPUs no compositing scale factor 2 GPU compositing
© 2008 NVIDIA Corporation.
Multi-GPU Rendering: 4 GPUs
Multi GPU Rendering: 4 GPUs
Generalize 2 GPU formula to k Generalize 2-GPU formula to k
GPUs:
Assumption:
buffer transports time grows linearly with number of GPUs
Rendering Performance 4.00E+08 5.00E+08 6.00E+08 for m a nc e es/ s] p g y
k GPUs total time per frame:
- ttotal, k GPU= trender(n/k) + (k-1)·tcompositing
⇒ performance gain skGPU= ttotal,1 GPU/ ttotal,k GPU
0.00E+00 1.00E+08 2.00E+08 3.00E+08 0 10 20 30 40 50 60 70 re nde r pe rf [t ri a ngl
multi-GPU relative to 1 GPU kGPU total,1 GPU total,k GPU
peak performance to be expected at n = k n0: s(n=k·n0) = (n0·tin + (k-1) ·n0·tout)/(n0·tin+(k-1)·tc) Ù
smax = (tin + (k-1) ·tout)/(tin+(k-1) ·tc/n0)
triangles [millions]
triangle throughput 1 GPU triangle throughput 2 GPU compositing
triangle throughput 4 GPUs
6 8 10 12 14 16 18 o r ov er s ingl e G P U
- Theoretical peak max smax= (tin + (k-1) · tout)/tin = 1 + (k-1) · tout/tin 0 2 4 6 0 20 40 60 80 100 120 triangles [millions] sca le f a ct o f G f G E.g. FX 5600 - k = 4 FX 5600 smax = (7ns +3 · 40ns)/7ns = 18.1 - k = 8 FX 5600 smax = (7ns +7 · 40ns)/7ns = 41
Frame Rates
90
Frame Rates
Application classes by frame rate
40 50 60 70 80 90 m er at e [ 1 /s ]
Application classes by frame rate
>= 60 Hz, Visual Simulation (e.g. professional flight simulators) >= 30 Hz, < 60 Hz, Entertainment (gaming) > 60 Hz 30 – 60 Hz 0 10 20 30 0 20 40 60 80 100 120 triangles [millions] fr a m >= 15 Hz, < 30 Hz VR, Design Review
>= 5 Hz, < 15 Hz Modeling, Large CAD Data Visualization, Seismic Interpretation,…
15 – 30 Hz
5 – 15 Hz
triangles [millions]
framerate 1 GPU framerate 2 GPUs framerate 4 GPUs
20 Best scalability for Large Models (low impact by compositor)
=> very good fit for Large Data visualization like Seismic interpretation 6 8 10 12 14 16 18 fr am e rate [1 /s ] 5 – 15 Hz 15 – 30 Hz 0 2 4 0 20 40 60 80 100 120 triangles [millions]
framerate 1 GPU framerate 2 GPUs framerate 4 GPUs
Results System Analysis
Results System Analysis
System Analysis shows: System Analysis shows:
• Performance scales with number of GPUs
=> Higher framerate => Higher framerate
• Graphics card memory scales linearly with number of GPUs
=> Larger models
• Small models don’t scale very well
=> Multi-GPU rendering (database decomposition) not useful for small models
• Peak scalability far beyond linear (Larger Cache effect)
Multi GPU rendering for Large models works !
Multi-GPU rendering for Large models works !
Results Multi GPU rendering: OSG
g
OSG = Open Scene Graph
Open Source Scene Graph widely used in Academia, Simulation, Oil&Gas and other industries We extended OSG to render in multiple threads (one thread per GPU) and added the MGPU SDK In this test OSG used around 140 bytes per triangle
Performance multi GPU vs. Standalone
Triangle Performance QP IV 600 0% 800.0% 1000.0% 1200.0% r e l. S A [% ] 2.00E+08 4.00E+08 6.00E+08 er fo rm a n ce tr ia n g les / s] 0.0% 200.0% 400.0% 600.0% P e rf or m a nc e 0.00E+00
0 5000000 1E+07 1.5E+07 2E+07 2.5E+07 3E+07
Data Size [primitives]
P [t
Standalone 2 GPU no Compositing
© 2008 NVIDIA Corporation.
0 5000000 10000000 15000000 20000000 25000000 Data Size [Primitives]
2 GPU's 4 GPU's
4 GPU no Compositing 2 GPU with Compositing
Results Multi GPU rendering: NVSG
g
NVSG = NVIDIA’s Scene Graph
Very efficient Scene Graph used in automotive simulation and academia Very efficient Scene Graph used in automotive, simulation and academia NVSG used 72 bytes per triangle
Triangle Performance QP IV
Performance multi GPU vs. Standalone
1200 0% 2.00E+08 3.00E+08 4.00E+08 5.00E+08 a n ce [ tr ian g les ] 600.0% 800.0% 1000.0% 1200.0% m an ce rel S A [ % ] 0.00E+00 1.00E+08
0.00E+00 1.00E+07 2.00E+07 3.00E+07 4.00E+07 5.00E+07
Data Size [primitives]
Pe rf o rm a 0.0% 200.0% 400.0% 0.00E+ 00 5.00E+ 06 1.00E+ 07 1.50E+ 07 2.00E+ 07 2.50E+ 07 3.00E+ 07 3.50E+ 07 4.00E+ 07 4.50E+ 07 Pe rf o rm
Standalone 2 GPU with Compositing 4 GPU with Compositing
Data size [primitives]
MGPU – SDK Introduction
Thomas Volk
Agenda
•
Scope of the SDK
•
Scope of the SDK
•
SDK features
Scope of the MGPU SDK
•
Multi-GPU set-up and addressing
• Affinity context/screen per GPU
• SDK utility class for context handling and off-screen rendering
•
Application parallelism
• Multi-threading of rendering loop: already solved in lots of scene Multi threading of rendering loop: already solved in lots of scene
graph implementations
• Multi-process and out of core/cluster solutions: highly specialized
solutions already availabley
•
Load decomposition and balancing: needs to be
addressed in SG for sort last (see NVSGScale)
•
Sample/reference implementations
•
Sample/reference implementations
•
Utility classes
•
Image compositing with huge data transfer (5GB/s) and
low latency: SDK compositor
SDK features
• OpenGL based
• Professional applications
• QuadroPlex only
• With current systems up to 2-8 GPUs, e.g. 2xD2s, 2xD4s • Up to 16 GB of addressable video memory
• Platforms: win32/64 linux 64
• Platforms: win32/64, linux 64
• In comparison to SLI non-transparent to application
• Utility functions to create MGPU aware GL contexts and drawables
• Platform independentPlatform independent • Stereo
• AA
• Image compositor for sort first and sort last based applications
b i f f i i
• Easy to use abstract interface for compositing
• Compositor implementation based on latest technologies, no migration effort for
applications (next gen hardware will provide faster transport)
• Multi-threaded, shared memory
C fi bl 1 1 1 hi hi l h b id
• Configurable: 1-1, n-1, hierarchical, hybrid
Compositor Types and Configurations
1
2
1
2
1
2
+
+
Screen tiling Alpha compositing
1
2
12
+
+
+
Depth compositing+
Hierarchical compositing © 2008 NVIDIA Corporation.System Overview
Compositor MGPU SDK Application thread Scene graph Application thread Scene graph Application thread Scene graphDst node Src node Src node
GL context GL context GL contextGL-context S
GL-context Dst drawable GL-context Src drawable GL-contextGL-context Src drawable
SDK Classes
ICCompositor
+setLayout() +createsICFactory
y () +getSrcNode() +getDstNode()RenderArea
+initialize()ICFactory
+createAlphaCompositor() +createDepthCompositor() +createDepthCompositor() n +initialize() +terminate() +resize() +makeCurrent() +releaseCurrent()ICNode
+intialize() +terminate() i () +showFBContents() +hideFBContents() +composite()ICSrcNode
ICDstNode
© 2008 NVIDIA Corporation.Code Example
p
//init in control/main threadICDepthCompositor* c = CompositorFactory::getInstance()->getDepthCompositor(); c->setLayout(IC_ALLNODES, IC_RECT(width, height));
ICNode* myCompositorNode[MAX NODES]; ICNode* myCompositorNode[MAX_NODES]; myCompositorNode[0] = c->getSrcNode(idx);
//init in every thread and context //create affinity context and drawable; RenderArea ra;
ra->initialize(); ra->makeCurrent();
//initialize GL resources of compositor myCompositornode[i]->initialize();
//in every render-thread void drawFrame(int myID) {
drawPartialScene(myID); drawPartialScene(myID);
//trigger image composition
myCompositornode[myID]->composite(); }
Sample Sequence Diagram
AppThread Compositor RenderThread SrcNode RenderThread DstNode
i it setLayout resize init init init init setLayout paint drawFrame drawFrame
compositecomposite composite
composite
MGPU Threadingg
A th d draw ic d i simul App thread Render src thread Render src thread simul draw d draw ic draw ic Render src thread Render dst thread draw d i simul App thread Render src thread simul d simul draw ic draw ic draw ic Render src thread Render src thread Render dst thread draw draw draw draw ic drawWrap up
p p
•
Current status
•
Beta release next week
•
No stereo and AA, linux only for beta
F
l
il bl
•
Freely available
•
Release mid Oct
•
Future features
•
Future features
•
Better hardware support
•
Access to frame-buffers
Access to frame buffers
•
Utility classes for load balancing
•
Performance feedback
•
Extensible shader compositor
•
Additional color formats
Multi-GPU Rendering with NVIDIA Scene Graph
The World’s Fastest Scene Graph On Steroids
Multi GPU Rendering with NVIDIA Scene Graph
Subu Krishnamoorthy Subu Krishnamoorthy
Distributing Scene Objects
g
j
•
Audience – specifies which GPUs are responsible for
processing a scene graph object
•
DistributionTraverser – assigns Audiences to balance
l d
h GPU
load on each GPU
•
GLDistributedTraverser – respects Audiences and
t
l h
it GPU i
l d d
© 2008 NVIDIA Corporation.
Distribution Scheme
•
Opaque objects
Opaque objects
Least loaded GPU included, others
excluded
excluded
•
Translucent and overlay objects
•
Translucent and overlay objects
Primary GPU included, others excluded
•
Implies use of Depth Compositor
Image composited before translucent
and overlay objects are drawn
The Components
p
•
GLAffinityArea creates a thread that drives an associated
GLAffinityArea creates a thread that drives an associated
GLDistributedTraverser
•
GLDistributedRenderArea manages the collection of N-1
© 2008 NVIDIA Corporation.
GLDistributedRenderArea manages the collection of N 1
GLAffinityAreas
Ease of Use
•
Derive client RenderArea from
GLDistributedRenderArea
GLDistributedRenderArea
(instead of nvui::RenderArea)
class wxGLDistributedRenderArea : public wxGLCanvas,p
public nvui::GLDistributedRenderArea {
... };
•
Override GLDistributedRenderArea::init()
•
Create Primary GL context (affinity context) and make it current
I
k b
i
l
i
•
Invoke base implementation
bool wxGLDistributedRenderArea::init(nvui::RenderArea *shareArea) {
...
m_glContext = new wxGLAffinityContext(this, shareArea); wxGLCanvas::SetCurrent(*m_glContext); ... GLDistributedRenderArea::init(shareArea); © 2008 NVIDIA Corporation. ... }
Depth Composited Image
p
p
g
Spatial Distribution
p
•
Audience assigned based object
g
j
bounding box
I
li
f Al h C
it
Summaryy
•
Powerful new feature of NVSG
Powerful new feature of NVSG
•
Visualize large models at interactive
g
frame rates
Visual Computing In The Oil Sector, Horn Thorolf Tonjum
Demos
Dinosaurs in the Museum
Dinosaurs in the Museum
Model characteristics:
• Model designed in Maya, AutoDesk • 217 Million Triangles
Rendering Hardware:
HP 8600 ith 32 GB S t M
• HP xw8600 with 32 GB System Memory
• 2xD2’s (4xGT200 with 4 GB FB memory) with 16 GB
total FB memory Rendering Performance: Rendering Performance: 1 GPU = 0.7 fps
2 GPU’s = 3 ½ fps = 5 times faster than 1 GPU 4 GPU’s = 6 fps = 8 ½ times faster than 1 GPU => Total max. Triangle throughput = 1.3 GTri/s
Multi-GPU rendering of
Multi GPU rendering of
Kristin,
Model characteristics:
a detailed off-shore
l tf
Model characteristics:
• Kristin is an off-shore platform in the North Sea
• Model designed in MicroStation, Bentley
• 230 Million Triangles (individual triangles, no triangle
platform
g ( g , g
strips)
Rendering Hardware:
HP 8600 i h 32 GB S M
• HP xw8600 with 32 GB System Memory
• 2xD2’s (4xGT200 with 4 GB FB memory) with 16 GB
total FB memory Rendering Performance: 1 GPU = 0.4 fps
2 GPU’s = 2 ½ fps = 6 times faster than 1 GPUp
4 GPU’s = 4 ½ fps = 11 times faster than 1 GPU => Total max. Triangle throughput = 1 GTri/s
© 2008 NVIDIA Corporation.
Summary
Summary
S
l i
f
l i GPU d i
•
System analysis of multi-GPU rendering
•
How to do multi-GPU rendering with NVIDIA’s Multi-GPU SDK
How to do multi GPU rendering with NVIDIA s Multi GPU SDK
•
Ease of use of multi-GPU rendering with NVSG
•
Large CAD data visualization in the Oil&Gas industry
=> Database decomposition together with
depth compositing is a strong tool to give
depth compositing is a strong tool to give
you tomorrows graphics performance today
The End
The End
Questions ?
Questions ?