Multi-GPU Scaling for Large Data Visualization (Thomas Ruge, NVIDIA)

(1)

Does Your Software Scale ?

Multi-GPU Scaling for Large Data Visualization

(2)

Agenda

•

Multi-GPU Scaling for Large Data Visualization (Thomas Ruge, NVIDIA)

– Multi GPU why ?Multi GPU why ?

– Different Multi-GPU rendering methods, focus on database decomposition

– System Analysis of depth compositing and its impact on multi-GPU rendering

– Performance results of multi-GPU rendering with NVSG and OSG

•

NVIDIA’s Multi-GPU SDK (Thomas Volk, NVIDIA)

– How to use NVIDIA’s MGPU SDK to get scalability in your application

•

NVSG-Scale (Subu Krishnamoorthy, NVIDIA)

– Multi-GPU implemented in NVSG

•

Technical Demos (with Horn Thorolf Tonjum, Stormfjord)

•

Q&A

(3)

Multi-GPU: Why ?

Multi GPU: Why ?

•

Demand for Large Data visualization beyond capacity of 1 GPU

•

Demand for Large Data visualization beyond capacity of 1 GPU

–

Boeing 777 (350 MTriangles) (between 4.2 – 29 GB of graphics data)

–

Visible Human 16 GB of 3D Textures

•

Single GPU technology approaches cliff (like CPU’s do)

–

Power consumption is growing

–

Cooling problems with high heat output

–

High production costs per chip of dies with large footprints

G80 : 681 M transistors, 90nm, 484 mm2

GT200 1 b

i

65

576

2

GT200: 1 bn transistors, 65 nm, 576 mm2

•

Higher Performance than fastest single GPU on the market

•

Higher Performance than fastest single GPU on the market

– MGPU can give you a time advantage, get tomorrow’s single GPU performance

(4)

Multi-GPU: Goal

Multi GPU: Goal

G l G t li

l bilit ith

b f GPU’ i

•

Goal: Get linear scalability with number of GPU’s in

–

Rendering performance

•

Triangle throughput

g

g p

•

Fill rate

–

Data size

Image resolution

–

Image resolution

(not topic of this presentation)

•

Steps to get scalability:

–

Distribution of the workload/rendering tasks to all GPU’s

–

Collection of rendering results from all GPU’s

–

Assemble the rendering results in final image

(5)

Multi-GPU: Distributing the Load

g

•

Pixel Decomposition (e.g. SLI Antialiasing )

– Assigns different pixels or subpixels to each

GPU

– helps with fill-rate bound apps, good

approach for AA

•

Decomposition in time (e.g. SLI AFR)

– Different GPUs render different frames

– scales well for vertex-processing and fill rate

(6)

Multi-GPU: Screen Decompositing

p

g

S

Fi

S

D

i i

(

•

Sort-First or Screen Decomposition (e.g.

SLI SFR or SLI Mosaic)

– Different GPUs are rendering different

ti f th

GPU 1 GPU 2

portions of the screen

– fill rate bound applications scale well

– Good method for Displays with very high

resolutions (e g mersive Displays or Sony

Final Image resolutions (e.g. mersive Displays or Sony

4k) Compositing Schemes: GPU 4 GPU 3 Display 1 Display 2 Compositing Schemes:

None, use (Tiled walls or multi-input Displays)

Simple stitching (e g SLI SFR) Display 3 Display 4

Simple stitching (e.g. SLI SFR)

(7)

Multi-GPU: Database Decomposition

p

S

L

D

b

D

i i

GPU 1 GPU 2

•

Sort-Last or Database Decomposition

– Different GPUs render different portions of

the dataset

S l ll i V t i fill t

GPU 1 GPU 2

– Scales well in Vertex processing, fill rate

bound and graphics card memory bound applications

Final Image

GPU 3 GPU 4

•

Compositing Schemes

– Depth or z-Buffer compositing (needs

RGB and Depth Buffer from rendering GPU’s

Final Image

GPU 3 GPU 4

GPU 1 GPU 2

RGB and Depth Buffer from rendering GPU s

– Alpha-compositing (needs only RGBA buffer)

GPU 1 GPU 2

=> Database decomposition is the method

that gives us more graphics card

memory

GPU 3 GPU 4

Final Image

(8)

Compositing: Performance

All D iti S h i C iti t t th fi l i

All Decomposition Schemes require Compositing to create the final image This can be done:

• with special purpose hardware

– very fast, typically very low additional latency and performance impact

Ti ht t ifi hi h d ( ’t i / t h diff t d )

– Tight to specific graphics hardware (can’t mix/match different cards)

– only subset of all possible composition schemes possible (e.g. SLI doesn’t support z- or

alpha-compositing)

– no general purpose compositing hardware available (but several attempts in the past)g p p p g ( p p )

• with commodity hardware and some software

– Adds additional latency, performance impact depends on final image size and compositing

mode mode

– not very graphic card dependent (mix/match possible) – can cover all known compositing schemes

– New HW developments (e.g. PCIe Gen 2) reduces the impact of SW-compositing

(9)

A System Analysis of Compositing with

commodity hardware

S t

A l i f d t b

d

iti

ith

System Analysis of database decomposition with

depth compositing

• Goal: • Goal:

– Understand the impact of depth compositing on multi-GPU

rendering

– Means to estimate performance benefits – Valid on today’s HWy

• Assumptions:

– single shared memory system (no clusters) – only one displaying GPU (no powerwalls)

GPU 1

GPU 2 Displaying GPU

– buffer transports go through host memory (no specific HW

tricks) • Tasks: Di ib kl d d i GPU GPU n p y g Host

– Distribute workload to rendering GPUs

– Download resulting 2D-buffers (from GPUs framebuffer) to

host memory

– Upload 2D-buffers to displaying GPU – composite 2D-buffers to final imagecomposite 2D buffers to final image

(10)

Compositing with 2 GPUs

St t A l i ith 2 GPU

Start Analysis with 2 GPUs

System description:

– 2 rendering GPUs, 1 displaying GPU

– Optimization: displaying GPU is also a rendering

GPU GPU

=> reduced number of buffer transports

Depth compositing needs RGB and Z-buffer

GPU 2 GPU 1

E.g. 1920x1200 pixels :

s_x= 1920, s_y= 1200, bpp = 8 (BGRA+DEPTH_STENCIL)

18,432,000 Bytes per frame buffer

73 728 000 Bytes transported through system

Host Memory Host Memory

73,728,000 Bytes transported through system per frame

@60 Hz = 4,423,680,000 > 4 GB per second !

=> High load on system Bus

CPU

(11)

Depth Compositing Performance

p

g

on 790i Ultra

S t

790i Ult

2 FX 3700 (PCI 2 0)

System: 790i Ultra, 2 FX 3700 (PCIe 2.0)

Specs relevant for Compositing:

•

Buffer downloads:

2.8 GB/s

Ù

360 ms/GB

•

host memory copy:

4.1 GB/s

Ù

240 ms/GB

•

Buffer uploads:

3.7 GB/s

Ù

270 ms/GB

•

Execution time of the actual compositing (fragment shader) is

negligible

GPU 2 GPU 1 GPU 2 GPU 1 Color : 2.8 GB/s Depth: 2.8 GB/s 4.1 GB/s 3.7 GB/s Landing Zone Host Memory Launching Pad Host Memory

(12)

Depth Compositing Performance

p

g

Time to depth-composite 1 MPixel (worst – best case)

buffer size = s ·s ·bpp = 1024·1024·8 = 8 MByte buffer size = s_x·s_y·bpp = 1024·1024·8 = 8 MByte

• Compositing performance for sequential execution: 870 ms/GB Ù 143 fps

worst case (no concurrency) t_total= t_dn+ t_mc+ t_up

• Compositing performance for parallel execution: 360 ms/GB Ù 350 fps

best case (full concurrency) t_total= max(t_dn, t_mc, t_up)

• Compositing performance measured: 550 ms/GB Ù 227 fps

(some concurrency)

td = Time to download BGRA buffer [s/GB]

GPU 2 GPU 1

1.8 GB/s

tdc Time to download BGRA buffer [s/GB]

t_dc= Time to download DEPTH_STENCIL buffer [s/GB] t_dn= (t_dc+ t_dc)/2

t_mc= Time to copy from host mem to host mem [s/GB] t_up= Time to upload [s/GB]

Host Memory Host Memory

(13)

Single GPU Performance:

T i

l th

h t

d d t i

Triangle throughput and datasize

R d i f

d i

Rendering performance measured in

triangles/sec depending on data size on

graphics card with limited memory

o

rmance

Single GPU cliff

1/t_in Rende ring perf o [ triangles / s] 1/t_out l d f d Data size [ triangles]

Example, Rendering performance on Quadro FX 5600:

bytes per triangle = 3 vertices(36 bytes) + 3 normals( 36 bytes) = 72 bytes / triangle NVSG-test program renders individual triangles in VBO’s:

– p_in = 143 Mtriangles/s for in-memory data => t_in = 7 ns

– p_out = 25 Mtriangles / s for out-of memory data => t_out = 40 ns

n 20 Mtriangles

(14)

Multi GPU Rendering: 2 GPUs

– Compositing impact is known _R _d _i _P _f – Compositing impact is known

– Rendering performance on single GPU is known

Let’s estimate what performance gain we can expect over a growing dataset Assumptions: Rendering Performance 1.50E+08 2.00E+08 2.50E+08 3.00E+08 p er fo rm an ce a ng le s/ s ]

Dual GPU Cliff

Assumptions:

- Execution of rendering and compositing happens sequentially (will be changed in the future):

- Rendering happens concurrently w/o overhead on both GPU’s - Fixed buffer size = 1k x 1k = 1Mpixel => tcompositing= 4.4 ms

i l GPU t t l ti f 0.00E+00 5.00E+07 1.00E+08 0 10 20 30 40 50 60 70 triangles [millions] re n d e r p [t ri a

triangle throughput 1 GPU

Single GPU Cliff

single GPU total time per frame:

- ttotal,1 GPU= trender(n) = n0·tin+ (n-n0) ·tout

dual GPU total time per frame:

- ttotal, 2 GPU= trender(n/2) + tcompositing

f i t / t

triangle throughput 1 GPU triangle throughpout no compositing triangle throughput 2 GPU compositing

2 GPU relative to 1 GPU

Area of super linear scalability

⇒ performance gain s_2GPU= t_{total,1 GPU}/ t_{total,2 GPU}

peak performance to be expected at n = 2 n0:

s(n=2·n0) = (n0·tin+ n0·tout)/(n0·tin+tc) Ù

(tin+ tout)/(tin+tc/n0)

improve peak performance by: 3

4 5 6 7 8 to r o ver si n g le G P U

reduce tc (e.g. optimize buffer transports)

increase n0 (e.g. pack more triangles in card like tristrips instead

individual triangles)

- Theoretical peak max s_max = (t_in+ t_out)/t_in - applied to FX5600 s = (7ns +40ns)/7ns = 6 7 0 1 2 0 10 20 30 40 50 60 70 triangles [millions] scal e f a c t

scale factor 2 GPUs no compositing scale factor 2 GPU compositing

(15)

Multi-GPU Rendering: 4 GPUs

Multi GPU Rendering: 4 GPUs

Generalize 2 GPU formula to k Generalize 2-GPU formula to k

GPUs:

Assumption:

buffer transports time grows linearly with number of GPUs

Rendering Performance 4.00E+08 5.00E+08 6.00E+08 for m a nc e es/ s] p g y

k GPUs total time per frame:

- t_{total, k GPU}= t_render(n/k) + (k-1)·t_compositing

⇒ performance gain s_kGPU= t_{total,1 GPU}/ t_{total,k GPU}

0.00E+00 1.00E+08 2.00E+08 3.00E+08 0 10 20 30 40 50 60 70 re nde r pe rf [t ri a ngl

multi-GPU relative to 1 GPU kGPU total,1 GPU total,k GPU

peak performance to be expected at n = k n₀: s(n=k·n₀) = (n₀·t_in + (k-1) ·n₀·t_out)/(n₀·t_in+(k-1)·t_c) Ù

s_max= (t_in + (k-1) ·t_out)/(t_in+(k-1) ·t_c/n₀)

triangles [millions]

triangle throughput 1 GPU triangle throughput 2 GPU compositing

triangle throughput 4 GPUs

6 8 10 12 14 16 18 o r ov er s ingl e G P U

- Theoretical peak max s_max= (t_in + (k-1) · t_out)/t_in = 1 + (k-1) · t_out/t_in 0 2 4 6 0 20 40 60 80 100 120 triangles [millions] sca le f a ct o f G f G E.g. FX 5600 - k = 4 FX 5600 s_max= (7ns +3 · 40ns)/7ns = 18.1 - k = 8 FX 5600 s_max= (7ns +7 · 40ns)/7ns = 41

(16)

Frame Rates

90

Frame Rates

Application classes by frame rate

40 50 60 70 80 90 m er at e [ 1 /s ]

Application classes by frame rate

>= 60 Hz, Visual Simulation (e.g. professional flight simulators) >= 30 Hz, < 60 Hz, Entertainment (gaming) > 60 Hz 30 – 60 Hz 0 10 20 30 0 20 40 60 80 100 120 triangles [millions] fr a m >= 15 Hz, < 30 Hz VR, Design Review

>= 5 Hz, < 15 Hz Modeling, Large CAD Data Visualization, Seismic Interpretation,…

15 – 30 Hz

5 – 15 Hz

triangles [millions]

framerate 1 GPU framerate 2 GPUs framerate 4 GPUs

20 Best scalability for Large Models (low impact by compositor)

=> very good fit for Large Data visualization like Seismic interpretation 6 8 10 12 14 16 18 fr am e rate [1 /s ] 5 – 15 Hz 15 – 30 Hz 0 2 4 0 20 40 60 80 100 120 triangles [millions]

framerate 1 GPU framerate 2 GPUs framerate 4 GPUs

(17)

Results System Analysis

System Analysis shows: System Analysis shows:

• Performance scales with number of GPUs

=> Higher framerate => Higher framerate

• Graphics card memory scales linearly with number of GPUs

=> Larger models

• Small models don’t scale very well

=> Multi-GPU rendering (database decomposition) not useful for small models

• Peak scalability far beyond linear (Larger Cache effect)

Multi GPU rendering for Large models works !

Multi-GPU rendering for Large models works !

(18)

Results Multi GPU rendering: OSG

g

OSG = Open Scene Graph

Open Source Scene Graph widely used in Academia, Simulation, Oil&Gas and other industries We extended OSG to render in multiple threads (one thread per GPU) and added the MGPU SDK In this test OSG used around 140 bytes per triangle

Performance multi GPU vs. Standalone

Triangle Performance QP IV 600 0% 800.0% 1000.0% 1200.0% r e l. S A [% ] 2.00E+08 4.00E+08 6.00E+08 er fo rm a n ce tr ia n g les / s] 0.0% 200.0% 400.0% 600.0% P e rf or m a nc e 0.00E+00

0 5000000 1E+07 1.5E+07 2E+07 2.5E+07 3E+07

Data Size [primitives]

P [t

Standalone 2 GPU no Compositing

0 5000000 10000000 15000000 20000000 25000000 Data Size [Primitives]

2 GPU's 4 GPU's

4 GPU no Compositing 2 GPU with Compositing

(19)

Results Multi GPU rendering: NVSG

g

NVSG = NVIDIA’s Scene Graph

Very efficient Scene Graph used in automotive simulation and academia Very efficient Scene Graph used in automotive, simulation and academia NVSG used 72 bytes per triangle

Triangle Performance QP IV

Performance multi GPU vs. Standalone

1200 0% 2.00E+08 3.00E+08 4.00E+08 5.00E+08 a n ce [ tr ian g les ] 600.0% 800.0% 1000.0% 1200.0% m an ce rel S A [ % ] 0.00E+00 1.00E+08

0.00E+00 1.00E+07 2.00E+07 3.00E+07 4.00E+07 5.00E+07

Data Size [primitives]

Pe rf o rm a 0.0% 200.0% 400.0% 0.00E+ 00 5.00E+ 06 1.00E+ 07 1.50E+ 07 2.00E+ 07 2.50E+ 07 3.00E+ 07 3.50E+ 07 4.00E+ 07 4.50E+ 07 Pe rf o rm

Standalone 2 GPU with Compositing 4 GPU with Compositing

Data size [primitives]

(20)

MGPU – SDK Introduction

Thomas Volk

(21)

Agenda

•

Scope of the SDK

•

Scope of the SDK

•

SDK features

(22)

Scope of the MGPU SDK

•

Multi-GPU set-up and addressing

• Affinity context/screen per GPU

• SDK utility class for context handling and off-screen rendering

•

Application parallelism

• Multi-threading of rendering loop: already solved in lots of scene Multi threading of rendering loop: already solved in lots of scene

graph implementations

• Multi-process and out of core/cluster solutions: highly specialized

solutions already availabley

•

Load decomposition and balancing: needs to be

addressed in SG for sort last (see NVSGScale)

•

Sample/reference implementations

•

Sample/reference implementations

•

Utility classes

•

Image compositing with huge data transfer (5GB/s) and

low latency: SDK compositor

(23)

SDK features

• OpenGL based

• Professional applications

• QuadroPlex only

• With current systems up to 2-8 GPUs, e.g. 2xD2s, 2xD4s • Up to 16 GB of addressable video memory

• Platforms: win32/64 linux 64

• Platforms: win32/64, linux 64

• In comparison to SLI non-transparent to application

• Utility functions to create MGPU aware GL contexts and drawables

• Platform independentPlatform independent • Stereo

• AA

• Image compositor for sort first and sort last based applications

b i f f i i

• Easy to use abstract interface for compositing

• Compositor implementation based on latest technologies, no migration effort for

applications (next gen hardware will provide faster transport)

• Multi-threaded, shared memory

C fi bl 1 1 1 hi hi l h b id

• Configurable: 1-1, n-1, hierarchical, hybrid

(24)

Compositor Types and Configurations

1

2

1

2

1

2

+

Screen tiling Alpha compositing

1

2

12

+

Depth compositing

+

(25)

System Overview

Compositor MGPU SDK Application thread Scene graph Application thread Scene graph Application thread Scene graph

Dst node Src node Src node

GL context GL context _{GL context}GL-context _S

GL-context _Dst drawable GL-context _Src drawable GL-contextGL-context _Src drawable

(26)

SDK Classes

ICCompositor

+setLayout() +creates

ICFactory

y () +getSrcNode() +getDstNode()

RenderArea

+initialize()

ICFactory

+createAlphaCompositor() +createDepthCompositor() +createDepthCompositor() _n +initialize() +terminate() +resize() +makeCurrent() +releaseCurrent()

ICNode

+intialize() +terminate() i () +showFBContents() +hideFBContents() +composite()

ICSrcNode

ICDstNode

(27)

Code Example

p

//init in control/main thread

ICDepthCompositor* c = CompositorFactory::getInstance()->getDepthCompositor(); c->setLayout(IC_ALLNODES, IC_RECT(width, height));

ICNode* myCompositorNode[MAX NODES]; ICNode* myCompositorNode[MAX_NODES]; myCompositorNode[0] = c->getSrcNode(idx);

//init in every thread and context //create affinity context and drawable; RenderArea ra;

ra->initialize(); ra->makeCurrent();

//initialize GL resources of compositor myCompositornode[i]->initialize();

//in every render-thread void drawFrame(int myID) {

drawPartialScene(myID); drawPartialScene(myID);

//trigger image composition

myCompositornode[myID]->composite(); }

(28)

Sample Sequence Diagram

AppThread Compositor RenderThread SrcNode RenderThread DstNode

i it setLayout resize init init init _init setLayout paint drawFrame drawFrame

compositecomposite composite

composite

(29)

MGPU Threadingg

A th d draw ic d i simul App thread Render src thread Render src thread simul draw d draw ic draw ic Render src thread Render dst thread draw d i simul App thread Render src thread simul d simul draw ic draw ic draw ic Render src thread Render src thread Render dst thread draw draw draw draw ic draw

(30)

Wrap up

p p

•

Current status

•

Beta release next week

•

No stereo and AA, linux only for beta

F

l

il bl

•

Freely available

•

Release mid Oct

•

Future features

•

Future features

•

Better hardware support

•

Access to frame-buffers

Access to frame buffers

•

Utility classes for load balancing

•

Performance feedback

•

Extensible shader compositor

•

Additional color formats

(31)

Multi-GPU Rendering with NVIDIA Scene Graph

The World’s Fastest Scene Graph On Steroids

Multi GPU Rendering with NVIDIA Scene Graph

Subu Krishnamoorthy Subu Krishnamoorthy

(32)

Distributing Scene Objects

g

j

•

Audience – specifies which GPUs are responsible for

processing a scene graph object

•

DistributionTraverser – assigns Audiences to balance

l d

h GPU

load on each GPU

•

GLDistributedTraverser – respects Audiences and

t

l h

it GPU i

l d d

(33)

Distribution Scheme

•

Opaque objects

Least loaded GPU included, others

excluded

•

Translucent and overlay objects

•

Translucent and overlay objects

Primary GPU included, others excluded

•

Implies use of Depth Compositor

Image composited before translucent

and overlay objects are drawn

(34)

The Components

p

•

GLAffinityArea creates a thread that drives an associated

GLDistributedTraverser

•

GLDistributedRenderArea manages the collection of N-1

GLDistributedRenderArea manages the collection of N 1

GLAffinityAreas

(35)

(36)

Ease of Use

•

Derive client RenderArea from

GLDistributedRenderArea

(instead of nvui::RenderArea)

class wxGLDistributedRenderArea : public wxGLCanvas,p

public nvui::GLDistributedRenderArea {

... };

•

Override GLDistributedRenderArea::init()

•

Create Primary GL context (affinity context) and make it current

I

k b

i

l

i

•

Invoke base implementation

bool wxGLDistributedRenderArea::init(nvui::RenderArea *shareArea) {

...

(37)

(38)

Depth Composited Image

p

g

(39)

Spatial Distribution

p

•

Audience assigned based object

g

j

bounding box

I

li

f Al h C

it

(40)

Summaryy

•

Powerful new feature of NVSG

•

Visualize large models at interactive

g

frame rates

(41)

Visual Computing In The Oil Sector, Horn Thorolf Tonjum

(42)

Demos

(43)

Dinosaurs in the Museum

Model characteristics:

• Model designed in Maya, AutoDesk • 217 Million Triangles

Rendering Hardware:

HP 8600 ith 32 GB S t M

• HP xw8600 with 32 GB System Memory

• 2xD2’s (4xGT200 with 4 GB FB memory) with 16 GB

total FB memory Rendering Performance: Rendering Performance: 1 GPU = 0.7 fps

2 GPU’s = 3 ½ fps = 5 times faster than 1 GPU 4 GPU’s = 6 fps = 8 ½ times faster than 1 GPU => Total max. Triangle throughput = 1.3 GTri/s

(44)

Multi-GPU rendering of

Multi GPU rendering of

Kristin,

a detailed off-shore

l tf

• Kristin is an off-shore platform in the North Sea

• Model designed in MicroStation, Bentley

• 230 Million Triangles (individual triangles, no triangle

platform

g ( g , g

strips)

Rendering Hardware:

HP 8600 i h 32 GB S M

• HP xw8600 with 32 GB System Memory

• 2xD2’s (4xGT200 with 4 GB FB memory) with 16 GB

total FB memory Rendering Performance: 1 GPU = 0.4 fps

2 GPU’s = 2 ½ fps = 6 times faster than 1 GPUp

4 GPU’s = 4 ½ fps = 11 times faster than 1 GPU => Total max. Triangle throughput = 1 GTri/s

(45)

Summary

S

l i

f

l i GPU d i

•

System analysis of multi-GPU rendering

•

How to do multi-GPU rendering with NVIDIA’s Multi-GPU SDK

How to do multi GPU rendering with NVIDIA s Multi GPU SDK

•

Ease of use of multi-GPU rendering with NVSG

•

Large CAD data visualization in the Oil&Gas industry

=> Database decomposition together with

depth compositing is a strong tool to give

you tomorrows graphics performance today

(46)

The End

Questions ?