• No results found

Multi-GPU Scaling for Large Data Visualization (Thomas Ruge, NVIDIA)

N/A
N/A
Protected

Academic year: 2021

Share "Multi-GPU Scaling for Large Data Visualization (Thomas Ruge, NVIDIA)"

Copied!
46
0
0

Loading.... (view fulltext now)

Full text

(1)

Does Your Software Scale ?

Multi-GPU Scaling for Large Data Visualization

(2)

Agenda

Agenda

Multi-GPU Scaling for Large Data Visualization (Thomas Ruge, NVIDIA)

– Multi GPU why ?Multi GPU why ?

– Different Multi-GPU rendering methods, focus on database decomposition

– System Analysis of depth compositing and its impact on multi-GPU rendering

– Performance results of multi-GPU rendering with NVSG and OSG

NVIDIA’s Multi-GPU SDK (Thomas Volk, NVIDIA)

– How to use NVIDIA’s MGPU SDK to get scalability in your application

NVSG-Scale (Subu Krishnamoorthy, NVIDIA)

– Multi-GPU implemented in NVSG

Technical Demos (with Horn Thorolf Tonjum, Stormfjord)

Q&A

(3)

Multi-GPU: Why ?

Multi GPU: Why ?

Demand for Large Data visualization beyond capacity of 1 GPU

Demand for Large Data visualization beyond capacity of 1 GPU

Boeing 777 (350 MTriangles) (between 4.2 – 29 GB of graphics data)

Visible Human 16 GB of 3D Textures

Single GPU technology approaches cliff (like CPU’s do)

Power consumption is growing

Power consumption is growing

Cooling problems with high heat output

High production costs per chip of dies with large footprints

G80 : 681 M transistors, 90nm, 484 mm2

GT200 1 b

i

65

576

2

GT200: 1 bn transistors, 65 nm, 576 mm2

Higher Performance than fastest single GPU on the market

Higher Performance than fastest single GPU on the market

– MGPU can give you a time advantage, get tomorrow’s single GPU performance

(4)

Multi-GPU: Goal

Multi GPU: Goal

G l G t li

l bilit ith

b f GPU’ i

Goal: Get linear scalability with number of GPU’s in

Rendering performance

Triangle throughput

g

g p

Fill rate

Data size

Image resolution

Image resolution

(not topic of this presentation)

Steps to get scalability:

Distribution of the workload/rendering tasks to all GPU’s

Collection of rendering results from all GPU’s

Assemble the rendering results in final image

(5)

Multi-GPU: Distributing the Load

g

Pixel Decomposition (e.g. SLI Antialiasing )

– Assigns different pixels or subpixels to each

GPU

– helps with fill-rate bound apps, good

approach for AA

Decomposition in time (e.g. SLI AFR)

– Different GPUs render different frames

– scales well for vertex-processing and fill rate

(6)

Multi-GPU: Screen Decompositing

p

g

S

Fi

S

D

i i

(

Sort-First or Screen Decomposition (e.g.

SLI SFR or SLI Mosaic)

– Different GPUs are rendering different

ti f th

GPU 1 GPU 2

portions of the screen

– fill rate bound applications scale well

– Good method for Displays with very high

resolutions (e g mersive Displays or Sony

Final Image resolutions (e.g. mersive Displays or Sony

4k) Compositing Schemes: GPU 4 GPU 3 Display 1 Display 2 Compositing Schemes:

None, use (Tiled walls or multi-input Displays)

Simple stitching (e g SLI SFR) Display 3 Display 4

Simple stitching (e.g. SLI SFR)

(7)

Multi-GPU: Database Decomposition

p

S

L

D

b

D

i i

GPU 1 GPU 2

Sort-Last or Database Decomposition

– Different GPUs render different portions of

the dataset

S l ll i V t i fill t

GPU 1 GPU 2

– Scales well in Vertex processing, fill rate

bound and graphics card memory bound applications

Final Image

GPU 3 GPU 4

Compositing Schemes

– Depth or z-Buffer compositing (needs

RGB and Depth Buffer from rendering GPU’s

Final Image

GPU 3 GPU 4

GPU 1 GPU 2

RGB and Depth Buffer from rendering GPU s

– Alpha-compositing (needs only RGBA buffer)

GPU 1 GPU 2

=> Database decomposition is the method

that gives us more graphics card

memory

GPU 3 GPU 4

Final Image

(8)

Compositing: Performance

Compositing: Performance

All D iti S h i C iti t t th fi l i

All Decomposition Schemes require Compositing to create the final image This can be done:

• with special purpose hardware

– very fast, typically very low additional latency and performance impact

Ti ht t ifi hi h d ( ’t i / t h diff t d )

– Tight to specific graphics hardware (can’t mix/match different cards)

– only subset of all possible composition schemes possible (e.g. SLI doesn’t support z- or

alpha-compositing)

– no general purpose compositing hardware available (but several attempts in the past)g p p p g ( p p )

• with commodity hardware and some software

– Adds additional latency, performance impact depends on final image size and compositing

mode mode

– not very graphic card dependent (mix/match possible) – can cover all known compositing schemes

– New HW developments (e.g. PCIe Gen 2) reduces the impact of SW-compositing

© 2008 NVIDIA Corporation.

(9)

A System Analysis of Compositing with

commodity hardware

commodity hardware

S t

A l i f d t b

d

iti

ith

System Analysis of database decomposition with

depth compositing

• Goal: • Goal:

– Understand the impact of depth compositing on multi-GPU

rendering

– Means to estimate performance benefits – Valid on today’s HWy

• Assumptions:

– single shared memory system (no clusters) – only one displaying GPU (no powerwalls)

GPU 1

GPU 2 Displaying GPU

– buffer transports go through host memory (no specific HW

tricks) • Tasks: Di ib kl d d i GPU GPU n p y g Host

– Distribute workload to rendering GPUs

– Download resulting 2D-buffers (from GPUs framebuffer) to

host memory

– Upload 2D-buffers to displaying GPU – composite 2D-buffers to final imagecomposite 2D buffers to final image

(10)

Compositing with 2 GPUs

Compositing with 2 GPUs

St t A l i ith 2 GPU

Start Analysis with 2 GPUs

System description:

– 2 rendering GPUs, 1 displaying GPU

– Optimization: displaying GPU is also a rendering

GPU GPU

=> reduced number of buffer transports

Depth compositing needs RGB and Z-buffer

GPU 2 GPU 1

E.g. 1920x1200 pixels :

sx = 1920, sy = 1200, bpp = 8 (BGRA+DEPTH_STENCIL)

18,432,000 Bytes per frame buffer

73 728 000 Bytes transported through system

Host Memory Host Memory

73,728,000 Bytes transported through system per frame

@60 Hz = 4,423,680,000 > 4 GB per second !

=> High load on system Bus

CPU

(11)

Depth Compositing Performance

p

p

g

on 790i Ultra

S t

790i Ult

2 FX 3700 (PCI 2 0)

System: 790i Ultra, 2 FX 3700 (PCIe 2.0)

Specs relevant for Compositing:

Specs relevant for Compositing:

Buffer downloads:

2.8 GB/s

Ù

360 ms/GB

host memory copy:

4.1 GB/s

Ù

240 ms/GB

Buffer uploads:

3.7 GB/s

Ù

270 ms/GB

Execution time of the actual compositing (fragment shader) is

negligible

GPU 2 GPU 1 GPU 2 GPU 1 Color : 2.8 GB/s Depth: 2.8 GB/s 4.1 GB/s 3.7 GB/s Landing Zone Host Memory Launching Pad Host Memory
(12)

Depth Compositing Performance

p

p

g

Time to depth-composite 1 MPixel (worst – best case)

buffer size = s ·s ·bpp = 1024·1024·8 = 8 MByte buffer size = sx·sy·bpp = 1024·1024·8 = 8 MByte

• Compositing performance for sequential execution: 870 ms/GB Ù 143 fps

worst case (no concurrency) ttotal = tdn+ tmc+ tup

• Compositing performance for parallel execution: 360 ms/GB Ù 350 fps

best case (full concurrency) ttotal = max(tdn, tmc, tup )

• Compositing performance measured: 550 ms/GB Ù 227 fps

(some concurrency)

td = Time to download BGRA buffer [s/GB]

GPU 2 GPU 1

1.8 GB/s

© 2008 NVIDIA Corporation.

tdc Time to download BGRA buffer [s/GB]

tdc= Time to download DEPTH_STENCIL buffer [s/GB] tdn= (tdc+ tdc)/2

tmc= Time to copy from host mem to host mem [s/GB] tup= Time to upload [s/GB]

Host Memory Host Memory

(13)

Single GPU Performance:

T i

l th

h t

d d t i

Triangle throughput and datasize

R d i f

d i

Rendering performance measured in

triangles/sec depending on data size on

graphics card with limited memory

o

rmance

Single GPU cliff

1/tin Rende ring perf o [ triangles / s] 1/tout l d f d Data size [ triangles]

Example, Rendering performance on Quadro FX 5600:

bytes per triangle = 3 vertices(36 bytes) + 3 normals( 36 bytes) = 72 bytes / triangle NVSG-test program renders individual triangles in VBO’s:

pin = 143 Mtriangles/s for in-memory data => tin = 7 ns

pout = 25 Mtriangles / s for out-of memory data => tout = 40 ns

n 20 Mtriangles

(14)

Multi GPU Rendering: 2 GPUs

Multi GPU Rendering: 2 GPUs

Compositing impact is known R d i P fCompositing impact is known

Rendering performance on single GPU is known

Let’s estimate what performance gain we can expect over a growing dataset Assumptions: Rendering Performance 1.50E+08 2.00E+08 2.50E+08 3.00E+08 p er fo rm an ce a ng le s/ s ]

Dual GPU Cliff

Assumptions:

- Execution of rendering and compositing happens sequentially (will be changed in the future):

- Rendering happens concurrently w/o overhead on both GPU’s - Fixed buffer size = 1k x 1k = 1Mpixel => tcompositing= 4.4 ms

i l GPU t t l ti f 0.00E+00 5.00E+07 1.00E+08 0 10 20 30 40 50 60 70 triangles [millions] re n d e r p [t ri a

triangle throughput 1 GPU

Single GPU Cliff

single GPU total time per frame:

- ttotal,1 GPU= trender(n) = n0·tin+ (n-n0) ·tout

dual GPU total time per frame:

- ttotal, 2 GPU= trender(n/2) + tcompositing

f i t / t

triangle throughput 1 GPU triangle throughpout no compositing triangle throughput 2 GPU compositing

2 GPU relative to 1 GPU

Area of super linear scalability

performance gain s2GPU= ttotal,1 GPU/ ttotal,2 GPU

peak performance to be expected at n = 2 n0:

s(n=2·n0) = (n0·tin+ n0·tout)/(n0·tin+tc) Ù

(tin+ tout)/(tin+tc/n0)

improve peak performance by: 3

4 5 6 7 8 to r o ver si n g le G P U

reduce tc (e.g. optimize buffer transports)

increase n0 (e.g. pack more triangles in card like tristrips instead

individual triangles)

- Theoretical peak max smax = (tin+ tout)/tin - applied to FX5600 s = (7ns +40ns)/7ns = 6 7 0 1 2 0 10 20 30 40 50 60 70 triangles [millions] scal e f a c t

scale factor 2 GPUs no compositing scale factor 2 GPU compositing

© 2008 NVIDIA Corporation.

(15)

Multi-GPU Rendering: 4 GPUs

Multi GPU Rendering: 4 GPUs

Generalize 2 GPU formula to k Generalize 2-GPU formula to k

GPUs:

Assumption:

buffer transports time grows linearly with number of GPUs

Rendering Performance 4.00E+08 5.00E+08 6.00E+08 for m a nc e es/ s] p g y

k GPUs total time per frame:

- ttotal, k GPU= trender(n/k) + (k-1)·tcompositing

performance gain skGPU= ttotal,1 GPU/ ttotal,k GPU

0.00E+00 1.00E+08 2.00E+08 3.00E+08 0 10 20 30 40 50 60 70 re nde r pe rf [t ri a ngl

multi-GPU relative to 1 GPU kGPU total,1 GPU total,k GPU

peak performance to be expected at n = k n0: s(n=k·n0) = (n0·tin + (k-1) ·n0·tout)/(n0·tin+(k-1)·tc) Ù

smax = (tin + (k-1) ·tout)/(tin+(k-1) ·tc/n0)

triangles [millions]

triangle throughput 1 GPU triangle throughput 2 GPU compositing

triangle throughput 4 GPUs

6 8 10 12 14 16 18 o r ov er s ingl e G P U

- Theoretical peak max smax= (tin + (k-1) · tout)/tin = 1 + (k-1) · tout/tin 0 2 4 6 0 20 40 60 80 100 120 triangles [millions] sca le f a ct o f G f G E.g. FX 5600 - k = 4 FX 5600 smax = (7ns +3 · 40ns)/7ns = 18.1 - k = 8 FX 5600 smax = (7ns +7 · 40ns)/7ns = 41

(16)

Frame Rates

90

Frame Rates

Application classes by frame rate

40 50 60 70 80 90 m er at e [ 1 /s ]

Application classes by frame rate

>= 60 Hz, Visual Simulation (e.g. professional flight simulators) >= 30 Hz, < 60 Hz, Entertainment (gaming) > 60 Hz 30 – 60 Hz 0 10 20 30 0 20 40 60 80 100 120 triangles [millions] fr a m >= 15 Hz, < 30 Hz VR, Design Review

>= 5 Hz, < 15 Hz Modeling, Large CAD Data Visualization, Seismic Interpretation,…

15 – 30 Hz

5 – 15 Hz

triangles [millions]

framerate 1 GPU framerate 2 GPUs framerate 4 GPUs

20 Best scalability for Large Models (low impact by compositor)

=> very good fit for Large Data visualization like Seismic interpretation 6 8 10 12 14 16 18 fr am e rate [1 /s ] 5 – 15 Hz 15 – 30 Hz 0 2 4 0 20 40 60 80 100 120 triangles [millions]

framerate 1 GPU framerate 2 GPUs framerate 4 GPUs

(17)

Results System Analysis

Results System Analysis

System Analysis shows: System Analysis shows:

• Performance scales with number of GPUs

=> Higher framerate => Higher framerate

• Graphics card memory scales linearly with number of GPUs

=> Larger models

• Small models don’t scale very well

=> Multi-GPU rendering (database decomposition) not useful for small models

• Peak scalability far beyond linear (Larger Cache effect)

Multi GPU rendering for Large models works !

Multi-GPU rendering for Large models works !

(18)

Results Multi GPU rendering: OSG

g

OSG = Open Scene Graph

Open Source Scene Graph widely used in Academia, Simulation, Oil&Gas and other industries We extended OSG to render in multiple threads (one thread per GPU) and added the MGPU SDK In this test OSG used around 140 bytes per triangle

Performance multi GPU vs. Standalone

Triangle Performance QP IV 600 0% 800.0% 1000.0% 1200.0% r e l. S A [% ] 2.00E+08 4.00E+08 6.00E+08 er fo rm a n ce tr ia n g les / s] 0.0% 200.0% 400.0% 600.0% P e rf or m a nc e 0.00E+00

0 5000000 1E+07 1.5E+07 2E+07 2.5E+07 3E+07

Data Size [primitives]

P [t

Standalone 2 GPU no Compositing

© 2008 NVIDIA Corporation.

0 5000000 10000000 15000000 20000000 25000000 Data Size [Primitives]

2 GPU's 4 GPU's

4 GPU no Compositing 2 GPU with Compositing

(19)

Results Multi GPU rendering: NVSG

g

NVSG = NVIDIA’s Scene Graph

Very efficient Scene Graph used in automotive simulation and academia Very efficient Scene Graph used in automotive, simulation and academia NVSG used 72 bytes per triangle

Triangle Performance QP IV

Performance multi GPU vs. Standalone

1200 0% 2.00E+08 3.00E+08 4.00E+08 5.00E+08 a n ce [ tr ian g les ] 600.0% 800.0% 1000.0% 1200.0% m an ce rel S A [ % ] 0.00E+00 1.00E+08

0.00E+00 1.00E+07 2.00E+07 3.00E+07 4.00E+07 5.00E+07

Data Size [primitives]

Pe rf o rm a 0.0% 200.0% 400.0% 0.00E+ 00 5.00E+ 06 1.00E+ 07 1.50E+ 07 2.00E+ 07 2.50E+ 07 3.00E+ 07 3.50E+ 07 4.00E+ 07 4.50E+ 07 Pe rf o rm

Standalone 2 GPU with Compositing 4 GPU with Compositing

Data size [primitives]

(20)

MGPU – SDK Introduction

Thomas Volk

(21)

Agenda

Scope of the SDK

Scope of the SDK

SDK features

(22)

Scope of the MGPU SDK

Multi-GPU set-up and addressing

• Affinity context/screen per GPU

• SDK utility class for context handling and off-screen rendering

Application parallelism

• Multi-threading of rendering loop: already solved in lots of scene Multi threading of rendering loop: already solved in lots of scene

graph implementations

• Multi-process and out of core/cluster solutions: highly specialized

solutions already availabley

Load decomposition and balancing: needs to be

addressed in SG for sort last (see NVSGScale)

Sample/reference implementations

Sample/reference implementations

Utility classes

Image compositing with huge data transfer (5GB/s) and

low latency: SDK compositor

(23)

SDK features

• OpenGL based

• Professional applications

• QuadroPlex only

• With current systems up to 2-8 GPUs, e.g. 2xD2s, 2xD4s • Up to 16 GB of addressable video memory

• Platforms: win32/64 linux 64

• Platforms: win32/64, linux 64

• In comparison to SLI non-transparent to application

• Utility functions to create MGPU aware GL contexts and drawables

• Platform independentPlatform independent • Stereo

• AA

• Image compositor for sort first and sort last based applications

b i f f i i

• Easy to use abstract interface for compositing

• Compositor implementation based on latest technologies, no migration effort for

applications (next gen hardware will provide faster transport)

• Multi-threaded, shared memory

C fi bl 1 1 1 hi hi l h b id

• Configurable: 1-1, n-1, hierarchical, hybrid

(24)

Compositor Types and Configurations

1

2

1

2

1

2

+

+

Screen tiling Alpha compositing

1

2

12

+

+

+

Depth compositing

+

Hierarchical compositing © 2008 NVIDIA Corporation.
(25)

System Overview

Compositor MGPU SDK Application thread Scene graph Application thread Scene graph Application thread Scene graph

Dst node Src node Src node

GL context GL context GL contextGL-context S

GL-context Dst drawable GL-context Src drawable GL-contextGL-context Src drawable

(26)

SDK Classes

ICCompositor

+setLayout() +creates

ICFactory

y () +getSrcNode() +getDstNode()

RenderArea

+initialize()

ICFactory

+createAlphaCompositor() +createDepthCompositor() +createDepthCompositor() n +initialize() +terminate() +resize() +makeCurrent() +releaseCurrent()

ICNode

+intialize() +terminate() i () +showFBContents() +hideFBContents() +composite()

ICSrcNode

ICDstNode

© 2008 NVIDIA Corporation.
(27)

Code Example

p

//init in control/main thread

ICDepthCompositor* c = CompositorFactory::getInstance()->getDepthCompositor(); c->setLayout(IC_ALLNODES, IC_RECT(width, height));

ICNode* myCompositorNode[MAX NODES]; ICNode* myCompositorNode[MAX_NODES]; myCompositorNode[0] = c->getSrcNode(idx);

//init in every thread and context //create affinity context and drawable; RenderArea ra;

ra->initialize(); ra->makeCurrent();

//initialize GL resources of compositor myCompositornode[i]->initialize();

//in every render-thread void drawFrame(int myID) {

drawPartialScene(myID); drawPartialScene(myID);

//trigger image composition

myCompositornode[myID]->composite(); }

(28)

Sample Sequence Diagram

AppThread Compositor RenderThread SrcNode RenderThread DstNode

i it setLayout resize init init init init setLayout paint drawFrame drawFrame

compositecomposite composite

composite

(29)

MGPU Threadingg

A th d draw ic d i simul App thread Render src thread Render src thread simul draw d draw ic draw ic Render src thread Render dst thread draw d i simul App thread Render src thread simul d simul draw ic draw ic draw ic Render src thread Render src thread Render dst thread draw draw draw draw ic draw
(30)

Wrap up

p p

Current status

Beta release next week

No stereo and AA, linux only for beta

F

l

il bl

Freely available

Release mid Oct

Future features

Future features

Better hardware support

Access to frame-buffers

Access to frame buffers

Utility classes for load balancing

Performance feedback

Extensible shader compositor

Additional color formats

(31)

Multi-GPU Rendering with NVIDIA Scene Graph

The World’s Fastest Scene Graph On Steroids

Multi GPU Rendering with NVIDIA Scene Graph

Subu Krishnamoorthy Subu Krishnamoorthy

(32)

Distributing Scene Objects

g

j

Audience – specifies which GPUs are responsible for

processing a scene graph object

DistributionTraverser – assigns Audiences to balance

l d

h GPU

load on each GPU

GLDistributedTraverser – respects Audiences and

t

l h

it GPU i

l d d

© 2008 NVIDIA Corporation.

(33)

Distribution Scheme

Opaque objects

Opaque objects

Least loaded GPU included, others

excluded

excluded

Translucent and overlay objects

Translucent and overlay objects

Primary GPU included, others excluded

Implies use of Depth Compositor

Image composited before translucent

and overlay objects are drawn

(34)

The Components

p

GLAffinityArea creates a thread that drives an associated

GLAffinityArea creates a thread that drives an associated

GLDistributedTraverser

GLDistributedRenderArea manages the collection of N-1

© 2008 NVIDIA Corporation.

GLDistributedRenderArea manages the collection of N 1

GLAffinityAreas

(35)
(36)

Ease of Use

Derive client RenderArea from

GLDistributedRenderArea

GLDistributedRenderArea

(instead of nvui::RenderArea)

class wxGLDistributedRenderArea : public wxGLCanvas,p

public nvui::GLDistributedRenderArea {

... };

Override GLDistributedRenderArea::init()

Create Primary GL context (affinity context) and make it current

I

k b

i

l

i

Invoke base implementation

bool wxGLDistributedRenderArea::init(nvui::RenderArea *shareArea) {

...

m_glContext = new wxGLAffinityContext(this, shareArea); wxGLCanvas::SetCurrent(*m_glContext); ... GLDistributedRenderArea::init(shareArea); © 2008 NVIDIA Corporation. ... }

(37)
(38)

Depth Composited Image

p

p

g

(39)

Spatial Distribution

p

Audience assigned based object

g

j

bounding box

I

li

f Al h C

it

(40)

Summaryy

Powerful new feature of NVSG

Powerful new feature of NVSG

Visualize large models at interactive

g

frame rates

(41)

Visual Computing In The Oil Sector, Horn Thorolf Tonjum

(42)

Demos

(43)

Dinosaurs in the Museum

Dinosaurs in the Museum

Model characteristics:

• Model designed in Maya, AutoDesk • 217 Million Triangles

Rendering Hardware:

HP 8600 ith 32 GB S t M

• HP xw8600 with 32 GB System Memory

• 2xD2’s (4xGT200 with 4 GB FB memory) with 16 GB

total FB memory Rendering Performance: Rendering Performance: 1 GPU = 0.7 fps

2 GPU’s = 3 ½ fps = 5 times faster than 1 GPU 4 GPU’s = 6 fps = 8 ½ times faster than 1 GPU => Total max. Triangle throughput = 1.3 GTri/s

(44)

Multi-GPU rendering of

Multi GPU rendering of

Kristin,

Model characteristics:

a detailed off-shore

l tf

Model characteristics:

• Kristin is an off-shore platform in the North Sea

• Model designed in MicroStation, Bentley

• 230 Million Triangles (individual triangles, no triangle

platform

g ( g , g

strips)

Rendering Hardware:

HP 8600 i h 32 GB S M

• HP xw8600 with 32 GB System Memory

• 2xD2’s (4xGT200 with 4 GB FB memory) with 16 GB

total FB memory Rendering Performance: 1 GPU = 0.4 fps

2 GPU’s = 2 ½ fps = 6 times faster than 1 GPUp

4 GPU’s = 4 ½ fps = 11 times faster than 1 GPU => Total max. Triangle throughput = 1 GTri/s

© 2008 NVIDIA Corporation.

(45)

Summary

Summary

S

l i

f

l i GPU d i

System analysis of multi-GPU rendering

How to do multi-GPU rendering with NVIDIA’s Multi-GPU SDK

How to do multi GPU rendering with NVIDIA s Multi GPU SDK

Ease of use of multi-GPU rendering with NVSG

Large CAD data visualization in the Oil&Gas industry

=> Database decomposition together with

depth compositing is a strong tool to give

depth compositing is a strong tool to give

you tomorrows graphics performance today

(46)

The End

The End

Questions ?

Questions ?

References

Related documents