• No results found

Hierarchical N-Body problem on graphics processor unit

N/A
N/A
Protected

Academic year: 2019

Share "Hierarchical N-Body problem on graphics processor unit"

Copied!
101
0
0

Loading.... (view fulltext now)

Full text

(1)

Rochester Institute of Technology

RIT Scholar Works

Theses

Thesis/Dissertation Collections

3-15-2006

Hierarchical N-Body problem on graphics

processor unit

Mohammad Faisal Siddiqui

Follow this and additional works at:

http://scholarworks.rit.edu/theses

This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please [email protected].

Recommended Citation

(2)

Hierarchical N

-Body Problem on Graphics Processor Unit

by

Mohammad Faisal Siddiqui

A Thesis Submitted in Partial Fulfillment

of

the Requirements for the Degree of

Master of Science in

Computer Engineering

Supervised by

Dr. Muhammad Shaaban, Associate Professor, Computer Engineering

Department of

Computer

Engineering

Kate Gleason

College of

Engineering

Rochester Institute

of

Technology

Rochester

,

New York

March 15

,

2006

Approved

By:

M.Shaaban

Dr. Muhammad Shaaban

Associate Professor

, Computer

Engineering

Primary Adviser

Lawrence Ray

Dr

.

Lawrence Ray

Adjunct Professor,

Computer

Engineering

Juan Cockburn

Dr

.

Juan Cockburn

(3)

Thesis Release Permission Form

Rochester Institute of Technology

Kate Gleason College of Engineering

Title: Hierarchical

N

-Body Problem on Graphics Processor Unit

I, Mohammad Faisal Siddiqui

,

hereby

grant

permission to the Wallac

e

Memorial

Li-brary reporduce my thesis in

whole or part.

Mohammad Siddiqui

Mohammad Faisal Siddiqui

(4)

Copyright 2006

by

Mohammad Faisal Siddiqui

(5)

Dedication

(6)

Acknowledgments

Having

Dr. Shaaban as my primary adviserhas been an amazing experience. His unique and exhaustive

teaching

style spurred in me thepassion forcomputer architecture. I am overwhelmedwith gratitudefor his guidance, motivation and support. Hegave meenough

freedomtoexplore new

ideas,

and appliedpressure when seemed appropriate. Ialso admire

andrespecthis

integrity

bothas a professor and aperson.

I'd like to thankmy committeemembers, Dr. Lawrence

Ray

andDr. Juan Cockburn for giving importantsuggestionsandfeedbackwhennecessary. Ialso appreciatetheir

help

and concernthroughoutmythesis.

I'dlike to thank Dr. Stefan Harfstwith the Astrophysics

Group

at Rochester Institute

of

Technology

whohelpedmethrough theinitial stages ofmyresearch

during

thesummer

of2005.

And,

finally

Iwouldliketo thank theentire staff ofDepartmentofComputer Engineer

ing

atRochester Institute of

Technology

fortheir

help

and assistance, and makingmefeel
(7)

Abstract

Galactic simulation is an importantcosmological computation, and represents a classical

N-body

problem suitablefor implementation on vector processors. Barnes-Hut algorithm

is ahierarchical

N-Body

methodusedto simulatesuch galactic evolutionsystems.

Stream processing architectures exposedata

locality

andconcurrencyavailableinmul

timediaapplications. On theother

hand,

there are numerous compute-intensive scientific orengineering applicationsthatcan potentially benefit fromsuch computational and com munication models. Theseapplications are

traditionally

implementedon vector processors.

Stream architecture based graphics processor units

(GPUs)

present a novel computa

tionalalternativefor efficiently

implementing

suchhigh-performanceapplications. Render

ing

onastream architecture sustainshighperformance, whileuser-programmablemodules allow

implementing

complex algorithms efficiently. GPUs have evolved over the years, from

being

fixed-function pipelinesto user programmableprocessors.

In this thesis, we focus on the implementation of Barnes-Hut algorithm on typical

current-generation programmableGPUs. We exploitcomputation andcommunication re quirements presentin Barnes-Hutalgorithmtoexposetheirsuitability foruser-programmable

(8)

Contents

Dedication iv

Acknowledgments v

Abstract vi

Glossary

xii

1 Introduction 1

2

N-Body

5

2.1

N-Body

Problem 5

2.2

N-Body

Methods 5

2.2.1 Particle-Particle Method 6

2.2.2 Particle-Mesh Method 6

2.2.3 Particle-Particle/Particle-Mesh Method 7

2.2.4 Hierarchical Methods 7

2.3 Barnes-Hut Algorithm 8

3 Stream Architecture and

Programming

Model 13

3.1 RAW Machines 14

3.2 Imagine Stream Processor 16

4 Graphics Processor Architecture 19

4.1 Graphics

Rendering

Pipeline 19

4.2 Graphics Hardware 21

4.2.1

Geometry

Engine 22

4.2.2 Channel Processor 23

4.2.3

Polygon-Rendering Chip

23
(9)

4.2.5 Indigo Extreme 25

4.2.6 PixelFlow 27

4.2.7 RealityEngine 28

4.2.8

InfiniteReality

29

4.2.9 Graphics Software Libraries 31

4.2.10

Streaming

CPU Extensions 32

4.3 Programmable Graphics Hardware 33

4.3.1 Vertex Processor 33

4.3.2 Fragment Processor 37

4.3.3 Graphics

Memory

38

4.3.4 GPU

Programming

Model 40

4.4 GPU for Non-Graphics Operations 42

5 Previous Work 44

5.1 Fast Matrix Multiplies using Graphics Hardware 45

5.2 Physically-based Visual Simulationon Graphics Hardware 46

5.3 GROMACS 47

5.4 Particle-Mesh

N-Body

MethodonGraphics Hardware 48

5.5

Radiosity

onGraphics Hardware 48

5.6

Ray

Tracing

onGPU 49

5.7 Octree Textureson theGPU 50

6 Implementation 52

6.1 Algorithm Overview 53

6.2 Octree Construction 55

6.3 A^-tree memory mapontheGPU 57

6.4 Kernels 60

6.4.1 Traverse 60

6.4.2 Position 64

6.4.3

Velocity

64

7 Resultsand Discussion 66

7.1 Results 66

7.2 Discussion 69

7.3 Difficulties Encountered 70

(10)

8 ConclusionandFuture Work 76

8.1 Conclusion 76

8.2 Future Work .77

Bibliography

78

A Octree Data Structureson CPU 84

(11)

List

of

Figures

2. 1 Barnes-Hutmultipoleaccessibilitycriterion 8

2.2 Therelation s/dfor different levelsofthetree 10

3.1 RAWprocessor. 15

3.2 Imagineprocessor. 17

4.1 Graphics renderingpipeline 20

4.2 RealityEngine block diagram 28

4.3

InflniteReality

Engine block diagram 30

4.4 Overallsystem architecture 34

4.5 GeForce 6 Series Vertexshaderblock diagram 35

4.6 GeForce 6 Series fragment processor. 37

4.7 GPU memory hierarchy. 39

5.1 Matrix Multiplication 45

5.2 CML Performance Comparison 47

6.1 Algorithm 54

6.2 Octree 55

6.3 Octreemapped to2Dtexture 57

6.4 2Dtextureson theGPU 59

6.5 Treetraversal 61

6.6 Position kernel 65

6.7

Velocity

kernel 65

7.1 Positionand

Velocity

kernel

timing

graph 68

7.2 Treetraversalkernel

timing

graph 68

7.3

(Ideal)

Positionand

Velocity

kernel

timing

graph 72
(12)

List

of

Tables

7.1 Minimum 2Dtextures neededto store octreeinformation 67

7.2 Branchpenalties onNvidia6 Series GPU 70

(13)

Glossary

arithmetic

intensity

Ratio ofarithmetic computationsto communication.

cell Anode of an octree withmorethan oneparticle.

F

fragment processor Fragmentprocessors are

fully

programmableprocessors, operat

ing

in SLMD-parallel fashionon inputelements: processing four-elementvec torsinparallel.

framebuffer Framebufferisa partofGPU memorywhichholds finalcolored pixelsfor

display

onthe screen.

gather Gatheroperation occurs whenthekernel processinga streamelementrequests

information from otherelements in the stream: it

"gathers"

information from

otherpartsofmemory.

GPU A graphics processor unit or GPU is a graphics rendering pipeline used for

(14)

K

kernel

Terminology

used in stream architecture for

describing

a computing unit or

programthatoperates onthe data.

leaf Anodeofan octree withonlyone particle.

M

MIMD Mulitple instructionmultiple data letsmultipleinstructions to operate on mul

tipledataitemsat anygiventime.

o

octree Atree datastructure with eight children.

scatter Scatteroperation occurs when the kernel processinga stream element distrib

utes information to other stream elements: it

"scatters"

information to other

parts of memory.

SIMD Single instruction multiple data lets one instruction operate on multiple data

itemssimultaneously.

Streams

Terminology

usedinstream architectureforadatasetthatneeds similar com
(15)

texture

memory

Texture memory istheonlypart oftheGPUthatis randomly accessi

ble

by

the fragmentprocessors.

vertex processor Vertexprocessors are

fully

programmable and operateineitherSIMD
(16)

Chapter 1

Introduction

The classical

A^-Body

problem simulates the evolution of a system, comprising of n dis

crete bodies underthe influence exertedon each

body by

all otherbodies. In thegalactic

simulation,we study the evolution of a system ofparticles, where each particle represents a star or sampledto represent collection ofstars, solarsystem, galaxies,cluster of galaxies

and other large-scale structures of the universe bounded together

by

Newtonian gravita

tionalpotential.

The hierarchical method proposed

by

Barnes andHut

[5]

builds atree structuredrep

resentation ofthe physical domain ofthe problem (in ourcase, space), andthen compute

interactions

by

traversing

the tree. Such hierarchicaltree-basedmethods reducethecompu

tational complexity to

0(nlogn)

fornon-uniform

distributions,

or even

0(n)

for uniform

distributions,

with parallelism proportional to n and without

introducing

any significant

ambiguity in the results. Applications that use Barnes-Hutmethod arechallenging to ob tain effective parallel performance, owingto their non-uniformdistribution ofbodies over thephysical

domain,

distributionofbodies across thephysicaldomainchanges overtime,

and the need for

long

range communication.

However,

we are helped

by

two properties:

first,

the system evolves slowly overtime, and second, the gravitation interaction falls off

quickly with distance.

Also,

the force computation on any particle can be computed in parallel with otherparticles,andindependentofotheriterations. We mapthesimulationof
(17)

Modern semiconductor

technology

makes arithmetic inexpensive and memory band

width expensive. Both global on-chipandoff-chip communicationthus becomes the per

formance

limiting

factor.

Increasing

ratio ofarithmetictomemory bandwidthcalled arith

metic

intensity intensity

and parallelism to use large number ofavailable arithmetic units

can offsetthisshiftincost. Modernmultimedia applications emphasize on stream process

ing

programmingmodel. Stream programmingmodelrenderssupportformulti-level paral

lelism and exposesinherent data

locality by

partitioningthecommunication anddatastor

age structures. Stream programming model represents all data as streams and expresses

all computationalunits askernels. Applicationsmappedto the streamprogrammingmodel

are constructed

by

chaining kernelstogether. Thesekernels operatingonentire streamsex

poses multi-levelparallelism while local data storage within kernel execution unitselimi

natesglobal communicationcosts. Computationunits are calledaskernelsand streamsare

made ofidentical dataelementsthatrequiredataprocessing.

Thecore ofthe graphics architecture, more commonlyreferred to asthe graphics ren

dering

pipeline orgraphics pipelinedefinesa projection-basedrenderingalgorithm(inreal

time) that supports surface-based, volume based or image-based representation, given a

virtual camera setting, mathematically modelled 3D objects, position and specification of

light sources,

lighting

models, textures, and more [4]. The graphics rendering pipeline is

split into several distinct substages, and more recently vertex and fragment shaders have

been introducedas user-programmable stages to improve the

functionality

and

flexibility

to perform any task [32]. This conceptual model ofgraphics hardware andthe addition

of user-programmable substages makes the graphics hardware rendering pipeline a good

matchforstreamprogrammaing model. Thevertexandfragmentshaders withinthe GPU

canberepresentedas kernelsor computationalblocks whilethe

highly

localized dataflow

between these substages or kernels can be abstracted as streams. The data flow between

stages in the graphics pipeline is

highly

localized with the data produced

by

one stage

cosumed

by

the nextstage. Some ofthe latest GPUsprovideIEEE-standard single preci
(18)

GeForce 6 Series orNvidia GeForce 7800. The user-definedprograms (shaderprograms)

written eitherin assembly

(ARB_fragment_program)

orhigh-level languages

[34];

HLSL (High-level shading

language);

Sh

[36]

can be loaded into vertex and fragment units at

run-time. The applicationdata is storedon the GPUas texture imagesand can be

ID,

2D

or3D. The fragmentor vertex shaders actonthisdataandwriteresultseitherto the

frame-buffer orframebuffer attached objects (framebufferobjects) which can be again used via

afeedback loop.

Furthermore,

with the addition of dynamic flowcontrol

(branching)

and

loop

constructs, present generation GPUs can be used for more than one particular with

possiblegaininperformance.

The developmentofprogrammingcomponents inthe GPU has allowed researchers to

use them for other applications thanjust simply rendering polygons. With the advent of programmable

GPUs,

efforts have been made to also map high-performance computing applicationsto the

GPU,

eg. scientificcomputations like

FFTs,

linear algebrasolvers, dif ferential equations solvers, multi-grid solvers and applications to fluid

dynamics,

visual

simulation,icecrystals;geometriccomputationslike Voronoi

diagrams,

distancecomputa tion, motionplanning, collision

detection,

object reconstruction, visibility; advanced ren

dering

like ray-tracing, radiosity, photon mapping, sub-surface scattering, shadows; and databaseoperations onaggregates, predicates, Booleancombinations, selectionqueriesfor large databases [1].

We have developed a completeBarnes-Huttreecodeimplementation onthe GPU. The

motivation behind attempting

A^-Body

problem on the GPU is the recent successful im plementationofradiosityalgorithm on GPUs. The hierarchicalsubdivisionradiosity algo

rithm is based on the hierarchical

A^-Body

problem. The

A"-Body

problemshares many

similarities with the radiosity problem which suggest that these ideas can lead to an in crease in performance on the GPU. In both the

A^-Body

andthe radiosity problem, there

are n(n l)/2 pairs ofinteractions.

Moreover,

just as the magnitude of the form-factor

betweentwopatches falls off as

1/r2,

the gravitational or electromagnetic forces alsofall
(19)

major difference between the two problems is the manner in which the hierarchical data structures areformed. The radiosityalgorithmbeginswithafew largepolygons andsubdi

videsthemintosmaller and smaller patches. Our

A^-Body

algorithmbeginswith nparticles

andclustertheminto largerand largergroups. Anotherdifference isthatradiosityproblem

is

inherently

non-linearbecauseof occlusion where

intervening

opaque surfacescanblock

the transportoflight between the twosurfaces. The

A^-Body

problem ontheotherhand is

linearandtakesadvantageoflinearsuperposition, theprinciple of superposition statesthat

thepotential due to a cluster of particles isthe sum ofthepotentials ofthe individual par

ticles.

Finally,

the radiosityproblemis basedon anintegral equation,whereasthe

Af-Body

problemis basedon adifferentialequation.

The remainderofthisdocument is organizedin the

following

manner:

Chapter 2presents an overview ofdifferent

A"-Body

methods and explains

Barnes-Hutmethodin detail.

Chapter 3givesabackgroundoverview onthestream architecture andprogramming

model, anddetailstwo major stream architectures.

Chapter 4 presents a detailed account of landmark graphics architectures

belong

ing

to differentgenerations, and also presents adetailed explanation on the current

generationof programmable graphicshardware.

Chapter 5 detailssome backgroundinformation in differentresearchareasthatpro

vided motivation fortheBarnes-Huttreecode implementationonthe GPU.

Chapter 6 discusses the actual implementation ofthe Barnes-Hut treecode on the

GPU.

Chapter 7 discussestheresultsand providesthediscussiononthe generated results.

Chapter

8,

finally

provides the conclusion and suggests some extensions for future
(20)

Chapter 2

N-Body

The main objective of our thesis is to implement Barnes-Hut treecode algorithm on the

graphicsprocessor unit. Inthischapter we explainthe

A^-Body

problemandpresentabrief

overview ofimportant

A^-Body

methods used in astrophysical simulations.

Furthermore,

we explainBarnes-Hutalgorithmingreatdetail.

2.1

TV-Body

Problem

The

Af-Body

problem isstated asfollows: Giventhe initial states,position andvelocityof

n

bodies,

computethestates attime T. Themost widely acceptedapproach istoperform

simulation over atime period. Thesimulation time period is discretized into hundreds or

thousands of time-steps, each

time-step

computes the forces acting on each particle and

updatesthe position, velocityand other attributes associatedwiththatparticle.

2.2

TV-Body

Methods

We

briefly

describe here different

A^-Body

methodscommonly usedin astrophysical sim
(21)

2.2.1

Particle-Particle Method

The particle-particle orthe directsummation methodisthe most straight-forwardmethod

ofthe

A^-Body

methods.Thisdirect integrationapproach is flexible butthecost of compu

tation

linearly

increases with n, the number ofparticles and mostly used with collisional

systems. In the case of collisional systems, the evolution of the system is driven

by

mi croscopic exchange ofthermal energies between different particles. In this case it is not

easyto usefastand approximate algorithmsto determinethe interaction betweenparticles and the cost per

timestep

is

0{n2),

and the total cost of the simulation is

0(n3)

[33]. There aretworeasonswhytheuse of approximate algorithmsfor force computationisnot

plausible. Theforemostreasonis theneedfor relatively higher accuracy andsecondly, the wide difference intheorbital timescalesofparticles whichmeans that the particles in the

core require much smaller time-steps than particles in the halo. In order to achieve such

highcomputationalcost, specializedhardware has beenconstructedforgravitationalinter

action, GRAPE

(GRAvity

piPE) [17]. AGRAPE hardware has specialized pipelines for

gravitational force calculation, whichis the most expensive part ofmost

A^-Body

simula

tion algorithms. Allothercalculations, suchastime integrationoforbits,areperformedon

astandardhostcomputer connectedtoGRAPE.

Ontheother

hand,

collisionless systems canbeapproximatedusingalgorithms suchas

particle-mesh

(PM)

method

[8],

particle-particle/particle-mesh

(P3M)

method

[8],

Barnes-Huttreealgorithm

[5],

fastmultipolemethod

[18]

ortreecodeparticle mesh

(TPM)

meth

ods [51].

2.2.2

Particle-Mesh

Method

Theparticle-meshmethodtreats theforceasafield quantity

by

approximating iton amesh.

The main advantage ofthe PM methodis speed. The number of computations is of order 0(n + 7iglogng) where

ng

is the number of grid points on the mesh.

However,

the PM
(22)

have difficulties

handling

non-uniform particle distributions thus offering limited resolu

tion.

2.2.3

Particle-Particle/Particle-Mesh Method

Theparticle-particle/particle-meshmethod solvesthemajorshortcomingofthePMmethod

- low

resolution forcescomputed forparticles near each other. The P3M method creates a

combined force function forall n bodies that is evaluated at regularly spaced points on a

3D mesh. The interaction ofp with other bodies in thesystem may be decomposed into

a number of direct interactions between p and nearby

bodies,

and interaction betweenp

andthecombinedforce interpolatedon a mesh. The totalnumberof operations isof order

0(n + ng), where ng is the number of grid points on the mesh. The only disadvantage

oftheP3M methodisthat it becomes dominated

by

theinteraction forces between nearby

particles.

2.2.4

Hierarchical

Methods

Hierarchical basedmethods like Barnes-Huttree algorithmandfast multipole method are

the best known methods for solving classical

A^-Body

problems, such as those in astro

physics, electrostatics and moleculardynamics. Allthe hierarchicalmethodsforclassical

problems first builds a tree-structured, hierarchical representation of physical space, and

then compute force interactions

by

traversing

the tree. The root ofthe tree represents a

space cell containing all the particles in the system. Thetree is built

by

recursively sub

dividing

space cells until some termination condition is met. In three

dimensions,

every

subdivisionof a cell results inthecreation ofeight equally-sized childrencells,

leading

to

an octree representation ofspace, whereas, intwo

dimensions,

each spatial decompostion

leadsto fourequally-sizedchildren.

Treecodeparticle-meshmethods usestreecodehierarchicaldecompostionforparticles

(23)

usto usebesttechniques

depending

onthe spatialdecompostionofthe space.

2.3

Barnes-Hut Algorithm

In the Barnes-Hut tree algorithm, the cubical root node encompassing the total mass is

repeatedly divided intoeight orfewer daughternodesofhalfthe sidelengtheach, untilone

ends upwith '

leaf nodescontaining single particles orbodies. Forces arethen computed

on each particle

by

traversing

the tree starting with the root node. A decision is made

whetherthenode provides anaccurateanoughforce actingontheparticle. Iftheansweris

affirmative, themultipoleforce isusedtocomputetheforceandthewalkalongthisbranch

ofthe tree is terminated,otherwise, the node is traversedfurther down where we evaluate

forces ontheparticle

by

considering its daughternodes.

d

j^k

[image:23.549.115.484.362.515.2]

particle

Figure 2.1: Barnes-Hut multipole accessibility criterion.

V

denotes the length ofthe

sideofthecell representingthepseudoparticle. 'd! denotes theeuclideandistancebetween

the particle and the center-of-mass ofthe pseudoparticle. The tree traversal is terminated

(24)

The tree hierarchical structure provides the means to distinguish between close parti

cles anddistantparticles without actually computing the distance between every particle.

The force betweenparticles that are close enough are computed

directly,

whereas distant

particles are groupedtogetherand are accounted as one giant node or apseudoparticle. In

order to compute forces from these pseudoparticles, several 'multipole acceptance crite

rion'

(MACs)

are available whichdecideswhetherto accept as it

is,

or subdivideit into its

daughternodes. Thesimplestofthe criterion istheoriginal s/d introduced

by

Barnesand

Hut Ref.

[5]

and shown in Figure 2.1.

Starting

with theroot node, the size ofthe current

node s (thesidelengthofthe cellcomprisingthepseudoparticle) is compared with its dis

tancefromthe particle,d. Iftheratios/dissmallerthansome predefinedvalue,

9,

then the

nodeistaken asisand furthertraversalalongthatnode isterminated.

Otherwise,

the node

is 'opened'

into its daughter nodes, each of which is recursively examined according to

s/d and, ifnecessary, subdivided. 9 is definedastheMAC parameterand can bevariedto

achieve any desired level of approximation for computing forceson theparticles.

Usually

theMAC provestobeintherange of9= 0.3

1.0,

depending

onthe accuracyneeded

by

the application. Thesimulation oftheevolution of a system of nbodies overtimeis based

onthe Newtonian equations of motion. The pairwiseinteraction potential

f(i,j)

between

body

i and

j

canbeexpressed as:

G.nii.nii

f(t.3)

= h^

(2-1)

where Gis gravitationalforceconstant, m*andrrij representthemassesofthe two

bodies,

and

rfj

representtheeuclideandistance betweenthe twodistinct bodies. Thetotal forceon

a particleican besummed as:

/()

=

/(*,

J)-(2-2)

The Barnes-Hutalgorithm isone ofthe mostwidely implemented astrophysical simu

lationalgorithm. Effortshave been made to parallelizethe algorithmto exploitthe paral

lelismavailableinthe computation partofthealgorithm. Thecore oftheforce calculation

(25)

4

it

it itit it &

it it it it A s V it

**

\ *

**

it it it a & * it it

s/d=0.3 s/d=0.7

A V ft it # it it * & ft

[image:25.549.131.421.86.392.2]

s/d=1.0

Figure 2.2: Therelation s/dfor different levelsofthe tree. The figure illustratesthe tree

traversalfor differentvalues of9. When 9 =

1.0,

theforceexerted on theremoteparticle

is due to the influence of the entire cell with the side length given

by

s, and computing

body-cell force interaction. If 9 =

0.7,

is

assumed, we traversefurther deeper within that

cell and dividethe space into equal spaces, with s again representing the cell side

length,

and againcomputingbody-cell force interaction. If9 = 0.3 is

assumed,wefurther divide

the space evenly and eventually we reach the leafofthe tree, thus computing

body-body

force interaction.

Algorithm 1 StepSystem

MakeTree()

while

loop

overallbodies do
(26)

Algorithm 2GravCalc l 2 3 4 5 6 7 8 9 10 11

if

Q

is aleafthen

body-body

force interaction else

if P is distantenoughfrom

Q

then

body-cellforce interation

else

while

loop

over alldaughternodes do

GravCalc(P,

daughternode)

end while

endif

endif

StepSystem( ), as in Algorithm 1 is the routine used for stepping through the

simulation. The routine StepSystem ( ) calls MakeTreeO which constucts a new

octree based on the current particle positions, and then we compute force on each par

ticle/body

using the routine GravCalc () , as shown in Algorithm 2. The routine

GravCalc ( ) computestheforceexertedontheparticleP

by

all othernodes,Q. Depend

ing

whetherthenode

Q

is a particle orapseudo-particle,wecomputetheforce exerted

by

thatnodeontheparticle P.

In our

implementation,

we used3 different MACs.

First,

aminimalcriterion isused wherenocell

touching

theparticle underconsider

ationis accepted.

Second,

theoriginal MAC introduced

by

BarnesandHutas mentioned above.

Third,

theMAC introduced

by

Salmon andWarren to eliminate pathological errors [image:26.549.63.494.60.228.2]

thatmayoccur whenusing Barnes-Hut MAC [44].

Figure 2.2 shows the effect of9 on the opening angle criterion. As 9 is taken closerto

0,

wetraverse deeper down thetreefor every node withtheforcecomputation more exact as

comparedwiththe forcecomputation when9 istakencloserto 1.

We also eliminated computing quadrupole momemts associated with each cell node.

(27)

mapping oftheoctree on the GPU in amemory efficientway. Also it has been observed

thatusingquadrupole orhighercell moments may increasetheaccuracy ofthe forcecom

putations, but this advantage is compensated

by

the increased complexity per evaluation and the highertree construction and the tree storage overhead [48]. Another advantage ofusing monopole moments arisefrom thefact that thedynamical treeupdates

during

the

simulation canbeperformed inaveryefficient manner.

Barnes-Hutalgorithm has been successfullyparallelized toexploitdataparallelismin

volvedintheforcecomputation,thusmakingthemidealfor high-performancemultiproces sors or vectorprocessors.

However,

suchapplication

typically

hascharacteristicsthatmake it challengingtopartition and scheduleit foreffective parallel performance. As clearfrom

thealgorithm GravCalc,theforcecomputations on each particle are

totally

independent ofprevious simulation steps.

Also,

we can computeforce interactionsondifferentparticles

simultaneously because the particle positions are only updated at the end of the current simulation step.

Thus,

the force computations are SIMD and

they

can be parallelize to achieve faster computation which results in the reduction ofthe overall simulation time.

Recently,

stream processors have been proposedthat may eventually takeplace of vector
(28)

Chapter 3

Stream

Architecture

and

Programming

Model

Inthischapter wedelve intothedetailsofthestreamarchitecture andits programingmodel.

We study two ofthe ground-breaking stream architectures and their applications to wide

rangingareas.

Modern semiconductor

technology

makes arithmetic inexpensive and memory band

width expensive. Both global on-chip andoff-chip communicationthus becomes the per

formance

limiting

factor.

Increasing

ratio of arithmeticto memory bandwidthcalled arith

metic

intensity

and parallelismtouse largenumber ofavailable arithmetic unitscan offset

this shift in cost. Modern multimedia applications emphasize on stream processing pro

grammingmodel.

Stream architecture still in its

infancy

converges these underpinning issueseffectively.

Stream architecture dividesthe application to exposethe inherent

locality

and concurren

ciesinapplications

by

partitioningthecommunication and storage structures andprovides

explicit interconnect

(communication)

between functional units and registerfiles. Stream

processing architecture achieves support for multi-level parallelism also known as con

currency

by

exposing instruction-level parallelism(instructionoverlap inexecution across

multiple functional units), dataparallelism

(exposing

SIMD within same

kernel)

and task
(29)

different

(explicit)

levels of storage

forming

adata-bandwidth

hierarchy

(register hierar

chy), thus reducing the distancetraveled

by

the operands and

limiting

thecommunication

cost. The higher the arithmetic

intensity,

the better the application is suited to stream

processing architecture. Mediaapplications

including

3Dcomputer graphics, image com

pression and signal processing have abundant parallelism to take advantage of stream ar

chitecture. Follows below are some of the representative architectures based on stream

processing.

3.1

RAW Machines

The MIT RAWmachines(Figure

3.1)

provided efficientparallel,pipelined supportformul

timedia applications. The RAW microprocessor executes 16 different

load,

store,

integer,

orfloating-point instructions every clockcycle,controls 25 Gbytes/sofI/O bandwidth and

has 2 Mbytes of on-chip LI SRAM providing 57 Gbytes/s of memory bandwidth [50].

The RAW microprocessor addresses the issue of wire delays and stream-based multime

diacomputations

by

exposingthehardwarearchitecture'slow-level details to thecompiler.

The RAWmicroprocessorrenderssupportinits software

(ISA)

forexplicit communication

to transferdatavalues betweenthe computational portions ofthe tileson aswitched, point

topoint

interconnect,

andthushelps modelingwiredelays. The RAWmicroprocessors also

differ fromcurrent superscalar architectures

by distributing

theregister

files,

and memory

ports, thusrendering support for data locality. Much ofthe hardware support forregister

renaming,instruction scheduling, and

dependency

checking has beenrenderedinsoftware,

hence making room formore memory

(distributed)

and computational logic to beplaced

on chip. Thisresults in the increase in the arithmetic

intensity

rendering support for data

parallelism, maximizing the number of tiles that can be fit on a chip and

increasing

the

chip's achievable clock speed. Multipleprocesses can run simultaneouslywith operations

being

assigned to tiles in a manner that minimizes congestion. The RAW microproces
(30)

V /*" >

J

% J f f r A

=c

/^ '".

3

^ S r .f

,A

^

j f [image:30.549.78.465.42.254.2]

/"

=<

}

/"

)

( IMEM DMEM IE Registers

\aiu7

CL / r SMEM

\

r

|

E3

i Switch

Figure 3.1: RAWprocessor.A RAWprocessorisconstructedusingmultipleidenticaltiles. Each tile contains instruction memory

(IMEM),

data memories

(DMEM),

an arithmetic logic unit

(ALU),

registers, configurable logic

(CL),

and a programmable switch with its

associatedinstruction memory(SMEM).

low-latency

communicationforefficient execution of scientific compute-intensive andme

dia applications. The applications were written in aportable, high-level language called

Streamlt to execute ontheRAWmicroprocessor architecture. Thebasic unit of computa

tionin Streamlt isthe filter. Work formsthecore ofthe filterunit and performs most fine

grained execution step inthe steady state, thus exposing parallelism. In order to support

explicitcommunicationbetween differentcomputationblocksor

filters,

Streamltprovides

high-bandwidth I/O channels (FIFO queues), which can operate in pop, peek and push

modes.

RAWprocessorsnot onlyrunconventional scalar codes,butalsoword-levelcomputa

tions thatrequireso muchperformace that

they

have been assignedtocustomASIC solu

tions. RAWprocessorshave beenusedtoimplement Gigabit Internetprotocols, video and

audioprocessing,filtersandmodems,I/Oprotocols

(RAID,

SCSI,

Fire

Wire)

and communi

cation protocolsforcellularphones,multiple channel cellularbasestations,high-definition

(31)

3.2

Imagine Stream Processor

The Imagine processor (Figure

3.2) [24]

represents a programmable processor designed

to operate

directly

onstreams. The Imaginestream-programming modelmakes communi

cation explicit, and exposesinherent dataparallelism availableinmultimedia applications.

The Imaginearchitecture proposes novel register-filearchitecture andsupports48

floating

point arithmetic unitsineight arithmetic clusters. Eachcluster canbemappedtooperate on

adifferentprocess

by

thestream-programmingmodelusing VLIWinstruction(SIMDfash

ion). Thestream program organizesdataas streamsand expresses computations askernels

(inmediaapplicationseachkernel iscomposedofhundredsofcomputations), thusexploits

dataparallelism through streams and explicit communication (between streams) between

kernels. A kernel consumes a set ofinput streams of data and generates a set of output

streams ofdata. In order to capturedata

locality,

Imagine

(architecture)

divides the regis

ter

hierarchy

into local-register files

(LRFs),

andstream-registerfiles (SRFs). Datastreams

flow between differentcomputationkernelsisachievedthroughSRFsandintra-kernelstor

age/retrieval ofresults is done throughLRFs

(allowing

explicit communication). As each kernelperformsmany computations(hundredsofcomputations),LRFs

help

maintainhigh

data bandwidth needed for reading/writing data

by

different ALUs and handles the bulk

ofthe data

during

kernel execution. StreamC and KernelC (both uses C-like syntax) are

used for writing kernels and applications for Imagineprocessor. StreamC code compiled

by

stream scheduler on the host-CPU sends stream instructions to Imagine's

(processor)

stream controllerusing static anddynamicruntime mechanism. KernelC code running on

Imagineprocessoriscompiledstatically using VLIW kerneland sperformskermelcompu

tationsonthesetofinputstreamsandusescommunicationscheduling forexplicit commu

nicationbetweenoperands [35].

Imagine is an on-going researchproject at Stanford

University

and

they

havesuccess

fully

used it for varied range of applications.

They

have used it for audio and image

compression and decompression. More recently, Imagine processors have been utilized

(32)

advanced communication interconnection networks to give an order of magnitude more

performance per unit costthancluster-based scientificcomputers [13].

Imagineprocessor

Hcsl J Stream

j

conkolter SDRAM c 5 E r

|

i

SDRAM SDRAM 25 Netarork

j

devices t (oSier J MiorocDrrtrc ar 8

1

A'Ihmetic duster Arthrritic

duster

\

831

A':hrrr.ic

:!us:er

Imagines I

h Neararfc Interface

JET]

***

orI/O) | i

2.67 Sbytes/:

Lom

register He

[image:32.549.99.460.99.310.2]

32Gbytes/s 544Sbytesfe

Figure 3.2: Imagine processor. An Imagine stream processor consists of a 128-Kbyte

stream register file

(SRF),

48 floating-point arithmetic units in 8 arithmetic clusters con

trolled

by

a microcontroller, a network

interface,

a streaming memory system with 4

SDRAMchannels, and a stream controller.

However,

the Imagineprocessor is controlled

by

ahostprocessor which sendsinstructionstoit.

Wehaveseen above theneedsfor

designing

new architectures andprogrammingmod

els to meet the performance and

flexibility

requirements ofmedia applications. We also

saw thatstreams provide a powerful programming abstractionthathave been suitable for

efficiently

implementing

complex andhigh-performancemediaapplications. Inthelast few

years, someimportantworkhas been donetoemploystream architectureandprogramming

modelforcomputergraphics application

[40];

[23].

[40]

focusedonthedesignandimple

mentation, and analysis of a computer graphicspipeline on a stream architectureusingthe

streamprogrammingmodel. Imaginewas used asthestreamarchitecture,andan

OpenGL-like pipeline wascompletely implemented on itto illustratethat the streamprocessing ar

chitectures are well suited for graphics applications.

[23]

constructed scalable graphics
(33)

of a clusterofoff-the-shelfworkstations. Chromium usesthe notion ofstream processing

to achieve scalability while retainingmaximum flexibility. Bothworks arguedthatstream

architecturescan providethenecessary scalabilityand

flexibilty

needed

by

futuregraphics

and multimedia applicationsto achievehigherperformance rate.

Today'sgraphics processor unit orGPU

(VPU)

can process overtens ofmilliontrian

gles and over a billion of pixels per second, owing primarily to their stream architecture

which enables all additional transistors to bedevoted to

increasing

computational power.

On a positive side note, this also has led to the growing use of graphics processor as a

general-purpose stream processing engine. Researchers are exploring the idea and port

ing

more and more applications to use GPUas a general-purpose co-processorto achieve

higher performance than on a CPU. In the next chapter we study some of the landmark

graphics architecturesfrom differentgenerations, and alsodetailtheirrecent evolution as a

(34)

Chapter

4

Graphics

Processor Architecture

Inthe last chapterwe studied aboutthe stream architecture andprogramming model, and

presentedthat the current generation ofgraphics processor are modelled after stream

ar-chitectture. In this chapter, we study some ofthe important graphics architectures from

different generations, and also introducethe latest generation of graphics processor, more

commonly referred asGPU orVPU. We begin with an overview ontheclassical graphics

renderingpipeline.

4.1

Graphics

Rendering

Pipeline

Theclassic computergraphics-renderingpipeline enumerates aprocessthatgeometricob

jects must experience on their way to

becoming

pixels on the screen. These objects can

be in the form of points, lines or primitives in the shape of polygons and expressed in

homogeneous coordinate system. The operations performed on the primitives consist of

transformations, divisions

(w-division),

andclippingasin Figure 4.1. An object processed

through thepipeline seesthecoordinate spacesinthe

following

order:

Object Space

Universe Space orWorld Space

(35)

Venexstream

Vertexprogram

Transformed vertexstream

Triangleasse-moty

Trianglestream

Screen-space

mappectiange stream

ClipVCuHrViewport

Texture

Coarsepi*e^or

fragmentstream

'

Fragmertprogram

Processedpixel FinalImageto

[image:35.549.89.454.55.328.2]

stream display

Figure 4. 1: Graphicsrenderingpipeline.

Perspective Space

Clip

Space

Normalized Device Coordinates Space

PixelSpace

The transformations form the object space to the universe space is a combination of

translation,rotation,andscalingoperations. This is donetofittheprimitivesintheuniverse

space. Universetoeyespaceconversionconsistsof puretranslationandrotation sothat the

viewing camera or user is placed at the origin and the direction of viewing

being

set to

+z axis. Then eye to perspective space conversion is done in order to provide mapping

from 3-D space to 2-D space.

However,

the conversion results in

technically

3-D space

representation with z-axis values stored in the Z-buffer and used later in the pipeline to

enable visible and non-visible effects to the object

being

rendered on the screen. The
(36)

outside the

boundary

coordinates of this volume are rendered invisible whereas objects seen to lie within the

boundary

coordinates are made visible on the screen. This space

is chosen to make the clipping arithmetic simpler. It maps the desired visible chunk of

perspective spaceto afixedregion.

Later,

we performcliptonormalizeddevicecoordinate space transformation in order to represent the objects on an internal pixel independent dimensions. Fromthis internalpixelindependentrepresentation, wetransform theobjects in the hardware pixel space from where

they

are rendered onto the screen. This space

represents the screen real state. The whole rendering process can be performed through

fourmajor steps:

Transformto clippingspace

Clip

Divide

by

w (realclipping spaceas dictated

by

the screen)

Transformto pixel space

4.2

Graphics Hardware

According

to one classification of computer graphics architectures

[2],

various genera tionscanbeclassified

by

thecapabilitiesforwhichthearchitecturewasprimarily designed

(target capabilities with maximized performance) and not

by

the scope of the capabili

ties. Anotherclassification categorizesthemonthebasis of

technology (ASIC, DSP,

RISC

CPU,

and CPU extensions), arrangement (SIMD or

MIMD),

and programmability

(end-userprogrammablity and the relative ease with which

they

were programmed) [32]. We

usethelatterclassification methodtodiscussthemotivationbehindvarious

(representative)

graphics architectures.

(37)

down on system costs. We describe some of the landmark graphics architectures, and

presentthelatest generation ofGPU.

Geometry

Engine,

1982

Channel Processoror

CHAP,

1984

Polygon-Rendering Chip

or

PRC,

1986

High Performance Polygon

Rendering

Architecture,

1988

PixelFlow,

1992

Indigo

Extreme,

1993

RealityEngine,

1993

InfiniteReality,

1997

Programmable Graphics

Architecture,

2001

4.2.1

Geometry

Engine

One of the earlist design effort was the development of the

Geometry

Engine

[10]

for

Geometry

Graphics System (IRIS Graphics System). The

Geometry

Enginewas a

special-purpose, four component vector, floating-point processor for accomplishing three basic

floating-point operations; matrix transformations, clipping and viewport transformations

in computergraphics applications. The

Geometry

Engine provided limited

flexibility

and

couldbeconfiguredtoact as a partofthematrixsubsystem,clippingsubsystem andscaling

subsystem inthegraphicsrenderingpipeline. The

Geometry

Systemwas a slave processor

and could only accept or yield one format of data and instructions. The graphical primi

tives were stripped into individual coordinates or vertices and each component processed

(38)

in a pipeline orderto achieve performance ofthe order of 5 million floating-point opera

tions per second. The programmingmodel ofthegraphicssystem wasused asahardware

subroutineto theIRIS processor/memory system andthus wasspecificto theplatformand architecture ofthe

host/controlling

processor.

4.2.2

Channel Processor

The system called Lucasfilm Compositor used the channel processor or CHAP

[30],

a

special-purpose, fourcomponent vector, programmable pixel processorto provide digital

processing capabilities for special-effects filmproduction. The CHAP performed all the

computations and controlledtheflow ofpixels(madeoffourcomponents; red, green,blue and alpha) in the Compositor. Scratchpad memories provided for program data storage

couldbereferencedfor fouror one operand

by

any arithmetic operation. Vectorarithmetic

units connectedto thescratchpadmemoriesthrougha crossbar networkthus allowediden

tical or different operations to be performed on all the fourpixel components in parallel.

TheCHAP's datapath wasoptimized for 16-bit integeroperations. Inorder to render rich

images for film applications, for the first time framebuffer memory banks were config

ured with 12-bits per channel. The CHAP performed single

instruction,

fourcomponent

processing (SIMD). One of the unique features of CHAP was the support for

disabling

some subset of processors over a rangeof operations likeclamping. Applicationsrequired skilledprogrammingefforts and couldbecomplicated

by

thepipelinedelays built into the

hardware modules. Pixar's CHAP could provide interactive usage on limited resolution

images while acceptable performance for high resolution, movie quality images required

offline rendering.

4.2.3

Polygon-Rendering

Chip

Image rendering becauseofitscompute-intensive nature representsthemajorperformance

(39)

necessary components needed forreal-time image rendering were put onone-chip VLSI

implementation. The chip knownaspolygon-rendering chiporPRCwasdesignedprimar

ily

formechanicalengineering CAD applications [49]. CAD applications requireinterac

tive manipulation of solids (shaded polygons). In order to render real-time performance,

PRC facilitated pipelining and internal bandwidth for computation and communication.

The performance was maximized

by

pipelining differentparts ofthe fill and shading al

gorithms and running different functional blocks in parallel. This made the architecture

unique in the sense thatdatacouldbemovedfromone functional blockto another in one

clock cyclevia register-registertransfers. The architecture also used pixel cache so

frame-bufferwrites andZ-bufferaccesses couldbe done inparallel.

4.2.4

High

Performance

Polygon

Rendering

A novel graphics subsystem was developed to meet the demands for

displaying

realis

tic solid objects and support for real-time image rendering and independent windows.

The graphics subsystem was partitioned into four special-purpose pipelined subsystems;

geometry, scan-conversion,raster and

display

subsystem with eachperformingadedicated

task [3]. Each ofthese subsystems were pipelined into various functional blocks. The

Geometry

subsystem comprising of

Geometry

Engines was organized as asinglepipeline

of floating-point processors with each executing a specific subset of geometric transfor

mations.

Geometry

Engines operated

independently,

thus rendering support for SIMD.

When takentogether, goemetrysubsystem achieves an efficiency of50% notseen

by

vec

torprocessors used in other graphics systems present at the same time. Scan-conversion

subsystem responsibleforconvertingvertexdata (from geometry subsystem)to individual

pixels comprised ofPolygon processor, Edge processorand five Span processors (operat

ing

parallel) arranged in pipeline with each composed offixed-point engines. Five Span

processors againoperating in SIMDacheived anaggregatefillrateof40million pixels per

second. Rastersubsytemwaspipelinedinto

twenty

Image

Engines,

each of which operated
(40)

operation).

They

resultedin extremely powerful andflexible pixelfill operations not seen

by

other graphics systems atthe same time. And

finally

the

Display

subsytem interpreted

pixels from the frame-buffer memory and routed it to the DAC for display. The graphics

subsystem produced somebrilliant operatingnumbers

(rendering

100,000Gouraudshaded

and Z-bufferedpolygons per second) and represented a unique graphics system skeleton

thatwasfollowed

by

subsequent graphics architectures. Someofthespecial features

supp-portedwere pan and zoom, window ID masking, real-time video, alpha

blending

and

an-tialiasedlines.

4.2.5

Indigo Extreme

One of the

limiting

factor of the High Performance Polygon

Rendering

design was its

choice forthehostprocessor. This choice made the design only suitedforone particular

kindof platformand thusrenderedlimited programmability and portabilityto otherhard

wares. Other disadvantages include the throughputof various subsystems

being

dictated

by

the speed ofthe slowestprocessing step. This results ina non-linear performance gain.

Considerableoverheadisadded witheach processorinsubsystem stagesintheformofse

quencer, microcodestore, globals data storeandinterface logic between differentpipeline

stagesresulting in thelossof performance.

In order to eliminate the above mentioned design limitations and disadvantages and

to circumvent the bottlenecks seen in the floating-point compute-intensive units used in

the geometry subsystem, compute-intensive arithmetic units used in the scan-conversion

subsytem, the limited fill rate andthe pixel bandwidth as seen into the

frame-buffer,

SGI

introducedaunique computer graphicsarchitecture oftheIndigo Extreme aimed at maxi

mizing performance while minimizing size [20]. The decision was made to combine the

per-vertexand slope calculations into one unit. The slope calculations were implemented

in amicrocoded processorbecauseofrelativelycomplex algorithms involved. The advan

tagesachievedthroughthisapproachresultedintheincreaseinvertex and slopeprocessing

(41)

floating-pointfunctional units were implemented.

Geometry

Engine design approachfol

lowed the fundamental design principle of maximizing the performance of most expen

sive functional units. This was achieved

by

providing custom design data paths to the

operands and support for multiple threads of the same algorithm active simultaneously.

Multi-portregisters were used to render support formultiple simultaneous threads ofcal

culation. More than one

Geometry

Engines were usedin SEMD parallel approach instead

ofsingle pipeline ordertoattainalinearperformnace gain with minimal overhead required

to implement it. This resulted in the processing of eight polygons

(primitives)

in paral

lel.

Hyper-pipelining

was used for

designing

Raster Engines instead of using multiple

copies in an SIMD parallel approach. This was done to minimize the area requirements

while achieving the desired performance.

Memory

bandwidth was increased

by

using an

interleaved frame-buffer across multiple memory banks. The resulting architecture was

capable of over 500,000 Gouraud shaded Z-buffered polygons per second and a fill rate

of 80 million pixels per second.

However,

the design was particularly suited to one kind

of application (CAD programs).

Any

deviation from this fixed-function implementation

resultedin significant performanceloss.

So farwehave mentioned architecturesthatusedspecial-purpose, fixed-functionchips

(ASICs)

forgeometrictransformationsandimage renderingarrangedineitherfine-grained

or course-grained SIMD arrangement. These graphics architectures processed graphical

primitives as independent vertices or coordinates. Some ofthe disadvantages have been

outlined withthedescriptionofeach architecture. SIMDarchitecturesdonot scale

linearly

with the number of geometric engines or image Tenderers. Several important architec

tures basedon MIMD (multiple instructionmultiple

data)

arrangementwere introducedto

resolve these issues. MIMD arrangements processed the vertex stream as a part of a geo

metric primitivethusenablingoperationslike cullingandreducingtheprocessingtimeand

increasing

parallelism. Here we discusssome ofthese importantarchitectures and outline
(42)

4.2.6

PixelFlow

PixelFlowarchitecture

[38]

overcamethescalinglimitationaswellasthegeometrictrans

formations and frame-buffer bottlenecks

by

using the technique of image composition.

Imagecompositiondesigned forreal-time

(30Hz)

3Dgraphics algorithmsand applications

achievedhigh renderingperformance

by

applyingobject parallelismthroughoutthegraph

ics renderingpipeline. The rasterization processors

(renderers,

shaders and

frame-buffers)

were associated with aportion ofthe primitives

(occupying

128x128 pixel regions ofthe

screen), rather than a portion of the screen which is termed as image parallelism. Each

Tenderercomputes pixel values for its primitives regardless oftheir viewport association.

Thecomputedpixelvalues arethen transmittedover a networktocompositingprocessors,

whichresolvethe visibilityofpixelsfromeach Tenderer. Someoftheunique architectural

featuresoffered

by

PixelFlow were intheformofsupersampling antialiasingand deferred

shading mechanism. This distribution mechanism of primitives is alsoclassified as

sort-last mechanism [37]. Sort-last mechanism issues someoutstanding features in the form

of linear scalability and

being

less prone to load imbalances. PixelFlow can be scaled

linearly

to achieve tens of millions of polygons per second. PixelFlow could be config

ured to operate as a part ofworkstation or parallel computer. One distinct feature was

that all the components of PixelFlow were general-purpose, programmable and offered

a simple programming model. PixelFlow rendered complex primitives such as spheres,

quadrics and supported real-timeillumination models,procedural andimage-basedtextur

ing,

environmentmappingand restrictive antialiasing. Somemajorlimitationswerethat the

image-composition network mustsupportvery high bandwidthcommunicationto transfer

128x128pixel region of pixeldata. PixelFlowsuffered significant performanceloss incase

ofload imbalances. PixelFlow could also render the final image once all the primitives

corresponding to the logical tiles hadbeen transformedand sorted. This limitation could

(43)

4.2.7

RealityEngine

RealityEngine is a massively parallel computer graphics architecture developed

by

SGI.

SGI classified RealityEngine (Figure

4.2)

as their third-generation design primarily for

real-time(15-30

Hz)

rendering of textured mapped, antialiased traingles. The

Geometry

lV>l'ni dus. Command Process

Engfees

~

Fragment w

dwserators;

ii

ii

ii ii

ii

nil ii n n

frcage_

Engfcnes

tfi frifli-M*+t -[*

riEin i.-J!W3rjr^rd

tlHlltll *-iS--iHrl !

nrfteic^onKvy bj*i

! Gfepljy^GHE r;tbl :-*ii.nJ;

Figure 4.2: RealityEngine block diagram. Board-levelblock diagramof anintermediate

configuration consisting of 8

Geometry

Engines,

2 raster memory

boards,

and a

display

generatorboard.

Engineperformed multiple

instruction,

multiple componentprocessing andrequired

hand-codedassemblycodetoextractthegreatest performance. Texture mappingrenderesimages

that are

highly

visually

interesting

and lookedmore realistic. PixelFlow rendered texture

mapped images but RealityEngine provided expansive support in thehardware to gener

ate such detailed images. Fragment Generators incorporatedwere fixed-function pipeline

unit and performed the initial generation of

fragments,

generation ofthe coverage mask,

texture addressgeneration, texture

lookup,

texturesample

filtering,

texture modulation of

the fragment color and

fog

computation and blending. RealityEngine also supported

1-and 3-dimension texture maps and thus could be used torender volumetric images. An

[image:43.549.168.381.170.357.2]
(44)

beperformedeither

by

single-pass (PixelFlow didnot supportthis menthod)or multi-pass

antialiasing methods. Unlike

PixelFlow,

RealityEngine separated the transformation and

fragmentgenerationrates. Thismade possiblethe fine

tuning

ofthe architectureforunbal

ancedrendering requirements. Image Engines were assigned a fixed subset ofthe pixels

intheframebuffer. Unlike

PixelFlow,

allthe different functional blocks running inparallel

could complete the final image as soon as the last primitive was rendered

by

the Image

Engine. Thecolorcomponentresolution was increased from 8-bits to 12-bitsto eliminate

visible

banding

and support for substantial framebuffer composition. RealityEngine used

bothobjectparallelismandimageparallelismto achievevery highperformance rates. Ob

ject parallelism implies partitioning either the geometric description of the scene where

vertices of a geometric primitive are processed

independently by

different processors in

parallel or theassociated object spacewhere different geometric primitives areprocessed

independently

by

differentprocessors inparallel. Imageparallelismsplitstheimage space

as fragments where each fragment is processed

independently by

different processor in

parallel.

4.2.8

InfiniteReality

The

InfiniteReality [39]

graphics system architecture shown in Figure 4.3 was designed

with the primary goal of real-time application performance. The architecture was devel

oped as a superset ofRealityEngine graphics system. Unlike previous mentioneddesigns

where

display

list processing had been handled

by

thehostprocessor,

InfiniteReality

intro

ducedlocal

display

listprocessing

by

the graphics subsystemto eliminatehostto graphics

I/O bottlenecks. RealityEngine and previous architectures distributed primitives among

different

Geometry

Engines usinground-robinmechanism. As the

Geometry

Engines op

eratedinanMIMDarrangement,

InfiniteReality

extendedthesupportfor

least-busy

distrib

ution mechanism which offered performanceadvantage over ubiquitous round-robinmech

(45)

nfevffyrtraiSue V-fi**L-i tniaiVUK'

t

#|jkHliOifli***4*ii!i '

1

tga 1 1feaJBw 1 1Saga* ||fogfcMii |

5T=*

|giiiHSiwar-fcui-HM

]

tn*i

r

r

L

f

'

]

11 imp<rr*'..-*Jui 1tiqpmm torn*ma II^lM^M-M^ f"i|p'*liirl"ii>IMTtiri- 1

I

4m* III

r

I 1III lllllll ! tjlUi llMMf>dMi.ii

E

|

l^|4a^

j

T

t**4MWi'

l,

f

1.

1

:,

k.

f

i

,jl

3

r

\

i

i

i

^

it

Figure4.3:

InfiniteReality

Engine block diagram. Board-level block diagramof

Infinite-Reality

Engine withmaximum supportfor4

Geometry

Engines,

4 Raster

Memory

baords,

and a

Display

Generator boardwith8outputchannels.

with four stage pipelined arithmetic units executing pixel-components in SIMD arrange

ment. The screen-space primitives were then input to the Raster

Memory

board which

comprised of a Fragment Generatorandeighty Image Engines. Unlike previous architec

tures, theFragmentGenerators in

InfinteReality

assembled screen-space primitives (trian

gle setup) and computed parameter slopes. The performance of

InfiniteReality

graphics

system made possible theuse ofvery large texturedatabases

by

introducing

a representa

tion called clip-map, and multipass rendering techniques to implement reflections, shad

[image:45.549.132.420.54.404.2]
(46)

for very large textures while multipass rendering techniques can enhance visual realism.

The

display

generator board provided support for flexible and software controlled video

architecture thatcoulddrive upto eight autonomous

display

output channels. To counter

actfill-rate

bottlenecks,

InfiniteReality

graphics systemintroduced dynamicvideoresizing

where the scene is rendered to the frame-buffer at a potentially reduced resolution such

that primitives are drawn in less than one frame time. Despite the success ofMIMD ar

chitectures, SIMD systems offered a simpler programming model. This is owing to the

fact thatMIMD systems imposes aburden on applications and operating systems, which

must be able to cope with the arrival of data at unpredictable intervals and in arbitrary

order. This often resulted in the design of complex and cumbersome communication and

buffering

protocols rendering software overheads. SIMD systems operating in the

lock-step fashion virtually eliminates these overheads.

Also,

it is often possible to structure

algorithms as several distinct functional

blocks,

each of which operates on a uniformdata

type. The rendering pipeline maps naturally onto this structure, and the regularity of the

datastructures within eachphaseleadsto uniformoperations,providingagoodfitwiththe

SIMD programming paradigm.

Finally,

SIMD architectures usually contain thousands of

simpleprocessing elements. Becauseoftheirsheer numbers, goodperformance can often

beachieved even thoughprocessors may notbe

fully

utilized [12].

4.2.9

Graphics Software Libraries

The realm of computer graphics applications was growing but there was no real coordi

nated effortto

develop

hardwareindependentgraphicssoftwarelibrariesthatcouldbeused

to render simple or complex scenes effeciently. All the above mentioned graphics sys

temdesignsweredeveloped

independently

withoutanycoersiontowards

industry

standard

compliance. These graphics systemscould notbe implemented on avariety of platforms

with a range of graphical capabilities. There existed several independent standards most

(47)

(PHIGS extensionto X window system)

[15]

thatcould render 2D or3D objects on lim

itedgraphics systems withlimitedgraphical capabilitiesmakingprogramportability

prob-lemetic. Also we have mentioned that it required considerable knowledge and expertise

to program graphical applications effeciently and was mostly carried out

by

the

systems'

designers. OpenGL developed

by

SGI is an emerging graphics standard. Itis alow-level

graphics

library

specification thatprovides advancedrenderingfeatures whilemaintaining

a simple andflexible programmingmodel. OpenGL is rendering-only,so it is independent

ofthemethods

by

whichtheuserinputand other window or non-window systemfunctions

are achieved, making the rendering portions of the graphical program thatuses OpenGL

platformandoperatingsystemindependent.

4.2.10

Streaming

CPU Extensions

PixelFlow, RealityEngine, InfiniteReality,

etc represented massively parallel graphics ar

chitectures with variedrange of capabilities.

However,

the above mentioned graphics ar

chitectures were accessibleatvery highcostforspecial-purposes. Technological advance

mentsintheVLSI

technology

made possible visual realism experiencefor

desktop

personal

computers

by

incorporating

SIMD arrangement inthe general-purpose processor designs.

In

Figure

Figure 2.1:thealong Barnes-Hut multipole accessibility criterion. V denotes the length of theside of the cell representing the pseudoparticle
Figure 2.2: The relation s/d for different levels of the tree. The figure illustrates the treetraversal for different values of 9
Figure 2.2 shows the effect of 9 on the opening angle criterion. As 9 is taken closer to 0,
Figure 3.1: RAW processor. A RAW processor is constructed using multiple identical tiles.Each tile contains instruction memory (IMEM), data memories (DMEM), an arithmeticlogic unit (ALU), registers, configurable logic (CL), and a programmable switch with itsassociated instruction memory (SMEM).
+7

References

Related documents