Hierarchical N-Body problem on graphics processor unit

(1)

Rochester Institute of Technology

RIT Scholar Works

Theses

Thesis/Dissertation Collections

3-15-2006

Hierarchical N-Body problem on graphics

processor unit

Mohammad Faisal Siddiqui

Follow this and additional works at:

http://scholarworks.rit.edu/theses

This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please [email protected].

Recommended Citation

(2)

Hierarchical N

-Body Problem on Graphics Processor Unit

by

A Thesis Submitted in Partial Fulfillment

of

the Requirements for the Degree of

Master of Science in

Computer Engineering

Supervised by

Dr. Muhammad Shaaban, Associate Professor, Computer Engineering

Department of

Computer

Engineering

Kate Gleason

College of

Engineering

Rochester Institute

of

Technology

Rochester

,

New York

March 15

,

2006

Approved

By:

M.Shaaban

Dr. Muhammad Shaaban

Associate Professor

, Computer

Engineering

Primary Adviser

Lawrence Ray

Dr

.

Lawrence Ray

Adjunct Professor,

Computer

Engineering

Juan Cockburn

Dr

.

Juan Cockburn

(3)

Thesis Release Permission Form

Rochester Institute of Technology

Kate Gleason College of Engineering

Title: Hierarchical

N

-Body Problem on Graphics Processor Unit

I, Mohammad Faisal Siddiqui

,

hereby

grant

permission to the Wallac

e

Memorial

Li-brary reporduce my thesis in

whole or part.

Mohammad Siddiqui

(4)

Copyright 2006

by

(5)

Dedication

(6)

Acknowledgments

Having

Dr. Shaaban as _{my primary} adviserhas been an _amazing experience. His unique and exhaustive

teaching

style spurred in me thepassion forcomputer architecture. I am overwhelmedwith gratitudefor his _guidance, motivation and support. Hegave meenough

freedomtoexplore new

ideas,

and appliedpressure when seemed appropriate. Ialso admire

andrespecthis

integrity

bothas a professor and aperson.

I'd like to thank_my committee_members, Dr. Lawrence

Ray

andDr. Juan Cockburn for giving importantsuggestionsandfeedbackwhennecessary. Ialso appreciatetheir

help

and concernthroughout_mythesis.

I'dlike to thank Dr. Stefan Harfstwith the Astrophysics

Group

at Rochester Institute

of

Technology

whohelpedmethrough theinitial stages of_myresearch

during

thesummer

of2005.

And,

finally

Iwouldliketo thank theentire staff ofDepartmentofComputer Engineer

ing

atRochester Institute of

Technology

fortheir

help

and _assistance, and _makingmefeel

(7)

Abstract

Galactic simulation is an importantcosmological _computation, and represents a classical

N-body

problem suitablefor implementation on vector processors. Barnes-Hut algorithm

is ahierarchical

N-Body

methodusedto simulatesuch galactic evolutionsystems.

Stream processing architectures exposedata

locality

and_concurrencyavailableinmul

timediaapplications. On theother

hand,

there are numerous compute-intensive scientific or_engineering applicationsthatcan _{potentially benefit from}such computational and com munication models. Theseapplications are

traditionally

implementedon vector processors.

Stream architecture based graphics processor units

(GPUs)

present a novel computa

tionalalternativefor efficiently

implementing

suchhigh-performanceapplications. Render

ing

onastream architecture sustainshigh_performance, whileuser-programmablemodules allow

implementing

complex algorithms efficiently. GPUs have evolved over the years, from

being

fixed-function pipelinesto user programmableprocessors.

In this _thesis, we focus on the implementation of Barnes-Hut algorithm on typical

current-generation programmableGPUs. We exploitcomputation andcommunication re quirements presentin Barnes-Hutalgorithmtoexposetheir_{suitability for}user-programmable

(8)

Contents

Dedication iv

Acknowledgments v

Abstract vi

Glossary

xii

1 Introduction 1

2

N-Body

5

2.1

N-Body

Problem 5

2.2

N-Body

Methods 5

2.2.1 Particle-Particle Method 6

2.2.2 Particle-Mesh Method 6

2.2.3 Particle-Particle/Particle-Mesh Method 7

2.2.4 Hierarchical Methods 7

2.3 Barnes-Hut Algorithm 8

3 Stream Architecture and

Programming

Model 13

3.1 RAW Machines 14

3.2 Imagine Stream Processor 16

4 Graphics Processor Architecture 19

4.1 Graphics

Rendering

Pipeline 19

4.2 Graphics Hardware ₂₁

4.2.1

Geometry

Engine ₂₂

4.2.2 Channel Processor ₂₃

4.2.3

Polygon-Rendering Chip

23

(9)

4.2.5 Indigo Extreme 25

4.2.6 PixelFlow 27

4.2.7 RealityEngine 28

4.2.8

InfiniteReality

29

4.2.9 Graphics Software Libraries 31

4.2.10

Streaming

CPU Extensions 32

4.3 Programmable Graphics Hardware 33

4.3.1 Vertex Processor 33

4.3.2 Fragment Processor 37

4.3.3 Graphics

Memory

38

4.3.4 GPU

Programming

Model 40

4.4 GPU for Non-Graphics Operations 42

5 Previous Work 44

5.1 _{Fast Matrix Multiplies using Graphics Hardware} 45

5.2 Physically-based Visual Simulationon Graphics Hardware 46

5.3 GROMACS 47

5.4 Particle-Mesh

N-Body

MethodonGraphics Hardware 48

5.5

Radiosity

onGraphics Hardware 48

5.6

Ray

Tracing

onGPU 49

5.7 Octree Textureson theGPU 50

6 Implementation 52

6.1 Algorithm Overview 53

6.2 Octree Construction 55

6.3 _{A^-tree memory map}ontheGPU 57

6.4 Kernels 60

6.4.1 Traverse 60

6.4.2 Position 64

6.4.3

Velocity

64

7 Resultsand Discussion 66

7.1 Results ₆₆

7.2 Discussion ₆₉

7.3 Difficulties Encountered 70

(10)

8 ConclusionandFuture Work 76

8.1 Conclusion 76

8.2 Future Work .77

Bibliography

78

A Octree Data Structureson CPU 84

(11)

List

of

Figures

2. 1 Barnes-Hutmultipole_{accessibility}criterion 8

2.2 Therelation s/dfor different levelsofthetree 10

3.1 RAWprocessor. 15

3.2 Imagineprocessor. 17

4.1 _{Graphics rendering}pipeline 20

4.2 RealityEngine block diagram 28

4.3

InflniteReality

Engine block diagram 30

4.4 Overallsystem architecture 34

4.5 GeForce 6 Series Vertexshaderblock diagram 35

4.6 GeForce 6 Series fragment processor. 37

4.7 _{GPU memory hierarchy.} 39

5.1 Matrix Multiplication 45

5.2 CML Performance Comparison 47

6.1 Algorithm 54

6.2 Octree 55

6.3 Octreemapped to2Dtexture 57

6.4 2Dtextureson theGPU 59

6.5 Treetraversal 61

6.6 Position kernel 65

6.7

Velocity

kernel 65

7.1 Positionand

Velocity

kernel

timing

graph ₆₈

7.2 Treetraversalkernel

timing

graph 68

7.3

(Ideal)

Positionand

Velocity

kernel

timing

graph 72

(12)

List

of

Tables

7.1 Minimum 2Dtextures neededto store octreeinformation 67

7.2 Branchpenalties onNvidia6 Series GPU 70

(13)

Glossary

arithmetic

intensity

Ratio ofarithmetic computationsto communication.

cell Anode of an octree withmorethan oneparticle.

F

fragment processor Fragmentprocessors are

fully

programmable_processors, operat

ing

in SLMD-parallel fashionon inputelements: _{processing four-element}vec torsinparallel.

framebuffer Framebufferisa partofGPU memorywhichholds finalcolored pixelsfor

display

onthe screen.

gather Gatheroperation occurs whenthe_{kernel processing}a streamelementrequests

information from otherelements in the stream: it

"gathers"

information from

otherpartsofmemory.

GPU A graphics processor unit or GPU is a graphics _rendering pipeline used for

(14)

K

kernel

Terminology

used in stream architecture for

describing

a _computing unit or

programthatoperates onthe data.

leaf Anodeofan octree with_onlyone particle.

M

MIMD Mulitple instructionmultiple data letsmultipleinstructions to operate on mul

tipledataitemsat _anygiventime.

o

octree Atree datastructure with eight children.

scatter Scatteroperation occurs when the _{kernel processing}a stream element distrib

utes information to other stream elements: it

"scatters"

information to other

parts of memory.

SIMD Single instruction multiple data lets one instruction operate on multiple data

itemssimultaneously.

Streams

Terminology

usedinstream architectureforadatasetthatneeds similar com

(15)

texture

memory

_{Texture memory is}the_onlypart oftheGPUthat_{is randomly} accessi

ble

by

the fragmentprocessors.

vertex processor Vertexprocessors are

fully

programmable and operateineitherSIMD

(16)

Chapter 1

Introduction

The classical

A^-Body

problem simulates the evolution of a system, comprising of n dis

crete bodies underthe influence exertedon each

body by

all otherbodies. In thegalactic

simulation,we _study the evolution of a system of_particles, where each particle represents a star or sampledto represent collection ofstars, solarsystem, galaxies,cluster of galaxies

and other large-scale structures of the universe bounded together

by

Newtonian gravita

tionalpotential.

The hierarchical method proposed

by

Barnes andHut

[5]

builds atree structured_rep

resentation ofthe physical domain ofthe problem (in our_{case, space),} andthen compute

interactions

by

traversing

the tree. Such hierarchicaltree-basedmethods reducethecompu

tational _complexity to

0(nlogn)

fornon-uniform

distributions,

or even

0(n)

for uniform

distributions,

with parallelism proportional to n and without

introducing

_any significant

ambiguity in the results. Applications that use Barnes-Hutmethod are_challenging to ob tain effective parallel performance, owingto their non-uniformdistribution ofbodies over thephysical

domain,

distributionofbodies across thephysicaldomainchanges over_time,

and the need for

long

range communication.

However,

we are helped

by

two properties:

first,

the system evolves _slowly over_time, and _second, the gravitation interaction falls off

quickly with distance.

Also,

the force computation on _any particle can be computed in parallel with otherparticles,andindependentofother_{iterations. We map}thesimulationof

(17)

Modern semiconductor

technology

makes arithmetic inexpensive and _{memory band}

width expensive. Both global _on-chipand_off-chip communicationthus becomes the per

formance

limiting

factor.

Increasing

ratio ofarithmeticto_{memory bandwidth}called arith

metic

intensity intensity

and parallelism to use large number ofavailable arithmetic units

can offsetthisshiftincost. Modernmultimedia applications emphasize on stream process

ing

programmingmodel. _{Stream programming}modelrenderssupportformulti-level paral

lelism and exposesinherent data

locality by

_partitioningthecommunication anddatastor

age structures. _{Stream programming} model represents all data as streams and expresses

all computationalunits askernels. Applicationsmappedto the stream_programmingmodel

are constructed

by

_{chaining kernels}together. These_{kernels operating}onentire streamsex

poses multi-levelparallelism while local data storage within kernel execution unitselimi

natesglobal communicationcosts. Computationunits are calledaskernelsand streamsare

made ofidentical dataelementsthatrequiredataprocessing.

Thecore ofthe graphics _{architecture,} more _commonlyreferred to asthe graphics ren

dering

pipeline orgraphics pipelinedefinesa projection-based_renderingalgorithm(inreal

time) that supports _{surface-based,} volume based or image-based _{representation,} given a

virtual camera _setting, _{mathematically} modelled 3D objects, position and specification of

light sources,

lighting

models, textures, and more [4]. The graphics _rendering pipeline is

split into several distinct substages, and more _recently vertex and fragment shaders have

been introducedas user-programmable stages to improve the

functionality

and

flexibility

to perform _any task [32]. This conceptual model ofgraphics hardware andthe addition

of user-programmable substages makes the graphics hardware rendering pipeline a good

matchforstream_programmaing model. Thevertexandfragmentshaders withinthe GPU

canberepresentedas kernelsor computationalblocks whilethe

highly

localized dataflow

between these substages or kernels can be abstracted as streams. The data flow between

stages in the graphics pipeline is

highly

localized with the data produced

by

one stage

cosumed

by

the nextstage. Some ofthe latest GPUsprovideIEEE-standard single preci

(18)

GeForce 6 Series orNvidia GeForce 7800. The user-definedprograms (shader_programs)

written either_{in assembly}

(ARB_fragment_program)

orhigh-level languages

[34];

HLSL (High-level shading

language);

Sh

[36]

can be loaded into vertex and fragment units at

run-time. The applicationdata is storedon the GPUas texture imagesand can be

ID,

2D

or3D. The fragmentor vertex shaders actonthisdataandwriteresultseitherto the

frame-buffer orframebuffer attached objects (framebufferobjects) which can be again used via

afeedback loop.

Furthermore,

with the addition of dynamic flowcontrol

(branching)

and

loop

constructs, present generation GPUs can be used for more than one particular with

possiblegaininperformance.

The developmentof_programmingcomponents inthe GPU has allowed researchers to

use them for other applications than_{just simply rendering} polygons. With the advent of programmable

GPUs,

efforts have been made to also _{map high-performance computing} applicationsto the

GPU,

eg. scientificcomputations like

FFTs,

linear algebrasolvers, dif ferential equations _solvers, multi-grid solvers and applications to fluid

dynamics,

visual

simulation,icecrystals;geometriccomputationslike Voronoi

diagrams,

distancecomputa tion, motionplanning, collision

detection,

object reconstruction, visibility; advanced ren

dering

like ray-tracing, radiosity, photon mapping, sub-surface _scattering, _shadows; and databaseoperations on_{aggregates, predicates,} Boolean_{combinations,} selectionqueriesfor large databases [1].

We have developed a completeBarnes-Huttreecodeimplementation onthe GPU. The

motivation behind attempting

A^-Body

problem on the GPU is the recent successful im plementationof_radiosityalgorithm on GPUs. The hierarchicalsubdivision_radiosity algo

rithm is based on the hierarchical

A^-Body

problem. The

A"-Body

problemshares _many

similarities with the _radiosity problem which suggest that these ideas can lead to an in crease in performance on the GPU. In both the

A^-Body

andthe _radiosity _{problem, there}

are n(n l)/2 pairs ofinteractions.

Moreover,

just as the magnitude of the form-factor

betweentwopatches falls off as

1/r2,

the gravitational or electromagnetic forces alsofall

(19)

major difference between the two problems is the manner in which the hierarchical data structures are_{formed. The radiosity}algorithmbeginswithafew largepolygons andsubdi

videsthemintosmaller and smaller patches. Our

A^-Body

algorithmbeginswith nparticles

andclustertheminto largerand largergroups. Anotherdifference isthat_radiosityproblem

is

inherently

non-linearbecauseof occlusion where

intervening

opaque surfacescanblock

the transportoflight between the twosurfaces. The

A^-Body

problem ontheotherhand is

linearandtakesadvantageoflinearsuperposition, theprinciple of superposition statesthat

thepotential due to a cluster of particles isthe sum ofthepotentials ofthe individual par

ticles.

Finally,

the _radiosityproblemis basedon anintegral equation,whereasthe

Af-Body

problemis basedon adifferentialequation.

The remainderofthisdocument is organizedin the

following

manner:

Chapter 2presents an overview ofdifferent

A"-Body

methods and explains

Barnes-Hutmethodin detail.

Chapter 3givesabackgroundoverview onthestream architecture and_programming

model, anddetailstwo major stream architectures.

Chapter 4 presents a detailed account of landmark graphics architectures

belong

ing

to differentgenerations, and also presents adetailed explanation on the current

generationof programmable graphicshardware.

Chapter 5 detailssome backgroundinformation in differentresearchareasthatpro

vided motivation fortheBarnes-Huttreecode implementationonthe GPU.

Chapter 6 discusses the actual implementation ofthe Barnes-Hut treecode on the

GPU.

Chapter 7 discussestheresultsand providesthediscussiononthe generated results.

Chapter

8,

finally

provides the conclusion and suggests some extensions for future

(20)

Chapter 2

N-Body

The main objective of our thesis is to implement Barnes-Hut treecode algorithm on the

graphicsprocessor unit. Inthischapter we explainthe

A^-Body

problemandpresentabrief

overview ofimportant

A^-Body

methods used in astrophysical simulations.

Furthermore,

we explainBarnes-Hutalgorithmingreatdetail.

2.1

TV-Body

Problem

The

Af-Body

problem isstated asfollows: Giventhe initial _states,position and_velocityof

n

bodies,

computethestates attime T. Themost _widely acceptedapproach istoperform

simulation over atime period. Thesimulation time period is discretized into hundreds or

thousands of _time-steps, each

time-step

computes the _{forces acting} on each particle and

updates_{the position,} _velocityand other attributes associatedwiththatparticle.

2.2

TV-Body

Methods

We

briefly

describe here different

A^-Body

methods_commonly usedin astrophysical sim

(21)

2.2.1

Particle-Particle Method

The particle-particle orthe directsummation methodisthe most straight-forwardmethod

ofthe

A^-Body

methods.Thisdirect integrationapproach is flexible butthecost of compu

tation

linearly

increases with _n, the number ofparticles and _mostly used with collisional

systems. In the case of collisional _systems, the evolution of the system is driven

by

mi croscopic exchange ofthermal energies between different particles. In this case it is not

easyto usefastand approximate algorithmsto determinethe interaction betweenparticles and the cost per

timestep

is

0{n2),

and the total cost of the simulation is

0(n3)

[33]. There aretworeasons_whytheuse of approximate algorithmsfor force computationisnot

plausible. Theforemostreasonis theneed_{for relatively higher accuracy} and_{secondly, the} wide difference intheorbital timescalesofparticles whichmeans that the particles in the

core require much smaller time-steps than particles in the halo. In order to achieve such

highcomputationalcost, specializedhardware has beenconstructedforgravitationalinter

action, GRAPE

(GRAvity

piPE) [17]. AGRAPE hardware has specialized pipelines for

gravitational force _calculation, whichis the most expensive part ofmost

A^-Body

simula

tion algorithms. Allother_{calculations,} suchastime integrationof_orbits,areperformedon

astandardhostcomputer connectedtoGRAPE.

Ontheother

hand,

collisionless systems canbeapproximated_usingalgorithms suchas

particle-mesh

(PM)

method

[8],

particle-particle/particle-mesh

(P3M)

method

[8],

Barnes-Huttreealgorithm

[5],

fastmultipolemethod

[18]

ortreecodeparticle mesh

(TPM)

meth

ods [51].

2.2.2

Particle-Mesh

Method

Theparticle-meshmethodtreats theforceasa_{field quantity}

by

_{approximating it}on amesh.

The main advantage ofthe PM methodis speed. The number of computations is of order 0(n + _7iglogng) where

ng

is the number of grid points on the mesh.

However,

the PM

(22)

have difficulties

handling

non-uniform particle distributions thus _{offering limited} resolu

tion.

2.2.3

Particle-Particle/Particle-Mesh Method

Theparticle-particle/particle-meshmethod solvesthemajor_shortcomingofthePMmethod

- low

resolution forcescomputed forparticles near each other. The P3M method creates a

combined force function forall n bodies that is evaluated at _regularly spaced points on a

3D mesh. The interaction of_p with other bodies in thesystem _{may be decomposed into}

a number of direct interactions between _p and _nearby

bodies,

and interaction between_p

andthecombinedforce interpolatedon a mesh. The totalnumberof operations isof order

0(n + ng), where _ng is the number of grid points on the mesh. The only disadvantage

oftheP3M methodisthat it becomes dominated

by

theinteraction _{forces between nearby}

particles.

2.2.4

Hierarchical

Methods

Hierarchical basedmethods like Barnes-Huttree algorithmandfast multipole method are

the best known methods for solving classical

A^-Body

_problems, such as those in astro

physics, electrostatics and moleculardynamics. Allthe hierarchicalmethodsforclassical

problems first builds a tree-structured, hierarchical representation of physical space, and

then compute force interactions

by

traversing

the tree. The root ofthe tree represents a

space cell _containing all the particles in the system. Thetree is built

by

_recursively sub

dividing

space cells until some termination condition is met. In three

dimensions,

_every

subdivisionof a cell results inthecreation ofeight equally-sized children_cells,

leading

to

an octree representation ofspace, whereas, intwo

dimensions,

each spatial decompostion

leadsto fourequally-sizedchildren.

Treecodeparticle-meshmethods usestreecodehierarchicaldecompostionforparticles

(23)

usto usebesttechniques

depending

onthe spatialdecompostionofthe space.

2.3

Barnes-Hut Algorithm

In the Barnes-Hut tree algorithm, the cubical root node _encompassing the total mass is

repeatedly divided intoeight orfewer daughternodesofhalfthe sidelengtheach, untilone

ends _upwith '

leaf nodes_containing single particles orbodies. Forces arethen computed

on each particle

by

traversing

the tree _starting with the root node. A decision is made

whetherthenode provides anaccurateanoughforce actingontheparticle. Iftheansweris

affirmative, themultipoleforce isusedtocomputetheforceandthewalk_alongthisbranch

ofthe tree _{is terminated,}_{otherwise, the} node is traversedfurther down where we evaluate

forces ontheparticle

by

_{considering its daughter}nodes.

d

j^k

[image:23.549.115.484.362.515.2]

particle

Figure 2.1: Barnes-Hut multipole _{accessibility} criterion.

V

denotes the length ofthe

sideofthecell _representingthepseudoparticle. 'd! denotes theeuclideandistancebetween

the particle and the center-of-mass ofthe pseudoparticle. The tree traversal is terminated

(24)

The tree hierarchical structure provides the means to distinguish between close parti

cles anddistantparticles without _{actually computing} the _{distance between every} particle.

The force betweenparticles that are close enough are computed

directly,

whereas distant

particles are groupedtogetherand are accounted as one giant node or apseudoparticle. In

order to compute forces from _{these pseudoparticles,} several 'multipole acceptance crite

rion'

(MACs)

are available whichdecideswhetherto accept as it

is,

or subdivideit into its

daughternodes. Thesimplestofthe criterion istheoriginal s/d introduced

by

Barnesand

Hut Ref.

[5]

and shown in Figure 2.1.

Starting

with theroot node, the size ofthe current

node s (thesidelengthofthe cell_comprisingthe_{pseudoparticle)} is compared with its dis

tancefromthe particle,d. Iftheratios/dissmallerthansome predefinedvalue,

9,

then the

nodeistaken asisand furthertraversal_alongthatnode isterminated.

Otherwise,

the node

is 'opened'

into its daughter _nodes, each of which is recursively examined _according to

s/d _and, if_necessary, subdivided. 9 is definedastheMAC parameterand can bevariedto

achieve _{any desired level} of approximation for computing forceson theparticles.

Usually

theMAC provestobeintherange of9= 0.3

1.0,

depending

onthe _accuracyneeded

by

the application. Thesimulation oftheevolution of a system of nbodies overtimeis based

onthe Newtonian equations of motion. The pairwiseinteraction potential

f(i,j)

between

body

i and

j

canbeexpressed as:

G.nii.nii

f(t.3)

= h^

(2-1)

where Gis gravitationalforceconstant, m*and_rrij representthemassesofthe two

bodies,

and

rfj

representtheeuclideandistance betweenthe twodistinct bodies. Thetotal forceon

a particleican besummed as:

/()

=

/(*,

J)-(2-2)

The Barnes-Hutalgorithm isone ofthe most_{widely implemented} astrophysical simu

lationalgorithm. Effortshave been made to parallelizethe algorithmto exploitthe paral

lelismavailableinthe computation partofthealgorithm. Thecore oftheforce calculation

(25)

4

it

it itit it &

it it it it A s V it

**

\ *

**

it it it a & * it it

s/d=_0.3 s/d=0.7

A V ft it # it it * & ft

[image:25.549.131.421.86.392.2]

s/d=1.0

Figure 2.2: Therelation s/dfor different levelsofthe tree. The figure illustratesthe tree

traversalfor differentvalues of9. When 9 =

1.0,

_theforce_{exerted on} _the_remote_particle

is due to the influence of the entire cell with the side length given

by

s, and _computing

body-cell force interaction. If 9 =

0.7,

is

assumed, we traversefurther deeper within that

cell and dividethe space into equal spaces, with s again _representing the cell side

length,

and again_computingbody-cell force interaction. If9 = _{0.3 is}

assumed,wefurther divide

the space _evenly and _eventually we reach the leafofthe _tree, thus _computing

body-body

force interaction.

Algorithm 1 StepSystem

MakeTree()

while

loop

overallbodies do

(26)

Algorithm 2GravCalc l 2 3 4 5 6 7 8 9 10 11

if

Q

is aleafthen

body-body

force interaction else

if P is distantenoughfrom

Q

then

body-cellforce interation

else

while

loop

over alldaughternodes do

GravCalc(P,

daughter_node)

end while

endif

StepSystem( ), as in Algorithm 1 is the routine used for stepping through the

simulation. The routine StepSystem ( ) calls _MakeTreeO which constucts a new

octree based on the current particle positions, and then we compute force on each par

ticle/body

using the routine GravCalc () , as shown in Algorithm 2. The routine

GravCalc ( ) computestheforceexertedontheparticleP

by

all other_nodes,Q. Depend

ing

whetherthenode

Q

is a particle orapseudo-particle,wecomputetheforce exerted

by

thatnodeontheparticle P.

In our

implementation,

we used3 different MACs.

First,

aminimalcriterion isused wherenocell

touching

theparticle underconsider

ationis accepted.

Second,

theoriginal MAC introduced

by

BarnesandHutas mentioned above.

Third,

theMAC introduced

by

Salmon andWarren to eliminate pathological errors [image:26.549.63.494.60.228.2]

that_mayoccur when_{using Barnes-Hut MAC [44].}

Figure 2.2 shows the effect of9 on the _opening angle criterion. As 9 is taken closerto

0,

wetraverse deeper down thetree_{for every} node withtheforcecomputation more exact as

comparedwiththe forcecomputation when9 istakencloserto 1.

We also eliminated _computing quadrupole momemts associated with each cell node.

(27)

mapping oftheoctree on the GPU in a_memory efficientway. Also it has been observed

that_usingquadrupole orhighercell moments _{may increase}the_accuracy ofthe forcecom

putations, but this advantage is compensated

by

the _{increased complexity} per evaluation and the highertree construction and the tree storage overhead [48]. Another advantage of_using monopole moments arisefrom thefact that thedynamical treeupdates

during

the

simulation canbeperformed ina_veryefficient manner.

Barnes-Hutalgorithm _{has been successfully}parallelized toexploitdataparallelismin

volvedintheforcecomputation,thus_makingthemidealfor high-performancemultiproces sors or vectorprocessors.

However,

suchapplication

typically

hascharacteristicsthatmake it challengingtopartition and scheduleit foreffective parallel performance. As clearfrom

thealgorithm _GravCalc,theforcecomputations on each particle are

totally

independent ofprevious simulation steps.

Also,

we can computeforce interactionsondifferentparticles

simultaneously because the particle positions are _only updated at the end of the current simulation step.

Thus,

the force computations are SIMD and

they

can be parallelize to achieve faster computation which results in the reduction ofthe overall simulation time.

Recently,

stream processors have been proposedthat _may _eventually takeplace of vector

(28)

Chapter 3

Stream

Architecture

and

Programming

Model

Inthischapter wedelve intothedetailsofthestreamarchitecture andits programingmodel.

We study two ofthe _{ground-breaking} stream architectures and their applications to wide

rangingareas.

Modern semiconductor

technology

makes arithmetic inexpensive and _{memory band}

width expensive. Both global _on-chip and_off-chip communicationthus becomes the per

formance

limiting

factor.

Increasing

ratio of arithmeticto _{memory bandwidth}called arith

metic

intensity

and parallelismtouse largenumber ofavailable arithmetic unitscan offset

this shift in cost. Modern multimedia applications emphasize on stream _processing pro

grammingmodel.

Stream architecture still in its

infancy

converges these _{underpinning issues}effectively.

Stream architecture dividesthe application to exposethe inherent

locality

and concurren

ciesinapplications

by

_partitioningthecommunication and storage structures andprovides

explicit interconnect

(communication)

between functional units and registerfiles. Stream

processing architecture achieves support for multi-level parallelism also known as con

currency

by

exposing instruction-level parallelism(instruction_overlap inexecution across

multiple functional units), dataparallelism

(exposing

SIMD within same

kernel)

and task

(29)

different

(explicit)

levels of storage

forming

adata-bandwidth

hierarchy

(register hierar

chy), thus _reducing the distancetraveled

by

the operands and

limiting

thecommunication

cost. The higher the arithmetic

intensity,

the better the application is suited to stream

processing architecture. Mediaapplications

including

3Dcomputer graphics, image com

pression and signal _{processing have} abundant parallelism to take advantage of stream ar

chitecture. Follows below are some of the representative architectures based on stream

processing.

3.1

RAW Machines

The MIT RAWmachines(Figure

3.1)

provided efficientparallel,pipelined supportformul

timedia applications. The RAW microprocessor executes 16 different

load,

store,

integer,

orfloating-point instructions every clock_cycle,controls 25 Gbytes/sofI/O bandwidth and

has 2 Mbytes of _{on-chip LI SRAM} _providing 57 Gbytes/s of _{memory bandwidth} [50].

The RAW microprocessor addresses the issue of wire delays and stream-based multime

diacomputations

by

_exposingthehardwarearchitecture'slow-level details to thecompiler.

The RAWmicroprocessorrenderssupportinits software

(ISA)

forexplicit communication

to transferdatavalues betweenthe computational portions ofthe tileson a_switched, point

topoint

interconnect,

andthus_{helps modeling}wiredelays. The RAWmicroprocessors also

differ fromcurrent superscalar architectures

by distributing

theregister

files,

and _memory

ports, thusrendering support for data locality. Much ofthe hardware support forregister

renaming,instruction scheduling, and

dependency

_{checking has been}renderedinsoftware,

hence making room formore _memory

(distributed)

and computational logic to beplaced

on chip. Thisresults in the increase in the arithmetic

intensity

_rendering support for data

parallelism, maximizing the number of tiles that can be fit on a _chip and

increasing

the

chip's achievable clock speed. Multipleprocesses can run _{simultaneously}with operations

being

assigned to tiles in a manner that minimizes congestion. The RAW microproces

(30)

V /*" >

J

% J f f r A

=c

/^ '".

3

^ S r .f

,A

^

j f [image:30.549.78.465.42.254.2]

/"

=<

}

/"

)

( IMEM DMEM IE Registers

\aiu7

CL / r SMEM

\

r

|

E3

i Switch

Figure 3.1: RAWprocessor.A RAWprocessorisconstructed_usingmultipleidenticaltiles. Each tile contains instruction memory

(IMEM),

data memories

(DMEM),

an arithmetic logic unit

(ALU),

_registers, configurable logic

(CL),

and a programmable switch with its

associated_{instruction memory}(SMEM).

low-latency

communicationforefficient execution of scientific compute-intensive andme

dia applications. The applications were written in a_portable, high-level language called

Streamlt to execute ontheRAWmicroprocessor architecture. Thebasic unit of computa

tionin Streamlt isthe filter. Work formsthecore ofthe filterunit and performs most fine

grained execution _{step in}the _steady state, thus exposing parallelism. In order to support

explicitcommunicationbetween differentcomputationblocksor

filters,

Streamltprovides

high-bandwidth I/O channels (FIFO queues), which can operate in pop, peek and push

modes.

RAWprocessorsnot _onlyrunconventional scalar _codes,butalsoword-levelcomputa

tions thatrequireso muchperformace that

they

have been assignedtocustomASIC solu

tions. RAWprocessorshave beenusedtoimplement Gigabit Internetprotocols, video and

audioprocessing,filtersand_modems,I/Oprotocols

(RAID,

SCSI,

Fire

Wire)

and communi

cation protocolsforcellularphones,multiple channel cellularbasestations,high-definition

(31)

3.2

Imagine Stream Processor

The Imagine processor (Figure

3.2) [24]

represents a programmable processor designed

to operate

directly

onstreams. The Imagine_{stream-programming} modelmakes communi

cation explicit, and exposesinherent dataparallelism availableinmultimedia applications.

The Imaginearchitecture proposes novel register-filearchitecture andsupports48

floating

point arithmetic unitsineight arithmetic clusters. Eachcluster canbemappedtooperate on

adifferentprocess

by

the_{stream-programming}model_{using VLIW}instruction(SIMDfash

ion). Thestream program organizesdataas streamsand expresses computations askernels

(inmediaapplicationseachkernel iscomposedofhundredsofcomputations), thusexploits

dataparallelism through streams and explicit communication (between streams) between

kernels. A kernel consumes a set ofinput streams of data and generates a set of output

streams ofdata. In order to capturedata

locality,

Imagine

(architecture)

divides the regis

ter

hierarchy

into local-register files

(LRFs),

andstream-registerfiles (SRFs). Datastreams

flow between differentcomputationkernelsisachievedthroughSRFsandintra-kernelstor

age/retrieval ofresults is done throughLRFs

(allowing

explicit communication). As each kernelperforms_many computations(hundredsof_{computations),}LRFs

help

maintainhigh

data bandwidth needed _{for reading/writing data}

by

different ALUs and handles the bulk

ofthe data

during

kernel execution. StreamC and KernelC (both uses C-like _syntax) are

used for writing kernels and applications for Imagineprocessor. StreamC code compiled

by

stream scheduler on the host-CPU sends stream instructions to Imagine's

(processor)

stream controller_using static anddynamicruntime mechanism. KernelC code _running on

Imagineprocessoriscompiled_{statically using VLIW kernel}and sperformskermelcompu

tationsonthesetofinputstreamsandusescommunication_{scheduling for}_{explicit commu}

nicationbetweenoperands [35].

Imagine is an _on-going researchproject at Stanford

University

and

they

havesuccess

fully

used it for varied range of applications.

They

have used it for audio and image

compression and decompression. More recently, Imagine processors have been utilized

(32)

advanced communication interconnection networks to give an order of magnitude more

performance per unit costthancluster-based scientificcomputers [13].

Imagineprocessor

Hcsl J Stream

j

conkolter SDRAM c 5 E r

|

i

SDRAM SDRAM 25 Netarork

j

devices t (oSier J MiorocDrrtrc ar 8

1

A'_Ihmetic duster Arthrritic

duster

\

831

A':hrrr.ic

:!us:er

Imagines I

h Neararfc Interface

JET]

***

or_I/O) | i

2.67 Sbytes/:

Lom

register He

[image:32.549.99.460.99.310.2]

32Gbytes/s 544Sbytesfe

Figure 3.2: Imagine processor. An Imagine stream processor consists of a 128-Kbyte

stream register file

(SRF),

48 floating-point arithmetic units in 8 arithmetic clusters con

trolled

by

a _{microcontroller,} a network

interface,

a _{streaming memory} system with 4

SDRAMchannels, and a stream controller.

However,

the Imagineprocessor is controlled

by

ahostprocessor which sendsinstructionstoit.

Wehaveseen above theneedsfor

designing

new architectures and_programmingmod

els to meet the performance and

flexibility

requirements ofmedia applications. We also

saw thatstreams provide a powerful _programming abstractionthathave been suitable for

efficiently

implementing

complex andhigh-performancemediaapplications. Inthelast few

years, someimportantworkhas been doneto_employstream architectureand_programming

modelforcomputergraphics application

[40];

[23].

[40]

focusedonthedesignandimple

mentation, and analysis of a computer graphicspipeline on a stream architecture_usingthe

stream_programmingmodel. Imaginewas used asthestream_{architecture,}andan

OpenGL-like pipeline was_{completely implemented} on itto illustratethat the stream_processing ar

chitectures are well suited for graphics applications.

[23]

constructed scalable graphics

(33)

of a clusterofoff-the-shelfworkstations. Chromium usesthe notion ofstream _processing

to achieve _scalability while _retainingmaximum flexibility. Bothworks arguedthatstream

architecturescan providethe_{necessary scalability}and

flexibilty

needed

by

futuregraphics

and multimedia applicationsto achievehigherperformance rate.

Today'sgraphics processor unit orGPU

(VPU)

can process overtens ofmilliontrian

gles and over a billion of pixels per _second, _{owing primarily} to their stream architecture

which enables all additional transistors to bedevoted to

increasing

computational power.

On a positive side _{note, this} also has led to the _growing use of graphics processor as a

general-purpose stream _processing engine. Researchers are _exploring the idea and port

ing

more and more applications to use GPUas a general-purpose co-processorto achieve

higher performance than on a CPU. In the next chapter we _study some of the landmark

graphics architecturesfrom differentgenerations, and alsodetailtheirrecent evolution as a

(34)

Chapter

4

Graphics

Processor Architecture

Inthe last chapterwe studied aboutthe stream architecture and_programming model, and

presentedthat the current generation ofgraphics processor are modelled after stream

ar-chitectture. In this chapter, we _study some ofthe important graphics architectures from

different generations, and also introducethe latest generation of graphics processor, more

commonly referred asGPU orVPU. We begin with an overview ontheclassical graphics

renderingpipeline.

4.1

Graphics

Rendering

Pipeline

Theclassic computer_{graphics-rendering}pipeline enumerates aprocessthatgeometricob

jects must experience on their _way to

becoming

pixels on the screen. These objects can

be in the form of points, lines or primitives in the shape of polygons and expressed in

homogeneous coordinate system. The operations performed on the primitives consist of

transformations, divisions

(w-division),

and_clippingasin Figure 4.1. An object processed

through thepipeline seesthecoordinate spacesinthe

following

order:

Object Space

Universe Space orWorld Space

(35)

Venexstream

Vertexprogram

Transformed vertexstream

Triangleasse-moty

Trianglestream

Screen-space

mappectiange stream

ClipVCuHrViewport

Texture

Coarsepi*e^or

fragmentstream

'

Fragmertprogram

Processedpixel FinalImageto

[image:35.549.89.454.55.328.2]

stream display

Figure 4. 1: Graphics_renderingpipeline.

Perspective Space

Clip

Space

Normalized Device Coordinates Space

PixelSpace

The transformations form the object space to the universe space is a combination of

translation,rotation,and_scalingoperations. This is donetofittheprimitivesintheuniverse

space. Universetoeyespaceconversionconsistsof puretranslationandrotation sothat the

viewing camera or user is placed at the origin and the direction of _viewing

being

set to

+z axis. Then eye to perspective space conversion is done in order to provide _mapping

from 3-D space to 2-D space.

However,

the conversion results in

technically

3-D space

representation with z-axis values stored in the Z-buffer and used later in the pipeline to

enable visible and non-visible effects to the object

being

rendered on the screen. The

(36)

outside the

boundary

coordinates of this volume are rendered invisible whereas objects seen to lie within the

boundary

coordinates are made visible on the screen. This space

is chosen to make the _clipping arithmetic simpler. It maps the desired visible chunk of

perspective spaceto afixedregion.

Later,

we perform_cliptonormalizeddevicecoordinate space transformation in order to represent the objects on an internal pixel independent dimensions. Fromthis internalpixelindependent_{representation,} wetransform theobjects in the hardware pixel space from where

they

are rendered onto the screen. This space

represents the screen real state. The whole _rendering process can be performed through

fourmajor steps:

Transformto _clippingspace

Clip

Divide

by

w (real_clipping spaceas dictated

by

the _screen)

Transformto pixel space

4.2

Graphics Hardware

According

to one classification of computer graphics architectures

[2],

various genera tionscanbeclassified

by

thecapabilitiesforwhichthearchitecturewas_{primarily designed}

(target capabilities with maximized performance) and not

by

the scope of the capabili

ties. Anotherclassification categorizesthemonthebasis of

technology (ASIC, DSP,

RISC

CPU,

and CPU extensions), arrangement (SIMD or

MIMD),

and _{programmability}

(end-user_{programmablity} and the relative ease with which

they

were _programmed) [32]. We

usethelatterclassification methodtodiscussthemotivationbehindvarious

(representative)

graphics architectures.

(37)

down on system costs. We describe some of the landmark graphics _{architectures,} and

presentthelatest generation ofGPU.

Geometry

Engine,

1982

Channel Processoror

CHAP,

1984

Polygon-Rendering Chip

or

PRC,

1986

High Performance Polygon

Rendering

Architecture,

1988

PixelFlow,

1992

Indigo

Extreme,

1993

RealityEngine,

1993

InfiniteReality,

1997

Programmable Graphics

Architecture,

2001

4.2.1

Geometry

Engine

One of the earlist design effort was the development of the

Geometry

Engine

[10]

for

Geometry

Graphics System (IRIS Graphics System). The

Geometry

Enginewas a

special-purpose, four component _vector, floating-point processor for accomplishing three basic

floating-point operations; matrix _{transformations, clipping} and viewport transformations

in computergraphics applications. The

Geometry

Engine provided limited

flexibility

and

couldbeconfiguredtoact as a partofthematrix_subsystem,_clippingsubsystem and_scaling

subsystem inthegraphics_renderingpipeline. The

Geometry

Systemwas a slave processor

and could _only accept or yield one format of data and instructions. The graphical primi

tives were stripped into individual coordinates or vertices and each component processed

(38)

in a pipeline orderto achieve performance ofthe order of 5 million floating-point opera

tions per second. _{The programming}model ofthegraphicssystem wasused asahardware

subroutineto the_{IRIS processor/memory} system andthus wasspecificto theplatformand architecture ofthe

host/controlling

processor.

4.2.2

Channel Processor

The system called Lucasfilm Compositor used the channel processor or CHAP

[30],

a

special-purpose, fourcomponent _vector, programmable pixel processorto provide digital

processing capabilities for special-effects filmproduction. The CHAP performed all the

computations and controlledtheflow ofpixels(madeoffour_{components; red, green,}blue and _alpha) in the Compositor. Scratchpad memories provided for program data storage

couldbereferencedfor fouror one operand

by

_any arithmetic operation. Vectorarithmetic

units connectedto thescratchpadmemoriesthrougha crossbar networkthus allowediden

tical or different operations to be performed on all the fourpixel components in parallel.

TheCHAP's datapath wasoptimized for 16-bit integeroperations. Inorder to render rich

images for film applications, for the first time _{framebuffer memory banks} were _config

ured with 12-bits per channel. The CHAP performed single

instruction,

fourcomponent

processing (SIMD). One of the unique features of CHAP was the support for

disabling

some subset of processors over a rangeof operations likeclamping. Applicationsrequired skilled_programmingefforts and couldbecomplicated

by

thepipelinedelays built into the

hardware modules. Pixar's CHAP could provide interactive usage on limited resolution

images while acceptable performance for high resolution, movie _{quality images} required

offline rendering.

4.2.3

Polygon-Rendering

Chip

Image rendering becauseofitscompute-intensive nature representsthemajorperformance

(39)

necessary components needed forreal-time image _rendering were put on_{one-chip VLSI}

implementation. _{The chip known}as_{polygon-rendering chip}orPRCwasdesignedprimar

ily

formechanical_{engineering CAD} applications [49]. CAD applications requireinterac

tive manipulation of solids (shaded polygons). In order to render real-time performance,

PRC facilitated pipelining and internal bandwidth for computation and communication.

The performance was maximized

by

_{pipelining different}parts ofthe fill and _shading al

gorithms and _{running different functional blocks in} parallel. This made the architecture

unique in the sense thatdatacouldbemovedfromone functional blockto another in one

clock cyclevia register-registertransfers. The architecture also used pixel cache so

frame-bufferwrites andZ-bufferaccesses couldbe done inparallel.

4.2.4

High

Performance

Polygon

Rendering

A novel graphics subsystem was developed to meet the demands for

displaying

realis

tic solid objects and support for real-time image rendering and independent windows.

The graphics subsystem was partitioned into four special-purpose pipelined subsystems;

geometry, scan-conversion,raster and

display

subsystem with each_performingadedicated

task [3]. Each ofthese subsystems were pipelined into various functional blocks. The

Geometry

subsystem _comprising of

Geometry

Engines was organized as asinglepipeline

of floating-point processors with each _executing a specific subset of geometric transfor

mations.

Geometry

Engines operated

independently,

thus _rendering support for SIMD.

When taken_{together, goemetry}subsystem achieves an _efficiency of50% notseen

by

vec

torprocessors used in other graphics systems present at the same time. Scan-conversion

subsystem responsiblefor_convertingvertex_{data (from geometry subsystem)}to individual

pixels comprised ofPolygon processor, Edge processorand five Span processors (operat

ing

parallel) arranged in pipeline with each composed offixed-point engines. Five Span

processors again_{operating in SIMD}acheived anaggregatefillrateof40million pixels per

second. Rastersubsytemwaspipelinedinto

twenty

Image

Engines,

each of which operated

(40)

operation).

They

resultedin extremely powerful andflexible pixelfill operations not seen

by

other graphics systems atthe same time. And

finally

the

Display

subsytem interpreted

pixels from the frame-buffer _memory and routed it to the DAC for display. The graphics

subsystem produced some_{brilliant operating}numbers

(rendering

100,000Gouraudshaded

and Z-bufferedpolygons per _second) and represented a unique graphics system skeleton

thatwasfollowed

by

subsequent graphics architectures. Someofthespecial features

supp-portedwere pan and _zoom, window ID _masking, real-time video, alpha

blending

and

an-tialiasedlines.

4.2.5

Indigo Extreme

One of the

limiting

factor of the High Performance Polygon

Rendering

design was its

choice forthehostprocessor. This choice made the _{design only} suitedforone particular

kindof platformand thusrenderedlimited programmability and _portabilityto otherhard

wares. Other disadvantages include the throughputof various subsystems

being

dictated

by

the speed ofthe slowest_processing step. This results ina non-linear performance gain.

Considerableoverheadisadded witheach processorinsubsystem stagesintheformofse

quencer, microcodestore, globals data storeandinterface logic between differentpipeline

stages_{resulting in} thelossof performance.

In order to eliminate the above mentioned design limitations and disadvantages and

to circumvent the bottlenecks seen in the floating-point compute-intensive units used in

the _geometry subsystem, compute-intensive arithmetic units used in the scan-conversion

subsytem, the limited fill rate andthe pixel bandwidth as seen into the

frame-buffer,

SGI

introducedaunique computer graphicsarchitecture oftheIndigo Extreme aimed at maxi

mizing performance while _minimizing size [20]. The decision was made to combine the

per-vertexand slope calculations into one unit. The slope calculations were implemented

in amicrocoded processorbecauseof_relativelycomplex algorithms involved. The advan

tagesachievedthroughthisapproachresultedintheincreaseinvertex and slope_processing

(41)

floating-pointfunctional units were implemented.

Geometry

Engine design approachfol

lowed the fundamental design principle of _maximizing the performance of most expen

sive functional units. This was achieved

by

_providing custom design data paths to the

operands and support for multiple threads of the same algorithm active simultaneously.

Multi-portregisters were used to render support formultiple simultaneous threads ofcal

culation. More than one

Geometry

Engines were usedin SEMD parallel approach instead

ofsingle pipeline ordertoattainalinearperformnace gain with minimal overhead required

to implement it. This resulted in the _processing of eight polygons

(primitives)

in paral

lel.

Hyper-pipelining

was used for

designing

Raster Engines instead of _using multiple

copies in an SIMD parallel approach. This was done to minimize the area requirements

while _achieving the desired performance.

Memory

bandwidth was increased

by

_using an

interleaved frame-buffer across multiple _{memory banks.} _{The resulting} architecture was

capable of over 500,000 Gouraud shaded Z-buffered polygons per second and a fill rate

of 80 million pixels per second.

However,

the design was _particularly suited to one kind

of application (CAD programs).

Any

deviation from this fixed-function implementation

resultedin significant performanceloss.

So farwehave mentioned architecturesthatusedspecial-purpose, fixed-functionchips

(ASICs)

forgeometrictransformationsandimage renderingarrangedineitherfine-grained

or course-grained SIMD arrangement. These graphics architectures processed graphical

primitives as independent vertices or coordinates. Some ofthe disadvantages have been

outlined withthedescriptionofeach architecture. SIMDarchitecturesdonot scale

linearly

with the number of geometric engines or image Tenderers. Several important architec

tures basedon MIMD (multiple instructionmultiple

data)

arrangementwere introducedto

resolve these issues. MIMD arrangements processed the vertex stream as a part of a geo

metric primitivethus_enablingoperationslike cullingand_reducingthe_processingtimeand

increasing

parallelism. Here we discusssome ofthese importantarchitectures and outline

(42)

4.2.6

PixelFlow

PixelFlowarchitecture

[38]

overcamethe_scalinglimitationaswellasthegeometrictrans

formations and frame-buffer bottlenecks

by

_using the technique of image composition.

Imagecompositiondesigned forreal-time

(30Hz)

3Dgraphics algorithmsand applications

achieved_{high rendering}performance

by

_applyingobject parallelismthroughoutthegraph

ics renderingpipeline. The rasterization processors

(renderers,

shaders and

frame-buffers)

were associated with aportion ofthe primitives

(occupying

128x128 pixel regions ofthe

screen), rather than a portion of the screen which is termed as image parallelism. Each

Tenderercomputes pixel values for its primitives regardless oftheir viewport association.

Thecomputedpixelvalues arethen transmittedover a networkto_compositingprocessors,

whichresolvethe _visibilityofpixelsfromeach Tenderer. Someoftheunique architectural

featuresoffered

by

PixelFlow were intheformof_{supersampling antialiasing}and deferred

shading mechanism. This distribution mechanism of primitives is alsoclassified as

sort-last mechanism [37]. Sort-last mechanism issues some_{outstanding features in} the form

of _{linear scalability} and

being

less prone to load imbalances. PixelFlow can be scaled

linearly

to achieve tens of millions of polygons per second. PixelFlow could be config

ured to operate as a part ofworkstation or parallel computer. One distinct feature was

that all the components of PixelFlow were general-purpose, programmable and offered

a simple _programming model. PixelFlow rendered complex primitives such as spheres,

quadrics and supported real-timeillumination models,procedural andimage-basedtextur

ing,

environment_mappingand restrictive antialiasing. Somemajorlimitationswerethat the

image-composition network mustsupport_{very high bandwidth}communicationto transfer

128x128pixel region of pixeldata. PixelFlowsuffered significant performanceloss incase

ofload imbalances. PixelFlow could also render the final image once all the primitives

corresponding to the logical tiles hadbeen transformedand sorted. This limitation could

(43)

4.2.7

RealityEngine

RealityEngine is a _massively parallel computer graphics architecture developed

by

SGI.

SGI classified RealityEngine (Figure

4.2)

as their third-generation _{design primarily for}

real-time(15-30

Hz)

rendering of textured mapped, antialiased traingles. The

Geometry

lV>l'ni dus. Command Process

Engfees

~

Fragment w

dwserators;

ii

ii ii

ii

nil ii n n

frcage_

Engfcnes

tfi frifli-M*+t -[*

riEin i.-J!W3rjr^rd

tlHlltll *-iS--iHrl !

nrfteic^onKvy bj*i

! Gfepljy^GHE r;tbl :-*ii.nJ;

Figure 4.2: RealityEngine block diagram. Board-levelblock diagramof anintermediate

configuration _consisting of 8

Geometry

Engines,

2 raster _memory

boards,

and a

display

generatorboard.

Engineperformed multiple

instruction,

multiple component_processing andrequired

hand-coded_assemblycodetoextractthegreatest performance. Texture mappingrenderesimages

that are

highly

_visually

interesting

and lookedmore realistic. PixelFlow rendered texture

mapped images but RealityEngine provided expansive support in thehardware to gener

ate such detailed images. Fragment Generators incorporatedwere fixed-function pipeline

unit and performed the initial generation of

fragments,

generation ofthe coverage _mask,

texture addressgeneration, texture

lookup,

texturesample

filtering,

texture modulation of

the fragment color and

fog

computation and blending. RealityEngine also supported

1-and 3-dimension texture maps and thus could be used torender volumetric images. An

[image:43.549.168.381.170.357.2]

(44)

beperformedeither

by

single-pass (PixelFlow didnot supportthis _menthod)or multi-pass

antialiasing methods. Unlike

PixelFlow,

RealityEngine separated the transformation and

fragmentgenerationrates. Thismade possiblethe fine

tuning

ofthe architectureforunbal

anced_rendering requirements. Image Engines were assigned a fixed subset ofthe pixels

intheframebuffer. Unlike

PixelFlow,

allthe different functional blocks running inparallel

could complete the final image as soon as the last primitive was rendered

by

the Image

Engine. Thecolorcomponentresolution was increased from 8-bits to 12-bitsto eliminate

visible

banding

and support for substantial framebuffer composition. RealityEngine used

bothobjectparallelismandimageparallelismto achieve_{very high}performance rates. Ob

ject parallelism _{implies partitioning} either the geometric description of the scene where

vertices of a geometric primitive are processed

independently by

different processors in

parallel or theassociated object spacewhere different geometric primitives areprocessed

independently

by

differentprocessors inparallel. Imageparallelismsplitstheimage space

as fragments where each fragment is processed

independently by

different processor in

parallel.

4.2.8

InfiniteReality

The

InfiniteReality [39]

graphics system architecture shown in Figure 4.3 was designed

with the _primary goal of real-time application performance. The architecture was devel

oped as a superset ofRealityEngine graphics system. Unlike previous mentioneddesigns

where

display

list processing had been handled

by

thehostprocessor,

InfiniteReality

intro

ducedlocal

display

list_processing

by

the graphics subsystemto eliminatehostto graphics

I/O bottlenecks. RealityEngine and previous architectures distributed primitives _among

different

Geometry

_{Engines using}round-robinmechanism. As the

Geometry

_{Engines op}

eratedinanMIMDarrangement,

InfiniteReality

extendedthesupportfor

least-busy

distrib

ution mechanism which offered performanceadvantage over ubiquitous round-robinmech

(45)

nfevffyrtraiSue V-fi**L-i tniaiVUK'

t

#|jkHliOifli***4*ii!i '

1

tga _{1 1}feaJBw _{1 1}Saga* _||fogfcMii _|

5T=*

|giiiHSiwar-fcui-HM

]

tn*i

r

L

f

'

]

11 imp<rr*'..-*Jui 1tiqpmm torn*ma II^lM^M-M^ f"i|p'*liirl"ii>IMTtiri- ₁

I

4m* III

r

I 1III lllllll ! tjlUi llMMf>dMi.ii

E

|

l^|4a^

j

T

t**4MWi'

l,

f

1.

1

:,

k.

f

i

,jl

3

r

\

i

^

it

Figure4.3:

InfiniteReality

Engine block diagram. Board-level block diagramof

Infinite-Reality

Engine withmaximum supportfor4

Geometry

Engines,

4 Raster

Memory

baords,

and a

Display

Generator boardwith8outputchannels.

with four stage pipelined arithmetic units _executing pixel-components in SIMD arrange

ment. The screen-space primitives were then input to the Raster

Memory

board which

comprised of a Fragment Generatorand_{eighty Image Engines. Unlike} previous architec

tures, theFragmentGenerators in

InfinteReality

assembled screen-space primitives (trian

gle _setup) and computed parameter slopes. The performance of

InfiniteReality

graphics

system made possible theuse of_{very large} texturedatabases

by

introducing

a representa

tion called clip-map, and multipass _rendering techniques to implement reflections, shad

[image:45.549.132.420.54.404.2]

(46)

for very large textures while multipass _rendering techniques can enhance visual realism.

The

display

generator board provided support for flexible and software controlled video

architecture thatcoulddrive upto eight autonomous

display

output channels. To counter

actfill-rate

bottlenecks,

InfiniteReality

graphics systemintroduced dynamicvideo_resizing

where the scene is rendered to the frame-buffer at a _potentially reduced resolution such

that primitives are drawn in less than one frame time. Despite the success ofMIMD ar

chitectures, SIMD systems offered a simpler _programming model. This is owing to the

fact thatMIMD systems imposes aburden on applications and _operating systems, which

must be able to cope with the arrival of data at unpredictable intervals and in arbitrary

order. This often resulted in the design of complex and cumbersome communication and

buffering

protocols _rendering software overheads. SIMD systems _{operating in} the

lock-step fashion virtually eliminates these overheads.

Also,

it is often possible to structure

algorithms as several distinct functional

blocks,

each of which operates on a uniformdata

type. _{The rendering} pipeline maps _naturally onto _{this structure,} and the _regularity of the

datastructures within eachphaseleadsto uniformoperations,providingagoodfitwiththe

SIMD programming paradigm.

Finally,

SIMD architectures _usually contain thousands of

simple_processing elements. Becauseoftheirsheer numbers, goodperformance can often

beachieved even thoughprocessors _may notbe

fully

utilized [12].

4.2.9

Graphics Software Libraries

The realm of computer graphics applications was _{growing but} there was no real coordi

nated effortto

develop

hardwareindependentgraphicssoftwarelibrariesthatcouldbeused

to render simple or complex scenes effeciently. All the above mentioned graphics sys

temdesignsweredeveloped

independently

without_anycoersiontowards

industry

standard

compliance. These graphics systemscould notbe implemented on a_variety of platforms

with a range of graphical capabilities. There existed several independent standards most

(47)

(PHIGS extensionto X window _system)

[15]

thatcould render 2D or3D objects on lim

itedgraphics systems withlimitedgraphical capabilities_makingprogram_portability

prob-lemetic. Also we have mentioned that it required considerable knowledge and expertise

to program graphical applications _effeciently and was _mostly carried out

by

the

systems'

designers. OpenGL developed

by

SGI is an _emerging graphics standard. Itis alow-level

graphics

library

specification thatprovides advanced_renderingfeatures while_maintaining

a simple and_{flexible programming}model. OpenGL is rendering-only,so it is independent

ofthemethods

by

whichtheuserinputand other window or non-window systemfunctions

are _achieved, _making the _rendering portions of the graphical program thatuses OpenGL

platformand_operatingsystemindependent.

4.2.10

Streaming

CPU Extensions

PixelFlow, RealityEngine, InfiniteReality,

etc represented _massively parallel graphics ar

chitectures with variedrange of capabilities.

However,

the above mentioned graphics ar

chitectures were accessibleat_{very high}costforspecial-purposes. Technological advance

mentsintheVLSI

technology

made possible visual realism experiencefor

desktop

personal

computers

by

incorporating

SIMD arrangement inthe general-purpose processor designs.

In