Parallel Rendering - High fidelity rendering on shared computational resources

The large computational complexity exhibited by high-fidelity rendering has en- couraged researchers to tackle the rendering problem with parallel computing. An early survey of parallel rendering techniques and related issues is provided by Crockett [Cro97]. A detailed survey of these methods especially in context of global illumination and ray tracing is presented by Chalmers et al. [CDR02]. Ray tracing algorithms are relatively easy to parallelise if the entire scene description can be duplicated on each processor, as each processor can be designated to independently work on a part of the image-space. However, load balancing can be challenging.

Parallel rendering tasks can be constructed by subdividing either in image- space or object-space. The former is suitable if the scene data can be replicated on each processor and the latter is generally used when the scene data is larger than the memory size of an individual processor. A uniform load balancing is easier to achieve with image-space subdivision (for example [Woo84, PB85, LS91]) but it leads to poor memory usage. In contrast, a scene can be distributed evenly using object-space subdivision (for example [DS84, PB89, Pit93]) but it results in complex load balancing strategies. Many algorithms have been proposed in both categories taking advantage of hardware and application environments. A hybrid task subdivision strategy was developed independently by Salmon and Goldsmith [SG88], and Scherson and Caspary [SC88], where object data is hi- erarchically subdivided. The upper portion of this hierarchy is replicated on all processors while the lower part with most of the scene data is distributed allowing dissociation of image-space and object-space.

Load balancing can be static or dynamic based on the scheduling policy adopted by the algorithm. Heirich and Arvo [HA98] provided a good comparison of various load balancing techniques and showed that any static task subdivision strategy is affected by load imbalances. In their experiments, they observed that a hybrid method reduced the load imbalance cost such that it was smaller than other costs of parallelisation. Badouel and Priol [BP89] described a dynamic demand-driven load balancing based on the master-worker paradigm whereby each worker is assigned a 3×3 tile of pixels when it becomes idle. But this approach suffered from poor scalability and therefore, Green and Paddon [GP90] used a hierarchical approach to overcome this. They connected the distributed processors in a tree structure such that the master process was at the root of

2. High-fidelity Rendering 29

the tree and all the other workers would communicate with their parents only, limiting the number of requests being sent across the network. This resulted in better scalability. Reisman et al. [RGS00] devised a dynamic load balancing strategy for progressive ray tracing on distributed clusters exploiting temporal coherence for obtaining interactive rates. The image was subdivided into regions which were assigned to separate processors and during run-time the regions were dynamically adjusted to rectify load imbalances.

One of the first parallel architectures designed for interactive ray tracing using a 96-processor supercomputer was proposed by Muuss [Muu95]. He needed interactive rendering for radar systems which had been modelled using large number of Constructive Solid Geometry (CSG) primitives. These could not be rasterised as the tessellation of the primitives would have resulted in several mil- lion polygons. Using traditional ray tracing the author was able to render the CSG primitives directly, preventing the memory and time costs of tessellation. Keates and Hubbold [KH95] presented an interactive ray tracing system based on a virtual shared-memory machine which took advantage of cache coherence by using regular gridcells. Dynamic load balancing was achieved with thread synchronisation mechanisms to achieve interactivity on 64 processors.

Another such rendering system was proposed by Parker et al. [PMS∗99] using an optimised static load balancing approach. This supported image-based rendering with realistic shadows which also ran on a shared-memory supercomputer. Interactivity was achieved by designing the jobs to match the caching granularity and exploiting fast synchronisation available on the target hardware. They were also able to perform volume rendering [PPL∗99] and iso-surface visualisa- tion [PSL∗98] using their system which supported frameless rendering in addition to synchronous operation.

Wald et al. [WSBW01] enhanced the methods presented in [PMS∗99] by using object dataflow strategy to run on a distributed cluster. Using a highly optimised SIMD code they were able to achieve interactivity. Wald et al. [WKB∗02,BWS03] further enhanced this by adding global illumination and handling a few dynamic scene changes by modifying instant radiosity (see Section 2.6.4).

Purcell et al. [PBMH02] exploited the programmability of the modern GPU and used it as an efficient and massively parallel stream processor for ray tracing, demonstrating similar performance to the optimised CPU version of Wald et al. [WKB∗02]. They moulded the ray tracing algorithm into a streaming process by breaking it down into a set of kernels representing different aspects of the

2. High-fidelity Rendering 30

computation.

Yao et al. [YPZ10] presented a system for parallel animation rendering on distributed resources. In addition to rendering frames in parallel, their system used tile-based image subdivision for finer granularity to achieve dynamic load balancing. Their system was designed to take advantage of the spatial and temporal coherence between animation frames while scheduling parallel tasks.

2.9.1 Parallel Irradiance Cache

There have been several attempts at parallelising the irradiance cache. The stan- dard Radiance light simulation package [LS98], supports a parallel renderer which uses the Network File System (NFS) to manage simultaneous access to irradiance cache between parallel processes on a distributed system. This may lead to file contention while using inefficient file lock managers resulting in poor performance. Koholka et al. [KMG99] shared the irradiance values between processes on the slaves using MPI after every 50 irradiance samples were calculated at each slave. Robertson et al. [RCLL99] proposed a master slave model of Radiance where each slave calculates and deposits the irradiance cache values to the cen- tral master after a threshold, and then gathers the irradiance values calculated by other slaves from the master after a threshold. Debattista et al. [DSC06] followed the irradiance caching component-based philosophy by subdividing the computation of the indirect diffuse on a separate dedicated set of irradiance cache nodes, while the remainder of the nodes computed the rest of the rendering. This en- abled cached samples to be shared quickly and efficiently amongst the dedicated irradiance cache nodes.

In document High fidelity rendering on shared computational resources (Page 42-44)