The non-uniform sampling of final gather ray directions that arise from importance sam- pling not only reduces the variance of the computed reflected radiance but also changes the spatial distribution of the photon gathers that must be performed. This will have a direct impact on bandwidth requirements, and in this section we examine the magnitude of this effect and its interaction with both query reordering and irradiance caching.
To clearly examine the effect of a glossy surface on these other bandwidth reducing techniques, a modified version of the Cornell box scene was used that with a viewpoint focused on a glossy wooden floor. The BRDF in this experiment is composed of two separate terms. The first is a simple uniform Lambertian. The second is a single Phong cosine lobe. The resulting images are shown in Figure 6.16 for several values of n, the glossy exponent of the Phong illumination model.
1 2 4 8 16 32 64 128 0 20 40 60 80 100 120 140 160 180 Glossy Exponent Bandwidth (GB) Tiled Tiled DirBin Tiled DirBin Hash
Tiled Combined Importance (Dirbin/Hash)
(a) Without Irradiance Caching
1 2 4 8 16 32 64 128 0 20 40 60 80 100 120 140 160 180 Glossy Exponent Bandwidth (GB) Tiled Tiled DirBin Tiled DirBin Hash
Tiled Combined Importance (Dirbin/Hash)
(b) With Irradiance Caching
Figure 6.17: As the exponent of the glossy lobe is sharpened the required bandwidth for naive photon mapping is reduced. This effect is minor or non-existant for reordering techniques that are already successfully exploiting coherence. For a fair comparison, the naive and direction-binned techniques are shown here with analytical BRDF sampling. (Modified Cornell Scene, 16×16 tiles, kd-tree 128KB cache, 128B cachelines, max =
0.025)
Surfaces with a higher glossy exponent were found to require less bandwidth for the naive ordering than those that are more diffuse (Figure 6.17a). This is because the sampled final gather rays are clustered in a small portion of the hemisphere, corresponding to the expected contribution.
The effect is less pronounced, or even non-existant, for the higher quality reordering algorithms. When the photon queries have already been reordered, however, there is no reduction in bandwidth as the glossy exponent increases. This is because the reordering have already captured most of the available coherence.The pattern held true both with and without irradiance caching (Figure 6.17b).
Although glossy surfaces require less photon gather bandwidth than uniform diffuse surfaces, the improvement due to importance sampling is minor compared to that ob- tained from query reordering. This experiment shows that query reordering is compatible with importance sampling and irradiance caching, exploiting the additional potential co- herence among the photon gathers.
6.6
Conclusion
Combined importance sampling is a simple global importance sampling algorithm that re- duces variance in generated images for scenes demonstrated at a low cost. It is compatible with photon gather reordering and irradiance caching. It therefore can be incorporated into a system which utilizes those techniques to reduce memory bandwidth.
The low cost is due to a preprocessing stage where multiple p.d.f.s are combined together, leaving a single sampling strategy to be used during final gather ray generation. Although combining p.d.f.s is not always possible, it is easy in the application presented in this chapter because the p.d.f. of incident radiance when captured by akNN search in a photon map is naturally expressed in a table. Not all functions, such as sharp BRDFs, are well represented this way however. If two more general p.d.f.s were expressed in different ways, varying basis functions for example, that could prevent the application of combined importance sampling.
CHAPTER 7 ARCHITECTURE
The most significant impediment to interactive image generation using the photon map- ping algorithm is the memory bandwidth requirement of photon gathering. The previous chapters of this dissertation have presented several techniques that were shown to be successful at reducing the memory bandwidth requirements: photon gather reordering, irradiance caching and combined importance sampling. Irradiance caching reduces the number of final gathers that must be performed. Importance sampling reduces the num- ber of final gather rays and subsequent photon gathers that must be generated. Finally, photon gather reordering increases the coherence of the memory request stream generated while servicing those photon gathers that remain to be performed. Together these tech- niques will reduce the memory requirements of image generation by almost two orders of magnitude.
The algorithms and data structures to accomplish these reductions were designed with care to neither impose significant additional computation nor hide the natural parallelism of the original photon map algorithm. Further, it was shown that the techniques are compatible; not only can they be implemented in a single rendering system but their benefits build on each other. In this chapter, the techniques are combined into a single proposed hardware architecture that could be practically implemented by 2010. An implementation small enough to fit in a standard desktop PC will be shown via simulation to have sufficient performance to support the interactive rendering of scenes with difficult indirect illumination at 30 frames per second.
constraints imposed by current semiconductor technology are revisited. The architecture is then described at a high level before a careful examination of each component. The required bandwidths, computations and internal storage are shown to be well within the expected performance of the targeted implementation technology. A functional sim- ulation of the architecture is presented and used to verify both the correctness of the architecture and evaluate the expected performance under a variety of conditions. Fi- nally, scalability, deadlock and load balancing are addressed along with limitations and potential design alternatives.
7.1
Goals and constraints
It is the goal of this dissertation to present a feasible hardware architecture for interactive image generation using the photon mapping algorithm. Chapter 2 established an overall vision for the system, based on reducing overall system complexity and cost. From this vision, the following goals are obtained: 1) the entire system will be packaged as one or more expansion cards that can be inserted into a standard commodity workstation; 2) no more than one major custom ASIC design will be required; 3) no inter-chip commu- nication will be allowed; and 4) only semiconductor technologies that are expected to be commercially available in the next three years should be used.
A standard commodity workstation will be used as a host. It will handle all the generic tasks such as program loading and execution, input/output, and management of the hardware system. This frees the hardware architecture to focus on the tasks directly related to image generation. This arrangement is typical for both consumer graphics cards, such as the GeForce 6800 (Montrym and Moreton, 2005), and also large high-end research systems, such as PixelFlow (Eyles et al., 1997). Limiting the implementation to a small number of expansion cards reduces the overall system complexity. Whole classes of subsystems such as a power supply, chassis, and host interface do not have to
be designed or tested.
Each custom ASIC that must be designed for the system represents a large devel- opment and testing cost. This is particularly true of high performance designs that use recent semiconductor technologies. To limit this cost, the architecture will use only a single custom chip design. Although the development and testing of a design is expen- sive, the per unit cost is relatively low. Therefore, multiple instances of the chip will be used together for higher performance. This provides greater flexibility in deployment. Models offering different performances can be constructed from the same components at different price points.
There are several advantages to prohibiting inter-chip communication. The first is that the majority of the rendering chip’s communication resources can be dedicated to high speed memory links. The second advantage is that the expansion cards do not need to support a bus or routing system between the chips. Instead, a simple tree communication structure with the host will suffice. System testing is also easier, as each chip performs its duties in isolation from the others. Finally, this lack of communication removes one of the largest barriers to system scalability in terms of rendering chips. This last aspect is addressed in more detail in Section 7.4.3.
It will be shown in Section 7.3.4 that the bulk of the computations in this architecture are highly parallel, regular, and exhibit few control flow dependencies, thus permitting deep pipelining. This allows the resources on the chip to be allocated to processing elements instead of control, routing and large memories. These are also the characteristics of commodity GPUs which enable them to perform so much useful computation on a single chip, as compared to a CPU. In Section 2.2 it was shown that it is a conservative estimate that GPUs will be performing 500 GFLOPS by 20101.
Even without photon gather reordering, irradiance caching, or combined importance
1Recall from Chapter 2 that the total number of FLoating point OPerations to perform an operation
sampling, generating 512×512 images of Sponza Atrium at 30 FPS will be shown, in Section 7.4, to require 7.4 TFLOPS. This burden can be handled by less than sixteen chips, which can fit on a small number of expansion boards.
It has long been lamented that the memory bandwidth available to a single chip is increasing at a much slower rate than the computational performance of that same chip (Dally and Poulton, 1998). High end commodity chips such as GPUs are expected in 2010 to have a usable bandwidth of at least 90 GB/s (see Section 2.2). The naive photon mapping algorithm, as originally described in Chapter 2, has been shown to require 11 TB/s to generate 512×512 images of the Sponza atrium at 30 FPS. Dividing that bandwidth across twelve dozen chips is simply not feasible for a workstation sized machine. The techniques presented in this dissertation are therefore required for a feasible architecture.
Because memory bandwidth is the limiting factor, the description and initial analysis of the architecture concentrates on that. Once it has been established how many tiles per second can be performed on a single chip, using the 90 GB/s permitted, it is then verified that the corresponding computation and internal storage for the ray casting, irradiance caching, importance sampling, and the photon gathers themselves are reasonable. During initial discussions, this chapter will assume that the target output is 30FPS for 512×512 images with a single sample per pixel.
Performance is, of course, a function of many variables. Some are set by the users, some fixed by the architecture as described here, and some depend on the exact scene. Table 7.1 defines the notation used throughout the rest of the chapter to discuss these costs. Nominal values are shown with references to where they are discussed. Those values that are observed characteristics of certain scenes, such as the proportion of pixels that can use irradiance caching, are given as an expected range with a conservative value used during calculations unless otherwise noted.
Nominal
Notation Description Value Reference
P Number of pixels in a tile 16×16 Chapter 3
T Number of tiles in (512×512) image 1024 Chapter 1
F Number of frames per second 30 Chapter 1
NF G Number of final gather rays 33-100–200 Chapter 2
k Number of photons searched for 100 Chapter 2
NS Number of shadow rays 8 Chapter 2
CJ Number of importance sampling bins 4×8 Chapter 6
NPM Number of photons in indirect map .5–6×106 Chapter 2
RIC Ratio of pixels using irradiance cache 0–60–95% Chapter 5
RF G Ratio of pixels using full final gather 100%-RIC Chapter 5
LIC IC records interpolated per pixel 4 Section 7.3.3
CB1 Partial evaluation of fr (Stage 1) 12 FLOPs Section 7.3.2
CB2 Partial evaluation of fr (Stage 2) 10 FLOPs Section 7.3.2
CRayCast Intersecting ray with scene 330 FLOPs Section 7.3.2
CG Gathering the k nearest photons 2,389 FLOPs Section 7.3.4
SB Size of partially evaluated BRDF 13 bytes Section 7.4
Table 7.1: The notation used throughout the rest of the chapter to discuss the cost of photon mapping in terms of both bandwidth and FLOPs. Nominal values are shown with references to where they are discussed. Those values that are observed characteristics of certain scenes, such as the proportion of pixels that can use irradiance caching, are given as an expected range with a conservative value in bold that is used during calculations unless otherwise noted.