Performance Analysis - Hybrid Cell Projection / Ray Casting

6.1 Hybrid Cell Projection / Ray Casting

6.1.3 Performance Analysis

In Figure 6.10 we plot millions of tetrahedra against frames per second for our variants of cell projection / ray casting algorithms based on trivariate splines. The numbers correspond to a worst case scenario where the surface is filling the entire screen and all tetrahedra are within the view port. Frame rates significantly increase for smaller view ports as well as for close inspection of the surface, since in the latter case, not all tetrahedra need to be processed. The largest isosurface in these tests consists of approximately 7.5 million tetrahedra within 650 000 relevant cubes. Since we use a 4 byte index to encode the position for each T in Tactive,i we need about 29 MByte of

GPU memory for the geometry encoding in this case. For the variants using precomputed determining sets we further need to store 20 float values per relevant cube for quadratic super splines, which corresponds to circa 50 MByte in the above example, and for the cubic C1-spline we need 56 float values per relevant cube, i.e., circa 139 MByte. If the spline coefficients are computed on-the-fly then the data set itself needs to be stored on the GPU.

As a rough estimate of shader performance we report that NVidia’s tool ShaderPerf gives us a throughput of 143 million fragments per second for the most complex shader (cubic splines with on-the-fly coefficients) and 778 million fragments per second for the simplest shader (quadratic splines and precomputed determining sets). In our tests we achieve a peak performance of about 30 million Bézier tetrahedra per second for the variants using precomputed determining sets. In comparison, in [SWBG06] the authors achieve a rendering performance of about 11 million quadratic surfaces on similar hard- ware, but restrict themselves to only spheres, ellipsoids, and cylinders. In our own tests based on the approach in [LB06] we were able to render about 500 000 quadratic Bézier tetrahedra per second, where the main problem is that the screen space projections of the tetrahedra are done on the CPU. In addition, both approaches do not consider splines, but the much simpler case of unconnected polynomials. The GPU approach based on trivariate splines given in [SGS06] has a peak performance of only 850 000 quadratic Bézier tetrahedra. Further, since the authors ignore the spline struc-

0 10 20 30 40 50 60 0· 106 ₁_{· 10}6 ₂_{· 10}6 ₃_{· 10}6 ₄_{· 10}6 ₅_{· 10}6 ₆_{· 10}6 ₇_{· 10}6 Tetrahedra F P S r r r r r _r _r u u u u _u u u b b b b b _b b b b b b b b b b

b quadratic splines &

precomp. DS

r cubic splines & pre-

comp. DS quadratic splines & on-the-fly coeff.

cubic splines & on- the-fly coeff.

Figure 6.10: A plot of millions of tetrahedra against frames per second for four variants of trivariate

splines rendering methods (quadratic splines and cubic C1-splines with precomputed determining sets and with on-the-fly computation of coefficients). All timings were done on a NVIDIA GTX285 GPU and a 1000 × 1000 view port.

ture and encode the geometry in a naive way, this approach suffers from severe memory limitations.

The peak rate of 30 million tetrahedra corresponds to about 60 million triangles (the back facing triangles are culled in the triangle setup stage). This can be compared with the peak rate of modern GPUs which are capable of processing up to 300 million triangles in optimal conditions. Here, optimal refers to the size of the screen space projections of triangles as well as the overdraw, i.e., the number of fragments that fall onto one single pixel. In the rasterizer, the GPU processes batches usually consisting of 2 × 2 pixels in parallel. If the screen-space projection of a triangle is below this value then threads on the GPU diverge with the effect of a dropping performance. High overdraw affects performance since many computations are done for fragments which later get overwritten by fragments closer to the viewer. The raster operations unit (ROP) is the GPU stage that finally writes a pixel to the frame buffer. High overdraw is also a problem in the ROP since the number of units is limited (32 for a GTX 280 GPU) and frame buffer writes need to be serialized. The frame rates about linearly depend on the number of tetrahedra contributing to the surface indicating that vertex processing is one of the main performance bottlenecks. Consider an isosurface with five million tetrahedra, where for each tetrahedron six vertices need to be processed. On a 1000 × 1000 view port, this corresponds to 30 vertices per pixel. The performance of our shaders using precomputed determining sets is limited by these factors: a huge amount of small triangles and high overdraw. This is also reflected in Figure 6.10 since the red and blue curves almost coincide even though cubic splines have a shader complexity which is about twice as high as for quadratics.

We have lower performance using on-the-fly computation of spline coefficients, since in this case, more arithmetic is needed to obtain the coefficients from the data stencil.

0ms 100ms 200ms 300ms 400ms Chmutov ρ = 0.21 2.3· 105_tetrahedra M 8.3MB, Tactive0.9MB 57ms 64ms 40ms 28ms Tooth ρ = 0.45 7.3· 105_tetrahedra M 13.5MB, Tactive2.8MB 93ms 108ms 70ms 52ms Vismale ρ = 0.19 3.2· 106_tetrahedra M 56.1MB, Tactive12.5MB 120ms 177ms 312ms 196ms Asian Dragon ρ = 0.5 3.5· 106_tetrahedra M 59.3MB, Tactive13.4MB 143ms 201ms 323ms 208ms MRI Head ρ = 0.36 4.8· 106_tetrahedra M 85.7MB, Tactive18.3MB 141ms 211ms 466ms 357ms Reconstruction on-the-fly Reconstruction determining sets Frame time on-the-fly Frame time determining sets

Figure 6.11: Reconstruction and frame times of cubic C1_{-splines for a selection of data sets (see Fig-}

ures 6.20, 1.1, 6.16). For each data set, the isovalue ρ, the number of active tetrahedra, the size of the determining set M, and the size of Tactive, are given.

We furthermore need more texture fetches, which is 27 (instead of 5 for precomputed determining sets) in the case of quadratic super splines, 28 for quadratic C1-splines on truncated octahedral partitions, and 23 texture fetches for cubic C1-splines (14 for precomputed determining sets). Again, the frame rates of quadratic and cubic splines are very similar (the black and green curves in Figure 6.10), indicating that the higher arithmetic complexity of cubics is almost compensated for by fewer texture fetches.

By using instancing we need |T_active| 4-byte words to implicitly encode the geometry of the tetrahedra on the surface. The memory requirements for the precomputed deter- mining sets are determined by the number of active cubes |Qactive|. Since on average

12 tetrahedra are relevant on each Q ∈ Q_active the memory requirements are then given by |T_active| ·|MQ|/12, where |MQ| is 20 for quadratic super splines, and 56 for cubic C1- splines, respectively. Storing the precomputed determining set needs less memory than the complete data set of n3 values if |T_active| < n3_·₁₂_/_|M

Q|. For example, for a data set of size 2563 _{and quadratic super splines this corresponds to about 10 million tetrahedra,}

and for cubic C1-splines we have 3.59 million tetrahedra. Above these values it is more memory efficient to compute the spline coefficients on-the-fly from the volume data.

The table in Figure 6.11 exemplarily summarizes the performance of cubic C1-splines for a selection of data sets and lists the isovalues ρ, the number of active tetrahedra, as well as the size of the determining set M and the geometry encoding T_active. Timings were recorded on a GeForce GTX 285 (240 unified shader units) and CUDA 2.2. For each data set, the first two bars show the timings of our GPU kernels (see Section 6.1.1), which are invoked when the surface needs to be reconstructed. The on-the-fly computation

of coefficients is slightly faster than preparing the determining sets, since less data is written. The most expensive part in the reconstruction is the classification, i.e., the determination of the array Q_class. This could be improved by using appropriate spatial data structures, e.g., min-max octrees, but the recursive nature of these data structures makes an efficient GPU implementation challenging. In addition, the data structures have to be rebuilt if the data itself changes over time, which is not necessary in our simpler approach. Still, for our largest data sets the reconstruction times are in a range of a few hundred ms and in most cases even below the rendering times of a single frame. Compared to optimized CPU reconstruction based on octrees we achieve significant speed-ups of up to two orders of magnitude. The per-frame rendering times are also given for a 1000 × 1000 view port with the surface filling the entire screen.

In document New Models for High-Quality Surface Reconstruction and Rendering (Page 100-103)