Algorithm Analysis - GPU Acceleration - Multi-view dynamic scene modeling

3.5 GPU Acceleration

3.5.1 Algorithm Analysis

Foreground Inference

Background color RGB Gaussian model has to be trained for every pixel in a camera view in advance. Assume that the background stays the same during the capturing, for every time instant, given the background model and a new image frame, the silhouette probability of a pixel is computed. Using this information from all views, the posterior probability of a 3D space voxel to be occupied by a dynamic object can be inferred. In

fact, voxel occupancy probability can be reliably computed from its corresponding pixels along the camera-viewing ray. Therefore, assuming neighboring voxel occupancies are independent, the inference procedure is the same for every voxel.

Occluder Inference

After the dynamic shape is computed, the occlusion events at every voxel position are examined by looking at inconsistency between the computed dynamic shape volume and the silhouette information from the background models at image views. As discussed before, this inconsistency happens when the dynamic objects has been occluded in the same view, while some other views still give positive information for dynamic shape occupancy. For a voxel in the occluder volume, again, only the corresponding camera- viewing rays need to be examined. The major diﬀerence is that one needs to know the maximum values of dynamic shape occupancy probability along the viewing ray in the direction forwards and backwards from the voxel being examined. The examination requires view-dependent ordering, which is the most challenging part for parallelization. Once the peak information is computed, the rest is almost the same information fusion process as dynamic shape volume inference. A merging of the accumulated occluder computed at each time instant is needed to get a ﬁnal grid, which is again very easy to parallelize.

Algorithm Complexity Analysis

In Section 3.3.5, a term 𝑅 is introduced for every occluder voxel to model how reliable its value already is, given the inference up to the current time instant. The CPU implementation complexity chart is given in Fig. 3.9. For the current GPU version this term is not implemented yet, and according to the chart, it does not aﬀect the total complexity of the algorithm and can be added easily too.

Figure 3.9: CPU occupancy grid algorithm complexity analysis, including both dynamic shape and static occluder computation.

number of cameras, 𝑁 is the side length. Most of the computations are on the voxels, which makes GPU parallelization feasible. It is unlikely that all temporary volumes can be stored in memory, which means one might need to re-design the data ﬂow for GPU implementation. The most time-consuming process is the “peak-ﬁnding” in the occluder grid computation step, which takes 𝑂(2𝑛𝑁3_{) time complexity for every time instant.}

Peak Finding—Brute force method

For every voxel, the brute-force algorithm would traverse the viewing ray for all 𝑛

camera views, which takes 𝑛𝑁 times, therefore the whole algorithm takes 𝑂(𝑛𝑁4_).

This algorithm is very slow, but because it is implemented on a voxel basis, it takes the advantage of parallelization. One deﬁnitely can make this implementation together with occluder probability inference in a single function, thus reduce the data transfer time between the CPU and GPU. This algorithm takes the camera projection matrices, foreground volume, pre-computed background probability images from all cameras as input and computes two accumulating values, namely Eq. 3.10, and the ﬁnal occluder

probability as output. However, the implementation shows that it takes more than 4 minutes to compute one time instant, which is much slower than optimized CPU version. Therefore, it is deﬁnitely not acceptable for a real-time solution.

Peak Finding—Divide and conquer method

What is actually implemented in the final GPU version is splitting the peak finding process and the occluder inference process. More specifically, one can pre-compute two volumes storing “peak-in-the-front” and “peak-behind” values for each voxel from one camera direction, compute the intermediate marginalization probability result in a temporary volume, and move on to next camera direction. For each direction, a 2D image is used to store the maximum value along the viewing ray so far has been swept, and sequentially test 2D slices along the direction in the 3D volume against this 2D image. While this reduces the time complexity to 𝑂(𝑁3), two 2D images have to be kept to store the current “peak-in-the-front” and “peak-behind” values when the sweeping plan traveling the 3D volume in the “front-to-back” and “back-to-front” order with respect to the camera direction. Four more volumes are also needed to store the temporary probability result, since the algorithm is computing every camera view separately first and merging them in the final step.

Peak Finding—Cache-friendly divide and conquer method

Since the plane sweeping direction depends on the camera view orientation, for a certain camera view, the plane sequential value access may be not local at all, for which the operating system may be constantly transferring data pieces in and out of the cache. This actually has a huge impact on the speed of the peak finding. From the “Peak finding analysis” in Section 3.5.3, one can see that it might take about 2 times more to complete the peak finding process for a cache-unfriendly direction than a cache-friendly one as CPU implementation. However, since the cache-friendly CPU version requires ordered

traversal, which prevents parallelization, the GPU version cannot really beneﬁt from it. Therefore, the ﬁnal GPU implementation goes with the cache-unfriendly version as described in the previous section. However, there might still have room in this direction for speedup.

In document Multi-view dynamic scene modeling (Page 66-70)