Evaluation of parallel implementation - Improving the performance of video based reconstruction

5.7 Evaluation

5.8.3 Evaluation of parallel implementation

The acceleration achieved by each step of the parallel partitioning can be seen in the following graphs, for three different processing scenarios: CPU sequential, CPU parallel, and GPU. Note that the implementation is currently not optimised for GPU execution, which could significantly improve performance on the GPU.

Step 1 - Create viewing edges

Figure 5.12 shows the execution time for step 1 of the parallel scheme. It can be seen that as input contour edges increase, the sequential CPU execution time increases more sharply than GPU parallel execution, or CPU parallel, which is the fastest. The CPU and GPU parallel executions both provide an acceleration

CHAPTER 5. IMPROVING PERFORMANCE OF VBR 137

Figure 5.10: Parallel latency reduction compared to the distributed approach - Dashed lines show the corresponding latency reduction of distributed EPVH from

Franco et al. [41]

compared to sequential execution on the CPU.

Step 2 - Find Connections

In Figure 5.13, the execution times are plotted against the number of viewing edges, since this is the data entity over which the step is parallelised. It can be seen that this step, and the following step (Figure 5.14) tell a different story to step 1. As the camera count is increased, more triple points are found by step 2, which leads to more connections being sought in step 3.

Step 3 - Triplepoint Connections

Triplepoint connection seeking acceleration can be found in Figure 5.14. Simi- larly to step 2, triplepoint connection seeking performs poorly on the GPU, worse than the sequential execution on the CPU.

Figure 5.11: Parallel acceleration of the implementation compared to the distributed approach - Dashed lines show the corresponding acceleration of dis-

tributed EPVH from Franco et al. [41]

Figure 5.12: Step 1 - Create Viewing Edges execution time graph.

Steps 4 and 5 - Edges and Faces

Figure 5.15 Shows the combined time taken to extract edges and faces, includ- ing texturing. Very little gain is made over sequential CPU processing for either parallel CPU or GPU.

Overall performance

Overall, it can be seen from Figure 5.16, that CPU parallel processing is the quick- est way to process the algorithm. CPU sequential, and GPU parallel execution ap-

CHAPTER 5. IMPROVING PERFORMANCE OF VBR 139

Figure 5.13: Step 2 - Find Connections execution time graph.

Figure 5.14: Step 3 - Find Triple Point Connections execution time graph.

pear to follow a very similar line in the graph, however, on analysis of the graphs from each parallelisation step it is evident that the GPU can process contour edges into viewing edges more quickly than the sequential CPU.

Acceleration achieved per step

Figure 5.17 shows the acceleration achieved for each step of the algorithm when executed in parallel on an 8 core CPU.

Branching analysis

GPUs are batch stream processors and as such are not efficient at executing code containing conditional statements (branches). This is because execution threads

Figure 5.15: Steps 4 and 5 - Extract edges and faces execution time.

Figure 5.16: Overall execution time.

are grouped together to execute code in single instruction multiple data (SIMD) groups to save on the required number of instruction fetches and decodes. If threads encounter a branch they may diverge resulting in some threads in the group executing the branch whilst the others are stalled or executing a different code path. Effectively for each branch reached the execution group is split into two, which reduces the efficiency of both thread groups due to the loss of SIMD parallelism. For example, in a 4 thread SIMD group encountering a branch where 1 thread follows the branch and the remaining 3 do not, the original SIMD group will be split into 2 new groups, one of 1 thread and one of 3 threads. The 1 thread group will run at 25% efficiency and the 3 thread group at 75% efficiency compared with the original 4 thread group. Nested branches, where conditional code is contained within conditional code further exacerbates the problem, and SIMD execution of deeply nested branches can lead to very poor performance. CPUs on the other hand are not designed as batch stream processors, they have long instruction pipelines compared with GPUs resulting in branching generally incur-

CHAPTER 5. IMPROVING PERFORMANCE OF VBR 141

Figure 5.17: Acceleration of each algorithm step achieved on 8 core CPU ring a small penalty for SIMD code execution. Furthermore their instruction and data cache size and logic is much deeper than their GPU counterparts and they can be flushed and reloaded in fewer cycles. Hence code containing many nested branches will execute much more efficiently on multi-core CPUs than on GPUs. The implementation of the EPVH algorithm for local processing comprises the aforementioned 5 steps, each of which has differing numbers of branches and lev- els of branch nesting. In order to evaluate the differences in the branching penalty that may be incurred on the GPU and CPU, each stage is analysed for the number of branches and the nesting level at which these branches occur. A simple strategy is employed to roughly assess the branching cost of each step of the algorithm. The number of branches at each nesting level is multiplied by the nesting depth resulting in more deeply nested branches having a greater weighting on the overall branching cost for that algorithm step. For example:

if (a) // level 1 branch 1

{ if (b) // level 2 branch 1 { if (c) {} // level 3 branch 1 if (d) {} // level 3 branch 2 }

}

The branching cost would be (1 x 1) + (2 x 1) + (3 x 2) = 9. Table 5.2 shows the corresponding calculation for each step of the implemented algorithm.

Algorithm level 1 level 2 level 3 level 4 level 5 branching

step cost Step 1 9 13 2 1 1 50 Step 2 3 8 2 25 Step 3 5 2 9 Step 4 1 4 1 12 Step 5 1 2 2 2 2 29

Table 5.2: Branching cost calculation for each step of the algorithm

In document Improving the performance of video based reconstruction and validating it within a Telepresence context (Page 149-155)