GPU implementation - Achieved Occupancy - Heterogeneous computing systems for vision-based mult

5.4 Achieved Occupancy

6.1.2 GPU implementation

Figure 6.3 illustrates a system that was built for debugging and testing the functionality of the proposed GPU-accelerated computing system.

To evaluate the functionality of the proposed GPU-accelerated computing system, all of the CUDA kernel and host PC codes are integrated into the test system. The testing system utilizes the same video datasets used for the debugging and verification of the FPGA-based system implementation. The GPU receives the video and continu-ously executes all of the CUDA kernels (segmentation, edge filter, and circle detection algorithms) to obtain the circle center candidates, which represent the coordinate candidates of the robots’ locations. Sequentially, all of the coordinate candidates are loaded into the CPU’s memory. Then, the host PC uses the graph clustering algorithm to obtain the true robot marker coordinates. Finally, the system verifies the detection performance of these results (coordinates) and displays the output in the host PC.

GPU Hardware Environment GPU device CUDA Video Processing Kernels

Object Segmentation Edge Filter Circle Detection Software Environment (Host PC)

Video Data Set

Display and Verification

PCIe

NVIDIA CUDA

OpenCV

Figure 6.3: Testing system for debugging and detection evaluation of GPU accelerated vision-based multi-robot tracking.

In line with the algorithms and implemented GPU kernels that were previously described in section 5.2, the experiments for evaluating the detection performance of the proposed design considered three main aspects: the circle detection method, which used either the circular Hough transform (CHT) or circle scanning window (CSW); the utilization of a downscaling method; and a method for combining the gradient magnitude in the Sobel kernel, which applied either Pythagoras’ theorem or an approximation technique. Accordingly, this work examined eight configurations to study the detection performance.

Here are the four configurations that were built based on the CHT:

• CHT-Full-Pyth configuration: the proposed design uses the CHT with the full video frame size and Pythagoras’ theorem.

• CHT-Full-Approx configuration: the proposed design uses the CHT with the full video frame size and an approximation technique.

• CHT-Resize-Pyth configuration: the proposed design uses the CHT with a resized video frame and Pythagoras’ theorem.

• CHT-Resize-Approx configuration: the proposed design uses the CHT with the full video frame size and an approximation technique.

In addition, four configurations were developed based on the CSW:

• CSW-Full-Pyth configuration: the proposed design uses the CSW with the full video frame size and Pythagoras’ theorem.

• CSW-Full-Approx configuration: the proposed design uses the CSW with the full video frame size and an approximation technique.

• CSW-Resize-Pyth configuration: the proposed design uses the CSW with a resized video frame and Pythagoras’ theorem.

• CSW-Resize-Approx configuration: the proposed design uses the CSW with the full video frame size and an approximation technique.

A detection performance evaluation was performed to investigate the precision and recall of the proposed design. The radius(r) parameter values for the full and resized (downscaled) methods were set to r= 13 and r = 6, respectively. The threshold value parameter for generating the circle center candidates in the CHT or CSW was set to at least 62.5% of the vote-sampling value.

Table 6.3 lists the detection rates and accuracies of the proposed system based on the CHT algorithm with different numbers of robots and CHT vote samples S (16 and 32).

Based on the experiments, the proposed design can handle multi-robot localization with a typical precision and recall of 99%. This means that the design and its algorithm can provide an excellent performance for detecting the robots’ locations.

When using a higher number of vote samples(S), the system produces a higher precision and recall. A configuration that uses the full video frame size has a slightly higher detection performance than one that works on resized video frames. This is

6.1 Detection Performance

Table 6.3: Precision and recall values of proposed system developed based on CHT algorithm.

N_R Precision (%) Recall (%) S₁₆ S₃₂ S₁₆ S₃₂ CHT-Full-Pyth configuration 4 99.70 99.78 99.57 99.76 8 99.70 99.85 99.57 99.83 16 99.60 99.85 99.37 99.82 32 99.72 99.93 99.63 99.92 64 99.61 99.62 99.24 99.61

CHT-Full-Approx configuration 4 99.66 99.79 99.56 99.77 8 99.72 99.85 99.59 99.84 16 99.61 99.84 99.36 99.81 32 99.72 99.92 99.59 99.91 64 99.61 99.62 99.25 99.62

CHT-Resize-Pyth configuration 4 98.88 99.39 99.40 99.53 8 99.44 99.77 99.48 99.65 16 99.45 99.39 99.58 99.72 32 99.51 99.75 99.38 99.50 64 99.05 99.59 99.19 99.52 CHT-Resize-Approx configuration 4 98.83 99.40 99.33 99.55 8 99.43 99.80 99.49 99.66 16 99.44 99.40 99.56 99.73 32 99.51 99.74 99.38 99.49 64 99.05 99.59 99.18 99.53

probably because the downscaling process for the image frame affects the circle shape.

However, the detection performance difference between them (full and resized image configurations) is relatively small. Configurations that use Pythagoras’ theorem or an approximation technique to calculate the gradient magnitude in the Sobel kernel have an almost similar detection result. In other words, the proposed design can simply select the method based on the processing time. The processing time evaluations are presented in section 6.2.2.

Table 6.4 lists the precision and recall values of the proposed system using the CSW technique. Typically, the proposed design using the CSW algorithm provides a detection performance similar to that obtained by a system with the CHT algorithm. When using a higher number of vote samples(S), the system produces a higher precision and recall.

A configuration that utilizes the full video frame size has a slightly higher detection performance than one that works on resized video frames. The design and its algorithm can provide an excellent precision and recall of about 99% for detecting the robots’

locations.

The CSW technique with the full frame size configuration provides almost the same detection performance as the CHT algorithm with a similar configuration. For the resized (downscaling) image configuration, the CSW technique is able to obtain a slightly higher detection than the CHT approach with the same configuration. This is probably because the CSW algorithm is more robust than the CHT, when considering robot collisions.

Both the CHT and CSW configurations utilizing 16 and 32 vote samples (S16 and S32) provide high detection performances for different numbers of robots (4, 8, 16, 32, and 64). This means that the proposed design and its algorithm are sufficiently robust for multiple robot tracking. As a result, the computing performances (or processing times) for executing the CHT and CSW algorithms in the GPU are the main factors when deciding on the best method. Section 6.2.2 presents a detail evaluation of the computing performance of the proposed design.

6.1 Detection Performance

Table 6.4: Precision and recall values of proposed system developed based on CSW algorithm.

N_R Precision (%) Recall (%) S₁₆ S₃₂ S₁₆ S₃₂ CSW-Full-Pyth configuration 4 99.76 99.79 99.67 99.42 8 99.83 99.87 99.77 99.82 16 99.79 99.88 99.73 99.81 32 99.91 99.94 99.89 99.94 64 99.57 99.63 99.55 99.62

CSW-Full-Approx configuration 4 99.73 99.79 99.66 99.43 8 99.83 99.86 99.80 99.82 16 99.78 99.88 99.76 99.81 32 99.91 99.95 99.90 99.93 64 99.56 99.64 99.55 99.63

CSW-Resize-Pyth configuration 4 99.23 99.57 99.28 99.68 8 99.66 99.75 99.69 99.75 16 99.16 99.44 99.71 99.77 32 99.86 99.87 99.37 99.85 64 99.55 99.63 99.60 99.63 CSW-Resize-Approx configuration 4 99.22 99.57 99.30 99.68 8 99.65 99.77 99.67 99.77 16 99.09 99.47 99.72 99.78 32 99.83 99.88 99.35 99.86 64 99.53 99.63 99.60 99.63

In document Heterogeneous computing systems for vision-based multi-robot tracking (Page 137-142)