1.4.1
Fast ME/DE
To accelerate the block matching in ME/DE many sub-optimal (CPU-based) techniques have been proposed for AVC/H.264 that are generally applicable to HEVC/H.265. An iterative Hexagon search was proposed in [8] to enhance the RD performance of the earlier diamond search [9]. More sophisticated search algorithms use multiple simple search patterns and local correlations to further improve RD performance. The technique in [10] proposes unsymmetrical-cross multi-hexagon- grid search (UMHexagonS). The work in [11] gains further improvement through enhanced predictive zonal search (EPZS). There are exploratory implementations of ME/DE algorithms on MPA platform of graphical processing units (GPU) [12], [13], [14], [15], [16], [17]. All these implementations except [15] are purposed for single
view video, and have not even been integrated into a complex platform of JMVC, and therefore, not compared with TZsearch, (which is the gold standard for MVC). These algorithms, while demonstrate the potential of MPA, do not exhibit impressive performance comparing to the CPU-based full search. The best performance speed up with respect to sequential full search reported in these references range from 9 to 17.
The main reason for the low performance of these algorithms is that they fail to fully exploit the massive-parallelism afforded by the programming environment of MPA of GPU by performing many parallelizable tasks sequentially [12], [17]. Works in [12] and [13] lack due consideration to the algorithmic scalability within the capabilities of the underlying hardware, resulting in unnecessary resource saturation and scaling limitations. Some of these algorithms also gain speedup at the expense of significant compromise in bit-rate and PSNR quality. That is because these algorithms fail to properly handle inter-partition MV/DV dependencies in MVC [12], [13], [16] (see Section 2.3). The best implementation of MVC, to date, in terms of speed, bit-rate, and PSNR quality, is the JMVC reference software with TZsearch mode for ME/DE. Therefore, all performance evaluations of proposed algorithm in this paper is made with respect to JMVC reference software.
1.4.2
Fast Mode Decision
A rich choice of hierarchical partitioning modes within the CTU is the main reason for higher coding efficiency as well as the high computational complexity in HEVC. To
has been selected. An early termination scheme where the partitioning mode with largest PU size is first checked, was proposed in [19]. If in this mode PU produces a coded-block-flag equal to zero, the processing of all PUs within this CU is skipped. The work in [20] improves upon this termination scheme by halting the processing of all other PUs if both the MV difference and coded-block-flag are turned out to be equal to zero. Further, the latest related work on HEVC coding in [21] proposed a ME technique that skips the processing of all CU of size 8 × 8. For the remaining larger CUs all 17 possible symmetric partitioning modes are evaluated. Three modes with lowest costs collectively determine the early termination decision for the processing of CTU subtrees.
Above algorithms are estimated to yield a speedup factor of about 1.6 to 3 with varying loss in the RD performance. There are also fast mode decisions proposed for intra-prediction [22]. However, intra-prediction consumes very little time in compar- ison with ME/DE processing. These efforts reveal the potential of reducing encoding time by appropriate skipping of some of CUs and PUs. However, the increasing number of views in MV-HEVC brings significantly more inter-prediction for each PU within a CTU, potentially slowing down the processing of PUs for the existing algorithms.
1.4.3
Multiview Coding Scheduling
To maximum computational resource usage and reduce unnecessary stalls involved in multiview coding due to inter-view and temporal dependencies, works in [23] and [24] present scheduling algorithms for parallel MVC encoding at the frame level on a multi-processor system for a given prediction structure. In these works a prediction structure is used to build a directed acyclic graph where frames across the temporal
and inter-view domains form the vertices of the graphs and the coding dependencies form the edges. Starting with I-frames vertices in view 0 as pair of roots, the encoding scheduler inspects all the neighboring vertices and assigns them to one of the available CPU cores. Then, for each of those neighbor vertices in turn, it inspects their neighbor vertices which are not yet coded, and so on. The process continues until all the GOPs across all the views are encoded. This requires a complex scheduler that traverses the graph to discover and schedule the frames that are ready to be dispatched for the execution on one of the many identical CPU cores. Further, in this scheme the workload, in terms of the number of frames ready to be scheduled, at each coding stage varies greatly across the stages. This results in under-utilization of CPU cores or inadequate number of cores for efficient parallelism depending on the coding stage. To alleviate this problem by creating enough workload to keep all the CPU cores busy, the work in [23] proposes the processing of multiple GOPs across all the views in parallel, further complicating the scheduler. The scheduler task becomes even more cumbersome considering the fact that at each stage of encoding frame vertices take widely different execution times depending on the number of their immediate descendants in the graph and the nature of edge dependencies (temporal or inter- view).
In contrast, this work presents a simple scheduling scheme where the number of frames to be encoded does not change across the coding steps. As will be described in the Section 3, this is achieved through a simple encoding step time shift. The simplicity of the proposed parallel scheduling scheme results in the more complex prediction structure of IPP to have a more efficient parallel implementation compared with IBP.