Slice-Level and Coding Unit Parallelism - High Performance Multiview Video Coding

View-parallel and multiple-IGOPs processing schemes for IBP and IPP can be easily extended to lower slice-level parallelism. It can be accommodated by multiple partial reference frame transfers. This is possible as CUs are coded in raster-scan order.

It is also worthwhile to briefly discuss the influence of implementation platform on slice level parallelism. From the discussion so far it is clear that the limit to parallel processing on a node is four cores where CPU-concurrency loses its effectiveness. It is observed that as a frame is divided into slices and processed in parallel on a node, the computing resources saturate quickly, and beyond four slices, coding schedule become sequential on a node. Therefore, it is easy to conclude that with four (or multiple thereof) slices, the only factor contributing to the performance is, platform independent, prediction structure workload distribution from Table 3.3. This concludes that provided the number of slices are sufficiently large, it is possible to analyze the performance of MVC on computing cluster purely from its prediction structure and workload distribution across the compute nodes.

The provision of wavefront processing of coding units (CU) in HEVC presents an opportunity for parallelization at a finer level of granularity. This, however, requires resources to process multiple coding units in parallel. Similar to the discussion on slice level parallelism, a 4-core CPU limits parallel processing of CUs to four. From the discussion in Section 3.6 it should be noted that to implement the coding of CUs using a fast fine grain massively parallel technique requires allocation of multiple MPAs resources to a compute node to accelerate the wavefront processing.

3.9 Conclusion

By exploiting group of pictures (GOP) parallelism in the multiview coding, in this chapter, the contribution is a multiple-view-parallel, multiple-interleaved GOP (multiple-IGOP) scheduling scheme that will produce a balanced workload. In addition, the resources of multicore CPUs and multi-GPUs are leveraged to extract the maximum performance from a computer cluster. A substantial decrease of overall

encoding time by a factor of 12 is observed with no cost to the bitrate or video qual- ity when compared with the implementation on a single node. The improvement factor of 12 is more than the number of nodes (eight). Furthermore, this strategy decouples the optimizations at the lower levels from those at higher levels, allowing the deployment of any search algorithm for the ME/DE processing.

Chapter 4 Multi Level MVC Encoder

Optimization

1

Standardized in 2014, multiview extension of high efficiency video coding (MV- HEVC) offers significantly better compression performance of around 50% for multiview and 3D videos compared to multiple independent single-view HEVC coding. However, the extreme high computational complexity of MV-HEVC, demands significant optimization of the encoder. Novel optimization techniques at various levels of abstraction. Non-aggregation massively parallel motion estimation (ME) and disparity estimation (DE) in prediction unit (PU), fractional DE and bi-directional ME/DE, quantization parameter (QP)-based early termination of coding tree unit (CTU), and optimized resource-scheduled wave-front parallel processing for CTU, are proposed in this chapter. When evaluated over three views for all MV-HEVC available test se- quences, proposed optimization outperforms the anchor encoder by average factor of

0.12 dB.

4.1 Introduction

High efficiency video coding (HEVC)/H.265 [6] is the latest state-of-the-art video coding standard, providing up to 50% better compression over its predecessor advance video coding (AVC)/H.264 [3] to satisfy ever growing demands for higher resolution in videos such as 4K and 8K [25]. The multiview extension to HEVC (MV-HEVC) was defined in the Annex G of HEVC/H.265 [6] [73] [5] in 2014. Common applications of multiview coding (MVC) are free view TV, 3D movies/TV, and immersive tele- conferencing [25] [26]. In these applications, multiple cameras commonly arranged in linear, grid or arc formation are deployed to capture the same scene simultane- ously. In addition to exploiting temporal similarity using motion estimation (ME) and motion compensation, MVC exploits inter-view similarity using disparity estimation (DE) and disparity compensation, achieving notably higher coding efficiency compared with coding of multiple views as separate video streams. The ME algo- rithms designed for temporal prediction can, with little or no modifications, be applied to DE for inter-view prediction.

The profiling of AVC/H.264 MVC in our previous work [50] showed that 99% of execution time is spent on integer ME/DE and its sole optimization is enough to gain significant speedup. However, as the profiling result for MV-HEVC/H.265 in Table 4.1 shows, the execution time is spread across many functional modules where integer ME/DE contribution to the overall coding execution time is only 47%. Frac- tional ME/DE and bi-directional ME/DE also make a significant 15% contribution to the execution time. A significant 38% of execution time is contribution from the other functional modules such as intra-prediction, discrete cosine transform (DCT)

Table 4.1

Execution Time Profiling of MV-HEVC/H.265 (HTM) 16.2 at QP=32 for multiview video sequence Shark

Integer ME Fractional ME Bi-directional ME Other

47% 5% 10% 38%

and context-adaptive binary arithmetic coding (CABAC) . Therefore, sole optimization of integer ME/DE in MV-HEVC/H.265 while still needed, is obviously not enough. This chapter proposes several low level techniques that reduce execution time of integer, fractional and bi-directional ME/DE. Further, several high level optimization techniques that affect all or a significant number of functional modules in MV-HEVC/H.265 is considered. For example by selective evaluation of coding units execution of all required functional modules are skipped. Another available tool is, of course, concurrent coding of functional modules.

There have been efforts to improve the execution performance of HEVC single view coding [21] [74] (and references thereof). However, these efforts only deal with single view HEVC, and thus do not take the characteristics of inter-view correlation into account, and are, therefore, inadequate for MV-HEVC. This chapter presents three major contributions that reduce the computational complexity of MV-HEVC. While MV-HEVC has been the focus of complexity reduction in this chapter, the contributions in this chapter provide significant benefit to the single view HEVC as well. In this chapter, a methodology for designing a novel massively parallel architecture (MPA) based ME/DE algorithm from a completely new perspective is explored. A single instruction multiple data (SIMD) approach is proposed for sub-pixel and bi- directional ME/DE. Next, quantization parameter (QP)-based early termination of coding tree unit (CTU) is proposed, where the coding units (CU) below a certain

platform, a resource-optimized multi-threaded execution scheduling for the wave front processing (WPP) for the implementation on a multicore processor is proposed.

This chapter is organized as follows. Section 4.2 briefly presents the relevant concepts in HEVC and MV-HEVC. Section 4.3 presents a survey of related works in improving the runtime performance of HEVC video coding. Section 4.4 presents our efforts on scalable massive parallelism and the opportunity that it provides for fast ME/DE algorithm from a new perspective. Section 4.5 presents our proposal for QP-based early termination of CTU. Section 4.6 discusses the proposal for an optimized WPP scheduling and implementation. Section 4.7 concludes the chapter.

In document High Performance Multiview Video Coding (Page 134-141)