Investigating Host-Device communication in a GPU-based H.264 encoder.

(1)

Department of Informatics

Investigating

Host-Device

communication in a

GPU-based H.264

encoder.

Master thesis

Kristoffer Egil

Bonarjee

May 16, 2012

(2)

(3)

3 nVidia Graphic Processing Units and the Compute Unified Device Architec-ture 41 3.1 Introduction . . . 41 3.2 History . . . 42 3.3 Hardware overview . . . 45 3.3.1 Stream Multiprocessors . . . 45 3.3.2 Compute capability . . . 47 3.3.3 Memory hierarchy . . . 48 3.3.4 Programming model . . . 52 3.3.5 Software stack . . . 57

3.4 Vector addition; a trivial example . . . 58

3.5 Summary . . . 59

4 Software basis and testbed 61 4.1 Introduction . . . 61

4.1.1 Design . . . 61

4.1.2 Work in progress . . . 64

4.2 Evaluation . . . 64

4.3 Summary . . . 65

5 Readahead and Writeback 67 5.1 Introduction . . . 67 5.2 Design . . . 69 5.2.1 Rationale . . . 69 5.2.2 Implementation . . . 70 5.2.3 Hypothesis . . . 72 5.3 Evaluation . . . 73 5.4 Write buffering . . . 75 5.5 Readahead Window . . . 77

(5)

5.6 Lessons learned . . . 79 5.7 Summary . . . 80 6 Memory 81 6.1 Introduction . . . 81 6.2 Design . . . 82 6.2.1 Rationale . . . 82 6.3 Implementation . . . 84 6.3.1 Necessary groundwork . . . 84 6.3.2 Marshalling implementation . . . 84 6.4 Evaluation . . . 86 6.5 Lessons learned . . . 88 6.6 Summary . . . 88 7 CUDA Streams 91 7.1 Introduction . . . 91 7.2 Design . . . 92 7.2.1 Implementation . . . 92 7.2.2 Hypothesis . . . 94 7.3 Evaluation . . . 94 7.4 Lessons learned . . . 98 7.5 Summary . . . 98 8 Deblocking 99 8.1 Introduction . . . 99 8.2 Design . . . 100 8.2.1 Implementation . . . 101 8.3 Evaluation . . . 103 8.4 Lessons learned . . . 105 8.5 Summary . . . 106 9 Discussion 107 9.1 Introduction . . . 107

(6)

9.2 State of GPU hardware . . . 107

9.3 GPU Multitasking . . . 109

9.4 Design emphasis for offloaded applications . . . 110

9.5 Our findings in a broader scope . . . 112

10 Conclusion 115 10.1 Summary . . . 115

10.2 Further work . . . 117

10.3 Conclusion . . . 118

A Code Examples 119 A.1 CUDA Vector Addition Example . . . 119

References 121

(7)

List of Figures

2.1 4:2:0 Sub sampling. For each 4 Luma samples, only one pair of Chroma

samples are transmitted. . . 15

2.2 I-, P- and B-frames. . . 18

2.3 A selection of the available 4x4 intra-modes [1]. . . 19

2.4 Half- and Quarter-pixels [1]. . . 22

2.5 Octa-pixel interpolation [2]. . . 23

2.6 The current motion vector E is predicted from its neighbors A, B and C. 25 2.7 A JPEG image showing clear block artefacts. . . 29

2.8 ZigZag pattern of a 4x4 luma block [2]. . . 29

2.9 Plotted pixel values of an edge showing typical signs of blocking arte-facts. As the difference in pixel value between the edges of adjacent blocks p0 and q0 is much higher than the differences between p0−p4 and q₀−q₄, it is likely a result of quantization [9]. . . 30

2.10 Flowchart of the CAVLC coding process [3]. . . 33

3.1 CPU transistor usage compared to GPU [4]. . . 43

3.2 Diagram of a pre-DX10 GPU pipeline. [5] The Vertex processor was pro-grammable from DX8, while the Fragment processor was propro-grammable from DX9. . . 43

3.3 Fermi Stream Multiprocessor overview. [39] . . . 46

3.4 Feature support by compute capability. [40] and slightly modified. . . . 48

3.5 Fermi memory hierarchy: Local, global, constant and texture memory all reside in DRAM. [39], slightly modified. . . 49

3.6 Example execution grid [40]. . . 54 v

(8)

3.7 An application built on top of the CUDA stack [40]. . . 58

4.1 Modified Motion vector prediction in cuve264b [6]. . . 63

5.1 Serial encoding as currently done in the encoder. . . 67

5.2 Threaded encoding as planned, where the separate threads will pipeline the input, encoding and output steps, reducing the elapsed time. . . 68

5.3 Improvement in encoding time due to readahead and writeback. . . 74

5.4 Writeback performance over various GOP levels. . . 75

5.5 Writeback performance over various QP levels. . . 75

5.6 IO buffering from application memory to storage device. Based on [41]. 76 5.7 Writeback performance with buffer flushing. . . 78

5.8 Readahead performance under different resource constrains. . . 79

6.1 Encoder state shared between host and device. . . 83

6.2 Device memory accesses for the direct port. . . 85

6.3 Device memory accesses for the optimized port. . . 85

6.4 Performance of the two marshalling implementations. . . 86

7.1 Improvement in encoding time due to streams and pinned memory. . . . 95

7.2 Improvement in encoding time due to reordering on device. . . 96

8.1 Wavefront deblock filter [7]. . . 99

8.2 Limited error propagation with DFI [8]. . . 101

8.3 Padding of a deblocked frame prior to referencing. . . 102

8.4 Column flatting of raster order. . . 103

(9)

List of Tables

2.1 Determining Boundary strength. Data from [9]. . . 31 4.1 Hardware specifications for the machine Kennedy. . . 65 6.1 Selected profiling data from our marshalling implementations. . . 87 6.2 Instruction and data accesses of host marshalling implementations. . . . 88 7.1 Time used to copy and reorder input frames with and without

over-lapped transfer. . . 95 7.2 Time used to re-order input frames on host and device, including transfer. 97 7.3 Time used to re-order input frames on host and device, excluding transfer. 97 8.1 Relative runtime of deblock helper functions and kernels. . . 105

(10)

List of Abbreviations

ALU Arithmetic Logic Unit

APU Accelerated Processing Unit

ASMP Asymmeric Multiprocess[ing/or]

B-slice Bi-predicted slice

Bs Boundary Strength

CABAC Context-based Adaptive Binary Arithmetic Coding

CAVLC Context Adaptive Variable Length Coding

CPU Central Processing Unit

CUDA Compute Unified Device Architecture

Cb Chroma blue color channel

Cr Chroma red color channel

DCT Discreet Cosinus Transform

DFI Deblocking Filter Independency

DVB Digital Video Broadcasting

EIB Element Interconnect Bus

FIR Finite Impulse Response

FMO Flexible Macroblock Ordering

GOP Group of Pictures

GPGPU General-Purpose computing on Graphics Processing Units

GPU Graphical Processing Unit

I-slice Independent slice

(11)

MFP Macroblock Filter Partition

MVD Motion Vector Difference

MVP Motion Vector Prediction

MVp Predicted Motion vector

P-slice Predicted slice

QP Quantization Parameter

RGB Red Green Blue

SAD Sum of Absolute Differences

SAE Sum of Absolute Error

SFU Special Function Unit

SIMT Single Instruction, Multiple Threads

SMP Symmetric Multiprocess[ing/or]

SM Stream Multiprocessor

(12)

(13)

List of Code Snippets

2.1 ExpGolomb code example. . . 35

3.1 CUDA kernel call . . . 53

3.2 Thread array lookup. . . 54

3.3 __syncthreads()example. . . 56

5.1 Queue header . . . 71

6.1 Block Metadata. . . 82

9.1 Branch-free variable assignment . . . 108

9.2 Branching variable assignment . . . 108

A.1 Vector addition; a trivial CUDA example . . . 119

(14)

(15)

Preface

Modern graphical processing units (GPU) are powerful parallel processors, capable of running thousands of concurrent threads. While originally limited to graphics process-ing, newer generations can be used for general computing (GPGPU). Through frame-works such as nVidia Compute Unified Device Architecture (CUDA) and OpenCL, GPU programs can be written using established programming languages (with minor extensions) such as C and C++. The extensiveness of GPU deployment, low cost of entry and high performance makes GPUs an attractive target for workloads formerly reserved for supercomputers or special hardware. While the programming language is similar, the hardware architecture itself is significantly different than a CPU. In ad-dition, the GPU is connected through a comparably slow interconnect, the PCI Express bus. Hence, it is easy to fall into performance pitfalls if these characteristics are not taken into account.

In this thesis, we have investigated the performance pitfalls of a H.264 encoder written for nVidia GPUs. More specifically, we looked into the interaction between the host CPU and the GPU. We did not focus on optimizing GPU code, but rather how the execution and communication was handled by the CPU code. As much manual labour is required to optimize GPU code, it is easy to neglect the CPU part of accelerated applications.

Through our experiments, we have looked into multiple issues in the host application that can effect performance. By moving IO operations into separate host threads, we masked away the latencies associated with reading input from secondary storage. By analyzing the state shared between the host and the device, we where able to reduce

(16)

the time spent synchronizing data by only transferring actual changes.

Using CUDA streams, we further enhanced our work on input prefetching by trans-ferring input frames to device memory in parallel with the encoding. We also exper-imented with concurrent kernel execution to perform preprocessing of future frames in parallel with encoding. While we only touched upon the possibilities in concurrent kernel execution, the results where promising.

Our results show that a significant improvement can be achieved by focusing opti-mizing effort on the host part of a GPU application. To reach peak performance, the host code must be designed for low latency in job dispatching and GPU memory man-agement. Otherwise the GPU will idle while waiting for more work. With the rapid advancement of GPU technology, this trend is likely to escalate.

(17)

Acknowledgements

I would like to thank my advisors Håkon Kvale Stensland, Pål Halvorsen and Carsten Griwodz for their valuable feedback and discussion. I would also like to thank Mei Wen and the National University of Defense Technology in China for source code ac-cess to their H.264 encoder research. This thesis would not have come to fruition with-out their help.

Thanks to Håvard Espeland for helping me restart the test machine after I trashed it on numerous occasions.

Thanks to my fellow students at the Simula lab and PING for providing a motivating working environment with many a great conversation and countless cups of coffee. Thanks to my father, Vernon Bonarjee, for his support and help with proofreading. Thanks to my mother in law, Solveig Synnøve Gjestang Søndberg, for relieving me of household chores in the final sprint of thesis work.

Finally, I would like to thank my wife Camilla and our children Maria and Victor for encouragement, motivation and precious moments of joy and relaxation.

Oslo, May 16, 2012 Kristoffer Egil Bonarjee

(18)

(19)

Chapter 1 Introduction

1.1 Background and motivation

Since the announcement of ENIAC, the first electronic general-purpose computer in 1946 [42], there has been a great demand for ever increasing processing power. In 1965, Intel co-founder Gordon Moore predicted that the amount of components, eg transistors, in an integrated circuit would double approximately every two years [43]. While Moore anticipated this development would hold for at least ten years, it is still the case today, and has been known as "Moore’s law".

For decades, the advancement of processors where driven by an ever increasing clock speed. However, increasing clock speed results in higher power requirements and heat emission. While this trend has resulted in processors with multigigahertz clock frequencies, it has approached the limit of sustainable power density, also known as the power wall.

To further increase the processing power of a single processor beyond the power wall, processor manufacturers have focused on putting multiple processing cores on a single die, referred to as multi-core processors. By placing multiple processor cores on a single unit, enhanced processing power can be achieved with each core running at a lower clock speed. These individual cores are then programmed in a similar fashion as multiple identical processors, known as Symmetric Multiprocessing (SMP).

(20)

While multi-core processors increase the total computational power of the processors, they do not improve performance for programs designed for single-core processors. To take advantage of the increased computing power, it is crucial that algorithms are designed to run in parallel. While eight-core processors are a commodity today, the maximum performance of an algorithm is limited by the sum of its serial parts [10]. Thus, programmers cannot exploit modern processors without focusing on parallel workloads.

SMP allows for some scalability by increasing the amount of independent cores on each processor as well as multiple processors per machine. However, each additional processor or core further constrains access to shared resources such as system memory and data bus. The symmetry of the processors also means that each core must be of identical design, which may not yield optimal usage of die space for parallel work-loads.

Asymmetric Multiprocessing (ASMP) relaxes the symmetry requirement. Hence, the independent cores of an asymmetric processor can trade the versatility of SMP for greater computing power in a more limited scope. Thus, the limited die space can be used most efficient for the purposed tasks. For instance, each x86 processing core in-cludes support for instruction flow control such as branch prediction. However, if we take careful steps when designing our algorithms, we can make certain our code does not branch. While branch prediction is crucial for the performance of a general pur-pose processor, we can take the necessary steps to make it obsolete for special purpur-pose processing cores. Thus, we free up die space that can be more efficiently used.

ASMP systems usually consists of a traditional "fat" CPU core that manages a number of simpler cores designed for high computing performance rather than versatility. The main core and the computing cores are connected through a high speed interconnect along with memory and IO interfaces. One example of such a heterogeneous ASMP architecture is the Cell Broadband Engine [11], which consists of a general purpose PowerPC core and eight computing cores called Synergistic Processing Elements, con-nected through the Element Interconnect Bus (EIB).

(21)

coun-terparts, and cannot be considered a commodity. While the first generation of Sony Playstation 3 gaming consoles could be used as a general purpose Cell computing de-vice, the option to boot non-gaming operating systems was later removed from future firmware updates [44].

Modern GPUs on the other hand, can be found in virtually any relatively new com-puter. They are massively parallel processors designed to render millions of pixel val-ues at a fraction of a second. Frameworks such as nVidia Compute Unified Device Ar-chitecture (CUDA) (see chapter 3) and Open Computing Language (OpenCL) allows supported GPUs to be used for general purpose programming. Combined with a gen-eral purpose processor such as an Inter Core i7, a modern GPU allows us to perform massively parallel computations on commodity hardware similar to ASMP processors. However, a notable limitation compared to a fullblown ASMP processor is intercon-nect bandwidth. While for instance the EIB in a Cell processor has a peak bandwith of 204.8GB/s [11], a GPU is usually connected via the PCI Express bus. While the bandwith of the PCI Express bus has doubled the bandwith twice with version 2 [45] and 3 [46], the ratio between available bandwith and the computational power is in-creasing. Compared to the quadroupled bandwith, the number of CUDA cores have increased from 128 [47] in the first CUDA compatible Geforce 8800 GTX (using PCI Ex-press 1.0) to 1536 [48] in the Geforce GTX 680 using PCI ExEx-press 3. This is an increase of 12 times, without taking increased clock speed, improved memory bandwith etc. into consideration.

In the case of our test machine (see table 4.1), the GPU is connected through a PCI Ex-press v2 x16 port with an aggregate bandwith in both directions combined of approx-imately 16GB/s [45]. Hence, GPU applications must be designed with this limitation in mind to fully utilize the hardware. This is especially the case for high-volume, data driven workloads.

(22)

1.2 Problem statement

One task requiring substanial computational power is video encoding. As the com-puting power of processors has increased, video standards have evolved to take ad-vantage of it. By using advanced techniques with higher processing requirements, the same picture quality can be achieved with lower bitrate. However, due to data depen-dencies, all parts of the video encoding process might not run efficiently on a GPU. This can lead to the GPU idling for prolonged times unless care is taken to design the application efficiently.

In this thesis we will investigate such performance pitfalls in a CUDA-based H.264 en-coder. By cooperating with researchers from the National University of Defense Tech-nology in China, we will work to increase GPU efficiency by analyzing and improving the on-device memory management, host-device communication and job dispatching. We will take an in-depth approach to the memory transfers to identify data that can be kept on the GPU, utilize CUDA Streams to overlap execution and transfer, as well as using conventional multi-threading on the host to make input/output operations in the background while keeping the GPU busy.

1.2.1 Limitations

We limit our focus on the host-device relationship, so we will not go into topics such as device kernel algorithms or design unless specified otherwise.

1.3 Research Method

For this thesis, we will design, implement and evaluate various improvements on a working H.264 encoder. Our approach is based on the Design methodoligy as specified by the ACM Task Force on the Core of Computer Science [12].

(23)

1.4 Main Contributions

Through the work on this master thesis, we have designed and implemented a number of proposed improvements to the cuve264b encoder. These proposals include asyn-chronos IO through separate readahead and writeback threads, more efficient state synchronization by only sending relevant data, removal of reduntant frame transfers, and usage of CUDA Streams to perform memory transfers and preprosessing work in parallel.

With the exception of our deblocking work, we have reduced the encoding time by 23.3% and GPU idle time by 27.3 % for the tractor video sequence. Note that the re-duction in idle time is probably even higher, as the CUDA Visual Profiler [49] does not currently support concurrent kernels [50]. We have also identified additional work to further reduce the GPU idle and consequent runtime. While we performed our exper-iments on a video encoder, our findings may apply to any GPU-offloaded application, especially for high-volume, data driven workloads.

In addition to the work covered by our experiments, we initially ported the encoder to Unix1. In connection with our porting efforts, we also added getopt support and a progress bar. Addionally, we made the encoder choose the best GPU available in multi-GPU setups, so it could be debugged with the CUDA debugger.

1.5 Outline

The rest of this thesis is organized as follows; In chapter 2, we give an introduction to video coding and the H.264/MPEG4-AVC Standard. In chapter 3, we introduce GPGPU programming and the nVidia CUDA architecture, used in our evaluations. Chapter 4 introduces the cuve264b encoder, the software basis for our experiments, as well as our evaluation method and testbed specifications. In chapter 5, we start our experiments with our investigation of readahead and writeback. Chapter 6 con-tinues with our analysis and proposals to reduce redundancy in host-device memory

(24)

transfers. In chapter 7, we investigate how we can utilize CUDA Streams to extend our Readahead work to on-device memory. Chapter 8 completes our experiments with our work on a GPU-based deblocking filter. In chapter 9 we discuss our results and lessons learned. Chapter 10 concludes our thesis.

(25)

Chapter 2 Video Coding and the

H.264/MPEG4-AVC standard

2.1 Introduction

H.264/MPEG4-AVC is the newest video coding standard developed by the Joint Video Team (JVT), comprising of experts from Telecommunication Standardization Sector (ITU-T) Study Group 16 (VCEG) and ISO/IEC JTC 1 SC 29 / WG 11 Moving Picture Experts Group (MPEG). The standard was ratified in 2003 as ITU-T H.264 [51] and MPEG-4 Part 10 [52]. In the rest of the thesis, we will refer to it simply as H.264.

Older standards such as ITU-T H.262/MPEG-2 [53] has been at the core of video con-tent such as digital television (both SD and HD) and media such as DVDs. However, an increasing number of HD content require more bandwidth on television broadcast networks as well as storage media. While broadcast networks can handle a limited amount of MPEG-2 coded HD streams, more efficient video coding is necessary to scale. At the same time, emerging trends such as video playback on mobile devices require acceptable picture quality and robustness against transmission errors under bandwidth constraints. H.264 allows for higher video quality per bit rate than older standards and has been broadly adopted. It is used in video media such as Blu-ray, broadcasting such as the Digital Video Broadcasting(DVB) standards, as well as online

(26)

streaming services like Youtube.

The H.264 standard improves the video quality by enhancing the coding efficiency. To support a broad range of applications from low powered hand held devices to digital cinemas, it has been divided into different profiles. The profiles support different fea-tures of the standard, from the simplest constrained baseline to the most advanced High

4:4:4 P. This allows implementors to utilize the standard without carrying the cost of

unwanted features. For instance, the constrained baseline profile lacks many features such as B-slices (see section 2.2.4), CABAC (see section 2.5.3), and interlaced coding. While this profile cannot code video as efficient as the more advanced profiles, the lack of features makes it easier to implement. This makes it well suitable for low-margin applications such as cellular phones.

A modern video coding standard uses a multitude of approaches to achieve optimal compression performance, such as:

• Removing unnecessary data without influencing subjective video quality. By re-moving image properties that cannot be seen by the human eye, there is less data to compress without perceptible quality loss. This will be explained in the fol-lowing subsection.

• Removing data that influences subjective video quality on a cost/benefit basis. By re-moving fine-grained details from the images, we can reduce the amount of data stored. However, as this impacts the video quality, it is a matter of data size ver-sus picture quality. This step, known as quantization, will be covered in section 2.3.

• Reduce spatial redundancy across a frame. As neighboring pixels in a picture often contain similar values, we can save space by storing differences between neigh-bors instead of the full values. This is known as intra-frame prediction, and will be elaborated in section 2.2.2.

• Reduce temporal redundancy across series of frames. Comparable to the relationship between spatial neighboring pixels, there often exist a same kind of relationship between pixels in different frames of a video sequence. By referring to pixels of a

(27)

previously encoded frame, we store the differences in pixel value instead. This is known as inter-frame prediction, and will be detailed in section 2.2.3.

• Use knowledge of the data to efficiently entropy-code the result.When all the previous steps have been completed, the nearly-encoded video will contain certain pat-terns in the bitstream. The last part of the encoding process is to take advantage of these patterns to further compress the bitstream. Entropy-coding will be cov-ered in section 2.5.

In the rest of this chapter, we will explain the techniques mentioned above, and how they are used in the H.264 standard.

2.1.1 Color Spaces and human perception

The Red Green Blue Color space

Most video devices today, both input devices such as cameras and display equipment such as HDTV’s, uses the Red Green Blue (RGB) color space. RGB, named after its three color channels; red green and blue, and convey the information of each pixel by a triplet giving the amount of each color. For instance, using 8bit per channel, the colors redwill be (255,0,0),green(0,255,0),blue(0,0,255), andcyan(0,255, 255).

RGB is well suited for both capturing and displaying video. For instance, the pixel values can be mapped to the lighting sources in displays, such as phosphor dots in CRT monitors or sub-pixels in LCD panels. However, the three RGB channels carry more information than the human vision can absorb.

Limitations of the human vision

The human vision system is actually two distinct systems, from the cells in the retina to the processing layers in the primary visual cortex. The first one is found in all mals. The second is a complimentary system we share with other primates. The mam-mal system is responsible for our ability to register motion, depth and position, as well

(28)

as our overall field of vision. It can distinguish acute variation of brightness, but it does not detect color. The primate system is responsible for detecting objects, such as facial recognition, and is able to detect color. However, it has a lower sensitivity to luminance and is less acute [13]. As a result, our ability to detect color is at a lower spatial resolution compared to our detection of brightness and contrast.

Knowing this, we can reduce the amount of color information in the video accordingly. We loose information, but because of the limits of the human vision, the subjective video quality experienced will be the same.

Y’CbCr

To take advantage of the human vision in terms of video coding, we need a way to re-duce the resolution of the color information while keeping the brightness and contrast intact. As we noted above, this is not possible with RGB, where the pixel values are given solely by their color. However, we might use the derivative Y’CbCr colorspace. It works in an additional fashion similar to RGB, and transforming from the one to the other involves few computations.

Instead of identifying a pixel value by its composition of amounts red green and blue, it is identified by its brightness and color difference. Color difference is the differ-ence between brightness (Luma1) and the RGB colors. Only the Chroma blue(Cb) and Chroma red(Cr) is transmitted, as Chroma green (Cg)= 1− (Cb+Cr). With bright-ness separated from color, we can treat them separately, and provide them in different resolutions to save space.

Chroma sub-sampling

Using a lower resolution for the chroma components is called chroma sub-sampling. The default form of sub-sampling in the H.264 standard is 4:2:0. The first number,

1_{Note that the term Luma must not be confused with Luminance, as the nonlinear Luma is only an}

(29)

4, is reminiscent to the legacy NTSC and PAL standards, and represents the Luma sam-ple rate. The second number, 2, indicates that Cb and Cr will be samsam-pled at half the horizontal sample rate of Luma. Originally, the second and third digits denoted the horizontal subsample rate of Cb and Cr respectively, as the notation predates vertical sub-sampling., Today however, a third digit of zero now indicates half ivertical sample rate for both Cb and Cr. (For a more thorough explanation, the reader is referred to [14].)

Luma (Y)

Chroma (Cb and Cr)

4:2:0 sub-sampling

Figure 2.1: 4:2:0 Sub sampling. For each 4 Luma samples, only one pair of Chroma samples are transmitted.

Using 4:2:0, we only use half of the luma sample size to store both chroma compo-nents. As shown in figure 2.1, the chroma samples are only stored for every fourth luma sample. H.264 supports richer chroma sampling as well, through the Hi422 and

Hi444P profiles, which supports 4:2:2 and 4:4:4 respectively, as well as higher

(30)

2.2 Frames, Slices and Macroblocks

In H.264, each picture in the video to be encoded can be divided into multiple inde-pendent units, called slices. This can make the resulting video more robust, as package loss will only suffer the slices that looses information, while keeping the others intact. It also makes it possible to process different slices in parallel [15]. The number of slices to use is left for the encoder to decide, but it is a trade-off between robustness against transmission errors and picture quality. Slicing reduces the efficiency of prediction, as redundancy over slice boundaries cannot be exploited. This reduces the picture area available for reference, and may reduce the efficiency of both spatial and temporal prediction. When macroblocks must be predicted from sub-optimal prediction blocks, the difference in pixel values, the residual, will increase. This results in an increased bitrate to achieve identical objective picture quality [16], as the larger residuals needs additional storage in the bitstream. On the other hand, independent slices increase the robustness of the coded video, as errors in one slice cannot propagate across slice boundaries.

Slices can be grouped together in groups called slice groups, referred to in earlier ver-sions of the draft standard and in [1] as Flexible Macroblock Ordering (FMO). When using only one slice group per frame, the macroblocks are (de)coded in raster order. By using multiple slice groups, the encoder is free to map each macroblock to a slice group as deemed fit. There are 6 predefined maps, but it is also possible to explicitly define the slice group corresponding to each macroblock. For frames with only one slice, the terms can be interchanged, but we will continue to use the term slice in the rest of the chapter. In inter-prediction, other slices refers to the same slice in another frame, not a different slice in the same frame.

2.2.1 Macroblocks

After a frame has been divided into one or more slices, each slice is further divided into macroblocks. They are non-overlapping groups of pixels similar to the pieces of a

(31)

jigsaw puzzle, and they form the basic work units of the encoding process.

Introduced in the H.261 standard [54], macroblocks was always 16x16 pixels in size, with 8x8 for the chroma channels. In H.264, however, they can be a divided in a plethora of sizes; 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4, depending on the prediction mode. This partitioning makes it possible for the encoder to adapt the prediction block size depending on the spatial and temporal properties of the video. For instance, 16x16 blocks might be used to predict large homogeneous areas, while 4x4 sub-blocks will be applicable for heterogeneous areas with rapid motion.

H.264 uses two types of macroblocks depending on their prediction mode. Intracoded macroblocks reference neighboring blocks inside the current slice, thereby exploiting spatial redundancy by only storing the residual pixel differences. Instead of storing the whole block, only the difference between the block and its most similar neighbor must be transmitted. Intercoded macroblocks references blocks in other slices, exploiting temporary redundancy. This allows us to take advantage of similar blocks in both space and time.

Slices made up of only intra-coded macroblocks are referred to as independent slices, as they do not reference any data from other slices. We will elaborate on independent slices in the following subsection.

Slices containing coded macroblocks are known as predicted slices. As inter-coded macroblocks refer to similar macroblocks in prior eninter-coded slices, a predicted slice cannot be used as a starting point for the decoding process. Predicted slices will be discussed in subsection 2.2.3.

An extension of predicted slices, Bi-predicted slices, support predicting each block from two different reference frames, known as bi-prediction. They are only available in the extended or more advanced profiles, and is further explained in subsection 2.2.4. Lastly, the extended profile also supports two special switching slices, briefly explained in section 2.2.5.

(32)

2.2.2 Independent slices

Independent (I) slices are the foundation of the encoded video, as they are the initial reference point for the motion vectors in Predicted and Bi-predicted slices. They form random access points for the decoder to start decoding the video, as well as a natural point in the stream for fast forward/rewind.

Figure 2.2: I-, P- and B-frames.

An I-slice and the P- and B-slices that references it, is known as a group of pictures (GOP). See figure 2.2.

H.264 uses intraslice predictions to reduce the amount of data needed for each mac-roblock. If a macroblock is similar to one of its neighbors, we only need to store the residuals needed to predict the pixel values from it.

In prior standards, such as H.262/MPEG2 [53], I-slices where encoded without any prediction. Thus, they did not take advantage the spatial redundancy in the slices. Intraslices also supports a special mode called I_PCM, where the image samples are given directly, without neither prediction nor quantization and transform-coding (de-scribed in detail in section 2.3.

Intraslice prediction in H.264 is implemented as intra-coded macroblocks with a set of predefined modes. Depending on the channel to be predicted and the spatial cor-relations in the slice, the encoder chooses the mode resulting in the minimal residual data. For instance, it might summarize the absolute difference in pixel values for the different modes, and select the one with the least number. This test is often referred to as the Sum of Absolute Error (SAE) or Sum of Absolute Differences (SAD) [2].

For the luma channel, the macroblocks are either predicted for 16x16 macroblocks as a whole, or of each 4x4 sub-block. Chroma blocks on the other hand, are always coded as

(33)

8x8. As the prediction mode for a chroma block is only signaled once for both channels, Cb and Cr always share the same prediction mode.

Figure 2.3: A selection of the available 4x4 intra-modes [1].

4x4 sub-blocks can be predicted in nine different ways, some of which are shown in figure 2.3. Eight of these represent a direction; such as mode 0 - vertical and mode 1 - horizontal, where the arrows shows how the block will be predicted. Mode 2 - DC is a special case where all the pixels in the block is predicted from a mean value of upper and left hand samples. As an example, a Mode 0 - vertical prediction uses the last row of pixels from the neighboring block directly above. The residual block is then formed by calculating the difference between the pixel values in the reference row and every row in the current block. In the event of slice boundaries, the modes crossing the boundaries will be disabled, and the input for the DC mode will be similarly reduced. For smooth areas, 16x16 prediction might yield less signaling overhead than four 4x4 blocks combined. There are four 16x16 modes available, of which the first three are similar to the 4x4 ones. Namely mode 0 vertical, mode 1 horizontal and mode 2 -DC. Mode 3 - Plane, uses the upper and left pixels as input in a linear plane function, resulting in a prediction block with a smooth transition between them [2].

The chroma blocks have similar prediction modes as the 16x16 luma blocks. However, the ordering is different: Mode 0 is DC, followed by horizontal, vertical and plane. The chosen mode for each (sub)block is signaled in the bit stream along with the pre-diction residuals. However, as there often is a relationship among neighboring blocks, the standard supports an implicit prediction mode called most_probable_mode. If

(34)

both the block above and to the left of the current block is predicted using the same mode, it will be the most probable mode for the current block. If they differ, the most probable mode defaults to DC. The implicit mode selection makes it possible to signal the most probable mode by only setting the use_most_probable_mode flag. Other-wise, use_most_probable_mode is nilled, and the remaining_mode_selector variable signals the new mode.

2.2.3 Predicted slices

Similar to the spatial redundancy exploited in I-slices, there also often exist a correla-tion between macroblocks of different slices. Using a reference slice as a starting point, we may further reduce the space requirements by storing the pixel difference between a macroblock and a similar macroblock in another slice. To accomplish this, the H.264 standard use motion vectors. A motion vector allows a macroblock to be predicted from any block in any slice available for reference in memory. Instead of limiting the prediction options to a limited set of modes, a motion vector explicitly points to the coordinates of the referential block. This gives the encoder great flexibility to find the best possible residual block.

Akin to the different prediction block sizes for intra-coded macroblocks, inter-predicted blocks support different sizes. Depending on the amount, speed and extensiveness of motion in the sequence, the best prediction block size for a given macroblock can vary. Picking the best prediction block size to code the motion of a block is known as motion-compensation.

To facilitate efficient motion-compensation for both rapid and slower movement, H.264 supports macroblock partitioning. Each macroblock can be divided into partitions of either the default 16x16, two 16x8, two 8x16 or four 8x8 sub-blocks. The 8x8 parti-tions can be further divided up into 8x8, 8x4, 4x8 or 4x4 blocks. These sub-blocks are then motion-compensated separately, each requiring a separate motion vector in the resulting bitstream. However, each sub-block might yield a smaller residual. On the other hand, homogeneous regions can be represented by larger blocks with fewer

(35)

mo-tion vectors. Depending on the amount and speed of movement in the input video, the encoder can then find the combination of block sizes that minimize the combined signaling of motion vectors and residuals.

Motion vector search

For each (sub)macroblock to be inter-predicted, the encoder must find a similar sized block in a prior coded reference slice. However, as H.264 decouples coding- and dis-play order, the encoder is free to code a slice suitable for motion compensation earlier than it is actually displayed on screen. It also supports pinning slices as long term reference slices that might be used for motion compensation far longer than display purposes would suggest.

Similar to earlier video coding standards from ITU-T and ISO/IEC, the scope of the H.264 standard only covers the bit stream syntax and decoder process; An encoder is valid as long as it outputs a bit stream that can be decoded properly by a conformant decoder; The actual motion search is not defined in the standard. In fact, the H.264 syntax supports unrestricted motion vectors, as the boundary pixels will be repeated for vectors pointing outside the slice [17].

The extensiveness of the motion vector search depends on the use case of the encoder. For instance, a real time encoder might use a small search area to keep deadlines, com-bined with multiple slices to make the stream robust. Afterward, the video can be encoded offline with a thorough full-slice motion vector search without realtime re-quirements.

While the procedure of motion vector search is left to the encoder implementation, the general approach is to search a grid surrounding the identical position in the reference slice. Each block with its potential motion vector in the search window is evaluated. The optimal solution is the one which gives the lowest possible residuals. For instance, an encoder might calculate the SAD over all the potential blocks and select the one yielding the lowest sum.

(36)

mo-tion vector or predicmo-tion residual is transmitted [15]. Instead, the macroblock is pre-dicted as a 16x16 macroblock with a default motion vector pointing to the same po-sition (i.e. 0,0) in the first available prediction slice. Macroblocks without motion can then be transmitted by only signaling the metadata of macroblock type, while skipping the actual data.

Half- and Quarter-pixels

The motion vectors in H.264 is given with quarter pixel accuracy. To support this within the finite resolution of a sampled video frame, the quarter-pixels (quarter-pels) must be interpolated. For the luma channel, this is done in a two step procedure; we first interpolate half-pixels (half-pels) which then again get interpolated to form quarter-pels.

Figure 2.4: Half- and Quarter-pixels [1].

To interpolate a half-pel, a six-tap Finite Impulse Response (FIR) filter [2] is used, in which three pixels on either side is weighted to calculate the half-pel value. Figure 2.4 shows a block, a grid of pixels (in gray) with a selection of interpolated half-pels, and quarter-pels (white). To determine the value of half-pel b (shown between pixels G and H), the FIR filter would be applied as follows [2]:

(37)

b = E−5F+20G+20H−5I+J

32 (2.1)

After the half-pels have been calculated, the quarter-pels are calculated by means of an unweighted linear interpolation. Depending on position, they are either interpolated horizontal, vertically or diagonally. In figure 2.4, a would be horizontally interpolated between G and b, d vertically between G and h, while e would be interpolated diago-nally between G and j.

Figure 2.5: Octa-pixel interpolation [2].

Due to the chroma sub sampling, the resolution of the chroma channels are halved in both dimensions. To support the same motion vector accuracy as the luma channel, we must calculate octa-pixels (octa-pels) for Cb and Cr. This is done in one step, where each octa-pel to be calculated in linearly interpolated between 4 chroma pixels. Each chroma pixel is weighted according to its distance from the octa-pel. For instance, the octa-pel a in figure 2.5 is calculated by [2]:

a =round ₍ 8−dx)(8−dy)A+dx(8−dy)B+ (8−dx)dyC+dxdyD 64 (2.2)

(38)

We substitute dxwith 2 and dywith 3, which gives us

a =round 30A+10B+18C+6D 64

(2.3) The higher-resolution motion vectors allow us to more precisely represent the mo-tion between the slices. For instance, if two adjacent macroblocks yields the fewest residuals, but both have differences in its direction, the higher resolution enables us to generate a macroblock representing their mean values.

Given the large number of potential motion vectors, quadrupled by the quarter pixel resolution, an exhausting search for motion vectors will require much processing time. Finding more efficient motion search algorithms has resulted in numerous research efforts. For instance, the popular open source X264 supports a range of different algo-rithms, including Diamond search [18] and Uneven MultiHexagon search [55].

Motion vector prediction

Having performed a motion vector search over the slice, the individual motion vectors often have high correlation, similar to the spatial redundancy exploited in intra-coded macroblocks.

To take advantage of this, H.264 uses Motion Vector Prediction (MVP), to predict the current motion vector. A predicted motion vector (MVp) is made from certain neigh-boring motion vectors and a residual motion vector difference (MVD). As motion vec-tor prediction is carried out as defined in the standard, only the MVD needs to be encoded in the bitstream.

To calculate the MVp for a macroblock, the median of the block above, to the left and above and to the right is calculated as shown in figure 2.6 for the block E. If any of these are of smaller partition size than the current block, the topmost of the ones to the left, and the leftmost of the ones above is used. If any of the blocks are missing, for instance if the current macroblock is on a slice border, the median is calculated for the remaining blocks.

(39)

A

B

C

E

Figure 2.6: The current motion vector E is predicted from its neighbors A, B and C. A special case is partitions of size 16x8 or 8x16. In the case of 16x8, the upper partition is predicted from the block above, while the lower partition is predicted from the block to the left. Similarly, the leftmost partition of a 8x16 partitioned macroblock is predicted from the block to the left, while the rightmost block is predicted from the block above and to the right.

2.2.4 Bi-predicted slices

Bi-predicted slices (B-slices) are part of the Main profile and extend P-slices with the ability to reference two slices; Macroblocks might use motion vectors pointing to either slice, broadening the potential to exploit temporal redundancy with two directions. In addition, B-slices supports bi-prediction by using two independent motion vectors to predict each block. The weight of each motion vector can be specified by the encoder. Weighted prediction is also supported in P-slices, where it can be used to better code certain special cases such as fade to black. By default, the prediction samples are evenly averaged. H.264 also supports implicit weighted prediction, whereby the weighting factor is calculated based on the temporal distance between each reference slice and the current slice [2].

B-slice macroblocks can also be encoded in Direct mode, in which no motion vector is transmitted. Instead, the motion vector is calculated on the fly by the decoder. A direct bi-predicted macroblock without prediction residuals is also referred to as a B_Skip

(40)

macroblock, similar to the P_Skip macroblock we detailed in section 2.2.3.

2.2.5 Switching I and Switching P slices

In addition to the already mentioned slice types, H.264 also supports two special pur-pose slices, Switching Independent and Switching Predicted [19] as part of the ex-tended profile. Using switching slices, it is possible to change bitstream, such as the same content at a different resolution, without synchronizing on the next or previous I-slice. For instance, a videostream displayed on a mobile phone might switch to a lower resolution if 3G connectivity is lost. The standard have also been annexed with the Scalabe Video Coding Extension to improve its support in such scenarios [20].

2.3 Transform coding and Quantization

The intra- and inter-predictions detailed in previous sections greatly reduces the re-dundancy in the information, but to achieve high compression ratio, we need to actu-ally remove some data from the residuals.

By using transform coding with its basis in Fourier theory, we may transform the residual blocks to the frequency domain. Instead of representing the residual for each pixel, a transform to the frequency domain gives us the data as a sum of coefficients to a transform-dependent continuous function. Instead of spatial pixel differences, we have transformed the values to a range of coefficients, ranging from low to high-frequency information. The first coefficient is the mean value of the transformed signal, known as the DC coefficient.

By removing high-frequency coefficients from the transformed result, we may rep-resent the data in a more compact form with a controlled loss of information. The transform to and from the frequency domain is by itself lossless [21]. However, as the transforms often operate on real numbers (and even imaginary numbers for the Fourier transform), rounding the result to integer values may lead to inaccurate reconstruction.

(41)

Most known transforms in this area are the Fourier transform, Discreet cosine trans-form (DCT), Walsh-Hadamard transtrans-form and Karhunen-Loève transtrans-form. The DCT is not the most efficient to pack information. However, it gives the best ratio between computational cost and information packing. This property has established DCT as an international standard for transform coding [22].

In H.264, DCT is not used directly. Instead, it uses a purpose-built 4x4 integer trans-form that approximates the DCT while providing key properties essential to the ef-ficiency and robustness of the transform [23]. Most notably, it gives integer results. Due to the DCT producing real numbers, floating point operations and subsequent rounding may produce slightly different results between different hardware, and the inverse-transformed residuals will be incorrect. This is known as drifting. As both inter- and intra-coded slices depend on the correctness of the residuals, the inverse transform must provide exact results.

In earlier standards without intra-prediction, this was solved by periodic I-frame re-freshes. However, with intra-prediction, drifting may occur and propagate within the I-slice itself. By using an integer transform without risk of drifting, the standard guar-antees that the results will be reliable for reference. Another important property of the transform is that is requires less computation than DCT, as it only requires addition and binary shifts [23].

While H.264 uses the described integer transform for most of the residual blocks, it uses the Walsh-Hadamard transform on the four DC components in a 16x16 intra-macroblock, as well as the 2x2 chroma DC coefficients of any block. This second transform of DC coefficients often improves the compression of very smooth regions typically found in 16x16-mode intra-coded macroblocks and the sub-sampled chroma blocks [15].

Quantization

After transforming the residuals into the frequency domain, the next step is quan-tization. Depending on the desired compression level, a number of high frequency

(42)

coefficients will be removed. By removing the frequencies, less information must be stored in the bitstream, which results in very efficient compression. On the other hand, the lost data cannot be restored by the decoder, thereby degrading the output. Hence, efficient quantization does not solely depend on the compression efficiency, but also fine-grained control over the process.

In H.264, quantization is controlled with the Quantization Parameter (QP). It can range from 0 to 51 for luma, and 39 for chroma channels. For every sixth step of QP, the level of quantization doubles. Compared to earlier standards, H.264 allows for better control for near-lossless quantization. For instance, a zeroed QP in H.263+ corresponds to 6 in H.264, giving more control over the information loss [23]. The fine granularity of QP allows for encoding video for a range of different scenarios, depending on the cost and benefit of perceived video quality versus the cost of data transmission. For instance, neither the expected video quality nor associated data rate cost will be the same for the offline playback of a bluray movie as streaming of a live soccer match over a 3G connection.

Depending on the QP, multiplying the coefficients with the quantization table will re-sult in coefficients clipped to zero. When the decoder decodes the slice, it will in turn inverse the multiplication. However, the zeroed coefficients will not be restored, and some details in the picture will be lost. Depending on the grade of quantization, er-rors will be introduced in the decoded picture. Figure 2.7 shows similar quantization artefacts in a JPEG image. When the inverse block transform is performed with fewer coefficients, details will be lost. Thus, the text in the upper left corner is not readable. As the quantization is performed per block, the edges between the restored blocks will degrade. For instance, the lines in the wall looks jagged, as there is not enough infor-mation in the image to restore the transition between the transformed blocks. We will return to how H.264 reduces the extensiveness of these errors when we explain the in-loop deblocking filter in 2.4.

After quantization, the coefficients clipped to zero will introduce redundancy that will be exploited in the coding step explained in section 2.5. Before that, however, we will need to re-arrange the data from raster to a pattern know as "zig-zag" as shown in

(43)

figure 2.8.

Figure 2.7: A JPEG image showing clear block artefacts.

Figure 2.8: ZigZag pattern of a 4x4 luma block [2].

This will order the coefficients so that the zeroed values are grouped together. Having reordered the coefficients, we can compress them more efficiently as continuous runs of zeroes, instead of individual values scattered across the raster order.

(44)

2.4 Deblocking

To minimize the blocking artefacts introduced by the quantization step, H.264 man-dates an in loop deblocking filter. As such, the deblocking filter is a part of both the encoding and decoding process, as opposed to deblocking as a post-processing step done by the decoder. This ensures a level of quality, as the encoder can guarantee the quality delivered to the end user by a conforming decoder. It also works more effi-ciently than a post-processing filter, as it greatly reduces the propagation of blocking artefacts through motion vectors [9]. It also reduces the residual size, as the smoothing of artefact results in closer resemblance to the original pixels.

Figure 2.9: Plotted pixel values of an edge showing typical signs of blocking artefacts. As the difference in pixel value between the edges of adjacent blocks p0and q0is much

higher than the differences between p₀−p₄and q₀−q₄, it is likely a result of quanti-zation [9].

The purpose of the deblocking filter is to evaluate the edges between the 4x4 luma transformation blocks and 2x2 chroma blocks to determine if there is a block artefact between them, as opposed to any other form of edge due to actual picture content. A synthetic edge from blocking artefacts is identified by a pronounced spike in pixel values between the edge pixels that does not continue across the interior samples, as seen in figure 2.9. The sharp spike between pixel values p₀and q₀does not propagate

(45)

Condition Bs One of the blocks is intra-coded and on a macroblock edge 4

One of the blocks is intra-coded 3

One of the blocks has coded residuals 2

Difference of block motion≥1 luma sample distance 1 Motion compensation from different reference slices 1

Else 0

Table 2.1: Determining Boundary strength. Data from [9].

in the interior pixels p1−p3and q1−q3. Edges that fall into such a pattern is smoothed

over by interpolation with interior pixels.

The deblocking filter in H.264 works in a two-step process. First, each edge is given a score, known as Boundary strength(Bs). It is based on the type of macroblock and how it has been predicted. It is then given a value between 0 and 4, determining the expected amount of filtering necessary. For edges with a Bs of 0, deblocking will be disabled. Edges with a Bs between 1-3 might be filtered with the normal filter, while edges between intra-coded macroblocks gets a score of 4 and receives extra strong filtering. The conditions for the assignment of Bs is given in table 2.1. Each condition is tested as ordered in the table, and the corresponding boundary strength is chosen from the first matching condition.

After each edge has been assigned a Bs, the pixel values for those with Bs≥ 1 is ana-lyzed to detect possible blocking artefacts. As the amount of block artefacts depends on the amount of quantization, the tested threshold values, α(Index_A)and β(Index_B), is dependent on the current QP. IndexA and IndexB is derived from QP, with optional

influence from the encoder. The possible values for α and β is predefined in the stan-dard, and originates from empirical data that should ensure good results for a broad range of different content. For instance, QP values close to zero will result in very low data loss, so the deblocking filter can safely be disabled.

The pixel values are then tested, and the edge is filtered if the following conditions hold:

(46)

|p₀−q₀| < α(Index_A) (2.4)

|p₁−q₁| < β(Index_B) (2.5)

|p₂−q₂| < β(Index_B) (2.6)

where β(IndexB)generally is significantly smaller than α(IndexA).

If the above conditions hold, the edge is filtered according to its Bs. For edges with Bs

≤ 3, the edge pixels and up to 1 pixel on either side of the edge might be filtered. For

Bs = 4, the edge pixels and up to two interior pixels on either side might be filtered. Thus, the filter smooths out artefacts while keeping the original image content sharp. Compared to non-filtered video, the deblocking filter reduces the bit rate by 5−10% while keeping the same quality.

2.5 Entropy coding

Before the quantized prediction residuals and associated parameters such as predic-tion mode or mopredic-tion vectors are written to the output bitstream, they are further com-pressed with entropy coding. The prior steps we introduced earlier worked in either the spatial- or frequency domain of the input frame, while entropy coding takes ad-vantage of patterns in the resulting bitstream. Instead of using natural binary coding to write the end result of the prior encoding steps, we use our knowledge of the data to assign shorter codewords for frequent data. For instance, the quantization step in sec-tion 2.3 does not reach its compression potential unless we represent the runs of zeroes more densely than just a zero word for each occurrence. H.264 uses context-adaptive entropy coding, by which means that the assignment of codewords adapts with the content. Depending on recently coded macroblocks, the encoder will choose the code tables that it estimates will yield the shortest codes for the current block.

H.264 supports two entropy coding methods, either Context Adaptive Variable Length Coding (CAVLC) or Context-based Adaptive Binary Arithmetic Coding (CABAC). While

(47)

CAVLC is available in all profiles, the latter is only available in the Main profile or higher. The choice of encoding is signaled via the entropy_coding_mode flag [2].

2.5.1 CAVLC

When the entropy_coding_mode is clear, CAVLC is used to code the residual blocks. After a blocks has been reordered in a zigzag pattern, it is coded as shown in figure 2.10:

Figure 2.10: Flowchart of the CAVLC coding process [3].

First, a variable coeff_token is coded and written to the bitstream. coeff_token holds two values; the number of nonzero coefficients and how many of these are±1. As coeff_token only stores up to 3 trailing ones, any further ones are stored along with the other nonzero coefficients. On the other hand, if there are fewer than 3 trailing ones, we know that the last nonzero coefficient can not be±1. CAVLC takes advantage of this by storing the first coefficient decremented by±1.

The standard defines four lookup tables for coding coeff_token, and the table in current use depend on the number of nonzero coefficients stored in the neighboring blocks above and to the left of the current block. If both the upper and left block has

(48)

been coded, the table chosen is the mean value, or the same table is uses if only one is available. If the block is the first to be encoded, it defaults to the first table.

The tables are constructed so that the first table most efficiently codes low values, while the second and third tables code increasingly larger values. The fourth table uses a 6bit fixed length and is used for values not available in the other tables. Thus, the choice of lookup table for coeff_token is one of the properties that makes the coding context adaptive.

After coding coeff_token, the sign of each±1 is coded as a 1bit

trailing_one_sign_flagfollowing the convention from two’s complements neg-ative numbers with the bit set for minus, and cleared for plus. Note that the signs are given in reverse order.

Following the special case ones, the rest of the nonzero coefficients are coded, again in

reverse order. Each one is coded as a two-touple, level_prefix and level_suffix.

The level is divided into two code words to more efficiently code the variety of coeffi-cient values by adapting the size of level_suffix. level_suffix uses between 0 and 6 bits, and the number of bits used by the currently coded coefficient depends on the magnitude of prior coded coefficients. The number of bits used by level_suffix is another property that makes CAVLC context adaptive.

Subsequent to coding all the nonzero coefficients, the total number of zeroes between the start of the block and the highest nonzero coefficient is coded in total_zeroes. Finally, each run of zeroes before a nonzero coefficient is coded in a run_before code. As with level_suffix and trailing_one_sign_flag, this is done in reverse or-der. However, this is not done for the first coefficient, as we already know the total numbers of zeroes through total_zeroes.

2.5.2 Exponential Golomb codes

While CAVLC is used for residual block data, another approach is used for other in-formation such as headers, macroblock type, motion vector difference etc, called

(49)

Expo-nential Golomb (Exp-Golomb) codes. In contrast to the tailored approach, Exp-Golomb codes is a general coding scheme that follows a general pattern. Each codeword con-sists of N leading zeroes separated from M bits of information with a 1, except for the first codeword that is coded as simply one. The second codeword is 010, the third, 111, the fourth 00100 and so forth. Hence, the shorter the code number, the shorter the codeword will be.

For a given code number n, the corresponding Exp Golomb code can easily be calcu-lated as shown in listing 2.1 (which prints the code as text for readability).

import math

def ExpGolomb_code ( code_num ) :

i f code_num == 0 : r e t u r n ’ 1 ’

M = i n t ( math . f l o o r ( math . l o g ( code_num +1 , 2 ) ) ) INFO = bin ( code_num +1 − _{2 * *M) [ 2 : ] # s k i p 0b}

pad = M− l e n ( INFO ) # number o f b i t s t o pad t h e INFO f i e l d

r e t u r n _{M* ’ 0 ’ + ’ 1 ’ + pad * ’ 0 ’ + INFO}

Listing 2.1: ExpGolomb code example.

Each code number is decoded by counting until reading a 1, and then read the same amount of bits as the INFO field. The code number is then found by calculating

code_num=2M+INFO−1. (2.7)

As Exp-Golomb coding is a general coding scheme, its efficiency lies in the lookup tables matching code words with actual data. To achieve the best compression, the shorter code words must be used for the most frequent data. As Exp-Golomb coding is used for a range of parameters, four different types of mappings are used:

• me, mapped symbols, are coding tables defined in the standard. They are used for, among others, inter macroblock types, and are specifically crafted to map the most frequent values to the shortest code numbers.

• ue, unsigned direct mapping, where the code number is directly mapped to the value. This is used for, inter alia, the reference frame index used in inter-prediction.

(50)

• te, truncated direct mapping, is a version of ue where short codewords are trun-cated. If the value to be coded cannot be greater than 1, it is coded as a single bit b, where b=!code_num[24].

• se, signed mapping, is a mapping interleaving positive and negative numbers. A positive number p is mapped as 2|p| −1, while a negative or zero number n is mapped as 2|n|.

2.5.3 CABAC

When the entropy_coding_mode flag is set, CABAC is used. It uses arithmetic cod-ing for both residual data and parameters, and achieves higher compression ratio at the cost of higher computational costs. In test sequences for typical broadcast scenarios, CABAC has shown a mean improvement of 9% - 14% increased compression rate [25]. The process of encoding with CABAC is threefold:

1. First, the syntax element to be encoded must go through binarization, by which means that a non-binary value must be converted to a binary sequence, as the arithmetic coder only works with binary values. The resulting binary sequence is referred to as a bin string, and each binary digit in the string as a bin. If the value to be coded already has a binary representation, this step can naturally be omitted.

The binarization step represent the first part of the encoding process, as the bina-rization scheme employed by CABAC assigns shorter codes to the most frequent values. It uses four basic types, unary codes, truncated unary codes, k’th order Exp-Golomb codes and fixed length codes, and combine these types to form bi-narization schemes depending on the parameter to be coded. In addition, there are five specifically crafted binary trees constructed for the binarization of mac-roblock and sub-block types. The binarization step also makes the arithmetic coder significantly less complex, as each value is either 0 or 1.

(51)

distri-bution is assigned one or more bins for each bin string. The selection of context model depends on the type of syntax element to be encoded, the current slice type, and statistics from recently coded values. The choice of context model for the current bin is influenced by neighboring values above and to the left. CABAC uses nearly 400 context models for all the syntax elements to be coded. CABAC also includes a simplified mode where this step is skipped for syntax values with a near-uniform probability distribution.

The probability states in the context models are initialized for each slice, based on the slice QP. As the amount of quantization has a direct impact on the occurrence of various syntax values. For each encoded bin, the probability estimates are updated to adapt to the data. As the context model gains more knowledge of the data, recent observations are given more impact as the frequency counts gets scaled down after a certain threshold.

3. Following the selection of context model, both the bin to be coded and its context model is passed on to the arithmetic coder, which encodes the bin according to the model. The frequency count of the corresponding bin in the context model is then updated. The model will then continue to adapt based on the encoded data, until a new slice is to be encoded and the models are reset.

2.6 Related work

Many of the features that makes H.264 efficient depend on exploiting the relationship and similarities between neighboring macroblocks. A prime example of this is the mo-tion vector predicmo-tion illustrated in figure 2.6. To efficiently compress the momo-tion vector E, it is stored as a residual based on the median of its neighbors A, B and C. In terms of data dependencies, this means that E cannot be predicted until A, B and C has been processed. While this allows the video to be compressed more efficient, it also compli-cates parallelism, as the interdependent macroblocks cannot be processed in parallel by default. However, research shows that various steps of the encoding process can be parallelized by novel algorithms or relaxing requirements while minimizing the

(52)

asso-ciated drawbacks.

Parallelizing of H.264 can be broadly categorized as either strategies to parallelize the overall encoding workflow, or parallelized solutions to specific parts of the encoding process. Of the overall strategies, there are two independent methods; slice-based and I-frame/GOP-based.

The slice-based approach takes advantage of slices (see section 2.2) to divide each frame into self-standing slices that can be processed in parallel. While slices add ro-bustness by being independently decodable, the same independence reduce the effi-ciency of prediction. Measures by Yen-Kuang Chen et al. has shown that using 9 slices increases the bitrate by 15-20% to maintain the same picture quality [16], with support-ing findsupport-ings by A. Rodríguez et al. [26].

One advantage of the slice-based approach is that it supports real-time applications, as it does not add extra latency or dependence on buffered input. Examples of encoders using the slice-based approach are the cuve264b encoder we introduce in chapter 4, including its origin, the Streaming HD H.264 Encoder [6]. Slice-based parallelism was also the original threading model for the free software X264 encoder [56].

The GOP/frame based approach is to use the interval of I-frames to encode different GOPs in parallel [26]. By predefining each GOP, the video can be divided into work-ing units handled by different threads. An important drawback with the GOP-based strategy is that it is unsuitable for real-time applications, as it depends on enough con-secutive frames to form a working unit. It also adds latency, as each GOP in itself is encoded sequentially.

As both approaches work independent of each other, they can also be combined to form hierarchal parallelization. By further dividing each GOP into different slices per frame, each GOP can also be encoded in parallel [26].

In addition to the mentioned approaches, there also exists different strategies to par-allelize certain parts of the encoding process. However, the algorithmic detail of such designs is out of scope of this thesis.

(53)

2.7 Summary

In this chapter, we introduced the H.264 video coding standard, and gave an overview of its main means of achieving effective video compression. We also laid out related work in parallelization of video encoding. In the next chapter, we will introduce GPUs in general, and the nVidia CUDA framework in particular.

Investigating Host-Device communication in a GPU-based H.264 encoder.

Department of Informatics

Investigating

Host-Device

communication in a

GPU-based H.264

encoder.

Master thesis

Kristoffer Egil

Bonarjee

May 16, 2012

Contents

List of Figures

List of Tables

List of Abbreviations

List of Code Snippets

Preface

Acknowledgements

Chapter 1

Introduction

1.1

Background and motivation

1.2

Problem statement

1.2.1

Limitations

1.3

Research Method

1.4

Main Contributions

1.5

Outline

Chapter 2

Video Coding and the

H.264/MPEG4-AVC standard

2.1

Introduction

2.1.1

Color Spaces and human perception

Luma (Y)

Chroma (Cb and Cr)

4:2:0 sub-sampling

2.2

Frames, Slices and Macroblocks

2.2.1

Macroblocks

2.2.2

Independent slices

2.2.3

Predicted slices

A

B

C

E

2.2.4

Bi-predicted slices

2.2.5

Switching I and Switching P slices

2.3

Transform coding and Quantization

2.4

Deblocking

2.5

Entropy coding

2.5.1

CAVLC

2.5.2

Exponential Golomb codes

2.5.3

CABAC

2.6

Related work

2.7

Summary