A **decoder** test chip was implemented in [4] with a core size of 1.77mm 2 in 40 nm CMOS, comprising 715K logic gates and 124KB of on-chip SRAM. Fig. 22 shows the micrograph of the test chip. It is compliant to **HEVC** Test Model (HM) 4.0, and the supported decoding tools in **HEVC** Working Draft (WD) 4 are listed in Table 14 along with the main specs. The main differences from the final version of **HEVC** are that SAO is absent and Context-Adaptive Variable Length Coding (CAVLC) is used in place of CABAC in the Entropy **Decoder**. This chip achieves 249 Mpixels/s decoding throughput for 4K Ultra HD videos at 200 MHz with the target DDR3 SDRAM operating at 400 MHz. The core power is measured for six different configurations as shown in Fig. 23. The average core power consumption for 4K Ultra HD decoding at 30 fps is 76 mW at 0.9 V which corresponds to 0.31 nJ/pixel. Logic and SRAM breakdown of the chip is shown in Fig. 24. Similar to H.264/AVC decoders, we observe that prediction has the most significant resource utilization. However, we also observe that inverse transform is now significant due to the larger transform units while deblocking filter is relatively small due to sim- plifications in the standard. Power breakdown from post-layout power simulations with a bi-prediction bitstream is shown in Fig. 25. We observe that the MC cache takes up a significant portion of the total power. However, the DRAM power saving due to the cache is about six times the cache’s own power consumption.

Show more
39 Read more

A Deeply Pipelined CABAC **Decoder** for **HEVC** Supporting Level 6.2 High-tier Applications
Yu-Hsin Chen, Student Member, IEEE, and Vivienne Sze, Member, IEEE
Abstract—High Efficiency Video Coding (**HEVC**) is the latest video coding standard that specifies video resolutions up to 8K Ultra-HD (UHD) at 120 fps to support the next decade of video applications. This results in high-throughput requirements for the context adaptive binary arithmetic coding (CABAC) entropy **decoder**, which was already a well-known bottleneck in H.264/AVC. To address the throughput challenges, several modifications were made to CABAC during the standardization of **HEVC**. This work leverages these improvements in the design of a high-throughput **HEVC** CABAC **decoder**. It also supports the high-level parallel processing tools introduced by **HEVC**, including tile and wavefront parallel processing. The proposed design uses a deeply pipelined **architecture** to achieve a high clock rate. Additional techniques such as the state prefetch logic, latched-based context memory, and separate finite state machines are applied to minimize stall cycles, while multi- bypass-bin decoding is used to further increase the throughput. The design is implemented in an IBM 45nm SOI process. After place-and-route, its operating frequency reaches 1.6 GHz. The corresponding throughputs achieve up to 1696 and 2314 Mbin/s under common and theoretical worst-case test conditions, respectively. The results show that the design is sufficient to decode in real-time high-tier video bitstreams at level 6.2 (8K UHD at 120 fps), or main-tier bitstreams at level 5.1 (4K UHD at 60 fps) for applications requiring sub-frame latency, such as video conferencing.

Show more
13 Read more

quarter) accurate variable block size motion estimation is applied in both H.264/AVC and **HEVC**. The H.264/AVC standard uses a six-tap finite impulse response (FIR) luma filtering at half-pixel positions followed by a linear interpolation at quarter-pixel positions. Chroma samples are computed by the weighed interpolation of four closest integer pixel samples. In **HEVC** standard, three different eight-tap or seven-tap FIR filters are used for the luma interpolation of half-pixel and quarter-pixel positions, respectively. Chroma samples are computed using four-tap filters. Sub-pixel interpolation is one of the most computa- tionally intensive parts of **HEVC** video encoder and de- coder. In the high-efficiency and low-complexity configurations of **HEVC** **decoder**, 37 and 50 % of the **HEVC** **decoder** complexity is caused by sub-pixel interpolation on average, respectively [4]. On the other hand, compared with the six-tap filters used in H.264/AVC standard, the seven-tap and eight-tap filters cost more area in **hardware** implementation and occupy 37~50 % of the total complexity for its DRAM access and filtering. There- fore, it is necessary to design a dedicated **hardware**

Show more
12 Read more

In this thesis, we proposed a low complexity **HEVC** SPME technique for SPME in **HEVC** encoder. The proposed technique reduced the amount of computations significantly with slight decrease in PSNR. We designed and implemented a high performance **HEVC** SPME **hardware** implementing the proposed low complexity **HEVC** SPME technique. We also designed and implemented an **HEVC** fractional interpolation **hardware** using memory based constant multiplication for all PU sizes for both **HEVC** encoder and **decoder**. The proposed **hardware** uses memory based constant multiplication technique for implementing multiplications with constant coefficients. We proposed three different high performance FVC 2D transform **hardware** for 4x4 and 8x8 TU sizes. The first two **hardware** use adders and shifters for implementing FVC transform algorithm. The third **hardware** uses DSP blocks in Xilinx Virtex 6 FPGA for implementing FVC transform algorithm. The proposed **hardware** is verified to work correctly on an FPGA board.

Show more
65 Read more

The key idea of SAO is to reduce sample distortion by first classifying reconstructed samples into different categories, obtaining an offset for each category, and then adding the offset to each sample of the category. The offset of each category is properly calculated at the encoder and explicitly signaled to the **decoder** for reducing sample distortion effectively, while the classification of each sample is performed at both the encoder and the **decoder** for saving side information significantly.

Abstract Low-density parity-check (LDPC) codes and
convolutional Turbo codes are two of the most power- ful error correcting codes that are widely used in mod- ern communication systems. In a multi-mode baseband receiver, both LDPC and Turbo decoders may be re- quired. However, the different decoding approaches for LDPC and Turbo codes usually lead to different **hardware** architectures. In this paper we propose a uni- fied message passing algorithm for LDPC and Turbo codes and introduce a flexible soft-input soft-output (SISO) module to handle LDPC/Turbo decoding. We employ the trellis-based maximum a posteriori (MAP) algorithm as a bridge between LDPC and Turbo codes decoding. We view the LDPC code as a concatenation of n super-codes where each super-code has a simpler trellis structure so that the MAP algorithm can be easily applied to it. We propose a flexible functional unit (FFU) for MAP processing of LDPC and Turbo codes with a low **hardware** overhead (about 15% area and timing overhead). Based on the FFU, we propose an area-efficient flexible SISO **decoder** **architecture** to support LDPC/Turbo codes decoding. Multiple such SISO modules can be embedded into a parallel **decoder** for higher decoding throughput. As a case study, a flexible LDPC/Turbo **decoder** has been synthesized on a TSMC 90 nm CMOS technology with a core area of 3.2 mm 2 . The **decoder** can support IEEE 802.16e LDPC codes, IEEE 802.11n LDPC codes, and 3GPP LTE

Show more
16 Read more

Abstract: This paper presents an efficient High-level synthesis (HLS) **hardware** design to implement the Inverse Quantization and Transform (IQ/IT) for a High Efficiency Video Coding (**HEVC**) **decoder**. Using Xilinx Vivado HLS tool, different directives are applied to the IQ/IT C code to select the optimized **hardware** **architecture** in terms of area and clock cycles. This **architecture** is implemented in a SW/HW context for verification. In fact, it is connected to ARM Cortex-A9 processor using AXI stream interface and integrated on Xilinx Zynq ZC702 platform. Therefore, the experimental results show that the SW/HW design can only decode 240p@15fps with a gain of 8% in throughput and 74% in power consumption compared to SW implementation.

Show more
57
and rate estimation in a fully parallel manner. The proposed intra encoder consists of two parts: efficient **HEVC** algorithm adaptations and highly-parallel **hardware** **architecture** design. The former aims to reduce the computational complexity in the algorithm level, while the latter maximizes the potential of parallelism to improve the overall throughput of intra encoder. The proposed intra encoder supports all CU/PU/TU sizes and 35 prediction modes. Compared with HM-15.0, the proposed algorithm adaptations lead to a 27% computation reduction with an average loss in BD-Rate and BD-PSNR is 4.39% and 0.21dB, respectively. To address the bottleneck of data/timing dependency, a fully-parallel intra encoder **architecture** utilizing 4- parallelism in intra prediction is proposed. Intra prediction of four different size PUs from 4×4 to 32×32 will be performed simultaneously in 4 prediction engines (PE) to greatly improve prediction throughput. Highly pipelined computational schemes are designed and employed in each PE to maximize RDO throughput. Moreover, the proposed high throughput table-based CABAC rate estimator in chapter 3 is incorporated in the proposed intra encoder to further increase RDO performance. Experimental results show the proposed intra encoder is capable of handling real-time video compression for 4K videos at 30fps.

Show more
147 Read more

With new wireless communication standards and new MIMO decoding algorithms emerging every few years, existing systems need to be redesigned and upgraded not only to meet the newly defined standards, but also to allow integration of multiple standards onto the same platform and improve performance via more advanced decoding algorithms. This fact serves as the main motivation for this solution. A programmable **hardware** solution focused on the unique MIMO decoding operations of a MIMO system can help drive down nonrecurring engineering costs, can facilitate system upgrades to take advantage of emerging algorithms and can help minimize **hardware** duplications in system-on-a-chips that support multiple standards.

Show more
computationally intensive parts of High Efficiency Video Coding (**HEVC**) video encoder and **decoder**. In this paper, an **HEVC** fractional interpolation **hardware** using memory based constant multiplication is proposed. The proposed **hardware** uses memory based constant multiplication technique for implementing multiplication with constant coefficients. The proposed memory based constant multiplication **hardware** stores pre-computed products of an input pixel with multiple constant coefficients in memory. Several optimizations are proposed to reduce memory size. The proposed **HEVC** fractional interpolation **hardware**, in the worst case, can process 35 quad full HD (3840x2160) video frames per second. It has up to 31% less energy consumption than original **HEVC** fractional interpolation **hardware**.

Show more
Multiprocessor System on Chip (MPSoC) technology presents an interesting solution to reduce the computational time of complex applications such as multimedia applications. Implementing the new High Efficiency Video Cod- ing (**HEVC**/h.265) codec on the MPSoC **architecture** becomes an interesting research point that can reduce its algorithmic complexity and resolve the real time constraints. The implementation consists of a set of steps that compose the Co-design flow of an embedded system design process. One of the first anf key steps of a Co-design flow is the modeling phase which allows designers to make best architectural choices in order to meet user requirements and plat- form constraints. Multimedia applications such as **HEVC** **decoder** are com- plex applications that demand increasing degrees of agility and flexibility. These applications are usually modeling by dataflow techniques. Several ex- tensions with several schedules techniques of dataflow model of computation have been proposed to support dynamic behavior changes while preserving static analyzability. In this paper, the **HEVC**/h.265 video **decoder** is modeled with SADF based FSM in order to solve problems of placing and scheduling this application on an embedded **architecture**. In the modeling step, a high-level performance analysis is performed to find an optimal balance be- tween the decoding efficiency and the implementation cost, thereby reducing the complexity of the system. The case study in this case works with the **HEVC**/h.265 **decoder** that runs on the Xilinx Zedboard platform, which offers a real environment of experimentation.

Show more
21 Read more

structure and multiplier less implementation, were adopted for saving the **hardware** cost. The work presented a transform **architecture** that uses the canonical signed digit representation and common sub-expression elimination technique to perform the multiplication with a shift-add operation. Based on these optimizations, the transform **architecture** is greatly simplified for practice application. However, with the increasing applications of high definition (HD) and ultra HD video coding, the higher processing capacity of codes is required. Thus, all modules in video code, including the transform, need to be further improved for real-time coding with low complexity.

Show more
eﬃciency compared with the new **HEVC** High Eﬃciency Video Coding (**HEVC**) standard. On the other hand, 3D **HEVC**-based techniques have a high coding eﬃciency, but are not supported by H.264/AVC decoders. Therefore, **HEVC**-based systems cannot immediately be incorporated in the network without the high cost of upgrading the existing network infrastructure (such as encoders, streaming servers, transcoders, etc.) and the **decoder** install base. In order to enable a system which oﬀers 3D functionality, a low overall bit rate, and compatibility with currently existing H.264/AVC-based systems, a multiview H.264/AVC and **HEVC** hybrid **architecture** was proposed in the context of 3D applications and standardized in [23]. The standardization of this hybrid **architecture** was aligned with the **HEVC** extensions by the MPEG. The **architecture** is hybrid in the sense that the base view and the other views apply a diﬀerent encoding standard. This is achieved by combining H.264/AVC encoding for the base view and **HEVC** encoding for the other views. This **architecture** reduces the bandwidth by exploiting redundancy with the base view stream (which is decodable by already existing systems), while the functionality of those systems is maintained in the mid-term. It can be noticed that depth maps are not used for the purpose of this paper, since as the aim is to maintain interoperability, if a device cannot decode the **HEVC** views, it will very likely not be able to decode the depth maps either, since H.264/AVC did not include a specification about texture views plus depth maps [15].

Show more
Asynchronous pipeline is faster. The reason for the improvement is at the completion of every stage, the data is determined individually, in spite of evaluating the worst case delay of the slowest stage (using a global clock). In asynchronous pipeline there are two different types of stalls and in both, a clock is used in the synchronous version. In stage2 the first is demonstrated and then BMU is move to stage3. Here the new data is accepted by the stage2 **hardware** but ACS is yet to complete its function in stage1. This period is called starvation, since the data is not available and the **hardware** has to wait for the data. In the second type of stall, SMU starts its movement stage1 to stage2 but the stage is not prepared to accept new data since BMU is processed. Consequently blocking occurs since the data is readily available but it will have to wait for the availability of the **hardware**. When a small number of data elements are present in the pipeline, then starvation occurs and hence the throughput is low. At the same time when many data elements are present in the pipeline, then blockings occurs and causes high latency. Hence the balanced pipeline will have low latency and high throughput.

Show more
This paper presents the memory requirement and the **architecture** for the same. We have presented a new reorganized decode decision engine with look-ahead ctxIdx calculation logic to improve performance. Using this optimal memory requirement and accessing the data which are required for the next update is stored in cache inorder to improve the speed of operation. Using this method there will be increase processing speed by 14 to 22% and reduce memory size by 50%. It demonstrates the benefits of accounting for implementation cost when designing video coding algorithms. We recommend that this approach be extended to the rest of the video codec to maximize processing speed and minimize area cost, while

Show more
10 Read more

but, aside from refractory period suppression of positive movement classifications, each movement decision is independent of any other classification. Additionally, the computa- tions performed in each classifier within a single movement period are independent of one another. It was decided that this independence should be exploited to create a **hardware** **architecture** where all classifiers and multipliers were implemented in parallel. This fully parallel approach closely mimics the theoretical algorithm structure, which is advantageous because it is easy to understand and compare to both the theoretical behavior and software implementation. Each **hardware** component can be easily compared to its theoretical coun- terpart, allowing for easy verification of proper functionality. Additionally, a fully parallel implementation should have a short critical path delay, making the 50Hz target clock rate easily achievable. Since the product of intermediate computations do not need to be stored if every computation is happening simultaneously the entire system can be clocked at the 50Hz rate, which also closely matches the theoretical algorithm structure. Optimizations undoubtedly exist, but since this work represents a first attempt at implementing this algo- rithm in **hardware** design simplicity was favored over speed and size.

Show more
73 Read more

Pixel equality and similarity based techniques are proposed for reducing amount of computations performed by H.264 intra prediction algorithm in [9, 10, 22]. In this thesis, we propose using pixel equality based computation reduction (PECR) technique for intra prediction algorithm in **HEVC** **decoder**. PECR technique compares the pixels used in the prediction equations of intra prediction modes. If the pixels used in a prediction equation are equal, the predicted pixel by this equation is equal to these pixels. Therefore, this prediction equation simplifies to a constant value and prediction calculation for this equation becomes unnecessary. The simulation results obtained by **HEVC** Test Model HM 5.2 **decoder** software [12] for several benchmark videos showed that using this technique after data reuse achieved more than 40% computation reduction with a small comparison overhead.

Show more
58 Read more

throughput low-complexity **decoder** **architecture** and design technique to implement successive- cancellation (SC) polar decoding. A novel merged processing element with aone’s complement scheme, a main frame with optimal internal word length, and optimized feedback part **architecture** are proposed. Generally, a polar **decoder** uses a two’s complement scheme in merged processingelements, in which a conversion between two’scomplement and sign- magnitude requires an adder. However, the novel merged processing elements do not require an adder. Moreover, in order to reduce **hardware** complexity, optimized main frame and feedback part approaches are also presented. A (1024, SC polar **decoder** was designed and implemented using 40-nm CMOS standard cell technology. Synthesis results show that the proposed SC polar **decoder** can lead to a 13% reduction in **hardware** complexity and a higher clock speed compared to conventional decoders.

Show more
Therefore, several **hardware** architectures to compute the variable-size DCT in **HEVC** have been proposed in the last years. Dias et al. [5] exploited a 2D systolic array to implement the DCT as matrix-vector multiplication, thus supporting mul- tiple standards. On the other hand, Meher et al. [6] designed an efficient integer DCT **architecture** for **HEVC** by relying on the odd-even decomposition of the DCT matrix and by reusing the core N /2-point DCT for the even computation of the N -point DCT. Moreover, to achieve high throughput, such an **architecture** includes and additional N /2-point DCT unit, so that it computes 32/N N -point DCTs concurrently. However, these approaches require a lot of **hardware** resources as they implement exactly the DCT matrix specified by the **HEVC** standard [3]. For this reason, approximation has been introduced as a new paradigm to efficiently compute the DCT in video coding applications, by trading complexity for rate-distortion performance loss [7]. Several approximations of the 8-point DCT have been derived by manipulating the coefficients and by simplifying the DCT matrix. A collection of these methods is available in [8]. To extend the transform size from 8 to 32, Jridi et al. [9] proposed a generalized

Show more
In the work carried out by [23], a complete profiling of the encoder was carried out. The aim of this work is to present the different functions of the encoder, their execution times and the types of operations carried out in order to deduce the functions that are candidates for a **hardware** migration. The results are presented in terms of types of assembly level instructions in each encoder function. In [24], the authors propose a hybrid parallel decoding strategy for **HEVC**, which combines task level parallelism and data level parallelism. In [25], the authors use a performance estimation analysis to prove a power model based on bit derivations that estimates the energy required to decode a given **HEVC** coded bitstream. In [26], authors proposed a method to improve H.265/**HEVC** encoding performance for 8K UHDTV moving pictures by detecting amount or complexity of object motions. 4.1. Functions of the Test Model

Show more