A decoder test chip was implemented in  with a core size of 1.77mm 2 in 40 nm CMOS, comprising 715K logic gates and 124KB of on-chip SRAM. Fig. 22 shows the micrograph of the test chip. It is compliant to HEVC Test Model (HM) 4.0, and the supported decoding tools in HEVC Working Draft (WD) 4 are listed in Table 14 along with the main specs. The main differences from the final version of HEVC are that SAO is absent and Context-Adaptive Variable Length Coding (CAVLC) is used in place of CABAC in the Entropy Decoder. This chip achieves 249 Mpixels/s decoding throughput for 4K Ultra HD videos at 200 MHz with the target DDR3 SDRAM operating at 400 MHz. The core power is measured for six different configurations as shown in Fig. 23. The average core power consumption for 4K Ultra HD decoding at 30 fps is 76 mW at 0.9 V which corresponds to 0.31 nJ/pixel. Logic and SRAM breakdown of the chip is shown in Fig. 24. Similar to H.264/AVC decoders, we observe that prediction has the most significant resource utilization. However, we also observe that inverse transform is now significant due to the larger transform units while deblocking filter is relatively small due to sim- plifications in the standard. Power breakdown from post-layout power simulations with a bi-prediction bitstream is shown in Fig. 25. We observe that the MC cache takes up a significant portion of the total power. However, the DRAM power saving due to the cache is about six times the cache’s own power consumption.
A Deeply Pipelined CABAC Decoder for HEVC Supporting Level 6.2 High-tier Applications
Yu-Hsin Chen, Student Member, IEEE, and Vivienne Sze, Member, IEEE
Abstract—High Efficiency Video Coding (HEVC) is the latest video coding standard that specifies video resolutions up to 8K Ultra-HD (UHD) at 120 fps to support the next decade of video applications. This results in high-throughput requirements for the context adaptive binary arithmetic coding (CABAC) entropy decoder, which was already a well-known bottleneck in H.264/AVC. To address the throughput challenges, several modifications were made to CABAC during the standardization of HEVC. This work leverages these improvements in the design of a high-throughput HEVC CABAC decoder. It also supports the high-level parallel processing tools introduced by HEVC, including tile and wavefront parallel processing. The proposed design uses a deeply pipelined architecture to achieve a high clock rate. Additional techniques such as the state prefetch logic, latched-based context memory, and separate finite state machines are applied to minimize stall cycles, while multi- bypass-bin decoding is used to further increase the throughput. The design is implemented in an IBM 45nm SOI process. After place-and-route, its operating frequency reaches 1.6 GHz. The corresponding throughputs achieve up to 1696 and 2314 Mbin/s under common and theoretical worst-case test conditions, respectively. The results show that the design is sufficient to decode in real-time high-tier video bitstreams at level 6.2 (8K UHD at 120 fps), or main-tier bitstreams at level 5.1 (4K UHD at 60 fps) for applications requiring sub-frame latency, such as video conferencing.
quarter) accurate variable block size motion estimation is applied in both H.264/AVC and HEVC. The H.264/AVC standard uses a six-tap finite impulse response (FIR) luma filtering at half-pixel positions followed by a linear interpolation at quarter-pixel positions. Chroma samples are computed by the weighed interpolation of four closest integer pixel samples. In HEVC standard, three different eight-tap or seven-tap FIR filters are used for the luma interpolation of half-pixel and quarter-pixel positions, respectively. Chroma samples are computed using four-tap filters. Sub-pixel interpolation is one of the most computa- tionally intensive parts of HEVC video encoder and de- coder. In the high-efficiency and low-complexity configurations of HEVCdecoder, 37 and 50 % of the HEVCdecoder complexity is caused by sub-pixel interpolation on average, respectively . On the other hand, compared with the six-tap filters used in H.264/AVC standard, the seven-tap and eight-tap filters cost more area in hardware implementation and occupy 37~50 % of the total complexity for its DRAM access and filtering. There- fore, it is necessary to design a dedicated hardware
In this thesis, we proposed a low complexity HEVC SPME technique for SPME in HEVC encoder. The proposed technique reduced the amount of computations significantly with slight decrease in PSNR. We designed and implemented a high performance HEVC SPME hardware implementing the proposed low complexity HEVC SPME technique. We also designed and implemented an HEVC fractional interpolation hardware using memory based constant multiplication for all PU sizes for both HEVC encoder and decoder. The proposed hardware uses memory based constant multiplication technique for implementing multiplications with constant coefficients. We proposed three different high performance FVC 2D transform hardware for 4x4 and 8x8 TU sizes. The first two hardware use adders and shifters for implementing FVC transform algorithm. The third hardware uses DSP blocks in Xilinx Virtex 6 FPGA for implementing FVC transform algorithm. The proposed hardware is verified to work correctly on an FPGA board.
The key idea of SAO is to reduce sample distortion by first classifying reconstructed samples into different categories, obtaining an offset for each category, and then adding the offset to each sample of the category. The offset of each category is properly calculated at the encoder and explicitly signaled to the decoder for reducing sample distortion effectively, while the classification of each sample is performed at both the encoder and the decoder for saving side information significantly.
Abstract Low-density parity-check (LDPC) codes and
convolutional Turbo codes are two of the most power- ful error correcting codes that are widely used in mod- ern communication systems. In a multi-mode baseband receiver, both LDPC and Turbo decoders may be re- quired. However, the different decoding approaches for LDPC and Turbo codes usually lead to different hardware architectures. In this paper we propose a uni- fied message passing algorithm for LDPC and Turbo codes and introduce a flexible soft-input soft-output (SISO) module to handle LDPC/Turbo decoding. We employ the trellis-based maximum a posteriori (MAP) algorithm as a bridge between LDPC and Turbo codes decoding. We view the LDPC code as a concatenation of n super-codes where each super-code has a simpler trellis structure so that the MAP algorithm can be easily applied to it. We propose a flexible functional unit (FFU) for MAP processing of LDPC and Turbo codes with a low hardware overhead (about 15% area and timing overhead). Based on the FFU, we propose an area-efficient flexible SISO decoderarchitecture to support LDPC/Turbo codes decoding. Multiple such SISO modules can be embedded into a parallel decoder for higher decoding throughput. As a case study, a flexible LDPC/Turbo decoder has been synthesized on a TSMC 90 nm CMOS technology with a core area of 3.2 mm 2 . The decoder can support IEEE 802.16e LDPC codes, IEEE 802.11n LDPC codes, and 3GPP LTE
Abstract: This paper presents an efficient High-level synthesis (HLS) hardware design to implement the Inverse Quantization and Transform (IQ/IT) for a High Efficiency Video Coding (HEVC) decoder. Using Xilinx Vivado HLS tool, different directives are applied to the IQ/IT C code to select the optimized hardwarearchitecture in terms of area and clock cycles. This architecture is implemented in a SW/HW context for verification. In fact, it is connected to ARM Cortex-A9 processor using AXI stream interface and integrated on Xilinx Zynq ZC702 platform. Therefore, the experimental results show that the SW/HW design can only decode 240p@15fps with a gain of 8% in throughput and 74% in power consumption compared to SW implementation.
and rate estimation in a fully parallel manner. The proposed intra encoder consists of two parts: efficient HEVC algorithm adaptations and highly-parallel hardwarearchitecture design. The former aims to reduce the computational complexity in the algorithm level, while the latter maximizes the potential of parallelism to improve the overall throughput of intra encoder. The proposed intra encoder supports all CU/PU/TU sizes and 35 prediction modes. Compared with HM-15.0, the proposed algorithm adaptations lead to a 27% computation reduction with an average loss in BD-Rate and BD-PSNR is 4.39% and 0.21dB, respectively. To address the bottleneck of data/timing dependency, a fully-parallel intra encoder architecture utilizing 4- parallelism in intra prediction is proposed. Intra prediction of four different size PUs from 4×4 to 32×32 will be performed simultaneously in 4 prediction engines (PE) to greatly improve prediction throughput. Highly pipelined computational schemes are designed and employed in each PE to maximize RDO throughput. Moreover, the proposed high throughput table-based CABAC rate estimator in chapter 3 is incorporated in the proposed intra encoder to further increase RDO performance. Experimental results show the proposed intra encoder is capable of handling real-time video compression for 4K videos at 30fps.
With new wireless communication standards and new MIMO decoding algorithms emerging every few years, existing systems need to be redesigned and upgraded not only to meet the newly defined standards, but also to allow integration of multiple standards onto the same platform and improve performance via more advanced decoding algorithms. This fact serves as the main motivation for this solution. A programmable hardware solution focused on the unique MIMO decoding operations of a MIMO system can help drive down nonrecurring engineering costs, can facilitate system upgrades to take advantage of emerging algorithms and can help minimize hardware duplications in system-on-a-chips that support multiple standards.
computationally intensive parts of High Efficiency Video Coding (HEVC) video encoder and decoder. In this paper, an HEVC fractional interpolation hardware using memory based constant multiplication is proposed. The proposed hardware uses memory based constant multiplication technique for implementing multiplication with constant coefficients. The proposed memory based constant multiplication hardware stores pre-computed products of an input pixel with multiple constant coefficients in memory. Several optimizations are proposed to reduce memory size. The proposed HEVC fractional interpolation hardware, in the worst case, can process 35 quad full HD (3840x2160) video frames per second. It has up to 31% less energy consumption than original HEVC fractional interpolation hardware.
Multiprocessor System on Chip (MPSoC) technology presents an interesting solution to reduce the computational time of complex applications such as multimedia applications. Implementing the new High Efficiency Video Cod- ing (HEVC/h.265) codec on the MPSoC architecture becomes an interesting research point that can reduce its algorithmic complexity and resolve the real time constraints. The implementation consists of a set of steps that compose the Co-design flow of an embedded system design process. One of the first anf key steps of a Co-design flow is the modeling phase which allows designers to make best architectural choices in order to meet user requirements and plat- form constraints. Multimedia applications such as HEVCdecoder are com- plex applications that demand increasing degrees of agility and flexibility. These applications are usually modeling by dataflow techniques. Several ex- tensions with several schedules techniques of dataflow model of computation have been proposed to support dynamic behavior changes while preserving static analyzability. In this paper, the HEVC/h.265 video decoder is modeled with SADF based FSM in order to solve problems of placing and scheduling this application on an embedded architecture. In the modeling step, a high-level performance analysis is performed to find an optimal balance be- tween the decoding efficiency and the implementation cost, thereby reducing the complexity of the system. The case study in this case works with the HEVC/h.265 decoder that runs on the Xilinx Zedboard platform, which offers a real environment of experimentation.
structure and multiplier less implementation, were adopted for saving the hardware cost. The work presented a transform architecture that uses the canonical signed digit representation and common sub-expression elimination technique to perform the multiplication with a shift-add operation. Based on these optimizations, the transform architecture is greatly simplified for practice application. However, with the increasing applications of high definition (HD) and ultra HD video coding, the higher processing capacity of codes is required. Thus, all modules in video code, including the transform, need to be further improved for real-time coding with low complexity.
eﬃciency compared with the new HEVC High Eﬃciency Video Coding (HEVC) standard. On the other hand, 3D HEVC-based techniques have a high coding eﬃciency, but are not supported by H.264/AVC decoders. Therefore, HEVC-based systems cannot immediately be incorporated in the network without the high cost of upgrading the existing network infrastructure (such as encoders, streaming servers, transcoders, etc.) and the decoder install base. In order to enable a system which oﬀers 3D functionality, a low overall bit rate, and compatibility with currently existing H.264/AVC-based systems, a multiview H.264/AVC and HEVC hybrid architecture was proposed in the context of 3D applications and standardized in . The standardization of this hybrid architecture was aligned with the HEVC extensions by the MPEG. The architecture is hybrid in the sense that the base view and the other views apply a diﬀerent encoding standard. This is achieved by combining H.264/AVC encoding for the base view and HEVC encoding for the other views. This architecture reduces the bandwidth by exploiting redundancy with the base view stream (which is decodable by already existing systems), while the functionality of those systems is maintained in the mid-term. It can be noticed that depth maps are not used for the purpose of this paper, since as the aim is to maintain interoperability, if a device cannot decode the HEVC views, it will very likely not be able to decode the depth maps either, since H.264/AVC did not include a specification about texture views plus depth maps .
Asynchronous pipeline is faster. The reason for the improvement is at the completion of every stage, the data is determined individually, in spite of evaluating the worst case delay of the slowest stage (using a global clock). In asynchronous pipeline there are two different types of stalls and in both, a clock is used in the synchronous version. In stage2 the first is demonstrated and then BMU is move to stage3. Here the new data is accepted by the stage2 hardware but ACS is yet to complete its function in stage1. This period is called starvation, since the data is not available and the hardware has to wait for the data. In the second type of stall, SMU starts its movement stage1 to stage2 but the stage is not prepared to accept new data since BMU is processed. Consequently blocking occurs since the data is readily available but it will have to wait for the availability of the hardware. When a small number of data elements are present in the pipeline, then starvation occurs and hence the throughput is low. At the same time when many data elements are present in the pipeline, then blockings occurs and causes high latency. Hence the balanced pipeline will have low latency and high throughput.
This paper presents the memory requirement and the architecture for the same. We have presented a new reorganized decode decision engine with look-ahead ctxIdx calculation logic to improve performance. Using this optimal memory requirement and accessing the data which are required for the next update is stored in cache inorder to improve the speed of operation. Using this method there will be increase processing speed by 14 to 22% and reduce memory size by 50%. It demonstrates the benefits of accounting for implementation cost when designing video coding algorithms. We recommend that this approach be extended to the rest of the video codec to maximize processing speed and minimize area cost, while
but, aside from refractory period suppression of positive movement classifications, each movement decision is independent of any other classification. Additionally, the computa- tions performed in each classifier within a single movement period are independent of one another. It was decided that this independence should be exploited to create a hardwarearchitecture where all classifiers and multipliers were implemented in parallel. This fully parallel approach closely mimics the theoretical algorithm structure, which is advantageous because it is easy to understand and compare to both the theoretical behavior and software implementation. Each hardware component can be easily compared to its theoretical coun- terpart, allowing for easy verification of proper functionality. Additionally, a fully parallel implementation should have a short critical path delay, making the 50Hz target clock rate easily achievable. Since the product of intermediate computations do not need to be stored if every computation is happening simultaneously the entire system can be clocked at the 50Hz rate, which also closely matches the theoretical algorithm structure. Optimizations undoubtedly exist, but since this work represents a first attempt at implementing this algo- rithm in hardware design simplicity was favored over speed and size.
Pixel equality and similarity based techniques are proposed for reducing amount of computations performed by H.264 intra prediction algorithm in [9, 10, 22]. In this thesis, we propose using pixel equality based computation reduction (PECR) technique for intra prediction algorithm in HEVCdecoder. PECR technique compares the pixels used in the prediction equations of intra prediction modes. If the pixels used in a prediction equation are equal, the predicted pixel by this equation is equal to these pixels. Therefore, this prediction equation simplifies to a constant value and prediction calculation for this equation becomes unnecessary. The simulation results obtained by HEVC Test Model HM 5.2 decoder software  for several benchmark videos showed that using this technique after data reuse achieved more than 40% computation reduction with a small comparison overhead.
throughput low-complexity decoderarchitecture and design technique to implement successive- cancellation (SC) polar decoding. A novel merged processing element with aone’s complement scheme, a main frame with optimal internal word length, and optimized feedback part architecture are proposed. Generally, a polar decoder uses a two’s complement scheme in merged processingelements, in which a conversion between two’scomplement and sign- magnitude requires an adder. However, the novel merged processing elements do not require an adder. Moreover, in order to reduce hardware complexity, optimized main frame and feedback part approaches are also presented. A (1024, SC polar decoder was designed and implemented using 40-nm CMOS standard cell technology. Synthesis results show that the proposed SC polar decoder can lead to a 13% reduction in hardware complexity and a higher clock speed compared to conventional decoders.
Therefore, several hardware architectures to compute the variable-size DCT in HEVC have been proposed in the last years. Dias et al.  exploited a 2D systolic array to implement the DCT as matrix-vector multiplication, thus supporting mul- tiple standards. On the other hand, Meher et al.  designed an efficient integer DCT architecture for HEVC by relying on the odd-even decomposition of the DCT matrix and by reusing the core N /2-point DCT for the even computation of the N -point DCT. Moreover, to achieve high throughput, such an architecture includes and additional N /2-point DCT unit, so that it computes 32/N N -point DCTs concurrently. However, these approaches require a lot of hardware resources as they implement exactly the DCT matrix specified by the HEVC standard . For this reason, approximation has been introduced as a new paradigm to efficiently compute the DCT in video coding applications, by trading complexity for rate-distortion performance loss . Several approximations of the 8-point DCT have been derived by manipulating the coefficients and by simplifying the DCT matrix. A collection of these methods is available in . To extend the transform size from 8 to 32, Jridi et al.  proposed a generalized
In the work carried out by , a complete profiling of the encoder was carried out. The aim of this work is to present the different functions of the encoder, their execution times and the types of operations carried out in order to deduce the functions that are candidates for a hardware migration. The results are presented in terms of types of assembly level instructions in each encoder function. In , the authors propose a hybrid parallel decoding strategy for HEVC, which combines task level parallelism and data level parallelism. In , the authors use a performance estimation analysis to prove a power model based on bit derivations that estimates the energy required to decode a given HEVC coded bitstream. In , authors proposed a method to improve H.265/HEVC encoding performance for 8K UHDTV moving pictures by detecting amount or complexity of object motions. 4.1. Functions of the Test Model