3.3 High Dynamic Range Content Storage
3.3.3 High Dynamic Range Video Coding
Compression of HDR videos is of special importance as they are significantly larger than LDR videos and in RAW form they can take a sizable portion of a current hard-drive. Transferring them over a current network infrastructure, or using available media would be difficult and even reading and playing such uncompressed data from a local hard-drive is challenging. This section presents video coding algorithms and discusses backwards compatible algorithms in more detail.
Backward Compatible HDR MPEG Video Coding
Mantiuk, Efremov, Myszkowski & Seidel (2006) proposed a backwards compatible method for storing HDR videos. The technique was similar to the JPEG-HDR image compression. The input video stream was split into two streams: a tone mapped and a residual stream. The well-established MPEG-4 video encoder processed each of them separately. The TM stream was displayed on LDR devices while the HDR devices used residual data to generate an HDR image. The main differences from the JPEG-HDR were: usage of the inverse TMO to restore the original HDR, calculating residuals using difference and filtering the residual stream to remove noise invisible to HVS. The full pipeline of the encoder is shown in Figure 3.12.
MPEG Encode
LDR Luminance LDR Frame
Compute Residual
Frame Auxiliary Stream
MPEG Decode
Filter Invisible Noise
HDR Frame Colour Space Transform Colour Space Transform HDR Luminance Find Reconstruction Function Quantise Residual
Frame MPEG Encode
LDR Stream
Residual Stream
Figure 3.12: The coding process transforms images into the common colour space, allowing residuals to be calculated and subsequently filtered. Both the LDR stream and the residual stream are MPEG encoded.
The pipeline started by MPEG encoding the TM video which was stored - without further processing - as the final backwards compatible LDR stream to be displayed on an LDR device.
To calculate residuals the LDR video was decoded exposing all the errors due to compression. The next step involved the decoded tone mapped and the original HDR streams. They were transformed into a common colour space which decorrelated RGB (or XYZ) representations and allowed comparison of the two. Chromas of both were converted to the CIE 1976 uniform scale (u0, v0, similar to LogLuv) which could represent the full visible gamut. The luma of the TM frame was nonlinearly transformed using sRGB, which had linear and power function parts. The HDR used a different luma coding as the sRGB nonlinearity could not have been used for the values spanning the 10−5to 1010cd/m2 range. The authors applied an encoding that was based on contrast detection measurements for the luminance range visible by the HVS. It ensured that the quantisation errors were invisible. The transformation is a piecewise function separately defined across the three dynamic ranges. For exact values please refer to the original work (Mantiuk, Efremov, Myszkowski & Seidel, 2006).
Once both images were in the perceptually similar colour space it enabled the authors to approximate the reconstruction function RF(·) which expanded the TM luma values (ld) back to the original HDR luma (ld). It was assumed that the original TMO was unknown. The HDR values were put into 256 bins based on their TM counterparts. The reconstruction function was then calculated by
finding the arithmetic mean of all pixels for each bin Ωi: RF(i) = 1 |Ωi| X x∈Ωi lW(x) (3.18)
where |Ωi| = x|ld = i and i ∈ [0,255] was the bin index. The chromaticity reconstruction function was approximated by: (u0d, vd0) = (u0W, v0W). Residual function data was computed for all frames, stored in the auxiliary stream and Huffman encoded.
After expanding the TM image, the residual frame (rl) was calculated by a simple subtraction:
rl(x) = lW(x)−RF(ld(x)) (3.19)
The obtained values could range from -4095 to 4095 which, if left uncompressed, required 12 bits. As MPEG optimally encoded 8 bit data, results needed to be scaled and quantised or clamped. A solution was proposed which allowed a trade-off between the errors due to clamping and the errors due to quantisation. To enable even more control a quantisation factor (i.e. scaling value) was set for each bin. So, the final residual image was computed as follows:
ˆ rl(x) = rl(x) q(m) 127 −127 , where x=k⇔i⊂Ωk (3.20)
where [·]127−127was the rounding operator that clamped values below -127 and above 127, and the quantisation factor, q(m), was selected separately for each bin Ωl:
q(m) = max qmin, maxx∈Ωl(|rl(x)|) 127 (3.21)
The scaled and quantised ˆrlcontained high frequency values which hindered com- pression but could have been removed without losing perceptual quality. Hence, the residual frame was filtered using the original HDR frame as a guide. The operation was performed in the wavelet domain and was applied to the three finest scales as filtering at coarser scales could lead to noticeable artefacts.
The performance of HDR-MPEG was measured using three metrics: HDR VDP (Mantiuk, Myszkowski & Seidel, 2004; Mantiuk et al., 2005), universal im- age quality index, UQI (Bovik, 2002), and signal-to-noise ratio (SNR). The first study evaluated the influence of a chosen TMO on quality and bit rate. Five operators were tested: time-dependent visual adoption (Pattanaik et al., 2000),
fast bilateral filtering (Durand & Dorsey, 2002), photographic tone reproduction (Reinhard et al., 2002), the gradient domain tone mapping (Fattal et al., 2002), and adaptive logarithmic mapping (Drago et al., 2003). Temporal coherence was preserved by modifying the operators and default parameters were used. All of them exhibited similar performance except the gradient domain one, for which the output was larger. Still, the latter generated images which appeared bet- ter during LDR playback, and were more suitable when backward compatibility was important. The second test compared the proposed method against HDRV and JPEG-HDR using the photographic tone reproduction TMO. HDR-MPEG performed better than JPEG-HDR but was similar to HDRV.
Rate-Distortion Optimised HDR Video Coding
Lee & Kim (2008) proposed another method which separates HDR into a TM
stream and a residual stream making it backwards compatible. They added
two new contributions to the method: temporal coherence was imposed which reduced flickering, and bits were allocated between TM and residual frames in a way which would optimise appearance of both TM and the restored HDR frames. The diagram of the encoder is presented in Figure 3.13.
Temporal Gradient TMO
Decoded TM Stream Input HDR Frame
H.264 Decoding
Compute Ratio Frame
(÷) Ratio Stream
H.264 Encoding TM Stream
Cross Bilateral Filter H.264 Encoding
Figure 3.13: The pipeline showing encoding of HDR video using rate-distortion optimised coding. The key components are tone mapping that operates in both the spatial and temporal domains, and a technique for improving the quality of the TM image.
First, the input HDR stream was tone mapped using a temporally coherent version of the gradient domain TMO (Lee & Kim, 2007). The TMO reduced flick- ering by estimating a motion vector field between two consecutive HDR frames which were used to generate LDR pixel values. The TM stream was then encoded using the standard H.264 encoder allowing for backwards compatibility.
The compressed TM stream was then decoded so a ratio image could be calculated using the equation:
R(x) = log2 LW(x) Ld(x) + (3.22)
whereLW,Ldwere the luminances of the HDR and LDR frames respectively and
was a small constant which prevented division by zero. The residual values were normalised to the range from 0 to 255. The residuals contained high frequency
components due to noise introduced by tone mapping and quantisation. To
improve coding efficiency, the edge preserving cross bilateral filter (Eisemann & Durand, 2004) was used on residuals with the luminance of the HDR image as a guide. H.264 was used to encode the residual stream as well.
While previous methods sought to maximise the quality of the reconstructed image, here the authors were concerned with the quality of the TM image as well. To reduce distortions of both the TM and reconstructed HDR sequences (Dd and DW) they controlled the quantisation parameters of the LDR and ratio sequences (QPdandQPratio). The optimisation problem was solved by minimising the Lagrangian cost function:
J =Dd+µDW+λ(Rd+Rratio) (3.23)
where Rd and Rratio were the bit rates for TM and ratio sequences, µcontrolled
the importance of the HDR sequence and λ determined the trade-off between
bit rates and the distortion. The authors analysed J and suggested an equation for controlling the quality of both the TM and ratio streams using only a single parameter QPd: QPratio = 0.77QPd+ 13.42.
The method was evaluated against MPEG-HDR. Peak signal to noise ratio (PSNR) measured the quality of the tone mapped frames while the HDR visual difference predictor (HDR-VDP) compared the reconstructed frames. For TM frames, the proposed technique had better quality averaging more than a 10 dB difference. The generated HDR frames were better for low bit rates with 10% smaller VDP error, but for the rates above 1 bpp quality was worse with 2 to 5 % larger VDP error. In a final test authors concluded that the ratio stream constituted 10 to 30 % of the total file size.
The authors extended the method (Lee & Kim, 2012) to provide a more ef- fective rate-distortion optimisation at the macro-block level in order to maximize
the quality of both the LDR and HDR streams given the limited bit depth.
Other Video Compression Techniques
Mantiuk, Krawczyk, Myszkowski & Seidel (2004) suggested one of the first meth- ods for compressing HDR videos termed HDR video (HDRV). They used the well- established capabilities of the MPEG-4 video codec and extended it to work with HDR video data. The main characteristic of the proposed algorithm was quan- tisation of luminance where errors were kept below the just noticeable threshold values of the HVS. To facilitate HDR data, MPEG-4 data structures were ex- panded from 8 to 11 bits and an efficient coding scheme for DCT blocks was introduced. Three captured video sequences together with rendered videos were used to evaluate the approach. The achieved compression rates were between 0.09 and 0.53 bpp. This was compared to the performance of MPEG encoded TM data only, which was approximately half the size. OpenEXR, on the other hand, required between 16 to 28 bpp for the same sequences.
Mantiuk, Myszkowski & Seidel (2006) also suggested a novel colour space which allowed compression of HDR data while preserving the error below the vis- ibility threshold of the HVS. This space was capable of representing the complete luminance range and full colour gamut visible to the human eye. The current coding algorithms required minor changes to support the proposed colour space. To validate the approach, the authors developed two lossy HDR compression al- gorithms (for static images and video). They claimed that image compression was “efficient and fast”, but did not provide any results.
Adaptive bit-depth transformation of HDR data was explored by Motra &
Thoma (2010) and Zhang et al. (2011). Motra & Thoma (2010) transformed
HDR images to LogLuv format, which they have optimised for 16 bit floating point numbers. Then quantisation errors were minimised by adaptively utilising levels which were left unused after transformation. Three video sequences were employed to test the approach. Non-adaptive and adaptive techniques were com- pared to GT using the VDP metric where the percentage of detected errors was significantly lower for the adaptive case. For example, at a 11,200 bit rate for one of the sequence, the VDP error percentage was 8.5 for non-adaptive and 0.01 for the adaptive method. Zhanget al. (2011) extended the method by optimising bit-depth quantisation via the Lloyd-Max algorithm (Max, 1960; Lloyd, 1982). In addition invisible high frequency noise was reduced by transforming frames
into the wavelet domain where a contrast sensitivity function weighed wavelet sub-bands. The proposed technique showed improvement over the technique of Motra & Thoma (2010) by achieving VDP results which were between 65% and 18% better.