Objective Quality Metrics - Using machine learning to select and optimise multiple objectives i

The purpose of using an objective quality metric is to have a quick, deterministic and reliable way to measure quality degradation of the compressed image or video in comparison with the original.

Depending on the practical purpose of compressing an image or video, it can be targeted for a subsequent algorithmic processing or for the human vision. Con- sequently, the quality metrics used to estimate the amount of introduced noise may be different. In the case if the image/video is a subject to further machine analy- sis, the standard mean squared error is often considered sufficient. However, if the compressed material will be viewed by people, there is no solid consensus on which metric should be used as a replacement for subjective quality perception. Many quality metric investigations like [36] suggest that it is still an open question.

There are two kinds of objective quality metrics. One is calculated by com- paring the compressed image with the original. Another – referenceless metrics – estimate the image quality without reference to any other source. Referenceless metrics are usually aimed at detecting explicit compression artefacts like blur or blockiness. For example, Tong et al. [37] propose to measure the amount of blur using spectrum coefficients of the discrete Haar wavelet transform, while Chen et al. [38] use conceptually similar method based on calculating the gradient at different image resolutions. To detect blockinness Gunawan et al. [39] propose to use a Sobel operator while taking into account the entropy of the image area. There seem to be no established standards among the referenceless metrics.

This research uses only metrics calculated with respect to original image or video

(full-reference metrics). Probably the most popular quality estimation methods

today are PSNR (peak signal-to-noise ratio) and SSIM (structure similarity index).

The PSNR metric is a simplest one based on the mean squared error. It uses

a logarithmic scale in decibels to facilitate interpretation of the values. For two

images (discrete signals)x and y the PSNR metric is calculated as:

P SN R= 10·log₁₀       L2 1 N · N X i=1 (xi−yi) 2       ,

whereN is a number of discrete samples (pixels in the image); L= 255 is a dynamic

range of the brightness levels or maximum possible pixel value;xi and yi are values

Although PSNR is still widely used today, it has been criticised for insufficient correlation with the human vision system. For example, Huynh-Thu et al. [40] empirically established that PSNR can reliably indicate quality levels only for a single image or video. In case of different images and videos, the same values of this metric do not correspond to the equal amount of the visible quality degradation.

The SSIM metric and, more importantly, its window version – MSSIM (mean

structure similarity) were designed from scratch by Wang et al. [2] based on several basic assumptions about specifics and sensitivity of human vision. The MSSIM was proposed as a more sophisticated alternative to a widely used PSNR. The authors’ aim was to improve the approximation of the human quality perception.

Between two images MSSIM is calculated as an average of the SSIM metric

values in all positions of 11×11 sliding window. According to Wang et al., this

approach is usually more reliable than simply calculating SSIM for the entire image because of higher attention to the details in each window.

The SSIM in 11×11 pixels window is calculated as follows:

SSIM = (2µxµy +C1)(2σxy+C2) (µ2 x+µ2y+C1)(σx2+σy2+C2) µx = N X i=1 wixi σx = v u u t N X i=1 wi(xi−yi)2 σxy = N X i=1 wi(xi−µx)(yi−µy),

where N = 121 is the number of pixels in the window; C1 = (0.01· L)2; C2 =

(0.03·L)2_; _L _{= 255 is a maximum luminance;} _x

i and yi are corresponding pixel

values;wirepresents a value from the 11×11 matrix of samples from the 2D Gaussian

with σ= 1.5 in the range [–5; +5].

The range of theoretically possible values of the SSIM metric is [–1; +1], where 1.0 means identical images. However, in practice the metric values do not drop below zero.

Various comparisons between PSNR and SSIM metrics including the one made by Wang et al. [2] in the original paper indicate that SSIM better correlates with the subjective quality perception. For example, Hore et al. [41] compared relative sensitivity of these two metrics on images compressed into JPEG and JPEG 2000 formats. The metrics are closely correlated in case of blurred images, but SSIM is more sensitive to JPEG blockiness than PSNR. Kotevski et al. [42] compared

PSNR and SSIM metrics for compressed videos and experimentally established that SSIM is considerably more adequate than PSNR for measuring quality degradation in video sequences. Kotevski et al. also note that SSIM is not a perfect metric and has some issues too, mainly due to reduced sensitivity to changes in brightness and contrast.

Butteraugliis a new image quality metric proposed by Google. It is defined by its reference implementation [43] and unfortunately lacks a comprehensive explana- tion of its mechanisms and basic principles. According to the information available, the Butteraugli project aims to reach a sufficient approximation of the way the human visual system reacts to minor quality differences. It is a full-reference metric that uses information from all colour channels to calculate a difference value. The PSNR and SSIM metrics are typically calculated only for image luminance. The reference implementation of Butteraugli is relatively slow. It takes approximately 10 seconds for a Full HD frame in comparison with about 1 second for MSSIM metric and only few milliseconds for PSNR.

The most reliable method to estimate quality degradation in images and videos

is to ask real people to rank them and calculate themean opinion score (MOS).

However, this approach is not feasible in many research projects, including this thesis. In the case of calculating quality of the compressed material as one of the final stages of an experiment, obtaining MOS is a demonstration of a solid result. However, if multiple quality measurements need to be conducted in real time to make certain decisions, using an objective quality metric is the only practical choice.

Despite the fact that new alternative quality metrics are introduced, only PSNR and SSIM/MSSIM remain widely used by multimedia research community. The problem seems to be in the balance between the metric complexity and its correlation with the human perception. When another more complex quality metric is introduced, it becomes necessary to understand how reliable it is in the different use cases. Even assuming that the new metric demonstrates better correlation with mean subjective opinion than SSIM, it may not be the case in the alternative sce- nario. For example, considering Butteraugli, its authors admit that it may not be a reliable measure of significant image distortions.

Consequently, the researchers often prefer slightly less efficient metrics with relatively simple and transparent implementations than more complicated ones with lower reliability for general use.

Almost all quality measurements for compressed images and videos in this the-

sis were done in the MSSIM metric. One of the advantages is that due to its

resolution as well as length of the video.

It is important to emphasise that MSSIM is not an ideal quality metric. However this thesis proposes methods which are independent from any particular metric by design. The logic behind this decision is that although the results may slightly differ upon using a different metric, the methodology remains the same.

In document Using machine learning to select and optimise multiple objectives in media compression (Page 34-37)