The problem of predicting compression result characteristics is an integral part
of this research. The machine learning methods require a considerable amount
of diverse training material to achieve a reasonable accuracy in their statistical estimations.
In order to provide training samples for the regression models a dataset of 2500 video segments was constructed. They were obtained from 47 YouTube videos, which were chosen more-or-less randomly but with deliberate intention to diversify the content of the whole dataset. The videos include amateur and professional filming material that contains scenes of urban and natural landscapes, people and animals as well as some aerial drone footage.
The resolution of all original video clips was 4K (3840×2160 pixels). However,
due to relatively poor quality of the source material the frame detailization was
increased by resampling all videos into smaller Full HD resolution (1920×1080 pix-
els). Between various processing steps the videos were stored in a lossless *.y4m
file format [77], which contains separate frames as raw pixel values inY CbCr colour
space. FFmpeg tool [71] was used to decode and resize videos at all stages of the experiments.
The optimal positions of key frames in each video were extracted from the compressed versions of each 1080p video using another program from the FFmpeg toolset called FFprobe. The x265 codec was used to perform the default compression in order to define positions of the key frames (fig. 4.6). The lossless videos were
2160p original compressed video 1080p raw original video 1080p compressed original video FFmpeg x265 FFprobe
key frames positions
*.y4m sampling program
raw video segments
Figure 4.6: Video segments extraction procedure.
subsequently split into segments using a specially created program for frame-precise raw video sampling.
The length of obtained video segments varied between 1 and 250 frames (fig. 4.7). By default x265 limits segment size to 250 frames maximum, therefore any longer scene in the video was cut into smaller parts making 250 the most common size.
All raw video material was stored in a lossless *.y4m container format using
4:2:0 chroma subsampling (Cb and Cr colour components have two times smaller
resolution 960×540 pixels). It was assumed that colour space transformation should
be done according to ITU-R BT.709 standard [78], although actual conversion to RGB was not explicitly used in the experiments. The quality metric was calculated
for luminance component Y under assumption that x265 balances chromatic com-
ponents in a reasonable manner in accordance with luminance quality degradation. The latter assumption is based on the fact that the codec by default uses the same quantization parameter for all colour components (see x265 manual [75], section
Quality Control, parameters --cbqpoffs <int> and --crqpoffs <int>).
0 100 200 300 400 0 50 100 150 200 250
Video segment length, frames
Number of se
gments (2500 in total)
Figure 4.7: Video length distribution in the dataset of 2500 segments.
20 segments Data set used to obtain regression models
1250 segments 1250 segments
Training set, 50% Development set, 50%
4 4 test videos 123 segments 24 segments 66 segments 3 2 1
Figure 4.8: Dataset structure.
training and development (fig. 4.8). The training set was used in gradient calcula- tions at each iteration of learning the model. The development set was employed to control overfitting during training and to choose the best model among several alternatives by the lowest error. Term “development set” is used by Andrew Ng in his recommendations to machine learning strategies [79]. This term reflects the purpose more accurately than traditional “validation set”, although the meanings are similar. It is obvious that the models can overfit to the development set, however it can be considered as an intermediate test set to a certain extent when creating the statistical modes.
There was no explicit test set allocated among the 2500 segments. Instead, testing of the models was conducted on a separate group of four videos. The reason is that this research aims to optimise some real world use cases that involve complete videos rather than individual segments. Table 4.1 provides more detailed informa-
Table 4.1: Short summary of four test videos.
# Video name Content description Frames Segments Source address
1 GoPro HERO5 + Karma: The Launch in 4K
Action scenes 7255 123 https://youtu.be/vlDzYIIOYmM
2 Horizon Zero Dawn PS4 Pro 4K Showcase
Video game
recording 5257* 24 https://www.digitalfoundry.net/2017-02-23-free-download-horizon-zero-dawn-ps4-pro-4k-showcase
3 New York in 4K City landscapes,
complex motion 8421 66 https://youtu.be/TmDKbUrSYxQ
4 Sony Glass
Blowing Demo Colourful scenes ona dark background 2492 20 https://youtu.be/74SZXCQb44s
* the original frame rate was reduced by half to remove frame duplicates
tion about these videos. The first video contains a large number of various short action scenes. The second test video contains long seamless scenes of the gameplay from a modern 3D computer game. The third video mostly consists of highly de- tailed scenes with complex motion patterns. The last video contains images with smooth colour transitions and has relatively small amount of motion.