Study of Compression Statistics and Prediction of Rate-Distortion Curves for Video Texture
SStudy of Compression Statistics and Prediction of Rate-DistortionCurves for Video Texture
Angeliki V. Katsenou a , < , Mariana Afonso a and David Bull a a O ce 1.23, Visual Information Lab, One Cathedral Square, BS1 5DD, Bristol, UK A R T I C L E I N F O
Keywords :Video TextureVideo Compression,Rate-Distortion CurvesHEVC
A B S T R A C T
Encoding textural content remains a challenge for current standardised video codecs. It is there-fore beneficial to understand video textures in terms of both their spatio-temporal characteristicsand their encoding statistics in order to optimize encoding performance. In this paper, we analysethe spatio-temporal features and statistics of video textures, explore the rate-quality performanceof di erent texture types and investigate models to mathematically describe them. For all con-sidered theoretical models, we employ machine-learning regression to predict the rate-qualitycurves based solely on selected spatio-temporal features extracted from uncompressed content.All experiments were performed on homogeneous video textures to ensure validity of the ob-servations. The results of the regression indicate that using an exponential model we can moreaccurately predict the expected rate-quality curve (with a mean Bjøntegaard Delta rate of .46%over the considered dataset), while maintaining a low relative complexity. This is expected tobe adopted by in the loop processes for faster encoding decisions such as rate-distortion optimi-sation, adaptive quantization, partitioning, etc.
1. Introduction
Although recent video coding standards such as High E ciency Video Coding (HEVC) (1), VP9 (2) and AV1 (3)have achieved impressive compression gains with significantly better rate-quality performance compared to their pre-decessors, they are all challenged by certain types of content, in particular complex dynamic textures (see examplesin (4)). The Versatile Video Coding (VVC) (5) standard under development adopts a similar coding architecture and,while o ering overall coding gains, still exhibits the same limitations. A recent statistical analysis of HEVC referencesoftware (HM) performance has shown that the codec handles various types of texture very di erently in terms ofcoding modes and bit rate (6). For example, for the homogeneous ù video texture patches in the HomTex (7)dataset, HM requires, on average, twice the number of bits per pixel (bpp) for dynamic discrete textures (e.g. fallingleaves) compared to dynamic continuous textures (e.g. flowing water) and five times higher than for static textures(e.g. a camera panning over grass). This work also showed that there is correlation between encoding statistics andtexture types. For example, dynamic continuous textures tend to use more intra modes.Knowledge about the compression characteristics of video content prior to encoding can be exploited in varioussituations, including: o -line rate-quality optimisation of video-on-demand streamed content, multi-pass encoding,rate control, statistical multiplexing, in loop rate-distortion optimisation (8; 9; 10; 11; 12). One way of obtainingsuch knowledge is through multi-pass encoding where encoder settings are adjusted according to post-encoding statis-tics (8; 11). Another approach is that embodied in the Netflix Dynamic Optimiser (9; 10). The algorithm consistsof encoding video shots multiple times with di erent parameters (e.g. at di erent quantisation levels and/or di erentspatial resolutions); it then constructs a convex hull in rate-quality space and combines points from the convex hull tocreate an encode profile for the entire video sequence.Other methods invest in using content features to build more e cient (13; 14; 15) or fast (16; 17; 18; 19) en-coding mechanisms. In (13; 14; 15) the frames are segmented in textural areas and the depth decision/synthesismethod/synthesis mode (respectively) is based on models built on the explored correlations of textural features to All authors were with the Bristol Vision Institute and the Visual Information Lab, Department of Electrical and Electronic Engineering,University of Bristol, UK. The work presented was supported by the “Marie Sk odowska-Curie Actions- The EU Framework Programme for Re-search” project PROVISION, the Engineering and Physical Sciences Research Council (EPSRC), EP/M000885/1, and the Leverhulme Early CareerFellowship (ECF-2017-413). < Corresponding author [email protected] (A.V. Katsenou)
ORCID (s): (A.V. Katsenou)
Page 1 of 17 ideo Texture Compression compression or the content-tailored methodologies. In (16; 17; 18; 19) the authors explore the correlation of texturalfeatures to encoding statistics in order to make faster decisions for e.g. the prediction mode or the partitioning and,subsequently, reduce the encoding complexity.For video streaming applications, that do not have the time constraints of a live transmission and have the advantageand flexibility of o ine processing, approaches such as dynamic optimization have many advantages. However, thisis not a generic coding solution, it is computationally intensive and comes with a financial cost (12). In such cases,reducing the computational overload would be beneficial. On the other hand, in cases where multi-pass methods are notappropriate, or where computing resources are limited, it would be helpful to have data extracted from uncompressedcontent that characterise the video’s likely rate-quality performance in the context of the selected encoding method.Such a method could inform encoding decisions much more precisely and more e ciently than using multiple encodingtechniques and statistics would only need to be extracted once from the original sequence.The above challenges and observations provide the motivation to extract knowledge from uncompressed videocontent by analysing statistics of its spatio-temporal features with the aim of characterising the relationship betweencontent and compression performance. Firstly, we advance on the works already presented (20; 6; 21) and furtherextend them, as explained next. In (21), we worked on identifying and defining the types of textures, recognizingfeatures to work as their descriptors in the spatio-temporal or frequency domain and used those to classify them. Here,we provide more details of the feature statistics per texture type and we further demonstrate that such features correlatewith the encoding statistics that were extracted and analysed in (6). We then explore the Rate-Distortion (RD) curvesfor the di erent texture types and show that it is beneficial to extract RDs per group of pictures. Based on this, we revisitthe previously proposed mathematical model in (20) and develop suitable new mathematical models to represent theseRD curves. Moreover, we explore the RD parameters relations to reduce the required predictions. For the training ofthe machine-learning based regression, we use a Random Forest (RF) based method to select the optimal feature set,which slightly varies for the di erent tested model parameters. Then, we perform a comparative analysis of di erentmodels on their prediction performance and their computational complexity. Concluding, we recommend a model thatbalances the tradeo between the relative computational complexity and prediction performance.The structure of the paper is as follows: Section 2 presents an analysis of the video texture, firstly as uncompressedcontent and secondly as compressed video through their statistics; Section 3 studies and models the RD curves of thehomogeneous textures; Section 4 presents the results on the prediction of the RD curves for all considered mathematicalmodels on homogeneous content. Finally, conclusions and future work are outlined in Section 5.
2. Analysing Video Content for Compression Purposes
The literature is rich in contributions relating to texture analysis (22; 23; 24) but most of these are in the context ofcomputer vision-based recognition systems (25) and most deal only with still images. Our focus here however is thestudy of video compression performance. Video textures, particularly dynamic textures, are recognised as presentingsignificant challenges for video encoders (6). In this section, firstly we explore the relationships between variousspatio-temporal features extracted from uncompressed videos in order to identify common characteristics. Secondlywe encode a dataset of homogeneous video textures and independently study the associated encoding decisions andcompression performance. Finally, we correlate the spatio-temporal features with the encoding decisions and perfor-mance. In order to perform meaningful video analysis, we have adopted the HomTex (7; 6) dataset; HomTex compriseshomogeneous video annotated by experts. We also adopt the definition of video texture categories used in our previouswork (21): • Static: rigid texture that exhibits perspective motion, typically a moving solid object or a static background shotwith camera motion, e.g. camera panning over a carpet. • Dynamic Continuous: spatially irregular texture, with no clear structure, moving as a continuum e.g. water,deformable surfaces or smoke. • Dynamic Discrete: spatially regular or irregular texture that consists of perspectively moving independent dis-cernible parts or structures, e.g. leaves moving in a blowing wind.The aforementioned categorisation have been validated in our previous work (6; 21) both from the perspective of thespatio-temporal characteristics of the uncompressed video content, as well as from the perspective of the encoding
Page 2 of 17ideo Texture Compression (a) Example frames of static video textures: Bricks-Tilting-Wall1, Painting,PinkFlowers, TreeTrunk, BlueCarpet (from left to right).(b) Example frames of dynamic continuous video textures: CalmingWater,Steam, Waterfall, Flag, CalmSea (from left to right).(c) Example frames of dynamic discrete video textures: TreeLeaves, RiceField,MovingField, ThinBranches, YellowFlowers (from left to right).
Figure 1:
Examples of sequences from HomTex for the three di ff erent texture types that indicate the content variety. Table 1
List of features and statistics (20; 21).
Feature Statistics
Gray Level Co-occurrence Matrix (GLCM) (23) F1. meanGLCM con , F2. stdGLCM con , F3. meanGLCM cor , F4. stdGLCM cor , F5. meanGLCM hom , F6. stdGLCM hom , F7. meanGLCM enr , F8. stdGLCM enr , F9. meanGLCM ent , F10. stdGLCM ent
Normalised Cross-Correlation (NCC) (26; 20) F11.
NCC mean , F12.
NCC std , F13.
NCC skw , F14.
NCC kur , F15.
NCC ent
Average Local Peak Distance (ALPD) (20) F16.
ALPD mean , F17.
ALPD std
Normalised Laplacian Pyramids (NLP) (27) F18.
NLP mean , F19.
NLP std , F20.
NLP skw , F21.
NLP kur
Temporal Coherence (TC) (20) F22. meanTC mean , F23. stdTC mean , F24. meanTC std , F25. stdTC std ,F26. meanTC skw , F27. stdTC skw , F28. meanTC kur , F29. stdTC kur ,F30. meanTC entr , F31. stdTC entr
Optical Flow (OF) (28) F32. meanOF mag , F33. stdOF mag , F34. meanOF or , F35. stdOF or ,F36. meanOF curl , F37. stdOF curl , F38. meanOF ang , F39. stdOF ang ,F40. stdOF covVx , F41. meanOF covVy , F42. stdOF covVy , F43. meanOF covVxVy , F44. stdOF covVxVy decisions, statistics and performance. Examples of videos in these categories are illustrated in Fig. 1. For example,sequence Bricks-Tilting-Wall1 is a static texture where the camera pans over the wall. In the dynamic continuoustexture CalmingWater, the camera is static but the water is moving continuously as it is triggered by a stream. In thedynamic discrete texture, TreeLeaves, a static camera captures the leaves moving irregularly in the wind. Textural features are conventionally defined with the purpose of facilitating similarity, browsing, retrieval andclassification applications (22; 23; 24; 29; 30; 14; 15; 31). Additionally, most works have only considered statictextures, namely images (23; 24; 22; 29; 30). Hence, most textural features do not capture the dynamic characteristicsthat texture exhibits in videos. Some of these features have previously been used for spatial segmentation in videosynthesis and coding (13; 14; 15). An e ort to synergise spatial and temporal texture features in video (but only Page 3 of 17ideo Texture Compression static cont. discr.0500100015002000 m ean G L C M c on (a) meanGLCM con static cont. discr.0.50.60.70.80.91 m ean G L C M c o r (b) meanGLCM cor static cont. discr.0.10.20.30.40.50.6 m ean G L C M ho m (c) meanGLCM hom static cont. discr.0.60.70.80.91 NCC m ean (d) NCC mean static cont. discr.00.050.10.15
NCC s t d (a) NCC std static cont. discr.0100200300400500 A L P D m ean (b) ALPD mean static cont. discr.0102030 A L P D s t d (c) ALPD std static cont. discr.0123 N L P s t d (d) NLP std static cont. discr.0.20.40.60.8 m ean T C m ean (e) meanTC mean static cont. discr.234567 m ean T C k u r t (f) meanTC kur static cont. discr.00.51 s t d T C k u r (g) stdTC kur static cont. discr.051015 m ean O F m ag (h) meanOF mag static cont. discr.-101 m ean O F o r (i) meanOF or static cont. discr.00.511.522.5 s t d O F c u r l (j) stdOF or Figure 2:
Distribution of several extracted spatio-temporal features and their statistics or the static (orange), dynamiccontinuous (blue) and dynamic discrete (green) video textures. for classification purposes) is reported in (31). Additionally, there are features in the literature that try to model thedynamic nature of texture, such as the motion co-occurence matrix (32) adopted also in (33).The spatio-temporal features employed in this work have been selected by investigating the vast variety of featuresin literature and by modifying some, so that they cover the basic characteristics of video texture that relate to encodingdi culty, i.e. spatial diversity, coarseness and motion, as shown in (21) (and as will be shown in Section 2.3). Note thatthese features are not a unique set and could be replaced by others that similarly capture the video texture characteristics.However, we adopt those that have been successfully used in our previous work for compression-related tasks, such asthe recommendation of the perceptually-optimal frame rate (34) or for intelligently adapting spatial resolution prior tovideo encoding (35) or for the prediction of a content-driven bitrate ladder for adaptive video streaming (36).Following the analysis used in the aforementioned works, six spatio-temporal features and their statistics (49 intotal) were extracted as listed in Table 1. The GLCM (23) is a traditional spatial textural feature that expresses theintensity contrast of neighbouring pixels in an image, thus capturing the degree of coarseness and directionality ofthe texture. For the present frame I t , let G be the GLCM, whose element G ij is the number of occurrences for pixelpair ij with intensity values Y i , Y j , with Y À {0 , . The probability that a pixel pair ij assumes Y i , Y j values is p ij = G ij _ K , where K is the number of occurrences. GLCM has five main descriptors: contrast (con), correlation(cor), energy (enr) (or uniformity), homogeneity (hom) and entropy (ent) that are formally defined in the equationsbelow:GLCM con = M … i =1 N … j =1 ( i * j ) p ij , (1)GLCM cor = M … i =1 N … j =1 ( i * m r )( j * m c ) p ij r c , (2)GLCM enr = M … i =1 N … j =1 p ij , (3)GLCM hom = p ij i * j , (4)GLCM ent = * M … i =1 N … j =1 p ij log p ij , (5)where M , N are the rows and columns dimensions respectively, m r , m c the mean and r , c the standard deviation Page 4 of 17ideo Texture Compression along rows and columns of both I t and G . We computed the mean values of the GLCM descriptors on a frame leveland the mean and standard deviation of all these descriptors over the whole sequence, as seen in Table 1.The NCC (26), which is commonly used in image processing applications for spatial similarity purposes, is used asin (20) as a spatio-temporal feature by capturing the peaks of cross-correlation between successive frames. It assumesvalues within the range [ * , with its maximum value indicating the maximum correlation and vice versa. In thispaper, NCC is used as a spatio-temporal feature, as it examines the spatial similarity of two successive frames, I t * and I t , using a sliding matching template window T of w ù w size from the reference frame I t * :NCC = M ≥ i =1 N ≥ j =1 I t ( i , j ) * ÑI t ( u , v ) T ( i * u , i * v ) * ÑT ) yxxwH M ≥ i =1 N ≥ j =1 I t ( i , j ) * ÑI t ( u , v ) I H M ≥ i =1 N ≥ j =1 T ( i * u , i * v ) * ÑT I (6)where u , v define the area covered by the window T . By using an overlapping template between successive frames wecomputed the mean, the standard deviation, the skewness, the kurtosis and the entropy of the highest peaks betweensuccessive frames. All these statistics were averaged over the sequence length.As an additional measure of the coarseness, we employed the ALPD in the third level of the Discrete WaveletTransform (DWT) inspired by (22; 30). For all sequences, we computed the average local peak distance on a framelevel, as below:ALPD = 1 N N … n =1 K ≥ k =2 D , p * D , p * K , (7)where k À {1 , , … , K } the number of peaks and D , p the peak p in frame n in the DWT domain. Then, we computedthe mean and standard deviation over all frames.If we assume that each frame I t is a distorted version of its previous neighbour I t * , we could use the NLPs (27)(or any other metric) to express this level of “distortion” at di erent scales as follows:NLP = 1 N N … k =1 t N ( k ) s I t * I t * , (8)where N ( k ) s is the number of coe cients at scale k À N and N is the number of scales. Thus, NLPs attempt to capturehierarchical frame relationships. We computed the mean, the standard deviation, the skewness, the kurtosis and theentropy of the NLP between successive frames and then we averaged these statistics for the total number of frames.In order to express how easy or di cult it is to predict one frame I t from its previous temporal neighbour I t * , weused TC, as in (20). It is computed using the Fast Fourier Transform (FFT) (37) and is defined as follows:TC = P I t * I t P I t * I t * P I t I t , (9)where P I t * is the auto-spectral density of I t * and P I t * I t the cross-spectral density of frames I t * , I t . TC is normalizedwithin the range [0 , and assumes its maximum value for static or purely translational motion among two successiveframes. We computed the mean ratio of the square of auto-spectral density of the reference frame to the cross-spectraldensity of the reference frame to its successive frame, the standard deviation, the skewness, the kurtosis and the entropybetween successive frames. Then, we took the mean and standard deviation of all these statistics for the length of thesequence.The list of spatio-temporal features is completed with OF (28) that is based on a polynomial expansion. TheOF descriptors and statistics are very important for the characterization of the dynamic textures, since the dynamiccontinuous textures exhibit di erent OF patterns compared to the dynamic discrete textures. We extracted the OFfields along with the following statistics: mean and standard deviation of magnitude, mean and standard deviation of Page 5 of 17ideo Texture Compression orientation, mean and standard deviation of curl, mean and standard deviation of angular velocity, mean and standarddeviation of covariance of horizontal OF vectors, mean and standard deviation of covariance of vertical OF vectors,mean and standard deviation of covariance of horizontal and vertical OF vectors.Figure 2 depicts boxplots of the extracted features and their statistics, demonstrating feature distributions for thethree types of video texture. Overall, many features demonstrate good selectivity between the di erent types oftexture, notwithstanding some overlapping distributions. As expected, dynamic continuous textures express lowermeanGLCM con values compared to static and dynamic discrete textures (see Fig. 2 (a)). This can be attributed tothe lower density of edges (high spatial frequencies) in these types of textures. Due to this, the meanGLCM cor andmeanGLCM hom (see Fig. 2 (b)-(c)) are higher compared to the other two types. We also note that the distributions ofmeanGLCM con and meanGLCM cor are wider for the dynamic discrete textures than for the static cases.The mean and standard deviation of NCC (see Fig. 2 (d)-(e)) are, as expected, higher and lower, respectively, forthe static textures compared to the dynamic types. This is because, for static video textures, the di erences acrosssuccessive frames are usually small. Regarding the dynamic textures, the mean NCC distribution ranges in highervalues for discrete textures than continuous. The distribution of ALPD mean/std (see Fig. 2 (f)-(g)) is narrow and ina low value range for dynamic discrete textures as anticipated. This is attributed to the fact that these video texturesexhibit dense edges that subsequently leads to a low peak distance. On the contrary, ALPD mean/std is higher for dynamiccontinuous textures and for the same reasons NLP std (see Fig. 2 (h)) are lower compared to the other two types. Agood way to discriminate dynamic continuous texture from static and discrete is by the meanTC con/kur values (see Fig. 2(i)-(j)), as it is significantly lower/higher, respectively. This is attributed to the low density in edges and the deformablenature of these type of content, which results in lower cross-spectral density. Lastly, although the OF statistics (seeFig. 2 (l)-(n)) appear less selective compared to the other features, they show a clear di erence in the span of themagnitude and orientation of the OF vectors. Particularly, for dynamic discrete textures that are characterised by veryfine local motion of usually fine grained content, the distribution is narrow and in the lower range of values. Onthe other hand, meanOF mag is wider for static and even more for dynamic continuous textures. These two can bewell di erentiated by the mean/stdOF or that are significantly di erent, namely the OF orientation has a higher butquite uniform value for static textures, while it is smaller on average but more variant across frames for the dynamiccontinuous textures. HEVC encoding statistics were extracted using the test model version HM16.20 and the encoding analyser software,Harp (38). All the sequences from the HomTex dataset were encoded using the Main profile and three configurations:Random Access, Low Delay and All Intra. The initial quantization parameter (QP) was set to five commonly usedvalues QP={22, 25, 27, 32, 37} that capture the rate-distortion curve. A total of 39 statistics were obtained from theencoding process at the Coding Tree Unit (CTU) level. These were then post-processed to obtain the statistics fromthe encoding decisions and performance per sequence for the di erent frame types (I, B and P). Table 2 summarisesthe encoding decisions and statistics that were extracted. For the measure of correlation between the original and theresidual frames, the 2-D Pearson product-moment correlation coe cient was used, considering only the luminancecomponent.Figure 3 depicts the distributions of a subset of the statistics for the B frames of the Random Access configuration,using a QP value of 25 for all three types of texture (as annotated by experts).The prediction modes selected per Coding Unit (CU) vary significantly for di erent texture types (see Fig. 3 (a)-(d)). As expected, static textures are associated with a high percentage of Skip mode, similarly low percentages ofMerge and Inter and almost zero usage of Intra mode due to the simplicity of the motion present (camera panningor zooming). Dynamic continuous textures rely mostly on Intra mode, with fewer Inter, Skip or Merge modes. Thisimplies that motion prediction frequently fails for these textures. Discrete dynamic textures, on the other hand, relymostly on Inter and Merge mode as they exhibit distinct motion that can be e ectively predicted, with reduced use ofIntra mode.Since di erent texture types exhibit di erent spatial patterns, it is expected that the number of partitions in a CTUwill also di er. In particular, when the content has fine texture (high granularity), then the CTUs are usually highlydivided. Thus, as seen in Fig. 3 (e), the highest average number of partitions per CTU is observed for discrete dynamictextures (with a median equal to 35), then for continuous dynamic textures (median of nine), and the lowest number isrecorded for static textures (with a median equal to 4).The RD performance also varies with texture type, as demonstrated in Fig. 3 (f)-(g). As expected, static textures Page 6 of 17ideo Texture Compression
Table 2
Statistics extracted from HM during the encoding process (6).
Category Statistics
Prediction modes intra (%), stdIntra, Skip (%), stdSkip, merge (%), stdMerge, inter (%), stdInterReference indexes ref0 (%), ref1 (%), ref2 (%), ref3 (%)Partitioning avgPart, stdPartBits avgBits, stdBitsDistortion avgDist, stdDistBit allocation bitsModeSignal (%), bitsPart (%), bitsIntraDir (%), bitsMergeIdx (%), bitsMotionPred(%), bitsResidual (%), bitsOthers (%)Residual Statistics avgMSEresi, stdMSEresi, avgMSERecError, stdMSERecError, avgCorrResi, stdCorrResi,avgCorrCodedResi, stdCorrCodedResiIntra mode DCIntra, PlanarIntra, avgIntraDir, stdIntraDirMotion Vectors avgLengthMV, stdDistMV static cont. discr.050100 I n t r a ( % ) (a)Intra (%). static cont. discr.050100 S k i p ( % ) (b)Skip (%). static cont. discr.0204060 M e r ge ( % ) (c)Merge (%) static cont. discr.0204060 I n t e r( % ) (d)Inter (%). static cont. discr.020406080 a v g P a r t (e)avgPart. static cont. discr.00.511.5 a v g B i t s (f)avgBits. static cont. discr.0204060 a v g D i s t (g)avgDist. static cont. discr.0.20.40.60.8 a v g C o rr R e s i (h)avgCorrResi. static cont. discr.05101520 b i t s M ode S i gna l ( % ) (i)bitsModSignal(%) static cont. discr.0102030 b i t s P a r t ( % ) (j)bitsPart(%) static cont. discr.02040 b i t s M o t i on P r ed ( % ) (k)bitsMotPred(%). static cont. discr.050100 b i t s R e s i dua l ( % ) (l)bitsResidual(%). static cont. discr.050100150200 a v gLeng t h M V (m)avgLengthMV. static cont. discr.00.20.40.6 s t d D i s t M V (n)stdDistMV. Figure 3:
Distribution of a subset of the extracted encoding statistics for static (red), dynamic continuous (blue) anddynamic discrete (green) video textures. require a smaller number of bits (a median value of 0.0043 bit per pixel (bpp)) to encode compared to dynamic texturesthat require 30 times more bpp for dynamic continuous and 80 times higher bpp for the dynamic discrete case, whileexhibiting lower distortion (SAD of residual) for the same QP. It is also interesting to observe that, although dynamicdiscrete textures require on average 1.8 times higher bitrate compared to continuous, they also increase distortion by,on average, 1.97 times.The bit allocation for both types of dynamic textures in Fig. 3 (l) shows that the majority of the total bits are spenton residual coding, 77% on average of the bits generated for dynamic continuous and 84% for dynamic discrete videotexture. Taking also into account that the other bit statistics used for coding additional information such as motionvectors,mode signalling, etc. (see Fig. 3 (i)-(k)) are only contributing a small percentage to the total encoding bitrate.We can also see that the residual for dynamic textures typically exhibits very high energy and this explains the highRD statistics. This is further confirmed by the high distortion and high correlation between the original frame and theresidual as illustrated in Fig. 3 (h). Contrary to dynamic textures, static textures exhibit a more uniform bit allocationamong the di erent categories as well as a low correlation of the residual signal to the reference.Motion is another important characteristic. Static textures are associated with small magnitude motion vectors thatshow directional consistency. As expected, dynamic discrete textures have generally small magnitude motion vectorswith high values in the distribution of directional irregularity. On the other hand, continuous textures are associated Page 7 of 17ideo Texture Compression with a wide range of magnitude motion vectors with a slightly lower median value of irregularity in the direction.
In this subsection, we validate the the use of spatio-temporal features from uncompressed content to predict encod-ing behaviour and performance. Figure 4 depicts a visualisation of the linear correlation matrix computed among thespatio-temporal features and the encoding decisions and statistics. The colorbar indicates the range of values, whereblue colors indicate positive correlation and red colors indicate negative correlation. Higher linear correlation is rep-resented by darker blue or red colours (for values close to or * , respectively.) This matrix can be seen as a wayof finding the “strongest” candidate features in order to build a prediction models for the encoding performance usingspatio-temporal features. − − − − − ’ I n t r a ( \ % ) ’ s t d I n t r a ’ S k i p ( \ % ) ’ s t d S k i p ’ M e r ge ( \ % ) ’ s t d M e r ge ’ I n t e r( \ % ) ’ s t d I n t e r ’ \ % R e f ’’ \ % R e f ’ a v g N u m P a r t s t d N u m P a r t a v g B i t ss t d N u m B i t s a v g D i s t s t d D i s t ’ B i t s M ode S i gna l ( \ % ) ’’ B i t s P a r t ( \ % ) ’’ B i t s I n t r a D i r( \ % ) ’’ B i t s M e r ge I d x ( \ % ) ’’ B i t s M o t i on P r ed ( \ % ) ’’ B i t s R e s i dua l ( \ % ) ’’ B i t s O t he r s ( \ % ) ’ a v g M SE r e s i s t d M SE r e s i a v g M SE R e c E rr o r s t d M SE R e c E rr o r a v g C o rr R e s i s t d C o rr R e s i a v g C o rr R e s i C oded R e s i s t d C o rr R e s i C oded R e s i a v gLeng t h M V s t d D i s t M V (r ad ) meanG_contrastmeanG_correlationG_energymeanG_homogeneitymeanG_entropystdG_contraststdG_correlationstdG_energystdG_homogeneitystdG_entropymeanT_meanmeanT_stdmeanT_skewmeanT_kurtmeanT_entropystdT_meanstdT_stdstdT_skewstdT_kurtstdT_entropyN_meanN_stdN_skewN_kurtN_entropymeanGranOFmeanMagOFstdMagOFmeanOrOFstdOrOFmeanCurlOFstdCurlOFmeanCavOFstdCavOFx^2MagmeanOFx^2MagstdOFx^2OrmeanOFx^2OrstdOFmeanCovVxOFstdCovVxOFmeanCovVyOFstdCovVyOFmeanCovVxVyOFstdCovVxVyNLP1NLP2NLP3NLP4 Figure 4:
Pearson correlation matrix of the extracted spatio-temporal features and the encoding decisions and statistics.
As can be seen from the the correlation matrix, not all of the features demonstrate strong correlation to all of theencoding decisions and statistics. For some of the features and their statistics, however, the correlation matrix showsa strong linear relationship, indicating that the spatio-temporal features could be utilised to predict encoding perfor-mance. Particularly, the GLCM descriptor statistics are highly correlated with the partitioning statistics (avg/stdPart),the number of bits required for encoding (avg/stdBits), the resulting quality (expressed either as avg/stdDist or asavg/stdMSERecError) and the residual statistics (avg/stdMSEresi). The TC statistics are highly correlated to certainprediction modes, i.e. Intra(%), Skip(%) and Inter(%), as can be seen in Fig. 4. The prediction modes are also highlycorrelated to NCC statistics. Furthermore, the NCC statistics are correlated to the allocated bits statistics and the resid-
Page 8 of 17ideo Texture Compression ual signal reconstruction error. Compared to the aforementioned features, the OF and NLP statistics show lower linearcorrelation values to the encoding decisions and statistics. The highest correlation values for OF reported in the matrixconcern the standard deviation of the OF vectors’ orientation of the partitioning statistics and the reconstruction error.
3. Predicting RD Curves
In this Section, we explore the links between RD curves and video texture characteristics, we study their behaviourfor di erent texture classes and fit them using mathematical models. One of the few approaches to understanding thecoding performance of video textures was in (39), where the authors categorise sequences as static, dynamic and mixedand study the RD curves of the compressed sequences using HEVC and and the Advanced Video Coding (AVC) (40)reference software. To the best of our knowledge, there exists no other reported work that studies the RD properties ofhomogeneous video textures. For the purposes of our study, we require a large amount of homogeneous video texture data. We extract the RDcurves per Group of Picture (GoP) per sequence thus expanding the utility of the dataset. As can be seen from Fig. 5(a), where the mean and per GoP curves are plotted, using the per GoP data can help cover a wider range of curves.This is useful both for analysis and regression purposes. Furthermore, in order to assess how much the curves withina GoP might vary, we computed the Bjøntegaard Delta PSNR (BDPSNR)(41) of the per GoP curves over the mean .In Fig. 5 (b), we present the cumulative histogram of BDPSNR values of the RD curves per GOP over the RD of eachsequence. As can be observed, in many cases the RD curves per GOP are very close in terms of quality to the overallRD curve of the sequence, evidenced by the mean value of the BDPSNR histogram being less than 0.01dB. However,the standard deviation of 1.81dB indicates a significant di erence in the vertical shift of the RD curves. The di erencebetween the mean and per GoP RDs becomes more significant when we inspect the di erence in terms of bit rate inFig. 5 (c), where the mean BDRate is 16.41% and the standard deviation is 61.39%. Finally, to better understand howthe BD metrics are distributed in terms of texture type, we scattered BDPSNR against BDRate, coloured accordingto expert annotations in Fig. 5 (d). As can be seen, for most of the static sequences, although the BDPSNR is closeto zero, the distribution of BDRates is very wide. On the other hand, it is noticable that the variation of BDRate andBDPSNR demonstrates a similar curve for the case of dynamic textures. In Fig. 6 (a), we illustrate the RD curves for all HomTex sequences coloured according to the expert annotations(red for static, blue for dynamic continuous and green for dynamic discrete textures) and in (b) the average RD curvesper di erent texture type. As can be seen, from Fig. 6 (a), the sequences are generally naturally clustered. There are,however, cases, where certain sequences that were by context classified by the experts as one type of texture, behavesimilarly as another type of texture. In Fig. 6 (b), the vertical ranges denote the standard deviation of the PSNR,while the horizontal ranges denote the bit rate standard deviation. This figure shows the average RD performance pertexture type and confirms with the overlapping standard deviation ranges that there are indeed sequences that performsimilarly to other texture types. For example, according to experts, CalmSea is annotated as a dynamic continuoustexture sequence. True to its name, the scene depicts sea water that is not moving fast. Therefore, it is easy to predictand thus compress. First, using ordinary least squares, we explore several di erent families of models, such as polynomial, exponentialand power models to find the best fit for the log(R)-PSNR curves for video textures. The linear model was proposed inour previous work (20) and will serve as a basis for comparison. The 3 rd -order polynomial is generally used to computeBD metrics (41) and will be explored here. In Table 3 the explored models are reported along with their goodness of fitvalues, particularly the Root MSE (RMSE) and the coe cient of determination R . It is clear from this Table that thebest two candidates models are the 3 rd -order polynomial (Poly3) and the 2 nd -order polynomial (Poly2) model. Bothhave low RMSE values and very high R ( > . . Poly2 has the advantage of having only three parameters that needto be computed, while Poly3 has four parameters. This means that the training and prediction of the RD parameterswith machine learning methods (Section 4) will require more computations. Page 9 of 17ideo Texture Compression (a) Mean (black solid curves) and per GoP (blue dashed curves) RDs. (b) BDPSNR (dB) of the per GoP RDs over the mean RD.(c) BDRate (%) of the per GoP RDs over the mean RD. (d) BDPSNR against BDRate (%) for the di ff erent types of texture. Figure 5:
Comparison of the mean and per GoP RD curves for the HomTex dataset. The colours in subfigure (d) are usedas follows: red for static, blue for dynamic continuous and green for dynamic discrete. (a) RDs per GoP for the di erent texture types. PS NR ( d B ) staticcontinuousdiscrete (b) Average RD curves with standard deviation ranges per texturetype. Figure 6:
HomTex RDs per GoP and average RD curves for the di ff erent types of video texture. The di ff erent colourscorrespond to the sequence annotation by the experts: red for static, blue for dynamic continuous and green for dynamicdiscrete textures. In the above context, we need to understand the tradeo in terms of RD fitting performance, if one of the modelswith three parameters is selected. To select the most appropriate model we have computed the BD delta metricsbetween each fitted RD curve and its real version. The mean and standard deviation (std) BDPSNR and BDRatevalues are reported in the last two columns of Table 3. The BD metrics confirm that Poly3 and Poly2 are the two bestperforming models, with lower absolute mean values and stds. The performance of Poly2 and Poly3 are very similarin terms of BDPSNR. In terms of BDRate, Poly2 seems to be performing better on average. Although the Poly3 modeldoes not demonstrate the lowest absolute mean BDRate value, it has the lowest std value compared to the other two.In terms of BDPSNR, Poly3 has both the lowest mean and std value. Regarding the other two models, Lin and Exp,they have a similar performance, with Exp achieving better BDPSNR statistics and Lin better BDRate statistics.As the goodness of fit and BD metrics are very high and low, respectively, for the explored models and occupya similar range, we further investigated the significance of their di erence. We computed the empirical Cumulative Page 10 of 17ideo Texture Compression
Table 3
Explored mathematical models for RD curves (where ↵ i , i , i , i À R , with i À {1 , … , )and goodness of fit assessment. Model Acro. Formula R RMSE
Linear Lin Q ( R ) = ↵ log( R ) + .96 .54772 nd -order Polynomial Poly2 Q ( R ) = ↵ log ( R ) + log( R ) + .99 .12473 rd -order Polynomial Poly3 Q ( R ) = ↵ log ( R ) + log ( R ) + log( R ) + .99 .0402Exponential Exp Q ( R ) = ↵ e log( R ) .96 .4300 Table 4
BD metrics for the explored mathematical models for RD curves. The ± represent themean value and the standard deviation. Model BDPSNR: ± BDRate: ± Lin -.0885 ± .1552dB -.0075 ± ± .1274dB -.0037 ± .2985%Poly3 -.0004 ± .0028dB .0043 ± .0542%Exp -.0364 ± .1302dB -.1534 ± -2 -1.5 -1 -0.5 0 BDPSNR (dB) C u m u l a t i v e p r obab ili t y LinExpPoly2Poly3 (a) CDF distributions of BDPSNR for the explored models. -5 0 5
BDRate (%) C u m u l a t i v e p r obab ili t y LinExpPoly2Poly3 (b) CDF distributions of BDRate for the explored models.
Figure 7:
Comparison of the performance of the fitted RD models.
Distribution Functions (CDF) and the histograms of the BD metrics and plot these in Fig. 7. As can be seen in Fig. 7(a) and (b), Poly3 has the best performance as its variance around 0 is almost negligible. We note that Poly2 has a verysimilar CDF curve compared to Poly3 for both BDPSNR and BDRate. Lastly, the CDF curves for Lin and Exp arevery similar.From the discussion above, we conclude that Poly3 best fits the video texture RD curves. In order, however, tomake a recommendation of which model to use for the prediction of the RD curves based on spatio-temporal features,we need to make a comparative analysis of the accuracy of the predicted RDs against the real curves for each testedmodel, as well as their relative complexity.
In this Section, we study the RD curve model parameters and inspect their correlation to the characteristic shapesand parameters of the RD curves. For example, for the Poly model in Eq. ( ?? ), parameter ↵ is related to the curvatureof the RD curve. Parameter is related with the slope of the RD curve, which expresses the “cost” in bit rate in thelogarithmic domain for a given increase video quality. Last, parameter is related to the horizontal shift of the RDcurve which approximates the minimum bit budget required for the video encoding.From the RD curves plotted in Fig. 5, we can observe that the curvature is generally positive for lower bit rates, while Page 11 of 17ideo Texture Compression -20 0 20-1-0.500.51 (a) Parameter ↵ against . -20 0 20-2000200 (b) Parameter against . Figure 8:
Example of RD model parameters relation: Poly2 parameters.
Table 5
Relation of the fitted RD model parameters.
Model RD Parameter Models PCC SROCC R Lin É = .8571 ↵ * . ↵ * . ↵ + 40 . -.9825 -.9846 .97Poly2 É↵ = 1 . * + 4 . * * . * * . * + 5 . * -.9939 -.9979 .99 É = * . * * . * + 4 . * * . + 22 . -.9908 -.9944 .99Poly3 É↵ = * . * * . * * . * * . * * . + 1 . * -.9829 -.9973 .99 É = 8 . * * . * + 1 . * + 2 . * * . + 26 . -.9840 -.9968 .99Exp É = * . ↵ . + . -.9440 -.9919 .98 it becomes negative for higher bit rates. This is confirmed by the Poly2 parameter ↵ and its correlation to parameter ; the Pearson Correlation Coe cient (PCC) is .9706 and the Spearman Rank Correlation Coe cient (SROCC) is.9867. A similar observation is made for parameters ↵ and . Particularly, when the slope is flatter usually the RDcurve has a positive curvature and the inverse is observed for RD curves with steeper slopes. This is confirmed inFig. 8 (a), where parameter ↵ is scattered against , and also by computing the correlation coe cients between the twoparameters: PCC is equal to -.9939 and SROCC is equal to -.9979. Regarding the other parameters of Poly2, we noticein Fig. 5 that in general, the smaller the slope is, the more shifted towards lower bit rates the curve on the log( R ) axisis. This indicates a strong and potentially linear relation among those two parameters that is confirmed in Fig. 8 (b),where the parameter is scattered against . The PCC is very high, -.9908, as well as the SROCC, -.9944. Similarand correlations can be observed and conclusions drawn for the parameters of the other considered RD models.The linear and rank correlation values of the RD parameter models are reported in Table 5. In order to takeadvantage from the high correlations of the RD model parameters, we explored several mathematical models and theones that achieved the highest goodness of fit values (e.g. R ) are given in Table 5. Modelling the RD parametersprovides the benefit of training fewer models, only for the parameters that are required to estimated the other ones. Forexample, for the Lin model, we can build a machine-learning based regression model to predict parameter and thenwith Eqs.( ?? )-( ?? ) we can estimate the other parameter ↵ .
4. Predicting RD Curves using Textural Features from Uncompressed Content
Previous work that relates textural features to video compression e ciency includes (42; 43; 30). In (30), Subedaret al. define a no-reference metric of granularity in static textured images and discuss its relation to compressione ciency, but with no clear association with RD curves. In (43), elementary statistics of prediction error for textureand motion are obtained from H.264/AVC encoder and used to build variability-distortion models. In (42), a block-based spatial correlation model is defined and used to predict the RD bounds within an H.264/AVC encoder. Thiswork was extended in (44) to consider the block-based spatial correlation among two successive frames within HEVCHM. In our work, we predict RD curves for homogeneous video textures by building models driven by spatio-temporalfeatures extracted from the uncompressed source sequences. We consider RD parameter correlations and reduce the Page 12 of 17ideo Texture Compression
Table 6
Validation metrics of predicted parameters and the resulting BD metrics for the tested RDmodels. The ± represent the mean value and the standard deviation. Model Param. Features/Eqs. PCC SROCC R MAE NRMSELin Ç↵ F1, F4, F15, F22, F24, F26,F28, F30, F32-F35, F37 .9303 .9247 .86 .2173 .0817 Ç É , Table 5 .9215 .9182 .85 5.9199 .0886 Poly2 Ç↵ É↵ , Table 5 .9311 .9475 .87 .0632 .0521 Ç F1, F4, F15, F22, F24, F26,F28, F30, F32-F35, F37 .9327 .9481 .87 2.1676 .0557 Ç É , Table 5 .9206 .9420 .85 19.7760 .0544 Poly3 Ç↵ F1, F3, F4, F15, F22, F24, F26,F28, F30, F32-F35 .6508 .7922 .44 0.0273 .0224 Ç F1, F4, F15, F22, F24, F26,F28, F30, F32, F33 .6359 .7982 .53 1.4056 .0227 Ç F1, F4, F5, F15, F22, F24, F26,F28, F30, F32-F35, F37 .7518 .8077 .58 23.6200 .0277 Ç É , Table 5 .7884 .8083 .62 141.1246 .0332 Exp Ç↵ F1, F4, F15, F22, F24, F26,F28, F30, F32-F35, F37 .9440 .9346 .89 1.7671 .0816 Ç É , Table 5 .9124 .9228 .83 .0077 .1062 prediction complexity by using the models explored in Section 3. This extends our previous work (20), through a fullcharacterisation and comparison of RD models (previous Section), by exploring machine learning based regressionmethods and by recommending the RD model taking into account the complexity-accuracy tradeo . For the training and testing of the proposed method, the first 240 frames of the HomTex (7) sequences were used(some of the sequences have 250 frames and 25 fps frame rate and others 300 frames and 60 fps frame rate). The RDcurves were obtained using the HM16.20 reference software in Random Access configuration for five di erent quan-tization scales, QP = {22 , , , , , GoP length equal to 8 and Intra Period 32. The RD curves were constructedper GoP resulting in 3600 RD curves. For all RD curves, we fitted the RD models of Table 3 and computed theirparameters as explained in Section 3.3.2. These fitted parameters serve as the ground truth for the regression tasks.Furthermore, we extracted the spatio-temporal features from the uncompressed source sequences and computed thestatistics per GoP. Finally, we normalised the features.Before building and testing regression models for all considered RD models, we perform feature selection in orderto reduce the computational cost of feature extraction. We have noticed before (21) that many of the features arecorrelated, thus di erent combination of features may achieve the same prediction accuracy. We perform a featureselection and elimination process based on RF models, the Recursive Feature Elimination (RFE) (45). RFE uses featureranking and iteratively selects subsets of features with di erent cardinality and returns the optimal feature subset. Wenote that we perform feature selection for each one of the predicted RD parameters, as the subset of selected featuresvaries slightly.To avoid overfitting, a random 10-fold cross-validation process was adopted. A random split of the extractedfeatures and the fitted RD parameters was performed with 80% of the data being used for model configuration andtraining and the remaining 20% of the data for the performance evaluation. The machine-learning regression modelsexplored included: Support Vector Machines (SVMs), Random Forests (RFs), Ensemble Trees and Gaussian Processes(GPs) using di erent kernels and by optimising the associated parameters. The evaluation of the performance of the proposed approach takes place in two steps. First, the accuracy of pre-dicting RD parameters is assessed. Next, the predicted RD curves are validated against the real ones using BD metrics.Finally, the results are discussed in terms of the accuracy of prediction and their relative complexity.
Page 13 of 17ideo Texture Compression
Table 7
Resulting BD metrics for the tested RD models. The ± represent the mean value andthe standard deviation. Model BDPSNR: ± BDRate: ± RCRLin .0973 ± ± Poly2 .3811 ± ± Poly3 ± ± Exp -.0958 ± .4552 ± P r ed i c t ed (a) Predicted Ç↵ against true ↵ . -40 -20 0 20 40True -40-2002040 P r ed i c t ed (b) Predicted Ç against true . -10 -5 0 5 10True -10-505 P r ed i c t ed (c) Predicted Ç against true . P r ed i c t ed (d) Predicted Ç↵ against true ↵ . Figure 9:
Examples of predicted RD parameters against the true (fitted) values for the four di ff erent RD models. In Table 6, a summary of the best results achieved per di erent RD model are presented. Particularly, for eachpredicted RD parameter the following evaluation metrics are reported: • the selected features or the equation (if that produced better results); • the correlation metrics, PCC and SROCC between the predicted and the fitted RD parameter; • the coe cient of determination R , the Mean Absolute Error (MAE), and the NRMSE.It is remarkable that although Poly3 was the best fitted model (see Section 3.3.1), the prediction of the parametersof the lower order models, Lin, Exp and Poly2, is of higher accuracy as the correlation metrics are greater than .9and the coe cient of determination is higher than .83. The improved prediction of the RD parameters for the lowerorder models is noticeable also in the example plots in Fig. 9. In this figure, we illustrate examples of the predictedRD parameter values against their true (fitted) values for the di erent tested models. As can be seen, the distributionaround the main diagonal is tighter for Lin, Poly2 and Exp RD models explaining the high validation metric values inTable 6. On the other hand, Fig. 9 (c) justifies the low PCC and SROCC values for parameter as the predicted vsthe true values are not tightly distributed around the diagonal. Page 14 of 17ideo Texture Compression
To fully validate the e ectiveness of the proposed method, we computed and report in Table 6: • the mean and standard deviation of the BDPSNR (dB) and BDRate (%) of the predicted RD curves over the realones; • the Relative Complexity Ratio (RCR), i.e. the ratio of execution time for each model over the minimum executiontime recorded.As anticipated from the validation metric values associated with the predicted parameters, the mean BDPSNRand BDRate are higher for Poly3. The predicted curves are closer to the real ones in terms of average and standarddeviation of BDRate for the Exp model. Exp and Lin have very similar mean BDPSNR values, however Exp has alower standard deviation. Poly2 has the lowest standard deviation value, but bit a higher mean value.Regarding the complexity, this depends on the number of features that need to be extracted (dictated by the o inefeature selection process) and on the number of machine learning-based predictions that need to be performed. Forsome models, it has been shown that the fitted equations can be reliable for an accurate RD parameter estimation. Thelow order models, benefit of having only two parameters that need to be determined and for one of them a fitted equationcan be used. A lower number of predicted parameters is directly related to the decreased accumulated prediction error.Indeed, the models with only two parameters, Lin and Exp, demonstrate the lowest complexity and highest accuracy,while for the four-parametric model Poly3 the opposite is shown.Taking into account all the above points, we can conclude that Poly3 is theoretically the best RD model. Never-theless, the Exp model emerges as the prominent solution that balances accuracy and complexity, as it achieves thehighest accuracy of prediction for the BDRate metric and the second lowest RCR amongst the compared models.
5. Conclusion
Predicated on the challenges of compressing video textures, we have characterised their spatio-temporal featuresand their encoding statistics using homogeneous video sequences. We have employed a homogeneous dataset with120 sequences, HomTex. We have investigated mathematical models capable of representing the corresponding RDcurves and characterised their accuracy. For all considered theoretical models, we employ machine-learning regres-sion to predict the RD curves based exclusively on the selected spatio-temporal feature statistics that were extractedfrom uncompressed video content. Taking into account the tradeo between the RD prediction accuracy and the rel-ative computational complexity, we conclude that the exponential model performs best. The proposed method can bemodified to e ectively predict other encoding decisions or statistics, which remains part of our future work. References [1] G. J. Sullivan, J. R. Ohm, W. J. Han, and T. Wiegand, “Overview of the High E ciency Video Coding (HEVC) Standard,” IEEE Trans. onCircuits and Systems for Video Technology , vol. 22, no. 12, pp. 1649–1668, Dec 2012.[2] D. Mukherjee, J. Bankoski, A. Grange, J. Han, J. Koleszar, P. Wilkins, Y. Xu, and R. Bultje, “The latest open-source video codec vp9 - anoverview and preliminary results,” in
Picture Coding Symposium , Dec 2013, pp. 390–393.[3] Y. Chen, D. Murherjee, J. Han, A. Grange, Y. Xu, Z. Liu, S. Parker, C. Chen, H. Su, U. Joshi, C.-H. Chiang, Y. Wang, P. Wilkins, J. Bankoski,L. Trudeau, N. Egge, J.-M. Valin, T. Davies, S. Midtskogen, A. Norkin, and P. de Rivaz, “An Overview of Core Coding Tools in the AV1Video Codec,” in
Picture Coding Symposium , Jun 2018.[4] A. V. Katsenou, F. Zhang, M. Afonso, and D. R. Bull, “A subjective comparison of av1 and hevc for adaptive video streaming,” in
IEEEInternational Conference on Image Processing , Sep. 2019, pp. 4145–4149.[5] B. Bross, J. Chen, and S. Liu, “Versatile Video Coding (Draft 4),” in the JVET meeting . ITU-T and ISO/IEC, 2019, number JVET-M1001.[6] M. Afonso, A. Katsenou, F. Zhang, D. Agrafiotis, and D. R. Bull, “Video Texture Analysis based on HEVC Encoding Statistics,” in
PictureCoding Symposium , Dec 2016.[7] M. Afonso, A. Katsenou, F. Zhang, D. Agrafiotis, and D. R. Bull, “Homogeneous Video Texture Dataset (HomTex),” 2016,https://data.bris.ac.uk/data/dataset/1h2kpxmxdhccf1gbi2pmvga6qp/.[8] Y. C. Lin, H. Denman, and A. Kokaram, “Multipass encoding for reducing pulsing artifacts in cloud based video transcoding,” in
IEEEInternational Conference on Image Processing , Sept 2015, pp. 907–911.[9] J. De Cock, Z. Li, M. Manohara, and A. Aaron, “Complexity-based consistent-quality encoding in the cloud,” in , Sept 2016, pp. 1484–1488.[10] Ioannis Katsavounidis, “Dynamic optimizer - a perceptual video encoding optimization framework,” .[11] I. Zupancic, M. Naccari, M. Mrak, and E. Izquierdo, “Two-Pass Rate Control for Improved Quality of Experience in UHDTV Delivery,”
IEEE Journal of Selected Topics in Signal Processing , vol. 11, no. 1, pp. 167–179, Feb 2017.
Page 15 of 17ideo Texture Compression [12] J.Ozer, “Per-Title Encoding Comparison: Crunch Video Optimization Technology compared to: Brightcove CAE, Capped CRF, CapellaSystems SABL, JWPlayer,” https://streaminglearningcenter.com/wp-content/uploads/2018/07/Report_final.pdf .[13] T. Li, L. Yu, S. Wang, and H. Wang, “Simplified Depth Intra Coding Based on Texture Feature and Spatial Correlation in 3D-HEVC,” in , March 2018, pp. 421–421.[14] M. Bosch, F. Zhu, and E. J. Delp, “Segmentation-Based Video Compression Using Texture and Motion Models,”
IEEE Journal of SelectedTopics in Signal Processing , vol. 5, no. 7, pp. 1366–1377, Nov 2011.[15] F. Zhang and D. R. Bull, “A Parametric Framework for Video Compression Using Region-Based Texture Models,”
IEEE Journal of SelectedTopics in Signal Processing , vol. 5, no. 7, Nov 2011.[16] M. Wang, W. Xie, X. Meng, H. Zeng, and K. N. Ngan, “Uhd video coding: A light-weight learning-based fast super-block approach,”
IEEETransactions on Circuits and Systems for Video Technology , vol. 29, no. 10, pp. 3083–3094, Oct 2019.[17] M. Jamali and S. Coulombe, “Fast HEVC Intra Mode Decision Based on RDO Cost Prediction,”
IEEE Transactions on Broadcasting , vol.65, no. 1, pp. 109–122, March 2019.[18] H. Hamout and A. Elyousfi, “Fast Texture Intra Size Coding Based On Big Data Clustering for 3D-HEVC,” in
IEEE International Conferenceon Acoustics, Speech and Signal Processing , April 2018, pp. 1728–1732.[19] W. Zhu, X. Tian, F. Zhou, and Y. Chen, “Fast inter mode decision based on textural segmentation and correlations for multiview video coding,”
IEEE Transactions on Consumer Electronics , vol. 56, no. 3, pp. 1696–1704, Aug 2010.[20] A. Katsenou, M. Afonso, D. Agrafiotis, and D. R. Bull, “Predicting Video Rate-Distortion Curves using Textural Features,” in
Picture CodingSymposium , Dec 2016.[21] A. V. Katsenou, T. Ntasios, M. Afonso, D. Agrafiotis, and D. R. Bull, “Understanding Video Texture - a Basis for Video Compression,” in
IEEE 19th International Workshop on Multimedia Signal Processing , Oct 2017.[22] P. Salembier and T. Sikora,
Introduction to MPEG-7: Multimedia Content Description Interface , John Wiley and Sons, Inc., New York, NY,USA, 2002.[23] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Textural features for image classification,”
IEEE Transactions on Systems, Man, andCybernetics , vol. SMC-3, no. 6, pp. 610–621, Nov 1973.[24] R. M. Haralick, “Statistical and structural approaches to texture,”
Proceedings of the IEEE , vol. 67, no. 5, pp. 786–804, May 1979.[25] Renaud Péteri, Sándor Fazekas, and Mark J Huiskes, “Dyntex: A comprehensive database of dynamic textures,”
Pattern Recognition Letters ,vol. 31, no. 12, pp. 1627–1632, 2010.[26] J. P. Lewis, “Fast template matching,” in
Vision interface , 1995, vol. 95, pp. 15–19.[27] V. Laparra, J. Ballé, A. Berardino, and E.P. Simoncelli, “Perceptual image quality assessment using a normalized laplacian pyramid,” in
Electronic Imaging 2016 . SPIE, 2016, pp. 1–6.[28] G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” in
Scandinavian Conference on Image analysis . Springer,2003, pp. 363–370.[29] J. Zujovic, T. N. Pappas, and D. L. Neuho , “Structural Texture Similarity Metrics for Image Analysis and Retrieval,” IEEE Transactions onImage Processing , vol. 22, no. 7, pp. 2545–2558, July 2013.[30] M. M. Subedar and L. J. Karam, “A no reference texture granularity index and application to visual media compression,” in
IEEE InternationalConference on Image Processing , Sept 2015, pp. 760–764.[31] C. H. Peh and L. F. Cheong, “Synergizing spatial and temporal texture,”
IEEE Transactions on Image Processing , vol. 11, no. 10, pp.1179–1191, Oct 2002.[32] A. Rahman and M. Murshed, “A novel 3D motion co-occurrence matrix (MCM) approach to characterise temporal textures,” in , 2004, vol. 1, pp. 717–720.[33] O. Chubach, P. Garus, M. Wien, and J. Ohm, “Motion-distribution based dynamic texture synthesis for video coding,” in
Picture CodingSymposium , June 2018, pp. 218–222.[34] A. V. Katsenou, D. Ma, and D. R. Bull, “Perceptually-Aligned Frame Rate Selection Using Spatio-Temporal Features,” in
Picture CodingSymposium , June 2018, pp. 288–292.[35] M. Afonso, F. Zhang, and D. R. Bull, “Spatial resolution adaptation framework for video compression,” in
SPIE Optical Engineering +Applications , 2018, vol. Proceedings Volume 10752, Applications of Digital Image Processing XLI.[36] A. V. Katsenou, J. Sole, and D. R. Bull, “Content-gnostic Bitrate Ladder Prediction for Adaptive Video Streaming,” in
Picture CodingSymposium , November 2019.[37] G. Carter, C. Knapp, and A. Nuttall, “Statistics of the estimate of the magnitute-coherence function,”
IEEE Trans. on Audio and Electroa-coustics , vol. 21, no. 4, pp. 388–389, Aug 1973.[38] D. Springer, W. Schnurrer, A. Weinlich, A. Heindel, J. Seiler, and A. Kaup, “Open source HEVC analyzer for rapid prototyping (HARP),” in
IEEE International Conference on Image Processing , 2014, pp. 2189–2191.[39] M. A. Papadopoulos, D. Agrafiotis, and D. R. Bull, “On the performance of modern video coding standards with textured sequences,” in
International Conference on Systems, Signals and Image Processing , Sept 2015, pp. 137–140.[40] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,”
IEEE Transactions onCircuits and Systems for Video Technology , vol. 13, no. 7, pp. 560–576, July 2003.[41] G. Bjontegaard, “Calculation of average PSNR di erences between RD-curves,” Tech. Rep., 13th VCEGM33 Meeting, Austin. TX, 2001.[42] J. Hu and J. D. Gibson, “New rate distortion bounds for natural videos based on a texture-dependent correlation model,” IEEE Transactionson Circuits and Systems for Video Technology , vol. 19, no. 8, pp. 1081–1094, Aug 2009.[43] G. Van der Auwera, M. Reisslein, and L. J. Karam, “Video texture and motion based modeling of rate variability-distortion (vd) curves,”
IEEE Transactions on Broadcasting , vol. 53, no. 3, pp. 637–648, Sept 2007.[44] J. Hu, M. Bhaskaranand, and J. D. Gibson, “Rate distortion lower bounds for video sources and the HEVC standard,” in
Information Theoryand Applications Workshop, 2013 , Feb 2013, pp. 1–10.
Page 16 of 17ideo Texture Compression [45] M. Kuhn and K. Johnson,
Applied Predictive Modeling , Springer, New York, USA, 2013., Springer, New York, USA, 2013.