Texture-aware Video Frame Interpolation
TTEXTURE-AWARE VIDEO FRAME INTERPOLATION
Duolikun Danier and David Bull
Visual Information LaboratoryUniversity of BristolBristol, BS8 1UB, United Kingdom { Duolikun.Danier, Dave.Bull } @bristol.ac.uk ABSTRACT
Temporal interpolation has the potential to be a powerful toolfor video compression. Existing methods for frame interpola-tion do not discriminate between video textures and generallyinvoke a single general model capable of interpolating a widerange of video content. However, past work on video textureanalysis and synthesis has shown that different textures ex-hibit vastly different motion characteristics and they can bedivided into three classes (static, dynamic continuous and dy-namic discrete). In this work, we study the impact of videotextures on video frame interpolation, and propose a novelframework where, given an interpolation algorithm, separatemodels are trained on different textures. Our study shows thatvideo texture has significant impact on the performance offrame interpolation models and it is beneficial to have sepa-rate models specifically adapted to these texture classes, in-stead of training a single model that tries to learn generic mo-tion. Our results demonstrate that models fine-tuned usingour framework achieve, on average, a 0.3dB gain in PSNR onthe test set used.
1. INTRODUCTION
Video frame interpolation (VFI) refers to the task of gener-ating non-existent intermediate frames between any two con-secutive frames in a video while preserving spatiotemporalconsistencies. VFI allows up-conversion of frame rates forimproved visual quality, and offers an important tool for videocompression, where it can be used to replace or enhance con-ventional motion estimation [1] or to conceal errors in recon-structed frames [2]. Interpolating visually pleasing intermedi-ate frames requires accurate modelling of motion. However,due to the variety of textures and motion patterns in real-worldvideos, it is difficult to capture the motion of different texturetypes using a single mathematical model. Hence frame inter-polation remains a challenging task.Existing VFI methods can be classified as either flow-based or kernel-based. Flow-based methods [3–9] generally
This work was supported by the China Scholarship Council - Universityof Bristol Scholarship. Grant No. 202008060038. involve two steps, namely motion estimation and frame syn-thesis, where the latter typically resorts to pixel-wise warpingof the two adjacent frames utilising the motion information topredict the intermediate frame. However, these methods gen-erally experience a degradation in performance under chal-lenging conditions where finding reference pixels in nearbyframes is made difficult, e.g. under illumination change, oc-clusion and large motion. On the other hand, kernel-basedmethods [10–17] estimate a convolution kernel for each out-put pixel and predict the pixel by convolving nearby regionsin adjacent frames with the kernel. Although in this case eachoutput pixel is synthesised by referring to multiple pixels inthe adjacent frames and occlusion can be better handled, thecomplexity of captured motion is still limited by the size ofthe kernels.The aforementioned methods are mostly based on convo-lutional neural networks (CNNs). These have advanced thestate-of-the-art in VFI over the past few years with variousinnovations relating to their overall architectures and to spe-cific processing stages. One common feature amongst them isthat a single model is trained to interpolate all kinds of videoframe. In contrast, previous work on video texture analysisand synthesis [19–24] has shown that different textures ex-hibit vastly different motion patterns which are better han-dled separately. For example, Zhang et al. [19] proposedan analysis-synthesis video compression framework in whichdifferent algorithms were adopted to synthesise dynamic andstatic textures.A finer classification of video textures was proposed in[21,22], where the authors clustered homogeneous sequencesbased on texture-relevant features as well as the HM encodingstatistics and found three video texture clusters exist whichwere referred to as static (rigid texture exhibiting perspec-tive motion), dynamic discrete (texture with discernible partsundergoing perspective motion independently) and dynamiccontinuous (spatially irregular and unstructured texture mov-ing as a continuum). Example frames from each class takenfrom the HomTex dataset [21] are shown in Fig.1.While generalisation is a desirable property of deep learn-ing models, it has also proved to be difficult due to the issue of a r X i v : . [ ee ss . I V ] F e b riginal frames AdaCoF-original AdaCoF-TAFI Fig. 1 . Interpolation results of our texture-aware fine-tuned AdaCoF and the original AdaCoF [15] on samplesequences from static (top row), dynamic discrete (middlerow) and dynamic continuous (bottom row) textures. Se-quences are “PaintingTilting1”, “RiceField” and “Shinny-BlueWater downsampled” from the HomTex dataset [21].over-fitting. Motivated by the classification of video texturesin previous work, in this contribution, we investigate the im-pact of video texture type on the performance of state-of-the-art VFI models. Based on our observation, we hypothesisethat it is harder to train a generic VFI model that performswell on all texture types than to train separate models, eachspecialising in a specific texture (i.e. to overfit texture class);we verify our hypothesis through experimentation. As a con-sequence, we propose a novel texture-aware frame interpola-tion (TAFI) framework that can generalise to any VFI methodand improve its performance.
2. RELATED WORK
Conventionally, VFI has been addressed by estimating opticalflow and warping two adjacent frames using the flow map [3].Such methods rely heavily on the adopted flow estimation al-gorithm and have been generally found to suffer from occlu-sion and large motion. Recent flow-based methods improvedupon the framework by deploying CNNs for flow estimationand frame synthesis. Liu et al. [4] trained an encoder-decodernetwork to estimate the optical flow from the intermediateframe to the two reference frames, which is then used by asampling layer to interpolate each output pixel. The com-plexity of the captured motion was limited since equal flow intwo directions was assumed. Jiang et al. [5] improved on [4] by enabling bidirectional flow. Niklaus et al. [6] incorporatedcontext information during the warping process. Bao et al. [7]additionally estimated depth map to refine the estimated flow.Recently, Niklaus et al. [9] proposed to use softmax splattingfor differentiable forward warping such that different refer-ence pixels can map to the same output pixel. Although thisallows more complex motion to be captured, such complexityis still limited due to the dependence on flow estimation andpixel-wise warping.On the other hand, Kernel-based methods inherently al-low more pixels to be sampled for predicting one output pixel.Niklaus et al. [11] proposed a CNN (improved later in [12]and [13]) to predict a kernel for each output pixel which isconvolved with the input image to obtain the target pixel. In-stead of predicting individual kernels for each pixel, Choi etal. [14] proposed an end-to-end network with channel atten-tion that directly outputs the interpolated frame. Recently,there has been an increased number of works on VFI that em-ploys deformable convolution [18]. Lee et al. [15] developeda CNN to predict the kernel weights together with their off-sets for each output pixel. The flexibility of deformable ker-nels enabled more complex motion to be captured. Cheng etal. [16] adopted a similar approach but separable kernels wereused. Shi et al. [17] enabled further degree of freedom by al-lowing the reference pixels to be sampled in an interpolatedspace-time volume instead of just from the existent frames.While these VFI methods were developed to generaliseon all types of videos, previous work on video texture analy-sis and synthesis adopted a different approach. In [19], Zhanget al. observed that textured regions in videos typically con-sume more bits to encode, and proposed to synthesise such re-gion and encode only the warping parameters, where differentsynthesis methods were used for dynamic and static textures.In [21], Afonso et al. analysed video texture by clusteringhomogeneous videos based on their encoding statistics (e.g.prediction modes, bit allocation etc.) and showed that videotexture can be categorised as static , dynamic discrete and dy-namic continuous . A homogeneous texture dataset, HomTex,was produced. Katsenou et al. [22] confirmed the existence ofsuch categorisation by clustering videos based on their spa-tiotemporal features. In our work, the implications of suchtexture categorisation for VFI are investigated.
3. TEXTURE AWARE FRAME INTERPOLATION3.1. Effect of Texture on VFI Performance
In this section we investigate the performance of VFI methodson different video textures. Specifically, the pre-trained ver-sions of three CNN-based state-of-the-art VFI models withpublicly available source code are evaluated: AdaCoF [15],CAIN [14] and GDConvNet [17]. The evaluation dataset usedis HomTex [21] which contains 120 videos with 250 framesof × spatial resolution at 25 and 60 fps. Each video yncon dyndis static15202530354045 PS N R model AdaCoFCAINGDConvNet dyncon dyndis static0.20.30.40.50.60.70.80.91.0 SS I M model AdaCoFCAINGDConvNet dyncon dyndis static5060708090100 V M A F model AdaCoFCAINGDConvNet
Fig. 2 . Performance of three selected models on three texture types.
Table 1 . Results of One-way ANOVA test on PSNR, SSIMand VMAF scores obtained by the VFI models on differenttextures.
AdaCoF CAIN GDConvNetF(2,117) p F(2,117) p F(2,117) pPSNR 38.41 0.00 40.14 0.00 35.99 0.00SSIM 17.58 0.00 16.16 0.00 16.91 0.00VMAF 73.57 0.00 67.92 0.00 70.07 0.00 in HomTex is (approximately) texturally homogeneous; thereare 45, 50 and 25 sequences that are labelled dynamic con-tinuous , dynamic discrete and static respectively. Each modelis evaluated on the entire HomTex dataset, where for eachsequence every second frame, I t , is considered the ground-truth and the two adjacent frames I t − and I t +1 are inputto the model (in the case of GDConvNet 4 frames, namely I t − , I t − , I t +1 , I t +3 , are used as input due to the model de-sign). Peak signal-to-noise ratio (PSNR) and structural simi-larity (SSIM) [25] scores are computed using the original andinterpolated frames as these are the most commonly used met-rics in VFI [3–17]. Additionally, video multi-method assess-ment fusion (VMAF) [26] scores are also computed using theoriginal and interpolated videos. The scores of each model arethen grouped according to the video texture class, and theirdistributions are shown in Fig.2, where abbreviations “dyn-con” and “dyndis” are used for dynamic continuous and dy-namic discrete .It is noticeable that the distributions of the scores obtainedby the models vary with texture type. Specifically, all mod-els scored the highest PSNR, SSIM and VMAF on static tex-tures, which is expected since the perspective motion of rigidobjects is relatively easier to capture. For the case of dy-namic texture, all models see degraded performance acrossall metrics, and this decrease is even more severe for dynamiccontinuous textures which exhibit the most irregular motionintrinsic to the scenes containing water, smoke and fire, etc.To assess whether the impact of texture on VFI perfor-mance is statistically significant, a one-way ANOVA is per-formed with results shown in Table 3.1. Here we see that interms of all metrics, texture type has significant effect on the performance of the VFI models at p < . level. Further-more, T-tests (Welch’s T-test to account for unequal samplesizes) are performed between the scores of each model oneach pair of textures to see if there are significant differences,and the results are summarised in Table 2. The last two rowsof the table show that, in terms of all metrics, the averageperformance of the models on static textures is significantlydifferent from that on dynamic discrete and dynamic continu-ous textures. Comparing between the two dynamic types, allmodels have significantly different SSIM and VMAF scores.Combining these results with Fig.2, we can confirm that for p < . , all models performed best on static textures andworst on dynamic continuous textures. There are two possi-ble reasons for this observation. The first is that the dynamictexture types fall outside of the distribution of the three mod-els’ training sets. We consider this reason less important sinceall models were trained on the Vimeo90K dataset [8] whichinvolves a wide range of non-homogeneous videos that coverall types of textures. The second reason is that the complex-ity of motion patterns, and hence the levels of difficulties tointerpolate different textures are inherently different, with dy-namic continuous textures exhibiting the most unpredictablemotion thus being the hardest to interpolate, and static tex-tures being the easiest type. This implies that, rather thantrying to train a single model that learns to model the motionof all textures, there might be potential gain in having sepa-rate models specialised in interpolation of one texture class,namely texture-aware frame interpolation. Under the proposed VFI framework, the aforementioned threetexture types are treated separately. That is, given a VFImodel, we generate three versions of it, each trained exclu-sively on one type of texture. In this work we focus on perfor-mance of models on homogeneously-textured videos so theinference on a test video is performed in the same way asthe original model, but with the difference that the specialisedversions are used according to the video texture type. Weshow through our experiments that it is more difficult to learna generic model that can interpolate all types of videos than tolearn a specialised model that overfits a specific texture class. able 2 . Results of T-Tests between the scores obtained by each model on three pairs of textures.
AdaCoF CAIN GDConvNetPSNR SSIM VMAF PSNR SSIM VMAF PSNR SSIM VMAFdyndis vs.dyncon t(75)=0.10,p=0.92 t(87)=3.52,p=0.00 t(92)=8.93,p=0.00 t(73)=-0.19,p=0.85 t(87)=3.21,p=0.00 t(92)=8.35,p=0.00 t(74)=0.18,p=0.86 t(87)=3.48,p=0.00 t(91)=8.90,p=0.00static vs.dyncon t(66)=8.66,p=0.00 t(47)=7.78,p=0.00 t(58)=12.28,p=0.00 t(66)=8.64,p=0.00 t(46)=7.53,p=0.00 t(61)=12.08,p=0.00 t(66)=8.47,p=0.00 t(46)=7.63,p=0.00 t(61)=11.85,p=0.00static vs.dyndis t(62)=11.41,p=0.00 t(54)=4.01,p=0.00 t(59)=3.84,p=0.00 t(61)=11.91,p=0.00 t(53)=4.24,p=0.00 t(61)=4.06,p=0.00 t(63)=11.03,p=0.00 t(53)=3.89,p=0.00 t(60)=3.40,p=0.00
4. EXPERIMENTS4.1. Experiment Setup
We focus on the three models used previously for analysis,i.e. AdaCoF [15], CAIN [14] and GDConvNet [17]. The testdata used for evaluation are the 120 homogeneous sequencesin HomTex [21]. For the purpose of training models tuned to asingle texture type, the training dataset should contain homo-geneous sequences. Therefore, we make use of three existingtexture datasets, namely DynTex [20], BVI-Texture [23] andSynTex [24]. Specifically, DynTex contains 650 annotatedvideos of × spatial resolution at 25 fps, and we re-tained those that contain only one type of texture accordingto the annotations. BVI-Texture and SynTex each contain 20and 196 full-HD resolution ( × homogeneous se-quences at 60 fps. It should be noted that, since HomTex iscomposed of videos from DynTex and BVI-Texture, we re-moved the DynTex sequences that exist in HomTex and othersequences that are visually similar to them. This resulted ina total of 214 dynamic continuous, 222 dynamic discrete and110 static sequences with each sequence containing at least250 frames. Due to the lack of a diverse homogeneous video texturedataset, the models are initialised with their pre-trainedweights provided and fine-tuned on our training sets. Theloss functions, optimisation strategies and other hyperparam-eters except learning rate are set identical to the original im-plementations of the models. To fine-tune a model on a spe-cific texture class, each training batch is formed by randomlysampling 10000 triplets from videos of that texture whereeach time a sequence is randomly selected then a triplet (orquintuplet for GDConvNet) of × patches is sam-pled randomly from the space-time volume. The sampledframes are augmented via random horizontal and vertical flip-ping, color jittering and temporal order reversal. Batch sizesof , , are used and the initial learning rates are set to be − , − , − for AdaCoF, CAIN and GDConvNet re-spectively. The models are fine-tuned on each video texture for 10 epochs with learning rates decayed by a factor of . every 4 epochs. The computations are performed on NvidiaP100 GPU cards provided on the shared cluster BlueCrystalPhase 4 [27] at the University of Bristol. In order to investigate whether there is benefit in texture-specific training instead of training the models on all tex-ture classes, in this experiment we fine-tune each model onfour training sets: three for the three texture classes as de-scribed above and also the combination of them (referred toas “mixed”). We name the fine-tuned versions after the tar-getted texture type, namely “dyncon”, “dyndis”, “static” and“mixed”. The four fine-tuned versions, as well as the origi-nal “off-the-shelf” version (regarded as the baseline) of eachmodel are then evaluated on HomTex. The average PSNRand SSIM scores obtained for each texture subset in HomTex(i.e. HomTex-dyncon, HomTex-dyndis and HomTex-static)and the overall HomTex dataset (HomTex-overall) are sum-marised in Table 3.We see from Table 3 that for all baseline models, tex-ture class tuning increased their performance on test se-quences from that class in terms of both PSNR and SSIM, al-though in general the variations in SSIM scores are marginal.Meanwhile, models fine-tuned on a class generally performedworse on other textures, implying some overfitting which isexpected. The improvements due to class-based training areparticularly evident for dynamic textures.The performance variations are more obviously reflectedby PSNR, and this is because all models use (cid:96) distortion asa major component in their loss functions, so one can expecta lower mean-squared-error after training and hence the obvi-ous variations in PSNR.Comparing the specialised models that are fine-tuned onsingle texture types (which together form the TAFI model)against the model fine-tuned on the mixed training set, we seethat in all cases the former managed to deliver increased gainin terms of both PSNR and SSIM. This result is consistentwith our hypothesis that learning a generic model is relativelyharder than specialising on a specific video texture class. It isnoted that the models fine-tuned on the mixed set constantlyexhibited worse performance on static textures while perfor- able 3 . Performance of the original (baseline) and texture-aware fine-tuned versions of AdaCoF, CAIN and GDConvNet onHomTex. The version indicates the texture the model is fine-tuned on. The combination of dyncon, dyndis and static versionsforms the TAFI model. Models are evaluated on three individual texture types in HomTex as well as the whole HomTex dataset.Numbers in brackets denote change with respect to the baseline model. For each column, the best result is in bold text. OurTAFI framework consistently improved the overall performance of the baseline models. HomTex-dyncon HomTex-dyndis HomTex-static HomTex-overallmodel version PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIMAdaCoF baseline 25.90 0.68 26.03 0.83 36.69 0.94 28.20 0.80dyncon (ours) dyndis (ours) 26.14(+0.24) 0.69(+0.01) mixed 26.19(+0.29) 0.69(+0.01) 26.12(+0.09) 0.84(+0.01) 36.01(-0.68) 0.93(-0.01) 28.22(+0.02) 0.80(+0.00)CAIN baseline 26.26 0.70 26.03 0.83 36.96 0.95 28.39 0.81dyncon (ours) dyndis (ours) 26.44(+0.18) 0.70(-0.00) mixed 26.42(+0.16) 0.70(+0.00) 25.84(-0.19) 0.84(+0.01) 36.07(-0.89) 0.94(-0.01) 28.19(-0.20) 0.81(+0.00)GDConvNet baseline 26.38 0.70 26.60 0.84 36.97 0.95 28.68 0.81dyncon (ours) dyndis (ours) 26.02(-0.36) 0.69(-0.01) mixed 26.48(+0.10) 0.70(+0.00) 26.72(+0.12) 0.84(+0.00) 35.38(-1.59) 0.93(-0.02) 28.43(-0.25) 0.81(-0.00) mance on the other two texture types are mostly improved.This can be attributed to the fact that the static sequences inthe training set constitute a relatively small proportion (20%)in the mixed set, making the training more biased towards thedynamic texture types.Finally, comparing the performance of the models onthe entire HomTex dataset, it is clear that for all three VFImodels, the combination of their specialised versions (TAFI)achieved higher PSNR and SSIM than their baselines, withapproximately 0.3dB gain in PSNR after texture-aware fine-tuning. Example qualitative interpolation results of the spe-cialised AdaCoF models and the original AdaCoF are given inFig.1, where it can be observed that the specialised versions(AdaCoF-TAFI) produce an intermediate frame with highervisual quality. This is particularly evident for dynamic tex-tures where the original AdaCoF model fails to capture themore complex motion and produces numerous visual arte-facts.
5. CONCLUSION AND FUTURE WORK
In this work, we applied the video texture categorisation pro-posed in previous work to the problem of video frame inter-polation (VFI). The effect of video texture type on VFI wasstudied by evaluating three state-of-the-art models on a videodataset HomTex which contains homogeneous sequences ofstatic, dynamic discrete and dynamic continuous textures.The results showed that the models perform differently on dif-ferent textures with statistical significance. Motivated by thisobservation, we fine-tuned the three VFI models on each ofthe three types of textures as well as the combination of them, and found that although each model fine-tuned on a specifictexture overfits that texture class, it outperformed the modeltrained on combination of textures, proving our hypothesisthat it is beneficial in terms of interpolation quality to haveseparate models specialised in different textures, instead oftraining a single model to learn to interpolate all kinds of tex-tures.To further confirm these findings, VFI models shouldbe completely re-trained on larger scale homogeneous videodatasets which do not yet exist. A possible direction of futurework is to construct such a dataset and train state-of-the-artVFI models on homogeneous videos from scratch. In addi-tion, despite the improved performance, our framework re-quires three times more computation compared to the originalVFI models and there is up to future research to develop a sin-gle model that can adapt to different textures. Finally, in thiswork we focused on performance of VFI on homogeneousvideos and did not consider fusion of the specialised modelsto interpolate generic sequences. Development of such en-semble algorithms will also be part of our future work.
6. REFERENCES [1] H. Choi and I. V. Baji´c, “Deep frame prediction forvideo coding,”
IEEE Transactions on Circuits and Sys-tems for Video Technology , vol. 30, no. 7, pp. 1843-1855, July 2020.[2] M. Usman, X. He, K. Lam, M. Xu, S. M. M. Bokhariand J. Chen, “Frame Interpolation for Cloud-Based Mo-ile Video Streaming,”
IEEE Transactions on Multime-dia , vol. 18, no. 5, pp. 831-839, May 2016.[3] S. Baker, S. Roth, D. Scharstein, M. J. Black, J. P. Lewisand R. Szeliski, “A database and evaluation methodol-ogy for optical flow,” in
IEEE 11th International Con-ference on Computer Vision , 2007, pp. 1-8.[4] Z. Liu, R. A. Yeh, X. Tang, Y. Liu and A. Agarwala,“Video frame synthesis using deep voxel flow,” in
IEEEInternational Conference on Computer Vision , 2017, pp.4473-4481.[5] H. Jiang, D. Sun, V. Jampani, M. Yang, E. Learned-Miller and J. Kautz,“Super SloMo: high quality esti-mation of multiple intermediate frames for video inter-polation,” in
IEEE Conference on Computer Vision andPattern Recognition , 2018, pp. 9000-9008.[6] S. Niklaus and F. Liu, “Context-aware synthesis forvideo frame interpolation,” in
IEEE Conference onComputer Vision and Pattern Recognition , 2018, pp.1701-1710.[7] W. Bao, W. Lai, C. Ma, X. Zhang, Z. Gao and M. Yang,“Depth-aware video frame interpolation,” in
IEEE Con-ference on Computer Vision and Pattern Recognition ,2019, pp. 3698-3707.[8] T. Xue, B. Chen, J. Wu, D. Wei and W. T. Freeman,“Video enhancement with task-oriented flow,”
Interna-tional Journal of Computer Vision , vol. 127, no. 8, pp.1106-1125, 2019.[9] S. Niklaus and F. Liu, “Softmax splatting for videoframe interpolation,” in
IEEE Conference on ComputerVision and Pattern Recognition , 2020, pp. 5436-5445.[10] G. Long, L. Kneip, J. M. Alvarez, H. Li, X. Zhang, andQ. Yu, “Learning image matching by simply watchingvideo,” in
European Conference on Computer Vision ,2016, pp. 434-450. Springer, Cham.[11] S. Niklaus, L. Mai and F. Liu, “Video frame interpo-lation via adaptive convolution,” in
IEEE Conferenceon Computer Vision and Pattern Recognition , 2017, pp.2270-2279.[12] S. Niklaus, L. Mai and F. Liu, “Video frame interpo-lation via adaptive separable convolution,” in
IEEE In-ternational Conference on Computer Vision , 2017, pp.261-270.[13] S. Niklaus, L. Mai and O. Wang, “Revisiting adaptiveconvolutions for video frame interpolation,” in
Proceed-ings of the IEEE/CVF Winter Conference on Applica-tions of Computer Vision , 2021, pp. 1099-1109. [14] M. Choi, H. Kim, B. Han, N. Xu, and K. M. Lee, “Chan-nel attention is all you need for video frame interpola-tion,” in
Proceedings of the AAAI Conference on Artifi-cial Intelligence , vol. 34, no. 7, pp. 10663-10671, April2020.[15] H. Lee, T. Kim, T. -y. Chung, D. Pak, Y. Ban and S.Lee, “AdaCoF: adaptive collaboration of flows for videoframe interpolation,” in
IEEE Conference on ComputerVision and Pattern Recognition , 2020, pp. 5315-5324.[16] X. Cheng and Z. Chen, “Video frame interpolation viadeformable separable convolution,” in
Proceedings ofthe AAAI Conference on Artificial Intelligence , vol. 34,no. 07, pp. 10607-10614, April 2020.[17] Z. Shi, X. Liu, K. Shi, L. Dai and J. Chen,“Video interpolation via generalized deformable con-volution,” 2020, arXiv:2008.10680. [Online]. Available:https://arxiv.org/abs/2008.10680.[18] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, andY. Wei, “Deformable convolutional networks,” in
IEEEInternational Conference on Computer Vision , 2017, pp.764-773.[19] F. Zhang and D. R. Bull, “A parametric framework forvideo compression using region-based texture models,”
IEEE Journal of Selected Topics in Signal Processing ,vol. 5, no. 7, pp. 1378-1392, 2011.[20] R. P´eteri, S. Fazekas, and M. J. Huiskes, “Dyntex: Acomprehensive database of dynamic textures,”
PatternRecognition Letters , vol. 31, no. 12, pp. 1627–1632,2010.[21] M. Afonso, A. Katsenou, F. Zhang, D. Agrafiotis and D.Bull, “Video texture analysis based on HEVC encodingstatistics,” in
Picture Coding Symposium , 2016, pp. 1-5.[22] A. V. Katsenou, T. Ntasios, M. Afonso, D. Agrafiotisand D. R. Bull, “Understanding video texture — A basisfor video compression,”
IEEE 19th International Work-shop on Multimedia Signal Processing , 2017, pp. 1-6.[23] M. A. Papadopoulos, F. Zhang, D. Agrafiotis and D.Bull, “A video texture database for perceptual com-pression and quality assessment,” in
IEEE InternationalConference on Image Processing , 2015, pp. 2781-2785.[24] D. Ma, A. V. Katsenou and D. R. Bull, “A SyntheticVideo Dataset for Video Compression Evaluation,” in
IEEE International Conference on Image Processing ,2019, pp. 1094-1098.[25] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simon-celli, “Image quality assessment: from error visibility tostructural similarity,”