[PDF] Texture-aware Video Frame Interpolation

Abstract

Temporal interpolation has the potential to be a powerful tool for video compression. Existing methods for frame interpolation do not discriminate between video textures and generally invoke a single general model capable of interpolating a wide range of video content. However, past work on video texture analysis and synthesis has shown that different textures exhibit vastly different motion characteristics and they can be divided into three classes (static, dynamic continuous and dynamic discrete). In this work, we study the impact of video textures on video frame interpolation, and propose a novel framework where, given an interpolation algorithm, separate models are trained on different textures. Our study shows that video texture has significant impact on the performance of frame interpolation models and it is beneficial to have separate models specifically adapted to these texture classes, instead of training a single model that tries to learn generic motion. Our results demonstrate that models fine-tuned using our framework achieve, on average, a 0.3dB gain in PSNR on the test set used.

Full PDF

TTEXTURE-AWARE VIDEO FRAME INTERPOLATION

Duolikun Danier and David Bull

Visual Information LaboratoryUniversity of BristolBristol, BS8 1UB, United Kingdom { Duolikun.Danier, Dave.Bull } @bristol.ac.uk ABSTRACT

Temporal interpolation has the potential to be a powerful toolfor video compression. Existing methods for frame interpola-tion do not discriminate between video textures and generallyinvoke a single general model capable of interpolating a widerange of video content. However, past work on video textureanalysis and synthesis has shown that different textures ex-hibit vastly different motion characteristics and they can bedivided into three classes (static, dynamic continuous and dy-namic discrete). In this work, we study the impact of videotextures on video frame interpolation, and propose a novelframework where, given an interpolation algorithm, separatemodels are trained on different textures. Our study shows thatvideo texture has signiﬁcant impact on the performance offrame interpolation models and it is beneﬁcial to have sepa-rate models speciﬁcally adapted to these texture classes, in-stead of training a single model that tries to learn generic mo-tion. Our results demonstrate that models ﬁne-tuned usingour framework achieve, on average, a 0.3dB gain in PSNR onthe test set used.

1. INTRODUCTION

Video frame interpolation (VFI) refers to the task of gener-ating non-existent intermediate frames between any two con-secutive frames in a video while preserving spatiotemporalconsistencies. VFI allows up-conversion of frame rates forimproved visual quality, and offers an important tool for videocompression, where it can be used to replace or enhance con-ventional motion estimation [1] or to conceal errors in recon-structed frames [2]. Interpolating visually pleasing intermedi-ate frames requires accurate modelling of motion. However,due to the variety of textures and motion patterns in real-worldvideos, it is difﬁcult to capture the motion of different texturetypes using a single mathematical model. Hence frame inter-polation remains a challenging task.Existing VFI methods can be classiﬁed as either ﬂow-based or kernel-based. Flow-based methods [3–9] generally

This work was supported by the China Scholarship Council - Universityof Bristol Scholarship. Grant No. 202008060038. involve two steps, namely motion estimation and frame syn-thesis, where the latter typically resorts to pixel-wise warpingof the two adjacent frames utilising the motion information topredict the intermediate frame. However, these methods gen-erally experience a degradation in performance under chal-lenging conditions where ﬁnding reference pixels in nearbyframes is made difﬁcult, e.g. under illumination change, oc-clusion and large motion. On the other hand, kernel-basedmethods [10–17] estimate a convolution kernel for each out-put pixel and predict the pixel by convolving nearby regionsin adjacent frames with the kernel. Although in this case eachoutput pixel is synthesised by referring to multiple pixels inthe adjacent frames and occlusion can be better handled, thecomplexity of captured motion is still limited by the size ofthe kernels.The aforementioned methods are mostly based on convo-lutional neural networks (CNNs). These have advanced thestate-of-the-art in VFI over the past few years with variousinnovations relating to their overall architectures and to spe-ciﬁc processing stages. One common feature amongst them isthat a single model is trained to interpolate all kinds of videoframe. In contrast, previous work on video texture analysisand synthesis [19–24] has shown that different textures ex-hibit vastly different motion patterns which are better han-dled separately. For example, Zhang et al. [19] proposedan analysis-synthesis video compression framework in whichdifferent algorithms were adopted to synthesise dynamic andstatic textures.A ﬁner classiﬁcation of video textures was proposed in[21,22], where the authors clustered homogeneous sequencesbased on texture-relevant features as well as the HM encodingstatistics and found three video texture clusters exist whichwere referred to as static (rigid texture exhibiting perspec-tive motion), dynamic discrete (texture with discernible partsundergoing perspective motion independently) and dynamiccontinuous (spatially irregular and unstructured texture mov-ing as a continuum). Example frames from each class takenfrom the HomTex dataset [21] are shown in Fig.1.While generalisation is a desirable property of deep learn-ing models, it has also proved to be difﬁcult due to the issue of a r X i v : . [ ee ss . I V ] F e b riginal frames AdaCoF-original AdaCoF-TAFI Fig. 1 . Interpolation results of our texture-aware ﬁne-tuned AdaCoF and the original AdaCoF [15] on samplesequences from static (top row), dynamic discrete (middlerow) and dynamic continuous (bottom row) textures. Se-quences are “PaintingTilting1”, “RiceField” and “Shinny-BlueWater downsampled” from the HomTex dataset [21].over-ﬁtting. Motivated by the classiﬁcation of video texturesin previous work, in this contribution, we investigate the im-pact of video texture type on the performance of state-of-the-art VFI models. Based on our observation, we hypothesisethat it is harder to train a generic VFI model that performswell on all texture types than to train separate models, eachspecialising in a speciﬁc texture (i.e. to overﬁt texture class);we verify our hypothesis through experimentation. As a con-sequence, we propose a novel texture-aware frame interpola-tion (TAFI) framework that can generalise to any VFI methodand improve its performance.

2. RELATED WORK

Conventionally, VFI has been addressed by estimating opticalﬂow and warping two adjacent frames using the ﬂow map [3].Such methods rely heavily on the adopted ﬂow estimation al-gorithm and have been generally found to suffer from occlu-sion and large motion. Recent ﬂow-based methods improvedupon the framework by deploying CNNs for ﬂow estimationand frame synthesis. Liu et al. [4] trained an encoder-decodernetwork to estimate the optical ﬂow from the intermediateframe to the two reference frames, which is then used by asampling layer to interpolate each output pixel. The com-plexity of the captured motion was limited since equal ﬂow intwo directions was assumed. Jiang et al. [5] improved on [4] by enabling bidirectional ﬂow. Niklaus et al. [6] incorporatedcontext information during the warping process. Bao et al. [7]additionally estimated depth map to reﬁne the estimated ﬂow.Recently, Niklaus et al. [9] proposed to use softmax splattingfor differentiable forward warping such that different refer-ence pixels can map to the same output pixel. Although thisallows more complex motion to be captured, such complexityis still limited due to the dependence on ﬂow estimation andpixel-wise warping.On the other hand, Kernel-based methods inherently al-low more pixels to be sampled for predicting one output pixel.Niklaus et al. [11] proposed a CNN (improved later in [12]and [13]) to predict a kernel for each output pixel which isconvolved with the input image to obtain the target pixel. In-stead of predicting individual kernels for each pixel, Choi etal. [14] proposed an end-to-end network with channel atten-tion that directly outputs the interpolated frame. Recently,there has been an increased number of works on VFI that em-ploys deformable convolution [18]. Lee et al. [15] developeda CNN to predict the kernel weights together with their off-sets for each output pixel. The ﬂexibility of deformable ker-nels enabled more complex motion to be captured. Cheng etal. [16] adopted a similar approach but separable kernels wereused. Shi et al. [17] enabled further degree of freedom by al-lowing the reference pixels to be sampled in an interpolatedspace-time volume instead of just from the existent frames.While these VFI methods were developed to generaliseon all types of videos, previous work on video texture analy-sis and synthesis adopted a different approach. In [19], Zhanget al. observed that textured regions in videos typically con-sume more bits to encode, and proposed to synthesise such re-gion and encode only the warping parameters, where differentsynthesis methods were used for dynamic and static textures.In [21], Afonso et al. analysed video texture by clusteringhomogeneous videos based on their encoding statistics (e.g.prediction modes, bit allocation etc.) and showed that videotexture can be categorised as static , dynamic discrete and dy-namic continuous . A homogeneous texture dataset, HomTex,was produced. Katsenou et al. [22] conﬁrmed the existence ofsuch categorisation by clustering videos based on their spa-tiotemporal features. In our work, the implications of suchtexture categorisation for VFI are investigated.

3. TEXTURE AWARE FRAME INTERPOLATION3.1. Effect of Texture on VFI Performance

In this section we investigate the performance of VFI methodson different video textures. Speciﬁcally, the pre-trained ver-sions of three CNN-based state-of-the-art VFI models withpublicly available source code are evaluated: AdaCoF [15],CAIN [14] and GDConvNet [17]. The evaluation dataset usedis HomTex [21] which contains 120 videos with 250 framesof × spatial resolution at 25 and 60 fps. Each video yncon dyndis static15202530354045 PS N R model AdaCoFCAINGDConvNet dyncon dyndis static0.20.30.40.50.60.70.80.91.0 SS I M model AdaCoFCAINGDConvNet dyncon dyndis static5060708090100 V M A F model AdaCoFCAINGDConvNet

Fig. 2 . Performance of three selected models on three texture types.

Table 1 . Results of One-way ANOVA test on PSNR, SSIMand VMAF scores obtained by the VFI models on differenttextures.

AdaCoF CAIN GDConvNetF(2,117) p F(2,117) p F(2,117) pPSNR 38.41 0.00 40.14 0.00 35.99 0.00SSIM 17.58 0.00 16.16 0.00 16.91 0.00VMAF 73.57 0.00 67.92 0.00 70.07 0.00 in HomTex is (approximately) texturally homogeneous; thereare 45, 50 and 25 sequences that are labelled dynamic con-tinuous , dynamic discrete and static respectively. Each modelis evaluated on the entire HomTex dataset, where for eachsequence every second frame, I t , is considered the ground-truth and the two adjacent frames I t − and I t +1 are inputto the model (in the case of GDConvNet 4 frames, namely I t − , I t − , I t +1 , I t +3 , are used as input due to the model de-sign). Peak signal-to-noise ratio (PSNR) and structural simi-larity (SSIM) [25] scores are computed using the original andinterpolated frames as these are the most commonly used met-rics in VFI [3–17]. Additionally, video multi-method assess-ment fusion (VMAF) [26] scores are also computed using theoriginal and interpolated videos. The scores of each model arethen grouped according to the video texture class, and theirdistributions are shown in Fig.2, where abbreviations “dyn-con” and “dyndis” are used for dynamic continuous and dy-namic discrete .It is noticeable that the distributions of the scores obtainedby the models vary with texture type. Speciﬁcally, all mod-els scored the highest PSNR, SSIM and VMAF on static tex-tures, which is expected since the perspective motion of rigidobjects is relatively easier to capture. For the case of dy-namic texture, all models see degraded performance acrossall metrics, and this decrease is even more severe for dynamiccontinuous textures which exhibit the most irregular motionintrinsic to the scenes containing water, smoke and ﬁre, etc.To assess whether the impact of texture on VFI perfor-mance is statistically signiﬁcant, a one-way ANOVA is per-formed with results shown in Table 3.1. Here we see that interms of all metrics, texture type has signiﬁcant effect on the performance of the VFI models at p < . level. Further-more, T-tests (Welch’s T-test to account for unequal samplesizes) are performed between the scores of each model oneach pair of textures to see if there are signiﬁcant differences,and the results are summarised in Table 2. The last two rowsof the table show that, in terms of all metrics, the averageperformance of the models on static textures is signiﬁcantlydifferent from that on dynamic discrete and dynamic continu-ous textures. Comparing between the two dynamic types, allmodels have signiﬁcantly different SSIM and VMAF scores.Combining these results with Fig.2, we can conﬁrm that for p < . , all models performed best on static textures andworst on dynamic continuous textures. There are two possi-ble reasons for this observation. The ﬁrst is that the dynamictexture types fall outside of the distribution of the three mod-els’ training sets. We consider this reason less important sinceall models were trained on the Vimeo90K dataset [8] whichinvolves a wide range of non-homogeneous videos that coverall types of textures. The second reason is that the complex-ity of motion patterns, and hence the levels of difﬁculties tointerpolate different textures are inherently different, with dy-namic continuous textures exhibiting the most unpredictablemotion thus being the hardest to interpolate, and static tex-tures being the easiest type. This implies that, rather thantrying to train a single model that learns to model the motionof all textures, there might be potential gain in having sepa-rate models specialised in interpolation of one texture class,namely texture-aware frame interpolation. Under the proposed VFI framework, the aforementioned threetexture types are treated separately. That is, given a VFImodel, we generate three versions of it, each trained exclu-sively on one type of texture. In this work we focus on perfor-mance of models on homogeneously-textured videos so theinference on a test video is performed in the same way asthe original model, but with the difference that the specialisedversions are used according to the video texture type. Weshow through our experiments that it is more difﬁcult to learna generic model that can interpolate all types of videos than tolearn a specialised model that overﬁts a speciﬁc texture class. able 2 . Results of T-Tests between the scores obtained by each model on three pairs of textures.

AdaCoF CAIN GDConvNetPSNR SSIM VMAF PSNR SSIM VMAF PSNR SSIM VMAFdyndis vs.dyncon t(75)=0.10,p=0.92 t(87)=3.52,p=0.00 t(92)=8.93,p=0.00 t(73)=-0.19,p=0.85 t(87)=3.21,p=0.00 t(92)=8.35,p=0.00 t(74)=0.18,p=0.86 t(87)=3.48,p=0.00 t(91)=8.90,p=0.00static vs.dyncon t(66)=8.66,p=0.00 t(47)=7.78,p=0.00 t(58)=12.28,p=0.00 t(66)=8.64,p=0.00 t(46)=7.53,p=0.00 t(61)=12.08,p=0.00 t(66)=8.47,p=0.00 t(46)=7.63,p=0.00 t(61)=11.85,p=0.00static vs.dyndis t(62)=11.41,p=0.00 t(54)=4.01,p=0.00 t(59)=3.84,p=0.00 t(61)=11.91,p=0.00 t(53)=4.24,p=0.00 t(61)=4.06,p=0.00 t(63)=11.03,p=0.00 t(53)=3.89,p=0.00 t(60)=3.40,p=0.00

4. EXPERIMENTS4.1. Experiment Setup

We focus on the three models used previously for analysis,i.e. AdaCoF [15], CAIN [14] and GDConvNet [17]. The testdata used for evaluation are the 120 homogeneous sequencesin HomTex [21]. For the purpose of training models tuned to asingle texture type, the training dataset should contain homo-geneous sequences. Therefore, we make use of three existingtexture datasets, namely DynTex [20], BVI-Texture [23] andSynTex [24]. Speciﬁcally, DynTex contains 650 annotatedvideos of × spatial resolution at 25 fps, and we re-tained those that contain only one type of texture accordingto the annotations. BVI-Texture and SynTex each contain 20and 196 full-HD resolution ( × homogeneous se-quences at 60 fps. It should be noted that, since HomTex iscomposed of videos from DynTex and BVI-Texture, we re-moved the DynTex sequences that exist in HomTex and othersequences that are visually similar to them. This resulted ina total of 214 dynamic continuous, 222 dynamic discrete and110 static sequences with each sequence containing at least250 frames. Due to the lack of a diverse homogeneous video texturedataset, the models are initialised with their pre-trainedweights provided and ﬁne-tuned on our training sets. Theloss functions, optimisation strategies and other hyperparam-eters except learning rate are set identical to the original im-plementations of the models. To ﬁne-tune a model on a spe-ciﬁc texture class, each training batch is formed by randomlysampling 10000 triplets from videos of that texture whereeach time a sequence is randomly selected then a triplet (orquintuplet for GDConvNet) of × patches is sam-pled randomly from the space-time volume. The sampledframes are augmented via random horizontal and vertical ﬂip-ping, color jittering and temporal order reversal. Batch sizesof , , are used and the initial learning rates are set to be − , − , − for AdaCoF, CAIN and GDConvNet re-spectively. The models are ﬁne-tuned on each video texture for 10 epochs with learning rates decayed by a factor of . every 4 epochs. The computations are performed on NvidiaP100 GPU cards provided on the shared cluster BlueCrystalPhase 4 [27] at the University of Bristol. In order to investigate whether there is beneﬁt in texture-speciﬁc training instead of training the models on all tex-ture classes, in this experiment we ﬁne-tune each model onfour training sets: three for the three texture classes as de-scribed above and also the combination of them (referred toas “mixed”). We name the ﬁne-tuned versions after the tar-getted texture type, namely “dyncon”, “dyndis”, “static” and“mixed”. The four ﬁne-tuned versions, as well as the origi-nal “off-the-shelf” version (regarded as the baseline) of eachmodel are then evaluated on HomTex. The average PSNRand SSIM scores obtained for each texture subset in HomTex(i.e. HomTex-dyncon, HomTex-dyndis and HomTex-static)and the overall HomTex dataset (HomTex-overall) are sum-marised in Table 3.We see from Table 3 that for all baseline models, tex-ture class tuning increased their performance on test se-quences from that class in terms of both PSNR and SSIM, al-though in general the variations in SSIM scores are marginal.Meanwhile, models ﬁne-tuned on a class generally performedworse on other textures, implying some overﬁtting which isexpected. The improvements due to class-based training areparticularly evident for dynamic textures.The performance variations are more obviously reﬂectedby PSNR, and this is because all models use (cid:96) distortion asa major component in their loss functions, so one can expecta lower mean-squared-error after training and hence the obvi-ous variations in PSNR.Comparing the specialised models that are ﬁne-tuned onsingle texture types (which together form the TAFI model)against the model ﬁne-tuned on the mixed training set, we seethat in all cases the former managed to deliver increased gainin terms of both PSNR and SSIM. This result is consistentwith our hypothesis that learning a generic model is relativelyharder than specialising on a speciﬁc video texture class. It isnoted that the models ﬁne-tuned on the mixed set constantlyexhibited worse performance on static textures while perfor- able 3 . Performance of the original (baseline) and texture-aware ﬁne-tuned versions of AdaCoF, CAIN and GDConvNet onHomTex. The version indicates the texture the model is ﬁne-tuned on. The combination of dyncon, dyndis and static versionsforms the TAFI model. Models are evaluated on three individual texture types in HomTex as well as the whole HomTex dataset.Numbers in brackets denote change with respect to the baseline model. For each column, the best result is in bold text. OurTAFI framework consistently improved the overall performance of the baseline models. HomTex-dyncon HomTex-dyndis HomTex-static HomTex-overallmodel version PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIMAdaCoF baseline 25.90 0.68 26.03 0.83 36.69 0.94 28.20 0.80dyncon (ours) dyndis (ours) 26.14(+0.24) 0.69(+0.01) mixed 26.19(+0.29) 0.69(+0.01) 26.12(+0.09) 0.84(+0.01) 36.01(-0.68) 0.93(-0.01) 28.22(+0.02) 0.80(+0.00)CAIN baseline 26.26 0.70 26.03 0.83 36.96 0.95 28.39 0.81dyncon (ours) dyndis (ours) 26.44(+0.18) 0.70(-0.00) mixed 26.42(+0.16) 0.70(+0.00) 25.84(-0.19) 0.84(+0.01) 36.07(-0.89) 0.94(-0.01) 28.19(-0.20) 0.81(+0.00)GDConvNet baseline 26.38 0.70 26.60 0.84 36.97 0.95 28.68 0.81dyncon (ours) dyndis (ours) 26.02(-0.36) 0.69(-0.01) mixed 26.48(+0.10) 0.70(+0.00) 26.72(+0.12) 0.84(+0.00) 35.38(-1.59) 0.93(-0.02) 28.43(-0.25) 0.81(-0.00) mance on the other two texture types are mostly improved.This can be attributed to the fact that the static sequences inthe training set constitute a relatively small proportion (20%)in the mixed set, making the training more biased towards thedynamic texture types.Finally, comparing the performance of the models onthe entire HomTex dataset, it is clear that for all three VFImodels, the combination of their specialised versions (TAFI)achieved higher PSNR and SSIM than their baselines, withapproximately 0.3dB gain in PSNR after texture-aware ﬁne-tuning. Example qualitative interpolation results of the spe-cialised AdaCoF models and the original AdaCoF are given inFig.1, where it can be observed that the specialised versions(AdaCoF-TAFI) produce an intermediate frame with highervisual quality. This is particularly evident for dynamic tex-tures where the original AdaCoF model fails to capture themore complex motion and produces numerous visual arte-facts.

5. CONCLUSION AND FUTURE WORK

In this work, we applied the video texture categorisation pro-posed in previous work to the problem of video frame inter-polation (VFI). The effect of video texture type on VFI wasstudied by evaluating three state-of-the-art models on a videodataset HomTex which contains homogeneous sequences ofstatic, dynamic discrete and dynamic continuous textures.The results showed that the models perform differently on dif-ferent textures with statistical signiﬁcance. Motivated by thisobservation, we ﬁne-tuned the three VFI models on each ofthe three types of textures as well as the combination of them, and found that although each model ﬁne-tuned on a speciﬁctexture overﬁts that texture class, it outperformed the modeltrained on combination of textures, proving our hypothesisthat it is beneﬁcial in terms of interpolation quality to haveseparate models specialised in different textures, instead oftraining a single model to learn to interpolate all kinds of tex-tures.To further conﬁrm these ﬁndings, VFI models shouldbe completely re-trained on larger scale homogeneous videodatasets which do not yet exist. A possible direction of futurework is to construct such a dataset and train state-of-the-artVFI models on homogeneous videos from scratch. In addi-tion, despite the improved performance, our framework re-quires three times more computation compared to the originalVFI models and there is up to future research to develop a sin-gle model that can adapt to different textures. Finally, in thiswork we focused on performance of VFI on homogeneousvideos and did not consider fusion of the specialised modelsto interpolate generic sequences. Development of such en-semble algorithms will also be part of our future work.

6. REFERENCES [1] H. Choi and I. V. Baji´c, “Deep frame prediction forvideo coding,”

IEEE Transactions on Circuits and Sys-tems for Video Technology , vol. 30, no. 7, pp. 1843-1855, July 2020.[2] M. Usman, X. He, K. Lam, M. Xu, S. M. M. Bokhariand J. Chen, “Frame Interpolation for Cloud-Based Mo-ile Video Streaming,”

IEEE Transactions on Multime-dia , vol. 18, no. 5, pp. 831-839, May 2016.[3] S. Baker, S. Roth, D. Scharstein, M. J. Black, J. P. Lewisand R. Szeliski, “A database and evaluation methodol-ogy for optical ﬂow,” in

IEEE 11th International Con-ference on Computer Vision , 2007, pp. 1-8.[4] Z. Liu, R. A. Yeh, X. Tang, Y. Liu and A. Agarwala,“Video frame synthesis using deep voxel ﬂow,” in

IEEEInternational Conference on Computer Vision , 2017, pp.4473-4481.[5] H. Jiang, D. Sun, V. Jampani, M. Yang, E. Learned-Miller and J. Kautz,“Super SloMo: high quality esti-mation of multiple intermediate frames for video inter-polation,” in

IEEE Conference on Computer Vision andPattern Recognition , 2018, pp. 9000-9008.[6] S. Niklaus and F. Liu, “Context-aware synthesis forvideo frame interpolation,” in

IEEE Conference onComputer Vision and Pattern Recognition , 2018, pp.1701-1710.[7] W. Bao, W. Lai, C. Ma, X. Zhang, Z. Gao and M. Yang,“Depth-aware video frame interpolation,” in

IEEE Con-ference on Computer Vision and Pattern Recognition ,2019, pp. 3698-3707.[8] T. Xue, B. Chen, J. Wu, D. Wei and W. T. Freeman,“Video enhancement with task-oriented ﬂow,”

Interna-tional Journal of Computer Vision , vol. 127, no. 8, pp.1106-1125, 2019.[9] S. Niklaus and F. Liu, “Softmax splatting for videoframe interpolation,” in

IEEE Conference on ComputerVision and Pattern Recognition , 2020, pp. 5436-5445.[10] G. Long, L. Kneip, J. M. Alvarez, H. Li, X. Zhang, andQ. Yu, “Learning image matching by simply watchingvideo,” in

European Conference on Computer Vision ,2016, pp. 434-450. Springer, Cham.[11] S. Niklaus, L. Mai and F. Liu, “Video frame interpo-lation via adaptive convolution,” in

IEEE Conferenceon Computer Vision and Pattern Recognition , 2017, pp.2270-2279.[12] S. Niklaus, L. Mai and F. Liu, “Video frame interpo-lation via adaptive separable convolution,” in

IEEE In-ternational Conference on Computer Vision , 2017, pp.261-270.[13] S. Niklaus, L. Mai and O. Wang, “Revisiting adaptiveconvolutions for video frame interpolation,” in

Proceed-ings of the IEEE/CVF Winter Conference on Applica-tions of Computer Vision , 2021, pp. 1099-1109. [14] M. Choi, H. Kim, B. Han, N. Xu, and K. M. Lee, “Chan-nel attention is all you need for video frame interpola-tion,” in

Proceedings of the AAAI Conference on Artiﬁ-cial Intelligence , vol. 34, no. 7, pp. 10663-10671, April2020.[15] H. Lee, T. Kim, T. -y. Chung, D. Pak, Y. Ban and S.Lee, “AdaCoF: adaptive collaboration of ﬂows for videoframe interpolation,” in

IEEE Conference on ComputerVision and Pattern Recognition , 2020, pp. 5315-5324.[16] X. Cheng and Z. Chen, “Video frame interpolation viadeformable separable convolution,” in

Proceedings ofthe AAAI Conference on Artiﬁcial Intelligence , vol. 34,no. 07, pp. 10607-10614, April 2020.[17] Z. Shi, X. Liu, K. Shi, L. Dai and J. Chen,“Video interpolation via generalized deformable con-volution,” 2020, arXiv:2008.10680. [Online]. Available:https://arxiv.org/abs/2008.10680.[18] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, andY. Wei, “Deformable convolutional networks,” in

IEEEInternational Conference on Computer Vision , 2017, pp.764-773.[19] F. Zhang and D. R. Bull, “A parametric framework forvideo compression using region-based texture models,”

IEEE Journal of Selected Topics in Signal Processing ,vol. 5, no. 7, pp. 1378-1392, 2011.[20] R. P´eteri, S. Fazekas, and M. J. Huiskes, “Dyntex: Acomprehensive database of dynamic textures,”

PatternRecognition Letters , vol. 31, no. 12, pp. 1627–1632,2010.[21] M. Afonso, A. Katsenou, F. Zhang, D. Agraﬁotis and D.Bull, “Video texture analysis based on HEVC encodingstatistics,” in

Picture Coding Symposium , 2016, pp. 1-5.[22] A. V. Katsenou, T. Ntasios, M. Afonso, D. Agraﬁotisand D. R. Bull, “Understanding video texture — A basisfor video compression,”

IEEE 19th International Work-shop on Multimedia Signal Processing , 2017, pp. 1-6.[23] M. A. Papadopoulos, F. Zhang, D. Agraﬁotis and D.Bull, “A video texture database for perceptual com-pression and quality assessment,” in

IEEE InternationalConference on Image Processing , 2015, pp. 2781-2785.[24] D. Ma, A. V. Katsenou and D. R. Bull, “A SyntheticVideo Dataset for Video Compression Evaluation,” in

IEEE International Conference on Image Processing ,2019, pp. 1094-1098.[25] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simon-celli, “Image quality assessment: from error visibility tostructural similarity,”