All at Once: Temporally Adaptive Multi-Frame Interpolation with Advanced Motion Modeling
Zhixiang Chi, Rasoul Mohammadi Nasiri, Zheng Liu, Juwei Lu, Jin Tang, Konstantinos N Plataniotis
AAll at Once: Temporally Adaptive Multi-FrameInterpolation with Advanced Motion Modeling
Zhixiang Chi , Rasoul Mohammadi Nasiri , Zheng Liu , Juwei Lu , Jin Tang Konstantinos N Plataniotis − − − Noahs Ark Lab, Huawei Technologies University of Toronto, Canada { zhixiang.chi, rasoul.nasiri, zheng.liu1, tangjin, juwei.lu } @huawei.com,[email protected] Abstract.
Recent advances in high refresh rate displays as well as theincreased interest in high rate of slow motion and frame up-conversionfuel the demand for efficient and cost-effective multi-frame video inter-polation solutions. To that regard, inserting multiple frames betweenconsecutive video frames are of paramount importance for the consumerelectronics industry. State-of-the-art methods are iterative solutions in-terpolating one frame at the time. They introduce temporal inconsisten-cies and clearly noticeable visual artifacts.Departing from the state-of-the-art, this work introduces a true multi-frame interpolator. It utilizes a pyramidal style network in the temporaldomain to complete the multi-frame interpolation task in one-shot. Anovel flow estimation procedure using a relaxed loss function, and anadvanced, cubic-based, motion model is also used to further boost inter-polation accuracy when complex motion segments are encountered. Re-sults on the Adobe240 dataset show that the proposed method generatesvisually pleasing, temporally consistent frames, outperforms the currentbest off-the-shelf method by 1.57db in PSNR with 8 times smaller modeland 7.7 times faster. The proposed method can be easily extended to in-terpolate a large number of new frames while remaining efficient becauseof the one-shot mechanism.
Video frame interpolation targets generating new frames for the moments inwhich no frame is recorded. It is mostly used in slow motion generation [26],adaptive streaming [25], and frame rate up-conversion [5]. The fast innovationin high refresh rate displays and great interests in a higher rate of slow motionand frame up-conversion bring the needs to multi-frame interpolation.Recent efforts focus on the main challenges of interpolation, including oc-clusion and large motions, but they have not explored the temporal consistencyas a key factor in video quality, especially for multi-frame interpolation. Almostall the existing methods interpolate one frame in each execution, and gener-ating multiple frames can be addressed by either iteratively generating a mid-dle frame [19,15,27] or independently creating each intermediate frame for cor-responding time stamp [10,2,3,17,14]. The former approach might cause error a r X i v : . [ c s . C V ] J u l Z. Chi et al. propagation by treating the generated middle frame as input. As well, the laterone may suffer from temporal inconsistency due to the independent process foreach frame and causes temporal jittering at playback. Those artifacts are fur-ther enlarged when more frames are interpolated. An important point that hasbeen missed in existing methods is the variable level of difficulties in generat-ing intermediate frames. In fact, the frames closer to the two initial frames areeasier to generate, and those with larger temporal distance are more difficult.Consequently, the current methods are not optimized in terms of model size andrunning time for multi-frame interpolation, which makes them inapplicable forreal-life applications.On the other hand, most of the state-of-the-art interpolation methods com-monly synthesize the intermediate frames by simply assuming linear transitionin motion between the pair of input frames. However, real-world motions re-flected in video frames follow a variety of complex non-linear trends [26]. Whilea quadratic motion prediction model is proposed in [26] to overcome this limita-tion, it is still inadequate to model real-world scenarios especially for non-rigidbodies, by assuming constant acceleration. As forces applied to move objects inthe real world are not necessarily constant, it results in variation in acceleration.To this end, we propose a temporal pyramidal processing structure that effi-ciently integrates the multi-frame generation into one single network. Based onthe expected level of difficulties, we adaptively process the easier cases (frames)with shallow parts to guide the generation of harder frames that are processedby deeper structures. Through joint optimization of all the intermediate frames,higher quality and temporal consistency can be ensured. In addition, we exploitthe advantage of multiple input frames as in [26,13] to propose an advancedhigher-order motion prediction modeling, which explores the variation in ac-celeration. Furthermore, inspired by [27], we develop a technique to boost thequality of motion prediction as well as the final interpolation results by intro-ducing a relaxed loss function to the optical flow (O.F.) estimation module. Inparticular, it gives the flexibility to map the pixels to the neighbor of their groundtruth locations at the reference frame while a better motion prediction for theintermediate frames can be achieved. Comparing to the current state-of-the-artmethod [26], we outperform it in interpolation quality measured by PSNR by1.57dB on the Adobe240 dataset and achieved 8 times smaller in model size and7.7 times faster in generating 7 frames.We summarize our contributions as 1) We propose a temporal pyramidalstructure to integrate the multi-frame interpolation task into one single networkto generate temporally consistent and high-quality frames; 2) We propose ahigher-order motion modeling to exploit variations in acceleration involved inreal-world motion; 3) We develop a relaxed loss function to the flow estimationtask to boost the interpolation quality; 4) We optimize the network size andspeed so that it is applicable for the real world applications especially for mobiledevices. ll at Once 3
Recent efforts on frame interpolation have focused on dealing with the mainsources of degradation in interpolation quality, such as large motion and occlu-sion. Different ideas have been proposed such as estimating occlusion maps [10,28],learning adaptive kernel for each pixel [19,18], exploring depth information [2]or extracting deep contextual features [17,3]. As most of these methods inter-polate frames one at a time, inserting multiple frames is achieved by iterativelyexecuting the models. In fact, as a fundamental issue, the step-wise implemen-tation of multi-frame interpolation does not consider the time continuity andmay cause temporally inconsistency. In contrast, generating multiple frames inone integrated network will implicitly enforce the network to generate tempo-rally consistent sequences. The effectiveness of the integrated approach has beenverified by Super SloMo [10]; however, their method is not purposely designedfor the task of multi-frame interpolation. Specifically, what has been missedin [10] is to utilize the error cue from temporal distance between a middle frameand the input frames and optimize the whole model accordingly. Therefore, theproposed adaptive processing based on this difficulty pattern can result in amore optimized solution, which is not considered in the state-of-the-art meth-ods [10,2,3,26,19].Given the estimated O.F. among the input frames, one important step inframe interpolation is modeling the traversal of pixels in between the two frames.The most common approach is to consider a linear transition and scaling ofthe O.F. [28,17,10,15,3,2]. Recent work in [26,4] applied an acceleration-awaremethod by also contributing the neighborhood frames of the initial pair. How-ever, in real life, the force applied to the moving object is not constant; thus, themotion is not following the linear or quadratic pattern. In this paper, we pro-pose a simple but powerful higher-order model to handle more complex motionshappen in the real world and specially non-rigid bodies. On the other hand, [10]imposes accurate estimation the O.F. by the warping loss. However, [27] revealsthat accurate O.F. is not tailored for task-oriented problems. Motivated by that,we apply a flexible O.F. estimation between initial frames, which gives higherflexibility to model complex motions.
An overview of the proposed method is shown in Fig. 1 where we use four inputframes ( I − , I , I and I ) to generate 7 frames ( I t i , t i = i , i ∈ [1 , , · · · , I and I . We first use two-step O.F. estimation module to calculate O.F.s( f → , f → , f →− , f → ) and then use these flows and cubic modeling to pre-dict the flow between input frames and the new frames. Our proposed temporalpyramidal network then refines the predicted O.F. and generates an initial esti-mation of middle frames. Finally, the post processing network further improvesthe quality of interpolated frames ( I t i ) with the similar temporal pyramid. Z. Chi et al.
First stageflow estimation Second stageflow estimation Cubic flow modeling
Temporalpyramidal flow refinement
Framesynthesis
Temporalpyramidal post processing
Interpolated7 frames
Fig. 1: An overview of the proposed multi-frame interpolation method.
In this work, we integrate the cubic motion modeling to specifically handle theacceleration variation in motions. Considering the motion starting from I to amiddle time stamp t i as f → t i , we model object motion by the cubic model as: f → t i = v × t i + a × t i + ∆a × t i , (1)where v , a , and ∆a are the velocity, acceleration, and acceleration changerate estimated at I , respectively. The acceleration terms can be computed as: ∆a = a − a , a = f → + f →− , a = f → + f → . (2)where a and a are calculated for pixels at I and I respectively. However, the ∆a should be calculated for the pixels correspond to the same real-world pointrather than pixels with the same coordinate in the two frames. Therefore, wereformulate a to calculated ∆a based on referencing pixel’s locations at I as: a = f → − × f → . (3)To calculate v in (1), the calculation in [26] does not hold when the accelera-tion is variable, instead, we apply (1) for t i = 1 to solve for v using only theinformation computed above v = f → − a − a − a . (4)Finally, f → t i for any t i ∈ [0 ,
1] can be expressed based on only O.F. betweeninput frames by f → t i = f → × t i + a × ( t i − t i ) + a − a × ( t i − t i ) . (5) f → t i can be computed using the same manner. The detailed derivation and proofof all the above equations will be provided in the supplementary document.In Fig. 2, we simulate three different 1-D motions, including constant velocity,constant acceleration, and variable acceleration, as distinguished in three pathlines. For each motion, the object position at four time stamps of [ t , t , t , t ]are given as shown by gray circles; we apply three predictive models: linear,quadratic[26] and our cubic model to estimate the location of the object fortime stamp t . blindly (without having the parameters of simulated motions).The prediction results show that our cubic model is more robust to simulatedifferent order of motions. ll at Once 5 L t t L t2t1t0 t2t1t0 t t2t1t0 QCLQGT CQGT CGT
Constant velocityConstant accelerationVariant acceleration L QC Linear modelQuadratic modelCubic modelGround truth GT Predictive models
Fig. 2: A toy example to illustrate the performance of three models (Linear,Quadratic, and Cubic) in predicting three motion patterns (constant velocity,constant acceleration, and variant acceleration).
To estimate the O.F. among the input frames,the existing frame interpolation methods commonly adopt the off-the-shelf net-works [26,17,3,2,24,6,8]. However, the existing flow networks are not efficientlydesigned for multi-frame input, and some are limited to one-directional flowestimation. To this end, following the three-scale coarse-to-fine architecture inSPyNet [22], we design a customized two-stage flow estimation to involve theneighbor frames in better estimating O.F. between I and I . Both stages arefollowing similar three-scale architecture, and they optimally share the weightsof two coarser levels. The first stage network is designed to compute O.F. be-tween two consecutive frames. We use that to estimate f →− and f → . In thefinest level of second-stage network, we use I and I concatenated with − f →− and − f → as initial estimations to compute f → and f → . Alongside, we arecalculating the estimation of f → and f →− in the first stage, which are usedin our cubic motion modeling in later steps. Motion estimation constraint relaxation.
Common O.F. estimation meth-ods try to map the pixel from the first frame to the exact corresponding locationin the second frame. However, TOFlow [27] reveals that the accurate O.F. as apart of a higher conceptual level task like frame interpolation does not lead tothe optimal solution of that task, especially for occlusion. Similarly, we observedthat a strong constraint on O.F. estimation among input frames might degradethe motion prediction for the middle frames, especially for complex motion. Incontrast, accepting some flexibility in flow estimation will provide a closer esti-mation to ground truth motion between frames. The advantage of this flexibilitywill be illustrated in the following examples.Consider the two toy examples, as shown in Fig. 3, where a pixel is movingon the blue curve in consecutive frames and (x,y) is the pixel coordinate inframe space. The pixel position is given in four consecutive frames as P − , P , P and P and the aim is to find locations for seven moments between P and P indicated by blue stars. We consider P as a reference point in motion prediction.The green lines represent ground truth O.F. between P and other points. Wepredict middle points (green stars) by quadratic [26] and cubic models in (5) asshown in Fig. 3. The predicted locations are far from the ground truths (bluestars). However, instead of estimating the exact O.F., giving it a flexibility of Z. Chi et al.
X coordinate of the pixel Y c oo r d i n a t e o f t h e p i x e l P −1 P P P −1 ' P ' MSE: 2.72MSE: 1.21P w/o relaxationw/ relaxationGT (a) Quadratic prediction. X coordinate of the pixel Y c oo r d i n a t e o f t h e p i x e l P −1 P P P P −1 ' P ' P 'MSE: 2.52MSE: 0.94 w/o relaxationw/ relaxationGT (b) Cubic prediction. Fig. 3: An example of an object motion path (blue curve) and the motion pre-diction (with and without relaxation) by Quadratic (a) and Cubic (b) model.mapping P to the neighbor of other points denoted as P (cid:48) − , P (cid:48) , P (cid:48) , a betterprediction of the seven middle locations can be achieved as shown by the redstars. It also reduces the mean squared error (MSE) significantly. The idea is ananalogy to introduce certain errors to the flow estimation process.To apply the idea of relaxation, we employ the same unsupervised learningin O.F. estimation as [10], but with a relaxed warping loss. For example, the lossfor estimating f → is defined as: L f → w relax = h − (cid:88) i =0 z − (cid:88) j =0 min m,n (cid:13)(cid:13) I w → ( i, j ) − I ( i + m, j + n ) (cid:13)(cid:13) , for m, n ∈ [ − d, + d ] , (6)where I w → denotes I warped by f → to the reference point I , d determinesthe range of neighborhood and h , z are the image height and width. We use L w relax for both stages of O.F. estimation. We evaluate the trade-off betweenthe performance of flow estimation and the final results in Section 4.4. Considering the similarity between consecutive frames and also the pattern ofdifficulty for this task, it leads to the idea of introducing adaptive joint process-ing. We applied this by proposing temporal pyramidal models.
Temporal pyramidal network for O.F. refinement . The bidirectional O.F.s f → t i and f → t i predicted by (5) are based on the O.F.s computed among theinput frames. The initial prediction may inherit errors from flow estimation andcubic motion modeling, notably for the motion boundaries [10]. To effectivelyimprove f → t i and f → t i , unlike the existing methods [2,3,15,10,17,28,20,14],we aim to consider the relationship among intermediate frames and process allat one forward pass. To this end, we propose a temporal pyramidal O.F. refine-ment network, which enforces a strong bond between the intermediate frames, asshown in Fig. 4a. The network takes the concatenation of seven pairs of predictedO.F.s as input and adaptively refines the O.F.s based on the expected qualityof the interpolation correspond to the distance to I and I . In fact, the closest ll at Once 7 Conv+LeakyReluAddition
Sub network
Concatenation
WarpWarp Warp WarpWarp C Conv
Sub network48 channels Sub network24 channelsSub network32 channels CCCSub network64 channelsC Warp (a) O.F. refinement network
Sub network64 channels Sub network64 channels Sub network64 channels Sub network64 channelsC C C C D (b) Post processing network Fig. 4: The pyramidal network model designed for O.F. refinement (a) and adap-tive pyramidal structure in post processing (b).ones, I t and I t are processed only by one level of pyramid as they are morelikely to achieve higher quality. With the same patterns, ( I t , I t ) are processedby two levels, ( I t , I t ) by three levels and finally I t by the entire four levels ofthe network as it is expected to achieve the lowest quality in interpolation.To fully utilize the refined O.F.s, we warp I and I by the refined O.F. ineach level as I w → t i and I w → t i and feed them to the next level. It is helpful toachieve better results in the next level as the warped frames are one step closer intime domain toward the locations in the target frame of that layer compared to I and I . Thus, the motion between I and I is composed of step-wise motions,each measured within a short temporal interval.Additional to the refined O.F. at each level, a blending mask b t i [28] is alsogenerated. Therefore, the intermediate frames can be synthesized as [28] by I t i = b t i (cid:12) g ( I , ˆ f → t i ) + (1 − b t i ) (cid:12) g ( I , ˆ f → t i ) , (7)where ˆ f → t i and ˆ f → t i are refined bidirectional O.F. at t i , (cid:12) denotes element-wise multiplication, and g ( · , · ) is the bilinear warping function from [28,9]. Temporal pyramidal network for post processing . The intermediate framessynthesized by (7) may still contain artifacts due to the inaccurate O.F., blendingmasks, or synthesis process. Therefore, we introduce a post processing network
Z. Chi et al. following the similar idea of the O.F. refine network to adaptively refine the in-terpolated frames I t i . However, as the generated frames are not aligned, feedingall the frames at the beginning level cannot properly enhance the quality. In-stead, we input the generated frame separately at different levels of the networkaccording to the temporal distance, as shown in Fig. 4b. At each time stamp t i , we also feed the warped inputs I w → t i and I w → t i to reduce the error causedby inaccurate blending masks. Similar to O.F. refinement network, the refinedframes ˆ I t i are also fed to the next level as guidance.For both pyramidal networks, we employ the same sub network for each levelof the pyramid and adopt residual learning to learn the O.F. and frame residuals.The sub network is composed of two residual blocks proposed by [16] and oneconvolutional layer at the input and another at the output. We set the numberof channels in a reducing order for O.F. refinement pyramid, as fewer frames aredealt with when moving to the middle time step. In contrast, we keep the samechannel numbers for all the levels of post processing module. The proposed integrated network for multi-frame interpolation targets tempo-ral consistency by joint optimization of all frames. To further impose consis-tency between frames, we apply generative adversarial learning scheme [29] andtwo-player min-max game idea in [7] to train a discriminator network D whichoptimizes the following problem:min G max D E g ∼ p ( I gtti ) [log D ( g )] + E x ∼ p ( I ) [log(1 − D ( G ( x )))] , (8)where g = [ I gtt , · · · I gtt ] are the seven ground truth frames and x = [ I − , I , I , I ]are the four input frames. We add the following generative component of theGAN as the temporal loss [29,12]: L temp = N (cid:88) n =1 − log D ( G ( x )) . (9)The proposed framework in Fig. 1 serves as a generator and is trained alter-natively with the discriminator. To optimize the O.F. refinement and post pro-cessing networks, we apply the (cid:96) loss. The whole architecture is trained bycombining all the loss functions: L = (cid:88) i =1 ( (cid:13)(cid:13)(cid:13) ˆ I t i − I gtt i (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13) I t i − I gtt i (cid:13)(cid:13) ) + L w relax + λ L temp , (10)where the λ is the weighting coefficient and equals to 0.001. In this section, we provide the implementation details and the analysis of theresults of the proposed method in comparison to the other methods and differentablation studies. ll at Once 9
To train our network, we collected a dataset of 903 short video clips (2 to 10 sec-onds) with the frame rate of 240fps and a resolution of 720 × st , 9 th , 17 th , and 25 th frames as inputs togenerate the seven frames between the 9 th and 17 th frames by considering 10 th to 16 th frames as ground truths. We randomly crop 352 ×
352 patches and applyhorizontal, vertical as well as temporal flip for data augmentation in training.To improve the convergence speed, a stage-wise training strategy is adopted[30]. We first train each module except the discriminator using (cid:96) loss inde-pendently for 15 epochs with the learning rate of 10 − by not updating othermodules. The whole network is then jointly trained using (10) and learning rateof 10 − for 100 epochs. We use the Adam optimizer [11] and empirically set theneighborhood range d in (6) to 9. During the training, the pixel values of all im-ages are scaled to [-1, 1]. All the experiments are conducted on an Nvidia V100GPU. More detailed network architecture will be provided in the supplementarymaterial. We evaluate the performance of the proposed method on widely used datasets in-cluding two multi-frame interpolation dataset (Adobe240 [23] and GOPRO [16])and two single-frame interpolation (Vimeo90K [27] and DAVIS[21]). Adobe240and GOPRO are initially designed for deblurring tasks with a frame rate of240fps and resolution of 720 × th (middle) frame forDAVIS and Vimeo90K by using the 1 st , 3 rd , 5 th and 7 th frames as inputs. We compare our method with four state-of-the-art frame interpolation methods:Super SloMo [10], Quadratic [26], DAIN [2], and SepConv [19], where we train[10] and [26] on our training data and use the model released by authors inthe last two. We use PSNR, SSIM and interpolation error (IE) [1] as evaluationmetrics. For multi-frame interpolation in GOPRO and Adobe240, we borrow
Table 1: Performance evaluation of the proposed method compared to the state-of-the-art methods in different datasets.
Methods Adobe240 GoPro Vimeo90K DAVISPSNR SSIM TCC PSNR SSIM TCC PSNR SSIM IE PSNR SSIM IESepConv 32.38 0.938 0.832 30.82 0.910 0.789 33.60 0.944 5.30 26.30 0.789 15.61Super SloMo 31.63 0.927 0.809 30.50 0.904 0.784 33.38 0.938 5.41 26.00 0.770 16.19DAIN 31.36 0.932 0.808 29.74 0.900 0.759 34.54 0.950 4.76 27.25 0.820 13.17Quadratic 32.80 0.949 0.842 32.01 0.936 0.822 33.62 0.946 5.22 27.38 0.834 12.46Ours
Input> SepConv Super SloMo DAIN Qudratic Ours
Fig. 5: An example from Adobe240 to visualize the temporal consistency. Thetop row shows the middle frames generated by different methods, and the bottomrow shows the interpolation error. Our method experiences less shifting in thetemporal domain.the concept of Temporal Change Consistency [29] which compares the generatedframes and ground truth in terms of changes between adjacent frames by
T CC ( F, G ) = (cid:80) i =1 SSIM( abs ( f i − f i +1 ) , abs ( g i − g i +1 ))6 , (11)where, F = ( f , · · · , f ) and G = ( g , · · · , g ) are the 7 interpolated and groundtruth frames respectively. For the multi-frame interpolation task, we report theaverage of the metrics for 7 interpolated frames. The results reported in Table 1,shows that our proposed method consistently performs better than the existingmethods on both single and multi-frame interpolation scenarios. Notably, formulti-frame interpolation datasets (Adobe240 and GOPRO), our method sig-nificantly outperforms the best existing method [26] by 1.57dB and 0.9dB. Theproposed method also achieves the highest temporal consistency measured byTCC thanks to the temporal pyramid structure and joint optimization of themiddle frames, which exploits the temporal relation among the middle frames.In addition to the TCC, to better show the power of the proposed methodin preserving temporal consistency between frames, Fig. 5 reports ˆ I t and IEgenerated by different methods from Adobe240. As shown in Fig. 5, the generated ll at Once 11 Table 2: Ablation studies on the network components on Adobe240 and GOPRO.
Methods Adobe240 GOPROPSNR SSIM IE TCC PSNR SSIM IE TCCw/o post pro.. 33.87 0.954 6.21 0.848 32.63 0.942 6.80 0.831w/o adv. loss 34.35 0.958 5.89 0.850 32.86 0.942 6.77 0.830w/o 2 nd O.F. 34.24 0.957 5.97 0.854 32.73 0.940 6.91 0.832w/o O.F. relax. 33.92 0.955 6.14 0.851 32.45 0.936 7.09 0.828w/o pyr. 33.92 0.954 6.33 0.845 32.37 0.935 7.30 0.820Full model [ ][ ] O u r s G T I t I t I t I t I t I t I t Fig. 6: Visualization of the seven intermediate frames of I t to I t generated byour method compared to Quadratic [26] and Super SloMo [10] from GOPRO.middle frames by different methods are visually very similar to the ground truth.However, a comparison of the IE reveals significant errors that occurred near theedges of moving objects caused because of time inconsistency between generatedframes and the ground truth. In contrast, our method generates a high-qualityconsistent frame with the ground truth in both spatial and temporal domains.Another example from GOPRO in Fig. 6, shows the results of the proposedmethod in comparison with Super SloMo [10] and Quadratic [26] which theyhave not applied any adaptive processing for frames interpolation. As it can beseen in Fig. 6, at t and t which are closer to the input frames, all the methodsgenerate comparable results. However, approaching to the middle frame as thetemporal distance from the input increases, the quality of frames generated bySuper SloMo and Quadratic start to degrade while our method experiences lessdegradation and higher quality. Especially for I t , our improvement is significant,as also shown by the PSNR values at each time stamp t i in Fig. 9c.Our method also works better on DAVIS and Vimeo90K, as reported inTable 1. Fig. 7 shows an example of a challenging scenario that involves bothtranslational and rotational motion. The acceleration-aware Quadratic can bet-ter estimate the motion, while others have undergone severe degradation. How-ever, undesired artifacts are still generated by Quadratic near the motion bound-ary. In contrast, our method exploits the cubic motion modeling and temporalpyramidal processing, which better captures this complex motion and generatescomparable results against the ground truth. Table 3:
Comparison between linear,quadratic and cubic motion models.
Models Adobe240 GOPROPSNR SSIM IE PSNR SSIM IELinear 33.97 0.955 6.13 32.40 0.936 7.09Quad. 34.24 0.957 5.95 32.70 0.941 6.85Cubic
Table 4:
Comparison between modelsgenerating different number of frames.
Methods DAVIS Vimeo90KPSNR SSIM PSNR SSIM1 frames 27.07 0.819 32.02 0.9443 frames 27.44 0.816 34.67 0.9507 frames(no pyr.) 27.25 0.815 34.56 0.9507 frames
Inputs SepConv SuperSloMo DAIN Quadratic Ours GT
Fig. 7: Sample results for interpolating the middle frame for a complex motionexample from DAVIS dataset.
To explore the impact of different components of theproposed model, we investigate the performance of our solution when applyingdifferent variations including 1) w/o post pro.: removing post processing; 2) w/oadv. loss: removing adversarial loss; 3) w/o 2 nd O.F.: replace the second stageflow estimation with the exact same network as the first stage; 4) w/o O.F. re-lax.: replace L w relax by L (cid:96) ; 5) w/o pyr.: in both pyramidal modules, we placeall the input as the first level of the network, and the outputs are caught atthe last layer. The performance of the above variations evaluated on Adobe240and GOPRO datasets, as shown in Table 2, reveals that all the listed modifica-tions lead to degradation in performance. As expected, motion relaxation andthe pyramidal structure are important as they provide more accurate motionprediction and enforce the temporal consistency among the interpolated frames,as reflected in TCC. The post processing as its missing in the model also bringsa large degradation is a crucial component that compensates the inaccurate O.F.and blending process. It is worth noting that even though the quantitative im-provement of PSNR and SSIM for the adversarial loss is small, it is effective topreserve the temporal consistency as reported by the TCC values. Motion models.
To investigate the impact of different motion models, wetrained our method with linear and quadratic [26] motion prediction as well.The reported average quality in Table 3, shows that the cubic modeling hasbeen dominant in both GOPRO and Adobe240. Importantly, the improvementby quadratic against linear in the model proposed in [26], is reported to be morethan 1dB, however, we observed 0.27dB and 0.3dB on Adobe240 and GOPROdatasets. We give credit to the proposed temporal pyramidal processing andapplying motion relaxation. In comparison with the impact of quadratic over ll at Once 13
Table 5: Motion relaxation evaluation for warping, prediction and final results.
Datasets PSNR( I w → , I ) PSNR( I w → t , I gtt ) PSNR(ˆ I t , I gtt )) L (cid:96) L wrelax L (cid:96) L wrelax L (cid:96) L wrelax DAVIS
Inputs & GT Error ( I w → ) Error ( I w → t ) Error ( ˆ I t ) ˆ I t Fig. 8: Sample results from Vimeo90K to show the comparison between O.F.estimation with (bottom row) and without (top row) relaxation in terms of theinterpolation error for motion prediction and final interpolation result.linear, our cubic modeling adds another 0.13dB and 0.21dB improvement onthe Adobe240 and GOPRO, respectively, which shows the necessity of applyingcubic modeling on the complexity of motions we have in different videos.
Constraints relaxation in motion estimation.
To investigate the im-pact of applying motion estimation relaxation in our architecture, we train twoversions of the entire solution, with relaxation ( (cid:96) w relax ) and without relaxation( (cid:96) ). For each case we perform three comparisons, first, I warped by f → whichnamed ( I w → ) and compare to I , second, I warped by the predicted f → t (before refinement) named by ( I w → t ) and compared to I gtt , and finally, we alsocompared the final output of the network with I gtt . Table 5 reports results ofevaluation on DAVIS and Fig. 8 shows IE for an example from Vimeo90k. BothTable 5 and Fig. 8 show that although the relaxation makes the O.F. estimationbetween two initial pair poor, it gives better initial motion prediction for themiddle frame as well as the final interpolation result. Temporal pyramidal structure.
The effectiveness of the temporal pyra-midal structure in interpolating multiple frames has already been verified inTable 2. To further investigate this impact by also considering the number offrames it generates, we trained another 3 variations of model including predict-ing all 7 frames without pyramidal structure, predicting 3 frames ( i = 2 , , i = 4) with pyramidal model. Table 4 reports theinterpolation quality of the middle frame on DAVIS and Vimeo90K for all thesecases. The results in Table 4 demonstrate that the interpolation of the middleframe benefits from the joint optimization with other frames.
10 20 30 40 50
Learning parameters (million) P S N R ( d B ) SepConvDAIN Super SloMoQuadraticOurs
Adobe240 dataset (a) Model size VS. PSNR.
Slow motion video frame rate (x30fps) T i m e r e q u i r e d ( s ) Slow motion videos generation timeSepConvDAINQuadraticSuperSloMoOurs (b) Inference speed.
Time index i P S N R ( d B ) PSNR (dB) for seven interpolated frames
OursOurs w/o pyr.SepconvQuadraticDAINSuper SloMo (c) PSNR for 7 frames.
Fig. 9: Efficiency of the proposed method compared to state-of-the-art methodsfrom the perspective of performance and model size (a), inference speed (b), andperformance trend in multiple frame interpolation (c).
Considering the wide applications for frame interpolation, especially on mobileand embedded devices, investigating the efficiency of the solution is crucial. Wereport the efficiency of the proposed method in terms of model size, interpolationquality, and inference time. Fig. 9a reports PSNR values evaluated on Adobe240in relation with the model sizes. The proposed method outperforms all the meth-ods in the quality of the results with a large margin while having a significantlysmaller model size. In particular, our method outperforms Quadratic [26] by1.57dB by using only 12.5% of its parameters. We also show the inference timesfor interpolating different numbers of frames in Fig. 9b. To interpolate morethan 8 frames, our method is able to be extended to interpolate more framesby simply adding more levels in the pyramid. However, higher frame rate videosare hard to be obtained for training; thus, we adopt the iterative interpolationmethod (run 8x model multiple times and drop the corresponding frames). Asreported in Fig. 9b, our method is around 7 times faster than [26] for interpo-lating more than 8 frames. Our method is the fastest and has the smallest sizewhile keeping the high-quality results for multi-frame interpolation tasks, whichmakes it applicable for low power devices.
In this work, we proposed a powerful and efficient multi-frame interpolationsolution that considers prior information and the challenges in this particulartask. The prior information about the difficulty levels among the intermediateframes helps us to design a temporal pyramidal processing structure. To han-dle the challenges of real world complex motion, our method benefits from theproposed advanced motion modeling, including cubic motion prediction and re-laxed loss function for flow estimation. All these parts together help to integratemulti-frame generation in a single optimized and efficient network while the tem-poral consistency of frames and spatial quality are at maximum level beatingthe state-of-the-art solutions. ll at Once 15
References
1. Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M.J., Szeliski, R.: A databaseand evaluation methodology for optical flow. International journal of computervision (1), 1–31 (2011)2. Bao, W., Lai, W.S., Ma, C., Zhang, X., Gao, Z., Yang, M.H.: Depth-aware videoframe interpolation. In: IEEE Conferene on Computer Vision and Pattern Recog-nition (2019)3. Bao, W., Lai, W.S., Zhang, X., Gao, Z., Yang, M.H.: Memc-net: Motion estima-tion and motion compensation driven neural network for video interpolation andenhancement. arXiv preprint arXiv:1810.08768 (2018)4. Bao, W., Zhang, X., Chen, L., Ding, L., Gao, Z.: High-order model and dynamicfiltering for frame rate up-conversion. IEEE Transactions on Image Processing (8), 3813–3826 (2018)5. Castagno, R., Haavisto, P., Ramponi, G.: A method for motion adaptive frame rateup-conversion. IEEE Transactions on circuits and Systems for Video Technology (5), 436–446 (1996)6. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., VanDer Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolu-tional networks. In: Proceedings of the IEEE international conference on computervision. pp. 2758–2766 (2015)7. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neuralinformation processing systems. pp. 2672–2680 (2014)8. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet2.0: Evolution of optical flow estimation with deep networks. In: Proceedings ofthe IEEE conference on computer vision and pattern recognition. pp. 2462–2470(2017)9. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks.In: Advances in neural information processing systems. pp. 2017–2025 (2015)10. Jiang, H., Sun, D., Jampani, V., Yang, M.H., Learned-Miller, E., Kautz, J.: Superslomo: High quality estimation of multiple intermediate frames for video interpo-lation. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. pp. 9000–9008 (2018)11. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)12. Ledig, C., Theis, L., Husz´ar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken,A., Tejani, A., Totz, J., Wang, Z., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEEconference on computer vision and pattern recognition. pp. 4681–4690 (2017)13. Lee, W.H., Choi, K., Ra, J.B.: Frame rate up conversion based on variational imagefusion. IEEE Transactions on Image Processing (1), 399–412 (2013)14. Liu, Y.L., Liao, Y.T., Lin, Y.Y., Chuang, Y.Y.: Deep video frame interpolationusing cyclic frame generation. In: AAAI Conference on Artificial Intelligence (2019)15. Liu, Z., Yeh, R.A., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis usingdeep voxel flow. In: Proceedings of the IEEE International Conference on ComputerVision. pp. 4463–4471 (2017)16. Nah, S., Hyun Kim, T., Mu Lee, K.: Deep multi-scale convolutional neural networkfor dynamic scene deblurring. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 3883–3891 (2017)6 Z. Chi et al.17. Niklaus, S., Liu, F.: Context-aware synthesis for video frame interpolation. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 1701–1710 (2018)18. Niklaus, S., Mai, L., Liu, F.: Video frame interpolation via adaptive convolution. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 670–679 (2017)19. Niklaus, S., Mai, L., Liu, F.: Video frame interpolation via adaptive separableconvolution. In: Proceedings of the IEEE International Conference on ComputerVision. pp. 261–270 (2017)20. Peleg, T., Szekely, P., Sabo, D., Sendik, O.: Im-net for high resolution video frameinterpolation. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 2398–2407 (2019)21. Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video objectsegmentation. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 724–732 (2016)22. Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition. pp. 4161–4170 (2017)23. Su, S., Delbracio, M., Wang, J., Sapiro, G., Heidrich, W., Wang, O.: Deep videodeblurring for hand-held cameras. In: Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition. pp. 1279–1288 (2017)24. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: Cnns for optical flow usingpyramid, warping, and cost volume. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pp. 8934–8943 (2018)25. Wu, J., Yuen, C., Cheung, N.M., Chen, J., Chen, C.W.: Modeling and optimizationof high frame rate video transmission over wireless networks. IEEE Transactionson Wireless Communications (4), 2713–2726 (2015)26. Xu, X., Siyao, L., Sun, W., Yin, Q., Yang, M.H.: Quadratic video interpolation.In: Advances in Neural Information Processing Systems. pp. 1645–1654 (2019)27. Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W.T.: Video enhancement with task-oriented flow. International Journal of Computer Vision (IJCV)127