A Group Variational Transformation Neural Network for Fractional Interpolation of Video Coding
AA Group Variational Transformation Neural Network forFractional Interpolation of Video Coding
Sifeng Xia, Wenhan Yang, Yueyu Hu, Siwei Ma and Jiaying Liu ∗ Peking University, Beijing, 100871, China {xsfatpku,yangwenhan,huyy,swma,liujiaying}@pku.edu.cn
Abstract
Motion compensation is an important technology in video coding to remove the temporalredundancy between coded video frames. In motion compensation, fractional interpolationis used to obtain more reference blocks at sub-pixel level. Existing video coding standardscommonly use fixed interpolation filters for fractional interpolation, which are not efficientenough to handle diverse video signals well. In this paper, we design a group variationaltransformation convolutional neural network (GVTCNN) to improve the fractional inter-polation performance of the luma component in motion compensation. GVTCNN inferssamples at different sub-pixel positions from the input integer-position sample. It firstextracts a shared feature map from the integer-position sample to infer various sub-pixelposition samples. Then a group variational transformation technique is used to transforma group of copied shared feature maps to samples at different sub-pixel positions. Experi-mental results have identified the interpolation efficiency of our GVTCNN. Compared withthe interpolation method of High Efficiency Video Coding, our method achieves 1 .
9% bitsaving on average and up to 5 .
6% bit saving under low-delay P configuration.
Motion compensation is a significant technology in video coding for the temporalredundancy removal between video frames. Specifically, during inter-prediction, thereis at least one reference block to be searched from the previously coded frames for eachblock to be coded. With the reference block, only the motion vector that indicatesthe position of the reference block and the residual between the blocks need to becoded, which can bring about bit saving in many cases.However, due to the spatial sampling of digital video, adjacent pixels in a videoframe are not continuous, which means that reference blocks at the integer positionmay not be similar enough to the block to be coded. In order to search for betterreference blocks, video coding standards like High Efficiency Video Coding (HEVC)generate reference samples at sub-pixel positions by performing fractional interpola-tion over the retrieved integer-position sample.Interpolation methods adopted by the coding standards usually use fixed inter-polation filters. For example, MPEG-4 AVC/H.264 [1] uses a 6-tap filter for half-pixel interpolation and a simple average filter for quarter-pixel interpolation for luma ∗ Corresponding authorThis work was supported by National Natural Science Foundation of China under contract No.61772043. We also gratefully acknowledge the support of NVIDIA Corporation with the GPU forthis research. a r X i v : . [ c s . MM ] J un omponent. HEVC uses a DCT-based interpolation filter (DCTIF) [2] for fractionalinterpolation. It is efficient to adopt simple fixed interpolation filters for motioncompensation of video coding in real applications. However, the quality of the inter-polation results generated by fixed filters may be limited, since fixed filters can not fitfor natural and artificial video signals with various kinds of structures and content.Recently, many deep learning based methods have been proposed for low-level im-age processing problems, e.g. image interpolation [3], denoising [4, 5], super-resolution[6–9] and these methods have shown impressive results. In [3], Yang et al. proposeda variational learning network that effectively exploits the structural similarities forimage representation. The deep learning based denoising method [4] utilizes a deepconvolutional neural network (CNN) to infer a noise map from the noisy image fordenoising. Dong et al. proposed a super-resolution method called SRCNN [6], whichis the first method that uses CNN for super-resolution and has obtained significantgain over traditional super-resolution methods. In [7], a deeper CNN network withresidual learning is built to further improve the super-resolution performance. Be-sides, edge information is additionally used to guide the inference of high-resolutionimages in [8]. Hu et al. [9] proposed a global context aggregation and local queuejumping network for image super-resolution considering both reconstruction qualityand time consumption.Considering the great performance brought by the deep learning based methods inlow-level image processing problems and the high implementation efficiency of deeplearning based methods brought by GPU acceleration, it is a new opportunity toutilize deep based interpolation methods in motion compensation for video coding.Yan et al. [10] first proposed a CNN-based interpolation filter to replace the half-pixelinterpolation part of HEVC. Their method has obtained obvious gain over HEVC,which has also demonstrated the superiority of deep learning based interpolation invideo coding. However, in their method, only the half-pixel interpolation is replacedand the quarter-pixel interpolation is still the one of HEVC. Furthermore, they choseto train one model for each half-pixel position, which leads to a high storage cost andsets a barrier to the applications in some extreme conditions.In this paper, we propose a deep learning based fractional interpolation method toinfer reference samples of all sub-pixel positions for motion compensation in HEVC.A group variational transformation convolutional neural network (GVTCNN) is de-signed to infer samples of various sub-pixel positions with one single network. Thenetwork firstly extracts a shared feature map from the integer-position sample. Thena group variational transformation method is used to infer different samples froma group of shared feature maps. Experimental results show the superiority of ourGVTCNN in fractional interpolation which further benefits video coding.The rest of the paper is organized as follows. Sec. 2 introduces the proposedgroup variational transformation convolutional neural network based fractional in-terpolation. Details of fractional interpolation in HEVC are first introduced andanalyzed. Then the architecture of the network is illustrated and the group varia-tional transformation is described. Process of the generation of training data is alsopresented. Experimental results are shown in Sec. 3 and concluding remarks aregiven in Sec. 4. , 1 l I l I l I l I
0, 1 l I
1, 1 l I
2, 1 l I l I l I l I l I l I l I l I l I l I f I f I f I f I f I f I f I f I f I f I f I f I f I f I f I Figure 1: Half-pixel and quarter-pixel sample positions in luma component fractional inter-polation.
During motion compensation of the luma component in HEVC, at least one integer-position sample in the previously coded frame is first searched for each query blockwhich is to be coded. As shown in Fig. 1, half-pixel position samples I f , I f and I f are first interpolated based on the integer-position sample I l . And quarter-pixel position samples are then interpolated with the values of integer and half-pixelposition samples. The most appropriate reference sample is finally selected amongthe integer, half-pixel position and quarter-pixel position samples to facilitate codingthe query block.HEVC adopts a uniform 8-tap filter for half-pixel interpolation and 7-tap filtersfor quarter-pixel interpolation. The fixed interpolation filters may not be flexibleenough to accomplish all the interpolation tasks of various kinds of video scenes well.Moreover, the interpolation of each sub-pixel only covers a small area of the integer-position sample, which means that only limited reference information is utilized forsub-pixels generation. As a result, the interpolation results of HEVC may not begood enough in some hard cases like the scenes with complex structures.Deep convolutional neural network based methods have obtained much success insuch kinds of low-level image processing problems. With the help of training data, hared Feature Map Extraction h w h w … h w h w . . . Group Variational
Transformation
Integer-PositionSample I l Sub-PixelPositionSamples h w Shared
Feature Map l I f I Copy h w h w . . . l I K f I h w Figure 2: Framework of the proposed group variational transformation convolutional neuralnetwork (GVTCNN). deep CNN based methods learn a mapping from the input signal x to the target result y (cid:48) by: y (cid:48) = f ( x, Θ) , (1)where Θ represents the set of learnt parameters of the convolutional neural network,which are learnt based on the training data with the back-propagation algorithm. The proposed GVTCNN consists of two components: the shared feature map extrac-tion part and the group variational transformation part. The shared feature map isfirst extracted by GVTCNN from the integer-position sample I l and group variationaltransformation infers the residual maps of the samples at different sub-pixel positionsbased on the shared feature map.Fig. 2 shows the architecture of GVTCNN. The integer-position sample I l is theinput of the network. h × w × c represents the size of each convolutional layer, where h and w are respectively the height and the width of the feature map, and c is thechannel number of the feature map. 3 × × f outk to be the output of the k -th convolutional layer. f outk is obtained by: f outk = P k (cid:0) W k ∗ f outk − + B k (cid:1) , (2)where f outk − is the output of the previous layer, W k is the convolutional filter kernel ofthe k -th layer and B k is the bias of the k -th layer. f out is the input integer-positionsample. The function P k ( · ) is the PReLU function of the k -th layer: P k ( x ) = (cid:40) x, x > ,a k ∗ x, x ≤ . (3) is the input signal and a k is the parameter to be learned for the k-th layer. a k isinitially set as 0 .
25 and all channels of the k -th layer share the same parameter a k .In the shared feature map extraction component, a feature map with 48 channels isinitially generated from the integer-position sample, followed by 8 convolution layerswith 10 channels which are lightweight and cost less to save the learnt parameters. The10-th layer later derives a 48 channel feature map. The residual learning techniqueis utilized in shared feature map extraction for accelerating the convergency of thenetwork. So that we add the 1-st layer to the 10-th layer and then activate the sumwith PReLU function to obtain the shared feature map. After 9 convolutional layerswith 3 × ×
19, which means that a large nearby area in the integer-pixel position samplehas been considered for the feature extraction of each pixel.Considering the spatial correlation and continuity of the sub-pixels, we argue thatthere is no need to separately extract a feature map for the generation of each sub-pixel position sample. In other words, we do not need to train a network for eachsub-pixel position sample, which is inconvenient for real applications. As a result,after the shared feature map extraction, the shared feature map is used to inferthe sub-pixel samples at different locations. The group variational transformation isfurther performed over the shared feature maps with a specific convolutional layer foreach sub-pixel sample. Different residual maps are then generated and we obtain thefinal inferred sub-pixel position samples by adding the residual maps to the integer-position sample.During training process, mean square error is used as the loss function. Let F ( · ) represent the learnt network that infers sub-pixel position samples from theinteger-position sample and Θ denote the set of all the learnt parameters includingthe convolutional filter kernels, bias and a k of the PReLU function in each layer. Theloss function can be formulated as follows: L (Θ) = 1 n n (cid:88) i =1 (cid:107) F ( x i , Θ) − y i (cid:107) , (4)where pairs { x i , y i } ni =1 are the generated ground-truth pairs of integer-position andsub-pixel position samples and n is the total number of the pairs. In the deep CNN based interpolation and super-resolution methods, the ground truthhigh-resolution images are directly used as the label for the loss function and the down-sampled images are used as the input. However, there are big differences betweenimage resolution recovery problems and fractional interpolation in video coding.Firstly, methods of interpolation and super-resolution recover a high-resolutionimage from the low-resolution image, where ground truth high-resolution imagesexist. The resolution recovery quality can be measured by simply calculating thedifferences between the recovered images and the ground truth images. Fractionalinterpolation in video coding differently aims to generate more sub-pixel position aw Image . . .
Blurred Image
Integer-Position
Sample
Sub-Pixel Position Samples
BlurringInteger-Position Sampling Sub-PixelPosition SamplingHEVC Coding
Figure 3: Flow chart of the training data generation for GVTCNN. samples for motion compensation. And the efficiency of sub-pixel position samplesgeneration is measured by final coding performance. Secondly, fractional interpola-tion is performed over the reconstructed previously coded reference frame, thereforethe additional information loss brought by the reconstruction in video coding shouldalso be considered in training data generation. Besides, in many video coding stan-dards, half-pixel position interpolation and quarter-pixel position interpolation areperformed separately, which is reasonable since these two kinds of interpolation pro-vide reference samples at different sub-pixel levels.As a result, we correspondingly take several measures in training data generationto make the trained network applied to fractional interpolation in video coding. Theoverall flow chart of the process of training data generation has been shown in Fig.3. Referring to the method mentioned in [10], the training data is first blurred witha gaussian filter to simulate the correlations between the integer-position sample andsub-pixel position samples. Sub-pixel position samples are later sampled from theblurred image. As for the input integer-position sample generation, an intermediateinteger sample is previously down-sampled from the raw image. Then, the intermedi-ate down-sampled version is coded by HEVC video coding to simulate the informationloss of the reconstructed reference sample.Moreover, we train two networks separately for 3 half-pixel position samples and12 quarter-pixel position samples to better generate the samples at different sub-pixel levels. And there are some differences between the training data generationof the network that infers half-pixel position samples and the network that infersquarter-pixel position samples (respectively called GVTCNN-H and GVTCNN-Q).In the process of training data generation for GVTCNN-H, 200 training images and200 testing images in the set
BSDS500 [12] at size 481 ×
321 and 321 ×
481 are usedfor training. 3 × . , .
6] are used for blurring. By dividing the images into 2 × able 1: BD-rate reduction of the proposed method compared to HEVC. Class Sequence BD-rateY U VClass B Kimono -3.7% 0.9% 1.3%BQTerrace -5.6% -4.2% -5.5%BasketballDrive -3.3% -0.9% -1.1%ParkScene -1.1% 0.0% -0.7%Cactus -2.2% -0.6% -0.9%Average -3.2% -1.0% -1.4%Class C BasketballDrill -2.3% -1.4% -0.7%BQMall -2.8% -1.3% -1.2%PartyScene -0.6% 0.0% -0.6%RaceHorsesC -1.9% -0.7% -1.6%Average -1.9% -0.8% -1.0%Class D BasketballPass -3.1% -1.2% -1.6%BlowingBubbles -1.6% -0.2% -0.5%BQSquare 1.6% 1.7% 2.7%RaceHorses -1.8% -1.7% -1.6%Average -1.2% -0.3% -0.2%Class E FourPeople -2.2% -0.9% -0.7%Johnny -2.7% -0.4% 0.7%KristenAndSara -2.1% 0.5% 0.4%Average -2.3% -0.3% 0.1%Class F BasketballDrillText -1.7% -0.6% -0.3%ChinaSpeed -1.4% -1.9% -1.6%SlideEditing 0.2% 0.1% 0.0%SlideShow -0.1% 0.2% -0.6%Average -0.8% -0.6% -0.6%All Sequences Overall -1.9% -0.6% -0.7% patches are separately sampled from the blurred image to derive the sub-pixel positionsamples.For GVTCNN-Q, the inferred samples are at a smaller sub-pixel level. The sam-pling will be performed based on 4 × / / able 2: BD-rate reduction of CNNIF and the proposed GVTCNN-H for different classes. Class Sequence CNNIF[10] GVTCNN-HY U V Y U VClass C BasketballDrill -1.2% -0.6% 0.2% -1.9% -1.3% -0.4%BQMall -0.9% 0.3% 0.7% -2.0% -0.8% -0.9%PartyScene 0.2% 0.5% 0.3% -0.3% -0.1% -0.1%RaceHorsesC -1.5% -0.5% -0.1% -1.6% -1.0% -0.2%Average -0.9% -0.1% 0.3% -1.4% -0.8% -0.4%Class D BasketballPass -1.3% -0.4% 0.3% -2.4% -1.2% -0.7%BlowingBubbles -0.3% 0.4% 0.8% -0.9% 0.9% -0.5%BQSquare 1.2% 2.9% 3.1% 1.9% 2.0% 3.7%RaceHorses -0.8% -0.9% 0.0% -1.1% -0.9% -0.2%Average -0.3% 0.5% 1.0% -0.6% 0.2% 0.6%Class E FourPeople -1.3% -0.4% 0.1% -2.1% -0.5% -0.3%Johnny -1.2% -0.4% -0.7% -2.7% -1.1% -0.6%KristenAndSara -1.0% 0.3% 0.2% -2.3% 0.1% 0.1%Average -1.2% -0.2% -0.1% -2.4% -0.5% -0.3% keep the continuity and similarity between the input integer samples and the targetsub-pixel position samples. 10 YUV sequences at size 1024 ×
768 and 1920 × . Standarddeviations of Gaussian kernels ranges from [0 . , . During the training process, the training images are decomposed into 32 ×
32 sub-images with a stride of 16. The GVTCNN is trained on Caffe platform [13] via Adam[14] with standard back-propagation. The learning rate is initially set as a fixed value0 . ,
000 iterations by a factor of 10. The batch size is set as128. Models after 50 ,
000 iterations are used for testing. The network is trained onone Titan X GPU.The proposed method is tested on HEVC reference software HM 16.15 under thelow delay P (LDP) configuration. BD-rate is used to measure the rate-distortion. Thequantization parameter (QP) values are set to be 22, 27, 32 and 37. We also comparewith the CNN based half-pixel interpolation method proposed in [10]. During thetraining data generation, each intermediate integer sample is coded by four QPs: 22,27, 32 and 37. And we train a GVTCNN-H and GVTCNN-Q for each QP based on thecorresponding training data. For other QPs, models belong to their nearest QP are http://media.xiph.org/video/derf/ hosen. A CU level rate-distortion optimization is also integrated to decide whetherto replace our deep based interpolation method with the interpolation method ofHEVC. Table 3: Average BD-rate reduction achieved by training the networks separately anduniformly. Class GVTCNN-Separate GVTCNN-UniformY U V Y U VClass C -1.9% -1.0% -1.2% -1.9% -0.8% -1.0%Class D -1.3% -0.1% -0.2% -1.2% -0.3% -0.2%Class E -2.4% -0.1% -0.4% -2.3% -0.3% 0.1%
Performance of the proposed deep learning based interpolation method in video codingfor classes B, C, D, E and F is shown in Table 1. Our method has obtained on average a1 .
9% BD-rate saving and up to 5 .
6% BD-rate saving for the test sequence
BQTerrace .We also compare our method with the CNN based half-pixel interpolation methodproposed in [10] (called CNNIF). CNNIF only replaces the half-pixel interpolationwithout rate-distortion optimization and is tested on HM 16.7. For fair comparison,we also test our method on HM 16.7 and only replace the half-pixel interpolationwith our GVTCNN-H without rate-distortion optimization. The BD-rate reductionof the two methods for several testing classes are shown in Table 2. Our method stillhas gain over CNNIF. And the gain will be larger after integrating the GVTCNN-Qand rate-distortion optimization.In order to further identify the effectiveness of our group variational transforma-tion method, we additionally train the networks for each sub-pixel position at eachQP separately. Average BD-rate reduction for some testing classes are shown in Table3. As can be observed, results of training the networks separately and uniformly arecomparable.
In this paper, we propose a group variational transformation deep convolutional neuralnetwork for fractional interpolation in motion compensation of video coding. Thenetwork first uniformly extracts a shared feature map from the input integer-positionsample and then a group of copied shared feature maps are transformed to samplesat various sub-pixel positions. Training data generation of the proposed network isalso carefully analyzed and designed. Experimental results show that our methodhas obtained on average a 1 .
9% BD-rate saving on the test sequences compared withHEVC. Effectiveness of the group variational transformation method adopted in ournetwork is also well identified by experiments. eferences [1] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,”
IEEE Transactions on Circuits and Systems forVideo Technology , vol. 13, no. 7, pp. 560–576, 2003.[2] G. J. Sullivan, J. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiencyvideo coding (HEVC) standard,”
IEEE Transactions on Circuits and Systems for VideoTechnology , vol. 22, no. 12, pp. 1649–1668, 2012.[3] W. Yang, J. Liu, S. Xia, and Z. Guo, “Variation learning guided convolutional networkfor image interpolation,” in
Proc. IEEE Int’l Conf. Image Processing , 2017.[4] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser:Residual learning of deep CNN for image denoising,”
IEEE Transactions on ImageProcessing , vol. 26, no. 7, pp. 3142–3155, 2017.[5] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser prior for imagerestoration,” in
Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition ,2017.[6] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network forimage super-resolution,” in
Proc. European Conf. Computer Vision , 2014.[7] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using verydeep convolutional networks,” in
Proc. IEEE Int’l Conf. Computer Vision and PatternRecognition , 2016.[8] W. Yang, J. Feng, J. Yang, F. Zhao, J. Liu, Z. Guo, and S. Yan, “Deep edge guidedrecurrent residual learning for image super-resolution,”
IEEE Transactions on ImageProcessing , vol. 26, no. 12, pp. 5895–5907, 2017.[9] Y. Hu, J. Liu, W. Yang, S. Deng, L. Zhang, and Z. Guo, “Real-time deep image super-resolution via global context aggregation and local queue jumping,” in
Proc. IEEE Vi-sual Communication and Image Processing , 2017.[10] N. Yan, D. Liu, H. Li, and F. Wu, “A convolutional neural network approach for half-pel interpolation in video coding,” in
Proc. IEEE Int’l Symposium on Circuits andSystems , 2017.[11] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in
Proc. IEEE Int’l Conf. ComputerVision , 2015.[12] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented nat-ural images and its application to evaluating segmentation algorithms and measuringecological statistics,” in
Proc. IEEE Int’l Conf. Computer Vision , 2001.[13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in
Proc. ACM Int’l Conf. Multimedia , 2014.[14] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in