A practical convolutional neural network as loop filter for intra frame
Xiaodan Song, Jiabao Yao, Lulu Zhou, Li Wang, Xiaoyang Wu, Di Xie, Shiliang Pu
AA PRACTICAL CONVOLUTIONAL NEURAL NETWORK AS LOOP FILTER FOR INTRAFRAME
Xiaodan Song ∗ , Jiabao Yao ∗ , Lulu Zhou ∗ , Li Wang, Xiaoyang Wu, Di Xie and Shiliang Pu { songxiaodan,yaojiabao,zhoululu,wangli7,wuxiaoyang,xiedi,pushiliang } @hikvision.comHikvision Research Institute, Hangzhou, China ABSTRACT
Loop filters are used in video coding to remove artifacts orimprove performance. Recent advances in deploying convo-lutional neural network (CNN) to replace traditional loop fil-ters show large gains but with problems for practical applica-tion. First, different model is used for frames encoded withdifferent quantization parameter (QP), respectively. It is ex-pensive for hardware. Second, float points operation in CNNleads to inconsistency between encoding and decoding acrossdifferent platforms. Third, redundancy within CNN modelconsumes precious computational resources.This paper proposes a CNN as the loop filter for intraframes and proposes a scheme to solve the above problems.It aims to design a single CNN model with low redundancyto adapt to decoded frames with different qualities and ensureconsistency. To adapt to reconstructions with different qual-ities, both reconstruction and QP are taken as inputs. Aftertraining, the obtained model is compressed to reduce redun-dancy. To ensure consistency, dynamic fixed points (DFP) areadopted in testing CNN. Parameters in the compressed modelare first quantized to DFP and then used for inference of CNN.Outputs of each layer in CNN are computed by DFP opera-tions. Experimental results on JEM 7.0 report 3.14%, 5.21%,6.28% BD-rate savings for luma and two chroma componentswith all intra configuration when replacing all traditional fil-ters.
Index Terms — video coding, loop filter, convolutionalneural network, model compression, dynamic fixed point
1. INTRODUCTION
In lossy image/video coding, loop filters are usually used toremove artifacts or further improve coding performance. Forexample, the recent standardized HEVC [1], employs sam-ple adaptive offset (SAO) and deblocking filter (DF). Up tofour filters are introduced in the reference software JEM 7.0,which is used within the joint video exploration team (JVET)group [2] for the next generation of video coding standard.Two additional filters, i.e., bilateral filter (BF) and adaptive ∗ Equal contribution
Fig. 1 . Comparison of intra decoding scheme between JEM7.0 and the proposed method.loop filter (ALF) are included as shown in Fig. 1(a). It iswondering whether they are sufficient to deal with the com-plex content in nature video and whether they can be replacedby a single type of filter. The recent advances in deep learninghave shed lights on this.Recent research has investigated the deep learning ap-proach, especially convolutional neural network (CNN), onpost-processing [3, 4, 5, 6] or loop filtering [7, 8]. In post-processing, CNN are used to improve the subjective or objec-tive quality of reconstructed frames after decoding. It doesnot require to change the encoding algorithm. While in loopfiltering, the filtered reconstruction is used as reference of thefollowing frames and helps to reduce bit-rate.In post-processing, Dai et al. [3] adopts a CNN with resid-ual learning structure with variable filter size for acceleration .[4] designs and trains a deeper network for intra frame and di-rectly deploys it to B and P frames. Yang et al. [6] argues thatmodels obtained for intra frames is not good enough for B andP frames and proposes a scalable CNN for devices with dif-ferent computational resources. To improve the performance,[5] introduces a noval CNN taking into multi-scale and par-tition of coding tree unit into account. In loop filtering, Park et al. [7] propose to use CNN as in-loop filter to replace DFand SAO in HEVC and reports bit-rate reduction besides ob-jective quality enhancement. However, the generalizabilitycannot be ensured due to its test data are included in the train-ing set. [8] takes both the current reconstructed block and theco-located block in the nearest reference frame as inputs tojointly exploit the spatial and temporal information.The above mentioned CNN approaches show significantgains over traditional methods, which makes them attractiveto be used in practice. However, three problems impede theway. First, almost all of them train a separate model for a r X i v : . [ c s . MM ] M a y ach QP. For a codec, in which QP varies from M to N, N-M+1 models are necessary to be stored. It is expensive for ahardware-oriented codec. Second, the default float point op-erations in CNN computation will lead to inconsistency be-tween encoding and decoding across different platforms, e.g.CPU and GPU, which cannot be acceptable in video commu-nications among various manufactures and users. Third, thereis redundancy among parameters in the pre-trained model [9],which consumes unnecessary storage and computational re-sources.In this paper, we propose a practical convolutional neu-ral network filter (CNNF) to replace all traditional filters asshown in Fig. 1(b) and propose a scheme to solve the aboveproblems. Both decoded frame and QP are taken as inputs toCNNF to obtain a QP independent model and adapt to recon-structions with different qualities. After training, the obtainedmodel is compressed for acceleration. To ensure the consis-tency, the compressed model is quantized to DFP [10] andoutputs of each layer are computed by DFP operations duringinference.Note that CNNF has been submitted to JVET meeting inGwangju [11] and an Adhoc Group is set up to investigatedeep learning for video compression.The rest of this paper is organized as follows. Section 2overview the proposed CNNF. Section 3 explains it in detail.And the training process is shown in Section 4. The experi-mental results are given in Section 5. Section 6 concludes thispaper.
2. OVERVIEW OF THE PROPOSED SCHEME
The proposed scheme mainly includes three parts: a novelnetwork, model compression and DFP-based inference ofCNNF. A fully CNN with residual learning structure isadopted in CNNF. To obtain one QP independent model, a QPmap is generated and taken as an input of CNN. After train-ing, the model is compressed by reducing the filter numberof each convolutional layer for acceleration. In the compres-sion, two steps are included. First, additional regularizationis included in the loss function to help the compression. Dur-ing training, filters are automatically pruned based on theabsolute value of the scale parameter in its corresponding BNlayer. Second, the obtained model in the first step is furthercompressed by low rank approximation with fine-tuning. Af-ter that, filters are reconstructed using a much lower basisfrom the low-rank space for acceleration.Before DFP inference, parameters in the compressedmodel are first quantized and converted to DFP. To recoverperformance loss due to quantization, fine-tuning is estab-lished [12]. During inference of CNNF, outputs of each layerare computed by DFP operations and then quantized to DFPwith low bit-width to avoid overflow. After the fixed-pointinference, the outputs are denormalized to obtain the finalreconstruction.
Fig. 2 . Network structure of CNNF, in which“Convi” repre-sents a convlution layer, k is the kernel size and K L is thekernel number. Table 1 . Compressed filter number for each convolution layerconvL K K K K K K K
3. THE PROPOSED CNNF3.1. Network structure
CNNF includes two inputs: the reconstruction and QP map,which makes it possible to use a single set of parameters toadapt to reconstructions with different qualities. QP map isgenerated by
QP M ap ( x, y ) = QP , where QP is the QPused for encoding, x = 0 , , ..., W − and y = 0 , , ..., H − . W and H are the width and height of the reconstruction,respectively. Note that the reconstruction can be a decodedframe or a block.Before fed to CNNF, both the two inputs are normalizedto [ 0,1] for better convergence in training process. Each pixelin the decoded frame is divided by << B − , in which B isthe bit depth and << denotes bit shift. QP map is divided bythe maximum value of QP. After filtering, a corresponding de-normalization is established to obtain the final reconstruction.In the following, a simple CNN with 8 convolution layersas shown in Fig. 2 is taken as an example to make a trade-offbetween performance and complexity. We claim that takingQP map as a side information can also be applied to othernetwork, e.g. [3, 4, 5, 6, 8]. K L is set to 64. By connectingthe normalized Y, U or V to summation layer, the network isregularized to learn characteristics of the residual between thedecoded frame and its original one. To speed up, the learned model is compressed before test-ing. For efficient compression, loss function
Loss with twoadditional regularizers included is designed for the trainingprocess as the following
Loss = 12 M M (cid:88) i =1 || y i − f w ( x i ) || (cid:124) (cid:123)(cid:122) (cid:125) mean square error + λ w L (cid:88) j =1 || W j || g (cid:124) (cid:123)(cid:122) (cid:125) normal regularizer + λ s || S || g + λ lda L − (cid:88) j =1 L − (cid:88) i =1 || W j || W j || − W i || W i || || (cid:124) (cid:123)(cid:122) (cid:125) additional regularizers , (1) able 2 . Estimated F L for each convolution layerconvL 1 2 3 4 5 6 7 8
F L w F L b
17 15 14 16 15 13 13 16
F L o
15 14 14 15 15 15 16 18in which y i and f w ( x i ) are the ground truth and the filteredresults of x i , respectively. W ( · ) is the parameter to be learnedand S is the scale parameter in BN layer [13]. λ w , λ s and λ lda are hyper parameters. M is the batch size. L is thenumber of convolution layers. Mean square error is used asthe main measurement of loss. The normal regularizer con-straints the model complexity. g and g denotes L or L norm. To reduce overall combinational cases, we set g = g and experimental results show that L norm is better than L .With the first additional regularizer, the learned scale pa-rameters in BN layer tends to be zero. In the training process,a filter will be pruned once the absolute value of its corre-sponding scale parameter is small enough. The second addi-tional regularizer, i.e. the linear discriminant analysis (LDA)item, makes the learned parameters friendly to the followinglow rank approximation. Then singular value decomposition(SVD) is established for low rank approximation[14]. Afterthat, filters are reconstructed using a much lower basis [15].Table 1 gives the compressed filter number. It can beobserved that it efficiently reduces the kernel number andthe amount of parameters is reduced to 51% of the origi-nal model. Experimental results report performance onlychanges about -0.08%, -0.19%, 0.25% in average for Y, Uand V components of class B, C, D and E on JEM 7.0 [16]. To ensure consistency between encoding and decoding acrossdifferent platforms, DFP [10] operations are proposed to beused in testing. A value V in dynamic fixed point is describedby V = ( − s · − F L B f − (cid:88) i =0 i · x i , (2)where B f denotes bit width to represent the DFP value, s thesign bit, F L the fractional length and x i the mantissa binarybits. Each float point within model parameters and outputs isquantized and clipped to be converted to DFP.Values in CNNF are divided into three groups: layerweights, biases and outputs. Bit width for weights B w andbiases B b is set to 8 and 32, respectively. Since weights andbiases quantization leads to performance loss, fine-tuningtaking quantization into account is established similar to[10, 12]. After that, the parameters are quantized to DFP. Forlayer outputs, the bit width is set to 16. Experimental resultsshow that quantization of outputs leads to negligible loss.Each group in the same layer shares one common F L ,which is estimated from available training data and layer pa-
Table 3 . Evaluation environmentconvL CPU+GPU CPUCPU Intel(R) Xeon(R)CPU E5-2650 v4 @2.20GHz Intel(R) Xeon(R)CPU E5-2680 v4 @2.40GHzGPU NVIDIA Titan Xpwith 12GB Memory —————Library cuDNN 5.1.10 OpenBLAS 0.2.18rameters. Table 2 gives the estimated
F L for each convolu-tion layer.
F L w , F L b and F L o denotes F L of weights, bi-ases and outputs, respectively.
F L for concat and summationlayer are both set to 15. Since CPU and GPU do not supportDFPs, they are simulated by float points similar to [10].
4. TRAINING PROCESS
Training data used in obtaining the layer parameters of CNNFare generated from Visual genome(VG) [17], DIV2K [18] andILSVRC2012 [19]. Each image is intra encoded by the QP22, 27, 32, 37 on JEM 7.0 with BF, DF, SAO and ALF off.Then the decoded frames, including luma and chroma com-ponents, are divided into patches with 35 ×
35 size. After that3.6 million training data are generated which includes 600thousands luma data and 300 thousands chroma data for eachQP. Finally, all the data are mixed in a random order.We use Caffe [20] as the training software on a NVIDIATian Xp with 12GB memory GPU. Eq. (1) is used as the lossfunction during training. Parameters of the proposed modelare initialized randomly. Batch size M is set to 64. The baselearning rate is set to 0.1. And λ w , λ s and λ lad are set to 1e-5,5e-8 and 3e-6, respectively. Stochastic gradient decent is usedto solve the optimization with gradients clipped. The trainingis stopped after 32 epochs.
5. EXPERIMENTAL RESULTS
In evaluation, JEM 7.0 [16] serves as the reference softwareand common test condition in [21] is used. BD-rate [22] isused as the measurement for comparison. Without specifica-tion, BF, SAO, DF and ALF are all turned off. Results withCPU and GPU are both tested and denoted as “CPU+GPU”and “CPU”, respectively. Anchor is evaluated in the both set-tings, respectively. The test environment is listed in Table 3.Since QP as a side information can be introduced to any net-work for loop filtering, we do not compare with other works.First, effectiveness of the QP independent model is eval-uated, in which model compression and DFP inference arenot used. The above network without QP map is used forcomparison and denoted as “QP dependent Model”. A modelis trained for QP 22, 27, 32 and 37 and denoted as “QP22”,“QP27”, “QP32” and “QP37”, respectively. The encoded datagenerated by its corresponding QP in Section 4 is used astraining data. In the test, decoded frames with all QPs area) The original frame (b) The decoded frame of JEM 7.0 (c) The decoded frame of CNNF
Fig. 3 . Comparison of subjective quality of BQSquare encoded with QP 37. The red bounding boxes demonstrate the high-lighted areas in CNNF are more enhanced than JEM 7.0.
Table 4 . Performance improvement of “QP DependentModel” and CNNF on luma QP Dependent ModelModel CNNF Best QP22 QP27 QP32 QP37ClassB -2.95% -3.09% -0.49% -1.67% 0.22% 8.09%ClassC -4.09% -4.24% -0.59% -2.44% -1.10% 6.12%ClassD -4.72% -4.90% -1.02% -3.02% -1.91% 5.20%ClassE -4.65% -4.75% -0.05% -2.54% -1.69% 6.03%
Overall -3.99% -4.14% -0.57% -2.36% -1.00% 6.49%Table 5 . Test results of AI configuration with ALF offY U VClassA1 -1.57% -4.74% -4.03%ClassA2 -2.36% -5.72% -6.07%ClassB -2.71% -4.58% -5.99%ClassC -3.70% -6.21% -8.21%ClassD -4.07% -5.29% -7.98%ClassE -3.97% -5.64% -4.81%
Overall -3.14% -5.21% -6.28% filtered by each model, respectively. The results that decodedframes with different QP using its corresponding model arealso tested and denoted as “Best”.Table 4 gives the test results above. Compared with “QPdependent Model”, CNNF shows large gains when using asingle model for all QPs. And CNNF even shows comparativegains over “Best”.Table 5 shows the results of CNNF with all intra (AI) con-figuration. It can achieve 3.14%, 5.21% and 6.28% BD-ratesavings for luma and both chroma components. The subjec-tive quality is given in Fig. 3. It can be observed that subjec-tive quality is enhanced, especially in the edge area.Table 6 and Table 7 show results on video coding. CNNFis only applied to intra frames. Due to inter dependencywithin ALF, it is not replaced. For B and P frames, filters areconfigured the same as JEM 7.0. 3.57%, 6.17% and 7.06%average gains are observed with AI configuration. Thoughonly applied to intra frames, CNNF achieves 1.23%, 3.65%and 3.88% gains with RA configuration.Table 6 and Table 7 also compare the encoding time(EncT) and decoding time (DecT), which are measured bythe ratio of time consuming of the proposed scheme to thatof JEM 7.0 during encoding and decoding, respectively. With
Table 6 . Test results of AI configuration with ALF onCPU+GPU CPUY U V EncT DecT EncT DecTClassA1 -2.26% -6.21% -5.05% 93% 157% 109% 15360%ClassA2 -3.58% -6.33% -7.02% 92% 158% 112% 16312%ClassB -3.08% -5.06% -6.27% 94% 148% 108% 15360%ClassC -3.88% -6.98% -9.11% 94% 158% 103% 11139%ClassD -4.13% -5.63% -8.20% 94% 214% 102% 7256%ClassE -4.93% -7.41% -6.88% 94% 169% 111% 15441%
Overall -3.57% -6.17% -7.06% 93% 157% 109% 12887%Table 7 . Test results of RA configuration with ALF onCPUY U V EncT DecTClassA1 -0.39% -1.96% -1.93% 99% 275%ClassA2 -1.76% -3.70% -4.29% 99% 303%ClassB -1.46% -4.65% -4.14% 99% 339%ClassC -1.28% -4.40% -4.75% 99% 289%ClassD -1.22% -3.28% -4.20% 99% 219%
Overall -1.23% -3.65% -3.88% 99% 284%
GPU, the EncT decreases and DecT increases a little. Evenwhen testing with CPU, the EncT only increases a little.Though DecT is extremely high on CPU, we do believe thatwith the development of deep learning specific hardware itwill not be a problem.
6. CONCLUSION
This paper proposes a practical CNN as the loop filter for intraframes. It uses a single model with low redundancy for loopfiltering, which can adapt to reconstructions with differentqualities. Besides, it solves the problem of mismatched en-coding and decoding results across various platforms. Com-paring with JEM 7.0, the proposed CNNF achieves largegains though only applied for intra frames. More gains willbe expected to extend it to B and P frames in future work.
7. REFERENCES [1] Gary J Sullivan, Jens Ohm, Woo-Jin Han, and ThomasWiegand, “Overview of the high efficiency video codinghevc) standard,”
IEEE Transactions on Circuits andSystems for Video Technology , vol. 22, no. 12, pp. 1649–1668, 2012.[2] “JVET document management system,” http://phenix.it-sudparis.eu/jvet/ , 2018, [On-line; accessed 5-February-2018].[3] Yuanying Dai, Dong Liu, and Feng Wu, “A convolu-tional neural network approach for post-processing inhevc intra coding,” in
MultiMedia Modeling , Cham,2017, pp. 28–39, Springer International Publishing.[4] Tingting Wang, Mingjin Chen, and Hongyang Chao, “Anovel deep learning-based method of improving cod-ing efficiency from the decoder-end for hevc,” in
DataCompression Conference (DCC), 2017 . IEEE, 2017, pp.410–419.[5] Jihong Kang, Sungjei Kim, and Kyoung Mu Lee,“Multi-modal/multi-scale convolutional neural networkbased in-loop filter design for next generation videocodec,” 2017.[6] Ren Yang, Mai Xu, and Zulin Wang, “Decoder-sidehevc quality enhancement with scalable convolutionalneural network,” in
IEEE International Conference onMultimedia and Expo (ICME) . IEEE, 2017, pp. 817–822.[7] Woon-Sung Park and Munchurl Kim, “Cnn-based in-loop filtering for coding efficiency improvement,” in
Image, Video, and Multidimensional Signal ProcessingWorkshop (IVMSP), 2016 IEEE 12th . IEEE, 2016, pp.1–5.[8] Chuanmin Jia, Shiqi Wang, Xinfeng Zhang, ShansheWang, and Siwei Ma, “Spatial-temporal residue net-work based in-loop filter for video coding,” arXivpreprint arXiv:1709.08462 , 2017.[9] Misha Denil, Babak Shakibi, Laurent Dinh, NandoDe Freitas, et al., “Predicting parameters in deep learn-ing,” in
Advances in Neural Information Processing Sys-tems (NIPS) , 2013, pp. 2148–2156.[10] Philipp Gysel, Mohammad Motamedi, and Soheil Ghi-asi, “Hardware-oriented approximation of convolutionalneural networks,” arXiv preprint arXiv:1604.03168 ,2016.[11] “JVET-I0022: Convolutional neural network fil-ter (CNNF) for intra frame,” http://phenix.it-sudparis.eu/jvet/ , 2018, [Online; accessed5-February-2018].[12] Darryl Dexu Lin, Sachin S. Talathi, and V. SreekanthAnnapureddy, “Fixed point quantization of deep convo-lutional networks,”
CoRR , vol. abs/1511.06393, 2015. [13] Sergey Ioffe and Christian Szegedy, “Batch nor-malization: Accelerating deep network training byreducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[14] Emily L Denton, Wojciech Zaremba, Joan Bruna, YannLeCun, and Rob Fergus, “Exploiting linear structurewithin convolutional networks for efficient evaluation,”in
Advances in Neural Information Processing Systems(NIPS) , 2014, pp. 1269–1277.[15] Wei Wen, Cong Xu, Chunpeng Wu, Yandan Wang, Yi-ran Chen, and Hai Li, “Coordinating filters for fasterdeep neural networks,”
CoRR, abs/1703.09746 , 2017.[16] “JEM 7.0,” https://jvet.hhi.fraunhofer.de/svn/svn_HMJEMSoftware/branches/HM-16.6-JEM-7.0-dev/ , 2018, [Online; accessed5-February-2018].[17] “Visual genome(VG),” http://visualgenome.org/ , 2018, [Online; accessed 5-February-2018].[18] “DIV2K,” https://data.vision.ee.ethz.ch/cvl/DIV2K/ , 2018, [Online; accessed 5-February-2018].[19] “ILSVRC2012,” , 2018, [Online; ac-cessed 5-February-2018].[20] Yangqing Jia, Evan Shelhamer, Jeff Donahue, SergeyKarayev, Jonathan Long, Ross Girshick, Sergio Guadar-rama, and Trevor Darrell, “Caffe: Convolutional ar-chitecture for fast feature embedding,” arXiv preprintarXiv:1408.5093 , 2014.[21] “JVET-H1010: JVET common test conditions and soft-ware reference configurations,” http://phenix.it-sudparis.eu/jvet/ , 2018, [Online; accessed5-February-2018].[22] Gisle Bjontegarrd, “Calculation of average psnr differ-ences between rd-curves,”