[PDF] A practical convolutional neural network as loop filter for intra frame

Abstract

Loop filters are used in video coding to remove artifacts or improve performance. Recent advances in deploying convolutional neural network (CNN) to replace traditional loop filters show large gains but with problems for practical application. First, different model is used for frames encoded with different quantization parameter (QP), respectively. It is expensive for hardware. Second, float points operation in CNN leads to inconsistency between encoding and decoding across different platforms. Third, redundancy within CNN model consumes precious computational resources. This paper proposes a CNN as the loop filter for intra frames and proposes a scheme to solve the above problems. It aims to design a single CNN model with low redundancy to adapt to decoded frames with different qualities and ensure consistency. To adapt to reconstructions with different qualities, both reconstruction and QP are taken as inputs. After training, the obtained model is compressed to reduce redundancy. To ensure consistency, dynamic fixed points (DFP) are adopted in testing CNN. Parameters in the compressed model are first quantized to DFP and then used for inference of CNN. Outputs of each layer in CNN are computed by DFP operations. Experimental results on JEM 7.0 report 3.14%, 5.21%, 6.28% BD-rate savings for luma and two chroma components with all intra configuration when replacing all traditional filters.

Full PDF

AA PRACTICAL CONVOLUTIONAL NEURAL NETWORK AS LOOP FILTER FOR INTRAFRAME

Xiaodan Song ∗ , Jiabao Yao ∗ , Lulu Zhou ∗ , Li Wang, Xiaoyang Wu, Di Xie and Shiliang Pu { songxiaodan,yaojiabao,zhoululu,wangli7,wuxiaoyang,xiedi,pushiliang } @hikvision.comHikvision Research Institute, Hangzhou, China ABSTRACT

Loop ﬁlters are used in video coding to remove artifacts orimprove performance. Recent advances in deploying convo-lutional neural network (CNN) to replace traditional loop ﬁl-ters show large gains but with problems for practical applica-tion. First, different model is used for frames encoded withdifferent quantization parameter (QP), respectively. It is ex-pensive for hardware. Second, ﬂoat points operation in CNNleads to inconsistency between encoding and decoding acrossdifferent platforms. Third, redundancy within CNN modelconsumes precious computational resources.This paper proposes a CNN as the loop ﬁlter for intraframes and proposes a scheme to solve the above problems.It aims to design a single CNN model with low redundancyto adapt to decoded frames with different qualities and ensureconsistency. To adapt to reconstructions with different qual-ities, both reconstruction and QP are taken as inputs. Aftertraining, the obtained model is compressed to reduce redun-dancy. To ensure consistency, dynamic ﬁxed points (DFP) areadopted in testing CNN. Parameters in the compressed modelare ﬁrst quantized to DFP and then used for inference of CNN.Outputs of each layer in CNN are computed by DFP opera-tions. Experimental results on JEM 7.0 report 3.14%, 5.21%,6.28% BD-rate savings for luma and two chroma componentswith all intra conﬁguration when replacing all traditional ﬁl-ters.

Index Terms — video coding, loop ﬁlter, convolutionalneural network, model compression, dynamic ﬁxed point

1. INTRODUCTION

In lossy image/video coding, loop ﬁlters are usually used toremove artifacts or further improve coding performance. Forexample, the recent standardized HEVC [1], employs sam-ple adaptive offset (SAO) and deblocking ﬁlter (DF). Up tofour ﬁlters are introduced in the reference software JEM 7.0,which is used within the joint video exploration team (JVET)group [2] for the next generation of video coding standard.Two additional ﬁlters, i.e., bilateral ﬁlter (BF) and adaptive ∗ Equal contribution

Fig. 1 . Comparison of intra decoding scheme between JEM7.0 and the proposed method.loop ﬁlter (ALF) are included as shown in Fig. 1(a). It iswondering whether they are sufﬁcient to deal with the com-plex content in nature video and whether they can be replacedby a single type of ﬁlter. The recent advances in deep learninghave shed lights on this.Recent research has investigated the deep learning ap-proach, especially convolutional neural network (CNN), onpost-processing [3, 4, 5, 6] or loop ﬁltering [7, 8]. In post-processing, CNN are used to improve the subjective or objec-tive quality of reconstructed frames after decoding. It doesnot require to change the encoding algorithm. While in loopﬁltering, the ﬁltered reconstruction is used as reference of thefollowing frames and helps to reduce bit-rate.In post-processing, Dai et al. [3] adopts a CNN with resid-ual learning structure with variable ﬁlter size for acceleration .[4] designs and trains a deeper network for intra frame and di-rectly deploys it to B and P frames. Yang et al. [6] argues thatmodels obtained for intra frames is not good enough for B andP frames and proposes a scalable CNN for devices with dif-ferent computational resources. To improve the performance,[5] introduces a noval CNN taking into multi-scale and par-tition of coding tree unit into account. In loop ﬁltering, Park et al. [7] propose to use CNN as in-loop ﬁlter to replace DFand SAO in HEVC and reports bit-rate reduction besides ob-jective quality enhancement. However, the generalizabilitycannot be ensured due to its test data are included in the train-ing set. [8] takes both the current reconstructed block and theco-located block in the nearest reference frame as inputs tojointly exploit the spatial and temporal information.The above mentioned CNN approaches show signiﬁcantgains over traditional methods, which makes them attractiveto be used in practice. However, three problems impede theway. First, almost all of them train a separate model for a r X i v : . [ c s . MM ] M a y ach QP. For a codec, in which QP varies from M to N, N-M+1 models are necessary to be stored. It is expensive for ahardware-oriented codec. Second, the default ﬂoat point op-erations in CNN computation will lead to inconsistency be-tween encoding and decoding across different platforms, e.g.CPU and GPU, which cannot be acceptable in video commu-nications among various manufactures and users. Third, thereis redundancy among parameters in the pre-trained model [9],which consumes unnecessary storage and computational re-sources.In this paper, we propose a practical convolutional neu-ral network ﬁlter (CNNF) to replace all traditional ﬁlters asshown in Fig. 1(b) and propose a scheme to solve the aboveproblems. Both decoded frame and QP are taken as inputs toCNNF to obtain a QP independent model and adapt to recon-structions with different qualities. After training, the obtainedmodel is compressed for acceleration. To ensure the consis-tency, the compressed model is quantized to DFP [10] andoutputs of each layer are computed by DFP operations duringinference.Note that CNNF has been submitted to JVET meeting inGwangju [11] and an Adhoc Group is set up to investigatedeep learning for video compression.The rest of this paper is organized as follows. Section 2overview the proposed CNNF. Section 3 explains it in detail.And the training process is shown in Section 4. The experi-mental results are given in Section 5. Section 6 concludes thispaper.

2. OVERVIEW OF THE PROPOSED SCHEME

The proposed scheme mainly includes three parts: a novelnetwork, model compression and DFP-based inference ofCNNF. A fully CNN with residual learning structure isadopted in CNNF. To obtain one QP independent model, a QPmap is generated and taken as an input of CNN. After train-ing, the model is compressed by reducing the ﬁlter numberof each convolutional layer for acceleration. In the compres-sion, two steps are included. First, additional regularizationis included in the loss function to help the compression. Dur-ing training, ﬁlters are automatically pruned based on theabsolute value of the scale parameter in its corresponding BNlayer. Second, the obtained model in the ﬁrst step is furthercompressed by low rank approximation with ﬁne-tuning. Af-ter that, ﬁlters are reconstructed using a much lower basisfrom the low-rank space for acceleration.Before DFP inference, parameters in the compressedmodel are ﬁrst quantized and converted to DFP. To recoverperformance loss due to quantization, ﬁne-tuning is estab-lished [12]. During inference of CNNF, outputs of each layerare computed by DFP operations and then quantized to DFPwith low bit-width to avoid overﬂow. After the ﬁxed-pointinference, the outputs are denormalized to obtain the ﬁnalreconstruction.

Fig. 2 . Network structure of CNNF, in which“Convi” repre-sents a convlution layer, k is the kernel size and K L is thekernel number. Table 1 . Compressed ﬁlter number for each convolution layerconvL K K K K K K K

3. THE PROPOSED CNNF3.1. Network structure

CNNF includes two inputs: the reconstruction and QP map,which makes it possible to use a single set of parameters toadapt to reconstructions with different qualities. QP map isgenerated by

QP M ap ( x, y ) = QP , where QP is the QPused for encoding, x = 0 , , ..., W − and y = 0 , , ..., H − . W and H are the width and height of the reconstruction,respectively. Note that the reconstruction can be a decodedframe or a block.Before fed to CNNF, both the two inputs are normalizedto [ 0,1] for better convergence in training process. Each pixelin the decoded frame is divided by << B − , in which B isthe bit depth and << denotes bit shift. QP map is divided bythe maximum value of QP. After ﬁltering, a corresponding de-normalization is established to obtain the ﬁnal reconstruction.In the following, a simple CNN with 8 convolution layersas shown in Fig. 2 is taken as an example to make a trade-offbetween performance and complexity. We claim that takingQP map as a side information can also be applied to othernetwork, e.g. [3, 4, 5, 6, 8]. K L is set to 64. By connectingthe normalized Y, U or V to summation layer, the network isregularized to learn characteristics of the residual between thedecoded frame and its original one. To speed up, the learned model is compressed before test-ing. For efﬁcient compression, loss function

Loss with twoadditional regularizers included is designed for the trainingprocess as the following

Loss = 12 M M (cid:88) i =1 || y i − f w ( x i ) || (cid:124) (cid:123)(cid:122) (cid:125) mean square error + λ w L (cid:88) j =1 || W j || g (cid:124) (cid:123)(cid:122) (cid:125) normal regularizer + λ s || S || g + λ lda L − (cid:88) j =1 L − (cid:88) i =1 || W j || W j || − W i || W i || || (cid:124) (cid:123)(cid:122) (cid:125) additional regularizers , (1) able 2 . Estimated F L for each convolution layerconvL 1 2 3 4 5 6 7 8

F L w F L b

17 15 14 16 15 13 13 16

F L o

15 14 14 15 15 15 16 18in which y i and f w ( x i ) are the ground truth and the ﬁlteredresults of x i , respectively. W ( · ) is the parameter to be learnedand S is the scale parameter in BN layer [13]. λ w , λ s and λ lda are hyper parameters. M is the batch size. L is thenumber of convolution layers. Mean square error is used asthe main measurement of loss. The normal regularizer con-straints the model complexity. g and g denotes L or L norm. To reduce overall combinational cases, we set g = g and experimental results show that L norm is better than L .With the ﬁrst additional regularizer, the learned scale pa-rameters in BN layer tends to be zero. In the training process,a ﬁlter will be pruned once the absolute value of its corre-sponding scale parameter is small enough. The second addi-tional regularizer, i.e. the linear discriminant analysis (LDA)item, makes the learned parameters friendly to the followinglow rank approximation. Then singular value decomposition(SVD) is established for low rank approximation[14]. Afterthat, ﬁlters are reconstructed using a much lower basis [15].Table 1 gives the compressed ﬁlter number. It can beobserved that it efﬁciently reduces the kernel number andthe amount of parameters is reduced to 51% of the origi-nal model. Experimental results report performance onlychanges about -0.08%, -0.19%, 0.25% in average for Y, Uand V components of class B, C, D and E on JEM 7.0 [16]. To ensure consistency between encoding and decoding acrossdifferent platforms, DFP [10] operations are proposed to beused in testing. A value V in dynamic ﬁxed point is describedby V = ( − s · − F L B f − (cid:88) i =0 i · x i , (2)where B f denotes bit width to represent the DFP value, s thesign bit, F L the fractional length and x i the mantissa binarybits. Each ﬂoat point within model parameters and outputs isquantized and clipped to be converted to DFP.Values in CNNF are divided into three groups: layerweights, biases and outputs. Bit width for weights B w andbiases B b is set to 8 and 32, respectively. Since weights andbiases quantization leads to performance loss, ﬁne-tuningtaking quantization into account is established similar to[10, 12]. After that, the parameters are quantized to DFP. Forlayer outputs, the bit width is set to 16. Experimental resultsshow that quantization of outputs leads to negligible loss.Each group in the same layer shares one common F L ,which is estimated from available training data and layer pa-

Table 3 . Evaluation environmentconvL CPU+GPU CPUCPU Intel(R) Xeon(R)CPU E5-2650 v4 @2.20GHz Intel(R) Xeon(R)CPU E5-2680 v4 @2.40GHzGPU NVIDIA Titan Xpwith 12GB Memory —————Library cuDNN 5.1.10 OpenBLAS 0.2.18rameters. Table 2 gives the estimated

F L for each convolu-tion layer.

F L w , F L b and F L o denotes F L of weights, bi-ases and outputs, respectively.

F L for concat and summationlayer are both set to 15. Since CPU and GPU do not supportDFPs, they are simulated by ﬂoat points similar to [10].

4. TRAINING PROCESS

Training data used in obtaining the layer parameters of CNNFare generated from Visual genome(VG) [17], DIV2K [18] andILSVRC2012 [19]. Each image is intra encoded by the QP22, 27, 32, 37 on JEM 7.0 with BF, DF, SAO and ALF off.Then the decoded frames, including luma and chroma com-ponents, are divided into patches with 35 ×

35 size. After that3.6 million training data are generated which includes 600thousands luma data and 300 thousands chroma data for eachQP. Finally, all the data are mixed in a random order.We use Caffe [20] as the training software on a NVIDIATian Xp with 12GB memory GPU. Eq. (1) is used as the lossfunction during training. Parameters of the proposed modelare initialized randomly. Batch size M is set to 64. The baselearning rate is set to 0.1. And λ w , λ s and λ lad are set to 1e-5,5e-8 and 3e-6, respectively. Stochastic gradient decent is usedto solve the optimization with gradients clipped. The trainingis stopped after 32 epochs.

5. EXPERIMENTAL RESULTS

In evaluation, JEM 7.0 [16] serves as the reference softwareand common test condition in [21] is used. BD-rate [22] isused as the measurement for comparison. Without speciﬁca-tion, BF, SAO, DF and ALF are all turned off. Results withCPU and GPU are both tested and denoted as “CPU+GPU”and “CPU”, respectively. Anchor is evaluated in the both set-tings, respectively. The test environment is listed in Table 3.Since QP as a side information can be introduced to any net-work for loop ﬁltering, we do not compare with other works.First, effectiveness of the QP independent model is eval-uated, in which model compression and DFP inference arenot used. The above network without QP map is used forcomparison and denoted as “QP dependent Model”. A modelis trained for QP 22, 27, 32 and 37 and denoted as “QP22”,“QP27”, “QP32” and “QP37”, respectively. The encoded datagenerated by its corresponding QP in Section 4 is used astraining data. In the test, decoded frames with all QPs area) The original frame (b) The decoded frame of JEM 7.0 (c) The decoded frame of CNNF

Fig. 3 . Comparison of subjective quality of BQSquare encoded with QP 37. The red bounding boxes demonstrate the high-lighted areas in CNNF are more enhanced than JEM 7.0.

Table 4 . Performance improvement of “QP DependentModel” and CNNF on luma QP Dependent ModelModel CNNF Best QP22 QP27 QP32 QP37ClassB -2.95% -3.09% -0.49% -1.67% 0.22% 8.09%ClassC -4.09% -4.24% -0.59% -2.44% -1.10% 6.12%ClassD -4.72% -4.90% -1.02% -3.02% -1.91% 5.20%ClassE -4.65% -4.75% -0.05% -2.54% -1.69% 6.03%

Overall -3.99% -4.14% -0.57% -2.36% -1.00% 6.49%Table 5 . Test results of AI conﬁguration with ALF offY U VClassA1 -1.57% -4.74% -4.03%ClassA2 -2.36% -5.72% -6.07%ClassB -2.71% -4.58% -5.99%ClassC -3.70% -6.21% -8.21%ClassD -4.07% -5.29% -7.98%ClassE -3.97% -5.64% -4.81%

Overall -3.14% -5.21% -6.28% ﬁltered by each model, respectively. The results that decodedframes with different QP using its corresponding model arealso tested and denoted as “Best”.Table 4 gives the test results above. Compared with “QPdependent Model”, CNNF shows large gains when using asingle model for all QPs. And CNNF even shows comparativegains over “Best”.Table 5 shows the results of CNNF with all intra (AI) con-ﬁguration. It can achieve 3.14%, 5.21% and 6.28% BD-ratesavings for luma and both chroma components. The subjec-tive quality is given in Fig. 3. It can be observed that subjec-tive quality is enhanced, especially in the edge area.Table 6 and Table 7 show results on video coding. CNNFis only applied to intra frames. Due to inter dependencywithin ALF, it is not replaced. For B and P frames, ﬁlters areconﬁgured the same as JEM 7.0. 3.57%, 6.17% and 7.06%average gains are observed with AI conﬁguration. Thoughonly applied to intra frames, CNNF achieves 1.23%, 3.65%and 3.88% gains with RA conﬁguration.Table 6 and Table 7 also compare the encoding time(EncT) and decoding time (DecT), which are measured bythe ratio of time consuming of the proposed scheme to thatof JEM 7.0 during encoding and decoding, respectively. With

Table 6 . Test results of AI conﬁguration with ALF onCPU+GPU CPUY U V EncT DecT EncT DecTClassA1 -2.26% -6.21% -5.05% 93% 157% 109% 15360%ClassA2 -3.58% -6.33% -7.02% 92% 158% 112% 16312%ClassB -3.08% -5.06% -6.27% 94% 148% 108% 15360%ClassC -3.88% -6.98% -9.11% 94% 158% 103% 11139%ClassD -4.13% -5.63% -8.20% 94% 214% 102% 7256%ClassE -4.93% -7.41% -6.88% 94% 169% 111% 15441%

Overall -3.57% -6.17% -7.06% 93% 157% 109% 12887%Table 7 . Test results of RA conﬁguration with ALF onCPUY U V EncT DecTClassA1 -0.39% -1.96% -1.93% 99% 275%ClassA2 -1.76% -3.70% -4.29% 99% 303%ClassB -1.46% -4.65% -4.14% 99% 339%ClassC -1.28% -4.40% -4.75% 99% 289%ClassD -1.22% -3.28% -4.20% 99% 219%

Overall -1.23% -3.65% -3.88% 99% 284%

GPU, the EncT decreases and DecT increases a little. Evenwhen testing with CPU, the EncT only increases a little.Though DecT is extremely high on CPU, we do believe thatwith the development of deep learning speciﬁc hardware itwill not be a problem.

6. CONCLUSION

This paper proposes a practical CNN as the loop ﬁlter for intraframes. It uses a single model with low redundancy for loopﬁltering, which can adapt to reconstructions with differentqualities. Besides, it solves the problem of mismatched en-coding and decoding results across various platforms. Com-paring with JEM 7.0, the proposed CNNF achieves largegains though only applied for intra frames. More gains willbe expected to extend it to B and P frames in future work.

7. REFERENCES [1] Gary J Sullivan, Jens Ohm, Woo-Jin Han, and ThomasWiegand, “Overview of the high efﬁciency video codinghevc) standard,”

IEEE Transactions on Circuits andSystems for Video Technology , vol. 22, no. 12, pp. 1649–1668, 2012.[2] “JVET document management system,” http://phenix.it-sudparis.eu/jvet/ , 2018, [On-line; accessed 5-February-2018].[3] Yuanying Dai, Dong Liu, and Feng Wu, “A convolu-tional neural network approach for post-processing inhevc intra coding,” in

MultiMedia Modeling , Cham,2017, pp. 28–39, Springer International Publishing.[4] Tingting Wang, Mingjin Chen, and Hongyang Chao, “Anovel deep learning-based method of improving cod-ing efﬁciency from the decoder-end for hevc,” in

DataCompression Conference (DCC), 2017 . IEEE, 2017, pp.410–419.[5] Jihong Kang, Sungjei Kim, and Kyoung Mu Lee,“Multi-modal/multi-scale convolutional neural networkbased in-loop ﬁlter design for next generation videocodec,” 2017.[6] Ren Yang, Mai Xu, and Zulin Wang, “Decoder-sidehevc quality enhancement with scalable convolutionalneural network,” in

IEEE International Conference onMultimedia and Expo (ICME) . IEEE, 2017, pp. 817–822.[7] Woon-Sung Park and Munchurl Kim, “Cnn-based in-loop ﬁltering for coding efﬁciency improvement,” in

Image, Video, and Multidimensional Signal ProcessingWorkshop (IVMSP), 2016 IEEE 12th . IEEE, 2016, pp.1–5.[8] Chuanmin Jia, Shiqi Wang, Xinfeng Zhang, ShansheWang, and Siwei Ma, “Spatial-temporal residue net-work based in-loop ﬁlter for video coding,” arXivpreprint arXiv:1709.08462 , 2017.[9] Misha Denil, Babak Shakibi, Laurent Dinh, NandoDe Freitas, et al., “Predicting parameters in deep learn-ing,” in

Advances in Neural Information Processing Sys-tems (NIPS) , 2013, pp. 2148–2156.[10] Philipp Gysel, Mohammad Motamedi, and Soheil Ghi-asi, “Hardware-oriented approximation of convolutionalneural networks,” arXiv preprint arXiv:1604.03168 ,2016.[11] “JVET-I0022: Convolutional neural network ﬁl-ter (CNNF) for intra frame,” http://phenix.it-sudparis.eu/jvet/ , 2018, [Online; accessed5-February-2018].[12] Darryl Dexu Lin, Sachin S. Talathi, and V. SreekanthAnnapureddy, “Fixed point quantization of deep convo-lutional networks,”

CoRR , vol. abs/1511.06393, 2015. [13] Sergey Ioffe and Christian Szegedy, “Batch nor-malization: Accelerating deep network training byreducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[14] Emily L Denton, Wojciech Zaremba, Joan Bruna, YannLeCun, and Rob Fergus, “Exploiting linear structurewithin convolutional networks for efﬁcient evaluation,”in

Advances in Neural Information Processing Systems(NIPS) , 2014, pp. 1269–1277.[15] Wei Wen, Cong Xu, Chunpeng Wu, Yandan Wang, Yi-ran Chen, and Hai Li, “Coordinating ﬁlters for fasterdeep neural networks,”

CoRR, abs/1703.09746 , 2017.[16] “JEM 7.0,” https://jvet.hhi.fraunhofer.de/svn/svn_HMJEMSoftware/branches/HM-16.6-JEM-7.0-dev/ , 2018, [Online; accessed5-February-2018].[17] “Visual genome(VG),” http://visualgenome.org/ , 2018, [Online; accessed 5-February-2018].[18] “DIV2K,” https://data.vision.ee.ethz.ch/cvl/DIV2K/ , 2018, [Online; accessed 5-February-2018].[19] “ILSVRC2012,” , 2018, [Online; ac-cessed 5-February-2018].[20] Yangqing Jia, Evan Shelhamer, Jeff Donahue, SergeyKarayev, Jonathan Long, Ross Girshick, Sergio Guadar-rama, and Trevor Darrell, “Caffe: Convolutional ar-chitecture for fast feature embedding,” arXiv preprintarXiv:1408.5093 , 2014.[21] “JVET-H1010: JVET common test conditions and soft-ware reference conﬁgurations,” http://phenix.it-sudparis.eu/jvet/ , 2018, [Online; accessed5-February-2018].[22] Gisle Bjontegarrd, “Calculation of average psnr differ-ences between rd-curves,”