[PDF] An Efficient QP Variable Convolutional Neural Network Based In-loop Filter for Intra Coding

Abstract

In this paper, a novel QP variable convolutional neural network based in-loop filter is proposed for VVC intra coding. To avoid training and deploying multiple networks, we develop an efficient QP attention module (QPAM) which can capture compression noise levels for different QPs and emphasize meaningful features along channel dimension. Then we embed QPAM into the residual block, and based on it, we design a network architecture that is equipped with controllability for different QPs. To make the proposed model focus more on examples that have more compression artifacts or is hard to restore, a focal mean square error (MSE) loss function is employed to fine tune the network. Experimental results show that our approach achieves 4.03\% BD-Rate saving on average for all intra configuration, which is even better than QP-separate CNN models while having less model parameters.

Full PDF

AAn Eﬃcient QP Variable Convolutional Neural NetworkBased In-loop Filter for Intra Coding

Zhijie Huang, Xiaopeng Guo, Mingyu Shang, Jie Gao and Jun Sun ∗ Wangxuan Institute of Computer TechnologyPeking UniversityBeijing, 100871, China {zhijiehuang,jsun}@pku.edu.cn

Abstract

In this paper, a novel QP variable convolutional neural network based in-loop ﬁlter isproposed for VVC intra coding. To avoid training and deploying multiple networks, wedevelop an eﬃcient QP attention module (QPAM) which can capture compression noiselevels for diﬀerent QPs and emphasize meaningful features along channel dimension. Thenwe embed QPAM into the residual block, and based on it, we design a network architecturethat is equipped with controllability for diﬀerent QPs. To make the proposed model focusmore on examples that have more compression artifacts or is hard to restore, a focal meansquare error (MSE) loss function is employed to ﬁne tune the network. Experimentalresults show that our approach achieves 4.03% BD-Rate saving on average for all intraconﬁguration, which is even better than QP-separate CNN models while having less modelparameters.

1. Introduction

In-loop ﬁltering is an essential module in video coding, which can not only improvethe quality of current frames directly by reducing the compression artifacts but alsoprovide high-quality reference frames for succeeding pictures. In the latest videocoding standard Versatile Video Coding (VVC) [1], four in-loop ﬁltering steps, namelya luma mapping with chroma scaling (LMCS) process [2], followed by a deblockingﬁlter (DBF) [3], an SAO ﬁlter [4] and an adaptive loop ﬁlter (ALF) [5] are applied tothe reconstructed samples. The DBF and SAO are similar to that of the HEVC [6]standard, whereas LMCS and ALF are newly adopted in VVC.Besides the built-in in-loop ﬁlters in video coding, various convolutional neuralnetwork (CNN) based in-loop ﬁlters have been proposed in recent years. In [7], a verydeep recursive residual CNN (RRCNN) was developed to recover the reconstructedintra frames. Zhang et al. [8] introduced a deep residual highway CNN (RHCNN)based in-loop ﬁltering in HEVC. Wang et al. [9] designed a dense residual CNNbased in-loop ﬁlter (DRNLF) for VVC. Typically, since the compression noise levelsare distinct for videos compressed with diﬀerent quantization parameters (QPs), weneed to train many CNN models for diﬀerent QPs. To address this issue, Zhang et al. [8] merged the QPs into several bands, and trained the optimal models for each band,but they still had to train and deploy several networks. Song et al. [10] combined QPsas an input and fed them into the CNN training stage by simply padding the scalar ∗ Corresponding author a r X i v : . [ c s . MM ] D ec ! " 1 × 1 × ! ! " (cid:28613)(cid:28612) (cid:28643)(cid:28642)(cid:28633)(cid:28577)(cid:28636)(cid:28643)(cid:28648) Figure 1: Overview of the QP attention module (QPAM). ⊗ denotes element-wiseproduct.QPs into a matrix with the same size of input frames or patches. However, these QP-combined models are inferior to QP-separate CNN models in terms of rate-distortion(RD) performance. And their ﬂexibility and scalability are not strong.In this paper, a novel QP variable convolutional neural network based in-loop ﬁlteris proposed for VVC intra frames. Speciﬁcally, Considering diﬀerent compressionnoise levels for diﬀerent QPs, a QP attention module (QPAM) is developed whichassigns diﬀerent weights to each channel of the input feature map according to the QPvalue. Compared with other methods, the proposed QPAM has wide applicability andstronger scalability, which can also be applied to adapt to diﬀerent frame types. ThenQPAM is embed into the residual block. Based on it, we design a network architecturethat can not only fully utilize residual feature, but has the controllability for diﬀerentQPs. To further make the proposed model pay more attention to examples that havemore compression artifacts or is hard to restore, a focal mean square error (MSE)loss function is employed to ﬁne tune the network. Experimental results verify theeﬃciency of the proposed QPAM and network architecture, which also outperformsother methods.

2. The Proposed QPALF Method

QPAM

Inspired by channel attention module in [11], we propose a QPAM to avoid trainingand deploying multiple networks. Unlike channel attention module which extracts theattention map from the input feature map, our QP attention module is controlled bythe QP value. The overview of the proposed QPAM is illustrated in Figure 1. Givena feature map F ∈ R H × W × C as input, QPAM sequentially infers a 1D QP attentionmap M ∈ R × × C . The attention process can be summarized as: F (cid:48) = M ⊗ F (1) onv3x3 PReLU RFA 1 RFA D ConcatRFA d

Conv1x1 PReLU Conv3x3

Figure 2: Overview of the architecture of our QPALF network.where ⊗ denotes element-wise multiplication. During multiplication, the attentionvalues are broadcast along the channel dimension. F (cid:48) is the reﬁned output. Theprocess of generating the QP attention map is as follows: Given a QP value q ∈ Ω = [ a, b ], since q is an integer, we ﬁrst map q to a vector v Ω ( q ) ∈ R m × by one-hotencoding. m is the length of | Ω | . The QP attention map M is calculated by: M (cid:48) = σ ( U v Ω ( q )) M = reshape( M (cid:48) ) (2)where U ∈ R C × m is a weight matrix and σ ( x ) = log (1 + e x ). From the process we cansee that the QPAM assigns diﬀerent weights to each channel of the input feature mapaccording to the QP value, so that the module can capture compression noise levelsamong diﬀerent QPs. Meanwhile, the module can also emphasize meaningful featuresalong channel axes through this process. Moreover, compared with other methods,the proposed QPAM has stronger scalability, which can also be easily applied to otherdiscrete variables, e.g. frame types. QPALF RB Concat

Conv1x1

RB RB

Input

Conv1x1

PReLUConv3x3

QPAM

Figure 3: Detail of Residual Block with QPAM and Residual Feature AggregationModule.

Left : (a) Residual Feature Aggregation Block. RB denotes Residual Block.

Right : (b) Residual Block with QPAM.

Architecture . In ﬁgure 2, we present the overview of the architecture of ourQPALF network. This is also one of the popular architecture used by many othermethods [8, 7, 9], which usually consists of three parts: the head part, the backboneart and the reconstruction part. The head part is responsible for initial featureextraction with only one convolutional layer followed by an activate function. Givena compressed input X , we can get a shallow feature F through this layer: F = F ( X ) (3)The backbone part is the key component of the network, which is also the mostdistinct part of various networks. Here it makes up of D cascaded residual featureaggregation modules (RFA). The backbone part receives the feature F as inputand sends the extracted global feature F to the reconstruction part, which can beformulated as: F D = R D ( F D − ) = R D ( R D − ( . . . ( R ( F )) . . . )) (4) F = F + R ([ F , . . . , F d , . . . , F D ]) (5)where R d denotes the d-th RFA module function. F d − is the input feature of thed-th RFA module function and F d is the corresponding output. The output featuresof the D RFAs are concatenated together. Then we utilize a long skip connection toextract a global feature F . Finally, the global feature F is transformed through thereconstruction part ˆ Y = X + H ( F ) (6)where ˆ Y is the output and H is the reconstruction function, which consists of onlyone convolutional layer. A global residual learning is usually used to ease the trainingdiﬃculty in the reconstruction part.Inspired by [12], we propose a RFA module to make a better use of the localresidual features. Figure 3(a) illustrates the detail of the RFA, which contains threeresidual blocks and one convolution layer. The input is extracted by three residualblocks at three diﬀerent levels. Then the outputs from three residual blocks areconcatenated and a 1 × D = 6 here. All of 3 × × × Dataset . For the network training, we build a dataset using DIV2K [15] whichcontains 800 high-resolution images. First, we convert these images to YUV 4:2:0color format and encode them by VTM6.0 [16] with all-intra (AI) conﬁguration atfour QPs, 22, 27, 32, 37. The built-in in-loop ﬁlters are all enabled when compressedthese images. Then the compressed images are divided into two non-overlapping setsof training (700 images), validation (100 images). To further expand the trainingdataset, we split the reconstructions to small patches of 64 ×

64 with stride 16. Andwe remove the patches whose PSNR are more than 50.0 or less than 20.0. Whentraining QP variable models, four training datasets are mixed in a random order.igure 4: Cumulative PSNR gain rate distribution of QPALF on valid image dataset.

Loss Function . Let X be the input and θ be the set of network parameters tobe optimized. Our goal is to learn an end-to-end function F for generating a higherquality reconstruction ˆ Y = F ( X ; θ ) that is close to the ground truth Y . The lossfunction is the MSE between ˆ Y and Y : L rec = 1 N N (cid:88) i =1 (cid:107) ˆ Y ( i ) − Y ( i ) (cid:107) (7)where N is the number of training samples in each batch. In order to train a morerobust QP-combined network, we analyse the restoration ability of the network fordiﬀerent QPs. First we train a QPALF network using mixed dataset (the trainingdetail will be presented in the follow). Then we plot the valid image cumulativeproportion over the PSNR gain rate for diﬀerent QPs. And the PSNR gain rate isdeﬁned as follows: R = 1 − L rec L init (8)where L init is the MSE between X and Y . From Figure 4, we can ﬁnd: 1) Thenetwork has lower PSNR gain rate overall on dataset with smaller QP, especiallyat QP=22. Obviously smaller QP means less compression artifacts, and we do notexpect the network to pay much attention on data with less compression artifacts. 2)The PSNR gain rate of 80% of the valid data is less than 10%, that is, the valid datawith low PSNR gain rate accounts for a large proportion. So we expect the networkto focus more on data that has low PSNR gain rate. To this end, we propose a focalMSE loss function to ﬁne tune the proposed network, which can be calculated by: L = α q (1 − R ) γ L rec = α q L γrec L init (9)where α q is a weighting factor over QP value, and γ is focusing parameter. Herein α q = 0 . , . , . , .

35 for four QPs respectively and γ = 1. Table 1 shows the codingperformances of three networks over test sequences. QPALF-I, QPALF-II, QPALF-III denote QPALF without ﬁne tuning, QPALF ﬁne tuned by MSE and QPALF ﬁneable 1: The coding performance of three QPALF networks Class

BD-Rate(%)

QPALF-I QPALF-II QPALF-III

A1 -1.54 -1.73 -2.02A2 -1.98 -2.18 -2.29B -3.14 -3.30 -3.32C -4.49 -4.61 -4.73D -5.48 -5.60 -5.70E -5.43 -5.66 -5.84Average -3.75 -3.91 -4.03 tuned by focal MSE respectively. As we can see, QPALF-III achieves more bit-ratesaving than QPALF-II, which demonstrates the eﬀectiveness of focal MSE.

Training Detail . The widely adopted deep learning framework Pytorch [17] isutilized to train our models. We use Adam [18] optimization to train these models anda batch size of 64. The learning rate discounts 0.5 every 25 epochs. The training takes100 epochs in total. For QP-separate models, we ﬁrst train the model for QP=37and then use it to initialize the parameters of the networks with smaller QP. Theinitial learning rate is 10 − for QP=37 and 10 − for other QPs. For QP-combinedmodels, the initial learning rate is 10 − and the ﬁne-tune process takes 50 epochswith learning rate 10 − . All models are trained on NVIDIA Titan X (Pascal) GPUs. Implementation . We integrate the QPALF into VVC as an additional tool ofin-loop ﬁlters between DBF and SAO. To get better performance for video coding, aframe level ﬂag is signaled in the bitstream to indicate whether QPALF is enabled forthis frame in the decoder. When the reduction of RD cost is greater than 0, the ﬂagwill be enabled and the QPALF will be applied for the frame on luma component.

3. Experiment

Experimental Setting

In our experiments, all approaches for in-loop ﬁltering are incorporated into the VVCreference software VTM6.0. The Libtorch [17] library is integrated into VTM 6.0to perform the in-loop ﬁltering with the diﬀerent models. Four typical QP valuesare tested, including 22, 27, 32, 37. We use the AI conﬁguration suggested by VVCcommon test condition (CTC) [19]. The anchor for all experiments is VTM6.0 withall built-in in-loop ﬁlters enabled. The coding eﬃciency is evaluated on standardvideo sequences from class A1 to class E recommended by JVET. The BD-Rate [20]are referred to measure the coding performance. We only train and apply the modelson Y channel, but our approach can be extended to any arbitrary number of channels.able 2: The BD-Rate of diﬀerent models on Y channel under AI conﬁguration

Class Sequence

BD-Rate(%)RHCNN DRNLF QPMLF QPALF-S QPALFA1 Tango2 -0.62 -0.63 -0.78 -0.62 -1.86Campﬁre -0.82 -1.32 -0.79 -1.42 -2.01FoodMarket4 -0.74 -0.09 -0.29 -0.89 -2.20A2 CatRobot -1.10 -2.20 -1.97 -2.28 -3.39DaylightRoad2 -0.43 0.07 -0.17 1.02 -0.50ParkRunning3 -1.01 -1.96 -1.56 -2.04 -2.99B RitualDance -1.88 -4.29 -4.03 -4.85 -6.32MarketPlace -1.34 -2.33 -2.09 -2.64 -3.58BasketballDrive -0.73 -1.63 -1.05 -1.82 -2.84BQTerrace -0.64 -1.38 -1.02 -1.56 -2.06Cactus -0.84 -2.07 -1.71 -1.69 -1.78C BasketballDrill -2.29 -5.43 -4.50 -5.76 -7.48BQMall -1.93 -4.31 -3.73 -4.58 -5.49PartyScene -1.22 -3.01 -2.57 -3.19 -3.62RaceHorsesC -0.81 -1.75 -1.39 -1.81 -2.31D BasketballPass -2.10 -5.24 -4.42 -5.67 -6.76BlowingBubbles -1.59 -3.64 -3.19 -3.87 -4.45BQSquare -1.93 -5.12 -4.40 -5.28 -6.20RaceHorses -2.03 -4.54 -4.21 -4.67 -5.40E FourPeople -2.05 -4.73 -4.08 -5.09 -6.49Johnny -1.63 -3.90 -3.17 -4.12 -5.72KristenAndSara -1.59 -3.95 -3.31 -4.24 -5.31

Average All -1.54 -2.88 -2.47 -3.05 -4.03

Evaluation on VVC Test Sequences

RD performance . First, we compare our QPALF with VTM baseline, and twoCNN based in-loop ﬁlters, RHCNN [8] and DRNLF [9]. For a fair comparison, wealso train the models using our dataset and integrate the trained model into VTM6.0between DBF and SAO. The results are displayed in Table 2. It can be seen obviously,our QPALF further improves the coding eﬃciency, which obtains 4.03% bit-rate sav-ing overall for the luma component on all the test sequences. To further verify theeﬃciency of the proposed model, we also compare our QPALF with QPMLF andQPALF-D. QPALF-D is the QPALF model trained separately on four QPs. QPMLFis the QPALF without QPAM and use the QP map method [10]. As we can see,compared with QP-separate model QPALF-S, the performance of the QP-combinedmodel QPMLF degrades while our model QPALF achieves even better coding perfor-mance. Moreover, the PSNR gain of three models on multiple QPs is also depictedin Figure 5. We can observe that our QPAM obtains the highest PSNR gain over allQPs, which also demonstrates the generalization ability and robustness of the pro- a) FourPeople (b) BQMall

Figure 5: PSNR gain of three methods on multiple QPs. (a) FourPeople; (b) BQMall.posed method. (Since the models are trained on only four QPs, {

22, 27, 32, 37 } , weﬁrst map other QP values to the four QPs.) Subjective evaluation . Figure 6 illustrates the subjective visual quality com-parison among all four approaches. It can be observed that the images enhanced byour approach remain less distortion than those by other approaches, e.g., the cleareredge of the basketball net line. In Figure 7, we display the residual map of threemethods over VTM baseline. Compared with RHCNN and DRNLF, our method canrestore more image texture details.

Complexity . Table 3 shows the average encode/decode complexity increase andparameters of diﬀerent models on Intel(R) Xeon(R) CPU E5-2697v4 and Titan X(Pascal). All of the neural networks are conducted with GPU acceleration. The com-plexity increase is calculated by ∆ T = ( ˆ T − T ) /T , where ˆ T is the encode/decodetime with integrating the models, T is the original encode/decode time. Our pro-posed QPALF has less model parameters overall. Compared with QPMLF, our modelachieves much better RD performance with little complexity increase.Table 3: Average complexity increase and parameters of diﬀerent modelsMethod ∆ ET ∆ DT

RHCNN 5.43% 10695.9% 6 . M × . k × . k × . k ×

4. Conclusion and Future Work

In this paper, an eﬃcient QP variable CNN based in-loop ﬁlter for VVC is proposed.With the proposed QPAM, the QPALF can be adaptive to diﬀerent QPs while achiev-ing better RD performance. And a focal MSE is introduced to train a more robustmodel. Experimental results demonstrate that our QPALF can signiﬁcantly improvethe coding eﬃciency, which outperforms other CNN based methods. Moreover, theround Truth(PSNR,SSIM) VTM(26.16, 0.9045) RHCNN(26.29, 0.9073) DRNLF(26.56, 0.9104) Ours( )Ground Truth(PSNR,SSIM) VTM(28.15, 0.8643) RHCNN(28.24, 0.8666) DRNLF(28.32, 0.8682) Ours( )Figure 6: Subjective image quality comparison on sequences BasketballDrill (ClassD) and RaceHorses (Class C) at QP = 37 (a) VTM (b) RHCNN (c) DRNLF (d) Ours

Figure 7: Residual map of diﬀerent methods over VTM baseline.proposed QPAM has wide applicability and stronger scalability which can be easilyto implement in networks and extend to other types. In our future work, we willextend our model to inter mode and speed up our QPALF.

Acknowledgment

This work was supported by National Natural Foundation ofChina under contract No. 61671025

References [1] S. Liu, B. Bross, and J. Chen, “Versatile Video Coding (Draft 10),”

JVET-S2001,Joint Video Experts Team (JVET) , Jul. 2020.[2] T. Lu, F. Pu, P. Yin, S. McCarthy, W. Husak, T. Chen, E. Francois, C. Chevance,F. Hiron, J. Chen, R. Liao, Y. Ye, and J. Luo, “Luma Mapping with Chroma Scalingn Versatile Video Coding,” in , 2020, pp.193–202.[3] A. Norkin, G. Bjontegaard, A. Fuldseth, M. Narroschke, M. Ikeda, K. Andersson,M. Zhou, and G. Van der Auwera, “HEVC Deblocking Filter,”

IEEE Transactions onCircuits and Systems for Video Technology , vol. 22, no. 12, pp. 1746–1754, 2012.[4] C. Fu, C. Chen, Y. Huang, and S. Lei, “Sample adaptive oﬀset for HEVC,” in , 2011, pp. 1–5.[5] C. Tsai, C. Chen, T. Yamakage, I. S. Chong, Y. Huang, C. Fu, T. Itoh, T. Watanabe,T. Chujoh, M. Karczewicz, and S. Lei, “Adaptive Loop Filtering for Video Coding,”

IEEE Journal of Selected Topics in Signal Processing , vol. 7, no. 6, pp. 934–945, 2013.[6] G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand, “Overview of the High EﬃciencyVideo Coding (HEVC) Standard,”

IEEE Transactions on Circuits and Systems forVideo Technology , vol. 22, no. 12, pp. 1649–1668, 2012.[7] S. Zhang, Z. Fan, N. Ling, and M. Jiang, “Recursive Residual Convolutional NeuralNetwork-Based In-Loop Filtering for Intra Frames,”

IEEE Transactions on Circuitsand Systems for Video Technology , vol. 30, no. 7, pp. 1888–1900, 2020.[8] Y. Zhang, T. Shen, X. Ji, Y. Zhang, R. Xiong, and Q. Dai, “Residual HighwayConvolutional Neural Networks for in-loop Filtering in HEVC,”

IEEE Transactionson Image Processing , vol. 27, no. 8, pp. 3827–3841, 2018.[9] Y. Wang, Z. Chen, Y. Li, L. Zhao S. Liu, and X. Li, “Test Results of Dense ResidualConvolutional Neural Network Based in-Loop Filter,” in document Rep. JVET-M0508,Marrakech, Morocco , Jan. 2019.[10] X. Song, J. Yao, L. Zhou, L. Wang, X. Wu, D. Xie, and S. Pu, “A Practical Convolu-tional Neural Network as Loop Filter for Intra Frame,” in , 2018, pp. 1133–1137.[11] S. Woo, J. Park, J. Lee, and I. Kweon, “CBAM: Convolutional Block AttentionModule,” in

Computer Vision – ECCV 2018 , Cham, 2018, pp. 3–19.[12] J. Liu, W. Zhang, Y. Tang, J. Tang, and G. Wu, “Residual Feature AggregationNetwork for Image Super-Resolution,” in , 2020, pp. 2356–2365.[13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,”in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2016.[14] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectiﬁers: SurpassingHuman-Level Performance on ImageNet Classiﬁcation,” in , 2015, pp. 1026–1034.[15] E. Agustsson and R. Timofte, “NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study,” in , July 2017, pp. 1122–1131.[16] “Versatile video coding Test Model (VTM), 6.0.1,” 2019.[17] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, andZ. Lin, “PyTorch: An imperative style, high-performance deep learning library,” in

Advances in Neural Information Processing Systems , 2019, pp. 8024–8035.[18] M. Kingma, P. Diederik, and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv e-prints , p. arXiv:1412.6980, Dec 2014.[19] K. Suehring and X. Li, “JVET common test conditions and software reference conﬁg-urations,” in

Document JVET-H1010 8th JVET Meeting , Oct. 2017, vol. 22.[20] G. Bjøntegaard, “Document VCEG-M33: Calculation of average PSNR diﬀerencesbetween RD-Curves,”