An Efficient QP Variable Convolutional Neural Network Based In-loop Filter for Intra Coding
AAn Efficient QP Variable Convolutional Neural NetworkBased In-loop Filter for Intra Coding
Zhijie Huang, Xiaopeng Guo, Mingyu Shang, Jie Gao and Jun Sun ∗ Wangxuan Institute of Computer TechnologyPeking UniversityBeijing, 100871, China {zhijiehuang,jsun}@pku.edu.cn
Abstract
In this paper, a novel QP variable convolutional neural network based in-loop filter isproposed for VVC intra coding. To avoid training and deploying multiple networks, wedevelop an efficient QP attention module (QPAM) which can capture compression noiselevels for different QPs and emphasize meaningful features along channel dimension. Thenwe embed QPAM into the residual block, and based on it, we design a network architecturethat is equipped with controllability for different QPs. To make the proposed model focusmore on examples that have more compression artifacts or is hard to restore, a focal meansquare error (MSE) loss function is employed to fine tune the network. Experimentalresults show that our approach achieves 4.03% BD-Rate saving on average for all intraconfiguration, which is even better than QP-separate CNN models while having less modelparameters.
1. Introduction
In-loop filtering is an essential module in video coding, which can not only improvethe quality of current frames directly by reducing the compression artifacts but alsoprovide high-quality reference frames for succeeding pictures. In the latest videocoding standard Versatile Video Coding (VVC) [1], four in-loop filtering steps, namelya luma mapping with chroma scaling (LMCS) process [2], followed by a deblockingfilter (DBF) [3], an SAO filter [4] and an adaptive loop filter (ALF) [5] are applied tothe reconstructed samples. The DBF and SAO are similar to that of the HEVC [6]standard, whereas LMCS and ALF are newly adopted in VVC.Besides the built-in in-loop filters in video coding, various convolutional neuralnetwork (CNN) based in-loop filters have been proposed in recent years. In [7], a verydeep recursive residual CNN (RRCNN) was developed to recover the reconstructedintra frames. Zhang et al. [8] introduced a deep residual highway CNN (RHCNN)based in-loop filtering in HEVC. Wang et al. [9] designed a dense residual CNNbased in-loop filter (DRNLF) for VVC. Typically, since the compression noise levelsare distinct for videos compressed with different quantization parameters (QPs), weneed to train many CNN models for different QPs. To address this issue, Zhang et al. [8] merged the QPs into several bands, and trained the optimal models for each band,but they still had to train and deploy several networks. Song et al. [10] combined QPsas an input and fed them into the CNN training stage by simply padding the scalar ∗ Corresponding author a r X i v : . [ c s . MM ] D ec ! " 1 × 1 × ! ! " (cid:28613)(cid:28612) (cid:28643)(cid:28642)(cid:28633)(cid:28577)(cid:28636)(cid:28643)(cid:28648) Figure 1: Overview of the QP attention module (QPAM). ⊗ denotes element-wiseproduct.QPs into a matrix with the same size of input frames or patches. However, these QP-combined models are inferior to QP-separate CNN models in terms of rate-distortion(RD) performance. And their flexibility and scalability are not strong.In this paper, a novel QP variable convolutional neural network based in-loop filteris proposed for VVC intra frames. Specifically, Considering different compressionnoise levels for different QPs, a QP attention module (QPAM) is developed whichassigns different weights to each channel of the input feature map according to the QPvalue. Compared with other methods, the proposed QPAM has wide applicability andstronger scalability, which can also be applied to adapt to different frame types. ThenQPAM is embed into the residual block. Based on it, we design a network architecturethat can not only fully utilize residual feature, but has the controllability for differentQPs. To further make the proposed model pay more attention to examples that havemore compression artifacts or is hard to restore, a focal mean square error (MSE)loss function is employed to fine tune the network. Experimental results verify theefficiency of the proposed QPAM and network architecture, which also outperformsother methods.
2. The Proposed QPALF Method
QPAM
Inspired by channel attention module in [11], we propose a QPAM to avoid trainingand deploying multiple networks. Unlike channel attention module which extracts theattention map from the input feature map, our QP attention module is controlled bythe QP value. The overview of the proposed QPAM is illustrated in Figure 1. Givena feature map F ∈ R H × W × C as input, QPAM sequentially infers a 1D QP attentionmap M ∈ R × × C . The attention process can be summarized as: F (cid:48) = M ⊗ F (1) onv3x3 PReLU RFA 1 RFA D ConcatRFA d
Conv1x1 PReLU Conv3x3
Figure 2: Overview of the architecture of our QPALF network.where ⊗ denotes element-wise multiplication. During multiplication, the attentionvalues are broadcast along the channel dimension. F (cid:48) is the refined output. Theprocess of generating the QP attention map is as follows: Given a QP value q ∈ Ω = [ a, b ], since q is an integer, we first map q to a vector v Ω ( q ) ∈ R m × by one-hotencoding. m is the length of | Ω | . The QP attention map M is calculated by: M (cid:48) = σ ( U v Ω ( q )) M = reshape( M (cid:48) ) (2)where U ∈ R C × m is a weight matrix and σ ( x ) = log (1 + e x ). From the process we cansee that the QPAM assigns different weights to each channel of the input feature mapaccording to the QP value, so that the module can capture compression noise levelsamong different QPs. Meanwhile, the module can also emphasize meaningful featuresalong channel axes through this process. Moreover, compared with other methods,the proposed QPAM has stronger scalability, which can also be easily applied to otherdiscrete variables, e.g. frame types. QPALF RB Concat
Conv1x1
RB RB
Input
Conv1x1
PReLUConv3x3
QPAM
Figure 3: Detail of Residual Block with QPAM and Residual Feature AggregationModule.
Left : (a) Residual Feature Aggregation Block. RB denotes Residual Block.
Right : (b) Residual Block with QPAM.
Architecture . In figure 2, we present the overview of the architecture of ourQPALF network. This is also one of the popular architecture used by many othermethods [8, 7, 9], which usually consists of three parts: the head part, the backboneart and the reconstruction part. The head part is responsible for initial featureextraction with only one convolutional layer followed by an activate function. Givena compressed input X , we can get a shallow feature F through this layer: F = F ( X ) (3)The backbone part is the key component of the network, which is also the mostdistinct part of various networks. Here it makes up of D cascaded residual featureaggregation modules (RFA). The backbone part receives the feature F as inputand sends the extracted global feature F to the reconstruction part, which can beformulated as: F D = R D ( F D − ) = R D ( R D − ( . . . ( R ( F )) . . . )) (4) F = F + R ([ F , . . . , F d , . . . , F D ]) (5)where R d denotes the d-th RFA module function. F d − is the input feature of thed-th RFA module function and F d is the corresponding output. The output featuresof the D RFAs are concatenated together. Then we utilize a long skip connection toextract a global feature F . Finally, the global feature F is transformed through thereconstruction part ˆ Y = X + H ( F ) (6)where ˆ Y is the output and H is the reconstruction function, which consists of onlyone convolutional layer. A global residual learning is usually used to ease the trainingdifficulty in the reconstruction part.Inspired by [12], we propose a RFA module to make a better use of the localresidual features. Figure 3(a) illustrates the detail of the RFA, which contains threeresidual blocks and one convolution layer. The input is extracted by three residualblocks at three different levels. Then the outputs from three residual blocks areconcatenated and a 1 × D = 6 here. All of 3 × × × Dataset . For the network training, we build a dataset using DIV2K [15] whichcontains 800 high-resolution images. First, we convert these images to YUV 4:2:0color format and encode them by VTM6.0 [16] with all-intra (AI) configuration atfour QPs, 22, 27, 32, 37. The built-in in-loop filters are all enabled when compressedthese images. Then the compressed images are divided into two non-overlapping setsof training (700 images), validation (100 images). To further expand the trainingdataset, we split the reconstructions to small patches of 64 ×
64 with stride 16. Andwe remove the patches whose PSNR are more than 50.0 or less than 20.0. Whentraining QP variable models, four training datasets are mixed in a random order.igure 4: Cumulative PSNR gain rate distribution of QPALF on valid image dataset.
Loss Function . Let X be the input and θ be the set of network parameters tobe optimized. Our goal is to learn an end-to-end function F for generating a higherquality reconstruction ˆ Y = F ( X ; θ ) that is close to the ground truth Y . The lossfunction is the MSE between ˆ Y and Y : L rec = 1 N N (cid:88) i =1 (cid:107) ˆ Y ( i ) − Y ( i ) (cid:107) (7)where N is the number of training samples in each batch. In order to train a morerobust QP-combined network, we analyse the restoration ability of the network fordifferent QPs. First we train a QPALF network using mixed dataset (the trainingdetail will be presented in the follow). Then we plot the valid image cumulativeproportion over the PSNR gain rate for different QPs. And the PSNR gain rate isdefined as follows: R = 1 − L rec L init (8)where L init is the MSE between X and Y . From Figure 4, we can find: 1) Thenetwork has lower PSNR gain rate overall on dataset with smaller QP, especiallyat QP=22. Obviously smaller QP means less compression artifacts, and we do notexpect the network to pay much attention on data with less compression artifacts. 2)The PSNR gain rate of 80% of the valid data is less than 10%, that is, the valid datawith low PSNR gain rate accounts for a large proportion. So we expect the networkto focus more on data that has low PSNR gain rate. To this end, we propose a focalMSE loss function to fine tune the proposed network, which can be calculated by: L = α q (1 − R ) γ L rec = α q L γrec L init (9)where α q is a weighting factor over QP value, and γ is focusing parameter. Herein α q = 0 . , . , . , .
35 for four QPs respectively and γ = 1. Table 1 shows the codingperformances of three networks over test sequences. QPALF-I, QPALF-II, QPALF-III denote QPALF without fine tuning, QPALF fine tuned by MSE and QPALF fineable 1: The coding performance of three QPALF networks Class
BD-Rate(%)
QPALF-I QPALF-II QPALF-III
A1 -1.54 -1.73 -2.02A2 -1.98 -2.18 -2.29B -3.14 -3.30 -3.32C -4.49 -4.61 -4.73D -5.48 -5.60 -5.70E -5.43 -5.66 -5.84Average -3.75 -3.91 -4.03 tuned by focal MSE respectively. As we can see, QPALF-III achieves more bit-ratesaving than QPALF-II, which demonstrates the effectiveness of focal MSE.
Training Detail . The widely adopted deep learning framework Pytorch [17] isutilized to train our models. We use Adam [18] optimization to train these models anda batch size of 64. The learning rate discounts 0.5 every 25 epochs. The training takes100 epochs in total. For QP-separate models, we first train the model for QP=37and then use it to initialize the parameters of the networks with smaller QP. Theinitial learning rate is 10 − for QP=37 and 10 − for other QPs. For QP-combinedmodels, the initial learning rate is 10 − and the fine-tune process takes 50 epochswith learning rate 10 − . All models are trained on NVIDIA Titan X (Pascal) GPUs. Implementation . We integrate the QPALF into VVC as an additional tool ofin-loop filters between DBF and SAO. To get better performance for video coding, aframe level flag is signaled in the bitstream to indicate whether QPALF is enabled forthis frame in the decoder. When the reduction of RD cost is greater than 0, the flagwill be enabled and the QPALF will be applied for the frame on luma component.
3. Experiment
Experimental Setting
In our experiments, all approaches for in-loop filtering are incorporated into the VVCreference software VTM6.0. The Libtorch [17] library is integrated into VTM 6.0to perform the in-loop filtering with the different models. Four typical QP valuesare tested, including 22, 27, 32, 37. We use the AI configuration suggested by VVCcommon test condition (CTC) [19]. The anchor for all experiments is VTM6.0 withall built-in in-loop filters enabled. The coding efficiency is evaluated on standardvideo sequences from class A1 to class E recommended by JVET. The BD-Rate [20]are referred to measure the coding performance. We only train and apply the modelson Y channel, but our approach can be extended to any arbitrary number of channels.able 2: The BD-Rate of different models on Y channel under AI configuration
Class Sequence
BD-Rate(%)RHCNN DRNLF QPMLF QPALF-S QPALFA1 Tango2 -0.62 -0.63 -0.78 -0.62 -1.86Campfire -0.82 -1.32 -0.79 -1.42 -2.01FoodMarket4 -0.74 -0.09 -0.29 -0.89 -2.20A2 CatRobot -1.10 -2.20 -1.97 -2.28 -3.39DaylightRoad2 -0.43 0.07 -0.17 1.02 -0.50ParkRunning3 -1.01 -1.96 -1.56 -2.04 -2.99B RitualDance -1.88 -4.29 -4.03 -4.85 -6.32MarketPlace -1.34 -2.33 -2.09 -2.64 -3.58BasketballDrive -0.73 -1.63 -1.05 -1.82 -2.84BQTerrace -0.64 -1.38 -1.02 -1.56 -2.06Cactus -0.84 -2.07 -1.71 -1.69 -1.78C BasketballDrill -2.29 -5.43 -4.50 -5.76 -7.48BQMall -1.93 -4.31 -3.73 -4.58 -5.49PartyScene -1.22 -3.01 -2.57 -3.19 -3.62RaceHorsesC -0.81 -1.75 -1.39 -1.81 -2.31D BasketballPass -2.10 -5.24 -4.42 -5.67 -6.76BlowingBubbles -1.59 -3.64 -3.19 -3.87 -4.45BQSquare -1.93 -5.12 -4.40 -5.28 -6.20RaceHorses -2.03 -4.54 -4.21 -4.67 -5.40E FourPeople -2.05 -4.73 -4.08 -5.09 -6.49Johnny -1.63 -3.90 -3.17 -4.12 -5.72KristenAndSara -1.59 -3.95 -3.31 -4.24 -5.31
Average All -1.54 -2.88 -2.47 -3.05 -4.03
Evaluation on VVC Test Sequences
RD performance . First, we compare our QPALF with VTM baseline, and twoCNN based in-loop filters, RHCNN [8] and DRNLF [9]. For a fair comparison, wealso train the models using our dataset and integrate the trained model into VTM6.0between DBF and SAO. The results are displayed in Table 2. It can be seen obviously,our QPALF further improves the coding efficiency, which obtains 4.03% bit-rate sav-ing overall for the luma component on all the test sequences. To further verify theefficiency of the proposed model, we also compare our QPALF with QPMLF andQPALF-D. QPALF-D is the QPALF model trained separately on four QPs. QPMLFis the QPALF without QPAM and use the QP map method [10]. As we can see,compared with QP-separate model QPALF-S, the performance of the QP-combinedmodel QPMLF degrades while our model QPALF achieves even better coding perfor-mance. Moreover, the PSNR gain of three models on multiple QPs is also depictedin Figure 5. We can observe that our QPAM obtains the highest PSNR gain over allQPs, which also demonstrates the generalization ability and robustness of the pro- a) FourPeople (b) BQMall
Figure 5: PSNR gain of three methods on multiple QPs. (a) FourPeople; (b) BQMall.posed method. (Since the models are trained on only four QPs, {
22, 27, 32, 37 } , wefirst map other QP values to the four QPs.) Subjective evaluation . Figure 6 illustrates the subjective visual quality com-parison among all four approaches. It can be observed that the images enhanced byour approach remain less distortion than those by other approaches, e.g., the cleareredge of the basketball net line. In Figure 7, we display the residual map of threemethods over VTM baseline. Compared with RHCNN and DRNLF, our method canrestore more image texture details.
Complexity . Table 3 shows the average encode/decode complexity increase andparameters of different models on Intel(R) Xeon(R) CPU E5-2697v4 and Titan X(Pascal). All of the neural networks are conducted with GPU acceleration. The com-plexity increase is calculated by ∆ T = ( ˆ T − T ) /T , where ˆ T is the encode/decodetime with integrating the models, T is the original encode/decode time. Our pro-posed QPALF has less model parameters overall. Compared with QPMLF, our modelachieves much better RD performance with little complexity increase.Table 3: Average complexity increase and parameters of different modelsMethod ∆ ET ∆ DT
RHCNN 5.43% 10695.9% 6 . M × . k × . k × . k ×
4. Conclusion and Future Work
In this paper, an efficient QP variable CNN based in-loop filter for VVC is proposed.With the proposed QPAM, the QPALF can be adaptive to different QPs while achiev-ing better RD performance. And a focal MSE is introduced to train a more robustmodel. Experimental results demonstrate that our QPALF can significantly improvethe coding efficiency, which outperforms other CNN based methods. Moreover, theround Truth(PSNR,SSIM) VTM(26.16, 0.9045) RHCNN(26.29, 0.9073) DRNLF(26.56, 0.9104) Ours( )Ground Truth(PSNR,SSIM) VTM(28.15, 0.8643) RHCNN(28.24, 0.8666) DRNLF(28.32, 0.8682) Ours( )Figure 6: Subjective image quality comparison on sequences BasketballDrill (ClassD) and RaceHorses (Class C) at QP = 37 (a) VTM (b) RHCNN (c) DRNLF (d) Ours
Figure 7: Residual map of different methods over VTM baseline.proposed QPAM has wide applicability and stronger scalability which can be easilyto implement in networks and extend to other types. In our future work, we willextend our model to inter mode and speed up our QPALF.
Acknowledgment
This work was supported by National Natural Foundation ofChina under contract No. 61671025
References [1] S. Liu, B. Bross, and J. Chen, “Versatile Video Coding (Draft 10),”
JVET-S2001,Joint Video Experts Team (JVET) , Jul. 2020.[2] T. Lu, F. Pu, P. Yin, S. McCarthy, W. Husak, T. Chen, E. Francois, C. Chevance,F. Hiron, J. Chen, R. Liao, Y. Ye, and J. Luo, “Luma Mapping with Chroma Scalingn Versatile Video Coding,” in , 2020, pp.193–202.[3] A. Norkin, G. Bjontegaard, A. Fuldseth, M. Narroschke, M. Ikeda, K. Andersson,M. Zhou, and G. Van der Auwera, “HEVC Deblocking Filter,”
IEEE Transactions onCircuits and Systems for Video Technology , vol. 22, no. 12, pp. 1746–1754, 2012.[4] C. Fu, C. Chen, Y. Huang, and S. Lei, “Sample adaptive offset for HEVC,” in , 2011, pp. 1–5.[5] C. Tsai, C. Chen, T. Yamakage, I. S. Chong, Y. Huang, C. Fu, T. Itoh, T. Watanabe,T. Chujoh, M. Karczewicz, and S. Lei, “Adaptive Loop Filtering for Video Coding,”
IEEE Journal of Selected Topics in Signal Processing , vol. 7, no. 6, pp. 934–945, 2013.[6] G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand, “Overview of the High EfficiencyVideo Coding (HEVC) Standard,”
IEEE Transactions on Circuits and Systems forVideo Technology , vol. 22, no. 12, pp. 1649–1668, 2012.[7] S. Zhang, Z. Fan, N. Ling, and M. Jiang, “Recursive Residual Convolutional NeuralNetwork-Based In-Loop Filtering for Intra Frames,”
IEEE Transactions on Circuitsand Systems for Video Technology , vol. 30, no. 7, pp. 1888–1900, 2020.[8] Y. Zhang, T. Shen, X. Ji, Y. Zhang, R. Xiong, and Q. Dai, “Residual HighwayConvolutional Neural Networks for in-loop Filtering in HEVC,”
IEEE Transactionson Image Processing , vol. 27, no. 8, pp. 3827–3841, 2018.[9] Y. Wang, Z. Chen, Y. Li, L. Zhao S. Liu, and X. Li, “Test Results of Dense ResidualConvolutional Neural Network Based in-Loop Filter,” in document Rep. JVET-M0508,Marrakech, Morocco , Jan. 2019.[10] X. Song, J. Yao, L. Zhou, L. Wang, X. Wu, D. Xie, and S. Pu, “A Practical Convolu-tional Neural Network as Loop Filter for Intra Frame,” in , 2018, pp. 1133–1137.[11] S. Woo, J. Park, J. Lee, and I. Kweon, “CBAM: Convolutional Block AttentionModule,” in
Computer Vision – ECCV 2018 , Cham, 2018, pp. 3–19.[12] J. Liu, W. Zhang, Y. Tang, J. Tang, and G. Wu, “Residual Feature AggregationNetwork for Image Super-Resolution,” in , 2020, pp. 2356–2365.[13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,”in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2016.[14] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: SurpassingHuman-Level Performance on ImageNet Classification,” in , 2015, pp. 1026–1034.[15] E. Agustsson and R. Timofte, “NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study,” in , July 2017, pp. 1122–1131.[16] “Versatile video coding Test Model (VTM), 6.0.1,” 2019.[17] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, andZ. Lin, “PyTorch: An imperative style, high-performance deep learning library,” in
Advances in Neural Information Processing Systems , 2019, pp. 8024–8035.[18] M. Kingma, P. Diederik, and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv e-prints , p. arXiv:1412.6980, Dec 2014.[19] K. Suehring and X. Li, “JVET common test conditions and software reference config-urations,” in
Document JVET-H1010 8th JVET Meeting , Oct. 2017, vol. 22.[20] G. Bjøntegaard, “Document VCEG-M33: Calculation of average PSNR differencesbetween RD-Curves,”