[PDF] A Convolutional Neural Network-Based Low Complexity Filter

Abstract

Convolutional Neural Network (CNN)-based filters have achieved significant performance in video artifacts reduction. However, the high complexity of existing methods makes it difficult to be applied in real usage. In this paper, a CNN-based low complexity filter is proposed. We utilize depth separable convolution (DSC) merged with the batch normalization (BN) as the backbone of our proposed CNN-based network. Besides, a weight initialization method is proposed to enhance the training performance. To solve the well known over smoothing problem for the inter frames, a frame-level residual mapping (RM) is presented. We analyze some of the mainstream methods like frame-level and block-level based filters quantitatively and build our CNN-based filter with frame-level control to avoid the extra complexity and artificial boundaries caused by block-level control. In addition, a novel module called RM is designed to restore the distortion from the learned residuals. As a result, we can effectively improve the generalization ability of the learning-based filter and reach an adaptive filtering effect. Moreover, this module is flexible and can be combined with other learning-based filters. The experimental results show that our proposed method achieves significant BD-rate reduction than H.265/HEVC. It achieves about 1.2% BD-rate reduction and 79.1% decrease in FLOPs than VR-CNN. Finally, the measurement on H.266/VVC and ablation studies are also conducted to ensure the effectiveness of the proposed method.

Full PDF

11 A Convolutional Neural Network-Based LowComplexity Filter

Chao Liu,

Student Member, IEEE

Heming Sun,

Member, IEEE

Jiro Katto,

Member, IEEE

Xiaoyang Zeng,

Member, IEEE and Yibo Fan

Abstract —Convolutional Neural Network (CNN)-based ﬁltershave achieved signiﬁcant performance in video artifacts reduc-tion. However, the high complexity of existing methods makes itdifﬁcult to be applied in real usage. In this paper, a CNN-basedlow complexity ﬁlter is proposed. We utilize depth separableconvolution (DSC) merged with the batch normalization (BN) asthe backbone of our proposed CNN-based network. Besides, aweight initialization method is proposed to enhance the trainingperformance. To solve the well known over smoothing problemfor the inter frames, a frame-level residual mapping (RM) ispresented. We analyze some of the mainstream methods likeframe-level and block-level based ﬁlters quantitatively and buildour CNN-based ﬁlter with frame-level control to avoid theextra complexity and artiﬁcial boundaries caused by block-levelcontrol. In addition, a novel module called RM is designed torestore the distortion from the learned residuals. As a result, wecan effectively improve the generalization ability of the learning-based ﬁlter and reach an adaptive ﬁltering effect. Moreover, thismodule is ﬂexible and can be combined with other learning-based ﬁlters. The experimental results show that our proposedmethod achieves signiﬁcant BD-rate reduction than H.265/HEVC.It achieves about 1.2% BD-rate reduction and 79.1% decrease inFLOPs than VR-CNN. Finally, the measurement on H.266/VVCand ablation studies are also conducted to ensure the effectivenessof the proposed method.

Index Terms —In-loop ﬁlter, HEVC, convolutional neural net-work, VTM.

I. I

NTRODUCTION T HE performance of video compression has been contin-uously improved with the development from H.264/AVC[1], H.265/HEVC [2] to H.266/VVC [3]. These standardsshare a similar hybrid video coding framework, which adoptsprediction [4], [5], transformation [6], quantization [7], and

This work was supported in part by the National Natural Science Foundationof China under Grant 61674041, in part by Alibaba Group through AlibabaInnovative Research (AIR) Program, in part by the STCSM under Grant16XD1400300, in part by the pioneering project of academy for engineeringand technology and Fudan-CIOMP joint fund, in part by the National NaturalScience Foundation of China under Grant 61525401, in part by the Program ofShanghai Academic/Technology Research Leader under Grant 16XD1400300,in part by the Innovation Program of Shanghai Municipal Education Com-mission, in part by JST, PRESTO Grant Number JPMJPR19M5, Japan.(Corresponding author: Heming Sun and Yibo Fan.)C. Liu, Y. Fan and X. Zeng are with the State Key Laboratory ofASIC and System, Fudan University, Shanghai 200433, China (e-mail:[email protected]; [email protected]; [email protected]).H. Sun is with the Waseda Research Institute for Science and Engineer-ing, Tokyo 169-8555, Japan and JST, PRESTO, 4-1-8 Honcho, Kawaguchi,Saitama, 332-0012, Japan (e-mail: [email protected]).J. Katto is with Waseda Research Institute for Science and Engineer-ing, Tokyo 169-8555, Japan and the Graduate School of FundamentalScience and Engineering, Waseda University, Tokyo 169-8555, Japan (e-mail:[email protected]). context adaptive binary arithmetic coding (CABAC) [8]. Ow-ing to the modules like quantization and ﬂexible partition,some unavoidable artifacts are produced and cause degradationof video quality, such as blocking effect, Gibbs effect, andringing. To compensate for those artifacts, many advancedﬁltering tools are designed, for instance, de-blocking(DB[9]), sample adaptive offset(SAO [10]), and adaptive loopﬁlter(ALF [11]). These tools reduce the artifacts effectivelywith acceptable complexity.In the past decades, the learning-based methods make greatprogress in both low-level and high-level computer vision tasks[12]–[17], such as object detection [12], [13], semantic imagesegmentation [14], [15], and super resolution [16], [17]. Byvirtue of the powerful non-linear capability of learning-basedtools, they also have been utilized to replace the existingmodules in video coding and show great potential, for instance,intra prediction [18]–[20], inter prediction [21], [22], andentropy coding [23], [24]. Learning-based models, especiallyCNN, have achieved excellent performances for the in-loopﬁlter of video coding [25]–[33]. Dai et al . [26], [27] proposedVR-CNN, which adopts a variable ﬁlter size technique to havedifferent receptive ﬁelds in one-layers and achieves excellentperformance with relatively low complexity. Zhang et al. [30]proposed a 13-layer RHCNN for both intra and inter frames.The relatively deep network has a strong mapping capability tolearn the difference between the original and the reconstructedinter frames. To further adapt to the image content, Jia et al. [33] designed a multi-model ﬁltering mechanism and proposeda content-aware CNN with a discriminative network. Thismethod uses the discriminative network to select the mostsuitable deep learning model for each region.Most of the learning-based ﬁlters can achieve considerableBD-rate [34] savings than H.265/HEVC anchor. However,real-world applications often require lightweight models. Highmemory usage and computing resource consumption makeit difﬁcult to apply complex models to various hardwareplatforms. Therefore, designing a light network is essentialto popularize learning-based in-loop ﬁlters. Considering this,some model compression methods that reduce the model com-plexity while maintaining performance are needed. In recentyears, some famous methods have been proposed, includinglightweight layers [35], [36], knowledge transfer [37]–[39],low-bit quantization [40], [41], and network pruning [42], [43].DSC [35], [36] is one of the famous lightweight layers. Itpreserves the essential features of standard convolution whilegreatly reducing the complexity by using grouping convolution[35]. In this paper, we build our learning-based ﬁlter with DSC a r X i v : . [ ee ss . I V ] S e p instead of the standard convolution. And knowledge transferis used to help the initialization of the trainable parameterswithout increasing the complexity.Besides the learning-based ﬁlter itself, we also need alightweight mechanism for the ﬁltering of inter frames. Someinter blocks fully inherit the texture from their reference blocksand have almost no residuals. If the learning-based ﬁlter isused for each frame, those blocks will be repeatedly ﬁlteredand cause over-smoothing in inter blocks [33], [44]. Onesolution to solve this problem is training a speciﬁc ﬁlter forinter frames [30]. However, the coding of intra and interframe share some of the same modules in H.265/HEVC liketransformation, quantization, and block partitions. This meansthe learning-based ﬁlter trained with intra frames can also beused for inter frames to some extent. Considering this, previousworks [27], [33], [44]–[47] designed a syntax element controlﬂag to indicate whether an inter CTU uses the learning-basedﬁlter or not. It chooses a selective ﬁltering strategy for eachCTU. For this strategy, we compare it with frame-level controlin Section IV-A and found the CTU-level control may leadto artiﬁcial boundaries between the neighboring CTUs. So wepropose to use the frame-level based ﬁlter to avoid unnecessaryartiﬁcial boundaries. In order to improve the performance offrame-level based ﬁltering, we propose a novel module calledresidual mapping (RM) in this paper.In summary, we propose a novel light CNN-based in-loop ﬁlter for both intra and inter frames based on [48],[49]. Experimental results show this model achieves excellentperformance in terms of both video quality and complexity.Speciﬁcally, our contributions are as follows. • A CNN-based lightweight in-loop ﬁlter is designed forH.265/HEVC. Low-complexity DSC merged with the BNis used as the backbone of this model. Besides, we useattention transfer to pre-train it to help the initializationof parameters. • For the ﬁltering of inter frames, we analyze and buildour CNN-ﬁlter based on frame-level to avoid the artiﬁ-cial boundaries caused by CTU-level. Besides, a novelpost-processing module RM is proposed to improve thegeneralization ability of the frame-level based model andenhance the subjective and objective quality. • We integrate the proposed method into HEVC and VVCreference software and signiﬁcant performance has beenachieved by our proposed method. Besides, we conductsome extensive experiments like ablation studies to provethe effectiveness of our proposed methods.The following of this paper is organized as follows. InSection II, we present the related works, including the in-loop ﬁlter in video coding and the lightweight network design.Section III elaborates on the proposed network, includingnetwork structure and its loss function. Section IV focuses onthe proposed RM module and provides an analysis of differentcontrol strategies. Experiment results and ablation studies areshown in Section V. In Section VI, we conclude this paperwith future work. II. R

ELATED W ORKS

A. In-loop Filters in Video Coding1) DB, SAO, and ALF:

DB, SAO, and ALF that are adoptedin the latest video coding standard H.266/VVC [3] are aimedat removing the artifacts in video coding. De-blocking [9] hasbeen used to reduce the discontinuity at block boundaries sincethe publication of coding standard H.263+ [50]. Depend on theboundary strength and reconstructed average luminance level,DB chooses different coding parameters to ﬁlter the distortedboundaries. Meanwhile, by classifying the reconstructed sam-ples into various categories, SAO [10] gives each categorya different offset to compensate for the error between thereconstructed and original pixels. Based on the Wiener ﬁlter,ALF [11] tries different ﬁlter coefﬁcients by minimizing thesquare error between the original and reconstructed pixels. Thesignal of the ﬁlter coefﬁcient needs to be sent to the decoderside to ensure the consistency between encoder and decoder.All these aforementioned ﬁlters can effectively alleviate thevarious artifacts in reconstructed images. However, there isstill much room for improvement.

2) Learning-based Filter:

Recently, the learning-based ﬁl-ters have far outperformed the DB, SAO, and ALF in termsof both objective and subjective quality. Different from SAOand ALF, they hardly need extra bits but can compensatefor errors adaptively as well. Most of them are based onCNNs and have achieved great success in this ﬁeld. For theﬁltering of intra frames, Park et al. [25] ﬁrst proposed a CNN-based in-loop ﬁlter IFCNN for video coding. Dai et al. [26]proposed VR-CNN as post-processing to replace DB and SAOin HEVC. Based on inception, Liu et al. [28] proposed a CNN-based ﬁlter with 475,233 trainable parameters. Meanwhile,Kang et al. [29] proposed a multi-modal/multi-scale neuralnetwork with up to 2,298,160 parameters. Considering thecoding unit (CU) size information, He et al. [31] proposeda partition-masked CNN with a dozen residual blocks. Sun et al. [48] proposed a learning-based ﬁlter with ResNet [51]for the VTM. Liu et al. [49] proposed a lightweight learning-based ﬁlter based on DSC. Apart from what was mentionedabove, Zhang et al. [44] proposed a residual convolution neuralnetwork with a recursive mechanism.Different from the training the ﬁlter for intra samples,training the ﬁlter with inter samples need to consider theproblem of repeated ﬁltering [33], [47]. Jia et al. [33] proposeda content-aware CNN based in-loop ﬁltering method thatapplies multiple CNN models and a discriminative networkin the H.265/HEVC. This discriminative network can be usedto judge the degree of distortion of the current block and selectthe most appropriate ﬁlter for it. However, the discriminativenetwork requires additional complexity and memory usage,some researchers [27], [45] proposed to use block-level syntaxelements to replace it. This method requires extra bit con-sumption but gets a more accurate judgment on whether touse the learning-based ﬁlter. Similarly, some researchers [25],[52] proposed to use frame-level syntax elements to controlthe ﬁltering of inter frames. Besides, complicated models[30], [45], [53] like spatial-temporal networks are also usefulfor solving this problem. Jia et al. [45] proposed spatial-

Depthwise Conv. Pointwise Conv.

Fig. 1. The depthwise separable convolution, where ”Conv.” indicatesconvolution. temporal residue network (STResNet) with CTU level controlto suppress visual artifacts. RHCNN that is trained for bothintra and inter frames was proposed by Zhang et al. [30].Filtering in the decoder side [32], [54], [55] can also solvethe problem of repeated enhancement well. For example, DS-CNN was designed by Yao et al. [32] to achieve qualityenhancement as well. Li et al. [54] adopted a 20-layers deepCNN to improve the ﬁltering performance. Zhang et al. [55]proposed a post-processing network for VTM 4.0.1.In summary, ﬁltering in inter frames is more challengingthan that of intra frames. In most cases, the CNN-based in-loopﬁlter with higher complexity can achieve better performanceon intra frames. But for the ﬁltering of inter frames, the ex-isting methods have their own problems. For example, frame-level control may lead to an over-smoothing problem, CTU-level control will cause the additional artiﬁcial boundaries, theout-loop ﬁlters cannot use the ﬁltered image as a reference,adding discriminative network and complex model may leadto over-complexity and impractical. Therefore, we should payattention to a more effective method for this task.

B. Lightweight Network Design1) Depthwise Separable Convolution:

As a novel neuralnetwork layer, DSC achieves great success in practical appli-cations because of its low complexity. It is initially introducedin [35] and subsequently used in MobileNets [36]. As shownin Fig. 1, DSC divides the calculation of standard convo-lution into two parts, depthwise convolution, and pointwiseconvolution. Different from standard convolution, depthwiseconvolution decompose the calculation of standard convolutioninto group convolution to reduce the complexity. Meanwhile,the pointwise convolution is the same as the standard convolu-tion with kernel × . In other words, depthwise convolutionis used to convolute the separate features whereas pointwiseconvolution is utilized to combine them to get the outputfeature maps. These two parts together form a complete DSC.

2) Knowledge Distillation and Transfer:

Previous studies[37]–[39] have shown that the ”knowledge” in pre-trainedmodels can be transferred to another model. Hinton et al. [37] propose a distillation method that uses a teacher modelto get a ”soft target”, which helps a student model that hasa similar structure perform better in the classiﬁcation task.Besides softening the target in classiﬁcation tasks, some othermethods [38], [39] use the intermediate representations of thepre-trained model to transfer the ”knowledge”. For example,Zagoruyko et al. [38] devise a method called attention transfer (AT) to get student model performance improved by letting itmimic the attention maps from a teacher model. Meanwhile,Huang et al. [39] design a loss function by minimizingthe maximum mean discrepancy (MMD) metric between thedistributions of the teacher and the student model, whereMMD is a distance metric for probability distributions [56].III. P

ROPOSED

CNN-

BASED F ILTER

A. Network Structure

As shown in Fig. 2, we design a network structure thatfunctions on both the teacher and the proposed model. Thisstructure is composed of convolution, BN layer, and activationReLU [57]. The backbone of this structure is K layers ofDSC with dozens of feature maps F . The input to thisstructure is the HM reconstruction without ﬁltering and theoutput is the ﬁltered reconstructed samples. The last part isa standard convolution with only 1 feature map. And we addthe reconstruction samples to the output inspired by residuallearning [51]. The depthwise and the standard convolutionkernel are both × . Every convolution is followed by theReLU except for the last one. The reason why choose ReLUinstead of other advanced activation functions is that ReLUhas a lower complexity while a considerable nonlinearity. Inour implementation, the values of K and F are 24 and 64for the teacher model, 9 and 32 for the proposed model. Thedescription of the parameters in the proposed model is shownin Table I.We use the BN layer in the training phase, this layer couldimprove the back-propagation of the gradients. What’s more,both BN and convolution are linear computations for the ten-sors in the proposed model. Therefore, the BN can be mergedinto the convolution to further reduce the computational duringthe inference phase. As shown in (1), depthwise convolutionoutput χ dwConv can be formulated as: χ dwConv = w dwConv ∗ χ (1)where ∗ indicates the convolution operation, w dwConv is thekernel and χ is the depthwise convolution input. Similarly, thepiecewise convolution output χ pwConv can be written as: χ pwConv = w pwConv ∗ χ dwConv + b pwConv (2)where w pwConv and b pwConv denote the kernel and bias.It is noticeable in (1) that the convolution of depthwiseconvolution has no bias, this is because the bias b dwConv canbe merged into b pwConv when there is no activation betweendepthwise and pointwise convolution. After convolution, theoutput of BN can be obtained by (3). (The reason why weuse ∗ operation here is because actually the calculation ofBN is equivalent to that of the depthwise convolution bysimpliﬁcation) χ bn = γ ∗ (cid:18) χ pwConv − mean √ var + (cid:15) (cid:19) + β (3)Substituting (2) into (3), we obtain (4) as follows: χ bn = (cid:98) w pwConv ∗ χ dwConv + (cid:98) b pwConv (4) x10 x3 x8 x3 x6 x3 Guide the initialization of parameter

Teacher

OutputStudent

Output

Depthwise SeparableConvolution +(Batch

Normalization) +

ReluStandard Convolution Element-wise AdditionAttentionLoss

Attention Attention

Input Rec.

Fig. 2. The architecture of teacher model and the proposed model, where ”Rec.” indicates ”reconstructed pixels”. The top-right and bottom-right are theteacher model and the proposed student model, respectively. The rectangle on the right implies the knowledge transfer.TABLE ID

ESCRIPTION OF THE P ARAMETERS IN THE P ROPOSED M ODEL

Index Block1 Block2 Block3 Std Conv. a SumParameters

73 + 2 × ,

344 3 × ,

344 289 a Standard Convolution. where (cid:98) w pwConv and (cid:98) b pwConv in (4) are: (cid:98) w pwConv = γ ∗ w pwConv √ var + (cid:15) (5) (cid:98) b pwConv = γ ∗ ( b pwConv − mean ) √ var + (cid:15) + β (6)In (5) and (6), γ and β are trainable parameters of BN, mean and var are non-trainable parameters of BN. Hyper-parameter (cid:15) represents a positive number that prevents divisionzero errors. In the inference phase, we use the (cid:98) w pwConv and (cid:98) b pwConv to replace the weight w pwConv and bias b pwConv indepthwise convolution, thus merging the BN into the DSC andreducing the model complexity. B. Standard Convolution of the Proposed Structure

In this subsection, the last part of the proposed structureis detailed. Because the standard convolution uses fewercalculations than DSC when the number of convolution outputchannels is only one. It is worth noting that the last convolutionof the proposed model is standard convolution, which isn’tconsistent with the backbone of the proposed model. TheDSC consists of two steps, including depthwise convolutionand pointwise convolution. The depthwise convolution is thesimpliﬁcation of the standard convolution to reduce the amountof computation while preserving the ability to convolve theinput feature maps. Meanwhile, the pointwise convolution isequivalent to the standard convolution with × kernel, itis utilized to fuse the different depthwise convolution output.According to their computing methods, the ratio r of thecalculation of the DSC to that of the standard convolutionis calculated as: r = K W K H C I W H + C I C O W HK W K H C I C O W H = 1 C O + 1 K W K H (7) where W , H is the width and height of the input frame,respectively. K W , K H is the width and height of the convo-lution kernel, respectively. C I , C O are the number of featuremaps for the convolution input and output, respectively. Inour proposed model, C O = 1 and K W = 3 , K H = 3 .So r = C O + K W K H = , which is bigger than .This represents DSC consumes more computing sources thanstandard convolution. The extra calculation is caused by point-wise convolution, which is utilized to combine feature maps.However, the standard convolution also can combine features,which indicates the extra calculation of pointwise convolutionis meaningless. Therefore, we choose the standard convolutionat the end of the model to avoid meaningless calculations. C. Proposed Initialization and Training Scheme

In this subsection, we will introduce the training processand loss functions of the proposed network. In most cases, asuitable initialization of parameters can help the model betterconverge to the minimum. Inspired by transfer learning, a pre-trained teacher model is used to guide the initialization of theparameters in the proposed model. By using such initialization,we hope the proposed model can obtain the output similarto that of the teacher model before the real training begins.The pre-trained model uses the mean square errors (MSE) lossbetween the output Y T of teacher model and the original pixels Y O . L T = 1 N N (cid:88) i =1 (cid:107) Y iT − Y iO (cid:107) (8)After the training of the teacher model, we use the intermediateoutputs of it to guide the proposed model on parameterinitialization. This process is denoted by the bold lines in Fig.2. Because the vanishing of gradients may lead to insufﬁcienttraining of shallow layers, the teacher model is divided intodifferently-sized blocks to produce the intermediate hints. Themetric of the distance between teacher and the proposedstudent model tries two forms, including MMD [39] andattention loss [38]. The loss function L MMD ( F T , F S ) with linear kernel function ( k ( x, y ) = x T y ) could be written asfollows: L MMD ( F T , F S ) = (cid:107) C T C T (cid:88) i =1 f iT (cid:107) f iT (cid:107) − C S C S (cid:88) j =1 f jS (cid:107) f jS (cid:107) (cid:107) (9)where F represents the attention map, f indicates a single fea-ture map, C is the number of feature maps, and the subscript T and S identify the teacher and student model. Meanwhile,the loss function L AT ( F T , F S ) of attention transfer (AT) [38]could be written as follows: L AT ( F T , F S ) = (cid:107) (cid:80) C T i =1 | f iT | p (cid:107) (cid:80) C T i =1 | f iT | p (cid:107) − (cid:80) C S j =1 | f jS | p (cid:107) (cid:80) C S j =1 | f jS | p (cid:107) (cid:107) (10)We set p to 2 in our implementation, because these twomethods are similar except for their normalization methodswhen p = 1 [39]. After the initialization, we start the realtraining process of using MSE L S in (11) to train the proposedmodel, where Y S indicates the output of the proposed model. L S = 1 N N (cid:88) i =1 (cid:107) Y iS − Y iO (cid:107) (11)In summary, the whole process can be divided into thefollowing steps. Algorithm 1

The process of building the trained proposedmodel.

Input:

The dataset pair of HM reconstruction samples X andoriginal samples Y O ; Output:

The trained proposed model; Constructing the teacher model T and training it for n epochs with MSE L T ; Extracting the attention maps F T from the trained T ; Constructing the student model S with BN and training itfor n epochs with L AT ( F T , F S ) or L MMD ( F T , F S ) ; Training S with MSE L S for n epochs; Calculating the (cid:98) w pwConv and (cid:98) b pwConv for S ; Removing the BN from S ; Using the (cid:98) w pwConv and (cid:98) b pwConv to replace the weight w pwConv and bias b pwConv in depthwise convolution of S ; return S ;IV. P ROPOSED R ESIDUAL M APPING FOR THE

CNN-

BASED F ILTERING

A. Analysis of CTU-level and Frame-level Control

From the size of ﬁltered samples, ﬁltering methods can bedivided into CTU-level (block-level) and frame-level. Com-pared with CTU-level control, there are two main advantagesof frame-level control in CNN-based ﬁlter design, includingthe required computational resource and the video quality. Inthis subsection, the difference is analyzed from the perspec-tives of the padding methods and the ﬁlter kernels.Firstly, to keep the input frames size unchanged, the CNN-based ﬁlter needs to pad the boundaries of input with somesamples. There are usually two padding ways, including

Input Sample Filtered Sample

Con- volution

Extra Calculation (a) Convolution with valid padding

Con-volutionInput Sample Filtered Sample (b) Convolution with same padding

Zero Sample Affected Sample Unaffected Sample

Fig. 3. The diagrams of convolution with different pad methods.TABLE IIC

OMPLEXITY C OMPARISON OF

CTU-

LEVEL C ONTROL B ETWEEN V ALID P ADDING AND S AME P ADDING

Items RHCNN [30] Jia et al. [33] VR-CNN [26]Padding type Valid Same Valid Same Valid SameFlops a (G) 16.21 10.89 2.02 1.49 0.25 0.22Madd b (G) 32.38 21.76 4.04 2.97 0.49 0.44Memory c (MB) 91.06 60.11 25.43 18.02 5.84 5.02MemR+W d (MB) 193.05 130.36 55.09 39.42 13.88 11.99 a Theoretical amount of ﬂoating point arithmetics. b Theoretical amount of multiply-adds. c Memory useage. d Memory read /write. valid padding (padded with reconstructed samples) and samepadding (padded with zero samples). In one case, if the CTUsare padded with reconstructed pixels to maintain the sameaccuracy as frame-level ﬁltering, most of the networks needto pad the input block with plenty of pixels and requireconsiderable calculation. Fig. 3 intuitively shows the differencein the amount of calculation between valid and same padding.The quantitative calculations [58] are illustrated in Table II(we assume that both of their output sizes of ﬁltered samplesare × ), it can be found that the valid padding (see”Valid” columns) of works [26], [30], [33] all have consid-erable complexity increasing than same padding (see ”Same”columns). In the other case, if the same padding is selected, itwill cause calculation errors around the boundaries as shownin Fig. 4. We assume that the size of the block control is h × h ,and the width of the boundary area affected by the pad is a .The proportion p fc of affected pixels under frame-control is h a Filtered

CTU Unfiltered

CTUArtificial

Boundary (a) CTU-level control

W H

Filtered

Frame (b) Frame-level controlFig. 4. The diagrams of convolution with different control methods. calculated as follows: p fc = 1 − ( W − a )( H − a ) W H = 2 a ( W + H − a ) W H (12)Similarly, the proportion p bc of affected pixels under blockcontrol can be approximated as follows. (No incomplete CTUare considered) p bc = 4 a ( h − a ) h ≈ ah (13)It can be found from (13) that the area affected by samepadding is approximately proportional to the perimeter ofthe ﬁltered samples. Therefore, the frame-level control with ahigher area-to-perimeter ratio is less affected than block-levelcontrol. According to (12) and (13), it can be obtained that forthe HEVC test sequence, the same-padding of our networkwill affect an average of 45% of the pixels under CTU-level control, whereas that of frame-level control is only 3%.Therefore, choosing frame-level control lays a solid foundationfor the application of the CNN-based ﬁlter.Secondly, the frames ﬁltered by frame-level control has theproperty of integrity. Frame-level control uses the same kernelfor ﬁltering of the entire frame whereas CTU-level control mayuse the different kernels for two consecutive CTUs, which maylead to some artiﬁcial errors in the boundaries. As shown inFig. 4, two consecutive CTUs with different ﬁltering strategieshave some errors along the boundaries because of the differentkernels used in the ﬁltering. Especially for the condition thatone of the CTUs uses the learning-based ﬁlter while theother one doesn’t. This further demonstrates the advantagesof frame-level control.In summary, for the design of lightweight CNN-basedﬁlters, the frame-level control has some advantages over block-level control. On the one hand, compared with frame-levelcontrol, CTU-level control leads to calculation cost with thesame padding or calculation error with the valid padding.On the other hand, frame-level control has the property ofintegrity and it brings better subjective quality. To reduce thepadding error brought by the multi-layer neural network andcomplexity, we built our CNN-based in-loop ﬁlters on a frame-level control. However, the ability to directly use frame-basedcontrol is weak because it only has two states of using or notusing the ﬁlter, we need some added methods to improve itsperformance. B. Residual Mapping

In this subsection, a novel post-processing module RMis proposed to improve the performance of the frame-level (a) Org. frame (b) Distortion (c) Learned residualFig. 5. A frame from CLIC dataset [59] is coded with HM-16.16 and QP37. The original frame, the distortion and the learned residual of this frameare shown in (a), (b) and (c).Fig. 6. The comparison of different ﬁltering mechanisms (“Race-Horses 416x240”, qp22, LDP conﬁguration). Linear, quadratic, and cubicrepresent the mapping function of linear, quadratic, and cubic functions,respectively. We can ﬁnd in the red box that the performance of using CNN-ﬁlter directly is not satisfactory, and even leads to a decrease in PSNR. control based CNN ﬁlter. It can effectively improve the over-smoothing problem [33], [47] of inter frame. Besides, wefound that it also has a considerable improvement to intraframes in Section V-D. Most of the trained neural networks areﬁtting to a certain training set. Since the distribution of trainingdata is often very complicated, the training is actually a trade-off of the data set. For a speciﬁc image, the trained ﬁlter maybe under-ﬁtted or over-ﬁtted. This may cause distortion or blurfor a learning-based ﬁlter. What’s more, if we want to use theneural network trained with intra samples for the ﬁltering ofinter samples, this phenomenon will be more serious becauseof the difference in the distribution of the intra and interdatasets. With this in mind, we proposed to use a parametricRM after the learning-based ﬁlter, which is some sort of non-parametric ﬁlter, to improve its generalization ability. Inspiredby the potential correlation of distortion and learned residualshown in Fig. 5, we handle this ﬁlter from the perspective thatof restoring distortion from the learning-based ﬁltered residual,which is equivalent to improving the quality of the distortedframes. The distortion R O is deﬁned as the difference betweenthe original samples Y O and reconstruction of de-blocking X : R O = Y O − X (14) DB CNN-filter SAO (a) Serial structure

DB SAO

CNN-filter (b) Parallel structure

DB CNN-filter RM SAO (c) Proposed structureFig. 7. The schemes of the different frameworks with CNN-based ﬁlter.

Similarly, the learned residual R S is deﬁnes as the differencebetween the output of learning-based ﬁlter and X . R S = Y S − X (15)A function f λ ( · ) with parameters λ is designed as the para-metric ﬁlter to map R S to R O . We choose MSE as the metric: λ = arg min λ ( f λ ( R S ) − R O ) (16)We should use a model with a small amount of parametersto construct f λ ( · ) , so that it is convenient to encode theparameters λ into the bitstream to ensure the consistency ofencoding and decoding. For the expression form of f λ ( · ) , wehave tried linear functions and polynomial functions as shownin Fig. 6. From the red box on the left, it can be found thatonly using the CNN ﬁlter (see red dotted line) may lead toa decrease in coding performance, this proves that directlyusing CNN ﬁlters for inter frames may degrade video quality.And the performance is improved after adopting RM. It isnoticeable that there is little difference in performance betweendifferent polynomial functions. So we choose simple linearfunctions to build RM. λ = arg min λ ( λR S − R O ) (17)So we add X and the output of RM ˆ R S to get the ﬁlteredframe ˆ Y S . After sending it to SAO, the entire ﬁltering processis completed. ˆ Y S = X + ˆ R S = X + λR S (18)We quantify the candidate λ with n bits for each component,where λ = i/ (2 n − , i ∈ , , ..., n − . In the imple-mentation, the number of required bits n is set to , so eachframe needs 15 bits for the RM module. And a rate-distortionoptimization (RDO) process is designed to ﬁnd the best λ .The regular mode of CABAC is used to code λ . RM does notneed speciﬁc models for inter frames or additional classiﬁersfor each CTU. What’s more, it is independent of the proposednetwork and can be combined with other learning-based ﬁltersto alleviate the over-smoothing problem as well.Different from previous strategy [47] of choosing onebetween traditional ﬁltering and learning-based ﬁltering, RMuses a serial structure and makes full use of both these two TABLE IIIE

XPERIMENTAL E NVIRONMENT

Items SpeciﬁcationOptimizer Adam [60]Processor Intel Xeon Gold 6134 at 3.20 GHzGPU NVIDIA GeForce RTX 2080Operating system CentOS Linux release 7.6.1810HM version 16.16DNN framework Keras 2.2.4 [61] and TensorFlow 1.12.0 [62] kinds of ﬁltering as shown in Fig. 7. From the perspectiveof reconstructed frames, the proposed RM can be interpretedas a post-processing module that fully utilizes the advantagesof both distorted reconstruction and learned ﬁltered output.The full use of these two aspects makes RM have excellentperformance. For example, we assume that the reference frameis a frame ﬁltered by a learning-based ﬁlter, so if the currentframe and the reference frame are almost identical, the currentframe does not need to use all the ﬁlters. Conversely, ifthe current frame and the reference frame are completelydifferent, it is easy to produce artiﬁcial imprints becauseof the distorted residue, so the ﬁlters should be used inthis case. For a speciﬁc frame, however, it is often difﬁcultto obtain an accurate judgment about whether to use theﬁlters by using its encoded information, such as residuals ormotion vectors. Considering the good generalization ability oftraditional ﬁlters, we keep them working and focused on theCNN ﬁlter. So we introduce a parametric module RM, whichuses an RDO process to give an appropriate ﬁltering effect ofthe CNN ﬁlter. From (16), it can be observed that the ﬁlteringstrength varies with the change of the λ . So we can traverseall of the candidate λ and code the one with the smallestreconstruction error into bitstreams. We can also derivativethe objective function to obtain the optimal parameters, andcode the quantized parameters in the bitstream. In this case, weneed to consider the inﬂuence of parameter quantization, thosemapping functions that are sensitive to quantization noise, suchas high-order polynomials, should be abandoned. Otherwise,this may result in larger quantization errors in the decodedframes. V. E XPERIMENTAL R ESULTS

A. Experimental Setting

For the experiment, we mainly focus on objective quality,subjective quality, complexity, and ablation studies to illustratethe performance of our model. Nine hundred pictures fromDIV2K [63] are cropped into the resolution of × ,and then down-sampled to × . These two sets ofpictures are spliced into two videos as our training sets.Only the luminance component is used for training, and thechrominance components are also tested by using the proposedmodel. The patch size in training is × of H.265/HEVCand × of H.266/VVC, which is consistent with thelargest size of the TU. Considering that the reconstructedimages with different QPs often have different degrees of distortion and artifacts, the whole QP band is divided intofour parts, below 24, 25 to 29, 30 to 34, and above 35. Sofour proprietary models are trained for each QP band. Theparameter initialization method is normal distribution [64]for both the teacher model and the proposed model. Thetraining epochs n and n are both set to 50. We use moretraining epochs for the model with higher QP in the specialinitialization phase because there are often more artifacts in thereconstructed images with higher QP. Speciﬁcally, parameters n is set as for the lower QPs and for the higherQPs. After the training phase, we save the trained modeland call it to infer in HEVC reference software (HM) andvideo coding test model (VTM). In the test phase, the ﬁrst64 frames from HEVC test sequences are used to evaluatethe generalization ability of our model. We test four differentconﬁgurations with default settings, including all-intra (AI),low-delay-B (LDB), low-delay-P (LDP), and random-access(RA) for H.265/HEVC anchor. For the H.266/VVC anchor,we test it with default AI and RA conﬁgurations. Four typicalQPs in common test conditions are tested, including 22, 27,32, 37. The other important test conditions are shown in TableIII. For a fair comparison with previous works, we use thecoding experimental results from their original papers. Thecomplexities of the reference papers are tested on our localserver to avoid the inﬂuence of the hardware platforms. B. Experiment on H.265/HEVC1) Objective Evaluation:

In this subsection, the objectiveevaluation is conducted to evaluate the performance of ourproposed model. The experimental results compared with theHM-16.16 anchor are shown in Table IV. For the luminancecomponent, the proposed model achieves 6.3%, 4.5%, 5.4%,and 5.7% BD-rate reduction compared with HEVC baselineunder AI, LDB, LDP, and RA conﬁguration, respectively.For chrominance components, the proposed model achievesmore BD-rate reduction than the luminance component. Itdemonstrates the generalization ability for the proposed modelbecause we only use the luminance components of intrasamples for training. Furthermore, the comparisons with theprevious works [26], [33] are conducted and the BD-ratereduction is shown in Table V. It can be seen that our modelachieves more BD-rate reduction for AI conﬁguration.For the performance evaluation of inter conﬁgurations, weintroduce the comparison of our proposed model with frame-level control [25], CTU-level control [45], and Jia et al. [33] as shown in Table VI. To compare fairly, we selectthe same padding and use the proposed model to test thedifferent control methods. From the experiment results, itcan be seen that our proposed model achieves about 1%extra BD-rate reduction than both CTU-level and frame-levelcontrol for all inter conﬁgurations. Compared with Jia et al. [33], our model achieves comparative BD-rate reduction ininter conﬁgurations. For the chrominance components, ourmodel achieves about 3% extra BD-rate reduction, it furtherdemonstrates the generalization ability of our model.

2) Subjective Evaluation:

We also conduct the subjectiveevaluation as shown in Fig. 8 and Fig. 9. It can be seen from the experimental results that our model has a great de-artifacts capability. First, we re-deploy the proposed modelin HM-16.9 for a fair subjective evaluation with Jia et al. [33]. From Fig. 8, it can be found that the various kinds ofartifacts in (a) are eliminated by the proposed model and theman’s face looks smoother and plump. At the same time, somevertical blocky effects are produced by Jia et al. [33], probablybecause it uses different ﬁlters for consecutive CTUs while ourproposed model uses the same ﬁlter for the whole images andhave no additional boundaries. Besides, the man’s eyes seemto be blurred by [33] and lead to the degradation of visualquality. Second, the subjective evaluation for the inter frames isconducted in Fig. 9. The default HM and HM with CTU-levelcontrol [27] are used as the anchors. As shown in Fig. 9, thecontouring and blocky artifacts on the number are eliminatedby the proposed model. For CTU-level control [27] basedﬁltering, the subjective quality of this frame is reduced dueto the artiﬁcial boundaries on the knee, whereas our proposedmodel has no boundaries on it and achieves a better visualquality. To sum up, because our proposed method makes fulluse of the frame-level ﬁltering strategy, the proposed methodhas signiﬁcantly better visual effects than previous CTU-basedmethods.

3) Complexity Analysis:

As shown in Table V, we comparethe complexity of Jia et al. [33], VR-CNN [26], and ourproposed model from two aspects, including computationalcomplexity and storage consumption. Firstly, for the codingcomplexity evaluation, we use the following equation to cal-culate the ∆ T : ∆ T = T (cid:48) T (19)where T (cid:48) and T denote the HM coding time with and withoutthe learning-based ﬁlter, respectively. FLOPs in Table V arealso tested for the frame with a resolution of 720p. Comparedwith VR-CNN [26], the FLOPs of our model is reduced by79.1%. The decoding complexity is reduced by approximately50% and the encoding complexity is reduced by 4%. Theprocessing time of the proposed model is almost the samefor both encoder and decoder. The difference in relative timeis caused by that the network inference time accounts for asmall proportion of the encoding complexity but comparativefor the decoding.In terms of storage consumption, compared with [26], thenumber of trainable parameters in the proposed model isreduced by 79.6%. It is almost the same with the reductionof model size because we use the same precision (ﬂoat32)to save the models. The main reason why our model hasrelatively fewer parameters is that the design of the proposedmodel focuses more on complexity instead of performance.For example, we use the DSC as the backbone of the proposedmodel, whereas previous works [26], [33] utilize the standardconvolution. Meanwhile, we also use many useful methodsto limit the model size while maintaining the performance,including BN merge and special initialization of parameters.What’s more, our proposed model only needs one learning-based network for both intra and inter frames. So thereis no need for additional models in practical applications.Compared with previous works that need multiple models or TABLE IVBD-

RATE R EDUCTION OF THE P ROPOSED M ETHOD THAN

HM-16.16 A

NCHOR

Sequences AI LDB LDP RAY U V Y U V Y U V Y U VClassA Trafﬁc -7.3% -3.4% -4.7% -4.6% -2.4% -0.9% -4.3% -3.4% -1.5% -6.4% -3.6% -2.7%PeopleOnStreet -6.8% -7.1% -6.9% -4.5% -0.6% -0.9% -3.1% -4.4% -2.4% -6.1% -4.6% -5.0%ClassB Kimono -4.9% -2.6% -2.5% -4.6% -7.5% -4.7% -7.3% -11.5% -6.7% -4.2% -5.7% -3.4%ParkScene -5.5% -3.2% -2.3% -1.9% -0.3% -0.7% -1.5% -0.5% -0.7% -3.8% -0.6% -0.3%Cactus -5.3% -4.1% -10.1% -4.4% -3.7% -4.4% -5.4% -5.6% -5.8% -6.8% -9.5% -7.2%BasketballDrive -4.3% -8.9% -11.7% -3.4% -4.2% -7.8% -6.0% -9.2% -11.7% -4.4% -4.4% -8.9%BQTerrace -3.7% -4.3% -4.8% -6.1% -2.1% -2.4% -10.6% -4.5% -3.9% -8.8% -3.8% -3.3%ClassC BasketballDrill -8.0% -11.7% -14.1% -2.8% -4.9% -4.6% -3.4% -5.6% -6.2% -4.2% -8.0% -9.7%BQMall -6.0% -6.3% -7.2% -3.8% -3.2% -4.7% -4.6% -4.5% -5.6% -5.1% -4.7% -5.1%PartyScene -3.7% -4.8% -5.7% -0.8% -0.1% -0.2% -1.8% -0.4% -0.4% -1.7% -1.4% -2.0%RaceHorses -3.9% -6.9% -12.0% -4.1% -6.6% -11.3% -4.2% -7.9% -12.7% -4.7% -9.7% -14.2%ClassD BasketballPass -6.5% -7.3% -10.3% -4.4% -3.3% -4.6% -4.3% -4.8% -5.8% -3.9% -4.7% -6.1%BQSquare -4.2% -3.0% -6.8% -2.4% -1.6% -2.8% -4.1% -1.8% -2.9% -2.4% -1.0% -2.9%BlowingBubbles -5.3% -9.3% -9.8% -3.6% -5.8% -2.1% -3.9% -5.4% -1.8% -4.0% -6.1% -4.4%RaceHorses -7.5% -10.5% -14.6% -6.3% -5.2% -10.3% -6.7% -7.4% -10.8% -6.8% -9.4% -12.2%ClassE Vidyo1 -8.9% -8.7% -10.5% -6.7% -9.0% -9.6% -7.4% -9.4% -8.9% -8.1% -8.4% -9.7%Vidyo3 -7.0% -5.2% -5.3% -4.0% -5.9% -3.1% -4.6% -6.3% -2.5% -6.5% -4.1% -5.1%Vidyo4 -6.3% -10.1% -10.8% -3.8% -11.5% -10.9% -3.9% -12.1% -11.2% -5.6% -9.8% -10.1%FourPeople -9.4% -8.1% -9.0% -8.6% -9.2% -9.4% -9.0% -9.7% -10.8% -9.4% -7.7% -8.1%Johnny -8.3% -12.3% -11.0% -7.0% -11.4% -9.1% -9.6% -13.1% -10.7% -8.3% -10.9% -9.7%KristenAndSara -8.6% -10.2% -11.1% -7.7% -8.3% -8.6% -8.3% -10.1% -11.2% -8.2% -8.9% -9.6%Average -6.3% -7.0% -8.6% -4.5% -5.1% -5.4% -5.4% -6.6% -6.4% -5.7% -6.1% -6.6%

TABLE VBD-

RATE R EDUCTION AND C OMPLEXITY (GPU)

OF THE P ROPOSED M ETHOD COMPARED WITH P REVIOUS W ORKS [26], [33] IN AI C

ONFIGURATION

Sequences Jia et al. [33] VR-CNN [26] Proposed modelY U V ∆ T enc ∆ T dec Y U V ∆ T enc ∆ T dec Y U V ∆ T enc ∆ T dec ClassA -4.7% -3.3% -2.6% 108.1% 734.9% -5.5% -4.7% -4.9% 108.3% 561.1% -7.1% -5.4% -5.9% 105.8% 281.0%ClassB -3.5% -2.8% -3.0% 109.0% 659.8% -3.3% -3.2% -3.7% 110.3% 505.3% -4.8% -4.8% -6.4% 106.2% 265.2%ClassC -3.4% -3.5% -5.0% 113.1% 894.9% -5.0% -5.5% -6.9% 113.0% 685.1% -5.4% -7.5% -9.9% 106.5% 326.3%ClassD -3.2% -4.7% -6.0% 128.9% 1406.0% -5.4% -6.4% -8.1% 121.6% 1047.1% -5.9% -7.8% -10.5% 114.4% 548.0%ClassE -5.8% -4.1% -5.2% 112.3% 1110.2% -6.5% -5.5% -5.6% 111.1% 836.7% -8.1% -9.2% -9.7% 107.2% 401.1%Average -4.1% -3.7% -4.4% 114.3% 961.2% -5.1% -5.1% -5.8% 112.9% 727.0% -6.3% -7.0% -8.6% 108.0% 364.3%

FLOPs 334.84G 50.39G

Parameters 362,753 54,512

Model size 1.38MB 220KB

TABLE VIO

VERALL

BD-

RATE C OMPARISON OF P REVIOUS M ETHODS [25], [33], [45] IN LDB, LDP,

AND

RA C

ONFIGURATION

Methods LDB LDP RAY U V Y U V Y U VJia et al. [33] -6.0% -2.9% -3.5% -4.7% -1.0% -1.2% -6.0% -3.2% -3.8%Our network + RM -4.5% -5.1% -5.4% -5.4% -6.6% -6.4% -5.7% -6.1% -6.6%

Our network + Frame control [25] -3.7% -3.3% -3.2% -4.4% -4.6% -3.9% -4.6% -4.8% -5.1%Our network + CTU control [45] -4.1% -4.4% -4.9% -4.6% -5.8% -5.9% -4.5% -5.1% -5.8% (a) HM Rec. (b) Jia et al. [33] (c) Proposed modelFig. 8. Visual quality comparison of Jia et al. [33] and the proposed model for AI conﬁguration. The test qp is 37 and this is the 1st frame for FourPeople(AnchorHM-16.9). (a) HM Rec. (b) CTU-level control [27] (c) Proposed modelFig. 9. Visual quality comparison of CTU-level control [27] and the proposed model for RA conﬁguration. The test QP is 37 and this is the 16th frame forRaceHorse(Anchor HM-16.16). TABLE VIIBD- RATE R EDUCTION AND C OMPUTATIONAL C OMPLEXITY (GPU)

OF THE P ROPOSED M ETHOD THAN

VTM-6.3 A

NCHOR

Sequences AI RAY U V ∆ T enc ∆ T dec Y U V ∆ T enc ∆ T dec ClassA Trafﬁc -1.6% -0.2% -0.4% 100.7% 234.1% -1.1% -0.7% -0.5% 99.6% 357.0%PeopleOnStreet -1.3% -0.4% -0.3% 98.0% 225.4% -0.9% -0.1% -0.2% 99.7% 266.3%ClassB Kimono -0.3% 0.1% -0.3% 104.9% 317.5% -0.2% 0.0% -0.4% 99.8% 319.3%ParkScene -1.9% 0.1% -0.1% 108.3% 232.4% -1.4% 0.3% -0.2% 101.2% 302.1%Cactus -1.3% -0.5% -0.8% 100.4% 244.5% -1.4% -1.8% -1.7% 103.1% 343.9%BasketballDrive -0.3% -0.8% -1.0% 103.7% 282.7% -0.4% -0.8% -0.5% 100.5% 321.0%BQTerrace -1.0% -0.6% -0.6% 101.6% 228.4% -1.9% -1.6% -1.5% 101.6% 313.0%ClassC BasketballDrill -2.7% -3.8% -5.5% 101.6% 219.2% -1.6% -3.4% -2.9% 102.9% 270.8%BQMall -2.2% -0.8% -0.7% 100.4% 220.6% -2.0% -1.1% -0.5% 104.6% 285.1%PartyScene -1.8% -1.1% -1.5% 100.5% 198.7% -1.3% -1.6% -1.8% 101.7% 242.7%RaceHorses -0.9% -1.1% -2.3% 101.6% 233.5% -1.1% -1.5% -2.5% 99.5% 243.8%ClassD BasketballPass -2.1% -1.4% -4.7% 99.4% 407.5% -1.2% -2.4% -1.0% 98.0% 406.1%BQSquare -3.0% -0.2% -1.0% 103.0% 319.1% -3.6% -1.0% -1.6% 105.6% 421.0%BlowingBubbles -2.1% -1.4% -1.0% 101.1% 352.0% -1.6% -2.3% -2.4% 101.0% 371.1%RaceHorses -2.8% -2.7% -4.6% 99.6% 366.6% -2.4% -3.1% -6.5% 98.6% 312.1%ClassE Vidyo1 -1.3% -0.1% -0.3% 101.5% 340.3% -1.0% -0.1% 0.4% 102.6% 486.6%Vidyo3 -1.1% 0.2% -0.2% 101.8% 298.8% -1.2% 1.5% 0.9% 102.0% 457.4%Vidyo4 -0.8% -0.3% -0.2% 106.4% 291.0% -1.0% 0.3% -1.4% 101.3% 425.5%FourPeople -2.1% -0.5% -0.5% 99.2% 263.3% -1.8% -0.8% -0.7% 106.2% 449.3%Johnny -1.3% -0.4% -0.6% 99.9% 317.9% -2.6% -1.2% -1.1% 100.1% 452.4%KristenAndSara -1.7% -0.6% -0.7% 101.1% 344.5% -1.6% -0.4% -1.4% 100.2% 428.0%Average -1.6% -0.8% -1.3% 101.6% 282.8% -1.5% -1.0% -1.3% 101.4% 355.9% TABLE VIIIA

BLATION S TUDY OF

RM (AI, VTM-6.3)Sequences Our network Our network+RMY U V Y U VClassA -0.5% -0.1% 0.4% -1.5% -0.3% -0.4%ClassB 0.3% 1.3% 0.1% -1.0% -0.3% -0.5%ClassC -1.6% -1.8% -2.8% -1.9% -1.7% -2.5%ClassD -2.6% -2.2% -3.6% -2.5% -1.4% -2.8%ClassE -0.3% 2.6% 1.5% -1.4% -0.3% -0.4%Average -0.9% 0.3% -0.7% -1.6% -0.8% -1.3%

TABLE IXA

BLATION S TUDY OF P ARAMETER I NITIALIZATION (AI, VTM-6.3)Methods ∆ PSNR(dB)Y U VStudent 0.310 0.231 0.295Student + MMD [39] 0.320 0.245 0.313Student + AT [38] classiﬁers, our proposed method reduces the required storageconsumption effectively beneﬁt from the RM module.

C. Experiment on H.266/VVC

To further evaluate the performance of our proposed model,we use the same test condition to test its performance in VTM-6.3. The only difference is that we use the entire DIV2kinstead of the down-sampled dataset to train the proposedmodel. From the experimental results shown in Table VII,it can be found that our model achieves about 1.6% and1.5% BD-rate reduction on the luminance component for AIand RA conﬁgurations. For chrominance components, it alsoachieved similar performance on BD-rate reduction. In termsof complexity, the proposed method introduces a negligibleincrease on the encoding side and brings about 3 timescomplexity to the decoding side.

D. Ablation Study1) RM for intra frames:

RM can effectively improve thegeneralization ability of learning-based ﬁlters. The experi-ments of RM about the inter frames have been carried out inSection V-B. Based on VTM here, we further conduct ablationexperiments on intra frames to illustrate the performance ofRM. Its test setting is the same as before. From the experi-ment shown in Table VIII, we can ﬁnd about 0.8% BD-ratereduction has been achieved by the RM module. Regarding theperformance of class-B, only using the proposed CNN-ﬁltermay even have a negative effect and leads to 0.3% BD-rateincrement. But its performance has been well improved afterusing RM. For most other classes, the performance has alsobeen improved more or less after using RM.

2) The initialization of parameters:

The 1-st frame ofall HEVC test sequences is tested and the overall PSNRincrements are shown in Table IX, where the student model without transfer learning is indicated as ”Student” row. MMDand AT in Table IX represent different transfer learning waysthat act on the student model. By comparing the ”Student”row with the other rows, we can ﬁnd that the PSNR of thestudent model is improved by both MMD and AT. What’smore, the improvements of the chrominance components aremore obvious than that of the luminance component.VI. C

ONCLUSION

In this paper, a CNN-based low complexity ﬁlter is proposedfor video coding. The lightweight DSC merged with the batchnormalization is used as the backbone. Based on the transferlearning, attention transfer is utilized to initialize the param-eters of the proposed network. By adding a novel parametricmodule RM after the CNN ﬁlter, the generality of the CNNﬁlter is improved and can also handle the ﬁltering problem ofinter frames. What’s more, RM is independent of the proposednetwork and can also combine with other learning-based ﬁltersto alleviate the over-smoothing problem. The experimental re-sults show our proposed model achieves excellent performancein terms of both BD-rate and complexity. For HEVC testsequences, our proposed model achieves about 1.2% BD-ratereduction and 79.1% FLOPs than VR-CNN anchor. Comparedwith Jia et al. [33], our model achieves comparative BD-rate reduction with much lower complexity. Finally, we alsoconduct the experiments on H.266/VVC and ablation studiesto demonstrate the effectiveness of the model. Our future workaims at further performance improvement of the learning-based ﬁlter in video coding.R

EFERENCES[1] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overviewof the h. 264/avc video coding standard,”

IEEE Transactions on circuitsand systems for video technology , vol. 13, no. 7, pp. 560–576, 2003.[2] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview ofthe high efﬁciency video coding (hevc) standard,”

IEEE Transactionson circuits and systems for video technology , vol. 22, no. 12, pp. 1649–1668, 2012.[3] H266. https://de.wikipedia.org/wiki/h.266/, 2018. 4.[4] J. Lainema, F. Bossen, W.-J. Han, J. Min, and K. Ugur, “Intra codingof the hevc standard,”

IEEE Transactions on Circuits and Systems forVideo Technology , vol. 22, no. 12, pp. 1792–1801, 2012.[5] J.-L. Lin, Y.-W. Chen, Y.-W. Huang, and S.-M. Lei, “Motion vectorcoding in the hevc standard,”

IEEE Journal of selected topics in SignalProcessing , vol. 7, no. 6, pp. 957–968, 2013.[6] T. Nguyen, P. Helle, M. Winken, B. Bross, D. Marpe, H. Schwarz, andT. Wiegand, “Transform coding techniques in hevc,”

IEEE Journal ofSelected Topics in Signal Processing , vol. 7, no. 6, pp. 978–989, 2013.[7] O. Crave, B. Pesquet-Popescu, and C. Guillemot, “Robust video codingbased on multiple description scalar quantization with side information,”

IEEE Transactions on Circuits and Systems for Video Technology ,vol. 20, no. 6, pp. 769–779, 2010.[8] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binaryarithmetic coding in the h. 264/avc video compression standard,”

IEEETransactions on circuits and systems for video technology , vol. 13, no. 7,pp. 620–636, 2003.[9] A. Norkin, G. Bjontegaard, A. Fuldseth, M. Narroschke, M. Ikeda,K. Andersson, M. Zhou, and G. Van der Auwera, “Hevc deblockingﬁlter,”

IEEE Transactions on Circuits and Systems for Video Technology ,vol. 22, no. 12, pp. 1746–1754, 2012.[10] C.-M. Fu, E. Alshina, A. Alshin, Y.-W. Huang, C.-Y. Chen, C.-Y. Tsai,C.-W. Hsu, S.-M. Lei, J.-H. Park, and W.-J. Han, “Sample adaptive offsetin the hevc standard,”

IEEE Transactions on Circuits and Systems forVideo technology , vol. 22, no. 12, pp. 1755–1764, 2012. [11] C.-Y. Tsai, C.-Y. Chen, T. Yamakage, I. S. Chong, Y.-W. Huang, C.-M.Fu, T. Itoh, T. Watanabe, T. Chujoh, M. Karczewicz et al. , “Adaptiveloop ﬁltering for video coding,” IEEE Journal of Selected Topics inSignal Processing , vol. 7, no. 6, pp. 934–945, 2013.[12] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet:Keypoint triplets for object detection,” in

Proceedings of the IEEEInternational Conference on Computer Vision , 2019, pp. 6569–6578.[13] Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Object detection with deeplearning: A review,”

IEEE transactions on neural networks and learningsystems , 2019.[14] X. Liu, Z. Deng, and Y. Yang, “Recent progress in semantic imagesegmentation,”

Artiﬁcial Intelligence Review , vol. 52, no. 2, pp. 1089–1106, 2019.[15] C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, andL. Fei-Fei, “Auto-deeplab: Hierarchical neural architecture search forsemantic image segmentation,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2019, pp. 82–92.[16] X. Hu, H. Mu, X. Zhang, Z. Wang, T. Tan, and J. Sun, “Meta-sr:A magniﬁcation-arbitrary network for super-resolution,” in

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2019, pp. 1575–1584.[17] J. W. Soh, G. Y. Park, J. Jo, and N. I. Cho, “Natural and realistic singleimage super-resolution with explicit natural manifold discrimination,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2019, pp. 8122–8131.[18] J. Li, B. Li, J. Xu, R. Xiong, and W. Gao, “Fully connected network-based intra prediction for image coding,”

IEEE Transactions on ImageProcessing , vol. 27, no. 7, pp. 3236–3247, 2018.[19] Y. Hu, W. Yang, M. Li, and J. Liu, “Progressive spatial recurrent neuralnetwork for intra prediction,”

IEEE Transactions on Multimedia , vol. 21,no. 12, pp. 3024–3037, 2019.[20] H. Sun, Z. Cheng, M. Takeuchi, and J. Katto, “Enhanced intra predictionfor video coding by using multiple neural networks,”

IEEE Transactionson Multimedia , 2020.[21] J. Liu, S. Xia, W. Yang, M. Li, and D. Liu, “One-for-all: Groupedvariation network-based fractional interpolation in video coding,”

IEEETransactions on Image Processing , vol. 28, no. 5, pp. 2140–2151, 2018.[22] L. Zhao, S. Wang, X. Zhang, S. Wang, S. Ma, and W. Gao, “Enhancedmotion-compensated video coding with deep virtual reference framegeneration,”

IEEE Transactions on Image Processing , 2019.[23] R. Song, D. Liu, H. Li, and F. Wu, “Neural network-based arithmeticcoding of intra prediction modes in hevc,” in . IEEE, 2017, pp. 1–4.[24] C. Ma, D. Liu, X. Peng, Z.-J. Zha, and F. Wu, “Neural network-basedarithmetic coding for inter prediction information in hevc,” in . IEEE, 2019,pp. 1–5.[25] W. Park and M. Kim, “Cnn-based in-loop ﬁltering for coding efﬁciencyimprovement,” in , July 2016, pp. 1–5.[26] Y. Dai, D. Liu, and F. Wu, “A convolutional neural network approachfor post-processing in hevc intra coding,” in

International Conferenceon Multimedia Modeling . Springer, 2017, pp. 28–39.[27] Y. Dai, D. Liu, Z.-J. Zha, and F. Wu, “A cnn-based in-loop ﬁlter withcu classiﬁcation for hevc,” in . IEEE, 2018, pp. 1–4.[28] C. Liu, H. Sun, J. Chen, Z. Cheng, M. Takeuchi, J. Katto, X. Zeng, andY. Fan, “Dual learning-based video coding with inception dense blocks,”in , Nov 2019, pp. 1–5.[29] J. Kang, S. Kim, and K. M. Lee, “Multi-modal/multi-scale convolutionalneural network based in-loop ﬁlter design for next generation videocodec,” in , Sep. 2017, pp. 26–30.[30] Y. Zhang, T. Shen, X. Ji, Y. Zhang, R. Xiong, and Q. Dai, “Residualhighway convolutional neural networks for in-loop ﬁltering in hevc,”

IEEE Transactions on Image Processing , vol. 27, no. 8, pp. 3827–3841,Aug 2018.[31] X. He, Q. Hu, X. Zhang, C. Zhang, W. Lin, and X. Han, “Enhancinghevc compressed videos with a partition-masked convolutional neuralnetwork,” in , Oct 2018, pp. 216–220.[32] R. Yang, M. Xu, and Z. Wang, “Decoder-side hevc quality enhancementwith scalable convolutional neural network,” in . IEEE, 2017, pp. 817–822.[33] C. Jia, S. Wang, X. Zhang, S. Wang, J. Liu, S. Pu, and S. Ma,“Content-aware convolutional neural network for in-loop ﬁltering in high efﬁciency video coding,”

IEEE Transactions on Image Processing ,vol. 28, no. 7, pp. 3343–3356, 2019.[34] G. Bjontegarrd, “Calculation of average psnr differences between rd-curves,”

VCEG-M33 , 2001.[35] L. Sifre and S. Mallat, “Rigid-motion scattering for image classiﬁcation,2014,” Ph.D. dissertation, Ph. D. thesis, 2014.[36] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efﬁcient convo-lutional neural networks for mobile vision applications,” arXiv preprintarXiv:1704.04861 , 2017.[37] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neuralnetwork,”

Computer Science , vol. 14, no. 7, pp. 38–39, 2015.[38] S. Zagoruyko and N. Komodakis, “Paying more attention to attention:Improving the performance of convolutional neural networks via atten-tion transfer,” arXiv preprint arXiv:1612.03928 , 2016.[39] Z. Huang and N. Wang, “Like what you like: Knowledge distill vianeuron selectivity transfer,” arXiv preprint arXiv:1707.01219 , 2017.[40] J. Yang, X. Shen, J. Xing, X. Tian, H. Li, B. Deng, J. Huang, and X.-s.Hua, “Quantization networks,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2019, pp. 7308–7316.[41] M. Nagel, M. van Baalen, T. Blankevoort, and M. Welling, “Data-freequantization through weight equalization and bias correction,” arXivpreprint arXiv:1906.04721 , 2019.[42] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz, “Importanceestimation for neural network pruning,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2019, pp.11 264–11 272.[43] C. Zhao, B. Ni, J. Zhang, Q. Zhao, W. Zhang, and Q. Tian, “Variationalconvolutional neural network pruning,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2019, pp.2780–2789.[44] S. Zhang, Z. Fan, N. Ling, and M. Jiang, “Recursive residual convo-lutional neural network- based in-loop ﬁltering for intra frames,”

IEEETransactions on Circuits and Systems for Video Technology , vol. 30,no. 7, pp. 1888–1900, 2020.[45] C. Jia, S. Wang, X. Zhang, S. Wang, and S. Ma, “Spatial-temporalresidue network based in-loop ﬁlter for video coding,” in , Dec 2017, pp.1–4.[46] J. Yao, X. Song, S. Fang, and L. Wang, “Ahg9: Convolutional neuralnetwork ﬁlter for inter frame,” Apr 2018, jVET-J0043.[47] D. Ding, L. Kong, G. Chen, Z. Liu, and Y. Fang, “A switchabledeep learning approach for in-loop ﬁltering in video coding,”

IEEETransactions on Circuits and Systems for Video Technology , vol. 30,no. 7, pp. 1871–1887, 2020.[48] H. Sun, C. Liu, J. Katto, and Y. Fan, “An image compression frameworkwith learning-based ﬁlter,” in

Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition Workshops , 2020, pp. 152–153.[49] C. Liu, H. Sun, J. Katto, X. Zeng, and Y. Fan, “A learning-based lowcomplexity in-loop ﬁlter for video coding,” in . IEEE, 2020,pp. 1–6.[50] G. Cote, B. Erol, M. Gallant, and F. Kossentini, “H. 263+: Video codingat low bit rates,”

IEEE Transactions on circuits and systems for videotechnology , vol. 8, no. 7, pp. 849–866, 1998.[51] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[52] Y. Wang, H. Zhu, Y. Li, Z. Chen, and S. Liu, “Dense residual convolu-tional neural network based in-loop ﬁlter for hevc,” in . IEEE, 2018, pp. 1–4.[53] J. W. Soh, J. Park, Y. Kim, B. Ahn, H.-S. Lee, Y.-S. Moon, and N. I.Cho, “Reduction of video compression artifacts based on deep temporalnetworks,”

IEEE Access , vol. 6, pp. 63 094–63 106, 2018.[54] C. Li, L. Song, R. Xie, and W. Zhang, “Cnn based post-processingto improve hevc,” in . IEEE, 2017, pp. 4577–4580.[55] F. Zhang, C. Feng, and D. R. Bull, “Enhancing vvc through cnn-basedpost-processing,” in . IEEE, 2020, pp. 1–6.[56] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola,“A kernel two-sample test,”

Journal of Machine Learning Research ,vol. 13, no. Mar, pp. 723–773, 2012.[57] V. Nair and G. E. Hinton, “Rectiﬁed linear units improve restricted boltz-mann machines,” in

Proceedings of the 27th international conference onmachine learning (ICML-10) , 2010, pp. 807–814. arXiv preprint arXiv:1412.6980 , 2014.[61] F. Chollet et al. , “Keras,” https://github.com/fchollet/keras, 2015.[62] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray,C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals,P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,“TensorFlow: Large-scale machine learning on heterogeneous systems,”2015, software available from tensorﬂow.org. [Online]. Available:http://tensorﬂow.org/[63] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single imagesuper-resolution: Dataset and study,” in The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) Workshops , July 2017.[64] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:Surpassing human-level performance on imagenet classiﬁcation,” in