A Convolutional Neural Network-Based Low Complexity Filter
11 A Convolutional Neural Network-Based LowComplexity Filter
Chao Liu,
Student Member, IEEE
Heming Sun,
Member, IEEE
Jiro Katto,
Member, IEEE
Xiaoyang Zeng,
Member, IEEE and Yibo Fan
Abstract —Convolutional Neural Network (CNN)-based filtershave achieved significant performance in video artifacts reduc-tion. However, the high complexity of existing methods makes itdifficult to be applied in real usage. In this paper, a CNN-basedlow complexity filter is proposed. We utilize depth separableconvolution (DSC) merged with the batch normalization (BN) asthe backbone of our proposed CNN-based network. Besides, aweight initialization method is proposed to enhance the trainingperformance. To solve the well known over smoothing problemfor the inter frames, a frame-level residual mapping (RM) ispresented. We analyze some of the mainstream methods likeframe-level and block-level based filters quantitatively and buildour CNN-based filter with frame-level control to avoid theextra complexity and artificial boundaries caused by block-levelcontrol. In addition, a novel module called RM is designed torestore the distortion from the learned residuals. As a result, wecan effectively improve the generalization ability of the learning-based filter and reach an adaptive filtering effect. Moreover, thismodule is flexible and can be combined with other learning-based filters. The experimental results show that our proposedmethod achieves significant BD-rate reduction than H.265/HEVC.It achieves about 1.2% BD-rate reduction and 79.1% decrease inFLOPs than VR-CNN. Finally, the measurement on H.266/VVCand ablation studies are also conducted to ensure the effectivenessof the proposed method.
Index Terms —In-loop filter, HEVC, convolutional neural net-work, VTM.
I. I
NTRODUCTION T HE performance of video compression has been contin-uously improved with the development from H.264/AVC[1], H.265/HEVC [2] to H.266/VVC [3]. These standardsshare a similar hybrid video coding framework, which adoptsprediction [4], [5], transformation [6], quantization [7], and
This work was supported in part by the National Natural Science Foundationof China under Grant 61674041, in part by Alibaba Group through AlibabaInnovative Research (AIR) Program, in part by the STCSM under Grant16XD1400300, in part by the pioneering project of academy for engineeringand technology and Fudan-CIOMP joint fund, in part by the National NaturalScience Foundation of China under Grant 61525401, in part by the Program ofShanghai Academic/Technology Research Leader under Grant 16XD1400300,in part by the Innovation Program of Shanghai Municipal Education Com-mission, in part by JST, PRESTO Grant Number JPMJPR19M5, Japan.(Corresponding author: Heming Sun and Yibo Fan.)C. Liu, Y. Fan and X. Zeng are with the State Key Laboratory ofASIC and System, Fudan University, Shanghai 200433, China (e-mail:[email protected]; [email protected]; [email protected]).H. Sun is with the Waseda Research Institute for Science and Engineer-ing, Tokyo 169-8555, Japan and JST, PRESTO, 4-1-8 Honcho, Kawaguchi,Saitama, 332-0012, Japan (e-mail: [email protected]).J. Katto is with Waseda Research Institute for Science and Engineer-ing, Tokyo 169-8555, Japan and the Graduate School of FundamentalScience and Engineering, Waseda University, Tokyo 169-8555, Japan (e-mail:[email protected]). context adaptive binary arithmetic coding (CABAC) [8]. Ow-ing to the modules like quantization and flexible partition,some unavoidable artifacts are produced and cause degradationof video quality, such as blocking effect, Gibbs effect, andringing. To compensate for those artifacts, many advancedfiltering tools are designed, for instance, de-blocking(DB[9]), sample adaptive offset(SAO [10]), and adaptive loopfilter(ALF [11]). These tools reduce the artifacts effectivelywith acceptable complexity.In the past decades, the learning-based methods make greatprogress in both low-level and high-level computer vision tasks[12]–[17], such as object detection [12], [13], semantic imagesegmentation [14], [15], and super resolution [16], [17]. Byvirtue of the powerful non-linear capability of learning-basedtools, they also have been utilized to replace the existingmodules in video coding and show great potential, for instance,intra prediction [18]–[20], inter prediction [21], [22], andentropy coding [23], [24]. Learning-based models, especiallyCNN, have achieved excellent performances for the in-loopfilter of video coding [25]–[33]. Dai et al . [26], [27] proposedVR-CNN, which adopts a variable filter size technique to havedifferent receptive fields in one-layers and achieves excellentperformance with relatively low complexity. Zhang et al. [30]proposed a 13-layer RHCNN for both intra and inter frames.The relatively deep network has a strong mapping capability tolearn the difference between the original and the reconstructedinter frames. To further adapt to the image content, Jia et al. [33] designed a multi-model filtering mechanism and proposeda content-aware CNN with a discriminative network. Thismethod uses the discriminative network to select the mostsuitable deep learning model for each region.Most of the learning-based filters can achieve considerableBD-rate [34] savings than H.265/HEVC anchor. However,real-world applications often require lightweight models. Highmemory usage and computing resource consumption makeit difficult to apply complex models to various hardwareplatforms. Therefore, designing a light network is essentialto popularize learning-based in-loop filters. Considering this,some model compression methods that reduce the model com-plexity while maintaining performance are needed. In recentyears, some famous methods have been proposed, includinglightweight layers [35], [36], knowledge transfer [37]–[39],low-bit quantization [40], [41], and network pruning [42], [43].DSC [35], [36] is one of the famous lightweight layers. Itpreserves the essential features of standard convolution whilegreatly reducing the complexity by using grouping convolution[35]. In this paper, we build our learning-based filter with DSC a r X i v : . [ ee ss . I V ] S e p instead of the standard convolution. And knowledge transferis used to help the initialization of the trainable parameterswithout increasing the complexity.Besides the learning-based filter itself, we also need alightweight mechanism for the filtering of inter frames. Someinter blocks fully inherit the texture from their reference blocksand have almost no residuals. If the learning-based filter isused for each frame, those blocks will be repeatedly filteredand cause over-smoothing in inter blocks [33], [44]. Onesolution to solve this problem is training a specific filter forinter frames [30]. However, the coding of intra and interframe share some of the same modules in H.265/HEVC liketransformation, quantization, and block partitions. This meansthe learning-based filter trained with intra frames can also beused for inter frames to some extent. Considering this, previousworks [27], [33], [44]–[47] designed a syntax element controlflag to indicate whether an inter CTU uses the learning-basedfilter or not. It chooses a selective filtering strategy for eachCTU. For this strategy, we compare it with frame-level controlin Section IV-A and found the CTU-level control may leadto artificial boundaries between the neighboring CTUs. So wepropose to use the frame-level based filter to avoid unnecessaryartificial boundaries. In order to improve the performance offrame-level based filtering, we propose a novel module calledresidual mapping (RM) in this paper.In summary, we propose a novel light CNN-based in-loop filter for both intra and inter frames based on [48],[49]. Experimental results show this model achieves excellentperformance in terms of both video quality and complexity.Specifically, our contributions are as follows. • A CNN-based lightweight in-loop filter is designed forH.265/HEVC. Low-complexity DSC merged with the BNis used as the backbone of this model. Besides, we useattention transfer to pre-train it to help the initializationof parameters. • For the filtering of inter frames, we analyze and buildour CNN-filter based on frame-level to avoid the artifi-cial boundaries caused by CTU-level. Besides, a novelpost-processing module RM is proposed to improve thegeneralization ability of the frame-level based model andenhance the subjective and objective quality. • We integrate the proposed method into HEVC and VVCreference software and significant performance has beenachieved by our proposed method. Besides, we conductsome extensive experiments like ablation studies to provethe effectiveness of our proposed methods.The following of this paper is organized as follows. InSection II, we present the related works, including the in-loop filter in video coding and the lightweight network design.Section III elaborates on the proposed network, includingnetwork structure and its loss function. Section IV focuses onthe proposed RM module and provides an analysis of differentcontrol strategies. Experiment results and ablation studies areshown in Section V. In Section VI, we conclude this paperwith future work. II. R
ELATED W ORKS
A. In-loop Filters in Video Coding1) DB, SAO, and ALF:
DB, SAO, and ALF that are adoptedin the latest video coding standard H.266/VVC [3] are aimedat removing the artifacts in video coding. De-blocking [9] hasbeen used to reduce the discontinuity at block boundaries sincethe publication of coding standard H.263+ [50]. Depend on theboundary strength and reconstructed average luminance level,DB chooses different coding parameters to filter the distortedboundaries. Meanwhile, by classifying the reconstructed sam-ples into various categories, SAO [10] gives each categorya different offset to compensate for the error between thereconstructed and original pixels. Based on the Wiener filter,ALF [11] tries different filter coefficients by minimizing thesquare error between the original and reconstructed pixels. Thesignal of the filter coefficient needs to be sent to the decoderside to ensure the consistency between encoder and decoder.All these aforementioned filters can effectively alleviate thevarious artifacts in reconstructed images. However, there isstill much room for improvement.
2) Learning-based Filter:
Recently, the learning-based fil-ters have far outperformed the DB, SAO, and ALF in termsof both objective and subjective quality. Different from SAOand ALF, they hardly need extra bits but can compensatefor errors adaptively as well. Most of them are based onCNNs and have achieved great success in this field. For thefiltering of intra frames, Park et al. [25] first proposed a CNN-based in-loop filter IFCNN for video coding. Dai et al. [26]proposed VR-CNN as post-processing to replace DB and SAOin HEVC. Based on inception, Liu et al. [28] proposed a CNN-based filter with 475,233 trainable parameters. Meanwhile,Kang et al. [29] proposed a multi-modal/multi-scale neuralnetwork with up to 2,298,160 parameters. Considering thecoding unit (CU) size information, He et al. [31] proposeda partition-masked CNN with a dozen residual blocks. Sun et al. [48] proposed a learning-based filter with ResNet [51]for the VTM. Liu et al. [49] proposed a lightweight learning-based filter based on DSC. Apart from what was mentionedabove, Zhang et al. [44] proposed a residual convolution neuralnetwork with a recursive mechanism.Different from the training the filter for intra samples,training the filter with inter samples need to consider theproblem of repeated filtering [33], [47]. Jia et al. [33] proposeda content-aware CNN based in-loop filtering method thatapplies multiple CNN models and a discriminative networkin the H.265/HEVC. This discriminative network can be usedto judge the degree of distortion of the current block and selectthe most appropriate filter for it. However, the discriminativenetwork requires additional complexity and memory usage,some researchers [27], [45] proposed to use block-level syntaxelements to replace it. This method requires extra bit con-sumption but gets a more accurate judgment on whether touse the learning-based filter. Similarly, some researchers [25],[52] proposed to use frame-level syntax elements to controlthe filtering of inter frames. Besides, complicated models[30], [45], [53] like spatial-temporal networks are also usefulfor solving this problem. Jia et al. [45] proposed spatial-
Depthwise Conv. Pointwise Conv.
Fig. 1. The depthwise separable convolution, where ”Conv.” indicatesconvolution. temporal residue network (STResNet) with CTU level controlto suppress visual artifacts. RHCNN that is trained for bothintra and inter frames was proposed by Zhang et al. [30].Filtering in the decoder side [32], [54], [55] can also solvethe problem of repeated enhancement well. For example, DS-CNN was designed by Yao et al. [32] to achieve qualityenhancement as well. Li et al. [54] adopted a 20-layers deepCNN to improve the filtering performance. Zhang et al. [55]proposed a post-processing network for VTM 4.0.1.In summary, filtering in inter frames is more challengingthan that of intra frames. In most cases, the CNN-based in-loopfilter with higher complexity can achieve better performanceon intra frames. But for the filtering of inter frames, the ex-isting methods have their own problems. For example, frame-level control may lead to an over-smoothing problem, CTU-level control will cause the additional artificial boundaries, theout-loop filters cannot use the filtered image as a reference,adding discriminative network and complex model may leadto over-complexity and impractical. Therefore, we should payattention to a more effective method for this task.
B. Lightweight Network Design1) Depthwise Separable Convolution:
As a novel neuralnetwork layer, DSC achieves great success in practical appli-cations because of its low complexity. It is initially introducedin [35] and subsequently used in MobileNets [36]. As shownin Fig. 1, DSC divides the calculation of standard convo-lution into two parts, depthwise convolution, and pointwiseconvolution. Different from standard convolution, depthwiseconvolution decompose the calculation of standard convolutioninto group convolution to reduce the complexity. Meanwhile,the pointwise convolution is the same as the standard convolu-tion with kernel × . In other words, depthwise convolutionis used to convolute the separate features whereas pointwiseconvolution is utilized to combine them to get the outputfeature maps. These two parts together form a complete DSC.
2) Knowledge Distillation and Transfer:
Previous studies[37]–[39] have shown that the ”knowledge” in pre-trainedmodels can be transferred to another model. Hinton et al. [37] propose a distillation method that uses a teacher modelto get a ”soft target”, which helps a student model that hasa similar structure perform better in the classification task.Besides softening the target in classification tasks, some othermethods [38], [39] use the intermediate representations of thepre-trained model to transfer the ”knowledge”. For example,Zagoruyko et al. [38] devise a method called attention transfer (AT) to get student model performance improved by letting itmimic the attention maps from a teacher model. Meanwhile,Huang et al. [39] design a loss function by minimizingthe maximum mean discrepancy (MMD) metric between thedistributions of the teacher and the student model, whereMMD is a distance metric for probability distributions [56].III. P
ROPOSED
CNN-
BASED F ILTER
A. Network Structure
As shown in Fig. 2, we design a network structure thatfunctions on both the teacher and the proposed model. Thisstructure is composed of convolution, BN layer, and activationReLU [57]. The backbone of this structure is K layers ofDSC with dozens of feature maps F . The input to thisstructure is the HM reconstruction without filtering and theoutput is the filtered reconstructed samples. The last part isa standard convolution with only 1 feature map. And we addthe reconstruction samples to the output inspired by residuallearning [51]. The depthwise and the standard convolutionkernel are both × . Every convolution is followed by theReLU except for the last one. The reason why choose ReLUinstead of other advanced activation functions is that ReLUhas a lower complexity while a considerable nonlinearity. Inour implementation, the values of K and F are 24 and 64for the teacher model, 9 and 32 for the proposed model. Thedescription of the parameters in the proposed model is shownin Table I.We use the BN layer in the training phase, this layer couldimprove the back-propagation of the gradients. What’s more,both BN and convolution are linear computations for the ten-sors in the proposed model. Therefore, the BN can be mergedinto the convolution to further reduce the computational duringthe inference phase. As shown in (1), depthwise convolutionoutput χ dwConv can be formulated as: χ dwConv = w dwConv ∗ χ (1)where ∗ indicates the convolution operation, w dwConv is thekernel and χ is the depthwise convolution input. Similarly, thepiecewise convolution output χ pwConv can be written as: χ pwConv = w pwConv ∗ χ dwConv + b pwConv (2)where w pwConv and b pwConv denote the kernel and bias.It is noticeable in (1) that the convolution of depthwiseconvolution has no bias, this is because the bias b dwConv canbe merged into b pwConv when there is no activation betweendepthwise and pointwise convolution. After convolution, theoutput of BN can be obtained by (3). (The reason why weuse ∗ operation here is because actually the calculation ofBN is equivalent to that of the depthwise convolution bysimplification) χ bn = γ ∗ (cid:18) χ pwConv − mean √ var + (cid:15) (cid:19) + β (3)Substituting (2) into (3), we obtain (4) as follows: χ bn = (cid:98) w pwConv ∗ χ dwConv + (cid:98) b pwConv (4) x10 x3 x8 x3 x6 x3 Guide the initialization of parameter
Teacher
OutputStudent
Output
Depthwise SeparableConvolution +(Batch
Normalization) +
ReluStandard Convolution Element-wise AdditionAttentionLoss
Attention Attention
Input Rec.
Fig. 2. The architecture of teacher model and the proposed model, where ”Rec.” indicates ”reconstructed pixels”. The top-right and bottom-right are theteacher model and the proposed student model, respectively. The rectangle on the right implies the knowledge transfer.TABLE ID
ESCRIPTION OF THE P ARAMETERS IN THE P ROPOSED M ODEL
Index Block1 Block2 Block3 Std Conv. a SumParameters
73 + 2 × ,
344 3 × ,
344 3 × ,
344 289 a Standard Convolution. where (cid:98) w pwConv and (cid:98) b pwConv in (4) are: (cid:98) w pwConv = γ ∗ w pwConv √ var + (cid:15) (5) (cid:98) b pwConv = γ ∗ ( b pwConv − mean ) √ var + (cid:15) + β (6)In (5) and (6), γ and β are trainable parameters of BN, mean and var are non-trainable parameters of BN. Hyper-parameter (cid:15) represents a positive number that prevents divisionzero errors. In the inference phase, we use the (cid:98) w pwConv and (cid:98) b pwConv to replace the weight w pwConv and bias b pwConv indepthwise convolution, thus merging the BN into the DSC andreducing the model complexity. B. Standard Convolution of the Proposed Structure
In this subsection, the last part of the proposed structureis detailed. Because the standard convolution uses fewercalculations than DSC when the number of convolution outputchannels is only one. It is worth noting that the last convolutionof the proposed model is standard convolution, which isn’tconsistent with the backbone of the proposed model. TheDSC consists of two steps, including depthwise convolutionand pointwise convolution. The depthwise convolution is thesimplification of the standard convolution to reduce the amountof computation while preserving the ability to convolve theinput feature maps. Meanwhile, the pointwise convolution isequivalent to the standard convolution with × kernel, itis utilized to fuse the different depthwise convolution output.According to their computing methods, the ratio r of thecalculation of the DSC to that of the standard convolutionis calculated as: r = K W K H C I W H + C I C O W HK W K H C I C O W H = 1 C O + 1 K W K H (7) where W , H is the width and height of the input frame,respectively. K W , K H is the width and height of the convo-lution kernel, respectively. C I , C O are the number of featuremaps for the convolution input and output, respectively. Inour proposed model, C O = 1 and K W = 3 , K H = 3 .So r = C O + K W K H = , which is bigger than .This represents DSC consumes more computing sources thanstandard convolution. The extra calculation is caused by point-wise convolution, which is utilized to combine feature maps.However, the standard convolution also can combine features,which indicates the extra calculation of pointwise convolutionis meaningless. Therefore, we choose the standard convolutionat the end of the model to avoid meaningless calculations. C. Proposed Initialization and Training Scheme
In this subsection, we will introduce the training processand loss functions of the proposed network. In most cases, asuitable initialization of parameters can help the model betterconverge to the minimum. Inspired by transfer learning, a pre-trained teacher model is used to guide the initialization of theparameters in the proposed model. By using such initialization,we hope the proposed model can obtain the output similarto that of the teacher model before the real training begins.The pre-trained model uses the mean square errors (MSE) lossbetween the output Y T of teacher model and the original pixels Y O . L T = 1 N N (cid:88) i =1 (cid:107) Y iT − Y iO (cid:107) (8)After the training of the teacher model, we use the intermediateoutputs of it to guide the proposed model on parameterinitialization. This process is denoted by the bold lines in Fig.2. Because the vanishing of gradients may lead to insufficienttraining of shallow layers, the teacher model is divided intodifferently-sized blocks to produce the intermediate hints. Themetric of the distance between teacher and the proposedstudent model tries two forms, including MMD [39] andattention loss [38]. The loss function L MMD ( F T , F S ) with linear kernel function ( k ( x, y ) = x T y ) could be written asfollows: L MMD ( F T , F S ) = (cid:107) C T C T (cid:88) i =1 f iT (cid:107) f iT (cid:107) − C S C S (cid:88) j =1 f jS (cid:107) f jS (cid:107) (cid:107) (9)where F represents the attention map, f indicates a single fea-ture map, C is the number of feature maps, and the subscript T and S identify the teacher and student model. Meanwhile,the loss function L AT ( F T , F S ) of attention transfer (AT) [38]could be written as follows: L AT ( F T , F S ) = (cid:107) (cid:80) C T i =1 | f iT | p (cid:107) (cid:80) C T i =1 | f iT | p (cid:107) − (cid:80) C S j =1 | f jS | p (cid:107) (cid:80) C S j =1 | f jS | p (cid:107) (cid:107) (10)We set p to 2 in our implementation, because these twomethods are similar except for their normalization methodswhen p = 1 [39]. After the initialization, we start the realtraining process of using MSE L S in (11) to train the proposedmodel, where Y S indicates the output of the proposed model. L S = 1 N N (cid:88) i =1 (cid:107) Y iS − Y iO (cid:107) (11)In summary, the whole process can be divided into thefollowing steps. Algorithm 1
The process of building the trained proposedmodel.
Input:
The dataset pair of HM reconstruction samples X andoriginal samples Y O ; Output:
The trained proposed model; Constructing the teacher model T and training it for n epochs with MSE L T ; Extracting the attention maps F T from the trained T ; Constructing the student model S with BN and training itfor n epochs with L AT ( F T , F S ) or L MMD ( F T , F S ) ; Training S with MSE L S for n epochs; Calculating the (cid:98) w pwConv and (cid:98) b pwConv for S ; Removing the BN from S ; Using the (cid:98) w pwConv and (cid:98) b pwConv to replace the weight w pwConv and bias b pwConv in depthwise convolution of S ; return S ;IV. P ROPOSED R ESIDUAL M APPING FOR THE
CNN-
BASED F ILTERING
A. Analysis of CTU-level and Frame-level Control
From the size of filtered samples, filtering methods can bedivided into CTU-level (block-level) and frame-level. Com-pared with CTU-level control, there are two main advantagesof frame-level control in CNN-based filter design, includingthe required computational resource and the video quality. Inthis subsection, the difference is analyzed from the perspec-tives of the padding methods and the filter kernels.Firstly, to keep the input frames size unchanged, the CNN-based filter needs to pad the boundaries of input with somesamples. There are usually two padding ways, including
Input Sample Filtered Sample
Con- volution
Extra Calculation (a) Convolution with valid padding
Con-volutionInput Sample Filtered Sample (b) Convolution with same padding
Zero Sample Affected Sample Unaffected Sample
Fig. 3. The diagrams of convolution with different pad methods.TABLE IIC
OMPLEXITY C OMPARISON OF
CTU-
LEVEL C ONTROL B ETWEEN V ALID P ADDING AND S AME P ADDING
Items RHCNN [30] Jia et al. [33] VR-CNN [26]Padding type Valid Same Valid Same Valid SameFlops a (G) 16.21 10.89 2.02 1.49 0.25 0.22Madd b (G) 32.38 21.76 4.04 2.97 0.49 0.44Memory c (MB) 91.06 60.11 25.43 18.02 5.84 5.02MemR+W d (MB) 193.05 130.36 55.09 39.42 13.88 11.99 a Theoretical amount of floating point arithmetics. b Theoretical amount of multiply-adds. c Memory useage. d Memory read /write. valid padding (padded with reconstructed samples) and samepadding (padded with zero samples). In one case, if the CTUsare padded with reconstructed pixels to maintain the sameaccuracy as frame-level filtering, most of the networks needto pad the input block with plenty of pixels and requireconsiderable calculation. Fig. 3 intuitively shows the differencein the amount of calculation between valid and same padding.The quantitative calculations [58] are illustrated in Table II(we assume that both of their output sizes of filtered samplesare × ), it can be found that the valid padding (see”Valid” columns) of works [26], [30], [33] all have consid-erable complexity increasing than same padding (see ”Same”columns). In the other case, if the same padding is selected, itwill cause calculation errors around the boundaries as shownin Fig. 4. We assume that the size of the block control is h × h ,and the width of the boundary area affected by the pad is a .The proportion p fc of affected pixels under frame-control is h a Filtered
CTU Unfiltered
CTUArtificial
Boundary (a) CTU-level control
W H
Filtered
Frame (b) Frame-level controlFig. 4. The diagrams of convolution with different control methods. calculated as follows: p fc = 1 − ( W − a )( H − a ) W H = 2 a ( W + H − a ) W H (12)Similarly, the proportion p bc of affected pixels under blockcontrol can be approximated as follows. (No incomplete CTUare considered) p bc = 4 a ( h − a ) h ≈ ah (13)It can be found from (13) that the area affected by samepadding is approximately proportional to the perimeter ofthe filtered samples. Therefore, the frame-level control with ahigher area-to-perimeter ratio is less affected than block-levelcontrol. According to (12) and (13), it can be obtained that forthe HEVC test sequence, the same-padding of our networkwill affect an average of 45% of the pixels under CTU-level control, whereas that of frame-level control is only 3%.Therefore, choosing frame-level control lays a solid foundationfor the application of the CNN-based filter.Secondly, the frames filtered by frame-level control has theproperty of integrity. Frame-level control uses the same kernelfor filtering of the entire frame whereas CTU-level control mayuse the different kernels for two consecutive CTUs, which maylead to some artificial errors in the boundaries. As shown inFig. 4, two consecutive CTUs with different filtering strategieshave some errors along the boundaries because of the differentkernels used in the filtering. Especially for the condition thatone of the CTUs uses the learning-based filter while theother one doesn’t. This further demonstrates the advantagesof frame-level control.In summary, for the design of lightweight CNN-basedfilters, the frame-level control has some advantages over block-level control. On the one hand, compared with frame-levelcontrol, CTU-level control leads to calculation cost with thesame padding or calculation error with the valid padding.On the other hand, frame-level control has the property ofintegrity and it brings better subjective quality. To reduce thepadding error brought by the multi-layer neural network andcomplexity, we built our CNN-based in-loop filters on a frame-level control. However, the ability to directly use frame-basedcontrol is weak because it only has two states of using or notusing the filter, we need some added methods to improve itsperformance. B. Residual Mapping
In this subsection, a novel post-processing module RMis proposed to improve the performance of the frame-level (a) Org. frame (b) Distortion (c) Learned residualFig. 5. A frame from CLIC dataset [59] is coded with HM-16.16 and QP37. The original frame, the distortion and the learned residual of this frameare shown in (a), (b) and (c).Fig. 6. The comparison of different filtering mechanisms (“Race-Horses 416x240”, qp22, LDP configuration). Linear, quadratic, and cubicrepresent the mapping function of linear, quadratic, and cubic functions,respectively. We can find in the red box that the performance of using CNN-filter directly is not satisfactory, and even leads to a decrease in PSNR. control based CNN filter. It can effectively improve the over-smoothing problem [33], [47] of inter frame. Besides, wefound that it also has a considerable improvement to intraframes in Section V-D. Most of the trained neural networks arefitting to a certain training set. Since the distribution of trainingdata is often very complicated, the training is actually a trade-off of the data set. For a specific image, the trained filter maybe under-fitted or over-fitted. This may cause distortion or blurfor a learning-based filter. What’s more, if we want to use theneural network trained with intra samples for the filtering ofinter samples, this phenomenon will be more serious becauseof the difference in the distribution of the intra and interdatasets. With this in mind, we proposed to use a parametricRM after the learning-based filter, which is some sort of non-parametric filter, to improve its generalization ability. Inspiredby the potential correlation of distortion and learned residualshown in Fig. 5, we handle this filter from the perspective thatof restoring distortion from the learning-based filtered residual,which is equivalent to improving the quality of the distortedframes. The distortion R O is defined as the difference betweenthe original samples Y O and reconstruction of de-blocking X : R O = Y O − X (14) DB CNN-filter SAO (a) Serial structure
DB SAO
CNN-filter (b) Parallel structure
DB CNN-filter RM SAO (c) Proposed structureFig. 7. The schemes of the different frameworks with CNN-based filter.
Similarly, the learned residual R S is defines as the differencebetween the output of learning-based filter and X . R S = Y S − X (15)A function f λ ( · ) with parameters λ is designed as the para-metric filter to map R S to R O . We choose MSE as the metric: λ = arg min λ ( f λ ( R S ) − R O ) (16)We should use a model with a small amount of parametersto construct f λ ( · ) , so that it is convenient to encode theparameters λ into the bitstream to ensure the consistency ofencoding and decoding. For the expression form of f λ ( · ) , wehave tried linear functions and polynomial functions as shownin Fig. 6. From the red box on the left, it can be found thatonly using the CNN filter (see red dotted line) may lead toa decrease in coding performance, this proves that directlyusing CNN filters for inter frames may degrade video quality.And the performance is improved after adopting RM. It isnoticeable that there is little difference in performance betweendifferent polynomial functions. So we choose simple linearfunctions to build RM. λ = arg min λ ( λR S − R O ) (17)So we add X and the output of RM ˆ R S to get the filteredframe ˆ Y S . After sending it to SAO, the entire filtering processis completed. ˆ Y S = X + ˆ R S = X + λR S (18)We quantify the candidate λ with n bits for each component,where λ = i/ (2 n − , i ∈ , , ..., n − . In the imple-mentation, the number of required bits n is set to , so eachframe needs 15 bits for the RM module. And a rate-distortionoptimization (RDO) process is designed to find the best λ .The regular mode of CABAC is used to code λ . RM does notneed specific models for inter frames or additional classifiersfor each CTU. What’s more, it is independent of the proposednetwork and can be combined with other learning-based filtersto alleviate the over-smoothing problem as well.Different from previous strategy [47] of choosing onebetween traditional filtering and learning-based filtering, RMuses a serial structure and makes full use of both these two TABLE IIIE
XPERIMENTAL E NVIRONMENT
Items SpecificationOptimizer Adam [60]Processor Intel Xeon Gold 6134 at 3.20 GHzGPU NVIDIA GeForce RTX 2080Operating system CentOS Linux release 7.6.1810HM version 16.16DNN framework Keras 2.2.4 [61] and TensorFlow 1.12.0 [62] kinds of filtering as shown in Fig. 7. From the perspectiveof reconstructed frames, the proposed RM can be interpretedas a post-processing module that fully utilizes the advantagesof both distorted reconstruction and learned filtered output.The full use of these two aspects makes RM have excellentperformance. For example, we assume that the reference frameis a frame filtered by a learning-based filter, so if the currentframe and the reference frame are almost identical, the currentframe does not need to use all the filters. Conversely, ifthe current frame and the reference frame are completelydifferent, it is easy to produce artificial imprints becauseof the distorted residue, so the filters should be used inthis case. For a specific frame, however, it is often difficultto obtain an accurate judgment about whether to use thefilters by using its encoded information, such as residuals ormotion vectors. Considering the good generalization ability oftraditional filters, we keep them working and focused on theCNN filter. So we introduce a parametric module RM, whichuses an RDO process to give an appropriate filtering effect ofthe CNN filter. From (16), it can be observed that the filteringstrength varies with the change of the λ . So we can traverseall of the candidate λ and code the one with the smallestreconstruction error into bitstreams. We can also derivativethe objective function to obtain the optimal parameters, andcode the quantized parameters in the bitstream. In this case, weneed to consider the influence of parameter quantization, thosemapping functions that are sensitive to quantization noise, suchas high-order polynomials, should be abandoned. Otherwise,this may result in larger quantization errors in the decodedframes. V. E XPERIMENTAL R ESULTS
A. Experimental Setting
For the experiment, we mainly focus on objective quality,subjective quality, complexity, and ablation studies to illustratethe performance of our model. Nine hundred pictures fromDIV2K [63] are cropped into the resolution of × ,and then down-sampled to × . These two sets ofpictures are spliced into two videos as our training sets.Only the luminance component is used for training, and thechrominance components are also tested by using the proposedmodel. The patch size in training is × of H.265/HEVCand × of H.266/VVC, which is consistent with thelargest size of the TU. Considering that the reconstructedimages with different QPs often have different degrees of distortion and artifacts, the whole QP band is divided intofour parts, below 24, 25 to 29, 30 to 34, and above 35. Sofour proprietary models are trained for each QP band. Theparameter initialization method is normal distribution [64]for both the teacher model and the proposed model. Thetraining epochs n and n are both set to 50. We use moretraining epochs for the model with higher QP in the specialinitialization phase because there are often more artifacts in thereconstructed images with higher QP. Specifically, parameters n is set as for the lower QPs and for the higherQPs. After the training phase, we save the trained modeland call it to infer in HEVC reference software (HM) andvideo coding test model (VTM). In the test phase, the first64 frames from HEVC test sequences are used to evaluatethe generalization ability of our model. We test four differentconfigurations with default settings, including all-intra (AI),low-delay-B (LDB), low-delay-P (LDP), and random-access(RA) for H.265/HEVC anchor. For the H.266/VVC anchor,we test it with default AI and RA configurations. Four typicalQPs in common test conditions are tested, including 22, 27,32, 37. The other important test conditions are shown in TableIII. For a fair comparison with previous works, we use thecoding experimental results from their original papers. Thecomplexities of the reference papers are tested on our localserver to avoid the influence of the hardware platforms. B. Experiment on H.265/HEVC1) Objective Evaluation:
In this subsection, the objectiveevaluation is conducted to evaluate the performance of ourproposed model. The experimental results compared with theHM-16.16 anchor are shown in Table IV. For the luminancecomponent, the proposed model achieves 6.3%, 4.5%, 5.4%,and 5.7% BD-rate reduction compared with HEVC baselineunder AI, LDB, LDP, and RA configuration, respectively.For chrominance components, the proposed model achievesmore BD-rate reduction than the luminance component. Itdemonstrates the generalization ability for the proposed modelbecause we only use the luminance components of intrasamples for training. Furthermore, the comparisons with theprevious works [26], [33] are conducted and the BD-ratereduction is shown in Table V. It can be seen that our modelachieves more BD-rate reduction for AI configuration.For the performance evaluation of inter configurations, weintroduce the comparison of our proposed model with frame-level control [25], CTU-level control [45], and Jia et al. [33] as shown in Table VI. To compare fairly, we selectthe same padding and use the proposed model to test thedifferent control methods. From the experiment results, itcan be seen that our proposed model achieves about 1%extra BD-rate reduction than both CTU-level and frame-levelcontrol for all inter configurations. Compared with Jia et al. [33], our model achieves comparative BD-rate reduction ininter configurations. For the chrominance components, ourmodel achieves about 3% extra BD-rate reduction, it furtherdemonstrates the generalization ability of our model.
2) Subjective Evaluation:
We also conduct the subjectiveevaluation as shown in Fig. 8 and Fig. 9. It can be seen from the experimental results that our model has a great de-artifacts capability. First, we re-deploy the proposed modelin HM-16.9 for a fair subjective evaluation with Jia et al. [33]. From Fig. 8, it can be found that the various kinds ofartifacts in (a) are eliminated by the proposed model and theman’s face looks smoother and plump. At the same time, somevertical blocky effects are produced by Jia et al. [33], probablybecause it uses different filters for consecutive CTUs while ourproposed model uses the same filter for the whole images andhave no additional boundaries. Besides, the man’s eyes seemto be blurred by [33] and lead to the degradation of visualquality. Second, the subjective evaluation for the inter frames isconducted in Fig. 9. The default HM and HM with CTU-levelcontrol [27] are used as the anchors. As shown in Fig. 9, thecontouring and blocky artifacts on the number are eliminatedby the proposed model. For CTU-level control [27] basedfiltering, the subjective quality of this frame is reduced dueto the artificial boundaries on the knee, whereas our proposedmodel has no boundaries on it and achieves a better visualquality. To sum up, because our proposed method makes fulluse of the frame-level filtering strategy, the proposed methodhas significantly better visual effects than previous CTU-basedmethods.
3) Complexity Analysis:
As shown in Table V, we comparethe complexity of Jia et al. [33], VR-CNN [26], and ourproposed model from two aspects, including computationalcomplexity and storage consumption. Firstly, for the codingcomplexity evaluation, we use the following equation to cal-culate the ∆ T : ∆ T = T (cid:48) T (19)where T (cid:48) and T denote the HM coding time with and withoutthe learning-based filter, respectively. FLOPs in Table V arealso tested for the frame with a resolution of 720p. Comparedwith VR-CNN [26], the FLOPs of our model is reduced by79.1%. The decoding complexity is reduced by approximately50% and the encoding complexity is reduced by 4%. Theprocessing time of the proposed model is almost the samefor both encoder and decoder. The difference in relative timeis caused by that the network inference time accounts for asmall proportion of the encoding complexity but comparativefor the decoding.In terms of storage consumption, compared with [26], thenumber of trainable parameters in the proposed model isreduced by 79.6%. It is almost the same with the reductionof model size because we use the same precision (float32)to save the models. The main reason why our model hasrelatively fewer parameters is that the design of the proposedmodel focuses more on complexity instead of performance.For example, we use the DSC as the backbone of the proposedmodel, whereas previous works [26], [33] utilize the standardconvolution. Meanwhile, we also use many useful methodsto limit the model size while maintaining the performance,including BN merge and special initialization of parameters.What’s more, our proposed model only needs one learning-based network for both intra and inter frames. So thereis no need for additional models in practical applications.Compared with previous works that need multiple models or TABLE IVBD-
RATE R EDUCTION OF THE P ROPOSED M ETHOD THAN
HM-16.16 A
NCHOR
Sequences AI LDB LDP RAY U V Y U V Y U V Y U VClassA Traffic -7.3% -3.4% -4.7% -4.6% -2.4% -0.9% -4.3% -3.4% -1.5% -6.4% -3.6% -2.7%PeopleOnStreet -6.8% -7.1% -6.9% -4.5% -0.6% -0.9% -3.1% -4.4% -2.4% -6.1% -4.6% -5.0%ClassB Kimono -4.9% -2.6% -2.5% -4.6% -7.5% -4.7% -7.3% -11.5% -6.7% -4.2% -5.7% -3.4%ParkScene -5.5% -3.2% -2.3% -1.9% -0.3% -0.7% -1.5% -0.5% -0.7% -3.8% -0.6% -0.3%Cactus -5.3% -4.1% -10.1% -4.4% -3.7% -4.4% -5.4% -5.6% -5.8% -6.8% -9.5% -7.2%BasketballDrive -4.3% -8.9% -11.7% -3.4% -4.2% -7.8% -6.0% -9.2% -11.7% -4.4% -4.4% -8.9%BQTerrace -3.7% -4.3% -4.8% -6.1% -2.1% -2.4% -10.6% -4.5% -3.9% -8.8% -3.8% -3.3%ClassC BasketballDrill -8.0% -11.7% -14.1% -2.8% -4.9% -4.6% -3.4% -5.6% -6.2% -4.2% -8.0% -9.7%BQMall -6.0% -6.3% -7.2% -3.8% -3.2% -4.7% -4.6% -4.5% -5.6% -5.1% -4.7% -5.1%PartyScene -3.7% -4.8% -5.7% -0.8% -0.1% -0.2% -1.8% -0.4% -0.4% -1.7% -1.4% -2.0%RaceHorses -3.9% -6.9% -12.0% -4.1% -6.6% -11.3% -4.2% -7.9% -12.7% -4.7% -9.7% -14.2%ClassD BasketballPass -6.5% -7.3% -10.3% -4.4% -3.3% -4.6% -4.3% -4.8% -5.8% -3.9% -4.7% -6.1%BQSquare -4.2% -3.0% -6.8% -2.4% -1.6% -2.8% -4.1% -1.8% -2.9% -2.4% -1.0% -2.9%BlowingBubbles -5.3% -9.3% -9.8% -3.6% -5.8% -2.1% -3.9% -5.4% -1.8% -4.0% -6.1% -4.4%RaceHorses -7.5% -10.5% -14.6% -6.3% -5.2% -10.3% -6.7% -7.4% -10.8% -6.8% -9.4% -12.2%ClassE Vidyo1 -8.9% -8.7% -10.5% -6.7% -9.0% -9.6% -7.4% -9.4% -8.9% -8.1% -8.4% -9.7%Vidyo3 -7.0% -5.2% -5.3% -4.0% -5.9% -3.1% -4.6% -6.3% -2.5% -6.5% -4.1% -5.1%Vidyo4 -6.3% -10.1% -10.8% -3.8% -11.5% -10.9% -3.9% -12.1% -11.2% -5.6% -9.8% -10.1%FourPeople -9.4% -8.1% -9.0% -8.6% -9.2% -9.4% -9.0% -9.7% -10.8% -9.4% -7.7% -8.1%Johnny -8.3% -12.3% -11.0% -7.0% -11.4% -9.1% -9.6% -13.1% -10.7% -8.3% -10.9% -9.7%KristenAndSara -8.6% -10.2% -11.1% -7.7% -8.3% -8.6% -8.3% -10.1% -11.2% -8.2% -8.9% -9.6%Average -6.3% -7.0% -8.6% -4.5% -5.1% -5.4% -5.4% -6.6% -6.4% -5.7% -6.1% -6.6%
TABLE VBD-
RATE R EDUCTION AND C OMPLEXITY (GPU)
OF THE P ROPOSED M ETHOD COMPARED WITH P REVIOUS W ORKS [26], [33] IN AI C
ONFIGURATION
Sequences Jia et al. [33] VR-CNN [26] Proposed modelY U V ∆ T enc ∆ T dec Y U V ∆ T enc ∆ T dec Y U V ∆ T enc ∆ T dec ClassA -4.7% -3.3% -2.6% 108.1% 734.9% -5.5% -4.7% -4.9% 108.3% 561.1% -7.1% -5.4% -5.9% 105.8% 281.0%ClassB -3.5% -2.8% -3.0% 109.0% 659.8% -3.3% -3.2% -3.7% 110.3% 505.3% -4.8% -4.8% -6.4% 106.2% 265.2%ClassC -3.4% -3.5% -5.0% 113.1% 894.9% -5.0% -5.5% -6.9% 113.0% 685.1% -5.4% -7.5% -9.9% 106.5% 326.3%ClassD -3.2% -4.7% -6.0% 128.9% 1406.0% -5.4% -6.4% -8.1% 121.6% 1047.1% -5.9% -7.8% -10.5% 114.4% 548.0%ClassE -5.8% -4.1% -5.2% 112.3% 1110.2% -6.5% -5.5% -5.6% 111.1% 836.7% -8.1% -9.2% -9.7% 107.2% 401.1%Average -4.1% -3.7% -4.4% 114.3% 961.2% -5.1% -5.1% -5.8% 112.9% 727.0% -6.3% -7.0% -8.6% 108.0% 364.3%
FLOPs 334.84G 50.39G
Parameters 362,753 54,512
Model size 1.38MB 220KB
TABLE VIO
VERALL
BD-
RATE C OMPARISON OF P REVIOUS M ETHODS [25], [33], [45] IN LDB, LDP,
AND
RA C
ONFIGURATION
Methods LDB LDP RAY U V Y U V Y U VJia et al. [33] -6.0% -2.9% -3.5% -4.7% -1.0% -1.2% -6.0% -3.2% -3.8%Our network + RM -4.5% -5.1% -5.4% -5.4% -6.6% -6.4% -5.7% -6.1% -6.6%
Our network + Frame control [25] -3.7% -3.3% -3.2% -4.4% -4.6% -3.9% -4.6% -4.8% -5.1%Our network + CTU control [45] -4.1% -4.4% -4.9% -4.6% -5.8% -5.9% -4.5% -5.1% -5.8% (a) HM Rec. (b) Jia et al. [33] (c) Proposed modelFig. 8. Visual quality comparison of Jia et al. [33] and the proposed model for AI configuration. The test qp is 37 and this is the 1st frame for FourPeople(AnchorHM-16.9). (a) HM Rec. (b) CTU-level control [27] (c) Proposed modelFig. 9. Visual quality comparison of CTU-level control [27] and the proposed model for RA configuration. The test QP is 37 and this is the 16th frame forRaceHorse(Anchor HM-16.16). TABLE VIIBD- RATE R EDUCTION AND C OMPUTATIONAL C OMPLEXITY (GPU)
OF THE P ROPOSED M ETHOD THAN
VTM-6.3 A
NCHOR
Sequences AI RAY U V ∆ T enc ∆ T dec Y U V ∆ T enc ∆ T dec ClassA Traffic -1.6% -0.2% -0.4% 100.7% 234.1% -1.1% -0.7% -0.5% 99.6% 357.0%PeopleOnStreet -1.3% -0.4% -0.3% 98.0% 225.4% -0.9% -0.1% -0.2% 99.7% 266.3%ClassB Kimono -0.3% 0.1% -0.3% 104.9% 317.5% -0.2% 0.0% -0.4% 99.8% 319.3%ParkScene -1.9% 0.1% -0.1% 108.3% 232.4% -1.4% 0.3% -0.2% 101.2% 302.1%Cactus -1.3% -0.5% -0.8% 100.4% 244.5% -1.4% -1.8% -1.7% 103.1% 343.9%BasketballDrive -0.3% -0.8% -1.0% 103.7% 282.7% -0.4% -0.8% -0.5% 100.5% 321.0%BQTerrace -1.0% -0.6% -0.6% 101.6% 228.4% -1.9% -1.6% -1.5% 101.6% 313.0%ClassC BasketballDrill -2.7% -3.8% -5.5% 101.6% 219.2% -1.6% -3.4% -2.9% 102.9% 270.8%BQMall -2.2% -0.8% -0.7% 100.4% 220.6% -2.0% -1.1% -0.5% 104.6% 285.1%PartyScene -1.8% -1.1% -1.5% 100.5% 198.7% -1.3% -1.6% -1.8% 101.7% 242.7%RaceHorses -0.9% -1.1% -2.3% 101.6% 233.5% -1.1% -1.5% -2.5% 99.5% 243.8%ClassD BasketballPass -2.1% -1.4% -4.7% 99.4% 407.5% -1.2% -2.4% -1.0% 98.0% 406.1%BQSquare -3.0% -0.2% -1.0% 103.0% 319.1% -3.6% -1.0% -1.6% 105.6% 421.0%BlowingBubbles -2.1% -1.4% -1.0% 101.1% 352.0% -1.6% -2.3% -2.4% 101.0% 371.1%RaceHorses -2.8% -2.7% -4.6% 99.6% 366.6% -2.4% -3.1% -6.5% 98.6% 312.1%ClassE Vidyo1 -1.3% -0.1% -0.3% 101.5% 340.3% -1.0% -0.1% 0.4% 102.6% 486.6%Vidyo3 -1.1% 0.2% -0.2% 101.8% 298.8% -1.2% 1.5% 0.9% 102.0% 457.4%Vidyo4 -0.8% -0.3% -0.2% 106.4% 291.0% -1.0% 0.3% -1.4% 101.3% 425.5%FourPeople -2.1% -0.5% -0.5% 99.2% 263.3% -1.8% -0.8% -0.7% 106.2% 449.3%Johnny -1.3% -0.4% -0.6% 99.9% 317.9% -2.6% -1.2% -1.1% 100.1% 452.4%KristenAndSara -1.7% -0.6% -0.7% 101.1% 344.5% -1.6% -0.4% -1.4% 100.2% 428.0%Average -1.6% -0.8% -1.3% 101.6% 282.8% -1.5% -1.0% -1.3% 101.4% 355.9% TABLE VIIIA
BLATION S TUDY OF
RM (AI, VTM-6.3)Sequences Our network Our network+RMY U V Y U VClassA -0.5% -0.1% 0.4% -1.5% -0.3% -0.4%ClassB 0.3% 1.3% 0.1% -1.0% -0.3% -0.5%ClassC -1.6% -1.8% -2.8% -1.9% -1.7% -2.5%ClassD -2.6% -2.2% -3.6% -2.5% -1.4% -2.8%ClassE -0.3% 2.6% 1.5% -1.4% -0.3% -0.4%Average -0.9% 0.3% -0.7% -1.6% -0.8% -1.3%
TABLE IXA
BLATION S TUDY OF P ARAMETER I NITIALIZATION (AI, VTM-6.3)Methods ∆ PSNR(dB)Y U VStudent 0.310 0.231 0.295Student + MMD [39] 0.320 0.245 0.313Student + AT [38] classifiers, our proposed method reduces the required storageconsumption effectively benefit from the RM module.
C. Experiment on H.266/VVC
To further evaluate the performance of our proposed model,we use the same test condition to test its performance in VTM-6.3. The only difference is that we use the entire DIV2kinstead of the down-sampled dataset to train the proposedmodel. From the experimental results shown in Table VII,it can be found that our model achieves about 1.6% and1.5% BD-rate reduction on the luminance component for AIand RA configurations. For chrominance components, it alsoachieved similar performance on BD-rate reduction. In termsof complexity, the proposed method introduces a negligibleincrease on the encoding side and brings about 3 timescomplexity to the decoding side.
D. Ablation Study1) RM for intra frames:
RM can effectively improve thegeneralization ability of learning-based filters. The experi-ments of RM about the inter frames have been carried out inSection V-B. Based on VTM here, we further conduct ablationexperiments on intra frames to illustrate the performance ofRM. Its test setting is the same as before. From the experi-ment shown in Table VIII, we can find about 0.8% BD-ratereduction has been achieved by the RM module. Regarding theperformance of class-B, only using the proposed CNN-filtermay even have a negative effect and leads to 0.3% BD-rateincrement. But its performance has been well improved afterusing RM. For most other classes, the performance has alsobeen improved more or less after using RM.
2) The initialization of parameters:
The 1-st frame ofall HEVC test sequences is tested and the overall PSNRincrements are shown in Table IX, where the student model without transfer learning is indicated as ”Student” row. MMDand AT in Table IX represent different transfer learning waysthat act on the student model. By comparing the ”Student”row with the other rows, we can find that the PSNR of thestudent model is improved by both MMD and AT. What’smore, the improvements of the chrominance components aremore obvious than that of the luminance component.VI. C
ONCLUSION
In this paper, a CNN-based low complexity filter is proposedfor video coding. The lightweight DSC merged with the batchnormalization is used as the backbone. Based on the transferlearning, attention transfer is utilized to initialize the param-eters of the proposed network. By adding a novel parametricmodule RM after the CNN filter, the generality of the CNNfilter is improved and can also handle the filtering problem ofinter frames. What’s more, RM is independent of the proposednetwork and can also combine with other learning-based filtersto alleviate the over-smoothing problem. The experimental re-sults show our proposed model achieves excellent performancein terms of both BD-rate and complexity. For HEVC testsequences, our proposed model achieves about 1.2% BD-ratereduction and 79.1% FLOPs than VR-CNN anchor. Comparedwith Jia et al. [33], our model achieves comparative BD-rate reduction with much lower complexity. Finally, we alsoconduct the experiments on H.266/VVC and ablation studiesto demonstrate the effectiveness of the model. Our future workaims at further performance improvement of the learning-based filter in video coding.R
EFERENCES[1] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overviewof the h. 264/avc video coding standard,”
IEEE Transactions on circuitsand systems for video technology , vol. 13, no. 7, pp. 560–576, 2003.[2] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview ofthe high efficiency video coding (hevc) standard,”
IEEE Transactionson circuits and systems for video technology , vol. 22, no. 12, pp. 1649–1668, 2012.[3] H266. https://de.wikipedia.org/wiki/h.266/, 2018. 4.[4] J. Lainema, F. Bossen, W.-J. Han, J. Min, and K. Ugur, “Intra codingof the hevc standard,”
IEEE Transactions on Circuits and Systems forVideo Technology , vol. 22, no. 12, pp. 1792–1801, 2012.[5] J.-L. Lin, Y.-W. Chen, Y.-W. Huang, and S.-M. Lei, “Motion vectorcoding in the hevc standard,”
IEEE Journal of selected topics in SignalProcessing , vol. 7, no. 6, pp. 957–968, 2013.[6] T. Nguyen, P. Helle, M. Winken, B. Bross, D. Marpe, H. Schwarz, andT. Wiegand, “Transform coding techniques in hevc,”
IEEE Journal ofSelected Topics in Signal Processing , vol. 7, no. 6, pp. 978–989, 2013.[7] O. Crave, B. Pesquet-Popescu, and C. Guillemot, “Robust video codingbased on multiple description scalar quantization with side information,”
IEEE Transactions on Circuits and Systems for Video Technology ,vol. 20, no. 6, pp. 769–779, 2010.[8] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binaryarithmetic coding in the h. 264/avc video compression standard,”
IEEETransactions on circuits and systems for video technology , vol. 13, no. 7,pp. 620–636, 2003.[9] A. Norkin, G. Bjontegaard, A. Fuldseth, M. Narroschke, M. Ikeda,K. Andersson, M. Zhou, and G. Van der Auwera, “Hevc deblockingfilter,”
IEEE Transactions on Circuits and Systems for Video Technology ,vol. 22, no. 12, pp. 1746–1754, 2012.[10] C.-M. Fu, E. Alshina, A. Alshin, Y.-W. Huang, C.-Y. Chen, C.-Y. Tsai,C.-W. Hsu, S.-M. Lei, J.-H. Park, and W.-J. Han, “Sample adaptive offsetin the hevc standard,”
IEEE Transactions on Circuits and Systems forVideo technology , vol. 22, no. 12, pp. 1755–1764, 2012. [11] C.-Y. Tsai, C.-Y. Chen, T. Yamakage, I. S. Chong, Y.-W. Huang, C.-M.Fu, T. Itoh, T. Watanabe, T. Chujoh, M. Karczewicz et al. , “Adaptiveloop filtering for video coding,” IEEE Journal of Selected Topics inSignal Processing , vol. 7, no. 6, pp. 934–945, 2013.[12] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet:Keypoint triplets for object detection,” in
Proceedings of the IEEEInternational Conference on Computer Vision , 2019, pp. 6569–6578.[13] Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Object detection with deeplearning: A review,”
IEEE transactions on neural networks and learningsystems , 2019.[14] X. Liu, Z. Deng, and Y. Yang, “Recent progress in semantic imagesegmentation,”
Artificial Intelligence Review , vol. 52, no. 2, pp. 1089–1106, 2019.[15] C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, andL. Fei-Fei, “Auto-deeplab: Hierarchical neural architecture search forsemantic image segmentation,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2019, pp. 82–92.[16] X. Hu, H. Mu, X. Zhang, Z. Wang, T. Tan, and J. Sun, “Meta-sr:A magnification-arbitrary network for super-resolution,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2019, pp. 1575–1584.[17] J. W. Soh, G. Y. Park, J. Jo, and N. I. Cho, “Natural and realistic singleimage super-resolution with explicit natural manifold discrimination,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2019, pp. 8122–8131.[18] J. Li, B. Li, J. Xu, R. Xiong, and W. Gao, “Fully connected network-based intra prediction for image coding,”
IEEE Transactions on ImageProcessing , vol. 27, no. 7, pp. 3236–3247, 2018.[19] Y. Hu, W. Yang, M. Li, and J. Liu, “Progressive spatial recurrent neuralnetwork for intra prediction,”
IEEE Transactions on Multimedia , vol. 21,no. 12, pp. 3024–3037, 2019.[20] H. Sun, Z. Cheng, M. Takeuchi, and J. Katto, “Enhanced intra predictionfor video coding by using multiple neural networks,”
IEEE Transactionson Multimedia , 2020.[21] J. Liu, S. Xia, W. Yang, M. Li, and D. Liu, “One-for-all: Groupedvariation network-based fractional interpolation in video coding,”
IEEETransactions on Image Processing , vol. 28, no. 5, pp. 2140–2151, 2018.[22] L. Zhao, S. Wang, X. Zhang, S. Wang, S. Ma, and W. Gao, “Enhancedmotion-compensated video coding with deep virtual reference framegeneration,”
IEEE Transactions on Image Processing , 2019.[23] R. Song, D. Liu, H. Li, and F. Wu, “Neural network-based arithmeticcoding of intra prediction modes in hevc,” in . IEEE, 2017, pp. 1–4.[24] C. Ma, D. Liu, X. Peng, Z.-J. Zha, and F. Wu, “Neural network-basedarithmetic coding for inter prediction information in hevc,” in . IEEE, 2019,pp. 1–5.[25] W. Park and M. Kim, “Cnn-based in-loop filtering for coding efficiencyimprovement,” in , July 2016, pp. 1–5.[26] Y. Dai, D. Liu, and F. Wu, “A convolutional neural network approachfor post-processing in hevc intra coding,” in
International Conferenceon Multimedia Modeling . Springer, 2017, pp. 28–39.[27] Y. Dai, D. Liu, Z.-J. Zha, and F. Wu, “A cnn-based in-loop filter withcu classification for hevc,” in . IEEE, 2018, pp. 1–4.[28] C. Liu, H. Sun, J. Chen, Z. Cheng, M. Takeuchi, J. Katto, X. Zeng, andY. Fan, “Dual learning-based video coding with inception dense blocks,”in , Nov 2019, pp. 1–5.[29] J. Kang, S. Kim, and K. M. Lee, “Multi-modal/multi-scale convolutionalneural network based in-loop filter design for next generation videocodec,” in , Sep. 2017, pp. 26–30.[30] Y. Zhang, T. Shen, X. Ji, Y. Zhang, R. Xiong, and Q. Dai, “Residualhighway convolutional neural networks for in-loop filtering in hevc,”
IEEE Transactions on Image Processing , vol. 27, no. 8, pp. 3827–3841,Aug 2018.[31] X. He, Q. Hu, X. Zhang, C. Zhang, W. Lin, and X. Han, “Enhancinghevc compressed videos with a partition-masked convolutional neuralnetwork,” in , Oct 2018, pp. 216–220.[32] R. Yang, M. Xu, and Z. Wang, “Decoder-side hevc quality enhancementwith scalable convolutional neural network,” in . IEEE, 2017, pp. 817–822.[33] C. Jia, S. Wang, X. Zhang, S. Wang, J. Liu, S. Pu, and S. Ma,“Content-aware convolutional neural network for in-loop filtering in high efficiency video coding,”
IEEE Transactions on Image Processing ,vol. 28, no. 7, pp. 3343–3356, 2019.[34] G. Bjontegarrd, “Calculation of average psnr differences between rd-curves,”
VCEG-M33 , 2001.[35] L. Sifre and S. Mallat, “Rigid-motion scattering for image classification,2014,” Ph.D. dissertation, Ph. D. thesis, 2014.[36] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo-lutional neural networks for mobile vision applications,” arXiv preprintarXiv:1704.04861 , 2017.[37] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neuralnetwork,”
Computer Science , vol. 14, no. 7, pp. 38–39, 2015.[38] S. Zagoruyko and N. Komodakis, “Paying more attention to attention:Improving the performance of convolutional neural networks via atten-tion transfer,” arXiv preprint arXiv:1612.03928 , 2016.[39] Z. Huang and N. Wang, “Like what you like: Knowledge distill vianeuron selectivity transfer,” arXiv preprint arXiv:1707.01219 , 2017.[40] J. Yang, X. Shen, J. Xing, X. Tian, H. Li, B. Deng, J. Huang, and X.-s.Hua, “Quantization networks,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2019, pp. 7308–7316.[41] M. Nagel, M. van Baalen, T. Blankevoort, and M. Welling, “Data-freequantization through weight equalization and bias correction,” arXivpreprint arXiv:1906.04721 , 2019.[42] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz, “Importanceestimation for neural network pruning,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2019, pp.11 264–11 272.[43] C. Zhao, B. Ni, J. Zhang, Q. Zhao, W. Zhang, and Q. Tian, “Variationalconvolutional neural network pruning,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2019, pp.2780–2789.[44] S. Zhang, Z. Fan, N. Ling, and M. Jiang, “Recursive residual convo-lutional neural network- based in-loop filtering for intra frames,”
IEEETransactions on Circuits and Systems for Video Technology , vol. 30,no. 7, pp. 1888–1900, 2020.[45] C. Jia, S. Wang, X. Zhang, S. Wang, and S. Ma, “Spatial-temporalresidue network based in-loop filter for video coding,” in , Dec 2017, pp.1–4.[46] J. Yao, X. Song, S. Fang, and L. Wang, “Ahg9: Convolutional neuralnetwork filter for inter frame,” Apr 2018, jVET-J0043.[47] D. Ding, L. Kong, G. Chen, Z. Liu, and Y. Fang, “A switchabledeep learning approach for in-loop filtering in video coding,”
IEEETransactions on Circuits and Systems for Video Technology , vol. 30,no. 7, pp. 1871–1887, 2020.[48] H. Sun, C. Liu, J. Katto, and Y. Fan, “An image compression frameworkwith learning-based filter,” in
Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition Workshops , 2020, pp. 152–153.[49] C. Liu, H. Sun, J. Katto, X. Zeng, and Y. Fan, “A learning-based lowcomplexity in-loop filter for video coding,” in . IEEE, 2020,pp. 1–6.[50] G. Cote, B. Erol, M. Gallant, and F. Kossentini, “H. 263+: Video codingat low bit rates,”
IEEE Transactions on circuits and systems for videotechnology , vol. 8, no. 7, pp. 849–866, 1998.[51] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[52] Y. Wang, H. Zhu, Y. Li, Z. Chen, and S. Liu, “Dense residual convolu-tional neural network based in-loop filter for hevc,” in . IEEE, 2018, pp. 1–4.[53] J. W. Soh, J. Park, Y. Kim, B. Ahn, H.-S. Lee, Y.-S. Moon, and N. I.Cho, “Reduction of video compression artifacts based on deep temporalnetworks,”
IEEE Access , vol. 6, pp. 63 094–63 106, 2018.[54] C. Li, L. Song, R. Xie, and W. Zhang, “Cnn based post-processingto improve hevc,” in . IEEE, 2017, pp. 4577–4580.[55] F. Zhang, C. Feng, and D. R. Bull, “Enhancing vvc through cnn-basedpost-processing,” in . IEEE, 2020, pp. 1–6.[56] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola,“A kernel two-sample test,”
Journal of Machine Learning Research ,vol. 13, no. Mar, pp. 723–773, 2012.[57] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz-mann machines,” in
Proceedings of the 27th international conference onmachine learning (ICML-10) , 2010, pp. 807–814. arXiv preprint arXiv:1412.6980 , 2014.[61] F. Chollet et al. , “Keras,” https://github.com/fchollet/keras, 2015.[62] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray,C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals,P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,“TensorFlow: Large-scale machine learning on heterogeneous systems,”2015, software available from tensorflow.org. [Online]. Available:http://tensorflow.org/[63] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single imagesuper-resolution: Dataset and study,” in The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) Workshops , July 2017.[64] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on imagenet classification,” in