[PDF] A DenseNet Based Approach for Multi-Frame In-Loop Filter in HEVC

Abstract

High efficiency video coding (HEVC) has brought outperforming efficiency for video compression. To reduce the compression artifacts of HEVC, we propose a DenseNet based approach as the in-loop filter of HEVC, which leverages multiple adjacent frames to enhance the quality of each encoded frame. Specifically, the higher-quality frames are found by a reference frame selector (RFS). Then, a deep neural network for multi-frame in-loop filter (named MIF-Net) is developed to enhance the quality of each encoded frame by utilizing the spatial information of this frame and the temporal information of its neighboring higher-quality frames. The MIF-Net is built on the recently developed DenseNet, benefiting from the improved generalization capacity and computational efficiency. Finally, experimental results verify the effectiveness of our multi-frame in-loop filter, outperforming the HM baseline and other state-of-the-art approaches.

Full PDF

aa r X i v : . [ c s . C V ] M a r A DenseNet Based Approach for Multi-Frame In-Loop Filterin HEVC

Tianyi Li ∗ , Mai Xu ∗ , Ren Yang ∗ and Xiaoming Tao †∗ School of Electronic and Information Engineering,Beihang University, Beijing, 100191, China † Department of Electronic Engineering,Tsinghua University, Beijing, 100084, China [email protected] (Corresponding author: Mai Xu)

Abstract

High eﬃciency video coding (HEVC) has brought outperforming eﬃciency for video com-pression. To reduce the compression artifacts of HEVC, we propose a DenseNet basedapproach as the in-loop ﬁlter of HEVC, which leverages multiple adjacent frames to en-hance the quality of each encoded frame. Speciﬁcally, the higher-quality frames are foundby a reference frame selector (RFS). Then, a deep neural network for multi-frame in-loopﬁlter (named MIF-Net) is developed to enhance the quality of each encoded frame by uti-lizing the spatial information of this frame and the temporal information of its neighboringhigher-quality frames. The MIF-Net is built on the recently developed DenseNet, ben-eﬁting from the improved generalization capacity and computational eﬃciency. Finally,experimental results verify the eﬀectiveness of our multi-frame in-loop ﬁlter, outperformingthe HM baseline and other state-of-the-art approaches.

The high eﬃciency video coding (HEVC) standard [1] developed by the Joint Col-laborate Team on Video Coding (JCT-VC) has brought outperforming eﬃciency forvideo compression. However, various artifacts (e.g., blocking, blurring and ringingartifacts) also exist in compressed videos, mainly resulting from the block-wise pre-diction and quantization with limited precision. To alleviate such artifacts, in-loopﬁlters were adopted for enhancing the quality of each encoded frame and providinghigher-quality reference for its successive frames. Consequently, the coding eﬃciencycan be further improved by adopting the in-loop ﬁlters.In total, three built-in in-loop ﬁlters were proposed for standard HEVC, includingdeblocking ﬁlter (DBF) [2], sample adaptive oﬀset (SAO) ﬁlter [3] and adaptive loopﬁlter (ALF) [4]. Speciﬁcally, DBF is ﬁrstly used to remove the blocking artifacts.Then, the SAO ﬁlter reduces distortion by adding an adaptive oﬀset to each sample.Afterwards, ALF minimizes the distortion based on Wiener ﬁlter. However, ALFintroduces heavy bit-rate overhead and it has not been adopted in the ﬁnal version ofHEVC. Besides the built-in in-loop ﬁlters for HEVC, various heuristic and learning-based methods have also been proposed. In heuristic methods, some prior knowledgeof video coding is utilized to build a statistical model of compression artifacts, andthen each frame is enhanced based on the model. For example, Matsumura et al . [5]utilized the weighted mean of non-local similar frame patches for artifact reduction.hang et al . [6] attached a low-rank constraint on each matrix formed by a patchgroup, and then established an adaptive soft-thresholding model to achieve sparserepresentation. More recently, deep learning has been successfully employed in manyareas about data compression, such as video coding [7] and quality enhancement[8]. Also, learning-based methods have further improved the performance of in-loopﬁltering. Among them, Meng et al . [9] developed a multi-channel long-short-termdependency residual network (MLSDRN) for mapping a distorted frame to the rawframe, inserted between DBF and SAO. Zhang et al . [10] proposed a residual highwayCNN (RHCNN) based on the ResNet [11], implemented after the standard SAO.However, none of the above learning-based methods has employed multiple framesfor in-loop ﬁltering in HEVC. Typically, the high ﬂuctuation of visual quality existsacross the encoded frames, and thus a low-quality frame can be enhanced by referringto its adjacent higher-quality frames.Based on deep learning, this paper develops a multi-frame in-loop ﬁlter (MIF) forHEVC, replacing the original DBF and SAO. Speciﬁcally, we ﬁrst exploit the qualityﬂuctuation of encoded frames via designing a reference frame selector (RFS) to ﬁndreference frames for an unﬁltered reconstructed frame (URF), based on frame qualityand content similarity. If RFS provides suﬃcient reference frames, the URF ﬂowsthrough a deep neural network for MIF (named MIF-Net) to utilize both spatial in-formation within one frame and temporal information across multiples frames. In thecase that no suﬃcient reference frames are selected by RFS, a simpler deep neuralnetwork for in-loop ﬁlter (named IF-Net) is used to enhance the URF instead. Con-sidering the blocking artifacts inﬂuenced by the coding tree unit (CTU) partition,the proposed networks are also adaptive to the partition structure, via varying con-volutional kernels at diﬀerent locations of the coding unit (CU) and transform unit(TU) maps. Finally, the experimental results show that our approach outperformsother state-of-the-art approaches, with 5 .

33% and 2 .

40% saving of the Bjøntegaarddelta bit-rate (BD-BR) over the non-local adaptive loop ﬁlter [6] and the RHCNN[10], respectively.

The framework of our MIF approach is illustrated in Figure 1. In the standard HEVC,each raw frame is encoded through intra/inter-mode prediction, discrete transformand quantization. Then, the predicted frame and the residual frame form a URF.Subsequently, the URF is ﬁltered with DBF and SAO for quality enhancement. Dif-ferent from the standard HEVC, we propose a deep-learning-based in-loop ﬁlter toenhance the URF, leveraging information from its neighboring frames. First, RFSselects high quality and high correlated frames as reference, to be introduced in Sec-tion 2.2. Next, one of the two possible ﬁltering modes is adopted to the URF, asdescribed below. • Mode 1: MIF-Net.

Assume that M reference frames are needed in MIF-Net. If RFS selects at least M frames, the URF is processed by MIF-Net togenerate an enhanced frame. In MIF-Net, each reference frame is ﬁrst alignedwith the URF in terms of content, with a motion compensation network. Then, esidual Frame (Estimated) TransformInverse Transform

Reference Frames + + Quality EnhancementDBF IF-Net

Reference Frames?

NoYes

HEVC Encoding

Multi-Frame In-Loop Filter

Unfiltered Reconstructed Frame

MotionCompensationRFS QuantizationSAO

Selected Reference Frames

Mode SelectionPrediction

MIF-Net

Enhanced Frame

Residual Frame (Lossless) + -

Raw Frame

Figure 1: Framework of the proposed MIF. both aligned reference frames and the URF are fed into a quality enhancementnetwork to output the reconstructed frame. • Mode 2: IF-Net.

If no enough reference frames are found for the URF, IF-Net is adopted instead for quality enhancement. In contrast to MIF-Net, IF-Nettakes only the URF as input without any consideration of multiple frames.More details about Modes 1 and 2 are presented in Section 2.3. If MIF-Net or IF-Netfails to improve frame quality, the standard DBF and SAO can also be used as asupplementary mode. Finally, the best mode among the three possible choices (i.e.,MIF-Net, IF-Net and the standard in-loop ﬁlters) is selected as the actual choice,ensuring the overall performance of our approach.

In our approach, RFS selects reference frames for each URF. For the n -th URF(denoted as F U n ) in a sequence, RFS examines its previous N encoded frames as thereference frame pool, each of which is denoted by F P i ( n − N ≤ i ≤ n − • ∆PSNR Y i,n , ∆PSNR U i,n and ∆PSNR V i,n : PSNR increment of F P i over F U n , for theY, U and V channels, respectively. • CC Y i,n , CC U i,n and CC V i,n : the correlation coeﬃcient (CC) values of frame contentbetween F P i and F U n for the Y, U and V channels, respectively.Based on the above metrics, the reference frame pool is ﬁrst divided into valid andinvalid reference frames, and then all valid frames are fed into RFS-Net to select M frames in ﬁnal. Speciﬁcally, a binary value V i,n represents whether a reference framefrom the pool is valid. For at least one channel of F P i , if the PSNR increment ispositive and the CC value is above a threshold τ , i.e., V i,n = 1 in (1), F P i is seen as avalid reference frame. V i,n =  , if _ c ∈{ Y , U , V } (∆PSNR ci,n > ∧ CC ci,n > τ )0 , otherwise . (1)If there exist at least M valid reference frames, the six metrics for each validreference frame form a 6-dimensional vector, and then they are input to a two-layer (cid:37) Dense unit

Dense unit

Block-adap. convolution

Motion compensation

Referenceframe M ( ) Referenceframe M ( ) R,M n F Referenceframe M ( ) R,M n F Compensatedframe M ( ) Compensatedframe M ( ) C,M n F Compensatedframe M ( ) C,M n F M (cid:37) Dense unit

Dense unit

Block-adap. convolution

Motion compensation

Referenceframe M ( ) R,M n F Compensatedframe M ( ) C,M n F (cid:37) Dense unit Dense unitBlock-adap. convolutionBlock-adap. convolutionMotion compensationReferenceframe 1 ( )

R1,n F Referenceframe 1 ( )

R1,n F Compensatedframe 1 ( )

C1,n F Compensatedframe 1 ( )

C1,n F (cid:37) Dense unit Dense unitBlock-adap. convolutionMotion compensationReferenceframe 1 ( )

R1,n F Compensatedframe 1 ( )

C1,n F Concat. Dense unitConcat. Dense unitDifference frame ( ) n (cid:39) F Difference frame ( ) n (cid:39) F Input Output

Concat.Concat.

Original scale×2 Down-scale ×4 Down-scale

Intermediatecoarse MV maps ×2 Intermediatefine MV maps

Concat.Concat.Concat.Concat.

MV maps ( (cid:712) ) MV maps ( (cid:712) ) X,mn M Y,mn M STMC,mn F ×4 Down-scaling path ×2 Down-scaling path Full-scale path Modified STMC

Conv. layer (PReLU activated)

Up-scaling layer

Input

Output

CU and TU partition (in white and blue lines)

Concat.

Intermediate feature

Maps, = 6 O P Intermediate feature

Maps, = 6 O P Guidance maps, = 2 G P Guidance maps, = 2 G P Element-wise additionElement-wise multiplication

Inner product n6-k3-s1-PReLU n6-k3-s1-PReLU

Concat.Concat.Concat.

Output featuremaps ( =6) O P Output featuremaps ( =6) O P Compensated frame M ( ) Compensated frame M ( ) Compensated frame M ( ) C,m n F Compensated frame M ( ) C,m n F URF ( ) Un F URF ( ) Un F - Un F C,m n F - Un F C,m n F CU and TU partition (shown in white and blue lines)

Concat.

OriginalweightsModifiedweights

Input feature maps ( =3) I P Input feature maps ( =3) I P Intermediate featuremaps ( =6) O P Intermediate featuremaps ( =6) O P Guidance maps ( =2) G P Guidance maps ( =2) G P n6-k3-s1-PReLUn6-k3-s1-PReLU Original weights

1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)

Original weights

1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)

Modified weights

G 1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)

Modified weights

G 1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)

Input feature

Maps, = 3 I P Input feature

Maps, = 3 I P C U C U, , ( , , ) mn n mn n (cid:16)

F F F F

Input feature

Maps, = 3 I P C U C U, , ( , , ) mn n mn n (cid:16)

F F F F

Output feature

Maps, = 6

Output feature

Maps, = 6 O P Output feature

Maps, = 6 O P Conv. layer (tanh activated) -2 1 2 0 -1 1 2 1 -10.5 1 0.51 1.5 1

Input feature map

Guidance mapOriginalweightsModified weights

Output feature map

Input feature map

Guidance mapOriginalweightsModified weights

Output feature map

Input feature map

Intermediate feature map Originalweights

Modified weightsOutput feature map

Input Output

Enhanced frame ( ) En F Enhanced frame ( ) En F CU and TU maps ( , )

CU and TU maps ( , ) n C n T CU and TU maps ( , ) n C n T URF ( ) Un F URF ( ) Un F MIF-Net onlyIF-Net onlyMIF-Net and IF-NetMIF-Net onlyIF-Net onlyMIF-Net and IF-Net

Illustration for the principle of the deconvolutional layer. Assume (cid:87)(cid:75)(cid:68)(cid:87)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:76)(cid:81)(cid:83)(cid:88)(cid:87)(cid:3)(cid:75)(cid:68)(cid:86)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:86)(cid:76)(cid:93)(cid:72)(cid:3)(cid:82)(cid:73)(cid:3)(cid:21)(cid:3)(cid:104)(cid:3)(cid:21)(cid:15)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3)(cid:87)(cid:75)(cid:68)(cid:87)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:71)(cid:72)(cid:70)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:68)(cid:79)(cid:3)(cid:79)(cid:68)(cid:92)(cid:72)(cid:85)(cid:3)(cid:75)(cid:68)(cid:86)(cid:82)(cid:81)(cid:72)(cid:3)(cid:73)(cid:76)(cid:79)(cid:87)(cid:72)(cid:85)(cid:3)(cid:90)(cid:76)(cid:87)(cid:75)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:86)(cid:76)(cid:93)(cid:72)(cid:3)(cid:82)(cid:73)(cid:3)(cid:23)(cid:3)(cid:104)(cid:3)(cid:23)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:86)(cid:87)(cid:85)(cid:76)(cid:71)(cid:72)(cid:3)(cid:82)(cid:73)(cid:3)(cid:22)(cid:3)(cid:11)(cid:76)(cid:17)(cid:72)(cid:17)(cid:15)(cid:3) k = 1, s = 4,and c = 3). Each value of 4 pixels in the input patch is multiplied by the (cid:23)(cid:3)(cid:104)(cid:3)(cid:23)(cid:3)(cid:80)(cid:68)(cid:87)(cid:85)(cid:76)(cid:91)(cid:3)(cid:82)(cid:73)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:79)(cid:72)(cid:68)(cid:85)(cid:81)(cid:72)(cid:71)(cid:3)(cid:71)(cid:72)(cid:70)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:68)(cid:79)(cid:3)(cid:73)(cid:76)(cid:79)(cid:87)(cid:72)(cid:85)(cid:17)(cid:3)(cid:55)(cid:75)(cid:72)(cid:81)(cid:15)(cid:3)(cid:73)(cid:82)(cid:88)(cid:85)(cid:3)(cid:23)(cid:3)(cid:104)(cid:3)(cid:23)(cid:3)(cid:80)(cid:68)(cid:87)(cid:85)(cid:76)(cid:70)(cid:72)(cid:86) are obtained, corresponding to each pixel of the input patch. Finally, thefour obtained matrices are combined with stride being 3 to form the finaloutput of the deconvolutional layer, according to the arrangement of thecorresponding pixels in the input patch. Note that in the combination, thevalues are summarized for the overlapping region.

An example of convolution with intermediation. Assume that one input and one intermediate feature maps are used in the convolution, to output one feature map (i.e., ). All feature maps are with the size of 3×2, and the convolutional kernels are 3×3 with the stride of 1. The zero-padding is adopted in the receptive fields of convolution, shown by transparent pixels with dashed boundary. In the intermediate feature map and convolutional weights, the luminance of pixels corresponds to their relative values. Finally, the color of each pixel in output feature map reveals the main source of information from the input. An example of convolution with intermediation. Assume that one input and one intermediate feature maps are used in the convolution, to output one feature map (i.e., ). All feature maps are with the size of 3×2, and the convolutional kernels are 3×3 with the stride of 1. The zero-padding is adopted in the receptive fields of convolution, shown by transparent pixels with dashed boundary. In the intermediate feature map and convolutional weights, the luminance of pixels corresponds to their relative values. Finally, the color of each pixel in output feature map reveals the main source of information from the input.

Warp WarpWarp

Compensatedframe M ( ) C,m n F Compensatedframe M ( ) C,m n F URF ( ) Un F URF ( ) Un F Referenceframe m ( ) R,m n F Referenceframe m ( ) R,m n F Figure 2: Architecture of MIF-Net or IF-Net. fully connected network (named RFS-Net ) to output a scalar ˆ R i,n . The output ˆ R i,n is a continuous variable representing the potential of F P i being the reference for F U n .A larger ˆ R i,n indicates that F P i has more potential than other reference frames forenhancing F U n . Here, ˆ R i,n is the predicted value by RFS-Net, with the correspondingground-truth denoted by R i,n . In RFS-Net, the ground-truth R i,n should reﬂect thequality of a valid reference frame after it is aligned with F U n via motion compensation.To this end, we assign R i,n as the PSNR between the compensated valid referenceframe and the n -th raw frame (denoted as F n ). In accord with ˆ R i,n , the R i,n is alsoZ-scored normalized within one training batch. After normalization, the ℓ -loss onthe whole training batch can be used to measure the diﬀerence between R i,n and ˆ R i,n ,formulated as L RFS = X n − N ≤ i ≤ n − , V i,n =1 ( R i,n − ˆ R i,n ) , (2)which is optimized by the Adam algorithm [13]. Using the trained RFS-Net model,the reference potential for all the valid frames can be obtained. Then, RFS selects M frames denoted by { F R m,n } Mm =1 , where the index m indicates that F R m,n is the framewith the m -th highest ˆ R i,n among all valid reference frames. In the exceptional casethat the number of valid reference frames is less than M , RFS does not work andIF-Net is used to enhance F U n instead. This section mainly focuses on the architecture of MIF-Net and its training strategy,and then speciﬁes the diﬀerence between IF-Net and MIF-Net. Figure 2 illustrates theoverall architecture of MIF-Net or IF-Net. As shown in this ﬁgure, MIF-Net takes aURF F U n and its M reference frames { F R m,n } Mm =1 as the input, to generate the enhancedframe F E n as the output. The information from M parallel branches { B m } Mm =1 issynthesized, with each branch B m dealing with the corresponding reference frame F R m,n . In branch B m , F R m,n is ﬁrst aligned with F U n to produce a motion-compensatedframe, denoted as F C m,n . Next, F U n with F C m,n ﬂows through a novel convolutional layerguided by the CTU partition structure of F U n (named block-adaptive convolutional The 6-dimensional vector ﬂows through two layers, with 12 hidden nodes and 1 output node,respectively. Both layers are activated with parametric rectiﬁed linear units (PReLU) [12]. Notethat the samples in one training batch are extracted from the valid reference frames for only oneURF, and the output of samples in the same batch are Z-score normalized. (cid:37)

Dense unit

Block-adap. convolution

Motion compensation

Referenceframe M ( ) Referenceframe M ( ) R,M n F Referenceframe M ( ) R,M n F Compensatedframe M ( ) Compensatedframe M ( ) C,M n F Compensatedframe M ( ) C,M n F M (cid:37) Dense unit

Dense unit

Block-adap. convolution

Motion compensation

Referenceframe M ( ) R,M n F Compensatedframe M ( ) C,M n F (cid:37) Dense unit Dense unitBlock-adap. convolutionBlock-adap. convolutionMotion compensationReferenceframe 1 ( )

R1,n F Referenceframe 1 ( )

R1,n F Compensatedframe 1 ( )

C1,n F Compensatedframe 1 ( )

C1,n F (cid:37) Dense unit Dense unitBlock-adap. convolutionMotion compensationReferenceframe 1 ( )

R1,n F Compensatedframe 1 ( )

C1,n F Concat. Dense unitConcat. Dense unitDifference frame ( ) n (cid:39) F Difference frame ( ) n (cid:39) F Input Output

Concat.Concat.

Original scale×2 Down-scale ×4 Down-scale

Intermediatecoarse MV maps ×2 Intermediatefine MV maps

Concat.Concat.Concat.Concat.

MV maps ( (cid:712) ) MV maps ( (cid:712) ) X,mn M Y,mn M STMC,mn F ×4 Down-scaling path ×2 Down-scaling path Full-scale path Modified STMC

Conv. layer (PReLU activated)

Up-scaling layer

Input

Output

CU and TU partition (in white and blue lines)

Concat.

Intermediate feature

Maps, = 6 O P Intermediate feature

Maps, = 6 O P Guidance maps, = 2 G P Guidance maps, = 2 G P Element-wise additionElement-wise multiplication

Inner product n6-k3-s1-PReLU n6-k3-s1-PReLU

Concat.Concat.Concat.

Concat.

OriginalweightsModifiedweights

1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)

Original weights

1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)

Modified weights

G 1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)

Modified weights

G 1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)

Input feature

Maps, = 3 I P Input feature

Maps, = 3 I P C U C U, , ( , , ) mn n mn n (cid:16)

F F F F

Input feature

Maps, = 3 I P C U C U, , ( , , ) mn n mn n (cid:16)

F F F F

Output feature

Maps, = 6

Output feature

Maps, = 6 O P Output feature

Maps, = 6 O P Conv. layer (tanh activated) -2 1 2 0 -1 1 2 1 -10.5 1 0.51 1.5 1

Input feature map

Guidance mapOriginalweightsModified weights

Output feature map

Input feature map

Guidance mapOriginalweightsModified weights

Output feature map

Input feature map

Intermediate feature map Originalweights

Modified weightsOutput feature map

Input Output

Enhanced frame ( ) En F Enhanced frame ( ) En F CU and TU maps ( , )

CU and TU maps ( , ) n C n T CU and TU maps ( , ) n C n T URF ( ) Un F URF ( ) Un F MIF-Net onlyIF-Net onlyMIF-Net and IF-NetMIF-Net onlyIF-Net onlyMIF-Net and IF-Net

Warp WarpWarp

Compensatedframe M ( ) C,mn F Compensatedframe M ( ) C,mn F URF ( ) Un F URF ( ) Un F Referenceframe m ( ) R,m n F Referenceframe m ( ) R,m n F (a) M (cid:37) Dense unit Dense unitBlock-adap. convolutionBlock-adap. convolutionMotion compensation

Reference frame M ( ) Reference frame M ( ) R,M n F Reference frame M ( ) R,M n F Compensated frame M ( ) Compensated frame M ( ) C,M n F Compensated frame M ( ) C,M n F M (cid:37) Dense unit Dense unitBlock-adap. convolutionMotion compensation

Reference frame M ( ) R,M n F Compensated frame M ( ) C,M n F (cid:37) Dense unit Dense unitBlock-adap. convolutionBlock-adap. convolutionMotion compensationReferenceframe 1 ( )

R1,n F Referenceframe 1 ( )

R1,n F Compensatedframe 1 ( )

C1,n F Compensatedframe 1 ( )

C1,n F (cid:37) Dense unit Dense unitBlock-adap. convolutionMotion compensationReferenceframe 1 ( )

R1,n F Compensatedframe 1 ( )

C1,n F Concat. Dense unitConcat.

Dense unit

Difference frame ( ) n (cid:39) F Difference frame ( ) n (cid:39) F Input Output

Concat.Concat.

Original scale×2 Down-scale×4 Down-scale stride 2 3×3, 32×4

Intermediatecoarse MV maps ×2 Intermediatefine MV maps

Concat.Concat.Concat.Concat.

MV maps ( (cid:712) ) MV maps ( (cid:712) ) X,mn M Y,mn M STMC,mn F ×4 Down-scaling path ×2 Down-scaling path Full-scale path Modified STMC

Conv. layer (PReLU activated) Up-scaling layer

Input Output

Conv. layer (tanh activated)

Input Output

Enhanced frame ( ) En F Enhanced frame ( ) En F CU and TU maps ( , )

CU and TU maps ( , ) n C n T CU and TU maps ( , ) n C n T URF ( ) Un F URF ( ) Un F MIF-Net only

IF-Net only

MIF-Net and IF-NetMIF-Net only

IF-Net only

MIF-Net and IF-Net

Warp WarpWarp

Compensatedframe M ( ) C,m n F Compensatedframe M ( ) C,m n F URF ( ) Un F URF ( ) Un F Referenceframe m ( ) R,m n F Referenceframe m ( ) R,m n F Concat.3×3, 12

Input

Output

Concat.Concat.Concat.Concat.Concat.3×3, 12

Input

Output

Concat.Concat.Concat.

InputOutput

Concat.Concat.Concat.Concat. (b) M (cid:37) Dense unit

Dense unit

Block-adap. convolution

Motion compensation

Referenceframe M ( ) Referenceframe M ( ) R,M n F Referenceframe M ( ) R,M n F Compensatedframe M ( ) Compensatedframe M ( ) C,M n F Compensatedframe M ( ) C,M n F M (cid:37) Dense unit

Dense unit

Block-adap. convolution

Motion compensation

Referenceframe M ( ) R,M n F Compensatedframe M ( ) C,M n F (cid:37) Dense unit Dense unitBlock-adap. convolutionBlock-adap. convolutionMotion compensationReferenceframe 1 ( )

R1,n F Referenceframe 1 ( )

R1,n F Compensatedframe 1 ( )

C1,n F Compensatedframe 1 ( )

C1,n F (cid:37) Dense unit Dense unitBlock-adap. convolutionMotion compensationReferenceframe 1 ( )

R1,n F Compensatedframe 1 ( )

C1,n F Concat. Dense unitConcat. Dense unitDifference frame ( ) n (cid:39) F Difference frame ( ) n (cid:39) F Input Output

Concat.Concat.

Original scale×2 Down-scale ×4 Down-scale

Intermediatecoarse MV maps ×2 Intermediatefine MV maps

Concat.Concat.Concat.Concat.

MV maps ( (cid:712) ) MV maps ( (cid:712) ) X,mn M Y,mn M STMC,mn F ×4 Down-scaling path ×2 Down-scaling path Full-scale path Modified STMC

Conv. layer (PReLU activated)

Up-scaling layer

Input

Output

CU and TU partition (in white and blue lines)

Concat.

Intermediate feature

Maps, = 6 O P Intermediate feature

Maps, = 6 O P Guidance maps, = 2 G P Guidance maps, = 2 G P n6-k3-s1-PReLU n6-k3-s1-PReLU Concat.Concat.Concat.

Concat.

OriginalweightsModifiedweights

1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)

Original weights

1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)

Modified weights

G 1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)

Modified weights

G 1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)

Input feature

Maps, = 3 I P Input feature

Maps, = 3 I P C U C U, , ( , , ) mn n mn n (cid:16)

F F F F

Input feature

Maps, = 3 I P C U C U, , ( , , ) mn n mn n (cid:16)

F F F F

Output feature

Maps, = 6

Output feature

Maps, = 6 O P Output feature

Maps, = 6 O P Conv. layer (tanh activated) -2 1 2 0 -1 1 2 1 -10.5 1 0.51 1.5 1

Input feature map

Guidance mapOriginalweightsModified weights

Output feature map

Input feature map

Guidance mapOriginalweightsModified weights

Output feature map

Input feature map

Intermediate feature map Originalweights

Modified weightsOutput feature map

Input Output

Enhanced frame ( ) En F Enhanced frame ( ) En F CU and TU maps ( , )

CU and TU maps ( , ) n C n T CU and TU maps ( , ) n C n T URF ( ) Un F URF ( ) Un F MIF-Net onlyIF-Net onlyMIF-Net and IF-NetMIF-Net onlyIF-Net onlyMIF-Net and IF-Net

Warp WarpWarp

Compensatedframe M ( ) C,mn F Compensatedframe M ( ) C,mn F URF ( ) Un F URF ( ) Un F Referenceframe m ( ) R,m n F Referenceframe m ( ) R,m n F Figure 3: Network details. (a) Motion compensation network. (b) Dense unit. For convo-lutional layers, “ p × p , q ” represents q output channels with p × p kernels. Note that theconvolutional stride is set to 1 by default, except that explicitly mentioned in certain layers. layer), to explore low-level features from diﬀerent sources and merge the features withconsideration of the CU and TU partition. Then, the low-level features ﬂow throughtwo successive dense units [14] to extract more comprehensive features within B m .Finally, the extracted features from M branches are concatenated together and furtherprocessed with two dense units to extract high-level features. For ease of training,the output of the last dense unit (denoted as F ∆ n ) is regarded as a diﬀerence frame,and the enhanced frame F E n is the summation of F ∆ n and F U n . The details of MIF-Netcomponents are presented in the following. Motion compensation network.

We propose a motion compensation networkbased on the spatial transformer motion compensation (STMC) [15], for content align-ment between F R m,n and F U n , illustrated in Figure 3-(a). In [15], the STMC takes both F R m,n and F U n as the input, to output a compensated frame denoted as F STMC m,n . TheSTMC consists of two paths ( × × × F R m,n for outputting F STMC m,n . The two down-sampling paths in[15] are capable for estimating various scales of motion. However, the accuracy of theSTMC is limited due to down-sampling, and its architecture can also be improved.Therefore, we propose a motion compensation network with the following advance-ments. (1) Besides the × × M X m,n and M Y m,n , denoting the horizontal and vertical motion of all pixels from F R m,n to F U n .Finally, the compensated frame F C m,n is derived by F C m,n ( x, y ) = Bil { F R m,n ( x + M X m,n ( x, y ) , y + M Y m,n ( x, y )) } , (3)here x and y are coordinates of a pixel, and Bil {·} represents the bilinear interpo-lation considering that the motion may be of non-integer pixels. Block-adaptive convolutional layer.

The input to this layer is a concatenationof three feature maps, including a compensated frame F C m,n , a URF F U n and F C m,n − F U n .The CU and TU partition are represented by two feature maps, i.e., C n and T n , re-spectively. C n and T n each has the same size as F U n , and the values in the two mapsare assigned according to the partition structure. If pixel ( x, y ) is on the boundaryof a CU or TU, C n ( x, y ) or T n ( x, y ) is set to 1. Otherwise, the value is set to − P I , P G and P O feature maps are used as the input, guidance and output, respectively. The guidedconvolution consists of two main procedures, i.e., intermediate map generation andconvolution with intermediation. First, the P G guidance feature maps are processedwith two typical convolutional layers to generate P O intermediate feature maps, keep-ing the size of each feature map unchanged. Then, during the convolution, the P O output feature maps are generated based on these P O intermediate feature maps, cor-respondingly. Compared with typical convolution using space-irrelevant weights w j,l only, the guided convolution is conducted with space-relevant weights w G j,l generatedfrom the intermediation, as formulated below w G j,l (∆ x, ∆ y ) = w j,l (∆ x, ∆ y ) · F M l ( x + ∆ x, y + ∆ y ) , (4) F O l ( x, y ) = P I X j =1 1 X ∆ x = − X ∆ y = − w G j,l (∆ x, ∆ y ) · F I j ( x + ∆ x, y + ∆ y ) . (5)In (4) and (5), F I j , F M l and F O l represent the j -th input, the l -th intermediate andthe l -th output feature maps, respectively. ∆ x, ∆ y denote the relative coordinateswithin a 3 × P I = 3 and P G = 2, and we set the number of output maps to be P O = 16. Dense units for quality enhancement.

The DenseNet [14] introduces variouslength of inter-layer connections, with alleviation of vanishing gradients and encour-agement of feature reuse. Considering the advantages, (2 M + 2) dense units areadopted in MIF-Net, i.e., 2 dense units in each branch and 2 dense units at the end ofMIF-Net synthesizing features from M branches. Figure 3-(b) illustrates the struc-ture of each dense unit, and it can be observed that a dense unit with 4 convolutionallayers includes 10 inter-layer connections, much more than a 4-layer plain CNN withonly 4 inter-layer connections. Here, each layer outputs 12 channels, except the lastlayer in the ﬁnal dense unit outputting only 1 channel as the diﬀerence frame F ∆ n . MIF-Net Training.

With both motion compensation and quality enhancement,it may be diﬃcult to train the whole MIF-Net directly. Thus, we propose to train itwith intermediate supervision [17], introducing two loss functions at diﬀerent stages.First, the diﬀerence between F U n and each frame in { F C m,n } Mm =1 can measure the per-formance of motion compensation, and thus it is deﬁned as the intermediate loss L INT = 1 M M X m =1 k F C m,n − F U n k , (6)here k·k represents the ℓ -norm diﬀerence. Next, the diﬀerence between F E n and F n indicates the performance of the whole MIF-Net, and thus the global loss is L GLO = k F E n − F n k . (7)The loss for training MIF-Net is the weighted summation of them: L = α · L INT + β · L GLO , (8)where α and β are adjustable positive weights. On account that quality enhancementrelies on the well-trained motion compensation network, L INT should be emphaticallyoptimized with α ≫ β at early stage of training. After L INT converges, we set β ≫ α instead, to emphasize more on optimization of the global loss L GLO . Diﬀerence between IF-Net and MIF-Net.

The diﬀerence between two net-works lies in the absence of M reference frames in IF-Net. Therefore, only qualityenhancement without motion compensation is adopted in IF-Net, illustrated by redarrows in Figure 2. Compared with MIF-Net, only one branch without any compen-sated frame exists in IF-Net, and the concatenation synthesizing M branches is alsoomitted. Despite simpleness, a block-adaptive convolutional layer and four consecu-tive dense units still exist in IF-Net, ensuring suﬃcient network capacity. Consideringno motion compensation in IF-Net, the loss of IF-Net is the same as L GLO in MIF-Net.

Experimental conﬁgurations.

In the experiments, all approaches for in-loop ﬁl-tering were incorporated into the HEVC reference software HM 16.5. For our MIFapproach, we established a large-scale database for HEVC in-loop ﬁltering (namedHIF database) containing 111 raw video sequences, collected from the JCT-VC [18],Xiph.org [19] and the conversational video set [20]. Our HIF database was dividedinto non-overlapping sets of training (83 sequences), validation (10 sequences) andtest (18 sequences). The training set was used to train the networks, and the hyper-parameters in our approach were tunned on the validation set. The test set was usedfor performance evaluation, containing all 18 standard sequences from the JCT-VCset [18]. The RA conﬁguration was applied for both network training and perfor-mance evaluation at four QPs, { , , , } . During evaluation, the BD-BR andthe Bjøntegaard delta PSNR (BD-PSNR) were measured to assess the rate-distortion(RD) performance. Network settings.

For our approach, one MIF-Net model and one IF-Net modelwere trained for each evaluated QP, while all QPs shared the same trained RFS-Netmodel. The tuned hyper-parameters for these networks are listed in Table 1. Fortraining MIF-Net and IF-Net, all the frames were segmented into 64 ×

64 patches.Here, each training sample was composed of the co-located patches from a raw frame,a URF, a CU map, a TU map and M reference frames (if have). Considering theeﬃciency of training, the IF-Net or MIF-Net model at QP = 37 was trained fromscratch, while the models at QPs { , , } were ﬁne-tuned from the trained modelsat QPs { , , } , respectively. able 1: Hyper-parameters for networks Hyper-parameter RFS-Net MIF-Net or IF-NetSize of ref. frame pool: N

16 -Threshold for CC value: τ M ≤ ∗ − − Num. of iterations 10 (from scratch) or 2 × (ﬁne-tunning)Changeable weightsin MIF-Net: α and β - 0 .

99 & 0 .

01 (at beginning)0 .

01 & 0 .

99 (after L INT converged) ∗ The batch size equals to the number of valid reference frames for a URF.

Table 2:

RD performance of in-loop ﬁlters on the JCT-VC test set

Class Sequence StandardDBF and SAO Non-local adaptiveloop ﬁlter [6] RHCNN [10] Proposed MIFBD-BR BD-PSNR BD-BR BD-PSNR BD-BR BD-PSNR BD-BR BD-PSNR(%) (dB) (%) (dB) (%) (dB) (%) (dB)A

PeopleOnStreet -8.29 0.37 -12.03 0.54 -12.48 0.57 -16.82 0.78

Traﬃc -5.35 0.16 -6.17 0.19 -9.81 0.30 -12.15 0.38 B BasketballDrive -6.65 0.15 -8.84 0.20 -11.05 0.25 -14.87 0.35

BQTerrace -7.15 0.11 -11.40 0.17 -14.36 0.23 -17.13 0.27

Cactus -7.54 0.16 -8.90 0.19 -12.52 0.27 -15.83 0.35

Kimono -7.54 0.22 -9.26 0.27 -10.48 0.31 -12.24 0.37

ParkScene -3.68 0.11 -4.08 0.12 -5.94 0.18 -7.99 0.25 C BasketballDrill -5.02 0.21 -5.39 0.22 -7.81 0.33 -10.32 0.43

BQMall -3.93 0.15 -4.45 0.17 -7.65 0.30 -9.38 0.37

PartyScene -1.05 0.04 -1.22 0.05 -2.41 0.10 -4.16 0.17

RaceHorses -6.15 0.22 -7.08 0.26 -10.40 0.39 -12.74 0.48 D BasketballPass -3.85 0.18 -4.32 0.20 -7.68 0.37 -9.98 0.48

BlowingBubbles -0.83 0.03 -0.83 0.03 -3.05 0.12 -3.98 0.16

BQSquare -0.05 0.00 0.01 0.00 -3.24 0.12 -4.40 0.17

RaceHorses -4.44 0.20 -4.80 0.22 -8.84 0.41 -10.99 0.51 E FourPeople -7.02 0.26 -8.49 0.32 -13.92 0.54 -16.48 0.64

Johnny -5.60 0.14 -8.03 0.21 -11.62 0.30 -14.37 0.38

KristenAndSara -6.41 0.20 -8.01 0.25 -12.62 0.41 -15.34 0.50

Average -5.03 0.16 -6.29 0.20 -9.22 0.30 -11.62 0.39

Objective RD performance.

We analyze the objective performance of our MIFapproach in terms of the BD-BR and BD-PSNR, compared with the standard in-loopﬁlters (DBF and SAO), a model-based approach (the non-local adaptive loop ﬁlter[6]) and a deep-learning-based approach (the RHCNN [10]). For a fair comparison,the RHCNN models in [10] were re-trained on our HIF database. Table 2 tabulatesthe RD performance of all four approaches, and the original HM without in-loopﬁlters is used as anchor. As indicated in Table 2, the BD-BR of our MIF approachis − .

62% averaged over the 18 standard test sequences, outperforming − .

03% ofthe HM baseline, − .

29% of [6] and − .

22% of [10]. In terms of BD-PSNR, ourapproach achieves 0 . . . . tandard DBF and SAO

Non-local adaptive loop filter [6]

RHCNN [8]

Proposed MIF

Raw (ground truth)

Standard

DBF and SAO

Non-local adaptive loop filter [6] RHCNN [10]

Proposed MIF

Raw (ground truth)

Figure 4: Comparison of subjective visual quality on sequences

RaceHorses (Class C) and

PeopleOnStreet (Class A) at QP = 37.

Subjective visual quality.

Figure 4 illustrates the subjective visual qualityamong all four approaches. It can be observed that the frames enhanced by ourapproach remain less distortion than those by other approaches, e.g., the clearer edgeof the horse tail and the reduced blocking artifacts on the pedestrians. The highestvisual quality mainly beneﬁts from the utilization of multiple adjacent frames in theproposed MIF approach.

In this paper, we have proposed a DenseNet based in-loop ﬁlter for HEVC. Diﬀerentfrom existing in-loop ﬁlter approaches based on a single frame, our MIF approach en-hances the quality of each encoded frame leveraging multiple adjacent frames. To thisend, we ﬁrst propose an RFS to ﬁnd higher-quality frames. Then, we develop an MIF-Net model for multi-frame in-loop ﬁlter in HEVC, which is based on the DenseNetand beneﬁts from the improved generalization capacity and computational eﬃciency.Finally, experimental results demonstrate that our approach achieves − .

62% ofBD-BR saving and 0 . Acknowledgment

This work was supported by NSFC under Grants 61876013 and 61573037, and by theFok Ying Tung Education Foundation under Grant 151061.

References [1] G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand, “Overview of the high eﬃciencyvideo coding (HEVC) standard,”

IEEE Transactions on Circuits and Systems forVideo Technology , vol. 22, no. 12, pp. 1649–1668, Dec 2012.[2] A. Norkin, G. Bjontegaard, A. Fuldseth, M. Narroschke, M. Ikeda, K. Andersson,M. Zhou, and G. Van der Auwera, “HEVC deblocking ﬁlter,”

IEEE Transactions onCircuits and Systems for Video Technology , vol. 22, no. 12, pp. 1746–1754, Dec 2012.[3] C. Fu, E. Alshina, A. Alshin, Y. Huang, C. Chen, C. Tsai, C. Hsu, S. Lei, J. Park,and W. Han, “Sample adaptive oﬀset in the HEVC standard,”

IEEE Transactions onCircuits and Systems for Video Technology , vol. 22, no. 12, pp. 1755–1764, Dec 2012.[4] C. Tsai, C. Chen, T. Yamakage, I. S. Chong, Y. Huang, C. Fu, T. Itoh, T. Watanabe,T. Chujoh, M. Karczewicz, and S. Lei, “Adaptive loop ﬁltering for video coding,”

EEE Journal of Selected Topics in Signal Processing , vol. 7, no. 6, pp. 934–945, Dec2013.[5] M. Matsumura, Y. Bandoh, S Takamura, and H. Jozawa, “In-loop ﬁlter based onnon-local means ﬁlter,” JCTVC-E206, ITU-T SG16, Geneva, Swizerland, March 2011.[6] X. Zhang, R. Xiong, W. Lin, J. Zhang, S. Wang, S. Ma, and W. Gao, “Low-rank-basednonlocal adaptive loop ﬁlter for high-eﬃciency video compression,”

IEEE Transactionson Circuits and Systems for Video Technology , vol. 27, no. 10, pp. 2177–2188, Oct 2017.[7] M. Xu, T. Li, Z. Wang, X. Deng, R. Yang, and Z. Guan, “Reducing complexity ofHEVC: A deep learning approach,”

IEEE Transactions on Image Processing , vol. 27,no. 10, pp. 5044–5059, Oct 2018.[8] R. Yang, M. Xu, T. Liu, Z. Wang, and Z. Guan, “Enhancing quality for HEVCcompressed videos,”

IEEE Transactions on Circuits and Systems for Video Technology ,2018.[9] X. Meng, C. Chen, S. Zhu, and B. Zeng, “A new HEVC in-loop ﬁlter based on multi-channel long-short-term dependency residual networks,” in , March 2018, pp. 187–196.[10] Y. Zhang, T. Shen, X. Ji, Y. Zhang, R. Xiong, and Q. Dai, “Residual highway convo-lutional neural networks for in-loop ﬁltering in HEVC,”

IEEE Transactions on ImageProcessing , vol. 27, no. 8, pp. 3827–3841, Aug 2018.[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”in

IEEE Conference on Computer Vision and Pattern Recognition , June 2016, pp.770–778.[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Delving deep into rec-tiﬁers: Surpassing human-level performance on ImageNet classiﬁcation,” in

IEEEInternational Conference on Computer Vision , 2015, pp. 1026–1034.[13] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,”

Computer Science , 2014.[14] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger, “Denselyconnected convolutional networks,” in

IEEE Conference on Computer Vision andPattern Recognition , July 2017, pp. 4700–4708.[15] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi, “Real-timevideo super-resolution with spatio-temporal networks and motion compensation,” in

IEEE Conference on Computer Vision and Pattern Recognition , July 2017, pp. 2848–2857.[16] Xavier Glorot, Antoine Bordes, and Yoshua Bengio, “Deep sparse rectiﬁer neuralnetworks,” in

Proceedings of the Fourteenth International Conference on ArtiﬁcialIntelligence and Statistics , 2011, pp. 315–323.[17] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,”in

IEEE Conference on Computer Vision and Pattern Recognition , June 2016, pp.4724–4732.[18] J. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan, and T. Wiegand, “Comparison ofthe coding eﬃciency of video coding standards-including high eﬃciency video coding(HEVC),”

IEEE Transactions on Circuits and Systems for Video Technology , vol. 22,no. 12, pp. 1669–1684, Dec 2012.[19] Xiph.org, “Xiph.org video test media,” https://media.xiph.org/video/derf , 2017.[20] M. Xu, X. Deng, S. Li, and Z. Wang, “Region-of-interest based conversational HEVCcoding with hierarchical perception model of face,”