A DenseNet Based Approach for Multi-Frame In-Loop Filter in HEVC
aa r X i v : . [ c s . C V ] M a r A DenseNet Based Approach for Multi-Frame In-Loop Filterin HEVC
Tianyi Li ∗ , Mai Xu ∗ , Ren Yang ∗ and Xiaoming Tao †∗ School of Electronic and Information Engineering,Beihang University, Beijing, 100191, China † Department of Electronic Engineering,Tsinghua University, Beijing, 100084, China [email protected] (Corresponding author: Mai Xu)
Abstract
High efficiency video coding (HEVC) has brought outperforming efficiency for video com-pression. To reduce the compression artifacts of HEVC, we propose a DenseNet basedapproach as the in-loop filter of HEVC, which leverages multiple adjacent frames to en-hance the quality of each encoded frame. Specifically, the higher-quality frames are foundby a reference frame selector (RFS). Then, a deep neural network for multi-frame in-loopfilter (named MIF-Net) is developed to enhance the quality of each encoded frame by uti-lizing the spatial information of this frame and the temporal information of its neighboringhigher-quality frames. The MIF-Net is built on the recently developed DenseNet, ben-efiting from the improved generalization capacity and computational efficiency. Finally,experimental results verify the effectiveness of our multi-frame in-loop filter, outperformingthe HM baseline and other state-of-the-art approaches.
The high efficiency video coding (HEVC) standard [1] developed by the Joint Col-laborate Team on Video Coding (JCT-VC) has brought outperforming efficiency forvideo compression. However, various artifacts (e.g., blocking, blurring and ringingartifacts) also exist in compressed videos, mainly resulting from the block-wise pre-diction and quantization with limited precision. To alleviate such artifacts, in-loopfilters were adopted for enhancing the quality of each encoded frame and providinghigher-quality reference for its successive frames. Consequently, the coding efficiencycan be further improved by adopting the in-loop filters.In total, three built-in in-loop filters were proposed for standard HEVC, includingdeblocking filter (DBF) [2], sample adaptive offset (SAO) filter [3] and adaptive loopfilter (ALF) [4]. Specifically, DBF is firstly used to remove the blocking artifacts.Then, the SAO filter reduces distortion by adding an adaptive offset to each sample.Afterwards, ALF minimizes the distortion based on Wiener filter. However, ALFintroduces heavy bit-rate overhead and it has not been adopted in the final version ofHEVC. Besides the built-in in-loop filters for HEVC, various heuristic and learning-based methods have also been proposed. In heuristic methods, some prior knowledgeof video coding is utilized to build a statistical model of compression artifacts, andthen each frame is enhanced based on the model. For example, Matsumura et al . [5]utilized the weighted mean of non-local similar frame patches for artifact reduction.hang et al . [6] attached a low-rank constraint on each matrix formed by a patchgroup, and then established an adaptive soft-thresholding model to achieve sparserepresentation. More recently, deep learning has been successfully employed in manyareas about data compression, such as video coding [7] and quality enhancement[8]. Also, learning-based methods have further improved the performance of in-loopfiltering. Among them, Meng et al . [9] developed a multi-channel long-short-termdependency residual network (MLSDRN) for mapping a distorted frame to the rawframe, inserted between DBF and SAO. Zhang et al . [10] proposed a residual highwayCNN (RHCNN) based on the ResNet [11], implemented after the standard SAO.However, none of the above learning-based methods has employed multiple framesfor in-loop filtering in HEVC. Typically, the high fluctuation of visual quality existsacross the encoded frames, and thus a low-quality frame can be enhanced by referringto its adjacent higher-quality frames.Based on deep learning, this paper develops a multi-frame in-loop filter (MIF) forHEVC, replacing the original DBF and SAO. Specifically, we first exploit the qualityfluctuation of encoded frames via designing a reference frame selector (RFS) to findreference frames for an unfiltered reconstructed frame (URF), based on frame qualityand content similarity. If RFS provides sufficient reference frames, the URF flowsthrough a deep neural network for MIF (named MIF-Net) to utilize both spatial in-formation within one frame and temporal information across multiples frames. In thecase that no sufficient reference frames are selected by RFS, a simpler deep neuralnetwork for in-loop filter (named IF-Net) is used to enhance the URF instead. Con-sidering the blocking artifacts influenced by the coding tree unit (CTU) partition,the proposed networks are also adaptive to the partition structure, via varying con-volutional kernels at different locations of the coding unit (CU) and transform unit(TU) maps. Finally, the experimental results show that our approach outperformsother state-of-the-art approaches, with 5 .
33% and 2 .
40% saving of the Bjøntegaarddelta bit-rate (BD-BR) over the non-local adaptive loop filter [6] and the RHCNN[10], respectively.
The framework of our MIF approach is illustrated in Figure 1. In the standard HEVC,each raw frame is encoded through intra/inter-mode prediction, discrete transformand quantization. Then, the predicted frame and the residual frame form a URF.Subsequently, the URF is filtered with DBF and SAO for quality enhancement. Dif-ferent from the standard HEVC, we propose a deep-learning-based in-loop filter toenhance the URF, leveraging information from its neighboring frames. First, RFSselects high quality and high correlated frames as reference, to be introduced in Sec-tion 2.2. Next, one of the two possible filtering modes is adopted to the URF, asdescribed below. • Mode 1: MIF-Net.
Assume that M reference frames are needed in MIF-Net. If RFS selects at least M frames, the URF is processed by MIF-Net togenerate an enhanced frame. In MIF-Net, each reference frame is first alignedwith the URF in terms of content, with a motion compensation network. Then, esidual Frame (Estimated) TransformInverse Transform
Reference Frames + + Quality EnhancementDBF IF-Net
Reference Frames?
NoYes
HEVC Encoding
Multi-Frame In-Loop Filter
Unfiltered Reconstructed Frame
MotionCompensationRFS QuantizationSAO
Selected Reference Frames
Mode SelectionPrediction
MIF-Net
Enhanced Frame
Residual Frame (Lossless) + -
Raw Frame
Figure 1: Framework of the proposed MIF. both aligned reference frames and the URF are fed into a quality enhancementnetwork to output the reconstructed frame. • Mode 2: IF-Net.
If no enough reference frames are found for the URF, IF-Net is adopted instead for quality enhancement. In contrast to MIF-Net, IF-Nettakes only the URF as input without any consideration of multiple frames.More details about Modes 1 and 2 are presented in Section 2.3. If MIF-Net or IF-Netfails to improve frame quality, the standard DBF and SAO can also be used as asupplementary mode. Finally, the best mode among the three possible choices (i.e.,MIF-Net, IF-Net and the standard in-loop filters) is selected as the actual choice,ensuring the overall performance of our approach.
In our approach, RFS selects reference frames for each URF. For the n -th URF(denoted as F U n ) in a sequence, RFS examines its previous N encoded frames as thereference frame pool, each of which is denoted by F P i ( n − N ≤ i ≤ n − • ∆PSNR Y i,n , ∆PSNR U i,n and ∆PSNR V i,n : PSNR increment of F P i over F U n , for theY, U and V channels, respectively. • CC Y i,n , CC U i,n and CC V i,n : the correlation coefficient (CC) values of frame contentbetween F P i and F U n for the Y, U and V channels, respectively.Based on the above metrics, the reference frame pool is first divided into valid andinvalid reference frames, and then all valid frames are fed into RFS-Net to select M frames in final. Specifically, a binary value V i,n represents whether a reference framefrom the pool is valid. For at least one channel of F P i , if the PSNR increment ispositive and the CC value is above a threshold τ , i.e., V i,n = 1 in (1), F P i is seen as avalid reference frame. V i,n = , if _ c ∈{ Y , U , V } (∆PSNR ci,n > ∧ CC ci,n > τ )0 , otherwise . (1)If there exist at least M valid reference frames, the six metrics for each validreference frame form a 6-dimensional vector, and then they are input to a two-layer (cid:37) Dense unit
Dense unit
Block-adap. convolution
Block-adap. convolution
Motion compensation
Referenceframe M ( ) Referenceframe M ( ) R,M n F Referenceframe M ( ) R,M n F Compensatedframe M ( ) Compensatedframe M ( ) C,M n F Compensatedframe M ( ) C,M n F M (cid:37) Dense unit
Dense unit
Block-adap. convolution
Motion compensation
Referenceframe M ( ) R,M n F Compensatedframe M ( ) C,M n F (cid:37) Dense unit Dense unitBlock-adap. convolutionBlock-adap. convolutionMotion compensationReferenceframe 1 ( )
R1,n F Referenceframe 1 ( )
R1,n F Compensatedframe 1 ( )
C1,n F Compensatedframe 1 ( )
C1,n F (cid:37) Dense unit Dense unitBlock-adap. convolutionMotion compensationReferenceframe 1 ( )
R1,n F Compensatedframe 1 ( )
C1,n F Concat. Dense unitConcat. Dense unitDifference frame ( ) n (cid:39) F Difference frame ( ) n (cid:39) F Input Output
Concat.Concat.
Original scale×2 Down-scale ×4 Down-scale
Intermediatecoarse MV maps ×2 Intermediatefine MV maps
Concat.Concat.Concat.Concat.
MV maps ( (cid:712) ) MV maps ( (cid:712) ) X,mn M Y,mn M STMC,mn F ×4 Down-scaling path ×2 Down-scaling path Full-scale path Modified STMC
Conv. layer (PReLU activated)
Up-scaling layer
Input
Output
CU and TU partition (in white and blue lines)
Concat.
Intermediate feature
Maps, = 6 O P Intermediate feature
Maps, = 6 O P Guidance maps, = 2 G P Guidance maps, = 2 G P Element-wise additionElement-wise multiplication
Inner product n6-k3-s1-PReLU n6-k3-s1-PReLU
Concat.Concat.Concat.
Output featuremaps ( =6) O P Output featuremaps ( =6) O P Compensated frame M ( ) Compensated frame M ( ) Compensated frame M ( ) C,m n F Compensated frame M ( ) C,m n F URF ( ) Un F URF ( ) Un F - Un F C,m n F - Un F C,m n F CU and TU partition (shown in white and blue lines)
Concat.
OriginalweightsModifiedweights
Input feature maps ( =3) I P Input feature maps ( =3) I P Intermediate featuremaps ( =6) O P Intermediate featuremaps ( =6) O P Guidance maps ( =2) G P Guidance maps ( =2) G P n6-k3-s1-PReLUn6-k3-s1-PReLU Original weights
1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)
Original weights
1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)
Modified weights
G 1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)
Modified weights
G 1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)
Input feature
Maps, = 3 I P Input feature
Maps, = 3 I P C U C U, , ( , , ) mn n mn n (cid:16)
F F F F
Input feature
Maps, = 3 I P C U C U, , ( , , ) mn n mn n (cid:16)
F F F F
Output feature
Maps, = 6
Output feature
Maps, = 6 O P Output feature
Maps, = 6 O P Conv. layer (tanh activated) -2 1 2 0 -1 1 2 1 -10.5 1 0.51 1.5 1
Input feature map
Guidance mapOriginalweightsModified weights
Output feature map
Input feature map
Guidance mapOriginalweightsModified weights
Output feature map
Input feature map
Intermediate feature map Originalweights
Modified weightsOutput feature map
Input Output
Enhanced frame ( ) En F Enhanced frame ( ) En F CU and TU maps ( , )
CU and TU maps ( , ) n C n T CU and TU maps ( , ) n C n T URF ( ) Un F URF ( ) Un F MIF-Net onlyIF-Net onlyMIF-Net and IF-NetMIF-Net onlyIF-Net onlyMIF-Net and IF-Net
Illustration for the principle of the deconvolutional layer. Assume (cid:87)(cid:75)(cid:68)(cid:87)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:76)(cid:81)(cid:83)(cid:88)(cid:87)(cid:3)(cid:75)(cid:68)(cid:86)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:86)(cid:76)(cid:93)(cid:72)(cid:3)(cid:82)(cid:73)(cid:3)(cid:21)(cid:3)(cid:104)(cid:3)(cid:21)(cid:15)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3)(cid:87)(cid:75)(cid:68)(cid:87)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:71)(cid:72)(cid:70)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:68)(cid:79)(cid:3)(cid:79)(cid:68)(cid:92)(cid:72)(cid:85)(cid:3)(cid:75)(cid:68)(cid:86)(cid:82)(cid:81)(cid:72)(cid:3)(cid:73)(cid:76)(cid:79)(cid:87)(cid:72)(cid:85)(cid:3)(cid:90)(cid:76)(cid:87)(cid:75)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:86)(cid:76)(cid:93)(cid:72)(cid:3)(cid:82)(cid:73)(cid:3)(cid:23)(cid:3)(cid:104)(cid:3)(cid:23)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:86)(cid:87)(cid:85)(cid:76)(cid:71)(cid:72)(cid:3)(cid:82)(cid:73)(cid:3)(cid:22)(cid:3)(cid:11)(cid:76)(cid:17)(cid:72)(cid:17)(cid:15)(cid:3) k = 1, s = 4,and c = 3). Each value of 4 pixels in the input patch is multiplied by the (cid:23)(cid:3)(cid:104)(cid:3)(cid:23)(cid:3)(cid:80)(cid:68)(cid:87)(cid:85)(cid:76)(cid:91)(cid:3)(cid:82)(cid:73)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:79)(cid:72)(cid:68)(cid:85)(cid:81)(cid:72)(cid:71)(cid:3)(cid:71)(cid:72)(cid:70)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:68)(cid:79)(cid:3)(cid:73)(cid:76)(cid:79)(cid:87)(cid:72)(cid:85)(cid:17)(cid:3)(cid:55)(cid:75)(cid:72)(cid:81)(cid:15)(cid:3)(cid:73)(cid:82)(cid:88)(cid:85)(cid:3)(cid:23)(cid:3)(cid:104)(cid:3)(cid:23)(cid:3)(cid:80)(cid:68)(cid:87)(cid:85)(cid:76)(cid:70)(cid:72)(cid:86) are obtained, corresponding to each pixel of the input patch. Finally, thefour obtained matrices are combined with stride being 3 to form the finaloutput of the deconvolutional layer, according to the arrangement of thecorresponding pixels in the input patch. Note that in the combination, thevalues are summarized for the overlapping region.
An example of convolution with intermediation. Assume that one input and one intermediate feature maps are used in the convolution, to output one feature map (i.e., ). All feature maps are with the size of 3×2, and the convolutional kernels are 3×3 with the stride of 1. The zero-padding is adopted in the receptive fields of convolution, shown by transparent pixels with dashed boundary. In the intermediate feature map and convolutional weights, the luminance of pixels corresponds to their relative values. Finally, the color of each pixel in output feature map reveals the main source of information from the input. An example of convolution with intermediation. Assume that one input and one intermediate feature maps are used in the convolution, to output one feature map (i.e., ). All feature maps are with the size of 3×2, and the convolutional kernels are 3×3 with the stride of 1. The zero-padding is adopted in the receptive fields of convolution, shown by transparent pixels with dashed boundary. In the intermediate feature map and convolutional weights, the luminance of pixels corresponds to their relative values. Finally, the color of each pixel in output feature map reveals the main source of information from the input.
Warp WarpWarp
Compensatedframe M ( ) C,m n F Compensatedframe M ( ) C,m n F URF ( ) Un F URF ( ) Un F Referenceframe m ( ) R,m n F Referenceframe m ( ) R,m n F Figure 2: Architecture of MIF-Net or IF-Net. fully connected network (named RFS-Net ) to output a scalar ˆ R i,n . The output ˆ R i,n is a continuous variable representing the potential of F P i being the reference for F U n .A larger ˆ R i,n indicates that F P i has more potential than other reference frames forenhancing F U n . Here, ˆ R i,n is the predicted value by RFS-Net, with the correspondingground-truth denoted by R i,n . In RFS-Net, the ground-truth R i,n should reflect thequality of a valid reference frame after it is aligned with F U n via motion compensation.To this end, we assign R i,n as the PSNR between the compensated valid referenceframe and the n -th raw frame (denoted as F n ). In accord with ˆ R i,n , the R i,n is alsoZ-scored normalized within one training batch. After normalization, the ℓ -loss onthe whole training batch can be used to measure the difference between R i,n and ˆ R i,n ,formulated as L RFS = X n − N ≤ i ≤ n − , V i,n =1 ( R i,n − ˆ R i,n ) , (2)which is optimized by the Adam algorithm [13]. Using the trained RFS-Net model,the reference potential for all the valid frames can be obtained. Then, RFS selects M frames denoted by { F R m,n } Mm =1 , where the index m indicates that F R m,n is the framewith the m -th highest ˆ R i,n among all valid reference frames. In the exceptional casethat the number of valid reference frames is less than M , RFS does not work andIF-Net is used to enhance F U n instead. This section mainly focuses on the architecture of MIF-Net and its training strategy,and then specifies the difference between IF-Net and MIF-Net. Figure 2 illustrates theoverall architecture of MIF-Net or IF-Net. As shown in this figure, MIF-Net takes aURF F U n and its M reference frames { F R m,n } Mm =1 as the input, to generate the enhancedframe F E n as the output. The information from M parallel branches { B m } Mm =1 issynthesized, with each branch B m dealing with the corresponding reference frame F R m,n . In branch B m , F R m,n is first aligned with F U n to produce a motion-compensatedframe, denoted as F C m,n . Next, F U n with F C m,n flows through a novel convolutional layerguided by the CTU partition structure of F U n (named block-adaptive convolutional The 6-dimensional vector flows through two layers, with 12 hidden nodes and 1 output node,respectively. Both layers are activated with parametric rectified linear units (PReLU) [12]. Notethat the samples in one training batch are extracted from the valid reference frames for only oneURF, and the output of samples in the same batch are Z-score normalized. (cid:37)
Dense unit
Dense unit
Block-adap. convolution
Block-adap. convolution
Motion compensation
Referenceframe M ( ) Referenceframe M ( ) R,M n F Referenceframe M ( ) R,M n F Compensatedframe M ( ) Compensatedframe M ( ) C,M n F Compensatedframe M ( ) C,M n F M (cid:37) Dense unit
Dense unit
Block-adap. convolution
Motion compensation
Referenceframe M ( ) R,M n F Compensatedframe M ( ) C,M n F (cid:37) Dense unit Dense unitBlock-adap. convolutionBlock-adap. convolutionMotion compensationReferenceframe 1 ( )
R1,n F Referenceframe 1 ( )
R1,n F Compensatedframe 1 ( )
C1,n F Compensatedframe 1 ( )
C1,n F (cid:37) Dense unit Dense unitBlock-adap. convolutionMotion compensationReferenceframe 1 ( )
R1,n F Compensatedframe 1 ( )
C1,n F Concat. Dense unitConcat. Dense unitDifference frame ( ) n (cid:39) F Difference frame ( ) n (cid:39) F Input Output
Concat.Concat.
Original scale×2 Down-scale ×4 Down-scale
Intermediatecoarse MV maps ×2 Intermediatefine MV maps
Concat.Concat.Concat.Concat.
MV maps ( (cid:712) ) MV maps ( (cid:712) ) X,mn M Y,mn M STMC,mn F ×4 Down-scaling path ×2 Down-scaling path Full-scale path Modified STMC
Conv. layer (PReLU activated)
Up-scaling layer
Input
Output
CU and TU partition (in white and blue lines)
Concat.
Intermediate feature
Maps, = 6 O P Intermediate feature
Maps, = 6 O P Guidance maps, = 2 G P Guidance maps, = 2 G P Element-wise additionElement-wise multiplication
Inner product n6-k3-s1-PReLU n6-k3-s1-PReLU
Concat.Concat.Concat.
Output featuremaps ( =6) O P Output featuremaps ( =6) O P Compensated frame M ( ) Compensated frame M ( ) Compensated frame M ( ) C,m n F Compensated frame M ( ) C,m n F URF ( ) Un F URF ( ) Un F - Un F C,m n F - Un F C,m n F CU and TU partition (shown in white and blue lines)
Concat.
OriginalweightsModifiedweights
Input feature maps ( =3) I P Input feature maps ( =3) I P Intermediate featuremaps ( =6) O P Intermediate featuremaps ( =6) O P Guidance maps ( =2) G P Guidance maps ( =2) G P n6-k3-s1-PReLUn6-k3-s1-PReLU Original weights
1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)
Original weights
1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)
Modified weights
G 1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)
Modified weights
G 1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)
Input feature
Maps, = 3 I P Input feature
Maps, = 3 I P C U C U, , ( , , ) mn n mn n (cid:16)
F F F F
Input feature
Maps, = 3 I P C U C U, , ( , , ) mn n mn n (cid:16)
F F F F
Output feature
Maps, = 6
Output feature
Maps, = 6 O P Output feature
Maps, = 6 O P Conv. layer (tanh activated) -2 1 2 0 -1 1 2 1 -10.5 1 0.51 1.5 1
Input feature map
Guidance mapOriginalweightsModified weights
Output feature map
Input feature map
Guidance mapOriginalweightsModified weights
Output feature map
Input feature map
Intermediate feature map Originalweights
Modified weightsOutput feature map
Input Output
Enhanced frame ( ) En F Enhanced frame ( ) En F CU and TU maps ( , )
CU and TU maps ( , ) n C n T CU and TU maps ( , ) n C n T URF ( ) Un F URF ( ) Un F MIF-Net onlyIF-Net onlyMIF-Net and IF-NetMIF-Net onlyIF-Net onlyMIF-Net and IF-Net
Illustration for the principle of the deconvolutional layer. Assume (cid:87)(cid:75)(cid:68)(cid:87)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:76)(cid:81)(cid:83)(cid:88)(cid:87)(cid:3)(cid:75)(cid:68)(cid:86)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:86)(cid:76)(cid:93)(cid:72)(cid:3)(cid:82)(cid:73)(cid:3)(cid:21)(cid:3)(cid:104)(cid:3)(cid:21)(cid:15)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3)(cid:87)(cid:75)(cid:68)(cid:87)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:71)(cid:72)(cid:70)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:68)(cid:79)(cid:3)(cid:79)(cid:68)(cid:92)(cid:72)(cid:85)(cid:3)(cid:75)(cid:68)(cid:86)(cid:82)(cid:81)(cid:72)(cid:3)(cid:73)(cid:76)(cid:79)(cid:87)(cid:72)(cid:85)(cid:3)(cid:90)(cid:76)(cid:87)(cid:75)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:86)(cid:76)(cid:93)(cid:72)(cid:3)(cid:82)(cid:73)(cid:3)(cid:23)(cid:3)(cid:104)(cid:3)(cid:23)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:86)(cid:87)(cid:85)(cid:76)(cid:71)(cid:72)(cid:3)(cid:82)(cid:73)(cid:3)(cid:22)(cid:3)(cid:11)(cid:76)(cid:17)(cid:72)(cid:17)(cid:15)(cid:3) k = 1, s = 4,and c = 3). Each value of 4 pixels in the input patch is multiplied by the (cid:23)(cid:3)(cid:104)(cid:3)(cid:23)(cid:3)(cid:80)(cid:68)(cid:87)(cid:85)(cid:76)(cid:91)(cid:3)(cid:82)(cid:73)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:79)(cid:72)(cid:68)(cid:85)(cid:81)(cid:72)(cid:71)(cid:3)(cid:71)(cid:72)(cid:70)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:68)(cid:79)(cid:3)(cid:73)(cid:76)(cid:79)(cid:87)(cid:72)(cid:85)(cid:17)(cid:3)(cid:55)(cid:75)(cid:72)(cid:81)(cid:15)(cid:3)(cid:73)(cid:82)(cid:88)(cid:85)(cid:3)(cid:23)(cid:3)(cid:104)(cid:3)(cid:23)(cid:3)(cid:80)(cid:68)(cid:87)(cid:85)(cid:76)(cid:70)(cid:72)(cid:86) are obtained, corresponding to each pixel of the input patch. Finally, thefour obtained matrices are combined with stride being 3 to form the finaloutput of the deconvolutional layer, according to the arrangement of thecorresponding pixels in the input patch. Note that in the combination, thevalues are summarized for the overlapping region.
An example of convolution with intermediation. Assume that one input and one intermediate feature maps are used in the convolution, to output one feature map (i.e., ). All feature maps are with the size of 3×2, and the convolutional kernels are 3×3 with the stride of 1. The zero-padding is adopted in the receptive fields of convolution, shown by transparent pixels with dashed boundary. In the intermediate feature map and convolutional weights, the luminance of pixels corresponds to their relative values. Finally, the color of each pixel in output feature map reveals the main source of information from the input. An example of convolution with intermediation. Assume that one input and one intermediate feature maps are used in the convolution, to output one feature map (i.e., ). All feature maps are with the size of 3×2, and the convolutional kernels are 3×3 with the stride of 1. The zero-padding is adopted in the receptive fields of convolution, shown by transparent pixels with dashed boundary. In the intermediate feature map and convolutional weights, the luminance of pixels corresponds to their relative values. Finally, the color of each pixel in output feature map reveals the main source of information from the input.
Warp WarpWarp
Compensatedframe M ( ) C,mn F Compensatedframe M ( ) C,mn F URF ( ) Un F URF ( ) Un F Referenceframe m ( ) R,m n F Referenceframe m ( ) R,m n F (a) M (cid:37) Dense unit Dense unitBlock-adap. convolutionBlock-adap. convolutionMotion compensation
Reference frame M ( ) Reference frame M ( ) R,M n F Reference frame M ( ) R,M n F Compensated frame M ( ) Compensated frame M ( ) C,M n F Compensated frame M ( ) C,M n F M (cid:37) Dense unit Dense unitBlock-adap. convolutionMotion compensation
Reference frame M ( ) R,M n F Compensated frame M ( ) C,M n F (cid:37) Dense unit Dense unitBlock-adap. convolutionBlock-adap. convolutionMotion compensationReferenceframe 1 ( )
R1,n F Referenceframe 1 ( )
R1,n F Compensatedframe 1 ( )
C1,n F Compensatedframe 1 ( )
C1,n F (cid:37) Dense unit Dense unitBlock-adap. convolutionMotion compensationReferenceframe 1 ( )
R1,n F Compensatedframe 1 ( )
C1,n F Concat. Dense unitConcat.
Dense unit
Difference frame ( ) n (cid:39) F Difference frame ( ) n (cid:39) F Input Output
Concat.Concat.
Original scale×2 Down-scale×4 Down-scale stride 2 3×3, 32×4
Intermediatecoarse MV maps ×2 Intermediatefine MV maps
Concat.Concat.Concat.Concat.
MV maps ( (cid:712) ) MV maps ( (cid:712) ) X,mn M Y,mn M STMC,mn F ×4 Down-scaling path ×2 Down-scaling path Full-scale path Modified STMC
Conv. layer (PReLU activated) Up-scaling layer
Input Output
Conv. layer (tanh activated)
Input Output
Enhanced frame ( ) En F Enhanced frame ( ) En F CU and TU maps ( , )
CU and TU maps ( , ) n C n T CU and TU maps ( , ) n C n T URF ( ) Un F URF ( ) Un F MIF-Net only
IF-Net only
MIF-Net and IF-NetMIF-Net only
IF-Net only
MIF-Net and IF-Net
Warp WarpWarp
Compensatedframe M ( ) C,m n F Compensatedframe M ( ) C,m n F URF ( ) Un F URF ( ) Un F Referenceframe m ( ) R,m n F Referenceframe m ( ) R,m n F Concat.3×3, 12
Input
Output
Concat.Concat.Concat.Concat.Concat.3×3, 12
Input
Output
Concat.Concat.Concat.
InputOutput
Concat.Concat.Concat.Concat. (b) M (cid:37) Dense unit
Dense unit
Block-adap. convolution
Block-adap. convolution
Motion compensation
Referenceframe M ( ) Referenceframe M ( ) R,M n F Referenceframe M ( ) R,M n F Compensatedframe M ( ) Compensatedframe M ( ) C,M n F Compensatedframe M ( ) C,M n F M (cid:37) Dense unit
Dense unit
Block-adap. convolution
Motion compensation
Referenceframe M ( ) R,M n F Compensatedframe M ( ) C,M n F (cid:37) Dense unit Dense unitBlock-adap. convolutionBlock-adap. convolutionMotion compensationReferenceframe 1 ( )
R1,n F Referenceframe 1 ( )
R1,n F Compensatedframe 1 ( )
C1,n F Compensatedframe 1 ( )
C1,n F (cid:37) Dense unit Dense unitBlock-adap. convolutionMotion compensationReferenceframe 1 ( )
R1,n F Compensatedframe 1 ( )
C1,n F Concat. Dense unitConcat. Dense unitDifference frame ( ) n (cid:39) F Difference frame ( ) n (cid:39) F Input Output
Concat.Concat.
Original scale×2 Down-scale ×4 Down-scale
Intermediatecoarse MV maps ×2 Intermediatefine MV maps
Concat.Concat.Concat.Concat.
MV maps ( (cid:712) ) MV maps ( (cid:712) ) X,mn M Y,mn M STMC,mn F ×4 Down-scaling path ×2 Down-scaling path Full-scale path Modified STMC
Conv. layer (PReLU activated)
Up-scaling layer
Input
Output
CU and TU partition (in white and blue lines)
Concat.
Intermediate feature
Maps, = 6 O P Intermediate feature
Maps, = 6 O P Guidance maps, = 2 G P Guidance maps, = 2 G P n6-k3-s1-PReLU n6-k3-s1-PReLU Concat.Concat.Concat.
Output featuremaps ( =6) O P Output featuremaps ( =6) O P Compensated frame M ( ) Compensated frame M ( ) Compensated frame M ( ) C,m n F Compensated frame M ( ) C,m n F URF ( ) Un F URF ( ) Un F - Un F C,m n F - Un F C,m n F CU and TU partition (shown in white and blue lines)
Concat.
OriginalweightsModifiedweights
Input feature maps ( =3) I P Input feature maps ( =3) I P Intermediate featuremaps ( =6) O P Intermediate featuremaps ( =6) O P Guidance maps ( =2) G P Guidance maps ( =2) G P n6-k3-s1-PReLUn6-k3-s1-PReLU Original weights
1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)
Original weights
1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)
Modified weights
G 1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)
Modified weights
G 1, , 1 { ( , )} jl x y w x y (cid:39) (cid:39) (cid:32)(cid:16) (cid:39) (cid:39)
Input feature
Maps, = 3 I P Input feature
Maps, = 3 I P C U C U, , ( , , ) mn n mn n (cid:16)
F F F F
Input feature
Maps, = 3 I P C U C U, , ( , , ) mn n mn n (cid:16)
F F F F
Output feature
Maps, = 6
Output feature
Maps, = 6 O P Output feature
Maps, = 6 O P Conv. layer (tanh activated) -2 1 2 0 -1 1 2 1 -10.5 1 0.51 1.5 1
Input feature map
Guidance mapOriginalweightsModified weights
Output feature map
Input feature map
Guidance mapOriginalweightsModified weights
Output feature map
Input feature map
Intermediate feature map Originalweights
Modified weightsOutput feature map
Input Output
Enhanced frame ( ) En F Enhanced frame ( ) En F CU and TU maps ( , )
CU and TU maps ( , ) n C n T CU and TU maps ( , ) n C n T URF ( ) Un F URF ( ) Un F MIF-Net onlyIF-Net onlyMIF-Net and IF-NetMIF-Net onlyIF-Net onlyMIF-Net and IF-Net
Illustration for the principle of the deconvolutional layer. Assume (cid:87)(cid:75)(cid:68)(cid:87)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:76)(cid:81)(cid:83)(cid:88)(cid:87)(cid:3)(cid:75)(cid:68)(cid:86)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:86)(cid:76)(cid:93)(cid:72)(cid:3)(cid:82)(cid:73)(cid:3)(cid:21)(cid:3)(cid:104)(cid:3)(cid:21)(cid:15)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3)(cid:87)(cid:75)(cid:68)(cid:87)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:71)(cid:72)(cid:70)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:68)(cid:79)(cid:3)(cid:79)(cid:68)(cid:92)(cid:72)(cid:85)(cid:3)(cid:75)(cid:68)(cid:86)(cid:82)(cid:81)(cid:72)(cid:3)(cid:73)(cid:76)(cid:79)(cid:87)(cid:72)(cid:85)(cid:3)(cid:90)(cid:76)(cid:87)(cid:75)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:86)(cid:76)(cid:93)(cid:72)(cid:3)(cid:82)(cid:73)(cid:3)(cid:23)(cid:3)(cid:104)(cid:3)(cid:23)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:86)(cid:87)(cid:85)(cid:76)(cid:71)(cid:72)(cid:3)(cid:82)(cid:73)(cid:3)(cid:22)(cid:3)(cid:11)(cid:76)(cid:17)(cid:72)(cid:17)(cid:15)(cid:3) k = 1, s = 4,and c = 3). Each value of 4 pixels in the input patch is multiplied by the (cid:23)(cid:3)(cid:104)(cid:3)(cid:23)(cid:3)(cid:80)(cid:68)(cid:87)(cid:85)(cid:76)(cid:91)(cid:3)(cid:82)(cid:73)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:79)(cid:72)(cid:68)(cid:85)(cid:81)(cid:72)(cid:71)(cid:3)(cid:71)(cid:72)(cid:70)(cid:82)(cid:81)(cid:89)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81)(cid:68)(cid:79)(cid:3)(cid:73)(cid:76)(cid:79)(cid:87)(cid:72)(cid:85)(cid:17)(cid:3)(cid:55)(cid:75)(cid:72)(cid:81)(cid:15)(cid:3)(cid:73)(cid:82)(cid:88)(cid:85)(cid:3)(cid:23)(cid:3)(cid:104)(cid:3)(cid:23)(cid:3)(cid:80)(cid:68)(cid:87)(cid:85)(cid:76)(cid:70)(cid:72)(cid:86) are obtained, corresponding to each pixel of the input patch. Finally, thefour obtained matrices are combined with stride being 3 to form the finaloutput of the deconvolutional layer, according to the arrangement of thecorresponding pixels in the input patch. Note that in the combination, thevalues are summarized for the overlapping region.
An example of convolution with intermediation. Assume that one input and one intermediate feature maps are used in the convolution, to output one feature map (i.e., ). All feature maps are with the size of 3×2, and the convolutional kernels are 3×3 with the stride of 1. The zero-padding is adopted in the receptive fields of convolution, shown by transparent pixels with dashed boundary. In the intermediate feature map and convolutional weights, the luminance of pixels corresponds to their relative values. Finally, the color of each pixel in output feature map reveals the main source of information from the input. An example of convolution with intermediation. Assume that one input and one intermediate feature maps are used in the convolution, to output one feature map (i.e., ). All feature maps are with the size of 3×2, and the convolutional kernels are 3×3 with the stride of 1. The zero-padding is adopted in the receptive fields of convolution, shown by transparent pixels with dashed boundary. In the intermediate feature map and convolutional weights, the luminance of pixels corresponds to their relative values. Finally, the color of each pixel in output feature map reveals the main source of information from the input.
Warp WarpWarp
Compensatedframe M ( ) C,mn F Compensatedframe M ( ) C,mn F URF ( ) Un F URF ( ) Un F Referenceframe m ( ) R,m n F Referenceframe m ( ) R,m n F Figure 3: Network details. (a) Motion compensation network. (b) Dense unit. For convo-lutional layers, “ p × p , q ” represents q output channels with p × p kernels. Note that theconvolutional stride is set to 1 by default, except that explicitly mentioned in certain layers. layer), to explore low-level features from different sources and merge the features withconsideration of the CU and TU partition. Then, the low-level features flow throughtwo successive dense units [14] to extract more comprehensive features within B m .Finally, the extracted features from M branches are concatenated together and furtherprocessed with two dense units to extract high-level features. For ease of training,the output of the last dense unit (denoted as F ∆ n ) is regarded as a difference frame,and the enhanced frame F E n is the summation of F ∆ n and F U n . The details of MIF-Netcomponents are presented in the following. Motion compensation network.
We propose a motion compensation networkbased on the spatial transformer motion compensation (STMC) [15], for content align-ment between F R m,n and F U n , illustrated in Figure 3-(a). In [15], the STMC takes both F R m,n and F U n as the input, to output a compensated frame denoted as F STMC m,n . TheSTMC consists of two paths ( × × × F R m,n for outputting F STMC m,n . The two down-sampling paths in[15] are capable for estimating various scales of motion. However, the accuracy of theSTMC is limited due to down-sampling, and its architecture can also be improved.Therefore, we propose a motion compensation network with the following advance-ments. (1) Besides the × × M X m,n and M Y m,n , denoting the horizontal and vertical motion of all pixels from F R m,n to F U n .Finally, the compensated frame F C m,n is derived by F C m,n ( x, y ) = Bil { F R m,n ( x + M X m,n ( x, y ) , y + M Y m,n ( x, y )) } , (3)here x and y are coordinates of a pixel, and Bil {·} represents the bilinear interpo-lation considering that the motion may be of non-integer pixels. Block-adaptive convolutional layer.
The input to this layer is a concatenationof three feature maps, including a compensated frame F C m,n , a URF F U n and F C m,n − F U n .The CU and TU partition are represented by two feature maps, i.e., C n and T n , re-spectively. C n and T n each has the same size as F U n , and the values in the two mapsare assigned according to the partition structure. If pixel ( x, y ) is on the boundaryof a CU or TU, C n ( x, y ) or T n ( x, y ) is set to 1. Otherwise, the value is set to − P I , P G and P O feature maps are used as the input, guidance and output, respectively. The guidedconvolution consists of two main procedures, i.e., intermediate map generation andconvolution with intermediation. First, the P G guidance feature maps are processedwith two typical convolutional layers to generate P O intermediate feature maps, keep-ing the size of each feature map unchanged. Then, during the convolution, the P O output feature maps are generated based on these P O intermediate feature maps, cor-respondingly. Compared with typical convolution using space-irrelevant weights w j,l only, the guided convolution is conducted with space-relevant weights w G j,l generatedfrom the intermediation, as formulated below w G j,l (∆ x, ∆ y ) = w j,l (∆ x, ∆ y ) · F M l ( x + ∆ x, y + ∆ y ) , (4) F O l ( x, y ) = P I X j =1 1 X ∆ x = − X ∆ y = − w G j,l (∆ x, ∆ y ) · F I j ( x + ∆ x, y + ∆ y ) . (5)In (4) and (5), F I j , F M l and F O l represent the j -th input, the l -th intermediate andthe l -th output feature maps, respectively. ∆ x, ∆ y denote the relative coordinateswithin a 3 × P I = 3 and P G = 2, and we set the number of output maps to be P O = 16. Dense units for quality enhancement.
The DenseNet [14] introduces variouslength of inter-layer connections, with alleviation of vanishing gradients and encour-agement of feature reuse. Considering the advantages, (2 M + 2) dense units areadopted in MIF-Net, i.e., 2 dense units in each branch and 2 dense units at the end ofMIF-Net synthesizing features from M branches. Figure 3-(b) illustrates the struc-ture of each dense unit, and it can be observed that a dense unit with 4 convolutionallayers includes 10 inter-layer connections, much more than a 4-layer plain CNN withonly 4 inter-layer connections. Here, each layer outputs 12 channels, except the lastlayer in the final dense unit outputting only 1 channel as the difference frame F ∆ n . MIF-Net Training.
With both motion compensation and quality enhancement,it may be difficult to train the whole MIF-Net directly. Thus, we propose to train itwith intermediate supervision [17], introducing two loss functions at different stages.First, the difference between F U n and each frame in { F C m,n } Mm =1 can measure the per-formance of motion compensation, and thus it is defined as the intermediate loss L INT = 1 M M X m =1 k F C m,n − F U n k , (6)here k·k represents the ℓ -norm difference. Next, the difference between F E n and F n indicates the performance of the whole MIF-Net, and thus the global loss is L GLO = k F E n − F n k . (7)The loss for training MIF-Net is the weighted summation of them: L = α · L INT + β · L GLO , (8)where α and β are adjustable positive weights. On account that quality enhancementrelies on the well-trained motion compensation network, L INT should be emphaticallyoptimized with α ≫ β at early stage of training. After L INT converges, we set β ≫ α instead, to emphasize more on optimization of the global loss L GLO . Difference between IF-Net and MIF-Net.
The difference between two net-works lies in the absence of M reference frames in IF-Net. Therefore, only qualityenhancement without motion compensation is adopted in IF-Net, illustrated by redarrows in Figure 2. Compared with MIF-Net, only one branch without any compen-sated frame exists in IF-Net, and the concatenation synthesizing M branches is alsoomitted. Despite simpleness, a block-adaptive convolutional layer and four consecu-tive dense units still exist in IF-Net, ensuring sufficient network capacity. Consideringno motion compensation in IF-Net, the loss of IF-Net is the same as L GLO in MIF-Net.
Experimental configurations.
In the experiments, all approaches for in-loop fil-tering were incorporated into the HEVC reference software HM 16.5. For our MIFapproach, we established a large-scale database for HEVC in-loop filtering (namedHIF database) containing 111 raw video sequences, collected from the JCT-VC [18],Xiph.org [19] and the conversational video set [20]. Our HIF database was dividedinto non-overlapping sets of training (83 sequences), validation (10 sequences) andtest (18 sequences). The training set was used to train the networks, and the hyper-parameters in our approach were tunned on the validation set. The test set was usedfor performance evaluation, containing all 18 standard sequences from the JCT-VCset [18]. The RA configuration was applied for both network training and perfor-mance evaluation at four QPs, { , , , } . During evaluation, the BD-BR andthe Bjøntegaard delta PSNR (BD-PSNR) were measured to assess the rate-distortion(RD) performance. Network settings.
For our approach, one MIF-Net model and one IF-Net modelwere trained for each evaluated QP, while all QPs shared the same trained RFS-Netmodel. The tuned hyper-parameters for these networks are listed in Table 1. Fortraining MIF-Net and IF-Net, all the frames were segmented into 64 ×
64 patches.Here, each training sample was composed of the co-located patches from a raw frame,a URF, a CU map, a TU map and M reference frames (if have). Considering theefficiency of training, the IF-Net or MIF-Net model at QP = 37 was trained fromscratch, while the models at QPs { , , } were fine-tuned from the trained modelsat QPs { , , } , respectively. able 1: Hyper-parameters for networks Hyper-parameter RFS-Net MIF-Net or IF-NetSize of ref. frame pool: N
16 -Threshold for CC value: τ M ≤ ∗ − − Num. of iterations 10 (from scratch) or 2 × (fine-tunning)Changeable weightsin MIF-Net: α and β - 0 .
99 & 0 .
01 (at beginning)0 .
01 & 0 .
99 (after L INT converged) ∗ The batch size equals to the number of valid reference frames for a URF.
Table 2:
RD performance of in-loop filters on the JCT-VC test set
Class Sequence StandardDBF and SAO Non-local adaptiveloop filter [6] RHCNN [10] Proposed MIFBD-BR BD-PSNR BD-BR BD-PSNR BD-BR BD-PSNR BD-BR BD-PSNR(%) (dB) (%) (dB) (%) (dB) (%) (dB)A
PeopleOnStreet -8.29 0.37 -12.03 0.54 -12.48 0.57 -16.82 0.78
Traffic -5.35 0.16 -6.17 0.19 -9.81 0.30 -12.15 0.38 B BasketballDrive -6.65 0.15 -8.84 0.20 -11.05 0.25 -14.87 0.35
BQTerrace -7.15 0.11 -11.40 0.17 -14.36 0.23 -17.13 0.27
Cactus -7.54 0.16 -8.90 0.19 -12.52 0.27 -15.83 0.35
Kimono -7.54 0.22 -9.26 0.27 -10.48 0.31 -12.24 0.37
ParkScene -3.68 0.11 -4.08 0.12 -5.94 0.18 -7.99 0.25 C BasketballDrill -5.02 0.21 -5.39 0.22 -7.81 0.33 -10.32 0.43
BQMall -3.93 0.15 -4.45 0.17 -7.65 0.30 -9.38 0.37
PartyScene -1.05 0.04 -1.22 0.05 -2.41 0.10 -4.16 0.17
RaceHorses -6.15 0.22 -7.08 0.26 -10.40 0.39 -12.74 0.48 D BasketballPass -3.85 0.18 -4.32 0.20 -7.68 0.37 -9.98 0.48
BlowingBubbles -0.83 0.03 -0.83 0.03 -3.05 0.12 -3.98 0.16
BQSquare -0.05 0.00 0.01 0.00 -3.24 0.12 -4.40 0.17
RaceHorses -4.44 0.20 -4.80 0.22 -8.84 0.41 -10.99 0.51 E FourPeople -7.02 0.26 -8.49 0.32 -13.92 0.54 -16.48 0.64
Johnny -5.60 0.14 -8.03 0.21 -11.62 0.30 -14.37 0.38
KristenAndSara -6.41 0.20 -8.01 0.25 -12.62 0.41 -15.34 0.50
Average -5.03 0.16 -6.29 0.20 -9.22 0.30 -11.62 0.39
Objective RD performance.
We analyze the objective performance of our MIFapproach in terms of the BD-BR and BD-PSNR, compared with the standard in-loopfilters (DBF and SAO), a model-based approach (the non-local adaptive loop filter[6]) and a deep-learning-based approach (the RHCNN [10]). For a fair comparison,the RHCNN models in [10] were re-trained on our HIF database. Table 2 tabulatesthe RD performance of all four approaches, and the original HM without in-loopfilters is used as anchor. As indicated in Table 2, the BD-BR of our MIF approachis − .
62% averaged over the 18 standard test sequences, outperforming − .
03% ofthe HM baseline, − .
29% of [6] and − .
22% of [10]. In terms of BD-PSNR, ourapproach achieves 0 . . . . tandard DBF and SAO
Non-local adaptive loop filter [6]
RHCNN [8]
Proposed MIF
Raw (ground truth)
Standard
DBF and SAO
Non-local adaptive loop filter [6] RHCNN [10]
Proposed MIF
Raw (ground truth)
Figure 4: Comparison of subjective visual quality on sequences
RaceHorses (Class C) and
PeopleOnStreet (Class A) at QP = 37.
Subjective visual quality.
Figure 4 illustrates the subjective visual qualityamong all four approaches. It can be observed that the frames enhanced by ourapproach remain less distortion than those by other approaches, e.g., the clearer edgeof the horse tail and the reduced blocking artifacts on the pedestrians. The highestvisual quality mainly benefits from the utilization of multiple adjacent frames in theproposed MIF approach.
In this paper, we have proposed a DenseNet based in-loop filter for HEVC. Differentfrom existing in-loop filter approaches based on a single frame, our MIF approach en-hances the quality of each encoded frame leveraging multiple adjacent frames. To thisend, we first propose an RFS to find higher-quality frames. Then, we develop an MIF-Net model for multi-frame in-loop filter in HEVC, which is based on the DenseNetand benefits from the improved generalization capacity and computational efficiency.Finally, experimental results demonstrate that our approach achieves − .
62% ofBD-BR saving and 0 . Acknowledgment
This work was supported by NSFC under Grants 61876013 and 61573037, and by theFok Ying Tung Education Foundation under Grant 151061.
References [1] G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand, “Overview of the high efficiencyvideo coding (HEVC) standard,”
IEEE Transactions on Circuits and Systems forVideo Technology , vol. 22, no. 12, pp. 1649–1668, Dec 2012.[2] A. Norkin, G. Bjontegaard, A. Fuldseth, M. Narroschke, M. Ikeda, K. Andersson,M. Zhou, and G. Van der Auwera, “HEVC deblocking filter,”
IEEE Transactions onCircuits and Systems for Video Technology , vol. 22, no. 12, pp. 1746–1754, Dec 2012.[3] C. Fu, E. Alshina, A. Alshin, Y. Huang, C. Chen, C. Tsai, C. Hsu, S. Lei, J. Park,and W. Han, “Sample adaptive offset in the HEVC standard,”
IEEE Transactions onCircuits and Systems for Video Technology , vol. 22, no. 12, pp. 1755–1764, Dec 2012.[4] C. Tsai, C. Chen, T. Yamakage, I. S. Chong, Y. Huang, C. Fu, T. Itoh, T. Watanabe,T. Chujoh, M. Karczewicz, and S. Lei, “Adaptive loop filtering for video coding,”
EEE Journal of Selected Topics in Signal Processing , vol. 7, no. 6, pp. 934–945, Dec2013.[5] M. Matsumura, Y. Bandoh, S Takamura, and H. Jozawa, “In-loop filter based onnon-local means filter,” JCTVC-E206, ITU-T SG16, Geneva, Swizerland, March 2011.[6] X. Zhang, R. Xiong, W. Lin, J. Zhang, S. Wang, S. Ma, and W. Gao, “Low-rank-basednonlocal adaptive loop filter for high-efficiency video compression,”
IEEE Transactionson Circuits and Systems for Video Technology , vol. 27, no. 10, pp. 2177–2188, Oct 2017.[7] M. Xu, T. Li, Z. Wang, X. Deng, R. Yang, and Z. Guan, “Reducing complexity ofHEVC: A deep learning approach,”
IEEE Transactions on Image Processing , vol. 27,no. 10, pp. 5044–5059, Oct 2018.[8] R. Yang, M. Xu, T. Liu, Z. Wang, and Z. Guan, “Enhancing quality for HEVCcompressed videos,”
IEEE Transactions on Circuits and Systems for Video Technology ,2018.[9] X. Meng, C. Chen, S. Zhu, and B. Zeng, “A new HEVC in-loop filter based on multi-channel long-short-term dependency residual networks,” in , March 2018, pp. 187–196.[10] Y. Zhang, T. Shen, X. Ji, Y. Zhang, R. Xiong, and Q. Dai, “Residual highway convo-lutional neural networks for in-loop filtering in HEVC,”
IEEE Transactions on ImageProcessing , vol. 27, no. 8, pp. 3827–3841, Aug 2018.[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”in
IEEE Conference on Computer Vision and Pattern Recognition , June 2016, pp.770–778.[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Delving deep into rec-tifiers: Surpassing human-level performance on ImageNet classification,” in
IEEEInternational Conference on Computer Vision , 2015, pp. 1026–1034.[13] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,”
Computer Science , 2014.[14] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger, “Denselyconnected convolutional networks,” in
IEEE Conference on Computer Vision andPattern Recognition , July 2017, pp. 4700–4708.[15] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi, “Real-timevideo super-resolution with spatio-temporal networks and motion compensation,” in
IEEE Conference on Computer Vision and Pattern Recognition , July 2017, pp. 2848–2857.[16] Xavier Glorot, Antoine Bordes, and Yoshua Bengio, “Deep sparse rectifier neuralnetworks,” in
Proceedings of the Fourteenth International Conference on ArtificialIntelligence and Statistics , 2011, pp. 315–323.[17] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,”in
IEEE Conference on Computer Vision and Pattern Recognition , June 2016, pp.4724–4732.[18] J. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan, and T. Wiegand, “Comparison ofthe coding efficiency of video coding standards-including high efficiency video coding(HEVC),”
IEEE Transactions on Circuits and Systems for Video Technology , vol. 22,no. 12, pp. 1669–1684, Dec 2012.[19] Xiph.org, “Xiph.org video test media,” https://media.xiph.org/video/derf , 2017.[20] M. Xu, X. Deng, S. Li, and Z. Wang, “Region-of-interest based conversational HEVCcoding with hierarchical perception model of face,”