[PDF] Learning a Virtual Codec Based on Deep Convolutional Neural Network to Compress Image

Abstract

Although deep convolutional neural network has been proved to efficiently eliminate coding artifacts caused by the coarse quantization of traditional codec, it's difficult to train any neural network in front of the encoder for gradient's back-propagation. In this paper, we propose an end-to-end image compression framework based on convolutional neural network to resolve the problem of non-differentiability of the quantization function in the standard codec. First, the feature description neural network is used to get a valid description in the low-dimension space with respect to the ground-truth image so that the amount of image data is greatly reduced for storage or transmission. After image's valid description, standard image codec such as JPEG is leveraged to further compress image, which leads to image's great distortion and compression artifacts, especially blocking artifacts, detail missing, blurring, and ringing artifacts. Then, we use a post-processing neural network to remove these artifacts. Due to the challenge of directly learning a non-linear function for a standard codec based on convolutional neural network, we propose to learn a virtual codec neural network to approximate the projection from the valid description image to the post-processed compressed image, so that the gradient could be efficiently back-propagated from the post-processing neural network to the feature description neural network during training. Meanwhile, an advanced learning algorithm is proposed to train our deep neural networks for compression. Obviously, the priority of the proposed method is compatible with standard existing codecs and our learning strategy can be easily extended into these codecs based on convolutional neural network. Experimental results have demonstrated the advances of the proposed method as compared to several state-of-the-art approaches, especially at very low bit-rate.

Full PDF

JJOURNAL OF L A TEX CLASS FILES 1

Learning a Virtual Codec Based on DeepConvolutional Neural Network to Compress Image

Lijun Zhao, Huihui Bai,

Member, IEEE,

Anhong Wang,

Member, IEEE, and Yao Zhao,

Senior Member, IEEE

Abstract —Although deep convolutional neural network hasbeen proved to efﬁciently eliminate coding artifacts caused bythe coarse quantization of traditional codec, it’s difﬁcult totrain any neural network in front of the encoder for gradient’sback-propagation. In this paper, we propose an end-to-endimage compression framework based on convolutional neuralnetwork to resolve the problem of non-differentiability of thequantization function in the standard codec. First, the featuredescription neural network is used to get a valid description inthe low-dimension space with respect to the ground-truth imageso that the amount of image data is greatly reduced for storageor transmission. After image’s valid description, standardimage codec such as JPEG is leveraged to further compressimage, which leads to image’s great distortion and compressionartifacts, especially blocking artifacts, detail missing, blurring,and ringing artifacts. Then, we use a post-processing neuralnetwork to remove these artifacts. Due to the challenge ofdirectly learning a non-linear function for a standard codecbased on convolutional neural network, we propose to learn avirtual codec neural network to approximate the projectionfrom the valid description image to the post-processedcompressed image, so that the gradient could be efﬁcientlyback-propagated from the post-processing neural network tothe feature description neural network during training.Meanwhile, an advanced learning algorithm is proposed totrain our deep neural networks for compression. Obviously, thepriority of the proposed method is compatible with standardexisting codecs and our learning strategy can be easilyextended into these codecs based on convolutional neuralnetwork. Experimental results have demonstrated the advancesof the proposed method as compared to several state-of-the-artapproaches, especially at very low bit-rate.

Index Terms —Virtual codec, valid description, post-processing,convolutional neural network, image compression, compressionartifact.

I. I

NTRODUCTION I MAGE and video compression is an essential andefﬁcient tool to reduce the amount of social media dataand multimedia data on the Internet. Traditional imagecompression standards such as JPEG, and HEVC, etc., arebuilt on block-wise transformation and quantization codingframework, which can largely reduce image block’sredundancy [1–3]. However, the quantization after individualblock transformation inevitably results in the blockingartifacts during image coding. Meanwhile, large quantization

L. Zhao, H. Bai, Y. Zhao are with the Beijing Key Laboratory of AdvancedInformation Science and Network Technology, Institute Information Science,Beijing Jiaotong University, Beijing, 100044, P. R. China, e-mail: 15112084,hhbai, [email protected]. Wang is with Institute of Digital Media & Communication, TaiyuanUniversity of Science and Technology, Taiyuan, 030024, P. R. China, e-mail:wah [email protected] parameters are always assigned to the codec in order toachieve low bit-rate coding leading to serious blurring andringing artifacts [4–6], when the transmission band-width isvery limited. In order to alleviate the problem of Internettransmission congestion [7], advanced coding techniques,such as de-blocking, and post-processing [8], are still the hotand open issues to be researched.The post-processing technique for compressed image canbe explicitly embedded into the codec to improve the codingefﬁciency and reduce the artifacts caused by the coarsequantization. For instance, adaptive de-blocking ﬁltering isdesigned as a loop ﬁlter and integrated into H.264/MPEG-4AVC video coding standard [9], which does not require anextra frame buffer at the decoder. The advantage forde-blocking ﬁltering inside codec is to make sure that anestablished level of image quality is coded and conveyed inthe transmission channel. However, the drawback caused bythis kind of ﬁltering is the relatively high computationalcomplexity. In order to avoid this drawback and makeﬁltering compatible to traditional codec, the alternativeﬂexible way is to use the ﬁltering as a post-processingoperation after image decoding. In [10], two methods areintroduced to reduce the blocking effects: ﬁltering methodand overlapped block based method.To date, a large amount of methods have been studied toremove compression artifacts by efﬁcient ﬁltering or otheralgorithms. In [11], a wavelet-based algorithm usesthree-scale over-complete wavelet to de-block via atheoretical analysis of blocking artifacts. In [12], throughimage’s total variation analysis to deﬁne two kinds ofregions, adaptive bilateral ﬁlters is used as an imagede-blocking method to differentially deal with these tworegions. In [13], by deﬁning a new metric involvingsmoothness of the block boundaries and image content’sﬁdelity to evaluate the blocking artifacts, quantization noiseon the blocks is removed by non-local means ﬁlter. Forimage’s de-noising and de-blocking, both hard-thresholdingand empirical Wiener ﬁltering are carried on the shapeadaptive discrete cosine transform domain (DCT), where thesupport of an arbitrarily shaped transform is adaptivelycalculated for all the pixels [14].Except the above mentioned methods [11–14], manyworks have incorporated some priors or expert knowledgeinto their model. In [15], the compression artifacts arereduced by adaptively estimating DCT coefﬁcients inoverlapped transform-blocks and integrating the quantizationnoise model with block similarity prior model. In [16],maximum a posteriori criterion is used to resolve the a r X i v : . [ c s . C V ] J a n OURNAL OF L A TEX CLASS FILES 2 problem of compressed image’s post-processing by treatingpost-processing as an inverse problem. In [17], an artifactreducing approach is developed to reduce the artifacts ofJPEG compression by dictionary learning and total variationregularization. In [18], by using constrained non-convexlow-rank model, image de-blocking is formulated as anoptimization problem within maximum a posterioriframework for image de-blocking. In [19], by thecombination of both JPEG prior knowledge and sparsecoding expertise, deep dual-domain based restoration isdeveloped for JPEG-compressed images. In [20], byexploiting the redundancy of residual in the JPEG streamsand the properties of sparsity in the latent images,compressed image’s restoration is regarded as a sparsecoding process carried out jointly in the DCT and pixeldomains. Different from [15–20], the technique ofstructure-texture decomposition has been used in [21] toreduce the compression artifacts for JPEG compressedimages as well as image contrast enhancement.Unlike the speciﬁc task of compression artifact removal,image de-noising is a more general technique to remove thenoise such as additive Gaussian noise, equipment noise,compression artifact and so on. In [22], based on a sparserepresentation on transform domain, an advanced imagede-noising strategy is used to achieve collaborative ﬁlteringby the following steps: grouping similar 2-D imagefragments into 3-D data arrays, 3-D transformation of agroup, shrinkage of the transformation spectrum, and inverse3-D transformation. In [23], by exploiting image’s nonlocalself-similarity, weighted nuclear norm minimization problemis studied for image de-noising, while the solutions of thisproblem are well analyzed under different weightingconditions. In [24], self-learning based image decompositionis applied for single image denoising, while anover-complete dictionary is learned from input image’s highspatial frequency for image’s reconstruction.Deep learning has achieved great success for thehigh-level tasks in the ﬁeld of computer vision [25], such asimage classiﬁcation, object detection and tracking, and imagesemantic segmentation, etc. Meanwhile, it has made amilestone result again and again for low-level imageprocessing, such as de-noising, image super-resolution, andimage in-painting. In the early time, a plain multi-layerperceptron has been employed to directly learn a projectionfrom a noisy image to a noise-free image [26]. Recently, bydirectly learning an end-to-end mapping from thelow-resolution image to high-resolution one, convolutionalneural network is used to resolve the problem of imagesuper-resolution [27]. More importantly, a general solution toimage-to-image translation problems is got with theconditional generative adversarial networks [28]. Latter,conditional generative adversarial network is used to resolvethe problem of multi-tasks learning in [29] such as: colorimage super-resolution and depth image super-resolutionproblems at the same time, and simultaneous imagesmoothing and edge detection.From the literatures of [25–29], it can be found that deeplearning has been widely applied into various ﬁelds. For compression artifact suppression in JPEG compressedimages, some literatures have pioneered to eliminatecompression artifacts with deep convolutional neuralnetworks. In [30], a 12-layer deep convolutional neuralnetwork with hierarchical skip connections is used to betrained with a multi-scale loss function. In [31], aconditional generative adversarial framework is trained byreplacing full size patch generation with sub-patchdiscrimination to remove compression artifacts and make theenhanced image to be looked very realistic as much aspossible. Besides, several works [32–34] have usedconvolutional neural network to compress image, which havegot appealing results and make a great preparation andcontribution for next image compression standard. However,existing standard codecs are still widely used all over theworld, so how convolutional neural network based coding iscompatible to the traditional codec remains an open issue.Recently, two convolutional neural networks are trainedusing a uniﬁed optimization method to cooperate each otherso as to get a compact intermediate for encoding and toreconstruct the decoded image with high quality [35].Although a uniﬁed end-to-end learning algorithm ispresented to simultaneously learn these two convolutionalneural networks, its approximation by directly connectingtwo convolutional neural networks before and after codec isnot optimal. Our intuitive idea of getting the optimal solutionfor this problem is to use the convolutional neural networkto perfectly replace the classic codec for gradientback-propagation. However, this task is still a challengingwork now. Although image-to-image translation [28, 29]achieves many visually pleasant results which look veryrealistic, it is difﬁcult to learn a mapping from input imageto decoded image for standard codec. Fortunately, theprojection from the valid description image to thepost-processed compressed image can be well learned byconvolutional neural network.In this paper, we propose a new end-to-end neural networkframework to compress image by learning a virtual codecneural network (denoted as VCNN). Firstly, the featuredescription neural network (FDNN) is used to get a validdescription in the low-dimension space with respect to theground-truth image, so the amount of image data can begreatly reduced by the FDNN network. After image’s validdescription, standard image codec such as JPEG is leveragedto further compress image, which leads to image’s greatdistortion and compression artifacts. Finally, we use apost-processing neural network (PPNN) to remove thesecompression artifacts. The experimental results will validatethe efﬁciency of the proposed method, especially in the caseof very low bit-rate. Our contributions are listed as follows:1) In order to efﬁciently back-propagate the gradient fromPPNN network to FDNN network during training, VCNNnetwork is proposed to get an optimal approximation forthe projection from the valid feature description image tothe post-processed compressed image.2) Due to the difﬁculty of directly training the wholeframework once, the learning of three convolutionalneural networks in our framework can be decomposed

OURNAL OF L A TEX CLASS FILES 3

Fig. 1. The framework of learning a virtual codec neural network to compress image into three sub-problems learning. Although threeconvolutional neural networks are used during thetraining, only two convolutional neural networks areused for testing.3) Apparently, our framework is compatible with standardcodec, so there is no need to change any part in thestandard codec. Meanwhile, our learning strategy can beeasily extended into these codecs based onconvolutional neural network.The rest of this paper is arranged as follows. Firstly, we givea detail description about the proposed method in Section 2,which is followed by the experimental results in the Section3. At last, we give a conclusion in the Section 4.II. T

HE METHODOLOGY

In this paper, we propose a novel way to resolve theproblem of non-differentiability of quantizaion function afterblock transformation in the classic codec, e.g., JPEG, whenboth convolutional neural networks and traditional codec areused to compress image at very low bit-rate. This way is tolearn a virtual codec neural network to optimallyapproximate the mapping from feature description image topost-processed compressed image.Our framework is composed of a standard codec (e.g.,JPEG), and three convolutional neural networks: FDNNnetwork, PPNN network, and VCNN network, as shown inFig. 1. In order to greatly reduce the amount of image datafor storage or transmission, we use the FDNN network to geta valid description of Y in the low-dimension space withrespect to the ground-truth image X with size of M · N before image compression. For simplicity, the FDNNnetwork is expressed as a non-linear function f ( X , α ) , inwhich α is the parameter set of FDNN network. Thecompression procedure of standard codec is described as amapping function Z = g ( Y , β ) , where β is the parameterset of codec. Our PPNN network learns a post-processingfunction h ( Z , γ ) from image Z to image X to remove thenoise, such as blocking artifacts, ringing artifacts and blurring, which are caused by coarse quantization after theseparate blocking transformation. Here, the parameter γ isthe parameter set of PPNN network.In order to combine the standard codec with convolutionalneural network for compression, the direct way is to learn aneural network to approximate the compression procedure ofcodec. Although convolutional neural network is a powerfultool to approximate any nonlinear function, it’s well-knownthat it’s hard to imitate the procedure of image compression.This reason is that the quantization operator is conductedseparately on the transformation domain in each block afterthe DCT transform, which leads to serious block artifactsand coding distortion. However, as compared to thecompressed images Z , the post-processed compressed image ˜ I has less distortion, because ˜ I loses some detailinformation, but does not have obvious artifacts and blockingartifacts. Therefore, the function h ( g ( Y , β ) , γ ) of twosuccessive procedure of codec g ( Y , β ) and post-processing h ( Z , γ ) can be well represented by the VCNN network. Tomake sure that the gradient can be rightly back-propagatedfrom the PPNN to FDNN, our VCNN network is proposedto learn a projection function v ( Y , θ ) from valid featuredescription Y to ﬁnal output ˜ I of PPNN. Here, theparameter θ is the parameter set of VCNN network. Thisprojection can properly approximate the two successiveprocedure: the compression of standard codec andpost-processing based on convolutional neural network. Aftertraining the VCNN network, we can use this network tosupervise the training of our FDNN network. A. Objective function

Our objective function is written as follows: arg min α,γ,θ L ( X , ˜ I ) + L ( ˆ I , ˜ I ) + L SSIM ( s ( Y ) , X ) , Y = f ( X , α ) , ˜ I = h ( Z , γ ) , Z = g ( Y , β ) , ˆ I = v ( Y , θ ) , (1)where α , γ , and θ are respectively three parameter sets ofFDNN, PPNN, and VCNN, and s ( · ) is the linear up-sampling OURNAL OF L A TEX CLASS FILES 4

Fig. 2. The structure of the proposed three convolutional neural networks: FDNN, PPNN, and VCNN operator so that s ( Y ) and X could have the same image size.Here, in order to make ﬁnal output image ˜ I to be similar to X , L ( X , ˜ I ) have the L1 content loss L content ( X , ˜ I ) and L1gradient difference loss L gradient ( X , ˜ I ) for the regularizationof training the FDNN network: L content ( X , ˜ I ) = 1 M · N (cid:88) i ( || X i − ˜ I i || ) , (2) L gradient ( X , ˜ I ) = 1 M · N (cid:88) i (( (cid:88) k ∈ Ω ||∇ k X i ) − ∇ k ˜ I i || )) (3)where || · || is the L1 norm, which has better performance tosupervise convolutional neural network’s training than the L2norm. This has been reported in the literature of [36], whichsuccessfully learns to predict future images from the videosequences.Since standard codec, as a big obstacle, exists betweenPPNN network and FDNN network, it’s tough to make thegradient back-propagate between them. Therefore, it’s a challenging task to train FDNN network directly without thesupervision of PPNN network. To address this task, we canlearn a nonlinear function from the Y to ˜ I in the VCNNnetwork, where the L1 content loss L content ( ˆ I , ˜ I ) and L1gradient difference loss L gradient ( ˆ I , ˜ I ) in Eq. (4-5) are usedto supervise the VCNN network’s training. Here, ˆ I is theresult predicted by VCNN network to approximate ˜ I . L content ( ˆ I , ˜ I ) = 1 M · N (cid:88) i ( || ˆ I i − ˜ I i || ) (4) L gradient ( ˆ I , ˜ I ) = 1 M · N (cid:88) i (( (cid:88) k ∈ Ω ||∇ k ˆ I i ) − ∇ k ˜ I i || )) (5)Moreover, we hope that feature description image’s structuralinformation is similar to ground-truth image X , so the SSIMloss L SSIM ( s ( Y ) , X ) [31, 37] is used to further supervise OURNAL OF L A TEX CLASS FILES 5 the learning of FDNN, except the loss from the network ofVCNN, which is deﬁned as follows: L SSIM ( s ( Y ) , X ) = − M · N (cid:88) i L SSIM ( s ( Y ) i , X i ) (6) L SSIM ( s ( Y ) i , X i ) =(2 µ s ( Y ) i · µ X i + c σ s ( Y ) i X i + c µ s ( Y ) i + µ X i + c σ s ( Y ) i + σ X i + c (7)where c and c are two constant values, which respectivelyequal to . and . . µ X i and σ X i respectively denotethe mean value and the variance of the neighborhood windowcentered by pixel i in the image X . In this way, µ s ( Y ) i as wellas σ s ( Y ) i can be denoted similarly. Meanwhile, σ s ( Y ) i X i is thecovariance between neighbourhood windows centered by pixel i in the image X and in the image s ( Y ) . Because the functionof SSIM is differentiable, the gradient can be efﬁciently back-propagated during the FDNN network’s training. B. Proposed Network

As depicted in Fig. 2, eight convolutional layers in theFDNN network are used to extract features from theground-truth image X to get a valid feature description Y ,whose weights of convolutional layer are in the spatial sizeof 9x9 for the ﬁrst layer and the last layer, which couldmake receptive ﬁeld (RF) of convolutional neural networksto be large enough. In addition, other six convolutionallayers in the FDNN use 3x3 convolution kernel to furtherenlarge the size of RF. In this ﬁgure, ”Conv:1x9x9x128”denotes the convolutional layer, where the channel numberof the input image is 1, the convolution kernel is 3x3 in thespatial domain, and the number of output feature map is 128.Meanwhile, other convolutional layers can be markedsimilarly. These convolutional layers are used to increase thenonlinearity of the network, when ReLU is followed toactivate the output features of these convolutional hiddenlayers. The feature map number of 1-7 convolutional layersis 128, but the last layer only has one feature map so as tokeep consistent with the ground truth image X . Eachconvolution layer is operated with a stride of 1, except thatthe second layer uses stride step of 2 to down-sample featuremaps, so that the convolution operation is carried out in thelow-resolution space to reduce computational complexityfrom the third convolutional layer to the 8-th convolutionallayer. All the convolutional layers are followed by anactivation layer with ReLU function, except the lastconvolutional layer.In the PPNN network, as shown in Fig. 2, we leverageseven convolutional layers to extract features and each layeris activated by ReLU function. The size of convolutionallayer is 9x9 in the ﬁrst layer and the left six layers use 3x3,while the output channel of feature map equals to 128 inthese convolutional layer. After these layers, onede-convolution layer with size of 9x9 and stride to be 2 isused to up-scale feature map from low-resolution tohigh-resolution so that the size of output image is matchedwith the ground truth image. We design the VCNN network to be the same structure withthe PPNN network, as displayed in Fig. 2, because they belongto the same class of low-level image processing problems.From Fig. 2, it also can be found that the VCNN networkworks to make the valid feature description image Y degradeto a post-processed compressed but high-resolution image ˜ I .On the contrary, the functionality of the PPNN network isto improve the quality of the compressed feature descriptionimage Z so that the user could receive a high-quality image ˜ I without blocking artifacts and ringing artifacts after post-processing with PPNN network at the decoder. Algorithm 1

Learning Algorithm for Training Our ThreeConvolutional Neural Networks: FDNN, PPNN, and VCNN

Input:

Ground truth image: X ; the number of iteration: K ; the total number of imagesfor training: n ; the batch size during training: m ; Output:

The parameter sets of FDNN network and PPNN network: α , γ ; The initialization of the FDNN network’s output by down-sampling to prepare forthe training of PPNN network; The initialization of parameter sets: α , β , γ , θ ; for k = 1 to K do The valid description images are compressed by standard codec with β for epoch = 1 to p do for i = 1 to ﬂoor ( n/m ) do Update the parameter set of γ by training the PPNN network to minimize the Eq. (2-3) with i -th batch images end for end for for epoch = 1 to p do for j = 1 to ﬂoor ( n/m ) do Update the parameter set of θ by training the VCNN network to minimize the Eq. (4-5) with j -th batch images end for end for for epoch = 1 to q do for l = 1 to ﬂoor ( n/m ) do Update the parameter set of α with ﬁxing θ by training the FDNN network to minimize Eq. (2-3) and Eq. (6-7) with l -th batch images end for end for end for Update the parameter set of γ by training the PPNN network to minimize the Eq.(2-3) return α , γ ; C. Learning Algorithm

Due to the difﬁculty of directly training the wholeframework once, we decompose the learning of threeconvolutional neural networks in our framework as threesub-problems learning. First, we initialize all the parameterset β , α , γ , and θ of codec, FDNN network, PPNN network,and VCNN network. Meanwhile, we uses Bicubic, Nearest,Linear, Area, and LANCZOS4 interpolation methods to getan initial feature description image Y of the ground-truthimage X , which is then compressed by JPEG codec as theinput of training data set at the beginning. Next, the ﬁrstsub-problem learning is to train PPNN network by updatingthe parameter set of γ according to the Eq. (2-3). Thecompressed description image Z got from ground-truthimage X and its post-processed compressed image ˜ I predicted by PPNN network are used for the secondsub-problem’s learning of VCNN to update parameter set of θ based on the Eq. (4-5). After VCNN’s learning, we ﬁx theparameter set of θ in the VCNN network to carry on thethird sub-problem learning by updating the parameter set of α for training FDNN network according to Eq. (2-3) and Eq. OURNAL OF L A TEX CLASS FILES 6 (6-7). After FDNN network’s learning, the next iterationbegins to train the PPNN network, after the updateddescription image are compressed by the standard codec.Our whole training process is summarized in the

Algorithm-1 . It is worth mentioning that the functionality ofVCNN network is to bridge the great gap between FDNNand PPNN. Thus, once the training of our whole frameworkis ﬁnished, the VCNN network is not in use any more, thatis to say, only the parameter sets of α , γ in the networks ofFDNN and PPNN are used during testing. Fig. 3. The data-set is used for our testing

III. E

XPERIMENTAL RESULTS

In order to demonstrate the novelty of the proposedframework, we compare our method with six approaches:JPEG [1], Foi’s [14], BM3D [22], DicTV [17], CONCOLOR[18], and Jiang’s [35]. Here, both Foi’s [14] and BM3D [22]are the class of image de-noising. The method of Foi’s [14]is speciﬁcally designed for de-blocking. The approaches ofDicTV [17], CONCOLOR [18] use the dictionary learningor the low-rank model to resolve the problem of de-blockingand de-artifact. Our approach is highly related to the Jiang’sapproach in [35], which is CNN-based methods, so we givemany comparisons between them later.

A. Training details

Our framework of learning a virtual codec neural networkto compress image is implemented with TensorFlow [38].The training data-set comes from [39], in which 400 imagesof size 180x180 are included. We augment these data bycropping, rotating and ﬂipping image to build our trainingdata set, in which the total number of image patches withsize of 160x160 are 3200 (n=3200). For testing as shown inFig. 2, eight images, which are broadly employed forcompressed image de-noising or de-artifact, are used toevaluate the efﬁciency of the proposed method. We train ourmodel using the optimization method of Adam, with thebeta1=0.9, beta2=0.999. The initial learning rate of trainingthree convolutional neural network is set to be 0.0001, whilethe learning rate decays to be half of the initial one once thetraining step reaches 3/5 of total step. And it decreases to be1/4 of the initial one when the training step reaches 4/5 oftotal step. In the

Algorithm-1 , K equals to 3, p = 60, q is30, and m is set to be 20. B. The quality comparison of different methods

To validate the efﬁciency of the proposed framework atvery low bit-rate, we compare our method with JPEG, Foi’s[14], BM3D [22], DicTV [17], CONCOLOR [18], andJiang’s [35]. The JPEG software of image compression inOpencv is used for all the experimental results. The resultsof Foi’s [14], BM3D [22], DicTV [17] and CONCOLOR[18] are got by strictly using the author’s open codes withthe parameter settings in their papers. However, the highlyrelated method of Jiang’s [35] only give one factor fortesting, so we try to re-implement their method withTensorFlow. Meanwhile, to fairly compare with the Jiang’s[35], we use our FDNN and PPNN to replace its networks ofComCNN and ReCNN for training and testing to avoid theeffect of the network’s structure design on the experimentalresults. Additionaly, the training of the Jiang’s simulation isachieved according to the framework in [35].It’s clear that the Jiang’s [35] has two convolution neuralnetworks, which are directly connected for training toback-propagate the gradient from the ReCNN to theComCNN, but the proposed method has three convolutionneural networks, in which our virtual codec neural networkhas considered the impacts of codec on the featuredescription neural network. From this aspect, it can beconcluded that the proposed method is superior to the Jiang’s[35], which doesn’t resolve the problem of the codec’seffects on the ComCNN’s training in Jiang’s [35]. For thecomparison of Jiang’s [35] in the following, images arecompressed by JPEG with quality factors to be 5, 10, 20,and 40 for their training and testing. Meanwhile, the validdescription of input image in the proposed framework is alsocompressed by JPEG codec with quality factors to be 5, 10,20, and 40 for training our three convolutional neuralnetworks. Except the proposed method and Jiang’s [35], fourother comparative methods deal with the JPEG compressedimage with the full-resolution when the quality factor set isset to be 2, 3, 4, 5, 10. It’s worthy to notice that in theproposed framework the JPEG codec is used, but in fact ourframework can be applied into most of existing standardcodec.We use the Peak Signal to Noise Ratio (PSNR) andStructural SIMilarity Index (SSIM) as the objective quality’smeasurement. From the Fig. 4 and Fig. 5, where bpp denotesthe bit-per-pixel, it can be obviously observed that theproposed method has the best objective performance onPSNR and SSIM, as compared to several state-of-the-artapproaches: JPEG [1], Foi’s [14], BM3D [22], DicTV [17],CONCOLOR [18], and Jiang’s [35]. Note that the results ofJiang’s in Fig. 4 and Fig. 5 are the re-implemented results byus according to [35], which has been mentioned previously.Among these methods, CONCOLOR [18] has a stableobjective performance and achieves a great gain in the termof PSNR and SSIM when comparing with Foi’s [14], BM3D[22], and DicTV [17]. As mentioned above, the proposedmethod can rightly back-propagate the gradient from thepost-processed neural network to the feature descriptionneural network ahead of codec, so our method is nearly

OURNAL OF L A TEX CLASS FILES 7

Fig. 4. The objective measurement comparison on PSNR and SSIM for several state-of-the-art approaches. (a1-a2) are the results of image (a) in Fig. 3,(b1-b2) are the results of image (b) in Fig. 3, (c1-c2) are the results of image (c) in Fig. 3, (d1-d2) are the results of image (d) in Fig. 3 optimal as compared with the approach of Jiang’s [35],which just provides a way to train their two convolutionalneural networks together.We have compared the visual quality of different methodsfor compression artifact removal, as shown in Fig. 6 and Fig.7. From these ﬁgures, it can be seen that Foi’s [14] and CONCOLOR [18] can better remove the block artifacts thanBM3D [22], DicTV [17], but these methods may makeimage’s boundary blurring. Both the proposed method andJiang’s [35] have better performance on the discontinuitypreservation than other methods (Please see the regions ofhair and the eyes of Lena in the Fig. 6), but the proposed

OURNAL OF L A TEX CLASS FILES 8

Fig. 5. The objective measurement comparison on PSNR and SSIM for several state-of-the-art approaches. (a1-a2) are the results of image (e) in Fig. 3,(b1-b2) are the results of image (f) in Fig. 3, (c1-c2) are the results of image (g) in Fig. 3, (d1-d2) are the results of image (h) in Fig. 3 method can retain more details and greatly save the bit forcompression at the very low bit-rate than Jiang’s [35].Meanwhile, we also show the difference between ourFDNN’s output image and ComCNN’s output image in [35],which results in the difference of artifact’s distributions sothat different region may be emphasized and protectedduring compression, as displayed in the Fig. 6 (g, j) and Fig.7 (g, j). The artifact’s distribution difference, caused by thedifference between the description of our FDNN and compact representation of Jiang’s ComCNN, leads to theobvious reconstruction differences between Jiang’s and oursat the decoder, as displayed in the Fig. 6 (h, k, i, l) and Fig.7 (h, k, i, l). From the above comparisons, it can be knownthat the back-propagation of gradient in the featuredescription network from postprocessing neural networkplays a signiﬁcant role on the effectiveness of featuredescription and the compression efﬁciency when combiningthe neural network with standard codec together to

OURNAL OF L A TEX CLASS FILES 9

Fig. 6. The visual comparisons for several state-of-the-art approaches. (a) input image of Lena, (b) compressed image of (a) 26.46/0.718/0.173(PSNR/SSIM/bpp), (c) Foi’s 28.07/0.784/0.173, (d) BM3D 27.76/0.764/0.173, (e) DicTV 27.38/0.761/0.173, (f) CONCOLOR 28.57/0.798/0.173, (g) the outputof Jiang’s ComCNN, (h) compressed image of (g), (i) the output of Jiang’s RecCNN 31.14/0.851/0.204, (j) our FDNN network’s output, (k) compressedimage of (j), (l) our PPNN network’s output 31.31/0.853/0.157 ; Note that the real resolution (g-h) and (j-k) is half of the input image, while all the otherimages have the same size with (a) effectively compress image. In a word, the proposedframework provides a good way to resolve the gradientback-propagation problem in the image compressionframework with convolutional neural network ahead of astandard codec by learning a virtual codec neural network.IV. C

ONCLUSION

In this paper, we propose a new image compressionframework to resolve the problem of non-differentiability ofthe quantization function in the lossy image compression bylearning a virtual codec neural network at very low bit-rate.Our framework consists of a traditional codec, featuredescription neural network, post-processing neural network,and virtual codec neural network. Directly learning thewhole framework of the proposed method is a intractableproblem, so we decompose this challenging optimizationproblem into three sub-problems learning. Finally, a largenumber of quantitative and qualitative experimental resultshave shown the priority of the proposed method than severalstate-of-the-art methods. R

EFERENCES [1] G. Wallace, “The JPEG still picture compressionstandard,”

IEEE Transactions on Consumer Electronics ,vol. 38, no. 1, pp. xviii–xxxiv, 1992.[2] L. Shen, Z. Liu, X. Zhang, W. Zhao, and Z. Zhang, “Aneffective CU size decision method for HEVC encoders,”

IEEE Transactions on Multimedia , vol. 15, no. 2, pp.465–470, 2013.[3] J. Xiong, H. Li, Q. Wu, and F. Meng, “A fast HEVCinter CU selection method based on pyramid motiondivergence,”

IEEE Transactions on Multimedia , vol. 16,no. 2, pp. 559–564, 2014.[4] L. Zhao, A. Wang, B. Zeng, and Y. Wu, “Candidatevalue-based boundary ﬁltering for compressed depthimages,”

Electronics Letters , vol. 51, no. 3, pp. 224–226,2015.[5] L. Zhao, H. Bai, A. Wang, Y. Zhao, and B. Zeng, “Two-stage ﬁltering of compressed depth images with markovrandom ﬁeld,”

Signal Processing: Image Communication ,vol. 51, pp. 11–22, 2017.[6] L. Zhao, H. Bai, A. Wang, and Y. Zhao, “Iterativerange-domain weighted ﬁlter for structural preserving

OURNAL OF L A TEX CLASS FILES 10

Fig. 7. The visual comparisons for several state-of-the-art approaches. (a) input image of House, (b) compressed image of (a) 27.77dB/0.773/0.197(PSNR/SSIM/bpp), (c) Foi’s 29.29dB/0.814/0.197, (d) BM3D 29.21dB/0.808/0.197, (e) DicTV 28.74dB/0.804/0.197, (f) CONCOLOR 30.16dB/0.829/0.197,(g) the output of Jiang’s ComCNN, (h) compressed image of (g), (i) the output of Jiang’s RecCNN 29.10/0.809/0.157, (j) our FDNN network’s output, (k)compressed image of (j), (l) our PPNN network’s output 29.57/0.819/0.131; Note that the real resolution (g-h) and (j-k) is half of the input image, while allthe other images have the same size with (a) image smoothing and de-noising,”

Multimedia Tools andApplications , pp. 1–28, 2017.[7] J. Viron and C. Guillemot, “Real-time constrained tcp-compatible rate control for video over the internet,”

IEEETransactions on Multimedia , vol. 6, no. 4, pp. 634–646,2004.[8] S. Yoo, K. Choi, and J. Ra, “Post-processing for blockingartifact reduction based on inter-block correlation,”

IEEETransactions on Multimedia , vol. 16, no. 6, pp. 1536–1548, 2014. [9] P. List, A. Joch, J. Lainema, G. Bjontegaard, andM. Karczewicz, “Adaptive deblocking ﬁlter,”

IEEETransactions on Circuits and Systems for VideoTechnology , vol. 13, no. 7, pp. 614–619, 2003.[10] H. Reeve and J. Lim, “Reduction of blocking effectin image coding,” in

IEEE International Conferenceon Acoustics, Speech, and Signal Processing , Boston,Massachusetts, USA, Apr. 1983.[11] A. Liew and H. Yan, “Blocking artifacts suppressionin block-coded images using overcomplete wavelet

OURNAL OF L A TEX CLASS FILES 11 representation,”

IEEE Transactions on Circuits andSystems for Video Technology , vol. 14, no. 4, pp. 450–461, 2004.[12] N. Francisco, N. Rodrigues, E. Da-Silva, and S. De-Faria,“A generic post-deblocking ﬁlter for block based imagecompression algorithms,”

Signal Processing: ImageCommunication , vol. 27, no. 9, pp. 985–997, 2012.[13] C. Wang, J. Zhou, and S. Liu, “Adaptive non-local meansﬁlter for image deblocking,”

Signal Processing: ImageCommunication , vol. 28, no. 5, pp. 522–530, 2013.[14] A. Foi, V. Katkovnik, and K. Egiazarian, “Pointwiseshape-adaptive DCT for high-quality denoising anddeblocking of grayscale and color images,”

IEEETransactions on Image Processing , vol. 16, no. 5, pp.1395–1411, 2007.[15] X. Zhang, R. Xiong, X. Fan, S. Ma, and W. Gao,“Compression artifact reduction by overlapped-blocktransform coefﬁcient estimation with block similarity,”

IEEE Transactions on Image Processing , vol. 22, no. 12,pp. 4613–4626, 2013.[16] D. Sun and W. Cham, “Postprocessing of low bit-rateblock DCT coded images based on a ﬁelds of expertsprior,”

IEEE Transactions on Image Processing , vol. 16,no. 11, pp. 2743–2751, 2007.[17] H. Chang, M. Ng, and T. Zeng, “Reducing artifacts inJPEG decompression via a learned dictionary,”

IEEETransactions on Signal Processing , vol. 62, no. 3, pp.718–728, 2014.[18] J. Zhang, R. Xiong, C. Zhao, Y. Zhang, S. Ma, andW. Gao, “CONCOLOR: Constrained non-convex low-rank model for image deblocking,”

IEEE Transactions onImage Processing , vol. 25, no. 3, pp. 1246–1259, 2016.[19] Z. Wang, D. Liu, S. Chang, Q. Ling, Y. Yang, andT. Huang, “D3: Deep dual-domain based fast restorationof JPEG-compressed images,” in

IEEE Conference onComputer Vision and Pattern Recognition , Las Vegas,NV, United States, Jun. 2016.[20] X. Liu, X. Wu, J. Zhou, and D. Zhao, “Data-drivensoft decoding of compressed images in dual transform-pixel domain,”

IEEE Transactions on Image Processing ,vol. 25, no. 4, pp. 1649–1659, 2016.[21] Y. Li, F. Guo, R. Tan, and M. Brown, “Acontrast enhancement framework with JPEG artifactssuppression,” in

European Conference on ComputerVision , Cham, Sep. 2014.[22] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian,“Image denoising by sparse 3-D transform-domaincollaborative ﬁltering,”

IEEE Transactions on ImageProcessing , vol. 16, no. 8, pp. 2080–2095, 2007.[23] S. Gu, L. Zhang, W. Zuo, and X. Feng, “Weightednuclear norm minimization with application to imagedenoising,” in

IEEE Conference on Computer Vision andPattern Recognition , Columbus, OH, USA, Jun. 2014.[24] D. Huang, L. Kang, Y. Wang, and C. Lin, “Self-learningbased image decomposition with applications to singleimage denoising,”

IEEE Transactions on Multimedia ,vol. 16, no. 1, pp. 83–93, 2013.[25] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille, “Deeplab: semantic image segmentation withdeep convolutional nets, atrous convolution, and fullyconnected crfs,”

IEEE Transactions on Pattern Analysisand Machine Intelligence , vol. PP, no. 99, pp. 1–1, 2016.[26] H. Burger, C. Schuler, and S. Harmeling, “Imagedenoising: Can plain neural networks compete withBM3D ? ,” in IEEE Conference on Computer Vision andPattern Recognition , Providence, RI USA, Jun. 2012.[27] C. Dong, C. Loy, K. He, and X. Tang, “Imagesuper-resolution using deep convolutional networks,”

IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 38, no. 2, pp. 295–307, 2016.[28] P. Isola, J. Zhu, T. Zhou, and A. Efros, “Image-to-image translation with conditional adversarial networks,”in arXiv: 1611.07004 , 2016.[29] L. Zhao, J. Liang, H. Bai, A. Wang, and Y.Zhao,“Simultaneously color-depth super-resolution withconditional generative adversarial network,” in arXiv:1708.09105 , 2017.[30] L. Cavigelli, P. Hager, and L. Benini, “CAS-CNN:A deep convolutional neural network for imagecompression artifact suppression,” in

IEEE Conferenceon Neural Networks , Anchorage, AK, USA, May 2017.[31] L. Galteri, L. Seidenari, M. Bertini, and B. Del, “Deepgenerative adversarial compression artifact removal,” in arXiv: 1704.02518 , 2017.[32] G. Toderici, S. Malley, S. Hwang, D. Vincent, D. Minnen,S. Baluja., and R. Sukthankar, “Variable rate imagecompression with recurrent neural networks,” in arXiv:1511.06085 , 2015.[33] J. Ball, V. Laparra, and E. Simoncelli, “Variable rateimage compression with recurrent neural networks,” in arXiv: 1611.01704 , 2016.[34] M. Li, W. Zuo, S. Gu., D. Zhao, and D. Zhang,“Learning convolutional networks for content-weightedimage compression,” in arXiv: 1703.10553 , 2017.[35] F. Jiang, W. Tao, S. Liu, J. Ren, X. Guo, andD. Zhao, “An end-to-end compression framework basedon convolutional neural networks,”

IEEE Transactions onCircuits and Systems for Video Technology , 2017.[36] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” in arXiv: 1511.05440 , 2015.[37] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Imagequality assessment: from error visibility to structuralsimilarity,”

IEEE Transactions on Image Processing ,vol. 13, no. 4, pp. 600–612, 2004.[38] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,C. Citro, and et al., “Tensorﬂow: large-scale machinelearning on heterogeneous distributed systems,” in arXiv:1603.04467 , 2016.[39] Y. Chen and T. Pock, “Trainable nonlinear reactiondiffusion: A ﬂexible framework for fast and effectiveimage restoration,”