Layered Image Compression using Scalable Auto-encoder
LLayered Image Compression using Scalable Auto-encoder
Chuanmin Jia , , Zhaoyi Liu , , Yao Wang , Siwei Ma , Wen Gao Department of Electrical and Computer Engineering, Tandon School of Engineering,New York University, Brooklyn, NY 11201, USA Peking University, Institute of Digital Media, Beijing 100871, China Beijing Institute of Technology, School of Information and Electronics, Beijing 100081, [email protected], { cmjia, swma, wgao } @pku.edu.cn, [email protected] Abstract —This paper presents a novel convolutional neu-ral network (CNN) based image compression framework viascalable auto-encoder (SAE). Specifically, our SAE based deepimage codec consists of hierarchical coding layers, each ofwhich is an end-to-end optimized auto-encoder. The coarseimage content and texture are encoded through the first (base)layer while the consecutive (enhance) layers iteratively codethe pixel-level reconstruction errors between the original andformer reconstructed images. The proposed SAE structurealleviates the need to train multiple models for different bit-ratepoints by recently proposed auto-encoder based codecs. TheSAE layers can be combined to realize multiple rate points, orto produce a scalable stream. The proposed method has similarrate-distortion performance in the low-to-medium rate rangeas the state-of-the-art CNN based image codec (which usesdifferent optimized networks to realize different bit rates) overa standard public image dataset. Furthermore, the proposedcodec generates better perceptual quality in this bit rate range.
Keywords -Image Compression; end-to-end optimization;scalable auto-encoder; CNN;
I. I
NTRODUCTION
Image compression aims at representing an image withminimal coding bits while preserving the maximal pixel-level reconstruction quality as it could be. Recently, deeplearning (DL) based image compression has been one of theemerging topics due to its elegant end-to-end optimizationability. Multiple learning-based image codecs have beenproposed by investigating the joint intersection of deeplearning and image coding. Essentially, the deep models aretrained to learn the image-to-image mapping between thepristine image and the reconstructed image based on therate-distortion (R-D) learning objective.Different from conventional image coding formats,JPEG [1], JPEG2000 [2] and BPG [3] (based on HighEfficiency Video Coding, HEVC [4]), which utilize separatesub-modules for prediction and transform coding, the deepcodecs formulate the end-to-end learnt networks as trans-form coding [5]. The most representative models typicallyadopted the auto-encoder (AE) like structure, which generatelatent representations that are quantized and entropy coded.For example, Toderici et al. [6] proposed an end-to-endimage coding model based on the recurrent neural network (RNN), where the original and residual images are itera-tively compressed using RNN structure. During each RNNiteration, the better reconstruction with a commensuratebitrate (2-bits) cost will be produced. However, the RNNbased method might have limitations in representing high-frequency residuals. Additionally, the lack of explicit entropyestimation during RNN training also constraints its overallR-D performance. To address this issue, Theis et al. [7]presented a convolutional neural network (CNN) basedAE structure, where the entropy model was approximatedusing Gaussian distribution during optimization. In [8], aninpainting based learning approach was proposed for imagecompression. To enhance the visual quality, the generativeadversarial network (GAN) based learning strategies wereembeded in the CNN based framework [9], [10] to im-prove the perceptual quality of the reconstructed images.In [11], the generalized divisive normalization (GDN) [12]was introduced as a substitute for the nonlinear activa-tion in a variational autoencoder (VAE) to de-correlatethe channel-wise dependency among latent representations,which significantly improves the coding performance tobe competitive with JPEG2000 standard. Compared withother activation functions, the core advantage of GDN isits full reversibility, which guarantees nearly no informationloss for the transform coding. More recently [13], a noveldistribution parameter estimation method was proposed forentropy coding which has brought additional coding gain.All the aforementioned approaches have to train a particu-lar auto-encoder for each target bit rate, which can limit theirapplicability when the desired bit rates have to be adaptedin realtime, and/or when it is impractical to save multipletrained models. Inspired by the conventional scalable codingparadigm [14], we proposed a scalable auto-encoder (SAE)based deep image coding method to solve this problem byiteratively and incrementally coding the errors using the end-to-end trained auto-encoders. By cascading the bitstreamsgenerated by each layer of SAE, variable bit-rates or layeredbit streams could be obtained while maintaining optimal R-D performances. Furthermore, the proposed SAE structurecould be compatible with any other learning based imagecodec since each layer of SAE could be substituted by a r X i v : . [ c s . MM ] A p r ifferent AE based deep codecs.II. P ROPOSED M ETHOD
The overall flowchart of the proposed framework isdepicted in Fig. 1. This model contains several stackedmodules, each of which is an end-to-end optimized AE.The original image is firstly compressed by the AE in thebase layer. Subsequently, the enhance layers would takethe difference between the latest reconstructed image andthe original image as input then compress the residue. Wetrain our entire framework in a layer-by-layer manner, whichmeans we fix all previous layers when training the currentone. The design of base layer and enhance layers will beintroduced in more details in subsequent subsections.
Figure 1. The framework of the proposed SAE based image compression(first three layers are illustrated).
A. Base Layer
The base layer in the proposed SAE mainly compressesthe coarse image content and basic texture information.Considering the original image x , the encoder part of AE inthe base layer encodes x into latent representation q b . q b = E b ( x ) , (1)Subsequently, the quantizer ( round ( · ) ) is applied to thelatent representation q b to obtain the quantized latent feature ¯ q b . The bits required to code ¯ q b can be approximated byits entropy, which is determined from P ¯ q b , which is themarginal probability mass function of ¯ q b . Following [11],for the purpose of end-to-end differentiability of the lossfunction, we replace the quantizer by adding a white noise with uniform distribution in (-1/2, 1/2) to the latent feature q b . We considered two different training objectives. The firstone is aimed to optimize the objective quality measured bythe mean square error, and is trained using the followingloss function, Loss b = || x − ˆ x b || + λ b ∗ R ( q b + ∆ q ) , (2)where λ b is the Lagrange multiplier and ˆ x b = D b ( ¯ q b ) is thereconstructed image of base layer and D b ( · ) represents thebase layer decoder of AE (shown in Fig. 2). The secondtraining objective optimizes for perceptual quality measuredby MS-SSIM [15], which is Loss b = (1 − MS SSIM(ˆ x b ))+ λ b msssim ∗ R ( q b +∆ q ) , (3)where MS SSIM( · ) is based on [15] and λ b msssim is theLagrange multiplier for MS-SSIM. To estimate the rate R ( · ) , we deploy the state-of-the-art entropy model describedin [13].Following [11], the E b ( · ) has three convolution layerswith filter size × , × and × with the down-samplingstep size , , respectively. GDN is utilized to achieve non-linearity after each convolution layer. The decoder ( D b ( · ) )has the mirror structure of E ( · ) . The AE structure is depictedin Fig. 2, which is the same as [11]. Note that the samestructure is used in each subsequent enhance layer and theonly difference is the number of channels for the latentfeatures, which will be described later. Figure 2. The AE in each layer of our proposed SAE. The encoder anddecoder are shown in the left and right panels respectively.
B. Enhance Layers
In the proposed framework, the enhance layers are re-sponsible for iteratively encoding the residues between thereconstructed image from the previous layers and the orig-inal image. As shown in Fig. 1, the first enhance layer(enhance-1) takes the error between original image x and thereconstructed image of base layer as input. The formulationof enhance-1 could be represented as follows. ˜ x e = D e ( round ( E e ( x − ˆ x b ))) , (4)To acquire the reconstruction from base layer and enhance-1, we could simply add the two outputs from such two layer ˆ x e = ˆ x b + ˜ x e ). As such, the reconstruction quality couldbe enhanced with the error coded in the enhance layer.For the subsequent enhance layers, taking the secondenhance layer (enhance-2) as example, the input is theresidue between original x and the reconstruction from thelatest layer ˆ x e . The reconstructed residue can be describedas, ˜ x e = D e ( round ( E e ( x − ˆ x e ))) , (5)In this case, the reconstruction for enhance-2 is obtained byadding three corresponding outputs of the three AEs ( ˆ x e =ˆ x b + ˜ x e + ˜ x e ).The loss function of training the i -th enhance layer hasthe following formation, Loss ei = || x − ˆ x ei || + λ ei ∗ R ( q ei + ∆ q ) , (6)where the λ ei is the Lagrange Multiplier for i -the layer.Similar with base layer training, uniform noise is also addedto the latent variables q ei to train the enhance layers forrelaxation of the loss functions. To obtain rate-distortionoptimality, we tried different combination of λ ’s, which aredetailed in the next Section.III. T RAINING D ETAILS
This section presents the training details of the proposedSAE based image codec. A subset of ImageNet [16] databaseis used for training, which contains 5500 RGB images.And another 300 images are used for validation. The con-vergence is met when the loss on the validation imagesbecomes stable. We randomly crop a region of × from each training sample to prevent boundary issues. Thehyper-parameters during our training procedure are listed inTable. I. For each SAE layer, we need to train a pair ofencoder and decoder as well as a entropy model, iteratively.The M SE lr means the fixed learning rate for the AE andentropy model in Eq. (2) and
Rate lr is the initializedlearning rate of the entropy model. The
Rate lr has anexponential decay with the parameter 0.96 for every , iterations during training. Adam [17] is utilized as theoptimizer for both AE and entropy model respectively.We train each layer successively: given the previous ( i -1)layers, we try to find the optimal hyper-parameters for the i -th SAE layer, including the number of feature maps and λ , that achieve the best rate-distortion tradeoff. A. Base Layer Training
To achieve a good rate-distortion tradeoff for the baselayer, and to use as few channels as possible for reducedcomplexity, we varied the number of feature maps in eachof the three convolution layers in the encoder and decoderof the AE and λ values. We found that using 48 featuresmaps in all three layers, and λ =3000 achieved the best resultfor the MSE-oriented optimization. For MS-SSIM-orientedoptimization, λ =50 achieved the best result. Table IT
HE HYPER - PARAMETERS FOR TRAINING
SAE.
Parameter Name Value
MSE lr 0.0001Rate lr 0.001Optimizer
Adam epochs 1000batch size 8training image size × Table II λ AND NUMBER OF FEATURE MAPS FOR EACH LAYER OF THE PROPOSED
SAE.Layers Base e1 e2 e3 e4 λ (for MSE) 3000 1000 300 100 30 λ (for MS-SSIM) 50 30 10 0.5 -Feature Map Number 48 48 96 144 192 B. Enhance Layer Training
We train each subsequent enhance layers, while fixing thelower layers at their optimized states. We have found thatthe number of features for enhance layers should increasewith the growing of enhance layers. We suspect that this isbecause the distribution of the errors become more similarto random noise when the enhance layer goes deeper suchthat more parameters are needed to model and capturesuch distribution. The parameter λ , which is responsiblefor balancing the contributions from the entropy term andthe MSE term in Eq. (6), should decrease layer by layersince more emphasis should be put on minimizing the MSE.Recall that ideally λ should be equal to the negative slopeof the MSE vs. rate curve, and this slope reduces as therate increases. Table. II summarizes the λ values and thenumber of features for each layer in the trained SAE. Thesevalues are selected based on exhaustive search of differentcombinations of the parameters when training the SAEstructure. C. Entropy Model
In this paper, we re-use the entropy model proposedin [13], which incorporates a hyper-prior to effectivelycapture spatial dependencies in the latent representationgenerated by each layer of the proposed SAE. Particularly,we deploy different entropy models for different layers inthe proposed SAE.IV. E
XPERIMENTAL R ESULTS
For training and testing, we used the popular DL libraryTensorflow [18] and the Tensorflow-compression submod-ule [19], which is an implementation of [13]. able IIIT HE R-D
PERFORMANCE OF THE PROPOSED
SAE
BASED IMAGE CODEC .Dataset vs. [11] vs. [13]BD-rate BD-PSNR BD-rate BD-PSNRKodak -65.2 % 3.38 dB -0.6 % 0.021 dB
A. Experiment Set-up
To evaluate the efficiency of the propose SAE basedimage codec, we test the proposed model on the widelyused Kodak Lossless True Color Image dataset [20] whocontains 24 true color images with resolution × .The results presented in this section are the average of these24 images. It is worthy noting that all the experiments andcomparisons are based on three-channel true color images.The test environment of this work is the Intel i5 7200U-CPUwith 16GB RAM and NVIDIA GTX 1050Ti GPU. B. Rate-distortion Performances
We compare our proposed SAE coder against the algo-rithms in [11] and [13]. Our method and [11] have thesame structure for the encoder and decoder, as illustratedin Fig. 2. And [13] used one more convolution layer ontop of [11]. However, the numbers of feature maps differamong these methods. [11] used 192 feature maps in eachlayer while [13] used 128 and 192 for the convolution layerand bottleneck layer at low bit-rate and that of 192, 320for high bit-rate points respectively. The numbers of featuremaps in the proposed SAE differ among the scalable layersand are summarized in Table. II.The proposed method use the same entropy estimationmethod as [13], but [11] used an older method, which isless efficient. To illustrate the coding performance in a widerange of bit-rates, the R-D curves of the proposed SAE, [11]and [13] are provided in Fig. 3. We also provide the R-D curves obtained by the BPG codec . For each of [13]and SAE, we provide two sets of results, one optimizedfor MSE and another for MS-SSIM. Table III summarizesthe BD-Rate and BD-PSNR of the proposed SAE methodagainst [11] and [13], respectively. Compared with [11], theSAE achieved significant coding gain, where over 65% bit-rate could be reduced and 3.38dB PSNR increase could beobtained. The SAE performance is slightly better than [13].For the proposed SAE model, the points from the leftto the right correspond to the results of the base layer,enhance-1 layer to enhance-4 layer. Obviously, both [13]and the proposed model outperforms [11] over the entirerange of bit-rate by clear margin. This is due to the moreefficient entropy coding method used. The SAE coder is Note that when using the BPG codec [3], we set the option to indicatethat the input image is in the RGB format. The BPG performance reportedin [13] is lower than in Fig. 3 and 4, because the option was set to assumethe input is in YCbCr format. similar to [13] up to about 0.5 bpp, and then becomes lessefficient. This loss of coding efficiency with more layers isas expected, as with any scalable coder compared to a non-scalable coder. In fact, it was somewhat surprising that SAEwas able to achieve similar performance (in fact slightlybetter) as [13], up to 3 enhancement layers.
Rate (bits/pixel) PS NR ( d B ) BPG (4:4:4) [ ] (Optimized for MSE) Proposed (Optimized for MSE)Proposed (Optimized for MS-SSIM)[ ] (Optimized for MSE)[ ] (Optimized for MS-SSIM) Figure 3. R-D curves (PSNR) over Kodak dataset.
Additionally, we provide the comparisons based on MS-SSIM in Fig. 4. In general, the MS-SSIM metric is bettercorrelated with the perceptual quality than PSNR. It is veryencouraging that the proposed SAE method has similar orbetter performance than [13] in the entire rate range. More-over, the SAE method achieved much better performancethan BPG in the entire bit-rate range, consistent with thevisual evaluation described below.
Rate (bits/pixel) M S - SS I M BPG (4:4:4)[6] (Optimized for MSE)Proposed (Optimized for MSE)Proposed (Optimized for MS-SSIM)[7] (Optimized for MSE)[7] (Optimized for MS-SSIM)
Figure 4. R-D curves (MS-SSIM) over Kodak dataset. . Visual Evaluations
For visual comparisons, Fig. 5 and Fig. 6 present sev-eral cropped versions of reconstructed images from Kodakdataset. Notice that the proposed SAE structure offers moredetailed information in contour and textural regions whileusing less or similar bits than both [11] and [13] at the lowto intermediate bit rate. We have provided all of the decodedtest images by the proposed SAE framework and methodsof [11], [13] in different bit-rate points in the supplementarymaterials . V. C ONCLUSION
In this paper, a scalable auto-encoder based deep imagecodec is proposed. The novelty of the paper lies in that theproposed method does not need to train multiple independentnetworks to realize different bit-rate points. The quantitativeand qualitative evaluations have shown that the proposedmethod can achieve rate-distortion performance similar to astate-of-art DL-based method in the low to intermediate bitrate in terms of mean-square error, and has similar or betterperformance than the benchmark in terms of the perceptualquality in the entire rate range. We should also note that theproposed SAE structure is general. One can simply replacethe particular AE structure in Fig. 2 by another structurethat can provide better coding performance in each layer. Infact, one can also use methods not based on AEs.A
CKNOWLEDGMENT
The authors would like to thank Dr. Johannes Ball´efor kindly providing their trained models of [11] and thedecoded images of [13] for performance comparison. Thiswork was done by C. Jia and Z. Liu as visiting studentsin NYU-Tandon sponsored by China Scholarship Council(CSC), which is gratefully acknowledged.R
EFERENCES [1] G. K. Wallace, “The jpeg still picture compression standard,”
IEEE transactions on consumer electronics , vol. 38, no. 1,pp. xviii–xxxiv, 1992.[2] A. Skodras, C. Christopoulos, and T. Ebrahimi, “The jpeg2000 still image compression standard,”
IEEE Signal pro-cessing magazine , vol. 18, no. 5, pp. 36–58, 2001.[3] F. Bellard, “Bpg image format,”
Available:http://bellard.org/bpg/. , Accessed: 05-28-2018.[4] G. J. Sullivan, J.-R. Ohm, W.-J. Han, T. Wiegand et al. ,“Overview of the high efficiency video coding(hevc) stan-dard,”
IEEE Transactions on circuits and systems for videotechnology , vol. 22, no. 12, pp. 1649–1668, 2012.[5] V. K. Goyal, “Theoretical foundations of transform coding,”
IEEE Signal Processing Magazine , vol. 18, no. 5, pp. 9–21,2001. Please visit https://github.com/chuanminj/MIPR2019/ .for supplementary materials. (a) [11]: 0.1455 bpp,24.1610 dB, 0.8827 (b) [13]: 0.1782 bpp,25.0016 dB, 0.9079 (c) Ours: 0.1305 bpp,24.9626 dB, 0.9250(d) [11]: 0.1394 bpp,25.1619 dB, 0.8739 (e) [13]: 0.1417 bpp,25.5955 dB, 0.8870 (f) Ours: 0.1166 bpp,25.8482 dB, 0.9192(g) [11]: 0.6468 bpp,28.9845 dB, 0.9725 (h) [13]: 0.4959 bpp,28.3934 dB, 0.9648 (i) Ours: 0.5495 bpp,30.9506 dB, 0.9852(j) [11]: 0.5782 bpp,30.0684 dB, 0.9680 (k) [13]: 0. 4858 bpp,29.1530 dB, 0.9533 (l) Ours: 0.4932 bpp,28.7907 dB, 0.9831Figure 5. Comparisons of decoded test images by [11], [13] and proposedmethod. The last number under each figure is the MS-SSIM value.a) [11]: 0.1128 bpp,26.2665 dB, 0.8831 (b) [13]: 0.1102 bpp,30.0579 dB, 0.9483 (c) Ours: 0.0996 bpp,27.0657 dB, 0.9228(d) [11]: 0.0850 bpp,29.9462 dB, 0.9361 (e) [13]: 0.0739 bpp,30.1352 dB, 0.9388 (f) Ours: 0.0811 bpp,30.6522 dB, 0.9524(g) [11]: 0.4858 bpp,30.9572 dB, 0.9644 (h) [13]: 0.4596 bpp,31.9486 dB, 0.9660 (i) Ours: 0.4336 bpp,32.6423 dB, 0.9809(j) [11]: 0.1370 bpp,24.0123 dB, 0.8881 (k) [13]: 0.1387 bpp,24.4523 dB, 0.9004 (l) Ours: 0.1130 bpp,24.6560 dB, 0.9287Figure 6. Comparisons of decoded test images by [11], [13] and proposedmethod. The last number under each figure is the MS-SSIM value. [6] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen,J. Shor, and M. Covell, “Full resolution image compressionwith recurrent neural networks.” in
CVPR , 2017, pp. 5435–5443.[7] L. Theis, W. Shi, A. Cunningham, and F. Husz´ar, “Lossyimage compression with compressive autoencoders,” arXivpreprint arXiv:1703.00395 , 2017.[8] M. H. Baig, V. Koltun, and L. Torresani, “Learning to inpaintfor image compression,” in
Advances in Neural InformationProcessing Systems , 2017, pp. 1246–1255.[9] E. Agustsson, M. Tschannen, F. Mentzer, R. Timo-fte, and L. Van Gool, “Generative adversarial networksfor extreme learned image compression,” arXiv preprintarXiv:1804.02958 , 2018.[10] O. Rippel and L. Bourdev, “Real-time adaptive image com-pression,” arXiv preprint arXiv:1705.05823 , 2017. [11] J. Ball´e, V. Laparra, and E. P. Simoncelli, “End-to-end opti-mized image compression,” arXiv preprint arXiv:1611.01704 ,2016.[12] ——, “Density modeling of images using a generalized nor-malization transformation,” arXiv preprint arXiv:1511.06281 ,2015.[13] J. Ball´e, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston,“Variational image compression with a scale hyperprior,” arXiv preprint arXiv:1802.01436 , 2018.[14] J.-R. Ohm, “Advances in scalable video coding,”
Proceedingsof the IEEE , vol. 93, no. 1, pp. 42–56, 2005.[15] Z. Wang, E. Simoncelli, A. Bovik et al. , “Multi-scale struc-tural similarity for image quality assessment,” in
ASILOMARCONFERENCE ON SIGNALS SYSTEMS AND COMPUT-ERS , vol. 2. IEEE, 2003, pp. 1398–1402.[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenetclassification with deep convolutional neural networks,” in
Advances in neural information processing systems , 2012, pp.1097–1105.[17] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” arXiv preprint arXiv:1412.6980 , 2014.[18] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,M. Devin, S. Ghemawat, G. Irving, M. Isard et al. , “Tensor-flow: a system for large-scale machine learning.” in