[PDF] Extreme Image Coding via Multiscale Autoencoders With Generative Adversarial Optimization

Abstract

We propose a MultiScale AutoEncoder(MSAE) based extreme image compression framework to offer visually pleasing reconstruction at a very low bitrate. Our method leverages the "priors" at different resolution scale to improve the compression efficiency, and also employs the generative adversarial network(GAN) with multiscale discriminators to perform the end-to-end trainable rate-distortion optimization. We compare the perceptual quality of our reconstructions with traditional compression algorithms using High-Efficiency Video Coding(HEVC) based Intra Profile and JPEG2000 on the public Cityscapes and ADE20K datasets, demonstrating the significant subjective quality improvement.

Full PDF

EExtreme Image Coding via Multiscale Autoencoderswith Generative Adversarial Optimization

Chao Huang, Haojie Liu, Tong Chen, Qiu Shen, and Zhan Ma ∗ Vision Lab, Nanjing University

Abstract —We propose a MultiScale AutoEncoder (MSAE)based extreme image coding/compression framework to offervisually pleasing reconstruction at a very low bitrate. Our methodleverages the “priors” at different resolution scale to improvethe compression efﬁciency, and also employs the generativeadversarial network (GAN) with multiscale discriminators toperform the end-to-end trainable rate-distortion optimization.We compare the perceptual quality of our reconstructions withtraditional compression algorithms using High-Efﬁciency VideoCoding (HEVC) based Intra Proﬁle and JPEG2000 on thepublic

Cityscapes , ADE20K and

Kodak datasets, demonstrating thesigniﬁcant subjective quality improvement. However, objectivemeasurements, such as PSNR, SSIM, etc, are often deterioratedby applying the generative adversarial optimization.

I. I

NTRODUCTION

Images that capture vivid scenes and events are stored andshared extensively every day. Thus image compression playsa vital role to ensure the efﬁcient storage and sharing at theentire Internet scale. Traditional image compression methodssuch as JPEG, JPEG2000, HEVC based BPG, as well as recentdeep neural network (DNN) based image compression meth-ods [1]–[4] have presented signiﬁcant advances in compressionefﬁciency. Typically, these DNN-based schemes exhibit bettervisual quality than the traditional methodologies, at the samebit rate [5]. However, both of them fail to represent imagesefﬁciently with pleasant reconstruction quality at very lowbitrates (e.g., targeting for < . bits per pixel (bpp)) [6].This is mainly due to the reason that visual sensitive infor-mation (i.e., perceptual signiﬁcance) can not be well preservedusing conventional quality optimization criteria, such as peaksignal-to-noise ratio (PSNR) and multiscale structural simi-larity (MS-SSIM) [8], at such extreme compression scenario.Recent explorations have shown that adversarial loss could bea tentative solution to capture global semantic information andlocal texture, yielding appealing reconstructions [6], [9]. Thus,Agustsson et al. [6] developed a GAN-based extreme imagecompression framework with bitrate below 0.1 bpp, resultingin the noticeable subjective quality improvement comparedwith the JPEG2000 [10] and BPG [11]. However, it hadlimitations by adopting a purely GAN-based structure. First, itwas difﬁcult to ensure the generalization of GAN to capture avariety of distributions of different datasets. In the meantime,GAN sometimes would introduce unexpected textures becauseof the failure of discriminator [12].In this work, we propose a MultiScale AutoEncoder(MSAE) based extreme image compression structure where Corresponding Author: Z. Ma. we employ a multiscale network shown in Fig.1(a) to generatespatial scalable bitstreams. To the best of our knowledge,most learning based compression methods [1]–[4], generatea single layer bitstream at its native spatial resolution, with-out utilizing mutual information from other spatial scales.Different from Scalable Auto-encoder [13] that iterativelycodes the pixel-level errors at the same resolution, “priors”at different spatial resolution scale that well capture the localtextures, are embedded as reference to help the coarse-to-ﬁne reconstruction and compression in our MSAE framework.Generative Adversarial loss [14] is applied in different scalesfor end-to-end trainable rate-distortion optimization, so as tooptimize the reconstruction quality subjectively by maintainingthe global semantic structure for visual signiﬁcance, at a verylow bit rate budget. We have our method tested on

Cityscapes , ADE20K and

Kodak datasets, yielding signiﬁcant perceptualquality margins over the existing JPEG2000 and BPG.II. M

ULTI S CALE A UTO E NCODER WITH G ENERATIVE A DVERSARIAL O PTIMIZATION

Fig.1(a) presents the extreme image compression frameworkof MSAE with generative adversarial optimization. Let X k bethe original image ( k is the size of the input). We downscalethe X k to obtain two more inputs X k/s and X k/ ( s ∗ s ) . s denotesthe downscaling factor, which is set by 2 in this paper. Let A i be the autoencoder network at scale i ( i ∈ [ k, k/ , k/ ), and U denotes the upscaling operator. We then deﬁne the overallMSAE framework by X (cid:48) k/ = A k/ ( X k/ ) , (1) X (cid:48) k/ = U ( X (cid:48) k/ ) + A k/ ( X k/ − U ( X (cid:48) k/ )) , (2) X (cid:48) k = U ( X (cid:48) k/ ) + A k ( X k − U ( X (cid:48) k/ )) . (3)Our proposed MSAE framework in (1), (2), and (3) haspresented a coarse-to-ﬁne reconstruction step by step. At thelowest scale k/ , the autoencoder A k/ only takes X k/ asan input to derive the reconstructed image X (cid:48) k/ , yieldingthe coarsest representation of original X k . Then X (cid:48) k/ , as theprior, is upscaled and aggregated with residuals at each scale toderive the ﬁnal X (cid:48) k . Low resolution reconstructions are referredas “priors” to improve the overall rate-distortion performance.In addition, conditional GAN [15] is integrated into ourMSAE system to do end-to-end training for visually appealingreconstruction, by enabling the multiscale discriminators foreach input high-resolution images. a r X i v : . [ ee ss . I V ] J a n × × Entropy

Coding

Output bitstreamAutoencoderAutoencoder Autoencoder

Multi-Scale Discriminator× k X k X k X ' k X ' 4 k X ' 2 k X (a) Input

Encoder Network Quantizer Decoder Network

Output

Entropy coding

Entropy decoding Dequantizer Residual BlockBitstreamBottleneck

Layer

Bottleneck

Layer

Information Augmentation (b)Fig. 1. Our extreme image compression framework via Multi-Scale AutoEncoder (MSAE) with GAN optimization. (a) overall structure, (b) autoencoder. Themulti-scale distriminator in (a) contains three identical discriminators that are patch-based fully convolutional networks [7].The encoder network contains 1convolutional layer with stride 1 and 4 convolutional layers with stride 2; all the residual blocks in information augmentation have the same convolutionalkernel size 3 and stride 1; the decoder network is a mirror version of the encoder, which contains 4 transposed convolutional layers with stride 2 and 1convolutional layer with stride 1. Entropy encoding and decoding denote the arithmetic encoding and decoding.

A. AutoEncoder

The same autoencoder architecture is used in our MSAEframework at each scale. Except at the scale k/ where thedownscaled image X k/ serves as the input, residuals betweenupscaled priors and inputs (i.e., X k/ − U ( X (cid:48) k/ ) and X k − U ( X (cid:48) k/ ) ) at the same resolution, are fed into the autoencoderfor compression. Using residuals, instead of default textures,generally boost the coding efﬁciency at the same bitrate budgetdue to better energy compaction and redundancy exploit.Such autoencoder, shown in Fig.1(b), includes an encoder E to encode the input X to a set of feature maps (fMaps) ω . Thenthe ω is passed to the quantizer Q and will be quantized toa compressed representation ˆ ω = Q ( E ( X )) . Speciﬁcally, theencoder E ﬁrst compresses the input with size of W × H × C to feature maps with dimensions at W × H × . Usually, W is for image width, H is the height and C is the number ofcolor channels (e.g., C = 3 for RGB color space). The fMapsare then projected down to W × H × C neck at bottlenecklayer prior to being quantized for ˆ ω . Note that C neck variesat different scale.The decoder, denoted by the generator G , tries to reconstruct the image X (cid:48) = G (ˆ ω ) from the compressed representation ˆ ω .Within the decoder, information augmentation module withnine residual blocks [16] is aggregated to retrieve more infor-mation from the data to improve the reconstruction. DecodedfMaps will go through a mirror network of E to obtain ﬁnalreconstruction with dimensions at the same dimension, i.e., W × H × C , as the input image.Note that the autoencoder is optimized using PSNR or MS-SSIM in default, often resulting in compression artifacts suchas blocking, blurring and contouring effects at a low bitrate. Toaddress this problem, we adopt adversarial loss [9] in trainingto reconstruct image X (cid:48) with visually pleasant quality. B. End-to-End Rate-Distortion Optimization

We adopt adversarial training in end-to-end optimizationframework for extreme compression. This is mainly due tothe reason that adversarial loss can address the blurring andcontouring problems at a low bitrate level [9]. In the proposedframework, the decoder or generator G is conditioned on thecompressed representations and there is no necessity to addrandom noise for generator [15]. For discriminator D , we usethe multiscale architecture following [14], which measures the a) Original image (b) MSAE framework: (c) Without MultiScale: (d) Without GAN Loss: (e) BPG: 0.047bpp, (f) JPEG2000: 0.051bpp, Fig. 2. Visual comparison on different architectures and loss functions, evaluated on a real-world image from Cityscapes dataset (values below each imageare bitrate, PSNR and SSIM). In (c), we replace MSAE model with a single scale model. (d) MSAE with MS-SSIM loss optimization with color degradation.In (e) and (f), traditional codec frameworks make the reconstructed images have undesired blur and artifacts. Our complete model in (b) is able to producebetter prediction. Compared with the result in (c), MSAE model can preserve local textures better (in red box). divergence between real image and fake image generated by G both globally and locally. Here we introduce a loss functionthat is closer to the perceptual similarity instead of relying onpixel-wise distortion [17], i.e., L f = λW m,n H m,n W m,n (cid:88) x =1 H m,n (cid:88) y =1 ( φ m,n ( Y k ) x,y − φ m,n ( Y (cid:48) k ) x,y ) , (4)with Y k = D ( X k ) and Y (cid:48) k = D ( X (cid:48) k ) . φ m,n represents thefeature map generated by the n -th convolution (with stride 2)of the m -th scale for the multiscale discriminator. W m,n and H m,n are the dimensional size of the respective feature maps.For the coefﬁcient λ , we set it to 10.The regular GAN [9] hypothesizes the discriminator as aclassiﬁer with the sigmoid cross entropy loss function, whichmay lead to gradient vanishing problem. In this paper, we useobjective measures f ( y ) = ( y − and g ( y ) = y developedfor Least-Squares GAN [18], where f and g denote the scalarfunctions. It results in the generator loss as, L G = min G f (cid:0) D (cid:0) G (ˆ ω k ) + U ( G (ˆ ω k/ ) + U ( G (ˆ ω k/ )) (cid:1)(cid:1) , (5)and the discriminator loss as: L D = min D (cid:16) f ( D ( X k )) + g (cid:16) D ( X (cid:48) k ) (cid:17)(cid:17) . (6)In order to backpropagate through the non-differentiablequantizer Q , we model the entropy rate following the [3]at bottleneck layer. We simply add uniform noise to ensuredifferentiability during training and replace it with ROUND( · )in inference. The entropy of ˆ ω i is evaluated using: H (ˆ ω ) = − (cid:88) j log ( p ˆ ω j | ψ ( j ) (ˆ ω j | ψ ( j ) )) , (7)where ψ ( j ) represents parameters of each univariate distribu-tion p ˆ ω j . To balance the quality of the reconstruction and the bitrate, An entropy rate term need to be added to the trainingloss for optimal rate-distortion efﬁciency, i.e., L RD =min d,H (cid:88) i ∈ [ k, ks , ks ∗ s ] (cid:16) L G + α i d ( X i , X (cid:48) i ) + L f + β i H (ˆ ω i ) (cid:17) . (8)As we can see, the rate-distortion trade-off is adjusted bysetting the variations of α i and β i . Distortion, i.e., d ( X i , X (cid:48) i ) ,is measured by the PSNR in this study, and the entropy ofcompressed representation, i.e., H (ˆ ω i ) , is used to approximatethe encoding bitrate [3]. Such compound loss L RD is appliedin a end-to-end trainable framework to achieve the optimalrate-distortion performance.III. E XPERIMENTAL S TUDIES

Datasets:

We use two public accessible datasets for train-ing: Cityscapes [19] and ADE20K [20]. Cityscapes datasetcontains images, each of them has the dimension of × × in RGB color space. During the training,we randomly select 2400 images for training and the rest forvalidation. These images are downscaled to × × inour experiments to avoid GPU memory overﬂow in training.For the ADE20K dataset, we choose images. It is thensegmented randomly to a training set and a validation set, withsizes from × × to × × . For simplicity, werescale all of them to × × for training and validation. Parameters:

We set β k = 100 and β k/ = β k/ = 1 .Meanwhile, α k = 1 , α k/ = α k/ = 100 accordingly. Thenumber of channels of the bottleneck layer varies betweendifferent scales. For scale k/ and k/ , we set the C neck = 1 ,while at scale k , C neck = 4 . This setting aims to provide suf-ﬁcient “prior” information while consuming less bit overhead.Additionally, we use a learning rate of × − and the Adamoptimizer for end-to-end learning. Performance Evaluation:

To evaluate the performanceof our proposed MSAE based extreme image compressionmethod, we compare our method with BPG and JPEG2000,as shown in Fig. 3, where both objective PSNR, SSIM andsubjective snapshots of two samples are illustrated. For allthe images we tested in datasets Cityscapes and ADE20K.We use basic arithmetic coding to generate actual bitstreams, urs:0.039bpp, 27.55dB, 0.8089Ours:0.047bpp, 23.27dB, 0.5814 BPG:0.045bpp, 23.73dB, 0.6712BPG:0.044bpp, 29.52dB, 0.9133 JPEG2000:0.049bpp, 23.06dB, 0.6461JPEG2000:0.044bpp, 27.52dB, 0.8922Original imageOriginal image

Fig. 3. Illustration of performance comparison for our proposed extremeimage compression method versus BPG, JPEG2000 on ADE20K dataset withobjective PSNR, SSIM and subjective snapshots.Fig. 4. Visual comparison on Kodak image (left): example reconstructionsby our MSAE framework (middle) and BPG (right). The respective bitrates,PSNR and SSIM are 0.070bpp, 19.78dB, 0.5386 vs 0.086bpp, 20.97dB,0.6473. and the bitrate is below 0.1 bpp. For quantitative evaluations,we compute the PSNR and SSIM between input X andreconstruction X (cid:48) . But we have to mention that at such lowbitrate, quantitative measurements such as PSNR or SSIM [21]become meaningless as they penalize changes in local struc-ture rather than the preservation of the global semantics.It is clear that our method has demonstrated noticeableperceptual quality margin over traditional JPEG2000 and BPG,even with tiny loss in objective metrics like PSNR andSSIM. Similar conclusion can be found in Fig.3. This alsocoincides similar observations that learning based compressioncan usually provide better visual quality, but worse PSNR [4]–[6]. IV. C ONCLUDING R EMARKS

We have developed an extreme image compression frame-work via a multiscale autoencoder structure with embeddedgenerative adversarial optimization for end-to-end training.Such multiscale authoencoder is fulﬁlled by downscalingthe original image into various scales to capture the imagestatistics locally and globally. Each decoded representation atlower resolution scale is utilized as the priors for the efﬁcientcompression at higher scale. In addition to traditional pixel-wise distortion measurements (e.g., PSNR, SSIM), we have in-troduced the adversarial loss for pleasant image reconstructionat a very low bitrate (i.e., usually below 0.05 bpp), to preservethe image structure and global semantics. Experimental studieshave demonstrated that our method has provided subjectivequality improvement over existing JPEG2000 and BPG onpublic datasets, but objective evaluation suffers (e.g., PSNR or SSIM). This calls for the interesting studies on quality met-rics to accurately capture the visual signiﬁcance in semanticdomain for ultra low bitrate scenario.R

EFERENCES[1] George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang,David Minnen, Joel Shor, and Michele Covell, “Full resolution imagecompression with recurrent neural networks,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2017,pp. 5306–5314.[2] Oren Rippel and Lubomir Bourdev, “Real-time adaptive image com-pression,” arXiv preprint arXiv:1705.05823 , 2017.[3] Johannes Ball´e, David Minnen, Saurabh Singh, Sung Jin Hwang, andNick Johnston, “Variational image compression with a scale hyperprior,” arXiv preprint arXiv:1802.01436 , 2018.[4] Haojie Liu, Tong Chen, Qiu Shen, Tao Yue, and Zhan Ma, “Deep imagecompression via end-to-end learning,” in , 2018, pp. 2575–2578.[5] Johannes Ball´e, Valero Laparra, and Eero P Simoncelli, “End-to-endoptimized image compression,” arXiv preprint arXiv:1611.01704 , 2016.[6] Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte,and Luc Van Gool, “Generative adversarial networks for extreme learnedimage compression,” arXiv preprint arXiv:1804.02958 , 2018.[7] Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolu-tional networks for semantic segmentation,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2015, pp. 3431–3440.[8] Zhou Wang, Eero P Simoncelli, and Alan C Bovik, “Multiscalestructural similarity for image quality assessment,” in

The Thrity-SeventhAsilomar Conference on Signals, Systems & Computers, 2003 . Ieee,2003, vol. 2, pp. 1398–1402.[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Gen-erative adversarial nets,” in

Advances in neural information processingsystems , 2014, pp. 2672–2680.[10] David Taubman and Michael Marcellin,

JPEG2000 image compressionfundamentals, standards and practice: image compression fundamentals,standards and practice , vol. 642, Springer Science & Business Media,2012.[11] Bellard, “Bpg image format,” https://bellard.org/bpg/ .[12] Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy, “Recoveringrealistic texture in image super-resolution by deep spatial feature trans-form,” in

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , 2018, pp. 606–615.[13] Chuanmin Jia, Zhaoyi Liu, Yao Wang, Siwei Ma, and Wen Gao,“Layered image compression using scalable auto-encoder,” in , 2019, pp. 431–436.[14] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz,and Bryan Catanzaro, “High-resolution image synthesis and semanticmanipulation with conditional gans,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2018, pp.8798–8807.[15] Mehdi Mirza and Simon Osindero, “Conditional generative adversarialnets,” arXiv preprint arXiv:1411.1784 , 2014.[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deepresidual learning for image recognition,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2016, pp. 770–778.[17] Christian Ledig, Lucas Theis, Ferenc Husz´ar, Jose Caballero, AndrewCunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Jo-hannes Totz, Zehan Wang, et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in . IEEE,2017, pp. 105–114.[18] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, andStephen Paul Smolley, “Least squares generative adversarial networks,”in

Proceedings of the IEEE International Conference on ComputerVision , 2017, pp. 2794–2802.19] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld,Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, andBernt Schiele, “The cityscapes dataset for semantic urban sceneunderstanding,” in

Proceedings of the IEEE conference on computervision and pattern recognition , 2016, pp. 3213–3223.[20] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso,and Antonio Torralba, “Scene parsing through ade20k dataset,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition . IEEE, 2017, vol. 1, p. 4.[21] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli,“Image quality assessment: from error visibility to structural similarity,”