[PDF] Toward Joint Image Generation and Compression using Generative Adversarial Networks

Abstract

In this paper, we present a generative adversarial network framework that generates compressed images instead of synthesizing raw RGB images and compressing them separately. In the real world, most images and videos are stored and transferred in a compressed format to save storage capacity and data transfer bandwidth. However, since typical generative adversarial networks generate raw RGB images, those generated images need to be compressed by a post-processing stage to reduce the data size. Among image compression methods, JPEG has been one of the most commonly used lossy compression methods for still images. Hence, we propose a novel framework that generates JPEG compressed images using generative adversarial networks. The novel generator consists of the proposed locally connected layers, chroma subsampling layers, quantization layers, residual blocks, and convolution layers. The locally connected layer is proposed to enable block-based operations. We also discuss training strategies for the proposed architecture including the loss function and the transformation between its generator and its discriminator. The proposed method is evaluated using the publicly available CIFAR-10 dataset and LSUN bedroom dataset. The results demonstrate that the proposed method is able to generate compressed data with competitive qualities. The proposed method is a promising baseline method for joint image generation and compression using generative adversarial networks.

Full PDF

11 Toward Joint Image Generation and Compressionusing Generative Adversarial Networks

Byeongkeun Kang, Subarna Tripathi, and Truong Q. Nguyen,

Fellow, IEEE

Abstract —In this paper, we present a generative adversarialnetwork framework that generates compressed images instead ofsynthesizing raw RGB images and compressing them separately.In the real world, most images and videos are stored and trans-ferred in a compressed format to save storage capacity and datatransfer bandwidth. However, since typical generative adversarialnetworks generate raw RGB images, those generated images needto be compressed by a post-processing stage to reduce the datasize. Among image compression methods, JPEG has been oneof the most commonly used lossy compression methods for stillimages. Hence, we propose a novel framework that generatesJPEG compressed images using generative adversarial networks.The novel generator consists of the proposed locally connectedlayers, chroma subsampling layers, quantization layers, residualblocks, and convolution layers. The locally connected layer isproposed to enable block-based operations. We also discusstraining strategies for the proposed architecture including theloss function and the transformation between its generatorand its discriminator. The proposed method is evaluated usingthe publicly available CIFAR-10 dataset and LSUN bedroomdataset. The results demonstrate that the proposed method isable to generate compressed data with competitive qualities.The proposed method is a promising baseline method for jointimage generation and compression using generative adversarialnetworks.

I. I

NTRODUCTION

Most images and videos exist in a compressed form sincedata compression saves lots of data storage and networkbandwidth and further enables many applications such as real-time video streaming in cell phones. Compression is indeedcrucial, considering compressed image and video can be about10 times and 50 times smaller than raw data, respectively. Nev-ertheless, typical generative adversarial networks (GAN) focuson generating raw RGB images or videos [1]–[6]. Consideringone of the most common usages of GANs is generatinglarge-scale synthetic images/videos for data augmentation, thecreated images/videos often require compression in a post-processing stage to store the large dataset in a hardware [7]–[10]. Besides, typical GANs are evaluated using the generatedraw RGB data although compression is processed to the rawdata prior to ﬁnal applications. Hence, we investigate GANframeworks that aim to generate compressed data and toevaluate the networks using the generated encoded images.We focus on the GAN frameworks for compressed imagegeneration since image generation networks [1]–[4] have beenfar more studied comparing to video generation networks [5],

B. Kang, S. Tripathi, and T. Q. Nguyen are with the Department of Electri-cal and Computer Engineering, University of California, San Diego, CA 92093USA (e-mail: [email protected], [email protected], [email protected]).This work is supported in part by NSF grant IIS-1522125. Fig. 1. The proposed framework to generate compressed images. Theframework consists of a generator, a discriminator, and a transformer betweenthe generator and the discriminator. The visualized generator output is anintermediate example. [6]. In image compression, JPEG [11], [12] has been one of themost commonly used lossy compression methods [13]–[15]. Ithas been used in digital cameras and utilized for storing andtransmitting images. While JPEG has many variants, typicalJPEG consists of color space transformation, chroma sub-sampling, block-based discrete cosine transform (DCT) [16],quantization, and entropy coding [17]. In more details, thecompression method converts an image in the RGB domainto the image in another color space (YCbCr) that separatesthe luminance component and the chrominance components.Then, the chrominance components are downsampled. It thenapplies the 8 × a r X i v : . [ c s . C V ] J a n images using generative adversarial networks as shownin Fig. 1. The framework consists of a generator, a dis-criminator, and a transformer between the generator and thediscriminator. The proposed generator produces compresseddata given a randomly selected noise in a latent space. Thetransformer is applied to make the data from the generator andthe training data to be in the same domain. Since the generatoroutputs compressed data and the training data is raw RGBimages, the transformer should convert encoded data to a rawimage, so is a decoder. The discriminator takes synthesizedimages and real images and aims to differentiate them.The proposed generator has three paths, one path for theluminance component and the other two paths for the chromi-nance components. The separate paths are proposed to processany requested chroma subsampling. We also propose thelocally connected layer that takes an input of a subregion andoutputs for the corresponding subregion. The proposed locallyconnected layer is able to handle block-based processing inJPEG compression. In summary, the contributions of our workare as follows: • We propose the framework that generates JPEG com-pressed images using generative adversarial networks. • We propose the locally connected layer to enable block-based operations in JPEG compression. • We analyze the effects of compression for the proposedmethod and other methods.II. R

ELATED W ORKS

A. Generative Adversarial Networks

Generative adversarial networks were introduced in [1]where the framework is to estimate generative models bylearning two competing networks (generator and discrimina-tor). The generator aims to capture the data distribution bylearning the mapping from a known noise space to the dataspace. The discriminator differentiates between the samplesfrom the training data and those from the generator. These twonetworks compete since the goal of the generator is makingthe distribution of generated samples equivalent to that of thetraining data while the discriminator’s objective is discoveringthe discrepancy between the two distributions.While the work in [1] employed multilayer perceptrons forboth generator and discriminator, deep convolutional GANs(DCGANs) replaced multilayer perceptrons by convolutionalneural networks (CNNs) to take the advantage of sharedweights, especially for image-related tasks [2]. To utilizeCNNs in the GAN framework, extensive architectures with therelatively stable property during training, are explored. Theyexamined stability even for models with deeper layers and fornetworks that generate high-resolution outputs. The analysisincludes fractional-stride convolutions, batch normalization,and activation functions. Salimans et al. presented the methodsthat improve the training of GANs [18]. The techniquesinclude matching expected activations of training data andthose of generated samples, penalizing similar samples in amini-batch, punishing sudden changes of weights, one-sidedlabel smoothing, and virtual batch normalization. Arjovsky et al. presented the advantage of the Earth-Mover(EM) distance (Wasserstein-1) comparing to other popularprobability distances and divergences such as the Jensen-Shannon divergence, the Kullback-Leibler divergence, and theTotal Variation distance [3]. The advantage of the EM distanceis that it is continuous everywhere and differentiable almosteverywhere when it is applied for a neural network-basedgenerator with a constrained input noise variable. They alsoshowed that the EM distance is a more sensible cost function.Based on these, they proposed Wasserstein GAN that uses areasonable and efﬁcient approximation of the EM distance.They then showed that the proposed GAN achieves improvedstability in training. However, clipping weights for Lipschitzconstraint in [3] might cause optimization difﬁculties [4].Hence, Gulrajani et al. proposed penalizing the gradient normto enforce Lipschitz constraint instead of clipping [4].Wasserstein GAN trains its discriminator multiple timesat each training of its generator so that the framework cantrain the generator using the more converged discriminator [3].To avoid the expensive multiple updates of the discriminator,Heusel et al. proposed to use the two time-scale update rule(TTUR) [19] in a Wasserstein GAN framework [20]. SinceTTUR enables separate learning rates for the discriminator andthe generator, they can train the discriminator faster than thegenerator by selecting a higher learning rate for the discrimi-nator comparing to that of the generator. It is also proved thatTTUR converges to a stationary local Nash equilibrium undermild assumptions. They further experimentally showed thattheir method outperforms most other state-of-the-art methods.Hence, the proposed framework of this paper is based on [20].

B. Image Compression: JPEG

JPEG [12] has been one of the most commonly usedlossy compression methods for still images with continuoustones [13]–[15]. It has been used for digital cameras, photo-graphic images on the World Wide Web, medical images, andmany other applications.In JPEG, discrete cosine transform (DCT) is utilized since itachieves high energy compaction while having low computa-tional complexity. Considering an image contains uncorrelated(various) information, block-based DCT is used so that eachblock contains correlated data. Using too small block preventsfrom compressing correlated information. Too large blockwith uncorrelated pixels increases computational complexitywithout compression gain. 8 × × (a)(b) (c)Fig. 2. The proposed architecture for JPEG compressed image generation. (a) Generator. (b) Discriminator. (c) Residual block. The proposed generatorconsists of three paths, one for each luminance or chrominance component. The generator employs the proposed locally connected layer to operate block-based processing. The visualized generator considers chroma subsampling ratio of 4:2:0. using Huffman coding considering the statistical distributionof the information [17].While these are the baseline of JPEG, other additionalmethods and components were also suggested for particularpurposes. Also, typical JPEG uses the YCbCr color space andchroma subsampling [11]. C. Other Related Works

Our work differs from learning neural networks for imagecompression such as autoencoders [21]–[24] that aims to learnimage encoders to compress real images. The proposed methodand learning encoders differ in two aspects. First, the goal ofthe proposed method is generating synthetic JPEG compresseddata while that of the latter is compressing real images. Sec-ond, the proposed method intends to utilize existing standarddecoder in general electronic devices while the latter requiresa particular decoder to decode the compressed data.This paper is also distinct from [25], [26] that are aboututilizing GANs for postprocessing real data in a frequencydomain. The proposed work generates compressed imagesfrom random noises in the latent space. III. P

ROPOSED M ETHOD

We propose a framework that generates JPEG compressedimages using generative adversarial networks. We use thearchitectures analyzed in the TTUR method [20] as ourbaseline networks since the method achieves one of the state-of-the-art results. The generator in the baseline architectureconsists of one fully connected layer, four residual blocks, andone convolution layer. The discriminator in the architectureconsists of one convolution layer, four residual blocks, andone fully connected layer as shown in Fig. 2(b). The residualblock consists of two paths (see Fig. 2(c)). One path has twoconvolution layers with ﬁlter-size of 3 ×

3, and the other hasonly one convolution layer with ﬁlter-size of 1 ×

1. All theconvolution layers outside of residual blocks have ﬁlter-sizeof 3 × (a) Convolution layer (b) Fully connected layer (c) Locally connected layerFig. 3. Visual comparison of (a) convolution layer with ﬁlter-size of 3 ×

3, (b) fully connected layer, and (c) proposed locally connected layer with block-sizeof 4 ×

4. The locally connected layer operates comparable to block-based processing. Each region of output is produced by the summation of the multiplicationof the corresponding region of input and shared weights. The weights are shared between blocks, but not between outputs in a block. entropy encoding layer is omitted since the encoding is losslessand located at the last layer. Hence, this exclusion does notaffect the results.We then present the processing between the generator andthe discriminator in Section III-B. Since typical generatorsgenerate RGB images which are in the same domain withthe training images, any additional processing is not requiredto use the output of the generator for the input to thediscriminator. However, since the proposed generator producesJPEG compressed data, the outputs of the generator and thetraining images are in different domains and cannot be usedtogether for the discriminator. Consequently, we need to eithercompress the training images or decode the generated JPEGcompressed data so that they are in the same domain.In Section III-C, we discuss training strategies for theproposed architecture. Although many studies have been con-ducted to improve training stability, training GANs for non-typical images is still quite challenging.

A. Generator

We propose a novel generator that generates JPEG com-pressed images (see Fig. 2). The generator consists of sixlocally connected layers, two chroma subsampling layers, anda quantization layer in addition to the layers in the baselinegenerator. The generator has three paths where each pathgenerates one of luminance or chrominance components inthe YCbCr representation. The separated paths are required tohandle any required chroma subsampling since the resolutionof a luminance component and chrominance components aredifferent if chroma subsampling is applied. The locally con-nected layer is proposed to operate block-based processing.An entropy encoding layer is not applied since the encodingis lossless and at the last layer, which does not impact theresults.The proposed locally connected layer takes an input ofa subregion and produces an output for the correspondingsubregion (see Fig. 3). The layer is proposed to performoperations comparable to block-based processing. Comparingto a convolution layer, the proposed layer is different since aconvolution layer takes an input from a region and outputs toonly a single location. The nearby outputs from a convolution layer are produced by using different regions of inputs. Inother words, to generate 8 × × ×

8, respectively. The block-sizeof 8 × × × × Q l, , Q c, )are for the quality factor of 50. The former and the lattermatrices are for luminance component ( Q l ) and chrominance part ( Q c ). Q l, = 

16 11 10 16 24 40 51 6112 12 14 19 26 58 60 5514 13 16 24 40 57 69 5614 17 22 29 51 87 80 6218 22 37 56 68 109 103 7724 35 55 64 81 104 113 9249 64 78 87 103 121 120 10172 92 95 98 112 100 103 99  . (1) Q c, = 

17 18 24 47 99 99 99 9918 21 26 66 99 99 99 9924 26 56 99 99 99 99 9947 66 99 99 99 99 99 9999 99 99 99 99 99 99 9999 99 99 99 99 99 99 9999 99 99 99 99 99 99 9999 99 99 99 99 99 99 99  . (2)Quantization matrix for another quality factor for luminancecomponent is computed as follows: Q l,n = (cid:40) max (1 , (cid:98) − n Q l, + 0 . (cid:99) ) , if n ≥ . (cid:98) n Q l, + 0 . (cid:99) , otherwise . (3)where n ∈ (0 , denotes a quality factor. The conversionof the quantization matrix for chrominance is equivalent.The architecture in Fig. 2 is used for the LSUN bedroomdataset [30] which aims to generate images with the resolutionof 64 ×

64. For the CIFAR-10 dataset [31] whose objectiveresolution is 32 ×

32, the output dimensions of all the layersin both the generator and the discriminator are reduced byhalf for both x- and y-axes. Also, the number of activations(feature maps) is reduced by half up to the last residual blockin both the generator and the discriminator.

B. Transform between Generator and Discriminator

Since the discriminator takes half of the inputs from thegenerator and the other half from the training dataset, the twodata should be in the same domain (representation). However,the training data set contains real images in the RGB domainand outputs of the proposed generator are JPEG compresseddata. Hence, we have to either compress the training data ordecode the outputs of the generator so that the two images arein the same domain.We examined both alternatives. It turns out that it is betterto decode outputs of the generator before providing them forinputs to the discriminator. Our opinion is that convolutionlayers, which are major elements of the discriminator, areinvented for real images which usually have a continuoustone. However, the compressed data contains amplitudes ofblock-based DCT which vary largely at the boundary ofblocks and also in the blocks. Consequently, the discriminatorprovides inferior-quality gradients to the generator and hinderstraining a good generator. Hence, the proposed frameworkhas a decoder that takes outputs of the generator and rendersinputs to the discriminator during training. We do not needthis conversion (decoder) after training since we only utilize the discriminator for training and our goal is generatingcompressed data.Given an output of the generator, we ﬁrst de-quantizethe amplitudes by multiplying them by the correspondingquantization matrices used in the generator. We then applyinverse DCT to transform the amplitudes in the frequencydomain to the contents in the 2D color domain. We upsamplechrominance components to the same resolution of the lu-minance component if chroma subsampling is applied in thegenerator. We then convert the amplitudes in the YCbCr spaceto the RGB space. Lastly, we clip the amplitudes so that aftercompensating shifting and scaling, the amplitudes are in therange of [0, 255].

C. Training

As GANs are difﬁcult to train [1], many studies have beenconducted to improve the stability of training [3], [4], [18].Still, by employing current state-of-the-art training algorithmsto our problem, we had difﬁculty in training the proposednetworks. Hence, we propose a novel loss function to trainthe proposed framework.Given a loss function, L , the generator G and the discrim-inator D are trained by playing a minimax game as follows: min G max D L ( G, D ) (4)Considering the objective function in the Wasserstein GANwith gradient penalty [4], we propose a loss function L byadding an additional loss term for the generator. L ( G,D, P ) = E ˜ x ∼ P g [ D (˜ x ) + γ | P ( G (˜ x )) − ˆ G (˜ x ) | ] − E x ∼ P r [ D ( x )] + λ E ˆ x ∼ P ˆ x [( (cid:107)∇ ˆ x D (ˆ x ) (cid:107) − ] (5)where P g and P r denote the generator distribution and thetraining data distribution. The authors in [4] implicitly deﬁned P ˆ x as sampling uniformly along straight lines between pairsof points sampled from P r and P g . ˆ G is the layers in thegenerator G before any locally connected layer. ˆ G is initial-ized by the parameters that are trained using [4]. P is thetransformation between the generator and the discriminator. γ is the hyperparameter to weight between typical generator lossand the proposed additional generator loss. Gradient penaltycoefﬁcient λ is 10 and γ is 100 in all experiments. The learningrates for the discriminator and the generator are 0.0003 and0.0001, respectively.We believe that further studying on an optimization algo-rithm for non-typical images can improve the quality of gener-ated results further. However, developing a novel optimizationalgorithm is beyond the scope of this paper.IV. E XPERIMENTS AND R ESULTS

A. Dataset

We experiment using the CIFAR-10 training dataset [31]and the LSUN bedroom training dataset [30]. The CIFAR-10 dataset consists of 50,000 images with the resolutionof 32 ×

32. The dataset includes images from 10 categories(airplane, automobile, bird, cat, deer, dog, frog, horse, ship,and truck). The LSUN bedroom dataset consists of 3,033,042bedroom images. The images are scaled to 64 ×

64 followingthe previous works [4].

TABLE IQ

UANTITATIVE COMPARISON USING

FID

S FOR THE

CIFAR-10

DATASET [31]. T

HE LOWER

FID

MEANS THE BETTER RESULT . T

HE FIRST AND SECONDROWS SHOW

FID

S OF TRAINING IMAGES AND OF GENERATED

RGB

IMAGES USING THE

TTUR

METHOD [20]

BY PROCESSING COMPRESSION IN APOST - PROCESSING STAGE . T

HE LAST THREE ROWS PRESENT GENERATING COMPRESSED DATA DIRECTLY USING THE

TTUR

METHOD [20],

THE

FCG

ENERATOR IN THE

TTUR [20],

AND THE PROPOSED METHOD .Method Generator Chroma Quality factoroutput subsampling 100 75 50 254:4:4 0.02 6.49 16.89 35.76Real data - 4:2:2 0.68 10.59 23.89 45.404:2:0 1.86 16.30 32.47 55.83TTUR [20] 4:4:4 26.10 method 4:2:2 25.80

B. Metric

We use the Fr´echet Inception Distance (FID) [20] whichwas improved from the Inception score [18] by considering thestatistics of real data. The FID computes the Fr´echet distance(also known as Wasserstein-2 distance) [32], [33] betweenthe statistics of real data and that of generated samples. Thedistance is computed using the ﬁrst two moments (mean andcovariance) of activations from the last pooling layer in theInception v3 model [34]. The FID d is computed as follows: d (( m , C ) , ( m w , C w )) = || m − m w || + Tr ( C + C w − CC w ) / ) (6)where ( m , C ) and ( m w , C w ) represent the mean and covari-ance of generated samples and of the real dataset. ( m w , C w ) are measured using entire images in the training dataset. ( m , C ) are computed using 50,000 generated images. C. Results

We analyze the proposed method and the architectures inthe TTUR method [20] by training them using the datasetsin Section IV-A and by evaluating them quantitatively andqualitatively. For quantitative comparison, we measure theFr´echet Inception Distance (FID) in Section IV-B. Table Iand Table II show the quantitative results for the CIFAR-10dataset [31] and the LSUN bedroom dataset [30], respectively.In both tables, we show the FIDs for three chroma subsamplingratios (4:4:4, 4:2:2, 4:2:0) and for four quality factors ofquantization (100, 75, 50, 25). The ﬁrst row shows the FIDvariations of real images by applying chroma subsamplingand quantization. The second row presents the FID of theoriginal TTUR method generating RGB images [20]. For thisanalysis, JPEG compression is processed as a post-processingstep that follows the neural networks. The third row shows theresult of the TTUR method for generating JPEG compressed images directly. In Table I, the fourth row presents the resultof the fully connected (FC) generator in the TTUR methodfor generating JPEG compressed images directly. In the lastrow, we show the result of the proposed method.We show visual results of the CIFAR-10 and the LSUNbedroom datasets in Figs. 4 and 5, respectively. On the leftside, we denote the quality factor for quantization and chromasubsampling ratios. We show the results of (100, 4:4:4), (100,4:2:2), (100, 4:2:0), (75, 4:4:4), (50, 4:4:4), and (25, 4:4:4)from the ﬁrst row to the last row. On the ﬁrst column, weshow the results of real images that are processed by the cor-responding compression. The second column shows the resultsof the generated images using the original TTUR method [20].The result images are ﬁrst generated as RGB images and arethen coded and decoded in a post-processing stage. The thirdcolumn shows the result of the TTUR method [20]. In Fig. 4,the fourth column presents the result of the FC generator inthe TTUR method [20]. The last column shows the resultsof the proposed method. The last three columns in Fig. 4and two columns in Fig. 5 are the results of generating JPEGcompressed images in the networks.To compute the FID in the ﬁrst row in both tables, weﬁrst encode and decode 50,000 training images and thencompute the statistics of the processed images. We thenestimate the FID between the statistics of the original trainingdata and those of the processed images. For the CIFAR-10 dataset in Table I, the distance is close to 0 using thechroma subsampling ratio of 4:4:4 and the quality factor of100. It’s quite small since the encoding/decoding does notimpact the images much (only small rounding errors, etc). Bydecreasing quality factor and by subsampling from a largerregion, the encoding/decoding distorts images more and hence,FID increases. Since decreasing quality factor by 25 affectsimages much more than adjusting chroma subsampling from4:4:4 to 4:2:0, FID is also increased by a larger amount.

TABLE IIQ

UANTITATIVE COMPARISON USING

FID

S FOR THE

LSUN

DATASET [30]. T

HE LOWER

FID

MEANS THE BETTER RESULT . T

HE FIRST AND SECOND ROWSSHOW

FID

S OF TRAINING IMAGES AND OF GENERATED

RGB

IMAGES USING THE

TTUR

METHOD [20]

BY PROCESSING COMPRESSION IN APOST - PROCESSING STAGE . T

HE LAST TWO ROWS PRESENT GENERATING COMPRESSED DATA DIRECTLY USING THE

TTUR [20]

AND THE PROPOSEDMETHOD .Method Generator Chroma Quality factoroutput subsampling 100 75 50 254:4:4 9.96 6.82 7.57 18.26Real data - 4:2:2 10.36 7.13 8.18 18.944:2:0 11.37 10.22 12.31 23.01TTUR [20] 4:4:4 12.44 10.24 11.35

RGB image 4:2:2 13.41 12.05 13.34 24.464:2:0 14.80 16.00 17.98 27.99TTUR [20] 4:4:4 35.70 27.78 30.70 45.39JPEG 4:2:2 43.35 34.51 36.19 50.22compressed 4:2:0 71.32 62.18 67.43 75.30Proposed image 4:4:4

For the LSUN bedroom dataset in Table II, the distance ofreal data using the chroma subsampling ratio of 4:4:4 andthe quality factor of 100 is much greater than that in theCIFAR-10 dataset. The FID is deﬁned by the distance betweenthe statistics of the entire training data and those of 50,000processed or generated images. Since the number of imagesin the CIFAR-10 dataset is 50,000, the distance is quite smallconsidering the encoding/decoding does not distort much.However, since the LSUN bedroom dataset contains 3,033,042images, 50,000 images should be sampled to compute thestatistics of the processed images. It causes a relatively largerFID for the LSUN bedroom dataset. It is also interestingto note that for the LSUN dataset, the quality factor of 75is better than that of 100 in most experiments. We believesince the bedroom images often have continuous tone becauseof its contents or pre-processing, discarding high-frequencycomponents decreases FID.FIDs in the second row is computed by ﬁrst generatingRGB images using the TTUR method [20] and by encod-ing/decoding the generated RGB images. As generating RGBimages have been studied a lot in recent years and theTTUR method is one of the state-of-the-art methods, generatedRGB images are quite visually plausible. FID is increased byencoding/decoding the images using a lower quality factor andsubsampling from a larger region. Some of the distortions canbe visually observed in Figs. 4 and 5.The third row presents applying the same method to gen-erate JPEG encoded images. The results demonstrate thatdirectly applying the method does not produce competitiveresults. The fourth row in Table I shows the results of applyingthe FC generator in the TTUR method [20] for generating en-coded images. While FC generator often performs poorer thanthe selected TTUR method for generating typical images, wetried the FC generator to avoid extensively applied convolutionlayers in the TTUR method. However, the FC generator doesnot perform well even for generating encoded images.The last row in both tables shows the results of theproposed method. The proposed method achieves promising results for generating JPEG encoded images directly. Theproposed method outperforms applying the TTUR methodfor generating JPEG encoded image directly. Moreover, theproposed method is competitive to the method that generatesRGB images using the TTUR method and compresses themby post-processing. V. C

ONCLUSION

We present a generative adversarial network framework thatcombines image generation and compression by generatingcompressed images directly. We propose a novel generatorconsisting of the proposed locally connected layers, chromasubsampling layers, quantization layers, residual blocks, andconvolution layers. We also present training strategies forthe proposed framework including the loss function and thetransformation between the generator and the discriminator.We demonstrate that the proposed framework outperformsapplying the state-of-the-art GANs for generating compresseddata directly. Moreover, we show that the proposed methodachieves competitive results comparing to generating rawRGB images using one of the state-of-the-art methods andcompressing the images by post-processing. We believe thatthe proposed method can be further improved by investigatingoptimization algorithms for learning to generate compresseddata. We also consider the scenario where the proposed methodcan serve as a baseline method for further studies.R

EFERENCES[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Gen-erative adversarial nets,” in

Advances in Neural Information ProcessingSystems 27 , Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,and K. Q. Weinberger, Eds., pp. 2672–2680. Curran Associates, Inc.,2014.[2] Alec Radford, Luke Metz, and Soumith Chintala, “Unsupervisedrepresentation learning with deep convolutional generative adversarialnetworks,”

CoRR , vol. abs/1511.06434, 2015. (a) Real data (b) TTUR (RGB) (c) TTUR (encoded) (d) FC generator (encoded) (e) Proposed methodFig. 4. Visual results of generating compressed images using the CIFAR-10 dataset [31]. On the left side, we denote the quality factor for quantizationand chroma subsampling ratios. We show the results of (100, 4:4:4), (100, 4:2:2), (100, 4:2:0), (75, 4:4:4), (50, 4:4:4), and (25, 4:4:4) from the ﬁrst rowto the last row. The ﬁrst and second columns show the results of real images and the original TTUR method [20] that are processed by the correspondingencoding/decoding. The third and fourth columns show the results of the TTUR method and the FC generator in the TTUR method that generates compresseddata directly. The last column shows the result of the proposed method. (a) Real data (b) TTUR (RGB) (c) TTUR (encoded) (d) Proposed methodFig. 5. Visual results of generating compressed images using the LSUN dataset [30]. On the left side, we denote the quality factor for quantization and chromasubsampling ratios. We show the results of (100, 4:4:4), (100, 4:2:2), (100, 4:2:0), (75, 4:4:4), (50, 4:4:4), and (25, 4:4:4) from the ﬁrst row to the last row.The ﬁrst and second columns show the results of real images and the original TTUR method [20] that are processed by the corresponding encoding/decoding.The third column presents the results of the TTUR method that generates compressed data directly. The last column shows the result of the proposed method. [3] Martin Arjovsky, Soumith Chintala, and L´eon Bottou, “Wassersteingenerative adversarial networks,” in Proceedings of the 34th Interna-tional Conference on Machine Learning , Doina Precup and Yee WhyeTeh, Eds., International Convention Centre, Sydney, Australia, 06–11Aug 2017, vol. 70 of

Proceedings of Machine Learning Research , pp.214–223, PMLR.[4] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin,and Aaron C Courville, “Improved training of wasserstein gans,” in

Advances in Neural Information Processing Systems 30 , I. Guyon, U. V.Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, andR. Garnett, Eds., pp. 5767–5777. Curran Associates, Inc., 2017.[5] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba, “Generatingvideos with scene dynamics,” in

Advances in Neural InformationProcessing Systems 29 , D. D. Lee, M. Sugiyama, U. V. Luxburg,I. Guyon, and R. Garnett, Eds., pp. 613–621. Curran Associates, Inc.,2016.[6] Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo, “Learningto generate time-lapse videos using multi-stage dynamic generativeadversarial networks,” in , June 2018.[7] Leon Sixt, Benjamin Wild, and Tim Landgraf, “Rendergan: Generatingrealistic labeled data,”

Frontiers in Robotics and AI , vol. 5, pp. 66, 2018.[8] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz,and Bryan Catanzaro, “High-resolution image synthesis and semanticmanipulation with conditional gans,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2018.[9] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, AndrewTao, Jan Kautz, and Bryan Catanzaro, “Video-to-video synthesis,” in

Advances in Neural Information Processing Systems (NIPS) , 2018.[10] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan,“Synthetic data augmentation using gan for improved liver lesion clas-siﬁcation,” in , April 2018, pp. 289–293.[11]

ITU-R Recommendation 601: Encoding Parameters of Digital Televisionfor Studios , International Telecommunication Union, 1982.[12]

ITU-T T.81 Information Technology - Digital Compression and Codingof Continuous-Tone Still Images - Requirements and Guidelines , Inter-national Telecommunication Union, 1992.[13] G. K. Wallace, “The jpeg still picture compression standard,”

IEEETransactions on Consumer Electronics , vol. 38, no. 1, pp. xviii–xxxiv,Feb 1992.[14] William B. Pennebaker and Joan L. Mitchell,

JPEG Still Image DataCompression Standard , Kluwer Academic Publishers, Norwell, MA,USA, 1st edition, 1992.[15] G. Hudson, A. Lger, B. Niss, and I. Sebestyn, “Jpeg at 25: Still goingstrong,”

IEEE MultiMedia , vol. 24, no. 2, pp. 96–103, Apr 2017.[16] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,”

IEEE Transactions on Computers , vol. C-23, no. 1, pp. 90–93, Jan 1974.[17] D. A. Huffman, “A method for the construction of minimum-redundancycodes,”

Proceedings of the IRE , vol. 40, no. 9, pp. 1098–1101, Sept1952.[18] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, AlecRadford, Xi Chen, and Xi Chen, “Improved techniques for traininggans,” in

Advances in Neural Information Processing Systems 29 , D. D.Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds., pp.2234–2242. Curran Associates, Inc., 2016.[19] H.L. Prasad, Prashanth L.A., and Shalabh Bhatnagar, “Two-timescalealgorithms for learning nash equilibria in general-sum stochastic games,”in

Proceedings of the 2015 International Conference on AutonomousAgents and Multiagent Systems , Richland, SC, 2015, AAMAS ’15,pp. 1371–1379, International Foundation for Autonomous Agents andMultiagent Systems.[20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, BernhardNessler, and Sepp Hochreiter, “Gans trained by a two time-scale updaterule converge to a local nash equilibrium,” in

Advances in NeuralInformation Processing Systems 30 , I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., pp. 6626–6637. Curran Associates, Inc., 2017.[21] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszar,“Lossy image compression with compressive autoencoder,” in

Interna-tional Conference on Learning Representations (ICLR) , 2017.[22] George Toderici, Sean M. O’Malley, Sung Jin Hwang, Damien Vincent,David Minnen, Shumeet Baluja, Michele Covell, and Rahul Sukthankar,“Variable rate image compression with recurrent neural networks,” in

International Conference on Learning Representations (ICLR) , 2016.[23] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor,and M. Covell, “Full resolution image compression with recurrent neural networks,” in , July 2017, pp. 5435–5443.[24] S. Santurkar, D. Budden, and N. Shavit, “Generative compression,” in , June 2018, pp. 258–262.[25] Takuhiro Kaneko, Shinji Takaki, Hirokazu Kameoka, and Junichi Yam-agishi, “Generative adversarial network-based postﬁlter for stft spectro-grams,” in

Proc. Interspeech 2017 , 2017, pp. 3389–3393.[26] M. Mardani, E. Gong, J. Y. Cheng, S. S. Vasanawala, G. Zaharchuk,L. Xing, and J. M. Pauly, “Deep generative adversarial neural networksfor compressive sensing (gancs) mri,”

IEEE Transactions on MedicalImaging , pp. 1–1, 2018.[27] Wen-Hsiung Chen and W. Pratt, “Scene adaptive coder,”

IEEETransactions on Communications , vol. 32, no. 3, pp. 225–232, March1984.[28] H. Lohscheller, “A subjectively adapted image communication system,”

IEEE Transactions on Communications , vol. 32, no. 12, pp. 1316–1322,December 1984.[29] Heidi A. Peterson, Huei Peng, J. H. Morgan, and William B. Pennebaker,“Quantization of color image components in the dct domain,” 1991.[30] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao,“Lsun: Construction of a large-scale image dataset using deep learningwith humans in the loop,” arXiv preprint arXiv:1506.03365 , 2015.[31] Alex Krizhevsky, “Learning multiple layers of features from tinyimages,”

Technical report, University of Toronto , 2009.[32] M. Fr´echet, “Sur la distance de deux lois de probabilit,”

C. R. Acad.Sci. Paris , pp. 689 – 692, 1957.[33] D.C Dowson and B.V Landau, “The fr´echet distance between multi-variate normal distributions,”

Journal of Multivariate Analysis , vol. 12,no. 3, pp. 450 – 455, 1982.[34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re-thinking the inception architecture for computer vision,” in2016 IEEEConference on Computer Vision and Pattern Recognition (CVPR)