Generalized Octave Convolutions for Learned Multi-Frequency Image Compression
GGENERALIZED OCTAVE CONVOLUTIONS FORLEARNED MULTI-FREQUENCY IMAGE COMPRESSION
Mohammad Akbari ∗ , Jie Liang ∗ , Jingning Han † , Chengjie Tu ‡ [email protected], [email protected], [email protected], [email protected] Fraser University, Canada ∗ , Google Inc. † , Tencent Technologies ‡ ABSTRACT
Learned image compression has recently shown the potential tooutperform the standard codecs. State-of-the-art rate-distortion(R-D) performance has been achieved by context-adaptive en-tropy coding approaches in which hyperprior and autoregres-sive models are jointly utilized to effectively capture the spatialdependencies in the latent representations. However, the la-tents are feature maps of the same spatial resolution in previousworks, which contain some redundancies that affect the R-Dperformance. In this paper, we propose the first learned multi-frequency image compression and entropy coding approachthat is based on the recently developed octave convolutions tofactorize the latents into high and low frequency (resolution)components, where the low frequency is represented by a lowerresolution. Therefore, its spatial redundancy is reduced, whichimproves the R-D performance. Novel generalized octaveconvolution and octave transposed-convolution architectureswith internal activation layers are also proposed to preservemore spatial structure of the information. Experimental re-sults show that the proposed scheme not only outperforms allexisting learned methods as well as standard codecs such asthe next-generation video coding standard VVC (4:2:0) on theKodak dataset in both PSNR and MS-SSIM. We also showthat the proposed generalized octave convolution can improvethe performance of other auto-encoder-based computer visiontasks such as semantic segmentation and image denoising.
Index Terms — generalized octave convolutions, multi-frequency autoencoder, multi-frequency image coding, learnedimage compression, learned entropy model
1. INTRODUCTION
Deep learning-based image compression [1, 2, 3, 4, 5, 6, 7, 8,9, 10, 11, 12] has shown the potential to outperform standardcodecs such as JPEG2000 and H.265/HEVC-based BPG im-age codec [13]. Learned image compression was first usedin [10] to compress thumbnail images using long short-termmemory (LSTM)-based recurrent neural networks (RNNs) inwhich better SSIM results than JPEG and WebP were reported.This approach was generalized in [5], which utilized spatiallyadaptive bit allocation to further improve the performance. In [4], a scheme based on generalized divisive normaliza-tion (GDN) and inverse GDN (IGDN) were proposed, whichoutperformed JPEG2000 in both PSNR and SSIM. A compres-sive auto-encoder framework with residual connection as inResNet was proposed in [9], where the quantization was re-placed by a smooth approximation, and a scaling approach wasused to get different rates. In [2], a soft-to-hard vector quan-tization approach was introduced, and a unified frameworkwas developed for image compression as well as deep learningmodel compression. In order to take the spatial variation ofimage content into account, a content-weighted frameworkwas introduced in [6], where an importance map for locallyadaptive bit rate allocation was employed. A learned channel-wise quantization along with arithmetic coding was also usedto reduce the quantization error.There have also been some efforts in taking advantage ofother computer vision tasks in image compression frameworks.For example, in [3], a deep semantic segmentation-based lay-ered image compression (DSSLIC) was proposed, by takingadvantage of the Generative Adversarial Network (GAN) andBPG-based residual coding. It outperformed the BPG codec(in RGB444 format) in both PSNR and MS-SSIM [14] acrossa large range of bit rates.Since most learned image compression methods need totrain multiple networks for multiple bit rates, variable-rateapproaches have also been proposed in which a single neuralnetwork model is trained to operate at multiple bit rates. Thisapproach was first introduced by [10], which was then gen-eralized for full-resolution images using deep learning-basedentropy coding in [11]. A CNN-based multi-scale decompo-sition transform was optimized for all scales in [15], whichachieved better performance than BPG in MS-SSIM. In [16],a learned progressive image compression model was proposedusing bit-plane decomposition and also bidirectional assem-bling of gated units. Another variable-rate framework wasintroduced in [17], which employed GDN-based shortcut con-nections, stochastic rounding-based scalable quantization, anda variable-rate objective function. The method in [17] outper-formed previous learned variable-rate methods.Most previous works used fixed entropy models shared be-tween the encoder and decoder. In [18], a conditional entropymodel based on Gaussian scale mixture (GSM) was proposed1 a r X i v : . [ ee ss . I V ] D ec here the scale parameters were conditioned on a hyperpriorlearned using a hyper auto-encoder. The compressed hyper-prior was transmitted and added to the bit stream as side infor-mation. This model was extended in [7, 8] where a Gaussianmixture model (GMM) with both mean and scale parametersconditioned on the hyperprior was utilized. In these meth-ods, the hyperpriors were combined with autoregressive priorsgenerated using context models, which outperformed BPG interms of both PSNR and MS-SSIM. The coding efficiency in[8] was further improved in [19] by a joint optimization of im-age compression and quality enhancement networks. Anothercontext-adaptive approach was introduced by [20] in whichmulti-scale masked convolutional networks were utilized fortheir autoregressive model combined with hyperpriors.The state of the art in learned image compression has beenachieved by context-adaptive entropy methods in which hy-perprior and autoregressive models are combined [7]. Theseapproaches are jointly optimized to effectively capture thespatial dependencies and probabilistic structures of the latentrepresentations, which lead to a compression model with su-perior rate-distortion (R-D) performance. However, similar tonatural images, the latents are usually represented by featuremaps of the same spatial resolution, which has some spatialredundancies. For example, some of these maps have morelow frequency components, and therefore do not need the sameresolution as others with more high frequency components.This suggests that better R-D performance can be achieved byhaving different spatial redundancies in different feature maps.In this paper, a learned multi-frequency (bi-frequency) im-age compression and entropy model is introduced in whichoctave convolutions [21] are utilized to factorize the latentrepresentations into high frequency (HF) and low frequency(LF) components. The LF information is then representedby a lower spatial resolution, which reduces the correspond-ing spatial redundancy and improves the compression perfor-mance, similar to wavelet transforms [22]. In addition, dueto the effective communication between HF and LF compo-nents in octave convolutions, the reconstruction performanceis also improved. In the original octave convolution [21], fixedinterpolation methods are used for down- and up-samplingoperations, which do not retain the spatial information. So,it can negatively affect the image compression performance.In order to preserve the spatial structure of the latents in ourimage coding framework, we develop novel generalized octaveconvolution and octave transposed-convolution architectureswith internal activation layers.Experiment results show that the proposed scheme outper-forms all existing learning-based methods and standard codecsin terms of both PSNR and MS-SSIM on the Kodak dataset.The framework proposed in this work bridges the wavelettransform and deep learning. Therefore, many techniques inthe wavelet transform can be used in the proposed frameworkto further improve its performance. This will have profoundimpact on future image coding research. Besides, we show that the proposed generalized octave convolution and transposed-convolution architectures can improve the performance and thecomputational complexity in other auto-encoder-based com-puter vision tasks such as semantic segmentation and imagedenoising.The paper is organized as follows. In Section 2, vanillaand octave convolutions are briefly described and compared.The proposed generalized octave convolution and transposed-convolution with internal activation layers are formulated anddiscussed in Section 3. The architecture of the proposed multi-frequency image compression framework as well as the multi-frequency entropy model are then introduced in Section 4. InSection 5, the experimental results along with the ablationstudy will then be discussed, and compared with the state-of-the-art in learning-based image compression methods. Follow-ing that, the proposed multi-frequency semantic segmentationand image denoising frameworks are respectively studied inSections 6 and 7. The paper conclusion is finally given inSection 8.
2. VANILLA VS. OCTAVE CONVOLUTION
Let
X, Y ∈ R h × w × c be the input and output feature vectorswith c number of channels of size h × w , each feature map inthe vanilla convolution is obtained as follows: Y ( p,q ) = (cid:88) i,j ∈N k Φ ( i + k − ,j + k − ) T X ( p + i,q + j ) , (1)where Φ ∈ R k × k × c is a k × k convolution kernel, ( p, q ) is thelocation coordinate, and N k is a local neighborhood.In the vanilla convolution, all input and output featuremaps are of the same spatial resolution. If the feature maps ineach layer of existing deep learning schemes have the same res-olution, there will be some unnecessary redundancies, whichwill hurt the performance in applications such as compression.For example, some feature maps capture more LF components,and therefore do not need the same resolution as others withmore HF components. This suggests that better R-D perfor-mance can be achieved by using different spatial redundanciesin different feature maps, similar to the approach in wavelettransform [23].To address this problem, in the recently developed octaveconvolution [21], the feature maps are factorized into HF andLF components with different resolutions, where each compo-nent is processed with different convolutions. As a result, theresolution of LF feature maps can be spatially reduced, whichsaves both memory and computation.The architecture of the original octave convolution is il-lustrated in Figure 1. The factorization of input vector X inoctave convolutions is denoted by X = { X H , X L } , where X H ∈ R h × w × (1 − α ) c and X L ∈ R h × w × αc are respectivelythe HF and LF maps. The ratio of channels allocated to theLF feature representations (i.e., at half of spatial resolution)2 ig. 1 : The architecture of the original octave convolution. X H and X L : input HF and LF feature maps; f : regular vanillaconvolution; downsample : fixed down-sampling operation(e.g., maxpooling); upsample : fixed up-sampling operation(e.g., bilinear); Y H → H and Y L → L : intra-frequency updates; Y H → L and Y L → H : inter-frequency communications; Φ H → H and Φ L → L : intra-frequency convolution kernels; Φ H → L and Φ L → H : inter-frequency convolution kernels; Y H and Y L :output HF and LF feature maps.is defined by α ∈ [0 , . The factorized output vector is de-noted by Y = { Y H , Y L } , where Y H ∈ R h (cid:48) × w (cid:48) × (1 − α ) c (cid:48) and Y L ∈ R h (cid:48) × w (cid:48) × αc (cid:48) are the output HF and LF maps. Theoutputs are given by: Y H = Y H → H + Y L → H ,Y L = Y L → L + Y H → L , (2)where Y H → H and Y L → L are intra-frequency update and Y H → L and Y L → H denote inter-frequency communication.intra-frequency component is used to update the informationwithin each part, while inter-frequency communication furtherenables information exchange between the two parts. Simi-lar to filter bank theory [24], the octave convolution allowsinformation exchange between the HF and LF feature maps.The octave convolution kernel is given by Φ = [Φ H , Φ L ] with which the inputs X H and X L are respectively convolved. Φ H and Φ L are further divided into intra- and inter-frequencycomponents as follows: Φ H = [Φ H → H , Φ L → H ] and Φ L =[Φ L → L , Φ H → L ] .For the intra-frequency update, the regular vanilla convolu-tion is used. However, up- and down-sampling interpolationsare applied to compute the inter-frequency communicationformulated as: Y H = f ( X H ; Φ H → H ) + upsample (cid:0) f ( X L ; Φ L → H ) , (cid:1) ,Y L = f ( X L ; Φ L → L ) + f (cid:0) downsample ( X H , H → L (cid:1) , (3)where f denotes a vanilla convolution with parameters Φ .As reported in [21], due to the effective inter-frequencycommunications, the octave convolution can have better perfor-mance in classification and recognition performance comparedto the vanilla convolution. Since the octave convolution allows different resolutions in HF and LF feature maps, it is verysuitable for image compression. This motivates us to apply itto learned image compression. However, some modificationsare necessary in order to get a good performance.
3. GENERALIZED OCTAVE CONVOLUTION
In the original octave convolution, the average pooling andnearest interpolation are respectively employed for down-and up-sampling operations in inter-frequency communica-tion [21]. Such conventional interpolations do not preservespatial information and structure of the input feature map. Inaddition, in convolutional auto-encoders where sub-samplingneeds to be reversed at the decoder side, fixed operations suchas pooling result in a poor performance [25].In this work, we propose a novel generalized octave convo-lution (GoConv) in which strided convolutions are used to sub-sample the feature vectors and compute the inter-frequencycommunication in a more effective way. Fixed sub-samplingoperations such as pooling are designed to forget about spatialstructure, for example, in object recognition where we onlycare about the presence or absence of the object, not its posi-tion. However, if the spatial information is important, stridedconvolution can be a useful alternative. With learned filters,strided convolutions can learn to handle discontinuities fromstriding and preserve more spatial properties required in down-sampling operation [25]. Moreover, since it can learn howto summarize, better generalization with respect to the inputis achieved. As a result, better performance with less spatialinformation loss can be achieved, especially in auto-encoderswhere it is easier to reverse strided convolutions. Moreover,as in ResNet, applying strided convolution (i.e., convolutionand down-sampling at the same time) reduces the computa-tional cost compared to a convolution followed by a fixeddown-sampling operation (e.g., average pooling).The architecture of the proposed GoConv is shown in Fig-ure 2a. Compared to the original octave convolution (Figure 1),we apply another important modification regarding the inputsto the inter-frequency convolution operations. As summarizedin Section 2, in order to calculate the inter-frequency commu-nication outputs (denoted by Y H → L and Y L → H ), the inputHF and LF vectors (denoted by X H and X L ) are consideredas the inputs to the HF-to-LF and LF-to-HF convolutions, re-spectively ( f ↓ and g ↑ in Figure 2a). This strategy is onlyefficient for the stride of 1 (i.e., the size of input and outputHF and LF vectors is the same). However, in GoConv, thisprocedure can result in significant information loss for largerstrides. As an example, consider using stride 2, which resultsin down-sampled output HF and LF feature maps (half res-olution of input HF and LF maps). To achieve this, stride 2is required for the intra-frequency convolution ( f in Figure2a). However, for the inter-frequency convolution f ↓ , a harshstride of 4 should be used, which results in significant spatialinformation loss.3 a) Architecture of the proposed generalized octave convolution(GoConv). X H and X L : input HF and LF feature maps; Y H → H and Y L → L : intra-frequency updates; Y H → L and Y L → H : inter-frequency communications; Φ H → H and Φ L → L : intra-frequencyconvolution kernels; Φ H → L and Φ L → H : inter-frequency convolu-tion kernels; Y H and Y L : output HF and LF feature maps. (b) Architecture of the proposed generalized transposed-convolution(GoTConv). ˜ Y H and ˜ Y L : input HF and LF feature maps; ˜ X H → H and ˜ X L → L : intra-frequency updates; ˜ X H → L and ˜ X L → H : inter-frequency communications; Ψ H → H and Ψ L → L : intra-frequencytransposed-convolution kernels; Ψ H → L and Ψ L → H : inter-frequencyconvolution kernels; ˜ X H and ˜ X L : output HF and LF feature maps. Fig. 2 : Architecture of the proposed generalized octave convolution (GoConv) shown in the left figure, and transposed-convolution(GoTConv) shown in the right figure.
Act : the activation layer; f : regular vanilla convolution; g : regular transposed-convolution; f ↓ : regular convolution with stride 2; g ↑ : regular transposed-convolution with stride 2.In order to deal with this problem, we instead use two con-secutive convolutions with stride 2 where the first convolutionis indeed the intra-frequency operation f . In other words, tocompute Y H → L , we exploited the filters learned by f to haveless information loss. Thus, instead of X H and X L , we set Y H → H and Y L → L as inputs to f ↓ and g ↑ .The output HF and LF feature maps in GoConv are formu-lated as follows: Y H = Y H → H + g ↑ ( Y L → L ; Φ L → H ) ,Y L = Y L → L + f ↓ ( Y H → H ; Φ H → L ) , with Y H → H = f ( X H ; Φ H → H ) ,Y L → L = f ( X L ; Φ L → L ) , (4)where f ↓ and g ↑ are respectively Vanilla convolution andtransposed-convolution operations with stride of 2.In original octave convolutions, activation layers (e.g.,ReLU) are applied to the output HF and LF maps. However,as shown in Figure 2, we utilize activations for each internalconvolution performed in our proposed GoConv. In this case,we assure activation functions are properly applied to eachfeature map computed by convolution operations. Each of theinter- and intra-frequency components is then followed by anactivation layer in GoConv.We also propose a generalized octave transposed-convolution denoted by GoTConv (Figure 2), which can re-place the conventional transposed-convolution commonly em-ployed in deep auto-encoder (encoder-decoder) architectures.Let ˜ Y = { ˜ Y H , ˜ Y L } and ˜ X = { ˜ X H , ˜ X L } respectively be thefactorized input and output feature vectors, the output HF and LF maps ˜ X H and ˜ X L ) in GoTConv are obtained as follows: ˜ X H = ˜ X H → H + g ↑ ( ˜ X L → L ; Ψ L → H ) , ˜ X L = ˜ X L → L + f ↓ ( ˜ X H → H ; Ψ H → L ) with ˜ X H → H = g ( ˜ Y H ; Ψ H → H ) , ˜ X L → L = g ( ˜ Y L ; Ψ L → L ) , (5)where ˜ Y H , ˜ X H ∈ R h × w × (1 − α ) c and ˜ Y L , ˜ X L ∈ R h × w × αc .Unlike GoConv in which regular convolution operation isused, transposed-convolution denoted by g is applied for intra-frequency update in GoTConv. For up- and down-samplingoperations in inter-frequency communication, the same stridedconvolutions g ↑ and f ↓ as in GoConv are respectively uti-lized.Similar to the original octave convolution, the proposedGoConv and GoTConv are designed and formulated as generic,plug-and-play units. As a result, they can respectively replacevanilla convolution and transposed-convolution units in anyconvolutional neural network (CNN) architecture, especiallyauto-encoder-based frameworks such as image compression,image denoising, and semantic segmentation. When used in anauto-encoder, the input image to the encoder is not representedas a multi-frequency tensor. In this case, to compute the outputof the first GoConv layer in the encoder, Equation 4 is modifiedas follows: Y H = f ( X ; Φ H → H ) , Y L = f ↓ ( Y H ; Φ H → L ) , (6)Similarly, at the decoder side, the output of the last GoT-Conv is a single tensor representation, which can be formulated4 ig. 3 : Overall framework of the proposed image compression model. H-AE and
H-AD : arithmetic encoder and decoder for HFlatents.
L-AE and
L-AD : arithmetic encoder and decoder for LF latents.
H-CM and
L-CM : the HF and LF context models eachcomposed of one 5*5 masked convolution layer with 2*M filters and stride of 1. Q : represents the additive uniform noise fortraining, or uniform quantizer for the test.by modifying Equation 5 as: ˜ X = ˜ X H → H + g ↑ ( ˜ X L → L ; Ψ L → H ) , with ˜ X H → H = g ( ˜ Y H ; Ψ H → H ) , ˜ X L → L = g ( ˜ Y L ; Ψ L → L ) . (7)Compared to GoConv, the process of using activationsfor each internal transposed-convolution in GoTConv is in-verted, where the activation layer is followed by inter- andintra-frequency communications as shown in Figure 2.
4. MULTI-FREQUENCY IMAGE CODING ANDENTROPY MODEL
Octave convolution is similar to the wavelet transform [22],since it has lower resolution in LF than in HF. Therefore, it canbe used to improve the R-D performance in learning-based im-age compression frameworks. Moreover, due to the effectiveinter-frequency communication as well as the receptive fieldenlargement in octave convolutions, they also improve the per-formance of the analysis (encoding) and synthesis (decoding)transforms in a compression framework.The overall architecture of the proposed multi-frequencyimage compression framework is shown in Figure 3. Similarto [7], our architecture is composed of two sub-networks: thecore auto-encoder and the entropy sub-network. The coreauto-encoder is used to learn a quantized latent vector of theinput image, while the entropy sub-network is responsiblefor learning a probabilistic model over the quantized latentrepresentations, which is utilized for entropy coding.In order to handle multi-frequency entropy coding, we havemade several improvements to the scheme in [7]. First, all vanilla convolutions in the core encoder, and hyper encoder arereplaced by the proposed GoConv, and all vanilla transposed-convolutions in the core and hyper decoders are replaced byGoTConv. In [7], each convolution/transposed-convolutionis accompanied by an activation layer (e.g., GDN/IGDN orLeaky ReLU). In our architecture, we move these layers intothe GoConv and GoTConv architectures and directly applythem to the inter- and intra-frequency components as describedin Section 3. GDN/IGDN transforms are respectively used forthe GoConv and GoTConv employed in the proposed deep en-coder and decoder, while Leaky ReLU is utilized for the hyperauto-encoder and the parameters estimator. The convolutionproperties (i.e., size and number of filters and strides) of allnetworks including the core and hyper auto-encoders, contextmodels, and parameter estimator are the same as in [7].Let x ∈ R h × w × be the input image, the multi-frequencylatent representations are denoted by { y H , y L } where y H ∈ R h × w × (1 − α ) M and y L ∈ R h × w × αM are generated us-ing the parametric deep encoder (i.e., analysis transform) g e represented as: { y H , y L } = g e ( x ; θ ge ) , (8)where θ ge is the parameter vector to be optimized. M denotesthe total number of output channels in g e , which is dividedinto (1 − α ) M channels for HF and αM channels for LF (i.e.,at half spatial resolution of the HF part). The calculation inEquation (6) is used for the first GoConv layer, while the otherencoder layers are formulated using Equation (4).At the decoder side, the parametric decoder (i.e., synthesistransform) g d with the parameter vector θ gd reconstructs the5 ig. 4 : Sample HF and LF latent representations. Left column:original image; Middle columns: HF; Right column: LF.image ˜ x ∈ R h × w × as follows: ˜ x = g d (cid:0) { ˜ y H , ˜ y L } ; θ gd (cid:1) with { ˜ y H , ˜ y L } = Q (cid:0) { y H , y L } (cid:1) , (9)where Q represents the addition of uniform noise to the latentrepresentations during training, or uniform quantization (i.e.,round function in this work) and arithmetic coding/decodingof the latents during the test. As illustrated in Figure 3, thequantized HF and LF latents ˜ y H and ˜ y L are entropy-codedusing two separate arithmetic encoder and decoder.The entropy sub-network in our architecture contains twomodels: a context model and a hyper auto-encoder [7]. Thecontext model is an autoregressive model over multi-frequencylatent representations. Unlike the other networks in our archi-tecture where GoConv are incorporated for their convolutions,we use Vanilla convolutions in the context model to ensurethat the causality of the contexts is not spoiled due to the intra-frequency communication in GoConv. The contexts of the HFand LF latents, denoted by φ Hi and φ Li , are then predicted withtwo separate models f Hcm and f Lcm defined as follows: φ Hi = f Hcm (˜ y H
5. EXPERIMENTAL RESULTS: IMAGECOMPRESSION
The CLIC training set with images of at least 256 pixels inheight or width (1732 images in total) were used for trainingthe proposed model. Random crops of size h = w = 256 were extracted from the images for training. We set α = 0 . so that 50% of the latent representations is assigned to the LFpart with half spatial resolution. Sample HF and LF latentrepresentations are shown in Figure 4.Considering four layers of strided convolutions (with strideof 2) and the output channel size M = 192 in the core encoder (Figure 3), the HF and LF latents y H and y L will respectivelybe of size 16 × ×
96 and 8 × ×
96 for training. As discussedin [18], the optimal number of filters (i.e., N ) increases withthe R-D balance factor λ , which indicates that higher networkcapacity is required for models with higher bit rates. As aresult, in order to avoid λ -dependent performance saturationand to boost the network capacity, we set M = N = 256 forhigher bit rates (BPPs > 0.5). All models in our frameworkwere jointly trained for 200 epochs with mini-batch stochasticgradient descent and a batch size of 8. The Adam solver withlearning rate of 0.00005 was fixed for the first 100 epochs, andwas gradually decreased to zero for the next 100 epochs.We compare the performance of the proposed scheme withstandard codecs including JPEG, JPEG2000 [23], WebP [28],BPG (both YUV4:2:0 and YUV4:4:4 formats) [13], the next-generation video coding standard VVC Test Model or VTM5.2 (both YUV4:2:0 and YUV4:4:4 formats) [29], and alsostate-of-the-art learned image compression methods in [7, 6, 8,19, 20]. We use both PSNR and MS-SSIM dB as the evaluationmetrics, where MS-SSIM dB represents MS-SSIM scores in dBdefined as: MS-SSIM dB = − log (1 − MS-SSIM ) .The comparison results on the popular Kodak image set(averaged over 24 test images) are shown in Figure 5. For thePSNR results, we optimized the model for the MSE loss asthe distortion metric d in Equation 15, while the perceptualMS-SSIM metric was used for the MS-SSIM results reportedin Figure 5. In order to obtain the seven different bit rates onthe R-D curve illustrated in Figure 5, seven models with seven7 a) Original (b)
Ours (0.149bpp, 35.30dB, 15.25dB) (c)
BPG (0.151bpp, 33.96dB, 14.44dB) (d)
J2K (0.151bpp, 32.59dB, 13.43dB)
Fig. 6 : Kodak visual example (bits-per-pixel, PSNR, MS-SSIM dB ).different values for λ were trained.As shown in Figure 5, our method outperforms the standardcodecs such as BPG and VTM (4:2:0) as well as the state-of-the-art learning-based image compression methods in termsof both PSNR and MS-SSIM. Our method achieves ≈ ≈ ≈ In order to evaluate the performance of different componentsof the proposed framework, ablation studies were performed,which are reported in Table 1. The results are the average overthe Kodak image set. All the models reported in this ablationstudy have been optimized for MSE distortion metric (for onesingle bit-rate). However, the results were evaluated with bothPSNR and MS-SSIM metric.•
Ratio of HF and LF : in order to study varying choicesof the ratio of channels allocated to the LF feature repre-sentations, we evaluated our model with three differentratios α ∈ { . , . , . } . As summarized in Table1, compressing 50% of the LF part to half the resolu-tion (i.e., α = 0 . ) results in the best R-D performancein both PSNR and MS-SSIM at 0.345bpp (where thecontributions of HF and LF latents are 0.276bpp and0.069bpp). As the ratio decreases to α = 0 . , less com- pression with a higher bit rate of 0.445bpp (0.410bpp forHF and 0.035 for LF) is obtained, while no significantgain in the reconstruction quality is achieved. Althoughincreasing the ratio to 75% provides a better compres-sion with 0.309bpp (high: 0.132bpp, low: 0.176bpp),it significantly results in a lower PSNR. As indicatedby the number of floating point operations per second(FLOPs) in the table, larger ratio results in a faster modelsince less operations are required for calculating the LFmaps with half spatial resolution.• Position of activation layers : in this scenario (denotedby ActOut), we cancel the internal activations (i.e.,GDN/IGDN) employed in our proposed GoConv andGoTConv. Instead, as in the original octave convolution[21], we apply GDN to the output HF and LF maps inGoConv, and IGDN before the input HF and LF mapsfor GoTConv. This experiment is denoted by ActOutin Table 1. As the comparison results indicate, the pro-posed architecture with internal activations ( α = 0.5)provides a better performance (with ≈ Octave only for core auto-encoder : as described inSection 4, the proposed multi-frequency entropy modelutilizes GoConv and GoTConv units for both latentsand hyper latents. In order to evaluate the effectivenessof multi-frequency modelling of hyper latents, we alsoreport the results in which GoConv and GoTConv areonly used for the core auto-encoder latents (denoted byCoreOct in Table 1). To deal with the HF and LF latentsresulted from the multi-frequency core auto-encoder, weused two separate networks (similar to [7] with Vanillaconvolutions) for each of the hyper encoder, and hyperdecoder. As summarized in the table, a PSNR gain of ≈ able 1 : Ablation study of different components in the proposed framework. BPP : bits-per-pixel (HF/LF: BPPs for HF and LFlatents).
ActOut : activation layers moved out of GoConv/GoTConv;
CoreOct : proposed GoConv/GoTConv only used for thecore auto-encoder;
OrgOct : GoConv/GoTConv replaced by original octave convolutions. α = 0.25 α = 0.5 α = 0.75 ActOut CoreOct OrgOctBPP(HF / LF) PSNR (dB)
MS-SSIM (dB)
FLOPs (G)
Fig. 7 : The architecture of the original octave transposed-convolution. ˜ Y H and ˜ Y L : input HF and LF featuremaps; g : regular vanilla transposed-convolution; upsample :fixed up-sampling operation (i.e., bilinear); downsample :fixed down-sampling operation (i.e., maxpooling); ˜ X H → H and ˜ X L → L : intra-frequency updates; ˜ X H → L and ˜ X L → H :inter-frequency communications; Ψ H → H and Ψ L → L : intra-frequency transposed-convolution kernels; Ψ H → L and Ψ L → H :inter-frequency transposed-convolution kernels; ˜ X H and ˜ X L :output HF and LF feature maps.• Original octave convolutions : in this experiment, theperformance of the proposed GoConv and GoTConvarchitectures compared with the original octave convo-lutions (denoted by OrgOct in Table 1) is analyzed. Wereplace all GoConv layers in the proposed framework(Figure 3) by original octave convolutions (Figure 1).For the octave transposed-convolution used in the coreand hyper decoders, we reverse the octave convolutionoperation formulated as follows: ˜ X H = g ( ˜ Y H ; Ψ H → H ) + upsample ( g ( ˜ Y L ; Ψ L → H ) , , ˜ X L = g ( ˜ Y L ; Ψ L → L ) + g ( downsample ( ˜ Y H , H → L ) , (19)where { ˜ Y H , ˜ Y L } and { ˜ X H , ˜ X L } are the input andoutput feature maps, and g is vanilla transposed-convolution. The architecture of the original octave transposed-convolution is illustrated in Figure 7. Sim-ilar to the octave convolution defined in [21], averagepooling and nearest interpolation are respectively usedfor down- and up-sampling operations.As reported in Table 1, OrgOct provides a significantlylower performance than the architecture with the pro-posed GoConv/GoTConv, which is due to the fixed sub-sampling operations incorporated for its inter-frequencycomponents. The PSNR and MS-SSIM of the proposedarchitecture are respectively ≈ ≈ α = 0 . wasused for the ActOut, CoreOct, and OrgOct models.As explained in Section 3, since the proposed GoConvand GoTConv are designed as generic, plug-and-play units,they can be used in any auto-encoder-based architecture. Inthe following sections, the usage of GoConv and GoTConv inimage semantic segmetnation and image denoising problemsare analyzed.
6. MULTI-FREQUENCY SEMANTICSEGMENTATION
In this experiment, we evaluate the proposed Go-Conv/GoTConv units on image semantic segmentation. Thepopular UNet model [30] is used as the baseline in this experi-ment. UNet is a CNN-based architecture originally developedfor segmenting biomedical images, but also considered as abaseline for image semantic segmentation. UNet has an auto-encoder scheme composed of two paths: the contraction path(i.e., the encoder) and the expanding path (i.e., the decoder).The encoder captures the context in the image, while the de-coder is used to enable precise localization. The architectureof UNet auto-encoder is summarized in Table 2. All the vanillaconvolution layers used in the encoder and decoder are fol-lowed by batch normalization and ReLU layers.In this study, we build a multi-frequency UNet model, de-noted by GoConv-UNet, by replacing all the UNet convolutionlayers with GoConv and all transposed-convolutions with GoT-Conv. The other properties such as the number of layers, filters,9 a) Input Image (b)
Ground Truth (c)
GoConv-UNet (Acc: 0.86, mIoU: 0.37) (d)
UNet (Acc: 0.84, mIoU: 0.31) (e)
Input Image (f)
Ground Truth (g)
GoConv-UNet (Acc: 0.89, mIoU: 0.38) (h)
UNet (Acc: 0.86, mIoU: 0.34)
Fig. 8 : Visual examples on Cityscapes.
GoConv-UNet : GoConv/GoTConv-based multi-frequency UNet with α = 0 . ; UNet :original UNet architecture;
Acc : pixel accuracy; mIoU : mean IoU.and strides are kept the same as the original UNet. In orderto compare the performance of the original octave convolu-tions with our proposed GoConv/GoTConv, we build anothermodel, denoted by OrgOct-UNet, in which the original octaveconvolution units (shown in Figures 1 and 7) are employed.All the experiments in this section are performed onCityscapes dataset [31], which contains 2974 training imageswith 19 semantic labels. All models were trained with cross-entropy loss function for 100 epochs. The models were thenevaluated on Cityscapes validation set (including 500 images)based on pixel accuracy, mean intersection-over-union (IoU),number of parameters, and number of FLOPs. The mean IoU(mIou) is calculated by averaging over the IoU values of allsemantic classes.Table 3 presents the comparison results of the originalUNet and the proposed multi-frequency OrgOct-UNet andGoConv-UNet models with different α ’s. All the reportedvalues are the average over the Cityscapes validation set. Asgiven in the table, the GoConv-UNet model with α = 0 . achieves the highest accuracy and mIoU. Compared to theoriginal UNet model, GoConv-UNet ( α = 0 . ) requires ≈ α ’s can result in lower FLOPs, it has anegative impact on the model performance. OrgOct-UNet hasthe lowest number of parameters and FLOPs compared to theother models. However, it achieves the lowest performancewith no improvement over UNet, which is basically due tothe fixed sub-sampling operations used in the original octaveconvolutions.The visual comparison of UNet and GoConv-UNet is givenin Figure 8. As shown in the results, higher quality semanticsegmentations are obtained with GoConv-UNet. For example,the pedestrian with backpack in the top example and the trees,the board, and the wall on the right side of the bottom example Table 2 : UNet architecture (
Conv : vanilla convolution;
T-Conv : vanilla transposed-convolution).
Encoder Decoder
Conv (3*3, 64, s1) T-Conv (3*3, 512, s2)Maxpool (2*2, s2) Conv (3*3, 512, s1)Conv (3*3, 128, s1) T-Conv (3*3, 256, s1)Maxpool (2*2, s2) Conv (3*3, 256, s1)Conv (3*3, 256, s1) T-Conv (3*3, 128, s1)Maxpool (2*2, s2) Conv (3*3, 128, s1)Conv (3*3, 512, s1) T-Conv (3*3, 64, s2)Maxpool (2*2, s2) Conv (3*3, 64, s1)Conv (3*3, 1024, s1) Conv (3*3, 19, s1)are more accurately segmented with GoConv-UNet.
7. MULTI-FREQUENCY IMAGE DENOISING
In this experiment, we build a simple convolutional auto-encoder and use it for image denoising problem [32, 33]. Inthis problem, we try to denoise images corrupted by whiteGaussian noise, which is the common result of many acquisi-tion channels. The architecture of the auto-encoder used in thisexperiment is summarized in Table 4 where the encoder anddecoder are respectively composed of a sequence of vanillaconvolutions and transposed-convolutions each followed bybatch normalization and ReLU.We performed our experiments on MNIST and CIFAR10datasets. After 100 epochs of training, an average PSNR of23.19dB and 23.29dB for MNIST and CIFAR10 test sets arerespectively achieved. In order to analyze the performance ofGoConv and GoTConv in this experiment, we replaced all thevanilla convolution layers with GoConv and all transposed-convolutions with GoTConv. The other properties of the en-coder and decoder networks (e.g., numebr of layers, filters,10 able 3 : Comparison results for image semantic segmentation on Cityscapes database.
UNet : original UNet architecture;
GoConv-UNet : multi-frequency UNet with GoConv/GoTConv;
OrgOct-UNet : multi-frequency UNet with original octaveunits;
Ratio ( α ) : the LF ratio used in octave convolutions; mIoU : mean intersection-over-union; Params (M) : the number ofmodel parameters (in million)
UNet GoConv-UNet OrgOct-UNetRatio ( α ) Accuracy mIoU
Params (M)
FLOPs (G)
Table 4 : Baseline convolutional auto-encoder for imagedenoising (
Conv : vanilla convolution;
T-Conv : vanillatransposed-convolution).
Encoder Decoder
Conv (3*3, 32, s1) T-Conv (3*3, 128, s2)Conv (3*3, 32, s1) T-Conv (3*3, 128, s1)Conv (3*3, 64, s1) T-Conv (3*3, 64, s1)Conv (3*3, 64, s2) T-Conv (3*3, 64, s2)Conv (3*3, 128, s1) T-Conv (3*3, 32, s1)Conv (3*3, 128, s1) T-Conv (3*3, 32, s1)Conv (3*3, 256, s2) T-Conv (3*3, 3, s1)and strides) are the same as the baseline in Table 4. We set α = 0 . and trained the model for 100 epochs.For MNIST dataset, the multi-frequency auto-encoderachieved an average PSNR of 23.20dB (almost the same as thebaseline with vanilla convolutions). However, for CIFAR10,we achieved an average PSNR of 23.54dB, which is 0.25dBhigher than the baseline, due to the effective communicationbetween HF and LF. In addition, the proposed multi-frequencyauto-encoder has less number of parameters and FLOPs thanthe baseline model, which indicates the benefit of octave con-volutions on parameters and operations reduction with no per-formance loss. The comparison results are presented in Table5. In Figure 10, 8 visual examples from CIFAR10 test set aregiven. Compared to the baseline model with vanilla convolu-tions and transposed-convolutions, the multi-frequency modelwith the proposed GoConv/GoTConv results in a higher visualquality in the denoised images (e.g., the red car in the secondcolumn from right).
8. CONCLUSION
In this paper, we propose a new learned multi-frequency imagecompression and entropy model with octave convolutions inwhich the latents are factorized into HF and LF components,and the LF is stored at lower resolution to reduce the spatial re-dundancy. To preserve the spatial structure of the input, novelgeneralized octave convolution and transposed-convolution
Table 5 : Comparison results of the baseline and multi-frequency auto-encoders for image denoising on MNIST andCIFAR10 test sets.
Baseline Multi-frequency
MNIST CIFAR10 MNIST CIFAR10
PSNR (dB) (a) Input Images(b) Input Noisy Images(c) Baseline Denoised Results(d) Multi-frequency Denoised Results
Fig. 9 : Sample image denoising results from MNIST test set.11 a) Input Images(b) Input Noisy Images(c) Baseline Denoised Results(d) Multi-frequency Denoised Results
Fig. 10 : Sample image denoising results from CIFAR10.architectures denoted by GoConv/GoTConv are introduced.Our experiments show that the proposed method significantlyimproves the R-D performance and achieves the new state-of-the-art learned image compression, which even outperformsVTM (4:2:0) in PSNR. Further improvements can be achievedby multi-frequency factorization of latents into a sequence ofhigh to low frequencies as in wavelet transform.Our method bridges the wavelet transform and deeplearning-based image compression, and allows many tech-niques in the wavelet transform research to be applied tolearned image compression. This will lead to many otherresearch topics in the future. We also show the benefit of theproposed GoConv/GoTConv in other CNN-based computervision applications, particularly auto-encoder-based schemessuch as image denoising and semantic segmentation.
Acknowledgements
This work is supported by the Natural Sciences and Engi-neering Research Council (NSERC) of Canada under grantRGPIN-2015-06522.
9. REFERENCES [1] O. Rippel and L. Bourdev, “Real-time adaptive imagecompression,” in
International Conference on MachineLearning , vol. 70, 2017, pp. 2922–2930.[2] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli,R. Timofte, L. Benini, and L. V. Gool, “Soft-to-hardvector quantization for end-to-end learning compressiblerepresentations,” arXiv preprint arXiv:1704.00648 , 2017. [3] M. Akbari, J. Liang, and J. Han, “DSSLIC: Deep seman-tic segmentation-based layered image compression,” in
IEEE International Conference on Acoustics, Speech andSignal Processing , 2019, pp. 2042–2046.[4] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-endoptimization of nonlinear transform codes for perceptualquality,” in
Picture Coding Symposium , 2016, pp. 1–5.[5] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh,T. Chinen, S. J. Hwang, J. Shor, and G. Toderici, “Im-proved lossy image compression with priming and spa-tially adaptive bit rates for recurrent networks,” arXivpreprint arXiv:1703.10114 , 2017.[6] M. Li, W. Zuo, S. Gu, J. You, and D. Zhang, “Learn-ing content-weighted deep image compression,” arXivpreprint arXiv:1904.00664 , 2019.[7] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autore-gressive and hierarchical priors for learned image com-pression,” in
Advances in Neural Information ProcessingSystems , 2018, pp. 10 771–10 780.[8] J. Lee, S. Cho, and S. Beack, “Context-adaptive entropymodel for end-to-end optimized image compression,” arXiv preprint arXiv:1809.10452 , 2018.[9] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossyimage compression with compressive autoencoders,” arXiv preprint arXiv:1703.00395 , 2017.[10] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent,D. Minnen, S. Baluja, M. Covell, and R. Sukthankar,“Variable rate image compression with recurrent neuralnetworks,” arXiv preprint arXiv:1511.06085 , 2015.[11] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang,D. Minnen, J. Shor, and M. Covell, “Full resolutionimage compression with recurrent neural networks,” in
IEEE Conference on Computer Vision and Pattern Recog-nition , 2017, pp. 5306–5314.[12] B. Li, M. Akbari, J. Liang, and Y. Wang, “Deep learning-based image compression with trellis coded quantization,”in
Data Compression Conference , 2020, pp. 13–22.[13] F. Bellard, “BPG image format,” http://bellard.org/bpg,2017.[14] Z. Wang, E. Simoncelli, and A. Bovik, “Multiscale struc-tural similarity for image quality assessment,” in
Recordof the Thirty-Seventh Asilomar Conference on Signals,Systems and Computers , vol. 2, 2003, pp. 1398–1402.[15] C. Cai, L. Chen, X. Zhang, and Z. Gao, “Efficient vari-able rate image compression with multi-scale decom-position network,”
IEEE Transactions on Circuits andSystems for Video Technology , 2018.1216] Z. Zhang, Z. Chen, J. Lin, and W. Li, “Learned scalableimage compression with bidirectional context disentan-glement network,” in
IEEE International Conference onMultimedia and Expo . IEEE, 2019, pp. 1438–1443.[17] M. Akbari, J. Liang, J. Han, and C. Tu, “Learned variable-rate image compression with residual divisive normaliza-tion,” in
IEEE International Conference on Multimediaand Expo , 2020, pp. 1–6.[18] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. John-ston, “Variational image compression with a scale hyper-prior,” arXiv preprint arXiv:1802.01436 , 2018.[19] J. Lee, S. Cho, and M. Kim, “A hybrid architectureof jointly learning image compression and quality en-hancement with improved entropy minimization,” arXivpreprint arXiv:1912.12817 , 2019.[20] J. Zhou, S. Wen, A. Nakagawa, K. Kazui, and Z. Tan,“Multi-scale and context-adaptive entropy model for im-age compression,” arXiv preprint arXiv:1910.07844 ,2019.[21] Y. Chen, H. Fan, B. Xu, Z. Yan, Y. Kalantidis,M. Rohrbach, S. Yan, and J. Feng, “Drop an octave:Reducing spatial redundancy in convolutional neural net-works with octave convolution,” in
IEEE InternationalConference on Computer Vision , 2019, pp. 3435–3444.[22] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies,“Image coding using wavelet transform,”
IEEE Transac-tions on Image Processing , vol. 1, no. 2, pp. 205–220,1992.[23] C. Christopoulos, A. Skodras, and T. Ebrahimi, “TheJPEG2000 still image coding system: an overview,”
IEEE Transactions on Consumer Electronics , vol. 46,no. 4, pp. 1103–1127, 2000.[24] P. P. Vaidyanathan,
Multirate systems and filter banks .Pearson Education India, 2006.[25] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Ried-miller, “Striving for simplicity: The all convolutionalnet,” arXiv preprint arXiv:1412.6806 , 2014.[26] A. Van den Oord, N. Kalchbrenner, L. Espeholt,O. Vinyals, A. Graves et al. , “Conditional image gen-eration with pixelcnn decoders,” in
Advances in NeuralInformation Processing Systems , 2016, pp. 4790–4798.[27] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, andA. Torralba, “Scene parsing through ADE20K dataset,”in
IEEE Conference on Computer Vision and PatternRecognition , vol. 1, 2017, p. 4.[28] Google Inc., “WebP,” https://developers.google.com/speed/webp, 2016. [29] H. Fraunhofer, “VVC official test model VTM,” https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM, 2019.[30] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convo-lutional networks for biomedical image segmentation,” in
International Conference on Medical Image Computingand Computer-Assisted Intervention , 2015, pp. 234–241.[31] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En-zweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele,“The cityscapes dataset for semantic urban scene under-standing,” in
IEEE Conference on Computer Vision andPattern Recognition , 2016.[32] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo, “Multi-level wavelet-cnn for image restoration,” in
IEEE Con-ference on Computer Vision and Pattern RecognitionWorkshops , 2018, pp. 773–782.[33] C. Tian, L. Fei, W. Zheng, Y. Xu, W. Zuo, and C. Lin,“Deep learning on image denoising: An overview,” arXivpreprint arXiv:1912.13171arXivpreprint arXiv:1912.13171