[PDF] A Machine Learning Approach to Optimal Inverse Discrete Cosine Transform (IDCT) Design

Abstract

The design of the optimal inverse discrete cosine transform (IDCT) to compensate the quantization error is proposed for effective lossy image compression in this work. The forward and inverse DCTs are designed in pair in current image/video coding standards without taking the quantization effect into account. Yet, the distribution of quantized DCT coefficients deviate from that of original DCT coefficients. This is particularly obvious when the quality factor of JPEG compressed images is small. To address this problem, we first use a set of training images to learn the compound effect of forward DCT, quantization and dequantization in cascade. Then, a new IDCT kernel is learned to reverse the effect of such a pipeline. Experiments are conducted to demonstrate that the advantage of the new method, which has a gain of 0.11-0.30dB over the standard JPEG over a wide range of quality factors.

Full PDF

AA Machine Learning Approach to Optimal InverseDiscrete Cosine Transform (IDCT) Design

Yifan Wang ∗ , Zhanxuan Mei ∗ , Chia-Yang Tsai † , Ioannis Katsavounidis † and C.-C. Jay Kuo ∗∗ University of Southern California, Los Angeles, California, USA † Facebook, Inc., Menlo Park, California, USA

Abstract —The design of the optimal inverse discrete cosinetransform (IDCT) to compensate the quantization error isproposed for effective lossy image compression in this work.The forward and inverse DCTs are designed in pair in currentimage/video coding standards without taking the quantizationeffect into account. Yet, the distribution of quantized DCTcoefﬁcients deviate from that of original DCT coefﬁcients. This isparticularly obvious when the quality factor of JPEG compressedimages is small. To address this problem, we ﬁrst use a setof training images to learn the compound effect of forwardDCT, quantization and dequantization in cascade. Then, a newIDCT kernel is learned to reverse the effect of such a pipeline.Experiments are conducted to demonstrate that the advantageof the new method, which has a gain of 0.11-0.30dB over thestandard JPEG over a wide range of quality factors.

I. I

NTRODUCTION

Many image and video compression standards have beendeveloped in the last thirty years. Examples include JPEG[1], JPEG2000 [2], and BPG [3] for image compression andMPEG-1 [4], MPEG-2 [5], MPEG-4 [6], H.264/AVC [7] andHEVC [8] for video compression. Yet, we see few machinelearning techniques adopted by them. As machine learningbecomes more popular in multimedia computing and contentunderstanding, it is interesting to see how machine learningcan be introduced to boost the performance of image/videocoding performance. In this work, we investigate how machinelearning can be used in the optimal design of the inversediscrete cosine transform (IDCT) targeting at quantizationerror compensation.Block transforms are widely used in image and video codingstandards for energy compaction in the spatial domain. Onlya few leading DCT coefﬁcients have larger magnitudes afterthe transformation. Furthermore, a large number of quantizedcoefﬁcients become zeros after quantization. These two factorscontribute to ﬁle size reduction greatly. The 8x8 block DCT[9] is used in JPEG. The integer cosine transform (ICT) isadopted by H.264/AVC for lower computational complexity.All forward and inverse kernels are ﬁxed in these standards.Research on IDCT has primarily focused on complexityreduction via new computational algorithms and efﬁcientsoftware/hardware implementation. For example, an adaptivealgorithm was proposed in [10] to reduce the IDCT com-plexity. A zero-coefﬁcient-aware butterﬂy IDCT algorithm wasinvestigated in [11] for faster decoding. As to effective imple-mentations, a fast multiplier implementation was studied in[12], where the modiﬁed Loefﬂer algorithm was implemented on the FPGA to optimize the speed and the area. There ishowever little work on analyzing the difference of DCT coef-ﬁcients before and after quantization. In image/video coding,DCT coefﬁcients go through quantization and de-quantizationbefore their inverse transform. When the coding bit rateis high, quantized-and-then-dequantized DCT coefﬁcients arevery close to the input DCT coefﬁcients. Consequently, IDCTcan reconstruct input image patches reasonably well.However, when the coding bit rate is low, this condition doesnot hold any longer. The quantization effect is not negligible.The distribution of quantized DCT coefﬁcients deviates fromthat of original DCT coefﬁcients. The traditional IDCT kernel,which is derived from the forward DCT kernel, is not optimal.In this work, we attempt to ﬁnd the optimal IDCT kernel. Thisnew kernel is derived from the training set using machinelearning approach. Simply speaking, it transforms quantizedDCT coefﬁcients back to spatial-domain pixel values, which isexactly what IDCT is supposed to do. Our solution can accountfor the quantization effect through the training of real worlddata so that the quantization error can be reduced in decoding.It can improve the evaluation score and visual quality ofreconstructed images without any extra cost in decoding stage.The rest of this paper is organized as follows. The impactof quantization on the IDCT is analyzed in Sec. II. The newIDCT method is proposed in Sec. III. Experiment results arepresented in Sec. IV. Finally, concluding remarks and futureresearch directions are given in Sec. V.II. I

MPACT OF Q UANTIZATION ON I NVERSE

DCTThe forward block DCT can be written as X = K x , (1)where x ∈ R N , N = n × n , is an N-dimensional vector thatrepresents an input image block of n × n pixels, K ∈ R N × N denotes the DCT transform matrix, and X ∈ R N is the vectorof transformed DCT coefﬁcients. Rows of K are mutuallyorthogonal. Furthermore, they are normalized in principle. Inthe implementation of H.264/AVC and HEVC, rows of K maynot be normalized to avoid ﬂoating point computation, whichcould vary from one machine to the other. Mathematically, wecan still view K as a normalized kernel for simplicity. Then,the corresponding inverse DCT can be represented by x q = K − X q = K T X q , (2) a r X i v : . [ c s . MM ] J a n here K − = K T since K is an orthogonal transform and X q is the vector of quantized DCT coefﬁcients. Mathematically,we have X q = DeQuan ( Quan ( X )) , (3)where Quan ( · ) represents the quantization operation and DeQuan ( · ) represents the de-quantization operation. If thereis little information loss in the quantization process (i.e., nearlylossless compression), we have X q ≈ X . Consequently, thereconstructed image block x q will be close to the input imageblock, x . Sometimes, a smaller compressed ﬁle size at theexpense of lower quality is used. Then, the difference between X q and X can be expressed as e q = X − X q . (4)In decoding, we have X q rather than X . If we apply thetraditional IDCT (i.e., K − = K T ) to X q , we can derive theimage block error based on Eqs. (2), (2) and (4) as e q = x − K − DeQuan [ Quan ( K x )] . (5)To minimize e q , we demand taht K − q ≈ DeQuan (cid:12)

Quan (cid:12) K (6)where (cid:12) denotes an element-wise operation. Since it is dif-ﬁcult to ﬁnd a mathematical model for the right-hand-sideof Eq. (6), we adopt a machine learning approach to learnthe relationship between quantized DCT coefﬁcients, X q , andpixels of the original input block, x , from a large number oftraining samples. Then, given X q , we can predict target x .III. P ROPOSED

IDCT M

ETHOD

We use image block samples of size × to learn theoptimal IDCT matrix. The training procedure is stated below.1) Perform the forward DCT on each image block.2) Quantize DCT coefﬁcients with respect to quality factor QF .3) Conduct linear regression between de-quantized DCTcoefﬁcients, which are the input, and image pixels ofinput blocks, which are the output, to determine optimalIDCT matrix ˆ K .Then, we will use ˆ K , instead of K − in Eq. (5), for the IDCTtask. Note that it is costly to train ˆ K at every QF value.For Step 2, it is important to emphasize that there is no needto train and store ˆ K for every QF value, which is too costly.Instead, we can train and store several ˆ K for a small set ofselected quality factors. Then, according to the QF that is usedto encode images, we can choose the K with the closest QFto decode them. This point will be elaborated in Sec. IV.For Step 3, we ﬁrst reshape the 2D layout of image pixelsand dequantized DCT coefﬁcients (both of dimension × )into 1D vectors of dimension . Notation x i denotes theﬂattened pixel vector of the i th image block and X q,i = DeQuan [ Quan ( K x i )] (7)denotes the ﬂattened dequantized DCT coefﬁcients of the i thimage block. Next, we form data matrix P ∈ R × N of image Fig. 1: The L norm of differences between kernels learnedwith different QFs.pixels and data matrix D ∈ R × N of dequantized DCTcoefﬁcients. Columns of matrix P are formed by x i whilecolumns of matrix D are formed by X q,i , i = 1 , · · · , N .Then, we can set up the regression problem by minimizingthe following objective function ξ ( ˆ K ) = N (cid:88) i =1 || e i || , (8)where || · || is the Euclidean norm and e i is the column vectorof error matrix E = P − ˆ KD, (9)and where learned kernel, ˆ K , is the regression matrix ofdimension × . Typically, the sample number, N , issigniﬁcantly larger than = 4 , .IV. E XPERIMENTS

Experimental Setup.

We perform experiments on threedatasets with the libjpeg software [13] on a Unix machine. Thethree datasets are: Kodak [14], DIV2K [15], and MultibandTexture (MBT) [16]. The ﬁrst two datasets contain genericimages. They are used to test the power of the proposedIDCT method in typical images. The third dataset has speciﬁccontent and it is desired to see whether our solution can betailored to it for more performance gain. For each dataset, wesplit images into disjoint sets - one for training and the otherfor testing. There are 24, 785 and 112 total images in Kodak,DIV2K, and MBT datasets, and we choose 10, 100 and 20from them as training images, respectively. The DCT remainsthe same as that in libjpeg. The IDCT kernel is learned ata certain QF based on discussion in Sec. III. Afterwards, thelearned kernel is used to decode images compressed by libjpeg.Both RGB-PSNR and SSIM [17] metrics are used to evaluatethe quality of decoded images.

Impact of QF Mismatch.

It is impractical to train manykernels with different QF values. To see the QF mismatcheffect on kernel learning, we show the L norm of differences a) Kodak PSNR versus QF (b) DIV2K PSNR versus QF (c) MBT PSNR versus QF(d) Kodak SSIM versus QF (e) DIV2K SSIM versus QF (f) MBT SSIM versus QF Fig. 2: Comparison of quality of decoded test images using the standard IDCT in JPEG and the proposed optimal IDCT forthe Kodak, DIV2K and MBT datasets. (a) Input image from Kodak(b) Zoom-in (c) JPEG (d) Ours

Fig. 3: Visualization of a zoom-in region of an input Kodakimage, the decoded region by JPEG and the proposed methodwith QF=70.between learned kernels derived by different QFs in Fig. 1. Wesee from the ﬁgure that the kernel learned from a certain QFis close to those computed from its neighbouring QF valuesexcept for very small QF values (say, less than 20). WhenQF is very small, quantized DCT coefﬁcients have a lot ofzeros especially in high frequency region. This makes linearregression poor. The IDCT kernel derived from these quantizedcoefﬁcients contains more zeros in each column which makesit different from others. We plot the PSNR value and theSSIM value of decoded test images using the standard IDCTin JPEG and the proposed optimal IDCT as a function of QF (a) Input image from DIV2K(b) Zoom-in (c) JPEG (d) Ours

Fig. 4: Visualization of a zoom-in region from an input DIV2Kimage, the decoded region by JPEG and the proposed methodwith QF=70.in Fig. 2. We see a clear gain across all QFs and all datasets.Furthermore, we show the averaged PSNR and SSIM gainsoffered by the proposed IDCT designed with the ﬁxed QFvalues over the standard DCT in Table I. The PSNR gainranges from 0.11-0.30dB.

Quality Comparison via Visual Inspection.

We show azoom-in region of three representative images, each of whichis selected from Kodak, DIV2K and MBT datasets using thestandard IDCT and the proposed IDCT for visual inspectionin Figs. 3, 4 and 5, respectively. The proposed IDCT methodgives better edge, smooth and texture regions over the standard a) Input image from MBT(b) Zoom-in (c) JPEG (d) Ours

Fig. 5: Visualization of a zoom-in region of an input imagefrom MBT, the decoded region by JPEG and the proposedmethod with QF=70.

Dataset QF 50 70 90Kodak PSNR (dB) +0.1930 +0.2189 +0.2057SSIM +0.0029 +0.0025 +0.0012DIV2K PSNR (dB) +0.1229 +0.1488 +0.1144SSIM +0.0024 +0.0020 +0.0008MBT PSNR (dB) +0.1603 +0.2537 +0.3038SSIM +0.0025 +0.0022 +0.0005

TABLE I: Evaluation results on test images in the Kodak,DIV2K, and MBT datasets, where training QFs are set to 50,70 and 90 for each column.IDCT in these three examples. Speciﬁcally, edge boundariessuffer less from the Gibbs phenomenon due to the learnedkernel. Similarly, texture regions are better preserved and thesmooth regions close to edge boundaries are smoother usingthe learned kernel. All familiar quantization artifacts decreaseby a certain degree.

Impact of Image Content.

Another phenomenon of interestis the relationship between image content and the performanceof the proposed IDCT method. On one hand, we would liketo argue that the learned IDCT kernel is generally applicable.It is not too sensitive to image content. To demonstrate thispoint, we use 10 images from the Kodat dataset to train theIDCT kernel with QF=70 and then apply it to all imagesin the DIV2K dataset. This kernel offers a PSNR gain of0.24dB over the standard IDCT kernel. On the other hand,it is still advantageous if the training and testing imagecontent matches well. As shown in Table I, MBT has a higherPSNR gain than Kodak and DIV2K. It is well known thattexture images contain high-frequency components. When QFis smaller, these components are quantized to zero and it is (a) PSNR versus QF(b) SSIM versus QF

Fig. 6: Comparison of quality of decoded images using thestandard IDCT and the proposed IDCT trained by the Kodakdataset yet tested on the DIV2K dataset.difﬁcult to learn a good kernel. Yet, when QF is larger, high-frequency components are retained and the learned kernel cancompensate quantization errors better for a larger PSNR gainon the whole dataset.V. C

ONCLUSION AND F UTURE W ORK

An IDCT kernel learning method that compensates thequantization effect was proposed in this work. The proposedmethod adopts a machine learning method to estimate theoptimal IDCT kernel based on the quantized DCT coefﬁcientsand the desired output block of image pixels. Extensiveexperiments were conducted to demonstrate a clear advantageof this new approach. The learned kernel is not sensitive tothe learning QF values neither to the image content. It offersa robust PSNR gain from 0.1 to 0.3 dB over the standardJPEG. Since it is used in the decoder, it does increase encodingtime or the compressed ﬁle size. The learned kernel can betransmitted ofﬂine as an overhead ﬁle or simply implementedby the decoder alone.It is interesting to consider region adaptive kernels. Forexample, we can roughly categorize regions into smooth, edgeand textured regions. The distribution of DCT coefﬁcients inthese regions are quite different. Thus, we can use a clusteringtechnique to group similar DCT coefﬁcient distributions andconduct learning within a cluster. Furthermore, we may con-sider separable IDCT kernels since they can be implementedmore effectively. Finally, it is desired to apply the same ideato video coding such as H.264/AVC and HEVC for furtherperformance improvement.

EFERENCES[1] G. K. Wallace, “The jpeg still picture compression standard,”

IEEEtransactions on consumer electronics , vol. 38, no. 1, pp. xviii–xxxiv,1992.[2] M. Rabbani, “Jpeg2000: Image compression fundamentals, standardsand practice,”

Journal of Electronic Imaging , vol. 11, no. 2, p. 286,2002.[3]

Better Portable Graphics . [Online]. Available: https://bellard.org/bpg/[4] K. Brandenburg and G. Stoll, “Iso/mpeg-1 audio: A generic standard forcoding of high-quality digital audio,”

Journal of the Audio EngineeringSociety , vol. 42, no. 10, pp. 780–792, 1994.[5] B. G. Haskell, A. Puri, and A. N. Netravali,

Digital video: an introduc-tion to MPEG-2 . Springer Science & Business Media, 1996.[6] F. C. Pereira, F. M. B. Pereira, F. C. Pereira, F. Pereira, and T. Ebrahimi,

The MPEG-4 book . Prentice Hall Professional, 2002.[7] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overviewof the h. 264/avc video coding standard,”

IEEE Transactions on circuitsand systems for video technology , vol. 13, no. 7, pp. 560–576, 2003.[8] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview ofthe high efﬁciency video coding (hevc) standard,”

IEEE Transactionson circuits and systems for video technology , vol. 22, no. 12, pp. 1649–1668, 2012.[9] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,”

IEEE transactions on Computers , vol. 100, no. 1, pp. 90–93, 1974.[10] I.-M. Pao and M.-T. Sun, “Modeling dct coefﬁcients for fast videoencoding,”

IEEE Transactions on Circuits and Systems for Video Tech-nology , vol. 9, no. 4, pp. 608–616, 1999.[11] S.-h. Park, K. Choi, and E. S. Jang, “Zero coefﬁcient-aware fastbutterﬂy-based inverse discrete cosine transform algorithm,”

IET ImageProcessing , vol. 10, no. 2, pp. 89–100, 2016.[12] A. B. Atitallah, P. Kadionik, F. Ghozzi, P. Nouel, N. Masmoudi, andP. Marchegay, “Optimization and implementation on fpga of the dct/idctalgorithm,” in , vol. 3. IEEE, 2006, pp. III–III.[13] libjpeg . [Online]. Available: http://libjpeg.sourceforge.net[14]

Kodak images . [Online]. Available: http://r0k.us/graphics/kodak/[15] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single imagesuper-resolution: Dataset and study,” in

The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) Workshops , July 2017.[16] S. Abdelmounaime and H. Dong-Chen, “New brodatz-based imagedatabases for grayscale color and multiband texture analysis,”

ISRNMachine Vision , vol. 2013, 2013.[17] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Imagequality assessment: from error visibility to structural similarity,”