JQF: Optimal JPEG Quantization Table Fusion by Simulated Annealing on Texture Images and Predicting Textures
JJQF: Optimal JPEG Quantization Table Fusion by SimulatedAnnealing on Texture Images and Predicting Textures
Chen-Hsiu Huang and Ja-Ling Wu
Department of Computer Science and Information Engineering,National Taiwan UniversityNo. 1, Sec. 4, Roosevelt Rd., Taipei City 106, Taiwan (R.O.C.) {chenhsiu48,wjl}@cmlab.csie.ntu.edu.tw
Abstract
JPEG has been a widely used lossy image compression codec for nearly three decades. TheJPEG standard allows to use customized quantization table; however, it’s still a challengingproblem to find an optimal quantization table within acceptable computational cost. Thiswork tries to solve the dilemma of balancing between computational cost and image specificoptimality by introducing a new concept of texture mosaic images. Instead of optimizinga single image or a collection of representative images, the simulated annealing techniqueis applied to texture mosaic images to search for an optimal quantization table for eachtexture category. We use pre-trained VGG-16 CNN model to learn those texture featuresand predict the new image’s texture distribution, then fuse optimal texture tables to comeout with an image specific optimal quantization table. On the Kodak dataset with thequality setting Q = 95, our experiment shows a size reduction of 23.5% over the JPEGstandard table with a slightly 0.35% FSIM decrease, which is visually unperceivable. Theproposed JQF method achieves per image optimality for JPEG encoding with less thanone second additional timing cost. The online demo is available at https://matthorn.s3.amazonaws.com/JQF/qtbl_vis.html . Introduction
JPEG is a commonly used lossy compression standard for digital images, developed bythe Joint Photographic Experts Group [1] in 1992. Although the JPEG Still PictureCompression Standard has been introduced for nearly three decades, JPEG remainsthe most frequently used image format, whether in the Internet content sharing [2]or produced by various digital image capture devices. JPEG divides the image into8x8 blocks, using Discrete Cosine Transform (DCT) to shift the pixels from spatialdomain to frequency domain for better coding efficiency. Because the human visualsystem (HVS) is more sensitive to low-frequency components and less perceivableon high-frequency components, the transformed DCT coefficients are rearranged inzigzag order for reflecting the spectral importance. Then a quantization table withvalues in different magnitude is used to quantize DCT coefficients in correspondingspectrum position, resulting in reduced coefficient values and sparse DCT block, whichis more beneficial to variable length coding and run-length encoding (RLE). Figure1 (a) shows the default luminance quantization table provided in the JPEG standard.It is easy to observe that quantization values are generally increasingly arranged inthe zigzag scanning order. a r X i v : . [ c s . MM ] A ug a) (b) Figure 1: Examples of JPEG luminance quantization table: (a) the default standard table,and (b) the fused customized table for the lighthouse image
Since JPEG is lossy, the compression rate can be adjusted, allowing a selectabletradeoff between storage size and image quality. The JPEG standard library [3] in-cludes a quality metric Q , ranging from 1 to 100, to scale values in the quantizationtable to control the reduction of DCT coefficients. The libjpeg reference implemen-tation demonstrates how to calculate the scaling factor S f and scale quantizationvalues to target quality: S f = (cid:40) /Q ≤ Q < − Q/
50 50 ≤ Q ≤
100 (1) T Q ( i ) = max(min( T S ( i ) × S f , , , for i = 1 . . .
63 (2)where T S is the standard quantization table and T Q is the scaled table at quality Q . Although a smaller quality metric achieves better image size reduction, it mayintroduce visual artifacts such as blocking effect and ringing effect if we carefully ob-serve image pixels under a magnifier. Therefore, the default JPEG quality metric invarious applications is usually set to higher values, say at least 75 or above. Except forthe standard table, the JPEG standard also allows users to use customized quantiza-tion tables. However, the JPEG quantization table’s selection remains a challengingand un-optimized problem due to the numerous solution space and lack of reliablequality measurement that accurately model the HVS. As a result, the modern imageapplications and digital camera image processors tend to compress JPEG images withminimal quantization values to preserve the quality.In this paper, we try to solve the dilemma of balancing between computationalcost and image specific optimality by introducing a new concept of texture mosaicimages. We use the RAISE dataset [4] as our training database and cross-validate onthe Kodak dataset [5]. We crop the training images into patches and further applyunsupervised clustering to categorize different texture types, then stitch those texturepatches to form the texture mosaic images. The simulated annealing technique is usedon those texture mosaic images to search for an optimal quantization table for eachdifferent texture category. The convolutional neural network (CNN) model is appliedto the texture clustering result to learn the generic representation of texture features,and used to predict the testing image’s texture distribution. Finally, based on eachmage’s texture characteristics, the texture relevant optimal quantization tables arefused to come out with an image specific optimal quantization table. On the RAISEtesting set and the Kodak dataset, our experiments show a double-digit percentagesize reduction compared to the JPEG standard table with a slight decrease of FSIMscore, which is visually unperceivable under high quality metric setting. Related Works
Simulated Annealing
The stochastic optimization process known as simulated annealing has been appliedto find vector quantization parameters. The works from Monro and Sherlock [6, 7]were the first attempts to use simulated annealing on determining quantization tablesfor DCT coding. To locate an optimized quantization table, we applied the simulatedannealing to all 64 quantization values with the cost function composed of RMSE errorand a selected target compression ratio. Their optimization process searches optimaltables on selected images with minimal RMSE error while keeps the compression ratioclose to the chosen target. Around a single digit percentage of error improvement isreported comparing to the standard JPEG table. Both Monro’s and Sherlock’s worksindicated the following two things: 1) High frequency components are more criticalthan the assumption made in the JPEG standard table; 2) RMSE can only be usedas a power based measure for signal fidelity, but shown to be an inferior metric toapproximate subjective image quality.Until the early 2000s, new objective FR-IQA methods like SSIM [8] and FSIM [9]were proposed and shown to be statistically closer to HVS. Jiang et al. [10] utilizeSSIM as the quality metric to evaluate distortion in the compressed images during thesimulated annealing process. In their work, a multi-objective optimization equationis proposed to minimize bitrate while maximizing SSIM. To solve the equation, theyestimate the Pareto optimal point for finding an optimal quantization table. Thereare no other feasible points to have both lower bitrate and higher SSIM index. Theirbest annealing technique achieves 11.68% size reduction over the JPEG standard tablewhile slightly decreases the SSIM index by 0.11%. However, since Pareto optimalpoint differs in every image, the multi-objective optimization framework only provesuseful on a per-image basis, not on a set of evaluation images.On top of Jiang’s work, Hopkins et al. [11] adopt FSIM as the quality metric andrevise the annealing process to focus on compression maximization with a temperaturefunction that rewards lower error. A set of 4,000 images was selected from RAISE[4] dataset as a training set to run four groups of 400 separate annealing processesin parallel at quality metrics 35, 50, 75, and 95. With the four global optimizedquantization tables at different quality metrics, Hopkins’ work reduces the compressedsize by around 20% over the JPEG standard table. It claims to improve FSIM errorby 10% on the evaluation set. The corpus of 4,000 training images looks like a prettygood proxy to the universal pictures, but still not custom-tailored per image.Another work from Google’s JPEG encoder Guetzli [12] aims to produce visu-ally indistinguishable images at a lower bit-rate with Butteraugli [13], Google’s per-ceptual distance metric. By using a close-loop optimizer, Guetzli optimizes globaluantization tables and selectively zero out specific DCT coefficients in each block.Compared to Hopkins’ global optimized quantization table, Guetzli’s per image op-timization strategy achieves a 29-45% data size reduction. However, the most sizereduction improvement of Guetzli comes from identifying DCT coefficients to zeroout, not from optimizing the quantization table. Google’s Guetzli provides both perimage customized optimization and better size-reduction, but is extremely slow (upto 30 minutes on a high-resolution image) and considered not for practical use.
Proposed Method
The proposed JPEG Quantization Table Fusion (JQF) method contains two work-flows. The training workflow is shown in Figure 2 (a) served as a series of offlineprocedures to collect and cluster texture patches from the training image dataset,then optimize the texture quantization tables. In the prediction workflow shown inFigure 2 (b), we predict the texture distribution of the input image by the textureCNN model, then aggregate a custom-tailored quantization table for the input image. (a)(b)
Figure 2: The workflows of the proposed JQF method. (a) The texture training and mosaicimage annealing flow. (b) The texture prediction and quantization table fusion flow.
Texture Patches Clustering
We crop the training images into 64 ×
64 patches in a non-overlapping manner andperform unsupervised image clustering methods on textures. Typically the imageclustering problem can be handled with two steps: 1) finding the appropriate imagevisual features and 2) training a classifier that minimizes the class assignments [14]. Inrecent years, the pre-trained CNN models on ImageNet [15] have become the buildinglocks in many computer vision applications. In this context, the last activation mapsafter layers of ConvNets, called bottleneck features, is a good choice of visual featuresto represent raw pixels in a vector space of fixed dimensionality.We use the VGG-16 pre-trained [16] network to extract bottleneck features fromtexture patches, then apply principal component analysis (PCA) to further reducethe dimension to 500, covering 82% of the variance. Because we don’t know howmany types of textures in our training set, the K-Means algorithm is used to clustertextures into K categories. On the RAISE dataset, we make a natural guess to select K = 100. We argue that the number of classes is not essential. Still, just a dimensionof our table pools, since different texture type’s optimal tables will be aggregated inthe prediction flow. The linear combination of optimal tables yields the final custom-tailored quantization table. Figure 3 (b) shows one example of texture mosaic imagesthat composite the lighthouse (kodim19) image of the Kodak PhotoCD dataset. (a) (b) Figure 3: A texture distribution example. (a) The lighthouse image, and (b) The top-4textures of the lighthouse image.
Annealing on Texture Mosaic Images
The quantization table is an 8 × Q =50 and 95 to validate our approach. In each step, we randomly choose some tableindices based on the magnitude of the current value. We weight smaller values more,ssuming that smaller value has higher visual importance. Then we update the tablevalues with ± libjpeg , we scale the candidate solution to target quality metric using equation (2),where S f = 1 . Q = 50 and S f = 0 . Q = 95. We compress a new JPEG fileusing the candidate table as the luminance table and standard chrominance table. Ifthe new JPEG file has size reduction and the quality degradation falls within tolerance γ = 0 .
01, we accept the candidate solution and complete the current iteration. Weevaluate the FSIM quality reduction by,FSIM( I r , I c ) ≥ FSIM( I r , I s ) × (1 − γ ) , (3)where I r is the raw image, I c and I s is the JPEG image compressed with thecandidate table and standard table. In order not to be trapped in a local minimum,there’s a probability P ( i ) to accept a worse solution, affected by the temperaturefunction T ( i ) and the energy delta ∆ E . The probability P ( i ) to receive an answer iscalculated by, P ( i ) = ∆ E × T ( i ) , for i = 1 . . . M (4)∆ E = S i S i − , S i = C i × (1 − D i ) , D i = FSIM( I r , I c ) (5) T ( i ) = MM + i × p , (6)where i is the iteration index, M is the maximum iterations to anneal, C i denotesthe current compressed JPEG file size, and D i is the FSIM quality distortion. Thetemperature function is designed so that the probability approaches p +1 at the endof the annealing process. With the design of probability P ( i ), we have a higherpossibility to accept a worse solution at the early stage, which prevents us from beingtrapped in a local minimum. And the probability decreases gradually with the numberof iterations we run; then the annealing becomes a hill-climbing process. The currentiteration could be finished by accepting a worse solution, or we’ll randomly updateagain to get the next candidate. In this work, we choose M = 2 ,
000 and p = 10to anneal each texture mosaic image, taking around 5.5 hours to execute on an IntelCore i7-9700K processor core. Texture Training and Prediction
After we cluster the texture mosaic images, we use the clustering labels to train atexture prediction model by supervised learning. With the ImageNet pre-trainedVGG-16 network, we freeze the ConvNets parameters and fine-tune the fully con-nected layers as the classifier for 100 textures. We crop 261,712 texture patchesfrom the RAISE dataset, split into 80%-20% for training and testing. We employthe Adam optimizer with default settings in PyTorch to fine-tune our network withlearning rate 0.0001 and batch size 2,048. We train the texture CNN model for 30epochs, resulting in top-3 testing accuracy around 99.77%. For prediction, we crophe input image into 64 ×
64 patches, then execute the forward propagation processto obtain each patch’s texture category, forming a texture distribution to describethe image structure.
Quantization Table Fusion
It is common to see a very different distribution of textures, as pictures with all kindsof variety. To aggregate those per texture optimized quantization tables to betterfit the whole image, we considered two strategies, voting by majority and weightedaverage. We select the weighted average policy for its overall better performance.The fused optimal quantization table T O is calculated by, T O ( i ) = (cid:88) t ( T t ( i ) × W t ) , for i = 0 . . .
63 (7)where T t denotes the optimal table of the texture t , and W t is the weights of thecorresponding textures. The fused optimal quantization table of the lighthouse imageis provided in Figure 1 (b). Compared to the standard table, we mark the cells inred for increases and blue decreases. Our observation again validates the conclusionsobtained from Monro and Sherlock [6, 7] that the JPEG standard table improperlyover-estimates the low frequency parts and under-estimates the high frequencies. Experimental Results
We briefly describe the two datasets used in our experiments as follows:
RAISE:
The RAISE dataset [4] is a real-world camera photo database, collectedfrom four photographers, capturing different scenes in over 80 places with differentcameras. It consists of 8,156 high-resolution (around 4 , × , Kodak:
The Kodak dataset [5] has 24 lossless images, commonly used for evalu-ating image compression. Each image is about 768 ×
512 in resolution.We choose the RAISE-1k subset of the RAISE dataset and randomly select 50images as testing set to evaluate the optimality of JPEG compression. The rest ofthe images are used as the training set. Each image is cropped into 64 ×
64 patcheswith stride 256, generating total 261,712 texture patches. Then we cluster patchesto 100 textures and perform the simulated annealing process on the stitched mosaicimages as described earlier. At most 225 patches are randomly selected from eachtexture category to limit the required annealing time. To be fair, all the raw imagesare encoded using the JPEG standard table, the best table from Hopkins’s work [11],and our fused quantization table, scaled to targeting JPEG quality as equation (2).The compression is done with the command line program cjpeg from libjpeg [3].Full reference image quality metric PSNR, SSIM, and FSIM are used as the qualitybenchmark, reported as the distance between the original image and the compressedJPEG image.
RAISE Training Texture Annealing Result
We present the annealing performance of the 100 texture mosaic images from theRAISE training patches at quality Q = 95 in Table 1. The proposed annealingethod delivers slightly worse FSIM quality than Hopkins18, decreasing by 0.04%,but it further reducing the JPEG size by 7.25% on average. It is interesting to notethat our method improves PSNR and SSIM by 1.38% and 0.09%. Although it is wellknown that PSNR does not align with the HVS, but it reflects the signal fidelity.We don’t mind having higher PSNR if the compressed size is smaller. It somehowindicates that the optimized quantization table better adapts to the image contentthan standard and Hopkins’ global table. Table 1: The RAISE training textures optimization performance, compared to Hopkins’global table at Q = 95. We mark the superior results in bold. Q=95 Hopkins18 v.s. Standard JQF v.s. Hopkins18Size PSNR SSIM FSIM Size PSNR SSIM FSIM100 Textures -17.86% -4.96% -0.56% -0.40% -7.25% 1.38% 0.09% -0.04%
Evaluation Result and Cross-validation
The evaluation result of real-world images is reported in Table 2. A similar patternof better PSNR, SSIM and FSIM quality against Hopkins18 is also observed. Theproposed JQF method achieves further 8.46% size reduction and improve FSIM by0.05% on RAISE, while reduces 7.62% compressed size and enhances FSIM qualityby 0.09% cross-validated on the Kodak dataset. Although we see a better Quality vs.Size trade-off compared to Hopkins’ work, it can only be thought of as one possibleoutcome of the rate-distortion optimization process. If we compare the JQF annealingtables with the standard JPEG table, the 25.14%-23.5% compression gain does hurtthe image quality, reducing the FSIM score by 0.28%-0.35%. Our experiences showthat if we anneal fewer iterations or relax the tolerance γ in equation (3), we’ll obtainoptimal texture tables with better quality but worse compression ratio. As the storagecost becomes cheaper and cheaper, we should focus on the customized optimality, notthe total compressed size. Therefore, we’ll only compare the standard table when wescale the fused table at a different quality level. Table 2: The RAISE evaluation performance and cross-validation result on the Kodakdataset, compared to Hopkins’ global table at Q = 95. Q=95 JQF v.s. Hopkins18 JQF v.s. StandardSize PSNR SSIM FSIM Size PSNR SSIM FSIMRAISE testing set -8.46% 1.96% 0.22% 0.05% -25.14% -3.10% -0.52% -0.28%Kodak dataset -7.62% 2.51% 0.23% 0.09% -23.50% -2.90% -0.45% -0.35%
Scaling at Different Quality Metric
As both Hopkins [11] and Jiang [10] report their results at different quality levels, it isclear that the default scaling equation (2), which uniformly scales quantization valuesis not a proper way to adapt quantization table to target quality. From Table 3, ourexperiments validate that our annealed optimal tables at Q = 95 is only optimalat the trained level, i.e., the tables scaled to Q = 50 have much worse Qualitys. Size trade-off than directly annealed at Q = 50 (the right hand side). Sinceannealing texture images at all different qualities are not practical, we decide to usethe annealed optimal tables at Q = 50 and scale to other different qualities as ourproposed solution. The optimal tables from Q = 50 maintain good Quality vs. Sizereduction trade-off at different quality levels. Table 3: The Kodak performance impact of annealed tables scaled to different Q Anneal Q=95 JQF v.s. Standard Anneal Q=50 JQF v.s. StandardSize PSNR SSIM FSIM Size PSNR SSIM FSIMScale to 35 -41.73% -9.30% -10.22% -2.36% Scale to 35 -24.19% -3.91% -4.13% -1.01%Scale to 50 -38.67% -7.99% -7.24% -1.82% Scale to 50 -21.65% -3.06% -2.71% -0.73%Scale to 75 -32.57% -5.69% -3.30% -0.98% Scale to 75 -17.49% -1.71% -1.05% -0.38%Scale to 95 -23.50% -2.90% -0.45% -0.35% Scale to 95 -12.34% -0.58% -0.09% -0.09%
Prediction Computational Cost
The image specific optimization only matters if the extra computational cost is withinan acceptable range. For the texture prediction, we don’t need to predict the fullresolution of the given image, but on a down-sampled version of it, say around 2 , × ,
360 is enough. We used a workstation with Intel Core i7-9700K CPU and NvidiaGeForce RTX 2080 Ti GPU to complete our experiments, taking about 0.48 secondswith GPU and 23.83 seconds on pure CPU computation per RAISE testing image,as shown in Table 4.
Table 4: The average texture prediction time, in seconds.
Database
Prediction on a smaller resized image could further reduce the time needed forprediction, but it may not be necessary. In fact, in our experience, we found thatthe computational time of image quality metric FSIM is a crucial factor that limitsthe optimization process. Even so, the annealing and CNN texture training processcan be implemented offline without affecting our proposed JQF as a real-time JPEGoptimization approach.
Conclusion
We propose a novel JPEG Quantization Table Fusion method using simulated an-nealing on texture mosaic images to search for an optimal quantization table for eachtexture category. A VGG-16 pre-trained CNN model is used to learn those texturefeatures and predict the input image’s texture distribution, then fuse optimal tex-ture tables to come out an image specific optimal quantization table. Our methodshows superior performance on a larger consumer photo dataset and generalizes welln cross-database evaluation. On the Kodak dataset with quality Q = 95 setting,our experiment shows a size reduction of 23.5% over the JPEG standard table with aslightly 0.35% FSIM decrease but visually unperceivable. The per-image optimalityfor JPEG encoding is achieved with less than one second additional timing cost. References [1] Gregory K Wallace, “The jpeg still picture compression standard,”
IEEE transactionson consumer electronics , vol. 38, no. 1, pp. xviii–xxxiv, 1992.[2] Graham Hudson, Alain L´eger, Birger Niss, Istv´an Sebesty´en, and Jørgen Vaaben,“Jpeg-1 standard 25 years: past, present, and future reasons for a success,”
Journalof Electronic Imaging
Proceedings of the 6thACM Multimedia Systems Conference , 2015, pp. 219–224.[5] “Kodak photocd dataset,” http://r0k.us/graphics/kodak/.[6] Donald M Monro and Barry G Sherlock, “Optimum dct quantization,” in [Proceedings]DCC93: Data Compression Conference . IEEE, 1993, pp. 188–194.[7] BG Sherlock, A Nagpal, and DM Monro, “A model for jpeg quantization,” in
Pro-ceedings of ICSIPNN’94. International Conference on Speech, Image Processing andNeural Networks . IEEE, 1994, pp. 176–179.[8] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli, “Image qualityassessment: from error visibility to structural similarity,”
IEEE transactions on imageprocessing , vol. 13, no. 4, pp. 600–612, 2004.[9] Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang, “Fsim: A feature similarityindex for image quality assessment,”
IEEE transactions on Image Processing , vol. 20,no. 8, pp. 2378–2386, 2011.[10] Yuebing Jiang and Marios S Pattichis, “Jpeg image compression using quantizationtable optimization based on perceptual image quality assessment,” in . IEEE, 2011, pp. 225–229.[11] Max Hopkins, Michael Mitzenmacher, and Sebastian Wagner-Carena, “Simulated an-nealing for jpeg quantization,” arXiv preprint arXiv:1709.00649 , 2017.[12] Jyrki Alakuijala, Robert Obryk, Ostap Stoliarchuk, Zoltan Szabadka, Lode Vande-venne, and Jan Wassenberg, “Guetzli: Perceptually guided jpeg encoder,” arXivpreprint arXiv:1703.04421 , 2017.[13] Jyrki Alakuijala et al., “Butteraugli,” https://github.com/google/butteraugli, 2016.[14] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze, “Deep clus-tering for unsupervised learning of visual features,” in
Proceedings of the EuropeanConference on Computer Vision (ECCV) , 2018, pp. 132–149.[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification withdeep convolutional neural networks,” in
Advances in neural information processingsystems , 2012, pp. 1097–1105.[16] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556arXiv preprint arXiv:1409.1556