Reinforced Bit Allocation under Task-Driven Semantic Distortion Metrics
RReinforced Bit Allocation under Task-DrivenSemantic Distortion Metrics
Jun Shi, Zhibo Chen
CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application SystemUniversity of Science and Technology of China
Hefei, [email protected]
Abstract —Rapid growing intelligent applications require op-timized bit allocation in image/video coding to support specifictask-driven scenarios such as detection, classification, segmenta-tion, etc. Some learning-based frameworks have been proposedfor this purpose due to their inherent end-to-end optimizationmechanisms. However, it is still quite challenging to integratethese task-driven metrics seamlessly into traditional hybridcoding framework. To the best of our knowledge, this paper is thefirst work trying to solve this challenge based on reinforcementlearning (RL) approach. Specifically, we formulate the bit alloca-tion problem as a Markovian Decision Process (MDP) and trainRL agents to automatically decide the quantization parameter(QP) of each coding tree unit (CTU) for HEVC intra coding,according to the task-driven semantic distortion metrics. Thisbit allocation scheme can maximize the semantic level fidelity ofthe task, such as classification accuracy, while minimizing thebit-rate. We also employ gradient class activation map (Grad-CAM) and Mask R-CNN tools to extract task-related importancemaps to help the agents make decisions. Extensive experimentalresults demonstrate the superior performance of our approachby achieving 43.1% to 73.2% bit-rate saving over the anchor ofHEVC under the equivalent task-related distortions.
Index Terms —HEVC, intra coding, bit allocation, reinforce-ment learning
I. I
NTRODUCTION
In the past few years, continuous progress has been made inthe field of image/video analysis and understanding, promot-ing many intelligent applications such as surveillance videoanalysis, medical image diagnosis, mobile authentication, etc.This brings out the requirements for efficient compression ofimage/video signals to support these intelligent applications,which have quite different distortion metrics compared withthe traditional image/video compression scenarios.The development of image/video coding technology hasbeen continuously improving the rate-distortion performance, i.e. , reducing the bit-rate under the same quality or reducingthe distortion using the same bit-rate. However, the problem ishow to define the distortion. Generally, it can be classified intothree levels of distortion metrics:
Pixel Fidelity , PerceptualFidelity , and
Semantic Fidelity , according to different levelsof human cognition on image/video signal [1].
Pixel Fidelity metrics like MSE (Mean Square Error) andPSNR (Peak Signal to Noise Ratio) have been widely adopted
This work was supported in part by NSFC under Grant U1908209,61571413, 61632001 and the National Key Research and DevelopmentProgram of China 2018AAA0101400.
Reinforcement learning agent Encoder (HEVC)
QP decision
Originalframe
Importance map
Assisting decision Generating
Coding input
Specific vision task
BPP reductionReconstrcution
Task-driven semantic distortion
Fig. 1: Illustration of proposed framework. RL agent decidesthe QP of each CTU according to tasks.in many coding techniques and standards ( e.g. , H.264 [2],High Efficiency Video Coding (HEVC) [3], etc.), which can beeasily integrated into image/video hybrid compression frame-work as an in-loop metric for rate-distortion optimization,while it is obvious that pixel fidelity metric cannot fullyreflect human perceptual viewing experience [4]. Therefore,many researchers have developed
Perceptual Fidelity metricsto investigate objective metrics measuring human subjectiveviewing experience [5], [6], such as the saliency-guided PSNR[7]. Based on this new metric, many approaches [8]–[15] op-timized the bit allocation scheme to provide better perceptualquality. The common way to achieve this goal in these methodsis to allocate relatively more bits in the region-of-interest (ROI)area to ensure acceptable quality in these regions heuristically.With the development of aforementioned intelligent appli-cations, image/video signals will be captured and processednot only for human eyes, but also for semantic analyses.Consequently, there will be more requirements on researchfor
Semantic Fidelity metrics to study the semantic difference( e.g. , accuracy difference of specific intelligent tasks) betweenthe original image/video and the compressed one.These various distortion metrics measure the quality ofreconstructed image/video content, but it is a contradictorythat most of these complicated quality metrics with highperformance are not able to be integrated easily into any exist-ing image/video compression frameworks. Some approaches a r X i v : . [ ee ss . I V ] F e b e.g. , quantization parameters (QPs)) heuristicallyaccording to the embedded quality metrics, but they are stillheuristic solutions without ability to automatically and adap-tively optimize encoding configurations according to differentcomplicated distortion metrics. [1] is the first scheme trying tosolve this challenge by designing an end-to-end image codingframework which inherently provides feasibility on integratingcomplicated metrics into coding loop. But its still a hugechallenge for traditional hybrid coding frameworks which arenot derivable like end-to-end image coding framework.In this paper, we try to solve this problem by using rein-forcement learning (RL) to optimize the bit allocation schemefor HEVC intra coding according to the task-driven distortionmetrics, as Fig. 1 shows. We formulate the bit allocationscheme, i.e. , deciding the QP of each CTU in sequence, asa Markovian Decision Process (MDP), and then introduce RLto decide the QP of each coding tree unit (CTU) to providea better bit allocation scheme for the different vision tasksincluding classification, detection and segmentation. For thespecific task, importance maps of the original frames are gen-erated using gradient class activation map (Grad-CAM) [18]and Mask R-CNN [19] tools, which can help the agents makebetter decisions. In order to train the RL agents efficiently, weestablish a universal task-driven bit allocation dataset. Usingthis dataset, the off-line training can be efficient. With ourscheme, we can achieve the bit-rate reduction from 43.1% to73.2% with comparable task-driven distortion, depending onthe type of tasks. It shows that the scheme we propose iseffective and efficient.II. T ASK -D RIVEN B IT A LLOCATION F RAMEWORK
A. Importance Map Generation
Although it is possible to train RL agents to directly outputthe bit allocation scheme merely based on the frame itself,it will make the agents hard to train and lack generalizationand extendibility. Therefore, we employ existing approachesto generate the importance maps, which can help and ease theagents to make decisions, as shown in Fig. 2.First, we implement the Grad-CAM, which uses the gradientbackpropagation to flow into the final convolutional layer ofthe CNN model to produce a localization map highlightingthe important regions for predicting the concept. The specificCNN model we adopt is VGG-16 [20], thus we can obtain theimportance maps of the frames for the classification task.Then, we employ Mask R-CNN to get the detection andsegmentation results of the original frames. This can help theagents to discriminate the foreground and the background, andfurther get the density distribution of the instances. These im-portance maps can be used for the detection and segmentationtasks. Based on this pre-processed information, RL agents areable to make QP decision precisely and accurately.
B. Reinforcement Learning for Bit Allocation
After getting the importance maps of the frames, a simplemethod to optimize towards the task-driven distortion metrics
Using Grad-CAM
Using Mask R-CNN original frame original frame heatmap for why predicting as dog pedestrian detection and segmentation
Fig. 2: Importance map generation.is to increase the bit-rate for the highly weighted area heuris-tically, such as the threshold scheme. However, these heuristicmethods can hardly get the optimal results and may introducelimitation due to the fact that the results rely heavily on thehandcrafted designs. Recently, RL has achieved outstandingperformance in many tasks, especially in the unsupervisedor semi-supervised scenarios. It has also been used in manyapproaches [21]–[24] to optimize the traditional hybrid codingframework. In this paper, we adopt the RL algorithm, DeepQ-learning (DQN) [25], to solve this problem.
1) Problem formulation:
We formulate bit allocation prob-lem as a MDP, which includes four elements: state , action , reward and policy (agent) . In this process, DQN trains an agent that observes the states from the environment, and then executea series of actions (QP decisions) in order to optimize the goal.Generally, the goal can be described by the maximization ofexpected cumulative reward . In bit allocation, the specific goalis, given a frame that consists of i CTUs, to generate i QPs foreach CTU so that we can achieve the largest bit-rate reductionunder the least task-driven distortion. All elements of the MDPare detailed below.
2) State:
The agent needs to observe the CTU and thenmake the QP decision. The CTU information are sent tothe agent according to the encoding sequence, from left toright and top to bottom. In this paper, the luminance and theimportance map of current CTU are the part of the state. Wealso include a 15-d feature vector in the state to reflect theglobal information. The detail of the feature vector is shownin TABLE I.
3) Action:
The original QP of HEVC ranges from 1 to 51.Lower QP means more bit-rate, and more bit-rate means lessdistortion. However, we do not need all QPs in this paper.When the bit-rate is high enough, there is almost no semanticdistortion, which means more bit-rate is useless, although itcan still improve the pixel level performance, such as PSNR.In our experiment, we observe that the semantic distortionbecomes really negligible compared with the original uncom-pressed images after QP reaches 22. Therefore, we set theaction space as QP 22 to 51 in this paper.ig. 3: Structure of the proposed Q-network. There are two input branches: current CTU part and global information part.The luminance and importance map of current block are concatenated in depth to represent the local information.TABLE I: Global Information Vector index vector components1 number of overll CTUs2 index of current CTU3 mask ratio of current CTU4-7 mask ratio of neighboring CTUs8 mask ratio of overall frame9 instance number of current CTU10-13 instance number of neighboring CTUs14-15 QPs of left and above CTUs
4) Reward:
The cumulative reward is the optimization goalof RL agent. Like the rate-distortion cost in HEVC coding, thedesigned reward has a similar format: reward = λ · BP P save − Distortion task . (1)where BP P save means the reduction of bit per pixel (BPP)in current CTU from the anchor of QP = 22 to the decisionQP. The
Distortion task is a penalty term for the semanticdifference of the task result, such as the detection numberdifference for the detection task, and λ is a Lagrange factorto balance the BPP reduction and the risk of semantic loss.
5) Agent:
The agent is a Q-network, which is used foroptimal QP prediction. Taking the state s t as input, the Q-network will output the decayed cumulative reward (Q-value)of each action a as Q ( s t , a ) . We can get the optimal action a ∗ t as : a ∗ t = arg max a Q ( s t , a ) . (2)For the Q-network structure, we have two input branches:the current CTU part and the global feature vector. TheCTU information flows through four convolutional layers toextract the features. Then we concatenate the features withthe global feature vector after ascending dimension together.The combination of these features can help better understandthe content. Next, the overall features will flow throughthree fully-connected layers, including two hidden layers and one output layer. All convolutional layers and hidden fully-connected layers are activated with Leaky Rectified LinearUnit (LeakyReLU), while the output layer is not activated.Fig. 3 shows the detail of the proposed Q-network.III. E XPERIMENTAL R ESULTS AND A NALYSIS
In this section, we will introduce the dataset and the specificexperimental details for task-driven bit allocation.
A. Task-driven Bit Allocation (TBA) Dataset
As we employ RL agents (CNN models) as the bit allocationpredictors, we need a large amount of training data. So weestablish a universal dataset for Task-driven Bit Allocation,namely TBA dataset. First, for detection and segmentation,we collect images from TUD-Brussels pedestrian dataset [26],including 1000 images. And for classification, we select 2400high resolution images (larger than 576 × × ∼
51 are applied for encoding. During encoding, we collect theBPP and MSE of every CTU for further usage.After encoding, we get the reconstruction of all images,which are then sent to the vision task CNNs to get the semanticresults. We use VGG-16 for classification and Mask R-CNNfor detection and segmentation. Finally, the TBA dataset isobtained, which is randomly divided into training (80%) andtest (20%) sets.
B. Configuration of Experiments
In this part, we implement our approach into the HEVCreference software HM 16.9. In HM, the all-intra main con-figuration was used. For comparison, we run the fixed QP =22 as benchmark, since the semantic distortion under this QPis really negligible. In addition, two experiments of fixed QPscheme without bit allocation for the equivalent bit-rate andequivalent distortion with the proposed approach are also done.After coding, we sent the reconstructions to the task CNNsto measure the semantic task-driven distortions. This distortion lassificationDetectionSegmentation
QP = 22 Equivalent bit-rateProposed Equivalent distortion
Fig. 4: Detailed comparison between the proposed method and the original HEVC. The first, third and fourth columns areencoded with fixed QP, while the second column is the result using our task-driven bit allocation.is defined as the semantic similarity, i.e. , the difference ofoutputs compared with the benchmark frames of QP = 22.For classification in ImageNet, the top-5 accuracy is adopted,while the number of accurate detected instances is used fordetection task. And the intersection over union (IOU) ismeasured for the segmentation task. Fig. 4 and TABLE IIshow the results of our experiments.
C. Performance Evaluation and Analysis
Assessment of bit-rate reduction.
It is interesting toinvestigate how many bits can be saved when applying ourapproach compared with the original HEVC. Taking QP = 22as the benchmark, we compare the BPP of our scheme with theone which possesses the similar semantic distortion (Baseline*in TABLE II). For classification task, we can further save43.1% bits (relative ratio) over this fixed QP strategy underthe same distortion. Similarly, our approach saves 73.2% bitsfor the pedestrian detection, while for segmentation, more bitsshould be used to preserve the pixels of foreground, especiallythe boundary, so only 58.6% bits are saved, less than the resultof detection task. The RL agent is very astute to use as fewbits as possible while minimizing the task-driven distortion.
Evaluation of task-driven distortion.
We also investigatehow our approach can preserve the semantic fidelity to mini-mize the task-driven distortion using the same bits. We choosethe fixed QP of HEVC to encode the frames using equivalentBPP (Baseline in TABLE II) to examine the effectivenessof our proposed scheme. We can see that our bit allocationstrategy leads to much less semantic distortions in all tasks. Forclassification, our approach uses more bits on the importantregions, such as the face of fox in Fig. 4. Consequently, the TABLE II: Bit-Rate Reduction Ratio and Task-DrivenDistortion Compared with QP = 22
Proposed Baseline Baseline*Vision task BR DIST BR DIST BR DISTClassification 85.2% 3.7% 87.2% 12.2% 74.0% 3.5%Detection 80.2% 2.7% 79.3% 18.6% 26.2% 2.2%Segmentation 66.2% 6.8% 66.4% 11.9% 18.5% 6.8%Baseline and Baseline* indicate the equivalent bit-rate and distortionschemes, resepectively. Bit-rate reduction ratio and semantic distortionare abbreviated as BR and DIST in this table. task-driven distortion is much smaller. Similarly, for detectionand segmentation, our approach can better preserve the qualityof foregrounds. Obviously, our RL agent can spend the bitsrationally according to the tasks.IV. C
ONCLUSION
In this paper, we propose an automatically optimized task-driven bit allocation scheme for HEVC intra coding usingreinforcement learning. We formulate sequential QP decisionfor each CTU as a Markovian Decision Process and then trainRL agent to integrate the vision task-driven distortion metricsinto HEVC. Benefiting from the Grad-CAM and Mask R-CNN tools, we obtain the task-related importance maps ofthe original frames before bit allocation. These importancemaps can do help for the RL agents. In addition, we establisha TBA dataset, with which, off-line training can be efficient.Compared with the original HEVC, our approach can saveabout 43% to 73% bits under the equivalent task-drivendistortion. In future work, we will consider extending ourscheme to inter mode for further optimization.
EFERENCES[1] Z. Chen and T. He, “Learning based facial image compression withsemantic fidelity metric,”
Neurocomputing , 2019.[2] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview ofthe H.264/AVC video coding standard,”
IEEE Transactions on Circuitsand Systems for Video Technology , vol. 13, no. 7, pp. 560–576, July2003.[3] G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand, “Overview of theHigh Efficiency Video Coding (HEVC) Standard,”
IEEE Transactionson Circuits and Systems for Video Technology , vol. 22, no. 12, pp. 1649–1668, Dec 2012.[4] Z. Wang and A. C. Bovik, “Mean squared error: Love it or leave it? anew look at signal fidelity measures,”
IEEE signal processing magazine ,vol. 26, no. 1, pp. 98–117, 2009.[5] Z. Chen, N. Liao, X. Gu, F. Wu, and G. Shi, “Hybrid distortion rankingtuned bitstream-layer video quality assessment,”
IEEE Transactions onCircuits and Systems for Video Technology , vol. 26, no. 6, pp. 1029–1043, 2015.[6] W. Zhou, Z. Chen, and W. Li, “Dual-stream interactive networks for no-reference stereoscopic image quality assessment,”
IEEE Transactions onImage Processing , 2019.[7] Z. Li, S. Qin, and L. Itti, “Visual attention guided bit allocation in videocompression,”
Image and Vision Computing , vol. 29, no. 1, pp. 1–14,2011.[8] A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer, “Semanticperceptual image compression using deep convolution networks,” in . IEEE, 2017, pp. 250–259.[9] H. Hadizadeh and I. V. Baji´c, “Saliency-aware video compression,”
IEEETransactions on Image Processing , vol. 23, no. 1, pp. 19–33, 2013.[10] M. T. Khanna, K. Rai, S. Chaudhury, and B. Lall, “Perceptual depthpreserving saliency based image compression,” in
Proceedings of the2nd International Conference on Perception and Machine Intelligence .ACM, 2015, pp. 218–223.[11] C. Guo and L. Zhang, “A novel multiresolution spatiotemporal saliencydetection model and its applications in image and video compression,”
IEEE transactions on image processing , vol. 19, no. 1, pp. 185–198,2009.[12] R. Leung and D. Taubman, “Perceptual optimization for scalable videocompression based on visual masking principles,”
IEEE Transactions onCircuits and Systems for Video Technology , vol. 19, no. 3, pp. 309–322,2009.[13] S. Li, M. Xu, Y. Ren, and Z. Wang, “Closed-form optimization onsaliency-guided image compression for hevc-msp,”
IEEE Transactionson Multimedia , vol. 20, no. 1, pp. 155–170, 2017.[14] S. Zhu and Z. Xu, “Spatiotemporal visual saliency guided perceptualhigh efficiency video coding with neural network,”
Neurocomputing ,vol. 275, pp. 511–522, 2018.[15] H. Oh and W. Kim, “Video processing for human perceptual visualquality-oriented video coding,”
IEEE Transactions on Image processing ,vol. 22, no. 4, pp. 1526–1535, 2012.[16] J. Alakuijala, R. Obryk, O. Stoliarchuk, Z. Szabadka, L. Vandevenne,and J. Wassenberg, “Guetzli: Perceptually guided jpeg encoder,” arXivpreprint arXiv:1703.04421 , 2017.[17] D. Liu, D. Wang, and H. Li, “Recognizable or not: Towards imagesemantic quality assessment for compression,”
Sensing and Imaging ,vol. 18, no. 1, p. 1, 2017.[18] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, andD. Batra, “Grad-cam: Visual explanations from deep networks viagradient-based localization,” in
Proceedings of the IEEE InternationalConference on Computer Vision , 2017, pp. 618–626.[19] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask r-cnn,” in
Proceedings of the IEEE international conference on computer vision ,2017, pp. 2961–2969.[20] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[21] J.-H. Hu, W.-H. Peng, and C.-H. Chung, “Reinforcement learningfor hevc/h. 265 intra-frame rate control,” in . IEEE, 2018, pp. 1–5.[22] N. Li, Y. Zhang, L. Zhu, W. Luo, and S. Kwong, “Reinforcementlearning based coding unit early termination algorithm for high ef-ficiency video coding,”
Journal of Visual Communication and ImageRepresentation , vol. 60, pp. 276–286, 2019. [23] L. Costero, A. Iranfar, M. Zapater, F. D. Igual, K. Olcoz, and D. Atienza,“Mamut: Multi-agent reinforcement learning for efficient real-timemulti-user video transcoding,” in . IEEE, 2019, pp. 558–563.[24] A. Iranfar, M. Zapater, and D. Atienza, “Machine learning-based quality-aware power and thermal management of multistream hevc encodingon multicore servers,”
IEEE Transactions on Parallel and DistributedSystems , vol. 29, no. 10, pp. 2268–2281, 2018.[25] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-ing,” arXiv preprint arXiv:1312.5602 , 2013.[26] C. Wojek, S. Walk, and B. Schiele, “Multi-cue onboard pedestriandetection,” in . IEEE, 2009, pp. 794–801.[27] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in2009 IEEE conference oncomputer vision and pattern recognition