[PDF] AdaCompress: Adaptive Compression for Online Computer Vision Services

Abstract

With the growth of computer vision based applications and services, an explosive amount of images have been uploaded to cloud servers which host such computer vision algorithms, usually in the form of deep learning models. JPEG has been used as the {\em de facto} compression and encapsulation method before one uploads the images, due to its wide adaptation. However, standard JPEG configuration does not always perform well for compressing images that are to be processed by a deep learning model, e.g., the standard quality level of JPEG leads to 50\% of size overhead (compared with the best quality level selection) on ImageNet under the same inference accuracy in popular computer vision models including InceptionNet, ResNet, etc. Knowing this, designing a better JPEG configuration for online computer vision services is still extremely challenging: 1) Cloud-based computer vision models are usually a black box to end-users; thus it is difficult to design JPEG configuration without knowing their model structures. 2) JPEG configuration has to change when different users use it. In this paper, we propose a reinforcement learning based JPEG configuration framework. In particular, we design an agent that adaptively chooses the compression level according to the input image's features and backend deep learning models. Then we train the agent in a reinforcement learning way to adapt it for different deep learning cloud services that act as the {\em interactive training environment} and feeding a reward with comprehensive consideration of accuracy and data size. In our real-world evaluation on Amazon Rekognition, Face++ and Baidu Vision, our approach can reduce the size of images by 1/2 -- 1/3 while the overall classification accuracy only decreases slightly.

Full PDF

AAdaCompress: Adaptive Compression forOnline Computer Vision Services

Hongshan Li

Tsinghua-Berkeley ShenzhenInstitute, Tsinghua [email protected]

Yu Guo

Graduate School at Shenzhen,Tsinghua [email protected]

Zhi Wang ∗ Graduate School at Shenzhen,Tsinghua UniversityPeng Cheng [email protected]

Shutao Xia

Graduate School at Shenzhen,Tsinghua [email protected]

Wenwu Zhu ∗ Tsinghua-Berkeley Shenzhen institute,Department of Computer Science andTechnology, Tsinghua [email protected]

ABSTRACT

With the growth of computer vision based applications and ser-vices, an explosive amount of images have been uploaded to cloudservers which host such computer vision algorithms, usually inthe form of deep learning models. JPEG has been used as the defacto compression and encapsulation method before one uploadsthe images, due to its wide adaptation. However, standard JPEGconfiguration does not always perform well for compressing imagesthat are to be processed by a deep learning model, e.g., the stan-dard quality level of JPEG leads to 50% of size overhead (comparedwith the best quality level selection) on ImageNet under the sameinference accuracy in popular computer vision models includingInceptionNet, ResNet, etc. Knowing this, designing a better JPEGconfiguration for online computer vision services is still extremelychallenging: 1) Cloud-based computer vision models are usually ablack box to end-users; thus it is difficult to design JPEG configura-tion without knowing their model structures. 2) JPEG configurationhas to change when different users use it. In this paper, we proposea reinforcement learning based JPEG configuration framework. Inparticular, we design an agent that adaptively chooses the com-pression level according to the input image’s features and backenddeep learning models. Then we train the agent in a reinforcementlearning way to adapt it for different deep learning cloud servicesthat act as the interactive training environment and feeding a rewardwith comprehensive consideration of accuracy and data size. In ourreal-world evaluation on Amazon Rekognition, Face++ and BaiduVision, our approach can reduce the size of images by 1/2 – 1/3while the overall classification accuracy only decreases slightly. ∗ corresponding authorsPermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]. MM ’19, October 21–25, 2019, Nice, France © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-6889-6/19/10...$15.00https://doi.org/10.1145/3343031.3350874

CCS CONCEPTS • Networks → Network components ; •

Computer systems orga-nization → Real-time systems . KEYWORDS edge computing;reinforcement learning;data compression;onlinecomputer vision services

ACM Reference Format:

Hongshan Li, Yu Guo, Zhi Wang, Shutao Xia, and Wenwu Zhu. 2019. Ada-Compress: Adaptive Compression for Online Computer Vision Services. In

Proceedings of the 27th ACM International Conference on Multimedia (MM’19), October 21–25, 2019, Nice, France.

ACM, New York, NY, USA, 9 pages.https://doi.org/10.1145/3343031.3350874

With the great success of deep learning in computer vision, thisdecade has witnessed an explosion of deep learning based computervision applications. Because of the huge computational resourceconsumption for deep learning applications (e.g., inferring an imageon VGG19 [37] requires 20 GFLOPS GPU resource), in today’scomputer vision applications, users usually have to upload the inputimages to the central cloud service providers (e.g., SenseTime, BaiduVision and Google Vision, etc.), leading to a significant uploadingtraffic burden. For example, a picture taken by a cellphone at theresolution of 3968 × qual-ity level in the default JPEG configuration, by retraining it on theoriginal dataset, one can compress an image to a smaller versionwhile maintaining the inference accuracy for a fixed deep computervision algorithm. We then raise an intuitive question: to make itpractically useful, can we improve the JPEG configuration adap-tively for different cloud computer vision services, without anypre-knowledge of the original model and dataset? a r X i v : . [ c s . MM ] S e p ur answer to this question is a new learning-based compressionmethodology for today’s cloud computer vision services. We tacklethe following challenges in our design. • Lack of information about the cloud computer visionmodels.

Different from the studies [15, 27, 42], in which thecomputer vision models are available so that one can adjustthe JPEG configuration according to the model structure orretrain the parameters in it, e.g., one can greedily search agradient descent to reach an optimal compression level inJPEG. In our study, however, the details of the online cloudcomputer vision model are inaccessible. • Different cloud computer vision models need differ-ent JPEG configurations.

As an adaptive JPEG configura-tion solution, we target to provide a solution that is adaptiveto different cloud computer vision services, i.e., it can gen-erate

JPEG configuration for different models. However, to-day’s cloud computer vision algorithms, based one deep andconvolutional computations, are quite hard to understand.The same compression level could lead to totally differentaccuracy performance. Some examples are shown in Figure1, picture 1a and 1b, 2a and 2b are visually similar for humanbeings, but the deep learning models give different inferenceresults, only because they are compressed at different qualitylevels. And such relationship is not apparent, e.g., picture3b is highly compressed and looks destroyed comparing topicture 3a, but the deep learning model can still recognizeit. This phenomenon is also presented in [7] and commonlyseen in adversarial neural network researches [10, 43]. • Lack of well-labeled training data.

In our problem, oneis not provided the well-labeled data on which image shouldbe compressed to which quality level, as in conventionalsupervised deep learning tasks. In practice, such an imagecompression module is usually utilized in an online manner,and the solution has to learn from the images it uploadsautomatically.To address the above challenges, we present a deep reinforce-ment learning (DRL) based solution, AdaCompress, to choose theproper compression level for an image to a computer vision modelon the cloud, in an online manner. We open-sourced our JPEG con-figuration module that works with today’s cloud computer visionAPIs upon acceptance of this paper. In particular, our contributionsare summarized as follows: • First, we design an interactive training environment that canbe applied to different computer vision cloud services at dif-ferent times, then we propose a deep Q neural network agentto evaluate and predict the performance of a compressionlevel on an input image. In real-world application scenarios,this deep Q neural network can be highly efficient to runon today’s edge infrastructures (e.g., Google edge TPU [14],Huawei Atlas 500 edge station [21]). • Second, we build up a reinforcement learning frameworkto train the deep Q network in the above environment. Byfeeding the agent with carefully designed reward compre-hensively considering accuracy and data size, the agent can https://github.com/hosea1008/AdaCompress (1a) Q=75Face++ prediction = ["donut"] (1b) Q=55Face++ prediction = [](2a) Q=75Baidu prediction = ["chameleon"] (2b) Q=55Baidu prediction = ["electric fan"](3a) Q=75Baidu prediction = ["leopard"] (3b) Q=5Baidu prediction = ["leopard"] Figure 1: The prediction of a deep learning model is not com-pletely related to the input image’s quality, making it diffi-cult to use a fixed compression quality for all images. Forimage 1a, 1b and 2a, 2b, minor changes cause different pre-dictions though they are visually similar; for image 3a and3b, the cloud model still output correct label from a severelycompressed image though they look very different learn to choose a proper compression level for an input im-age after iteratively interacting with the environment. Tomake the solution adaptive to the changing input images, wepropose an explore-exploit mechanism to adapt the agentto different “scenery” online. After deploying the deep Qagent, an inference-estimate-retrain mechanism is designedto restart the training procedure once the scenery changes,and the existing running Q agent cannot guarantee stableaccuracy performance. • Finally, we provide analysis and insights on our design.We analyze the Q network’s behavior by introducing Grad-Cam [36], and we explain why the Q network chooses aspecific compression level, and provide some general pat-terns. Generally speaking, images that contain large smoothareas are more sensitive to compression, while the imageswith complex textures are more robust to compression whenshown to deep learning models. We evaluate our system onsome most popular cloud deep learning services, includingAmazon Rekognition [2], Face++ [11] and Baidu Vision [5],and show that our design can reduce the uplink traffic loadby up to 1/2 while maintaining comparable overall accuracy.The rest of this paper is organized as follows. We present ourframework and detailed design in Sec. 2. In Sec. 3 we present oursolution’s performance. We discuss related works in Sec. 4 andconclude the paper in Sec. 5.

PEG fixed (a) Conventional solution:fixed user-defined compression level

JPEG adaptive reward feedback (b) AdaCompress solution:input image and model aware compression

Figure 2: Comparing to the conventional solution, our solu-tion can update the compression strategy based on the back-end model feedback

A brief framework of AdaCompress is shown in Figure 2. Briefly,it is a DRL (deep reinforcement learning) based system to trainan agent to choose the proper quality level c for one image to becompressed by JPEG. We will discuss the formulation, agent design,reinforcement learning framework, reward feedback, and retrainmechanism separately in the following subsections. We will provideexperimental details of all the hyperparameters in Sec. 3. Without loss of generality, we denote the cloud deep learning ser-vice as (cid:174) y i = M ( x i ) that provides a predicted result list (cid:174) y i for eachinput image x i , and it has a baseline output (cid:174) y ref = M ( x ref ) for allreference input x ∈ X ref . We use this (cid:174) y ref as the ground truth labels,and for each image x c compressed at quality c , we have (cid:174) y c = M ( x c ) .Therefore, we have an accuracy metric A c by comparing (cid:174) y ref and (cid:174) y c . To be general, we use the top-5 accuracy as the following A ,the same as the classification metric of ILSVRC2012 [29]. A = (cid:213) k min j d ( l j , д k ) l j ∈ (cid:174) y c , j = , ..., д k ∈ (cid:174) y ref , k = , ..., length ((cid:174) y ref ) d ( x , y ) = x = y else 0Where j = , ..., k = , ..., length ((cid:174) y ref ) means that if anyone of the top-5 predictedlabels matches one of the predictions from (cid:174) y ref , it is regarded as acorrect prediction. To be general, we stipulate that for a cloud deeplearning service, we cannot get the deep model’s in-layer details(e.g., softmax probabilities) therefore we use a binary hard label d ( x , y ) ∈ { , } to evaluate the accuracy. We also denote JPEG input images as f ic = J ( x i , c ) that for aninput image x i and a given compression quality c , it outputs acompressed file f ic at the size of s ic , for a reference compressionlevel c ref , the compressed file size is s ref . Besides, images input froma specific location usually belong to a particular contextual group.For example, in an indoor scenery, the user input is less likely tohave the images of the ocean, airplanes, and dolphins but morelikely to have furniture and so on. Therefore, the agent at one placedoes not need to know all the contextual features in all places. Weformulated this as contextual group X . This contextual groupingconcept is also discussed in [18].Initially, the agent tries different compression level c min < c < c max , c ∈ N to obtain compressed image x c from input image x ,and an image compressed at a reference level c ref is also uploadedto the cloud to obtain (cid:174) y ref . Comparing the two uploaded instances { x , x c } and cloud recognition results { (cid:174) y ref , (cid:174) y c } , we can have thereference file size s ref and compressed file size s c and therefore thefile compression ratio ∆ s = s c s ref and accuracy metric A c . The DRL agent is expected to give a proper compression level c thatminimizing the file size s c while keeping the accuracy A . For theDRL agent, the input features are continuous numerical vectors, andthe expected output are discrete quality levels c , therefore we canuse DQN (Deep Q Network) [28] as the DRL agent. But naive DQNcan’t work well in this task because of the following challenges: • The state space of reinforcement learning is too large, andto preserve enough details, we have to add many layersand nodes to the neural network, making the DRL agentextremely difficult to converge. • It takes a long time to train one step in a large inference neu-ral network, making the training process too time-consuming. • DRL starts training from random trials, and starts learningafter it found a better reward feedback. When training froma randomly initialized neural network, the reward feedbackis very sparse, making it difficult for the agent to learn.To address these challenges, we use the early layers of a well-trained neural network to extract the structural information of aninput image. This is a commonly used strategy in training a deepneural network [12, 30]. Therefore instead of training a DRL agentdirectly from the input image, we use a pre-trained small neuralnetwork to extract the features from the input image to reducethe input dimension and accelerate the training procedure. In thiswork, we use the early convolution layers of MobileNetV2 [34]as the image feature extractor

E(·) for its efficiency in image clas-sification and lightweight. The Q network ϕ is connected to thefeature extractor’s last convolution layer, therefore the output of E is the input of ϕ . We update the RL agent’s policy by changing theparameters of Q network ϕ while the feature extractor E remainsfixed. In a specific scenery where the user input x belongs context group X , we define the contextual information X , along with the back-end cloud model M , as the emulator environment {X , M } of thereinforcement learning problem.ased on this insight, we formulate the feature extractor’s output E( J (X , c )) as states , and the compression quality c as discrete actions .In our system, to accelerate training, we define 10 discrete actionsto indicate 10 quality levels of JPEG ranging from 5 , , ...,

95. Wedenote the action-value function as Q ( ϕ (E( f t )) , c ; θ ) , then the op-timal compression level at time t is c t = argmax c Q ( ϕ (E( f t )) , c ; θ ) where θ indicates the parameters of Q network ϕ . In such rein-forcement learning formulation, the training phase is to minimize aloss function L i ( θ i ) = E s , c ∼ ρ (·) (cid:104)(cid:0) y i − Q ( s , c ; θ i ) (cid:1) (cid:105) that changesat each iteration i where s = E( f t ) , and y i = E s ′ ∼{X , M } (cid:2) r + γ max c ′ Q ( s ′ , c ′ ; θ i − ) | s , c (cid:3) is the target for iteration i , r is thefeedback reward and ρ ( s , c ) is a probability distribution over se-quences s and quality level c [28]. When minimizing the distanceof the action-value function’s output Q (·) and target y i , the action-value function Q (·) outputs a more accurate estimation of an action.In such formulation, it is similar to DQN problem but not the same.Different from conventional reinforcement learning, the interac-tions between the agent and environment are infinite; there is nosignal from the environment telling that an episode has finished.Therefore, we train the RL agent intermittently at a manual inter-val of T after the condition t ≥ T start guaranteeing that there areenough transitions in the memory buffer D . In the training phase,the RL agent firstly takes some random trials to observe the envi-ronment’s reaction, and we decrease the randomness when training.All transitions are saved into a memory buffer queue D , the agentlearns to optimize its action by minimizing the loss function L ona minibatch from D . The training procedure will converge as theagent’s randomness keeps decaying. Finally, the agent’s action isbased on its historical optimal experiences. The training procedureis presented in Algorithm 1, we list the parameters in Sec. 3. Algorithm 1

Training RL agent ϕ in environment {X , M } Initialize replay memory queue D to capacity N Initialize action-value function Q with random weights θ Initialize sequence s = E (cid:0) J ( x , c ) (cid:1) , x ∈ X and ϕ = ϕ ( f ) for t = 1, K do With probability ϵ select a random compression level c t otherwise select c t = argmax c Q (cid:16) ϕ (cid:0) E( f t ) (cid:1) , c ; θ (cid:17) Compress image x t at quality c t and upload it to the cloudto get result ((cid:174) y ref , (cid:174) y c ) and calculate reward r = R ( ∆ s , A c ) Set s t + = s t , generate c t , x t + and preprocess ϕ t + = ϕ (cid:0) E( f t + ) (cid:1) Store transition ( ϕ t , c t , r t , ϕ t + ) in D if t mod T == t ≥ T start then Sample random minibatch of transitions ( ϕ j , c j , r j , ϕ j + ) from memory buffer D Set y i = r j + γ max c ′ Q ( ϕ j + , c ′ ; θ ) Decay exploration rate ϵ = (cid:40) µ dec · ϵ if µ dec · ϵ > ϵ min ϵ min if µ dec · ϵ ≤ ϵ min Perform a gradient descent step on (cid:0) y j − Q ( ϕ j , c j ; θ ) (cid:1) according to [28] end if end for In our solution, the agent is trained by the reward feedback fromthe environment {X , M } . In the above formulation, we definedcompression rate ∆ s = s c s ref and accuracy metric A c in compressionquality c . Basically, we want the agent to choose a proper compres-sion level that minimizing the file size while remaining acceptableaccuracy, therefore the overall reward r should be in proportionto the accuracy A while in inverse proportion to the compressionratio ∆ s . We introduce two linear factors α and β to form a linearcombination r = α A − ∆ s + β as the reward function R ( ∆ s , A) . As a running system, we introduce a running-estimate-retrain mech-anism to cope with the scenery change in the inference phase,building a system with different components to inference, cap-turing scenery change, then retraining the RL agent. The overallsystem diagram is illustrated in Figure 3. clientimages cloudservices

Compress-upload driver

RL kernel

Inference scheduler

JPEG

Memory buffer 𝒟 RESTful

RL Agent Training framework

Inference driver

Estimator

Accuracy monitor

Update 𝑝est 𝑝est invoke quality read write train

Existing modules AdaCompress modules invoke client edge cloud

Random generator 𝜉 ∈ (0,1)

Figure 3: Diagram of AdaCompress architecture

The system diagram is shown in Figure 3. We build up the mem-ory buffer D and RL (reinforcement learning) training kernel basedon the compression and upload driver. When the RL kernel is called,it will load transitions from the memory buffer D to train the com-pression level predictor ϕ . When the system is deployed, the pre-trained RL agent ϕ guides the compression driver to compress theinput image with an adaptive compression quality c according tothe input image, then uploads the compressed image to cloud.After the AdaCompress is deployed, the input images scenerycontext X may change. (e.g., day to night, sunny to rainy), whenthe scenery changes, the older RL agent’s compression selectionstrategy may not be suitable anymore, causing the overall accuracydecreases. To cope with this scenery drifting issue, we invoke anestimator with probability p est . We do this by generating a randomvalue ξ ∈ ( , ) and compare it to p est . If ξ ≤ p est the estimator isinvoked, AdaCompress will upload the reference image x ref alongwith the compressed image x i to fetch (cid:174) y ref and (cid:174) y i and thereforecalculates A i , and save the transition ( ϕ i , c i , r i , A i ) to the memorybuffer D . The estimator will also compare the recent n steps’ aver-age accuracy ¯ A n and the earliest average accuracy A in memorynferencestart retrain estimate ξ > p est ξ ≤ p est ¯ r n > r th ¯ r n ≤ r th ξ ≤ p est ξ > p est ¯ A n < A Figure 4: State switching policy D , once the recent average accuracy is much lower than the initialaverage accuracy, the estimator will invoke the RL training ker-nel to retrain the agent. And once the estimator discover that thetrained reward is higher than a threshold, it will stop the trainingkernel, returning to normal inference state.Basically, AdaCompress will adaptively switch itself betweenthree states. The switching policy is shown as Figure 4. Inference:

For most times, AdaCompress runs in this state.In this state, only the compressed images are uploaded to the cloudto achieve minimum uploading traffic load. To keep a stable ac-curacy performance even the input scenery changes, the agentwill occasionally switch to estimation state with probability p est ,meanwhile remains inference state with probability 1 − p est . Estimate:

In this state, the reference image x ref and com-pressed image x i are uploaded to the cloud simultaneously tofetch (cid:174) y ref and (cid:174) y i and therefore A i . In each epoch i the transition ( ϕ i , c i , r i , A i ) is logged in a memory buffer D . Once the averageaccuracy ¯ A n of the latest n steps is lower than the average accuracy A of the earliest n steps in the memory buffer D , indicating thatthe current agent is no more suitable for the current input scenery,AdaCompress will switch into retrain state and invoke the RL train-ing kernel. Otherwise, it remains estimate state with probability p est or switches back into inference state with probability 1 − p est .Therefore, the estimating probability p est is vital to the wholesystem. On the one hand, the estimator should be invoked occa-sionally to estimate the current agent’s accuracy, so that to retrainthe agent on time once the scenery changes; on the other hand,the estimator will upload the reference image x ref along with thecompressed image, therefore the upload size is greater than theconventional benchmark solution, causing higher traffic load.To trade-off between the risk of scenery changes and the ob-jective of reducing upload traffic, we design an accuracy-awaredynamic p est solution, we first define that after running for N steps,the recent n steps’ average accuracy is:¯ A n = (cid:40) n (cid:205) Ni = N − n A i if N ≥ n n (cid:205) ni = A i if N < n With this definition, an intuitive formulation of the changes of p est is in inverse proportion of the gradient of A , meaning thatwhen the recent accuracy is going down, we should increase theestimation probability p est . We formulate that p ′ est = p est + ω ∇ ¯ A where ω is a scaling factor. With this recursive formula, we havethe general term of p est with an initial estimation probability p is p est = p + ω (cid:205) Ni = ∇ ¯ A i . Retrain:

This state is to adapt the agent to the current inputimage scenery by retraining it with the memory buffer D , which issimilar to the training procedure. The retrain phase finishes uponthe recent n steps’ average reward ¯ r n higher than a user-definedthreshold r th . And when the retrain procedure finishes, the memorybuffer D will be flushed, preparing to save new transitions for theretraining of a next scenery drift. In the inference phase, the pre-trained RL agent predicts a propercompression level according to the input image’s feature. The ref-erence image is not uploaded to the cloud anymore; only the com-pressed image is uploaded, therefore, the upload traffic is reduced.We noticed that the RL agent’s behavior are various for differentinput dataset and backend cloud services, we try to take furtherinvestigations by plotting the RL agent’s "attention map" (i.e., visualexplanations of why the agent chooses a quality level).

Compression level choice variation:

In our experiment,we found that in different cloud application environments, theagent’s final chosen compression qualities can be quite different. Asshown in Figure 5, for Face++ and Amazon Rekognition, the agent’schoices are concentrated at around c =

15, but for Baidu Vision, theagent’s choices are distributed more evenly. Therefore, the optimalcompression strategy should be different for different backend cloudservices. This variation is caused by the interaction between theagent and the backend model in the training phase. Since the agent’straining procedure is based on a specific backend cloud model M ,for another cloud model M , the interaction between the agentand M is quite different. Therefore the agent’s compression levelchoice presents variation for different backend cloud models. C o un t s ( × ) Baidu VisionFace++Amazon Rekognition

Figure 5: Histogram of RL agent’s best compression level se-lection for different cloud services

Moreover, in our experiment, the agent presents different be-havior when the input images change from one dataset to another.Figure 6 shows the agent’s choices for a same backend model (BaiduVision) but different image datasets. We prepare two datasets indi-cating two contextual scenery. We randomly sample images fromImageNet [33] whose images are mostly taken in the daytime, toact as a daytime scenery, and we randomly select nighttime im-ages from DNIM [44] to form another dataset to act as a nighttimescenery. The histogram shown in Figure 6 points out that, for the Im-ageNet images, the agent prefers a lower compression level, but its

15 25 35 45 55 65 75 85 95Compression level (Q for JPEG)0.00.20.4 C o un t s ( × ) ImageNet imagesDNIM images

Figure 6: Histogram of RL agent’s best compression level se-lection for different scenery image inputs choices are distributed more evenly. For DNIM images, the agent’schoices are more accumulated in some relatively high compressionqualities. We can see that, to maintain high accuracy, when theinput image’s contextual group X changes, the agent’s compressionlevel selection changes as well. This phenomenon presents that theagent can adaptively choose a proper compression level based onthe input image’s features. Attention map variation:

To take insight investigation,we plot the importance map of a chosen compression quality. Wedo so by introducing a conventional visualize algorithm, Grad-Cam,to observe the Q prediction network’s interest when choosing com-pression levels. Grad-Cam is a widely used solution to present theimportance map of a deep neural network, it is done by calculatingthe gradients of each target concept and backtracking to the finalconvolution layer. In this work, we plot the RL agent’s attentionmap by Grad-Cam in Figure 7.In our investigation, we found that in different environment {X , M } , the Q agent picks up compression qualities based on thevisual textures of different regions in the image. As shown in Figure7, picture 1a – 1d are some pictures that the agent chooses tocompress highly, the agent selects lower compression qualitiesbased on the complex texture of the images. On the contrary, forpictures 2a – 2d, the agent chooses higher compression qualitiesto preserve more details, and the agent’s interest falls on somesmooth regions. Especially for 1a and 2a, in picture 1a, the agentchooses a low compression level based on the rough central regionthough there are smooth regions around it, and in picture 2a, theagent chooses a relatively higher compression level based on thesurrounding smooth region rather than the central region. In this section, we present AdaCompress’s behavior and effective-ness by some real-world experiments.

We carry out real-world experiments to verify our solution’s perfor-mance. We used a desktop PC with an NVIDIA 1080ti graphic cardas the edge infrastructure. For the cloud deep learning services, wechoose Baidu Vision, Face++ object detection service, and AmazonRekognition. In the experiments, we use two datasets mentionedbefore in Sec.2.6, ImageNet dataset indicating daytime scenery andDNIM dataset indicating nighttime scenery. Some important hy-perparameters in our experiments are given in Table 1. notation value notation value c ref K ϵ min p γ ω -3 µ dec T r th n Table 1: Experiment parameter settings

In industry, the default compression quality for JPEG is usually75 [26, 31], we regard this as a typical value c ref =

75 of the con-ventional industry benchmark.In our experiments, we measure the compressed and originalimage’s file size to obtain the compression rate ∆ s . Since we don’thave the real ground truth label of an image, we use the outputfrom a reference image (cid:174) y ref as the ground truth label, and calculatethe relative top-5 accuracy A as the accuracy metric, the formulaof A is presented in Sec. 2.1. Figure 8 presents the upload traffic load of the training and inferencephase, to be more intuitionistic, we plot the size overhead ss ref asthe y -axis where s is the real upload size of AdaCompress, s ref is the benchmark upload size, therefore y ≥ y < Figure 9 presents the compression performance in the inferencephase for each cloud service. We tested AdaCompress on Face++,Baidu Vision and Amazon Rekognition, comparing to the conven-tional compression level, for all tested cloud services, our solutioncan reduce the upload size by more than 1/2, meanwhile, the rela-tive accuracy, indicated by brown bars, only decrease about 7% onaverage, proving the efficiency of our design.

To evaluate the efficiency of the inference-estimate-retrain mecha-nism, we feed AdaCompress with a combined dataset whose first720 images from DNIM night images, the later 2376 images ran-domly sampled from ImageNet. We adapt AdaCompress’s currentDRL agent to DNIM night scenery by training it on DNIM dataset,then we run AdaCompress on the combine dataset, observing Ada-Compress’s behavior upon the scenery change at step 720.1a) Q=5 (1b) Q=15 (1c) Q=15 (1d) Q=15(2a) Q=85 (2b) Q=85 (2c) Q=75 (2d) Q=75

Figure 7: Visualization of the importance map for the RL agent to choose a compression quality s i z e o v e r h e a d ( s / s r e f ) BaiduFace++Amazon RekognitionInference phaseTraining phase

Figure 8: Size overhead in training and inference phase

Benchmark Q=75 Baidu Vision Face++ Amazon Rekognition010203040 A v e r a g e s i z e ( K B ) R e l a t i v e A cc u r a c y ( % ) Figure 9: Average size and relative accuracy on differentcloud services

We illustrate AdaCompress’s behavior in Figure 10, the x -axisindicates steps, the vertical red line with a ∆ mark on x -axis meansthe dataset change(i.e. scenery change). We plot AdaCompress’soverall accuracy as the green line and the estimating probability p est as the gray line. At the bottom of Figure 10, we also plot thescaled uploading data size of AdaCompress and benchmark solutionto illustrate the upload data size overhead in the inference phase.From Figure 10 we can see that AdaCompress can adaptivelyupdate the estimation probability p est , usually, when the overallaccuracy decreases, AdaCompress will increase the estimation prob-ability, trying to catch the scenery change. When the overall ac-curacy is stable and high enough, the estimation probability p est decreases to reduce transmission.Upon the data scenery change shown as the vertical red line inFigure 10, comparing to the earlier steps, the accuracy decreasesdramatically and therefore p est raises to determine whether scenerychanges, the accuracy keeps dropping in the following estimations. A v e r a g e a cc u r a c y E s t i m a t e p r o b . p e s t InferenceRetrain Benchmark upload sizeOur upload size AccuracyEstimate probability

Figure 10: AdaCompress’s reaction upon scenery change

Therefore, AdaCompress starts to retrain, to adapt the RL agentinto the current scenery. The retrain steps are shown as the light-blue region in Figure 10. In the retrain phase, AdaCompress alwaysuses the reference image’s prediction label (cid:174) y ref as the output result,therefore the accuracy A and p est is locked to 1. After finishingretraining the agent in the new scenery, in the following iterations,sometimes the accuracy decrease accidentally, the estimation prob-ability p est also raises to get more samples, but the accuracy is notlower than the initial average accuracy A of this scenery, thereforethe retrain phase will not be triggered again.From Figure 10 we can also observe the uploading file size over-head in different phases, we can see that in retrain phase, AdaCom-press uploads more data than the conventional benchmark, but ininference phase, AdaCompress’s upload data size is only half of thebenchmark’s. Comparing to the conventional solution that uploads the imagedirectly, in our solution, the image is passed to the DRL agent firstto estimate the compression level. Running this DRL agent bringsextra latency to the whole system. In this subsection, we evaluatethis latency overhead.We tested the DRL agent’s inference time and compressed filesize for batches of images, and simulate the latency of uploadingsuch compressed images. We test the average inference latencyfrom 1000 ImageNet images and simulate the network bandwidthas 27.64 Mbps according to the global average fixed broadbandpload speed [38] in Feb. 2019 to verify the end-to-end latencyperformance. The latency comparison is listed in Table 2.Benchmark AdaCompressAverage upload size 42.68 KB 18.46 KBInference latency 0 s 2.09 msTransmission latency 12.35 ms 5.34 msOverall latency 12.35 ms 7.43 ms

Table 2: Latency between image upload and inference resultfeedback

Our solution brings in inference latency to the end-to-end la-tency, but the transmission latency is much lower by shrinkingthe upload file size. In today’s network architecture where theedge infrastructure’s computational power is increasing signifi-cantly [20, 35], we can use the computing power of the edge in-frastructure in exchange for the reduction of upload traffic andtransmission latency.

As cloud-based computer vision services have become the normfor today’s applications [1, 22], many studies have been devotedto improving the cloud-based model execution, including modelcompression and data compression.

Though the accurate term is still for the community to debate, weuse “model compression” to represent the studies on compressing and moving the deep learning models close to users. A numberof studies tried to compress the deep learning models and deploythem locally [3, 4, 13, 16, 17, 23], i.e., running an alternative “smallerversion” of a computer vision model at the user end, to avoid theimage upload, so that to improve the inference efficiency. Otherstudies proposed to run part of a deep learning model locally [9,19, 24, 25], by decoupling the deep learning model into differentparts, e.g., based on the layers in the deep learning model, so that apart of the inference is done locally to save some execution time.However, these solutions usually need to re-train the model, usingthe original dataset of the model, which is not practical for today’scloud computer vision services that are merely a black box to end-users, e.g., in the form of a RESTful API.

Data compression solutions study how to compress the original data(e.g., a video or image) to be inferred by the cloud deep learningmodel, so that less traffic is used to upload the data to improveinference speed. Conventional data compression solutions (e.g.,JPEG, WebP, JPEG2000 etc.) and some recent neural network basedcompression solutions [32, 39–41] are initially designed for humanvision systems. In recent years, researchers start to found thatthe human visually optimized data compression solutions are notusually applicable to deep learning vision systems. Delac et al. [7]observed that, in some cases, higher compression level does notalways deteriorate the model inference accuracy, and in some cases,even improves it slightly. Dodge et al. [8] further discovered that besides the JPEG compression, four types of quality distortions:blur, noise, contrast, and JPEG2000 compression can also affect theperformance in deep learning inference.Based on these insights, Robert et al. [42] tried to train the neuralnetwork from the compressed representations of an auto-encoder.Liu et al. [27] proposed DeepN-JPEG that provides a JPEG quanti-zation table learned from the dataset so that the compressed imagesize is reduced for deep learning models. Recently, Lionel et al. [15]present a new type of neural network that inference directly fromthe discrete cosine-transform (DCT) coefficients in the middle ofthe JPEG codec. Baluja et al. [6] proposed task-specific compressionthat compresses images based on the end-use of the image.However, such proposals all need one to understand the char-acteristics of the cloud-end deep learning model and have accessto the original training dataset, to generate the appropriate colorspace and/or compression schemes. To the best of our knowledge,we are the first to propose an adaptive compression configurationsolution that learns the deep learning model by itself.

To reduce the upload traffic load of deep learning application, mostresearchers focus on modifying the deep learning model, but thisdoes not apply to the industry because the backend deep model isusually inaccessible for users. We present a heuristic solution usinga deep learning agent to decide the proper compression quality foreach image, according to the input image and backend service. Ourexperiments show that for different backend deep learning cloudservices and different input image scenery, using different qualityselection strategy can significantly reduce the upload file size whilekeeping comparable accuracy. Based on this work, some possiblefuture orientations can focus on the following: 1) In some regularlychange scenery (e.g., daytime and nighttime, etc.), one can designan agent caching strategy, to cache an agent for a specific sceneryand use it again when a similar scenery arrives rather than retrainfrom scratch. 2) By introducing transfer learning and knowledgedistillation, an agent could learn from another nearby agent toaccelerate its training.

ACKNOWLEDGMENTS

This work is supported in part by NSFC under Grant 61872215,61531006, 61771273 and U1611461, National Key R&D Program ofChina under Grant 2018YFB1800204 and 2015CB352300, SZSTI un-der Grant JCYJ20180306174057899 and JCYJ20180508152204044,and Shenzhen Nanshan District Ling-Hang Team under GrantLHTD20170005.

REFERENCES [1] Harsh Agrawal, Clint Solomon Mathialagan, Yash Goyal, Neelima Chavali,Prakriti Banik, Akrit Mohapatra, Ahmed Osman, and Dhruv Batra. 2015. Cloudcv:Large-scale distributed computer vision as a cloud service. In

Mobile cloud visualmedia computing . Springer, 265–290.[2] Amazon. 2019. Amazon Rekognition. https://aws.amazon.com/rekognition/.[3] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. 2015. Fixed point optimiza-tion of deep convolutional neural networks for object recognition. In

Acoustics,Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on .IEEE, 1131–1135.[4] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. 2017. Structured pruning ofdeep convolutional neural networks.

ACM Journal on Emerging Technologies inComputing Systems (JETC)

13, 3 (2017), 32.[5] Baidu. 2019. Baidu AI Open Platform. https://ai.baidu.com/.6] Shumeet Baluja, David Marwood, and Nicholas Johnston. 2019. Task-specificcolor spaces and compression for machine-based object recognition. (2019).[7] Kresimir Delac, Mislav Grgic, and Sonja Grgic. 2005. Effects of JPEG and JPEG2000compression on face recognition. In

International Conference on Pattern Recogni-tion and Image Analysis . Springer, 136–145.[8] Samuel Dodge and Lina Karam. 2016. Understanding how image quality af-fects deep neural networks. In . IEEE, 1–6.[9] Amir Erfan Eshratifar and Massoud Pedram. 2018. Energy and PerformanceEfficient Computation Offloading for Deep Neural Networks in a Mobile CloudComputing Environment. In

Proceedings of the 2018 on Great Lakes Symposiumon VLSI . ACM, 111–116.[10] Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, AtulPrakash, Amir Rahmati, and Dawn Song. 2018. Robust physical-world attacks ondeep learning models. In

Computer Vision and Pattern Recognition

Proceedings of the IEEEconference on computer vision and pattern recognition . 1086–1095.[13] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. 2014. Compress-ing deep convolutional networks using vector quantization. arXiv preprintarXiv:1412.6115 (2014).[14] Google.Inc. 2019. Google Edge TPU. https://cloud.google.com/edge-tpu/.[15] Lionel Gueguen, Alex Sergeev, Ben Kadlec, Rosanne Liu, and Jason Yosinski. 2018.Faster neural networks straight from JPEG. In

Advances in Neural InformationProcessing Systems . 3933–3944.[16] Song Han, Huizi Mao, and William J Dally. 2015. A deep neural network com-pression pipeline: Pruning, quantization, huffman encoding. arXiv preprintarXiv:1510.00149

10 (2015).[17] Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weightsand connections for efficient neural network. In

Advances in neural informationprocessing systems . 1135–1143.[18] Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman,and Arvind Krishnamurthy. 2016. Mcdnn: An approximation-based executionframework for deep stream processing under resource constraints. In

Proceedingsof the 14th Annual International Conference on Mobile Systems, Applications, andServices . ACM, 123–136.[19] Pengfei Hu, Huansheng Ning, Tie Qiu, Yanfei Zhang, and Xiong Luo. 2017.Fog Computing-Based Face Identification and Resolution Scheme in Internet ofThings.

IEEE Transactions on Industrial Informatics

13, 4 (2017), 1910 – 1920.[20] Yun Chao Hu, Milan Patel, Dario Sabella, Nurit Sprecher, and Valerie Young. 2015.Mobile edge computing - A key technology towards 5G.

ETSI white paper

11, 11(2015), 1–16.[21] Huawei. 2019. Huawei Atlas 500 Edge Station. https://e.huawei.com/en/products/cloud-computing-dc/servers/g-series/atlas-500.[22] Loc N Huynh, Youngki Lee, and Rajesh Krishna Balan. 2017. Deepmon: Mo-bile gpu-based deep learning framework for continuous vision applications. In

Proceedings of the 15th Annual International Conference on Mobile Systems, Appli-cations, and Services . ACM, 82–95.[23] Kyuyeon Hwang and Wonyong Sung. 2014. Fixed-point feedforward deep neuralnetwork design using weights+ 1, 0, and- 1. In

Signal Processing Systems (SiPS),2014 IEEE Workshop on . IEEE, 1–6.[24] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J.Mars, and L. Tang. 2017.Neurosurgeon: collaborative intelligence between the cloud and mobile edge.ACM, 615–629.[25] Hongshan Li, Chenghao Hu, Jingyan Jiang, Zhi Wang, Yonggang Wen, andWenwu Zhu. 2018. JALAD: Joint Accuracy-And Latency-Aware Deep StructureDecoupling for Edge-Cloud Execution. In . 671–678. https://doi.org/10.1109/PADSW.2018.8645013[26] Python Imaging Library. 2019. Image file formats. https://pillow.readthedocs.io/en/3.1.x/handbook/image-file-formats.html.[27] Zihao Liu, Tao Liu, Wujie Wen, Lei Jiang, Jie Xu, Yanzhi Wang, and Gang Quan.2018. DeepN-JPEG: a deep neural network favorable JPEG-based image compres-sion framework. In

Proceedings of the 55th Annual Design Automation Conference .ACM, 18.[28] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, IoannisAntonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deepreinforcement learning. arXiv preprint arXiv:1312.5602 (2013).[29] Jia D et al. Olga R. 2012. ImageNet Large Scale Visual Recognition Challenge2012. http://image-net.org/challenges/LSVRC/2012/.[30] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.2018. Improving language understanding by generative pre-training.

URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf (2018).[31] rflynn. 2019. Lossy image optimization. https://github.com/rflynn/imgmin.[32] Oren Rippel and Lubomir Bourdev. 2017. Real-time adaptive image compression.In

Proceedings of the 34th International Conference on Machine Learning-Volume70 . JMLR. org, 2922–2930.[33] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge.

International Journal of Computer Vision (IJCV)

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition .4510–4520.[35] Mahadev Satyanarayanan. 2017. The emergence of edge computing.

Computer

50, 1 (2017), 30–39.[36] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedan-tam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations fromdeep networks via gradient-based localization. In

Proceedings of the IEEE Interna-tional Conference on Computer Vision . 618–626.[37] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networksfor large-scale image recognition. arXiv preprint arXiv:1409.1556 arXiv preprintarXiv:1703.00395 (2017).[40] George Toderici, Sean M O’Malley, Sung Jin Hwang, Damien Vincent, DavidMinnen, Shumeet Baluja, Michele Covell, and Rahul Sukthankar. 2015. Vari-able rate image compression with recurrent neural networks. arXiv preprintarXiv:1511.06085 (2015).[41] George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen,Joel Shor, and Michele Covell. 2017. Full resolution image compression withrecurrent neural networks. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition . 5306–5314.[42] Robert Torfason, Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, RaduTimofte, and Luc Van Gool. 2018. Towards image understanding from deepcompression without decoding. arXiv preprint arXiv:1803.06131 (2018).[43] Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. 2019. Adversarial examples:Attacks and defenses for deep learning.

IEEE transactions on neural networks andlearning systems (2019).[44] Hao Zhou, Torsten Sattler, and David W Jacobs. 2016. Evaluating local featuresfor day-night matching. In