A Serverless Cloud-Fog Platform for DNN-Based Video Analytics with Incremental Learning
Huaizheng Zhang, Meng Shen, Yizheng Huang, Yonggang Wen, Yong Luo, Guanyu Gao, Kyle Guan
AA Serverless Cloud-Fog Platform for DNN-BasedVideo Analytics with Incremental Learning
Huaizheng Zhang ∗ , Meng Shen ∗ , Yizheng Huang ∗ , Yonggang Wen ∗ , Yong Luo † , Guanyu Gao ‡ , Kyle Guan §∗ Nanyang Technological University ∗ { huaizhen001, meng005, yizheng.huang, ygwen } @ntu.edu.sg † Wuhan University, ‡ Nanjing University of Science and Technology, § K&C Technologies Solutions † [email protected], ‡ [email protected], § [email protected] Abstract —Deep neural networks (DNN) based video analyticshave empowered many new applications, such as automated retailand smart city. Meanwhile, the proliferation of fog computingsystems provides system developers with more design optionsto improve performance and save cost. To the best of ourknowledge, this paper presents the first serverless system thattakes full advantage of the client-fog-cloud synergy to better servethe DNN-based video analytics. Specifically, the system aims toachieve two goals: 1) Provide the optimal analytics results underthe constraints of lower bandwidth usage and shorter round-trip time (RTT) by judiciously managing the computationaland bandwidth resources deployed in the client, fog, and cloudenvironment. 2) Free developers from tedious administration andoperation tasks, including DNN deployment, cloud and fog’sresource management. To this end, we design and implement aholistic cloud-fog system referred to as VPaaS (Video-Platform-as-a-Service) to execute inference related tasks. The proposedsystem adopts serverless computing to enable developers to builda video analytics pipeline by simply programming a set offunctions (e.g., encoding and decoding, and model inference).These functions are then orchestrated to process video streamingthrough carefully designed modules. To save bandwidth andreduce RTT at the same time, VPaaS provides a new videostreaming protocol that only sends low-quality video to the cloud.The state-of-the-art hardware accelerators and high-performingDNNs deployed at the cloud can identify regions of video framesthat need further processing at the fog ends. At the fog ends,misidentified labels in these regions can be corrected using alight-weight DNN model. To address the data drift issues, weincorporate limited human feedback into the system to verify theresults and adopt incremental machine learning to improve oursystem continuously. The evaluation of our system with extensiveexperiments on standard video datasets demonstrates that VPaaSis superior to several state-of-the-art systems: it maintains highaccuracy while reducing bandwidth usage by up to 21%, RTTby up to 62.5%, and cloud monetary cost by up to 50%. Weplan to release VPaaS as open-source software to facilitate theresearch and development of video analytics.
Index Terms —model serving, video analytics, cloud computing,edge computing
I. I
NTRODUCTION
We are witnessing an unprecedented increase in high-resolution video camera deployment. These cameras, equippedwith video analytics applications, continually collect high-quality video data and generate valuable insights. Many in-dustries, such as automated retail, manufacturing, and smartcity, rely on these video analytics to improve service efficiencyand reduce operational expenditure [1], [2]. For instance, thou- sands of cameras in a city combined with a traffic monitoringapplication are able to provide drivers with optimized routesto reduce traffic congestion [3]. The key to such success liesin recent advances in deep neural networks (DNNs) that allowthese applications to analyze video content with extremelyhigh accuracy.Utilizing DNN to analyze video content is not withoutits drawback. The DNN models (e.g., FasterRCNN101 [4])that provide the highest accuracy can consist of hundreds oflayers and millions of weight parameters. These extremelycomputationally intensive tasks thus rely on state-of-the-arthardware accelerators (e.g., GPUs or TPUs). Since client/edgedevices are limited in their computational resources, the ana-lytics tasks are executed in clouds equipped with the newestcomputational hardware so as to provide real-time feedback.As a consequence, the current video analytics pipeline needsfirst to stream videos to clouds, resulting in high bandwidthusage . Moreover, many works [5], [6] show that the videotransmission time accounts for nearly half of the end-to-endprocessing time (the transmission time would increase evenmore in the presence of communication link outage or networkcongestion). As a result, users experience from long round-triptime (RTT) from time to time.In order to address the aforementioned issues, many effortshave been invested in designing efficient video analyticssystems [7]–[11]. In general, these systems fall into twocategories, the client-driven, and the cloud-driven methods,respectively. The client-driven method runs small models [12],[13] or simple frame differencing algorithms [7] on resource-limited client devices to filter video frames, only sendingregions potentially having target objects to clouds for furtherprocessing. These methods suffer from missing importantregions or sending redundant information due to the simpletechniques used [11]. To overcome this issue, current systems[10], [11], [14] focus on utilizing cloud-driven methods thatoffload more model computational tasks to the cloud-side. Forinstance, CloudSeg [15] sends low-resolution data from theclient to the cloud, which runs a super-resolution model [16]to recover a frame to high resolution for prediction. DDS [11]and SimpleProto [17] design a multiple-round transmissionmethod to save bandwidth while maintaining high accuracy.Though these cloud-driven methods can potentially alleviatethe bandwidth consumption to a certain degree, they still a r X i v : . [ c s . D C ] F e b ANClient/Edge FogSmart City Automated Retail WarehouseVideo ApplicationsInfrastructure CloudWildlife
Fig. 1: The client-fog-cloud infrastructure is widely deployedto support many video applications. In general, the clientsequipped with high-resolution cameras have very limited com-putation resources and can not support DNNs’ inference andvideo encoding and decoding very efficiently. The co-locatedfog nodes can only support lightweight DNN models; whilethe cloud can support computationally heavy DNN models.have drawbacks and leave much room to be improved. First,by trading bandwidth with cloud computational resources,these methods inadvertently increase cloud infrastructure andoperational costs. For instance, DDS [11] runs multiple roundsof inference on a single frame to attain high accuracy, andCloudSeg [15] needs extra cloud resources to run a super-resolution model to recover low-resolution frames. Moreover,these designs often incur higher latency overhead. Thus theperformance of RTT can deteriorate in the presence of networkcongestion or unavailability of cloud resources. Second, mostof these works over-rely on the performances of the state-of-the-art models (e.g., FasterRCNN101) to guide their designoptions [18]. Specially, these single-stage models are oftendesigned and trained to simultaneously perform several tasks(such as detection and classification) in an end-to-end fashion.These works overlook the fact that a carefully designedand trained multi-stage model that consists of models, eachoptimized only for a single task, can also achieve similarperformance with less computational as well as bandwidthresources. Besides, these fixed and pre-trained models caneasily meet the data drift issue when the training dataset dis-tribution is different from that of online data. Third, deployingand operating these DNN-based video analytics pipeline arestill non-trivial and require immense manual efforts. Even fora simple object detection task, developers have to undergotedious administration and operation tasks such as resourcemanagement, DNN deployment, and so on.We aim to design a cloud-driven platform for DNN-basedvideo analytics with the following design guidelines to narrowthese gaps. First, we utilize the widely deployed fog nodesas shown in Figure 1, to minimize the cloud infrastructurecost and bandwidth usage while still achieving an accuracyperformance comparable to those of the cloud-driven methods.The key insight is that a well designed and trained detectionmodel can provide a very accurate target location even fora low-quality frame (though this model cannot classify the object). With object locations, we only need to run classifica-tion models, which require much fewer computation resourcescompared to detection models. Moving DNN inference tasksto fog nodes that are close to users can also reduce thetransmission time. Second, involving a very limited humanfeedback to the system rather than relying exclusively on thebest cloud models (namely the golden configuration in [18]),the system should have the ability to evolve itself continuouslyto address the data drift issue and maintain the accuracyperformance. Third, the often manual and mundane pipelineadministration and resource management work should be keptminimal and even be avoided. The system should provideconfigurability and flexibility for users to easily build theirvideo analytics pipelines and automate the operational tasks.Following these guidelines and insights, we propose anddevelop VPaaS (Video-Platform-as-a-Service), a cloud-fog co-serving platform for DNN-based video analytics that takesthe full benefits of the client-fog-cloud synergy. First, a clientequipped with cameras sends high-quality videos to fog, wherethe videos will be re-encoded into low-quality videos. Thelow-quality videos are then sent to the cloud, where a state-of-art detection model is employed to analyze video content.The regions (within a video frame) with high confidenceclassification scores are considered as successfully identified.The coordinates of these regions with high confidence locationscores are sent back to the fog node, where a lightweightclassification model is employed to recognize these regionsusing high-quality frames. Second, to correct the miss-labelingcaused by the best cloud model and continuously improve thesystem, we design an interface and data collector to collecthuman feedback and employ incremental machine learningto update fog models. Third, the whole system adopts theserverless computing design philosophy. We provide a set offunctions to ease the development and deployment of a videoanalytics pipeline. Users can easily register and run their newlydesigned models and scheduling policies in our system thatmakes it easy to orchestrate both cloud and fog resources.We deploy VPaaS into a testbed that emulates variousreal-world scenarios and evaluate system performance usingmultiple video datasets. Across these datasets, VPaaS achievesa comparable or higher accuracy while reducing the bandwidthusage by up to 21%, RTT time by up to 62.5%, and cloudcost by 50%. Meanwhile, it can improve itself with verylow resource consumption and update models with almostnegligible overhead and bandwidth usage. We also conductmany case studies to show the ease of use, fault-tolerance,and scalability of our system. The contributions of this papercan be summarized as follows: • We develop a completed video processing platformtermed VPaaS and conduct extensive experiments toverify its effectiveness. • We implement a novel video streaming protocol that cansave bandwidth usage while maintaining high accuracyby utilizing both cloud and fog resources. • We employ incremental machine learning to improvesystem performance with minimal human efforts. ideo Chunks Decoder Re-encoder Pre-processor Modelinference Post-processorQuality Control Content analysis (a) Vanilla Pipeline
Video Chunks QualityControl Pre-processor ResNet18 ResNet101 Post-processor (b) Cascade Pipeline
Fig. 2: The example video analysis pipelines. In general, apipeline consists of two stages, quality control and contentanalysis. A video stream is re-encoded for bandwidth ef-ficiency. Then it will be analyzed through three functions,pre-processing (e.g., key-frame extraction and resize), modelinference (e.g., object detection), and post-processing (e.g.,add bounding boxes). • To the best of our knowledge, VPaaS is the first server-less cloud-fog platform that provides a set of APIs toease the development and deployment of video analyticsapplications.The reminder of the paper is organized as follows. SectionII introduces the background and related work. Section IIIprovides the overview of VPaaS. Section IV and Section Vdetail the protocol and incremental learning design. SectionVI presents the evaluation results. Section VII summarizes thepaper and discusses the future work.II. B
ACKGROUND AND R ELATED W ORK
This section first introduces the DNN-based video analyticspipeline and our deployment scenarios and then surveys therelated work both video analytics and serverless computing.
A. DNN-based Video Analytics
Video analytic pipeline.
Typical video analytics pipelinesare shown in Figure 2. It contains two main stages, qualitycontrol and content analytics. In the quality control stage, avideo chunk’s quality setting will be adjusted via a decoderand encoder so that it can be transmitted in a bandwidth-efficient manner. Many settings, such as resolution, frame rateand quantization parameter (QP), can be chosen to improve ordegrade video’s quality [10], [11], [18].Once the server (either in cloud or fog) receives a videochunk, it will start the content analytics process that consistsof three steps, pre-processing, model inference, and post-processing. The pre-processing step contains many functions,such as decoding and image manipulation. These functionsdecode a video chunk into many image frames and resizethem so that they can be fed into DNN models for furtherprocessing. Many DNN models, such as object detection,object tracking, and image classification, can be employedto analyze video content. Each type of models has variants,each of which has different processing speeds and predictionaccuracy. Users need to evaluate the performance - processingspeed trade-off before deploying them to a device.
Video applications under consideration.
This paper con-siders quasi-real-time video applications such as automatedretail video analytics, warehouse management, etc. Unlike thevideo analytics for self-driving, which have very stringentlatency constraints, these video applications provide near real-time and high-quality feedback so the downstream tasks can befinished in time. For instance, in an automated retail store, thevideo monitoring system must process videos efficiently andaccurately so the clearing system can help customers checkout in time. Also, for modern warehouses (such as Amazonwarehouses), this kind of video applications is crucial forefficient management of workflows.
Client-Fog-Cloud infrastructure.
Figure 1 shows our sys-tem’s deployment scenario, representing real-world applica-tions. In this scenario, a client device is used to collectinformation and generate video chunks, which will be sent tobackend devices for content analytics. The backend devicesinclude fog nodes and cloud servers. The fog nodes areoften directly connected to the end devices and have limitedcomputational resources. In comparison, the cloud servers arelocated in data centers connected to edge devices and fognodes via local-area and wide-area networks and are equippedwith the high-performing hardware accelerators (e.g., NVIDIAV100 GPU) to perform video analytics with the state-of-the-artDNN models.
B. Related Work
System for video analytics.
Many systems [7]–[11], [15],[18] have been implemented to optimize the video analyticsprocess with metrics such as latency, bandwidth, etc. Theycan be categorized into client/edge-driven methods and cloud-driven methods. Representative client/edge approaches includeNoScope [12], FileterForward [13], etc. They use lightweightand customized DNN models to improve the processingthroughput with a lower inference time. Glimpse [7] designsa frame difference analyzer to filter out frames and run atracking model on the client for object detection.In comparison, cloud-driven methods balance the trade-off between resources and accuracy to optimize systems.CloudSeg [15] leverages a super-resolution model to improvethe video quality for a high-accurate recognition on the cloud.VideoStorm [14] examines the impact of many system param-eters on the final results and thus orchestrates GPU clusterresources more efficiently. AWSstream [10] aims to wide-areavideo streaming analytics with efficient profiler and runtime.DDS [11] designs a multiple-round video streaming protocol.However, these systems (1) ignore the widely deployed fognodes to further optimize the system while VPaaS leveragesthe fog nodes with a new protocol design and (2) do notconsider the data drift issue while VPaaS employs human-in-the-loop machine learning to improve models continuously.Besides, VPaaS is the first integrated serverless cloud-fogplatform that provides many functions for usability, scalability,fault-tolerance, etc.
Serverless Computing.
Serverless cloud computing is de-signed to automatically handle all system administration oper- og Server Cloud Server GlobalMonitorPolicyManagerCloud-FogCoordinator FunctionManager fn fn fn fn fn fn
Executor ExecutorModelCacheCloud LoadBalancerProvisioner ModelProfiler Dashbord/Annotator ModelTrainerModel Zoo DispatcherData Store DataCollectorGlobal TaskSchedulerData Key-Value Store Training BackendDeployment Backend LogsGlobal Control PlaneStateless Server Stateful Backend
Fig. 3: System architecture of VPaaS.ations, including auto-scaling, monitoring, and virtual machine(VM) configuration, thus greatly simplifying cloud servicedevelopment and deployment. Many systems have been pro-posed to leverage their properties for large-scale video pro-cessing. For instance, ExCamera [19] utilizes AWS Lambdasfor encoding massive videos in a few minutes. Sprocket [20]is a video processing (including both video encoding andcontent analytics) framework built on an AWS serverlessplatform. However, none of these systems include offloadingcomputation to fog nodes, especially regarding DNN-basedvideo analytics applications.III. S
YSTEM D ESIGN
This section first details the system design and workflowand then introduces the core components for video analyticspipeline with serverless computing.
A. Design Goals
Seamless integration of fog and cloud resources.
The sys-tem can handle the environmental heterogeneity and provideessential functions for smooth task executions.
End-to-end video analytics support.
The system shouldprovide end-to-end support, including video decoding andencoding, frame processing, model inference, model tuning,and so on, ensuring users can run the pipeline with ease.
Low bandwidth, high accuracy.
Saving bandwidth usagewhile maintaining a high accuracy is the essential requirementof our system design.
Human-in-the-Loop (HITL).
VPaaS should incorporatehuman insights into the system to address the data drift issueand continually improve the model performance.
B. Architecture Overview
Figure 3 depicts our serverless cloud-fog platform architec-ture. It consists of 1) a stateless server to execute a videoanalytics pipeline across cloud and fog, and 2) a statefulbackend to provide all essential functions to manage the wholesystem. Firstly, a serverless cloud-fog server is developedto serve a DNN-based video analytics pipeline. The cloudserver provides an executor to run models as well as the othervideo processing functions. It also includes the provisioner and load balancer to provide highly available and scalable service.The fog server contains a low-latency function executor, amodel cache, and a cloud-fog coordinator. The cloud-fogcoordinator has the responsibility to perform the schedulingbetween cloud and fog with a pre-specified policy. Secondly,the stateful backend provides an interface and many functionsto involve humans in the system. It allows users to registervideo analytics functions (e.g., newly trained ML models)and scheduling policies (e.g., ensemble) to the system. Itthen provides many administrative functions such as functiondispatcher and model profiler to improve its usability. Tocontinuously improve the system, the data labeled by humanswill be collected, and a model trainer is used to automate themodel tuning. A data storage is implemented to store models,data, and system logs. We will detail our cloud-fog coordinatorand HITL design in Section IV and Section V.
C. Serverless Cloud-Fog Computing
The serverless cloud-fog ML server performs video an-alytics tasks, including video decoding and encoding, datapre-processing and post-processing, and ML model inference.It also provides essential functions to the stateful backendto update models (via incremental learning) and manage thesystem resources.The cloud ML server provides a runtime to execute compu-tationally demanding tasks such as running accurate models(e.g., FasterRCNN101), training models, etc. It encapsulatesall necessary functions, such as load balancer and resourceprovisioner, running at scale without the need for maintenanceby the developers. Different from the current public serverlessplatform like AWS lambda that only supports CPU, our ex-ecutor can utilize the GPU resources to speed up the executionof model inference tasks.The fog ML server contains many useful functions, includ-ing a cloud-fog coordinator, a model cache, and an executor.The coordinator executes policies that need to get both cloudand fog resources involved. We design a new policy and useit to discuss the fog coordinator in detail in Section IV. Themodel cache is to store models dispatched from the cloud, andthe model in it will be updated periodically by our incrementallearning. The executor, like that of the cloud ML server, willutilize hardware resources to run video analytics functions.
D. Stateful Backend.
VPaaS provides users a stateful control backend, includ-ing many functions developed to support a serverless videoinference serving. We organize these functions into four mod-ules, including deployment backend, training backend, utilityfunction, and data store. A dashboard with a video annotatoris provided as a frontend to achieve HITL ML and facilitateservice deployment and management.Th deployment backend provides video analytics pipelinemanagement and deployment across cloud and fog nodes. Itincludes (1) a function manager that provides a fine-grainedhousekeeping service (e.g., registration) for video processingrelated functions (as illustrated in Figure 2), (2) a policy loud(TeslaV100) Fog(Xavier) Client(RaspberryPi) T h r o u g h p u t ( F P S ) (a) Quality Control Yolov3 FasterR-CNN ResNet18 ResNet50 T h r o u g h p u t ( F P S ) Fog (Xavier)Cloud (V100) (b) Model Inference
Fig. 4: The performance of video quality control (4a) andDNN inference (4b) on different devices. Fig. 4a shows thatthe computational resources for client/edge devices (RaspberryPi 4B in the experiment) can not support real-time videodecoding and re-encoding. Meanwhile, fog (NVIDIA Xavier)and cloud (V100 GPU) can efficiently perform the task. Fig. 4bindicates that though the fog node can not support heavy objectdetection models very efficiently, it can more than supporthigh-performing classification models in real-time.manager that allows users to register and select schedulingpolicies under specific scenarios, and (3) a dispatcher fordeploying functions and policies to fog and clouds.The auto-training backend contains two useful functions: thedata collector and model trainer, to automatically improve thequality of the deployed models. The data collector managesthe data labeled by humans during the model inference, whilethe model trainer tunes models with incremental learning.The global control plane provides all necessary systemoperation functions to manage and schedule system resources,freeing developers from tedious administration tasks. It has amodel profiler to profile ML models on underlying fog andcloud devices, a global monitor to collect system runtime per-formance information, and a global task scheduler to executethe dispatched policy.IV. C
LOUD -F OG C OORDINATOR P ROTOCOL
This section presents our design and implementation forthe cloud-fog coordinator. We focus on the protocol design,which is complementary to existing offloading optimizations[8], [9], [21], [22] for the resource-accuracy trade-off. Theprotocol should meet the following requirements (RQ): • RQ1.
The module can utilize the best models running inclouds to maintain high accuracy. • RQ2.
Some of the tasks can be offloaded to fog nodeswith no additional cloud cost. • RQ3.
The bandwidth usage should be minimized.To achieve these goals, we first conduct several preliminarystudies and obtain some key observations and insights. Basedon that, we design and implement a practical protocol named
High and Low Video Streaming . A. Key Observation and Formulation
In our client-fog-cloud scenario, we use the followingformula to summarize the dependencies of the final accuracy a on design options of key system elements (e.g., choices ofmodels and scheduling algorithms): a = F ( M fog , M cloud ) , (1)where M fog and M cloud are models running in fog and cloud,respectively. We use F ( · ) (abstractly) to describe a protocol.In this paper, we focus on finding the optimal F ∗ ( · ) .Intuitively, to achieve the highest accuracy, the moststraightforward protocol is to send the original quality videoto the cloud where the best video analysis DNN model isrunning to recognize content. In this process, the bandwidthcost is incurred for transmitting video frames from the clientto the cloud. This cost for a video frame is proportionalto its size, which is consequently determined by the videoresolution and the quantization parameter (QP) value (lowervalue means more details will be retained). Hence, we candenote the average video size as a function F v ( r, q ) , where r is the resolution, and q is the QP value. So we can estimatethe bandwidth cost for transmitting a video frame with B = F v ( r, q ) C B , (2)where C B is the monetary cost for transmitting one unit ofdata from the client to the cloud. Developers can adjust thevalue according to their local data rates.To reduce this cost, we have several design options. Thefirst one is to avoid the transmission and only utilize the M fog . However, this approach’s inference accuracy cannotmeet the requirement due to the poor performance of thecompressed small models. The second approach is to rely onsome filter methods like Glimpse [7], only sending elaborateframes or regions to the cloud for analysis. Though thisapproach, in general, has a very high processing speed, theycan easily miss essential frames or regions, which degradethe performance accuracy. As we will show in Section VI,the accuracy obtained by this approach is unacceptable in ourdeployment scenarios.To address these issues, we first conduct several preliminarystudies. As shown in Figure 4, ( Key Observation 1 ) thoughthe fog cannot run the best object detection model veryefficiently, it can support high-speed quality control and highperformance classification models. In a way, this classificationmodel is superior to detection models in terms of recognitionability. So the question becomes if we can get the regions thatcontain objects and only use classification models in the fog.This is precisely a chicken and egg problem: if we do notrun an object detection model, how do we get the objects’locations.To answer the question, we rely on the property of currentstate-of-the-art detection models (like FasterRCNN101 [4])and continue our empirical studies. These DNNs alwaysinvolve two stages. It first identifies the regions that mightcontain objects and then classify them into objects. We aimto utilize its location power with minimal bandwidth usage,so we adjust the resolution r and QP q to reduce the videosize and watch its influence. As shown in Figure 5, ( KeyObservation 2 ) even for a low quality video, the model can a) Ground truth (b) Results with blur videos
Fig. 5: The output of the best cloud model with high- andlow-quality videos. Fig. 5a shows the ground truth output fromthe cloud model (i.e., FasterRCNN-101) for high-quality (butbandwidth consuming) videos. Fig. 5b illustrates that even fora very low-quality (but bandwidth efficient) video, the cloudmodel can output some regions that contains objects with high-confidence (red) and the locations of regions that may containobjects (blue). This observation provides us design options tosave bandwidth as well as computational resources.identify a region that possibly contains objects. It just cannotrecognize the objects due to the blur video frames. A similarphenomenon has also been mentioned in [11]. In addition,(
Key Observation 3 ) smaller video size also means lowertransmission and processing time with object detection model,which leaves us much room to employ classification modelson fog nodes.
B. High and Low Video Streaming
Based on the three key observations, we now design anew protocol named high and low video streaming . Figure 6depicts the whole process in our client-fog-cloud deploymentscenario. First, a client equipped with a high-resolution camerawill send the high-quality video to a co-located fog nodes.Since the client and fog nodes are co-located, the bandwidthcost is negligible. The video will then be re-encoded to low-quality format and sent to the cloud, where a high performingobject detection model is employed to analyze the videos.The model will output the coordinates of regions that maycontain objects with location confidence scores and boundingboxes with high recognition scores. We directly treat boundingboxes as labels and send them back to the fog for downstreamapplications. For regions that can not be recognized in a low-quality format, we apply a filter method derived from [11]: Wefirst keep the regions with location confidence scores higherthan a threshold θ loc (for different deployment scenarios, thevalue may be different). We then remove the regions that havea large overlap with the above-mentioned bounding boxes.Here, we use the intersection-over-union (IoU) to indicate theoverlap. An overlap value lower than a threshold θ iou indicatesthat we can keep the region. We then remove the regions thataccount for more than θ back % of the frame size because theyare most likely to be a background.After the filter, we will send the coordinator informationof the remaining regions to fog nodes. As the informationonly occupies several bytes, the bandwidth usage and canbe ignored. To reduce the classification overhead, we designa lightweight classification pipeline on fog by following the Fog Nodes(Classifcation) Cloud Servers(Detection)Client/EdgeDevices
CoordidatesLow-qualityvideoHigh-quality videoCo-located
Fig. 6: The overview of our cloud-fog coordinator. The clientfirst streams high-quality videos to fog nodes, where videoswill be re-encoded and sent to the cloud. The cloud runs thebest model to recognize the videos. The cloud DNN modelsoutput the bounding boxes with high-confidence classificationscores and only send coordinates of regions that contain uncer-tain objects to the fog for further processing. The design cansave bandwidth while maintaining high accuracy and improvethe re-encoding efficiency to reduce processing latency.one-vs-all reduction rule [23]. The pipeline contains a featureextraction backbone network pre-trained on ImageNet datasetto learn a high-level representation from the input regions. Thisrepresentation will be fed into a set of binary classifiers forclassification. In doing so, we significantly reduce the compu-tation resources needed to do the multi-class classification andmaintain high accuracy (sometimes, this method can even havebetter performance, as illustrated in Figure 5). In addition, asthe video content varies within time, the number of returnedcoordinates will be different. To maintain high throughput andrelatively low latency, we implement the well-known dynamicbatching [24] and feed batched regions into the models.V. H
UMAN - IN - THE -L OOP L EARNING
In this section, we improve our system performance byadopting human-in-the-loop learning. We first present ourobservations and motivation and then describe the problemformulation and the learning process in detail.
A. Observations and Motivation
Many previous cloud-driven approaches using fixed settingsrely on the predictions from the well-trained DNN runningin the cloud as the ground truth [10], [11], [18]. Althoughthis can save many human efforts and speed up the systemverification process, it has several drawbacks. First, as theexample Figure 7a shows, many objects still can not beidentified correctly even using the best model among all high-quality videos (
Key Observation 4 ). Thus using the fixedsetting prevents us from exploring parameter space to achievehigher performance. Second, in many cases, though an objectin a completed image can not be detected and classifiedcorrectly, it can be recognized in a cropped region (
KeyObservation 5 ), as shown in Figure 7b. This can be attributedto the fact that surrounding pixels of an object can misleadeven sophisticated models. In this case, using the predictionfrom the sophisticated DNN as the ground truth will resultin a wrong system performance result (The system’s result isright, but the ground truth is wrong). a) Detection (b) Classification
Fig. 7: Case studies of detection and classification models. Fig.7a shows that even for the best object detection model, someobjects can not be located and recognized. Fig. 7b illustratesthat once we crop the regions that can not be detected bythe object detection model and feed them to a classificationmodel, they can be classified correctly. This observation canbe attributed to the misleading surrounding pixels.Moreover, the performance of the current system relies onthe fixed, pre-trained models and thus will suffer from the datadrift when the distribution of online inference data divergesfrom the training data. Over time, the effectiveness of theentire system will deteriorate. Also, when new objects appear,the system can not handle them.To overcome these issues, we design a module that includeshuman-in-the-loop and incremental learning. This particularenhancement faces two challenges. First, the module shouldconsider the catastrophic forgetting issue, i.e., after beingtrained with new data, the model could perform worse onexisting data). Second, the module should be lightweight, easy-to-implement, and extensible.
B. Incremental Learning Process
We now describe the incremental learning (IL) process, asshown in Figure 8, following our high and low video streamingprotocol . In this work, We only update the DNNs on fogservers and leave the cloud DNNs’ update as the future work.We do not claim the novelty in the IL algorithm design, andthe system supports easily integrating new models.After obtaining the coordinates from the cloud, the regionsthat perhaps contain objects can be cropped by the system.A human operator (user) can assign a label y t to the croppedparts and a lightweight backbone network is applied to extractfeatures vectors from them. We denote the feature vector ofthe t -th cropped image as x t , the classifier can be given by f ( x ; Θ) , where Θ is the set of parameters in the designedDNN. Once collecting enough images (a pre-defined value,which is decided by developers), VPaaS will update classifi-cation function f in an incremental manner (to make use ofhuman feedback on cropped images) as f t = f t − − η∂ f R [ f, x t , y t ] | f = f t − , (3)where R [ f, x t , y t ] is the empirical risk w.r.t. the labeledinstance ( x t , y t ) , ∂ f is short for ∂/∂f (gradient w.r.t. f ), and η is the learning rate. To achieve real-time human-in-the-loopfeedback, we propose to only update weights W of the last ModelPrediction HumanFeedback Online ModelUpdateLoss
Fig. 8: The human-in-the-loop process. Firstly, the croppedimages will be collected together with the inference results.Secondly, annotators will check the dataset and correct thewrong results. Finally, the images with their human labels willbe fed into the model for retraining.layer in the DNN. Then the objective function for updatingthe model is given by: W = arg min W (cid:107) W − W t − (cid:107) F + ηl ( f ( x t ) , y t ) , (4)where l ( · , · ) is the empirical loss and we choose it as thecross-entropy loss for classification, i.e., l ( f ( x t ) , y t ) = y t log f ( x t ) . (5)Here, f ( x t ) = σ ( W T x t ) , where σ ( · ) is an activation func-tion, and the bias term is absorbed in W by simply appendingthe original ˜ x t with a feature 1, i.e., W T x t = [ ˜ W , ˜ b ] T [ ˜ x t , .Then the detailed formulation of (4) can be given as follows, arg min W (cid:107) W − W t − (cid:107) F + ηy t log f ( x t ) . (6)By taking the derivative w.r.t W and set it to be zero, wehave: W t = W t − − ηy t σ ( W T x t ) ∂σ ( W T x t ) ∂ ( W ) , (7)If we use W Tt − x t to approximate W Tt x t and choose theactivation function to be ReLU, then W t = (cid:40) W t − − ηy t σ ( W Tt − x t ) x t , W Tt − x t > W t − , W Tt − x t ≤ . (8)When the human labor budget is exhausted after τ steps ofupdate, we obtain a set of classifiers, i.e., { W t } τt =1 . These clas-sifiers can be weighted combined to improve the performancein the further prediction. Let z i = [ f ( x i ; W ) , · · · , f ( x i ; W τ )] ,then the weight ω can be learned by solving a regularizedoptimization problem, i.e., arg min ω (cid:107) ω T z i − y i (cid:107) F + v (cid:107) ω (cid:107) , (9)where the labeled data ( x i , y i ) obtained in the incrementallearning stage is reused.VI. E VALUATION R ESULTS
In this section, we first introduce the system implementationdetails and experiment setup. We then evaluate VPaaS andpresent the evaluation results as well as the insights gainedfrom them. . . . . . . . . Normalized Bandwidth . . . . . . . A cc u r a c y ( F - S c o r e ) better VPaaSGlimpseCloudSegDDSMPEG (a) DashCam . . . . . . . Normalized Bandwidth . . . . . . . . A cc u r a c y ( F - S c o r e ) better VPaaSGlimpseCloudSegDDSMPEG (b) Drone − . . . . . . . . Normalized Bandwidth . . . . . . . A cc u r a c y ( F - S c o r e ) better VPaaSGlimpseCloudSegDDSMPEG (c) Traffic
Fig. 9: The normalized bandwidth usage of different systems for three video datasets. Compared to cloud-driven methods,VPaaS achieves the lowest bandwidth usage while maintaining a higher or comparable accuracy. The client-driven methodshave lower bandwidth usage, but their accuracy drops drastically. The MPEG denotes using original videos to do inference.
A. Experiment Settings
Implementation detail.
We build our system atop Cloud-Burst [25], an open-source, serverless platform that can bedeployed to a private cluster. We extend the system to supportrunning on cloud servers equipped with GPUs and fog deviceswith GPU computation cores. The communication betweencloud and fog is supported by gRPC. We prototype our cloud-fog coordinator and human-in-the-loop module using Python.We also implement a set of Python APIs to provide afine-grained control and customization for users to performwide-range of DNN-based video analysis tasks, such as 1)video encoding and decoding with support for many formats,2) video pre-processing including resizing, batching, etc.,3) model inference and 4) video-post processing. We useOpenCV 4.5.0 to read videos and FFmpeg 4.3.1 to adjust thevideo quality. Since the DNN models are the cores for oursystem, we also implement a model zoo using MongoDB. Weuse PyTorch 1.4.0 to re-train and fine-tune our DNN modelsfor improving them with our human-in-the-loop process.
Experimental testbed.
We deploy our VPaaS on a real-world cloud-fog-client testbed, as shown in Figure 1. Thecloud side is hosted on servers equipped with 4 NVIDIAV100 GPUs and Intel Core i9-9940X CPU. The fog serveris NVIDIA AGX Xavier [26]) with a 512-core Volta GPUand 8-core ARM CPU. The client is a Raspberry Pi 4B with4GB RAM and a 1080P video camera. Following the existingvideo system settings [27], we set up a switch to build a localnetwork for connecting the fog and clients. The bandwidthnetwork between them is 10Gbps. Both the fog and client areconnected to cloud servers through WAN (wide area network).
Evaluation metrics.
An ideal video analytics system shouldconsider the following metrics, bandwidth usage, accuracy,latency, and cloud cost.
Bandwidth Usage.
The metric evaluates the network re-sources used for video transmission. We calculate the band-width usage with b = (cid:80) n v i t , where v i is a video chunk witha specific quality of a video and n is the total number ofvideo chunks within a duration t . We normalize the usage ofboth our VPaaS and baselines against that of original videoswithout quality control. Accuracy.
We use the F1 score (which denotes the harmonicmean of precision and recall) as the accuracy. For a publicdataset without ground truth labeled by humans, we followprevious settings [10], [11], [18] to run a FasterRCNN101model and use the output from it as the label. We comparethe output from a system with the label to get the true positive,false positive, and false negative, respectively, for calculatingthe F1 score.
Cloud Cost.
Estimating cloud cost is critical for real-worldsystem deployment, as some scenarios require a strict cloudbudget. We do not consider the costs of fog devices in ourevaluations, as they are amortized to zero over the continuousframe processing. In this paper, we take the serverless billingmethod, which allows users to pay for the total number ofrequests from the public cloud (e.g., AWS). Therefore, wedefine the cloud cost as c F = p F n ∗ , where p F is the cost perframe and n ∗ is the frame processed by the cloud. Latency.
We measure the end-to-end latency by followingthe freshness definition from [11] and [10]. It is a durationbetween when an object first appears in the video frameobtained by the client and when it is localized and classifiedin either fog or cloud. The duration consists of quality controltime, transmission time, and content analytics time.
Compared methods.
We compare our methods with threestate-of-the-art methods, including one client-driven methods- Glimpse [7] and two cloud-driven methods - DDS [11]and CloudSeg [28]. Glimpse computes the pixel-level framedifferences to filter out frames and runs an object trackingmodel to do the location and recognition. Compared to itsoriginal version, our implementation uses a more advancedtracking model from OpenCV and hence will have betteraccuracy. CloudSeg first sends low-quality videos from theclient to the cloud and uses a pre-trained super-resolutionmodel [16] from its official implementation to recover thevideos. Then the system will call an object detection modelto analyze video content. DDS also sends the low-qualityvideo to the cloud and then re-sends regions that may containobjects in high-quality for further process. To ensure a faircomparison, we use the same pre-trained object detectionmodels, FasterRCNN101, in the cloud for all of these methods.ABLE I: The specifications of the video datasets used inthe evaluation. Specifically, we evaluate our system with threedatasets: dashcam, traffic and drone. Each of them contains alarge number of video clips with a variety of scenarios.
Dataset
Dataset.
We use real-world video datasets to evaluate thesystems. The data covers a variety of scenarios, includingtraffic monitor, parking management, and video surveillance.Their details are summarized in Table I and the links to theirrepositories can be found in [29].
B. Macro Benchmarking
We start by comparing the overall performance of VPaaS tothat of other baselines on all videos. For all of the videos undertest, we follow the frame-skip setting from [11], extractingone frame (called keyframes) every 15 frames. Once weaccumulate 15 keyframes, we will pack them into a videochunk and then send them to the cloud. For both VPaaS andDDS, the first round QP and resolution scale (RS) are 36 and0.8, respectively, while the second round QP and RS are 26and 0.8, respectively. For CloudSeg, we downscale the videoswith QP set to 20 and RS set to 0.35 and recover them withthe upscale ratio set to 2x.
Bandwidth and Accuracy.
Figure 9 presents the nor-malized bandwidth usage and achieved F1 scores of differ-ent systems on all three datasets. We have two significantobservations. First, VPaaS achieves a higher or comparableaccuracy with about 21% bandwidth saving than the closestcloud-driven system. The result indicates that by utilizingboth fog computation and human-in-the-loop learning, we canfurther save bandwidth and improve accuracy with fewer cloudresources. Second, VPaaS consistently outperforms the client-driven methods in terms of accuracy. Though these methodsconsume less bandwidth, their very low accuracy stops themfrom being deployed to complex scenarios.
Cloud Cost.
Figure 10a compares the cloud cost of VPaaSagainst two cloud-driven baselines. As shown, VPaaS outper-forms the other two methods by a large margin. Specifically,for each frame, our system only uses the expensive objectdetection model running on the cloud once. Instead, CloudSegneeds an extra super-resolution model, and hence the cost isdoubled compared to that incurred by our system. Also, DDSruns multiple rounds of detection on frames that are difficultto detect or classify, so it incurs more cost.
Latency.
We report the overall latency gain in Figure10b. In general, VPaaS performs better than all the othercloud-driven methods. Specifically, VPaaS achieves about 2.5xspeedup in terms of 50th (median) percentile latency than DDSand CloudSeg. Three factors contribute to the gains: movingvideo quality control from resource-limited clients to morecapable fog nodes, reduced transmission time, and the fasterclassification models on fog nodes.
VPaaS CloudSeg DDS . . . . . . N o r m a li z e d C l o u d C o s t (a) Cloud Cost Response Delay (s) . . . . . . P r o b a b ili t y VPaaSCloudSegDDS (b) Latency CDF
Fig. 10: The normalized cloud cost and response latency forall three datasets. Compared to other cloud-driven baselines,our method do not require extra cloud resources and thussave cloud cost significantly. Meanwhile, with the help ofour cloud-fog protocol design, the overall latency is reduced alarge margin. We own this to the faster quality control process,near-client computation and low-quality video transmission. . . . . . . . R e s p o n s e D e l a y ( s ) (a) 720P Video . . . . . . . . (b) 1080P Video Fig. 11: The system latency under different network band-width. VPaaS achieves a low latency under both low band-width (10Mbps) and high bandwidth (20Mbps).
C. Micro Benchmarking
Impact of Network Bandwidth.
We validate VPaaS’s sen-sitivity to network bandwidth. We test our system’s responsedelays for a set of bandwidth [10, 15, 20] Mbps. The resultsshown in Figure 11 demonstrate that our system can achieve avery steady latency under different network bandwidth, whichindicates that our system is robust to network bandwidthfluctuations.
Impact of Video Content Types.
We then examine thebandwidth saving of different videos from three datasets toillustrate the impact of video content types on the performance.We first randomly select three videos from each dataset andthen use the nine videos to evaluate our system and DDS(the closest work to our system). As shown in Figure 12,our system outperforms the baseline in all video types sub-stantially. The results prove that the performance gain mainlycomes from our innovative system designs and is independentof video content.
Impact of HITL Parameter.
Next, we evaluate the influ-ence of the HITL parameter and human labor budget, whichdecides how much data will be used labeled in a time window.To compare the setting impact, we first divide a dataset intoa training set and a test set. We use a portion of the trainingset for training and then gradually increase the percentage ofparticipation in the training. The results presented in Figure ashCam_1 DashCam_2 DashCam_3 Drone_1 Drone_2 Drone_3 Traffic_1 Traffic_2 Traffic_3
Dataset Name60%80%100% C l o u d B a n d w i d t h C o s t % VPaaS DDS
Fig. 12: Bandwidth usage (normalized to that of DDS) pervideo under three video content types. The bandwidth usagefor DDS is fixed at one (100 % ) for each content type. VPaaSoutperforms the baseline in all videos, indicating that theeffectiveness of our system design.
0% 10% 20% 30% 40% 50%
Data used to train . . . . . F - S c o r e (a) Human Labor Budget G P UU t ili z a t i o n ( % ) start training end training Time (s) . . . . . . . . . R e s p o n s e D e l a y ( s ) start training end training (b) Overhead Fig. 13: The impact of HITL. Fig. 13a shows the effect ofhuman labor budget on accuracy, proving that the incrementallearning can address the data drift issue and thus improve theperformance. Fig. 13b illustrates the training overhead. Duringthe training, the GPU utilization (top) increases about 15% andthe latency increases about 0.5 seconds. However, the effect isconsidered negligible and can be further avoided by designinga better scheduler.13a show that the HITL indeed addresses the data drift issueand improves performance. Also, as the budget increases, theeffect of growth is no longer significant. We attribute thisto overfitting and will explore more efficient algorithms toovercome this issue.
HITL Overhead.
We now show the HITL overhead. Duringthe video analytics, VPaaS triggers the auto-trainer to tune themodel. It batches the data labeled by humans with batch size =4 and feeds the batched data into the model for training. Thisprocess is executed in the same GPU for model inference tosave cloud cost. Figure 13b illustrates the overhead. Duringthe training, the GPU utilization increases about 10%, and thelatency increases about 0.5 seconds as the result of a higherworkload. Once the process is finished, the latency will quicklyrevert to the normal level. This finding prompts us to avoidthe workload spike by starting the training process when it isrelatively idle.
D. Case Studies
Usability.
We show a start-to-finish process of video ap-plication development from a VPaaS user’s perspective. Weassume that a user has trained a face recognition model.Figure 14 illustrates the user’s code to build a video facerecognition application across cloud and fog. Firstly, the user Fig. 14: Example code to build a video application in VPaaS. N e t w o r k L a t e n c y ( m s ) R e s p o n s e D e l a y ( s ) With Fault ToleranceWithout Fault Tolerance0 25 50 75 100 125 150Time (s)0.650.710.770.830.890.95 F - S c o r e With Fault Tolerance
Fig. 15: Fault-tolerance evaluation. VPaaS can quickly call asmall backup object detection model (e.g., YOLOv3) on fogonce detecting the networking disconnection issue. The bottomfigure shows the fluctuations in accuracy. Though the accuracydrops but the system can still provide a low-latency service(middle).needs to register the model to our system, where the modelwill be profiled. The model with the profiling informationwill be stored in the cloud model zoo. Secondly, the user candispatch the model to the fog and deploy an already-registeredmodel to the cloud. Thirdly, the users can specify a policy toorchestrate two models for the application (e.g., monitoringthe networking congestion/latency to decide whether to sendvideos to the cloud or process them locally). In addition,our system will automatically call the video pre- and post-processing functions to complete the pipeline.
Fault-tolerance.
To test the system’s fault-tolerance feature,we simulate an outage scenario by shutting down the cloudserver. In this situation, our fog nodes will run a backup, usinga light-weight object detection model such as YOLOv3 [30]to resume the recognition tasks quickly (albeit with reducedaccuracy). As shown in Figure 15, the fog node detects thedisconnection issue at t =25s. Then it feeds the video chunksthat are already cached in fog nodes to the YOLOv3 modeldeployed at the fog to continue object detection tasks. Thoughthe accuracy decreases, our system can maintain the servicecontinuation until the recovery of the cloud server. Scalability.
Our serverless system also provides the es-sential provision function for the dynamic workload. To testthe feature, we first simulate a scenario where users installmore fog nodes and cameras by increasing the number ofvideo chunks sent simultaneously. As shown in Figure 16,the number of GPUs used increases as the more video chunkscome in, so that our system maintains a low latency evenduring heavy workload.VII. C
ONCLUSION
Efficient video analytics empower many applications rang-ing from smart city to warehouse management. This paper W o r k l o a d ( f p s ) Workload
Time (s) G P U N u m b e r L a t e n c y ( m s ) Replica Latency
Fig. 16: Scalability evaluation. Our serverless platform canscale in/out GPUs to save cost and maintain high availabilitywhen experiencing dynamic workload. In this case, we simu-late a scenario where users install more fog nodes and camerasfor their video applications.presents a serverless platform termed VPaaS to run DNN-based video analytics pipelines that take full advantage ofthe client-fog-cloud infrastructure’s synergy. It can efficientlyorchestrate both fog and cloud resources for cost-effectiveand high-accurate video analytics. VPaaS employs human-in-the-loop design philosophy, continuously improving modelperformance. The system provides a set of functions for videoapplication development and deployment, freeing developersfrom tedious resource management and system administrationtasks. Extensive experiments demonstrate that VPaaS con-sumes less bandwidth and cloud cost and has lower process-ing latency than state-of-the-art systems. We plan to involvenetwork topology in the system design as future work. Also,preserving privacy in video analytics is very important, andwe plan to explore this research direction in the future.R
EFERENCES[1] J. Emmons, S. Fouladi, G. Ananthanarayanan, S. Venkataraman,S. Savarese, and K. Winstein, “Cracking open the dnn black-box: Videoanalytics with dnns across the camera-cloud boundary,” in
Proceedingsof the 2019 Workshop on Hot Topics in Video Analytics and IntelligentEdges , 2019, pp. 27–32.[2] G. Ananthanarayanan, V. Bahl, L. Cox, A. Crown, S. Nogbahi, andY. Shu, “Video analytics-killer app for edge computing,” in
Proceed-ings of the 17th Annual International Conference on Mobile Systems,Applications, and Services , 2019, pp. 695–696.[3] E. Bas, A. M. Tekalp, and F. S. Salman, “Automatic vehicle countingfrom video for traffic flow analysis,” in . Ieee, 2007, pp. 392–397.[4] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” in
Advances in neuralinformation processing systems , 2015, pp. 91–99.[5] S. S. Ogden and T. Guo, “Characterizing the deep neural net-works inference performance of mobile applications,” arXiv preprintarXiv:1909.04783 , 2019.[6] H. Zhang, Y. Huang, Y. Wen, J. Yin, and K. Guan, “No more 996:Understanding deep learning inference serving with an automatic bench-marking system,” arXiv preprint arXiv:2011.02327 , 2020.[7] T. Y.-H. Chen, L. Ravindranath, S. Deng, P. Bahl, and H. Balakrishnan,“Glimpse: Continuous, real-time object recognition on mobile devices,”in
Proceedings of the 13th ACM Conference on Embedded NetworkedSensor Systems , 2015, pp. 155–168.[8] X. Ran, H. Chen, X. Zhu, Z. Liu, and J. Chen, “Deepdecision: A mobiledeep learning framework for edge video analytics,” in
IEEE INFOCOM2018-IEEE Conference on Computer Communications . IEEE, 2018,pp. 1421–1429.[9] C.-C. Hung, G. Ananthanarayanan, P. Bodik, L. Golubchik, M. Yu,P. Bahl, and M. Philipose, “Videoedge: Processing camera streamsusing hierarchical clusters,” in . IEEE, 2018, pp. 115–131. [10] B. Zhang, X. Jin, S. Ratnasamy, J. Wawrzynek, and E. A. Lee,“Awstream: Adaptive wide-area streaming analytics,” in
Proceedingsof the 2018 Conference of the ACM Special Interest Group on DataCommunication , 2018, pp. 236–252.[11] K. Du, A. Pervaiz, X. Yuan, A. Chowdhery, Q. Zhang, H. Hoffmann, andJ. Jiang, “Server-driven video streaming for deep learning inference,” in
Proceedings of the Conference of the ACM Special Interest Group onData Communication , 2020, pp. 557–570.[12] D. Kang, J. Emmons, F. Abuzaid, P. Bailis, and M. Zaharia, “Noscope:optimizing neural network queries over video at scale,” arXiv preprintarXiv:1703.02529 , 2017.[13] C. Canel, T. Kim, G. Zhou, C. Li, H. Lim, D. G. Andersen, M. Kamin-sky, and S. R. Dulloor, “Scaling video analytics on constrained edgenodes,” arXiv preprint arXiv:1905.13536 , 2019.[14] H. Zhang, G. Ananthanarayanan, P. Bodik, M. Philipose, P. Bahl, andM. J. Freedman, “Live video analytics at scale with approximation anddelay-tolerance,” in { USENIX } Symposium on Networked SystemsDesign and Implementation ( { NSDI } , 2017, pp. 377–392.[15] Y. Wang, W. Wang, J. Zhang, J. Jiang, and K. Chen, “Bridging the edge-cloud barrier for real-time advanced vision analytics,” in { USENIX } Workshop on Hot Topics in Cloud Computing (HotCloud 19) , 2019.[16] N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and lightweightsuper-resolution with cascading residual network,” in
Proceedings of theEuropean Conference on Computer Vision (ECCV) , 2018, pp. 252–268.[17] C. Pakha, A. Chowdhery, and J. Jiang, “Reinventing video streamingfor distributed vision analytics,” in { USENIX } Workshop on HotTopics in Cloud Computing (HotCloud 18) , 2018.[18] J. Jiang, G. Ananthanarayanan, P. Bodik, S. Sen, and I. Stoica,“Chameleon: scalable adaptation of video analytics,” in
Proceedingsof the 2018 Conference of the ACM Special Interest Group on DataCommunication , 2018, pp. 253–266.[19] S. Fouladi, R. S. Wahby, B. Shacklett, K. V. Balasubramaniam, W. Zeng,R. Bhalerao, A. Sivaraman, G. Porter, and K. Winstein, “Encoding,fast and slow: Low-latency video processing using thousands of tinythreads,” in { USENIX } Symposium on Networked Systems Designand Implementation ( { NSDI } , 2017, pp. 363–376.[20] L. Ao, L. Izhikevich, G. M. Voelker, and G. Porter, “Sprocket: Aserverless video processing framework,” in Proceedings of the ACMSymposium on Cloud Computing , 2018, pp. 263–274.[21] S. Han, H. Shen, M. Philipose, S. Agarwal, A. Wolman, and A. Krishna-murthy, “Mcdnn: An approximation-based execution framework for deepstream processing under resource constraints,” in
Proceedings of the 14thAnnual International Conference on Mobile Systems, Applications, andServices , 2016, pp. 123–136.[22] S. Yi, Z. Hao, Q. Zhang, Q. Zhang, W. Shi, and Q. Li, “Lavea: Latency-aware video analytics on edge computing platform,” in
Proceedings ofthe Second ACM/IEEE Symposium on Edge Computing , 2017, pp. 1–13.[23] R. Rifkin and A. Klautau, “In defense of one-vs-all classification,”
Journal of machine learning research , vol. 5, no. Jan, pp. 101–141,2004.[24] D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, andI. Stoica, “Clipper: A low-latency online prediction serving system,”in { USENIX } Symposium on Networked Systems Design andImplementation ( { NSDI } , 2017, pp. 613–627.[25] V. Sreekanti, C. W. X. C. Lin, J. M. Faleiro, J. E. Gonzalez, J. M. Heller-stein, and A. Tumanov, “Cloudburst: Stateful functions-as-a-service,” arXiv preprint arXiv:2001.04592 Proceedings of the 18th Conference on Embedded Networked SensorSystems , 2020, pp. 409–421.[28] Y. Wang, W. Wang, J. Zhang, J. Jiang, and K. Chen, “Bridging the edge-cloud barrier for real-time advanced vision analytics,” in , 2019.[29] “Video datasets used in the paper,” shorturl.at/fwIM8, accessed: 2021-01-01.[30] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767arXiv preprint arXiv:1804.02767