Split Computing for Complex Object Detectors: Challenges and Preliminary Results
SSplit Computing for Complex Object Detectors:Challenges and Preliminary Results
Yoshitomo Matsubara
University of California, [email protected]
Marco Levorato
University of California, [email protected]
ABSTRACT
Following the trends of mobile and edge computing for DNNmodels, an intermediate option, split computing, has beenattracting attentions from the research community. Previ-ous studies empirically showed that while mobile and edgecomputing often would be the best options in terms of totalinference time, there are some scenarios where split com-puting methods can achieve shorter inference time. All theproposed split computing approaches, however, focus onimage classification tasks, and most are assessed with smalldatasets that are far from the practical scenarios. In this pa-per, we discuss the challenges in developing split computingmethods for powerful R-CNN object detectors trained on alarge dataset, COCO 2017. We extensively analyze the objectdetectors in terms of layer-wise tensor size and model size,and show that naive split computing methods would notreduce inference time. To the best of our knowledge, thisis the first study to inject small bottlenecks to such objectdetectors and unveil the potential of a split computing ap-proach. The source code and trained models’ weights usedin this study are available at https://github.com/yoshitomo-matsubara/hnd-ghnd-object-detectors.
KEYWORDS
Object detection, Split computing, Head network distillation
Along with the rapid evolution of computing devices, deeplearning approaches have been widely studied to train pow-erful machine learning models, and many of such modelsachieve state-of-the-art performance in various tasks such asvision and natural language processing. Such models, how-ever, are often too complex to be executed on mobile devices,which have a severely constrained computational capacity.To address this critical problem, the research communityproposed two key strategies: model compression and edgecomputing. In the former approach, complex models are sim-plified, e.g. , by knowledge distillation [6] and model pruning
Accepted to 4th International Workshop on Embedded and Mobile DeepLearning (EMDL ’20), September 21, 2020, London, United Kingdom and quantization [20, 27]. In the latter approach [1, 13, 15],mobile devices offload the execution of computationally ex-pensive models to powerful (edge) computers located at thenetwork edge. Clearly, edge computing requires to wirelesslytransport the input data and model outcome on the link con-necting the mobile device to the edge computer.Recently, an intermediate option, namely split Deep NeuralNetwork (DNN) or split computing , has been attracting aconsiderable interest [2, 3, 8, 10, 11, 16, 24, 26]. Many ofsuch methods literally split DNN models into head and tailportions, which are executed by the mobile device and edgecomputer, respectively. Note that in this case, instead of theinput data, the tensor produced by the head model shouldbe transported to the edge computer. More sophisticatedapproaches [3, 8, 16, 24] modify the architecture of the modelitself to (a) reduce the size of the tensor to be transferred, and(b) reduce the computing load assigned to the mobile device.Intuitively, the two features mentioned above would proveessential in enabling effective task offloading in challengedsettings ( e.g. , low capacity of the wireless channel and lowcomputing capacity at the mobile device).Although promising, as discussed in detail in Section 3.2,most of the existing split DNN approaches are either not eval-uated [2] or evaluated only in simple classification tasks [3, 8,16, 24, 26], such as on miniImageNet, Caltech 101, CIFAR -10,and -100 datasets. The goal of this paper is to discuss the sev-eral technical challenges in achieving effective DNN splittingfor edge computing in one of the most difficult computingtasks, that is, object detection. We specifically consider Fasterand Mask R-CNNs [4, 22].Our extensive module-wise model analysis indicates thatnaive splitting strategies would fail to provide any improve-ments, where performance is measured in total inferencetime, compared to mobile and edge computing. Then, we pro-pose to redesign the object detectors to introduce small bot-tlenecks whose output tensor sizes are significantly smallerthan that of the input layer. To the best of our knowledge, thisis the first work that discusses split DNN approaches for suchpowerful object detectors providing 1) benchmark results ona well-known object detection dataset, COCO 2017, and 2)a thorough illustration of the tradeoff between bottlenecktensor size and detection performance. a r X i v : . [ c s . C V ] J u l s discussed in detail in Section 3.2, the complex structureof CNN-based object detectors poses unique challenges indesigning effective splitting approaches. For instance, as themodels branch to provide intermediate outputs to later mod-ules, splitting in later layers would require to send multipletensors, which results in an increased amount of data to betransferred. Our design, then, places the bottleneck early inthe model, before the first branching. However, this strategyrequires an effective re-design of the network, as the firstlayers tend to amplify the input, rather than compressingit. The bottlenecks we designed reduced the tensor size byabout 80.3 – 93.4% compared to pure edge computing (wherethe input tensor is transmitted), and the aggressively smallbottlenecks resulted in significant loss of the detection per-formance. Based on our study, it is apparent that there isa strong need for efficient training and compression strate-gies ( e.g. , quantization) to achieve effective splitting in a widerange of settings and conditions. The source code and trainedmodels’ weights used in this study are released to enablesuch further studies and developments. In this section, we discuss the architecture of recent objectdetectors based on Convolutional Neural Network (CNN)that achieve state-of-the-art detection performance. TheseCNN-based object detectors are often categorized into eithersingle-stage or two-stage models. Single-stage models, suchas YOLO and SSD series [14, 21], are designed to directlypredict bounding boxes and classify the contained objects.Conversely, two-stage models [4, 22] generate region propos-als as output of the first stage, and classify the objects in theproposed regions in the second stage. In general, single-stagemodels have smaller execution time due to their lower overallcomplexity compared to two-stage models, that are superiorto single-stage ones in terms of detection performance.Recent object detectors, e.g. , Mask R-CNN and SSD [4, 23],adopt state-of-the-art image classification models, such asResNet [5] and MobileNet v2 [23], as backbone . The mainrole of backbones in detection models pretrained on largeimage datasets, such as the ILSVRC dataset, is feature ex-traction. As illustrated in Figure 1, such features include theoutputs of multiple intermediate layers in the backbone. Allthe features are fed to complex modules specifically designedfor detection tasks, e.g. , the feature pyramid network [12],to extract further high-level semantic feature maps at differ-ent scales. Finally, these features are used for bounding boxregression and object class prediction.In this study, we focus our attention on state-of-the-arttwo-stage models. Specifically, we consider Faster R-CNNand Mask R-CNN [4, 22] pretrained on the COCO 2017 https://github.com/yoshitomo-matsubara/hnd-ghnd-object-detectors Figure 1: R-CNN with ResNet-based backbone. Bluemodules are from its backbone model, and yellowmodules are of object detection. C: Convolution, B:Batch normalization, R: ReLU, M: Max pooling layers. datasets. Faster R-CNN is the strong basis of several 1st-place entries [5] in ILSVRC and COCO 2015 competitions.The model is extended to Mask R-CNN by adding a branchfor predicting an object mask in parallel with the existingbranch for bounding box recognition [4]. Mask R-CNN notonly is a strong benchmark, but also a well-designed frame-work, as it easily generalizes to other tasks such as instancesegmentation and person keypoint detection.
We discuss challenges in deploying CNN-based object de-tectors in three different scenarios: mobile, edge, and splitcomputing. We use total inference time (including the timeneeded to transport the data over the wireless channel) andobject detection performance as performance metrics.
In mobile computing the mobile device executes the wholemodel, and the inference time is determined by the com-plexity of the model and local computing power. Due tolimitations in the latter, in order to have tolerable inferencetime, the models must be simple and lightweight. To thisaim, one of the main approaches is to use human-engineeredfeatures instead of those extracted from stacked neural mod-ules. For instance, Mekonnen et al. [19] propose an efficientHOG (Histogram Oriented Gradients) based person detectionmethod for mobile robotics. Designing high-level features ofhuman’s behavior on touch screen, Matsubara et al. [18] pro-pose distance/SVM-based one-class classification approachesto screen unlocking on smart devices in place of passwordor fingerprint authentications.n recent years, however, deep learning based methodshave been outperforming the models with human-engineeredfeatures in terms of accuracy. For image classification tasks,MobileNets [7, 23] and MNasNets [25] are examples of mod-els designed to be executed on mobile devices, while provid-ing moderate classification accuracy. Corresponding light-weight object detection models are SSD [14] and SSDLite [23].Techniques such as model pruning, quantization and knowl-edge distillation [6, 20, 27] can be used to produce lightweightmodels from larger ones.Table 1 summarizes the performance of some modelstrained on the COCO 2014 minival dataset as reported in theTensorFlow repository . Obviously, the SSD series objectdetectors with MobileNet backbones outperform the FasterR-CNN with ResNet-50 backbone in terms of inference time,but such lightweight models offer degraded detection perfor-mance. Note that in the repository, the COCO 2014 minivalsplit is used for evaluation, and the inference time is mea-sured on a machine with one NVIDIA GeForce GTX TITANX, which is clearly not suitable to be embedded in a mo-bile device. Also, the values reported in Table 1 are givenfor models implemented with the TensorFlow frameworkwith input images resized to 600 × e.g. , Raspberry Pi4 and the desktop machine, respectively). If the predictionresults are to be sent back to the mobile device, a furthercommunication delay term should be taken into account,although outcomes ( e.g. , bounding boxes and labels) typicallyhave a much smaller size compared to the input image. Asdiscussed in [10, 16], the delay of the communication frommobile device to edge computer is a critical component ofthe total inference time, which may become dominant in https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md Table 1: Mean average precision (mAP) on COCO 2014minival dataset and running time on a machine withan NVIDIA GeForce GTX TITAN X. TensorFlow model mAP Speed [sec]SSDLite with MobileNet v2 0.220 0.027SSD with MobileNet v3 (Large) 0.226 N/A*Faster R-CNN with ResNet-50 0.300 0.0890 * Reported speed was measured on a different device
Table 2: Running time [sec/image] of Faster and MaskR-CNNs with different ResNet models.
Backbone with ResNets -18 -34 -50 -101 F a s t e r R - C NN Raspberry Pi 4 Model B 27.73 23.40 26.14 35.16NVIDIA Jetson TX2 0.617 0.743 0.958 1.26Desktop + 1 GPU 0.0274 0.033 0.0434 0.0600 M a s k R - C NN Raspberry Pi 4 Model B 18.30 23.65 27.02 34.73NVIDIA Jetson TX2 0.645 0.784 0.956 1.27Desktop + 1 GPU 0.0289 0.0541 0.0613 0.0606 some network conditions, where the performance of edgecomputing may suffer from a reduced channel capacity.
Split computing is an intermediate option between mobileand edge computing. The core idea is to split models intohead and tail portions, which are deployed at the mobiledevice and edge computer, respectively. To the best of ourknowledge, Kang et al. [10] were the first to propose to splitdeep neural network models. However, the study simplyproposed to optimize where to split the model, leaving thearchitecture unaltered.In split computing, the total inference time is sum of threecomponents: mobile processing time, communication delay,and edge processing time. To shorten the inference timein split computing compared to those of mobile and edgecomputing, the core challenge is to significantly reduce com-munication delay while leaving a small portion of compu-tational load on mobile device for compressing the data tobe transferred to edge server. Splitting models in a straight-forward way, as suggested in [10], however, does not lead toan improvement in performance in most cases. The tensionis between the penalty incurred by assigning a portion ofthe overall model to a weaker device (compared to the edgecomputer) and the potential benefit of transmitting a smalleramount of data. However, most models do not present “nat-ural” bottlenecks in their design, that is, layers with a smallnumber of output nodes, corresponding to a small tensor tobe propagated to the edge computer. In fact, the neurosurgeon ramework locates pure mobile or edge computing as theoptimal computing strategies in most models.Building on the work of Kang et al. [10], recent contribu-tions propose DNN splitting methods [2, 3, 8, 11, 16, 24, 26].Most of these studies, however, (I) do not evaluate modelsusing their proposed lossy compression techniques [2], (II)lack of motivation to split the models as the size of the inputdata is exceedingly small, e.g. , 32 ×
32 pixels RGB imagesin [8, 24, 26], (III) specifically select models and network con-ditions in which their proposed method is advantageous [11],and/or (IV) assess proposed models in simple classificationtasks such as miniImageNet, Caltech 101, CIFAR -10, and-100 datasets [3, 8, 16, 24].Similar to CNN-based image classification models, it is notpossible to reduce the inference time of CNN-based objectdetectors by naive splitting methods without altering themodels’ architecture. This is due to the designs of the earlylayers of the models, which amplify the input data size. Itwould be worth noting that Matsubara et al. [16] apply aloseless compression technique, a standard Zip compression,to intermediate outputs of all the splittable layers in a CNNmodel, and show the compression gain is not sufficient tosignificantly reduce inference time in split computing.Figure 2 illustrates this effect by showing the amplificationof the data at each of core layers in Faster and Mask R-CNNs with ResNet-50, compared to the input tensor size(3 × × bottleneck within the model, and split the model at thatlayer [16]. In the following section, we discuss bottleneckinjection for CNN-based object detectors, specifically Fasterand Mask R-CNNs, and present preliminary experimentalresults supporting this strategy. One of the core challenges of bottleneck injection in R-CNNobject detectors is that the bottleneck needs to be introduced
Figure 2: Layer-wise output tensor sizes of Faster andMask R-CNNs scaled by input tensor size ( × × ). in early stages of the detector compared to image classifica-tion models. As illustrated in Figure 1, the first branch of thenetwork is after Layer 1. As a result, the bottleneck needsto be injected before the layer to avoid the need to forwardmultiple tensors produced by the branches (Figs. 1 and 2).The amount of computational load assigned to the mo-bile device should be considered as well when determiningthe bottleneck placement. In fact, the execution time of the head model, which will be deployed on the mobile device,is a critical component to minimize the total inference time.Figures 3 and 4 depict the number of parameters of eachmodel used for partial inference on the mobile device whensplitting the model at specific modules. The reported valuesprovide a rough estimate of the head model’s complexity asa function of the splitting point.Recall that feature pyramid network (FPN), region pro-posal network (RPN), and region of interest (RoI) Heads inthe R-CNN models are designed specifically for object de-tection tasks, and all the modules before them are originallyfrom an image classification model (ResNet models [5] inthis study). Because of not only the models’ branching, butthe trends in these figures, it is clear that the bottleneck, andthus the splitting point, should be placed before “Layer 1”.Matsubara et al. [16] attempted to introduce a bottleneckin the first convolution layer of DenseNet-169 [9]. The bottle-neck uses 4 output channels in place of 64 channels, so thatthe output tensor of the layer is smaller than the input tensorto the model. Using Caltech 101 dataset, they naively trainedthe redesigned model, that significantly degraded classifica-tion accuracy even despite the relative low complexity of thedataset compared to the ILSVRC dataset.Based on the analysis and results, we attempt to introducea bottleneck to “Layer 1”, that consists of multiple low-levelmodules such as convolution layers. Here, we redesign the igure 3: Cumulative number of parameters in coremodules of Faster R-CNN.Figure 4: Cumulative number of parameters in coremodules of Mask R-CNN. layer 1 by pruning a couple of layers and adjust hyperparam-eters to make its output shape match that of the layer 1 inthe original model. The redesigned layer 1 has a small bottle-neck with C output channels, a key parameter to control thebalance between detection performance and bottleneck size.Assuming that the original models are overparameter-ized, we distill the head models while injecting bottlenecks.Specifically, we use head network distillation [16], a teacher-student training scheme, only applied to the head portion toreduce training time. We treat the first layers until the layer1 in the original detector as a teacher model, and those inthe redesigned detector as a student model. Note that theredesigned detector reuses all the modules and learnt param-eters of the original detector except the modules until theend of the layer 1. The exact network architectures are notdescribed in this paper due to limited space, but the code andtrained model weights are released to ensure reproducible Figure 5: Average bottleneck tensor size vs. BBox andMask mAPs on COCO 2017 validation dataset (testsplit is not publicly available). Bottleneck tensor sizeis scaled by average input tensor size ( × × ). results. In this study, we use pretrained Faster and MaskR-CNNs with ResNet-50 as teacher models. To the best ofour knowledge, this is the first study that discusses intro-ducing bottlenecks to CNN-based object detectors for splitcomputing and provide experimental results.Figure 5 illustrates the relationship between the outputtensor size of the introduced bottleneck with the number ofoutput channels C = , , , ,
15 and bounding box / maskmAPs of the modified Faster and Mask R-CNNs. The right-most data points on the dashed line in the figure correspondto the detection performance (mAP) of the original R-CNNmodels in edge computing, that is, the splitting points are attheir input layers. Since in this study we do not alter the tailportion of the models, but modify and train the head por-tion only, we take the detection performance of the originalmodels (on dashed line) as the upper bound performance ofour modified models. It can be observed how we success-fully reduce the size of the tensors to be transferred to theedge computer compared to the input tensors incurring somemAP degradation. Recall that there are no effective splittingpoints in the original models, as shown in Figure 2 ( i.e. , mostof the normalized tensor sizes are above 1), and our intro-duced bottlenecks save approximately 80.3 – 93.4% of tensorsize for offloading compared to edge computing.
In this study, we discussed the challenges in deploying CNN-based object detectors in mobile devices using three keystrategies: mobile computing, edge computing, and the re-cently proposed split computing. We focused our discussionon two different state-of-the-art two-stage object detectors,which have no suitable “natural” splitting points, and in-jected small bottlenecks into the models based on the analy-sis we provided. While the introduced bottlenecks are smallerhan the input tensors, the detection performance is some-times significantly degraded specifically when introducingaggressively small bottlenecks. In addition to improving thedetection performance, it would be necessary to assess theinference time using further compression techniques suchas quantization, which we further discuss in [17].
ACKNOWLEDGMENTS
This work was supported by the NSF grant IIS-1724331 andMLWiNS-2003237, and DARPA grant HR00111910001.
REFERENCES [1] Marco V Barbera, Sokol Kosta, Alessandro Mei, and Julinda Stefa. 2013.To offload or not to offload? the bandwidth and energy costs of mobilecloud computing. In
Proceedings of IEEE INFOCOM 2013 . 1285–1293.[2] John Emmons, Sadjad Fouladi, Ganesh Ananthanarayanan, ShivaramVenkataraman, Silvio Savarese, and Keith Winstein. 2019. Crack-ing open the DNN black-box: Video Analytics with DNNs across theCamera-Cloud Boundary. In
Proceedings of the 2019 Workshop on HotTopics in Video Analytics and Intelligent Edges . 27–32.[3] Amir Erfan Eshratifar, Amirhossein Esmaili, and Massoud Pedram.2019. BottleNet: A Deep Learning Architecture for Intelligent MobileCloud Computing Services. In . IEEE, 1–6.[4] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017.Mask R-CNN. In
Proceedings of the IEEE International Conference onComputer Vision . 2961–2969.[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. DeepResidual Learning for Image Recognition. In
Proceedings of the IEEEconference on computer vision and pattern recognition . 770–778.[6] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2014. Distilling theKnowledge in a Neural Network. In
Deep Learning and RepresentationLearning Workshop: NIPS 2014 .[7] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, BoChen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, VijayVasudevan, et al. 2019. Searching for MobileNetV3. In
Proceedings ofthe IEEE International Conference on Computer Vision . 1314–1324.[8] Diyi Hu and Bhaskar Krishnamachari. 2020. Fast and Accurate Stream-ing CNN Inference via Communication Compression on the Edge. In . IEEE, 157–163.[9] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Wein-berger. 2017. Densely connected convolutional networks. In
Proceed-ings of the IEEE conference on computer vision and pattern recognition .4700–4708.[10] Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, TrevorMudge, Jason Mars, and Lingjia Tang. 2017. Neurosurgeon: Collabora-tive Intelligence Between the Cloud and Mobile Edge. In
Proceedingsof the Twenty-Second International Conference on Architectural Sup-port for Programming Languages and Operating Systems (Xi’an, China).615–629. https://doi.org/10.1145/3037697.3037698[11] Guangli Li, Lei Liu, Xueying Wang, Xiao Dong, Peng Zhao, and Xiaob-ing Feng. 2018. Auto-tuning Neural Network Quantization Frameworkfor Collaborative Inference Between the Cloud and Edge. In
Interna-tional Conference on Artificial Neural Networks . 402–411.[12] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hari-haran, and Serge Belongie. 2017. Feature pyramid networks for objectdetection. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition . 2117–2125. [13] Luyang Liu, Hongyu Li, and Marco Gruteser. 2019. Edge assisted real-time object detection for mobile augmented reality. In
The 25th AnnualInternational Conference on Mobile Computing and Networking . 1–16.[14] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, ScottReed, Cheng-Yang Fu, and Alexander C Berg. 2016. SSD: Single shotmultibox detector. In
European conference on computer vision . 21–37.[15] Pavel Mach and Zdenek Becvar. 2017. Mobile Edge Computing: ASurvey on Architecture and Computation Offloading.
IEEE Communi-cations Surveys & Tutorials
19 (2017), 1628–1656.[16] Yoshitomo Matsubara, Sabur Baidya, Davide Callegaro, Marco Levo-rato, and Sameer Singh. 2019. Distilled Split Deep Neural Networks forEdge-Assisted Real-Time Systems. In
Proceedings of the 2019 MobiComWorkshop on Hot Topics in Video Analytics and Intelligent Edges . 21–26.[17] Yoshitomo Matsubara and Marco Levorato. 2021. Neural Compressionand Filtering for Edge-assisted Real-time Object Detection in Chal-lenged Networks. In
Proceedings of the 25th International Conferenceon Pattern Recognition (To appear) .[18] Yoshitomo Matsubara, Haruhiko Nishimura, Toshiharu Samura, Hi-royuki Yoshimoto, and Ryohei Tanimoto. 2016. Screen Unlockingby Spontaneous Flick Reactions with One-Class Classification Ap-proaches. In . IEEE, 752–757.[19] Alhayat Ali Mekonnen, Cyril Briand, Frédéric Lerasle, and Ariane Her-bulot. 2013. Fast HOG based person detection devoted to a mobile robotwith a spherical camera. In . IEEE, 631–637.[20] Antonio Polino, Razvan Pascanu, and Dan Alistarh. 2018. Modelcompression via distillation and quantization. In
Sixth InternationalConference on Learning Representations .[21] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016.You only look once: Unified, real-time object detection. In
Proceedingsof the IEEE conf. on computer vision and pattern recognition . 779–788.[22] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. FasterR-CNN: Towards Real-Time Object Detection with Region ProposalNetworks. In
Advances in neural information processing systems . 91–99.[23] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov,and Liang-Chieh Chen. 2018. MobileNetV2: Inverted Residuals andLinear Bottlenecks. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition . 4510–4520.[24] Jiawei Shao and Jun Zhang. 2020. Bottlenet++: An end-to-end ap-proach for feature compression in device-edge co-inference systems.In . IEEE, 1–6.[25] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark San-dler, Andrew Howard, and Quoc V Le. 2019. MnasNet: Platform-AwareNeural Architecture Search for Mobile. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition . 2820–2828.[26] Surat Teerapittayanon, Bradley McDanel, and H. T. Kung. 2017. Dis-tributed Deep Neural Networks Over the Cloud, the Edge and EndDevices. In . 328–339.[27] Yi Wei, Xinyu Pan, Hongwei Qin, Wanli Ouyang, and Junjie Yan. 2018.Quantization mimic: Towards very tiny cnn for object detection. In