[PDF] CBinfer: Exploiting Frame-to-Frame Locality for Faster Convolutional Network Inference on Video Streams

Abstract

The last few years have brought advances in computer vision at an amazing pace, grounded on new findings in deep neural network construction and training as well as the availability of large labeled datasets. Applying these networks to images demands a high computational effort and pushes the use of state-of-the-art networks on real-time video data out of reach of embedded platforms. Many recent works focus on reducing network complexity for real-time inference on embedded computing platforms. We adopt an orthogonal viewpoint and propose a novel algorithm exploiting the spatio-temporal sparsity of pixel changes. This optimized inference procedure resulted in an average speed-up of 9.1x over cuDNN on the Tegra X2 platform at a negligible accuracy loss of <0.1% and no retraining of the network for a semantic segmentation application. Similarly, an average speed-up of 7.0x has been achieved for a pose detection DNN and a reduction of 5x of the number of arithmetic operations to be performed for object detection on static camera video surveillance data. These throughput gains combined with a lower power consumption result in an energy efficiency of 511 GOp/s/W compared to 70 GOp/s/W for the baseline.

Full PDF

11 CBinfer: Exploiting Frame-to-Frame Locality for FasterConvolutional Network Inference on Video Streams

Lukas Cavigelli, Luca Benini

Abstract — The last few years have brought advances incomputer vision at an amazing pace, grounded on new ﬁndingsin deep neural network construction and training as well as theavailability of large labeled datasets. Applying these networks toimages demands a high computational effort and pushes the useof state-of-the-art networks on real-time video data out of reachof embedded platforms.Many recent works focus on reducing network complexityfor real-time inference on embedded computing platforms. Weadopt an orthogonal viewpoint and propose a novel algorithmexploiting the spatio-temporal sparsity of pixel changes. Thisoptimized inference procedure resulted in an average speed-upof . × over cuDNN on the Tegra X2 platform at a negligibleaccuracy loss of < . and no retraining of the networkfor a semantic segmentation application. Similarly, an averagespeed-up of . × has been achieved for a pose detection DNNand a reduction of × of the number of arithmetic operationsto be performed for object detection on static camera videosurveillance data. These throughput gains combined with a lowerpower consumption result in an energy efﬁciency of 511 GOp/s/Wcompared to 70 GOp/s/W for the baseline. I. I

NTRODUCTION

Computer vision (CV) technology has become a key in-gredient for automatized data analysis over a broad range ofreal-world applications: smart cameras for video surveillance,robotics, industrial quality assurance, medical diagnostics, andadvanced driver assistance systems have recently become pop-ular due the rising accuracy and robustness of CV algorithms.This industry interest has fostered the procedure of a wealthof research projects yielding a ﬁerce competition on manybenchmarks datasets such as the ImageNet/ILSVRC [1], MSCOCO [2], and Cityscapes [3] benchmarks, on which scientistsfrom academia and big industry players evaluate their latestalgorithms.In recent years, the most competitive approaches to addressmany CV challenges have relied on machine learning withcomplex, multi-layered, trained feature extractors commonlyreferred to as deep learning [4]. The most frequently usedﬂavor of deep learning techniques for CV are convolutionalneural networks (ConvNets, CNNs). Since their landslidesuccess at the 2012 ILSVRC competition over hand-craftedfeatures, their accuracy has further improved year-over-yeareven exceeding human performance on this complex dataset[1], [5]. CNNs keep on expanding to more areas of computervision and data analytics in general [6]–[9] and are moving

The authors would like to thank armasuisse Science & Technology forfunding this research. This project was supported in part by the EU’s H2020programme under grant no. 732631 (OPRECOMP).L. Cavigelli and L. Benini are with ETH Z¨urich, Z¨urich, Switzerland (e-mail: { cavigelli,benini } @iis.ee.ethz.ch). towards analyzing video data for action recognition [10],tracking [11], and improved object detection [12], [13].Unfortunately, the high accuracy of CNNs comes with ahigh computational cost, requiring powerful GPU servers totrain these networks for several weeks using hundreds ofgigabytes of labeled data. While this is a massive effort, itis a one-time endeavor and can be done ofﬂine for manyapplications. However, the inference of state-of-the-art CNNsalso requires several billions of multiplications and additions toclassify even low-resolution images by today’s standards [14].While in some cases ofﬂoading to centralized compute centerswith powerful GPU servers is also possible for inferenceafter deployment, it is extremely costly in terms of com-pute infrastructure and energy. Furthermore, collecting largeamounts of data at a central site raises privacy concerns andthe required high-bandwidth communication channel causesadditional reliability problems and potentially prohibitive costof deployment and during operation [15]. For many applica-tions, the introduced latency is prohibitive.The alternative, on-site near sensor embedded processing,largely solves the aforementioned issues by transmitting onlythe less sensitive, condensed information—potentially onlysecurity alerts in case of a smart surveillance camera—butimposes restrictions on available computation resources andpower. These push the evaluation of such networks for real-time semantic segmentation or object detection out of reachof even the most powerful embedded platforms availabletoday for high-resolution video data [14]. However, exactlysuch systems are required for a wide range of applicationslimited in cost (CCTV/urban surveillance, perimeter surveil-lance, consumer behavior and highway monitoring) and la-tency (aerospace and UAV monitoring and defense, visualauthentication) [15], [16].Large efforts have thus already been taken to develop opti-mized software implementations for heterogeneous platforms[14], [17]–[20], to design specialized hardware architectures[9], [21]–[25], and to adapt the networks to avoid expen-sive arithmetic operations by reducing arithmetic precision,exploiting sparsity, and developing more compact DNNs [8],[26]. However, they either 1) do not provide a strong enoughperformance boost, 2) are already at the theoretical limit ofwhat can be achieved on a given platform, 3) are inﬂexibleand not commercially available, or 4) incur a considerableaccuracy loss. It is thus essential to extend the available optionsto efﬁciently perform inference on CNNs.In this paper, we propose a novel method performing c hange- b ased infer ence (hence named CBinfer) for convolu-tional neural networks on video data from a static camera withlimited frame-to-frame changes. We extend our preliminary a r X i v : . [ c s . C V ] M a r work in [27]:1) Enhancements to the algorithm for improved computetime and ensuring a consistent input-output relation foreach convolution layer.2) In-depth analysis how changes propagate through theCBinfer DNN.3) Analysis of accuracy, compute time and energy efﬁ-ciency for long frame sequences.4) Additional evaluations for pose and object detectionapplications with much deeper networks and datasetswithout annotations.5) Optimizations and evaluations on the more recent NvidiaTegra X2 platform .6) Discussion and evaluation of the processing steps andhow the chosen conﬁguration provides the highest per-formance gain.Overall the proposed method provides an average speed-up by a factor of 9.1–7.0 over an implementation relyingon cuDNN and introducing only negligible accuracy loss. Itthus signiﬁcantly outperforms previous approaches exploitingframe-to-frame locality which all have measured performancegains in the range of a few ten percent while introducingaccuracy losses of several percent (cf. Section II-C). Ourmethod can be combined with most single-frame optimizationssuch as exploiting weight sparsity or the development ofmore compact DNN models. The code is available online athttps://github.com/lukasc-ch/CBinfer.II. R ELATED W ORK

In this section, we ﬁrst describe existing optimized imple-mentations for CNN inference and existing approximationstrading accuracy for throughput. We then speciﬁcally surveyrelated approaches exploiting the limited changes in video datato reduce the computational effort required to perform CNNinference. Finally, we discuss available datasets and CNNswith which we can evaluate our proposed algorithm.Most per-frame optimization techniques can be combinedwith the method we propose herein. Existing approachestargeting video data have very limited gains and have not beenspeciﬁcally optimized for static camera frame sequences.

A. Optimized Embedded System Implementations

The latest wave of interest in neural networks can beattributed to their sudden success driven by the availability oflarge datasets and the increasingly powerful computing plat-forms. One of the most economical and practicable solutionsfor training medium-sized CNNs is to use a workstation withGPUs. The available software frameworks to implement andtrain CNNs provide strong support for this kind of platform.The massive amounts of compute time spent training CNNshas spurred the development of highly optimized GPU im-plementations. First, most widely used frameworks relied ontheir own custom implementations which have all convergedto methods relying on matrix-multiplications, leveraging the The Nvidia Tegra X2 is a system-on-chip available on an embedded boardwith an affordable power budget ( < W) for a stationary smart camera. availability of highly optimized code in BLAS libraries and thefact that GPUs are capable of achieving a throughput withina few percent of their peak performance with this type ofworkload. Specialized libraries such as Nvidia’s cuDNN andNervana Systems’ Neon provide some additional performancegains through assembly-level implementations [19] and addi-tional algorithmic improvements such as Winograd and FFT-based convolution [20]. A speciﬁc implementation for non-batched inference on an embedded platform building on amatrix multiplication is documented in [14], also showing thatmore than 90% of time is spent computing convolutions.

B. Approximations Trading Accuracy for Throughput

DNNs commonly require a high computation effort in theorder of 20 GOp/frame for classiﬁcation of a × pixel image (1 multiply-add is counted as 2 operations) [28].Extracting features when working with high resolution images(e.g. for object detection or semantic segmentation) scalesup the effort proportional to the number of pixels, quicklyreaching few 100 GOp/frame.Admitting limited accuracy losses in order to gain a higherthroughput by approximating existing networks, inferencealgorithms, and arithmetic operations can help overcome thecomputational obstacles preventing widespread adoption ofCNN-based algorithms on embedded and mobile platforms.Several such approaches are surveyed and compared in [29],[30]. In this section, we will provide an overview of differentoptions that can be exploited.One such option is the reduction of the required arithmeticprecision to evaluation NNs. Various methods from normalﬁxed-point analysis to retraining networks to adapt for quan-tized weights and activations exist. On some off-the-shelfsoftware programmable platforms, 16-bit or 8-bit arithmeticoperations can be vectorized to obtain a performance boost[31]. Extreme methods go as far as to enforce binary weights[32], [33], and in some cases also binary activations [26]. Thismeans that multiplications can be dropped entirely, and in caseof binary activations even collapse some of the add/subtractoperations into XNOR and bit count operations. Many net-works can be quantized with 8 bit without an increase in errorrate, before there is a trade-off between precision and accuracy[21], [34]. Some methods try reducing the computationaleffort by pruning many very small weights to zero, makingit possible to skip some operations [35], or even dynamicallyskip operations when the activations are zero [36]. Moresophisticated quantization schemes such as vector quantizationexist and can further compress a trained CNN model, butthey require specialized hardware to bring an improvementin energy efﬁciency [36], [37].Further research has focused on optimizing semantic seg-mentation and object detection algorithms to better reuse al-ready computed features by eliminating any non-convolutionalelements from the network [38], [39] or introducing structuredsparsity [40]. Simplifying the operations in a network, suchas low-rank approximations of 2D convolutions or by simplydesigning smaller networks with state-of-the-art methods havebeen evaluated in [28], [41]. The method we propose in this paper does not supersedethese methods, but can be combined with the aforementionedapproximation methods to further improve throughput.

C. Video-based Computation Reduction

Obtaining per-frame features naturally seems like an easiertask when these frames belong to a video sequence rather thana random collection of images. Limited movement of objectsin a frame can be exploited in object tracking by workingwith a limited search window within the frame [42], not onlyreducing the problem size, but also simplifying the regressiontask—up until the tracked target is occluded by a large object.

Clockwork CNNs [43] speciﬁcally target CNNs for semanticsegmentation with a structure similar to [39]. They haveextended this work on fully convolutional networks, whichpresents a CNN with skip connections and deconvolutionlayers to reﬁne the lower-resolution feature maps obtaineddeep within the network using the features extracted early inthe network. They exploit that lower-resolution feature mapswithin the network are more stable over time than the full-resolution input. They thus propose to reevaluate the ﬁrst fewlayers and the last layers affected through the skip connectionsmore frequently than the coarser grained feature maps. Thisis a strong limitation on the set of CNNs this method can beapplied to. They present evaluations based on a static as wellas a dynamic, content-adaptive reevaluation schedule, showingthat they can reduce the number of full-frame convolutions byabout 40% before the accuracy starts to drop on the Youtube-Objects dataset. However, this approach is limited to updatingentire frames, whereas we exploit that often only small partsof the scene change and need to be reevaluated, which leadsto larger savings.

CNNCache [44] describes a general approach pursuing asimilar direction of work. They describe their method as acaching mechanism, where blocks of the image are matchedto blocks in the previous frame, thereby fetching results ofsimilar block from the cache instead of recomputing theresults. Similarly to our work, this requires the selection ofa threshold, and on top of that a block size and a cachedepth in the form of an expiration time. The block matchingallows to handle video data where the camera is not fullystatic, but it does not allow perspective changes. They haveshown that their method achieves an average speed-up in theorder of 20% at a top-1 accuracy loss of 3.5% performingimage classiﬁcation relative to the ncnn framework’s defaultimplementation. The capability to recall convolution resultseven when the speciﬁc image tile has moved introduces a sig-niﬁcant overhead comparing image tiles, thereby limiting thepotential speed-up signiﬁcantly. Further, this method requiresa relatively high tolerance when comparing image tiles to beable to ﬁnd matches, thereby introducing signiﬁcant accuracylosses.

DeepMon [45] proposes another method combining convo-lution layer decomposition, half-precision computation, andconvolutional layer caching. Similarly to CNNCache, theydivide the input to each convolutional layer into blocks andreuse the result when a block matches to the one in the previ-ous frame. To reduce overhead, they do not directly compare the blocks, but instead extract histogram-based features. Theyapply their technique only to the ﬁrst few layers, because inlater layers the caching overhead exceeds the compute latencysavings. They show a speed-up attributable to caching of18% for object detection and 36% for image classiﬁcationat an accuracy loss in the order of 3.8% to 6.2%. Whiletheir histogram-based comparison method for the image tilesreduces overhead, it still remains signiﬁcant and the introducedaccuracy loss increases further.

Sigma-Delta Quantized Networks [46] is the most similarmethod to ours. They combine quantizing the network anddecomposing the input to each convolution layer with thedifference of the current frame’s values to the previous frame’svalues and accumulate the result over time. They show a − × reduction in the number of operations in total, ofwhich − × can be attributed to the temporal differencesaspect of their method at an accuracy drop. However, whetherthis reduction in number of multiply-add operations can beput into performance gains after all the introduced overheadremains an open question. D. Suitable Datasets and Neural Networks

We show the applicability of the concept to various applica-tions, namely by evaluating the proposed method for semanticsegmentation and pose detection. These are both often appliedto high-resolution images and video streams with high framerates above 10 frame/s for meaningful applications.We are speciﬁcally interested in video sequences obtainedfrom a static camera. While some such datasets exist (e.g. per-son or vehicle detection or re-identiﬁcation), most of them arelimited to extremely few (1-3) classes and rarely target seman-tic segmentation. However, for ﬁrst application scenario—semantic segmentation—the dataset used in [47] providesground truth labels for 10-class semantic segmentation froman urban street surveillance perspective, and while they workwith individual images, several surrounding unlabeled framesand a trained convolutional network are available. An exampleimage labeled with the provided CNN is shown in Figure 1,and a sample sequence of 3 images is visualized in Figure 2.For the second application—pose detection—severaldatasets to detect joints and limbs exist in the form ofannotated images or a moving camera frame sequences, but Available online at https://doi.org/10.3929/ethz-b-000276417TABLE IS

EMANTIC S EGMENTATION

CNN U

SED F OR E VALUATIONS . Type Outp. Res. Feat. Maps CT [ms] rel. CTL1 conv × ×

871 3 → × ×

436 16 → × ×

218 64 → × ×

218 256 → × ×

218 64 → Fig. 1. Example output of the scene labeling network of [47] on which weevaluate our algorithm.Fig. 2. A sample video sequence from the dataset of [47] showing the frame-by-frame changes by overlaying a sequence of length 3. Moving objects areonly a small part of the overall scene and affect only a small share of thepixels. none with a static camera. To overcome this and to show thefeasibility of applying CBinfer without annotated data, we useunlabeled frame sequences from the CAVIAR dataset andtake the pretrained network to generate the reference output.The dataset contains scenes recorded using surveillancecameras with wide-angle lenses and captures the interactionof few people. It has a resolution of × pixel and aframe rate of 25 frame/s. A few sample frames are shown inFigure 3.For object detection—our third application scenario—weuse video sequences of trafﬁc surveillance cameras. Objectdetection is performed using YOLOv3 [48] trained on the MSCOCO dataset [2]. Since there is no ground truth available forthe sequences, we generate our reference output by applyingthe original YOLOv3 network to each frame.III. M ETHODOLOGY

The most straight-forward pixel-level approach is to detectchanging pixels on the input frame based on a threshold onthe difference to the previous frame, and then update all thepixels affected by them. This increases the number of pixels to Available at http://homepages.inf.ed.ac.uk/rbf/CAVIAR/, collected throughthe EC Funded CAVIAR project/IST 2001 37540. Fig. 3. Sample frames from sequences of the CAVIAR dataset on which weperform evaluations for pose detection.Fig. 4. Sample frames from the video sequences on which we performevaluations for object detection. be updated layer-after-layer due to the convolution operations.Thus for e.g. a × convolution, a one-pixel change triggersan update of 49 pixels in the next layer and 169 pixels afteranother × convolution. Strided operations (often usedwith pooling layers) reduce this effect, but do not preventit. This issue might seem prohibitive for multi-layer CNNs,particularly when considering that individual pixels might keepexceeding the threshold due to noise.However, the change is not only spatially local at the input,but also at the output. Furthermore, noise-like changes willlikely not have strong impacts on feature maps deeper withinthe network. We thus propose to perform the change-detectionnot only at the input, but before each convolution layer—relative to its previous input—and to compute an updated valueonly for the affected output pixels. This can be done withoutmodiﬁcations to the training of the CNN, can be applied toexisting pre-trained networks, and is not speciﬁc to the CNNon which we evaluate the proposed algorithm.We propose to replace all spatial convolution layers (convlayers) with change-based spatial convolution layers (CBconvlayers). This means adapting the widely used, simple andwell-performing matrix-generation and matrix-multiplicationsequence of operations [14], [49]. The convolution layercomputes y o ( j, i ) = b o + (cid:88) c ∈C in (cid:88) (∆ j, ∆ i ) ∈S k k o,c (∆ j, ∆ i ) x c ( j − ∆ j, i − ∆ i ) , (1) where o indexes the output channels C out and c indexes theinput channels C in . The pixel is identiﬁed by the tuple ( j, i ) and S k denotes the support of the ﬁlters kernels k . This canbe computed by performing a matrix multiplication Y = KX , Y ∈ R |C out |× h o · w o , K ∈ R |C out |×|C in |· h k · w k , X ∈ R |C in |· h k · w k × h o · w o . (2)The image matrix X is constructed as X (( kh k + j ) w k + i, y o w o + x o ) = x ( k, j + y o , i + x o ) with k = 1 , . . . , |C in | , j = 1 , . . . , h k , i = 1 , . . . , w k and y o = 1 , . . . , h o , x o =1 , . . . , w o . The ﬁlter matrix K is given by K ( o, ( ch k + j ) w k + i ) = k ( o, c, j, i ) for o = 1 , . . . , |C out | , c = 1 , . . . , |C in | , j = 1 , . . . , h k and i = 1 , . . . , w k . The result matrix is stored as Y ( o, y o w o + x o ) = y ( o, y o , x o ) . Zero-padding can be appliedduring the construction of the X matrix and an efﬁcient stridedconvolution can be computed by dropping the unused rows.We replace this matrix multiplication by the followingsequence of processing steps, thereby drastically reducing thesize of the matrix used in the main computation step. A. Processing Steps

We modify the standard approach and use a sequenceof processing steps (cf. Figure 5, top/ feed-forward ): changedetection, change indexes extraction, matrix generation, matrixmultiplication, and output update. In the following, we willexplain the individual steps. a) Change Detection:

In this step, changed pixels aredetected. We deﬁne a changed pixel as one where the absolutedifference of the current to the previous input of any featuremap/channel exceeds some threshold τ , i.e. m ( j, i ) = (cid:95) c ∈C I (cid:16) | x ( t ) ( c, j, i ) − x ( t − ( c, j, i ) | > τ (cid:17) . (3)The computation effort of this step is crucial, since it isexecuted independently of whether any pixel changed. Eachof these changes affects a region equal to the ﬁlter size, andthese output pixels are marked for updating: (cid:101) m ( j, i ) = (cid:95) (∆ j, ∆ i ) ∈S k m ( j + ∆ j, i + ∆ i ) , (4)where S k is the ﬁlter kernel support, e.g. S k = {− , . . . , } for a × ﬁlter. All of this is implemented on GPU by clearingthe change map to all-zero and having one thread per pixel,which—if a change is detected—sets the pixels of the ﬁltersupport neighborhood in the resulting change map . b) Change Indexes Extraction: In this step, we condensethe change map (cid:101) m to 1) a list of pixel indexes where changesoccurred and 2) count the number of changed pixels. Thishas been implemented by relying on the Thrust copy_if function. The computed index list is later on needed toaccess only the needed pixels to assemble the matrix for theconvolution. https://thrust.github.io c) Matrix Generation & Matrix Multiplication: Matrixmultiplications are used in many applications, and highlyoptimized implementations such as the GEMM (general ma-trix multiplication) function provided by the Nvidia cuBLASlibrary come within a few percent of the peak FLOPS aGPU is capable to provide. Matrix multiplication-based im-plementations of the convolution layer relying on GEMMare widely available and are highly efﬁcient [14], [50] asdescribed above. The X matrix in (2) is not generated full-sized, but instead only those columns corresponding to therelevant output pixels are assembled, resulting in a reducedwidth equal to the number of output pixels affected by thechanges in the input image. The columns to be generatedare selected using the change indexes (cf. Figure 5) and areconstructed following the procedure described in the previoussection. This is implemented with independent threads for eachpixel and spatial ﬁlter position, where each of them copies allthe feature map values at the position.The K matrix is made up of the ﬁlters trained using normalconvolution layers and keeps the same dimensions, so thecomputation effort in this step is proportional to the numberof changed pixels and the matrix multiplication is in the worstcase only as time consuming as the full-frame convolution. d) Output Updating: We use the previously stored resultsand the newly computed output values along with the changeindexes list to provide the updated output feature maps. Tomaximize throughput, we also include the ReLU activation ofthe affected pixels in this step, reducing the compute time by1) not writing the value to memory and immediately readingthem again—an independent ReLU layer is strongly memorybandwidth limited, and 2) only applying the ReLU operationto changed pixels.

B. Memory Requirements

The memory requirements of DNN frameworks are knownto be very high, up to the point where it becomes a limitingfactor for increasing the mini-batch size during learning andthus reducing the throughput when parallelizing across multi-ple GPUs. These requirements are very different when lookingat embedded inference-only systems:1) Inference is typically done on single frames. Creatingmini-batches would introduce often unacceptable latencywhile only providing a few percent of additional perfor-mance [14].2) During training, the input of each layer has to be storedin order to be able to compute the gradients. This is notrequired during inference.3) Batch normalization layers, Dropout layer, etc. (ifpresent) are considered independent layers during train-ing. They can be absorbed into the convolution layer forinference.To obtain a baseline memory requirement, we compute therequired memory of common DNN frameworks performingconvolutions using matrix multiplication with a batch size of1. We assume an optimized network minimizing the numberof layers, e.g. by absorbing batch normalization layers intothe convolution layers or using in-place activation layers. input n in × h × w, flt prev. input changedetect. changeindexesextract.change map h × w, bool X matrixgen.change indexes n chg , long matrixmultipl.X matrix n in k × n chg , flt outputupdateY matrix n out × n chg , flt prev. output output n out × h × w, flt a ) f ee d - f o r w a r d C G input n in × h × w, flt input state changedetect. changeindexesextract.change map h × w, bool X matrixgen.change indexes n chg , long matrixmultipl.X matrix n in k × n chg , flt outputupdateY matrix n out × n chg , flt output state output n out × h × w, flt b ) c l o s e d - l oo p C G input n in × h × w, flt prev. input changedetect.(per FM) changeindexesextract.change map n in × h × w, bool X matrixgen.change indexes [n chg,(c) ] c , long matrixmultipl.X matrix [k × n chg,(c) ] c , flt outputupdate (additive) Y matrix [n out × n chg ,c ] c , flt prev. output output n out × h × w, flt - c ) f ee d - f o r w a r d F G - F M input n in × h × w, flt prev. input changedetect. (no prop.) changeindexesextract.change map k × h × w, bool X matrixgen.change indexes [n chg,(∆j,∆i) ] ∆j,∆i , long matrixmultipl.X matrix [n in × n chg,(∆j,∆i) ] ∆j,∆i , flt outputupdate (additive) Y matrix [n out × n chg,(∆j,∆i) ] ∆j,∆i , flt prev. output output n out × h × w, flt - d ) f ee d - f o r w a r d F G - S P Fig. 5. Processing ﬂow and intermediate data tensors of CBinfer. Color code: custom processing kernels, cuBLAS kernel, variables sharable among layers,and variables to be stored per-layer. Size and data type of intermediate results are indicated below the variable name. Coarse-grained feed-forward CBinfer(a) is introduced in Section III-A, the closed-loop formulation (b) is described in Section III-C, and the ﬁne-grained extensions to the algorithm (c,d) areformulated in Section III-D.

This way 30M values need to be stored for the intermediateresults, 264M values for the X matrix, and 873k values forthe parameters. This can further be optimized by sharing X among all convolution layers and by keeping only memoryallocated to storing only the output of two layers and switchingback-and-forth between them, layer-by-layer. This reduces thememory footprint to 9M, 93M, and 872k values, and a totalof 103M values for our baseline.Applying our algorithm requires a little more memory,because we need to store additional intermediate results (cf.Figure 5) such as the change matrix, the changed indexes list,and the Y matrix, which can all again be shared between thelayers. We also need to store the previous output to use it as abasis for the updated output and to use it as the previous inputof the subsequent layer. For our sample network, this requiredanother ∼

60M values to a total of 163M values (+58%, totalsize ∼

650 MB)—an acceptable increase and not a limitation,considering that modern graphics cards typically come witharound 12 GB memory and even GPU-accelerated embeddedplatforms such as the Nvidia Jetson TX2 module provide 8 GBof memory.

C. Closed-Loop Formulation

In Figure 5a and Section III-A we describe the processingsteps for a feed-forward implementation of CBinfer. However,note that this structure allows gradually changing inputs (e.g.two images are morphed over several frames with incrementsbelow the change detection threshold) to never trigger anyupdate within the network and thus keep a stale result. In anoutdoors surveillance setting, the effects could be even worse:consider a static scenery with a sunset and thus graduallychanging brightness without triggering any update operation. Now a moving object passes, leaving a dark trace behindwhich has been updated under the changed lighting conditions.To overcome such issues, we are proposing a closed-loopversion of CBinfer as shown in Figure 5, bottom/ closed-loop .Rather than storing the previous input, we now have an inputstate , which is updated only for those pixels which havetriggered a change. This can be done directly in the changedetection phase. This way, the previous output is consistentlythe convolution result of the input state and ensured not todrift far away from ideal result.Since the previous input had to be stored before as well, thisdoes not introduce any memory overhead. Moreover, in manycases it can even decrease compute time since only the fewvalues where changes occurred have to be copied over fromthe input to the input state . For the feed-forward CBinfer theentire input tensor would have to be copied . D. Fine-Grained Change-based Inference

In the proposed scheme, every output value affected byany change at the input is recomputed. As the convolutionoperation is linear, updates based on the difference to theprevious frame can be computed to reduce the number ofmultiplications and additions in two ways: Note that one such tensor always has to be copied when applying CBinfer.Consider two CBinfer layers after each other. During the update output step ofthe ﬁrst CBinfer layer, we copy the newly computed values into the previousoutput tensor and feed it to the next CBinfer layer as the input . If we wouldnot copy the data from input to previous input here and instead just keepthe memory address of the previous frame’s input, it will be at the samelocation where the output of the ﬁrst CBinfer layer’s result will be storedwhen processing the next frame, thus directly modifying the previous input variable and thereby introducing incorrect behavior (i.e. there are never anychanges, since ultimately the input and previous input would point to the samememory location) change mappropagation(3x3 filter) changeindexesextractionprevious layerchange map change map

Fig. 6. Worst-case propagation of change map when skipping the changedetection step for a × convolution.

1) Fine-grained across feature maps (FG-FM): Only someof the input feature maps affecting a given output valuemight have changed. An incremental update of theaffected feature maps based on the difference of thechange in input values relative to the previous framewould be sufﬁcient. This results in a 3D-tensor changemap and a correspondingly long change indexes list,and—crucially—forces to decompose the large matrixmultiplication into several smaller ones of which theresults have to be added individually during the updateoutput step (cf. Figure 5c).2) Spatially ﬁne-grained (FG-SP): Just because an outputpixel is affected by an input pixel does not require thatit is completely recomputed. With a × ﬁlter, a singlepixel marked as change would trigger the re-computationof 9 pixels. Also, here an incremental update based ondifferences is possible (cf. Figure 5d).However, there are some drawbacks and limitations: • For both approaches the structure of the core computationeffort is less regular and cannot be written as a densematrix multiplication. • The compute effort of the change indexes extraction scales linearly with the number of values that have tobe checked for changes. In case of (1), the effort in thisstep is thus scale up by a factor of the number of inputfeature maps. • The potential gains in case of (2) are limited. Changingpixels are typically clustered together and all that is beingsaved is a small halo on the change map around thechanges. This can in most cases be expected to be inthe range of a few percent.

E. Propagating Changes & Pooling

Change detection and change indexes extraction can con-tribute up to half of the compute time (cf. Section IV-F). Insome cases, it is thus worth considering skipping these steps:1) If the previous layer was a CBconv layer as well, wecan skip the change detection step and instead startfrom the previous layer’s change map and apply changepropagation to it (cf. Eq. 4 and Figure 6). This changepropagation can be computed much faster, because noiteration across all the feature maps is necessary.2) In the special case that the current layer has a × ﬁltersize, the changes do not propagate. This implies that the change map is identical to the one of the previous layer,which allows to also skip the change indexes extractionand re-use the change indexes of the previous layer.Avoiding change detection also implies saving the memory tostore the previous input for that layer. Besides the aforemen-tioned advantages, there are some potential drawbacks:1) In case of (1), only the change detection step can beavoided and replaced with a change propagation step,and the change indexes have to be extracted again. Thechanges spread out at every layer this is done, althoughthe change detection threshold might not have beenexceeded everywhere and some of the changes couldhave been discarded.2) For (2), there is no propagation of changes and both,change detection and change indexes extraction, can beskipped. So, the only drawback is that a few changesmight be updated although they would be discarded ifthe input would be checked against the current layer’sthreshold.Besides for accelerating convolution layers, the above is alsointeresting for pooling layers which can also be implementedusing a change-based approach. Since they typically follow aconvolution layer, case (2) can be applied and the change-based update introduces no signiﬁcant overhead but savescompute time—mostly by reducing memory bandwidth aspooling layers are memory-bound operations. F. Threshold Selection

The proposed algorithm adds one parameter to each convo-lution layer, the change detection threshold. It is ﬁxed ofﬂinebased on sample video sequences which are passed throughthe trained network. Other than through the selected valuesfor the thresholds, this selection process does not affect theperformance of the system. A threshold of zero yields iden-tical results to the non-change-based implementation, whichhas been used for functional veriﬁcation.For our evaluations, we perform an automated thresholdselection process. First, all convolution layers are convertedto change-based convolutions, and batch normalization andReLU layers are absorbed into the CBinfer layers whereverpossible. We deﬁne and choose:1) a performance metric such as pixel-wise classiﬁcationaccuracy, intersect-over-union (IoU), mean average pre-cision (mAP)—possibly, the loss function of the net-work,2) a set of frame sequences to evaluate the network,where the last frame is ideally annotated. An obviousalternative in case of a lack of frame sequences withannotated last frame is the generation the comparison ofthe change-based network model’s output to the outputof the original model using an appropriate metric, and3) an initial threshold, a factor determining the rate withwhich we adjust the threshold, and a maximum accept-able increment in quality loss per layer.We then set all thresholds to zero and start to iterativelystep from the ﬁrst to the last layer of the network. For eachlayer, we set an initial threshold value and evaluate the model

Ground TruthFrame NFrame 1 Frame 2 Frame N-1 Frame N full-framechange-based change-basedevaluation

No ground truth available ... change-based

Fig. 7. Scheme of the image sequence used for the evaluations. with the aforementioned metric and dataset. We increment thethreshold by a ﬁxed factor (e.g. 1.1), re-evaluate, and repeatuntil the quality loss introduced by the current layer (withrespect to a zero threshold) exceeds the maximum acceptablelimit and then take the previous threshold value.In case of a DNN with (re-)convergent paths, we perform thethreshold selection on these paths independently while settingthe thresholds for the other paths to zero.The maximum acceptable quality loss can be set equallyfor all layers of the network. We focus on low accuracyloss conﬁgurations, and thus we are trying to select thethreshold values such that they are right at the point whereimplementation losses are starting to occur. Nevertheless, wehave observed best results by splitting the overall acceptableloss unevenly, allowing the ﬁrst layer to introduce most of theloss. IV. R

ESULTS & D

ISCUSSION

In this section, we will ﬁrst present the evaluation environ-ment and analyze the baseline compute time breakdown. Wethen analyze the threshold selection, the effect on accuracyand achievable throughput. We then perform a more in-depthanalysis of the throughput to verify the quality of the GPUimplementation and investigate how the changes propagate inthe network. We then establish why more ﬁne-grained changedetection does not pay off and how implementation loss andperformance gains behave on longer sequences.

A. Evaluation Environment

We evaluate our method for two application scenarios:semantic segmentation and pose detection. For the ﬁrst, weperform our evaluations on the urban surveillance datasetdescribed in Section II-D and [47] and using the correspondingscene labeling CNN, not using the multispectral imaging data.The dataset provides 51 training images and 6 validationimages with × pixel with the corresponding ground-truth scene labeling, classifying each pixel into one of thefollowing 8 classes: building, road, tree, sky, tram, car/truck,water, distant background. For the validation set, the labeledimages are part of short video sequences with 5 additionalframes available before the frame for which the groundtruth labeling is available. A trained network on this data isdescribed in [47] and its parameters are reused unaltered forour evaluations. The procedure with which we perform ourevaluations is visualized in Figure 7.For the pose detection application, we use frame sequencesfrom the CAVIAR dataset without ground truth annotations TABLE IIP

ERFORMANCE B ASELINE C OMPUTE T IME B REAKDOWN

Layer Conv. Activ. Pooling total share1 . . . . . . . . . . — . . . — . . — — . and the trained body estimation network of OpenPose [51]with T = 2 stages. The frames are re-sampled to × pixel as in the original OpenPose implementation to enable ameaningful comparison. The frame sequences are subsampledin time by a factor of 6 to arrive at a frame rate of around4 frame/s. In this setting, we measure the accuracy loss interms of mean-squared error (MSE) relative to the output of thenon-change-based network. We have found a MSE of · − on the network’s output to be sufﬁcient for the pose detectionto work reliably. With this dataset we run change-basedinference for 9 frames before the accuracy and throughputmeasurements are performed on Frame 10 to avoid any start-uptransients. As we will show later in Figure 18, these transientsare very short and the error does not accumulate over time.For our experiment on object detection, we use the YOLOv3network trained on the MS COCO dataset with 80 classesof everyday objects. The input image is rescaled, such thatits smaller dimension corresponds to 416 pixels. The inputsequences for our evaluations are described in Section II-D.Similar to pose detection we do not have ground-truth data,instead we generate our target output using non-change-basedYOLOv3. For measuring the quality of the output featuremaps, the MSE is not a suitable measure given that e.g.the outputs for the classiﬁcation of the recognized object isscaled differently than the objectness score or the boundingbox size. We have experimentally identiﬁed the objectnessscore to be the most sensitive output to potential artifacts ofapplying CBinfer and are thus measuring the accuracy loss dueto change-based inference using the MSE on the objectnessscore.We have implemented the proposed algorithm in the Py-Torch framework using custom CUDA kernels, includingfunctions to aid in converting DNNs to CBinfer (automaticconversion and threshold selection). We have evaluated theperformance on a Jetson TX2 board. Our performance baselineis the PyTorch implementation using Nvidia’s cuDNN back-end. It includes optimizations such as the Winograd algorithmand FFT-based convolutions mentioned in Section II-A. Ourevaluations were conducted using half-precision ﬂoating pointnumbers which have no negative impact on accuracy for bothDNNs. B. Baseline Throughput and Computation Breakdown

Before we discuss the performance of the proposed algo-rithm, we analyze the baseline throughput and compute timebreakdown of the segmentation DNN in Table II. Clearly,the convolution operations are dominant, taking up 94.5%( . ) of the overall computation time (

535 ms ). This reafﬁrms the focus on the convolution layers and will lateron show that after accelerating the convolution operationsigniﬁcantly, optimizations for activation and pooling becomerelevant.

C. Threshold Selection

Our algorithm introduces a threshold parameter for eachlayer, for which we outline the selection process in Sec-tion III-F. In Figure 8 we visualize the relation between accu-racy and each layer’s change detection threshold. We proceedsimilarly to our selection process, allowing an accuracy dropof 0.04% per layer for the semantic segmentation network.Starting from all-zero thresholds ( τ i = 0 , i = 1 , . . . , ), wesweep and select the optimal threshold parameter for eachlayer iteratively. The main purpose is to align the tipping pointsof the threshold-accuracy curve, such that not a single layer’sthreshold is limiting the overall accuracy.After the selection of the thresholds, we can scale themjointly to analyze the trade-off against the classiﬁcation ac-curacy more concisely as can be observed in Figure 9 (left).The accuracy of the individual test sequences (different traces)clearly show a similar behavior with a plateau up to a clearpoint where there is a steep increase in error rate. We repeatedthis analysis for the much deeper pose detection network (cf.Figure 10), showing similar behavior for the MSE with respectto the baseline DNN. D. Throughput Evaluations

The motivation for the proposed algorithm was to increasethroughput by focusing only on the frame-to-frame changes.We show the performance gain in Figure 9 (right) with theindicated baseline analyzing the entire frame with the samenetwork using cuDNN. In the extreme case of setting allthresholds to zero, the entire frame is updated, which resultsin a clear performance loss because of the change detectionoverhead as well as fewer optimization options such as lesscache-friendly access patterns when generating the X matrix.Nevertheless, few operations are skipped where the pixels didnot change at all.When increasing the threshold factor, the average through-put increases rapidly to about 20 frame/s, where it starts satu-rating because the change detection step as well as other non-varying components like the pooling and pixel classiﬁcationlayers are becoming dominant and the number of detectedchanged pixels does not further decrease. We almost reachthis plateau already for a threshold factor of 1, where wehave by construction almost no accuracy loss. The averageframe rate over the different sequences is near 18 frame/s atthis point—an improvement of . × over the cuDNN baselineof 1.96 frame/s.One sequence (Figure 9, ) has—while still being closeto . × faster than the baseline—a signiﬁcantly lower through-put than the other sequences. While most of them showtypical scenarios such as shown in Figure 2, this sequenceshows a very busy situation where the entire road is full ofvehicles and all of them are moving. The effective numberof operations (add or multiply operations) to compute the convolution updates is visualized in Figure 9 (center). Formost frame sequences the savings are above × while theaforementioned exceptional cases have a signiﬁcantly highershare with savings of around × .Running the same analysis for the pose detection networkyields very similar results. For the cuDNN baseline, we get aframe rate of 0.72 frame/s and CBinfer achieves a rate of 3–8 frame/s for a threshold factor of 1 or a speed-up of . × to . × . A noticeable difference are performance gains for thezero threshold conﬁguration. Here the overhead of CBinfer isoutweighed by the savings due to many pixels at the input notchanging at all and therefore not triggering an update evenfor a zero threshold, yielding a performance gain even in acompletely loss-less conﬁguration.In Figure 11, we show the evaluation results for objectdetection using YOLOv3 trained on MS COCO and appliedto various video sequences. We have observed that the mostcritical output of the network is the objectness score and thatthe classiﬁcation and bounding box dimensions are much moreresilient to the change detection threshold. As we do not haveground truth data available for the video sequences used inthe experiment, we measure the loss based on the MSE of theobjectness score relative to frame-by-frame inference. Again, aclear reduction by around × in the number of operations canbe observed, albeit not as much as for the other two applicationscenarios. We attribute this to the network’s structure, whichuses leaky ReLU activations and hence does not naturallyeliminate all changes of feature maps values which are belowzero.We have repeated the performance measurements for thesegmentation application with fp32 precision on a workstationwith a Nvidia GTX 1080 Ti GPU to compare them to theTegra X2 platform, obtaining an almost identical throughput-threshold trade-off and compute time breakdown up to ascaling factor of . × —as can be expected for a largelyvery well parallelizable workload and a . × more powerfuldevice with a similar architecture . E. Accuracy-Throughput Trade-Off

While for some scenarios any drop in accuracy is unac-ceptable, many applications allow for some trade-off betweenaccuracy and throughput—after all choosing a speciﬁc CNNalready implies selecting a network with an associated accu-racy and computational cost.We analyze the trade-off directly in Figure 12. The mostextreme case is updating the entire frame every time resultingin the lowest throughput at the same accuracy as full-frameinference. Increasing the threshold factor in steps of 0.25immediately results in a signiﬁcant throughput gain and formost sequences the trade-off only starts at frame rates closeto saturation above 20 frame/s. The same frame sequence thatalready deviated from the norm before behaves differently hereas well. However, an adaptive selection of the threshold factorwith a control loop getting feedback about the number of Tegra X2: 437-750 GFLOPS (fp32), 874-1500 GFLOPS (fp16), and58.4 GB/s DRAM bandwidth.GTX 1080 Ti: 10609 GFLOPS (fp32) and 484 GB/s. · − . . . . . . Threshold for Layer 1 ( τ ) E rr o rI n c r . [ % ] . . . . Threshold for Layer 2 ( τ ) Threshold for Layer 3 ( τ ) Fig. 8. Analysis of the increase in pixel classiﬁcation error rate by selecting a certain change detect threshold. This analysis is conducted layer-by-layer,where the error increase of any layer includes the error introduced by the previous layers’ threshold choice ( τ = 0 . , τ = 0 . , τ = 1 . ). . . − . . . . . Threshold Factor C l a ss i f . E rr o rI n c r ea s e [ % ] . . Threshold Factor O p e r a ti on s . . cuDNNThreshold Factor T h r oughpu t [fr a m e / s ] Fig. 9. Evaluation of the impact of jointly scaling the change detection thresholds on the classiﬁcation error, the number of detected changed pixels (sumover all 3 layers), and the throughput. The various traces are different sequences of the Gloriastrasse segmentation dataset, where one sequence ( ) is aparticularly active scene (road full of cars and trams), whereas in the sequence ( ) shows a low-activity scene with a single car. The traces correspond toSeq. 1643 28 ( ), Seq. 1607 11 ( ), Seq. 1611 12 ( ), Seq. 1624 48 ( ), Seq. 1607 30 ( ), Seq. 1624 17 ( ) in the dataset. . . . . . · − Threshold Factor L o ss ( M S E ) . . Threshold Factor O p e r a ti on s . . cuDNNThreshold Factor T h r oughpu t [fr a m e / s ] Fig. 10. Evaluation of introduced loss (left), effective number of compute operations (center) and measured throughput (right) for several frame sequencesrunning the pose detection network and varying the change detection threshold. The various traces correspond to different frame sequences of the CAVIARdataset (cf. Figure 3). The traces correspond to Seq. 1 ( ), Seq. 2 ( ), Seq. 3 ( ), Seq. 4 ( ) in the dataset. . . . . L o ss ( M S E ) . . Threshold Factor O p e r a ti on s Fig. 11. Evaluation of the introduced loss relative to non-change-basedinference based on the MSE of the objectness score and the number ofexecuted operations for object detection using YOLOv3. The various tracescorrespond to different video sequences.

93 94 95 96 97 98051015202530 cuDNNPixel Classiﬁcation Accuracy [%] T h r oughpu t [fr a m e / s ] Fig. 12. Evaluation of the throughput-accuracy trade-off for frame sequencesof the Gloriastrasse segmentation dataset. The different frame sequences aremarked identically in Figure 9. changed pixels could allow for a guaranteed throughput byreducing the accuracy in such cases and is left to be exploredin future work.

L1 Conv.L2 Conv.L3 Conv. Compute Time [ms]Change Det. Change Extr. gen. X GEMM Output Upd.

Fig. 13. Compute time for the individual processing steps per layer runningon the GPU for a typical frame sequence. . . . . . . · Threshold Factor O p e r a ti on s Layer 1Layer 2Layer 3

Fig. 14. Cumulative number of multiply and add operations for the scenelabeling network.

F. Compute Time Breakdown

In Section IV-B and speciﬁcally in Table II, we alreadydiscussed the compute time breakdown of the entire networkwhen using frame-by-frame analysis. To gain more in-depthunderstanding of the limiting factors of our proposed algo-rithm, we show a detailed compute time breakdown of only thechange-based convolution layers in Figure 13. The time spenton change detection is similar across all 3 convolution layers,which aligns well with our expectations since the feature mapvolume at the input of n ch · h · w values is identical for L2and L3, and 25% smaller for L1. That this step already makesup for more than 23.4% of the overall time underlines theimportance of a very simple change detection function: anyincrease in compute time for change detection has to be offsetby time savings in the other steps by reducing the numberof changes signiﬁcantly. The change indexes extraction effortis linear to the number of pixels h · w and the clear dropfrom L1 to L2 is as expected. However, since it is notwell parallelizable, there is not much additional gain whencomparing L2 to L3. The effort to generate the X matrixis very dependent on the number of changed pixels, thenumber of feature maps, and the ﬁlter size. It is, however,most important that the time spent on shufﬂing data aroundto generate X is signiﬁcantly smaller than the actual matrixmultiplication, which clearly makes up the largest share. Thesubsequent update of the output values including activationonly uses a negligible part of the overall processing time.An important aspect is not directly visible: The overallcompute time for the dominant convolution layers, has shrunktremendously by more than . × from . to about . . This makes the pooling layer a non-negligible con-tributor to the overall compute time (total . ). As outlinedin Section III-E, we can perform the pooling also with achange-based approach and skip the change detection andindexes extraction by relying on the preceding convolutionlayer’s change indexes. This provides an additional speed-upby an average of . × and . × for the ﬁrst and secondpooling layer, respectively. G. Change Propagation

During the construction of the algorithm we argued thatchange detection should be performed for every convolution . . . . . . . . . · Threshold Factor O p e r a ti on s Fig. 15. Number of multiply and add operations for the pose detectionnetwork stacked by layer with ﬁrst layer on bottom. layer not only for modularity, but also justifying that theworst-case change propagation would result in a rapid growthof the share of changed pixels as we proceed deeper intothe network. However, skipping change detection and insteadassuming worst-case propagation for some intermediate layersmight improve performance. Our experiments have shown thatfor neither of the networks this pays off. An experiment hasshown that for Layer 2, the number of changes is reducedby . × from 7.57% to 1.11% and for Layer 3 from 2.58%to 1.94% by . × . Not repeating change detection for somelayer affects the compute time:1) reducing the compute time by substituting the changedetection step with a more light-weight change propa-gation step,2) scaling up the compute effort from the matrix generationthrough the output update proportionally to the increasein the number of pixels marked as change, and3) leaving the execution time of the change indexes extrac-tion unaffected.Combining this with the results in Figure 13, skipping changedetection for Layer 2 would result in an increase in executiontime by a factor of . × . For Layer 3 it would result inapproximately no effect on performance.An immediate concern evaluating a CNN based on changingpixels is the spreading of the affected regions though theconvolutions. We have thus analyzed the effective number ofcompute operations in Figure 14 and Figure 15 for the seman-tic segmentation and the pose detection networks, respectively.For each layer the number of compute operations is shownin dependence of the joint threshold scaling factor. Layers inparallel branches of the network are shown sequentially. Thechanges are neither spreading out nor vanishing as we proceeddeeper into the DNN.Besides the effect on performance, the visualization of thesechanges in Figure 16 provides insight into the inner workingsof the DNN. As expected, single-pixel artifacts such as noisedisappear due to the smoothing effect of the convolution.Changes originating from moving objects such as pedestriansand cars are propagated to the next layer as desired. Aparticularly interesting observation is the effect on the regionmarked as tree: In the input frame sequence the leaves of thetree move in the wind, but already after the ﬁrst convolution a) Layer 1 changes b) Layer 2 worst-case (7.57%)c) Layer 2 changes (1.11%) noise carspedestrianstree Fig. 16. Analysis of the change propagation with a frame sequence withthe same scene as Figure 2. (a) shows the changes detected (change map) inLayer 1 using the thresholds determined in Section IV-C, in the upper partof the image there are several single-pixel changes due to noise. We showthe changed pixels for Layer 2 based on worst-case propagation as assumedwhen dropping the Layer 2 change detection step (b) and those when applyingchange detection instead (c). layer index c o m pu t ee ff o r t [ GO p ] Fig. 17. Analysis of the compute effort by layer for the pose detectionnetwork. We compare not using CBinfer ( ) to different granularities ofCBinfer: normal (coarse-grained) ( ), spatially ﬁne-grained ( ), andfeature map ﬁne-grained ( ). layer the resulting changes completely vanish. We construethese pixels in this region to already be represented moreabstractly as leaves . H. Fine-grained CBinfer

In Section III-D we have introduced two types of ﬁne-grained CBinfer to further reduce the number of multiply-add operations: spatially and across feature maps. We analyzethis effect by running change-based inference and comparingthe compute effort to the number of detected changes andnumber of operations to perform per change in Figure 17.The drawbacks discussed in Section III-D are conﬁrmed: • Spatially ﬁne-grained (SP-FG) CBinfer reduces the num-ber of operations only by around 20% while exploitingthis lets the operations become much less regular andthe larger matrix multiplication decays into many smallerones followed by an aggregation step, where both intro-duce a massive memory bandwidth overhead. • The results for ﬁne-grained evaluation by feature mapshow much more potential based on a reduction ofmultiply and add operations by around 65%. However,such an implementation also requires a change map perfeature map and thus the change extraction step has to . . · − L o ss . R un ti m e [ s / fr m ] GO p / fr a m e P o w e r[ W ] time (frame index) E n e r gy [ m J / fr m ] Fig. 18. Evaluation of accuracy effect, runtime, number of compute opera-tions, power and energy when performing inference for the pose detection net-work on a 200-frame video sequence.

Legend:

CBinfer, cuDNN; continuous:max-N (max. performance) power mode, dashed: max-Q (max. efﬁciency)power mode. be performed on a 3D tensor rather than a 2D tensor.The effort is scaled up by the number of feature mapsat the input of the convolution layer (often 16, 64, 256,or more), thereby pushing this computation overhead(for normal CBinfer from 10-20% of compute time,cf. Figure 13) to several times (160-5120%) the overallcompute effort of normal CBinfer.

I. Energy Efﬁciency

We have measured the power consumption of the TegraX2 module using the on-board sensors for two of its powermodes: maximum performance (max-N) and maximum ef-ﬁciency (max-Q). When idling, the power consumption is1.80 W and 1.77 W for max-N and max-Q, respectively. Themeasurements under load have been conducted while runningpose detection on a 200-frame sequence and are visualized inFigure 18. Generally, we can see a clear correlation betweenthe number of operations that have to be computed and the runtime, where the later has a clear offset due to the overheadof change detection and change indexes extraction. We canalso observe that there is no long-term rise in the introducedloss. The power is very constant for the cuDNN baseline inmax-N ( , 12 W) and max-Q mode ( , 5.3 W) as wellas for the CBinfer implementation ( , 6.8 W) and ( ,4.8 W), respectively. Note that we were processing the frameswithout duty-cycling. The resulting energy efﬁciency is shownin the trace at the bottom. The baseline uses around 9.6 J/framein max-N mode and 6.1 J/frame in max-Q mode, whereas theCBinfer implementation uses an average of 1.1 J/frame and0.8 J/frame, respectively. This corresponds to energy savingsof . × and . × and an equivalent average energy efﬁciencyof 148 GOp/s/W and 204 GOp/s/W for the max-N and max-Qpower modes, respectively.For the scene labeling network and the max-N power modewe have measured a power consumption of 6.8 W with CBinferand 10.5 W with cuDNN and thus 411 and 3003 mJ/frame,respectively. With a frame requiring 210 GOp, this results inan energy efﬁciency of 511 and 70 GOp/s/W—an improvementby . × . J. Comparison to Related Work

CBinfer is compared to state-of-the-art methods exploitingtemporal redundancy in Table III. DeepMon [45] achieves aspeed-up of 46% using VGG-16 on the UCF101 dataset atan accuracy drop of 6% from 89.9% to 83.9%. CNNCache[44] improves on this result by achieving an 23% speed-upat an accuracy loss of 3% using ResNet-50 on the simpliﬁedUCF101 dataset with 10 classes.With CBinfer, the average speed-up is in the range of 700–910% at negligible accuracy loss (e.g. < . on average forsegmentation). These results should not be compared directlywith the previous methods: since the video sequences of theUCF101 dataset have a moving camera and we require astatic camera, the evaluation had to be performed on differentdatasets.However, we can discuss the origins of the limited speed-ups and how they are overcome using our method. Thestatic camera requirement allows us to eliminate the costlyimage block matching employed by CNNCache to ﬁnd thecorresponding image block in the previous frame. In our case,the matching becomes a trivial per-pixel comparison, since itis immediately clear where to ﬁnd the corresponding patchof pixels in the previous frame. CNNCache hence also hasto limit the identiﬁcation of where cached results can be re-used to the input of the CNN, since repeated block matchingwith many feature maps would introduce an overhead likelyin excess of the compute time for the CNN itself.The repetition of the change detection at each layer is acrucial property of our algorithm and allows us to reuse sig-niﬁcantly more data from the cache (previous frame’s outputof each layer) because we are not bound to fetch rectangularregions from cache and can eliminate irrelevant changes at anearly stage within the network. This is particularly importantfor our application scenarios, where we also expect multiplesmall moving objects of interest in the scene while expecting TABLE IIIC

OMPARISON OF R ESULTS WITH S TATE - OF - THE -A RT Method Dataset Camera/Backgr. Result (speed-up @ accuracy loss)DeepMon UCF101 simpliﬁed (activity/sports recordings, 10 cl.) object-following 1.23 × speed-up @ 3% loss with ResNet-50CNNCache UCF101 (activity/sports recordings, 101 classes) object-following . × speed-up @ 6% loss with VGG-16 i CBinfer (ours) Gloriastr. (surveillance cam, semantic segm., 10 classes) static 9.1 × speed-up @ < × speed-up @ negligible loss with OpenPose iii Only speed-up from using convolutional layer caching. Additional methods are also presented in the corresponding work. ii Difference to non-change-basedinference of less than . · − MSE. changing pixels from background object irrelevant to the ﬁnaloutput. V. C

ONCLUSION

We have proposed and evaluated a novel algorithm forchange-based evaluation of CNNs for video recorded with astatic camera setting, exploiting the spatio-temporal sparsityof pixel changes. The method introduces a set of parametersto trade-off accuracy and throughput. Even when choosing theparameters conservatively to introduce no signiﬁcant accuracyloss, we have observed an average speed-up by . × for asemantic segmentation DNN and . × for a pose detectionDNN relative to cuDNN using our GPU implementation. Theresulting boost in energy efﬁciency over per-frame evaluationis an average of . × and . × for the two applications,respectively. This corresponds to an equivalent energy efﬁ-ciency of 511 GOp/s/W on the Tegra X2 platform for thepose detection DNN. For object detection using YOLOv3, wehave observed a reduction of × in computational workload.We have furthermore analyzed various ﬂavors of the proposedalgorithm and how the changes propagate through the DNNs tofurther underline the optimality of the structure of the proposedalgorithm.Further gains might be possible by training the network onvideos using change-based inference for the forward propaga-tion or by introducing noise to simulate the slight deviationsfor the ideal feature maps. The proposed method should alsonot be limited to video data, but work on any data wherechanges in at least one dimension are sparse (e.g. spectro-grams of audio data). Finally, reducing the granularity of thealgorithm by × or × would allow an implementationusing Winograd’s convolution algorithm for additional speed-up, like it is being done in the cuDNN baseline.R EFERENCES[1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”

IJCV , vol.115, no. 3, pp. 211–252, 2015.[2] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common Objects inContext,” in

Proc. ECCV , 2014.[3] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen-son, U. Franke, S. Roth, and B. Schiele, “The Cityscapes Dataset forSemantic Urban Scene Understanding,” in

Proc. IEEE CVPR , 2016, pp.3213–3223.[4] I. Goodfellow, Y. Bengio, and A. Courville,

Deep Learning . MIT Press,2016.[5] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectiﬁers:Surpassing Human-Level Performance on ImageNet Classiﬁcation,” in

Proc. IEEE ICCV , 2015, pp. 1026–1034. [6] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadara-jan, and S. Vijayanarasimhan, “YouTube-8M: A Large-Scale VideoClassiﬁcation Benchmark,” arXiv:1609.08675 , 2016.[7] P. Fischer, A. Dosovitskiy, E. Ilg, P. Haeusser, C. Hazirbas, V. Glokov,P. Van der Smagt, D. Cremers, and T. Brox, “FlowNet: Learning OpticalFlow with Convolutional Networks,” in arXiv:15047.06852 , 2015.[8] C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: TowardsUniformed Representation and Acceleration for Deep ConvolutionalNeural Networks,” in

Proc. ACM ICCAD , New York, NY, USA, 2016.[9] W.-s. Park and M. Kim, “CNN-based in-loop ﬁltering for codingefﬁciency improvement,” in

Proc. IEEE IVMSP , 2016.[10] P. Wang, Y. Cao, C. Shen, L. Liu, and H. T. Shen, “Temporal PyramidPooling Based Convolutional Neural Networks for Action Recognition,”vol. 27, no. 12, pp. 2613–2622, 2015.[11] K. Chen and W. Tao, “Once for All: a Two-ﬂow Convolutional NeuralNetwork for Visual Tracking,”

IEEE Transactions on Circuits andSystems for Video Technology , vol. 8215, no. c, pp. 1–10, 2017.[12] K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang,Z. Wang, R. Wang, X. Wang, and W. Ouyang, “T-CNN: Tubelets withConvolutional Neural Networks for Object Detection from Videos,”

IEEE Transactions on Circuits and Systems for Video Technology , vol.8215, no. c, pp. 1–11, 2017.[13] Z. Jie, W. F. Lu, S. Sakhavi, Y. Wei, E. H. F. Tay, and S. Yan,“Object Proposal Generation With Fully Convolutional Networks,”

IEEETransactions on Circuits and Systems for Video Technology , vol. 28,no. 1, pp. 62–75, 2018.[14] L. Cavigelli, M. Magno, and L. Benini, “Accelerating Real-Time Embed-ded Scene Labeling with Convolutional Networks,” in

Proc. ACM/IEEEDAC , 2015.[15] G. Ananthanarayanan, P. Bahl, P. Bodik, K. Chintalapudi, M. Philipose,L. Ravindranath, and S. Sinha, “Real-Time Video Analytics: The KillerApp for Edge Computing,”

Computer , vol. 50, no. 10, pp. 58–67, 2017.[16] L. T. Nguyen-Meidine, E. Granger, M. Kiran, and L.-A. Blais-Morin, “Acomparison of CNN-based face and head detectors for real-time videosurveillance applications,” in

Proc. IEEE IPTA , nov 2017.[17] N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S.Moses, S. Verdoolaege, A. Adams, and A. Cohen, “Tensor Compre-hensions: Framework-Agnostic High-Performance Machine LearningAbstractions,” arXiv:1802.04730 , 2018.[18] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,and E. Shelhamer, “cuDNN: Efﬁcient Primitives for Deep Learning,” in arXiv:1410.0759 , 2014.[19] A. Lavin, “maxDNN: An Efﬁcient Convolution Kernel for Deep Learn-ing with Maxwell GPUs,” in arXiv:1501.06633v3 , 2015.[20] A. Lavin and S. Gray, “Fast Algorithms for Convolutional NeuralNetworks,” in

Proc. IEEE CVPR , 2016, pp. 4013–4021.[21] L. Cavigelli and L. Benini, “Origami: A 803-GOp/s/W ConvolutionalNetwork Accelerator,”

IEEE TCSVT , vol. 27, no. 11, pp. 2461–2475,2017.[22] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on BinaryWeights,” in

Proc. IEEE ISVLSI , 2016, pp. 236–241.[23] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, andL. Benini, “Origami: A Convolutional Network Accelerator,” in

Proc.ACM GLSVLSI . ACM Press, 2015, pp. 199–204.[24] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An Energy-Efﬁcient Reconﬁgurable Accelerator for Deep Convolutional NeuralNetworks,” in

Proc. IEEE ISSCC , 2016, pp. 262–263.[25] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, andY. LeCun, “NeuFlow: A Runtime Reconﬁgurable Dataﬂow Processorfor Vision,” in

Proc. IEEE CVPRW , 2011, pp. 109–116. [26] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:ImageNet Classiﬁcation Using Binary Convolutional Neural Networks,”in Proc. ECCV , 2016, pp. 525–542.[27] L. Cavigelli, P. Degen, and L. Benini, “CBinfer: Change-Based Inferencefor Convolutional Neural Networks on Video Data,” in

Proc. ACMICDSC , 2017.[28] A. Canziani, E. Culurciello, and A. Paszke, “Evaluation of neuralnetwork architectures for embedded systems,” in

Proc. IEEE ISCAS ,2017.[29] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” arXiv:1801.04381 , 2018.[30] F. Iandola and K. Keutzer, “Small neural nets are beautiful,” in

Proc.IEEE/ACM/IFIP CODES . ACM Press, 2017.[31] P. Gysel, M. Motamedi, and S. Ghiasi, “Hardware-oriented Approxima-tion of Convolutional Neural Networks,” in

ICLR Workshops , 2016.[32] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental NetworkQuantization: Towards Lossless CNNs with Low-Precision Weights,” in

Proc. ICLR , 2017.[33] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “Hyperdrive: ASystolically Scalable Binary-Weight CNN Inference Engine for mW IoTEnd-Nodes,” in

Proc. IEEE ISVLSI , 2018.[34] S. Hashemi, N. Anthony, H. Tann, R. I. Bahar, and S. Reda, “Un-derstanding the Impact of Precision Quantization on the Accuracy andEnergy of Neural Networks,” in arXiv:1612.03940 , 2016.[35] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “PruningFilters for Efﬁcient ConvNets,” arXiv:1608.08710 , 2016.[36] A. Aimar, H. Mostafa, E. Calabrese, A. Rios-Navarro, R. Tapiador-Morales, I.-A. Lungu, M. B. Milde, F. Corradi, A. Linares-Barranco,S.-C. Liu, and T. Delbruck, “NullHop: A Flexible Convolutional NeuralNetwork Accelerator Based on Sparse Representations of Feature Maps,”

IEEE Transactions on Neural Networks and Learning Systems , jun 2018.[37] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J.Dally, “EIE: Efﬁcient Inference Engine on Compressed Deep NeuralNetwork,” in

Proc. ACM/IEEE ISCA , 2016, pp. 243–254.[38] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only LookOnce: Uniﬁed, Real-Time Object Detection,” in

Proc. IEEE CVPR ,2016, pp. 779–788.[39] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networksfor Semantic Segmentation,” in

Proc. IEEE CVPR , 2015.[40] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShufﬂeNet: An Ex-tremely Efﬁcient Convolutional Neural Network for Mobile Devices,”in arXiv:1707.01083 , 2017.[41] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewerparameters and < arXiv:1602.07360 , N. Navab,J. Hornegger, W. M. Wells, and A. Frangi, Eds., vol. 9349, Cham, feb2016.[42] D. Held, S. Thrun, and S. Savarese, “Learning to Track at 100 FPS withDeep Regression Networks,” in LNCS , ser. Lecture Notes in ComputerScience, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham:Springer International Publishing, 2016, vol. 9905, pp. 749–765.[43] E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell,

Clockwork Con-vnets for Video Semantic Segmentation , ser. Lecture Notes in ComputerScience, J. Matas, N. Sebe, and M. Welling, Eds., vol. 9905.[44] M. Xu, X. Liu, Y. Liu, and F. X. Lin, “Accelerating ConvolutionalNeural Networks for Continuous Mobile Vision via Cache Reuse,” arXiv:1712.01670 , 2017.[45] L. N. Huynh, Y. Lee, and R. K. Balan, “DeepMon: Mobile GPU-basedDeep Learning Framework for Continuous Vision Applications,” in

Proc.ACM MobiSys , 2017, pp. 82–95.[46] P. O. Connor and M. Welling, “Sigma-Delta Quantized Networks,”

ICLR , 2017.[47] L. Cavigelli, D. Bernath, M. Magno, and L. Benini, “Computationallyefﬁcient target classiﬁcation in multispectral image data with DeepNeural Networks,” in

Proc. SPIE Security + Defence , vol. 9997, 2016.[48] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,”2018. [Online]. Available: http://arxiv.org/abs/1804.02767[49] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe,” in

Proc. ACM MM , 2014, pp.675–678.[50] J. Jin, V. Gokhale, A. Dundar, B. Krishnamurthy, B. Martini, andE. Culurciello, “An efﬁcient implementation of deep convolutionalneural networks on a mobile coprocessor,” in

Proc. IEEE MWSCAS’14 ,2014, pp. 133–136. [51] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime Multi-person2D Pose Estimation Using Part Afﬁnity Fields,” in