Analysis of Latent-Space Motion for Collaborative Intelligence
EExtended version of the paper “Latent Space Motion Analysis for Collaborative Intelligence,” to be presented at the IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada, June 6-11, 2021.
ANALYSIS OF LATENT-SPACE MOTION FOR COLLABORATIVE INTELLIGENCE
Mateen Ulhaq and Ivan V. Baji´c
School of Engineering Science, Simon Fraser University, Burnaby, BC, Canada
ABSTRACT
When the input to a deep neural network (DNN) is a videosignal, a sequence of feature tensors is produced at the in-termediate layers of the model. If neighboring frames of theinput video are related through motion, a natural question is,“what is the relationship between the corresponding featuretensors?” By analyzing the effect of common DNN oper-ations on optical flow, we show that the motion present ineach channel of a feature tensor is approximately equal to thescaled version of the input motion. The analysis is validatedthrough experiments utilizing common motion models.
Index Terms — Collaborative intelligence, latent spacemotion, feature domain motion, feature compression
1. INTRODUCTION
Collaborative intelligence (CI) [1] has emerged as a promis-ing strategy to bring AI “to the edge.” In a typical CI system(Fig. 1), a deep neural network (DNN) is split into two parts:the edge sub-model, deployed on the edge device near the sen-sor, and the cloud sub-model deployed in the cloud. Interme-diate features produced by the edge sub-model are transferredfrom the edge device to the cloud. It has been shown that sucha strategy may provide better energy efficiency [2, 3], lowerlatency [2, 3, 4], and lower bitrates over the communicationchannel [5, 6], compared to more traditional cloud-based an-alytics where the input signal is directly sent to the cloud.These potential benefits will find a number of applicationsin areas such as intelligent sensing [7] and video coding formachines [8, 9]. In particular, compression of intermediatefeatures has become an important research problem, with anumber of recent developments [10, 11, 12, 13, 14] for thecase when the input to the edge sub-model is a still image.When the input to the edge sub-model is video, its out-put is a sequence of feature tensors produced from successiveframes in the input video. This sequence of feature tensorsneeds to be compressed prior to transmission and then de-coded in the cloud for further processing. Since motion playssuch an important role in video processing and compression,we are motivated to examine whether any similar relationship
This work was supported in part by the Natural Sciences and Engineer-ing Council (NSERC) of Canada.
Fig. 1 : Basic collaborative intelligence system.
Fig. 2 : Motion estimates for input frames (left) and selectchannels from the output of ResNet-34’s add 3 layer (right).exists in the latent space among the feature tensors. Our the-oretical and experimental results show that, indeed, motionfrom the input video is approximately preserved in the chan-nels of the feature tensor. An illustration of this is presented inFig. 2, where the estimated input-space motion field is shownon the left, and the estimated motion fields in several featuretensor channels are shown on the right. These findings sug-gest that methods for motion estimation, compensation, andanalysis that have been developed for conventional video pro-cessing and compression may provide a solid starting pointfor equivalent operations in the latent space.The paper is organized as follows. In Section 2, we ana-lyze the actions of typical operations found in deep convolu-tional neural networks on optical flow in the input signal, andshow that these operations tend to preserve the optical flow,at least approximately, with an appropriate scale. In Section 3we provide empirical support for the theoretical analysis fromSection 2. Finally, Section 4 concludes the paper. a r X i v : . [ c s . C V ] F e b ig. 3 : The problem studied in this paper: if input imagesare related via motion, what is the relationship between thecorresponding intermediate feature tensors?
2. LATENT SPACE MOTION ANALYSIS
The basic problem studied in this paper is illustrated in Fig. 3.Consider two images (video frames) input to the edge sub-model of a CI system. It is common to represent their rela-tionship via a motion model. The question we seek to answerhere is, “what is the relationship between the correspondingfeature tensors produced by the edge sub-model?” To answerthis question, we will look at the processing pipeline betweenthe input image and a given channel of a feature tensor. Inmost deep models for computer vision applications, this pro-cessing pipeline consists of a sequence of basic operations:convolutions, pointwise nonlinearities, and pooling. We willshow that each of these operations tends to preserve motion, atleast approximately, in a certain sense, and from this we willconclude that (approximate) input motion may be observed inindividual channels of a feature tensor.
Motion model.
Optical flow is a frequently used mo-tion model in computer vision and video processing. In a“2D+t” model, I ( x, y, t ) denotes pixel intensity at time t , atspatial position ( x, y ) . Under a constant-intensity assump-tion, optical flow satisfies the following partial differentialequation [15]: ∂I∂x v x + ∂I∂y v y + ∂I∂t = 0 , (1)where ( v x , v y ) represents the motion vector. For notationalsimplicity, in the analysis below we will use a “1D+t” model,which captures all the main ideas but keeps the equationsshorter. In a “1D+t” model, I ( x, t ) denotes intensity at po-sition x at time t , and the optical flow equation is ∂I∂x v + ∂I∂t = 0 , (2)with v representing the motion. We will analyze the effectof basic operations — convolutions, pointwise nonlinearities, and pooling — on (2), to gain insight into the relationshipbetween input space motion and latent space motion. Convolution.
Let f be a (spatial) filter kernel, then theoptical flow after convolution is a solution to the followingequation ∂∂x ( f ∗ I ) v (cid:48) + ∂∂t ( f ∗ I ) = 0 , (3)where v (cid:48) is the motion after the convolution. Since the con-volution and differentiation are linear operations, we have f ∗ (cid:18) ∂I∂x v (cid:48) + ∂I∂t (cid:19) = 0 . (4)Hence, solution v from (2) is also a solution of (4), but (4)could also have other solutions, besides those that satisfy (2). Pointwise nonlinearity.
Nonlinear activations such as sigmoid , ReLU , etc., are commonly applied in a pointwisefashion on the output of convolutions in deep models. Let σ ( · ) denote such a pointwise nonlinearity, then the optical flow af-ter this nonlinearity is a solution to the following equation ∂∂x [ σ ( I )] v (cid:48) + ∂∂t [ σ ( I )] = 0 , (5)where v (cid:48) is the motion after the pointwise nonlinearity. Byusing the chain rule of differentiation, the above equation canbe rewritten as σ (cid:48) ( I ) · (cid:18) ∂I∂x v (cid:48) + ∂I∂t (cid:19) = 0 . (6)Hence, again, solution v from (2) is also a solution of (6). Itshould be noted that (6) may have solutions other than thosefrom (2). For example, in the region where inputs to ReLU arenegative, the corresponding outputs will be zero, so σ (cid:48) ( I ) =0 . Hence, in those regions, (6) will be satisfied for arbitrary v (cid:48) . Nonetheless, the solution from (2) is still one of thosearbitrary solutions. Pooling.
There are various forms of pooling, such as max-pooling, mean-pooling, learnt pooling (via strided convolu-tions), etc. All these can be decomposed into a sequence oftwo operations: a spatial operation (local maximum or con-volution) followed by scale change (downsampling). Spatialconvolution operations can be analyzed as above, and the con-clusion is that motion before such an operation is also a so-lution to the optical flow equation after such an operation.Hence, we will focus here on the local maximum operationand the scale change.
Local maximum.
Consider the maximum of function I ( x, t ) over a local spatial region [ x − h, x + h ] , at agiven time t . We can approximate I ( x, t ) as a locally-linearfunction, whose slope is the spatial derivative of I at x , ∂∂x I ( x , t ) . If the derivative is positive, the maximum is I ( x + h, t ) , and if it is negative, it is I ( x − h, t ) . Inthe special case when the derivative is zero, any point in [ x − h, x + h ] , including the endpoints, is a maximum.From Taylor series expansion of I ( x, t ) around x up to andincluding the first-order term, ( x ± (cid:15), t ) ≈ I ( x , t ) ± ∂∂x I ( x , t ) · (cid:15), (7)for | (cid:15) | ≤ h . With such linear approximation, the local maxi-mum of I ( x, t ) over [ x − h, x + h ] occurs either at x + h orat x − h , depending on the sign of ∂∂x I ( x , t ) ; if the deriva-tive is zero, every point in the interval is a local maximum.Hence, the local maximum of I ( x, t ) can be approximated as max x ∈ [ x − h,x + h ] I ( x, t ) ≈ I ( x , t ) + sign (cid:18) ∂∂x I ( x , t ) (cid:19) · ∂∂x I ( x , t ) · h. (8)Let (8) be the definition of M ( x , t ) , the function that takes onlocal spatial maximum values of I ( x, t ) over windows of size h . The optical flow after such a local maximum operation isdescribed by ∂M∂x v (cid:48) + ∂M∂t = 0 , (9)where v (cid:48) represents the motion after local spatial maximumoperation. Using (8) in (9), after some manipulation we ob-tain the following equation ∂I∂x v (cid:48) + ∂I∂t + sign (cid:18) ∂I∂x (cid:19) · ∂∂x (cid:18) ∂I∂x v (cid:48) + ∂I∂t (cid:19) · h = 0 . (10)Note that if v (cid:48) satisfies the original optical flow equation (3),it will also satisfy (10), hence pre-max motion v is also onepossible solution to post-max motion v (cid:48) . Scale change.
Finally, consider the change of spatial scaleby a factor s , such that the new signal is I (cid:48) ( x, t ) = I ( s · x, t ) .The optical flow equation is now ∂I (cid:48) ∂x v (cid:48) + ∂I (cid:48) ∂t = 0 . (11)Since ∂I (cid:48) ∂x = s · ∂I∂x and ∂I (cid:48) ∂t = ∂I∂t , we conclude that v (cid:48) = v/s ,where v is the solution to pre-scaling motion (2). Hence, asexpected, down-scaling the signal spatially by a factor of ( s = 2 ) would reduce the motion by a factor of .Combining the results of the above analyses, we concludethat convolutions, pointwise nonlinearities, and local maxi-mum operations tend to be motion-preserving operations, inthe sense that pre-operation motion is also a solution to post-operation optical flow, at least approximately. The opera-tion with the most obvious impact on motion is scale change.Hence, when looking at latent-space motion at some layer ina deep model, we should expect to find motion similar to theinput motion, but scaled down by a factor of n k , where k isthe number of pooling operations (over n × n windows) be-tween the input and the layer of interest. Specifically, if v isthe motion vector at some position in the input frame, thenat the corresponding spatial location in all the channels of thefeature tensor we can expect to find the vector v (cid:48) ≈ v /n k . (12) Fig. 4 : Examples of motion transformations applied to refer-ence image. The output tensors of ResNet-34’s add 3 layerare reliably predicted from only the reference tensor andknown input-space motion.In Section 3, we will verify these conclusions experimentally.
3. EXPERIMENTS
An illustration of the correspondence between the input-spacemotion and latent-space motion was shown in Fig. 2. This ex-ample was produced using a pair of frames from a video of amoving car. The motion vectors were estimated using an ex-haustive block-matching search at each pixel, which soughtto minimize the sum of squared differences (SSD). In the in-put frames, whose resolution was × , the block size of × around each pixel and the search range of ± wereused. In the corresponding feature tensor channels, whoseresolution was × , the block size of × and a searchrange of ± were used. Although the estimated motion vectorfields are somewhat noisy, the similarity between the input-space motion and latent-space motion is evident.To examine the relationship between input-space andlatent-space motion more closely, we performed several ex-periments with synthetic input-space motion. In this case,exact input-space motion is known, so relationship (12) canbe tested more reliably. Fig. 4 shows examples of varioustransformations (translation, rotation, stretching, shearing)applied to an input image of a dog. The second columndisplays several channels from the actual tensor producedby the transformed image, and the third column shows the ig. 5 : NRMSE across parameter ranges for translation (top-left), rotation (top-right), scaling (bottom-left), and shear (bottom-right). For translation, NRMSE local minima occur when the input-space shifts correspond to integer latent-space shifts in (12). Fig. 6 : NRMSE histogram computed for an affine motionmodel [16] over the combination of the following sevenindependent uniformly distributed parameters: x- and y-translation ( ± px), x- and y-scaling (0.95–1.05), x- and y-shearing ( ± ◦ ), and rotation ( ± ◦ ). corresponding channels produced by motion compensatingthe tensor of the original image via (12). The last columnshows the difference between the actual and predicted tensorchannels. Note that regions that cannot be predicted, suchas regions “entering the frame,” were excluded from differ-ence computation. As seen in Fig. 4, the model (12) worksreasonably well, and the differences between the actual andpredicted tensors are low.For quantitative evaluation, experiments were conductedon several layers of ResNet-34 [17] and DenseNet-121 [18].Normalized Root Mean Square Error (NRMSE) [19] wasused for this purpose:NRMSE = 1 R (cid:118)(cid:117)(cid:117)(cid:116) N N (cid:88) i =1 ( p i − a i ) , (13)where a i is the actual tensor value produced from the trans-formed input, p i is the tensor value predicted using our mo-tion model (12), N is the number of elements in the featuretensor, and R is the dynamic range. Again, regions that can-not be predicted were excluded from NRMSE computation.Fig. 5 shows NRMSE computed across a range of parame-ters for several transformations, at various layers of the twoDNNs.s seen in Fig. 5, NRMSE goes up to about 0.04 forreasonable ranges of transformation parameters. How goodis this? To answer this question, we set out to find thetypical values of NRMSE found in conventional motion-compensated frame prediction. In a recent study [20], thequality of frames predicted by conventional motion estima-tion and motion compensation (MEMC) in High EfficiencyVideo Coding (HEVC) [21] was compared against a DNNdeveloped for frame prediction. From Table III in [20], theluminance Peak Signal to Noise Ratio (PSNR) of frames pre-dicted uni-directionally by the DNN and conventional HEVCMEMC was in the range 27–41 dB over several HEVC testsequences. NRMSE can be computed from PSNR asNRMSE = 1256 (cid:114) PSNR / , (14)so the PSNR range of 27–41 dB corresponds to the NRMSErange of 0.009–0.044. These levels of NRMSE are indicativeof how much motion models used in video coding deviatefrom the true motion. As seen in Fig. 5, the model (12) pro-duces NRMSE in the same range, so the accuracy of (12)is comparable to the accuracy of common motion modelsused in video coding. Another illustration of this is pre-sented in Fig. 6, which shows the histogram of NRMSEcomputed across a range of affine transformation parameters.Hence, (12) represents a good starting point for developmentof latent-space motion compensation.
4. CONCLUSIONS
Using the concept of optical flow, in this paper we analyzedmotion in the latent space of a deep model induced by themotion in the input space, and showed that motion tends to beapproximately preserved in the channels of intermediate fea-ture tensors. These findings suggest that motion estimation,compensation, and analysis methods developed for conven-tional video signals should be able to provide a good startingpoint for latent-space motion processing, such as motion-compensated prediction and compression, tracking, actionrecognition, and other applications.
5. REFERENCES [1] I. V. Baji´c, W. Lin, and Y. Tian, “Collaborative intel-ligence: Challenges and opportunities,” in
Proc. IEEEICASSP , 2021, to appear.[2] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge,J. Mars, and L. Tang, “Neurosurgeon: Collaborative in-telligence between the cloud and mobile edge,” in
Proc.22nd ACM Int. Conf. Arch. Support Programming Lan-guages and Operating Syst. , 2017, pp. 615–629.[3] A. E. Eshratifar, M. S. Abrishami, and M. Pedram,“JointDNN: An efficient training and inference engine for intelligent mobile cloud computing services,”
IEEETrans. Mobile Computing , 2019, Early Access.[4] M. Ulhaq and I. V. Baji´c, “Shared mobile-cloud infer-ence for collaborative intelligence,” arXiv:2002.00157 ,2019, NeurIPS’19 demonstration.[5] H. Choi and I. V. Baji´c, “Deep feature compression forcollaborative object detection,” in
Proc. IEEE ICIP , Oct.2018, pp. 3743–3747.[6] H. Choi and I. V. Baji´c, “Near-lossless deep featurecompression for collaborative intelligence,” in
Proc.IEEE MMSP , Aug. 2018, pp. 1–6.[7] Z. Chen, K. Fan, S. Wang, L. Duan, W. Lin, and A. C.Kot, “Toward intelligent sensing: Intermediate deep fea-ture compression,”
IEEE Trans. Image Processing , vol.29, pp. 2230–2243, 2019.[8] ISO/IEC, “Draft call for evidence for video coding formachines,” ISO/IEC JTC 1/SC 29/WG 11 W19508, Jul.2020.[9] L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Videocoding for machines: A paradigm of collaborative com-pression and intelligent analytics,”
IEEE Transactionson Image Processing , vol. 29, pp. 8680–8695, 2020.[10] S. R. Alvar and I. V. Baji´c, “Multi-task learning withcompressible features for collaborative intelligence,” in
Proc. IEEE ICIP , Sep. 2019, pp. 1705–1709.[11] H. Choi, R. A. Cohen, and I. V. Baji´c, “Back-and-forthprediction for deep tensor compression,” in
Proc. IEEEICASSP , 2020, pp. 4467–4471.[12] S. R. Alvar and I. V. Baji´c, “Bit allocation for multi-taskcollaborative intelligence,” in
Proc. IEEE ICASSP , May2020, pp. 4342–4346.[13] R. A. Cohen, H. Choi, and I. V. Baji´c, “Lightweightcompression of neural network feature tensors for col-laborative intelligence,” in
Proc. IEEE ICME , Jul. 2020,pp. 1–6.[14] S. R. Alvar and I. V. Baji´c, “Pareto-optimal bit alloca-tion for collaborative intelligence,” arXiv:2009.12430 ,Sep. 2020.[15] B. K. P. Horn and B. G. Schunck, “Determining opticalflow,”
Artificial Intelligence , vol. 17, no. 1, pp. 185 –203, 1981.[16] Y. Wang, J. Ostermann, and Y.-Q. Zhang,
Video Pro-cessing and Communications , Prentice-Hall, 2002.[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residuallearning for image recognition,” in
Proc. IEEE CVPR ,2016, pp. 770–778.18] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein-berger, “Densely connected convolutional networks,” in
Proc. IEEE CVPR , 2017, pp. 2261–2269.[19] Wikipedia contributors, “Root-mean-square devi-ation — Wikipedia, the free encyclopedia,” 2020,[Online] Available: https://en.wikipedia.org/wiki/Root-mean-square deviation. [20] H. Choi and I. V. Baji´c, “Deep frame prediction forvideo coding,”