[PDF] Analysis of Latent-Space Motion for Collaborative Intelligence

Abstract

When the input to a deep neural network (DNN) is a video signal, a sequence of feature tensors is produced at the intermediate layers of the model. If neighboring frames of the input video are related through motion, a natural question is, "what is the relationship between the corresponding feature tensors?" By analyzing the effect of common DNN operations on optical flow, we show that the motion present in each channel of a feature tensor is approximately equal to the scaled version of the input motion. The analysis is validated through experiments utilizing common motion models. %These results will be useful in collaborative intelligence applications where sequences of feature tensors need to be compressed or further analyzed.

Full PDF

EExtended version of the paper “Latent Space Motion Analysis for Collaborative Intelligence,” to be presented at the IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada, June 6-11, 2021.

ANALYSIS OF LATENT-SPACE MOTION FOR COLLABORATIVE INTELLIGENCE

Mateen Ulhaq and Ivan V. Baji´c

School of Engineering Science, Simon Fraser University, Burnaby, BC, Canada

ABSTRACT

When the input to a deep neural network (DNN) is a videosignal, a sequence of feature tensors is produced at the in-termediate layers of the model. If neighboring frames of theinput video are related through motion, a natural question is,“what is the relationship between the corresponding featuretensors?” By analyzing the effect of common DNN oper-ations on optical ﬂow, we show that the motion present ineach channel of a feature tensor is approximately equal to thescaled version of the input motion. The analysis is validatedthrough experiments utilizing common motion models.

Index Terms — Collaborative intelligence, latent spacemotion, feature domain motion, feature compression

1. INTRODUCTION

Collaborative intelligence (CI) [1] has emerged as a promis-ing strategy to bring AI “to the edge.” In a typical CI system(Fig. 1), a deep neural network (DNN) is split into two parts:the edge sub-model, deployed on the edge device near the sen-sor, and the cloud sub-model deployed in the cloud. Interme-diate features produced by the edge sub-model are transferredfrom the edge device to the cloud. It has been shown that sucha strategy may provide better energy efﬁciency [2, 3], lowerlatency [2, 3, 4], and lower bitrates over the communicationchannel [5, 6], compared to more traditional cloud-based an-alytics where the input signal is directly sent to the cloud.These potential beneﬁts will ﬁnd a number of applicationsin areas such as intelligent sensing [7] and video coding formachines [8, 9]. In particular, compression of intermediatefeatures has become an important research problem, with anumber of recent developments [10, 11, 12, 13, 14] for thecase when the input to the edge sub-model is a still image.When the input to the edge sub-model is video, its out-put is a sequence of feature tensors produced from successiveframes in the input video. This sequence of feature tensorsneeds to be compressed prior to transmission and then de-coded in the cloud for further processing. Since motion playssuch an important role in video processing and compression,we are motivated to examine whether any similar relationship

This work was supported in part by the Natural Sciences and Engineer-ing Council (NSERC) of Canada.

Fig. 1 : Basic collaborative intelligence system.

Fig. 2 : Motion estimates for input frames (left) and selectchannels from the output of ResNet-34’s add 3 layer (right).exists in the latent space among the feature tensors. Our the-oretical and experimental results show that, indeed, motionfrom the input video is approximately preserved in the chan-nels of the feature tensor. An illustration of this is presented inFig. 2, where the estimated input-space motion ﬁeld is shownon the left, and the estimated motion ﬁelds in several featuretensor channels are shown on the right. These ﬁndings sug-gest that methods for motion estimation, compensation, andanalysis that have been developed for conventional video pro-cessing and compression may provide a solid starting pointfor equivalent operations in the latent space.The paper is organized as follows. In Section 2, we ana-lyze the actions of typical operations found in deep convolu-tional neural networks on optical ﬂow in the input signal, andshow that these operations tend to preserve the optical ﬂow,at least approximately, with an appropriate scale. In Section 3we provide empirical support for the theoretical analysis fromSection 2. Finally, Section 4 concludes the paper. a r X i v : . [ c s . C V ] F e b ig. 3 : The problem studied in this paper: if input imagesare related via motion, what is the relationship between thecorresponding intermediate feature tensors?

2. LATENT SPACE MOTION ANALYSIS

The basic problem studied in this paper is illustrated in Fig. 3.Consider two images (video frames) input to the edge sub-model of a CI system. It is common to represent their rela-tionship via a motion model. The question we seek to answerhere is, “what is the relationship between the correspondingfeature tensors produced by the edge sub-model?” To answerthis question, we will look at the processing pipeline betweenthe input image and a given channel of a feature tensor. Inmost deep models for computer vision applications, this pro-cessing pipeline consists of a sequence of basic operations:convolutions, pointwise nonlinearities, and pooling. We willshow that each of these operations tends to preserve motion, atleast approximately, in a certain sense, and from this we willconclude that (approximate) input motion may be observed inindividual channels of a feature tensor.

Motion model.

Optical ﬂow is a frequently used mo-tion model in computer vision and video processing. In a“2D+t” model, I ( x, y, t ) denotes pixel intensity at time t , atspatial position ( x, y ) . Under a constant-intensity assump-tion, optical ﬂow satisﬁes the following partial differentialequation [15]: ∂I∂x v x + ∂I∂y v y + ∂I∂t = 0 , (1)where ( v x , v y ) represents the motion vector. For notationalsimplicity, in the analysis below we will use a “1D+t” model,which captures all the main ideas but keeps the equationsshorter. In a “1D+t” model, I ( x, t ) denotes intensity at po-sition x at time t , and the optical ﬂow equation is ∂I∂x v + ∂I∂t = 0 , (2)with v representing the motion. We will analyze the effectof basic operations — convolutions, pointwise nonlinearities, and pooling — on (2), to gain insight into the relationshipbetween input space motion and latent space motion. Convolution.

Let f be a (spatial) ﬁlter kernel, then theoptical ﬂow after convolution is a solution to the followingequation ∂∂x ( f ∗ I ) v (cid:48) + ∂∂t ( f ∗ I ) = 0 , (3)where v (cid:48) is the motion after the convolution. Since the con-volution and differentiation are linear operations, we have f ∗ (cid:18) ∂I∂x v (cid:48) + ∂I∂t (cid:19) = 0 . (4)Hence, solution v from (2) is also a solution of (4), but (4)could also have other solutions, besides those that satisfy (2). Pointwise nonlinearity.

Nonlinear activations such as sigmoid , ReLU , etc., are commonly applied in a pointwisefashion on the output of convolutions in deep models. Let σ ( · ) denote such a pointwise nonlinearity, then the optical ﬂow af-ter this nonlinearity is a solution to the following equation ∂∂x [ σ ( I )] v (cid:48) + ∂∂t [ σ ( I )] = 0 , (5)where v (cid:48) is the motion after the pointwise nonlinearity. Byusing the chain rule of differentiation, the above equation canbe rewritten as σ (cid:48) ( I ) · (cid:18) ∂I∂x v (cid:48) + ∂I∂t (cid:19) = 0 . (6)Hence, again, solution v from (2) is also a solution of (6). Itshould be noted that (6) may have solutions other than thosefrom (2). For example, in the region where inputs to ReLU arenegative, the corresponding outputs will be zero, so σ (cid:48) ( I ) =0 . Hence, in those regions, (6) will be satisﬁed for arbitrary v (cid:48) . Nonetheless, the solution from (2) is still one of thosearbitrary solutions. Pooling.

There are various forms of pooling, such as max-pooling, mean-pooling, learnt pooling (via strided convolu-tions), etc. All these can be decomposed into a sequence oftwo operations: a spatial operation (local maximum or con-volution) followed by scale change (downsampling). Spatialconvolution operations can be analyzed as above, and the con-clusion is that motion before such an operation is also a so-lution to the optical ﬂow equation after such an operation.Hence, we will focus here on the local maximum operationand the scale change.

Local maximum.

Consider the maximum of function I ( x, t ) over a local spatial region [ x − h, x + h ] , at agiven time t . We can approximate I ( x, t ) as a locally-linearfunction, whose slope is the spatial derivative of I at x , ∂∂x I ( x , t ) . If the derivative is positive, the maximum is I ( x + h, t ) , and if it is negative, it is I ( x − h, t ) . Inthe special case when the derivative is zero, any point in [ x − h, x + h ] , including the endpoints, is a maximum.From Taylor series expansion of I ( x, t ) around x up to andincluding the ﬁrst-order term, ( x ± (cid:15), t ) ≈ I ( x , t ) ± ∂∂x I ( x , t ) · (cid:15), (7)for | (cid:15) | ≤ h . With such linear approximation, the local maxi-mum of I ( x, t ) over [ x − h, x + h ] occurs either at x + h orat x − h , depending on the sign of ∂∂x I ( x , t ) ; if the deriva-tive is zero, every point in the interval is a local maximum.Hence, the local maximum of I ( x, t ) can be approximated as max x ∈ [ x − h,x + h ] I ( x, t ) ≈ I ( x , t ) + sign (cid:18) ∂∂x I ( x , t ) (cid:19) · ∂∂x I ( x , t ) · h. (8)Let (8) be the deﬁnition of M ( x , t ) , the function that takes onlocal spatial maximum values of I ( x, t ) over windows of size h . The optical ﬂow after such a local maximum operation isdescribed by ∂M∂x v (cid:48) + ∂M∂t = 0 , (9)where v (cid:48) represents the motion after local spatial maximumoperation. Using (8) in (9), after some manipulation we ob-tain the following equation ∂I∂x v (cid:48) + ∂I∂t + sign (cid:18) ∂I∂x (cid:19) · ∂∂x (cid:18) ∂I∂x v (cid:48) + ∂I∂t (cid:19) · h = 0 . (10)Note that if v (cid:48) satisﬁes the original optical ﬂow equation (3),it will also satisfy (10), hence pre-max motion v is also onepossible solution to post-max motion v (cid:48) . Scale change.

Finally, consider the change of spatial scaleby a factor s , such that the new signal is I (cid:48) ( x, t ) = I ( s · x, t ) .The optical ﬂow equation is now ∂I (cid:48) ∂x v (cid:48) + ∂I (cid:48) ∂t = 0 . (11)Since ∂I (cid:48) ∂x = s · ∂I∂x and ∂I (cid:48) ∂t = ∂I∂t , we conclude that v (cid:48) = v/s ,where v is the solution to pre-scaling motion (2). Hence, asexpected, down-scaling the signal spatially by a factor of ( s = 2 ) would reduce the motion by a factor of .Combining the results of the above analyses, we concludethat convolutions, pointwise nonlinearities, and local maxi-mum operations tend to be motion-preserving operations, inthe sense that pre-operation motion is also a solution to post-operation optical ﬂow, at least approximately. The opera-tion with the most obvious impact on motion is scale change.Hence, when looking at latent-space motion at some layer ina deep model, we should expect to ﬁnd motion similar to theinput motion, but scaled down by a factor of n k , where k isthe number of pooling operations (over n × n windows) be-tween the input and the layer of interest. Speciﬁcally, if v isthe motion vector at some position in the input frame, thenat the corresponding spatial location in all the channels of thefeature tensor we can expect to ﬁnd the vector v (cid:48) ≈ v /n k . (12) Fig. 4 : Examples of motion transformations applied to refer-ence image. The output tensors of ResNet-34’s add 3 layerare reliably predicted from only the reference tensor andknown input-space motion.In Section 3, we will verify these conclusions experimentally.

3. EXPERIMENTS

An illustration of the correspondence between the input-spacemotion and latent-space motion was shown in Fig. 2. This ex-ample was produced using a pair of frames from a video of amoving car. The motion vectors were estimated using an ex-haustive block-matching search at each pixel, which soughtto minimize the sum of squared differences (SSD). In the in-put frames, whose resolution was × , the block size of × around each pixel and the search range of ± wereused. In the corresponding feature tensor channels, whoseresolution was × , the block size of × and a searchrange of ± were used. Although the estimated motion vectorﬁelds are somewhat noisy, the similarity between the input-space motion and latent-space motion is evident.To examine the relationship between input-space andlatent-space motion more closely, we performed several ex-periments with synthetic input-space motion. In this case,exact input-space motion is known, so relationship (12) canbe tested more reliably. Fig. 4 shows examples of varioustransformations (translation, rotation, stretching, shearing)applied to an input image of a dog. The second columndisplays several channels from the actual tensor producedby the transformed image, and the third column shows the ig. 5 : NRMSE across parameter ranges for translation (top-left), rotation (top-right), scaling (bottom-left), and shear (bottom-right). For translation, NRMSE local minima occur when the input-space shifts correspond to integer latent-space shifts in (12). Fig. 6 : NRMSE histogram computed for an afﬁne motionmodel [16] over the combination of the following sevenindependent uniformly distributed parameters: x- and y-translation ( ± px), x- and y-scaling (0.95–1.05), x- and y-shearing ( ± ◦ ), and rotation ( ± ◦ ). corresponding channels produced by motion compensatingthe tensor of the original image via (12). The last columnshows the difference between the actual and predicted tensorchannels. Note that regions that cannot be predicted, suchas regions “entering the frame,” were excluded from differ-ence computation. As seen in Fig. 4, the model (12) worksreasonably well, and the differences between the actual andpredicted tensors are low.For quantitative evaluation, experiments were conductedon several layers of ResNet-34 [17] and DenseNet-121 [18].Normalized Root Mean Square Error (NRMSE) [19] wasused for this purpose:NRMSE = 1 R (cid:118)(cid:117)(cid:117)(cid:116) N N (cid:88) i =1 ( p i − a i ) , (13)where a i is the actual tensor value produced from the trans-formed input, p i is the tensor value predicted using our mo-tion model (12), N is the number of elements in the featuretensor, and R is the dynamic range. Again, regions that can-not be predicted were excluded from NRMSE computation.Fig. 5 shows NRMSE computed across a range of parame-ters for several transformations, at various layers of the twoDNNs.s seen in Fig. 5, NRMSE goes up to about 0.04 forreasonable ranges of transformation parameters. How goodis this? To answer this question, we set out to ﬁnd thetypical values of NRMSE found in conventional motion-compensated frame prediction. In a recent study [20], thequality of frames predicted by conventional motion estima-tion and motion compensation (MEMC) in High EfﬁciencyVideo Coding (HEVC) [21] was compared against a DNNdeveloped for frame prediction. From Table III in [20], theluminance Peak Signal to Noise Ratio (PSNR) of frames pre-dicted uni-directionally by the DNN and conventional HEVCMEMC was in the range 27–41 dB over several HEVC testsequences. NRMSE can be computed from PSNR asNRMSE = 1256 (cid:114) PSNR / , (14)so the PSNR range of 27–41 dB corresponds to the NRMSErange of 0.009–0.044. These levels of NRMSE are indicativeof how much motion models used in video coding deviatefrom the true motion. As seen in Fig. 5, the model (12) pro-duces NRMSE in the same range, so the accuracy of (12)is comparable to the accuracy of common motion modelsused in video coding. Another illustration of this is pre-sented in Fig. 6, which shows the histogram of NRMSEcomputed across a range of afﬁne transformation parameters.Hence, (12) represents a good starting point for developmentof latent-space motion compensation.

4. CONCLUSIONS

Using the concept of optical ﬂow, in this paper we analyzedmotion in the latent space of a deep model induced by themotion in the input space, and showed that motion tends to beapproximately preserved in the channels of intermediate fea-ture tensors. These ﬁndings suggest that motion estimation,compensation, and analysis methods developed for conven-tional video signals should be able to provide a good startingpoint for latent-space motion processing, such as motion-compensated prediction and compression, tracking, actionrecognition, and other applications.

5. REFERENCES [1] I. V. Baji´c, W. Lin, and Y. Tian, “Collaborative intel-ligence: Challenges and opportunities,” in

Proc. IEEEICASSP , 2021, to appear.[2] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge,J. Mars, and L. Tang, “Neurosurgeon: Collaborative in-telligence between the cloud and mobile edge,” in

Proc.22nd ACM Int. Conf. Arch. Support Programming Lan-guages and Operating Syst. , 2017, pp. 615–629.[3] A. E. Eshratifar, M. S. Abrishami, and M. Pedram,“JointDNN: An efﬁcient training and inference engine for intelligent mobile cloud computing services,”

IEEETrans. Mobile Computing , 2019, Early Access.[4] M. Ulhaq and I. V. Baji´c, “Shared mobile-cloud infer-ence for collaborative intelligence,” arXiv:2002.00157 ,2019, NeurIPS’19 demonstration.[5] H. Choi and I. V. Baji´c, “Deep feature compression forcollaborative object detection,” in

Proc. IEEE ICIP , Oct.2018, pp. 3743–3747.[6] H. Choi and I. V. Baji´c, “Near-lossless deep featurecompression for collaborative intelligence,” in

Proc.IEEE MMSP , Aug. 2018, pp. 1–6.[7] Z. Chen, K. Fan, S. Wang, L. Duan, W. Lin, and A. C.Kot, “Toward intelligent sensing: Intermediate deep fea-ture compression,”

IEEE Trans. Image Processing , vol.29, pp. 2230–2243, 2019.[8] ISO/IEC, “Draft call for evidence for video coding formachines,” ISO/IEC JTC 1/SC 29/WG 11 W19508, Jul.2020.[9] L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Videocoding for machines: A paradigm of collaborative com-pression and intelligent analytics,”

IEEE Transactionson Image Processing , vol. 29, pp. 8680–8695, 2020.[10] S. R. Alvar and I. V. Baji´c, “Multi-task learning withcompressible features for collaborative intelligence,” in

Proc. IEEE ICIP , Sep. 2019, pp. 1705–1709.[11] H. Choi, R. A. Cohen, and I. V. Baji´c, “Back-and-forthprediction for deep tensor compression,” in

Proc. IEEEICASSP , 2020, pp. 4467–4471.[12] S. R. Alvar and I. V. Baji´c, “Bit allocation for multi-taskcollaborative intelligence,” in

Proc. IEEE ICASSP , May2020, pp. 4342–4346.[13] R. A. Cohen, H. Choi, and I. V. Baji´c, “Lightweightcompression of neural network feature tensors for col-laborative intelligence,” in

Proc. IEEE ICME , Jul. 2020,pp. 1–6.[14] S. R. Alvar and I. V. Baji´c, “Pareto-optimal bit alloca-tion for collaborative intelligence,” arXiv:2009.12430 ,Sep. 2020.[15] B. K. P. Horn and B. G. Schunck, “Determining opticalﬂow,”

Artiﬁcial Intelligence , vol. 17, no. 1, pp. 185 –203, 1981.[16] Y. Wang, J. Ostermann, and Y.-Q. Zhang,

Video Pro-cessing and Communications , Prentice-Hall, 2002.[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residuallearning for image recognition,” in

Proc. IEEE CVPR ,2016, pp. 770–778.18] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein-berger, “Densely connected convolutional networks,” in

Proc. IEEE CVPR , 2017, pp. 2261–2269.[19] Wikipedia contributors, “Root-mean-square devi-ation — Wikipedia, the free encyclopedia,” 2020,[Online] Available: https://en.wikipedia.org/wiki/Root-mean-square deviation. [20] H. Choi and I. V. Baji´c, “Deep frame prediction forvideo coding,”