[PDF] Key-Point Sequence Lossless Compression for Intelligent Video Analysis

Abstract

Feature coding has been recently considered to facilitate intelligent video analysis for urban computing. Instead of raw videos, extracted features in the front-end are encoded and transmitted to the back-end for further processing. In this article, we present a lossless key-point sequence compression approach for efficient feature coding. The essence of this predict-and-encode strategy is to eliminate the spatial and temporal redundancies of key points in videos. Multiple prediction modes with an adaptive mode selection method are proposed to handle key-point sequences with various structures and motion. Experimental results validate the effectiveness of the proposed scheme on four types of widely used key-point sequences in video analysis.

Full PDF

FFeature Article

Key-point Sequence LosslessCompression for IntelligentVideo Analysis

Weiyao Lin Xiaoyi He

Shanghai Jiao Tong University Shanghai Jiao Tong University

Wenrui Dai John See

Shanghai Jiao Tong University Multimedia University

Tushar Shinde Hongkai Xiong

Indian Institute of Technology Jodhpur Shanghai Jiao Tong University

Lingyu Duan

Peking University

Abstract —Feature coding has been recently considered to facilitate intelligent video analysis forurban computing. Instead of raw videos, extracted features in the front-end are encoded andtransmitted to the back-end for further processing. In this article, we present a lossless key-pointsequence compression approach for efﬁcient feature coding. The essence of thispredict-and-encode strategy is to eliminate the spatial and temporal redundancies of key pointsin videos. Multiple prediction modes with an adaptive mode selection method are proposed tohandle key-point sequences with various structures and motion. Experimental results validatethe effectiveness of the proposed scheme on four types of widely-used key-point sequences in video analysis. I NTELLIGENT VIDEO ANALYSIS , involvingapplications such as activity recognition, facerecognition and vehicle re-identiﬁcation, has be-come part and parcel of smart cities and urbancomputing. Recently, deep learning techniqueshave been adopted to improve the capabilitiesof urban video analysis and understanding byleveraging on large amounts of video data. Withwidespread deployment of surveillance systems inurban areas, massive amounts of video data are captured daily from front-end cameras. However,it remains a challenging task to transmit the large-scale data to the back-end server for analysis, al-though the state-of-the-art High Efﬁciency VideoCoding (HEVC) [1] and on-going Versatile VideoCoding (VVC) standards present reasonably efﬁ-cient solutions. An alternative strategy that trans-mits the extracted and compressed compact fea-tures, rather than entire video streams, from thefront-end to the back-end, is illustrated in Fig- IEEE Multimedia

Published by the IEEE Computer Society c (cid:13) a r X i v : . [ c s . MM ] S e p eatures compression Back end Object detectionPerson retrievalActivity recognition Fro

Figure 1.

Illustration of the feature compression and transmission framework. Best viewed in color. ure 1. These feature streams , when passed to theback-end, enable various video analysis tasks tobe achieved efﬁciently. Here, we summarize theadvantages of transmitting information via featurecoding in a lossless fashion: (1) Lossy videocoding would affect the ﬁdelity of reconstructedvideos and subsequent feature extraction at theback-end, which leads to degraded accuracy invideo analysis tasks; (2) Transmitting featuresrather than videos can mitigate privacy concernsto sensitive scenes such as in hospitals and pris-ons; (3) Computational balance can be struck be-tween the front-end and back-end processing, asdecoded features are directly utilized for analysisin the back-end.In video analysis, common features includehand-crafted features (e.g., LoG and SIFT de-scriptors), deep features and other contextual in-formation (e.g., segmentation information, humanand vehicle bounding boxes, facial and body poselandmarks). Among these features, key-point se-quence is one of the most widely used type of fea-ture. Key-point information like facial landmarks,human body key-points, bounding boxes of ob-jects and region-of-interests (ROIs) for videosare essential for many applications, e.g., facerecognition, activity recognition, abnormal eventdetection, and ROI-based video transcoding.Key-point sequences consist of the coordi-nates of key points in each frame and the cor-responding tracking IDs. With the advances inmultimedia systems, such semantic data becomenon-negligible for complex surveillance sceneswith a large number of objects. Figure 2 showsthat uncompressed skeleton streams still takes

MallEarthquake0Mbps 0.3Mbps 0.6Mbps 0.9Mbps 1.2MbpsVideo streamUncompressed skeleton streamCompressed skeleton streamMall Earthquake

Figure 2.

Two typical surveillance video sequencesalong with uncompressed and compressed skeletonstreams. Best viewed in color. up a costly portion of typical video streams.Therefore, there is an urgency to compress thesesequences effectively.In this paper, we propose a new framework forlossless compression of key-point sequences insurveillance videos to eliminate their spatial andtemporal redundancies. The spatial redundancy iscaused by correlations of spatial positions, whilethe temporal redundancy arose from the signiﬁ-cant similarities between the positions of objectkey-points in consecutive frames. The proposedframework as a proposal for key-point compres-sion, has been accepted by the vision feature cod- IEEE Multimedia ng (VFC) group of AITISA as coding standardfor vision features.We start with a brief review of the feature rep-resentation for video analysis, particularly on howkey-point information is extracted from videosto generate key-point sequences. Consequently,we propose a lossless compression framework forkey-point sequences with adaptive selection ofprediction modes to minimize spatial and tempo-ral redundancies. Finally, we present experimentalresults to showcase the strengths of the proposedframework on various key-point sequences. FEATURE REPRESENTATION INEVENT ANALYSIS

In this section, we discuss several widely usedfeature representations for event analysis.

Digital Video

As a prevailing representation of video sig-nals, digital videos consist of multiple framesof pixels with three color components. Digitalvideo contents can be shown on mobile devices,desktop computers and television. Compressionof digital videos has been well addressed in var-ious studies. The High Efﬁciency Video Coding(HEVC) standard improves conventional hybridframeworks like MPEG-2 and H.264/AVC toyield quasi-equivalent visual quality with signiﬁ-cantly reduced bit-rates, e.g., 50% bitrate savingin comparison to H.264/AVC. Recently, the on-going Versatile Video Coding (VVC) standard isexpected to further improve HEVC.

Feature Map

Generally, feature maps (in the form of 4-dimension tensors) are the output of applyingﬁlters to the preceding layer in neural networks.In recent years, deep convolutional neural net-works (CNNs) have been utilized to extract deepfeatures for video analysis. These features can betransmitted and deployed to accomplish analy-sis on the server side. Recently, there has beenincreasing interest in the compression of deepfeature maps. For example, [2] employed HEVCto compress quantized 8-bit feature maps.

3D Point Cloud

3D point clouds are popular means of di-rectly representing 3D objects and scenes in ap-plications such as VR/AR, autonomous drivingand intelligent transportation systems. They arecomposed of a set of 3D coordinates and at-tributes (e.g., colors and normal vectors) for datapoints in space. However, communication of pointclouds is challenging due to their huge volumeof data, which necessitates effective compressiontechniques. As such, MPEG is ﬁnalizing the stan-dard for point cloud compression that includesboth lossless and lossy compression methods.Typically, the coordinates and attributes of pointclouds are compressed separately. Coordinates aredecomposed into structures such as octrees [3] forquantization and encoding. When preprocessedwith k-dimensional (k-d) tree and level of details(LoD) description, attributes are compressed withsimilar encoding process (prediction, transform,quantization, and entropy coding) as traditionalimage and video coding.

Key-Point Sequence

Various key-point sequences have been con-sidered to improve video representation for urbanvideo analysis. However, costs for transmissionand processing are signiﬁcant as there exist noefﬁcient compression algorithms for key-pointsequences.

2D Bounding Box Sequence.

A 2D boundingbox sequence is a sequence of 2D boxes overtime for an object, as shown in Figure 3a. A2D box can be represented by two (diagonal oranti-diagonal) key points. Multiple sequences of2D boxes can be combined to depict the motionvariations and interactions between objects ina scene. As such, these sequences are suitablefor human counting, intelligent transportation andautonomous driving.2D bounding box sequences can be obtainedbased on object detection [4], [5] and tracking [6].2D object detection methods can be classiﬁed intoanchor free [5] and anchor based [4] methods.Object tracking [6] can be viewed as bound-ing box matching, as it is commonly realizedbased on a tracking-by-detection strategy. Fur-thermore, the

MOT Challenge [7] provides astandard benchmark for multiple-object trackingto facilitate the detection and tracking of dense

March 2020 a) 2D bounding boxes (b) 3D bounding boxes(c) Facial landmarks (d) Skeletons Figure 3.

Examples of different key points. Best viewed in color. crowds in videos.

3D Bounding Box Sequence.

Similar to the2D case, a 3D bounding box sequence is asequence of 3D boxes of an object over time.Compared with 2D, 3D bounding boxes offer thesize and position of objects in real-world coordi-nates to perceive their poses and reveal occlusion.A 3D bounding box shown in Figure 3b consistsof eight points and can be represented by ﬁveparameters. Since an autonomous vehicle requiresan accurate perception of its surrounding envi-ronment, 3D box sequences are fundamental toautonomous driving systems. A 3D bounding boxsequence can be obtained by 3D object detectionand tracking methods. 3D object detection canbe realized with monocular image, point cloudand fusion based methods [8], [9]. Monocularimage based methods mainly utilize single RGBimages to predict 3D bounding box, but thislimits the detection accuracy. Fusion based meth-ods fuse the front-view images and point cloudsfor robust detection. Tracking with 3D boundingboxes [10] is similar to 2D object tracking, exceptthat modeling object attributes (i.e., motion andappearance) is performed in 3D space. However,uncompressed 3D bounding box sequences areinfeasible for transmission.

Skeleton Sequence of Human Bodies.

Skeleton sequences can address various prob-lems, including action recognition, person re-identiﬁcation, human counting, abnormal eventdetection, and surveillance analysis. In general,a skeleton sequence of a human body consists of15 body joints (shown in Figure 3d), which pro-vides camera view-invariant and rich informationabout human kinematics. Skeleton sequences ofhuman bodies can be obtained by pose estimationand tracking. OpenPose [11] is the ﬁrst real-time multi-person 2D pose estimation approachthat achieved high accuracy and real-time per-formance. AlphaPose [12] further presents animproved online pose tracker.

PoseTrack [13] isproposed as a large-scale benchmark for video-based human pose estimation and tracking, wheredata-driven approaches have been developed tobeneﬁt skeleton-based video analysis.

Facial Landmark Sequence.

Facial landmarksequence, which consists of facial key-points ofa human face in video, is widely used in video-based facial behavior analysis. Figure 3c providesan example of facial landmarks, where 68 key-points are annotated for each human face. Thedynamic motions in facial landmark sequencescan produce accurate temporal representationsof faces. Studies in facial landmark detectionrange from traditional generative models [14] IEEE Multimedia

23 4 56 123 4 56

Tracking ID: 1Visibility: 111111Tracking ID: 2Visibility: 111101 k = ( p , p ) Coordinates:Coordinates: Incidence matrixa b c d e 1 0 0 0 0-1 -1 -1 0 00 1 0 0 00 0 0 1 00 0 0 0 10 0 1 -1 -1 a b c d e 123456 k = ( p , p ) Tracking ID: 1Visibility: 1111111Coordinates: k , k , …, k Tracking ID: 2Coordinates: k , k , k , k a bc Visibility: 1110100 T abc def

125 437 6 f 7

Incidence matrix ab c d e 1 2 3 4 5 6f 7 K e y p o i n t r e f e r e n ce c o rr e l a t i o n Key points

Figure 4.

Example of an arbitrary form of key-point sequence in a frame and corresponding incidence matrix.Vertices (key points) are annotated with numbers 1-7 and edges are annotated with letters a-f. Best viewed incolor. to deep neural network based methods [15]. Inaddition, facial landmark tracking has been well-studied under constrained and unconstrained con-ditions [16], [17].

REPRESENTATION OF KEY-POINTSEQUENCES

Descriptor

To encode key-point sequences, we proposeto represent key-point information in videos withfour components: key point coordinate, incidencematrix of key points, tracking ID and visibilityindicator.

Key Point Coordinate.

Key points of eachobject is expressed as a set of N coordinates of D dimensions (e.g., 2D and 3D): K = { k , k , . . . , k N } , (1)where k i = ( p i , p i , . . . , p iD ) with p ij thecoordinate in the j -th dimension for the i -th point. Incidence Matrix.

The encoded points can beused as references to predict the coordinates ofcurrent point. To efﬁciently reduce redundancies,an incidence matrix is introduced to deﬁne thereferences to key points. Thus, the key pointsof an object can be viewed as vertices of adirected graph. An edge directed from point to point indicates that can be a referencepoint of . Given a key point, one of its adjacentvertices indicated by the incidence matrix areselected for prediction and compression. Thissuggests that efﬁcient prediction and compressioncan be achieved by selecting adjacent verticeswith higher correlations as references. Tracking ID.

Each object is assigned with atracking ID when it ﬁrst appears in the videosequence. Note that tracking ID for the sameobject does not change within the sequence andnew objects are assigned a new tracking ID inincreasing arithmetic order.

Visibility Indicator.

Occlusion tends to ap-pear in dense scenarios. This is commonly due tooverlapping movements of different objects, andmovements in and out of camera view. Similar toPoseTrack [13] annotations, we introduce a one-bit ﬂag for each key point to indicate whether itis occluded. V = { v , v , . . . , v N } , v i ∈ { , } (2) LOSSLESS COMPRESSION FORKEY-POINT SEQUENCES

Framework

Figure 5 illustrates the proposed frame-work for lossless key-point sequence compressionbased on the key-point sequence descriptor. Here,we consider to encode the key point coordinates,tracking IDs and visibility indicators, as pre-deﬁned incidence matrices are provided in bothencoder and decoder for speciﬁc key-point se-quences, e.g., facial key points, bounding boxesand skeleton key joints. Similar to H.264/AVCand HEVC, we adopt exponential-Golomb codingto encode prediction residuals.In this section, four different prediction modeswith adaptive mode selection are developed forkey-point coordinates, as they consume the bulkof the encoded bitstream. Code computation

March 2020 ode selection Entropy encoder Auxiliary information

Tracking ID encodingVisibility Indicator encoding

Spatial reference key points

Coding mode bank

Independent encodingSpatial-temporal predictionTrajectory predictionTemporal prediction

Code computation

Temporal reference key points

Decoder

Figure 5.

The proposed framework for lossless key-point sequence compression. Best viewed in color. varies for different encoding modes. For inde-pendent encoding mode, each frame is separatelyencoded and decoded without reference frames.Given references, a predict-and-encode strategyis developed to realize the encoding based on thetemporal, spatial-temporal and trajectory predic-tion modes. Residuals between the original dataand their predictions are calculated as the codes tobe encoded. Prediction residuals are then fed intothe entropy encoder to generate the bit-stream. Itis worth mentioning that prediction modes for thekey points can be adaptively predicted using itsspatial and temporal neighbors. Furthermore, thepredict-and-encode strategy leverages an adaptiveprediction method to combine different predic-tion modes for key-point sequences with variousstructures and semantic information. Tracking IDand visibility indicator are also encoded withthe auxiliary information encoding module forcommunication.

Independent Encoding

For independent encoding, the key points of asingle object are encoded by considering the spa-tial correlations without introducing references.We ﬁrst encode the absolute coordinates of thekey point k s with zero in-degree. Subsequently,the difference of coordinates between two ad-jacent vertices deﬁned by the incidence matrix (i.e., the edges) is encoded. The residual of inde-pendent encoding r IEi,j between the i -th and j -thvertices is computed by r IEi,j = k i − k j . (3) Reference-Based Prediction Modes

Besides independent encoding, three addi-tional prediction modes are developed for tem-poral prediction to minimize the residuals withtemporal references.

Temporal Prediction.

For each object, thecorrelations between consecutive frames are char-acterized by the movements, including the trans-lation of the main body and twists of some parts.As shown in Figure 6a, we ﬁrst obtain a co-located prediction (yellow points in the currentframe) of the point from the reference frameby motion compensation with the motion vectorof the central key point (yellow vector). Conse-quently, the temporal prediction can be expressedas p ti = k t − i + M V c , (4)where M V c = k tc − k t − c and k c is the key pointwith maximum out-degree in the incidence ma-trix. The residuals of temporal prediction r T,ti (reddashed vectors) are computed for transmissionand reconstruction in a lossless fashion: r T,ti = k ti − p ti (5) IEEE Multimedia

Reference frame

Current frameCo-located points in the current frameOriginal points Residual of temporal predictionResidual of spatial-temporal predictionMotion vector of the second point MV r sr r ′ ab c def

12 437 65 6 r T r T r sr r ST r T (a) ab d fab d f ct-2 t-1 Already decoded key point ab d f t

To be encoded key pointCurrent key point

Best mode estimation w w B e s t m o d e w (b) Figure 6. (a)Illustration of spatial and spatial-temporal prediction modes; (b)Illustration of the best modeestimation with already decoded spatial and temporal references. Best viewed in color.

However, temporal prediction would be affectedby possible twists, i.e, the gap between co-located(yellow and blue) points in the current frame.

Spatial-temporal Prediction.

The spatial-temporal correlations between key points can beutilized to improve the accuracy of predictionand further reduce the redundancy. Since adja-cent points in the incidence matrix are highlycorrelated in the spatial domain, their movementsare probably in the same direction and even withthe same distance. Thus, the redundancy can befurther reduced by encoding the residual of pre-diction p i with respect to the prediction p r ( i ) of itsreference point, as their temporal predictions arevery close. For example, as shown in Figure 6a,the spatial-temporal prediction of the 5th pointis obtained with the encoded residual of the 6thpoint (red vector) and the co-located temporalprediction (5th yellow point). In this case, we can see that the to be transmitted spatial-temporalresidual of the 5th point (maroon vector, r ST ) issmaller than the residual r T . Formally, r ST,ti canbe computed by: r ST,ti = p ti − p tr ( i ) , (6)where r ( i ) is the index of i -th point’s reference.Equation 6 is equivalent to predicting using M V c and the encoded residual r T,tc ( i ) of the referencepoint. Trajectory Prediction.

The above two modesutilize the MV of the central point to accom-plish temporal prediction. However, the motionsof different parts of an object are complex, asthey vary in direction and distance. Thus, therequired bits for coding can be further reducedwith more accurate prediction. For example, whenwe assume the motion of an object is uniformin a short time (e.g., three frames), the motion

March 2020 rom the ( t − -th frame to the t -th frame canbe approximated with that from the ( t − -thframe to ( t − -th frame. Its predicted value is tp ti = k t − i + ( k t − i − k t − i ) . (7)The residual between the predicted value andactual value is computed and transmitted.The accuracy of trajectory prediction methodscan be improved by incorporating more featuresat the cost of further complexity [18]. In thispaper, we propose a simple and efﬁcient linearprediction based on the previous two frames. Adaptive Mode Selection

Independent encoding mode is adopted, whenkey points are in the ﬁrst frame or appear for theﬁrst time in sequences. When temporal referencesare introduced, adaptive mode selection is devel-oped for the candidate temporal, spatial-temporaland trajectory prediction modes. The predictionmode m (cid:63) is estimated from the encoded spatialand temporal reference points with weighted vot-ing: m (cid:63) = arg min m (cid:88) n ∈ N t w tn × b mn + (cid:88) n ∈ N s w sn × b mn , (8)where N t and N s are the sets of spatial and tem-poral reference points, w tn and w sn are the weightsof the corresponding point n in N t and N s , m is the candidate modes for temporal, spatial-temporal prediction and trajectory prediction and b mn is the bit-length of point n encoded with m .As depicted in Figure 6b, the prediction mode of4th key point in the t -th frame is estimated withthe reconstructed 4th key point in ( t − -th and ( t − -th frames, along with the encoded ﬁrstkey point in the t -th frame with weights w , w , w , respectively.Equation 8 indicates that m (cid:63) is determined tominimize the average bit-length of its spatial andtemporal reference points encoded with all can-didate modes. The weights are hyper-parametersthat commonly decrease with the growth of thedistance between the current point and its neigh-bors. Note that trajectory prediction will not al-ways be enabled. For example, the object or pointexists in the t -th and ( t − -th frame would notappear in the ( t − -th frame. It is symmetric forthe encoder and decoder to determine whether the trajectory prediction is adopted. Thus, we excludeit from the candidate modes, when unavailable. Auxiliary Information Encoding

In addition to the key points, tracking ID andvisibility indicator are encoded as auxiliary infor-mation. Note that they actually consume minimalbit-rates in the output bitstream.

Tracking ID.

A tracking ID is assigned inarithmetic order to each object when it ﬁrstappears in the video. For each frame, we sortthe objects in ascending order (of tracking IDs)and encode the differences between neighboringtracking IDs.

Visibility indicator.

Since visibility indicatorchanges slowly within two consecutive frames,one bit is used to represent whether it changesfor an object. If not, the difference is encodedand transmitted.

EXPERIMENTS

Evaluation Framework

To demonstrate the robustness of the proposedlossless compression method for key-point se-quences, we evaluate four types of key points: 2Dbounding boxes, human skeletons, 3D boundingboxes and facial landmarks.

2D Bounding Box Dataset.

MOT17dataset [7] consists of 14 different sequences (7training, 7 test sequences). Here, we evaluate thetraining sequences with ground-truths. We alsoadopt another important dataset for 2D boundingboxes, i.e., crowd-event BBX dataset [19],which we have constructed. This dataset includesannotated 2D bounding boxes (and correspondingtracking information) in crowed scenes.

Human Skeleton Dataset.

Two datasets areused for human skeleton compression: (1) Pose-Track; (2) Our crowd-event skeleton dataset [19].For human pose estimation and tracking, Pose-Track [13] is one of the most widely used datasetwith over 1,356 video sequences. Five challeng-ing sequences that contain 7-12 skeletons arechosen as test sequences in this paper. In ourown collected crowd-event skeleton dataset, eachskeleton is labeled with 15 key joints (e.g., eyes,nose, neck), as shown in Figure 3d. Comparedwith the PoseTrack sequences, our crowd-eventskeleton dataset contains a larger number ofsmaller skeletons in crowded scenes. IEEE Multimedia able 1. Average bits for encoding one point and compression ratio for different encoding methods.

Fixedbit-lengthcoding Independentencoding Temporalprediction Spatial-temporalprediction Trajectoriesprediction MultimodalcodingMOT17 37.41 36.65(97.97%) 14.77(39.49%) 14.77(39.49%)

Crowd-event skeleton 33.79 14.40(42.62%) 3.17(9.37%) 3.01(8.92%) 4.06(12.02%) nuScenes 50.29 35.48(70.55%) 28.25(56.18%) 27.92(55.53%) 30.78(61.22%)

Facial landmarks 33.11 10.20(30.80%) 9.37(28.31%) 9.33(28.18%) 9.49(28.67%) nuScenes Dataset.

The nuScenes dataset [20]is a large-scale public dataset for autonomousdriving. It contains 1,000 driving scenes (a 20-second clip is selected for each scene) whileaccurate 3D bounding boxes sampled at 2Hz overthe entire dataset are annotated.

Facial Landmark Dataset.

We collect threevideo sequences and label the landmark se-quences, as existing facial landmark datasetsrarely contain tracking information. The se-quences contain 11 to 34 visible human faces,each having about 100 frames on average.The compression performance is evaluated interms of (1) average bits for encoding one point(i.e., the ratio between total required bits forencoding and the number of encoded key points)(2) compression ratio (i.e., the ratio between dataamount before and after compression). In thispaper, the size of uncompressed data is calculatedby encoding each coordinate of each key pointwith a 16-bit universal code, e.g., 32 bits for 2Dcoordinates and 48 bits for 3D coordinates of eachkey point. In Tables 1 and 2, the average bits forﬁxed bit-length coding are obtained by summingup the bit-lengths assigned for coordinates andrequired for encoding auxiliary information liketracking IDs and visibility indicators.

Results

Table 1 reports the performance of differentprediction modes. Independent encoding mode issuitable for objects with dense key points (e.g.,facial landmark sequences) by exploiting spatialcorrelations. However, it is inferior to prediction modes based on temporal reference.The spatial-temporal prediction mode is com-petitive or slightly better than the temporal predic-tion mode, due to obvious correlations betweenspatially adjacent points. The largest performancegap is achieved on PoseTrack, as sports scenes inPoseTrack are regular and predictable. The tra-jectory prediction mode outperforms other modeson sequences with simple, predictable motions.Consequently, the multimodal coding method isdeveloped to combine different prediction modesand improve compression performance for com-plex scenes.The multimodal coding method yields the bestaverage performance on most sequences, whichvalidates the advantages of the proposed scheme.For 2D bounding box sequences, the multimodalcoding method is equivalent or slightly inferiorto the single prediction mode based on temporalreferences. This fact implies that the multimodalcoding method is more suited for sequenceswith complex and unpredictable motions, whilethe reference-based prediction mode would favorkey-point sequences with simple and predictablemotions, e.g. 2D bounding box sequences.We further down-sample the video sequencesfor evaluations under various motion searchranges. A number of frames are skipped aftereach frame during encoding. To validate the ef-fectiveness of our approach in real-world appli-cations, we also conduct experiments on dataestimated by existing algorithms and noisy databy adding zero-mean Gaussian noise, where alot of missing and off-target key points exist.

March 2020 able 2. Average bits for encoding one point and compression ratio for different encoding methods with differentframe skip scenarios, Gaussian noise level (standard deviation) and data sources. Groundtruth? Frameskip Noiselevel Fixedbit-lengthcoding Independentencoding Temporalprediction Spatial-temporalprediction Trajectoriesprediction MultimodalcodingMOT17 (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:55)

Two benchmark datasets (MOT17 and PoseTrack)are evaluated. Table 2 shows that compressionperformance drops when the frame skippingrange increases. More importantly, under differ-ent settings, the multimodal coding method stillachieves the best performance on all skeletonsequences. It demonstrates the robustness of ourproposed scheme.

CONCLUSION AND OUTLOOK

In this paper, we highlight the problem oflossless compression of features and shown itsimportance in modern urban computing applica-tions. Importantly, we introduce a lossless key-point sequence compression approach where bothreference-free and reference-based modes are pre-sented. Furthermore, an adaptive mode selectionscheme is proposed to deal with a variety of sce-narios, i.e., camera scenes, key-point sequencesand motion degree.Forward looking, we expectthat key-point sequence compression methodswill play an important role in the transmissionand storage of key-point data in urban computing and intelligent analysis.

REFERENCES

1. Sullivan et al. , “Overview of the high efﬁciency videocoding (HEVC) standard,”

IEEE Trans. Circuits Syst.Video Technol. , vol. 22, no. 12, pp. 1649–1668, 2012.2. H. Choi et al. , “Deep feature compression for collab-orative object detection,” in , 2018, pp. 3743–3747.3. J. Elseberg et al. , “One billion points in the cloud–anoctree for efﬁcient processing of 3D laser scans,”

ISPRSJ. Photogrammetry Remote Sens. , vol. 76, pp. 76–88,2013.4. S. Ren et al. , “Faster R-CNN: Towards real-time objectdetection with region proposal networks,” in

Adv. NeuralInf. Process. Syst. 28 , 2015, pp. 91–99.5. H. Law et al. , “CornerNet: Detecting objects as pairedkeypoints,” in

Eur. Conf. Comput. Vis. , 2018, pp. 734–750.6. A. Bewley et al. , “Simple online and realtime tracking,” in , 2016, pp. 3464–3468. IEEE Multimedia . A. Milan et al. , “Mot16: A benchmark for multi-objecttracking,” arXiv preprint arXiv:1603.00831 , 2016.8. X. Chen et al. , “Monocular 3D object detection forautonomous driving,” in , 2016, pp. 2147–2156.9. ——, “Multi-view 3D object detection network for au-tonomous driving,” in , 2017, pp. 1907–1915.10. D. Frossard et al. , “End-to-end learning of multi-sensor3D tracking by detection,” in , 2018, pp. 635–642.11. Z. Cao, et al. , “Realtime multi-person 2d pose esti-mation using part afﬁnity ﬁelds,” in , 2017, pp. 7291–7299.12. H.-S. Fang et al. , “RMPE: Regional multi-person poseestimation,” in , 2017, pp. 2353–2362.13. M. Andriluka et al. , “PoseTrack: A benchmark for humanpose estimation and tracking,” in , 2018, pp. 5167–5176.14. T. F. Cootes et al. , “Active appearance models,”

IEEETrans. Pattern Anal. Mach. Intell. , vol. 23, no. 6, pp.681–685, 2001.15. Y. Sun et al. , “Deep convolutional network cascade forfacial point detection,” in , 2013, pp. 3476–3483.16. J. Yang et al. , “Facial shape tracking via spatio-temporalcascade shape regression,” in , 2015, pp. 41–49.17. A. Yao et al. , “Efﬁcient facial landmark tracking using on-line shape regression method,” U.S. Patent 9 361 510,Jun. 7, 2016.18. R. Q. M´ınguez et al. , “Pedestrian path, pose, and in-tention prediction through Gaussian process dynamicalmodels and pedestrian activity recognition,”

IEEE Trans.Intell. Transp. Syst. , vol. 20, no. 5, pp. 1803–1814, 2018.19. W. Lin et al. , “Challenge on large-scale human-centricvideo analysis in complex events,” 2020. [Online].Available: http://humaninevents.org/20. H. Caesar et al. , “nuScenes: A multimodal dataset forautonomous driving,” arXiv preprint arXiv:1903.11027 ,2019.

Weiyao Lin is currently a Full Professor with theDepartment of Electronic Enigeering, Shanghai JiaoTong University, Shanghai, China. He received thePh. D degree from the University of Washington,Seattle, USA in 2010. He served as an associateeditor for a number of journals including TIP, TCSVT,and TITS. His research interest includes urban com- puting and multimedia processing. Contact him [email protected].

Xiaoyi He focuses his current research interests onlarge-scale video compression and semantic infor-mation coding. He received the B.S. degree in Elec-tronic Engineering from Shanghai Jiao Tong Univer-sity (SJTU), Shanghai, China, in 2017. He is currentlyworking toward the M. S. degree at SJTU. Contacthim at [email protected].

Wenrui Dai is currently an Associate Professor withthe Department of Computer Science and Engineer-ing, Shanghai Jiao Tong University (SJTU), Shanghai,China. He received the Ph. D degree from SJTUin 2014. His research interests include learning-based image/video coding, image/signal processingand predictive modeling. Contact him at [email protected].

John See is a Senior Lecturer with the Faculty ofComputing and Informatics at Multimedia University,Malaysia. He is currently the Chair of the Centre forVisual Computing (CVC) and he leads the Visual Pro-cessing (ViPr) Lab. From 2018, he is also a VisitingResearch Fellow at Shanghai Jiao Tong University(SJTU). Contact him at [email protected].

Tushar Shinde focuses his current research inter-ests on multimedia processing and predictive coding.He received the M.S. degree in Electrical Engineeringfrom Indian Institute of Technology, Jodhpur (IITJ),India. He is currently working toward the Ph. D degreeat IITJ. Contact him at [email protected].

Hongkai Xiong is a Distinguished Professor in boththe Department of Electronic Engineering and theDepartment of Computer Science and Engineer-ing, Shanghai Jiao Tong University (SJTU). Cur-rently, he is the Vice Dean of Zhiyuan College inSJTU. He received the Ph.D. degree from SJTUin 2003. His research interests include multime-dia signal processing and coding. Contact him [email protected].

Lingyu Duan is currently a Full Professor with theNational Engineering Laboratory of Video Technol-ogy, School of Electronics Engineering and ComputerScience, Peking University (PKU), Beijing, China. Hewas the Associate Director of the Rapid-Rich Ob-ject Search Laboratory, a joint lab between NanyangTechnological University, Singapore, and PKU, since2012. Contact him at [email protected].