A multimodal lossless coding method for skeletons in videos
Mingzhou Liu, Xiaoyi He, Weiyao Lin, Xintong Han, Yanmin Zhu, Hongtao Lu, Hongkai Xiong
AA MULTIMODAL LOSSLESS CODING METHOD FOR SKELETONS IN VIDEOS
Xiaoyi He , Mingzhou Liu , Weiyao Lin † , Xintong Han , Yanmin Zhu , Hongtao Lu , Hongkai Xiong Department of Electronic Engineering, Shanghai Jiao Tong University, China Malong Technologies Department of Computer Science and Engineering, Shanghai Jiao Tong University, China
ABSTRACT
Nowadays, skeleton information in videos plays an impor-tant role in human-centric video analysis but effective cod-ing such massive skeleton information has never been ad-dressed in previous work. In this paper, we make thefirst attempt to solve this problem by proposing a multi-modal skeleton coding tool containing three different codingschemes, namely, spatial differential-coding scheme, motion-vector-based differential-coding scheme and inter predictionscheme, thus utilizing both spatial and temporal redundancyto losslessly compress skeleton data. More importantly, theseschemes are switched properly for different types of skele-tons in video frames, hence achieving further improvementof compression rate. Experimental results show that ourapproach leads to 74.4% and 54.7% size reduction on oursurveillance sequences and overall test sequences respec-tively, which demonstrates the effectiveness of our skeletoncoding tool.
Index Terms — feature coding, skeleton coding
1. INTRODUCTION AND RELATED WORK
Skeleton information in videos is of increasing important re-cently in many applications such as event detection, videorecognition, etc. For example, previous works have shownhow action recognition can benefit from skeleton-based videomodeling [1, 2, 3, 4]. A person’s pose is described by multipleskeleton key joints and the skeleton information in videos rep-resents the dynamic characteristics of body postures, whichmakes skeleton information widely used in human actionrecognition and other video analysis tasks.Since video analysis is directly performed based onextracted features, shifting the feature extraction into thecamera-integrated module can reduce the analysis server loadand is highly desirable. Therefore, some feature coding meth-ods that aim to compress and transmit different kinds of ex-tracted features of videos are proposed recently. Duan etal. [5] describe the compact descriptors for video analysis,where handcrafted and deep features are compressed andtransmitted in a standardized bitstream. Chen et al. [6] intro-duce their proposed Region-of-Interest (ROI) location coding * Equal contribution † Corresponding Author: [email protected] (a) (b)
Video Encoder
Video Frame Bitstream of the Frame Skeletons Encoding
Skeletons Bitstream of the Skeletons Local Skeletons Decoder
Skeletons of the
Previous Frames Bitstream of the Skeletons Video Decoder
Bitstream of the Frame Reconstructed Frame
Skeletons Decoding
Skeletons of the
Current FrameSkeletons of the
Previous FramesOutput
Bitstream
Encoder Decoder (c)
Fig. 1 . (a) An example of skeletons (b) Skeleton informationin one video frame (c) Overview of skeletons compressionalgorithmtool where the ROI location information itself is coded in thevideo bitstream.Recently, reliable human skeletons can be obtained fromthe depth sensor using real-time skeleton estimation algo-rithms. However, transmitting these skeletons directly backto the analysis server is too expensive. In this paper, we arguethat skeleton information in videos plays an important role invideo analysis. However, existing approaches have been over-looked coding this massive skeleton information. Therefore,it is necessary to develop new algorithms to encode this skele-ton data efficiently. To the best of our knowledge, this paper isthe first to study coding skeleton information into bitstream.In our case, skeletons in many video frames need to becompressed and transmitted. We present human skeleton byfourteen key joints as shown in Fig. 1a. For example, the st islocated at the nose and that labeled as th presents the rightankle. Our task is to encode and transmit the size and loca- a r X i v : . [ c s . MM ] M a y w i t c h S w i t c h P r e d i c t i o n *only red parts are encoded Spatial differential-codingMV-based differential-codingInter prediction Entropy coding
Fig. 2 . The framework of our multimodal skeleton coding method.tion of each key point of these skeletons to the decoder. Onestraightforward way to do this is to directly transmit the ( x, y ) coordinates of every key joint. This simple method can workwell when there are only few people in the video. However,when the number of skeletons becomes large (for example,the video-frame shown in Fig. 1b), these skeleton locationdata will become huge and non-negligible. According to ourexperiments, the skeleton data will take about 42% of the to-tal bits for a video like Fig. 1b with about 35 skeletons in eachframe. Therefore, new algorithms are required to efficientlycompress these massive skeleton data.To this end, we propose a novel approach to compress theskeleton information by combining skeletons encoderm loss-less along with video codec, whose framework is shown inFig. 1c. In the encoder, the input video frame will be encodedby video encoder such as H.265. Meanwhile, the skeletonsof this video frame are encoded by our skeletons encodingmodule that also takes the skeletons of previous frames fromthe local skeletons decoder as input. These previous skele-tons will be used as the reference to reduce the redundancyof skeletons in the current frame. Then the resulting skele-tons bitstream will be added together with the bitstream ofthe frame as the final output bitstream. Since the decodingprocess can be easily derived from the encoding process, wewill only focus on discussing skeleton encoding in this paper.The proposed multimodal skeleton coding tool con-tains three coding schemes: (1) Spatial differential-codingscheme, (2) Motion-vector-based (MV-based) differential-coding scheme, and (3) Inter prediction scheme, which areswitched dynamically to encode different types skeletons. Insummary, our contributions are two folds:1. This is the first work to study coding skeleton infor-mation itself into bitstream. A skeleton coding tool isdeveloped in this paper, which achieves skeletons com-pression in videos with up to 54.7% compression rateon average.2. We introduce three different schemes for skeleton cod-ing. Furthermore, a multimodal scheme that integratesthese schemes is proposed and achieves more robustskeletons encoding results.The rest of paper is organized as follows: Section 2 de-scribes the framework of our skeleton information coding tool. Section 3 describes the detail of our coding tool andits three sub-schemes. Section 4.2 shows the experimentalsettings and results. Section 5 concludes this paper.
2. OVERVIEW OF OUR METHOD
Fig 2 shows the framework of our multimodal skeleton cod-ing algorithm. Skeletons are relayed to three coding schemesproperly to achieve higher compression rate losslessly. Thespatial differential-coding scheme utilizes the spatial redun-dancy to compress skeleton data while MV-based differential-coding scheme and inter prediction scheme are mainly basedon the temporal redundancy. Thus, our multimodal skele-ton coding tool can compress complex skeleton trajectorieswithin crowed scene efficiently.
3. THE SKELETON INFORMATION CODING TOOL
In this section, we will first detail the definition of skeletonsin video and then describe the three proposed skeleton codingschemes. Finally, a multimodal skeleton coding method isintroduced.
As we mentioned, the skeleton of a human can be describedand coded by fourteen key points. According to this, we de-fine the skeleton information as: SK i = { l i , ( x i, , y i, ) , ( x i, , y i, ) , . . . , ( x i, , y i, ) } (1)where l i is the ID of the i th human SK i in one frame and ( x j , y j ) are the horizontal and vertical coordinates of j th keypoint of SK i ( j ∈ { , , . . . , } ). Note that each person hasa unique ID over whole video and is decided according to itsfirst appearing time in the video. The index of i th skeleton i in one frame is decided according to its label. With these29 elements, one human skeleton in one video frame can bedetermined uniquely.The difference between two skeletons are defined as theset of difference between the same key joint: SK i − SK k = { ( x i,j − x k,j , y i,j − y k,j ) | j = 1 , , . . . , } (2) only red parts are encoded Fig. 3 . Illustration of spatial differential-coding scheme
Three coding schemes are introduced in our skeleton infor-mation coding tool:
Spatial differential-coding scheme.
Considering thespatial correlation of joints within a skeleton, we developed aspatial differential-coding scheme that utilizes the spatial re-dundancy to compress the skeleton data. As shown in Fig. 3,only the absolute coordinates of th joint with the differencevectors (see the red joint and vectors between joints) of askeleton are encoded.The procedure is as follows: for each skeleton in a frame,the coordinates of th joint are first encoded and a set E = { } that represents the th joint has been encoded is initial-ized. Then for each encoded joint in set E , the difference be-tween it and each of its neighbors are encoded. This processis repeated until all joints of a skeleton are encoded. MV-based differential-coding scheme.
When a lot ofskeletons exist and need to be encoded in a dense crowdscene, we need a new compression algorithm for skele-tons to deal with such huge amount of skeleton data effi-ciently. Therefore, we developed a MV-based difference-coding scheme that mainly utilizes the temporal redundancyof skeletons (the same persons’ skeletons in different framesare highly correlated). As shown in Fig. 4, the skeleton withlighter yellow joints and dash lines in t th frame is co-locatedwith the one in ( t − th frame. Then a predicted skeletonis obtained using the motion vector calculated with the nd joint (The nd joint corresponds to the center of a human) ofco-located and original skeletons. Finally, the differences be-tween the predicted skeleton using MV and the original oneare encoded.Formally, for a frame at T = t , the ( t − th frame ischosen as the reference frame. Then for each skeleton SK ti ,difference between it and its corresponding skeleton in se-lected reference frame is encoded. More specifically, the mo-tion vector (MV) of nd joint is first calculated: M V ( SK i , SK k ) = ( M V x , M V y )= ( x i, − x k, , y i, − y k, ) (3)Then the motion compensation (MC) of other joints of SK i S w i t c h P r e d i c t i o n *only red parts are encoded Fig. 4 . Illustration of MV-based differential-coding scheme S w i t c h P r e d i c t i o n *only red parts are encoded Fig. 5 . Illustration of inter prediction schemeis achieved using the MV of nd joint: M C ( SK i )= { ( x i,j + M V x , y i,j + M V y ) | j = 1 , , , . . . , } (4)Finally, the encoded parameters is defined as: EP ( SK i ) = SK i − ME ( SK i ) (5) Inter prediction scheme.
In the MV-based differential-coding scheme, the motion vector of nd joint is utilized topredict all joints. It is the optimal solution when the skeletonis nearly translated from the previous to the current frame (i.e.every joint of the body moves in the same direction and overthe same distance, without any rotation, reflection). However,human bodies are non-rigid objects and therefore the real sit-uation is different obviously. Therefore, we argue that moreaccurate predictions of joints will lead to less residual, thusachieving a higher compression rate.For inter prediction scheme, the corresponding skeletonsin ( t − th , ( t − th frames are used to predict the skeletonin t th frame (light yellow joints and dash lines) as shown inFig. 5. Then the differences between the original skeleton andthe predicted skeleton are encoded. Trajectories prediction.
There are a lot of researchesworking on trajectories prediction [7, 8, 9, 10]. In our method,the trajectories prediction method proposed in [10] is used. = tT= t-1T= t-2 ! " Encoded parameters3 | | | | { (-30 ,-30 ),(1 , 1 ),…. } | | { ( 1 ,0 ),( -1 , -1 ),…. } | { (10,11),(0,-1),… }Total number of skeletons Skip :the number of skipped skeletons ! encoded parametersSkeleton flag0 : ! " uses MV-basedcoding scheme Skeleton flag0 : ! uses MV-basedcoding scheme P r e d i c t i o n Motion vector ! $ encoded parameters(using spatial differentia- coding) ! ! ! '%&' ! "%&" ! "%&' ! $% Flag buffer: 0( ! ' disappears) ! "% Motion vector
Fig. 6 . An example of encoding skeletons in a frame with our multimodal coding tool.More specifically, every key joint of a skeleton in t th frame ispredicted individually with the corresponding joint in ( t − th and ( t − th frame (i.e. the ( t − th and ( t − th framesare chosen as the reference frames). Considering labeling the skeleton data is expensive, the skele-tons in videos may be the data estimated by the existing skele-ton estimation methods. However, these methods may intro-duce some unexpected skeleton trajectories (for example, lackof key joints, inaccurate matching, and tracking), which leadsto the correlations between skeletons become more complexand a more robust and efficient algorithm is needed. To thisend, we propose a multimodal skeleton coding method wherethree schemes are switched for encoding skeletons.The framework of our multimodal skeleton codingscheme has been shown in Fig. 2. Moreover, the switchingrules are defined as follow:1. For a skeleton that newly appears in the current frame,the spatial difference-coding scheme is used. Besides,the spatial differential-coding scheme is also used forthe first frame.2. When both MV-based differential-coding scheme andinter prediction can be used simultaneously for a skele-ton, the one with less encoded bit length is chosen.A flag indicating the chosen scheme is allocated andtransmitted.3. For other skeletons that exist in ( t − th and t th but cannot be found in ( t − th frame, MV-based differential-coding scheme is used. Fig. 7 . Some example video framesFurthermore, several details should be noted: (1) For askeleton that exists in the previous frame but disappears inthe current frame, a disappear flag is allocated in bitstream.(2) For a skeleton that is exactly the same as its correspond-ing skeleton in the reference frame, a skip flag is allocated toindicate such condition instead of encoding fourteen zeros.Fig 6 shows an example of coding skeletons in a frameusing our proposed multimodal coding method. S exists inall three frames and therefore both MV-based scheme and in-ter prediction scheme can be used. Finally, the MV-basedscheme that leads to less bit length for encoding this skeletonis chosen and a flag is transmitted. Because S only exists inthe last and current frame, MV-based scheme is chosen. S disappears in the current frame so that a skip flag is allocated.As for S that newly appears in the current frame, the spatial able 1 . Experimental results of different coding schemes. Sequences 0, 1, 2 come from to PoseTrack dataset [11].Size(KB)Seq. Frames Resolution GT 2.42 2.26(-6.8%) 1.14(-52.9%) 1.23(-49.2%)
GT 1.25 1.48(18.6%) 0.89(-28.8%) 1.08(-13.9%)
GT 9.83 7.16(-27.1%)
GT 32.66 23.02(-29.5%) 10.68(-67.3%) 15.76(-51.7%)
GT 43.31 30.94(-28.6%)
GT 86.24 52.08(-39.6%)
Average on our surveillance seq. GT - -28.1% -76.9% -62.1% -74.4%ES - -26.2% -39.6% -20.6% -43.7%Average - -15.3% -53.8% -41.0% -54.7% differential-coding scheme is applied. The resulting bitstreamof t th frame is also shown in Fig 6.
4. EXPERIMENTAL RESULTS4.1. Settings
In our experiments, the aforementioned four schemes (threesingle-modal schemes and one multimodal scheme) are eval-uated and compared.During the test, 7 videos with different resolutions andscenes are included. Three of them come from PoseTrackdataset [11] and others are collected and labeled by ourselves.Some examples of them are shown in Fig. 7. To evaluate theperformance of our methods under different motion degrees,test sequences are re-sampled with different sample rates be- fore being encoded. Apart from encoding the ground truth ofskeletons (GT), we also evaluate our methods with skeletonsestimated by [12] (ES). Note that only compression rate isused to evaluate our proposed lossless compression method.
Table 1 compares the performance of different codingmethods. In Table 1,
CM1 represents using the spatialdifferential-coding scheme;
CM2 represents using the MV-based differential-coding;
CM3 represents using the inter pre-diction scheme;
CM4 represents our full version, multimodalcoding method. Note that for a skeleton that MV-basedscheme (inter prediction scheme) can not be used, spatialdifferential-coding scheme is used in
CM2 ( CM3 ). From Ta-ble 1, we can have the following observations:. The full version of our approach, the multimodal cod-ing method (
CM4 ), achieves the best performance onaverage. Specifically, it can reduce 54.7% size of en-coded skeleton data on average.2. More importantly, our multimodal scheme shows su-perior performance (extra 4.1% compression) to MV-based scheme when compressing estimated skeletonsof surveillance sequences (i.e. the most practical situ-ation). This demonstrates that our multimodal codingmethod is more robust than other compared methodswhen the skeletons trajectories in videos are complexand noisy and therefore is especially useful in the realapplications.3. When looking at encoding annotated skeletons of ourcollected surveillance sequences, 76.9% and 74.4%reduction of encoded size are obtained by our MV-based differential-coding scheme and multimodal cod-ing method, respectively. This clearly indicates the ef-fectiveness of our designed skeleton coding schemes.4. Our MV-based scheme achieves 53.7% compressionrate across all test sequences, which is slightly worsethan our multimodal scheme. This indicates that MV-based scheme can also provide satisfactory results atdifferent kinds of applications.
5. CONCLUSION
This paper presents a new skeleton coding tool for encod-ing skeletons in videos. We introduce a multimodal schemewhere three encoding sub-schemes that utilize both spa-tial and temporal redundancy to compress skeleton data areswitched properly, hence achieving higher coding efficiency.Experimental results show that skeleton data can be reducedefficiently using our multimodal coding tool.
Acknowledgement
This paper is supported in part by: Shanghai “The Belt andRoad” Young Scholar Exchange Grant (17510740100), thePKU-NTU Joint Research Institute (JRI) sponsored by a do-nation from the Ng Teng Fong Charitable Foundation.
6. REFERENCES [1] Hongsong Wang and Liang Wang, “Modeling tempo-ral dynamics and spatial configurations of actions usingtwo-stream recurrent neural networks,” in
Conferenceon Computer Vision and Pattern Recognition (CVPR) ,2017. 1[2] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Fer-dous Sohel, and Farid Boussaid, “A new representa-tion of skeleton sequences for 3d action recognition,” in
Computer Vision and Pattern Recognition (CVPR), 2017IEEE Conference on . IEEE, 2017, pp. 4570–4579. 1[3] Yansong Tang, Yi Tian, Jiwen Lu, Peiyang Li, and JieZhou, “Deep progressive reinforcement learning forskeleton-based action recognition,” in
Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition , 2018, pp. 5323–5332. 1[4] Girum G Demisse, Konstantinos Papadopoulos,Djamila Aouada, and Bjorn Ottersten, “Pose encodingfor robust skeleton-based action recognition,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition Workshops , 2018, pp.188–194. 1[5] Ling-Yu Duan, Vijay Chandrasekhar, Shiqi Wang,Yihang Lou, Jie Lin, Yan Bai, Tiejun Huang,Alex Chichung Kot, and Wen Gao, “Compact descrip-tors for video analysis: The emerging mpeg standard,”
IEEE MultiMedia , 2018. 1[6] Mingliang Chen, Weiyao Lin, and Xiaozhen Zheng, “Anefficient coding method for coding region-of-interestlocations in avs2,” in
Multimedia and Expo Work-shops (ICMEW), 2014 IEEE International Conferenceon . IEEE, 2014, pp. 1–5. 1[7] Gianluca Antonini, Michel Bierlaire, and Mats Weber,“Discrete choice models of pedestrian walking behav-ior,”
Transportation Research Part B: Methodological ,vol. 40, no. 8, pp. 667–687, 2006. 3[8] Kota Yamaguchi, Alexander C Berg, Luis E Ortiz, andTamara L Berg, “Who are you with and where are yougoing?,” in
Computer Vision and Pattern Recognition(CVPR), 2011 IEEE Conference on . IEEE, 2011, pp.1345–1352. 3[9] Tharindu Fernando, Simon Denman, Sridha Sridharan,and Clinton Fookes, “Soft+ hardwired attention: Anlstm framework for human trajectory prediction and ab-normal event detection,”
Neural networks , vol. 108, pp.466–478, 2018. 3[10] Agrim Gupta, Justin Johnson, Li Fei-Fei, SilvioSavarese, and Alexandre Alahi, “Social gan: Sociallyacceptable trajectories with generative adversarial net-works,” in
IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2018, number CONF. 3[11] M. Andriluka, U. Iqbal, E. Ensafutdinov, L. Pishchulin,A. Milan, J. Gall, and Schiele B., “PoseTrack: Abenchmark for human pose estimation and tracking,” in
CVPR , 2018. 5[12] Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang,and Cewu Lu, “Pose flow: Efficient online pose track-ing,” arXiv preprint arXiv:1802.00977arXiv preprint arXiv:1802.00977