Human-Machine Collaborative Video Coding Through Cuboidal Partitioning
Ashek Ahmmed, Manoranjan Paul, Manzur Murshed, David Taubman
HHUMAN-MACHINE COLLABORATIVE VIDEO CODING THROUGH CUBOIDALPARTITIONING
Ashek Ahmmed , Manoranjan Paul , Manzur Murshed , and David Taubman School of Computing and Mathematics, Charles Sturt University, Australia. School of Science, Engineering, and Information Technology, Federation University, Australia. School of Electrical Engineering and Telecommunications, University of New South Wales, Australia.
ABSTRACT
Video coding algorithms encode and decode an entire videoframe while feature coding techniques only preserve andcommunicate the most critical information needed for a givenapplication. This is because video coding targets human per-ception, while feature coding aims for machine vision tasks.Recently, attempts are being made to bridge the gap betweenthese two domains. In this work, we propose a video codingframework by leveraging on to the commonality that existsbetween human vision and machine vision applications usingcuboids. This is because cuboids, estimated rectangular re-gions over a video frame, are computationally efficient, hasa compact representation and object centric. Such propertiesare already shown to add value to traditional video codingsystems. Herein cuboidal feature descriptors are extractedfrom the current frame and then employed for accomplishinga machine vision task in the form of object detection. Exper-imental results show that a trained classifier yields superioraverage precision when equipped with cuboidal features ori-ented representation of the current test frame. Additionally,this representation costs less in bit rate if the capturedframes are need be communicated to a receiver. Index Terms — Cuboid, HEVC, VCM, Object detection
1. INTRODUCTION
Many video analysis systems employ a client-server architec-ture where video signals are captured at the front-end devicesand the analyses are carried out at cloud-end servers. In suchsystems, video data need be communicated from the front endto the server. Prior to this transmission, captured video signalis encoded using a video coding paradigm and at the receiv-ing end the obtained signal is decoded before performing anyanalysis task. An alternative approach is to extract featuresfrom the video data at the client end and then communicatethe encoded features rather than the video signal itself. Sincecapturing of the signal and feature extraction are conductedat the front end, the server becomes less clumsy pertaining tolarge scale applications and can dedicate most of its resourcesfor the analysis work. However, as the estimated features are devised for specific tasks only, they can be difficult to gener-alize to a broad spectrum of machine vision tasks in the cloudend.Traditional video coding standards like H.264/AVC [1]and HEVC [2] are block-based. They are also pixel andframe centric. In order to model the commonality that ex-ists within a video sequence, the frame that need be coded,known as the current frame, is artificially partitioned intosquare or rectangular shape blocks. These blocks are formedby grouping together neighboring pixels. After this partition-ing, for each current frame block, commonality modeling isemployed to form a prediction of it either using the alreadycoded neighboring blocks belonging to the current frame(intra-prediction) or using motion estimation and compen-sation within a co-located neighborhood region in the setof already coded reference frame(s) (inter-prediction), byminimizing a rate-distortion criterion.Due to the requirements for communicating feature de-scriptors, coding standards like compact descriptors for visualsearch (CDVS) [3] and compact descriptors for video analysis(CDVA) [4] were developed. In CDVS, to represent the visualcharacteristics of images, local and global descriptors are de-signed. Deep learning features are employed in CDVA to fur-ther augment the video analysis performance. Although thesefeatures showed excellent performance for machine visiontasks, it is not possible to reconstruct full resolution videosfor human vision from such features. This results into twosuccessive stages of analysis and compression for machineand human vision [5].As modern video coding standards like HEVC focus onhuman vision by reconstructing full resolution frames fromthe coded bitstream and standards like CDVS and CDVA fo-cus on machine vision tasks and incapable of reconstructingfull resolution pictures from the coded feature descriptors;there is a need for human-machine collaborative coding thatcan leverage on the advances made on the frontiers of videocoding and feature coding as well as bridge the gap betweenthese two technologies. For example, in autonomous drivinguse case, some decisions can be made by the machine itselfand for some other decisions humans would interact with ma- a r X i v : . [ ee ss . I V ] F e b hine. Such collaborative coding problem is known as videocoding for machines (VCM) [5]. In this direction of work,it was proposed in [6] to extract features in the form of edgemaps from a given image which is used to perform machinevision tasks. After that a generative model was employed toperform image reconstruction based on the edge maps andadditional reference pixels.Although the objectives of human vision and machine vi-sion are different, human-machine collaborative coding canbenefit from exploring the significant overlap that exists be-tween these two domains. One important similarity is thathuman vision is mostly hierarchical. For example, an imageof a study room can be described by humans using annota-tions like - over the floor carpet (global context) there is adesk (local detail) and a laptop (finer detail) is on the desk; onthe wall (global context) a clock (local detail) is mounted, etc.Machine vision tasks like object detection can also be mod-eled in this kind of hierarchical way. For instance, in [7] amethod is proposed for performing hierarchical object detec-tion in images through deep reinforcement learning [8]. Us-ing scale-space theory [9] and deep reinforcement learning,another work [10] explored optimal search strategies for lo-cating anatomical structures, based on image information atmultiple scales.Murshed et al. proposed the hierarchical cuboidal par-titioning of image data (CuPID) algorithm in [11]. In Cu-PID framework, an entire image is initially partitioned intotwo rectangular shaped regions, known as cuboids, by find-ing a hyperplane orthogonal to one of the axes. A greedyoptimization heuristic, equipped with sum of the informationentropy in the split-pair cuboids as the objective function, isemployed to find the minimizing hyperplane. This ensuresthat the obtained split-pair cuboids are maximum dissimilar interms of image pixel intensity distribution. Next, each cuboidis recursively partitioned by solving the same greedy opti-mization problem. It was shown that cuboids have excellentglobal commonality modeling capabilities, are object centricand computationally efficient [12–14].The aforementioned properties of cuboids were leveragedon to traditional video coding in the work of [15]; whereina reference frame is generated using cuboidal approximationof the target frame and incorporating this additional referenceframe into a modified HEVC encoder outperformed the rate-distortion performance of a standalone HEVC encoder. Theobtained gain in delta PSNR and savings in bit rate can be at-tributed to the fact that HEVC employs a partitioning schemethat begins at a rigid fixed size of × pixels level andduring this process does not take into account the structuralproperties of the scene need be coded. This sub-optimal par-titioning was improved by incorporating cuboids, which areestimated considering the entire frame’s pixel intensity distri-bution and therefore more object-centric.Building on the work [15], in this paper we investigatethe applicability of cuboidal features in video coding target- Fig. 1 . The estimated cuboid map over the current frame, C .This frame is part of the vehicle detection dataset in [16]. Thecuboid map, consisting of cuboids, is determined by theCuPID algorithm [15]. Each cuboid’s coverage is depictedwith white boundary pixels.ing machines. Based on the CuPID algorithm, (i) featuresare extracted from a given video frame for the machine vi-sion task of object detection. These features costs less bitsto encode. Addition to this, (ii) it is possible to reconstructa full-resolution frame from the obtained feature descriptorsalone; unlike the approach in [6] that require additional refer-ence pixels along with edge based feature descriptors in thisregard. The reconstructed frame is a coarse representationof the original frame and is capable of preserving importantstructural properties of the original frame. Finally, (iii) theobject detection performance of the cuboidal descriptors setare put to test across varying number of cuboids.The rest of the paper is organized as follows: in sectionII, we briefly describe the cuboidal feature extraction processfrom a given video frame and the reconstruction process ofa full-resolution frame from those estimated feature descrip-tors. Section III, describes the performance of the cuboidalfeatures over an object detection task. Finally, in section IV,we present our conclusions.
2. DETERMINATION OF CUBOIDAL FEATURESFROM THE CURRENT FRAME USING CUPID
Given the original uncompressed current frame C , of resolu-tion X × Y pixels, the CuPID algorithm hierarchically parti-tions it into n cuboids C (1) , C (2) , . . . , C ( n ) , where n is a user-defined number of tiles. The first partitioning can produce twohalf-cuboids C (1) i and C (2) i of size i × Y and ( X − i ) × Y pix-els respectively. This is achieved by selecting a vertical line x = i +0 . over C from the set of lines i ∈ { , , . . . , X − } .Alternatively, C could be split into two half-cuboids C (1) x − j and C (2) x − j of size X × j and X × ( Y − j ) pixels, respec-tively by selecting a horizontal line y = j + 0 . from the set a) R co frame ( n = 100 ): R ( n =100) co (b) R co frame ( n = 200 ): R ( n =200) co (c) R co frame ( n = 300 ): R ( n =300) co Fig. 2 . Different coarse representations of the current frame C . Each one is a R co frame, obtained by varying the number ofcuboids, n .of lines j ∈ { , , . . . , Y − } .The first optimal split s ∗ of C from the possible num-ber of splits ( X −
1) + ( Y −
1) = X + Y − is obtainedby solving a greedy optimization [17] problem whose objec-tive function is taken to be a measure of dissimilarity betweenthe candidate cuboid split-pair [15]. After that each obtainedhalf-cuboid is further split recursively until the total numberof cuboids obtained from C becomes equal to the user input n . Fig. 1 shows an example cuboid map for the frame C . Itcan be observed that the estimated cuboids are fairly homoge-neous in terms of pixel intensity distribution and they occupyarbitrary shaped rectangular regions. Classes like road, skyand trees are represented using separate larger cuboids andimportant cues for vehicle detection are also preserved.The estimated cuboid map can be reconstructed at a de-coder side from the optimal partitioning indices { s ∗ } n − i =1 alone. That means, for example, if the first optimal partition-ing, which split the entire image into cuboids, took placeat height (or width) h , the number h need be encoded andso on. These indices are encoded and augmented into a bit-stream in the way described in [15]. Next, for each obtainedcuboid, C ( i ) , from the cuboid map, a feature descriptor, m ( i ) ,is computed. This value is taken to be the mean pixel in-tensity (for each image channel of C ), considering all thepixels p and their intensities p ( x, y ) within the coverage ofthe corresponding cuboid. m ( i ) j = (cid:80) p ∈ C ( i ) p j ( x, y ) | C ( i ) | (1)where j represents different color component channels of theframe C . The feature descriptors m ( i ) ∈ Z n ∗ are encoded inthe way described in [15] and are also part of the bitstream.Having decoded the optimal partitioning indices and thefeature descriptors from the communicated bitstream, it ispossible to generate a full-resolution frame, R co . This frame R co provides a coarse representation of the current frame, C since it is created by replacing every encompassing pixel Table 1 . Savings in bits and complexity requirements fromthe cuboidal feature descriptors at different scale (varyingnumber of cuboids, n ) over HEVC reference. It also reportsthe corresponding reconstructed R co frames PSNR with re-spect to the current frame C . n 100 200 300 Savings in bits: ( b c − b co ) b c
71% 38% 4 . Savings in comp. time: ( t c − t co ) t c .
41% 41 .
38% 3 . Y-PSNR (in dB) .
20 32 .
69 33 . p ∈ C ( i ) intensity by the associated cuboidal feature descrip-tor m ( i ) . Fig. 3 shows examples of this coarse representa-tion frame R co . It can be observed that this low-frequencyrepresentation R co attempt to capture important structure in-formation about the scene and the degree of detailed infor-mation it can communicate tends to increase with growingnumber of cuboids, n . Table 1 reports the bits requirements,computational complexity and quality (in terms of PSNR) ofthese R co frames. Here, b c and t c stand for bits and computa-tional time (in seconds) required for an HEVC encoder [18] tointra-code (with QP ) the frame C . The employed systemconfiguration is: Intel Core i7-8650U [email protected] GHz, 32.0GB RAM. The maximum PSNR is achieved from the frame R ( n =300) co and this frame can be encoded at less than bitsrequirements that of HEVC. Moreover, the encoding processis . times faster compared to the HEVC reference soft-ware [18].
3. OBJECT DETECTION USING THE CUBOIDALFEATURE DESCRIPTORS
The cuboidal feature descriptors attempt to capture importantstructural properties of the image at comparatively lower bitand computational complexity requirements. In this work, wepropose to utilize them for a machine vision task, in particular a) HEVC coded test framewith detection confidence scores: { . , . , . , . } respec-tively. (b) R ( n =300) co test frame withdetection confidence scores: { . , . , . , . } respec-tively. Fig. 3 . The trained vehicle detector’s performance over differ-ent test frame. (From left to right) HEVC intra-coded versionand coarse representation R co of C respectively. The detectormanaged to detect same number of vehicles in the R co framewith various confidence scores. Table 2 . Average precision (AP) values of the trainedvehicle detector over different representations of the sametest set [16]. It also reports the bit rate required to code thetest set frames.Testset AP Bit rate (Kbps)HEVC coded . . R ( n =100) co . . R ( n =200) co . . R ( n =300) co . the R co frames are employed in a vehicle detection problem.For the addressed vehicle detection problem, the object detec-tor You only look once (YOLO) v2 [19] is used. YOLO v2 isa deep learning object detection framework that uses a convo-lutional neural network (CNN) [20] for detection task. Usingunoccluded RGB images of the front, rear, left, and right sidesof cars on a highway scene, the detector is trained. The CNNused with the vehicle detector, uses a modified version of theMobileNet-v2 network architecture [21]. of the data isused for training, for validation, and the rest is usedfor testing the trained detector.Once the detector has been trained, it is used to detect ve-hicles over (i) HEVC coded versions of the original test videoframes (refers to Fig. 3(a)) and (ii) cuboidal descriptor ori-ented coarse representation frames ( R co ) of the original testframes (refers to Fig. 3(b)). As can be seen, the detector man-aged to detect same number of vehicles in the R ( n =300) co framelike its HEVC coded counterpart. Moreover, for vehiclesthe detector’s detection performance (in terms of higher con-fidence score) improved when compared to the HEVC codedframe. However, for the second vehicle, the confidence score HEVCR
Co(n=100) R Co(n=200) R Co(n=300)
Fig. 4 . Precision recall performance of the trained vehicledetector over different representations of the same test set.is lower in the case of R ( n =300) co frame. This is because thatvehicle in the scene is partly occluded by another vehicle andtherefore its cuboidal representation lacks in some structuralinformation.Table 2 reports the performance of the detector over thesedifferent test data in the form of average precision. It can benoted that the detector produced the maximum average pre-cision of . over R ( n =300) co test frames. This result isachieved with a bit rate requirements that of HEVC; in par-ticular a bit rate savings of around is obtained. Bit ratesavings improve further (a savings of ) if the R ( n =200) co test frames are used instead. However, in this case the av-erage precision decreases ( − . ) compared to the HEVCcoded test set. Fig. 4 shows the precision-recall curves forthe employed test sets. At almost recall, the precisionis more over the test sets R ( n =200) co and R ( n =300) co , comparedto the HEVC coded test set. And these cuboidal descriptororiented test sets take less bits to encode than HEVC.
4. CONCLUSION
Despite the difference in objectives, there exists similaritiesbetween human vision and machine vision. A collaborativecoding paradigm is required for many use cases. The cuboidaldescriptors oriented representation of the current frames wereshown to be beneficial in traditional video coding. Leverag-ing on their properties like object oriented, scene structuralawareness, compact representation and computational sim-plicity, in this paper, cuboidal descriptors are employed ina machine vision task. Experimental results show that theaddressed vehicle detection problem can be solved more ac-curately (average precision improved by . ) if cuboidal de-criptors are used. Along with this a savings in bit rate ofaround is reported over a HEVC reference.
5. REFERENCES [1] T. Wiegand, G.J. Sullivan, G. Bjøntegaard, andA. Luthra, “Overview of the H.264/AVC video codingstandard,”
IEEE Trans. on CSVT , vol. 13, no. 7, pp.560–576, July 2003.[2] G.J. Sullivan, J. Ohm, W. Han, and T. Wiegand,“Overview of the high efficiency video coding (HEVC)standard,”
IEEE Trans. on CSVT , vol. 12, no. 22, pp.1649–1668, 2012.[3] L. Duan, T. Huang, and W. Gao, “Overview of theMPEG CDVS standard,” in , 2015, pp. 323–332.[4] L. Duan, Y. Lou, Y. Bai, T. Huang, W. Gao, V. Chan-drasekhar, J. Lin, S. Wang, and A. C. Kot, “Compactdescriptors for video analysis: The emerging MPEGstandard,”
IEEE MultiMedia , vol. 26, no. 2, pp. 44–54,2019.[5] L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Videocoding for machines: A paradigm of collaborative com-pression and intelligent analytics,”
IEEE Trans. on Im-age Processing , vol. 29, pp. 8680–8695, 2020.[6] Y. Hu, S. Yang, W. Yang, L. Y. Duan, and J. Liu, “To-wards coding for human and machine vision: A scalableimage coding approach,” in
IEEE ICME , 2020, pp. 1–6.[7] M. Bellver, X. Gir´o-i-Nieto, F. Marqu´es, and J. Torres,“Hierarchical object detection with deep reinforcementlearning,”
CoRR , vol. abs/1611.03718, 2016.[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Ve-ness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidje-land, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik,I. Antonoglou, H. King, D. Kumaran, D. Wierstra,S. Legg, and D. Hassabis, “Human-level control throughdeep reinforcement learning,”
Nature , vol. 518, no.7540, pp. 529–533, Feb. 2015.[9] T. Lindeberg, “Scale-space for discrete signals,”
IEEETrans. on PAMI , vol. 12, no. 3, pp. 234–254, 1990.[10] F. Ghesu, B. Georgescu, Y. Zheng, S. Grbic, A. Maier,J. Hornegger, and D. Comaniciu, “Multi-scale deep re-inforcement learning for real-time 3D-landmark detec-tion in ct scans,”
IEEE Trans. on PAMI , vol. 41, no. 1,pp. 176–189, 2019.[11] M. Murshed, S. W. Teng, and G. Lu, “Cuboid segmenta-tion for effective image retrieval,” in
DICTA , Nov 2017,pp. 1–8. [12] M. Murshed, P. Karmakar, S. W. Teng, and G. Lu, “En-hanced colour image retrieval with cuboid segmenta-tion,” in
DICTA , Dec 2018, pp. 1–8.[13] S. Shahriyar, M. Murshed, M. Ali, and M. Paul,“Depth sequence coding with hierarchical partition-ing and spatial-domain quantisation,”
IEEE Trans. onCSVT , pp. 1–1, 2019.[14] A. Ahmmed, M. Murshed, and M. Paul, “Leveragingcuboids for better motion modeling in high efficiencyvideo coding,” in
ICASSP , 2020, pp. 2188–2192.[15] A. Ahmmed, M. Paul, M. Murshed, and D. Taubman, “Acoarse representation of frames oriented video codingby leveraging cuboidal partitioning of image data,” in
MMSP
Intro-duction to Algorithms, Third Edition , The MIT Press,3rd edition, 2009.[18] “HM Reference Software for HEVC (version 16.20),”https://hevc.hhi.fraunhofer.de/.[19] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi,“You only look once: Unified, real-time object detec-tion,”
CoRR , vol. abs/1506.02640, 2015.[20] S. Albawi, T. A. Mohammed, and S. Al-Zawi, “Under-standing of a convolutional neural network,” in
ICET ,2017, pp. 1–6.[21] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, andL. Chen, “Inverted residuals and linear bottlenecks: Mo-bile networks for classification, detection and segmenta-tion,”