An Adaptive Video Acquisition Scheme for Object Tracking and its Performance Optimization
Srutarshi Banerjee, Henry H. Chopp, Juan G. Serra, Hao Tian Yang, Oliver Cossairt, A. K. Katsaggelos
SSUBMITTED TO
IEEE SENSORS JOURNAL 1
An Adaptive Video Acquisition Scheme forObject Tracking and its PerformanceOptimization
Srutarshi Banerjee, Henry H. Chopp, Juan G. Serra, Hao Tian Yang, Oliver Cossairt and A. K. Katsaggelos
Abstract — We present a novel adaptive host-chip modular archi-tecture for video acquisition to optimize an overall objective taskconstrained under a given bit rate. The chip is a high resolutionimaging sensor such as gigapixel focal plane array (FPA) with lowcomputational power deployed on the field remotely, while the hostis a server with high computational power. The communicationchannel data bandwidth between the chip and host is constrained toaccommodate transfer of all captured data from the chip. The hostperforms objective task specific computations and also intelligentlyguides the chip to optimize (compress) the data sent to host.This proposed system is modular and highly versatile in termsof flexibility in re-orienting the objective task. In this work, objecttracking is the objective task. While our architecture supports any form of compression/distortion, in this paper we usequadtree (QT)-segmented video frames. We use Viterbi (Dynamic Programming) algorithm to minimize the area normalizedweighted rate-distortion allocation of resources. The host receives only these degraded frames for analysis. An objectdetector is used to detect objects, and a Kalman Filter based tracker is used to track those objects. Evaluation of systemperformance is done in terms of Multiple Object Tracking Accuracy (MOTA) metric. In this proposed novel architecture,performance gains in MOTA is obtained by twice training the object detector with different system generated distortionsas a novel 2-step process. Additionally, object detector is assisted by tracker to upscore the region proposals in thedetector to further improve the performance.
Index Terms — Image acquisition, Image reconstruction, Video signal processing, Object detection, Object tracking, Objecttracker assisted detection, Optimization, Viterbi algorithm
I. I
NTRODUCTION T HIS work focuses on the problem of optimal informationextraction in wide-area surveillance using high resolutionsensors with low computational power for imaging applica-tions. The imaging instrument (i.e., the chip) is assumed tobe of a very high resolution Focal Plane Array (FPA) (e.g., > MPixels) [2], [3], [4] providing imagery over desiredfield of view, but with low computational power. Imagers of
Manuscript submitted on February 19, 2021. (Corresponding author:Srutarshi Banerjee.)
Preliminary work has been presented in [1]. This work is supportedin part by a the Defense Advanced Research Projects Agency (DARPA)under Grant No. HR0011-17-2-0044.Srutarshi Banerjee, Henry H. Chopp, A. K. Katsaggelos are with theDepartment of Electrical and Computer Engineering (ECE), Northwest-ern University (NU), Evanston, IL, 60208, USA. (email: [email protected]; [email protected];[email protected] ).Juan G. Serra is with the Center for Interdisciplinary Explorationand Research in Astrophysics, NU, Evanston, IL, 60208, USA. (email:[email protected]).Hao Tian Yang was with the Computer Science (CS) Department, NU,Evanston, IL, 60208, USA. (email: [email protected]).Oliver Cossairt is with the Department of CS andDepartment of ECE, NU, Evanston, IL, 60208, USA. (email:[email protected]). such high resolution capture data at a large bit rate, but donot process them fast enough. Limited computational powerin FPAs and other imaging devices is a key practical constraintin the devices currently available in the market. Moreover, theFPA contains Readout Integrated Circuit (ROIC) electronics,and the primary challenge is that the data bandwidth of theROIC limits the maximum amount of data (in bits/s) that canbe delivered by the sensor (chip). For such a sensor withlow computational power capturing data at a high rate, thedata needs to be analyzed remotely on a server with highcomputational power, termed as host, in order to performcomputationally heavy tasks such as object detection, tracking,anomaly detection. For a case of a very high bandwidth andhigh readout rate from the chip, the chip can easily sendall its captured high resolution video frames to the host fordata analysis, and the analysis of the data on the host canbe straight-forward with state-of-art algorithms. However, inpractice, having a very high data bandwidth is impracticaldue to various factors: ROIC electronics, commercial aspectsto using large data bandwidth, lossy transmission media andother factors. Thus, the chip can only send limited data to thehost. In such a scenario, the chip must be selective in sendinga subset or a compressed representation of the captured high a r X i v : . [ ee ss . I V ] F e b SUBMITTED TO
IEEE SENSORS JOURNAL resolution video frames. Optimally selecting the compressedvideo frames is a challenging task for the chip. Moreover, thehost has access to only the compressed frames. Task specificcomputations (such as object detection, tracking) are difficultto be performed on compressed frames than high qualityframes.Commercial FPAs have different controls over spatio-temporal sampling. Pixel-binning and sub-sampling modesallow a dynamic trade-off between spatial and temporal res-olutions. For instance, high frame rates (e.g., > kfps) maybe achieved at low resolution (e.g., < VGA), while maximumframe rates that can be achieved for high resolution FPAs (e.g., > MPixels) are typically low ( < Hz). The pixel binningand sub-sampling modes provide a way to optimize samplingwith constraints on the bandwidth of ROIC electronics [5]- [10]. Analysis of theoretical and experimental propertiesof compressive video cameras which utilize per-pixel codedexposure sequences has been done as well [5] - [7]. FPA im-plementations using programmable coded exposure sequenceshave also been developed [11], [12]. All of these algorithmswork with an approach to maximize the information contentin space and time using a compressed sensing approach whichassumes the signal can be sparsely represented in a transformdomain. Unfortunately this assumption is not always valid forarbitrary scenes.We propose an architecture which performs not only theobjective task (such as object detection and tracking) but alsoan intelligent system which can adapt its acquisition basedon the scene. In order to do so, we use object detection andtracking algorithms on the host which has high computationalpower to perform such tasks at low computational time. Objectdetection and tracking has been a topic of immense interestfor the image processing community over the past decade,accelerated with the advent of Deep Learning approaches [13],[14] especially with Convolutional Neural Network (CNNs),which is one of the most widely used models of deep learning[15], [16], [17].In this work, we introduce an algorithm for adaptive sam-pling of high bit rate data (such as from a high resolution FPA)that is optimized together with a reconstruction algorithm forobject detection and tracking purposes. We develop the archi-tecture assuming the imaging device (chip) to have limitedcomputational power and the host to have high computationalpower. The communication channel between the chip and thehost has limited bandwidth and hence, it is not possible totransfer all the captured data from the chip to host. To the bestof the authors’ knowledge this is the first work to introduce abandwidth limited resource constrained optimized solution forobject tracking, with our preliminary work presented in [1].The detection and tracking of multiple objects in a compressedimage domain is a unique approach in our system whichrequires careful optimization. Since, the framework is aimedat object tracking, the final evaluation metric for the perfor-mance of this algorithm is not the traditional reconstructedimage quality measured, for example, by PSNR or SSIM,but rather a surrogate tracking performance metric, MultipleObject Tracking Accuracy, (MOTA) for tracking the objectsof interest. The proposed host-chip architecture allows dynamic, mod-ular, re-configurable and content-adaptive acquisition of datafrom an imager (chip) with low computational power, withan optimal bandwidth utilization. The optimization problem isposed as a resource allocation problem: given the constrainedallowable data bandwidth between the host computer and chip,with low computational power on the chip, we estimate thebest tessellation per frame based on the ROIs. A frame thusrequires reduced number of bits for its representation. The hostand chip mutually transmit only the most important informa-tion. The main contributions of the paper are as follows:1) The development of a host-chip modular feedback ar-chitecture designed for optimal information extractionin a bandwidth-constrained communication channel forobject detection and tracking in a computationally con-strained chip deployed on the field.2) Implementation of a multiple object detector and trackerin a lossy image domain.3) Performance optimization of the proposed system usinga 2-step object detector training strategy.4) Performance optimization of the proposed system byenhancing object detection with assist of an objecttracker.5) Performance comparison of the proposed system withstate-of-art systems.One of the shortcomings of this approach is the fact that theobject tracking metric is used as performance tracking metric,instead of the traditional PSNR or SSIM metric. For highlyconstrained bandwidth, the quality of image frames might behave large distortion, while the object tracking performancemay be good. In such cases, the priority is given to achievinga good object tracking metric instead of a good image quality.Our architecture focuses on having a good tracking metric forhighly distorted frames. The rest of the paper is organizedas follows: in Section II and III, we describe the related workand problem formulation respectively. Section IV describes thehost-chip architecture. Section V describes the performanceoptimization of the system using a 2-step training approach forthe object detector, as well as tracker assisted object detection.Section VI describes our experimental results. Section VIIcovers the discussion, while Section VIII concludes the paper.
II. R
ELATED W ORK
Image acquisition with adaptivity has been introduced inseveral ways. For example, local features e.g., standard de-viation [18], [19], edge counting [20] or estimation of thereconstruction error [21] in local domain is used to guide theadaptive acquisition. An adaptive scheme proposed in [22] byestimating the compression based on local redundancy mea-sured statistically utilizing previously sensed measurements.In the literature of rate-distortion for video / images, one ofthe early works for region based rate control for H.264 [23]has been done in [24]. The work used region-based rate controlscheme for macro-blocks and grouped regions of similarcharacteristics as same region for treating them as a basicunit for rate control. Similar works on region-classification-based rate control for Coding Tree Units (CTUs) in I-frames
ANERJEE et al. : AN ADAPTIVE VIDEO ACQUISITION SCHEME FOR OBJECT TRACKING AND ITS PERFORMANCE OPTIMIZATION 3 to improve reconstruction quality of I-frames for suppressingflicker artifacts has been done in [25], region-based inter-frame rate-control scheme to improve the objective quality andreduce PSNR fluctuations among CTUs [26], moving regionshave been used as the RoI for identifying the depth level ofCTU [27], and other works [28], [29], [30]. Further progresshas been made in the RoI aware rate-control where higher bitrate is allocated to regions of interest human faces [31] andcombination of human faces with CTU level [32] and usinghuman faces and tile-based rate control [33]. Work has beendone in attention region based rate control for 3DVC depthmap coding based on regions classified as foreground, edgesof objects and dynamic regions [34]. A content-aware ratecontrol scheme for HEVC [35] has been done on static anddynamic saliency detection using deep convolutional networkfor extracting static saliency map [36]. Work has been done inpreserving scale-invariant features such as SIFT/SURF [37].The rate control algorithm of HEVC based on RoI usingimproved Itti algorithm [38] has been done. Rate control hasbeen also done by λ adjustment in [39] and in other works.While some of these works had developed compressionalgorithms based on priority regions, none of them focusedon joint Rate Distortion optimization with object trackingas an end metric. Moreover, these works in literature havebeen developed without considering computational power ofthe imager. None of them have a host-chip architecture withshared computational load on the chip and host, with the hostperforming heavy computation and providing feedback to thechip having low computational power to adaptively acquiredata for the next time instant.Object detection using Deep Neural Network has been atopic of heavy activity in the last few years. Typically, CNNshave deeper architectures which allows hierarchical featurerepresentation and learning with fewer weight parameters,which increases their expressive capability compared to shal-low models [40]. R-CNN [41], Spatial pyramid pooling (SPP)-net [42], Faster R-CNN [43], Mask R-CNN [44], Region-based fully convolutional network (R-FCN) [45], G-CNN [46],MultiBox [47], YOLOv3 [48], YOLOv4 [49], Single ShotMultiBox Detector (SSD) [50], Deconvolutional Single ShotDetector (DSSD) [51] are few of the deep learning basedobject detectors in literature. Typically, object detectors aretrained on datasets such as COCO [52], PASCAL VOC [53]and ImageNet [54] which have low inherent distortions andnoise. This breed of trained object detectors is not optimizedfor distorted frames and leads to sub-optimal detection per-formance. We address this challenge by retraining the objectdetector at different distortions to handle degraded data andboost its performance.Object Tracking, such as the multiple object tracking (MOT)problem is typically solved using Joint Probabilistic DataAssociation (JPDA) filters [55], [56] or Multiple Hypothe-sis Tracking (MHT) [57]. Real time applications of theseapproaches in highly dynamic environments is impracticaldue to their complexities. Online trackers build appearancemodel of individual objects [58] - [60] or a global model[61] - [63]. Often motion is taken into account in addition toappearance model [64]. Geiger et. al. [65] used the Hungarian algorithm [66] in a two-stage process by forming tracklets byassociating detections and then associating tracklets to bridgebroken trajectories. Bewley et. al. [67] used a Kalman filterbased approach combined with the Hungarian algorithm inorder to track multiple objects. More recently deep learningbased multi-object trackers has been introduced such as [68]- [72]. While advances in object detection and tracker is stillan interesting pursuit, in this work we focus on introducing anovel modular host-chip architecture which has the flexibilityof upgrading the object detector and tracker with the progressof research in that domain. Typically, the object detectorindependently detects objects in a frame. However, we utilizethe predicted locations of the object in a frame by the trackerto guide the object detections. III. P
ROBLEM F ORMULATION
Our method is based on a computational imaging approachusing a prediction-correction feedback paradigm. The goal ofthe host computer is to predict the location of the regions ofinterests (RoIs) for a particular frame and be able to correctthat prediction. The predicted ROIs for the chip, consisting ofthe FPA and ROIC, help guide the chip to capture optimalinformation for the host to optimally perform object detectionand tracking. The methodology has been developed withconsideration of limited computational power on the chipwhich forces it to transfer data to the host to perform heavycomputations.The adaptive segmentation is data-driven based on a de-composition of the image into regions or blocks. While ourarchitecture supports different distortion/compression intro-duced by these regions/blocks, in this work we focus onadaptive segmentation of video frame based on a quadtree(QT) structure. We particularly use the QT structure as thisfits into the H.264 [23], H.265 / High Efficiency Video Coding(HEVC) [35] and latest H.266 / Variable Video Coding (VVC)[73] standards which partitions image frame into QT blocks.Thus, our architecture can be applied directly into the existingelectronic hardware systems which utilizes latest HEVC orVVC standards and earlier H.264 standards as well.The host-chip system has been developed as a prediction-correction feedback system as shown in Fig. 1. The hostpredicts the RoIs in a frame and updates its prediction basedon the data received from the chip. This feedback mechanismis very critical for our system as it prevents error propagation.The chip generates an optimized QT structure that subdividesthe current frame into superpixels before transmitting them tothe host. The bigger superpixels have high distortion whichmay be mitigated by subdividing them if sufficient bandwidthis available. Further QT subdivision, depending on availablebandwidth, captures finer details in a frame. QTs for newlyacquired frame on the chip contains information about thesuperpixels the host should update or skip in its frame of theprevious time step. The intensities for the update regions aresent from the chip to the host. Skipped superpixels assume thevalue of the previous frame. The QT is optimized based on: (i)the distortion between the current and previously reconstructedframe, (ii) the predicted locations of the ROIs for the current
SUBMITTED TO
IEEE SENSORS JOURNAL frame, and (iii) the available bandwidth. In this work, a fastand effective recursive encoding of the QT structure in [74] isused.
Fig. 1 : Host-Chip Architecture (S: Skip, A: Acquire).The Host-Chip architecture of the system is shown in Fig.1. The Chip for a particular frame, sends QT, mode of leaves(skip or acquire) and pixel values corresponding to the acquiremode to the Host. The Host based on these informationcomputes the ROIs for the next frame and sends in back tothe chip. This iterative loop is repeated once for each framethe chip captures. Clearly, the host has access to only distortedframes which are compressed by the QT. The object detectoron the host needs to classify and return bounding boxes basedon these distorted frames, which is more challenging comparedto the undistorted, higher quality frames. The performance ofthe object detector deteriorates due to the QT compression,and hence it is a necessity to boost its performance under lowbandwidth conditions. This is of utmost importance for thehost-chip architecture which must be robust to both bandwidthfluctuations and different operating conditions. Additionallythe object detector uses spatial information per frame to gener-ate bounding boxes. In order to maintain a temporal continuityamong the bounding boxes, the RoIs predicted by the objecttracker is taken into account. Section IV of this paper detailsthe 2-step training process as well as the tracker assistedobject detection. The performance metric for this host-chiparchitecture is Multiple Object Tracking Accuracy (MOTA)instead of the PSNR metric typically used in literature, asthe end system performance is of critical importance, ratherthan the fidelity to the undistorted frames. This helps inmaintaining the objective performance metric as the criteriafor comparison.
IV. H
OST -C HIP S YSTEM A RCHITECTURE
The system architecture consisting of a host-chip frameworkis developed from the methodology of guiding a sensor (chip)through real-time tuning of its optimization parameters tocollect data with highest content of useful information forobject tracking. The architecture is based on the considerationof limited bandwidth channel capacity, B , between the hostcomputer and chip with limited (low) computational poweron the chip. The host-chip modular architecture has beendeveloped keeping in mind the predictive-correction feedbacksystem. The chip has low computational power while thehost has high computational power. The disparity between thecomputational power of the chip and host drives the design ofthe host and chip models. Fig. 2 : Computation on Chip
1) Chip Computation:
Fig. 2 shows the computation on thechip. The compression of video frame is based on a QTstructure. The host computes the predicted bounding boxes (cid:101) bb t +1 , with (cid:101) bb t +1 ∈ R × P ( P is the number of boundingboxes detected), and sends it to the chip for time t + 1 .The chip has a copy of ˆ f t , which is the distorted frame fortime t . The full resolution undistorted frame at t + 1 , f t +1 is acquired at time t + 1 by the FPA on the chip. These areinputs to the Viterbi Optimization Algorithm, which providesas output the optimal QT structure S (cid:48) t +1 and optimal skip-acquire modes Q (cid:48) t +1 subject to the communication channelbandwidth constraint B for time t +1 . The skip (S) and acquire(A) modes in S (cid:48) t +1 identify the QT leaves (blocks) where newdata need to be acquired at time t +1 and the remaining leaves(QT blocks) where data will be copied from frame ˆ f t . The Sand A modes are included in the framework, as this allowsonly a reduced set of data to be sent from the chip to thehost, thereby aiding in data compression significantly. Now { ˆ f t , f t +1 } ∈ R N × N , where N , N = 512 (for instance) isthe resolution of the frame, S (cid:48) t +1 ∈ R × N and Q (cid:48) t +1 ∈ R × N ,with N as the maximum depth of the QT ( N = 9 , for N = N = 512 ). The bounding box information (cid:101) bb t +1 , is used toprioritize the distortion in the RoIs relative to other regions.The higher distortion in RoI regions forces the optimizationalgorithm to allocate more bits while performing the rate-distortion optimization. On the chip, S (cid:48) t +1 , Q (cid:48) t +1 providesus with the QT structure alongwith the skip/acquire modes.Corresponding to the acquire modes in Q (cid:48) t +1 and acquiredframe at t + 1 , f t +1 , we can generate the pixel values forthe leaves (QT blocks), V t +1 for the acquire modes. Here, V t +1 ∈ R N a , with N a as the number of acquire modes in Q (cid:48) t +1 . The chip sends S (cid:48) t +1 , Q (cid:48) t +1 and V t +1 to the host in orderto reconstruct the frame ˆ f t +1 . The differential information issent from the chip to the host, instead of the whole frame.This helps in reducing the bandwidth required for transferringthe relevant information to the host for performing the tasksof object detection and tracking. Viterbi Optimization
The Viterbi optimization provides a trade-off between theframe distortion D and frame bit rate R . This is done byminimizing by frame distortion D over the leaves of the QT x subject to a given maximum frame bit rate R max . Here, { D, R } ∈ R × N , x ∈ R × N and R max ∈ R , where N is themaximum depth of the QT. Previous works [74], [75], [76] ANERJEE et al. : AN ADAPTIVE VIDEO ACQUISITION SCHEME FOR OBJECT TRACKING AND ITS PERFORMANCE OPTIMIZATION 5 on Viterbi optimization have been used for compression onactual frames. In this work, we use the reconstructed frame ˆ f t and the actual frame f t +1 acquired by the chip to computethe distortion.The optimization is formulated as follows arg min x D ( x ) , (1)s. t. R ( x ) ≤ R max The distortion for each node of the QT is based on theacquisition mode Q (cid:48) t +1 of that node. If a particular node ˆ x t of a reconstructed frame at time t is skip, the distortion withrespect to the new node at time t + 1 , x t +1 , is given by D s = | x t +1 − ˆ x t | , (2)On the contrary, if the node is an acquire, the distortion isproportional to the standard deviation σ . This is shown in Eq.3, where N is the maximum depth of the QT and n is the levelof the QT where distortion is computed. The root is definedto be on level , and the most subdivided level as N : D a = σ × N − n , (3)It must be kept in mind that the distortion D is computedper block the QT and thus { D s , D a } ∈ R . The total distortionis therefore defined as D = D s + D a , (4)The constrained discrete optimization of Eq. 1 is solvedusing Lagrangian relaxation, leading to solutions in the convexhull of the rate-distortion curve [75]. The Lagrangian costfunction is of the form J λ ( x ) = D ( x ) + λR ( x ) , (5)where λ ≥ , ( λ ∈ R ) is a Lagrangian multiplier. Here, J λ ( x ) ∈ R × N , over all the leaves of the QT. It has beenshown that if there is a λ ∗ such that x ∗ = arg min x J λ ∗ ( x ) (6)which leads to R ( x ∗ ) = R max , then x ∗ is the optimal solutionto Eq. 1. This is solved using the Viterbi algorithm, shown indetail in [74]. A sample frame with its QT decompositioncontaining the skip and acquire modes are shown in Fig. 3 for λ = 2 . which corresponds to regime of low distortion.Now, in the distortion term, we want to prioritize the regionsbased on the bounding boxes, which are the ROIs of region i .This is introduced by the weight factors w i in each region i .However, in case where region i occupies a large area withinthe frame, the amount of distortion may heavily outweighother smaller regions. We want to have a weighted distortionindependently of the area of ROI i . This is done by dividingthe weighted distortion by the area of the ROI of region i ,thus modifying Eq. 5 as J λ ( x ) = (cid:88) i ∈ Ω w i D i ( x i ) A i + λR ( x ) , (7)where, Ω is the set of differently weighted regions, D i thedistortion of region i ( D i ∈ R ), w i the weight of region i ( w i ∈ R ), A i the area of region i ( A i ∈ R ), and x i the leavesin the QT of region i .The system can also be operated in a fixed bit rate withina certain tolerance. The λ value in the Lagrangian multiplieris adjusted at each frame for achieving the desired bit rate.The optimal λ ∗ is computed by a convex search in the Beziercurve [75]. The Bezier curve accelerates convergence in feweriterations. Fig. 3 : Sample frame with (left) with its QT decomposition(right) for λ = 2 . .
2) Host Computation:
Fig. 4 shows the computation on thehost. For a undistorted frame f t acquired at time t on thechip, we have QT acquisition, skip or acquire modes for theleaves, and values for the leaves of acquire modes, denotedby S (cid:48) t , Q (cid:48) t , and V t , respectively. These are then sent from thechip to the host in order reconstruct frame ˆ f t . The previously Fig. 4 : Computation on Host.reconstructed frame ˆ f t − for time t − saved on the hostis used to copy the values in the skip leaves of ˆ f t . Here { f t , ˆ f t , ˆ f t − } ∈ R N × N , where N , N = 512 (for instance)is the resolution of the frame, S (cid:48) t ∈ R × N and Q (cid:48) t ∈ R × N ,with N as the maximum depth of the QT ( N = 9 , for N = N = 512 ). An object detector on the host thendetermines the ROIs of the reconstructed image ˆ bb t . The ROIsare then fed into a Kalman Filter-based object tracker as anobservation, which updates the state of the filter. The KalmanFilter then predicts the locations of the next ROIs for the next SUBMITTED TO
IEEE SENSORS JOURNAL frame at time t + 1 , based on a linear motion model, denotedas (cid:101) bb t +1 . Here, { ˆ bb t , (cid:101) bb t +1 } ∈ R × P ( P is the number ofbounding boxes detected). These predicted ROIs for frame at t + 1 are then sent back to the chip. A copy of the distortedreconstructed frame ˆ f t is kept in the host for creating thereconstructed frame ˆ f t +1 at time t + 1 . Object Detection
The regions of interest is detected by using an objectdetector based on the reconstructed frame on the host as shownin Fig. 4. While in principle the framework supports any objectdetector, in this work, we use Faster R-CNN [43] for detectingobjects of interest owing to its higher accuracy than other deeplearning based object detectors. Infact, Faster R-CNN is widelyused in several systems. Faster R-CNN comprises of twomodules: the first module consists of the convolutional layersof VGG16 [77] which extracts features. A region proposalnetwork (RPN) finds and labels regions of probable objectsas foreground or background. The second module classifiesthe objects in those region proposals and also regresses abounding box for each object. This object detector on thehost has access to only distorted reconstructed frames. Forenhancing its performance on degraded data as well, the objectdetector has been trained on distorted and undistorted data.Additionally, in order to ensure continuity among the framesin terms of detected objects, the bounding boxes predicted bythe tracker is used to assist the Faster R-CNN. Multiple classesof objects were used to train the Faster R-CNN network. Inthis work, we train the object detector using a novel 2-stepmethodology and assisted by tracker information as describedin Section V.
Object Tracker
The object detector generates bounding box with classlabels, which are fed as input to an object tracker. While inprinciple the framework supports any tracker, in this work,a Kalman Filter-based multiple object tracker, Simple Onlineand Realtime Tracking (SORT) [67] is adapted in this objecttracker implementation. The object tracker uses a linear motionmodel to predict the bounding box locations in the next frame f t +1 . It then associates the identities using linear assignmentbetween the new detections from Faster R-CNN and the mostrecently predicted bounding boxes. The state of the KalmanFilter, X s , for each detection is modeled using a linear motionmodel as X s = [ u, v, s, r, ˙ u, ˙ v, ˙ s ] T , (8)where u and v represent the coordinates of the target’s center,and s and r represent the scale (area) and the aspect ratio(width/height) of the target’s bounding box, respectively. Threeof these time derivatives are part of the state parameters aswell, namely ˙ u , ˙ v , and ˙ s .When a detection is associated with a target, the targetstate is updated using the detected bounding box. The velocitycomponents of the state are solved optimally via the Kalmanfilter framework [78]. The predicted bounding boxes areextracted from predicted state of the Kalman filter. These arethe ROIs for acquisition of the next frame f t +1 which arealso input to the Viterbi algorithm. However, when there is no detection from the object detector, the predicted boundingboxes are translated following the constant motion model for N tracked consecutive frames. The predicted bounding boxesare fed into the Faster R-CNN for upscoring those predictions.Additionally, the predicted regions are of higher quality dueto lower distortion in those regions as described in Eqn. .This allows the Faster R-CNN to detect objects in one out of N frames and still be tracked using the Kalman Filte, therebyimproving the tracking accuracy.
3) Performance Accuracy Metric:
The multi-target perfor-mance is measured using the Multiple Object Tracking Ac-curacy (MOTA) evaluation metric defined as [79] refered hereas
M OT A full , M OT A full = 1 − (cid:88) t m t + f p t + mme t g t , (9)where m t represents the number of missed detections at time t , f p t the number of false positives at time t , mme t thenumber of mismatch (track switching) errors at time t and g t the number of ground truth objects at time t .In this work we also consider a modified MOTA metricwhich does not penalize the false positives. It is of utmostimportance for many object tracking applications (includingours) that all objects that should be tracked are indeedtracked, especially when there is an increased difficulty indetecting the objects in degraded frames. The modified MOTA( M OT A mod ) is given by
M OT A mod = 1 − (cid:88) t m t + mme t g t , (10)A higher score of M OT A mod and
M OT A full correspondsto higher tracking of the objects in the video sequence andhence better performance. The experiments are conducted fordifferent values of λ in reference to Eq. 5, which providesoperating point in the rate-distortion curve. This providesdifferent average bit rates over a video sequence, which are afraction of the maximum rate. For different values of λ , thedistortion and the bit rate fluctuates for each frame. However,in practice the communication channel between the chip andthe host is bandwidth-limited. Thus the bit rate of the data sentthrough the channel can only vary within a certain tolerance(e.g., < ). In order to keep the bit rate constant, for eachframe we vary λ . This mode of operation keeps the rate fixed,within certain tolerance, but the distortion varies frame toframe. V. P
ERFORMANCE O PTIMIZATION
The system is designed to achieve good object trackingperformance for different bit rates R . The object detectoridentifies the ROIs, which are then input to the object tracker.Hence, it is the most important component of the host in itsrole of detecting and tracking objects in each frame. However,the host has access to only the reconstructed frame ˆ f t attime t , which is a distorted version of the uncompressed highquality frame. In order to perform well the Faster R-CNNmust also be trained with similarly distorted frames. This ANERJEE et al. : AN ADAPTIVE VIDEO ACQUISITION SCHEME FOR OBJECT TRACKING AND ITS PERFORMANCE OPTIMIZATION 7 would improve the detection accuracy of the Faster R-CNNfor system-generated distortions at different bit rates.
A. Training the Object Detector
Traditionally, object detectors are trained on data frompublicly available datasets such as COCO [52], PASCAL VOC[53] and ImageNet [54], among others. Most of the datasetshave been curated using a good quality camera, and theinherent distortions and noise in those image/video frames islow. Thus, these object detectors are finely tuned to the imagequality of the particular dataset. The detection performanceworsens once it is tested with other forms of distortion. Inorder to address this issue and improve the performance ofthe detector on distorted frames, we resort to training theobject detector in a novel two stage approach. This two stepapproach achieves much higher performance with system-generated distortions than training with undistorted images.We used the ILSVRC VID dataset [54] to train the Faster R-CNN. Since the work is catered to surveillance applications inground, air and water scenes, we trained our object detectoron Airplanes, Watercrafts and Cars. However, it must be keptin mind that the architecture can work with an object detectortrained on any number of classes. The training data in thisdataset has been split into 70:30 randomly as training andvalidation data for training the Faster R-CNN.
1) Step I:
In this step, the object detector in the host inFig. 4 is replaced by Ground Truth bounding boxes. Thiscreates exact bounding boxes (ROIs) precisely encompassingthe entire object while still generating data consistent with thedegradation we would see in the system.The ROIs are then transmitted to the chip. The chipfinds the optimal QT according to the ROIs, λ ∈{ , , , , } (the value in the Viterbi optimizationalgorithm), along with the full undistorted frame f t on thechip and the previous reconstructed frame ˆ f t − . The distortionlevels in the system are set by the weights w i in the ROIs andbackground. The weights are uniquely selected such that theresulting distortion in the background is significantly higherthan that in the ROIs. For each value of λ , the entire trainingdata is passed through the architecture which from ˆ f t createsthe training and validation dataset for Faster R-CNN. The datain the original dataset corresponding to λ = 0 is also includedin the dataset. The Faster R-CNN trained on this distorteddata has seen high quality data as well as data with differentdegrees of distortion corresponding with λ . The higher the λ is, the higher the distortion. Ground truth annotations are usedfor training and validation of the Faster R-CNN.
2) Step II:
The Faster R-CNN trained in Step I has beentrained on perfect bounding boxes which encompass the objectcompletely. However, in actual scenarios, the object detectormay detect bounding boxes which may not perfectly alignwith the object. For example, part of the bounding box maynot entirely overlap with the object. An example is shown inFig. 5. The bounding box predicted by the object detector isshown in blue. This does not align perfectly with the groundtruth bounding box which is in white. Clearly, the tail andtop of the boat are not covered by the blue bounding box, whereas portions of the background in the bottom of the boatis included in the blue bounding box.
Fig. 5 : Bounding box and object partly overlapping.Regardless, the Kalman Filter predicts ROIs for the nextframe based on these imperfect detections. The chip thenacquires the next frame based on these imperfections and sendsthem to the host. Portions of the object inside the ROI willbe less distorted and portions outside the ROI will be highlydistorted as per the weight ratio. In order to improve the objectdetector performance, the Faster R-CNN needs to be trainedon this type of unique distortion - where part of the objectis segmented finely with less distortion and the rest coarselywith high distortion. This is the objective of Step II training.The Faster R-CNN trained from Step I is used as the objectdetector in the host as in Fig. 4. The bounding boxes detectedby the Faster R-CNN is passed to the Kalman Filter to updatethe state and predict the ROIs in the next frame. The chipreconstructs the frame based on these ROIs predicted by theKalman Filter. Analogously to training in Step-I, for eachvalue of λ ∈ { , , , , } along with originaldataset ( λ = 0 ), the entire training data is again passed throughthe architecture which creates ˆ f t , the training and validationdata. The ground truth annotations are used for training andvalidation in this step as well.The Faster R-CNN trained in Step I, during the testingphase generates the bounding boxes closely aligned to theactual physical object. However, it never generates perfectbounding boxes exactly aligned to the actual physical object.The bounding box detections partially align with the actualobjects in most of the cases. These bounding boxes are thenpassed onto Kalman Filter, which predicts the RoIs imperfectlycompared to the actual object and sends them back to the chip.The reconstructed frame on the chip thus has different degreesof distortion corresponding to the entire actual physical object.The Step II training is hence critical as it trains the Faster R-CNN taking into account the different distortion levels for theobject.The system performance is sensitive to the training data forthe object detector. The generation of distorted data for train-ing and validating the Faster R-CNN depends on the weightsassigned to the ROIs and elsewhere. This is important as itdictates the extent of relative distortion. Based on randomlyselected videos from the training data, for λ ∈ { , , } corresponding to low, medium and high distortions respec-tively, we chose the weights as w i = 10 for the ROIs SUBMITTED TO
IEEE SENSORS JOURNAL and w i = 10 for the rest of the regions (background) withreference to Eq. 7, which visually made distortion between theROIs and the background distinct, with the background is nottoo heavily distorted compared to the ROIs. An example ofsuch a frame is shown in Fig. 6. The car within the ROI, hasa finer segmentation (and therefore lower distortion) than thebackground. Fig. 6 : Example degraded frame of a car for λ = 100 .
3) Model Variants:
We compare the tracking performanceof the system with object detector models trained on differentdatasets. Videos including airplanes, cars and watercraft fromthe ILSVRC VID dataset of different distortions were used fortraining six different Faster R-CNN models:1)
Pristine NN model : Faster R-CNN trained exclusivelywith pristine (non-distorted) data2)
Uniform NN model : Faster R-CNN trained with pristinedata and uniformly binned × , × and × data3) Mixed NN model : Faster R-CNN trained with pristinedata and distorted data for a mixed assortment of λ ∈{ , , , , } generated in Step I training4) Mixed+ NN model : Faster R-CNN trained with pristinedata and distorted data for λ ∈ { , , , , } generated in Step II training5) MixedU NN model : Faster R-CNN trained with pristinedata, uniformly binned × , × and × and distorteddata for λ ∈ { , , , , } generated in StepI training6) MixedU+ NN model : Faster R-CNN trained with pristinedata, uniformly binned × , × and × and distorteddata for λ ∈ { , , , , } generated in StepII trainingIn the Mixed+ model, the Mixed model is used on the Hostto generate distorted data as mentioned in Step-II training.Similarly, in order to generate MixedU+ model, the MixedUmodel is used as the object detector to generate distorted dataas mentioned in Step-II training. The NN models were trainedusing ADAM [80] as the optimizer with a learning rate of 1e-5. Dropout of 0.5 is used while training the models. Duringtesting, no dropout is used. B. Tracker assisted Detection Framework
The task of object tracking was initially framed to be dif-ferent from object detection. However, more recent algorithmshave aimed to fuse both of them together. In this work, weuse the region-based object detector (Faster R-CNN) with the Kalman Filter based tracker to form a novel joint Detector-Tracker (JDT) system. This is shown in Fig. 7 below. Theregion based object detectors (e.g. Faster R-CNN) generates lotof candidate bounding box, more than the number of objectsin the scene before eventually removing most of them. Toprioritise the candidate bounding boxes overlapping with theobject, a set of detection confidence scores are calculated foreach candidate bounding boxes. If the detection confidencescore of candidate bounding boxes is lower than pre-definedthreshold, those candidate bounding boxes are classified as”background” class and removed. However, this approach doesnot take into account any temporal continuity between theframes.In order to utilize the temporal consistency among the imageframes, we introduce the concept of ”tracking confidencescore” to describe the likelihood of a given bounding boxcontaining a tracked object. Similar to detection confidencescores for each class of object, we introduce multiple trackingconfidence scores, one for each object class. The tracking con-fidence scores are computed based on the highest Intersectionover Union (IoU) values between all candidate bounding boxeswith the bounding box predicted by the tracker. Additionalconstraints are forced while computing the IoU in order toremove the false positives: (1) candidate bounding boxes withIoU < . are rejected, and, (2) candidate bounding boxeswith difference in size greater than are not considered.The joint confidence score C j is computed from the detec-tion score C d and tracking score C t using Eqn. (11) with w t and w d as the tunable parameter which weights the trackingconfidence score and detector confidence score respectively. C j = (cid:113) w d C d + w t C t , (11)Combining both the tracking and detection scores for thecandidate bounding boxes is tricky. We fuse the two scores intoa joint confidence score satisfying: (1) bounding boxes con-taining objects entering the scene should not have its score bepenalized by lack of tracking information, (2) bounding boxesthat have low detection score but high tracking score shouldhave its joint score be boosted by virtue of its high trackingscore, and, (3) bounding boxes which have mediocre detectionscore and tracking score should have a lower joint score thana bounding box with at least one excellent confidence score.With drop in quality of the frames, the candidate boundingboxes with low detection scores must be compensated withhigh tracking scores. (2) Object entering the scene without anytracking history is rewarded with higher detection or trackingscore without penalizing cases where one score is much lowerthan other. VI. E
XPERIMENTAL R ESULTS
The experimental performance results of the system isshown in this paper by simulating the proposed model on threesequences of the ILSVRC VID dataset: (i) a video of airplanes,ILSVRC2015 val 00007010.mp4; (ii) a video of a watercraft,ILSVRC2015 val 00020006.mp4; and (iii) a video of cars,ILSVRC2015 val 00144000.mp4. These videos are selected to
ANERJEE et al. : AN ADAPTIVE VIDEO ACQUISITION SCHEME FOR OBJECT TRACKING AND ITS PERFORMANCE OPTIMIZATION 9
Fig. 7 : Tracker assisted Object Detection. The candidate bounding boxes are given detection confidence and tracking confidencescores. The tracking confidence score is updated based on predictions from object tracker. Both these scores are combined tomodify the bounding box score before applying bounding box filter.have optically small, medium, and large sized objects as wellas sequences with one, two and multiple objects. The frameswere resized to × to accommodate the QT structure.The maximum depth of the tree is thus N = 9 . A. Variation of Distortion with Rate
The amount of the distortions at different bit rates areimportant parameters in identifying the distortions which isgenerated by this system. We compute the variations of thedistortions as PSNR and SSIM metrics for the sequences atdifferent bit rates which is shown in Table I. The uncom-pressed bit rate is 62.91 Mbits/s. The PSNR and SSIM hasbeen computed at different % bit rate with respect to thisuncompressed bit rate. For small and medium sized objects,the PSNR and SSIM are quite low for low bit rates whilefor relatively larger sized objects (e.g. boat sequence), thedistortions are significantly higher as shown by relatively lowPSNR and SSIM values. The performance of the system hasbeen optimized to such high distortion levels where the objectis almost not recognizable. Sample frame ˆ f for differentsequences at bit rate of . of the maximum bit rate is shownin Fig. 8 below. Clearly the tail of the boat in the Fig. 8 (b) isnot recognizable while the cars and planes can be recognizedat such bit rates as well. It should be kept in mind that thePSNR and SSIM values are only for visual quality. The endperformance of the system is dictated by the MOTA metric. B. Operation at constant λ In this mode of operation, λ is kept constant. This fluctuatesthe rate and distortion per frame. MOTA is computed foreach sequence for the performance of the system trained withdifferent object detectors. The effect of 2 step-training method-ology is demonstrated here. Tracker assisted object detectorupscoring is not included in this subsection of experiments.Fig. 9 show the detections in the distorted frame of airplane, Sequence Bit Rate ( % ) PSNR (dB) SSIMAirplane Sequence 0.75 26.4150 0.89621.00 28.4576 0.91661.50 32.1420 0.94772.00 35.0587 0.96393.00 37.7115 0.97455.00 39.5348 0.98097.00 40.1643 0.983610.00 40.5849 0.985225.00 41.1766 0.9867Boat Sequence 0.75 16.7498 0.42141.00 17.9076 0.44111.50 20.2070 0.46852.00 21.8621 0.53013.00 23.8621 0.59635.00 25.0364 0.67137.00 26.3222 0.719410.00 27.8823 0.772425.00 32.5498 0.8964Car Sequence 0.75 22.1733 0.73901.00 24.2945 0.76831.50 27.8283 0.83352.00 29.7073 0.86093.00 32.3193 0.90185.00 35.0724 0.93487.00 36.8468 0.950810.00 38.5461 0.963825.00 40.6701 0.9795 TABLE I : SSIM and PSNR for sequences at different bit ratescar and watercraft sequence for each of the six Faster R-CNN models, with distorted frames generated at λ = 400 .The Pristine NN detector fails to detect the objects in each ofthe three cases. On the other hand, the Uniform NN detectordetects a few objects. Mixed NN and MixedU NN detectors areable to detect almost all of the objects. However, the boundingboxes given by these detectors either overfit the objects with SUBMITTED TO
IEEE SENSORS JOURNAL (a) Airplane Sequence (b) Watercraft Sequence (c) Car Sequence
Fig. 8 : Degraded frame ˆ f generated at . of maximum bit rate. The background has more distortion than the objectsitself. We ask the reader to zoom in on each of the frames to see the degradations.excess background included or underfit the objects. On theother hand Mixed+ and MixedU+ NN detectors perform abetter job in fitting the bounding box to the objects in thescene including minimum background.Fig. 10 (a) and 11 (a) shows the M OT A full and
M OT A mod curves for the airplane sequence consideringEqns. and respectively. There are seven small sizedairplanes within the full frame, and some become obscuredover time. It is seen that the system trained with the PristineNN has significant deterioration in performance after λ = 250 .The performance of the Uniform NN is significantly better forhigher λ values than the Pristine NN detector. This indicatesthat the QT for small objects can be replaced with uniformbinning. However, the performance of Mixed and MixedUNN-based detectors is better than the Uniform NN detector—clearly suggesting the benefits of using actual degraded datagenerated by the system for training the Faster R-CNN.The best performance is obtained by using the Mixed+ andMixedU+ NNs. The Mixed+ NN detector performance isslightly better than the MixedU+ NN detector since the exactdegradations correspond to the QT binning. In MixedU+ NN,the NN has been trained on actual system generated distortionsas well as uniformly binned data. Thus the 2 step trainingstrategy do help in improving the performance metric.The performance of the system when tested on medium-sized cars is shown in the M OT A full and
M OT A mod curvesof Fig. 10 (c) and 11 (c) curves. It is seen here as wellthat the Pristine NN has a performance drop after λ = 250 .The Uniform NN detector performance is better for higher λ values compared to the Pristine NN detector. The Mixed NNdetector has higher accuracy than the Uniform NN detector.The MixedU NN detector performance is worse than theMixed NN and Uniform NN detectors. This indicates thattraining using both system generated and uniform distortionsmay lead to a sub-optimal performance. However when wedo Step-II training, the performance of MixedU+ is greaterthan MixedU. The Mixed+ NN detector performance is withinabout 0.05 at worse ( λ = 100 ) to the Mixed NN detector formost of the λ values. The Mixed NN detector trained with onlythe system generated data once, has its performance close of the Mixed+ NN and MixedU+ NN detector. However, overall,the performance of MixedU+ and Mixed+ NN detectors isbetter than that of the pristine detectors.The watercraft sequence has a large boat which occupiesmost of the frame during the entire sequence. The performanceof the system for this sequence is shown in Fig. 10(b)and 11(b). The Pristine NN performance drops significantlybeyond λ ≥ as in the previous two cases of small andmedium sized objects. The Uniform NN detector performanceis lower for most of the λ values than the Pristine NN detector.The Mixed NN detector and MixedU NN detector performanceis higher than the Pristine NN detector. Surprisingly, theMixedU NN detector performance is higher than the MixedNN detector’s performance. This implies that for large sizedobjects, the system generated distortion is different fromuniform binned distortions, and training the detector with boththese types of distortions actually aids the performance. Theperformance of Mixed+ NN detector is better than Mixed NNdetector. However, the performance of MixedU NN detectoris higher due to less false positives. Considering no falsepositives in our metric, the peformance of MixedU+ NN,MixedU NN and Mixed+ detectors are very similar. TheMixedU+ detector performance is within about 0.05 at worse( λ = 100 ) to the MixedU NN detector for most of the λ values.In this case, the 2-step training process does improve thesystem performance especially when we use the M OT A mod metric, thus highlighting benefit of this training process.From our experimental studies, we observe that Mixedand MixedU detectors is able to peform better for mediumand large sized objects respectively, mostly due to lack offalse positives. However, the performance of the Mixed+ andMixedU+ detectors are the best among the different Faster R-CNN models across the board especially when we ignore falsepositives. It is also observed that when background objectsare significantly present (in the boat sequence), MixedU andMixed NN detector tends to perform better with false positivesconsidered in
M OT A full . However, the experimental studiessuggest the benefits of 2-step training process for improvingthe performance metric for most of the cases. The object de-tector trained only once (MixedU and Mixed) has performance
ANERJEE et al. : AN ADAPTIVE VIDEO ACQUISITION SCHEME FOR OBJECT TRACKING AND ITS PERFORMANCE OPTIMIZATION 11
Pristine NN Uniform NN Mixed NN MixedU NN Mixed+ NN MixedU+ NNPristine NN Uniform NN Mixed NN MixedU NN Mixed+ NN MixedU+ NNPristine NN Uniform NN Mixed NN MixedU NN Mixed+ NN MixedU+ NN
Fig. 9 : Top row: ˆ f of the Airplane Sequence, middle row: ˆ f of the Car Sequence, bottom row: ˆ f of the WatercraftSequence. Degraded frames generated at λ = 400 . We ask the reader to zoom in on each of the frames to see the degradations.(a) Airplane M OT A full (b) Watercraft
M OT A full (c) Car
M OT A full
Fig. 10 : M OT A full vs λ for Airplane, Car and Watercraft sequenceimprovements over Pristine NN detector as well but in generalthe performance gains are lower than that of the 2-step trainedMixedU+ and Mixed+ models. C. Operation at Constant Bit Rate
In the mode of operation for the system, we force the bitrate to be constant as a fraction of the maximum bit rate(within a tolerance of of the fractional bit rate). Thismakes λ and the distortion fluctuate in each frame and ineach sequence. The detector has been trained with the 2-step strategy. Tracker assisted object detector upscoring is not included in this subsection of experiments. M OT A full and
M OT A mod is computed for each of these rates. Fig.12 and 13 shows the plot of
M OT A full and
M OT A mod vs bit-rate as per Eqns. and respectively. We havecomputed the performance using the Pristine, Uniform, Mixed,MixedU, Mixed+ and MixedU+ NN detectors to show theiroverall performance with each detector. Both M OT A full and
M OT A mod increases initially with the increase in the bitrate for the Airplane, Watercraft and Car sequences and thenremains approximately constant. The false positives are veryfew as the
M OT A full and
M OT A mod values are close to SUBMITTED TO
IEEE SENSORS JOURNAL (a) Airplane
M OT A mod (b) Watercraft
M OT A mod (c) Car
M OT A mod
Fig. 11 : M OT A mod vs λ (Eq. ) for Airplane, Car and Watercraft sequenceeach other. The performance of the Mixed+, MixedU+ andMixedU NN detectors are close to each other, with MixedUNN detector performance having less false positives. However,across the board, the Mixed+ detector have consistent goodperformance than MixedU+ detector.It is also pointed out that for the watercraft sequenceespecially at lower bit rates ( < ), in some frames wehave values of λ well over the maximum λ = 650 , themaximum λ we had trained the detectors. Yet, the systemtrained at medium distortions can even perform quite well atthese higher distortions. This shows the robustness of the 2-step training process at distortion levels worse than the traineddistortion levels. The early convergence of the curves to high M OT A full and
M OT A mod accuracy at low bit rates showthe effectiveness of the 2-step training procedure over using aPristine NN detector. The system performance has been shownfor 0.75 % to 25 % of the maximum bit rate of 62.9 Mbits/swhich is the desired range of operation.
D. Operation with Tracker assisted Detection Framework
We compare the performance of the system with trackerassisted object detection alongwith the 2-step training strategyfor the object detector as mentioned in the previous section.We do a parametric evaluation of the system performance withvarying tunable detection weight w d and tracking weight w t as shown in Fig. 14. Mixed+ object detector has been used inthe experiments as it provides one of the best performancefor the system as shown in the previous subsection. Weobserve that for fixed w t (referred in Fig. 14 as w ), theperformance detoriates with reduction in w d (referred in Fig.14 as wd ) in most of the cases. On the other hand for afixed w d , the performance of the system is better when w t isincreased. Based on our experimental results, we find the bestperformance M OT A mod in most of the cases is when w d = 1 and w t = 1 . It is clearly evident from our experiments thatthere is a significant increase in the system peformance whenwe have a object detector assisted with the tracker compared tothe system with no assistance from the object tracker w d = 1 and w t = 0 , especially when there is significant background,as in the boat and car sequences. E. Comparison with other methods
We compare the performance of our method with three othertechniques. One of the alternative compression techniques issimple binning of images (without using our system) to × , × , × and × blocks with each block havingthe intensity value equal to the average of individual pixelswithin the block. In the case of uniformly binned frames,the pristine detector is used to evaluate the MOTA metric.Alternatively, the video is separately compressed using sophis-ticated H.264 (AVC) and H.265 (HEVC) techniques, whichare most commonly used video compression standards in thevideo and telecom industry. We utilize FFmpeg library libx265with its HEVC video encoder wrapper (x265). Similarly forH.264 compression FFmpeg library libx264 is used. We use − pass encoding scheme for both H.264 and H.265 as therate control mode to limit the bit-rate. For fair comparison,we compute the performance metric at the same bit-rates of . , . , . and of the maximum bit-rate whichis identical to / , / , / and / of the maximum bit-rates respectively. The performance M OT A mod of the videoscompressed with naiive binning, AVC and HEVC standaradshas been evaluated with pristine object detectors. These com-pression standards compress videos with high PSNR and highquality. This makes it more reasonable to use pristine objectdetector for fair comparison. In our proposed system, we usethe Mixed+ and Mixedu+ detectors assisted with the tracker.We see that in Fig. 15, the performance of naiive simplebinning detoriates at rates less than of maximum rate. Onthe other hand, performance of our system, H.264 and H.265compressed videos do not detoriate at lower bit rates. In fact,the MOTA performance of our system is better than H.264and H.265 encoded videos for most of the cases. It must bekept in mind that sophisticated video coding techniques suchas H.264 or H.265 techniques are computationally heavy andis not suitable to be applied directly in a resource constrainedchip (as in our current architecture). Thus, with the currentcomputationally constrained chip, our proposed system hasgood tracking accuracy compared to current state-of-the-artcompression standards such as H.264 and H.265. ANERJEE et al. : AN ADAPTIVE VIDEO ACQUISITION SCHEME FOR OBJECT TRACKING AND ITS PERFORMANCE OPTIMIZATION 13 (a) Airplane
M OT A full (c) Watercraft
M OT A full (c) Car
M OT A full
Fig. 12 : M OT A full vs Rate Curves for Pristine, Uniform, Mixed, MixedU, Mixed+ and MixedU+ detectors.(a) Airplane
M OT A mod (c) Watercraft
M OT A mod (c) Car
M OT A mod
Fig. 13 : M OT A mod vs Rate Curves for Pristine, Uniform, Mixed, MixedU, Mixed+ and MixedU+ detectors.(a) Airplane
M OT A mod (b) Watercraft
M OT A mod (c) Car
M OT A mod
Fig. 14 : System performance with varying w d and w t with Mixed+ detector for the airplane, watercraft and car sequenceswith Tracker assisted Detector. VII. D
ISCUSSION
In this work, we propose an intelligent algorithm for adap-tive sampling of high bit rate data captured by an imager(chip), optimized together with a reconstruction algorithm forobject detection and tracking on a remote host. The modelhas been developed assuming a chip with low computationalpower and a remote host with high computational power. Inthis framework, the communication channel between the chipand host has limited bandwidth and thus limited data transfer capabilities. The chip performs the Viterbi optimization forgenerating QT and skip/acquire modes, while the host performthe tasks of object detection and tracking along with predictingthe RoIs in the next time instant for the chip. The performancecurves of
M OT A full and
M OT A mod indicate that the per-formance of the system deteriorates for the Pristine NN Modelbeyond λ = 250 . This is consistent among all the categoriesof objects which have different sizes. It is also evident that theperformance of the Faster R-CNN is dependent on the level of SUBMITTED TO
IEEE SENSORS JOURNAL (a) Airplane
M OT A mod (b) Watercraft
M OT A mod (c) Car
M OT A mod
Fig. 15 : Comparison of
M OT A mod vs Rate Curves (Eq. ) for Binned, Mixed+, MixedU+, H.264 and H.265 videosQT binning of the ROIs. The edges of the objects get distortedsignificantly based on the level of QT binning. Additionally,the texture of the object is affected by the QT binning whichin turn affects the detector performance. It is consistent withthe observation in [81] that the ImageNet-trained CNNs arebiased towards texture decisions than based on shape.In our investigation, we find that at high distortions, thebackground influences the amount of false positives. In thecase of a flat background like the airplane sequence, the falsepositives are fewer. However, this increases in the boat and carsequences which has significant content in the background.The dataset contains small, medium and large sized objectsin each class. For high λ , the distortion is very high andsmall objects are binned very similarly to the background. Thisaffects the false detections with sufficient background contentas the CNN identifies portions of the background as objects.The Faster R-CNN was trained to have a good accuracy overdetecting objects of different classes and sizes, which resultsin more false positives at higher λ values that reduce the M OT A full scores. Both
M OT A full and
M OT A mod scoresincreases with an increase in bit rate and then saturates. Asthe rate reduces, the distortion increases. However, both thedetectors trained in the 2-step process have their performanceat low rates better than Pristine NN detector. Interestingly,the detector trained only once with a mixture of uniformlybinned images and system generated images have comparableperformance especially over varying bit rates.We also observe that by adding a tracker assisted objectdetection on the 2-step training strategy further improves theMOTA. A detailed study on the relative weightage of thedetection confidence and tracker confidence proposal boundingboxes have been carried out to find the optimal weights of1:1 which improves the MOTA scores across the board. Theperformance of the system is comparable to sophisticatedAVC and HEVC techniques which require high computationalpower on the device. Additionally, our performance metrics ishigher than naiive binning techniques especially significantlyat lower bit rates.One limitation of the work is that, there is no sensor-hosthardware system developed so far to test out this framework (tothe best of authors’ knowledge). The gigapixel FPAs currently available in the market lack the ability to allocate bits non-uniformly in different regions of image frame.
VIII. C
ONCLUSION
In summary and conclusion, this paper proposes a novelsystem using a host-chip architecture for video acquisitionoptimized for object detection and tracking especially designedfor computationally constrained chip (edge devices). Althoughthe system is based on QT compression driven by the ROIs,this architecture is generalizable to other forms of region/blockbased compression as well. Since QT sub-blocks are inherentin AVC, HEVC and VVC standards, we focus on the optimiza-tion of the host-chip architecture based on QT decomposition.Future work will involve exploring other region based com-pression techniques. A Viterbi-based optimization was usedto generate the acquisition modes in the FPA along withthe optimal QT structure that minimizes the area-normalized,weighted rate-distortion equation. The optimization algorithmtakes into account the priority regions in the scene based onobjects of interest. An object detector, Faster R-CNN, is usedto detect the ROIs based on the class of object. A novel 2-step training methodology of the Faster R-CNN is applied.In Step-I, we train using ground truth boxes as the outputof the detection step to generate training data. In Step-II,we generate new image data using the Step-I detector in oursystem instead of the ground truth bounding boxes as before.This is done in order to have more realistic data (e.g. imperfectbounding boxes from the last frame affecting the distortion inthe present frame). The ROIs from the detector in the currentframe are used by the Kalman filter-based tracker to predictthe ROIs in the next frame. Another novel tracker assistedupscoring of the object detector has been implemented whichaids in further improving the MOTA performance metric. Theperformance of the system is measured by the
M OT A full and
M OT A mod scores. The results of our method show significantimprovements in the tracking performance and the strength ofthis host-chip architecture in different operating conditions.Compared to state-of-art highly sophisticated compressiontechniques employed in image/video coding standards, oursystem performs better for most of the experimental cases.
ANERJEE et al. : AN ADAPTIVE VIDEO ACQUISITION SCHEME FOR OBJECT TRACKING AND ITS PERFORMANCE OPTIMIZATION 15 A CKNOWLEDGMENT
This authors are grateful to Defense Advanced ResearchProjects Agency (DARPA) for their funding in this project.The work is supported in part by a DARPA Grant No.HR0011-17-2-0044. R EFERENCES [1] S. Banerjee, J. G. Serra, H. H. Chopp, O. Cossairt, and A. K. Katsagge-los, ”An Adaptive Video Acquisition Scheme for Object Tracking,” In2019 27th European Signal Proc. Conf. (EUSIPCO), A Coruna, Spain,2019, pp. 1-5.[2] Daniel L. Marks, David S. Kittle, Hui S. Son, Seo Ho Youn, StevenD. Feller, Jungsang Kim, David J. Brady et al. ”Gigapixel imaging withthe AWARE multiscale camera,”
Optics and Photonics News
23, no. 12,2012, pp. 31-31.[3] Patrick Llull, et al. ”Characterization of the AWARE 40 wide-field-of-view visible imager,”
Optica
2, no. 12, 2015, pp. 1086-1089.[4] David J. Brady, ”Recent Advances in Gigapixel Cameras,”
Computa-tional Optical Sensing and Imaging . Optical Society of America, 2017.[5] R. Koller et al., “High Spatio-Temporal Resolution Video with Com-pressed Sensing,”
Opt. Express , vol. 23, Iss. 12, pp. 15992-16007, 2015.[6] L. Spinoulas, O. Cossairt, and A. K. Katsaggelos, ”Sampling opti-mization for on-chip compressive video,” In 2015 Int. Conf. on ImageProcessing (ICIP), Quebec City, QC, 2015, pp. 3329-3333.[7] L. Spinoulas, K. He, O. Cossairt, and A. K. Katsaggelos, ”Videocompressive sensing with on-chip programmable subsampling,” In 2015IEEE Conf. on Comput. Vision and Pattern Recognit. Workshops(CVPRW), Boston, MA, 2015, pp. 49-57.[8] M. Gupta, A. Agrawal, A. Veeraraghavan, and S. G. Narasimhan,“Flexible voxels for motion-aware videography,” In European Conf. onComput. Vision, Springer Berlin Heidelberg, 2010, pp. 100-114.[9] D. Reddy, A. Veeraraghavan, and R. Chellappa, “P2C2: Programmablepixel compressive camera for high speed imaging,” In IEEE Conf. onComput. Vision and Pattern Recognit., Colorado Springs, CO, USA,2011, pp. 329-336.[10] Y. Hitomi, J. Gu, M. Gupta, T. Mitsunaga, and S. K. Nayar, “Videofrom a single coded exposure photograph using a learned over-completedictionary,” In 2011 Int. Conf. on Comput. Vision, Barcelona, 2011, pp.287-294.[11] T. Sonoda, H. Nagahara, K. Endo, Y. Sugiyama, and R. Taniguchi,”High-speed imaging using CMOS image sensor with quasi pixel-wiseexposure,” In 2016 IEEE Int. Conf. on Comput. Photography (ICCP),Evanston, IL, 2016, pp. 1-11.[12] M. Wei, et. al., “Coded Two-Bucket Cameras for Computer Vision,” InEuropean Conf. on Comput. Vision, Munich, Germany, 2018, pp. 55.[13] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet Classificationwith Deep Convolutional Neural Network,” in Proc. of IEEE NeuralInf. Proc. Syst., 2012, pp. 1097-1105.[14] G. E. Hinton, and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,”
Science , vol. 313, no. 5786, pp. 504–507,2006.[15] Z. Zhao, P. Zheng, S. Xu, and X. Wu, ”Object Detection With DeepLearning: A Review,”
IEEE Trans. on Neural Netw. and Learn. Syst. ,vol. 30, no. 11, pp. 3212-3232, Nov. 2019.[16] L. Liu, et. al., “Deep Learning for Generic Object Detection: A Survey,”2019. [Online]. Available: https://arxiv.org/abs/1809.02165.[17] W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, T.-K Kim, ”Multipleobject tracking: A literature review,” in
Artificial Intelligence , vol. 293,2021, 103448.[18] I. Noor and E. L. Jacobs, “Adaptive compressive sensing algorithm forvideo acquisition using single pixel camera,”
SPIE J. Electron. Imag. ,vol. 22, no. 2, pp. 021013–021013, Jul. 2013.[19] W. Guicquero, A. Verdant, A. Dupret, and P. Vandergheynst, “Nonuni-form sampling with adaptive expectancy based on local variance,” InProc. Int. Conf. Sampling Theory and Applications (SampTA), Wash-ington, DC, May 2015, pp. 254–258.[20] W. Guicquero, A. Dupret and P. Vandergheynst, “An adaptive compres-sive sensing with side information,” In 2013 Asilomar Conf. on Signals,Syst. and Comput., Pacific Grove, CA, 2013, pp. 138-142.[21] D.M. Malioutov, S.R. Sanghavi, and A.S.Willsky, “Sequential com-pressed sensing,”
IEEE J. Sel. Topics in Signal Processing , vol. 4, no.2, pp. 435–444, Apr 2010. [22] J. Chen, X. Zhang, and H. Meng, “Self-adaptive sampling rate assign-ment and image reconstruction via combination of structured sparsityand non-local total variation priors,”
Digital Signal Processing , vol. 29,pp. 54–66, June 2014.[23] T. Wiegand, G. J. Sullivan, G. Bjontegaard and A. Luthra, ”Overviewof the H.264/AVC video coding standard,” in IEEE Transactions onCircuits and Systems for Video Technology, vol. 13, no. 7, pp. 560-576,July 2003.[24] H. M. Hu, et al. ”Region-based rate control for H. 264/AVC for low bit-rate applications.”
IEEE transactions on circuits and systems for videotechnology
Melbourne, VIC, 2013,pp. 1986-1990, doi: 10.1109/ICIP.2013.6738409.[26] M. Zhou, H. Hu and Y. Zhang, ”Region-based intra-frame rate-controlscheme for High Efficiency Video Coding,”
Signal and InformationProcessing Association Annual Summit and Conference (APSIPA), 2014Asia-Pacific , Siem Reap, 2014, pp. 1-4.[27] H. M. Maung, S. Aramvith and Y. Miyanaga, ”Region-of-interest basederror resilient method for HEVC video transmission,” , Nara, 2015, pp. 241-244.[28] H. M. Hu, M. Zhou, Y. Liu and N. Yin, ”A region-based intra-framerate control scheme by jointing inter-frame dependency and inter-framecorrelation ,”
Multimed Tools Appl , vol. 76, pp. 12917–12940, 2017.[29] N. T. Pham, K. D. Vu, D. T. Dinh and H. T. Le, ”Efficient Region-of-Interest Based Adaptive Bit Allocation for 3D-TV Video Transmissionover Networks ,”
VNU Journal of Science: Computer Science andCommunication Engineering , vol. 32, no. 1, Feb. 2016.[30] C.-C. Wang, C.-W. Tang, ”Region-based rate control for 3D-HEVCbased texture video coding ,” in
Journal of Visual Communication andImage Representation , Vol. 54, pp. 108 - 122, 2018.[31] F. Li, N. Li, ”Region-of-interest based rate control algorithm forH.264/AVC video coding ,” in
Multimed Tools Appl , Vol. 75, pp.4163–4186, 2016.[32] M. Meddeb, M. Cagnazzo and B. Pesquet-Popescu, ”Region-of-interestbased rate control scheme for high efficiency video coding,” , Florence, 2014, pp. 7338-7342.[33] M. Meddeb, M. Cagnazzo and B. Pesquet-Popescu, ”ROI-based ratecontrol using tiles for an HEVC encoded video stream over a lossynetwork,” , Quebec City, QC, 2015, pp. 1389-1393.[34] P. Lee, Y. Xiao and S. Yao, ”Attention region based rate controlalgorithm for 3DVC depth map coding,” , Puli, 2016, pp. 1-2.[35] G. J. Sullivan, J. Ohm, W. Han and T. Wiegand, ”Overview of the HighEfficiency Video Coding (HEVC) Standard,”
IEEE Trans. on Circuitsand Syst. for Video Technol. , vol. 22, no. 12, pp. 1649-1668, Dec. 2012.[36] X. Sun, X. Yang, S. Wang, M. Liu, ”Content-aware rate controlscheme for HEVC based on static and dynamic saliency detection,” in
Neurocomputing , vol. 411, pp. 393-405, 2020.[37] J. Chao, R. Huitl, E. Steinbach and D. Schroeder, ”A Novel Rate ControlFramework for SIFT/SURF Feature Preservation in H.264/AVC VideoCompression,” in
IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 25, no. 6, pp. 958-972, June 2015.[38] R. Song and Y. Zhang, ”Optimized Rate Control Algorithm of High-Efficiency Video Coding Based on Region of Interest”, in
Journal ofElectrical and Computer Engineering, vol. 2020, 2020.[39] C. Kuo, Y. Shih and S. Yang, ”Rate Control via Adjustment of LagrangeMultiplier for Video Coding,” in
IEEE Transactions on Circuits andSystems for Video Technology , vol. 26, no. 11, pp. 2069-2078, Nov.2016.[40] Y. LeCun et al., “Deep learning,”
Nature , vol. 521, pp. 436–444, May2015.[41] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”in Proc. of IEEE Conf. on Comput. Vision and Pattern Recongit., 2014,pp. 580–587.[42] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deepconvolutional networks for visual recognition,” in
IEEE Trans. PatternAnal. Mach. Intell. , vol. 37, no. 9, pp. 1904–1916, Sep. 2015.[43] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in
IEEE Trans.on Pattern Anal. and Mach. Intell. , vol. 39, no. 6, pp. 1137-1149, 1 June2017. SUBMITTED TO
IEEE SENSORS JOURNAL [44] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask R-CNN,” in Proc.of Int. Conf. on Comput. Vision, 2017, pp. 2980–2988.[45] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-based fully convolutional networks,” in Proc. of IEEE Neur. Inf. Proc.Syst., 2016, pp. 379–387.[46] M. Najibi, M. Rastegari, and L. S. Davis, “G-CNN: An iterative gridbased object detector ,” in Proc. of IEEE Conf. on Comput. Vision andPattern Recognit., 2016, pp. 2369-2377.[47] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable objectdetection using deep neural networks,” in Proc. of IEEE Conf. onComput. Vision and Pattern Recognit., 2014, pp. 2155–2162.[48] J. Redmon, and A. Farhadi, “YOLOv3: An Incremental Improvement,”2018. [Online]. Available: https://arxiv.org/abs/1804.02767.[49] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, ”Yolov4: Op-timal speed and accuracy of object detection.” in arXiv preprintarXiv:2004.10934 (2020).[50] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. of EuropeanConf. on Comput. Vision, 2016, pp. 21–37.[51] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD: Deconvolutional Single Shot Detector,” 2017. [Online]. Available:https://arxiv.org/abs/1701.06659.[52] T.-Y. Lin, et. al., “Microsoft COCO: Common Objects in Context,” 2015.[Online]. Available: https://arxiv.org/abs/1405.0312.[53] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A.Zisserman, “The PASCAL Visual Object Classes (VOC) Challenge ,”
Int. Jour. Comp. Vis. , vol. 88, no. 2, pp. 303-338, 2010.[54] O. Russakovsky et al., “ImageNet Large Scale Visual RecognitionChallenge ,”
Int. Jour. of Comput. Vision , vol. 115, no. 3, pp. 211 -252, 2015.[55] Y. Bar-Shalom, “Tracking and data association,”
The Jour. of theAcoustical Soc. of Amer. , vol. 87, no. 918, 1990.[56] S. H. Rezatofighi, A. Milan, Z. Zhang, A. Dick, Q. Shi, and I. Reid,“Joint Probabilistic Data Association Revisited,” In Proc. of Int. Conf.on Comput. Vision, 2015, pp. 3047-3055.[57] D. Reid, “An algorithm for tracking multiple targets,”
IEEE Trans. onAutom. Control , vol. 24, no. 6, pp. 843-854, Dec. 1979.[58] S. H. Bae, and K. J. Yoon, “Robust Online Multi-Object Trackingbased on Tracklet Confidence and Online Discriminative AppearanceLearning,” In Proc. of IEEE Comput. Vision and Pattern Recognit., 2014,pp. 1218-1225.[59] M. Yang, and Y. Jia, “Temporal dynamic appearance modeling for onlinemulti-person tracking ,”
Comp. Vis. and Image Understanding , vol. 153,2016, pp. 16-28.[60] Y. Xiang, A. Alahi, and S. Savarese, “Learning to Track : Online Multi-Object Tracking by Decision Making,” In Proc. of Int. Conf. on Comput.Vision, 2015, pp. 4705-4713.[61] A. Bewley, V. Guizilini, F. Ramos, and B. Upcroft, “Online self-supervised multi-instance segmentation of dynamic objects,” In Proc.of Int. Conf. on Robotics and Automat., 2014, pp. 1296-1303.[62] W. Choi, “Near-Online Multi-target Tracking with Aggregated LocalFlow Descriptor,” In Proc. of Int. Conf. on Comput. Vision, 2015, pp.3029-3037.[63] A. Bewley, L. Ott, F. Ramos, and B. Upcroft, ”Alextrac: Affinity learningby exploring temporal reinforcement within association chains,” In Proc.of Int. Conf. on Robotics and Automat., 2016, pp. 2212-2218.[64] J. H. Yoon, M. Yang, J. Lim, and K. Yoon, ”Bayesian Multi-objectTracking Using Motion Context from Multiple Objects,” in IEEE WinterConf. on Appl. of Comput. Vision, 2015, pp. 33-40.[65] A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun, ”3D TrafficScene Understanding From Movable Platforms,”
IEEE Trans. PatternAnal. and Mach. Intel. , vol. 36, no. 5, pp. 1012-1025, May 2014.[66] H.W. Kuhn, “The Hungarian method for the assignment problem,” in
Naval Research Logistics Quarterly , vol. 2, pp. 83–97, 1955.[67] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, ”Simple online andrealtime tracking,” In Proc. of IEEE Int. Conf. on Image Proc., 2016,pp. 3464-3468.[68] N. Wojke, A. Bewley and D. Paulus, ”Simple online and realtimetracking with a deep association metric,” , Beijing, 2017, pp. 3645-3649,doi: 10.1109/ICIP.2017.8296962.[69] F. Yu, W. Li, Q. Li, Y. Liu, X. Shi, and J. Yan, “Poi: Multiple objecttracking with high performance detection and appearance feature,” inECCV . Springer, 2016, pp. 36–42.[70] N. Mahmoudi, S. M. Ahadi, and M. Rahmati, “Multi-target trackingusing CNN-based features: CNNMTT,”
Multimed. Tools Appl. , vol. 78,no. 6, pp. 7077–7096, 2019. [71] Z. Zhou, J. Xing, M. Zhang and W. Hu, ”Online Multi-Target Trackingwith Tensor-Based High-Order Graph Matching,” , Beijing, 2018, pp.1809-1814, doi: 10.1109/ICPR.2018.8545450.[72] K. Fang, Y. Xiang, X. Li and S. Savarese, ”Recurrent AutoregressiveNetworks for Online Multi-object Tracking,”
IEEE Trans. on Image Proc. , vol. 6, no. 11, pp. 1487-1502, Nov.1997.[75] G. M. Schuster, and A. K. Katsaggelos, “An optimal quadtree-basedmotion estimation and motion-compensated interpolation scheme forvideo compression,”
IEEE Trans. on Image Proc. , vol. 7, no. 11, pp.1505-1523, Nov. 1998.[76] E. Soyak, S. A. Tsaftaris, and A. K. Katsaggelos, “Low-ComplexityTracking-Aware H.264 Video Compression for Transportation Surveil-lance,”
IEEE Trans. on Circuits and Syst. for Video Technol. , vol. 21,no. 10, pp. 1378-1389, Oct. 2011.[77] K. Simonyan, and A. Zisserman, “Very Deep Convolutional Networksfor Large-Scale Image Recognition,” In Int. Conf. on Learn. Represen-tations, San Diego, USA, May, 7-9, 2015.[78] R. E. Kalman, “A New Approach to Linear Filtering and PredictionProblems,”