[PDF] A Bayesian Filter for Multi-view 3D Multi-object Tracking with Occlusion Handling

Abstract

This paper proposes an online multi-camera multi-object tracker that only requires monocular detector training, independent of the multi-camera configurations, allowing seamless extension/deletion of cameras without retraining effort. The proposed algorithm has a linear complexity in the total number of detections across the cameras, and hence scales gracefully with the number of cameras. It operates in the 3D world frame, and provides 3D trajectory estimates of the objects. The key innovation is a high fidelity yet tractable 3D occlusion model, amenable to optimal Bayesian multi-view multi-object filtering, which seamlessly integrates, into a single Bayesian recursion, the sub-tasks of track management, state estimation, clutter rejection, and occlusion/misdetection handling. The proposed algorithm is evaluated on the latest WILDTRACKS dataset, and demonstrated to work in very crowded scenes on a new dataset.

Full PDF

11 A Bayesian Filter for Multi-view 3D Multi-objectTracking with Occlusion Handling

Jonah Ong, Ba-Tuong Vo, Ba-Ngu Vo, Du Yong Kim and Sven Nordholm

Abstract —This paper proposes an online multi-camera multi-object tracker that only requires monocular detector training, independentof the multi-camera conﬁgurations, allowing seamless extension/deletion of cameras without retraining effort. The proposed algorithmhas a linear complexity in the total number of detections across the cameras, and hence scales gracefully with the number of cameras.It operates in the 3D world frame, and provides 3D trajectory estimates of the objects. The key innovation is a high ﬁdelity yet tractable3D occlusion model, amenable to optimal Bayesian multi-view multi-object ﬁltering, which seamlessly integrates, into a single Bayesianrecursion, the sub-tasks of track management, state estimation, clutter rejection, and occlusion/misdetection handling. The proposedalgorithm is evaluated on the latest WILDTRACKS dataset, and demonstrated to work in very crowded scenes on a new dataset.

Index Terms —Multi-view, Multi-sensor, Multi-object Visual Tracking, Occlusion Handling, Generalized Labeled Multi-Bernoulli (cid:70)

NTRODUCTION T HE interest of visual tracking is to jointly estimate an un-known time-varying number of object trajectories froma stream of images [1]. The challenges of visual trackingare the random appearance/disappearance of the objects,false positives/negatives, and data association uncertainty[2]. Multiple object tracking (MOT) algorithms can operateonline to produce current estimates as data arrives, orin batch which delay the estimation until further data isavailable [3], [4]. In principle, batch algorithms are more ac-curate than online as they allow better data integration intothe estimates [2], [5], [6], [7]. Online algorithms, however,tend to be faster and hence better suited for time-criticalapplications [4], [8], [9], [10], [11].The common sub-tasks, traditionally performed by sepa-rate modules in a MOT system are track management, stateestimation, clutter rejection, and occlusion/misdetectionhandling. Track management involves the initiation, termi-nation and identiﬁcation of trajectories of individual objects,while state estimation is concerned with determining thestate vectors of the trajectories. Problems such as track loss,track fragmentation and identity switching are caused byfalse negatives that can arise from occlusions when objectsof interest are visually blocked from a sensor, or frommisdetections when the sensor/detector fails to registerobjects of interest. On the other hand, false positives canlead to false tracks and identity switching. Hence, occlu-sion/misdetection handling and clutter rejection are criticalfor improving tracking performance.While occlusion handling is just as challenging com-pared with the other sub-tasks, theoretical developments are • J. Ong, B.T. Vo, B.N. Vo, and S. Nordholm are with the Department ofElectrical and Computer Engineering, Curtin University, Bentley, WA6102, Australia.E-mail: { j.ong1, ba-tuong.vo, ba-ngu.vo, s.nordholm } @curtin.edu.au • D.Y. Kim is with the School of Engineering, RMIT University, Melbourne,Australia.E-mail: [email protected] far and few [12]. This is due mainly to the complex object-to-object and object-to-background relationships, as well ascomputational tractability because, theoretically, all possiblepartitions of the set of objects need to be considered [4]. Ina single-view setting, useful a priori information about theobjects of interest are exploited to resolve occlusions [2], [6],[11], [13]. However, there are fundamental limitations onwhat can be achieved with single-view data. In contrast, amulti-view setting naturally allows exploiting complemen-tary information from the data to resolve occlusions since anobject occluded in one view may not be occluded in another[14]. Furthermore, from an information theoretic standpoint,data from diverse views will reduce the uncertainty onthe set of objects of interest, thereby improving overalltracking performance. Given the proliferation of cameras intoday’s world, it is imperative to develop effective meansfor making the best of the information-rich multi-view datasources, not only for occlusion handling, but ultimately toachieve better visual tracking.The perennial challenge in multi-view visual MOT isthe high-dimensional data association problem between thedetections and objects, across different views/cameras [12],[15]. Two common architectures for multi-view MOT areshown in Fig. 1. So far the best solutions are batch algo-rithms with the architecture in Fig. 1 (a). These solutions arebased on: generative modeling and dynamic programming[15]; convolutional neural network (CNN) multi-camera de-tection (MCD), trained on multi-view datasets [16], followedby track management [17]; and MCD via multi-view CNNtraining combined with Conditional Random Fields (CRF)models to exploit multi-camera geometry (followed by trackmanagement) [18]. These MCD based MOT solutions, whichproduce trajectories on the ground plane, have been shownto outperform previous works [16], and demonstrated re-markable performance in crowded scenarios [18]. Note thatsuch data-centric MCDs require retraining when the multi-camera system is extended/reconﬁgured, and that train-ing/learning is expensive as the input space is very high-dimensional due to the large number of possible combina- a r X i v : . [ c s . C V ] O c t (a) (b)Fig. 1: Multi-view Architectures: (a) Multi-view Detection + Single-sensor Multi-object Tracking [17]; (b) Monocular Detection + Multi-sensor Multi-object Tracking. tions across the cameras [19]. In practice, it is desirable for amulti-view MOT system to produce trajectories in 3D worldframe, online, and requires no retraining for multi-cameraextension/reconﬁguration (including camera failures) so asto operate uninterrupted.This paper proposes a model-centric, online multi-viewvisual MOT solution that only requires monocular detectortraining, independent of the multi-camera conﬁgurations,via the architecture of Fig. 1 (b). Hence, no retraining ofthe detectors is needed when the multi-camera system isextended/reconﬁgured. More importantly, our algorithmhas a linear complexity in the total number of detections,thereby scales gracefully with the number of cameras. Thealgorithm intrinsically operates in the 3D world frame by ex-ploiting multi-camera geometry, allowing it to track peoplejumping and falling, suitable for applications such as sportsanalytics, age care, school environment monitoring, etc. Wevalidate the proposed method on the latest WILDTRACKSdataset on ground plane and show comparable results withDeep-Occlusion+KSP+ptrack [17]. To evaluate tracking per-formance in the 3D world frame, we develop a new datasetwith varying degrees of difﬁculties on scenarios with veryclosely spaced people, with addition/deletion of camerasduring operation, and with people jumping and falling.The key innovation is a high ﬁdelity yet tractable 3Docclusion model, amenable to Bayesian multi-sensor multi-object ﬁltering [20], which seamlessly integrates, into asingle Bayesian recursion, the sub-tasks of track man-agement, state estimation, clutter rejection, and occlu-sion/misdetection handling. In the Bayesian paradigm, themulti-object ﬁltering density captures all information on theset of trajectories in 3D, encapsulated in the observations, aswell as dynamic and observation models. The novel occlu-sion model, incorporated in the multi-object measurementlikelihood function, enables the MOT Bayesian ﬁlter to cor-rectly maintain occluded tracks that would have otherwisebeen incorrectly terminated. The schematic in Fig. 2 showsthe integration of the novel occlusion model into a near-optimal multi-sensor multi-object Bayes ﬁlter known asthe Multi-Sensor Generalized Labeled Multi-Bernoulli (MS-GLMB) ﬁlter [20]. This conﬁguration enables the proposedalgorithm, herein referred to as Multi-View GLMB with OC-clusion modeling (MV-GLMB-OC), to address occlusions,and inherits the numerical efﬁciency of the MS-GLMB ﬁlter.In short, our main technical contributions are: Fig. 2: MV-GLMB-OC ﬁlter Processing Chain. Monocular detectionsfrom multiple cameras are fed into the ﬁlter, which outputs the ﬁl-tering density. This output is fed into: the estimator to generate trackestimates; and back into the ﬁlter to process detections at the next time.The Occlusion Model (red) is an add-on that takes the ﬁlter output andcompute the detection probabilities for the ﬁlter on-the-ﬂy. • A tractable and realistic detection model that accom-modates 3D occlusion by taking into account theLines of Sights (LoSs) of all objects in the scene withrespect to the cameras. In contrast, conventional de-tection models either neglect the LoSs of the objectsor are computationally intractable, leading to poortracking performance in the presence of occlusions.Our new detection model can be regarded as a gener-alization of tractable conventional detection models; • The ﬁrst Bayesian multi-view MOT ﬁlter for suchdetection model, which resolves occlusion onlineand is scalable with the number of sensors. Experi-ments show better performance than the latest multi-camera tracking algorithm; • A new dataset with full 3D annotations (not re-stricted to the ground plane), in terms of positionand extent in all 3 x, y, z-coordinates, includingsequences that involve changes in the z-coordinatedue to people jumping and falling. Instead of re-porting performance for the entire scenario duration(as done traditionally), we also introduce live oronline tracking performance evaluation over time,using the OSPA (2) metric [21], to characterize thebehavior of the algorithm and demonstrate uninter-rupted operation when the multi-camera system isextended/reconﬁgured.The rest of the paper is organized as follows. Sec-tion 2 presents the related work. Section 3 formulates themulti-view MOT problem, including the proposed occlu-sion/detection model, and the new tractable ﬁlter with oc-clusion handling capability via optimal Bayesian estimation.Section 4 presents the implementation of the algorithm. Sec-tion 5 shows experimental results and discussions. Finally,some conclusions are drawn in Section 6.

ELATED W ORK

A Deep Convolutional Neural Network (CNN) trained onlarge-scale high-resolution image dataset, with efﬁcient im-plementations such as Fast/Faster R-CNN [22], [23], hasbeen shown to outperform all previous object detectorsbased on hand-engineered features, e.g. the AggregatedChannel Features (ACF) object detector [24]. Faster R-CNN introduces the concept of Region Proposal Network (RPN)and exploits feature sharing together with efﬁcient multi-scale solution to improve test-time speed and detection ac-curacy, achieving real-time detection at 5 frames-per-second(fps) [23]. Recently, the You Only Look Once (YOLO) real-time object detector, which attains 40fps at mAP of 76.8%(resolution of 544x544) on PASCAL VOC 2007, has gainedimmense popularity [25]. In contrast to the aforementionedtechniques that rely on a sliding classiﬁer for every image,YOLO’s impressive speed is achieved by only scanningthe image once. Additionally, spatial constraints, introducedto eliminate unlikely bounding boxes, allow trade-offs be-tween speed and accuracy via a suitable score threshold[26]. The YOLO detector can also be extended to 3D [27].The main drawback is the inability to detect small objectsdue to the imposed spatial constraints [26].Progress in object detections facilitated the developmentof many tracking-by-detection approaches that typically jointhe detections together to form consistent trajectories [8],[28], [29]. Tracking-by-detection can be designed for batchor online operations. Online algorithms tend to be faster andbetter suited for time-critical applications, but may be proneto irrevocable errors if objects are undetected in severalframes or if detections at different times are incorrectlyjoined [2]. Such errors can be reduced by global trajectoryoptimization over batches of frames [2], [3], [5], [6], [7].However, track loss and fragmentation can still be causedby occlusion, which is an active area of research in itself[28]. In single-view/monocular settings, a popular approachto occlusion handling is to exploit a priori knowledge ofthe scene [2], [6], [7]. Deep neural network techniques thatleverage spatio-temporal information in the images haveshown to perform well in autonomous driving [30], [31].In a multi-view setting, complementary informationfrom the data can be exploited to resolve occlusions nat-urally, since an object occluded in one view may not beoccluded in another view [14]. The hierarchical compositionapproach in [3] uses monocular information from multipleviews to construct estimates in the ground plane. How-ever, this approach is susceptible to reprojection errors andignores occlusions [18]. In [32], the author formulates anocclusion model based on 2D silhouette-based visual anglesfrom multiple views. Subsequently, a simple approach isto pre-process images from individual views (e.g. via back-ground subtraction) from which occupancy (on the groundplane) can be estimated using Probability of OccupancyMap (POM) [15]. A more sophisticated approach was pro-posed in [12], which combines multi-view Bayesian networkmodeling of occlusion relationship and homography corre-spondence, across all views, with height-adaptive projection(HAP) to obtain ﬁnal ground plane detections [12]. Stereo-based MOT approaches have also demonstrated improved3D object estimation and tracking [33], [34], [35].So far, the best multi-view tracking solution is based on amulti-camera detection (MCD) architecture that uses a CNNto train multi-view detectors from monocular and multi-view data [16], together with batch processing to computeglobal trajectories on the ground plane [17]. Combined withConditional Random Field (CRF) modeling and Mean Fieldvariational inference, this approach achieves remarkableperformance in crowded scenarios [18]. This approach is more data-centric than model-centric as the multi-cameradetection relies mostly on training from data. Hence, largetraining sets are required, and the learning algorithm tendsto be computationally expensive in exploring tight conver-gence levels, especially for high dimensional scenarios (e.g.large number of cameras) [19]. More examples of deeplylearned multi-view approaches are found in [36], [37]. To thebest of our knowledge, no online MOT algorithm has pro-duced comparable tracking performance with these data-centric batch solutions.In practice, it is desirable to have online algorithmswhose complexity scale linearly with the number of cam-eras, and do not require multi-view training so that re-conﬁguration (including addition and deletion) of camerascan be performed without interruption to the operation.Moreover, in a multi-view context, it is more prudent tohave trajectories in the 3D world frame for applications suchas sports analytics, age care, school environment monitor-ing, etc. While there are solutions to online 3D multi-viewMOT with monocular data such as [38], [39], they do notscale gracefully with the number of cameras. Similar to thementioned batch-processing methods, these solutions aremore data-centric as they rely, respectively, on deep trainingfor object depth information, and motion learning.At the other end of the spectrum are the model-centricapproaches that rely largely on physical models of thedynamics of the objects, the geometry and characteristics ofthe sensors/cameras. Such model-based solutions to 3D on-line MOT with monocular data, using 2D object detections,3D object proposals, and 3D point cloud techniques weredeveloped, respectively, in [33], [40], [41]. From a state-spacemodeling perspective, a natural choice for online MOT isthe multi-object Bayes ﬁlter [42]. Since the inception of theRandom Finite Sets (RFS) framework for multi-object state-space models, a number of multi-objects Bayesian ﬁltershave been developed [43], [44] and applied to visual MOTproblems [4], [10], [45]. The latest is the Generalized labeledMulti-Bernoulli (GLMB) ﬁlter, an analytic solution to themulti-object Bayes ﬁlter that jointly estimates the numberof objects and their trajectories online [46]. The salientfeature of this approach is that it seamlessly integratestrack management, state estimation, clutter rejection, occlu-sion/misdetection handling and multiple sensor data into asingle recursion [4]. In this article, we use this frameworkto develop an online 3D multi-view MOT solution thatonly requires one-off monocular detector training (or off-the-shelf monocular detectors), yet is capable of producingcomparable results with the aforementioned data-centricbatch-processing approaches.In addition to algorithms, datasets for performance eval-uation are an important aspect of 3D multi-view MOTresearch. Existing multi-view datasets include DukeMTMC[47], PETS 2009 S2.L1 [48], EPFL - Laboratory, Terrace andPassageway [15], SALSA [49], Campus [3] and EPFL-RLC[16]. However, in [17] the authors discussed a number oftheir shortcomings and introduced a seven-camera high-deﬁnition (HD) unscripted pedestrian dataset known asWILDTRACKS to provide a high quality, highly crowdedand cluttered evaluation scenario. It comes with accuratejoint (extrinsic and intrinsic) calibration, and 7 series of 400annotated frames for detection at a rate of 2 frames per

TABLE 1: Basic NotationSymbol Description a T Transpose of vector/matrix a ⊗ Kronecker product (for matrices) I n n -dimensional identity matrix n × m n by m zero matrix diag( · ) Converts a vector to a diagonal matrix X m : n X m , X m +1 , . . . , X n (cid:104) f, g (cid:105) (cid:82) f ( x ) g ( x ) dxh X (cid:81) x ∈ X h ( x ) where h ∅ = 1 δ Y [ X ] Kronecker delta function: if X = Y , otherwise Y ( x ) Indicator function: if x ∈ Y , otherwise N ( · ; µ, P ) Gaussian pdf with mean µ and covariance P second (fps). The annotations of the tracks are given bothas locations on the ground plane and 2D bounding boxesprojected onto each view.While WILDTRACKS is more extensive than earlierdatasets, it is still not sufﬁcient for comprehensive 3D MOTperformance evaluation. Speciﬁcally, for actual 3D MOTapplications where objects may also move vertically (e.g.sport analytics, age care, etc.), ground plane annotations aresimply not adequate for evaluating tracking performance infull 3D, i.e. changes in all 3 x, y, z-coordinates. To enrich thedatasets and to enable performance evaluation in full 3D,we propose the Curtin Multi-Camera (CMC) dataset thatcomprises four calibrated cameras, on scenarios of varyingdifﬁculties in crowd density and occlusion, as well as scenar-ios with people jumping and falling, all with 3D centroid-with-extent annotations, along with camera locations andparameters. Note that in addition to extrinsic and intrinsicparameters, we also provide the absolute camera locationsneeded for testing and evaluation of model-centric solutionsthat exploit multi-camera geometry. AYESIAN F ORMULATION

This section formulates the multi-view MOT problem (Sec-tions 3.1-3.4), including the proposed occlusion/detectionmodel (Section 3.5), and the new tractable ﬁlter with occlu-sion handling capability (Section 3.6). The notations used inthis paper are tabulated in Table 1.

We ﬁrst recall the classical Bayesian ﬁlter where the state x of the object, in some ﬁnite dimensional state space X ,is modeled as a random vector. The dynamic of the stateis described by a Markov chain with transition density f + ( x + | x ) , i.e. the probability density of a transition to thestate x + at the next time given the current state x . Note thatfor simplicity we omit the subscript for current time and usethe subscript ‘+’ denotes the next time step. Additionally,the current state x generates an observation z described bythe likelihood function g ( z | x ) , i.e. the probability density ofreceiving the observation z given x . All information on thecurrent the state is encapsulated in the ﬁltering density p ,which can be propagated to the next time as p + , via thecelebrated Bayes recursion [50] p + ( x + ) ∝ g ( z + | x + ) (cid:90) f + ( x + | x ) p ( x ) dx. (1)

1. The ﬁltering densities are conditioned on the observations, whichhave been omitted for notational compactness . The multi-view MOT Bayes ﬁlter used in this work isconceptually identical to the classical Bayes ﬁlter above byreplacing: x and x + with the sets X and X + ; p and p + with the multi-object ﬁltering densities π and π + ; f + and g with the multi-object transition density f + and multi-objectobservation likelihood g ; z + with the observation set Z + ;and the integral with the set integral [43], i.e. π + ( X + ) ∝ g ( Z + | X + ) (cid:90) f + ( X + | X ) π ( X ) δ X . (2)The sets X (and X + ) containing the object states at thecurrent (and next) time, is called the current (and next)multi-object state. Each element of the multi-object state X is an ordered pair x = ( x, (cid:96) ) , where x ∈ X is a statevector, and (cid:96) (cid:44) ( t, α ) is a unique label consisting of theobject’s time of birth t , and an index α to distinguish thoseborn at the same time [46]. The cardinality (number ofelements) of X and X + may differ due to the appearanceand disappearance of objects from one frame to the next.Under the Bayesian paradigm, the multi-object stateis modeled as a random ﬁnite set, i.e. a ﬁnite-set-valuedrandom variable, characterized by Mahler’s multi-objectdensity [43], [44] (equivalent to a probability density [51]).The multi-object transition density f + captures the motionsas well as births and deaths of objects. The multi-object ob-servation likelihood g captures the detections, false alarms,occlusions, and misdetections. An object at time k , represented by a state x = ( x, (cid:96) ) , eithersurvives with probability P S ( x ) and evolves to state x + =( x + , (cid:96) + ) at the next time with transition density f S, + ( x + | x ) = f S, + ( x + | x, (cid:96) ) δ (cid:96) [ (cid:96) + ] , (3)or dies with probability − P S ( x ) [46]. At this next time,an object with label (cid:96) is born with probability P B, + ( (cid:96) ) , andwith feature-vector x distributed according to a probabilitydensity f B, + ( · , (cid:96) ) . Note that the label of an object remainsthe same over time, and hence the trajectory of an object is asequence of consecutive states with a common label [46].Let B k denote the ﬁnite set of all possible labels forobjects born at time k , then the label space for all objects upto time k is the disjoint union L k = (cid:85) kt =0 B t . For simplicitywe omit the time subscript k , and let L ( x ) denote the labelof an x ∈ X × L . For any ﬁnite X ⊂ X × L , we deﬁne L ( X ) (cid:44) {L ( x ) : x ∈ X } , and the distinct label indicator ∆ ( X ) (cid:44) δ | X | [ |L ( X ) | ] . At any time, the set X of (states of)objects in the scene must have distinct labels, i.e. ∆ ( X ) = 1 .Conditional on the current set of objects, it is standardpractice to assume that objects are born or displaced at thenext time, independently of one another. The expression forthe multi-object transition density f + is not needed in thiswork, interested readers are referred to [46]. Suppose that at time k , there are C cameras (sensors), anda set X of current objects. Each x ∈ X is either: detectedby camera c ∈ { C } , with probability P ( c ) D ( x ; X −{ x } ) and generates an observation z ( c ) in the measurement space Z ( c ) with likelihood g ( c ) ( z ( c ) | x ) ; or missed with probability − P ( c ) D ( x ; X −{ x } ) . Note that to account for occlusions(and uncertainty in the detection process), the probabilityof detecting an object x also depends on the states of othercurrent objects X − { x } . However, most MOT algorithmsneglect this dependence for computational tractability.The detection process also generates false positives atcamera c , usually characterized by an intensity function κ ( c ) on Z ( c ) . The standard model is a Poisson distribution,with mean (cid:104) κ ( c ) , (cid:105) , for the number of false positives, andthe false positives themselves are i.i.d. according to theprobability density κ ( c ) / (cid:104) κ ( c ) , (cid:105) [44], [52], [53]. Moreover,conditional on the set X of objects, detections are assumedto be independent from false positives, and that the set Z ( c ) of detections and false positives at sensor c , are independentfrom those at other sensors.An association hypothesis (at time k ) associating labelswith detections from camera c is a mapping γ ( c ) : L →{− | Z ( c ) |} , such that no two distinct arguments are mappedto the same positive value [46]. This property ensures eachdetection comes from at most one object. Given an associa-tion hypothesis γ ( c ) : γ ( c ) ( (cid:96) ) = − means object (cid:96) does notexist; γ ( c ) ( (cid:96) ) = 0 means object (cid:96) is not detected by camera c ; γ ( c ) ( (cid:96) ) > means object (cid:96) generates detection z γ ( c ) ( (cid:96) ) atcamera c ; and the set L ( γ ( c ) ) (cid:44) { (cid:96) ∈ L : γ ( c ) ( (cid:96) ) ≥ } are the live labels of γ ( c ) . Under standard assumptions, the (multi-object) likelihood for camera c is given by the following sumover the space Γ ( c ) of association hypotheses with domain L and range {− | Z ( c ) |} [46]: g ( c ) ( Z ( c ) | X ) ∝ (cid:88) γ ( c ) ∈ Γ ( c ) δ L ( γ ( c ) ) [ L ( X )] (cid:104) ψ ( c,γ ( c ) ) X −{·} ( · ) (cid:105) X , (4)where Z ( c ) = { z ( c )1: | Z ( c ) | } , and ψ ( c,γ ( c ) ) X −{ x } ( x ) =  − P ( c ) D ( x ; X −{ x } ) , γ ( c ) ( L ( x )) = 0 P ( c ) D ( x ; X −{ x } ) g ( c ) ( z ( c ) j | x ) κ ( c ) ( z ( c ) j ) , γ ( c ) ( L ( x )) = j> , (5)Note that ψ ( c,γ ( c ) ) X −{ x } ( x ) also depends on Z ( c ) , but we omittedit for clarity. Interested readers are referred to the texts [43],[44] for the derivation/discussion.A multi-sensor (association) hypothesis is an array γ (cid:44) ( γ (1) , ..., γ ( C ) ) of association hypotheses with the same setof live labels, denoted as L ( γ ) . The likelihood that X generates the multi-sensor observation Z (cid:44) ( Z (1: C ) ) is theproduct (cid:81) Cc =1 g ( c ) ( Z ( c ) | X ) , which can be rewritten as [20] g ( Z | X ) ∝ (cid:88) γ ∈ Γ δ L ( γ ) [ L ( X )] (cid:104) ψ ( γ ) X −{·} ( · ) (cid:105) X , (6)where Γ is the set of all multi-sensor hypotheses, δ L ( γ ) [ J ] (cid:44) C (cid:89) c =1 δ L ( γ ( c ) ) [ J ] , (7) ψ ( γ ) X −{ x } ( x ) (cid:44) C (cid:89) c =1 ψ ( c,γ ( c ) ) X −{ x } ( x ) . (8)Remark: The sets of objects, observations, and possiblythe number of sensors and their parameters, may vary withtime. However, for clarity we suppressed the time index. Most of the literature on tracking assumes the probabilityof detection P ( c ) D ( x ; X −{ x } ) = P ( c ) D ( x ) , i.e. independentof X − { x } . In this case, the Bayes recursion (2) admitsan analytical solution based on Generalized Labeled Multi-Bernoulli (GLMB) models.A GLMB is a multi-object density of the form [46] π ( X ) = ∆ ( X ) (cid:88) I,ξ w ( I,ξ ) δ I [ L ( X )] (cid:104) p ( ξ ) (cid:105) X , (9)where: I ∈ F ( L ) the space of all ﬁnite subsets of L ; ξ ∈ Ξ thespace of all (multi-sensor) association hypotheses historiesup to the current time, i.e. ξ (cid:44) γ k ; each w ( I,ξ ) is anon-negative weight such that (cid:80) I,ξ w ( I,ξ ) = 1 ; and each p ( ξ ) ( · , (cid:96) ) is a probability density on X . For convenience, werepresent a GLMB by its parameter-set π (cid:44) (cid:110)(cid:16) w ( I,ξ ) , p ( ξ ) (cid:17) : ( I, ξ ) ∈ F ( L ) × Ξ (cid:111) . (10)Each GLMB component ( I, ξ ) can be interpreted as a hypoth-esis with probability w ( I,ξ ) , and each individual object (cid:96) ∈ I of this hypothesis has probability density p ( ξ ) ( · , (cid:96) ) .A simple multi-object state estimate can be obtainedfrom a GLMB by ﬁrst determining: the most probable cardi-nality n ∗ from the cardinality distribution [46] Prob( | X | = n ) = (cid:88) I,ξ δ n [ | I | ] w ( I,ξ ) ; (11)and then the hypothesis ( I ∗ , ξ ∗ ) with highest weight suchthat | I ∗ | = n ∗ . The current state estimate for each object (cid:96) ∈ I ∗ can be computed from p ( ξ ∗ ) ( · , (cid:96) ) , e.g. the mode or mean.Alternatively, the entire trajectory of object (cid:96) ∈ I ∗ can beestimated using the forward-backward algorithm, startingfrom its current ﬁltering density p ( ξ ∗ ) ( · , (cid:96) ) and propagatingbackward to its time of birth [20], [54].Under the Bayes recursion (2), and the standard multi-object model (i.e. with no occlusions, P ( c ) D ( x ; X −{ x } ) = P ( c ) D ( x ) ), the multi-object ﬁltering density at any time is aGLMB [46]. Moreover, if (10) is the current GLMB ﬁlteringdensity, then the next GLMB ﬁltering density π + = (cid:110)(cid:16) w ( I + ,ξ + )+ , p ( ξ + )+ (cid:17) : ( I + , ξ + ) ∈ F ( L + ) × Ξ + (cid:111) , (12)can be computed via the MS-GLMB recursion [20] π + = Ω ( π ; P D, + ) , (13)where P D, + (cid:44) ( P (1) D, + , ..., P ( C ) D, + ) . The actual mathematicalexpressions for the recursion operator Ω : π (cid:55)→ π + arenot critical for our arguments, and hence omitted from thissection. Nonetheless, for completeness the deﬁnition of Ω is provided in Appendix 7.1. Note that Ω also dependson the measurement Z + , and model parameters for birth ( P B, + , f B, + ) , death/survival P S , motion f S, + , false alarms κ + (cid:44) ( κ (1)+ , . . . , κ ( C )+ ) , and detection g + (cid:44) ( g (1)+ , . . . , g ( C )+ ) (described in Section 3.3). However, for our purpose itsufﬁces to show the dependence on detection probabilities.While the MS-GLMB ﬁlter can applied directly to multi-view MOT, a detection probability (of an object x ) that doesnot depend on other objects, i.e. X − { x } , is unable to cap-ture the effect of occlusions. On the other hand, accountingfor occlusions with P ( c ) D ( x ; X −{ x } ) that actually depends on X −{ x } , results in ﬁltering densities that are not GLMBs.One example is the merged-measurement model [55], whichinvolves summing over all partitions of the set X , making itintractable [55]. Although the resulting ﬁltering density canbe approximated by a GLMB, this solution is still compu-tationally demanding and not suitable for large number ofobjects [55]. In what follows, we propose a new detectionmodel that addresses occlusions and permits efﬁcient multi-view MOT implementations. For tracking in 3D, we consider the state x = ( x, (cid:96) ) , where: x = ( x ( p ) , ˙ x ( p ) , x ( s ) ); (14) x ( p ) is the object’s position (centroid) in 3D Cartesian coor-dinates; ˙ x ( p ) is its velocity; and x ( s ) is its shape parameter.The region in R occupied by an object with labeled state x is denoted by R ( x ) .Consider camera c and the set X of current objects. Inthis work, an object ( x, (cid:96) ) ∈ X is regarded as occluded fromcamera c when its position x ( p ) is not in the line of sight(LoS) of the camera, i.e. x ( p ) is in the shadow regions of theother objects in X . Assuming straight LoSs, the shadowregion of an object with labeled state x (cid:48) , relative to camera c (see Fig. 3), is given by S ( c ) ( x (cid:48) ) = (cid:110) y ∈ R : ( u ( c ) , y ) ∩ R ( x (cid:48) ) (cid:54) = ∅ (cid:111) , (15)where ( u ( c ) , y ) (cid:44) { λy + (1 − λ ) u ( c ) : λ ∈ [0 , } is the linesegment joining the position u ( c ) of camera c and y . Notethat for an ellipsoidal region R ( x (cid:48) ) , the indicator function S ( c ) ( x (cid:48) ) ( · ) of its shadow region can be computed in closedform (see Section 4.1). Fig. 3: The shadow region (in yellow) of object with labeled state x (cid:48) ,relative to camera c . To incorporate the effect of occlusions into the detectionmodel, the probability that x ∈ X be detected by camera c should be close to zero when it is occluded from camera c . This can be accomplished by extending the standarddetection probability so that: when x is in the LoS of camera c , its detection probability is P ( c ) D ( x ) ; and when occludedby the other objects its detection probability scales downto βP ( c ) D ( x ) , where β is a small positive number. Moreexplicitly, P ( c ) D ( x ; X −{ x } ) = P ( c ) D ( x ) (cid:16) M ( x ; X −{ x } ) + β (cid:0) − M ( x ; X −{ x } ) (cid:1)(cid:17) , (16) where M ( x ; X −{ x } ) = (cid:89) x (cid:48) ∈ X −{ x } (cid:16) − S ( c ) ( x (cid:48) ) ( x ) (cid:17) (17)Conditional on detection, x is observed at camera c asa bounding box z ( c ) (cid:44) ( z ( c ) p , z ( c ) e ) , where z ( c ) p is the center,and z ( c ) e is the extent, parameterized by the logarithms ofthe width (x-axis) and height (y-axis), in image coordinates.The observed z ( c ) is a noisy version of the box Φ ( c ) ( x ) bounding the image of R ( x ) in the camera’s image plane,under the projection of the camera matrix P ( c )3 × . This matrixprojects homogeneous points in the world coordinate frameto homogeneous points in the image plane of camera c , andcan be obtained by standard calibration techniques (see [56]for details). Note that for an ellipsoidal region R ( x ) , theaxis-aligned Φ ( c ) ( x ) on the image plane can be computedanalytically (see Section 4.1). This observation process canbe modeled by the likelihood g ( c ) ( z ( c ) | x ) = N (cid:32) z ( c ) ; Φ ( c ) ( x ) + (cid:34) × − υ ( c ) e / (cid:35) , diag (cid:32)(cid:34) υ ( c ) p υ ( c ) e (cid:35)(cid:33)(cid:33) , (18)where υ ( c ) p and υ ( c ) e are respectively the vector of noisevariances for the center and the extent (in logarithm) of thebox. This Gaussian model of the logarithms of the widthand height is equivalent to modeling the actual width andheight as log-normals, which ensures that they are non-negative. Note that these log-normals have mean 1, andvariances e υ ( c ) e, − and e υ ( c ) e, − , where υ ( c ) e, and υ ( c ) e, arethe two components of υ ( c ) e . This means the observed widthand height are randomly scaled versions of their nominalvalues, with an expected scaling factor of 1. This subsection presents a tractable GLMB approximationto the multi-view Bayes ﬁlter to address occlusions. Theproposed ﬁlter (with the new detection model to account forocclusion) is referred to as Multi-View GLMB with occlusionmodeling (MV-GLMB-OC).Given the current GLMB ﬁltering density (10), the pre-dicted density (cid:82) f + ( X + | X ) π ( X ) δ X in the Bayes recursion(2) is also a GLMB [46], which we denote by (cid:98) π + ( X + ) = ∆ ( X + ) (cid:88) I + ,ξ w ( I + ,ξ )+ δ I + [ L ( X + )] (cid:104) p ( ξ )+ (cid:105) X + , (19)where I + ∈ F ( L + ) . Multiplying (19) by the likelihood (8)yields the next (unnormalized) multi-object density π + ( X + ) ∝ ∆( X + ) (cid:88) I + ,ξ,γ + δ L ( γ + ) [ L ( X + )] w ( I + ,ξ )+ × δ I + [ L ( X + )] (cid:104) p ( ξ,γ + ) X + −{·} ( · ) (cid:105) X + , (20)where p ( ξ,γ + ) X + −{ x + } ( x + ) = p ( ξ )+ ( x + ) ψ ( γ + ) X + −{ x + } ( x + ) . (21)As previously alluded to, the multi-object density (20) isnot a GLMB because p ( ξ,γ + ) X + −{ x + } depends on X + − { x + } .Nonetheless, a good GLMB approximation of (20) can be obtained by approximating p ( ξ,γ + ) X + −{ x + } with a density that isindependent of X + −{ x + } .Note that ψ ( γ + ) X + −{ x + } is the only factor of p ( ξ,γ + ) X + −{ x + } ,which depends on X + −{ x + } (see (21)). Further inspectionof (5) and (8) reveals that the detection probability functions P ( c ) D, + ( · ; X + − { x + } ) , c ∈ { C } are the only constituentterms that depend on X + −{ x + } . Moreover, it follows from(16) that P ( c ) D, + ( x + ; X + − { x + } ) only takes on two values,depending on whether x + falls in the shadow region of X + − { x + } w.r.t. camera c . Assuming the positions of theelements of X + −{ x + } are concentrated around their pre-dicted values according to the prediction densities p ( ξ )+ ( · , (cid:96) ) ,(cid:96) ∈ L ( X + −{ x + } ) , we can approximate P ( c ) D, + ( · ; X + −{ x + } ) by replacing the set X + −{ x + } with its predicted value. Not-ing that the term δ I + [ L ( X + )] in (20) implies L ( X + ) = I + ,the prediction of X + −{ x + } is X ( ξ,I + )+ = { ( x ( ξ,(cid:96) )+ , (cid:96) ) : (cid:96) ∈ I + − L ( x + ) } , (22)where x ( ξ,(cid:96) )+ denotes an estimate (e.g. mean, mode) from thedensity p ( ξ )+ ( · , (cid:96) ) , which is either the birth density f B, + ( · , (cid:96) ) if (cid:96) ∈ B + or (cid:82) f S, + ( ·| x, (cid:96) ) p ( ξ ) ( x, (cid:96) ) dx if (cid:96) / ∈ B + [46].The above approximation translates to p ( ξ,γ + ) X + −{ x + } ≈ p ( ξ,γ + ) X ( ξ,I +)+ , (23)which is independent of X + − { x + } , thereby turning (20)into a GLMB. Moreover, the computation of this GLMBapproximation to (20) only differs from the MS-GLMB re-cursion (13) in the detection probabilities P ( ξ,I + ) D, + ( (cid:96) ) (cid:44) (cid:16) P (1) D, + ((ˆ x + ,(cid:96) ); X ( ξ,I + )+ ) ,...,P ( C ) D, + ((ˆ x + ,(cid:96) ); X ( ξ,I + )+ ) (cid:17) , (24)where (cid:96) = L ( x + ) , and ˆ x + denotes an estimate (e.g. mean,mode) from the density p ( ξ )+ ( · , (cid:96) ) . Speciﬁcally, the GLMBapproximation of the multi-object ﬁltering density can bepropagated by the MS-GLMB recursion π + = Ω (cid:16) π ; { P ( ξ,I + ) D, + ( (cid:96) ) : (cid:96) ∈ I + , ( ξ, I + ) ∈ Ξ ×F ( L + ) } (cid:17) . (25)The integration of the proposed occlusion model (via thedetection probabilities) into the MS-GLMB ﬁlter is shown inFig. 2. The implementation of this so-called MV-GLMB-OCﬁlter is discussed in the next section. MPLEMENTATION

This section describe the implementation of the proposed ﬁl-ter for ellipsoidal objects. Section 4.1 provides mathematicalrepresentations for the objects and the multi-object modelparameters. Propagation of the MV-GLMB-OC ﬁltering den-sity is then described in Section 4.2.

Each object is represented by an axis-aligned ellipsoid. Foran object with labeled state x = ( x, (cid:96) ) , the position x ( p ) isthe centroid, and the shape parameter x ( s ) is a vector con-taining the logarithms of the half-lengths of the ellipsoid’sprincipal axes. Further, the time-evolution of the state vector x is modeled by a linear Gaussian transition density: f S, + ( x + | x, (cid:96) ) = N (cid:18) x + ; F x + (cid:20) × − υ ( s ) / (cid:21) , Q (cid:19) , (26) (a) (b)Fig. 4: Illustration of the survival probability model: (a) The scene mask b ( x ) ; (b) The control parameter τ of the sigmoid function. where F =  I ⊗ (cid:20) T (cid:21) × × I  , (27) Q =  diag( υ ( p ) ) ⊗ (cid:20) T T (cid:21)(cid:104) T T (cid:105) × × diag( υ ( s ) )  , (28) T is the sampling period, υ ( p ) and υ ( s ) are, respectively,3D vectors of noise variances for the components of thecentroid and shape parameter (in logarithm) of the ellipsoid.This transition density describes a nearly constant velocitymodel for the centroid and a Gaussian random-walk forthe shape parameter. Gaussianity of the logarithms of thehalf-lengths is equivalent to modeling the half-lengths aslog-normals, which ensure that they are non-negative. Notethat these log-normals have mean 1, and variances e υ ( s ) i − , i = 1 , , , where υ ( s ) i is the i th components of υ ( s ) . Hence,the observed half-lengths are randomly scaled versions oftheir nominal values, with an expected scaling factor of 1.Empirically, objects that are in the scene for a long time,are more likely to remain in the scene, unless they are closeto the borders (exit regions). This can be modeled via thefollowing object survival probability [4]: P S ( x, (cid:96) ) = b ( x )1 + exp( − τ ( k − (cid:96) [1 , T )) , (29)where b ( x ) is the the scene mask (chosen to be close toone in the middle of the scene, and close to zero in thedesignated exit regions and beyond) as depicted in Fig. 4 (a),and τ is the control parameter of the sigmoid function thatis dependent on the duration (age) of the track k − (cid:96) [1 , T as depicted in Fig. 4 (b).The detection probability (16)-(17) can be computed inclosed form when the objects extents are ellipsoids. As al-luded to in Section 3.5, the shadow region indicator function S ( c ) ( y ) ( · ) used for checking whether an object is in theshadow region of the object y , can be determined analyti-cally. Suppose that R ( y ) in (15) is a quadric, then it intersectsthe line ( u ( c ) , x ( p ) ) (between u ( c ) and x ( p ) ) if the roots ofa certain quadratic equation are real [57]. Consequently, foran axis-aligned ellipsoidal object representation, the shadowregion indicator function is given by S ( c ) ( y ) ( x ) = (cid:40) , (cid:16) B ( c ) x , y (cid:17) − A ( c ) x , y C ( c ) y ≥ , otherwise , (30)where A ( c ) x , y =( x ( p ) − u ( c ) ) T (cid:16) diag( y ( s ) ) (cid:17) − ( x ( p ) − u ( c ) ) , (31) Fig. 5: The projections P ( c ) of two quadrics (in cyan and pink) onto two image views ( c = 1 , result in 2D conics. The transformation Z yieldsthe corresponding estimated bounding boxes (in cyan and pink). The estimated bounding box and the measured bounding box (in red) frommonocular detector formulate the measurement likelihood (18). B ( c ) x , y =( x ( p ) − u ( c ) ) T (cid:20) (cid:16) diag( y ( s ) ) (cid:17) − u ( c ) + d y (cid:21) , (32) C ( c ) y =( u ( c ) ) T (cid:20)(cid:16) diag( y ( s ) ) (cid:17) − u ( c ) + d y (cid:21) + E y , (33) d y = − y ( p ) ( y ( s ) · y ( s ) ) , E y = (cid:13)(cid:13)(cid:13) y ( p ) /y ( s ) (cid:13)(cid:13)(cid:13) − , (34)and u ( c ) is the position of camera c , with multiplica-tion/division of two vectors of the same dimension to beunderstood as point-wise multiplication/division.In addition, using quadric projection [58, pp. 201], the re-lationship between the estimated bounding box Φ ( c ) ( x ) andmeasured bounding box z ( c ) captured in the measurementlikelihood (18), has the following closed form Φ ( c ) ( x ) (cid:44) Z ( P ( c ) ( x )) , (35)where P ( c ) ( x ) = (cid:32) P ( c )3 × (cid:20) (diag( x ( s ) )) − d x / d T x / E x (cid:21) − (P ( c )3 × ) T (cid:33) − , (36) Z (cid:18)(cid:20) A rr T q (cid:21)(cid:19) =  −Q D − Q T r ν (cid:13)(cid:13) [1 , Q D − . (cid:13)(cid:13) ν (cid:13)(cid:13) [0 , Q D − . (cid:13)(cid:13)  , (37) ν =( r T Q D − Q T r − q ) . , (38) Q is a matrix containing the eigenvectors of A, and D is a diagonal matrix of the eigenvalues of A. Given thecamera matrices P (1)3 × , ..., P ( C )3 × , P ( c ) ( · ) is a matrix-to-matrixprojection that transforms the quadric into a conic on eachimage of camera c [58, pp. 201]. Z ( · ) is a matrix-to-vectortransformation that transforms the conic into a 4D boundingbox (in the same format as z ( c ) ). The illustration of theoverall transformation (35) is depicted in Fig. 5.The Poisson false alarms intensity for camera c is κ ( c ) (cid:44) λ c U ( · ) , where λ c is the false-positive (clutter) rate, and U ( · ) is a uniform distribution on the measurement space Z ( c ) . In many visual tracking cases, this value can eitherbe estimated ofﬂine or manually tuned. The false alarmintensity can be estimated by the Cardinalized ProbabilityHypothesis Density (CPHD) clutter estimator [59]. In thiswork, we bootstrap the CPHD clutter intensity estimatoroutput to the tracker [60]. The number of components of the GLMB ﬁltering densitygrows super-exponentially over time. To maintain tractabil-ity in GLMB ﬁlter implementations, truncating insigniﬁcantcomponents has been proven to minimize the L approxi-mation error [20]. This truncation strategy can be formulatedas an NP-hard multi-dimensional assignment problem [20].Nonetheless, it can be solved by exploiting certain structuralproperties, and suitable adaptation of 2D assignment solu-tions such as Murty’s or Auction [20].The MV-GLMB-OC recursion described in Section 3.6,can be directly implemented with separate prediction andupdate, i.e. by computing a truncated version of the pre-diction (19) and the corresponding detection probabilities { P ( ξ,I + ) D, + ( (cid:96) ) : (cid:96) ∈ I + , ( ξ, I + ) ∈ Ξ ×F ( L + ) } , then using this tocompute a truncated version of the update (25). This strat-egy requires keeping a signiﬁcant portion of the predictedcomponents that would end up as updated componentswith negligible weights, thereby wasting computations insolving a large number of 2D assignment problems. Thus,this approach is inefﬁcient and becomes infeasible for sys-tems with many sensors [20].In this work, we exploit an efﬁcient GLMB truncationstrategy that has a linear complexity in the sum of the mea-surements across all sensors [20]. This approach bypassesthe prediction truncation, and returns the signiﬁcant com-ponents of the next GLMB ﬁltering density (25) by samplingfrom a discrete probability distribution proportional to theweights of the components [20]. This means GLMB compo-nents with higher weights are more likely to be selectedthan those with lower weights. For the MV-GLMB-OCrecursion, this discrete probability distribution s ( · ; P D, + ) of the GLMB components, is determined by the detectionprobabilities P D, + (cid:44) { P ( ξ,I + ) D, + ( (cid:96) ) : (cid:96) ∈ I + , ( ξ, I + ) ∈ Ξ ×F ( L + ) } (and other multi-object system parameters, which are sup-pressed for clarity) [20]. However, since truncation of theprediction (19) has been bypassed, the predicted compo-nents { ( ξ, I + ) ∈ Ξ × F ( L + ) } and their corresponding de-tection probabilities are not available. Nonetheless, impor-tance sampling can be used to generate weighted samplesof s ( · ; P D, + ) by sampling from s ( · ; (cid:98) P D, + ) , where (cid:98) P D, + (cid:44) { P ( ξ,I (cid:93) B + ) D, + ( (cid:96) ) : (cid:96) ∈ I (cid:93) B + , ( ξ, I ) ∈ Ξ × F ( L ) } , and then re-weight the resulting samples accordingly [50]. Note thatthe detection probabilities (cid:98) P D, + can be readily computed from the components of the (truncated) current GLMB ﬁl-tering density { ( w ( I,ξ ) , p ( ξ ) ) : ( I, ξ ) ∈ F ( L ) × Ξ } . Moreover, P ( ξ,I (cid:93) B + ) D, + (cid:22) P ( ξ,I + ) D, + , for any I + ⊆ I (cid:93) B + , it follows from[61] that s ( · ; (cid:98) P D, + ) is more diffused than s ( · ; P D, + ) , i.e. thesupport of s ( · ; (cid:98) P D, + ) contains the support of s ( · ; P D, + ) .The MS-GLMB and MV-GLMB-OC recursions are pre-sented in Algorithm 1 and 2 respectively. Observe thatthe main difference is the additional computation of thedetection probabilities prior to and re-weighting after theGibbs sampling step in the MV-GLMB-OC ﬁlter.In this work, the object’s birth density f B, + ( · , (cid:96) ) , single-object transition (26) and likelihood (18) are all Gaussians.Standard Kalman prediction and Unscented Kalman updateare used to evaluate the single-object ﬁltering density p ( ξ + )+ ,which results in a Gaussian. Algorithm 1

MS-GLMB Filter [20]

Algorithm 2

MV-GLMB-OC Filter

Global Input: (cid:8)(cid:0) P B, + ( (cid:96) ) , f B, + ( · , (cid:96) ) (cid:1)(cid:9) (cid:96) ∈ B + , f S, + ( ·|· ) , P S ( · ) Global Input: κ, P D , g Input: π (cid:44) (cid:8)(cid:0) w ( I,ξ ) , p ( ξ ) (cid:1) : ( I, ξ ) ∈ F ( L ) × Ξ (cid:9) Output: π + (cid:44) (cid:26)(cid:18) w ( I + ,ξ + ) + , p ( ξ + ) + (cid:19) : ( I + , ξ + ) ∈ F ( L + ) × Ξ + (cid:27) for ( I, ξ ) ∈ F ( L ) × Ξ Compute occlusion-based probability of detection { P ( ξ,I (cid:93) B + ) D, + ( (cid:96) ) : (cid:96) ∈ I (cid:93) B + } via (24) Construct stationary distribution from inputs and { P ( ξ,I (cid:93) B + ) D, + ( (cid:96) ) : (cid:96) ∈ I (cid:93) B + } Run Gibbs sampler to obtain samples γ + [20, Algorithm 3] Update occlusion-based probability of detection { P ( ξ, L ( γ + )) D, + ( (cid:96) ) : (cid:96) ∈ L ( γ + ) } , via (24) Use samples γ + , { P ( ξ, L ( γ + )) D, + ( (cid:96) ) : (cid:96) ∈ L ( γ + ) } to compute π + end for Extract labeled state estimates

XPERIMENTS

This section demonstrates the three main advantages of theproposed MV-GLMB-OC approach. The ﬁrst is the capa-bility to produce 3D object trajectories using independentmonocular detections from multiple views, where each ob-ject is represented as a 3D ellipsoid of unknown locationand extent (Section 5.2). The second is the amenability foruninterrupted/seamless operation in the event that camerasare added, removed or repositioned on the ﬂy (Section 5.3).The third is the ﬂexibility of not conﬁning objects to the ground plane, which is demonstrated by tracking peoplejumping and falling (Section 5.4). The effectiveness of theproposed occlusion model is also studied, by comparing thetracking performance of the MV-GLMB-OC against that ofthe standard MS-GLMB ﬁlter.We ﬁrst focus our demonstrations on the latestWILDTRACKS dataset , which involves seven-cameras at1920×1080 resolution with overlapping views. The WILD-TRACKS dataset is also supplied with calibrated intrinsicand extrinsic camera parameters, along with 3D groundplane annotations although these are restricted to theground plane. WILDTRACKS was initially introduced toaddress various perceived shortcomings in older multi-viewdatasets, the arguments for which were originally presentedin [17] and are summarized as follows. The DukeMTMCdataset [47] is essentially non-overlapping in views and isnow no longer available. The PETS 2009 S2.L1 dataset [48]has supposed inconsistencies when projecting 3D pointsacross the views. The EPFL, SALSA and Campus datasets[3], [15], [49] involve a relatively small number of people,and are relatively sparse in terms of person density, butdo not provide 3D annotations. In addition, the EPFL-RLCdataset [16] only provides annotations for a small subsetof the last 300 of 8000 frames. For the same reasons thatthe authors of WILDTRACKS were motivated to introducetheir new dataset, the older multi-view datasets supersededby WILDTRACKS are not suitable for evaluating the MV-GLMB-OC ﬁlter in the 3D world frame.In the context of demonstrating the MV-GLMB-OC ap-proach however, the WILDTRACKS dataset is not suitablefor evaluating tracking performance in full 3D, i.e. changesin all 3 x, y, z-coordinates. While WILDTRACKS provides3D annotations, these are restricted to the ground plane.Moreover the annotations are for centroids only, and do notcapture the extent (in terms of length, width and height)of objects in the world coordinates. In our performancecomparisons, the outputs of the proposed MV-GLMB-OCﬁlter on WILDTRACKS are limited to the estimated cen-troids projected onto the ground plane. To demonstratethe full capabilities of MV-GLMB-OC, it is critical to haveannotations of the 3D centroids and their 3D extent, alongwith the ground truths for each of the camera locations.Consequently we introduce a new Curtin Multi-Camera(CMC) dataset which meets these requirements.The new CMC dataset is a four-camera 1920x1024 reso-lution dataset recorded at 4fps in a room with dimensions7.67m x 3.41m x 2.7m. The CMC dataset has 5 differentsequences with varying levels of person density and occlu-sion: CMC1 has a maximum of 3 people and virtually noocclusion; CMC2 has a maximum of 10 people with some oc-clusion; CMC3 has a maximum of 15 people with signiﬁcantocclusion; while CMC4 and CMC5 involve people jumpingand falling with a maximum of 3 and 7 people respec-tively. CMC1 and CMC4 have low person density and areintended for basic testing, while CMC2, CMC3 and CMC5have higher person density and signiﬁcant visual occlusionsacross multiple overlapping cameras, and are intended tohighlight performance differences. The convention for theworld coordinate frame is illustrated in Fig. 6. The origin is at the lower corner and the ground plane corresponds to thex-y plane i.e. z = . In every sequence, each person entersthe tracking area at (2 . , . with an average heightof . . The dataset is also supplied with camera locationsand parameters, along with annotations for 3D centroidand extent. The 2D monocular annotation for boundingboxes is carried out with the MATLAB Image Labeler Tool,and the world coordinates are obtained by averaging thehomographic projection of the feet coordinates from eachview. The actual height and width of each person is usedfor the annotation. Fig. 6: Layout for CMC dataset: The blue line denotes the boundaryof the tracking area. The yellow boxes denote the coordinates of theboundary in (x,y,z) axes. The 4 cameras are positioned (in sequence) atthe top 4 corners of the room.

A common setting for object survival and detectionmodel parameters is used in both evaluations on the WILD-TRACKS and CMC datasets. Speciﬁcally: the survival prob-ability P S ( x ) given by (29), is parameterized by the controlparameter τ = 0.5 and the scene mask b ( · ) with a marginof 0.3m inside the border of the tracking area; the detec-tion probability, given in Section 3.5 is parameterized by P ( c ) D ( x ) = 0 . and β = 0 . . For all cameras, the observedbounding box model is described in (18), with position noiseparameterized by υ ( c ) p = [400 , T , and the extent noiseparameterized by υ ( c ) e = [0 . , . T (on the logarithmsof the half-lengths of the principal axes). The performance of various combinations of detectors andtrackers are evaluated using the CLEAR MOT devkit pro-vided in [62]. For computing CLEAR MOT, we adhere tothe convention of using the Euclidean distance ( L -norm)on the estimated 3D centroid with a threshold of 1m.For MOT, the following performance indicators are re-ported: Multiple Object Tracking Accuracy (MOTA) whichpenalizes normalized false negatives (FNs), false posi-tives (FPs) and identity switches (IDs) between consecu-tive frames; Multiple Object Tracking Precision (MOTP)which accounts for the overall dissimilarity between all truepositives and the corresponding ground truth objects [63];Mostly Tracked (MT), Partially Tracked (PT), Mostly Lost(MT) which indicate how much of the trajectory is retainedor lost by the tracker; Fragmentations (FM) which accountfor interrupted tracks based on ground truth trajectories;Identity Precision (IDP), Identity Recall (IDR) and F score(IDF1) which are analogous to the standard precision , stan-dard recall and F score with identiﬁcations (tracks) [47]. For reference, we also provide performance indicators onthe bounding box detections, where we set the threshold at0.5 and report: Multiple Object Detection Accuracy (MODA)which accounts for misdetections and false alarms; MultipleObject Detection Precision (MODP) which accounts for thespatial overlap information between the bounding boxes; precision which is the measure of exactness; and recall whichis the measure of quality.We note that CLEAR MOT is traditionally calculatedover the entire scenario window, and thus the trackingperformance is reported after the entire data stream hasbeen processed. To evaluate the live or online trackingperformance over time, we employ the Optimal Sub-PatternAssignment (OSPA (2) ) distance between two sets of tracks[21]. This distance is based on the OSPA metric that capturesboth localization and cardinality errors between two ﬁnitesets of a metric space with a suitable base-distance betweenobjects (e.g. the Euclidean distance) [64]. The OSPA (2) metricis deﬁned as the OSPA distance between two sets of tracksover a time window. Details for OSPA and OSPA (2) metricsare given in Appendix 7.2. By design, OSPA (2) captures bothlocalization and cardinality errors between the set of trueand estimated tracks, and penalizes switched tracks or labelchanges [21]. The resultant metric carries the interpretationof a time-averaged per-track error. In our evaluation of theposition estimate in real world coordinates, we use a 3DEuclidean base-distance for OSPA (2) with order parameter and cutoff parameter 1m. Performance evaluation for liveor online tracking is given by plotting the error over asliding window of length L w = 10 frames, while overallperformance is captured in a single number by calculatingthe error over the entire scenario window. As the proposed MV-GLMB-OC ﬁlter outputs 3D estimatesof the object centroid and extent, we extend the performanceevaluations to capture the joint error in the centroid andextent. This is achieved by employing an alternative base-distance between two objects, in this case a 3D generalizedintersection over union (GIoU), which extends the com-monly used IoU to non-overlapping bounding boxes [65].The details for the IoU and GIoU metrics are given inAppendix 7.3. It is important to note that if there is nooverlap between the ground truth and estimated shape, theIoU distance is zero regardless of their separation, whereasthe GIoU distance captures the extent of the error whileretaining the metric property [65]. We present evaluations ofthe estimated centroid with extent for CLEAR MOT (usinga GIoU base-distance with a threshold of 0.5) and OSPA (2) metric with GIoU base-distance (and with unit order andcut-off parameters). We refer the reader to [66] for therationale and discussions on the use of OSPA (2) -GIoU forperformance evaluation.

We test MV-GLMB-OC against the latest multi-camera de-tector (Deep-Occlusion) [18] coupled with the k -shortest-path (KSP) algorithm [5] and ptrack as shown in [17] (Deep-Occlusion+KSP+ptrack). KSP is an optimization algorithm TABLE 2:

WILDTRACKS Performance Benchmarks for 3D Position Estimates (restricted to the ground plane)

Detector and Tracker IDF1 ↑ IDP ↑ IDR ↑ MT ↑ PT ↓ ML ↓ FP ↓ FN ↓ IDs ↓ FM ↓ MOTA ↑ MOTP ↑ OSPA (2) ↓ YOLOv3+MV-GLMB-OC 74.3%

111 37

YOLOv3+MS-GLMB 74.2% 79.0% 69.9% 116 85 83 841 1951 139 105 61.9% 68.3% 0.81mFaster-RCNN(VGG16)+MV-GLMB-OC 76.5% 84.5% 70.0% 119 118 47 545 1621 104 81 65.3% 71.9% 0.72mFaster-RCNN(VGG16)+MS-GLMB 75.5% 76.8% 74.3% 98 104 82 1114 1716 179 116 61.5% 65.8% 0.88mDeep-Occlusion+GLMB 72.5% 82.7% 72.2% 160 86 39 960

74 25 CLEAR MOT scores and OSPA(2) distance are calculated on standard position estimates ( ↑ means higher is better while ↓ means lower is better). Threedifferent detectors are considered -Deep-Occlusion (multiocular), Faster-RCNN (VGG16) (monocular) and YOLOv3 (monocular). Three types of trackers areconsidered -KSP+ptrack or GLMB (single-sensor), MV-GLMB-OC (multi-view with occlusion model) and MS-GLMB (multi-sensor without occlusion model). that ﬁnds the most likely sequence of ground plane oc-cupancies (trajectories) given by the multi-camera detec-tor, and ptrack described in [67] improves and smoothsover tracks by learning motion patterns. As a base-line comparison, we employ the Deep-Occlusion multi-camera detector combined with single-view GLMB (Deep-Occlusion+GLMB). Since WILDTRACKS provides annota-tions in real-world coordinates but restricted to the groundplane, tracking is performed in real-world coordinates butalso restricted to the ground plane. To further explore theperformance of MV-GLMB-OC, we also run experimentsusing monocular detections from each of the cameras. Forthe detectors, we use the monocular backbone of the Deep-Occlusion detector i.e. VGG16-net trained using Faster-RCNN [23], and separately with the newer YOLOv3 [68],to produce separate monocular detections for input to MV-GLMB-OC. Since WILDTRACKS does not supply the cam-era positions required for our proposed occlusion model,we reconstruct the camera positions from the given cameraparameters. We note that KSP and/or ptrack is an ofﬂineor batch method, while GLMB is online or recursive, andprovides estimates on the ﬂy. The birth density is adaptive/measurement-driven (see Sec-tion F in [69]) with P B, + ( (cid:96) ) = 0 . and f B, + ( x, (cid:96) ) = N ( x ; µ ( (cid:96) ) B, + , . I ) where µ ( (cid:96) ) B, + is obtained via clustering(e.g. k -means). The single-object transition is as describedin (26) with position noise and extent (in logarithm) noiseparameterized by: υ ( p ) = [0 . , . , . T ,υ ( s ) = [0 . , . , . T . Table 2 shows the CLEAR MOT and OSPA (2) benchmarks forMV-GLMB-OC (with occlusion modeling) and MS-GLMB(without occlusion modeling) with two different detec-tors YOLOv3 and Faster-RCNN(VGG16). Results for Deep-Occlusion+KSP+ptrack being the reference, are reproduceddirectly from the original paper [17]. The results indicatethat the two trackers based on multi-camera detections, i.e.Deep-Occlusion+KSP+ptrack and Deep-Occlusion+GLMB,have very similar tracking performance in terms ofMOTA/MOTP and OSPA (2) . Importantly, closer examina-tion of the tracking results based on multiple monocular de-tections indicates that performance is signiﬁcantly improvedwith the addition of the occlusion model. This can be seenfrom the relative changes in the MOTA/MOTP and OSPA (2) .Several observations can also be drawn from comparing themulti-camera detector with batch processing method (Deep-Occlusion+KSP+ptrack), and the related monocular detector with online processing (Faster-RCNN(VGG16)+MV-GLMB-OC). While the MOTP improves due to the use of multiplemonocular detectors, the MOTA degrades due to the useof an online method which is unable to correct past es-timates. This is corroborated by the overall OSPA (2) valuewhich improves slightly from Deep-Occlusion+KSP+ptrackto Faster-RCNN(VGG16)+MV-GLMB-OC. Surprisingly, theresults based on YOLOv3 are better across the board thanthat for Faster-RCNN(VGG16), even though YOLOv3 ismore efﬁcient than Faster-RCNN(VGG16). For reference, theCLEAR evaluations for the detectors used in the experimentare presented in Appendix 7.4, from which it is noted thatthe monocular detections are generally much poorer thanthe multi-camera detections due to severe occlusions.

This subsection focuses on scenarios with people walk-ing in order of increasing difﬁculty, i.e. CMC1-CMC3.Similar to the WILDTRACKS evaluation, we evaluate ourmethod based on 2 monocular detectors, namely Faster-RCNN(VGG16) and YOLOv3. For each sequence, the effectof the occlusion model is studied by comparing the pro-posed MV-GLMB-OC with the standard MS-GLMB ﬁlter.

Unlike WILDTRACKS where objects enter the scene fromanywhere at the boundary, in CMC we know the loca-tion of objects entering the scene. Hence, we specify thebirth parameters as P B, + ( (cid:96) ) = 0 . and f B, + ( x, (cid:96) ) = N ( x ; µ B, + , . I ) where µ B, + =[2 .

03 0 0 .

71 0 0 .

825 0 − . − . − . T . We use the single-object transition density (26) with positionnoise and extent (in logarithm) noise parameterized by: υ ( p ) = [0 . , . , . T ,υ ( s ) = [0 . , . , . T . Table 3 shows the CLEAR MOT and OSPA (2) benchmarkswith a Euclidean base-distance, for the estimated 3D cen-troids only. Table 4 shows the CLEAR MOT and OSPA (2) benchmarks with a 3D GIoU base-distance, for the estimated3D centroids and extent. Both tables compare the trackingperformance with and without and occlusion model, i.e.MV-GLMB-OC and MS-GLMB respectively. The asteriskedentry denotes the multi-camera reconﬁguration case whichis discussed later on. All results are presented for twodifferent detectors YOLOv3 and Faster-RCNN(VGG16).We focus our initial examination on the non-asteriskedentries in Tables 3 and 4. This corresponds to the case where TABLE 3:

CMC1,2,3 Performance Benchmarks for 3D Position Estimates

CMC1 (Maximum/Average 3 people)Detector and Tracker IDF1 ↑ IDP ↑ IDR ↑ MT ↑ PT ↓ ML ↓ FP ↓ FN ↓ IDs ↓ FM ↓ MOTA ↑ MOTP ↑ OSPA (2) ↓ YOLOv3+MV-GLMB-OC

YOLOv3+MV-GLMB-OC* 98.9% 97.9% 99.8%

14 1

55 1 1 Faster-RCNN(VGG16)+MV-GLMB-OC* 95.5% 91.4% ↑ IDP ↑ IDR ↑ MT ↑ PT ↓ ML ↓ FP ↓ FN ↓ IDs ↓ FM ↓ MOTA ↑ MOTP ↑ OSPA (2) ↓ YOLOv3+MV-GLMB-OC

11 9 2 98.3%

YOLOv3+MV-GLMB-OC* 90.1% 90.2% 90.0%

10 0 0

38 29 11 7 96.2% 78.9% 0.34mYOLOv3+MS-GLMB 67.7% 79.9% 58.9% 4 6

10 0 0

50 37

10 0 0

120 60 25 13 90.1% 79.8% 0.48mFaster-RCNN(VGG16)+MS-GLMB 75.3% 81.9% 69.7% 7 3

316 23 19 83.3% 80.4% 0.58mCMC3 (Maximum/Average 15 people)Detector and Tracker IDF1 ↑ IDP ↑ IDR ↑ MT ↑ PT ↓ ML ↓ FP ↓ FN ↓ IDs ↓ FM ↓ MOTA ↑ MOTP ↑ OSPA (2) ↓ YOLOv3+MV-GLMB-OC

191 44 YOLOv3+MV-GLMB-OC* 72.1% 77.9% 67.2% 11 4

47 437 51 37 81.1% 72.3% 0.61mYOLOv3+MS-GLMB 50.5% 69.9% 39.5% 0 15

71 303

44 32

92 419 59 44 79.8% 68.0% 0.70mFaster-RCNN(VGG16)+MS-GLMB 54.3% 73.2% 43.1% 0 15

CLEAR MOT scores and OSPA(2) distance are calculated on standard position estimates ( ↑ means higher is better while ↓ means lower is better). Two differentdetectors are considered - Faster-RCNN(VGG16) (monocular) and YOLOv3 (monocular). Two types of trackers are considered - MV-GLMB-OC (multi-viewwith occlusion model) and MS-GLMB (multi-sensor without occlusion model). The asterisk (*) indicates the multi-camera reconﬁguration experiment. TABLE 4:

CMC1,2,3 Performance Benchmarks for 3D Centroid with Extent Estimates

CMC1 (Maximum/Average 3 people)Detector and Tracker IDF1 ↑ IDP ↑ IDR ↑ MT ↑ PT ↓ ML ↓ FP ↓ FN ↓ IDs ↓ FM ↓ MOTA ↑ MOTP ↑ OSPA (2) ↓ YOLOv3+MV-GLMB-OC

YOLOv3+MV-GLMB-OC* 98.9% 97.9% 99.8%

14 1

YOLOv3+MS-GLMB 95.9% 92.3% 99.8%

55 1 1 Faster-RCNN(VGG16)+MV-GLMB-OC* 95.5% 91.4% Faster-RCNN(VGG16)+MS-GLMB 99.6% 99.2% ↑ IDP ↑ IDR ↑ MT ↑ PT ↓ ML ↓ FP ↓ FN ↓ IDs ↓ FM ↓ MOTA ↑ MOTP ↑ OSPA (2) ↓ YOLOv3+MV-GLMB-OC

YOLOv3+MV-GLMB-OC* 87.3% 87.1% 87.5%

10 0 0

53 44 14 12 94.7% 57.0% 0.38YOLOv3+MS-GLMB 59.4% 69.9% 51.7% 4 6

21 563 30 31 70.4% 55.7% 0.62Faster-RCNN(VGG16)+MV-GLMB-OC 86.7% 86.5% 87.0%

10 0 0

68 55 10 8 93.6% 60.9% 0.34Faster-RCNN(VGG16)+MV-GLMB-OC* 81.3% 80.2% 82.5%

10 0 0

127 67 33 15 89.1% 55.0% 0.45Faster-RCNN(VGG16)+MS-GLMB 68.6% 74.6% 63.5% 7 3

23 332 23 21 81.8% 57.1% 0.52CMC3 (Maximum/Average 15 people)Detector and Tracker IDF1 ↑ IDP ↑ IDR ↑ MT ↑ PT ↓ ML ↓ FP ↓ FN ↓ IDs ↓ FM ↓ MOTA ↑ MOTP ↑ OSPA (2) ↓ YOLOv3+MV-GLMB-OC

222 45 37 87.2% 52.8% 0.53

YOLOv3+MV-GLMB-OC* 60.8% 65.7% 56.6% 9 6

91 481 66 56 77.4% 46.4% 0.60YOLOv3+MS-GLMB 41.4% 57.3% 32.4% 0 15

97 329 63 41 82.7%

133 460 78 60 76.3% 47.9% 0.66Faster-RCNN(VGG16)+MS-GLMB 45.7% 61.7% 36.3% 0 15

13 1175 61 67 55.8% 46.6% 0.75

CLEAR MOT scores and OSPA(2) distance are calculated with a 3D GIoU base-distance for estimates of 3D centroid with extent ( ↑ means higher is betterwhile ↓ means lower is better). Two different detectors are considered - Faster-RCNN(VGG16) (monocular) and YOLOv3 (monocular). Two types of trackersare considered - MV-GLMB-OC (multi-view with occlusion model) and MS-GLMB (multi-sensor without occlusion model). The asterisk (*) indicates themulti-camera reconﬁguration experiment . all cameras are operational. For the sparse scenario CMC1,both MV-GLMB-OC and MS-GLMB on either detectorsachieved a close to perfect CLEAR MOT scores in MOTAand MOTP. Some of the ﬂagged FPs are caused by trackinitiation/termination mismatches with the ground truths(annotations). The OSPA (2) values are relatively low due tothe sparsity of the scenario.For the medium scenario CMC2, Fig. 7 shows a screen-shot of the detections and the MV-GLMB-OC estimates.In this case, MV-GLMB-OC on both detectors managed tomaintain consistent tracks and accurate estimates overall.The CLEAR MOT benchmarks for CMC2 show high MOTAand MOTP but with some FNs and FPs. We observe an improvement in performance for MV-GLMB-OC over MS-GLMB, and on both detectors due to the inclusion of oc-clusion modeling. The improvement in performance due toocclusion modeling is also reﬂected in the OSPA (2) .For the dense scenario CMC3, MV-GLMB-OC on bothdetectors managed to achieve acceptable MOTA/MOTPscores, but is penalized with high FPs, FNs, IDs and FMs.This outcome occurs even with the proposed occlusionmodel, as the algorithm fails when a person is totallyoccluded in all views. An example of this occurrence isillustrated in Fig. 8, where the red bounding boxes denotedetections, while the yellow bounding boxes indicate peoplewho are undetected in all views. Such an event could Fig. 7: CMC2 Camera 1 to 4 (left to right): YOLOv3 detections (top row) and MV-GLMB-OC estimates (bottom row).Fig. 8: CMC3 Camera 1 to 4 (left to right): YOLOv3 detections (red bounding boxes) and people that are occluded in all four cameras (yellowbounding boxes). cause track termination/switching and is reﬂected in theperformance evaluation. It is evident from Tables 3 and 4that the tracking performance improves considerably withthe occlusion model. Examination of the OSPA (2) error leadsto a similar conclusion.Overall, YOLOv3+MV-GLMB-OC performs slightly bet-ter than Faster-RCNN(VGG16)+MV-GLMB-OC due to bet-ter detections. The tracking performance of the proposedMV-GLMB-OC ﬁlter generally degrades as the number ofpeople in the scene increases, since the visual occlusionsbecome more frequent and more difﬁcult to resolve. The re-sults of this study on the proposed occlusion model suggestthat without proper modeling of the probability of detection,the algorithm fails to maintain tracks, resulting in poorertracking results. The CLEAR evaluation for the monoculardetectors used are given in Appendix 7.4.

The MV-GLMB-OC approach requires only a one-off train-ing on each monocular detector, and hence can operatewithout retraining and without interruption, in the eventthat cameras are added, removed or repositioned on the ﬂy.To demonstrate this capability, we design a multi-camerareconﬁguration experiment. At the start of the sequence,all four cameras are operational. Later, one camera is takenofﬂine to mimic a camera failure. Subsequently, two camerasare taken ofﬂine to mimic a more severe camera failure.After this, the two previously ofﬂine cameras are made oper-ational, while the previously operational cameras are takenofﬂine, which mimics the event that the two operationalcameras are moved to different locations. We benchmark themulti-camera reconﬁguration experiment against the idealcase when all cameras are operational.Results for the experiments on multi-camera reconﬁg-uration are denoted with an asterisk in Tables 3 and 4.The reported CLEAR MOT scores and OSPA (2) errors show similar trends in respect of inclusion of the occlusion model,increasing scenario density, and relative performance onthe two detectors. The tracking performance in the multi-camera reconﬁguration case is generally worse than the casewhen all cameras are active. This relative observation is inline with expectations, as there is less sensor data to resolveocclusions and perform estimation.To facilitate an examination of the relative performancein further detail, Fig. 9 plots the

OSPA (2) error with 3DGIoU base-distance over a sliding window with time. Asa reference point for the performance comparison, theYOLOv3+MV-GLMB-OC with all cameras operational caseis also shown. The spikes in the error curve at the beginningand the end of the scenario are due to mismatches intrack initiation and termination with the ground truths.For CMC1, we observe that the error curves are relativelyclose to the reference case. This would be expected for asparse scenario as there are virtually no occlusions evenwhen some cameras are ofﬂine. For CMC2 and CMC3, theerror curves for both YOLOv3+MV-GLMB-OC* and Faster-RCNN(VGG16)+MV-GLMB-OC* begin to deviate midwayinto sequence from the all cameras operational reference.The errors become more pronounced entering the 2-cameraonly segment, as the more crowded scenarios exacerbatethe effect of occlusions and misdetections. Nonetheless, theresults show that the MV-GLMB-OC ﬁlter is able to accom-modate on-the-ﬂy changes to the camera conﬁgurations.

Here we present the ﬁrst multi-camera dataset with peoplejumping and falling, which is more challenging for MOTthan scenarios with only normal walking. We demonstratethe versatility of the proposed MOT framework by using aJump Markov System (JMS), to cater for potential switchingbetween upright and fallen modes [70]. Fig. 9: Multi-Camera Reconﬁguration Experiment: OSPA (2) plots with 3D GIoU base-distance for estimates of 3D centroid with extent. Three track-ers are considered: YOLOv3+MV-GLMB-OC* (multi-camera reconﬁguration) and Faster-RCNN+MV-GLMB-OC* (multi-camera reconﬁguration)and with YOLOv3+MV-GLMB-OC (all cameras operational).

Each state is augmented x with a discrete mode or class m ∈ { , } , where m = 0 corresponds to a standing stateand m = 1 corresponds to a fallen state. We considerthe single-object state as ( x , m ) , with single-object density p ( ξ ) ( x , m ) = p ( ξ ) ( x | m ) µ ( ξ ) ( m ) . The following single-objecttransition density and observation likelihood are used f S, + ( x + m + | x , m )= f ( m + ) S, + ( x + | x, (cid:96), m ) δ (cid:96) [ (cid:96) + ] µ + ( m + | m ) ,g ( c ) ( z ( c ) | x , m ) ∝ g ( c ) e ( z ( c ) e | m ) ×N (cid:32) z ( c ) ; Φ ( c ) ( x )+ (cid:34) × − υ ( c,m ) e / (cid:35) , diag (cid:32)(cid:34) υ ( c ) p υ ( c,m ) e (cid:35)(cid:33)(cid:33) . The mode transition probabilities are µ + (0 |

0) = 0 . , µ + (1 |

0) = 0 . , µ + (0 |

1) = 0 . and µ + (1 |

1) = 0 . .For a standing object, i.e. m = 0 , we have υ ( c, e = υ ( c ) e = [0 . , . T in the above observation likelihood.Further, standing objects typically have a bounding boxsize ratio (y-axis/x-axis) greater than one, thus the modedependent likelihood component is chosen as g ( c ) e ( z ( c ) e |

0) = e ρ (cid:16) ( [0 , z ( c ) e / [1 , z ( c ) e ) − (cid:17) for all cameras, where ρ = 2 is acontrol parameter. The transition density to another stand-ing state f (0) S, + ( x + | x, (cid:96), , is the same as per the previoussubsection.For a fallen object, i.e. m = 1 , we have υ ( c, e =[0 . , . T in the above observation likelihood, andthe mode dependent likelihood component is chosen as g ( c ) e ( z ( c ) e |

1) = e − ρ (cid:16) ( [0 , z ( c ) e / [1 , z ( c ) e ) − (cid:17) for all cameras becausefallen objects typically have a bounding box size ratio (y-axis/x-axis) less than one. The transition density to an-other fallen state f (1) S, + ( x + | x, (cid:96), is the same as that forstanding-to-standing except for the large variance υ ( s ) =[0 . , . , . T to capture all possible orientations duringthe fall.For a state transition involving a mode switch i.e.standing-to-fallen or fallen-to-standing, the transition den-sity f (1)+ ( x + | x, (cid:96), or f (0)+ ( x + | x, (cid:96), takes the form (26), with position noise and extent (in logarithm) noise parame-terized by: υ ( p ) =[0 . , . , . T ,υ ( s ) =[0 . , . , . T . Notice that the position noise is increased in the case of amode switch compared to the case of no switching, in orderto capture the abrupt change in the size of the object duringmode switching.The birth density is an LMB with parameters P B, + ( (cid:96) ) =0 . and f B, + ( x, (cid:96),

0) =0 . N ( x ; µ B, + , , Σ B, + , ) ,f ( (cid:96) ) B, + ( x, (cid:96),

1) =0 . N ( x ; µ B, + , , Σ B, + , ) ,µ B, + , =[2 .

03 0 0 .

71 0 0 .

825 0 − . − . − . T ,µ B, + , =[2 .

03 0 0 .

71 0 0 .

413 0 − . − . − . T , Σ B, + , =Σ B, + , = 0 . I . Tables 5 and 6 show the CLEAR MOT and OSPA (2) bench-marks for MV-GLMB-OC and MS-GLMB on both detectorsYOLOv3 and Faster-RCNN(VGG16). The CLEAR evalua-tions for the monocular detections are given in Appendix7.4.For CMC4 which has a maximum of 3 people, bothMV-GLMB-OC and MS-GLMB on either detectors achievedhigh CLEAR MOT scores in MOTA/MOTP, and lowOSPA (2) errors. The incidence of FPs and FNs is causedby track initiation/termination mismatches with the groundtruths. Nonetheless, we observe that on MOTA/MOTP andOSPA (2) , MV-GLMB-OC outperforms MS-GLMB.For CMC5 which has a maximum of 7 people, bothMV-GLMB-OC and MS-GLMB on either detectors were stillcapable of producing reasonable MOTA/MOTP scores andOSPA (2) errors. Fig. 10 shows a snapshot of detections andestimates on a single view. However, due to poor detectionsand more occlusions in CMC5, we observe many IDs andFNs. Again on MOTA/MOTP and OSPA (2) , MV-GLMB-OCoutperforms MS-GLMB. TABLE 5:

CMC4,5 Performance Benchmarks for 3D Position Estimates

CMC4 (Jumping and Falling, Maximum/Average 3 people)Detector and Tracker IDF1 ↑ IDP ↑ IDR ↑ MT ↑ PT ↓ ML ↓ FP ↓ FN ↓ IDs ↓ FM ↓ MOTA ↑ MOTP ↑ OSPA (2) ↓ YOLOv3+MV-GLMB-OC

YOLOv3+MV-GLMB-OC* 95.0% 93.5% 96.5%

17 4 5 1 93.6% 87.7% 0.18mYOLOv3+MS-GLMB 95.9% 94.0% 97.8%

21 5 4 1 92.6% 86.4% 0.21mFaster-RCNN(VGG16)+MV-GLMB-OC 98.0% 98.5% 97.5% ↑ IDP ↑ IDR ↑ MT ↑ PT ↓ ML ↓ FP ↓ FN ↓ IDs ↓ FM ↓ MOTA ↑ MOTP ↑ OSPA (2) ↓ YOLOv3+MV-GLMB-OC YOLOv3+MV-GLMB-OC* 59.3% 58.1% 60.1%

410 1185 61 49 60.3% 64.1% 0.66mFaster-RCNN(VGG16)+MV-GLMB-OC* 56.2% 55.6% 59.2% CLEAR MOT scores and OSPA(2) distance are calculated on standard position estimates ( ↑ means higher is better while ↓ means lower is better). Two differentdetectors are considered - Faster-RCNN(VGG16) (monocular) and YOLOv3 (monocular). Two types of trackers are considered - MV-GLMB-OC (multi-viewwith occlusion model) and MS-GLMB (multi-sensor without occlusion model). The asterisk (*) indicates the multi-camera reconﬁguration experiment . TABLE 6:

CMC4,5 Performance Benchmarks for 3D Centroid with Extent Estimates

CMC4 (Jumping and Falling, Maximum/Average 3 people)Detector and Tracker IDF1 ↑ IDP ↑ IDR ↑ MT ↑ PT ↓ ML ↓ FP ↓ FN ↓ IDs ↓ FM ↓ MOTA ↑ MOTP ↑ OSPA (2) ↓ YOLOv3+MV-GLMB-OC

YOLOv3+MV-GLMB-OC* 95.0% 93.5% 96.5%

17 4 5 1 93.6% 58.9% 0.20YOLOv3+MS-GLMB 95.9% 94.0% 97.8%

21 5 4 1 92.6% 57.0% 0.26Faster-RCNN(VGG16)+MV-GLMB-OC 98.0% 98.5% 97.5% ↑ IDP ↑ IDR ↑ MT ↑ PT ↓ ML ↓ FP ↓ FN ↓ IDs ↓ FM ↓ MOTA ↑ MOTP ↑ OSPA (2) ↓ YOLOv3+MV-GLMB-OC YOLOv3+MV-GLMB-OC* 55.9% 54.9% 57.1%

451 1008 72 57 59.9% 43.1% 0.66Faster-RCNN(VGG16)+MV-GLMB-OC* 55.9% 53.6% 51.6% CLEAR MOT scores and OSPA(2) distance are calculated with a 3D GIoU base-distance for estimates of 3D centroid with extent ( ↑ means higher is betterwhile ↓ means lower is better). Two different detectors are considered - Faster-RCNN(VGG16) (monocular) and YOLOv3 (monocular). Two types of trackersare considered - MV-GLMB-OC (multi-view with occlusion model) and MS-GLMB (multi-sensor without occlusion model). The asterisk (*) indicates themulti-camera reconﬁguration experiment. Fig. 10: CMC5 Camera 1: YOLOv3 detections (left) and MV-GLMB-OCestimates (right).

The multi-camera reconﬁguration experiment described inSection 5.3.3 is repeated for the multi-modal datasets CMC4and CMC5. The results for the multi-camera reconﬁgurationare denoted with asterisks in Tables 5 and 6. The plot for

OSPA (2) with 3D GIoU base-distance over a sliding windowwith time is given in Fig. 11. While similar observations canbe made from the experiments without jumping and falling(CMC1-CMC3), the results for CMC4-CMC5 exhibit differ-ent behavior for people in the fallen state. The estimatedextent is warped out of its ordinary shape when the personis on the ground, and more data is required to infer the corresponding state of the fallen person. In CMC4-CMC5,the effect of occlusions or misdetections is exacerbated byhaving fewer cameras when the person is on the ground,which would likely lead to track termination or switching.Nonetheless, the results conﬁrm that the JMS variant of theMV-GLMB-OC algorithm can automatically accommodatemulti-camera reconﬁguration.

TABLE 7:

MV-GLMB-OC Runtime on WILDTRACKS and CMC

Dataset (Cams) Frames No. Obj (avg) Exec. Time (s/frame)W.T. (7) 400 20 18.0CMC1(4) 261 3 0.1CMC2 (4) 263 10 3.2CMC3 (4) 263 15 7.9CMC4 (4) 147 3 0.4CMC5 (4) 560 7 5.5

The runtimes for the MV-GLMB-OC ﬁlter on the WILD-TRACKS and CMC datasets are summarized in Table 7.The current implementation is via unoptimized MATLABcode. The reported runtimes appear to be consistent with thecomputational complexity of the MV-GLMB-OC algorithm:quadratic in the number of objects and linear in the sum ofthe number of detections across all cameras. Fig. 11: Multi-Camera Reconﬁguration Experiment: OSPA (2) plots with 3D GIoU base-distance for estimates of 3D centroid with extent. Threetrackers are considered: YOLOv3+MV-GLMB-OC* (multi-camera reconﬁguration) and Faster-RCNN+MV-GLMB-OC* (multi-camera reconﬁgura-tion) and with YOLOv3+MV-GLMB-OC (all cameras operational).

ONCLUSIONS

By developing a tractable 3D occlusion model, we havederived an online Bayesian multi-view multi-object ﬁlteringalgorithm that only requires monocular detector training,independent of the multi-camera conﬁgurations. This en-ables the multi-camera system to operate uninterruptedin the event of extension/reconﬁguration (including cam-era failures), obviating the need for multi-view retraining.Moreover, it addresses the multi-camera data associationproblem in a way that is scalable in the total number of de-tections. Experiments on existing 3D multi-camera datasetshave demonstrated similar performance to the state-of-the-art batch method. We also demonstrated the ability of theproposed algorithm to track in densely populated scenarioswith high occlusions, and with people jumping/falling inthe 3D world frame. R EFERENCES [1] F. Poiesi, R. Mazzon, and A. Cavallaro, “Multi-target tracking onconﬁdence maps: An application to people tracking,”

COMPUTVIS IMAGE UND , vol. 117, no. 10, pp. 1257–1272, 2013.[2] H. B. Shitrit, J. Berclaz, F. Fleuret, and P. Fua, “Multi-commoditynetwork ﬂow for tracking multiple people,”

IEEE Trans. PatternAnal. Mach. Intell. , vol. 36, no. 8, pp. 1614–1627, 2014.[3] Y. Xu, X. Liu, Y. Liu, and S.-C. Zhu, “Multi-view people trackingvia hierarchical trajectory composition,” in

IEEE Comput. Soc. Conf.Comput. Vis. Pattern Recognit. , pp. 4256–4265, 2016.[4] D. Y. Kim, B.-N. Vo, B.-T. Vo, and M. Jeon, “A labeled random ﬁniteset online multi-object tracker for video data,”

Pattern Recognition ,vol. 90, pp. 377–389, 2019.[5] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, “Multiple objecttracking using k-shortest paths optimization,”

IEEE Trans. PatternAnal. Mach. Intell. , vol. 33, no. 9, pp. 1806–1819, 2011.[6] A. Milan, S. Roth, and K. Schindler, “Continuous energy mini-mization for multitarget tracking,”

IEEE Trans. Pattern Anal. Mach.Intell. , vol. 36, no. 1, pp. 58–72, 2014.[7] X. Wang, E. T ¨uretken, F. Fleuret, and P. Fua, “Tracking interactingobjects using intertwined ﬂows,”

IEEE Trans. Pattern Anal. Mach.Intell. , vol. 38, no. 11, pp. 2312–2326, 2016.[8] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, andL. Van Gool, “Online multiperson tracking-by-detection from asingle, uncalibrated camera,”

IEEE Trans. Pattern Anal. Mach. In-tell. , vol. 33, no. 9, pp. 1820–1833, 2010.[9] B. Babenko, M.-H. Yang, and S. Belongie, “Robust object trackingwith online multiple instance learning,”

IEEE Trans. Pattern Anal.Mach. Intell. , vol. 33, no. 8, pp. 1619–1632, 2010.[10] R. Hoseinnezhad, B.-N. Vo, B.-T. Vo, and D. Suter, “Visual trackingof numerous targets via multi-bernoulli ﬁltering of image data,”

Pattern Recognition , vol. 45, no. 10, pp. 3625–3635, 2012. [11] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speedtracking with kernelized correlation ﬁlters,”

IEEE Trans. PatternAnal. Mach. Intell. , vol. 37, no. 3, pp. 583–596, 2014.[12] P. Peng, Y. Tian, Y. Wang, J. Li, and T. Huang, “Robust multiplecameras pedestrian detection with multi-view Bayesian network,”

Pattern Recognition , vol. 48, no. 5, pp. 1760–1772, 2015.[13] A. Andriyenko, S. Roth, and K. Schindler, “An analytical formula-tion of global occlusion reasoning for multi-target tracking,” in ,pp. 1839–1846, IEEE, 2011.[14] S. L. Dockstader and A. M. Tekalp, “Multiple camera fusion formulti-object tracking,” in

Proc. 2001 IEEE Workshop on Multi-ObjectTracking , pp. 95–102, IEEE, 2001.[15] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, “Multicamera peopletracking with a probabilistic occupancy map,”

IEEE Trans. PatternAnal. Mach. Intell. , vol. 30, no. 2, pp. 267–282, 2008.[16] T. Chavdarova and F. Fleuret, “Deep multi-camera people detec-tion,” in , pp. 848–853, IEEE, 2017.[17] T. Chavdarova et al. , “WILDTRACK: A multi-camera HD datasetfor dense unscripted pedestrian detection,” in

Proc. IEEE Conf.Comput. Vis. Pattern Recognit. , pp. 5030–5039, 2018.[18] P. Baqu´e, F. Fleuret, and P. Fua, “Deep occlusion reasoning formulti-camera multi-target detection,” in

Proc. IEEE Int. Conf. Com-put. Vis. , pp. 271–279, 2017.[19] J. Domke, “Learning graphical model parameters with approxi-mate marginal inference,”

IEEE Trans. Pattern Anal. Mach. Intell. ,vol. 35, no. 10, pp. 2454–2467, 2013.[20] B.-N. Vo, B.-T. Vo, and M. Beard, “Multi-sensor multi-object track-ing with the generalized labeled multi-bernoulli ﬁlter,”

IEEE Trans.Signal Process. , vol. 67, no. 23, pp. 5952–5967, 2019.[21] M. Beard, B. T. Vo, and B.-N. Vo, “A solution for large-scale multi-object tracking,”

IEEE Trans. on Signal Process. , vol. 68, pp. 2754–2769, 2020.[22] R. Girshick, “Fast R-CNN,” in

Proc. of the IEEE Int. Conf. on Comput.Vis. , pp. 1440–1448, 2015.[23] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towardsreal-time object detection with region proposal networks,” in

Advances in Neural Onformation Proc. Systems , pp. 91–99, 2015.[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁ-cation with deep convolutional neural networks,” in

Advances inNeural Information Process. Systems , pp. 1097–1105, 2012.[25] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” in

IEEE Conf. Comput. Vis. and Pattern Recognit. , pp. 7263–7271, 2017.[26] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Uniﬁed, real-time object detection,” in

Proc. of the IEEE Conf.Comput. Vis. and Pattern Recognit. , pp. 779–788, 2016.[27] X. Zhao, H. Jia, and Y. Ni, “A novel three-dimensional objectdetection with the modiﬁed you only look once method,”

Int. J.Adv. Robot. Syst. , vol. 15, no. 2, 2018.[28] M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection and people-detection-by-tracking,” in , pp. 1–8, IEEE, 2008.[29] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. De-hghan, and M. Shah, “Visual tracking: An experimental survey,”

Trans. Pat. Anal. Mach. Intell. , vol. 36, no. 7, pp. 1442–1468, 2013. [30] X. Weng, Y. Wang, Y. Man, and K. M. Kitani, “GNN3DMOT:Graph neural network for 3D multi-object tracking with 2D-3Dmulti-feature learning,” in Proc. of the IEEE/CVF Conf. Comput. Vis.Pattern Recognit. , pp. 6499–6508, 2020.[31] M. Liang, B. Yang, W. Zeng, Y. Chen, R. Hu, S. Casas, andR. Urtasun, “PnPNet: End-to-end perception and prediction withtracking in the loop,” in

Proc. of the IEEE/CVF Conf. Comput. Vis.Pattern Recognit. , pp. 11553–11562, 2020.[32] K. Otsuka and N. Mukawa, “Multiview occlusion analysis fortracking densely populated objects based on 2-d visual angles,”in

Proc. of the 2004 IEEE Comput. Soc. Conf. on Comput. Vis. andPattern Recognit. , vol. 1, pp. I–I, IEEE, 2004.[33] A. Osep, W. Mehner, M. Mathias, and B. Leibe, “Combined image-and world-space tracking in trafﬁc scenes,” in

IEEE Int. Conf. onRobotics and Automation (ICRA) , pp. 1988–1995, IEEE, 2017.[34] P. Li, J. Shi, and S. Shen, “Joint spatial-temporal optimization forstereo 3D object tracking,” in

Proc. of the IEEE/CVF Conf. Comput.Vis. Pattern Recognit. , pp. 6877–6886, 2020.[35] M. Pedersen, J. B. Haurum, S. H. Bengtson, and T. B. Moeslund,“3D-ZeF: A 3D zebraﬁsh tracking benchmark dataset,” in

Proc. ofthe IEEE/CVF Conf. Comput. Vis. Pattern Recognit. , pp. 2426–2436,2020.[36] D. Frossard and R. Urtasun, “End-to-end learning of multi-sensor3D tracking by detection,” in , pp. 635–642, IEEE, 2018.[37] W. Zhang, H. Zhou, S. Sun, Z. Wang, J. Shi, and C. C. Loy, “Robustmulti-modality multi-object tracking,” in

Proceedings of the IEEEInternational Conf. on Comput. Vis. , pp. 2365–2374, 2019.[38] S. Scheidegger, J. Benjaminsson, E. Rosenberg, A. Krishnan, andK. Granstr¨om, “Mono-camera 3D multi-object tracking using deeplearning detections and PMBM ﬁltering,” in , pp. 433–440, IEEE, 2018.[39] H.-N. Hu, Q.-Z. Cai, D. Wang, J. Lin, M. Sun, P. Krahenbuhl,T. Darrell, and F. Yu, “Joint monocular 3D vehicle detection andtracking,” in

Proceedings of the IEEE International Conf. on Comput.Vis. , pp. 5390–5399, 2019.[40] B. Leibe, K. Schindler, N. Cornelis, and L. Van Gool, “Coupledobject detection and tracking from static cameras and movingvehicles,”

IEEE Trans. on Pattern Anal. Mach. Intell. , vol. 30, no. 10,pp. 1683–1698, 2008.[41] X. Weng and K. Kitani, “A baseline for 3D multi-object tracking,” arXiv preprint arXiv:1907.03961 , 2019.[42] R. P. Mahler, “Multitarget Bayes ﬁltering via ﬁrst-order multitargetmoments,”

IEEE Trans. Aerospace and Electronic Systems , vol. 39,no. 4, pp. 1152–1178, 2003.[43] R. P. Mahler,

Advances in Statistical Multisource-Multitarget Informa-tion Fusion . Artech House, 2014.[44] R. P. Mahler,

Statistical Multisource-Multitarget Information Fusion .Artech House, Inc., 2007.[45] E. Maggio, M. Taj, and A. Cavallaro, “Efﬁcient multitarget visualtracking using random ﬁnite sets,”

IEEE Trans. Circuits and Systemsfor Video Tech. , vol. 18, no. 8, pp. 1016–1027, 2008.[46] B.-T. Vo and B.-N. Vo, “Labeled random ﬁnite sets and multi-object conjugate priors,”

IEEE Trans. Signal Process. , vol. 61, no. 13,pp. 3460–3475, 2013.[47] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Per-formance measures and a data set for multi-target, multi-cameratracking,” in

European Conf. Comput. Vis. , pp. 17–35, Springer, 2016.[48] J. Ferryman and A. Shahrokni, “Pets2009: Dataset and challenge,”in , pp. 1–6, IEEE, 2009.[49] X. Alameda-Pineda, J. Staiano, R. Subramanian, L. Batrinca,E. Ricci, B. Lepri, O. Lanz, and N. Sebe, “Salsa: A novel datasetfor multimodal group behavior analysis,”

IEEE Trans. Pattern Anal.Mach. Intell. , vol. 38, no. 8, pp. 1707–1720, 2015.[50] B. Ristic, S. Arulampalam, and N. Gordon,

Beyond the Kalman ﬁlter:Particle ﬁlters for tracking applications . Artech house, 2003.[51] B.-N. Vo, S. Singh, and A. Doucet, “Sequential Monte Carlo meth-ods for multitarget ﬁltering with random ﬁnite sets,”

IEEE Trans.Aerospace and Electronic Systems , vol. 41, no. 4, pp. 1224–1245, 2005.[52] Y. Bar-Shalom, T. E. Fortmann, and P. G. Cable, “Tracking and dataassociation,” 1990.[53] S. Blackman and R. Popoli, “Design and analysis of moderntracking systems (artech house radar library),”

Artech house , 1999.[54] T. T. D. Nguyen and D. Y. Kim, “GLMB tracker with partialsmoothing,”

Sensors , vol. 19, no. 20, p. 4419, 2019. [55] M. Beard, B.-T. Vo, and B.-N. Vo, “Bayesian multi-target trackingwith merged measurements using labelled random ﬁnite sets.,”

IEEE Trans. Signal Processing , vol. 63, no. 6, pp. 1433–1447, 2015.[56] Z. Zhang, “A ﬂexible new technique for camera calibration,”

IEEETrans. Pat. Anal. Mach. Intell. , vol. 22, no. 11, pp. 1330–1334, 2000.[57] P. Schneider and D. H. Eberly,

Geometric tools for computer graphics .Elsevier, 2002.[58] R. Hartley and A. Zisserman,

Multiple view geometry in computervision . Cambridge university press, 2003.[59] R. P. Mahler, B.-T. Vo, and B.-N. Vo, “CPHD ﬁltering with un-known clutter rate and detection proﬁle,”

IEEE Trans. on SignalProcess. , vol. 59, no. 8, pp. 3497–3513, 2011.[60] C.-T. Do and T. T. D. Nguyen, “Multiple marine ships trackingfrom multistatic Doppler data with unknown clutter rate,” in

Int.Conf. on Control, Autom. and Inf. Sci. (ICCAIS) , pp. 1–6, IEEE, 2019.[61] B.-N. Vo, B.-T. Vo, and H. G. Hoang, “An efﬁcient implementationof the generalized labeled multi-Bernoulli ﬁlter,”

IEEE Trans. SignalProcess. , vol. 65, no. 8, pp. 1975–1987, 2017.[62] L. Leal-Taix´e, A. Milan, I. Reid, S. Roth, and K. Schindler,“Motchallenge 2015: Towards a benchmark for multi-target track-ing,” arXiv preprint arXiv:1504.01942 , 2015.[63] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo,R. Bowers, M. Boonstra, V. Korzhova, and J. Zhang, “Frameworkfor performance evaluation of face, text, and vehicle detection andtracking in video: Data, metrics, and protocol,”

IEEE Trans. PatternAnal. Mach. Intell. , vol. 31, no. 2, pp. 319–336, 2008.[64] D. Schuhmacher, B.-T. Vo, and B.-N. Vo, “A consistent metric forperformance evaluation of multi-object ﬁlters,”

IEEE Trans. SignalProcess. , vol. 56, no. 8, pp. 3447–3457, 2008.[65] H. Rezatoﬁghi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, andS. Savarese, “Generalized intersection over union: A metric anda loss for bounding box regression,” in

Proc. IEEE Conf. Comput.Vis. Pattern. Recognit. , pp. 658–666, 2019.[66] H. Rezatoﬁghi, T. T. D. Nguyen, B.-N. Vo, B.-T. Vo, S. Savarese,and I. Reid, “How trustworthy are the existing performance eval-uations for basic vision tasks?,” arXiv preprint arXiv:2008.03533 ,2020.[67] A. Maksai, X. Wang, F. Fleuret, and P. Fua, “Non-Markovianglobally consistent multi-object tracking,” in

Proceedings of the IEEEInternational Conf. on Comput. Vis. , pp. 2544–2554, 2017.[68] J. Redmon and A. Farhadi, “YOLOv3: An incremental improve-ment,” arXiv preprint arXiv:1804.02767 , 2018.[69] S. Reuter, B.-T. Vo, B.-N. Vo, and K. Dietmayer, “The labeledmulti-Bernoulli ﬁlter,”

IEEE Trans. Signal Process. , vol. 62, no. 12,pp. 3246–3260, 2014.[70] Y. G. Punchihewa, B.-T. Vo, B.-N. Vo, and D. Y. Kim, “Multipleobject tracking in unknown backgrounds with labeled randomﬁnite sets,”

IEEE Trans. Signal Process. , vol. 66, no. 11, pp. 3040–3055, 2018.

Jonah Ong received the B.E. degree in electri-cal and power engineering with ﬁrst-class honorsfrom Curtin University, Perth, Western Australia,in 2018. He is currently working towards thePh.D. degree in electrical engineering at CurtinUniversity. His research interests include statisti-cal signal processing, Bayesian ﬁltering and es-timation, random sets, and multi-target tracking.

Ba-Tuong Vo received the B.Sc. degree in ap-plied mathematics and B.E. degree in electricaland electronic engineering (with ﬁrst-class hon-ors) in 2004 and the Ph.D. degree in engineering(with Distinction) in 2008, all from the Universityof Western Australia. He is currently a Professorof Signal Processing at Curtin University andhas primary research interests in random sets,ﬁltering and estimation, multiple object systems. Ba-Ngu Vo received his Bachelor degrees inMathematics and Electrical Engineering with ﬁrstclass honors in 1994, and PhD in 1997. Cur-rently he is Professor of Signals and Systemsat Curtin University. Vo is a Fellow of the IEEE,and is best known as a pioneer in the stochasticgeometric approach to multi-object system. Hisresearch interests are signal processing, sys-tems theory and stochastic geometry.

Du Yong Kim received the B.E. degree in elec-trical and electronics engineering from Ajou Uni-versity, Korea, in 2005. He received the M.S. andPh.D. degrees from the Gwangju Institute of Sci-ence and Technology, Korea, in 2006 and 2011,respectively. As a Postdoctoral Researcher, heworked on statistical signal processing and im-age processing at the Gwangju Institute of Sci-ence and Technology (2011—2012), and theUniversity of Western Australia (2012—2014),Curtin University (2014-2018). He is currentlyworking as a Vice-Chancellor’s Research Fellow at School of Engi-neering, RMIT University. His main research interests include Bayesianﬁltering theory and its applications to machine learning, computer vision,sensor networks, and automatic control.

Sven Nordholm received his PhD in Signal Pro-cessing in 1992, Licentiate of engineering 1989and MscEE (Civilingenj¨or) 1983 all from LundUniversity, Sweden. Since 1999, Nordholm isProfessor, Signal Processing, School of Electri-cal and Computer Engineering, Curtin Univer-sity. He is a co-founder of two start-up com-panies; Sensear, providing voice communica-tion in extreme noise conditions and Nuheara ahearables company. He was a lead editor for aspecial issue on assistive listing techniques inIEEE Signal Processing Magazine and several other EURASIP specialissues. He is a former Associate editor for Eurasip Advances in SignalProcessing and Journal of Franklin Institute and at the current time anAssociate Editor IEEE/ACM Transaction on Audio, Speech and Lan-guage Processing. His primary research has encompassed the ﬁeldsof Speech Enhancement, Adaptive and Optimum Microphone Arrays,Audio Signal Processing and WirelessCommunication. He has writtenmore than 200 papers in refereed journals and conference proceedings.He frequently contributes to book chapters and encyclopedia articles.He is holding seven patents in the area of speech enhancement andmicrophone arrays. A PPENDIX

Under the standard multi-object model with no occlusions,i.e. P ( c ) D ( x ; X −{ x } ) = P ( c ) D ( x ) , hence ψ ( γ ) X −{ x } ( x ) does notdepend on X −{ x } , and as a result we write ψ ( γ ) X −{ x } ( x ) = ψ ( γ ) ( x ) . In this case, the MS-GLMB recursion Ω taking theparameter-set π (cid:44) (cid:110)(cid:16) w ( I,ξ ) , p ( ξ ) (cid:17) : ( I, ξ ) ∈ F ( L ) × Ξ (cid:111) . to the parameter-set π + = (cid:110)(cid:16) w ( I + ,ξ + )+ , p ( ξ + )+ (cid:17) : ( I + , ξ + ) ∈ F ( L + ) × Ξ + (cid:111) is given by I + = B + (cid:93) I, ξ + = ( ξ, γ + ) (39) w ( I + ,ξ + )+ = 1 F ( B + (cid:93) I ) ( L ( γ + )) w ( I,ξ ) (cid:104) ω ( ξ,γ + ) (cid:105) B + (cid:93) I (40) p ( ξ + )+ ( x + , (cid:96) ) ∝ (cid:40) (cid:104) Λ ( γ + ) S ( x + |· , (cid:96) ) , p ( ξ ) ( · , (cid:96) ) (cid:105) , (cid:96) ∈ L ( γ + ) − B + Λ ( γ + ) B ( x + , (cid:96) ) , (cid:96) ∈ L ( γ + ) ∩ B + (41) ω ( ξ,γ + ) ( (cid:96) ) =  − ¯ P ( ξ ) S ( (cid:96) ) , (cid:96) ∈ L ( γ + ) − B + Λ ( ξ,γ + ) S ( (cid:96) ) , (cid:96) ∈ L ( γ + ) − B + − P B, + ( (cid:96) ) , (cid:96) ∈ L ( γ + ) ∩ B + Λ ( γ + ) B ( (cid:96) ) , (cid:96) ∈ L ( γ + ) ∩ B + , (42) ¯ P ( ξ ) S ( (cid:96) ) = (cid:104) P S ( · , (cid:96) ) , p ( ξ ) ( · , (cid:96) ) (cid:105) , (43) Λ ( γ + ) B ( x + , (cid:96) ) = ψ ( γ + ) ( x + , (cid:96) ) f B, + ( x + , (cid:96) ) P B, + ( (cid:96) ) , (44) Λ ( γ + ) S ( x + | y, (cid:96) ) = ψ ( γ + ) ( x + , (cid:96) ) f S, + ( x + | y, (cid:96) ) P S ( y, (cid:96) ) , (45) Λ ( γ + ) B ( (cid:96) ) = (cid:90) Λ ( γ + ) B ( x, (cid:96) ) dx, (46) Λ ( ξ,γ + ) S ( (cid:96) ) = (cid:90) (cid:104) Λ ( γ + ) S ( x |· , (cid:96) ) , p ( ξ ) ( · , (cid:96) ) (cid:105) dx. (47) (2) Metrics

Consider a space W with d : W × W → [0; ∞ ) as the base-distance between the elements of W . Let d ( c ) ( x, y ) =min ( c, d ( x, y )) , and Π n be the set of permutations of { , , ..., n } . The Optimal Sub-Pattern Assignment (OSPA)distance of order p ≥ , and cut-off c > , between twoﬁnite sets of points X = { x , ..., x m } and Y = { y , ..., y n } of W is deﬁned by [64] d ( p,c ) O ( X,Y )= (cid:32) n (cid:32) min π ∈ Π n m (cid:88) i =1 d ( c ) (cid:0) x i , y π ( i ) (cid:1) p + c p ( n − m ) (cid:33)(cid:33) p , (48)if n ≥ m > , and d ( p,c ) O ( X, Y ) = d ( p,c ) O ( Y, X ) if m > n > .In addition, d ( p,c ) O ( X, Y ) = c if one of the set is empty, and d ( p,c ) O ( ∅ , ∅ ) = 0 . The integer p plays the same role as theorder of the (cid:96) p -distance for vectors. The cut-off parameter c provides a weighting between cardinality and locationerrors. A large c emphasizes cardinality error while a small c emphasizes location error. However, a small c also decreasesthe sensitivity to the separation between the points due tothe saturation of d ( c ) at c . The OSPA (2) distance between two sets of tracks is theOSPA distance with the following base-distance betweentwo tracks f and g [21]: d ( c ) ( f, g ) = (cid:88) t ∈D f ∪D g d ( c ) O ( { f ( t ) } , { g ( t ) } ) |D f ∪ D g | , (49)if D f ∪ D g (cid:54) = ∅ , and d ( c ) ( f, g ) = 0 if D f ∪ D g = ∅ , where D f ∪ D g denotes the set of instants when at least one of thetracks exists, and d ( c ) O ( { f ( t ) } , { g ( t ) } ) denotes the OSPAdistance between the two sets containing the states of thetwo tracks at time t (the set { f ( t ) } (or { g ( t ) } ) would beempty if the track f (or g ) does not exist at time t ). Notethat the order parameter p of the OSPA distance in (49)is redundant because only sets of at most one element areconsidered. For bounding boxes x, y , the IoU similarity index is givenby

IoU ( x, y ) = | x ∩ y | / | x ∪ y | ∈ [0; 1] , where |·| denoteshyper-volume, while the Generalized IoU index is given by GIoU ( x, y ) = IoU ( x, y ) − | C ( x ∪ y ) \ ( x ∪ y ) | / | C ( x ∪ y ) | ,where C ( x ∪ y ) is the convex hull of x ∪ y [65]. The metricforms of IoU and GIoU, respectively are d IoU ( x, y ) = 1 − IoU ( x, y ) and d GIoU ( x, y ) = − GIoU ( x,y )2 , both of which arebounded by one [65]. Table 9 shows the CLEAR evaluation for detections onthe CMC dataset, which is referenced from Section 5.3.2and Section 5.4.2. Table 8 shows the CLEAR evaluation fordetections on WILDTRACKS dataset, which is referencedfrom Section 5.2.2.

TABLE 8:

CLEAR Evaluation for Detection Results onWILDTRACKS Dataset

Detector MODA ↑ MODP ↑ Precision ↑ Recall ↑ C1 YOLOv3 12.2% 70.1% 0.55 0.62F-RCNN(VGG16) -17.1% 69.6% 0.44 0.62C2 YOLOv3 31.7% 68.5% 0.68 0.58F-RCNN(VGG16) 28.4% 68.3% 0.67 0.57C3 YOLOv3 -24.4% 69.2% 0.42 0.68F-RCNN(VGG16) -34.6% 69.0% 0.40 0.69C4 YOLOv3 -272.4% 71.1% 0.14 0.57F-RCNN(VGG16) -300.0% 70.1% 0.14 0.57C5 YOLOv3 -94.4% 70.0% 0.29 0.69F-RCNN(VGG16) -113.0% 67.8% 0.27 0.71C6 YOLOv3 -12.6% 63.4% 0.44 0.50F-RCNN(VGG16) -30.5% 65.4% 0.39 0.53C7 YOLOv3 -79.2% 70.1% 0.33 0.77F-RCNN(VGG16) -100.3% 69.3% 0.31 0.77All Deep-Occlusion 74.1% 53.8% 0.95 0.80

ABLE 9:

CLEAR Evaluation for Detection Results onCMC1 to CMC5

CMC1 Detector MODA ↑ MODP ↑ Prcn ↑ Rcll ↑ Cam1 YOLOv3 20.6% 80.2% 0.56 0.97F-RCNN(VGG16) 12.0% 80.3% 0.53 0.97Cam2 YOLOv3 20.5% 78.8% 0.56 0.97F-RCNN(VGG16) 12.0% 80.1% 0.53 0.98Cam3 YOLOv3 13.2% 79.7% 0.53 0.97F-RCNN(VGG16) 10.1% 80.8% 0.51 0.97Cam4 YOLOv3 12.1% 79.7% 0.51 0.96F-RCNN(VGG16) 11.1% 80.3% 0.41 0.96CMC2 Detector MODA ↑ MODP ↑ Prcn ↑ Rcll ↑ Cam1 YOLOv3 51.2% 76.2% 0.77 0.72F-RCNN(VGG16) 37.5% 76.5% 0.67 0.73Cam2 YOLOv3 45.3% 76.5% 0.72 0.72F-RCNN(VGG16) 35.5% 76.6% 0.66 0.73Cam3 YOLOv3 43.4% 77.2% 0.71 0.72F-RCNN(VGG16) 34.4% 77.2% 0.66 0.72Cam4 YOLOv3 47.3% 77.7% 0.74 0.71F-RCNN(VGG16) 37.4% 78.0% 0.67 0.72CMC3 Detector MODA ↑ MODP ↑ Prcn ↑ Rcll ↑ Cam1 YOLOv3 44.9% 76.4% 0.79 0.60F-RCNN(VGG16) 33.1% 76.0% 0.67 0.61Cam2 YOLOv3 39.8% 75.3% 0.73 0.62F-RCNN(VGG16) 30.9% 75.4% 0.66 0.63Cam3 YOLOv3 36.1% 74.4% 0.72 0.58F-RCNN(VGG16) 29.6% 74.0% 0.66 0.61Cam4 YOLOv3 37.0% 74.9% 0.72 0.59F-RCNN(VGG16) 27.6% 74.6% 0.65 0.60CMC4 Detector MODA ↑ MODP ↑ Prcn ↑ Rcll ↑ Cam1 YOLOv3 86.8% 82.0% 0.93 0.93F-RCNN(VGG16) 76.7% 82.6% 0.84 0.94Cam2 YOLOv3 75.2% 79.1% 0.87 0.88F-RCNN(VGG16) 68.3% 80.3% 0.82 0.88Cam3 YOLOv3 86.7% 84.6% 0.93 0.93F-RCNN(VGG16) 77.3% 87.0% 0.84 0.95Cam4 YOLOv3 81.5% 82.7% 0.94 0.87F-RCNN(VGG16) 75.9% 82.2% 0.82 0.97CMC5 Detector MODA ↑ MODP ↑ Prcn ↑ Rcll ↑↑