[PDF] Detection and Tracking of General Movable Objects in Large 3D Maps

Abstract

This paper studies the problem of detection and tracking of general objects with long-term dynamics, observed by a mobile robot moving in a large environment. A key problem is that due to the environment scale, it can only observe a subset of the objects at any given time. Since some time passes between observations of objects in different places, the objects might be moved when the robot is not there. We propose a model for this movement in which the objects typically only move locally, but with some small probability they jump longer distances, through what we call global motion. For filtering, we decompose the posterior over local and global movements into two linked processes. The posterior over the global movements and measurement associations is sampled, while we track the local movement analytically using Kalman filters. This novel filter is evaluated on point cloud data gathered autonomously by a mobile robot over an extended period of time. We show that tracking jumping objects is feasible, and that the proposed probabilistic treatment outperforms previous methods when applied to real world data. The key to efficient probabilistic tracking in this scenario is focused sampling of the object posteriors.

Full PDF

11 Detection and Tracking of GeneralMovable Objects in Large 3D Maps

Nils Bore, Johan Ekekrantz, Patric Jensfelt and John FolkessonRobotics, Perception and Learning LabRoyal Institute of Technology (KTH)Stockholm, SE-100 44, SwedenEmail: { nbore, ekz, patric, johnf } @kth.se Submitted to IEEE Transactions on Robotics, October 2017. c (cid:13)(cid:13)

Abstract —This paper studies the problem of detection and tracking ofgeneral objects with long-term dynamics, observed by a mobile robotmoving in a large environment. A key problem is that due to theenvironment scale, it can only observe a subset of the objects at anygiven time. Since some time passes between observations of objects indifferent places, the objects might be moved when the robot is notthere. We propose a model for this movement in which the objectstypically only move locally, but with some small probability they jumplonger distances, through what we call global motion. For ﬁltering, wedecompose the posterior over local and global movements into two linkedprocesses. The posterior over the global movements and measurementassociations is sampled, while we track the local movement analyticallyusing Kalman ﬁlters. This novel ﬁlter is evaluated on point cloud datagathered autonomously by a mobile robot over an extended period of time.We show that tracking jumping objects is feasible, and that the proposedprobabilistic treatment outperforms previous methods when applied toreal world data. The key to efﬁcient probabilistic tracking in this scenariois focused sampling of the object posteriors.

Keywords — Mobile robot, multi-target tracking, movable objects, mapping.

I. I

NTRODUCTION

Mobile robots often operate in large environments that can notbe fully observed without moving. In fact, the ﬂoor plan of mostbuildings is divided into rooms, which the robot can only visit one ata time. This setup creates an interesting problem if the robot is to keeptrack of speciﬁc objects. For example, imagine a care robot that helpselderly keep track of their belongings, as suggested in [1]. Since therobot cannot be in all places simultaneously, it will not observe mostobjects as they move, as is required by classical tracking techniques.We might instead rely fully on appearance to identify the objects, butthis also fails if the objects are visually similar. To enable tracking inthis scenario, we need to take into account that the objects might havemoved while we were not observing them. Imagine we are lookingfor a patient’s cell phone, and that we last saw it in her room. Ingeneral, we should expect to ﬁnd the object in the same place, orsomewhere nearby. Further, if the robot looks, and the phone is notthere, the next hypothesis is that it should be somewhere within theclosed environment of the patient’s department or the care home. Anyfurther guesses need to be based on knowledge of who might havetaken the phone and to where. These common sense intuitions shouldform the basis for any tracking algorithm that works even when therobot does not continuously observe the targets.The need for tracking speciﬁc instances of objects arises in severalrobotic applications, for example when mapping in highly dynamic[2] and long-term scenarios [3], obstacle avoidance [4][5] and objectsearch [6]. Of particular interest to us is surveillance of objects withlong-term dynamics moving in large environments. The applicationwe have in mind is a robot tasked with monitoring a number of mostlystatic objects that typically do not leave some closed environment,such as in the care scenario described above. Another applicationis security, where a robot should monitor the presence for exampleof important items in an ofﬁce. In these surveillance applications, aprobabilistic model for the objects’ positions is essential for the robot,since it aids it in knowing where to look. In addition to helping usdistinguish similar objects, a tracking framework with a probabilistic motion model provides us with such distributions, as illustrated inFigure 1. When the robot covers more locations, the model shouldreﬁne the distributions to aid the search.As mentioned in [7], tracking of instances is also necessary whenwe would like to learn explicit movement models for objects. This istrue both for instance level models and more general object categories.For example, to estimate a movement model for the mug category, itdoes not sufﬁce to study the collective movement of mugs. This canbe illustrated by a collection of mugs standing in a kitchen dryingrack. If the method can only estimate the number of mugs in the rack,it will fail to detect that people are leaving washed mugs, in order topick them up later. Therefore, a reasonable algorithmic model wouldbe that of a queue, rather than a set. More generally, a cumulativemodel will fail to capture any such interactions of objects.Since a mobile robot cannot observe all objects in a large environ-ment simultaneously, some of them will be moved when the robot isnot looking. Most often, objects only move locally, as humans mightuse them for some task and then put them down. However, objectssometimes move in unpredictable ways, to entirely different parts ofthe robot environment. This might happen if the task is completedsomewhere else than the original object position, as when drinkingfrom a cup brought from the kitchen. A key insight is that theselatter movements happen much more seldom than small adjustmentsof object positions. We therefore propose decomposing the modelingof object movement into two parts: local movements describing thesmall adjustments, and rarer global movements for longer jumps.In this scheme, we can explain most object detections with localmovements from the previous positions. One concrete advantage ofthis can be highlighted by an example where we have observed twovisually similar objects. With new observations, the ﬁrst object is wellexplained by local motion from the previous position. If the secondis not but a similar object is observed elsewhere, we can concludethat the second is more likely to have jumped there than the ﬁrst.This matches well with our human intuition.To track objects within the maps, we need to generate objectdetections from the scene observations. In previous work on similartopics [7][3], it has been popular to use scene differencing fordetection. This is natural, as we are only interested in trackingobjects that move; static objects are trivial to track. But even movableobjects often remain static for long periods of time, necessitatinga mechanism for detection when they do not move. Methods havebeen proposed for extending 3D scene differencing temporally, tosegment one movable object across a sequence of scene observations[8]. In this work, we propose restricting such an extension to only theunambiguous parts. In practice, this means the we identify two partsin different scenes as the same only if they are static between theobservations. As we will show, such a scheme can still segment themovable objects in all frames. Importantly, it defers movable objectclustering to the tracker, which incorporates a probabilistic motionmodel and can therefore handle uncertainty.In our scenario, the task is assumed to be to track and monitora ﬁxed set of objects. The robot patrols one ﬂoor of a building atregular intervals, visits all relevant locations and builds local 3Dmaps. From each map observation, we can extract a set of objectdetections and corresponding visual features. In the experimentswe use CNN features, which proved to discriminate well betweendissimilar objects. Although it only takes a few minutes to travelbetween the different locations and construct the maps, we expectthe robot to visit the same location again on the time scale of a fewhours. Since it can only survey one location at a time, the majorityof the objects we are interested in will not be observed for hoursat a time. Between observations the objects move locally within thelocations and will occasionally jump to different locations. The sparse a r X i v : . [ c s . R O ] J a n (a) Position probabilities of chair 1. (b) Position probabilities of chair 2. (c) Position probabilities of chair 3. Fig. 1: Example of our system tracking objects: the current detections of objects (red) and location probabilities (blue-violet) for three differentchairs. The chairs are visually similar, making long-term tracking just from visual or shape features difﬁcult.observations and the two movement modes complicate inferring theobject positions and demand careful modeling of probabilities.The main idea presented in the paper brings together modeling ofdiscrete object jumps in between locations and smaller local variationin a probabilistic multi-target tracking framework. We survey previouswork of relevance and present a key set of assumptions which makethe problem tractable. Since the method is developed for tracking inthe context of a mobile robot patrolling an environment, we havedeveloped a complete system to verify the viability of the approach.In summary, we present the following contributions:1) A multi-object tracking model for continuous movement pairedwith larger discontinuous movement that happens when therobot is somewhere else2) A practical, efﬁcient inference scheme that fully captures theproposed underlying model3) Experimental validation on a robot and in simulationII. R

ELATED W ORK

This paper sits atop and draws upon a variety of different disci-plines, including mapping, computer vision and Bayesian inference.As such, we focus on a few key areas where we ﬁnd the mostrelevant work. Most studies that address detection and tracking ofmovable objects in a mobile robotic framework are found withinthe area of robot mapping . In this ﬁeld, considerable effort hasbeen spent to handle dynamics in order to better model noise inthe measurements and to improve localization [9][10]. Others aimto infer object positions with respect to the map, as describedin the sections below. Our work is a variant of the multi-targettracking problem; another area of relevance which we discuss in thesubsequent paragraphs. Finally, we survey related work on

3D objectdetection and segmentation .Within the ﬁelds of robot mapping and localization , there havebeen substantial contributions to solving detection and tracking ofmultiple objects, referred to as the

DATMO problem [11]. The areacan be divided into approaches dealing with short-term dynamicssuch as cars and humans [2][12], and long-term dynamics, suchas furniture [13][7][3][14]. In a robotic setting, short-term dynamicobjects are typically continuously observed while long-term, or semi-static , objects are only observed every once in a while, duringwhich the objects might have moved. In the following, we willfocus on methods that deal with the latter problem, correspondingto our problem formulation. Biswas et al. [3] studies the problem ofassociating dynamic objects between measurements at different times.Besides the static map, they also keep track of a set of objects, eachassociated with at set of measurements. While [3] do not incorporatea movement model, thus allowing arbitrary movement, they usethe mutual exclusion constraint , which disallows association of two detections made at the same time to the same object. A more recentsystem dealing with objects that are moved was presented in [15]. Theauthors speciﬁcally address the problem of modeling how long objectscan be expected to stay at their last observed position. In addition, theypresent a scheme for learning the model from observations. While wedo not employ the proposed probabilistic model for this time interval,it can be integrated into our system by modifying our prior model.The

GATMO system [7] presented by Gallagher et al. is the methodthat comes closest to realizing the vision we present in this paper. Infact, that system addresses the problem of tracking short-term, aswell as long-term dynamics. The method detects movable objects bymeans of scene differencing on 2D laser data The discovered objectsare kept in a database of categories, with an object moving to an”Absent” category if it is no longer in a similar position as whenlast observed. It may move back to the ”Movable” category if thereis a new detection somewhere else with a similar laser signature.Also worth noting here is earlier work of Schulz et al. [13] andWolf et al. [16], who present similar systems, but only for trackingmoveable objects locally within a map. Notably, [13] also incorporatesthe localization uncertainty of the robot when estimating the objectpositions. GATMO [7] and to some extent [13][16] address the samegeneral problem that we do. We improve on [13][16] by also trackinglarger scale motion, necessitating joint modeling of the objects. Ourprobabilistic model is an important improvement on the GATMO[7] system. As demonstrated by our experiments, it enables robustperformance also in the presence of noise. Lastly, unlike these earliersystems, the proposed system operates on top of 3D maps.At the basic level, our problem is a multi-target tracking (MTT)problem, . Many techniques have been proposed to address MTT butfew address the problem in a setting similar to ours. Our systemmust track the discrete process of jumping between a ﬁnite numberof locations as well as the continuous 2D position of the objects.Typically, this also includes inferring the association of measurementsto targets. However, explicitly representing associations betweentargets, locations and measurements requires a high-dimensional statespace and is not feasible. Instead, we focus on sampling based particleﬁlters . To improve performance in high-dimensional state spaces onecan sometimes use analytic representations of the continuous part,resulting in a so-called

Rao-Blackwellized particle ﬁlter (RBPF).RBPFs have been proposed for multi-target tracking in a number ofpapers, e.g. [17][18][19]. Our approach is inspired by that of S¨arkk¨aet al. [17], which integrates one measurement at a time and samplescorrespondences to the tracked targets. Given the associations, thecontinuous positions can be tracked using classical Kalman ﬁlters.However, our sampling scheme is more similar to the method of Vuet al. [20], which uses

MCMC techniques to estimate the associationsof several measurements jointly. Another approach to MTT is

Proba- bility Hypothesis Density (PHD) ﬁlters [21][22], which maintain thecombined target intensity function rather than the posteriors. As weare interested in explicit object posteriors, e.g. for object search, thesemethods are not directly relevant to this paper.In the broader ﬁeld of

3D object detection and segmentation ,there are several works relevant to ours, particularly in regards tosegmentation. In [8], Herbst et al. present a system which jointlysegments scenes observed at different times. The method uses anelegant multi-scene MRF formulation and graph cuts to segmentobjects that move in between observations. This has the advantage ofproviding consistent segmentation boundaries by comparing severalstatic scenes to each other at once, thus ﬁltering out noise. While theoutput of this segmentation is the same as ours, it is unlikely thatmethod would scale from the tens of scenes presented in the paper tothe hundreds that might be collected by a robot. Instead, our methodcan run in an online manner, which is desirable on a robot platform.In [23], Finman et al. present a complete system for object discovery.Their method detects objects from pair-wise scene differencing andassociates new detections with previously observed object models.Similar to us, they update the visual object models online using atracker, but do not explicitly model speciﬁc object instances or theirpositions. In [24] we introduced a scene differencing method thatwas also used in [25]. The method ﬁnds objects by comparing to abackground model of the environment, called a meta-room . In thecurrent work, we do not rely on a background model. Instead, wecompare precisely registered depth frames of subsequent observations,allowing us to reliably detect smaller objects than [24], such asmugs. Another advantage to [24] is the option of joint segmentation,allowing us to segment static objects in the past if they move insubsequent observations.In a subsequent work [25], we studied the related problem ofofﬂine clustering of object instances that have been observed withinone room at different times. One of the main conclusions was thatlighting variation or inability to always observe objects from similarangles present problems for object recognition in this application. Toalleviate this problem, we proposed to group the initial clusters ofvisual features by spatio-temporal coherency. Importantly, the secondstep makes sure no objects observed at the same time are assigned tothe same cluster. In the current work, we approach the problem froma different angle, by studying objects that may also move betweenlocations but never completely leave or enter the robot environment.Further, while the ofﬂine clustering algorithm in [25] is greedy, wepropose an online probabilistic framework where we perform jointdata association.To summarize, while [25][16][13] track local movement usingfeatures and [3][14] track only using features, ignoring position,none of them formulate a full motion model that can track generalobjects. Only [7] studies the full problem of tracking general long-term dynamic objects within large environments, but without aspeciﬁed motion model or a noise model. We improve on their workby incorporating motion and noise priors in a probabilistic modelworking on 3D data. This enables reliable estimates even in the noisyenvironments where many robots ﬁnd themselves. Detections comefrom a simple temporally consistent segmentation logic that combinesthe precision of [8] with the fast sequential updates of [24].III. M

ETHOD

A robot moves between a ﬁnite number of locations l ∈ L in an environment. Its task is to monitor a number of semi-staticobjects j ∈ . . . N at each time step k . For this, the robotneeds to reason about the current locations l j,k of the objects, andtheir exact positions ˆx j,k within the locations. At each time k , itobserves one of the locations l yk , giving it a sequence of M k point Fig. 2: The change detection result in one local map. There aredifferent sources of detections, coming from forward change detection(green), backward change detection (red) and propagated detections(yellow). The hole in the center of the map is due to the robot standingin that position when collecting the local map.measurements Y k = { y ,k , . . . , y M k ,k } . Each point comprises a2D position ˆy sm,k , together with a visual feature vector ˆy fm,k , eachwith some noise. While some of the measurements correspond toone of the N objects it is monitoring, others originate from otherobjects or spurious noise. To know which objects to monitor, theinitial positions ˆx , , . . . , ˆx N, are given. Our tracking formulationis based on the closed world assumption, meaning no object j entersor fully leaves the environment. This is justiﬁed, since the trackingsystem can still maintain a distribution over possible positions in theenvironment, while concluding it does not know where an object isif the uncertainty grows too high.In the following section, we brieﬂy describe how our methodgenerates the point measurements Y k from RGBD scans of thecurrent location l yk . Then in subsequent Sections III-B1 and III-B2,we detail the local and global process models, respectively. Finally, inSection III-B3 and onwards, we describe how to combine the modelsinto a joint posterior and then propose a framework for efﬁcientinference. A. Map Differencing and Consistent Segmentation

As our robot patrols the environment, it performs ◦

3D sweepsusing its RGBD sensor at the pre-speciﬁed locations L . The RGDBframes are registered into local maps , see [25] for details. Our aimis to track objects moving within our environment. It is thereforenatural to use change detection techniques [24][23][26] to detectimage segments corresponding to moving objects.Change detection only detects objects that move in between twotime points k and k + 1 . However, if at time k + 2 any of the objectsremained in the same place as in k + 1 , we would like to still detectthem, as this is vital information for our object tracking. That is,we need to distinguish between not having moved and not beingthere any more. Therefore, we add a new component to the detectionsystem wherein we propagate detections of objects into previousand subsequent observations. This allows us to in theory segmentall instances of objects that have moved at any point in the robot’sobservations. When formulating the principles for propagating thedetected objects, there are several scenarios to take into account. First,an object can appear as well as disappear, allowing it to be detected bychange detection. We distinguish between detection of appearing anddisappearing objects, by referring to the processes as forward and Before Moves Static . . .

Static Moves AfterPresentAbsent t t Forward detection t Forward propagated detection . . . t N − Backward detection t N − t N Fig. 3: A real world example where a chair moves and then moves again. It is detected through forward change detection and the detection ispropagated into subsequent frames. The second movement is detected via the backward change detection pass.Present Backward prop. Local movement Occluded Backward prop. Forward prop.Absent t t t t t t t t t t t t t Fig. 4: A more complex example where the object moves several times and with occlusion. Note that small local movements will be registeredas a backward and a forward detection at the same time step. When occluded, we do not register any measurements. backward change detection, respectively. If an object has appearedin frame k + 1 , it could be present in the subsequent frames whilebackward detections might have been present in previous frames. Wethus compare all the registered depth image frames in observation k +1 with those in k +2 to see if the pixels corresponding to the objecthave similar values, see Figure 3. The object detections are propagatedin either direction until the depth values do no longer correspond orwe detected an object change in the opposite direction, indicating thatthe object disappeared, see Figure 4. Finally, a complicating factor isthe objects might be occluded. While one could use the observationsbefore and after an occlusion to model the probability of the objectbeing there, we do not to attempt any such reasoning in this work.To summarize, several passes determine the ﬁnal object detections.First, our algorithm computes forward- and backward change detec-tion passes by comparing each subsequent pair of observations. Then,a forward pass propagates all the detected objects to the subsequentstatic observations of the same object. No propagated detection isadded when it has already been detected by another pass. Finally,a backward pass propagates the detections from backward changedetection. The output of the segmentation of one local map is shownin Figure 2. Note that the method can be modiﬁed to run onlinesimply by skipping the backward detection and propagation steps. B. Object Posteriors

Now we turn to the problem of associating the detections fromevery time step with objects in order to track them. The modelingof object movement can be described in terms of a combination oftwo different processes. On the one hand, an object often moveslocally in one place, as illustrated in Figure 5a. Picture, for example,a chair in front of a desk; it frequently moves a bit when the useris getting up or adjusting the seating position. More rarely, objectsmight also move somewhere completely different, as illustrated inFigure 5b. To be able to track these rare movements, we need to reason situations when a tracked object’s visual descriptor does notmatch anything nearby its previous position. If its descriptor is alsoclose to that of a new, unexplained object, we might be conﬁdent theobject has ”jumped”. Since the robot can only observe objects withinone location l ∈ L at each time step, estimating l j,k for each of theobjects will prove essential to our treatment of the problem. In thefollowing, we describe our modeling of the two processes separatelyand then how to combine them.

1) Local Movement:

Individual object, or target , states x j,k andmeasurements y m,k at time k are composed of continuous object 2Dposition ˆ · s and descriptor ˆ · f vectors, as well as a discrete location l ∈ L , x j,k =  ˆx sj,k ˆx fj,k  , l j,k  , y m,k =  ˆy sm,k ˆy fm,k  , l yk  . The continuous states ˆx can be observed directly, with some noiseand therefore have the same dimension as the measurements ˆy . Notethat for the measurements, the continuous location ˆy sm,k also uniquelydeﬁnes the discrete location l yk . We denote the set of target states attime step k , X k = { x ,k , . . . , x N,k } and the measurements Y k . Ifthe objects move only locally, and if we assume Gaussian noise forthe process, and for the position and descriptor measurements, wecan describe the system using linear dynamics. Given assignments c j,k = m that map each target j to its corresponding measurement m , we may use the standard update equations for the Kalman ﬁlterto track each target separately. With the prior that the object stayslocally, the state will simply propagate as the previous state plussome normally distributed noise. The continuous states can furtherbe directly observed, giving us ˆx j,k = ˆx j,k − + q j,k − ˆy k,c j,k = ˆx j,k + r j,k , (1) (a) Local, small movements of the objects within the locations. (b) Global movements, jumping within or between different locations. Fig. 5: Illustration of the possible movements of three objects. Arrows indicate possible movements. The objects can move locally, within acertain area. Additionally, they may also jump to new locations in completely different areas of the environment.with q j,k − and r j,k denoting process noise and measurement noise,respectively. The feature part has no process noise but we do assumenormally distributed measurement noise. In our application this isa reasonable assumption, as discussed in Section IV-A. Note thatour model can incorporate other distributions. In this local model,the location l j,k never changes but it is implicitly given by the lastmeasurement associated with the target.

2) Global Movement & Associations:

In addition to the continuouspose, our objects can also jump between several discrete locations l ∈ L . These often correspond to rooms such as ofﬁces or differentparts of a hallway. At each time point, we assume an object mighttake the ”action” u j,k ∈ { jump , no jump } of jumping to a random l or staying, p ( u j,k ) =  p jump , if u j,k = jump − p jump , if u j,k = no jump . (2)If object j did not jump, we know it stayed in the same location asbefore, u j,k = no jump ⇒ l j,k = l j,k − . In the event of a jump, itmay uniformly jump to any of the N l = |L| locations in the robotenvironment. There is therefore the probability N l of jumping tothe location l yk we are currently observing and N l − N l of jumping toany of the other locations. If the number N l is large, we gain littleinformation about an object’s location from knowing it is missingfrom one of the locations. Instead, for simplicity we introduce a newlocation l unknown , indicating that we believe the object jumped to a newlocation, but we do not know which. Therefore, if l j,k = l unknown , wealready know j jumped in subsequent time steps, and the locationstays l unknown until it is associated with a measurement, l j,k − = l unknown ⇒ u j,k = jump. As we will see, this simpliﬁcation helpsimprove our inference. Thus, the location priors are p ( l j,k | u j,k = jump , l yk ) =  N l , if l j,k = l ykN l − N l , if l j,k = l unknown , for other l ∈ L . Intrinsically, multi-target tracking is a problem of estimating themeasurement associations c j,k . Now, c j,k can take on measurementindices m ∈ , . . . , M k as well as (cid:15) , which indicates the target didnot give rise to any measurement at this time. If, for example, the lasttarget estimate l j,k − is not at the current observation location l yk , andthe target did not jump, the probability of not getting a measurementof the target is , since we can only observe one location at a time.Even if the object is at the estimated location, it is detected with some large probability p meas , allowing for some errors in the detector.Together, this gives us the measurement prior model, p ( c j,k = (cid:15) | l j,k , l yk ) =  , if l j,k (cid:54) = l yk − p meas , if l j,k = l yk , and correspondingly, if each of the measurements in the location area-priori equally likely to originate from the object, p ( c j,k = m | l j,k , l yk ) = 1 M k (1 − p ( c j,k = (cid:15) | l j,k , l yk )) . (3)With the conditional priors in hand, we can computethe individual target association priors p ( c j,k | l yk , l j,k − ) = p ( c j,k | l yk , l j,k ) p ( l j,k | l yk , l j,k − , u j,k ) p ( u j,k ) , see Table I. Sincethe associations encode the locations l j,k except for when c j,k = (cid:15) ,we simplify notation by taking c k to mean all associations andlocations at time k . Note that the individual association priors areconditioned only on c j,k − , as opposed to the full c k − . Further,there is no closed form expression to combine them into a jointprior p ( c k | c k − , Y k − ) , since we need to disallow assignmentsof targets to the same measurement. While we never sample fromthis speciﬁc distribution, it is relevant to talk about how one mightdo so here; we will use the same techniques later on for the proposaldistribution.One option to sample from the joint association prior is to use MCMC methods. In our case, one might use blocked Gibbs sampling[27] of two random target assignments at a time, conditioned on theother assignments. It is important to sample the assignments in blocksof several targets since this allows the targets to switch measurementassignments. The Gibbs sampling procedure allows us to keep theprobability of assigning a measurement in the current room ﬁxed,even when most of the observations are already assigned to anothertarget. In effect, this is analogous to adaptively changing the set ofmeasurements Y k depending on which are unassigned at that iter-ation of the sampling procedure, and sample from p ( c j,k | l yk , l j,k − ) computed over this modiﬁed set of measurements.While we have found that Gibbs sampling of the joint prior isindeed feasible, it might be unnecessarily slow. We can also approx-imate the distribution using the assumption that the target locations l j,k are independent. However, there is still the constraint over c k thatno two targets can be associated with the same measurement, giving ˜ p ( c k | c k − , Y k − ) ∝∼  , if ∃ j (cid:54) = j (cid:48) : c j,k = c j (cid:48) ,k (cid:54) = (cid:15) (cid:81) j p ( c j,k | l yk , l j,k − ) otherwise . (4) p ( u j,k , l j,k , c j,k | l j,k − ) u j,k = no jump , u j,k = no jump , u j,k = jump , u j,k = jump , u j,k = jump , u j,k = jump ,l j,k = l j,k − , l j,k = l j,k − , l j,k = l yk , l j,k = l yk , l j,k = l unknown , l j,k = l unknown ,c j,k = m c j,k = (cid:15) c j,k = m c j,k = (cid:15) c j,k = m c j,k = (cid:15)l j,k − = l yk Mk (1 − p jump ) p meas (1 − p jump )(1 − p meas ) MkNl p jump p meas Nl p jump (1 − p meas ) Nl − Nl p jump l j,k − (cid:54) = l yk (1 − p jump ) MkNl p jump p meas Nl p jump (1 − p meas ) Nl − Nl p jump l j,k − = l unknown MkNl p meas Nl (1 − p meas ) Nl − Nl TABLE I: The prior probabilities given previous location. Note that we can get the probability for row two from row one by taking into accountthat p ( c j,k = m | l j,k − (cid:54) = l yk , u j,k = no jump ) = 0 . Similarly, we get the third row from p ( u j,k = jump ) = 1 .In particular, this approximation becomes exact when M k (cid:29) N , sincea target can then be assigned to l yk without any noticeable effect onthe other priors. This distribution is easier to work with as we canuse rejection sampling to sample from the individual priors and rejectany sample set with overlapping assignments.

3) Combined Process Model:

Now, we would like to combinethe continuous local processes and the model for the locations andassociations c k in such a way that we can estimate the full posterior p ( X k , c k | Y k ) jointly. Fortunately, given c k and measurements Y k , the states x j are independent. This allows us to decompose theposterior: p ( X k , c k | Y k ) = p ( X k | c k , Y k ) p ( c k | Y k )= p ( c k | Y k ) (cid:89) j p ( x j, k | c k , Y k ) . (5)This is the basic principle underlying the use of Rao-Blackwellizedparticle ﬁlters for multiple target tracking, see for example [17][19]. Itallows us to sample a c i k for every particle i while also maintainingan analytic ﬁlter for each state ˆx ij . This decomposition reducessampling variance, allowing the use of far fewer particles than ifwe were to track the full state ( X k , c k ) using particle sampling.We will outline our modeling of the combined process dynamicsbefore we delve further into the details of inference. In the following,we assume the process has propagated according to the discretedynamics in the previous section. If an object j does not jump, itadheres to the dynamics in Equation 1. If it does jump, the movementmodel a-priori distributes the spatial part uniformly over the targetspatial domain, ˆx sj,k ∼ U ( X s ) . If at the same time, it is associatedwith a measurement ˆy m,k , our new estimate of the target positionwill therefore be N ( ˆy sm,k , R sk ) , where R sk is the spatial measurementnoise. The complete continuous dynamics are ˆx sj,k = ˆx sj,k − + q sj,k − , if u j,k = no jump ˆx sj,k ∼ U ( X s ) , if u j,k = jump ˆx fj,k = ˆx fj,k − + q fj,k − ˆy m,k = ˆx j,k + r j,k , if c j,k = m , (6)where the random parts are distributed according to c k ∼ p ( c k | c k − , Y k − ) q j,k − ∼ N ( , Q k ) r j,k ∼ N ( , R k ) . (7)

4) Likelihoods:

The idea behind our Rao-Blackwellized particleﬁlter is to sample locations and associations c k and then update (cid:8) ˆx ij (cid:9) using linear Gaussian dynamics. The idea comes from [17],where the system incorporates one new measurement for every timestep. Since our system has distinct time steps where it gets severalcoupled measurements at once, we incorporate these measurementsin the same update step, unlike [17][19]. In the following, we willlook at how to extract weighted samples of the posterior p ( c k | Y k ) recursively using a particle ﬁlter. This can then be combined with thelinear part of the state to form the full posterior, see Equation 5. UsingBayes’ rule two times, we can decompose the association posteriorto give us a recursive expression: p ( c k | Y k ) ∝ p ( Y k | c k , Y k − ) p ( c k | Y k − )= p ( Y k | c k , Y k − ) p ( c k | c k − , Y k − ) p ( c k − | Y k − ) . (8)The ﬁrst term, p ( Y k | c k , Y k − ) = L ( Y k | c k ) , deﬁnes ourmeasurement likelihood. Given the full set of associations c k , itdecomposes into a product of point likelihoods. Since we know whichdata points originate from the tracked objects, we know the others donot. We will refer to the latter as clutter measurements , giving L ( Y k | c k ) = (cid:89) j : c j,k = m p ( y m,k | c j,k = m, u j,k ) (cid:89) m : m is clutter p ( y m,k | m clutter ) . (9)The point likelihood of local movements p ( y m,k | c j,k = m, u j,k = no jump ) is simply the Kalman marginal likelihoods of position andfeatures [17]. If we instead consider jumps u j,k = jump, we knowfrom the previous section that ˆx sj,k is a-priori uniformly distributedover the spatial domain. Further, taking the association c j,k = m into account, it must be somewhere within the measurement location l yk . The likelihood is therefore given by the Kalman feature marginallikelihood times a uniform density over the area. With A k being thearea of that location, and µ fj,k , Σ fj,k the feature estimate, we get p ( y m,k | c j,k = m, u j,k = j. ) = 1 A k N ( ˆy fm,k ; µ fj,k , Σ fj,k + R fk ) . As of yet, we have not deﬁned any prior probability of gettingclutter measurements, only of targets not giving rise to a measure-ment. However, as we will see, clutter is implicitly sampled within theGibbs sampling procedure. The likelihood of the clutter measurementsis given by a uniform density over the spatial and feature domains, p ( y m,k | m clutter ) = A k S f . Note that the support of the featuredensity, S f , needs to be estimated from data.

5) Sampling from the Proposal:

A density q ( c k ) ∝ p ( Y k | c k , Y k − ) p ( c k | c k − , Y k − ) proportional to theupdate of the posterior in Equation 8 is called the optimalimportance distribution . Sampling particle updates from the proposal q ( c k ) would minimize variance among our particle weights butis in general difﬁcult and typically requires some approximation[30]. Using approximate MCMC methods to sample from theproposal is therefore a well-established idea [30] and has beenused to estimate data associations for multi-target tracking in[18]. In our case, recall that p ( c k | c k − , Y k − ) describes ourtransition model, which is in turn given by the individual associationpriors as outlined in Section III-B1. The individual priors canbe combined with the likelihoods to compute the individualproposals q j ( c j,k ) ∝ p ( Y k | c j,k ) p ( c j,k | c j : k − , l j,k − ) . Usingthese distributions, the importance distribution q ( c k ) is sampledanalogously to p ( c k | c k − , Y k − ) ; either with Gibbs sampling or − −

20 0 20 40 − − − Object 0Object 1Object 2Object 3Object 4Object 5Object 6 Object 7Object 8Object 9Object 10Object 11Object 12Noise (a) Features from experiment 1 reduced to two dimensions for visual-ization. They discriminate well between most of the objects. The onlyexceptions are the monitor and chairs to the far right: the featuresof objects 10 and 12 changed drastically after the objects jumped,interleaving them with the noise measurements and those of object 4. − − −

50 0 50 100 − − Object 0Object 1Object 2Object 3 Object 4Object 5Object 6Noise (b) Features from experiment 2. Since the tracked objects are visuallysimilar chairs, the features do not discriminate well. However, we seethat the noisy detections are well separated from the chairs, indicatingthat they help us focus on the objects of interest.

Fig. 6: CNN Features from the Google Inception v3 network [28]. They consist of the tensor responses at the ﬁnal bottleneck layerdimension reduced using t-SNE [29]. Additional images and labels for training covariances were extracted from the

KTH Longterm DatasetLabels ( https://strands.pdc.kth.se/public/KTH_longterm_dataset_labels/readme.html ) data set. Note that eachclass corresponds to a speciﬁc instance. Semantically similar instances are mostly grouped together. For example, the four classes to the leftin Figure 6a represent all of the food containers, bottles and mugs in that dataset.approximated with an independence assumption. In the following,we explore the two methods in more detail.In Gibbs sampling, the idea is to sample one of the variables c j,k at a time conditioned on all the others, denoted c − jk = { c j (cid:48) ,k } j (cid:48) \ c j,k . This gives us a modiﬁed individual proposal inthe form of a conditional distribution over c j,k , q j ( c j,k | c − jk ) ∝ p ( Y k | c j,k , c − jk ) p ( c j,k | c j,k − , c − jk ) . Note that since we know all ofthe assignments, p ( Y k | c j,k , c − jk ) can be uniquely identiﬁed witheither a target or a clutter likelihood in Equation 9. Again, due tothe hard constraints in the prior, it is important to sample severalassignments block-wise. We do 100 iterations of burn-in, samplingtwo random target assignments from a joint version of q j ( c j,k | c − jk ) at each iteration, and pick the ﬁnal assignments as our sample. Thealgorithm is initialized with assignments from approximate rejectionsampling as described below.We also investigate a faster sampling scheme using the prior ofEquation 4, paired with approximate independent likelihoods. Thedata points are already independent given assignments, y m,k ⊥⊥ y n,k | c j,k = m . But given no assignment c j,k = (cid:15) , without the otherassignments we do not know if it implies that a measurement willbe rejected as clutter. Instead, we approximate it with an independentlikelihood p ( y m,k | c j,k = (cid:15) ) that should be the same for all datapoints, see Section III-D for a derivation of the approximation usedhere. Given the likelihood, a joint independent distribution over c k is given by ˜ q ( c k ) = (cid:81) j p ( y m,k | c j,k ) p ( c j,k | c j : k − , l j,k − ) . Tosample from the approximate posterior, we generate samples from ˜ q ( c k ) , which we reject if any two targets are assigned to the samemeasurement. In the experiments, we report results both from thisapproximate rejection sampling and from Gibbs sampling. C. Calculating the Weights

Since the likelihood p ( Y k | c k , Y k − ) is not actually a prob-ability mass function with respect to c k , i.e. it does not sum to one, the importance distribution is deﬁned by the product moduloa normalization constant Z ik , q ( c k ) = 1 Z ik L ( Y k | c k ) p ( c k | c i k − , Y k − ) , and Z ik is given by the sum of the product over all assignments, Z ik = (cid:88) c k L ( Y k | c k ) p ( c k | c i k − , Y k − ) . (10)Since Z ik varies between the particles, we need to update the weightsproportional to their values of Z ik , as it is not reﬂected in thesampling. Intuitively, this is similar to a particle ﬁlter that uses thelikelihood to update the weights of the particles. As we can notcompute the Z ik directly, it needs to be approximated. Componentsof the sum in Equation 10 typically take on their largest values ina few places where the likelihood is large. The idea is therefore toproduce estimates of Z ik by sampling from the proposal distribution q ( c k ) ∝ p ( c k | c i k − , Y k − ) L ( Y k | c k ) . In the following, we mainlyuse the fact that (cid:80) c k p ( c k | c i k − , Y k − ) = 1 : Z ik = Z ik (cid:80) c k p ( c k | c i k − , Y k − )= Z ik (cid:80) c k p ( c k | c i k − , Y k − ) L ( Y k | c k ) / L ( Y k | c k )= 1 (cid:80) c k q ( c k ) / L ( Y k | c k ) = 1 E q (1 / L ( Y k | c k )) . (11)If we look at the last line, we can then estimate the expectation E q (1 / L ( Y k | c k )) by sampling many c k ∼ q ( c k ) and compute / L ( Y k | c k ) for each sample. The inverse of the estimated expec-tation gives us our estimate of the sum Z k . Importantly, we alreadyperform Gibbs sampling from q ( c k ) to sample our particle proposals.By simply running a few more iterations of the same MCMC chain,we can use this procedure to estimate Z k . Being able to produce both of these properties at the same time caters to the efﬁciency andsimplicity of the algorithm.In addition to this scheme, we also evaluate a faster versionwhere we compute Z k using the approximated independent pos-terior ˜ q ( c k ) . Inserting the corresponding product into the sum ofEquation 10 results in a product over the individual sums, Z ik ≈ (cid:81) j (cid:80) c j,k q j ( c j,k ) = (cid:81) j (cid:80) c j,k p ( y m,k | c j,k ) p ( c j,k | c j : k − , l j,k − ) .

1) Rao-Blackwellized Multi-Target Tracking:

In summary, thisgives us the sequential importance sampling updates:1) Sample new locations and associations, either using Gibbssampling or approximate rejection sampling: c ik ∼ q ( c k | Y k , c i k − )

2) Update the position and feature estimates µ ij,k , Σ ij,k

3) Update the weights using Gibbs from previous section or withapproximate Z ik = (cid:81) j (cid:80) c j,k q j ( c j,k ) : w ik = w ik − Z ik , w ik = w ik (cid:80) w ik At each step of the ﬁltering, we can then construct a posterior overthe joint feature and spatial distributions of the objects. The featurepart can be marginalized out simply by removing the correspondingdimensions from the posterior: p ( ˆx j | Y k ) = (cid:88) i w i N ( ˆx j ; µ ij,k , Σ ij,k ) . (12)The resulting density is illustrated in the ﬁltering results, see forexample Figure 1 and Table III. D. Implementation

While each particle could potentially need its whole history ofassociations and measurements to deﬁne the probabilities, in realitywe can get away with much less. Each particle is parametrizedby a list P ik = { µ ij,k , Σ ij,k , l ij,k } j =1: N of sets for the targets j .In particular, the target locations l ij,k fully specify the probability p ( c k | c i k − , Y k − ) of the proposal distribution, as seen in Table I. µ ij,k and Σ ij,k parametrize the marginal likelihoods.For the version with approximate rejection sampling using ˜ q ( c k ) ,we need to estimate the independent likelihoods p ( y m,k | c j,k = (cid:15) ) conditioned on no observation of j . These roughly describe the datadistribution. One approach would be to try to directly approximate theunconditional likelihood of the data points, which is hard in itself.However, we have found empirically that another approach worksbetter in our case. We proceed using the intuition that these valuesshould remain fairly constant with respect to the current target j . If thecurrent data is assigned low values by the Kalman likelihoods, thosevalues should start approaching the data density. This allows samplingfor example of jumps. With this intuition in mind, we approximatethe data likelihood by the expectation of the data likelihood withrespect to the estimated continuous distribution of target j , ˆY ∼ N ( µ j,k , Σ j,k + R k ) . In the derivation, we use the fact that thereshould be negligible overlap between the different target distributionsof one particle and that (cid:80) c k : c j,k = m p ( c k ) ≈ N : E ˆY (cid:104) p ( ˆY ) (cid:105) = (cid:88) c k E ˆY (cid:104) p ( ˆY | c k ) (cid:105) p ( c k ) ≈ E ˆY (cid:104) p ( ˆY | c j,k = m ) (cid:105) × (cid:88) c k : c j,k = m p ( c k ) ≈ N E ˆY (cid:104) p ( ˆY | c j,k = m ) (cid:105) = (4 π ) − D N (cid:112) | Σ j,k + R k | While this is an approximation, we have seen empirically that itperforms better than trying to estimate the data likelihood. Again, wereason this is due to that the marginal p ( c j,k = (cid:15) | Y k ) is determined mainly by the ratio to the point likelihoods of the measurementassociations. IV. E XPERIMENTS (a) Experiment 1. (b) Experiment 2.

Fig. 7: The jumps of the objects in the two experiments. Arrowsindicate jumping objects while dots show the position of static objectswith only local movement. Jumps are annotated with time step andobject id. In the ﬁrst experiment there are 3 jumps between 2 rooms(i.e. locations). The second experiment contains 13 objects from 3rooms jumping a total of 14 times.We perform several experiments, both on real data collectedautonomously by a robot and on simulated data. In the following,we describe the exact details of the method used as well as theexperimental setups.

A. Detections and Features

For basic change detection, we rely on the

Statistical Inlier Estima-tion (SIE) method of Ekekrantz et al. [31]. The method compares twosubsequent local maps from the same location to ﬁnd the differencesand through them the moving objects. While the method can alsoextract parts that are highly dynamic (i.e. moving while the sweeps arecollected), we choose to ignore these parts, as they mostly consist ofmoving humans. SIE explicitly models the noise of the range sensorsas part of its optimization, enabling us to reliably extract smallerobjects close by but also larger objects that are further away. Theresulting detections are fed into the system of Section III-A to adddetections of the objects in the observations where they are static.In recent years, convolutional neural networks have come to playa vital role in computer vision. Today, they represent the mostsuccessful technique e.g. for classiﬁcation of images. For applications,a popular method for adopting a network to a particular domainis ﬁnetuning . In [32] and [33], the authors showed ﬁnetuning ofnetworks often lead to slightly better performance, but can be difﬁcultto get right. Further, in [33] as well as [34], features extracted from thetensor responses of one of the layers in a pre-trained network showedremarkable performance on a wide range of computer vision tasks.Importantly for our application, [33] showed these features couldachieve state-of-the art performance in both instance recognitionbenchmarks and retrieval scenarios. Whenever we can distinguish the

Fig. 8: The objects in experiment 2. Objects 0-12 from left to right: Trash can, bag, ﬁre extinguisher, bottle, chair, mug, trash can, bag,container, trash can, chair, mug, monitor. The tracker features are initialized from these images, which were also extracted using the methodin Section III-A. The colors along the bottom correspond to object colors in Figures 7a and 6a.objects by visual appearance we would like to take advantage ofthe discriminative power of these features. Therefore, we use CNNfeatures reduced to three dimensions using t-SNE [29], see Figure 6.In our experiments, we found that three dimensions provide a goodtrade-off between computational complexity of the ﬁlter, accurateestimation of covariances and discrimination. In addition, we haveseen that after t-SNE reduction, the variation within each class iswell approximated by a multivariate Gaussian. The two-dimensionalillustration in Figure 6 gives some indication of this.

B. Parameters

Parameter p jump p meas σ q σ r A k Particles R fk , S f Value 0.03 0.98 0.35 0.15 m

300 estimated

Most of the parameters have some meaning and can to an extent beselected by hand. In these experiments, we have p jump = 0 . , p meas =0 . , Q sk = σ q I , R sk = σ r I , with σ q = 0 . , σ r = 0 . . Note thatthe spatial dimensions are independent. We deliberately set p jump to aslightly lower value to avoid sampling too many jumps, which mightlead to particle depletion with many objects. For simplicity, the spatialprocess noise Q sk is constant, but could also increase as some functionof the time between observation k − and k . Q fk = since we do notexpect the descriptors to change over time. We ﬁt the measurementcovariance R fk and support S f from the data set referenced in Figure6. While the areas of the locations, A k , can be estimated from the3D map observations, the rooms in our data sets are all around m and we use this value in all experiments. C. Evaluation metrics

To compare the performance of different methods, we use theCLEAR Multiple Object Tracking metrics from [35]. The metricsare deﬁned using the association of targets to observations based onestimated positions. These associations can be used to compute themean errors, number of mismatching associations, number of falsepositives and the number of missed observations. The MOTA metric[35] combines several of these measures into a global score, showingthe performance of the trackers. We also compute the MOTP metric,which is simply the mean distance error of the associations [35].To ﬁnd the best association for the MOTA calculation, we use the

Hungarian algorithm to minimize the combined distance betweenﬁlter estimates and observations, similar to [35]. We only considerassociations closer than . m . Once we have computed the asso-ciations, we compute the false positives as the number of estimatesassociated with a measurement with no label. Missed observations aresimilarly labels with no estimate.

Mismatches are observations whichthe estimate assigns a label different from the annotated label. In [35],the authors count mismatches only in the observation sequence wherethe initial error is made; for example when two target paths cross.However, for our application it is important we maintain tracking of the correct targets over the whole sequence. We therefore count anyobservation with the wrong label estimate as a mismatch.To establish an independent baseline to compare against, we alsoused our detections and features in a system similar to GATMO[7]. Since all short-term dynamics and static structures have alreadybeen ﬁltered, we only need to keep track of the long-term dynamics.Therefore, the baseline tracker keeps track of two sets of objects, movable and absent . We use the information from Section III-A toknow if a detected object was propagated, i.e. if it has not moved.Given the previous sets of movable and absent objects, a movableobject is explained and kept in the movable set if it is static ascompared to the previous observation. If it is not, or the objectwas in the absent set, it can be matched to one of the unexplainedobservations. Then, a match is made if the visual feature distance issmaller than some threshold. We ﬁnd the threshold by running severalexperiments with different thresholds, and picking the one that givesthe best results on both annotated data sets. If it is matched in thisway, it is placed in the movable set, otherwise among the absentobjects. The estimated position of a given object is simply taken tobe that of its last matched observation.

D. Experimental setup

We perform several experiments using the STRANDS robot plat-form [36]. In each case, the inputs to the system are RGBD framesa robot collects autonomously while moving between a few differentlocations. Typically, after observing one location, the robot visits theother ones before returning to the same location again. Since wewant to validate the system in the presence of lots of movement, andespecially movement of the objects between different locations, wemanually move the some of the objects when the robot is away. Thisgives us ground truth knowledge of which speciﬁc objects moved.As the motion of single objects in the experiments was exaggeratedas compared to most natural environments, we argue that it providesa thorough validation of the robustness of the tracker. The collectedframes are fused into maps, mostly consisting of an entire room, andsegmented using the system in Section III-A. Features are extractedfrom the images associated with the point cloud segments. The objectposition observations are simply computed as the mean of the pointclouds. To initialize the tracking system, we mark the segmentscorresponding to the objects we wish to track in the ﬁrst observations.The ﬁrst experiment should verify the tracking ability underfrequent movement, but with dissimilar objects, to avoid ambiguity.There are a couple of instances of most types, but in particular the ﬁreextinguisher and monitor types only appear once, see Figure 8. In thesequence of 77 observations, the 13 objects jump a total of 14 times,see Figure 7a. While the ﬁrst experiment should validate the abilityto track in favorable circumstances, the second experiment tests theability to handle ambiguity and noise in the features. To that end, allof the objects are chairs, and most of them are visually similar. Theseven chairs in the sequence are all in either of two locations. In the55 observations of the locations, there are three jumps taking place,as well as considerable local movement of all chairs, see Figure 7b. The second experiment should be challenging for methods that donot explicitly handle ambiguous evidence, such as the baseline.To verify how much information either modality contributes tothe tracker, we benchmark the full system as well as the localand global tracking components individually. The local tracker issimply a variant of the full system with p jump = 0 , disallowing anyjumping within or between locations. A global tracker based purelyon features is obtained by replacing the spatial Kalman measurementlikelihood with a uniform distribution. The spatial position thus hasno contribution to the ﬁnal estimate. To attain position estimates, weinstead set the positions of each particle’s targets to those of their lastestimated measurement correspondences c j,k .Finally, we perform experiments in simulation to investigate howtracking performance is affected by varying the number of trackedobjects and the rate of jumps. Given a number of pre-deﬁned areas,jumps are sampled between the areas as well as Gaussian processnoise around the current positions. Additionally, with each observa-tion round we sample a number of noise detections not correspondingto any of the targets. Since we sample a full simulation sequencefor every value of the parameters, the sampling will introduce someadditional variability in the performance measures. To decrease theeffects of sampling, we sample three simulations per value, andevaluate the tracker ten times on every sampled simulation.V. R ESULTS

Since the ﬁltering method is sampling based and the results varyslightly, we do 50 experiments with each method and present bothqualitative and quantitative results from these runs. The experimentsare all with 300 particles, which proved to be sufﬁcient to trackthe posteriors (see Section V-A). Qualitative results are reported onthe version using approximate rejection sampling of the proposal,with comparisons to the results with other variations of our methodand to the baseline. We evaluate sampling the proposals using Gibbssampling by itself and then also with Gibbs weight calculation.

A. Experiment 1: Various Objects

100 200 300 400 500 600 700 800Number particles0.600.650.700.75 M O T A ( ± s t de v ) Fig. 9: The MOTA score as a function of the number of particles, withstatistics from 10 runs on experiment 2. After about 300, increasingthe number of particles has little effect.In Figure 9, we see that the MOTA score on this experimentincreases with the number of particles. After about 300 particles,the trend stabilizes at around . , demonstrating that 300 is enoughto approximate our deﬁned posterior accurately, and that other errorsare due to modeling imperfections or system noise. The posteriors offour different objects from experiment 1 are visualized in Table III.Whenever an object disappears from a location, it affects the posteriorof the corresponding object. The local uncertainty increases, at thesame time as new associations are sampled around other measurementlocations. In the example posteriors, we see that the posteriors are wellconcentrated when the objects are static. Features of the tracked objects in experiment 1 are shown inFigure 6a. The different object classes result in features that are wellseparated. Out of the ten experiments, 10 out of the 14 jumps aretracked correctly in the majority of cases. These include successfultracking of diverse objects such as mugs, bottles, canisters, trash cansand bags. In fact, the failures were all with larger objects such as thejumping chair and the monitor. In these cases, the features seem tohave dramatically changed before or during the jumps of the objects.In the case of the ofﬁce chair, a rotation led to the feature of anotherchair to become more similar, causing confusion in the inference.The other failure occurred at time steps 37 and 39 when the ﬁreextinguisher and a trash can respectively switched places, see TableIII and Figure 7a. This typically lead the tracker to estimate that theobjects stayed in the same places and adapt to the new appearance.Besides the ten successfully tracked jumps, the two static objects wereinferred correctly. System MOTP[35] Miss rate False pos. Mism. MOTA[35]Simpliﬁed 0.11 0.17 0.07 0.08 0.67Gibbs sampl. 0.12 0.17 0.06 0.07 0.70Gibbs weights 0.12 0.17 0.04 0.06 0.73Gibbs weights * ** *** TABLE II: Results from experiment 1. The main method is run withthe simpliﬁed independent proposal, Gibbs sampling of proposals andGibbs estimation of weights. The rates are averaged over 50 runs. * optimized for experiment ( . × R fk ) ** no spatial tracking *** p jump = 0 Quantitative results are presented in Table II. Importantly, weobserved that in the experiments where the results were qualitativelyworse, this correlated with a signiﬁcantly lower MOTA rate. Asexpected, with almost no overlap between the feature distributions,the baseline achieves good results. Compared to the baseline, theproposed tracker takes several time steps to converge on the newtarget location whenever a target jumps. This is due to the uncertaintyembedded in the tracker, and leads to a signiﬁcantly lower MOTAscore in this case. If we decrease the feature uncertainty to aboutone third of the estimated covariance R fk , as shown in Table II,the proposed tracker becomes more certain and collapses on newlocations more quickly, leading to a score almost in line with thebaseline. In general, the proposed tracker infers the same jumps asthe baseline, but estimates are lagging due to uncertainty.From the results, we see that the local tracker performed signiﬁ-cantly worse than the full system on this dataset. Due to the largenumber of jumps, the model is too limited in this scenario. The highmismatch rate of the local tracker is due to its inability to jump,instead settling for the neighboring observations which might bevisually dissimilar. On the other hand, we see that the feature-basedtracker was almost on par with the local model. However, the highestimated uncertainty of the features keeps the feature-based trackerfrom achieving a result in line with that of the baseline. The resulttells us approximately how much of the performance of the combinedmodel is purely due to the features. We can see that, with the featureuncertainty, the combined model performs signiﬁcantly better than theindividual models. This is mirrored in the qualitative results, wherethe full inference framework clearly outperformed both methods.In this experiment, each Gibbs sampling method contributes toa slight performance improvement over rejection sampling. Quali-tatively, the Gibbs strategies were able to more often correctly infer Initialization 1 2 3 4 O b j e c t - B o tt l e Jump O b j e c t - M ug Jump Jump O b j e c t - T r a s h c an Jump O b j e c t - C on t a i ne r Jump

TABLE III: Posteriors (red-blue gradients) of four different objects from experiment 1 at different time steps together with the image of theclosest measurement. If there is no image, no measurement was associated with the estimate at that time step. The trash can in column 3shows one of the failures. Since that and the ﬁre extinguisher swap places around step 2, the estimates are confused, causing a mismatch. Theother objects are correctly tracked, since the images of ”Intialization” and ”4” coincide. Note that all objects jump at least once. M O T A ( ± s t de v ) Fig. 10: The MOTA score as a function of the number of targets insimulation. Note that the score stays constant when adding targets. Theinitial variability is due to how the simulation is set up; if only onetarget is at a location and it jumps somewhere else, the old location isunlikely to be observed again, leading the tracker to believe that thetarget is still there. M O T A ( ± s t de v ) Fig. 11: MOTA score as a function of the rate of jump in simulation,with statistics from 30 runs. Note that the score is decreasing withthe jump rate. Since three simulations are sampled for each rate, someadditional variability can be seen.that the garbage can and ﬁre extinguisher switched places and, moregenerally, collapse on the correct locations faster. These results andthose from experiment 2 give us conﬁdence that while the MCMCschemes are slightly more accurate, the approximate version is areasonable approximation. The trade-off becomes apparent in therun-time of the method, which is roughly doubled when doing 100iterations of burn in for Gibbs sampling.

B. Experiment 2: Chairs

As can be seen in Figure 6b, the features do not discriminate wellbetween the chairs and the inference therefore has to rely more onthe measurement positions to estimate the target positions. Of thethree jumps, the ﬁlter tracks two of them correctly, while the lastone happens too late in the observation sequence for the estimate toconsolidate on the new position. In general, the ﬁlter requires two tothree observations before a jump is inferred. At the ﬁrst jump, thechair jumps close to another identical chair, leading the tracker toconfuse the two chairs in the majority of the ten runs. Such errorsare to be expected as there is no way to distinguish these visuallyidentical chairs. Chair number 4 (see Figure 7b) moves around withina roughly three meter diameter but the tracker accurately follows itin most of the runs. The rest of the objects all move around locallyand are tracked correctly in the majority of experiments.

System MOTP[35] Miss rate False pos. Mism. MOTA[35]Simpliﬁed 0.18 0.18 0.03 0.11 0.68Gibbs sampl. 0.18 0.15 0.02 0.12

Gibbs weights 0.18 0.19 0.04 0.09 0.68Features * ** TABLE IV: Results from experiment 2 with chairs. The rates are anaverage over 50 runs. * no spatial tracking ** p jump = 0 Quantitative results are presented in Table IV. Both Gibbs proposalsampling and with weight calculation have about the same MOTAscore as approximate rejection sampling, with pure Gibbs proposalsampling being slightly higher. Correspondingly, qualitative resultsfor all three methods are similar. The baseline method performssigniﬁcantly worse than these methods on this experiment. It sufferssince it only tracks static objects spatially, without a soft localmovement prior. Whenever there is any movement, it therefore relies on the uncertain features to estimate associations. Qualitativeresults reﬂected the low score, with the estimated positions frequentlyjumping between different positions.Since there are few jumps in this experiment, we see that the localversion of the tracker performs well also in comparison to the fullsystem. In particular, since the jumps do not occur until halfwaythrough the sequence, this version typically gives perfect results upuntil that point. However, the larger jumps cannot be tracked, and iterrs at this point. Interestingly, the small jump within one room at theend of the sequence is typically inferred correctly. The feature-basedtracker qualitatively performed badly for this experiment, which canalso be seen from the MOTA. Since different particle hypotheses weretypically associated with several of the chairs in the two rooms, theestimate ended up somewhere in the middle. This can be seen by thehigh miss rate, which is due to the position annotations being too faraway from any estimates. In conclusion, we see the combined trackerperformed the best in this scenario, with the local tracker providingmost information for the joint inference.

C. Experiment 3: Simulations

In the simulated environment, we generated results for variousnumber of targets and rate of jumps, with three simulations sampledfor every parameter value. Our hypothesis is that tracking perfor-mance should decrease with the rate of jump, rather than with thenumber of targets. This is due to how our proposal distribution isconstructed; when one target is well explained by a measurement,most of the particles are likely to sample that association. Hence,particle depletion happens more rapidly when there are severaluncertain associations, as happens when targets jump. If we look atthe graph in Figure 11, we can indeed see that the MOTA developsnegatively with an increasing jump rate. Correspondingly, Figure 10shows the performance stays more or less constant when increasingthe number of targets. From these graphs, we can deduce that thecomplexity of the proposed algorithm scales with the jump rate ratherthan with the number of targets. This is important in real worldenvironments, where the number of objects might be very large, butjumps happen relatively seldom.

D. Data from robot deployment

In addition to the experiments, we have also evaluated the methodon data gathered by a mobile robot in a real world deployment.The robot patrolled an ofﬁce environment during a period of twomonths, gathering observations of a few places with one or two days’interval. In our analysis, we have focused on three places with highly

TABLE V: Statistics from robot deployment, with 9 objects in 34observations. Values are comparable to other experiments.movable objects: one reception area and two kitchens. In these scenes,we marked three objects in each location, most of them movingaround within the area. Since no motion was observed in betweenthe locations, the experiment should mostly be seen as a way to testthe stability of the methods in a real world scenario. In Table V, wesee that the numbers reﬂect those in other experiments. Indeed, themethod manages to track the objects through most of the sequences,conﬁrming the validity of the approach as applied to natural data.VI. D

ISCUSSION

We see that the joint tracker is able to bring in information fromboth the visual and spatial modalities to produce a reliable estimateof the object positions. In the case where we have different objectclasses, the CNN features proved to be highly discriminative, allowingus to track the jumps even of comparatively small cups. In fact,the small objects seem to consistently provide good features, evenin the presence of motion blur. This, together with the preliminarydeployment results in Section V-D, gives us conﬁdence that theproposed system will be able to work in a real world scenario, wherechange detection is likely to detect the movement of a multitudeof small objects. Further, the method works well even when weincrease the number of objects to track since the accuracy of thealgorithm depends more on the number of jumps. To maintainaccurate posteriors in the event of many jumping objects, one couldemploy an adaptive particle set scheme such as [37].The baseline method is based on the previous GATMO [7], asapplicable within our system. Since this method does not reasonprobabilistically, it suffers when there is noise and ambiguity in themeasurements, as can be seen in the second experiment. Since there isoverlap between the distributions of the different features, objects withoverlap are sometimes confused. In a real-world scenario, this kindof situation is to be expected, as there are often many similar-lookingobjects. Moreover, we expect more noise in real-world scenarios,when there is more lighting variation, and the objects may be observedfrom more angles. We conclude that it is important to incorporateuncertainty as part of the tracking, especially in environments thatcontain multiple visually similar objects.While not discussed in the results, the tracking is enabled partlyby the performance of the change detection framework discussed inSection III-A. Even when the objects did not move, and were detectedthrough the propagation approach, all objects were detected in all theobservations. This is remarkable, since in total, we have 522 annotatedobjects in the datasets. It shows that this simple method is sufﬁcient, atleast given precise registration. However, the segmentation sometimeseither over- or undersegments the objects. Particularly for the chairs,oversegmentation is an issue that should be dealt with in a moreprincipled fashion, potentially by training a supervised segmentationmethod on similar data. In most cases, the tracker was nonethelessable to handle these deﬁciencies.In Figure 12, we see the effect of varying the p jump and p meas parameters. In the experiments, we have chosen to use the samevalues of the parameters for consistency. However, from the ﬁgure,we observe that the method performs better on experiment 1 when thevalue is higher than the one used. The difference in the optimal valueof p jump between the experiments is due to that the actual rate of jumps p meas p j u m p Experiment 1 p meas p j u m p Experiment 2 M O T A Fig. 12: Effect of the interaction of p jump and p meas on the MOTAscore in the two experiments. In experiment 2, the method performswell roughly within the quadrant p jump ≤ . , p meas ≥ . . Theoptimal value of p jump seems to be higher in the ﬁrst experiment, ator above . .is higher in experiment 1, with about . jumps per observationversus . in experiment 2. The relation in the jump frequencies tothe optimal p jump values leads us to believe that it might be beneﬁcialto estimate the number of jumps using the algorithm and adjust the p jump parameter iteratively in an EM fashion. Similarly, we saw inexperiment 1 that estimating the feature covariance R f for the dataat hand can signiﬁcantly improve results.VII. C ONCLUSION & F

UTURE W ORK

We presented a system which can track multiple similar objectseven in the presence of noise and when only a subset of the objectscan be observed at any given time. There are three major reasonsthat together enable the system to track objects of sizes that rangefrom mugs to chairs accurately. First, the probabilistic method of[31] together with our temporal segmentation allows us to identifyeven small changes in the 3D maps consistently. Secondly, recentCNN architectures such as the one used [28] allow us to representthe discovered objects using discriminative features. For the smallerobjects, it is key to use the image data as the depth data is oftentoo noisy to produce reliable features. Third, our Rao-Blackwellizedobject tracker accurately models the dynamics of the scenario andallows us to maintain realistic posteriors of object positions. Evenwhen the objects are visually similar, the tracker can rely on thespatial measurements for reasoning.As mentioned, one pitfall of the current system is that it relieson a closed world assumption. Without this, tracking the jumps ofobjects would likely be intractable. An open question is how to resolvethis conﬂict. We would like to investigate if we could model howprobable different object types are to stay within the conﬁnes ofthe environment. For example, a chair is very likely to stay whilea cardboard box will likely be thrown away. It would therefore beappealing to learn these probabilities and track jumps only when theobject is likely to stay in the environment. For temporary objects, thesystem should instead track the births and deaths, see e.g. [18][17].With the ability to integrate new objects as they are observed comesalso the possibility of applying the movement models to improve afull-blown SLAM system as suggested in [38].More generally, we would like to learn more properties of theobjects given the visual features. In addition to estimating the jumpprobability of the object type, there should also be speciﬁc movementmodel variances for different targets. In this case, a monitor serves asa good example of something that stays in the exact same place for along time, while a chair will always be expected to move around a bit.The chair should thus have a higher uncertainty Q sk associated withits movement model than the monitor. In turn, one speciﬁc kitchenchair might be more probable to move than an ofﬁce chair. VIII. A

CKNOWLEDGEMENTS

The authors would like to thank Erik Ward for many rewardingconversations on the problems of multi-target tracking. The work pre-sented in this paper has been funded by the European Union SeventhFramework Programme (FP7/2007-2013) under grant agreement No600623 (“STRANDS”). R

EFERENCES [1] J. Koch, J. Wettach, E. Bloch, and K. Berns, “Indoor localisation ofhumans, objects, and mobile robots with rﬁd infrastructure,” in

HybridIntelligent Systems, 2007. HIS 2007. 7th International Conference on ,pp. 271–276, IEEE, 2007.[2] C.-C. Wang, C. Thorpe, S. Thrun, M. Hebert, and H. Durrant-Whyte,“Simultaneous localization, mapping and moving object tracking,”

TheInternational Journal of Robotics Research , vol. 26, no. 9, pp. 889–916,2007.[3] R. Biswas, B. Limketkai, S. Sanner, and S. Thrun, “Towards objectmapping in non-stationary environments with mobile robots,” in

Intel-ligent Robots and Systems, 2002. IEEE/RSJ International Conferenceon , vol. 1, pp. 1014–1019, IEEE, 2002.[4] N. E. Du Toit and J. W. Burdick, “Robot motion planning in dynamic,uncertain environments,”

IEEE Transactions on Robotics , vol. 28, no. 1,pp. 101–115, 2012.[5] L. Montesano, J. Minguez, and L. Montano, “Modeling the static andthe dynamic parts of the environment to improve sensor-based navi-gation,” in

Robotics and Automation, 2005. ICRA 2005. Proceedingsof the 2005 IEEE International Conference on , pp. 4556–4562, IEEE,2005.[6] K. Shubina and J. K. Tsotsos, “Visual search for an object in a3d environment using a mobile robot,”

Computer Vision and ImageUnderstanding , vol. 114, no. 5, pp. 535–547, 2010.[7] G. Gallagher, S. S. Srinivasa, J. A. Bagnell, and D. Ferguson, “Gatmo:A generalized approach to tracking movable objects,” in

Roboticsand Automation, 2009. ICRA’09. IEEE International Conference on ,pp. 2043–2048, IEEE, 2009.[8] E. Herbst, X. Ren, and D. Fox, “RGBD object discovery via multi-sceneanalysis,” in

Intelligent Robots and Systems (IROS), 2011 IEEE/RSJInternational Conference on , pp. 4850–4856, IEEE, 2011.[9] D. F. Wolf and G. S. Sukhatme, “Mobile robot simultaneous localizationand mapping in dynamic environments,”

Autonomous Robots , vol. 19,no. 1, pp. 53–65, 2005.[10] P. Biber, T. Duckett, et al. , “Dynamic maps for long-term operation ofmobile service robots,” in

Robotics: science and systems , pp. 17–24,2005.[11] C.-C. Wang and C. Thorpe, “Simultaneous localization and mappingwith detection and tracking of moving objects,” in

Robotics and Au-tomation, 2002. Proceedings. ICRA’02. IEEE International Conferenceon , vol. 3, pp. 2918–2924, IEEE, 2002.[12] M. Montemerlo, S. Thrun, and W. Whittaker, “Conditional particleﬁlters for simultaneous mobile robot localization and people-tracking,”in

Robotics and Automation, 2002. Proceedings. ICRA’02. IEEE Inter-national Conference on , vol. 1, pp. 695–701, IEEE, 2002.[13] D. Schulz and W. Burgard, “Probabilistic state estimation of dynamicobjects with a moving mobile robot,”

Robotics and Autonomous Sys-tems , vol. 34, no. 2, pp. 107–115, 2001.[14] D. Anguelov, R. Biswas, D. Koller, B. Limketkai, and S. Thrun,“Learning hierarchical object maps of non-stationary environmentswith mobile robots,” in

Proceedings of the Eighteenth conference onUncertainty in artiﬁcial intelligence , pp. 10–17, Morgan KaufmannPublishers Inc., 2002.[15] R. Toris and S. Chernova, “Temporal persistence modeling for objectsearch,” in

Robotics and Automation (ICRA), 2017 IEEE InternationalConference on , pp. 3215–3222, IEEE, 2017.[16] D. F. Wolf and G. S. Sukhatme, “Towards mapping dynamic environ-ments,” in

In Proceedings of the International Conference on AdvancedRobotics (ICAR) , pp. 594–600, 2003.[17] S. S¨arkk¨a, A. Vehtari, and J. Lampinen, “Rao-blackwellized particleﬁlter for multiple target tracking,”

Information Fusion , vol. 8, no. 1,pp. 2–15, 2007. [18] S. Oh, S. Russell, and S. Sastry, “Markov chain monte carlo dataassociation for multi-target tracking,”

IEEE Transactions on AutomaticControl , vol. 54, no. 3, pp. 481–497, 2009.[19] I. Miller and M. Campbell, “Rao-blackwellized particle ﬁltering formapping dynamic environments,” in

Robotics and Automation, 2007IEEE International Conference on , pp. 3862–3869, IEEE, 2007.[20] T. Vu, B.-N. Vo, and R. Evans, “A particle marginal metropolis-hastingsmulti-target tracker,”

IEEE Transactions on Signal Processing , vol. 62,no. 15, pp. 3953–3964, 2014.[21] R. Mahler and T. Zajic, “Multitarget ﬁltering using a multitarget ﬁrst-order moment statistic,” in

Proc. SPIE , vol. 4380, pp. 184–195, 2001.[22] B.-N. Vo and W.-K. Ma, “The gaussian mixture probability hypothesisdensity ﬁlter,”

IEEE Transactions on signal processing , vol. 54, no. 11,pp. 4091–4104, 2006.[23] R. Finman, T. Whelan, M. Kaess, and J. J. Leonard, “Toward lifelongobject segmentation from change detection in dense RGBD maps,” in

Mobile Robots (ECMR), 2013 European Conference on , pp. 178–185,IEEE, 2013.[24] R. Ambrus, N. Bore, J. Folkesson, and P. Jensfelt, “Meta-rooms:Building and maintaining long term spatial models in a dynamic world,”in

Intelligent Robots and Systems (IROS), 2014 IEEE/RSJ InternationalConference on , pp. 1854–1861, IEEE, 2014.[25] R. Ambrus, J. Ekekrantz, J. Folkesson, and P. Jensfelt, “Unsupervisedlearning of spatial-temporal models of objects in a long-term autonomyscenario,” in

Intelligent Robots and Systems (IROS), 2015 IEEE/RSJInternational Conference on , pp. 5678–5685, IEEE, 2015.[26] E. Herbst, P. Henry, and D. Fox, “Toward online 3-d object segmen-tation and mapping,” in

Robotics and Automation (ICRA), 2014 IEEEInternational Conference on , pp. 3193–3200, IEEE, 2014.[27] S. Geman and D. Geman, “Stochastic relaxation, gibbs distributions,and the bayesian restoration of images,”

IEEE Transactions on patternanalysis and machine intelligence , no. 6, pp. 721–741, 1984.[28] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re-thinking the inception architecture for computer vision,” in

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,pp. 2818–2826, 2016.[29] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”

Journalof Machine Learning Research , vol. 9, no. Nov, pp. 2579–2605, 2008.[30] A. Doucet, S. Godsill, and C. Andrieu, “On sequential monte carlosampling methods for bayesian ﬁltering,”

Statistics and computing ,vol. 10, no. 3, pp. 197–208, 2000.[31] J. Ekekrantz, N. Bore, R. Ambrus, J. Folkesson, and P. Jensfelt,“Unsupervised object discovery and segmentation of RGBD images,” arXiv preprint arXiv:1710.06929 , 2017.[32] K. Chatﬁeld, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return ofthe devil in the details: Delving deep into convolutional nets,” arXivpreprint arXiv:1405.3531 , 2014.[33] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnnfeatures off-the-shelf: an astounding baseline for recognition,” in

Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition Workshops , pp. 806–813, 2014.[34] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, andT. Darrell, “Decaf: A deep convolutional activation feature for genericvisual recognition.,” in

Icml , vol. 32, pp. 647–655, 2014.[35] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object trackingperformance: the clear mot metrics,”

EURASIP Journal on Image andVideo Processing , vol. 2008, no. 1, pp. 1–10, 2008.[36] N. Hawes, C. Burbridge, F. Jovan, L. Kunze, B. Lacerda, L. Mu-drov´a, J. Young, J. Wyatt, D. Hebesberger, T. Kortner, et al. , “Thestrands project: Long-term autonomy in everyday environments,”

IEEERobotics & Automation Magazine , vol. 24, no. 3, pp. 146–156, 2017.[37] D. Fox, “Adapting the sample size in particle ﬁlters through kld-sampling,”

The international Journal of robotics research , vol. 22,no. 12, pp. 985–1003, 2003.[38] S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas, “Proba-bilistic data association for semantic slam,” in

Robotics and Automation(ICRA), 2017 IEEE International Conference on , IEEE, 2017.Submitted to IEEE Transactions on Robotics, October 2017. c (cid:13)(cid:13)