[PDF] Optimizing Consensus-based Multi-target Tracking with Multiagent Rollout Control Policies

Abstract

This paper considers a multiagent, connected, robotic fleet where the primary functionality of the agents is sensing. A distributed multi-sensor control strategy maximizes the value of the collective sensing capability of the fleet, using an information-driven approach. Each agent individually performs sensor processing (Kalman Filtering and Joint Probabilistic Data Association) to identify trajectories (and associated distributions). Using communications with its neighbors the agents enhance the prediction of the trajectories using a Consensus of Information approach that iteratively calculates the Kullback-Leibler average of trajectory distributions, enabling the calculation of the value of the collective information for the fleet. The dynamics of the agents, the evolution of the identified trajectories for each agent, and the dynamics of individual observed objects are captured as a Partially Observable Markov Decision Process (POMDP). Using this POMDP and applying rollout with receding horizon control, an optimized non-myopic control policy that maximizes the collective fleet information value is synthesized. Simulations are performed for a scenario with three heterogeneous UAVs performing coordinated target tracking that illustrate the proposed methodology and compare the centralized approach with a contemporary sequential multiagent distributed decision technique.

Full PDF

OOptimizing Consensus-based Multi-target Trackingwith Multiagent Rollout Control Policies (Invited Paper)

Tianqi Li , Lucas W. Krakow and Swaminathan Gopalswamy Abstract — This paper considers a multiagent, connected,robotic ﬂeet where the primary functionality of the agents issensing. A distributed multi-sensor control strategy maximizesthe value of the collective sensing capability of the ﬂeet, using aninformation-driven approach. Each agent individually performssensor processing (Kalman Filtering and Joint ProbabilisticData Association) to identify trajectories (and associated distri-butions). Using communications with its neighbors the agentsenhance the prediction of the trajectories using a

Consensus ofInformation approach that iteratively calculates the Kullback-Leibler average of trajectory distributions, enabling the calcu-lation of the value of the collective information for the ﬂeet.The dynamics of the agents, the evolution of the identiﬁedtrajectories for each agent, and the dynamics of individualobserved objects are captured as a Partially Observable MarkovDecision Process (POMDP). Using this POMDP and applyingrollout with receding horizon control, an optimized non-myopiccontrol policy that maximizes the collective ﬂeet informationvalue is synthesized. Simulations are performed for a scenariowith three heterogeneous UAVs performing coordinated targettracking that illustrate the proposed methodology and com-pare the centralized approach with a contemporary sequentialmultiagent distributed decision technique.

Index Terms — multiagent rollout, target tracking, multi-sensor system, information-driven, POMDP

I. I

NTRODUCTION

In the paradigm of autonomous systems, sensing is re-garded as the primitive which transforms information ofthe physical world into the system as input for exampleaction selection or planning. Interactions between sensingand other primitives support the tasks which employingsensor information in world modeling, stimulus of controllerand knowledge representation. Research on control strategieswith considerations centered around information and sensors,including but not limited to target tracking, searching, worldmapping, is a new and fast-growing focus. Information-driven [1], [2], or sensor-driven control [3], studies theproblem of optimizing a system’s conﬁguration and resourceallocations to optimize or improve sensing performance.There are two main schemes in data processing of multi-agent systems, centralized and distributed . In a centralizedarchitecture, every agent must send its information (obser-vation, processed message, control information etc.) to a Tianqi Li and Swaminathan Gopalswamy are with the Departmentof Mechanical Engineering, Texas A&M University, College Station, TX77840, USA { xmcx731, sgopalswamy } @tamu.edu Lucas W. Krakow is with the Bush Combat Development Complex(Texas A&M) Bryan, TX 77807, USA [email protected]

This work has been submitted to the IEEE for possible publication.Copyright may be transferred without notice, after which this version mayno longer be accessible. central control node either directly or by some multi-hoprelay. This conﬁguration can instigate challenges with respectto the growth of the system’s number of agents and this iscompounded when the agents are mobile: the central nodehas bandwidth linear to the number of agents; the topologicalchange in the communication graph caused by moving agentsmakes it hard for the message be passed by all nodes in thenetwork [4]. The distributed approach means there is no ﬁxedinformation aggregation component central to the operationof the system, and the communication is peer to peer [5]. Inapplications like sensor fusion, distributed processing maynegatively impact the system’s overall performance whencompared to a centralized method, its advantages are realizedin sensor resource management and the robustness of thesystem, e.g., communications. Decentralized sensor fusioncan be more adaptable in applications like sensor deploymentfor maximal coverage [6], where sensors that require localdecisions already possess the needed local information suchas target tracking estimates. For the aspect of communica-tion, constraint in bandwidth makes the centralized sensorfusion infeasible to be for large system implementation [7].The trend of applying decentralized algorithms is arising inthe information-driven tasks, leveraging the aforementionedadvantages. [8] establishes the link of distributed targettracking and information-driven mobility, as a byproduct thebehavior of ﬂocking is observed in single target tracking bya mobile multisensor network. [9] presents the coordinatedintermittent searching of robot team in a restricted areawith max-consensus algorithm on the information to picturethe occupancy map, in which mobile sensor formations arepropelled by average consensus on decisions.However, an optimal solution for such coordinated het-erogeneous multiagent planning poses a computational chal-lenge. [10] introduces the sensor planning based on theestimation of a single target with the range-bearing sensormodel and motion constraints, and studies such a strategyin single and (heterogeneous) multisensor scenarios. Withmotion constraints on the robots (maximum speed and min-imum distance to the target), such non-convex constrainedoptimization problems are NP-hard in general. The workin [11] strives to maximize the FoV of the sensing teamby topological reconﬁguration of the robot network undersensor deterioration. The solution to the proposed reconﬁg-uration problem equates to a circle packing problem in 2-dimension space, which is shown to be NP-complete [12]. Itis clear that optimality is difﬁcult to achieve, thus, efﬁcientsuboptimal solutions are of great value for implementations. a r X i v : . [ c s . M A ] F e b ecently a multiagent version of rollout policy was published[13], which takes augments the traditional rollout approachthrough sequential optimization. In a multiagent system con-taining n agents with R s as the action domain of each agent,the joint optimization requires the computation of order O ( s n ) . However, when performing sequential optimizationwith respect to the agents, rollout reduces the computationload to the order of O ( sn ) .This paper focuses the coordination of mobile multi-sensorsystem in the task of target tracking. Different from workin [10], scenario of tracking multiple targets is studied fora more generic version of distributed target tracking. Forthe simplicity of the problem, agents are considered asUAVs for freedom of movement and alleviate sensor-to-target and sensor-to-sensor distance constraints. Since futuretargets positions are stochastic, a non-myopic behavior is ofgreat interest in the planning. This is achieved through theapplication of a receding horizon when computing rolloutpolicies. We formulate a POMDP framework and apply asequential rollout method in a consensus-based distributedtarget tracking task to solve an information-driven coordi-nated planning with a heterogeneous multiagent system. Withsimulation designed in target tracking, the performance oftracking in rollout algorithm shows1) The non-myopic behavior extracted from the multia-gent rollout beneﬁts the tracking performance for theﬂeet of mobile sensors2) The computation advantages of the sequential opti-mization in multiagent rollout policy can be realizedwith minimal target tracking degradation.This paper is organized in the following manner. Section IIdescribes fundamental setup of distributed target tracking andthe objective of the proposed problem. Section III details theproposed non-myopic multisensor planning in a frameworkof POMDP. Section IV presents the simulation results whichdemonstrates the coordination of sensors in target trackingtask. Insights of this study are recapped and future work isproposed in Section V.II. P ROBLEM S TATEMENT

We consider a heterogeneous mobile multi-sensor systemin a task of target tracking for our speciﬁc control prob-lem. Each agent is equipped with sensors, actuators and acommunication device for the three main functions: sensing,mobility and communicating with peer agents. This multi-sensor system has total n agents and each agent has a uniquelabel i ∈ [ n ] . Deﬁne the state of the single agent i as s i ∈ S i , where s i contains the pose and velocity of theagent in horizontal plane s i ∈ S i , s i = ( x, y, θ, v x , v y ) . Weassume a holonomic model for the dynamics of each agentwhere the control inputs directly alter agent’s velocity; thecontrol domain U i is the velocity of the agent, such that u i ∈ U i , u i = ( v x , v y ) and the azimuth in pose is consistentwith velocity, i.e. θ = arctan ( v y /v x ) . An agent i has thefollowing attributes: • Mobility: For agent i , in discrete time space, we have state s ik ∈ S i and control u ik ∈ U i , the sensor statetransition function is s ik +1 = f i ( s ik , u ik ) . • Sensing: Each sensor has a limited ﬁeld of view (FoV)and an associated sensing quality . For a quad-copterUAV agent, we deﬁne the FoV of agent i by variable φ ik ( s ik ) parameterized by agent state. For sensingquality, a scalar α i is deﬁned in the measurementuncertainty matrix R , detailed in (3). • Communication: An agent communicates to otherwithin its neighborhood deﬁned by the distance of d c .For simplicity, we assume this threshold is universalacross agents.The multisensor network is comprised of n agents de-scribed above. The network can then be represented as anundirected proximity graph G = ( V ( G ) , E ( G )) , let vertex v ∈ V ( G ) represents an agent in graph and every edge e ∈ E ( G ) represents communication link between twoagents, i.e. ∀{ i, j } ∈ V ( G ) , || s i , s j || ≤ d c . Deﬁne the neighbor of agent i ∈ V ( G ) as the set of graph vertices N G ( i ) ∈ V ( G ) , s.t. ∀ j ∈ N G ( i ) , { i, j } ∈ E ( G ) , and wedenote the degree of agent d ( i ) = | N G ( i ) | . A. Distributed Target Tracking

In the multi-target tracking (MTT) problem, at each timestep k , an agent i generates a set of measurements deemed asan observation, Z ik = { z ik, , z ik, ..., z ik,m ik } with cardinality | Z ik | = m ik . Given a target state x ∈ χ which containsposition and velocity variables in a 2-dimensional Cartesiancoordinate system, and a measurement, z ∈ O , generatedfrom a target is taken to be the target’s position.A linear assumption of the dynamic of target is x k +1 = F k x k + w k , w k ∼ N (0 , Q k ) (1)with w k represents the motion disturbance in a normaldistribution, F k is the nearly constant velocity motion model,and observation law is z k = H k x k + v k , v k ∼ N (0 , R k ) (2)with v k represents the measurement noise. The common range-bearing sensor model is considered in our paper,which is widely studied and applied [14]. Speciﬁcally, foragent i with state s ik , let r k and ρ k denote the estimatedrange and bearing of a mobile target, r k = max ( || s i,posk − x t,posk || , r ) , ρ k is the angular measurement between the sen-sor and target, the observation covariance matrix R ik ( s ik , x tk ) for target t is R ik ( s ik , x tk ) = α i G ( ρ k ) (cid:20) . × r k

00 0 . π × r k (cid:21) G ( ρ k ) T (3)where G ( ρ ) is the rotation matrix of angle ρ , and sensingquality factor α i is a scalar in this uncertainty matrix. Thisimposes the spatially varying measurement error [15]. Toavert computational issues inside the Kalman ﬁlter updates,we apply r as the minimal effective range threshold.Based on the principle of distributed target tracking in[16], the systematic estimation is twofold: i) single agent2ocal estimation and prediction; ii) track fusion over agents.For a single agent, the estimation includes the measurement-to-track assignment and estimation update given measure-ment. One method of this measurement-to-track assignmentis the joint probabilistic data association ﬁlter (JPDAF)[17]. JPDA algorithm handles the assignments based onconditional probabilities calculated with respect to predictedtarget locations and the received measurements. Since thetarget may be outside of sensor FoV or simply undetected,we also employ M-out-of-N track maintenance logic for bothtrack conﬁrmation and deletion. A track is conﬁrmed/deletedif there are more than M detections (or missed detections,respectively) in the latest N consecutive observations.Once the local tracking step (JPDAF) is complete, mes-sages containing tracking information from an agent’s neigh-bors are received and processed to update local track inthe consensus period. As is common in distributed sensorfusion, we follow the average consensus scheme for this step.However, instead of sending the measurements, we perform consensus of information (CI) which directly processes thetracking information. For example, [18] utilizes the informa-tion ﬁlter to make average consensus of track information,and the result is proven to be the Kullback–Leibler average(KLA) of the initial local distributions [19]. Denote Θ ik asthe set of local tracks of agent i . In a Kalman ﬁlter, apair consisting of the mean vector and covariance matrix( x ti,k | k , P ti,k | k ) is the standard representation of a track t ∈ Θ ik . The same track can be represented in information ﬁlterby the pair comprised of an information matrix Ω ti,k | k = P ti,k | k − and an information vector q ti,k | k = Ω ti,k | k x ti,k | k .As with any consensus algorithm, multiple updates mustbe done for the information to converge across the network.We denote the total number of consensus update steps as L ,which take place in order to accurately proliferate the JPDAtracking information to all agents of the network. These L updates are taken over each discrete epoch, e.g., from k to k + 1 . At each consensus update l = 1 , , . . . , L the averageconsensus step for agent i is Ω ti,k | k ( l + 1) = 11 + d ( i ) (cid:88) j ∈ N G ( i ) ∪{ i } Ω tj,k | k ( l ) (4a) q ti,k | k ( l + 1) = 11 + d ( i ) (cid:88) j ∈ N G ( i ) ∪{ i } q tj,k | k ( l ) (4b)Given the ﬁxed number of consensus steps, the output of CIis ( Ω ti,k | k ( L ) , x ti,k | k ( L )) . B. Information-driven multiagent Coordination Control

The multisensor system described above is information-driven, i.e. the control of mobile sensors is to obtain over allmore information of targets by observation and distributedtarget tracking. Firstly a good choice of information utilityis essential to picture the objective of the system in a math-ematical formation. The existing choices includes entropy[1] , Fisher Information matrix [8], the covariance matrixin Gaussian [10], [11], [15] and other sensor model-basedheuristics [20]. In our problem, the trace of covariance matrix of targets is chosen for its perceptual intuition andease of computation in the Gaussian based data association.To further simplify the computation, since the informationmatrix Ω ti,k | k is the inverse of covariance matrix P ti,k | k ,the trace of information matrix is applied as the utility ofinformation. In our distributed multisensor system, the utilityof information of agent i in time k is deﬁned as Φ i ( k ) = (cid:88) t ∈ Θ ik tr ( Ω ti,k | k ( L )) (5)Notice that this utility contains the process of distributedMTT in Section II-A, term Ω ti,k | k ( L ) is deﬁned as the outputof consensus step of certain track t by (4a). C. Semantic Map

The study of sensing is dependent on the environment, andin our mobile multisensor system the physical environmenthas impact on sensibility of agents. The semantic meaning ofareas and objects helps in understanding the environment andprovides guidance on robot decision making. Following [21],deﬁne the world W = ( T, X, Y ) with a topological graph T to represent components of physical object set X andsemantic element set Y . In our problem, geological factoraffects the quality of detection like tree shadows and groundreﬂection, by utilizing the semantic map, the movement oftargets into occlusion area will be predicted based on Kalmanﬁlter. Such semantic information helps picturing the futureobservations based on current target belief. Given a set ofobservation Z , semantic map functions as W ( Z ) whichreturns the observation outside the occlusion area.III. M ETHOD

A. A POMDP Model

A Partial Observable Markov Decision Process (POMDP)model is formulated in [15], which describes a non-myopicsensory centralized multi agent motion planning with the taskof target tracking. This POMDP model is extended to themulti agent scenario with distributed planning in this paper.Deﬁne a POMDP model as a tuple of P = ( X , U , T , R , O ) . State X : The state of POMDP contains the state of agents S , state of targets χ and state of ﬁlter F . The state of all n agents is deﬁned as S = S × S ... × S n . Together withagents states s k ∈ S , the overall FoV of these agents is φ k ( s k ) = φ k ( s k ) × φ k ( s k ) ... × φ nk ( s nk ) , which is regardedas a variable parameterized by agents state. Targets state χ contains all position and velocities of current targets. Theﬁlter state F k = { ( x ti,k | k , P ti,k | k ) |∀ i ∈ [ n ] , ∀ t ∈ Θ ik } , whichare all agents maintained tracks represented by Gaussiandistributions with posterior means x ti,k | k and posterior co-variance matrices P ti,k | k . The POMDP state is summarizedas x k = ( s k , χ k , F k ) . Action U : The action domain is the overall action to themultisensor system, with the control domain of each sensordeﬁned by the 2-d velocity in Section II. Such control domainhas constraint from both the physical world and the agents’kinematic constraints. Assume the maximum velocity is setas v m ax for all agents i ∈ [ n ] . Thus at time k , the control3omain U ik of a single agent i is implicit with its state s ik . Thejoint action domain is then deﬁned as U k = U k × ... × U nk . Observation and Observation Law O : Observation isgenerated by sensors equipped on agents through perceptionalgorithm, here we assume that an ideal perceptual com-ponent outputs observations with no false alarms in areaswithout occlusion in W . Under Gaussian noise assumptions,the observation law of a target is presented by (2) withthe dynamic sensor model deﬁned as (3). From a systemperspective, the joint observation domain of all sensors isdeﬁned as O , Z k ∈ O , Z k = { Z ik | i ∈ [ n ] } . For agent i , itsobservation Z ik is constrained by the semantic map W andits FoV φ ik . State Transition T : State transition law is deﬁned as themapping T : X × U −→ X . In the state x k ∈ X k , x k =( s k , χ k , F k ) , u k ∈ U k , the state transition is decomposedinto the following: i) the agents’ transitions are completelydetermined by s ik +1 = f i ( s ik , u ik ) ; ii) target states x tk ∈ χ k are a stochastic process (1) with Gaussian additive noise,independent of sensor state s k and control variable u k . iii)the ﬁlter state F k is dependent on both agent state s k , FoVof sensors φ k and target state χ k and evolves accordingto the chosen tracking ﬁlter equations, JPDA with averageconsensus. Reward R : The reward function for single agent is amapping X × U −→ R , with the objective of maximizingthe tracking information.For agent i , its tracking set Θ ik can be regarded as the subset of all targets in the system Θ ik ⊆ χ k . The standard objective in MTT is to minimize thetracking error over the system. By design, the informationutility function deﬁned in (5) directly reﬂects our purposeby maximizing the trace of the inverse of uncertainty matrixover all tracks. In [19] it is shown that under the assumptionof strongly connectivity in the sensor network, the estimationerror of CI is asymptotically bounded in mean square error.Let ˜Φ( · ) denote the converged estimate . We assume given L > L , ∀ i ∈ [ n ] , | Φ i ( k ) − ˜Φ( k ) | < (cid:15) , with (cid:15) sufﬁcientlysmall this error can be ignored. This implies that the trackinformation of each agent is close enough after sufﬁcientsteps of consensus that each agent may take its local estimateto be the converged estimate of the system, i.e., Φ i ( k ) ≈ ˜Φ( k ) . Then the one-step reward for the system is R ( x k , u k ) = E v k ,w k [ ˜Φ( k + 1) | x k , u k ] (6)This POMDP contains two challenging points for anoptimal solution: ﬁrst the action domain expands exponen-tially with number of agents which leads the computationalchallenge; second the state transition T is a continuousstochastic process. B. Belief-State MDP

Different from POMDP, the Markov Decision Process(MDP) is fully observable in the state as well as state tran-sition model, which makes it possible to calculate optimalpolicy given a state and Q-function. To solve the POMDP,it can be transferred to an MDP problem with belief state , b k : in state x k = ( s k , χ k , F k ) deﬁnition, the unobservable part is the target state χ k which is represented by the pos-terior distribution conditioned on the history of actions andobservations, i.e. b k ( x ) = P χ k ( x | Z , ..., Z k ; u , ..., u k − ) ∈ B ( X ) , here B ( X ) represents the domain of belief state. Thenthe one-step belief-state reward is deﬁned as ˜ R ( b k , u k ) = (cid:90) R ( x, u k ) b k ( x ) dx (7)replacing (6). Under the assumption of perfect data associ-ation and a linear Gaussian system, a target’s belief statecan be represented by a Gaussian distribution such that fortarget χ i ∈ χ k , b χ i k ∼ ( x χ i i,k | k , P χ i i,k | k ) . Based on this wetake the mean and covariance derived from the JPDAF asa sufﬁcient representation of the belief state for the belief-state MDP reward calculations. The belief state update isthen captured using the deﬁned state transition for the fullyobservable portions, i.e., agent state and the target belief-state update is f b : B ( X ) × U −→ B ( X ) , approximated bythe ﬁlter state update. With belief state b , the cumulativebelief reward over N decision stages (horizon) is V N ( b ) = E { N (cid:88) k =0 ˜ R ( b k , u k ) | b } (8)The policy in belief-state MDP is a mapping µ : B ( X ) −→U which provides the action given belief state. In a belief-state MDP, an optimal policy generates the action to max-imize the cumulative reward in (8). Following Bellman’sprinciple, the optimal objective function is interpreted as V ∗ N ( b ) = max u ˜ R ( b , u ) + E [ V ∗ N − ( b ) | b , u ] . Thus, wedeﬁne a Q-value function in the ﬁxed horizon as Q N : B ( X ) × U −→ R , speciﬁcally Q N ( b , u ) = ˜ R ( b , u ) + E [ V ∗ N − ( b ) | b , u ] (9)If N is ﬁxed as constant, this problem becomes recedinghorizon control and the optimal policy π ∗ is deﬁned Q N ( b, u ) = ˜ R ( b, u ) + E [ V ∗ N − ( b (cid:48) ) | b, u ] (10a) π ∗ ( b ) = arg max u ∈U Q N ( b, u ) (10b)where state b (cid:48) in (10a) is the random belief state obtainedfrom state b after taking action u . C. Multiagent Rollout Policy

The rollout algorithm is a suboptimal solution in decisionmaking which implements a sequential optimization methodover a ﬁnite time horizon [22]. Given a simple base policy (typically an easy to compute heuristic), the rollout policyis guaranteed to improve upon the base policy. A multiagentversion of rollout is introduced recently by Bertsekas in [13]which maintains a policy improvement property under a se-quential decision making with respect to agents. In standardone-step lookahead rollout algorithm, from initial state x ∈X a trajectory is generated { x , ˜ u , x , ..., x N − , ˜ u N − , x N } with one-step reward function g k ( x k , u k ) and reward-to-gofunction J k,π ( x k ) , and a sequence of actions called policy4 = { ˜ u , ..., ˜ u N − } is determined by one-step lookaheadmaximization ˜ u k = arg max u k E { g k ( x k , u k ) + J k +1 ,π ( x k +1 ) } (11)Similarly, the belief-state MDP starts with initial belief b and solves such one-step lookahead minimization as (11).Propagating the the belief into the future is computationallyexpensive and the accuracy of the estimates decreases withlength of the time. Thus one-step rollout that we implementhere focuses on solving the immediate action π ∗ ( b ) , whichis based on an approximate expected reward-to-go (ERTG)of E [ V ∗ N − ( b (cid:48) ) | b, u ] in (10a), and the policy of future beliefsis base policy.

1) Base Policy:

The base policy of our multi-sensorsystem is decomposed into individual sensors. Denote thebase policy of the described system as ¯ µ = { ¯ µ , ..., ¯ µ n } ,with each sensor’s base policy in a proximal tracking: thesensor will move towards to the track of largest uncertaintywithin distance d , i.e. the track t ∗ with t ∗ = arg min t ∈ Θ ik tr ( Ω ti,k | k ) (12a) | s i,posk − x t ∗ ,posk | ≤ d i (12b)For agent i , its base policy is ¯ µ i ( s ik ) = v s i,posk − x t ∗ ,posk | s i,posk − x t ∗ ,posk | (13)The consideration of proximity in base policy enables thesensor to maintain the target tracking with the simple heuris-tic of the distance between sensor and its worst local trackingestimate. This base policy is speciﬁcally motivated by oursensor model with spatially varying measurement error from(3). By moving towards track t ∗ deﬁned in (12), an efﬁcientbut myopic single-agent base policy is built.

2) Rollout Policy:

The base policy is designed in adistributed single agent’s way without coordination amongagents. The lookahead approach applied by the rollout policyimplicitly optimizes the agents maneuvers across the systeminducing coordinated movements. The one-step maximiza-tion objective function including ERTG term ˜ V N − ( b ) = E [ V N − ( b ) | b, ¯ µ ] is Q N ( b, u ) = ˜ R ( b , u ) + ˜ V N − ( b ) (14a) π ∗ ( b ) = arg max u ∈U Q N ( b , u ) (14b)The optimal solution π ∗ ( b ) = { u , u , ..., u n } in (14b) isin a high dimension space U that brings computational issuein the approach. In multiagent rollout algorithm, there aretwo approaches in solving the one-step lookahead optimiza-tion: 1) all-agents-at-once solves (14b) in one optimizationequation, i.e. returns solution in domain U , which is typicallychallenged in computation with the growth of dimensional-ity; 2) agent-by-agent initializes the policy π ( b ) with allbase policy ¯ µ on each agent, doing optimization in n timeswith i th time optimizing the action of agent i , solves thejoint action policy in a speciﬁed sequence of agents. The algorithm of agent-by-agent is described as Algorithm 1.Such agent-by-agent optimization reduces the computationby the trade-off with suboptimality, a brief description ofsuch sequential optimization is in Algorithm 1. Algorithm 1

Multi-Sensor Agent-by-agent Rollout TargetTracking

Require:

At time t = k , Sensor network graph G = ( V, E ) , agent state s k = ( s k , ..., s nk ) ∈ S ,observation set Z k = { Z ik , ..., Z nk } .1. Local sensor update, obtain local tracking set Θ ik forsensor i ;2. Consensus update with neighbor sensor nodes by (4),all agent obtains state in POMDP b k ∈ B ( X ) ;3. Initialize control u k with base policy ¯ µ ( b k ) = { ¯ µ , ..., ¯ µ n } by (13) such that u k = { u ik |∀ i ∈ [ n ] , u ik = ¯ µ i } ;4. Rollout policy solving one-step optimization for i ∈ [ n ] do u ik = arg max u ik ∈U ik Q N ( b , u k ) update u k end forreturn u k ; IV. S IMULATION R ESULT

To validate the aforementioned distributed target trackingrollout policy, a simulation of the multisensor system isconstructed described by the following scenario: 3 UAVscarrying camera sensors are deployed within a parking lotperimeter. The task of UAVs is to monitor the people walkingin this parking lot, as we assume the perception algorithmprovides the observation data Z ik in the frequency of 5 Hz. Tohave a close simulation of human’s trajectory, step lengths oftrajectories obey a Levy walk with the speed interval of [1.0,3.0] m/s. For this multiagent system, UAVs are set at differentaltitudes which makes the FoVs φ ik ( · ) and sensing qualities α i heterogeneous. Speciﬁcally, UAV 1 has a × m squareFoV with α = 0 . ; UAV 2 has larger side length ofFoV and α = 1 . α ; UAV 3 has larger side lengthof FoV and α = 1 . α . As the control variable deﬁnedin Section II, UAVs move on the horizontal planes withﬁxed altitude by commanding the velocity, with v = 5 m/sdeﬁned in base policy (13), and d i equal to the diagonal ofFoV to classify proximal tracks of a certain agent. We assumethat when the sensors move, the FoV maintains orientationconsistent with the coordinate system and the same anglethrough a stabilizing.The parking lot map is shown in Fig. 1, here the areasof occlusion (tree cover) provides semantic meanings to thesystem in observation laws deﬁned in POMDP. Utilizing suchsemantic information enables the system to plan based onenvironment and increases the agent coordination.The control frequency for the system is 1 Hz, whichruns rollout policy for all agents. For a sufﬁcient Q-valueapproximation in (14), belief states of target position are5ig. 1: Scenario of 3 UAVs monitoring in a parking lot.The UAV FoV’s, denoted as squares, deﬁne initial positions.The vegetation areas, solid green blocks, occlude cameradetection. The target trajectories are the grey gradient linesfrom dark (start) to light (end).represented by Monte Carlo sampling [23]. In each controliteration, 50 samples are drawn in the Gaussian belief oftargets to formulate the possible trajectories in N sec rollouthorizon. Solving of the one-step lookahead optimization ofsystemwise decision in (14b) as well as per agent in Al-gorithm 1 is accomplished through numerical optimization,speciﬁcally Differential Evolution.A typical control result of agent-by-agent rollout is rep-resented in Fig. 2. The cooperative behavior is observedbetween the UAV 1 and 2 when Target 1 moves throughthe bottom green occlusion (between t = 5 and t = 15 ):UAV 1 moves to bottom left to continue tracking Target 2while UAV 2 starts tracking Target 1. A similar behaviorhappens again between UAV 2 and 3 when Target 1 movesthrough the top occlusion area (between t = 20 and t = 30 ).UAV 2 maintains Target 3 and Target 4 but UAV 3, whichinitially kept track of Target 4, leaves Target 4 to be trackedby UAV 2 and transitions focus to the track of Target 1.The performance of the target tracking task is measuredby the Optimal Subpattern Assignment (OSPA) [24] metricsin Fig. 3. These results are all generated based on thescenario depicted in Fig. 1. Speciﬁcally, three algorithms areinvestigated, an agent-by-agent greedy algorithm ( N = 1 )which explicitly optimizes the one-step belief-state reward(7) and the two rollout policies, namely all-agents-at-onceand agent-by-agent. Two major conclusions can be drawnfrom Fig. 3. First the reader can observe from t = 15 to t = 35 the performance of the greedy algorithm suffers.This anomaly arises from the myopic policies lack of con-sideration for future impacts from current action choicesand is especially apparent when UAVs need to performtarget hand-offs, changing focus from one set of targetsto another. Both of the rollout algorithms with extendedhorizons, N = 5 , account for the future trajectory shifts Fig. 2: An example tracking result from agent-by-agent roll-out, sensors trajectories are dotted lines with correspondingFoVs.of the targets and the future movements of the cooperativeﬂeet. The second realization is the minimal performancedifferentiation between the two rollout algorithms. Thisresult shows that the sequential distributed control decisionparadigm is a viable replacement for the centralized control.More speciﬁcally this demonstrates that the advantages of thecontrol space dimension reduction from exponential to linearwith respect to number of agents [13] can be employed inthe model with partially observable states, e.g., the task oftarget tracking. In the context of implementation, this alsoyields the advantage of allowing each agent to compute it’sown optimized control, distributing the computational loadacross all agents evenly.V. C ONCLUSION

A multiagent (multi-sensor) target tracking problem isformulated as a POMDP. The control policy optimization,based on receding horizon rollout was implemented on anagent-by-agent basis. This sequential decision making aug-mentation, multiagent rollout introduced in [13], is comparedto a standard, all-agents-at-once, rollout policy generation.The sequential implementation showed similar performancewhen implemented in a distributed tracking system whereinformation maintained on a per agent basis and assimilatedacross the system through consensus-based JPDA.6ig. 3: Target tracking performance in OSPA metric, acomparison between the agent-by-agent (distributed) and all-agents-at-once (centralized) rollout policies. Total 50 trialsare run in the simulation of 40 decision epochs, with meanand 95% conﬁdence interval are plotted of the OSPA metric.Parameter of OSPA is with c = 40 . , p = 2 followingterminology in [24]. This encapsulates both tracking errorand target cardinality estimates with respect to ground truth.With these initial results in hand further investigationswill include applications considering limited communicationrange and/or bandwidth and extending the heterogeneity levelof the agents. Beyond viable scenario extensions for real-world considerations, the simulations herein have shown alackluster in computational efﬁciency which leads to furtherextension of this sequential methodology either throughsimilar avenues of reinforcement learning alluded to in [13]or though approximate dynamic programming techniqueswhich follow the vein of rollout and receding horizon control.R EFERENCES[1] F. Zhao, J. Shin, and J. Reich, “Information-driven dynamic sensorcollaboration,”

IEEE Signal processing magazine , vol. 19, no. 2,pp. 61–72, 2002.[2] A. C. Bellini, W. Lu, R. Naldi, and S. Ferrari, “Information drivenpath planning and control for collaborative aerial robotic sensors usingartiﬁcial potential functions,” in ,pp. 590–597, IEEE, 2014.[3] L. Paull, S. Saeedi, M. Seto, and H. Li, “Sensor-driven onlinecoverage planning for autonomous underwater vehicles,”

IEEE/ASMETransactions on Mechatronics , vol. 18, no. 6, pp. 1827–1838, 2012.[4] L. Xiao, S. Boyd, and S. Lall, “A scheme for robust distributed sensorfusion based on average consensus,” in

IPSN 2005. Fourth Inter-national Symposium on Information Processing in Sensor Networks,2005. , pp. 63–70, IEEE, 2005.[5] B. Siciliano and O. Khatib,

Springer handbook of robotics . Springer,2016.[6] J. Cortes, S. Martinez, T. Karatas, and F. Bullo, “Coverage controlfor mobile sensing networks,”

IEEE Transactions on robotics andAutomation , vol. 20, no. 2, pp. 243–255, 2004.[7] Z.-Q. Luo, “Universal decentralized estimation in a bandwidth con-strained sensor network,”

IEEE Transactions on information theory ,vol. 51, no. 6, pp. 2210–2219, 2005.[8] R. Olfati-Saber, “Distributed tracking for mobile sensor networks withinformation-driven mobility,” in ,pp. 4606–4612, IEEE, 2007. [9] B. Ristic and A. Skvortsov, “Intermittent information-driven multi-agent area-restricted search,”

Entropy , vol. 22, no. 6, p. 635, 2020.[10] K. Zhou and S. I. Roumeliotis, “Multirobot active target trackingwith combinations of relative observations,”

IEEE Transactions onRobotics , vol. 27, no. 4, pp. 678–695, 2011.[11] R. K. Ramachandran, N. Fronda, and G. S. Sukhatme, “Resilience inmulti-robot target tracking through reconﬁguration,” in , pp. 4551–4557, IEEE, 2020.[12] E. D. Demaine, S. P. Fekete, and R. J. Lang, “Circle packing fororigami design is hard,” arXiv preprint arXiv:1008.1224 , 2010.[13] D. Bertsekas, “Multiagent rollout algorithms and reinforcement learn-ing,” arXiv preprint arXiv:1910.00120 , 2019.[14] B. D. Anderson and J. B. Moore,

Optimal ﬁltering . Courier Corpora-tion, 2012.[15] L. W. Krakow, C. M. Eaton, and E. K. Chong, “Simultaneous non-myopic optimization of uav guidance and camera gimbal control fortarget tracking,” in , pp. 349–354, IEEE, 2018.[16] N. F. Sandell and R. Olfati-Saber, “Distributed data association formulti-target tracking in sensor networks,” in , pp. 1085–1090, IEEE, 2008.[17] T. Fortmann, Y. Bar-Shalom, and M. Scheffe, “Sonar tracking of mul-tiple targets using joint probabilistic data association,”

IEEE journalof Oceanic Engineering , vol. 8, no. 3, pp. 173–184, 1983.[18] G. Battistelli, L. Chisci, G. Mugnai, A. Farina, and A. Graziano,“Consensus-based linear and nonlinear ﬁltering,”

IEEE Transactionson Automatic Control , vol. 60, no. 5, pp. 1410–1415, 2014.[19] G. Battistelli and L. Chisci, “Kullback–leibler average, consensus onprobability densities, and distributed state estimation with guaranteedstability,”

Automatica , vol. 50, no. 3, pp. 707–718, 2014.[20] M. Chu, H. Haussecker, and F. Zhao, “Scalable information-drivensensor querying and routing for ad hoc heterogeneous sensor net-works,”

The International Journal of High Performance ComputingApplications , vol. 16, no. 3, pp. 293–313, 2002.[21] K. Zheng and A. Pronobis, “From pixels to buildings: End-to-endprobabilistic deep networks for large-scale semantic mapping,” arXivpreprint arXiv:1812.11866 , 2018.[22] D. Bertsekas, “Rollout algorithms for discrete optimization: A survey,”

Handbook of Combinatorial Optimization, D. Zu and P. Pardalos, Eds.Springer , 2010.[23] Y. He and E. K. Chong, “Sensor scheduling for target tracking: Amonte carlo sampling approach,”

Digital Signal Processing , vol. 16,no. 5, pp. 533–545, 2006.[24] D. Schuhmacher, B.-T. Vo, and B.-N. Vo, “A consistent metric forperformance evaluation of multi-object ﬁlters,”

IEEE transactions onsignal processing , vol. 56, no. 8, pp. 3447–3457, 2008., vol. 56, no. 8, pp. 3447–3457, 2008.