[PDF] Decentralized Game-Theoretic Control for Dynamic Task Allocation Problems for Multi-Agent Systems

Abstract

We propose a decentralized game-theoretic framework for dynamic task allocation problems for multi-agent systems. In our problem formulation, the agents' utilities depend on both the rewards and the costs associated with the successful completion of the tasks assigned to them. The rewards reflect how likely is for the agents to accomplish their assigned tasks whereas the costs reflect the effort needed to complete these tasks (this effort is determined by the solution of corresponding optimal control problems). The task allocation problem considered herein corresponds to a dynamic game whose solution depends on the states of the agents in contrast with classic static (or single-act) game formulations. We propose a greedy solution approach in which the agents negotiate with each other to find a mutually agreeable (or individually rational) task assignment profile based on evaluations of the task utilities that reflect their current states. We illustrate the main ideas of this work by means of extensive numerical simulations.

Full PDF

DDecentralized Game-Theoretic Control for Dynamic Task AllocationProblems for Multi-Agent Systems

Efstathios Bakolas Yoonjae Lee

Abstract

We propose a decentralized game-theoretic framework for dynamic task allocation problems for multi-agent systems. In ourproblem formulation, the agents’ utilities depend on both the rewards and the costs associated with the successful completionof the tasks assigned to them. The rewards reﬂect how likely is for the agents to accomplish their assigned tasks whereasthe costs reﬂect the effort needed to complete these tasks (this effort is determined by the solution of corresponding optimalcontrol problems). The task allocation problem considered herein corresponds to a dynamic game whose solution depends onthe states of the agents in contrast with classic static (or single-act) game formulations. We propose a greedy solution approachin which the agents negotiate with each other to ﬁnd a mutually agreeable (or individually rational) task assignment proﬁlebased on evaluations of the task utilities that reﬂect their current states. We illustrate the main ideas of this work by means ofextensive numerical simulations.

I. I

NTRODUCTION

We consider a dynamic task allocation problem for a multi-agent system whose agents have continuous state and inputspaces and have to complete a set of spatially distributed tasks (obtain in-situ measurements or pick up packages fromdifferent locations over a given spatial domain). We adopt a game-theoretic approach which seeks for task assignmentsthat maximize the individual utility of each agent conditional on the assignments of their teammates (individual rationalityprinciple) while also ensuring that the self-interests of the agents are aligned with those of the team. To this aim, we designthe agents’ utilities in accordance with the concept of wonderful life utility [1] (WLU) which allows us to associate thedynamic task allocation problem with a sequence of potential games [2]. We propose a greedy decentralized algorithm forthe computation of task assignment proﬁles which are mutually agreeable in the long run.

Literature review:

Task allocation problems for multi-agent systems can be addressed by auction based techniques, distributedand / or multi-objective optimization and game-theoretic methods. The auction-based techniques are centralized when theagents negotiate with each other under the guidance of an auctioneer [3] and decentralized when they negotiate directly witheach other [4]–[6]. Centralized methods rely on a single point of failure whereas the communication cost in decentralizedmethods can be substantial if not prohibitive. Distributed optimization [7] and multi-objective optimization [8] for taskallocation problems are typically quite complex and require some knowledge about the utilities of the other agents and,more importantly, do not necessarily yield solutions which are mutually agreeable. As suggested in [9], game-theoretic toolsconstitute one of the most natural approaches to task allocation problems for intelligent, autonomous agents. Reference [9],which is the main inspiration of this paper, utilizes the framework of potential games to deﬁne in a systematic way the task andagent utilities as well as several negotiation protocols (game-theoretic learning algorithms [10], [11]) for the computation ofmutually agreeable task assignments in a decentralized or distributed way. The negotiation protocols utilized in [9] convergeto mutually agreeable task assignment proﬁles without requiring that any agent should know the utility functions of herteammates (decentralized task allocation). However, their convergence is conditional on the game remaining the same (e.g.,the functional description of the utilities does not change throughout the negotiation process). Thus, although the equilibriumof the game is found iteratively in [9], the task allocation problem itself is essentially modeled as a static game. Extensionsof the game-theoretic framework for multi-agent control problems can be found in [12]–[14]. Ref. [15] proposes a myopicsolution approach to a dynamic task allocation problem modeled as a sequence of static (single-act) potential games. Theapproach in [15] cannot handle state-dependent utilities in general. Finally, the framework of state-dependent potentialgames [16] is only applicable to problems with ﬁnite (discrete) state spaces.

Contributions:

In this paper, we address a dynamic task allocation problem in which the task utilities depend on both therewards earned by the agents for completing their assigned tasks as well as the costs they incur while doing so (cost-to-gofunctions of corresponding optimal control problems). Consequently, the utilities are in general state-dependent. We adopta decentralized game-theoretic solution approach (each agent knows only her own utility function). The (individual) agent

This work was supported in part by ARL under W911NF2020085. E. Bakolas (Associate Professor) and Y. Lee (graduate student) are with theDepartment of Aerospace Engineering and Engineering Mechanics, The University of Texas at Austin, Austin, Texas 78712-1221, USA, Emails:[email protected]; [email protected] a r X i v : . [ c s . M A ] S e p tilities are designed in accordance with the WLU framework which ensures that their self-interests are aligned with theteam’s interests under the framework of potential games. We propose a greedy solution approach in which the negotiationsbetween the agents take place on-the-ﬂy while the agents move in their state space towards their assigned tasks. Everytime an agent changes her individual assignment (and thus her ﬁnal state destination) she has to update the estimate of thecost-to-go and consequently her utility function as well (state-dependent utilities). We design the negotiation process suchthat the agents compute a mutually agreeable proﬁle which is not likely to change during the last phase of the process. Outline:

The rest of the paper is organized as follows. In Section II, we discuss the problem preliminaries. The task,team and agent utilities are deﬁned in Section III. The open-loop task allocation is addressed in Section IV and the dynamicproblem in Section V. Numerical simulations are presented in Section VI. Finally, Section VII presents concluding remarksand directions for future work. II. P

RELIMINARIES AND P ROBLEM S ETUP

Notation:

We denote by R n the set of n -dimensional real vectors. We denote by Z the set of integers. Given a, b ∈ Z with a ≤ b , we denote by [ a, b ] d the discrete time interval from a to b , that is, [ a, b ] d := [ a, b ] ∩ Z . Given k ∈ Z , we write Z k to denote the (unbounded) discrete interval [ k, ∞ ) ∩ Z . Given a ﬁnite set A , we denote by card( A ) its cardinality. Problem setup:

We consider a multi-agent system (MAS) comprised of n agents. We denote by x i ∈ S i ∈ Σ and u i ∈ U i ,for i ∈ [1 , n ] d , the state and input of the i -th agent of the MAS at time t ≥ , where S i and U i denote her state space and inputspace, respectively, and Σ ⊆ R m . In addition, we denote by x ∈ S the joint state of the MAS, where x := ( x , . . . , x n ) and S := S × · · · × S n (joint state space), and by u ∈ U the joint input of the MAS, where u := ( u , . . . , u n ) and U := U × · · · × U n (joint input space). Furthermore, we denote by x − i ∈ S − i and u − i ∈ U − i the concatenations of thestates and the inputs of all the agents except from the i -th agent (the sets S − i and U − i are deﬁned accordingly).The motion of the i -th agent is described by ˙ x i = f i ( x , u ) , x i (0) = x i , i ∈ [1 , n ] d , (1)where x i ∈ S i is the initial state of the i -th agent and f i : S i × U i → S i is her associated vector ﬁeld. Note that the evolutionof the i -th agent is not fully determined by her own state and input. For instance, in any realistic setting, the input of everyagent at each time is conditioned on the actions of the other agents or at least a subset of them. A similar argument canbe made for the states of the other agents. We assume that the vector ﬁeld f i satisﬁes regularity conditions that ensure theexistence and uniqueness of solutions to the differential equations (1) for all piecewise continuous joint inputs u takingvalues in U and all joint states x ∈ S . Finally, we write ˙ x = f ( x , u ) , x (0) = x , (2)where x = ( x , . . . , x n ) ∈ S is the joint initial state and f := ( f , . . . , f n ) is the joint vector ﬁeld.The task allocation problem seeks for individual assignments for a team of n agents and for a given set of p tasks, T := {T , . . . , T p } . Each task is associated with a distinct state in Σ . We denote by X T the set of states associated withthe given tasks, where X T := { x T , . . . , x T p } . In principle, an agent can be assigned at most one task in T at each instantof time although more than one agents can be assigned to the same task simultaneously. We denote by A i := { a ki : k ∈ [1 , card( A i )] d } the set of possible task assignments for the i -th agent given a set of tasks T . Later on, we will see that eachassignment a ki induces a corresponding (admissible) control input u i ( · ) via the solution of a corresponding optimal controlproblem. We assume that a ∅ ∈ A i , where a ∅ denotes the null assignment (i.e., the i -th agent is not assigned to any task)which corresponds to the null control input, that is, when a i = a ∅ , then u i ( t ) = 0 , for all t ≥ . Each assignment a ki ∈ A i is equal to either a task in T , that is, a ki = T (cid:96) where T (cid:96) ∈ T , or the null assignment, that is, a ki = a ∅ . Thus, A i ⊆ T ,where T := T ∪ { a ∅ } . III. T ASK U TILITIES

The completion of a task T j ∈ T will accrue rewards to the agents assigned to it. These rewards, which do not dependon the states of the agents, reﬂect the importance of this speciﬁc task as well as the likelihood of its successful completionby each agent assigned to it (in general, not all agents are equally likely to complete a speciﬁc task successfully). We willrefer to these rewards as the static task utilities. Furthermore, an agent will have to incur a cost to complete her assignedtask (e.g., the transition cost to a certain location associated with this task). It is worth mentioning that the task completioncost is state-dependent and we will refer to it as the dynamic task completion cost . tatic task utility: Given an action proﬁle a = ( a , . . . , a n ) , we denote by T − j ( a ) the index-set corresponding to the agentsassigned to task T j ∈ T under the particular proﬁle, that is, T − j ( a ) = { i ∈ [1 , n ] d : a i = T j } . The completion of task T j will accrue a reward r T j ≥ to the agent or agents assigned to this task. In general, r T j is a function of the task assignmentproﬁle, that is, r T j ( a ) = ¯ r T j [1 − (cid:89) i ∈T − j ( a ) (1 − p ij )] , (3)where ¯ r T j is the nominal reward of T j and p ij ∈ [0 , is the probability of the task T j be completed successfully by the i -th agent. If T − j ( a ) = ∅ , then r T j ( a ) := 0 . State-dependent task completion cost:

Next, we deﬁne the cost for completing the task T j associated with the state x T j attime t = t f by the i -th agent. Essentially, the task completion cost is taken to be the cost incurred by the i -th agent, whichstarts from the state x i at time t = 0 , to reach the state x T j at time t = t f . The latter state transition cost is deﬁned as theoptimal cost-to-go corresponding to the following optimal control problem: Problem 1:

Let a i = T j , where T j ∈ T and i ∈ [1 , n ] d . Furthermore, let x T j ∈ S i be the state associated with the task T j and let t f > be the corresponding completion time (ﬁxed and common for all the agents). Then, ﬁnd an optimal piece-wisecontinuous input u (cid:63)i ( · ) : [0 , t f ] → U i that minimizes the following performance index: J i ( u i ( · ); x i , x T j ) := (cid:90) t f L i ( x i ( t ) , u i ( t ))d t, (4)subject to the dynamic constraints (1) and the terminal constraint: Ψ i ( x i ( t f ) , x T j ) = 0 , where Ψ i ( · ; x T j ) is a given C function. Finally, the optimal cost-to-go is denoted by ρ i ( x i ; x T j ) , where ρ i ( x i ; x T j ) := J i ( u (cid:63)i ( · ); x i , x T j ) . Remark 1

The terminal constraint function Ψ i can be deﬁned, for instance, as follows: Ψ i ( x i ( t f ) , x T j ) = x i ( t f ) − x T j , inwhich case we require that x i ( t f ) = x T j (hard constraint). Total Task Utility:

The total cost of completion of task T j under the action proﬁle a = ( a , . . . , a n ) , which is denoted as R T j ( a ; x , x T j ) , is deﬁned as the sum of the individual task completion costs of all the agents assigned to that task. Moreprecisely, R T j ( a ; x , x T j ) := (cid:88) i ∈T − j ( a ) ρ i ( x i ; x T j ) . (5)Note that R T j depends on the initial state x (more precisely, the initial states of the agents assigned to the task T j ).Furthermore, the total task utility associated with task T j for a given x is denoted as U T j ( a ; x ) and deﬁned as follows: U T j ( a ; x ) := max { , r T j ( a ) − R T j ( a , x ; x T j ) } . (6)Note that U T j is state-dependent because the task completion costs ρ i ( x i ; x T j ) , for i ∈ T − j ( a ) , are state-dependent. Individual and Team Utilities and Solution Concepts:

First, we deﬁne the team’s utility (the latter reﬂects the team’s collectivewelfare), which is denoted by U ( a ; x ) , as follows: U ( a ; x ) := (cid:88) T j ∈T U T j ( a ; x ) . (7)The individual utility of the i -th agent given a task proﬁle a = ( a i , a − i ) , which is denoted as U i ( a i , a − i ) or U i ( a ) , istaken to be equal to her marginal contribution to the team’s utility U ( a ; x ) , that is, U i ( a ; x ) := U (( a i , a − i ); x ) − U (( a ∅ , a − i ); x ) (8)from which it follows, in view of (7), that U i ( a ; x ) = U T j (( a i , a − i ); x ) − U T j (( a ∅ , a − i ); x ) , (9)where ( a ∅ , a − i ) corresponds to the action proﬁle when the i -th agent has a null assignment, that is, a i = a ∅ . Next, weprovide the deﬁnition of the basic solution concept that will be used in our task allocation problem. Deﬁnition 1:

An assignment proﬁle a (cid:63) := ( a (cid:63)i , a (cid:63) − i ) ∈ A is a pure strategy Nash equilibrium of the game G , where G := (cid:104)U ( a ; x ) , . . . , U n ( a ; x ); A(cid:105) , if ∀ i ∈ [1 , n ] d U i ( a (cid:63)i , a (cid:63) − i ; x ) ≥ U i ( a i , a (cid:63) − i ; x ) , ∀ a i ∈ A i . (10) Remark 2

The solution concept of (pure strategy) Nash equilibrium is fundamental in non-cooperative game theory. Whenall agents play in accordance with the Nash equilibrium, they act selﬁshly and try to maximize their own utilities conditionalon the decisions of others ( individual rationality ).V. T HE O PEN -L OOP T ASK A LLOCATION P ROBLEM AND ITS D ECENTRALIZED S OLUTION

A. Problem formulation and analysis

Next, we formulate the task allocation problem as a non-cooperative game. In the following formulation, we only accountfor the estimates of the task completion costs at the initial time (open-loop approach).

Problem 2 (OLTA: Open-Loop Task Allocation):

Let t f > and x ∈ S be given. Then, ﬁnd a (time-invariant) taskassignment proﬁle a (cid:63) ∈ A , where a (cid:63) := ( a (cid:63)i , a (cid:63) − i ) , such that for all i ∈ [1 , n ] d the inequality (10) is satisﬁed. In otherwords, the task assignment proﬁle a := ( a (cid:63)i , a (cid:63) − i ) corresponds to a Nash equilibrium of the game G . Remark 3

In the formulation of the open-loop task allocation problem (Problem 2), the agents’ utilities (or, more precisely,their functional descriptions) do not change with time, as the agents progress towards the states of their assigned tasks. Thisis because their estimated task completion costs are based on knowledge available at time t = 0 and these estimates are notupdated afterwards. To the i -th individual task assignment a (cid:63)i from the optimal proﬁle a (cid:63) , where, say, a (cid:63)i = T j , we associatea corresponding state x T j , which in turn determines the terminal constraint Ψ( x i ( t f ); x T j ) = 0 in Problem 1. Because allthe task assignments are time-invariant, the control input u (cid:63)i ( · ) that solves Problem 1 will not be updated along the i -thagent’ ensuing trajectory.It is well-known that potential games correspond to a special class of non-cooperative games that always admit purestrategy Nash equilibria. We claim that Problem 2 corresponds to an exact potential game [17]. Deﬁnition 2:

The game, G , corresponds to an exact potential game, if there exists an exact potential, that is, a function P : A → R such that, for all i ∈ [1 , n ] d and a ∈ A , it holds true that U i ( a (cid:48) i , a − i ; x ) − U i ( a i , a − i ; x )= P ( a (cid:48) i , a − i ) − P ( a i , a − i ) , (11)for all a i , a (cid:48) i ∈ A i . Proposition 1:

The open-loop task allocation problem (Problem 2) with task team utility and individual utilities deﬁnedby (7) and (10), respectively, is an exact potential game with potential P = U ( a i , a − i ; x ) , for all a i ∈ A i . Proof:

We will directly verify (11). In particular, U ( a (cid:48) i , a − i ; x ) − U ( a i , a − i ; x )= (cid:88) T j ∈T (cid:0) U T j ( a (cid:48) i , a − i ; x ) − U T j ( a i , a − i ; x ) (cid:1) = p (cid:88) k =1 k (cid:54) = i U T j ( a k , a − k ; x ) − p (cid:88) k =1 U T j ( a k , a − k ; x )+ U T j ( a (cid:48) i , a − i ; x )= U T j ( a (cid:48) i , a − i ; x ) − U T j ( a i , a − i ; x )= (cid:0) U T j ( a (cid:48) i , a − i ; x ) − U T j ( a ∅ , a − i ; x ) (cid:1) − (cid:0) U T j ( a i , a − i ; x ) − U T j ( a ∅ , a − i ; x ) (cid:1) = U i ( a (cid:48) i , a − i ; x ) − U i ( a i , a − i ; x ) , where in the derivation of the last equality we have used Eq. (9). In view of Deﬁnition 2, we conclude that G correspondsto an exact potential game. B. Negotiation protocols for decentralized task allocation

Problem 2 can be solved by utilizing standard tools for the computation of Nash equilibria of noncooperative (staticor single-act) games and in particular potential games [17]. An alternative approach is to employ game-theoretic learningalgorithms which generate a sequence of task assignment proﬁles that converge to a Nash equilibrium. Some of thesealgorithms include the ﬁctitious play (FP), spatial adaptive play (SAP), and generalized regret matching (GRM) algorithmsto name but a few; the reader may refer to [9] for more information on these and other similar algorithms. A key point isthat for their realization, an agent does not have to know the utilities of her teammates (decentralized implementation).uring the negotiation (learning) process, the task assignment proﬁle of the team is updated at different time instants thatform a non-decreasing sequence { τ k } ∞ k =0 in [0 , t f ] such that τ = 0 and lim k →∞ τ k = t f . The time instant τ k correspondsto the k -th stage of the negotiation process. At that stage, the i -th agent picks her new task assignment, which we denoteas a i ( τ k ) ; we also denote the corresponding proﬁle of the whole team as a ( τ k ) . The exact deﬁnition of the a i ( τ k ) will bedetermined by the particular learning algorithm that will be employed, which in turn will rely on a corresponding informationset I ik . The latter set may encode information about the past performance of the i -th agent (measured in terms of past valuesof her own utility) as well as information about the history (whole or truncated) of her teammates’ actions (such informationmay correspond to, for instance, the empirical distribution of the agents’ past task assignments). We write a i ( τ k ) = φ i ( I ik ; G ) , a ( τ k ) = φ ( I k ; G ) , i ∈ [1 , n ] d , (12)where φ i : I ik → A i , for i ∈ [1 , n ] d , is the update law (or proposal) of the individual target assignment of the i -th agentwhereas φ : I k → A , with φ ( I k ; G ) := ( φ ( I k ; G ) , . . . , φ n ( I nk ; G )) is the update law of the task assignment proﬁle ofthe MAS given the joint information set I k := I k × · · · × I nk . The following claim is based on the analysis provided in [9](refer to, for instance, Theorem 4.1) and references therein. Claim 1:

The update law a ( τ k ) which is deﬁned as in (12) and corresponds to one of the decentralized negotiationprotocols (game-theoretic learning algorithms) used [9] will converge to a pure strategy Nash equilibrium of the game G .It is important to emphasize that the update law for the task assignment proﬁle will solve Problem 2 under the assumptionthat throughout the interval [0 , t f ] , the functional description of the utilities will be based on their initial estimates at time t = 0 . However, the agents’ utilities are state-dependent and thus their functional description will change along the agents’ensuing trajectories. This variability in the agents’ preferences and capabilities (as reﬂected on their utilities) cannot becaptured in this update law as well as the OLTA problem itself. An alternative interpretation of the negotiation processis to assume that it does not take place over the time interval [0 , t f ] but instantaneously, at time t = 0 . In other words,the clock is paused until the negotiations have converged (within some acceptable tolerance) to a Nash equilibrium of thepotential game G . Subsequently, the agents can execute the corresponding inputs that will transfer them to the terminalstates associated with their assigned tasks (these inputs are computed by solving Problem 1 for each agent). The input signalwill remain the same function of time, for all t ∈ [0 , t f ] . It is worth mentioning that the game-theoretic learning algorithmscan be implemented based on local information (distributed implementation) by requiring that an agent cannot be assigneda task which is not within a certain range (cid:37) > from her ( range constrained case) in contrast with the nominal ( rangeunconstrained case) in which (cid:37) → ∞ .V. D YNAMIC T ASK A LLOCATION AND A G REEDY A LGORITHM FOR ITS S OLUTION

A. Problem formulation

Next, we formulate a dynamic version of the task allocation problem in which the fact that the agents’ utilities changealong their ensuing trajectories is accounted in the determination of their task assignments in contrast with the OLTA problem.In this problem formulation, a new game G t , where G t := (cid:104)U ( · ; x ( t )) , . . . , U n ( · ; x ( t )); A(cid:105) is essentially obtained at each t ∈ [0 , t f ] as the agents move in their state space. Problem 3 (DTA: Dynamic Task Allocation):

Let t f > and x ∈ S be given. Then, ﬁnd a time-varying task assignmentproﬁle a (cid:63) ( · ) : [0 , t f ] → A , where a (cid:63) ( t ) := ( a (cid:63)i ( t ) , a (cid:63) − i ( t )) , which is such that for all i ∈ [1 , n ] d : U i ( a (cid:63)i ( t ) , a (cid:63) − i ( t ); x ( t )) ≥ U i ( a i ( t ) , a (cid:63) − i ( t ); x ( t )) , for all a i ( t ) ∈ A i , as t → t f . In other words, the task assignment proﬁle a (cid:63) ( t ) converges to a Nash equilibrium of the game G t as t → t f . Remark 4

Problem 3 seeks for a task assignment proﬁle that will become mutually agreeable as t approaches the ﬁnaltime t f . Note that if t f is taken to be sufﬁciently large, then the solution to the Problem 3 will essentially converge to a steady-state task assignment proﬁle. B. A greedy algorithm for task allocation

Next we propose a greedy solution approach to address Problem 3. To ensure that the game-theoretic learning algorithmsdiscussed in Section IV-B will converge to a Nash equilibrium as t → t f , we propose to stop updating the agents’ utilities attime t = t f − (cid:15) for some < (cid:15) < t f so that the utilized learning algorithm (whose convergence is guaranteed only for a staticame) are given the chance to converge during the sub-interval [ t f − (cid:15), t f ] . A key difference between the DTA and OLTAproblems is that the static game that determines the task assignment proﬁle in the former is not G (game corresponding totime t = 0 ), as in the latter problem, but a game corresponding to time t = t f − (cid:15) , assuming that the (dynamic) game hasevolved for t ∈ [0 , t f − (cid:15) ] .Next, we present the main steps of the proposed greedy algorithm. To this aim, let us consider a sequence { τ k } ∞ k =0 as inSection IV-B and let K = K ( (cid:15) ) be the ﬁrst positive integer at which τ K > t f − (cid:15) for the given (cid:15) (the existence of such K is guaranteed by the fact that τ k → t f as k → ∞ ). Now, let φ ( I k ; G τ k ) denote the update law of a negotiation protocol asin Section IV-B. Let us consider the following update law: a d ( τ k ) := (cid:40) φ ( I k ; G τ k ) , for k ∈ [0 , K − d , φ ( I k ; G τ K ) , for k ∈ Z K . (13)We claim that the update law (13) will ﬁnd an approximate (in the sense that we will explain shortly later) solution toProblem 3. Proposition 2:

The piecewise constant dynamic task assignment proﬁle a ( t ) = a d ( τ k ) , ∀ t ∈ [ τ k , τ k +1 ) for all k ∈ Z ≥ ,where a d is deﬁned as in (13), will converge, as t → t f , to a Nash equilibrium of the game G T , where T ∈ [ t f − (cid:15), t f ] . Proof:

Given that { τ k } ∞ k =0 is a non-decreasing sequence in [0 , t f ] which converges to t f as k → ∞ , we conclude that τ k ∈ [ t f − (cid:15), t f ] , for all k ∈ Z K . After truncating the K − ﬁrst elements of { τ k } ∞ k =0 , we obtain a new non-decreasingsequence { σ n } ∞ n =0 , where σ n = τ n + K , for n ∈ Z , which implies that σ = τ K and lim n →∞ σ n = t f . In view of Claim 1,the update law a ( τ k ) := φ ( I k ; G ) deﬁned in (12) will converge to a Nash equilibrium of the game G as k → ∞ . Fromreal analysis, we know that after truncating the ﬁrst K − elements of the convergent sequence { a ( τ k ) } ∞ k =0 , we obtain anew sequence { a ( σ n ) } ∞ n =0 that will remain convergent with the same limit. The previous claim on convergence holds truefor any game G τ k for a ﬁxed k when the latter is treated as a static game (with a possibly different limit for each τ k ). Weconclude that the sequence { a d ( σ n ) } ∞ n =0 , where a d is deﬁned in (13), will also converge to a Nash equilibrium of the game G τ K , where by deﬁnition τ K ∈ [ t f − (cid:15), t f ] . This concludes the proof. Remark 5

If at time t = τ k the individual assignment of the i -agent attains a different value than at t = τ k − , then thestate corresponding to her new task will also be different. Therefore, the i -th agent will have to solve Problem 1 with theupdated terminal constraint, with her initial state set equal to x i ( τ k ) and the ﬁnal time to t f − τ k .VI. N UMERICAL S IMULATIONS

In this section, we present numerical simulations to illustrate the main ideas of the methods proposed so far. We considera team of agents with double integrator dynamics, that is, ¨ p i = u i , with p i (0) = p i and ˙ p i (0) = v i , where p i ∈ R ( p i ∈ R ) and ˙ p i ∈ R ( v i ∈ R ) denote, respectively, the position and velocity of the i -th agent at time t ( t = 0 ), i ∈ [1 , n ] d . The performance index is given by J ( u i ( · )) := (1 / (cid:82) t f | u i ( t ) | d t whereas the terminal constraint function Ψ i ( x i ( t f ); x T j ) := x i − x T j , where x i := ( p i , ˙ p i ) ∈ R and x T j := ( p T j , ∈ R , which means that the i -th agents tries toreach the position p T j associated with her assigned task T j at time t = t f with zero terminal velocity (soft landing). It turns out(see, for instance, [18]) that the optimal control input is given by u (cid:63)i ( t ; t f , x i ) = α + tβ , α := (6 /t f )( p T j − p i − t f v i )+(2 /t f ) v i , β := − (12 /t f )( p T j − p i − t f v i ) − (6 /t f ) v i , and the optimal cost-to-go by ρ i ( x i ; x T j ) := (1 / t f | α | + t f α T β +(1 / t f | β | ) . TABLE IT

EAM U TILITIES ( n = p = 100 ) FOR RANGE CONSTRAINED AND UNCONSTRAINED CASES

GRM SAPOLTA DTA OLTA DTA t f (cid:37) → ∞ (cid:37) = 0 . (cid:37) → ∞ (cid:37) = 0 . (cid:37) → ∞ (cid:37) = 0 . (cid:37) → ∞ (cid:37) = 0 . We will present numerical simulations for both Problem 2 (OLTA) and Problem 3 (DTA) based on the SAP and GRMalgorithms from [9] for both the range constrained and unconstrained cases. We will use a constant time step δt (althoughin Proposition 2, we proposed a decreasing time step, it turns out that a sufﬁciently small constant step is adequate for oursimulations). The negotiation process for the OLTA ran for k rounds (all these rounds took place at time t = 0 per thediscussion in Section IV-B) in order to converge to a Nash equilibrium before the agents start moving toward the states a) t = 0 + (b) t = 2 . (c) t = 10 Fig. 1. Dynamic task allocation for the range constrained case (GRM, n = p = 10 , (cid:37) = 0 . , t f = 10 , U = 3 . )(a) t = 0 + (b) t = 2 . (c) t = 10 Fig. 2. Dynamic task allocation for the range unconstrained case (GRM, n = p = 10 , (cid:37) → ∞ , t f = 10 , U = 4 . ) corresponding to their assigned tasks. We have used k = 100 for the GRM algorithm and k = 1000 for the SAP algorithm.The negotiation process for the DTA starts with a random task assignment proﬁle at t = 0 and subsequently, the agentscontinue to update their utilities and individual assignments at every time step ( k = t f /δt ). We have noticed that δt mustbe smaller for the SAP algorithm than the GRM algorithm to achieve convergence. For this reason, we select δt = 0 . forGRM and δt = 0 . for SAP when solving the DTA problem. Per the discussion in Section V-B, the agents’ utilities arenot updated after time t = t f − (cid:15) whereas the negotiation process continues until t = t f . In our simulations we have usedthe following parameter values: (cid:15) = t f / , p i ∈ [0 , , ˙ p i ∈ [ − . , . , p T j ∈ [0 , , ¯ r T j ∈ [0 , , and p i,j ∈ [0 , where i ∈ [1 , n ] d and j ∈ [1 , p ] d . For the implementation of the GRM and SAP algorithms, we have used ρ = 0 . (discountfactor), α = 0 . (parameter for the agents’ willingness to optimize at each time step) and randomization level τ = 10 /k .Finally, (cid:37) ∈ { . , . } (parameter for range constrained implementations of SAP and GRM). All the graphs and numericaloutcomes presented herein are averaged data from simulation runs.Figures 1 and 2 illustrate the evolution of the agents trajectories computed for the DTA problem at different time instantsfor the range constrained and the unconstrained cases, respectively. In particular, the dash lines indicate the current taskassignments whereas the solid curves correspond to the past segments of the agents’ trajectories. Fig. 3 shows that boththe team utility obtained by the GRM and SAP negotiation protocols for the DTA problem reach the team utility attainedby the solution to the OLTA problem. In addition, the negotiations converge to a pure strategy Nash equilibrium as t → t f in agreement with Proposition 2. Table I shows the values of the total team utility U for different scenarios for both therange constrained and range unconstrained cases with a signiﬁcant number of agents and tasks and for different values ofthe terminal time t f . We observe that for the DTA problem the team’s performance improves as t f increases. As we havediscussed in Remark 3, when t f is large, then the equilibrium assignment proﬁle corresponds to a “steady-state” proﬁle inwhich case the performance achieved by the solutions to both the OLTA and DTA problems are expected to be similar. Theobtained results conﬁrm the latter claim. a) GRM (b)

SAP

Fig. 3. Team utilities versus time ( n = p = 100 , t f = 10 ) VII. C

ONCLUDING R EMARKS

In this paper, we have presented a framework to address dynamic task allocation problems for multi-agent systemswith state-dependent utilities. Our approach, which leverage game-theoretic learning algorithms for the solution of staticpotential games, offers a practical solution to a class of more realistic and challenging dynamic task allocation problems forautonomous mobile agents. In our future work, we plan to extend the results presented herein to even more realistic taskallocation problems including scenarios with deadlines attached to tasks, pop-up tasks and agents with varying capabilitiesand preferences. R

EFERENCES[1] D. H. Wolpert, K. R. Wheeler, and K. Tumer, “General principles of learning-based multi-agent systems,” in

Proceedings of the third annual conferenceon Autonomous Agents , pp. 77–83, 1999.[2] D. Monderer and L. S. Shapley, “Potential games,”

Games and economic behavior , vol. 14, no. 1, pp. 124–143, 1996.[3] B. P. Gerkey and M. J. Mataric, “Sold!: Auction methods for multirobot coordination,”

IEEE Transactions on Robotics and Automation , vol. 18,no. 5, pp. 758–768, 2002.[4] H. Choi, L. Brunet, and J. P. How, “Consensus-based decentralized auctions for robust task allocation,”

IEEE Transactions on Robotics , vol. 25,no. 4, pp. 912–926, 2009.[5] M. Nanjanath and M. Gini, “Repeated auctions for robust task execution by a robot team,”

Robotics and Autonomous Systems , vol. 58, no. 7, pp. 900– 909, 2010.[6] J. Capitan, M. T. Spaan, L. Merino, and A. Ollero, “Decentralized multi-robot cooperation with auctioned POMDPs,”

The International Journal ofRobotics Research , vol. 32, no. 6, pp. 650–671, 2013.[7] K. S. Macarthur, R. Stranders, S. Ramchurn, and N. Jennings, “A distributed anytime algorithm for dynamic task allocation in multi-agent systems,”in , 2011.[8] A. T. Tolmidis and L. Petrou, “Multi-objective optimization for dynamic task allocation in a multi-robot system,”

Eng. Appl. Artif. Intell. , vol. 26,no. 5-6, pp. 1458–1468, 2013.[9] G. Arslan, J. R. Marden, and J. S. Shamma, “Autonomous Vehicle-Target Assignment: A Game-Theoretical Formulation,”

J. Dyn. Syst. Meas. Control ,vol. 129, pp. 584–596, 04 2007.[10] D. Fudenberg and D. Levine, “Learning in games,”

European economic review , vol. 42, no. 3-5, pp. 631–639, 1998.[11] H. P. Young,

Individual strategy and social structure: An evolutionary theory of institutions . Princeton University Press, 2020.[12] J. R. Marden, G. Arslan, and J. S. Shamma, “Joint strategy ﬁctitious play with inertia for potential games,”

IEEE Transactions on Automatic Control ,vol. 54, no. 2, pp. 208–220, 2009.[13] G. C. Chasparis, J. S. Shamma, and A. Rantzer, “Perturbed learning automata in potential games,” in

CDC (2011) , pp. 2453–2458, 2011.[14] Y. Wang and L. Pavel, “A modiﬁed Q-learning algorithm for potential games,” in , pp. 8710 – 8718, 2014.[15] A. C. Chapman, R. A. Micillo, R. Kota, and N. R. Jennings, “Decentralized dynamic task allocation using overlapping potential games,”

The ComputerJournal , vol. 53, no. 9, pp. 1462–1477, 2010.[16] J. R. Marden, “State based potential games,”

Automatica , vol. 48, no. 12, pp. 3075–3088, 2012.[17] D. Gonz´alez-S´anchez and O. Hern´andez-Lerma, “A survey of static and dynamic potential games,”

Science China Mathematics , vol. 59, no. 11,pp. 2075–2102, 2016.[18] E. Bakolas, “A decentralized spatial partitioning algorithm based on the minimum control effort metric,” in