Distributed Path Planning for Executing Cooperative Tasks with Time Windows
DDistributed Path Planning for ExecutingCooperative Tasks with Time Windows
Raghavendra Bhat ∗ , Yasin Yazıcıo˘glu ∗ , and Derya Aksaray ∗∗∗
Department of Electrical and Computer Engineering, University ofMinnesota, Minneapolis, Minnesota 55455. Email: [email protected], [email protected]. ∗∗ Department of Aerospace Engineering and Mechanics, University ofMinnesota, Minneapolis, Minnesota 55455. Email: [email protected]
Abstract:
We investigate the distributed planning of robot trajectories for optimal executionof cooperative tasks with time windows. In this setting, each task has a value and is completed ifsufficiently many robots are simultaneously present at the necessary location within the specifiedtime window. Tasks keep arriving periodically over cycles. The task specifications (requirednumber of robots, location, time window, and value) are unknown a priori and the robots tryto maximize the value of completed tasks by planning their own trajectories for the upcomingcycle based on their past observations in a distributed manner. Considering the rechargingand maintenance needs, robots are required to start and end each cycle at their assignedstations located in the environment. We map this problem to a game theoretic formulationand maximize the collective performance through distributed learning. Some simulation resultsare also provided to demonstrate the performance of the proposed approach.
Keywords:
Distributed control, multi-robot systems, planning, game theory, learning1. INTRODUCTIONTeams of mobile robots provide efficient and robust solu-tions in multitude of applications such as precision agricul-ture, environmental monitoring, surveillance, search andrescue, and warehouse automation. Such applications typ-ically require the robots to complete a variety of spatio-temporally distributed tasks, some of which (e.g., liftingheavy objects) may require the cooperation of multiplerobots. In such a setting, successful completion of the tasksrequire the presence of sufficiently many robots at the rightlocation and time. Accordingly, the overall performancedepends on a coordinated plan of robot trajectories.Planning is one of the fundamental topics in robotics (e.g.,see LaValle (2006); Elbanhawi and Simic (2014) and thereferences therein). For multi-robot systems, the inherentcomplexity arising from the exponential growth of the jointplanning space usually renders the exact centralized solu-tions intractable. Unlike the standard formulations such asminimizing the travel time, the energy consumed, or thedistance traveled while avoiding collisions, there is lim-ited literature on planning trajectories for serving spatio-temporally distributed cooperative tasks. In Thakur et al.(2013), the authors investigate a problem where a team ofrobots allocate a given set of waypoints among themselvesand plan paths through those waypoints while avoidingobstacles and reaching their goal regions by specific dead-liness. In Bhattacharya et al. (2010), the authors considerthe distributed path planning for robots with preassignedtasks that can be served in any order or time.In this work, we study a distributed task execution (DTE)problem, where a homogeneous team of mobile robots opti- mize their trajectories to maximize the total value of com-pleted cooperative tasks that arrive periodically over timeat different locations in a discretized environment. In thissetting, each task is defined by the following specifications:required number of robots, location, time window (arrivaland departure times), and value. Tasks are completed ifsufficiently many robots simultaneously spend one time-step at the necessary location within the correspondingtime window. Tasks keep arriving periodically over cyclesand their specifications are unknown a priori. Robots arerequired to start and end each cycle at their assignedstations in the environment and they try to maximize thevalue of completed tasks by each of them planning its owntrajectory based on the observations from previous cycles.In order to tackle the challenges due to the possibly largescale of the system (number of tasks and robots) andinitially unknown task specifications, we investigate howthe DTE problem can be solved via distributed learning.Such distributed coordination problems are usually solvedusing methods based on machine learning, optimization,and game theory (e.g., see Bu et al. (2008); Boyd et al.(2011); Marden et al. (2009) and the references therein).Game theoretic methods have been used to solve variousproblems such as vehicle-target assignment (e.g., Mar-den et al. (2009)), coverage optimization (e.g., Zhu andMart´ınez (2013); Yazıcıo˘glu et al. (2013, 2017)), or dy-namic vehicle routing (e.g., Arsie et al. (2009)). In thispaper, we propose a game theoretic solution to the DTEproblem by designing a corresponding game and utilizing alearning algorithm that drives the robots to configurationsthat maximize the global objective, i.e., the total valueof completed tasks. In the proposed game, the action of a r X i v : . [ c s . R O ] A ug ach robot is defined as its trajectory during one cycle. Weshow that some feasible trajectories can never contributeto the global objective in this setting, no matter what thetask specifications are. By excluding such trajectories, weobtain a game with a significantly smaller action space,which still contains the globally optimal combinations oftrajectories but also facilitates faster learning due to itssmaller size. Using the proposed method, robots spendan arbitrarily high percentage of cycles at the optimalcombinations of trajectories in the long run.This paper is organized as follows: Section 2 presents theformulation of the distributed task execution problem.Section 3 is on the proposed game theoretic formulationand solution. Some simulation results are presented inSection 4. Finally, Section 5 concludes the paper.2. PROBLEM FORMULATIONIn this section, we introduce the distributed task execution(DTE) problem, where the goal is to have a homogeneousteam of n mobile robots, R = { r , r , . . . , r n } , optimizetheir trajectories to maximize the total value of completedcooperative tasks that arrive periodically and remain avail-able only over specific time windows.We consider a discretized environment represented as a2D grid, P = { , , . . . , ¯ x } × { , , . . . , ¯ y } , where ¯ x, ¯ y ∈ N denote the number of cells along the corresponding di-rections. In this environment, some of the cells may beoccupied by obstacles, P O ⊂ P , and the robots are free tomove over the feasible cells P F = P \ P O . Within the feasi-ble space, we consider m stations, S = { s , . . . , s m } ⊆ P F ,where the robots start and end in each cycle. Stationsdenote the locations where the robots recharge, go throughmaintenance when needed, and prepare for the next cycle.Each robot is assigned to a specific station (multiple canbe assigned to the same station) and must return there bythe end of each cycle. We assume that each cell representsa sufficiently large amount of space, hence any number ofrobots can be present in the same cell at the same time.Each cycle consist of T time steps and the trajec-tory of each robot r i ∈ R over a cycle is denoted as p i = { p i , p i , . . . , p Ti } . The robots can maintain their cur-rent position or move to any of the feasible neighboringcells within one time step. Specifically, if a robot is atsome cell p = ( x, y ) ∈ P F , at the next time step it has tobe within p ’s neighborhood on the grid N ( p ) ⊆ P F , i.e., N ( p ) = { ( x (cid:48) , y (cid:48) ) ∈ P F | | x (cid:48) − x | ≤ , | y (cid:48) − y | ≤ } . For any robot r i ∈ R , the set of feasible trajectories, P i ,is defined as P i = { p i | p i = p Ti = σ i , p t +1 i ∈ N ( p ti ) , t = 0 , . . . , T − } , (1)where σ i ∈ S denotes the station of robot r i . Accordingly,we use P = P × . . . × P m to denote the combined setof feasible trajectories. An example of an environmentwith some obstacles and three stations is illustrated inFig. 2 along with the examples of feasible motions andtrajectories of robots.Given such an environment and a set of mobile robots,we consider a set of k tasks τ = { τ , τ , . . . , τ k } , eachof which is defined as tuple, τ i = { c ∗ i , l i , t ai , t di , v i } , where s s r y x s Fig. 1.
A discretized environment with obstacles (red) and threestations (gray) is shown. Robots can move to any neighboringcell (non-obstacle) in one time step. On the left, the neighboringcells of station s are shown via arrows. On the right, a feasibletrajectory of length five is shown for a robot at station s . c ∗ i ∈ N is the required number of robots, l i ∈ P F is thelocation, t ai < t di ∈ { , . . . , T } are the arrival and departuretimes, and v i ∈ R + is the value. Accordingly, the taskis completed if at least c ∗ i robots simultaneously spendone time step at l i within the time window [ t ai , t di ]. Morespecifically, given the trajectories of all robots, p , the setof completed tasks τ ∗ ( p ) ⊆ τ is defined as τ ∗ ( p ) = { τ i ∈ τ | ∃ t ∈ [ t ai , t di − , c i ( p , t ) ≥ c ∗ i } , (2)where c i ( p , t ) ∈ { , . . . , n } is the counter denoting thenumber of robots that stayed at l i ∈ P F from time t to t + 1 in that cycle, i.e., c i ( p , t ) = |{ r i ∈ R | p ti = p t +1 i = l i }| . (3)Accordingly, each task can be completed within one timestep if sufficiently many robots simultaneously stay at thecorresponding location within the specified time window.This model captures a variety of tasks with time windows.Examples include placing a box in a shelf, where the weightof the box determines the number of robots needed, oraerial monitoring, where the required ground resolutiondetermines the number of needed drones (higher resolutionimplies closer inspection, hence more drones needed tocover the area). As per this model, when a team of robotsencounter a task, they are assumed to know how to achievea proper low-level coordination (e.g., how multiple robotsshould move a box) so that the task is completed if thereare sufficiently many robots. Our main focus is on theproblem of having sufficiently many robots in the rightplace and time window. In this regard, we quantify theperformance resulting from the trajectories of all robots, p , as the total value of completed tasks, i.e., f ( p ) = (cid:88) τ i ∈ τ ∗ v i . (4)We consider a setting where the tasks are unknown apriori. Accordingly, the robots are expected to improve theoverall performance by updating their trajectories over thecycles based on their observations. In such a setting, p ( t )denotes the trajectories at the t th cycle for t ∈ { , , . . . } ,and we are interested in the resulting long-run averageperformance. Accordingly, for any infinite sequence ofrobot trajectories over time (cycles), we quantify theresulting performance aslim inf t ∗ →∞ t ∗ + 1 t ∗ (cid:88) t =0 f ( p ( t )) . (5)e are interested in optimizing the long-run averageperformance via a distributed learning approach, whereeach robot r i ∈ R independently plans its own trajectoryfor the upcoming cycle based on its past observations.3. PROPOSED METHODWe will present a game theoretic solution to the DTEproblem. We first provide some game theory preliminaries. A finite strategic game
Γ = (
I, A, U ) has three compo-nents: (1) a set of players (agents) I = { , , . . . , m } , (2)an action space A = A × A × . . . × A m , where each A i is the action set of player i , and (3) a set of utilityfunctions U = { U , U , . . . , U m } , where each U i : A (cid:55)→ (cid:60) is a mapping from the action space to real numbers.For any action profile a ∈ A , we use a − i to denote theactions of players other than i . Using this notation, anaction profile a can also be represented as a = ( a i , a − i ).An action profile a ∗ ∈ A is called a Nash equilibrium if U i ( a ∗ i , a ∗− i ) = max a i ∈ A i U i ( a i , a ∗− i ) , ∀ i ∈ I. A class of games that is widely utilized in cooperativecontrol problems is the potential games . A game is calleda potential game if there exists a potential function , φ : A (cid:55)→ (cid:60) , such that the change of a player’s utility resultingform its unilateral deviation from an action profile equalsthe resulting change in φ . More precisely, for each player i , for every a i , a (cid:48) i ∈ A i , and for all a − i ∈ A − i , U i ( a (cid:48) i , a − i ) − U i ( a i , a − i ) = φ ( a (cid:48) i , a − i ) − φ ( a i , a − i ) . When a cooperative control problem is mapped to a po-tential game, the game is designed such that its potentialfunction captures the global objective (e.g., Marden et al.(2009)). Such a design achieves some alignment betweenthe global objective and the utility of each agent.In game theoretic learning, starting from an arbitraryinitial configuration, the agents repetitively play a game.At each step t ∈ { , , , . . . } , each agent i ∈ I playsan action a i ( t ) and receives some utility U i ( a ( t )). In thissetting, the agents update their actions in accordance withsome learning algorithm. For potential games, there aremany learning algorithms that provide convergence to aNash equilibrium (e.g., Arslan et al. (2007); Marden et al.(2009) and the references therein). While any potentialmaximizer is a Nash equilibrium, a potential game mayalso have some suboptimal Nash equilibria. In that case,a learning algorithm such as log-linear learning (LLL)presented in Blume (1993) can be used to drive the agentsto the equilibria that maximize the potential function φ ( a ).Essentially, LLL is a noisy best-response algorithm, andit induces a Markov chain over the action space witha unique limiting distribution, µ ∗ (cid:15) , where (cid:15) denotes thenoise parameter. When all agents follow LLL in potentialgames, the potential maximizers are the stochasticallystable states as shown in Blume (1993). In other words,as the noise parameter (cid:15) goes down to zero, the limitingdistribution, µ ∗ (cid:15) , has an arbitrarily large part of its massaccumulated over the set of potential maximizers, i.e.,lim (cid:15) → + µ ∗ (cid:15) ( a ) > ⇐⇒ φ ( a ) ≥ φ ( a (cid:48) ) , ∀ a (cid:48) ∈ A. (6) To execute LLL, only one agent should update its actionat each step. Furthermore, each agent should be able tocompute its current utility as well as the hypothetical util-ities it may gather by unilaterally switching to any otheraction. Alternatively, the payoff-based implementation ofLLL presented in Marden and Shamma (2012) yields thesame limiting behavior without requiring single-agent up-dates and the computation of hypothetical utilities. In light of (3.1), if the DTE problem is mapped to apotential game such that the potential function is equal to(2), then the long-run average performance given in (2) canbe made arbitrarily close to its maximum possible value byhaving robots follow some learning algorithm such as LLL.In this section, we build such a game theoretic solution.We first design a corresponding game Γ
DTE by definingthe action space and the utility functions.Since the impact of each agent on the overall performanceis solely determined by its trajectory, it is rather expectedto define the action set of each agent as some subsetof the feasible trajectories, i.e., A i ⊆ P i . One possiblechoice is setting A i = P i , which allows the robots totake any feasible trajectory. However, it should be notedthat the learning will involve robots searching within theirfeasible actions. Accordingly, defining a larger action spaceis likely to result in a slower learning rate, which is animportant aspect in practice. Due to this practical concern,we will design a more compact action space which canyield the same long run performance as the case with allfeasible trajectories, yet approaches the limiting behaviormuch faster. The impact of this reduction in the size ofaction space will later be demonstrated through numericalsimulations in Section 4. Action Space Design
While the set of possible trajecto-ries, P i , grows exponentially with the cycle length T , alarge number of those trajectories actually can never be auseful choice in the proposed setting, regardless of the taskspecifications and the trajectories of other robots. In par-ticular, if a trajectory doesn’t involve staying anywhere,i.e., there is no t ∈ { , . . . , T − } such that p ti = p t +1 i , itis guaranteed that the robot i will not be contributingto the global score in (2) since it will not be helpingwith any task (no contribution to any counters in (2)).Furthermore, there are also many trajectories, for whichthere exist some other trajectory guaranteed to be equallyuseful or better for the global performance no matter whatthe task specifications or the trajectories of other robotsare. In particular, if a trajectory p i has all the stays someother trajectory q i has, then removing q i from the fea-sible options would not degrade the overall performance.By removing such inferior trajectories from the availableoptions, we define a significantly smaller action set withoutany reduction in the achievable long-run performance: A i = argmin A i ∈ P i | A i | (7) s.t. (3.2.1) , (3.2.1) , where the constraints are ∀ p i ∈ A i , ∃ t ∈ { , , . . . , T − } : p ti = p t +1 i , (8) ∀ q i ∈ P i \ A i , ∃ p i ∈ A i : p ti = p t +1 i = q ti , ∀ t : q ti = q t +1 i . (9)ccordingly, the action set of each robot A i is the smallestsubset of its all feasible trajectories P i such that 1)each trajectory p i ∈ A i should have at least one stay,i.e.,(3.2.1), and 2) for every excluded trajectory q i ∈ P i \ A i , there should be a trajectory p i ∈ A i such that anystay in q i is also included in p i , i.e., (3.2.1). It is worthemphasizing that this reduced action space maintains allthe global optima. Lemma 1.
For any set of feasible trajectories P i as in (2)and the action sets as in (3.2.1),max p ∈ P f ( p ) = max p ∈ A f ( p ) . (10) Proof.
Let q ∈ P \ A be a maximizer of f ( p ), i.e., f ( q ) = max p ∈ P f ( p ) . (11)In light of (3.2.1), there exist p ∈ A such that, for everyrobot i , all the stays under q i are also included in p i , i.e.,for any t ∈ { , . . . , T − } q ti = q t +1 i ⇒ p ti = p t +1 i = q ti . Accordingly, in light of (2) and (2), all the tasks completedunder q should be completed under p as well, i.e., τ ∗ ( q ) ⊆ τ ∗ ( p ) . Hence, f ( p ) ≥ f ( q ). Since A ⊆ P , f ( p ) ≥ f ( q ) and (3.2.1)together imply (1). Example:
Consider the environment in Fig 2, and leteach cycle consist of three time steps ( T = 3). In thatcase, the feasible trajectories for a robot i stationedat s in cell (2 ,
2) are all the cyclic routes of threehops p i = { p i , p i , p i , p i } such that p i = p i = (2 , A i consisting of only the following nine trajectories: { (2 , , (1 , , (1 , , (2 , } , { (2 , , (1 , , (1 , , (2 , } , { (2 , , (1 , , (1 , , (2 , } , { (2 , , (2 , , (2 , , (2 , } , { (2 , , (2 , , (2 , , (2 , } , { (2 , , (3 , , (3 , , (2 , } , { (2 , , (3 , , (3 , , (2 , } , { (2 , , (3 , , (3 , , (2 , } , { (2 , , (2 , , (2 , , (2 , } .Note that these are the trajectories that 1) move to one ofthe eight adjacent cells, stay there for one time step, andreturn to the station, or 2) stay at cell (2 ,
2) throughout thecycle, which are indeed the only choices that may result inhelping with the completion of some task in this example.
Utility Design
We design the utility functions so thatthe total value of completed tasks becomes the potentialfunction of the resulting game. To this end we employ thenotion of wonderful life utility presented in Tumer andWolpert (2004). Accordingly, we set the utility of eachrobot i to the total value of completed tasks that wouldnot have been completed if robot i was removed from thesystem, i.e., U i ( p ) = (cid:88) τ j ∈ ( τ ∗ ( p ) \ τ ∗ ( p − i )) v j , (12)where τ ∗ ( p ), as defined in (2), is the set of tasks completedgiven the trajectories of all robots and τ ∗ ( p − i ) is the set oftasks completed when robot i is excluded from the system. Lemma 2.
For any set of robots R with the action space A as per (3.2.1), the utilities in (3.2.2) lead to a poten- tial game Γ DTE = (
R, A, U ) with the potential function φ ( p ) = f ( p ), where f ( p ) is the total value of completedtasks as given in (2). Proof.
Let p i (cid:54) = p (cid:48) i ∈ A i be two possible trajectories forrobot i , and let p − i denote the trajectories of all otherrobots. Since removing a robot from the system cannotincrease the number of robots present at each cell duringthe cycle, for each task τ j we have c j ( p , t ) ≥ c j ( p − i , t ) forthe counters as defined in (2). Accordingly, by removingrobot i from the system, the set of completed tasks canonly shrink, i.e., τ ∗ ( p − i ) ⊆ τ ∗ ( p ). Hence, the utility in(3.2.2) can be expressed as U ( p i , p − i ) = (cid:88) τ j ∈ τ ∗ ( p i , p − i ) v j − (cid:88) τ j ∈ τ ∗ ( p − i ) v j . Accordingly, U ( p i , p − i ) − U ( p (cid:48) i , p − i ) = (cid:88) τ j ∈ τ ∗ ( p i , p − i ) v j − (cid:88) τ j ∈ τ ∗ ( p (cid:48) i , p − i ) v j , = f ( p i , p − i ) − f ( p (cid:48) i , p − i ) . Consequently, Γ
DTE = (
R, A, U ) is a potential game withthe potential function φ ( p ) = f ( p ).Note that the utility in (3.2.2) can be computed by eachrobot based on local information. For every completedtask τ j ∈ τ ∗ ( p ), let t ∗ j ( p ) denote the time when theparticipating robots started execution, i.e., t ∗ j ( p ) = min { t ∈ [ t aj , t dj − | c j ( p , t ) ≥ c ∗ j } . Each robot involved in the execution of τ j needs toknow the value v j and whether the task could have beencompleted without itself to compute the amount of rewardit receives from this task (0 or v j ). To assess whether thetask could have been completed without itself, the robotneeds to know the required number of robots c ∗ j , and thenumber of robots that stay at l j starting from t ∗ j until thelast point the task can be completed, i.e., c j ( p , t ) for all t ∈ { t ∗ j , . . . , t dj − } . Here, the future values of c j ( p , t ) until t dj − l j before t dj . Inthat case, the second team could have completed the taskif it wasn’t already completed, which affects the marginalcontributions of the robots in the first team. Example:
Consider the environment in Fig 2 with 3 robots,all stationed at s in cell (2 , τ = { c ∗ = 2 , l = (2 , , t a = 0 , t d = 4 , v = 1 } . Let T = 4and let the trajectories of the robots be: p = { (2 , , (2 , , (2 , , (2 , , (2 , } , p = { (2 , , (2 , , (2 , , (2 , , (2 , } , p = { (2 , , (2 , , (2 , , (2 , , (2 , } .In that case, the task is completed by robots r and r inthe first time step. Note that the task cannot be completedwithout r since there are no instants where both r and r stay at (2 , r should receive a utility of U ( p ) = 1.On the other hand, if r is removed from the system, r and r can still complete the task at the last time step.So r should receive a utility of U ( p ) = 0, although thetask cannot be completed without r in the first time step.Also, U ( p ) = 0 since the task was completed without r . .3 Learning Algorithm Since we designed Γ
DTE as a game with the potentialfunction equal to the total value of completed tasks, alearning algorithm such as log-linear learning (LLL) can beused to keep the long-run average performance arbitrarilyclose to the best possible value. In order to avoid thenecessity of single-agent updates and the computation ofthe hypothetical utilities from all feasible actions whenupdating, we propose using the pay-off based log-linearlearning (PB-LLL) algorithm presented in Marden andShamma (2012). In this algorithm, each agent has a binarystate x i ( t ) denoting whether the agent has experimentedwith a new action in cycle t (1 if experimented, 0 other-wise). Every non-experimenting agent either experimentswith a new random action at the next step with somesmall probability (cid:15) m , or keeps its current action and staysnon-experimenting with probability 1 − (cid:15) m . Each exper-imenting agent settles at the action that it just triedor its previous action with probabilities determined bythe received utilities (similar to the softmax function).Accordingly, the agent assigns a much higher probabilityto the action that yielded higher utility. PB-LLL Algorithm (Marden and Shamma (2012)) initialization: (cid:15) ∈ (0 , m > t = 0, x i (0) = 0, p i (0) ∈ A i arbitrary2 : repeat if x i ( t ) = 04 : with probability (cid:15) m p i ( t + 1) is picked randomly from A i x i ( t + 1) = 17 : with probability 1 − (cid:15) m p i ( t + 1) = p i ( t )9 : x i ( t + 1) = 010 : if x i ( t ) = 111 : x i ( t + 1) = 012 : α = (cid:15) − U i ( p ( t − (cid:15) − U i ( p ( t − + (cid:15) − U i ( p ( t ))
13 : p i ( t + 1) = (cid:110) p i ( t − , with prob. α p i ( t ) , with prob. 1 − α
14 : end if
15 : t = t + 116 : end repeat Theorem 3.
For a set of robots R with the set of feasibletrajectories as in (2), let Γ DTE = (
R, A, U ) be designed asper (3.2.1) and (3.2.2). If all robots follow the payoff-basedlog-linear learning (PB-LLL) algorithm with a sufficientlylarge value of m in a repeated play of Γ DTE , thenlim (cid:15) → + lim t ∗ →∞ t ∗ + 1 t ∗ (cid:88) t =0 f ( p ( t )) = max q ∈ P f ( q ) (15) Proof.
Since Γ
DTE = (
R, A, U ) is a potential game withthe potential function f ( p ), in light of Theorem 6.1 inMarden and Shamma (2012), for sufficiently large valuesof m , PB-LLL induces a Markov chain with the limitingdistribution µ (cid:15) over A such thatlim (cid:15) → + µ ∗ (cid:15) ( q ) > ⇐⇒ f ( q ) = max q (cid:48) ∈ A f ( q (cid:48) ) . (16)Note that the long term average time spent at any state q ∈ A converges to the corresponding entry of the limitingdistribution, i.e., lim t ∗ →∞ t ∗ + 1 t ∗ (cid:88) t =0 I ( p ( t ) , q ) = µ ∗ (cid:15) ( q ) , ∀ q ∈ A, where I ( p ( t ) , q ) = 1 if p ( t ) = q , and I ( p ( t ) , q ) = 0if p ( t ) (cid:54) = q . Accordingly, the long-run average of thepotential function satisfies lim t ∗ →∞ t ∗ + 1 t ∗ (cid:88) t =0 f ( p ( t )) = lim t ∗ →∞ t ∗ + 1 t ∗ (cid:88) t =0 (cid:88) q ∈ A f ( q ) I ( p ( t ) , q )= (cid:88) q ∈ A µ ∗ (cid:15) ( q ) f ( q ) . (17)Using (3.3) and (3.3), we getlim (cid:15) → + lim t ∗ →∞ t ∗ + 1 t ∗ (cid:88) t =0 f ( p ( t )) = max q ∈ A f ( q ) . (18)Using (3.3) together with (1), we obtain (3).4. SIMULATION RESULTSWe consider the environment shown in Fig. 1 and presentsimulation results for two cases. Case 1 : This case aims to demonstrate how the proposeddesign of action sets A i ⊆ P i in (3.2.1) improves theconvergence rate compared to the trivial choice of A i = P i .We consider a small scenario consisting of two robots, oneat station s and s , and a single task. Each cycle hassix time steps, T = 6. The task requires two robots to bepresent at cell (6 ,
5) and has the arrival and departurestimes as t a = 2, t d = 5 and a value of v = 3, i.e., τ = { , (6 , , , , } . For both simulations, robots startwith randomly selected initial trajectories and follow thePB-LLL algorithm with m = 1 . (cid:15) = 0 . s ) has | A | = 69 actions, which is significantlysmaller than the number of feasible trajectories | P | = 555.Robot 2 (station s ) has | A | = 173 actions as opposed to | P | = 5349 feasible trajectories. The evolution of the totalvalue of completed tasks (0 or 3 in this case) over cycles isshown in Fig. 2 and Fig. 3. In both simulations, once therobots reach a configuration that completes the task, theymaintain that most of the time. However, the scenario withthe proposed A i in (3.2.1) reaches that behavior aboutnine times faster than the scenario with A i = P i . Ingeneral, this ratio of convergence rates depends on thesize of the problem and may get much bigger in caseswith longer cycle lengths and larger teams of robots.The repeated drops of f ( p ( t )) = 0 to zero occur whenrobots experiment with other trajectories that result inthe incompletion of the task. Due to the resulting dropin their own utility, robots almost never choose to stayat such failing configurations. For the results in Fig. 2, f ( p ( t )) = 3 in 99 .
89% of the time for t ≥ , f ( p ( t )) = 3 in 99 .
89% of the time for t ≥ , , Case 2 : We simulate a larger scenario in the same en-vironment with more robots and tasks. Each cycle hassix time steps, T = 6. We define three tasks: τ in theprevious case and two more tasks τ = { , (4 , , , , } , τ = { , (3 , , , , } . There are seven robots: two atstation s , two at station s , and three at station s . Fig. 2.
Total value of completed tasks, f ( p ( t )), in Case 1 with theproposed action sets A i ⊆ P i as per (3.2.1). Fig. 3.
Total value of completed tasks, f ( p ( t )), in Case 1 when theaction sets are defined as A i = P i . While the long-run averagevalues are similar for the results in Figs. 2 and 3, convergenceto the steady state behavior is much slower when A i = P i (notethe different scales on the x axes of figures). Their action sets are defined as per (3.2.1). Robots startwith random initial trajectories and follow the PB-LLLalgorithm with m = 1 . (cid:15) = 0 . .
91% of the time after t = 1 , , Fig. 4.
Total value of completed tasks, f ( p ( t )), in Case 2.
5. CONCLUSIONWe presented a game-theoretic approach to distributedplanning of robot trajectories for optimal execution ofcooperative tasks with time windows. We considered asetting where each task has a value, and it is completedif sufficiently many robots simultaneously spent one unitof time at the necessary location within the specified timewindow. Tasks keep arriving periodically over cycles withthe same specifications, which are unknown a priori. Inconsideration of the recharging and maintenance require-ments, the robots are required to start and end each cycleat their assigned stations and they try to maximize the value of completed tasks by planning their own trajectoriesin a distributed manner based on their observations in theprevious cycles. We formulated this problem as a potentialgame and presented how a payoff-based learning algorithmcan be used to maximize the long-run average (over cycles)of the total value of completed tasks. Performance of theproposed approach was also demonstrated via simulations.REFERENCESArsie, A., Savla, K., and Frazzoli, E. (2009). Efficientrouting algorithms for multiple vehicles with no explicitcommunications.
IEEE Transactions on AutomaticControl , 54(10), 2302–2317.Arslan, G., Marden, J., and Shamma, J.S. (2007). Au-tonomous vehicle-target assignment: a game theoreticalformulation.
ASME Journal of Dynamic Systems, Mea-surement, and Control , 584–596.Bhattacharya, S., Likhachev, M., and Kumar, V. (2010).Multi-agent path planning with multiple tasks and dis-tance constraints. In , 953–959. IEEE.Blume, L.E. (1993). The statistical mechanics of strategicinteraction.
Games and Econ. Behavior , 5(3), 387–424.Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.,et al. (2011). Distributed optimization and statisticallearning via the alternating direction method of mul-tipliers.
Foundations and Trends in Machine learning ,3(1), 1–122.Bu, L., Babu, R., De Schutter, B., et al. (2008). A com-prehensive survey of multiagent reinforcement learning.
IEEE Transactions on Systems, Man, and Cybernetics,Part C (Applications and Reviews) , 38(2), 156–172.Elbanhawi, M. and Simic, M. (2014). Sampling-basedrobot motion planning: A review.
Ieee access , 2, 56–77.LaValle, S.M. (2006).
Planning algorithms . Cambridgeuniversity press.Marden, J.R., Arslan, G., and Shamma, J.S. (2009). Coop-erative control and potential games.
IEEE Transactionson Systems, Man, and Cybernetics, Part B: Cybernetics ,39(6), 1393–1407.Marden, J.R. and Shamma, J.S. (2012). Revisiting log-linear learning: Asynchrony, completeness and payoff-based implementation.
Games and Economic Behavior ,75(2), 788–808.Thakur, D., Likhachev, M., Keller, J., Kumar, V., Do-brokhodov, V., Jones, K., Wurz, J., and Kaminer, I.(2013). Planning for opportunistic surveillance withmultiple robots. In , 5750–5757.Tumer, K. and Wolpert, D.H. (2004).
Collectives and thedesign of complex systems . Springer Science & BusinessMedia.Yazıcıo˘glu, A.Y., Egerstedt, M., and Shamma, J.S. (2013).A game theoretic approach to distributed coverage ofgraphs by heterogeneous mobile agents.
IFAC Proceed-ings Volumes , 46(27), 309–315.Yazıcıo˘glu, A.Y., Egerstedt, M., and Shamma, J.S. (2017).Communication-free distributed coverage for networkedsystems.
IEEE Transactions on Control of NetworkSystems , 4(3), 499–510.Zhu, M. and Mart´ınez, S. (2013). Distributed coveragegames for energy-aware mobile sensor networks.