[PDF] Dynamic neighbourhood optimisation for task allocation using multi-agent

Abstract

In large-scale systems there are fundamental challenges when centralised techniques are used for task allocation. The number of interactions is limited by resource constraints such as on computation, storage, and network communication. We can increase scalability by implementing the system as a distributed task-allocation system, sharing tasks across many agents. However, this also increases the resource cost of communications and synchronisation, and is difficult to scale. In this paper we present four algorithms to solve these problems. The combination of these algorithms enable each agent to improve their task allocation strategy through reinforcement learning, while changing how much they explore the system in response to how optimal they believe their current strategy is, given their past experience. We focus on distributed agent systems where the agents' behaviours are constrained by resource usage limits, limiting agents to local rather than system-wide knowledge. We evaluate these algorithms in a simulated environment where agents are given a task composed of multiple subtasks that must be allocated to other agents with differing capabilities, to then carry out those tasks. We also simulate real-life system effects such as networking instability. Our solution is shown to solve the task allocation problem to 6.7% of the theoretical optimal within the system configurations considered. It provides 5x better performance recovery over no-knowledge retention approaches when system connectivity is impacted, and is tested against systems up to 100 agents with less than a 9% impact on the algorithms' performance.

Full PDF

DDynamic neighbourhood optimisation for task allocation using multi-agentlearning

NIALL CREECH,

Kings College London, UK

NATALIA CRIADO PACHECO,

Kings College London, UK

SIMON MILES,

Kings College London, UK

In large-scale systems there are fundamental challenges when centralised techniques are used for task allocation. The number ofinteractions is limited by resource constraints such as on computation, storage, and network communication. We can increase scalabilityby implementing the system as a distributed task-allocation system, sharing tasks across many agents. However, this also increasesthe resource cost of communications and synchronisation, and is difficult to scale.In this paper we present four algorithms to solve these problems. The combination of these algorithms enable each agent toimprove their task allocation strategy through reinforcement learning, while changing how much they explore the system in responseto how optimal they believe their current strategy is, given their past experience. We focus on distributed agent systems wherethe agents’ behaviours are constrained by resource usage limits, limiting agents to local rather than system-wide knowledge. Weevaluate these algorithms in a simulated environment where agents are given a task composed of multiple subtasks that must beallocated to other agents with differing capabilities, to then carry out those tasks. We also simulate real-life system effects such asnetworking instability. Our solution is shown to solve the task allocation problem to . of the theoretical optimal within thesystem configurations considered. It provides × better performance recovery over no-knowledge retention approaches when systemconnectivity is impacted, and is tested against systems up to agents with less than a impact on the algorithms’ performance.CCS Concepts: • Computing methodologies → Multi-agent systems ; Intelligent agents ; Multi-agent planning ; Mobile agents ; Cooperation and coordination ; Q-learning ; Temporal difference learning ; •

Theory of computation → Multi-agent reinforcementlearning ; Multi-agent learning .Additional Key Words and Phrases: Multi-agent systems, distributed task allocation, Multi-agent reinforcement learning, MARL

ACM Reference Format:

Niall Creech, Natalia Criado Pacheco, and Simon Miles. 2021. Dynamic neighbourhood optimisation for task allocation using multi-agentlearning.

ACM Trans. Autonom. Adapt. Syst.

0, 0, Article 0 ( 2021), 28 pages. https://doi.org/0000001.0000001

In a distributed task-allocation system (DTAS) there are interactions between many independent agents. These systemsare increasingly seen in a wide range of real world applications such as wireless sensor networks (WSN) [4, 6, 18, 28],robotics [7, 24], and distributed computing [20, 26]. The growing complexity and scope of these applications presents anumber of challenges such as responding to change, handling failures, and optimisation. System performance must also

Authors’ addresses: Niall Creech, Kings College London, Department of Informatics, London, WC2B 4BG, UK, [email protected]; Natalia CriadoPacheco, Kings College London, Department of Informatics, London, WC2B 4BG, UK, [email protected]; Simon Miles, Kings CollegeLondon, Department of Informatics, London, WC2B 4BG, UK, [email protected] to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for componentsof this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post onservers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.Manuscript submitted to ACMManuscript submitted to ACM a r X i v : . [ c s . A I] F e b N. Creech et al.be scalable with growth in the number of agents, being able to perform tasks given constraints in computational orstorage resources. The challenges summarised below are shared across many diverse subject areas, meaning relevantand practical solutions become more generally applicable. • Task allocation , how best to allocate tasks amongst agents in the system. An agent may have a goal that comprisesof a composite task that requires the completion of a number of sub-tasks by other agents [35]. • Resource management , allocating and optimising the use of resources to complete a task. For example, managingenergy usage while performing a function within a physical environment [15, 32, 45]. • Dynamic networking , agent discovery and communication adaptability. Agents must be able to communicatewith each other while connections are lost and created [5]. • Self-organisation , autonomously forming structures to complete a goal. Rigidly architected solutions are oftennon-applicable to dynamic systems with many unknowns as designs would be too complex. To improve agentsadaptability in these situations, self-organising solutions can be used. [1, 13, 14, 17, 25].Formally designed agents can perform set tasks given a well-understood system. However, it is often not feasible todesign algorithms that can predict the large variety of failures or changes that may occur in large-scale, real-worldoperating environments. In addition, as the systems become more complex there is an exponential growth in agents state-action space size. This space represents the set of combinations of states they can be in, alongside the actionsthey may take in those states. Knowing this space before deploying the agents is often unrealistic, as is understandingwhich algorithms will perform optimally. Introducing a centralised source of continually updated information onthe environment and other agents can increase the knowledge available to an agent about their state-action space,allowing for better optimisation. Approaches like this such as the use of orchestrating agents , agents that specialise incoordinating other agents in the system, are used within distributed software architectures [21, 23, 27, 34] and robotics[3, 10]. However, even extending this method through clustering and consensus techniques to increase fault-tolerance,a central point of fragility is created. As other agents interactions and communications are channelled through theseagents, congestion and bandwidth saturation problems also grow.Distributed agent systems with learning enhancements such as multi-agent reinforcement learning (MARL) canprovide the same functionality but distributed across agents, removing the focal points for orchestration and mitigatingcongestion issues while still providing the knowledge sharing and action coordination that allow agents to optimisestate-action space. With an increasing number of interacting agents though we see an exponential increase in theamount of communications within the system, eventually saturating bandwidth and exhausting computational resources.There is also an expectation of stability , that the solution to the agents optimisation remains relatively stable with agradual reduction in the need for exploration of state-action space over time. In dynamic systems this often does nothold. MARL techniques also do not take account of the inherent risks involved in taking different types of actions,leading to catastrophic effects in areas such as robotics where some actions may risk severe physical damage, or infinancial systems where large losses might be incurred [16, 22, 29, 41].The overall problem can be summarised as how to provide for efficient task allocation in a dynamic multi-agentsystem while ensuring scalability as the number of tasks increases and the availability of agents changes. The solutionpresented uses a number of algorithms in combination, allowing an agent to determine the capability of other knownagents to perform tasks, allocating these tasks, and carrying out other actions based on its current knowledge and theneed to explore agent capability space. The algorithms introduced are,

Manuscript submitted to ACM ynamic neighbourhood optimisation for task allocation using multi-agent learning 3 • The agent task allocation with risk-impact awareness (

ATA-RIA ) algorithm allows each agent to choose a subset ofother agents in the system based on how much it predicts those agents will help complete the sub-tasks of theiroverall composite task. They can learn the best task allocation strategy for these agents, but can also changewhich agents compose the group to improve performance. • The reward trends for action-risks probabilities (RT-ARP) algorithm gives agents the ability to transform theirexploration strategies given the trends in the rewards obtained over time. Using this algorithm, agents canincrease the likelihood of them taking actions that risk larger changes to their task allocation strategy, dependingon their historical performance. • The state-action space knowledge-retention (

SAS-KR ) algorithm intelligently manages the resources used by agentsto maintain the information they have learned about state-action space and the effects of their actions. • The neighbourhood update (

N-Prune ) algorithm selectively removes agents from the group considered for taskallocation by an agent, constraining resource usage. This selection is based on not only how much an agentpredicts the other agents will contribute to its composite task, but also how much uncertainty it has about thatprediction, so complimenting the ATA-RIA algorithms’ behaviour.We evaluate the effectiveness of these algorithms in combination through evaluation of their performance in a series ofsimulated multi-agent systems.Section 2 covers the related research in the areas of MARL and multi-agent systems. In-depth analysis of the problemdomain and motivation is looked at in Section 3, with the proposed solution and algorithm definitions in Sections4 and 5. We cover evaluation of the algorithms’ performance in system simulations in Section 6. Finally we discussconclusions and future research in Section 7.

To provide some context for the work to follow we look at some relevant research in multi-agent reinforcement learning(MARL). Although there are other useful strategies, such as auction-based systems, and particle swarm optimisationtechniques, these also have specific challenges. Auction-based systems carry increasing orchestration cost as the numberof agents involved increases, which impacts the scalability of related solutions. They also suffer significant impact whenthe system is dynamic as agent communication is lost. Swarm approaches can be effective under dynamic conditionsbut are also prone to optimising on local-optima [37]. As we look for an approach that can handle scaling, and dynamicsystems, we focus here on MARL. In particular, we look at methods of allocating rewards to drive behaviours, howallocation effects both the exploration of state space, and coordination between agents.

State space exploration in multi-agent reinforcement learning.

Multi-agent reinforcement learning (MARL) [8, 9, 40]applies reinforcement learning techniques to multiple agents sharing a common environment. Each senses the environ-ment and takes actions that cause a transition of the environment state to a new state, resulting in feedback in the formof the reward signal. There are two main issues that can limit the applicability of MARL techniques.Firstly, the exploration of large state-action spaces. As the state space size can exponentially increase in realisticscenarios, finding the right balance of exploration, so that agents’ can fully explore the expansive state space, andexploitation, so that they can successfully complete tasks, is difficult. The dimensionality of the system greatly increaseswith the number of agents, mainly due to the corresponding increases in the number of actions and states. An agentmay not only have to learn about its own effects on the environment but also about the nature of other agents in the

Manuscript submitted to ACM

N. Creech et al.system. The exploration/exploitation issue increases in difficulty with both a non-stationary environment and thedynamism of other agents policies and actions.Secondly, we need to assign credit for task outcomes to specific agents and actions. Since the rewards and valuesof actions result from multiple agents’ contributions, it is difficult to share rewards fairly as the effects of individualactions are not easily separable. The delay between an action and a successful outcome results in a temporal creditassignment problem as discussed by Sutton et al [38]. There is the additional issue of assigning rewards to individualagents in a collection of agents participating in an outcome, the structural credit assignment problem [2, 46]. Thedifficulty in assigning credit makes choosing a good reward function for the system complex [30]. We must understand alignment , how well the individual agents’ own goal optimisation improves the overall system goal. Also, sensitivity ,how responsive the reward is to an agent changing its own actions. If a reward is sensitive then the agent can separatethe effect of changes to its behaviour from the behaviour of other agents more easily. This means it can learn muchquicker than when the impact of its actions is less clear. If we use system rewards , where system-wide knowledgeis used to decide rewards, learning becomes tightly coupled to the actions of other agents in the system, leading tolow-sensitivity [42]. If we use local rewards , where we restrict reward calculation to a agents’ local-view only, we keepthis coupling low. There is a risk however that the agents’ behaviours could become non-beneficial to the system goal,or become stuck in local-minima solutions that are sub-optimal.

Coordination in agent-based systems.

Agents in MARL systems can range from being fully cooperative to fully compet-itive. In cooperative systems the agents all share a common reward function and try to maximise that shared valuefunction. Dedicated algorithms often rely on static, deterministic, or on exact knowledge of other agents states andactions. Coordination and maximisation of joint-action states results in high dimensionality due to the inclusion ofother agents actions in calculations. We can utilise the sparseness of the interactions in large multi-agent systems toreduce the coupling between agents by having them work independently and only collecting information about otheragents when required. For example, by learning the states where some degree of coordination is required [11, 12, 33]. Ingeneral, coordination in multi-agent systems increases the optimality of solutions found, but at the cost of increasedoverhead which limits scalability.This past research highlights some of the key challenges that we look to tackle in our work,(1) in large or complex systems the correct policies for agents’ behaviour are not known at system initialisation, andmay be constantly changing due to system dynamics.(2) since systems may be dynamic, the optimal solution may be constantly changing.(3) for a scalable system, system-wide knowledge is not feasible to maintain or to compute with.(4) agents have physical constraints on compute and memory in real situations that limit their maximum resourceusage.To do this we need to develop the abilities for agents to,(1) learn to make the best decisions given their current state.(2) adapt how they explore state-space depending on how successful they are in task-allocation currently.(3) make decisions based only on a localised or otherwise partial-view of the system.(4) must maintain their resource usage within set limits.The four algorithms we present in the following sections are designed to tackle these issues and combine to form ascalable, resilient, and adaptive mult-agent task allocation solution.

Manuscript submitted to ACM ynamic neighbourhood optimisation for task allocation using multi-agent learning 5

In the following sections we introduce the elements of the multi-agent system problem and model the system.

Informally we define a distributed task allocation system as a multi-agent system where a set of agents work togetherto perform a set of composite tasks . These composite tasks are formed by atomic tasks that can be executed by individualagents. Each agent has some capabilities to perform atomic tasks and is also able to coordinate and oversee the executionof a set of composite tasks. Each agent also has constraints on memory and communication, limiting the number ofagents it can interact with and maintain information on. This in turn constrains the size of the neighbourhood of agentsit can learn to allocate tasks to, and the amount of knowledge it can retain on the systems’ agents overall.

Definition 3.1 (Distributed Task Allocation System).

A distributed task-allocation system (DTAS) is defined by a tuple ⟨ 𝐴𝑇, 𝐶𝑇, 𝐺 ⟩ where: • 𝐴𝑇 = { 𝑎𝑡 , ..., 𝑎𝑡 𝑚 } is a set of atomic tasks (or tasks for short), where each task 𝑎𝑡 ∈ 𝐴𝑇 can be performed by asingle agent; • 𝐶𝑇 = { 𝑐𝑡 , ..., 𝑐𝑡 𝑘 } is a set of composite tasks, where each composite task is formed by a set of atomic tasks( ∀ 𝑐𝑡 ∈ 𝐶𝑇 : 𝑐𝑡 ⊆ 𝐴𝑇 ); • 𝐺 = { 𝑔 , ..., 𝑔 𝑛 } is a set of agents, where each agent 𝑔 ∈ 𝐺 is is defined by a tuple ⟨ 𝑐, 𝑟, 𝛿 𝑛 , 𝛿 𝑘 ⟩ , where: • 𝑐 ⊆ 𝐴𝑃 is the agent capabilities; i.e., the atomic task types that the agent can perform; • 𝑟 ⊆ 𝐶𝑃 is the agent responsibilities; i.e., the composite task types that the agent can oversee; • 𝛿 𝑛 , 𝛿 𝑘 ∈ N , are the resource constraints of the agent, namely the communication and memory constraints (i.e.,how many other agents a given agent can communicate with and know about).Atomic tasks are of one of the atomic task types 𝑎𝑝 in the system, with composite task types 𝑐𝑝 defined by the typeof its elements. We define 𝑡𝑦𝑝𝑒 𝑎 : 𝐴𝑇 → 𝐴𝑃 and 𝑡𝑦𝑝𝑒 𝑐 : 𝐶𝑇 → 𝐴𝑃 as the mappings of atomic and composite tasks totheir respective task types, where 𝑡𝑦𝑝𝑒 𝑐 ({ 𝑎𝑡 , .., 𝑎𝑡 𝑛 }) = { 𝑡𝑦𝑝𝑒 𝑎 ( 𝑎𝑡 ) , .., 𝑡𝑦𝑝𝑒 𝑎 ( 𝑎𝑡 𝑛 )} .Given an agent 𝑔 , we denote by 𝑐 ( 𝑔 ) , 𝑟 ( 𝑔 ) , 𝛿 𝑛 ( 𝑔 ) , 𝛿 𝑘 ( 𝑔 ) the capabilities, responsibilities, communication, and memoryconstraints of that agent, respectively. These communication constraints limit the number of agents that an agentscan interact with at any one time, its neighbourhood , while memory constraints limit the amount of information it canhave about other agents in the system as a whole, its knowledge . Note that for all atomic tasks in the system there isat least one agent capable of performing it. Similarly, for all composite tasks in the system there is at least one agentresponsible for overseeing it. Composite tasks arrive in the system with constant or slowly varying frequency distribution. The DTAS is capable ofprocessing these tasks in the following way:(1) A request to perform composite task of a defined composite type arrives in the system.(2) The composite task is allocated to an agent that can be responsible for tasks of that type.(3) The agent decomposes the composite task into atomic tasks.(4) The agent allocates these atomic tasks to other agents.(5) Once all the atomic tasks have been completed the composite task is complete.

Manuscript submitted to ACM

N. Creech et al.To be able to allocate atomic tasks, agents need to not only be aware of the other agents in the system and theircapabilities to execute tasks, but also to have communication links with them. Hence, the current state of an agent isdetermined by the agents it knows (i.e., its knowledge) and the agents it has links with (i.e., its neighbourhood).

Definition 3.2 (Agent State).

Given an agent 𝑔 = ⟨ 𝑐, 𝛿 𝑛 , 𝛿 𝑘 ⟩ , we define its state at a particular point in time as a tuple ⟨ 𝐾, 𝑁 ⟩ , where: • 𝐾 ⊆ 𝐺 is the knowledge of the agent . • 𝑁 ⊂ 𝐾 is the neighbourhood of the agent.Note that | 𝐾 | ≤ 𝛿 𝑘 and | 𝑁 | ≤ 𝛿 𝑛 . Given an agent 𝑔 we denote by 𝐾 ( 𝑔 ) , 𝑁 ( 𝑔 ) , its knowledge and neighbourhood.Given a set of agents 𝐺 , we denote by 𝐺 𝑆 the set formed by their states.At a given point in time the system is required to perform a set of composite tasks 𝑅 by a set of external agents 𝐸 . For simplicity, we assume that only one request can be done at a given moment in time and, hence, time allowsus to distinguish between different requirements to perform the same task. Therefore it acts as an identifier for eachcomposite task, and the associated atomic tasks, allocated to the system.A requirement to perform a composite task is allocated to a particular agent. We represent this by tuples such as ⟨ 𝑐𝑡, 𝑡, 𝑔, 𝑒 ⟩ , where 𝑐𝑡 ∈ 𝐶𝑇 , 𝑡 ∈ N is the time at which the request to perform the task was created, 𝑔 ∈ 𝐺 is the agentresponsible for the completion of the composite task, the parent agent ; and 𝑒 ∈ 𝐸 is the agent who requested theexecution of the composite task. Note that agents can also be allocated atomic tasks that are needed to complete acomposite task, which we term child agents . We represent that as allocations where a set of tasks is formed by one task ⟨{ 𝑎𝑡 } , 𝑡, 𝑔, 𝑎 ⟩ , where 𝑎 ∈ 𝐺 is an agent capable of performing the atomic task 𝑎𝑡 . In general, we denote by 𝐿 the set of allallocations at a given point in time. The set is formed by tuples ⟨ 𝑇, 𝑡, 𝑔, 𝑎 ⟩ where 𝑇 is a list of atomic tasks (which can bedefined as a composite task), 𝑡 ∈ N is the time at which the request to perform the task was created, 𝑔 ∈ 𝐺 is the agentwhich is allocated the task, and 𝑎 ∈ ( 𝐺 ∪ 𝐸 ) is the agent which allocated the task. Definition 3.3 (System State).

Given a DTAS we define its configuration as a tuple 𝑆 = ⟨ 𝐺 𝑆 , 𝐴𝐿 ⟩ where • 𝐺 𝑆 is the set of states of all agents in the system; • 𝐴𝐿 is the joint system allocation , the set of task allocations in the system. Example 3.4 (Real-world systems).

A marine-based WSN system agents are equipped with sensors that can completetasks to measure temperature, salinity, oxygen levels, and pH levels, so 𝐴𝑃 = { 𝑎𝑝 𝑡𝑒𝑚𝑝 , 𝑎𝑝 𝑠𝑎𝑙 , 𝑎𝑝 𝑜𝑥𝑦 , 𝑎𝑝 𝑝ℎ } . Each agents’capabilities may be a subset of these atomic task-types depending on which sensors they have, and whether they arefunctional. For instance 𝑐 𝑔 = { 𝑎𝑝 𝑠𝑎𝑙 , 𝑎𝑝 𝑜𝑥𝑦 } , if an agent 𝑔 only has working sensors to measure salinity and oxygenlevels. Some agents receive composite tasks from outside the system, requests for samples of combinations of thesemeasurements, e.g. 𝑐𝑡 = { 𝑎𝑡 𝑠𝑎𝑙 , 𝑎𝑡 𝑜𝑥𝑦 } . These agents then decompose these composite tasks into atomic tasks andallocate them to other agents to complete. The DTAS’s configuration changes as a result of the actions executed by the agents and actions taken bythe external agents (e.g., users) who make requests to the system to execute a set of tasks. In the following we providethe operational semantics for the different actions that can be executed in a DTAS. For simplicity, we represent the knowledge about a particular agent by the agent identifier, but the knowledge also includes other information such asthe agent capabilities and qualities when performing particular actions, etc.Manuscript submitted to ACM ynamic neighbourhood optimisation for task allocation using multi-agent learning 7 • Requirement Assignment. Every time the DTAS receives a new requirement from an external agent 𝑒 to performan composite task 𝑐𝑡 at a given time 𝑡 it is randomly assigned to an agent responsible for that task: 𝑅𝐸𝑄𝑈 𝐼𝑅𝐸𝑀𝐸𝑁𝑇 ( 𝑒, 𝑐𝑡 ) ∧ 𝑒 ∈ 𝐸 ∧ 𝑡𝑖𝑚𝑒 ( 𝑡 ) ∧ ∃ 𝑔 ∈ 𝐺 : 𝑐𝑡 ∈ 𝑟 ( 𝑔 )⟨ 𝐺 𝑆 , 𝐴𝐿 ⟩ → ⟨ 𝐺 𝑆 , 𝐴𝐿 ∪ {⟨ 𝑐𝑡, 𝑡, 𝑔, 𝑒 ⟩}⟩ where 𝑔 is a randomly selected agent being responsible for that composite task and 𝑡𝑖𝑚𝑒 just returns the currenttime of the DTAS. • Allocation action. A agent 𝑔 performing an allocation action allocates an atomic task that is currently allocatedto him to a neighbourhood agent. The system state is updated accordingly: 𝐴𝐿𝐿𝑂𝐶 ( 𝑔, 𝑎𝑡, 𝑛 ) ∧ 𝑔 ∈ 𝐺 ∧ 𝑎𝑡 ∈ 𝐴𝑇 ∧ 𝑛 ∈ 𝑁 ( 𝑔 ) ∧ ∃⟨ 𝑇, 𝑡, 𝑔, 𝑎 ⟩ ∈ 𝐴𝐿 : 𝑎𝑡 ∈ 𝑇 ⟨ 𝐺 𝑆 , 𝐴𝐿 ⟩ → ⟨ 𝐺 𝑆 , 𝐴𝐿 ∪ {⟨{ 𝑎𝑡 } , 𝑡, 𝑛, 𝑔 ⟩}⟩• Execute action. If an agent is allocated an atomic task and is capable of performing it 𝑎𝑡 ∈ 𝑐 ( 𝑔 ) then it canperform an execute action, 𝐸𝑋 𝐸𝐶 ( 𝑔, 𝑎𝑡 ) : 𝐸𝑋 𝐸𝐶 ( 𝑔, 𝑎𝑡 ) ∧ 𝑔 ∈ 𝐺 ∧ 𝑎𝑡 ∈ 𝐴𝑇 ∧ 𝑎𝑡 ∈ 𝑐 ( 𝑔 ) ∧ ∃⟨ 𝑇, 𝑡, 𝑔, 𝑎 ⟩ ∈ 𝐴𝐿 : 𝑎𝑡 ∈ 𝑇 ⟨ 𝐺 𝑆 , 𝐴𝐿 ⟩ → ⟨ 𝐺 𝑆 , 𝐴𝐿 ′ ⟩ where 𝐴𝐿 ′ = {⟨ 𝑇, 𝑡 ′ , 𝑔, 𝑎 ⟩|⟨ 𝑇, 𝑡 ′ , 𝑔, 𝑎 ⟩ ∈ 𝐴𝐿 ∧ 𝑡 ′ <> 𝑡 } ∪ {⟨ 𝑇 ′ , 𝑡, 𝑔, 𝑎 ⟩|⟨ 𝑇, 𝑡, 𝑔, 𝑎 ⟩ ∈ 𝐴𝐿 ∧ 𝑇 ′ = 𝑇 \ { 𝑎𝑡 }} . Afterexecuting an atomic task with a given time identifier, all tasks allocations corresponding to that identifier arereviewed so that the atomic task is removed from the list of pending tasks. • Information action. An agent can request information on other agents in the system, from an agent in itsneighbourhood, by carrying out an info action.

𝐼 𝑁 𝐹𝑂 ( 𝑔, 𝑡, 𝑛 ) ∧ 𝑔 ∈ 𝐺 ∧ 𝑡𝑖𝑚𝑒 ( 𝑡 ) ∧ 𝑛 ∈ 𝑁 ( 𝑔 )⟨ 𝐺 𝑆 , 𝐴𝐿 ⟩ → ⟨ 𝐺 𝑆 , 𝐴𝐿 ∪ {⟨{ 𝑖𝑛𝑓 𝑜 } , 𝑡, 𝑛, 𝑔 ⟩}⟩ where 𝑖𝑛𝑓 𝑜 is an special information atomic task that is not part of any composite task. • Provide Information. Agents who are allocated an info action execute that action by providing information aboutone of their neighbour agent randomly selected:

𝑃𝑅𝑂𝑉 𝐼𝐷𝐸 _ 𝐼 𝑁 𝐹𝑂 ( 𝑔, 𝑛, 𝑢 ) ∧ 𝑔 ∈ 𝐺 ∧ 𝑛 ∈ 𝑁 ( 𝑔 ) ∧ 𝑢 ∈ 𝐾 ( 𝑔 ) ∧ ⟨{ 𝑖𝑛𝑓 𝑜 } , 𝑡, 𝑔, 𝑎 ⟩ ∈ 𝐴𝐿 ⟨ 𝐺 𝑆 , 𝐴𝐿 ⟩ → ⟨ 𝐺 ′ 𝑆 , 𝐴𝐿 \ {⟨{ 𝑖𝑛𝑓 𝑜 } , 𝑡, 𝑔 ⟩}⟩ where 𝐺 ′ 𝑆 = {⟨ 𝐾 ( 𝑔 ′ ) , 𝑁 ( 𝑔 ′ )⟩|∀ 𝑔 ′ ∈ ( 𝐺 \ { 𝑛 }} ∪ {⟨ 𝐾 ( 𝑛 ) ∪ 𝑢, 𝑁 ( 𝑛 )⟩}• Remove Info: An agent 𝑔 ∈ 𝐺 can remove information about an agent from its knowledge as long as that agent isnot in its neighbourhood: 𝑅𝐸𝑀𝑂𝑉 𝐸 _ 𝐼 𝑁 𝐹𝑂 ( 𝑔, 𝑘 ) ∧ 𝑔 ∈ 𝐺 ∧ 𝑘 ∈ 𝐾 ( 𝑔 ) ∧ 𝑘 ∉ 𝑁 ( 𝑔 )⟨ 𝐺 𝑆 , 𝐴𝐿 ⟩ → ⟨ 𝐺 ′ 𝑆 , 𝐴𝐿 ⟩ where 𝐺 ′ 𝑆 = {⟨ 𝐾 ( 𝑔 ′ ) , 𝑁 ( 𝑔 ′ )⟩|∀ 𝑔 ′ ∈ ( 𝐺 \ { 𝑔 }} ∪ { 𝐾 ( 𝑔 ) \ { 𝑘 } , 𝑁 ( 𝑔 )} Manuscript submitted to ACM

N. Creech et al. • An agent can add a known agent into its neighbourhood by taking a link action,

𝐿𝐼 𝑁 𝐾 ( 𝑔, 𝑘 ) : 𝐿𝐼 𝑁 𝐾 ( 𝑔, 𝑘 ) ∧ 𝑔 ∈ 𝐺 ∧ 𝑘 ∈ 𝐾 ( 𝑔 ) ∧ | 𝑁 ( 𝑔 )| < 𝛿 𝑛 ( 𝑔 )⟨ 𝐺 𝑆 , 𝐴𝐿 ⟩ → ⟨ 𝐺 ′ 𝑆 , 𝐴𝐿, 𝑡, 𝑔 ⟩}⟩ where 𝐺 ′ 𝑆 = {⟨ 𝐾 ( 𝑔 ′ ) , 𝑁 ( 𝑔 ′ )⟩|∀ 𝑔 ′ ∈ ( 𝐺 \ { 𝑔 }} ∪ {⟨ 𝐾 ( 𝑔 ) , 𝑁 ( 𝑔 ) ∪ { 𝑘 }⟩}• Remove Link. An agent 𝑔 ∈ 𝐺 can remove an agent 𝑛 from its neighbourhood by taking a remove link action, 𝑅𝐸𝑀𝑂𝑉 𝐸 _ 𝐿𝐼 𝑁 𝐾 ( 𝑔, 𝑛 ) : 𝑅𝐸𝑀𝑂𝑉 𝐸 _ 𝐿𝐼 𝑁 𝐾 ( 𝑔, 𝑛 ) ∧ 𝑔 ∈ 𝐺 ∧ 𝑛 ∈ 𝑁 ( 𝑔 )⟨ 𝐺 𝑆 , 𝐴𝐿 ⟩ → ⟨ 𝐺 ′ 𝑆 , 𝐴𝐿 ⟩ where 𝐺 ′ 𝑆 = {⟨ 𝐾 ( 𝑔 ′ ) , 𝑁 ( 𝑔 ′ )⟩|∀ 𝑔 ′ ∈ ( 𝐺 \ { 𝑔 }} ∪ { 𝐾 ( 𝑔 ) , 𝑁 ( 𝑔 ) \ { 𝑛 }} We map a given action 𝑎 to one of the defined action-categories above as 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 ( 𝑎 ) . Example 3.5 (Actions).

An agent in a marine WSN receives and a composite task 𝑐𝑡 = { 𝑎𝑡 𝑠𝑎𝑙 , 𝑎𝑡 𝑜𝑥𝑦 } . Since agent 𝑔 has a working salinity measuring sensor, 𝑎𝑝 𝑠𝑎𝑙 ∈ 𝑐 𝑔 , it can complete the task 𝑎𝑡 𝑠𝑎𝑙 itself, and so performs action 𝐸𝑋 𝐸𝐶 ( 𝑔, 𝑎𝑡 𝑠𝑎𝑙 ) . As it doesn’t have a sensor to detect oxygen levels, it cannot complete tasks of that type, 𝑎𝑝 𝑜𝑥𝑦 ∉ 𝑐 𝑔 ,and so it allocates this task to another agent 𝑛 through the action 𝐴𝐿𝐿𝑂𝐶 ( 𝑔, 𝑎𝑡 𝑜𝑥𝑦 , 𝑛 ) . Given the set of all possible actions 𝐴 , let 𝐴 𝑔 be all the actions that can be taken byan agent 𝑔 . Finally we define child target actions of an agent 𝑔 as those of its actions that interact a set of other agents 𝐺 ,written 𝐴 𝑔 ≻ 𝐺 , where 𝐴 𝑔 ≻ 𝐺 ⊂ 𝐴 𝑔 : ∀ 𝑎 ∈ 𝐴 𝑔 ≻ 𝐺 , 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 ( 𝑎 ) ∈ 𝐴𝐿𝐿𝑂𝐶, 𝐼 𝑁 𝐹𝑂, 𝑃𝑅𝑂𝑉 𝐼𝐷𝐸 _ 𝐼 𝑁 𝐹𝑂 . In general we denote an allocation of atomic tasks 𝐴𝑇 to a set of agents 𝐺 as 𝑎𝑙 : 𝐴𝑇 × 𝐺 → 𝐴𝑇 × 𝐺 , where each atomic task forms a tuple with the agent it is allocated to. If 𝐴𝑇 represents all thecurrent atomic tasks in the system then this is the joint system allocation, 𝐴𝐿 . The set of current atomic tasks an agenthas been allocated but is yet to complete are its concurrent allocations , | 𝑎𝑙 ( 𝐴𝑇, 𝑔 )| , which we abbreviate as | 𝐴𝐿 𝑔 | . Oncompleting a task, an agent gives an atomic task quality which depends on the task type and the agents’ concurrentallocations, 𝜔 𝑔 : 𝐴𝑃 × N → R > = . Therefore the allocation quality of an allocation of tasks 𝐴𝑇 to agents 𝐺 , will dependon the joint system allocation as a whole, 𝑞𝑙 ( 𝐴𝑇, 𝐺, 𝐴𝐿 ) = ∑︁ 𝜔 𝑔 ( 𝑡𝑦𝑝𝑒 𝑎 ( 𝑎𝑡 ) , | 𝐴𝐿 𝑔 |) , ∀( 𝑎𝑡, 𝑔 ) ∈ 𝑎𝑙 ( 𝐴𝑇, 𝐺 ) (1)We can then simply define the utility of the system, Definition 3.6 (System utility). If 𝑁 atomic tasks are completed during a time period 𝑇 , then the system utility is thesum of allocation qualities of all these tasks, 𝑢 ( 𝑇 ) = 𝑁 ∑︁ 𝑖 = 𝑞𝑙 𝑖 ( 𝐴𝑇, 𝐺, 𝐴𝐿 ) (2) The range of allocations that an agent can achieve is bounded by its neighbourhood.An allocation may be non-optimal, locally-optimal, system-optimal, or non-allocable. The optimal allocation of a setof tasks 𝐴𝑇 to a set of agents 𝐺 within a system with joint system allocation 𝐴𝐿 is the allocation that maximises the Manuscript submitted to ACM ynamic neighbourhood optimisation for task allocation using multi-agent learning 9allocation quality, 𝑜𝑙 ∗ ( 𝐴𝑇, 𝐺, 𝐴𝐿 ) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎𝑙 ( 𝐴𝑇,𝐺 ′ ) , ∀ 𝐺 ′ ∈ 𝐺 𝑞𝑙 ( 𝐴𝑇, 𝐺 ′ , 𝐴𝐿 ) (3) Definition 3.7 (Locally optimal allocation).

There exists a locally optimal allocation of tasks 𝐴𝑇 to the neighbourhoodof an agent 𝑔 within a system with joint allocation 𝐴𝐿 that gives the optimal allocation possible for that neighbourhood. 𝑜𝑙 ∗ 𝑙𝑜𝑐 ( 𝐴𝑇, 𝑔, 𝐴𝐿 ) = 𝑜𝑙 ∗ ( 𝐴𝑇, 𝑁 ( 𝑔 ) , 𝐴𝐿 ) (4)This allows us to define an optimal neighbourhood of an agent 𝑔 given a set of tasks, 𝐴𝑇 , the neighbourhood withinthe system that gives the maximum possible locally optimal allocation. 𝑜𝑛 ∗ ( 𝐴𝑇, 𝑔, 𝐴𝐿 ) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑁 ( 𝑔 ) 𝑜𝑙 ∗ ( 𝐴𝑇 𝑖 , 𝑁 ( 𝑔 ) , 𝐴𝐿 ) (5) Definition 3.8 (System-optimal allocation).

The system-optimal allocation for an agent given a set of tasks is theoptimal allocation of those tasks to the optimal neighbourhood. 𝑜𝑙 ∗ 𝑠𝑦𝑠 ( 𝐴𝑇, 𝑔, 𝐴𝐿 ) = 𝑜𝑙 ∗ ( 𝐴𝑇, 𝑜𝑛 ∗ ( 𝐴𝑇, 𝑔, 𝐴𝐿 ) , 𝐴𝐿 ) (6) Definition 3.9 (Optimal joint system allocation).

The optimal joint system allocation 𝐴𝐿 ∗ is the joint system allocationof tasks 𝐴𝑇 over all agents in the system 𝐺 that maximises the sum of allocation qualities. 𝑎𝑙 ∗ ( 𝐴𝑇, 𝐺 ) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐴𝐿 ∑︁ 𝑞𝑙 ( 𝑎𝑙 ( 𝑎𝑡, 𝑔 ) , 𝐴𝐿 ) , ∀ 𝑎𝑡 ∈ 𝐴𝑇, ∃ 𝑔 ∈ 𝐺 (7) The different agent capabilities mean that there are a limitednumber of agents that can complete a given atomic task type, increasing the resource pressure effect on the quality ofatomic tasks. Given this, there exists system-wide competition between parent agents for child agents’ resources thatcan change individual optimal allocation solutions compared to where there is no competition.Theorem 3.10 (Allocation state).

An agent 𝑔 is allocated a composite task 𝑐𝑡 composed of a set of atomic tasks. If theagent has a set of neighbours 𝑁 ( 𝑔 ) then one of the following will be true, (1) For each atomic task in the composite task, the capability required to complete the task is provided by one of the agentsin the neighbourhood, ∃⟨ 𝑐, 𝑟, 𝛿 𝑛 , 𝛿 𝑘 ⟩ ∈ 𝑁 ( 𝑔 ) . The composite task can be successfully allocated to 𝑔 and completed. Alocally optimal allocation exists. (2) The capabilities required for the atomic tasks composing the composite task cannot be provided by agents within theneighbourhood. The composite task can be allocated to 𝑔 but cannot be successfully completed. Note the optimal knowledge for an agent 𝑔 is simply the knowledge base containing the agents within the optimal neighbourhood, 𝑜𝑘 ∗ ( 𝐴𝑇,𝑔, 𝐴𝐿 ) = 𝑜𝑛 ∗ ( 𝐴𝑇,𝑔, 𝐴𝐿 ) Manuscript submitted to ACM

If all sets of neighbourhoods in a systemare pairwise disjoint then the optimal joint allocation is the union of all system optimal allocations. 𝑁 ( 𝑔 ) ∩ 𝑁 ( 𝑔 ) = ∅ , ∀( 𝑔 , 𝑔 ) ∈ ( 𝐺 × 𝐺 ) where 𝑔 ≠ 𝑔 = ⇒ 𝐴𝐿 ∗ = (cid:216) 𝑜𝑙 ∗ 𝑠𝑦𝑠 ( 𝐴𝑇, 𝑔, 𝐴𝐿 ) (8)Theorem 3.12 (Resource contention in non-disjoint neighbourhoods). If not all neighbourhoods in the systemare pairwise disjoint then there can be resource contention on the agents in the intersection of the neighbourhoods. If theimpact of resource contention on allocation quality is sufficient, then the optimal joint allocation may no longer be theunion of all system optimal allocations. In this case the optimal joint allocation of tasks cannot be decomposed and solvedindependently and must be solved centrally, greatly increasing the complexity of the solution.

Given a set of agents 𝐺 and a set of composite tasks 𝐶𝑇 how can we find the optimal joint allocation 𝑎𝑙 ∗ ( 𝐴𝑇, 𝐺 ) ofatomic tasks when capabilities 𝑐 𝑔 and task qualities 𝜔 𝑔 of the agents are dynamic and unknown, and therefore maximisethe system utility 𝑢 ( 𝑇 ) ? We separate this into two main sub-problems,(1) Given a fixed local neighbourhood how can an agent 𝑔 find the optimal local allocation 𝑜𝑙 ∗ 𝑙𝑜𝑐 that returns theoptimal quality?(2) How does an agent find the optimal neighbourhood 𝑜𝑛 ∗ within the set of all possible neighbourhoods it canachieve, containing the system-optimal allocation 𝑜𝑙 ∗ 𝑠𝑦𝑠 ? We now give a high-level introduction to our algorithms for solving the task-allocation problem. The concepts andnotation will be covered in more depth in Section 5. • The agent task allocation with risk-impact awareness (

ATA-RIA ) algorithm learns to take actions to optimisethe task-allocation problem described. Its main purpose is to integrate the following three algorithms, as wellas updating Q-values and sample data. It also makes action selections based on measured progress towardscomposite task completion. (See Figure 1). • The reward trends for action-risks probabilities (RT-ARP) algorithm increases the probability of an agent takingneighbourhood-altering actions and increasing exploration when the possible optimal allocation achievable inits current neighbourhood is relatively poor compared to previous neighbourhoods. • The state-action space knowledge-retention (

SAS-KR ) algorithm implements a knowledge retention scheme underdynamic neighbourhood changes. This removes parts of an agents knowledge less relevant to the optimisationproblem so the agent can stay within resource bounds. • The neighbourhood update (

N-Prune ) algorithm maintains an agents neighbourhood within resource constraintsby removing information on child agents based on their recent relative contribution to task completion quality.In these algorithms we utilise some standard functions which we summarise in Table 1. ATA-RIA ) algorithm

The agent task allocation with risk-impact awareness (

ATA-RIA ) algorithm integrates the RT-ARP, SAS-KR , and

N-Prune algorithms to provide a framework for optimising task-allocation in a multi-agent system (See Algorithm 1). It chooses

Manuscript submitted to ACM ynamic neighbourhood optimisation for task allocation using multi-agent learning 11 𝐿 𝑔 Retrieve tasksallocated to g

No Yes ∃ <{ 𝑎𝑡 }, 𝑡 , 𝑔 > ∈𝐿 𝑔 : 𝑎𝑡 ∈ 𝑐 ( 𝑔 ) EXEC YesNo ∃ <{ 𝑖𝑛𝑓𝑜 }, 𝑡 , 𝑔 > ∈𝐿 𝑔 ALLOCINFOLINK REMOVE_INFO REMOVE_LINKLinkaction

Infoaction

AllocactionDecide next action(RT-ARP) Decide linkto remove(N-prune)PROVIDE_INFO Decide infoto remove(SAS-KR)

YesNo | 𝑁 ( 𝑔 )| > 𝛿𝑛 ( 𝑔 )? Fig. 1.

Flowchart of the

ATA-RIA algorithm. On receiving a composite task, an agent can carry out

𝐸𝑋𝐸𝐶 or 𝑃𝑅𝑂𝑉 𝐼𝐷𝐸 _ 𝐼𝑁 𝐹𝑂 actions immediately, or will choose amongst

𝐴𝐿𝐿𝑂𝐶 , 𝐼𝑁 𝐹𝑂 and

𝐿𝐼𝑁 𝐾 using the RT-ARP algorithm. Taking an

𝐼𝑁 𝐹𝑂 or 𝐿𝐼𝑁 𝐾 action will lead to knowledge removal through the

SAS-KR algorithm or neighbourhood pruning through the

N-Prune algorithmrespectively.

Table 1.

Summary of standard functions

Function Definition Summary sumnorm ( 𝑋 ) 𝑓 ( 𝑥 𝑖 ) = 𝑥 𝑖 (cid:205) 𝑁𝑗 = 𝑥 𝑗 Unit normalisation, uniformly scales all values in 𝑋 into therange R [ , ] where (cid:205) sumnorm ( 𝑋 ) = . softmax ( 𝑋 ) 𝜎 ( 𝑥 𝑖 ) = 𝑒 𝑥 𝑖 (cid:205) 𝑁𝑗 = 𝑒 𝑥 𝑗 Softmax normalisation, scales all 𝑁 values in 𝑋 into therange R [ , ] . rand ( 𝑋 ) 𝑃 ( 𝑥 𝑖 ) = 𝑈 ( 𝑥 𝑖 ) Selects an element from the set 𝑋 using the uniform distri-bution. boltzmann ( 𝑄 ) 𝑃 ( 𝑎 𝑖 ) = 𝑒 ( 𝑞 𝑖 / 𝜏 ) (cid:205) 𝑁𝑗 = 𝑒 ( 𝑞 𝑗 / 𝜏 ) selects an element from the 𝑁 elements of 𝑄 = {( 𝑎 𝑗 , 𝑞 𝑗 )} 𝑁𝑗 = using the probability distribution 𝑌 , and a temperature value 𝜏 .between actions an agent can take. It then updates the Q-values of each action selected based on the quality valuesreturned using the temporal-difference update algorithm described later in Section 5.2.3. We detail the steps when anagent is allocated a composite task below.(1) Execute an atomic task if the agent has the capability to do it [lines 2-6].(2) Otherwise choose an action based on RT-ARP [line 8].(3) Carry out the action and update the set of outputs, qualities, neighbours, and knowledge [lines 9-20]. Manuscript submitted to ACM

SAS-KR to keep within the agents’ resource bounds [line 17].(5) Prune the neighbourhood using

N-Prune to keep within the agents’ resource bounds [line 20].(6) Update the Q-values using temporal-difference learning [line 22].(7) Update the action samples [line 24].(8) Repeat until all of the atomic tasks in the composite task are completed.

The reward trends for action-risks probabilities (RT-ARP) algorithm estimates the possible optimal allocation of an agents’current neighbourhood relative to previous neighbourhood estimates using a

TSQM (See Algorithm 2). It then takesthe current Q-values for an agent and transforms them based on this estimate through the impact transformationfunction. The effect is to increase the probability of an agent taking neighbourhood-altering actions, and increasingthe exploration factor, when the current neighbourhood is estimated to be a lower possible optimal allocation thanhistorical neighbourhoods. The steps are,(1) Generate an impact transformation function from the current

TSQM [line 2].(2) Calculate the impact values of actions based on the area under the impact transformation graph [line 2].(3) Transform the current Q-values using the impact values. This increases the probability of taking neighbourhood-altering actions when in lower quality neighbourhoods [line 3].(4) Transform the exploration factor of the agent using the impact transformation function and use this for e-greedy action selection. This means more exploration when recent neighbourhoods have lower quality optimalallocations achievable [lines 4-5].(5) Either take the maximum-Q action amongst the transformed Q-values or use random Boltzmann selectionbased on the transformed exploration factor. The normalised, transformed Q-values are used as the probabilitydistribution for action selection [lines 7-9].

SAS-KR ) algorithm

The state-action space knowledge-retention (

SAS-KR ) algorithm removes learned Q-values and knowledge based on theaction information quality to stay within the bounds of an agents resource constraints (See Algorithm 3).(1) Find all an agents Q-values that involve agents that are in its knowledge base but not its neighbourhood [line 1].(2) Calculate the action information quality based on the staleness and amount of times actions have been taken[line 2].(3) Remove all knowledge of actions that have a value below a threshold value [line 4].(4) Remove all knowledge of an external agent if there are no actions in an agents Q-values that target the externalagent [lines 5-6].(5) Check if size of knowledge exceeds limit [line 9].(6) Remove random agent from knowledge base [line 10]. N-Prune ) algorithm

The neighbourhood update (

N-Prune ) algorithm ensures that an agents’ neighbourhood is maintained at a size thatbounds it within resource constraints (See Algorithm 4). Each child agents’ contribution to task quality values are Manuscript submitted to ACM ynamic neighbourhood optimisation for task allocation using multi-agent learning 13summed. Decay to used to reduce the relevance of older values. The information on the agents with the lowestcontribution is then removed.(1) Compare the neighbourhood size with the resource limits [line 1].(2) If the neighbourhood is too big and we have accumulated some quality values then select the agent that hasproduced the poorest quality value returns and remove it from the neighbourhood [lines 2-3].(3) If the neighbourhood is too big and there are no quality values available then remove a random agent [line 5].

Algorithm 1: The agent task allocation with risk-impact awareness (

ATA-RIA ) algorithm

Input: 𝑔 , The agent allocated the composite task Input: 𝑐𝑡 , The composite task allocated to the agent Input: 𝐴𝑇 ⊖ , The composite tasks currently unallocated atomic tasks Input: 𝑞𝑚 ( 𝑔, 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ )) , the Q-values mappings for agent 𝑔 Input: 𝑊 , The potential change on neighbourhoods on taking an action. Input: Λ 𝑔 , the TQSM matrix of summarised reward trends for agent 𝑔 Input: 𝛼 , a value R > [ , ] , weighting the rate of Q-value update Input: 𝜆 , a value R > [ , ] , weighting importance of future rewards Input: ^ 𝜇 min , The information retention threshold. Input: Ψ , The set of action samples Result: 𝑁 ( 𝑔 ) , updates to the neighbourhood of agent 𝑔 Result: 𝐾 ( 𝑔 ) , updates to the knowledge base of agent 𝑔 Result: 𝑞𝑚 ( 𝑔, 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ )) , updates to the Q-mapping of agent 𝑔 Result: Ψ , updates to the set of action samples for 𝑎𝑡 ∈ 𝑐𝑡 do // Execute atomic task if agent has capabilities if 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 ( 𝑎 ) ∈ 𝑐 ( 𝑔 ) then 𝐸𝑋𝐸𝐶 ( 𝑔, 𝑎𝑡 ) if 𝑎𝑡 is successfully completed then 𝑐𝑡 ← 𝑐𝑡 − { 𝑎𝑡 } end else // Select an action given unallocated tasks 𝑎 ← RT-ARP ( 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ ) ,𝑊 , Λ , 𝜖 𝑏𝑎𝑠𝑒 ) if 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 ( 𝑎 ) = 𝐴𝐿𝐿𝑂𝐶 ( 𝑔, 𝑎𝑡,𝑛 ) then 𝐴𝐿𝐿𝑂𝐶 ( 𝑔, 𝑎𝑡,𝑛 ) if 𝑎𝑡 is successfully completed then 𝑐𝑡 ← 𝑐𝑡 − { 𝑎𝑡 } end else if 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 ( 𝑎 ) = 𝐼𝑁 𝐹𝑂 ( 𝑔, 𝑡,𝑛 ) then // Get new agent 𝑘 from action 𝑘 ← 𝐼𝑁 𝐹𝑂 ( 𝑔, 𝑡,𝑛 ) 𝐾 ( 𝑔 ) ← 𝐾 ( 𝑔 ) ∪ { 𝑘 } // Prune knowledge base SAS-KR ( 𝐴𝑇 ⊖ , 𝑁 ( 𝑔 ) , 𝐾 ( 𝑔 ) , Ψ 𝑔 , ^ 𝜇 min ) else if 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 ( 𝑎 ) = 𝐿𝐼𝑁𝐾 ( 𝑔, 𝑘 ) then // Add new agent to neighbourhood 𝐿𝐼𝑁𝐾 ( 𝑔,𝑘 ) // Prune neighbourhood based on resources N-Prune ( 𝑁 ( 𝑔 ) , Ψ ) end // Update Q-value mappings with reward 𝑞𝑚 ( 𝑔, 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ )) ← 𝑡𝑑 ( 𝑔, 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ ) , 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ ) , 𝜔, 𝛼, 𝜆 ) // Use the quality value to update the TQSM updatetqsm ( Λ 𝑔 , 𝜔 ) // Update action samples Ψ ← Ψ ∪ {( 𝑎, 𝑡, 𝜔 ) } end return ( 𝑁 ( 𝑔 ) , 𝐾 ( 𝑔 ) ,𝑞𝑚 ( 𝑔, 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ )) , Ψ ) Manuscript submitted to ACM

Algorithm 2: The reward trends for action-risks probabilities (RT-ARP) algorithm

Input: 𝐴𝑇 ⊖ , the set of unallocated atomic tasks of agent 𝑔 Input: 𝑊 , the action-risk values for the available actions Input: Λ , the TSQM used to generate the transformation function

Input: 𝜖 base , the base exploration factor for the learning algorithm Result: 𝑎 , the action for the agent to carry out ( 𝐴,𝑄 ) ← 𝑞𝑚 ( 𝑔, 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ )) // Scale Q-values element-wise using impact-transformation ( 𝐴,𝑄 ) ← (

𝐴,𝑄 ◦ 𝑖𝑡 ( 𝑊 )) ( 𝐴,𝑄 ) ← ( 𝐴, sumnorm ( 𝑄 )) // Calculate the impact exploration factor 𝜖 ief ← 𝑖𝑡 ( . ) // Scale the base exploration value 𝜖 ← 𝜖 base × 𝜖 ief // Select best action or explore with boltzmann selection if rand ( R [ , ]) < 𝜖 then ( 𝑎,𝑞 ) ← max 𝑞 ( 𝐴,𝑄 ) else ( 𝑎,𝑞 ) ← boltzmann 𝑞 ( 𝐴,𝑄 ) end return 𝑎 Algorithm 3: The state-action space knowledge-retention (

SAS-KR ) algorithm

Input: 𝐴𝑇 ⊖ , the set of unallocated atomic tasks of agent 𝑔 Input: 𝑁 ( 𝑔 ) , the neighbourhood of agent 𝑔 Input: 𝐾 ( 𝑔 ) , the knowledge base of agent 𝑔 Input: Ψ , the set of action samples Input: 𝑞𝑚 ( 𝑔, 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ )) , the Q-values for agent 𝑔 Input: ^ 𝜇 min , The information retention threshold. Result: 𝐾 ( 𝑔 ) , updates to the knowledge of agent 𝑔 Result: Ψ , updates to the action samples Result: 𝑞𝑚 ( 𝑔, 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ )) , updates to the Q-mappings of agent 𝑔 // For all Q-values with unavailable actions for ( 𝑎,𝑞 ) ∈ 𝑄 ⊖ do // Test the action meets the information retention threshold if 𝑚𝑣 ( Ψ , 𝑎, 𝑡 ) < ^ 𝜇 min then // Remove all samples of action 𝑎 Ψ ← Ψ − {( 𝑎, 𝑡, 𝜔 ) : ( 𝑎, 𝑡, 𝜔 ) ∈ 𝑠 ( Ψ , 𝐴 ) } // Remove actions learned Q-values 𝑞𝑚 ( 𝑔, 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ )) ← 𝑞𝑚 ( 𝑔, 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ )) − {( 𝑎,𝑞 ) } // Remove agents in 𝑔 ’s knowledge that are not targets of any action in 𝑄 𝑋 = { 𝑥 : ∀( 𝑎,𝑞 ) ∈ 𝑄, 𝑥 ∈ 𝐾 ( 𝑔 ) , 𝑎 ∉ 𝐴 𝑔 ≻ 𝑥 } 𝐾 ( 𝑔 ) ← 𝐾 ( 𝑔 ) − 𝑋 end end // Check if knowledge size exceeds resource limit while | 𝐾 ( 𝑔 ) | > 𝛿 𝑘 ( 𝑔 ) do // Remove a random agent in the knowledge base but not neighbourhood 𝐾 ( 𝑔 ) ← 𝐾 ( 𝑔 ) − rand ( 𝐾 ( 𝑔 ) − 𝑁 ( 𝑔 )) end return ( 𝐾 ( 𝑔 ) , Ψ ,𝑞𝑚 ( 𝑔, 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ ))) Manuscript submitted to ACM ynamic neighbourhood optimisation for task allocation using multi-agent learning 15

Algorithm 4: The neighbourhood update (

N-Prune ) algorithm

Input: 𝑁 ( 𝑔 ) , the neighbourhood of the agent. Input: Ψ , The set of action samples Result: 𝑁 ( 𝑔 ) , the updated neighbourhood of the agent. // Check if neighbourhood size exceeds resource limit while | 𝑁 ( 𝑔 ) | > 𝛿 𝑛 ( 𝑔 ) do if | Ψ 𝑔 | > then // Find the neighbour agent that has returned the lowest total quality 𝑛 ← 𝑚𝑞𝑛 ( Ψ ,𝑔 ) else // Choose a random neighbour agent 𝑛 ← rand ( 𝑁 ( 𝑔 )) end // Remove the neighbour agent 𝑁 ( 𝑔 ) ← 𝑁 ( 𝑔 ) − { 𝑛 } end return 𝑁 ( 𝑔 ) Next we detail the concepts and definitions that are used within our algorithms so that task allocation can be optimisedthrough the use of reinforcement learning. We see how the probability of agents taking different types of actions canbe changed based on previous experiences. Risk-impact awareness is also an important aspect in predicting whethercertain actions will increase or decrease the likelihood of agents achieving optimal allocation solutions.

To use the agents historical performance to alter future behaviours we need to collect informationon past actions and their outcomes. We do this through action sample tuples 𝜓 = ⟨ 𝑎, 𝑡, 𝜔 ⟩ , where 𝑎 is an action takenat time 𝑡 that gave quality 𝜔 . We define the action sample selection function to allow us to specify subsets of actionsamples, 𝑠 ( Ψ , 𝐴 ) = {( 𝑎, 𝑡, 𝜔 ) : ∀( 𝑎, 𝑡, 𝜔 ) ∈ Ψ , ∃ 𝑎 ∈ 𝐴 } . For convenience we also define the set of agent action samples ,those samples involving a particular agents’ actions, Ψ 𝑔 = 𝑠 ( Ψ , 𝐴 𝑔 ) and the latest action sample in a set of samples, 𝑙𝑠 ( Ψ , 𝐴 ) = max 𝑡 𝑠 ( Ψ , 𝐴 ) We first make an assumption that the predictability of an actions’ outcome increases with therecentness and higher frequency of samples of the action. This allows us to define the action information quality , aproxy for the value of information collected about an action 𝑎 at time 𝑡 , given the set of action samples Ψ . 𝑚𝑣 ( Ψ , 𝑎, 𝑡 ) = | 𝑠 ( Ψ , { 𝑎 })| 𝑡 − 𝑙𝑠 ( Ψ , { 𝑎 }) (9)The uncertain information threshold ˆ 𝜇 min is then chosen as the minimum required action information quality valuebelow which actions are considered discardable. We define neighbour information quality as the sum of the quality values ofall action samples Ψ of an agent 𝑔 that refer to actions that involve agents in a set 𝐺 . 𝑛𝑞 ( Ψ , 𝑔, 𝐺 ) = ∑︁ 𝜔 ∀( 𝑎,𝑡,𝜔 ) ∈ Ψ : 𝑎 ∈ 𝐴 𝑔 ≻ 𝐺 (10) Manuscript submitted to ACM

Definition 5.1 (Minimum Quality Neighbour).

The minimum quality neighbour of an agent 𝑔 is the child agent thatgenerates the minimum neighbour quality. 𝑚𝑞𝑛 ( Ψ , 𝑔 ) = 𝑎𝑟𝑔𝑚𝑖𝑛 ∀ 𝑥 ∈ 𝑁 ( 𝑔 ) 𝑛𝑞 ( Ψ , 𝑔, { 𝑥 }) (11) For all possible actions an agent can take there is a probability that taking that action in the current state will increasefuture composite task qualities. When an action is taken these estimates can be improved in accuracy based on theactual quality values returned.

As previously mentioned in Definitions 3.2 and 3.3, the system state can be specified as ⟨ 𝐺 𝑆 , 𝐴𝐿 ⟩ ,where the set of agent states 𝐺 𝑆 are defined by the knowledge and neighbourhood of each agent, ⟨ 𝐾, 𝑁 ⟩ . For each statethere exists Q-value tuples, 𝑄 = ( 𝑎, 𝑞 ) , 𝑞 ∈ R [ , ] , where 𝑞 is the likelihood that 𝑎 is the optimal action to perform inthe current state. Q-values are mapped to each agent and atomic task type, 𝑞𝑚 : 𝐺 × 𝐴𝑃 → 𝑄 . Not all of the actions an agent knows of are available for it to take. For example anagent 𝑔 cannot perform an allocation action ALLOC ( 𝑔, 𝑎𝑡, 𝑛 ) if 𝑛 ∈ 𝐾 ( 𝑔 ) but 𝑛 ∉ 𝑁 ( 𝑔 ) . We refer to these unavailableactions 𝐴 ⊖ 𝑔 as actions that involve agents in an agents’ knowledge base, but are not currently in its neighbourhood.An agents’ set of available actions 𝐴 ⊕ 𝑔 are the actions it can take given its unallocated atomic tasks, neighbourhood,and knowledge. We can then additionally define 𝑄 ⊕ and 𝑄 ⊖ as available and unavailable Q-values respectively, thosevalues that refer to available or unavailable actions. Q-mapping values are updated using a temporal-difference algorithm (

TD-Update ).This is a standard reinforcement learning method of updating a set of learning-values from a set of quality values orrewards [39]. An agent with unallocated atomic tasks, 𝐴𝑇 ⊖ , will take an action from the set of Q-values and receive aquality value 𝜔 . We then update the value of the actions associated optimal likelihood 𝑞 using the temporal differenceupdate algorithm ( TD-Update ), 𝑞𝑚 𝑡 ( 𝑔, 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ )) ← ( − 𝛼 ) 𝑞𝑚 𝑡 ( 𝑔, 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ )) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) current + 𝛼 learned value (cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123) [ 𝜔 + 𝜆 max 𝑎 𝑞𝑚 𝑡 + ( 𝑔, 𝑡𝑦𝑝𝑒 𝑎 ( 𝐴𝑇 ⊖ )) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) future estimate ] (12) Some actions change an agents’ neighbourhood or knowledge base. Predicting if these actions will improve taskallocations in the future is useful for agents in making an action selection. To enable agents to make decisions we,(1) Define the impact of the different categories of actions on both an agents’ neighbourhood and knowledge.(2) Estimate the probability that actions generating impact will actually occur.(3) Combine these factors to define action impact.(4) Detail algorithms based on historical quality values to predict which action impacts will have a positive effect ontask completion quality.

Manuscript submitted to ACM ynamic neighbourhood optimisation for task allocation using multi-agent learning 17

There is an impact on possible allocation quality if an agent takes actionsthat change its neighbourhood as the optimal allocation quality for a fixed set of atomic tasks will often be different.This neighbourhood impact of an agent changing its neighbourhood from a set of agents 𝑋 to 𝑌 within a system withjoint system allocation 𝐴𝐿 is the difference between the local optimal allocation quality of all atomic tasks 𝐴𝑇 to beallocated in each respective neighbourhood, 𝑛𝑖 ( 𝐴𝑇, 𝑋, 𝑌, 𝐴𝐿 ) = 𝑞𝑙 ∗ ( 𝐴𝑇, 𝑋, 𝐴𝐿 ) − 𝑞𝑙 ∗ ( 𝐴𝑇, 𝑌, 𝐴𝐿 ) (13) Definition 5.2 (Maximum neighbourhood impact).

The maximum neighbourhood impact is the maximum possibleneighbourhood impact given a set of atomic tasks 𝐴𝑇 and all combinations of neighbourhoods that can be formed fromagents in the knowledge base 𝐾 . 𝑚𝑛𝑖 ( 𝐴𝑇, 𝐾, 𝐴𝐿 ) = 𝑎𝑟𝑔𝑚𝑎𝑥 ∀( 𝑋,𝑌 ) ⊆( 𝐾 × 𝐾 ) 𝑛𝑖 ( 𝐴𝑇, 𝑋, 𝑌, 𝐴𝐿 ) (14) Definition 5.3 (Knowledge impact).

The knowledge impact of an agent changing its knowledge from set of agents 𝐽 to 𝐾 is the difference between the maximal neighbourhood impacts. 𝑘𝑖 ( 𝐴𝑇, 𝐽, 𝐾, 𝐴𝐿 ) = 𝑚𝑛𝑖 ( 𝐴𝑇, 𝐽, 𝐴𝐿 ) − 𝑚𝑛𝑖 ( 𝐴𝑇, 𝐾, 𝐴𝐿 ) (15) Example 5.4 (Impact).

An agent 𝑔 in a marine WSN system has a neighbourhood, 𝑁 ( 𝑔 ) = { 𝑛 } , to which it is allocatingoxygen reading tasks, 𝑎𝑝 𝑜𝑥𝑦 , and knowledge base 𝐾 ( 𝑔 ) = { 𝑛 , 𝑛 , 𝑛 } . If 𝑛 returns much worse qualities for completingtasks of that type than 𝑛 (for example, due to low battery levels), and 𝑛 much better, then 𝑞𝑙 ∗ ({ 𝑎𝑡 𝑜𝑥𝑦 } , { 𝑛 } , 𝐴𝐿 ) << 𝑞𝑙 ∗ ({ 𝑎𝑡 𝑜𝑥𝑦 } , { 𝑛 } , 𝐴𝐿 ) << 𝑞𝑙 ∗ ({ 𝑎𝑡 𝑜𝑥𝑦 } , { 𝑛 } , 𝐴𝐿 ) . In this case, if 𝑔 was to take an action to replace 𝑛 with 𝑛 , then thiswould give 𝑛𝑖 ({ 𝑎𝑡 𝑜𝑥𝑦 } , { 𝑛 } , { 𝑛 } , 𝐴𝐿 ) < , a negative impact. In contrast, taking an action that replaces 𝑛 with 𝑛 would give 𝑛𝑖 ({ 𝑎𝑡 𝑜𝑥𝑦 } , { 𝑛 } , { 𝑛 } , 𝐴𝐿 ) > , which is then the maximum neighbourhood impact, given the knowledgebase { 𝑛 , 𝑛 , 𝑛 } . The quality of a composite task on completion is the result of which agentsthe atomic tasks are allocated to. Since neighbourhoods and knowledge are dynamic, agents are continually addedand removed. Therefore there is a probability that they will be part of the neighbourhood but never contribute tothe quality of a composite task before it is completed. The neighbourhood impact probability 𝑝 𝑁 𝑥 ∩ 𝑦 is the probabilityof an action being taken that involves an agent in the intersect of two overlapping neighbourhoods 𝑋 ∩ 𝑌 ≠ ∅ . The knowledge impact probability 𝑝 𝐾 𝑗 ∩ 𝑘 is the probability of an action being taken that involves an agent in the intersect oftwo overlapping knowledge bases 𝐽 ∩ 𝐾 ≠ ∅ . The action impact is the expected value of the change in allocationquality if an action 𝑎 is taken. On taking the action the neighbourhood is changed from 𝑋 → 𝑌 and the knowledgebase from 𝐽 → 𝐾 . 𝑎𝑖 ( 𝐴𝑇, 𝑋, 𝑌, 𝐽, 𝐾, 𝐴𝐿 ) = 𝑝 𝑁 𝑥 ∩ 𝑦 𝑛𝑖 ( 𝐴𝑇, 𝑋, 𝑌, 𝐴𝐿 ) + 𝑝 𝐾 𝑗 ∩ 𝑘 𝑘𝑖 ( 𝐴𝑇, 𝐽, 𝐾, 𝐴𝐿 ) (16)As calculating the impact of different types of action can quickly become non-tractable in a dynamic system, we useestimates based on properties such as whether they change the state of neighbourhoods or knowledge bases, and theprobabilities of the impact actually occurring given the systems’ size. Manuscript submitted to ACM

Definition 5.5 (Action-impact values). Action-impact values 𝑊 are estimated values for maximum action impacts foreach action-category, 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 ( 𝑎 ) . We assume that both | 𝑌 − 𝑋 | ∈ { , } and | 𝐾 − 𝐽 | ∈ { , } for all actions. We alsoassume that 𝐴𝐿 is large enough to remain approximately constant despite any allocation change or resource pressureresulting from the action. 𝑊 = {( 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 ( 𝑎 ) , (cid:98) 𝑎𝑖 ( 𝐴𝑇, 𝑋, 𝑌, 𝐽, 𝐾, 𝐴𝐿 )) : ∀ 𝑎 ∈ 𝐴 } (17) For an agent to know the optimal task quality it could achieve in its current neighbourhood we use a metric to measurehow far its current quality values are from optimal.

Definition 5.6 (Locally optimal allocation metric).

The locally optimal allocation metric is the difference betweenan agents’ current allocation quality, of atomic tasks 𝐴𝑇 to agents 𝐺 in its neighbourhood, and the locally optimalallocation quality. 𝑑 loc ( 𝑔, 𝐴𝑇, 𝐺, 𝐴𝐿 ) = 𝑞𝑙 ∗ ( 𝐴𝑇, 𝑁 ( 𝑔 ) , 𝐴𝐿 ) − 𝑞𝑙 ( 𝐴𝑇, 𝐺, 𝐴𝐿 ) (18) Definition 5.7 (System optimal allocation metric).

The system optimal allocation metric is the difference between anagents’ current allocation quality and the system optimal allocation quality. 𝑑 sys ( 𝑔, 𝐴𝑇, 𝐺, 𝐴𝐿 ) = 𝑞𝑙 ∗ ( 𝐴𝑇 𝑖 , 𝑜𝑛 ∗ ( 𝐴𝑇, 𝑔, 𝐴𝐿 ) , 𝐴𝐿 ) − 𝑞𝑙 ( 𝐴𝑇, 𝐺, 𝐴𝐿 ) (19) An agent needs to know the locally optimal allocation quality for both the current and the future neighbourhoodsto predict whether the impact of changing neighbourhoods from 𝑋 to 𝑌 would be positive. This is difficult since theagent is uncertain of 𝑑 𝑙𝑜𝑐 and so does not know the best values it can obtain in the current neighbourhood. However,it is likely to have less samples of the actions available in 𝑌 so may have even more uncertainty in future values if itchanged neighbourhoods. To find proxies for these values we make the following assumptions based around time-basedtrends in action-samples.Assumption 1. (Likelihood of neighbourhood change) The more actions an agent takes the greater the likelihood that itwill have taken actions that change its neighbourhood. Assumption 2. (Variation of neighbourhoods) Samples in a large set of historical action-samples will come from manydifferent neighbourhoods.

Assumption 3. (Time-dependent similarity of neighbourhoods) Action-samples separated by short spaces of time arelikely to be from similar neighbourhoods. Those separated by large amounts of time are more likely to represent significantlydifferent neighbourhoods

With these assumptions we can estimate the relative local and system optimal allocation metric values. As recentaction-samples with small time separations come from the same or similar neighbourhoods we compare their quality

Manuscript submitted to ACM ynamic neighbourhood optimisation for task allocation using multi-agent learning 19value statistics to estimate 𝑑 loc . As action-samples over the long-term come from many different neighbourhoods wecompare their values to estimate 𝑑 sys . To estimate which actions will have a positive impact we firstly use historicalaction-sample quality values to estimate action-impacts. Based on these values we increase or decrease the probabilitiesof taking different action-categories. Whether an impact is estimated to be positive or negative will alter the agentslikelihood of taking actions that explore allocation within the current neighbourhood or change its neighbourhood orknowledge base. The process is as follows,(1) Define the time-summarised quality matrix (TSQM) , a method of summarising historical quality returns overmultiple time scales.(2) Using this matrix we generate the impact interpolation function .(3) We then define the impact transformation function using a ratio of the integrations over the impact interpolationfunction.(4) Finally we use the action-impact values for each action-category that that will be used to as the input for theimpact transformation function.

TSQM ). A TSQM Λ has shape ( 𝑚 × 𝑛 ) with all values initially null. Time-orderedactions-sample quality values { 𝜔 𝑡 , 𝜔 𝑡 − . . . , 𝜔 𝑡 − 𝑛 } for all actions of a specific agent are added to the first row Λ ( ,𝑗 ) asthey are sampled such that, Λ ( , ) ← { 𝜔 𝑖 } 𝑛𝑖 = . Each subsequent row is the result of averaging and pooling values in theprevious row. This approach allows each row to represent the quality trends across different time-scales. If ℎ is thenumber of quality values added to the matrix then we update the elements, Λ ( 𝑖 + ,𝑘 ) ← (cid:205) Λ ( 𝑖, ) (cid:12)(cid:12) Λ ( 𝑖, ) (cid:12)(cid:12) , if ℎ mod ( 𝑘 (cid:12)(cid:12) Λ ( 𝑖, ) (cid:12)(cid:12) ) = (20)We summarise the full update process for an agent 𝑔 as the function, updatetqsm ( Λ 𝑔 , 𝜔 ) The impact interpolation function 𝑖𝑖 ( 𝑥 ) is generated by taking a linear interpolationover the rows of a TSQM (see Figure 2). A decay factor 𝛿 acts to dampen the values of longer time-scales and allow morerecent trends to have a stronger impact. For a TSQM of shape ( 𝑚 × 𝑛 ) a value 𝑥 ∈ R [ , ] will be transformed as below. 𝑖𝑖 ( 𝑥 ) = interpol (cid:18)(cid:26) average ( Λ 𝑖 ) 𝛿 𝑖 | Λ 𝑖 | (cid:27) 𝑁𝑖 = (cid:19) ( 𝑥 ) , for layers to 𝑁 (21) The impact transformation function estimates the probability that taking an anaction from an action-category in the current neighbourhood will be positive by taking a ratio over the integrals ofthe interpolation representing the fraction of the historical quality values that occur up to the input value. For any 𝑦 ∈ R [ , ] this is given by, 𝑖𝑡 ( 𝑥 ) = − ∫ 𝑥𝑦 = 𝑖𝑖 ( 𝑦 ) ∫ 𝑦 = 𝑖𝑖 ( 𝑦 ) (22)We also use the overall balance of the impact transformation function between shorter and longer timescales to adaptthe exploration behaviour of our reinforcement learning model. Higher values mean the agent is attaining better Manuscript submitted to ACM

Activecelltargetcell 2targetcell 3

Layer 2Layer 1Layer 0 convolution

Impact transformation functionTemporal reward set

Layer averagingand interpolation

IT(x) x

Fig. 2.

Transforming the

TSQM performance now than in the past and should exploit rather than explore the system further. Lower values mean itsexploration of the system should be increased. This impact exploration factor is defined as, 𝜖 𝑖𝑒𝑓 = 𝑖𝑡 ( . ) Finally we can use the interpolation of action-impact values w of action-categories of each action 𝑎 to estimate theprobability that taking those type of action will have a positive impact, 𝑃 ( 𝑛𝑖 ( 𝐴𝑇, 𝑋, 𝑌, 𝐴𝐿 ) > | 𝑎 ) ≈ 𝑖𝑡 ( w ) (23) We simulated four systems to evaluate the algorithms’ performance. In the stable system we look at the performance ofthe

ATA-RIA algorithm on the task allocation problem overall, when agents’ neighbourhoods were randomly assignedon initialisation. The exploration system focuses on how the RT-ARP algorithm alters the probability of exploring systemspace to find the best neighbourhood for each agent. In this system we initialise parent agents’ neighbourhoods tocontain child agents with atomic task qualities that are significantly more or less than the average in the system . Wethen investigate how agents adapt these neighbourhoods to improve performance. The volatile system examines theadaptability of the algorithms when the system is highly dynamic. Specifically, when child agents have a probability ofleaving or rejoining the system each episode. Finally, in the large system we look at the performance of the algorithmsas we increase the number of agents in the system.Labels for the algorithms and configurations used in the simulations are described in Tables 2, 3, 4, and 5. Systemparameters are included in Appendix A, with general and individual system values shown in Tables 6, and Table 7respectively. The composite task frequency distribution introduced the same fixed set of tasks over a specified period,defining each episode of the system. Child agents’ atomic task qualities were set at system start time from values in the range ( , ] drawn randomly from the normal distribution defined byvalues in 𝑋 ∼ N( 𝜇, 𝜎 ) , 𝜇 = . , 𝜎 = . Manuscript submitted to ACM ynamic neighbourhood optimisation for task allocation using multi-agent learning 21

Table 2.

Summary of algorithm labels for the stable system

Algorithm Summary < optimal > This algorithm is used as a performance comparison as it provides the theoretical optimumsystem utility. Its parent agents are initialised with the most optimal neighbourhoodsavailable in the system, and always allocate tasks to the highest quality child agents.< ataria > The ATA-RIA algorithm.

Table 3.

Summary of algorithm labels for the exploration system

Algorithm Summary < rtrap > ATA-RIA when the system is initialised with random neighbourhoods then explores witha constant 𝜖 factor, RT-ARP is disabled. This is used for a baseline comparison.< rtrap + > ATA-RIA when the system is initialised with neighbourhoods containing of theoptimal neighbourhoods’ agents and explores using RT-ARP.< rtrap - > ATA-RIA when the system is initialised with neighbourhoods containing of the leastoptimal agents and explores using RT-ARP.

Table 4.

Summary of algorithm labels for the volatile system

Algorithm Summary < nodrop > ATA-RIA when the system has no network instability.< drop > ATA-RIA when of agents leave/rejoin the system each episode between episodes and .< nosaskr > ATA-RIA when of agents leave/rejoin the system each episode between episodes and but the RT-ARP and SAS-KR algorithms are disabled.

Table 5.

Summary of algorithm labels for the large system

Algorithm Summary < large-optimal > ATA-RIA with agents, configured to give the most optimal possible RT-ARPperformance in the given system.< large-25 > ATA-RIA in a system of agents< large-50 > ATA-RIA in a system of agents< large-100 > ATA-RIA in a system of agentsResults for each system are show in Figures 3, 4, 5, and 6. Values are shown for the percentage increase or decreasein system utility with the given algorithms in comparison to the baselines described. In the stable system, the baselineis the < optimal > algorithm, in the exploration system, the < rtrap > algorithm, the volatile system, the < nodrop >algorithm, < large-optimal > for the large system. A summary of results are shown in Appendix B in Tables 8, 9, and10 for the stable, exploration, volatile, and large systems respectively.As seen in Figure 3, the < ataria > algorithm performs to . of the < optimal > algorithm after episodes in the stable system . Initially ∼ of the atomic task allocations made by the parent agents are not successful, but the failurerate rapidly falls to < . Although exploration is reduced as the algorithm approaches the optimal task allocationstrategy, it never fully exploits the best strategy due to the effect of RT-ARP, which generates a low level of non-optimalactions. This shows that the < ataria > algorithm can optimise system utility well in a stable system. Although the Manuscript submitted to ACM

10 20 30 40 50 60 70 80 90 100 episode S y s t e m u t ili t y l o ss w . r . t . o p t i m a l optimalataria Fig. 3.

System utility comparison to the system optimal in the optimal system

100 200 300 400 500 episode -20%-10%0%10%20%30%40%50%60%70% S y s t e m u t ili t y g a i n w . r . t . b a s e li n e rtrap0rtrap-rtrap+ Fig. 4.

System utility comparison to the system optimal in the exploration system effect of RT-ARP means that

ATA-RIA is not fully optimal under these conditions, it also improves its ability to adapt tochanges as the environment becomes more dynamic.Next we examine the exploration of state-space in the exploration system , in Figure 4. The < rtrap + > algorithmgains a . improvement in system utility compared to < rtrap > after episodes. < rtrap - > improves . in Manuscript submitted to ACM ynamic neighbourhood optimisation for task allocation using multi-agent learning 23

10 20 30 40 50 60 70 80 90 100 episode S y s t e m u t ili t y l o ss w . r . t . o p t i m a l nodropdropnosaskr Fig. 5.

System utility comparison to the system optimal in the volatile system

10 20 30 40 50 60 70 80 90 100 episode S y s t e m u t ili t y l o ss w . r . t . o p t i m a l large-optimallarge-25large-50large-100 Fig. 6.

System utility comparison to the system optimal in the large system task completion performance, with the expectation that this would merge with the utility levels of < rtrap + > givenmore episodes. The RT-ARP algorithm acts of a proxy comparison of the current allocation quality for an agent, tothe locally optimal allocation, and system-optimal allocation qualities for that agent. It drives the agent into better Manuscript submitted to ACM volatile system in Figure 5 we see the

SAS-KR algorithms’ effect on system resilience and recovery . Before theimpact on agent connectivity is introduced at episode , the algorithms’ performances are equivalent. On introducinginstability, the performance of the < drop > and < nosaskr > algorithms deteriorate by . , gradually improvingto . over the course of the disruption. After instability stops at episode < drop > recovers to . of theperformance of the non-impacted < nodrop > algorithm by episode , as compared to . for < nosaskr >. As the SAS-KR algorithm retains the most up-to-date, and least uncertain actions and associated Q-values, better informationabout past actions and neighbourhoods is kept by the agent. When the instability is removed, the quality of knowledgekept by the < drop > algorithm is higher than in < nosaskr >, allowing a quicker recovery to more optimal neighbourhoodformations, and so task-allocation quality and overall system utility.The large system is shown in Figure 6. Here we see the < large-25 > algorithm perform within . of the< large-optimal > algorithm, the optimal performance possible for the ATA-RIA algorithm in the system. The < large-50 >and < large-100 > algorithms optimise system utility to within . and . of < large-optimal > by the completionof episodes. As expected, the system utility of the ATA-RIA algorithm is initially poorer with increasing number ofagents in the system. On initialisation of the system, there is a greater likelihood of parent agents being in neighbour-hoods with agents that have lower than average atomic task qualities available, or where not all atomic tasks in theparent agents’ composite task are completable. There is also a larger system space for the algorithm to search. Evenso, the

ATA-RIA algorithm shows good performance in optimising the system utility to under of optimal with asystem of agents.Overall, the evaluation of the algorithms’ presented shows that they perform well at task allocation in both stable andunstable environments, as well as scaling to larger systems. The

ATA-RIA algorithm improved system utility to . of the optimal in the simulated system. The RT-ARP algorithm reduced exploration as the system utility approachedoptimal, and adapted well in response to disruption. It allowed agents to alter their neighbourhoods from areas ofstate-action space that would not allow task completion to those where it would be possible. In environments withdisrupted connectivity, the retention of learned knowledge through SAS-KR allowed for quicker re-optimisation andadaptation of neighbourhoods, over × better than when RT-ARP and SAS-KR were disabled, and there was no adaptiveexploration or knowledge retention strategy.

As we have shown in this paper, with the

ATA-RIA algorithm optimising agents’ task allocations, RT-ARP adaptingexploration based on reward trends, and the

SAS-KR and

N-Prune algorithms managing knowledge and neighbourhoodretention respectively, the contributions presented here combine to give a novel method of optimising task-allocation inmulti-agent systems. The evaluation results show that the combined algorithms give good task allocation performancecompared to the optimal available in the simulated systems, and are resilient to system change with constrainedcomputational cost and other resource usage. This indicates a good basis for successful application to real-life systemswhere there are resource constraints, or dynamic environments.The algorithms described here are applicable to a general class of problems where there are dynamic, self-organisingnetworks, and where multiple agents need to learn to associate other agents with subtasks necessary for completion ofa composite task. This work may be especially applicable to systems where there are changeable conditions that causeinstabilities and where there are very limited possibilities for maintenance or human intervention. There are applications

Manuscript submitted to ACM ynamic neighbourhood optimisation for task allocation using multi-agent learning 25in wireless sensor networks (WSN) [31, 44] where adaptive networking and optimisation are essential to keep usageand maintenance costs minimal. The algorithms’ adaptability to connectivity disruption and agent loss indicates thattheir performance in harsh environmental conditions, and where reliability of components deteriorates over time, maybe worth further investigation. Similarly dynamic multi-agent systems such as vehicular ad-hoc networks (VANET)[43], and cloud computing service composition [19, 36], also provide real-world task allocation applications.Adaptation to congestion when multiple agents are in competition showed how the algorithms could be useful inenvironments where resource contention on both targets of requests and the network itself are factors. Agents learnedto compromise on allocating subtasks to the agents that would give the best quality, but had more competition fromother agents, with allocating to agents that had reduced contention on their resources. While this allows a degree ofbalance to develop in a contained system it would be worth investigating how this behaviour could be used to driveexploration of the greater system. For example, agents who find themselves in a heavily resource competitive area ofthe system could be pushed to prioritise exploration of less busy areas, adapting their behaviour to not require or utilisethe same resources by adopting a different role in the system. This has uses in load balancing workloads across cloudcompute systems and energy consumption management in distributed sensor networks.

REFERENCES [1] Abbas, H. A., Shaheen, S. I., and Amin, M. H. Organization of Multi-Agent Systems: An Overview.

International Journal of Intelligent InformationSystems 4 , 3 (2015), 46–57.[2] Agogino, A., and Tumer, K. Multi-agent reward analysis for learning in noisy domains. In

Proceedings of the International Conference on AutonomousAgents (2005).[3] Agrawal, S., and Kamal, R. Computational Orchestrator: A Super Class for Matrix, Robotics and Control System Orchestration.

InternationalJournal of Computer Applications 117 , 10 (2015), 12–19.[4] Akyildiz, I. F., Su, W., Sankarasubramaniam, Y., and Cayirci, E. Wireless sensor networks: A survey.

Computer Networks (2002).[5] Al-Rawi, H. A. A., Ng, M. A., and Yau, K.-L. A. Application of reinforcement learning to routing in distributed wireless networks: a review.

ArtificialIntelligence Review 43 , 3 (mar 2015), 381–416.[6] Albaladejo, C., Sánchez, P., Iborra, A., Soto, F., López, J. A., and Torres, R. Wireless sensor networks for oceanographic monitoring: A systematicreview.

Sensors 10 , 7 (2010), 6948–6968.[7] Bagnell, J. A., and Ng, A. Y. On Local Rewards and Scaling Distributed Reinforcement Learning.

Advances in Neural Information Processing Systems18 [Neural Information Processing Systems, NIPS 2005 (2005).[8] Buşoniu, L., Babuška, R., and De Schutter, B. A comprehensive survey of multiagent reinforcement learning.

IEEE Transactions on Systems, Manand Cybernetics Part C: Applications and Reviews 38 , 2 (2008), 156–172.[9] Buşoniu, L., Babuška, R., and De Schutter, B. Multi-agent Reinforcement Learning: An Overview.

Proceedings of the 2nd International Conferenceon Multi-Agent Systems 19 (2010), 183–221.[10] Chen, Y., and Bai, X. On robotics applications in service-oriented architecture.

Proceedings - International Conference on Distributed ComputingSystems (2008), 551–556.[11] De Hauwere, Y. M., Vrancx, P., and Nowé, A. Solving sparse delayed coordination problems in multi-agent reinforcement learning. In

Lecture Notesin Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , vol. 7113 LNAI. Springer-Verlag,2012, pp. 114–133.[12] DeHauwere, Y. M., Vrancx, P., and Nowé, A. Learning multi-agent state space representations. In

Proceedings of the International Joint Conferenceon Autonomous Agents and Multiagent Systems, AAMAS (2010).[13] Di, G., Di, G., Serugendo, M., Gleizes, M.-p., and Karageorgos, A. Self-organisation and emergence in mas: An overview.

IN THIS VOLUME 30 (2006), 45—-54.[14] Di Marzo Serugendo, G., Foukia, N., Hassas, S., Karageorgos, A., Mostéfaoui, S. K., Rana, O. F., Ulieru, M., Valckenaers, P., and Van Aart,C. Self-organisation: Paradigms and applications. In

Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science) (2004).[15] Edmondson, J., and Schmidt, D. Multi-agent distributed adaptive resource allocation (MADARA).

International Journal of Communication Networksand Distributed Systems 5 , 3 (2010), 229–245.[16] García, J., and Fernández, F. A comprehensive survey on safe reinforcement learning.

Journal of Machine Learning Research 16 (2015), 1437–1480.[17] Gleizes, M. P. Self-adaptive complex systems. In

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence andLecture Notes in Bioinformatics) (2012).[18] Gungor, V. C., and Hancke, G. P. Industrial wireless sensor networks: Challenges, design principles, and technical approaches.

IEEE Transactions

Manuscript submitted to ACM on Industrial Electronics 56 , 10 (2009), 4258–4265.[19] Gutierrez-Garcia, J. O., and Sim, K. M. Agent-based service composition in cloud computing.

Communications in Computer and InformationScience 121 CCIS (2010), 1–10.[20] Gutierrez-Garcia, J. O., and Sim, K. M. Self-organizing agents for service composition in cloud computing.

Proceedings - 2nd IEEE InternationalConference on Cloud Computing Technology and Science, CloudCom 2010 (2010), 59–66.[21] Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R., Shenker, S., and Stoica, I. Mesos: a platform for fine-grainedresource sharing in the data center, 2011.[22] Hodicky, J. Modelling and simulation for autonomous systems first international workshop, MESAS 2014 Rome, Italy, may 5-6, 2014 revised selectedpapers 13.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8906 , 457(2014), 1062–1070.[23] Howard, H., Schwarzkopf, M., Madhavapeddy, A., and Crowcroft, J. Raft refloated: Do we have consensus? In

Operating Systems Review (ACM) (2015).[24] Kober, J., Bagnell, J. A., and Peters, J. Reinforcement learning in robotics: A survey.

International Journal of Robotics Research (2013).[25] Kota, R., Gibbins, N., and Jennings, N. R. Decentralised structural adaptation in agent organisations. In

Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2009).[26] Krivic, P., Skocir, P., Kusek, M., and Jezic, G. Microservices as agents in IoT systems.

Smart Innovation, Systems and Technologies 74 , January(2018), 22–31.[27] Lakshman, A., and Malik, P. Cassandra.

ACM SIGOPS Operating Systems Review 44 , 2 (apr 2010), 35.[28] Lesser, V., Ortiz, C. L., and Tambe, M.

Distributed Sensor Networks: Introduction to a Multiagent Perspective . Springer US, Boston, MA, 2003, pp. 1–8.[29] Mannucc, T., Van Kampen:, E., De Visser, C. C., and Chu, Q. P. SHERPA: a safe exploration algorithm for Reinforcement Learning controllersNomenclature RL Reinforcement Learning MDP Markov Decision Process FSS Fatal State Space RSS Restricted State Space RFSS Restricted FatalState Space LTF Lead-to-fatal (state) IA .

AIAA Guidance, Navigation, and Control Conference , February 2017 (2015).[30] Mao, H., Gong, Z., and Xiao, Z. Reward Design in Cooperative Multi-agent Reinforcement Learning for Packet Routing, 2020.[31] Marsh, D., Tynan, R., O’Kane, D., and O’Hare, G. M. Autonomic wireless sensor networks.

Engineering Applications of Artificial Intelligence 17 , 7(2004), 741–748.[32] Mazrekaj, A., Minarolli, D., and Freisleben, B. Distributed resource allocation in cloud computing using multi-agent systems.

Telfor Journal 9 , 2(2017), 110–115.[33] Melo, F. S., and Veloso, M. Learning of Coordination: Exploiting Sparse Interactions in Multiagent Systems.

Proceedings of the 8th InternationalConference on Autonomous Agents and Multiagent Systems (2009), 773–780.[34] Ongaro, D., and Ousterhout, J. In search of an understandable consensus algorithm. In

Proceedings of the 2014 USENIX Annual TechnicalConference, USENIX ATC 2014 (2019).[35] Parker, J. Task allocation for multi-agent systems in dynamic environments. (2013), 1445–1446.[36] Qiu, L. Self-Organization Mechanisms for Service Composition in Cloud Computing.

International Journal of Hybrid Information Technology (2014).[37] Singhal, V., and Dahiya, D. Distributed task allocation in dynamic multi-agent system.

International Conference on Computing, Communicationand Automation, ICCCA 2015 (2015), 643–648.[38] Sutton, and Stuart, R. Temporal credit assignment in reinforcement learning, 1984.[39] Sutton, R. S., and Barto, A. G. s . MIT Press, 1998.[40] Tuyls, K., and Weiss, G. Multiagent Learning:Basics, Challanges, and Prospects. AI Magazine (2012), 41–52.[41] Verleysen, M., ESANN (16 2008.04.23-25 Bruges), and European Symposium on Artificial Neural Networks (16 2008.04.23-25 Bruges). SafeExploration for RL.

European Symposium on Artificial Neural Networks - Advances in Computational Intelligence and Learning , April (2008).[42] WOLPERT, D. H., and TUMER, K. OPTIMAL PAYOFF FUNCTIONS FOR MEMBERS OF COLLECTIVES.

Advances in Complex Systems (2001).[43] Xu, S., Guo, C., Hu, R. Q., and Qian, Y. Multi-agent deep reinforcement learning enabled computation resource allocation in a vehicular cloudnetwork, 2020.[44] Ye, D., Zhang, M., and Yang, Y. A Multi-Agent Framework for Packet Routing in Wireless Sensor Networks.

Sensors 15 , 5 (apr 2015), 10026–10047.[45] Zhang, C., Lesser, V., and Shenoy, P. A multi-agent learning approach to online distributed resource allocation. In

IJCAI International JointConference on Artificial Intelligence (2009).[46] Zhong, Y., Gu, G., and Zhang, R. A new approach for structural credit assignment in distributed reinforcement learning systems. In

Proceedings -IEEE International Conference on Robotics and Automation (2003).Manuscript submitted to ACM ynamic neighbourhood optimisation for task allocation using multi-agent learning 27

A PARAMETERS FOR SYSTEM SIMULATIONS AND ALGORITHMS

Table 6.

General parameter values

Simulation parameter values

Variable Summary Optimal Exploration Volatile Large | 𝐺 𝑝 | Number of parent agents inthe system | 𝐺 𝑐 | Number of child agent in thesystem

10 10 10 { , , } 𝑃 ( 𝑙𝑒𝑎𝑣𝑒 / 𝑗𝑜𝑖𝑛 | 𝑔 𝑝 ) Probability of agent leavingor re-joining the system eachepisode .

01 0

B SUMMARY OF RESULTS

Table 8.

Experimental results for the stable system after 100 episodes

Algorithm % performance decrease from < optimal > < ataria > . < congested > . < loss > . < cost > . Table 9.

Experimental results for the exploration system after 100 episodes

Algorithm % performance increase over < rtrap > < rtrap - > . < rtrap + > . Manuscript submitted to ACM

Table 10.

Experimental results for volatile system after 100 episodes

Algorithm % performance decrease from < nodrop > < drop > . < nosaskr > . Table 11.

Experimental results for large system after 100 episodes

Algorithm % performance decrease from < large-optimal > < large-25 > . < large-50 > . < large-100 > .6%