[PDF] Leveraging Communication Topologies Between Learning Agents in Deep Reinforcement Learning

Abstract

A common technique to improve learning performance in deep reinforcement learning (DRL) and many other machine learning algorithms is to run multiple learning agents in parallel. A neglected component in the development of these algorithms has been how best to arrange the learning agents involved to improve distributed search. Here we draw upon results from the networked optimization literatures suggesting that arranging learning agents in communication networks other than fully connected topologies (the implicit way agents are commonly arranged in) can improve learning. We explore the relative performance of four popular families of graphs and observe that one such family (Erdos-Renyi random graphs) empirically outperforms the de facto fully-connected communication topology across several DRL benchmark tasks. Additionally, we observe that 1000 learning agents arranged in an Erdos-Renyi graph can perform as well as 3000 agents arranged in the standard fully-connected topology, showing the large learning improvement possible when carefully designing the topology over which agents communicate. We complement these empirical results with a theoretical investigation of why our alternate topologies perform better. Overall, our work suggests that distributed machine learning algorithms could be made more effective if the communication topology between learning agents was optimized.

Full PDF

LLeveraging Communication Topologies BetweenLearning Agents in Deep Reinforcement Learning

Dhaval Adjodah , Dan Calacci , Abhimanyu Dubey ,Anirudh Goyal , P. M. Krafft , Esteban Moro , , Alex Pentland Massachusetts Institute of Technology MILA/Université de Montréal Universidad Carlos III de Madrid Oxford Internet Institute, University of Oxford

ABSTRACT

A common technique to improve learning performance in deepreinforcement learning (DRL) and many other machine learning al-gorithms is to run multiple learning agents in parallel. A neglectedcomponent in the development of these algorithms has been howbest to arrange the learning agents involved to improve distributedsearch. Here we draw upon results from the networked optimizationliteratures suggesting that arranging learning agents in communi-cation networks other than fully connected topologies (the implicitway agents are commonly arranged in) can improve learning. Weexplore the relative performance of four popular families of graphsand observe that one such family (Erdos-Renyi random graphs)empirically outperforms the de facto fully-connected communica-tion topology across several DRL benchmark tasks. Additionally,we observe that 1000 learning agents arranged in an Erdos-Renyigraph can perform as well as 3000 agents arranged in the standardfully-connected topology, showing the large learning improvementpossible when carefully designing the topology over which agentscommunicate. We complement these empirical results with a theo-retical investigation of why our alternate topologies perform better.Overall, our work suggests that distributed machine learning algo-rithms could be made more effective if the communication topologybetween learning agents was optimized . KEYWORDS

Reinforcement Learning; Evolutionary algorithms; Deep learning;Networks

Implementations of deep reinforcement learning (DRL) algorithmshave become increasingly distributed, running large numbers ofparallel sampling and training nodes. For example, AlphaStar runsthousands of parallel instances of Stracraft II on TPU’s [32], andOpenAI Five runs on 128,000 CPU cores at the same time [25].Such distributed algorithms rely on an implicit communicationnetwork between the processing units being used in the algorithm. Correspondence to Dhaval Adjodah ([email protected]). Code available atgithub.com/d-val/NetES

Proc. of the 19th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2020), B. An, N. Yorke-Smith, A. El Fallah Seghrouchni, G. Sukthankar (eds.),May 2020, Auckland, New Zealand

These units pass information such as data, parameters, or rewardsbetween each other, often through a central controller. For example,in the popular A3C [17] reinforcement learning algorithm, multiple‘workers’ are spawned with local copies of a global neural network,and they are used to collectively update the global network. Theseworkers can either be viewed as implementing the parallelized formof an algorithm, or they can be seen as a type of multi-agent dis-tributed optimization approach to searching the reward landscapefor parameters that maximize performance.In this work, we take the latter approach of thinking of the‘workers’ as separate agents that search a reward landscape moreor less efficiently. We adopt such an approach because it allows usto consider improvements studied in the field of multi-agent opti-mization [11], specifically the literatures of networked optimization(optimization over networks of agents with local rewards) [20–22]and collective intelligence (the study of mechanisms of how agentslearn, influence and collaborate with each other) [36, 37].These two literatures suggest a number of different ways to im-prove such multi-agent optimization, and, in this work, we chooseto focus on one of main ways to do so: optimizing the topology ofcommunication between agents (i.e. the local and global character-ization of the connections between agents used to communicatedata, parameters, or rewards with).We focus on communication topology because it has been shownto result in increased exploration, higher reward, and higher diver-sity of solutions in both simulated high-dimensional optimizationproblems [15] and human experiments [5], and because, to thebest of our knowledge, almost no prior work has investigated howthe topology of communication between agents affects learningperformance in distributed DRL.Here, we empirically investigate whether using alternate com-munication topologies between agents could lead to improvinglearning performance in the context of DRL. The two topologiesthat are almost always used in DRL are either a complete (fully-connected) network, in which all processors communicate witheach other; or a star network—in which all processors communicatewith a single hub server, which is, in effect, a more efficient, cen-tralized implementation of the complete network (e.g., [29]). Ourhypothesis is that using other topologies than fully-connected willlead to learning improvements.Given that network effects are sometimes only significant withlarge numbers of agents, we choose to build upon one of the DRLalgorithms most oriented towards parallelizability and scalability:Evolution Strategies [26, 28, 34], which has recently been shown toscale-up to tens of thousands of agents [27]. a r X i v : . [ c s . L G ] M a r ype Task Fully-connected Erdos Improv. % MuJoCo Ant-v1 4496 4938

MuJoCo HalfCheetah-v1 1571 7014

MuJoCo Hopper-v1 1506 3811

MuJoCo Humanoid-v1 762 6847

Roboschool Humanoid-v1 364 429

Table 1: Improvements from Erdos-Renyi networks with1000 nodes compared to fully-connected networks.

We introduce Networked Evolution Strategies (NetES), a net-worked decentralized variant of ES. NetES, like many DRL algo-rithms and evolutionary methods, relies on aggregating the rewardsfrom a population of processors that search in parameter space tooptimize a single global parameter set. Using NetES, we explorehow the communication topology of a population of processorsaffects learning performance.Key aspects of our approach, findings, and contributions are asfollows: • We introduce the notion of communication network topolo-gies to the ES paradigm for DRL tasks. • We perform an ablation study using various baseline controlsto make sure that any improvements we see come from usingalternate topologies and not other factors. • We compare the learning performance of the main topologi-cal families of communication graphs, and observe that onefamily (Erdos-Renyi graphs) does best. • Using an optimized Erdos-Renyi graph, we evaluate NetESon five difficult DRL benchmarks and find large improve-ments compared to using a fully-connected communicationtopology. We observe that our 1000-agent Erdos-Renyi graphcan compete with 3000 fully-connected agents. • We derive an upper bound which provides theoretical in-sights into why alternate topologies might outperform afully-connected communication topology. We find that ourupper bound only depends on the topology of learning agents,and not on the reward function of the reinforcement learn-ing task at hand, which indicates that our results likely willgeneralize to other learning tasks.

As discussed earlier, given that network effects are generally onlysignificant with large numbers of agents, we choose to build uponone of the DRL algorithms most oriented towards parallelizabilityand scalability: Evolution Strategies.We begin with a brief overview of the application of the Evo-lution Strategies (ES) [28] approach to DRL, following Salimanset al. [27]. Evolution Strategies is a class of techniques to solveoptimization problems by utilizing a derivative-free parameter up-date approach. The algorithm proceeds by selecting a fixed model,initialized with a set of weights θ (whose distribution p ϕ is param-eterized by parameters ϕ ), and an objective (reward) function R (·) defined externally by the DRL task being solved. The ES algorithmthen maximizes the average objective value E θ ∼ p ϕ R ( θ ) , which is Figure 1: Learning in DRL can be visualized with agents (reddots) searching a reward landscape for the parameter set (lo-cation) that leads to the highest reward. A: In most DRL al-gorithms, including ES, agents are searching the same localarea. Because the controller receives information from allagents, and then broadcasts a new parameter to all otheragents, agents are, in effect communicating in a fully con-nected network. B: In NetES, the same number of agents areembedded in a communication topology over which theyshare data. This leads to a more distributed search whereeach cluster of agents focuses on a different part of the land-scape. optimized with stochastic gradient ascent. The score function es-timator for ∇ ϕ E θ ∼ p ϕ R ( θ ) is similar to REINFORCE [35], given by ∇ ϕ E θ ∼ p ϕ R ( θ ) = E θ ∼ p ϕ [ R ( θ )∇ ϕ log p ϕ ( θ )] .The update equation used in this algorithm for the parameter θ at any iteration t +

1, for an appropriately chosen learning rate α and noise standard deviation σ , is a discrete approximation to thegradient: θ ( t + ) = θ ( t ) + αN σ N (cid:213) i = (cid:0) R ( θ ( t ) + σϵ ( t ) i ) · σϵ ( t ) i (cid:1) (1)This update rule is implemented by spawning a collection of N agents at every iteration t , with perturbed versions of θ ( t ) , i.e. {( θ ( t ) + σϵ ( t ) ) , ..., ( θ ( t ) + σϵ ( t ) N )} where ϵ ∼ N( , I ) . The algorithmthen calculates θ ( t + ) which is broadcast again to all agents, andthe process is repeated.In summary, a centralized controller holds a global parameter θ ,records the perturbed noise ϵ ( t ) i used by all agents, collects rewardsfrom all agents at the end of an episode, calculates the gradient andobtains a new global parameter θ . Because the controller receivesinformation from all agents, and then broadcasts a new parameterto all other agents, each agent is in effect communicating (throughthe controller) with all other agents.This means that the de facto communication topology used inEvolution Strategies (and all other DRL algorithms that use a centralcontroller) is a fully-connected network. Our hypothesis is thatsing alternate communication topologies between agents will leadto improved learning performance.So far, we have assumed that all agents start with the sameglobal parameter θ ( t ) . When each agent i starts with a differ-ent parameter θ ( t o ) i , Equation 1 has to be generalized. In the casewhen all agents start with the same parameter, Equation 1 canbe understood as having each agent taking a weighted average ofthe differences (perturbations) between their last local parametercopy and the perturbed copies of each agent, (the differences being σϵ ( t ) i = (( θ ( t ) + σϵ ( t ) i ) − θ ( t ) ) ). The weight used in the weightedaverage is given by the reward at the location of each perturbedcopy, R ( θ ( t ) + σϵ ( t ) i ) .When agents start with different parameters, the same weightedaverage is calculated: because each agent now has different pa-rameters, this difference between agent i and j ’s parameters is (( θ ( t ) i + σϵ ( t ) i ) − θ ( t ) j ) . The weights are still R ( θ ( t ) + σϵ ( t ) i ) ). In thisnotation, Equation 1 is then: θ ( t + ) j = θ ( t ) j + αN σ N (cid:213) i = (cid:16) R ( θ ( t ) i + σϵ ( t ) i ) · ( θ ( t ) i + σϵ ( t ) i − θ ( t ) j ) (cid:17) (2)It is straightforward to show that Equation 2 reduces to Equation1 when all agents start with the same parameter. As we will show,generalizing this standard update rule further to handle alternatetopologies will be straightforward. The task ahead is to take the standard ES algorithm and operate itover new communication topologies, wherein each agent is onlyallowed to communicate with its neighbors. This would allow us totest our hypothesis that alternate topologies perform better thanthe de facto fully-connected topology.An interesting possibility for future work would be to optimizeover the space of all possible topologies to find the ones that per-form best for our task at hand. In this work, we take as a moretractable starting point a comparison of four popular graph families(including the fully-connected topology).

We denote a network topology by A = { a ij } , where a ij = i and j communicate with each other, and equals 0 other-wise. A represents the adjacency matrix of connectivity, and fullycharacterizes the communication topology between agents. In afully connected network, we have a ij = i , j .Using adjacency matrix A , it is straightforward to allow equation2 to operate over any communication topologies: θ ( t + ) j = θ ( t ) j + αN σ N (cid:213) i = a ij · (cid:16) R ( θ ( t ) i + σϵ ( t ) i )·( θ ( t ) i + σϵ ( t ) i − θ ( t ) j ) (cid:17) (3)Because equation 3 uses the same weighted average as in ES(equations 1 and 2), when fully-connected networks are used (i.e. a ij =

1) and when agents start with the same parameters, equation3 reduces to 1. The only other change introduced by NetES is the use of peri-odic global broadcasts. We implemented parameter broadcast asfollows: at every iteration, with a probability p b , we choose toreplace all agents’ current parameters with the best agent’s per-forming weights, and then continue training (as per Equation 3)after that. The same broadcast techniques have been used in manyother algorithms to balance local vs. global search (e.g. the ‘exploit’action in Population-based Training [13] replaces current agentweights with the weights that give the highest rewards).The full description of the NetES algorithm is shown in Algo-rithm 1. Algorithm 1

Networked Evolution Strategies

Input : Learning rate α , noise standard deviation σ , initial policyparameters θ ( ) i where i = 1, 2, . . . , N (for N workers), adjacencymatrix A , global broadcast probability p b Initialize : n workers with known random seeds, initial parame-ters θ ( ) i for t = 0, 1, 2, . . . dofor each worker i = 1, 2, . . . , N do Sample ϵ ( t ) j ∼ N( , I ) Compute returns R i = R ( θ ( t ) j + σϵ ( t ) j ) Sample β ( t ) ∼ U( , ) if β ( t ) < p b then Set θ ( t + ) i ← arg max θ ( t ) i R ( θ ( t ) j + σϵ ( t ) j ) elsefor each worker i = 1, 2, . . . , n do Set θ ( t + ) i ← θ ( t ) i + αN σ (cid:205) Nj = a ij · (cid:16) R ( θ ( t ) j + σϵ ( t ) j ) · ( θ ( t ) j + σϵ ( t ) j − θ ( t ) i ) (cid:17) In summary, NetES implements three modifications to the ESparadigm: the use of alternate topologies through a ij , the use ofdifferent starting parameters, and the use of global broadcast. In thefollowing sections, we will run careful controls during an ablationstudy to investigate where the improvement in learning we observecome from. Our hypothesis is that they come mainly – or completely– from the use of alternate topologies. As we will show later, theydo come from only the use of alternate topologies as shown in seeFig. 2B. Previous work [5] demonstrates that the exact form of the updaterule does not matter as long as the optimization strategy is to findand aggregate the parameters with the highest reward (as opposedto, for example, finding the most common parameters many agentshold). Therefore, although our update rule is a straightforwardextension of ES, we expect that our primary insight—that networktopology can affect DRL—to still be useful with alternate updaterules.Secondly, although Equation 3 is a biased gradient estimate, atleast in the short term, it is unclear whether in practice we achievea biased or an unbiased gradient estimate, marginalizing over time ●● ● P e r f o r m an c e Erdos-Renyi Scale-Free Small-World Fully-Connected A ●● ●●● ● P e r f o r m an c e 1 K K K K K K Fully-Connected ES NetworksErdos-Renyi B T r a i n i ng P e r f o r m an c e

1K Erdos-Renyi1K Fully-connected2K Fully-connected3K Fully-connected4K Fully-connected5K Fully-connected C Figure 2: A: Learning performance on all network families: Erdos-Renyi graphs do best, fully-connected graphs do worst(MuJoCo Ant-v1 task with small networks of 100 nodes). B: Evaluation results for Erdos-Renyi graph with 1000 agents com-pared to fully-connected networks with varying network sizes (RoboSchool Humanoid-v1). C: Comparing Erdos-Renyi graphwith 1000 agents to fully-connected networks with varying network sizes on training (not evaluation metric) performance(Roboschool Humanoid-v1). All: Error bars represent 95% confidence intervals. steps between broadcasts. This is because in the full algorithm(algorithm 1) we implement, we combine this update rule with aperiodic parameter broadcast (as is common in distributed learningalgorithms - we will address this in detail in a later section), andevery broadcast returns the agents to a consensus position.Future work can better characterize the theoretical properties ofNetES and similar networked DRL algorithms using the recentlydeveloped tools of calculus on networks (e.g., [1]). Empirically andtheoretically, we present results suggesting that the use of alternatetopologies can lead to large performance improvements.

Given the update rule as per equation 3, the goal is then to findwhich topology leads to the highest improvement. Because we aredrawing inspiration from the study of collective intelligence andnetworked optimization, we use topologies that are prevalent inmodeling how humans and animals learn collectively: • Erdos-Renyi Networks:

Networks where each edge be-tween any two nodes has a fixed independent probabilityof being present [8], which are among the commonly usedbenchmark graphs for comparison in social networks [23]. • Scale-Free Networks:

Scale-free networks, whose degreedistribution follows a power law [7], are commonly observedin citation and signaling biological networks [4]. • Small-World Networks:

Networks where most nodes canbe reached through a small number of neighbors, resultingin the famous ‘six degrees of separation’ [31]. • Fully-Connected Networks:

Networks where every nodeis connected to every other node.We used the generative model of [9] to create Erdos-Renyi graphs,the Watts-Strogatz model [33] for Small-World graphs, and theBarabÃąsi-Albert model [4] for Scale-Free networks. We can randomly sample instances of graphs from each familywhich is parameterized by the number of nodes N , and their degreedistribution. Erdos-Renyi networks, for example, are parameterizedby their average density p ranging from 0 to 1, where 0 wouldlead to a completely disconnected graph (no nodes are connected),and 1 would lead back to a fully-connected graph. The lower p is,the sparser a randomly generated network is. Similarly, the degreedistribution of scale-free networks is defined by the exponent ofthe power distribution. Because each graph is generated randomly,two graphs with the same parameters will be different if they havedifferent random seeds, even though, on average, they will havethe same average degree (and therefore the same number of links). Through the modifications to ES we have described, we are now ableto operate on any communication topology. Due to previous workin networked optimization and collective intelligence which showsthat alternate network structures result in better performance, weexpect NetES to perform better on DRL tasks when using alternatetopologies compared to the de facto fully-connected topology. Wealso expect to see differences in performance between families oftopologies.

A focus of recent DRL has been the ability to be able to run moreand more agents in parallel (i.e. scalability). An early example isthe Gorila framework [19] that collects experiences in parallel frommany agents. Another is A3C [17] that we discussed earlier. IM-PALA [10] is a recent algorithm which solves many tasks witha single parameter set. Population Based Training [13] optimizesboth learning weights and hyperparameters. However, in all theapproaches described above, agents are organized in an implicitfully-connected centralized topology.e build on the Evolution Strategies implementation of Salimanset al. [27] which was modified for scalability in DRL. There havebeen many variants of Evolution Strategies over the years, such asCMA-ES [3] which updates the covariance matrix of the Gaussiandistribution, Natural Evolution strategies [34] where the inverse ofthe Fisher Information Matrix of search distributions is used in thegradient update rule, and, Again, these algorithms implicitly use afully-connected topology between learning agents.On the other hand, work in the networked optimization literaturehas demonstrated that the network structure of communicationbetween nodes significantly affects the convergence rate and ac-curacy of multi-agent learning [20–22]. However, this work hasbeen focused on solving global objective functions that are thesum (or average) of private, local node-based objective functions- which is not always an appropriate framework for DRL. In thecollective intelligence literature, alternate network structures havebeen shown to result in increased exploration, higher overall maxi-mum reward, and higher diversity of solutions in both simulatedhigh-dimensional optimization [15] and human experiments [5].One recent study [16] investigated the effect of communicationnetwork topology, but only as an aside, and on very small networks- they also observe improvements when using not fully-connectednetworks. Another work focuses on the absence of a central con-troller, but differs significantly in that agents solve different tasksat the same time [38]. The performance of a ring network has alsobeen compared to random one [14], however, there are significantdifferences in domains (ours is non-convex optimization, theirs isa quadratic convex assignment task), solving technique (we focuson black box optimization, theirs a GA with mutations), and thenumber of components (we make sure that all our networks are ina single connected component for fair comparison, while they varythe number of components). Additionally, since a ring of N agentshas N − O ( pN ) edges, theirresult that a sparser (ring) network performs better than a (denser)random network generally agrees with ours in terms of sparsity.Finally, the dynamic concept of adjusting connections and limitingthe creation of hubs has been explored [2]. We have expanded ontheir study both by looking to the DRL context and by looking atseveral network topologies they left for future work.To the best of our knowledge, no prior work has focused oninvestigating how the topology of communication between agentsaffects learning performance in distributed DRL, for large networksand on popular graph families. The main goal of our experiments is to test our hypothesis thatusing alternate topologies will lead to an improvement in learningperformance. Therefore, we want to be able to generate commu-nication topologies from each of the four popular random graphfamilies, wire our agents using this topology and deploy them tosolve the DRL task at hand. We also want to run a careful ablationstudy to understand where the improvements come from.

We evaluate our NetES algorithm on a series of popular bench-mark tasks for deep reinforcement learning, selected from twoframeworks—the open source Roboschool [24] benchmark, and theMuJoCo framework [30]. The five benchmark tasks we evaluateon are: Humanoid-v1 (Roboschool and Mujoco), HalfCheetah-v1(MuJoCo), Hopper-v1 (MuJoCo) and Ant-v1 (MuJoCo). Our choiceof benchmark tasks is motivated by the difficulty of these walker-based problems.To maximize reproducibility of our empirical results, we use thestandard evaluation metric of collecting the total reward agentsobtain during a test-only episode, which we compute periodicallyduring training [6, 18, 27]. Specifically, with a probability of 0.08, weintermittently pause training, take the parameters of the best agentand run this parameter (without added noise perturbation) for 1000episodes, and take the average total reward over all episodes—asin Salimans et al. [27]. When performance eventually stabilizesto a maximum ‘flat’ line (determined by calculating whether a 50-episode moving average has not changed by more than 5%), werecord the maximum of the evaluation performance values for thisparticular experimental run. As is usual [6], training performance(shown in Fig. 2C) will be slightly lower that the correspondingmaximum evaluation performance (shown in Table 1). We observethis standard procedure to be quite robust to noise.We repeat this evaluation procedure for multiple random in-stances of the same network topology by varying the random seedof network generation. These different instances share the sameaverage density p (i.e. the same average number of links) and thesame number of nodes N . We use a global broadcast probabilityof 0.8 (a popular hyperparameter value for broadcast in optimiza-tion problems). Since each node runs the same number of episodetime steps per iteration, different networks with the same p canbe fairly compared. For all experiments (all network families andsizes of networks), we use an average network density of 0.5 be-cause it is sparse enough to provide good learning performance,and consistent (not noisy) empirical results.We then report the average performance over 6 runs with 95%confidence intervals. We share the JSON files that fully describeour experiments and our code .In addition to using the evaluation procedure of Salimans etal. [27], we also use their exact same neural network architecture:multilayer perceptrons with two 64-unit hidden layers separated by tanh nonlinearities. We also keep all the modifications to the updaterule introduced by Salimans et al. to improve performance: (1) train-ing for one complete episode for each iteration; (2) employing anti-thetic or mirrored sampling, also known as mirrored sampling [12],where we explore ϵ ( t ) i , − ϵ ( t ) i for every sample ϵ ( t ) i ∼ N( , I ) ; (3) em-ploying fitness shaping [34] by applying a rank transformation tothe returns before computing each parameter update, and (4) weightdecay in the parameters for regularization. We also use the exactsame hyperparameters as the original OpenAI (fully-connectedand centralized) implementation [27], varying only the networktopology for our experiments. JSON expement files and code implementation can be found at github.com/d-val/NetES. ● ●●● ●●● ●●●●● P e r f o r m an c e F u ll y - . . . . . . . . . . . E r do s broadcast prob of disconnected graph c onne c t A ● ● ● ● ● P e r f o r m an c e no b r oad c a s t g l oba l pa r a m NetES b r oad c a s t d i ff e r en t pa r a m e t e r s b r oad c a s t g l oba l pa r a m no b r oad c a s t d i ff e r en t pa r a m e t e r s Fully-Connected ES Networks a l t e r na t i v e t opo l og y b r oad c a s t d i ff e r en t pa r a m e t e r s B H o m ogene i t y Reachability0.10 1.00 10.0 100.0

Fully-connectedErdos-RenyiScale-freeSmall-worldFully-connected Topologies thatmaximize parameter diversity are very sparse Erdos-Renyi graphsDirection of increased sparsity (i.e. less connected topologies) C Figure 3: A: Agents with any amount of periodic broadcasting do not learn (RoboSchool Humanoid-v1 with 1000 agents). B:None of the control baselines with fully-connected networks learn, showing that the use of alternate topologies is what leads tolearning (MuJoCo Ant-v1 with 100 agents). C: We generate instances of random networks from our four families of networks,and observe that sparser Erdos-Renyi graphs maximize the diversity of parameter updates.

We first use one benchmark task (MuJoCo Ant-v1, because it runsfastest) and networks of 100 agents to evaluate NetES on each ofthe 4 families of communication topology: Erdos-Renyi, scale-free,small-world and the standard fully-connected network. As seen inFig. 2A, Erdos-Renyi strongly outperforms the other topologies.Our hypothesis is that using alternate topologies (other than thede facto fully-connected topologies) can lead to strong improve-ments in learning performance. We therefore decide to focus onErdos-Renyi graphs for all other results going forward - this choiceis supported by our theoretical results which indicate that Erdos-Renyi would do better on any task.If Erdos-Renyi continues to outperform fully-connected topolo-gies on various tasks and with larger networks, our hypothesiswill be confirmed - as long as they are also in agreement with ourablation studies. We leave to future work the full characterizationof the performance of other network topologies.

Using Erdos-Renyi networks, we run larger networks of 1000 agentson all 5 benchmark results. As can be seen in Table 1, our Erdos-Renyi networks outperform fully-connected networks on all bench-mark tasks, resulting in improvements ranging from 9.8% on Mu-JoCo Ant-v1 to 798% on MuJoCo Humanoid-v1. All results arestatistically significant (based on 95% confidence intervals).We note that the difference in performance between Erdos-Renyiand fully-connected networks is higher for smaller networks (Fig.2A and Fig. 3B) compared to larger networks (Table 1) for the samebenchmark, and we observe this behavior across different bench-marks. We believe that this is because NetES is able to achievehigher performance with fewer agents due to its efficiency of ex-ploration, as supported in our empirical and theoretical resultsbelow.

So far, we have compared alternate network topologies with fully-connected networks containing the same number of agents. Inthis section, we investigate whether organizing the communica-tion topology using Erdos-Renyi networks can outperform largerfully-connected networks. We choose one of the benchmarks thathad a small difference between the two algorithms at 1000 agents,Roboschool Humanoid-v1. As shown in Fig. 2B and the trainingcurves (which display the training performance, not the evalua-tion metric results which would be higher as discussed earlier) inFig. 2C, an Erdos-Renyi network with 1000 agents provides com-parable performance to 3000 agents arranged in a fully-connectednetwork.

To ensure that none of the modifications we implemented in the ESalgorithm are causing improvements in performance, instead of justthe use of alternate network topologies, we run control experimentson each modification: 1) the use of broadcast, 2) the fact that eachagent/node has a different parameter set. We test all combinations.

We want to make sure that broadcast(over different probabilities ranging from 0.0 to 1.0) does not explainaway our performance improvements. We compare ‘disconnected‘networks, where agents can only learn from their own parameterupdate and from broadcasting (they do not see the rewards andparameters of any other agents each step as in NetES). We comparethem to Erdos-Renyi networks and fully-connected networks of1000 agents on the Roboschool Humanoid-v1 task. As can be seen inFig. 3A practically no learning happens with just broadcast and nonetwork. These experiments show that broadcast does not explainaway the performance improvement we observe when using NetES.

The other change weintroduce in NetES is to have each agent hold their own parametervalue θ ( t ) i instead of a global (noised) parameter θ ( t ) . We thereforenvestigate the performance of the following 4 control baselines:fully-connected ES with 100 agent running: (1) same global param-eter, no broadcast; (2) same global parameter, with broadcast; (3)different parameters, with broadcast; (4) different parameters, nobroadcast; compared to NetES running an Erdos-Renyi network.For this experiment we use MuJoCo Ant-v1. As shown in Fig. 3B,NetES does better than all 4 other control baselines, showing thatthe improvements of NetES come from using alternate topologiesand not from having different local parameters for each agent. In this section, we present theoretical insights into why alternatetopologies can outperform fully-connected topologies, and whyErdos-Renyi networks also outperform the other two network fam-ilies we have tested. A motivating factor for introducing alternateconnectivity is to search the parameter space more effectively, acommon motivation in DRL and optimization in general. One pos-sible heuristic for measuring the capacity to explore the parameterspace is the diversity of parameter updates during each iteration,which can be measured by the variance of parameter updates:Theorem 7.1.

In a NetES update iteration t for a system with N agents with parameters Θ = { θ ( t ) , ..., θ ( t ) N } , agent communicationmatrix A = { a ij } , agent-wise perturbations E = { ϵ ( t ) , ..., ϵ ( t ) N } , andparameter update u ( t ) i = αN σ (cid:205) Nj = a ij · (cid:0) R ( θ ( t ) j + σϵ ( t ) j ) · (( θ ( t ) j + σϵ ( t ) j ) − ( θ ( t ) i )) (cid:1) as per Equation 3, the following relation holds: Var i [ u ( t ) i ] ≤ max R (·) N σ (cid:110)(cid:16) ∥ A ∥ F ( min l | A l |) (cid:17) · f ( Θ , E)− (cid:16) min l | A l | max l | A l | (cid:17) · σ N ( (cid:213) i , j ϵ ( t ) i ϵ ( t ) j ) (cid:111) (4) Here, | A l | = (cid:205) j a jl , and f ( Θ , E) = (cid:113) ( (cid:205) j , k , m (cid:0) ( θ ( t ) j + σϵ ( t ) j − θ ( t ) m ) · ( θ ( t ) k + σϵ ( t ) k − θ ( t ) m ) (cid:1) ) . The proof for Theorem 9.1 is provided in the supplementarymaterial.In this work, our hypothesis is to test if some networks could dobetter than the de facto fully-connected topologies used in state-of-the-art algorithms. We leave to future work the important questionof optimizing the network topology for maximum performance.Doing so would require a lower bound, as it would provide us the worst-case performance of a topology.Instead, in this section, we are interested in providing insightsinto why some networks could do better than others, which canbe understood through our upper-bound, as it allows us to under-stand the capacity for parameter exploration possible by a networktopology.By Theorem 9.1, we see that the diversity of exploration in theparameter updates across agents is likely affected by two quantitiesthat involve the connectivity matrix A : the first being the term (∥ A ∥ F /( min l | A l |)) (henceforth referred to as the reachability ofthe network), which according to our bound we want to maximize,and the second being ( min l | A l |/ max l | A l |) (henceforth referredto as the homogeneity of the network), which according to ourbound we want to be as small as possible in order to maximize the diversity of parameter updates across agents. Reachability andhomogeneity are not independent, and are statistics of the degreedistribution of a graph. It is interesting to note that the upper bound does not depend on the reward landscape R (·) of the task at hand ,indicating that our theoretical insights should be independent ofthe learning task.Reachability is the squared ratio of the total number of pathsof length 2 in A to the minimum number of links of all nodes of A . Homogeneity is the squared ratio of the minimum to maximumconnectivity of all nodes of A : the higher this value, the morehomogeneously connected the graph is.Using the above definitions for reachability and homogeneity,we generate random instances of each network family, and plotthem in Fig. 3C. Two main observations can be made from thissimulation. • Erdos-Renyi networks maximize reachability and minimizehomogeneity, which means that they likely maximize thediversity of parameter exploration. • Fully-connected networks are the single worst network interms of exploration diversity (they minimize reachabilityand maximize homogeneity, the opposite of what would berequired for maximizing parameter exploration according tothe suggestion of our bound).These theoretical results agree with our empirical results: Erdos-Renyi networks perform best, followed by scale-free networks,while fully-connected networks do worse.It is also important to note that the quantity in Theorem 9.1 is notthe variance of the value function gradient , which is typicallyminimized in reinforcement learning. It is instead the variance inthe positions in parameter space of the agents after a step ofour algorithm. This quantity is more productively conceptualizedas akin to a radius of exploration for a distributed search proce-dure rather than in its relationship to the variance of the gradient.The challenge is then to maximize the search radius of positionsin parameter space to find high-performing parameters. As far asthe side effects this might have, given the common wisdom thatincreasing the variance of the value gradient in single-agent rein-forcement learning can slow convergence, it is worth noting thatnoise (i.e. variance) is often critical for escaping local minima inother algorithms, e.g. via stochasticity in SGD.

We can approximate reachability and homogeneity for Erdos-Renyinetworks as a function of their density (a derivation can be foundin the supplementary):Lemma 7.2.

For an Erdos-Renyi graph G with N vertices, density p and adjacency matrix A , the following approximation can be madeon its Reachability ρ (G) . ρ (G) ≈ ( pN ) − / . Similarly, its homogeneity γ (G) can be approximated as follows. γ (G) ≈ − (cid:112) ( − p )/( Np ) . As can be interpreted from these approximations, the sparseran Erdos-Renyi network is (i.e. the lower is p ), the larger is thereachability and the lower is the homogeneity. The approximations igure 4: Reachability and homogeneity in the Erdos-Renyicase for different densities p . Points correspond to the realdata, while the lines are the approximations. and the actual reachability and homogeneity (computed directlyfrom the graph adjacency matrix) are plotted in Fig. 4.In addition to providing insights as to why some families ofnetwork topologies do better than others (as shown in Fig. 3C),Theorem 9.1 also predicts that as Erdos-Renyi networks becomesparser (less dense) – because their reachability increases and theirhomogeneity decreases – the diversity of parameter updates duringeach iteration would increase leading to more effective parametersearch, and therefore increased performance.By running a last number of experiments where we vary thedensity of Erdos-Renyi networks (keeping the number of agents at1000) and use these topologies on the RoboSchool Humanoid-v1DRL benchmark, we can test if sparser networks actually performbetter. As can be seen in Figure 5, when the density of Erdos-Renyinetworks decreases, learning performance increases significantly. In our work, we extended ES, a DRL algorithm, to use alternatenetwork topologies and empirically showed that the de facto fully-connected topology performs worse in our experiments. We alsoperformed an ablation study by running controls on all the mod-ifications we made to the ES algorithm, and we showed that theimprovements we observed are not explained away by modifica-tions other than the use of alternate topologies. Finally, we providedtheoretical insights into why alternate topologies may be superior,

Figure 5: The distributions of reward improvements (com-pared to the fully-connected topologies) with network den-sity in Erdos-Renyi networks for RoboSchool Humanoid-v1.As predicted by Theorem 9.1, as density decreases, perfor-mance increases. Note that a density of 1.0 would result in afully-connected network. and observed that our theoretical predictions are in line with ourempirical results. Future work could explore the use of dynamicaltopologies where agent connections are continuously rewired toadapt to the local terrain of the research landscape.

The authors wish to thank Yan Leng for her help in early analysisof the properties of networks, Alia Braley for proofreading, andTim Salimans for his help with replicating the OpenAI results as abenchmark.

REFERENCES [1] Daron Acemoglu, Munther A Dahleh, Ilan Lobel, and Asuman Ozdaglar. 2011.Bayesian learning in social networks.

The Review of Economic Studies

78, 4 (2011),1201–1236.[2] Ricardo M Araujo and Luis C Lamb. 2008. Memetic Networks: Analyzing theEffects of Network Properties in Multi-Agent Performance.. In

AAAI , Vol. 8. 3–8.[3] Anne Auger and Nikolaus Hansen. 2005. A restart CMA evolution strategy withincreasing population size. In

Evolutionary Computation, 2005. The 2005 IEEECongress on , Vol. 2. IEEE, 1769–1776.[4] Albert-László Barabási and Réka Albert. 1999. Emergence of scaling in randomnetworks. science

Nature communications

Journalof Artificial Intelligence Research

47 (2013), 253–279.[7] Krzysztof Choromański, Michał Matuszak, and Jacek Miekisz. 2013. Scale-freegraph with preferential attachment and evolving internal vertex structure.

Journalof Statistical Physics

Publ. Math. Debrecen

Publicationes mathe-maticae

6, 26 (1959), 290–297.[10] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih,Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. 2018.MPALA: Scalable distributed Deep-RL with importance weighted actor-learnerarchitectures. arXiv preprint arXiv:1802.01561 (2018).[11] Jacques Ferber and Gerhard Weiss. 1999.

Multi-agent systems: an introduction todistributed artificial intelligence . Vol. 1. Addison-Wesley Reading.[12] John Geweke. 1988. Antithetic acceleration of Monte Carlo integration inBayesian inference.

Journal of Econometrics

38, 1-2 (1988), 73–89.[13] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, JeffDonahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan,et al. 2017. Population Based Training of Neural Networks. arXiv preprintarXiv:1711.09846 (2017).[14] Tang Jing, Meng Hiot Lim, and Yew Soon Ong. 2004. Island model parallelhybrid-ga for large scale combinatorial optimization. In

Proceedings of the 8thInternational Conference on Control, Automation, Robotics and Vision, SpecialSession on Computational Intelligence on the Grid .[15] David Lazer and Allan Friedman. 2007. The network structure of explorationand exploitation.

Administrative Science Quarterly

52, 4 (2007), 667–694.[16] Sergio Valcarcel Macua, Aleksi Tukiainen, Daniel García-Ocaña Hernández, DavidBaldazo, Enrique Munoz de Cote, and Santiago Zazo. 2017. Diff-DAC: Dis-tributed Actor-Critic for Multitask Deep Reinforcement Learning. arXiv preprintarXiv:1710.10363 (2017).[17] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timo-thy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchro-nous methods for deep reinforcement learning. (2016), 1928–1937.[18] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, IoannisAntonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deepreinforcement learning. arXiv preprint arXiv:1312.5602 (2013).[19] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon,Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, CharlesBeattie, Stig Petersen, et al. 2015. Massively parallel methods for deep reinforce-ment learning. arXiv preprint arXiv:1507.04296 (2015).[20] Angelia Nedic. 2011. Asynchronous broadcast-based convex optimization over anetwork.

IEEE Trans. Automat. Control

56, 6 (2011), 1337–1351.[21] Angelia Nedić, Alex Olshevsky, and Michael G Rabbat. 2017. Network Topol-ogy and Communication-Computation Tradeoffs in Decentralized Optimization. arXiv preprint arXiv:1709.08765 (2017).[22] Angelia Nedic and Asuman Ozdaglar. 2010. 10 cooperative distributed multi-agent.

Convex Optimization in Signal Processing and Communications

340 (2010).[23] Mark Newman. 2010. Networks: An Introduction. (2010).[24] OpenAI. 2017. Roboschool. https://github.com/openai/roboschool. (2017). Ac-cessed: 2017-09-30.[25] OpenAI. 2018. OpenAI Five. https://blog.openai.com/openai-five/. (2018).[26] Ingo Rechenberg. 1973. Evolution Strategy: Optimization of Technical systems bymeans of biological evolution.

Fromman-Holzboog, Stuttgart

104 (1973), 15–16.[27] Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. 2017. Evolutionstrategies as a scalable alternative to reinforcement learning. arXiv preprintarXiv:1703.03864 (2017).[28] Hans-Paul Schwefel. 1977.

Numerische Optimierung von Computer-Modellenmittels der Evolutionsstrategie: mit einer vergleichenden Einführung in die Hill-Climbing-und Zufallsstrategie . Birkhäuser.[29] Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman,Edward I George, and Robert E McCulloch. 2016. Bayes and big data: Theconsensus Monte Carlo algorithm.

International Journal of Management Scienceand Engineering Management

11, 2 (2016), 78–88.[30] Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. MuJoCo: A physics enginefor model-based control. In

Intelligent Robots and Systems (IROS), 2012 IEEE/RSJInternational Conference on . IEEE, 5026–5033.[31] Jeffrey Travers and Stanley Milgram. 1977. An experimental study of the smallworld problem. In

Social Networks . Elsevier, 179–197.[32] Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jader-berg, Wojtek Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, RichardPowell, Timo Ewalds, Dan Horgan, Manuel Kroiss, Ivo Danihelka, John Aga-piou, Junhyuk Oh, Valentin Dalibard, David Choi, Laurent Sifre, Yury Sulsky,Sasha Vezhnevets, James Molloy, Trevor Cai, David Budden, Tom Paine, CaglarGulcehre, Ziyu Wang, Tobias Pfaff, Toby Pohlen, Dani Yogatama, Julia Cohen,Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Chris Apps,Koray Kavukcuoglu, Demis Hassabis, and David Silver. 2019. AlphaStar: Mas-tering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/. (2019).[33] Duncan J Watts and Steven H Strogatz. 1998. Collective dynamics of âĂŸsmall-worldâĂŹnetworks. nature

Journal of Machine LearningResearch

15, 1 (2014), 949–980.[35] Ronald J Williams. 1992. Simple statistical gradient-following algorithms forconnectionist reinforcement learning. In

Reinforcement Learning . Springer, 5–32.[36] David H Wolpert and Kagan Tumer. 1999. An introduction to collective intelli-gence. arXiv preprint cs/9908014 (1999). [37] Anita Williams Woolley, Christopher F Chabris, Alex Pentland, Nada Hashmi,and Thomas W Malone. 2010. Evidence for a collective intelligence factor in theperformance of human groups. science arXiv preprint arXiv:1802.08757 (2018).

PPENDIX 1 : DIVERSITY OF PARAMETERUPDATES

Here we provide proofs Theorem 1 from the main paper concerningthe diversity of the parameter updates.Theorem 9.1.

In a multi-agent evolution strategies update itera-tion t for a system with N agents with parameters Θ = { θ ( t ) , ..., θ ( t ) N } ,agent communication matrix A = { a ij } , agent-wise perturbations E = { ϵ ( t ) , ..., ϵ ( t ) N } , and parameter update u ( t ) i given by the sparsely-connected update rule: u ( t ) i = αN σ N (cid:213) j = a ij · (cid:0) R ( θ ( t ) j + σϵ ( t ) j ) · (( θ ( t ) j + σϵ ( t ) j ) − ( θ ( t ) i )) (cid:1) The following relation holds:

Var i [ u ( t ) i ] ≤ max R (·) N σ (cid:110)(cid:16) ∥ A ∥ F ( min l | A l |) (cid:17) · f ( Θ , E)− (cid:16) min l | A l | max l | A l | (cid:17) · д (E) (cid:111) (5) Here, | A l | = (cid:205) j a jl , f ( Θ , E) = (cid:16) (cid:205) N , N , Nj , k , m (cid:0) ( θ ( t ) j + σϵ ( t ) j − θ ( t ) m ) ·( θ ( t ) k + σϵ ( t ) k − θ ( t ) m ) (cid:1) (cid:17) , and д (E) = σ N (cid:16) (cid:205) N , Ni , j ϵ ( t ) i ϵ ( t ) j (cid:17) . Proof. From Equation 9.1, the update rule is given by: u ( t ) i = αN σ N (cid:213) j = a ij · (cid:0) R ( θ ( t ) j + σϵ ( t ) j ) · (( θ ( t ) j + σϵ ( t ) j ) − ( θ ( t ) i )) (cid:1) (6)The variance of u ( t ) i can be written as: Var i [ u ( t ) i ] = E i ∈A [( u ( t ) i ) ] − ( E i ∈A [( u ( t ) i )]) (7)Expanding E i ∈A [( u ( t ) i ) ] : = N (cid:213) i ∈A (cid:8) γN σ (cid:213) j = a ij · R ( θ ( t ) j + σϵ ( t ) j ) · ( θ ( t ) j + σϵ ( t ) j − θ ( t ) i ) (cid:9) (8)Simplifying: = N σ (cid:213) i , j , k (cid:16) a ij a ik | A i | R ( θ ( t ) j + σϵ ( t ) j ) R ( θ ( t ) k + σϵ ( t ) k )· ( θ ( t ) j + σϵ ( t ) j − θ ( t ) i ) · ( θ ( t ) k + σϵ ( t ) k − θ ( t ) i ) (cid:17) (9)Since R (·) ≤ max R (·) , therefore: ≤ max R (·) N σ (cid:213) i , j , k a ij a ik | A i | · ( θ ( t ) j + σϵ ( t ) j − θ ( t ) i ) · ( θ ( t ) k + σϵ ( t ) k − θ ( t ) i ) (10) ≤ max R (·) N σ (cid:213) i , j , k a ij a ik min l | A l | ·( θ ( t ) j + σϵ ( t ) j − θ ( t ) i )·( θ ( t ) k + σϵ ( t ) k − θ ( t ) i ) (11) By the Cauchy-Schwarz Inequality: E i ∈A [( u ( t ) i ) ] ≤ max R (·) N σ (cid:16) (cid:213) i , j , k ( a ij a ik ) min l | A l | (cid:17) · (cid:16) (cid:213) i , j , k (cid:0) ( θ ( t ) j + σϵ ( t ) j − θ ( t ) i ) · ( θ ( t ) k + σϵ ( t ) k − θ ( t ) i ) (cid:1) (cid:17) (12)Since a ij ∈ { , } ∀ ( i , j ) , ( a ij a ik ) = a ij a ik ∀ ( i , j , k ) . Addition-ally, we know that a ij = a ji , since A is symmetric. Therefore, (cid:205) i a ij a ik = (cid:205) i a ji a ik = A jk . Using this: E i ∈A [( u ( t ) i ) ] ≤ max R (·) N σ · (cid:16) | A | min l | A l | (cid:17) · (cid:16) (cid:213) i , j , k (cid:0) ( θ ( t ) j + σϵ ( t ) j − θ ( t ) i ) · ( θ ( t ) k + σϵ ( t ) k − θ ( t ) i ) (cid:1) (cid:17) (13)Replacing (cid:16) (cid:205) i , j , k (cid:0) ( θ ( t ) j + σϵ ( t ) j − θ ( t ) i ) · ( θ ( t ) k + σϵ ( t ) k − θ ( t ) i ) (cid:1) (cid:17) = f ( Θ , E) , where Θ = { θ ( t ) i } Ni = , E = { ϵ i } Ni = for compactness, weobtain: E i ∈A [( u ( t ) i ) ] ≤ max R (·) N σ · (cid:16) | A | min l | A l | (cid:17) · f ( Θ , E) (14) Similarly, the squared expectation of ( u ( t ) i ) over all agents can begiven by: ( E i ∈A [ u ( t ) i ]) = (cid:16) N (cid:213) i ∈A (cid:8) γN σ (cid:213) j = a ij · R ( θ ( t ) j + σϵ ( t ) j )· ( θ ( t ) j + σϵ ( t ) j − θ ( t ) i ) (cid:9)(cid:17) (15) = N σ (cid:16) (cid:213) i ∈A (cid:8) | A i | (cid:213) j = a ij · R ( θ ( t ) j + σϵ ( t ) j )·( θ ( t ) j + σϵ ( t ) j − θ ( t ) i ) (cid:9)(cid:17) (16) = N σ (cid:16) (cid:213) i , j (cid:8) a ij | A i | · R ( θ ( t ) j + σϵ ( t ) j ) · ( θ ( t ) j + σϵ ( t ) j − θ ( t ) i ) (cid:9)(cid:17) (17)Since R (·) ≥ min R (·) , therefore: ≥ min R (·) N σ (cid:16) (cid:213) i , j (cid:8) a ij | A i | · ( θ ( t ) j + σϵ ( t ) j − θ ( t ) i ) (cid:9)(cid:17) (18) ≥ min R (·) N σ max l | A l | (cid:16) (cid:213) i , j (cid:8) a ij · ( θ ( t ) j + σϵ ( t ) j − θ ( t ) i ) (cid:9)(cid:17) (19)ince A is symmetric, (cid:205) N , Ni , j a ij · ( θ ( t ) j + σϵ j − θ ( t ) i ) = (cid:205) N , Ni , j a ij ·( θ ( t ) i + σϵ i − θ ( t ) j ) . Therefore: = min R (·) N σ max l | A l | (cid:16) (cid:213) i , j (cid:8) a ij · ( θ ( t ) j + σϵ ( t ) j − θ ( t ) i ) + a ij · ( θ ( t ) i + σϵ ( t ) i − θ ( t ) j ) (cid:9)(cid:17) (20)Therefore, ( E i ∈A [ u ( t ) i ]) = min R (·) N σ max l | A l | (cid:16) (cid:213) i , j (cid:8) a ij · ( ϵ ( t ) j + ϵ ( t ) i ) (cid:9)(cid:17) (21)Using the symmetry of A , we have that (cid:205) N , Ni , j a ij ϵ i = (cid:205) N , Ni , j a ij ϵ j .Therefore: = min R (·) N σ max l | A l | (cid:16) (cid:213) i , j a ij · ϵ ( t ) j (cid:17) (22) = min R (·) N σ max l | A l | (cid:16) (cid:213) j | A j | · ϵ ( t ) j (cid:17) (23) ≥ min R (·) min l | A l | N σ max l | A l | (cid:16) (cid:213) i , j ϵ ( t ) i ϵ ( t ) j (cid:17) (24)Combining both terms of the variance expression, and using thenormalization of the iteration rewards that ensures min R (·) = − max R (·) , we can obtain (using д (E) = σ N (cid:16) (cid:205) i , j ϵ ( t ) i ϵ ( t ) j (cid:17) ): Var i ∈A [ u ( t ) i ] ≤ max R (·) N σ (cid:110)(cid:16) | A | min l | A l | (cid:17) · f ( Θ , E)− (cid:16) min l | A l | max l | A l | (cid:17) · д (E) (cid:111) (25) □ APPENDIX 2 : APPROXIMATINGREACHABILITY AND HOMOGENEITY FORLARGE ERDOS-RENYI GRAPHS

Recall that a Erdos-Renyi graph is constructed in the following way(1) Take n nodes(2) For each pair of nodes, link them with probability p The model is simple, and we can infer the following: • The average degree of a node is p ( n − )• The distribution of degree for the nodes is the Binomialdistribution of n − p , B ( n − , p ) . • The (average) number of paths of length 2 from one node i toa node j (cid:44) i ( n ( ) ij ) can be calculated this way: a path of lengthtwo between i and j involves a third node k . Since there are n − i and j is n −

2. However, for that path to exists there has to be alink between i and k and k and j , an event with probability p . Thus, the average number of paths between i and j is p ( n − ) Estimating Reachability

We can then estimate Reachability:

Reachability = || A || F ( min l | A l |) = (cid:113)(cid:205) i , j n ( ) ij k min where k min = ( min l | A l |) is the minimum degree in the network.Given the above calculations we can approximate (cid:213) i , j n ( ) ij = (cid:213) i n ( ) ii + (cid:213) i (cid:44) j n ( ) ij ≈ n × [ p ( n − )] + n ( n − ) × [ p ( n − )] where the first term is the number of paths of length 2 from i to i summed over all nodes, i.e. the sum of the degrees in the network.The second term is the sum of p ( n − ) for the terms in which i (cid:44) j .For large n we have that (cid:213) i , j n ( ) ij ≈ p n and thus, || A || F ≈ (cid:113) p n . (26)For the denominator k min we could use the distribution of theminimum of the binomial distribution B ( n − , p ) . However, since itis a complicated calculation we can approximate this way: since thebinomial distribution B ( n − , p ) looks like a Gaussian, we can saythat the minimum of the distribution is closed to the mean minustwo times the standard deviation: k min ≈ p ( n − ) − (cid:112) p ( n − )( − p ) (27)Once again in the case of large n we have k min ≈ pn Thus

Reachability ≈ (cid:112) p n [ p ( n − ) − (cid:112) p ( n − )( − p )] (28)Assuming that n is large, we can approximate Reachability ≈ pn / p n = pn / Thus the bound decreases with increasing n and p . Note thatthe density of the Erdos-Renyi graph (the number of links over thenumber of possible links) is p . And thus for a fixed n more sparsenetworks p ≃ p ≃ Estimating Homogeneity

The Homogeneity is defined as

Homoдeneity = (cid:18) k min k max (cid:19) As before we can approximate k max ≈ p ( n − ) + (cid:112) p ( n − )( − p ) And thus igure 6: Comparison between the values of k min , || A || F , and Reachability as a function of p for different realizations of theErdos-Renyi model (points) and their approximations given in Equations (27), (26) and (28) respectively (lines). Homoдeneity ≈ (cid:32) p ( n − ) − (cid:112) p ( n − )( − p ) p ( n − ) + (cid:112) p ( n − )( − p ) (cid:33) For large p we can approximate it to be Homoдeneity ≈ − √ − p √ np (29) which shows that for p ≃ p . Thus for fixed number of nodes n , increasing pp