[PDF] Multi-Scale Games: Representing and Solving Games on Networks with Group Structure

Abstract

Network games provide a natural machinery to compactly represent strategic interactions among agents whose payoffs exhibit sparsity in their dependence on the actions of others. Besides encoding interaction sparsity, however, real networks often exhibit a multi-scale structure, in which agents can be grouped into communities, those communities further grouped, and so on, and where interactions among such groups may also exhibit sparsity. We present a general model of multi-scale network games that encodes such multi-level structure. We then develop several algorithmic approaches that leverage this multi-scale structure, and derive sufficient conditions for convergence of these to a Nash equilibrium. Our numerical experiments demonstrate that the proposed approaches enable orders of magnitude improvements in scalability when computing Nash equilibria in such games. For example, we can solve previously intractable instances involving up to 1 million agents in under 15 minutes.

Full PDF

MM ULTI -S CALE G AMES : R

EPRESENTING AND S OLVING G AMESON N ETWORKSWITH G ROUP S TRUCTURE

A P

REPRINT

Kun Jin

University of Michigan, Ann [email protected]

Yevgeniy Vorobeychik

Washington University in St. [email protected]

Mingyan Liu

University of Michigan, Ann [email protected] 22, 2021 A BSTRACT

Network games provide a natural machinery to compactly represent strategic interactions amongagents whose payoffs exhibit sparsity in their dependence on the actions of others. Besides encodinginteraction sparsity, however, real networks often exhibit a multi-scale structure, in which agents canbe grouped into communities, those communities further grouped, and so on, and where interactionsamong such groups may also exhibit sparsity. We present a general model of multi-scale networkgames that encodes such multi-level structure. We then develop several algorithmic approaches thatleverage this multi-scale structure, and derive sufﬁcient conditions for convergence of these to aNash equilibrium. Our numerical experiments demonstrate that the proposed approaches enableorders of magnitude improvements in scalability when computing Nash equilibria in such games.For example, we can solve previously intractable instances involving up to 1 million agents in under15 minutes.

Strategic interactions among interconnected agents are commonly modeled using the network, or graphical, gameformalism (Kearns, Littman, and Singh, 2001; Jackson and Zenou, 2015). In such games, the utility of an agentdepends on his own actions as well as those by its network neighbors. Many variations of games on networks havebeen considered, with applications including the provision of public goods (Allouch, 2015; Buckley and Croson,2006; Khalili, Zhang, and Liu, 2019; Yu et al., 2020), security (Hota and Sundaram, 2018; La, 2016; Vorobeychik andLetchford, 2015), and ﬁnancial markets (Acemoglu et al., 2012).Figure 1: An illustration of a multi-scale (3-level) network. a r X i v : . [ c s . C E ] J a n PREPRINT - J

ANUARY

22, 2021While network games are a powerful modeling framework, they fail to capture a common feature of human organi-zation: groups and communities. Indeed, investigation of communities, or close-knit groups, in social networks is amajor research thread in network science. Moreover, such groups often have a hierarchical structure (Clauset, Moor,and Newman, 2008; Girvan and Newman, 2002). For example, strategic interactions among organizations in a market-place often boil down to interactions among their constituent business units, which are, in turn, comprised of individualdecision makers. In the end, it is those lowest-level agents who ultimately accrue the consequences of these interac-tions (for example, corporate proﬁts would ultimately beneﬁt individual shareholders). Moreover, while there are clearinterdependencies among organizations, individual utilities are determined by a combination of individual actions ofsome agents, together with aggregate decisions by the groups (e.g., business units, organizations). For example, anemployee’s bonus is determined in part by their performance in relation to their co-workers, and in part by how welltheir employer (organization) performs against its competitors in the marketplace.We propose a novel multi-scale game model that generalizes network games to capture such hierarchical organizationof individuals into groups. Figure 1 offers a stylized example in which three groups (e.g., organizations) are comprisedof 2-3 subgroups each (e.g., business units), which are in turn comprised of 2-5 individual agents. Speciﬁcally, ourmodel includes an explicit hierarchical network structure that organizes agents into groups across a series of levels.Further, each group is associated with an action which deterministically aggregates the decisions by its constituentagents. The game is grounded at the lowest level, where the agents are associated with scalar actions and utilityfunctions that have modular structure in the strategies taken at each level of the game. For example, in Figure 1, theutility function of an individual member a j of level-3 group a (3)3 is a function of the strategies of (i) a j ’s immediateneighbors (represented by links between pairs of ﬁlled-in circles), (ii) a j ’s level-2 group and its network neighbor (thesmall hollow circles), and (iii) a j ’s level-3 group, a (3)3 (large hollow circle) and its network neighbors, a (3)1 and a (3)2 .Our next contribution is a series of iterative algorithms for computing pure strategy Nash equilibria that explicitlyleverage the proposed multi-scale game representation. The ﬁrst of these simply takes advantage of the compactgame representation in computing equilibria. The second algorithm we propose offers a further innovation through aniterative procedure that alternates between game levels, treating groups themselves as pseudo-agents in the process.We present sufﬁcient conditions for the convergence of this algorithm to a pure strategy Nash equilibrium through aconnection to Structured Variational Inequalities (He, Yang, and Wang, 2000), although the result is limited to gameswith two levels. To address the latter limitation, we design a third iterative algorithm that now converges even in gameswith an arbitrary number of levels.Our ﬁnal contribution is an experimental evaluation of the proposed algorithms compared to best response dynamics.In particular, we demonstrate orders of magnitude improvements in scalability, enabling us to solve games that cannotbe solved using a conventional network game representation. Related Work:

Network games have been an active area of research; see e.g., surveys by Jackson and Zenou (2015)and Bramoull´e and Kranton (2016). We now review the most relevant papers. Conditions for the existence, uniquenessand stability of Nash equilibria in network games under general best responses are studied in (Parise and Ozdaglar,2019; Naghizadeh and Liu, 2017; Scutari et al., 2014; Bramoull´e, Kranton, and D’amours, 2014). Variational inequal-ities (VI) are used in these works to analyze the ﬁxed point and contraction properties of the best response mappings.It is identiﬁed in Parise and Ozdaglar (2019); Naghizadeh and Liu (2017); Scutari et al. (2014) that when the Jacobianmatrix of the best response mapping is a P-matrix or is positive deﬁnite, a feasible unique Nash equilibrium existsand can be obtained by best-response dynamics (Scutari et al., 2014; Parise and Ozdaglar, 2019). In this paper, weextended the analysis of equilibrium and best responses for a conventional network game to that in a multi-scale net-work game, where the utility functions are decomposed into separable utility components to which best responses areapplied separately. This is similar to the generalization from a conventional VI problem to an SVI problem (He, Yang,and Wang, 2000; He, 2009; He and Yuan, 2012; Bnouhachem, Benazza, and Khalfaoui, 2013) problem.Previous works on network games that involve group or community structure focus on ﬁnding such structures; e.g.,community detection in networks using game theoretic methods have been studied in (Mcsweeney, Mehrotra, andOh, 2017; Newman, 2004; Alvari, Hajibagheri, and Sukthankar, 2014). By contrast, our work focuses on analyzing anetwork game with a given group/community structure, and using the structure as an analytical tool for the analysis ofequilibrium and best responses.

A general normal-form game is deﬁned by a set of agents (players) I = { , . . . , N } , with each agent a i having anaction/strategy space K i and a utility function u i ( x i , xxx − i ) that i aims to maximize; x i ∈ K i and x − i denotes the2 PREPRINT - J

ANUARY

22, 2021actions by all agents other than i . We term the collection of strategies of all agents xxx a strategy proﬁle . We assume K i ⊂ R is a compact set.We focus on computing a Nash equilibrium (NE) of a normal-form game, which is a strategy proﬁle with each agentmaximizing their utility given the strategies of others. Formally, xxx ∗ is a Nash equilibrium if for each agent i , x ∗ i ∈ argmax x i ∈ K i u i ( x i , xxx ∗− i ) . (1)A network game encodes structure in the utility functions such that they only depend on the actions by networkneighbors. Formally, a network game is deﬁned over a weighted graph ( I, E ) , with each node an agent and E is the setof edges; the agent’s utility u i ( x i , xxx − i ) reduces to u i ( x i , xxx I i ) , where I i is the set of network neighbors of i , althoughwe will frequently use the former for simplicity.An agent’s best response is its best strategy given the actions taken by all the other agents. Formally, the best responseis a set deﬁned by BR i ( xxx − i , u i ) = argmax x i u i ( x i , xxx − i ) . (2)Whenever we deal with games that have a unique best response, we will use the singleton best response set to alsorefer to the player’s best response strategy (the unique member of this set).Clearly, a NE of a game is a ﬁxed point of this best response correspondence. Consequently, one way to compute a NEof a game is through best response dynamics (BRD) , which is a process whereby agents iteratively and asynchronously(that is, one agent at a time) take the others’ actions as ﬁxed values and play a best response to them.We are going to use this BRD algorithm as a major building block below. One important tool that is useful for analyzingBRD convergence is Variational Inequalities (VI) . To establish the connection between NE and VI we assume theutility functions u i , ∀ i = 1 , . . . , N , are continuously twice differentiable. Let K = (cid:81) Ni =1 K i and deﬁne F : R N → R N as follows: F ( xxx ) := (cid:18) − (cid:79) x i u i ( xxx ) (cid:19) Ni =1 . (3)Then xxx ∗ is said to be a solution to VI ( K, F ) if and only if ( xxx − xxx ∗ ) T F ( xxx ∗ ) ≥ , ∀ xxx ∈ K . (4)In other words, the solution set to VI ( K, F ) is equivalent to the set of NE of the game. Now, we can deﬁne thecondition that will guarantee the convergence of BRD. Deﬁnition 1.

The P Υ condition : The Υ matrix generated from F : R N → R N is given as follows Υ( F ) =  α ( F ) − β , ( F ) · · · − β ,N ( F ) − β , ( F ) α ( F ) · · · − β ,N ( F ) ... ... . . . ... − β N, ( F ) − β N, ( F ) · · · α N ( F )  ,α i ( F ) = inf xxx ∈ K || (cid:79) i F i || , β i,j ( F ) = sup xxx ∈ K || (cid:79) j F i || , i (cid:54) = j . If Υ( F ) is a P-matrix, that is, if all of its principalcomponents have a positive determinant, then we say F satisﬁes the P Υ condition. Theorem 1. (Scutari et al., 2014) If F satisﬁes the P Υ condition, then F is strongly monotone on K , and VI ( K, F ) has a unique solution. Moreover, BRD converges to the unique NE from an arbitrary initial state. Consider a conventional network (graphical) game with the set I of N agents situated on a network G = ( I, E ) , eachwith a utility function u i ( x i , xxx I i ) , with I i the set of i ’s neighbors, I the full set of agents/nodes and E the set of edgesconnecting them. Suppose that this network G exhibits the following structure and feature of the strategic dependenceamong agents: agents can be partitioned into a collection of groups { S k } , where k is a group index, and an agent a i in the k th group (i.e., a i ∈ S k ) has a utility function that depends (i) on the strategies of its network neighbors in S k , and (ii) only on the aggregate strategies of groups other than k (see, e.g., Fig. 1). Further, these groups may goon to form larger groups, whose aggregate strategies impact each other’s agents, giving rise to a multi-scale structure The edges are generally weighted, resulting in a weighted adjacency matrix on which the utility depends. PREPRINT - J

ANUARY

22, 2021of the network. This kind of structure is very natural in a myriad of situations. For example, members of criminalorganizations take stock of individual behavior by members of their own organization, but their interactions withother organizations (criminal or otherwise) are perceived in group terms (e.g., how much another group has harmedtheirs). A similar multi-level interaction structure exists in national or ethnic conﬂicts, organizational competition ina market place, and politics. Indeed, a persistent ﬁnding in network science is that networks exhibit a multi-scaleinteraction structure (i.e., communities, and hierarchies of communities) (Girvan and Newman, 2002; Clauset, Moor,and Newman, 2008).We present a general model to capture such multi-scale structure. Formally, an L -level structure is given by a hierar-chical graph structure { G ( l ) } for each level l , ≤ l ≤ L , where G ( l ) = ( { S ( l ) k } k , E ( l ) ) represents the level- l structure.The ﬁrst component, { S ( l ) k } k prescribes a partition, where agents in level l − form disjoint groups given by thispartition; each group is viewed as an agent in level l , denoted as a ( l ) k . Notationally, while both a ( l ) k and S ( l ) k bear thesuperscript ( l ) , the former refers to a level- l agent, while the latter is the group (of level- ( l − agents) that the formerrepresents. The set of level- l agents is denoted by I ( l ) and their total number N ( l ) . The second component, E ( l ) , is aset of edges that connect level- l agents, encoding the dependence relationship among the groups they represent. Thisstructure is anchored in level 1 (the lowest level), where sets S (1) k are singletons, corresponding to agents a k in thegame, who constitute the set I .To illustrate, the multi-scale structure shown in Fig. 1 is given by G (1) = G = ( { S (1) k } k = I, E (1) = E ) , as well ashow level-1 agents are grouped into level-2 agents, how level-2 agents are further grouped into level-3 agents, and theedges connecting these groups at each level.It should be obvious that the above multi-scale representation of a graphical game is a generalization of a conventionalgraphical game, as any such game essentially corresponds to a L = 1 multi-scale representation. On the other hand,not all conventional graphical games have a meaningful L > multi-scale representation (with non-singleton groupsof level-1 agents); this is because our assumption that an agent’s utility only depends on the aggregate decisions bygroups other than the one they belong to implies certain properties of the dependence structure. For the remainder ofthis paper we will proceed with a given multi-scale structure deﬁned above, while in Appendix G we outline a set ofconditions on a graphical game G that allows us to represent it in a (non-trivial) multi-scale fashion.Since the resulting multi-scale network is strictly hierarchical, we can deﬁne a direct supervisor of agent a ( l ) i in level- l to be the agent a ( l +1) k corresponding to the level-( l + 1 ) group k that the former belongs to. Similarly, two agents whobelong in the same level- l group k are (level- l ) group mates . Finally, note that any level-1 agent a i belongs to exactlyone group in each level l . We index a level- l group to which a i belongs by k il .In order to capture the agent dependence on aggregate actions, we deﬁne an aggregation function σ ( l ) k for each level- l group k that maps individual actions of group members to R (a group strategy ). Speciﬁcally, consider a level- l group S ( l ) k with level-( l − ) agents in this group playing a strategy proﬁle xxx S ( l ) k . The (scalar) group strategy, which is alsothe strategy for the corresponding level-( l + 1 ) agent, is determined by the aggregation function, x ( l ) k = σ ( l ) k ( xxx S ( l ) k ) . (5)A natural example of this is linear (e.g., agents respond to total levels of violence by other criminal organizations): σ ( l ) k ( xxx S ( l ) k ) = (cid:80) i ∈ S ( l ) k x ( l ) i .The L -level structure above is captured strategically by introducing structure into the utility functions of agents. Let I k il denote the set of neighbors of level- l group k to which level-1 agent a i belongs; i.e., this is the set of level- l groupsthat interact with agent a i ’s group. This level-1 agent’s utility function can be decomposed as follows: u i ( x i , xxx − i ) = L (cid:88) l =1 u ( l ) k il (cid:18) x ( l ) k il , xxx ( l ) I kil (cid:19) . (6)In this deﬁnition, the level- l strategies x ( l ) k are implicitly functions of the level-1 strategies of agents that comprisethe group, per a recursive application of Eqn. (5). Consequently, the utility is an additive function of the hierarchy ofgroup-level components for increasingly (with l ) abstract group of agents. Note that conventional network games area special case with only a single level ( L = 1 ). 4 PREPRINT - J

ANUARY

22, 2021To illustrate, if we consider just two levels (a collection of individuals and groups to which they directly belong), theutility function of each agent a i is a sum of two components: u i ( x i , xxx − i ) = u (1) k i (cid:18) x (1) k i , xxx (1) I ki (cid:19) + u (2) k i (cid:18) x (2) k i , xxx (2) I ki (cid:19) . In the ﬁrst component, x (1) k i = x i , since level-1 groups correspond to individual agents, whereas xxx (1) I ki is the strategyproﬁle of i ’s neighbors belonging to the same group as i , given by E (1) . The second utility component now dependsonly on the aggregate strategy x (2) k i of the group to which i belongs, as well as the aggregate strategies of the groupswith which i ’s group interacts, given by E (2) . Consider the BRD algorithm (formalized in Algorithm 1) in which we iteratively select an agent who plays a bestresponse to the strategy of the rest from the previous iteration.

ALGORITHM 1:

BRD AlgorithmInitialize the game, t = 0 , x i (0) = ( xxx ) i , i = 1 , · · · , N ; while not converged dofor i = 1: N do x i ( t + 1) = BR i ( xxx − i ( t ) , u i ) end t ← t + 1 end The conventional BRD algorithm operates on the “ﬂattened” utility function which evaluates utilities explicitly asfunctions of the strategies played by all agents a i ∈ I . Our goal henceforth is to develop algorithms that take advantageof the special multi-scale structure and enable signiﬁcantly better scalability than standard BRD, while preserving theconvergence properties of BRD. The simplest way to take advantage of the multi-scale representation is to directly leverage the structure of the utilityfunction in computing best responses. Speciﬁcally, the multi-scale utility function is more compact than one thatexplicitly accounts for the strategies of all neighbors of i (which includes all of the players in groups other than theone i belongs to). This typically results in a direct computational beneﬁt to computing a best response. For example,in a game with a linear best response, this can result in an exponential reduction in the number of linear operations.The resulting algorithm, Multi-Scale Best-Response Dynamics (MS-BRD) , which takes advantage of our utility repre-sentation is formalized as Algorithm 2. The main difference from BRD is that it explicitly uses the multi-scale utilityrepresentation: in each iteration, it updates the aggregated strategies at all levels for the groups to which the mostrecent best-responding agent belongs. Since MS-BRD simply performs operations identical to BRD but efﬁciently,its convergence is guaranteed under the same conditions (see Theorem 1). Next, we present iterative algorithms forcomputing NE that take further advantage of the multi-scale structure, and study their convergence.

In order to take full advantage of the multi-scale game structure, we now aim to develop algorithms that treat groupsexplicitly as agents, with the idea that iterative interactions among these can signiﬁcantly speed up convergence. Ofcourse, in our model groups are not actual agents in the game: utility functions are only deﬁned for agents in level 1.However, note that we already have well-deﬁned group strategies – these are just the aggregations of agent strategiesat the level immediately below, per the aggregation function (5). Moreover, we have natural utilities for groups aswell: we can use the corresponding group-level component of the utility of any agent in the group (note that theseare identical for all group members in Eqn. (6)). However, using these as group utilities will in fact not work: sinceultimately the game is only among the agents in level 1, equilibria of all of the games at more abstract levels mustbe consistent with equilibrium strategies in level 1 . On the other hand, we need to enforce consistency only betweenneighboring levels, since that fully captures the across-level interdependence induced by the aggregation function.5

PREPRINT - J

ANUARY

22, 2021

ALGORITHM 2:

Multi-Scale BRD (MS-BRD)Initialize the game, t = 0 , x (1) i (0) = ( xxx ) i , i = 1 , . . . , N for l = 2:L dofor k = 1: N ( l ) do xxx ( l ) k (0) = σ ( l ) k ( xxx S ( l ) k (0)) ; endendwhile not converged dofor i = 1: N (Level-1) do x (1) i ( t + 1) = BR i ( xxx (1) − i ( t ) , u i ) endfor l = 2:L dofor k = 1: N ( l ) do xxx ( l ) k ( t + 1) = σ ( l ) k ( xxx S ( l ) k ( t + 1)) ; endend t ← t + 1 ; end Therefore, we deﬁne the following pseudo-utility functions for agents at levels other than 1, with agent k in level l corresponding to a subset of agents from level l − : ˆ u ( l ) k = u ( l ) k (cid:18) x ( l ) k , xxx ( l ) I k (cid:19) − L ( l,l − k (cid:18) x ( l ) k , σ ( l ) k ( xxx S ( l ) k ) (cid:19) − L ( l,l +1) k (cid:18) σ ( l +1) k ( xxx S ( l +1) k ) , x ( l +1) k (cid:19) . (7)The ﬁrst term is the level- l component of the utility of any level-1 agent in group k . The second and third termsmodel the inter-level inconsistency loss that penalizes a level- l agent a ( l ) k , where L ( l,l +1) k and L ( l,l − i penalize itsinconsistency with the level- ( l + 1) and level- ( l − entities respectively. In general, L ( l,l +1) k is a different functionfrom L ( l +1 ,l ) k ; we elaborate on this further below.The central idea behind the second algorithm we propose is simple: in addition to iterating best response steps at level1, we now interleave them with best response steps taken by agents at higher levels, which we can since strategiesand utilities of these pseudo-agents are well deﬁned. This algorithm is similar to the augmented Lagrangian methodin optimization theory, where penalty terms are added to relax an equality constraint and turn the problem into onewith separable operators. We can decompose this type of problem into smaller subproblems and solve the subproblemssequentially using the alternating direction method (ADM) (Yuan and Li, 2011; Bnouhachem, Benazza, and Khalfaoui,2013). The games at adjacent levels are coupled through the equality constraints on their action proﬁles given by Eqn(5), and the penalty functions are updated before starting a new iteration. The full algorithm, which we call SeparatedHierarchical BRD (SH-BRD) , is provided in Algorithm (3).The penalty updating rule in iteration t of Algorithm (3) is:1. For l = 2 , . . . , L, i = 1 , . . . , N ( l ) L ( l,l − i (cid:18) x ( l ) i , σ ( l ) i ( xxx S ( l ) i ( t + 1)) (cid:19) = h ( l ) i (cid:20) x ( l ) i − σ ( l ) i ( xxx S ( l ) i ( t + 1)) + λ ( l ) i ( t ) (cid:21) . (8)2. For l = 1 , . . . , L − i = 1 , . . . , N ( l ) , where a ( l ) i ∈ S ( l +1) k L ( l,l +1) k (cid:18) σ ( l +1) k ( xxx S ( l +1) k ) , x ( l +1) k ( t ) (cid:19) = h ( l +1) k (cid:20) σ ( l +1) k ( xxx S ( l +1) k ) − x ( l +1) k ( t ) − λ ( l +1) k ( t ) (cid:21) . (9)6 PREPRINT - J

ANUARY

22, 20213. For l = 2 , . . . , L, i = 1 , . . . , N ( l ) λ ( l ) i ( t + 1)= λ ( l ) i ( t ) − h ( l ) i (cid:20) σ ( l ) i ( xxx S ( l ) i ( t + 1)) − x ( l ) i ( t + 1) (cid:21) . (10)When updating, all other variables are treated as ﬁxed, and λλλ ( l ) (0) , h ( l ) i > are chosen arbitrarily. ALGORITHM 3:

Separated Hierarchical BRD (SH-BRD)Initialize the game, t = 0 , x (1) i (0) = ( xxx ) i , i = 1 , . . . , N (0) for l = 2:L dofor k = 1: N ( l ) do xxx ( l ) k (0) = σ ( l ) k ( xxx S ( l ) k (0)) ; endendwhile not converged dofor l = 1:L dofor i = 1: N ( l ) ( l to l − Penalty Update, if l > ) do Update L ( l,l − i endfor i = 1: N ( l ) ( l to l + 1 Penalty Update, if l < L ) do Update L ( l,l +1) k , where a ( l ) i ∈ S ( l +1) k endfor i = 1: N ( l ) (Best Response) do x ( l ) i ( t + 1) = BR i (cid:18) σ ( l ) i ( xxx S ( l ) i ( t + 1)) ,xxx ( l ) I i ( t ) , x ( l +1) k ( t ) , ˆ u ( l ) i (cid:19) endend t ← t + 1 ; end Unlike MS-BRD, the convergence of the SH-BRD algorithm is non-trivial. To prove it, we exploit a connectionbetween this algorithm and Structured Variational Inequalities (SVI) with separable operators (He, 2009; He andYuan, 2012; Bnouhachem, Benazza, and Khalfaoui, 2013). To formally state the convergence result, we need to makeseveral explicit assumptions.

Assumption 1.

The functions u ( l ) i , ∀ l = 1 , . . . , L, ∀ i = 1 , . . . , N ( l − are twice continuously differentiable. Assumption 2. − (cid:79) x ( l ) i u ( l ) i are monotone ∀ l = 1 , . . . , L, ∀ i = 1 , . . . , N ( l − . The solution set of (cid:79) x ( l ) i u ( l ) i = 0 , ∀ l =1 , . . . , L, ∀ i = 1 , . . . , N ( l − is nonempty, with solutions in the interior of the action spaces. Let F ( l ) be deﬁned as in Equation (3) for each level- l pseudo-utility. Assumption 3. F ( l ) satisfy the P Υ condition. Note that these assumptions directly generalize the conditions required for the convergence of BRD to our multi-scalepseudo-utilities. The following theorem formally states that SH-BRD converges to a NE for . Theorem 2.

Suppose L = 2 . If Assumptions 1 and 3 hold, SH-BRD converges to a NE, which is unique.

The full proof of this theorem, which makes use of the connection between SH-BRD and SVI, is provided in theSupplement due to space constraint. The central issue, however, is that there are no established convergence guaranteesfor ADM-based algorithms for SVI with 3 or more separable operators. Alternative algorithms for SVI can extend tothe case of 3 operators using parallel operator updates with regularization terms, but no approaches exist that canhandle more than 3 operators (He, 2009). We thus propose an algorithm for iteratively solving multi-scale games thatuses the general idea from SH-BRD, but packs all levels into two meta-levels . The two meta-levels each has to be7

PREPRINT - J

ANUARY

22, 2021comprised of consecutive levels. For example, if we have 5 levels, we can have { , , } and { , } combinations, butnot { , , } and { , } . Upon grouping levels together to obtain a meta-game with only two meta-levels, we can applywhat amounts to a 2-level version of the SH-BRD. This yields an algorithm, which we call Hybrid Hierarchical BRD(HH-BRD) , that now provably converges to a NE for an arbitrary number of levels L given assumptions 1-3.As presenting the general version of HH-BRD involves cumbersome notation, we illustrate the idea by presentingit for a 4-level game (Algorithm 4). The fully general version is deferred to the Supplement. In this example, theobjectives of the meta-levels are deﬁned as ˆ u ( sl ) i = u (1) i + u (2) k i − L ( sl ,sl ) k i (cid:18) σ (3) k i ( xxx S (3) ki ) , x (3) k i (cid:19) , ˆ u ( sl ) k i = u (3) k i + u (4) k i − L ( sl ,sl ) k i (cid:18) x (3) k i , σ (3) k i ( xxx S (3) ki ) (cid:19) . ALGORITHM 4:

Hybrid Hierarchical BRDInitialize the game, t = 0 , x (1) i (0) = ( xxx ) i , i = 1 , . . . , N (0) for l = 2:4 dofor k = 1: N ( l ) do xxx ( l ) k (0) = σ ( l ) k ( xxx S ( l ) k (0)) ; endendwhile not converged dofor k = 1: N (3) (Meta-Level-1 Penalty Update) do Update L ( sl ,sl ) k endfor i = 1 : N (1) (Level-1) do x (1) i ( t + 1) = BR i (cid:18) xxx (1) I i ( t ) , xxx (2) I ki ( t ) , x (3) k i ( t ) , ˆ u ( sl ) i (cid:19) endfor j = 1: N (2) (Level-2) do xxx (2) j ( t + 1) = σ (2) j ( xxx S (2) j ( t + 1)) endfor k = 1: N (3) (Meta-Level-2 Penalty Update) do Update L ( sl ,sl ) k endfor k = 1 : N (3) (Level-3) do x (3) k ( t + 1) = BR i (cid:18) σ (3) k ( xxx S (3) k ( t + 1)) , xxx (3) I k ( t ) ,xxx (4) − p ( t ) , ˆ u ( sl ) k (cid:19) , ( a (3) k ∈ S (4) p ) endfor p = 1: N (4) (Level-4) do xxx (4) p ( t + 1) = σ (4) p ( xxx S (4) p ( t + 1)) end t ← t + 1 ; end Theorem 3.

Suppose Assumptions 1-3 hold Then

HH-BRD ﬁnds the unique NE.Proof Sketch.

We ﬁrst “ﬂatten” the game within each meta-level to obtain an effective 2-level game. We then useTheorem 2 to show this 2-level game converges to the unique NE of the game under SH-BRD. Finally, we prove thatSH-BRD and HH-BRD have the same trajectory given the same initialization, thus establishing the convergence forHH-BRD. For full proof see Supplement, Appendix D.HH-BRD combines the advantages of both MS-BRD and SH-BRD: not only does it exploit the sparsity embedded inthe network topology, but it also avoids the convergence problem of SH-BRD when the number of levels is higher than8

PREPRINT - J

ANUARY

22, 2021three. Indeed, there is a known challenge in the related work on structured variational inequalities that convergenceis difﬁcult when we involve three or more operators (He, 2009), which we leverage for our convergence results, withoperators mapping to levels in our multi-scale game representation. One may be concerned that HH-BRD pseudocodeappears to involve greater complexity (and more steps) than SH-BRD. However, this does not imply greater algorithmiccomplexity, but is rather due to our greater elaboration of the steps within each super level. Indeed, as our experimentsbelow demonstrate, the superior theoretical convergence of HH-BRD also translates into a concrete computationaladvantage of this algorithm.

In this section, we numerically compare the three algorithms introduced in Section 4, as well as the conventional BRD.We only consider settings which satisfy Assumptions 1-3; consequently, we focus comparison on computational costs.We use two measures of computational cost: ﬂoating-point operations (FLOPs) in the case of games with a linearbest response (a typical measure for such settings), and CPU time for the rest. All experiments were performed on amachine with A 6-core 2.60/4.50 GHz CPU with hyperthreaded cores, 12MB Cache, and 16GB RAM.

Games with a Linear Best Response (GLBRs)

GLBRs (Bramoull´e, Kranton, and D’amours, 2014; Candogan,Bimpikis, and Ozdaglar, 2012; Miura-Ko et al., 2008) feature utility functions such that an agent’s best response is alinear function of its neighbors’ actions. This includes quadratic utilities of the form u i ( x i , x I i ) = a i + b i x i + (cid:18) (cid:88) j ∈ I i g ij x j (cid:19) x i − c i x i , (11)since an agent’s best response is: BR i ( x I i , u i ) = (cid:80) j ∈ I i g ij x j c i − b i . We consider a 2-level GLBR and compare three algorithms: BRD (baseline), MS-BRD, and HS-BRD (note that in2-level games, HH-BRD is identical to HS-BRD, and we thus don’t include it here). We construct random 2-levelgames with utility functions based on Equation (11). Speciﬁcally, we generalize this utility so that Equation (11)represents only the level-1 portion, u (1) i , and let the level-2 utilities be u (2) k ( x k , xxx I k ) = x (2) k (cid:88) p (cid:54) = k v kp x (2) p for each group k . At every level, the existence of a link between two agents follows the Bernoulli distribution where P exist = 0 . . If a link exists, we then generate a parameter for it. The parameters of the utility functions are sampleduniformly in [0 , without requiring symmetry. Please refer to Appendix E and E.1 for further details. Results com-paring BRD, MS-BRD, and SH-BRD are shown in Table 1. We observe dramatic improvement in the scalability ofusing MS-BRD compared to conventional BRD. This improvement stems from the representational advantage pro-vided by multi-scale games compared to conventional graphical games (since without the multi-scale representation,we have to use the standard version of BRD for equilibrium computation). We see further improvement going fromMS-BRD to SH-BRD which makes algorithmic use of the multi-scale representation. Size BRD MS-BRD SH-BRD (2.51 ± × (1.03 ± × (9.81 ± × (2.53 ± × (5.33 ± × (4.35 ± × (4.46 ± × (4.36 ± × (3.56 ± × (6.73 ± × (3.48 ± × (2.79 ± × (2.84 ± × (5.69 ± × (4.04 ± × Table 1: Convergence and complexity (ﬂops) comparison with linear best response under multiple initialization.

Games with a Non-Linear Best Response

Next, we study the performance of the proposed algorithms in 2- and3-level games, with the same number of groups in each level (we systematically vary the number of groups). SinceSH-BRD and HH-BRD are identical in 2-level games, the latter is only used in 3-level games. All results are averaged9

PREPRINT - J

ANUARY

22, 2021over 30 generated sample games. The non-linear best response ﬁts a much broader class of utility functions than thelinear best response. The best responses generally don’t have closed-form representations. In this case, we can’t uselinear equations to ﬁnd the best response and instead have to apply gradient-based methods. In our instances, the utilitywith non-linear best responses is generated by adding an exponential cost term to the utility function used in GLBRs.Please refer to Appendix E and E.2 for further details.

Size BRD MS-BRD SH-BRD ± ± ± ± ± ± ± ± ± > ± ± nan 5485 ± ± Table 2: CPU times on a single machine on 2-Level games with general best response functions; all times are inseconds.Table 2 shows the CPU time comparison between all algorithms. The scalability improvements from our proposedalgorithms are substantial, with orders of magnitude speedup in some cases (e.g., from ∼

25 minutes for the BRDbaseline, down to ∼

12 seconds for SH-BRD for games with 10K agents). Furthermore, BRD fails to solve instanceswith 250K agents, which can be solved by SH-BRD in ∼

42 min. Again, we separate here the representationaladvantage of multi-scale games, illustrated by MS-BRD, and algorithmic advantage that comes from SH-BRD. Notethat SH-BRD, which takes full advantage of the multi-scale structure, also exhibits signiﬁcant improvement overMS-BRD, yielding a factor of 2-3 reduction in runtime.

Size BRD MS-BRD SH-BRD ± ± ± ± ± ± ±

14 15.49 ± ± > ± ± nan 4258 ±

56 s ± Table 3: CPU times on a single machine for 2-Level, linear/nonlinear best-response games; all times are in seconds.Our next set of experiments involves games in which level-1 utility has a linear best response, but level-2 utility hasa non-linear best response. The results are shown in Table 3. We see an even bigger advantage of SH-BRD over theothers: it is now typically orders of magnitude faster than even MS-BRD, which is itself an order of magnitude fasterthan BRD. For example, in games with 250K agents, in which BRD fails to return a solution, MS-BRD takes morethan 1 hour to ﬁnd a solution, whereas SH-BRD ﬁnds a solution in under 30 seconds.

Size BRD MS-BRD SH-BRD HH-BRD ± ± ± ± ± ± ± ± > ± ± ± nan 68.59 ± ± ± nan 1126 ± ± ± Table 4: CPU times in seconds on a single machine on 3-Level, general best response games; all times are in seconds.10

PREPRINT - J

ANUARY

22, 2021Finally, Table 4 presents the results of HH-BRD in games with > levels compared to SH-BRD, which doesnot provably converge in such games. In this case, HH-BRD outperforms the other alternatives, with up to 22%improvement over MS-BRD; indeed, we ﬁnd that SH-BRD is considerably worse even than MS-BRD. We proposed a novel representation of games that have a multi-scale network structure. These generalize networkgames, but with special structures that agent utilities are additive across the levels of hierarchy, with utility at each leveldepending only on the aggregate strategies of other groups. We present several iterative algorithms that make use of themulti-scale game structure, and show that they converge to a pure strategy Nash equilibrium under similar conditionsas for best response dynamics in network games. Our experiments demonstrate that the proposed algorithms can yieldorders of magnitude scalability improvement over conventional best response dynamics. Our multi-scale algorithmscan reveal to what extent one’s group afﬁliation impacts one’s strategic decision making, and how strategic interactionsamong groups impact strategic interactions among individuals.While the issue of multi-scale networks abounds in the network science literature (e.g., hierarchical clustering, etc.),the “multi-scale” part is primarily concerned with community structure in networks, rather than modeling how howcommunities interact , which is critical for us in describing a formal multi-scale structure for games. Thus a very im-portant future direction is to identify and obtain relevant ﬁeld data for experiments, and create realistic benchmarksfor multi-scale games. This would involve identifying ways to obtain data about how communities (and not just indi-viduals) interact. Once we have the ability to collect data about interactions at multiple scales (e.g., among membersand among groups), we can apply our algorithms to such multi-scale networks. To use criminal networks (criminalorganizations and their members) as an example, given game models constructed with the help of domain expertise,we can:1. compute equilibria predicting, say, criminal activity as a function of structural changes to organizations;2. infer utility models from observational data at multiple scales;3. study policies (including strengthening or weakening connections between agents or groups, endowingagents/groups with more resources (lower costs of effort), etc.) that would induce more desirable equilib-rium outcomes.

Acknowledgment

This work is supported by the NSF under grants CNS-1939006, CNS-2012001, IIS-1905558 (CAREER) and by theARO under contract W911NF1810208.

References

Acemoglu, D.; Carvalho, V. M.; Ozdaglar, A.; and Tahbaz-Salehi, A. 2012. The network origins of aggregate ﬂuctua-tions.

Econometrica

Journal of Economic Theory

Applied Mathematics and Computation

American Economic Review

Games Played on Networks .Buckley, E., and Croson, R. 2006. Income and wealth heterogeneity in the voluntary provision of linear public goods.

Journal of Public Economics

OperationsResearch

Nature

PREPRINT - J

ANUARY

22, 2021Gabay, D., and Mercier, B. 1976. A dual algorithm for the solution of nonlinear variational problems via ﬁnite elementapproximation.

Computers & Mathematics with Applications

Proc. Natl. Acad.Sci.

Journal of AppliedMechanics

Journal of Optimization Theory and Applications

Computational Optimization and Applications

IEEE Transactions on Control of Network Systems

Handbook of game theory with economic applications ,volume 4. Elsevier. 95–163.Kearns, M. J.; Littman, M. L.; and Singh, S. P. 2001. Graphical models for game theory. In

Conference in Uncertaintyin Artiﬁcial Intelligence , 253–260.Khalili, M. M.; Zhang, X.; and Liu, M. 2019. Public good provision games on networks with resource pooling. In

Network Games, Control, and Optimization . Springer. 271–287.La, R. J. 2016. Interdependent security with strategic agents and cascades of infection.

IEEE/ACM Transactions onNetworking

Siam Journal on Numerical Analysis - SIAM J NUMER ANAL

Game-Theoretic Framework for Community Detection . 1–16.Miura-Ko, R. A.; Yolken, B.; Mitchell, J.; and Bambos, N. 2008. Security decision-making among interdependentorganizations. In , 66–80.Naghizadeh, P., and Liu, M. 2017. On the uniqueness and stability of equilibria of network games. In , 280–286. IEEE.Newman, M. 2004. Detecting community structure in networks.

Eur Phys J

Games and Economic Behavior .Scutari, G.; Facchinei, F.; Pang, J.-S.; and Palomar, D. P. 2014. Real and complex monotone communication games.

IEEE Transactions on Information Theory

Mathematical Programming

Journal of Autonomous Agents and Multia-gent Systems

AAAI Conference on Artiﬁcial Intelligence , 2310–2317.Yuan, X.-M., and Li, M. 2011. An lqp-based decomposition method for solving a class of variational inequalities.

SIAM Journal on Optimization

PREPRINT - J

ANUARY

22, 2021

Appendices

A Structured Variational Inequalities

A structured variational inequality SVI n arises when a VI problem has n separable operators. This is used to analyzeour game under the multi-scale perspective described in Section 3.We now introduce a particular type of SVI relevant to our model. Suppose the N level-1 agents form M disjointgroups in the game and S j denotes the j th level-1 group, whereby i ∈ S j denotes that a i is a member of S j . Considerthe following utility function of a i : u i ( x i , xxx − i , y j , yyy − j ) = u (1) i ( x i , xxx − i ) + u (2) j ( y j , yyy − j ) , (12)where xxx ∈ R N denotes the level-1 action proﬁle and yyy ∈ R M denotes the level-2 action proﬁle, and Axxx + yyy = 000 , for A ji = (cid:26) − , if i ∈ S j , else , j = 1 , . . . , M, i = 1 , . . . , N . Thus

Axxx + yyy = 000 is equivalent to y j = (cid:80) i ∈ S j x i . We say xxx and yyy are two separated operators, and deﬁne F (1) ( xxx ) := (cid:18) − (cid:79) x i u (1) i ( xxx ) (cid:19) Ni =1 , x i ∈ K (1) i ,F (2) ( yyy ) := (cid:18) − (cid:79) y j u (2) j ( yyy ) (cid:19) Mj =1 , y j ∈ K (2) j ,K (1) = N (cid:89) i =1 K (1) i , K (2) = M (cid:89) j =1 K (2) j , K = K (1) × K (2) ,vvv = (cid:20) xxxyyy (cid:21) ∈ K, F ( vvv ) = (cid:20) F (1) ( xxx ) F (2) ( yyy ) (cid:21) . (13)Deﬁne Ω = { v ∈ K | Axxx + yyy = 000 } . Then the VI( Ω , F ) problem is to ﬁnd v ∗ ∈ Ω , such that: ( vvv − vvv ∗ ) T F ( vvv ) ≥ , ∀ vvv ∈ Ω . This problem is equivalent to the SVI problem VI( W , Q ) deﬁned in Eqn (14) ( ωωω − ωωω ∗ ) T Q ( ωωω ) ≥ , ∀ ωωω ∈ W , (14)where, W = K × R M and ωωω = (cid:32) xxxyyyλλλ (cid:33) , Q ( ωωω ) =  F (1) ( xxx ) − A T λλλF (2) ( yyy ) − λλλAxxx + yyy  . (15)It is easy to see that if we use (cid:80) i ∈ S j x i to replace y j , then we again have a single operator xxx and can construct aVI( K, F ) as outlined in Section 2. There is a one-to-one mapping between a solution xxx ∗ to VI( K, F ) and a solution ωωω ∗ = ( xxx ∗ , − Axxx ∗ , λλλ ∗ ) to VI( W , Q ). Therefore, solving either VI( K, F ) or VI( W , Q ) ﬁnds the set of NEs. B Uniqueness of NE

We will introduce some special matrices before we move on to the sufﬁcient conditions for the uniqueness of NE.

Deﬁnition 2.

Some special matrices:1. P-matrix: A square matrix is a P-matrix if all its principal components have positive determinant2. Z-matrix: A square matrix is a Z-matrix if all its off-diagonal components are nonpositive3. M-matrix: An M-matrix is a Z-matrix whose eigenvalues’ real parts are nonnegative4. L-matrix: An L-matrix is a Z-matrix whose diagonal elements are nonnegative

For an arbitrary mapping F : R N → R N , we denote the Jacobian of F ( xxx ) as JF ( xxx ) . And then (cid:79) j F i = [ JF ( xxx )] ij Checking if a matrix is P-matrix or not is still not trivial, and we can look at the spectral radius of a matrix instead.13

PREPRINT - J

ANUARY

22, 2021

Theorem 4.

The P Γ condition :We deﬁne the Γ matrix generated from F as follows Γ( F ) =  − β , ( F ) α ( F ) · · · − β ,N ( F ) α ( F ) − β , ( F ) α ( F ) · · · − β ,N ( F ) α ( F ) ... ... . . . ... − β N, ( F ) α N ( F ) − β N, ( F ) α N ( F ) · · ·  , (16) if the spectral radius ρ (Γ( F )) = || Γ( F ) || < , then we say F satisﬁes the P Γ condition. Then P Γ condition ⇔ P Υ condition and VI ( K, F ) has a unique solution. In Scutari et al. (2014), the authors mentioned that the P Υ captures “some kind of diagonal dominance”. In fact, thestrong diagonal dominance(s.d.d) or weakly chained diagonal dominance(w.c.d.d) of Υ can be an easier yet sufﬁcientcondition to check. Theorem 5. If Υ is s.d.d or w.c.d.d, the NE is unique, since Υ is an s.d.d L-matrix ⇒ Υ is a w.c.d.d L-matrix ⇔ Υ is a nonsigular weakly diagonally dominant(w.d.d)L-matrix ⇔ Υ is a nonsigular w.d.d M-matrix ⇒ Υ is a P-matrix Also, when Υ is s.d.d, Γ is a (right, row) substochastic matrix and thus ρ (Γ) < trivially holds and the NE is unique.The P Υ condition guarantees both the uniqueness of NE and the convergence of BRD. Please refer to Parise andOzdaglar (2019) for more conditions on the uniqueness. C Proof of Theorem 2

Proof.

This algorithm is designed to solve the SVI problem presented in Eqn (14) and (15). We denote H = diag ( hhh ) ,and the norm || xxx || G , where G (cid:31) as || xxx || G = xxx T Gxxx.

For simplicity reason, we will use xxx and yyy to replace xxx (1) and xxx (2) in the remainder of the proof.We can rewrite the steps in Algorithm 3 as follows:• Step 0: Initialization, given (cid:15), µ and xxx , let t = 0 , xxx (0) = xxx , y k (0) = σ k ( xxx S k (0)) ; arbitrarily choose λλλ (0) .• Step 1: Find xxx ∗ ∈ K (1) that solves ( xxx (cid:48) − xxx ∗ ) T (cid:20) f ( xxx ∗ ) − A T [ λλλ ( t ) − H ( Axxx ∗ + yyy ( t ))] (cid:21) ≥ , (17)for ∀ xxx (cid:48) ∈ K , and set xxx ( t + 1) = xxx ∗ .• Step 2: Find yyy ∗ ∈ K (2) that solves ( yyy (cid:48) − yyy ∗ ) T (cid:20) f ( xxx ∗ ) − [ λλλ ( t ) − H ( Axxx ( t + 1) + yyy ∗ )] (cid:21) ≥ , (18)for ∀ yyy (cid:48) ∈ K , and set yyy ( t + 1) = yyy ∗ .• Step 3: Set λλλ ( t + 1) = λλλ ( t ) − H ( Axxx ( t + 1) − yyy ( t + 1)) (19)• Step 4: Convergence veriﬁcation: If || ωωω ( t + 1) − ωωω ( t ) || ∞ < (cid:15) , then stop. Otherwise let t ← t + 1 and go backto Step 1. 14 PREPRINT - J

ANUARY

22, 2021When we have yyy ( t + 1) = yyy ( t ) and λλλ ( t + 1) = λλλ ( t ) , ωωω ( t + 1) = ( xxx ( t + 1) , yyy ( t + 1) , λλλ ( t + 1)) is the solution toour SVI . We denote the unique solution as ωωω ∗ = ( xxx ∗ , yyy ∗ , λλλ ∗ ) . From Eqn (18) and (19), we have the following fromSection 2 of (He, 2009), || yyy ( t + 1) − yyy ∗ || H + || λλλ ( t + 1) − λλλ ∗ || H − ≤ (cid:18) || yyy ( t ) − yyy ∗ || H + || λλλ ( t ) − λλλ ∗ || H − (cid:19) − (cid:18) || yyy ( t + 1) − yyy ( t ) || H + || λλλ ( t + 1) − λλλ ( t ) || H − (cid:19) < || yyy ( t ) − yyy ∗ || H + || λλλ ( t ) − λλλ ∗ || H − , (20)which shows the contraction property of the sequence { ( yyy ( t ) , λλλ ( t )) } and thus proves the convergence of the algorithm.A more detailed proof of convergence of the above steps in Eqn (17)-(19) is covered in (Gabay and Mercier, 1976;Glowinski and Oden, 1985), and a more generalized version of the above steps and convergence proofs are covered in(Tseng, 1990; Lions and Mercier, 1979). D Proof of Theorem 3

D.1 Full version of HH-BRD

We will ﬁrst show the ull version of HH-BRD, suppose the superlevel partitions is taken between level q − and level q , then for i = 1 , . . . , N (1) , ˆ u ( sl ) i = q − (cid:88) l =1 u ( l ) k il ( x k il , xxx I kil ) − L ( sl ,sl ) k iq (cid:18) σ (1 ,q ) k iq ( xxx S (1 ,q ) kiq ) , x ( q ) k iq (cid:19) , (21)where S (1 ,q ) p = { a (1) i | k iq = p } ,σ (1 ,q ) p ( xxx S (1 ,q ) p ) = (cid:88) a (1) i ∈ S (1 ,q ) p x (1) i . And for j = 1 , . . . , N ( q ) ˆ u ( sl ) j = L (cid:88) l = q u ( l ) k jl ( x k jl , xxx I kjl ) − L ( sl ,sl ) j (cid:18) x ( q ) j , σ (1 ,q ) j ( xxx S (1 ,q ) j ) (cid:19) . (22)Please refer to Algorithm 5 for the pseudo code of the full version of this algorithm. The loss function updates aresimilar to that of Algorithm 3. D.2 Proof of Theorem

We will ﬁrst construct an equivalent 2-level game to the L -level game where L > , and then show that the actionproﬁle update trajectories are the same for the original game and he equivalent game. Finally, the convergence of theequivalent game follows Theorem 2 and thus Algorithm 4 guarantees convergence. Proof.

We deﬁne the following counter-part for utility component u ( l ) i ( x ( l ) i , xxx ( l ) I i ) ( < l < q ) u ( l ) i ( xxx S (1 ,l ) i , xxx S (1 ,l ) Ii ) = u ( l ) i ( x ( l ) i , xxx ( l ) I i ) , (23)15 PREPRINT - J

ANUARY

22, 2021

ALGORITHM 5:

Hybrid Hierarchical BRD(Full Version)Initialize the game, t = 0 , x (1) i (0) = ( xxx ) i , i = 1 , . . . , N (0) for l = 2:L dofor k = 1: N ( l ) do xxx ( l ) k (0) = σ ( l ) k ( xxx S ( l ) k (0)) ; endendwhile not converged dofor k = 1: N ( q ) (Meta-Level-1 Penalty Update) do Update L ( sl ,sl ) k endfor i = 1 : N (1) (Level-1/Meta-Level-1 Gaming) do x (1) i ( t + 1) = BR i (cid:18) xxx (1) I i ( t ) , xxx (2) I ki ( t ) , . . . , x (3) k iq ( t ) , ˆ u ( sl ) i (cid:19) endfor l = 2:q-1 (Level-2 to Level-q Aggregation) dofor j = 1: N ( l ) do xxx ( l ) j ( t + 1) = σ ( l ) j ( xxx S ( l ) j ( t + 1)) endendfor k = 1: N ( q ) (Meta-Level-2 Penalty Update) do Update L ( sl ,sl ) k endfor j = 1 : N ( q ) (Level-q/Meta-Level-2 Gaming) do x ( q ) j ( t + 1)= BR j (cid:18) σ (1 ,q ) j ( xxx S (1 ,q ) j ) , xxx ( q ) I j ( t ) , xxx ( q +1) I kj ( q +1) ( t ) , . . . ,xxx ( L ) I kjL ( t ) , ˆ u ( sl ) j (cid:19) endfor l = q+1:L (Level-2 to Level-q+1 Aggregation) dofor p = 1: N ( l ) do xxx ( l ) p ( t + 1) = σ ( l ) p ( xxx S ( l ) p ( t + 1)) endend t ← t + 1 ; end when x ( l ) i = σ (1 ,l ) i ( xxx S (1 ,l ) i ) , ∀ i, ∀ l ∈ { , . . . , q − } . Both xxx S (1 ,l ) i and xxx S (1 ,l ) Ii are level-1 action proﬁles. This is exactlyhow we create the utility functions under the ﬂat perspective, where we expand the higher level aggregate actions downto level-1.Similarly, we deﬁne the following counter-part for utility component u ( l ) j ( x ( l ) j , xxx ( l ) I j ) ( q < l ≤ L ) u ( l ) j ( xxx S ( q,l ) j , xxx S ( q,l ) Ij ) = u ( l ) j ( x ( l ) j , xxx ( l ) I j ) , (24)when x ( l ) j = σ ( q,l ) j ( xxx S ( q,l ) j ) , ∀ j, ∀ l ∈ { q, . . . , L } . Both xxx S (1 ,l ) i and xxx S (1 ,l ) Ii are level- q action proﬁles. This time weexpand the higher level aggregate actions down to level- q instead of level-1.16 PREPRINT - J

ANUARY

22, 2021So then we can deﬁne a “ﬂattened” super-level-1 utility function counterpart for u ( sl ) i as follows u ( sl ) i ( x (1) i , x (1) I i ) = q − (cid:88) l =1 u ( l ) k il ( xxx S (1 ,l ) kil , xxx S (1 ,l ) Ikil ) − L ( sl ,sl ) k iq (cid:18) σ (1 ,q ) k iq ( xxx S (1 ,q ) kiq ) , x ( q ) k iq (cid:19) , (25)where I ( sl ) i = { a (1) j | k jq = k iq , j (cid:54) = i } . Similarly, for meta-level 2, we can deﬁne a “ﬂattened”(to level-q) function counterpart for u ( sl ) j as follows u ( sl ) j ( x ( q ) j , x ( q ) I ( sl j ) = L (cid:88) l = q u ( l ) k jl ( xxx S ( q,l ) kjl , xxx S ( q,l ) IkjL ) − L ( sl ,sl ) j (cid:18) x ( q ) j , σ (1 ,q ) j ( xxx S (1 ,q ) j ) (cid:19) , (26)where I ( sl ) j = { a ( q ) p | k pL = k jL , p (cid:54) = j } . So now we can create a 2-level game where the level-1(resp. level-q) agents in the original game become the level-1(resp. level-2) agents in the new game with utility functions deﬁned in Eqn (25) (resp. Eqn (26)). Based on Theorem2, we know that if we apply SH-BRD, we can converge to the unique NE of the game under Assumptions 1-3.Then it remains to show that given the same initialization, applying HH-BRD in the original game and the MS-BRDin the new 2-level game generate the same level-1 action proﬁle update trajectory. This can be shown using induction.We know from initialization that x ( l ) i (0) = σ (1 ,l ) i ( xxx S (1 ,l ) i (0)) , ∀ i, ∀ l ∈ { , . . . , q − } ,x ( l ) j (0) = σ ( q,l ) j ( xxx S ( q,l ) j (0)) , ∀ j, ∀ l ∈ { q, . . . , L } . Then based on Eqn (23), we know that u ( sl ) i ( x (1) i , xxx (1) I i (0) , . . . , x ( q ) k iq (0))= u ( sl ) i ( x (1) i , x (1) I i (0)) ⇔ BR i ( xxx (1) I i (0) , . . . , x ( q ) k iq (0) , u ( sl ) i )= BR i ( x (1) I i (0) , u ( sl ) i ) , and thus when t = 1 , xxx (1) ( t ) are the same when applying HH-BRD in the original game and the MS-BRD in the new2-level game. Similarly, xxx ( q ) (1) are the same based on Eqn (24).Suppose xxx (1) ( t ) and xxx ( q ) ( t ) are the same for the two dynamics for t = 0 , , . . . , T , we need to show that xxx (1) ( t ) and xxx ( q ) ( t ) are the same for t = T + 1 to complete the proof.Again, based on Eqn (23), we know that u ( sl ) i ( x (1) i , xxx (1) I i ( T ) , . . . , x ( q ) k iq ( T ))= u ( sl ) i ( x (1) i , x (1) I i ( T )) ⇔ BR i ( xxx (1) I i ( T ) , . . . , x ( q ) k iq ( T ) , u ( sl ) i )= BR i ( x (1) I i ( T ) , u ( sl ) i ) , which implies xxx (1) ( T + 1) are the same for the two dynamics and similarly xxx ( q ) ( T + 1) are the same based on Eqn(24). 17 PREPRINT - J

ANUARY

22, 2021

E Data Generation for Numerical Experiments

We introduce the data generation procedures for both games with linear best response and non-linear best response inthis part.First of all, for both type of games, we create an adjacency matrix for each of the groups on every level. This matrixhas 0 diagonal elements and for the off-diagonal elements, the existence of a directed edge subjects to the Bernoullidistribution where there is a ﬁxed P exist . Then if a directed edge exist, the edge weight is generated by choosing avalue from [0 , uniformly at random. Later, we will multiply these matrices with different scalars to adjust the valuesso that Assumption 3 holds. These matrices have 0 diagonal elements because they capture the dependencies of agentson each other, or equivalently, they are used to model the external impact the agents receive from the network. Theinternal impact are modeled by cost functions and marginal beneﬁt terms that only depend on an agent’s own action. E.1 Linear Best Response Games

For games with linear best response, we generated a 2-level game with 100 groups and 10,000 level-1 agents. Theadjacency matrix generation follows P exist = 0 . , which creates a rather sparse network. Each level-2 group S (2) k contains 100 members, and we use W k to denote the corresponding adjacency matrix. We use V to denote the level-2adjacency matrix. From Eqn (6), we know that for each level-1 agent, the utility function is u i ( x (1) i , xxx (1) I i , xxx (2) I ki ) = u (1) i ( x (1) i , xxx (1) I i ) + u (2) k i ( x (2) k i , xxx (2) I ki ) , where u (1) i ( x (1) i , xxx (1) I i ) = b i x (1) i + x (1) i (cid:18) (cid:88) j ∈ I i ( W k i ) r i r j x (1) j (cid:19) − c i ( x (1) i ) ,u (2) k ( x (2) k i , xxx (2) I ki ) = x (2) p (cid:18) (cid:88) p (cid:54) = k V kp x (2) p (cid:19) . We choose the cost coefﬁcients c i to be large enough so that the Υ( F ) satisﬁes the P Υ condition(from Appendix A,strong diagonal dominance implies P Υ condition). In the experiments, the ρ (Γ) (Se Appendix A for Γ ) has a valuebetween [0 . , . .Then under the ﬂat perspective, a level-1 agent a (1) i has the following utility function u flati ( x (1) i , x (1) − i ) = b i x (1) i + x (1) i (cid:18) (cid:88) j (cid:54) = i W flatij x (1) j (cid:19) − c i ( x (1) i ) + d i , where d i = (cid:88) j ∈ I i x (1) j (cid:18) (cid:88) p/ ∈ S (2) ki W flatjp x (1) p (cid:19) ,W flat =  W V , · · · · V , · V , · W · · · V , · ... ... . . . ... V , · · · · V , · W  , here represents the all 1 matrix of suitable size(100 × E.2 General Best Response Games

For games with general(non-linear) best response, we generated data using the graphical game model similarly likethe above. However, this time we use a mixed cost term that is a weighted sum of a quadratic component and anexponential component. Therefore, we can no longer represent the best response functions as linear functions and thebest response computing now relies on gradient based optimization steps. In the experiments shown in the main article,18

PREPRINT - J

ANUARY

22, 2021the adjacency matrix is generated following P exist = 0 . , which creates a sparse network. We also tried P exist = 1 and the results on the dense networks are included in this part of the appendix.We use W ( l ) i to denote the adjacency matrix within S ( l ) i and W ( L +1) to denote the adjacency matrix between highestlevel agents. For the 2-level games with general best response, the utility components are set as follows u (1) i ( x (1) i , xxx (1) I i ) = b i x (1) i + x (1) i (cid:18) (cid:88) j ∈ I i ( W (2) k i ) r i r j x (1) j (cid:19) − c i ( x (1) i ) − e . x (1) i ,u (2) i ( x (2) i , xxx (2) I i ) = x (2) i (cid:18) (cid:88) j (cid:54) = i ( W (3) k i ) ij x (2) j (cid:19) − | S (2) i | · e . x (2) i / | S (2) i | . For 3-level games with general best response, the components in level-1 and 2 remain the same, and the level-3components are u (2) i ( x (3) i , xxx (3) I i ) = x (3) i (cid:18) (cid:88) j (cid:54) = i W (4) ij x (3) j (cid:19) − | S (1 , i | · e . x (3) i / | S (1 , i | . For the 2-level games with linear/nonlinear best response, the utility components are set as follows u (1) i ( x (1) i , xxx (1) I i ) = b i x (1) i + x (1) i (cid:18) (cid:88) j ∈ I i ( W (2) k i ) r i r j x (1) j (cid:19) − c i ( x (1) i ) ,u (2) i ( x (2) i , xxx (2) I i ) = x (2) i (cid:18) (cid:88) j (cid:54) = i ( W (3) k i ) ij x (2) j (cid:19) − | S (2) i | · e . x (2) i / | S (2) i | . Again, the adjacency matrix and the cost terms will be scaled to ensure that Assumption 3 holds, and in the experi-ments, the ρ (Γ) (Se Appendix A for Γ ) has a value between [0 . , . .Hyperparameter settings: besides the parameters in the graphical games, the parameter h ( l ) i in the loss function updatesin Eqn (10) is chosen arbitrarily. These parameters can also be referred to as “penalty parameters”. In our experiments,the performance over these parameters are rather smooth under assumption 3. The hyperparameters h ( l ) i are set to thesame value on each level l . In the 2-level case, we perform a binary search on these hyperparameter, where each valueis tested for 5 runs to see the average performance. For the 3-level case, we need to determine 2 hyperparameter values,and this is done by a ﬁxed step size search performed iteratively on the two values. We tune the ﬁrst one, each value istested for 5 runs like the above, while ﬁxing the second value, after that, we switch to the tuning of the second valueand this process keeps iteratively. The parameters we used in the numerical experiments are• 2-Level game: h (2) i = 0 . , . , . , . , . ; for network sizes , , , , respectively.With tuning range [0 , . .• 3-Level game:For SH-BRD: ( h (2) i , h (3) j ) = (0 . , . , (0 . , . , (0 . , . , (0 . , . , (0 . , . ; for networksizes , , , , respectively. With tuning range [0 , . and tuning step . .For HH-BRD: h ( sl ) i = 0 . , . , . , . , . for network sizes , , , , respectively.With tuning range [0 , . PREPRINT - J

ANUARY

22, 2021Under the current parameter settings, we still haven’t bring out the best performances of SH-BRD, and HH-BRD.In act, the performance gap between the current setting and the optimal setting won’t be too large since the bestresponse steps are well-posed. And even with their sub-optimal performances, we have seen their advantages overother algorithms.In (He, Yang, and Wang, 2000), the authors mentioned an adaptive method to generate the penalty parameter matrix H which is generally not diagonal, that can speed up the problem solving steps. This will be an interesting directionto generalize our current algorithm when the best response functions become more ill-posed in the future. E.3 CPU Specs: • CPU: 6 cores, 12 threads, 2.60/4.50 GHz, 12MB Cache• OS: Windows 10• Software: Python 3.7• RAM: 16 GB

E.4 Results on Dense NetworksSize BRD MS-BRD SH-BRD (2.97 ± × (9.91 ± × (8.31 ± × (2.41 ± × (4.83 ± × (3.27 ± × (4.07 ± × (4.07 ± × (3.04 ± × (6.66 ± × (3.33 ± × (2.44 ± × (2.72 ± × (5.53 ± × (3.26 ± × Table 5: Convergence and complexity (ﬂops) comparison with linear best response under multiple initialization, densenetwork.

Size BRD MS-BRD SH-BRD ± ± ± ± ± ± ± ± ± > ± ± nan 3505 ± ± Table 6: CPU times on a single machine on 2-Level games with general best response functions, dense network; Alltimes are in seconds.

Size BRD MS-BRD SH-BRD ± ± ± ± ± ± ± ± ± > ± ± nan 3416 ± ± Table 7: CPU times on a single machine for 2-Level, linear/nonlinear best-response games, dense network; All timesare in seconds. 20

PREPRINT - J

ANUARY

22, 2021

Size BRD MS-BRD SH-BRD HH-BRD ± ± ± ± ± ± ± ± > ± ± ± nan 58.04 ± ± ± nan 926.8 ± ± ± Table 8: CPU times in seconds on a single machine on 3-Level, general best response games, dense network; All timesare in seconds.We can see that though the results in linear best response games are very different in sparse and dense networks, theresults in games with non-linear best responses are quite similar in both types of networks. In games with linear bestresponses, the standard deviation results from different initialization. For the same game, one initial action proﬁle’sdistance(measured in Euclidean norm) to the equilibrium point can be 20 times to the distance of another initial actionproﬁle. This results in different number of iterations of the algorithm before convergence. However, it only takes about20% more iterations for a “distant” initial action proﬁle to reach convergence, which shows that these algorithms havegood convergence property under Assumptions 1-3. In games with non-linear best responses, the standard deviationsof CPU times are relatively small(around 1%) compared to the mean values, and it shows that the performance of allalgorithms are stable with a ﬁxed initial action proﬁle.

F Algorithm Performances and Network Sizes

In this part, we present some results that show the algorithms’ performances with different network sizes in 2-levelgames.Figure 2 shows the number of ﬂops per iteration for the three algorithms in I × M games where I is the number ofagents in each group and M the number of groups in the network. Both Algorithms 2 and 4 outperform Algorithm 1.Algorithm 4 generally has lower complexity per iteration compared to Algorithm 2 since it has less input in every sub-problem and the number of sub-problems are similar in Algorithm 2 and 4 when the group sizes are large. However,when group sizes are small compared to the number of groups, Algorithm 2 and 4 are similar per iteration.Figure 2: Complexity per iteration for linear best response. G Reverse Engineer Multi-scale Structure

A question that naturally arises is whether sparsity in the network can be exploited when the multi-scale structure isnot readily available. The utility function in Eqn (6) suggests that such reverse engineering is possible if the gamesatisﬁes:1. An agent is either connected to all agents in another group or not connected to any agent in that group; If so,we can create a set of possible group partitions.2. Based on the partition in the previous step, agents in one group have the same dependency on an agent inanother group.3. Based on the partition, we can represent the groups’ aggregate actions from their members’ actions usingsome aggregate functions. 21

PREPRINT - J

ANUARY

22, 20214. Based on the partition, the original utility function of each agent can be separated to components on differentlevels, each component only based on the actions and dependencies on the corresponding level.An example of the ﬁrst condition is shown in Figs. 3 and 4. For the other conditions, the “ﬂattened” utility functionsused in Appendix E are good examples.Figure 3: Ungrouped. Figure 4: Grouped.