[PDF] Nonasymptotic Convergence Rates for Cooperative Learning Over Time-Varying Directed Graphs

Abstract

We study the problem of distributed hypothesis testing with a network of agents where some agents repeatedly gain access to information about the correct hypothesis. The group objective is to globally agree on a joint hypothesis that best describes the observed data at all the nodes. We assume that the agents can interact with their neighbors in an unknown sequence of time-varying directed graphs. Following the pioneering work of Jadbabaie, Molavi, Sandroni, and Tahbaz-Salehi, we propose local learning dynamics which combine Bayesian updates at each node with a local aggregation rule of private agent signals. We show that these learning dynamics drive all agents to the set of hypotheses which best explain the data collected at all nodes as long as the sequence of interconnection graphs is uniformly strongly connected. Our main result establishes a non-asymptotic, explicit, geometric convergence rate for the learning dynamic.

Full PDF

aa r X i v : . [ m a t h . O C ] A ug Nonasymptotic Convergence Rates for Cooperative LearningOver Time-Varying Directed Graphs

Angelia Nedi´c, Alex Olshevsky and C´esar A. Uribe

Abstract

We study the problem of distributed hypothesis testing with a network of agents where some agentsrepeatedly gain access to information about the correct hypothesis. The group objective is to globallyagree on a joint hypothesis that best describes the observed data at all the nodes. We assume thatthe agents can interact with their neighbors in an unknown sequence of time-varying directed graphs.Following the pioneering work of Jadbabaie, Molavi, Sandroni, and Tahbaz-Salehi, we propose locallearning dynamics which combine Bayesian updates at each node with a local aggregation rule of privateagent signals. We show that these learning dynamics drive all agents to the set of hypotheses which bestexplain the data collected at all nodes as long as the sequence of interconnection graphs is uniformlystrongly connected. Our main result establishes a non-asymptotic, explicit, geometric convergence ratefor the learning dynamic.

I. I

NTRODUCTION

Recent years have seen a considerable amount of work on the analysis of distributed algorithms.Nonetheless, the study of distributed decision making and computation can be traced back to the classicpapers [1], [2], [3] from the 70s and 80s. Applications of such algorithms range from opinion dynamicsanalysis, network learning and inference, cooperative robotics, communication networks, to social as wellas sensor networks. It is the latter settings of social and sensor networks which is the focus of the currentpaper.Interactions among people produce exchange of ideas, opinions, observations and experiences, onwhich new ideas, opinions, and observations are generated. Analyzing dynamic model of such processesgenerates insight into human behavior and produce algorithms useful in the sensor networking context.

The authors are with the Coordinated Science Laboratory, University of Illinois, 1308 West Main Street, Urbana, IL 61801,USA, { angelia,aolshev2,cauribe2 } @illinois.edu. This research is supported partially by the National Science Foundation undergrant no. CCF-1017564 and by the Ofﬁce of Naval Research under grant no. N00014-12-1-0998. October 14, 2018 DRAFT

We consider an agent network where agents repeatedly receive information from their neighbors andprivate signals from an external source, which provide samples from random variable with unknowndistribution. The agents would like to collectively agree on a hypothesis (distribution) that best explainsthe data.Initial results on learning in social networks are described in [4], where local update rules are designedsuch that it matches the Bayes’ Theorem. That is, given a prior and new observations, the agent is able tocompute likelihood functions in order to generate a new posterior, see [5]. Nevertheless, a fully Bayesianapproach might not be possible in general since full knowledge of neither the network structure nor otheragents hypothesis might be available [6]. Fortunately, non-Bayesian methods have been shown successfulin learning as well. For example, in [7], the authors propose a modiﬁcation of Bayes’ rule that accountsfor over-reactions or under-reactions to new information.In a distributed setting, several groundbreaking papers have described ways agents achieve global be-haviors by repeatedly aggregating local information in a network [8], [9], [10]. For example, in distributedhypothesis testing using belief propagation, convergence and dependence of the communication structurewere shown [10]. Later, extensions to ﬁnite capacity channels, packet losses, delayed communicationsand tracking where developed [11], [12]. In [9], the authors proved convergence in probability, asymptoticnormality of the distributed estimation and provided conditions under which the distributed estimationis as good as a centralized one. Later in [8], almost sure convergence of a non-Bayesian rule basedon arithmetic mean was shown for ﬁxed topology graphs. Extensions to information heterogeneity andasymptotic convergence rates have been derived as well [13]. Following [8], other methods to aggregateBayes estimates in a network have been explored. In [14], geometric means are used for ﬁxed topologiesas well, however the consensus and learning steps are separated. The work in [15] extends the resultsof [8] to time-varying undirected graphs. In [16], local exponential rates of convergence for undirectedgossip-like graphs are studied.In this paper we propose a non-Bayesian learning rule, analyze its consistency and derive a non-asymptotic rate of convergence for time-varying directed graphs. Our ﬁrst result shows consistency: weshow that over time, the protocol learns the hypothesis or set of hypotheses which better explain the datacollected by all the nodes. Moreover, our main result provides a geometric, non-asymptotic, and explicitcharacterization of the rate of convergence which immediately leads to ﬁnite-time bounds which scaleintelligibly with the number of nodes.In a simultaneous independent effort, the authors in [17], [18] proposed a similar non-Bayesian learningalgorithm where a local Bayes update is followed by a consensus step. In [17], convergence result for

October 14, 2018 DRAFT ﬁxed graphs is provided and large deviation convergence rates are given, proving the existence of arandom time after which the beliefs will concentrate exponentially fast. In [18], similar probabilisticbounds for the rate of convergence are derived for ﬁxed graphs and comparisons with the centralizedversion of the learning rule are provided.This paper is organized as follows. In Section II we describe the model that we study and the proposedupdate rule. In Section III we analyze the consistency of the information aggregation and estimationmodels, while in Section IV we establish non-asymptotic convergence rates of the agent beliefs. Someconclusions and future work directions are given in Section VI.

Notation:

Upper case letters represent random variables (e.g. X k ), and the corresponding lower caseletters for their realizations (e.g. x k ). Subindex will generally indicate the time index. We write as [ A k ] ij the i -th row and j -th column entry of matrix A k . We write A ′ for the transpose of a matrix A and x ′ for the transpose of a vector x . We use I for the identity matrix. Bold letters represent vectors whichare assumed to be column vectors. The i ’th entry of a vector will be denoted by a superscript i , i.e., x k = (cid:2) x k , . . . , x nk (cid:3) ′ . We write n to denote the all-ones vector of size n . For a sequence of matrices { A t } , we let A t f : t i , A t f · · · A t i +1 A t i for all t f ≥ t i ≥ .We terms ”almost surely” and ”independentidentically distributed” are abbreviated by a.s. and i.i.d. respectively.II. P ROBLEM S ETUP AND M AIN R ESULTS

We consider a group of n agents each of which observes a random variable at each time step k =1 , , , . . . . We use S ik to denote the random variable whose samples are observed by agent i at time step k . We denote the set of outcomes of the random variable S ik by S i , and we assume that this set is ﬁnite,i.e., S i = { s i , s i , . . . , s im i } for all i = 1 , . . . , n. Furthermore, we assume that all S ik are i.i.d. and drawnaccording to some probability distribution f i : S i → [0 , . For convenience, we stack up all S ik ’s into asingle random vector S k .We assume there is a ﬁnite set of hypothesis, Θ = { θ , θ , . . . , θ m } and there is a probability distribution l i ( ·| θ ) for each agent i and hypothesis θ ∈ Θ . Intuitively, we think of l i ( ·| θ ) as the probability distributionseen by agent i if hypothesis θ were true. Note that, it is not required for the agents to have an hypothesisthat is exactly equal to the unknown distribution f i . The goal of the agents is to agree on an element of Θ that ﬁts all the observations in the network best (in a technical sense to be described soon).Agents communicate with their neighbors, this communication is modedeled as a graph G k = { V, E k } composed of a node set V = { , , . . . , n } and a set of directed links E k .We will refer to probability distributions over Θ as beliefs and assume that agent i begins with an October 14, 2018 DRAFT initial belief µ i , which we also refer to as its prior distribution or prior belief .This paper focuses in the study of the group dynamics wherein, at time k , each agent i updates itsprevious belief µ ik to a new belief µ ik +1 as follows: µ ik +1 ( θ ) = Q nj =1 µ jk ( θ ) [ A k ] ij l i (cid:0) s ik +1 | θ (cid:1)P mp =1 Q nj =1 µ jk ( θ p ) [ A k ] ij l i (cid:0) s ik +1 | θ p (cid:1) , (1)with [ A k ] ij > when i receives information from j at time k , and else [ A k ] ij = 0 .The “weight matrices” A k satisfy some technical connectivity conditions which have been previouslyused in convergence analysis of distributed averaging and other consensus algorithms [19], [20], [21].The assumptions on the communication graph are presented next. Assumption 1

The graph sequence {G k } and the matrix sequence { A k } are such that: A k is row-stochastic with [ A k ] ij > if ( j, i ) ∈ E k . A k has positive diagonal entries, [ A k ] ii > . If [ A k ] ij > then [ A k ] ij > η for some positive constant η . {G k } is B -strongly connected, i.e., there is an integer B ≥ such that the graph n V, S ( k +1) B − i = kB E i o is strongly connected for all k ≥ . As a measure for the explanatory quality of the hypotheses in the set Θ we use the Kullback-Leiblerdivergence between two discrete probability distributions p and q : d ( p k q ) = n X i =1 p i log (cid:18) p i q i (cid:19) . Concretely, the quality of hypothesis θ j for agent i is measured by the Kullback-Leibler divergence d (cid:0) f i ( · ) k l i ( ·| θ j ) (cid:1) between the true distribution of the signals S ik and the probability distribution l i ( ·| θ j ) as seen by agent i if hypothesis θ j were correct. We use the following assumption on the agents’ besthypotheses. Assumption 2

The set Θ ∗ deﬁned as Θ ∗ , T ni =1 Θ i , where Θ i = arg min θ ∈ Θ d (cid:0) f i ( · ) k l i ( ·| θ ) (cid:1) for each i ,is non-empty. Assumption 2 is satisﬁed if there is some “true state of the world” b θ ∈ Θ such that each agent i sees distributions generated according to b θ , i.e., f i ( · ) = l i ( ·| b θ ) . However, this need not be the case forAssumption 2 to hold. Indeed, the assumption is considerably weaker as it merely requires that the setof hypotheses, which provide the “best ﬁts” for each agent, have at least a single element in common. October 14, 2018 DRAFT

We will further require the following assumptions on the initial distribution and the likelihood functions.The ﬁrst of these is sometimes referred to as the Zero Probability Property [22].

Assumption 3

For all agents i = 1 , . . . , n , The prior beliefs on all θ ∗ ∈ Θ ∗ are positive, i.e. µ i ( θ ∗ ) > for all θ ∗ ∈ Θ ∗ . There exists an α > such that l i (cid:0) s i | θ (cid:1) > α for all s i ∈ S i and θ ∈ Θ . Assumption 3.1 can be relaxed to a requirement that all prior beliefs are positive for some θ ∗ ∈ Θ ∗ . Bothof these conditions are equally complex to be satisﬁed. They can be satisﬁed by letting each agent havea uniform prior belief, which is reasonable in the absence of any initial information about the goodnessof the hypotheses.We now state our ﬁrst result, which asserts that the dynamics in Eq. (1) concentrate all agent’s believesin the optimal hypothesis set. We provide its proof in Section III. Theorem 1

Under Assumptions 1, 2, and 3, the update rule of Eq. (1) has the following property: lim k →∞ µ ik ( θ ) = 0 a.s. ∀ θ / ∈ Θ ∗ , i = 1 , . . . , n. The result states that the agents’ beliefs will concentrate on the set Θ ∗ asymptotically as k → ∞ .Our main result is a non-asymptotic explicit convergence rate, given in the following theorem, provenin Section IV. Theorem 2

Let Assumptions 1, 2, and 3 hold. Also, let ρ ∈ (0 , be a given error percentile (orconﬁdence value). Then, the update rule of Eq. (1) has the following property: there exists an integer N ( ρ ) such that, with probability − ρ , for all k ≥ N ( ρ ) there holds that for any θ Θ ∗ , µ ik ( θ ) ≤ exp (cid:18) − k γ + γ (cid:19) ∀ i = 1 , . . . , n, where N ( ρ ) , α )) log (cid:16) ρ (cid:17) γ + 1 ,γ , max θ ∗∈ Θ ∗ θ/ ∈ Θ ∗ (cid:26) max ≤ i ≤ n log µ i ( θ ) µ i ( θ ∗ ) + C − λ k H ( θ ) k (cid:27) ,γ , δn min θ / ∈ Θ ∗ k H ( θ ) k [ H ( θ )] i = d ( f i ( · ) || l i ( · | θ )) − d (cid:0) f i ( · ) k l i ( ·| θ ∗ ) (cid:1) , with α from Assumption 3.2. October 14, 2018 DRAFT

The constants C , δ and λ satisfy the following relations:(1) For general B -connected graph sequences {G k } , C = 2 , λ ≤ (cid:0) − η nB (cid:1) B , δ ≥ η nB . (2) If every matrix A k is doubly stochastic, C = √ , λ = (cid:16) − η n (cid:17) B , δ = 1 . (3) If each G k is an undirected graph and each A k is the lazy Metropolis matrix, i.e. the stochastic matrixwhich satisﬁes [ A k ] ij = 12 max( d ( i ) , d ( j )) for all { i, j } ∈ G k , then C = √ , λ = 1 − O ( n ) , δ = 1 . Note that H ( θ ) does not depend on θ ∗ since d (cid:0) f i ( · ) k l i ( ·| θ ∗ ) (cid:1) is the same for all θ ∗ .In contrast to the previous literature, this convergence rate is not only geometric but also non-asymptoticand explicit in the sense of immediately leading to bounds which scale intelligible in terms of the numberof nodes. For example, in the case of doubly stochastic matrices, Theorem 2 immediately implies that, aftera transient time, which scales cubically in the number n of nodes, the network will achieve exponentialdecay to the correct answer with the exponent − min θ ∗ ∈ Θ ∗ k H ( θ ) k /n .Now, consider the case when Assumption 3.1 is relaxed to the following requirement: The prior beliefson some θ ∗ ∈ Θ ∗ are positive (i.e. µ i ( θ ∗ ) > for some θ ∗ ∈ Θ ∗ and all i ). Then, it can be seen that theTheorem 2 is valid with max θ ∗ ∈ Θ ∗ and min θ ∗ ∈ Θ ∗ replaced, respectively, by max θ ∗ ∈ e Θ ∗ and max θ ∗ ∈ e Θ ∗ ,where e Θ ∗ ⊆ Θ ∗ is the set of all θ ∗ ∈ Θ ∗ for which all the agents priors µ i are positive.III. C ONSISTENCY OF THE L EARNING R ULE

In this section we prove Theorem 1, which provides a statement about the consistency (see [23], [24])of the distributed estimator given in Eq. (1). Our analysis will require some auxiliary results. First, wewill recall some results from [25] about the convergence of a product of row stochastic matrices.

Lemma 1 [25], [26] Under Assumption 1, for a graph sequence {G k } and each t ≥ , there is astochastic vector φ t (meaning its entries are nonnegative and sum to one) such that for all i, j and k ≥ t , (cid:12)(cid:12)(cid:12) [ A k : t ] ij − φ jt (cid:12)(cid:12)(cid:12) ≤ Cλ k − t ∀ k ≥ t ≥ October 14, 2018 DRAFT where

C > and λ ∈ (0 , satisfy the relations described in Theorem 2. The proof of Lemma 1 may be found in [25], with the exception of the bounds on

C, λ for the lazyMetropolis chains which we omit here due to space constraints.

Lemma 2 [25] Let the graph sequence {G k } satisfy Assumption 1. Deﬁne δ , inf k ≥ (cid:18) min ≤ i ≤ n (cid:2) ′ n A k :0 (cid:3) i (cid:19) . (2) Then, δ ≥ η nB , and if all A k are doubly stochastic, then δ = 1 . Furthermore, the sequence φ t fromLemma 1 satisﬁes φ jt ≥ δ/n for all t ≥ , j = 1 , . . . , n . Next, we need a technical lemma regarding the weighted average of random variables with a ﬁnitevariance.

Lemma 3

If assumptions 1, 2 and 3 hold. Then for a graph sequence {G k } we have for any θ / ∈ Θ ∗ and θ ∗ ∈ Θ ∗ , lim k →∞ k k X t =1 A k : t L θt + 1 k k X t =1 n φ ′ t H ( θ ) = 0 a.s. where L θt is the random vector with coordinates given by h L θt i i = log l i (cid:0) S it | θ (cid:1) l i (cid:0) S it | θ ∗ (cid:1) ∀ i = 1 , . . . , n, while the vector H ( θ ) has coordinates given by [ H ( θ )] i = d ( f i ( · ) || l i ( · | θ )) − d (cid:0) f i ( · ) k l i ( ·| θ ∗ ) (cid:1) .Proof: Adding and subtracting k P kt =1 n φ ′ t L θt yields k k X t =1 (cid:16) A k : t L θt + n φ ′ t H ( θ ) (cid:17) = 1 k k X t =1 (cid:0) A k : t − n φ ′ t (cid:1) L θt + 1 k k X t =1 n φ ′ t (cid:16) L θt + H ( θ ) (cid:17) . (3)By Lemma 1, lim k →∞ A k : t = φ ′ t for all t ≥ . Moreover, each of the entries of L θt are upper boundedby Assumption 2. Thus, the ﬁrst term on the right hand side of Eq. (3) goes to zero as we take the limitover k → ∞ . Regarding the second term in Eq. (3), by the deﬁnition of the KL divergence measure, wehave that E " log l i (cid:0) S it | θ (cid:1) l i (cid:0) S it | θ ∗ (cid:1) = m i X j =1 f i (cid:0) s ij (cid:1) log l i (cid:16) s ij | θ (cid:17) l i (cid:16) s ij | θ ∗ (cid:17) = m i X j =1 f i (cid:0) s ij (cid:1) log  l i (cid:16) s ij | θ (cid:17) l i (cid:16) s ij | θ ∗ (cid:17) f i (cid:16) s ij (cid:17) f i (cid:16) s ij (cid:17)  October 14, 2018 DRAFT = d (cid:0) f i ( · ) k l i ( ·| θ ∗ ) (cid:1) − d (cid:0) f i ( · ) k l i ( ·| θ ) (cid:1) or equivalently E (cid:2) L θt (cid:3) = − H ( θ ) .Kolmogorov’s strong law of large numbers states that if { X t } is a sequence of independent randomvariables with variances such that P ∞ k =1 Var( X k ) k < ∞ , then n P nk =1 X k − n P nk =1 E [ X k ] → a.s. Let X t = φ ′ t L θt . Then, by using Assumptions 1 and 3.2, it can be seen that sup t ≥ Var ( X t ) < ∞ .The result follows by Lemma 1 and Kolmogorov’s strong law of large numbers.With Lemma 3 in place, we are ready to prove Theorem 1. The proof of Theorem 1 (and alsoTheorem 2) makes use of the following quantities: for all i = 1 , . . . , n and k ≥ , ϕ ik ( θ ) , log µ ik ( θ ) µ ik ( θ ∗ ) for all θ ∈ Θ , (4)deﬁned for any θ ∗ ∈ Θ ∗ (dependence on θ ∗ is suppressed). Proof: ( Theorem 1 ) Dividing both sidesof (1) by µ ik +1 ( θ ∗ ) , then using the log function and the deﬁnition of ϕ ik ( θ ) we obtain: ϕ ik +1 ( θ ) = n X j =1 [ A k ] ij ϕ jk ( θ ) + log l i (cid:0) s ik +1 | θ (cid:1) l i (cid:0) s ik +1 | θ ∗ (cid:1) . Stacking up the values ϕ ik +1 ( θ ) over agents i = 1 , . . . , n, into a single vector ϕ k +1 ( θ ) , we cancompactly write the preceding relations, as follows: ϕ k +1 ( θ ) = A k ϕ k ( θ ) + L θk +1 , (5)which implies that for all k ≥ , ϕ k +1 ( θ ) = A k :0 ϕ ( θ ) + k X t =1 A k : t L θt + L θk +1 . (6)We add and subtract P kt =1 n φ ′ t H ( θ ) in Eq. (6), then ϕ k +1 ( θ ) = A k :0 ϕ ( θ ) + k X t =1 (cid:16) A k : t L θt + n φ ′ t H ( θ ) (cid:17) + L θk +1 − k X t =1 n φ ′ t H ( θ ) . By using the lower bounds on φ t described in Lemma 2 and the fact that H ( θ, θ ∗ ) ≥ , we obtain ϕ k +1 ( θ ) ≤ A k :0 ϕ ( θ ) + k X t =1 (cid:16) A k : t L θt + n φ ′ t H ( θ ) (cid:17) + L θk +1 − δn k k H ( θ ) k n . Therefore, we have lim k →∞ k ϕ k +1 ( θ ) ≤ lim k →∞ k A k :0 ϕ ( θ ) − δn k H ( θ ) k n + lim k →∞ k L θk +1 + lim k →∞ k k X t =1 (cid:16) A k : t L θt + n φ ′ t H ( θ ) (cid:17) . October 14, 2018 DRAFT

The ﬁrst term of the right hand side of the preceding relation converges to zero deterministically. Thethird term goes to zero as well since L θt is bounded, and the fourth term converges to zero almost surelyby Lemma 3. Consequently, lim k →∞ k ϕ k +1 ( θ ) ≤ − δn k H ( θ ) k n a.s. (7)Now if θ / ∈ Θ ∗ , then H ( θ, θ ∗ ) > and, thus, ϕ k ( θ ) → −∞ almost surely. This implies µ k ( θ ) → almost surely. IV. N ON -A SYMPTOTIC R ATE OF C ONVERGENCE

In this section, we prove Theorem 2, which states an explicit rate of convergence for cooperative agentlearning process. Before proving the theorem, we will estate an auxiliary lemma that provides a boundfor the expectation of the random variables ϕ ik ( θ ) as deﬁned in Eq. (4). Lemma 4

Let θ ∗ ∈ Θ ∗ be arbitrary, and consider ϕ ik ( θ ) as deﬁned in Eq. (4) . Then, for any θ Θ ∗ we have E (cid:2) ϕ ik +1 ( θ ) (cid:3) ≤ γ − ( k + 1) γ for all i and k ≥ , where γ and γ are deﬁned in Theorem 2.Proof: The expected value of Eq. (5) and E (cid:2) L θk +1 (cid:3) = − H ( θ ) , gives E (cid:2) ϕ k +1 ( θ ) (cid:3) = A k E [ ϕ k ( θ )] − H ( θ ) Therefore, by recursion we can see that for all k ≥ , E (cid:2) ϕ k +1 ( θ ) (cid:3) = A k :0 ϕ ( θ ) − k X t =1 A k : t H ( θ ) − H ( θ ) . By adding and subtracting P kt =1 n φ ′ t H ( θ ) , we obtain E (cid:2) ϕ k +1 ( θ ) (cid:3) = A k :0 ϕ ( θ ) + k X t =1 (cid:0) n φ ′ t − A k : t (cid:1) H ( θ ) − k X t =1 n φ ′ t H ( θ ) − H ( θ ) . We removed the last term of the right hand side in the preceding relation since H ( θ ) ≥ . Moreover,bounding the entries for the ﬁrst two terms on the right hand side and using the fact that A k :0 is astochastic matrix, we have that E (cid:2) ϕ k +1 ( θ ) (cid:3) ≤ k ϕ ( θ ) k ∞ n − k X t =1 n φ ′ t H ( θ ) + k X t =1 max ≤ i,j ≤ n | φ jt − [ A k : t ] ij |k H ( θ ) k n October 14, 2018 DRAFT0

Next, we use the upper bound on terms | φ jt − [ A k : t ] ij | from Lemma 1 and the lower bound for the entriesin φ t as given in Lemma 2, and we arrive at the following relation: E (cid:2) ϕ k +1 ( θ ) (cid:3) ≤ k ϕ ( θ ) k ∞ n + k X t =1 Cλ k − t k H ( θ ) k n − k δn k H ( θ ) k n and the result follows.The proof of Theorem 2 uses the McDiarmid’s inequality [27]. This will provide bounds on theprobability that the beliefs exceed a given value ǫ . McDiarmid’s inequality is provided below. Theorem 3 (McDiarmid’s inequality [27]) Let { X t } kt =1 = ( X , . . . , X k ) be a sequence of independentrandom variables with X t ∈ X . If a function g : { X t } kt =1 → R has bounded differences, i.e., for all t , sup X t ∈X g ( . . . , X t , . . . ) − inf Y t ∈X g ( . . . , Y t , . . . ) ≤ c t then for any ǫ > and all k ≥ , P (cid:16) g (cid:16) { X t } kt =1 (cid:17) − E h g (cid:16) { X t } kt =1 (cid:17)i ≥ ǫ (cid:17) ≤ exp − ǫ P kt =1 c t ! Now, we are ready to prove Theorem 2.

Proof: ( Theorem

2) First we will express the belief µ ik +1 ( θ ) in terms of the variable ϕ ik +1 ( θ ) .This will allow us to use the McDiarmid’s inequality to obtain the concentration bounds. By dynamicsof Eq. (1) and Assumption 3.1, since µ ik +1 ( θ ∗ ) ∈ (0 , for any θ ∗ ∈ Θ ∗ , we have µ ik +1 ( θ ) ≤ µ ik +1 ( θ ) µ ik +1 ( θ ∗ ) = exp (cid:0) ϕ ik +1 ( θ ) (cid:1) Therefore, P (cid:18) µ ik +1 ( θ ) ≥ exp (cid:18) − kγ γ (cid:19)(cid:19) ≤ P (cid:18) ϕ ik +1 ( θ ) ≥ − kγ γ (cid:19) = P (cid:18) ϕ ik +1 ( θ ) − E (cid:2) ϕ ik +1 ( θ ) (cid:3) ≥ − k γ + γ − E (cid:2) ϕ ik +1 ( θ ) (cid:3)(cid:19) = P (cid:18) ϕ ik +1 ( θ ) − E (cid:2) ϕ ik +1 ( θ ) (cid:3) ≥ k γ (cid:19) , where the last equality follows from Lemma 4.We now view ϕ ik +1 ( θ ) a function of the random vectors s , . . . , s k , s k +1 , see Eq. (6), where s t =( s t , . . . , s nt ) ∈ S for all t . Thus, for all t with ≤ t ≤ k , we have max s t ∈S ϕ ik +1 ( θ ) − min s t ∈S ϕ ik +1 ( θ ) = max s t ∈S n X j =1 [ A k : t ] ij (cid:2) L θt (cid:3) j − min s t ∈S n X j =1 [ A k : t ] ij (cid:2) L θt (cid:3) j October 14, 2018 DRAFT1 = max s t ∈S n X j =1 [ A k : t ] ij log l j (cid:16) s jt | θ (cid:17) l j (cid:16) s jt | θ ∗ (cid:17) − min s t ∈S n X j =1 [ A k : t ] ij log l j (cid:16) s jt | θ (cid:17) l j (cid:16) s jt | θ ∗ (cid:17) ≤ log 1 α + log 1 α = 2 log 1 α . Similarly, from Eq. (6) we can see that max s k +1 ∈S ϕ ik +1 ( θ ) − min s k +1 ∈S ϕ ik +1 ( θ ) ≤ α . It follows that ϕ ik +1 ( θ ) has bounded variations and by McDiarmid’s inequality (Theorem 3) we obtainthe following concentration inequality, P (cid:18) ϕ ik +1 ( θ ) − E (cid:2) ϕ ik +1 ( θ ) (cid:3) ≥ k γ (cid:19) ≤ exp − ( kγ ) P k +1 t =1 (cid:0) α (cid:1) ! = exp − ( kγ ) k + 1) (cid:0) log α (cid:1) ! ≤ exp (cid:18) − ( k − γ α ) (cid:19) Finally, for a given conﬁdence level ρ , in order to have P (cid:16) µ ik ( θ ) ≥ exp (cid:16) − kγ + γ (cid:17)(cid:17) ≤ ρ the desiredresult follows. V. S IMULATION R ESULTS

In this section we show simulation results for a group of agents connected over a time-varying directedgraph, shown in Figure 1, for some speciﬁc weighting matrices. Each agent updates its beliefs accordingto Eq. (1).Note that the graph is such that the edge connecting agent 1 and agent 2 is switching on and off ateach time step. Agents 2-6 connecting edges are changing at each time step as well.

October 14, 2018 DRAFT2

51 2 3 46 51

Fig. 1. Time-Varying graph with a switching external agent

Every agent i receives information from a binary random variable S ik : Ω → { , } with probabilitydistribution f i (0) = 0 . and f i (1) = 0 . for all i ’s. Moreover, every agent has two possible models θ and θ . Agent 1 hypotheses have the following likelihood functions: l (0 | θ ) = 0 . and l (1 | θ ) = 0 . for hypothesis θ ; and l (0 | θ ) = 0 . and l (1 | θ ) = 0 . for hypothesis θ . Therefore, hypothesis θ iscloser to the true distribution. On the other hand, agents 2 to 6 have uniformly distributed observationallyequivalent hypothesis for both θ and θ , that is, they are not able to differentiate between the hypothesisindividually. Thus l i ( s | θ ) = 0 . for i = { , . . . , } , s = { , } and θ = { θ , θ } .Figure 2 shows the empirical mean over 5000 Monte Carlo simulations of the beliefs on hypothesis θ of agents 1, 4, 5 and 6. Results show that agent 1 is the fastest learning agent, since is the one withthe correct model. Nevertheless, all other agents are converging to the correct parameter model as well,even if they do not have differentiable models. E m p i r i c a l M ean o v e r M on t e C a r l o S i m u l a t i on s Agent 1Agent 4Agent 5Agent 6

Fig. 2. Simulation results for Agents 1, 4, 5 and 6

October 14, 2018 DRAFT3

VI. C

ONCLUSIONS AND F UTURE W ORK

We have studied the consistency and the rate of convergence for a distributed non-Bayesian learningsystem. We have shown almost sure consistency and have provided bounds on the global exponential rateof convergence. The novelty of our results is in the establishment of convergence rate estimates that arenon-asymptotic, geometric, and explicit, in the sense that the bounds capture the quantities characterizingthe graph sequence properties as well as the agent learning capabilities. This results were derived forgeneral time-varying directed graphs.Our work suggests a number of open questions. It is natural to attempt to extensions to continuousspaces, on the number of agents, on the number of hypothesis, etc. This result can be extended totracking problems where the distribution of the observations changes with time. When the number ofhypothesis is large, ideas from social sampling can also be incorporated in this framework [28]. Moreover,the possibility of corrupted measurements or conﬂicting models between the agents are also of interest,especially in the setting of social networks. R

EFERENCES [1] R. J. Aumann, “Agreeing to disagree,”

The annals of statistics , pp. 1236–1239, 1976.[2] V. Borkar and P. P. Varaiya, “Asymptotic agreement in distributed estimation,”

IEEE Transactions on Automatic Control ,vol. 27, no. 3, pp. 650–655, 1982.[3] J. N. Tsitsiklis and M. Athans, “Convergence and asymptotic agreement in distributed decision problems,”

IEEETransactions on Automatic Control , vol. 29, no. 1, pp. 42–50, 1984.[4] D. Acemoglu, M. A. Dahleh, I. Lobel, and A. Ozdaglar, “Bayesian learning in social networks,”

The Review of EconomicStudies , vol. 78, no. 4, pp. 1201–1236, 2011.[5] M. Mueller-Frank, “A general framework for rational learning in social networks,”

Theoretical Economics , vol. 8, no. 1,pp. 1–40, 2013.[6] D. Gale and S. Kariv, “Bayesian learning in social networks,”

Games and Economic Behavior , vol. 45, no. 2, pp. 329–346,2003.[7] L. G. Epstein, J. Noor, and A. Sandroni, “Non-bayesian learning,”

The BE Journal of Theoretical Economics , vol. 10,no. 1, 2010.[8] A. Jadbabaie, P. Molavi, A. Sandroni, and A. Tahbaz-Salehi, “Non-bayesian social learning,”

Games and EconomicBehavior , vol. 76, no. 1, pp. 210–225, 2012.[9] K. Rahnama Rad and A. Tahbaz-Salehi, “Distributed parameter estimation in networks,” in

IEEE Conference on Decisionand Control , 2010, pp. 5050–5055.[10] M. Alanyali, S. Venkatesh, O. Savas, and S. Aeron, “Distributed bayesian hypothesis testing in sensor networks,” in

American Control Conference , vol. 6, 2004, pp. 5369–5374.[11] V. Saligrama, M. Alanyali, and O. Savas, “Distributed detection in sensor networks with packet losses and ﬁnite capacitylinks,”

IEEE Transactions on Signal Processing , vol. 54, no. 11, pp. 4118–4132, 2006.

October 14, 2018 DRAFT4 [12] R. Rahman, M. Alanyali, and V. Saligrama, “Distributed tracking in multihop sensor networks with communication delays,”

IEEE Transactions on Signal Processing , vol. 55, no. 9, pp. 4656–4668, 2007.[13] A. Jadbabaie, P. Molavi, and A. Tahbaz-Salehi, “Information heterogeneity and the speed of learning in social networks,”

Columbia Business School Research Paper , no. 13-28, 2013.[14] S. Bandyopadhyay and S.-J. Chung, “Distributed estimation using bayesian consensus ﬁltering,” in

American ControlConference (ACC) , 2014, pp. 634–641.[15] Q. Liu, A. Fang, L. Wang, and X. Wang, “Social learning with time-varying weights,”

Journal of Systems Science andComplexity , vol. 27, no. 3, pp. 581–593, 2014.[16] S. Shahrampour and A. Jadbabaie, “Exponentially fast parameter estimation in networks using distributed dual averagingy,”in

IEEE Conference on Decision and Control , 2013, pp. 6196–6201.[17] A. Lalitha, T. Javidi, and A. Sarwate, “Social learning and distributed hypothesis testing,” preprint arXiv:1410.4307 , 2015.[18] S. Shahrampour, A. Rakhlin, and A. Jadbabaie, “Distributed detection: Finite-time analysis and impact of network topology,” arXiv preprint arXiv:1409.8606 , 2014.[19] D. P. Bertsekas and J. N. Tsitsiklis,

Parallel and distributed computation: numerical methods . Prentice-Hall, Inc., 1989.[20] L. Moreau, “Stability of multiagent systems with time-dependent communication links,”

IEEE Transactions on AutomaticControl , vol. 50, no. 2, pp. 169–182, 2005.[21] A. Jadbabaie, J. Lin, and A. S. Morse, “Coordination of groups of mobile autonomous agents using nearest neighbor rules,”

IEEE Transactions on Automatic Control , vol. 48, no. 6, pp. 988–1001, 2003.[22] C. Genest, J. V. Zidek et al. , “Combining probability distributions: A critique and an annotated bibliography,”

StatisticalScience , vol. 1, no. 1, pp. 114–135, 1986.[23] J. L. Doob, “Application of the theory of martingales,”

Le calcul des probabilites et ses applications , pp. 23–27, 1949.[24] S. Ghosal, “A review of consistency and convergence of posterior distribution,” in

Varanashi Symposium in BayesianInference, Banaras Hindu University , 1997.[25] A. Nedic and A. Olshevsky, “Distributed optimization over time-varying directed graphs,”

IEEE Transactions on AutomaticControl , vol. 60, no. 3, pp. 601–615, 2015.[26] A. Nedi´c, A. Olshevsky, A. Ozdaglar, and J. N. Tsitsiklis, “On distributed averaging algorithms and quantization effects,”

IEEE Transactions on Automatic Control , vol. 54, no. 11, pp. 2506–2517, 2009.[27] C. McDiarmid, “On the method of bounded differences,”

Surveys in combinatorics , vol. 141, no. 1, pp. 148–188, 1989.[28] A. Sarwate and T. Javidi, “Distributed learning of distributions via social sampling,”

IEEE Transactions on AutomaticControl , vol. 60, no. 1, pp. 34–45, 2015., vol. 60, no. 1, pp. 34–45, 2015.