Distributed Online Optimization in Dynamic Environments Using Mirror Descent
11 Distributed Online Optimization in DynamicEnvironments Using Mirror Descent
Shahin Shahrampour and Ali Jadbabaie
Abstract —This work addresses decentralized online optimiza-tion in non-stationary environments. A network of agents aimto track the minimizer of a global time-varying convex function.The minimizer evolves according to a known dynamics corruptedby an unknown, unstructured noise. At each time, the globalfunction can be cast as a sum of a finite number of local functions,each of which is assigned to one agent in the network. Moreover,the local functions become available to agents sequentially, andagents do not have a prior knowledge of the future cost functions.Therefore, agents must communicate with each other to build anonline approximation of the global function. We propose a decen-tralized variation of the celebrated Mirror Descent, developed byNemirovksi and Yudin. Using the notion of Bregman divergencein lieu of Euclidean distance for projection, Mirror Descent hasbeen shown to be a powerful tool in large-scale optimization. Ouralgorithm builds on Mirror Descent, while ensuring that agentsperform a consensus step to follow the global function and takeinto account the dynamics of the global minimizer. To measurethe performance of the proposed online algorithm, we compare itto its offline counterpart, where the global functions are availablea priori. The gap between the two is called dynamic regret. Weestablish a regret bound that scales inversely in the spectral gapof the network, and more notably it represents the deviation ofminimizer sequence with respect to the given dynamics. We thenshow that our results subsume a number of results in distributedoptimization. We demonstrate the application of our method todecentralized tracking of dynamic parameters and verify theresults via numerical experiments.
I. I
NTRODUCTION
Distributed convex optimization has received a great deal ofinterest in science and engineering. Classical engineering prob-lems such as decentralized tracking, estimation, and detectionare optimization problems in essence [1]–[7], and early studieson parallel and distributed computation dates back to threedecades ago with seminal works of [8]–[10]. In any decentral-ized scheme, the objective is to perform a global task, assignedto a number of agents in a network. Each individual agent haslimited resources or partial information about the task. As aresult, agents engage in local interactions to complement theirinsufficient knowledge and accomplish the global task. The useof decentralized techniques has increased rapidly since theyimpose low computational burden on agents and are robustto node failures as opposed to centralized algorithms whichheavily rely on a single information processing unit.
The authors gratefully acknowledge the support of ONR BRC Program onDecentralized, Online Optimization.Shahin Shahrampour is with the Department of Electrical Engi-neering at Harvard University, Cambridge, MA 02138 USA. (e-mail: [email protected] ).Ali Jadbabaie is with the Institute for Data, Systems, and Society atMassachusetts Institute of Technology, Cambridge, MA 02139 USA. (email: [email protected] ). In distributed optimization, the main task is often minimiza-tion of a global convex function, written as the sum of localconvex functions, where each agent holds a private copy ofone specific local function. Then, based on a communicationprotocol, agents exchange local gradients to minimize theglobal cost function.Decentralized optimization is a mature discipline in address-ing problems dealing with time-invariant cost functions [11]–[17]. However, in many real-world applications, cost functionsvary over time. Consider, for instance, the problem of trackinga moving target, where the goal is to follow the position,velocity, and acceleration of the target. One should tackle theproblem by minimizing a loss function defined with respectto these parameters; however, since they are time-variant , thecost function becomes dynamic .When the problem framework is dynamic in nature, thereare two key challenges one needs to consider:1) Agents often observe the local cost functions in an online or sequential fashion, i.e., the local functions are revealedonly after they make their instantaneous decision at eachround, and they are unaware of future cost functions. Inthe last ten years, this problem (in the centralized domain)has been the main focus of the online optimization fieldin the machine learning community [18].2) Any online algorithm should mimic the performance ofits offline counterpart, and the gap between the twois called regret . The most stringent benchmark is anoffline problem that aims to track the minimizer of theglobal cost function over time, which brings forwardthe notion of dynamic regret [19]. It is well-known thatthis benchmark makes the problem intractable in theworst-case. However, as studied in the centralized onlineoptimization [19]–[22], the hardness of the problem canbe characterized via a complexly measure that capturesthe variation in the minimizer sequence.In this paper, we aim to address the above directionssimultaneously. We consider an online optimization problem,where the global cost is realized sequentially, and the objectiveis to track the minimizer of the function. The dynamics ofthe minimizer is common knowledge, but the time-varyingminimizer sequence can deviate from this dynamics due to an unstructured noise. At each time step, the global function canbe cast as sum of local functions, each of which is associatedto one agent. Therefore, agents need to exchange informationto solve the global problem.Our multi-agent tracking setup is reminiscent of a dis-tributed Kalman [23]. However, there are fundamental distinc-tions in our approach: (i) We do not assume that the minimizer a r X i v : . [ m a t h . O C ] S e p sequence is corrupted with a Gaussian noise. Nor do weassume that this noise has a statistical distribution. Instead, weconsider an adversarial-noise model with unknown structure. (ii) Agents observations are not necessarily linear; in fact, theobservations are local gradients being potentially non-linear.Furthermore, our focus is on the finite-horizon analysis ratherthan asymptotic results.We also note that our setup differs from the distributedparticle filtering [24] as it is online , and agents receive onlyone observation per iteration. Moreover, we reiterate that thenoise does not have a certain statistical distribution.For this setup, we propose a decentralized version of thewell-known Mirror Descent , developed by Nemirovksi andYudin [25]. Using the notion of Bregman divergence in lieuof Euclidean distance for projection, Mirror Descent has beenshown to be a powerful tool in large-scale optimization. Ouralgorithm consists of three interleaved updates: (i) each agentfollows the local gradient while staying close to previous esti-mates in the local neighborhood; (ii) agents take into accountthe dynamics of the minimizer sequence; (iii) agents averagetheir estimates in their local neighborhood in a consensus step.Motivated by centralized online optimization, we use thenotion of dynamic regret to characterize the difference betweenour online decentralized algorithm and its offline centralizedversion. We establish a regret bound that scales inversely inthe spectral gap of the network, and more notably it representsthe deviation of minimizer sequence with respect to the givendynamics. That is, it highlights the impact of the arbitrarynoise driving the dynamical model of the minimizer. Wefurther consider stochastic optimization, where agents observeonly noisy versions of their local gradients, and we prove thatin this case, our regret bound holds true in the expectationsense.Our main theoretical contribution is providing a compre-hensive analysis on networked online optimization in dynamicsetting. Our results subsume two important classes of decen-tralized optimization in the literature: (i) decentralized opti-mization of time-invariant objectives, and (ii) decentralizedoptimization of time-variant objectives over fixed variables.This generalization is an artifact of allowing dynamics in bothobjective and variable.We finally show that our algorithm is applicable to decen-tralized tracking of dynamic parameters. In fact, we show thatthe problem can be posed as the minimization of the squareloss using Euclidean distance as the Bregman divergence. Wethen empirically verify that the tracking quality depends onhow well the parameter follows its given dynamics. A. Related Literature
This work is related to two distinct bodies of literature: (i) decentralized optimization, and (ii) online optimization indynamic environments. Our goal in this work is to bridge Algorithms relying on Gradient Descent minimize Euclidean distance inthe projection step. Mirror Descent generalizes the projection step usingthe concept of Bregman divergence [25], [26]. Euclidean distance is aspecial Bregman divergence that reduces Mirror Descent to Gradient Descent.Kullback-Leibler divergence is another well-known type of Bregman diver-gence (see e.g. [27] for more details on Bregman divergence). the two and provide a general framework for decentralized online optimization in non-stationary environments. Below, weprovide an overview of the related works to both scenarios:
Decentralized Optimization:
There are a host of results inthe literature on decentralized optimization for time-invariantfunctions. The seminal work of [11] studies distributed sub-gradient methods over time-varying networks and providesconvergence analysis. The effect of stochastic gradients is thenconsidered in [13]. Shi et al. [17] prove fast convergence ratesfor Lipschitz-differentiable objectives by adding a correctionterm to the decentralized gradient descent algorithm. Of par-ticular relevance to this work is [28], where decentralizedmirror descent has been developed for when agents receive thegradients with a delay. More recently, the application of mirrordescent to saddle point problems is studied in [29]. Moreover,Rabbat in [30] proposes a decentralized mirror descent forstochastic composite optimization problems and provide guar-antees for strongly convex regularizers. In [31], Raginsky andBouvrie investigate distributed stochastic mirror descent in thecontinuous-time domain. On the other hand, Duchi et al. [14]study dual averaging for distributed optimization, and providea comprehensive analysis on the impact of network parameterson the problem. The extension of dual averaging to onlinedistributed optimization is considered in [32]. Mateos-N´unezand Cort´es [33] consider online optimization using subgradientdescent of local functions, where the graph structure is time-varying. In [34], a decentralized variant of Nesterov’s primal-dual algorithm is proposed for online optimization. Finally,in [35], distributed online optimization is studied for stronglyconvex objective functions over time-varying networks.
Online Optimization in Dynamic Environments:
In onlineoptimization, the benchmark can be defined abstractly in termsof a time-varying sequence, a particular case of which is theminimizer sequence of a time-varying cost function. Severalversions of the problem have been studied in the literature ofmachine learning in the centralized case. In [19], Zinkevichdevelops the celebrated online gradient descent and considersits extension to time-varying sequences. The authors of [20]generalize this idea to study time-varying sequences followinggiven dynamics. Besbes et al. [21] restrict their attentionto minima sequence and introduce a complexity measurefor the problem in terms of variation in cost functions. Forthe same problem, the authors of [22] develop an adaptivealgorithm whose regret bound is expressed in terms of thevariation of both functions and minima sequence, while in[36] an improved rate is derived for strongly convex objectives.Moreover, online dynamic optimization with linear objectivesis discussed in [37]. Fazlyab et al. [38] consider interior pointmethods and provide continuous-time analysis for the problem.Finally, Yang et al. [39] provide optimal bounds for when theminimizer belongs to the feasible set.
B. Organization
The paper is organized as follows. The notation, problemformulation, assumptions, and algorithm are described inSection II. In Section III, we provide our theoretical resultscharacterizing the behavior of the dynamic regret. Section IV is dedicated to application of our method to decentralizedtracking of dynamic parameters. Section V concludes, and theproofs are given in Section VI (Appendix).II. P
ROBLEM F ORMULATION AND A LGORITHM
Notation:
We use the following notation in the exposition ofour results: [ n ] The set { , , ..., n } for any integer nx (cid:62) Transpose of the vector xx ( k ) The k -th element of vector xI n Identity matrix of size n ∆ d The d -dimensional probability simplex (cid:104)· , ·(cid:105) Standard inner product operator E [ · ] Expectation operator (cid:107)·(cid:107) p p -norm operator (cid:107)·(cid:107) ∗ The dual norm of (cid:107)·(cid:107) λ i ( W ) The i -th largest eigenvalue of matrix Wσ i ( W ) The i -th largest singular of matrix W Throughout the paper, all the vectors are in column format.
A. Decentralized Optimization in Dynamic Environments
In this work, we consider an optimization problem involvinga global convex function. We let X be a convex set andrepresent the global function by f t : X → R at time t . Theglobal function is time-variant, and the goal is to track theminimizer of f t ( · ) , denoted by x (cid:63)t . We address a finite-timeproblem whose offline and centralized version can be posedas follows minimize x ,...,x T T (cid:88) t =1 f t ( x t ) subject to x t ∈ X , t ∈ [ T ] . (1)However, we want to solve the problem in an online and decentralized fashion. In particular, the global function at eachtime t can be written as the sum of n local functions as f t ( x ) := 1 n n (cid:88) i =1 f i,t ( x ) , (2)where f i,t : X → R is a local convex function on X for all i ∈ [ n ] . We have a network of n agents facing two challengesin solving problem (1): (i) agent j ∈ [ n ] receives privateinformation only about f j,t ( · ) and does not have access tothe global function f t ( · ) , which is common to decentralizedschemes; (ii) The functions are revealed to agents sequentiallyalong the time horizon, i.e., at any time instance s , agent j hasobserved f j,t ( · ) for t < s , whereas the agent does not know f j,t ( · ) for s ≤ t ≤ T , which is common to online settings.The agents can exchange information with one another, andtheir relationship is encoded via an undirected graph G =( V , E ) , where V = [ n ] denotes the set of nodes (agents), and E is the set of edges (links between agents). Each agent i assigns a positive weight [ W ] ij for the information receivedfrom agent j (cid:54) = i . Hence, the set of neighbors of agent i isdefined as N i := { j : [ W ] ij > } . Note that our framework subsumes two important classesof decentralized optimization in the literature:1) Existing methods often consider time-invariant objectives(see e.g. [11], [14], [28]). This is simply the special casewhere f t ( x ) = f ( x ) and x t = x in (1).2) Online algorithms deal with time-varying functions, butoften the network’s objective is to minimize the temporalaverage of { f t ( x ) } Tt =1 over a fixed variable x (see e.g.[32], [33]). This can be captured by our setup when x t = x in (1).To exhibit the online nature of the problem, the latter class inabove is usually reformulated by a popular performance metriccalled regret . Since in that setup x t = x for t ∈ [ T ] , denotingby x (cid:63) := argmin x ∈X (cid:80) Tt =1 f t ( x ) , the solution to problem (1)becomes (cid:80) Tt =1 f t ( x (cid:63) ) . Then, the goal of online algorithm isto mimic its offline version by minimizing the regret definedas follows Reg sT = 1 n n (cid:88) i =1 T (cid:88) t =1 f t ( x i,t ) − T (cid:88) t =1 f t ( x (cid:63) ) , (3)where x i,t is the estimate of agent i for x (cid:63) at time t . Moreover,the superscript “s” reiterates the fact that the benchmarkis minimum of the sum (cid:80) Tt =1 f t ( x ) over a static or fixed comparator variable x that resides in the set X . In thissetup, a successful algorithm incurs a sub-linear regret, whichasymptotically closes the gap between the online algorithmand the offline algorithm (when normalized by T ).On the contrary, the focal point of this paper is to studythe scenario where functions and comparator variables evolvesimultaneously, i.e., the variables { x t } Tt =1 are not constrainedto be fixed in (1). Let x (cid:63)t := argmin x ∈X f t ( x ) be the minimizerof the global function at time t . Then, the solution to problem(1) is simply (cid:80) Tt =1 f t ( x (cid:63)t ) . Therefore, to capture the onlinenature of problem (1), we reformulate it using the notion of dynamic regret as Reg dT = 1 n n (cid:88) i =1 T (cid:88) t =1 f t ( x i,t ) − T (cid:88) t =1 f t ( x (cid:63)t ) , (4)where x i,t is the estimate of agent i for x (cid:63)t at time t . The goalis to minimize the dynamic regret measuring the gap betweenthe online algorithm and its offline version. The superscript“d” indicates that the benchmark is the sum of minima (cid:80) Tt =1 f t ( x (cid:63)t ) characterized by dynamic variables { x (cid:63)t } Tt =1 thatlie in the set X .It is well-known that the more stringent benchmark in thedynamic setup makes the problem intractable in the worst-case, i.e., achieving a sub-linear regret could be impossible.However, as studied in the centralized online optimization[20]–[22], we would like to characterize the hardness of theproblem via a complexly measure that captures the pattern ofthe minimizer sequence { x (cid:63)t } Tt =1 . More specifically, assumingthat a dynamics A is a common knowledge in the network,and x (cid:63)t +1 = Ax (cid:63)t + v t , (5) we want to prove a regret bound in terms of C T := T (cid:88) t =1 (cid:13)(cid:13) x (cid:63)t +1 − Ax (cid:63)t (cid:13)(cid:13) = T (cid:88) t =1 (cid:107) v t (cid:107) , (6)which represents the deviation of minimizer sequence withrespect to dynamics A . Note that generalizing the results to thetime-variant case is straightforward, i.e., when A is replacedby A t in (5).The problem setup (1) coupled with the dynamics givenin (5) is reminiscent of distributed Kalman filtering [23].However, there are fundamental distinctions here: (i) Themismatch noise v t is neither Gaussian nor of known statisticaldistribution. It can be thought as an adversarial noise with unknown structure, which represents the deviation from thedynamics . (ii) Agents observations are not necessarily linear;in fact, the observations are local gradients of { f i,t ( · ) } Tt =1 and are non-linear when the objective is not quadratic. Fur-thermore, another implicit distinction in this work is our focuson finite-time analysis rather than asymptotic results.We note that our framework also differs from distributedparticle filtering [24] since agents receive only one observationper iteration, and the mismatch noise v t has no structure ordistribution.Having that in mind, to solve the online consensus opti-mization (4), we propose to decentralize the Mirror Descentalgorithm [25] and to analyze it in a dynamic framework.The appealing feature of Mirror Descent is extension of theprojection step using Bregman divergence in lieu of Euclideandistance, which makes the algorithm applicable to a wide rangeof problems. Before defining Bregman divergence and elabo-rating the algorithm, we start by stating a couple of standardassumptions in the context of decentralized optimization. Assumption 1.
For any i ∈ [ n ] , the function f i,t ( · ) is Lipschitzcontinuous on X with a uniform constant L . That is, | f i,t ( x ) − f i,t ( y ) | ≤ L (cid:107) x − y (cid:107) , for any x, y ∈ X . This further implies that the gradient of f i,t ( · ) denoted by ∇ f i,t ( · ) is uniformly bounded on X by theconstant L , i.e., we have (cid:107)∇ f i,t ( · ) (cid:107) ∗ ≤ L . Assumption 2.
The network is connected, i.e., there exists apath from any agent i ∈ [ n ] to any agent j ∈ [ n ] . Also, thematrix W is doubly stochastic with positive diagonal. Thatis, n (cid:88) i =1 [ W ] ij = n (cid:88) j =1 [ W ] ij = 1 . The connectivity constraint in Assumption 2 guarantees theinformation flow in the network. It simply implies uniquenessof λ ( W ) = 1 and warrants that other eigenvalues of W arestrictly less than one in magnitude [40]. In online learning, the focus is not on distribution of data. Instead, data isthought to be generated arbitrarily, and its effect is observed through the lossfunctions [18]. This relationship is standard, see e.g. Lemma 2.6. in [18] for more details. For the sake of simplicity, we assume that the topology is time-invariant,and W is fixed. The extension of problem to time-varying topology isstraightforward, as previously investigated in the literature (see e.g. [11], [14],[28]). B. Decentralized Online Mirror Descent
The development of Mirror Descent relies on the Bregmandivergence outlined in this section. Consider a convex set X in a Banach space B , and let R : B → R denote a 1-stronglyconvex function on X with respect to a norm (cid:107)·(cid:107) . That is, R ( x ) ≥ R ( y ) − (cid:104)∇R ( y ) , x − y (cid:105) + 12 (cid:107) x − y (cid:107) . for any x, y ∈ X . Then, the Bregman divergence D R ( · , · ) withrespect to the function R ( · ) is defined as follows: D R ( x, y ) := R ( x ) − R ( y ) − (cid:104) x − y, ∇R ( y ) (cid:105) . Combining the two identities above yields an important prop-erty of the Bregman divergence, and for any x, y ∈ X weget D R ( x, y ) ≥ (cid:107) x − y (cid:107) , (7)due to the strong convexity of R ( · ) . Two famous examplesof Bregman divergence are the Euclidean distance and theKullback-Leibler (KL) divergence generated from R ( x ) = (cid:107) x (cid:107) and R ( x ) = (cid:80) di =1 x ( i ) log x ( i ) − x ( i ) , respectively. Assumption 3.
Let x and { y i } ni =1 be vectors in R d . TheBregman divergence satisfies the separate convexity in thefollowing sense D R ( x, n (cid:88) i =1 α ( i ) y i ) ≤ n (cid:88) i =1 α ( i ) D R ( x, y i ) , where α ∈ ∆ n is on the n -dimensional simplex. The assumption is satisfied for commonly used cases ofBregman divergence. For instance, the Euclidean distance evi-dently respects the condition. The KL-divergence also satisfiesthe constraint, and we refer the reader to Theorem 6.4. in [27]for the proof.
Assumption 4.
The Bregman divergence satisfies a Lipschitzcondition of the form |D R ( x, z ) − D R ( y, z ) | ≤ K (cid:107) x − y (cid:107) , for all x, y, z ∈ X . When the function R is Lipschitz on X , the Lipschitzcondition on the Bregman divergence is automatically sat-isfied. Again, for the Euclidean distance the assumptionevidently holds. In the particular case of KL divergence,the condition can be achieved via mixing a uniform dis-tribution to avoid the boundary. More specifically, consider R ( x ) = (cid:80) di =1 x ( i ) log x ( i ) − x ( i ) for which |∇R ( x ) | = | (cid:80) di =1 log x ( i ) | ≤ d log T as long as x ∈ { µ : (cid:80) di =1 µ ( i ) =1; µ ( i ) ≥ T , ∀ i ∈ [ d ] } . Therefore, in this case the constant K is of O (log T ) (see e.g. [22] for more comments on theassumption).We are now ready to propose a three-step algorithm to solvethe optimization problem formulated in terms of dynamicregret in (4). Let us define ∇ i,t := ∇ f i,t ( x i,t ) as the shorthandfor the local gradients. Noticing the dynamic framework, we develop the decentralized online mirror descent via thefollowing updates ˆ x i,t +1 = argmin x ∈X (cid:8) η t (cid:104) x, ∇ i,t (cid:105) + D R ( x, y i,t ) (cid:9) , (8a) x i,t = A ˆ x i,t , and y i,t = n (cid:88) j =1 [ W ] ij x j,t , (8b)where { η t } Tt =1 is the step-size sequence, and A ∈ R d × d isthe given dynamics in (5) which is a common knowledge.Recall that x i,t ∈ R d represents the estimate of agent i forthe global minimizer x (cid:63)t at time t . The step-size sequenceis non-increasing and positive. Our proposed methodologycan also be recognized as the decentralized variant of theDynamic Mirror Descent algorithm in [20] though we restrictour attention only to linear dynamics.The update (8a) allows the algorithm to follow the privategradient while staying close to the previous estimates in thelocal neighborhood. This closeness is achieved in the sense ofminimizing the Bregman divergence. On the other hand, thefirst update in (8b) takes into account the potential dynamicsthat the minimizer sequence follow, and the second update in(8b) is the consensus term averaging the estimates in the localneighborhood. Assumption 5.
The mapping A is assumed to be non-expansive. That is, the condition D R (cid:0) Ax, Ay (cid:1) ≤ D R (cid:0) x, y (cid:1) , holds for all x, y ∈ X , and (cid:107) A (cid:107) ≤ . The assumption postulates a natural constraint on the map-ping A : it does not allow the effect of a poor prediction (atone step) to be amplified as the algorithm moves forward.III. T HEORETICAL R ESULTS
In this section, we state our theoretical results and theirconsequences. The proofs are presented later in the Appendix(Section VI). Our main result (Theorem 3) proves a boundon the dynamic regret, which captures the deviation of theminimizer trajectory from the dynamics A (tracking error) aswell as the decentralization cost (network error). After statingthe theorem, we show that our result recovers previous rateson decentralized optimization (static regret) once the trackingerror is removed. Also, it recovers previous rates on centralizedonline optimization in dynamic setting when the network erroris factored out. Therefore, we establish that our generalizationis bona fide. A. Preliminary Results
We start with a convergence result on the local estimates,which presents an upper bound on the deviation of the localestimates at each iteration from their consensual value. Asimilar result has been proven in [28] for time-invariant functions without dynamics; however, the following lemmaextends that of [28] to online setting and takes into accountthe dynamics A in (8b). The algorithm is initialized at x i,t = (cid:48) to avoid clutter in the analysis. Ingeneral, any initialization could work for the algorithm. Lemma 1. (Network Error) Let X be a convex set in a Banachspace B , R : B → R denote a 1-strongly convex function on X with respect to a norm (cid:107) · (cid:107) , and D R ( · , · ) represent theBregman divergence with respect to R , respectively. Further-more, assume that the local functions are Lipschitz continuous(Assumption 1), the matrix W is doubly stochastic (Assump-tion 2), and the mapping A is non-expansive (Assumption 5).Then, the local estimates { x i,t } Tt =1 generated by the updates (8a) - (8b) satisfy (cid:107) x i,t +1 − ¯ x t +1 (cid:107) ≤ L √ n t (cid:88) τ =0 η τ σ t − τ ( W ) , for any i ∈ [ n ] , where ¯ x t := n (cid:80) ni =1 x i,t . It turns out that the error bound depends on the network pa-rameter σ ( W ) and the step-size sequence { η t } Tt =1 . It is well-known that smaller σ ( W ) results in closeness of estimates totheir average by speeding up the mixing rate (see e.g. resultsof [14]). For instance, when the communication is all-to-all,i.e., the graph is complete, σ ( W ) = 0 and the mixing rateis most rapid since each agent receives the private gradientsof others only after one iteration delay. On the other hand,a usual diminishing step-size, which asymptotically goes tozero, can guarantee asymptotic closeness; however, such step-size sequence is most suitable for static rather than dynamic environments. We will discuss the choice of step-size carefullywhen we state our main result. Before that, we need to stateanother lemma as follows. Lemma 2. (Tracking Error) Let X be a convex set in a Banachspace B , R : B → R denote a 1-strongly convex function on X with respect to a norm (cid:107)·(cid:107) , and D R ( · , · ) represent the Breg-man divergence with respect to R , respectively. Furthermore,assume that the matrix W is doubly stochastic (Assumption 2),the Bregman divergence satisfies the Lipschitz condition andthe separate convexity (Assumptions 3-4), and the mapping A is non-expansive (Assumption 5). Then, it holds that n n (cid:88) i =1 T (cid:88) t =1 (cid:18) η t D R ( x (cid:63)t , y i,t ) − η t D R ( x (cid:63)t , ˆ x i,t +1 ) (cid:19) ≤ R η T +1 + T (cid:88) t =1 η t +1 (cid:13)(cid:13) x (cid:63)t +1 − Ax (cid:63)t (cid:13)(cid:13) , where R := sup x,y ∈X D R ( x, y ) . In the update (8a), each agent i calculates ˆ x i,t +1 , whilestaying close to y i,t by minimizing the Bregman divergence.Lemma 2 establishes a bound on difference of these two quan-tities, when they are evaluated in the Bregman with respect to x (cid:63)t . The relation of left-hand side with dynamic regret is notimmediate, and it becomes clear in the analysis. However, theterm (cid:13)(cid:13) x (cid:63)t +1 − Ax (cid:63)t (cid:13)(cid:13) = (cid:107) v t (cid:107) in the bound highlights the impactof mismatch noise v t in the tracking quality. Lemmata 1 and2 disclose the critical parameters involved in the regret bound.We carefully discuss the consequences of these bounds in thesubsequent section. B. Finite-horizon Performance: Regret Bound
We now state our main result on the non-asymptotic perfor-mance of the decentralized online mirror descent in dynamicenvironments. The succeeding theorem provides the regretbound in the general case, and it is followed by a corollarycharacterizing the regret rate for the optimized fixed step-sizesequence. In particular, the theorem uses the results in theprevious section to present an upper bound on the dynamicregret decomposed into tracking and network errors.
Theorem 3.
Let X be a convex set in a Banach space B , R : B → R denote a 1-strongly convex function on X withrespect to a norm (cid:107)·(cid:107) , and D R ( · , · ) represent the Bregman di-vergence with respect to R , respectively. Furthermore, assumethat the local functions are Lipschitz continuous (Assumption1), the matrix W is doubly stochastic (Assumption 2), theBregman divergence satisfies the Lipschitz condition and theseparate convexity (Assumptions 3-4), and the mapping A isnon-expansive (Assumption 5). Then, using the local estimates { x i,t } Tt =1 generated by the updates (8a) - (8b) , the regret (4) canbe bounded as Reg dT ≤ E Track + E Net , where E Track := 2 R η T +1 + T (cid:88) t =1 Kη t +1 (cid:13)(cid:13) x (cid:63)t +1 − Ax (cid:63)t (cid:13)(cid:13) + L T (cid:88) t =1 η t , and E Net := 4 L √ n T (cid:88) t =1 t − (cid:88) τ =0 η τ σ t − τ − ( W ) . Corollary 4.
Under the same conditions stated in Theorem 3,using the fixed step-size η = (cid:112) (1 − σ ( W )) C T /T yields aregret bound of order Reg dT ≤ O (cid:32)(cid:115) C T T − σ ( W ) (cid:33) , where C T = (cid:80) Tt =1 (cid:13)(cid:13) x (cid:63)t +1 − Ax (cid:63)t (cid:13)(cid:13) .Proof: The proof of the corollary follows directly fromsubstituting the step-size into the bound in Theorem 3.Theorem 3 decomposes the upper bound into two termsfor a general step-size { η t } Tt =1 . In Corollary 4, we fix thestep-size and observe the role of C T in controlling the regretbound . As we recall from (5), this quantity collects mismatcherrors { v t } Tt =1 that are not necessarily Gaussian or of somestatistical distribution. In Section II, we discussed that oursetup generalizes some of previous works, and it is importantto notice that our result recovers the corresponding rates whenrestricted to those special cases:1) When the global function f t ( x ) = f ( x ) is time-invariant,the minimizer sequence { x (cid:63)t } Tt =1 is fixed, i.e., the map-ping A = I d and v t = (cid:48) in (5). In this case inTheorem 3, the term involving (cid:13)(cid:13) x (cid:63)t +1 − Ax (cid:63)t (cid:13)(cid:13) in E Track is equal to zero, and we can use the step-size sequence η = (cid:112) (1 − σ ( W )) /T to recover the result of compa-rable algorithms, such as [14] in which distributed dualaveraging is proposed. 2) The same argument holds when the global function istime-variant, but the comparator variables are fixed. Inthis case, the problem is reduced to minimizing the staticregret (3). Since (cid:13)(cid:13) x (cid:63)t +1 − Ax (cid:63)t (cid:13)(cid:13) = 0 again, our resultrecovers that of [32] on distributed online dual averaging.3) When the graph is complete, σ ( W ) = 0 and the E Net term in Theorem 3 vanishes. We then recover theresults of [20] on centralized online learning in dynamicenvironments.As we mentioned earlier, when mismatch errors { v t } Tt =1 arelarge, the minimizer sequence { x (cid:63)t } Tt =1 fluctuates drastically,and C T could become linear in time. The bound in thecorollary is then not useful in the sense of keeping the dynamicregret sub-linear. Such behavior is natural since even in thecentralized online optimization, the algorithm receives onlya single gradient to predict the next step . As discussedin Section II, in this worst-case, the problem is generallyintractable. However, our goal was to consider C T as acomplexity measure of the problem environment and expressthe regret bound with respect to this parameter. In practice, ifthe algorithm is allowed to query multiple gradients per time,the error would be reduced, but this direction is beyond thescope of this paper. C. Optimization with Stochastic Gradients
In many engineering applications such as decentralizedtracking, learning, and estimation, agents observations areusually noisy. In this section, we demonstrate that the result ofTheorem 3 does not rely on exact gradients, and it holds truein expectation sense when agents follow stochastic gradients.Mathematically speaking, let F t be the σ -field containing allinformation prior to the outset of round t + 1 . Let also ∇ i,t represent the stochastic gradient observed by agent i aftercalculating the estimate x i,t . Then, we define a stochasticoracle that provides noisy gradients respecting the followingconditions E (cid:104) ∇ i,t (cid:12)(cid:12) F t − (cid:105) = ∇ i,t E (cid:104) (cid:107) ∇ i,t (cid:107) ∗ (cid:12)(cid:12) F t − (cid:105) ≤ G . (9)The new updates take the following form ˆ x i,t +1 = argmin x ∈X (cid:8) η t (cid:104) x, ∇ i,t (cid:105) + D R ( x, y i,t ) (cid:9) , (10a) x i,t = A ˆ x i,t , and y i,t = n (cid:88) j =1 [ W ] ij x j,t , (10b)where the only distinction between (10a) and (8a) is using thestochastic gradient in the former. A commonly used modelto generate stochastic gradients satisfying (9) is an additivezero-mean noise with bounded variance. We now discuss theimpact of stochastic gradients in the following theorem. Theorem 5.
Let X be a convex set in a Banach space B , R : B → R denote a 1-strongly convex function on X with respectto a norm (cid:107)·(cid:107) , and D R ( · , · ) represent the Bregman divergence Even in a more structured problem setting such as Kalman filtering, whenwe know the exact value of a state at a time step, we cannot exactly predictthe next state, and we incur a minimum mean-squared error of the size ofnoise variance. with respect to R , respectively. Furthermore, assume that thelocal functions are Lipschitz continuous (Assumption 1), thematrix W is doubly stochastic (Assumption 2), the Bregmandivergence satisfies the Lipschitz condition and the separateconvexity (Assumptions 3-4), and the mapping A is non-expansive (Assumption 5). Let the local estimates { x i,t } Tt =1 be generated by updates (10a) - (10b) , where the stochasticgradients satisfy the condition (9) . Then, E (cid:2) Reg dT (cid:3) ≤ R η T +1 + T (cid:88) t =1 Kη t +1 (cid:13)(cid:13) x (cid:63)t +1 − Ax (cid:63)t (cid:13)(cid:13) + G T (cid:88) t =1 η t G √ n T (cid:88) t =1 t − (cid:88) τ =0 η τ σ t − τ − ( W ) . The theorem indicates that when using stochastic gradients,the result of Theorem 3 holds true in expectation sense. Thus,the algorithm can be used in dynamic environments whereagents observations are noisy.IV. N
UMERICAL E XPERIMENT : S
TATE E STIMATION AND T RACKING D YNAMIC P ARAMETERS
The generality of Mirror Descent stems from the freedomover the selection of the Bregman divergence. A particularlywell-known Bergman divergence is the Euclidean distance,which turns our framework to state estimation and tracking.In this section, we focus on this scenario as an application ofour method. Distributed state estimation and tracking dynamicparameters has a long history in the literature of control andsignal processing. However, there are key distinctions in ourapproach to the dynamical model of the parameter and agentsobservations. We elaborate on these differences as we describeour numerical experiment.Let us consider a slowly maneuvering target in the D plane and assume that each position component of the targetevolves independently according to a near constant velocitymodel [41]. The state of the target at each time consists of fourcomponents: horizontal position, vertical position, horizontalvelocity, and vertical velocity. Therefore, representing suchstate at time t by x (cid:63)t ∈ R , the state space model takes theform x (cid:63)t +1 = Ax (cid:63)t + v t , where v t ∈ R is the system noise, and using ⊗ for Kroneckerproduct, A is described as A = I ⊗ (cid:20) (cid:15) (cid:21) , with (cid:15) being the sampling interval . The goal is to track x (cid:63)t using a network of agents. This problem has been studied inthe context of distributed Kalman filtering [23], [42], stateestimation [43]–[45], and particle filtering [24], [46], [47].However, as opposed to Kalman filtering, we need not assumethat the system noise v t is Gaussian. Also, unlike particlefiltering, we do not assume receiving a large number ofsamples (particles) per iteration since our setup is online, i.e., The sampling interval of (cid:15) (seconds) is equivalent to the sampling rate of /(cid:15) ( Hz ) . agents only observe one sample per iteration. Moreover, wedo not assume a statistical distribution on v t in our analysis,which makes our framework different from state estimation.We have a model-free approach in which the noise can bedeterministic with unknown structure, or even stochastic withdependence over time. For our experiment, we generate thisnoise according to a zero-mean Gaussian distribution withcovariance matrix Σ as follows Σ = σ v I ⊗ (cid:20) (cid:15) / (cid:15) / (cid:15) / (cid:15) (cid:21) . We let the sampling interval be (cid:15) = 0 . seconds which isequivalent to frequency Hz . The constant σ v is changedin different scenarios, so we describe the choice of this pa-rameter later. Importantly, we remark that though this noise isgenerated randomly, it is fixed with each run of our experimentlater. That is, the noise is generated once and remains fixedthroughout, so it can be considered deterministic.We consider a sensor network of n = 25 agents locatedon a × grid. Agents aim to track the moving target x (cid:63)t collaboratively. At time t , agent i observes z i,t , a noisy versionof one coordinate of x (cid:63)t as follows z i,t = e (cid:62) k i x (cid:63)t + w i,t , where w i,t ∈ R denotes the observation noise, and e k is the k -th unit vector in the standard basis of R for k ∈ { , , , } .We divide agents into four groups, and for each group wechoose one specific k i from the set { , , , } . Furthermore,the observation noise must satisfy the standard assumption ofbeing zero-mean and finite-variance. Our results are not de-pendent on Gaussian noise, so we generate w i,t independentlyfrom a uniform distribution on [ − , .Though not locally observable to each agent, it is straight-forward to see that the target x (cid:63)t is globally identifiable fromthe standpoint of the whole network (see e.g. [43] for the exactdefinition of the global identifiability in a general trackingproblem).At time t , each agent i forms an estimate x i,t of x (cid:63)t based onobservations { z i,τ } t − τ =1 . After that, the new signal z i,t becomesavailable to the agent. The online nature of the problem allowsus to pose it as an instance of online optimization formulated in(4). To derive an explicit update for x i,t , we need to introducethe loss functions. We use the local square loss f i,t ( x ) := E (cid:104)(cid:0) z i,t − e (cid:62) k i x (cid:1) (cid:12)(cid:12) x (cid:63)t (cid:105) , for each agent i , resulting in the network loss f t ( x ) := 1 n n (cid:88) i =1 E (cid:104)(cid:0) z i,t − e (cid:62) k i x (cid:1) (cid:12)(cid:12) x (cid:63)t (cid:105) . In our experiment v t is a deterministic noise, but in bothdefinitions x (cid:63)t could be random in the case that v t is random,so we use the conditional expectation to be precise. Nowusing Euclidean distance as the Bregman divergence in updates(10a)-(10b), we can derive the following update x i,t = n (cid:88) j =1 [ W ] ij A x j,t − + η t A e k i (cid:0) z i,t − − e (cid:62) k i x i,t − (cid:1) . Iteration N o r m a li ze d E xp ec t e d D yn a m i c R e g r e t < =0.25 < =0.5 < =0.75 < =1 Fig. 1. The plot of dynamic regret versus iterations. Naturally, when σ v is smaller, the innovation noise added to the dynamics is smaller with highprobability, and the network incurs a lower dynamic regret. In this plot, thedynamic regret is normalized by iterations, so the y -axis is E (cid:2) Reg dT (cid:3) /T . We fix the step size to η t = η = 0 . since using diminishingstep size is not useful in tracking unless we have diminishingsystem noise [48]. The update is akin to consensus+innovation updates in the literature (see e.g. [48]–[50]) though we recallthat we did not analyze this update for a system noise v t witha statistical distribution.It is proved in [51] that in decentralized tracking, thedynamic regret can be presented in terms of the trackingerror x i,t − x (cid:63)t of all agents. More specifically, the dynamicregret averages the tracking error over space and time (whennormalized by T ). Exploiting this connection and combiningthat with the result of Theorem 5, we observe that once theparameter does not deviate too much from the dynamics, i.e.,when (cid:80) Tt =1 (cid:107) v t (cid:107) is small, the bound on the dynamic regret(or equivalently the collective tracking error) becomes smalland vice versa.We demonstrate this intuitive idea by tuning σ v . Largervalues for σ v are more likely to cause deviations from thedynamics A ; therefore, we expect large dynamic regret (worseperformance) when σ v is large. In Fig. 1, we plot the dynamicregret for σ v ∈ { . , . , . , } . For each specific valueof σ v , we run the experiment 50 times and average outthe dynamic regret over all runs. As we conjectured, theperformance improves once σ v tends to smaller values.Let us now focus on the case that σ v = 0 . . For one runof this case, we provide a snapshot of the target trajectory(in red) in Fig. 2 and plot the estimator trajectory (in blue)for agents i ∈ { , , , } . While the dynamic regret canbe controlled in the expectation sense (Theorem 5), Fig. 2suggests that agents’ estimators closely follow the trajectoryof the moving target with high probability. -1000 -800 -600 -400 -200 0 Agent 1 -20-15-10-50
TargetEstimator -1000 -800 -600 -400 -200 0
Agent 6 -20-15-10-50-1000 -800 -600 -400 -200 0
Agent 12 -20-15-10-50 -1000 -800 -600 -400 -200 0
Agent 23 -20-15-10-50
Fig. 2. The trajectory of x (cid:63)t over T = 1000 iterations is shown in red.We also depict the trajectory of the estimator x i,t (shown in blue) for i ∈{ , , , } and observe that it closely follows x (cid:63)t in every case. V. C
ONCLUSION
The work unifies a number of frameworks in the literatureby addressing decentralized , online optimization in dynamic environments. We considered tracking the minimizer of aglobal time-varying convex function via a network of agents.The minimizer of the global function has a dynamics knownto agents, but an unknown, unstructured noise causes deviationfrom this dynamics. The global function can be written as asum of local functions at each time step, and each agent canonly observe its associated local function. However, these localfunctions appear sequentially, and agents do not have a priorknowledge of the future cost functions.Our proposed algorithm for this setup can be cast asa decentralized version of Mirror Descent. However, thealgorithm possesses two additional steps to include agentsinteractions and dynamics of the minimizer. We used a notionof network dynamic regret to measure the performance of ouralgorithm versus its offline counterpart. We established thatthe regret bound scales inversely in the spectral gap of thenetwork and captures the deviation of minimizer sequence withrespect to the given dynamics. We next considered stochasticoptimization, where agents observe only noisy versions of theirlocal gradients, and we proved that in this case, our regretbound holds true in the expectation sense. We showed thatour generalization is valid and convincing in the sense that theresults recover those of distributed optimization in online andoffline setting. We also applied our method to decentralizedtracking of dynamic parameters in the numerical experiments.Our work opens a few directions for future works. Weconjecture that our theoretical results can be strengthened ina setup where agents receive multiple gradients per time step.However, as mentioned in Section III, this is still an open question. Also, the result of Corollary 4 assumes the step-size is tuned in advance. This would require the knowledge of C T or an upper bound on the quantity. For the centralizedsetting, one can potentially avoid the issue using doublingtricks which requires online accumulation of the mismatchnoise v t . However, it is more natural to consider that this noiseis not fully observable in the decentralized setting. Therefore,an adaptive solution to step-size tuning remains open for thefuture investigation. VI. A PPENDIX
The following lemma is standard in the analysis of mirrordescent. We state the lemma here and revoke it in our analysislater.
Lemma 6 (Beck and Teboulle [26]) . Let X be a convexset in a Banach space B , R : B → R denote a 1-stronglyconvex function on X with respect to a norm (cid:107)·(cid:107) , and D R ( · , · ) represent the Bregman divergence with respect to R ,respectively. Then, any update of the form x (cid:63) = argmin x ∈X (cid:8) (cid:104) a, x (cid:105) + D R ( x, c ) (cid:9) , satisfies the following inequality (cid:104) x (cid:63) − d, a (cid:105) ≤ D R ( d, c ) − D R ( d, x (cid:63) ) − D R ( x (cid:63) , c ) , for any d ∈ X .A. Proof of Lemma 1 Applying Lemma 6 to the update (8a), we get η t (cid:104) ˆ x i,t +1 − y i,t , ∇ i,t (cid:105) ≤ −D R ( y i,t , ˆ x i,t +1 ) − D R (ˆ x i,t +1 , y i,t ) In view of the strong convexity of R , the Bregman divergencesatisfies D R ( x, y ) ≥ (cid:107) x − y (cid:107) for any x, y ∈ X (see (7)).Therefore, we can simplify the equation above as follows η t (cid:104) y i,t − ˆ x i,t +1 , ∇ i,t (cid:105) ≥ D R ( y i,t , ˆ x i,t +1 ) + D R (ˆ x i,t +1 , y i,t ) ≥ (cid:107) y i,t − ˆ x i,t +1 (cid:107) . (11)On the other hand, for any primal-dual norm pair it holds that (cid:104) y i,t − ˆ x i,t +1 , ∇ i,t (cid:105) ≤ (cid:107) y i,t − ˆ x i,t +1 (cid:107) (cid:107)∇ i,t (cid:107) ∗ ≤ L (cid:107) y i,t − ˆ x i,t +1 (cid:107) , using Assumption 1 in the last line. Combining above with(11), we obtain (cid:107) y i,t − ˆ x i,t +1 (cid:107) ≤ Lη t . (12)Letting e i,t := ˆ x i,t +1 − y i,t , we can now rewrite update (8b)as ˆ x i,t +1 = n (cid:88) j =1 [ W ] ij x j,t + e i,t , which implies x i,t +1 = A ˆ x i,t +1 = n (cid:88) j =1 [ W ] ij Ax j,t + Ae i,t . (13) Using Assumption 2 (doubly stochasticity of W ), the aboveimmediately yields ¯ x t +1 := 1 n n (cid:88) i =1 x i,t +1 = 1 n n (cid:88) i =1 n (cid:88) j =1 [ W ] ij Ax j,t + 1 n n (cid:88) i =1 Ae i,t = 1 n n (cid:88) j =1 (cid:32) n (cid:88) i =1 [ W ] ij (cid:33) Ax j,t + 1 n n (cid:88) i =1 Ae i,t = A ¯ x t + A ¯ e t , where ¯ e t := n (cid:80) ni =1 e i,t , and ¯ x t = n (cid:80) ni =1 x i,t as defined inthe statement of the lemma. As a result, ¯ x t +1 = t (cid:88) τ =0 A t +1 − τ ¯ e τ . (14)On the other hand, stacking the local vectors x i,t and e i,t in(13) in the following form x t := [ x (cid:62) ,t , x (cid:62) ,t , . . . , x (cid:62) n,t ] (cid:62) e t := [ e (cid:62) ,t , e (cid:62) ,t , . . . , e (cid:62) n,t ] (cid:62) , and using ⊗ to denote the Kronecker product, we can write(13) in the matrix format as x t +1 = ( W ⊗ A ) x t + ( I n ⊗ A ) e t = t (cid:88) τ =0 ( W ⊗ A ) t − τ ( I n ⊗ A ) e τ = t (cid:88) τ =0 ( W t − τ ⊗ A t − τ )( I n ⊗ A ) e τ Therefore, using above, we have x i,t +1 = t (cid:88) τ =0 n (cid:88) j =1 (cid:2) W t − τ (cid:3) ij A t +1 − τ e j,τ . Combining above with (14), we derive x i,t +1 − ¯ x t +1 = t (cid:88) τ =0 n (cid:88) j =1 (cid:18)(cid:2) W t − τ (cid:3) ij − n (cid:19) A t +1 − τ e j,τ , which entails (cid:107) x i,t +1 − ¯ x t +1 (cid:107) ≤ t (cid:88) τ =0 n (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:2) W t − τ (cid:3) ij − n (cid:12)(cid:12)(cid:12)(cid:12) Lη τ , (15)where we used (cid:107) e i,τ (cid:107) ≤ Lη τ obtained in (12) as well as theassumption (cid:107) A (cid:107) ≤ (Assumption 5). By standard propertiesof doubly stochastic matrices (see e.g. [40]), the matrix Wsatisfies n (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:2) W t (cid:3) ij − n (cid:12)(cid:12)(cid:12)(cid:12) ≤ √ nσ t ( W ) . Substituting above into (15) finishes the proof. B. Proof of Lemma 2
We start by adding, subtracting, and regrouping severalterms as follows η t D R ( x (cid:63)t ,y i,t ) − η t D R ( x (cid:63)t , ˆ x i,t +1 ) =+ 1 η t D R ( x (cid:63)t , y i,t ) − η t +1 D R ( x (cid:63)t +1 , y i,t +1 )+ 1 η t +1 D R ( x (cid:63)t +1 , y i,t +1 ) − η t +1 D R ( Ax (cid:63)t , y i,t +1 )+ 1 η t +1 D R ( Ax (cid:63)t , y i,t +1 ) − η t +1 D R ( x (cid:63)t , ˆ x i,t +1 )+ 1 η t +1 D R ( x (cid:63)t , ˆ x i,t +1 ) − η t D R ( x (cid:63)t , ˆ x i,t +1 ) . (16)We now need to bound each of the four terms above. For thesecond term, we note that D R ( x (cid:63)t +1 , y i,t +1 ) − D R ( Ax (cid:63)t , y i,t +1 ) ≤ K (cid:13)(cid:13) x (cid:63)t +1 − Ax (cid:63)t (cid:13)(cid:13) , (17)by the Lipschitz condition on the Bregman divergence (As-sumption 4). Also, by the separate convexity of Bregmandivergence (Assumption 3) as well as stochasticity of W (Assumption 2), we have n (cid:88) i =1 D R ( Ax (cid:63)t , y i,t +1 ) − n (cid:88) i =1 D R ( x (cid:63)t , ˆ x i,t +1 )= n (cid:88) i =1 D R ( Ax (cid:63)t , n (cid:88) j =1 [ W ] ij x j,t +1 ) − n (cid:88) i =1 D R ( x (cid:63)t , ˆ x i,t +1 ) ≤ n (cid:88) i =1 n (cid:88) j =1 [ W ] ij D R ( Ax (cid:63)t , x j,t +1 ) − n (cid:88) i =1 D R ( x (cid:63)t , ˆ x i,t +1 )= n (cid:88) j =1 D R ( Ax (cid:63)t , x j,t +1 ) n (cid:88) i =1 [ W ] ij − n (cid:88) i =1 D R ( x (cid:63)t , ˆ x i,t +1 )= n (cid:88) j =1 D R ( Ax (cid:63)t , x j,t +1 ) − n (cid:88) i =1 D R ( x (cid:63)t , ˆ x i,t +1 )= n (cid:88) i =1 D R ( Ax (cid:63)t , A ˆ x i,t +1 ) − n (cid:88) i =1 D R ( x (cid:63)t , ˆ x i,t +1 ) ≤ , (18)where the last inequality follows from the fact that A isnon-expansive (Assumption 5). When summing (16) over t ∈ [ T ] the first term telescopes, while the second andthird terms are handled with the bounds in (17) and (24),respectively. Recalling from the statement of the lemma that R = sup x,y ∈X D R ( x, y ) , we obtain n n (cid:88) i =1 T (cid:88) t =1 (cid:18) η t D R ( x (cid:63)t , y i,t ) − η t D R ( x (cid:63)t , ˆ x i,t +1 ) (cid:19) ≤ R η + T (cid:88) t =1 Kη t +1 (cid:13)(cid:13) x (cid:63)t +1 − Ax (cid:63)t (cid:13)(cid:13) + R T (cid:88) t =1 η t +1 − η t ≤ R η T +1 + T (cid:88) t =1 Kη t +1 (cid:13)(cid:13) x (cid:63)t +1 − Ax (cid:63)t (cid:13)(cid:13) , where we used the fact that the step-size is positive anddecreasing in the last line. C. An Auxiliary Lemma
In the proof of Theorem 3, we make use of another technicallemma provided below.
Lemma 7.
Let X be a convex set in a Banach space B , R : B → R denote a 1-strongly convex function on X withrespect to a norm (cid:107)·(cid:107) , and D R ( · , · ) represent the Bregman di-vergence with respect to R , respectively. Furthermore, assumethat the local functions are Lipschitz continuous (Assumption1), the matrix W is doubly stochastic (Assumption 2), theBregman divergence satisfies the Lipschitz condition and theseparate convexity (Assumptions 3-4), and the mapping A isnon-expansive (Assumption 5). Then, for the local estimates { x i,t } Tt =1 generated by the updates (8a) - (8b) , it holds that n n (cid:88) i =1 T (cid:88) t =1 (cid:16) f i,t ( x i,t ) − f i,t ( x (cid:63)t ) (cid:17) ≤ R η T +1 + T (cid:88) t =1 Kη t +1 (cid:13)(cid:13) x (cid:63)t +1 − Ax (cid:63)t (cid:13)(cid:13) + L T (cid:88) t =1 η t
2+ 2 L √ n T (cid:88) t =1 t − (cid:88) τ =0 η τ σ t − τ − ( W ) , where R := sup x,y ∈X D R ( x, y ) . Proof:
In view of the convexity of f i,t ( · ) , we have f i,t ( x i,t ) − f i,t ( x (cid:63)t ) ≤ (cid:104)∇ i,t , x i,t − x (cid:63)t (cid:105) = (cid:104)∇ i,t , ˆ x i,t +1 − x (cid:63)t (cid:105) + (cid:104)∇ i,t , x i,t − y i,t (cid:105) + (cid:104)∇ i,t , y i,t − ˆ x i,t +1 (cid:105) (19)for any i ∈ [ n ] . We now need to bound each of the three termson the right hand side of (19). Starting with the last term andusing boundedness of gradients (Assumption 1), we have that (cid:104)∇ i,t , y i,t − ˆ x i,t +1 (cid:105) ≤ (cid:107) y i,t − ˆ x i,t +1 (cid:107) (cid:107)∇ i,t (cid:107) ∗ ≤ L (cid:107) y i,t − ˆ x i,t +1 (cid:107)≤ η t (cid:107) y i,t − ˆ x i,t +1 (cid:107) + η t L , (20) where the last line is due to AM-GM inequality. Next, werecall update (8b) to bound the second term in (19) usingAssumption 1 and 2 as (cid:104)∇ i,t , x i,t − y i,t (cid:105) = (cid:104)∇ i,t , x i,t − ¯ x t + ¯ x t − y i,t (cid:105) = (cid:104)∇ i,t , x i,t − ¯ x t (cid:105) + n (cid:88) j =1 [ W ] ij (cid:104)∇ i,t , ¯ x t − x j,t (cid:105)≤ L (cid:107) x i,t − ¯ x t (cid:107) + L n (cid:88) j =1 [ W ] ij (cid:107) x j,t − ¯ x t (cid:107)≤ L √ n t − (cid:88) τ =0 η τ σ t − τ − ( W ) , (21)where in the last line we appealed to Lemma 1. Finally, weapply Lemma 6 to (19) to get (cid:104)∇ i,t , ˆ x i,t +1 − x (cid:63)t (cid:105) ≤ η t D R ( x (cid:63)t , y i,t ) − η t D R ( x (cid:63)t , ˆ x i,t +1 ) − η t D R (ˆ x i,t +1 , y i,t ) ≤ η t D R ( x (cid:63)t , y i,t ) − η t D R ( x (cid:63)t , ˆ x i,t +1 ) − η t (cid:107) ˆ x i,t +1 − y i,t (cid:107) , (22)since the Bregman divergence satisfies D R ( x, y ) ≥ (cid:107) x − y (cid:107) for any x, y ∈ X . Substituting (20), (21), and (22)into the bound (19), we derive f i,t ( x i,t ) − f i,t ( x (cid:63)t ) ≤ η t L + 2 L √ n t − (cid:88) τ =0 η τ σ t − τ − ( W )+ 1 η t D R ( x (cid:63)t , y i,t ) − η t D R ( x (cid:63)t , ˆ x i,t +1 ) . (23)Summing over t ∈ [ T ] and i ∈ [ n ] , and applying Lemma 2completes the proof. D. Proof of Theorem 3
To bound the regret defined in (4), we start with f t ( x i,t ) − f t ( x (cid:63)t ) = f t ( x i,t ) − f t (¯ x t ) + f t (¯ x t ) − f t ( x (cid:63)t ) ≤ L (cid:107) x i,t − ¯ x t (cid:107) + f t (¯ x t ) − f t ( x (cid:63)t )= 1 n n (cid:88) i =1 f i,t (¯ x t ) − n n (cid:88) i =1 f i,t ( x (cid:63)t )+ L (cid:107) x i,t − ¯ x t (cid:107) , where we used the Lipschitz continuity of f t ( · ) (Assumption1) in the second line. Using the Lipschitz continuity of f i,t ( · ) for i ∈ [ n ] , we simplify above as follows f t ( x i,t ) − f t ( x (cid:63)t ) ≤ n n (cid:88) i =1 f i,t ( x i,t ) − n n (cid:88) i =1 f i,t ( x (cid:63)t )+ L (cid:107) x i,t − ¯ x t (cid:107) + Ln n (cid:88) i =1 (cid:107) x i,t − ¯ x t (cid:107) . (24)Summing over t ∈ [ T ] and i ∈ [ n ] , and applying Lemmata 1and 7 completes the proof. E. Proof of Theorem 5
We need to rework the proof of Theorem 3 using stochasticgradients by tracking the changes. Following the lines in theproof of Lemma 1, equation (12) will be changed to (cid:107) y i,t − ˆ x i,t +1 (cid:107) ≤ η t (cid:107) ∇ i,t (cid:107) ∗ , yielding (cid:107) x i,t +1 − ¯ x t +1 (cid:107) ≤ √ n t (cid:88) τ =0 η τ (cid:107) ∇ i,τ (cid:107) ∗ σ t − τ ( W ) . (25)On the other hand, at the beginning of Lemma 7, we shoulduse the stochastic gradient as f i,t ( x i,t ) − f i,t ( x (cid:63)t ) ≤ (cid:104)∇ i,t , x i,t − x (cid:63)t (cid:105) = (cid:104) ∇ i,t , x i,t − x (cid:63)t (cid:105) + (cid:104)∇ i,t − ∇ i,t , x i,t − x (cid:63)t (cid:105) = (cid:104) ∇ i,t , ˆ x i,t +1 − x (cid:63)t (cid:105) + (cid:104) ∇ i,t , x i,t − y i,t (cid:105) + (cid:104) ∇ i,t , y i,t − ˆ x i,t +1 (cid:105) + (cid:104)∇ i,t − ∇ i,t , x i,t − x (cid:63)t (cid:105) Moreover, as in Lemma 1, any bound involving L whichwas originally an upper bound on the exact gradient must bereplaced by the norm of stochastic gradient, which changesinequality (23) to f i,t ( x i,t ) − f i,t ( x (cid:63)t ) ≤ η t (cid:13)(cid:13) ∇ i,t (cid:13)(cid:13) ∗ + 2 √ n t − (cid:88) τ =0 η τ (cid:13)(cid:13) ∇ i,τ (cid:13)(cid:13) ∗ σ t − τ − ( W )+ 1 η t D R ( x (cid:63)t , y i,t ) − η t D R ( x (cid:63)t , ˆ x i,t +1 )+ (cid:104)∇ i,t − ∇ i,t , x i,t − x (cid:63)t (cid:105) Then taking expectation from above, since E [ (cid:104)∇ i,t − ∇ i,t , x i,t − x (cid:63)t (cid:105) ]= E (cid:2) E (cid:2) (cid:104)∇ i,t − ∇ i,t , x i,t − x (cid:63)t (cid:105) (cid:12)(cid:12) F t − (cid:3)(cid:3) = E (cid:2)(cid:10) E (cid:2) ∇ i,t − ∇ i,t (cid:12)(cid:12) F t − (cid:3) , x i,t − x (cid:63)t (cid:11)(cid:3) = 0 , using condition (9), we get E [ f i,t ( x i,t )] − f i,t ( x (cid:63)t ) ≤ η t E (cid:104) (cid:107) ∇ i,t (cid:107) ∗ (cid:105) + 2 √ n t − (cid:88) τ =0 η τ E (cid:104) (cid:107) ∇ i,τ (cid:107) ∗ (cid:105) σ t − τ − ( W )+ E (cid:20) η t D R ( x (cid:63)t , y i,t ) − η t D R ( x (cid:63)t , ˆ x i,t +1 ) (cid:21) Summing over i ∈ [ n ] and t ∈ [ T ] , we apply bounded secondmoment condition (9) and Lemma 2 to get the same as boundas Lemma 7, except for L being replaced by G . Then the proofis finished once we return to (24).R EFERENCES[1] D. Li, K. D. Wong, Y. H. Hu, and A. M. Sayeed, “Detection, classifica-tion, and tracking of targets,”
IEEE signal processing magazine , vol. 19,no. 2, pp. 17–29, 2002.[2] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,”in
Proceedings of the 3rd international symposium on Informationprocessing in sensor networks . ACM, 2004, pp. 20–27.[3] L. Xiao, S. Boyd, and S.-J. Kim, “Distributed average consensuswith least-mean-square deviation,”
Journal of Parallel and DistributedComputing , vol. 67, no. 1, pp. 33–46, 2007.[4] V. Lesser, C. L. Ortiz Jr, and M. Tambe,
Distributed sensor networks:A multiagent perspective . Springer Science & Business Media, 2012,vol. 9.[5] S. Shahrampour, A. Rakhlin, and A. Jadbabaie, “Distributed detection :Finite-time analysis and impact of network topology,”
IEEE Transactionson Automatic Control , vol. 61, 2016.[6] A. Nedi´c, A. Olshevsky, and C. A. Uribe, “Fast convergence rates fordistributed non-bayesian learning,” arXiv preprint arXiv:1508.05161 ,2015.[7] L. Qipeng, Z. Jiuhua, and W. Xiaofan, “Distributed detection viabayesian updates and consensus,” in . IEEE, 2015, pp. 6992–6997.[8] J. N. Tsitsiklis, “Problems in decentralized decision making and compu-tation,” Ph.D. dissertation, Massachusetts Institute of Technology, 1984.[9] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans, “Distributed asyn-chronous deterministic and stochastic gradient optimization algorithms,”in
American Control Conference (ACC) , 1984, pp. 484–489.[10] D. P. Bertsekas and J. N. Tsitsiklis,
Parallel and distributed computation:numerical methods . Prentice hall Englewood Cliffs, NJ, 1989, vol. 23.[11] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,”
IEEE Transactions on Automatic Control , vol. 54,no. 1, pp. 48–61, 2009.[12] B. Johansson, M. Rabi, and M. Johansson, “A randomized incrementalsubgradient method for distributed optimization in networked systems,”
SIAM Journal on Optimization , vol. 20, no. 3, pp. 1157–1170, 2009.[13] S. S. Ram, A. Nedi´c, and V. V. Veeravalli, “Distributed stochasticsubgradient projection algorithms for convex optimization,”
Journal ofoptimization theory and applications , vol. 147, no. 3, pp. 516–545, 2010.[14] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging fordistributed optimization: convergence analysis and network scaling,”
IEEE Transactions on Automatic Control , vol. 57, no. 3, pp. 592–606,2012.[15] M. Zhu and S. Mart´ınez, “On distributed convex optimization underinequality and equality constraints,”
IEEE Transactions on AutomaticControl , vol. 57, no. 1, pp. 151–164, 2012.[16] D. Jakoveti´c, J. Xavier, and J. M. Moura, “Fast distributed gradientmethods,”
IEEE Transactions on Automatic Control , vol. 59, no. 5, pp.1131–1146, 2014.[17] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-orderalgorithm for decentralized consensus optimization,”
SIAM Journal onOptimization , vol. 25, no. 2, pp. 944–966, 2015.[18] S. Shalev-Shwartz, “Online learning and online convex optimization,”
Foundations and Trends in Machine Learning , vol. 4, no. 2, pp. 107–194, 2011.[19] M. Zinkevich, “Online convex programming and generalized infinites-imal gradient ascent,”
International Conference on Machine Learning(ICML) , 2003.[20] E. C. Hall and R. M. Willett, “Online convex optimization in dynamicenvironments,”
IEEE Journal of Selected Topics in Signal Processing ,vol. 9, no. 4, pp. 647–662, 2015.[21] O. Besbes, Y. Gur, and A. Zeevi, “Non-stationary stochastic optimiza-tion,”
Operations Research , vol. 63, no. 5, pp. 1227–1244, 2015.[22] A. Jadbabaie, A. Rakhlin, S. Shahrampour, and K. Sridharan, “Onlineoptimization: Competing with dynamic comparators,” in
Proceedings ofthe Eighteenth International Conference on Artificial Intelligence andStatistics , 2015, pp. 398–406.[23] R. Olfati-Saber, “Distributed kalman filtering for sensor networks,” in
IEEE Conference on Decision and Control , 2007, pp. 5492–5498.[24] D. Gu, “Distributed particle filter for target tracking,” in
IEEE Interna-tional Conference on Robotics and Automation , 2007, pp. 3856–3861. [25] D. Yudin and A. Nemirovskii, “Problem complexity and method effi-ciency in optimization,” 1983.[26] A. Beck and M. Teboulle, “Mirror descent and nonlinear projectedsubgradient methods for convex optimization,”
Operations ResearchLetters , vol. 31, no. 3, pp. 167–175, 2003.[27] H. H. Bauschke and J. M. Borwein, “Joint and separate convexity ofthe bregman distance,”
Studies in Computational Mathematics , vol. 8,pp. 23–36, 2001.[28] J. Li, G. Chen, Z. Dong, and Z. Wu, “Distributed mirror descent methodfor multi-agent optimization with delay,”
Neurocomputing , vol. 177, pp.643–650, 2016.[29] J. Li, G. Chen, Z. Dong, Z. Wu, and M. Yao, “Distributed mirror descentmethod for saddle point problems over directed graphs,”
Complexity ,2016.[30] M. Rabbat, “Multi-agent mirror descent for decentralized stochasticoptimization,” in
Computational Advances in Multi-Sensor AdaptiveProcessing (CAMSAP), 2015 IEEE 6th International Workshop on .IEEE, 2015, pp. 517–520.[31] M. Raginsky and J. Bouvrie, “Continuous-time stochastic mirror descenton a network: Variance reduction, consensus, convergence,” in
IEEEConference on Decision and Control (CDC) , 2012, pp. 6793–6800.[32] S. Hosseini, A. Chapman, and M. Mesbahi, “Online distributed op-timization via dual averaging,” in
IEEE Conference on Decision andControl (CDC) , 2013, pp. 1484–1489.[33] D. Mateos-N´unez and J. Cort´es, “Distributed online convex optimizationover jointly connected digraphs,”
IEEE Transactions on Network Scienceand Engineering , vol. 1, no. 1, pp. 23–37, 2014.[34] A. Nedi´c, S. Lee, and M. Raginsky, “Decentralized online optimizationwith global objectives and local communication,” in
IEEE AmericanControl Conference (ACC) , 2015, pp. 4497–4503.[35] M. Akbari, B. Gharesifard, and T. Linder, “Distributed online convexoptimization on time-varying directed graphs,”
IEEE Transactions onControl of Network Systems , 2015.[36] A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro, “Online op-timization in dynamic environments: Improved regret rates for stronglyconvex problems,” arXiv preprint arXiv:1603.04954 , 2016.[37] C.-J. Lee, C.-K. Chiang, and M.-E. Wu, “Resisting dynamic strategies ingradually evolving worlds,” in
Third International Conference on Robot,Vision and Signal Processing (RVSP) . IEEE, 2015, pp. 191–194.[38] M. Fazlyab, S. Paternain, V. M. Preciado, and A. Ribeiro, “Interior pointmethod for dynamic constrained optimization in continuous time,” in
IEEE American Control Conference (ACC) , July 2016, pp. 5612–5618.[39] T. Yang, L. Zhang, R. Jin, and J. Yi, “Tracking slowly moving clair-voyant: Optimal dynamic regret of online learning with true and noisygradient,”
International Conference on Machine Learning (ICML) , 2016.[40] R. A. Horn and C. R. Johnson,
Matrix analysis . Cambridge universitypress, 2012.[41] Y. Bar-Shalom,
Tracking and data association . Academic PressProfessional, Inc., 1987.[42] F. S. Cattivelli and A. H. Sayed, “Diffusion strategies for distributedkalman filtering and smoothing,”
IEEE Transactions on automaticcontrol , vol. 55, no. 9, pp. 2069–2084, 2010.[43] U. Khan, S. Kar, A. Jadbabaie, J. M. Moura et al. , “On connectivity, ob-servability, and stability in distributed estimation,” in
IEEE Conferenceon Decision and Control (CDC) , 2010, pp. 6639–6644.[44] S. Das and J. M. Moura, “Distributed state estimation in multi-agentnetworks,” in
IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2013, pp. 4246–4250.[45] D. Han, Y. Mo, J. Wu, S. Weerakkody, B. Sinopoli, and L. Shi,“Stochastic event-triggered sensor schedule for remote state estimation,”
IEEE Transactions on Automatic Control , vol. 60, no. 10, pp. 2661–2675, 2015.[46] O. Hlinka, O. Sluciak, F. Hlawatsch, P. M. Djuric, and M. Rupp, “Likeli-hood consensus and its application to distributed particle filtering,”
IEEETransactions on Signal Processing , vol. 60, no. 8, pp. 4334–4349, 2012.[47] J. Li and A. Nehorai, “Distributed particle filtering via optimal fusionof gaussian mixtures,” in
Information Fusion (Fusion), 2015 18thInternational Conference on . IEEE, 2015, pp. 1182–1189.[48] D. Acemoglu, A. Nedi´c, and A. Ozdaglar, “Convergence of rule-of-thumb learning rules in social networks,” in
IEEE Conference onDecision and Control (CDC) , 2008, pp. 1714–1720.[49] S. Shahrampour, S. Rakhlin, and A. Jadbabaie, “Online learning ofdynamic parameters in social networks,” in
Advances in Neural Infor-mation Processing Systems , 2013.[50] S. Kar, J. M. Moura, and K. Ramanan, “Distributed parameter estimationin sensor networks: Nonlinear observation models and imperfect com- munication,” IEEE Transactions on Information Theory , vol. 58, no. 6,pp. 3575–3605, 2012.[51] S. Shahrampour, A. Rakhlin, and A. Jadbabaie, “Distributed estimation of dynamic parameters: Regret analysis,” in