Distributed Networked Real-time Learning
DDistributed Networked Real-time Learning
Alfredo Garcia ∗ , Luochao Wang, Jeff Huang, and Lingzhou Hong † Texas A&M University
Abstract
Many machine learning algorithms have been developed under the assumption that datasets are already available in batch form. Yet in many application domains data is only availablesequentially overtime via compute nodes in different geographic locations. In this paper, weconsider the problem of learning a model when streaming data cannot be transferred to a singlelocation in a timely fashion. In such cases, a distributed architecture for learning relying on anetwork of interconnected “local” nodes is required. We propose a distributed scheme in whichevery local node implements stochastic gradient updates based upon a local data stream. Toensure robust estimation, a network regularization penalty is used to maintain a measure ofcohesion in the ensemble of models. We show the ensemble average approximates a stationarypoint and characterize the degree to which individual models differ from the ensemble average.We compare the results with federated learning to conclude the proposed approach is morerobust to heterogeneity in data streams (data rates and estimation quality). We illustratethe results with an application to image classification with a deep learning model based uponconvolutional neural networks. keywords: asynchronous computing, distributed computing,networks, non-convex optimization, real-time machine learning.
Streaming data sets are pervasive in certain application domains often involving a network of com-pute nodes located in different geographic locations. However, most machine learning algorithmshave been developed under the assumption that data sets are already available in batch form. Whenthe data is obtained through a network of heterogeneous compute nodes, assembling a diverse batchof data points in a central processing location to update a model may imply significant latency.Recently, an architecture referred to as federated learning (FL, see e.g. [1, 2]) with a central serverin proximity to local nodes has been proposed. In FL, each node implements updates to a machinelearning model that is kept in the central server. This allows collaborative learning while keepingall the training data on nodes rather than in the cloud. In general, schemes that avoid the need torely on the cloud for data storage and/or computation are referred to as “edge computing”.With high data payloads, such architecture for real-time learning is subject to an accuracy vs speed trade-off due to asymmetries in data quality vs. data rates as we explain in what follows.Consider nodes i ∈ { , . . . , N } generating data points ( x i,n , y i,n ) , n ∈ N + at different rates µ i > θ k , k ∈ N + (striving tominimize loss (cid:96) ). This setting could correspond for example with supervised deep learning in real-time wherein gradient estimates (with noise variance σ i >
0) are computed via back-propagation in ∗ This work was supported by NSF ECCS-1933878 Award and Grant AFOSR-15RT0767. † { alfredo.garcia, wangluochao, jeffhuang,hlz } @tamu.edu a r X i v : . [ m a t h . O C ] S e p igure 1: Performance comparison for Deep Learning on MNIST with learning rate γ = 0 .
01. The95% percentile is depicted with green lines.a relatively fast fashion. Without complete information on σ i >
0, updating the model parametersbased upon every incoming data point yields high speed but possibly at the expense of low accuracy.For example, if the nodes producing noisier estimates are also faster at producing data, it is highlyunlikely that an accurate model will be identified at all.To illustrate this scenario, in Figure 1, we depict the performance of FL for deep convolutionalneural networks with the MNIST dataset. In these simulations, each one of N = 5 nodes sends dataaccording to independent Poisson processes with µ = 8 and µ i = 1, i ∈ { , . . . , } . The fastestnode computes gradient estimates based upon a single image whereas the slower nodes computegradient estimates based upon a batch of 64 images.This trade-off between speed and precision is mitigated in a distributed approach to real-timelearning subject to a network regularization penalty. In such an approach, each one of N > Specifically, we show that the ensemble averagesolution approximates a stationary point and that the approximation quality is O ( (cid:80) Ni =1 σ i N ), whichcompares quite favorably with FL, which is highly sensitive to fast and inaccurate data streams.We illustrate the results with an application to deep learning with convolutional neural networks.The structure of the paper is as follows. In section 2 we introduce the distributed scheme thatcombines stochastic gradient descent with network regularization (NR). In section 3 we analyze thescheme and show that it converges (in a certain sense) to a stationary point, we also compare itsperformance with FL. Finally, in section 4 we report the results from a testbed on deep learningapplication to image processing, and in section 5 we offer conclusions. We consider a setting in which data is made available sequentially overtime via nodes i ∈ { , . . . , N } in different geographic locations. We denote the i -th stream by { ( x i,n , y i,n ) : n ∈ N + } and assume Similar network regularization methods have been used in multi-task learning to account for inherent networkstructure in data sets (see e.g. [3–8]). See section D. P i .We also assume the data streams are independent but heterogeneous, i.e. P i (cid:54) = P j , i (cid:54) = j . Eachnode strives to find a parameter specification θ ∈ Θ ⊂ R p that minimizes the performance criteria E P i [ L ( x i , y i ; θ )], where the loss function L ( · ) ≥ θ .Though data is distributed and heterogeneous, we consider a setting in which nodes agree on a com-mon learning task. This is formalized in the first standing assumption. Let g i ( θ ) (cid:44) ∇ θ L ( x i , y i ; θ )denote the gradient evaluated at ( x i , y i ) ∼ P i , and assume g i ( θ ) is uniformly integrable. Assumption 0:
For all θ ∈ Θ , and i ∈ { , . . . , N } : E P i [ g i ( θ )] = E P j [ g j ( θ )] . Let (cid:96) ( θ ) denote the (ensemble) average expected loss: (cid:96) ( θ ) (cid:44) N N (cid:88) i =1 E P i [ L ( x i , y i ; θ )] . By uniform integrability, ∇ θ E P i L ( x i , y i ; θ ) = E P i g i ( θ ). Assumption 0 thus implies that E P i [ g i ( θ )] = (cid:79) (cid:96) ( θ ) for all i and θ .Let ε i ( θ ) (cid:44) g i ( θ ) − (cid:79) (cid:96) ( θ ), then it holds that E [ ε i ( θ )] = 0. We further assume: Assumption 1:
For all θ i ∈ Θ , the random variables { ε i ( θ i ) : i ∈ { , . . . , N }} are independentand E [ (cid:107) ε i ( θ i ) (cid:107) ] ≤ σ i . Define σ = (cid:80) Ni =1 σ i . By independence of data streams: E [ ε i ( θ ) (cid:124) ε j ( θ )] = E [ ε i ( θ )] (cid:124) E [ ε j ( θ )] = 0 , for all θ ∈ Θ , j ∈ { , . . . , N }\ { i } .Streams generate data over time according to independent Poisson processes D i ( t ) with rate µ i > D i (0) = 1. We assume the time required to compute gradient estimates and/orexchange parameters locally among neighbors or with the central server are negligible compared tothe time in between model updates. In what follows we make use of a virtual clock that producesticks according to an aggregate counting process D ( t ) = (cid:80) Ni =1 D i ( t ) with rate µ = (cid:80) Ni =1 µ i . Let k ∈ N + denote the index set of ticks associated with the aggregate process. Since we assume theparameter is updated once a data point arrives, the k -th iteration is completed at the k -th tick.Index k denotes the k -th step in the schemes described below. In FL, gradient estimates are communicated to a central server where a model is updated as follows: θ k +1 = θ k − γ N (cid:88) i =1 i,k g i ( θ k ) , (1)where γ is the learning rate, i,k is a indicator of whether node i performs the an update: i,k = 1if the next gradient estimate comes from the i -th stream and i,k = 0 otherwise.The algorithmic scheme described in (1) was first analyzed in [9] for data in batch form andhas been used in the recent literature on asynchronous parallel optimization algorithms (see forexample [10], [11] and [12]). As Figure 1 suggests, with heterogeneous data streams, the scheme3n (1) trades off speed in producing parameter updates at the expense of heterogeneous noise ingradient estimates. In what follows we introduce a distributed approach that relies on a networkregularization penalty to ensure the ensemble average approximates a stationary point (i.e. a choiceof parameters with null gradient). We will show that in such a networked approach the trade-offbetween precision and speed is mitigated. In the NR scheme, we consider a network of local compute nodes which we model as a graph G = ( N , E ), where N = { , . . . , N } stands for the set of nodes and E ⊆ N × N is the set of linksconnecting nodes. Let A = [ α ij ] ∈ R N × N be the adjacency matrix of G , where α ij ∈ { , } indicateswhether node i communicates with node j : α ij = 1 if two nodes can exchange local informationand α ij = 0 otherwise.In this scheme, each local node i performs model updates according to a linear combination of localgradient estimate and the gradient of a consensus potential: F ( θ ) = 14 (cid:88) i (cid:88) j (cid:54) = i α ij (cid:107) θ i − θ j (cid:107) , where θ (cid:124) t = ( θ (cid:124) ,t , . . . , θ (cid:124) N,t ) ∈ R p × N . The consensus potential is a measure of similarity across localmodels. The update performed by node i is of the form: θ i,k +1 = θ i,k − γ i,k [ g i ( θ i,k ) + a (cid:79) F i,k ] , (2)where a > (cid:79) F i,k (cid:44) ∇ θ i F ( θ k ) = (cid:88) j (cid:54) = i α ij ( θ i,k − θ j,k ) . Note that the basic iterate (2) can be interpreted as a stochastic gradient approach to solve thelocal problem: min θ i [ E P i ( θ i ) + a F ( θ )] , in which the objective function is a linear combination of loss and consensus potential. When a = 0, each local node ignores the neighboring models. For large values of a >
0, the resultingdynamics reflect the countervailing effects of seeking to minimize consensus potential and improvingmodel fit. With highly dissimilar initial models, each local node largely ignores its own data andopts for updates that lead to a model that is similar to the local average. Once approximateconsensus is achieved, local gradient estimates begin to dictate the dynamics of model updates.In what follows it will be convenient to rewrite (2) as follows: θ i,k +1 = θ i,k − γ i,k [ (cid:79) (cid:96) ( θ i,k ) + a (cid:79) F i,k + ε i,k ] . (3)Given that local nodes independently update and maintain their own parameters, the networkregularized scheme is not subject to the possibility of biased gradient estimates stemming fromupdate delays in FL (see [16]). This consensus potential has been used in the literature of opinion dynamics (see e.g. [13]). This interpretation is not novel (see e.g. [14], [15] for its use in swarm (flocking) optimization and in multi-tasklearning [4–8]). .4 Literature Review The scheme proposed in (2) has already been considered in the machine learning literature. In aseries of papers (see [4–8])), the authors consider an approach to multi-task learning based upon anetwork regularization penalty as in (2). This paper focuses on distributed single-task learning. Incontrast to the papers referred above, we consider a non-convex setting with heterogeneous nodesasynchronously updating their respective models at different rates over time.The scheme proposed in (2) is also related to the literature on consensus optimization (see e.g. [17],[11], [18]). However, the proposed approach can not be interpreted as being based upon averagingover local models as in consensus-based optimization. In that literature, the basic iteration is ofthe form: θ i,k +1 = (cid:88) j W i,j,k θ j,k − γg ( θ i,k ) , where W k ∈ R N × N is doubly stochastic and g ( θ i,k ) is a noisy gradient estimate. Indeed one canrewrite (2) as: θ i,k +1 = (cid:88) j W i,j θ j,k − γ i,k g i ( θ i,k ) , with W i,i = 1 − γa (cid:80) j α i,j and W i,j = γa (cid:80) j α i,j . However, the resulting matrix W is not doublystochastic in general since we only require a >
0. Thus, the approach to consensus in (2) can notbe interpreted as being based upon averaging over local models as in consensus-optimization.The algorithms proposed in [11] and [18] are designed for batch data while our approach deals with streaming data. For example, in [11], each node uses the same mini-batch size for estimating gradi-ents while in our approach gradient estimation noise is heterogeneous . In addition, the algorithmsproposed in [11] and [18], every node is equally likely to be selected at each iteration to update itslocal model. In contrast, in our approach data streams are heterogeneous so that certain nodes are more likely to update their models at any given time. Finally, in [11] the objective function (loss)is defined with respect to a distribution that is biased towards the nodes that update more often.This is in contrast to the objective function defined in this paper (i.e. (cid:96) ( θ )), where every nodecontributes to the global distribution with the same weight regardless of their updating frequency. In this section, we show the NR scheme converges (in a certain sense) to a stationary point. Tothat end we study stochastic processes { θ i,k : k > } associated with each one of the N >
Assumption 2 : The graph G corresponding to the network of nodes is undirected ( A = A (cid:124) ) andconnected, i.e., there is a path between every pair of vertices. Assumption 3 (Lipschitz) (cid:107) (cid:79) (cid:96) ( θ ) − (cid:79) (cid:96) ( θ (cid:48) ) (cid:107) ≤ L (cid:107) θ − θ (cid:48) (cid:107) for some L > and for all θ, θ (cid:48) . The ensemble average ¯ θ k (cid:44) N (cid:80) Ni =1 θ i,k plays an important role in characterizing the performanceof the network regularized scheme. To this end, we analyze the process { V k : k > } defined as V k (cid:44) N N (cid:88) i =1 (cid:107) θ i,k − ¯ θ k (cid:107) . e i,k (cid:44) θ i,k − ¯ θ k and V i,k (cid:44) (cid:107) e i,k (cid:107) , then V k = N (cid:80) Ni =1 V i,k . We now introduce some additionalnotations. Let deg( i ) denote the degree of vertex i in graph G and d := max i deg( i ). Let E [ V k +1 | θ k ]denote the conditional expectation of V k +1 given θ k . We define µ max = max { µ i : 1 ≤ i ≤ N } and µ min = min { µ i : 1 ≤ i ≤ N } . We first prove two intermediate results. Lemma 1
Suppose Assumptions 0, 1 and 2 hold. It holds that V k +1 = V k − N N (cid:88) i =1 γe (cid:124) i,k i,k [ (cid:79) (cid:96) ( θ i,k ) + a (cid:79) F i,k ] − N N (cid:88) i =1 γe (cid:124) i,k ε i,k i,k + 1 N N (cid:88) i =1 γ (cid:107) δ i,k (cid:107) , where δ i,k = δ fi,k + δ gi,k + δ ni,k , and δ fi,k (cid:44) (cid:79) (cid:96) ( θ i,k ) i,k − (cid:79) ¯ (cid:96) k , (cid:79) ¯ (cid:96) k (cid:44) N N (cid:88) j =1 (cid:79) (cid:96) ( θ j,k ) j,k ,δ gi,k (cid:44) a ( (cid:79) F i,k i,k − (cid:79) ¯ F k ) , δ ni,k (cid:44) ε i,k i,k − N N (cid:88) j =1 ε j,k j,k , (cid:79) ¯ F k (cid:44) N N (cid:88) j =1 (cid:79) F j,k j,k . Lemma 2
Suppose Assumptions 0, 1, 2 and 3 hold. Let ξ = µ max /µ min , then: E [ V k +1 | θ k ] ≤ (1 + κγN ) V k + 4 γ ξN (cid:13)(cid:13) (cid:79) (cid:96) (¯ θ k ) (cid:13)(cid:13) + γ ξσ N , where λ denotes the second-smallest of the Laplacian associated with graph G and κ = 2( Lµ min − aλ µ max ) + 4 γξN ( L + 2 a d ) . We are now ready to state and prove the main theorem.As in [19], convergence is described in terms of the expected value of the average squared norm ofthe gradient in the first K -updates. The ensuing corollary goes into further detail by describingthe same result in terms of real-time elapsed and not just on a total number of iterations. Theorem 1 : Suppose Assumptions 0, 1, 2 and 3 hold. Choose γ < min { ¯ γ , ¯ γ } , where ¯ γ = N aλ µ max − L (2 µ min + ξ/ ξ ( L + 2 a d ) and ¯ γ = 14 L (2 N + 1)6 re positive by choosing a > µ min L + ξL λ µ max . With scheme (2), it holds that E (cid:34) K K − (cid:88) k =0 E [ (cid:107) (cid:79) (cid:96) (¯ θ k ) (cid:107) ] (cid:35) ≤ ηK (cid:20) (cid:96) (¯ θ ) + LV + KLγ ξσ N (1 + 12 N ) (cid:21) , where η = γξN (cid:16) − γL (2 + N ) (cid:17) . The regularization penalty parameter a must be high enough to ensure cohesion between localmodels. This condition is weaker with a higher degree of connectivity (i.e. higher values of λ ).Note also that for fixed N >
0, when a → ∞ , then γ ∝ /a . So convergence, as characterizedby Theorem 1, may be slower. This is not necessarily the case since the conditions in Theorem1 identify a wide range of choices for a and γ . For example, simulations indicate that for fixed γ higher values of a may speed up convergence (see Figure 3 (c)). The analysis in Theorem 1 takes place in the time scale indexed by k > µ >
0. To embed the result in Theorem 1 in real-time , recall that { D ( t ) : t ≥ } is the counting process governing the aggregation of all datastreams. Given our assumption on computation times being negligible, the total number of updatescompleted in [0 , t ) is also D ( t ). Let us define the conditional average squared gradient norm (cid:107) ¯ (cid:79) (cid:96) t (cid:107) in the interval [0 , t ) as follows: E [ (cid:107) ¯ (cid:79) (cid:96) t (cid:107) | D ( t )] (cid:44) D ( t ) D ( t ) (cid:88) k =1 (cid:107) (cid:79) (cid:96) (¯ θ k ) (cid:107) . (4)Hence, the result in Theorem 1 can be reinterpreted by taking expectation of (4) over D ( t ) as: E [ (cid:107) ¯ (cid:79) (cid:96) t (cid:107) ] = E (cid:104) E [ (cid:107) ¯ (cid:79) (cid:96) t (cid:107) | D ( t )] (cid:105) ≤ E (cid:104) ηD ( t ) (cid:0) (cid:96) (¯ θ ) + LV (cid:1) + Lγ ξσ ηN (cid:0) N (cid:1)(cid:105) = ( (cid:96) (¯ θ ) + LV )(1 − e − µt ) ηµt + Lγ ξσ ηN (1 + 12 N ) . According to Theorem 1, and using γ ∼ N , the coupling of solutions via the network regularizationpenalty implies the ensemble average approximates a stationary point in the sense that:lim sup t →∞ E [ (cid:107) ¯ (cid:79) (cid:96) t (cid:107) ] = O ( σ N ) . The approximation quality is monotonically increasing in the number of nodes. The convergenceproperties outlined above are related to the ensemble average. It is, therefore, necessary to examinethe degree to which individual models differ from the ensemble average. This is the gist of the nextresult. 7 orollary 1 : With the same assumptions and definitions in Theorem 1, it holds that E (cid:34) K K − (cid:88) k =0 V k (cid:35) ≤ K | κ | (cid:104)(cid:0) Nγ + 4 Lγξη (cid:1) V + 4 γξη l (¯ θ ) (cid:105) + 4 Lγ ξ σ η | κ | N (1 + 12 N ) + γξσ | κ | N .
We embed the result in Corollary 1 in real-time . Define the conditional average of ¯ V k in theinterval [0 , t ) as E [ ¯ V t | D ( t )] (cid:44) D ( t ) D ( t ) (cid:88) k =1 ¯ V k . The random process { ¯ V t : t > } tracks the average distance of individual models to the ensembleaverage. Similar to the discussion of Theorem 1, the real-time result of Corollary 1 is as follows: E [ ¯ V t ] = E (cid:104) E [ ¯ V t | D ( t )] (cid:105) ≤ D ( t ) | κ | (cid:104)(cid:0) Nγ + 4 Lγξη (cid:1) V + 4 γξη l (¯ θ ) (cid:105) + 4 Lγ ξ σ η | κ | N (1 + 12 N ) + γξσ | κ | N = 1 − e − µt µt | κ | (cid:104)(cid:0) Nγ + 4 Lγξη (cid:1) V + 4 γξη l (¯ θ ) (cid:105) + + 4 Lγ ξ σ η | κ | N (1 + 12 N ) + γξσ | κ | N .
This implies the asymptotic difference between individual models and the ensemble average satisfies:lim sup t →∞ E [ ¯ V t ] = O ( σ N ) . The network regularization parameter a >
N >
0, when a → ∞ , then γ, η ∝ /a and | κ | ∝ a , it followsthat E (cid:104) K (cid:80) K − k =0 V k (cid:105) ∝ /a . Hence, the upper bound in Corollary 1 can be made arbitrarily smallby choosing large enough a . We now present the counterpart convergence result regarding to FL.
Proposition 1 : Suppose Assumptions 0, 1, 2 and 3 hold. For scheme 1, with a choice γ ∈ (0 , L ) ,it holds that: E (cid:104) K K − (cid:88) k =0 (cid:107) (cid:79) (cid:96) ( θ k ) (cid:107) (cid:105) ≤ (cid:96) ( θ )˜ ηK + Lγ η N (cid:88) i =1 µ i µ σ i , with ˜ η = γ (1 − Lγ ) . To embed the process in Proposition 1 in real-time , let us define the average squared gradient norm (cid:107) (cid:79) ˜ (cid:96) t (cid:107) in the interval [0 , t ) as follows: E [ (cid:107) (cid:79) ˜ (cid:96) t (cid:107) | D ( t )] (cid:44) D ( t ) D ( t ) (cid:88) k =1 (cid:107) (cid:79) (cid:96) ( θ k ) (cid:107) . (5)8ence, the result in Proposition 1 can be reinterpreted by taking expectation of (5) over D ( t ) as: E [ (cid:107) (cid:79) ˜ (cid:96) t (cid:107) ] = E (cid:104) E [ (cid:107) (cid:79) ˜ (cid:96) t (cid:107) | D ( t )] (cid:105) ≤ E (cid:104) (cid:96) ( θ )˜ ηD ( t ) + Lγ η N (cid:88) i =1 µ i µ σ i (cid:105) = (cid:96) ( θ )(1 − e − µt )˜ ηµt + Lγ η N (cid:88) i =1 µ i µ σ i . To compare FL with NR, we also make γ ∼ N . The asymptotic approximation quality is given by:lim sup t →∞ E [ (cid:107) (cid:79) ˜ (cid:96) t (cid:107) ] = O ( 1 N (cid:88) i µ i µ σ i ) , which suggests that the approximation quality is determined by the faster data streams. This leadsto unsatisfactory performance whenever µ i ∝ σ i (i.e. faster data streams are also less accurate).Evidently, the opposite holds true when faster nodes are also more accurate, i.e. µ i ∝ /σ i . How-ever, in many real-time machine learning applications, this is not likely to be the case. Obtaininghigher precision gradient estimates requires larger batches and/or increased computation. Thusnodes with higher precision are less likely to be the faster ones. Deep
Learning
In this section, we report the results of NR (scheme (2)) to distributed real-time learning fromthree aspects: the comparison with FL (scheme (1)), the effects of the regularization parameter a ,and the effects of the network connectivity.The specific learning task is to classify handwritten digits between 0 and 9 digits as given inthe MNIST data set [20]. The dataset is composed of 10000 testing items and 60 ,
000 trainingitems. Each item in the dataset is a black-and-white (single-channel) image of 28 ×
28 pixels of ahandwritten digit between 0 and 9.In the first two experiments, we implement schemes in a heterogeneous setting with 5 nodes, andthe third experiment with 20 nodes in a homogeneous setting. In the test-bed MNIST streamsaccording to independent Poisson processes. Gradient estimates are obtained with different mini-batch sizes. Evidently, a smaller mini-batch size implies noisier gradient estimates. The detailedexperimental settings are summarized in Table 1. In the heterogeneous setting, “node 0” is thefastest and noisiest in producing gradient estimates.Setting Stream ID µ i D D − D Table 1.The experiment hyperparameters of the two settings, including thedata stream ID (Stream ID), number of nodes involved ( µ i ). We use the
Ray platform (see [21]) which is a popular library with shared memory supported,allowing information exchange between local nodes without copying as well as avoiding a centralbottleneck. For low-level computation, Google TensorFlow is used. We use a Convolutional NeuralNetwork (CNN) with two 2D Convolutions each with kernel size 5 ×
5, stride 1 and 32, 64 filters.Each convolution layer is followed by a Max-pooling with a 2 × . Details on the implementation are available at: https://github.com/wangluochao902/Network-Regularized-Approach .We present the experimental results in mean plots with stand error bar. The means are computedacross 10 trials under the same hyper-parameters (namely, γ and a ). In this experiment, we compare NR with FL in the heterogeneous setting. In Figure 2, we plot themeans of the ensemble average of NR and FL with different learning rates. (a) (b)
Figure 2. The mean plot of ensemble average computed under the schemes ofNR and FL in heterogeneous setting. The parameter a is set to 10 and thenetwork is fully connected. The learning rate γ is set to 0 .
002 in (a) and 0 . We can observe from Figure (a) that when the learning rate is moderate, both FL and NR canconverge, but the empirical standard deviation of FL is much larger than that of NR. With increased γ , FL fails to converge while NR still performs relatively well, as shown in Figure (b). We can seethat NR is more robust with respect to the learning rate. In this experiment, we look at the effects of changing the regularization parameter a . In Figure 3,we present the means of each node as well as the ensemble average.As we increase a from 1 to 10, we can observe from Figure 3 (a) and (b) that the consensus amongnodes increases and the empirical mean standard deviation of the “node 0” decreases. As presentedin Corollary 1, the regularization parameter a influences the degree of similarity between individualmodels and the ensemble average. Note that we only identify a range of values for a (lower bound)and γ (upper bound) for which convergence is guaranteed so that a higher value of a does notnecessarily imply slower convergence, as shown in Figure 3 (c). With max-pooling the loss function is not differentiable in a set of measure zero. If in the course of execution anon-differentiable point is encountered, Tensorflow assumes a zero derivative a) (b)(c) Figure 3. The mean plot of each node computed under the scheme of NR inheterogeneous setting. The parameter γ is set to 0 .
01 and the network is fullyconnected. The regularization parameter a is set to 1 in (a) and 10 in (b). Themean plot of the ensemble average under two choices of a is presented in (c). .3 The Effects of Network Connectivity In the third experiment, we check the effect of increased connectivity in the homogeneous settingby using a Watts-Strogatz “small world” topology (see [22]), in which each node is connected with2 (or 8) nearest neighbors. (a)
Figure 4. The mean plot of ensemble average com-puted under the scheme of NR in the homogeneoussetting. The parameter a is set to 10 and the learningrate γ is set to 0 . We can see from Figure 4 that increasing the connectivity of the topology only improves theperformance slightly, meaning that only a limited connectivity is needed for the network regularizedapproach to enjoy a satisfactory rate of convergence.
In many application domains, data streams through a network of heterogeneous nodes in differentgeographic locations. When there is high data payload (e.g. high-resolution video), assemblinga diverse batch of data points in a central processing location in order to update a model entailssignificant latency. In such cases, a distributed architecture for learning relying on a network ofinterconnected “local” nodes may prove advantageous. We have analyzed a distributed scheme inwhich every local node implements stochastic gradient updates every time a data point is obtained.To ensure robust estimation, a local regularization penalty is used to maintain a measure of cohesionin the ensemble of models. We show the ensemble average approximates a stationary point. Theapproximation quality is superior to that of FL, especially when there is heterogeneity in gradientestimation quality. We also show that our approach is robust against changes in the learningrate and network connectivity. We illustrate the results with an application to deep learning withconvolutional neural networks.In future work we plan to study different localized model averaging schemes. A careful selectionof weights for computing local average model ensures a reduction of estimation variance. Thisis motivated by the literature on the optimal combination of forecasts (see [23]). For example,weights minimizing the sample mean square prediction error are of the form ˆ σ − i (cid:80) Nj =1 ˆ σ − j where ˆ σ i isthe estimated mean squared prediction error of the i -th model.12 Appendix
Proof of Lemma 1
Note that ¯ θ k +1 = 1 N N (cid:88) i =1 [ θ i,k − γ i,k [ (cid:79) (cid:96) ( θ i,k ) + a (cid:79) F i,k + ε i,k ]]= ¯ θ k − γN N (cid:88) i =1 (cid:79) (cid:96) ( θ i,k ) i,k − aγN N (cid:88) i =1 (cid:79) F i,k i,k − γN N (cid:88) i =1 ε i,k i,k . Hence, e i,k +1 = θ i,k +1 − ¯ θ k +1 = e i,k − γδ i,k . Then V i,k +1 = ( e i,k − γδ i,k ) (cid:124) ( e i,k − γδ i,k )= e (cid:124) i,k e i,k − γe (cid:124) i,k δ i,k + γ (cid:107) δ i,k (cid:107) = V i,k − γe (cid:124) i,k ( δ fi,k + δ gi,k + δ ni,k ) + γ (cid:107) δ i,k (cid:107) , and V k +1 = V k − γN N (cid:88) i =1 e (cid:124) i,k ( δ fi,k + δ gi,k + δ ni,k ) + γ N N (cid:88) i =1 (cid:107) δ i,k (cid:107) . Finally, note that N (cid:88) i =1 e (cid:124) i,k δ fi,k = N (cid:88) i =1 e (cid:124) i,k (cid:2) (cid:79) (cid:96) ( θ i,k ) i,k − (cid:79) ¯ (cid:96) k (cid:3) = N (cid:88) i =1 e (cid:124) i,k (cid:79) (cid:96) ( θ i,k ) i,k , N (cid:88) i =1 e (cid:124) i,k δ gi,k = a N (cid:88) i =1 e (cid:124) i,k ( (cid:79) F i,k i,k − (cid:79) ¯ F k ) = a N (cid:88) i =1 e (cid:124) i,k (cid:79) F i,k i,k , N (cid:88) i =1 e (cid:124) i,k δ ni,k = N (cid:88) i =1 e (cid:124) i,k ( ε i,k i,k − N N (cid:88) j =1 ε j,k j,k ) = N (cid:88) i =1 e (cid:124) i,k ε i,k i,k . So the result follows. (cid:4)
Proof of Lemma 2
In light of Lemma 1 we have: E [ V k +1 | θ k ] = V k − γN N (cid:88) i =1 µ i µ e (cid:124) i,k [ (cid:79) (cid:96) ( θ i,k ) + a (cid:79) F i,k ] + γ N N (cid:88) i =1 (cid:107) δ i,k (cid:107) . Let e k = (cid:104) e (cid:124) ,k , e (cid:124) ,k , · · · , e (cid:124) N,k (cid:105) (cid:124) and L = [ l ij ] be the Laplacian matrix associated with the ad-jacency matrix A , where l ii = (cid:80) j a ij and l ij = − a ij when i (cid:54) = j . For an undirected graph, theLaplacian matrix is symmetric positive semi-definite. It follows that N (cid:88) i =1 e (cid:124) i,k (cid:79) F i,k = N (cid:88) i =1 N (cid:88) j =1 ,j (cid:54) = i α ij e (cid:124) i,k ( e i,k − e j,k ) = − N (cid:88) i =1 N (cid:88) j (cid:54) = i l ij e (cid:124) i,k ( e i,k − e j,k )= N (cid:88) i =1 N (cid:88) j (cid:54) = i l ij e (cid:124) i,k e j,k = e (cid:124) k ( L ⊗ I p ) e k ≥ λ N (cid:88) i =1 (cid:107) e i,k (cid:107) , λ := λ ( L ) is the second-smallest eigenvalue of L , also called the algebraic connectivity of G [24]. Thus, E [ V k +1 | θ k ] ≤ V k − γN N (cid:88) i =1 µ i µ e (cid:124) i,k (cid:79) (cid:96) ( θ i,k ) − aλ γN N (cid:88) i =1 µ i µ (cid:107) e i,k (cid:107) + γ N N (cid:88) i =1 E [ (cid:107) δ i,k (cid:107) | θ k ]= V k − γN N (cid:88) i =1 µ i µ ( (cid:79) (cid:96) ( θ i,k ) − (cid:79) (cid:96) (¯ θ k )) (cid:124) e i,k − aλ γN N (cid:88) i =1 µ i µ (cid:107) e i,k (cid:107) + γ N N (cid:88) i =1 E [ (cid:107) δ i,k (cid:107) | θ k ] . By CauchySchwarz inequality and Assumption 3, we can obtain that − ( (cid:79) (cid:96) ( θ i,k ) − (cid:79) (cid:96) (¯ θ k )) (cid:124) e i,k ≤ (cid:13)(cid:13) (cid:79) (cid:96) ( θ i,k ) − (cid:79) (cid:96) (¯ θ k ) (cid:13)(cid:13) (cid:107) e i,k (cid:107) ≤ L (cid:107) e i,k (cid:107) . Define ¯ µ = µ/N , and by the inequalities µ min N ¯ µ ≤ µ i µ ≤ µ max N ¯ µ , we can obtain E [ V k +1 | θ k ] ≤ (1 + 2 γN ¯ µ ( Lµ max − aλ µ min )) V k + γ N N (cid:88) i =1 E [ (cid:107) δ i,k (cid:107) | θ k ] . (6)We now simplify the last term in the right hand side of (6). First we note that: E [ (cid:107) δ i,k (cid:107) | θ k ] = E [ (cid:107) δ fi,k + δ gi,k (cid:107) | θ k ] + E [ (cid:107) δ ni,k (cid:107) | θ k ] . (7)The first term in the right hand side of (7) can be further described as follows: γ E [ (cid:107) δ fi,k + δ gi,k (cid:107) | θ k ]= γ E (cid:104) (cid:13)(cid:13) (cid:79) (cid:96) ( θ i,k ) i,k − (cid:79) ¯ (cid:96) k + a ( (cid:79) F i,k i,k − (cid:79) ¯ F k ) (cid:13)(cid:13) (cid:12)(cid:12)(cid:12) θ k (cid:105) = γ E (cid:104) (cid:107) (1 − N )[ (cid:79) (cid:96) ( θ i,k ) + a (cid:79) F i,k ] i,k + 1 N N (cid:88) j (cid:54) = i [ (cid:79) (cid:96) ( θ j,k ) + a (cid:79) F j,k ] j,k (cid:107) | θ k (cid:105) = γ N (cid:104) (1 − N ) µ i ¯ µ (cid:107) (cid:79) (cid:96) ( θ i,k ) + a (cid:79) F i,k (cid:107) + 1 N N (cid:88) j (cid:54) = i µ j ¯ µ (cid:107) (cid:79) (cid:96) ( θ j,k )+ a (cid:79) F j,k (cid:107) (cid:105) . This leads to: N (cid:88) i =1 γ E [ (cid:107) δ fi,k + δ gi,k (cid:107) | θ k ] ≤ γ ξN (cid:104) (1 − N ) N (cid:88) i =1 (cid:107) (cid:79) (cid:96) ( θ i,k )+ a (cid:79) F i,k (cid:107) + 1 N N (cid:88) i =1 N (cid:88) j (cid:54) = i (cid:107) (cid:79) (cid:96) ( θ j,k )+ a (cid:79) F j,k (cid:107) (cid:105) ≤ γ ξN N (cid:88) i =1 (cid:107) (cid:79) (cid:96) ( θ i,k )+ a (cid:79) F i,k (cid:107) . (8)Finally, γ N (cid:88) i =1 E [ (cid:107) δ ni,k (cid:107) | θ k ] = γ N (cid:88) i =1 E (cid:104) (cid:107) (1 − N ) ε i,k i,k − N N (cid:88) j (cid:54) = i ε j,k j,k (cid:107) | θ k (cid:105) = γ N (cid:104) (1 − N ) N (cid:88) i =1 µ i ¯ µ E [ (cid:107) ε i,k (cid:107) | θ k ] + 1 N N (cid:88) i =1 N (cid:88) j (cid:54) = i µ i ¯ µ E [ (cid:107) ε j,k (cid:107) | θ k ] (cid:105) ≤ γ N (cid:0) µ max µ min (cid:1) N (cid:88) i =1 E [ (cid:107) ε i,k (cid:107) | θ k ] ≤ γ ξσ N . (9)14e use inequalities (8) and (9) with (7) to obtain an upper bound of (6) as follows: E [ V k +1 | θ k ] ≤ (1 + 2 γN ( Lµ min − aλ µ max )) V k + γ ξN N (cid:88) i =1 (cid:107) (cid:79) (cid:96) ( θ i,k ) + a (cid:79) F i,k (cid:107) + γ ξσ N . (10)Finally, we analyze the third term on the right hand side of (10). By Parallellogram law (cid:107) (cid:79) (cid:96) ( θ i,k ) + a (cid:79) F i,k (cid:107) = 2 (cid:107) (cid:79) (cid:96) ( θ i,k ) (cid:107) + 2 (cid:107) a (cid:79) F i,k (cid:107) − (cid:107) (cid:79) (cid:96) ( θ i,k ) − a (cid:79) F i,k (cid:107) ≤ (cid:107) (cid:79) (cid:96) ( θ i,k ) (cid:107) + 2 (cid:107) a (cid:79) F i,k (cid:107) In addition, (cid:107) (cid:79) F i,k (cid:107) = deg( i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) j =1 ,j (cid:54) = i α ij ( θ i,k − θ j,t )deg( i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ deg( i ) N (cid:88) j =1 ,j (cid:54) = i α ij (cid:107) θ i,k − θ j,k (cid:107) ≤ ¯ d N (cid:88) j =1 ,j (cid:54) = i α ij (cid:107) θ i,k − θ j,k (cid:107) . which implies N (cid:88) i =1 (cid:107) (cid:79) F i,k (cid:107) ≤ ¯ d N (cid:88) i =1 N (cid:88) j =1 ,j (cid:54) = i α ij (cid:107) θ i,k − θ j,k (cid:107) = ¯ d N (cid:88) i =1 N (cid:88) j (cid:54) = i α ij (cid:107) e i,k − e j,k (cid:107) ≤ d N (cid:88) i =1 N (cid:88) j (cid:54) = i α ij ( (cid:107) e i,k (cid:107) + (cid:107) e j,k (cid:107) ) ≤ d N (cid:88) i =1 (cid:107) e i,k (cid:107) = 4 N ¯ d V k . Thus, N (cid:88) i =1 (cid:107) (cid:79) (cid:96) ( θ i,k ) + a (cid:79) F i,k (cid:107) ≤ N (cid:88) i =1 (cid:107) (cid:79) (cid:96) ( θ i,k ) (cid:107) + 8 a N d V k = 4 N (cid:13)(cid:13) (cid:79) (cid:96) (¯ θ k ) (cid:13)(cid:13) + 4 N (cid:88) i =1 (cid:13)(cid:13) (cid:79) (cid:96) ( θ i,k ) − (cid:79) (cid:96) (¯ θ t ) (cid:13)(cid:13) + 8 a N d V k ≤ N (cid:13)(cid:13) (cid:79) (cid:96) (¯ θ k ) (cid:13)(cid:13) + 4 N ( L + 2 a d ) V k . (11)The result follows by using the previous inequality to obtain an upper bound for the right handside of (10). (cid:4) Proof of Theorem 1
By Taylor expansion and Lipschitz assumption: (cid:96) (¯ θ k +1 ) ≤ (cid:96) (¯ θ k ) + (cid:79) (cid:96) (¯ θ k ) (cid:124) (¯ θ k +1 − ¯ θ k ) + L (cid:13)(cid:13) ¯ θ k +1 − ¯ θ k (cid:13)(cid:13) = (cid:96) (¯ θ k ) − γN N (cid:88) i =1 (cid:79) (cid:96) (¯ θ k ) (cid:124) (cid:79) (cid:96) ( θ i,k ) i,k − aγN N (cid:88) i =1 (cid:79) (cid:96) (¯ θ k ) T (cid:79) F i,k i,k − γN N (cid:88) i =1 (cid:79) (cid:96) (¯ θ k ) T ε i,k i,k + L (cid:13)(cid:13) ¯ θ k +1 − ¯ θ k (cid:13)(cid:13) . (cid:80) Ni =1 (cid:79) F i,k = 0, it follows that E [ (cid:96) (¯ θ k +1 ) (cid:12)(cid:12) θ k ] ≤ (cid:96) (¯ θ k ) − γξN N (cid:88) i =1 (cid:79) (cid:96) (¯ θ k ) T (cid:79) (cid:96) ( θ i,k ) + L E [ (cid:13)(cid:13) ¯ θ k +1 − ¯ θ k (cid:13)(cid:13) | θ k ] (12) ≤ (cid:96) (¯ θ k ) − γξN N (cid:88) i =1 (cid:79) (cid:96) (¯ θ k ) T [ (cid:79) (cid:96) ( θ i,k ) − (cid:79) (cid:96) (¯ θ k )] − γξN (cid:13)(cid:13) (cid:79) (cid:96) (¯ θ k ) (cid:13)(cid:13) + L E [ (cid:13)(cid:13) ¯ θ k +1 − ¯ θ k (cid:13)(cid:13) | θ k ] . Using (11) from the proof of Lemma 2 we obtain E [ (cid:13)(cid:13) ¯ θ k +1 − ¯ θ k (cid:13)(cid:13) (cid:12)(cid:12)(cid:12) θ k ] = γ N N (cid:88) i =1 µ i ¯ µ (cid:107) (cid:79) (cid:96) ( θ i,k ) + a (cid:79) F i,k (cid:107) + γ N N (cid:88) i =1 µ i ¯ µ E [ (cid:107) ε i,k (cid:107) | θ k ] ≤ γ ξN (cid:104) (cid:13)(cid:13) (cid:79) (cid:96) (¯ θ k ) (cid:13)(cid:13) + ( L + 2 a d ) V k (cid:105) + γ ξσ N . (13)Also, − (cid:79) (cid:96) (¯ θ k ) T [ (cid:79) (cid:96) ( θ i,k ) − (cid:79) (cid:96) (¯ θ k )] = 12 (cid:13)(cid:13) (cid:79) (cid:96) (¯ θ k ) (cid:13)(cid:13) + 12 (cid:13)(cid:13) (cid:79) (cid:96) ( θ i,k ) − (cid:79) (cid:96) (¯ θ k ) (cid:13)(cid:13) − (cid:107) (cid:79) (cid:96) ( θ i,k ) (cid:107) ≤ (cid:13)(cid:13) (cid:79) (cid:96) (¯ θ k ) (cid:13)(cid:13) + L (cid:13)(cid:13) θ i,k − ¯ θ k (cid:13)(cid:13) . (14)Substituting (14) and (13) into (12) we obtain: E [ (cid:96) (¯ θ k +1 ) (cid:12)(cid:12) θ k ] ≤ (cid:96) (¯ θ k ) − γξ N (cid:13)(cid:13) (cid:79) (cid:96) (¯ θ k ) (cid:13)(cid:13) + L γξ N ¯ V k + 2 Lγ ξN (cid:104) (cid:13)(cid:13) (cid:79) (cid:96) (¯ θ k ) (cid:13)(cid:13) + ( L + 2 a d ) V k (cid:105) + Lγ ξσ N . (15)Consider the function (cid:96) (¯ θ k ) + LV k . From the inequalities in (15) and Lemma 2 we obtain: E [ V k +1 | θ k ] ≤ (1 + κγN ) V k + 4 γ ξN (cid:13)(cid:13) (cid:79) (cid:96) (¯ θ k ) (cid:13)(cid:13) + γ ξσ N , E [ (cid:96) (¯ θ k +1 ) + LV k +1 | θ k ] ≤ ( (cid:96) (¯ θ k ) + LV k ) − γξN (cid:18) − γL (2 + 1 N ) (cid:19) (cid:107) (cid:79) (cid:96) (¯ θ k ) (cid:107) + (cid:104) κ + Lξ γξN ( L + 2 a d ) (cid:105) LγN V k + Lγ ξσ N (1 + 12 N ) . By choosing a > µ min L + ξL λ µ max , ¯ γ >
0. Given the choice γ < ¯ γ in the statement of Theorem 1, wehave κ + Lξ γξN ( L + 2 a d ) = − aλ µ max + L (2 µ min + ξ γξN ( L + 2 a d ) ≤ γξN (cid:16) − γL (2 + 1 N ) (cid:17) (cid:107) (cid:79) (cid:96) (¯ θ k ) (cid:107) ≤ (cid:96) (¯ θ k ) + LV k − E (cid:2) (cid:96) (¯ θ k +1 ) + LV k +1 | θ k (cid:3) + Lγ ξσ N (1 + 12 N ) . η = γξN (cid:16) − γL (2 + N ) (cid:17) . By definition γ < ¯ γ , we have η >
0. Since the loss function isnonnegative, l ( · ) ≥ V k ≥ k . Taking full expectation and summing from k = 0 to k = K − E [ η K − (cid:88) k =0 (cid:107) (cid:79) (cid:96) (¯ θ k ) (cid:107) ] ≤ (cid:96) (¯ θ ) + LV + KLγ ξσ N (1 + 12 N ) . We conclude that E (cid:34) K K − (cid:88) k =0 E [ (cid:107) (cid:79) (cid:96) (¯ θ k ) (cid:107) ] (cid:35) ≤ ηK (cid:20) (cid:96) (¯ θ ) + LV + KLγ ξσ N (1 + 12 N ) (cid:21) . (cid:4) Proof of Corollary 1
Since γ < ¯ γ , it follows that κ < Lµ min − aλ µ max + 23 (cid:16) aλ µ max − L (2 µ min + ξ (cid:17) = 23 Lµ min − aλ µ max − ξL < , and from Lemma 2: | κ | γN V k ≤ V k − E [ V k +1 | θ k ] + 4 γ ξN (cid:13)(cid:13) (cid:79) (cid:96) (¯ θ k ) (cid:13)(cid:13) + γ ξσ N . Taking full expectation and summing from k = 0 to k = K − E (cid:34) K K − (cid:88) k =0 V k (cid:35) ≤ NK | κ | γ V + 4 γξ | κ | (cid:104) K K − (cid:88) k =0 (cid:13)(cid:13) (cid:79) (cid:96) (¯ θ k ) (cid:13)(cid:13) + σ N (cid:105) , and using Theorem 1 we obtain the result. (cid:4) Proof of Proposition 1
By Assumption 3 and Taylor expansion, (cid:96) ( θ k +1 ) ≤ (cid:96) ( θ k ) + (cid:79) (cid:96) ( θ k ) (cid:124) ( θ k +1 − θ k ) + L (cid:107) θ k +1 − θ k (cid:107) = (cid:96) ( θ k ) − γ (cid:79) (cid:96) ( θ k ) (cid:124) N (cid:88) i =1 i,k g i ( θ k ) + L (cid:107) θ k +1 − θ k (cid:107) . Taking conditional expectation on both sides, E [ (cid:96) ( θ k +1 ) | θ k ] ≤ (cid:96) ( θ k ) − γ (cid:79) (cid:96) ( θ k ) (cid:124) N (cid:88) i =1 µ i µ E [ (cid:79) (cid:96) ( θ k ) + ε i ] + L E [ (cid:107) θ k +1 − θ k (cid:107) | θ k ]= (cid:96) ( θ k ) − γ (cid:107) (cid:79) (cid:96) ( θ k ) (cid:107) + L E [ (cid:107) θ k +1 − θ k (cid:107) | θ k ] . Note that E [ (cid:107) θ k +1 − θ k (cid:107) | θ k ] = E [ (cid:107) γ N (cid:88) i =1 i,k g i ( θ k ) (cid:107) | θ k ]= γ N (cid:88) i =1 µ i µ (cid:107) (cid:79) (cid:96) ( θ k ) + ε i ( θ k ) (cid:107) ≤ γ (cid:0) (cid:107) (cid:79) (cid:96) ( θ k ) (cid:107) + 1 µ N (cid:88) i =1 µ i σ i (cid:1) ,
17t follows that γ (1 − Lγ (cid:107) (cid:79) (cid:96) ( θ k ) (cid:107) ≤ (cid:96) ( θ k ) − E [ (cid:96) ( θ k +1 ) | θ k ] + Lγ µ N (cid:88) i =1 µ i σ i . The results follows by taking full expectation and summing from k = 0 to k = K − (cid:4) References [1] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: Concept and applica-tions,”
ACM Transactions on Intelligent Systems and Technology (TIST) , vol. 10, no. 2, pp.1–19, 2019.[2] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods,and future directions,” arXiv preprint arXiv:1908.07873 , 2019.[3] D. Hallac, J. Leskovec, and S. Boyd, “Network lasso: Clustering and optimization in largegraphs,”
Proceedings SIGKDD , pp. 387–396, 2015.[4] J. Chen, C. Richard, and A. Sayed, “Multitask diffusion adaptation over networks,”
IEEETransactions on Signal Processing , vol. 62, pp. 4129–4144, 2014.[5] R. Nassif, C. Richard, A. Ferrari, and A. Sayed, “Multitask diffusion adaptation over asyn-chronous networks,”
IEEE Transactions on Signal Processing , vol. 64, pp. 2835–2850, 2016.[6] ——, “Multitask diffusion adaptation over asynchronous networks,”
IEEE Transactions onSignal Processing , vol. 64, pp. 2835–2850, 2016.[7] R. Nassif, S. Vlaski, and A. Sayed, “Learning over multitask graphs (part i: Stability analysis),” https://arxiv.org/abs/1805.08535 , 2019.[8] ——, “Learning over multitask graphs (part ii: Performance analysis),” https://arxiv.org/abs/1805.08547 , 2019.[9] F. Niu, B. Recht, C. Re, and S. Wright, “Hogwild: A lock-free approach to parallelizingstochastic gradient descent,”
Advances in Neural Information Processing Systems , pp. 693–701, 2011.[10] J. Liu, S. Wright, C. R´e, V. Bittorf, and S. Srikrishna, “An asynchronous parallel stochasticcoordinate descent algorithm,”
Journal of Machine Intelligence Research , vol. 16, pp. 285–322,2015.[11] X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochastic gradient for nonconvexoptimization,” in
Advances in Neural Information Processing Systems 28 (NIPS) , 2015.[12] J. Duchi, S. Chaturapruek, and C. R´e, “Asynchronous stochastic convex optimization,” in
Advances in Neural Information Processing Systems 28 (NIPS) , 2015.[13] N. Friedkin and E. Johnsen, “Social influence and opinions,”
Journal of Mathematical Sociol-ogy , vol. 15, pp. 193–205, 1990. 1814] V. Gazi and K. Passino,
Swarm Stability and Optimization . Springer, 2011.[15] S. Pu and A. Garcia, “A flocking-based approach for distributed stochastic optimization,”
Operations Research , vol. 6, pp. 267–281, 2017.[16] R. Leblond, F. Pedregosa, and S. Lacoste-Julien, “Improved asynchronous parallel optimiza-tion analysis for stochastic incremental methods,”
Journal of Machine Intelligence Research ,vol. 19, pp. 1–68, 2018.[17] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,”
IEEE Transactions on Automatic Control , vol. 54, no. 1, pp. 48–61, 2009.[18] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first- order algorithm for decentralizedconsensus optimization,”
SIAM Journal on Optimization , vol. 25, pp. 944–966, 2015.[19] S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochasticprogramming,”
SIAM Journal on Optimization , vol. 23, no. 4, pp. 2341–2368, 2013.[20] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al. , “Gradient-based learning applied to docu-ment recognition,”
Proceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998.[21] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang,W. Paul, M. I. Jordan et al. , “Ray: A distributed framework for emerging AI applications,”in ,2018, pp. 561–577.[22] D. Watts and S. Strogatz, “Collective dynamics of “small-world” networks,”
Nature , vol. 393,pp. 440–442, 1998.[23] J. M. Bates and C. M. W. Granger, “The combination of forecasts,”
Operations ResearchQuaterly , vol. 20, pp. 451–468, 1969.[24] C. Godsil and G. F. Royle,