[PDF] Personalized Federated Learning: A Meta-Learning Approach

Abstract

In Federated Learning, we aim to train models across multiple computing units (users), while users can only communicate with a common central server, without exchanging their data samples. This mechanism exploits the computational power of all users and allows users to obtain a richer model as their models are trained over a larger set of data points. However, this scheme only develops a common output for all the users, and, therefore, it does not adapt the model to each user. This is an important missing feature, especially given the heterogeneity of the underlying data distribution for various users. In this paper, we study a personalized variant of the federated learning in which our goal is to find an initial shared model that current or new users can easily adapt to their local dataset by performing one or a few steps of gradient descent with respect to their own data. This approach keeps all the benefits of the federated learning architecture, and, by structure, leads to a more personalized model for each user. We show this problem can be studied within the Model-Agnostic Meta-Learning (MAML) framework. Inspired by this connection, we study a personalized variant of the well-known Federated Averaging algorithm and evaluate its performance in terms of gradient norm for non-convex loss functions. Further, we characterize how this performance is affected by the closeness of underlying distributions of user data, measured in terms of distribution distances such as Total Variation and 1-Wasserstein metric.

Full PDF

PPersonalized Federated Learning: A Meta-LearningApproach

Alireza Fallah ∗ , Aryan Mokhtari † , Asuman Ozdaglar ∗ Abstract

In Federated Learning, we aim to train models across multiple computing units (users),while users can only communicate with a common central server, without exchanging theirdata samples. This mechanism exploits the computational power of all users and allows usersto obtain a richer model as their models are trained over a larger set of data points. However,this scheme only develops a common output for all the users, and, therefore, it does not adaptthe model to each user. This is an important missing feature, especially given the heterogeneity of the underlying data distribution for various users. In this paper, we study a personalizedvariant of the federated learning in which our goal is to ﬁnd an initial shared model thatcurrent or new users can easily adapt to their local dataset by performing one or a few stepsof gradient descent with respect to their own data. This approach keeps all the beneﬁts ofthe federated learning architecture, and, by structure, leads to a more personalized model foreach user. We show this problem can be studied within the Model-Agnostic Meta-Learning(MAML) framework. Inspired by this connection, we study a personalized variant of the well-known Federated Averaging algorithm and evaluate its performance in terms of gradient normfor non-convex loss functions. Further, we characterize how this performance is aﬀected by thecloseness of underlying distributions of user data, measured in terms of distribution distancessuch as Total Variation and 1-Wasserstein metric.

In Federated Learning (FL), we consider a set of n users that are all connected to a central node(server), where each user has access only to its local data [1]. In this setting, the users aim to comeup with a model that is trained over all the data points in the network without exchanging theirlocal data with other users or the central node due to privacy issues or communication limitations.More formally, if we deﬁne f i : R d → R as the loss corresponding to user i , the goal is to solve min w ∈ R d f ( w ) := 1 n n (cid:88) i =1 f i ( w ) . (1)In particular, consider a supervised learning setting, where f i represents expected loss over thedata distribution of user i , i.e., f i ( w ) := E ( x,y ) ∼ p i [ l i ( w ; x, y )] , (2)where l i ( w ; x, y ) measures the error of model w in predicting the true label y ∈ Y i given the input x ∈ X i , and p i is the distribution over X i × Y i . The focus of this paper is on a data heterogeneous setting where the probability distribution p i of users are not identical. To illustrate this formulation,consider the example of training a Natural Language Processing (NLP) model over the devices ofa set of users. In this problem, p i represents the empirical distribution of words and expressionsused by user i . Hence, f i ( w ) can be expressed as f i ( w ) = (cid:80) ( x,y ) ∈S i p i ( x, y ) l i ( w ; x, y ) , where S i isthe data set corresponding to user i and p i ( x, y ) is the probability that user i assigns to a speciﬁcword which is proportional to the frequency of using this word by user i . ∗ Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge,MA, USA. {[email protected], [email protected]}. † Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, [email protected]. a r X i v : . [ c s . L G ] O c t ndeed, each user can solve its local problem deﬁned in (2) without any exchange of informationwith other users; however, the resulted model may not generalize well to new samples as it hasbeen trained over a small number of samples. If users cooperate and exploit the data available atall users, then their local models could obtain stronger generalization guarantees. A conventionalapproach for achieving this goal is minimizing the aggregate of local functions deﬁned in (1).However, this scheme only develops a common output for all the users, and therefore, it does notadapt the model to each user. In particular, in the heterogeneous settings where the underlyingdata distribution of users are not identical, the resulted global model obtained by minimizing theaverage loss could perform arbitrarily poorly once applied to the local dataset of each user. Inother words, the solution of problem (1) is not personalized for each user. To highlight this point,recall the NLP example, where although the distribution over the words and expressions variesfrom one person to another, the solution to problem (1) provides a shared answer for all users,and, therefore, it is not fully capable of achieving a user-speciﬁc model.In this paper, we overcome this issue by considering a modiﬁed formulation of the federatedlearning problem which incorporates personalization (Section 2). Building on the Model-AgnosticMeta-Learning (MAML) problem formulation introduced in [2], the goal of this new formulationis to ﬁnd an initial point shared between all users which performs well after each user updates itwith respect to its own loss function, potentially by performing a few steps of a gradient-basedmethod. This way, while the initial model is derived in a distributed manner over all users, theﬁnal model implemented by each user diﬀers from other ones based on her or his own data. Westudy a Personalized variant of the FedAvg algorithm, called Per-FedAvg, designed for solving theproposed personalized FL problem (Section 3). In particular, we elaborate on its connections withthe original FedAvg algorithm [3], and also, discuss a number of considerations that one needsto take into account for implementing Per-FedAvg. We also establish the convergence propertiesof the proposed Per-FedAvg algorithm for minimizing non-convex loss functions (Section 4). Inparticular, we characterize the role of data heterogeneity and closeness of data distribution ofdiﬀerent users, measured by distribution distances, such as Total Variation (TV) or 1-Wasserstein,on the convergence of Per-FedAvg. Related Work.

Recently we have witnessed signiﬁcant progress in developing novel methods thataddress diﬀerent challenges in FL; see [4, 5]. In particular, there have been several works on variousaspects of FL, including preserving the privacy of users [6, 7, 8, 9] and lowering communicationcost [10, 11, 12, 13]. Several work develop algorithms for the homogeneous setting, where the datapoints of all users are sampled from the same probability distribution [14, 15, 16, 17]. More relatedto our paper, there are several works that study statistical heterogeneity of users’ data points inFL [19, 20, 21, 22, 24, 25], but they do not attempt to ﬁnd a personalized solution for each user.The centralized version of model-agnostic meta-learning (MAML) problem was ﬁrst proposedin [2] and followed by a number of papers studying its empirical characteristics [26, 27, 28, 29, 30, 31]as well as its convergence properties [32, 33]. In this work, we focus on the convergence of MAMLmethods for the FL setting that is more challenging as nodes perform multiple local updatesbefore sending their updates to the server, which is not considered in previous theoretical workson meta-learning.Recently, the idea of personalization in FL and its connections with MAML has gained a lotof attention. In particular, [34] considers a formulation and algorithm similar to our paper, andelaborates on the empirical success of this framework. Also, recently, there has been a number ofother papers that have studied diﬀerent combinations of MAML-type methods with FL architec-ture from an empirical point of view [35, 36]. However, our main focus is on developing a theoreticalunderstating regarding this formulation, where we characterize the convergence of the Per-FedAvg,and the role of this algorithm’s parameters on its performance. Besides, in our numerical exper-iment section, we show how the method studied in [34] may not perform well in some cases, andpropose another algorithm which addresses this issue. In addition, an independent and concurrentwork [37] studies a similar formulation theoretically for the case of strongly convex functions. Theresults in [37] are completely diﬀerent from ours, as they study the case that the functions arestrongly convex and exact gradients are available, while we study nonconvex functions, and alsoaddress gradient stochasticity.Using meta-learning and multi-task learning to achieve personalization is not limited to MAMLframework. In particular, [38] proposes ARUBA, a meta-learning algorithm inspired by onlineconvex optimization, and shows that applying it to FedAvg improves its performance. A similaridea is later used in [39] to design diﬀerentially private algorithms with application in FL. Also2n [40], the authors use multi-task learning framework and propose a new method, MOCHA, toaddress the statistical and systems challenges, including data heterogeneity and communicationeﬃciency. Their proposed multi-task learning scheme also leads to a set of solutions that are moreuser-speciﬁc. A detailed survey on the connections of FL and multi-task and meta-learning can befound in [4, 5]. Also, in [18], the authors consider a framework for training a mixture of a singleglobal model and local models, leading to a personalized solution for each user. A similar ideahas been studied in [41], where the authors propose an adaptive federated learning algorithm thatlearns a mixture of local and global models as the personalized model.

As we stated in Section 1, our goal in this section is to show how the fundamental idea behind theModel-Agnostic Meta-Learning (MAML) framework in [2] can be exploited to design a personalizedvariant of the FL problem. To do so, let us ﬁrst brieﬂy recap the MAML formulation. Given a setof tasks drawn from an underlying distribution, in MAML, in contrast to the traditional supervisedlearning setting, the goal is not ﬁnding a model which performs well on all the tasks in expectation.Instead, in MAML, we assume we have a limited computational budget to update our model aftera new task arrives, and in this new setting, we look for an initialization which performs well after it is updated with respect to this new task, possibly by one or a few steps of gradient descent. Inparticular, if we assume each user takes the initial point and updates it using one step of gradientdescent with respect to its own loss function, then problem (1) changes to min w ∈ R d F ( w ) := 1 n n (cid:88) i =1 f i ( w − α ∇ f i ( w )) , (3)where α ≥ is the stepsize. The strength of this formulation is that, not only it allows us tomaintain the advantages of FL, but also it captures the diﬀerence between users as either existingor new users can take the solution of this new problem as an initial point and slightly update itwith respect to their own data. Going back to the NLP example, this means that the users couldtake this resulting initialization and update it by going over their own data S i and performing justone or few steps of gradient descent to obtain a model that works well for their own dataset.As mentioned earlier, for the considered heterogeneous model of data distribution, solvingproblem (1) is not the ideal choice as it returns a single model that even after a few steps of localgradient may not quickly adjust to each users local data. On the other hand, by solving (3) weﬁnd an initial model (Meta-model) which is trained in a way that after one step of local gradientleads to a good model for each individual user. This formulation can also be extended to the casethat users run a few steps of gradient update, but to simplify our notation we focus on the singlegradient update case. We would like to mention that the problem formulation in (3) for FL washas been proposed independently in another work [34] and studied numerically. In this work, wefocus on the theoretical aspect of this problem and seek a provably convergent method for the casethat the functions f i are nonconvex. In this section, we present the Personalized FedAvg (Per-FedAvg) method to solve (3). Thisalgorithm is inspired by FedAvg, but it is designed to ﬁnd the optimal solution of (3) insteadof (1). In FedAvg, at each round, the server chooses a fraction of users with size rn ( r ∈ (0 , ) andsends its current model to these users. Each selected user i updates the received model based onits own loss function f i and by running τ ≥ steps of stochastic gradient descent. Then, the activeusers return their updated models to the server. Finally, the server updates the global model bycomputing the average of the models received from these selected users, and then the next roundfollows. Per-FedAvg follows the same principles. First, note that function F in (3) can be writtenas the average of meta-functions F , . . . , F n where the meta-function F i associated with user i isdeﬁned as F i ( w ) := f i ( w − α ∇ f i ( w )) . (4)3 lgorithm 1: The proposed Personalized FedAvg (Per-FedAvg) Algorithm

Input:

Initial iterate w , fraction of active users r . for k : 0 to K − do Server chooses a subset of users A k uniformly at random and with size rn ;Server sends w k to all users in A k ; for all i ∈ A k do Set w ik +1 , = w k ; for t : 1 to τ do Compute the stochastic gradient ˜ ∇ f i ( w ik +1 ,t − , D it ) using dataset D it ;Set ˜ w ik +1 ,t = w ik +1 ,t − − α ˜ ∇ f i ( w ik +1 ,t − , D it ) ;Set w ik +1 ,t = w ik +1 ,t − − β ( I − α ˜ ∇ f i ( w ik +1 ,t − , D (cid:48)(cid:48) it )) ˜ ∇ f i ( ˜ w ik +1 ,t , D (cid:48) it ) ; end for Agent i sends w ik +1 ,τ back to server; end for Server updates its model by averaging over received models: w k +1 = rn (cid:80) i ∈A k w ik +1 ,τ ; end for To follow a similar scheme as FedAvg for solving problem (3), the ﬁrst step is to compute thegradient of local functions, which in this case, the gradient ∇ F i , that is given by ∇ F i ( w ) = (cid:0) I − α ∇ f i ( w ) (cid:1) ∇ f i ( w − α ∇ f i ( w )) . (5)Computing the gradient ∇ f i ( w ) at every round is often computationally costly. Hence, we takea batch of data D i with respect to distribution p i to obtain an unbiased estimate ˜ ∇ f i ( w, D i ) givenby ˜ ∇ f i ( w, D i ) := 1 |D i | (cid:88) ( x,y ) ∈D i ∇ l i ( w ; x, y ) . (6)Similarly, the Hessian ∇ f i ( w ) in (5) can be replaced by its unbiased estimate ˜ ∇ f i ( w, D i ) .At round k of Per-FedAvg, similar to FedAvg, ﬁrst the server sends the current global model w k to a fraction of users A k chosen uniformly at random with size rn . Each user i ∈ A k performs τ steps of stochastic gradient descent locally and with respect to F i . In particular, these localupdates generate a local sequence { w ik +1 ,t } τt =0 where w ik +1 , = w k and, for τ ≥ t ≥ , w ik +1 ,t = w ik +1 ,t − − β ˜ ∇ F i ( w ik +1 ,t − ) , (7)where β is the local learning rate (stepsize) and ˜ ∇ F i ( w ik +1 ,t − ) is an estimate of ∇ F i ( w ik +1 ,t − ) in (5). Note that the stochastic gradient ˜ ∇ F i ( w ik +1 ,t − ) for all local iterates is computed usingindependent batches D it , D (cid:48) it , and D (cid:48)(cid:48) it as follows ˜ ∇ F i ( w ik +1 ,t − ) := (cid:16) I − α ˜ ∇ f i ( w ik +1 ,t − , D (cid:48)(cid:48) it ) (cid:17) ˜ ∇ f i (cid:16) w ik +1 ,t − − α ˜ ∇ f i ( w ik +1 ,t − , D it ) , D (cid:48) it (cid:17) . (8)Note that ˜ ∇ F i ( w ik +1 ,t − ) is a biased estimator of ∇ F i ( w ik +1 ,t − ) due to the fact that ˜ ∇ f i ( w ik +1 ,t − − α ˜ ∇ f i ( w ik +1 ,t − , D it ) , D (cid:48) it ) is a stochastic gradient that contains another stochastic gradient inside.Once, the local updates w ik +1 ,τ are evaluated, users send them to the server, and the serverupdates its global model by averaging over the received models, i.e., w k +1 = rn (cid:80) i ∈A k w ik +1 ,τ .Note that as in other MAML methods [2, 33], the update in (7) can be implemented in twostages: First, we compute ˜ w ik +1 ,t = w ik +1 ,t − − α ˜ ∇ f i ( w ik +1 ,t − , D it ) and then evaluate w ik +1 ,t by w ik +1 ,t = w i,t − k +1 − β ( I − α ˜ ∇ f i ( w ik +1 ,t − , D (cid:48)(cid:48) it )) ˜ ∇ f i ( ˜ w ik +1 ,t , D (cid:48) it ) . Indeed, it can be veriﬁed theoutcome of the these two steps is equivalent to the update in (7). To simplify the notation,throughout the paper, we assume that the size of D it , D (cid:48) it , and D (cid:48)(cid:48) it is equal to D , D (cid:48) , and D (cid:48)(cid:48) ,respectively, and for any i and t . The steps of Per-FedAvg are depicted in Algorithm 1. In this section, we study the convergence properties of the Personalized FedAvg (Per-FedAvg)method. We focus on nonconvex settings, and characterize the overall communication rounds4etween server and users to ﬁnd an (cid:15) -approximate ﬁrst-order stationary point, where its formaldeﬁnition follows.

Deﬁnition 4.1.

A random vector w (cid:15) ∈ R d is called an (cid:15) -approximate First-Order Stationary Point(FOSP) for problem (3) if it satisﬁes E [ (cid:107)∇ F ( w (cid:15) ) (cid:107) ] ≤ (cid:15) . Next, we formally state the assumptions required for proving our main results.

Assumption 1.

Functions f i are bounded below, i.e., min w ∈ R d f i ( w ) > −∞ . Assumption 2.

For every i ∈ { , . . . , n } , f i is twice continuously diﬀerentiable and L i -smooth,and also, its gradient is bounded by a nonnegative constant B i , i.e., (cid:107)∇ f i ( w ) (cid:107) ≤ B i , (cid:107)∇ f i ( w ) − ∇ f i ( u ) (cid:107) ≤ L i (cid:107) w − u (cid:107) ∀ w, u ∈ R d . (9)As we discussed in Section 3, the second-order derivative of all functions appears in the updaterule of Per-FedAvg Algorithm. Hence, in the next Assumption, we impose a regularity conditionon the Hessian of each f i which is also a customary assumption in the analysis of second-ordermethods. Assumption 3.

For every i ∈ { , . . . , n } , the Hessian of function f i is ρ i -Lipschitz continuous,i.e., (cid:107)∇ f i ( w ) − ∇ f i ( u ) (cid:107) ≤ ρ i (cid:107) w − u (cid:107) ∀ w, u ∈ R d . (10)To simplify the analysis, in the rest of the paper, we deﬁne B := max i B i , L := max i L i , and ρ := max i ρ i which can be, respectively, considered as a bound on the norm of gradient of f i ,smoothness parameter of f i , and Lipschitz continuity parameter of Hessian ∇ f i , for i = 1 , . . . , n .Our next assumption provides upper bounds on the variances of gradient and Hessian estima-tions. Assumption 4.

For any w ∈ R d , the stochastic gradient ∇ l i ( x, y ; w ) and Hessian ∇ l i ( x, y ; w ) ,computed with respect to a single data point ( x, y ) ∈ X i × Y i , have bounded variance, i.e., E ( x,y ) ∼ p i (cid:2) (cid:107)∇ l i ( x, y ; w ) − ∇ f i ( w ) (cid:107) (cid:3) ≤ σ G , (11) E ( x,y ) ∼ p i (cid:2) (cid:107)∇ l i ( x, y ; w ) − ∇ f i ( w ) (cid:107) (cid:3) ≤ σ H . (12)Finally, we state our last assumption which characterizes the similarity between the tasks ofusers. Assumption 5.

For any w ∈ R d , the gradient and Hessian of local functions f i ( w ) and the averagefunction f ( w ) = (cid:80) ni =1 f i ( w ) satisfy the following conditions n n (cid:88) i =1 (cid:107)∇ f i ( w ) − ∇ f ( w ) (cid:107) ≤ γ G , n n (cid:88) i =1 (cid:107)∇ f i ( w ) − ∇ f ( w ) (cid:107) ≤ γ H . (13)Assumption 5 captures the diversity between the gradients and Hessians of users. Note thatunder Assumption 2, the conditions in Assumption 5 are automatically satisﬁed for γ G = 2 B and γ H = 2 L . However, we state this assumption separately to highlight the role of similarity offunctions corresponding to diﬀerent users in convergence analysis of Per-FedAvg. In particular, inthe following subsection, we highlight the connections between this assumption and the similarityof distributions p i for the case of supervised learning (2) under two diﬀerent distribution distances. Recall the deﬁnition of f i in (2). Note that Assumption 5 captures the similarity of loss functionsof diﬀerent users. Hence, a fundamental question here is whether this has any connection withthe closeness of distributions p i . We study this connection by considering two diﬀerent distances:Total Variation (TV) distance and 1-Wasserstein distance. Throughout this subsection, we assumeall users have the same loss function l ( . ; . ) over the same set of inputs and labels, i.e., f i ( w ) := E z ∼ p i [ l ( z ; w )] where z := ( x, y ) ∈ Z := X × Y . Also, let p = n (cid:80) i p i denote the average of allusers’ distributions. • Total Variation (TV) Distance:

For distributions q and q over countable set Z , theirTV distance is given by (cid:107) q − q (cid:107) T V = (cid:80) z ∈Z | q ( z ) − q ( z ) | . If we assume a stronger version of5ssumption 2 holds where for any z ∈ Z and w ∈ R d , we have (cid:107)∇ w l ( z ; w ) (cid:107) ≤ B and (cid:107)∇ w l ( z ; w ) (cid:107) ≤ L , then Assumption 5 holds with (check Appendix B) γ G = 4 B n n (cid:88) i =1 (cid:107) p i − p (cid:107) T V , γ H = 4 L n n (cid:88) i =1 (cid:107) p i − p (cid:107) T V . (14a)This simple derivation shows that γ G and γ H exactly capture the diﬀerence between the probabilitydistributions of the users in a heterogeneous setting. • The 1-Wasserstein distance between two probability distributions q and q over a metric space Z deﬁned as W ( q , q ) := inf q ∈ Q ( q ,q ) (cid:82) Z×Z d ( z , z ) d q ( z , z ) , where d ( ., . ) is a distance function over metric space Z and Q ( q , q ) denotes the set of all measureson Z × Z with marginals q and q on the ﬁrst and second coordinate, respectively. Here, weassume all p i have bounded support (note that this assumption holds in many cases as either Z itself is bounded or because we normalize the data). Also, we assume that for any w , the gradient ∇ w l ( z ; w ) and the Hessian ∇ w l ( z ; w ) are both Lipschitz with respect to parameter z and distance d ( ., . ) , i.e, (cid:107)∇ w l ( z ; w ) − ∇ w l ( z ; w ) (cid:107) ≤ L Z d ( z , z ) , (cid:107)∇ w l ( z ; w ) − ∇ w l ( z ; w ) (cid:107) ≤ ρ Z d ( z , z ) . (15)Then, Assumption 5 holds with (check Appendix B) γ G = L Z n n (cid:88) i =1 W ( p i , p ) , γ H = ρ Z n n (cid:88) i =1 W ( p i , p ) . (16)This derivation does not require Assumption 2 and holds when (15) are satisﬁed. Finally, consider aspecial case where the data distributions are homogeneous, and each p i is an empirical distributiondrawn from a distribution p u with sample size m . In this case, we have W ( p i , p u ) = O (1 / √ m ) [42]. Hence, since W is a distance, it is easy to verify that γ G , γ H = O (1 / √ m ) . In this subsection, we derive the overall complexity of Per-FedAvg for achieving an (cid:15) -ﬁrst-orderstationary point. To do so, we ﬁrst prove the following intermediate result which shows that underAssumptions 2 and 3, the local meta-functions F i ( w ) deﬁned in (4) and their average function F ( w ) = (1 /n ) (cid:80) ni =1 F i ( w ) are smooth. Lemma 4.2.

Recall the deﬁnition of F i ( w ) in (4) with α ∈ [0 , /L ] . If Assumptions 2 and 3hold, then F i is smooth with parameter L F := 4 L + αρB . As a consequence, the average function F ( w ) = (1 /n ) (cid:80) ni =1 F i ( w ) is also smooth with parameter L F . Assumption 4 provides upper bounds on the variances of gradient and Hessian estimation forfunctions f i . To analyze the convergence of Per-FedAvg, however, we require upper bounds on thebias and variance of gradient estimation of F i . We derive these bounds in the following lemma. Lemma 4.3.

Recall the deﬁnition of the gradient estimate ˜ ∇ F i ( w ) in (8) which is computedusing D , D (cid:48) , and D (cid:48)(cid:48) that are independent batches with size D , D (cid:48) , and D (cid:48)(cid:48) , respectively. IfAssumptions 2-4 hold, then for any α ∈ [0 , /L ] and w ∈ R d we have (cid:13)(cid:13)(cid:13) E (cid:104) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:105)(cid:13)(cid:13)(cid:13) ≤ αLσ G √ D , E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ σ F := 12 (cid:20) B + σ G (cid:20) D (cid:48) + ( αL ) D (cid:21)(cid:21)(cid:20) σ H α D (cid:48)(cid:48) (cid:21) − B . To measure the tightness of this result, we consider two special cases. First, if the exactgradients and Hessians are available, i.e., σ G = σ H = 0 , then σ F = 0 as well which is expectedas we can compute exact ∇ F i . Second, for the classic federated learning problem, i.e., α = 0 and F i = f i , we have σ F = O (1) σ G /D (cid:48) which is tight up to constants.Next, we use the similarity conditions for the functions f i in Assumption 5 to study the simi-larity between gradients of the functions F i . While our focus here is to elaborate on the dependence of Wasserstein distance on the number of samples, it isworth noting that one drawback of this bound is that the convergence speed of Wasserstein distance in dimensionis exponentially slow. emma 4.4. Recall the deﬁnition of F i ( w ) in (4) and assume that α ∈ [0 , /L ] . Suppose that theconditions in Assumptions 2, 3, and 5 are satisﬁed. Then, for any w ∈ R d , we have n n (cid:88) i =1 (cid:107)∇ F i ( w ) − ∇ F ( w ) (cid:107) ≤ γ F := 3 B α γ H + 192 γ G . To check the tightness of this result, we focus on two special cases as we did for Lemma 4.3. First, if ∇ f i are all equal, i.e., γ G = γ H = 0 , then γ F = 0 . This is indeed expected as all ∇ F i are equal to each other in this case. Second, for the classic federated learning problem, i.e., α = 0 and F i = f i , we have γ F = O (1) γ G that is optimal up to a constant factor given the conditions inAssumption 5. Theorem 4.5.

Consider the objective function F deﬁned in (3) for the case that α ∈ (0 , /L ] .Suppose that the conditions in Assumptions 1-4 are satisﬁed, and recall the deﬁnitions of L F , σ F ,and η F from Lemmas 4.2-4.4. Consider running Algorithm 1 for K rounds with τ local updates ineach round and with β ≤ / (10 τ L F ) . Then, the following ﬁrst-order stationary condition holds τ K K − (cid:88) k =0 τ − (cid:88) t =0 E (cid:2) (cid:107)∇ F ( ¯ w k +1 ,t ) (cid:107) (cid:3) ≤ F ( w ) − F ∗ ) βτ K + O (1) (cid:18) βL F (1 + βL F τ ( τ − σ F + βL F γ F (cid:18) − rr ( n −

1) + βL F τ ( τ − (cid:19) + α L σ G D (cid:19) , where ¯ w k +1 ,t is the average of iterates of users in A k at time t , i.e., ¯ w k +1 ,t = rn (cid:80) i ∈A k w ik +1 ,t ,and in particular, ¯ w k +1 , = w k and ¯ w k +1 ,τ = w k +1 . Note that σ F is not a constant, and as expressed in Lemma 4.3, we can make it arbitrary smallby choosing batch sizes D , D (cid:48) , or D (cid:48)(cid:48) large enough. To see how tight our result is, we again focuson special cases. Let α = 0 , τ = 1 , and r = 1 . In this case, Per-FedAvg reduces to stochasticgradient descent, where the only source of stochasticity is the batches of gradient. In this case,the second term in the right hand side reduces to O (cid:0) βL F σ F (cid:1) where, here, σ F itself is equal to σ G /D . This is the classic result for stochastic gradient descent for nonconvex functions, and werecover the lower bounds [43]. Also, it is worth noting that the term α L σ G /D appears in theupper bound due to the fact that ˜ ∇ F i ( w ) is a biased estimator of ∇ F i ( w ) . This bias term willbe eliminated if we assume that we have access to the exact gradients at training time (see thediscussion after Lemma 4.3), which is, for instance, the case in [37], where the authors focus onthe deterministic case.Next, we characterize the choices of τ , K , and β in terms of the required accuracy (cid:15) to obtainthe best possible complexity bound for the result in Theorem 4.5. Corollary 4.6.

Suppose the conditions in Theorem 4.5 are satisﬁed. If we set the number of localupdates as τ = O ( (cid:15) − / ) , number of communication rounds with the server as K = O ( (cid:15) − / ) , andstepsize of Per-FedAvg as β = (cid:15) , then we ﬁnd an O ( (cid:15) + α σ G D ) -ﬁrst-order stationary point of F . The result in Corollary 4.6 shows that to achieve an O ( (cid:15) + α σ G D ) -ﬁrst-order stationary pointof F the Per-FedAvg algorithm requires K = O ( (cid:15) − / ) rounds of communication between usersand the server. Indeed, by setting D = O ( (cid:15) − ) or setting the meta-step stepsize as α = O ( (cid:15) / ) Per-FedAvg can ﬁnd an (cid:15) -ﬁrst-order stationary point of F for any arbitrary (cid:15) > . Remark 4.7.

The result of Theorem 4.5 and Corollary 4.6 provide an upper bound on the averageof E (cid:2) (cid:107)∇ F ( ¯ w k +1 ,t ) (cid:107) (cid:3) for all k ∈ { , , ..., K − } and t ∈ { , , ..., τ − } . However, one concernhere is that due to the structure of Algorithm 1, for any k , we only have access to ¯ w k +1 ,t for t = 0 .To address this issue, at any iteration k , the center can choose t k ∈ { , ..., τ − } uniformly atrandom, and ask all the users in A k to send w ik +1 ,t k back to the server, in addition to w ik +1 ,τ . Byfollowing this scheme we can ensure that the same upper bound also hods for the expected averagemodels at the server, i.e., K (cid:80) K − k =0 E (cid:2) (cid:107)∇ F ( ¯ w k +1 ,t k ) (cid:107) (cid:3) . Remark 4.8.

It is worth noting that it is possible to achieve the same complexity bound using adiminishing stepsize. We will further discuss this at the end of Appendix G. Numerical Experiments

In this section, we numerically study the role of personalization when the data distributions areheterogeneous. In particular, we consider the multi-class classiﬁcation problem over MNIST [44]and CIFAR-10 [45] datasets and distribute the training data between n users as follows: (i) Halfof the users, each have a images of each of the ﬁrst ﬁve classes; (ii) The rest, each have a/ images from only one of the ﬁrst ﬁve classes and a images from only one of the other ﬁve classes(see Appendix I for an illustration). We set the parameter a as a = 196 and a = 68 for MNISTand CIFAR-10 datasets, respectively. This way, we create an example where the distribution ofimages over all the users are diﬀerent. Similarly, we divide the test data over the nodes withthe same distribution as the one for the training data. Note that for this particular example inwhich the user’s distributions are signiﬁcantly diﬀerent, our goal is not to achieve state-of-the-artaccuracy. Rather, we aim to provide an example to compare the various approaches for obtainingpersonalization in the heterogenous setting. Indeed, by using more complex neural networks theresults for all the considered algorithms would improve; however, their relative performance wouldstay the same.We focus on three algorithms: The ﬁrst method that we consider is the FedAvg method, and,to do a fair comparison, we take the output of the FedAvg method, and update it with one stepof stochastic gradient descent with respect to the test data, and then evaluate its performance.The second and third algorithms that we consider are two diﬀerent eﬃcient approximations ofPer-FedAvg. Similarly, we evaluate the performance of these methods for the case that one stepof local stochastic gradient descent is performed during test time. To formally explain these twoapproximate versions of Per-FedAvg, note that the implementation of Per-FedAvg requires accessto second-order information which is computationally costly. To address this issue, we considertwo diﬀerent approximations:(i) First, we replace the gradient estimate with its ﬁrst-order approximation which ignores theHessian term, i.e., ˜ ∇ F i ( w ik +1 ,t − ) in (8) is approximated by ˜ ∇ f i ( w ik +1 ,t − − α ˜ ∇ f i ( w ik +1 ,t − , D it ) , D (cid:48) it ) .This is the same idea deployed in First-Order MAML (FO-MAML) in [2], and it has been studiedempirically for the federated learning setting in [34]. We refer to this algorithm as Per-FedAvg(FO).(ii) Second, we use the idea of the HF-MAML, proposed in [33], in which the Hessian-vectorproduct in the MAML update is replaced by diﬀerence of gradients using the following approx-imation: ∇ φ ( w ) u ≈ ( ∇ φ ( u + δv ) − ∇ φ ( u − δv )) /δ . We refer to this algorithm as Per-FedAvg(HF).As shown in [33], for small stepsize at test time α both FO-MAML and HF-MAML performwell, but as α becomes large, HF-MAML outperforms FO-MAML in the centralized setting. Amore detailed discussion on Per-FedAvg (FO) and Per-FedAvg (HF) is provided in Appendix H.Moreover, there we discuss how our analysis can be extended to these two methods. Note thatthe model obtained by any of these three methods is later updated using one step of stochasticgradient descent at the test time, and hence they have the same budget at the test time.We use a neural network with two hidden layers with sizes 80 and 60, and we use ExponentialLinear Unit (ELU) activation function. We take n = 50 users in the network, and run all threealgorithms for K = 1000 rounds. At each round, we assume rn agents with r = 0 . are chosen torun τ local updates. The batch sizes are D = D (cid:48) = 40 and the learning rate is β = 0 . . Partof the code is adopted from [46]. Note that the reported results for all the considered methodscorresponds to the average test accuracy among all users, after running one step of local stochasticgradient descent.The test accuracy results along with the 95% conﬁdence intervals are reported in Table 1.For MNIST dataset, both Per-FedAvg methods achieve a marginal gain compared to FedAvg.However, the achieved gain from using Per-FedAvg (HF) compared to FedAvg is more signiﬁcantfor CIFAR-10 dataset. In particular, we have three main observations here: (i) For α = 0 . and τ = 10 , Per-FedAvg (FO) and Per-FedAvg (HF) perform almost similarly, and better thanFedAvg. In addition, decreasing τ leads to a decrease in the performance of all three algorithms,which is expected as the total number of iterations decreases. (ii) Next, we study the role of α .By increasing α from . to . , for τ = 4 , the performance of Per-FedAvg (HF) improves,which could be due to the fact that model adapts better with user data at test time. However,as discussed above, for larger α , Per-FedAvg (FO) performance drops signiﬁcantly. (iii) Third,we examine the eﬀect of changing the level of data heterogeneity. To do so, we change the data8able 1: Comparison of test accuracy of diﬀerent algorithms given diﬀerent parametersDataset Parameters AlgorithmsFedAvg + update Per-FedAvg (FO) Per-FedAvg (HF)MNIST τ = 10 , α = 0 . ± ± ± τ = 4 , α = 0 . ± ± ± τ = 10 , α = 0 . ± ± ± τ = 4 , α = 0 . ± ± ± τ = 4 , α = 0 . ± ± ± τ = 4 , α = 0 . , ± ± ± diﬀ. hetero.distribution of half of the users that have a/ images from one of the ﬁrst ﬁve classes by removingthese images from their dataset. As the last line of Table 1 shows, Per-FedAvg (HF) performssigniﬁcantly better that FedAvg under these new distributions, while Per-FedAvg (FO) still suﬀersfrom the issue we discussed in (ii). In summary, the more accurate implementation of Per-FedAvg,i.e., Per-FedAvg (HF), outperforms FedAvg in all cases and leads to a more personalized solution. We considered the Federated Learning (FL) problem in the heterogeneous case, and studied a personalized variant of the classic FL formulation in which our goal is to ﬁnd a proper initializationmodel for the users that can be quickly adapted to the local data of each user after the trainingphase. We highlighted the connections of this formulation with Model-Agnostic Meta-Learning(MAML), and showed how the decentralized implementation of MAML, which we called Per-FedAvg, can be used to solve the proposed personalized FL problem. We also characterized theoverall complexity of Per-FedAvg for achieving ﬁrst-order optimality in nonconvex settings. Finally,we provided a set of numerical experiments to illustrate the performance of two diﬀerent ﬁrst-orderapproximations of Per-FedAvg and their comparison with the FedAvg method, and showed that thesolution obtained by Per-FedAvg leads to a more personalized solution compared to the solutionof FedAvg.

Research was sponsored by the United States Air Force Research Laboratory and was accomplishedunder Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions containedin this document are those of the authors and should not be interpreted as representing the oﬃ-cial policies, either expressed or implied, of the United States Air Force or the U.S. Government.The U.S. Government is authorized to reproduce and distribute reprints for Government pur-poses notwithstanding any copyright notation herein. Alireza Fallah acknowledges support fromMathWorks Engineering Fellowship. The research of Aryan Mokhtari is supported by NSF AwardCCF-2007668. 9 eferences [1] J. Konečn`y, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon,“Federated learning: Strategies for improving communication eﬃciency,” arXiv preprintarXiv:1610.05492 , 2016.[2] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deepnetworks,” in

Proceedings of the 34th International Conference on Machine Learning , (Sydney,Australia), 06–11 Aug 2017.[3] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-Eﬃcient Learning of Deep Networks from Decentralized Data,” in

Proceedings of the 20thInternational Conference on Artiﬁcial Intelligence and Statistics , vol. 54 of

Proceedings ofMachine Learning Research , (Fort Lauderdale, FL, USA), pp. 1273–1282, PMLR, 20–22 Apr2017.[4] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz,Z. Charles, G. Cormode, R. Cummings, et al. , “Advances and open problems in federatedlearning,” arXiv preprint arXiv:1912.04977 , 2019.[5] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods,and future directions,”

IEEE Signal Process. Mag. , vol. 37, no. 3, pp. 50–60, 2020.[6] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Privacy aware learning,”

Journal of theACM (JACM) , vol. 61, no. 6, p. 38, 2014.[7] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learning diﬀerentially private recur-rent language models,” arXiv preprint arXiv:1710.06963 , 2017.[8] N. Agarwal, A. T. Suresh, F. X. X. Yu, S. Kumar, and B. McMahan, “cpsgd: Communication-eﬃcient and diﬀerentially-private distributed sgd,” in

Advances in Neural Information Pro-cessing Systems , pp. 7564–7575, 2018.[9] W. Zhu, P. Kairouz, B. McMahan, H. Sun, and W. Li, “Federated heavy hitters discoverywith diﬀerential privacy,” in

International Conference on Artiﬁcial Intelligence and Statistics ,pp. 3837–3847, 2020.[10] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “Fedpaq: Acommunication-eﬃcient federated learning method with periodic averaging and quantization,”in

International Conference on Artiﬁcial Intelligence and Statistics , pp. 2021–2031, 2020.[11] X. Dai, X. Yan, K. Zhou, K. K. Ng, J. Cheng, and Y. Fan, “Hyper-sphere quantization:Communication-eﬃcient sgd for federated learning,” arXiv preprint arXiv:1911.04655 , 2019.[12] D. Basu, D. Data, C. Karakus, and S. Diggavi, “Qsparse-local-sgd: Distributed sgd with quan-tization, sparsiﬁcation and local computations,” in

Advances in Neural Information ProcessingSystems , pp. 14668–14679, 2019.[13] Z. Li, D. Kovalev, X. Qian, and P. Richtárik, “Acceleration for compressed gradient descentin distributed and federated optimization,” arXiv preprint arXiv:2002.11364 , 2020.[14] S. U. Stich, “Local sgd converges fast and communicates little,” arXiv preprintarXiv:1805.09767 , 2018.[15] J. Wang and G. Joshi, “Cooperative sgd: A uniﬁed framework for the design and analysis ofcommunication-eﬃcient sgd algorithms,” arXiv preprint arXiv:1808.07576 , 2018.[16] F. Zhou and G. Cong, “On the convergence properties of a k-step averaging stochastic gradientdescent algorithm for nonconvex optimization,” in

Proceedings of the 27th International JointConference on Artiﬁcial Intelligence , pp. 3219–3227, 2018.[17] T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi, “Don’t use large mini-batches, use local SGD,”in , 2020.1018] F. Hanzely and P. Richtárik, “Federated learning of a mixture of global and local models,” arXiv preprint arXiv:2002.05516 , 2020.[19] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iiddata,” arXiv preprint arXiv:1806.00582 , 2018.[20] A. K. Sahu, T. Li, M. Sanjabi, M. Zaheer, A. Talwalkar, and V. Smith, “On the convergenceof federated optimization in heterogeneous networks,” arXiv preprint arXiv:1812.06127 , 2018.[21] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh,“Scaﬀold: Stochastic controlled averaging for on-device federated learning,” arXiv preprintarXiv:1910.06378 , 2019.[22] F. Haddadpour and M. Mahdavi, “On the convergence of local descent methods in federatedlearning,” arXiv preprint arXiv:1910.14425 , 2019.[23] A. Khaled, K. Mishchenko, and P. Richtárik, “Tighter theory for local sgd on identical andheterogeneous data,” arXiv preprint arXiv:1909.04746 , 2019.[24] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on non-iiddata,” arXiv preprint arXiv:1907.02189 , 2019.[25] A. K. R. Bayoumi, K. Mishchenko, and P. Richtarik, “Tighter theory for local sgd on identicaland heterogeneous data,” in

International Conference on Artiﬁcial Intelligence and Statistics ,pp. 4519–4529, 2020.[26] A. Antoniou, H. Edwards, and A. Storkey, “How to train your MAML,” in

InternationalConference on Learning Representations , 2019.[27] Z. Li, F. Zhou, F. Chen, and H. Li, “Meta-SGD: Learning to learn quickly for few-shotlearning,” arXiv preprint arXiv:1707.09835 , 2017.[28] E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griﬃths, “Recasting gradient-based meta-learning as hierarchical bayes,” in

International Conference on Learning Representations , 2018.[29] A. Nichol, J. Achiam, and J. Schulman, “On ﬁrst-order meta-learning algorithms,” arXivpreprint arXiv:1803.02999 , 2018.[30] L. Zintgraf, K. Shiarli, V. Kurin, K. Hofmann, and S. Whiteson, “Fast context adaptationvia meta-learning,” in

Proceedings of the 36th International Conference on Machine Learning ,pp. 7693–7702, 2019.[31] H. S. Behl, A. G. Baydin, and P. H. S. Torr, “Alpha MAML: adaptive model-agnostic meta-learning,” 2019.[32] P. Zhou, X. Yuan, H. Xu, S. Yan, and J. Feng, “Eﬃcient meta learning via minibatch proximalupdate,” in

Advances in Neural Information Processing Systems 32 , pp. 1534–1544, CurranAssociates, Inc., 2019.[33] A. Fallah, A. Mokhtari, and A. Ozdaglar, “On the convergence theory of gradient-based model-agnostic meta-learning algorithms,” in

International Conference on Artiﬁcial Intelligence andStatistics , pp. 1082–1092, 2020.[34] F. Chen, M. Luo, Z. Dong, Z. Li, and X. He, “Federated meta-learning with fast convergenceand eﬃcient communication,” arXiv preprint arXiv:1802.07876 , 2018.[35] Y. Jiang, J. Konečn`y, K. Rush, and S. Kannan, “Improving federated learning personalizationvia model agnostic meta learning,” arXiv preprint arXiv:1909.12488 , 2019.[36] T. Li, M. Sanjabi, and V. Smith, “Fair resource allocation in federated learning,” arXiv preprintarXiv:1905.10497 , 2019.[37] S. Lin, G. Yang, and J. Zhang, “A collaborative learning framework via federated meta-learning,” arXiv preprint arXiv:2001.03229 , 2020.1138] M. Khodak, M.-F. F. Balcan, and A. S. Talwalkar, “Adaptive gradient-based meta-learningmethods,” in

Advances in Neural Information Processing Systems , pp. 5915–5926, 2019.[39] J. Li, M. Khodak, S. Caldas, and A. Talwalkar, “Diﬀerentially private meta-learning,” arXivpreprint arXiv:1909.05830 , 2019.[40] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” in

Advances in Neural Information Processing Systems , pp. 4424–4434, 2017.[41] Y. Deng, M. M. Kamani, and M. Mahdavi, “Adaptive personalized federated learning,” arXivpreprint arXiv:2003.13461 , 2020.[42] E. del Barrio, E. Giné, and C. Matrán, “Central limit theorems for the wasserstein distancebetween the empirical and the true distributions,”

Annals of Probability , pp. 1009–1071, 1999.[43] Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth, “Lowerbounds for non-convex stochastic optimization,” arXiv preprint arXiv:1912.02365 , 2019.[44] Y. LeCun, “The mnist database of handwritten digits,” http://yann. lecun. com/exdb/mnist/ ,1998.[45] A. Krizhevsky, G. Hinton, et al. , “Learning multiple layers of features from tiny images,” 2009.[46] J. Langelaar, “Mnist neural network training and testing,”

MATLAB Central File Exchange ,2019.[47] C. Villani,

Optimal transport: old and new , vol. 338. Springer Science & Business Media,2008. 12 ppendix

A Intermediate Notes

Note that the gradient Lipschitz assumption, i.e., the second inequality in (9), also implies that f i satisﬁes the following conditions for all w, u ∈ R d : − L i I d (cid:22) ∇ f i ( w ) (cid:22) L i I d , (17a) | f i ( w ) − f i ( u ) − ∇ f i ( u ) (cid:62) ( w − u ) | ≤ L i (cid:107) w − u (cid:107) . (17b) B Proofs of results in Subsection 4.1

B.1 TV Distance

Note that (cid:107)∇ f i ( w ) − ∇ f ( w ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) z ∈Z ∇ w l ( z ; w ) ( p i ( z ) − p ( z )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:88) z ∈Z (cid:107)∇ w l ( z ; w ) (cid:107) | p i ( z ) − p ( z ) |≤ B (cid:88) z ∈Z | p i ( z ) − p ( z ) | = 2 B (cid:107) p i − p (cid:107) T V (18)where the second inequality holds due to the assumption that (cid:107)∇ w l ( z ; w ) (cid:107) ≤ B for any w and z . Plugging (18) in n (cid:80) ni =1 (cid:107)∇ f i ( w ) − ∇ f ( w ) (cid:107) , gives us the desired result. The other result onHessians can be proved similarly. B.2 1-Wasserstein Distance

We claim that for any i and w ∈ R d , we have (cid:107)∇ f i ( w ) − ∇ f ( w ) (cid:107) ≤ L Z W ( p i , p ) , (19)which will immediately give us one of the two results. To show this, ﬁrst, note that (cid:107)∇ f i ( w ) − ∇ f ( w ) (cid:107) = sup v ∈ R d : (cid:107) v (cid:107)≤ v (cid:62) ( ∇ f i ( w ) − ∇ f ( w ))= sup v ∈ R d : (cid:107) v (cid:107)≤ (cid:18) E z ∼ p i (cid:2) v (cid:62) ∇ l ( z ; w ) (cid:3) − E z ∼ p (cid:2) v (cid:62) ∇ l ( z ; w ) (cid:3) (cid:19) Thus, we need to show for any v ∈ R d with (cid:107) v (cid:107) ≤ , we have E z ∼ p i (cid:2) v (cid:62) ∇ l ( z ; w ) (cid:3) − E z ∼ p (cid:2) v (cid:62) ∇ l ( z ; w ) (cid:3) ≤ L Z W ( p i , p ) . (20)Next, note that since p i and p both have bounded support, by Kantorovich-Rubinstein Duality[47], we have W ( p i , p ) = sup { E z ∼ p i [ g ( z )] − E z ∼ p [ g ( z )] | continuous g : Z → R , Lip( g ) ≤ } . (21)Using this result, to show (20), it suﬃces to show g ( z ) = v (cid:62) ∇ l ( z ; w ) is L Z -Lipschitz. Note thatCauchy-Schwarz inequality implies (cid:107) v (cid:62) ∇ l ( z ; w ) − v (cid:62) ∇ l ( z ; w ) (cid:107) ≤ (cid:107) v (cid:107)(cid:107)∇ l ( z ; w ) − ∇ l ( z ; w ) (cid:107) ≤ L Z d ( z , z ) (22)where the last inequality is obtained using (cid:107) v (cid:107) ≤ along with (15).Finally, note that we can similarly show the result for γ H by considering the fact that (cid:107)∇ f i ( w ) −∇ f ( w ) (cid:107) = max ξ ∈{ , − } sup v ∈ R d : (cid:107) v (cid:107)≤ ξv (cid:62) (cid:0) ∇ f i ( w ) − ∇ f ( w ) (cid:1) v = max ξ ∈{ , − } sup v ∈ R d : (cid:107) v (cid:107)≤ ξ (cid:0) E z ∼ p i (cid:2) v (cid:62) ∇ l ( z ; w ) v (cid:3) − E z ∼ p (cid:2) v (cid:62) ∇ l ( z ; w ) v (cid:3)(cid:1) and taking the functions g ( z ) = v (cid:62) ∇ l ( z ; w ) v and g ( z ) = − v (cid:62) ∇ l ( z ; w ) v along with usingKantorovich-Rubinstein Duality Theorem again.13 Proof of Lemma 4.2

Recall that ∇ F i ( w ) = (cid:0) I − α ∇ f i ( w ) (cid:1) ∇ f i ( w − α ∇ f i ( w )) . (23)Given this, note that (cid:107)∇ F i ( w ) − ∇ F i ( w ) (cid:107) = (cid:13)(cid:13)(cid:0) I − α ∇ f i ( w ) (cid:1) ∇ f i ( w − α ∇ f i ( w )) − (cid:0) I − α ∇ f i ( w ) (cid:1) ∇ f i ( w − α ∇ f i ( w )) (cid:13)(cid:13) = (cid:13)(cid:13)(cid:0) I − α ∇ f i ( w ) (cid:1) ( ∇ f i ( w − α ∇ f i ( w )) − ∇ f i ( w − α ∇ f i ( w )))+ (cid:0)(cid:0) I − α ∇ f i ( w ) (cid:1) − (cid:0) I − α ∇ f i ( w ) (cid:1)(cid:1) ∇ f i ( w − α ∇ f i ( w )) (cid:13)(cid:13) (24) ≤ (cid:13)(cid:13) I − α ∇ f i ( w ) (cid:13)(cid:13) (cid:107)∇ f i ( w − α ∇ f i ( w )) − ∇ f i ( w − α ∇ f i ( w )) (cid:107) + α (cid:13)(cid:13) ∇ f i ( w ) − ∇ f i ( w ) (cid:13)(cid:13) (cid:107)∇ f i ( w − α ∇ f i ( w )) (cid:107) (25)where (24) is obtained by adding and subtracting (cid:0) I − α ∇ f i ( w ) (cid:1) ∇ f i ( w − α ∇ f i ( w )) and thelast inequality follows from the triangle inequality and the deﬁnition of matrix norm. Now, webound two terms of (25) separately.First, note that by (17a), (cid:13)(cid:13) I − α ∇ f i ( w ) (cid:13)(cid:13) ≤ αL . Using this along with smoothness of f i , wehave (cid:13)(cid:13) I − α ∇ f i ( w ) (cid:13)(cid:13) (cid:107)∇ f i ( w − α ∇ f i ( w )) − ∇ f i ( w − α ∇ f i ( w )) (cid:107)≤ (1 + αL ) L (cid:107) w − α ∇ f i ( w )) − w + α ∇ f i ( w ) (cid:107)≤ (1 + αL ) L ( (cid:107) w − w (cid:107) + α (cid:107)∇ f i ( w ) − ∇ f i ( w ) (cid:107) ) ≤ (1 + αL ) L (1 + αL ) (cid:107) w − w (cid:107)≤ L (cid:107) w − w (cid:107) , (26)where we used smoothness of f i along with α ≤ /L .For the second term, Using (9) in Assumption 2 along with Assumption 3 implies α (cid:13)(cid:13) ∇ f i ( w ) − ∇ f i ( w ) (cid:13)(cid:13) (cid:107)∇ f i ( w − α ∇ f i ( w )) (cid:107) ≤ αρB (cid:107) w − w (cid:107) . (27)Putting (26) and (27) together, we obtain the desired result. D Proof of Lemma 4.3

Recall that the expression for the stochastic gradient ˜ ∇ F i ( w ) is given by ˜ ∇ F i ( w ) = (cid:16) I − α ˜ ∇ f i ( w, D (cid:48)(cid:48) ) (cid:17) ˜ ∇ f i (cid:16) w − α ˜ ∇ f i ( w, D ) , D (cid:48) (cid:17) (28)which can be written as ˜ ∇ F i ( w ) = (cid:0) I − α ∇ f i ( w ) + e (cid:1) ( ∇ f i ( w − α ∇ f i ( w )) + e ) . (29)Note that in the above expression e and e are given by e = α (cid:16) ∇ f i ( w ) − ˜ ∇ f i ( w, D (cid:48)(cid:48) ) (cid:17) , and e = ˜ ∇ f i ( w − α ˜ ∇ f i ( w, D ) , D (cid:48) ) − ∇ f i ( w − α ∇ f i ( w )) . Based on Assumption 4, it can be easily shown that E [ e ] = 0 , (30a) E (cid:2) (cid:107) e (cid:107) (cid:3) ≤ α σ H D (cid:48)(cid:48) . (30b)14ext, we proceed to bound the ﬁrst and second moments of e . To do so, ﬁrst note that e canalso be written as e = (cid:16) ˜ ∇ f i (cid:16) w − α ˜ ∇ f i ( w, D ) , D (cid:48) (cid:17) − ∇ f i (cid:16) w − α ˜ ∇ f i ( w, D ) (cid:17)(cid:17) + (cid:16) ∇ f i (cid:16) w − α ˜ ∇ f i ( w, D ) (cid:17) − ∇ f i ( w − α ∇ f i ( w )) (cid:17) . (31)Note that, conditioning on D , the ﬁrst term is zero mean and the second term is deterministic.Therefore, (cid:107) E [ e ] (cid:107) = (cid:13)(cid:13)(cid:13) E (cid:104) ∇ f i (cid:16) w − α ˜ ∇ f i ( w, D ) (cid:17) − ∇ f i ( w − α ∇ f i ( w )) (cid:105)(cid:13)(cid:13)(cid:13) ≤ E (cid:104)(cid:13)(cid:13)(cid:13) ∇ f i (cid:16) w − α ˜ ∇ f i ( w, D ) (cid:17) − ∇ f i ( w − α ∇ f i ( w )) (cid:13)(cid:13)(cid:13)(cid:105) ≤ αL E (cid:104)(cid:13)(cid:13)(cid:13) ˜ ∇ f i ( w, D ) − ∇ f i ( w ) (cid:13)(cid:13)(cid:13)(cid:105) (32) ≤ αLσ G √ D , (33)where (32) is obtained using smoothness of f i . The last inequality is also obtained using E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ f i ( w, D ) − ∇ f i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ σ G D . (34)In addition, we have E (cid:2) (cid:107) e (cid:107) (cid:3) = E (cid:2) E (cid:2) (cid:107) e (cid:107) |D (cid:3)(cid:3) = E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ f i (cid:16) w − α ˜ ∇ f i ( w, D ) , D (cid:48) (cid:17) − ∇ f i (cid:16) w − α ˜ ∇ f i ( w, D ) (cid:17)(cid:13)(cid:13)(cid:13) (cid:21) + E (cid:20)(cid:13)(cid:13)(cid:13) ∇ f i (cid:16) w − α ˜ ∇ f i ( w, D ) (cid:17) − ∇ f i ( w − α ∇ f i ( w )) (cid:13)(cid:13)(cid:13) (cid:21) ≤ σ G D (cid:48) + L α E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ f i ( w, D ) − ∇ f i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) (35) ≤ σ G (cid:18) D (cid:48) + ( αL ) D (cid:19) (36)where (36) follows from (34), and (35) is obtained using smoothness of f i along with the fact that E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ f i (cid:16) w − α ˜ ∇ f i ( w, D ) , D (cid:48) (cid:17) − ∇ f i (cid:16) w − α ˜ ∇ f i ( w, D ) (cid:17)(cid:13)(cid:13)(cid:13) (cid:21) ≤ σ G D (cid:48) . Next, note that, by comparing (29) and (5), along with the fact that e and e are independent,and e is zero-mean (30a), we have E (cid:104) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:105) = ( I − α ∇ f i ( w )) E [ e ] . (37)Hence, by taking the norm of both sides, we obtain (cid:13)(cid:13)(cid:13) E (cid:104) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:105)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13) ( I − α ∇ f i ( w )) E [ e ] (cid:13)(cid:13) ≤ (cid:13)(cid:13) ( I − α ∇ f i ( w )) (cid:13)(cid:13) (cid:107) E [ e ] (cid:107) (38)where the last inequality follows from the deﬁnition of matrix norm. Now, using (33) along withthe fact that (cid:107) I − α ∇ f i ( w ) (cid:107) ≤ αL ≤ gives us the ﬁrst result in Lemma 4.3.To show the other result, note that, by comparing (29) and (5), along with the matrix normdeﬁnition, we have (cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:13)(cid:13)(cid:13) ≤ (cid:107) I − α ∇ f i ( w ) (cid:107)(cid:107) e (cid:107) + (cid:107) e (cid:107)(cid:107)∇ f i ( w − α ∇ f i ( w )) (cid:107) + (cid:107) e (cid:107)(cid:107) e (cid:107) . (39)15s a result, by the Cauchy-Schwarz inequality ( a + b + c ) ≤ a + b + c ) for a, b, c ≥ , we have (cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:13)(cid:13)(cid:13) ≤ (cid:107) I − α ∇ f i ( w ) (cid:107) (cid:107) e (cid:107) + 3 (cid:107) e (cid:107) (cid:107)∇ f i ( w − α ∇ f i ( w )) (cid:107) + 3 (cid:107) e (cid:107) (cid:107) e (cid:107) . (40)By taking expectation, and using the fact that (cid:107) I − α ∇ f i ( w ) (cid:107) ≤ αL ≤ and (cid:107)∇ f i ( w − α ∇ f i ( w )) (cid:107) ≤ B, we have E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ B E (cid:2) (cid:107) e (cid:107) (cid:3) + 12 E (cid:2) (cid:107) e (cid:107) (cid:3) + 3 E (cid:2) (cid:107) e (cid:107) (cid:3) E (cid:2) (cid:107) e (cid:107) (cid:3) (41)where we also used the fact that e and e are independent as D (cid:48)(cid:48) is independent from D and D (cid:48) .Plugging (30b) and (36) in (41), we obtain E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ B α σ H D (cid:48)(cid:48) + 12 σ G (cid:18) D (cid:48) + ( αL ) D (cid:19) + 3 α σ G σ H (cid:18) D (cid:48) D (cid:48)(cid:48) + ( αL ) DD (cid:48)(cid:48) (cid:19) which gives us the desired result. E Proof of Lemma 4.4

Recall that ∇ F i ( w ) = (cid:0) I − α ∇ f i ( w ) (cid:1) ∇ f i ( w − α ∇ f i ( w )) . (42)which can be expressed as ∇ F i ( w ) = (cid:0) I − α ∇ f ( w ) + E i (cid:1) ( ∇ f ( w − α ∇ f ( w )) + r i ) (43)where E i = α (cid:0) ∇ f ( w ) − ∇ f i ( w ) (cid:1) , (44) r i = ∇ f i ( w − α ∇ f i ( w )) − ∇ f ( w − α ∇ f ( w )) . (45)First, note that, by Assumption 5, we have n n (cid:88) i =1 (cid:107) E i (cid:107) = α γ H . (46)Second, note that (cid:107) r i (cid:107) ≤ (cid:107)∇ f i ( w − α ∇ f i ( w )) − ∇ f i ( w − α ∇ f ( w )) (cid:107) + (cid:107)∇ f i ( w − α ∇ f ( w )) − ∇ f ( w − α ∇ f ( w )) (cid:107)≤ αL (cid:107)∇ f i ( w ) − ∇ f ( w ) (cid:107) + (cid:107)∇ f i ( w − α ∇ f ( w )) − ∇ f ( w − α ∇ f ( w )) (cid:107) (47)where the last inequality is obtained using (9) in Assumption 2. Now, by using ( a + b ) ≤ a + b ) ,we have n n (cid:88) i =1 (cid:107) r i (cid:107) ≤ n n (cid:88) i =1 (cid:16) ( αL ) (cid:107)∇ f i ( w ) − ∇ f ( w ) (cid:107) + (cid:107)∇ f i ( w − α ∇ f ( w )) − ∇ f ( w − α ∇ f ( w )) (cid:107) (cid:17) ≤ (cid:0) αL ) (cid:1) ( γ G + γ G ) (48) ≤ γ G . (49)16here the second inequality follows from Assumption 5 and the last inequality is obtained using αL ≤ . Next, recall that the goal is to bound the variance of ∇ F i ( w ) when i is drawn from auniform distribution. We know that by subtracting a constant from a random variable, its variancedoes not change. Thus, variance of ∇ F i ( w ) is equal to variance of ∇ F i ( w ) − (cid:0) I − α ∇ f ( w ) (cid:1) ∇ f ( w − α ∇ f ( w )) . Also, the variance of the latter is bounded by its second moment, and hence, n n (cid:88) i =1 (cid:107)∇ F i ( w ) − ∇ F ( w ) (cid:107) ≤ n n (cid:88) i =1 (cid:13)(cid:13) E i ∇ f ( w − α ∇ f ( w )) + (cid:0) I − α ∇ f ( w ) (cid:1) r i + E i r i (cid:13)(cid:13) ≤ n n (cid:88) i =1 (cid:0) (cid:107) E i ∇ f ( w − α ∇ f ( w )) (cid:107) + (cid:13)(cid:13)(cid:0) I − α ∇ f ( w ) (cid:1) r i (cid:13)(cid:13) + (cid:107) E i r i (cid:107) (cid:1) (50)Therefore, using (cid:107)∇ f ( w − α ∇ f ( w )) (cid:107) ≤ B along with (cid:13)(cid:13) I − α ∇ f ( w ) (cid:13)(cid:13) ≤ αL and Cauchy-Schwarz inequality ( a + b + c ) ≤ a + b + c ) for a, b, c ≥ , we obtain n n (cid:88) i =1 (cid:107)∇ F i ( w ) − ∇ F ( w ) (cid:107) ≤ (cid:32) B n n (cid:88) i =1 (cid:107) E i (cid:107) + (1 + αL ) n n (cid:88) i =1 (cid:107) r i (cid:107) + 1 n n (cid:88) i =1 (cid:107) E i r i (cid:107) (cid:33) ≤ (cid:32) B n n (cid:88) i =1 (cid:107) E i (cid:107) + 4 1 n n (cid:88) i =1 (cid:107) r i (cid:107) + 1 n n (cid:88) i =1 (cid:107) E i (cid:107) (cid:107) r i (cid:107) (cid:33) (51)where the last inequality is obtained using αL ≤ along with (cid:107) E i r i (cid:107) ≤ (cid:107) E i (cid:107)(cid:107) r i (cid:107) which comesfrom the deﬁnition of matrix norm. Finally, to complete the proof, notice that we have n n (cid:88) i =1 (cid:107) E i (cid:107) (cid:107) r i (cid:107) ≤ max i (cid:107) E i (cid:107) (cid:32) n n (cid:88) i =1 (cid:107) r i (cid:107) (cid:33) (52) ≤ max i (cid:107) E i (cid:107) (8 γ G ) (53) ≤ αL ) γ G ≤ γ G (54)where (53) follows from (49) and the last line is obtained using αL ≤ along with the fact that (cid:107)∇ f i ( w ) (cid:107) ≤ L , and thus, (cid:107) E i (cid:107) α = (cid:107)∇ f ( w ) − ∇ f i ( w ) (cid:107) ≤ L. (55)Plugging (53) in (51) along with (46) and (49), we obtain the desired result. F An Intermediate Result

Proposition F.1.

Recall from Section 3 that at any round k ≥ , and for any agent i ∈ { , .., n } ,we can deﬁne a sequence of local updates { w ik,t } τt =0 where w ik, = w k − and, for τ ≥ t ≥ , w ik,t = w ik,t − − β ˜ ∇ F i ( w ik,t − ) . (56) We further deﬁne the average of these local updates at round k and time t as w k,t = 1 /n (cid:80) ni =1 w ik,t .Suppose that the conditions in Assumptions 2-4 are satisﬁed. Then, for any α ∈ [0 , /L ] and any t ≥ , we have E (cid:34) n n (cid:88) i =1 (cid:107) w ik,t − w k,t (cid:107) (cid:35) ≤ βt (1 + 2 βL F ) t − ( σ F + γ F ) , (57a) E (cid:34) n n (cid:88) i =1 (cid:107) w ik,t − w k,t (cid:107) (cid:35) ≤ β (1 + 1 φ ) t (cid:18) φ + 16(1 + 1 φ ) β L F (cid:19) t − (2 σ F + γ F ) (57b) where φ > is an arbitrary positive constant and L F , σ F , and γ F are given in Lemmas 4.2, 4.3,and 4.4, respectively. Before stating the proof, note that an immediate consequence of this result is the followingcorollary: 17 orollary F.2.

Under the same assumptions as Proposition F.1, and for any β ≤ / (10 τ L F ) , wehave E (cid:34) n n (cid:88) i =1 (cid:107) w ik,t − w k,t (cid:107) (cid:35) ≤ βt ( σ F + γ F ) , (58a) E (cid:34) n n (cid:88) i =1 (cid:107) w ik,t − w k,t (cid:107) (cid:35) ≤ β tτ (2 σ F + γ F ) (58b) for any ≤ t ≤ τ .Proof. Let S t := 1 n n (cid:88) i =1 E (cid:2) (cid:107) w ik,t − w k,t (cid:107) (cid:3) (59)where S = 0 since w ik, = w k − for any i . Note that S t +1 = 1 n n (cid:88) i =1 E (cid:2) (cid:107) w ik,t +1 − w k,t +1 (cid:107) (cid:3) = 1 n n (cid:88) i =1 E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w ik,t − β ˜ ∇ F i ( w ik,t ) − n n (cid:88) j =1 (cid:16) w jk,t − β ˜ ∇ F j ( w jk,t ) (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ n n (cid:88) i =1 E  (cid:107) w ik,t − n n (cid:88) j =1 w jk,t (cid:107)  + β n n (cid:88) i =1 E  (cid:107) ˜ ∇ F i ( w ik,t ) − n n (cid:88) j =1 ˜ ∇ F j ( w jk,t ) (cid:107)  . (60)Note that the ﬁrst term in (60) is in fact S t and the second one can be upper bounded as follows n n (cid:88) i =1 E  (cid:107) ˜ ∇ F i ( w ik,t ) − n n (cid:88) j =1 ˜ ∇ F j ( w jk,t ) (cid:107)  ≤ n n (cid:88) i =1 E  (cid:107)∇ F i ( w ik,t ) − n n (cid:88) j =1 ∇ F j ( w jk,t ) (cid:107)  + 1 n n (cid:88) i =1 E (cid:104) (cid:107)∇ F i ( w ik,t ) − ˜ ∇ F i ( w ik,t ) (cid:107) (cid:105) + 1 n n (cid:88) i =1 E  n n (cid:88) j =1 (cid:107)∇ F j ( w jk,t ) − ˜ ∇ F j ( w jk,t ) (cid:107)  ≤ n n (cid:88) i =1 E  (cid:107)∇ F i ( w ik,t ) − n n (cid:88) j =1 ∇ F j ( w jk,t ) (cid:107)  + 2 βσ F where the last inequality is obtained using Lemma 4.3. By substituting this in (60), we obtain S t +1 ≤ S t + 2 βσ F + β n n (cid:88) i =1 E  (cid:107)∇ F i ( w ik,t ) − n n (cid:88) j =1 ∇ F j ( w jk,t ) (cid:107)  . (61)If we deﬁne η i := ∇ F i ( w ik,t ) − ∇ F i ( w k,t ) , using (61), we obtain S t +1 ≤ S t + 2 βσ F + β n n (cid:88) i =1 E  (cid:107)∇ F i ( w k,t ) − n n (cid:88) j =1 ∇ F j ( w k,t ) (cid:107)  + β n n (cid:88) i =1 E  (cid:107) η i − n n (cid:88) j =1 η j (cid:107)  . (62)Note that, by Lemma 4.2, (cid:107) η i (cid:107) ≤ L F (cid:107) w ik,t − w k,t (cid:107) , (63)18nd thus, n n (cid:88) i =1 (cid:107) η i (cid:107) ≤ L F S t . (64)As a result, and by using (62), we have S t +1 ≤ (1 + 2 βL F ) S t + 2 βσ F + β n n (cid:88) i =1 E  (cid:107)∇ F i ( w k,t ) − n n (cid:88) j =1 ∇ F j ( w k,t ) (cid:107)  . ≤ (1 + 2 βL F ) S t + 2 β ( σ F + γ F ) (65)where the last inequality is obtained using Lemma 4.4. Using (65) recursively, we obtain S t +1 ≤  t (cid:88) j =0 (1 + 2 βL F ) j  β ( σ F + γ F ) ≤ β ( t + 1)(1 + 2 βL F ) t ( σ F + γ F ) (66)which completes the proof of (57a). To prove (57b), let Σ t := 1 n n (cid:88) i =1 E (cid:2) (cid:107) w ik,t − w k,t (cid:107) (cid:3) . (67)Similarly Σ = 0 . Note that Σ t +1 = 1 n n (cid:88) i =1 E (cid:2) (cid:107) w ik,t +1 − w k,t +1 (cid:107) (cid:3) = 1 n n (cid:88) i =1 E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w ik,t − β ˜ ∇ F i ( w ik,t ) − n n (cid:88) j =1 (cid:16) w jk,t − β ˜ ∇ F j ( w jk,t ) (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  ≤ φn n (cid:88) i =1 E  (cid:107) w ik,t − n n (cid:88) j =1 w jk,t (cid:107)  + β /φn n (cid:88) i =1 E  (cid:107) ˜ ∇ F i ( w ik,t ) − n n (cid:88) j =1 ˜ ∇ F j ( w jk,t ) (cid:107)  (68) ≤ (1 + φ )Σ t + β /φn n (cid:88) i =1 E  (cid:107) ˜ ∇ F i ( w ik,t ) − n n (cid:88) j =1 ˜ ∇ F j ( w jk,t ) (cid:107)  (69)where (68) is obtained using (cid:107) a + b (cid:107) ≤ (1 + φ ) (cid:107) a (cid:107) + (1 + 1 /φ ) (cid:107) b (cid:107) for any arbitrary positive realnumber φ . To bound the second term in (69), note that E  (cid:107) ˜ ∇ F i ( w ik,t ) − n n (cid:88) j =1 ˜ ∇ F j ( w jk,t ) (cid:107)  ≤ E  (cid:107)∇ F i ( w ik,t ) − n n (cid:88) j =1 ∇ F j ( w jk,t ) (cid:107)  + 2 E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:16) ˜ ∇ F i ( w ik,t ) − ∇ F i ( w ik,t ) (cid:17) + 1 n n (cid:88) j =1 (cid:16) ∇ F j ( w jk,t ) − ˜ ∇ F j ( w jk,t ) (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  . (70)Now, we bound the second term in (70). Using Cauchy-Schwarz inequality (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n +1 (cid:88) l =1 a l b l (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:32) n +1 (cid:88) l =1 (cid:107) a l (cid:107) (cid:33) (cid:32) n +1 (cid:88) l =1 (cid:107) b l (cid:107) (cid:33) (71)with a = ˜ ∇ F i ( w ik,t ) − ∇ F i ( w ik,t ) , b = 1 and a l = 1 / √ n ( ˜ ∇ F l − ( w l − k,t ) − ∇ F l − ( w l − k,t )) , b l = 1 / √ n, l = 2 , ..., n + 1 , implies E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:16) ˜ ∇ F i ( w ik,t ) − ∇ F i ( w ik,t ) (cid:17) + 1 n n (cid:88) j =1 (cid:16) ∇ F j ( w jk,t ) − ˜ ∇ F j ( w jk,t ) (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  ≤ E (cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ik,t ) − ∇ F i ( w ik,t ) (cid:13)(cid:13)(cid:13) + 1 n n (cid:88) j =1 (cid:13)(cid:13)(cid:13) ∇ F j ( w jk,t ) − ˜ ∇ F j ( w jk,t ) (cid:13)(cid:13)(cid:13)  ≤ σ F (72)where the last inequality is obtained using Lemma 4.3. Plugging (72) in (70) and using (69), weobtain Σ t +1 ≤ (1 + φ )Σ t + 8(1 + 1 φ ) β σ F + 2(1 + 1 φ ) β n n (cid:88) i =1 E  (cid:107)∇ F i ( w ik,t ) − n n (cid:88) j =1 ∇ F j ( w jk,t ) (cid:107)  . (73)Now, it remains to bound the last term in (73). Recall η i = ∇ F i ( w ik,t ) − ∇ F i ( w k,t ) . First, notethat, using (cid:107) a + b (cid:107) ≤ (cid:107) a (cid:107) + 2 (cid:107) b (cid:107) , we have (cid:107)∇ F i ( w ik,t ) − n n (cid:88) j =1 ∇ F j ( w jk,t ) (cid:107) ≤ (cid:107)∇ F i ( w k,t ) − n n (cid:88) j =1 ∇ F j ( w k,t ) (cid:107) + 2 (cid:107) η i − n n (cid:88) j =1 η j (cid:107) . (74)Substituting this bound in (73) and using Lemma 4.4 yields Σ t +1 ≤ (1 + φ )Σ t + 4(1 + 1 φ ) β (2 σ F + γ F ) + 4(1 + 1 φ ) β n n (cid:88) i =1 E  (cid:107) η i − n n (cid:88) j =1 η j (cid:107)  . (75)Note that, using Cauchy-Schwarz inequality (71) with a = η i , b = 1 and a l = 1 / √ nη l − , b l =1 / √ n for l = 2 , ..., n + 1 , implies (cid:107) η i − n n (cid:88) j =1 η j (cid:107) ≤  (cid:107) η i (cid:107) + 1 n n (cid:88) j =1 (cid:107) η j (cid:107)  ≤ L F  (cid:107) w ik,t − w k,t (cid:107) + 1 n n (cid:88) j =1 (cid:107) w ik,t − w k,t (cid:107)  (76)where the last inequality is obtained using Lemma 4.2 which states (cid:107) η i (cid:107) ≤ L F (cid:107) w ik,t − w k,t (cid:107) . (77)Plugging (76) in (75) implies Σ t +1 ≤ (cid:18) φ + 16(1 + 1 φ ) β L F (cid:19) Σ t + 4(1 + 1 φ ) β (2 σ F + γ F ) . (78)As a result, similar to (66), we obtain Σ t +1 ≤ β (1 + 1 φ )( t + 1) (cid:18) φ + 16(1 + 1 φ ) β L F (cid:19) t (2 σ F + γ F ) (79)which gives us the desired result (57b).Finally, to show (58), ﬁrst note that for any n , we know (1 + 1 n ) n ≤ e. (80)20sing this, along with the assumption β ≤ / (10 L F τ ) and the fact that e . ≤ , we immediatelyobtain (58a). To show the other one (58b), we use (57b) with φ = 1 / (2 τ ) : φ + 16(1 + 1 φ ) β L F = 12 τ + 16(1 + 2 τ ) β L F ≤ τ + 16(1 + 2 τ ) 1100 τ ≤ τ (81)where the ﬁrst inequality follows from the assumption β ≤ / (10 L F τ ) and the last inequality isobtained using the trivial bound τ ≤ τ . Finally, using (81) along with (80) completes theproof. G Proof of Theorem 4.5

Although we only ask a fraction of agents to compute their local updates in Algorithm 1, here,and just for the sake of analysis, we assume all agents perform local updates. This is just for ouranalysis and we will not use all agents’ updates in computing w k +1 . Also, from Proposition F.1,recall that w k,t = 1 /n (cid:80) ni =1 w ik,t .Let F tk +1 denote the σ -ﬁeld generated by { w ik +1 ,t } ni =1 . Note that, by Lemma 4.2, we know F is smooth with gradient Lipschitz parameter L F , and thus, by (17b), we have F ( ¯ w k +1 ,t +1 ) ≤ F ( ¯ w k +1 ,t ) + ∇ F ( ¯ w k +1 ,t ) (cid:62) ( ¯ w k +1 ,t +1 − ¯ w k +1 ,t ) + L F (cid:107) ¯ w k +1 ,t +1 − ¯ w k +1 ,t (cid:107) ≤ F ( ¯ w k +1 ,t ) − β ∇ F ( ¯ w k +1 ,t ) (cid:62) (cid:32) rn (cid:88) i ∈A k ˜ ∇ F i ( w ik +1 ,t ) (cid:33) + L F β (cid:107) rn (cid:88) i ∈A k ˜ ∇ F i ( w ik +1 ,t ) (cid:107) (82)where the last inequality is obtained using the fact that ¯ w k +1 ,t +1 = 1 rn (cid:88) i ∈A k w ik +1 ,t +1 = 1 rn (cid:88) i ∈A k (cid:16) w ik +1 ,t − β ˜ ∇ F i ( w ik +1 ,t ) (cid:17) = ¯ w k +1 ,t − β rn (cid:88) i ∈A k ˜ ∇ F i ( w ik +1 ,t ) . Taking expectation from both sides of (82) yields E [ F ( ¯ w k +1 ,t +1 )] ≤ E [ F ( ¯ w k +1 ,t )] − β E (cid:34) ∇ F ( ¯ w k +1 ,t ) (cid:62) (cid:32) rn (cid:88) i ∈A k ˜ ∇ F i ( w ik +1 ,t ) (cid:33)(cid:35) + L F β E (cid:34) (cid:107) rn (cid:88) i ∈A k ˜ ∇ F i ( w ik +1 ,t ) (cid:107) (cid:35) (83)Next, note that rn (cid:88) i ∈A k ˜ ∇ F i ( w ik +1 ,t ) = X + Y + Z + 1 rn (cid:88) i ∈A k ∇ F i ( ¯ w k +1 ,t ) (84)where X = 1 rn (cid:88) i ∈A k (cid:16) ˜ ∇ F i ( w ik +1 ,t ) − ∇ F i ( w ik +1 ,t ) (cid:17) , (85) Y = 1 rn (cid:88) i ∈A k (cid:0) ∇ F i ( w ik +1 ,t ) − ∇ F i ( w k +1 ,t ) (cid:1) , (86) Z = 1 rn (cid:88) i ∈A k ( ∇ F i ( w k +1 ,t ) − ∇ F i ( ¯ w k +1 ,t )) . (87)21e next bound the moments of X , Y , and Z , condition on F tk +1 . First, recall the Cauchy-Schwarzinequality (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) rn (cid:88) i =1 a i b i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:32) rn (cid:88) i =1 (cid:107) a i (cid:107) (cid:33) (cid:32) rn (cid:88) i =1 (cid:107) b i (cid:107) (cid:33) . (88)• Using this inequality with a i = ( ˜ ∇ F i ( w ik +1 ,t ) −∇ F i ( w ik +1 ,t )) / √ rn and b l = 1 / √ rn , we obtain (cid:107) X (cid:107) ≤ rn (cid:88) i ∈A k (cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ik +1 ,t ) − ∇ F i ( w ik +1 ,t ) (cid:13)(cid:13)(cid:13) , (89)and hence, by using Lemma 4.3 along with the tower rule, we have E [ (cid:107) X (cid:107) ] = E [ E [ (cid:107) X (cid:107) | F tk +1 ]] ≤ σ F . (90)• Regarding Y , note that by using Cauchy-Schwarz inequality (similar to what we did above)along with smoothness of F i , we obtain (cid:107) Y (cid:107) ≤ rn (cid:88) i ∈A k (cid:13)(cid:13) ∇ F i ( w ik +1 ,t ) − ∇ F i ( w k +1 ,t ) (cid:13)(cid:13) ≤ L F rn (cid:88) i ∈A k (cid:13)(cid:13) w ik +1 ,t − w k +1 ,t (cid:13)(cid:13) . (91)Again, taking expectation and using the fact that A k is chosen uniformly at random, implies E [ (cid:107) Y (cid:107) ] = E [ E [ (cid:107) Y (cid:107) | F tk +1 ]] ≤ L F E (cid:34) E (cid:34) rn (cid:88) i ∈A k (cid:13)(cid:13) w ik +1 ,t − w k +1 ,t (cid:13)(cid:13) (cid:12)(cid:12)(cid:12) F tk +1 (cid:35)(cid:35) = L F E (cid:34) n n (cid:88) i =1 (cid:107) w ik,t − w k,t (cid:107) (cid:35) ≤ β L F τ ( τ − σ F + γ F ) (92)where the last step follows from (58b) in Corollary F.2 along with the fact that t ≤ τ − .• Regarding Z , ﬁrst recall that if we have n numbers a , ..., a n with mean µ = 1 /n (cid:80) ni =1 a i andvariance σ = 1 /n (cid:80) ni =1 | a i − µ | , and we take a subset of them { a i } i ∈A with size |A| = rn by sampling without replacement, then we have E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) i ∈A a i rn − µ (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) = σ rn (cid:18) − rn − n − (cid:19) = σ (1 − r ) r ( n − . (93)Using this, we have E (cid:2) (cid:107) ¯ w k +1 ,t − w k +1 ,t (cid:107) | F tk +1 (cid:3) ≤ (1 − r ) /n (cid:80) ni =1 (cid:107) w ik +1 ,t − w k +1 ,t (cid:107) r ( n − , (94)and hence, by taking expectation from both sides and using the tower rule along with (58b)in Corollary F.2, we obtain E (cid:2) (cid:107) ¯ w k +1 ,t − w k +1 ,t (cid:107) (cid:3) ≤ − r ) β τ ( τ − σ F + γ F ) r ( n − . (95)Next, note that by using Cauchy-Schwarz inequality (88), with a i = ( ∇ F i ( w k +1 ,t ) − ∇ F i ( ¯ w k +1 ,t )) / √ rn and b i = 1 / √ rn , we have (cid:107) Z (cid:107) ≤ rn (cid:88) i ∈A k (cid:107)∇ F i ( w k +1 ,t ) − ∇ F i ( ¯ w k +1 ,t ) (cid:107) ≤ L F rn (cid:88) i ∈A k (cid:107) w k +1 ,t − ¯ w k +1 ,t (cid:107) = L F (cid:107) ¯ w k +1 ,t − w k +1 ,t (cid:107) (96)where the last inequality is obtained using smoothness of F i (Lemma 4.2). Now, takingexpectation from both sides and using (95) yields E [ (cid:107) Z (cid:107) ] ≤ − r ) β L F τ ( τ − σ F + γ F ) r ( n − . (97)22ow, getting back to (83), we ﬁrst lower bound the term E (cid:34) ∇ F ( ¯ w k +1 ,t ) (cid:62) (cid:32) rn (cid:88) i ∈A k ˜ ∇ F i ( w ik +1 ,t ) (cid:33)(cid:35) . To do so, note that, by (84), we have E (cid:34) ∇ F ( ¯ w k +1 ,t ) (cid:62) (cid:32) rn (cid:88) i ∈A k ˜ ∇ F i ( w ik +1 ,t ) (cid:33)(cid:35) = E (cid:34) ∇ F ( ¯ w k +1 ,t ) (cid:62) (cid:32) X + Y + Z + 1 rn (cid:88) i ∈A k ∇ F i ( ¯ w k +1 ,t ) (cid:33)(cid:35) ≥ E (cid:34) ∇ F ( ¯ w k +1 ,t ) (cid:62) (cid:32) rn (cid:88) i ∈A k ∇ F i ( ¯ w k +1 ,t ) (cid:33)(cid:35) − (cid:13)(cid:13) E (cid:2) ∇ F ( ¯ w k +1 ,t ) (cid:62) X (cid:3) ] (cid:13)(cid:13) − E [ (cid:107)∇ F ( ¯ w k +1 ,t ) (cid:107) ] − E [ (cid:107) Y + Z (cid:107) ] (98)where the last inequality is obtained using the fact that E (cid:2) ∇ F ( ¯ w k +1 ,t ) (cid:62) ( Y + Z ) (cid:3) ≤ E [ (cid:107)∇ F ( ¯ w k +1 ,t ) (cid:107) ] + E [ (cid:107) Y + Z (cid:107) ] . Now, we bound terms in (98) separately. First, note that by tower rule we have E (cid:34) ∇ F ( ¯ w k +1 ,t ) (cid:62) (cid:32) rn (cid:88) i ∈A k ∇ F i ( ¯ w k +1 ,t ) (cid:33)(cid:35) = E (cid:34) E (cid:34) ∇ F ( ¯ w k +1 ,t ) (cid:62) (cid:32) rn (cid:88) i ∈A k ∇ F i ( ¯ w k +1 ,t ) (cid:33) (cid:12)(cid:12)(cid:12) F tk +1 (cid:35)(cid:35) = E (cid:34) ∇ F ( ¯ w k +1 ,t ) (cid:62) E (cid:34)(cid:32) rn (cid:88) i ∈A k ∇ F i ( ¯ w k +1 ,t ) (cid:33) (cid:12)(cid:12)(cid:12) F tk +1 (cid:35)(cid:35) = E (cid:2) (cid:107)∇ F ( ¯ w k +1 ,t ) (cid:107) (cid:3) (99)where the last equality is obtained using the fact that A k is chosen uniformly at random, and thus, E (cid:34)(cid:32) rn (cid:88) i ∈A k ∇ F i ( ¯ w k +1 ,t ) (cid:33) (cid:12)(cid:12)(cid:12) F tk +1 (cid:35) = 1 n n (cid:88) i =1 ∇ F i ( ¯ w k +1 ,t ) . Second, note that E (cid:2) ∇ F ( ¯ w k +1 ,t ) (cid:62) X (cid:3) = E (cid:104) E (cid:104) ∇ F ( ¯ w k +1 ,t ) (cid:62) X (cid:12)(cid:12)(cid:12) F tk +1 (cid:105)(cid:105) = E (cid:104) ∇ F ( ¯ w k +1 ,t ) (cid:62) E (cid:104) X (cid:12)(cid:12)(cid:12) F tk +1 (cid:105)(cid:105) . As a result, we have (cid:13)(cid:13) E (cid:2) ∇ F ( ¯ w k +1 ,t ) (cid:62) X (cid:3)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) E (cid:104) ∇ F ( ¯ w k +1 ,t ) (cid:62) E (cid:104) X (cid:12)(cid:12)(cid:12) F tk +1 (cid:105)(cid:105)(cid:13)(cid:13)(cid:13) ≤ E [ (cid:107)∇ F ( ¯ w k +1 ,t ) (cid:107) ] + E (cid:20)(cid:13)(cid:13)(cid:13) E (cid:104) X (cid:12)(cid:12)(cid:12) F tk +1 (cid:105)(cid:13)(cid:13)(cid:13) (cid:21) ≤ E [ (cid:107)∇ F ( ¯ w k +1 ,t ) (cid:107) ] + 4 α L σ G D (100)where the last inequality follows from Lemma 4.3. Third, note that by Cauchy-Schwarz inequality, E [ (cid:107) Y + Z (cid:107) ] ≤ (cid:0) E [ (cid:107) Y (cid:107) ] + E [ (cid:107) Z (cid:107) ] (cid:1) ≤ β L F τ ( τ − σ F + γ F ) (cid:18) − rr ( n − (cid:19) ≤ β L F τ ( τ − σ F + γ F ) (101)23here second inequality is obtained using (92) and (97). Plugging (99), (100), and (101) in (98)implies E (cid:34) ∇ F ( ¯ w k +1 ,t ) (cid:62) (cid:32) rn (cid:88) i ∈A k ˜ ∇ F i ( w ik +1 ,t ) (cid:33)(cid:35) ≥ E [ (cid:107)∇ F ( ¯ w k +1 ,t ) (cid:107) ] − β L F τ ( τ − σ F + γ F ) − α L σ G D . (102)Next, we characterize an upper bound for the other term in (83): E (cid:34) (cid:107) rn (cid:88) i ∈A k ˜ ∇ F i ( w ik +1 ,t ) (cid:107) (cid:35) Note that, by (84) we have (cid:107) rn (cid:88) i ∈A k ˜ ∇ F i ( w ik +1 ,t ) (cid:107) ≤ (cid:107) X + Y + Z (cid:107) + 2 (cid:107) rn (cid:88) i ∈A k ∇ F i ( ¯ w k +1 ,t ) (cid:107) , (103)and thus, by (101) along with (90), we have E (cid:34) (cid:107) rn (cid:88) i ∈A k ˜ ∇ F i ( w ik +1 ,t ) (cid:107) (cid:35) ≤ E (cid:34) (cid:107) rn (cid:88) i ∈A k ∇ F i ( ¯ w k +1 ,t ) (cid:107) (cid:35) + 4 σ F + 560 β L F τ ( τ − σ F + γ F ) . (104)Note that, E (cid:2) / ( rn ) (cid:80) i ∈A k ∇ F i ( ¯ w k +1 ,t ) | F tk +1 (cid:3) = ∇ F ( ¯ w k +1 ,t ) , since A k is chosen uniformly atrandom. Also, by Lemma 4.4, we have n E (cid:104) (cid:107)∇ F i ( ¯ w k +1 ,t ) − ∇ F ( ¯ w k +1 ,t ) (cid:107) (cid:12)(cid:12)(cid:12) F tk +1 (cid:105) ≤ γ F , and thus, by (93), we have E (cid:34) (cid:107) rn (cid:88) i ∈A k ∇ F i ( ¯ w k +1 ,t ) (cid:107) (cid:35) ≤ E (cid:2) (cid:107)∇ F ( ¯ w k +1 ,t ) (cid:107) (cid:3) + γ F (1 − r ) r ( n − . (105)Plugging (105) in (104), we obtain E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) rn (cid:88) i ∈A k ˜ ∇ F i ( w ik +1 ,t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  ≤ E (cid:2) (cid:107)∇ F ( ¯ w k +1 ,t ) (cid:107) (cid:3) + 2 γ F (1 − r ) r ( n −

1) + 4 σ F + 560 β L F τ ( τ − σ F + γ F ) . (106)Substituting (106) and (102) in (83) implies E [ F ( ¯ w k +1 ,t +1 )] ≤ E [ F ( ¯ w k +1 ,t )] − β (1 / − βL F ) E (cid:2) (cid:107)∇ F ( ¯ w k +1 ,t ) (cid:107) (cid:3) + 140(1 + 2 βL F ) β L F τ ( τ − σ F + γ F ) + β L F (cid:18) σ F + γ F (1 − r ) r ( n − (cid:19) + 4 βα L σ G D ≤ E [ F ( ¯ w k +1 ,t )] − β E (cid:2) (cid:107)∇ F ( ¯ w k +1 ,t ) (cid:107) (cid:3) + βσ T . (107)where σ T := 280( βL F ) τ ( τ − σ F + γ F ) + βL F (cid:18) σ F + γ F (1 − r ) r ( n − (cid:19) + 4 α L σ G D (108)24he last inequality is obtained using β ≤ / (10 τ L F ) . Summing up (107) for all t = 0 , ..., τ − , weobtain E [ F ( w k +1 )] ≤ E [ F ( w k )] − βτ (cid:32) τ τ − (cid:88) t =0 E (cid:2) (cid:107)∇ F ( ¯ w k +1 ,t ) (cid:107) (cid:3)(cid:33) + βτ σ T (109)where we used the fact that ¯ w k +1 ,τ = w k +1 . Finally, summing up (109) for k = 0 , ..., K − implies E [ F ( w K )] ≤ F ( w ) − βτ K (cid:32) τ K K − (cid:88) k =0 τ − (cid:88) t =0 E (cid:2) (cid:107)∇ F ( ¯ w k +1 ,t ) (cid:107) (cid:3)(cid:33) + βτ Kσ T . (110)As a result, we have τ K K − (cid:88) k =0 τ − (cid:88) t =0 E (cid:2) (cid:107)∇ F ( ¯ w k +1 ,t ) (cid:107) (cid:3) ≤ βτ K (cid:0) F ( w ) − E [ F ( w K )] + βτ Kσ T (cid:1) ≤ F ( w ) − F ∗ ) βτ K + 4 σ T (111)which gives us the desired result. Remark G.1.

As stated in Remark 4.8, we could easily extend our analysis to the case withdiminishing stepsize. In particular, by using β k as the stepsize at iteration k , the descent result (109) holds with β = β k . Hence, summing up this equation for k = 0 , ..., K − , we recover thesame complexity bounds using β k = O (1 / √ τ k ) . H On First-Order Approximations of Per-FedAvg

As we stated previously, the Per-FedAvg method, same as MAML, requires computing Hessian-vector product which is computationally costly in some applications. As a result, one may considerusing the ﬁrst-order approximation of the update rule for the Per-FedAvg algorithm. The maingoal of this section is to show how our analysis can be extended to the case that we either dropthe second-order term or approximate the Hessian-vector product using ﬁrst-order techniques.To do so, we show that it suﬃces to only extend the result in Lemma 4.3 for the ﬁrst-orderapproximation settings and ﬁnd ˜ σ F such that (cid:13)(cid:13)(cid:13) E (cid:104) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:105)(cid:13)(cid:13)(cid:13) ≤ m F , E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ ˜ σ F . One can easily check that the rest of analysis does not change, and the ﬁnal result (Theorem 4.5)holds if we just replace σ F by ˜ σ F and α L σ G /D by m F .We next focus on two diﬀerent approaches, developed for MAML formulation, for approximatingthe Hessian-vector product, and show how we can characterize m F and ˜ σ F for both cases: • Ignoring the second-order term:

Authors in [2] suggested to simply ignore the second-order term in the update of MAML to reduce the computation cost of MAML, i.e., to replace ˜ ∇ F i ( w ) with ˜ ∇ f i (cid:16) w − α ˜ ∇ f i ( w, D ) , D (cid:48) (cid:17) . (112)This approach is known as First-Order MAML (FO-MAML), and it has been shown that it performsrelatively well in many cases [2]. In particular, [33] characterized the convergence properties ofFO-MAML for the centralized MAML problem. Next, we characterize the mean and variance ofthis gradient approximation. Lemma H.1.

Assume that we estimate ∇ F i ( w ) by (112) where D and D (cid:48) are independent batcheswith size D and D (cid:48) , respectively. Suppose that the conditions in Assumptions 2-4 are satisﬁed.Then, for any α ∈ [0 , /L ] and w ∈ R d , we have (cid:13)(cid:13)(cid:13) E (cid:104) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:105)(cid:13)(cid:13)(cid:13) ≤ m F OF := αL (cid:18) σ G √ D + B (cid:19) , E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ (˜ σ F OF ) := 2 σ G (cid:18) D (cid:48) + ( αL ) D (cid:19) + 2( αLB ) . roof. In fact, in this case, ˜ ∇ F i ( w ) is approximating G i ( w ) := ∇ f i ( w − α ∇ f i ( w )) . (113)To bound m F OF , note that (cid:13)(cid:13)(cid:13) E (cid:104) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:105)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) E (cid:104) ˜ ∇ F i ( w ) − G i ( w ) (cid:105)(cid:13)(cid:13)(cid:13) + (cid:107) E [ G i ( w ) − ∇ F i ( w )] (cid:107) (114) ≤ αLσ G √ D + αLB (115)where the ﬁrst term follows from (33) in the proof of Lemma 4.3 in Appendix D, and the secondterm is obtained using (cid:107) G i ( w ) − ∇ F i ( w ) (cid:107) = α (cid:13)(cid:13) ∇ f i ( w ) ∇ f i ( w − α ∇ f i ( w )) (cid:13)(cid:13) ≤ α (cid:107)∇ f i ( w ) (cid:107) · (cid:107)∇ f i ( w − α ∇ f i ( w )) (cid:107) ≤ αLB (116)where the ﬁrst inequality follows from the matrix norm deﬁnition and the last inequality is obtainedusing Assumption 2.To characterize ˜ σ F OF , note that E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − G i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) + 2 E (cid:104) (cid:107) G i ( w ) − ∇ F i ( w ) (cid:107) (cid:105) . (117)We bound these two terms separately. Note that we have already bounded the ﬁrst term inAppendix D (see (36)), and we have E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − G i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ σ G (cid:18) D (cid:48) + ( αL ) D (cid:19) . (118)Plugging (118) and (116) into (117), we obtain the desired result.Note that while the ﬁrst term in ˜ σ F OF can be made arbitrary small by choosing D and D (cid:48) largeenough, this is not the case for the second term. However, the second term is also negligible if α issmall enough. Yet this bound suggests that this approximation introduces a non-vanishing errorterm which is directly carried to the ﬁnal result (Theorem 4.5). • Estimating Hessian-vector product using gradient diﬀerences:

In the context ofMAML problem, it has been shown that the update of FO-MAML leads to an additive errorthat does not vanish as time progresses. To resolve this matter, [33] introduced another variant ofMAML, called HF-MAML, which approximates the Hessian-vector product by gradient diﬀerences.More formally, the idea behind their method is that for any function g , the product of the Hessian ∇ g ( w ) by any vector v can be approximated by ∇ g ( w + δv ) − ∇ g ( w − δv )2 δ (119)with an error of at most ρδ (cid:107) v (cid:107) , where ρ is the parameter for Lipschitz continuity of the Hessianof g . Building on this idea, in Per-FedAvg update rule, we can replace ˜ ∇ F i ( w ) by ˜ ∇ f i (cid:16) w − α ˜ ∇ f i ( w, D ) , D (cid:48) (cid:17) − α ˜ d i ( w ) (120)where ˜ d i ( w ) := ˜ ∇ f i (cid:16) w + δ ˜ ∇ f i ( w − α ˜ ∇ f i ( w, D ) , D (cid:48) ) , D (cid:48)(cid:48) (cid:17) − ˜ ∇ f i (cid:16) w − δ ˜ ∇ f i ( w − α ˜ ∇ f i ( w, D ) , D (cid:48) ) , D (cid:48)(cid:48) (cid:17) δ . (121)For this approximation, we have the following result, which shows that we have an additionaldegree of freedom ( δ ) to control the error term that does not decreased with increasing batch sizes.26 emma H.2. Assume that we estimate ∇ F i ( w ) by (120) where D , D (cid:48) , and D (cid:48)(cid:48) are independentbatches with size D , D (cid:48) , and D (cid:48)(cid:48) , respectively. Suppose that the conditions in Assumptions 2-4 aresatisﬁed. Then, for any α ∈ [0 , /L ] and w ∈ R d , we have (cid:13)(cid:13)(cid:13) E (cid:104) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:105)(cid:13)(cid:13)(cid:13) ≤ m HFF := α (cid:18) Lσ G √ D + Lσ G √ D (cid:48) + ρδB (cid:19) , E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ (˜ σ HFF ) := 6 σ G (cid:18) αL ) D + 2 D (cid:48) + α δ D (cid:48)(cid:48) (cid:19) + 2( αρδ ) B . Proof.

Note that, this time ˜ ∇ F i ( w ) is approximating G (cid:48) i ( w ) := ∇ f i ( w − α ∇ f i ( w )) − αd i ( w ) (122)where d i ( w ) := ∇ f i ( w + δ ∇ f i ( w − α ∇ f i ( w ))) − ∇ f i ( w − δ ∇ f i ( w − α ∇ f i ( w )))2 δ (123)is the term approximating ∇ f i ( w ) ∇ f i ( w − α ∇ f i ( w )) . Below, we characterize ˜ σ HFF , and m HFF can be done similarly.Similar to (117), we have E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − ∇ F i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − G (cid:48) i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) + 2 E (cid:20)(cid:13)(cid:13)(cid:13) G (cid:48) i ( w ) − ∇ F i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) . (124)We again bound both terms separately. To simplify the notation, let us deﬁne g i ( w ) := ∇ f i ( w − α ∇ f i ( w )) , ˜ g i ( w ) := ˜ ∇ f i (cid:16) w − α ˜ ∇ f i ( w, D ) , D (cid:48) (cid:17) . (125)First, note that, using ( a + b + c ) ≤ a + b + c ) for a, b, c ≥ , we have (cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − G (cid:48) i ( w ) (cid:13)(cid:13)(cid:13) ≤ (cid:107) ˜ g i ( w ) − g i ( w ) (cid:107) + 3 α δ (cid:13)(cid:13)(cid:13) ˜ ∇ f i ( w + δ ˜ g i ( w ) , D (cid:48)(cid:48) ) − ∇ f i ( w + δg i ( w )) (cid:13)(cid:13)(cid:13) + 3 α δ (cid:13)(cid:13)(cid:13) ˜ ∇ f i ( w − δ ˜ g i ( w ) , D (cid:48)(cid:48) ) − ∇ f i ( w − δg i ( w )) (cid:13)(cid:13)(cid:13) . (126)Taking expectation from both sides, along with using (118), we have E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − G (cid:48) i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ σ G (cid:18) D (cid:48) + ( αL ) D (cid:19) + 3 α δ (cid:18) E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ f i ( w + δ ˜ g i ( w ) , D (cid:48)(cid:48) ) − ∇ f i ( w + δg i ( w )) (cid:13)(cid:13)(cid:13) (cid:21) + E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ f i ( w − δ ˜ g i ( w ) , D (cid:48)(cid:48) ) − ∇ f i ( w − δg i ( w )) (cid:13)(cid:13)(cid:13) (cid:21)(cid:19) ≤ σ G (cid:18) α δ D (cid:48)(cid:48) + 1 D (cid:48) + ( αL ) D (cid:19) + 3 α δ (cid:16) E (cid:104) (cid:107)∇ f i ( w + δ ˜ g i ( w )) − ∇ f i ( w + δg i ( w )) (cid:107) (cid:105) + E (cid:104) (cid:107)∇ f i ( w − δ ˜ g i ( w )) − ∇ f i ( w − δg i ( w )) (cid:107) (cid:105)(cid:17) (127)where (127) is obtained using the fact that D (cid:48)(cid:48) is independent from D and D (cid:48) which implies E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ f i ( w ± δ ˜ g i ( w ) , D (cid:48)(cid:48) ) − ∇ f i ( w ± δg i ( w )) (cid:13)(cid:13)(cid:13) (cid:21) ≤ σ G D (cid:48)(cid:48) + E (cid:104) (cid:107)∇ f i ( w ± δ ˜ g i ( w )) − ∇ f i ( w ± δg i ( w )) (cid:107) (cid:105) . Next, note that Assumption 2 yields (cid:107)∇ f i ( w ± δ ˜ g i ( w )) − ∇ f i ( w ± δg i ( w )) (cid:107) ≤ δL (cid:107) ˜ g i ( w ) − g i ( w ) (cid:107) . · · · · · · G r o up s o f U s e r s a a · · · a 0 0 · · · a a · · · a 0 0 · · · ... ... ... . . . ... ... ... . . . ... a a · · · a 0 0 · · · a/2 0 · · · · · · · · · · · · ... ... ... . . . ... ... ... . . . ... · · · a/2 0 0 · · · a: Comparison in terms of runtime b: Comparison in terms of number of iterations Plugging this bound into (127) and using (117) implies E (cid:20)(cid:13)(cid:13)(cid:13) ˜ ∇ F i ( w ) − G (cid:48) i ( w ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ σ G (cid:18) α δ D (cid:48)(cid:48) + (1 + ( αL ) (cid:18) D (cid:48) + ( αL ) D (cid:19)(cid:19) ≤ σ G (cid:18) αL ) D + 2 D (cid:48) + α δ D (cid:48)(cid:48) (cid:19) (128)where the last inequality is obtained using αL ≤ .Bounding the second term in (124) is more straightforward as we have (cid:13)(cid:13)(cid:13) G (cid:48) i ( w ) − ∇ F i ( w ) (cid:13)(cid:13)(cid:13) = α (cid:13)(cid:13) d i ( w ) − ∇ f i ( w ) ∇ f i ( w − α ∇ f i ( w )) (cid:13)(cid:13) ≤ αρδ (cid:107) g i ( w ) (cid:107) ≤ αρδB . (129)Plugging (128) and (129) into (124) gives us the desired result. I More on Numerical Experiments

In this section, we discuss our further results on numerical experiments. We thank the anonymousreviewers for their suggestions on adding this results, and we are looking forward to further exploreour method from numerical point of view in future works.First, in Table 2. we provide an illustration of the numerical setting in Section 5.Second, in Figure 1a, we illustrate the average test accuracy of all studied algorithms withrespect to time. As this ﬁgure shows, Per-FedAvg (HF) achieves higher level of accuracy comparedto the regular Fed-Avg with local updates within the same computation time.Third, we also compare our method with ARUBA [38]. To do so, we also report the outputof FedAvg+ARUBA after reﬁnement for each user. In particular, we consider τ = 4 and K =1000 , and also tune hyper-parameters of ARUBA for a fair comparison. The ﬁnal accuracy of28ll algorithms is as follows: Per-FedAvg(FO): . ± . , Fed-Avg+ARUBA (with reﬁnement): . ± . , Per-FedAvg(HF): . ± . . In Figure 1b, we have also depicted one realizationone realization