Distributed Zero-Order Optimization under Adversarial Noise
aa r X i v : . [ m a t h . O C ] F e b Distributed Zero-Order Optimization under Adversarial Noise
Arya Akhavan , Massimiliano Pontil , and Alexandre B. Tsybakov CSML, Istituto Italiano di Tecnologia CREST, ENSAE, IP Paris Dept. of Computer Science, University College LondonFebruary 4, 2021
Abstract
We study the problem of distributed zero-order optimization for a class of strongly convexfunctions. They are formed by the average of local objectives, associated to different nodes ina prescribed network of connections. We propose a distributed zero-order projected gradientdescent algorithm to solve this problem. Exchange of information within the network is permit-ted only between neighbouring nodes. A key feature of the algorithm is that it can query onlyfunction values, subject to a general noise model, that does not require zero mean or indepen-dent errors. We derive upper bounds for the average cumulative regret and optimization error ofthe algorithm which highlight the role played by a network connectivity parameter, the numberof variables, the noise level, the strong convexity parameter of the global objective and certainsmoothness properties of the local objectives. When the bound is specified to the standard undis-tributed setting, we obtain an improvement over the state-of-the-art bounds, due to the novelgradient estimation procedure proposed here. We also comment on lower bounds and observethat the dependency over certain function parameters in the bound is nearly optimal.
We study the problem of distributed optimization where each node (or agent) has an objective func-tion f i : R d → R and exchange of information is limited between neighbouring agents within aprescribed network of connections. The goal is to minimize the average of these objectives on aclosed bounded convex set Θ ⊂ R d , min x ∈ Θ f ( x ) where f ( x ) = 1 n n X i =1 f i ( x ) . (1)Distributed optimization has been widely studied in the literature, we refer to [36, 18, 19, 6, 8, 13, 17,14, 34, 12, 31, 26] and references therein. This problem has broad applications such as multi-agenttarget seeking [16], distributed learning [15], and wireless networks [24], among others.We address problem (1) from the perspective of zero-order distributed optimization. That is weassume that only function values can be queried by the algorithm, subject to measurement noise.Each agent i is associated to one local objective f i and communication between agents is determinedby a prescribed undirected graph, whereby each agent can exchange information only with the corre-sponding neighboring agents.The principal contribution of this work is a distributed zero-order optimization algorithm whichwe show to achieve tight rates of convergence under certain assumptions on the objective functions.Specifically, we consider that the local objectives f i are β -H ¨older and the average objective f is α -strongly convex. In Section 6, we provide the rates of convergence for the cumulative regret and theoptimization error of the proposed algorithm. The rates highlight the dependency with respect to the1umber of variables d , the number of function queries T , the spectral gap of the network matrix, andthe parameters n , α and β . In Section 7, we provide the improved rates of convergence for -smoothfunctions. When these bounds are compared to related lower bounds in [2] for undistributed setting,one can see that the rates are optimal with respect to T and α or with respect to T and d .A key novelty of our approach is the general noise model added to the queried function values.The noise variables do not need to be zero mean or independently sampled, and thus they include“adversarial” noise. Moreover, our algorithm relies on a novel gradient estimator procedure, thatresults in an O ( d ) computational gain over previous zero-order algorithms [2, 3] and referencestherein. Previous Work
For both deterministic and stochastic scenarios of problem (1), a large body ofliterature is devoted to first-order gradient based methods with a consensus scheme (see the paperscited above and references therein). On the other hand, the study of zero-order methods was startedonly recently [27, 29, 28, 11, 37, 35]. The works [27, 37, 35] are dealing with zero-order distributedmethods in noise-free settings while the noisy setting is developed in [11, 29, 28]. Namely, [11]consider 2-point zero-order methods with stochastic queries for non-convex optimization but assumethat the noise is the same for both queries, which makes the problem analogous to noise-free scenarioin terms of optimization rates. [29, 28] study zero-order distributed optimization for strongly convexand β -smooth functions f i with β ∈ { , } . They derive bounds on the optimization error, thoughwithout providing closed form expressions (see Section 6 for the details). Notation
Throughout we denote by h· , ·i and k · k be the standard inner product and Euclideannorm on R d , respectively, and by k · k ∗ the spectral norm of a matrix. The notation I is used for the n -dimensional identity matrix and for the vector in R n with all entries equal to 1. We denote by e j the j -th canonical basis vector in R d . For any set A , the number of elements in A is denoted by | A | .For x ∈ R , the value ⌊ x ⌋ is the maximal integer less than x . We denote by Proj Θ ( · ) the operator ofEuclidean projection onto the set Θ and by diam(Θ) the Euclidean diameter of Θ . For every closedconvex set Θ ⊂ R d and x ∈ R d we denote by Proj Θ ( x ) = argmin {k z − x k : z ∈ Θ } the Euclideanprojection of x onto Θ . We denote by U [ − , the uniform distribution on [ − , . Let n be the number of agents and let G = ( V, E ) be an undirected graph, where V = { , . . . , n } is the set of nodes and E ⊆ V × V is the set of edges. The adjacency matrix of G is the symmetricmatrix ( A ij ) ni,j =1 defined as A ij = 1 , if ( i, j ) ∈ E and zero otherwise. We consider the followingsequential learning framework, where each agent i gets values of function f i corrupted by noise andshares information with other agents. At step t , agent i acts as follows:• makes queries and gets noisy values of f i ,• provides a local output u i ( t ) based on these queries and on the past information,• broadcasts u i ( t ) to neighboring agents,• updates its local variable using information from other agents as follows: x i ( t +1)= n X j =1 W ij u j ( t ) , where W = ( W ij ) ni,j =1 is a given matrix called the consensus matrix.Below we use the following condition on the consensus matrix. Assumption 2.1.
Matrix W is symmetric, doubly stochastic, and ρ := (cid:13)(cid:13) W − n − ⊤ (cid:13)(cid:13) ∗ < . lgorithm 1 Distributed Zero-Order Gradient
Input
Communication matrix ( W ij ) ni,j =1 , step sizes ( η t > T − t =1 Initialization
Choose initial vectors x (1) = · · · = x n (1) ∈ R d For t = 1 , . . . , T − For i = 1 , . . . , n
1. Build an estimate g i ( t ) of the gradient ∇ f i ( x i ( t )) using noisy evaluations of f i
2. Update x i ( t +1)= n P k =1 W ik Proj Θ ( x k ( t ) − η t g k ( t )) EndEndOutput
Approximate minimizer ¯ x ( T ) = n P ni =1 x i ( T ) of the average objective f = n P ni =1 f i Matrix W accounts for the connectivity properties of the network. If W ij = 0 the agents i and j are not connected (do not exchange information). Often W is defined as a doubly stochastic matrixfunction of the adjacency matrix A of the graph. One popular example is as follows: W ij = A ij γ max { d ( i ) ,d ( j ) } if i = j, − P k : k = i A ki γ max { d ( i ) ,d ( k ) } if i = j, where d ( i ) = P nj =1 A ij is the degree of node i and γ > is a constant. Then, clearly, W = ( W ij ) is a symmetric and doubly stochastic matrix, and W ij = 0 if agents i and j are not connected.Moreover, we have ρ < − c/n for a constant c > (see [27, 22]). Values of spectral gaps ρ forsome other W reflecting different network topologies can be found in [8]. Typically, ρ < − a n ,where a n = Ω( n − ) or a n = Ω( n − ) . Parameter ρ can be viewed as a measure of difference betweenthe distributed problem and a standard optimization problem. If the graph of communication is acomplete graph a natural choice is W = n − n ⊤ n and then ρ = 0 . For more examples of consensusmatrices W , see [23, 8] and references therein.The local outputs u i can be defined in different ways. Our approach is outlined in Algorithm 1.At Step 1, an estimate of the gradient of the local objective f i at x i ( t ) is constructed. This involvesa randomized procedure that we describe and justify in Section 4. The local output u i is definedas an update of the projected gradient algorithm with such an estimated gradient. At Step 2 of thealgorithm, each agent computes the next point by a local consensus gradient descent step, which useslocal and neighbor information. Step 2 of the algorithm is known as gossip method, see e.g., [5]),which was initially introduced as an approach for the networks with the imposed connection betweenthe nodes changing by time. We also refer to [30] for similar algorithms in the context of distributedstochastic first-order gradient methods. In this section, we give some definitions and introduce our assumptions on the local objective func-tions f , . . . , f n . Definition 3.1.
Denote by F β ( L ) the set of all functions f : R d → R that are ℓ = ⌊ β ⌋ timesdifferentiable and satisfy, for all x, z ∈ R d the H¨older-type condition (cid:12)(cid:12)(cid:12)(cid:12) f ( z ) − X ≤| m |≤ ℓ m ! D m f ( x )( z − x ) m (cid:12)(cid:12)(cid:12)(cid:12) ≤ L k z − x k β , (2)3 lgorithm 2 Gradient Estimator with d Queries
Input
Function F : R d → R and point x ∈ R d Requires
Kernel K : [ − , → R , parameter h > Initialization
Generate random r from uniform distribution on [ − , For j = 1 , . . . , d
1. Obtain noisy values y j = F ( x + hre j ) + ξ j and y ′ j = F ( x − hre j ) + ξ ′ j
2. Compute g j = h ( y j − y ′ j ) K ( r ) EndOutput g = ( g j ) dj =1 ∈ R d estimator of ∇ F ( x ) where L > , the sum is over the multi-index m = ( m , ..., m d ) ∈ N d , we used the notation m ! = m ! · · · m d ! , | m | = m + · · · + m d , and we defined, for every ν = ( ν , . . . , ν d ) ∈ R d , D m f ( x ) ν m = ∂ | m | f ( x ) ∂ m x · · · ∂ m d x d ν m · · · ν m d d . Elements of the class F β ( L ) are referred to as β -H¨older functions. Definition 3.2.
Function f : R d → R is called 2-smooth if it is differentiable on R d and there exists ¯ L > such that, for every ( x, x ′ ) ∈ R d × R d , it holds that k∇ f ( x ) − ∇ f ( x ′ ) k ≤ ¯ L k x − x ′ k . Definition 3.3.
Let α > . Function f : R d → R is called α -strongly convex if f is differentiable on R d and f ( x ) − f ( x ′ ) ≥ h∇ f ( x ′ ) , x − x ′ i + α (cid:13)(cid:13) x − x ′ (cid:13)(cid:13) , ∀ x, x ′ ∈ R d . Assumption 3.4.
Functions f , . . . , f n : (i) belong to the class F β ( L ) , for some β ≥ , and (ii) are2-smooth. In Section 6 we will analyse the convergence properties of Algorithm 1 when the objective func-tion f in 1 is α -strongly convex. We stress that we do not need the functions f , . . . , f n , to be as well α -strongly convex. It is enough to make such an assumption on the compound function f , while thelocal functions f i only need to satisfy the smoothness conditions stated in Assumption 3.4 above. In this section, we detail our choice of gradient estimators g i ( t ) used at Step 1 of Algorithm 1. Weconsider Algorithm 2. For any function F : R d → R and any point x , the vector g returned byAlgorithm 2 is an estimate of ∇ F ( x ) based on noisy observations of F at randomized points. Theestimator is computed for every node i at each step t , thus giving the vectors g = g i ( t ) in Algorithm1. The gradient estimator crucially requires a kernel function K : [ − , → R that allows us totake advantage of possible higher order smoothness properties of f . Specifically, in what follows weassume that Z uK ( u ) du = 1 , Z u j K ( u ) du = 0 , j = 0 , , , . . . , ℓ, and κ β ≡ Z | u | β | K ( u ) | du < ∞ , (3)for given β ≥ and ℓ = ⌊ β ⌋ . In [25] such kernels can be constructed as weighted sums of Legendrepolynomials, in which case κ β ≤ √ β with β ≥ ; see also Appendix A.3 in [3] for a derivation.The gradient estimator in Algorithm 2 differs from the standard d -point Kiefer-Wolfowitz typeestimator in that it uses multiplication by a random variable K ( r ) with a well-chosen kernel K . On4he other hand, it is also different from the previous kernel-based estimators in zero-order optimiza-tion literature [25, 3, 2] in that it needs d function queries per step, whereas those estimators requireonly one or two queries; see, in particular, Algorithm 1 in [2] for a comparison. At first sight, thisseems a big drawback of the estimator proposed here. We argue that it is not the case. Usual criticsabout d -point estimators put forward the fact that they cannot be realized when the dimension d ishigh as they need at least T = 2 d queries. However, the lower bounds in [32, 2] (see (12) below)indicate that no estimator can achieve nontrivial convergence rate for zero-order optimization when T . d ββ − . Thus, having the total budget of T ≫ d queries is a necessary condition for successof any zero-order stochastic optimization method. Algorithms with one or two queries per step can,of course, be realized for T . d but in this case they do not enjoy any nontrivial error behavior.Moreover, with the same total budget of queries T , the gradient estimator from Algorithm 2 is com-putationally more efficient than the estimators in [25, 3, 2]. As detailed in Section 6, with the samebudget of queries, it needs /d less calls of random variable generator than it would be for the gra-dient estimator in [3, 2]. We also show that, again with the same budget, the rate of convergence ofsuch a gradient estimator is better than the rate obtained in [2].When the estimator in Algorithm 2 is used at the t -th outer step of Algorithm 1, it should beintended as a random variable that depends on the randomization used during the current estimationat the given node, as well as on the randomness of the past iterations, inducing the σ -algebra F t (seeSection 5 for the definition). Bounds for the bias of this estimator conditional on the past and for itssecond moment play an important role below, in our analysis of the convergence rates. These boundsare presented in the next two lemmas. We state them in the simpler setting of Algorithm 2, with noreference to the filtration ( F t ) t ∈ N . Lemma 4.1.
Let f : R d → R be a function in F β ( L ) , β ≥ , and let the random variables ξ , . . . , ξ d and ξ ′ , . . . , ξ ′ d be independent of r and such that E [ | ξ j | ] < ∞ , E [ | ξ ′ j | ] < ∞ , for j = 1 , . . . , d . Let thekernel satisfy the conditions (3) . If g is the estimator of the gradient given by Algorithm 2 then, forevery x ∈ R d , k E [ g ] − ∇ f ( x ) k ≤ Lκ β √ dh β − . Proof.
By Taylor expansion we have f ( x + hre j ) − f ( x − hre j )2 h = ∂f ( x ) ∂x j r + 1 h X ≤ m ≤ ℓ,m odd ( rh ) m m ! ∂ m f ( x ) ∂x mj + R ( hre j ) − R ( − hre j )2 h , where | R ( ± hre j ) | ≤ L k hre j k β = L | r | β h β . Using (3) it follows that (cid:12)(cid:12)(cid:12) E [ g j ] − ∂f ( x ) ∂x j (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) E (cid:20) f ( x + hre j ) − f ( x − hre j )2 h K ( r ) (cid:21) − ∂f ( x ) ∂x j (cid:12)(cid:12)(cid:12) ≤ Lκ β h β − , which implies the result.It is straightforward to see that the bound of Lemma 4.1 holds when the estimators are buildrecursively during the execution of Algorithm 1 and the expectation is taken conditionally on F t .This will be used in the proofs. Lemma 4.2.
Let f : R d → R be -smooth and let max x ∈ Θ k∇ f ( x ) k ≤ G , κ ≡ R K ( u ) du < ∞ .Let the random variables ξ , . . . , ξ d and ξ ′ , . . . , ξ ′ d be independent of r and E [ ξ j ] ≤ σ , E [( ξ ′ j ) ] ≤ σ for j = 1 , . . . , d. If g is defined by Algorithm 2, where x is a random variable with values in Θ independent of r and depending on ξ , . . . , ξ d and ξ ′ , . . . , ξ ′ d in an arbitrary way, then E k g k ≤ dκ (cid:18) σ h + 3 ¯ L h (cid:19) + 9 G κ. roof. Fix j ∈ , . . . , d. Using the inequality ( a + b + c ) ≤ a + b + c ) and the independencebetween r and ( ξ j , ξ ′ j ) we have E [ g j ] = 14 h E (cid:2) ( f ( x + hre j ) − f ( x − hre i ) + ξ i − ξ ′ i ) K ( r ) (cid:3) (4) ≤ h E h(cid:16)(cid:0) f ( x + hre j ) − f ( x − hre j ) (cid:1) + 2 σ (cid:17) K ( r ) i . The same calculations as in the proof of Lemma 2.4 in [2] yield (cid:0) f ( x + hre j ) − f ( x − hre j ) (cid:1) ≤ (cid:18) ¯ L k hre j k + 4 h∇ f ( x ) , hre j i (cid:19) , Finally, we combine this inequality with (4) to obtain E [ g j ] ≤ κ (cid:18) σ h + 3 ¯ L h (cid:19) + 9 κ E [ h∇ f ( x ) , e i i ] , which immediately implies the lemma. Algorithm 2 is called to compute estimators of gradients of the local functions f i , i = 1 , . . . n, at eachiteration t of Algorithm 1. Thus, we assume that agent i at iteration t generates a uniform randomvariable r i ( t ) ∼ U [ − , and gets d noisy observations y i,j ( t ) = f ( x i ( t ) + h t r i ( t ) e j ) + ξ i,j ( t ) and y ′ i,j ( t ) = f ( x i ( t ) + h t r i ( t ) e j ) + ξ ′ i,j ( t ) , j = 1 , . . . , d , where h t > . In what follows, we denote by F t the σ -algebra generated by the random variables x i ( t ) , for i = 1 , . . . , n .In order to meet the conditions of Lemmas 4.1 and 4.2 for each ( i, t ) , we impose the followingassumption on the collection of random variables ( r i ( t ) , ξ i,j ( t ) , ξ ′ i,j ( t )) . Assumption 5.1.
For all integers t and i ∈ { , . . . , n } the following properties hold.(i) The random variables r i ( t ) ∼ U [ − , are independent of ξ i, ( t ) , . . . ξ i,d ( t ) , ξ ′ i, ( t ) , . . . , ξ ′ i,d ( t ) and from the σ -algebra F t ,(ii) E [( ξ i,j ( t )) ] ≤ σ , E [( ξ ′ i,j ( t )) ] ≤ σ for j = 1 , . . . , d , and some σ ≥ . Assumption 5.1 is very mild. Indeed, its part (i) occurs as a matter of course since it is unnaturalto assume dependence between the random environment noise and artificial random variables r i ( t ) generated by the agents. We state (i) only for the purpose of formal rigor. Remarkably, we donot assume the noises ξ i,j ( t ) and ξ ′ i,j ( t ) to have zero mean. What is more, these variables can bedeterministic and no independence between them for different i, j, t is required, so we consider anadversarial environment. Having such a relaxed assumption on the noise is possible because of themultiplication by the zero-mean variable K ( r ) in Algorithm 2. This and the fact that all componentsof the vectors are treated separately allows the proofs go through without the zero-mean assumptionand under arbitrary dependence between the noises. In this section, we provide upper bounds on the performance of the proposed algorithms. Recallthat T is the number of outer iterations in Algorithm 1. Let T be the total number of times that weobserved noisy values of f . At each iteration of Algorithm 2 we make d queries. Thus, to keep thetotal budget equal to T we need to make T = T / (2 d ) steps of Algorithm 1 (assuming that T / (2 d ) is an integer). We compare our results to lower bounds for any algorithm with the total budget of T queries. 6or given β ≥ , we choose the tuning parameters η t and h = h t in Algorithms 1 and 2 asfollows : η t = 2 αt , and h t = t − β . (5)Inspection of the proofs in Appendix B shows that these values of η t and h t lead to the best ratesminimizing the bounds.As one can expect, there are two contributions to the bounds. One of them represents the usualstochastic optimization error and the second one accounts for the distributed character of the problem.This second contribution to the bounds is driven by the following quantity that we call the meandiscrepancy: ∆( t ) ≡ n − P ni =1 E [ (cid:13)(cid:13) x i ( t ) − ¯ x ( t ) (cid:13)(cid:13) ] . It plays an important role in our argument andmay be of interest by itself, cf. [35]. The next lemma gives a control of the mean discrepancy.
Lemma 6.1.
Let Assumptions 2.1, 3.4, and 5.1 hold. Let Θ be a convex compact subset of R d .Assume that diam (Θ) ≤ K and max x ∈ Θ k∇ f ( x ) k ≤ G . If the updates x i ( t ) , ¯ x ( t ) are defined byAlgorithm 1, in which the gradient estimators for i -th agent are defined by Algorithm 2 with F = f i , i = 1 , . . . , n , and parameters (5) then ∆( t ) ≤ A (cid:18) ρ − ρ (cid:19) dα t − β − β , (6) where A is a constant independent of T, d, α, n, ρ . The explicit value of A can be found in the proof. Proof sketch.
Let V ( t ) = P ni =1 (cid:13)(cid:13) x i ( t ) − ¯ x ( t ) (cid:13)(cid:13) , and z i ( t ) = Proj Θ (cid:0) x i ( t ) − η t g i ( t ) (cid:1) − ( x i ( t ) − η t g i ( t )) . The first step is to show that, due to the definition of the algorithm and Assumptions 2.1 onmatrix W , we have V ( t + 1) ≤ ρ n X i =1 (cid:13)(cid:13) x i ( t ) − ¯ x ( t ) − η t ( g i ( t ) − ¯ g ( t )) + z i ( t ) − ¯ z ( t ) (cid:13)(cid:13) , (7)where ¯ g ( t ) and ¯ z ( t ) denote the averages of g i ( t ) ’s and z i ( t ) ’s over the agents i . From (7), by usingthe fact that k z i ( t ) k ≤ η t k g i ( t ) k , applying Lemma 4.1 conditionally on F t , taking expectations andthen applying Lemma 4.2 we deduce the recursion ∆( t + 1) ≤ ρ ∆( t ) + A ρ − ρ · dα t − β − β , where A > is a constant. The initialization of Algorithm 1 is chosen such that ∆(1) = 0 . Itfollows that ∆( t ) is bounded by a discrete convolution that can be carefully evaluated leading finallyto (6).Using Lemma 6.1 we obtain the following theorem. Theorem 6.2.
Let f be an α -strongly convex function and let the assumptions of Lemma 6.1 besatisfied. Then for any x ∈ Θ the cumulative regret satisfies T X t =1 E (cid:2) f (¯ x ( t )) − f ( x ) (cid:3) ≤ dα T /β (cid:18) B + B ρ − ρ (cid:19) + B α (1 − ρ ) (log( T ) + 1) , where the positive constants B i are independent of T, d, α, n, ρ . The explicit values of these constantscan be found in the proof. Furthermore, if x ∗ is the minimizer of f over Θ the optimization error ofthe averaged estimator ˆ x ( T ) = T P T t =1 ¯ x ( t ) satisfies E [ f (ˆ x ( T )) − f ( x ∗ )] ≤ dα T − β − β (cid:18) B + B ρ − ρ (cid:19) + B α (1 − ρ ) (cid:16) log( T ) + 1 T (cid:17) . (8)7 roof sketch. Note first that, due to the definition of Algorithm 1 and to the properties of matrix W we have ¯ x ( t + 1) = ¯ x ( t ) − η t ¯ g ( t ) + ¯ z ( t ) . This resembles the usual recursion of the gradient algorithmwith an additional term ¯ z ( t ) = n − P ni =1 z i ( t ) , where k z i ( t ) k ≤ η t k g i ( t ) k . Using this boundand α -strong convexity of f , analyzing the recursion in the standard way and taking conditionalexpectations we obtain that, for any x ∈ Θ , f (¯ x ( t )) − f ( x ) ≤ η t E (cid:2) a t − a t +1 |F t (cid:3) − αa t η t n n X i =1 E (cid:2) (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t (cid:3) + (cid:13)(cid:13) E (cid:2) ¯ g ( t ) |F t (cid:3) − ∇ f (¯ x ( t )) (cid:13)(cid:13) k ¯ x ( t ) − x k | {z } Bias1 + 1 η t E (cid:2) h ¯ z ( t ) , ¯ x ( t ) − x i|F t (cid:3)| {z } Bias2 , (9)where a t = k ¯ x ( t ) − x k . Here, the term Bias2 is entirely induced by the distributed nature of theproblem. Using the properties of Euclidean projection and some algebra, it can be bounded as Bias2 ≤ η t − ρ ) n n X i =1 E (cid:2) (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t (cid:3) + 1 − ρ nη t n X i =1 (cid:13)(cid:13) x i ( t ) − ¯ x ( t ) (cid:13)(cid:13) . (10)On the other hand, Bias1 accumulates two contributions, the first due to the gradient approximation(cf. Lemma 4.1) and the second due to the distributed nature of the problem: Bias1 ≤ κ β L √ dh β − t k ¯ x ( t ) − x k + ¯ Ln n X i =1 (cid:13)(cid:13) x i ( t ) − ¯ x ( t ) (cid:13)(cid:13) k ¯ x ( t ) − x k≤ (cid:16) ( κ β L ) α dh β − t + αa t (cid:17) + ¯ Ltα (1 − ρ ) n n X i =1 (cid:13)(cid:13) x i ( t ) − ¯ x (cid:13)(cid:13) + ¯ L K tα (1 − ρ ) ! . (11)Next, we combine inequalities (9)–(11), take expectations of both sides of the resulting inequality,and use Lemmas 4.2 and 6.1 to bound the second moments E (cid:2) (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) (cid:3) and the mean discrepancy.The final result is obtained by summing up from t = 1 to t = T and recalling that η t = αt , h t = t − β .Due to α -strong convexity of f , Theorem 6.2 immediately implies a bound on the estimationerror E [ k ˆ x ( T ) − x ∗ k ] . The bound is of the order of the right-hand side of (8) divided by α . Further-more, combining this consequence of Theorem 6.2, Lemma 6.1 and the trivial bound by the squareddiameter of Θ we easily get the following corollary. Corollary 6.3.
Let the assumptions of Theorem 6.2 hold. Then the local average estimator ˆ x i ( T ) = T P T t =1 x i ( t ) satisfies E [ k ˆ x i ( T ) − x ∗ k ] ≤ C min (cid:26) , dα (1 − ρ ) T − β − β (cid:18) nρ (1 − ρ ) T (cid:19)(cid:27) , where C > is a positive constant independent of T, d, α, n, ρ . We now state a corollary of Theorem 6.2 for an algorithm with total budget of T queries. Assumethat T = T / (2 d ) is an integer. As our algorithm makes d queries per step the estimator ˆ x ( T / (2 d )) uses the total budget of T queries. Combining Theorem 6.2 with the trivial bound E [ f (ˆ x ( T / (2 d )) − f ( x ∗ )] ≤ G K we get the following result. Corollary 6.4.
Let T ≥ d and let the assumptions of Theorem 6.2 be satisfied. Then satisfies E [ f (ˆ x ( T / (2 d )) − f ( x ∗ )] ≤ C min ( , d − /β α (1 − ρ ) T − β − β ) , where C > is a positive constant independent of T, d, α, n, ρ . n = 1 , ρ = 0 corresponds to usual (undistributed) zero-order stochastic optimization.Then Corollary 6.4 gives a bound of order min (cid:0) , d − /β α T − β − β (cid:1) . This improves upon thebound min (cid:0) , d α T − β − β (cid:1) obtained under the same assumptions in [2] but still does not matchthe minimax lower bound established in [2] and equal to min (cid:16) max( α, T − / /β ) , d √ T , dα T − β − β (cid:17) . (12)For α ≍ the lower bound (12) scales as min (cid:0) , dα T − β − β (cid:1) . It has the same behavior inthe interesting regime of α not too small ( α ≥ T − / /β ) and T ≥ d . Note, however, thatthe lower bound (12) is obtained for the setting with i.i.d. noise, while our upper bound isvalid under adversarial noise. Therefore, it may seem rather surprising that the ratio is only d − /β . Furthermore, for β = 2 and α ≍ one can match both bounds, see [33, 2] and anextension to distributed setting in Section 7 below. When α is too small ( α < T − / /β ) thebest achievable rate is presumably the same as for simply convex functions. Indeed, the bound(12) is asymptotically Ω( d/ √ T ) independently of the smoothness index β and on α , while itis shown by [4] that achievable rate for convex functions can be as fast as O ( d . / √ T ) , up tolog-factors.2. We would like to emphasize that, with the same budget of queries T , our d -point method (cf.Algorithm 2) is computationally simpler than the methods with one or two queries per step[25, 3, 2] previously suggested previously suggested for the same setting. For example, themethod in [3, 2] prescribes, at each step t = 1 , . . . , T , to generate a random variable uniformlydistributed on the unit sphere in R d . This requires d calls of one-dimensional random variablegenerator. Overall, in T steps, the number of calls is dT . For our method with the same budget T , we make T = T / (2 d ) steps and at each step we need to call the generator only once inorder to get r ∼ U [ − , . Thus, with the same budget of queries, our Algorithm 2 needs /d less calls of random variable generator than it would be for the gradient estimator in [3, 2].3. If we deal with distributed learning in the case of complete graph ( ρ = 0 ), the discrepancyrelated to the distributed nature of the problem vanishes. In this case, the bound of Theorem6.2 is, to within a constant, the same as in undistributed setting.4. It is straightforward from the proofs that the results of this section remain valid for time-varyingmatrices W = W ( t ) satisfying Assumption 2.1 with ρ := sup t (cid:13)(cid:13) W ( t ) − n − ⊤ (cid:13)(cid:13) ∗ < .We conclude this section with a brief overview of existing results on zero-order distributed opti-mization. The problem with noisy queries was considered in detail in [29, 28], where the settingdiffers from ours in many aspects (the updates are obtained not as in Step 2 of Algorithm 1 butrather via decentralized techniques, matrix W is random, the noise is zero-mean random rather thanadversarial, 2-point gradient estimator is used). [29, 28] provide, for β = 2 and β = 3 , boundson E [ k x i ( T ) − x ∗ k ] of the order at least n / (1 − ρ ) T − / and n / (1 − ρ ) T − / , respectively, as functionsof n , ρ and T . It should be noted that the bounds in [29, 28] contain uncontrolled terms of theform E [ k x i ( k ) − x ∗ k ] for some large enough k = k ( n, α, d ) leaving unclear the resulting rate.[11] consider 2-point methods with stochastic queries but assume that the noise is the same for bothqueries and deal with non-convex optimization. Noisy-free zero-order distributed optimization isstudied by [27, 37, 35]. From these, [35] is the closest to our work as it builds on the updates as atStep 2 of Algorithm 1 (though without projections). The bounds obtained in these papers are of theorder (1 − ρ ) − considered as functions of ρ . Note that the bound of Theorem 6.2 scales only as The recent work [21] obtains the same improvement, using the gradient estimator of [2]. However as we notice belowthat estimator is less computationally appealing. − ρ ) − and this bound holds true for noisy-free setting, which is its special case corresponding to σ = 0 . Since most common values of − ρ are of the order n − (or n − ), this represents a substantialgain. Moreover, Theorem 6.2 covers a difficult noise setting as we deal with adversarial noise. It isworthwhile to note that the first-order distributed optimization exhibits much better dependency on ρ since bounds that scale as (1 − ρ ) − / can be achieved [9, 31]. β = 2 In this section we provide improved upper bounds for the case β = 2 in Corollary 6.4, where werelax the dependency over d , from d / to d . Following the literature on undistributed zero-orderoptimization, we use a standard 2-point method with elements of the analysis developed in [10, 1, 9,32, 33, 2] among others. Specifically, we define g i ( t ) = d h t ( y i ( t ) − y ′ i ( t )) ζ i ( t ) (13)where y i ( t ) = f i ( x i ( t )+ h t ζ i ( t ))+ ξ i ( t ) , y ′ i ( t ) = f i ( x i ( t ) − h t ζ i ( t ))+ ξ ′ i ( t ) , with the random variables ζ i ( t ) , ≤ i ≤ n , ≤ t ≤ T , that are i.i.d. uniformly distributedon the unit Euclidean sphere in R d . We make the following assumption on the noise analogous toAssumption 5.1. Assumption 7.1.
For all integers t and all i ∈ { , . . . , n } the following properties hold.(i) The random variables ζ i ( t ) are independent of ξ i ( t ) , ξ ′ i ( t ) and from the σ -algebra F t ,(ii) E [( ξ i ( t )) ] ≤ σ , E [( ξ ′ i ( t )) ] ≤ σ for some σ ≥ . Theorem 7.2.
Let f be an α -strongly convex function. Let Assumptions 2.1, 3.4, and 7.1 hold with β = 2 . Let Θ be a convex compact subset of R d , and assume that diam (Θ) ≤ K . Assume that max x ∈ Θ k∇ f i ( x ) k ≤ G , for ≤ i ≤ n . Let the updates x i ( t ) , ¯ x ( t ) be defined by Algorithm 1, inwhich the gradient estimator for i -th agent is defined by (13) , and η t = αt , h t = (cid:16) d σ Lαt +9 L d (cid:17) / .Then for the estimator ˜ x ( T ) = T −⌊ T/ ⌋ P Tt = ⌊ T/ ⌋ +1 ¯ x ( t ) we have E [ f (˜ x ( T )) − f ( x ∗ )] ≤ B − ρ (cid:18) d √ αT + d αT (cid:19) , where B > is a constant independent of T, d, α, n, ρ . The main idea of the proof is to use surrogate functions ˆ f it ( x ) , for ≤ i ≤ n , defined as follows ˆ f it ( x ) = E f i ( x + h t ˜ ζ ) , ∀ x ∈ R d . where the expectation with respect to the random vector ˜ ζ uniformly distributed on the unit ball B d = { u ∈ R d : k u k ≤ } . A result, which can be traced back to [20] implies the fact that g i ( t ) is an unbiased estimator of the gradient of the surrogate function ˆ f it at x i ( t ) . Thus, we can considerAlgorithm 1 as a gradient descent for the surrogate function. Then replacing f i and f by the surrogatefunctions with the cost of the order h t , we can recover the initial problem. This method does notwork for β > since the error of approximation by surrogate function becomes of bigger order thanthe optimal rate T − β − β . The results that we implement as tools for this section are given in AppendixC. Combining Theorem 7.2 with the obvious bound E [ f (˜ x ( T )) − f ( x ∗ )] ≤ G K we obtain E [ f (˜ x ( T )) − f ( x ∗ )] ≤ B ′ − ρ min (cid:16) , d √ αT (cid:17) , (14)where B ′ > is a constant independent of T, d, α, n, ρ . By comparing this upper bound with theminimax lower bound (12) for β = 2 , one can note that (14) is optimal with respect to the parameters T and d when α ≍ . 10 Conclusion
We presented a zero-order distributed optimization algorithm to minimize a strongly convex functionformed by the average of local smooth objectives. We analysed the performance of the algorithmunder a general noise model on the queried function values. We observed that when the algorithmis specified to the undistributed setting it enjoys better rates than those previously available, whilebeing computationally more efficient. Important directions of future research include the analysis ofdisturbed zero-order algorithms for larger classes of functions, such as α -gradient dominant ones, aswell the study of lower bounds in the zero-order distributed setting. Furthermore, extension of ourresults to stochastic updates or asynchronous activation schemes would be valuable. References [1] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization withmulti-point bandit feedback. In
Proc. 23rd International Conference on Learning Theory , pages28–40, 2010.[2] A. Akhavan, M. Pontil, and A. Tsybakov. Exploiting higher order smoothness in derivative-freeoptimization and continuous bandits. In
Advances in Neural Information Processing Systems33 , 2020.[3] F. Bach and V. Perchet. Highly-smooth zero-th order online optimization. In
Proc. 29th AnnualConference on Learning Theory , pages 1–27, 2016.[4] A. Belloni, T. Liang, H. Narayanan, and A. Rakhlin. Escaping the local minima via simulatedannealing: Optimization of approximately convex functions. In
Proc. 28th Annual Conferenceon Learning Theory , pages 240–265, 2015.[5] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms.
IEEE Transac-tions on Information Theory , 52(6):2508–2530, 2006.[6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statisticallearning via the alternating direction method of multipliers.
Foundations and Trends in MachineLearning , 3:1–122, 2011.[7] L. Devroye, L. Gy¨orfi, and G. Lugosi.
A Probabilistic Theory of Pattern Recognition . Springer,New York, 1996.[8] J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual averaging for distributed optimiza-tion: Convergence analysis and network scaling.
IEEE Transactions on Automatic Control ,57(3):592–606, 2012.[9] J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order con-vex optimization: The power of two function evaluations.
IEEE Transactions on InformationTheory , 61(5):2788–2806, 2015.[10] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the banditsetting: gradient descent without a gradient. In
Proc. 16th Annual ACM-SIAM Symposium onDiscrete algorithms (SODA) , pages 385—-394, 2005.[11] D. Hajinezhad, M. Hong, and A. Garcia. Zeroth order nonconvex multi-agent optimization overnetworks.
IEEE Transactions on Automatic Control , 64(10):3995–4010, 2019.[12] D. Jakoveti´c. A unification and generalization of exact distributed first-order methods.
IEEETransactions on Signal and Information Processing over Networks , 5(1):31–46, 2019.1113] D. Jakoveti´c, J. Xavier, and J. M. F. Moura. Fast distributed gradient methods.
IEEE Transac-tions on Automatic Control , 59(5):1131–1146, 2014.[14] S. Kia, J. Cort´es, and S. Mart´ınez. Distributed convex optimization via continuous-time coordi-nation algorithms with discrete-time communication.
Autom. , 55:254–264, 2015.[15] T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. Franklin, and M. I. Jordan. MLbase: Adistributed machine-learning system. In
CIDR , 2013.[16] L. Liu, C. Luo, and F. Shen. Multi-agent formation control with target tracking and navigation.In , pages 98–103,2017.[17] I. Lobel, A. Ozdaglar, and D. Feijer. Distributed multi-agent optimization with state-dependentcommunication.
Mathematical Programming , 129:255––284, 2011.[18] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization.
IEEETransactions on Automatic Control , 54(1):48–61, 2009.[19] A. Nedic, A. Ozdaglar, and P. Parrilo. Constrained consensus and optimization in multi-agentnetworks.
Automatic Control, IEEE Transactions on , 55:922–938, 2010.[20] A. S. Nemirovsky and D. B. Yudin.
Problem Complexity and Method Efficiency in Optimization.
Wiley & Sons, 1983.[21] V. Novitskii and A. Gasnikov. Improved exploiting higher order smoothness in derivative-freeoptimization and continuous bandit. arXiv preprint arXiv:2101.03821 , 2021.[22] A. Olshevsky. Linear time average consensus on fixed graphs and implications for decentralizedoptimization and multi-agent control. arXiv: Optimization and Control , 2014.[23] A. Olshevsky and J. Tsitsiklis. Convergence speed in distributed consensus and control.
SIAMJournal on Control and Optimization , 48(1):33–55, 2009.[24] J. Park, S. Samarakoon, A. Elgabli, J. Kim, M. Bennis, S.-L. Kim, and M. Debbah.Communication-efficient and distributed learning over wireless networks: Principles and ap-plications, 08 2020.[25] T. B. Polyak and A. B. Tsybakov. Optimal order of accuracy of search algorithms in stochasticoptimization.
Problems of Information Transmission , 26(2):45–53, 1990.[26] S. Pu, W. Shi, J. Xu, and A. Nedi´c. Push-pull gradient methods for distributed optimization innetworks.
IEEE Transactions on Automatic Control , 66(1):1–16, 2021.[27] G. Qu and N. Li. Harnessing smoothness to accelerate distributed optimization.
IEEE Transac-tions on Control of Network System , 5(5):1245–1260, 2018.[28] A. Sahu, D. Jakovetic, D. Bajovic, and S. Kar. Communication-efficient distributed stronglyconvex stochastic optimization: Non-asymptotic rates. arXiv:1809.02920 , 2018.[29] A. Sahu, D. Jakovetic, D. Bajovic, and S. Kar. Distributed zeroth order optimization overrandom networks: A kiefer-wolfowitz stochastic approximation approach. pages 4951–4958,12 2018.[30] M. O. Sayin, N. D. Vanli, S. S. Kozat, and T. Bas¸ar. Stochastic subgradient algorithms forstrongly convex optimization over distributed networks.
IEEE Transactions on Network Scienceand Engineering , 4(4):248–260, 2017. 1231] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massouli´e. Optimal convergence rates forconvex distributed optimization in networks.
Journal of Machine Learning Research , 20:1–31,2019.[32] O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In
Proc. 30th Annual Conference on Learning Theory , pages 1–22, 2013.[33] O. Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-pointfeedback.
Journal of Machine Learning Research , 18(1):1703–1713, 2017.[34] W. Shi, G. Wu, and W. Yin. Extra: An exact first-order algorithm for decentralized consensusoptimization.
SIAM Journal on Optimization , 25(2):944–966, 2014.[35] Y. Tang, J. Zhang, and N. Li. Distributed zero-order algorithms for nonconvex multi-agentoptimization. arXiv preprint arXiv:1908.11444v3 , 2019.[36] J. Tsitsiklis, D. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochas-tic gradient optimization algorithms.
IEEE Transactions on Automatic Control , 31(9):803–812,1986.[37] Z. Yu, D. W. C. Ho, and D. Yuan. Distributed randomized gradient-free mirror descent algo-rithm for constrained optimization. arXiv preprint arXiv:1903.04157 , 2019.13
Auxiliary Lemma
Lemma A.1.
Let W be a matrix satisfying Assumption 2.1 and let x i = P nj =1 W i,j u j for i =1 , . . . , n , where u , . . . , u n are some vectors in R d . Set ¯ x = n − P ni =1 x i , ¯ u = n − P ni =1 u i . Then n X i =1 (cid:13)(cid:13) x i − ¯ x (cid:13)(cid:13) ≤ ρ n X i =1 (cid:13)(cid:13) u i − ¯ u (cid:13)(cid:13) . Proof.
Introduce the matrices X ⊤ = ( x , . . . , x n ) ∈ R d × n , U ⊤ = ( u , . . . , u n ) ∈ R d × n and thecentering matrix H = I − n ⊤ ∈ R n × n . Notice that P ni =1 (cid:13)(cid:13) x i − ¯ x (cid:13)(cid:13) = Tr(Σ) , where Tr(Σ) isthe trace of the matrix
Σ = n X i =1 ( x i − ¯ x )( x i − ¯ x ) ⊤ = n X i =1 x i ( x i ) ⊤ − ¯ x ¯ x ⊤ = X ⊤ HX.
It is not hard to check that
Tr(Σ) = Tr( U ⊤ W HW U ) . Moreover, as W is symmetric and W = we have HW = W − n ⊤ := W = W H . Thus,
W HW = W H W = HW H and Tr(Σ) = Tr( U ⊤ HW HU ) ≤ k W k ∗ Tr( U ⊤ H U ) ≤ ρ Tr( U ⊤ HU ) = ρ n X i =1 (cid:13)(cid:13) u i − ¯ u (cid:13)(cid:13) . B Proofs for Section 6
Recall the notation ∆( t ) = n − P ni =1 E [ (cid:13)(cid:13) x i ( t ) − ¯ x ( t ) (cid:13)(cid:13) ] , ¯ g ( t ) = n P ni =1 g i ( t ) , and z i ( t ) = Proj Θ (cid:16) x i ( t ) − η t g i ( t ) (cid:17) − ( x i ( t ) − η t g i ( t )) . We also set ¯ z ( t ) = n P ni =1 z i ( t ) . Lemma B.1.
Let Assumptions 2.1, 3.4, and 5.1 hold. Let Θ be a convex compact subset of R d .Assume that diam (Θ) ≤ K and max x ∈ Θ k∇ f ( x ) k ≤ G . If the updates x i ( t ) , ¯ x ( t ) are defined byAlgorithm 1, in which the gradient estimators for i -th agent are defined by Algorithm 2 with F = f i , i = 1 , . . . , n , and parameters (5) then ∆( t ) ≤ A (cid:18) ρ − ρ (cid:19) dα t − β − β , (6) where A is a constant independent of T, d, α, n, ρ . The explicit value of A can be found in the proof. Proof.
Set V ( t ) = P ni =1 (cid:13)(cid:13) x i ( t ) − ¯ x ( t ) (cid:13)(cid:13) . The definition of Algorithm 1 and Lemma A.1 imply: V ( t + 1) ≤ ρ n X i =1 (cid:13)(cid:13) x i ( t ) − ¯ x ( t ) − η t ( g i ( t ) − ¯ g ( t )) + z i ( t ) − ¯ z ( t ) (cid:13)(cid:13) . The result is immediate if ρ = 0 . Therefore, in rest of the proof we assume that ρ > . We have V ( t + 1) ≤ ρ n X i =1 h V ( t ) + η t (cid:13)(cid:13) g i ( t ) − ¯ g ( t ) (cid:13)(cid:13) + (cid:13)(cid:13) z i ( t ) − ¯ z ( t ) (cid:13)(cid:13) (15) − η t D x i ( t ) − ¯ x ( t ) , g i ( t ) − ¯ g ( t ) E (16) − η t D g i ( t ) − ¯ g ( t ) , z i ( t ) − ¯ z ( t ) E (17) + 2 D x i ( t ) − ¯ x ( t ) , z i ( t ) − ¯ z ( t ) Ei . (18)14or any z ∈ R d , we have P ni =1 (cid:13)(cid:13) g i ( t ) − ¯ g ( t ) (cid:13)(cid:13) ≤ P ni =1 (cid:13)(cid:13) g i ( t ) − z (cid:13)(cid:13) , so that η t n X i =1 E (cid:2) (cid:13)(cid:13) g i ( t ) − ¯ g ( t ) (cid:13)(cid:13) |F t (cid:3) ≤ η t n X i =1 E (cid:2) (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t (cid:3) . Next, from the definition of the projection, (cid:13)(cid:13) z i ( t ) (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) Proj Θ (cid:16) x i − η t g i ( t ) (cid:17) − ( x i − η t g i ( t )) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13) x i − ( x i − η t g i ( t )) (cid:13)(cid:13) = η t (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) . (19)Therefore, for the term containing (cid:13)(cid:13) z i ( t ) − ¯ z ( t ) (cid:13)(cid:13) in (15) we obtain n X i =1 E [ (cid:13)(cid:13) z i ( t ) − ¯ z ( t ) (cid:13)(cid:13) |F t ] ≤ n X i =1 E [ (cid:13)(cid:13) z i ( t ) (cid:13)(cid:13) |F t ] ≤ η t n X i =1 E h (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t i . For the expression in (16), by decoupling we get − η t n X i =1 E hD x i ( t ) − ¯ x ( t ) , g i ( t ) − ¯ g ( t ) E |F t i ≤ λV ( t ) + η t λ n X i =1 E h (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t i , where λ > is a value to be chosen later. For the expression in (17), we have − η t n X i =1 E hD g i ( t ) − ¯ g ( t ) , z i ( t ) − ¯ z ( t ) E |F t i ≤ η t n X i =1 E h (cid:13)(cid:13) g i ( t ) − ¯ g ( t ) (cid:13)(cid:13) |F t i + n X i =1 E h (cid:13)(cid:13) z i ( t ) − ¯ z ( t ) (cid:13)(cid:13) |F t i ≤ η t n X i =1 E h (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t i . Similarly, for the expression in (18), using the Cauchy–Schwarz inequality we get n X i =1 E hD x i ( t ) − ¯ x ( t ) , z i ( t ) − ¯ z ( t ) E |F t i ≤ n X i =1 E h (cid:13)(cid:13) x i ( t ) − ¯ x ( t ) (cid:13)(cid:13) (cid:13)(cid:13) z i ( t ) − ¯ z ( t ) (cid:13)(cid:13) |F t i ≤ λV ( t ) + 1 λ n X i =1 E h (cid:13)(cid:13) z i ( t ) − ¯ z ( t ) (cid:13)(cid:13) |F t i ≤ λV ( t ) + η t λ n X i =1 E h (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t i . Combining the above inequalities yields E [ V ( t + 1) |F t ] ≤ ρ (1 + 2 λ ) V ( t ) + ρ (cid:16) λ (cid:17) η t n X i =1 E h (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t i . (20)Taking expectations in (20) and applying Lemma 4.2 we obtain ∆( t + 1) ≤ ρ (1 + 2 λ )∆( t ) + ρ (cid:16) λ (cid:17) η t (cid:16) κG + d (cid:16) h t κ ¯ L κσ h t (cid:17)(cid:17) . Choose here λ = − ρ ρ . Then, using the fact that η t = αt , h t = t − β we find ∆( t + 1) ≤ ρ ∆( t ) + A ρ − ρ · dα t − β − β , (21)15here A = κG d + 18 κ ¯ L + 24 κσ . Due to the recursion in (21) we have, for any t ≥ , ∆( t + 1) ≤ ρ t ∆(1) + A ρ − ρ · dα t X s =1 s − β − β ρ t − s ≤ A ρ − ρ · dα (cid:16) ⌊ t ⌋ ⌊ t ⌋ X s =1 s − β − β t − X k = t −⌊ t ⌋ ρ k + 1 ⌊ t ⌋ t X s = ⌊ t ⌋ +1 s − β − β t −⌊ t ⌋− X k =0 ρ k (cid:17) , (22)where ∆(1) = 0 by the choice of initial values and the last inequality uses the fact that if the function φ ( · ) is monotone decreasing and φ ( · ) is monotone increasing then S S X s =1 φ ( s ) φ ( s ) ≤ S S X s =1 φ ( s ) ! S S X s =1 φ ( s ) ! , see, e.g., [7, Theorem A.19]. The sums in (22) satisfy ⌊ t ⌋ X s =1 s − β − β ≤ Z ∞ s − β − β = 2 β − β − , t X s = ⌊ t ⌋ +1 s − β − β ≤ t (cid:18) t (cid:19) − β − β = 2 β − β t − β − β , t −⌊ t ⌋− X k =0 ρ k ≤ − ρ , t − X k = t −⌊ t ⌋ ρ k ≤ t − X k = ⌊ t ⌋ ρ k ≤ tρ ⌊ t ⌋ / ≤ /ρ ) t , where the last inequality follows from the fact that ρ k ≤ /ρ ) k for any positive integer k . Plug-ging the above inequalities in (22) gives ∆( t + 1) ≤ A ρ − ρ dα (cid:16) /ρ ) t β − β − β − β ) t − β − β − ρ (cid:17) ≤ A ρ (1 − ρ ) dα t − β − β , where A = 24 β − β − + 3(2 β − β ) . Therefore, setting A := 2 A we conclude that, for t ≥ , ∆( t ) ≤ A ρ (1 − ρ ) dα t − β − β . For t ∈ { , } the bound of the lemma holds trivially since ¯ x and all x i belong to the compact Θ . Theorem 6.2.
Let f be an α -strongly convex function and let the assumptions of Lemma 6.1 besatisfied. Then for any x ∈ Θ the cumulative regret satisfies T X t =1 E (cid:2) f (¯ x ( t )) − f ( x ) (cid:3) ≤ dα T /β (cid:18) B + B ρ − ρ (cid:19) + B α (1 − ρ ) (log( T ) + 1) , where the positive constants B i are independent of T, d, α, n, ρ . The explicit values of these constantscan be found in the proof. Furthermore, if x ∗ is the minimizer of f over Θ the optimization error ofthe averaged estimator ˆ x ( T ) = T P T t =1 ¯ x ( t ) satisfies E [ f (ˆ x ( T )) − f ( x ∗ )] ≤ dα T − β − β (cid:18) B + B ρ − ρ (cid:19) + B α (1 − ρ ) (cid:16) log( T ) + 1 T (cid:17) . (8)16 roof. From the definition of Algorithm 1 and (19) we obtain k ¯ x ( t + 1) − x k = k ¯ x ( t ) − x k + k ¯ z ( t ) k + η t k ¯ g ( t ) k − η t h ¯ g ( t ) , ¯ x ( t ) − x i + 2 h ¯ z ( t ) , ¯ x ( t ) − x i − η t h ¯ z ( t ) , ¯ g ( t ) i≤ k ¯ x ( t ) − x k − η t h ¯ g ( t ) , ¯ x ( t ) − x i + 2 h ¯ z ( t ) , ¯ x ( t ) − x i + 4 η t n n X i =1 (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) . It follows that h ¯ g ( t ) , ¯ x ( t ) − x i ≤ k ¯ x ( t ) − x k − k ¯ x ( t + 1) − x k η t + 1 η t h ¯ z ( t ) , ¯ x ( t ) − x i + 2 η t n n X i =1 (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) . The strong convexity assumption implies f (¯ x ( t )) − f ( x ) ≤ h∇ f (¯ x ( t )) , ¯ x ( t ) − x i − α k ¯ x ( t ) − x k . Combining the last two displays and taking conditional expectations from both sides we get E (cid:2) f (¯ x ( t )) − f ( x ) |F t (cid:3) ≤ (cid:13)(cid:13) E (cid:2) ¯ g ( t ) |F t (cid:3) − ∇ f (¯ x ( t )) (cid:13)(cid:13) k ¯ x ( t ) − x k + 12 η t E (cid:2) a t − a t +1 |F t (cid:3) + 2 η t n n X i =1 E (cid:2) (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t (cid:3) − α a t + 1 η t E (cid:2) h ¯ z ( t ) , ¯ x ( t ) − x i|F t (cid:3) , (23)where a t = k ¯ x ( t ) − x k .The first term in right hand side of (23) is bounded as follows (cid:13)(cid:13) E (cid:2) ¯ g ( t ) |F t (cid:3) − ∇ f (¯ x ( t )) (cid:13)(cid:13) k ¯ x ( t ) − x k ≤ (cid:20) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:2) ¯ g ( t ) |F t (cid:3) − n n X i =1 ∇ f i ( x i ( t )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ∇ f i ( x i ( t )) − n n X i =1 ∇ f i (¯ x ( t )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:21) k ¯ x ( t ) − x k≤ κ β L √ dh β − t k ¯ x ( t ) − x k + ¯ Ln n X i =1 (cid:13)(cid:13) x i ( t ) − ¯ x ( t ) (cid:13)(cid:13) k ¯ x ( t ) − x k , (24)where the last inequality is due to Lemma 4.1 and Assumption 3.4(ii). We now decouple the termsin (24) using the fact that ab ≤ a v + vb , ∀ a, b ≥ , v > . Thus, we obtain κ β L √ dh β − t k ¯ x ( t ) − x k ≤ ( κ β L ) α dh β − t + α k ¯ x ( t ) − x k (25)and ¯ Ln n X i =1 (cid:13)(cid:13) x i ( t ) − ¯ x ( t ) (cid:13)(cid:13) k ¯ x ( t ) − x k ≤ ¯ Ltα (1 − ρ ) n n X i =1 (cid:13)(cid:13) x i ( t ) − ¯ x (cid:13)(cid:13) + ¯ L K tα (1 − ρ ) . (26)Combining (25) and (26) with (24) gives (cid:13)(cid:13) E (cid:2) ¯ g ( t ) |F t (cid:3) − ∇ f (¯ x ( t )) (cid:13)(cid:13) k ¯ x ( t ) − x k ≤ ( κ β L ) α dh β − t + α k ¯ x ( t ) − x k ++ ¯ Ltα (1 − ρ ) n n X i =1 (cid:13)(cid:13) x i ( t ) − ¯ x ( t ) (cid:13)(cid:13) + ¯ L K tα (1 − ρ ) . (27)17ext, we have η t h ¯ z ( t ) , ¯ x ( t ) − x i = 1 nη t n X i =1 h z i ( t ) , ¯ x ( t ) − x i≤ nη t n X i =1 h z i ( t ) , ¯ x ( t ) − (cid:0) x i ( t ) − η t g i ( t ) (cid:1) i + h z i ( t ) , (cid:0) x i ( t ) − η t g i ( t ) (cid:1) − x i . (28)Since Proj Θ ( · ) is the Euclidean projection on the convex set Θ , for any w ∈ R d , x ∈ Θ we have h Proj Θ ( w ) − w, Proj Θ ( w ) − x i ≤ , which implies h Proj Θ ( w ) − w, w − x i = − k Proj Θ ( w ) − w k + h Proj Θ ( w ) − w, Proj Θ ( w ) − x i ≤ . Therefore, h z i ( t ) , x i − η t g i ( t ) − x i = h Proj Θ ( x i ( t ) − η t g i ( t )) − ( x i ( t ) − η t g i ( t )) , x i ( t ) − η t g i ( t ) − x i ≤ . Applying this inequality in (28) and using (19) we find η t h ¯ z ( t ) , ¯ x ( t ) − x i ≤ nη t n X i =1 h z i ( t ) , (cid:0) ¯ x ( t ) − x i ( t ) (cid:1) + η t g i ( t ) i≤ nη t n X i =1 (cid:13)(cid:13) z i ( t ) (cid:13)(cid:13) (cid:13)(cid:13) x i ( t ) − ¯ x ( t ) (cid:13)(cid:13) + 1 n n X i =1 (cid:13)(cid:13) z i ( t ) (cid:13)(cid:13) (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) ≤ nη t n X i =1 h η t (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) − ρ + (1 − ρ ) (cid:13)(cid:13) x i − ¯ x ( t ) (cid:13)(cid:13) i + η t n n X i =1 (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) ≤ η t − ρ ) n n X i =1 (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) + 1 − ρ nη t n X i =1 (cid:13)(cid:13) x i ( t ) − ¯ x ( t ) (cid:13)(cid:13) . (29)Inserting (29) and (27) in (23) and using the fact that η t = αt we get E [ f (¯ x ( t )) − f ( x ) |F t ] ≤ η t E [ a t − a t +1 |F t ] − α a t + (1 + 4 ¯ L ) tα (1 − ρ )4 n n X i =1 (cid:13)(cid:13) x i − ¯ x ( t ) (cid:13)(cid:13) ++ 7 η t − ρ ) n n X i =1 E [ (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t ] + ( κ β L ) α dh β − t + ¯ L K tα (1 − ρ ) . where the last inequality follows from. Taking the expectations, setting r t := E [ a t ] and applyingLemma 4.2 we get E (cid:2) f (¯ x ( t )) − f ( x ) (cid:3) ≤ r t − r t +1 η t − α r t + (1 + 4 ¯ L ) tα (1 − ρ )4 ∆( t )++ 7 α (1 − ρ ) t (cid:16) κG + d (cid:16) h t κ ¯ L κσ h t (cid:17)(cid:17) + ( κ β L ) α dh β − t + ¯ L K tα (1 − ρ ) . Notice that since η t = αt we have T X t =1 (cid:18) r t − r t +1 η t − α r t (cid:19) ≤ . h t = t − β and summing over t we get T X t =1 E (cid:2) f (¯ x ( t )) − f ( x ) (cid:3) ≤ (1 + 4 ¯ L ) α (1 − ρ )4 T X t =1 t ∆( t ) + B dα T − β + ¯ L K α (1 − ρ ) (cid:0) log( T ) + 1 (cid:1) , where B = 7 β (cid:16) κG d + ( κ ¯ L + κσ ) (cid:17) + β ( κ β L ) . Finally, using Lemma 6.1 we obtain T X t =1 E (cid:2) f (¯ x ( t )) − f ( x ) (cid:3) ≤ B dα T /β + B ρ − ρ dα T /β + B α (1 − ρ ) (cid:0) log( T ) + 1 (cid:1) , where B = β (1+4¯ L )4 A , and B = ¯ L K . This proves the first part of the theorem. The second partfollows from the first one by the convexity of f . C Proofs for Section 7
We first restate the following three lemmas from [2].
Lemma C.1.
Let for β = 2 , Assumptions 3.4 and 7.1 hold. Let ¯ g ( t ) be the average of gradientestimators for n agents defined each by (13) , and h = h t . If max x ∈ Θ k∇ f i ( x ) k ≤ G , for ≤ i ≤ n ,then E [ k ¯ g ( t ) k ] ≤ κ (cid:16) G d + L d h t (cid:17) + 3 κd σ h t . Introduce the notation ˆ f t ( x ) = E f ( x + h t ˜ ζ ) , ∀ x ∈ R d , and ˆ f it ( x ) = E f i ( x + h t ˜ ζ ) , ∀ x ∈ R d . Lemma C.2.
Suppose f i is differentiable. For the conditional expectation given F t , we have E [ g i ( t ) |F t ] = ∇ ˆ f it ( x i ( t )) . Lemma C.3. If f is α -strongly convex then ˆ f t is α -strongly convex. If f ∈ F ( L ) , for any x ∈ R d and h t > , we have | ˆ f t ( x ) − f ( x ) | ≤ Lh t , and | E f ( x ± h t ζ t ) − f ( x ) | ≤ Lh t . Lemma C.4.
Let Assumptions 2.1, 3.4, and 7.1 hold with β = 2 . Let Θ be a convex compact subsetof R d , and assume that diam (Θ) ≤ K . Assume that max x ∈ Θ k∇ f i ( x ) k ≤ G , for ≤ i ≤ n . Letthe updates x i ( t ) , ¯ x ( t ) be defined by Algorithm 1, in which the gradient estimator for i -th agent isdefined by (13) , and η t = αt , h t = (cid:16) d σ Lαt +9 L d (cid:17) / . Then ∆( t ) ≤ (cid:16) ρ − ρ (cid:17) (cid:16) A ′ dα / t − + A ′ d α t − (cid:17) , where A ′ and A ′ are positive constants independent of T, d, α, n, ρ . roof. Similarly to Lemma 6.1 we obtain E [ V ( t + 1) |F t ] ≤ ρ (1 + 2 λ ) V ( t ) + ρ (4 + 2 λ ) η t n X i =1 E [ (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t ] . Choosing λ = − ρ ρ and using Lemma C.1 we get E [ V ( t + 1) |F t ] ≤ ρV ( t ) + 4 ρ − ρ η t (cid:16) G d + L d h t d σ h t (cid:17) . Taking here the expectations and setting η t = αt and h t = (cid:16) d σ Lαt +9 L d (cid:17) / yields ∆( t + 1) ≤ ρ ∆( t ) + ρ − ρ (cid:16) A ′ dα / t / + A ′ d α t (cid:17) with A ′ = 2 √ Lσ , and A ′ = 12 √ Lσ + G d . On the other hand, by recursion we have ∆( t + 1) ≤ ρ t ∆(1) + ρ − ρ dα / A ′ t X s =1 s − ρ t − s + A ′ dα / + t X s =1 s − ρ t − s (cid:17) . Here ∆(1) = 0 due to the initialization. The sums on right hand side can be estimated by using anargument, which is quite analogous to what was done in the proof of Lemma 6.1, after equation (22),leading to the result of the lemma.
Lemma C.5.
Let the assumptions of Lemma C.4 hold and let f be an α -strongly convex function.Then E [ k ¯ x ( t ) − x ∗ k ] ≤ C − ρ (cid:18) dt / α / + d tα (cid:19) , where C > is a constant independent of T, d, α, n, ρ . Proof.
First note that due to the strong convexity assumption we have k ¯ x (1) − x ∗ k ≤ G α . Therefore, for t = 1 the result holds. For t ≥ , by the definition of the algorithm we have k ¯ x ( t + 1) − x ∗ k ≤ k ¯ x ( t ) − x ∗ k + η t k ¯ g ( t ) k + k ¯ z ( t ) k − η t h ¯ g ( t ) , ¯ z ( t ) i−− η t h ¯ g ( t ) , ¯ x ( t ) − x ∗ i + 2 h ¯ x ( t ) − x ∗ , ¯ z ( t ) i . Taking conditional expectations we get E [ a t +1 |F t ] ≤ a t + 2 η t n n X i =1 E [ (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t ] − η t E [ h ¯ g ( t ) , ¯ z ( t ) i|F t ] − (30) − η t E [ h ¯ g ( t ) , ¯ x ( t ) − x ∗ i|F t ] + 2 E [ h ¯ x ( t ) − x ∗ , ¯ z ( t ) i|F t ] , (31)where we used the fact that (cid:13)(cid:13) z i ( t ) (cid:13)(cid:13) ≤ η t (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) for ≤ i ≤ n .For the term − η t E [ h ¯ g ( t ) , ¯ x ( t ) − x ∗ i|F t ] in (30), we have − η t E [ h ¯ g ( t ) , ¯ x ( t ) − x ∗ i|F t ] ≤ − η t n n X i =1 (cid:16) E [ h g i ( t ) − ∇ ˆ f it ( x i ( t )) , ¯ x ( t ) − x ∗ i|F t ]+ (32) + h∇ ˆ f it ( x i ( t )) − ∇ ˆ f it (¯ x ( t )) , ¯ x ( t ) − x ∗ i + (33) + h∇ ˆ f t (¯ x ( t )) , ¯ x ( t ) − x ∗ i (cid:17) (34)20or the term in (32), by Lemma C.2 we have − η t n n X i =1 E [ h g i ( t ) − ∇ ˆ f it ( x i ( t )) , ¯ x ( t ) − x ∗ i|F t ] = 0 . For the term in (33), decoupling yields − η t n n X i =1 h∇ ˆ f it ( x i ( t )) − ∇ ˆ f it (¯ x ( t )) , ¯ x ( t ) − x ∗ i ≤ η t tαn (1 − ρ ) V ( t ) + ¯ L η t tα − ρ a t . Next, we use the strong convexity (cf. Lemma C.3) to handle (34): − η t h∇ ˆ f t (¯ x ( t )) , ¯ x ( t ) − x ∗ i ≤ − η t αa t . Finally, for the term containing h ¯ x ( t ) − x ∗ , ¯ z ( t ) i in (31) we obtain similarly to (29) that E [ h ¯ x ( t ) − x ∗ , ¯ z ( t ) i|F t ] ≤ η t (1 − ρ ) n n X i =1 E [ (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t ] + 1 − ρn V ( t ) . Combining the above inequalities yields E [ a t +1 |F t ] ≤ (1 − η t α ) a t + 2 η t n n X i =1 E [ k ¯ g ( t ) k |F t ] − η t E [ h ¯ g ( t ) , ¯ z ( t ) i|F t ] + η t ¯ L K tα (1 − ρ ) ++ η t tα + 1 n (1 − ρ ) V ( t ) + 3 η t (1 − ρ ) n n X i =1 E [ (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t ] . Now, recalling that η t = tα , h t = (cid:16) d σ Lαt +9 L d (cid:17) / , taking the expectations and applying LemmaC.1 we find r t +1 ≤ (cid:16) − t (cid:17) r t + 2(1 − ρ )∆( t ) + C (1 − ρ ) (cid:16) dt / α / + d t α (cid:17) , (35)where r t = E [ a t ] , and C > is a constant independent of T, d, α, n, ρ . Using Lemma C.4 to bound ∆( t ) in (35) we get r t +1 ≤ (cid:16) − t (cid:17) r t + C ′ (1 − ρ ) (cid:16) dt / α / + d t α (cid:17) , where C ′ > is a constant independent of T, d, α, n, ρ . The desired result follows from this recursionby applying [2, Lemma D.1].
Theorem 7.2.
Let f be an α -strongly convex function. Let Assumptions 2.1, 3.4, and 7.1 hold with β = 2 . Let Θ be a convex compact subset of R d , and assume that diam (Θ) ≤ K . Assume that max x ∈ Θ k∇ f i ( x ) k ≤ G , for ≤ i ≤ n . Let the updates x i ( t ) , ¯ x ( t ) be defined by Algorithm 1, inwhich the gradient estimator for i -th agent is defined by (13) , and η t = αt , h t = (cid:16) d σ Lαt +9 L d (cid:17) / .Then for the estimator ˜ x ( T ) = T −⌊ T/ ⌋ P Tt = ⌊ T/ ⌋ +1 ¯ x ( t ) we have E [ f (˜ x ( T )) − f ( x ∗ )] ≤ B − ρ (cid:18) d √ αT + d αT (cid:19) , where B > is a constant independent of T, d, α, n, ρ . roof. Fix x ∈ Θ . Due to the α -strong convexity of ˆ f t , we have ˆ f t (¯ x ( t )) − ˆ f t ( x ∗ ) ≤ h∇ ˆ f t (¯ x ( t )) , ¯ x ( t ) − x ∗ i − α k ¯ x ( t ) − x ∗ k . Thus, by Lemma C.3 we get f (¯ x ( t )) − f ( x ∗ ) ≤ Lh t + h∇ ˆ f t (¯ x ( t )) , ¯ x ( t ) − x ∗ i − α k ¯ x ( t ) − x ∗ k . Let a t = k ¯ x ( t ) − x ∗ k . Taking conditional expectations and applying Lemma C.2 we obtain E [ f (¯ x ( t )) − f ( x ∗ ) |F t ] ≤ Lh t + 1 n n X i =1 h∇ ˆ f it (¯ x ( t )) − ∇ ˆ f it ( x i ( t )) , ¯ x ( t ) − x ∗ i − α a t + E [ h ¯ g ( t ) , ¯ x ( t ) − x ∗ i|F t ] ≤ Lh t + 1 n n X i =1 E [ h∇ ˆ f it (¯ x ( t )) − ∇ ˆ f it ( x i ( t )) , ¯ x ( t ) − x ∗ i|F t ] − α a t + a t − E [ a t +1 |F t ]2 η t + 1 η t E [ h ¯ z ( t ) , ¯ x ( t ) − x ∗ i|F t ] + 2 η t n n X i =1 E [ (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t ] , (36)where the last inequality uses the definition of the algorithm. Now, by decoupling we find n n X i =1 h∇ ˆ f it (¯ x ( t )) − ∇ ˆ f it ( x i ( t )) , ¯ x ( t ) − x ∗ i ≤ tα n (1 − ρ ) V ( t ) + 12(1 − ρ ) ¯ L tα K , (37)while similarly to (29) we also have η t E [ h ¯ z ( t ) , ¯ x ( t ) − x ∗ i|F t ] ≤ − ρ η t n n X i =1 E [ (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t ] + (1 − ρ ) 12 nη t V ( t ) . (38)Combining the above inequalities and applying Lemma C.1 yields E [ f (¯ x ( t )) − f ( x ∗ ) |F t ] ≤ (cid:16) η t + tα (cid:17) − ρ n V ( t ) + 12(1 − ρ ) ¯ L tα K − α a t + a t − E [ a t +1 |F t ]2 η t ++ 2 Lh t + (cid:18) − ρ ) (cid:19) η t n n X i =1 E [ (cid:13)(cid:13) g i ( t ) (cid:13)(cid:13) |F t ] . (39)Let r t = E [ a t ] . Using the fact that η t = αt , h t = (cid:16) d σ Lαt +9 L d (cid:17) / , taking the expectations in (39)and applying Lemma C.1 we find E [ f (¯ x ( t )) − f ( x ∗ )] ≤ tα (cid:16) r t − r t +1 (cid:17) − α r t + (1 − ρ ) αt ∆( t ) + C − ρ (cid:16) d √ αt + d αt (cid:17) , where C > is a constant independent of T, d, α, n, ρ . Summing up both sides over t gives T X t = ⌊ T ⌋ +1 E [ f (¯ x ( t )) − f ( x ∗ )] ≤ r ⌊ T ⌋ +1 ⌊ T ⌋ α − ρ ) α T X t = ⌊ T ⌋ +1 t ∆( t ) + C − ρ (cid:16) d √ T √ α + d α (cid:17) C > is a constant independent of T, d, α, n, ρ . We now apply Lemma C.4 to bound ∆( t ) and Lemma C.5 to bound r ⌊ T ⌋ +1 . It follows that T X t = ⌊ T ⌋ +1 E [ f (¯ x ( t )) − f ( x ∗ )] ≤ C − ρ (cid:16) d √ T √ α + d α (cid:17) , where C > is a constant independent of T, d, α, n, ρ . The desired bound for E [ f (˜ x ( T )) − f ( x ∗ )] follows from this inequality by the convexity of ff