Faster Gradient-Free Proximal Stochastic Methods for Nonconvex Nonsmooth Optimization
FFaster Gradient-Free Proximal Stochastic Methods forNonconvex Nonsmooth Optimization
Feihu Huang , , Bin Gu , Zhouyuan Huo , Songcan Chen ∗ , Heng Huang , College of Computer Science & Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China Department of Electrical & Computer Engineering, University of Pittsburgh, PA 15261, USA [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract
Proximal gradient method has been playing an important roleto solve many machine learning tasks, especially for the non-smooth problems. However, in some machine learning prob-lems such as the bandit model and the black-box learningproblem, proximal gradient method could fail because the ex-plicit gradients of these problems are difficult or infeasible toobtain. The gradient-free (zeroth-order) method can addressthese problems because only the objective function values arerequired in the optimization. Recently, the first zeroth-orderproximal stochastic algorithm was proposed to solve the non-convex nonsmooth problems. However, its convergence rateis O ( √ T ) for the nonconvex problems, which is significantlyslower than the best convergence rate O ( T ) of the zeroth-order stochastic algorithm, where T is the iteration number.To fill this gap, in the paper, we propose a class of fasterzeroth-order proximal stochastic methods with the variancereduction techniques of SVRG and SAGA, which are denotedas ZO-ProxSVRG and ZO-ProxSAGA, respectively. In theo-retical analysis, we address the main challenge that an unbi-ased estimate of the true gradient does not hold in the zeroth-order case, which was required in previous theoretical analy-sis of both SVRG and SAGA. Moreover, we prove that bothZO-ProxSVRG and ZO-ProxSAGA algorithms have O ( T ) convergence rates. Finally, the experimental results verifythat our algorithms have a faster convergence rate than theexisting zeroth-order proximal stochastic algorithm. Introduction
Proximal gradient (PG) methods (Mine and Fukushima,1981; Nesterov, 2004; Parikh, Boyd, and others, 2014) are aclass of powerful optimization tools in artificial intelligenceand machine learning. In general, it considers the followingnonsmooth optimization problem: min x ∈ R d f ( x ) + ψ ( x ) , (1)where f ( x ) usually is the loss function such as hinge lossand logistic loss, and ψ ( x ) is the nonsmooth structure regu-larizer such as (cid:96) -norm regularization. In recent research,Beck and Teboulle (2009); Nesterov (2013) proposed theaccelerate PG methods to solve convex problems by using ∗ Corresponding Author.Copyright c (cid:13) the Nesterov’s accelerated technique. After that, Li and Lin(2015) presented a class of accelerated PG methods for non-convex optimization. More recently, Gu, Huo, and Huang(2018) introduced inexact PG methods for nonconvex nons-mooth optimization. To solve the big data problems, the in-cremental or stochastic PG methods (Bertsekas, 2011; Xiaoand Zhang, 2014) were developed for large-scale convexoptimization. Correspondingly, Ghadimi, Lan, and Zhang(2016); Reddi et al. (2016) proposed the stochastic PG meth-ods for large-scale nonconvex optimization.However, in many machine learning problems, the ex-plicit expressions of gradients are difficult or infeasible toobtain. For example, in some complex graphical model in-ference (Wainwright, Jordan, and others, 2008) and struc-ture prediction problems (Sokolov, Hitschler, and Riezler,2018), it is difficult to compute the explicit gradients ofthe objective functions. Even worse, in bandit (Shamir,2017) and black-box learning (Chen et al., 2017) prob-lems, only the objective function values are available (theexplicit gradients cannot be calculated). Clearly, the abovePG methods will fail in dealing with these scenarios. Thegradient-free (zeroth-order) optimization method (Nesterovand Spokoiny, 2017) is a promising choice to address theseproblems because it only uses the function values in opti-mization process. Thus, the gradient-free optimization meth-ods have been increasingly embraced for solving many ma-chine learning problems (Conn, Scheinberg, and Vicente,2009).Although many gradient-free methods have recently beendeveloped and studied (Agarwal, Dekel, and Xiao, 2010;Nesterov and Spokoiny, 2017; Liu et al., 2018b), they oftensuffer from the high variances of zeroth-order gradient esti-mates. In addition, these algorithms are mainly designed forsmooth or convex settings, which will be discussed in the be-low related works, thus limiting their applicability in a widerange of nonconvex nonsmooth machine learning problemssuch as involving the nonconvex loss functions and nons-mooth regularization.In this paper, thus, we propose a class of faster gradient-free proximal stochastic methods for solving the nonconvexnonsmooth problem as follows: min x ∈ R d F ( x ) =: f ( x ) + ψ ( x ) , f ( x ) =: 1 n n (cid:88) i =1 f i ( x ) (2) a r X i v : . [ m a t h . O C ] F e b able 1: Comparison of representative zeroth-order stochastic algorithms for finding an (cid:15) -approximate stationary point ofnonconvex problem, i.e., E (cid:107)∇ f ( x ) (cid:107) ≤ (cid:15) or E (cid:107) g η ( x ) (cid:107) ≤ (cid:15) . (S, NS, C and NC are the abbreviations of smooth, nonsmooth,convex and nonconvex, respectively. T is the whole iteration number, d is the dimension of data and n denotes the sample size.) B ( ≤ n ) is a mini-batch size. Algorithm Reference Gradient estimator Problem Convergence rate
RSGF Ghadimi and Lan (2013) GauSGE S(NC) O ( (cid:113) dT ) ZO-SVRG Liu et al. (2018c) CooSGE S(NC) O ( dT ) SZVR-G Liu et al. (2018a) GauSGE S(NC) O (max( d B , d B ) /T ) GauSGE NS(NC) O ( d √ B √ /T √ ) RSPGF Ghadimi, Lan, and Zhang (2016) GauSGE S(NC) + NS(C) O ( (cid:113) dT ) ZO-ProxSVRG Ours CooSGE S(NC) + NS(C) O ( dT ) GauSGE S(NC) + NS(C) O ( dT + dσ ) ZO-ProxSAGA Ours CooSGE S(NC) + NS(C) O ( dT ) GauSGE S(NC) + NS(C) O ( dT + dσ ) where each f i ( x ) is a nonconvex and smooth loss function,and ψ ( x ) is a convex and nonsmooth regularization term.Until now, there are few zeroth-order stochastic methodsfor solving the problem (2) except a recent attempt pro-posed in (Ghadimi, Lan, and Zhang, 2016). Specifically,Ghadimi, Lan, and Zhang (2016) have proposed a random-ized stochastic projected gradient-free method (RSPGF), i.e. , a zeroth-order proximal stochastic gradient method.However, due to the large variance of zeroth-order estimatedgradient generated from randomly selecting the sample andthe direction of derivative, the RSPGE only has a conver-gence rate O ( √ T ) , which is significantly slower than O ( T ) ,the best convergence rate of the zeroth-order stochastic al-gorithm. To accelerate the RSPGF algorithm, we use thevariance reduction strategies in the first-order methods, i.e. ,SVRG (Xiao and Zhang, 2014) and SAGA (Defazio, Bach,and Lacoste-Julien, 2014), to reduce the variance of esti-mated stochastic gradient.Although SVRG and SAGA have shown good perfor-mances, applying these strategies to the zeroth-order methodis not a trivial task . The main challenge arises due tothat both SVRG and SAGA rely on the assumption that astochastic gradient is an unbiased estimate of the true fullgradient. However, it does not hold in the zeroth-order al-gorithms. In the paper, thus, we will fill this gap betweenzeroth-order proximal stochastic method and the classicvariance reduction approaches (SVRG and SAGA). Main Contributions
In summary, our main contributions are summarized as fol-lows: • We propose a class of faster gradient-free proximalstochastic methods (ZO-ProxSVRG and ZO-ProxSAGA),based on the variance reduction techniques of SVRG andSAGA. Our new algorithms only use the objective func-tion values in the optimization process. • Moreover, we provide the theoretical analysis on the con-vergence properties of both new ZO-ProxSVRG and ZO-ProxSAGA methods. Table 1 shows the specifical conver- gence rates of the proposed algorithms and other relatedones. In particular, our algorithms have faster conver-gence rate O ( T ) than O ( √ T ) of the RSPGF (Ghadimi,Lan, and Zhang, 2016) (the existing stochastic PG algo-rithm for solving nonconvex nonsmoothing problems). • Extensive experimental results and theoretical analysisdemonstrate the effectiveness of our algorithms.
Related Works
Gradient-free (zeroth-order) methods have been effectivelyused to solve many machine learning problems, where theexplicit gradient is difficult or infeasible to obtain, andhave also been widely studied. For example, Nesterov andSpokoiny (2017) proposed several random gradient-freemethods by using Gaussian smoothing technique. Duchiet al. (2015) proposed a zeroth-order mirror descent al-gorithm. More recently, Yu et al. (2018); Dvurechensky,Gasnikov, and Gorbunov (2018) presented the acceleratedzeroth-order methods for the convex optimization. To solvethe nonsmooth problems, the zeroth-order online or stochas-tic ADMM methods (Liu et al., 2018b; Gao, Jiang, andZhang, 2018) have been introduced.The above zeroth-order methods mainly focus on the(strongly) convex problems. In fact, there exist many non-convex machine learning tasks, whose explicit gradientsare not available, such as the nonconvex black-box learn-ing problems (Chen et al., 2017; Liu et al., 2018c). Thus,several recent works have begun to study the zeroth-orderstochastic methods for the nonconvex optimization. Forexample, Ghadimi and Lan (2013) proposed the random-ized stochastic gradient-free (RSGF) method, i.e. , a zeroth-order stochastic gradient method. To accelerate optimiza-tion, more recently, Liu et al. (2018c,a) proposed the zeroth-order stochastic variance reduction gradient (ZO-SVRG)methods. Moreover, to solve the large-scale machine learn-ing problems, some asynchronous parallel stochastic zeroth-order algorithms have been proposed in (Gu, Huo, andHuang, 2016; Lian et al., 2016; Gu et al., 2018).lthough the above zeroth-order stochastic methods caneffectively solve the nonconvex optimization, there are fewzeroth-order stochastic methods for the nonconvex nons-mooth composite optimization except the RSPGF methodpresented in (Ghadimi, Lan, and Zhang, 2016). In addition,Liu et al. (2018a) have also studied the zeroth-order algo-rithm for solving the nonconvex nonsmooth problem, whichis different from problem (2).
Zeroth-Order Proximal Stochastic MethodRevisit
In this section, we briefly review the zeroth-order proxi-mal stochastic gradient (ZO-ProxSGD) method to solve theproblem (2). Before that, we first revisit the proximal gradi-ent descent (ProxGD) method (Mine and Fukushima, 1981).ProxGD is an effective method to solve the problem (2)via the following iteration: x t +1 = Prox ηψ (cid:0) x t − η ∇ f ( x t ) (cid:1) , t = 0 , , · · · , (3)where η > is a step size, and Prox ηψ ( · ) is a proximaloperator defined as:Prox ηψ ( x ) = arg min y ∈ R d (cid:8) ψ ( y ) + 12 η (cid:107) y − x (cid:107) (cid:9) . (4)As discussed above, because ProxGD needs to computethe gradient at each iteration, it cannot be applied to solvethe problems, where the explicit gradient of function f ( x ) isnot available. For example, in the black-box machine learn-ing model, only function values ( e.g. , prediction results) areavailable Chen et al. (2017). To avoid computing explicitgradient, we use the zeroth-order gradient estimators (Nes-terov and Spokoiny, 2017; Liu et al., 2018c) to estimate thegradient only by function values. • Specifically, we use the
Gau ssian S moothing G radient E stimator ( GauSGE ) (Nesterov and Spokoiny, 2017;Ghadimi, Lan, and Zhang, 2016) to estimate the gradientsas follows: ˆ ∇ f i ( x ) = f i ( x + µu i ) − f i ( x ) µ u i , i ∈ [ n ] , (5)where µ is a smoothing parameter, and { u i } ni =1 denote i.i.d. random directions drawn from a zero-mean isotropicmultivariate Gaussian distribution N (0 , I ) . • Moreover, to obtain better estimated gradient, wecan use the
Coo rdinate S moothing G radient E stimator( CooSGE ) (Gu, Huo, and Huang, 2016; Gu et al., 2018;Liu et al., 2018c) to estimate the gradients as follows: ˆ ∇ f i ( x ) = d (cid:88) j =1 f i ( x + µ j e j ) − f i ( x − µ j e j )2 µ j e j , i ∈ [ n ] , (6)where µ j is a coordinate-wise smoothing parameter, and e j is a standard basis vector with at its j -th coordinate,and otherwise. Although the CooSGE need more func-tion queries than the GauSGE, it can get better estimatedgradient, and even can make the algorithms to obtain afaster convergence rate. Finally, based on these estimated gradients, we givea zeroth-order proximal gradient descent (ZO-ProxGD)method, which performs the following iteration: x t +1 = Prox ηψ (cid:0) x t − η ˆ ∇ f ( x t ) (cid:1) , t = 0 , , · · · , (7)where ˆ ∇ f ( x ) = n (cid:80) ni =1 ˆ ∇ f i ( x ) .Since ZO-ProxGD needs to estimate full gradient ˆ ∇ f ( x ) = n (cid:80) ni =1 ∇ f i ( x ) , when n is large in the prob-lem (2), its high cost per iteration is prohibitive. As a result,Ghadimi, Lan, and Zhang (2016) proposed the RSPGF ( i.e. ,ZO-ProxSGD) with performing the following iteration: x t +1 = Prox ηψ (cid:0) x t − η ˆ ∇ f I t ( x t ) (cid:1) , t = 0 , , · · · , (8)where ˆ ∇ f I t ( x t ) = b (cid:80) i ∈I t ˆ ∇ f i ( x ) , I t ∈ { , , · · · , n } and b = |I t | is the mini-batch size. New Faster Zeroth-Order Proximal StochasticMethods
In this section, to efficiently solve the large-scale nonconvexnonsmooth problems, we propose a class of faster zeroth-order proximal stochastic methods with the variance reduc-tion (VR) techniques of SVRG and SAGA, respectively.
ZO-ProxSVRG
In the subsection, we propose the zeroth-order proximalSVRG (ZO-ProxSVRG) method by using VR technique ofSVRG in (Xiao and Zhang, 2014; Reddi et al., 2016).The corresponding algorithmic framework is describedin Algorithm 1, where we use a mixture stochastic gra-dient ˆ v st = ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) + ˆ ∇ f (˜ x s ) . Note that E I t [ˆ v st ] = ˆ ∇ f ( x st ) (cid:54) = ∇ f ( x st ) , i.e. , this stochastic gradi-ent is a biased estimate of the true full gradient. Althoughthe SVRG has shown a great promise, it relies upon theassumption that the stochastic gradient is an unbiased es-timate of the true full gradient. Thus, adapting the similarideas of SVRG to zeroth-order optimization is not a trivialtask. To address this issue, we analyze the upper bound forthe variance of the estimated gradient ˆ v st , and choose the ap-propriate step size η and smoothing parameter µ to controlthis variance, which will be in detail discussed in the belowtheorems.Next, we derive the upper bounds for the variance of es-timated gradient ˆ v st based on the CooSGE and the GauSGE,respectively. Lemma 1.
In Algorithm 1 using the CooSGE, given themixture estimated gradient ˆ v st = ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) +ˆ ∇ f (˜ x s ) , then the following inequality holds E (cid:107) ˆ v st − ∇ f ( x st ) (cid:107) ≤ δ n L db E (cid:107) x st − ˜ x s (cid:107) + L d µ , (9) where ≤ δ n ≤ . Remark 1.
Lemma 1 shows that variance of ˆ v st has an up-per bound. As the number of iterations increases, both x st and ˜ x s will approach the same stationary point x ∗ , then thevariance of stochastic gradient decreases, but does not van-ishes, due to using the zeroth-order estimated gradient. emma 2. In Algorithm 1 using the GauSGE, given the es-timated gradient ˆ v st = ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) + ˆ ∇ f (˜ x s ) ,then the following inequality holds E (cid:107) ˆ v st − ∇ f ( x t ) (cid:107) ≤ (2 + 12 δ n b )( d + 6) L µ + 6 δ n L b E (cid:107) x st − ˜ x s (cid:107) + (4 + 24 δ n b )(2 d + 9) σ . (10) Remark 2.
Lemma 2 shows that variance of ˆ v st has an up-per bound. As the number of iterations increases, both x st and ˜ x s will approach the same stationary point x ∗ , then thevariance of stochastic gradient decreases. Algorithm 1
ZO-ProxSVRG for Nonconvex Optimization Input: mini-batch size b , S , m and step size η > ; Initialize: x = ˜ x ∈ R d ; for s = 1 , , · · · , S do ˆ ∇ f (˜ x s ) = n (cid:80) ni =1 ˆ ∇ f i (˜ x s ) ; for t = 0 , , · · · , m − do Uniformly randomly pick a mini-batch I t ⊆{ , , · · · , n } such that |I t | = b ; Using (5) or (6) to estimate mixture stochastic gra-dient ˆ v st = ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) + ˆ ∇ f (˜ x s ) ; x st +1 = Prox ηψ ( x st − η ˆ v st ) ; end for ˜ x s +1 = x sm and x s +10 = x sm ; end for Output:
Iterate x chosen uniformly random from { ( x st ) mt =1 } Ss =1 . ZO-ProxSAGA
In the subsection, we propose the zeroth-order proximalSAGA (ZO-ProxSAGA) method via using VR technique ofSAGA in (Defazio, Bach, and Lacoste-Julien, 2014; Reddiet al., 2016).The corresponding algorithmic description is given in Al-gorithm 2, where we use a mixture stochatic gradient ˆ v t = b (cid:80) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) −∇ f i t ( z ti t ) (cid:1) + ˆ φ t . Similarly, E I t [ˆ v t ] =ˆ ∇ f ( x st ) (cid:54) = ∇ f ( x st ) , i.e. , this stochastic gradient is a biased estimate of the true full gradient. Note that in Algorithm 2,due to (cid:80) i t ∈I t ˆ ∇ f i t ( z t +1 i t ) = (cid:80) i t ∈I t ˆ ∇ f i t ( x t ) , the step 8can use directly the term (cid:80) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) ,which is computed in the step 5, to avoid unnecessary cal-culations. Next, we give the upper bounds for the varianceof stochastic gradient ˆ v t based on the CooSGE and theGauSGE, respectively. Lemma 3.
In Algorithm 2 using the CooSGE, given the esti-mated gradient ˆ v t = b (cid:80) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) + ˆ φ t with ˆ φ t = n (cid:80) ni =1 ˆ ∇ f i ( z ti ) , then the following inequalityholds E (cid:107) ˆ v t − ∇ f ( x t ) (cid:107) ≤ L dnb n (cid:88) i =1 E (cid:107) x t − z ti (cid:107) + L d µ . (11) Remark 3.
Lemma 3 shows that variance of ˆ v t has an upperbound. As the number of iterations increases, both x t and { z ti } ni =1 will approach the same stationary point, then thevariance of stochastic gradient decreases. Lemma 4.
In Algorithm 2 using GauSGE, given the esti-mated gradient ˆ v t = b (cid:80) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) + ˆ φ t with ˆ φ t = n (cid:80) ni =1 ˆ ∇ f i ( z ti ) , then the following inequalityholds E (cid:107) ˆ v t − ∇ f ( x t ) (cid:107) ≤ (2 + 12 b )( d + 6) L µ + 6 L nb n (cid:88) i =1 E (cid:107) x t − z ti (cid:107) + (4 + 24 b )(2 d + 9) σ . (12) Remark 4.
Lemma 4 shows that variance of ˆ v t has an upperbound. As the number of iterations increases, both x t and { z ti } ni =1 will approach the same stationary point x ∗ , thenthe variance of stochastic gradient decreases. Algorithm 2
ZO-ProxSAGA for Nonconvex Optimization Input: mini-batch size b , T and step size η > ; Initialize: x ∈ R d , and z i = x for i ∈ { , , · · · , n } , ˆ φ = n (cid:80) ni =1 ˆ ∇ f i ( z i ) ; for t = 0 , , · · · , T − do Uniformly randomly pick a mini-batch I t ⊆{ , , · · · , n } (with replacement) such that |I t | = b ; Using (5) or (6) to estimate mixture stochastic gra-dient ˆ v t = b (cid:80) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) + ˆ φ t ; x t +1 = Prox ηψ ( x t − η ˆ v t ) ; z t +1 i t = x t for i ∈ I t and z t +1 i = z ti for i / ∈ I t ; ˆ φ t +1 = ˆ φ t − n (cid:80) i t ∈I t (cid:0) ˆ ∇ f i t ( z ti t ) − ˆ ∇ f i t ( z t +1 i t ) (cid:1) ; end for Output:
Iterate x chosen uniformly random from { x t } Tt =1 . Convergence Analysis
In this section, we conduct the convergence analysis of bothZO-ProxSVRG and ZO-ProxSAGA. First, we give somemild assumptions regarding problem (2) as follows:
Assumption 1.
For ∀ i ∈ { , , · · · , n } , gradient of thefunction f i is Lipschitz continuous with a Lipschitz constant L > , such that (cid:107)∇ f i ( x ) − ∇ f i ( y ) (cid:107) ≤ L (cid:107) x − y (cid:107) , ∀ x, y ∈ R d , which implies f i ( x ) ≤ f i ( y ) + ∇ f i ( y ) T ( x − y ) + L (cid:107) x − y (cid:107) . Assumption 2.
The gradient is bounded as (cid:107)∇ f i ( x ) (cid:107) ≤ σ for all i = 1 , , · · · , n . The first assumption is standard for the convergence anal-ysis of the zeroth-order algorithms (Ghadimi, Lan, andZhang, 2016; Nesterov and Spokoiny, 2017; Liu et al.,018c). The second assumption gives the bounded gradientused in (Nesterov and Spokoiny, 2017; Liu et al., 2018b),which is relatively stricter than the bounded variance of gra-dient in (Lian et al., 2016; Liu et al., 2018c,a), due to thatwe need to analyze more complex problem (2) including anon-smooth part. Next, we introduce the standard gradientmapping (Parikh, Boyd, and others, 2014) used in the con-vergence analysis as follows: g η ( x ) = 1 η (cid:0) x − Prox ηψ ( x − η ∇ f ( x )) (cid:1) . (13)For the nonconvex problems, if g η ( x ) = 0 , the point x is acritical point (Parikh, Boyd, and others, 2014). Thus, we canuse the following definition as the convergence metric. Definition 1. (Reddi et al., 2016) A solution x is called (cid:15) -accurate, if E (cid:107) g η ( x ) (cid:107) ≤ (cid:15) for some η > . Convergence Analysis of ZO-ProxSVRG
In the subsection, we show the convergence analysis ofthe ZO-ProxSVRG with the CooSGE (
ZO-ProxSVRG-CooSGE ) and the GauSGE (
ZO-ProxSVRG-GauSGE ),respectively.
Theorem 1.
Assume the sequence { ( x st ) mt =1 } Ss =1 generatedfrom Algorithm 1 using the CooSGE , and define a sequence { c t } mt =1 as follows: for s = 1 , , · · · , Sc t = δ n L dηb + c t +1 (1 + β ) , ≤ t ≤ m − , t = m (14) where β > . Let T = mS , η = ρdL (0 < ρ < ) and b satisfies the following inequality: ρ m b + ρ ≤ , (15) then we have E (cid:107) g η ( x st ) (cid:107) ≤ E [ F ( x ) − F ( x ∗ )] T γ + L d µ η γ , (16) where γ = η − Lη and x ∗ is an optimal solution of theproblem (2) . Further let b = [ n ] , m = [ n ] , ρ = and µ = O ( √ dT ) , we have E (cid:107) g η ( x st ) (cid:107) ≤ dL E [ F ( x ) − F ( x ∗ )] T + O ( dT ) . (17) Remark 5.
Theorem 1 shows that, given µ = O ( √ dT ) , b =[ n ] and m = [ n ] , the ZO-ProxSVRG-CooSGE has O ( dT ) convergence rate. Theorem 2.
Assume the sequence { ( x st ) mt =1 } Ss =1 generatedfrom Algorithm 1 using the GauSGE, and define a sequence { c t } mt =1 as follows: for s = 1 , , · · · , Sc t = δ n L ηb + c t +1 (1 + β ) , ≤ t ≤ m − , t = m (18) where β > . Let η = ρL (0 < ρ < ) and b satisfies thefollowing inequality: ρ m b + ρ ≤ , (19) then we have E (cid:107) g η ( x st ) (cid:107) ≤ E [ F ( x ) − F ( x ∗ )] T γ + (1 + 6 δ n b )( d + 6) L µ ηγ + (2 + 12 δ n b )(2 d + 9) σ ηγ , (20) where γ = η − η L and x ∗ is an optimal solution of theproblem (2) . Further let b = [ n ] , m = [ n ] , ρ = and µ = O ( d √ T ) , we have E (cid:107) g η ( x st ) (cid:107) ≤ L E [ F ( x ) − F ( x ∗ )] T + O ( dT )+ O ( dσ ) . (21) Remark 6.
Theorem 2 shows that given µ = O ( d √ T ) , b =[ n ] and m = [ n ] , the ZO-ProxSVRG-GauSGE has O ( dT + dσ ) convergence rate, in which the part O ( dσ ) generatesfrom the GauSGE. Convergence Analysis of ZO-ProxSAGA
In this subsection, we provide the convergence analysisof the ZO-ProxSAGA with the CooSGE (
ZO-ProxSAGA-CooSGE ) and the GauSGE (
ZO-ProxSAGA-GauSGE ),respectively.
Theorem 3.
Assume the sequence { x t } Tt =1 generated fromAlgorithm 2 using the CooSGE, and define a positive se-quence { c t } Tt =1 as follows: c t = L dηb + c t +1 (1 − p )(1 + β ) (22) where β > . Let c T = 0 , η = ρLd (0 < ρ < ) , and b satisfies the following inequality: ρ n b + ρ ≤ , (23) then we have E (cid:107) g η ( x t ) (cid:107) ≤ E [ F ( x ) − F ( x ∗ )] T γ + L d µ η γ , (24) where γ = η − Lη and x ∗ is an optimal solution of theproblem (2) . Further let b = [ n ] , ρ = and µ = O ( √ dT ) ,we have E (cid:107) g η ( x t ) (cid:107) ≤ dL E [ F ( x ) − F ( x ∗ )]3 T + O ( dT ) . (25) Remark 7.
Theorem 3 shows that given µ = O ( √ dT ) and b = [ n ] , the ZO-ProxSAGA-CooSGE has O ( dT ) conver-gence rate. heorem 4. Assume the sequence { x t } Tt =1 generated fromAlgorithm 2 using the GauSGE, and define a positive se-quence { c t } Tt =1 as follows: c t = 3 L ηb + c t +1 (1 − p )(1 + β ) , (26) where β > . Let c T = 0 , η = ρL (0 < ρ < ) and b satisfiesthe following inequality: ρ n b + ρ ≤ , (27) then we have E (cid:107) g η ( x t ) (cid:107) ≤ E [ F ( x ) − F ( x ∗ )] T γ + (2 + b )(2 d + 9) σ ηγ + (1 + b )( d + 6) L µ ηγ , (28) where γ = η − Lη and x ∗ is an optimal solution of theproblem (2) . Further given b = [ n ] , ρ = and µ = O ( d √ T ) , we have E (cid:107) g η ( x t ) (cid:107) ≤ L E [ F ( x ) − F ( x ∗ )]5 T + O ( dT ) + O ( dσ ) . (29) Remark 8.
Theorem 4 shows that given µ = O ( d √ T ) and b = [ n ] , the ZO-ProxSAGA-GauSGE has O ( dT + dσ ) con-vergence rate, in which the part O ( dσ ) generates from theGauSGE.All related proofs are in the supplementary document. Experiments
In this section, we will compare the proposed algorithms(ZO-ProxSVRG-CooSGE, ZO-ProxSVRG-GauSGE, ZO-ProxSAGA-CooSGE, ZO-ProxSAGA-GauSGE) with theRSPGF method (Ghadimi, Lan, and Zhang, 2016) on twoapplications: black-box binary classification and ad-versarial attacks on black-box deep neural networks(DNNs) . Note that the RSPGF uses the GauSGE to estimategradient.
Black-Box Binary Classification
Experimental Setup
In this experiment, we apply our al-gorithms to learn the black-box binary classification prob-lem. Specifically, given a set of training samples { a i , l i } ni =1 ,where a i ∈ R d and l i ∈ {− , } , we find the optimal pre-dictor x ∈ R d by solving the following problem: min x ∈ R d n n (cid:88) i =1 f i ( x ) + λ (cid:107) x (cid:107) + λ (cid:107) x (cid:107) , (30)where f i ( x ) is the black-box loss function, that only returnsthe function value given an input. Here, we specify the non-convex sigmoid loss function f i ( x ) = l i a Ti x ) in theblack-box setting. Table 2: Real data for black-box binary classificationdatasets samples f eatures classes a9a w8a covtype.binary , which are summarized in Table 2. In the algo-rithms, we fix the mini-batch size b = 20 , the smoothingparameters µ = d √ t in the GauSGE and µ = √ dt in theGooSGE. Meanwhile, we fix λ = λ = 10 − , and use thesame initial solution x from the standard normal distribu-tion in each experiment. For each dataset, we use half of thesamples as training data, and the rest as testing data. Experimental Results
Figures 1 and 2 show that both ob-jective values and test losses of the proposed methods fasterdecrease than the RSPGF method, as the time increases.In particular, both the ZO-ProxSVRG and ZO-ProxSAGAusing the CooSGE show the better performances than thecounterparts using the GauSGE. From these results, wefind that the CooSGE shows the better performances thanthe CauSGE in estimating gradients. Moreover, these re-sults also demonstrate that both the ZO-ProxSVRG andZO-ProxSAGA using the CooSGE have a relatively fasterconvergence rate than the counterparts using the GauSGE.Since the ZO-ProxSAGA has less function query complexitythan the ZO-ProxSVRG, it shows the better performancesthan the ZO-ProxSVRG. For example, the ZO-ProxSVRG-CooSGE needs O ( ndS + bdT ) function queries, while ZO-SAGA-CooSGE needs O ( bdT ) function queries. Adversarial Attacks on Black-Box DNNs
In this experiment, we apply our methods to generate ad-versarial examples to attack a pre-trained neural networkmodel. Following (Chen et al., 2017; Liu et al., 2018c), theparameters of given model are hidden from us and only itsoutputs are accessible. In this case, we can not computethe gradients by using back-propagation algorithm. Thus,we use the zeroth-order algorithms to find an universal ad-versarial perturbation x ∈ R d that could fool the samples { a i ∈ R d , l i ∈ N } ni =1 , which can be specified as the fol-lowing elastic-net attacks to black-box DNNs problem: min x ∈ R d n n (cid:88) i =1 max (cid:8) F l i ( a i + x ) − max j (cid:54) = l i F j ( a i + x ) , (cid:9) + λ (cid:107) x (cid:107) + λ (cid:107) x (cid:107) , (31)where λ and λ are nonnegative parameters to balanceattack success rate, distortion and sparsity. Here F ( a ) =[ F ( a ) , · · · , F K ( a )] ∈ [0 , K represents the final layer is from the website https://cs.nyu.edu/˜roweis/data.html ; a9a , w8a and covtype.binary arefrom the website .
50 100 150 200 250 30010 −4 −3 −2 −1 CPU time (seconds) O b j e c t i v e m i nu s be s t RSPGFZO−ProxSAGA−GauSGEZO−ProxSAGA−CooSGEZO−ProxSVRG−GauSGEZO−ProxSVRG−CooSGE (a) 20news −4 −3 −2 −1 CPU time (seconds) O b j e c t i v e m i nu s be s t RSPGF ZO−ProxSAGA−GauSGE ZO−ProxSAGA−CooSGEZO−ProxSVRG−GauSGEZO−ProxSVRG−CooSGE (b) a9a −4 −3 −2 −1 CPU time (seconds) O b j e c t i v e m i nu s be s t RSPGFZO−ProxSAGA−GauSGEZO−ProxSAGA−CooSGEZO−ProxSVRG−GauSGEZO−ProxSVRG−CooSGE (c) w8a −4 −3 −2 −1 CPU time (seconds) O b j e c t i v e m i nu s be s t RSPGFZO−ProxSAGA−GauSGEZO−ProxSAGA−CooSGEZO−ProxSVRG−GauSGEZO−ProxSVRG−CooSGE (d) covtype.binary
Figure 1: Objective value versus
CPU time on black-box binary classification.
CPU time (seconds) T e s t e rr o r RSPGFZO−ProxSAGA−GauSGEZO−ProxSAGA−CooSGEZO−ProxSVRG−GauSGEZO−ProxSVRG−CooSGE (a) 20news
CPU time (seconds) T e s t e rr o r RSPGF ZO−ProxSAGA−GauSGE ZO−ProxSAGA−CooSGEZO−ProxSVRG−GauSGEZO−ProxSVRG−CooSGE (b) a9a
CPU time (seconds) T e s t e rr o r RSPGFZO−ProxSAGA−GauSGEZO−ProxSAGA−CooSGEZO−ProxSVRG−GauSGEZO−ProxSVRG−CooSGE (c) w8a
CPU time (seconds) T e s t e rr o r RSPGFZO−ProxSAGA−GauSGEZO−ProxSAGA−CooSGEZO−ProxSVRG−GauSGEZO−ProxSVRG−CooSGE (d) covtype.binary
Figure 2: Test loss versus
CPU time on black-box binary classification.output of neural network, which is the probabilities of K classes.Following (Liu et al., 2018c), we use a pre-trained DNN on the MNIST dataset as the target black-box model, whichachieves 99.4 % test accuracy. In the experiment, we select n = 10 examples from the same class, and set the batchsize b = 5 and a constant step size η = 1 /d for the zeroth-order algorithms, where d = 28 × . In addition, we set λ = 10 − and λ = 1 in the experiment.Figure 3 shows that both objective values and black-box attack losses ( i.e. the first part of the problem (31))of the proposed algorithms faster decrease than the RSPGFmethod, as the number of iteration increases. Here, we addthe ZO-ProxSGD-CooSGE method for comparison, whichis obtained by combining the ZO-ProxSGD method with theCooSGE. Interestingly, the ZO-ProxSGD-CooSGE showsbetter performance than both the ZO-ProxSVRG-GauSGEand ZO-ProxSAGA-GauSGE, which further demonstratesthat the CooSGE can have better performance than theCauSGE in estimating gradient. Although having a rel-atively good performance in generating the adversarialsamples, the ZO-ProxSGD still shows worse performancethan both the ZO-ProxSVRG-CooSGE and ZO-ProxSAGA-CooSGE, due to not using the VR technique. Conclusions
In this paper, we proposed a class of faster gradient-freeproximal stochastic methods based on the zeroth-order gra-dient estimators, i.e. , the GauSGE and the CooSGE, whichonly use the objective function values in the optimiza-tion. Moreover, we provided the theoretical analysis on the https://github.com/carlini/nn robust attacks. −2 −1 O b j e c t i v e m i nu s be s t RSPGFZO−ProxSGD−GooSGEZO−ProxSAGA−GauSGEZO−ProxSAGA−CooSGEZO−ProxSVRG−GauSGEZO−ProxSVRG−CooSGE (a)
Objective function value −2 −1 B l a ck − bo x a tt a ck l o ss RSPGFZO−ProxSGD−GooSGEZO−ProxSAGA−GauSGEZO−ProxSAGA−CooSGEZO−ProxSVRG−GauSGEZO−ProxSVRG−CooSGE (b)
Black-box attack loss
Figure 3: Objective value and attack loss on generating ad-versarial samples from black-box DNNs.convergence properties of the proposed algorithms (ZO-ProxSVRG and ZO-ProxSAGA) based on the CooSGEand the GauSGE, respectively. In particular, both the ZO-ProxSVRG and ZO-ProxSAGA using the CooSGE have rel-atively faster convergence rates than the counterparts usingthe GauSGE, since the CooSGE has better performance thanthe CauSGE in estimating gradients.
Acknowledgments
F. Huang and S. Chen were partially supported by the Natu-ral Science Foundation of China (NSFC) under Grant No.61806093 and No. 61682281, and the Key Program ofNSFC under Grant No. 61732006, and Jiangsu Postdoc-toral Research Grant Program No. 2018K004A. F. Huang,Z. Huo, H. Huang were partially supported by U.S. NSFIIS 1836945, IIS 1836938, DBI 1836866, IIS 1845666, IIS1852606, IIS 1838627, IIS 1837956. eferences
Agarwal, A.; Dekel, O.; and Xiao, L. 2010. Optimal al-gorithms for online convex optimization with multi-pointbandit feedback. In
COLT , 28–40. Citeseer.Beck, A., and Teboulle, M. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAMjournal on imaging sciences
Mathematical program-ming
The 10th ACM Workshop on Artificial Intelli-gence and Security , 15–26. ACM.Conn, A. R.; Scheinberg, K.; and Vicente, L. N. 2009.
Intro-duction to derivative-free optimization , volume 8. Siam.Defazio, A.; Bach, F.; and Lacoste-Julien, S. 2014. Saga: Afast incremental gradient method with support for non-strongly convex composite objectives. In
Advances inNeural Information Processing Systems , 1646–1654.Duchi, J. C.; Jordan, M. I.; Wainwright, M. J.; and Wibisono,A. 2015. Optimal rates for zero-order convex optimiza-tion: The power of two function evaluations.
IEEE Trans-actions on Information Theory arXiv preprint arXiv:1802.09022 .Gao, X.; Jiang, B.; and Zhang, S. 2018. On the information-adaptive variants of the admm: an iteration complexityperspective.
Journal of Scientific Computing
SIAM Journal on Optimization
Mathematical Programming
ICML , 1807–1816.Gu, B.; Huo, Z.; and Huang, H. 2016. Zeroth-order asyn-chronous doubly stochastic algorithm with variance re-duction. arXiv preprint arXiv:1612.01425 .Gu, B.; Huo, Z.; and Huang, H. 2018. Inexact proximalgradient methods for non-convex and non-smooth opti-mization. In
AAAI .Li, H., and Lin, Z. 2015. Accelerated proximal gradientmethods for nonconvex programming. In
Advances inneural information processing systems , 379–387. Lian, X.; Zhang, H.; Hsieh, C. J.; Huang, Y.; and Liu, J.2016. A comprehensive linear speedup analysis for asyn-chronous stochastic parallel optimization from zeroth-order to first-order. In
Advances in Neural InformationProcessing Systems , 3054–3062.Liu, L.; Cheng, M.; Hsieh, C.-J.; and Tao, D. 2018a.Stochastic zeroth-order optimization via variance reduc-tion method.
CoRR abs/1805.11811.Liu, S.; Chen, J.; Chen, P.-Y.; and Hero, A. 2018b. Zeroth-order online alternating direction method of multipliers:Convergence analysis and applications. In
The Twenty-First International Conference on Artificial Intelligenceand Statistics , volume 84, 288–297.Liu, S.; Kailkhura, B.; Chen, P.-Y.; Ting, P.; Chang, S.;and Amini, L. 2018c. Zeroth-order stochastic variancereduction for nonconvex optimization. arXiv preprintarXiv:1805.10367 .Mine, H., and Fukushima, M. 1981. A minimization methodfor the sum of a convex function and a continuously dif-ferentiable function.
Journal of Optimization Theory &Applications
Foundations ofComputational Mathematics
Introductory Lectures on Convex Pro-gramming Volume I: Basic course . Kluwer, Boston.Nesterov, Y. 2013. Gradient methods for minimiz-ing composite functions.
Mathematical Programming
Foun-dations and Trends R (cid:13) in Optimization Advances in Neural InformationProcessing Systems , 1145–1153.Shamir, O. 2017. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback.
Jour-nal of Machine Learning Research arXiv preprintarXiv:1806.04458 .Wainwright, M. J.; Jordan, M. I.; et al. 2008. Graphical mod-els, exponential families, and variational inference.
Foun-dations and Trends R (cid:13) in Machine Learning SIAMJournal on Optimization
IJCAI , 3040–3046. upplementary Materials for “Faster Gradient-Free Proximal Stochastic Methods for NonconvexNonsmooth Optimization”
In this section, we provide the detailed proofs of the above lemmas and theorems. First, we give some useful properties of theCooSGE and the GauSGE, respectively.
Lemma 5. (Liu et al., 2018c) Assume that the function f ( x ) is L -smooth. Let ˆ ∇ f ( x ) denote the estimated gradient definedby the CooSGE . Define f µ j = E u ∼ U [ − µ j ,µ j ] f ( x + ue j ) , where U [ − µ j , µ j ] denotes the uniform distribution at the interval [ − µ j , µ j ] . Then we have1) f µ j is L -smooth, and ˆ ∇ f ( x ) = d (cid:88) j =1 ∂f µ j ( x ) ∂x j e j , (32) where ∂f /∂x j denotes the partial derivative with respect to the j th coordinate.2) For j ∈ [ d ] , | f µ j ( x ) − f ( x ) | ≤ Lµ j , (33) | ∂f µ j ( x ) ∂x j | ≤ Lµ j . (34)
3) If µ = µ j for j ∈ [ d ] , then (cid:107) ˆ ∇ f ( x ) − ∇ f ( x ) (cid:107) ≤ L d µ . (35) Lemma 6.
Assume that the function f ( x ) is L -smooth. Let ˆ ∇ f ( x ) denote the estimated gradient defined by the GauSGE .Define f µ ( x ) = E u ∼N (0 ,I ) [ f ( x + µu )] . Then we have1) For any x ∈ R d , ∇ f µ ( x ) = E u [ ˆ ∇ f ( x )] .2) For any x ∈ R d , | f µ ( x ) − f ( x ) | ≤ Ldµ , |∇ f µ ( x ) − ∇ f ( x ) | ≤ Lµ ( d + 3) , E u (cid:107) ˆ ∇ f ( x ) (cid:107) ≤ d + 4) (cid:107)∇ f ( x ) (cid:107) + µ L ( d + 6) . (36)
3) For any x ∈ R d , E u (cid:107) ˆ ∇ f ( x ) − ∇ f ( x ) (cid:107) ≤ d + 9) (cid:107)∇ f ( x ) (cid:107) + µ L ( d + 6) . (37) Proof.
The first and second parts of the above results can be obtain from Lemma 5 in (Ghadimi, Lan, and Zhang, 2016). Usingthe inequality (36), we have (cid:107) ˆ ∇ f ( x ) − ∇ f ( x ) (cid:107) ≤ (cid:107) ˆ ∇ f ( x ) (cid:107) + 2 (cid:107)∇ f ( x ) (cid:107) ≤ d + 9) (cid:107)∇ f ( x ) (cid:107) + µ L ( d + 6) , where the first inequality holds by the Cauchy-Schwarz and Young’s inequality. Notations:
To make the paper easier to follow, we give the following notations: • (cid:107) · (cid:107) denotes the vector (cid:96) norm and the matrix spectral norm, respectively. • µ denotes the smooth parameter of the gradient estimators ( i.e. , the CooSGE and GauSGE ). • η denotes the step size of updating variable x . • L denotes the Lipschitz constant of ∇ f ( x ) . • b denotes the mini-batch size of stochastic gradient. • T , m and S are the total number of iterations, the number of iterations in the inner loop, and the number of iterations in theouter loop, respectively. • For notational simplicity, E denotes E I t ,u . onvergence Analysis of ZO-ProxSVRG-CooSGE In this section, we give the convergence analysis of the ZO-ProxSVRG-CooSGE. First, we give an useful lemma about theupper bound of the variance of estimated gradient.
Lemma 7.
In Algorithm 1 using the CooSGE, given the estimated gradient ˆ v st = ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) + ˆ ∇ f (˜ x s ) , then thefollowing inequality holds E (cid:107) ˆ v st − ∇ f ( x st ) (cid:107) ≤ δ n L db E (cid:107) x st − ˜ x s (cid:107) + L d µ . (38) Proof.
Since E I t [ ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s )] = ˆ ∇ f ( x st ) − ˆ ∇ f (˜ x s ) , (39)we have E (cid:107) ˆ v st − ∇ f ( x st ) (cid:107) = E (cid:107) ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) + ˆ ∇ f (˜ x s ) − ∇ f ( x st ) (cid:107) = E (cid:107) ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) + ˆ ∇ f (˜ x s ) − ˆ ∇ f ( x st ) + ˆ ∇ f ( x st ) − ∇ f ( x st ) (cid:107) = E (cid:107) ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) − E I t ( ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s )) + ˆ ∇ f ( x st ) − ∇ f ( x st ) (cid:107) ≤ E (cid:107) ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) − E I t ( ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s )) (cid:107) + 2 E (cid:107) ˆ ∇ f ( x st ) − ∇ f ( x st ) (cid:107) ≤ E (cid:107) ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) − E I t ( ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s )) (cid:107) + L d µ , (40)where the second inequality holds by Lemma 5. By the equality (39), we have n (cid:88) i =1 (cid:0) ˆ ∇ f i ( x st ) − ˆ ∇ f i (˜ x s ) − E I t ( ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s )) (cid:1) = n (cid:0) ˆ ∇ f ( x st ) − ˆ ∇ f (˜ x s ) (cid:1) − n (cid:0) ˆ ∇ f ( x st ) − ˆ ∇ f (˜ x s ) (cid:1) = 0 . (41)Based on (41), we have E (cid:107) ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) − E I t ( ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s )) (cid:107) ≤ δ n bn n (cid:88) i =1 E (cid:107) ˆ ∇ f i ( x st ) − ˆ ∇ f i (˜ x s ) − ( ˆ ∇ f ( x st ) − ˆ ∇ f (˜ x s )) (cid:107) = δ n bn n (cid:88) i =1 E (cid:107) ˆ ∇ f i ( x st ) − ˆ ∇ f i (˜ x s ) (cid:107) − (cid:107) ˆ ∇ f ( x st ) − ˆ ∇ f (˜ x s ) (cid:107) ≤ δ n bn n (cid:88) i =1 E (cid:107) ˆ ∇ f i ( x st ) − ˆ ∇ f i (˜ x s ) (cid:107) , (42)where the first inequality holds by Lemmas 4 and 5 in (Liu et al., 2018c), and δ n = (cid:26) , if I t contains i.i.d. samples with replacement I ( b < n ) , if I t contains i.i.d. samples without replacement ,I ( b < n ) = 1 if b < n and otherwise. Using (32), we have E (cid:107) ˆ ∇ f i ( x st ) − ˆ ∇ f i (˜ x s ) (cid:107) = E (cid:107) d (cid:88) j =1 ∂f i,µ j ∂x st,j e j − ∂f i,µ j ∂ ˜ x sj e j (cid:107) ≤ d d (cid:88) j =1 E (cid:107) ∂f i,µ j ∂x st,j − ∂f i,µ j ∂ ˜ x sj (cid:107) ≤ L d d (cid:88) j =1 E (cid:107) x st,j − ˜ x sj (cid:107) = L d (cid:107) x st − ˜ x s (cid:107) , (43)where the first inequality holds by the Jensen’s inequality yielding (cid:107) n (cid:80) ni =1 z i (cid:107) ≤ n (cid:80) ni =1 (cid:107) z i (cid:107) , and the second inequalityholds due to that the function f µ j is L -smooth. Finally, combining the inequalities (40), (42) and (43), we have the above result.ext, based on the above lemma, we study the convergence property of the ZO-ProxSVRG-CooSGE. Theorem 5.
Assume the sequence { ( x st ) mt =1 } Ss =1 generated from Algorithm 1 using the CooSGE , and given a sequence { c t } mt =1 as follows: for s = 1 , , · · · , S c t = δ n L dηb + c t +1 (1 + β ) , ≤ t ≤ m − , t = m (44) where β > . Let T = mS , η = ρdL (0 < ρ < ) and b satisfies the following inequality: ρ m b + ρ ≤ , (45) then we have E (cid:107) g η ( x st ) (cid:107) ≤ E [ F ( x ) − F ( x ∗ )] T γ + L d µ η γ , (46) where γ = η − Lη and x ∗ is an optimal solution of the problem (2) . Further let b = [ n ] , m = [ n ] , ρ = and µ = O ( √ dT ) ,we have E (cid:107) g η ( x st ) (cid:107) ≤ dL E [ F ( x ) − F ( x ∗ )] T + O ( dT ) . (47) Proof.
We begin with defining an iteration by using the full true gradient: ¯ x st +1 = Prox ηψ (cid:0) x st − η ∇ f ( x st ) (cid:1) . (48)Then applying Lemma 2 of Reddi et al. (2016), we have F (¯ x st +1 ) ≤ F ( z ) + ( L − η ) (cid:107) ¯ x st +1 − x st (cid:107) + ( L η ) (cid:107) z − x st (cid:107) − η (cid:107) ¯ x st +1 − z (cid:107) , ∀ z ∈ R d . (49)Since x st +1 = Prox ηψ (cid:0) x st − η ˆ v st (cid:1) , we have F ( x st +1 ) ≤ F ( z ) + (cid:104) x st +1 − z, ∇ f ( x st ) − ˆ v st (cid:105) + ( L − η ) (cid:107) x st +1 − x st (cid:107) + ( L η ) (cid:107) z − x st (cid:107) − η (cid:107) x st +1 − z (cid:107) . (50)Setting z = x st in (49) and z = ¯ x st +1 in (50), then summing them together and taking the expectations, we have E [ F ( x st +1 )] ≤ E (cid:2) F ( x st ) + (cid:104) x st +1 − ¯ x st +1 , ∇ f ( x st ) − ˆ v st (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) T +( L − η ) (cid:107) x st +1 − x st (cid:107) + ( L − η ) (cid:107) ¯ x st +1 − x st (cid:107) − η (cid:107) x st +1 − ¯ x st +1 (cid:107) (cid:3) . (51)Next, we give an upper bound of the term T as follows: T = E (cid:104) x st +1 − ¯ x st +1 , ∇ f ( x st ) − ˆ v st (cid:105)≤ η E (cid:107) x st +1 − ¯ x st +1 (cid:107) + η E (cid:107)∇ f ( x st ) − ˆ v st (cid:107) ≤ η E (cid:107) x st +1 − ¯ x st +1 (cid:107) + δ n L dηb E (cid:107) x st − ˜ x s (cid:107) + L d µ η , (52)where the first inequality holds by Cauchy-Schwarz and Young’s inequality and the second inequality holds by Lemma 7.Combining (51) with (52), we have E [ F ( x st +1 )] ≤ E (cid:2) F ( x st ) + δ n L dηb E (cid:107) x st − ˜ x s (cid:107) + L d µ η L − η ) (cid:107) x st +1 − x st (cid:107) + ( L − η ) (cid:107) ¯ x st +1 − x st (cid:107) (cid:3) . (53)Next, we define an useful Lyapunov function as follows: R st = E (cid:2) F ( x st ) + c t (cid:107) x st − ˜ x s (cid:107) (cid:3) , (54)here { c t } is a nonnegative sequence. Considering the upper bound of (cid:107) x st +1 − ˜ x s (cid:107) , we have (cid:107) x st +1 − ˜ x s (cid:107) = (cid:107) x st +1 − x st + x st − ˜ x s (cid:107) = (cid:107) x st +1 − x st (cid:107) + 2( x st +1 − x st ) T ( x st − ˜ x s ) + (cid:107) x st − ˜ x s (cid:107) ≤ (cid:107) x st +1 − x st (cid:107) + 2 (cid:0) β (cid:107) x st +1 − x st ) (cid:107) + β (cid:107) x st − ˜ x s (cid:107) (cid:1) + (cid:107) x st − ˜ x s (cid:107) = (1 + 1 β ) (cid:107) x st +1 − x st (cid:107) + (1 + β ) (cid:107) x st − ˜ x s (cid:107) , (55)where β > . Then we have R st +1 = E (cid:2) F ( x st +1 ) + c t +1 (cid:107) x st +1 − ˜ x s (cid:107) (cid:3) ≤ E (cid:2) F ( x st +1 ) + c t +1 (1 + 1 β ) (cid:107) x st +1 − x st (cid:107) + c t +1 (1 + β ) (cid:107) x st +1 − ˜ x s (cid:107) (cid:3) ≤ E (cid:2) F ( x st ) + (cid:0) δ n L dηb + c t +1 (1 + β ) (cid:1) (cid:107) x st − ˜ x s (cid:107) + ( L − η ) (cid:107) ¯ x st +1 − x st (cid:107) + (cid:0) L − η + c t +1 (1 + 1 β ) (cid:1) (cid:107) x st +1 − x st (cid:107) + L d µ η (cid:3) = R st + ( L − η ) (cid:107) ¯ x st +1 − x st (cid:107) + (cid:0) L − η + c t +1 (1 + 1 β ) (cid:1) (cid:107) x st +1 − x st (cid:107) + L d µ η , (56)where c t = δ n L dηb + c t +1 (1 + β ) . Let c m = 0 , β = m and η = ρdL (0 < ρ < ) , recursing on t , we have c t = δ n L dηb (1 + β ) m − t − β = δ n Lρmb (cid:0) (1 + 1 m ) m − t − (cid:1) ≤ δ n Lρmb ( e − ≤ Lρmb , (57)where the first inequality holds by (1 + m ) m is an increasing function and lim m →∞ (1 + m ) m = e ; The second inequalityholds by ≤ δ n ≤ and e − ≤ . It follows that L c t +1 (1 + 1 β ) ≤ L Lρmb (1 + m ) ≤ L Lρm b = ( ρ + 8 ρ m b ) L ρ ≤ L ρ ≤ η , (58)where the last inequality holds by ρ + ρ m b ≤ . Thus, we have L − η + c t +1 (1 + β ) ≥ . Then, we obtain R st +1 ≤ R st + ( L − η ) (cid:107) ¯ x st +1 − x st (cid:107) + L d µ η . (59)Telescoping inequality (59) over t from to m − , since x s = x s − m = ˜ x s − and x sm = ˜ x s , we have m m (cid:88) t =1 E (cid:107) g η ( x st ) (cid:107) ≤ E [ F (˜ x s − ) − F (˜ x s )] mγ + L d µ η γ , (60)where γ = η − η L , and g η ( x st ) = 1 η (cid:2) x st − Prox ηψ ( x st − η ∇ f ( x st )) (cid:3) = 1 η ( x st − ¯ x st +1 ) . (61)Summing the inequality (60) over s from to S , we have min t,s E (cid:107) g η ( x st ) (cid:107) ≤ T S (cid:88) s =1 m (cid:88) t =1 E (cid:107) g η ( x st ) (cid:107) ≤ E [ F (˜ x ) − F (˜ x S )] T γ + L d µ η γ ≤ E [ F (˜ x ) − F ( x ∗ )] T γ + L d µ η γ , (62)where x ∗ is an optimal solution of (2).Given m = [ n ] , b = [ n ] and ρ = , it is easy verified that ρ + ρ m b = < . Using d ≥ , we have γ = η − Lη = dL − d L ≤ dL − dL = dL , we can obtain the above results. onvergence Analysis of ZO-ProxSVRG-GauSGE In this section, we give the convergence analysis of the ZO-ProxSVRG-GauSGE. First, we give an useful lemma about theupper bound of the variance of estimated gradient.
Lemma 8.
In Algorithm 1 using GauSGE, given the estimated gradient ˆ v st = ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) + ˆ ∇ f (˜ x s ) , then thefollowing inequality holds E (cid:107) ˆ v st − ∇ f ( x st ) (cid:107) ≤ δ n L b E (cid:107) x st − ˜ x s (cid:107) + (2 + 12 δ n b ) L µ ( d + 6) + (4 + 24 δ n b )(2 d + 9) σ . (63) Proof.
Since E I t [ ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s )] = ˆ ∇ f ( x st ) − ˆ ∇ f (˜ x s ) , (64)we have E (cid:107) ˆ v st − ∇ f ( x st ) (cid:107) = (cid:107) ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) + ˆ ∇ f (˜ x s ) − ∇ f ( x st ) (cid:107) (65) = E (cid:107) ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) + ˆ ∇ f (˜ x s ) − ˆ ∇ f ( x st ) + ˆ ∇ f ( x st ) − ∇ f ( x st ) (cid:107) = E (cid:107) ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) − E I t ( ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s )) + ˆ ∇ f ( x st ) − ∇ f ( x st ) (cid:107) ≤ E (cid:107) ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) − E I t ( ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s )) (cid:107) + 2 E (cid:107) ˆ ∇ f ( x st ) − ∇ f ( x st ) (cid:107) ≤ E (cid:107) ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) − E I t ( ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s )) (cid:107) + 4(2 d + 9) (cid:107)∇ f ( x st ) (cid:107) + 2 µ L ( d + 6) , ≤ E (cid:107) ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) − E I t ( ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s )) (cid:107) + 4(2 d + 9) σ + 2 µ L ( d + 6) , where the second inequality holds by Lemma 6 and the third inequality follows Assumption 2. By the equality (64), we have n (cid:88) i =1 (cid:0) ˆ ∇ f i ( x st ) − ˆ ∇ f i (˜ x s ) − E I t ( ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s )) (cid:1) = n (cid:0) ˆ ∇ f ( x st ) − ˆ ∇ f (˜ x s ) (cid:1) − n (cid:0) ˆ ∇ f ( x st ) − ˆ ∇ f (˜ x s ) (cid:1) = 0 . (66)It follows that E (cid:107) ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s ) − E I t ( ˆ ∇ f I t ( x st ) − ˆ ∇ f I t (˜ x s )) (cid:107) ≤ δ n bn n (cid:88) i =1 E (cid:107) ˆ ∇ f i ( x st ) − ˆ ∇ f i (˜ x s ) − ( ˆ ∇ f ( x st ) − ˆ ∇ f (˜ x s )) (cid:107) = δ n bn n (cid:88) i =1 E [ (cid:107) ˆ ∇ f i ( x st ) − ˆ ∇ f i (˜ x s ) (cid:107) − (cid:107) ( ˆ ∇ f ( x st ) − ˆ ∇ f (˜ x s )) (cid:107) ≤ δ n bn n (cid:88) i =1 E (cid:107) ˆ ∇ f i ( x st ) − ˆ ∇ f i (˜ x s ) (cid:107) , (67)where the first inequality holds by Lemmas 4 and 5 in (Liu et al., 2018c). By (64), we have E (cid:107) ˆ ∇ f i ( x st ) − ˆ ∇ f i (˜ x s ) (cid:107) = E (cid:107) ˆ ∇ f i ( x st ) − ∇ f i ( x st ) + ∇ f i ( x st ) − ∇ f i (˜ x s ) + ∇ f i (˜ x s ) − ˆ ∇ f i (˜ x s ) (cid:107) ≤ E (cid:107) ˆ ∇ f i ( x st ) − ∇ f i ( x st ) (cid:107) + 3 (cid:107)∇ f i ( x st ) − ∇ f i (˜ x s ) (cid:107) + 3 E (cid:107)∇ f i (˜ x s ) − ˆ ∇ f i (˜ x s ) (cid:107) ≤ d + 9) (cid:0) (cid:107)∇ f i ( x st ) (cid:107) + (cid:107)∇ f i (˜ x s ) (cid:107) (cid:1) + 3 L (cid:107) x st − ˜ x s (cid:107) + 6 L µ ( d + 6) ≤ d + 9) σ + 3 L (cid:107) x st − ˜ x s (cid:107) + 6 L µ ( d + 6) , (68)where the first inequality holds by the Jensen’s inequality, the second inequality holds by the Lemma 6 and the third inequalityfollows Assumption 2. Finally, combining the inequalities (65), (67) and (68), we obtain the above result.Next, based on the above lemma, we study the convergence property of the ZO-ProxSVRG-GauSGE. heorem 6. Assume the sequence { ( x st ) mt =1 } Ss =1 generated from Algorithm 1 using the GauSGE, and given a sequence { c t } mt =1 as follows: for s = 1 , , · · · , S c t = δ n L ηb + c t +1 (1 + β ) , ≤ t ≤ m − , t = m (69) where β > . Let η = ρL (0 < ρ < ) and b satisfies the following inequality: ρ m b + ρ ≤ , (70) then we have E (cid:107) g η ( x st ) (cid:107) ≤ E [ F ( x ) − F ( x ∗ )] T γ + (1 + 6 δ n b )( d + 6) L µ ηγ + (2 + 12 δ n b )(2 d + 9) σ ηγ , (71) where γ = η − η L and x ∗ is an optimal solution of the problem (2) . Further let b = [ n ] , m = [ n ] , ρ = and µ = O ( d √ T ) ,we have E (cid:107) g η ( x st ) (cid:107) ≤ L E [ F ( x ) − F ( x ∗ )] T + O ( dT ) + O ( dσ ) . (72) Proof.
This proof is the similar to the proof of Theorem 1. We start by defining an iteration by using the full true gradient: ¯ x st +1 = Prox ηψ (cid:0) x st − η ∇ f ( x st ) (cid:1) , (73)then applying Lemma 2 of Reddi et al. (2016), we have F (¯ x st +1 ) ≤ F ( z ) + ( L − η ) (cid:107) ¯ x st +1 − x st (cid:107) + ( L η ) (cid:107) z − x st (cid:107) − η (cid:107) ¯ x st +1 − z (cid:107) , ∀ z ∈ R d . (74)Since x st +1 = Prox ηψ (cid:0) x st − η ˆ v st (cid:1) , we have F ( x st +1 ) ≤ F ( z ) + (cid:104) x st +1 − z, ∇ f ( x st ) − ˆ v st (cid:105) + ( L − η ) (cid:107) x st +1 − x st (cid:107) + ( L η ) (cid:107) z − x st (cid:107) − η (cid:107) x st +1 − z (cid:107) . (75)Setting z = x st in (74) and z = ¯ x st +1 in (75), then summing them together and taking the expectations, we obtain E [ F ( x st +1 )] ≤ E (cid:2) F ( x st ) + (cid:104) x st +1 − ¯ x st +1 , ∇ f ( x st ) − ˆ v st (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) T +( L − η ) (cid:107) x st +1 − x st (cid:107) + ( L − η ) (cid:107) ¯ x st +1 − x st (cid:107) − η (cid:107) x st +1 − ¯ x st +1 (cid:107) (cid:3) . (76)Next, we give an upper bound of the term T as follows: T = E (cid:104) x st +1 − ¯ x st +1 , ∇ f ( x st ) − ˆ v st (cid:105)≤ η E (cid:107) x st +1 − ¯ x st +1 (cid:107) + η E (cid:107)∇ f ( x st ) − ˆ v st (cid:107) ≤ η E (cid:107) x st +1 − ¯ x st +1 (cid:107) + 3 δ n L ηb E (cid:107) x st − ˜ x s (cid:107) + (1 + 6 δ n b ) L ( d + 6) µ η + (2 + 12 δ n b )(2 d + 9) σ η, (77)where the first inequality holds by Cauchy-Schwarz and Young’s inequality and the second inequality holds by Lemma 8.Combining (76) with (77), we have E [ F ( x st +1 )] ≤ E (cid:2) F ( x st ) + 3 δ n L ηb E (cid:107) x st − ˜ x s (cid:107) + (1 + 6 δ n b ) L ( d + 6) µ η + (2 + 12 δ n b )(2 d + 9) σ η + ( L − η ) (cid:107) x st +1 − x st (cid:107) + ( L − η ) (cid:107) ¯ x st +1 − x st (cid:107) (cid:3) . (78)Next, we define an useful Lyapunov function as follows: Ψ st = E (cid:2) F ( x st ) + c t (cid:107) x st − ˜ x s (cid:107) (cid:3) , (79)here { c t } is a nonnegative sequence. Considering the upper bound of (cid:107) x st +1 − ˜ x s (cid:107) , we have (cid:107) x st +1 − ˜ x s (cid:107) = (cid:107) x st +1 − x st + x st − ˜ x s (cid:107) = (cid:107) x st +1 − x st (cid:107) + 2( x st +1 − x st ) T ( x st − ˜ x s ) + (cid:107) x st − ˜ x s (cid:107) ≤ (cid:107) x st +1 − x st (cid:107) + 2 (cid:0) β (cid:107) x st +1 − x st ) (cid:107) + β (cid:107) x st − ˜ x s (cid:107) (cid:1) + (cid:107) x st − ˜ x s (cid:107) = (1 + 1 β ) (cid:107) x st +1 − x st (cid:107) + (1 + β ) (cid:107) x st − ˜ x s (cid:107) , (80)where β > . Then we have Ψ st +1 = E (cid:2) F ( x st +1 ) + c t +1 (cid:107) x st +1 − ˜ x s (cid:107) (cid:3) ≤ E (cid:2) F ( x st +1 ) + c t +1 (1 + 1 β ) (cid:107) x st +1 − x st (cid:107) + c t +1 (1 + β ) (cid:107) x st +1 − ˜ x s (cid:107) (cid:3) ≤ E (cid:2) F ( x st ) + (cid:0) δ n L ηb + c t +1 (1 + β ) (cid:1) (cid:107) x st − ˜ x s (cid:107) + ( L − η ) (cid:107) ¯ x st +1 − x st (cid:107) + (cid:0) L − η + c t +1 (1 + 1 β ) (cid:1) (cid:107) x st +1 − x st (cid:107) + (1 + 6 δ n b )( d + 6) L µ η + (2 + 12 δ n b )(2 d + 9) σ η (cid:3) = Ψ st + ( L − η ) (cid:107) ¯ x st +1 − x st (cid:107) + (cid:0) L − η + c t +1 (1 + 1 β ) (cid:1) (cid:107) x st +1 − x st (cid:107) + (1 + 6 δ n b )( d + 6) L µ η + (2 + 12 δ n b )(2 d + 9) σ η, (81)where c t = δ n L ηb + c t +1 (1 + β ) . Let c m = 0 , β = m and η = ρL (0 < ρ < ) , recursing on t , we have c t = 3 δ n L ηb (1 + β ) m − t − β = 3 δ n Lρmb (cid:0) (1 + 1 m ) m − t − (cid:1) ≤ δ n Lρmb ( e − ≤ Lρmb , (82)where the first inequality holds by (1 + m ) m is an increasing function and lim m →∞ (1 + m ) m = e .It follows that L c t +1 (1 + 1 β ) ≤ L Lρmb (1 + m ) ≤ L Lρm b = ( ρ + 24 ρ m b ) L ρ ≤ L ρ = 12 η , (83)where the last inequality holds by ρ + ρ m b ≤ . Then we obtain Ψ st +1 ≤ Ψ st + ( L − η ) (cid:107) ¯ x st +1 − x st (cid:107) + (1 + 6 δ n b )( d + 6) L µ η + (2 + 12 δ n b )(2 d + 9) σ η. (84)Telescoping inequality (84) over t from to m − , since x s = x s − m = ˜ x s − and x sm = ˜ x s , we have m m (cid:88) t =1 E (cid:107) g η ( x st ) (cid:107) ≤ E [ F (˜ x s − ) − F (˜ x s )] mγ + (1 + 6 δ n b )( d + 6) L µ ηγ + (2 + 12 δ n b )(2 d + 9) σ ηγ , (85)where γ = η − η L , and g η ( x st ) = 1 η (cid:2) x st − Prox ηψ ( x st − η ∇ f ( x st )) (cid:3) = 1 η ( x st − ¯ x st +1 ) . (86)Summing the inequality (85) over s from to S , we have T S (cid:88) s =1 m (cid:88) t =1 E (cid:107) g η ( x st ) (cid:107) ≤ E [ F (˜ x ) − F (˜ x S )] T γ + (1 + 6 δ n b )( d + 6) L µ ηγ + (2 + 12 δ n b )(2 d + 9) σ ηγ ≤ E [ F (˜ x ) − F ( x ∗ )] T γ + (1 + 6 δ n b )( d + 6) L µ ηγ + (2 + 12 δ n b )(2 d + 9) σ ηγ , (87)where x ∗ is an optimal solution of (2).Let m = [ n ] , b = [ n ] and ρ = , it is easy verified that ρ + ρ m b = < and γ = η − η L = L . Finally, given µ = O ( d √ T ) , we can obtain the above results. onvergence Analysis of ZO-ProxSAGA-CooSGE In this section, we give the convergence analysis of the ZO-ProxSAGA-CooSGE. First, we give an useful lemma about theupper bound of the variance of estimated gradient.
Lemma 9.
In Algorithm 2 using the CooSGE, given the estimated gradient ˆ v t = b (cid:80) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) + ˆ φ t with ˆ φ t = n (cid:80) ni =1 ˆ ∇ f i ( z ti ) , then the following inequality holds E (cid:107) ˆ v t − ∇ f ( x t ) (cid:107) ≤ L dnb n (cid:88) i =1 E (cid:107) x t − z ti (cid:107) + L d µ . (88) Proof.
By the definition of the estimated gradient ˆ v t , we have E (cid:107) ˆ v t − ∇ f ( x st ) (cid:107) = E (cid:107) b (cid:88) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) + ˆ φ t − ∇ f ( x t ) (cid:107) = E (cid:107) b (cid:88) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) + ˆ φ t − ˆ ∇ f ( x t ) + ˆ ∇ f ( x t ) − ∇ f ( x t ) (cid:107) ≤ E (cid:107) b (cid:88) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) + ˆ φ t − ˆ ∇ f ( x t ) (cid:107) + 2 E (cid:107) ˆ ∇ f ( x t ) − ∇ f ( x t ) (cid:107) ≤ E (cid:107) b (cid:88) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) − ( ˆ ∇ f ( x t ) − ˆ φ t ) (cid:107) + L d µ , (89)where the second inequality holds by Lemma 5. Using E I t (cid:2) b (cid:88) i t ∈I t ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:3) = ˆ ∇ f ( x t ) − n n (cid:88) i =1 ˆ ∇ f i ( z ti ) = ˆ ∇ f ( x t ) − ˆ φ t , (90)we have n (cid:88) i =1 (cid:0) ˆ ∇ f i ( x t ) − ˆ ∇ f i ( z ti t ) − ( ˆ ∇ f ( x t ) − ˆ φ t ) (cid:1) = n (cid:0) ˆ ∇ f ( x t ) − ˆ φ t (cid:1) − n (cid:0) ˆ ∇ f ( x t ) − ˆ φ t (cid:1) = 0 . (91)It follows that E (cid:107) b (cid:88) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) − ( ˆ ∇ f ( x t ) − ˆ φ t ) (cid:107) ≤ bn n (cid:88) i =1 E (cid:107) ˆ ∇ f i ( x t ) − ˆ ∇ f i ( z ti ) − ( ˆ ∇ f ( x t ) − ˆ φ t ) (cid:107) = 1 bn n (cid:88) i =1 E (cid:107) ˆ ∇ f i ( x t ) − ˆ ∇ f i ( z ti ) (cid:107) − (cid:107) ˆ ∇ f ( x t ) − ˆ φ t (cid:107) ≤ bn n (cid:88) i =1 E (cid:107) ˆ ∇ f i ( x t ) − ˆ ∇ f i ( z ti ) (cid:107) , (92)where the first inequality holds by Lemmas 4 and 5 in (Liu et al., 2018c). By (32), we have E (cid:107) ˆ ∇ f i ( x t ) − ˆ ∇ f i ( z ti ) (cid:107) = E (cid:107) d (cid:88) j =1 ∂f i,µ j ∂x t,j e j − ∂f i,µ j ∂z ti,j e j (cid:107) ≤ d d (cid:88) j =1 E (cid:107) ∂f i,µ j ∂x t,j − ∂f i,µ j ∂z ti,j (cid:107) ≤ L d d (cid:88) j =1 E (cid:107) x t,j − z ti,j (cid:107) = L d (cid:107) x t − z ti (cid:107) , (93)where the first inequality follows the Jensen’s inequality, and the second inequality holds due to that the function f µ j is L -smooth.Finally, combining the inequalities (89), (92) and (93), we have the above result.ext, based on the above lemma, we study the convergence property of the ZO-ProxSAGA-CooSGE. Theorem 7.
Assume the sequence { x t } Tt =1 generated from Algorithm 2 using the CooSGE, and given a positive sequence { c t } Tt =1 as follows: c t = L dηb + c t +1 (1 − p )(1 + β ) (94) where β > . Let c T = 0 , η = ρLd (0 < ρ < ) , and b satisfies the following inequality: ρ n b + ρ ≤ , (95) then we have E (cid:107) g η ( x t ) (cid:107) ≤ E [ F ( x ) − F ( x ∗ )] T γ + L d µ η γ , (96) where γ = η − Lη and x ∗ is an optimal solution of the problem (2) . Further let b = [ n ] , ρ = and µ = O ( √ dT ) , we obtain E (cid:107) g η ( x t ) (cid:107) ≤ dL E [ F ( x ) − F ( x ∗ )]3 T + O ( dT ) . (97) Proof.
First, we define an iteration by using the full true gradient: ¯ x t +1 = Prox ηψ (cid:0) x t − η ∇ f ( x t ) (cid:1) , (98)then applying Lemma 2 of Reddi et al. (2016), we have F (¯ x t +1 ) ≤ F ( z ) + ( L − η ) (cid:107) ¯ x t +1 − x t (cid:107) + ( L η ) (cid:107) z − x t (cid:107) − η (cid:107) ¯ x t +1 − z (cid:107) , ∀ z ∈ R d . (99)Since x t +1 = Prox ηψ (cid:0) x t − η ˆ v t (cid:1) , we have F ( x t +1 ) ≤ F ( z ) + (cid:104) x t +1 − z, ∇ f ( x t ) − ˆ v t (cid:105) + ( L − η ) (cid:107) x t +1 − x t (cid:107) + ( L η ) (cid:107) z − x t (cid:107) − η (cid:107) x t +1 − z (cid:107) . (100)Setting z = x t in (99) and z = ¯ x t +1 in (100), then summing them together and taking the expectations, we obtain E [ F ( x t +1 )] ≤ E (cid:2) F ( x t ) + (cid:104) x t +1 − ¯ x t +1 , ∇ f ( x t ) − ˆ v t (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) T +( L − η ) (cid:107) x t +1 − x t (cid:107) + ( L − η ) (cid:107) ¯ x t +1 − x t (cid:107) − η (cid:107) x t +1 − ¯ x t +1 (cid:107) (cid:3) . (101)Next, we give an upper bound of the term T as follows: T = E (cid:104) x t +1 − ¯ x t +1 , ∇ f ( x t ) − ˆ v t (cid:105)≤ η E (cid:107) x t +1 − ¯ x t +1 (cid:107) + η E (cid:107)∇ f ( x t ) − ˆ v t (cid:107) ≤ η E (cid:107) x t +1 − ¯ x t +1 (cid:107) + L dηnb n (cid:88) i =1 E (cid:107) x t − z ti (cid:107) + L d µ η , (102)where the first inequality holds by Cauchy-Schwarz and Young’s inequality and the second inequality holds by Lemma 9.Combining (101) with (102), we have E [ F ( x t +1 )] ≤ E (cid:2) F ( x t ) + L dηnb n (cid:88) i =1 E (cid:107) x t − z ti (cid:107) + L d µ η L − η ) (cid:107) x t +1 − x t (cid:107) + ( L − η ) (cid:107) ¯ x t +1 − x t (cid:107) (cid:3) . (103)Next, we define an useful Lyapunov function as follows: Φ t = E (cid:2) F ( x t ) + c t n n (cid:88) i =1 (cid:107) x t − z ti (cid:107) (cid:3) , (104)here { c t } is a nonnegative sequence. By the step 7 of Algorithm 2, we have n n (cid:88) i =1 (cid:107) x t +1 − z t +1 i (cid:107) = 1 n n (cid:88) i =1 (cid:0) p (cid:107) x t +1 − x t (cid:107) + (1 − p ) (cid:107) x t +1 − z ti (cid:107) (cid:1) = pn n (cid:88) i =1 (cid:107) x t +1 − x t (cid:107) + 1 − pn n (cid:88) i =1 (cid:107) x t +1 − z ti (cid:107) = p (cid:107) x t +1 − x t (cid:107) + 1 − pn n (cid:88) i =1 (cid:107) x t +1 − z ti (cid:107) , (105)where p denotes probability of an index i being in I t . Here we have p = 1 − (1 − n ) b ≥ −
11 + b/n = b/n b/n ≥ b n , (106)where the first inequality follows from (1 − a ) b ≤ ab , and the second inequality holds by b ≤ n . Considering the upperbound of (cid:107) x t +1 − z ti (cid:107) , we have (cid:107) x t +1 − z ti (cid:107) = (cid:107) x t +1 − x t + x t − z ti (cid:107) = (cid:107) x t +1 − x t (cid:107) + 2( x t +1 − x t ) T ( x t − z ti ) + (cid:107) x t − z ti (cid:107) ≤ (cid:107) x t +1 − x t (cid:107) + 2 (cid:0) β (cid:107) x t +1 − x t (cid:107) + β (cid:107) x t − z ti (cid:107) (cid:1) + (cid:107) x t − z ti (cid:107) = (1 + 1 β ) (cid:107) x t +1 − x t (cid:107) + (1 + β ) (cid:107) x t − z ti (cid:107) , (107)where β > . Combining (105) with (107), we have n n (cid:88) i =1 (cid:107) x t +1 − z t +1 i (cid:107) ≤ (1 + 1 − pβ ) (cid:107) x t +1 − x t (cid:107) + (1 − p )(1 + β ) n n (cid:88) i =1 (cid:107) x t − z ti (cid:107) . (108)It follows that Φ t +1 = E (cid:2) F ( x t +1 ) + c t +1 n n (cid:88) i =1 (cid:107) x t +1 − z t +1 i (cid:107) (cid:3) ≤ E (cid:2) F ( x t +1 ) + c t +1 (1 + 1 − pβ ) (cid:107) x t +1 − x t (cid:107) + c t +1 (1 − p )(1 + β ) n n (cid:88) i =1 (cid:107) x t − z ti (cid:107) (cid:3) ≤ E (cid:2) F ( x t ) + (cid:0) L dηb + c t +1 (1 − p )(1 + β ) (cid:1) n n (cid:88) i =1 E (cid:107) x t − z ti (cid:107) + (cid:0) L − η + c t +1 (1 + 1 − pβ ) (cid:1) (cid:107) x t +1 − x t (cid:107) + ( L − η ) (cid:107) ¯ x t +1 − x t (cid:107) + L d µ η (cid:3) ≤ Φ t + (cid:0) L − η + c t +1 (1 + 1 − pβ ) (cid:1) (cid:107) x t +1 − x t (cid:107) + ( L − η ) (cid:107) ¯ x t +1 − x t (cid:107) + L d µ η , (109)where c t = L dηb + c t +1 (1 − p )(1 + β ) .Let c T = 0 and β = b n . Since (1 − p )(1 + β ) = 1 + β − p − pβ ≤ β − p and p ≥ b n , we have c t ≤ c t +1 (1 − θ ) + L dηb , (110)where θ = p − β ≥ b n . Recursing on t , for ≤ t ≤ T − , we have c t ≤ L dηb − θ T − t θ ≤ L dηbθ ≤ nL dηb . (111)sing η = ρdL (0 < ρ < ) , we obtain c t ≤ nρLb . It follows that c t +1 (1 + 1 − pβ ) + L ≤ nρLb (1 + 4 n − bb ) + L
2= 4 nρLb ( 4 nb −
1) + L ≤ ρLn b + L ρ n b + ρ ) L ρ ≤ L ρ ≤ η , (112)where the third inequality holds by ρ n b + ρ ≤ . Thus, we obtain Φ t +1 ≤ Φ t + ( L − η ) (cid:107) ¯ x t +1 − x t (cid:107) + L d µ η . (113)Summing the inequality (113) across all the iterations, we have T T (cid:88) t =1 ( 12 η − L ) E (cid:107) x t − ¯ x t +1 (cid:107) ≤ Φ − Φ T T + L d µ η . (114)Since c T = 0 and z i = x for all i = 1 , , · · · , n , we have T T (cid:88) t =1 ( 12 η − L ) E (cid:107) g η ( x t ) (cid:107) ≤ E [ F ( x ) − F ( x T )] T γ + L d µ η γ , (115)where γ = η − Lη and g η ( x t ) = 1 η (cid:2) x t − Prox ηψ ( x t − η ∇ f ( x t )) (cid:3) = 1 η ( x t − ¯ x t +1 ) . (116)Given b = [ n ] and ρ = , it is easy verified that ρ n b + ρ = ≤ . and γ = η − Lη = dL − d L ≥ dL − dL = dL .Finally, let µ = O ( √ dT ) , we can obtain the above result. Convergence Analysis of ZO-ProxSAGA-GauSGE
In this section, we give the convergence analysis of the ZO-ProxSAGA-GauSGE. First, we give an useful lemma about theupper bound of the variance of estimated gradient.
Lemma 10.
In Algorithm 2 using GauSGE, given the estimated gradient ˆ v t = b (cid:80) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) + ˆ φ t with ˆ φ t = n (cid:80) ni =1 ˆ ∇ f i ( z ti ) , then the following inequality holds E (cid:107) ˆ v t − ∇ f ( x t ) (cid:107) ≤ L nb n (cid:88) i =1 E (cid:107) x t − z ti (cid:107) + (4 + 24 b )(2 d + 9) σ + (2 + 12 b )( d + 6) L µ . (117) Proof.
By the definition of the estimated gradient ˆ v t , we have E (cid:107) ˆ v t − ∇ f ( x t ) (cid:107) = E (cid:107) b (cid:88) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) + ˆ φ t − ∇ f ( x t ) (cid:107) = E (cid:107) b (cid:88) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) + ˆ φ t − ˆ ∇ f ( x t ) + ˆ ∇ f ( x t ) − ∇ f ( x t ) (cid:107) ≤ E (cid:107) b (cid:88) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) + ˆ φ t − ˆ ∇ f ( x t ) (cid:107) + 2 E (cid:107) ˆ ∇ f ( x t ) − ∇ f ( x t ) (cid:107) ≤ E (cid:107) b (cid:88) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) − ( ˆ ∇ f ( x t ) − ˆ φ t ) (cid:107) + 4(2 d + 9) E (cid:107)∇ f ( x t ) (cid:107) + 2 µ L ( d + 6) , (118)here the second inequality holds by Lemma 6. Using E I t (cid:2) b (cid:88) i t ∈I t ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:3) = ˆ ∇ f ( x t ) − n n (cid:88) i =1 ˆ ∇ f i ( z ti ) = ˆ ∇ f ( x t ) − ˆ φ t , (119)then we have n (cid:88) i =1 (cid:0) ˆ ∇ f i ( x t ) − ˆ ∇ f i ( z ti t ) − ( ˆ ∇ f ( x t ) − ˆ φ t ) (cid:1) = n (cid:0) ˆ ∇ f ( x t ) − ˆ φ t (cid:1) − n (cid:0) ˆ ∇ f ( x t ) − ˆ φ t (cid:1) = 0 . (120)It follows that E (cid:107) b (cid:88) i t ∈I t (cid:0) ˆ ∇ f i t ( x t ) − ˆ ∇ f i t ( z ti t ) (cid:1) − ( ˆ ∇ f ( x t ) − ˆ φ t ) (cid:107) ≤ bn n (cid:88) i =1 E (cid:107) ˆ ∇ f i ( x t ) − ˆ ∇ f i ( z ti ) − ( ˆ ∇ f ( x t ) − ˆ φ t ) (cid:107) = 1 bn n (cid:88) i =1 E (cid:107) ˆ ∇ f i ( x t ) − ˆ ∇ f i ( z ti ) (cid:107) − (cid:107) ˆ ∇ f ( x t ) − ˆ φ t (cid:107) ≤ bn n (cid:88) i =1 E (cid:107) ˆ ∇ f i ( x t ) − ˆ ∇ f i ( z ti ) (cid:107) , (121)where the first inequality holds by Lemmas 4 and 5 in (Liu et al., 2018c). By (32), we have E (cid:107) ˆ ∇ f i ( x t ) − ˆ ∇ f i ( z ti ) (cid:107) = E (cid:107) ˆ ∇ f i ( x t ) − ∇ f i ( x t ) + ∇ f i ( x t ) − ∇ f i ( z ti ) + ∇ f i ( z ti ) − ˆ ∇ f i ( z ti ) (cid:107) ≤ E (cid:107) ˆ ∇ f i ( x t ) − ∇ f i ( x t ) (cid:107) + 3 (cid:107)∇ f i ( x t ) − ∇ f i ( z ti ) (cid:107) + 3 E (cid:107)∇ f i ( z ti ) − ˆ ∇ f i ( z ti ) (cid:107) ≤ d + 9) (cid:0) (cid:107)∇ f i ( x t ) (cid:107) + (cid:107)∇ f i ( z ti ) (cid:107) (cid:1) + 3 L (cid:107) x t − z ti (cid:107) + 6 L µ ( d + 6) , (122)where the first inequality follows the Jensen’s inequality, and the second inequality holds by the Lemma 6. Finally, combiningthe inequalities (118), (121) and (122), we obtain the above result.Next, based on the above lemma, we study the convergence property of the ZO-ProxSAGA-GauSGE. Theorem 8.
Assume the sequence { x t } Tt =1 generated from Algorithm 2 using the GauSGE, and given a positive sequence { c t } Tt =1 as follows: c t = 3 L ηb + c t +1 (1 − p )(1 + β ) , (123) where β > . Let c T = 0 , η = ρL (0 < ρ < ) and b satisfies the following inequality: ρ n b + ρ ≤ , (124) then we have E (cid:107) g η ( x t ) (cid:107) ≤ E [ F ( x ) − F ( x ∗ )] T γ + (2 + b )(2 d + 9) σ ηγ + (1 + b )( d + 6) L µ ηγ , (125) where γ = η − Lη and x ∗ is an optimal solution of the problem (2) . Further given b = [ n ] , ρ = and µ = O ( d √ T ) , wehave E (cid:107) g η ( x t ) (cid:107) ≤ L E [ F ( x ) − F ( x ∗ )]5 T + O ( dT ) + O ( dσ ) . (126) Proof.
This proof is the similar to the proof of Theorem 3. We begin with defining an iteration by using the full true gradient: ¯ x t +1 = Prox ηψ (cid:0) x t − η ∇ f ( x t ) (cid:1) . (127)Using Lemma 2 of Reddi et al. (2016), we have for any z ∈ R d F (¯ x t +1 ) ≤ F ( z ) + ( L − η ) (cid:107) ¯ x t +1 − x t (cid:107) + ( L η ) (cid:107) z − x t (cid:107) − η (cid:107) ¯ x t +1 − z (cid:107) . (128)ince x t +1 = Prox ηψ (cid:0) x t − η ˆ v t (cid:1) , similarly, we have F ( x t +1 ) ≤ F ( z ) + (cid:104) x t +1 − z, ∇ f ( x t ) − ˆ v t (cid:105) + ( L − η ) (cid:107) x t +1 − x t (cid:107) + ( L η ) (cid:107) z − x t (cid:107) − η (cid:107) x t +1 − z (cid:107) . (129)Setting z = x t in (128) and z = ¯ x t +1 in (129), then summing them together and taking the expectations, we obtain E [ F ( x t +1 )] ≤ E (cid:2) F ( x t ) + (cid:104) x t +1 − ¯ x t +1 , ∇ f ( x t ) − ˆ v t (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) T +( L − η ) (cid:107) x t +1 − x t (cid:107) + ( L − η ) (cid:107) ¯ x t +1 − x t (cid:107) − η (cid:107) x t +1 − ¯ x t +1 (cid:107) (cid:3) . (130)Next, we give an upper bound of the term T as follows: T = E (cid:104) x t +1 − ¯ x t +1 , ∇ f ( x t ) − ˆ v t (cid:105)≤ η E (cid:107) x t +1 − ¯ x t +1 (cid:107) + η E (cid:107)∇ f ( x t ) − ˆ v t (cid:107) ≤ η E (cid:107) x t +1 − ¯ x t +1 (cid:107) + 3 L ηnb n (cid:88) i =1 E (cid:107) x t − z ti (cid:107) + (2 + 12 b )(2 d + 9) ησ + (1 + 6 b )( d + 6) L µ η, (131)where the first inequality holds by Cauchy-Schwarz and Young’s inequality and the second inequality holds by Lemma 10.Combining (130) with (131), we have E [ F ( x t +1 )] ≤ E (cid:2) F ( x t ) + 3 L ηnb n (cid:88) i =1 E (cid:107) x t − z ti (cid:107) + (2 + 12 b )(2 d + 9) ησ + (1 + 6 b )( d + 6) L µ η + ( L − η ) (cid:107) x t +1 − x t (cid:107) + ( L − η ) (cid:107) ¯ x t +1 − x t (cid:107) (cid:3) . (132)Next, we define an useful Lyapunov function as follows: Ω t = E (cid:2) F ( x t ) + c t n n (cid:88) i =1 (cid:107) x t − z ti (cid:107) (cid:3) , (133)where { c t } is a nonnegative sequence. By the step 7 of Algorithm 2, we have n n (cid:88) i =1 (cid:107) x t +1 − z t +1 i (cid:107) = 1 n n (cid:88) i =1 (cid:0) p (cid:107) x t +1 − x t (cid:107) + (1 − p ) (cid:107) x t +1 − z ti (cid:107) (cid:1) = pn n (cid:88) i =1 (cid:107) x t +1 − x t (cid:107) + 1 − pn n (cid:88) i =1 (cid:107) x t +1 − z ti (cid:107) = p (cid:107) x t +1 − x t (cid:107) + 1 − pn n (cid:88) i =1 (cid:107) x t +1 − z ti (cid:107) , (134)where p denotes probability of an index i being in I t . Here we have p = 1 − (1 − n ) b ≥ −
11 + b/n = b/n b/n ≥ b n , (135)where the first inequality follows from (1 − a ) b ≤ ab , and the second inequality holds by b ≤ n . Considering the upperbound of (cid:107) x t +1 − z ti (cid:107) , we have (cid:107) x t +1 − z ti (cid:107) = (cid:107) x t +1 − x t + x t − z ti (cid:107) = (cid:107) x t +1 − x t (cid:107) + 2( x t +1 − x t ) T ( x t − z ti ) + (cid:107) x t − z ti (cid:107) ≤ (cid:107) x t +1 − x t (cid:107) + 2 (cid:0) β (cid:107) x t +1 − x t (cid:107) + β (cid:107) x t − z ti (cid:107) (cid:1) + (cid:107) x t − z ti (cid:107) = (1 + 1 β ) (cid:107) x t +1 − x t (cid:107) + (1 + β ) (cid:107) x t − z ti (cid:107) , (136)here β > . Combining (134) with (136), we have n n (cid:88) i =1 (cid:107) x t +1 − z t +1 i (cid:107) ≤ (1 + 1 − pβ ) (cid:107) x t +1 − x t (cid:107) + (1 − p )(1 + β ) n n (cid:88) i =1 (cid:107) x t − z ti (cid:107) . (137)It follows that Ω t +1 = E (cid:2) F ( x t +1 ) + c t +1 n n (cid:88) i =1 (cid:107) x t +1 − z t +1 i (cid:107) (cid:3) ≤ E (cid:2) F ( x t +1 ) + c t +1 (1 + 1 − pβ ) (cid:107) x t +1 − x t (cid:107) + c t +1 (1 − p )(1 + β ) n n (cid:88) i =1 (cid:107) x t − z ti (cid:107) (cid:3) ≤ E (cid:2) F ( x t ) + (cid:0) L ηb + c t +1 (1 − p )(1 + β ) (cid:1) n n (cid:88) i =1 E (cid:107) x t − z ti (cid:107) + (cid:0) L − η + c t +1 (1 + 1 − pβ ) (cid:1) (cid:107) x t +1 − x t (cid:107) + ( L − η ) (cid:107) ¯ x t +1 − x t (cid:107) + (2 + 12 b )(2 d + 9) ησ + (1 + 6 b )( d + 6) L µ η (cid:3) ≤ Ω t + (cid:0) L − η + c t +1 (1 + 1 − pβ ) (cid:1) (cid:107) x t +1 − x t (cid:107) + ( L − η ) (cid:107) ¯ x t +1 − x t (cid:107) + (2 + 12 b )(2 d + 9) ησ + (1 + 6 b )( d + 6) L µ η, (138)where c t = L ηb + c t +1 (1 − p )(1 + β ) .Let c T = 0 and β = b n . Since (1 − p )(1 + β ) = 1 + β − p − pβ ≤ β − p and p ≥ b n , it follows that c t ≤ c t +1 (1 − θ ) + 3 L ηb , (139)where θ = p − β ≥ b n . Then recursing on t , for ≤ t ≤ T − , we have c t ≤ L ηb − θ T − t θ ≤ L ηbθ ≤ nL ηb . (140)Let η = ρL (0 < ρ < ) , we have c t ≤ nρLb . It follows that c t +1 (1 + 1 − pβ ) + L ≤ nρLb (1 + 4 n − bb ) + L
2= 12 nρLb ( 4 nb −
1) + L ≤ ρLn b + L ρ n b + ρ ) L ρ ≤ L ρ = 12 η , (141)where the third inequality holds by ρ n b + ρ ≤ . Thus, we obtain Ω t +1 ≤ Ω t + ( L − η ) (cid:107) ¯ x t +1 − x t (cid:107) + (2 + 12 b )(2 d + 9) ησ + (1 + 6 b )( d + 6) L µ η. (142)Summing the inequality (142) across all the iterations, we have T T (cid:88) t =1 ( 12 η − L ) E (cid:107) x t − ¯ x t +1 (cid:107) ≤ Ω − Ω T T + (2 + 12 b )(2 d + 9) ησ + (1 + 6 b )( d + 6) L µ η. (143)Since c T = 0 and z i = x for all i = 1 , , · · · , n , we have T T (cid:88) t =1 ( 12 η − L ) E (cid:107) g η ( x t ) (cid:107) ≤ E [ F ( x ) − F ( x T )] T γ + (2 + 12 b )(2 d + 9) ησ + (1 + 6 b )( d + 6) L µ η, (144)where γ = η − Lη and g η ( x t ) = 1 η (cid:2) x t − Prox ηψ ( x t − η ∇ f ( x t )) (cid:3) = 1 η ( x t − ¯ x t +1 ) . (145)Given b = [ n ] and ρ = , it is easy verified that ρ n b + ρ = ≤ , and γ = η − Lη = L . Finally, let µ = O ( d √ T ))