An Optimal Hybrid Variance-Reduced Algorithm for Stochastic Composite Nonconvex Optimization
aa r X i v : . [ m a t h . O C ] A ug An Optimal Hybrid Variance-Reduced Algorithm forStochastic Composite Nonconvex Optimization
Deyi Liu, Lam M. Nguyen, and Quoc Tran-DinhAugust 21, 2020
Abstract
In this note we propose a new variant of the hybrid variance-reduced proximal gradientmethod in [7] to solve a common stochastic composite nonconvex optimization problem understandard assumptions. We simply replace the independent unbiased estimator in our hybrid-SARAH estimator introduced in [7] by the stochastic gradient evaluated at the same sample,leading to the identical momentum-SARAH estimator introduced in [2]. This allows us tosave one stochastic gradient per iteration compared to [7], and only requires two samples periteration. Our algorithm is very simple and achieves optimal stochastic oracle complexitybound in terms of stochastic gradient evaluations (up to a constant factor). Our analysis isessentially inspired by [7], but we do not use two different step-sizes.
We consider the following stochastic composite and possibly nonconvex optimization problem: min x ∈ R p n F ( x ) := E ξ [ f ξ ( x )] + ψ ( x ) o , (1)where f ξ ( · ) : R p × Ω → R is a stochastic function defined, such that for each x ∈ R p , f ξ ( x ) isa random variable in a given probability space (Ω , P ) , while for each realization ξ ∈ Ω , f ξ ( · ) isdifferentiable on R p ; and f ( x ) := E ξ [ f ξ ( x )] is the expectation of the random function f ξ ( x ) over ξ on Ω ; ψ : R p → R ∪ { + ∞} is a proper, closed, and convex function.Our algorithm developed in this note relies on the following fundamental assumptions: Assumption 1.1.
The objective functions f and ψ of (1) satisfies the following conditions: (a) ( Convexity of ψ ) ψ : R p → R ∪ { + ∞} is proper, closed, and convex. In addition, dom( F ) := dom( f ) ∩ dom( ψ ) = ∅ . (b) ( Boundedness from below ) There exists a finite lower bound F ⋆ := inf x ∈ R p n F ( x ) := f ( x ) + ψ ( x ) o > −∞ . (2) (c) ( L -average smoothness ) The expectation function f ( · ) is L -smooth on dom( F ) , i.e.,there exists L ∈ (0 , + ∞ ) such that E ξ h k∇ f ξ ( x ) − ∇ f ξ ( y ) k i ≤ L k x − y k , ∀ x, y ∈ dom( F ) . (3) (d) ( Bounded variance ) There exists σ ∈ [0 , ∞ ) such that E ξ (cid:2) k∇ f ξ ( x ) − ∇ f ( x ) k (cid:3) ≤ σ , ∀ x ∈ dom( F ) . (4)These assumptions are very standard in stochastic optimization and required for variousgradient-based methods. Unlike [2], we do not impose a bounded gradient assumption, i.e., k∇ f ( x ) k ≤ G for all x ∈ R p . Algorithm 1 below has a single loop and achieves optimal oraclecomplexity bound since it matches the lower bound complexity in [1] up to a constant factor.1 Hybrid Variance-Reduced Proximal Gradient Algorithm
We first propose a new variant of [7, Algorithm 1] for solving (1) and then analyze its convergenceand oracle complexity.
We propose a novel hybrid variance-reduced proximal gradient method to solve (1) understandard assumptions (i.e., Assumption 1.1) as described in Algorithm 1.
Algorithm 1 ( Hybrid Variance-Reduced Proximal Gradient Algorithm ) Initialization:
An arbitrarily initial point x ∈ dom( F ) . Choose an initial batch size ˜ b ≥ , β ∈ (0 , , and η > as in Theorem 2.1 below. Generate an unbiased estimator v := b P ˜ ξ i ∈ e B ∇ f ˜ ξ i ( x ) at x using a mini-batch e B . Update x := prox η ψ ( x − η v ) . For t := 1 , · · · , T do Generate a proper sample ξ t (single sample or mini-batch). Evaluate v t and update ( v t := ∇ f ξ t ( x t ) + (1 − β ) [ v t − − ∇ f ξ t ( x t − )] x t +1 := prox ηψ ( x t − ηv t ) . (5) EndFor Choose x T uniformly from { x , x , · · · , x T } .Compared to [7, Algorithm 1], the new algorithm, Algorithm 1, has two major differences.First, it uses a new estimator v t adopted from [2]. This estimator can also be viewed as a variantof the hybrid SARAH estimator in [7] by using the same sample ξ t for ∇ f ξ t ( x t ) . That isHybrid SARAH [7]: v ht := (1 − β )[ v ht − + ∇ f ξ t ( x t ) − ∇ f ξ t ( x t − )] + β ∇ f ζ t ( x t ) , ξ t = ζ t , Momentum SARAH [2]: v t := (1 − β )[ v t − + ∇ f ξ t ( x t ) − ∇ f ξ t ( x t − )] + β ∇ f ζ t ( x t ) , ξ t = ζ t . Second, it does not require an extra damped step-size γ as in [7], making Algorithm 1 simplerthan the one in [7].To analyze Algorithm 1, as usual, we define the following gradient mapping of (1): G η ( x ) := η (cid:0) x − prox ηψ ( x − η ∇ f ( x )) (cid:1) , (6)where η > is any given step-size. It is obvious to show that x ⋆ ∈ dom( F ) is a stationary pointof (1), i.e., ∈ ∇ f ( x ⋆ ) + ∂ψ ( x ⋆ ) if and only if G η ( x ⋆ ) = 0 . We will show that for any ε > ,Algorthm 1 can find x T such that E (cid:2) k G η ( x T ) k (cid:3) ≤ ε , which means that x T is an ε -approximatestationary point of (1), where the expectation is taken over all the present randomness.The following theorem establishes convergence of Algorithm 1 and provides oracle complexity. Theorem 2.1.
Under Assumption 1.1, suppose that η ∈ (0 , L ) is a given step-size and < L η − Lη ≤ β < . Let { x t } Tt =0 be generated by Algorithm 1. Then, we have T + 1 T X t =0 E (cid:2) k G η ( x t ) k (cid:3) ≤ F ( x ) − F ⋆ ] η ( T + 1) + E (cid:2) k v − ∇ f ( x ) k (cid:3) β ( T + 1) + 2 βσ . (7) In particular, if we choose η := L ( T +1) / , β := T +1) / , and ˜ b := l ( T +1) / m ≥ , then theoutput x T of Algorithm 1 satisfies E (cid:2) k G η ( x T ) k (cid:3) ≤ L [ F ( x ) − F ⋆ ] + 4 σ ( T + 1) / . (8)2 onsequently, for any tolerance ε > , the total number of stochastic gradient evaluations inAlgorithm 1 to achieves x T such that E (cid:2) k G η ( x T ) k (cid:3) ≤ ε is at most T ∇ f := (cid:24) ∆ / ε + / ε (cid:25) ,where ∆ := 4 (cid:2) L [ F ( x ) − F ⋆ ] + σ (cid:3) . Theorem 2.1 shows that the oracle complexity of Algorithm 1 is O (cid:18) ∆ / ε + ∆ / ε (cid:19) as in [7],where ∆ := 4( L [ F ( x ) − F ⋆ ] + σ ) . This complexity bound in fact matches the lower boundone in [1] up to a constant factor under the same assumptions as in Assumption 1.1. Hence, weconclude that Algorithm 1 is optimal . Let us denote by F t := σ ( ξ , ξ , · · · , ξ t ) the σ -filed generated by { ξ , ξ , · · · , ξ t } . We also denoteby E [ · ] the full expectation over the history F t . The following lemma establishes a key estimatefor our convergence analysis. We emphasize that Lemma 2.1 is self-contained and can be appliedto other types of estimators, e.g., Hessian, and other problems. Lemma 2.1.
Let v t be computed by (5) for β ∈ (0 , . Then, under Assumption 1.1, we have E ξ t (cid:2) k v t − ∇ f ( x t ) k (cid:3) ≤ (1 − β ) k v t − − ∇ f ( x t − ) k + 2(1 − β ) L k x t − x t − k + 2 β σ . (9) Therefore, by induction, we have E (cid:2) k v t − ∇ f ( x t ) k (cid:3) ≤ (1 − β ) t E (cid:2) k v − ∇ f ( x ) k (cid:3) + 2 βσ + 2 L P t − i =0 (1 − β ) t − i ) E (cid:2) k x i +1 − x i k (cid:3) . (10) Proof.
Let us denote a t := (1 − β ) [ ∇ f ξ t ( x t ) − ∇ f ( x t ) − ∇ f ξ t ( x t − ) + ∇ f ( x t − )] and b t := β [ ∇ f ξ t ( x t ) − ∇ f ( x t )] . Since E ξ t [ a t ] = E ξ t [ b t ] = 0 , and (3), we can derive (9) as follows: E ξ t (cid:2) k v t − ∇ f ( x t ) k (cid:3) = E ξ t (cid:2) k∇ f ξ t ( x t ) + (1 − β )( v t − − ∇ f ξ t ( x t − )) − ∇ f ( x t ) k (cid:3) = E ξ t (cid:2) k (1 − β )[ v t − − ∇ f ( x t − )] + a t + b t k (cid:3) = (1 − β ) k v t − − ∇ f ( x t − ) k + E ξ t (cid:2) k a t + b t k (cid:3) ≤ (1 − β ) k v t − − ∇ f ( x t − ) k + 2 E ξ t (cid:2) k a t k (cid:3) + 2 E ξ t (cid:2) k b t k (cid:3) ≤ (1 − β ) k v t − − ∇ f ( x t − ) k + 2(1 − β ) E ξ t (cid:2) k∇ f ξ t ( x t ) − ∇ f ξ t ( x t − ) k (cid:3) + 2 β E ξ t (cid:2) k∇ f ξ t ( x t ) − ∇ f ( x t ) k (cid:3) ≤ (1 − β ) k v t − − ∇ f ( x t − ) k + 2(1 − β ) L k x t − x t − k + 2 β σ . Taking the full expectation over the full history F t of (9), and noticing that for β ∈ (0 , , − (1 − β ) t − (1 − β ) ≤ β , by induction, we can show that E (cid:2) k v t − ∇ f ( x t ) k (cid:3) ≤ (1 − β ) t E (cid:2) k v − ∇ f ( x ) k (cid:3) + 2 β σ − (1 − β ) t − (1 − β ) + 2 L P t − i =0 (1 − β ) t − i ) E (cid:2) k x i +1 − x i k (cid:3) ≤ (1 − β ) t E (cid:2) k v − ∇ f ( x ) k (cid:3) + 2 βσ + 2 L P t − i =0 (1 − β ) t − i ) E (cid:2) k x i +1 − x i k (cid:3) . This proves (10).Next, we prove another property of our composite function F in (1). 3 emma 2.2. Let { x t } be generated by Algorithm 1 for solving (1) and G η be defined by (6) .Then, under Assumption 1.1, we have E [ F ( x t +1 ) − F ⋆ ] ≤ E [ F ( x t ) − F ⋆ ] − (cid:16) η − L (cid:17) E (cid:2) k x t +1 − x t k (cid:3) − η E (cid:2) k G η ( x t ) k (cid:3) + η E (cid:2) k∇ f ( x t ) − v t k (cid:3) . (11) Proof.
Let us denote by ¯ x t := prox ηψ ( x t − η ∇ f ( x t )) . From the optimality condition of thisproximal operator, we have h∇ f ( x t ) , ¯ x t − x t i + 12 η k ¯ x t − x t k + ψ (¯ x t ) ≤ ψ ( x t ) − η k x t − ¯ x t k . Similarly, from x t +1 = prox ηψ ( x t − ηv t ) , we also have h v t , x t +1 − x t i + 12 η k x t +1 − x t k + ψ ( x t +1 ) ≤ h v t , ¯ x t − x t i + 12 η k ¯ x t − x t k + ψ (¯ x t ) − η k ¯ x t − x t +1 k . Combining the last two inequalities, we can show that ψ ( x t +1 ) + η k x t +1 − x t k ≤ ψ ( x t ) − η k G η ( x t ) k − η k ¯ x t − x t +1 k + h v t , ¯ x t − x t +1 i − h∇ f ( x t ) , ¯ x t − x t i . (12)By the Cauchy-Schwarz inequality, for any η > , we easily get h∇ f ( x t ) − v t , x t +1 − ¯ x t i ≤ η k∇ f ( x t ) − v t k + 12 η k x t +1 − ¯ x t k . (13)Finally, using the L -average smoothness of f , we can derive f ( x t +1 ) + ψ ( x t +1 ) ≤ f ( x t ) + h∇ f ( x t ) , x t +1 − x t i + L k x t +1 − x t k + ψ ( x t +1 )= f ( x t ) − ( η − L ) k x t +1 − x t k + h∇ f ( x t ) , x t +1 − x t i + ψ ( x t +1 ) + η k x t +1 − x t k (12) ≤ f ( x t ) − ( η − L ) k x t +1 − x t k + ψ ( x t ) + h∇ f ( x t ) − v t , x t +1 − ¯ x t i− η k G η ( x t ) k − η k ¯ x t − x t +1 k (13) ≤ f ( x t ) + ψ ( x t ) − ( η − L ) k x t +1 − x t k + η k∇ f ( x t ) − v t k − η k G η ( x t ) k . Taking the full expectation of both sides of the last inequality and noting that F = f + ψ , weobtain (11).Now, we are ready to prove our main result, Theorem 2.1 above. The proof of Theorem 2.1 . First, summing up (10) from t := 0 to t := T , we get P Tt =0 E (cid:2) k v t − ∇ f ( x t ) k (cid:3) ≤ P Tt =0 (1 − β ) t k v − ∇ f ( x ) k + 2( T + 1) βσ + 2 L P Tt =0 P t − i =0 (1 − β ) t − i ) E (cid:2) k x i +1 − x i k (cid:3) ≤ β k v − ∇ f ( x ) k + 2( T + 1) βσ + 2 L P T − i =0 P Tt = i +1 (1 − β ) t − i ) E (cid:2) k x i +1 − x i k (cid:3) ≤ β k v − ∇ f ( x ) k + 2( T + 1) βσ . + 2 L P T − i =0 1 β E (cid:2) k x i +1 − x i k (cid:3) . (14)4ext, summing up (11) from t := 0 to t := T , we obtain E [ F ( x T +1 ) − F ⋆ ] ≤ [ F ( x ) − F ⋆ ] − η P Tt =0 E (cid:2) k G η ( x t ) k (cid:3) − P Tt =0 (cid:16) η − L (cid:17) E (cid:2) k x t +1 − x t k (cid:3) + η P Tt =0 E (cid:2) k v t − ∇ f ( x t ) k (cid:3) (14) ≤ [ F ( x ) − F ⋆ ] − η P Tt =0 E (cid:2) k G η ( x t ) k (cid:3) − P Tt =0 (cid:16) η − L (cid:17) E (cid:2) k x t +1 − x t k (cid:3) + η β E (cid:2) k v − ∇ f ( x ) k (cid:3) + P T − i =0 L ηβ E (cid:2) k x i +1 − x i k (cid:3) + ( T + 1) ηβσ . Since η ∈ (cid:0) , L (cid:1) , we have < L η − Lη < . Suppose η − L ≥ L ηβ , i.e., β ≥ L η − Lη , we have E [ F ( x T +1 ) − F ⋆ ] ≤ [ F ( x ) − F ⋆ ] − η T X t =0 E (cid:2) k G η ( x t ) k (cid:3) + η β E (cid:2) k v − ∇ f ( x ) k (cid:3) + ( T + 1) ηβσ , which leads to (7).Now, if we choose η := L ( T +1) / and β := T +1) / , then we can verify that β ≥ L η − Lη .Moreover, (7) becomes T + 1 T X t =0 E (cid:2) k G η ( x t ) k (cid:3) ≤ L ( T + 1) / [ F ( x ) − F ⋆ ] + 2 σ ( T + 1) / + E (cid:2) k v − ∇ f ( x ) k (cid:3) ( T + 1) / . By Step 3 of Algorithm 1 and the choice ˜ b := l ( T +1) / m , we have E (cid:2) k v − ∇ f ( x ) k (cid:3) ≤ σ ˜ b ≤ σ ( T +1) / . Substituting this bound into the previous one and using E (cid:2) k G η ( x T ) k (cid:3) = T +1 P Tt =0 E (cid:2) k G η ( x t ) k (cid:3) , we obtain (8).Finally, from (8), to guarantee E (cid:2) k G η ( x T ) k (cid:3) ≤ ε , we have T + 1 ≥ ∆ / ε , where ∆ :=4 L [ F ( x ) − F ⋆ ] + 4 σ . We can take T := (cid:24) ∆ / ε (cid:25) . Therefore, the number of stochastic gradientevaluation is T ∇ f = ˜ b + 2 T = ∆ / ε + / ε . Rounding it, we obtain T ∇ f = (cid:24) ∆ / ε + / ε (cid:25) . Theorem 2.1 only analyzes a simple variant of Algorithm 1 with constant step-size η = O (cid:0) T / (cid:1) and constant weight β = O (cid:0) T / (cid:1) . It also uses a large initial mini-batch of size ˜ b = O (cid:0) T / (cid:1) .Compared to SARAH-based methods, e.g., in [3, 4, 5], Algorithm 1 is simpler since it is single-loop. At each iteration, it uses only two samples compared to three ones in [7]. We remark thatthe convergence of Algorithm 1 can be established by means of Lyapunov function as in [7].The result of this note can be extended into different directions:• We can also adapt our analysis to mini-batch, adaptive step-size η t , and adaptive weight β t variants as in [6]. If we use adaptive weight β t as in [6], then we can remove the initialbatch ˜ b at Step 3 of Algorithm 1. However, the convergence rate in Theorem 2.1 willbe O (cid:16) log( T ) T / (cid:17) instead of O (cid:0) T / (cid:1) . The rate O (cid:16) log( T ) T / (cid:17) matches the result of [2] withoutbounded gradient assumption.• Our results, especially, Lemma 2.1, here can be applied to develop stochastic algorithmsfor solving other optimization problems such as compositional nonconvex optimization,minimax problems, and reinforcement learning.• The idea here can also be extended to develop second-order methods such as sub-sampledand sketching Newton or cubic regularization-based methods.It is also interesting to incorporate this idea with adaptive schemes as done in [2] by developingdifferent strategies such as curvature aid or quasi-Newton methods. 5 eferences
1. Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth. Lowerbounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.2. A. Cutkosky and F. Orabona. Momentum-based variance reduction in non-convex SGD. InAdvances in Neural Information Processing Systems, pages 15210–15219, 2019.3. C. Fang, C. J. Li, Z. Lin, and T. Zhang. SPIDER: Near-optimal non-convex optimiza-tion via stochastic path integrated differential estimator. In Advances in Neural InformationProcessing Systems, pages 689–699, 2018.4. L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč. SARAH: A novel method for machinelearning problems using stochastic recursive gradient. ICML, 2017.5. H. N. Pham, M. L. Nguyen, T. D. Phan, and Q. Tran-Dinh. ProxSARAH: An efficientalgorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn.Res., 21:1–48, 2020.6. Q. Tran-Dinh, D. Liu, and L. M. Nguyen. Hybrid variance-reduced SGD algorithmsfor nonconvex-concave minimax problems. Tech. Report STOR.05.20, UNC-Chapel Hill(arXiv preprint arXiv:2006.15266), 2020.7. Q. Tran-Dinh, N. H. Pham, D. T. Phan, and L. M. Nguyen. A hybridstochastic optimization framework for stochastic composite nonconvex optimization.arXiv preprint arXiv:1907.03793, pages 1–49, 2019.
Authors’ information:
Deyi Liu and Quoc Tran-Dinh ∗ Department of Statistics and Operations ResearchThe University of North Carolina at Chapel HillChapel Hill, NC 27599
Email: [email protected],[email protected] ∗ Corresponding author .Lam M. Nguyen, IBM Research, Thomas J. Watson Research Center, NY10598
Email: [email protected]@ibm.com