Randomized Fast Subspace Descent Methods
aa r X i v : . [ m a t h . O C ] J un RANDOMIZED FAST SUBSPACE DESCENT METHODS ∗ LONG CHEN † , XIAOZHE HU ‡ , AND
HUIWEN WU § Abstract.
Randomized Fast Subspace Descent (RFASD) Methods are developed and analyzed for smooth andnon-constraint convex optimization problems. The efficiency of the method relies on a space decomposition whichis stable in A -norm, and meanwhile the condition number κ A measured in A -norm is small. At each iteration,the subspace is chosen randomly either uniformly or by a probability proportional to the local Lipschitz constants.Then in each chosen subspace, a preconditioned gradient descent method is applied. RFASD converges sublinearlyfor convex functions and linearly for strongly convex functions. Comparing with the randomized block coordinatedescent methods, the convergence of RFASD is faster provided κ A is small and the subspace decomposition is A -stable. This improvement is supported by considering a multilevel space decomposition for Nesterov’s ‘worst’problem. Key words.
Convex optimization, randomized methods, subspace decomposition
AMS subject classifications.
1. Introduction.
We consider the non-constraint convex minimization problem(1.1) min x ∈V f ( x ) , where f is a smooth and convex function and its derivative is Lipschitz continuous withconstant L and V is a Hilbert space. In practice, V = R N but might be assigned with an innerproduct other than the standard l inner product of R N . Solving minimization problem (1.1)is a central task with wide applications in fields of scientific computing, machine learning,and data science, etc.Due to the eruption of data and the stochasticity of the real world, randomness is in-troduced to make algorithms more robust and computational affordable. In the following,we will restrict ourselves to randomized algorithms related to the coordinate descent (CD)method [18, 3, 14] and its block variant, i.e., the block CD (BCD) method [18, 3, 2, 27, 1]and propose a new algorithm generalizing the randomized CD (RCD) and randomized BCD(RBCD) methods.In [16], Nesterov studied a RBCD method for huge-scale optimization problems. As-suming the gradient of the objective function f is coordinate-wise Lipschitz continuouswith constants L i , at each step, the block coordinates is chosen randomly with probability p i = L αi · [ P nj =1 L αj ] − , α ∈ R and an optimal block coordinate step with step size /L i isemployed. Note that, when α = 0 , the probability p i is uniformly distributed. When α = 1 , itis proportional to L i . It is shown that such an RBCD method converges linearly for a stronglyconvex function f and sublinearly for the convex case. Later, in [1], the cyclic version of theBCD method was studied, namely, each iteration consists of performing a gradient projectionstep with respect to a certain block taken in a cyclic order. Global sublinear convergence ratewas established for convex problems and, when the objective function is strongly convex,a linear convergence rate can be proved. More recently, Wright [28] studied a simple ver-sion of RCD that updates one coordinate at each time with uniformly chosen index. It was ∗ The work of X. Hu is partially supported by the National Science Foundation under grant DMS-1812503 andCCF-1934553. L. Chen and H. Wu are partially supported by the National Science Foundation under grant DMS-1913080. † Department of Mathematics, University of California at Irvine, Irvine, CA 92697, USA ([email protected]). ‡ Department of Mathematics, Tuffs University, Medford, MA 02155, USA ([email protected]) § Department of Mathematics, University of California at Irvine, Irvine, CA 92697, USA ([email protected]).1
L. CHEN, X. HU, AND H. WU pointed out that when applying to linear system Ax = b using least-sqaures formulation, sucha RCD is exactly a randomized Kaczmarz method [22, 11]. Similarly, it was shown that RCDconvergences sublinearly for convex functions and linearly for strongly convex functions.In [13], Lu developed a randomized block proximal damped newton (RBPDN) method. Forsolving smooth convex minimization problem, RBPDN uses Newton’s method in each block.Comparing with the Newton’s method, the computational complexity of RBPDN is reducedsince the Newton’s step is performed locally on each block. There is a trade-off betweenconvergence rate and computational complexity. If the dimension of the blocks is too small,i.e. O (1) , the Hessian on each block might lose lots of information, which might lead to slowconvergence. While if the block’s dimension is large, for example, N/ , then the computationof Hessian inverse on each block might still be expensive.Those existing RCD and RBCD methods can achieve acceleration comparing with stan-dard gradient descent (GD) methods, especially for large-scale problems. However, the con-vergence of RCD and RBCD becomes quite slow when the problem is ill-conditioned. Itis well-known that, preconditioning techniques can be used to improve the conditioning ofan optimization problem, see [19, 18]. While preconditioning techniques can be motivatedin different ways, one approach is to look at the problem (1.1) in V endowed by an innerproduct induced by the preconditioner. Roughly speaking, assuming the preconditioner A is symmetric and positive-definite, we consider the Lipschitz continuity and convexity usingthe A -inner product and A -norm. A good preconditioner means that the condition numbermeasured using A -norm is relatively small and therefore, the convergence of preconditionedGD (PGD) can be accelerated. The price to pay is the cost of the action of A − , which mightbe prohibitive for large-size problem. Moreover, it is also difficult to use the preconditionerin the RCD and RBCD methods due to the fact that the coordinate-wise decomposition isessentially based on the l -norm.One main idea of the proposed algorithm is to generalize the coordinate-wise decomposi-tion to subspace decomposition which is more suitable for the A -norm. This idea itself is notnew. For example, the well-known multigrid method [23], which is one of the most efficientmethods for solving the elliptic problem, can be derived fom subspace correction methodsbased on a multilevel space decomposition [29]. Its randomized version has been consid-ered in [10]. Recently, in [5], we have developed fast subspace descent (FASD) methods byborrowing the subspace decomposition idea of multigrid methods for solving the optimiza-tion problems. In this paper, we provide a randomized version of FASD and abbreviated asRFASD.A key feature of FASD and RFASD is a subspace decomposition V = V + V + · · · V J , with V i ⊂ V , i = 1 , · · · , J . Note here we do not require the space decomposition to be adirect sum nor be orthogonal. Indeed, the overlapping/redundancy between the subspaces iscrucial to speed up the convergence if the space decomposition is stable in the k · k A norm asfollows, • (SD) Stable decomposition: there exists a constant C A > , such that(1.2) ∀ v ∈ V , v = J X i =1 v i , and J X i =1 k v i k A ≤ C A k v k A . With such a subspace decomposition, the proposed RFASD method is similar with the RBCDmethod. At each iteration, RFASD randomly chooses a subspace according to certain sam-pling distribution, computes a search direction in the subspace, and then update the iteratorwith an appropriate step size.Based on standard assumptions, we first prove a sufficient decay inequality. Then cou-pled with the standard upper bound of the optimality gap, we are able to show RFASD con-
ANDOMIZED FAST SUBSPACE DESCENT METHODS A -norm and there is no need of inverting A directly, only local solves on eachsubspace is sufficient. Using strongly convex case as an example, we show that the conver-gence rate is − J C A κ A , where κ A is the condition number of f measured in the A -norm, which could be much smallerthan the condition number of f measures in l -norm. Here J is the number of subspaces and C A measures the stability of the space decomposition in A -norm; see (1.2). Therefore, if wechoose a proper preconditioner A such that κ A = O (1) and a stable subspace decompositionsuch that C A = O (1) , after J iterations, we have (cid:18) − J C A κ A (cid:19) J ≤ exp( − C A κ A ) = exp( −O (1)) . This indicates an exponential decay rate that is independent of the size of the optimizationproblem, which shows the potential of the proposed RFASD method for solving large-scaleoptimization problems. In summary, based on a stable subspace decomposition, we canachieve the preconditioning effect by only solving smaller size problems on each subspace,which reduces the computational complexity.The paper is organized as follows. In Section 2, we set up the minimization problem min x ∈V f ( x ) with proper assumptions on f and ∇ f . Then we propose the RFASD algo-rithm. In Section 3, based on the stable decomposition assumption, convergence analysis forconvex functions and strongly convex function are derived. In Section 4, we give some exam-ples and comparisons between several methods within the framework of RFASD. Numericalexperiments results are provided in Section 5 to confirm the theories. Finally, we providesome conclusions in Section 6.
2. Fast Subspace Descent Methods.
In this section, we introduce the basic setting ofthe optimization problem we consider as well as basic definitions, notation, and assumptions.Then we propose the fast subspace descent method based on proper subspace decomposition.
We consider the minimization problem (1.1). The Hilbert space V is a vector space equipped with an inner product ( · , · ) A and the norm induced is denotedby k x k A = ( x, x ) / A . Although our discussion might be valid in general Hilbert spaces, werestrict ourself to the finite dimensional space and without of loss generality we take V = R N .In this case, the standard l dot product in R N corresponds to A = I and is denoted by(2.1) ( x, y ) = x · y := N X i =1 x i y i ∀ x, y ∈ V . The inner produce ( · , · ) A is induced by a given symmetric positive definite (SPD) matrix A ∈ R N × N and defined as follows,(2.2) ( x, y ) A := ( Ax, y ) , ∀ x, y ∈ V . Let V ′ := L ( V , R ) be the linear space of all linear and continuous mappings V → R , whichis called the dual space of V . The dual norm w.r.t the A -norm is defined as: for f ∈ V ′ (2.3) k f k V ′ = sup k x k A ≤ h f, x i , L. CHEN, X. HU, AND H. WU where h· , ·i denotes the standard duality pair between V and V ′ . By Riesz representationtheorem, f can be also treat as a vector and, it is straightforward to verify that k f k V ′ = k f k A − := h A − f, f i / . Next, we introduce a decomposition of the space V , i.e.,(2.4) V = V + V + · · · + V J , V i ⊂ V , i = 1 , · · · , J. Again we emphasize that the space decomposition is not necessarily a direct sum nor beorthogonal. For each subspace V i , we assign an inner product induced by a symmetric andpositive definite matrix A i : V i → V i . The product space e V = V × V × · · · × V J is assigned with the product topology: for ˜ v = ( v , v , . . . , v J ) ⊺ ∈ e V k ˜ v k ˜ A := J X i =1 k v i k A i ! / . In matrix form, ˜ A = diag( A , A , . . . , A J ) is a block diagonal matrix defined on e V .We shall make the following assumptions on the objective function: • (LCi) The gradient of f is Lipschitz continuous restricted in each subspace withLipschitz constant L i , i.e., k∇ f ( x + v i ) − ∇ f ( x ) k A − ≤ L A,i k v i k A i , ∀ v i ∈ V i . • (SC) f is strongly convex with strong convexity constant µ ≥ , i.e., h∇ f ( x ) − ∇ f ( y ) , x − y i ≥ µ A k x − y k A , ∀ x, y ∈ V . Let I i : V i
7→ V be the natural inclusion and let R i = I ⊺ i : V ′
7→ V ′ i . In the terminologyof multigrid methods, I i corresponds to the prolongation operator and R i is the restriction. Now, we propose the randomizedfast subspace descent (RFASD) algorithm.
Algorithm 2.1
Randomized Fast Subspace Descent Method Choose x and k ← for k = 0 , , · · · do Choose an index of subspace i k from { , · · · , J } with the sampling probability:(2.5) i k = i with probability p i = 1 J L
A,i ¯ L A , i = 1 , , . . . , J, where ¯ L A = 1 J J X i =1 L A,i is the averaged Lipschitz constant. Compute subspace search direction s i k = − I i k A − i k R i k ∇ f ( x k ) Choose the step size α k = 1 L A,i k and update by the subspace correction: x k +1 := x k + α k s i k . end for ANDOMIZED FAST SUBSPACE DESCENT METHODS α k = 1 /L A,i k requires a prioriknowledge of L A,i . A conservative plan is to use one upper bound for all subspaces. Forexample, when the gradient of f is Lipschitz continuous with Lipschitz constant L A , i.e., k∇ f ( x ) − ∇ f ( y ) k A − ≤ L A k x − y k A , ∀ x, y ∈ V . We can set L A,i = L A for all i and consequently we pick the subspace uniformly and use auniform step size /L A .There is a balance between the number of subspaces and the complexity of the subspacesolvers. For example, we can choose J = 1 and thus C A = 1 . But then we need to compute A − which may cost O ( N p ) for p > which is not practical for large-size problems (e.g.,using the standard Gauss elimination to compute A − leads to p = 3 ). On the other extreme,we can chose a multilevel decomposition with J = O ( N log N ) and n i = O (1) . Then thecost to compute A − i is O (1) .One important question is what is a good choice of an A -norm? Any good preconditionerfor the objective function is a candidate. For example, when ∇ f exists, A = ∇ f ( x k ) orits approximation is a good choice since this inherits advantages of the Newton’s method orquasi-Newton methods.Another important question is how to get a stable decomposition based on a given SPDmatrix A ? There is no satisfactory and universal answer to this question. One can alwaysstart from a block coordinate decomposition. When A = ∇ f ( x k ) , this leads to the blockNewton method considered in [13]. We can then merge small blocks to form a larger onein a multilevel fashion and algebraic multigrid methods [23] can be used in this process toprovide a coarsening of the graph defined by the Hessian. In general, efficient and effectivespace decomposition will be problem dependent. We shall provide an example later on. The full approximation stor-age (FAS) scheme, in the deterministic setting, was proposed in [4] and is a multigrid methodfor solving nonlinear equations. Several FAS-like algorithms for solving optimization prob-lems have been considered in the literature [6, 12, 15, 7], including those that are line search-based recursive or trust region-based recursive algorithmsBased on RFASD, we shall propose a randomized FAS (RFAS) algorithm. We firstrecall FAS in the optimization setting as discussed in [5]. Given a space decomposition V = V + V + · · · + V J , V i ⊂ V , i = 1 , · · · , J. Let Q i : V 7→ V i be a projection operatorand, ideally, Q i v should provide a good approximation of v ∈ V in the subspace V i . Inaddition, f i : V i R is a local objective function. f i can be the original f . Then it coincideswith the multilevel optimization methods established by Tai and Xu [24]; see Remark 4.2in [5]. Given the current approximation x k , in FAS, the search direction s i , i = 1 , , · · · , J ,is computed by the following steps: Algorithm 2.2
Compute search direction s i for FAS Input: current approximation x k and index i Output: search direction s i Compute the so-called τ -correction: τ i = ∇ f i ( Q i x k ) − R i ∇ f ( x k ) Solve the τ -perturbed problem: h∇ f i ( η i ) , v i i = h τ i , v i i , ∀ v i ∈ V i . Compute the search direction: s i := η i − Q i x k .Replacing Step 4 in RFASD (Algorithm 2.1), we obtain the RFAS as shown in Algo-rithm 2.3. L. CHEN, X. HU, AND H. WU
Algorithm 2.3
Randomized Full Approximation Storage Scheme Choose x and k ← for k = 0 , , · · · do Choose an index i k from { , · · · , J } with the sampling probability given as (2.5). Compute subspace search direction s i k using Algorithm 2.2 with inputs x k and i k . Update by the subspace correction: x k +1 := x k + 1 L A,i k s i k . end for Next we show that if we choose f i in certain way, RFAS becomes a special case ofRFASD. Given the SPD matrices A i , which induce the inner products on subspaces V i , i =1 , , · · · , J , we define the following quadratic local objective functions, f i ( w ) = 12 k w k A i , ∀ w ∈ V i , i = 1 , , · · · , J. From Algortihm 2.2, it is easy to see that s i = − I i A − i R i ∇ f ( x k ) . Therefore, in this case,RFAS agrees with RFASD. Note that in this setting A i may not be the Galerkin projection of A , i.e. A i = R i AI i , which is different from RPSD. Nevertheless, the convergence analysis(Theorem 3.4 and 3.6) can be still applied but the constants L A,i and C A depends on thechoices of A i .Consider a slighly more general case that the local objective function f i is in C , then bymean value theorem, from Algorithm 2.2, we can write the equation of s i as h∇ f i ( ξ i ) s i , v i i = −h R i ∇ f ( x k ) , v i i , ∀ v i ∈ V i , which implies A i = R i ∇ f i ( ξ i ) I i and, again, RFAS is a special case of RFASD. Of course,this choice of A i is impractical because we do not know ξ i in general. One practical choicemight be A i = R i ∇ f ( x k ) I i , which is the Galerkin projection of the Hessian matrix oforiginal f . In this case, RFAS becomes block Newton’s method. The constants L A,i and C A are thus changed as A i depends on x k and is different at each iteration.
3. Convergence Analysis.
In this section, we shall present a convergence analysis ofRFASD. We first discuss the stability of a space decomposition and then prove the crucialsufficient decay property. Then we obtain a linear or sublinear convergence rate by differentupper bounds of the optimality gap.
We first introduce the mapping
Π : e V → V as follow, Π˜ v = J X i =1 v i , for ˜ v = ( v , v , . . . , v J ) ⊺ ∈ e V .
We can write
Π = ( I , I , . . . , I J ) and Π ⊺ = ( R , R , . . . , R J ) ⊺ in terms of prolongationand restriction operators.The space decomposition V = P i V i implies that Π is surjective. Since for finite dimen-sional spaces, all norms are equivalent, Π is a linear and continuous operator and, thus, bythe open mapping theorem, there exists a continuous right inverse of Π . Namely there existsa constant C A > , such that for any v ∈ V , there exists a decomposition v = P Ji =1 v i , with v i ∈ V i for i = 1 , . . . , J , and(3.1) J X i =1 k v i k A i ≤ C A k v k A . ANDOMIZED FAST SUBSPACE DESCENT METHODS C A measures the stability of the space decomposition. When the decompositionis orthogonal, C A = 1 . As the adjoint, the operator Π ⊺ : V ′ → e V ′ is injective and boundedbelow. The following result is essentially from the fact that Π and Π ⊺ has the same minimumsingular value /C A .L EMMA
For a given g ∈ V ′ , let g i = R i g, s i = − A − i g i for i = 1 , , . . . , J . Then (3.2) − h g, J X i =1 s i i = J X i =1 k g i k A − i = J X i =1 k s i k A i and (3.3) k g k A − ≤ C A J X i =1 k s i k A i = J X i =1 k s i k A i , where C A is the constant in (3.1) .Proof. The first identity is an easy consequence of definitions as: for i = 1 , , . . . , J −h g, s i i = h g, I i A − i R i g i = k g i k A − i = k A − i g i k A i = k s i k A i . We now prove (3.3). For a given g ∈ V ′ , let g i = R i g ∈ V ′ i . For any w ∈ V , we chose astable decomposition w = P Ji =1 w i , w i ∈ V i . Then h g, w i = J X i =1 h g, w i i = J X i =1 h g i , w i i ≤ J X i =1 k g i k A − i ! / J X i =1 k w i k A i ! / ≤ C / A J X i =1 k g i k A − i ! / k w k A . Thus, k g k A − = (cid:18) sup w ∈V h g, w ik w k A (cid:19) ≤ C A J X i =1 k g i k A − i = J X i =1 k s i k A i , which completes the proof. We shall prove a sufficient decay property for the function value.Note that we do not assume f is convex but only Lipschitz continuous in each subspace.L EMMA
Suppose the objective function f and space decomposition V = P Ji =1 V i satisfy (LCi). Let { x k } be the sequence generated by Algorithm 2.1. Then for all k > , wehave (3.4) E [ f ( x k +1 )] − f ( x k ) ≤ −
12 ¯ L A C A J k∇ f ( x k ) k A − . Proof.
By the Lipschitz continuity (LCi) and the choice of the step size, we have f ( x k +1 ) ≤ f ( x k ) + α k h∇ f ( x k ) , s i k i + L A,i k α k k s i k k A i = f ( x k ) + 1 L A,i k h∇ f ( x k ) , s i k i + 12 L A,i k k s i k k A i . L. CHEN, X. HU, AND H. WU
Take expectation of i k conditioned by x k with probability p i = J L A,i / ¯ L A , j = 1 , · · · , J. E (cid:2) f ( x k +1 ) (cid:1) − f ( x k ) ≤ J ¯ L A h∇ f ( x k ) , J X i =1 s i i + 12 J ¯ L A J X i =1 k s i k A i ≤ − J ¯ L A J X i =1 k s i k A i by (3.2) ≤ − J ¯ L A C A k∇ f ( x k ) k A − by (3.3) . As we will show in the next two subsections, based on the above sufficient decay prop-erty (3.4), together with a proper upper bound of the optimality gap, linear or sub-linearconvergent rate can be obtained for the strongly convex and convex case, respectively.
To derive the linear conver-gence for the strongly convex case, we shall use the following upper bound of the optimalitygap. L
EMMA
Suppose that f satisfies assumption (SC) withconstant µ A > and x ∗ ∈ V is the minimizer of f ; then for all x ∈ V , (3.5) f ( x ) − f ( x ∗ ) ≤ µ A k∇ f ( x ) k A − . Now we are ready to show the linear convergence of RFASD when the objective function f is strongly convex and the result is summarized in the following theorem.T HEOREM
Suppose the objective function and space decomposition satisfy (LCi)and (SC) with µ A > . Let { x k } be the sequence generated by Algorithm 2.1. Then for all k > , we have the linear contraction (3.6) E (cid:2) f ( x k +1 ) (cid:3) − f ( x ∗ ) ≤ (cid:18) − J µ A ¯ L A C A (cid:19) (cid:0) E (cid:2) f ( x k ) (cid:3) − f ( x ∗ ) (cid:1) . Proof.
The left hand side of (3.4) can be rewritten as E [ f ( x k +1 )] − f ( x ∗ ) − ( f ( x k ) − f ( x ∗ )) . Combining (3.4) and (3.5), rearranging the terms, and taking expectation with respectto x k , we get the desired result. Next, we give the convergence re-sult for convex but not strongly convex objective functions, i.e. µ = 0 in (SC), based on thefollowing bounded level set assumption. • (BL) Bounded level set: f is convex and attains its minimum value f ∗ on a set S .There is a finite constant R such that the level set for f defined by x is bounded,that is,(3.7) max x ∗ ∈ S max x {k x − x ∗ k A : f ( x ) ≤ f ( x ) } ≤ R . L EMMA
Suppose the objective function f satisfies (LC) and (BL). Then for all x ∈V and f ( x ) ≤ f ( x ) , (3.8) f ( x ) − f ( x ∗ ) ≤ R k∇ f ( x ) k A − . Proof.
By convexity and (BL), for x ∈ V and f ( x ) ≤ f ( x ) , f ( x ) − f ∗ ≤ h∇ f ( x k ) , x − x ∗ i ≤ k∇ f ( x ) k A − k x − x ∗ k A ≤ R k∇ f ( x ) k A − . ANDOMIZED FAST SUBSPACE DESCENT METHODS f .T HEOREM
Suppose the objective function and space decomposition satisfy (LCi),(BL) and (SC) with µ A = 0 . Let { x k } be the sequence generated by Algorithm 2.1. Then forall k > , we have (3.9) E (cid:2) f ( x k ) (cid:3) − f ∗ ≤ d d c k ≤ R J ¯ L A C A k , where d = f ( x ) − f ∗ and c = 1 / (2 R J ¯ L A C A ) . Proof.
Note that (3.4) can be written as E (cid:2) f ( x k +1 ) (cid:3) − f ( x ∗ ) − (cid:0) f ( x k ) − f ( x ∗ ) (cid:1) ≤ −
12 ¯ L A J C A k∇ f ( x k ) k A − ≤ − c (cid:0) f ( x k ) − f ( x ∗ ) (cid:1) where c = 1 / (2 R J ¯ L A C A ) . and the last inequality is derived based on (3.8) with x = x k .Now denoting d k = E (cid:2) f ( x k ) (cid:3) − f ∗ and taking expectation with respect to x k , we have d k +1 − d k ≤ − cd k ≤ . Based on the above inequality, we obtain d k +1 − d k = d k − d k +1 d k d k +1 ≥ d k − d k +1 ( d k ) ≥ c. Recursively applying this inequality, we obtain(3.10) d k ≥ d + ck. which implies (3.9).R EMARK
The parameters R , J, ¯ L A , and C A can be dynamically changing, i.e., asa function of k . For example, we can use R k := max x ∗ ∈ S max x {k x − x ∗ k : f ( x ) ≤ f ( x k ) } ,which is smaller than R . The space decomposition and local Lipschitz constants could alsobe improved during the iterations. In these cases, we use c k to denote the constant and thelast inequality (3.10) holds with the second term ck on the right-hand-side being replaced by P ki =1 c i . Based on the convergence results Theorem 3.4 and 3.6, we can esti-mate the computational complexity of the proposed RFASD method and compare with theGD, PGD, RCD, and RBCD methods. As usual, for a prescribed error ǫ > , we first es-timate how many iterations are needed to reach the tolerance and then estimate the overallcomputational complexity based on the cost of each iteration.For gradient-based methods, the main cost per iteration is the evaluation of gradient ∇ f ( x k ) . In general, it may take O ( N ) operations. In certain cases, the cost could be re-duced. For example, when ∇ f ( x k ) is sparse , e.g., computing one coordinate component of ∇ f ( x k ) only needs O (1) operations, then the computing ∇ f ( x k ) takes O ( N ) operations.Another example is to use advanced techniques, such as fast multipole method [21, 8], tocompute ∇ f ( x k ) , then the cost could be reduced to O ( N log N ) . In our discussion, we focuson the general case (referred as dense case) and sparse case. For subspace decomposition0 L. CHEN, X. HU, AND H. WU type algorithms, including RCD, BRCD, and RFASD, each iteration only needs to com-pute ∇ f ( x k ) restricted on one subspace and, therefore, the cost of computing the gradientis O ( N n i ) (dense case) or O ( n i ) (sparse case). Note n i = 1 for RCD. When the precondi-tioning technique is applied, the extra cost is introduced besides computing the gradient. Weassume computing the inverse of an n × n matrix is O ( n p ) with p ≥ , then the extra cost forPGD is O ( N p ) since the need of A − . For the proposed RFASD, the extra cost is reducedto O ( n pi ) since we only need to compute A − i on each subspace. Now, we summarize thecomplexity comparison in Table 1. T ABLE Complexity comparisons. ( ǫ : target accuracy; L : Lipschitz constant in l -norm; µ : strong convexity constantin l -norm; L A : Lipschitz constant in A -norm; µ A : strong convexity constant in A -norm; ¯ L := J P Ji =1 L i :averaged Lipschitz constant in l -norm; ¯ L A := J P Ji =1 L A,i : averaged Lipschitz constant in A -norm. GD: gra-dient descent; PGD: prreconditioned gradient descent; RCD: randomized coordinate descent; RBCD: randomizedblock coordinate descent; RFASD: randomized fast subspace descent.) Convex Strongly convex Cost per iterationdense sparseGD O (cid:18) Lǫ (cid:19) O (cid:18) Lµ | log ǫ | (cid:19) O ( N ) O ( N ) PGD O (cid:18) L A ǫ (cid:19) O (cid:18) L A µ A | log ǫ | (cid:19) O ( N + N p ) O ( N p ) RCD O (cid:18) N ¯ Lǫ (cid:19) O (cid:18) N ¯ Lµ | log ǫ | (cid:19) O ( N ) O (1) RBCD O (cid:18) J ¯ Lǫ (cid:19) O (cid:18) J ¯ Lµ | log ǫ | (cid:19) O ( N n i ) O ( n i ) RFASD O (cid:18) C A J ¯ L A ǫ (cid:19) O (cid:18) C A J ¯ L A µ A | log ǫ | (cid:19) O ( N n i + n pi ) O ( n pi ) From Table 1, it is clear that RFASD can take advantage of the preconditioning effect,i.e., L A µ A ≪ Lµ or ¯ L A µ A ≪ ¯ Lµ . Meanwhile, there is no need to invert A globally and butto compute A − i on each subspace, which reduces the computational cost in the sense that O ( N n i + n pi ) ≪ O ( N + N p ) (dense case) or O ( n pi ) ≪ O ( N p ) (sparse case) if n i ≪ N and p > . Of course, the key is a stable space decomposition in A -norm such that the stabilityconstant C A can be kept small. In next section, we use Nesterov’s “worst” problem [17] asan example to demonstrate how to achieve this in practice.
4. Examples.
In this section, we give some examples of the RFASD method and usethe example introduced by Nesterov [17] to discuss different methods.We first recall the Nesterov’s “worst” problem [17]
EXAMPLE x ∈ R N , consider the non-constrained minimization problem (1.1)with(4.1) f ( x ) := f L,r ( x ) = L x + r − X i =1 ( x i − x i − ) + x r ! − x ! , ANDOMIZED FAST SUBSPACE DESCENT METHODS x i represents the i -th coordinate of x and r < N is a constant integer that defines theintrinsic dimension of the problem. The minimum value of the function is(4.2) f ∗ = L (cid:18) − r + 1 (cid:19) . We follow [16] to present theRBCD methods. Let V = R N endowed with standard ℓ -norm k · k , i.e. A = I . Define apartition of the unit matrix I n = ( U , · · · , U J ) ∈ R N × N , U i ∈ R N × n i , i = 1 , · · · , J. Now we consider the space decomposition V = ⊕ Ji =1 V i , where V i = Range( U i ) and P Ji =1 n i = N . Naturally, I i = U i and R i : U ⊺ i in this setting. For each subspace, wealso use the ℓ -norm k · k , i.e. A i = I n i is the identity matrix of size n i . In this setting,RFASD (Algorithm 2.1) is given by(4.3) x k +1 = x k + 1 L i k s i k , s i k = − U i k U ⊺ i k ∇ f ( x k ) . This is just the RBCD algorithm proposed in [16]. Moreover, if the space decomposition iscoordinate-wise, i.e., n i = 1 , i = 1 , , · · · , J = N , then it is reduced to the RCD method.Regarding the convergence analysis, since the subspace decomposition is direct and or-thogonal in ℓ inner product, in this case, we have J X i =1 s i = −∇ f ( x k ) and J X i =1 k s i k = k∇ f ( x k ) k , which implies that C A = 1 in (SD). Moreover, because the l -norm is used here, Lipschitzconstant and strong convexity constant are measured in l -norm and, hence, we drop thesubscript A for those constants. Finally, we apply Theorem 3.4 and 3.6 and recovery theclassical convergence results of RBCD [16] as follows, • Convex case: J ¯ LR k • Strongly convex case: (cid:16) − µJ ¯ L (cid:17) k Consider Example 4.1 with r = N , since l -norm is used, it is easy to see that ¯ L ≤ L and µ = O ( L N − ) . Therefore, the condition number is ¯ Lµ = O ( N ) and, for strongly convexcase, the convergence rate is (cid:0) − J N (cid:1) and, according to Table 1, it requires O ( N | log ǫ | ) operations (due to the fact that this problem is sparse) to achieve a given accuracy ǫ . Thiscould be quite expensive, even impractical, for large N , i.e., large-scale problems. The RFASD method allows us touse a preconditioner A without computing A − . We chose an appropriate A -norm k · k A defined by an SPD matrix A . Let V = V + V + · · · + V J , V i ⊂ V , i = 1 , · · · , J, be a space decomposition. For each subspace, we still use the A -norm. Namely we use k v i k A i = k v i k A for all v i ∈ V i . One can easily verify that A i = R i AI i which is theso-called Galerkin projection of A to the subspace V i .In this case, the averaged Lipschitz constant ¯ L A and the strong convexity constant µ A are measured in A -norm. Based on Theorem 3.4 and 3.6, we naturally have2 L. CHEN, X. HU, AND H. WU • Convex case: JC A ¯ L A R k • Strongly convex case: (cid:18) − JC A µ A ¯ L A (cid:19) k Comparing with the convergence results of RBCD, the key here is to design an appropriatepreconditioner A which induces the A -norm and corresponding stable decomposition suchthat C A ¯ L A ≪ ¯ L for convex case or C A ¯ L A /µ A ≪ ¯ L/µ for strongly convex case. Then wewill achieve speedup comparing with RCD/RBCD.Of course, the choices of the preconditioner and the stable decomposition are usuallyproblem-dependent. Let us again consider Example 4.1 with r = N . Note that the objectivefunction f can be written in the following matrix format f ( x ) = f L,N ( x ) = 12 ( Ax, x ) − ( x, b ) , (4.4) A = L − , , − ∈ R N × N and b = L e , (4.5)where e = (1 , , · · · , T ∈ R N . A good choice of the preconditioner is A itself. It is easyto verify that, when measuring in A -norm, the averaged Lipschitz constant is ¯ L A = 1 and thestrong convexity constant is µ A = 1 . Therefore, the condition number measured in A -normis ¯ L A µ A = 1 ≪ O ( N ) , i.e., much smaller than the condition number measure in l -norm.To achieve good overall performance, we also need to find a stable subspace decompo-sition to avoid inverting A directly while keep C A small. Note that A (to be precise, L N A )here is essentially a central finite difference approximation of u ′′ ( x ) on the interval [0 , using a uniform mesh with N subintervals. A stable decomposition can be given by a multi-level decomposition used in the geometric multigrid method [26, 9]. We postpone the detailsof this stable decomposition in Section 5. Using such a decomposition, we have n i = 1 , J = O ( N log N ) , and C A = O (1) . Therefore, RFASD has complexity O ( N log N | log ǫ | ) to achieve a given accuracy ǫ , which is quasi-optimal and scalable for large N . Comparingwith the RCD/RBCD discussed in the previous section, the improvement is evident.
5. Numerical Experiments.
In this section, we present some numerical results for solv-ing the Nesterov’s worst function described in Example 4.1. We focus on the strongly convexcase, i.e., r = N , to better demonstrate the advantages of using RFASD. The comparisons aremade between RCD and RFASD. The results based on RCD and RFASD with permutation(non-replacement sampling) (denoted by RCD perm and RFASD perm , respectively), cyclicCD, and cyclic FASD will also be presented to illustrate the influence of randomization.For the implementation RCD and RCD perm , we follow [16] and uses the step size α k = L ik , see (4.3). For the Nestrov’s worst function, i.e, Example 4.1, we have L i = k a i k , where a i is the i -th column of A = L tridiag( − , , − ∈ R N × N , and, therefore, α k = k a i k isused in our numerical experiments. The same step size is used for cyclic CD. In addition, dueto the definition of A , L = L N ≈ L = · · · = L N − . Thus, we use uniform sampling inRCD, i.e., p i = N .According to the matrix format of the Nesterov’s worst function (4.4), we use A =tridiag( − , , − ∈ R N × N as the preconditioner. As suggested in Section 4.2, the stabledecomposition used here is based on the geomerical multigrid methods [26, 9]. We referto [30] for a brief introduction of the multigrid method. More precisely, we treat the i -thcomponent, x i , as the function value at an artificial grid point at ih with h = 1 /N . We chose N = 2 level − and use a multilevel nodal decomposition as the space decomposition (2.4).Figure 1 illustrates such a multilevel space decomposition for the case V = R , i.e. level = 3 . ANDOMIZED FAST SUBSPACE DESCENT METHODS F IG . 1. Example: multilevel nodal decomposition for V = R = V + · · · + V V { e } V { e } V { e } V { e } V { e } V { e } V { e }V { e e e }
112 12 V { e e e }
112 12 V
10 = span { e e e }
112 12 V
11 = span { e e e e e e e }
112 12
Note that we have
J < N and n i = 1 , i = 1 , · · · , J . With the choice of A i = R i AI i ,it is well-known that C A = O (1) (see e.g. [29]). Note that, in this setting, L A,i = 1 , i = 1 , · · · , J , therefore, the step size is simply α k = 1 in our numerical experiments anduniform sampling, p i = J , is used for RFASD. For RFASD perm and cyclic FASD, we usethe same space decomposition and step size.For all the experiments, we choose initial guess to be the vector (1 , · · · , T ∈ R N andstop the iterations when the relative norm of the gradient satisfies k∇ f ( x k ) kk∇ f ( x ) k ≤ − . Due tothe randomness in the algorithms, we repeat each test times and report the average results. T ABLE Performance of RCD, RCD perm , and cyclic RCD (for problems of size N ≥ , all three methods take morethan iterations to converges and, therefore, not reported here.) Size RCD RCD perm cyclic CD N Iter.
Epoch
Iter.
Epoch
Iter.
Epoch7 1.4129e3 201.84 944.50 134.93 819 11715 1.1054e4 736.96 7.4658e3 497.72 6.465e3 43131 8.3177e4 2.6831e3 5.6284e4 1.8156e3 4.8576e4 1.5670e363 6.1011e5 9.6843e3 4.1218e5 6.5425e3 3.5519e5 5.6379e3In Table 2 and 3, we compare all the algorithms by reporting the number of iterations(
Iter.) and the number of epochs (
Epoch). Here, we borrow the terminology from ma-chine leaning and refer all the subspaces as one epoch. Therefore, the number of epochs isdefined as
Epoch =
Iter. J . From the two tables, we can immediately see that the FASD-type algorithms outperform CD-type algorithms. For CD-type algorithms, we can see that thenumber of iterations grows like O ( N ) , which confirms our discussion in Section 4.1. Since J = N in this case, the number of epochs behaves like O ( N ) , which is consistent with whatwe see from Table 2. Among the three CD-type algorithms, RCD perm uses the least numberof iterations while RCD uses the most. For FASD-type algorithms, we report the numberof subspaces obtained by the multilevel decomposition. Due to such a construction, we cansee that J < N . Following the discussion in Section 4.2, we expect that the number ofiterations and the number of epochs behave like O ( N ) and O (1) , respectively. This is con-firmed by the numerical results in Table 3 and demonstrates that, due to the good choice of4 L. CHEN, X. HU, AND H. WUT
ABLE Performance of RFASD, RFASD perm , cyclic FASD size RFASD RFASD perm cyclic FASD
N J
Iter.
Epoch
Iter.
Epoch
Iter.
Epoch7 11 104.90 9.54 46.90 4.26 69 6.2715 26 323.20 12.43 157.30 6.05 213 8.1931 57 830.70 14.57 467.20 8.20 518 9.0963 120 1.9249e3 16.04 975.40 8.13 1.103e3 9.19127 247 3.8384e3 15.54 2.0484e3 8,29 2.278e3 9.22255 502 7.7405e3 15.42 4.301e3 8.57 4.637e3 9.24511 1,013 1.5797e4 15.59 8.2902e3 8.18 9.632e3 9.511,023 2,036 3.2071e4 15.75 1.7563e4 8.63 1.9352e4 9.502,047 4,083 6.7138e4 16.44 3.3591e4 8.23 3.88e4 9.504,095 8,178 1.2813e5 15.67 6.9891e4 8.55 7.7701e4 9.50the preconditioner and proper subspace decomposition, RFASD and its variants, RFASD perm and cyclic FASD, can achieve optimal computational complexity for Example 4.1. Finally,for FASD-type algorithms, we observe that RFASD perm seems to be the best choice. Tobetter understand the convergence of RFASD perm , especially its comparison with RFASD isan open question and a subject of our future research. Nevertheless, we recommend to useRFASD perm in practice if possible.
6. Conclusions and Future Work.
In this paper, we have derived a randomized versionof fast subspace descent methods. We first find a stable space decomposition for R N and thenrandomly chose a subspace to apply a gradient descent step.We have also developed the convergence analysis which shows the convergence of theproposed method can be measured by the condition number in a new A -norm and thus canachieve faster convergence if the condition number is much smaller than the standard onemeasured in ℓ -norm and the space decomposition is A -stable.For the Nesterov’s “worst” problem, we have found a stable and multilevel space decom-position and shown both theoretically and numerically the optimal convergence rate. How-ever, we haven’t discussed how to design a stable space decomposition for other benchmarkoptimization problems, which is a subject of our ongoing work.In the application of data science, there is no natural grid hierarchy to construct the spacedecomposition. Algebraic multigrid methods (AMG) will be a more appropriate approach.There is still very little theory and algorithm on AMG for nonlinear equations and optimiza-tion problems [25, 20]. In the future, we propose to develop an algebraic FASD method fornonlinear optimization problems. REFERENCES[1] A. Beck and L. Tetruashvili. On the Convergence of Block Coordinate Descent Type Methods.
SIAM Journalon Optimization , 23(4):2037–2060, Jan. 2013.ANDOMIZED FAST SUBSPACE DESCENT METHODS [2] D. P. Bertsekas. Nonlinear programming. Journal of the Operational Research Society , 48(3):334–334, 1997.[3] D. P. Bertsekas and J. N. Tsitsiklis.
Parallel and Distributed Computation: Numerical Methods , volume 23.Prentice hall Englewood Cliffs, NJ, 1989.[4] A. Brandt. Multi-level adaptive solutions to boundary-value problems.
Mathematics of Computation ,31(138):333–390, 1977.[5] L. Chen, X. Hu, and S. M. Wise. Convergence analysis of the fast subspace descent methods for convexoptimization problems.
Mathematics of Computation , page To appear, 2020.[6] E. Gelman and J. Mandel. On multilevel iterative methods for optimization problems.
Mathematical Pro-gramming , 48(1-3):1–17, 1990.[7] S. Gratton, A. Sartenaer, and P. L. Toint. Recursive trust-region methods for multiscale nonlinear optimization.
SIAM Journal on Optimization , 19(1):414–444, 2008.[8] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations.
Journal of Computational Physics ,73(2):325–348, Dec. 1987.[9] W. Hackbusch.
Multi-Grid Methods and Applications . Springer Science & Business Media, Mar. 2013.[10] X. Hu, J. Xu, and L. T. Zikatanov. Randomized and fault-tolerant method of subspace corrections.
Researchin the Mathematical Sciences , 6(3):29, Aug. 2019.[11] D. Leventhal and A. S. Lewis. Randomized Methods for Linear Constraints: Convergence Rates and Condi-tioning.
Mathematics of Operations Research , 35(3):641–654, Aug. 2010.[12] R. M. Lewis and S. G. Nash. Model problems for the multigrid optimization of systems governed by differ-ential equations.
SIAM Journal on Scientific Computing , 26(6):1811–1837, 2005.[13] Z. Lu. Randomized block proximal damped newton method for composite self-concordant minimization.
SIAM Journal on Optimization , 27(3):1910–1942, 2017.[14] Z. Q. Luo and P. Tseng. On the convergence of the coordinate descent method for convex differentiableminimization.
Journal of Optimization Theory and Applications , 72(1):7–35, Jan. 1992.[15] S. G. Nash. A Multigrid Approach to Discretized Optimization Problems.
Journal of Optimization Methodsand Software , 14:99–116, 2000.[16] Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems.
SIAM Journalon Optimization , 22(2):341–362, 2012.[17] Y. Nesterov.
Introductory lectures on convex optimization: A basic course , volume 87. Springer Science &Business Media, 2013.[18] J. M. Ortega and W. C. Rheinboldt.
Iterative Solution of Nonlinear Equations in Several Variables . Classicsin Applied Mathematics. Society for Industrial and Applied Mathematics, Jan. 2000.[19] B. T. Polyak. Introduction to optimization. optimization software.
Inc., Publications Division, New York , 1,1987.[20] C. Ponce, D. Bindel, and P. S. Vassilevski. A nonlinear algebraic multigrid framework for the power flowequations.
SIAM Journal on Scientific Computing , 40(3):B812–B833, 2018.[21] V. Rokhlin. Rapid solution of integral equations of classical potential theory.
Journal of ComputationalPhysics , 60(2):187–207, Sept. 1985.[22] T. Strohmer and R. Vershynin. A randomized kaczmarz algorithm with exponential convergence.
Journal ofFourier Analysis and Applications , 15(2):262, 2009.[23] K. St¨uben. A review of algebraic multigrid. In
Numerical Analysis: Historical Developments in the 20thCentury , pages 331–359. Elsevier, 2001.[24] X.-C. Tai and J. Xu. Global and uniform convergence of subspace correction methods for some convexoptimization problems.
Mathematics of Computation , 71(237):105–125, may 2001.[25] E. Treister, J. S. Turek, and I. Yavneh. A multilevel framework for sparse optimization with application to in-verse covariance estimation and logistic regression.
SIAM Journal on Scientific Computing , 38(5):S566–S592, 2016.[26] U. Trottenberg, C. W. Oosterlee, and A. Schuller.
Multigrid . Elsevier, 2000.[27] P. Tseng. Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization.
Journalof Optimization Theory and Applications , 109(3):475–494, June 2001.[28] S. J. Wright. Coordinate descent algorithms.
Mathematical Programming , 151(1):3–34, Jun 2015.[29] J. Xu. Iterative methods by space decomposition and subspace correction.
SIAM review , 34(4):581–613,1992.[30] I. Yavneh. Why multigrid methods are so efficient.