[PDF] Bregman Finito/MISO for nonconvex regularized finite sum minimization without Lipschitz gradient continuity

Abstract

We introduce two algorithms for nonconvex regularized finite sum minimization, where typical Lipschitz differentiability assumptions are relaxed to the notion of relative smoothness. The first one is a Bregman extension of Finito/MISO, studied for fully nonconvex problems when the sampling is random, or under convexity of the nonsmooth term when it is essentially cyclic. The second algorithm is a low-memory variant, in the spirit of SVRG and SARAH, that also allows for fully nonconvex formulations. Our analysis is made remarkably simple by employing a Bregman Moreau envelope as Lyapunov function. In the randomized case, linear convergence is established when the cost function is strongly convex, yet with no convexity requirements on the individual functions in the sum. For the essentially cyclic and low-memory variants, global and linear convergence results are established when the cost function satisfies the Kurdyka-\L ojasiewicz property.

Full PDF

BB regman F inito/ MISO for nonconvex regularized finite sum minimizationwithout L ipschitz gradient continuity PUYA LATAFAT , ANDREAS THEMELIS , MASOUD AHOOKHOSH , AND PANAGIOTIS PATRINOS A bstract . We introduce two algorithms for nonconvex regularized ﬁnite sum minimization, wheretypical Lipschitz di ﬀ erentiability assumptions are relaxed to the notion of relative smoothness [7]. Theﬁrst one is a Bregman extension of Finito / MISO [28, 42], studied for fully nonconvex problems whenthe sampling is random, or under convexity of the nonsmooth term when it is essentially cyclic. Thesecond algorithm is a low-memory variant, in the spirit of SVRG [34] and SARAH [48], that also allowsfor fully nonconvex formulations. Our analysis is made remarkably simple by employing a BregmanMoreau envelope as Lyapunov function. In the randomized case, linear convergence is established whenthe cost function is strongly convex, yet with no convexity requirements on the individual functions inthe sum. For the essentially cyclic and low-memory variants, global and linear convergence results areestablished when the cost function satisﬁes the Kurdyka-Łojasiewicz property.

1. I ntroduction

We consider the following regularized ﬁnite sum minimizationminimize x ∈ (cid:146) n ϕ ( x ) (cid:66) N (cid:80) Ni = f i ( x ) + g ( x ) subject to x ∈ C , (P)where C denotes the closure of C (cid:66) (cid:84) Ni = int dom h i , for some convex functions h i , i ∈ [ N ] (cid:66) { , . . . , N } . Our goal in this paper is to study such problems without imposing convexity assumptionson f i and g , and in a setting where f i are di ﬀ erentiable but their gradients need not be Lipschitzcontinuous. Our full setting is formalized in Assumption I.To relax the Lipschitz di ﬀ erentiability assumption, we adopt the notion of smoothness relativeto a distance-generating function [7], and following [40] we will use the terminology of relativesmoothness . Despite the lack of Lipschitz di ﬀ erentiability, in many applications the involved func-tions satisfy a descent property where the usual quadratic upper bound is replaced by a Bregmandistance (cf. Fact 2.5 (i) and Definition 2.1). Owing to this property, Bregman extensions for manyclassical schemes have been proposed [7, 40, 6, 59, 49, 1].In the setting of ﬁnite sum minimization, the incremental aggregated algorithm PLIAG was pro-posed recently [69] as a Bregman variant of the incremental aggregated gradient method [15, 16, 65].The analysis of PLIAG is limited to the convex case and requires restrictive assumptions for theBregman kernel [69, Thm. 4.1, Assump. B4]. Stochastic mirror descent (SMD) is another relevantalgorithm which can tackle more general stochastic optimization problems. SMD may be viewed as aBregman extension of the stochastic (sub)gradient method and has long been studied [46, 61, 11, 45].More recently, [32] studied SMD for convex and relatively smooth formulations, and (sub)gradientversions have been analyzed under relative continuity in a convex setting [39], as well as relativeweak convexity [71, 25]. Department of Electrical Engineering (ESAT-STADIUS) – KU Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium Faculty of Information Science and Electrical Engineering (ISEE) – Kyushu University, 744 Motooka, Nishi-ku 819-0395,Fukuoka, Japan Department of Mathematics and Computer Science, University of Antwerp, Middelheimlaan 1, B-2020 Antwerp, Belgium

E-mail addresses : [email protected], [email protected],[email protected], [email protected] .2020 Mathematics Subject Classiﬁcation.

Key words and phrases.

Nonsmooth nonconvex optimization, incremental aggregated algorithms, Bregman Moreau enve-lope, KL inequality.This work was supported by the Research Foundation Flanders (FWO) PhD grant 1196820N and research projectsG0A0920N, G086518N, and G086318N; Research Council KU Leuven C1 project No. C14 / / a r X i v : . [ m a t h . O C ] F e b P. LATAFAT, A. THEMELIS, M. AHOOKHOSH, AND P. PATRINOS

Algorithm 1

Bregman Finito / MISO (BFinito) for the regularized ﬁnite sum minimization (P)R equire

Legendre kernels h i such that f i is L f i -smooth relative to h i stepsizes γ i ∈ (0 , N / L fi )initial point x init ∈ C (cid:66) (cid:84) Ni = int dom h i I nitialize table s = ( s , . . . , s N ) ∈ (cid:146) nN of vectors s i = γ i ∇ h i ( x init ) − N ∇ f i ( x init ) (cid:146) n -vector ˜ s = (cid:80) Ni = s i R epeat for k = , , . . . until convergence . : Compute z k ∈ arg min w ∈ (cid:146) n (cid:110) g ( w ) + (cid:80) Ni = γ i h i ( w ) − (cid:104) ˜ s k , w (cid:105) (cid:111) . : Select a subset of indices (cid:73) k + ⊆ [ N ] (cid:66) { , . . . , N } and update the table s k + as follows: s k + i =  γ i ∇ h i ( z k ) − N ∇ f i ( z k ) if i ∈ (cid:73) k + , s ki otherwise . : Update the vector ˜ s k + = ˜ s k + (cid:80) i ∈ (cid:73) k + ( s k + i − s ki )R eturn z k Motivated by these recent works, we propose a Bregman extension of the popular Finito / MISOalgorithm [28, 42] in a fully nonconvex setting and with very general sampling strategies that will bemade percise shortly after. In a nutshell, our analysis revolves around the fact that, regardless of theindex selection strategy, the function (cid:76) : (cid:146) n × (cid:146) nN → (cid:146) deﬁned as (cid:76) ( z , s ) (cid:66) ϕ ( z ) + (cid:80) Ni = D ˆ h ∗ i (cid:0) s i , ∇ ˆ h i ( z ) (cid:1) , (1.1)where ˆ h ∗ i denotes the convex conjugate of ˆ h i (cid:66) h i / γ i − f i / N , monotonically decreases along the iterates( z k , s k ) k ∈ (cid:142) generated by Algorithm 1 (see Assumption I for the requirements on h i , f i ). Our method-ology leverages an interpretation of Finito / MISO as a block-coordinate algorithm that was observedin [37] in the Euclidean setting. In fact, the analysis is here further simpliﬁed after noticing thatthe smooth function can be “hidden” in the distance-generating function, resulting in a Lyapunovfunction (cid:76) that can be expressed as a

Bregman Moreau envelope (cf. Lemma 3.2).We cover a wide range of sampling strategies for the index set (cid:73) k + at step 1. , which we cansummarize into the following two umbrella categories:R andom sampling rule : ∃ p , . . . , p N > (cid:80) k (cid:104) i ∈ (cid:73) k + (cid:105) = p i ∀ k ∈ (cid:142) , i ∈ [ N ] . ( (cid:83) )E ssentially cyclic rule : ∃ T > (cid:83) Tt = (cid:73) k + t = [ N ] ∀ k ∈ (cid:142) . ( (cid:83) )The randomized setting ( (cid:83) ), in which (cid:80) k denotes the probability conditional to the knowledge atiteration k , covers, for instance, a mini-batch strategy of size b . Another notable case is when eachindex i is selected at random with probability p i independently of other indices.The essentially cyclic sampling ( (cid:83) ) is also very general and has been considered by many authors[62, 60, 33, 24, 67]. Two notable special cases of single index selection rules complying with ( (cid:83) )are the cyclic and shu ﬄ ed cyclic sampling strategies:S huffled cyclic rule : (cid:73) k + = (cid:8) π (cid:98) k / N (cid:99) (cid:0) mod( k , N ) + (cid:1)(cid:9) , ( (cid:83) shuf )where π , π , . . . are permutations of the set of indices [ N ] (chosen randomly or deterministically).When π (cid:98) k / N (cid:99) = id one recovers the (plain) cyclic sampling ruleC yclic rule : (cid:73) k + = { mod( k , N ) + } . ( (cid:83) cycl )We remark that, in the cyclic case, our algorithm generalizes DIAG [44] for smooth strongly convexproblems, which itself may be seen as a cyclic variant of Finito / MISO.1.1.

Low-memory variant.

One iteration of Algorithm 1 involves the computation of z k at step 1. and that of the gradients ∇ ( h i / γ i − f i / N ), i ∈ (cid:73) k + , at step 1. . Consequently, the overall complexity ofeach iteration is independent of the number N of functions appearing in problem (P), and is insteadproportional to the number of sampled indices, which the user is allowed to upper bound to anyinteger between 1 and N . As is the case for all incremental gradient methods, the low iteration costcomes at the price of having to store in memory a table s k of N many (cid:146) n vectors, which can becomeproblematic when N grows large. Other incremental algorithms for convex optimization such as regman Finito / MISO for nonconvex ﬁnite sum minimization 3

Algorithm 2

Low-memory Bregman Finito / MISOR equire

Legendre kernels h i such that f i is L f i -smooth relative to h i stepsizes γ i ∈ (0 , N / L fi )initial point x init ∈ C (cid:66) (cid:84) Ni = int dom h i I nitialize (cid:146) n -vector ˜ s = (cid:80) Ni = (cid:2) γ i ∇ h i ( x init ) − N ∇ f i ( x init ) (cid:3) set of selectable indices (cid:75) = ∅ (cid:46) Conventionally set to ∅ so as to start with a full update R epeat for k = , , . . . until convergence . : z k ∈ arg min w ∈ (cid:146) n (cid:110) g ( w ) + (cid:80) Ni = γ i h i ( w ) − (cid:104) ˜ s k , w (cid:105) (cid:111) . : if (cid:75) k = ∅ then (cid:46) No index left to be sampled: full update . : (cid:73) k + = (cid:75) k + = [ N ] (cid:46) activate all indices and reset the selectable indices . : ˜ z k = z k (cid:46) store the full update z k . : ˜ s k + = (cid:80) Ni = (cid:2) γ i ∇ h i ( z k ) − N (cid:80) Ni = ∇ f i ( z k ) (cid:3) . : else . : select a nonempty subset of indices (cid:73) k + ⊆ (cid:75) k (cid:46) select among the indices not yet sampled . : (cid:75) k + = (cid:75) k \ (cid:73) k + (cid:46) update the set of selectable indices . : ˜ z k = ˜ z k − . : ˜ s k + = ˜ s k + (cid:80) i ∈ (cid:73) k + (cid:104)(cid:0) γ i ∇ h i ( z k ) − N ∇ f i ( z k ) (cid:1) − (cid:0) γ i ∇ h i (˜ z k ) − N ∇ f i (˜ z k ) (cid:1)(cid:105) R eturn ˜ z k IAG [15, 16, 65], IUG [64], SAG [54], and SAGA [27], can considerably reduce memory allocationfrom O ( nN ) to O ( n ) in applications such as logistic regression and lasso where the gradients ∇ f i can be expressed as scaled versions of the data vectors. Despite the favorable performance of theFinito / MISO algorithm on such problems as observed in [27], this memory reduction trick can notbe employed due to the fact that the vectors s i stored in the table depend not only on the gradients, butalso on the vectors ∇ h i ( z k ). Nevertheless, inspired by the popular stochastic methods SVRG [34, 66]and SARAH [48], by suitably interleaving incremental and full gradient evaluations it is possible tocompletely waive the need of a memory table and match the O ( n ) storage requirement.In a nutshell, after a full update — which in Algorithm 1 corresponds to selecting (cid:73) k + = [ N ] —all vectors s k + i in the table only depend on variable z k computed at step 1. , until i is sampled again.As long as full gradient updates are frequent enough so that no index is sampled twice in between,it thus su ﬃ ces to keep track of z k ∈ (cid:146) n instead of the table s k ∈ (cid:146) nN . The variant is detailed inAlgorithm 2, in which (cid:75) k ⊆ [ N ] keeps track of the indices that have not yet been sampled betweenfull gradient updates (and is thus reset whenever such full steps occur, cf. step 2. ). Vector ˜ z k ∈ (cid:146) n is equal to z k corresponding to the latest full gradient update (cf. step 2. ) and acts as a low-memorysurrogate of the table s k of Algorithm 1. Similarly to SVRG and SARAH, this reduction in the storagerequirements comes at the cost of an extra gradient evaluation per sampled index (cf. step 2. ).Since full gradient updates correspond to selecting all indices, Algorithm 2 may be viewed asAlgorithm 1 with an essentially cyclic sampling rule of period N , a claim that will be formallyshown in Lemma 4.12. In fact, not only does it naturally inherit all the convergence results, but itsparticular sampling strategy also allows us to waive convexity requirements on g that are necessaryfor more general essentially cyclic rules.1.2. Contributions.

As a means to informally summarize the content of the paper, in Table 1 wesynopsize the convergence results of the two algorithms.1. To the best of our knowledge, this is the ﬁrst analysis of an incremental aggregated method in afully nonconvex setting and without Lipschitz di ﬀ erentiability assumptions. Our analysis is surpris-ingly simple, and yet it covers randomized and essentially cyclic samplings altogether, relies on asure descent property on the Bregman Moreau envelope (cf. Lemma 4.2).2. We propose a novel low-memory variant of the (Bregman) Finito / MISO algorithm, that in thespirit of SVRG [34, 66] and SARAH [48] alternates between incremental steps and a full proxi-mal gradient step. It is highly interesting even in the Euclidean case, as it can accommodate fullynonconvex formulations while maintaining a O ( n ) memory requirement. P. LATAFAT, A. THEMELIS, M. AHOOKHOSH, AND P. PATRINOS

Sampling Requirements (additionally to Assumption I)

Property Reference z k bounded any ϕ + δ C level bounded sure Lemma 4.2 (iv) ϕ ( z k )convergent (cid:83) a.s. Theorem 4.6 (ii) (cid:83) C = (cid:146) n ; g cvx; h i (loc) str cvx smooth; ( ϕ level bounded) sure Theorem 4.9LM Theorem 4.13 ω ( z k )stationary (cid:83) either C = (cid:146) n a.s. Theorem 4.6 (iv) or dom h i closed; ϕ cvx Theorem 4.6 (vi) (cid:83) C = (cid:146) n ; g cvx; h i (loc) str cvx smooth; ( ϕ level bounded) sure Theorem 4.9LM C = (cid:146) n Theorem 4.13 z k convergent (cid:83) either C = (cid:146) n ; ϕ cvx a.s. Theorem 4.6 (vii) or Assumption II; ϕ + δ C cvx level bounded (cid:83) Assumption III; g cvx sure Theorem 4.11 (i) LM Assumption III Theorem 4.14 (i) ϕ ( z k ) and z k linearlyconvergent (cid:83) C = (cid:146) n ; ϕ str cvx; h i loc smooth E Theorem 4.7 (cid:83) Assumption III; ϕ kl / ; g cvx sure Theorem 4.11 (iii) LM Assumption III; ϕ kl / Theorem 4.14 (iii)

Table 1.

Summary of the convergence results for Algorithm 1 with randomized sampling ( (cid:83) ) and essen-tially cyclic sampling ( (cid:83) ), and for the low-memory variant of Algorithm 2 (LM) . Claims are either sure ,almost sure (a.s.) , or in expectation ( E ) .O ther abbreviations : loc : locally; cvx : convex; str : strongly; smooth : Lipschitz di ﬀ erentiable; ω : set oflimit points; kl : Kurdyka-Łojasiewicz property with exponent θ .

3. Linear convergence of Algorithm 1 in the randomized case is established when the cost function ϕ is strongly convex, yet with no convexity requirement on f i or g . To the best of our knownledge,this is a novelty even in the Euclidean case, for all available results are bound to strong convexity ofeach term f i in the sum; see e.g., [28, 42, 44, 37, 50].4. We leverage the Kurdyka-Łojasiewicz (KL) property to establish global (as apposed to subsequen-tial) convergence as well as linear convergence, for Algorithm 1 with (essentially) cyclic samplingand for the low-memory Algorithm 2.1.3. Organization.

We conclude this section by introducing some notational conventions. The prob-lem setting is formally described in Section 2 together with a list of related deﬁnitions and knownfacts involving Bregman distances, relative smoothness, and proximal mapping. Section 3 o ﬀ ers analternative interpretation of Algorithm 1 as the block-coordinate Bregman proximal point Algorithm3, which majorly simpliﬁes the analysis, addressed in Section 4. Some auxiliary results are deferredto Appendices A and B. Section 5 applies the proposed algorithms to sparse phase retrieval problems,and Section 6 concludes the paper.1.4. Notation.

The set of real and extended-real numbers are (cid:146) (cid:66) ( −∞ , ∞ ) and (cid:146) (cid:66) (cid:146) ∪ {∞} ,while the positive and strictly positive reals are (cid:146) + (cid:66) [0 , ∞ ) and (cid:146) ++ (cid:66) (0 , ∞ ). With id we indicatethe identity function x (cid:55)→ x deﬁned on a suitable space. We denote by (cid:104) · , · (cid:105) and (cid:107) · (cid:107) the standardEuclidean inner product and the induced norm. For a vector w = ( w , . . . , w r ) ∈ (cid:146) (cid:80) i n i , w i ∈ (cid:146) n i is used to denote its i -th block coordinate. int E and bdry E respectively denote the interior andboundary of a set E , and for a sequence ( x k ) k ∈ (cid:142) we write ( x k ) k ∈ (cid:142) ⊆ E to indicate that x k ∈ E for all k ∈ (cid:142) . We say that ( x k ) k ∈ (cid:142) converges at Q-linear rate (resp.

R-linear rate ) to a point x if there exists c ∈ (0 ,

1) such that (cid:107) x k + − x (cid:107) ≤ c (cid:107) x k − x (cid:107) (resp. (cid:107) x k − x (cid:107) ≤ ρ c k for some ρ >

0) holds for all k ∈ (cid:142) .We use the notation H : (cid:146) n ⇒ (cid:146) m to indicate a mapping from each point x ∈ (cid:146) n to a subset H ( x ) of (cid:146) m . The graph of H is the set gph H (cid:66) { ( x , y ) ∈ (cid:146) n × (cid:146) m | y ∈ H ( x ) } . We say that H is outer semicontinuous (osc) if gph H is a closed subset of (cid:146) n × (cid:146) m , and locally bounded if for everybounded U ⊂ (cid:146) n the set (cid:83) x ∈ U H ( x ) is bounded.The domain and epigraph of an extended-real-valued function h : (cid:146) n → (cid:146) are the sets dom h (cid:66) { x ∈ (cid:146) n | h ( x ) < ∞} and epi h (cid:66) { ( x , α ) ∈ (cid:146) n × (cid:146) | h ( x ) ≤ α } . Function h is said to be proper ifdom h (cid:44) ∅ , and lower semicontinuous (lsc) if epi h is a closed subset of (cid:146) n + . We say that h is regman Finito / MISO for nonconvex ﬁnite sum minimization 5 level bounded if its α -sublevel set lev ≤ α h (cid:66) { x ∈ (cid:146) n | h ( x ) ≤ α } is bounded for all α ∈ (cid:146) . The con-jugate of h , is deﬁned by h ∗ ( y ) (cid:66) sup x ∈ (cid:146) n {(cid:104) y , x (cid:105) − h ( x ) } . The indicator function of a set E ⊆ (cid:146) n isdenoted by δ E , namely δ E ( x ) = x ∈ E and ∞ otherwise.We denote by ˆ ∂ h : (cid:146) n ⇒ (cid:146) n the regular subdi ﬀ erential of h , where v ∈ ˆ ∂ h ( ¯ x ) ⇔ lim inf ¯ x (cid:44) x → ¯ x h ( x ) − h ( ¯ x ) − (cid:104) v , x − ¯ x (cid:105)(cid:107) x − ¯ x (cid:107) ≥ . A necessary condition for local minimality of x for h is 0 ∈ ˆ ∂ h ( x ), see [53, Th. 10.1]. The (limiting) subdi ﬀ erential of h is ∂ h : (cid:146) n ⇒ (cid:146) n , where v ∈ ∂ h ( x ) i ﬀ x ∈ dom h and there exists a sequence( x k , v k ) k ∈ (cid:142) ⊆ gph ˆ ∂ h such that ( x k , h ( x k ) , v k ) → ( x , h ( x ) , v ) as k → ∞ . Finally, the set of r timescontinuously di ﬀ erentiable functions from X to (cid:146) is denoted by (cid:67) r ( X ).2. P roblem setting and preliminaries Throughout this paper, problem (P) is studied under the following assumptions.

Assumption I (basic requirements) . In problem (P) , a f i : (cid:146) n → (cid:146) are L f i -smooth relative to Legendre kernels h i (Definitions 2.2 and 2.4); a g : (cid:146) n → (cid:146) is proper and lower semicontinuous (lsc); a a solution exists: arg min (cid:110) ϕ ( x ) | x ∈ C (cid:111) (cid:44) ∅ ; a for given γ i ∈ (0 , N / L fi ) , i ∈ [ N ] , it holds that for any s ∈ (cid:146) n T ( s ) (cid:66) arg min w ∈ (cid:146) n (cid:110) g ( w ) + (cid:80) Ni = γ i h i ( w ) − (cid:104) s , w (cid:105) (cid:111) ⊆ C . (2.1)As it will become clear in Section 3, the subproblem (2.1) is in fact a reformulation of a (Breg-man) proximal mapping. Assumption I. a excludes boundary points from range T . This is a standardassumption that usually holds in practice [20, 56], e.g., when g is convex or when the intersection ofdom h i , i ∈ [ N ], is an open set. Definition 2.1 (Bregman distance) . For a convex function h : (cid:146) n → (cid:146) that is continuously di ﬀ eren-tiable on int dom h (cid:44) ∅ , the Bregman distance D h : (cid:146) n × (cid:146) n → (cid:146) is deﬁned as D h ( x , y ) (cid:66) (cid:40) h ( x ) − h ( y ) − (cid:104)∇ h ( y ) , x − y (cid:105) if y ∈ int dom h, ∞ otherwise. (2.2) Function h will be referred to as a distance-generating function . Definition 2.2 (Legendre kernel) . A proper, lsc, and strictly convex function h : (cid:146) n → (cid:146) with int dom h (cid:44) ∅ and such that h ∈ (cid:67) (int dom h ) is said to be a Legendre kernel if it is (i) ,i.e., such that lim (cid:107) x (cid:107)→∞ h ( x ) / (cid:107) x (cid:107) = ∞ , and (ii) essentially smooth , i.e., if (cid:107)∇ h ( x k ) (cid:107) → ∞ for everysequence ( x k ) k ∈ (cid:142) ⊆ int dom h converging to a boundary point of dom h. Fact 2.3.

The following hold for a Legendre kernel h : (cid:146) n → (cid:146) , x ∈ (cid:146) n , and y , z ∈ int dom h:(i) h ∗ ∈ (cid:67) ( (cid:146) n ) is strictly convex and ∇ h − = ∇ h ∗ [52, Thm. 26.5 and Cor. 13.3.1] .(ii) D h ( x , z ) = D h ( x , y ) + D h ( y , z ) + (cid:104) x − y , ∇ h ( y ) − ∇ h ( z ) (cid:105) [22, Lem. 3.1] .(iii) D h ( y , z ) = D h ∗ ( ∇ h ( z ) , ∇ h ( y )) [8, Thm. 3.7(v)] .(iv) D h ( · , z ) and D h ( z , · ) are level bounded [9, Lem. 7.3(v)-(viii)] .(v) If dom h is closed and D h ( x k , y k ) → for some x k ∈ dom h and y k ∈ int dom h, then ( x k ) k ∈ (cid:142) converges to a point x i ﬀ so does ( y k ) k ∈ (cid:142) [56, Thm. 2.4] .Moreover, for any convex set (cid:85) ⊆ int dom h and u , v ∈ (cid:85) the following hold:(vi) If h is µ h , (cid:85) -strongly convex on (cid:85) , then µ h , (cid:85) (cid:107) v − u (cid:107) ≤ D h ( v , u ) ≤ µ h , (cid:85) (cid:107)∇ h ( v ) − ∇ h ( u ) (cid:107) .(vii) If ∇ h is (cid:96) h , (cid:85) -Lipschitz on (cid:85) , then (cid:96) h , (cid:85) (cid:107)∇ h ( v ) − ∇ h ( u ) (cid:107) ≤ D h ( v , u ) ≤ (cid:96) h , (cid:85) (cid:107) v − u (cid:107) . Definition 2.4 (relative smoothness [7]) . We say that a proper, lsc function f : (cid:146) n → (cid:146) is smoothrelative to a Legendre kernel h : (cid:146) n → (cid:146) if dom f ⊇ dom h, and there exists L f ≥ such that L f h ± fare convex functions on int dom h. We will simply say that f is L f -smooth relative to h to make themodulus L f explicit. P. LATAFAT, A. THEMELIS, M. AHOOKHOSH, AND P. PATRINOS

Fact 2.5.

Let f : (cid:146) n → (cid:146) be L f -smooth relative to a Legendre kernel h : (cid:146) n → (cid:146) . Then, f ∈ (cid:67) (int dom h ) and the following hold:(i) (cid:12)(cid:12)(cid:12) f ( y ) − f ( x ) − (cid:104)∇ f ( x ) , y − x (cid:105) (cid:12)(cid:12)(cid:12) ≤ L f D h ( y , x ) for all x , y ∈ int dom h.(ii) − L f ∇ h (cid:22) ∇ f (cid:22) L f ∇ h on int dom h, provided that f , h ∈ (cid:67) (int dom h ) .(iii) If ∇ h is Lipschitz continuous with modulus (cid:96) h , (cid:85) on a convex set (cid:85) , then so is ∇ f with modulus (cid:96) f , (cid:85) = L f (cid:96) h , (cid:85) [1, Prop. 2.5(ii)] . Relative to a Legendre kernel h : (cid:146) n → (cid:146) , the Bregman proximal mapping of ψ is the set-valuedmap prox h ψ : int dom h ⇒ (cid:146) n given byprox h ψ ( x ) (cid:66) arg min z ∈ (cid:146) n { ψ ( z ) + D h ( z , x ) } , (2.3)and the corresponding Bregman-Moreau envelope is ψ h : (cid:146) n → [ −∞ , ∞ ] deﬁned as ψ h ( x ) (cid:66) inf z ∈ (cid:146) n { ψ ( z ) + D h ( z , x ) } . (2.4) Fact 2.6 (regularity properties of prox h ψ and ψ h [35]) . The following hold for a Legendre kernelh : (cid:146) n → (cid:146) and a proper, lsc, lower bounded function ψ : (cid:146) n → (cid:146) :(i) prox h ψ is locally bounded, compact-valued, and outer semicontinuous on int dom h.(ii) ψ h is real-valued and continuous on int dom h; in fact, it is locally Lipschitz if so is ∇ h. Fact 2.7 (relation between ψ and ψ h ) . Let h be a Legendre kernel and ψ : (cid:146) n → (cid:146) be proper, lsc,and lower bounded on dom h. Then, for every x ∈ int dom h, y ∈ dom h, and ¯ x ∈ prox h ψ ( x ) (i) ψ h ( x ) (def) = ψ ( ¯ x ) + D h ( ¯ x , x ) ≤ ψ ( y ) + D h ( y , x ) , and in particular ψ h ( x ) ≤ ψ ( x ) ;(ii) if ψ is convex, then ψ h ( x ) ≤ ψ ( y ) + D h ( y , x ) − D h ( y , ¯ x ) [59, Lem. 3.1] .Moreover, if range prox h ψ ⊆ int dom h, then the following also hold [1, Prop.3.3] :(iii) inf dom h ψ ≤ inf int dom h ψ = inf ψ h and arg min ψ h = arg min int dom h ψ .(iv) ψ + δ dom h is level bounded i ﬀ so is ψ h .

3. A block - coordinate interpretation By introducing N copies of x , problem (P) can equivalently be written asminimize x = ( x ,..., x N ) ∈ (cid:146) nN Φ ( x ) = N (cid:80) Ni = f i ( x i ) F ( x ) + N (cid:80) Ni = g ( x i ) + δ ∆ ( x ) G ( x ) subject to x ∈ C × · · · × C , (3.1)where ∆ (cid:66) (cid:110) x = ( x , . . . , x N ) ∈ (cid:146) nN | x = x = · · · = x N (cid:111) is the consensus set. The equivalence be-tween (3.1) and the original problem (P) is formally established in Lemma A.1. Note that AssumptionI. a implies that F as in (3.1) is smooth with respect to the Legendre kernel H : (cid:146) nN → (cid:146) deﬁned as H ( x ) = (cid:80) Ni = h i ( x i ) , (3.2)making Bregman forward-backward iterations x + ∈ arg min {(cid:104)∇ F ( x ) , · (cid:105) + G ( · ) + γ D H ( · , x ) } forsome stepsize γ > L F = N max i = ... N L f i is a smoothness modulus of F relative to H , indicating that ﬁxed point iterations x ← x + under Assumption I converge (in some sense to be made precise) to a stationary point ofthe problem whenever γ ∈ (0 , / L F ). Notice that a higher degree of ﬂexibility can be granted byconsidering an N -uple of individual stepsizes Γ = ( γ , . . . , γ N ), giving rise to the forward-backwardoperator T F , G Γ : (cid:146) nN ⇒ (cid:146) nN in the Bregman metric ( z , x ) (cid:55)→ (cid:80) Ni = γ i D h i ( z i , x i ), namelyT F , G Γ ( x ) (cid:66) arg min z ∈ (cid:146) nN (cid:110) F ( x ) + (cid:104)∇ F ( x ) , z − x (cid:105) + G ( z ) + (cid:80) Ni = γ i D h i ( z i , x i ) (cid:111) . (3.3)This intuition is validated in the next result, which asserts that whenever the stepsizes γ i are selectedas in Algorithm 1 the operator T F , G Γ coincides with a proximal mapping on a suitable Legendre kernelfunction ˆ H . This observation leads to a much simpler analysis of Algorithm 1, which will be shownto be a block-coordinate variant of a Bregman proximal point method. regman Finito / MISO for nonconvex ﬁnite sum minimization 7

Lemma 3.1.

Suppose that Assumption I. a holds and let γ i ∈ (0 , N / L fi ) be selected as in Algorithm 1.Then, ˆ h i (cid:66) γ i h i − N f i (with the convention ∞ − ∞ = ∞ ) is a Legendre kernel with dom ˆ h i = dom h i ,i ∈ [ N ] , and thus so is the function ˆ H : (cid:146) nN → (cid:146) deﬁned as ˆ H ( x ) = (cid:80) Ni = ˆ h i ( x i ) . (3.4) Moreover, for any ( z , x ) ∈ (cid:146) nN × (cid:146) nN it holds that Φ ( z ) + D ˆ H ( z , x ) = F ( x ) + (cid:104)∇ F ( x ) , z − x (cid:105) + G ( z ) + (cid:80) Ni = γ i D h i ( z i , x i ) , (3.5) and in particular the forward-backward operator (3.3) satisﬁes T F , G Γ ( x ) = prox ˆ H Φ ( x ) . (3.6) When Assumption I is satisﬁed, then the following also hold:(i) D ˆ H ( z , x ) ≥ (cid:80) Ni = ( γ i − L fi N ) D h i ( z i , x i ) .(ii) prox ˆ H Φ ( x ) = (cid:110) ( z , · · · , z ) | z ∈ T ( (cid:80) Ni = ∇ ˆ h i ( x i )) (cid:111) , with T as in (2.1) , is a nonempty and compactsubset of C × · · · × C for any x ∈ int dom h × · · · × int dom h N .(iii) If z ∈ prox ˆ H Φ ( x ) , then ∇ ˆ H ( x ) − ∇ ˆ H ( z ) ∈ ˆ ∂ Φ ( z ) ; the converse also holds when Φ is convex.(iv) If ∇ h i is (cid:96) h i , (cid:85) i -Lipschitz on a convex set (cid:85) i ⊆ int dom h i , then ∇ ˆ h i is (cid:96) ˆ h i , (cid:85) i -Lipschitz on (cid:85) i with (cid:96) ˆ h i , (cid:85) i ≤ (cid:0) γ i + L fi N (cid:1) (cid:96) h i , (cid:85) i . If, in addition, f i − µ fi , (cid:85) i (cid:107) · (cid:107) is convex on (cid:85) i for some µ f i , (cid:85) i ∈ (cid:146) , then (cid:96) ˆ h i ≤ (cid:96) hi , (cid:85) i γ i − µ fi , (cid:85) i N .(v) If h i is µ h i , (cid:85) i -strongly convex on a convex set (cid:85) i ⊆ dom h i , then ˆ h i is µ ˆ h i , (cid:85) i -strongly convex on (cid:85) i with µ ˆ h i , (cid:85) i ≥ (cid:0) γ i − L fi N (cid:1) µ h i , (cid:85) i .Proof. The claims on ˆ h i are shown in [1, Thm. 4.1], and (3.5) and (3.6) then easily follow. ♠ (i) This is an immediate consenquence of Fact 2.5 (i) . ♠ (ii) Let x be as in the statement, and observe that x ∈ int dom ˆ H ; nonemptyness and com-pactness of prox ˆ H Φ then follows from Fact 2.6 (i) . Let now u ∈ prox ˆ H Φ ( x ) be ﬁxed, and note that theconsensus constraint encoded in Φ ensures that u i = u j for all i , j ∈ [ N ]. Thus, u i = arg min w ∈ (cid:146) n (cid:110) Φ ( w , . . . , w ) + ˆ H ( w , . . . , w ) − (cid:104)∇ ˆ H ( x ) , ( w , . . . , w ) (cid:105) (cid:111) = arg min w ∈ (cid:146) n (cid:110) N (cid:80) Ni = f i ( w ) + g ( w ) + (cid:80) Ni = ˆ h i ( w ) − (cid:104)∇ ˆ h i ( x i ) , w (cid:105) (cid:111) = arg min w ∈ (cid:146) n (cid:110) g ( w ) + (cid:80) Ni = γ i h i ( w ) − (cid:104) (cid:80) Ni = ∇ ˆ h i ( x i ) , w (cid:105) (cid:111) (2.1) = T (cid:0)(cid:80) Ni = ∇ ˆ h i ( x i ) (cid:1) ⊆ C , where the inclusion follows from Assumption I. a . ♠ (iii) Observe ﬁrst that necessarily x ∈ int dom h i × · · · × int dom h N , for otherwise no such z exists. Moreover, from assertion 3.1 (ii) it follows that also z belongs to such open set, onto which ˆ H iscontinuously di ﬀ erentiable. The claim then follows from the necessary condition for optimality of z in the minimization problem (2.4) — which is also su ﬃ cient when Φ is convex, for so is Φ+ D ˆ H ( · , x )in this case — having0 ∈ ˆ ∂ [ Φ + D ˆ H ( · , x )]( z ) = ˆ ∂ [ Φ + ˆ H − (cid:104)∇ ˆ H ( x ) , · (cid:105) ]( z ) = ˆ ∂ Φ ( z ) + ∇ ˆ H ( z ) − ∇ ˆ H ( x ) . The last equality follows from [53, Ex. 8.8(c)], owing to smoothness of ˆ H at z . ♠ (iv) and 3.1 (v) Observe that N − γ i L fi N γ i h i (cid:22) N − γ i L fi N γ i h i + convex1 N ( L f i h i − f i ) = ˆ h i = N + γ i L fi N γ i h i − convex1 N ( L f i h i + f i ) (cid:22) N + γ i L fi N γ i h i , (3.7)where for notational convenience we used the partial ordering “ (cid:22) ”, deﬁned as α (cid:22) β i ﬀ β − α isconvex. The claimed moduli (cid:96) ˆ h i , (cid:85) i ≤ (cid:0) γ i + L fi N (cid:1) (cid:96) h i , (cid:85) i and µ ˆ h i , (cid:85) i ≥ (cid:0) γ i − L fi N (cid:1) µ h i , (cid:85) i are thus readilyinferred. In case f i is µ f i , (cid:85) i -strongly convex on (cid:85) i , we may writeˆ h i = h i γ i − µ fi , (cid:85) i N (cid:107) · (cid:107) − N ( f i − µ fi , (cid:85) i (cid:107) · (cid:107) ) (cid:22) h i γ i − µ fi , (cid:85) i N (cid:107) · (cid:107) to obtain the tighter bound (cid:96) ˆ h i , (cid:85) i ≤ (cid:96) hi , (cid:85) i γ i − µ fi , (cid:85) i N . (cid:3) Algorithm 3

Block-coordinate proximal point formulation of Algorithm 1R equire

Legendre kernels h i such that f i is L f i -smooth relative to h i stepsizes γ i ∈ (0 , N / L fi )initial point x init ∈ ∩ Ni = int dom h i = C D enote x = ( x init , . . . , x init ), ˆ h j (cid:66) γ i h j − N f j , ˆ H ( x ) (cid:66) (cid:80) Ni = ˆ h i ( x i )R epeat for k = , , . . . until convergence . : u k ∈ arg min w ∈ (cid:146) nN (cid:110) Φ ( w ) + ˆ H ( w ) − (cid:104)∇ ˆ H ( x k ) , w (cid:105) (cid:111) = arg min w ∈ (cid:146) nN (cid:110) Φ ( w ) + D ˆ H ( w , x k ) (cid:111) . : Select a subset of indices (cid:74) k + ⊆ [ N ] . : x k + (cid:74) k + = u k (cid:74) k + and x k + N ] \ (cid:74) k + = x k [ N ] \ (cid:74) k + R eturn ˜ z k Block-coordinate proximal point reformulation of Algorithm 1.

Algorithm 3 presents ablock coordinate (BC) proximal point algorithm with the distance generating function ˆ H . Note thatin a departure from most of the existing literature on BC proximal methods that consider separable nonsmooth terms (see e.g., [63, 47, 12, 19, 31]), here the nonsmooth function G in (3.1) is nonsep-arable. It is shown in the next lemma that this conceptual algorithm is equivalent to the BregmanFinito / MISO Algorithm 1.

Lemma 3.2 (equivalence of Algorithms 1 and 3) . As long as the same initialization parameters arechosen in the two algorithms, to any sequence ( s k , ˜ s k , z k , (cid:73) k + ) k ∈ (cid:142) generated by Algorithm 1 therecorresponds a sequence ( x k , u k , (cid:74) k + ) k ∈ (cid:142) generated by Algorithm 3 (and viceversa) satisfying thefollowing identities for all k ∈ (cid:142) and i ∈ [ N ] :(i) (cid:73) k + = (cid:74) k + (ii) ( z k , . . . , z k ) = u k (iii) s ki = γ i ∇ h i ( x ki ) − N ∇ f i ( x ki ) (or, equivalently, x ki = ∇ ˆ h ∗ i ( s ki ) )(iv) ˜ s k = (cid:80) Ni = ∇ ˆ h i ( x ki ) (v) ϕ ( z k ) = Φ ( u k ) = Φ ˆ H ( x k ) − D ˆ H ( u k , x k ) (vi) Φ ˆ H ( x k ) = (cid:76) ( z k , s k ) where (cid:76) is as in (1.1) .Proof. Let the index sets (cid:73) k + and (cid:74) k + be chosen identically, k ∈ (cid:142) . It follows from Lemma 3.1 (ii) that u ki = u kj for all k ∈ (cid:142) and i , j ∈ [ N ], with u ki = arg min w ∈ (cid:146) n (cid:110) g ( w ) + (cid:80) Ni = γ i h i ( w ) − (cid:104) (cid:66) v k (cid:80) Ni = ∇ ˆ h i ( x ki ) , w (cid:105) (cid:111) . (3.8)We now proceed by induction to show assertions 3.2 (ii) , 3.2 (iii) , and 3.2 (iv) . Note that the latteramounts to showing that v k as deﬁned in (3.8) coincides with ˜ s k ; by comparing (3.8) and the expres-sion of z k in step 1. , the claimed correspondence of u k and z k as in assertion 3.2 (ii) is then alsoobtained and, in turn, so is the identity in 3.2 (v) .For k = (iii) and 3.2 (iv) hold because of the initialization of ˜ s in Algorithm 1and of x in Algorithm 3; in turn, as motivated above, the base case for assertion 3.2 (ii) also holds.Suppose now that the three assertions hold for some k ≥

0; then, v k + = (cid:80) Ni = ∇ ˆ h i ( x k + i ) = (cid:88) i ∈ (cid:73) k + ∇ ˆ h i ( u ki ) + v k − (cid:88) i ∈ (cid:73) k + ∇ ˆ h i ( x ki ) (induction) = (cid:88) i ∈ (cid:73) k + ∇ ˆ h i ( z k ) + ˜ s k − (cid:88) i ∈ (cid:73) k + s ki = (cid:88) i ∈ (cid:73) k + s k + i + ˜ s k − (cid:88) i ∈ (cid:73) k + s ki = ˜ s k + , where the last two equalities are due to steps 1. and 1. . Therefore, v k + = ˜ s k + and thus u k + = ( z k + , . . . , z k + ). It remains to show that s k + i = γ i ∇ h i ( x k + i ) − N ∇ f i ( x k + i ). For i ∈ (cid:73) k + this holds regman Finito / MISO for nonconvex ﬁnite sum minimization 9 because of the update rule at step 1. and the fact that x k + i = u ki = z k owing to step 3. . For i (cid:60) (cid:73) k + this holds because ( x k + i , s k + i ) = ( x ki , s ki ). Finally, Φ ˆ H ( x k ) (def) = Φ ( u k ) + D ˆ H ( u k , x k ) (v) = ϕ ( z k ) + N (cid:88) i = D ˆ h i ( z k , x ki ) (iii) = ϕ ( z k ) + N (cid:88) i = D ˆ h i ( z k , ∇ ˆ h ∗ i ( s ki )) , and the last term is (cid:76) ( z k , s k ) (cf. Facts 2.3 (i) and 2.3 (iii) ), yielding assertion 3.2 (vi) . (cid:3)

4. C onvergence analysis

The block coordinate interpretation of Algorithm 1 presented in Section 3 plays a crucial role inthe proposed methodology, and leads to a remarkably simple convergence analysis. In fact, manykey facts can be established without conﬁning the discussion to a particular sampling strategy. Thesepreliminary results are presented in the next subsection and will be extensively referred to in thesubsequent subsections that are instead devoted to a speciﬁc sampling strategy.4.1.

General sampling results.

Unlike classical analyses of BC proximal methods that employthe cost as a Lyapunov function (see e.g., [10, §11]), here, the nonseparability of G precludes thispossibility. To address this challenge, we instead employ the Bregman Moreau envelope equippedwith the distance generating function ˆ H (see (3.4)). Before showing its Lyapunov-type behavior forAlgorithm 3, we list some of its properties and its relation with the original problem. The proof is asimple consequence of Facts 2.6 (ii) and 2.7 and the fact that ˆ H is a Legendre kernel with dom ˆ H = dom h × · · · × dom h N (cf. Lemma 3.1). Lemma 4.1 (connections between ϕ + δ C and Φ ˆ H ) . Suppose that Assumption I holds. Then,(i) Φ ˆ H is continuous on dom Φ ˆ H = int dom h × · · · × int dom h N , in fact, locally Lipschitz if so is ∇ h i on int dom h i , i ∈ [ N ] ;(ii) min C ϕ ≤ inf C ϕ = inf Φ ˆ H and arg min Φ ˆ H = (cid:8) ( x (cid:63) , . . . , x (cid:63) ) | x (cid:63) ∈ arg min C ϕ (cid:9) ;(iii) Φ ˆ H is level bounded i ﬀ so is ϕ + δ C . Lemma 4.2 (sure descent) . Suppose that Assumption I holds, and consider the iterates generated byAlgorithm 3. Then, u k = ( u k , . . . , u k ) for some u k ∈ C and x k ∈ C × · · · × C ⊆ int dom ˆ H for everyk ∈ (cid:142) , and the algorithm is thus well deﬁned. Moreover, the following hold:(i) Φ ˆ H ( x k + ) ≤ Φ ˆ H ( x k ) − D ˆ H ( x k + , x k ) = Φ ˆ H ( x k ) − (cid:80) i ∈ (cid:74) k + D ˆ h i ( u k , x ki ) for every k ∈ (cid:142) ; when Φ is convex (i.e., when so is ϕ ), then the inequality can be strengthened to Φ ˆ H ( x k + ) ≤ Φ ˆ H ( x k ) − D ˆ H ( x k + , x k ) − D ˆ H ( u k , u k + ) .(ii) ( Φ ˆ H ( x k )) k ∈ (cid:142) monotonically decreases to a ﬁnite value ϕ (cid:63) ≥ inf C ϕ ≥ min C ϕ .(iii) The sequence (D ˆ H ( x k + , x k )) k ∈ (cid:142) has ﬁnite sum (and in particular vanishes); the same holds alsofor (D ˆ H ( u k , u k + )) k ∈ (cid:142) when Φ is convex (i.e., when so is ϕ ).(iv) If ϕ + δ C is level bounded, then ( x k ) k ∈ (cid:142) and ( u k ) k ∈ (cid:142) are bounded.(v) If dom h i is closed, a subsequence ( x ki ) k ∈ K converges to a point x (cid:63) i ﬀ so does ( x k + i ) k ∈ K .(vi) If C = (cid:146) n , then Φ ˆ H is constant (and equals ϕ (cid:63) as above) on the set of accumulation points of ( x k ) k ∈ (cid:142) .Proof. It follows from Lemma 3.1 (ii) that u k ∈ C holds for every k ∈ (cid:142) . Notice that for every i ∈ [ N ]and k ∈ (cid:142) , either x ki = x init ∈ C (by initialization), or there exists k i ≤ k such that x ki = z k i ∈ C . Itreadily follows that x k ∈ C × · · · × C ⊆ int dom H = int dom ˆ H , hence that prox ˆ H Φ ( x k ) (cid:44) ∅ for all k ∈ (cid:142) by Lemma 3.1 (ii) , whence the well deﬁnedness of the algorithm. We now show the numberedclaims. ♠ (i) It follows from Facts 2.7 (i) and 2.7 (ii) that Φ ˆ H ( x k + ) ≤ Φ ( u k ) + D ˆ H ( u k , x k + ) − c k , where c k ≥ c k = D ˆ H ( u k , u k + ) when Φ is convex. Therefore, Φ ˆ H ( x k + ) ≤ Φ ( u k ) + D ˆ H ( u k , x k + ) − c k = Φ ˆ H ( x k ) − D ˆ H ( u k , x k ) + D ˆ H ( u k , x k + ) − c k (ii) = Φ ˆ H ( x k ) − D ˆ H ( x k + , x k ) − (cid:104) u k − x k + , ∇ ˆ H ( x k + ) − ∇ ˆ H ( x k ) (cid:105) − c k . The claim follows by noting that the inner product is zero: (cid:104) u k − x k + , ∇ ˆ H ( x k + ) − ∇ ˆ H ( x k ) (cid:105) = (cid:88) j ∈ [ N ] (cid:104) u k − x k + j = u k for j ∈ (cid:74) k + , ∇ ˆ h j ( = x kj for j (cid:60) (cid:74) k + x k + j ) − ∇ ˆ h j ( x kj ) (cid:105) = . ♠ (ii) Monotonic decrease of ( Φ ˆ H ( x k )) k ∈ (cid:142) follows from assertion 4.2 (i) . This ensures that thesequence converges to some value ϕ (cid:63) , bounded below by min C ϕ in light of Lemma 4.1 (ii) . ♠ (iii) It follows from assertion 4.2 (i) that (cid:80) k ∈ (cid:142) D ˆ H ( x k + , x k ) ≤ Φ ˆ H ( x ) − inf Φ ˆ H ≤ Φ ˆ H ( x ) − inf C ϕ < ∞ owing to Lemma 4.1 (ii) and Assumption I. a . When ϕ is convex, the tighter bound in assertion 4.2 (i) yields the similar claim for (D ˆ H ( u k , u k + )) k ∈ (cid:142) . ♠ (iv) It follows from assertion 4.2 (ii) that the entire sequence ( x k ) k ∈ (cid:142) is contained in the sublevelset { w | Φ ˆ H ( w ) ≤ Φ ˆ H ( x ) } , which is bounded provided that ϕ + δ C is level bounded as shown in Lem-ma 4.1 (iii) . In turn, boundedness of ( u k ) k ∈ (cid:142) then follows from local boundedness of T F , G Γ = prox ˆ H Φ ,cf. (3.6) and Fact 2.6 (i) . ♠ (v) Follows from Fact 2.3 (v) , since x ki ∈ int dom h i = int dom ˆ h i for every k (with equality owingto Lemma 3.1), and D ˆ h i ( x k + i , x ki ) → (iii) . ♠ (vi) Follows from assertion 4.2 (ii) and the continuity of Φ ˆ H , see Fact 2.6 (ii) . (cid:3) In conclusion of this subsection we provide an overview of the ingredients that are needed to showthat the limit points of the sequence ( z k ) k ∈ (cid:142) generated by Algorithm 1 are stationary for problem (P).As will be shown in Lemma 4.4, these amount to the vanishing of the residual D ˆ H ( u k , x k ) togetherwith some assumptions on the distance-generating functions h i . For the iterates of Algorithm 1, thistranslates to D ˆ h ∗ i ( s ki , ∇ ˆ h i ( z k )) → i ∈ [ N ], indicating that all vectors s k + i in the tableshould be good estimates of ∇ ˆ h i ( z k + ) = γ i ∇ h i ( z k + ) − N ∇ f i ( z k + ), as opposed to γ i ∇ h i ( z k ) − N ∇ f i ( z k )and for the indices in (cid:73) k + only (cf. step 1. ). As a result, we may view this property as jointly having z k − z k + vanish, desirable if any convergence of ( z k ) k ∈ (cid:142) is expected, and the fact that a consensus iseventually reached among the sampled blocks.In line with any result in the literature we are aware of, a complete convergence analysis fornonconvex problems will ultimately require C = (cid:146) n . For convex problems, that is, when the costfunction ϕ is convex without any among f i and g being necessarily so, the following requirementwill instead su ﬃ ce to our purposes in the randomized sampling setting of ( (cid:83) ). Assumption II (requirements on the distance-generating functions) . For i ∈ [ N ] , dom h i is closed,and whenever int dom h i (cid:51) z k → z ∈ bdry dom h i it holds that D h i ( z , z k ) → . Remark 4.3. (i)

Assumption II is vacuously satisﬁed when dom h i = (cid:146) n , having bdry (cid:146) n = ∅ . (ii) For any i ∈ [ N ], function h i complies with Assumption II i ﬀ so does ˆ h i , owing to the inequalities N − γ i L fi N γ i D h i ≤ D ˆ h i ≤ N + γ i L fi N γ i D h i (cf. (3.7)). (cid:3) Lemma 4.4 (subsequential convergence recipe) . Suppose that Assumption I holds, and considerthe iterates generated by Algorithm 1. Let x ki = ∇ ˆ h ∗ i ( s ki ) and z k = u k be the corresponding iteratesgenerated by Algorithm 3 as in Lemma 3.2, and suppose that a D ˆ H ( u k , x k ) → (or equivalently, D ˆ h ∗ i ( s ki , ∇ ˆ h i ( z k )) → , i ∈ [ N ] ).Then, letting ϕ (cid:63) be as in Lemma 4.2(ii), the following hold:(i) ϕ ( z k ) = Φ ( u k ) → ϕ (cid:63) as k → ∞ .(ii) If dom h i is closed, i ∈ [ N ] , then having (a) ( z k ) k ∈ K → z, (b) ( x ki ) k ∈ K → z ∃ i ∈ [ N ] , and(c) ( z k + , x k + i ) k ∈ K → ( z , z ) ∀ i ∈ [ N ] , are all equivalent conditions. In particular, if ( z k ) k ∈ (cid:142) isbounded (e.g., when ϕ + δ C is level bounded), then (cid:107) z k + − z k (cid:107) → holds, and the set of its limitpoints, be it ω , is thus nonempty, compact, and connected.(iii) Under Assumption II, ϕ ≡ ϕ (cid:63) on ω (the set of limit points of ( z k ) k ∈ (cid:142) ). regman Finito / MISO for nonconvex ﬁnite sum minimization 11 (iv) If C = (cid:146) n , then every z (cid:63) ∈ ω is stationary for (P) .Proof. Assumption 4.4. a can be written as D ˆ h i ( z k , ∇ ˆ h ∗ i ( s ki )) → i ∈ [ N ]. In turn, by using theconjugate identity Fact 2.3 (iii) , the equivalent expression in terms of s ki and z k is obtained. ♠ (i) As shown in Lemma 4.2 (ii) , ( Φ ˆ H ( x k )) k ∈ (cid:142) monotonically decreases to ϕ (cid:63) . In turn, Lemma3.2 (v) and Assumption 4.4. a then imply that ϕ ( z k ) = Φ ( u k ) converges to ϕ (cid:63) . ♠ (ii) The equivalences owe to Fact 2.3 (v) and Lemma 4.2 (v) (as dom ˆ h i = dom h i ), and imply (cid:107) z k + − z k (cid:107) → z k ) k ∈ (cid:142) is bounded. The claim on ω then follows from [19, Rem. 5]. ♠ (iii) Let z (cid:63) ∈ ω be ﬁxed, and let ( z k ) k ∈ K be a subsequence converging to z (cid:63) . Assertion 4.4 (ii) ensures that ( x k ) k ∈ K → z (cid:63) (cid:66) ( z (cid:63) , . . . , z (cid:63) ), hence ϕ (cid:63) (ii) ←−−−− k ∈ K Φ ˆ H ( x k ) − D ˆ H ( z (cid:63) , x k ) (i) ≤ Φ ( z (cid:63) ) lsc ≤ lim inf k ∈ K Φ ( u k ) (i) = ϕ (cid:63) , where Assumption II is used in the ﬁrst limit. ♠ (iv) Suppose that C = (cid:146) n and ( z k ) k ∈ K → z (cid:63) for some inﬁnite K ⊆ (cid:142) and z (cid:63) ∈ (cid:146) n , so that,by virtue of assertion 4.4 (ii) , ( x k , x k + ) k ∈ K → ( z (cid:63) , z (cid:63) ). Since ( z k , . . . , z k ) = u k ∈ prox ˆ H Φ ( x k ), the oscof prox ˆ H Φ (Fact 2.6 (i) ) ensures that z (cid:63) ∈ prox ˆ H Φ ( z (cid:63) ), hence 0 ∈ ˆ ∂ Φ ( z (cid:63) ) owing to Lemma 3.1 (iii) . Byinvoking Lemma A.1 (iv) we conclude that z (cid:63) is stationary for (P). (cid:3) Randomized sampling rule ( (cid:83) ) . The analysis for the randomized case dealt in this sectionwill make use of the following result, known as the Robbins-Siegmund supermartingale theorem,and stated here in simpliﬁed form following [13, Prop. 2].

Fact 4.5 (supermartingale convergence theorem [51]) . For k ∈ (cid:142) , let ξ k and η k be random variables,and (cid:70) k be a set of random variables such that a (cid:70) k ⊆ (cid:70) k + ; a ≤ ξ k , η k are functions of the random variables in (cid:70) k ; a E (cid:2) ξ k + | (cid:70) k ] ≤ ξ k − η k .Then, almost surely, (cid:80) k ∈ (cid:142) η k < ∞ and ξ k converges to a (nonnegative) random variable. The sets (cid:70) k in the above formulation will represent the information available at iteration k , andthe notation E k [ · ] will be used as a shorthand for E [ · | (cid:70) k ]. Theorem 4.6 (subsequential convergence of Algorithm 1 with randomized sampling ( (cid:83) )) . Supposethat Assumption I holds. Then, denoting p min = min i p i , the iterates generated by Algorithm 1 withindices selected according to the randomized rule ( (cid:83) ) satisfy E k (cid:104) (cid:76) ( z k + , s k + ) (cid:105) ≤ (cid:76) ( z k , s k ) − p min N (cid:88) i = D ˆ h ∗ i ( s ki , ∇ ˆ h i ( z k )) ∀ k ∈ (cid:142) , (4.1) where (cid:76) is as in (1.1) (and satisﬁes (cid:76) ( z k , s k ) = Φ ˆ H ( x k ) , cf. Lemma 3.2(vi)). Moreover, letting ω denote the set of limit points of ( z k ) k ∈ (cid:142) the following assertions hold almost surely:(i) The sequence (D ˆ h ∗ i ( s ki , ∇ ˆ h i ( z k ))) k ∈ (cid:142) has ﬁnite sum (and in particular vanishes), i ∈ [ N ] .(ii) The sequence ( ϕ ( z k )) k ∈ (cid:142) converges to the ﬁnite value ϕ (cid:63) ≤ ϕ ( x init ) of Lemma 4.2(ii).(iii) If Assumption II is satisﬁed, then ϕ ≡ ϕ (cid:63) on ω .(iv) If C = (cid:146) n , then ∈ ˆ ∂ϕ ( z (cid:63) ) for every z (cid:63) ∈ ω .When ϕ is convex (without g or any f i necessarily being so) and dom h i is closed, i ∈ [ N ] ,(v) ( ϕ ( z k )) k ∈ (cid:142) converges to min C ϕ with ϕ ( z k ) − min C ϕ ≤ o ( / k ) ;(vi) the limit points of ( z k ) k ∈ (cid:142) all belong to arg min C ϕ ;(vii) if either Assumption II holds and ϕ + δ C is level bounded, or C = (cid:146) n , then ( z k ) k ∈ (cid:142) and ( ∇ ˆ h ∗ i ( s ki )) k ∈ (cid:142) , i ∈ [ N ] , converge to the same point in arg min C ϕ . Proof.

By exploiting Lemma 3.2, we consider the simpler setting of Algorithm 3. We have E k (cid:104) (cid:76) ( z k + , s k + ) (cid:105) (vi) = E k (cid:104) Φ ˆ H ( x k + ) (cid:105) (i) = E k (cid:104) Φ ˆ H ( x k ) − (cid:88) i ∈ (cid:73) k + D ˆ h i ( u k , x ki ) (cid:105) = Φ ˆ H ( x k ) − N (cid:88) i = p i D ˆ h i ( u k , x ki ) ≤ Φ ˆ H ( x k ) − p min D ˆ H ( u k , x k ) (vi) = (cid:76) ( z k , s k ) − p min N (cid:88) i = D ˆ h i ( s ki , ∇ ˆ h i ( z k ))which is (4.1). We thus infer from Fact 4.5 that (D ˆ H ( u k , x k )) k ∈ (cid:142) has almost surely ﬁnite sum, and theproof of assertions 4.6 (i) –4.6 (iv) then follows from Lemma 4.4. ♠ (v) Suppose that ϕ is convex and dom h i is closed, i ∈ [ N ], so that C = (cid:84) Ni = int dom h i = (cid:84) Ni = dom h i [14, Prop. 1.3.8], and in particular D ˆ h i ( y , x ) < ∞ holds for any ( y , x ) ∈ dom h i × int dom h i ⊇ C × C . For any x (cid:63) ∈ arg min C ϕ , so that x (cid:63) (cid:66) ( x (cid:63) , . . . , x (cid:63) ) ∈ arg min C ×···× C Φ (cf.Lemma A.1 (v) ), the three-point identity (Fact 2.3 (ii) ), convexity of Φ (Lemma A.1 (vii) ) and the in-clusion ∇ ˆ H ( x k ) − ∇ ˆ H ( u k ) ∈ ˆ ∂ Φ ( u k ) (Lemma 3.1 (iii) ) give the boundD ˆ H ( x (cid:63) , u k ) = D ˆ H ( x (cid:63) , x k ) − D ˆ H ( u k , x k ) + (cid:104)∇ ˆ H ( x k ) − ∇ ˆ H ( u k ) , x (cid:63) − u k (cid:105)≤ D ˆ H ( x (cid:63) , x k ) − D ˆ H ( u k , x k ) + Φ ( x (cid:63) ) − Φ ( u k ) . (4.2)Next, E k  N (cid:88) i = p − i D ˆ h i ( x (cid:63) , x k + i )  = N (cid:88) i = p − i (cid:16) p i (cid:80) k (cid:104) (cid:73) k + (cid:51) i (cid:105) D ˆ h i ( x (cid:63) , u k ) + − p i (cid:80) k (cid:104) (cid:73) k + (cid:61) i (cid:105) D ˆ h i ( x (cid:63) , x k ) (cid:17) = D ˆ H ( x (cid:63) , u k ) + N (cid:88) i = (1 − p i ) p − i D ˆ h i ( x (cid:63) , x ki ) (4.3) (4.2) ≤ D ˆ H ( x (cid:63) , x k ) − D ˆ H ( u k , x k ) + Φ ( x (cid:63) ) − Φ ( u k ) + N (cid:88) i = ( p − i −

1) D ˆ h i ( x (cid:63) , x ki ) Lemmas 3.2 (v) and A.1 = N (cid:88) i = p − i D ˆ h i ( x (cid:63) , x ki ) − D ˆ H ( u k , x k ) − (cid:0) ϕ ( z k ) − min C ϕ (cid:1) , where u k = ( u k , . . . , u k ). From Fact 4.5 we conclude that (cid:88) k ∈ (cid:142) D ˆ H ( u k , x k ) < ∞ and (cid:88) k ∈ (cid:142) (cid:0) ϕ ( z k ) − min C ϕ (cid:1) < ∞ (4.4)almost surely, and N (cid:88) i = p − i D ˆ h i ( x (cid:63) , x ki ) converges a.s. for any x (cid:63) ∈ arg min C ϕ . (4.5)It now follows from (4.4) that ϕ ( z k ) converges a.s. to min C ϕ with the claimed o ( / k ) rate. ♠ (vi) Suppose that ( z k ) k ∈ K → z (cid:63) . Then, ( u k ) k ∈ K → u (cid:63) for u k = ( z k , . . . , z k ) and u (cid:63) = ( z (cid:63) , . . . , z (cid:63) ).Notice that z (cid:63) ∈ C , since z k ∈ C for all k (cf. Lemma 4.2). We havemin C ϕ A.1 (v) = min C ×···× C Φ ≤ Φ ( u (cid:63) ) lsc ≤ lim inf k ∈ K Φ ( u k ) (v) = lim inf k ∈ K ϕ ( z k ) (v) = min C ϕ. (4.6)Therefore, u (cid:63) is a minimizer of Φ on C × · · · × C , and thus z (cid:63) is a minimizer of ϕ on C by virtue ofLemma A.1 (v) . ♠ (vii) If ϕ + δ C is level bounded, then by Lemma 4.2 (iv) ( x k ) k ∈ (cid:142) and ( u k ) k ∈ (cid:142) are bounded. Alter-natively, if C = (cid:146) n , then boundedness of the former sequence follows from Fact 2.3 (iv) and (4.5), andin turn that of the latter from (4.4). In either cases Assumption II holds, as discussed in Remark 4.3 (i) .Boundedness of the sequences ensures the existence of K ⊆ (cid:142) , z (cid:63) and u (cid:63) as in the proof of assertion4.6 (v) . The vanishing of D ˆ H ( u k , x k ) shown in (4.4) implies through Lemma 4.4 (ii) that ( x k ) k ∈ (cid:142) and( u k ) k ∈ (cid:142) have the same limit points, and that ( x k ) k ∈ K → u (cid:63) . In turn, ( (cid:80) Ni = p − i D ˆ h i ( u (cid:63) , x ki )) k ∈ K → (cid:80) Ni = p − i D ˆ h i ( u (cid:63) , x ki )) k ∈ (cid:142) →

0, which by Fact 2.3 (v) implies ( x ki ) k ∈ (cid:142) → u (cid:63) , i ∈ [ N ]. As discussedabove, this implies that ( u k ) k ∈ (cid:142) → u (cid:63) , and the identity u k = ( z k , . . . , z k ) of Lemma 3.2 (ii) yields theclaimed convergence. (cid:3) regman Finito / MISO for nonconvex ﬁnite sum minimization 13

In Theorem 4.6 (vii) the assumption that h i has full domain can be relaxed by instead requiringthat for every v ∈ dom h i and every α ∈ (cid:146) , the level set (cid:110) w ∈ int dom h i | D h i ( v , w ) ≤ α (cid:111) is bounded, asthis would su ﬃ ce to ensure boundedness of the generated sequence. In fact, together with the closed-domain requirement this is a standing assumption in many works dealing with Bregman distances,speciﬁcally those involving Bregman functions “with zone S ” ( S being the interior of the domain),see e.g., [56] and references therein.We conclude this subsection with an analysis of the strongly convex case, in which linear con-vergence (in expectation) will be shown. Remarkably, strong convexity of the cost function ϕ alonewill su ﬃ ce, without imposing any such requirement on the individual terms f i or g which, in fact, areeven allowed to be nonconvex. Theorem 4.7 (linear convergence with random sampling ( (cid:83) ) for strongly convex problems) . Con-sider the iterates of Algorithm 1. Additionally to Assumption I, suppose that a ϕ is µ ϕ -strongly convex; a h i has locally Lipschitz gradient on the whole space (cid:146) n , i ∈ [ N ] , (hence C = (cid:146) n ).Let (cid:85) be a convex compact set containing x init and the sequence ( z k ) k ∈ (cid:142) , and let (cid:96) h i , (cid:85) be a Lipschitzmodulus for ∇ h i on (cid:85) , i ∈ [ N ] . Let x (cid:63) = arg min ϕ , ϕ (cid:63) = min ϕ , andc (cid:85) = min i p i + µ ϕ (cid:80) i (cid:0) (cid:96) hi , (cid:85) γ i − σ fi , (cid:85) N (cid:1) , (4.7) where σ f i , (cid:85) ≥ − L f i (cid:96) h i , (cid:85) is a (weak) convexity modulus of f i on (cid:85) . Then, for all k ∈ (cid:142) E k (cid:104) (cid:76) ( z k + , s k + ) − ϕ (cid:63) (cid:105) ≤ (1 − c (cid:85) ) (cid:0) (cid:76) ( z k , s k ) − ϕ (cid:63) (cid:1) , and (4.8) E (cid:104) µ ϕ (cid:107) z k − x (cid:63) (cid:107) (cid:105) ≤ E (cid:104) ϕ ( z k ) − ϕ (cid:63) (cid:105) ≤ (1 − c (cid:85) ) k (cid:0) ϕ ( x init ) − ϕ (cid:63) (cid:1) . (4.9) Proof.

See Appendix B. (cid:3)

Moreover, in the Euclidean case h i ( · ) = (cid:107)·(cid:107) , i ∈ [ N ], ∇ h i is 1-Lipschitz continuous on (cid:146) n , and theresults of Theorem 4.7 hold with (cid:85) = (cid:146) n and (cid:96) h i =

1, and improve those of [37, Cor. 3.3] (limited tothe Euclidean case) both by providing tighter rates and by relaxing (strong) convexity assumptions onindividual f i . Notice that, for the uniform sampling strategy p i = / N , the rate 1 − O (1 / N ) is obtained.The same arguments still hold for the Bregman extension of Algorithm 1 dealt in this paper, as longas each h i is Lipschitz di ﬀ erentiable. This fact is stated in the following corollary, where, by using thefact that µ ϕ ≥ N (cid:80) Ni = σ f i under a convexity assumption on g , a simpliﬁed expression for the constant c in (4.7) is obtained. We remark that in the Euclidean case a variant of SVRG [34] has also beenstudied in [2] under similar assumptions. Corollary 4.8 (global linear convergence rate) . Additionally to Assumption I, suppose that a g is convex, and f (cid:66) N (cid:80) i f i is µ f -strongly convex (yet each f i can be nonconvex); a ∇ h i is Lipschitz on (cid:146) n , hence so is f i with modulus (cid:96) f i (cf. Fact 2.5(iii)), i ∈ [ N ] .Fix α ∈ (0 , and set γ i = α N / L fi . Then, the following bound holds for the constant in (4.7) c = c (cid:146) n ≥ α min i p i κ f , where κ f (cid:66) N (cid:80) i (cid:96) f i µ f ≤ N (cid:80) i (cid:96) f i N (cid:80) i σ f i is a problem-speciﬁc constant, with σ f i ≥ − (cid:96) f i being a (weak) convexity modulus of f i . (cid:85) exists by Lemma 4.2 (iv) , owing to strong convexity and consequent level boundedness of ϕ . For i ∈ [ N ], a ﬁnite (cid:96) hi , (cid:85) then exists because of Assumption 4.7. a and since (cid:85) ⊂ C (as opposed to (cid:85) ⊆ C ), having C = (cid:146) n . f i is σ fi , (cid:85) -weakly convex on (cid:85) if f i − σ fi , (cid:85) (cid:107) · (cid:107) is convex on (cid:85) , thus coinciding with convexity (resp. σ fi , (cid:85) -strongconvexity) on (cid:85) when σ fi , (cid:85) ≥ σ fi , (cid:85) > σ fi , (cid:85) owes to Fact 2.5 (iii) . Essentially cyclic sampling rule ( (cid:83) ) . The convergence results in this subsection require con-vexity of the nonsmooth term g and local strong convexity and smoothness of h i (as is the case when h i ∈ (cid:67) ( (cid:146) n ) with ∇ h i (cid:31) Theorem 4.9 (subsequential convergence with essentially cyclic sampling ( (cid:83) )) . Additionally to As-sumption I, assume that g is convex, C = (cid:146) n , and that a either each h i is strongly convex and Lipschitz di ﬀ erentiable, a or ϕ is level bounded and each h i is locally strongly convex and locally Lipschitz di ﬀ erentiable.Then, all the claims in Theorems 4.6(i) to 4.6(iv) hold surely.Proof. See Appendix B. (cid:3)

Our next goal is to establish global (and linear) convergence results without convexity assump-tions on f i or their sum. To this end, we leverage the Kurdyka-Łojasiewicz property [38, 36], whichhas become the standard tool in the analysis of nonconvex proximal methods, and most notably holdsfor the class of semialgebraic functions [18, 17, 3, 4, 5, 19]. Definition 4.10 (KL property with exponent θ ) . A proper lsc function q : (cid:146) n → (cid:146) has the Kurdyka-Łojasiewicz (KL) property with exponent θ ∈ (0 , if for every ¯ w ∈ dom ∂ q there exist ε, η, (cid:37) > such that ψ (cid:48) ( q ( w ) − q ( ¯ w )) dist(0 , ∂ q ( w )) ≥ holds for every w satisfying (cid:107) w − ¯ w (cid:107) < ε and q ( ¯ w ) < q ( w ) < q ( ¯ w ) + η , where ψ ( s ) (cid:66) (cid:37) s − θ . As will be detailed in Theorem 4.11, global convergence is established when the model (cid:77) : (cid:146) nN × (cid:146) nN → (cid:146) deﬁned as (cid:77) ( w , x ) (cid:66) Φ ( w ) + D ˆ H ( w , x ) has the KL property. The next assumptionprovides easily veriﬁable requirements in terms of f i , h i , and ϕ . Assumption III (global convergence requirements) . In problem (P) ,(i) for i ∈ [ N ] , f i , h i ∈ (cid:67) ( (cid:146) n ) (hence C = (cid:146) n ) with ∇ h i (cid:31) ;(ii) ϕ has the KL property with exponent θ ∈ (0 , (e.g., when f i and g are semialgebraic) and islevel bounded. Theorem 4.11 (Global and linear convergence with essentially cyclic sampling ( (cid:83) )) . Suppose thatAssumptions I and III are satisﬁed and that g is convex. Then, the following hold for the iteratesgenerated by Algorithm 1 with an essentially cyclic sampling rule ( (cid:83) ) :(i) ( z k ) k ∈ (cid:142) converges to a stationary point z (cid:63) for ϕ .(ii) If θ > / , then there exists c > such that ϕ ( z k ) − ϕ ( z (cid:63) ) ≤ ck − θ − hold for all k ∈ (cid:142) .(iii) If θ ∈ (0 , / ] , then ( z k ) k ∈ (cid:142) and ( ϕ ( z k )) k ∈ (cid:142) converge at R-linear rate.Proof. Notice that Assumption 4.9. a holds (by Assumption III (i) ), hence it follows from Lemma4.4 that the set ω of limit points of ( z k ) k ∈ (cid:142) is nonempty, compact, connected, and made of stationarypoints for ϕ , with ϕ ≡ ϕ (cid:63) (cid:66) lim k →∞ Φ ˆ H ( x k ) on ω . If Φ ˆ H ( x k ) = ϕ (cid:63) holds for some k ∈ (cid:142) , then itfollows from Lemma 4.2 (i) that ( x k ) k ∈ (cid:142) is asymptotically constant, and the assertion holds trivially.In what follows we thus assume that Φ ˆ H ( x k ) > ϕ (cid:63) holds for all k . The assumptions together withLemma A.1 (iii) ensure that Φ enjoys the KL property with exponent θ . Since ˆ H is locally stronglyconvex, we may invoke [68, Lem. 5.1] to infer that the function (cid:77) : (cid:146) nN × (cid:146) nN → (cid:146) deﬁned as (cid:77) ( w , x ) = Φ ( w ) + D ˆ H ( w , x ) has the KL property with exponent ϑ (cid:66) max { θ, / } at every point of thecompact set Ω (cid:66) (cid:8) ( z (cid:63) , z (cid:63) ) | z (cid:63) ∈ ω (cid:9) , where ω (cid:66) ω × · · · × ω . Notice that Φ ˆ H ( x k ) = (cid:77) ( u k , x k ) and (cid:77) ( z (cid:63) , z (cid:63) ) = Φ ˆ H ( z (cid:63) ) = ϕ (cid:63) hold for every k ∈ (cid:142) and z (cid:63) ∈ ω (cf. Theorem 4.9), and that ∂ (cid:77) ( w , x ) = (cid:16) ∂ Φ ( w ) + ∇ ˆ H ( w ) − ∇ ˆ H ( x ) , ∇ ˆ H ( x )( x − w ) (cid:17) . In particular, since ∇ ˆ H ( x k ) − ∇ ˆ H ( u k ) ∈ ∂ Φ ( u k ),dist (cid:0) , ∂ (cid:77) ( u k , x k ) (cid:1) ≤ (cid:107)∇ ˆ H ( x k ) (cid:107)(cid:107) x k − u k (cid:107) ≤ C (cid:107) x k − u k (cid:107) , (4.10) Consistently with the locality of the KL property and the compactness of ω , the global strong convexity requirement in [68,Lem. 5.1] can clearly be replaced by local strong convexity. Similarly, if Φ is a KL function with exponent θ , then it is trivially aKL function with exponent ϑ , thus complying with the requirement in the reference. regman Finito / MISO for nonconvex ﬁnite sum minimization 15 where C = sup k (cid:107)∇ ˆ H ( x k ) (cid:107) is ﬁnite due to boundedness of ( x k ) k ∈ (cid:142) and continuity of ∇ ˆ H . Let ψ ( t ) (cid:66) ρ t − ϑ be a desingularizing function for (cid:77) on Ω [3, Lem. 1(ii)], namely such that ψ (cid:48) (cid:0) (cid:77) ( w , x ) − ϕ (cid:63) (cid:1) dist (cid:0) , ∂ (cid:77) ( w , x ) (cid:1) ≥ ε > w , x ) ε -close to Ω such that 0 < (cid:77) ( w , x ) − ϕ (cid:63) < ε . Since (cid:77) ( u k , x k ) =Φ ˆ H ( x k ) (cid:38) ϕ (cid:63) and ( u k , x k ) k ∈ (cid:142) is bounded and accumulates on Ω , by discarding early iterates we mayassume that the inequality above holds for ( w , x ) = ( u k , x k ), k ∈ (cid:142) , which combined with (4.10)results in ρ − (1 − θ ) − (cid:0) Φ ˆ H ( x k ) − ϕ (cid:63) (cid:1) θ = ψ (cid:48) (cid:0) Φ ˆ H ( x k ) − ϕ (cid:63) (cid:1) − ≤ C (cid:107) x k − u k (cid:107) . (4.11)Let ∆ k (cid:66) ψ ( Φ ˆ H ( x k ) − ϕ (cid:63) ), so that Φ ˆ H ( x k ) − ϕ (cid:63) = ( ∆ k /ρ ) / − θ . By concavity of ψ we have ∆ T ( ν + − ∆ T ν ≤ ψ (cid:48) ( Φ ˆ H ( x T ν ) − ϕ (cid:63) ) (cid:0) Φ ˆ H ( x T ( ν + ) − Φ ˆ H ( x T ν ) (cid:1) (4.11) ≤ Φ ˆ H ( x T ( ν + ) − Φ ˆ H ( x T ν ) C (cid:107) u T ν − x T ν (cid:107) (B.6) ≤ − c (cid:107) u T ν − x T ν (cid:107) , (4.12)for some constant c >

0. Similarly, by suitably shifting, we conclude that for all k ∈ (cid:142) ∆ k + T − ∆ k ≤ − c (cid:107) u k − x k (cid:107) . (4.13)Since ∆ k ≥

0, by telescoping we conclude that ( (cid:107) u k − x k (cid:107) ) k ∈ (cid:142) has ﬁnite sum, and since (cid:107) x k + − x k (cid:107) ≤(cid:107) u k − x k (cid:107) for all k ∈ (cid:142) , ( x k ) k ∈ (cid:142) has ﬁnite length. Therefore, ( x k ) k ∈ (cid:142) and ( u k ) k ∈ (cid:142) converge to the samestationary point, be it z (cid:63) , owing to Theorem 4.9.We now show the convergence rates. It follows from (4.11) that C ρ / − ϑ (1 − ϑ ) (cid:107) x k − u k (cid:107) ≥ ρ ϑ / − ϑ (cid:0) Φ ˆ H ( x T ν ) − ϕ (cid:63) (cid:1) ϑ = ∆ ϑ / − ϑ k . (4.14)Combined with (4.13), it results in c ∆ ϑ / − ϑ k ≤ ∆ k − ∆ k + T for some c >

0. We may now invoke LemmaA.3 to infer that, for every t ∈ [ T ], ( ∆ t + ν T ) ν ∈ (cid:142) converges Q -linearly (to 0) if θ ≤ / (which correspondsto ϑ = / ), and ∆ t + ν T ≤ c ν − − θ θ − for some c > ϑ = θ > / ). Note that theformer case implies that ( ∆ k ) k ∈ (cid:142) converges R -linearly, whereas the latter implies that ∆ k ≤ c k − − θ θ − holds for every k and some c ≥ c . Recalling that Φ ˆ H ( x k ) − ϕ (cid:63) = ( ∆ k /ρ ) / − θ , the claimed rates of forthe cost function follow from Fact 2.7 (i) . Similarly, when ( ∆ k ) k ∈ (cid:142) converges Q -linearly then so does( (cid:107) x k − u k (cid:107) ) k ∈ (cid:142) , as it follows from (4.13), and in turn so does ( (cid:107) x k − x k + (cid:107) ) k ∈ (cid:142) owing to the inequality (cid:107) x k − x k + (cid:107) ≤ (cid:107) x k − u k (cid:107) . These two facts imply that ( x k ) k ∈ (cid:142) and ( u k = ( z k , . . . , z k )) k ∈ (cid:142) are R -linearlyconvergent. (cid:3) Low-memory variant.

We now analyze Algorithm 2, which, as shown next, is simply a par-ticular implementation of Algorithm 1.

Lemma 4.12 (Algorithm 2 as an instance of Algorithm 1) . As long as the same parameters arechosen in Algorithms 1 and 2, to any sequence ( ˜ s k lm , z k lm , (cid:73) k + lm ) k ∈ (cid:142) generated by Algorithm 2 therecorresponds an identical sequence ( ˜ s k , z k , (cid:73) k + ) k ∈ (cid:142) generated by Algorithm 1. Moreover, the indices ( (cid:73) k + lm ) k ∈ (cid:142) comply with the essentially cyclic rule ( (cid:83) ) with period T = N.Proof.

See Appendix B. (cid:3)

As a consequence of Lemma 4.12, Algorithm 2 inherits all the convergence results of Subsection4.3. In addition, here, the convexity requirement of g in Theorems 4.9 and 4.11 can be lifted thanksto the periodic full sampling of the indices. Theorem 4.13 (subsequential convergence of Algorithm 2) . Suppose that Assumption I holds andlet ω denote the set of limit points of the sequence (˜ z k ) k ∈ (cid:142) generated by Algorithm 2. Then,(i) the sequence ( ϕ (˜ z k )) k ∈ (cid:142) converges to the ﬁnite value ϕ (cid:63) ≤ ϕ ( x init ) as in Lemma 4.2(ii);(ii) if Assumption II is satisﬁed, then ϕ ≡ ϕ (cid:63) on ω ;(iii) if C = (cid:146) n , then ∈ ˆ ∂ϕ ( z (cid:63) ) for every z (cid:63) ∈ ω . Proof.

As shown in Lemma 4.12, Algorithm 2 coincides with Algorithm 1 with an essentially cyclicsampling ( (cid:83) ), and there exists an indexing subsequence ( k r ) r ∈ (cid:142) with 0 < k r + − k r ≤ N + (cid:75) k r = ∅ . Then, the ˜ z -update rule (cf. steps 2. and 2. ) yields z k r = ˜ z k r = ˜ z k r + = · · · = ˜ z k r + − ∀ r ∈ (cid:142) . (4.15)We have Φ ˆ H ( x k r + ) ≤ Φ ˆ H ( x k r ) − (cid:88) i ∈ (cid:73) kr + D ˆ h i ( u k r , x k r i ) Lemma 4.2 (i) = Φ ˆ H ( x k r ) − D ˆ H ( u k r , x k r ) (cid:73) kr + = [ N ] (4.16) ≤ Φ ˆ H ( x k r − + ) − D ˆ H ( u k r , x k r ) , Lemma 4.2 (i) , k r ≥ k r − + holding for every r ∈ (cid:142) . By telescoping and by using the fact that Φ ˆ H ≥ min ϕ > −∞ , it follows that(D ˆ H ( u k r , x k r )) r ∈ (cid:142) has ﬁnite sum and in particular vanishes. Since z k r = ˜ z k r , ϕ (cid:63) (ii) ←−−−− r →∞ Φ ˆ H ( x k r ) = ϕ ( z k r ) + D ˆ H ( u k r , x k r ) = ϕ (˜ z k r ) + D ˆ H ( u k r , x k r ) , whence assertion 4.13 (i) follows. Assertions 4.13 (ii) and 4.4 (iv) follow by patterning the argumentsof Lemmas 4.4 (iii) and 4.4 (iv) . (cid:3) In the next theorem global convergence results are provided under Assumption III in a fullynonconvex setting. Moreover, in the spirit of Theorem 4.11, linear and sublinear convergence ratesare obtained according to the KL exponent.

Theorem 4.14 (Global and linear convergence of Algorithm 2) . Suppose that Assumptions I and IIIare satisﬁed. Then, the following hold for the iterates generated by Algorithm 2:(i) (˜ z k ) k ∈ (cid:142) converges to a stationary point x (cid:63) for ϕ .(ii) If θ > / , then there exists c > such that ϕ (˜ z k ) − ϕ ( x (cid:63) ) ≤ ck − θ − hold for all r ∈ (cid:142) .(iii) If θ ∈ (0 , / ] , then (˜ z k ) k ∈ (cid:142) and ( ϕ (˜ z k )) k ∈ (cid:142) converge at R-linear rate.Proof. Notice that Assumption III entails local strong convexity and Lipschitz di ﬀ erentiability ofeach h i . Thus, as discussed in the proof of Theorem 4.9, ˆ H is µ ˆ H , (cid:85) -strongly convex on a convexcompact set (cid:85) containing the iterates. Let the indexing subsequence ( k r ) r ∈ (cid:142) be as in the proof ofTheorem 4.13, and observe that k r ≤ ( N + r holds for every r ∈ (cid:142) . The assertions are establishedfollowing the same arguments as in Theorem 4.11 with the di ﬀ erence that here we examine thegenerated iterates at subindices k r . That is, (4.12) is replaced by ∆ k r + − ∆ k r ≤ Φ ˆ H ( x k r + ) − Φ ˆ H ( x k r ) C (cid:107) u k r − x k r (cid:107) ≤ (i) Φ ˆ H ( x k r + ) − Φ ˆ H ( x k r ) C (cid:107) u k r − x k r (cid:107) ≤ − C µ ˆ H , (cid:85) (cid:107) u k r − x k r (cid:107) , (4.17)where the last inequality follows from (4.16) and Fact 2.3 (vi) . Subsequently, by patterning the argu-ments of Theorem 4.11, we obtain that ( ∆ k r ) r ∈ (cid:142) converges Q -linearly if θ ≤ / , and ∆ k r ≤ cr − − θ θ − for some c >

0, if θ > / . The claims follow by noting that ˜ z k = z k r = u k r for all k satisfying k r ≤ k < k r + , cf. (4.15), and arguing as in the last part of Theorem 4.11. (cid:3)

5. A pplication to phase retrieval and numerical simulations

In this section we study two examples related to the phase retrieval problem, which consists of re-covering a signal based on intensity measurements , and arises in many important applications includ-ing X-ray crystallography, speech processing, electron microscopy, astronomy, and optical imaging;see, e.g., [21, 41, 55, 58]. Here, we consider phase retrieval problems with real-valued data, that is,given a i ∈ (cid:146) n \ { } and scalars b i ∈ (cid:146) + , i ∈ [ N ], the goal is to ﬁnd x ∈ (cid:146) n such that b i ≈ (cid:104) a i , x (cid:105) , i ∈ [ N ] , (5.1) regman Finito / MISO for nonconvex ﬁnite sum minimization 17 accounting for the fact that in real-world applications the recorded intensities are likely corrupted bynoise, and may involve outliers due to measurement errors. To tackle such problems, we consider thefollowing sparse phase retrieval formulation:minimize x ∈ (cid:146) n N (cid:80) Ni = L ( b i , (cid:104) a i , x (cid:105) ) + g ( x ) , (5.2)where L is a loss function, and g is some sparsity inducing function (e.g., the l - or l -norm). Inparticular, we study the case of squared loss L ( y , z ) = ( y − z ) [21, 58], as well as the Poissonloss L ( y , z ) = z − y log( z ) [23, 70], which is suitable when the measurements follow the Poissonmodel ( b i ≈ Poisson( (cid:104) a i , x (cid:105) )). Note that other formulations with l -loss have also been studied in theliterature [29, 26].5.1. Sparse phase retrieval with squared loss.

Consider the nonconvex minimization (5.2) withthe squared loss L ( y , z ) = ( y − z ) , with either g = λ (cid:107) · (cid:107) , λ ≥

0, or g = δ B κ , where B κ is the l -normball of radius κ . This problem is written in the form of (P) with functions f i and Legendre kernels h i given by f i ( x ) = ( (cid:104) a i , x (cid:105) − b i ) and h i ( x ) = (cid:107) x (cid:107) + (cid:107) x (cid:107) . (5.3)The next lemma is a simple adaptation from [20] for ﬁnding the smoothness moduli of f i relative tothe Legendre kernel h i , and for computing the solutions to the subproblem (2.1). For l -regularization,the inner subproblems amount to computations involving the soft-thresholding operator, whereas inthe case of l -norm ball they amount to computing projections onto B κ , that is, setting to zero the n − κ smallest elements in absolute value. Lemma 5.1 ([20, Lem. 5.1, Props. 5.1 and 5.2]) . Let f i and h i be as in (5.3) . Then, f i is L f i -smoothrelative to h i with L f i = (3 (cid:107) a i (cid:107) + (cid:107) a i (cid:107) | b i | ) . Moreover, denoting ¯ γ = ( (cid:80) Ni = / γ i ) − , for any y ( s ) ∈ prox ¯ γ g (¯ γ s ) the operator T as deﬁned in (2.1) may be computed as follows(i) if g = λ (cid:107) · (cid:107) , λ ≥ , then T ( s ) = t (cid:63) y ( s ) , where t (cid:63) is the real positive root of the equation (cid:107) y ( s ) (cid:107) t + t − = . (ii) if g = δ B κ ( x ) , then T ( s ) (cid:51) − t (cid:63) (cid:107) y ( s ) (cid:107) − y ( s ) , where t (cid:63) is the real nonnegative root of t + t −(cid:107) y ( s ) (cid:107) = (see Footnote 4). Sparse phase retrieval with Poisson loss.

We now assume that the recorded intensities followthe Poisson model ( b i ∼ Poisson( (cid:104) a i , x (cid:105) )). In this setting we adapt the Poisson loss L ( y , z ) = z − y log( z ) and consider the regularized problem (5.2) with g = λ (cid:107) · (cid:107) , λ ≥

0. This problem may bewritten in the form of (P) by setting f i ( x ) = − b i log( (cid:104) a i , x (cid:105) ) + (cid:104) a i , x (cid:105) and h i ( x ) = (cid:107) a i (cid:107) (cid:107) x (cid:107) − b i n (cid:88) j = log( x j ) . (5.4)As shown next, the nonconvex function f i is smooth relative to h i , and the operator T as in (2.1) iseasily computable. Lemma 5.2.

Let f i and h i be as in (5.4) , with a i ∈ (cid:146) n + \ { } and b = ( b , . . . , b N ) ∈ (cid:146) n + \ { } . Then, h i is a (cid:107) a i (cid:107) -strongly convex Legendre kernel (with dom h i = (cid:146) n ++ ), and f i is L f i -smooth relative to h i with L f i = . Moreover, denoting c a = (cid:80) Ni = γ i (cid:107) a i (cid:107) and c b = (cid:80) Ni = γ i b i , the operator T as deﬁned in (2.1) with g = λ (cid:107) · (cid:107) , λ ≥ , is given byT ( s ) = ( w , . . . , w M ) , with w j = c a (cid:16) s j − λ + (cid:0) ( s j − λ ) + c a c b (cid:1) / (cid:17) . (5.5) Proof.

To avoid clutter, we drop the subcript i from h , f , a , b . The assertion on h is of immediateveriﬁcation. Since both f and h are (cid:67) on int dom h , once we show that ∇ h − ∇ f (cid:23) h the claim will follow. From direct computations we obtain that ∇ h ( x ) = (cid:107) a (cid:107) + b diag( x − , . . . , x − n ) Nonnegative real roots of the cubic equation t + pt + q = p > q ≤ t (cid:63) = ( c − q / ) / − (cid:0) c + q / ) / , where c = ( q / + p / ) / ; see, e.g., [57]. and ∇ f ( x ) = (cid:0) + b (cid:104) a , x (cid:105) (cid:1) aa (cid:62) , and in particular M ( x ) (cid:66) ∇ h ( x ) − ∇ f ( x ) = b (cid:0) diag( x − , . . . , x − n ) − (cid:104) a , x (cid:105) aa (cid:62) (cid:1) + (cid:107) a (cid:107) − aa (cid:62) . For every y ∈ (cid:146) n it holds that (here a k stands for the k -th coordinate of a ) (cid:104) y , M ( x ) y (cid:105) ≥ b n (cid:88) j = y j x j − b (cid:104) a , y (cid:105) (cid:104) a , x (cid:105) ≥ b n (cid:88) j = y j x j − b n (cid:88) j = a j x j (cid:104) a , x (cid:105) y j x j ≥ , where the ﬁrst inequality follows by the Cauchy-Schwarz inequality, the second one from Jensen’sinequality (cid:80) i α i β i (cid:80) j α j ≥ ( (cid:80) i α i β i ) , and third one from the fact that (cid:80) i α i (cid:80) j β j ≥ (cid:80) i α i β i for every α i , β i ≥ i ∈ [ n ]. The closed-form solution for the proximal mapping T ( s ) follows directly from itsﬁrst order optimality conditions. (cid:3) Experimental setup.

We test Algorithm 1 with sampling strategies ( (cid:83) ) (using single-index se-lection with uniform sampling), ( (cid:83) shuf ), ( (cid:83) cycl ), and the low-memory Algorithm 2 with a cyclic innerloop (corresponding to (cid:73) k + = [ N ] if mod( k , N + =

0, and (cid:73) k + = { mod( k , N + } otherwise). Wealso implement the SMD method for relatively smooth problems [25, 32]. The incremental methodPLIAG [69] was not tested, as the problem setting does not comply with the requirements therein;cf. [69, Assump. B4].Parameters selection. For Algorithms 1 and 2, we always use γ i = . N / L fi . For SMD we used thepopular square-summable stepsize γ k = α / ( L f k ) , where k is the iteration counter, L f is the smoothnessmodulus of f = / N (cid:80) Ni = f i relative to a suitable Bregman kernel h , and α > L f = (cid:80) Ni = N L f i and h ( x ) = (cid:107) x (cid:107) + (cid:107) x (cid:107) for simulations related to Subsection 5.1, and h ( x ) = N (cid:80) Ni = (cid:107) a i (cid:107) (cid:107) x (cid:107) − N (cid:80) Ni = b i (cid:80) nj = log( x j ) for those related to Subsection 5.2 are used.Optimality criteria. As a measure of suboptimality, we consider (cid:68) ( z k ) (cid:66) (cid:107) z k − v k (cid:107) for some v k ∈ T ( (cid:80) Ni = ∇ ˆ h i ( z k )) , (5.6)since it satisﬁes N dist(0 , ˆ ∂ϕ ( v k )) A.1 (ii) = inf w ∈ ˆ ∂ Φ ( v k ) 1 N (cid:107) (cid:80) Ni = w i (cid:107) (iii) ≤ N (cid:13)(cid:13)(cid:13)(cid:80) Ni = (cid:0) ∇ ˆ h i ( z k ) − ∇ ˆ h i ( v k ) (cid:1)(cid:13)(cid:13)(cid:13) (5.7) ≤ N (cid:80) Ni = (cid:107)∇ ˆ h i ( z k ) − ∇ ˆ h i ( v k ) (cid:107) ≤ η (cid:107) z k − v k (cid:107) = η (cid:68) ( z k ) , where v k = ( v , . . . , v N ), and η is some positive constant that exists by virtue of local Lipschitzcontinuity of h i , Lemma 3.1 (iv) and boundedness of the sequences ( z k ) k ∈ (cid:142) and ( v k ) k ∈ (cid:142) . Although, in asimilar fashion to (5.7), dist(0 , ˆ ∂ϕ ( z k )) may be upper bounded by (cid:107) ˜ s k − (cid:80) i ∇ ˆ h i ( z k ) (cid:107) which would be anequally good estimate of dist(0 , ˆ ∂ϕ ( z k )), this quantity is not readily available in other methods suchas SMD. The introduction of v k , instead, o ﬀ ers a viable algorithm-independent alternative that onlyrequires access to output variables. Simulations.

In the ﬁrst set of simulations we consider 16 ×

16 gray-scale images from a digitsdataset [30] and a QR code dataset [43]. The images are vectorized resulting in the signal x ∈ (cid:146) n with n = A ∈ (cid:146) N × n ( a i being the i -th row) with N = nd , d = M ∈ (cid:146) n × n be a normalized Hadamard matrix. Wegenerate d many i.i.d. diagonal sign matrices S i with diagonal elements in {− , } selected uniformlyat random, and set A = [ MS , . . . , MS d ]. Typically d ≥ ﬃ cient for near complete recoveryon noiseless data. In our simulations, we corrupted a fraction of the measurements b i = (cid:104) a i , x (cid:105) independently by setting b i = p c = / . All of the plotted algorithms are initializedusing the initialization scheme described in [29, §3].For the l -regularized problem we performed tests with di ﬀ erent values of the regularization pa-rameter λ and found λ = . / N to lead to a visually favorable recovery. When g = δ B κ , we set κ = κ =

125 for the digit and QR data, respectively. The convergence behavior in terms of (cid:68) ( z k ) (see(5.6)) is plotted in Fig. 1 for a representative digit 8 image. With the above described initialization,the algorithms converge to the same cost. In our simulations SMD had the slowest performance, andthe cyclic sampling ( (cid:83) cycl ) in Algorithm 1 was observed to consistently outperform all others. The https: // web.stanford.edu / hastie / ElemStatLearn / data.html. regman Finito / MISO for nonconvex ﬁnite sum minimization 19 · − . − − . D ( z k ) = k z k − v k k N = n = g = . / N k · k randomized cyclic shu ﬄ ed low-memory SMD0 2 4 6 8 · . . . .

20 ϕ ( z k ) N = n = g = . / N k · k Figure 1.

Representative convergence plots for problem (5.2) with squared loss on a digits image: (ﬁrstrow) (cid:96) -regularization, (second row) (cid:96) -norm ball constraint. The related plots for the QR code imagesfollow a very similar trend and are therefore omitted. (a) original image (b) initialization (c) tolerance 10 − (d) tolerance 10 − Figure 2.

Image recovery with corrupted measurements for tolerances { − , − } . The sparsity parame-ters κ =

160 and κ =

125 are used for the digit and the QR code, respectively. · − − D ( z k ) = k z k − v k k N = n = g = . / N k · k randomized cyclic shu ﬄ ed low-memory SMD0 1 2 3 4 · ϕ ( z k ) N = n = g = . / N k · k Figure 3.

Representative convergence plots for the l -regularized problem with Poisson loss. low-memory Algorithm 2 has a comparable performance, almost always better than the randomizedsampling. As evident from Fig. 2, despite corrupted measurements a reasonably good recovery isachieved with the l -norm ball. A similar recovery is observed with l regularization.In the last set of simulations we consider synthetic data. We generate a standard random Gauss-ian matrix A ∈ (cid:146) N × n with n = N ∈ { , } . The data vector a i , i ∈ [ N ], is set equalto the absolute value of the i -th row of A . We also drew a random vector from (cid:78) (0 , I n ) and setthe signal x equal to its absolute value. We generated the measurements according to the Poissonmodel b i ∼ Poisson( (cid:104) a i , x (cid:105) ), i ∈ [ N ], and further corrupted the measurements b i by setting them equal to the nearest integer to the absolute value of (cid:107) x (cid:107) (cid:78) (0 ,

1) with probability p c = / . All meth-ods were initialized at the same random point. We ran simulations with regularization parameter λ ∈ { . / N , . / N , / N } . We only report the results for λ = . / N due to space limitations, neverthelessremarking that for other values similar plots were observed. The results are illustrated in Fig. 3. Sim-ilar to the previous experiments, SMD had the worst performance, while the best results are observedfor Algorithm 1 with cyclic sampling ( (cid:83) cycl ). The low-memory Algorithm 2 usually outperforms therandomized variant of Algorithm 1. 6. C onclusions A Bregman incremental aggregated method was developed that extends Finito / MISO [28, 42]to non-Lipschitz and nonconvex settings. The basic algorithm was studied under randomized andessentially cyclic sampling strategies. Furthermore, a variant with O ( n ) memory requirements is de-veloped that is novel even in the Euclidean case. A sure descent property established on a BregmanMoreau envelope leads to a surprisingly simple convergence analysis. As one particularly interestingresult, in the randomized setting linear convergence is established under strong convexity of the costwithout requiring convexity of the individual functions f i or g . Future research directions include ex-tending the analysis to the framework of the Douglas-Rachford splitting, momentum-type schemes,as well as applications to nonconvex distributed asynchronous optimization. regman Finito / MISO for nonconvex ﬁnite sum minimization 21 A ppendix A. A uxiliary results

Lemma A.1 (equivalence between (3.1) and (P)) . Let the function Φ : (cid:146) nN → (cid:146) and the set ∆ ⊆ (cid:146) nN be as in (3.1) . Then, the following hold:(i) cost : Φ ( x ) = ϕ ( x ) if x = ( x , . . . , x ) , and Φ ( x ) = ∞ otherwise.(ii) subdifferential : ˆ ∂ ( Φ + δ C ×···× C )( x ) = (cid:110) v = ( v , . . . , v N ) ∈ (cid:146) nN | (cid:80) ni = v i ∈ ˆ ∂ ( ϕ + δ C )( x ) (cid:111) if x = ( x , · · · , x ) for some x ∈ (cid:146) n , and is empty otherwise; the same relation still holds if the regularsubdi ﬀ erential ˆ ∂ is replaced by the limiting subdi ﬀ erential ∂ .(iii) KL property : ϕ has the KL property at x i ﬀ so does Φ at x = ( x , . . . , x ) , in which case thedesingularizing functions are the same up to a positive scaling.(iv) stationary points : a point x (cid:63) is stationary for problem (3.1) i ﬀ x (cid:63) = ( x (cid:63) , . . . , x (cid:63) ) for somex (cid:63) ∈ (cid:146) n which is stationary for problem (P) .(v) minimizers : x (cid:63) is a (local) minimizer of problem (3.1) i ﬀ x (cid:63) = ( x (cid:63) , . . . , x (cid:63) ) for some x (cid:63) ∈ (cid:146) n which is a (local) minimizer for problem (P) ; in fact, inf C ×···× C Φ = inf C ϕ .(vi) level boundedness : ϕ + δ C is level bounded i ﬀ so is Φ + δ C ×···× C .(vii) convexity : ϕ : (cid:146) n → (cid:146) is convex i ﬀ so is Φ : (cid:146) Nn → (cid:146) .Proof. For notational convenience, up to possibly replacing g with g + δ C we may assume withoutloss of generality that C = (cid:146) n . ♠ A.1 (i)

Trivial consequence of the fact that dom Φ ⊆ ∆ (the consensus set, cf. (3.1)). ♠ A.1 (ii)

In light of the previous point, having x = ( x , . . . , x ) for some x ∈ (cid:146) n is necessary for thenonemptiness of ˆ ∂ Φ ( x ). Let x = ( x , . . . , x ) and v ∈ ˆ ∂ Φ ( x ) be ﬁxed. Then,0 ≤ lim inf x (cid:44) y → x Φ ( y ) − Φ ( x ) − (cid:104) v , y − x (cid:105)(cid:107) y − x (cid:107) = lim inf x (cid:44) y → x ϕ ( y ) − ϕ ( x ) − (cid:104) (cid:80) i v i , y − x (cid:105)√ N (cid:107) y − x (cid:107) (A.1)where the equality comes from the fact that dom Φ ⊆ ∆ together with assertion A.1 (i) . This showsthat (cid:80) i v i ∈ ˆ ∂ϕ ( x ). Conversely, let u ∈ ˆ ∂ϕ ( x ) and v ∈ (cid:146) nN be such that (cid:80) i v i = u . By reading (A.1)from right to left we obtain that v ∈ ˆ ∂ Φ ( x ). Having shown the identity of the regular subdi ﬀ erential,the same claim with the limiting subdi ﬀ erential follows by deﬁnition. ♠ A.1 (iii)

Follows from the bounds N dist(0 , ∂ϕ ( x )) = inf v ∈ ∂ Φ ( x ) 1 N (cid:107) (cid:80) Ni = v i (cid:107) ≤ dist(0 ,∂ Φ ( x )) inf v ∈ ∂ Φ ( x ) (cid:80) Ni = (cid:107) v i (cid:107) ≤ inf v ∈ ∂ Φ ( x ) (cid:107) (cid:80) Ni = v i (cid:107) = dist(0 , ∂ϕ ( x )) , where the ﬁrst and last equalities are due to assertion A.1 (ii) . ♠ A.1 (iv)

Directly follows from assertion A.1 (ii) . ♠ A.1 (v) , A.1 (vi) & A.1 (vii)

Follow from assertion A.1 (i) . (cid:3) Lemma A.2.

Let (cid:85) i ⊆ int dom h i be nonempty and convex, i ∈ [ N ] , and let (cid:85) (cid:66) (cid:85) × · · · × (cid:85) N . Additionally to Assumption I, suppose that g is convex, and h i , i ∈ [ N ] , is (cid:96) h i , (cid:85) i -Lipschitz-di ﬀ erentiable and µ h i , (cid:85) i -strongly convex on (cid:85) i . Then, the following hold for function ˆ H as in (3.4) with γ i ∈ (0 , N / L fi ) , i ∈ [ N ] :(i) prox ˆ H Φ is Lipschitz continuous on (cid:85) .If in addition f i and h i are twice continuously di ﬀ erentiable on (cid:85) i , i ∈ [ N ] , then(ii) Φ ˆ H is continuously di ﬀ erentiable on (cid:85) with ∇ Φ ˆ H = ∇ ˆ H ◦ (id − prox ˆ H Φ ) .(iii) dist(0 , ∂ Φ ˆ H ( x )) = (cid:107)∇ Φ ˆ H ( x ) (cid:107) ≤ C (cid:85) (cid:107) x − z (cid:107) for any x ∈ (cid:85) , where z = prox ˆ H Φ ( x ) and C (cid:85) = max i (cid:26)(cid:0) + γ i L fi N (cid:1) (cid:96) hi , (cid:85) i γ i (cid:27) .Proof. It follows from Lemmas 3.1 (iv) and 3.1 (v) that ˆ h i is (cid:96) ˆ h i , (cid:85) i -Lipschitz di ﬀ erentiable and µ ˆ h i , (cid:85) i -strongly convex on (cid:85) i with (cid:96) ˆ h i , (cid:85) i = (cid:0) + γ i L fi N (cid:1) (cid:96) hi , (cid:85) i γ i and µ ˆ h i , (cid:85) i = (cid:0) − γ i L fi N (cid:1) µ hi , (cid:85) i γ i . It follows that ˆ H is Lipschitz di ﬀ erentiable and strongly convex on (cid:85) (with respective moduli (cid:96) ˆ H , (cid:85) = max i (cid:96) ˆ h i , (cid:85) i and µ ˆ H , (cid:85) = min i µ ˆ h i , (cid:85) i ), and therefore so is its conjugate ˆ H ∗ on ˆ H ( (cid:85) ) (with respective moduli (cid:96) ˆ H ∗ , (cid:85) = µ − H , (cid:85) and µ ˆ H ∗ , (cid:85) = (cid:96) − H , (cid:85) ). Notice that convexity of g , this being equivalent to that of G , implies thatˆ H ( x ) + Φ ( x ) = G ( x ) + (cid:80) Ni = γ i h i ( x i ) is strongly convex on (cid:85) . We may thus invoke [35, Thm. 4.2]and Fact 2.3 (i) to conclude that prox ˆ H Φ = ∂ ( ˆ H + Φ ) ∗ ◦ ∇ ˆ H = ∇ ( ˆ H + Φ ) ∗ ◦ ∇ ˆ H is the composition ofLipschitz-continuous mappings on (cid:85) , which shows assertion A.2 (i) . In turn, assertion A.2 (ii) followsfrom [35, Cor. 3.1]. Finally, (cid:96) ˆ H , (cid:85) -Lipschitz continuity of ∇ ˆ H on (cid:85) entails the bound (cid:107)∇ ˆ H (cid:107) ≤ (cid:96) ˆ H , (cid:85) on (cid:85) , leading to assertion A.2 (iii) . (cid:3) Lemma A.3.

Let ( α k ) k ∈ (cid:142) ⊂ (cid:146) + be a sequence, and suppose that there exist c > and δ ∈ [1 , ∞ ) such that α δ k + ≤ c ( α k − α k + ) holds for every k ∈ (cid:142) .(i) If δ = , then ( α k ) k ∈ (cid:142) is Q-linearly convergent (to ).(ii) If δ ∈ (1 , ∞ ) , then there exists c (cid:48) > such that α k ≤ c (cid:48) k − δ − holds for all k ∈ (cid:142) .Proof. The case δ = α k + ≤ c + c α k . The case δ ∈ (1 , ∞ ) is shown in theproof of [3, Thm. 2] (cf. equation (13) therein with θ = δ + δ ). (cid:3) A ppendix B. O mitted proofs of S ection Proof of Theorem 4.7 (Linear convergence with randomized sampling ( (cid:83) ) ). We will use the equivalent BC-reformulation of Algorithm 3, through the identities shown in Lem-ma 3.2. We start by observing that x ki ∈ { x init , z k | k ∈ (cid:142) } ⊆ (cid:85) holds for any k ∈ (cid:142) and i ∈ [ N ], as it follows from the x -update at step 3. and the fact that u ki = z k , cf. Lemma 3.2 (ii) . Let x (cid:63) = ( x (cid:63) , . . . , x (cid:63) ) be the unique minimizer of Φ (cf. Lemma A.1 (v) ). As shown in (4.2), denoting v k (cid:66) ∇ ˆ H ( x k ) − ∇ ˆ H ( u k ) ∈ ˆ ∂ Φ ( u k ) we have Φ ˆ H ( x k ) − min Φ = Φ ( u k ) + D ˆ H ( u k , x k ) − Φ ( x (cid:63) ) Fact 2.3 (ii) = Φ ( u k ) + D ˆ H ( x (cid:63) , x k ) − D ˆ H ( x (cid:63) , u k ) + (cid:104) v k , x (cid:63) − u k (cid:105) − Φ ( x (cid:63) ) Lemma A.1 (i) = ϕ ( u k ) + D ˆ H ( x (cid:63) , x k ) − D ˆ H ( x (cid:63) , u k ) + (cid:104) (cid:80) Ni = v ki , x (cid:63) − u k (cid:105) − ϕ ( x (cid:63) ) Lemma A.1 (ii) ≤ D ˆ H ( x (cid:63) , x k ) − D ˆ H ( x (cid:63) , u k ) − µ ϕ (cid:107) u k − x (cid:63) (cid:107) Fact 2.3 (ii) = D ˆ H ( u k , x k ) + (cid:104)∇ ˆ H ( u k ) − ∇ ˆ H ( x k ) , x (cid:63) − u k (cid:105) − µ ϕ (cid:107) u k − x (cid:63) (cid:107) = D ˆ H ( u k , x k ) + (cid:80) Ni = (cid:104)∇ ˆ h i ( u k ) − ∇ ˆ h i ( x ki ) , x (cid:63) − u k (cid:105) − µ ϕ (cid:107) u k − x (cid:63) (cid:107) , (B.1)where the inequality follows from strong convexity of ϕ . For any ε i > i ∈ [ N ], one has (cid:104)∇ ˆ h i ( u k ) − ∇ ˆ h i ( x ki ) , x (cid:63) − u k (cid:105) ≤ ε i (cid:107) x (cid:63) − u k (cid:107) + ε i (cid:107)∇ ˆ h i ( u k ) − ∇ ˆ h i ( x ki ) (cid:107) . Plugged in (B.1) with ε i > (cid:80) Ni = ε i = µ ϕ , so as to cancel the square norm therein, Φ ˆ H ( x k ) − min Φ ≤ D ˆ H ( u k , x k ) + (cid:80) Ni = ε i (cid:107)∇ ˆ h i ( u k ) − ∇ ˆ h i ( x ki ) (cid:107) Fact 2.3 (vii) ≤ D ˆ H ( u k , x k ) + (cid:80) Ni = (cid:96) ˆ hi , (cid:85) ε i D ˆ h i ( u k , x ki ) = (cid:80) Ni = (cid:18) + (cid:96) ˆ hi , (cid:85) ε i (cid:19) D ˆ h i ( u k , x ki ) , where (cid:96) ˆ h i , (cid:85) is a Lipschitz constant for ∇ ˆ h i on (cid:85) as in Lemma 3.1 (iv) . By choosing ε i = (cid:96) ˆ hi , (cid:85) / κ with κ (cid:66) (cid:80) j (cid:96) ˆ hj , (cid:85) µ ϕ (which satisﬁes (cid:80) Ni = ε i = µ ϕ ), we obtain Φ ˆ H ( x k ) − min Φ ≤ (1 + κ ) (cid:80) Ni = D ˆ h i ( u k , x ki ) = (1 + κ ) D ˆ H ( u k , x k ) . By combining this with (4.1) (recall the equivalences in Lemma 3.2), the inequality E k (cid:104) Φ ˆ H ( x k + ) − min Φ (cid:105) ≤ (cid:16) − p min + κ (cid:17)(cid:0) Φ ˆ H ( x k ) − min Φ (cid:1) = (1 − c (cid:85) ) (cid:0) Φ ˆ H ( x k ) − min Φ (cid:1) follows, where c (cid:85) as in the statement is obtained by using the estimates of Lemma 3.1 (iv) for themoduli (cid:96) ˆ h i , (cid:85) , i ∈ [ N ], appearing in the constant κ : namely, since ˆ h i = h i / γ i − f i / N and (cid:96) f i , (cid:85) (cid:66) (cid:96) h i , (cid:85) L f i is a Lipschitz modulus for ∇ f i on (cid:85) (cf. Fact 2.5 (iii) ), one has σ f i , (cid:85) ≥ − (cid:96) f i , (cid:85) and (cid:96) ˆ h i , (cid:85) ≤ (cid:96) hi , (cid:85) / γ i − σ fi , (cid:85) / N . This concludes the proof of inequality (4.8). In turn, (4.9) follows by taking unconditionalexpectation from both sides and using the fact that ϕ ( z k ) = Φ ( u k ) ≤ Φ ˆ H ( x k ), owing to Lemma 3.2 (v) and Fact 2.7 (i) . (cid:3) regman Finito / MISO for nonconvex ﬁnite sum minimization 23

Proof of Theorem 4.9 (subsequential convergence with essentially cyclic sampling ( (cid:83) ) ). The claims are established using the simpler setting of Algorithm 3 owing to the equivalence betweenthe two algorithms as shown in Lemma 3.2. We begin by observing that prox ˆ H Φ is λ -Lipschitz contin-uous on a convex set (cid:85) that contains the sequences ( x k ) k ∈ (cid:142) and ( u k ) k ∈ (cid:142) , for some (ﬁnite) λ >

0. Infact, if ϕ is level bounded, then a bounded (cid:85) exists by Lemma 4.2 (iv) , and local Lipschitz di ﬀ eren-tiability and strong convexity are thus enough to invoke Lemma A.2 (i) . If, instead, those propertieshold globally, then the same result can be invoked with (cid:85) = (cid:146) nN . Similarly, it follows from Lemmas3.1 (iv) and 3.1 (v) that ˆ H is µ ˆ H , (cid:85) -strongly convex and (cid:96) ˆ H , (cid:85) -Lipschitz di ﬀ erentiable on (cid:85) for someconstants (cid:96) ˆ H , (cid:85) ≥ µ ˆ H , (cid:85) > T iterations, one has that t ν ( i ) (cid:66) min (cid:8) t ∈ [ T ] | i is sampled at iteration ν T + t − (cid:9) is well deﬁned for each index i ∈ [ N ] and ν ∈ (cid:142) . In other words, since i is sampled at iteration ν T + t ν ( i ) − ν T and ν T + t ν ( i ) −

2, it holds that x ν Ti = x ν T + i = · · · = x ν T + t ν ( i ) − i and x ν T + t ν ( i ) i = u ν T + t ν ( i ) − ∀ i ∈ [ N ] , ν ∈ (cid:142) , (B.2)recalling u k = ( u k , . . . , u k ). We now proceed to establish a descent inequality for Φ ˆ H holding everyinterval of T iterations. First, Φ ˆ H ( x T ( ν + ) − Φ ˆ H ( x ν T ) = T (cid:88) τ = (cid:16) Φ ˆ H ( x ν T + τ ) − Φ ˆ H ( x ν T + τ − ) (cid:17) (i) ≤ − T (cid:88) τ = D ˆ H ( x ν T + τ , x ν T + τ − ) ≤ − D ˆ H ( x ν T + t , x ν T + t − ) (vi) ≤ − µ ˆ H , (cid:85) (cid:107) x ν T + t − x ν T + t − (cid:107) (B.3)holds for all t ∈ [ T ]. Next, for every i ∈ [ N ] it holds that (cid:107) u ν T + t ν ( i ) − − u ν T (cid:107) = √ N (cid:107) u ν T + t ν ( i ) − − u ν T (cid:107) ≤ λ √ N (cid:107) x ν T + t ν ( i ) − − x ν T (cid:107)≤ λ √ N t ν ( i ) − (cid:88) τ = (cid:107) x ν T + τ − x ν T + τ − (cid:107) (B.3) ≤ λ T √ N (cid:113) µ ˆ H , (cid:85) (cid:0) Φ ˆ H ( x ν T ) − Φ ˆ H ( x T ( ν + ) (cid:1) , (B.4)where the ﬁrst inequality uses the λ -Lipschitz continuity of prox ˆ H Φ , the second one the triangularinequality, and the last one the fact that t ν ( i ) ≤ T . For all i ∈ [ N ], it follows from Fact 2.3 (vii) and thetriangular inequality that (cid:107) x ν Ti − u ν T (cid:107) ≤ (cid:107) x ν Ti − u ν T + t ν ( i ) − (cid:107) + (cid:107) u ν T + t ν ( i ) − − u ν T (cid:107) (B.2) = (cid:107) x ν T + t ν ( i ) − i − x ν T + t ν ( i ) i (cid:107) + (cid:107) u ν T + t ν ( i ) − − u ν T (cid:107) (B.3) (B.4) ≤ (cid:113) µ ˆ H , (cid:85) (cid:16) + λ T √ N (cid:17) (cid:113) Φ ˆ H ( x ν T ) − Φ ˆ H ( x T ( ν + ) . (B.5)By squaring and summing over i ∈ [ N ] we obtain Φ ˆ H ( x T ( ν + ) − Φ ˆ H ( x ν T ) ≤ − µ ˆ H , (cid:85) + λ T / √ N ) (cid:107) x ν T − u ν T (cid:107) . (B.6)Similarly, by suitably shifting, we conclude that ∀ k ∈ (cid:142) Φ ˆ H ( x k + T ) − Φ ˆ H ( x k ) ≤ − µ ˆ H , (cid:85) + λ T / √ N ) (cid:107) x k − u k (cid:107) (B.7) Fact 2.3 (vii) ≤ − µ ˆ H , (cid:85) L ˆ H , (cid:85) (1 + λ T / √ N ) D ˆ H ( u k , x k ) . By telescoping the inequality and using the fact that the envelope is lower bounded (Lemma 4.1 (ii) and Assumption I. a ), all the assertions follow from Lemma 4.4. (cid:3) Proof of Lemma 4.12 (Algorithm 2 as an instance of Algorithm 1).

Note that, in Algorithm 1, (cid:80) Ni = s i = ˜ s . Arguing by induction, suppose that (cid:80) Ni = s ki = ˜ s k for some k ≥

0. Then, ˜ s k + = ˜ s k + (cid:88) i ∈ (cid:73) k + ( s k + i − s ki ) = N (cid:88) i = s ki + (cid:88) i ∈ (cid:73) k + ( s k + i − s ki ) = N (cid:88) i = s k + i , (B.8)where the ﬁrst equality follows by step 1. , the second one from the induction hypothesis, and thelast one from the fact that s k + i = s ki for i (cid:60) (cid:73) k + as in step 1. .In what follows, let (cid:78) full (cid:66) (cid:110) k | (cid:75) k = ∅ (cid:111) , and observe that k ∈ (cid:78) full i ﬀ the if statement at step 2. is true. In particular, it follows from step 2. that (cid:78) full ⊆ (cid:110) k | (cid:73) k + lm = [ N ] (cid:111) . (B.9)We now proceed by induction on k to establish that ( z k lm ) k ∈ (cid:142) and ( ˜ s k lm ) k ∈ (cid:142) are sequences generated byAlgorithm 1 with index sets being chosen as (cid:73) k + (cid:66) (cid:73) k + lm ∀ k ∈ (cid:142) . (B.10)The claim is true for k =

0; suppose it holds up to iteration k ≥

0. We consider two cases: ♠ Case 1: k ∈ (cid:78) full . We have˜ s k + lm = N (cid:88) i = ∇ ˆ h i ( z k lm ) = N (cid:88) i = ∇ ˆ h i ( z k ) = N (cid:88) i = s k + i = ˜ s k + , (B.11)where the ﬁrst equality follows by step 2. , the second one by induction, the third one by the factthat (cid:73) k + = (cid:73) k + lm = [ N ] (cf. (B.9) and (B.10)), and the last one from (B.8). It follows that theminimization problems deﬁning z k + and z k + lm (at step 1. and step 2. , respectively) coincide, thusensuring that z k + = z k + lm is a feasible update for Algorithm 1. ♠ Case 2: k (cid:60) (cid:78) full . Let t ( k ) (cid:66) max { t ≤ k | t ∈ (cid:78) full } be the last iteration before k at which thecondition at step 2. holds, so that, according to steps 2. and 2. , ˜ z k lm = z t ( k ) lm . We have˜ s k + lm = ˜ s k lm + (cid:88) i ∈ (cid:73) k + lm (cid:104) ∇ ˆ h i ( z k lm ) − ∇ ˆ h i (˜ z k lm ) (cid:105) step 2. = ˜ s k + (cid:88) i ∈ (cid:73) k + (cid:104) ∇ ˆ h i ( z k ) − ∇ ˆ h i ( z t ( k ) ) (cid:105) induction and (B.10) = ˜ s k + (cid:88) i ∈ (cid:73) k + (cid:104) s k + i − s t ( k ) + i (cid:105) , step 1. , (B.9), and (B.10) (B.12)where in the last equality (B.9) was used to infer that s t ( k ) + i = ∇ ˆ h i ( z t ( k ) ) for all i ∈ [ N ] (and inparticular for all i ∈ (cid:73) k + ). To conclude, note that the selection rule for (cid:73) k + lm (cf. steps 2. and 2. )ensures through (B.10) that the index sets (cid:73) t ( k ) + , (cid:73) t ( k ) + , . . . , (cid:73) k + are all pairwise disjoint, hencethat s t ( k ) i = s t ( k ) + i = · · · = s ki for all i ∈ (cid:73) k + , as is apparent from step 1. . We may thus replace s t ( k ) + i with s ki in (B.12) to obtain the ˜ s -update of step 1. , and conclude that ˜ s k + lm = ˜ s k + . As discussed in thelast part of Case 1 , this in turn shows that z k + = z k + lm is a feasible update for Algorithm 1. (cid:3) regman Finito / MISO for nonconvex ﬁnite sum minimization 25 R eferences [1] Masoud Ahookhosh, Andreas Themelis, and Panagiotis Patrinos. A Bregman forward-backward linesearch algorithm fornonconvex composite optimization: superlinear convergence to nonisolated local minima. SIAM J. Optim. , 2021. (to appear).[2] Zeyuan Allen-Zhu and Yang Yuan. Improved svrg for non-strongly-convex or sum-of-non-convex objectives. In

Int. Conf.Mach. Learn. , pages 1080–1089, 2016.[3] Hédy Attouch and Jérôme Bolte. On the convergence of the proximal algorithm for nonsmooth functions involving analyticfeatures.

Math. Program. , 116(1-2):5–16, 2009.[4] Hédy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating minimization and projectionmethods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality.

Math. Oper. Res. , 35(2):438–457, 2010.[5] Hédy Attouch, Jérôme Bolte, and Benar Fux Svaiter. Convergence of descent methods for semi-algebraic and tame problems:proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods.

Math. Program. , 137(1):91–129,2 2013.[6] Heinz H Bauschke, Jérôme Bolte, Jiawei Chen, Marc Teboulle, and Xianfu Wang. On linear convergence of non-Euclideangradient methods without strong convexity and Lipschitz gradient continuity.

J. Optim. Theory Appl. , 182(3):1068–1087,2019.[7] Heinz H. Bauschke, Jérôme Bolte, and Marc Teboulle. A descent lemma beyond Lipschitz gradient continuity: First-ordermethods revisited and applications.

Math. Oper. Res. , 42(2):330–348, 2017.[8] Heinz H Bauschke and Jonathan M Borwein. Legendre functions and the method of random Bregman projections.

J. ConvexAnal. , 4(1):27–67, 1997.[9] Heinz H. Bauschke, Jonathan M. Borwein, and Patrick L. Combettes. Essential smoothness, essential strict convexity, andLegendre functions in Banach spaces.

Commun. Contemp. Math. , 03(04):615–647, 2001.[10] Amir Beck.

First-Order Methods in Optimization . Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017.[11] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.

Oper.Res. Lett. , 31(3):167–175, 2003.[12] Amir Beck and Luba Tetruashvili. On the convergence of block coordinate descent type methods.

SIAM J. Optim. ,23(4):2037–2060, 2013.[13] Dimitri P. Bertsekas. Incremental proximal methods for large scale convex optimization.

Math. Program. , 129(2):163–195,2011.[14] Dimitri P. Bertsekas.

Convex Optimization Theory . Athena Scientiﬁc, 2015.[15] Dimitri P. Bertsekas and John N. Tsitsiklis. Gradient convergence in gradient methods with errors.

SIAM J. Optim. ,10(3):627–642, 2000.[16] Doron Blatt, Alfred O. Hero, and Hillel Gauchman. A convergent incremental gradient method with a constant step size.

SIAM J. Optim. , 18(1):29–51, 2007.[17] Jérôme Bolte, Aris Daniilidis, and Adrian Lewis. The Łojasiewicz inequality for nonsmooth subanalytic functions withapplications to subgradient dynamical systems.

SIAM J. Optim. , 17(4):1205–1223, 2007.[18] Jérôme Bolte, Aris Daniilidis, Adrian Lewis, and Masahiro Shiota. Clarke subgradients of stratiﬁable functions.

SIAM J.Optim. , 18(2):556–572, 2007.[19] Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and non-smooth problems.

Math. Program. , 146(1-2):459–494, 2014.[20] Jérôme Bolte, Shoham Sabach, Marc Teboulle, and Yakov Vaisbourd. First order methods beyond convexity and Lipschitzgradient continuity with applications to quadratic inverse problems.

SIAM J. Optim. , 28(3):2131–2151, 2018.[21] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger ﬂow: Theory and algorithms.

IEEE Trans. Inf. Theory , 61(4):1985–2007, 2015.[22] Gong Chen and Marc Teboulle. Convergence analysis of a proximal-like minimization algorithm using Bregman functions.

SIAM J. Optim. , 3(3):538–543, 1993.[23] Yuxin Chen and Emmanuel Candes. Solving random quadratic systems of equations is nearly as easy as solving linearsystems.

Adv. Neural Inf. Process. Syst. , 28:739–747, 2015.[24] Yat Tin Chow, Tianyu Wu, and Wotao Yin. Cyclic coordinate-update algorithms for ﬁxed-point problems: Analysis andapplications.

SIAM J. Sci. Comput. , 39(4):A1280–A1300, 2017.[25] Damek Davis, Dmitriy Drusvyatskiy, and Kellie J MacPhee. Stochastic model-based minimization under high-order growth.

ArXiv:1807.00255 , 2018.[26] Damek Davis, Dmitriy Drusvyatskiy, and Courtney Paquette. The nonsmooth landscape of phase retrieval.

IMA J. Numer.Anal. , 40(4):2652–2695, 2020.[27] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In

Adv. Neural Inf. Process. Syst. , pages 1646–1654, 2014.[28] Aaron Defazio and Justin Domke. Finito: A faster, permutable incremental gradient method for big data problems. In

Int.Conf. Mach. Learn. , pages 1125–1133, 2014.[29] John C Duchi and Feng Ruan. Solving (most) of a set of quadratic equalities: Composite optimization for robust phaseretrieval.

Inf. Inference J. IMA , 8(3):471–529, 2019.[30] Jerome Friedman, Trevor Hastie, and Robert Tibshirani.

The elements of statistical learning , volume 1. Springer series instatistics New York, 2001.[31] Tianxiang Gao, Songtao Lu, Jia Liu, and Chris Chu. Randomized Bregman coordinate descent methods for non-Lipschitzoptimization.

ArXiv:2001.05202 , 2020.[32] Filip Hanzely and Peter Richtárik. Fastest rates for stochastic mirror descent methods.

ArXiv:1803.07374 , 2018.[33] Mingyi Hong, Xiangfeng Wang, Meisam Razaviyayn, and Zhi-Quan Luo. Iteration complexity analysis of block coordinatedescent methods.

Math. Program. , 163(1-2):85–114, 2017.[34] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In

Adv. NeuralInf. Process. Syst. 26 , pages 315–323, 2013. [35] Chao Kan and Wen Song. The Moreau envelope function and proximal mapping in the sense of the Bregman distance.

Nonlinear Anal. Theory Methods Appl. , 75(3):1385 – 1399, 2012.[36] Krzysztof Kurdyka. On gradients of functions deﬁnable in o -minimal structures. Ann. L’institut Fourier , 48(3):769–783,1998.[37] Puya Latafat, Andreas Themelis, and Panagiotis Patrinos. Block-coordinate and incremental aggregated proximal gradientmethods for nonsmooth nonconvex problems.

Math. Program. , 2021.[38] Stanislaw Łojasiewicz. Sur la géométrie semi- et sous- analytique.

Ann. L’institut Fourier , 43(5):1575–1595, 1993.[39] Haihao Lu. “relative continuity” for non-Lipschitz nonsmooth convex optimization using stochastic (or deterministic) mirrordescent.

INFORMS J. Optim. , 1(4):288–303, 2019.[40] Haihao Lu, Robert M. Freund, and Yurii Nesterov. Relatively smooth convex optimization by ﬁrst-order methods, andapplications.

SIAM J. Optim. , 28(1):333–354, 2018.[41] D Russell Luke, James V Burke, and Richard G Lyon. Optical wavefront reconstruction: Theory and numerical methods.

SIAM Rev. , 44(2):169–224, 2002.[42] Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning.

SIAMJ. Optim. , 25(2):829–855, 2015.[43] Christopher A Metzler, Manoj K Sharma, Sudarshan Nagesh, Richard G Baraniuk, Oliver Cossairt, and Ashok Veeraragha-van. Coherent inverse scattering via transmission matrices: E ﬃ cient phase retrieval algorithms and a public dataset. In , pages 1–16, 2017.[44] Aryan Mokhtari, Mert Gürbüzbalaban, and Alejandro Ribeiro. Surpassing gradient descent provably: A cyclic incrementalmethod with linear convergence rate. SIAM J. Optim. , 28(2):1420–1447, 2018.[45] Angelia Nedic and Soomin Lee. On stochastic subgradient mirror-descent algorithm with weighted averaging.

SIAM J.Optim. , 24(1):84–107, 2014.[46] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach tostochastic programming.

SIAM J. Optim. , 19(4):1574–1609, 2009.[47] Yurii Nesterov. E ﬃ ciency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. , 22(2):341–362, 2012.[48] Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáˇc. SARAH: A novel method for machine learning problemsusing stochastic recursive gradient. In

Proc. 34th Int. Conf. Mach. Learn. , volume 70, pages 2613–2621. PMLR, 06–11 Aug2017.[49] Peter Ochs, Jalal Fadili, and Thomas Brox. Non-smooth non-convex Bregman minimization: Uniﬁcation and new algo-rithms.

J. Optim. Theory Appl. , 181(1):244–278, 2019.[50] Xun Qian, Alibek Sailanbayev, Konstantin Mishchenko, and Peter Richtárik. MISO is making a comeback with better proofsand rates.

ArXiv:1906.01474 , 2019.[51] Herbert Robbins and David Siegmund. A convergence theorem for non negative almost supermartingales and some appli-cations. In

Herbert Robbins Selected Papers , pages 111–135. Springer, 1985.[52] R. Tyrrell Rockafellar.

Convex analysis . Princeton University Press, 1970.[53] R. Tyrrell Rockafellar and Roger J.-B. Wets.

Variational Analysis , volume 317. Springer, 2009.[54] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing ﬁnite sums with the stochastic average gradient.

Math.Program. , 162(1):83–112, 3 2017.[55] Yoav Shechtman, Yonina C Eldar, Oren Cohen, Henry Nicholas Chapman, Jianwei Miao, and Mordechai Segev. Phaseretrieval with application to optical imaging: a contemporary overview.

IEEE Signal Process. Mag. , 32(3):87–109, 2015.[56] Mikhail V. Solodov and Benar F. Svaiter. An inexact hybrid generalized proximal point algorithm and some new results onthe theory of Bregman functions.

Math. Oper. Res. , 25(2):214–230, 2000.[57] Murray Ralph Spiegel.

Mathematical handbook of formulas and tables . McGraw-Hill, 1999.[58] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval.

Found. Comput. Math. , 18(5):1131–1198, 2018.[59] Marc Teboulle. A simpliﬁed view of ﬁrst order methods for optimization.

Math. Program. , 170(1):67–96, 2018.[60] Paul Tseng. Convergence of a block coordinate descent method for nondi ﬀ erentiable minimization. J. Optim. Theory Appl. ,109(3):475–494, 2001.[61] Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization.

SIAM J. Optim. , 2(3), 2008.[62] Paul Tseng and Dimitri P. Bertsekas. Relaxation methods for problems with strictly convex separable costs and linearconstraints.

Math. Program. , 38(3):303–321, 10 1987.[63] Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth separable minimization.

Math. Pro-gram. , 117(1):387–423, 3 2009.[64] Paul Tseng and Sangwoon Yun. Incrementally updated gradient methods for constrained and regularized optimization.

J.Optim. Theory Appl. , 160(3):832–853, 2014.[65] Nuri Denizcan Vanli, Mert Gürbüzbalaban, and Asu Ozdaglar. Global convergence rate of proximal incremental aggregatedgradient methods.

SIAM J. Optim. , 28(2):1282–1300, 2018.[66] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction.

SIAM J. Optim. ,24(4):2057–2075, 2014.[67] Yangyang Xu and Wotao Yin. A globally convergent algorithm for nonconvex optimization based on block coordinateupdate.

J. Sci. Comput. , 72(2):700–734, 8 2017.[68] Peiran Yu, Guoyin Li, and Ting Kei Pong. Deducing Kurdyka-Łojasiewicz exponent via inf-projection.

ArXiv:1902.03635 ,2019.[69] Hui Zhang, Yu-Hong Dai, Lei Guo, and Wei Peng. Proximal-like incremental aggregated gradient method with linear con-vergence under Bregman distance growth conditions.

Math. Oper. Res. , 2019.[70] Huishuai Zhang, Yuejie Chi, and Yingbin Liang. Provable non-convex phase retrieval with outliers: Median truncatedWirtinger ﬂow. In

Proc. 33rd Int. Conf. Mach. Learn. , volume 48 of

Proceedings of Machine Learning Research , pages1022–1031, New York, New York, USA, Jun 2016. PMLR.[71] Siqi Zhang and Niao He. On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization.