[PDF] Variance Reduction via Primal-Dual Accelerated Dual Averaging for Nonsmooth Convex Finite-Sums

Abstract

We study structured nonsmooth convex finite-sum optimization that appears widely in machine learning applications, including support vector machines and least absolute deviation. For the primal-dual formulation of this problem, we propose a novel algorithm called \emph{Variance Reduction via Primal-Dual Accelerated Dual Averaging (\vrpda)}. In the nonsmooth and general convex setting, \vrpda~has the overall complexity O(nd\log\min \{1/\epsilon, n\} + d/\epsilon ) in terms of the primal-dual gap, where n denotes the number of samples, d the dimension of the primal variables, and \epsilon the desired accuracy. In the nonsmooth and strongly convex setting, the overall complexity of \vrpda~becomes O(nd\log\min\{1/\epsilon, n\} + d/\sqrt{\epsilon}) in terms of both the primal-dual gap and the distance between iterate and optimal solution. Both these results for \vrpda~improve significantly on state-of-the-art complexity estimates, which are O(nd\log \min\{1/\epsilon, n\} + \sqrt{n}d/\epsilon) for the nonsmooth and general convex setting and O(nd\log \min\{1/\epsilon, n\} + \sqrt{n}d/\sqrt{\epsilon}) for the nonsmooth and strongly convex setting, in a much more simple and straightforward way. Moreover, both complexities are better than \emph{lower} bounds for general convex finite sums that lack the particular (common) structure that we consider. Our theoretical results are supported by numerical experiments, which confirm the competitive performance of \vrpda~compared to state-of-the-art.

Full PDF

VVariance Reduction via Primal-Dual Accelerated Dual Averagingfor Nonsmooth Convex Finite-Sums ∗ Chaobing Song, Stephen J. Wright and Jelena DiakonikolasDepartment of Computer SciencesUniversity of Wisconsin-Madison [email protected], [email protected], [email protected]

March 1, 2021

Abstract

We study structured nonsmooth convex ﬁnite-sum optimization that appears widely in machinelearning applications, including support vector machines and least absolute deviation. For the primal-dualformulation of this problem, we propose a novel algorithm called

Variance Reduction via Primal-DualAccelerated Dual Averaging ( vrpda ) . In the nonsmooth and general convex setting, vrpda has theoverall complexity O ( nd log min { /(cid:15), n } + d/(cid:15) ) in terms of the primal-dual gap, where n denotes the numberof samples, d the dimension of the primal variables, and (cid:15) the desired accuracy. In the nonsmooth andstrongly convex setting, the overall complexity of vrpda becomes O ( nd log min { /(cid:15), n } + d/ √ (cid:15) ) in termsof both the primal-dual gap and the distance between iterate and optimal solution. Both these results for vrpda improve signiﬁcantly on state-of-the-art complexity estimates, which are O ( nd log min { /(cid:15), n } + √ nd/(cid:15) ) for the nonsmooth and general convex setting and O ( nd log min { /(cid:15), n } + √ nd/ √ (cid:15) ) for thenonsmooth and strongly convex setting, in a much more simple and straightforward way. Moreover, bothcomplexities are better than lower bounds for general convex ﬁnite sums that lack the particular (common)structure that we consider. Our theoretical results are supported by numerical experiments, which conﬁrmthe competitive performance of vrpda compared to state-of-the-art. We consider large-scale regularized nonsmooth convex empirical risk minimization (ERM) of linear predictorsin machine learning. Let b i ∈ R d , i = 1 , , . . . , n, be sample vectors with n typically large; g i : R → R , i = 1 , , . . . , n, be possibly nonsmooth convex loss functions associated with the linear predictor (cid:104) b i , x (cid:105) ; and (cid:96) : R d → R be an extended-real-valued, σ -strongly convex ( σ ≥

0) and possibly nonsmooth regularizer thatadmits an eﬃciently computable proximal operator. The problem we study ismin x ∈ R d f ( x ) := g ( x ) + (cid:96) ( x ) := 1 n n (cid:88) i =1 g i ( b Ti x ) + (cid:96) ( x ) , (P)where g ( x ) := n (cid:80) ni =1 g i ( b Ti x ). Instances of the nonsmooth ERM problem (P) include (cid:96) -norm and (cid:96) -normregularized support vector machines (SVM) and least absolute deviation. For practicality of our approach, werequire in addition that the convex conjugates of the functions g i , deﬁned by g ∗ i ( y i ) := max z i ( z i y i − g i ( z i )),admit eﬃciently computable proximal operators. (Such is true of the examples mentioned above.) Fromthe statistical perspective, nonsmoothness in the loss function is essential for obtaining a model that is ∗ CS acknowledges support from the NSF award 2023239. JD acknowledges support from the Oﬃce of the Vice Chancellor forResearch and Graduate Education at the University of Wisconsin–Madison with funding from the Wisconsin Alumni ResearchFoundation. SW acknowledges support from NSF Awards 1740707, 1839338, 1934612, and 2023239; Subcontract 8F-30039 fromArgonne National Laboratory; and Award N660011824020 from the DARPA Lagrange Program. a r X i v : . [ m a t h . O C ] F e b oth tractable and robust. But from the optimization viewpoint, nonsmooth optimization problems areintrinsically more diﬃcult to solve. On one hand, the lack of smoothness in g precludes the use of black-boxﬁrst-order information to obtain eﬃcient methods. On the other hand, the use of structured compositeoptimization methods that rely on the proximal operator of g is out of question here too, because the proximaloperator of the sum n (cid:80) ni =1 g i ( b iT x ) may not be eﬃciently computable w.r.t. x , even when the proximaloperators of the individual functions g i ( · ) are.Driven by applications in machine learning, computational statistics, signal processing, and operationsresearch, the nonsmooth problem (P) and its variants have been studied for more than two decades. Therehave been two main lines of work: deterministic algorithms that exploit the underlying simple primal-dualstructure to improve eﬃciency (i.e., dependence on the accuracy parameter (cid:15) ) and randomized algorithmsthat exploit the ﬁnite-sum structure to improve scalability (i.e., dependence on the number of samples n ). Exploiting the primal-dual structure.

A na¨ıve approach for solving (P) would be subgradient descent,which requires access to subgradients of g ( x ) and (cid:96) ( x ). To ﬁnd a solution x with f ( x ) − f ( x ∗ ) ≤ (cid:15) , where x ∗ is an optimal solution of (P) and (cid:15) > O (1 /(cid:15) )iterations for the nonsmooth convex setting. This complexity is high, but it is also the best possible if weare only allowed to access “black-box” information of function value and subgradient. To obtain improvedcomplexity bounds, we must consider approaches that exploit structure in (P). To begin, we note that (P)admits an explicit and simple primal-dual reformulation:min x ∈ R d (cid:8) f ( x ) = max y ∈ R n L ( x , y ) (cid:9) ,L ( x , y ) := (cid:104) Bx , y (cid:105) − g ∗ ( y ) + (cid:96) ( x ) , (PD)where B = n [ b , b , . . . , b n ] T , g ∗ ( y ) = n (cid:80) ni =1 g ∗ i ( y i ), and with the convex conjugate functions g ∗ i ( · ) satisfying g i ( b Ti x ) = max y i { y i (cid:104) b i , x (cid:105) − g ∗ i ( y i ) } . The nonsmooth loss g ( x ) in (P) is thereby decoupled into a bilinearterm (cid:104) Bx , y (cid:105) and a separable function g ∗ ( y ) that admits an eﬃciently computable proximal operator. Dueto the possible nonsmoothness of g ( x ) , we can assume only that L ( x , y ) is concave w.r.t. y —but not stronglyconcave. Therefore, Problem (PD) is σ -strongly convex-(general) concave ( σ ≥ O (1 /(cid:15) ) to O (1 /(cid:15) ).Later, Nemirovski and Nesterov, respectively, showed that extragradient-type methods such as mirror-prox(Nemirovski, 2004) and dual extrapolation (Nesterov, 2007) can obtain the same O (1 /(cid:15) ) complexity boundfor (PD) directly, without the use of smoothing or Nesterov acceleration. Extragradient-type methodsneed to perform updates twice per iteration, for both primal and dual variables. Chambolle and Pock(2011) introduced an (extrapolated) primal-dual hybrid gradient ( pdhg ) method to obtain the same O (1 /(cid:15) )complexity with an extrapolation step on either the primal or dual variable rather than an extragradient step.Thus, pdhg needs to update primal and dual variables just once per iteration. All three kinds of methodshave been extensively studied from diﬀerent perspectives (Nesterov, 2005a, Chen et al., 2017, Tran-Dinhet al., 2018, Song et al., 2020b, Diakonikolas et al., 2020). In the case of large n , the focus has been onrandomized variants with low per-iteration cost (Zhang and Lin, 2015, Alacaoglu et al., 2017, Tan et al., 2018,Chambolle et al., 2018, Carmon et al., 2019, Lei et al., 2019, Devraj and Chen, 2019, Alacaoglu et al., 2020). Exploiting the ﬁnite-sum structure.

The deterministic methods discussed above have per-iterationcost O ( nd ), which can be prohibitively high when n is large. There has been much work on randomizedmethods whose per-iteration cost is independent of n . To be eﬃcient, the iteration count of such methodscannot increase too much over the deterministic methods. A major development in the past decade ofresearch has been the use of variance reduction in randomized optimization algorithms, which reduces theper-iteration cost and improves the overall complexity. For the variant of Problem (P) in which g ( x ) is smooth , there exists a vast literature on developing eﬃcient ﬁnite-sum solvers; see for example Roux et al.(2012), Johnson and Zhang (2013), Lin et al. (2014), Zhang and Lin (2015), Allen-Zhu (2017), Zhou et al.(2018), Lan et al. (2019), Song et al. (2020a). In particular, the recent work of Song et al. (2020a) has2roposed an algorithm called Variance Reduction via Accelerated Dual Averaging ( vrada ) that matches allthree lower bounds from Woodworth and Srebro (2016), Hannah et al. (2018) for the smooth and (general/ill-conditioned strongly/well-conditioned strongly) convex settings, using a simple, uniﬁed algorithm descriptionand convergence analysis. As discussed in Song et al. (2020a), the eﬃciency, simplicity, and uniﬁcation of vrada are due to randomizing accelerated dual averaging rather than accelerated mirror descent (as wasdone in Allen-Zhu (2017)) and adopting a novel initialization strategy. The results of Song et al. (2020a)provide the main motivation for this work.When the loss function is nonsmooth, the classical variance reduction tricks such as svrg and saga (Johnson and Zhang, 2013, Defazio et al., 2014) are no longer available, as their eﬀectiveness strongly dependson smoothness. A compromise proposed in Allen-Zhu and Hazan (2016), Allen-Zhu (2017) is to smoothenand regularize (P) and then apply existing ﬁnite-sum solvers, such as Katyusha. As shown by Allen-Zhu(2017), in the nonsmooth and general convex setting, the resulting overall complexity is improved from O (cid:0) nd(cid:15) (cid:1) to O (cid:0) nd log min { (cid:15) , n } + √ nd(cid:15) (cid:1) ; in the nonsmooth and strongly convex setting, it is improved from O (cid:0) nd √ (cid:15) (cid:1) to O (cid:0) nd log min { (cid:15) , n } + √ nd √ (cid:15) (cid:1) . Both of these improved complexity results match the lower bounds ofWoodworth and Srebro (2016) for general nonsmooth ﬁnite-sums when (cid:15) is small. However, the smoothingand regularization require tuning of additional parameters, which complicates the algorithm implementation.Meanwhile, it is not clear whether the complexity can be further improved to take advantage of the additionalERM structure that is present in (P). Table 1: Overall complexity and per-iteration cost for solving (FS-PD) in the σ -strongly convex-general concave setting( σ ≥ rpd O (cid:0) n / d(cid:15) (cid:1) O (cid:0) n / d √ (cid:15) (cid:1) — O ( d )(Dang and Lan, 2014) smart-cd O (cid:0) nd(cid:15) (cid:1) — — O ( d )(Alacaoglu et al., 2017)Carmon et al. (2019) O (cid:0) nd + √ nd ( n + d ) log( nd ) (cid:15) (cid:1) — — O ( n + d ) spdhg O (cid:0) nd(cid:15) (cid:1) — O (cid:0) ndσ √ (cid:15) (cid:1) O ( d )Chambolle et al. (2018) pure-cd O (cid:0) n d(cid:15) (cid:1) — — O ( d )(Alacaoglu et al., 2020) vrpda O ( nd log min { (cid:15) , n } + d(cid:15) ) O ( nd log min { (cid:15) , n } + d √ σ(cid:15) ) O ( nd log min { (cid:15) , n } + dσ √ (cid:15) ) O ( d )( This Paper ) It is only applicable when (cid:15) is small enough (see Chambolle et al. (2018, Theorem 5.1)).

For the nonsmooth ERM problem (P) considered here and its primal-dual formulation, the literatureis much scarcer (Dang and Lan, 2014, Alacaoglu et al., 2017, Chambolle et al., 2018, Carmon et al., 2019,Latafat et al., 2019, Fercoq and Bianchi, 2019, Alacaoglu et al., 2020). All of the existing methods target (PD)directly and focus on extending the aforementioned deterministic algorithms to this case. As sampling oneelement of the ﬁnite sum from (P) is reduced to sampling one dual coordinate in (PD), all of these methodscan be viewed as coordinate variants of the deterministic counterparts. For convenience, we explicitly rewrite(PD) in the following ﬁnite-sum primal-dual form:min x ∈ R d max y ∈ R n L ( x , y ) ,L ( x , y ) = 1 n n (cid:88) i =1 ( y i (cid:104) b i , x (cid:105) − g ∗ i ( y i )) + (cid:96) ( x ) . (FS-PD) I.e., the number of iterations times the per-iteration cost. xisting approaches. Table 1 compares vrpda to existing randomized algorithms for solving (FS-PD)in terms of the overall complexity and per-iteration cost under the setting of uniform sampling and generaldata matrix ( i.e., the data matrix is not necessarily sparse). The competing algorithms ( rpd , smart-cd , spdhg , and pure-cd ) attain O ( d ) per-iteration cost, but have overall complexity no better than that ofthe deterministic algorithms in both the nonsmooth and general/strongly convex settings. Meanwhile, thealgorithms of Carmon et al. (2019) attain O ( n + d ) per-iteration cost with improved dependence on thedimension d when n ≥ d. However, the overall dependence on the dominant term n is still not improved,which raises the question of whether it is even possible to simultaneously achieve the low O ( d ) per-iterationcost and reduce the overall complexity compared to the deterministic algorithms. Addressing this question isthe main contribution of our work. Our contributions.

We propose the vrpda algorithm for (FS-PD) in the σ -strongly convex-generalconcave setting ( σ ≥ σ -strongly convex setting of (P) ( σ ≥ σ = 0 and σ > vrpda has O ( d ) per-iteration cost and signiﬁcantly improvesthe best-known overall complexity results in a uniﬁed and simpliﬁed way. As shown in Table 1, to ﬁnd an (cid:15) -accurate solution in terms of the primal-dual gap, the overall complexity of vrpda is (cid:40) O (cid:0) nd log (cid:0) min { (cid:15) , n } (cid:1) + d(cid:15) (cid:1) , if σ = 0 ,O (cid:0) nd log (cid:0) min { (cid:15) , n } (cid:1) + d √ σ(cid:15) (cid:1) , if σ > , which is signiﬁcantly better than any of the existing results for (FS-PD). In particular, we only need O ( nd log n ) overall cost to attain an (cid:15) -accurate solution with (cid:15) = Ω( n log( n ) ). Meanwhile, when (cid:15) is suﬃcientlysmall compared to 1 /n , so that the second term in the bound becomes dominant, the overall complexity( O ( d(cid:15) ) for σ = 0 and O ( d √ σ(cid:15) ) for σ >

0) is independent of n , thus showing a Θ( n ) improvement comparedto the deterministic algorithms. To the best of our knowledge, even for smooth g i ’s, the improvement ofexisting algorithms is at most Θ( √ n ) and is attained by accelerated variance reduction methods such asKatyusha Allen-Zhu (2017) and vrada Song et al. (2020a).

Comparison to lower bounds.

What makes our results particularly surprising is that they seem tocontradict the iteration complexity lower bounds for composite objectives, which are Ω( n + √ n(cid:15) ) for nonsmoothand general convex objectives and Ω( n + (cid:112) nσ(cid:15) ) for nonsmooth and σ -strongly convex objectives (Woodworthand Srebro, 2016). This apparent contradiction is resolved upon a closer inspection of the lower-bound instancein Woodworth and Srebro (2016). This lower-bound instance has the form f ( x ) = n (cid:80) ni =1 f i ( x ), where thecomponent functions f i are accessed using a black-box oracle (that returns function value, (sub)gradient,prox-oracle). By contrast, we assume that each component function is a composition of a scalar convexfunction and a linear transformation (that is, g i ( b Ti x )), which admits the use of a primal-dual formulation.Our setting also diﬀers from the setting of Woodworth and Srebro (2016) in terms of the assumed oracleaccess. While Woodworth and Srebro (2016) restricts the prox-oracle access to apply to component functions f i ( g i , in our setting), our algorithm relies on the prox-oracles for the convex conjugates g ∗ i . Thus, the upperbounds of vrpda are not in conﬂict with the lower bounds in Woodworth and Srebro (2016).Remarkably, our upper bounds show that use of the ﬁnite-sum primal-dual formulation (FS-PD) canlead not only to improvements in eﬃciency (dependence on (cid:15) ), as in Nesterov (2005b), but also scalability(dependence on n ). As the ERM problem (P) is one of the main motivations for convex ﬁnite-sum solvers, itwould be interesting to characterize the complexity of the problem class (P) from the aspect of oracle lowerbounds and determine whether vrpda attains optimal oracle complexity. (We conjecture that it does, atleast for small values of (cid:15) .) Since the primary focus of the current paper is on algorithms, we leave the studyof lower bounds for future research. Our techniques.

Our vrpda algorithm is founded on a new deterministic algorithm Primal-DualAccelerated Dual Averaging ( pda ) for (PD). Similar to pdhg (Chambolle and Pock, 2011), pda is a4rimal-dual method with extrapolation on the primal or dual variable. However, unlike pdhg , which isbased on mirror-descent-type updates ( a.k.a. agile updates (Allen-Zhu and Orecchia, 2017)), pda performsupdates of dual averaging-style (Nesterov, 2015) ( a.k.a. lazy mirror-descent updates (Hazan et al., 2016)).Our analysis is based on the classical estimate sequence technique, but with a novel design of the estimatesequences that requires careful coupling of primal and dual portions of the gap; see Section 3 for a furtherdiscussion. The resulting argument allows us to use a uniﬁed parameter setting and convergence analysisfor pda in all the (general/strongly) convex-(general/strongly) concave settings. Thus, by building on pda rather than pdhg , the design and analysis of vrpda is uniﬁed over the diﬀerent settings and alsosigniﬁcantly simpliﬁed. Moreover, the dual averaging framework allows us to use a novel initialization strategyinspired by the vrada algorithm (Song et al., 2020a), which is key to cancelling the randomized error oforder n in the main loop and obtaining our improved results from Table 1. It is worth noting that although pda can be used in all the (general/strongly) convex-(general/strongly) concave settings, vrpda is onlyapplicable to the (general/strongly) convex-general concave settings, which correspond to the nonsmoothand (general/strongly) convex settings of (P). Study of vrpda in the (general/strongly) convex-stronglyconcave settings is deferred to future research. Throughout the paper, we use (cid:107) · (cid:107) to denote the Euclidean norm. In the case of matrices B , (cid:107) B (cid:107) is thestandard operator norm deﬁned by (cid:107) B (cid:107) := max x ∈ R d , (cid:107) x (cid:107)≤ (cid:107) Bx (cid:107) .In the following, we provide standard deﬁnitions and properties that will be used in our analysis. We startby stating the deﬁnition of strongly convex functions that captures both strong and general convexity, allowingus to treat both cases in a uniﬁed manner for signiﬁcant portions of the analysis. We use ¯ R = R ∪ { + ∞} todenote the extended real line. Deﬁnition 1.

Given σ ≥ , we say that a function f : R d → ¯ R is σ -strongly convex, if ∀ x , ˆ x ∈ R d , and all α ∈ (0 , f ((1 − α ) x + α ˆ x ) ≤ (1 − α ) f ( x ) + αf ( ˆ x ) − σ α (1 − α ) (cid:107) ˆ x − x (cid:107) . When σ = 0 , we say that f is (general) convex. When f is subdiﬀerentiable at x and g f ( x ) ∈ ∂f ( x ) is any subgradient of f at x , where ∂f ( x ) denotesthe subdiﬀerential set (the set of all subgradients) of f at x , then strong convexity implies that for all ˆ x ∈ R d , we have f ( ˆ x ) ≥ f ( x ) + (cid:104) g f ( x ) , ˆ x − x (cid:105) + σ (cid:107) ˆ x − x (cid:107) . Since we work with general nonsmooth convex functions f , we require that their proximal operators ,deﬁned as solutions to problems of the form min x ∈ R d (cid:8) f ( x ) + τ (cid:107) x − ˆ x (cid:107) (cid:9) are eﬃciently solvable for any τ > x ∈ R d . Problem deﬁnition.

As discussed in the introduction, our focus is on Problem (PD) under the followingassumption.

Assumption 1. g ∗ ( y ) is proper, l.s.c., and γ -strongly convex ( γ ≥ ); (cid:96) ( x ) is proper, l.s.c., and σ -stronglyconvex ( σ ≥ ); the proximal operators of g ∗ and (cid:96) can be computed eﬃciently; and (cid:107) B (cid:107) = R for some R ∈ (0 , ∞ ) . Observe that since g ∗ and (cid:96) are only assumed to be proper, l.s.c., and (strongly) convex, they can containindicators of closed convex sets in their description. Thus, the problem class that satisﬁes Assumption 1contains constrained optimization. We use X and Y to denote the domains of (cid:96) and g ∗ , respectively, deﬁnedby X = dom( (cid:96) ) = { x : (cid:96) ( x ) < ∞} , Y = dom( g ∗ ) = { y : g ∗ ( y ) < ∞} . When X , Y are bounded, we use D X , D Y to denote their diameters, D X = max x , u ∈X (cid:107) x − u (cid:107) , D Y = max y , v ∈Y (cid:107) y − v (cid:107) .5ote that Assumption 1 does not enforce a ﬁnite-sum structure of g ∗ (and g ). Thus, for the results thatutilize variance reduction, we will make a further assumption. Assumption 2. g ∗ ( y ) = n (cid:80) ni =1 g ∗ i ( y i ) , where each g ∗ i ( y i ) is convex and has an eﬃciently computableproximal operator. Further, (cid:107) b i (cid:107) ≤ R (cid:48) , for all i ∈ { , . . . , n } . Recall that B = n [ b , b , . . . , b n ] T . Observe that R = (cid:107) B (cid:107) ≤ n (cid:16) (cid:80) ni =1 (cid:107) b i (cid:107) (cid:17) / ≤ n (cid:80) ni =1 (cid:107) b i (cid:107) ≤ R (cid:48) . Observe further that, under Assumption 2, g ∗ ( y ) is separable over its coordinates. As a consequence,the domain Y of g ∗ can be expressed as the Cartesian product of dom( g ∗ i ). This structure is crucial forvariance reduction, as the algorithm in this case relies on performing coordinate descent updates over thedual variables y . Primal-dual gap.

Given x ∈ R d , the primal value of the problem (PD) is P ( x ) = max v ∈ R n L ( x , v ) . Similarly, the dual value (PD) is deﬁned by D ( y ) = min u ∈ R d L ( u , y ) . Given a primal-dual pair ( x , y ) ∈ R d × R n , primal-dual gap is then deﬁned byGap( x , y ) = P ( x ) − D ( y ) = max ( u , v ) ∈ R d × R n Gap u , v ( x , y ) , where we deﬁne Gap u , v ( x , y ) = L ( x , v ) − L ( u , y ) . (1)Observe that, by deﬁnition of P ( x ) and D ( y ) , the maximum of Gap u , v ( x , y ) for ﬁxed ( x , y ) is attainedwhen ( u , v ) ∈ X × Y , so we can also write Gap( x , y ) = max ( u , v ) ∈X ×Y Gap u , v ( x , y ).For our analysis, it is useful to work with the relaxed gap Gap u , v ( x , y ) . In particular, to bound theprimal-dual gap Gap( x , y ) for a candidate solution pair ( x , y ) constructed by the algorithm, we ﬁrst boundGap u , v ( x , y ) for arbitrary ( u , v ) ∈ X × Y . The bound on Gap( x , y ) then follows by taking the supremum ofGap u , v ( x , y ) over ( u , v ) ∈ X × Y . In general, Gap( x , y ) can be bounded by a ﬁnite quantity only when X , Y are compact Nesterov (2005b), Ouyang and Xu (2019). If either of X , Y is unbounded, to provide meaningfulresults and similar to Chambolle and Pock (2011), we assume that an optimal primal-dual pair ( x ∗ , y ∗ ) forwhich Gap( x ∗ , y ∗ ) = 0 exists, and bound the primal-dual gap in a ball around ( x ∗ , y ∗ ) . Auxiliary results.

Additional auxiliary results on growth of sequences that are needed when establishingconvergence rates in our results are provided in Appendix C.

In this section, we provide the pda algorithm for solving Problem (PD) under Assumption 1. The results inthis section provide the basis for our results in Section 4 for the ﬁnite-sum primal-dual setting. pda is described in Algorithm 1. Observe that the points u , v in the deﬁnitions of estimate sequences φ k ( x ) , ψ k ( y ) do not play a role in the deﬁnitions of x k , y k , as the corresponding arg mins are independent of u and v . They appear in the deﬁnitions of φ k ( x ) , ψ k ( y ) only for the convenience of the convergence analysis;the algorithm itself can be stated without them.We now outline the main technical ideas in the pda algorithm. To bound the relaxed notion of theprimal-dual gap Gap u , v ( ˜ x k , ˜ y k ) discussed in Section 2, we use estimate sequences φ k ( x ) and ψ k ( y ) deﬁned inthe algorithm. Unlike the classical estimate sequences used, for example, in Nesterov (2005b), these estimatesequences do not directly estimate the values of the primal and dual, but instead contain additional bilinearterms, which are crucial for forming an intricate coupling argument between the primal and the dual thatleads to the desired convergence bounds. In particular, the bilinear term in the deﬁnition of ψ k is deﬁnedw.r.t. an extrapolated point ¯ x k − . This extrapolated point is not guaranteed to lie in the domain of (cid:96) , butbecause this point appears only in bilinear terms, we never need to evaluate either (cid:96) or its subgradient at6 lgorithm 1 Primal-Dual Accelerated Dual Averaging ( pda ) Input: ( x , y ) ∈ X × Y , ( u , v ) ∈ X × Y , σ ≥ , γ ≥ , (cid:107) B (cid:107) = R > , K. a = A = 0 . x = x − ∈ R d , y ∈ R n . φ ( · ) = (cid:107) · − x (cid:107) , ψ ( · ) = (cid:107) · − y (cid:107) . for k = 1 , , . . . , K do a k = √ (1+ σA k − )(1+ γA k − ) √ R , A k = A k − + a k . ¯ x k − = x k − + a k − a k ( x k − − x k − ) . y k = arg min y ∈ R n { ψ k ( y ) = ψ k − ( y ) + a k ( (cid:104)− B ¯ x k − , y − v (cid:105) + g ∗ ( y )) } . x k = arg min x ∈ R d { φ k ( x ) = φ k − ( x ) + a k ( (cid:104) x − u , B T y k (cid:105) + (cid:96) ( x )) } . end for return ˜ y K = A K (cid:80) Kk =1 a k y k , ˜ x K = A K (cid:80) Kk =1 a k x k . ¯ x k − . Instead, the extrapolated point plays a role in cancelling error terms that appear when relating theestimate sequences to Gap u , v ( ˜ x k , ˜ y k ).Our main technical result for this section concerning the convergence of pda is summarized in thefollowing theorem. The proof of this result and supporting technical results are provided in Appendix A. Theorem 1.

Observe that when the domains of g ∗ and (cid:96) are bounded (i.e., when D X < ∞ , D Y < ∞ , and, inparticular, in the setting of constrained optimization over compact sets), Theorem 1 implies the followingbound on the primal-dual gap Gap( ˜ x k , ˜ y k ) ≤ D X + D Y A k . This bound can be shown to be optimal, using resultsfrom Ouyang and Xu (2019). For unbounded domains of g ∗ and (cid:96) , it is generally not possible to have anyﬁnite bound on Gap( x , y ) unless ( x , y ) = ( x ∗ , y ∗ ) (for a concrete example, see, e.g., Diakonikolas (2020)). Insuch a case, it is common to restrict u , v to bounded sets that include x ∗ , y ∗ , such as B ( x ∗ , r ) , B ( y ∗ , √ r ) from Remark 1 (Chambolle and Pock, 2011). lgorithm 2 Variance Reduction via Primal-Dual Accelerated Dual Averaging ( vrpda ) Input: ( x , y ) ∈ X × Y , ( u , v ) ∈ X × Y , σ ≥ , R (cid:48) > , K, n. φ ( · ) = (cid:107) · − x (cid:107) , ψ ( · ) = (cid:107) · − y (cid:107) . a = A = 0, ˜ a = R (cid:48) . y = arg min y ∈ R n { ˜ ψ ( y ) := ψ ( y ) + ˜ a ( (cid:104)− Bx , y − v (cid:105) + g ∗ ( y )) } . z = B T y . x = arg min x ∈ R d { ˜ φ ( x ) := φ ( x ) + ˜ a ( (cid:104) x − u , z (cid:105) + (cid:96) ( x )) } . ψ := n ˜ ψ , φ := n ˜ φ , a = A = n ˜ a , a = n − a , A = A + a . for k = 2 , , . . . , K do ¯ x k − = x k − + a k − a k ( x k − − x k − ) . Pick j k uniformly at random in [ n ] . y k = arg min y ∈ R n { ψ k ( y ) = ψ k − ( y ) + a k ( − b Tj k ¯ x k − ( y j k − v j k ) + g ∗ j k ( y j k )) } . x k = arg min x ∈ R d { φ k ( x ) = φ k − ( x ) + a k ( (cid:104) x − u , z k − + ( y k,j k − y k − ,j k ) b j k (cid:105) + (cid:96) ( x )) } . z k = z k − + n ( y k,j k − y k − ,j k ) b j k . a k +1 = min (cid:16)(cid:0) n − (cid:1) a k , √ n ( n + σA k )2 R (cid:48) (cid:17) , A k +1 = A k + a k +1 . end for return x K or ˜ x K := A K (cid:80) Ki =1 a i x i . Remark 3.

To bound the function value gap f ( ˜ x k ) − f ( x ∗ ) for Problem (P) using Theorem 1, we need onlythat D Y is bounded, leading to the bound f ( ˜ x k ) − f ( x ∗ ) ≤ r + D Y A k , where r = (cid:112) (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) as in Remark 1, since for u ∈ B ( x ∗ , r ) we have that (cid:107) u − x (cid:107) ≤ r . To see this, note that, as the iterates x i of pda are guaranteed to remain in B ( x ∗ , r ) (by Remark 1), there is no diﬀerence between applyingthis algorithm to f or to f + I B ( x ∗ ,r ) , where I B ( x ∗ ,r ) is the indicator function of B ( x ∗ , r ) . This allows usto restrict u ∈ B ( x ∗ , r ) when bounding f ( ˜ x k ) − f ( x ∗ ) by Gap u , v ( ˜ x k , ˜ y k ) . Note that for typical instancesof nonsmooth ERM problems, the domain Y of g ∗ is compact. Further, if g ∗ is strongly convex ( γ > ),then the set ˜ Y := { arg max y ∈ R n (cid:104) Bx , y (cid:105) − g ∗ ( y ) : x ∈ B ( x ∗ , r ) } is guaranteed to be compact. This claimfollows from standard results, as in this case arg max y ∈ R n (cid:104) Bx , y (cid:105) − g ∗ ( y ) = ∇ g ( Bx ) (by the standardFenchel-Young inequality; see, e.g., Rockafellar and Wets (2009, Proposition 11.3)) and g is γ -smooth. Thus, sup v , y ∈ ˜ Y (cid:107) v − y (cid:107) = sup x , u ∈B ( x ∗ ,r ) (cid:107)∇ g ( Bx ) − ∇ g ( Bu ) (cid:107) ≤ Rγ r . We now study the ﬁnite-sum form (FS-PD) of (PD), making use of the properties of the ﬁnite-sumterms described in Assumption 2. In Algorithm 2, we describe vrpda which is a randomized coordinatevariant of the pda algorithm from Section 3. By extending the uniﬁed nature of pda , vrpda provides auniﬁed and simpliﬁed treatment for both the general convex-general concave ( σ = 0) setting and the stronglyconvex-general concave setting.To provide an algorithm with complexity better than the deterministic counterpart pda , we combinethe deterministic initialization strategy of full primal-dual update in Steps 4-6 with randomized primal-dualupdates in the main loop—a strategy inspired by the recent paper of Song et al. (2020a). The use of thefactor n during initialization, in Step 7, helps to cancel an error term of order O ( n ) in the analysis.The main loop (Steps 8-15) randomizes the main loop of pda by introducing sampling in Step 10 andadding an auxiliary variable z k that is updated with O ( d ) cost in Step 13. ( z is initialized in Step 5). InStep 11, we update the estimate sequence ψ k by adding a term involving only the j k component of the ﬁnitesum, rather than the entire sum, as is required in Step 8 of Algorithm 1. As a result, although we deﬁne theestimate sequence for the entire vector y k , each update to y k requires updating only the j k coordinate of y k .In Step 12, we use a “variance reduced gradient” z k − + ( y k,j k − y k − ,j k ) b j k to update φ k , helping to cancelthe error from the randomized update of Step 11. The update of the sequences { a k } , { A k } appears at the8nd of the main loop, to accommodate their modiﬁed deﬁnitions. The modiﬁed update for a k +1 ensures that a k cannot have exponential growth with a rate higher than (cid:0) n − (cid:1) , which is an intrinsic constraint forsampling with replacement (see Song et al. (2020a), Hannah et al. (2018)).Finally, as Algorithm 2 is tailored to the nonsmooth ERM problem (P), we only return the last iterate x k or the weighed average iterate ˜ x k on the primal side, even though we provide guarantees for both primaland dual variables.Algorithm 2 provides suﬃcient detail for the convergence analysis, but its eﬃcient implementation is notimmediately clear, due especially to Step 11. An implementable version is described in Appendix D, showingthat the per-iteration cost is O ( d ) and that O ( n ) additional storage is required.Our main technical result is summarized in Theorem 2. Its proof relies on three main technical lemmasthat bound the growth of estimate sequences φ k ( x k ) and ψ k ( y k ) below and above. Proofs are provided inAppendices B and C. Theorem 2.

Suppose that Assumption 2 holds. Then for any ( u , v ) ∈ X × Y , the vectors x k , y k , k =2 , , . . . , K and the average ˜ x K generated by Algorithm 2 satisfy the following bound for k = 2 , , . . . , K : E [Gap u , v ( ˜ x k , ˜ y k )] ≤ n ( (cid:107) u − x (cid:107) + (cid:107) v − y (cid:107) )2 A k , where ˜ y K := na K y K + (cid:80) K − i =2 ( na i − ( n − a i +1 ) y i A K . Moreover, if ( x ∗ , y ∗ ) is a primal-dual solution to (PD) , then E (cid:104) n (cid:107) y ∗ − y k (cid:107) + n + σA k (cid:107) x ∗ − x k (cid:107) (cid:105) ≤ n ( (cid:107) x ∗ − x (cid:107) + (cid:107) y ∗ − y (cid:107) )2 . In both cases, A k is bounded below as follows: A k ≥ max (cid:110) n − R (cid:48) (cid:16) n − (cid:17) k k ≤ k , ( n − σ (4 R (cid:48) ) n ( k − k + n − k ≥ k ,n ( k − K + n − R (cid:48) k ≥ K (cid:111) , where denotes the indicator function, K = (cid:6) log( n )log( n ) − log( n − (cid:7) , k = (cid:6) log B n,σ,R (cid:48) log( n ) − log( n − (cid:7) , and B n,σ,R (cid:48) = σn ( n − R (cid:48) + (cid:114)(cid:16) σn ( n − R (cid:48) (cid:17) + n ≥ n max (cid:110) , σ ( n − R (cid:48) (cid:111) . Observe that, due to the randomized nature of the algorithm, the convergence bounds are obtained inexpectation w.r.t. the random choices of coordinates j k over iterations. Now let us comment on the iterationcomplexity of vrpda , given target error (cid:15) > . For concreteness, let D := (cid:107) u − x (cid:107) + (cid:107) v − y (cid:107) , where D can be bounded using the same reasoning as in Remarks 2 and 3. To bound the gap by (cid:15), we need A k ≥ nD (cid:15) .When (cid:15) ≥ nR (cid:48) D ( n − B n,R (cid:48) ,σ , then k = (cid:108) log( nR (cid:48) D n − (cid:15) )log( n ) − log( n − (cid:109) = O (cid:0) n log( R (cid:48) D(cid:15) ) (cid:1) iterations suﬃce, as in this case k ≤ k .When (cid:15) < nR (cid:48) D ( n − B n,σ,R (cid:48) , then the bound on k is obtained by ensuring that either of the last two terms bounding A k below in Theorem 2 is bounded below by nD (cid:15) , leading to k = O (cid:0) n log( B n,σ,R (cid:48) ) + min { R (cid:48) D √ σ(cid:15) , R (cid:48) D (cid:15) } (cid:1) . Remark 4.

The usefulness of our results extends beyond ERM problems (P) , due to the symmetry of theprimal-dual problem (PD) that vrpda solves. In particular, the symmetry allows exchanging roles of x and y in cases where d (cid:29) n , (cid:96) is decomposable over the coordinates with each coordinate function havingan eﬃciently computable prox-oracle, and g ∗ has an eﬃciently computable prox-oracle. A concrete exampleare feature selection problems, including, e.g., Lasso Tibshirani (1996), elastic net Zou and Hastie (2005),square-root Lasso/distributionally robust optimization Belloni et al. (2011), Blanchet et al. (2019) and -regularized logistic regression Ravikumar et al. (2010). For all these problems (ignoring the dependence on R (cid:48) , D ), the overall complexity that can be achieved with vrpda is O ( nd log(min { d, /(cid:15) } ) + min { n(cid:15) , n √ σ(cid:15) } ) ;i.e., we can reduce the dependence on d (by a factor d when (cid:15) is small enough). To the best of our knowledge,such a result cannot be obtained with any other variance reduction methods. (a) a9a , σ = 0 , average (b) a9a , σ = 0 , last (c) MNIST , σ = 0 , average (d) MNIST , σ = 0 , last(e) a9a , σ = 10 − , average (f) a9a , σ = 10 − , last (g) MNIST , σ = 10 − , average (h) MNIST , σ = 10 − , last(i) a9a , σ = 10 − , average (j) a9a , σ = 10 − , last (k) MNIST , σ = 10 − , average (l) MNIST , σ = 10 − , last Figure 1: Comparison of vrpda to spdhg and pure cd run for the elastic net-regularized SVM, on a9a and MNIST datasets. In all the plots, σ is the strong convexity parameter of the regularizer (cid:96) ; “last” refers to the lastiterate, “average” to the average iterate. For all problem instances, vrpda attains either similar or improvedconvergence compared to other algorithms. We study the performance of vrpda using the elastic net-regularized support vector machine (SVM)problem, which corresponds to (P) with g i ( b Ti x ) = max { − c i b Ti x , } , c i ∈ { , − } and (cid:96) ( x ) = λ (cid:107) x (cid:107) + σ (cid:107) x (cid:107) , λ ≥ , σ ≥

0. This problem is nonsmooth and general convex if σ = 0 or strongly convex if σ > . Itsprimal-dual formulation is min x ∈ R d max − ≤ y i ≤ , i ∈ [ n ] L ( x , y ) ,L ( x , y ) = 1 n n (cid:88) i =1 y i ( (cid:104) c i b i , x (cid:105) −

1) + λ (cid:107) x (cid:107) + σ (cid:107) x (cid:107) . We compare vrpda with two competitive algorithms pdhg Chambolle et al. (2018) and pure cd

Alacaogluet al. (2020) on standard a9a and

MNIST datasets from the LIBSVM library LIB. For simplicity, we normalize For each sample of

MNIST , we reassign the label as 1 if it is in { , , . . . , } and − R (cid:48) in vrpda ) in thealgorithms are at most 1. Then we tune the Lipschitz constants in { . , . , . , . , } . As is standard forERM, we plot the function value gap of the primal problem (P) in terms of the number of passes over thedataset. (The optimal value was evaluated by solving (P) to high accuracy).In the experiments, we ﬁx the (cid:96) -regularization parameter λ = 10 − and vary σ ∈ { , − , − } torepresent the general convex, ill-conditioned strongly convex, and well-conditioned strongly convex settings,respectively. For all the settings, we provide the comparison in terms of the average and last iterate . As canbe observed from Fig. 1, the average iterate yields much smoother curves, decreasing monotonically, and isgenerally more accurate than the last iterate. This is expected for the nonsmooth and general convex setting,as there are no theoretical guarantees for the last iterate, while for other cases the guarantee for the lastiterate is on the distance to optimum, not the primal gap. As can be seen in Fig. 1, the average iterate of vrpda is either competitive with or improves upon spdhg and pure cd . A more detailed discussion isprovided in Appendix E. We introduced a novel vrpda algorithm for structured nonsmooth ERM problems that are common inmachine learning. vrpda leverages the separable structure of common ERM problems to attain improvedconvergence compared to the state-of-the-art algorithms, both in theory and practice, even improving uponthe lower bounds for (general, non-structured) composite optimization. It is an open question to obtain tighterlower bounds for the structured ERM setting to which vrpda applies, possibly certifying its optimality, atleast for small target error (cid:15). References

LIBSVM Library. . Accessed: Feb. 3, 2020.Ahmet Alacaoglu, Quoc Tran Dinh, Olivier Fercoq, and Volkan Cevher. Smooth primal-dual coordinatedescent algorithms for nonsmooth convex optimization. In

Proc. NIPS’17 , 2017.Ahmet Alacaoglu, Olivier Fercoq, and Volkan Cevher. Random extrapolation for primal-dual coordinatedescent. In

Proc. ICML’20 , 2020.Zeyuan Allen-Zhu. Katyusha: The ﬁrst direct acceleration of stochastic gradient methods. In

Proc. ACMSTOC’17 , 2017.Zeyuan Allen-Zhu and Elad Hazan. Optimal black-box reductions between optimization objectives. In

Proc. NIPS’16 , 2016.Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate uniﬁcation of gradient and mirrordescent. In

Proc. ITCS’17 , 2017.Alexandre Belloni, Victor Chernozhukov, and Lie Wang. Square-root lasso: Pivotal recovery of sparse signalsvia conic programming.

Biometrika , 98(4):791–806, 2011.Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust Wasserstein proﬁle inference and applications tomachine learning.

Journal of Applied Probability , 56(3):830–857, 2019.Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian. Variance reduction for matrix games. In

Proc. NeurIPS’19 , 2019. In our experiments, all the algorithms diverge when the Lipschitz constant is set to 0 . spdhg and pure cd provide no results for the average iterate in the nonsmooth and strongly convex setting, so we usesimple uniform average for both of them. Journal of mathematical imaging and vision , 40(1):120–145, 2011.Antonin Chambolle, Matthias J Ehrhardt, Peter Richt´arik, and Carola-Bibiane Schonlieb. Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications.

SIAM Journal onOptimization , 28(4):2783–2808, 2018.Yunmei Chen, Guanghui Lan, and Yuyuan Ouyang. Accelerated schemes for a class of variational inequalities.

Mathematical Programming , 165(1):113–149, 2017.Cong Dang and Guanghui Lan. Randomized ﬁrst-order methods for saddle point optimization. arXiv preprintarXiv:1409.8625 , 2014.Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method withsupport for non-strongly convex composite objectives. In

Proc. NIPS’14 , 2014.Adithya M Devraj and Jianshu Chen. Stochastic variance reduced primal dual algorithms for empiricalcomposition optimization. In

Proc. NeurIPS’19 , 2019.Jelena Diakonikolas. Halpern iteration for near-optimal and parameter-free monotone inclusion and strongsolutions to variational inequalities. In

Proc. COLT’20 , 2020.Jelena Diakonikolas, Constantinos Daskalakis, and Michael I Jordan. Eﬃcient methods for structurednonconvex-nonconcave min-max optimization. arXiv preprint arXiv:2011.00364 , 2020.Olivier Fercoq and Pascal Bianchi. A coordinate-descent primal-dual algorithm with large step size andpossibly nonseparable functions.

SIAM Journal on Optimization , 29(1):100–134, 2019.Robert Hannah, Yanli Liu, Daniel O’Connor, and Wotao Yin. Breaking the span assumption yields fastﬁnite-sum minimization. In

Proc. NeurIPS’18 , 2018.Elad Hazan et al. Introduction to online convex optimization.

Foundations and Trends ® in Optimization , 2(3-4):157–325, 2016.Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.In Proc. NIPS’13 , 2013.Guanghui Lan, Zhize Li, and Yi Zhou. A uniﬁed variance-reduced accelerated gradient method for convexoptimization. In

Proc. NeurIPS’19 , 2019.Puya Latafat, Nikolaos M Freris, and Panagiotis Patrinos. A new randomized block-coordinate primal-dual proximal algorithm for distributed optimization.

IEEE Transactions on Automatic Control , 64(10):4050–4065, 2019.Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S Dhillon, and Alexandros G Dimakis. Primal-dualblock generalized Frank-Wolfe.

Proc. NeurIPS’19 , 2019.Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method.

Proc. NIPS’14 ,2014.Arkadi Nemirovski. Prox-method with rate of convergence O (1 /t ) for variational inequalities with lipschitzcontinuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal onOptimization , 15(1):229–251, 2004.Yu Nesterov. Excessive gap technique in nonsmooth convex minimization.

SIAM Journal on Optimization ,16(1):235–249, 2005a. 12u Nesterov. Smooth minimization of non-smooth functions.

Mathematical programming , 103(1):127–152,2005b.Yurii Nesterov. Dual extrapolation and its applications to solving variational inequalities and related problems.

Mathematical Programming , 109(2-3):319–344, 2007.Yurii Nesterov. Universal gradient methods for convex optimization problems.

Mathematical Programming ,152(1-2):381–404, 2015.Yurii Nesterov.

Lectures on convex optimization , volume 137. Springer, 2018.Yuyuan Ouyang and Yangyang Xu. Lower complexity bounds of ﬁrst-order methods for convex-concavebilinear saddle-point problems.

Mathematical Programming , pages 1–35, 2019.Pradeep Ravikumar, Martin J Wainwright, John D Laﬀerty, et al. High-dimensional Ising model selectionusing (cid:96) -regularized logistic regression. The Annals of Statistics , 38(3):1287–1319, 2010.R Tyrrell Rockafellar and Roger J-B Wets.

Variational analysis , volume 317. Springer Science & BusinessMedia, 2009.Nicolas Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergencerate for ﬁnite training sets.

Proc. NIPS’12 , 2012.Chaobing Song, Yong Jiang, and Yi Ma. Variance reduction via accelerated dual averaging for ﬁnite-sumoptimization.

Proc. NeurIPS’20 , 2020a.Chaobing Song, Zhengyuan Zhou, Yichao Zhou, Yong Jiang, and Yi Ma. Optimistic dual extrapolation forcoherent non-monotone variational inequalities.

Proc. NeurIPS’20 , 2020b.Conghui Tan, Tong Zhang, Shiqian Ma, and Ji Liu. Stochastic primal-dual method for empirical riskminimization with o (1) per-iteration complexity. In Proc. NeurIPS’18 , 2018.Robert Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the Royal Statistical Society:Series B (Methodological) , 58(1):267–288, 1996.Quoc Tran-Dinh, Olivier Fercoq, and Volkan Cevher. A smooth primal-dual optimization framework fornonsmooth composite convex minimization.

SIAM Journal on Optimization , 28(1):96–134, 2018.Blake E Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite objectives.

Proc. NIPS’16 , 2016.Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization.

Journal ofMachine Learning Research , 11(Oct):2543–2596, 2010.Yuchen Zhang and Xiao Lin. Stochastic primal-dual coordinate method for regularized empirical riskminimization. In

Proc. ICML’15 , 2015.Kaiwen Zhou, Fanhua Shang, and James Cheng. A simple stochastic variance reduced algorithm with fastconvergence rates. In

Proc. ICML’18 , 2018.Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net.

Journal of the royalstatistical society: series B (statistical methodology) , 67(2):301–320, 2005.13

Omitted Proofs from Section 3

We start by proving two auxiliary lemmas that bound the growth of the estimate sequences φ k ( x k ) and ψ k ( y k ) above and below, respectively. Lemma 1.

In Algorithm 1, ∀ ( u , v ) ∈ X × Y and k ≥ , we have ψ k ( y k ) ≤ A k g ∗ ( v ) + 12 (cid:107) v − y (cid:107) − γA k (cid:107) v − y k (cid:107) φ k ( x k ) ≤ A k (cid:96) ( u ) + 12 (cid:107) u − x (cid:107) − σA k (cid:107) u − x k (cid:107) . Proof.

By the deﬁnition of ψ k ( y ) in Algorithm 1, it follows that, ∀ k ≥ ,ψ k ( y ) = k (cid:88) i =1 a i ( (cid:104)− B ¯ x i − , y − v (cid:105) + g ∗ ( y )) + 12 (cid:107) y − y (cid:107) . As, by deﬁnition, y k = arg min y ∈ R n ψ k ( y ) (see Algorithm 1; observe that, by deﬁnition of ψ k , it must be y k ∈ Y ), it follows that there exists g g ∗ ( y k ) ∈ ∂g ∗ ( y k ) such that k (cid:88) i =1 a i ( − B ¯ x i − + g g ∗ ( y k )) + ( y k − y ) = . Thus, for any v ∈ R n , we have that, ∀ k ≥ ψ k ( y k ) = k (cid:88) i =1 a i ( (cid:104)− B ¯ x i − , y k − v (cid:105) + g ∗ ( y k )) + 12 (cid:107) y k − y (cid:107) = k (cid:88) i =1 a i ( (cid:104) g g ∗ ( y k ) , v − y k (cid:105) + g ∗ ( y k )) + (cid:104) y k − y , v − y k (cid:105) + 12 (cid:107) y k − y (cid:107) = A k ( (cid:104) g g ∗ ( y k ) , v − y k (cid:105) + g ∗ ( y k )) + (cid:104) y k − y , v − y k (cid:105) + 12 (cid:107) y k − y (cid:107) . As g ∗ is assumed to be γ -strongly convex, we have that (cid:104) g g ∗ ( y k ) , v − y k (cid:105) + g ∗ ( y k ) ≤ g ( v ) − γ (cid:107) v − y k (cid:107) . Thus, using that (cid:104) y k − y , v − y k (cid:105) + (cid:107) y k − y (cid:107) = (cid:107) v − y (cid:107) − (cid:107) v − y k (cid:107) , we further have ψ k ( y k ) ≤ A k (cid:16) g ∗ ( v ) − γ (cid:107) v − y k (cid:107) (cid:17) + 12 (cid:107) v − y (cid:107) − (cid:107) v − y k (cid:107) = A k g ∗ ( v ) + 12 (cid:107) v − y (cid:107) − γA k (cid:107) v − y k (cid:107) , as claimed.Bounding φ k ( x k ) can be done using the same sequence of arguments and is thus omitted. Lemma 2.

In Algorithm 1, we have: ∀ ( u , v ) ∈ X × Y and k ≥ ,ψ k ( y k ) ≥ ψ k − ( y k − ) + 1 + γA k − (cid:107) y k − y k − (cid:107) + a k (cid:0) g ∗ ( y k ) + (cid:104)− Bx k , y k − v (cid:105) (cid:1) + a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) − a k − (cid:104) B ( x k − − x k − ) , y k − − v (cid:105)− a k − (cid:104) B ( x k − − x k − ) , y k − y k − (cid:105) ,φ k ( x k ) ≥ φ k − ( x k − ) + 1 + σA k − (cid:107) x k − x k − (cid:107) + a k (cid:0) (cid:104) x k − u , B T y k (cid:105) + (cid:96) ( x k ) (cid:1) . roof. By the deﬁnition of ψ k ( y k ), the fact that y k − is optimal for ψ k − , and the 1 + γA k − -strong convexityof ψ k − ( y k − ), we have the following: ∀ k ≥ ψ k ( y k ) = ψ k − ( y k ) + a k ( (cid:104)− B ¯ x k − , y k − v (cid:105) + g ∗ ( y k )) ≥ ψ k − ( y k − ) + 1 + γA k − (cid:107) y k − y k − (cid:107) + a k ( (cid:104)− B ¯ x k − , y k − v (cid:105) + g ∗ ( y k )) . (3)On the other hand, by the deﬁnition of ¯ x k − , we also have (cid:104)− B ¯ x k − , y k − v (cid:105) = (cid:104)− B ( ¯ x k − − x k ) , y k − v (cid:105) + (cid:104)− Bx k , y k − v (cid:105) = (cid:104) B ( x k − x k − ) , y k − v (cid:105) − a k − a k (cid:104) B ( x k − − x k − ) , y k − v (cid:105) + (cid:104)− Bx k , y k − v (cid:105) = (cid:104) B ( x k − x k − ) , y k − v (cid:105) − a k − a k (cid:104) B ( x k − − x k − ) , y k − − v (cid:105)− a k − a k (cid:104) B ( x k − − x k − ) , y k − y k − (cid:105) + (cid:104)− Bx k , y k − v (cid:105) . (4)Hence, combining Eqs. (3) and (4), we have ψ k ( y k ) ≥ ψ k − ( y k − ) + 1 + γA k − (cid:107) y k − y k − (cid:107) + a k g ∗ ( y k )+ a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) − a k − (cid:104) B ( x k − − x k − ) , y k − − v (cid:105)− a k − (cid:104) B ( x k − − x k − ) , y k − y k − (cid:105) + a k (cid:104)− Bx k , y k − v (cid:105) . Similarly, by the deﬁnition of φ k , the (1+ σA k − )-strong convexity of φ k − , and x k − = arg min x ∈ R d φ k − ( x ),we have ∀ k ≥ φ k ( x k ) = φ k − ( x k ) + a k ( (cid:104) x k − u , B T y k (cid:105) + (cid:96) ( x k )) ≥ φ k − ( x k − ) + 1 + σA k − (cid:107) x k − x k − (cid:107) + a k ( (cid:104) x k − u , B T y k (cid:105) + (cid:96) ( x k )) , completing the proof.We are now ready to prove Theorem 1. Theorem 1.

Under Assumption 1, for Algorithm 1, we have, ∀ ( u , v ) ∈ X × Y and k ≥ , Gap u , v ( ˜ x k , ˜ y k ) ≤ (cid:107) u − x (cid:107) + (cid:107) v − y (cid:107) A k . Further, if ( x ∗ , y ∗ ) is a primal-dual solution to (PD) , then (1 + σA k ) (cid:107) x k − x ∗ (cid:107) + 1 + γA k (cid:107) y k − y ∗ (cid:107) ≤ (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) . (2) In both cases, the growth of A k can be bounded below as A k ≥ √ R max (cid:110) k, (cid:16) √ σγ √ R (cid:17) k − ,σ √ R (cid:16) [ k − k ] + + max (cid:8) . √ R, (cid:9)(cid:17) ,γ √ R (cid:16) [ k − k (cid:48) ] + + max (cid:8) . √ R, (cid:9)(cid:17) (cid:111) , where [ · ] + = max {· , } , k = (cid:100) σ √ R (cid:101) , and k (cid:48) = (cid:100) γ √ R (cid:101) . roof. Applying Lemma 2, we have ψ k ( y k ) + φ k ( x k ) ≥ ψ k − ( y k − ) + φ k − ( x k − )+ a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) − a k − (cid:104) B ( x k − − x k − ) , y k − − v (cid:105) + (cid:20) γA k − (cid:107) y k − y k − (cid:107) + 1 + σA k − (cid:107) x k − x k − (cid:107) − a k − (cid:104) B ( x k − − x k − ) , y k − y k − (cid:105) (cid:21) + (cid:2) a k ( g ∗ ( y k ) + (cid:96) ( x k ) − (cid:104) Bu , y k (cid:105) + (cid:104) x k , B T v (cid:105) ) (cid:3) . (5)The terms from the ﬁrst line in this inequality give a recursive relationship for the sum of estimate sequences ψ k + φ k . The terms from the second line telescope. Thus, we only need to focus on bounding the termsinside [ · ] and [ · ] .Observe that, by the deﬁnition (1) of Gap u , v ( x , y ), we have[ · ] = a k (cid:0) Gap u , v ( x k , y k ) + g ∗ ( v ) + (cid:96) ( u ) (cid:1) . (6)To bound [ · ] , observe ﬁrst that a k − (cid:104) B ( x k − − x k − ) , y k − y k − (cid:105) ≤ a k − (cid:107) B (cid:107)(cid:107) x k − − x k − (cid:107)(cid:107) y k − y k − (cid:107)≤ Ra k − (cid:107) x k − − x k − (cid:107)(cid:107) y k − y k − (cid:107) , where we have used Cauchy-Schwarz inequality, deﬁnition of the operator norm, and (cid:107) B (cid:107) ≤ R (which holdsby Assumption 1). Applying Young’s inequality, we have for all k ≥ a k − (cid:104) B ( x k − − x k − ) , y k − y k − (cid:105) ≤ R a k − γA k − ) (cid:107) x k − − x k − (cid:107) + 1 + γA k − (cid:107) y k − y k − (cid:107) ≤ R a k − γA k − ) (cid:107) x k − − x k − (cid:107) + 1 + γA k − (cid:107) y k − y k − (cid:107) = 1 + σA k − (cid:107) x k − − x k − (cid:107) + 1 + γA k − (cid:107) y k − y k − (cid:107) , ≤ σA k − (cid:107) x k − − x k − (cid:107) + 1 + γA k − (cid:107) y k − y k − (cid:107) , (7)where the second line is by A k − ≥ A k − , ∀ k, and the third line is by the deﬁnition of a k − in Algorithm 1.Hence, we have for k ≥ · ] ≥ σA k − (cid:107) x k − x k − (cid:107) − σA k − (cid:107) x k − − x k − (cid:107) . (8)For k = 1, we have[ · ] = 1 + γA (cid:107) y − y (cid:107) + 1 + σA (cid:107) x − x (cid:107) − a (cid:104) B ( x − x − ) , y − y (cid:105) = 12 (cid:107) y − y (cid:107) + 12 (cid:107) x − x (cid:107) . (9)When we sum [ · ] over 1 , , . . . , k , we obtain from (8) and (9) that the telescoped sum is bounded below by1 + σA k − (cid:107) x k − x k − (cid:107) − σA (cid:107) x − x (cid:107) + 12 (cid:107) y − y (cid:107) + 12 (cid:107) x − x (cid:107) = 1 + σA k − (cid:107) x k − x k − (cid:107) + 12 (cid:107) y − y (cid:107) . (10)16hus, by combining Eqs. (5)–(6), telescoping from 1 to k , and using (10), we have ψ k ( y k ) + φ k ( x k ) ≥ ψ ( y ) + φ ( x ) + a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) − a (cid:104) B ( x − x − ) , y − v (cid:105) + 1 + σA k − (cid:107) x k − x k − (cid:107) + 12 (cid:107) y − y (cid:107) + k (cid:88) i =1 a i (cid:0) Gap u , v ( x i , y i ) + g ∗ ( v ) + (cid:96) ( u ) (cid:1) ≥ a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) + 1 + σA k − (cid:107) x k − x k − (cid:107) + k (cid:88) i =1 a i Gap u , v ( x i , y i ) + A k ( g ∗ ( v ) + (cid:96) ( u )) , (11)where we have used φ ( x ) = ψ ( y ) = 0 and x = x − , which holds by assumption. By rearranging Eq. (11)and using the bounds on φ k ( x k ) and ψ k ( y k ) from Lemma 1, we have k (cid:88) i =1 a i Gap u , v ( x i , y i ) ≤ − a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) − σA k − (cid:107) x k − x k − (cid:107) + 12 (cid:107) v − y (cid:107) − γA k (cid:107) v − y k (cid:107) + 12 (cid:107) u − x (cid:107) − σA k (cid:107) u − x k (cid:107) . (12)By using the same sequence of arguments leading to Eq. (7), but with v replacing y k − and k + 1 replacing k , we have that the following bound holds for all k ≥ − a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) ≤ σA k − (cid:107) x k − x k − (cid:107) + 1 + γA k (cid:107) y k − v (cid:107) . By combining with Eq. (12), we obtain k (cid:88) i =1 a i Gap u , v ( x i , y i ) ≤ (cid:107) u − x (cid:107) + 12 (cid:107) v − y (cid:107) − σA k (cid:107) u − x k (cid:107) − γA k (cid:107) v − y k (cid:107) . (13)To complete bounding the primal-dual gap, it remains to observe that for any ﬁxed u , v , Gap u , v ( x , y ) isseparable and convex in x , y . Thus, by the deﬁnition of ˜ x k , ˜ y k and Jensen’s inequality, we have thatGap u , v ( ˜ x k , ˜ y k ) ≤ A k k (cid:88) i =1 a i Gap u , v ( x i , y i ) ≤ (cid:107) u − x (cid:107) + (cid:107) v − y (cid:107) A k . To bound (cid:107) x k − x ∗ (cid:107) + (cid:107) y k − y ∗ (cid:107) , note that for ( u , v ) = ( x ∗ , y ∗ ) we must have Gap u , v ( x , y ) ≥ , for all( x , y ), since ( x ∗ , y ∗ ) is a primal-dual solution. Thus, rearranging Eq. (13), we arrive at the claimed bound(1 + σA k ) (cid:107) x k − x ∗ (cid:107) + 1 + γA k (cid:107) y k − y ∗ (cid:107) ≤ (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) . To complete the proof, it remains to bound the growth of A k . We do this using the relationship A k − A k − = a k = √ (1+ σA k − )(1+ γA k − ) √ R . When either γ = 0 or σ = 0 , the bound follows by applyingLemma 6. If σ > γ >

0, then we have A k − A k − = a k ≥ √ σγ √ R A k − , which leads to A k = A (cid:16) √ σγ √ R (cid:17) k − = 1 √ R (cid:16) √ σγ √ R (cid:17) k − , as claimed. 17 Omitted Proofs from Section 4

We start by showing the following identity for the initial estimate sequence ψ ( y ) + φ ( x ). Lemma 3.

For the initialization steps (Lines 2-7 of Algorithm 2), we have, for all ( u , v ) ∈ X × Y that ψ ( y ) + φ ( x ) = n (cid:107) y − y (cid:107) + n (cid:107) x − x (cid:107) + a (cid:0) Gap u , v ( x , y ) + g ∗ ( v ) + (cid:96) ( u ) + (cid:104) x − x , B T ( y − v ) (cid:105) (cid:1) . Proof.

By the deﬁnition of ˜ ψ ( · ) and ˜ φ ( · ), we have˜ ψ ( y ) = 12 (cid:107) y − y (cid:107) + ˜ a ( (cid:104)− Bx , y − v (cid:105) + g ∗ ( y )) , ˜ φ ( x ) = 12 (cid:107) x − x (cid:107) + ˜ a ( (cid:104) x − u , B T y (cid:105) + (cid:96) ( x )) . Thus, using the deﬁnition (1) of Gap u , v , we have:˜ ψ ( y ) + ˜ φ ( x ) = 12 (cid:107) y − y (cid:107) + 12 (cid:107) x − x (cid:107) + ˜ a ( g ∗ ( y ) + (cid:96) ( x ) − (cid:104) Bu , y (cid:105) + (cid:104) x , B T v (cid:105) + (cid:104) x − x , B T ( y − v ) (cid:105) )= 12 (cid:107) y − y (cid:107) + 12 (cid:107) x − x (cid:107) + ˜ a (Gap u , v ( x , y ) + g ∗ ( v ) + (cid:96) ( u ) + (cid:104) x − x , B T ( y − v ) (cid:105) ) , We have by deﬁnition that ψ = n ˜ ψ , φ = n ˜ φ , and a = n ˜ a , so the result follows when we multiply bothsides of this expression by n .The following two lemmas now bound the growth of ψ k ( y k ) + φ k ( x k ) below and above, and are the maintechnical lemmas used in proving Theorem 2. Lemma 4.

For all steps of Algorithm 2 with k ≥ , we have, ∀ ( u , v ) ∈ X × Y ,ψ k ( y k ) ≤ k (cid:88) i =2 a i g ∗ j i ( v j i ) + a g ∗ ( v ) + n (cid:107) v − y (cid:107) − n (cid:107) v − y k (cid:107) ,φ k ( x k ) ≤ A k (cid:96) ( u ) + n (cid:107) u − x (cid:107) − n + σA k (cid:107) u − x k (cid:107) . Proof.

By the deﬁnition of ψ k ( y ) in Algorithm 2, it follows that, ∀ k ≥ ,ψ k ( y ) = k (cid:88) i =2 a i ( − b Tj i ¯ x i − ( y j i − v j i ) + g ∗ j i ( y j i ))+ (cid:16) n (cid:107) y − y (cid:107) + a ( (cid:104)− Bx , y − v (cid:105) + g ∗ ( y )) (cid:17) (14)and φ k ( x ) = k (cid:88) i =2 a i ( (cid:104) x − u , z i − + ( y i,j i − y i − ,j i ) b j i (cid:105) + (cid:96) ( x ))+ (cid:16) n (cid:107) x − x (cid:107) + a ( (cid:104) x − u , B T y (cid:105) + (cid:96) ( x )) (cid:17) . (15)By the ﬁrst-order optimality condition in the deﬁnition of y k (see Algorithm 2), it follows that thereexists g g ∗ ( y k ) ∈ ∂g ∗ ( y k ) such that k (cid:88) i =2 a i ( − b Tj i ¯ x i − + ( g ∗ j i ) (cid:48) ( y k,j i )) e j i + n ( y k − y ) + a ( − Bx + g g ∗ ( y k )) = , e j i denotes the j i th standard basis vector (i.e., a vector whose element j i equals one, while all theremaining elements are zero) and ( g ∗ j i ) (cid:48) ∈ ∂g ∗ j i ( y k,j i ) denotes the j i th element of g g ∗ ( y k )). By rearranging thisexpression, we obtain − a Bx − k (cid:88) i =2 a i b Tj i ¯ x i − e j i = − k (cid:88) i =2 a i ( g ∗ j i ) (cid:48) ( y k,j i ) e j i − n ( y k − y ) − a g g ∗ ( y k ) . (16)By setting y = y k in (14), then substituting from (16), we obtain ψ k ( y k ) = k (cid:88) i =2 a i ( (cid:104) ( g ∗ j i ) (cid:48) ( y k,j i ) , v j i − y k,j i (cid:105) + g ∗ j i ( y k,j i )) + a ( (cid:104) g g ∗ ( y k ) , v − y k (cid:105) + g ∗ ( y k ))+ (cid:104) n ( y k − y ) , v − y k (cid:105) + n (cid:107) y k − y (cid:107) ≤ k (cid:88) i =2 a i g ∗ j i ( v j i ) + a g ∗ ( v ) + n (cid:107) v − y (cid:107) − n (cid:107) v − y k (cid:107) , where we have used (cid:104) g g ∗ ( y k ) , v − y k (cid:105) + g ∗ ( y k ) ≤ g ∗ ( v ) (by convexity of g ∗ ) and (cid:104) n ( y k − y ) , v − y k (cid:105) + n (cid:107) y k − y (cid:107) = n (cid:107) v − y (cid:107) − n (cid:107) v − y k (cid:107) . The bound on φ k ( x k ) follows from a similar sequence of arguments. By the ﬁrst-order optimality in thedeﬁnition of x k , we have that there exists g (cid:96) ( x k ) ∈ ∂(cid:96) ( x k ) such that k (cid:88) i =2 a i ( z i − + ( y i,j i − y i − ,j i ) b j i + g (cid:96) ( x k )) + n ( x k − x ) + a ( B T y + g (cid:96) ( x k )) = , which rearranges to k (cid:88) i =2 a i ( z i − + ( y i,j i − y i − ,j i ) b j i ) + a B T y = − k (cid:88) i =2 a i g (cid:96) ( x k ) − n ( x k − x ) − a g (cid:96) ( x k ) . By using this expression in Eq. (15) with x = x k , we obtain φ k ( x k ) = k (cid:88) i =2 a i ( (cid:104) g (cid:96) ( x k ) , u − x k (cid:105) + (cid:96) ( x k )) + a ( (cid:104) g (cid:96) ( x k ) , u − x k (cid:105) + (cid:96) ( x k ))+ (cid:104) n ( x k − x ) , u − x k (cid:105) + n (cid:107) x k − x (cid:107) ≤ k (cid:88) i =2 a i ( (cid:96) ( u ) − σ (cid:107) u − x k (cid:107) ) + a ( (cid:96) ( u ) − σ (cid:107) u − x k (cid:107) ) + n (cid:107) u − x (cid:107) − n (cid:107) u − x k (cid:107) = A k (cid:96) ( u ) + n (cid:107) u − x (cid:107) − n + σA k (cid:107) u − x k (cid:107) , where we have used (cid:104) g (cid:96) ( x k ) , u − x k (cid:105) + (cid:96) ( x k ) ≤ (cid:96) ( u ) − σ (cid:107) u − x k (cid:107) (by σ -strong convexity of (cid:96) ) and (cid:104) n ( x k − x ) , u − x k (cid:105) + n (cid:107) x k − x (cid:107) = n (cid:107) u − x (cid:107) − n (cid:107) u − x k (cid:107) . The last line follows from the deﬁnition of A k . Lemma 5.

For all steps of Algorithm 2 with k ≥ , taking expectation on all the randomness in the algorithm, e have for all ( u , v ) ∈ X × Y that E [ ψ k ( y k )] ≥ E (cid:104) ψ k − ( y k − ) + n (cid:107) y k − y k − (cid:107) + a k g ∗ j k ( y k,j k )+ a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) − a k − (cid:104) B ( x k − − x k − ) , y k − − v (cid:105)− na k − (cid:104) B ( x k − − x k − ) , y k − y k − (cid:105) + a k (cid:104)− Bx k , y k − v (cid:105)− ( n − a k (cid:16) (cid:104) B ( x k − − u ) , y k − y k − (cid:105) + (cid:104) Bu , y k − y k − (cid:105) (cid:17)(cid:105) , E [ φ k ( x k )] ≥ E (cid:104) φ k − ( x k − ) + n + σA k − (cid:107) x k − x k − (cid:107) + a k ( (cid:104) x k − u , B T y k (cid:105) + ( n − (cid:104) x k − u , B T ( y k − y k − ) (cid:105) + (cid:96) ( x k )) (cid:105) . Proof.

By the deﬁnition of ψ k , we have ψ k ( y k ) = ψ k − ( y k ) + a k ( − b Tj k ¯ x k − ( y k,j k − v j k ) + g ∗ j k ( y k,j k ))= ψ k − ( y k − ) + (cid:2) ψ k − ( y k ) − ψ k − ( y k − ) (cid:3) + (cid:2) a k ( − b Tj k ¯ x k − ( y k,j k − v j k ) + g ∗ j k ( y k,j k )) (cid:3) . (17)To bound ψ k ( y k ) in expectation and obtain the claimed bound, we need to bound the terms in [ · ] and [ · ] .To do so, let F k be the natural ﬁltration that contains all the randomness up to and including iteration k. Inwhat follows, we will use the tower property of conditional expectation, which guarantees E [ · ] = E [ E [ ·|F k − ]] . To bound [ · ] , we use the deﬁnition of ψ k − from Algorithm 2, and the facts that y k − is optimal for ψ k − and that ψ k − is the sum of the n -strongly convex function ψ with k − · ] = ψ k − ( y k ) − ψ k − ( y k − ) ≥ n (cid:107) y k − y k − (cid:107) . (18)To bound [ · ] , observe that y k and y k − only diﬀer over the coordinate j k , which is chosen uniformly atrandom, independent of the history. Further, recall that B = n [ b , b , . . . , b n ] T , and let B − j k denote thematrix B with its j k th row replaced by a zero vector. Then, we have E (cid:2) − b Tj k ¯ x k − ( y k,j k − v j k ) |F k − (cid:3) = E (cid:2) (cid:104)− n B ¯ x k − , y k − v (cid:105) + (cid:104) n B − j k ¯ x k − , y k − v (cid:105)|F k − (cid:3) = E (cid:2) (cid:104)− n B ¯ x k − , y k − v (cid:105)|F k − (cid:3) + (cid:104) ( n − B ¯ x k − , y k − − v (cid:105) , where the second equality follows from y k being equal to y k − over all the coordinates apart from j k , andfrom j k being chosen uniformly at random. Taking expectations on both sides of the last equality andusing (cid:104) B ¯ x k − , y k − v (cid:105) = (cid:104) B ¯ x k − , y k − y k − (cid:105) + (cid:104) B ¯ x k − , y k − − v (cid:105) and the tower property of conditionalexpectation, we have E (cid:2) − b Tj k ¯ x k − ( y k,j k − v j k ) (cid:3) = E (cid:2) (cid:104)− B ¯ x k − , y k − v (cid:105) (cid:3) − ( n − E (cid:2) (cid:104) B ¯ x k − , y k − y k − (cid:105) (cid:3) . (19)To ﬁnish bounding E (cid:2) − b Tj k ¯ x k − ( y k,j k − v j k ) (cid:3) (and, consequently, E [[ · ] ]), we now proceed to bound theterms inside the expectations in Eq. (19). First, adding an subtracting x k in the ﬁrst inner product termand using the deﬁnition of ¯ x k − , we have (cid:104)− B ¯ x k − , y k − v (cid:105) = (cid:104)− B ( ¯ x k − − x k ) , y k − v (cid:105) − (cid:104) Bx k , y k − v (cid:105) = (cid:104) B ( x k − x k − ) , y k − v (cid:105) − a k − a k (cid:104) B ( x k − − x k − ) , y k − v (cid:105) − (cid:104) Bx k , y k − v (cid:105) = (cid:104) B ( x k − x k − ) , y k − v (cid:105) − a k − a k (cid:104) B ( x k − − x k − ) , y k − − v (cid:105)− a k − a k (cid:104) B ( x k − − x k − ) , y k − y k − (cid:105) − (cid:104) Bx k , y k − v (cid:105) . (20)20n the other hand, using the deﬁnition of ¯ x k − , we also have (cid:104) B ¯ x k − , y k − y k − (cid:105) = (cid:104) Bx k − , y k − y k − (cid:105) + a k − a k (cid:104) B ( x k − − x k − ) , y k − y k − (cid:105) = (cid:104) B ( x k − − u ) , y k − y k − (cid:105) + (cid:104) Bu , y k − y k − (cid:105) + a k − a k (cid:104) B ( x k − − x k − ) , y k − y k − (cid:105) . (21)Thus, combining Eqs. (19)-(21) with the deﬁnition of [ · ] , we have: E (cid:2) [ · ] (cid:3) = E (cid:2) a k g ∗ j k ( y k,j k ) + a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) − a k − (cid:104) B ( x k − − x k − ) , y k − − v (cid:105)− na k − (cid:104) B ( x k − − x k − ) , y k − y k − (cid:105) − a k (cid:104) Bx k , y k − v (cid:105)− ( n − a k (cid:0) (cid:104) B ( x k − − u ) , y k − y k − (cid:105) + (cid:104) Bu , y k − y k − (cid:105) (cid:1)(cid:3) . (22)The bound on ψ k ( y k ) from the statement of the lemma now follows by combining Eq. (17) with Eqs. (18)and (22).To bound φ k ( x k ) , we use similar arguments to those we used above for bounding ψ k ( y k ). In particular,from the deﬁnition of φ k ( x k ) , and using that φ k − ( x k − ) is ( n + σA k − )-strongly convex and minimized at x k − , we have φ k ( x k ) ≥ φ k − ( x k − ) + n + σA k − (cid:107) x k − x k − (cid:107) + a k (cid:0) (cid:104) x k − u , z k − + ( y k,j k − y k − ,j k ) b j k (cid:105) + (cid:96) ( x k ) (cid:1) . (23)Since y k and y k − only diﬀer on their j k element, by the deﬁnition of z k , we have z k = z k − + B T ( y k − y k − ),so by a recursive argument it follows that z i = B T y i for all i = 1 , , . . . , k . Thus, we have z k − + ( y k,j k − y k − ,j k ) b j k = B T y k − + n B T ( y k − y k − )= B T y k + ( n − B T ( y k − y k − ) , and consequently (cid:104) x k − u , z k − + ( y k,j k − y k − ,j k ) b j k (cid:105) = (cid:104) x k − u , B T y k (cid:105) + ( n − (cid:104) x k − u , B T ( y k − y k − ) (cid:105) . (24)To complete the proof, it remains to combine Eqs. (23) and (24).Using Lemmas 3–5, we are now ready to prove our main result. Theorem 2.

Fix any ( u , v ) ∈ X × Y . By combining the bounds on ψ k ( y k ) and φ k ( x k ) from Lemma 5, we have ∀ k ≥ E [ ψ k ( y k ) + φ k ( x k )] ≥ E (cid:104) ψ k − ( y k − ) + φ k − ( x k − )+ a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) − a k − (cid:104) B ( x k − − x k − ) , y k − − v (cid:105) + P k + Q k , (cid:105) . (25)where P k = n (cid:107) y k − y k − (cid:107) + n + σA k − (cid:107) x k − x k − (cid:107) − na k − (cid:104) B ( x k − − x k − ) , y k − y k − (cid:105) + ( n − a k (cid:104) x k − x k − , B T ( y k − y k − ) (cid:105) , (26)and Q k = a k ( g ∗ j k ( y k,j k ) + (cid:96) ( x k ) − (cid:104) Bu , y k (cid:105) + (cid:104) x k , B T v (cid:105) − ( n − (cid:104) Bu , y k − y k − (cid:105) ) . (27)Observe that the ﬁrst line Eq. (25) gives the desired recursive relationship for the sum of estimate sequences,while the second line telescopes. Thus, we only need to focus on bounding P k and Q k .To bound P k , we start by bounding the inner product terms that appear in it. Recall that y k and y k − only diﬀer on coordinate j k . We thus have (cid:104) B ( x k − − x k − ) , y k − y k − (cid:105) = 1 n b Tj k ( x k − − x k − )( y k,j k − y k − ,j k ) ≤ n (cid:16) α (cid:0) b Tj k ( x k − − x k − ) (cid:1) + α y k,j k − y k − ,j k ) (cid:17) , for any α > , by Young’s inequality. Further, as, by Assumption 2, max ≤ j ≤ n (cid:107) b j (cid:107) ≤ R (cid:48) , applyingCauchy-Schwarz inequality, we have that b Tj k ( x k − − x k − ) ≤ R (cid:48) (cid:107) x k − − x k − (cid:107) , and, hence (cid:104) B ( x k − − x k − ) , y k − y k − (cid:105) ≤ n (cid:16) R (cid:48) α (cid:107) x k − − x k − (cid:107) + α (cid:107) y k − y k − (cid:107) (cid:17) . (28)Similarly, ∀ β > , (cid:104) B ( x k − x k − ) , y k − y k − (cid:105) ≥ − n (cid:18) R (cid:48) β (cid:107) x k − x k − (cid:107) + β (cid:107) y k − y k − (cid:107) (cid:19) . (29)Thus, by combining Eqs. (26), (28), and (29), we have ∀ α, β > P k ≥ n − αna k − − β ( n − a k n (cid:107) y k − y k − (cid:107) + n ( n + σA k − ) − ( n − a k R (cid:48) /β n (cid:107) x k − x k − (cid:107) − a k − R (cid:48) α (cid:107) x k − − x k − (cid:107) . Taking α = n a k − , β = n a k , and recalling that by our choice of step sizes in Algorithm 2, a k ≤ √ n ( n + σA k − )2 R (cid:48) , ∀ k ≥ , we can further simplify the bound on P k to P k ≥ n + σA k − (cid:107) x k − x k − (cid:107) − n + σA k − (cid:107) x k − − x k − (cid:107) , (30)22hich telescopes. Combining the bound on P k from Eq. (30) with the initial bound on ψ k ( y k ) + φ k ( x k ) fromEq. (25) and telescoping, we have E [ ψ k ( y k ) + φ k ( y k )] ≥ E (cid:104) ψ ( x ) + φ ( y )+ a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) − a (cid:104) B ( x − x ) , y − v (cid:105) + n + σA k − (cid:107) x k − x k − (cid:107) − n (cid:107) x − x (cid:107) + k (cid:88) i =2 Q i (cid:105) . (31)We now proceed to bound E (cid:2) (cid:80) ki =2 Q i (cid:3) . Observe ﬁrst that g ∗ j i ( y i,j i ) = n ( g ∗ ( y i ) − n (cid:80) j (cid:54) = j i g ∗ j ( y i − ,j )) . Thus, E (cid:2) g ∗ j i ( y i,j i ) (cid:3) = E (cid:104) E (cid:104) n (cid:16) g ∗ ( y i ) − n (cid:88) j (cid:54) = j i g ∗ j ( y i − ,j ) (cid:17)(cid:12)(cid:12)(cid:12) F i − (cid:105)(cid:105) = E (cid:2) ng ∗ ( y i ) − ( n − g ∗ ( y i − ) (cid:3) , where F i − is the natural ﬁltration, containing all randomness up to and including iteration i − . Therefore,we can bound E (cid:2) (cid:80) ki =2 Q i (cid:3) as follows: E (cid:2) k (cid:88) i =2 Q i (cid:3) = E (cid:34) k (cid:88) i =2 a i ( ng ∗ ( y i ) − ( n − g ∗ ( y i − )) (cid:35) + E (cid:34) k (cid:88) i =2 a i (cid:96) ( x i ) (cid:35) + E (cid:34) k (cid:88) i =2 a i ( − n (cid:104) Bu , y i (cid:105) + ( n − (cid:104) Bu , y i − (cid:105) + (cid:104) x i , B T v (cid:105) ) (cid:35) = E (cid:104) na k g ∗ ( y k ) + k − (cid:88) i =2 ( na i − ( n − a i +1 ) g ∗ ( y i ) − ( n − a g ∗ ( y ) (cid:105) + E (cid:104) − na k (cid:104) Bu , y k (cid:105) + k − (cid:88) i =2 ( − na i + ( n − a i +1 ) (cid:104) Bu , y i (cid:105) + ( n − a (cid:104) Bu , y (cid:105) (cid:105) + E (cid:34) k (cid:88) i =2 a i (cid:96) ( x i ) + k (cid:88) i =2 a i (cid:104) x i , B T v (cid:105) (cid:35) . (32)On the other hand, recall that, by Lemma 3, we have ψ ( y ) + φ ( x ) = n (cid:107) y − y (cid:107) + n (cid:107) x − x (cid:107) + a ( g ∗ ( y ) + (cid:96) ( x ) − (cid:104) Bu , y (cid:105) + (cid:104) x , B T v (cid:105) + (cid:104) x − x , B T ( y − v ) (cid:105) )= n (cid:107) y − y (cid:107) + n (cid:107) x − x (cid:107) + ( n − a ( g ∗ ( y ) − (cid:104) Bu , y (cid:105) )+ a ( (cid:96) ( x ) + (cid:104) x , B T v (cid:105) + (cid:104) x − x , B T ( y − v ) (cid:105) ) , (33)where we have used the setting a = a n − of Algorithm 2.23hus, combining Eqs. (31)–(33), we have E [ ψ k ( y k ) + φ k ( y k )] ≥ E (cid:34) a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) + n + σA k − (cid:107) x k − x k − (cid:107) + n (cid:107) y − y (cid:107) + n (cid:107) x − x (cid:107) + na k g ∗ ( y k ) + k − (cid:88) i =2 ( na i − ( n − a i +1 ) g ∗ ( y i ) − na k (cid:104) Bu , y k (cid:105) − k − (cid:88) i =2 ( na i − ( n − a i +1 ) (cid:104) Bu , y i (cid:105) + k (cid:88) i =1 a i (cid:96) ( x i ) + k (cid:88) i =1 a i (cid:104) x i , B T v (cid:105) (cid:35) . Using convexity of g ∗ and (cid:96) (to apply Jensen’s inequality) and the deﬁnitions of ˜ x k , ˜ y k , it now follows that E [ ψ k ( y k ) + φ k ( y k )] ≥ E (cid:34) a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) + n + σA k − (cid:107) x k − x k − (cid:107) + n (cid:107) y − y (cid:107) + n (cid:107) x − x (cid:107) + A k ( g ∗ ( ˜ y k ) + (cid:96) ( ˜ x k ) − (cid:104) Bu , ˜ y k (cid:105) + (cid:104) ˜ x k , B T v (cid:105) ) (cid:35) . (34)Convexity can be used here because, due to our setting of a i , we have na i − ( n − a i +1 ≥ na k + (cid:80) k − i =2 ( na i − ( n − a i +1 ) = na + (cid:80) ki =3 a i = A k , as na = a + a , and ˜ x k , ˜ y k are chosen as˜ y k = 1 A k (cid:16) na k y k + k − (cid:88) i =2 ( na i − ( n − a i +1 ) y i (cid:17) , ˜ x k := 1 A k k (cid:88) i =1 a k x i . Recalling the deﬁnition (1) of Gap u , v ( x , v ) and rearranging Eq. (34), we have A k E [Gap u , v ( ˜ x k , ˜ y k )] ≤ E (cid:104) ψ k ( y k ) + φ k ( x k ) − A k ( g ∗ ( v ) + (cid:96) ( u )) − a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) − n + σA k − (cid:107) x k − x k − (cid:107) − n (cid:107) y − y (cid:107) − n (cid:107) x − x (cid:107) (cid:105) . To complete the proof, it remains to apply Lemma 4, and simplify. In particular, as E (cid:104) k (cid:88) i =2 a i g ∗ j i ( v j i ) + a g ∗ ( v ) (cid:105) = k (cid:88) i =2 a i g ∗ ( v ) + a g ∗ ( v ) = A k g ∗ ( v ) , (35)we have E [ ψ k ( y k ) + φ k ( y k )] ≤ E (cid:34) A k g ∗ ( v ) + n (cid:107) v − y (cid:107) − n (cid:107) v − y k (cid:107) + A k (cid:96) ( u ) + n (cid:107) u − x (cid:107) − n + σA k (cid:107) u − x k (cid:107) (cid:35) , A k E [Gap u , v ( ˜ x k , ˜ y k )] ≤ E (cid:104) n ( (cid:107) u − x (cid:107) + (cid:107) v − y (cid:107) )2 − n (cid:107) v − y k (cid:107) − n + σA k (cid:107) u − x k (cid:107) − a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) − n + σA k − (cid:107) x k − x k − (cid:107) − n (cid:107) y − y (cid:107) − n (cid:107) x − x (cid:107) (cid:105) . Finally, we have from Young’s inequality and the deﬁnition of a k that − a k (cid:104) B ( x k − x k − ) , y k − v (cid:105) ≤ a k (cid:107) B (cid:107)(cid:107) x k − x k − (cid:107)(cid:107) y k − v (cid:107)≤ R (cid:48) a k (cid:107) x k − x k − (cid:107)(cid:107) y k − v (cid:107)≤ R (cid:48) a k n (cid:107) x k − x k − (cid:107) + n (cid:107) v − y k (cid:107) ≤ n + σA k − (cid:107) x k − x k − (cid:107) + n (cid:107) v − y k (cid:107) , leading to A k E [Gap u , v ( ˜ x k , ˜ y k )] ≤ E (cid:104) n ( (cid:107) u − x (cid:107) + (cid:107) v − y (cid:107) )2 − n (cid:107) v − y k (cid:107) − n + σA k (cid:107) u − x k (cid:107) (cid:105) . Similarly as in the proof of Theorem 1, it now follows that, ∀ ( u , v ) ∈ X × Y , E [Gap u , v ( ˜ x k , ˜ y k )] ≤ n ( (cid:107) u − x (cid:107) + (cid:107) v − y (cid:107) )2 A k and, as Gap x ∗ , y ∗ ( ˜ x k , ˜ y k ) ≥

0, we also have E (cid:104) n (cid:107) y ∗ − y k (cid:107) + n + σA k (cid:107) x ∗ − x k (cid:107) (cid:105) ≤ n ( (cid:107) x ∗ − x (cid:107) + (cid:107) y ∗ − y (cid:107) )2 . The bound on A k follows by applying Lemma 7 (Appendix C). C Growth of Sequences

We start ﬁrst with a general lemma that is useful for bounding the convergence rate of both pda and vrpda . Lemma 6.

Let { A k } k ≥ be a sequence of nonnegative real numbers such that A = 0 and A k is deﬁnedrecursively via A k = A k − + (cid:112) c + c A k − , where c > , and c ≥ . Deﬁne K = (cid:100) c c (cid:101) . Then A k ≥  c (cid:16) k − K + max (cid:110) (cid:113) c c , (cid:111)(cid:17) , if c > and k > K ,c k, otherwise . Proof.

As, by assumption, c ≥

0, we have that A k ≥ A k − + c , which, combined with A = 0 , implies A k ≥ c k. Thus, we only need to focus on proving the stated bound in the case that c > k > K . To prove the lemma, let us start by assuming that there exist p > , q > , and k ∈ { , , . . . } , such that A k ≥ pq , and A k − ≥ p ( k − − k + q ) , (36)25or some k ≥ k + 1 (observe that the inequalities are consistent for k − k ). Then, by induction on k, wecan prove that A k ≥ p ( k − k + q ) , for some speciﬁc p, q, k . In particular: A k ≥ A k − + (cid:112) c A k − ≥ p ( k − − k + q ) + (cid:112) c p ( k − − k + q ) = p ( k − k + q ) − p ( k − k + q ) + p + √ c p ( k − − k + q )= p ( k − k + q ) − √ p (2 √ p − √ c )( k − k ) + p (1 − q ) + √ c p ( − q ) . Let c = 9 p and q ≥

1. Then, we have A k ≥ p ( k − k + q ) + p ( k − k ) + p (1 − q ) + 3 p ( − q ) ≥ p ( k − k + q ) + p ( − q ) ≥ p ( k − k + q ) , where we have used k ≥ k + 1 . It remains to show that we can choose p, q, k that make the deﬁnition of A k from the statement of thelemma consistent with the assumption from Eq. (36).If c ≤ c , we have A = c ≥ c = p. Thus, to satisfy Eq. (36), we can set k = K = (cid:100) c c (cid:101) = 1 and q = 3 (cid:113) c c . On the other hand, if c > c , we have A = c < c = p. Thus, by setting k = K = (cid:100) c c (cid:101) and q = 1 > (cid:113) c c and using that A k ≥ c k (argued at the beginning of the proof), we have A k ≥ c = p. We now examine the properties of the sequence { a k } k ≥ deﬁned in Algorithm 2 and restated here: a = n R (cid:48) , a = 1 n − a , a k = min (cid:110)(cid:16) n − (cid:17) a k − , (cid:112) n ( n + σA k − )2 R (cid:48) (cid:111) for k ≥

3. (37)We also examine growth properties of { A k } k ≥ deﬁned by A k = (cid:80) ki =1 a i . Proposition 1.

Suppose that n ≥ . Then there exists an index k such that A k = n − R (cid:48) (cid:16) n − (cid:17) k for all k ≤ k , where k = (cid:24) log B n,σ,R (cid:48) log n − log( n − (cid:25) , and B n,σ,R (cid:48) = σn ( n − R (cid:48) + (cid:115)(cid:18) σn ( n − R (cid:48) (cid:19) + n ≥ n max (cid:26) , σ ( n − R (cid:48) (cid:27) . Further, ( n −

1) log B n,σ,R (cid:48) ≤ k ≤ . n −

1) log B n,σ,R (cid:48) + 1 , and we can conclude that the dependence of k on n is k = Ω( n log n ) .Proof. Cases k = 1 , A = a = n R (cid:48) = n − R (cid:48) (cid:0) n − (cid:1) and A = a + a = a (cid:0) n − (cid:1) = n − R (cid:48) (cid:0) n − (cid:1) . Observe that also a = R (cid:48) (cid:0) n − (cid:1) . For k > , as long as a k − (cid:0) n − (cid:1) ≤ √ n ( n + σA k − )2 R (cid:48) for all successive iterates, we have that a k = a k − (cid:16) n − (cid:17) = 12 R (cid:48) (cid:16) n − (cid:17) k − . A k = (cid:80) ki =1 a i , we have A k = n R (cid:48) + 12 R (cid:48) k (cid:88) i =2 (cid:16) n − (cid:17) i − = n R (cid:48) + 12 R (cid:48) (cid:16) (cid:0) n − (cid:1) k −

11 + n − − − (cid:17) = n − R (cid:48) (cid:16) n − (cid:17) k . Now let k be the ﬁrst iteration for which a k (cid:16) n − (cid:17) > (cid:112) n ( n + σA k )2 R (cid:48) . (38)Since k is the ﬁrst such iteration, we also have (by the argument above) that a k = R (cid:48) (cid:0) n − (cid:1) k − and A k = n − R (cid:48) (cid:16) n − (cid:17) k . Thus, by using these equalities and squaring both sides in (38), we obtain n ( n − · R (cid:48) ) (cid:18) n − (cid:19) k − > n + nσ n − R (cid:48) (cid:16) n − (cid:17) k (2 R (cid:48) ) . After simplifying the last expression, we have (cid:16) n − (cid:17) k > n (cid:16) σ ( n − nR (cid:48) (cid:16) n − (cid:17) k (cid:17) , and we seek the smallest positive integer k that satisﬁes this property. By introducing the notation r = (cid:16) n − (cid:17) k , we can write this condition as r > n (cid:18) σ ( n − nR (cid:48) r (cid:19) ⇔ r − σn ( n − R (cid:48) r − n > , from which we obtain (by solving the quadratic) that r > B n,σ,R (cid:48) , where B n,σ,R (cid:48) = 12 (cid:18) σn ( n − R (cid:48) + (cid:114)(cid:16) σn ( n − R (cid:48) (cid:17) + 4 n (cid:19) , which is identical to the deﬁnition of B n,σ,R (cid:48) in the statement of the result.Thus, k is the smallest integer such that (cid:18) n − (cid:19) k > B n,σ,R (cid:48) , or, in other words, k = (cid:100) κ (cid:101) , where κ satisﬁes (cid:18) n − (cid:19) κ = B n,σ,R (cid:48) , which yields the main result when we take logs of both sides.By using log(1 + δ ) ∈ ((1 / . δ, δ ) for δ ∈ (0 , . κ = log B n,σ,R (cid:48) log (cid:16) n − (cid:17) ∈ (1 , . n −

1) log B n,σ,R (cid:48) for n ≥

2. The ﬁnal claim is immediate. 27 roposition 2.

For k deﬁned in Proposition 1 and n ≥ , we have for all k ≥ k that a k (cid:18) n − (cid:19) > (cid:112) n ( n + σA k )2 R (cid:48) . Thus, we have that a k +1 = (cid:112) n ( n + σA k )2 R (cid:48) , for all k ≥ k . (39) Proof.

Suppose for the purpose of contradiction that there is k ≥ k such that a k (cid:18) n − (cid:19) > (cid:112) n ( n + σA k )2 R (cid:48) (40) a k +1 (cid:18) n − (cid:19) ≤ (cid:112) n ( n + σA k +1 )2 R (cid:48) . (41)It follows from (40) that a k +1 = (cid:112) n ( n + σA k )2 R (cid:48) . (42)By squaring both sides of (41) and using (42) and A k +1 = A k + a k , we have (cid:18) n − (cid:19) a k +12 ≤ n ( n + σA k +1 )(2 R (cid:48) ) ⇔ (cid:18) nn − (cid:19) n ( n + σA k )(2 R (cid:48) ) ≤ n ( n + σA k + σa k +1 )(2 R (cid:48) ) ⇔ (cid:18) n ( n − − (cid:19) n ( n + σA k )(2 R (cid:48) ) ≤ nσa k +1 (2 R (cid:48) ) ⇔ n − n − a k +12 ≤ nσ (2 R (cid:48) ) a k +1 ⇔ a k +1 ≤ n ( n − σ (2 R (cid:48) ) (2 n − , and it follows by taking logs of both sides of the last expression thatlog a k +1 ≤ log (cid:16) R (cid:48) (cid:17) + log (cid:16) n − n − (cid:17) + log (cid:16) σn ( n − R (cid:48) (cid:17) . (43)We now obtain a lower bound on a k . Since, as noted in the proof of Proposition 1, we have a k = R (cid:48) (cid:16) n − (cid:17) k − , we have, using the deﬁnitions of k and B n,σ,R (cid:48) in the statement of Proposition 1 thatlog a k = log (cid:16) R (cid:48) (cid:17) + ( k −

1) log (cid:16) nn − (cid:17) = log (cid:16) R (cid:48) (cid:17) − log (cid:16) nn − (cid:17) + k log (cid:16) nn − (cid:17) ≥ log (cid:16) R (cid:48) (cid:17) − log (cid:16) nn − (cid:17) + log( B n,σ,R (cid:48) ) > log (cid:16) R (cid:48) (cid:17) − log (cid:16) nn − (cid:17) + log (cid:16) σn ( n − R (cid:48) (cid:17) Now for n ≥

2, we have − log (cid:16) nn − (cid:17) = log (cid:16) n − n (cid:17) ≥ log (cid:16) n − n − (cid:17) ,

28o by substituting in the last expression above, we obtainlog a k > log (cid:16) R (cid:48) (cid:17) + log (cid:16) n − n − (cid:17) + log (cid:16) σn ( n − R (cid:48) (cid:17) . (44)By comparing Eq. (43) and Eq. (44), we see that a k +1 < a k which (since { a i } i ≥ is a monotonically increasingsequence) implies that k + 1 < k , which contradicts our choice of k . Thus, no such k exists, and our proof iscomplete.Using Proposition 2, we have that for all k ≥ k , A k +1 = A k + √ n ( n + σA k )2 R (cid:48) . Thus, we can use Lemma 6 toconclude that after iteration k , the growth of A k is quadratic. However, to obtain tighter constants, we willderive a slightly tighter bound in the following proposition. Proposition 3.

Let k , B n,σ,R (cid:48) be deﬁned as in Proposition 1. Then, for all k > k , A k ≥ c ( k − k + n − , where c = ( n − σ (4 R (cid:48) ) n . Proof.

We prove the proposition by induction on k . Observe that B n,σ,R (cid:48) ≥ σn ( n − R (cid:48) . Applying Proposition 1,we have that A k ≥ n − R (cid:48) B n,σ,R (cid:48) ≥ ( n − nσ R (cid:48) ) = 4 cn ≥ c ( n − , so the claim holds for k = k . Now assume that the claim holds for k ≥ k and consider iteration k + 1 . Wehave A k +1 = A k + (cid:112) n + nσA k R (cid:48) > c ( k − k + n − + √ ncσ R (cid:48) ( k − k + n − . Let us argue that √ ncσ R (cid:48) ( k − k + n − ≥ c ( k − k + n ) . We note ﬁrst that √ ncσ R (cid:48) = √ nσ R (cid:48) ( n − √ σ R (cid:48) √ n = ( n − σ R (cid:48) ) = nn − c = ⇒ c = (cid:18) − n (cid:19) √ ncσ R (cid:48) . We then have √ ncσ R (cid:48) ( k − k + n − − c ( k − k + n ) = (cid:16) √ ncσ R (cid:48) − c (cid:17) ( k − k + n ) − √ ncσ R (cid:48) ≥ √ ncσ R (cid:48) n ( k − k + n ) − √ ncσ R (cid:48) ≥ , where we have used k ≥ k . Hence, we have that: A k +1 > c ( k − k + n − + 2 c ( k − k + n ) = c ( k − k + n ) + 1 > c ( k − k + n ) , establishing the inductive step and proving the claim.Proposition 3 is mainly useful when σ is not too small. For small or zero values of σ, however, we canshow that after k iterations the growth of A k is at least a linear function of k , as follows. Proposition 4.

Let K = (cid:100) log( n )log( n ) − log( n − (cid:101) , n ≥ . Then, for all k ≥ K , we have that A k ≥ n ( k − K + n − R (cid:48) . Proof.

Since K ≤ k , we have by Proposition 1 that A K ≥ n ( n − R (cid:48) and a K ≥ n − R (cid:48) . As a k = min (cid:8)(cid:0) n − (cid:1) a k − , √ n ( n + σA k − )2 R (cid:48) (cid:9) and σ ≥ , for all k ≥ , we have that a k ≥ n R (cid:48) for all k ≥ K + 1 , leading to theclaimed bound on A k . lgorithm 3 Variance Reduction via Primal-Dual Accelerated Dual Averaging ( vrpda , ImplementationVersion) Input: ( x , y ) ∈ X × Y , ( u , v ) ∈ X × Y . a = A = 0 , ˜ a = R (cid:48) , ˜ p = − Bx . y = prox ˜ a g ∗ ( y − ˜ a ˜ p ) . z = B T y . x = prox ˜ a (cid:96) ( x − ˜ a z ) . a = A = n ˜ a , a = n − a , A = A + a . p = a ˜ p , q = a z , r = n a . for k = 2 , , . . . , K do ¯ x k − = x k − + a k − a k ( x k − − x k − ) . Pick j k uniformly at random in [ n ] . p k,i = (cid:40) p k − ,i , i (cid:54) = j k p k − ,i − a k b Tj k ¯ x k − , i = j k , r k,i = (cid:40) r k − ,i , i (cid:54) = j k r k − ,i + a k , i = j k . y k,i = (cid:40) y k − ,i , i (cid:54) = j k prox n r k,jk g ∗ jk ( y ,j k − n p k,j k ) , i = j k . q k = q k − + a k ( z k − + ( y k,j k − y k − ,j k ) b j k ) . x k = prox n A k (cid:96) ( x − n q k ) . z k = z k − + n ( y k,j k − y k − ,j k ) b j k . a k +1 = min (cid:16)(cid:0) n − (cid:1) a k , √ n ( n + σA k )2 R (cid:48) (cid:17) , A k +1 = A k + a k +1 . end for return x K or ˜ x K := A K (cid:80) Kk =1 a k x k .We can now combine Propositions 1-4 to obtain a lower bound on A k , as summarized in the followinglemma. Its proof is a direct consequence of Propositions 1-4, and is thus omitted. Lemma 7.

Let sequences { a k } k ≥ , { A k } k ≥ be deﬁned by Eq. (37) . Then: A k ≥ max (cid:26) n − R (cid:48) (cid:16) n − (cid:17) k k ≤ k , ( n − σ (4 R (cid:48) ) n ( k − k + n − k ≥ k , n ( k − K + n − R (cid:48) k ≥ K (cid:27) , where denotes the indicator function and K = (cid:24) log n log n − log( n − (cid:25) , k = (cid:24) log B n,σ,R (cid:48) log n − log( n − (cid:25) ,B n,σ,R (cid:48) = σn ( n − R (cid:48)  (cid:115) (cid:18) R (cid:48) σ ( n − (cid:19)  ≥ n max (cid:26) , σ ( n − R (cid:48) (cid:27) . D Eﬃcient Implementation of VRPDA and Other Computational Con-siderations We discuss here an equivalent form of Algorithm 2 for which the aspects needed for eﬃcient implementationare treated more transparently. To implement Algorithm 2, we use the proximal operator prox τf ( x ) =arg min x { τ f ( x ) + (cid:107) x − x (cid:107) } for a scalar τ >

0, a convex function f and a given x . This implementationversion (shown here as Algorithm 3), maintains several additional vectors that are needed to keep track ofcoeﬃcients in the “arg min” Steps 11 and 12 of Algorithm 2. In particular, we maintain a vector p k ∈ R n that contains the coeﬃcients of the linear term in y in the function ψ k ( · ), a vector r k ∈ R n that contains30oeﬃcients of g ∗ j ( · ) in ψ k ( · ), and a vector q k ∈ R d that is the coeﬃcient of the linear term in x in the function φ k ( n · ). Each of these vectors generally must be stored in full, and all are initialized in Step 7 of Algorithm 3.In Step 11 of Algorithm 3, only one component—the j k component—of p k and r k needs to be updated. For r k , the cost of update is one scalar addition, whereas for p k it requires computation of the inner product b Tj k ¯ x k − , which costs O ( d ) scalar operations if b j k is dense but potentially much less for sparse B . Theupdate of q k (Step 14 of Algorithm 3) requires addition of scalar multiples of two vectors ( z k − and b j k ), alsoat a cost of O ( d ) operations in general. (Note that for the latter operation, savings are possible if B is sparse,but since z k − is generally dense, the overall cost will still be O ( d ).) In discussing the original Algorithm 2,we noted that the arg min operation over y in Step 11 resulted in only a single component (component j k )being diﬀerent between y k − and y k . We make this fact explicit in Step 12 of Algorithm 3, where we showprecisely the prox-operation that needs to be performed to recover the scalar y k,j k .To summarize, each iteration of Algorithm 3 requires several vector operations with the (possibly sparse)vector b j k ∈ R d , several other vector operations of cost O ( d ), a prox operation involving (cid:96) ( x ), and a scalarprox operation involving g ∗ j k . There are no operations of cost O ( n ).Initialization of Algorithm 3 involves signiﬁcant costs, including one matrix-vector product each involving B and B T , one prox operation involving (cid:96) ( · ), a prox operation involving g ∗ ( · ) (which can be implementedas n scalar prox operations involving g ∗ , g ∗ , . . . , g ∗ n in turn), and several vector operations of cost O ( d + n ).However, this cost is absorbed by the overall (amortized) computational cost of the algorithm, which adds to O ( nd log(min { n, /(cid:15) } ) + d(cid:15) ) for the general convex case and O ( nd log(min { n, /(cid:15) } ) + d √ σ(cid:15) ) for the stronglyconvex case, as stated in Theorem 2 and in Table 1. E Additional Discussion of Numerical Experiments

We now provide an additional discussion of the numerical results from Section 5. Both datasets a9a and

MNIST we use are large scale with n = 32561, d = 123 for a9a , and n = 60000 , d = 780 for MNIST . As mentioned inSection 5, the plotted function value gap was evaluated using an estimated value ˜ f ∗ of f ∗ = arg min x f ( x ) . For the plots to depict an accurate estimate of the function value gap, it is required that the true functionvalue gap f − f ∗ dominates the error of the estimate ˜ f ∗ − f ∗ . In our numerical experiments, this is achievedby running the algorithms for 30 × iterations compared to what is shown in the plots, choosing the lowestfunction value f min seen over the algorithm run and over all the algorithms, and then setting ˜ f ∗ = f min − δ, where δ is chosen either as 10 − or 10 − , depending on the value of σ. As can be observed from Figure 1, there is a noticeable diﬀerence in the performance of all the algorithmswhen their function value gap is evaluated at the average iterate versus the last iterate. For vrpda that is adual averaging-style method with the role of promoting sparsity Xiao (2010), this diﬀerence comes from thesigniﬁcantly diﬀerent sparsity of the average iterate and last iterate: as shown in Figure 2, average iterate isless sparse but provides more accurate ﬁt, while the last iterate is sparser (and thus more robust) but lessaccurate. For spdhg , the last iterate is signiﬁcantly more accurate than the average iterate in the stronglyconvex settings (i.e., for σ ∈ { − , − } ), because simple uniform average we use may not be the bestchoice for the two settings. Meanwhile, compared with vrpda , the better performance of spdhg in termsof the last iterate is partly due to the fact that it is a mirror descent-style algorithm with less sparse lastiterate. In our experiments, the pure cd algorithm is always worse than vrpda and spdhg , which is partlyconsistent with its worse convergence guarantee as shown in Table 1. However, as pure cd is particularlydesigned for sparse datasets, it may have a better runtime performance for sparse datasets (as shown inAlacaoglu et al. (2020)), which is beyond the scope of this paper.Meanwhile, the performance of the average iterate of vrpda and the last iterate of spdhg is almostthe same (both ﬁgures for the average iterate and the last iterate under the same setting use the samescale), which is surprising as vrpda has √ n -times better theoretical guarantees than spdhg for small (cid:15) .The better theoretical guarantees of vrpda comes from the particular initialization strategy inspired by However, as we mentioned in the main body, spdhg (Chambolle et al., 2018) provides no results for the average iterate inthe strongly convex setting. a) a9a , σ = 0 , average (b) a9a , σ = 0 , last (c) MNIST , σ = 0 , average (d) MNIST , σ = 0 , last(e) a9a , σ = 10 − , average (f) a9a , σ = 10 − , last (g) MNIST , σ = 10 − , average (h) MNIST , σ = 10 − , last(i) a9a , σ = 10 − , average (j) a9a , σ = 10 − , last (k) MNIST , σ = 10 − , average (l) MNIST , σ = 10 − , last Figure 2: Comparison of sparsity for vrpda , spdhg , and pure cd run for the elastic net-regularized SVM problem, on a9a and MNIST datasets. In all the plots, σ is the strong convexity parameter of the regularizer (cid:96) ; “last” refersto the last iterate, “average” to the average iterate. For all problem instances, vrpda generally constructsthe sparsest solutions out of the three algorithms. (The number of nonzeros is computed by counting theelements with absolute value larger than 10 − .) Song et al. (2020a). However, similar to the experimental results in Song et al. (2020a), we do not observethe performance gain of the initialization strategy in practice (despite the fact that it does not worsen theempirical performance). Thus, it is of interest to explore whether the initialization strategy is essential forthe improved algorithm performance or if it is only needed for the theoretical argument to go through.Finally, as most papers do, we chose not to plot (cid:107) x k − x ∗ (cid:107) as it would require estimating x ∗ , whichis much harder than estimating f ∗ . In particular, estimating x ∗ is out of reach for problems that are notstrongly convex even if f were smooth, due to known lower bounds (see, e.g., Nesterov (2018, Theorem 2.1.7)).For strongly convex problems, we can obtain an estimate ˜ x ∗ of x ∗ from the function value gap f ( ˜ x ∗ ) − f ( x ∗ )using strong convexity of f, as in this case (cid:107) ˜ x ∗ − x ∗ (cid:107) ≤ (cid:113) σ ( f ( ˜ x ∗ ) − f ( x ∗ )). However, obtaining error δ = (cid:107) ˜ x ∗ − x ∗ (cid:107) from the function value gap would require f ( ˜ x ∗ ) − f ( x ∗ ) ≤ σ δ , which even for δ = 10 − andthe “well-conditioned” setting of our experiments with σ = 10 − would require minimizing f to error 10 −20