[PDF] Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization

Abstract

We develop a family of accelerated stochastic algorithms that minimize sums of convex functions. Our algorithms improve upon the fastest running time for empirical risk minimization (ERM), and in particular linear least-squares regression, across a wide range of problem settings. To achieve this, we establish a framework based on the classical proximal point algorithm. Namely, we provide several algorithms that reduce the minimization of a strongly convex function to approximate minimizations of regularizations of the function. Using these results, we accelerate recent fast stochastic algorithms in a black-box fashion. Empirically, we demonstrate that the resulting algorithms exhibit notions of stability that are advantageous in practice. Both in theory and in practice, the provided algorithms reap the computational benefits of adding a large strongly convex regularization term, without incurring a corresponding bias to the original problem.

Full PDF

UUn-regularizing: approximate proximal point and faster stochasticalgorithms for empirical risk minimization

Roy Frostig ∗ , Rong Ge , Sham M. Kakade , and Aaron Sidford Stanford University Microsoft Research, New England MIT

Abstract

We develop a family of accelerated stochastic algorithms that minimize sums of convexfunctions. Our algorithms improve upon the fastest running time for empirical risk minimization(ERM), and in particular linear least-squares regression, across a wide range of problem settings.To achieve this, we establish a framework based on the classical proximal point algorithm .Namely, we provide several algorithms that reduce the minimization of a strongly convex func-tion to approximate minimizations of regularizations of the function. Using these results, weaccelerate recent fast stochastic algorithms in a black-box fashion.Empirically, we demonstrate that the resulting algorithms exhibit notions of stability thatare advantageous in practice. Both in theory and in practice, the provided algorithms reap thecomputational beneﬁts of adding a large strongly convex regularization term, without incurringa corresponding bias to the original problem.

A general optimization problem central to machine learning is that of empirical risk minimization (ERM): ﬁnding a predictor or regressor that minimizes a sum of loss functions deﬁned by a datasample. We focus in part on the problem of empirical risk minimization of linear predictors: givena set of n data points a i , . . . , a n ∈ R d and convex loss functions φ i : R → R for i = 1 , . . . , n , solvemin x ∈ R n F ( x ) , where F ( x ) def = n (cid:88) i =1 φ i ( a T i x ) . (1)This problem underlies supervised learning ( e.g. the training of logistic regressors when φ i ( z ) =log(1 + e − zb i ), or their regularized form when φ i ( z ) = log(1 + e − zb i ) + γ n (cid:107) x (cid:107) for a scalar γ > φ i ( z ) = ( z − b i ) .Over the past ﬁve years, problems such as (1) have received increased attention, with a recentburst of activity in the design of fast randomized algorithms. Iterative methods that randomly ∗ This is an extended and updated version of our conference paper that appeared in

Proceedings of the 32 nd International Conference on Machine Learning , Lille, France, 2015. JMLR: W&CP Volume 37.Email: [email protected] , [email protected] , [email protected] , [email protected] . a r X i v : . [ s t a t . M L ] J un ample the φ i have been shown to outperform standard ﬁrst-order methods under mild assumptions(Bottou and Bousquet, 2008; Johnson and Zhang, 2013; Xiao and Zhang, 2014; Defazio et al., 2014;Shalev-Shwartz and Zhang, 2014).Despite the breadth of these recent results, their running time guarantees when solving theERM problem (1) are sub-optimal in terms of their dependence on a natural notion of the prob-lem’s condition number (See Section 1.1). This dependence can, however, signiﬁcantly impact theirguarantees on running time. High-dimensional problems encountered in practice are often poorlyconditioned. In large-scale machine learning applications, the condition number of the ERM prob-lem (1) captures notions of data complexity arising from variable correlation in high dimensionsand is hence prone to be very large.More speciﬁcally, among the recent randomized algorithms, each one either:1. Solves the ERM problem (1), under an assumption of strong convexity, with convergencethat depends linearly on the problem’s condition number (Johnson and Zhang, 2013; Defazioet al., 2014).2. Solves only an explicitly regularized ERM problem, min x { F ( x ) + λr ( x ) } where the regularizer r is a known 1-strongly convex function and λ must be strictly positive, even when F is itselfstrongly convex. One such result is due to Shalev-Shwartz and Zhang (2014) and is theﬁrst to achieve acceleration for this problem, i.e. dependence only on the square root of theregularized problem’s condition number, which scales inversely with λ . Hence, taking small λ to solve the ERM problem (where λ = 0 in eﬀect) is not a viable option.In this paper we show how to bridge this gap via black-box reductions. Namely, we developalgorithms to solve the ERM problem (1) – under a standard assumption of strong convexity –through repeated, approximate minimizations of the regularized ERM problem min x { F ( x ) + λr ( x ) } for fairly large λ . Instantiating our framework with known randomized algorithms that solve theregularized ERM problem, we achieve accelerated running time guarantees for solving the originalERM problem.The key to our reductions are approximate variants of the classical proximal point algorithm (PPA) (Rockafellar, 1976; Parikh and Boyd, 2014). We show how both PPA and the inner minimiza-tion procedure can then be accelerated and our analysis gives precise approximation requirementsfor either option. Furthermore, we show further practical improvements when the inner minimizeroperates by a dual ascent method. In total, this provides at least three diﬀerent algorithms forachieving an improved accelerated running time for solving the ERM problem (1) under the stan-dard assumption of strongly convex F and smooth φ i . (Table 1 summarizes our improvements incomparison to existing minimization procedures.)Perhaps the strongest and most general theoretical reduction we provide in this paper is en-compassed by the following theorem which we prove in Section 3. Theorem 1.1 (Accelerated Approximate Proximal Point Algorithm) . Let f : R n → R be a µ -strongly convex function and suppose that, for all x ∈ R n , c > , λ > , we can compute a point x c (possibly random) such that E f ( x c ) − min x (cid:26) f ( x ) + λ (cid:107) x − x (cid:107) (cid:27) ≤ c (cid:20) f ( x ) − min x (cid:26) f ( x ) + λ (cid:107) x − x (cid:107) (cid:27)(cid:21) n time T c . Then given any x , c > , λ ≥ µ , we can compute x such that E f ( x ) − min x f ( x ) ≤ c (cid:104) f ( x ) − min x f ( x ) (cid:105) in time O (cid:18) T λ + µµ ) / (cid:112) (cid:100) λ/µ (cid:101) log c (cid:19) . This theorem essentially states that we can use a linearly convergent algorithm for minimiz-ing f ( x ) + λ (cid:107) x − x (cid:107) in order to minimize f , while incurring a multiplicative overhead of only O ( (cid:112) (cid:100) λ/µ (cid:101) polylog ( λ/µ )). Applying this theorem to previous state-of-the-art algorithms improvesboth the running time for solving (1), as well as the following more general ERM problem:min x ∈ R d n (cid:88) i =1 ψ i ( x ) , where ψ i : R d → R . (2)Problem (2) is fundamental in the theory of convex optimization and covers ERM problems formulticlass and structured prediction.There are a variety of additional extensions to the ERM problem to which some of our analysiseasily applies. For instance, we could work in more general normed spaces, allow non-uniformsmoothness of the φ , add an explicit regularizer, etc. However, to simplify exposition and compari-son to related work, we focus on (1) and make clear the extensions to (2) in Section 3. These casescapture the core of the arguments presented and illustrate the generality of this approach.Several of the algorithmic tools and analysis techniques in this paper are similar in principle to(and sometimes appear indirectly in) work scattered throughout the machine learning and optimiza-tion literature – from classical treatments of error-tolerant PPA (Rockafellar, 1976; Guler, 1992) tothe eﬀective proximal term used by Accelerated Proximal SDCA Shalev-Shwartz and Zhang (2014)in enabling its acceleration.By analyzing these as separate tools, and by bookkeeping the error requirements that theyimpose, we are able to assemble them into algorithms with improved guarantees. We believe thatthe presentation of Accelerated APPA (Algorithm 2) arising from this view simpliﬁes, and clariﬁesin terms of broader convex optimization theory, the “outer loop” steps employed by AcceleratedProximal SDCA. More generally, we hope that disentangling the relevant algorithmic componentsinto this general reduction framework will lead to further applications both in theory and in practice. We consider the ERM problem (1) in the following common setting:

Assumption 1.2 (Regularity) . Each loss function φ i is L -smooth, i.e. for all x, y ∈ R , φ ( y ) ≤ φ ( x ) + φ (cid:48) ( x )( y − x ) + L y − x ) , and the sum F is µ -strongly convex, i.e. for all x, y ∈ R d , F ( x ) ≥ F ( x ) + ∇ F ( x ) T ( y − x ) + µ (cid:107) y − x (cid:107) .

3e let R def = max i (cid:107) a i (cid:107) and let A ∈ R n × d be the matrix whose i ’th row is a T i . We refer to κ def = (cid:100) LR /µ (cid:101) as the condition number of (1).Although many algorithms are designed for special cases of the ERM objective F where there issome known, exploitable structure to the problem, our aim is to study the most general case subjectto Assumption 1.2. To standardize the comparison among algorithms, we consider the followinggeneric model of interaction with F : Assumption 1.3 (Computational model) . For any i ∈ [ n ] and x ∈ R d , we consider two primitiveoperations: • For b ∈ R , compute the gradient of x (cid:55)→ φ i ( a T i x − b ) . • For b ∈ R , c ∈ R d , minimize φ i ( a T i x ) + b (cid:107) x − c (cid:107) .We refer to these operations, as well as to the evaluation of φ i ( a T i x ) , as single accesses to φ i , andassume that these operations can be performed in O ( d ) time. Notation

Denote [ n ] def = { , . . . , n } . Denote the optimal value of a convex function by f opt =min x f ( x ), and, when f is clear from context, let x opt denote a minimizer. A point x (cid:48) is an (cid:15) -approximate minimizer of f if f ( x (cid:48) ) − f opt ≤ (cid:15) . The Fenchel dual of a convex function f : R k → R is f ∗ : R k → R deﬁned by f ∗ ( y ) = sup x ∈ R k {(cid:104) y, x (cid:105) − f ( x ) } . We use (cid:101) O ( · ) to hide factorspolylogarithmic in n , L , µ , λ , and R , i.e. (cid:101) O ( f ) = O ( f polylog ( n, L, µ, λ, R )). Regularization and duality

Throughout the paper we let F : R d → R denote a µ -stronglyconvex function. For certain results presented, F must in particular be the ERM problem (1),while other statements hold more generally. We make it clear on a case-by-case basis when F musthave the ERM structure as in (1).Beginning in Section 1.3 and throughout the remainder of the paper, we frequently consider thefunction f s,λ ( x ), deﬁned for all x, s ∈ R d and λ > f s,λ ( x ) def = F ( x ) + λ (cid:107) x − s (cid:107) (3)In such context, we let x opt s,λ def = argmin x f s,λ ( x ) and we call κ λ def = (cid:100) LR /λ (cid:101) the regularized condition number .When F is indeed the ERM objective (1), certain algorithms for minimizing f s,λ operate in the regularized ERM dual . Namely, they proceed by decreasing the negative dual objective g s,λ : R n → R , given by g s,λ ( y ) def = G ( y ) + 12 λ (cid:107) A T y (cid:107) − s T A T y, (4)where G ( y ) def = (cid:80) ni =1 φ ∗ i ( y i ). Similar to the above, we let y opt s,λ def = argmin y g s,λ ( y ).4 mpirical risk minimization Algorithm Running time ProblemGD dn κ log( (cid:15) /(cid:15) ) F Accel. GD dn / √ κ log( (cid:15) /(cid:15) ) F SAG, SVRG dnκ log( (cid:15) /(cid:15) ) F SDCA dnκ λ log( (cid:15) /(cid:15) ) F + λr AP-SDCA dn √ κ λ log( (cid:15) /(cid:15) ) F + λr APCG dn √ κ λ log( (cid:15) /(cid:15) ) F + λr This work dn √ κ log( (cid:15) /(cid:15) ) F Linear least-squares regression

Algorithm Running time ProblemNaive mult. nd (cid:107) Ax − b (cid:107) Fast mult. nd ω − (cid:107) Ax − b (cid:107) Row sampling ( nd + d ω ) log( (cid:15) /(cid:15) ) (cid:107) Ax − b (cid:107) OSNAP ( nd + d ω ) log( (cid:15) /(cid:15) ) (cid:107) Ax − b (cid:107) R. Kaczmarz dnκ log( (cid:15) /(cid:15) ) Ax = b Acc. coord. dn √ κ log( (cid:15) /(cid:15) ) Ax = b This work dn √ κ log( (cid:15) /(cid:15) ) (cid:107) Ax − b (cid:107) Table 1: Theoretical performance comparison on ERM and linear regression. Running times holdin expectation for randomized algorithms. In the “problem” column for ERM, F marks algorithmsthat can optimize the ERM objective (1), while F + λr marks those that only solve the explicitlyregularized problem. For linear regression, Ax = b marks algorithms that only solve consistentlinear systems, whereas (cid:107) Ax − b (cid:107) marks those that more generally minimize the squared loss. Theconstant ω denotes the exponent of the matrix multiplication running time (currently below 2 . dual-to-primal mapping, given by (cid:98) x s,λ ( y ) def = s − λ A T y, (5)and the primal-to-dual mapping, given entrywise by[ (cid:98) y ( x )] i def = (cid:20) ∂φ i ( z ) ∂z (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) z = a T i x (6)for i = 1 , . . . , n . (See Appendix B for a derivation of these facts and further properties of the dual.) In Table 1 we compare our results with the running time of both classical and recent algorithmsfor solving the ERM problem (1) and linear least-squares regression. Here we brieﬂy explain theserunning times and related work.

Empirical risk minimization

In the context of the ERM problem, GD refers to canonical gra-dient descent on F , Accel. GD is Nesterov’s accelerated gradient decent (Nesterov, 1983, 2004),SVRG is the stochastic variance-reduced gradient of Johnson and Zhang (2013), SAG is the stochas-tic average gradient of Roux et al. (2012) and Defazio et al. (2014), SDCA is the stochastic dualcoordinate ascent of Shalev-Shwartz and Zhang (2013), AP-SDCA is the Accelerated ProximalSDCA of Shalev-Shwartz and Zhang (2014) and APCG is the accelerated coordinate algorithmof Lin et al. (2014). The latter three algorithms are more restrictive in that they only solve theexplicitly regularized problem F + λr , even if F is itself strongly convex (such algorithms run intime inversely proportional to λ ).The running times of the algorithms are presented based on the setting considered in this paper, i.e. under Assumptions 1.2 and 1.3. Many of the algorithms can be applied in more general settings5 e.g. even if the function F is not strongly convex) and have diﬀerent convergence guarantees inthose cases. The running times are characterized by four parameters: d is the data dimension, n is the number of samples, κ = (cid:100) LR /µ (cid:101) is the condition number (for F + λr minimizers, thecondition number κ λ = (cid:100) LR /λ (cid:101) is used) and (cid:15) /(cid:15) is the ratio between the initial and desiredaccuracy. Running times are stated per (cid:101) O -notation; factors that depend polylogarithmically on n , κ , and κ λ are ignored. Linear least-squares regression

For the linear least-squares regression problem, there is greatervariety in the algorithms that apply. For comparison, Table 1 includes Moore-Penrose pseudoin-version – computed via naive matrix multiplication and inversion routines, as well as by theirasymptotically fastest counterparts – in order to compute a closed-form solution via the standardnormal equations. The table also lists algorithms based on the randomized Kaczmarz method(Strohmer and Vershynin, 2009; Needell et al., 2014) and their accelerated variant (Lee and Sid-ford, 2013), as well as algorithms based on subspace embedding (OSNAP) or row sampling (Nelsonand Nguyen, 2013; Li et al., 2013; Cohen et al., 2015). Some Kaczmarz-based methods only applyto the more restrictive problem of solving a consistent system (ﬁnding x satisfying Ax = b ) ratherthan minimize the squared loss (cid:107) Ax − b (cid:107) . The running times depend on the same four parameters n, d, κ, (cid:15) /(cid:15) as before, except for computing the closed-form pseudoinverse, which for simplicity weconsider “exact,” independent of initial and target errors (cid:15) /(cid:15) . Approximate proximal point

The key to our improved running times is a suite of approximateproximal point algorithms that we propose and analyze. We remark that notions of error-tolerancein the typical proximal point algorithm – for both its plain and accelerated variants – have beendeﬁned and studied in prior work (Rockafellar, 1976; Guler, 1992). However, these mainly considerthe cumulative absolute error of iterates produced by inner minimizers, assuming that such asequence is somehow produced. Since essentially any procedure of interest begins at some initialpoint – and has runtime that depends on the relative error ratio between its start and end – sucha view does not yield fully concrete algorithms, nor does it yield end-to-end runtime upper boundssuch as those presented in this paper.

Additional related work

There is an immense body of literature on proximal point methodsand alternating direction method of multipliers (ADMM) that are relevant to the approach in thispaper; see Boyd et al. (2011); Parikh and Boyd (2014) for modern surveys. We also note that theindependent work of Lin et al. (2015) contains results similar to some of those in this paper.

All formal results in this paper are obtained through a framework that we develop for iterativelyapplying and accelerating various minimization algorithms. When instantiated with recently-developed fast minimizers we obtain, under Assumptions 1.2 and 1.3, algorithms guaranteed tosolve the ERM problem in time (cid:101) O ( nd √ κ log(1 /(cid:15) )).Our framework stems from a critical insight of the classical proximal point algorithm (PPA) or proximal iteration : to minimize F (or more generally, any convex function) it suﬃces to iterativelyminimize f s,λ ( x ) def = F ( x ) + λ (cid:107) x − s (cid:107) λ > center s ∈ R d . PPA iteratively applies the update x ( t +1) ← argmin x f x ( t ) ,λ ( x )and converges to the minimizer of F . The minimization in the update is known as the proximaloperator (Parikh and Boyd, 2014), and we refer to it in the sequel as the inner minimizationproblem.In this paper we provide three distinct approximate proximal point algorithms, i.e. algorithmsthat do not require full inner minimization. Each enables the use of existing fast algorithm as itsinner minimizer, in turn yielding several ways to obtain our improved ERM running time: • In Section 2 we develop a basic approximate proximal point algorithm (APPA). The algo-rithm is essentially PPA with a relaxed requirement of inner minimization by only a ﬁxed multiplicative constant in each iteration. Instantiating this algorithm with an accelerated,regularized ERM solver – such as APCG (Lin et al., 2014) – as its inner minimizer yields theimproved accelerated running time for the ERM problem (1). • In Section 3 we develop Accelerated APPA. Instantiating this algorithm with SVRG (Johnsonand Zhang, 2013) as its inner minimizer yields the improved accelerated running time for boththe ERM problem (1) as well as the general ERM problem (2). • In Section 4 we develop Dual APPA: an algorithm whose approximate inner minimizersoperate on the dual f s,λ , with warm starts between iterations. Dual APPA enables severalinner minimizers that are a priori incompatible with APPA. Instantiating this algorithmwith an accelerate, regularized ERM solver – such as APCG (Lin et al., 2014) – as its innerminimizer yields the improved accelerated running time for the ERM problem (1).Each of the three algorithms exhibits a slight advantage over the others in diﬀerent regimes.APPA has by far the simplest and most straightforward analysis, and applies directly to any µ -strongly convex function F (not only F given by (1)). Accelerated APPA is more complicated, butin many regimes is a more eﬃcient reduction than APPA; it too applies to any µ -strongly convexfunction F and in turn proves Theorem 1.1.Our third algorithm, Dual APPA, is the least general in terms of the assumptions on which itrelies. It is the only reduction we develop that requires the ERM structure of F . However, thisalgorithm is a natural choice in conjunction with inner minimizers that operate on a popular dualobjective. In Section 5 we demonstrate moreover that this algorithm has properties that make itdesirable in practice. The remainder of this paper is organized as follows. In Section 2, Section 3, and Section 4 we stateand analyze the approximate proximal point algorithms described above. In Section 5 we discusspractical concerns and cover numerical experiments involving Dual APPA and related stochasticalgorithms. In Appendix A we prove general technical lemmas used throughout the paper and inAppendix B we provide a derivation of regularized ERM duality and related technical lemmas.7

Approximate proximal point algorithm (APPA)

In this section we describe our approximate proximal point algorithm (APPA). This algorithmis perhaps the simplest, both in its description and in its analysis, in comparison to the othersdescribed in this paper. This section also introduces technical machinery that is used throughoutthe sequel.We ﬁrst present our formal abstraction of inner minimizers (Section 2.1), then we present ouralgorithm (Section 2.2), and ﬁnally we step through its analysis (Section 2.3).

To design APPA, we ﬁrst quantify the error that can be tolerated of an inner minimizer, whileaccounting for the computational cost of ensuring such error. The abstraction we use is the followingnotion of inner approximation:

Deﬁnition 2.1.

An algorithm P is a primal ( c, λ )-oracle if, given x ∈ R d , it outputs P ( x ) that isa ([ f x,λ ( x ) − f opt x,λ ] /c ) -approximate minimizer of f x,λ in time T P . In other words, a primal oracle is an algorithm initialized at x that reduces the error of f x,λ bya 1 /c fraction, in time that depends on λ , and c , and regularity properties of F . Typical iterativeﬁrst-order algorithms, such as those in Table 1, yield primal ( c, λ )-oracles with runtimes T P thatscale inversely in λ or √ λ , and logarithmically in c . For instance: Theorem 2.2 (SVRG as a primal oracle) . SVRG (Johnson and Zhang, 2013) is a primal ( c, λ ) -oracle with runtime complexity T P = O ( nd min { κ, κ λ } log c ) for both the ERM problem (1) and thegeneral ERM problem (2) . Theorem 2.3 (APCG as an accelerated primal oracle) . Using APCG (Lin et al., 2014) we can ob-tain a primal ( c, λ ) -oracle with runtime complexity T P = (cid:101) O ( nd √ κ λ log c ) for the ERM problem (1) . Proof.

Corollary B.3 implies that, given a primal point x , we can obtain, in O ( nd ) time, a corre-sponding dual point y such that the duality gap f x,λ ( x ) + g x,λ ( y ) (and thus the dual error) is atmost O ( poly ( κ λ )) times the primal error. Lemma B.1 implies that decreasing the dual error bya factor O ( poly ( κ λ ) c ) decreases the induced primal error by c . Therefore, applying APCG to thedual and performing the primal and dual mappings yield the theorem. Our Approximate Proximal Point Algorithm (APPA) is given by the following Algorithm 1. When the oracle is a randomized algorithm, we require that expected error is the same, i.e. that the solution be (cid:15) -approximate in expectation. AP-SDCA could likely also serve as a primal oracle with the same guarantees. However, the results in Shalev-Shwartz and Zhang (2014) are stated assuming initial primal and dual variables are zero. It is not directly clear howone can provide a generic relative decrease in error from this speciﬁc initial primal-dual pair. lgorithm 1 Approximate PPA (APPA) input x (0) ∈ R d , λ > input primal ( λ + µ ) µ , λ )-oracle P for t = 1 , . . . , T do x ( t ) ← P ( x ( t − ) end foroutput x ( T ) The central goal of this section is to prove the following lemma, which guarantees a geometricconvergence rate for the iterates produced in this manner

Lemma 2.4 (Contraction in APPA) . For any c (cid:48) ∈ (0 , , x ∈ R d , and possibly randomized primal ( λ + µc (cid:48) µ , λ ) -oracle P (possibly randomized) we have E [ F ( P ( x ))] − F opt ≤ λ + c (cid:48) µλ + µ (cid:0) F ( x ) − F opt (cid:1) . (7)This lemma immediately implies the following running-time bounds for APPA. Theorem 2.5 (Un-regularizing in APPA) . Given a primal ( µ + λ ) µ , λ ) -oracle P , Algorithm 1 min-imizes the general ERM problem (2) to within accuracy (cid:15) in time O ( T P (cid:100) λ/µ (cid:101) log( (cid:15) /(cid:15) )) . Combining Theorem 2.5 and Theorem 2.3 immediately yields our desired running time forsolving (1).

Corollary 2.6.

Instantiating Algorithm 1 with the Theorem 2.3 as the primal oracle and taking λ = µ yields the running time of (cid:101) O ( nd √ κ log( (cid:15) /(cid:15) )) for solving (1) . This section gives a proof of Lemma 2.4. Throughout, no assumption is made on F aside from µ -strong convexity. Namely, we need not have F be smooth or at all diﬀerentiable.First, we consider the eﬀect of an exact inner minimizer. Namely, we prove the following lemmarelating the minimum of the inner problem f s,λ to F opt . Lemma 2.7 (Relationship between minima) . For all s ∈ R d and λ ≥ f opt s,λ − F opt ≤ λµ + λ (cid:0) F ( s ) − F opt (cid:1) . Proof.

Let x opt = argmin x F ( x ) and for all α ∈ [0 ,

1] let x α = (1 − α ) s + αx opt . The µ -strongconvexity of F implies that, for all α ∈ [0 , F ( x α ) ≤ (1 − α ) F ( s ) + αF ( x opt ) − α (1 − α ) µ (cid:107) s − x opt (cid:107) . Consequently, by the deﬁnition of f opt s,λ , f opt s,λ ≤ F ( x α ) + λ (cid:107) x α − s (cid:107) ≤ (1 − α ) F ( s ) + αF ( x opt ) − α (1 − α ) µ (cid:107) s − x opt (cid:107) + λα (cid:107) s − x opt (cid:107) Choosing α = µµ + λ yields the result. When the oracle is a randomized algorithm, the expected accuracy is at most (cid:15) . F decreases by a multiplicative λ/ ( λ + µ ). Using this we prove Lemma 2.4. Proof of Lemma 2.4.

Let x (cid:48) = P ( x ). By deﬁnition of primal oracle P we have f x,λ ( x (cid:48) ) − f opt x,λ ≤ c (cid:48) µλ + µ (cid:16) f x,λ ( x ) − f opt x,λ (cid:17) . Combining this and Lemma 2.7 we have f x,λ ( x (cid:48) ) − F opt ≤ c (cid:48) µλ + µ (cid:16) f x,λ ( x ) − f opt x,λ (cid:17) + λµ + λ (cid:0) F ( x ) − F opt (cid:1) Using that clearly for all z we have F ( z ) ≤ f x,λ ( z ) we see that F ( x (cid:48) ) ≤ f x,λ ( x (cid:48) ) and F opt ≤ f opt x,λ .Combining with the fact that f x,λ ( x ) = F ( x ) yields the result. In this section we show how generically accelerate the APPA algorithm of Section 2. AcceleratedAPPA (Algorithm 2) uses inner minimizers more eﬃciently, but requires a smaller minimizationfactor when compared to APPA. The algorithm and its analysis immediately prove Theorem 1.1and in turn yield another means by which we achieve the accelerated running time guarantees forsolving (1).We ﬁrst present the algorithm and state its running time guarantees (Section 3.1), then provethe guarantees as part of analysis (Section 3.2).

Our accelerated APPA algorithm is given by Algorithm 2. In every iteration it still makes a singlecall to a primal oracle, but rather than requiring a ﬁxed constant minimization the minimizationfactor depends polynomial on the ratio of λ and µ . Algorithm 2

Accelerated APPA input x (0) ∈ R d , µ > λ > µ input primal (4 ρ / , λ )-oracle P , where ρ = µ +2 λµ Deﬁne ζ = µ + λ v (0) ← x (0) for t = 0 , . . . , T − do y ( t ) ← (cid:16) ρ − / (cid:17) x ( t ) + (cid:16) ρ − / ρ − / (cid:17) v ( t ) x ( t +1) ← P ( y ( t ) ) g ( t ) ← λ ( y ( t ) − x ( t +1) ) v ( t +1) ← (1 − ρ − / ) v ( t ) + ρ − / (cid:2) y ( t ) − ζg ( t ) (cid:3) end foroutput x ( T ) The central goal is to prove the following theorem regarding the running time of APPA.10 heorem 3.1 (Un-regularizing in Accelerated APPA) . Given a primal (4( λ + µµ ) / , λ ) -oracle P for λ ≥ µ , Algorithm 2 minimizes the general ERM problem (2) to within accuracy (cid:15) in time O ( T P (cid:112) (cid:100) λ/µ (cid:101) log( (cid:15) /(cid:15) )) . This theorem is essentially a restatement of Theorem 1.1 and by instantiating it with Theo-rem 2.2 we obtain the following.

Corollary 3.2.

Instantiating Theorem 3.1 with SVRG (Johnson and Zhang, 2013) as the primaloracle and taking λ = 2 µ + LR yields the running time bound (cid:101) O ( nd √ κ log( (cid:15) /(cid:15) )) for the generalERM problem (2) . Here we establish the convergence rate of Algorithm 2, Accelerated APPA, and prove Theorem 3.1.Note that as in Section 2 the results in this section use nothing about the structure of F other thanstrong convexity and thus they apply to the general ERM problem (2).We remark that aspects of the proofs in this section bear resemblance to the analysis in Shalev-Shwartz and Zhang (2014), which achieves similar results in a more specialized setting.Our proof is split into the following parts. • In Lemma 3.3 we show that applying a primal oracle to the inner minimization problem givesus a quadratic lower bound on F ( x ). • In Lemma 3.4 we use this lower bound to construct a series of lower bounds for the mainobjective function f , and accelerate the APPA algorithm, comprising the bulk of the analysis. • In Lemma 3.5 we show that the requirements of Lemma 3.4 can be met by using a primaloracle that decreases the error by a constant factor. • In Lemma 3.6 we analyze the initial error requirements of Lemma 3.4.The proof of Theorem 3.1 follows immediately from these lemmas.

Lemma 3.3.

For x ∈ R n and (cid:15) > suppose that x + is an (cid:15) -approximate solution to f x ,λ . Thenfor µ (cid:48) def = µ/ , g def = λ ( x − x + ) , and all x ∈ R n we have F ( x ) ≥ F ( x + ) − µ (cid:48) (cid:107) g (cid:107) + µ (cid:48) (cid:13)(cid:13)(cid:13)(cid:13) x − (cid:18) x − (cid:18) µ (cid:48) + 1 λ (cid:19) g (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) − λ + 2 µ (cid:48) µ (cid:48) (cid:15). Note that as µ (cid:48) = µ/ Proof.

Since F is µ -strongly convex clearly f x ,λ is µ + λ strongly convex, by Lemma A.1 f x ,λ ( x ) − f x ,λ ( x opt x ,λ ) ≥ µ + λ (cid:107) x − x opt (cid:107) . (8)By Cauchy-Schwartz and Young’s Inequality we know that λ + µ (cid:48) (cid:107) x − x + (cid:107) ≤ λ + µ (cid:48) (cid:16) (cid:107) x − x opt x ,λ (cid:107) + (cid:107) x opt x ,λ − x + (cid:107) (cid:17) + µ (cid:48) (cid:107) x − x opt x ,λ (cid:107) + ( λ + µ (cid:48) ) µ (cid:48) (cid:107) x opt x ,λ − x + (cid:107) , µ + λ (cid:107) x − x opt x ,λ (cid:107) ≥ λ + µ (cid:48) (cid:107) x − x + (cid:107) − λ + µ (cid:48) µ (cid:48) · λ + µ (cid:107) x opt x ,λ − x + (cid:107) . On the other hand, since f x ,λ ( x + ) ≤ f x ,λ ( x opt x ,λ )+ (cid:15) by assumption we have λ + µ (cid:107) x + − x opt (cid:107) ≤ (cid:15) and therefore f x ,λ ( x ) − f x ,λ ( x + ) ≥ f x ,λ ( x ) − f x ,λ ( x opt x ,λ ) − (cid:15) ≥ µ + λ (cid:107) x − x opt x ,λ (cid:107) − (cid:15) ≥ λ + µ (cid:48) (cid:107) x − x + (cid:107) − λ + µ (cid:48) µ (cid:48) · λ + µ (cid:107) x opt x ,λ − x + (cid:107) − (cid:15) ≥ λ + µ (cid:48) (cid:107) x − x + (cid:107) − λ + 2 µ (cid:48) µ (cid:48) (cid:15). Now since (cid:107) x − x + (cid:107) = (cid:107) x − x + 1 λ g (cid:107) = (cid:107) x − x (cid:107) + 2 λ (cid:104) g, x − x (cid:105) + 1 λ (cid:107) g (cid:107) , and using the fact that f x ,λ ( x ) = F ( x ) + λ (cid:107) x − x (cid:107) , we have F ( x ) ≥ F ( x + ) + (cid:20) λ + µ (cid:48) λ (cid:21) (cid:107) g (cid:107) + (cid:18) µ (cid:48) λ (cid:19) (cid:104) g, x − x (cid:105) + µ (cid:48) (cid:107) x − x (cid:107) − λ + 2 µ (cid:48) µ (cid:48) (cid:15). The right hand side of the above equation is a quadratic function. Looking at its gradient withrespect to x we see that it obtains its minimum when x = x − ( µ (cid:48) + λ ) g and has a minimum valueof F ( x + ) − µ (cid:48) (cid:107) g (cid:107) − λ +2 µ (cid:48) µ (cid:48) (cid:15) . Lemma 3.4.

Suppose that in each iteration t we have ψ t def = ψ opt t + µ (cid:48) (cid:107) x − v ( t ) (cid:107) such that F ( x ) ≥ ψ t ( x ) for all x . Let ρ def = µ (cid:48) + λµ (cid:48) for λ ≥ µ (cid:48) , and let • y ( t ) def = (cid:16) ρ − / (cid:17) x ( t ) + (cid:16) ρ − / ρ − / (cid:17) v ( t ) , • E [ f y ( t ) ,λ ( x ( t +1) )] − f opt y ( t ) ,λ ≤ ρ − / ( F ( x ( t ) ) − ψ opt t ) , • g ( t ) def = λ ( y ( t ) − x ( t +1) ) , • v ( t +1) def = (1 − ρ − / ) v ( t ) + ρ − / (cid:104) y ( t ) − (cid:16) µ (cid:48) + λ (cid:17) g ( t ) (cid:105) .We have E [ F ( x ( t ) ) − ψ opt t ] ≤ (cid:32) − ρ − / (cid:33) t ( F ( x ) − ψ opt ) . Proof.

Regardless of how y ( t ) is chosen we know by Lemma 3.3 that for γ = 1 + µ (cid:48) λ and all x ∈ R n F ( x ) ≥ F ( x ( t +1) ) − µ (cid:48) (cid:107) g ( t ) (cid:107) + µ (cid:48) (cid:13)(cid:13)(cid:13)(cid:13) x − (cid:18) y ( t ) − γµ (cid:48) g ( t ) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) − λ + 2 µ (cid:48) µ (cid:48) (cid:16) f y ( t ) ,λ ( x ( t +1) ) − f opt y ( t ) ,λ (cid:17) . (9)12hus, for β = 1 − ρ − / we can let ψ t +1 ( x ) def = βψ t ( x ) + (1 − β ) (cid:20) F ( x ( t +1) ) − µ (cid:48) (cid:107) g ( t ) (cid:107) + µ (cid:48) (cid:107) x − (cid:18) y ( t ) − γµ (cid:48) g ( t ) (cid:19) (cid:107) − λ + 2 µ (cid:48) µ (cid:48) ( f y ( t ) ,λ ( x ( t +1) ) − f opt y ( t ) ,λ ) (cid:21) = β (cid:20) ψ opt t + µ (cid:48) (cid:107) x − v ( t ) (cid:107) (cid:21) + (1 − β ) (cid:20) F ( x ( t +1) ) − µ (cid:48) (cid:107) g ( t ) (cid:107) + µ (cid:48) (cid:107) x − (cid:18) y ( t ) − γµ (cid:48) g ( t ) (cid:19) (cid:107) − λ + 2 µ (cid:48) µ (cid:48) ( f y ( t ) ,λ ( x ( t +1) ) − f opt y ( t ) ,λ ) (cid:21) = ψ opt t +1 + µ (cid:48) (cid:107) x − v ( t +1) (cid:107) . where in the last line we used Lemma A.3. Again, by Lemma A.3 we know that ψ opt t +1 = βψ t + (1 − β ) (cid:18) F ( x ( t +1) ) − µ (cid:48) (cid:107) g ( t ) (cid:107) − λ + 2 µ (cid:48) µ (cid:48) ( f y ( t ) ,λ ( x ( t +1) ) − f opt y ( t ) ,λ ) (cid:19) + β (1 − β ) µ (cid:48) (cid:107) v ( t ) − (cid:18) y ( t ) − γµ (cid:48) g ( t ) (cid:19) (cid:107) ≥ βψ t + (1 − β ) F ( x ( t +1) ) − (1 − β ) µ (cid:48) (cid:107) g ( t ) (cid:107) + β (1 − β ) γ (cid:68) g ( t ) , v ( t ) − y ( t ) (cid:69) − (1 − β )( λ + 2 µ (cid:48) ) µ (cid:48) ( f y ( t ) ,λ ( x ( t +1) ) − f opt y ( t ) ,λ ) . In the second step we used the following fact: − − β µ (cid:48) + β (1 − β ) µ (cid:48) · γ µ (cid:48) = 1 − β µ (cid:48) ( − βγ ) ≥ − (1 − β ) µ (cid:48) . Furthermore, expanding the term µ (cid:107) ( x − y ( t ) ) + γµ g ( t ) (cid:107) and instantiating x with x ( t ) in (9) yields F ( x ( t +1) ) ≤ F ( x ( t ) ) − λ (cid:107) g ( t ) (cid:107) + γ (cid:68) g ( t ) , y ( t ) − x ( t ) (cid:69) + λ + 2 µ (cid:48) µ (cid:48) ( f y ( t ) ,λ ( x ( t +1) ) − f opt y ( t ) ,λ ) . Consequently we know F ( x ( t +1) ) − ψ opt t +1 ≤ β [ f ( x ( t ) ) − ψ opt t ] + (cid:20) (1 − β ) µ (cid:48) − βλ (cid:21) (cid:107) g ( t ) (cid:107) + γβ (cid:68) g ( t ) , y ( t ) − x ( t ) − (1 − β )( v ( t ) − y ( t ) ) (cid:69) + ( λ + 2 µ (cid:48) ) µ (cid:48) ( f y ( t ) ,λ ( x ( t +1) ) − f opt y ( t ) ,λ )Note that we have chosen y ( t ) so that the inner product term equals 0, and we choose β = 1 − ρ − / ≥ which ensures (1 − β ) µ (cid:48) − βλ ≤ µ (cid:48) + λ ) − λ ≤ . Also, by assumption we know E [ f y ( t ) ,λ ( x ( t +1) ) − f opt y ( t ) ,λ ] ≤ ρ − / ( f ( x ( t ) ) − ψ opt t ), which implies E [ F ( x ( t +1) ) − ψ opt t +1 ] ≤ (cid:32) β + ( λ + 2 µ (cid:48) ) µ (cid:48) · ρ − / (cid:33) ( F ( x ( t ) ) − ψ opt t ) ≤ (1 − ρ − / / F ( x ( t ) ) − ψ opt t ) .

13n the ﬁnal step we are using the fact that λ +2 µ (cid:48) µ (cid:48) ≤ ρ and ρ ≥ Lemma 3.5.

Under the setting of Lemma 3.4, we have f y ( t ) ,λ ( x ( t ) ) − f opt y ( t ) ,λ ≤ F ( x ( t ) ) − ψ opt t . Inparticular, in order to achieve E [ f y ( t ) ,λ ( x ( t +1) )] ≤ ρ − / ( F ( x ( t ) ) − ψ opt t ) we only need an oracle thatshrinks the function error by a factor of ρ − / (in expectation).Proof. We know f y ( t ) ,λ ( x ( t ) ) − f ( x ( t ) ) = λ (cid:107) x ( t ) − y ( t ) (cid:107) = λ · ρ − (1 + ρ − / ) (cid:107) x ( t ) − v ( t ) (cid:107) . We will try to show the lower bound f opt y ( t ) ,λ is larger than ψ opt t by the same amount. This is becausefor all x we have f y ( t ) ,λ ( x ) = F ( x ) + λ (cid:107) x − y ( t ) (cid:107) ≥ ψ opt t + µ (cid:48) (cid:107) x − v ( t ) (cid:107) + λ (cid:107) x − y ( t ) (cid:107) . The right hand side is a quadratic function, whose optimal point is at x = µ (cid:48) v ( t ) + λy ( t ) µ (cid:48) + λ and whoseoptimal value is equal to ψ opt t + λ (cid:18) µ (cid:48) µ (cid:48) + λ (cid:19) (cid:107) v ( t ) − y ( t ) (cid:107) + µ (cid:48) (cid:18) λµ + λ (cid:19) (cid:107) v ( t ) − y ( t ) (cid:107) = ψ opt t + µ (cid:48) λ µ (cid:48) + λ ) · ρ − / ) (cid:107) x ( t ) − v ( t ) (cid:107) . By deﬁnition of ρ − , we know µ (cid:48) λ µ (cid:48) + λ ) · ρ − / ) (cid:107) x ( t ) − v ( t ) (cid:107) is exactly equal to λ · ρ − (1+ ρ − / ) (cid:107) x ( t ) − v ( t ) (cid:107) , therefore f y ( t ) ,λ ( x ( t ) ) − f opt y ( t ) ,λ ≤ F ( x ( t ) ) − ψ opt t . Remark

In the next lemma we show that moving to the regularized problem has the same eﬀecton the primal function value and the lower bound. This is a result of the choice of β in the proofof Lemma 3.4. However, this does not mean that the choice of β is very fragile. We can choose any β (cid:48) that is between the current β and 1; the eﬀect on this lemma will be that the increase in primalfunction becomes smaller than the increase in the lower bound (so the lemma continues to hold). Lemma 3.6.

Let ψ opt = F ( x (0) ) − λ +2 µ (cid:48) µ (cid:48) ( F ( x (0) ) − f opt ) , and v (0) = x (0) , then ψ = ψ opt + µ (cid:48) (cid:107) x − v (cid:107) is a valid lower bound for F . In particular when λ = LR then F ( x (0) ) − ψ opt ≤ κ ( F ( x (0) ) − f opt ) .Proof. This lemma is a direct corollary of Lemma 3.3 with x + = x (0) . In this section we develop Dual APPA (Algorithm 3), a natural approximate proximal point al-gorithm that operates entirely in the regularized ERM dual. Our focus here is on theoreticalproperties of Dual APPA; Section 5 later explores aspects of Dual APPA more in practice.We ﬁrst present an abstraction for dual-based inner minimizers (Section 4.1), then present thealgorithm (Section 4.2), and ﬁnally step through its runtime analysis (Section 4.3).14 .1 Approximate dual oracles

Our primary goal in this section is to quantify how much objective function progress an algorithmneeds to make in the dual problem, g s,λ (See Section 1.1) in order to ensure primal progress at arate similar to that in APPA (Algorithm 1).Here, similar to Section 2.1, we formally deﬁne our requirements for an approximate dual-basedinner dual minimize. In particular, we use the following notion of dual oracle. Deﬁnition 4.1.

An algorithm D is a dual ( c, λ )-oracle if, given s ∈ R d and y ∈ R n , it outputs D ( s, y ) that is a ([ g s,λ ( y ) − g opt s,λ ] /c ) -approximate minimizer of g s,λ in time T D . Dual based algorithms for regularized ERM and variants of coordinate descent typically can beused as such a dual oracle. In particular we note that APCG is such a dual oracle.

Theorem 4.2 (APCG as a dual oracle) . APCG (Lin et al., 2014) is a dual ( c, λ ) -oracle withruntime complexity T D = (cid:101) O ( nd √ κ λ log c ) . Our dual APPA is given by the following Algorithm 3.

Algorithm 3

Dual APPA input x (0) ∈ R d , λ > input dual ( σ, λ )-oracle D (see Theorem 4.3 for σ ) y (0) ← (cid:98) y ( x (0) ) for t = 1 , . . . , T do y ( t ) ← D ( x ( t − , y ( t − ) x ( t ) ← (cid:98) x x ( t − ,λ ( y ( t ) ) end foroutput x ( T ) Dual APPA (Algorithm 3) repeatedly queries a dual oracle while producing primal iterates viathe dual-to-primal mapping (5) along the way. We show that it obtains the following running timebound:

Theorem 4.3 (Un-regularizing in Dual APPA) . Given a dual ( σ, λ ) -oracle D , where σ ≥ n κ λ max { κ, κ λ }(cid:100) λ/µ (cid:101) Algorithm 3 minimizes the ERM problem (1) to within accuracy (cid:15) in time (cid:101) O ( T D (cid:100) λ/µ (cid:101) log( (cid:15) /(cid:15) )) . Combining Theorem 4.3 and Theorem 4.2 immediately yields another way to achieve our desiredrunning time for solving (1). As in the primal oracle deﬁnition, when the oracle is a randomized algorithm, we require that its output be anexpected (cid:15) -approximate solution. As in Theorem 2.3, AP-SDCA could likely also serve as a dual oracle with the same guarantees, provided it ismodiﬁed to allow for the more general primal-dual initialization. As in Theorem 2.5, when the oracle is a randomized algorithm, the expected accuracy is at most (cid:15) . orollary 4.4. Instantiating Theorem 4.3 with Theorem 4.2 as the dual oracle and taking λ = µ yields the running time bound (cid:101) O ( nd √ κ log( (cid:15) /(cid:15) )) . While both this result and the results in Section 2 show that APCG can be used to achieve ourfastest running times for solving (1), note that the algorithms they suggest are in fact diﬀerent. Inevery invocation of APCG in Algorithm 1, we need to explicitly compute both the primal-to-dualand dual-to-primal mappings (in O ( nd ) time). However, here we only need to compute the primal-to-dual mapping once upfront, in order to initialize the algorithm. Every subsequent invocationof APCG then only requires a single dual-to-primal mapping computation, which can often bestreamlined. From a practical viewpoint, this can be seen as a natural “warm start” scheme forthe dual-based inner minimizer. Here we proves Theorem 4.3. We begin by bounding the error of the dual regularized ERM problemwhen the center of regularization changes. This characterizes the initial error at the beginning ofeach Dual APPA iteration.

Lemma 4.5 (Dual error after re-centering.) . For all y ∈ R n , x ∈ R d , and x (cid:48) = (cid:98) x x ( y ) we have g x (cid:48) ,λ ( y ) − g opt x (cid:48) ,λ ≤ g x,λ ( y ) − g opt x,λ ) + 4 nκ (cid:2) F ( x (cid:48) ) − F opt + F ( x ) − F opt (cid:3) In other words, the dual error g s,λ ( y ) − g opt s,λ is bounded across a re-centering step by multiplesof previous sub-optimality measurements (namely, dual error and gradient norm). Proof.

By the deﬁnition of g x,λ and x (cid:48) we have, for all z , g x (cid:48) ,λ ( z ) = G ( z ) + 12 λ (cid:107) A T z (cid:107) − x (cid:48) T A T z = g x,λ ( z ) − ( x (cid:48) − x ) (cid:62) A T z = g x,λ ( z ) + 1 λ y T AA T z . Furthermore, since g is L -strongly convex we can invoke Lemma A.2 obtaining g x (cid:48) ,λ ( y ) − g opt x (cid:48) ,λ ≤ (cid:104) g x,λ ( y ) − g opt x (cid:48) ,λ (cid:105) + L (cid:13)(cid:13)(cid:13)(cid:13) λ AA T y (cid:13)(cid:13)(cid:13)(cid:13) . Since each row of A has (cid:96) norm at most R we know that (cid:107) Az (cid:107) ≤ nR (cid:107) z (cid:107) and we know that bydeﬁnition A T y = λ ( x − x (cid:48) ). Combining these yields g x (cid:48) ,λ ( y ) − g opt x (cid:48) ,λ ≤ (cid:104) g x,λ ( y ) − g opt x (cid:48) ,λ (cid:105) + nLR (cid:107) x − x (cid:48) (cid:107) . Finally, since F is µ -strongly convex, by Lemma A.1, we have12 (cid:107) x − x (cid:48) (cid:107) ≤ (cid:107) x (cid:48) − x opt (cid:107) + (cid:107) x − x opt (cid:107) ≤ µ (cid:2) F ( x (cid:48) ) − F opt + F ( x ) − F opt (cid:3) . Combining and recalling the deﬁnition of κ yields the result.The following lemma establishes the rate of convergence of the primal iterates { x ( t ) } producedover the course of Dual APPA, and in turn implies Theorem 4.3.16 emma 4.6 (Convergence rate of Dual APPA) . Let c (cid:48) ∈ (0 , be arbitrary and suppose that σ ≥ (40 /c (cid:48) ) n κ λ max { κ, κ λ }(cid:100) λ/µ (cid:101) in Dual APPA (Algorithm 3). Then in every iteration t ≥ ofDual APPA (Algorithm 3) the following invariants hold: F ( x ( t − ) − F opt ≤ (cid:18) λ + c (cid:48) µλ + µ (cid:19) t − (cid:16) F ( x (0) ) − F opt (cid:17) , and (10) g x ( t − ,λ ( y ( t ) ) − g opt x ( t − ,λ ≤ (cid:18) λ + c (cid:48) µλ + µ (cid:19) t − (cid:16) F ( x (0) ) − F opt (cid:17) . (11) Proof.

For notational convenience we let r def = ( λ + c (cid:48) µλ + µ ), g t def = g x ( t ) ,λ , f t def = f x ( t ) ,λ , and (cid:15) t def = F ( x ( t ) ) − F opt for all t ≥

0. Thus, we wish to show that (cid:15) t − ≤ r t − (cid:15) (equivalent to (11)) and we wish toshow that g t − ( y ( t ) ) − g opt t − ≤ r t − (cid:15) (equivalent to (10)) for all t ≥ t ≥ g t − ( y ( t ) ) − g opt t − ≤ σ (cid:104) g t − ( y t − ) − g opt t − (cid:105) , (12)by Lemma B.1 we have, for all t ≥ f t − ( x ( t ) ) − f opt t − ≤ n κ λ (cid:104) g t − ( y ( t ) ) − g opt t − (cid:105) , (13)by Lemma 4.5 we know g t ( y ( t ) ) − g opt t ≤ (cid:104) g t − ( y ( t ) ) − g opt t (cid:105) + 4 nκ ( (cid:15) t + (cid:15) t − ) , (14)and by Lemma 2.7 we know that for all t ≥ f opt t − − F opt ≤ λµ + λ (cid:15) t − (15)Furthermore, by Corollary B.3, the deﬁnition of y (0) , and the facts that f ( x (0) ) = F ( x (0) ) and f t ( z ) ≥ F ( z ) we have g ( y (0) ) − g opt0 ≤ κ λ (cid:16) f ( x (0) ) − f opt0 (cid:17) ≤ κ λ (cid:16) F ( x (0) ) − F opt (cid:17) = 2 κ λ (cid:15) (16)We show that combining these and applying strong induction on t yields the desired result.We begin with our base cases. When t = 1 the invariant (11) holds immediately by deﬁnition.Furthermore, when t = 1 we see that the invariant (10) holds, since σ ≥ κ λ and g ( y (1) ) − g opt0 ≤ σ ( g ( y (0) ) − g opt0 ) ≤ κ λ σ (cid:16) f ( x (0) ) − f opt0 (cid:17) ≤ κ λ σ (cid:15) , (17)were we used (12) and (16) respectively. Finally we show that invariant (11) holds for t = 2: F ( x (1) ) − F opt ≤ f ( x (1) ) − f opt0 + f opt0 − F opt (Since F ( z ) ≤ f t ( z ) for all t, z ) ≤ n κ λ ( g ( y (1) ) − g opt0 ) + λµ + λ (cid:15) (Equations (13) and (15)) ≤ (cid:18) n κ λ σ + λµ + λ (cid:19) (cid:15) (Equation (17)) ≤ r(cid:15) (Since σ ≥ nκ λ / ( c (cid:48) λ/ ( µ + λ )))17ow consider t ≥ t . F ( x ( t − ) − F opt ≤ f t − ( x ( t − ) − f opt t − + f opt t − − F opt (Since F ( z ) ≤ f t ( z ) for all t, z ) ≤ n κ λ ( g t − ( y t − ) − g opt t − ) + λµ + λ (cid:15) t − (Equations (13) and (15)) ≤ n κ λ σ (cid:16) g t − ( y t − ) − g opt t − (cid:17) + λµ + λ (cid:15) t − (Equation (12))Furthermore, g t − ( y t − ) − g opt t − ≤ g t − ( y t − ) − g opt t − ) + 4 nκ [ (cid:15) t − + (cid:15) t − ] (Equation (17)) ≤ (cid:0) r t − + 4 nκ ( r t − + r t − ) (cid:1) (cid:15) (Inductive hypothesis) ≤ nκr t − (cid:15) ( r ≤ κ ≥ σ ≥ n κ λ κ/ ( c (cid:48) λ/ ( µ + λ )) combining yields that2 n κ λ σ (cid:16) g t − ( y t − ) − g opt t − (cid:17) ≤ c (cid:48) µµ + λ r t − (cid:15) and the result follows by the inductive hypothesis on (cid:15) t − .Finally we show that invariant (10) holds for any t ≥ t andinvariant (11) holds for that t and all smaller t . g t − ( y ( t ) ) − g opt t − ≤ σ ( g t − ( y t − ) − g opt t − ) (Deﬁnition dual oracle.) ≤ σ (cid:104) g x ( t − ( y t − ) − g opt x ( t − ) + 4 nκ [ (cid:15) t − + (cid:15) t − ] (cid:105) (Equation (14)) ≤ σ (cid:2) r t − + 4 nκ (cid:2) r t + r t − (cid:3)(cid:3) (cid:15) (Inductive hypothesis) ≤ r t − (cid:15) ( σ ≥ nκ )The result then follows by induction. In the following two subsections, respectively, we discuss implementation details and report on anempirical evaluation of the APPA framework.

While theoretical convergence rates lay out a broad-view comparison of the algorithms in theliterature, we brieﬂy remark on some of the ﬁner-grained diﬀerences between algorithms, whichinform their implementation or empirical behavior. To match the terminology used for SVRG inJohnson and Zhang (2013), we refer to a “stage” as a single step of APPA, i.e. the time spentexecuting the inner minimization of f x ( t ) ,λ or g x ( t ) ,λ (as in (3) and (4)).18 e-centering overhead of Dual APPA vs. SVRG At the end of every one of its stages,SVRG pauses to compute an exact gradient by a complete pass over the dataset (costing Θ( nd )time during which n gradients are computed). Although an amortized runtime analysis hides thiscost, this operation cannot be carried out in-step with the iterative updates of the previous stage,since the exact gradient is computed at a point that is only selected at the stage’s end.Meanwhile, if each stage in Dual APPA is initialized with a valid primal-dual pair for the innerproblem, Dual APPA can update the current primal point together with every dual coordinateupdate, in time O ( d ), i.e. with negligible increase in the overhead of the update. When doing so,the corresponding data row remains fresh in cache and, unlike SVRG, no additional gradient needbe computed.Moreover, initializing each stage with a valid such primal-dual pair can be done in only O ( d )time. At the end of a stage where s was the center point, Dual APPA holds a primal-dual pair( x, y ) where x = (cid:98) x s ( y ). The next stage is centered at x and the dual variables initialized at y , soit remains to set up a corresponding primal point x (cid:48) = (cid:98) x x ( y ) = x − λ A T y . This can be done bycomputing x (cid:48) ← x − s , since we know that x − s = − λ A T y . Decreasing λ APPA and Dual APPA enjoy the nice property that, as long as the inner problemsare solved with enough accuracy, the algorithm does not diverge even for large choice of λ . Inpractice this allows us to start with a large λ and make faster inner minimizations. If we heuristicallyobserve that the function error is not decreasing rapidly enough, we can switch to a smaller λ .Figure 3 (Section 5.2) demonstrates this empirically. This contrasts with algorithm parameterssuch as step size choices in stochastic optimizers (that may still appear in inner minimization).Such parameters are typically more sensitive, and can suddenly lead to divergence when taken toolarge, making them less amenable to mid-run parameter tuning. Stable update steps

When used as inner minimizers, dual coordinate-wise methods such asSDCA typically provide a convenient framework in which to derive parameter updates with data-dependent step sizes, or sometimes enables closed-form updates altogether ( i.e. optimal solutions toeach single-coordinate maximization sub-problem). For example, when Dual APPA is used togetherwith SDCA to solve a problem of least-squares or ridge regression, the locally optimal SDCAupdates can be performed eﬃciently in closed form. This decreases the number of algorithmicparameters requiring tuning, improves the overall the stability of the end-to-end optimizer and, inturn, makes it easier to use out of the box.

We experiment with Dual APPA in comparison with SDCA, SVRG, and SGD on several binaryclassiﬁcation tasks.Beyond general benchmarking, the experiments also demonstrate the advantages of the unordi-nary “bias-variance tradeoﬀ” presented by approximate proximal iteration: the vanishing proximalterm empirically provides advantages of regularization (added strong convexity, lower variance)at a bias cost that is less severe than with typical (cid:96) regularization. Even if some amount of (cid:96) shrinkage is desired, Dual APPA can place yet higher weight on its (cid:96) term, enjoy improved speedand stability, and after a few stages achieve roughly the desired bias.19 atasets In this section we show results for three binary classiﬁcation tasks, derived fromMNIST, CIFAR-10, and Protein: in MNIST we classify the digits { , , , , } vs. the rest,and in CIFAR we classify the animal categories vs. the automotive ones. MNIST and CIFAR aretaken under non-linear feature transformations that increase the problem scale signiﬁcantly: wenormalize the rows by scaling the data matrix by the inverse average (cid:96) row norm. We then taketake n/ ∼ ∼

30K test) that we preprocess minimally by row normalization and anappended aﬃne feature, and whose train/test split we obtain by randomly holding out 20% of theoriginal labeled data.

Algorithms

Each algorithm is parameterized by a scalar value λ analogous to the λ used inproximal iteration: λ is the step size for SVRG, λt − / is the decaying step size for SGD, and λ (cid:107) x (cid:107) is the ridge penalty for SDCA. (See Johnson and Zhang (2013) for a comparison of SVRG toa more thoroughly tuned SGD under diﬀerent decay schemes.) We use Dual APPA (Algorithm 3)with SDCA as the inner minimizer. For the algorithms with a notion of a stage – i.e. Dual APPA’stime spent invoking the inner minimizer, SVRG’s period between computing exact gradients – weset the stage size equal to the dataset size for simplicity. SVRG is given an advantage in that wechoose not to count its gradient computations when it computes the exact gradient between stages.All algorithms are initialized at x = 0. Each algorithm was run under λ = 10 i for i = − , − , . . . , Convergence and bias

The proximal term in APPA introduces a vanishing bias for the problem(towards the initial point of x = 0) that provides a speedup by adding strong convexity to the prob-lem. We investigate a natural baseline: for the purpose of minimizing the original ERM problem,how does APPA compare to solving one instance of a regularized ERM problem (using a single runof its inner optimizer)? In other words, to what extent does re-centering the regularizer over timehelp in solving the un-regularized problem? Intuitively, even if SDCA is run to convergence, someof the minimization is of the regularization term rather than the ERM term, hence one cannotweigh the regularization too heavily. Meanwhile, APPA can enjoy more ample strong convexityby placing a larger weight on its (cid:96) term. This advantage is evident for MNIST and CIFAR inFigures 1 and 2: recalling that λ is the same strong convexity added both by APPA and by SDCA,we see that APPA takes λ at least an order of magnitude larger than SDCA does, to achieve fasterand more stable convergence towards an ultimately lower ﬁnal value.Figure 1 also shows dashed lines corresponding to the ERM performance of the least-squaresﬁt and of fully-optimized ridge regression, using λ as that of the best APPA and SDCA runs.These appear in the legend as “ls( λ ).” They indicate lower bounds on the ERM value attainableby any algorithm that minimizes the corresponding regularized ERM objective. Lastly, test setclassiﬁcation accuracy demonstrates the extent to which a shrinkage bias is statistically desirable. http://yann.lecun.com/exdb/mnist/ ∼ kriz/cifar.html http://osmot.cs.cornell.edu/kddcup/datasets.html Such a choice is justiﬁed by the observation that doubling the stage size does not have noticeable eﬀect on theresults discussed. n sdca(0 .

1) svrg(1 .

0) appa(1 .

0) sgd(1 .

0) ls(0) ls(0 .

1) ls(1 . n (a) MNIST. Left: excess train loss F ( x ) − F opt . Right: test error rate. n sdca(0 .

1) svrg(1 .

0) appa(1 .

0) sgd(1 .

0) ls(0) ls(0 .

1) ls(1 . n (b) CIFAR. Left: excess train loss F ( x ) − F opt . Right: test error rate. n sdca(0 .

1) svrg(0 .

01) appa(0 .

1) sgd(1 .

0) ls(0) ls(0 .

1) ls(0 . n (c) Protein. Left: excess train loss F ( x ) − F opt . Right: test error rate. Figure 1: Sub-optimality curves when optimizing under squared loss φ i ( z ) = n ( z − b i ) .21 n sdca(1 e −

08) svrg(10 .

0) appa(0 .

1) sgd(1000 . n (a) MNIST. Left: train loss F ( x ). Right: test error rate. n sdca(0 .

01) svrg(10 .

0) appa(0 .

1) sgd(1000 . n (b) CIFAR. Left: train loss F ( x ). Right: test error rate. Figure 2: Objective curves when optimizing under logistic loss φ i ( z ) = n log(1 + e − zb i ).In the MNIST and CIFAR holdout, we want only the small bias taken explicitly by SDCA (andeﬀectively achieved by APPA). In the Protein holdout, we want no bias at all (again eﬀectivelyachieved by APPA). Parameter sensitivity

By solving only regularized ERM inner problems, SDCA and APPAenjoy a stable response to poor speciﬁcation of the biasing parameter λ . Figure 3 plots the al-gorithms’ ﬁnal value after 20 stages, against diﬀerent choices of λ . Overestimating the step sizein SGD or SVRG incurs a sharp transition into a regime of divergence. Meanwhile, APPA andSDCA always converge, with solution quality degrading more smoothly. APPA then exhibits aneven better degradation as it overcomes an overaggressive biasing by the 20th stage.22 − − parameter setting λ ﬁn a l v a l u e sdca svrg appa sgd − − parameter setting λ ﬁn a l v a l u e (a) Squared loss. Left: MNIST. Right: CIFAR. − − parameter setting λ ﬁn a l v a l u e sdca svrg appa sgd − − parameter setting λ ﬁn a l v a l u e (b) Logistic loss. Left: MNIST. Right: CIFAR. Figure 3: Sensitivity to λ : the ﬁnal objective values attained by each algorithm, after 20 stages (orthe equivalent), with λ chosen at diﬀerent orders of magnitude. SGD and SVRG exhibit a sharpthreshold past which they easily diverge, whereas SDCA degrades more gracefully, and Dual APPAyet more so. Acknowledgments

Part of this work took place while RF and AS were at Microsoft Research, New England, andanother part while AS was visiting the Simons Institute for the Theory of Computing, UC Berkeley.This work was partially supported by NSF awards 0843915 and 1111109, NSF Graduate ResearchFellowship (grant no. 1122374).

References

L. Bottou and O. Bousquet. The tradeoﬀs of large scale learning. In

Advances in Neural InformationProcessing Systems (NIPS) , 2008. 23. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statisticallearning via the alternating direction method of multipliers.

Foundations and Trends in MachineLearning , 3(1):1–122, 2011.M. B. Cohen, Y. T. Lee, C. Musco, C. Musco, R. Peng, and A. Sidford. Uniform sampling formatrix approximation. In

Innovations in Theoretical Computer Science (ITCS) , 2015.A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with supportfor non-strongly convex composite objectives. In

Advances in Neural Information ProcessingSystems (NIPS) , 2014.O. Guler. New proximal point algorithms for convex minimization.

SIAM Journal on Optimization ,2(4):649–664, 1992.R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance re-duction. In

Advances in Neural Information Processing Systems (NIPS) , 2013.Y. T. Lee and A. Sidford. Eﬃcient accelerated coordinate descent methods and faster algorithmsfor solving linear systems. In

Foundations of Computer Science (FOCS) , 2013.M. Li, G. L. Miller, and R. Peng. Iterative row sampling. In

Foundations of Computer Science(FOCS) , 2013.H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for ﬁrst-order optimization. arXiv , 2015.Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient method. In

Advances inNeural Information Processing Systems (NIPS) , 2014.D. Needell, N. Srebro, and R. Ward. Stochastic gradient descent, weighted sampling, and therandomized kaczmarz algorithm. In

Advances in Neural Information Processing Systems (NIPS) ,2014.J. Nelson and H. L. Nguyen. OSNAP: Faster numerical linear algebra algorithms via sparsersubspace embeddings. In

Foundations of Computer Science (FOCS) , 2013.Y. Nesterov. A method of solving a convex programming problem with convergence rate O (1 /k ). Soviet Mathematics Doklady , 27(2):372–376, 1983.Y. Nesterov.

Introductory Lectures on Convex Optimization: A Basic Course . Springer, 2004.N. Parikh and S. Boyd. Proximal algorithms.

Foundations and Trends in Optimization , 1(3):123–231, 2014.A. Rahimi and B. Recht. Random features for large-scale kernel machines. In

Advances in NeuralInformation Processing Systems (NIPS) , 2007.R. T. Rockafellar. Monotone operators and the proximal point algorithm.

SIAM Journal on Controland Optimization , 14(5):877–898, 1976.N. L. Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential conver-gence rate for ﬁnite training sets. In

Advances in Neural Information Processing Systems (NIPS) ,2012. 24. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized lossminimization.

Journal of Machine Learning Research (JMLR) , 14:567–599, 2013.S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regu-larized loss minimization.

Mathematical Programming , pages 1–41, 2014.T. Strohmer and R. Vershynin. A randomized kaczmarz algorithm with exponential convergence.

Journal of Fourier Analysis and Applications , 15:262–278, 2009.V. V. Williams. Multiplying matrices faster than Coppersmith-Winograd. In

Symposium on Theoryof Computing (STOC) , 2012.L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction.

SIAM Journal on Optimization , 24(4):2057–2075, 2014.

A Technical lemmas

In this section we provide several stand-alone technical lemmas we use throughout the paper. Firstwe provide Lemma A.1 some common inequalities regarding smooth or strongly convex functions,then Lemma A.2 which shows the eﬀect of adding a linear term to a convex function, and thenLemma A.3 a small technical lemma regarding convex combinations of quadratic functions.

Lemma A.1 (Standard bounds for smooth, strongly convex functions) . Let f : R k → R be diﬀer-entiable function that obtains its minimal value at x opt .If f is L -smooth then for all x ∈ R k L (cid:107)∇ f ( x ) (cid:107) ≤ f ( x ) − f ( x opt ) ≤ L (cid:107) x − x opt (cid:107) . If f is µ -strongly convex the for all x ∈ R k µ (cid:107) x − x opt (cid:107) ≤ f ( x ) − f ( x opt ) ≤ µ (cid:107)∇ f ( x ) (cid:107) . Proof.

Apply the deﬁnition of smoothness and strong convexity at the points x and x opt andminimize the resulting quadratic form. Lemma A.2.

Let f : R n → R be a µ -strongly convex function and for all a, x ∈ R n let f a ( x ) = f ( x ) + a (cid:62) x . Then f a ( x ) − f opt a ≤ f ( x ) − f opt ) + 1 µ (cid:107) a (cid:107) Proof. Let x opt = argmin x f ( x ). Since f is µ -strongly convex by Lemma A.1 we have f ( x ) ≥ f ( x opt ) + µ (cid:107) x − x opt (cid:107) for all x . Consequently, for all xf opt a ≥ f ( x ) + a (cid:62) x ≥ f ( x opt ) + µ (cid:107) x − x opt (cid:107) + a (cid:62) x ≥ f a ( x opt ) + a (cid:62) ( x − x opt ) + µ (cid:107) x − x opt (cid:107) Note we could have also proved this by appealing to the gradient of f and Lemma A.1, however the proof hereholds even if f is not diﬀerentiable. x yields that f opt a ≥ f a ( x opt ) − µ (cid:107) a (cid:107) . Consequently, by CauchySchwarz, and Young’s Inequality we have f a ( x ) − f opt a ≤ f ( x ) − f opt + a (cid:62) ( x − x opt ) + 12 µ (cid:107) a (cid:107) (18) ≤ f ( x ) − f opt + 12 µ (cid:107) a (cid:107) + µ (cid:107) x − x opt (cid:107) + 12 µ (cid:107) a (cid:107) (19)Applying A.1 again yields the result. Lemma A.3.

Suppose that for all x we have f ( x ) def = ψ + µ (cid:107) x − v (cid:107) and f ( x ) = ψ + µ (cid:107) x − v (cid:107) then αf ( x ) + (1 − α ) f ( x ) = ψ α + µ (cid:107) x − v α (cid:107) where v α = αv + (1 − α ) v and ψ α = αψ + (1 − α ) ψ + µ α (1 − α ) (cid:107) v − v (cid:107) Proof.

Setting the gradient of αf ( x ) + (1 − α ) f ( x ) to 0 we know that v α must satisfy αµ ( v α − v ) + (1 − α ) µ ( v α − v ) = 0and thus v α = αv + (1 − α ) v . Finally, ψ α = α (cid:104) ψ + µ (cid:107) v α − v (cid:107) (cid:105) + (1 − α ) (cid:104) ψ + µ (cid:107) v α − v (cid:107) (cid:105) = αψ + (1 − α ) ψ + µ (cid:2) α (1 − α ) (cid:107) v − v (cid:107) + (1 − α ) α (cid:107) v − v (cid:107) (cid:3) = αψ + (1 − α ) ψ + µ α (1 − α ) (cid:107) v − v (cid:107) . B Regularized ERM duality

In this section we derive the dual (4) to the problem of computing proximal operator for the ERMobjective (3) (Section B.1) and prove several bounds on primal and dual errors (Section B.2).Throughout this section we assume F is given by the ERM problem (1) and we make extensive useof the notation and assumptions in Section 1.1. B.1 Dual derivation

We can rewrite the primal problem, min x f s,λ ( x ), asmin x ∈ R d ,z ∈ R n (cid:80) ni =1 φ i ( z i ) + λ (cid:107) x − s (cid:107) subject to z i = a T i x, for i = 1 , . . . , n .

26y convex duality, this is equivalent tomin x, { z i } max y ∈ R n n (cid:88) i =1 φ i ( z i ) + λ (cid:107) x − s (cid:107) + y T ( Ax − z ) = max y min x, { z i } n (cid:88) i =1 φ i ( z i ) + λ (cid:107) x − s (cid:107) + y T ( Ax − z )Since min z i { φ i ( z i ) − y i z i } = − max z i { y i z i − φ i ( z i ) } = − φ ∗ i ( y i )andmin x (cid:26) λ (cid:107) x − s (cid:107) + y T Ax (cid:27) = y T As + min x (cid:26) λ (cid:107) x − s (cid:107) + y T A ( x − s ) (cid:27) = y T As − λ (cid:107) A T y (cid:107) , it follows that the optimization problem is in turn equivalent to − min y n (cid:88) i =1 φ ∗ i ( y i ) + 12 λ (cid:107) A T y (cid:107) − s T A T y. This negated problem is precisely the dual formulation.The ﬁrst problem is a Lagrangian saddle-point problem, where the Lagrangian is deﬁned as L ( x, y, z ) = n (cid:88) i =1 φ i ( z i ) + λ (cid:107) x − s (cid:107) + y T ( Ax − z ) . The dual-to-primal mapping (5) and primal-to-dual mapping (6) are implied by the KKT conditionsunder L , and can be derived by solving for x , y , and z in the system ∇L ( x, y, z ) = 0.The duality gap in this context is deﬁned asgap s,λ ( x, y ) def = f s,λ ( x ) + g s,λ ( y ) . (20)Strong duality dictates that gap s,λ ( x, y ) ≥ x ∈ R d , y ∈ R n , with equality attained when x is primal-optimal and y is dual-optimal. B.2 Error bounds

Lemma B.1 (Dual error bounds primal error) . For all s ∈ R d , y ∈ R n , and λ > we have f s,λ ( (cid:98) x s,λ ( y )) − f opt s,λ ≤ nκ λ ) ( g s,λ ( y ) − g opt s,λ ) . Proof.

Because F is nR L smooth, f s,λ is nR L + λ smooth. Consequently, for all x ∈ R d we have f s,λ ( x ) − f opt s,λ ≤ nR L + λ (cid:107) x − x opt s,λ (cid:107) Since we know that x opt s,λ = s − λ A T y opt s,λ and (cid:107) A T z (cid:107) ≤ nR (cid:107) z (cid:107) for all z ∈ R n we have f s,λ ( (cid:98) x x,λ ( y )) − f s,λ ( x opt s,λ ) ≤ nR L + λ (cid:107) s − λ A T y − ( s − λ A T y opt s,λ ) (cid:107) = nR L + λ λ (cid:107) y − y opt s,λ (cid:107) AA T ≤ nR ( nR L + λ )2 λ (cid:107) y − y opt s,λ (cid:107) . (21)27inally, since each φ ∗ i is 1 /L -strongly convex, G is 1 /L -strongly convex and hence so is g s,λ . There-fore by Lemma A.1 we have 12 L (cid:107) y − y opt s,λ (cid:107) ≤ g s,λ ( y ) − g s,λ ( y opt s,λ ) . (22)Substituting (22) in (21) and recalling that κ λ ≥ Lemma B.2 (Gap for primal-dual pairs) . For all s, x ∈ R d and λ > we have gap s,λ ( x, (cid:98) y ( x )) = 12 λ (cid:107)∇ F ( x ) (cid:107) + λ (cid:107) x − s (cid:107) . (23) Proof.

To prove the ﬁrst identity (23), let (cid:98) y = (cid:98) y ( x ) for brevity. Recall that (cid:98) y i = φ (cid:48) i ( a T i x ) ∈ argmax y i { x T a i y i − φ ∗ i ( y i ) } (24)by deﬁnition, and hence x T a i (cid:98) y i − φ ∗ i ( (cid:98) y i ) = φ i ( a T i x ). Observe thatgap s,λ ( x, (cid:98) y ) = n (cid:88) i =1 (cid:16) φ i ( a T i x ) + φ ∗ i ( (cid:98) y i ) (cid:17) − x T A T (cid:98) y + λ (cid:107) A T (cid:98) y (cid:107) + λ (cid:107) x − s (cid:107) = n (cid:88) i =1  φ i ( a T i x ) + φ ∗ i ( (cid:98) y i ) − x T a i (cid:98) y i (cid:124) (cid:123)(cid:122) (cid:125) =0 (by (24))  + λ (cid:107) A T (cid:98) y (cid:107) + λ (cid:107) x − s (cid:107) = λ (cid:107) A T (cid:98) y (cid:107) + λ (cid:107) x − s (cid:107) = λ (cid:107) n (cid:88) i =1 a i φ (cid:48) i ( a T i x ) (cid:107) + λ (cid:107) x − s (cid:107) = λ (cid:107)∇ F ( x ) (cid:107) + λ (cid:107) x − s (cid:107) . Corollary B.3 (Initial dual error) . For all s, x ∈ R d and λ > we have g x,λ ( (cid:98) y ( x )) − g opt x,λ ≤ κ λ (cid:16) f x,λ ( x ) − f opt x,λ (cid:17) Proof.

By Lemma B.2 we havegap x,λ ( x, (cid:98) y ( x )) = 12 λ (cid:107)∇ F ( x ) (cid:107) + λ (cid:107) x − x (cid:107) = 12 λ (cid:107)∇ F ( x ) (cid:107) Now clearly ∇ F ( x ) = ∇ f x,λ ( x ). Furthermore, since f x,λ ( x ) is ( nLR + λ )-smooth by Lemma A.1we have (cid:107)∇ f x,λ ( x ) (cid:107) ≤ nLR + λ )( f x,λ ( x ) − f opt x,λ ). Consequently, g x,λ ( (cid:98) y ( x )) − g opt x,λ ≤ gap x,λ ( x, (cid:98) y ( x )) ≤ nLR + λ )2 λ (cid:16) f x,λ ( x ) − f opt x,λ (cid:17) . Recalling the deﬁnition of κ λ and the fact that 1 ≤ κ λλ