Hyperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization
Pavel Dvurechensky, Dmitry Kamzolov, Aleksandr Lukashevich, Soomin Lee, Erik Ordentlich, César A. Uribe, Alexander Gasnikov
aa r X i v : . [ m a t h . O C ] F e b Hyperfast Second-Order Local Solvers for EfficientStatistically Preconditioned Distributed Optimization
Pavel Dvurechensky Dmitry Kamzolov Aleksandr Lukashevich Soomin Lee Erik Ordentlich C´esar A. Uribe Alexander Gasnikov
Abstract
Statistical preconditioning can be used to designfast methods for distributed large-scale empiri-cal risk minimization problems, for strongly con-vex and smooth loss functions, allowing fewercommunication rounds. Multiple worker nodescompute gradients in parallel, which are thenused by the central node to update the parameterby solving an auxiliary (preconditioned) smaller-scale optimization problem. However, previousworks require an exact solution of an auxiliaryoptimization problem by the central node at ev-ery iteration, which may be impractical. Thispaper proposes a method that allows the inexactsolution of the auxiliary problem, reducing thetotal computation time. Moreover, for loss func-tions with high-order smoothness, we exploit thestructure of the auxiliary problem and propose ahyperfast second-order method with complexity ˜ O ( κ / ) , where κ is the local condition number.Combining these two building blocks (inexact-ness and hyperfast methods), we show complex-ity estimates for the proposed algorithm, whichis provably better than classical variance reduc-tion methods and has the same convergence rateas statistical preconditioning with exact solutions.Finally, we illustrate the proposed method’s prac-tical efficiency by performing large-scale numer-ical experiments on logistic regression models. * Equal contribution Weierstrass Institute for Applied Anal-ysis and Stochastics, Berlin, Germany Moscow Institute ofPhysics and Technology, Dolgoprudny, Russia Skolkovo Insti-tute of Science and Technology, Moscow, Russia Yahoo! Re-search, Sunnyvale, CA Rice University, Houston, TX Institutefor Information Transmission Problems RAS, Moscow, Russia Higher school of economics, Moscow, Russia. Correspondenceto: Pavel Dvurechensky < [email protected] > .
1. Introduction
Efficient parallelization of large-scale learning is one of themost challenging problems in modern machine learning.Distributed computation and preconditioning have beenshown effective in accelerating optimization algorithms,specially with increasing amounts of data (Shamir et al.,2014a; Hendrikx et al., 2020b; Yuan & Li, 2020). In thispaper, we propose an efficient distributed optimization al-gorithm for solving the empirical risk minimization (ERM)problem: min x ∈ R d F ( x ) , N N X i =1 ℓ ( x ; ξ i , η i ) , (1)where { ζ i = ( ξ i , η i ) } Ni =1 are training samples, and ℓ is aconvex loss function with respect to x . Furthermore, weassume F is L F -smooth and σ F -strongly convex, i.e., σ F I d (cid:22) ∇ F ( x ) (cid:22) L F I d , (2)where I d is the d -dimensional identity matrix. The condi-tion number of F is denoted as κ F = L F /σ F , and thesolution to (1) as x ∗ .Sum-type optimization problems of the form (1) can beused to model various statistical learning problems, includ-ing least square regression, logistic regression, and supportvector machines. One characteristic of modern applicationsof (1) is the so-called large-scale regime, where N is verylarge. Having a large N poses additional challenges re-lated to the storage and processing of data, which in turndrives the need for modern distributed/federated architec-tures (Wang et al., 2018) that take advantage of parallel pro-cessing capabilities (Hendrikx et al., 2020a), e.g., ApacheSpark (Yang, 2013), Parameter Server (Li et al., 2014) andMapReduce (Dean & Ghemawat, 2008).The distributed setup assumes the N data-points (denotedas D ) are not stored or accessible at a single machine.Instead, data is distributed uniformly among m comput-ing units/nodes/agents such that D = {D , . . . , D m } and N = mn . That is, each machine j ∈ { , . . . , m } locallystores n samples D j = { ξ ( j ) i , η ( j ) i } ni =1 . Moreover, there yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization is a central node, that is able to communicate with all theworker nodes. Therefore, each agent j has a local empir-ical risk, denoted as F j ( x ) , (1 /n ) P ni =1 ℓ ( x ; ξ ( j ) i , η ( j ) i ) .Thus, F ( x )= 1 N m X j =1 F j ( x )= 1 nm m X j =1 n X i =1 ℓ ( x ; ξ ( j ) i , η ( j ) i ) . (3)The distributed optimization architecture described above,with a central node and a number of workers, typi-cally consists of two essential resources: communicationand computation. Communication is usually regarded asthe most valuable resource (Lan et al., 2017). Thus, re-cent efforts (Shamir et al., 2014a; Hendrikx et al., 2020b;Yuan & Li, 2020) have been focused on the efficiency ofcommunications, where one seeks to minimize (3) withminimal communication between the workers and the cen-tral node.The iteration complexity (or the number of communicationrounds) of minimizng (3) using first-order distributed meth-ods depends on the condition number of the objective func-tion κ F . However, one can take advantage of the specificproblem formulation to reduce communication complexityin the distributed statistical learning setup. That is, thedatasets D j at different workers are i.i.d samples, and eachlocal subproblem should be similar (in a sense to be de-fined in Section 2) to the global problem when data size issufficiently large. Recent Distributed Optimization Approaches:
The distributed approximate Newton-type method(DANE) (Shamir et al., 2014a) has been one of themost popular second-order methods for communication-efficient distributed machine learning. DANE improvesthe polynomial dependency of the iteration complexityon the condition number κ F of first-order methods fordistributed empirical risk minimization problems, com-pared to the geometric rates available for centralizedmethods (Nesterov et al., 2018). Particularly, DANE hasan iteration (communication complexity) of ˜ O ( κ F /n ) forquadratic functions, and ˜ O ( κ F ) for convex non-quadraticfunctions. However, DANE requires the exact solutionof a carefully constructed subproblem, which can beimpractical (Shamir et al., 2014a). An inexact version ofDANE, termed InexactDANE (Reddi et al., 2016), andits accelerated variant, termed AIDE (Reddi et al., 2016)achieve an iteration complexity of ˜ O ( κ F ) , and ˜ O ( √ κ F ) respectively, without requiring exact solutions of the aux-iliary sub-problem. For quadratic functions InexactDANEand AIDE have an iteration complexity of ˜ O ( κ F /n ) and ˜ O ( √ κ F /n / ) respectively. Nevertheless, the advantageof preconditioning, where the condition number is effec-tively reduced as n increases, was only shown for quadraticproblems. Recently, in (Yuan & Li, 2020), the authors showed that the preconditioning effect holds locally fora variation of DANE termed DANE-HB with inexactsolutions to the local subproblem. Specifically, an iterationcomplexity of ˜ O ( d / √ κ F /n / ) was shown to hold in aneighborhood around the optimal point for non-quadraticconvex functions. Additionally, if for linear predictionmodels an improved global bound of ˜ O ( √ κ F /n / ) was shown (Yuan & Li, 2020), termed as the D ANEAlgorithm. Note that we have hidden poly-logarithmicterms in the above bounds for simplicity. One of the mainobservations in (Yuan & Li, 2020) is that the loosenessin the bounds of DANE and AIDE came from the reduce(model aggregation) step done by the central node. Thus,DANE-HB and D ANE build their results from a modifiedstructure, where the worker nodes compute gradients andcommunicate them back to the central node, which respec-tively solved the preconditioned auxiliary problem. Suchalgorithmic structure was used in (Hendrikx et al., 2020b)recently where the authors proposed the StatisticallyPreconditioned Accelerated Gradient (SPAG) method.SPAG has an iteration complexity of ˜ O ( √ κ F /n / ) forquadratic functions with direct acceleration, instead ofusing the Catalyst framework (Lin et al., 2015). SPAGwas also shown to have an asymptotic iteration com-plexity of ˜ O ( √ κ F /n / ) , with empirical evidence thatsuch behavior rates hold non-asymptotically in practice.However, exact solvers for the inner auxiliary problem arerequired. Such convergence rates match complexity lowerbounds (Dragomir et al., 2019; Arjevani & Shamir, 2015).In a more challenging setup (which we do not considerin this paper) of decentralized distributed optimization(Sun et al., 2019) propose an algorithm with iteration com-plexity ˜ O ( κ F / √ n ) and similar up to a network-dependentfactor communication complexity.Although SPAG obtains the near-optimal iteration complex-ity for distributed algorithms applied to (3), it strongly de-pends on the ability to exactly solve an intermediate aux-iliary optimization problem (usually in the form of a non-standard Bregman projection), whose complexity was notexplicitly taken into account in (Hendrikx et al., 2020b).More importantly, as pointed out in (Hendrikx et al.,2020b), such an intermediate problem is computationallyhard, and the accuracy of its solution dramatically affectsthe performance of the preconditioned gradient methods.We solve this issue in this paper. The key innovation in oursolution is to take into account the inexactness of the auxil-iary subproblem explicitly. Moreover, for the case of func-tions with high-order bounded derivatives (e.g., logistic re-gression or softmax problems (Bullins, 2020)), we providea hyperfast second-order method that efficiently computesthe approximate solution of the subproblem. With this re-spect we continue the line of works on implementable ten-sor methods recently initiated by Yu. Nesterov (Nesterov, yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization O (1 /k ) for convex functions with Lipschitz third-orderderivative. As a next step, (Nesterov, 2020a) proposesan inexact accelerated high-order proximal point methodwhich allows to improve the convergence rate of super-fast second-order method to O (1 /k ) up to logarithmic fac-tors. In parallel to the last work and inspired by (Nesterov,2020c), the authors of (Kamzolov & Gasnikov, 2020) pro-posed a hyperfast second-order method with the sameconvergence rate, but based on another accelerated high-order proximal point method developed in (Gasnikov et al.,2019). In this paper, we extend both methods to the settingof strongly convex minimization problems and apply themto solve the intermediate auxiliary optimization problem ineach iteration of our inexact version of SPAG. Contributions:
SPAG is one of the fastest distributedmethods (in terms of communication steps) for the mini-mization of (1) with i.i.d. samples (Hendrikx et al., 2020b).Moreover, the hyperfast second-order method is the bestknown (near-optimal) second-order method to minimizefunctions with high-order Lipschitz derivatives. Hyper-fast second-order methods strongly depend on the compu-tational complexity of Hessian inversion. We argue thatthe extended combination of the proposed inexact SPAGand the new Hyperfast second-order method provides a use-ful approach to ameliorate these problems. Specifically,in SPAG, the central node solves a problem with a simi-lar structure as (1), but with a smaller number N of linearterms. Therefore, with a reduced number of samples, thecomplexity of calculating the Hessian is comparable (dueto the sum type structure of F ) with its inversion by thematrix inversion lemma (Cormen et al., 2009) and modernpractical versions of Strassen-type algorithm (Huang et al.,2016). In this regime, hyperfast second-order methods out-perform existing variance reduction first-order schemes. We extend the theoretical analysis of inexact statisticalpreconditioning methods alongside high-order methodsand show that they jointly provide an efficient second-order method that outperforms (from theoretical and prac-tical points of view) well known (randomized) first-orderschemes.
The main contributions of this paper are as follows:• We propose an inexact statistically preconditionedmethod, and explicitly characterize the accuracy by which the corresponding auxiliary problem needs to besolved to guarantee the same convergence rate as theexact method, i.e., ˜ O ( √ κ F /n / ) . Our method is not adirect extension and has slightly simpler structure thanthe method in (Hendrikx et al., 2020b).• We extend and generalize the hyperfast second-ordermethod (Nesterov, 2020a; Kamzolov & Gasnikov,2020), recently proposed for smooth and convexproblems, to the class of uniformly convex functions.We show a linear convergence rate for this problemclass.• We discuss the distributed optimization problemregime, for which high-order optimization methods pro-vide a theoretical advantage over classical first-ordermethods, for the problem size, dimension, and desiredaccuracy of the solution.• We provide experimental results to large-scale machinelearning problems that supports the use of high-ordermethods in practice. To the best knowledge of the au-thors, this is one of the first attempts to apply near-optimal tensor methods. Specifically, we test the pro-posed algorithm on a proprietary data set with mil-lion entries and a dimension of . million. Outline:
Section 2 presents the problem formulation andmotivating ideas. Also, we introduced the inexact SPAGmethod and showed its convergence properties. Section 3discusses the regime for which high-order methods are the-oretically sound. Section 4 discusses some experimentalresults. We finalize with conclusions in Section 5.
2. Inexact Statistically PreconditionedAccelerated Gradient Method
We consider the empirical risk minimization problem (1)where there are m machines or worker nodes, with n sam-ples each. Moreover, without loss of generality we in-dex the central node as node . Following the same al-gorithmic structure as DANE (Shamir et al., 2014a) andSPAG (Hendrikx et al., 2020b), we define a reference func-tion φ ( x ) = 1 n n X k =1 ℓ ( x, ζ k ) + µ k x k , (4)where the examples ζ k are taken from the node which ischosen to be central. It follows that φ ( x ) is L φ -smooth,and σ φ -strongly convex since it has a similar form as F ( x ) ,see also (17) and (4). The value of the parameter µ is setto be a upper bound that quantifies how statistically similarthe function F is from F , i.e., we assume that with highprobability, it holds that k∇ F ( x ) − ∇ F ( x ) k ≤ µ. (5) yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization Later we will see specific upper bounds for the value of µ for the function class we are interested. Moreover, itfollows that is F ( x ) is L F/φ -relative smooth and σ F/φ -relative strongly convex with respect to φ ( x ) (Zhang & Lin,2015; Hendrikx et al., 2020b), i.e., σ F/φ ∇ φ ( x ) ≤ ∇ F ( x ) ≤ L F/φ ∇ φ ( x ) , (6)with L F/φ = 1 , σ F/φ = σ F / ( σ F + 2 µ ) , and κ F/φ = L F/φ /σ F/φ , where the Bregman divergence is defined as D φ ( x, y ) , φ ( x ) − φ ( y ) − ∇ φ ( y ) ⊤ ( x − y ) . (7)Once the specific Bregman divergence has been defined,c.f. (7), distributed statistical preconditioning methods relyon Bregman proximal steps, where the algorithm needs tominimize functions of the form arg min x ∈ R d (cid:8) h∇ F ( z ) , x − z i + L F/φ D φ ( x, z ) (cid:9) , (8)at every iteration.Non-accelerated proximal methods based on steps ofthe form (8) have an iteration complexity of ˜ O ( κ F/φ ) (Bauschke et al., 2017; Lu et al., 2018; Stonyakin et al.,2020). More explicitly, statistical preconditioning al-lows for the relative condition number κ F/φ to deter-mine the convergence rate instead of κ F . The authorsin (Hendrikx et al., 2020b) show that for quadratic func-tions µ = ˜ O ( L F / √ n ) , which implies κ F/φ = 1 +˜ O ( κ F / √ n ) . Similarly, for non-quadratic functions µ =˜ O ( κ F p d/n ) , thus κ F/φ = 1 + ˜ O ( κ F p d/n ) .Therefore, in view of (6), the total number of commu-nication rounds is ˜ O (cid:0) κ F/φ (cid:1) , which is quantitative bet-ter than methods that do not use such statistical precon-ditioning (Arjevani & Shamir, 2015; Scaman et al., 2017;Hendrikx et al., 2020a). A similar argument follows for itsaccelerated variants, where the iteration complexity will be ˜ O (cid:0) κ / F/φ (cid:1) (Hendrikx et al., 2020b).Next, we study the building blocks of the proposed ap-proach. First, we consider the inexact version of the SPAGalgorithm and theoretically analyze the accuracy of the so-lution for the subproblem in each iteration of the method.Notably, the required accuracy increases as iterations go,meaning that the approximate solution’s quality is not highin the first iterations. Next, we introduce and analyze aHyperfast second-order method for uniformly convex func-tions, which we will apply to solve the subproblem in eachiteration of the SPAG algorithm. Finally, we analyze the to-tal complexity for combining the Inexact SPAG plus theHyperfast second-order method to solve the stated prob-lem. This combination is advantageous because we useonly first-order information from the whole dataset and ob-tain a small-size subproblem on the central node. Then, afast second-order method is used to solve this subproblem.
Algorithm 1
InSPAG ( L F/φ , σ
F/φ , x , D ) Input: D s.t. x ∗ ∈ B (0 , D ) , ˆ D φ = 2 L φ D , L F/φ , σ F/φ , G . Set y = u = x ∈ B (0 , D ) , α , , A , α . for t ≥ do At the central node Find the smallest integer i t ≥ such that D φ ( x t +1 , y t +1 ) ≤ G t +1 α t +1 A t +1 D φ ( u t +1 , u t ) , (9)where G t +1 =2 i t − G t , A t +1 , A t + α t +1 , and α t +1 is the largest root of A t +1 (1 + A t σ F/φ ) = L F/φ G t +1 α t +1 . (10) Send y t +1 , ( α t +1 u t + A t x t ) /A t +1 to workers. At every worker node Compute n P ni =1 ∇ ℓ (cid:0) y t +1 , ζ ( j ) i (cid:1) and send it to thecentral node. At the central node
Compute ∇ F ( y t +1 ) = 1 nm m X j =1 n X i =1 ∇ ℓ (cid:0) y t +1 , ζ ( j ) i (cid:1) . Solve u t +1 , arg min x ∈ B (0 ,D ) ˆ D φ /t V t ( x ) , (11)where V t ( x ) , α t +1 h∇ F ( y t +1 ) , x − y t +1 i ++ (1 + A t σ F/φ ) D φ ( x, u t )++ α t +1 σ F/φ D φ ( x, y t +1 ) , (12) Set x t +1 , α t +1 u t +1 + A t x t A t +1 . (13) end for2.1. Proposed Algorithm and Main Results This subsection introduces an inexact version of theSPAG algorithm together with its convergence rate anal-ysis. Inexactness in statistically preconditioned problemshas been studied for DANE, resulting in InexactDANE,AIDE (Reddi et al., 2016), and D ANE (Yuan & Li, 2020).We use the following notation for the inexact solution of aconvex optimization problem.
Definition 1 ((Ben-Tal & Nemirovski, 2020)) . For a con-vex optimization problem min x ∈ Q Ψ( x ) , we denote by Arg min ∆ x ∈ Q Ψ( x ) a set of such e x that ∃ h ∈ ∂ Ψ( e x ): ∀ x ∈ Q → h h, x − e x i ≥ − ∆ . (14) We denote by arg min ∆ x ∈ Q Ψ( x ) some element of yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization Arg min ∆ x ∈ Q Ψ( x ) . The pseudocode of the proposed Inexact SPAG algorithm ispresented in Algorithm 1. Unlike (Hendrikx et al., 2020b),our algorithm is inspired by a similar-triangles-type of ac-celerated methods (Gasnikov & Nesterov, 2018; Nesterov,2018; Dvurechensky et al., 2018b;a; Stonyakin et al., 2020;Dvurechensky et al., 2021), which leads to a slightly sim-pler algorithm. Note that Line of Algorithm 1 re-quires the approximate of minimization of the auxiliaryfunction (12). First, we present the complexity analysis ofAlgorithm 1 in Theorem 1 assuming the approximate solu-tion to (12). In Subsection 2.2, we show the complexity ofobtaining said approximate solution efficiently using high-order methods. Theorem 1.
Assume the function F is σ F -strongly con-vex and L F -smooth, and σ F/φ -strongly convex and L F/φ -smooth with respect to a function φ , where φ is σ φ -stronglyconvex and L φ -smooth. Moreover, let x t , t ≥ be the se-quence generated by Algorithm 1. Then, after T iterationsit holds that F ( x T ) − F ( x ∗ ) ≤ ˆ D φ A T (3 / T ) , (15) Moreover, the value A T grows as follows: A T ≥ max T L F/φ e G T , L F/φ G exp T s σ F/φ L F/φ e G T , where e G − / T = T P T − t =0 1 √ G t +1 . The proof of Theorem 1 can be found in Appendix A. More-over, following the general result in (15) it follows that as σ F/φ → , i.e., the non strongly convex setup, the con-vergence rate is the accelerated sub-linear rate ˜ O (1 /T ) .Additionally, if φ is a quadratic function, then G t = 1 ,and the communication complexity will be O ( √ κ F/φ ) . Inthe general case, where φ is not quadratic, the authorsin (Hendrikx et al., 2020b; Lin & Xiao, 2014) show that G t → linearly with rate ˜ O ( √ κ F ) .Next, we study the properties of the auxiliary problemin (11). Under the additional assumption that the func-tion ℓ has bounded fourth-order derivatives, we show theexplicit complexity of computing an approximate solutionusing high-order methods. In this subsection, we elaborate the properties of the aux-iliary problem (11). Recall that, at each iteration of Al-gorithm 1, we need to find an approximate minimizer of V t ( x ) on a Euclidean ball in the sense of Definition 1. Thenext lemma provides a bound on the accuracy needed in the approximate minimizer of V t ( x ) for the condition (11)to hold. Lemma 1.
Let us denote x ∗ t = arg min x ∈ B (0 ,D ) V t ( x ) and the point ˆ x satisfy V t (ˆ x ) − V t ( x ∗ t ) ≤ ∆ t , σ φ ˆ D φ t (3 L φ D + k∇ V t (0) k ) . (16) Then ˆ x ∈ arg min ˆ D φ /tx ∈ B (0 ,D ) V t ( x ) . The proof of Lemma 1 can be found in Appendix B.Next, we will propose an efficient method to obtain a point ˆ x for which (16) holds. Here we will need an additionalassumption on the structure of the function F , particularlyon the smoothness of the loss function ℓ . Assumption 1.
The loss function ℓ has bounded fourth-order derivatives, i.e., there exists L < ∞ such that k∇ ℓ ( x, ζ ) − ∇ ℓ ( y, ζ ) k ≤ L k x − y k , holds uniformly for all x, y ∈ dom ( ℓ ) , and all ζ . Specifically, we consider the sparse empirical risk mini-mization problem with logistic loss where ℓ ( x, ζ k ) = log (cid:0)
1+ exp( − η k x ⊤ ξ k ) (cid:1) ++ λ X i ∈ I S x i + λ X i ∈ I D x i , (17)where ζ k = ( ξ k , η k ) , η k = 1 indicates a positive (clicked)example, and η k = − otherwise. We assume thereare two types of features, namely, sparse and dense fea-tures. Let ξ k,i be the i -th element of the vector ξ k , ξ k,i is a sparse feature if ξ k,i = 0 for almost all k ∈{ , . . . , N } , and a dense feature if x k,i = 0 for many k ∈ { , . . . , N } . We denote by I S (and I D ) a set ofsparse (and dense) features with I S ∪ I D = { , . . . , n } and I S ∩ I D = ∅ . Moreover, it follows from (Nesterov,2005, Section 4.4) that the function F is L F -smooth with L F = max { λ , λ } + N P Nk =1 k η k ξ k k = O ( s ) , where s is an average number of nonzero elements in ξ k , and σ F -strongly convex with σ F = min { λ , λ } . More impor-tantly, the logistic loss in (17) has bounded fourth-orderderivatives (Bullins, 2020), which means that Assumption1 holds.The function φ is strongly convex with parameter σ φ , thus, V t ( x ) is also σ φ -strongly convex. Moreover, V t ( x ) in (12)has the form of regularized logistic regression (17). There-fore, V t ( x ) has Lipschitz derivative of all orders, in partic-ular of the order (Bullins, 2020, Theorem 5.4), see Ap-pendix B for more details. Thus, Assumption 1 holds forthe function V t ( x ) with constant L V, = 15 k A ⊤ A k w.r.t. -norm or with constant L V, = 15 w.r.t. k · k A ⊤ A -norm, yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization Algorithm 2
Restarted Hyperfast Second-Order Method
Require: z t , constant c which defines convergence rate ofthe basic Hyperfast method. Set R = 2 D for k = 0 , , ... do Set R k = R · − k , and N k = max {⌈ (cid:0) cL V, R k /σ φ (cid:1) ⌉ , } , Set z k +1 = y N k as the output of Hyperfast Second-Order Method (either (Nesterov, 2020a, Eq.3.6)for p = 3 and β = 1 / and with auxiliarysteps described in (Nesterov, 2020a, Sect. 5.2)or (Kamzolov & Gasnikov, 2020, Algorithm 2))started from z k and run for N k steps applied to V t ( x ) . Set k = k + 1 . end forEnsure: z k .where A = [ η ξ , . . . , η n ξ n ] ⊤ . At the same time, V t hasLipschitz gradient with constant L φ . In summary, at eachiteration of Algorithm 1, we need to minimize a stronglyconvex function V t ( x ) with Lipschitz third-order derivativeup to accuracy (16).The idea is to use a second-order implementation of a third-order method, in the sense of (Nesterov, 2020a, Sect. 5.2)or (Kamzolov & Gasnikov, 2020, Algorithm 2), to mini-mize V t in each iteration of InSPAG. Since, in our case, theobjective V t has additional strong convexity property, wepropose a new high-order method for strongly convex func-tions and show that it has a faster convergence rate underthat specific problem class. A general extension of the Hy-perfast method for the uniformly convex case is presentedin Appendix E. A special case for strongly convex case cor-responds to q = 2 and is described below as Algorithm2.As a building block, this method uses basic Hyper-fast method which has a convergence rate of the form cL k x ∗ − z k /k , where k is the iteration counter, c =48 for (Nesterov, 2020a, Theorem 2) and c = 35 for(Kamzolov & Gasnikov, 2020, Theorem 2). Theorem 2.
Let sequence z k , k ≥ be generated by Algo-rithm 2. Then σ φ k z k − x ∗ t k ≤ V t ( z k ) − V t ( x ∗ t ) ≤ σ φ D · − k . The proof can be found in Appendix C.
Corollary 1.
The total number of steps of the Hyperfastsecond-order method to reach V t ( z k ) − V t ( x ∗ t ) ≤ ∆ t isbounded by (cid:0) cL V, D /σ φ (cid:1) + log (cid:0) σ φ D / ∆ t (cid:1) . (18) Moreover, when applied to (11) in Algorithm 1, with accu-racy ∆ t for an iteration t , the required number of iterationsin (18) is O (cid:18) k A ⊤ A k D min { λ , λ } + µ (cid:19) + log t ( L φ D + k∇ V t (0) k ) L φ D ! . (19) Now, we have the complexity estimate for the number ofiterations in Algorithm 2 to guarantee that (11) in Algo-rithm 1 holds, which will run for T ≤ ˜ O ( √ κ F/φ log(1 /ε )) iterations. Thus, the total number of iterations of Algo-rithm 2 to obtain an ε accurate minimizer of F ( x ) is ˜ O √ κ F Dn / (cid:18) k A ⊤ A k D min { λ , λ } + µ (cid:19) ! . (20)See the details of bound (20) in Appendix D.
3. A Case for High-order Methods
Observe that (8) has almost the same (up to a linear terms)structure as (1) with N = n . Informally, statistical precon-dition reduces N to n . If we limit ourselves to first-ordermethods, for variance reduction schemes the oracle com-plexity, i.e., the number of computations of the loss func-tion gradient ∇ ℓ is ˜ O (cid:0) n + √ nκ (cid:1) , where κ = L φ /σ φ (Lan,2020). This bound corresponds to the lower bound for theclass of first-order incremental methods (Lan, 2020). Thus,for the sparse logistic regression problem, the (wall-clocktime) complexity is ˜ O (cid:0) s · (cid:0) n + √ nκ (cid:1)(cid:1) . (21)However, as pointed out in (Nesterov, 2020c; 2019), when f ( x ) is third-order smooth (i.e., bounded fourth-orderderivatives); third-order methods outperform second-orderones without lost in computational complexity with a fasterconvergence rate.A third-order (Nesterov, 2019; Gasnikov et al.,2019; Doikov & Nesterov, 2019; Nesterov, 2020c;a;Kamzolov & Gasnikov, 2020) method applied to (1) willrequire intermediate steps of the form: y k +1 = arg min y ∈ R d {h∇ Φ x t ( y k ) , y − y k i ++ 12 h∇ Φ x t ( y k )( y − y k ) , y − y k i ++ 16 ∇ Φ x t ( y k ) [ y − y k ] + L k y − y k k (cid:27) , (22) yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization where k∇ Φ x t ( y ) − ∇ Φ x t ( x ) k ≤ L k y − x k , and L = O (cid:0) n P nj =1 k ξ j k (cid:1) = O ( s ) for the particular caseof sparse logistic regression problem in (17). Note thatin (22), we use the index k to denote the inner loop iter-ations to solve (8). The outer loop runs with with iterationindex t . Solving (22) is the most expensive computation interms of the number of arithmetic operations. Furthermore,under the assumption that one can solve problems of theform (22), Problem (8) has been shown to have a complex-ity ˜ O (cid:0) s n + d log (cid:1) · (cid:18) L D σ φ (cid:19) / ! , (23)see (Gasnikov et al., 2019; Nesterov, 2020a;Kamzolov & Gasnikov, 2020) for further references.The first term in (23), i.e., s n , denotes the complexityof Hessian calculation. The second term, i.e. d log , isthe complexity of Hessian inversion, e.g. by the matrixinversion lemma using Strassen’s algorithm (Huang et al.,2016)). The term (cid:0) L D /σ φ (cid:1) / corresponds to thenumber of iterations required for near-optimal third-ordertensor method to achieve a desired accuracy that appears aspolylogaritmic term (Gasnikov et al., 2019). Additionally,we may expect D = O ( d ) , since dim x ∗ = d .Nevertheless, we now have the question of how to effi-ciently solve (22). It is shown in (Nesterov, 2019) thatthe objective in (22) is relatively smooth and strongly con-vex with respect to the function a ( y ) = h∇ Φ x t ( y k )( y − y k ) , y − y k i + L k y − y k k with σ a = 1 − / √ , L a =1 + 1 / √ (Nesterov, 2019). The auxiliary problem in (22)will be a quadratic optimization problem with a k y − y k k regularizer. Therefore, the complexity of this problem (upto a logarithmic factor) is not worse than the complexity ofthe quadratic programming problem and can be estimatedby the complexity of matrix inversion (Nesterov, 2020b).Without loss of generality we can assume that the parame-ter n can be set such that d log = O (cid:0) s n (cid:1) . In this case,the Hyperfast second-order tensor method in (22) outper-forms (accelerated) variance reduced schemes with com-plexity (21) if σ φ . s − n − . Where . , and ≃ shouldbe understood as ignoring dimension factors of order O (1) .For the particular case of sparse logistic regression prob-lems, our focused application, we can assume that s =˜ O (1) . Therefore, we have that if d . n . and σ φ . n − holds, in other words, if d log . n . σ − / φ , then, theHyperfast second-order method is better than variance re-duction approaches. The last inequality seems to be veryrestrictive. But in practice, via regularization (Gasnikov,2017), it is reasonable to set σ φ ≃ σ F ≃ ε/D ≃ ε/d ,where ε is a desired accuracy in the function value of solv-ing (17). Thus, in this case we can rewrite the last inequal-ity as ε . n − . ( d . . n . ε − . ). We can concludethat Hyperfast second-order methods are better when our goal is to solve (17) with high accuracy.Coming back to the original question: why should one usea high-order method instead of a first-order? The answeris: we need to compare min { n + √ nκ, √ κn/m } with κ / dn/m . Assuming that the number of workers in thedistributed optimization setup is large enough, the secondterm is smaller than the first-order complexity. In this case,the high-order approach is better if d . κ / .From the theoretical point of view, we have two regimesfor which the use of a third-order method becomes ap-pealing. First, n should not be too big. This is thecase for general statistical preconditioning, as we wantthe auxiliary problem to be much easier to solve thanthe original one. Second, d . κ / , where we knowthat κ / = ˜ O (cid:0) / ( σ F √ n ) / (cid:1) . This relation suggeststhat the limitation on the dimension of the variables is in-versely proportional to the strong convexity of the orig-inal function. This is the case, for example, for non-strongly convex functions with strongly convex regulariz-ers (Nemirovskii & Yudin, 1983). In such scenarios, onesets σ F = O ( ε ) , where ε is a desired precision in function.Therefore, high-order methods are reasonable from the the-oretical point of view for large-scale convex problems thatrequire high accuracy. Further advantage can be poten-tially achieved by using inexact tensor methods (Nesterov,2020b; Doikov & Nesterov, 2020; Agafonov et al., 2020;Kamzolov et al., 2020) to save some computational work.
4. Preliminary Numerical Analysis andImplementation Details
In this section, we present a numerical analysis and imple-mentation details of Algorithm 1. We work with the binaryclassification problems with logistic regression with costfunction 17 on a public datasets from LibSVM1 , namelyRCV1 (Lewis et al., 2004), and a proprietary large-scale in-house dataset that was generated from the click logs of alarge-scale commercial system for mobile app install ads.The main statistics of the datasets are shown in Table 1.Dataset N d
Feat. SizeRCV1 20k 47k 74.05 13.7In-house 710M 3,246k 109.86 650.8k
Table1. Statistics of the datasets. N is the number of samples, d is the number of features, Feat. is the average number of non-zerofeatures, and Size is the data size in MB. We obtained an MPI-based distributed implementation ofSPAG from the authors of (Hendrikx et al., 2020b) andmodified it to run on an Apache Spark (Apache, 2020) clus-ter. As shown in Algorithm 1, InSPAG switches between yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization two phases: a parallel gradient computation phase and acentral-node optimization phase in which we run the Hy-perfast second-order method in Algorithm 2. In our imple-mentation, the central-node optimization phase is carriedout by the driver, while executors compute the gradient.The code for the implementations was developed in Py-Torch (Pytorch, 2020). Algorithm 2 requires a line-searchwhere each test point calculates the full second-order step.The computational complexity of this step is theoreticallybounded above by O (log( ǫ − )) . However, in practice, weobserve that the line-search ends in approximately iter-ations. Therefore, we bound the number of iterations ex-ecuted in the line-search step. Additionally, experimentsshow that the number of steps required in the line-search de-creases as more iterations of Algorithm 1 is executed. Wecompute the approximate third-order terms multiplied byvectors using off-the-shelf automatic differentiation codes.Such a procedure’s computational complexity is around − gradient computations in practice.Moreover, recall that Algorithm 2 requires the use of ahigh-order optimizer, either (Nesterov, 2020a, Eq.3.6) for p = 3 and β = 1 / , and with auxiliary steps described in(Nesterov, 2020a, Sect. 5.2), or (Kamzolov, 2020, Algo-rithm 2)). Such algorithm requires in turn the solution of aproblem of the form: min (cid:26) c ⊤ x + 12 x ⊤ ∇ f ( x k ) x + L k x k (cid:27) . (24)The problem in (24) is solved using ADAM (Kingma & Ba,2014). The gradient of the argument in (24) has closed-form c + ∇ f ( x k ) x + L k x k x . Hence, we may computeonly Hessian-vector products using automatic differentia-tion, which in practice takes around − times the timerequired for gradient computation. On the very inner level,our method becomes the first-order method with hessian-vector and third-order derivatives on two vectors multipli-cation computed by an automatic differentiation technique.We do not compute full Hessians or full third-order deriva-tives, but we use information about them to accelerate theconvergence. Moreover, the central node uses GPU to ac-celerate the various Hessian related matrix-vector opera-tions in the algorithm. We believe our implementation tobe the first practical implementation of an algorithm in theHyperfast family of higher-order optimizers that can oper-ate on data at the above dimensionality. We compare Algorithm 1 with inner solver Algo-rithm 2 as well as Stochastic Dual Coordinate Ascent(SDCA) (Shalev-Shwartz, 2016) used in (Hendrikx et al.,2020b). For RCV1 dataset, we also compare the perfor-mance of Algorithm 1 with DANE (Shamir et al., 2014b)with both SDCA and Hyperfast at the central-node solver. https://github.com/OPTAMI/OPTAMI/ Figure1. Comparison of the communication rounds for the dataset RCV1.Figure2. Wall clock time performance of the InSPAG method forthe data set RCV1.
We used n = 10 samples for preconditioning, λ = 10 − , µ = 2 × − , constant L F/φ = 0 . , and a practicalapproximate − for D φ . We set the precision of the aux-iliary subproblem to − . Other parameters: L = 0 . ,the learning rate of ADAM is set to , and the number ofiterations of ADAM is . Figures 1 and 2 show results forthe RCV1 dataset. The point ˆ x is set as the point where theminimal cost was achieved overall the iterations and runsof the algorithm and serves as a proxy point used instead ofthe minimizer, which is in general unknown. We see thatAlgorithm 1 outperforms DANE regardless of the subsolverused. Moreover, InSPAG-SDCA has better performanceduring initial iterations. However, InSPAG-Hyperfast out-performs all other methods by accuracy. Also, we find thatHyperfast iterations are faster than SDCA near the mini-mum point. For example, the first five iterations take about seconds each, and the last five iterations take about . seconds each. Hence, suggesting that some combination ofmethods would be of use in practice. However, the Hyper-fast approach finds better solutions overall.Figure 3 shows the results of the comparison on the in-house dataset (split over 200 nodes, i.e., m = 200 ) with λ = 1 × − , µ = 2 × − . Other parameters are de-scribed in Table 2. We see that InSPAG-Hyperfast outper- yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization Figure3. Comparison of the communication rounds for thein house dataset. a) L = 10 , ADAM learning rate . , n =10000 ; b) L = 100 , ADAM learning rate . , n = 10000 ; c) L = 10 , ADAM learning rate . , n = 10000 ; d) L = 15 ,ADAM learning rate . , n = 1000 . forms InSPAG-SDCA for this large-scale dataset. Furtherdetails on the numerical experiments, and additional simu-lation results can be found in Appendix F.Run L ADAM n µ a)
10 0 .
01 1 × × − b)
100 0 . × × − c)
10 0 . × × − d)
15 0 .
01 1 × × − Table2. Parameter selection for experiments on in-house data.
5. Conclusions
We study the distributed optimization problem for smoothsum type (strongly) convex optimization problems withi.i.d. data stored at nodes. Building upon the recent re-sult of statistical preconditioning, we propose an algorithmthat iteratively minimizes the objective function taking ad-vantage of the statistical similarity of the cost functions oneach of the agents. Statistical preconditioning requires anauxiliary optimization problem at a designated central node.Contrary to existing approaches, we analyze the case wherethe auxiliary problem is solved inexactly. Moreover, pro-vide the conditions on the accuracy of the solution that guar-antees convergence at the same rate that the algorithm withaccess to exact minimizers of the auxiliary problem. Ad-ditionally, under an additional high-order bounded deriva-tives, we extend recently proposed high-order methods tothe class of uniformly convex functions. We show thatthe auxiliary problem in the statistical preconditioned prob-lem can be solved efficiently at a linear rate via a hyper-fast second-order method. We analyze the complexity ofthe proposed inexact statistically preconditioned algorithmwith the high-order sub-solver and show that it convergeslinearly with the improved relative condition number. Fi- nally, we show the first empirical results on the implemen-tation of high-order methods on large-scale problems, withdimension is of the order of million and million sam-ples. Funding
This project was supported by the Yahoo! Faculty Engage-ment Program. The research of P. Dvurechensky was par-tially supported by the Ministry of Science and Higher Edu-cation of the Russian Federation (Goszadaniye) 075-00337-20-03, project no. 0714-2020-0005. The work of D. Kam-zolov was funded by RFBR, project number 19-31-27001.The work of A. Gasnikov was partially supported by RFBR19-31-51001.
Acknowledgement
The authors are grateful to Hadrien Hendrikxfor sharing the code of the SPAG algorithm http://proceedings.mlr.press/v119/hendrikx20a.html
References
Agafonov, A., Kamzolov, D., Dvurechensky, P., and Gas-nikov, A. Inexact tensor methods and their applicationto stochastic convex optimization. arXiv:2012.15636 ,2020.Apache. Spark 2.4.5, 2020. https://spark.apache.org/ .Arjevani, Y. and Shamir, O. Communication complex-ity of distributed convex learning and optimization.In Cortes, C., Lawrence, N., Lee, D., Sugiyama,M., and Garnett, R. (eds.),
Advances in NeuralInformation Processing Systems , volume 28, pp.1756–1764. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/7fec306d1e665bc9c748b5d2b99a6e97-Paper.pdf .Bauschke, H. H., Bolte, J., and Teboulle, M. A descentlemma beyond lipschitz gradient continuity: first-ordermethods revisited and applications.
Mathematics of Op-erations Research , 42(2):330–348, 2017.Ben-Tal, A. and Nemirovski, A.
Lectures on Mod-ern Convex Optimization (Lecture Notes) . Per-sonal web-page of A. Nemirovski, 2020. URL .Bullins, B. Highly smooth minimization of non-smoothproblems. In Abernethy, J. and Agarwal, S. (eds.),
Pro-ceedings of Thirty Third Conference on Learning Theory ,volume 125 of
Proceedings of Machine Learning Re-search , pp. 988–1030. PMLR, 09–12 Jul 2020. URL http://proceedings.mlr.press/v125/bullins20a.html . yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C.
Introduction to algorithms . MIT press, 2009.Dean, J. and Ghemawat, S. Mapreduce: simplified data pro-cessing on large clusters.
Communications of the ACM ,51(1):107–113, 2008.Doikov, N. and Nesterov, Y. Contracting proximal meth-ods for smooth convex optimization. arXiv preprintarXiv:1912.07972 , 2019.Doikov, N. and Nesterov, Y. Inexact tensor meth-ods with dynamic accuracies. In III, H. D. andSingh, A. (eds.),
Proceedings of the 37th Interna-tional Conference on Machine Learning , volume119 of
Proceedings of Machine Learning Research ,pp. 2577–2586. PMLR, 13–18 Jul 2020. URL http://proceedings.mlr.press/v119/doikov20a.html .Dragomir, R.-A., Taylor, A., d’Aspremont, A., and Bolte,J. Optimal complexity and certification of bregman first-order methods. arXiv preprint arXiv:1911.08510 , 2019.Dvurechensky, P., Dvinskikh, D., Gasnikov, A., Uribe,C. A., and Nedi´c, A. Decentralize and randomize:Faster algorithm for Wasserstein barycenters. InBengio, S., Wallach, H., Larochelle, H., Grauman, K.,Cesa-Bianchi, N., and Garnett, R. (eds.),
Advances inNeural Information Processing Systems 31 , NIPS’18,pp. 10783–10793. Curran Associates, Inc., 2018a. URL http://papers.nips.cc/paper/8274-decentralize-and-randomize-faster-algorithm-for-wasserstein-barycenters.pdf .arXiv:1802.04367.Dvurechensky, P., Gasnikov, A., and Kroshnin, A. Com-putational optimal transport: Complexity by acceler-ated gradient descent is better than by Sinkhorn’s algo-rithm. In Dy, J. and Krause, A. (eds.),
Proceedings ofthe 35th International Conference on Machine Learn-ing , volume 80 of
Proceedings of Machine Learning Re-search , pp. 1367–1376, 2018b. arXiv:1802.04367.Dvurechensky, P., Staudigl, M., and Shtern, S. First-ordermethods for convex optimization. arXiv:2101.00935 ,2021.Gasnikov, A. Universal gradient descent. arXiv preprintarXiv:1711.00394 , 2017.Gasnikov, A., Dvurechensky, P., Gorbunov, E., Vorontsova,E., Selikhanovych, D., Uribe, C. A., Jiang, B., Wang, H.,Zhang, S., Bubeck, S., Jiang, Q., Lee, Y. T., Li, Y., andSidford, A. Near optimal methods for minimizing con-vex functions with lipschitz p -th derivatives. In Beygelz-imer, A. and Hsu, D. (eds.), Proceedings of the Thirty-Second Conference on Learning Theory , volume 99 of
Proceedings of Machine Learning Research , pp. 1392–1393, Phoenix, USA, 25–28 Jun 2019. PMLR. URL http://proceedings.mlr.press/v99/gasnikov19b.html .arXiv:1809.00382.Gasnikov, A. V. and Nesterov, Y. E. Universal methodfor stochastic composite optimization problems.
Com-putational Mathematics and Mathematical Physics , 58(1):48–64, 2018. First appeared in arXiv:1604.05275.Hendrikx, H., Bach, F., and Massoulie, L. An optimal al-gorithm for decentralized finite sum optimization. arXivpreprint arXiv:2005.10675 , 2020a.Hendrikx, H., Xiao, L., Bubeck, S., Bach, F., and Mas-soulie, L. Statistically preconditioned acceleratedgradient method for distributed optimization. InIII, H. D. and Singh, A. (eds.),
Proceedings of the37th International Conference on Machine Learning ,volume 119 of
Proceedings of Machine Learning Re-search , pp. 4203–4227. PMLR, 13–18 Jul 2020b. URL http://proceedings.mlr.press/v119/hendrikx20a.html .Huang, J., Smith, T. M., Henry, G. M., and van de Geijn,R. A. Strassen’s algorithm reloaded. In
SC’16: Pro-ceedings of the International Conference for High Per-formance Computing, Networking, Storage and Analysis ,pp. 690–701. IEEE, 2016.Kamzolov, D. Near-optimal hyperfast second-ordermethod for convex optimization. In Kochetov, Y.,Bykadorov, I., and Gruzdeva, T. (eds.),
Mathemati-cal Optimization Theory and Operations Research , pp.167–178, Cham, 2020. Springer International Publishing.ISBN 978-3-030-58657-7.Kamzolov, D. and Gasnikov, A. Near-optimal hyperfastsecond-order method for convex optimization and itssliding. arXiv preprint arXiv:2002.09050 , 2020.Kamzolov, D., Gasnikov, A., and Dvurechensky, P. Op-timal combination of tensor optimization methods. InOlenev, N., Evtushenko, Y., Khachay, M., and Malkova,V. (eds.),
Optimization and Applications , pp. 166–183,Cham, 2020. Springer International Publishing. ISBN978-3-030-62867-3. arXiv:2002.01004.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.Lan, G.
First-order and Stochastic Optimization Methodsfor Machine Learning . Springer, 2020.Lan, G., Lee, S., and Zhou, Y. Communication-efficientalgorithms for decentralized and stochastic optimization. arXiv preprint arXiv:1701.03961 , 2017.Lewis, D. D., Yang, Y., Rose, T. G., and Li, F.Rcv1: A new benchmark collection for textcategorization research.
Journal of Machine yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization
Learning Research , 5(Apr):361–397, 2004. URL .Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed,A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y.Scaling distributed machine learning with the parameterserver. In { USENIX } Symposium on Operating Sys-tems Design and Implementation ( { OSDI } , pp. 583–598, 2014.Lin, H., Mairal, J., and Harchaoui, Z. A universal catalystfor first-order optimization. In Proceedings of the 28thInternational Conference on Neural Information Pro-cessing Systems - Volume 2 , NIPS’15, pp. 3384–3392,Cambridge, MA, USA, 2015. MIT Press.Lin, Q. and Xiao, L. An adaptive accelerated proxi-mal gradient method and its homotopy continuationfor sparse optimization. In Xing, E. P. and Jebara,T. (eds.),
Proceedings of the 31st InternationalConference on Machine Learning , volume 32 of
Proceedings of Machine Learning Research , pp.73–81, Bejing, China, 22–24 Jun 2014. PMLR. URL http://proceedings.mlr.press/v32/lin14.html .Lu, H., Freund, R. M., and Nesterov, Y. Relatively smoothconvex optimization by first-order methods, and appli-cations.
SIAM Journal on Optimization , 28(1):333–354,2018.Nemirovskii, A. and Yudin.
Problem Complexity andMethod Efficiency in Optimization . Wiley, 1983.Nesterov, Y. Smooth minimization of non-smooth func-tions.
Mathematical Programming , 103(1):127–152,2005.Nesterov, Y.
Lectures on Convex Optimization . SpringerOptimization and Its Applications 137. Springer Interna-tional Publishing, 2nd ed. edition, 2018. ISBN 978-3-319-91577-7,978-3-319-91578-4.Nesterov, Y. Implementable tensor methods in uncon-strained convex optimization.
Mathematical Program-ming , pp. 1–27, 2019.Nesterov, Y. Inexact high-order proximal-point methodswith auxiliary search procedure. CORE Discus-sion Paper 2020/10, CORE UCL, 2020a. URL https://dial.uclouvain.be/pr/boreal/object/boreal%3A227954/datastream/PDF_01/view .Nesterov, Y. Inexact basic tensor methods for some classesof convex optimization problems.
Optimization Methodsand Software , pp. 1–29, 2020b.Nesterov, Y. Superfast second-order methods for uncon-strained convex optimization.
CORE DP , 7:2020, 2020c. Nesterov, Y. et al.
Lectures on convex optimization , volume137. Springer, 2018.Pytorch. 1.5.0, 2020. https://pytorch.org/ .Reddi, S. J., Koneˇcn`y, J., Richt´arik, P., P´ocz´os, B., andSmola, A. Aide: Fast and communication efficient dis-tributed optimization. arXiv preprint arXiv:1608.06879 ,2016.Scaman, K., Bach, F., Bubeck, S., Lee, Y. T., andMassouli´e, L. Optimal algorithms for smooth andstrongly convex distributed optimization in networks.In Precup, D. and Teh, Y. W. (eds.),
Proceedings ofthe 34th International Conference on Machine Learn-ing , volume 70 of
Proceedings of Machine LearningResearch , pp. 3027–3036, International Convention Cen-tre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/scaman17a.html .Shalev-Shwartz, S. Sdca without duality, regulariza-tion, and individual convexity. In Balcan, M. F.and Weinberger, K. Q. (eds.),
Proceedings ofThe 33rd International Conference on MachineLearning , volume 48 of
Proceedings of MachineLearning Research , pp. 747–754, New York,New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/shalev-shwartza16.html .Shamir, O., Srebro, N., and Zhang, T. Communication-efficient distributed optimization using an approximatenewton-type method. In Xing, E. P. and Jebara,T. (eds.),
Proceedings of the 31st InternationalConference on Machine Learning , volume 32 of
Proceedings of Machine Learning Research , pp. 1000–1008, Bejing, China, 22–24 Jun 2014a. PMLR. URL http://proceedings.mlr.press/v32/shamir14.html .Shamir, O., Srebro, N., and Zhang, T. Communication-efficient distributed optimization using an approximatenewton-type method. In
International conference on ma-chine learning , pp. 1000–1008, 2014b.Stonyakin, F., Tyurin, A., Gasnikov, A., Dvurechensky,P., Agafonov, A., Dvinskikh, D., Pasechnyuk, D., Arta-monov, S., and Piskunova, V. Inexact relative smooth-ness and strong convexity for optimization and varia-tional inequalities by inexact model. arXiv:2001.09013 ,2020.Sun, Y., Daneshmand, A., and Scutari, G. Distributed op-timization based on gradient-tracking revisited: Enhanc-ing convergence rate via surrogation. arXiv:1905.02637 ,2019.Wang, S., Roosta, F., Xu, P., and Mahoney, M. W. Gi-ant: Globally improved approximate newton method for yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization distributed optimization. In
Advances in Neural Informa-tion Processing Systems , pp. 2332–2342, 2018.Yang, T. Trading computation for communication: Dis-tributed stochastic dual coordinate ascent. In
Advancesin Neural Information Processing Systems , pp. 629–637,2013.Yuan, X.-T. and Li, P. On convergence of distributedapproximate newton methods: Globalization,sharper bounds and beyond.
Journal of MachineLearning Research , 21(206):1–51, 2020. URL http://jmlr.org/papers/v21/19-764.html .Zhang, Y. and Lin, X. Disco: Distributed optimizationfor self-concordant empirical loss. In Bach, F. andBlei, D. (eds.),
Proceedings of the 32nd Interna-tional Conference on Machine Learning , volume 37of
Proceedings of Machine Learning Research , pp.362–370, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/zhangb15.html . yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization A. Analysis of Inexact SPAG
First, we need the following technical lemma.
Lemma 2 ((Stonyakin et al., 2020, Lemma 3.5.)) . Let ψ ( x ) be a convex function and y = arg min x ∈ Q e δ { ψ ( x ) + βD φ ( x, z ) + γD φ ( x, u ) } , where β ≥ and γ ≥ . Then ψ ( x ) + βD φ ( x, z ) + γD φ ( x, u ) ≥ ψ ( y ) + βD φ ( y, z ) + γD φ ( y, u ) + ( β + γ ) D φ ( x, y ) − e δ, ∀ x ∈ Q. It can be shown (Lu et al., 2018) that from (6) it follows that σ F/φ D φ ( x, y ) ≤ F ( x ) − F ( y ) − h∇ F ( y ) , x − y i ≤ L F/φ D φ ( x, y ) . (25)Let us also denote Q = B (0 , D ) . The following is the main technical lemma that gives the per-iteration progress of thealgorithm. Lemma 3.
For all x ∈ Q , we have A t +1 F ( x t +1 ) − A t F ( x t ) + (1 + A t +1 σ F/φ ) D φ ( x, u t +1 ) − (1 + A k σ F/φ ) D φ ( x, u t ) ≤ α t +1 F ( x ) + ˆ D φ /t. Proof.
By the relative smoothness condition (6) and the stopping condition (9), we have F ( x t +1 ) (6) ≤ F ( y t +1 ) + h∇ F ( y t +1 ) , x t +1 − y t +1 i + L F/φ D φ ( x t +1 , y t +1 ) (9) ≤ F ( y t +1 ) + h∇ F ( y t +1 ) , x t +1 − y t +1 i + L F/φ G t +1 α t +1 A t +1 D φ ( u t +1 , u t ) . Substituting in this expression the definition (13) of the point x t +1 , using A t +1 = A t + α t +1 , and the definition (10) forthe sequence α t +1 , we obtain F ( x t +1 ) (13) ≤ F ( y t +1 ) + h∇ F ( y t +1 ) , α t +1 u t +1 + A t x t A t +1 − y t +1 i + L F/φ G t +1 α t +1 A t +1 D φ ( u t +1 , u t ) (10) = A t A t +1 ( F ( y t +1 ) + h∇ F ( y t +1 ) , x t − y t +1 i ) + α t +1 A t +1 ( F ( y t +1 ) + h∇ F ( y t +1 ) , u t +1 − y t +1 i )+ L F/φ G t +1 α t +1 A t +1 D φ ( u t +1 , u t ) conv-ty ≤ A t A t +1 F ( x t ) + α t +1 A t +1 (cid:16) F ( y t +1 ) + h∇ F ( y t +1 ) , u t +1 − y t +1 i (cid:17) + L F/φ G t +1 α t +1 A t +1 D φ ( u t +1 , u t ) (10) = A t A t +1 F ( x t ) + α t +1 A t +1 (cid:16) F ( y t +1 ) + h∇ F ( y t +1 ) , u t +1 − y t +1 i + 1 + A t σ F/φ α t +1 D φ ( u t +1 , u t ) (cid:17) . (26)By Lemma 2 for the optimization problem in (11) with ψ ( x ) = α t +1 h∇ F ( y t +1 ) , x − y t +1 i , β = 1 + A t σ F/φ , z = u t , γ = α t +1 σ F/φ , and u = y t +1 , it holds, for any x ∈ Q , that α t +1 h∇ F ( y t +1 ) , u t +1 − y t +1 i + (1 + A t σ F/φ ) D φ ( u t +1 , u t ) + α t +1 σ F/φ D φ ( u t +1 , y t +1 )+ (1 + A t +1 σ F/φ ) D φ ( x, u t +1 ) − ˆ D φ /t ≤ α t +1 h∇ F ( y t +1 ) , x − y t +1 i + (1 + A t σ F/φ ) D φ ( x, u t ) + α t +1 σ F/φ D φ ( x, y t +1 ) . Whence, using the fact that V [ y t +1 ]( u t +1 ) ≥ , we obtain α t +1 h∇ F ( y t +1 ) , u t +1 − y t +1 i + (1 + A t σ F/φ ) D φ ( u t +1 , u t ) ≤ α t +1 h∇ F ( y t +1 ) , x − y t +1 i + (1 + A t σ F/φ ) D φ ( x, u t ) − (1 + A t +1 σ F/φ ) D φ ( x, u t +1 ) + α t +1 σ F/φ D φ ( x, y t +1 ) + ˆ D φ /t. (27) yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization Combining (26) and (27), we obtain F ( x t +1 ) ≤ A t A t +1 F ( x t ) + α t +1 A t +1 (cid:16) F ( y t +1 ) + h∇ F ( y t +1 ) , x − y t +1 i + σ F/φ D φ ( x, y t +1 )+ 1 + A t σ F/φ α t +1 D φ ( x, u t ) − A t +1 σ F/φ α t +1 D φ ( x, u t +1 ) + ˆ D φ tα t +1 (cid:17) . We finish the proof of Lemma 3 applying the left inequality in (25) F ( x t +1 ) ≤ A t A t +1 F ( x t ) + α t +1 A t +1 F ( x )+ 1 + A t σ F/φ A t +1 D φ ( x, u t ) − A t +1 σ F/φ A t +1 D φ ( x, u t +1 ) + ˆ D φ tA t +1 . Tofinish the proof of the first part of Theorem 1, we telescope the inequality in Lemma 3 for k from to T − and take x = x ∗ : A T F ( x T ) ≤ A T F ( x ∗ ) + D φ ( x ∗ , u ) − (1 + A T σ F/φ ) D φ ( x ∗ , u T ) + T − X t =0 ˆ D φ /t. (28)Since D φ ( x ∗ , u T ) ≥ and D φ ( x ∗ , u ) ≤ ˆ D φ , we have A T F ( x T ) − A T F ( x ∗ ) ≤ ˆ D φ (3 / T ) , (29)which is (15) in Theorem 1.Next lemma gives a lower bound for the growth rate of the sequence A T , which constitutes the second claim of Theorem1. Lemma 4.
For all T ≥ , A T ≥ max T L F/φ e G T , L F/φ G exp T s σ F/φ L F/φ e G T , where e G − / T = T P T − t =0 1 √ G t +1 .Proof. In view of definition (10) of sequence α t +1 , we have: A T ≤ A T (1 + σ F/φ A T − ) = L F/φ G T ( α T ) = L F/φ G T ( A T − A T − ) ≤ L F/φ G T ( A / T − A / T − ) ( A / T + A / T − ) ≤ L F/φ G T A T ( A / T − A / T − ) . We can see that A / T ≥ A / T − + 12 L F/φ G T and A T ≥ T − X t =0 p L F/φ G t +1 ! = 12 L F/φ T · T T − X t =0 p G t +1 ! = T L F/φ e G T . This bound holds even when σ F/φ = 0 . yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization For the case when σ F/φ > we obtain: σ F/φ A T − A T ≤ A T (1 + σ F/φ A T − ) ≤ L F/φ G T A T ( A / T − A / T − ) . From the fact that A = 1 /L F/φ G and the last inequality we can show that A / T ≥ s σ F/φ L F/φ G T ! A / T − ≥ p L F/φ G T − Y t =0 s σ F/φ L F/φ G t +1 ! ≥ p L F/φ G T − Y t =0 exp s σ F/φ L F/φ G t +1 ! = 1 p L F/φ G exp T s σ F/φ L F/φ e G T . The final technical block justifies that the backtracking line-search for G t is correctly defined and (9) holds after a reason-able number of trials Lemma 5.
Under assumption that φ is σ φ -strongly convex and L φ -smooth inequality (9) holds with G t +1 = L φ /σ φ .Proof. Since φ is σ φ -strongly convex and L φ -smooth, we have that σ φ k x − y k ≤ D φ ( x, y ) ≤ L φ k x − y k . Thus, by Step 6 of the algorithm and (13), we have D φ ( x t +1 , y t +1 ) ≤ L φ k x t +1 − y t +1 k = L φ α t +1 A t +1 k u t +1 − u t k ≤ L φ α t +1 σ φ A t +1 D φ ( u t +1 , u t ) B. Properties of the auxiliary problem
In each iteration of InSPAG we need to solve the following optimization problem arg min x ∈ B (0 ,D ) ˆ D φ /t (cid:8) V t ( x ) = α t +1 h∇ F ( y t +1 ) , x − y t +1 i + (1 + A t σ F/φ ) D φ ( x, u t ) + α t +1 σ F/φ D φ ( x, y t +1 ) (cid:9) . (30)Since φ is strongly convex with parameter σ φ = min { λ , λ } + µ (see (17) and (4)), V t is σ φ - strongly convex. Further, V t ( x ) = α t +1 h∇ F ( y t +1 ) , x i + (1 + A t σ F/φ ) n n X k =1 ℓ ( x, ζ k ) + µ k x k − n n X k =1 ℓ ( u t , ζ k ) + µ k u t k ! − * ∇ n n X k =1 ℓ ( u t , ζ k ) + µ k u t k ! , x − u t +! + α t +1 σ F/φ n n X k =1 ℓ ( x, ζ k ) + µ k x k − n n X k =1 ℓ ( y t +1 , ζ k ) + µ k y t +1 k ! − * ∇ n n X k =1 ℓ ( y t +1 , ζ k ) + µ k y t +1 k ! , x − y t +1 +! = Const+ * α t +1 ∇ F ( y t +1 ) − n n X k =1 (cid:0) (1 + A t σ F/φ ) ∇ ℓ ( u t , ζ k ) + α t +1 σ F/φ ∇ ℓ ( y t +1 , ζ k ) (cid:1) − µ ((1 + A t σ F/φ ) u t + α t +1 σ F/φ y t +1 ) ! , x + + (1 + A t +1 σ F/φ ) n n X k =1 ℓ ( x, ζ k ) + µ k x k ! , (31)where ℓ ( · ) is defined in (17). Thus, subproblem in each step of InSPAG has the form of minimization of regularized logisticregression. This problem has Lipschitz derivative of all orders, in particular of the order . Indeed, let us define matrix yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization A = [ η ξ , . . . , η n ξ n ] ⊤ . Then, by Theorem 5.4 in (Bullins, 2020) with µ = 1 the function n P nk =1 ℓ ( x, ζ k ) has Lipschitzthird-order derivative with constant L V, = 15 k A ⊤ A k w.r.t. -norm or with constant L V, = 15 w.r.t. k · k A ⊤ A -norm.Since adding a linear function and a quadratic function does not change the Lipschitz constant for third-order derivative, V t has Lipschitz third-order derivative with constant L V, . At the same time, n P nk =1 ℓ ( x, ζ k ) has Lipschitz gradient withconstant max { λ , λ } + n P nk =1 k η k ξ k k . Here, for the logistic loss, we used the smoothing result (Nesterov, 2005)[Sect.4.4] with m = 2 , µ = 1 and a = 0 , a = η k ξ k for each k and then summing up the results for k = 1 , ..., n . Note also thataccording to (17) ℓ contains a quadratic function. Combining these observations with (31), we obtain that V t has Lipschitzgradient with constant L V, = µ + max { λ , λ } + n P nk =1 k η k ξ k k = L φ . Proof of Lemma 1
Since V t is σ φ - strongly convex, we have σ φ k ˆ x − x ∗ t k ≤ V t (ˆ x ) − V t ( x ∗ t ) ≤ ∆ t ⇒ k ˆ x − x ∗ t k ≤ q t /σ φ . (32)Hence, for any x ∈ B (0 , D ) , h∇ V t (ˆ x ) , x − ˆ x i = h∇ V t (ˆ x ) − ∇ V t ( x ∗ t ) , x − ˆ x i + h∇ V t ( x ∗ t ) , x − x ∗ t i + h∇ V t ( x ∗ t ) , x ∗ t − ˆ x i≥ − L φ k x ∗ t − ˆ x k k x − ˆ x k + 0 − k∇ V t ( x ∗ t ) k k x ∗ t − ˆ x k ≥ − L φ D q t /σ φ − ( L φ k − x ∗ t k + k∇ V t (0) k ) q t /σ φ ≥ − (3 L φ D + k∇ V t (0) k ) q t /σ φ ≥ − ˆ D φ /t, where we used the expression for ∆ t . C. Hyperfast Second-Order Method
In this section we consider general problem x ∗ = arg min x ∈ Q f ( x ) , where Q is closed convex bounded set, f has L -Lipschitz third-order derivative and is strongly convex, i.e., there exists σ > s.t. f ( y ) ≥ f ( x ) + h∇ f ( x ) , y − x i + σ k y − x k , ∀ x, y ∈ Q. (33)As a corollary, f ( y ) ≥ f ( x ∗ ) + σ k y − x ∗ k , ∀ y ∈ Q, (34)where x ∗ is a solution to problem. Theorem 3 ((Nesterov, 2020a)[Theorem 2]) . Let sequence x k , k ≥ be generated by Hyperfast Second-Order Method(Nesterov, 2020a)[Eq.3.6] for p = 3 and β = 1 / and with auxiliary steps described in (Nesterov, 2020a)[Sect. 5.2]. Then f ( x k ) − f ∗ ≤ · L R − β (cid:20) k − (cid:21) − ≤ · L R k = 48 L R k , where R is such that k x − x ∗ k ≤ R . Let us define c = 48 . Theorem 4.
Let sequence z k , k ≥ be generated by Algorithm 3. Then σ k z k − x ∗ k ≤ f ( z k ) − f ( x ∗ ) ≤ σ R · − k , and the total number of steps of the Hyperfast second-order method to reach f ( z k ) − f ∗ ≤ ε is bounded by ( c is theconstant in Theorem 3.) (cid:18) cL R σ (cid:19) + log σ R ε . yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization Algorithm 3
Restarted Hyperfast Second-Order Method
Require: σ , z , R s.t. k x ∗ − z k ≤ R . for k = 0 , , ... do Set R k = R · − k and N k = max (& cL R k σ ! ' , ) , Set z k +1 = y N k as the output of Hyperfast Second-Order Method (Nesterov, 2020a)[Eq.3.6] for p = 3 and β = 1 / and with auxiliary steps described in (Nesterov, 2020a)[Sect. 5.2] started from z k and run for N k steps applied to f ( x ) . Set k = k + 1 . end for . Ensure: z k . Proof.
For k = 0 we have k x ∗ − z k k ≤ R k . Let us assume that k x ∗ − z k k ≤ R k and show that k x ∗ − z k +1 k ≤ R k +1 .From Theorem 3 since f is σ -strongly convex and has L -Lipschitz third-order derivative, it holds that σ k z k +1 − x ∗ k ≤ f ( z k +1 ) − f ( x ∗ ) ≤ cL k z k − x ∗ k N k ≤ σ ( R k / σ R k +1 by the choice of N k . Thus, k z k +1 − x ∗ k ≤ R k +1 , f ( z k ) − f ( x ∗ ) ≤ σ R k for all k ≥ .It remains to estimate the number of iterations of the Hyperfast method. Summing up the number of operations N i , i =0 , ..., k , we obtain k X i =0 N i ≤ k X i =0 "(cid:18) cL R i σ (cid:19) + 1 = (cid:18) cL R σ (cid:19) k X i =0 − i + k ≤ (cid:18) cL R σ (cid:19) + log σ R ε . Here we used the fact that it is sufficient to take k = log σ R ε to reach the accuracy ε . This completes the proof.The result of Theorem 2 is obtained as a corollary by considering V t ( x ) as f and substituting L = L ,V , σ = σ φ , R = 2 D , ε = ∆ t , ˆ D φ = 2 L φ D . yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization D. Combining the building blocks
When applied to the solution of the auxiliary problem (11) in InSPAG with accuracy ∆ t for an iteration t , the complexitygiven in Theorem 2 gives the iteration complexity (in terms of the iterations of the Hyperfast method) O (cid:18) cL V, D σ φ (cid:19) + log σ φ D t ! = O (cid:18) k A ⊤ A k D min { λ , λ } + µ (cid:19) + log (min { λ , λ } + µ ) D { λ ,λ } + µ ) D φ t (3 L φ D + k∇ V t (0) k ) = O (cid:18) k A ⊤ A k D min { λ , λ } + µ (cid:19) + log (min { λ , λ } + µ ) D { λ ,λ } + µ )( L φ D ) t ( L φ D + k∇ V t (0) k ) = O (cid:18) k A ⊤ A k D min { λ , λ } + µ (cid:19) + log t ( L φ D + k∇ V t (0) k ) L φ D ! . (35)To get the number of iterations for T steps of the InSPAG we need to sum these numbers for t = 1 , ..., T , which gives O T (cid:18) k A ⊤ A k D min { λ , λ } + µ (cid:19) + T log T + T log ( L φ D + k∇ V t (0) k ) L φ D ! Finally, recall that T ≤ ˜ O ( √ κ F/φ log(1 /ε )) , and κ F/φ = 1 + ˜ O ( κ F p d/n ) and D = O ( d ) . E. Hyperfast Second-Order Method for uniformly convex functions
In the setting of Appendix C, instead of assuming strong convexity, let us make an assumption that the objective f ( x ) isuniformly convex of degree ≥ q ≥ on the convex bounded set Q , i.e., there exists σ q > s.t. f ( y ) ≥ f ( x ) + h∇ f ( x ) , y − x i + σ q q k y − x k q , ∀ x, y ∈ Q. (36)As a corollary, f ( y ) ≥ f ( x ∗ ) + σ q q k y − x ∗ k q , ∀ y ∈ Q, (37)where x ∗ is a solution to problem.We show how the restart technique can be used to accelerate Hyperfast Second-Order Method under this additional assump-tion. Theorem 5.
Let sequence z k , k ≥ be generated by Algorithm 4. Then σ q q k z k − x ∗ k q ≤ f ( z k ) − f ∗ ≤ ∆ · − k , and the total number of steps of Algorithm 2 is bounded by ( c is the constant in Theorem 1.) (cid:16) cq q (cid:17) L σ q q (∆ ) − q q · k X i =0 − i − q q + k. Proof.
Let us prove the first statement of the Theorem by induction. For k = 0 it holds. If it holds for some k ≥ , bychoice of N k , we have that cL N k (cid:18) q ∆ k σ q (cid:19) q ≤ ∆ k . yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization Algorithm 4
Restarted Hyperfast Second-Order Method
Require: q , σ q , z , ∆ s.t. f ( z ) − f ∗ ≤ ∆ . for k = 0 , , ... do Set ∆ k = ∆ · − k and N k = max cL q q σ q q ∆ − qq k , . Set z k +1 = y N k as the output of Algorithm 2 started from z k and run for N k steps. Set k = k + 1 . end forEnsure: z k .By (37), k z k − x ∗ k ≤ (cid:18) q ( f ( z k ) − f ∗ ) σ q (cid:19) q ≤ (cid:18) q ∆ k σ q (cid:19) q since, by our assumption, q ≤ . Combining the above two inequalities and Theorem 1, we obtain f ( z k +1 ) − f ∗ ≤ cL k z k − x ∗ k N k ≤ ∆ k k +1 . It remains to bound the total number of steps of Algorithm 2. Denote ˜ c = (cid:16) cq q (cid:17) . k X i =0 N i ≤ ˜ c L σ q q k X i =0 (∆ · − i ) − q q + k ≤ ˜ c L σ q q (∆ ) − q q · k X i =0 − i − q q + k. Let us make several remarks on the complexity of the restarted scheme in different settings. It is easy to see from Theorem5 that, to achieve an accuracy ε , i.e., to find a point ˆ x s.t. f (ˆ x ) − f ∗ ≤ ε , the number of tensor steps in Algorithm 4 is O L σ q q (∆ ) − q q + log ∆ ε , q < , and O (cid:18) L σ (cid:19) + 1 ! log ∆ ε ! , q = 4 . F. Additional Simulation Results
In this section, we show additional simulation results about the performance of the proposed InSPAG method in Algo-rithm 1.Figure 5 shows the loss of the auxiliary problem at different communication rounds for both the Hyper-fast method inAlgorithm 2 and the SDCA Method from (Shalev-Shwartz, 2016). The x -axis is the wall clock time recorded by thealgorithm, and the y -axis is the loss of the auxiliary problem. Markers indicate the time where an iteration has beencompleted. We show the function value of the auxiliary problem at communication rounds , , and , respectively. Wecan observe that in the first round, both Hyper-fast and SDCA methods have comparable performance in the sense that theyexecute the same number of iterations at around the same time of seconds. However, the loss of SDCA is lower thanthe one achieved by the Hyper-fast method. In this case, we have to use preconditioning of n = 1000 . Nevertheless, forthe next iterations, in the communication round and , the Hyper-fast method outperforms SDCA both in the obtainedloss and the time required for every inner iteration.Figure5 shows the times required by the Hyper-fast method in Algorithm 2 and the SDCA Method from (Shalev-Shwartz,2016) to complete the inner iterations at communication rounds , , and . The x -axis is the iteration number, and yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization Figure4. The complexity of solving the auxiliary subproblem with Hyper-fast method in Algorithm 2 and the SDCA Methodfrom (Shalev-Shwartz, 2016), when n = 1000 , at communication rounds , , and . The x -axis is the wall clock time recordedby the algorithm, and the y -axis is the loss of the auxiliary problem. Markers indicate the time where an iteration has been completed. the y -axis is the time required by the corresponding algorithm to complete an inner iteration. We can observe that inthe communication round , the cost time required by both methods is approximately the same on average. However, forcommunication rounds and , the Hyper-fast method outperforms SDCA as it requires less time to complete an iteration.Figure 6 on the left shows the loss function F ( x k ) evaluated at the point generated by iteration x k as a function of the wallclock time recorded by the InSPAG method in Algorithm 1. Markers identify when an iteration has been completed, in thiscase, execution of the Hyper-fast method in Algorithm 2. Moreover, we show the dependency on the number of points usedfor preconditioning n . We observe that for different values of n , the final loss is about the same. However, as n increases,the wall clock time required increases as well. On the other hand, the right figure shows the loss function F ( x k ) evaluatedat the point generated by iteration x k as a function of the number of communication rounds. As expected, when the numberof data points used for preconditioning increases, the number of required communication rounds decreases. However, thisimplies that the central node needs to solve a bigger problem at every iteration and take a longer time to solve. yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization Figure5. The time complexity per iteration for the Hyper-fast method in Algorithm 2 and the SDCA Method from (Shalev-Shwartz,2016) at communication rounds , , and . The x -axis is the iteration number, and the y -axis is the the time required by the corre-sponding algorithm to complete an inner iteration.Figure6. A comparison of the wall clock times and communication rounds for the InSPAG method in Algorithm 1 for different numberof data points used for preconditioning. On the left, the x -axis indicates time in seconds, and on the right the x -axis indicates number ofcommunication rounds. In both cases the y -axis is the loss function at the current iteration. Figure 7 shows the wall clock time required by the central node to solve the auxiliary problem for every communicationround. The x -axis shows the number of communication rounds, and the y -axis shows the clock time in seconds. Addition-ally, we show the results for different values of the preconditioning parameter n . As n increases, the time required for thesolution of the auxiliary problem increases as well. However, the time complexity of the auxiliary subproblem decreasesas the number of communication rounds increases. yperfast Second-Order Local Solvers for Efficient Statistically Preconditioned Distributed Optimization Figure7. Time complexity for the solution of the auxiliary subproblem for different number of preconditioning data points. The x -axisshows the number of communication rounds, and the yy