[PDF] Communication-Efficient Distributed Optimization with Quantized Preconditioners

Abstract

We investigate fast and communication-efficient algorithms for the classic problem of minimizing a sum of strongly convex and smooth functions that are distributed among n different nodes, which can communicate using a limited number of bits. Most previous communication-efficient approaches for this problem are limited to first-order optimization, and therefore have \emph{linear} dependence on the condition number in their communication complexity. We show that this dependence is not inherent: communication-efficient methods can in fact have sublinear dependence on the condition number. For this, we design and analyze the first communication-efficient distributed variants of preconditioned gradient descent for Generalized Linear Models, and for Newton's method. Our results rely on a new technique for quantizing both the preconditioner and the descent direction at each step of the algorithms, while controlling their convergence rate. We also validate our findings experimentally, showing fast convergence and reduced communication.

Full PDF

CCommunication-Eﬃcient Distributed Optimizationwith Quantized Preconditioners

Foivos Alimisis Peter Davies Dan Alistarh

IST Austria IST Austria IST AustriaNeural Magic

Abstract

We investigate fast and communication-eﬃcient algorithms for the classic problem of minimizing asum of strongly convex and smooth functions that are distributed among n diﬀerent nodes, whichcan communicate using a limited number of bits. Most previous communication-eﬃcient approachesfor this problem are limited to ﬁrst-order optimization, and therefore have linear dependence on thecondition number in their communication complexity. We show that this dependence is not inherent:communication-eﬃcient methods can in fact have sublinear dependence on the condition number. Forthis, we design and analyze the ﬁrst communication-eﬃcient distributed variants of preconditionedgradient descent for Generalized Linear Models, and for Newton’s method. Our results rely on a newtechnique for quantizing both the preconditioner and the descent direction at each step of the algorithms,while controlling their convergence rate. We also validate our ﬁndings experimentally, showing fasterconvergence and reduced communication relative to previous methods. a r X i v : . [ m a t h . O C ] F e b oivos Alimisis, Peter Davies, Dan Alistarh Due to the sheer size of modern datasets, many practical instances of large-scale optimization are now distributed ,in the sense that data are split among several computing nodes, which collaborate to jointly optimize the globalobjective function. This shift towards distribution induces new challenges, and the community has reconsideredmany classic, well-established algorithms in terms of their distribution costs, usually measured in terms of thenumber of bits sent and received by the algorithms ( communication complexity ) or by the number of paralleliterations ( round complexity ).In this paper, we focus on the communication (bit) complexity of the classic minimization problemmin x ∈ R d f ( x ) := 1 n n (cid:88) i =1 f i ( x ) , where the global d -dimensional cost function f is formed as the average of smooth and strongly-convex local costs f i , owned by diﬀerent machines, indexed by i = 1 , ..., n .This problem has a rich history. The seminal paper by Tsitsiklis and Luo (1986) considered the case n = 2, andprovided a lower bound of Ω( d log( d/(cid:15) )) for quadratic functions, as well as an almost-matching upper bound forthis case, within logarithmic factors. (Here, d is the problem dimension and (cid:15) is the error-tolerance.)The problem has concentrated signiﬁcant attention, given the surge of interest in distributed optimization andmachine learning, e.g. (Niu et al., 2011; Jaggi et al., 2014; Alistarh et al., 2016; Nguyen et al., 2018; Ben-Nun andHoeﬂer, 2019). In particular, a series of papers (Khirirat et al., 2018; Ye and Abbe, 2018; Magn´usson et al., 2020;Alistarh and Korhonen, 2020) continued to provide improved upper and lower bounds for the communicationcomplexity of this problem, both for deterministic and randomized algorithms, as well as examining relateddistributed settings and problems (Scaman et al., 2017; Jordan et al., 2018; Vempala et al., 2020; Mendler-D¨unnerand Lucchi, 2020; Hendrikx et al., 2020).The best known lower bound for solving the above problem for deterministic algorithms and general n is ofΩ( nd log( d/(cid:15) ))total communication bits, given recently by Alistarh and Korhonen (2020). On the algorithmic side, thedeterministic solution with the lowest known communication complexity is a variant of quantized gradientdescent (Magn´usson et al., 2020; Alistarh and Korhonen, 2020) using O ( ndκ log κ log( γd/(cid:15) ))total bits, where κ is the condition number of the problem and γ is the smoothness bound of f .An intriguing open question concerns the optimal dependency on the condition number. While existing lowerbounds show no such explicit dependency, all known algorithms have linear (or worse) dependency on κ . Resolvingthis problem is non-trivial, since one usually removes this dependency in the non-distributed case by leveragingcurvature information in the form of preconditioning or full Newton steps. However, existing distributiontechniques are designed for gradient quantization, and it is not at all clear for instance how using a preconditioningmatrix would interact with the convergence properties of the algorithm, and in particular whether favourableconvergence behaviour can be preserved at all following quantization. Contribution.

In this paper, we resolve this question in the positive, and present the ﬁrst communication-eﬃcient variants of preconditioned gradient descent for generalized linear models (GLMs) and distributed Newton’smethod.Speciﬁcally, given a small enough error-tolerance (cid:15) , a communication-eﬃcient variant of preconditioned gradientdescent for GLMs (QPGD-GLM) can ﬁnd an (cid:15) -minimizer of a γ -smooth function using a total number of bits B QPGD-GLM = O ( ndκ (cid:96) log( nκ (cid:96) κ ( M )) log( γD/(cid:15) )) , where d is the dimension, n is the number of nodes, κ (cid:96) is the condition number of the loss function (cid:96) usedto measure the distance of training data from the prediction, κ ( M ) is the condition number of the averaged oivos Alimisis, Peter Davies, Dan Alistarh covariance matrix of the training data, and D is a bound on the initial distance from the optimum. Practically, κ (cid:96) is much smaller than the condition number κ of the problem and is equal to 1 in the case that (cid:96) is a quadratic.This ﬁrst result suggests that distributed methods need not have linear dependence on the condition numberof the problem. Our main technical result extends the approach to a distributed variant of Newton’s method,showing that the same problem can be solved using B Newton = O (cid:0) nd log ( dκ ) log( γµ/σ(cid:15) ) (cid:1) total bits,under the assumption that the Hessian is σ -Lipschitz.Viewed in conjunction with the above Ω( nd log( d/ε )) lower bound, these algorithms outline a new communicationcomplexity trade-oﬀ between the dependency on the dimension of the problem d , and its condition number κ . Speciﬁcally, for ill-conditioned but low-dimensional problems, it may be advantageous to employ quantizedNewton’s method, whereas QPGD-GLM can be used in cases where the structure of the training data favorspreconditioning. Further, our results suggest that there can be no communication lower bound for the coordinatormodel with linear dependence on the condition number of the problem.Our paper introduces a few tools which should have broader applicability. One is a lattice-based matrix quantizationtechnique, which extends the state-of-the-art vector (gradient) quantization techniques to preconditioners. Thisenables us to carefully trade oﬀ the communication compression achieved by the algorithm with the non-trivialerror in the descent directions due to quantization. Our main technical advance is in the context of quantizedNewton’s method, where we need to keep track of the concentration of quantized Hessians relative to the full-precision version. Further, our algorithms quantize directly the local descent directions obtained by multiplyingthe inverse of the quantized estimation of the preconditioner with the exact local gradient. This is a non-obviouschoice, which turns out to be the correct way to deal with quantized preconditioned methods.We validate our theoretical results on standard regression datasets, where we show that our techniques canprovide an improvement of over 3 × in terms of total communication complexity used by the algorithm, whilemaintaining convergence and solution quality. Related Work.

There has been a surge of interest in distributed optimization and machine learning. While acomplete survey is beyond our scope, we mention the signiﬁcant work on designing and analyzing communication-eﬃcient versions of classic optimization algorithms, e.g. (Jaggi et al., 2014; Scaman et al., 2017; Jordan et al.,2018; Khirirat et al., 2018; Nguyen et al., 2018; Alistarh et al., 2016; Ye and Abbe, 2018; Magn´usson et al.,2020; Ghadikolaei and Magn´usson, 2020), and the growing interest in communication and round complexitylower bounds, e.g. (Arjevani and Shamir, 2015; Vempala et al., 2020; Alistarh and Korhonen, 2020; Zhanget al., 2013; Shamir, 2014). In this context, our work is the ﬁrst to address the bit complexity of optimizationmethods which explicitly employ curvature information, and shows that such methods can indeed be madecommunication-eﬃcient.Tsitsiklis and Luo (1986) gave the ﬁrst upper and lower bounds for the communication (bit) complexity ofdistributed convex optimization, considering the case of two nodes. Their algorithm is a variant of gradientdescent which performs adaptive quantization, in the sense that nodes adapt the number of bits they send andthe quantization grid depending on the iteration. Follow-up work, e.g. (Khirirat et al., 2018; Alistarh et al.,2016) generalized their algorithm to an arbitrary number of nodes, and continued to improve complexity. In thisline, the work closest to ours is that of Magn´usson et al. (2020), who introduce a family of adaptive gradientquantization schemes which can enable linear convergence in any norm for gradient-descent-type algorithms,in the same system setting considered in our work. However, we emphasize that this work did not considerpreconditioning. (Alistarh and Korhonen (2020) also focus on GD, but use diﬀerent quantizers and a more reﬁnedanalysis to obtain truly tight communication bounds for the speciﬁc case of quadratics.)Conceptually, the quantization techniques we introduce serve a similar purpose—to allow the convergenceproperties of the algorithm to be preserved, despite noisy directional information. At the technical level, however,the schemes we describe and analyze are diﬀerent, and arguably more complex. For instance, since only thegradient information is quantized, Magn´usson et al. (2020) can use grid quantization adapted to gradient norms,whereas employ more complex quantization, as well as ﬁne-grained bookkeeping with respect to the concentrationof quantized matrices and descent directions.Recently, Hendrikx et al. (2020) proposed a distributed preconditioned accelerated gradient method for oursetting, where preconditioning is done by solving a local optimization problem over a subsampled dataset at oivos Alimisis, Peter Davies, Dan Alistarh the server. The goal of the method is to reduce the number of communication rounds , and not bits transmitted.Their convergence rate depends on the square root of the relative condition number between the global and localloss functions.

Distributed Setting.

As discussed, we are in a standard distributed optimization setting, where we have n nodes, and each node i has its own local cost function f i : R d → R (wehre d is the dimension of the problem). Wewish to minimize the average cost f = n (cid:80) ni =1 f i and, for that, some communication between nodes is required.Communication may be performed over various network topologies, but in this work we assume a simple structurewhere an arbitrary node plays the role of the central server, i.e. receives messages from the others, processesthem, and ﬁnally sends the result back to all. (Such topologies are also common in practice (Li et al., 2014).)Then, the nodes compute an update based on their local cost, and subsequently transmit this information againto the master, repeating the pattern until convergence.The two main usually considered complexity metrics are the total number of rounds, or iterations, which thealgorithm requires, and the total number of bits transmitted. In this paper, we focus on the latter metric, andassume that nodes cannot communicate their information with inﬁnite precision, but instead aim to limit thenumber of bits that each node can use to encode messages. Thus, we measure complexity in terms of the totalnumber of bits that the optimization algorithm needs to use, in order to minimize f within some accuracy. Matrix Vectorization.

One of the main technical tools of our work is quantization of matrices. All the matricesthat we care to quantize turn out to be symmetric. The ﬁrst step for quantizing is to vectorize them. We do soby using the mapping φ : S ( d ) → R d ( d +1)2 deﬁned by φ ( P ) = ( p , ..., p d , p , ..., p d , ..., p dd ) , where P = ( p ij ) di,j =1 and S ( d ) is the space of d × d symmetric matrices. Thus, the mapping φ just isolates theupper triangle of a symmetric matrix and writes it as a vector. It is direct to check that φ is a linear isomorphism(notice that dim( S ( d )) = d ( d + 1) / (cid:96) norm in S ( d ) and the (cid:96) one in R d ( d +1)2 : Lemma 1.

For any matrices

P, P (cid:48) ∈ S ( d ) , we have √ d (cid:107) φ ( P ) − φ ( P (cid:48) ) (cid:107) ≤ (cid:107) P − P (cid:48) (cid:107) ≤ √ (cid:107) φ ( P ) − φ ( P (cid:48) ) (cid:107) . The proof can be found in Appendix A.We will use the isomorphism φ later in our applications to Generalized Linear Models and Newton’s method.This is the reason of appearance of the extra d inside a logarithm in our upper bounds. From now on we use (cid:107) · (cid:107) to denote the (cid:96) norm of either vectors or matrices. Lattice Quantization.

For estimating the gradient and Hessian in a distributed manner with limited communi-cation, we use a quantization procedure developed in (Davies et al., 2020). The original quantization schemeinvolves randomness, but we use a deterministic version of it, by picking up the closest point to the vector thatwe want to encode. This is similar to the quantization scheme used by Alistarh and Korhonen (2020) for standardgradient descent, and has the following properties:

Proposition 2. (Davies et al., 2020; Alistarh and Korhonen, 2020) Denoting by b the number of bits that eachmachine uses to communicate, there exists a quantization function Q : R d × R d × R + × R + → R d , which, for each (cid:15), y > , consists of an encoding function enc (cid:15),y : R d → { , } b and a decoding one dec (cid:15),y : { , } b × R d → R d , such that, for all x, x (cid:48) ∈ R d , oivos Alimisis, Peter Davies, Dan Alistarh • dec (cid:15),y (enc (cid:15),y ( x ) , x (cid:48) ) = Q ( x, x (cid:48) , y, (cid:15) ) , if (cid:107) x − x (cid:48) (cid:107) ≤ y . • (cid:107) Q ( x, x (cid:48) , y, (cid:15) ) − x (cid:107) ≤ (cid:15) , if (cid:107) x − x (cid:48) (cid:107) ≤ y . • If y/(cid:15) > , the cost of the quantization procedure in number of bits satisﬁes b = O ( d log (cid:0) y(cid:15) ) (cid:1) . As a warm-up, we consider the case of a Generalized Linear Model (GLM) with data matrix A ∈ R m × d . GLMsare particularly attractive models to distribute, because the distribution across nodes can be performed naturallyby partitioning the available data. For more background on distributing GLMs see (Mendler-D¨unner and Lucchi,2020).The matrix A consists of the data used for training in its rows, i.e. we have m -many d -dimensional data points.As is custom in regression analysis, we assume that m (cid:29) d , i.e. we are in the case of big but low-dimensionaldata. If m is very large, it can be very diﬃcult to store the whole matrix A in one node, so we distribute it in n -many nodes, each one owning m i -many data points ( m = (cid:80) ni =1 m i ). We pack the data owned by node i in amatrix A i ∈ R m i × d and denote the function used to measure the error on machine i by (cid:96) i : R m i → R . Then thelocal cost function f i : R d → R at machine i reads f i ( x ) = (cid:96) i ( A i x ) . We can express the global cost function f in the form f ( x ) = (cid:96) ( Ax )where (cid:96) : R m → R is a global loss function deﬁned by (cid:96) ( y ) = 1 n n (cid:88) i =1 (cid:96) i ( y i ) , where y i are sets of m i -many coordinates of y obtained by the same data partitioning. Assumption 3.

The local loss functions (cid:96) i are µ (cid:96) -strongly convex and γ (cid:96) -smooth. This assumption implies that the global loss function (cid:96) is µ (cid:96) n -strongly convex and γ (cid:96) n -smooth. This is because theHessian of (cid:96) has the block-diagonal structure ∇ y (cid:96) ( y ) = 1 n diag (cid:0) ∇ y (cid:96) ( y ) , ..., ∇ y n (cid:96) n ( y n ) (cid:1) and the eigenvalues of all matrices ∇ y i (cid:96) i ( y i ) are between µ (cid:96) and γ (cid:96) . The Hessian of f can be written as ∇ f ( x ) = A T ∇ (cid:96) ( Ax ) A ∈ S ( d ) ⊆ R d × d . We detail the computation of ∇ f in Appendix B. Assumption 4.

The matrix A ∈ R m × d is of full rank (i.e. rank ( A ) = d , since d < m ). This assumption is natural: if two columns of the matrix A were linearly dependent, we would not need both therelated features in our statistical model. Practically, we can prune one of them and get a new data matrix offull-rank. Proposition 5.

The max. eigenvalue λ max of ∇ f satisﬁes γ := λ max ( ∇ f ) ≤ γ (cid:96) λ max (cid:18) A T An (cid:19) and the min. eigenvalue λ min of ∇ f satisﬁes µ := λ min ( ∇ f ) ≥ µ (cid:96) λ min (cid:18) A T An (cid:19) . oivos Alimisis, Peter Davies, Dan Alistarh The proof is presented in Appendix B. Thus, we have that the condition number κ of our minimization problemsatisﬁes κ ≤ κ (cid:96) κ (cid:18) A T An (cid:19) , where κ (cid:16) A T An (cid:17) is the condition number of the covariance matrix A T A averaged in the number of machines. Theconvergence rate of gradient descent generally depends on κ , which can be much larger than κ (cid:96) in case that thecondition number of A T A is large. The usual way to get rid of κ (cid:16) A T An (cid:17) is to precondition gradient descent using A T An , which we denote by M from now on (we recall the convergence analysis of this method in Appendix C). Inour setting M is not known to all machines simultaneously, since each machine owns only a part of the overalldata. However, we observe that M = 1 n n (cid:88) i =1 A Ti A i , where A Ti A i =: M i is the local covariance matrix of the data owned by the node i . In this section we present our QPGD-GLM algorithm and study its communication complexity. We structure thealgorithm in four steps: ﬁrst, we describe how to recover a quantized version of the averaged covariance matrices.Then, we describe how nodes perform initialization. Next, we describe how nodes can quantize the initial descentdirection. Finally, we describe how to quantize the descent directions for subsequent steps. Our notation forquantization operations follows Section 2.1. Choose an arbitrary master node, say i . (A) Averaged Covariance Matrix Quantization:

2. Compute M i := A Ti A i in each node.3. Encode M i in each node i and decode it in the master node using its information:¯ M i = φ − (cid:18) Q (cid:18) φ ( M i ) , φ ( M i ) , √ dnλ max ( M ) , λ min ( M )16 √ κ (cid:96) (cid:19)(cid:19) . In detail, we ﬁrst transform the local matrix M i via the isomorphism φ , and then quantize it via Q , withcarefully-set parameters. The matrix will be then de-quantized relative to the master’s reference point φ ( M i ), and then re-constituted (in approximate form) via the inverse isomorphism.4. Average the decoded matrices in the master node: S = n (cid:80) ni =1 ¯ M i .5. Encode the average in the master node and decode in each node i using its local information¯ M = φ − (cid:18) Q ( φ ( S ) , φ ( M i ) , √ d (cid:18) λ min ( M )16 κ (cid:96) + 2 nλ max ( M ) (cid:19) , λ min ( M )16 √ κ (cid:96) ) (cid:19) . (B) Starting Point and Parameters for Descent Direction Quantization:

6. Choose

D > x (0) ∈ R d , such that (cid:107) x (0) − x ∗ (cid:107) ≤ D .7. Deﬁne the parameters ξ := 1 − κ (cid:96) , K := 2 ξ , δ := ξ (1 − ξ )4 , R ( t ) := γ (cid:96) K (cid:18) − κ (cid:96) (cid:19) t D. oivos Alimisis, Peter Davies, Dan Alistarh (C) Quantizing the Initial Descent Direction:

8. Compute ¯ M − ∇ f i ( x (0) ) in each node.9. Encode ¯ M − ∇ f i ( x (0) ) in each node and decode it in the master node using its local information: v (0) i = Q (cid:18) ¯ M − ∇ f i ( x (0) ) , ¯ M − ∇ f i ( x ) , nκ ( M ) R (0) , δR (0) (cid:19) .

10. Average the quantized local information in the master node: r (0) = n (cid:80) ni =1 v (0) i .11. Encode r (0) in the master node and decode it in each machine i using its local information: v (0) = Q (cid:18) r (0) , ¯ M − ∇ f i ( x (0) ) , (cid:18) δ nκ ( M ) (cid:19) R (0) , δR (0) (cid:19) . For t ≥ x ( t +1) = x ( t ) − ηv ( t ) for η > (D) Descent Direction Quantization for Next Steps:

13. Encode ¯ M − ∇ f i ( x ( t ) ) in each node i and decode in the master node using the previous local estimate: v ( t +1) i = Q (cid:16) ¯ M − ∇ f i ( x ( t +1) (cid:17) , v ( t ) i , nκ ( M ) R ( t +1) , δR ( t +1) .

14. Average the quantized local information: r ( t +1) = n (cid:80) ni =1 v ( t +1) i .15. Encode r ( t +1) in the master node and decode it in each node using the previous global estimate: v ( t +1) = Q (cid:18) r ( t +1) , v ( t ) , (cid:18) δ nκ ( M ) (cid:19) R ( t +1) , δR ( t +1) (cid:19) . We now discuss the algorithm’s assumptions. First, we assume that an over-approximation D for the distance ofthe initialization from the minimizer is known. This is practical, especially in the case of GLMs: since the lossfunctions (cid:96) i are often quadratics, we can use strong convexity and write (cid:107) x (0) − x ∗ (cid:107) ≤ µ ( f ( x (0) − f ∗ ) ≤ µ f ( x (0) ) =: D . Further, following Magn´usson et al. (2020) (Assumption 2, page 5), the value f ( x (0) ) is often available, for examplein the case of logistic regression. Of course, if we are restricted in a compact domain as is the case of Tsitsiklisand Luo (1986) and Alistarh and Korhonen (2020), then the domain itself provides an over approximation for allthe distances inside it.The parameters λ max ( M ) , λ min ( M ) used for quantization of the matrix M are usually assumed to be known.Speciﬁcally, it is common in distributed optimization to assume that all nodes know estimates of the smoothnessand strong convexity constants of each of the local cost functions (Tsitsiklis and Luo, 1986). In our case thiswould imply knowing all λ max ( M i ) , λ min ( M i ). However, we assume knowledge of just λ max ( M ) and λ min ( M ).This also explains the appearance of the extra log n factor in our GLM bounds, relative to those for Newton’smethod.The convergence and communication complexity of our algorithm are described in the following theorem: oivos Alimisis, Peter Davies, Dan Alistarh Theorem 6.

The iterates x ( t ) produced by the previous algorithm with η = µ (cid:96) + γ (cid:96) satisfy (cid:107) x ( t ) − x ∗ (cid:107) ≤ (cid:18) − κ (cid:96) (cid:19) t D and the total number of bits used for communication until f ( x ( t ) ) − f ∗ ≤ (cid:15) is O (cid:16) nd log (cid:16) √ dnκ (cid:96) κ ( M ) (cid:17)(cid:17) + O (cid:18) ndκ (cid:96) log( nκ (cid:96) κ ( M )) log γD (cid:15) (cid:19) . (1)When the accuracy (cid:15) is suﬃciently small (which is often the case in practice), the ﬁrst summand is negligible andthe total number of bits until reaching it is just b = O (cid:18) ndκ (cid:96) log( nκ (cid:96) κ ( M )) log γD (cid:15) (cid:19) which gains over quantized gradient descent in (Alistarh and Korhonen, 2020) the linear dependence on thecondition number of M . We prove Theorem 6 in Appendix D. After warming-up with quantizing ﬁxed preconditioners in the case of Generalized Linear Models, we move forwardto quantize non-ﬁxed ones. The extreme case of a preconditioner is the whole Hessian matrix; preconditioningwith it yields Newton’s method, which is computationally expensive, but removes completely the dependency onthe condition number from the iteration complexity. We develop a quantized version of Newton’s method in orderto address a question raised by Alistarh and Korhonen (2020) regarding whether the communication complexityof minimizing a sum of smooth and strongly convex functions depends linearly on the condition number of theproblem. The main technical challenge towards that, is keeping track of the concentration of the Hessians aroundthe Hessian evaluated at the optimum, while the algorithm converges. We show that the linear dependence of thecommunication cost on the condition number of the problem is not necessary, in exchange with extra dependenceon the dimension of the problem, i.e. d instead of d . This can give signiﬁcant advantage for low-dimensional andill-conditioned problems (training generalized linear models is among them).As it is natural for Newton’s method, we make the following assumptions for the objective function f : Assumption 7.

The functions f i are all γ -smooth and µ -strongly convex with a σ -Lipschitz Hessian, γ, µ, σ > . We note that the lower bound derived by Alistarh and Korhonen (2020) is obtained for the case that f i arequadratic functions; quadratic functions indeed satisfy Assumption 7. As in the case of GLMs, we deﬁne thecondition number of the problem to be κ := γµ . We also introduce a constant α ∈ [0 , We now describe our quantized Newton’s algorithm. Again, we split the presentation into several parts: localinitialization (A), estimating the initial Hessian modulo quantization (B), as well as the quantized initial descentdirection (C), and ﬁnally, quantization and update for each iteration (D,E).1. Choose the master node at random, e.g. i . (A) Starting Point and Parameters for Hessian Quantization:

2. Choose x (0) ∈ R d , such that (cid:107) x (0) − x ∗ (cid:107) ≤ αµ σ . oivos Alimisis, Peter Davies, Dan Alistarh

3. We deﬁne the parameters G ( t ) = µ α t +1 . (B) Initial Hessian Quantized Estimation:

4. Compute ∇ f i ( x (0) ) in each node.5. Encode ∇ f i ( x (0) ) in each node i and decode it in the master node i using its information: H i = φ − (cid:18) Q (cid:18) φ ( ∇ f i ( x (0) )) , φ ( ∇ f i ( x (0) )) , √ dγ, G (0) √ κ (cid:19)(cid:19) .

6. Average the decoded matrices in the master node: S = n (cid:80) ni =1 H i .7. Encode the average in the master node and decode in each node i using its local information H = φ − (cid:18) Q (cid:18) φ ( S ) , φ ( ∇ f i ( x (0) )) , √ d (cid:18) G (0) κ + 2 γ (cid:19) , G (0) √ κ (cid:19)(cid:19) . Parameters for Descent Direction Quantzation:

8. Deﬁne the parameters θ := α (1 − α )4 , K := 2 α , P ( t ) := µ σ Kα (cid:18) α (cid:19) t . (C) Initial Descent Direction Quantized Estimation:

9. Compute H − ∇ f i ( x (0) ) in each node.10. Encode H − ∇ f i ( x (0) ) in each node and decode it in the master node using its local information: v (0) i = Q (cid:18) H − ∇ f i ( x (0) ) , H − ∇ f i ( x (0) ) , κP (0) , θP (0) (cid:19) .

11. Average the quantized local information: p (0) = n (cid:80) ni =1 v (0) i .12. Encode p (0) in the master node and decode it in each machine i using its local information: v (0) = Q (cid:18) p (0) , H − ∇ f i ( x (0) ) , (cid:18) θ κ (cid:19) P (0) , θP (0) (cid:19) . For t ≥ x ( t +1) = x ( t ) − v ( t ) . oivos Alimisis, Peter Davies, Dan Alistarh (D) Hessian Quantized Estimation for Next Steps:

14. Compute ∇ f i ( x ( t +1) ) in each node.15. Encode ∇ f i ( x ( t +1) ) in each node i and decode in the master node using the previous local estimate: H it +1 = φ − (cid:32) Q (cid:32) φ ( ∇ f i ( x ( t +1) )) , φ ( H it ) , √ dα G ( t +1) , G ( t +1) √ κ (cid:33)(cid:33) .

16. Average the quantized local Hessian information: S t +1 = n (cid:80) ni =1 H it +1 .17. Encode S t +1 in the master node and decode it back in each node using the previous global estimate: H t +1 = φ − (cid:18) Q (cid:18) φ ( S t +1 ) , φ ( H t ) , √ d (cid:18) κ + 5 α (cid:19) G ( t +1) , G ( t +1) √ κ (cid:19)(cid:19) . (E) Descent Direction Quantized Estimation:

18. Compute H − t +1 ∇ f i ( x ( t +1) ) in each node.19. Encode H − t +1 ∇ f i ( x ( t +1) ) in each node i and decode in the master node using the previous local estimate: v ( t +1) i = Q (cid:18) H − t +1 ∇ f i ( x ( t +1) ) , v ( t ) i , κP ( t +1) , θP ( t +1) (cid:19) .

20. Average the quantized local Hessian information: p ( t +1) = n (cid:80) ni =1 v ( t +1) i .21. Encode S t +1 in the master node and decode it back in each node using the previous global estimate: v ( t +1) = Q (cid:18) p ( t +1) , v ( t ) , (cid:18) θ κ (cid:19) P ( t +1) , θP ( t +1) (cid:19) . The restriction of the initialization x (0) is standard for Newton’s method, which is known to converge only locally .Usually x (0) is chosen such that α ≥ σµ (cid:107) x (0) − x ∗ (cid:107) , while we choose it such that α ≥ σµ (cid:107) x (0) − x ∗ (cid:107) . This diﬀerenceoccurs from the extra errors due to quantization.We now state our theorem on communication complexity of quantized Newton’s algorithm, which is the mainresult of the paper. The proof is in Appendix E, and relies on analyzing the behaviour of both the quantizedHessian estimates and the quantized descent direction estimates, as can be seen in Lemmas 16 and 17 respectively. Theorem 8.

The iterates of the quantized Newton’s method starting from a point x (0) , such that (cid:107) x (0) − x ∗ (cid:107) ≤ µ σ (cid:18) α = 12 (cid:19) satisfy (cid:107) x ( t ) − x ∗ (cid:107) ≤ µ σ (cid:18) (cid:19) t and the communication cost until reaching accuracy (cid:15) in terms of function values is O (cid:18) nd log (cid:16) √ dκ (cid:17) log γµ σ (cid:15) (cid:19) (2) many bits in total. oivos Alimisis, Peter Davies, Dan Alistarh We note that the lower bound derived in (Alistarh and Korhonen, 2020) is for the case that all functions f i arequadratics. For quadratics, the Hessian is constant, thus σ = 0 and α can be chosen equal to 0 as well. Then,(non-distributed) Newton’s method converges in only one step. However, in the distributed case, σ = 0 implies G ( t ) = 0, thus the estimation of ∇ f ( x ( t ) ) must be exact. This would mean that we need to use an inﬁnitenumber of bits, and this can be seen also in our communication complexity results. In order to apply our resultin a practical manner, we need to allow the possibility for strictly positive quantization error of the Hessian, thuswe must choose σ > In the previous sections we computed an approximated minimizer of our objective function up to some accuracyand counted the communication cost of the whole process. We now extend our interest to the slightly harderproblem of estimating the minimum f ∗ of the function f (which is again assumed to be γ -smooth and µ -stronglyconvex) in the master node with accuracy (cid:15) . This extension is not considered in (Magn´usson et al., 2020), but isdiscussed in (Alistarh and Korhonen, 2020). To that end, we estimate the minimizer x ∗ of f by a vector x ( t ) ,such that f ( x ( t ) ) − f ∗ ≤ (cid:15) , and the communication cost of doing that is again given by expression (1) for GLMtraining and expression (2) for Newton’s method.We denote x ∗ i the minimizer of the local cost function f i and f ∗ i := f i ( x ∗ i ) its minimum. We also assume that weare aware of an over approximation C > x ∗ from the minimizers of the local costs x ∗ i , i.e. max i =1 ,...,n (cid:107) x ∗ − x ∗ i (cid:107) ≤ C and a radius c > i =1 ,...,n | f ∗ i |≤ c .Estimating these constants can be feasible in many practical situations: • We can always bound the quantity max i =1 ,...,n (cid:107) x ∗ − x ∗ i (cid:107) by a known constant if we set our problem in acompact domain as it is the case in (Tsitsiklis and Luo, 1986) and (Alistarh and Korhonen, 2020). Also, ifour local data are obtained from the same distribution, then we do not expect the minimizers of the localcosts to be too far away from the global minimizer. • The minima f ∗ i of the local costs are often exactly 0 (as assumed in (Alistarh and Korhonen, 2020)). This isbecause the local cost functions f i are often quadratics, as it happens in the case of GLMs. In the worstcase, knowing just that f i ≥

0, we can write | f ∗ i | = f ∗ i ≤ f i ( x (0) ) ≤ nf ( x (0) )and the value f ( x (0) ) is often available as discussed in Section 3 and in (Magn´usson et al., 2020).For estimating the minimum f ∗ , we start by computing f i ( x ( t ) ) in each node i and communicate them to themaster node i as follows: q ( t ) i := Q ( f i ( x ( t ) ) , f i ( x ( t ) ) , γC + c ) , (cid:15)/ . Then the master node computes and outputs the average¯ f = 1 n n (cid:88) i =1 q ( t ) i . Proposition 9.

The value ¯ f which occurs from the previous quantization procedure is an estimate of the trueminimum f ∗ of f with accuracy (cid:15) and the cost of quantization is O (cid:18) n log γC + c(cid:15) (cid:19) . if (cid:15) is suﬃciently small. The proof is presented in Appendix F.Thus, for the problem that the master node needs to output estimates for both the minimizer and the minimumwith accuracy (cid:15) in terms of function values, the total communication cost is at most O (cid:18) ndκ (cid:96) log( nκ (cid:96) κ ( M )) log γ ( C + D ) + c ) (cid:15) (cid:19) oivos Alimisis, Peter Davies, Dan Alistarh many bits in total for QPGD-GLM O (cid:18) nd log (cid:16) √ dκ (cid:17) log (cid:18)(cid:18) γ (cid:18) µ σ + C (cid:19) + c (cid:19) (cid:15) (cid:19)(cid:19) . many bits in total for quantized Newton’s method when (cid:15) is suﬃciently small. l o g ( c o s t ) m = 4096, n = 8, d = 64, lr = 0.75, qb = 3, pb = 3.1 QSGDqQSGDfHADqHADfGDnGDf (a)

Performance on synthetic data l o g ( c o s t ) m = 8192, n = 8, d = 12, lr = 0.11, qb = 8, pb = 9.0 QSGDqQSGDfHADqHADfGDnGDf (b)

Performance on cpusmall scale

We test our method experimentally to compress a parallel solver for least-squares regression. The setting is asfollows: we are given as input a data matrix A , with rows randomly partitioned evenly among the nodes, and atarget vector b , with the goal of ﬁnding x ∗ = argmin x (cid:107) Ax − b (cid:107) . Since this loss function f ( x ) := (cid:107) Ax − b (cid:107) isquadratic, its Hessian is constant, and so Newton’s method and QPGD-GLM are equivalent: in both cases, weneed only to provide the preconditioner matrix A T A in the ﬁrst iteration, and machines can henceforth use it forpreconditioning in every iteration.To quantize the preconditioner matrix, we apply the ‘practical version’ (that is, using the cubic lattice withmod-based coloring) of the quantization method of Davies et al. (2020), employing the ‘error detection’ methodin order to adaptively choose the number of bits required for the decoding to succeed. Each node i quantizes thematrix A Ti A i , which is decoded by the master node i using A Ti A i . Node i computes the average, quantizes,and returns the result to the other nodes, who decode using A Ti A i .To quantize gradients, we use two leading gradient quantization techniques: QSGD by Alistarh et al. (2016), andthe Hadamard-rotation based method by Suresh et al. (2017), since these are optimized for such an application. In each iteration (other than the ﬁrst), we quantize the diﬀerence between the current local gradient and that oflast iteration, average these at the master node i , and quantize and broadcast the result. Compared Methods.

In Figures 1a and 1b we compare the following methods: GDn and GDf are full-precision(i.e., using 32-bit ﬂoats) gradient descent using no preconditioning and full-precision preconditioning respectively,as basemarks. QSGDq and QSGDf use QSGD for gradient quantization, and the quantized and full-precision preconditioner respectively. HADq and HADf are the equivalents using instead the Hadamard-rotation methodfor gradient quantization. When using a preconditioner, we rescale preconditioned gradients to preserve (cid:96) -norm,in order to ensure that our comparison is based only on update direction and not size. Parameters.

In addition to m , n , and d , we also have the following parameters: the learning rate (lr in theﬁgure titles) is set close to the maximum for which gradient descent will converge, since this is the regime in whichpreconditioning can help. The number of bits per coordinate used to quantize gradients (qb) and preconditioners There is a wide array of other gradient quantization methods; we demonstrate these two as a representative examples,since we are mostly concerned with the eﬀects of preconditioner quantization. oivos Alimisis, Peter Davies, Dan Alistarh (pb) are also shown; the latter is an average since the quantization method uses a variable number of bits . Theresults presented are an average of the cost function per descent iteration, over 10 repetitions with diﬀerentrandom seeds. Synthetic Data.

We ﬁrst apply the methods to synthetic data: our data matrix A consists of independentGaussian entries with variance 1, we choose our target optimum x ∗ to be Gaussian with variance 1000, and set B = Ax ∗ .Figure 1a shows that, even using only ∼ Real Data.

Next, we use the dataset cpusmall scale from the LIBSVM collection, (Chang and Lin, 2011). Herewe must use a higher level of quantization precision (8 bits for gradients and 9 for preconditioner), but can stilloutperform non-preconditioned gradient descent and approach the performance of full-precision preconditionedgradient descent using signiﬁcantly reduced communication (Figure 1b).

We proposed communication-eﬃcient versions for two fundamental optimization algorithms, and analyzed theirconvergence and communication complexity. Our work shows for the ﬁrst time that quantizing second-orderinformation can i) theoretically yield to communication complexity upper bounds with sub-linear dependence onthe condition number of the problem, and ii) empirically achieve superior performance over vanilla methods.There are intriguing questions for future work:The log κ -dependency for Newton’s method occurs because of our bounds for the input and output variance ofquantization. We conjecture that this cannot be avoided.Another interesting question is whether the log d -dependency can be circumvented. log d is obtained directly fromthe use of vectorization φ and could be avoided by quantization using lattices with good spectral norm properties(we are however unaware of such lattice constructions).Finally, one key issue is the d -dependence, due to quantization of d -dimensional preconditioners. Future workcan examine whether one can substitute these preconditioners with d -dimensional (e.g. blockwise) approximationsand still provide similar guarantees. These quantization methods (and most others) also require exchange of two full-precision scalars, which are notincluded in the per-coordinate costs since they are independent of dimension. oivos Alimisis, Peter Davies, Dan Alistarh

Acknowledgements

The authors would like to thank Janne Korhonen, Aurelien Lucchi, Celestine Mendler-D¨unner and Antonio Orvieto for helpful discussions regarding the content of this work. FA and DA are supportedby the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovationprogramme (grant agreement No 805223 ScaleML). PD is supported by the European Union’s Horizon 2020programme under the Marie Sk(cid:32)lodowska-Curie grant agreement No. 754411. oivos Alimisis, Peter Davies, Dan Alistarh

References

Dan Alistarh and Janne H Korhonen. Improved communication lower bounds for distributed optimisation. arXivpreprint arXiv:2010.08222 , 2020.Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-eﬃcient sgdvia gradient quantization and encoding. arXiv preprint arXiv:1610.02132 , 2016.Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimization. In

Advances in Neural Information Processing Systems 28 (NIPS 2015) , pages 1756–1764, 2015.Tal Ben-Nun and Torsten Hoeﬂer. Demystifying parallel and distributed deep learning: An in-depth concurrencyanalysis.

ACM Computing Surveys (CSUR) , 52(4):1–43, 2019.Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines.

ACM Transactions onIntelligent Systems and Technology , 2:27:1–27:27, 2011. Software available at .Yuxin Chen. Gradient methods for unconstrained problems. , 2019. Princeton University, Fall 2019.Peter Davies, Vijaykrishna Gurunathan, Niusha Moshreﬁ, Saleh Ashkboos, and Dan Alistarh. Distributed variancereduction with optimal communication. arXiv e-prints , pages arXiv–2002, 2020.Hossein S Ghadikolaei and Sindri Magn´usson. Communication-eﬃcient variance-reduced stochastic gradientdescent. arXiv preprint arXiv:2003.04686 , 2020.Hadrien Hendrikx, Lin Xiao, Sebastien Bubeck, Francis Bach, and Laurent Massoulie. Statistically preconditionedaccelerated gradient method for distributed optimization. In

International Conference on Machine Learning ,pages 4203–4227. PMLR, 2020.Martin Jaggi, Virginia Smith, Martin Tak´aˇc, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, andMichael I Jordan. Communication-eﬃcient distributed dual coordinate ascent. arXiv preprint arXiv:1409.1458 ,2014.Michael I Jordan, Jason D Lee, and Yun Yang. Communication-eﬃcient distributed statistical inference.

Journalof the American Statistical Association , 2018.Sarit Khirirat, Hamid Reza Feyzmahdavian, and Mikael Johansson. Distributed learning with compressedgradients. arXiv preprint arXiv:1806.06573 , 2018.Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long,Eugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In

Proc.11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2014) , pages 583–598,2014.Sindri Magn´usson, Hossein Shokri-Ghadikolaei, and Na Li. On maintaining linear convergence of distributedlearning and optimization under limited communication.

IEEE Transactions on Signal Processing , 68:6101–6116,2020.Celestine Mendler-D¨unner and Aurelien Lucchi. Randomized block-diagonal preconditioning for parallel learning.In

International Conference on Machine Learning , pages 6841–6851. PMLR, 2020.Lam Nguyen, Phuong Ha Nguyen, Marten Dijk, Peter Richt´arik, Katya Scheinberg, and Martin Tak´ac. Sgd andhogwild! convergence without the bounded gradients assumption. In

International Conference on MachineLearning , pages 3750–3758. PMLR, 2018.Feng Niu, Benjamin Recht, Christopher R´e, and Stephen J Wright. Hogwild!: A lock-free approach to parallelizingstochastic gradient descent. arXiv preprint arXiv:1106.5730 , 2011.Kevin Scaman, Francis Bach, S´ebastien Bubeck, Yin Tat Lee, and Laurent Massouli´e. Optimal algorithms forsmooth and strongly convex distributed optimization in networks. In international conference on machinelearning , pages 3027–3036. PMLR, 2017.Ohad Shamir. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In

Advances in Neural Information Processing Systems 27 (NIPS 2014) , pages 163–171, 2014. oivos Alimisis, Peter Davies, Dan Alistarh

Ananda Theertha Suresh, Felix X. Yu, Sanjiv Kumar, and H. Brendan McMahan. Distributed mean estimationwith limited communication. In Doina Precup and Yee Whye Teh, editors,

Proceedings of the 34th InternationalConference on Machine Learning , volume 70 of

Proceedings of Machine Learning Research , pages 3329–3337,2017.J. N. Tsitsiklis and Z. Luo. Communication complexity of convex optimization. In , pages 608–611, 1986. doi: 10.1109/CDC.1986.267379.Santosh S Vempala, Ruosong Wang, and David P Woodruﬀ. The communication complexity of optimization. In

Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms , pages 1733–1752. SIAM,2020.Min Ye and Emmanuel Abbe. Communication-computation eﬃcient gradient coding. In

International Conferenceon Machine Learning , pages 5610–5619. PMLR, 2018.Yuchen Zhang, John Duchi, Michael I Jordan, and Martin J Wainwright. Information-theoretic lower boundsfor distributed statistical estimation with communication constraints. In

Advances in Neural InformationProcessing Systems 26 (NIPS 2013) , pages 2328–2336, 2013. oivos Alimisis, Peter Davies, Dan Alistarh

Appendix: Proofs and Supplementaries

A The isomorphism φ Lemma 1.

For any matrices

P, P (cid:48) ∈ S ( d ) , we have √ d (cid:107) φ ( P ) − φ ( P (cid:48) ) (cid:107) ≤ (cid:107) P − P (cid:48) (cid:107) ≤ √ (cid:107) φ ( P ) − φ ( P (cid:48) ) (cid:107) . Proof.

The Frobenius norm of a matrix P = ( p ij ) di,j =1 is deﬁned as (cid:107) P (cid:107) F = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) i,j =1 p ij thus (cid:107) P − P (cid:48) (cid:107) F = d (cid:88) i,j =1 ( p ij − p (cid:48) ij ) = (cid:88) i = j ( p ij − p (cid:48) ij ) + (cid:88) i (cid:54) = j ( p ij − p (cid:48) ij ) = (cid:88) i = j ( p ij − p (cid:48) ij ) + 2 (cid:88) i

We ﬁrstly compute the Hessian of the global cost function f in terms of the Hessian of the global loss function (cid:96) : Lemma 10.

We have ∇ f ( x ) = A T (cid:96) ( Ax ) A. Proof.

We start by computing the gradient of f . We ﬁx an arbitrary vector v ∈ R d and we write (cid:104)∇ f ( x ) , v (cid:105) = d x f ( x ) v = d x ( (cid:96) ( Ax )) v = d y ( (cid:96) ( y )) | y = Ax d x ( Ax ) v = d y ( (cid:96) ( y )) | y = Ax Av = (cid:104)∇ (cid:96) ( y ) | y = Ax , Av (cid:105) = ( Av ) T ∇ (cid:96) ( Ax ) = v T A T ∇ (cid:96) ( Ax ) = (cid:104) A T ∇ (cid:96) ( Ax ) , v (cid:105) Since v is arbitrary, the gradient of f is ∇ f ( x ) = A T ∇ (cid:96) ( Ax )For the Hessian, we have ∇ x f ( x ) = ∇ x ∇ x f ( x ) = ∇ x ( A T ∇ x (cid:96) ( Ax )) = A T ∇ x ( ∇ x (cid:96) ( Ax )) = A T ∇ y ( ∇ y (cid:96) ( y )) | y = Ax ∇ x ( Ax ) = A T ∇ (cid:96) ( Ax ) A. oivos Alimisis, Peter Davies, Dan Alistarh We recall standard technical results from linear algebra in order to prove Proposition 5. They will be useful alsoin the proof of Proposition 13 and Lemma 14.

Lemma 11.

Given matrices P ∈ R m × d and Q ∈ R d × m , we have that P Q and QP have exactly the same non-zero eigenvalues.Proof. Let λ (cid:54) = 0 an eigenvalues of P Q . Then there exists v (cid:54) = 0, such that P Qv = λv . Multiplying both sides by Q , we get QP ( Qv ) = λ ( Qv ). We know that Qv (cid:54) = 0, because then λ would be 0. Thus λ is an eigenvalue of QP with eigenvector Qv . Thus any non-zero eigenvalue of P Q is also an eigenvalue of QP . Switching P and Q in theprevious argument implies that any non-zero eigenvalue of QP is also an eigenvalue of P Q . Thus,

P Q and QP have the same non-zero eigenvalues. Corollary 11.1.

Given matrices P ∈ R m × d and Q ∈ R d × m , we have that rank ( P Q ) = rank ( QP ) = min { rank ( P ) , rank ( Q ) } Lemma 12.

Given a symmetric positive semi-deﬁnite matrix S ∈ R m × m and a symmetric positive deﬁnite T ∈ R m × m with eigenvalues λ ( S ) ≤ ... ≤ λ m ( S ) and λ ( T ) ≤ ... ≤ λ m ( T ) we have that λ k ( S ) λ ( T ) ≤ λ k ( ST ) ≤ λ k ( S ) λ m ( T ) for any k = 1 , ..., m .Proof. We use the min-max principle for the k -th eigenvalue of a matrix A ∈ R m × m . This reads λ k ( A ) = min F ⊂ R M dim( F )= k (cid:18) max x ∈ F \{ } ( Ax, x )( x, x ) (cid:19) We know that λ k ( ST ) = λ k ( √ T S √ T ). Since T is symmetric and positive-deﬁnite, its square root √ T is alsosymmetric and positive-deﬁnite. Thus, we have λ k ( ST ) = λ k ( √ T S √ T ) = min F ⊂ R M dim( F )= k (cid:32) max x ∈ F \{ } ( √ T S √ T x, x )( x, x ) (cid:33) = min F ⊂ R n dim( F )= k (cid:32) max x ∈ F \{ } ( S √ T x, √ T x )( √ T x, √ T x ) (

T x, x )( x, x ) (cid:33) Thus min F ⊂ R n dim( F )= k (cid:32) max x ∈ F \{ } ( S √ T x, √ T x )( √ T x, √ T x ) (cid:33) λ min ( T ) ≤ λ k ( ST ) ≤ min F ⊂ R n dim( F )= k (cid:32) max x ∈ F \{ } ( S √ T x, √ T x )( √ T x, √ T x ) (cid:33) λ max ( T )If { e , ..., e k } is a basis for F , we deﬁne F (cid:48) = span ( √ T − e , ..., √ T − e k ) and we havemin F ⊂ R n dim( F )= k (cid:32) max x ∈ F \{ } ( S √ T x, √ T x )( √ T x, √ T x ) (cid:33) = min F (cid:48) ⊂ R n dim( F (cid:48) )= k (cid:18) max x ∈ F (cid:48) \{ } ( Sx, x )( x, x ) (cid:19) = λ k ( S )and the desired result follows. Proposition 5.

Using Lemma 11, we have that the eigenvalues of the d × d matrix ∇ f are equal to the non-zero eigenvaluesof the m × m matrix ∇ (cid:96)AA T . Using Corollary 11.1, we have that the matrix AA T is of rank d , thus the matrix ∇ (cid:96)AA T is also of rank d . This means that it has exactly m − d zero eigenvalues. Exactly the same holds for thematrix AA T . We use also Lemma 12 for the positive deﬁnite matrix ∇ (cid:96) and the positive semi-deﬁnite matrix AA T and we have: • The maximum eigenvalue of the matrix ∇ f is equal to the maximum eigenvalue of the matrix ∇ (cid:96)AA T .For that we have λ max ( ∇ (cid:96)AA T ) ≤ λ max ( ∇ (cid:96) ) λ max ( AA T ) . Similarly the maximum eigenvalue of AA T is equal to the maximum one of A T A and we ﬁnally have λ max ( ∇ f ) ≤ γ (cid:96) n λ max ( A T A ) = γ (cid:96) λ max (cid:18) n A T A (cid:19) . • The minimum eigenvalue of the matrix ∇ f is equal to the eigenvalue of the matrix ∇ (cid:96)AA T of order m − d + 1. Using Lemma 12, we have λ m − d +1 ( ∇ (cid:96)AA T ) ≥ λ min ( ∇ (cid:96) ) λ m − d +1 ( AA T ) . By using similar arguments as before, we have that λ m − d +1 ( AA T ) = λ min ( A T A ) . Thus, we ﬁnally have λ min ( ∇ f ) ≥ µ (cid:96) n λ min ( A T A ) = µ (cid:96) λ min (cid:18) n A T A (cid:19) . C Gradient Descent with Preconditioning for GLMs

Gradient descent for a γ -smooth and µ -strongly convex function f ( x ) = (cid:96) ( Ax ) : R d → R preconditioned by amatrix M ∈ R d × d reads x ( t +1) = x ( t ) − ηM − ∇ f ( x ( t ) ) ,x (0) ∈ R d . In our setting the matrix M := n A T A is invertible, because we have assumed that the matrix A is of full rank.The convergence is now improved up to the condition number of M . For the proof we follow the techniquepresented in (Chen, 2019) for (non-preconditioned) gradient descent. Proposition 13.

The iterates x ( t ) of the previous algorithm with η = µ (cid:96) + γ (cid:96) satisfy (cid:107) x ( t ) − x ∗ (cid:107) ≤ (cid:18) − κ (cid:96) (cid:19) t (cid:107) x (0) − x ∗ (cid:107) Proof.

Similarly to the previous argument, we have x ( t +1) − x ∗ = x ( t ) − ηM − ∇ f ( x ( t ) ) − x ∗ = ( x ( t ) − x ∗ ) − ηM − (cid:18)(cid:90) ∇ f ( x ( ξ )) dξ (cid:19) ( x ( t ) − x ∗ )= (cid:18) Id − η (cid:90) M − ∇ f ( x ( ξ )) dξ (cid:19) ( x ( t ) − x ∗ )where x ( ξ ) = x ( t ) + ξ ( x ∗ − x ( t ) ) oivos Alimisis, Peter Davies, Dan Alistarh Thus (cid:107) x ( t +1) − x ∗ (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13) Id − η (cid:90) M − ∇ f ( x ( ξ )) dξ (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) x ( t ) − x ∗ (cid:107) ≤ max ≤ ξ ≤ (cid:107) Id − ηM − ∇ f ( x ( ξ )) (cid:107)(cid:107) x ( t ) − x ∗ (cid:107) Now we can write M − ∇ f ( x ( ξ )) = M − A T ∇ (cid:96) ( Ax ( ξ )) A By Lemma 11, the eigenvalues of the last matrix are exactly the same with the non-zero eigenvalues of the matrix ∇ (cid:96) ( Ax ( ξ )) AM − A T . This matrix is m × m with rank d , thus it has exactly m − d zero eigenvalues. The sameholds for the matrix AM − A T . Again by applying Lemma 11, we know that AM − A T has m − d eigenvaluesequal to 0 and the others are exactly equal with the ones of M − A T A = n Id, i.e. they are all equal to n .Thus, we have λ max ( M − A T ∇ (cid:96) ( Ax ( ξ )) A ) = λ max ( ∇ (cid:96) ( Ax ( ξ )) AM − A T ) ≤ λ max ( ∇ (cid:96) ) λ max ( AM − A T ) = γ (cid:96) n n = γ (cid:96) and λ min ( M − A T ∇ (cid:96) ( Ax ( ξ )) A ) = λ m − d +1 ( ∇ (cid:96) ( Ax ( ξ )) AM − A T ) ≥ λ min ( ∇ (cid:96) ) λ m − d +1 ( AM − A T ) = µ (cid:96) n n = µ (cid:96) by Lemma 12, because ∇ (cid:96) is positive deﬁnite and AM − A T is positive semi-deﬁnite.Since we choose η = µ (cid:96) + γ (cid:96) , the maximum eigenvalues of the matrix ηM − ∇ f ( x ( ξ )) is γ (cid:96) µ (cid:96) + γ (cid:96) and theminimum one is µ (cid:96) µ (cid:96) + γ (cid:96) . Thus, the maximum eigenvalue of Id − ηM − ∇ f ( x ( ξ )) is less or equal thanmax (cid:110) γ (cid:96) µ (cid:96) + γ (cid:96) − , − µ (cid:96) µ (cid:96) + γ (cid:96) (cid:111) = γ (cid:96) − µ (cid:96) γ (cid:96) + µ (cid:96) ≤ − κ (cid:96) . Thus (cid:107) x t +1 − x ∗ (cid:107) ≤ (cid:18) − κ (cid:96) (cid:19) (cid:107) x ( t ) − x ∗ (cid:107) and by an induction argument we get the desired result. D Proofs of convergence for GLMs

We prove the convergence result of the preconditioned algorithm for GLMs. We recall ﬁrstly the algorithm incompact form:

Algorithm 1

Quantized Preconditioned Gradient Descent for GLM training ¯ M i = φ − (cid:16) Q (cid:16) φ ( M i ) , φ ( M i ) , √ dnλ max ( M ) , λ min ( M )16 √ κ (cid:96) (cid:17)(cid:17) S = n (cid:80) ni =1 ¯ M i ¯ M = φ − (cid:16) Q ( φ ( S ) , φ ( M i ) , √ d (cid:16) λ min ( M )16 κ (cid:96) + 2 nλ max ( M ) (cid:17) , λ min ( M )16 √ κ (cid:96) ) (cid:17) x (0) ∈ R d , (cid:107) x (0) − x ∗ (cid:107) ≤ D v (0) i = Q (cid:16) ¯ M − ∇ f i ( x (0) ) , ¯ M − ∇ f i ( x (0) ) , nκ ( M ) R (0) , δR (0) (cid:17) r (0) = n (cid:80) ni =1 v (0) i . v (0) = Q (cid:16) r (0) , ¯ M − ∇ f i ( x (0) ) , (cid:0) δ + 4 nκ ( M ) (cid:1) R (0) , δR (0) (cid:17) for t ≥ do x ( t +1) = x ( t ) − ηv ( t ) v ( t +1) i = Q (cid:16) ¯ M − ∇ f i ( x ( t +1) ) , v ( t ) i , nκ ( M ) R ( t +1) , δR ( t +1) (cid:17) r ( t +1) = n (cid:80) ni =1 v ( t +1) i v ( t +1) = Q (cid:16) r ( t +1) , v ( t ) , ( δ + 4 nκ ( M )) R ( t +1) , δR ( t +1) (cid:17) end forLemma 14. Consider the algorithm x ( t +1) = x ( t ) − η ¯ M − ∇ f ( x ( t ) ) oivos Alimisis, Peter Davies, Dan Alistarh starting from a point x (0) ∈ R d such that (cid:107) x (0) − x ∗ (cid:107) ≤ D , where η = µ (cid:96) + γ (cid:96) and ¯ M is the quantized estimationof M obtained in Algorithm 1. Then, the iterates of this algorithm satisfy (cid:107) x ( t ) − x ∗ (cid:107) ≤ (cid:18) − κ (cid:96) (cid:19) t D. Proof.

We use the same proof technique as in Proposition 13, with the diﬀerence that now we have the quantizedestimation ¯ M of M instead of the original: x ( t +1) − x ∗ = x ( t ) − η ¯ M − ∇ f ( x ( t ) ) − x ∗ = ( x ( t ) − x ∗ ) − η ¯ M − (cid:18)(cid:90) ∇ f ( x ( ξ )) dξ (cid:19) ( x ( t ) − x ∗ )= (cid:18) Id − η (cid:90) ¯ M − ∇ f ( x ( ξ )) dξ (cid:19) ( x ( t ) − x ∗ )where x ( ξ ) = x ( t ) + ξ ( x ∗ − x ( t ) ) . Thus (cid:107) x ( t +1) − x ∗ (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13) Id − η (cid:90) ¯ M − ∇ f ( x ( ξ )) dξ (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) x ( t ) − x ∗ (cid:107) ≤ max ≤ ξ ≤ (cid:107) Id − η ¯ M − ∇ f ( x ( ξ )) (cid:107)(cid:107) x ( t ) − x ∗ (cid:107) . Now we can write ¯ M − ∇ f ( x ( ξ )) = M − ∇ f ( x ( ξ )) + ( ¯ M − − M − ) ∇ f ( x ( ξ ))and (cid:107) Id − η ¯ M − ∇ f ( x ( ξ )) (cid:107) ≤ (cid:107) Id − ηM − ∇ f ( x ( ξ )) (cid:107) + η (cid:107) ( ¯ M − − M − ) ∇ f ( x ( ξ )) (cid:107) . For the matrix M − A T ∇ (cid:96) ( Ax ( ξ )) A we apply exactly the same argument as in Proposition 13 and havemax ≤ ξ ≤ (cid:107) Id − ηM − ∇ f ( x ( ξ )) (cid:107) ≤ γ (cid:96) − µ (cid:96) γ (cid:96) + µ (cid:96) < − κ (cid:96) . For the extra error term, we ﬁrstly have to study the quantization error (cid:107) M − ¯ M (cid:107) :Notice that (cid:107) φ ( M i ) − φ ( M i ) (cid:107) ≤ √ d (cid:107) M i − M i (cid:107) ≤ √ d ( (cid:107) M i (cid:107) + (cid:107) M i (cid:107) ) ≤ √ dnλ max ( M )which implies that (cid:107) φ ( ¯ M i ) − φ ( M i ) (cid:107) ≤ λ min ( M )16 √ κ (cid:96) by the deﬁnition of quantization parameters (we have λ max ( M i ) ≤ nλ max ( M ), because nM = (cid:80) ni =1 M i andevery M i is positive semi-deﬁnite). The last inequality implies (cid:107) ¯ M i − M i (cid:107) ≤ λ min ( M )16 κ (cid:96) and (cid:107) S − M (cid:107) ≤ n n (cid:88) i =1 (cid:107) ¯ M i − M i (cid:107) ≤ λ min ( M )16 κ (cid:96) . Now we can write (cid:107) φ ( S ) − φ ( M i ) (cid:107) ≤ √ d (cid:107) S − M i (cid:107) ≤ √ d ( (cid:107) S − M (cid:107) + (cid:107) M − M i (cid:107) ) ≤ √ d (cid:18) λ min ( M )16 κ (cid:96) + 2 nλ max ( M ) (cid:19) ≤ n √ dλ max ( M ) . By the deﬁnition of quantization parameters, this implies (cid:107) φ ( ¯ M ) − φ ( S ) (cid:107) ≤ λ min ( M )16 √ κ (cid:96) oivos Alimisis, Peter Davies, Dan Alistarh and concequently (cid:107) ¯ M − S (cid:107) ≤ λ min ( M )16 κ (cid:96) . Then (cid:107) M − ¯ M (cid:107) ≤ (cid:107) M − S (cid:107) + (cid:107) S − ¯ M (cid:107) ≤ λ min ( M )8 κ (cid:96) . By standard results in perturbation theory, we know that λ min ( ¯ M ) ≥ λ min ( M ) − (cid:107) M − ¯ M (cid:107) ≥ λ min ( M ) − λ min ( M )8 κ (cid:96) ≥ λ min ( M )2 . This implies (cid:107) ¯ M − (cid:107) = λ max ( ¯ M − ) = 1 λ min ( ¯ M − ) ≤ λ min ( M ) . Now we havemax ≤ ξ ≤ µ (cid:96) + γ (cid:96) (cid:107) ( ¯ M − − M − ) ∇ f ( x ( ξ )) (cid:107) ≤ max ≤ ξ ≤ µ (cid:96) + γ (cid:96) (cid:107) ¯ M − ( M − ¯ M ) M − ∇ f ( x ( ξ )) (cid:107)≤ max ≤ ξ ≤ µ (cid:96) + γ (cid:96) (cid:107) ¯ M − ( M − ¯ M ) (cid:107)(cid:107) M − ∇ f ( x ( ξ )) (cid:107) ≤ µ (cid:96) + γ (cid:96) (cid:107) ¯ M − (cid:107)(cid:107) M − ¯ M (cid:107) γ (cid:96) ≤ λ min ( M ) λ min ( M )8 κ (cid:96) = 12 κ (cid:96) . Thus, it holds (cid:107) x ( t +1) − x ∗ (cid:107) ≤ (cid:18) − κ (cid:96) (cid:19) (cid:107) x ( t ) − x ∗ (cid:107) which implies (cid:107) x ( t +1) − x ∗ (cid:107) ≤ (cid:18) − κ (cid:96) (cid:19) t (cid:107) x (0) − x ∗ (cid:107) ≤ (cid:18) − κ (cid:96) (cid:19) t D. We recall the parameters ξ = 1 − κ (cid:96) ,K = 2 ξ ,δ = ξ (1 − ξ )4 ,R ( t ) = γ (cid:96) K (cid:18) − κ (cid:96) (cid:19) t D. Lemma 15.

The iterates of Algorithm 1 satisfy the following inequalities: (cid:107) x ( t ) − x ∗ (cid:107) ≤ (cid:18) − κ (cid:96) (cid:19) t D, (cid:107) ¯ M − ∇ f i ( x ( t ) ) − v ( t ) i (cid:107) ≤ δR ( t ) , (cid:107) ¯ M − ∇ f ( x ( t ) ) − v ( t ) (cid:107) ≤ δR ( t ) . Proof.

We ﬁrstly prove these inequalities for t = 0. The ﬁrst one is direct. For the second one, we notice that (cid:107) ¯ M − ∇ f i ( x (0) ) − ¯ M − ∇ f i ( x (0) ) (cid:107) ≤ λ min ( M ) ( (cid:107)∇ f i ( x (0) ) (cid:107) + (cid:107)∇ f i ( x (0) ) (cid:107) ) ≤ λ min ( M ) ( γ i (cid:107) x (0) − x ∗ (cid:107) + γ i (cid:107) x (0) − x ∗ (cid:107) ) ≤ γ (cid:96) λ max ( M i ) λ min ( M ) D + 2 γ (cid:96) λ max ( M i ) λ min ( M ) D ≤ nκ ( M ) R (0) . oivos Alimisis, Peter Davies, Dan Alistarh The last inequality follows because K ≥ λ max ( M i ) ≤ nλ max ( M ).(We recall also that (cid:107) ¯ M − (cid:107) ≤ λ min ( M ) , because λ min ( ¯ M ) ≥ λ min ( M ) − (cid:107) M − ¯ M (cid:107) ≥ λ min ( M ) / v (0) i , we have (cid:107) v (0) i − ¯ M − ∇ f i ( x (0) ) (cid:107) ≤ δR (0) . Towards the third inequality at t = 0, we have (cid:107) r (0) − ¯ M − ∇ f ( x (0) ) (cid:107) ≤ n n (cid:88) i =1 (cid:107) v (0) i − ¯ M − ∇ f i ( x (0) ) (cid:107) ≤ δR (0) . Also, it holds (cid:107) r (0) − ¯ M − ∇ f i ( x (0) ) (cid:107) ≤ (cid:107) r (0) − ¯ M − ∇ f ( x (0) ) (cid:107) + (cid:107) ¯ M − ∇ f ( x (0) ) − ¯ M − ∇ f i ( x (0) ) (cid:107) ≤ (cid:18) δ nκ ( M ) (cid:19) R (0) , thus, by the deﬁnition of v (0) , (cid:107) v (0) − r (0) (cid:107) ≤ δR (0) (cid:107) v (0) − ¯ M − ∇ f ( x (0) ) (cid:107) ≤ (cid:107) v (0) − r (0) (cid:107) + (cid:107) r (0) − ¯ M − ∇ f ( x (0) ) (cid:107) ≤ δR (0) δR (0) δR (0) . Now we assume that the inequalities hold for t and prove that they continue to hold for t + 1. We start with theﬁrst one: (cid:107) x ( t +1) − x ∗ (cid:107) = (cid:107) x ( t ) − ηv ( t ) + η ¯ M − ∇ f ( x ( t ) ) − η ¯ M − ∇ f ( x ( t ) ) − x ∗ (cid:107)≤ η (cid:107) ¯ M − ∇ f ( x ( t ) ) − v ( t ) (cid:107) + (cid:107) x ( t ) − η ¯ M − ∇ f ( x ( t ) ) − x ∗ (cid:107)≤ γ (cid:96) δR ( t ) + ξ (cid:18) − κ (cid:96) (cid:19) t D = 2 γ (cid:96) δ γ (cid:96) K (cid:18) − κ (cid:96) (cid:19) t D + ξ (cid:18) − κ (cid:96) (cid:19) t D = δK (cid:18) − κ (cid:96) (cid:19) t D + ξ (cid:18) − κ (cid:96) (cid:19) t D = ( δK + ξ ) (cid:18) − κ (cid:96) (cid:19) t D = (cid:18) − κ (cid:96) (cid:19) t +1 D. For the second inequality is suﬃces to show that (cid:107) ¯ M − ∇ f i ( x ( t +1) ) − v ( t ) i (cid:107) ≤ nκ ( M ) R ( t +1) . To that end, we write (cid:107) ¯ M − ∇ f i ( x ( t +1) ) − v ( t ) i (cid:107) = (cid:107) ¯ M − ∇ f i ( x ( t +1) ) − ¯ M − ∇ f i ( x ( t ) ) + ¯ M − ∇ f i ( x ( t ) ) − v ( t ) i (cid:107)≤ (cid:107) ¯ M − ∇ f i ( x ( t +1) ) − ¯ M − ∇ f i ( x ( t ) ) (cid:107) + (cid:107) ¯ M − ∇ f i ( x ( t ) ) − v ( t ) i (cid:107)≤ γ i (cid:107) ¯ M − (cid:107)(cid:107) x ( t +1) − x ( t ) (cid:107) + δR ( t ) ≤ γ (cid:96) λ max ( M i ) 2 λ min ( M ) ( (cid:107) x ( t +1) − x ∗ (cid:107) + (cid:107) x ( t ) − x ∗ (cid:107) ) + δR ( t ) ≤ nγ (cid:96) κ ( M ) (cid:18) − κ (cid:96) (cid:19) t D + δ γ (cid:96) K (cid:18) − κ (cid:96) (cid:19) t D ≤ n (2 /K + δ/ Kγ (cid:96) κ ( M ) (cid:18) − κ (cid:96) (cid:19) t D ≤ n (2 /K + δK ) Kγ (cid:96) κ ( M ) (cid:18) − κ (cid:96) (cid:19) t D ≤ nκ ( M ) R ( t +1) . oivos Alimisis, Peter Davies, Dan Alistarh Previously we have used again that λ max ( M i ) ≤ nλ max ( M ), because nM = (cid:80) ni =1 M i and all matrices M i arepositive semi-deﬁnite.For the last inequality we have (cid:107) ¯ M − ∇ f ( x ( t +1) ) − r ( t +1) (cid:107) ≤ n n (cid:88) i =1 (cid:107) ¯ M − ∇ f i ( x ( t +1) ) − v ( t +1) i (cid:107) ≤ δR ( t +1) (cid:107) r ( t +1) − v ( t ) (cid:107) = (cid:107) r ( t +1) − ¯ M − ∇ f ( x ( t +1) ) + ¯ M − ∇ f ( x ( t +1) ) − ¯ M − ∇ f ( x ( t ) ) + ¯ M − ∇ f ( x ( t ) ) − v ( t ) (cid:107)≤ (cid:107) r ( t +1) − ¯ M − ∇ f ( x ( t +1) ) (cid:107) + (cid:107) ¯ M − ∇ f ( x ( t +1) ) − ¯ M − ∇ f ( x ( t ) ) (cid:107) + (cid:107) ¯ M − ∇ f ( x ( t ) ) − v ( t ) (cid:107)≤ δR ( t +1) γ λ min ( M ) (cid:107) x ( t +1) − x ( t ) (cid:107) + δR ( t ) ≤ δR ( t +1) γ (cid:96) κ ( M )( (cid:107) x ( t +1) − x ∗ (cid:107) + (cid:107) x ( t ) − x ∗ (cid:107) ) + δR ( t ) ≤ δR ( t +1) κ ( M ) R ( t +1) ≤ (cid:18) nκ ( M ) + δ (cid:19) R ( t +1) . The last part of the inequality follows from the same argument used in deriving the second one.The last inequality implies that (cid:107) v ( t +1) − r ( t +1) (cid:107) ≤ δR ( t +1) . Thus, putting everything together, we have (cid:107) ¯ M − ∇ f ( x ( t +1) ) − v ( t +1) (cid:107) ≤ (cid:107) ¯ M − ∇ f ( x ( t +1) ) − r ( t +1) (cid:107) + (cid:107) r ( t +1) − v ( t +1) (cid:107) ≤ δR ( t +1) δR ( t +1) δR ( t +1) . Theorem 6.

The inequality for the convergence rate of the distance of the iterates from the minimizer holds from theprevious lemma. We now turn our interest to the total communication cost. We start from the quantization ofthe matrix M :The communication cost for encoding each M i and decoding in the master node is O  d ( d + 1)2 log  √ dnλ max ( M ) λ min ( M )16 √ κ (cid:96)  = O ( d log ( √ dnκ (cid:96) κ ( M ))) . The communication cost of encoding S in the master node and then decode back in every machine is O  d ( d + 1)2 log  √ dnλ max ( M ) λ min ( M )16 √ κ (cid:96)  = O ( d log ( √ dnκ (cid:96) κ ( M ))) . Since we have n -many communications of each kind, the total communication cost is b m = O ( nd log ( √ dnκ (cid:96) κ ( M )) = O ( nd log( √ dnκ (cid:96) κ ( M )) . oivos Alimisis, Peter Davies, Dan Alistarh The communication cost of quantizing the descent direction v ( t ) at step t ≥ O (cid:18) nd log nκ ( M ) δ/ (cid:19) = O (cid:18) nd log nκ ( M ) δ (cid:19) for encoding the local descent directions and O (cid:32) nd log nκ ( M ) + δ δ/ (cid:33) ≤ O (cid:18) nd log 9 nκ ( M ) δ (cid:19) = O (cid:18) nd log nκ ( M ) δ (cid:19) for decoding back. Since we have 1 δ = 4 ξ (1 − ξ ) = 4 κ (cid:96) (cid:16) − κ (cid:96) (cid:17) ≤ κ (cid:96) , we can bound the total communication cost by b = O ( nd log( nκ (cid:96) κ ( M )) . We have f ( x ( t ) ) − f ( x ∗ ) ≤ (cid:15) if (cid:107) x ( t ) − x ∗ (cid:107) ≤ (cid:113) (cid:15)γ , thus we reach accuracy (cid:15) in terms of function values in at most t = 2 κ (cid:96) log γD (cid:15) and putting everything together we ﬁnd the total communication cost for quantizing the descentdirections along the whole optimization procedure to be O (cid:18) ndκ (cid:96) log( nκ (cid:96) κ ( M )) log γD (cid:15) (cid:19) . Thus, the total communication cost in number of bits is obtained by summing the cost for matrix and descentdirection quantization: b = O (cid:16) nd log (cid:16) √ dnκ (cid:96) κ ( M ) (cid:17)(cid:17) + O (cid:18) ndκ (cid:96) log( nκ (cid:96) κ ( M )) log γD (cid:15) (cid:19) . E Proofs for Quantized Newton’s Method

We ﬁrstly recall Quantized Newton’s method in a compact form as we did also for GLMs:

Algorithm 2

Quantized Newton’s Method x (0) ∈ R d , (cid:107) x (0) − x ∗ (cid:107) ≤ αµ σ H i = φ − (cid:16) Q (cid:16) φ ( ∇ f i ( x (0) )) , φ ( ∇ f i ( x (0) )) , √ dγ, G (0) √ κ (cid:17)(cid:17) S = n (cid:80) ni =1 H i H = φ − (cid:16) Q (cid:16) φ ( S ) , φ ( ∇ f i ( x (0) )) , √ d (cid:16) G (0) κ + 2 γ (cid:17) , G (0) √ κ (cid:17)(cid:17) v (0) i = Q (cid:16) H − ∇ f i ( x (0) ) , H − ∇ f i ( x (0) ) , κP (0) , θP (0) (cid:17) p (0) = n (cid:80) ni =1 v (0) i v (0) = Q (cid:16) P (0) , H − ∇ f i ( x (0) ) , (cid:0) θ + 4 κ (cid:1) P (0) , θP (0) (cid:17) for t ≥ do x ( t +1) = x ( t ) − v ( t ) H it +1 = φ − (cid:16) Q (cid:16) φ ( ∇ f i ( x ( t +1) )) , φ ( H it ) , √ dα G ( t +1) , G ( t +1) √ κ (cid:17)(cid:17) S t +1 = n (cid:80) ni =1 H it +1 H t +1 = φ − (cid:16) Q (cid:16) φ ( S t +1 ) , φ ( H t ) , √ d (cid:0) κ + α (cid:1) G ( t +1) , G ( t +1) √ κ (cid:17)(cid:17) v ( t +1) i = Q (cid:16) H − t +1 ∇ f i ( x ( t +1) ) , v ( t ) i , κP ( t +1) , θP ( t +1) (cid:17) p ( t +1) = n (cid:80) ni =1 v ( t +1) i v ( t +1) = Q (cid:16) r ( t +1) , v ( t ) , (cid:0) θ + 4 κ (cid:1) P ( t +1) , θP ( t +1) (cid:17) end for oivos Alimisis, Peter Davies, Dan Alistarh We also recall the parameters G ( t ) = µ α t +1 ,α ≥ σµ (cid:107) x (0) − x ∗ (cid:107) . Lemma 16.

Consider the algorithm x ( t +1) = x ( t ) − ηH − t ∇ f ( x ( t ) ) , where H t is the quantized approximation of the Hessian at step t , but ∇ f is the exact gradient. The iterates ofthis algorithm satisfy the following three inequalities: (cid:107) x ( t ) − x ∗ (cid:107) ≤ µ σ α t +1 , (cid:107) H it − ∇ f i ( x ( t ) ) (cid:107) ≤ G ( t ) κ , (cid:107) H t − ∇ f ( x ( t ) ) (cid:107) ≤ G ( t ) κ . Proof.

We start by proving that the inequalities hold for t = 0. The ﬁrst one is direct by the bound on ourinitialization. For the second one, it suﬃces to show that (cid:107) φ ( H i ) − φ ( ∇ f i ( x (0) )) (cid:107) ≤ G (0) √ κ and for that suﬃces (cid:107) φ ( ∇ f i ( x (0) )) − φ ( ∇ f i ( x (0) )) (cid:107) ≤ √ dγ. which is indeed the case because (cid:107) φ ( ∇ f i ( x (0) )) − φ ( ∇ f i ( x (0) )) (cid:107) ≤ √ d (cid:107)∇ f i ( x (0) ) − ∇ f i ( x (0) ) (cid:107) ≤ √ d ( (cid:107)∇ f i ( x (0) ) (cid:107) + (cid:107)∇ f i ( x (0) ) (cid:107) ) ≤ √ dγ. For the third inequality at t = 0, we have (cid:107)∇ f ( x (0) ) − S (cid:107) ≤ n n (cid:88) i =1 (cid:107)∇ f i ( x (0) ) − H i (cid:107) ≤ G (0) κ . We need also (cid:107) S − H (cid:107) ≤ G (0) κ and for that it suﬃces (cid:107) φ ( S ) − φ ( H ) (cid:107) ≤ G (0) √ κ , which follows from (cid:107) φ ( S ) − φ ( ∇ f i ( x (0) )) (cid:107) ≤ √ d (cid:18) G (0) κ + 2 γ (cid:19) . In order to show the latter, we write (cid:107) φ ( S ) − φ ( ∇ f i ( x (0) )) (cid:107) ≤ √ d (cid:107) S − ∇ f i ( x (0) ) (cid:107) ≤ √ d ( (cid:107) S − ∇ f ( x (0) ) (cid:107) + (cid:107)∇ f ( x (0) ) − ∇ f i ( x (0) ) (cid:107) ) ≤ √ d (cid:18) G (0) κ + 2 γ (cid:19) . For t ≥

0, we assume that the inequalities hold simultaneously for t and want to prove that they continue to holdfor t + 1.We use a similar style proof as in analyzing convergence of GLMs with ﬁxed preconditioning. We have x ( t +1) − x ∗ = x ( t ) − H − t ∇ f ( x ( t ) ) − x ∗ = ( x ( t ) − x ∗ ) − H − t (cid:18)(cid:90) ∇ f ( x ( ξ )) dξ (cid:19) ( x ( t ) − x ∗ )= (cid:18) Id − (cid:90) H − t ∇ f ( x ( ξ )) dξ (cid:19) ( x ( t ) − x ∗ ) , oivos Alimisis, Peter Davies, Dan Alistarh where x ( ξ ) = x ( t ) + ξ ( x ∗ − x ( t ) ) . Thus (cid:107) x ( t +1) − x ∗ (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13) Id − (cid:90) H − t ∇ f ( x ( ξ )) dξ (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) x ( t ) − x ∗ (cid:107) ≤ max ≤ ξ ≤ (cid:107) Id − H − t ∇ f ( x ( ξ )) (cid:107)(cid:107) x ( t ) − x ∗ (cid:107) . Now we can writemax ≤ ξ ≤ (cid:107) Id − H − t ∇ f ( x ( ξ )) (cid:107) = max ≤ ξ ≤ (cid:107) Id − ∇ f ( x ( t ) ) − ∇ f ( x ( ξ )) + ( ∇ f ( x ( t ) ) − − H − t ) ∇ f ( x ( ξ )) (cid:107) ≤ max ≤ ξ ≤ (cid:107) Id − ∇ f ( x ( t ) ) − ∇ f ( x ( ξ )) (cid:107) + max ≤ ξ ≤ (cid:107) ( ∇ f ( x ( t ) ) − − H − t ) ∇ f ( x ( ξ )) (cid:107) ≤ max ≤ ξ ≤ (cid:107)∇ f ( x ( t ) ) − ( ∇ f ( x ( t ) ) − ∇ f ( x ( ξ ))) (cid:107) + (cid:107)∇ f ( x ( t ) ) − − H − t (cid:107) max ≤ ξ ≤ (cid:107)∇ f ( x ( ξ )) (cid:107) ≤ σµ (cid:107) x ( t ) − x ∗ (cid:107) + G ( t ) κ γ µ . The ﬁrst summand of the last inequality is obtained by the facts that the largest eigenvalue of ∇ f ( x ( t ) ) − is µ and ∇ f is σ -Lipschitz. The second summand is obtained using that (cid:107)∇ f ( x ( t ) ) − − H − t (cid:107) = (cid:107)∇ f ( x ( t ) ) − ( ∇ f ( x ( t ) ) − H t ) H − t (cid:107) ≤ (cid:107)∇ f ( x ( t ) ) − H t (cid:107)(cid:107)∇ f ( x ( t ) ) − (cid:107)(cid:107) H − t (cid:107) ≤ G ( t ) κ µ µ − G ( t ) κ . The last inequality is obtained from the fact that (cid:107) H − t (cid:107) = 1 λ min ( H t ) ≤ λ min ( ∇ f ( x t )) − (cid:107)∇ f ( x t )) − H t (cid:107) ≤ µ − G ( t ) κ . Now, we have G ( t ) κ ≤ µ κ ≤ α which holds always true, because κ , α < (cid:107) x ( t +1) − x ∗ (cid:107) ≤ σµ (cid:107) x ( t ) − x ∗ (cid:107) + G ( t ) µ (cid:107) x ( t ) − x ∗ (cid:107) = σµ (cid:16) µ σ (cid:17) α t +1) + µ α t +1 µ µ σ α t +1 = µ σ α t +1) ≤ µ σ α t +2 . This also implies that (cid:107) x ( t +1) − x ∗ (cid:107) ≤ σµ (cid:107) x ( t ) − x ∗ (cid:107) µ σ α t +1 + µ α t +1 µ (cid:107) x ( t ) − x ∗ (cid:107) ≤ α t +1 (cid:107) x ( t ) − x ∗ (cid:107) . For the second inequality it suﬃces to prove that (cid:107) φ ( H it +1 ) − φ ( ∇ f i ( x ( t +1) )) (cid:107) ≤ G ( t +1) √ κ and for that it suﬃces (cid:107) φ ( ∇ f i ( x ( t +1) )) − φ ( H it ) (cid:107) ≤ √ dα G ( t +1) . oivos Alimisis, Peter Davies, Dan Alistarh We indeed have (cid:107) φ ( ∇ f i ( x ( t +1) )) − φ ( H it ) (cid:107) ≤ (cid:107) φ ( ∇ f i ( x ( t +1) )) − φ ( ∇ f i ( x ( t ) )) + φ ( ∇ f i ( x ( t ) )) − φ ( H it ) (cid:107)≤ (cid:107) φ ( ∇ f i ( x ( t +1) )) − φ ( ∇ f i ( x ( t ) )) (cid:107) + (cid:107) φ ( ∇ f i ( x ( t ) )) − φ ( H it ) (cid:107)≤ √ d ( (cid:107)∇ f i ( x ( t +1) ) − ∇ f i ( x ( t ) ) (cid:107) + (cid:107)∇ f i ( x ( t ) ) − H it (cid:107) ) ≤ √ d (cid:18) σ (cid:107) x ( t +1) − x ( t ) (cid:107) + G ( t ) κ (cid:19) ≤ √ d (cid:18) σ µ σ α t +1 + 1 κ µ α t +1 (cid:19) ≤ √ d µ α t +1 = √ d µ α α t +2 = 5 √ dα G ( t +1) . For the last inequality, we have (cid:107)∇ f ( x ( t +1) ) − S t +1 (cid:107) ≤ n n (cid:88) i =1 (cid:107)∇ f i ( x ( t +1) ) − H it +1 (cid:107) ≤ G ( t +1) κ . Now it suﬃces to prove (cid:107) S t +1 − H t +1 (cid:107) ≤ G ( t +1) κ which holds if (cid:107) φ ( S t +1 ) − φ ( H t +1 ) (cid:107) ≤ G ( t +1) √ κ and for that suﬃces (cid:107) φ ( S t +1 ) − φ ( H t ) (cid:107) ≤ √ d (cid:18) κ + 5 α (cid:19) G ( t +1) . We now have (cid:107) φ ( S t +1 ) − φ ( H t ) (cid:107) ≤ √ d (cid:107) S t +1 − H t (cid:107) ≤ √ d (cid:107) S t +1 − ∇ f ( x ( t +1) ) + ∇ f ( x ( t +1) ) − ∇ f ( x ( t ) ) + ∇ f ( x ( t ) ) − H t (cid:107)≤ √ d ( (cid:107) S t +1 − ∇ f ( x ( t +1) ) (cid:107) + (cid:107)∇ f ( x ( t +1) ) − ∇ f ( x ( t ) ) (cid:107) + (cid:107)∇ f ( x ( t ) ) − H t (cid:107) ) ≤ √ d (cid:18) G ( t +1) κ + σ (cid:107) x ( t +1) − x ( t ) (cid:107) + G ( t ) κ (cid:19) ≤ √ d G ( t +1) κ + 5 √ dα G ( t +1) = √ d (cid:18) κ + 5 α (cid:19) G ( t +1) which concludes the induction.We now recall the parameters θ = α (1 − α )4 ,K = 2 α ,P ( t ) = µ σ Kα (cid:18) α (cid:19) t . Lemma 17.

The iterates x ( t ) of the quantized Newton’s algorithm satisfy the inequalities (cid:107) x ( t ) − x ∗ (cid:107) ≤ µ σ α (cid:18) α (cid:19) t , (cid:107) H − t ∇ f i ( x ( t ) ) − v ( t ) i (cid:107) ≤ θP ( t ) , (cid:107) H − t ∇ f ( x ( t ) ) − v ( t ) (cid:107) ≤ θP ( t ) . oivos Alimisis, Peter Davies, Dan Alistarh Proof.

We ﬁrstly prove that the inequalities hold at t = 0. The ﬁrst one is trivial by the choice of x (0) . For thesecond one it suﬃces to show that (cid:107) H − ∇ f i ( x (0) ) − H − ∇ f i ( x (0) ) (cid:107) ≤ κP (0) . Indeed, (cid:107) H − ∇ f i ( x (0) ) − H − ∇ f i ( x (0) ) (cid:107) ≤ (cid:107) H − (cid:107) ( (cid:107)∇ f i ( x (0) ) (cid:107) + (cid:107)∇ f i ( x (0) ) (cid:107) ) ≤ µ γ (cid:107) x (0) − x ∗ (cid:107)≤ γµ K µ σ α = 4 κP (0) . For the third inequality at t = 0, we have (cid:107) H − ∇ f ( x (0) ) − p (0) (cid:107) ≤ n n (cid:88) i =1 (cid:107) H − ∇ f i ( x (0) ) − v (0) i (cid:107) ≤ θP (0) . We need also (cid:107) v (0) − p (0) (cid:107) ≤ θP (0) . For that it suﬃces to show that (cid:107) p (0) − H − ∇ f i ( x (0) ) (cid:107) ≤ (cid:18) θ κ (cid:19) P (0) . Indeed (cid:107) p (0) − H − ∇ f i ( x (0) ) (cid:107) ≤ (cid:107) p (0) − H − ∇ f ( x (0) ) (cid:107) + (cid:107) H − ∇ f ( x (0) ) − H − ∇ f i ( x (0) ) (cid:107)≤ θP (0) µ γ (cid:107) x (0) − x ∗ (cid:107) ≤ θP (0) κP (0) = (cid:18) θ κ (cid:19) P (0) . Now we assume that the inequalities hold for t and wish to prove that they also hold for t + 1. We start with theﬁrst one: (cid:107) x ( t +1) − x ∗ (cid:107) = (cid:107) x ( t ) − v ( t ) + H − t ∇ f ( x ( t ) ) − H − t ∇ f ( x ( t ) ) − x ∗ (cid:107)≤ (cid:107) H − t ∇ f ( x ( t ) ) − v ( t ) (cid:107) + (cid:107) x ( t ) − H − t ∇ f ( x ( t ) ) − x ∗ (cid:107)≤ θP ( t ) + α (cid:107) x ( t ) − x ∗ (cid:107) = θ µ σ Kα (cid:18) α (cid:19) t + α µ σ α (cid:18) α (cid:19) t = ( θK + α ) µ σ α (cid:18) α (cid:19) t = µ σ α (cid:18) α (cid:19) t +1 . For the second inequality it suﬃces to prove (cid:107) H − t +1 ∇ f i ( x ( t +1) ) − v ( t ) i (cid:107) ≤ κP ( t +1) . We have (cid:107) H − t +1 ∇ f i ( x ( t +1) ) − v ( t ) i (cid:107) ≤ (cid:107) H − t +1 ∇ f i ( x ( t +1) ) − H − t +1 ∇ f i ( x ( t ) ) (cid:107) + (cid:107)| H − t +1 ∇ f i ( x ( t ) ) − H − t ∇ f i ( x ( t ) ) (cid:107) + (cid:107) H − t ∇ f i ( x ( t ) ) − v ( t ) i (cid:107)≤ µ γ (cid:107) x ( t +1) − x ( t ) (cid:107) + 4 µ γ (cid:107) x ( t ) − x ∗ (cid:107) + θP ( t ) ≤ γµ µ σ α (cid:18) α (cid:19) t + θK µ σ α (cid:18) α (cid:19) t ≤ κ (2 /K + θ ) K µ σ α (cid:18) α (cid:19) t ≤ κ (2 /K + θK ) K µ σ α (cid:18) α (cid:19) t ≤ κP ( t +1) . oivos Alimisis, Peter Davies, Dan Alistarh The second inequality of this derivation follows from γ -smoothness of f i , the fact that (cid:107) H − t (cid:107) , (cid:107) H − t +1 (cid:107) ≤ µ andthe induction hypothesis.For the third inequality, we have (cid:107) H − t +1 ∇ f ( x ( t +1) ) − p ( t +1) (cid:107) ≤ n n (cid:88) i =1 (cid:107) H − t +1 ∇ f i ( x ( t +1) ) − v ( t +1) i (cid:107) ≤ θP ( t +1) . We want to prove also that (cid:107) p ( t +1) − v ( t +1) (cid:107) ≤ θP ( t +1) . For that it suﬃces to show that (cid:107) p ( t +1) − v ( t ) (cid:107) ≤ (cid:18) θ κ (cid:19) P ( t +1) . We have (cid:107) p ( t +1) − v ( t ) (cid:107) ≤ (cid:107) p ( t +1) − H − t +1 ∇ f ( x ( t +1) ) + H − t +1 ∇ f ( x ( t +1) ) − H − t +1 ∇ f ( x ( t ) )+ H − t +1 ∇ f ( x ( t ) ) − H − t ∇ f ( x ( t ) ) + H − t ∇ f ( x ( t ) ) − v ( t ) (cid:107)≤ (cid:107) p ( t +1) − H − t +1 ∇ f ( x ( t +1) ) (cid:107) + (cid:107) H − t +1 ∇ f ( x ( t +1) ) − H − t +1 ∇ f ( x ( t ) ) (cid:107) + (cid:107) H − t +1 ∇ f ( x ( t ) ) − H − t ∇ f ( x ( t ) ) (cid:107) + (cid:107) H − t ∇ f ( x ( t ) ) − v ( t ) (cid:107)≤ θP ( t +1) µ γ (cid:107) x ( t +1) − x ( t ) (cid:107) + 4 µ γ (cid:107) x ( t ) − x ∗ (cid:107) + θP ( t ) ≤ (cid:18) θ κ (cid:19) P ( t +1) which completes the induction. Theorem 8.

1, we have that the cost for quantizing the local Hessians is O (cid:32) n d ( d + 1)2 log 5 √ dG ( t +1) /αG ( t +1) / √ κ (cid:33) = O (cid:32) n d ( d + 1)2 log 5 √ dG ( t +1) /αG ( t +1) / √ κ (cid:33) = O (cid:16) nd log (cid:16) √ dκ (cid:17)(cid:17) and for communicating the sum back to all machines is O (cid:32) n d ( d + 1)2 log √ d (1 / κ + 5 /α ) G ( t +1) G ( t +1) / √ κ (cid:33) = O (cid:32) n d ( d + 1)2 log √ d /α / √ κ (cid:33) = O (cid:16) nd log (cid:16) √ dκ (cid:17)(cid:17) , again because 1 / κ ≤ /α .Thus the total cost of Hessian quantization along the whole optimization process until reaching accuracy (cid:15) is b m = O (cid:18) nd log (cid:16) √ dκ (cid:17) log γµ σ (cid:15) (cid:19) many bits in total.On the other hand, the cost of quantizing the local descent directions at t ≥ O (cid:32) nd log 4 κP ( t ) θP ( t ) (cid:33) = O ( nd log κ )because now θ is just . The cost of sending the average of the quantized local directions back to any machine is O (cid:18) nd log ( θ/ κ ) P ( t ) θP ( t ) / (cid:19) = O (cid:32) nd log 4 κP ( t ) θP ( t ) (cid:33) = O ( nd log κ ) , because θ ≤ κ . Thus, the total communication cost for quantizing the descent directions until reaching accuracy (cid:15) is b = O (cid:18) nd log κ log γµ σ (cid:15) (cid:19) many bits.The total communication cost of Quantized Newton’s method overall is O (cid:18) nd log (cid:16) √ dκ (cid:17) log γµ σ (cid:15) (cid:19) + O (cid:18) nd log κ log γµ σ (cid:15) (cid:19) = O (cid:18) nd log (cid:16) √ dκ (cid:17) log γµ σ (cid:15) (cid:19) . F Estimation of the Minimum

Proposition 9.

We have that | f i ( x ( t ) ) − f i ( x ( t ) ) |≤| f i ( x ( t ) ) | + | f i ( x ( t ) ) |≤ γ (cid:107) x ( t ) − x ∗ i (cid:107) + | f ∗ i | + γ (cid:107) x ( t ) − x ∗ i (cid:107) + | f ∗ i | In order x ( t ) to satisfy f ( x ( t ) ) − f ∗ ≤ (cid:15) , we compute x ( t ) such that (cid:107) x ( t ) − x ∗ (cid:107) ≤ (cid:113) (cid:15)γ .Given that, we can write (cid:107) x ( t ) − x ∗ i (cid:107) = (cid:107) x ∗ − x ∗ i (cid:107) + (cid:107) x ( t ) − x ∗ (cid:107) + 2 (cid:104) x ∗ − x ∗ i , x ( t ) − x ∗ (cid:105) ≤(cid:107) x ∗ − x ∗ i (cid:107) + (cid:107) x ( t ) − x ∗ (cid:107) + 2 (cid:107) x ∗ − x ∗ i (cid:107)(cid:107) x ( t ) − x ∗ (cid:107) ≤ (cid:107) x ∗ − x ∗ i (cid:107) + (cid:15)γ + (cid:114) (cid:15)γ (cid:107) x ∗ − x ∗ i (cid:107) ≤ C + (cid:15)γ + (cid:114) (cid:15)γ C ≤ C for suﬃciently small (cid:15) . Similarly we have (cid:107) x ( T ) − x ∗ i (cid:107) ≤ C for small (cid:15) .Thus | f i ( x ( t ) ) − f i ( x ( t ) ) |≤ γC + c )and by the deﬁnition of the quantization, we have | q ( t ) i − f i ( x ( t ) ) |≤ (cid:15) . which implies | ¯ f − f ( x ( t ) ) |≤ n n (cid:88) i =1 | q ( t ) i − f i ( x ( t ) ) |≤ (cid:15) . Overall, we get ¯ f − f ∗ ≤| ¯ f − f ( x ( t ) ) | + f ( x ( t ) ) − f ∗ ≤ (cid:15) (cid:15) (cid:15). The communication cost for quantizing f i ( x ( t ) ) is O (cid:18) n log γC + c(cid:15) (cid:19) since we quantize real numbers, which are 1-dimensional, and we need to communicate nn