Numerically Stable Polynomially Coded Computing
11 Numerically Stable Polynomially Coded Computing
Mohammad Fahim and Viveck R. Cadambe
Abstract
We study the numerical stability of polynomial based encoding methods, which has emerged to be a powerful class oftechniques for providing straggler and fault tolerance in the area of coded computing. Our contributions are as follows:1) We construct new codes for matrix multiplication that achieve the same fault/straggler tolerance as the previously constructed
MatDot Codes and
Polynomial Codes . Unlike previous codes that use polynomials expanded in a monomial basis, our codesuse a basis of orthogonal polynomials.2) We show that the condition number of every m × m sub-matrix of an m × n, n ≥ m Chebyshev-Vandermonde matrix, evaluatedon the n -point Chebyshev grid, grows as O ( n n − m ) ) for n > m . An implication of this result is that, when Chebyshev-Vandermonde matrices are used for coded computing, for a fixed number of redundant nodes s = n − m, the condition numbergrows at most polynomially in the number of nodes n .3) By specializing our orthogonal polynomial based constructions to Chebyshev polynomials, and using our condition numberbound for Chebyshev-Vandermonde matrices, we construct new numerically stable techniques for coded matrix multiplication.We empirically demonstrate that our constructions have significantly lower numerical errors compared to previous approacheswhich involve inversion of Vandermonde matrices. We generalize our constructions to explore the trade-off between computa-tion/communication and fault-tolerance.4) We propose a numerically stable specialization of Lagrange coded computing. Motivated by our condition number bound, ourapproach involves the choice of evaluation points and a suitable decoding procedure that involves inversion of an appropriateChebyshev-Vandermonde matrix. Our approach is demonstrated empirically to have lower numerical errors as compared tostandard methods. I. I
NTRODUCTION
The recently emerging area of “coded computing” focuses on incorporating redundancy based on coding-theory-inspired strategies to tackle central challenges in distributed computing, including stragglers, failures, processingerrors, communication bottlenecks and security issues. Such ideas have been applied to different large scale distributedcomputations such as matrix multiplication [1]–[5], gradient methods [6]–[8], linear solvers [9]–[11] and multi-variatepolynomial evaluation [12]. An important idea that has emerged from this body of the work is the use of novel, Reed-Solomon like polynomial based methods for encoding data. In polynomial based methods, each computation node storesa linearly encoded combination of the data partitions, where data stored at different worker nodes can be interpretedas evaluation of an appropriate polynomial at different points. The nodes then perform computation on these encodedversions of the data, and a central master/fusion node aggregates the outputs of these computations to recover theoverall result via a decoding process that inevitably involves polynomial interpolation. Much like Reed Solomon Codes,if the number of nodes performing the computation is higher than the number of evaluation points required for accurateinterpolation, the overall computation is tolerant to faults and stragglers.Perhaps the most striking application of polynomial based methods comes in the context of matrix multiplication.To multiply two N × N matrices A , B , assuming that each node stores 1 /m of each matrix, classical work in algorithmbased fault tolerance [13] outlines a coding based method which has been analyzed in [14]. Reference [2] showedthrough polynomial based encoding methods that the result of just m nodes can be used by the master node torecover the matrix-product. Remarkably, this means that polynomial based codes ensure that the recovery threshold -the worst case number of nodes whose computation suffices to recover the overall matrix-product - does not grow with P , the number of the distributed system’s worker nodes, unlike the approaches of [13], [14]. The recovery thresholdfor matrix multiplication has been improved to 2 m − Lagrange coded computing [12], where coding is applied for multi-variate polynomial computingwith guarantees of straggler resilience, security and privacy. In addition, polynomial-based methods are also useful forcommunication-efficient approaches for inverse problems and gradient methods [8], [10], [15].Despite the enormous success, the scalability of polynomial based methods in practice are limited by an “inconvenienttruth”, their numerical instability. The decoding methods for polynomial based methods require interpolating a degree K − K evaluation points. While this is numerically stable for classical error correcting codesfor communication and storage which are implemented over finite fields, we are concerned here for data processingapplications where the operations are typically real-valued. The main reason for the instability is that either implicitlyor explicitly, interpolation effectively solves a linear system whose transform is characterized by a Vandermonde matrix. M. Fahim and V. Cadambe are with the Department of Electrical Engineering, Pennsylvania State University, University Park, PA 16802.This work will be presented in part at the IEEE International Symposium on Information Theory (ISIT), July 2019. a r X i v : . [ c s . I T ] M a y Fig. 1. Example of MatDot Codes [3], with a recovery threshold of . The matrix product AB is the coefficient of x in p A ( x ) p B ( x ) , and can be recoveredat the fusion node upon receiving the output of any worker nodes and interpolating p A ( x ) p B ( x ) . It is well known that the condition number of Vandermonde matrices with real-valued nodes grows exponentially inthe dimension of the matrix [16]–[19]. The large condition number means that small perturbations of the Vandermondematrix due to numerical precision errors can result in singular matrices [20], [21]. In practice, this can translateto large numerical errors even when the coded computation is distributed among few tens of nodes . Conventionalintuition dictates that the main scalability bottlenecks in distributed computing include computation cost per worker,communication bottlenecks, and stragglers. However, for polynomially coded computing , it turns out that numericallystability is also critical and constitutes a huge bottleneck for scalability of such codes. Indeed, a polynomially codedcomputing scheme that achieves the minimum recovery threshold, and that is optimal computation/communicationwise, will simply fail once implemented on a distributed system with tens of computing nodes due to the large numericalerrors. Thus, the main contribution of our paper is a new numerically stable approach to polynomially coded computing. II. S
UMMARY OF C ONTRIBUTIONS
In this paper, we develop a new, numerically stable, approach for polynomially coded computing. A significantdifference from previous polynomial coding approaches is that we depart from the monomial basis, which allows us tocircumvent the inherently ill-conditioned Vandermonde-matrices. We demonstrate our approach through two importantapplications of polynomially coded computing: matrix multiplication, and Lagrange coded computing.To illustrate our results, consider the coded matrix multiplication problem, where the goal is to multiply two matrices A , B over P computation nodes where each node stores 1 /m of each of the two matrices. A master node encodes A , B into P matrices each, and sends these matrices respectively to each worker node. Each worker node multiplies thereceived encoded matrices, and sends the product back to the fusion node , which aims to recover AB from a subset ofthe worker nodes. The recovery threshold is defined as a number K such that the computation of any set of K workernodes suffices to recover the product AB . The MatDot scheme of [3] achieves the best known recovery threshold of2 m −
1. We begin with an example of MatDot Codes for m = 2 . Example 1: MatDot Codes [3], recovery threshold = 3:
Consider two N × N matrices A = (cid:2) A A (cid:3) , B = (cid:20) B B (cid:21) , where A , A are N × N/ B , B are N/ × N matrices. Define p A ( x ) = A + A x and p B ( x ) = B x + B , and let x , · · · , x P be distinct real values. Notice that AB = A B + A B is the coefficient of x in polynomial p A ( x ) p B ( x ) . In MatDot Codes, as illustrated in Fig. 1, worker node i computes p A ( x i ) p B ( x i ) , i = 1 , , . . . P, so thatfrom any of the P nodes, the polynomial p ( x ) = A B + ( A B + A B ) x + A B x can be interpolated. Havinginterpolated the polynomial, the product AB is simply the coefficient of x . A generalization of the above example leads toa recovery threshold of 2 m −
1, with a decoding process that involves effectively inverting a 2 m − × m − n × n Vandermonde matrix grows exponentially in n with both (cid:96) ∞ and (cid:96) norms [16], [17]. The intuition behind the inherent poor conditioning of the monomial basis { , x, x , . . . , x m − } is demonstrated in Fig. 2 and Fig. 3.Motivated by Fig.3, we aim, in this paper, to choose polynomials that are orthonormal. However, it is not immediatelyclear whether orthonormal polynomials are applicable for matrix multiplications. We demonstrate the applicability For example, [22], reports that “In our experiments we observed large floating point errors when inverting high degree Vandermonde matrices forpolynomial interpolation”. The master and fusion nodes are logical entities; in practice, they may be the same node, or may be emulated in a decentralized manner by thecomputation nodes. -1 0 1-101 x x x x x xx Fig. 2. Plot of monomials , x, x , x , x , x , x , x ver-sus x for x ∈ [ − , . Note that for a large degree d, smallchanges in x can lead to large changes in x d ; this leads tosignificant numerical errors when working with the monomialbasis. Fig. 3. Note that { , x, . . . , x d } forms a basis for the vectorspace of d -degree polynomials, with the inner-product (cid:104) f, g (cid:105) = (cid:82) − f ( x ) g ( x ) dx. We have plotted the vectors x and x . Thesmall angle between the two vectors leads to numerical errors. -1 0 1-101 T ( x ) T ( x ) T ( x ) T ( x ) T ( x ) Fig. 4. Plot of Chebyshev polynomials T ( x ) , T ( x ) , T ( x ) , T ( x ) , T ( x ) versus x for x ∈ [ − , . of orthonormal codes for matrix multiplication. For the example below, let q ( x ) , q ( x ) denote two orthonormalpolynomials such that (cid:90) − q i ( x ) q j ( x ) dx = (cid:26) if i = j otherwise (1) where q i ( x ) , i = 0 , i . Example 2 : OrthoMatDot Codes [This paper], recovery threshold = 3:
For two N × N matrices A = (cid:2) A A (cid:3) , B = (cid:20) B B (cid:21) , let p A ( x ) = A q ( x ) + A q ( x ) and p B ( x ) = B q ( x ) + B q ( x ) . Notice that because of (1), wehave AB = (cid:90) − p A ( x ) p B ( x ) dx. This leads to the following coded computing scheme: worker node i computes p A ( x i ) p B ( x i ) , i = 1 , , . . . P, where x , · · · , x P are distinct real values, so that from any of the P nodes, the fusion node can interpolate p ( x ) = p A ( x ) p B ( x ) .Having interpolated the polynomial, the fusion node obtains the product AB by performing (cid:82) − p A ( x ) p B ( x ) dx . Thisexample is illustrated in Fig. 5 . Fig. 5. Example of our proposed orthonormal polynomials based codes, with a recovery threshold of . The matrix product AB is (cid:82) − p A ( x ) p B ( x ) dx ,and can be recovered at the fusion node upon receiving the output of any worker nodes, then interpolating p A ( x ) p B ( x ) , and computing the integral (cid:82) − p A ( x ) p B ( x ) dx. A simple generalization of the above example, described in Construction 1 in Section IV, leads to a class of codes, werefer to it as
OrthoMatDot Codes, with recovery threshold of 2 m −
1, the same recovery threshold as MatDot Codes.In general, orthonormal polynomials are defined over arbitrary weight measure (cid:82) − · w ( x ) dx ; some well known classesof polynomials corresponding to different weight measures w ( x ) include Legendre, Chebyshev, Jacobi and LaguerrePolynomials [20], [21] (See Section III for definitions). Our OrthoMatDot Codes in Section IV can use any weightmeasure, and therefore can be used with different classes of orthonormal polynomials. Of particular interest to ourpaper are the Chebyshev polynomials (Fig. 4).With our basic template, the task of developing numerically stable codes boils down to (A) interpolating p A ( x ) p B ( x )in a numerically stable manner, and (B) integrating this polynomial in a numerically stable manner. For task (B),we use a decoding procedure via Gauss Quadrature [20], [21], [23] to recover the integral. Task (A) is particularlychallenging in the coding setting, because our goal is to interpolate the coefficients of p A ( x ) p B ( x ) - expanded over aseries of orthonormal polynomials - from any 2 m − P points.In Section V, we provide a specialization to the class of OrthoMatDot Codes, a numerically stable matrix multi-plication code construction that has the same recovery threshold and communication/computation cost per worker asMatDot codes. The construction specializes the class of OrthoMatDot Codes via the use of Chebyshev polynomials,which are a class of orthogonal polynomials that are ubiquitous in numerical methods and approximation theory [21].Construction 2 also specifies the choice of evaluation points x , x , . . . , x P . The decoding procedure outlined for the specialization of OrthoMatDot Codes in Section V involves the effectiveinversion of some 2 m − × m − m − × P Chebyshev-Vandermonde matrix [19], where each of the i -th column contains evaluations of the first 2 m − x i , i = 1 , , . . . , P . A key technical resultof our paper shows that, with our choice of evaluation points x , x , . . . , x P , every 2 m − × m − m − × P Chebyshev-Vandermonde matrix is well-conditioned. More precisely, we show that, with our choice of x , x , . . . , x P , the condition number of any m − × m − P when the number of redundant parity nodes ∆ = P − (2 m −
1) is fixed. Our condition numberbound may be viewed as result of independent interest in the area of numerical methods, and requires non-trivial use oftechniques from numerical approximation theory. This result is in contrast with the well known exponential growth forVandermonde systems. We also show the significant improvement in stability via numerical experiments in Section V-C.We also provide a preview of the results here in Table I, whose results demonstrate that remarkably, our Chebyhev-Vandermonde construction with even P = 150 nodes has a smaller relative error than the Vandermonde-based MatDotCodes with P = 30 nodes.While MatDot Codes [3] have an optimal recovery threshold of 2 m −
1, they have relatively higher computationcost per worker ( O ( N /m )) and worker node to fusion node communication cost ( O ( N )) as compared to PolynomialCodes [2] which have a computation cost per worker of O ( N /m ) and worker node to fusion node communicationcost of O ( N /m ). In particular, each worker in MatDot Codes performs an “outer” product of an N × N/m matrixwith a
N/m × N matrix, whereas each worker in Polynomial Codes performs an “inner” product of a N/m × N matrix with a N × N/m matrix. The reduced computation/communication comes at the cost of weaker fault-tolerance- Polynomial Codes have a higher recovery threshold of m as compared with MatDot Codes (2 m − OrthoPoly Codes . We note that the numerical error depends not only on the condition number of the matrix, but also the algorithm used for solving the linear system.However, we are not aware of any approach that can accurately solve, say, a × linear system with a Vandermonde matrix (See e.g., [24], [25]) TABLE IA
TABLE DEPICTING THE RELATIVE ERRORS OF VARIOUS SCHEMES FOR ∆ = P − (2 m −
1) = 3
REDUNDANT NODES . T
HE ERROR IS MEASURED VIATHE F ROBENIUS NORM , I . E ., || AB − ˆ C || F || AB || F . T HE MATRICES A , B ARE CHOSEN WITH ENTRIES N (0 , . T HE AVERAGE RELATIVE ERROR AVERAGES OVERALL POSSIBLE NODE FAILURES , I . E ., OVER EVERY SET OF m − NODES AMONG THE P = 2 m + 2 NODES ; THE WORST CASE RELATIVE ERRORINVOLVES THE WORST SET OF m − NODES . S EE S ECTION
V-C
FOR MORE DETAILS . Number MatDot OrthoMatDot MatDot OrthoMatDotof Workers worst case worst case average average ( P ) relative error relative error relative error relative error30 . × − . × − . × − . × − . × . × − . × . × − . × . × − . × . × − . × . × − . × . × − The trade-off between computation/communication cost and recovery threshold imposed by MatDot Codes andPolynomial Codes has motivated general code constructions that interpolates both of them [3], [5], [26], albeit usingthe monomial basis. In Section VII, we extend our approach to a general matrix multiplication code construction,referred to as
Generalized OrthoMatDot , that offers a computation/communication cost vs recovery threshold trade-off,following the research thread for the monomial basis [3], [5], [26], however we also target numerical stability in ourproposed construction. While our Generalized OrthoMatDot Codes specialize to OrthoMatDot Codes, i.e., they achievethe same optimal recovery threshold as OrthoMatDot Codes when allowing for the same computation/communicationcost as OrthoMatDot Codes, they do not specialize to OrthoPoly Codes. Specifically, Generalized OrthoMatDot codeshave higher recovery threshold than OrthoPoly Codes when allowing for the same computation/communication cost asOrthoPoly Codes. In Section VIII, we exploit the result obtained in Theorem 5.1 on the condition number of the square K × K sub-matrices of the K × P Chebyshev-Vandermonde matrices to propose a numerically stable algorithm forLagrange coded computing. In Section IX, we conclude with a discussion on other related problems such as matrix-vectormultiplication [13], [27], and describe some related open questions.
III. P
RELIMINARIES ON N UMERICAL A NALYSIS AND N OTATIONS
We discuss, in this section, the problem of finite precision in representing real numbers on digital machines and howit may horribly affect the output of computation problems performed on these machines. In addition, we also introducesome basic definitions and results from the area of numerical approximation theory that will be used in this paper [23],[28]. At the end of this section, we provide most of the common notations that will be used in this paper.
A. Preliminaries on Numerical Analysis
Since digital machines have finite memory, real numbers are digitally stored using a finite number of bits, i.e., finiteprecision. However, storing real numbers using a finite number of bits leads to inevitable errors since a finite numberof bits can only represent a finite number of real numbers with no errors. On the other hand, real numbers that cannotbe directly represented using the specified finite number of bits have to be either truncated or rounded-off in orderto fit in the memory. Although such perturbation (e.g., truncation/round-off error) of real numbers due to the finiteprecision of digital machines can be negligibly small, the perturbation of the output of any computation that uses such“small” perturbed stored real numbers as input does not necessarily be small as well. In fact, a very small perturbationto the input of some computation may lead to an output that is totally wrong and irrelevant to the correct output.The condition number of a computation problem captures/measures this observation.
Definition 3.1 (Condition Number):
Let f be a function representing a computation problem with input x , and let δx bea small perturbation of x , and define δf = f ( x + δx ) − f ( x ) to be the perturbation of f due to δx , the condition number ofthe problem at x with respect to some norm || · || is κ ( x ) = sup δx (cid:18) || δf |||| f ( x ) || (cid:30) || δx |||| x || (cid:19) . (2) Given the above definition of condition number, a problem is said to be “ill-conditioned” if small perturbations in theinput lead to large perturbation in the output (i.e., the condition number is large). On the other hand, a problem issaid to be “well-conditioned” if small perturbations in the input lead to small perturbations in the output (i.e., thecondition number is small). In what follows, we discuss the condition number of two computation problems: thematrix-vector multiplication and solving a system of linear equations. For both problems, consider the system of linearequations represented in the matrix form Ax = y , where A ∈ R n,n and non-singular, and x , y ∈ R n , and let || · || besome matrix norm. Then, let A be fixed, the condition number of this matrix-vector multiplication problem with y as its output given small perturbations in the input x is κ ( x ) ≤ || A |||| A − || , for any x ∈ R n . Also, for the problem ofsolving the system of linear equations Ax = y , with A still fixed, the condition number of the problem of solving thissystem of linear equations, given small perturbations in the input y , where x is the output, is κ ( y ) ≤ || A |||| A − || , forany y ∈ R n .Since we focus on polynomially coded computing, next, we introduce some basic tools of numerical approximationtheory that will be used throughout this paper. Notice that, in the following, C [ a, b ] denotes the vector space ofcontinuous integrable functions defined on the interval [ a, b ]. Definition 3.2 (Inner Products on C [ a, b ] ): For any f, g ∈ C [ a, b ] , and given a non-negative integrable weight function w , (cid:104) f, g (cid:105) = (cid:90) ba f ( x ) g ( x ) w ( x ) dx defines an inner product on C [ a, b ] relative to w . Definition 3.3 (Orthogonal Polynomials):
Consider a non-negative integrable weight function w , the polynomials { q i } i ≥ in C [ a, b ] where q i ( x ) has degree i and (cid:104) q i , q j (cid:105) = (cid:26) c i if i = j , otherwise, (3)for some non-zero values c i , where the inner product is relative to w , are called orthogonal polynomials relative to w , . Definition 3.4 (Orthonormal Polynomials):
Consider a non-negative integrable weight function w , the polynomials { q i } i ≥ ,where q i ( x ) has degree i , in C [ a, b ] such that (cid:104) q i , q j (cid:105) = (cid:26) if i = j , otherwise, (4)where the inner product is relative to w , are called orthonormal polynomials relative to w . Note that based on the above definitions, if the polynomials { q i } i ≥ are orthogonal (or orthonormal), then q n ( x ) isorthogonal to all polynomials of degree ≤ n −
1, i.e., (cid:104) p n − ( x ) , q n ( x ) (cid:105) = 0, for any polynomial p n − ∈ C [ a, b ] with degreestrictly less than n . It’s also worth noting that for w ( x ) = 1 , a = − , b = 1, the orthogonal polynomials are Legendrepolynomials, which are derived via Gram-Schmidt procedure applied to { , x, x , . . . , } sequentially. In addition, thefollowing is an important class of orthogonal polynomials in our paper. Example 3.1 (Chebyshev polynomials of the first kind):
The following recurrence relation defines the Chebyshevpolynomials of the first kind: T n ( x ) = 2 xT n − ( x ) − T n − ( x ) , where, T ( x ) = 1 , T ( x ) = x . These Chebyshev polynomials are the corner stone of modern numerical approximation theoryand practice with applications to numerical integration, and least-square approximations of continuous functions [23], [28]. √ T , T , T , · · · are orthonormal relative to the weight function π √ − x . In general, Chebyshev polynomials are defined over x ∈ R . However, for x ∈ [ − , , T n ( x ) = cos( n arccos( x )) , for any n ∈ N . For the rest of this paper, unless otherwise isstated, whenever Chebyshev polynomials are used, they are restricted only to the range [ − , . We state, next, two results from [28] in Theorems 3.1 and 3.2.
Theorem 3.1:
Let w be a weight function on the range [ a, b ] , i.e., w is a non-negative integrable function on [ a, b ] , and let x , · · · , x n be distinct real numbers such that a < x < · · · < x n < b , there exist unique weights a , · · · , a n such that (cid:90) ba f ( x ) w ( x ) dx = n (cid:88) i =1 a i f ( x i ) , for all polynomials f with degree less than n . Theorem 3.1 is not surprising - the left hand side of the equation stated in the theorem is a linear operator on thevector space of n − n − n points. Therefore, the left hand side can be expressed as an innerproduct of the functions evaluations at n points. We next state a remarkable result by Gauss which states conditionsunder which the expression of Theorem 3.1 is exact for polynomials of degree up to 2 n − , even though the number ofevaluation points is just n . Theorem 3.2 (Gauss Quadrature):
Fix a weight function w , and let { q i } i ≥ be a set of orthonormal polynomials in C [ a, b ] relative to w . Given n , let η , · · · , η n be the roots of q n such that a ≤ η < η < · · · < η n ≤ b , and choose real values a , · · · , a n such that (cid:80) ni =1 a i f ( η i ) = (cid:82) ba f ( x ) w ( x ) dx , for any f ∈ C [ a, b ] with degree less than n . Then, (cid:80) ni =1 a i f ( η i ) = (cid:82) ba f ( x ) w ( x ) dx , for any polynomial f with degree less than n . Remark 3.1:
1) Consider any orthonormal polynomials { q i } i> . For any n ∈ N , the set { q , q , · · · , q n − } forms a basis for the vectorspace of polynomials with degree less than n .2) In Theorem 3.2, a , · · · , a n can be chosen as a i = (cid:90) ba (cid:18) (cid:89) j ∈ [ n ] − i x − η j η i − η j (cid:19) w ( x ) dx, i ∈ [ n ] . (5)3) In Theorem 3.2, the roots of q n , i.e., η , · · · , η n are, in fact, real and distinct. Moreover, the Chebyshev polynomial ofthe first kind T n has the following roots ρ ( n ) i = cos (cid:18) i − n π (cid:19) , i ∈ [ n ] . (6)The set { ρ ( n )1 , · · · , ρ ( n ) n } is often called the n -point Chebyshev grid, and its elements ρ ( n )1 , · · · , ρ ( n ) n are called “Chebyshevnodes” of degree n . We here discard the term “node” and use the term “Chebyshev points” to avoid confusion withcomputation nodes. We also denote by ρ ( n ) the vector ( ρ ( n )1 , · · · , ρ ( n ) n ) . It is useful to note that T n ( x ) can be written as T n ( x ) = 2 n − n (cid:89) i =1 ( x − ρ ( n ) i ) , (7)and for T n ( x ) , the weights a i in (5) are all equal to /n when w ( x ) = π √ − x . B. Notations
Throughout this paper, we use lowercase bold letters to denote vectors and uppercase bold letters to denote matrices.In addition, for any positive integers k, n , and given a set of orthogonal polynomials q , q , · · · , q k − on the interval[ a, b ], let x = ( x , · · · , x n ) be a vector with entries in [ a, b ], we define the k × n matrix Q ( k,n ) ( x ) as: Q ( k,n ) ( x ) = q ( x ) · · · q ( x n ) ... . . . ...q k − ( x ) · · · q k − ( x n ) . (8) For any subset S = { s , · · · , s r } ⊂ [ n ], we denote by Q ( k,n ) S ( x ) the sub-matrix of Q ( k,n ) ( x ) formed by concatenatingcolumns with indices in S , i.e., Q ( k,n ) S ( x ) = q ( x s ) · · · q ( x s r ) ... . . . ...q k − ( x s ) · · · q k − ( x s r ) . (9) For the special case where the orthogonal polynomials are the Chebyshev polynomials of the first kind T , T , · · · , T k − ,we define the k × n matrix G ( k,n ) ( x ) as: G ( k,n ) ( x ) = T ( x ) · · · T ( x n ) ... . . . ...T k − ( x ) · · · T k − ( x n ) , (10) we denote by G ( k,n ) S ( x ) the sub-matrix of G ( k,n ) ( x ) formed by concatenating columns with indices in S , i.e., G ( k,n ) S ( x ) = T ( x s ) · · · T ( x s r ) ... . . . ...T k − ( x s ) · · · T k − ( x s r ) . (11) Also, for the case where the orthogonal polynomials are the “orthonormal” Chebyshev polynomials √ T , T , · · · , T k − ,we define the k × n matrix ˜ G ( k,n ) ( x ) as:˜ G ( k,n ) ( x ) = T ( x ) / √ · · · T ( x n ) / √ T ( x ) · · · T ( x n ) ... . . . ...T k − ( x ) · · · T k − ( x n ) , (12) Fig. 6. The distributed system framework and we denote by ˜ G ( k,n ) S ( x ) the sub-matrix of ˜ G ( k,n ) ( x ) formed by concatenating columns with indices in S , i.e.,˜ G ( k,n ) S ( x ) = T ( x s ) / √ · · · T ( x s r ) / √ T ( x s ) · · · T ( x s r ) ... . . . ...T k − ( x s ) · · · T k − ( x s r ) . (13) Wherever there is no ambiguity on x , it may be dropped from the notation.In the next section, we show that orthonormal polynomials can be used for designing codes for the distributed largescale matrix multiplication problem. IV. O
RTHO M AT D OT : O RTHONORMAL P OLYNOMIALS BASED C ODES FOR D ISTRIBUTED M ATRIX M ULTIPLICATION
In this section, we present a new orthonormal polynomials based class of codes for matrix-multiplication calledOrthoMatDot. These codes achieve the same recovery threshold as MatDot Codes, and have similar computationalcomplexity as MatDot. The main advantage of the proposed codes is that they avoid dealing with the ill-conditionedmonomial basis used in previous work (e.g., in [2], [3], [5], [26]). In Section V, OrthoMatDot Codes will be specializedand demonstrated to have higher numerical stability as compared with state of the art. We begin with a formal problemformulation in Section IV-A, and describe our codes in Section IV-B.
A. System Model and Problem Formulation1) System Model:
We consider the distributed framework depicted in Fig. 6 that consists of a master node, P workernodes, and a fusion node where the only communication allowed is from the master node to the different worker nodesand from the worker nodes to the fusion node. It can happen that the fusion node and the master node be representedby the same node. In this case, the only communication allowed is the communication between the master node andevery worker node.
2) Problem Formulation:
The master node possesses two real-valued input matrices A , B with dimensions N × N , N × N , respectively. Every worker node receives from the master node an encoded matrix of A of dimension N × N /m and an encoded matrix of B of dimension N /m × N , and performs matrix multiplication of these two received inputs.Upon performing the matrix multiplication, each worker node sends the result to the fusion node. The fusion nodeneeds to recover the matrix multiplication AB once it receives the results of any K worker nodes, where K ≤ P . Inthis case, K is denoted by the recovery threshold of the distributed computing scheme. B. OrthoMatDot Code Construction
Our result regarding the existence of achievable codes solving the distributed matrix multiplication problem usingorthonormal polynomials is stated in the following theorem.
Theorem 4.1:
For the matrix multiplication problem described in Section IV-A2 computed on the system defined in SectionIV-A1, a recovery threshold of m − is achievable using any set of orthonormal polynomials { q i } i ≥ relative to some weightpolynomial w and defined on a range [ a, b ] . Before proving this theorem, we first present OrthoMatDot, a code construction that achieves the recovery thresholdof 2 m − { q i } i ≥ of orthonormal polynomials relative to a weight polynomial w ( x ) and defined on a range [ a, b ]. In our code construction, we assume that matrix A is split vertically into m equal sub-matrices, of dimension N × N /m each, and matrix B is split horizontally into m equal sub-matrices, of dimension N /m × N each, asfollows: A = ( A A . . . A m − ) , B = B B ... B m − , (14) we also define a set of P distinct real numbers x , · · · , x P in the range [ a, b ], and define two encoding polynomials p A ( x ) = (cid:80) m − i =0 A i q i ( x ) and p B ( x ) = (cid:80) m − i =0 B i q i ( x ) , and let p C ( x ) = p A ( x ) p B ( x ).In the following, we briefly describe the OrthoMatDot construction. First, for every r ∈ [ P ], the master node sendsto the r -th worker node evaluations of p A ( x ) , p B ( x ) at x = x r , that is, it sends p A ( x r ) and p B ( x r ) to the r -th workernode. Next, for every r ∈ [ P ], the r -th worker node computes the matrix product p C ( x r ) = p A ( x r ) p B ( x r ) and sendsthe result to the fusion node. Once the fusion node receives the output of any 2 m − p C ( x ) = p A ( x ) p B ( x ), and evaluates p C ( x ) at η , · · · , η m , where η , · · · , η m are the roots of q m . Then, itperforms the summation (cid:80) mr =1 a r p C ( η r ), where a , · · · , a m are as in (5).We formally present OrthoMatDot code in Construction 1. Construction 1 uses the following notation. The outputof the algorithm is the N × N matrix ˆ C . The ( i, j )-th entries of the matrix polynomial p C ( x ) and the matrix ˆ C arerespectively denoted as p ( i,j ) C ( x ) and ˆ C ( i, j ) . The reader may also recall the definition of matrices Q (2 m − ,P ) ( x ) and Q (2 m − ,P ) R ( x ) , for any subset R = { r , · · · , r m − } ⊂ [ P ]. η = ( η , · · · , η m ) is the vector of the roots of q m . Based onConstruction 1, we state the following claim. Construction 1
OrthoMatDot:
Inputs : A , B , Output : ˆ C procedure M ASTER N ODE ( A , B ) (cid:46) The master node’s procedure r ← while r (cid:54) = P + 1 do p A ( x r ) ← (cid:80) m − i =0 A i q i ( x r ) p B ( x r ) ← (cid:80) m − i =0 B i q i ( x r ) send p A ( x r ) , p B ( x r ) to worker node r r ← r + 1 end while end procedure procedure W ORKER N ODE ( p A ( x r ) , p B ( x r ) ) (cid:46) The procedure of worker node r p C ( x r ) ← p A ( x r ) p B ( x r ) send p C ( x r ) to the fusion node end procedure procedure F USION N ODE ( { p C ( x r ) , · · · , p C ( x r m − ) } ) (cid:46) The fusion node’s procedure, r i ’s are distinct Q inv ← (cid:16) Q (2 m − ,P ) R (cid:17) − for i ∈ [ N ] do for j ∈ [ N ] do ( c ( i,j )0 , · · · , c ( i,j )2 m − ) ← ( p ( i,j ) C ( x r ) , · · · , p ( i,j ) C ( x r m − )) Q inv ( p ( i,j ) C ( η ) , · · · , p ( i,j ) C ( η m )) ← ( c ( i,j )0 , · · · , c ( i,j )2 m − ) Q (2 m − ,m ) ( η ) ˆ C ( i, j ) ← ( p ( i,j ) C ( η ) , · · · , p ( i,j ) C ( η m ))( a , · · · , a m ) T (cid:46) a i ’s are as defined in (5) end for end for return ˆ C end procedureClaim 4.2: AB = (cid:80) mr =1 a r p C ( η r ) . The proof of Claim 4.2 is provided in Appendix A.Now, we can prove Theorem 4.1.
Proof of Theorem 4.1:
In order to prove the theorem, it suffices to show that Construction 1 is a valid constructionwith a recovery threshold of 2 m −
1. Therefore, in the following, we prove that Construction 1 can recover AB afterthe fusion node receives the output of at most 2 m − the results of any 2 m − p C ( x ) has degree 2 m −
2, the evaluations of p C ( x ) at any 2 m − x , · · · , x P are distinct,the fusion node can interpolate p C ( x ) once it receives the output of any 2 m − AB = (cid:80) mr =1 a r p C ( η r ) (Claim 4.2), the fusion node can evaluate p C ( η ) , · · · , p C ( η m ) and perform the scaled summation (cid:80) mr =1 a r p C ( η r ) to recover AB . (cid:3) Remark 4.1:
In Construction 1, setting x , · · · , x m to be the roots of q m leads to a faster decoding for the scenarios inwhich the first m worker nodes send their results but only less than m − workers succeed to send their outputs. For suchscenarios, we have (cid:80) mr =1 a r p C ( x r ) = (cid:80) mr =1 a r p C ( η r ) = AB , where the last equality follows from Claim 4.2. Next, we study the computational and communication costs of OrthoMatDot.
1) Complexity Analyses of OrthoMatDot:
Encoding Complexity : Encoding for each worker requires performing two additions, each adding m scaled matricesof size N N /m and N N /m , for an overall encoding complexity for each worker of O ( N N + N N ). Therefore, theoverall computational complexity of encoding for P workers is O ( N N P + N N P ). Computational Cost per Worker : Each worker multiplies two matrices of dimensions N × N /m and N /m × N ,requiring O ( N N N /m ) operations. Decoding Complexity : Since p C ( x ) has degree 2 m −
2, the interpolation of p C ( x ) requires the inversion of a2 m − × m − O ( m ), and performing N N matrix-vector multiplications, each of them isbetween the inverted matrix and a column vector of length 2 m − p C ( x ) at some position ( i, j ) ∈ [ N ] × [ N ], with complexity O ( N N m ). Next, the evaluation of the polynomial p C ( x ) at η , · · · , η m requires a complexity of O ( N N m ). Finally, performing the summation (cid:80) mr =1 a r p C ( η r ) requiresa complexity of O ( N N m ). Thus, assuming that m (cid:28) N , N , the overall decoding complexity is O ( m + 2 N N m + N N m ) = O ( N N m ). Communication Cost : The master node sends O ( N N P/m + N N P/m ) symbols, and the fusion node receives O ( N N m ) symbols from the successful worker nodes. Remark 4.2:
With the reasonable assumption that the dimensions of the input matrices A , B are large enough such that N , N , N (cid:29) m, P , we can conclude that the encoding and decoding costs at the master and fusion nodes, respectively, arenegligible compared to the computation cost at each worker node.V. N UMERICALLY S TABLE C ODES FOR M ATRIX M ULTIPLICATION VIA O RTHO M AT D OT C ODES WITH C HEBYSHEV P OLYNOMIALS
In this section, we specialize OrthoMatDot Codes by restricting the orthonormal polynomials to be Chebyshevpolynomials of the first kind { T i } i ≥ with the evaluation points chosen to be the P -dimensional Chebyshev grid, i.e., x i = ρ ( P ) i , i ∈ [ P ]. Our specialized OrthoMatDot, described in Construction 2 in Section V-A, develops a decoding thatinvolves inversion of a 2 m − × m − m − × P Chebyshev-Vandermonde matrix. One of the maintechnical results of this section (and paper), presented in Theorem 5.1 in Section V-B, is an upper bound to the worstcase condition number over all possible 2 m − × m − m − × P Chebeshev-Vandermondematrix for the case where the distinct evaluation points x , · · · , x P are chosen as the Chebyshev points of degree P , i.e., x i = ρ ( P ) i , i ∈ [ P ]. In fact, the derived bound shows that the worst case condition number grows at most polynomiallyin P at a fixed number of straggler/parity worker nodes. This is in contrast with the monomial basis codes where thecondition number grows exponentially in P , even when there is no redundancy [16]–[19]. We show through numericalexperiments in Section V-C that our proposed codes provide significantly lower numerical errors as compared to MatDotCodes in [3]. A. Chebyshev Polynomials based OrthoMatDot Code Construction
Recalling from Example 3.1 that √ T , T , T , · · · form an orthonormal polynomial set relative to the weight function w ( x ) = π √ − x , in Construction 2, we explain the application of Chebyshev polynomials of the first kind to Construction1. Note that, in Construction 2, we assume that the input matrices A and B are also split as in (14), and let x , x , . . . , x P be distinct real numbers in the range [ − , p A ( x ) , p B ( x ) as p A ( x ) = √ A T ( x ) + (cid:80) m − i =1 A i T i ( x ) and p B ( x ) = √ B T ( x ) + (cid:80) m − i =1 B i T i ( x ) , and let p C ( x ) = p A ( x ) p B ( x ).The idea of our Chebyshev polynomials based OrthoMatDot code is as follows: First, for every r ∈ [ P ], the masternode sends to the r -th worker node p A ( ρ ( P ) r ) and p B ( ρ ( P ) r ). Next, for every r ∈ [ P ], the r -th worker node computes thematrix product p C ( ρ ( P ) r ) = p A ( ρ ( P ) r ) p B ( ρ ( P ) r ) and sends the result to the fusion node. Once the fusion node receivesthe output of any 2 m − p C ( x ). Then, it evaluates p C ( x ) at ρ ( m )1 , · · · , ρ ( m ) m , where ρ ( m ) i ’sare as defined in (6), and computes (cid:80) mi =1 a i p C ( ρ ( m ) i ), where a i = 2 /m, i ∈ [ m ] based on 3) in Remark 3.1. A formal description of our Chebyshev polynomials based OrthoMatDot code is provided in Construction 2. Con-struction 2 uses the following notation. We let the ( i, j )-th entry of the matrix polynomial p C ( x ) be denoted p ( i,j ) C ( x )and written as p ( i,j ) C ( x ) = √ c ( i,j )0 T ( x ) + (cid:80) m − l =1 c ( i,j ) l T l ( x ). Also, following the notation in Section III-B, we define theChebyshev-Vandermonde matrices ˜ G (2 m − ,P ) ( ρ ( P ) ) , and ˜ G (2 m − ,P ) R ( ρ ( P ) ), for any subset R = { r , · · · , r m − } ⊂ [ P ],we also define the matrix ˜ G (2 m − ,m ) ( ρ ( m ) ). Finally, we assume that our construction returns an N × N matrix ˆ C representing the result of the product AB , where the ( i, j )-th entry of ˆ C is ˆ C ( i, j ). Construction 2
Chebyshev Polynomials based OrthoMatDot:
Inputs : A , B , Output : ˆ C procedure M ASTER N ODE ( A , B ) (cid:46) The master node’s procedure r ← while r (cid:54) = P + 1 do p A ( ρ ( P ) r ) ← √ A + (cid:80) m − i =1 A i T i ( ρ ( P ) r ) p B ( ρ ( P ) r ) ← √ B + (cid:80) m − i =1 B i T i ( ρ ( P ) r ) send p A ( ρ ( P ) r ) , p B ( ρ ( P ) r ) to worker node r r ← r + 1 end while end procedure procedure W ORKER N ODE ( p A ( ρ ( P ) r ) , p B ( ρ ( P ) r ) ) (cid:46) The procedure of worker node r p C ( ρ ( P ) r ) ← p A ( ρ ( P ) r ) p B ( ρ ( P ) r ) send p C ( ρ ( P ) r ) to the fusion node end procedure procedure F USION N ODE ( { p C ( ρ ( P ) r ) , · · · , p C ( ρ ( P ) r m − ) } ) (cid:46) The fusion node’s procedure, r i ’s are distinct G inv ← (cid:16) ˜ G (2 m − ,P ) R (cid:17) − for i ∈ [ N ] do for j ∈ [ N ] do ( c ( i,j )0 , · · · , c ( i,j )2 m − ) ← ( p ( i,j ) C ( ρ ( P ) r ) , · · · , p ( i,j ) C ( ρ ( P ) r m − )) G inv ( p ( i,j ) C ( ρ ( m )1 ) , · · · , p ( i,j ) C ( ρ ( m ) m )) ← ( c ( i,j )0 , · · · , c ( i,j )2 m − ) ˜ G (2 m − ,m ) ( ρ ( m ) ) ˆ C ( i, j ) ← m ( p ( i,j ) C ( ρ ( m )1 ) , · · · , p ( i,j ) C ( ρ ( m ) m ))(1 , · · · , T (cid:46) a i ’s are all /m end for end for return ˆ C end procedure
1) Complexity Analyses::
The different encoding complexity, computational complexity per worker, decoding com-plexity and communication cost for Chebyshev polynomials based OrthoMatDot are the same as their counterparts ofOrthoMatDot stated in Section IV-B1.
B. Evaluation Points and Condition Number Bound
When there is no redundancy, i.e., n = 2 m − , it is well known that the n × n decoding matrix G ( n,n ) has conditionnumber n with the (cid:96) as well as the Frobenius norms [17]. Note the remarkable contrast with the Vandermonde matrix,whose condition number for real-valued evaluation points grows exponentially in n , no matter how the nodes are chosen[16], [17]. Our problem differs from the standard problem in numerical methods, since we have to choose a rectangular“generator” matrix where every square sub-matrix is well-conditioned. In particular, even for Chebyshev-Vandermondematrix, if the evaluation points are not chosen carefully, they are poorly conditioned [19] (also see Fig. 8). Here, weshow that choosing x i = ρ ( n ) i leads to a well-conditioned system with s redundant nodes. Our goal is to choose vector x such that κ max ( G ( n − s,n ) ( x )) is sufficiently small, where κ max ( G ( n − s,n ) ( x )) denotes the worst case condition numberover all possible n − s × n − s sub-matrices of G ( n − s,n ) ( x ). Theorem 5.1:
For any s ∈ [ n − , κ maxF ( G ( n − s,n ) ( ρ ( n ) )) = O (cid:16) ( n − s ) (cid:112) ns ( n − s ) (cid:0) n (cid:1) s − (cid:17) , where κ maxF denotes the worst case condition number over all possible n − s × n − s sub-matrices of G ( n − s,n ) ( x ) with respect tothe Frobenius norm, ρ ( n ) = ( ρ ( n )1 , ρ ( n )2 , . . . , ρ ( n ) n ) are the roots of the Chebyshev polynomial T n , i.e., ρ ( n ) i = cos (cid:0) i − n π (cid:1) , i ∈ [ n ] .
12 12.5 13 13.5 14 14.5 1510
28 28.5 29 29.5 3010
39 39.2 39.4 39.6 39.8 4010
49 49.2 49.4 49.6 49.8 5010 Fig. 7. Comparison between the condition number of the interpolating matrix of the Chebyshev polynomials based OrthoMatDot Codes and MatDot Codesin five different distributed systems with , , , , and worker nodes, respectively. Fig. 8. The growth of the condition number, for both Chebyshev polynomials based OrthoMatDot and MatDot Codes, with the system size given a fixednumber of redundant worker nodes.
Since || . || ≤ || . || F , the above bound applies to the standard (cid:96) matrix norm as well. The proof uses techniques fromnumerical methods, and is provided in Appendix B. Remark 5.1:
Although the bound in Theorem 5.1 is derived for G ( n − s,n ) ( ρ ( n ) ) , the theorem also applies for ˜ G ( n − s,n ) ( ρ ( n ) ) .This is because it can be shown using simple matrix operations that for any ˜ G ( n − s,n ) R , for a subset R ⊂ [ n ] such that |R| = n − s , κ F ( ˜ G ( n − s,n ) R ) < √ κ F ( G ( n − s,n ) R ) . C. Numerical Results
The numerical stability of our codes is determined by the condition number of 2 m − × m − G (2 m − ,P ) . The natural comparison is with MatDot Codes where the decoding depends on effectively inverting2 m − × m − M = · · · x x · · · x P ... ... . . . ...x m − x m − · · · x m − P . (15) Based on the result of Theorem 5.1, we choose x i = ρ ( P ) i . In our experiments, we consider systems with various numberof worker nodes, namely, P = 16 , , , , κ max ( G (2 m − ,P ) ) with κ max ( M ). We also compare the -15 -10 -5 -15 -10 -5 Fig. 9. The growth of the relative error, for both Chebyshev polynomials based OrthoMatDot and MatDot Codes, both using Chebyshev points, with thesystem size given a fixed number of redundant worker nodes. average (cid:96) condition number of all 2 m − × m − G (2 m − ,P ) and all 2 m − × m − M . The results, in Fig. 7, show that, for every examined system, the maximum and average condition numbers of the2 m − × m − G (2 m − ,P ) are less than its MatDot Codes counterparts, especially for larger systemswith 60 , , and 100 worker nodes. In fact, for these specific systems, the improvement in the condition number isaround a scaling of 10 .Fig. 8 shows how the maximum/average condition number of the 2 m − × m − G (2 m − ,P ) growswith the size of the distributed system given a fixed number of redundant worker nodes, namely 1 and 3, and compareswith MatDot Codes. The figure shows that while MatDot Codes provide a reasonable condition number ( ∼ ) todistributed systems with size up to only 25 worker nodes, Construction 2 can afford distributed systems with size upto 150 worker nodes for the same condition number bound ∼ .As a reflection to the significant higher stability of Chebyshev polynomials based OrthoMatDot compared to MatDotCodes, Fig. 9 shows that Chebyshev polynomials based OrthoMatDot provides much more accurate outputs comparedto MatDot Codes. For the experiments whose results are shown in Fig. 9, the entries of the input matrices A , B arechosen independently according to the standard Gaussian distribution N (0 , A , B , let ˆ C be the output of the distributed system (which is not necessarily equal to the correct answer AB ), wedefine the relative error between AB and ˆ C to be E r ( AB , ˆ C ) = || AB − ˆ C || F || AB || F . (16) Fig. 9 shows how the maximum relative error (the worst case relative error given a fixed number of parity workers s among all the P − s successful nodes scenarios) grows with the size of the distributed system. In Fig. 9, we plot theaverage result of five different realizations of the system at each system size P . The figure shows that MatDot Codescrushes after the size of the system exceeds 50 workers, providing a relative error of around 10 . On the other hand,our OrthoMatDot construction can support systems with sizes up to 150 worker nodes only allowing for a relative error < − . It is also worth mentioning that in our experiments, we use the MATLAB command inv () [29] for matrixinversion. We have also tried matrices inversion through the Bjork-Pereyra algorithm [30], however, its results weremuch less accurate than inv (), especially for large systems with a number of worker nodes > Remark 5.2:
A main challenge in this work is that we assume operations over the real field. For finite fields, one can alwaysperform arithmetic operations with no errors. Although this fact may motivate a simple solution to the numerical stability ofreal-valued computations by rounding the computation’s inputs to a finite field’s elements and performing computations overthis finite field, such solution has limited applicability, especially for inputs with wide range, due to the following reason.Since performing arithmetic operations over a finite field F n requires representing each element of F n as an element in F n through a bit representation, this solution is applicable in machines with fixed point operations and word sizes of at least n .However, the solution is not applicable in machines with floating point operations since in floating point representation not allthe intermediate values between the minimum and the maximum representable values can be represented, this is a drawback ofthe floating point representation over the fixed point representation, though floating point representation can represent a widerrange of values than fixed point representation for the same word size.VI. O RTHO P OLY : L OW C OMMUNICATION /C OMPUTATION N UMERICALLY S TABLE C ODES FOR D ISTRIBUTED M ARIX M ULTIPLICATION
While MatDot Codes [3] have an optimal recovery threshold of 2 m −
1, they have relatively higher computation costper worker and worker node to fusion node communication cost as compared to Polynomial Codes [2]. In this section,motivated by the condition number bound in Theorem 5.1, we use the idea of using Chebyshev polynomials to provide a numerically stable code construction for matrix multiplication that has the same low communication/computation costsas Polynomial Codes, as well as the same recovery threshold. However, as will be shown in this section, our proposedcodes, denoted by OrthoPoly , provides lower numerical errors than Polynomial Codes. In this section, we follow thesame system model as in Section IV-A1, and solve the problem statement formulated in Section VI-A. We provide amotivating example in Section VI-B, then we provide the general code construction in Section VI-C. Finally, in SectionVI-D, we show experimentally that OrthoPoly Codes achieve lower numerical errors as compared to Polynomial Codes.
A. Problem Formulation
The master node possesses two real-valued input matrices A , B with dimensions N × N , N × N , respectively.Every worker node receives from the master node an encoded matrix of A of dimension N /m × N and an encodedmatrix of B of dimension N × N /n , and performs matrix multiplication of these two received inputs. Upon performingthe matrix multiplication, each worker node sends the result to the fusion node. The fusion node needs to recover thematrix multiplication AB once it receives the results of any mn worker nodes. B. Example ( m = n = 3)Consider computing the matrix multiplication AB , for some two real matrices A , B of dimensions N × N and N × N , respectively, over a distributed system of P ≥ A of dimension N / × N , and an encoded matrix of B of dimension N × N / AB can be recovered by the fusion node given the results of any 9 worker nodes.A solution can be as follows: First, matrices A , B can be partitioned as A = A A A , B = (cid:0) B B B (cid:1) , (17) where, for any i ∈ { , , } , A i has dimension N / × N , and B i has dimension N × N /
3. Next, let p A ( x ) = A T ( x ) + A T ( x ) + A T ( x ) ,p B ( x ) = B T ( x ) + B T ( x ) + B T ( x ) . Now, p A ( x ) p B ( x ) can be written as p A ( x ) p B ( x ) = (cid:0) A T ( x ) + A T ( x ) + A T ( x ) (cid:1)(cid:0) B T ( x ) + B T ( x ) + B T ( x ) (cid:1) = A B + (cid:0) A B + 12 A B (cid:1) T ( x ) + (cid:0) A B + 12 A B (cid:1) T ( x )+ A B T ( x ) + 12 (cid:0) A B + A B (cid:1) T ( x ) + 12 (cid:0) A B + A B (cid:1) T ( x )+ A B T ( x ) + 12 A B T ( x ) + 12 A B T ( x ) (18) Since p A ( x ) p B ( x ) is a degree 8 polynomial, once the fusion node receives the output of any 9 workers, it can interpolate p A ( x ) p B ( x ), i.e., obtain its matrix coefficients, let such matrix coefficients be C T , · · · , C T . Specifically, for any i ∈{ , · · · , } , let C T i be the matrix coefficient of T i in p A ( x ) p B ( x ). Now, recalling (17), the product AB can be writtenas AB = A B A B A B A B A B A B A B A B A B . (19) While the obtained set of matrix coefficients { C T i : i ∈ { , · · · , }} is not equal to { A i B j : i, j ∈ { , , }} , C T i ’s arelinear combinations of A i B j ’s. Specifically, for any C T i , i ∈ { , · · · , } , let C ( k,l ) T i be its ( k, l )-th entry, and, for any i, j ∈ { , , } , let ( A i B j ) ( k,l ) be the ( k, l )-th entry of the product A i B j , we can write C ( k,l ) T C ( k,l ) T C ( k,l ) T C ( k,l ) T C ( k,l ) T C ( k,l ) T C ( k,l ) T C ( k,l ) T C ( k,l ) T = / / / /
20 0 0 0 0 1 / / / / ( A B ) ( k,l ) ( A B ) ( k,l ) ( A B ) ( k,l ) ( A B ) ( k,l ) ( A B ) ( k,l ) ( A B ) ( k,l ) ( A B ) ( k,l ) ( A B ) ( k,l ) ( A B ) ( k,l ) , (20) for any ( k, l ) ∈ [ N / × [ N / A i B j , i, j ∈ { , , } can be obtained by computing ( A B ) ( k,l ) ( A B ) ( k,l ) ( A B ) ( k,l ) ( A B ) ( k,l ) ( A B ) ( k,l ) ( A B ) ( k,l ) ( A B ) ( k,l ) ( A B ) ( k,l ) ( A B ) ( k,l ) = / / / /
20 0 0 0 0 1 / / / / − C ( k,l ) T C ( k,l ) T C ( k,l ) T C ( k,l ) T C ( k,l ) T C ( k,l ) T C ( k,l ) T C ( k,l ) T C ( k,l ) T , (21) for all ( k, l ) ∈ [ N / × [ N / C. OrthoPoly Code Construction
We assume that matrix A is split horizontally into m equal sub-matrices, of dimension N /m × N each, and matrix B is split vertically into n equal sub-matrices, of dimension N × N /n each, as follows: A = A A ... A m − , B = ( B B . . . B n − ) , (22) and define two encoding polynomials p A ( x ) = (cid:80) m − i =0 A i T i ( x ) and p B ( x ) = (cid:80) n − i =0 B i T im ( x ) , and let p C ( x ) = p A ( x ) p B ( x ).We describe, next, the idea of the general code construction. First, for all r ∈ [ P ], the master node sends to the r -thworker evaluations of p A ( x ) and p B ( x ) at x = ρ ( P ) r , that is, it sends p A ( ρ ( P ) r ) and p B ( ρ ( P ) r ) to the r -th worker. Next,for every r ∈ [ P ], the r -th worker node computes the matrix product p C ( ρ ( P ) r ) = p A ( ρ ( P ) r ) p B ( ρ ( P ) r ) and sends the resultto the fusion node. Once the fusion node receives the output of any mn worker nodes, it interpolates p C ( x ). Next,the fusion node recovers the products A i B j , i ∈ { , · · · , m − } , j ∈ { , · · · , n − } , from the matrix coefficients of p C ( x ) using a low complexity matrix-vector multiplication, specified later in Construction 3. We formally present ourOrthoPoly Codes in Construction 3. In the following, we explain the notation used in Construction 3. The output ofthe algorithm is the N × N matrix ˆ C , where the ( k, l )-th block of ˆ C is the N /m × N /n matrix ˆ C k,l , and the ( i, j )-thentry of any matrix ˆ C k,l is ˆ c ( i,j ) k,l . The ( i, j )-th entry of the matrix polynomial p C ( x ) is denoted as p ( i,j ) C ( x ), and SectionIII-B defines matrices G ( mn,P ) ( ρ ( P ) ) and G ( mn,P ) R ( ρ ( P ) ), for any subset R = { r , · · · , r mn } ⊂ [ P ]. In addition, H is an mn × mn matrix of the following form H = (cid:0) H H · · · H n − (cid:1) , where H is an mn × m matrix with ones on the main diagonal and zeros elsewhere, and for any i ∈ { , · · · , n − } , H i is an mn × m matrix of the following structure H i = · · · ... ... ... ... ... · · · ... ... ... . . . /
20 0 0 . . .
00 0 1 / . . . ... / · · ·
01 0 0 · · ·
00 1 / · · ·
00 0 1 / . . . ... . . . ... ... ... . . . /
20 0 0 · · · ... ... ... ... ... · · · , where the value 1 in the first column is at the ( im + 1)-th row of H i . Construction 3
OrthoPoly:
Inputs : A , B , Output : ˆ C procedure M ASTER N ODE ( A , B ) (cid:46) The master node’s procedure r ← while r (cid:54) = P + 1 do p A ( ρ ( P ) r ) ← (cid:80) m − i =0 A i T i ( ρ ( P ) r ) p B ( ρ ( P ) r ) ← (cid:80) n − i =0 B i T im ( ρ ( P ) r ) send p A ( ρ ( P ) r ) , p B ( ρ ( P ) r ) to worker node r r ← r + 1 end while end procedure procedure W ORKER N ODE ( p A ( ρ ( P ) r ) , p B ( ρ ( P ) r ) ) (cid:46) The procedure of worker node r p C ( ρ ( P ) r ) ← p A ( ρ ( P ) r ) p B ( ρ ( P ) r ) send p C ( ρ ( P ) r ) to the fusion node end procedure procedure F USION N ODE ( { p C ( ρ ( P ) r ) , · · · , p C ( ρ ( P ) r mn ) } ) (cid:46) The fusion node’s procedure, r i ’s are distinct G inv ← (cid:16) G ( mn,P ) R (cid:17) − for i ∈ [ N /m ] do for j ∈ [ N /n ] do ( c ( i,j )0 , · · · , c ( i,j ) mn − ) ← ( p ( i,j ) C ( ρ ( P ) r ) , · · · , p ( i,j ) C ( ρ ( P ) r mn )) G inv (ˆ c ( i,j )0 , · · · ˆ c ( i,j ) m − , · · · · · · ˆ c ( i,j )0 ,n − · · · ˆ c ( i,j ) m − ,n − ) ← ( c ( i,j )0 , · · · , c ( i,j ) mn − )( H − ) T end for end for return ˆ C end procedure
1) Complexity Analyses of OrthoPoly:
Encoding Complexity : Encoding for each worker requires performing two additions, the first one adds m scaledmatrices of size N N /m and the other adds n scaled matrices of size N N /n , for an overall encoding complexityfor each worker of O ( N N + N N ). Therefore, the overall computational complexity of encoding for P workers is O ( N N P + N N P ). -15 -10 -5 -15 -10 -5 Fig. 10. The growth of the relative error, for both OrthoPoly and Polynomial Codes, both using Chebyshev points, with the system size given a fixed numberof redundant worker nodes.
Computational Cost per Worker : Each worker multiplies two matrices of dimensions N /m × N and N × N /n ,requiring O ( N N N /mn ) operations. Decoding Complexity : Since p A ( x ) p B ( x ) has degree mn −
1, the interpolation of p C ( x ) requires the inversion ofa mn × mn matrix, with complexity O ( m n ), and performing N N /mn matrix-vector multiplications, each of themis between the inverted matrix and a column vector of length mn of the received evaluations of the matrix polynomial p C ( x ) at some position ( i, j ) ∈ [ N /m ] × [ N /n ], with complexity O ( N N m n / ( mn )) = O ( N N mn ). Thus, assumingthat mn (cid:28) N , N , the overall decoding complexity is O ( N N mn ). Communication Cost : The master node sends O ( N N P/m + N N P/n ) symbols, and the fusion node receives O ( N N ) symbols from the successful worker nodes. Remark 6.1:
With the reasonable assumption that the dimensions of the input matrices A , B are large enough such that N , N , N (cid:29) m, n, P , we can conclude that the encoding and decoding costs at the master and fusion nodes, respectively,are negligible compared to the computation cost at each worker node. D. Numerical Results
In our experiments, the entries of the input matrices A , B are chosen independently according to the standardGaussian distribution N (0 , A , B , let ˆ C be the output of the distributedsystem, we define the relative error between AB and ˆ C to be E r ( AB , ˆ C ) = || AB − ˆ C || F || AB || F . Fig. 10 shows how the maximum relative error (the worst case relative error given a fixed number of parity workers s among all the P − s successful nodes scenarios) grows with the size of the distributed system for both Construction 3and Polynomial Codes. In Fig. 10, we plot the average result of five different realizations of the system at each systemsize P . The figure shows that Polynomial Codes have unacceptable relative errors after the size of the system exceeds50 workers, providing a relative error of around 10 . On the other hand, OrthoPoly can support systems with sizes upto 170 worker nodes only allowing for a relative error < − . VII. G
ENERALIZED O RTHO M AT D OT : N UMERICALLY STABLE C ODES FOR M ATRIX M ULTIPLICATION WITH C OMMUNICATION /C OMPUTATION -R ECOVERY T HRESHOLD T RADE - OFF
Although MatDot Codes [3] have a low recovery threshold of 2 m − mn , MatDot Codes’ worker to fusion nodes communication cost and computation costper worker are higher than Polynomial Codes. Codes proposed in [4], [5], [26] offer a trade-off between the commu-nication/computation cost and the recovery threshold. However, all of these codes are based on the “ill-conditioned”monomial basis. In this section, we offer a numerically stable code construction, denoted by Generalized OrthoMatDot ,that offers a trade-off between communication/computation costs and recovery threshold. Our construction incurs ahigher recovery threshold than the codes of [5], [26] by a factor of at most 4 for the same communication/computationcost. We provide in Section VII-A the formal problem statement considered in this section. We describe an example ofour construction in Section VII-B, provide the general code construction in Section VII-C, and describe our numericalexperiments in Section VII-D. A. System Model and Problem Formulation
We consider the same system model and problem formulation as in Section IV-A with the following change: Weassume that the master node is allowed to send an encoded 1 /m fraction of matrix A , and an encoded 1 /n fraction ofmatrix B , where m and n are not necessarily equal, and A and B are split as follows A = A , · · · A ,m − ... . . . ... A m − , · · · A m − ,m − , B = B , · · · B ,m − ... . . . ... B m − , · · · B m − ,m − , (23) where m , m , m divide N , N , N , respectively, and m = m m , n = m m . In addition, we assume that each workernode receives a linear combination of sub-matrices A i,j , and another linear combination of sub-matrices B i,j . Remark 7.1:
Although, in this section, we offer Generalized OrthoMatDot, a code construction with lower conditionnumbers than codes in [5], [26], the recovery threshold of our codes are higher by a factor of at most than the codes of thesereferences. Specifically, Generalized OrthoMatDot codes have a recovery threshold of m m m − m m + m m + m m )+ m + 2 m + m − while both codes in [5], [26] have a recovery threshold of m m m + m − . This increased recoverythreshold is due to the fact that Generalized OrthoMatDot Codes are based on Chebyshev polynomials which have the followingproperty: For any i, j ∈ N , T i ( x ) T j ( x ) = 1 / T i + j ( x ) + T | i − j | ( x )) . This property allows for a higher number of undesired terms in the multiplication of the encoding polynomials p A ( x ) , p B ( x ) . In order to avoid combining undesired and desired termsat the same degree, higher degree Chebyshev polynomials have to be used in p B ( x ) , yielding a higher recovery threshold. Itis still an open question whether the recovery threshold in [5], [26] can be achieved using orthonormal polynomials. B. Example ( m = m = m = 2)Consider computing the matrix multiplication AB , for some two real matrices A , B of dimensions N × N and N × N , respectively, over a distributed system of P ≥
15 workers such that:1) Each worker receives an encoded matrix of A of dimension N / × N /
2, and an encoded matrix of B of dimension N / × N / AB can be recovered by the fusion node given the results of any 15 worker nodes.A solution can be as follows: First, matrices A , B can be partitioned as A = (cid:18) A , A , A , A , (cid:19) , B = (cid:18) B , B , B , A , (cid:19) , (24) where, for i, j ∈ { , } , A i,j has dimension N / × N /
2, and B i,j has dimension N / × N /
2. Next, let p A ( x ) = A , T ( x ) + 12 A , T ( x ) + A , T α +1 ( x ) + A , T α ( x ) ,p B ( x ) = 12 B , T ( x ) + B , T ( x ) + B , T β ( x ) + B , T β +1 ( x ) , where α, β to be specified next, and define P distinct real numbers x , x , · · · , x P in the range [ − , r ∈ [ P ], the master node sends p A ( x r ) p B ( x r ).Now, in order to specify the best values for α, β , we expand the polynomial p A ( x ) p B ( x ) in the Chebyshev basis, andthen point out some observations. p A ( x ) p B ( x ) = 14 A , B , + 12 A , B , T ( x ) + 12 A , B , T ( x ) + A , B , T ( x ) T ( x )+ 12 A , B , T α ( x ) + A , B , T ( x ) T α ( x ) + 12 A , B , T α +1 ( x ) + A , B , T ( x ) T α +1 ( x )+ 12 A , B , T β ( x ) + A , B , T ( x ) T β ( x ) + 12 A , B , T β +1 ( x ) + A , B , T ( x ) T β +1 ( x )+ A , B , T ( x ) T β +1 ( x ) + A , B , T α +1 ( x ) T β ( x ) + A , B , T α ( x ) T β +1 ( x )+ A , B , T α +1 ( x ) T β +1 ( x ) . (25) Using the property of the Chebyshev polynomials that for any i, j ∈ N , T i ( x ) T j ( x ) = 1 / T i + j ( x ) + T | i − j | ( x )), (25)can be rewritten as p A ( x ) p B ( x ) = 14 A , B , + 12 A , B , + 12 ( A , B , + A , B , ) T ( x ) + 12 A , B , T ( x ) + 12 A , B , T α − ( x ) + 12 ( A , B , + A , B , ) T α ( x ) + 12 ( A , B , + A , B , ) T α +1 ( x )+ 12 A , B , T α +2 ( x ) + 12 A , B , T β − α − ( x ) + 12 ( A , B , + A , B , ) T β − α ( x )+ 12 A , B , T β − α +1 ( x ) + 12 A , B , T β − ( x ) + 12 ( A , B , + A , B , ) T β ( x )+ 12 ( A , B , + A , B , ) T β +1 ( x ) + 12 A , B , T β +2 ( x ) + 12 A , B , T β + α ( x )+ 12 ( A , B , + A , B , ) T β + α +1 ( x ) + 12 A , B , T β + α +2 ( x ) . (26) Now, note the following regrading p A ( x ) p B ( x ) in (26):(i) ( A , B , + A , B , ) is the coefficient of T ( x ),(ii) ( A , B , + A , B , ) is the coefficient of T α +1 ( x ),(iii) ( A , B , + A , B , ) is the coefficient of T β +1 ( x ),(iv) ( A , B , + A , B , ) is the coefficient of T β + α +1 ( x ).Since p A ( x ) p B ( x ) has degree β + α + 2, and this polynomial is evaluated at distinct value at each worker node,once the fusion node receives the output of any β + α + 3 worker nodes, it can interpolate p A ( x ) p B ( x ) and extractthe product AB (i.e., the matrix coefficients of T ( x ) , T α +1 ( x ), T β +1 ( x ) , T β + α +1 ( x )). Now, we aim for picking valuesfor α, β such that the degree of p A ( x ) p B ( x ) is minimal; and hence, the recovery threshold is minimal as well. Theseminimal values for α, β must be chosen such that the desired coefficients in (i)-(iv) are separate. That is, each of themis neither combined with another desired nor undesired term. This constraint leads to the following two inequalities: α − > , and α + 1 < β − α − , which implies that α = 3 , β = 9. Next, we provide our general code construction for the Generalized OrthoMatDotCodes. C. Generalized OrthoMatDot Code Construction
Theorem 7.1:
For the matrix multiplication problem described in Section VII-A computed on the system defined in SectionIV-A1, there exists a coding strategy with recovery threshold m m m − m m + m m + m m )+ m + 2 m + m − . (27) Notice that the problem specified in Section VII-A restricts the output matrix of each worker node to be of dimension N /m × N /m , for some positive integers m , m that divide N , N , respectively. This is smaller than the dimensionsof the output matrix of each worker node according to the problem specified in Section IV-A2 (i.e., N × N ) by afactor of m m . However, according to Theorem 7.1, this communication advantage, when m > m >
1, comesat the expense of a higher recovery threshold compared to OrthoMatDot Codes.
Remark 7.2 (Notation):
For ease of exposition in the remaining of this section, we use T (cid:48) , T (cid:48) , T (cid:48) , · · · to denote T , T , T , · · · ,respectively. In order to prove Theorem 7.1, we first present a code construction that achieves the recovery threshold in (27),then we prove that the presented code construction is valid. First, note that in the Generalized OrthoMatDot codeconstruction, we assume that the two input matrices A , B are split as in (23). Also, note that given this partitioningof input matrices, we can write C = AB , where C is written as C = C , · · · C ,m − ... . . . ... C m − , · · · C m − ,m − , (28) and each of C i,l has dimension N /m × N /m and can be expressed as C i,l = (cid:80) m − j =0 A i,j B j,l , for any i ∈{ , , · · · , m − } , and l ∈ { , , · · · , m − } . Also, let x , · · · , x P be distinct real numbers in the range [ − , p A ( x ) = m − (cid:88) i =0 m − (cid:88) j =0 A i,j T (cid:48) m − − j + i (2 m − ( x ) ,p B ( x ) = m − (cid:88) k =0 m − (cid:88) l =0 B k,l T (cid:48) k + l (2 m − m − ( x ) , (29) and let p C ( x ) = p A ( x ) p B ( x ). Notice that p C ( x ) is a polynomial matrix of degree equals deg C := 4 m m m − m m + m m + m m ) + m + 2 m + m − Claim 7.2:
For any i ∈ { , , · · · , m − } and l ∈ { , , · · · , m − } , C i,l is the matrix coefficient of T m − i (2 m − l (2 m − m − in p C ( x ) , The proof of this claim is in Appendix C.We describe, next, the idea of our proposed Generalized OrthoMatDot code construction. First, for all r ∈ [ P ], themaster node sends to the r -th worker evaluations of p A ( x ) and p B ( x ) at x = ρ ( P ) r , that is, it sends p A ( ρ ( P ) r ) and p B ( ρ ( P ) r ) to the r -th worker. Next, for every r ∈ [ P ], the r -th worker node computes the matrix product p C ( ρ ( P ) r ) = p A ( ρ ( P ) r ) p B ( ρ ( P ) r ) and sends the result to the fusion node. Once the fusion node receives the output of any deg C +1worker nodes, it interpolates p C ( x ).We formally present our Generalized OrthoMatDot code construction in Construction 4. In the following, we explainthe notation used in Construction 4. The output of the algorithm is the N × N matrix ˆ C , where the ( k, l )-th block ofˆ C is the N /m × N /m matrix ˆ C k,l , and the ( i, j )-th entry of any matrix ˆ C k,l is ˆ c ( i,j ) k,l . The ( i, j )-th entry of the matrixpolynomial p C ( x ) is denoted as p ( i,j ) C ( x ), and Section III-B defines matrices G (deg C +1 ,P ) ( ρ ( P ) ) and G (deg C +1 ,P ) R ( ρ ( P ) ),for any subset R = { r , · · · , r deg C +1 } ⊂ [ P ]. Construction 4
Generalized OrthoMatDot:
Inputs : A , B , Output : ˆ C procedure M ASTER N ODE ( A , B ) (cid:46) The master node’s procedure r ← while r (cid:54) = P + 1 do p A ( ρ ( P ) r ) ← (cid:80) m − i =0 (cid:80) m − j =0 A i,j T (cid:48) m − − j + i (2 m − ( ρ ( P ) r ) p B ( ρ ( P ) r ) ← (cid:80) m − k =0 (cid:80) m − l =0 B k,l T (cid:48) k + l (2 m − m − ( ρ ( P ) r ) send p A ( ρ ( P ) r ) , p B ( ρ ( P ) r ) to worker node r r ← r + 1 end while end procedure procedure W ORKER N ODE ( p A ( ρ ( P ) r ) , p B ( ρ ( P ) r ) ) (cid:46) The procedure of worker node r p C ( ρ ( P ) r ) ← p A ( ρ ( P ) r ) p B ( ρ ( P ) r ) send p C ( ρ ( P ) r ) to the fusion node end procedure procedure F USION N ODE ( { p C ( ρ ( P ) r ) , · · · , p C ( ρ ( P ) r deg C +1 ) } ) (cid:46) The fusion node’s procedure, r i ’s are distinct G inv ← (cid:16) G (deg C +1 ,P ) R (cid:17) − for i ∈ [ N /m ] do for j ∈ [ N /m ] do ( c ( i,j )0 , · · · , c ( i,j )deg C ) ← ( p ( i,j ) C ( ρ ( P ) r ) , · · · , p ( i,j ) C ( ρ ( P ) r deg C +1 )) G inv for k ∈ [ m ] do for l ∈ [ m ] do ˆ c ( i,j ) k,l ← c ( i,j ) m − k − m − l − m − m − end for end for end for end for return ˆ C end procedure Now, we prove Theorem 7.1.
Proof of Theorem 7.1:
To prove the theorem, it suffices to prove that Construction 4 is valid. Noting that p A ( x ) p B ( x )has degree 4 m m m − m m + m m + m m ) + m + 2 m + m − p A ( x ) p B ( x ) at a distinct point, once the fusion node receives the output of any 4 m m m − m m + m m + m m ) + m + 2 m + m − p A ( x ) p B ( x ) (i.e., obtain all its matrix coefficients). Thisincludes the coefficients of T m − i (2 m − l (2 m − m − for all i ∈ { , , · · · , m − } , and l ∈ { , , · · · , m − } ,i.e., C i,l , for all i ∈ { , , · · · , m − } , and l ∈ { , , · · · , m − } (Claim 7.2), which completes the proof. (cid:3) Fig. 11. Comparison between the condition number of the interpolating matrix of the Generalized OrthoMatDot Codes and the monomial-based codes [5],[26] in two distributed systems, one with worker nodes and the other with worker nodes, at different partitioning factors m . Next, we provide the different complexity analyses of the Generalized OrthoMatDot Codes.
1) Complexity Analyses of Generalized OrthoMatDot:
Encoding Complexity : Encoding for each worker requires performing two additions, the first one adds m m scaled matrices of size N N / ( m m ) and the other adds m m scaled matrices of size N N / ( m m ), for an overallencoding complexity for each worker of O ( N N + N N ). Therefore, the overall computational complexity of encodingfor P workers is O ( N N P + N N P ). Computational Cost per Worker : Each worker multiplies two matrices of dimensions N /m × N /m and N /m × N /m , requiring O ( N N N / ( m m m )) operations. Decoding Complexity : Since p A ( x ) p B ( x ) has degree k − m m m − m m + m m + m m ) + m + 2 m + m −
2, the interpolation of p C ( x ) requires the inversion of a k × k matrix, with complexity O ( k ) = O ( m m m ), andperforming N N / ( m m ) matrix-vector multiplications, each of them is between the inverted matrix and a columnvector of length k of the received evaluations of the matrix polynomial p C ( x ) at some position ( i, j ) ∈ [ N /m ] × [ N /m ], with complexity O ( N N k / ( m m )) = O ( N N m m m ). Thus, assuming that m , m (cid:28) N , N , theoverall decoding complexity is O ( N N m m m ) = O ( N N mn ). Communication Cost : The master node sends O ( N N P/ ( m m ) + N N P/ ( m m )) symbols, and the fusionnode receives O ( N N m ) symbols from the successful worker nodes. Remark 7.3:
With the reasonable assumption that the dimensions of the input matrices A , B are large enough suchthat N , N , N (cid:29) m , m , m , P , we can conclude that the encoding and decoding costs at the master and fusion nodes,respectively, are negligible compared to the computation cost at each worker node. D. Numerical Results
In our experiments on Construction 4, we considered distributed systems with P = 16 ,
25 worker nodes. Fig. 11 showsthat, for every examined system, the condition number of the interpolation matrix using the Generalized OrthoMatDotCodes is less than its counterpart codes in [5], [26]. The results in Fig. 11 also show that, for the same system, as thepartitioning factor m decreases (i.e., as the redundancy in worker nodes increases), the stability of the GeneralizedOrthoMatDot code construction decreases; however, it is still better than the monomial-basis based codes in any cases. VIII. N
UMERICALLY S TABLE L AGRANGE C ODED C OMPUTING
In this section, we study the numerical stability of Lagrange coded computing [12] that lifts coded computing beyondmatrix-vector and matrix-matrix multiplications, to multi-variate polynomial computations. As shown in [12], Lagrangecoded computing has applications in gradient coding, privacy and secrecy. Our main contribution here is to develop anumerically stable approach towards Lagrange coded computing inspired by our result of Theorem 5.1. In particular, ourcontribution involves (a) careful choice of evaluation points, and (b) a careful decoding algorithm that involves inversionof the appropriate Chebyshev Vandermonde matrix. We describe the system model in Section VIII-A. We overview theLagrange coded computing technique of [12] in Section VIII-B. We describe our numerically stable approach in SectionVIII-C, and present the results of our numerical experiments in Section VIII-D.
A. System Model and Problem Formulation
We consider, for this section, the distributed computing framework depicted in Fig. 12, that is used in [12] andconsists of a master node, P worker nodes, and a fusion node where the only communication allowed is from the masternode to the different worker nodes and from the worker nodes to the fusion node. The worker nodes have a prior Fig. 12. The Lagrange coded computing system framework knowledge of a polynomial function of interest f : R d → R v of degree deg( f ), where d, v ∈ N + . In addition, the masternode possesses a set of data points X = { X , · · · , X m } , where X i ∈ R d , i ∈ [ m ]. For every worker node i ∈ [ P ], themaster node is allowed to send some encoded vector ˜ X i ( X , · · · , X m ) ∈ R d . Once a worker node receives the encodedvector on its input, it evaluates f at this encoded vector and sends the evaluation to the fusion node. That is, for i ∈ [ P ], worker node i receives ˜ X i on its input, evaluates f ( ˜ X i ), then it sends the result to the fusion node. Finally, thefusion node is expected to numerically stably decode the set of evaluations F = { f ( X ) , · · · , f ( X m ) } after it receivesthe output of any K worker nodes. B. Background on Lagrange Coded Computing
In this section, we review the baseline Lagrange coded computing method introduced in [12] considering the frameworkin Section VIII-A. Notice that although the method in [12] is more general, here, for simplicity, we limit our discussionto the systematic
Lagrange coded computing. That is, we assume that for i ∈ [ m ], worker node i receives the i -th datapoint from the master node. In other words, we assume that ˜ X i = X i , i ∈ [ m ]. Now, the encoding procedure goes asfollows: First, let x , · · · , x P be distinct real values, an encoding function g ( x ) is defined as: g ( x ) = m (cid:88) i =1 X i (cid:89) j ∈ [ m ] − i x − x j x i − x j . (30) Given this encoding function, the master node sends the encoded vector ˜ X i = g ( x i ) to the worker node i , for every i ∈ [ P ]. Notice that the encoding function g ( x ) indeed leads to a systematic encoding since ˜ X i = g ( x i ) = X i , for all i ∈ [ m ]. Every worker node i computes f ( ˜ X i ) upon the reception of ˜ X i , and sends the result to the fusion node. Thefusion node waits till receiving the output of any K := ( m −
1) deg( f ) + 1. Since f ( g ( x )) has degree ( m −
1) deg( f ) in x , the fusion node is able to interpolate f ( g ( x )) after receiving the outputs of any ( m −
1) deg( f ) + 1, i.e., K , workernodes. Since g ( x i ) = X i , i ∈ [ m ], the fusion nodes evaluates { f ( g ( x )) , · · · , f ( g ( x m )) } to obtain { f ( X ) , · · · , f ( X m ) } . C. Numerically Stable Lagrange Coded Computing
Lagrange coded computing requires performing an interpolation at the fusion node to recover the polynomial f ( g ( x )).Performing the interpolation by obtaining the coefficients of the polynomial in a monomial basis requires inverting asquare Vandermonde matrix which is numerically unstable. Noting that the first (cid:96) Cheybshev polynomials also formsa basis for degree (cid:96) − f ( g ( x )) in the basis of Chebyshev polynomials. Thereby, our decoding procedure involvesinverting the Chebyshev-Vandermonde matrix . Guided by Theorem 5.1, we choose the evaluation points to be the P -point Chebyshev grid ρ ( P ) to obtain a decoding procedure that is more stable than one that uses the monomialbasis.Our numerically stable algorithm for Lagrange coded computing is formally described in Construction 5. In thefollowing, we explain the notation used in Construction 5. We let the polynomial at the i -th entry of f ( g ( x )) bedenoted f ( i ) ( x ) and written as f ( i ) ( x ) = (cid:80) K − l =0 c ( i ) l T l ( x ). Following the notation in Section III-B, we use the Chebyshev-Vandermonde matrices G ( K,P ) ( ρ ( P ) ), and G ( K,P ) R ( ρ ( P ) ), for any subset R = { r , · · · , r K } ⊂ [ P ], we also define the Since both systematic and non-systematic Lagrange coded computing require the inversion of the same Chebyshev-Vandermonde matrix, our numericallystable decoding procedure in Construction 5 naturally extends to non-systematic Lagrange coded computing, with the only difference is in the last stepof evaluating f ( g ( x )) at x , · · · , x m , where in the non-systematic case, f ( g ( x )) is instead evaluated at some predefined values y , · · · , y m such that g ( y i ) = X i for all i ∈ [ m ] . matrix G ( K,P )[ m ] ( ρ ( P ) ). Finally, we assume that our construction returns as output the set of evaluations ˆ F = { ˆ f ( X ) , · · · , ˆ f ( X m ) } , where for each ˆ f ( X i ) , i ∈ [ m ], we have ˆ f ( X i ) = ( ˆ f (1) ( x i ) , · · · , ˆ f ( v ) ( x i )), where for every i ∈ [ m ] , j ∈ [ v ] , ˆ f ( j ) ( x i ) and f ( j ) ( x i ) would be the same if the machine had infinite precision.In the following, we show through numerical experiments the stability of our proposed Construction 5. Construction 5
Numerically Stable Lagrange Coded Computing
Inputs : f, X = { X , · · · , X m } , Output : ˆ F = { ˆ f ( X ) , · · · , ˆ f ( X m ) } procedure M ASTER N ODE ( X ) (cid:46) The master node’s procedure r ← while r (cid:54) = P + 1 do if r ∈ [ m ] then ˜ X r ← X i else ˜ X r ← (cid:80) mi =1 X i (cid:81) j ∈ [ m ] − i ρ ( P ) r − ρ ( P ) j ρ ( P ) i − ρ ( P ) j end if send ˜ X r to worker node r r ← r + 1 end while end procedure procedure W ORKER N ODE ( f, ˜ X r ) (cid:46) The procedure of worker node r Out r ← f ( ˜ X r ) send Out r to the fusion node end procedure procedure F USION N ODE (Out r , · · · , Out r K ) (cid:46) The fusion node’s procedure, r i ’s are distinct G inv ← (cid:16) G ( K,P ) R (cid:17) − for i ∈ [ v ] do ( c ( i )0 , · · · , c ( i ) K − ) ← ( Out ( i ) r , · · · , Out ( i ) r K ) G inv ( ˆ f ( i ) ( x ) , · · · , ˆ f ( i ) ( x m )) ← ( c ( i )0 , · · · , c ( i ) K − ) G ( K,P )[ m ] end for return ˆ F end procedure D. Numerical Results
In our experiments, we assume that we have a distributed system of P worker nodes, m = P − X , · · · , X m , each of them is of dimension d = 10, where each entry of every input vector is picked independently,according to the standard Gaussian distribution N (0 , f ( X ) = Y T X , where Y is some d -dimensional vector with entries picked independently according to the standard Gaussian distribution N (0 , f = ( ˆ f ( X ) · · · ˆ f ( X m )) be the system’soutput vector, and f = ( f ( X ) · · · f ( X m )) be the correct output vector, we define the relative error between f and ˆ f tobe E r ( f , ˆ f ) = || f − ˆ f || || f || . (31) The results, shown in Fig. 13, illustrates that using the Chebyshev basis for interpolation provides less relativeerror/higher stability than the monomial basis at every system size. Fig. 13 also shows that under a certain relativeerror constraint, Construction 5 provides higher scalability than the monomial basis case. Specifically, let us assumethat a relative error up to 0 . IX. C
ONCLUDING R EMARKS
In this paper, we develop numerically stable codes for matrix-matrix multiplication and Lagrange coded computing.A distinctive character of our work is the infusion of principles of numerical approximation theory into coded computing Fig. 13. The growth of the relative error, for Construction 5, using both Chebyshev basis interpolation and monomial basis interpolation, both using Chebyshevpoints, with the system size given a fixed number of redundant worker nodes equals 2. towards the end goal of numerical stability. In particular, our work is marked by the use of orthogonal polynomialsfor encoding, Gauss quadrature techniques for decoding and new bounds on the condition number of ChebyshevVandermonde matrices. Notably, our constructions obtain the same recovery threshold as MatDot Codes and PolynomialCodes for matrix multiplication as well as for Lagrange Coded Computing. However, our construction in Section VIIobtains a weaker (higher) recovery threshold than previous constructions [5], [26] for the problem of coded matrixmultiplication when the computation/communication cost is constrained to be lower than that of MatDot Codes. Thesearch of numerically stable codes for this application with the same recovery threshold as [5], [26] remains open.While our paper focuses on applications where polynomial based encoding are particularly useful, our results mightbe useful for other applications as well. For instance, for the simple matrix-multiplication problem Ax performed in adistributed setting over P worker nodes, where the goal is to encode A such that each worker stores a partition 1 /m ofmatrix A , it is well known that MDS type codes can be used [13], [27]. Specifically, let A = A A ... A m and let H = ( h ij ) bean m × P matrix where every m × m submatrix of H has a full rank of m . Then the p -th worker for p ∈ { , , . . . , P } cancompute ( (cid:80) mi =1 h ip A i ) x ; the product Ax can be recovered from any m of the P nodes. The instinctual, Reed-Solomoninspired solution of choosing H to be a Vandermode matrix is ill-conditioned over real numbers. Note however that,unlike the matrix multiplication problem, the matrix H does not need to have a polynomial structure. Indeed, choosing H to be a random Gaussian matrix leads to well-conditioned solutions with high probability. In particular, the followingresult follows from elementary arguments that build on [31]. Theorem 9.1:
Let H be an m × P matrix, P ≥ m ≥ , and let the entries of H be independent and identically distributedstandard Gaussian random variables. Then, Pr (cid:16) κ max ( H ) > mP P − m ) (cid:1)(cid:17) < . P ( P − m ) . The theorem which is proved in Appendix D, formally demonstrates that for a fixed number of redundant workers s = P − m, the worst case condition number grows as O ( mP s ) with high probability. However, the random Gaussianmatrix approach has two drawbacks: (i) for a given realization of the random variables, it is difficult to verify whetherit is well-conditioned, and (ii) the lack of structure could lead to more complex decoding. Our result of Theorem 5.1also indicates that choosing H = G ( m,P ) ( ρ ( P ) ) , i.e., to be a Chebyshev Vandermonde matrix, naturally provides awell-conditioned solution to this problem. Another solution for the matrix-vector multiplication problem is provided in[25] via universally decodable matrices [32]; in this work numerical stability is demonstrated empirically.It is, however, important to note that the problems resolved in our paper here are more restrictive since matrixmultiplication codes - where both matrices are to be encoded so that the product can be recovered - require much morestructure than matrix-multiplication where only one matrix is to be encoded. For instance, random Gaussian encodingdoes not naturally work for matrix multiplication to get a recovery threshold of 2 m −
1, and it is not clear whether thesolution of [25] is applicable either. The utility of Chebyshev-Vandermonde matrices for a variety of coded computingproblems including matrix-vector multiplication, matrix multiplication and Lagrange coded computing motivates thestudy of low-complexity decoding and error correction mechanisms for these systems. R EFERENCES [1] S. Dutta, V. Cadambe, and P. Grover, “Short-Dot: Computing Large Linear Transforms Distributedly Using Coded Short DotProducts,” in
Advances In Neural Information Processing Systems (NIPS) , 2016, pp. 2092–2100.[2] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Polynomial Codes: an Optimal Design for High-Dimensional Coded MatrixMultiplication,” in
Advances In Neural Information Processing Systems (NIPS) , 2017, pp. 4403–4413.[3] M. Fahim, H. Jeong, F. Haddadpour, S. Dutta, V. Cadambe, and P. Grover, “On the optimal recovery threshold of coded matrixmultiplication,” in
Communication, Control, and Computing (Allerton) , Oct 2017, pp. 1264–1270.[4] S. Dutta, M. Fahim, F. Haddadpour, H. Jeong, V. R. Cadambe, and P. Grover, “On the optimal recovery threshold of coded matrixmultiplication,”
CoRR , vol. abs/1801.10292, 2018, Accepted to appear in
IEEE Transactions on Information Theory .[5] S. Dutta, Z. Bai, H. Jeong, T. M. Low, and P. Grover, “A unified coded deep neural network training strategy based ongeneralized polydot codes,” in , June 2018, pp. 1585–1589,http://arxiv.org/abs/1811.10 751.[6] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding,” in
Machine Learning Systems Workshop, Advances inNeural Information Processing Systems (NIPS) , 2016.[7] ——, “Gradient coding: Avoiding stragglers in distributed learning,” in
International Conference on Machine Learning , 2017, pp.3368–3376.[8] M. Ye and E. Abbe, “Communication-computation efficient gradient coding,” in
Proceedings of the 35th International Conference onMachine Learning, ICML 2018, Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018 , 2018, pp. 5606–5615. [Online]. Available:http://proceedings.mlr.press/v80/ye18a.html[9] Y. Yang, P. Grover, and S. Kar, “Coding for a single sparse inverse problem,” in , June 2018, pp. 1575–1579, http://arxiv.org/abs/1706.00 163.[10] F. Haddadpour, Y. Yang, V. R. Cadambe, and P. Grover., “Cross-iteration coded computing,” in
Communication, Control, andComputing (Allerton) , 2018.[11] R. K. Maity, A. S. Rawat, and A. Mazumdar, “Robust gradient descent via moment encoding with ldpc codes,” arXiv preprintarXiv:1805.08327 , 2018.[12] Q. Yu, N. Raviv, J. So, and A. S. Avestimehr, “Lagrange coded computing: Optimal design for resiliency, security and privacy,” arXivpreprint arXiv:1806.00939 , 2018.[13] K. H. Huang and J. Abraham, “Algorithm-Based Fault Tolerance for Matrix Operations,”
IEEE Transactions on Computers , vol. 100,no. 6, pp. 518–528, 1984.[14] K. Lee, C. Suh, and K. Ramchandran, “High-dimensional coded matrix multiplication,” in
IEEE International Symposium onInformation Theory (ISIT) , 2017, pp. 2418–2422.[15] S. Li, S. M. M. Kalan, Q. Yu, M. Soltanolkotabi, and A. S. Avestimehr, “Polynomially coded regression: Optimal straggler mitigationvia data encoding,” arXiv preprint arXiv:1805.09934 , 2018.[16] W. Gautschi and G. Inglese, “Lower bounds for the condition number of vandermonde matrices,”
Numerische Mathematik , vol. 52,no. 3, pp. 241–250, 1987.[17] W. Gautschi, “How (un) stable are vandermonde systems,”
Asymptotic and computational analysis , vol. 124, pp. 193–210, 1990.[18] ——, “Norm estimates for inverses of vandermonde matrices,”
Numerische Mathematik , vol. 23, no. 4, pp. 337–347, 1974.[19] L. Reichel and G. Opfer, “Chebyshev-vandermonde systems,”
Mathematics of Computation , vol. 57, no. 196, pp. 703–721, 1991.[20] A. Quarteroni, R. Sacco, and F. Saleri,
Numerical mathematics . Springer Science & Business Media, 2010, vol. 37.[21] L. N. Trefethen,
Approximation theory and approximation practice . Siam, 2013, vol. 128.[22] U. Sheth, S. Dutta, M. Chaudhari, H. Jeong, Y. Yang, J. Kohonen, T. Roos, and P. Grover, “An Application of Storage-OptimalMatDot Codes for Coded Matrix Multiplication: Fast k-Nearest Neighbors Estimation,” in
IEEE Big Data (Short Paper) , 2018.[23] L. N. Trefethen and D. Bau,
Numerical Linear Algebra . SIAM, 1997.[24] J. Demmel and P. Koev, “The accurate and efficient solution of a totally positive generalized vandermonde linear system,”
SIAMJournal on Matrix Analysis and Applications , vol. 27, no. 1, pp. 142–152, 2005.[25] A. Ramamoorthy, L. Tang, and P. O. Vontobel, “Universally decodable matrices for distributed matrix-vector multiplication,” arXivpreprint arXiv:1901.10674 , 2019.[26] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Straggler mitigation in distributed matrix multiplication: Fundamental limits andoptimal coding,” in , June 2018, pp. 2022–2026, arXiv preprintarXiv:1801.07 487.[27] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,”
IEEE Transactions on Information Theory
Mathematics of Computation , vol. 24, no. 112, pp. 893–903,1970.[31] J.-M. Aza¨ıs and M. Wschebor, “Upper and lower bounds for the tails of the distribution of the condition number of a gaussian matrix,”
SIAM Journal on Matrix Analysis and Applications , vol. 26, no. 2, pp. 426–440, 2004.[32] A. Ganesan and P. O. Vontobel, “On the existence of universally decodable matrices,”
IEEE transactions on information theory ,vol. 53, no. 7, pp. 2572–2575, 2007. A PPENDIX AP ROOF OF C LAIM
We have, (cid:90) ba p A ( x ) p B ( x ) w ( x ) dx = (cid:90) ba (cid:32) m − (cid:88) i =0 A i q i ( x ) (cid:33) m − (cid:88) j =0 B j q j ( x ) w ( x ) dx = (cid:90) ba m − (cid:88) i =0 m − (cid:88) j =0 A i B j q i ( x ) q j ( x ) w ( x ) dx = m − (cid:88) i =0 m − (cid:88) j =0 A i B j (cid:90) ba q i ( x ) q j ( x ) w ( x ) dx = m − (cid:88) i =0 m − (cid:88) j =0 A i B j (cid:104) q i , q j (cid:105) = m − (cid:88) i =0 A i B i = AB . (32) In addition, noting that p A ( x ) p B ( x ) (i.e., p C ( x )) is of degree 2 m − m ), Theorem 3.2 implies that (cid:90) ba p A ( x ) p B ( x ) w ( x ) dx = m (cid:88) r =1 a r p A ( η r ) p B ( η r )= m (cid:88) r =1 a r p C ( η r ) . (33) Finally, combining (32) and (33) completes the proof. (cid:3) A PPENDIX BP ROOF OF T HEOREM
We use the following trigonometric identity in our proof.
Lemma B.1:
For n ≥ , let x i be chosen as (6). Then (cid:81) j (cid:54) = i ( x i − x j ) = ( − i − − n n sin( (2 i − π n ) Proof:
Note that 2 n − (cid:81) ni =1 ( x − x i ) = T n ( x ) = cos( n cos − ( x )). Therefore,2 n − (cid:89) j (cid:54) = i ( x i − x j ) = T (cid:48) n ( x i ) = n (cid:112) − x i sin( n cos − ( x i ))where T (cid:48) n ( x ) denotes the derivative of T n ( x ) . Using x i = cos( (2 i − π n ) above we get the desired result. (cid:3) Proof of Theorem 5.1:
We show that any square sub-matrix of G ( n − s,n ) ( ρ ( n ) ) formed by any n − s columns of G ( n − s,n ) ( ρ ( n ) ) satisfies the bound stated in the theorem. Let S be a subset of [ n ] such that |S| = s , for some s < n ,and define G ( n − s,n )[ n ] −S ( ρ ( n ) ) to be the square n − s × n − s submatrix of G ( n − s,n ) ( ρ ( n ) ) after removing the columns withindices in S . Recalling the structure of G ( n − s,n ) ( ρ ( n ) ) from (12), we can write it as G ( n − s,n ) ( ρ ( n ) ) = T ( ρ ( n )1 ) · · · T ( ρ ( n ) n ) ... . . . ...T n − s − ( ρ ( n )1 ) · · · T n − s − ( ρ ( n ) n ) . Moreover, for any
S ⊂ [ n ] such that |S| = s , we can write G ( n − s,n )[ n ] −S ( ρ ( n ) ) := G ( n − s,n )Γ := T ( γ ) · · · T ( γ n − s ) ... . . . ...T n − s − ( γ ) · · · T n − s − ( γ n − s ) , where Γ = ( γ , γ , · · · , γ n − s ) = ( ρ ( n ) g , ρ ( n ) g , · · · , ρ ( n ) g n − s ), where { g i } i ∈ [ n − s ] = [ n ] − S and g < g < · · · < g n − s . Now,notice that || G ( n − s,n )Γ || F = (cid:80) n − si =1 (cid:80) n − sj =1 | T i − ( γ j ) | , and | T i ( γ j ) | ≤ i, j ∈ [ n − s ]. Therefore, we have || G ( n − s,n )Γ || F ≤ ( n − s ) . (34) In the following, we obtain an upper bound on || ( G ( n − s,n )Γ ) − || F . Let L Γ ,k be the k -th Lagrange polynomial associatedwith Γ, that is, L Γ ,k ( x ) = (cid:89) i ∈ [ n − s ] −{ k } x − γ i γ k − γ i (35) Since L Γ ,k ( x ) has a degree of n − s −
1, it can be written in terms of the Chebyshev basis T ( x ) , · · · , T n − s − ( x ) as L Γ ,k ( x ) = n − s − (cid:88) i =0 a i,k T i ( x ) , (36) for some real coefficients a ,k , · · · , a n − s − ,k . Now, from (35), note the following property regarding L Γ ,k ( x ): L Γ ,k ( x ) = (cid:26) , if x = γ k , if x ∈ { γ i } i ∈ [ n − s ] − k . Using this property and observing (36), we conclude that, for any j ∈ [ n − s ], (cid:80) n − s − i =0 a i,k T i ( γ j ) = δ ( k − j ). Therefore, a , · · · a n − s − , ... . . . ...a ,n − s · · · a n − s − ,n − s G ( n − s,n )Γ = I n − s , where I n − s is the n − s × n − s identity matrix. That is, (cid:16) G ( n − s,n )Γ (cid:17) − = a , · · · a n − s − , ... . . . ...a ,n − s · · · a n − s − ,n − s , (37) Therefore, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) G ( n − s,n )Γ (cid:17) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F = n − s (cid:88) i =1 n − s (cid:88) j =1 | a i − ,j | . (38) In addition, we have that n − s (cid:88) k =1 (cid:90) − L ,k ( x ) w ( x ) dx = n − s (cid:88) k =1 (cid:90) − n − s − (cid:88) i =0 n − s − (cid:88) j =0 a i,k a j,k T i ( x ) T j ( x ) w ( x ) dx = n − s (cid:88) k =1 n − s − (cid:88) i =0 n − s − (cid:88) j =0 a i,k a j,k (cid:90) − T i ( x ) T j ( x ) w ( x ) dx = n − s (cid:88) k =1 n − s − (cid:88) i =0 n − s − (cid:88) j =0 a i,k a j,k (cid:104) T i , T j (cid:105) = n − s (cid:88) k =1 n − s − (cid:88) i =0 | a i,k | . (39) From (38) and (39), we conclude that || ( G ( n − s,n )Γ ) − || F = (cid:80) n − sk =1 (cid:82) − L ,k ( x ) w ( x ) dx .Now, we express the integral (cid:82) − L ,k ( x ) w ( x ) dx in the Gauss quadrature form using the n roots of T n ( x ) : ρ ( n )1 , · · · , ρ ( n ) n .Note that this is a “trick” we use in the proof - it is possible to use the Gauss quadrature formula over n − s nodes toexpress the integral of the degree 2( n − s −
1) polynomial L ,k ( x ). However, the use of n nodes instead of n − s nodesleads to simple tractable bound for || ( G ( n − s,n )Γ ) − || F . Now, we can write (cid:90) − L ,k ( x ) w ( x ) dx = n (cid:88) i =1 c i L ,k ( ρ ( n ) i ) , (40) for some constants c , · · · , c n . Moreover, c , · · · , c n for the Chebyshev polynomials of the first kind are, in fact, all equalto π/n . Therefore, we have (cid:90) − L ,k ( x ) w ( x ) dx = πn n (cid:88) i =1 L ,k ( ρ ( n ) i ) , (41) and, consequently, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) G ( n − s,n )Γ (cid:17) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F = πn n − s (cid:88) k =1 n (cid:88) i =1 L ,k ( ρ ( n ) i ) . (42) Now, from (35), note that L Γ ,k ( x ) has the following evaluations L Γ ,k ( ρ ( n ) i ) = , if i = g k , if i ∈ { g i } i ∈ [ n − s ] − k (cid:81) j ∈ [ n − s ] −{ k } ρ ( n ) i − γ j γ k − γ j , if i ∈ S . (43) Therefore, (42) can be written as (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) G ( n − s,n )Γ (cid:17) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F = πn n − s (cid:88) k =1 (cid:88) i ∈S (cid:89) j ∈ [ n − s ] −{ k } (cid:32) ρ ( n ) i − γ j γ k − γ j (cid:33) = π ( n − s ) n + πn n − s (cid:88) k =1 (cid:88) i ∈S (cid:89) j ∈ [ n − s ] −{ k } (cid:32) ρ ( n ) i − γ j γ k − γ j (cid:33) (44) In order to obtain our upper bound on || ( G ( n − s,n )Γ ) − || F , in the following, we get an upper bound on the term (cid:81) j ∈ [ n − s ] −{ k } (cid:18) ρ ( n ) i − γ j γ k − γ j (cid:19) in (44). Notice that (cid:81) j ∈ [ n − s ] −{ k } (cid:18) ρ ( n ) i − γ j γ k − γ j (cid:19) can be written as (cid:89) j ∈ [ n − s ] −{ k } (cid:32) ρ ( n ) i − γ j γ k − γ j (cid:33) = (cid:89) j ∈ [ n − s ] −{ k } (cid:32) ρ ( n ) i − ρ ( n ) g j ρ ( n ) g k − ρ ( n ) g j (cid:33) = (cid:89) j ∈ [ n − s ] −{ k } (cid:32) ρ ( n ) i − ρ ( n ) g j ρ ( n ) g k − ρ ( n ) g j (cid:33) (cid:81) j ∈S∪{ g k }−{ i } (cid:16) ρ ( n ) i − ρ ( n ) j (cid:17) (cid:81) j ∈S∪{ g k }−{ i } (cid:16) ρ ( n ) i − ρ ( n ) j (cid:17) = (cid:81) j ∈ [ n ] −{ i } (cid:16) ρ ( n ) i − ρ ( n ) j (cid:17) (cid:81) j ∈ [ n − s ] −{ k } (cid:16) ρ ( n ) g k − ρ ( n ) g j (cid:17) (cid:81) j ∈S∪{ g k }−{ i } (cid:16) ρ ( n ) i − ρ ( n ) j (cid:17) = (cid:16) − n n/ sin( (2 i − π n ) (cid:17) (cid:81) j ∈ [ n − s ] −{ k } (cid:16) ρ ( n ) g k − ρ ( n ) g j (cid:17) (cid:81) j ∈S∪{ g k }−{ i } (cid:16) ρ ( n ) i − ρ ( n ) j (cid:17) , (45) where the last equality follows from Lemma B.1. Moreover, the product (cid:81) j ∈ [ n − s ] −{ k } (cid:16) ρ ( n ) g k − ρ ( n ) g j (cid:17) in (45) can bewritten as (cid:89) j ∈ [ n − s ] −{ k } (cid:16) ρ ( n ) g k − ρ ( n ) g j (cid:17) = (cid:81) j ∈ [ n − s ] −{ k } (cid:16) ρ ( n ) g k − ρ ( n ) g j (cid:17) (cid:81) j ∈S (cid:16) ρ ( n ) g k − ρ ( n ) j (cid:17) (cid:81) j ∈S (cid:16) ρ ( n ) g k − ρ ( n ) j (cid:17) = (cid:16) − n n/ sin( (2 g k − π n ) (cid:17) (cid:81) j ∈S (cid:16) ρ ( n ) g k − ρ ( n ) j (cid:17) , (46) where the last equality follows from Lemma B.1. Now, substituting from (46) in (45) yields (cid:89) j ∈ [ n − s ] −{ k } (cid:32) ρ ( n ) i − γ j γ k − γ j (cid:33) = (cid:16) sin( (2 g k − π n ) (cid:17) (cid:16) sin( (2 i − π n ) (cid:17) (cid:81) j ∈S (cid:16) ρ ( n ) g k − ρ ( n ) j (cid:17) (cid:81) j ∈S∪{ g k }−{ i } (cid:16) ρ ( n ) i − ρ ( n ) j (cid:17) = (cid:16) sin( (2 g k − π n ) (cid:17) (cid:16) sin( (2 i − π n ) (cid:17) (cid:81) j ∈S−{ i } (cid:16) ρ ( n ) g k − ρ ( n ) j (cid:17) (cid:81) j ∈S−{ i } (cid:16) ρ ( n ) i − ρ ( n ) j (cid:17) , ≤ (cid:16) sin( (2 g k − π n ) (cid:17) (cid:16) sin( (2 i − π n ) (cid:17) max j ∈S−{ i } (cid:16) ρ ( n ) g k − ρ ( n ) j (cid:17) min j ∈S−{ i } (cid:16) ρ ( n ) i − ρ ( n ) j (cid:17) s − = 1 (cid:16) sin( (2 i − π n ) (cid:17) (cid:34) (cid:0) cos( π n ) − cos( π n ) (cid:1) (cid:35) s − = 4 s − (cid:16) sin( (2 i − π n ) (cid:17) (cid:0) cos( π n ) − cos( π n ) (cid:1) s − = O (4 s − n s − ) . (47) Using (47) in (44), we conclude that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) G ( n − s,n )Γ (cid:17) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F = O (4 s − ( n − s ) sn s − ) . (48) Finally, combining (34) and (48), we conclude that κ maxF ( G ( n − s,n ) ( ρ ( n ) )) = O (cid:16) ( n − s ) (cid:112) ns ( n − s ) (cid:0) n (cid:1) s − (cid:17) . (cid:3) A PPENDIX CP ROOF OF C LAIM
Let α = 2 m − , γ = α (2 m − p A ( x ) in (29) can be written as p A ( x ) = m − (cid:88) i =0 m − (cid:88) j =0 A i,j T (cid:48) m − − j + iα ( x )= m − (cid:88) j =0 A ,j T m − − j ( x ) + 1 / A ,m − T ( x ) + m − (cid:88) i =1 m − (cid:88) j =0 A i,j T m − − j + iα ( x )Similarly, p B ( x ) in (29) can be written as p B ( x ) = m − (cid:88) k =0 m − (cid:88) l =0 B k,l T (cid:48) k + lγ ( x )= 1 / B , T ( x ) + m − (cid:88) k =1 B k, T k ( x ) + m − (cid:88) k =0 m − (cid:88) l =1 B k,l T k + lγ ( x ) (49) Now, the product p A ( x ) p B ( x ) can be written as p A ( x ) p B ( x ) = 12 (cid:0) p ( x ) + p ( x ) (cid:1) (50) where, p ( x ) = m − (cid:88) j =0 A ,j B , T m − − j ( x ) + 12 A ,m − B , T ( x ) + m − (cid:88) i =1 m − (cid:88) j =0 A i,j B , T m − − j + iα ( x )+ m − (cid:88) j =0 m − (cid:88) k =1 A ,j B k, T m − − j + k ( x ) + m − (cid:88) k =1 A ,m − B k, T k ( x )+ m − (cid:88) i =1 m − (cid:88) j =0 m − (cid:88) k =1 A i,j B k, T m − − j + iα + k ( x ) + m − (cid:88) j =0 m − (cid:88) k =0 m − (cid:88) l =1 A ,j B k,l T m − − j + k + lγ ( x )+ m − (cid:88) k =0 m − (cid:88) l =1 A ,m − B k,l T k + lγ ( x ) + m − (cid:88) i =1 m − (cid:88) j =0 m − (cid:88) k =0 m − (cid:88) l =1 A i,j B k,l T m − − j + iα + k + lγ ( x ) (51) and, p ( x ) = m − (cid:88) j =0 m − (cid:88) k =1 A ,j B k, T | m − − j − k | ( x ) + m − (cid:88) i =1 m − (cid:88) j =0 m − (cid:88) k =1 A i,j B k, T | m − − j + iα − k | ( x ) + m − (cid:88) j =0 m − (cid:88) k =0 m − (cid:88) l =1 A ,j B k,l T | m − − j − k − lγ | ( x ) + m − (cid:88) i =1 m − (cid:88) j =0 m − (cid:88) k =0 m − (cid:88) l =1 A i,j B k,l T | m − − j + iα − k − lγ | ( x ) . (52) Now, in order to prove the claim, it suffices to prove the following two statements:1) For any i ∈ { , · · · , m − } , l ∈ { , · · · , m − } , C i,l is the matrix coefficient of T m − iα + lγ in p ( x ).2) For any i ∈ { , · · · , m − } , l ∈ { , · · · , m − } , the matrix coefficient of T m − iα + lγ in p ( x ) is N /m × N /m ,where N /m × N /m is the N /m × N /m all zeros matrix.In the following, we prove that statement 1) is true. In order to find the coefficient of T m − iα + lγ in p ( x ), we find theset S = { ( i (cid:48) , j (cid:48) , k (cid:48) , l (cid:48) ) : m − − j (cid:48) + i (cid:48) α + k (cid:48) + l (cid:48) γ = m − iα + lγ } . Rewriting m − − j (cid:48) + i (cid:48) α + k (cid:48) + l (cid:48) γ = m − iα + lγ ,we have ( k (cid:48) − j (cid:48) ) + ( i (cid:48) − i ) α + ( l (cid:48) − l ) γ = 0 . (53) (53) implies that l (cid:48) = l . Suppose l (cid:48) (cid:54) = l , this means that ( k (cid:48) − j (cid:48) ) + ( i (cid:48) − i ) α = cγ for some integer c . However, this is acontradiction since | ( k (cid:48) − j (cid:48) ) + ( i (cid:48) − i ) α | < γ , for any i, i (cid:48) , j (cid:48) , k (cid:48) . Now, (53) can be written as( k (cid:48) − j (cid:48) ) + ( i (cid:48) − i ) α = 0 . (54) Again, (54) implies i (cid:48) = i . Suppose i (cid:48) (cid:54) = i , this means k (cid:48) − j (cid:48) = cα , for some integer c . However, this is a contradictionsince | k (cid:48) − j (cid:48) | < α . Now, since i (cid:48) = i , (54) implies j (cid:48) = k (cid:48) . Thus, S = { ( i, j (cid:48) , j (cid:48) , k ) : j (cid:48) ∈ { , · · · , m − }} . That is, forany i ∈ { , · · · , m − } , j ∈ { , · · · , m − } , the matrix coefficient of T m − iα + lγ in p ( x ) is (cid:80) m − j (cid:48) =0 A i,j (cid:48) B j (cid:48) ,l = C i,l .Now, it remains to prove statement 2). That is, for any i ∈ { , · · · , m − } , l ∈ { , · · · , m − } , the matrix coefficientof T m − iα + lγ in p ( x ) is N /m × N /m . In order to find the coefficient of T m − iα + lγ in p ( x ), we find the sets S (1)2 = { ( i (cid:48) , j (cid:48) , k (cid:48) , l (cid:48) ) : m − − j (cid:48) + i (cid:48) α − k (cid:48) − l (cid:48) γ = m − iα + lγ } , and S (2)2 = { ( i (cid:48) , j (cid:48) , k (cid:48) , l (cid:48) ) : − m +1+ j (cid:48) − i (cid:48) α + k (cid:48) + l (cid:48) γ = m − iα + lγ } .First, for the set S (1)2 , rewriting m − − j (cid:48) + i (cid:48) α − k (cid:48) − l (cid:48) γ = m − iα + lγ , we get( − j (cid:48) − k (cid:48) ) + ( i (cid:48) − i ) α + ( l + l (cid:48) ) γ = 0 . (55) From (55), we conclude that l + l (cid:48) = 0. Otherwise, ( − j (cid:48) − k (cid:48) ) + ( i (cid:48) − i ) α = cγ , for some integer c , a contradiction since | ( − j (cid:48) − k (cid:48) ) + ( i (cid:48) − i ) α | < γ . Since l + l (cid:48) = 0 and both l, l (cid:48) are non-negative, we conclude that l (cid:48) = l = 0. Moreover, now(55) reduces to ( − j (cid:48) − k (cid:48) ) + ( i (cid:48) − i ) α = 0 . (56) Again, since | − j (cid:48) − k (cid:48) | < α , we conclude that i (cid:48) = i , which implies that j (cid:48) + k (cid:48) = 0. Since j (cid:48) + k (cid:48) = 0 and both j (cid:48) , k (cid:48) are non-negative, we conclude that j (cid:48) = k (cid:48) = 0. Thus, S (1)2 = { ( i, , , } . Now, noticing from (52) that A i, B , does not contribute to any term in p ( x ), we conclude that the matrix coefficient of T m − iα + lγ in p ( x ) is only dueto the set S (2)2 . Recall that S (2)2 = { ( i (cid:48) , j (cid:48) , k (cid:48) , l (cid:48) ) : − m + 1 + j (cid:48) − i (cid:48) α + k (cid:48) + l (cid:48) γ = m − iα + lγ } , we rewrite − m + 1 + j (cid:48) − i (cid:48) α + k (cid:48) + l (cid:48) γ = m − iα + lγ as( j (cid:48) + k (cid:48) − m + 2) − ( i (cid:48) + i ) α + ( l (cid:48) − l ) γ = 0 (57) From (57), we conclude that l = l (cid:48) . Otherwise, ( j (cid:48) + k (cid:48) − m + 2) − ( i (cid:48) + i ) α = cγ , for some integer c , a contradictionsince | ( j (cid:48) + k (cid:48) − m + 2) − ( i (cid:48) + i ) α | < γ . Moreover, now (57) reduces to( j (cid:48) + k (cid:48) − m + 2) + ( i (cid:48) + i ) α = 0 . (58) Again, since | j (cid:48) + k (cid:48) − m + 2 | < α , we conclude that i (cid:48) + i = 0. Since i (cid:48) + i = 0 and both i, i (cid:48) are non-negative, weconclude that i (cid:48) = i = 0, which implies that j (cid:48) + k (cid:48) = 2 m −
2. Since j (cid:48) + k (cid:48) = 2 m − j (cid:48) , k (cid:48) ≤ m −
1, weconclude that j (cid:48) = k (cid:48) = m −
1. Thus, S (2)2 = { (0 , m − , m − , l ) } . Now, noticing from (52) that A ,m − B m − ,l doesnot contribute to any term in p ( x ), we conclude that the matrix coefficient of T m − iα + lγ in p ( x ) is N /m × N /m . (cid:3) A PPENDIX DU PPER B OUND ON THE C ONDITION N UMBER OF G AUSSIAN M ATRICES
We first introduce the following theorem from [31].
Theorem D.1:
Let A be an m × m matrix, m ≥ , and let the entries of A be independent and identically distributedstandard Gaussian random variables. Then, for all α > , Pr( κ ( A ) > mα ) < . α , where κ ( A ) is the condition number of A with respect to the matrix norm induced by (cid:96) . As a consequence, in the following, we extend the result in Theorem D.1 to bound the condition number of every m × m sub-matrix of a random m × P matrix with i.i.d standard Gaussian entries, P ≥ m . Proof of Theorem 9.1
For any subset
S ⊆ { , , . . . , P } , let H S denote the |S| × m sub-matrix of H containing thecolumns H corresponding to S , and let s = P − m . Then we havePr (cid:16) κ max ( H ) > mP s (cid:17) = Pr (cid:16) (cid:91) S (cid:48) ⊂ [ P ] , |S (cid:48) | = m (cid:0) κ ( H S (cid:48) ) > mP s (cid:1)(cid:17) (59) (1) ≤ (cid:88) S (cid:48) ⊂ [ P ] , |S (cid:48) | = m Pr (cid:0) κ ( H S (cid:48) ) > mP s (cid:1) = (cid:18) Ps (cid:19) Pr (cid:0) κ ( H S (cid:48) ) > mP s (cid:1) , for any S (cid:48) ⊂ [ P ] such that |S (cid:48) | = m (2) < P s . P s = 5 . P s , where (1) follows from the union bound, and (2) follows from the fact that (cid:18) Ps (cid:19) ≤ P s and Theorem D.1.and Theorem D.1.