CCoded Computing with Noise
Royee Yosibash and Ram Zamir
EE - Systems DepartmentTel Aviv University, IsraelEmail: [email protected], [email protected]
Abstract —Distributed computation is a framework used tobreak down a complex computational task into smaller tasksand distributing them among computational nodes. Erasurecorrection codes have recently been introduced and have becomea popular workaround to the well known “straggling nodes”problem, in particular, by matching linear coding for linearcomputation tasks. We observe that decoding tends to amplify thecomputation “noise”, i.e., the numerical errors at the computationnodes. We use noise amplification as a performance measureto compare various erasure-correction codes, and in particularpolynomial codes (which Reed-Solomon codes and other popularcodes are a subset of). We show that noise amplification can besignificantly reduced by a clever selection of the sampling pointsand powers of the polynomial code.
Index Terms — distributed computation, erasure codes,polynomial codes, noise amplification, DFT, frames, dif-ference set, equiangular tight frames, Jacobi/MANOVAdistribution.
I. I
NTRODUCTION
In the last years, some algorithms have struggled withthe run time of large scale and computationally complextasks needing a many consecutive calculations. A commonpractice for decreasing run time in such algorithms is using alarge distributed system comprised of individual computationalnodes (described in detail in section III). However, theselarge systems have to present new challenges to whom thesystem designer has to mitigate. One of the more significantchallenges are the "stragglers" – computational nodes whom,unexpectedly, have a significantly higher response time thantheir non-straggling counterparts. These straggler nodes createa computational bottleneck that delays the end product ofthe final computational product needed from all the systemsnodes. Taking this uncertainty into account calls for system’sengineer to find some sort of "back-up" scheme in order toensure high-quality service. One such method is implementinga coding technique taken from the realm of information andcode theory. In information theory’s terms, these stragglernodes are considered as "erasures" – a symbol in a streamthat is transmitted and instead of arriving correctly throughthe channel is lost and the receiver knows only throughside information that the symbol’s real value is unknown.The purpose of section IV is to explain how these erasuresand a coding scheme fit into a distributed computing model.Some of the codes that have been looked into include generalmaximum distance separable (MDS) codes [6], Reed-Solomon (RS) or Bose–Chaudhuri–Hocquenghem (BCH) codes [9] andthe general case polynomial codes [7]. In order to evaluatewhich code is best suited for distributed computation, manyresearch groups have chosen performance measures such ascomputational complexity of recovery [7], or average run-time[6], in order to show that a certain code is a good solution.This paper discusses another perspective of coding anddecoding that was left rather unexplored: noisy calculation.Section V proposes a model that introduces some form of noiseto the computation due to finite word-length. The existence ofsuch noise promotes a question of the noise amplification arising in the decoding process, which becomes even morecomplex with the assumption of stragglers. Using aspects offrame theory and random matrix theory this paper will presenthow the decoding scheme amplifies "noisy" computationsreturned from computational nodes with respect to the randomaspect induced by stragglers. In section VI we discuss recenttheoretical results in frame theory and random matrix theoryto provide different frames that are good/bad benchmarks forthe noise amplification performance measure. We also presenthow new variations of polynomial codes over the complexfield could be constructed with design guidelines taken fromthese good/bad benchmarks. In section VII We demonstratevia simulations that the noise amplification of these codesfollow theoretical expectations and demonstrate that one ofthe suggested type of polynomial codes has near identicalamplification as the benchmark for good noise amplification.II. N
OTATION
In order to not cause confusion due to coinciding notationsbetween Frame and code theory, in this paper the notationsshall be set as defined in this section. The notation for anyone dimensional variable a is written with any roman letters.The vector or set of one dimensional elements notation v iswritten in bold lowercase letters. All vectors are representedas column vectors and so for vectors dim ( v ) = h × . Thematrix notation A = { A j,i } , in bold capital letters, will bedefined as having dim ( A ) = h × (cid:96) . The parameter γ is calledthe frame aspect ratio or the redundancy ratio and is definedfor A as γ = h(cid:96) . The (cid:96) column vectors of A will be markedas a i and the vector cross correlation is defined as: c i ,i = (cid:104) a i , a i (cid:105) = h (cid:88) j =1 A ∗ j,i A j,i (1) a r X i v : . [ c s . I T ] F e b ere ∗ is the notation for conjugate transpose. Matrices/framesare defined as being "unit-norm columns" if for all i ∈ [ (cid:96) ] : c i,i = 1 . If all c i,i (cid:54) = 0 then a matrix/frame can be transformedinto being unit-norm columns by dividing all a i by c i,i .III. T HE NOISELESS SETUP
We consider a problem in a distributed computation setupas a function f over a data set A which is defined over somearbitrary field F h × (cid:96) . The function f over the data set A mustalso return an output in the same field but may have differentdimensions. The task of computing f ( A ) is too complex andso the task is broken down into some m simpler tasks denoted g , · · · , g m each operating on A or a subset of elements in A , denoted also as A , · · · , A m . The master node utilizesthe worker nodes by sending over the g i functions with thecorresponding subsets A i to the nodes so each node i receives g i and A i . Each node computes the simpler task g i ( A i ) andreturns the answer to the master node. The restriction on thefunction set { g i } and the subsets { A i } is that the master nodehas to be able to reconstruct f ( A ) with only { g i ( A i ) } . Ablock diagram of a distributed computation setup is given inFigure 1. Fig. 1. The distributed computation setup without coding
In a distributed computation setup a coding scheme is afunction that operates on the m subsets { A i } to create n > m new subsets denoted { A (cid:48) i } and n new functions { g (cid:48) i } . Thesesubsets and functions are now sent in the same manner to n nodes. After receiving the answers the decoder scheme is afunction that operates on { g (cid:48) i ( A (cid:48) i ) } and converts them backinto { g i ( A i ) } for the master node to compute f ( A ) . Theencoder might not need all n nodes, and might be able to useonly m < k < n answers from nodes to retrieve all { g i ( A i ) } .In order to simplify, we only discuss cases in which all g i ’sare identical and unchanged when preformed on A (cid:48) i . Therefore,the coding scheme only operates on { A i } . The encoding anddecoding schemes slightly alter the setup described in Figure1 and the new setup is described in Figure 2.In order to satisfy the prerequisite of reconstruction of all { g ( A i ) } from the returned { g ( A (cid:48) i ) } without the need to ofcreating a suitable coding scheme that has to be custom madefor each function g , the function and coding scheme have to becommutative operators. For this reason, we restrict our codingschemes to linear codes and the function g to be linear andcommutative to these linear transforms of the subsets { A i } .The linear transformation of the m elements of { A i } to the Fig. 2. The distributed computation setup with coding n elements of { A (cid:48) i } can be represented in matrix form by thecode generator matrix F T with dim ( A ) = n × m × F h × (cid:96) : F T [ A i , · · · , A m ] T = [ A (cid:48) i , · · · , A (cid:48) m ] T (2)A good example of a problem that might benefits from adistributed computation solution is the matrix multiplicationproblem: For some matrix A , with dim ( A ) = h × (cid:96) , and avector x , with dim ( x ) = (cid:96) × , the distributed model uses n computational nodes in order to calculate A T x . The masternode then sends each node i some matrix T i along with thevector x and receives back a result in the form of a vector r i .After gathering enough answers from the nodes to compute theproduct of A T x the master considers the computation as com-pleted successfully. A "naïve" approach, that does not yet takeinto account stragglers, is to divide the rows of A into n equalparts creating n matrices that satisfy A T = [ A T , · · · , A Tn ] T with each matrix A i having dim ( A i ) = h × (cid:100) (cid:96)n (cid:101) (addingrows of zeros if n does not divide (cid:96) ). Each computationalnode i can then calculate r i = A Ti x and after transmitting theresult back to the master node. In turn, the master node sumsup all the results to get: A T x = [( r ) T , · · · , ( r n ) T ] T . Byassuming the nodes calculate at the same pace and neglectingcommunication delays it is easy to see the master node hasincreased the speed of the calculation of A T x by a factor of n . To this point, the motivation for coding wasn’t obvious. Theintroduction of stragglers is what makes coding an importantpart of distributed computation.IV. T HE ERASURE MODEL
Erasure, as described before, is the event of a computationalnode not returning a response in due time (or at all) and thedecoder deems it unresponsive. This can be seen as the decoderhaving side information on which of the n nodes are valid andwhich are erased. The number of remaining nodes is denotedas k and we denote the set of node indices that have not beenerased as k ⊆ [ n ] .Returning to the model of trying to compute f ( A ) usingcomputational nodes, the introduction of straggles creates amajor drawback to this distributed setup: The calculation timeis lower bounded by the slowest node. If any node is strag-gling, and all nodes are needed to complete the calculation(the naïve approach), then the effectiveness of this methodis dramatically reduced. In a system with a very large n thedesired runtime, that would have decreased with n , is hinderedy lower bound on runtime if the system suffers at least onestraggler nodes (the probability of which increases with n ).As been alluded to before, one way to try to mitigate thestraggler problem is by using a coding scheme with erasureresilient codes.Continuing with the example of a matrix multiplicationproblem, it easy to show an example for a simple codingscheme to mitigate a single straggler: For n = 3 computationalnodes tasked to solve A T x and assuming that no more thanone node will straggle we choose to recreate A T x by dividing A T into A T = [ A T , A T ] T and then encoding these twomatrices as three matrices with the following erasure code: F = (cid:20) I dim ( A i ) dim ( A i ) I dim ( A i ) dim ( A i ) I dim ( A i ) I dim ( A i ) (cid:21) (3) F T (cid:104) A T , A T (cid:105) T = (cid:104) A T , A T , (cid:0) A T + A T (cid:1)(cid:105) = (cid:104) A (cid:48) T , A (cid:48) T , A (cid:48) T (cid:105) T (4)Its easy to see that after receiving any two out of threecalculation results (cid:104) A (cid:48) T x , A (cid:48) T x , A (cid:48) T x (cid:105) one can recreateboth A T x and A T x and then give out the solution originaltask A T x = [ A T x , A T x ] .It is useful to have a representation of erasures as an opera-tor that operates on F . The operator uses the ( n − k ) elementsin k (cid:48) = { x ⊆ [ n ] : x / ∈ k } and nullifies the same indexcolumns in F (and equivalently the same index rows in F T ).This operator is simply a "column retain matrix", P , whichright hand multiplies F to create the equivalent F s = F · P code generator matrix. The matrix P is constructed by takinga k × k identity matrix and inserting rows of all zeros to sothat the matrix P has rows of zeros in the same indices asthe elements of k (cid:48) . One can also create a slightly different P matrix by using an n × n identity matrix and changing everycolumn to a column of all zeros if its index is in k (cid:48) . Thisalternative method has zeroed columns instead of having thesecolumns omitted and while is an equivalent representation oferasures, the zeroed columns are redundant and have no use inthe decoding scheme. The new frame created due to erasureshas a new sub-frame aspect ratio of β = km .As mentioned before, the decoder does not necessarilyknow how many nodes will straggle. We could also use anassumption that the events of erasures are an i.i.d Bernoulliprocess with probability p and with that immediately followsthat the expected value of k is: E [ k ] = n · (1 − p ) (5)In the latter sections we will try to compare frames, and sohold k as a constant and the random variations will be whichelements comprises the set k .V. T HE NOISY SETUP
Prior to this paper, many papers have been discussingthe straggler problem in the framework discussed above. Wechoose to discuss another perspective that was left unexplored: noisy computation. Assume a simple model of a distributedcomputation system designed with an encoding method chosenas F Ts to better handle stragglers. After transmitting the taskto be performed at each computational node, a node i returnsthe output, u i = A i x with some noise denoted z . While noisefrom computation might not be detached from the value of u i , we choose to approximate the noise as some additivei.i.d process in order to simplify the model. The returnedtransmission is therefore: r i = u i + z = A i x + z (6)After gathering enough r i ’s the reconstruction of the origi-nal task is done, with the assumption of HSNR, by using thepseudo-inverse of the encoding matrix: A dec = (cid:0) ( F Ts ) ∗ F Ts (cid:1) − ( F Ts ) ∗ (7)The noise amplification is defined as the MSE dividedby the variance of the i.i.d noise. It equivalently defines thedegradation in SNR. Therefore, by decoding with the pseudo-inverse of the encoding matrix the noise amplification is: N oise Amplif ication = M SEσ z = 1 k trace (( A ∗ dec A dec ) − ) (8) Lemma 1. [2]
The noise amplification of any unit-normcolumn matrix A with dim ( A ) = h × (cid:96) is lower boundedby its frame aspect ratio γ . Because noise amplification is in direct relation to trace ( A ∗ dec A dec ) − we choose have all frames discussedbecome unit-norm in order to later properly compare themin a noise amplification performance measure.It’s important to note that having code generator matricestransformed into unit-norm has no effect on the "wellness" ofa coding scheme as this "new" generator matrix retains a nearidentical decoding schemes as the original decoding scheme(with the generator matrix that wasn’t normalized). Proof.
Dividing all columns of the generator matrix by theirrespective c i,i is equivalent to creating a new source of m elements - each new element being the same elementas the one in the previous source set only divided by thenormalization factor of the corresponding column index. Soif the "new" source elements can be decoded by the originalcode scheme then recovering the original source elements backfrom the now decoded "new source" is trivial – proof that theoperation retains a viable decoding scheme.From this point forward all code generator matrices dis-cussed, unless specified otherwise, are to have unit-normcolumns. VI. C ODES AS F RAMES
A. Known frames
When choosing and discussing noise amplification as theperformance measure of one type of frame or another, it’simportant to show examples of the types of frames who haveeen referred to as benchmarks to good/bad performancesin this measure. Frames based off choosing some s ⊆ [ n ] rows of the DF T ( n ) matrix can have very different noiseamplifications after erasures. The first example is choosing s = [ m ] (the first m consecutive rows) of the DFT matrix.The pattern created is recognized as the matrix form of a low-pass filter as the truncated ( n − m ) rows of the DF T ( n ) matrixare those who give weight to the ( n − m ) highest frequencies.The band-pass \ notch filter frame is created in the same wayby choosing s as consecutive series [ n ] with a cyclic shift over n . All these frames have the same Gram matrix because theyare all identical up to a scaling factor who is some root ofunity (and is canceled out in the Gram matrix) and so havethe same noise amplification. These frames also seem to bevery noise-amplifying [2] [8] [10].A better frame in terms of noise amplification is a matrix A with dim ( A ) = m × n whose entries are all random i.i.dGaussian random variables with variance √ m . Any sub-matrixwith any k of the n columns chosen has asymptotically forlarge m a column norm of 1. Its been shown in [11] thatthe eigenvalues distribution of the Gram matrix of this frameconverges to the Marchenko–Pastur (MP) density: f MP ( x ) = (cid:113) ( x − λ MP − )( λ MP + − x )2 πβx · I ( λ MP − ,λ MP + ) (9) λ MP ± = (cid:16) ± (cid:112) β (cid:17) (10)As noted in [2], the noise amplification of this type offrame is asymptotically β − . While this result is better thanthe low pass filter, we can improve on both by using theDFT matrix but choosing s whom qualifies as a differenceset. This sub-frame of the DFT was proved to be an ETF(Equiangular Tight Frame) [12]. ETFs have suggested to haveGram matrix eigenvalues that be asymptotically distributed asin the MANOVA distribution [3]: f MANOV A ( x ) = (cid:112) ( x − r − )( r + − x )2 πβx (1 − γ · x ) · I ( r − ,r + ) (11) r ± = (cid:16)(cid:112) (1 − γ ) β ± (cid:112) − γβ (cid:17) (12)Notice that for γ → the MANOVA distribution con-verges to the MP distribution. ETFs seem to have betternoise amplification [3] than most (if not all) frames withthe same dimensions and so are a benchmark for good noiseamplification.In [1] it is shown in that the eigenvalue distribution of arandom selection of s ⊆ [ n ] converges almost surely to theMANOVA distribution. B. Polynomial codes represented as frames
Now that the motivation for discussing erasure codes andthe noise amplification arising from these codes are clear, letus define and discuss the following codes and their generatormatrices:
Definition 1.
Polynomial code
Given two parameters ( n, m ) ∈ N and two sets s ∈ F n (here F is either some Galois field or an infinite field) and z ⊆ [ n − ∪{ } a polynomial code is a linear transformation of m elements, in the field F h × l and denoted A j , in the followingmanner: A (cid:48) i = m − (cid:88) j =1 A j s z j i (13)Where A (cid:48) i are n encoded elements who are defined over thesame field as A i . The n elements in s are also called samplepoints. The m elements of z are also called the polynomialpowers. This linear transformation could also be described inmatrix form: A (cid:48) ... A (cid:48) n = s z s z · · · s z m ... ... ... s z n − s z n − · · · s z m n − s z n s z n · · · s z m n (cid:124) (cid:123)(cid:122) (cid:125) F TPC A ... A m (14)While polynomial codes are defined over an arbitrary field,we will continue to discuss only polynomial codes definedover the complex field. Its important to note that all samples s must be distinct and / ∈ s . Also, notice that if z = [ m − ∪ { } then F P C is the Vandermonde matrix, otherwise itis a generalized Vandermonde matrix [4]. In the case of z =[ m − ∪ { } , the code is the well known Reed-Solomon codeover the complex.We will now define and discuss a few key examples in thisfamily of polynomial codes: Definition 2.
Polynomial code with uniform sampling of theunit circle (USPC)
We define a polynomial code with a uniform sampling as apolynomial code with z having [ m − consecutive membersin [ n − and the sample points in s as: s j = exp (cid:18) πin · ( j − (cid:19) = ω j − (15)For z = [ m − ∪ { } these samples create the followingcode generating matrix: F TUSP C = 1 √ m · · · ω · · · ω m − ... ... ... ω n − · · · ω ( n − m − (16)If z (cid:54) = [ m − ∪ { } then in the process of making theframe unit norm it will also divide the columns of the frameby s z i and the unit norm transform of the frame will then beidentical to the unit norm transform of F TUSP C . Notice thatframe defined in equation (16) is identical to the
DF T ( n ) matrix with the ( n − m ) latter columns omitted and thenmultiplied by (cid:112) nm (due to the frame the normalization factor).s mentioned in section VI-A, this is recognized as a low-passand as a frame it has been shown to be very noise amplifying. Definition 3.
Polynomial code with non-uniform sampling ofthe unit circle (NUSPC)
We define polynomial code with a uniform sampling asa polynomial code as a variation of the same code with auniform sampling. The NUSPC is defined by the parameters ( n, m, b, r ) ∈ N and the set y ⊆ N rb and also y ⊆ [ r − .With these parameters the NUSPC sample points set is: s = { ω yj + r · αb : y j ∈ y , α ∈ N } (17)Notice that the constraint on the number of elements in y isimposed so the number of unique sample points in s remains n . We can also see that a USPC is a special case of a NUSPCand can be created by choosing y j that are all equally spacedin r . This means NUSPC also has the potential to be is verynoise amplifying. Definition 4.
Polynomial code with uniform sampling of theunit circle and non consecutive powers.
The two codes defined in definitions 2 and 3 vary the choicein sample points. This choice creates variations/irregularitiesin the dimension that is subject to erasures. We would also liketo try to introduce irregularities in the dimension is not subjectto erasures. We define this type of polynomial code as one thatuses the same samples as in definition 2 but chooses the set z as one that does not contain a consecutive series in [ n − . Thediscussion in subsection VI-A implies the following lemma: Lemma 2. If z is a difference set, and the samples set are asdescribed in (15) , then this polynomial code is an ETF. More generally, we expect that even if z is chosen wiselyyet not a difference set (e.g., randomly), the code will stillhave noise amplification that is close to the good benchmarkof ETFs. VII. N UMERICAL R ESULTS
In order to show the performances of the codes describedin VI-B, we show the empirical results for codes with m = 50 and n = γ · m with γ in a wide range. The number of nodeserased out of the total amount of nodes was set as kn = 0 . .Each measurement of a single m, γ and code type choose200 random codes that fitted the description of valid code indefinitions 2, 3, 4 and averaged the noise amplification in trials. The best noise amplification of a code with the givenset of m, γ was then chosen. The comparison is shown the inthe graph described in Figure 3:As theorized in subsection VI-B, it is clear that the poly-nomial code created by choosing non consecutive powers out-preforms polynomial codes with consecutive powers. We alsosee that as expected, the code created by a choosing nonconsecutive powers closely mimics the noise amplification of amatrix with eigenvalues drawn from the MANOVA probabilitydensity function. In relation to the discussion in 4, we concludethat the introduction of irregularities in the dimension that Fig. 3. Comparison of noise amplification Vs. γ − for different polynomialcodes is not subject to erasures is preferable to only introducingirregularities in the same dimension as the erasures. Thisphenomena is interesting and should be investigated in futureworks.We also see that using non-uniform sampling of the unitcircle doesn’t seem to have a clear advantage in comparisonto a uniform sample in γ ’s that are much greater then the kn ratio. The reason for this behavior is that the worst-case noiseamplification under the uniform sampling are much higherthan the non-uniform sampling case. But outside of these fewout-liners (who also considerably drive up the mean noiseamplification), uniform sampling has lower noise amplificationthan most variations of the non-uniform sampling codes. Inan average log ( N oise Amplif ication ) measure the uniformsampling code out-preforms the non-uniform sampling codein all γ ’s tested. VIII. F UTURE W ORK
While we defined the definition of polynomial codes overan arbitrary field F , we choose to analyze codes defined overthe complex field in order to draw from theoretical results inframe theory and random matrix theory. In the future we hopeto analyze cases of codes defined over finite fields where thedesign guidelines are not as immediate. As we discussed insection V the noise model might be too simplistic and futurework might expanded it to some other form of non-additivenoise. IX. A CKNOWLEDGMENTS
We would like to thank Itzhak Tamo for the insightful dis-cussion on suitable erasure codes for distributed computationsystems. This research was partially supported by the IsraelScience Foundation, grant
EFERENCES[1] B. Farrell L imiting Empirical Singular Value Distribution of Restrictionsof Discrete Fourier Transform. Journal of Fourier Analysis and Applica-tions 17.4 (2011): 733-753.[2] M. Haikin and R. Zamir Analog coding of a source with erasures . 2016IEEE International Symposium on Information Theory (ISIT). IEEE,2016.[3] M. Haikin, R. Zamir and M. Gavish
Frame moments and welch boundwith erasures . 2018 IEEE International Symposium on Information The-ory (ISIT). IEEE, 2018.[4] E. R. Heineman
Generalized vandermonde determinants . Transactions ofthe American Mathematical Society 31.3 (1929): 464-476.[5] K. S. Kedlaya and C. Umans
Fast polynomial factorization and modularcomposition . SIAM Journal on Computing 40.6 (2011): 1767-1802.[6] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos and K. Ramchandran
Speeding up distributed machine learning using codes . IEEE Transactionson Information Theory 64.3 (2017): 1514-1529.[7] S. Li and A. S. Avestimehr
Coded Computing: Mitigating FundamentalBottlenecks in Large-Scale Distributed Computing and Machine Learning ,(2020), pp. 66-102.[8] A. Mashiach and R. Zamir
Noise-shaped quantization for nonuniformsampling G radient Codingfrom Cyclic MDS Codes and Expander Graphs. IEEE Transactions onInformation Theory 66.12 (2020): 7475-7489.[10] D. Seidner, and M. Feder Noise Amplification of Periodic NonuniformSampling , IEEE Trans. Signal Process., vol. 48, no. 1, pp. 275-277, Jan.2000.[11] A. M. Tulino, S. Verdú
Random matrix theory and wireless communi-cations . Now Publishers Inc, 2004.[12] P. Xia, S. Zhou, and G. B. Giannakis