[PDF] Adaptive Private Distributed Matrix Multiplication

Abstract

We consider the problem of designing codes with flexible rate (referred to as rateless codes), for private distributed matrix-matrix multiplication. A master server owns two private matrices \mathbf{A} and \mathbf{B} and hires worker nodes to help computing their multiplication. The matrices should remain information-theoretically private from the workers. Codes with fixed rate require the master to assign tasks to the workers and then wait for a predetermined number of workers to finish their assigned tasks. The size of the tasks, hence the rate of the scheme, depends on the number of workers that the master waits for. We design a rateless private matrix-matrix multiplication scheme, called RPM3. In contrast to fixed-rate schemes, our scheme fixes the size of the tasks and allows the master to send multiple tasks to the workers. The master keeps sending tasks and receiving results until it can decode the multiplication; rendering the scheme flexible and adaptive to heterogeneous environments. Despite resulting in a smaller rate than known straggler-tolerant schemes, RPM3 provides a smaller mean waiting time of the master by leveraging the heterogeneity of the workers. The waiting time is studied under two different models for the workers' service time. We provide upper bounds for the mean waiting time under both models. In addition, we provide lower bounds on the mean waiting time under the worker-dependent fixed service time model.

Full PDF

aa r X i v : . [ c s . I T ] J a n Adaptive Private Distributed Matrix Multiplication

Rawad Bitar, Marvin Xhemrishi and Antonia Wachter-ZehInstitute for Communications Engineering, Technical University of Munich, Munich, Germany { rawad.bitar, marvin.xhemrishi, antonia.wachter-zeh } @tum.de Abstract

We consider the problem of designing codes with ﬂexible rate (referred to as rateless codes), for private distributed matrix-matrix multiplication. A master server owns two private matrices A and B and hires worker nodes to help computing theirmultiplication. The matrices should remain information-theoretically private from the workers. Codes with ﬁxed rate require themaster to assign tasks to the workers and then wait for a predetermined number of workers to ﬁnish their assigned tasks. Thesize of the tasks, hence the rate of the scheme, depends on the number of workers that the master waits for.We design a rateless private matrix-matrix multiplication scheme, called RPM3. In contrast to ﬁxed-rate schemes, our schemeﬁxes the size of the tasks and allows the master to send multiple tasks to the workers. The master keeps sending tasks andreceiving results until it can decode the multiplication; rendering the scheme ﬂexible and adaptive to heterogeneous environments.Despite resulting in a smaller rate than known straggler-tolerant schemes, RPM3 provides a smaller mean waiting time of themaster by leveraging the heterogeneity of the workers. The waiting time is studied under two different models for the workers’service time. We provide upper bounds for the mean waiting time under both models. In addition, we provide lower bounds onthe mean waiting time under the worker-dependent ﬁxed service time model. Index Terms

Private rateless codes, double-sided private matrix multiplication, partial stragglers, information-theoretic privacy

I. I

NTRODUCTION

Matrix-matrix multiplication is at the core of several machine learning algorithms. In such applications the multiplied matricesare large and distributing the tasks to several machines is required. We consider the problem in which a master server ownstwo private matrices A and B and wants to compute C = AB . The master splits the computation into smaller tasks anddistributes them to several worker nodes that can run those computations in parallel. However, waiting for all workers to ﬁnishtheir tasks suffers from the presence of slow processing nodes [2], [3], referred to as stragglers . The presence of stragglerscan outweigh the beneﬁt of parallelism, see e.g., [4]–[6] and references therein.Moreover, the master’s data must remain private from the workers. We are interested in information-theoretic privacy whichdoes not impose any constraints on the computational power of the compromised workers. On the other hand, information-theoretic privacy assumes that the number of compromised workers is limited by a certain threshold.We consider applications where the resources of the workers are different, limited and time-varying. Examples of this settinginclude edge computing in which the devices collecting the data (e.g., sensors, tablets, etc.) cooperate to run the intensivecomputations. In such applications, the workers have different computation power, battery life and network latency which canchange in time. We refer to this setting as heterogeneous and time-varying setting.We develop a coding scheme that allows the master to ofﬂoad the computational tasks to the workers while satisfying thefollowing requirements: i) leverage the heterogeneity of the workers, i.e., assign a number of tasks to the workers that isproportional to their resources; ii) adapt to the time-varying nature of the workers; and iii) maintain the privacy of the master’sdata.We focus on matrix-matrix multiplication since this is a building block of several machine learning algorithms [7], [8]. Weuse coding-theoretic techniques to encode the tasks sent to the workers. We illustrate the use of codes to distribute the tasksin the following example. Example 1.

Let A ∈ F r × sq and B ∈ F s × ℓq be two private matrices owned by the master who wants to compute C = AB . Themaster has access to workers. At most workers can be stragglers. The workers do not collude, i.e., the workers do notshare with each other the tasks sent to them by the master. To encode the tasks, the master generates two random matrices R ∈ F r × sq and S ∈ F s × ℓq uniformly at random and independently from A and B . The master creates two polynomials f ( x ) = R (1 − x ) + A x and g ( x ) = S (1 − x ) + B x . The data sent to worker i is f ( a i ) and g ( a i ) , i = 1 , . . . , , where a i ∈ F q \ { } . Each worker computes h ( a i ) , f ( a i ) g ( a i ) = RS (1 − a i ) + RB a i (1 − a i ) + AS a i (1 − a i ) + AB a i andsends the result to the master. When the master receives at least three evaluations of h ( x ) , f ( x ) g ( x ) , it can interpolate Parts of these results of this work were presented at the IEEE International Symposium on Information Theory and its Applications (ISITA) [1].This work was partly supported by the Technical University of Munich - Institute for Advanced Studies, funded by the German Excellence Initiative andEuropean Union Seventh Framework Programme under Grant Agreement No. 291763. The multiplication and addition within the polynomials is element-wise, e.g., each element of A is multiplied by x . Worker Worker Worker Worker Worker Round R (1 − a ) + A a R (1 − a ) + A a R (1 − a ) + A a R (1 − a ) + A a R (1 − a ) + A a S (1 − a ) + B a S (1 − a ) + B a S (1 − a ) + B a S (1 − a ) + B a S (1 − a ) + B a Round R (1 − a ) + ( A + A ) a R (1 − a ) + ( A + A ) a R (1 − a ) + ( A + A ) a S (1 − a ) + B a S (1 − a ) + B a S (1 − a ) + B a TABLE I: A depiction of the tasks sent to the workers in Example 2. the polynomial of degree . In particular, the master can compute AB = h (1) . On a high level, the privacy of A and B ismaintained because each matrix is masked by a random matrix before being sent to a worker. In Example 1, even if there are no stragglers, the master ignores the responses of two workers. In addition, all the workersobtain computational tasks of the same complexity .We highlight in Example 2 the main ideas of our scheme that allows the master to assign tasks of different complexity tothe workers and use all the responses of the non stragglers. Example 2.

Consider the same setting as in Example 1. Assume that workers , and are more powerful than the others.The master splits A into A = (cid:2) A T A T (cid:3) T and wants C = (cid:2) ( A B ) T ( A B ) T (cid:3) T . The master divides the computations intotwo rounds. In the ﬁrst round, the master generates two random matrices R ∈ F r/ × sq and S ∈ F s × ℓq uniformly at randomand independently from A and B . The master creates four polynomials: f (1)1 ( x ) = R (1 − x ) + A x,f (2)1 ( x ) = R (1 − x ) + A x,g (1)1 ( x ) = g (2)1 ( x ) = S (1 − x ) + B x. The master sends f (1)1 ( a i ) and g (1)1 ( a i ) to workers , , , and sends f (2)1 ( a i ) and g (2)1 ( a i ) to workers , , where a i ∈ F q \{ , } .Workers , , compute h (1)1 ( a i ) , f (1)1 ( a i ) g (1)1 ( a i ) and workers , compute h (2)1 ( a i ) , f (2)1 ( a i ) g (2)1 ( a i ) .The master starts round when workers , , ﬁnish their tasks. It generates two random matrices R ∈ F r/ × sq and S ∈ F s × ℓq and creates f (1)2 ( x ) = R (1 − x ) + ( A + A ) x,g (1)2 ( x ) = S (1 − x ) + B x, and sends evaluations to the ﬁrst three workers which compute h (1)2 ( a i ) , f (1)2 ( a i ) g (1)2 ( a i ) . One main component of ourscheme is to generate e A , A , e A , A and e A , A + A as Fountain-coded [9]–[12] codewords of A and A . Thetasks sent to the workers are depicted in Table I. Decoding C : The master has two options:1) workers and ﬁnish their ﬁrst task before workers , , ﬁnish their second tasks, i.e., no stragglers. The masterinterpolates h (1)1 ( x ) and obtains h (1)1 (1) = A B and h (1)1 (0) = R S . Notice that h (2)1 (0) = h (1)1 (0) = R S . Thus, themaster has three evaluations of h (2)1 ( x ) and can interpolate it to obtain A B = h (2)1 (1) .2) workers and are stragglers and do not ﬁnish their ﬁrst task before workers , , ﬁnish their second tasks. Themaster interpolates (decodes) both h (1)1 ( x ) and h (1)2 ( x ) . In particular, the master obtains A B = h (1)1 (1) and A B =( A + A ) B − A B = h (1)2 (1) − A B . On a high level, the privacy of A and B is maintained because each matrix is masked by a different random matrix beforebeing sent to a worker. In this paper, we generalize the setting of Example 2 and show how to encode the data and generate the tasks sent to theworkers in such a heterogeneous server setting. We compare the proposed scheme to schemes with ﬁxed straggler toleranceand to perfect load balancing where we assume the master knows how the resources of the workers would change in time.

Related work:

The use of codes to mitigate stragglers in distributed linear computations was ﬁrst proposed in [6] withoutprivacy constraints. Several works such as [12]–[25] propose different techniques improving on [6] and provide fundamentallimits on distributed computing. Of particular importance is the work of [12] where the authors show how to construct ratelesscodes for non-private distributed matrix-matrix multiplication. Straggler mitigation with privacy constraints is considered in[26]–[39]. The majority of the literature assumes a threshold of ﬁxed number of stragglers. In [26], [27] the authors considerthe setting in which the number of stragglers is not known a priori and design schemes that can cope with this setting. However,[26], [27] consider the matrix-vector multiplication setting in which only the input matrix must remain private. Our proposedscheme can be seen as a generalization of the coding scheme in [27] to handle matrix-matrix multiplication. In [40], the authors Each evaluation of the polynomial f ( x ) (or g ( x ) ) is a matrix of the same dimension as A (or B ). The computational complexity of the task is thereforeproportional to the dimension of the created polynomial. characterize the regimes in which distributing the private matrix-matrix multiplication is faster than computing the productlocally. Contributions:

We present RPM3 a coding scheme for private matrix-matrix multiplication that have ﬂexible stragglertolerance. Our scheme is based on dividing the input matrices into smaller parts and encode the small parts using ratelessFountain codes. The Fountain-coded matrices are then encoded into small computational tasks (using several Lagrangepolynomials) and sent to the workers. The master adaptively sends tasks to the workers. In other words, the master ﬁrstsends a small task to each worker and then starts sending new small tasks to workers who ﬁnished their previous task. Weshow that our scheme satisﬁes the following properties: i) it maintains the privacy of the input matrices against a given numberof colluding workers; ii) it leverages the heterogeneity of the resources at the workers; and iii) it adapts to the time-varyingresources of the workers.We study the rate, i.e., the number of assigned tasks versus number of useful computations, and the mean waiting time of themaster when using RPM3. We give an upper bound on the mean waiting time of the master under two different delay modelsand provide a probabilistic guarantee of succeeding when given a deadline. We show that despite having a lower rate, RPM3outperforms schemes assuming a ﬁxed number of stragglers in heterogeneous environments. In addition, we provide lowerbounds on the mean waiting time of the master. The bounds are derived by assuming perfect load balancing, i.e., assumingthe master has full knowledge of the variation of the compute power at the workers during the matrix-matrix multiplicationprocess. We shed light on the properties that the encoding functions of any ﬂexible-rate codes should satisfy in order to havemean waiting time that matches with our lower bounds. We leave the problem of ﬁnding lower bounds on the rate, and thustighter lower bounds on the mean waiting time, as an interesting open problem. Organization:

In Section II we set the notation and deﬁne the model. We provide the details of our RPM3 scheme inSection IV. In Section V, we analyze the efﬁciency of RPM3 and compare it to the straggler tolerant ﬁxed rate scheme withthe best known rate. We introduce the delay models and analyze the waiting of the master in Section VI. In Section VII weexplain the perfect load balancing scheme, provide lower bounds on the mean waiting time and compare them to the meanwaiting time of to RPM3. We conclude the paper in Section VIII.II. P

RELIMINARIES

A. Notation

For any positive integer n we deﬁne [ n ] , { , . . . , n } . We denote by n the total number of workers. For i ∈ [ n ] we denoteworker i by w i . For a prime power q , we denote by F q the ﬁnite ﬁeld of size q . We denote by H ( A ) the entropy of the randomvariable A and by I ( A ; B ) the mutual information between two random variables A and B . All logarithms are to the base q . B. Problem setting

The master possesses two private matrices A ∈ F r × sq and B ∈ F s × ℓq uniformly distributed over their respective ﬁelds andwants to compute C = AB ∈ F r × ℓq . The master has access to n workers that satisfy the following properties: The workers have different resources. They can be grouped into c > clusters with n u workers, u = 1 , . . . , c, withsimilar resources such that P u ∈ [ c ] n u = n . The resources available at the workers can change with time. Therefore, the size of the clusters and their number canchange throughout the multiplication of A and B . The workers have limited computational capacity. Up to z , ≤ z < min u ∈ [ c ] n u , workers collude to obtain information about A and/or B . If z = 1 , we say the workers do notcollude.The master splits A row-wise and B column-wise into m and k smaller sub-matrices, respectively, i.e., A = (cid:2) A T , . . . , A Tm (cid:3) T ,and B = (cid:2) B , . . . , B k (cid:3) . The master sends several computational tasks to each of the workers such that each task has the samecomputational complexity as A i B j , i ∈ [ m ] , j ∈ [ k ] . After receiving km (1 + ε ) responses from the workers, where ≤ ǫ ≤ is a parameter of the scheme, the master should be able to compute C = AB . The value of ε depends on the encodingstrategy and decreases with km .A matrix-matrix multiplication scheme is double-sided z -private if any collection of z colluding workers learns nothingabout the input matrices involved in the multiplication. Deﬁnition 1 (Double-sided z -private matrix-matrix multiplication scheme) . Let A and B be the random variables representingthe input matrices. We denote by W i the set of random variables representing all the tasks assigned to worker w i , i = 1 , . . . , n .For a set A ⊆ [ n ] we deﬁne W A as the set of random variables representing all tasks sent to workers indexed by A , i.e., W A = {W i | i ∈ A} . Then the privacy constraint can be expressed as I ( A , B ; W Z ) = 0 , ∀Z ⊂ [ n ] , s.t. |Z| = z. (1) For regular Fountain codes, ε is of the order of . [9], [10]. Number of colluding workers z W a iti ng ti m e RPM3 - HomogeneousFixed rate scheme [31] - HomogeneousFixed rate scheme [31] - HeterogeneousRPM3 - Heterogeneous (a) Empirical average waiting timeunder model 1. − Number of colluding workers z W a iti ng ti m e Fixed rate scheme [31] - HomogeneousFixed rate scheme [31] - HeterogeneousRPM3 - HomogeneousRPM3 - Heterogeneous (b) Empirical average waiting timeunder model 2.

Number of colluding workers z W a iti ng ti m e Bound in (15) - HomogeneousRPM3 - HomogeneousBound in (15) - HeterogeneousRPM3 - Heterogeneous (c) Theoretical bound and empiricalaverage waiting time under model 1.

Fig. 1: Average waiting time of RPM3 and the ﬁxed rate scheme from [31]. We model the response time of the individualworkers as a shifted exponential random variable. We consider n = 1000 workers grouped in different clusters, referred toas setting 1, and shown in Table II. We consider two scenarios where workers of different clusters have similar service time(homogeneous environment) and different service times (heterogeneous environment). In Figure 1a we plot the average waitingtime over experiments when considering Model 1. In Figure 1b we plot the average waiting time over experiments whenconsidering Model 2. We observe that for both models RPM3 outperforms the scheme of [31] in a heterogeneous environment.However, when the workers of different clusters have similar service time, RPM3 outperforms the scheme of [31] only in themore accurate model 2. In Figure 1c we plot the theoretical bounds on the mean waiting time (15) of RPM3 under model 1derived in Theorem 2 and the empirical average waiting time of RPM3. We observe that the bound is a good representationof the empirical average waiting time. Let R i be the set of random variable representing all the computational results of w i received at the master. Let C be therandom variable representing the matrix C . The decodability constraint can be expressed as H ( C |R , . . . , R n ) = 0 . (2)Note that the sets R i can be of different cardinality, and some may be empty, reﬂecting the heterogeneity of the system andthe straggler tolerance.Let the download rate , ρ , of the scheme be deﬁned as the ratio between the number of needed tasks to compute C and thenumber of responses sent by the workers to the master, ρ = mk number of received responses . We are interested in designing rateless double-sided z -private codes for this setting. By rateless , we mean that the downloadrate, or simply rate, of the scheme is not ﬁxed a priori, but it changes depending on the resources available at the workers.For instance, the rate of the scheme in Example 1 is ﬁxed to / , whereas the rate of the scheme in Example 2 is either / or / depending on the behavior of the workers. C. Model of the service timeWorkers model:

The time spent by a worker to compute a task is a shifted exponential random variable with shift s and rate λ , i.e., the probability density function of the service (upload, compute and download) time is given by [6], [23], [26], [41] Pr T ( x ) = λ exp ( − λ ( x − s )) , if x > s, and otherwise.The service time includes the time spent to send the task from the master to the worker, the computation time at the workerand the time spent to send the result back from the worker to the master.Following the literature, we make the following assumption. Let Pr T ( x ) be the probability density function of the time spentto compute the whole product AB at one worker, we assume that the time spent to compute a fraction /mk of the wholecomputation follows a scaled distribution of Pr T ( x ) where the shift s becomes s/mk and the rate becomes λmk .In our delay analysis of RPM3, we consider two different models to account for sending several tasks from the master to theworkers. Model 1 is rather simplistic and treats all the tasks sent to a worker as one large task. However, it provides signiﬁcantengineering insights. Model 2 is more accurate and treats all tasks separately. Model 1: Worker-dependent ﬁxed service time.

In this model, we assume that the time spent to compute a task at worker w i in cluster u , i = 1 , . . . , n u , is ﬁxed to Y ui throughout the whole algorithm. We take Y ui to be a random variable that is exponentially distributed with mean λ u km , for a given λ u ∈ R . In addition, for each task there is an initial handshake time s u /km , for a given s u ∈ R , after which the worker computes the task. Therefore, for each worker w i in cluster u , the timespent to compute τ u tasks under this model is T iu = τ u s u /km + τ u Y ui for all i = 1 , . . . , n u . That is, the random variable T iu isa shifted exponential random variable with shift τ u s u /km and rate λ u km/τ u . The random variable Y ui is different for everyworker; hence the name of the model. Model 2: Cluster-dependent additive service time.

In this model, we assume that the time spent to compute a task at worker w i , i = 1 , . . . , n u , in cluster u at a given round is a shifted exponential random variable with shift s u /km and rate λ u km , s u , λ u ∈ R . Namely, at a given round, the service time of a worker depends only on the cluster u to which he belongs.Therefore, for each worker w i in cluster u , the time spent to compute τ u tasks under this model is T iu = P τ i j =1 X j where X , . . . , X τ u are iid random variables following the shifted exponential distribution with shift s u /km and rate λ u km . That is,the random variable T iu follows a shifted Erlang distribution with shape τ u and rate λ u km . The shape τ u of the Erlang randomvariable is cluster-dependent; hence the name of the model.III. M AIN R ESULTS

We introduce a new scheme for r ateless p rivate m atrix- m atrix m ultiplication that we call RPM3. This scheme allows themaster to cluster the workers into groups of workers of similar speed. Then, RPM3 takes any ﬁxed-rate private scheme, couplesit with the special Fountain coding technique [12] of the input matrices to create a rateless code. We prove, in Theorem 1,those properties of RPM3 when the ﬁxed-rate scheme is a Lagrange polynomial. A. Rate analysis

On a high level, under privacy constraints, clustering the workers comes at the expense of reducing the efﬁciency of thescheme by z − for every new cluster. We reduce this loss to z − per cluster by using Lagrange polynomials that share z evaluations with each other, c.f., Remark 1. To remove the penalty of ﬂexibility, the encoding polynomials of a rateless schememust share z − evaluations.We analyze the rate of RPM3 when using Lagrange polynomials and show in Lemma 1 that the rate of RPM3 is ρ RP M = mk mk (1 + ε ) + ( z − τ c c X u =1 γ u + zτ c γ , where τ c is the number of tasks ﬁnished by cluster the slowest cluster c , γ u τ c is the number of tasks ﬁnished by cluster u and ǫ is the overhead required by Fountain codes.In contrast to other schemes, the rate of RPM3 does not depend on the number of workers n . Therefore, the master candesign the size of the tasks to ﬁt the computational power and storage constraints of the workers. The penalty of clustering theworkers appears in the term ( z − τ c P cu =2 γ u . Clustering the workers does not affect the rate of RPM3 for z = 1 . However,for z > the rate decreases linearly with the number of clusters and with z . Hence, the existence of a tradeoff between therate and the ﬂexibility (number of clusters) of the scheme.We compare the rate of RPM3 to the rate of the ﬁxed-rate scheme of [31] that tolerates a ﬁxed number of stragglers. Thisscheme is the closest to our model and has the best known rate among schemes that tolerate stragglers. We observe that therate of RPM3 is lower than the rate of the scheme in [31] except for a small subset of parameters. B. Time analysis

We analyze the waiting time of the master for both considered models of the service time. The ﬁrst model is a simpliﬁedmodel that provides both theoretical and engineering insights about the system and its design. The second model reﬂects ourRPM3 scheme, but its analysis does not provide engineering insights about the system design. For both models, we give anupper bound on the probability distribution and on the mean of the waiting time of the master, Theorem 2 and Theorem 3,respectively. The bound on the probability distribution can be used to tune the parameters of the scheme so that the waitingtime of the master is less than a ﬁxed deadline with high probability. We analyze the waiting time of the master when usingthe scheme of [31] for both models of the service time, Corollary 2 and Corollary 3. Note that model 1 is the accurate modelfor ﬁxed-rate schemes that send only one task to the workers. However, we allow the ﬁxed-rate scheme to divide its tasks intosmaller tasks to beneﬁt from the advantages of the second model. For the ﬁrst model, we give a theoretical guarantee on whenthe master has a smaller mean waiting time when using RPM3. However, for the second model the derived expressions arehard to analyze. We provide numerical evidence (e.g., Figure 1a and 1b), showing that when the workers have different servicetimes (heterogeneous environment) RPM3 has a smaller waiting time for small values of z . However, when the workers havesimilar service times, RPM3 outperforms the ﬁxed rate scheme only under the second time model. We also show numericallythat the provided upper bound on the mean waiting time of PRM3 under model 1 is a good representative of the actual meanwaiting time, e.g., Figure 1c. C. Comparison to perfect load balancing

To check the effect of clustering on the mean waiting time of the master, we compare RPM3 to the setting where the masterhas previous knowledge of the computation power of the workers. In such a setting, the master can simply assign tasks thatare proportional to the compute power of the workers without the need of rateless codes. We show in Theorem 4 that, undermodel 1, the mean waiting time of the master when using RPM3 is far from the mean waiting time of the load balancingscheme by a factor of τ u⋆ τ ( LB ) c λ c λ u⋆ , where τ u ⋆ is the number of tasks computed by the slowest cluster of RPM3 and τ ( LB ) c is thenumber of tasks computed by the slowest cluster in the load balancing scheme.Since τ ( LB ) c depends on the rate of the scheme used for load balancing, we consider two settings. An ideal setting in whichthere exists a scheme with the best possible rate ( n − z ) /n that the master uses. Schemes achieving this rate for matrix-matrixmultiplication exist for z = 1 . Since this rate may not be always achievable, we present an achievable load balancing schemebased on GASP codes [29]. We derive the value of τ ( LB ) c under both settings and different parameter regimes, see Table III.We make a key observation when comparing RPM3 to the ideal setting and to the achievable setting with large z . Thosesettings are the two extreme settings of the spectrum of achievable rates. Corollary 1 (Informal; see Corollary 4 for a formal version) . The mean waiting time of the master when using RPM3 isbounded away from that of the ideal setting by E [ T RP M ] ≤ E [ T LB ] 2(1 + ǫ ) R ( z ) λ c λ u ⋆ . The mean waiting time of RPM3 is bounded away from the mean waiting time of the achievable GASP scheme with largevalues of z by E [ T RP M ] ≤ E [ T LB ] 1 + ǫR ( z ) λ c λ u ⋆ . Recall that ≤ ǫ ≤ is the overhead required by Fountain codes. R ( z ) , R ( z ) ∈ [0 , are decreasing functions of z thatreach when z reaches its maximal value z = n + 1 / n u + 1 for all u = 2 , . . . , c . The functions R ( z ) and R ( z ) showthe effect of the rate loss due to clustering on the waiting time. When alleviating the penalty of clustering, R ( z ) = R ( z ) = 1 we would get back the expected bound on the waiting time, see Remark 4. Finding rateless codes with R ( z ) or R ( z ) = 1 remains an open problem, see Remark 2. In other words, if the clustering comes at no extra cost, the only increase in thewaiting time would be due to the overhead of Fountain codes. Note that the rate of GASP codes for large z coincide with therate of Lagrange polynomials. However, when comparing to the ideal scheme an extra factor of is present. This is due tothe fact that Lagrange polynomials have rate ( n − z ) / n which is half of the rate of the ideal scheme. The ratio λ c /λ u ⋆ ≤ shows the ratio of the expected service rates of the cluster that ﬁnishes the last task of RPM3 to the slowest cluster c .IV. RPM3 S CHEME

We provide a detailed explanation of our RPM3 scheme and prove the following theorem.

Theorem 1.

Consider a matrix-matrix multiplication setting as described in Section II. The RPM3 scheme deﬁned next is arateless double-sided z -private matrix-matrix multiplication scheme that adapts to the heterogeneous behavior of the workers.Proof: The proof is constructive. We give the details of the construction in Sections IV-A and IV-B. In Section IV-C weshow that the master can obtain the desired the computation. We prove the privacy constraint in Section IV-D.

A. Data encoding

The master divides the encoding into rounds. At a given round t , the workers are grouped into c clusters each of n u workers, u = 1 , . . . , c and P cu =1 n u = n . We defer the clustering technique to the next section. Dividing workers into several clustersadds ﬂexibility in the decoding at the master. The results returned from a cluster of workers allow the master to decode newFountain-coded computations as explained next. We deﬁne d , ⌊ n − z +12 ⌋ and d u , ⌊ n u − z +12 ⌋ for u = 2 , . . . , c . The mastergenerates c Lagrange polynomial pairs f ( u ) t ( x ) and g ( u ) t ( x ) . Each polynomial f ( u ) t ( x ) contains d u Fountain-coded matrices e A ( u ) t,κ , κ = 1 , . . . , d u , deﬁned as e A ( u ) t,κ , P mi =1 b ( u ) κ,i A i , where b ( u ) κ,i ∈ { , } . Similarly, each polynomial g ( u ) t ( x ) contains d u Fountain-coded matrices e B ( u ) t,κ , P kj =1 b ( u ) κ,j B j where b ( u ) κ,j ∈ { , } are chosen randomly [12]. The distributions fromwhich the b ( u ) κ,i and b ( u ) κ,j are drawn must be designed jointly as in [12] to guarantee that the master can decode AB afterreceiving (1 + ǫ ) mk products of the form e A e B with small values of ǫ . The master generates z uniformly random matrices R t, , . . . , R t,z ∈ F r/m × sq and S t, , . . . , S t,z ∈ F s × ℓ/kq . Note that b ( u ) κ,i also depends on t , but we remove the subscript t for the ease of notation. Remark 1 (Penalty on clustering the workers) . From the deﬁnition of d and d u , u = 2 , . . . , c , we can see that clustering theworkers and assigning one polynomials to each cluster incurs an extra penalty of loosing z − computations per cluster. Thenumber d u is the number of computations of the form e A e B computed at every cluster. Assigning one Lagrange polynomial to allthe workers results in d = ⌊ n − z +12 ⌋ , i.e., for d = (cid:4) n − z +12 (cid:5) , every d u should be equal to ⌊ n u ⌋ . Initially, one would expecta loss of z − computations per cluster. We reduce this loss by allowing the encoding polynomials to share z evaluationsas explained in the decoding part. To alleviate the penalty of clustering, the encoding polynomials of every round must share z − evaluations. More details are given in Remark 2 after formally deﬁning the decoding process. Let d max = max u d u and α , . . . , α d max ∈ F q be distinct elements of F q . The polynomials are constructed as shown in (3)and (4). f ( u ) t ( x ) = z X δ =1 R t,δ Y ν ∈ [ d u + z ] \{ δ } x − α ν α δ − α ν + d u + z X δ = z +1 e A ( u ) t,δ − z Y ν ∈ [ d u + z ] \{ δ } x − α ν α δ − α ν , (3) g ( u ) t ( x ) = z X δ =1 S t,δ Y ν ∈ [ d u + z ] \{ δ } x − α ν α δ − α ν + d u + z X δ = z +1 e B ( u ) t,δ − z Y ν ∈ [ d u + z ] \{ δ } x − α ν α δ − α ν . (4)The master chooses n distinct elements β i ∈ F q \ { α , · · · , α d max + z } , i = 1 , . . . , n . For each worker, w i the master checksthe cluster u to which this worker belongs, and sends f ( u ) t ( β i ) , g ( u ) t ( β i ) to that worker. B. Clustering of the workers and task distributionClustering:

For the ﬁrst round t = 1 , the master groups all the workers in one cluster of size n = n . The master generatestasks as explained above and sends them to the workers.For t > , the master wants to put workers that have similar response times in the same cluster. In other words, workersthat send their results in round t − to the master within a pre-speciﬁed interval of time will be put in the same cluster. Let ∆ be the length of the time interval desired by the master.In addition to the time constraint, the number of workers per cluster and the privacy parameter z must satisfy n u ≥ ( z − if u = 1 ,z + 1 otherwise. (5)Those constraints ensure that the master can decode the respective polynomials h ( u ) t ( x ) as explained in the next section.Let η be the time spent until the result of w i is received by the master (at round t − ). All workers that send their resultsbefore time η + ∆ are put in cluster . If n ≥ z − , the master moves to cluster . Otherwise, the master increases ∆ so that n ≥ z − . The master repeats the same until putting all the workers in different clusters guaranteeing n u ≥ z + 1 , u = 2 , . . . , c .In the remaining of the paper we assume that the number of workers per cluster is ﬁxed during the whole algorithm and isknown as a system parameter.Over the course of the computation process, the master keeps measuring the empirical response time of the workers. Theresponse time of a worker is the time spent by that worker to receive, compute and return the result of one task. Having thosemeasurements, the master can update the clustering accordingly when needed using the same time intervals. Task distribution:

At the beginning of the algorithm, the master generates tasks assuming all workers are in the same clusterand sends those tasks to the workers. For round the master arranges the workers in their respective clusters and sends tasksaccordingly. Afterwards, when the master receives a task from worker w i , it checks at which round t i this worker is (howmany tasks did the worker ﬁnish so far) and to which cluster u it belongs. The master generates f ( u ) t i +1 ( x ) , h ( u ) t i +1 ( x ) if w i isthe ﬁrst worker of cluster u i to ﬁnish round t i and sends f ( u ) t i +1 ( β i ) , h ( u ) t i +1 ( β i ) to w i . Choosing the β i ’s carefully is needed to maintain the privacy constraints as explained in the sequel. In this section, all variables depend on t . However, we omit t for the clarity of presentation. To avoid idle time at the workers, the master can measure the expected computation time of each worker at round t i − . Using this information, themaster can then send a task to a worker in a way that this worker will receive the task right after ﬁnishing its current computation. This will guarantee thatthe worker will not be idle during the transmission of tasks to and from the master. See [20] for more details. C. Decoding

At a given round t , the master ﬁrst waits for the n fastest workers belonging to cluster to ﬁnish computing their tasksso that it can interpolate h (1) t ( x ) . This is possible because the master obtains n = 2 d + 2 z − evaluations of h (1) t ( x ) equalto the degree of h (1) t ( x ) plus one. By construction, for a given t , the polynomials f ( u ) t ( x ) and g ( u ) t ( x ) share the same randommatrices as coefﬁcients, see (3) and (4). Thus, for i = 1 , . . . , z , the polynomials h ( u ) t ( x ) share the following z evaluations h (1) t ( α i ) = h (2) t ( α i ) = · · · = h ( c ) t ( α i ) = R t,i S t,i . (6)Therefore, the master can interpolate h ( u ) t ( x ) when n u workers of cluster u, u = 2 , . . . , c, return their results. This is possiblebecause the master receives n u = 2 d u + z − evaluations of h ( u ) t ( x ) and possesses the z evaluations shared with h (1) t ( x ) .Allowing the polynomials to share the randomness enables us to reduce the number of workers from every cluster u > by z workers.After successfully interpolating a polynomial h ( u ) t ( x ) for a given round t and a cluster u , the master computes d u productsof Fountain-coded matrices h ( u ) t ( α κ + z ) = e A ( u ) t,κ e B ( u ) t,κ (7)for κ = 1 , . . . , d u . The master feeds those d u computations to a peeling decoder [9]–[12] and continues this process untilthe peeling decoder can successfully decode all the components of the matrix C . Thus, allowing a ﬂexibility in the rate andleveraging the rateless property. The peeling decoder works by searching for a received computation which is equal to any ofthe components of the desired matrix C , i.e., a Fountain coded matrix product e A e B = A i B j . If such a computation exists, thedecoder extracts (decodes) its value and then subtracts it from all other Fountain coded packets that contain it as a summand.This procedure is done iteratively until the decoding succeeds, i.e., until decoding the values of all of the components of C . D. Proof of double-sided privacy

Since the master generates new random matrices at each round, it is sufﬁcient to prove that the privacy constraint givenin (1) holds at each round separately. The proof is rather standard and follows the same steps as [27], [32]. We give a completeproof in Appendix A for completeness and provide next a sketch of the proof.Let W i,t be the set of random variables representing the tasks sent to worker w i at round t . For a set A ⊆ [ n ] we deﬁne W A ,t as the set of random variables representing the tasks sent to the workers indexed by A at round t , i.e., W A ,t , {W i,t | i ∈ A} .We want to prove that at every round t I ( A , B ; W Z ,t ) = 0 , ∀Z ⊂ [ n ] , s.t. |Z| = z. (8)To prove (8) it is enough to show that given the input matrices A and B , any collection of z workers w i , . . . , w i z , canuse the tasks given to them at round t to obtain the random matrices R t, , . . . , R t,z and S t, , . . . , S t,z . Decoding the randommatrices holds due to the use of Lagrange polynomials and setting the random matrices as the ﬁrst z coefﬁcients.V. R ATE A NALYSIS

We assume that the workers in the same clusters have very similar response time. We compare RPM3 to the scheme in [31]that has an improved rate over using the Lagrange polynomials but does not exist for all values of m and k . Rate of RPM3:

Let τ u be the number of rounds ﬁnished (tasks successfully computed) by all the workers in cluster u , u = 1 , . . . , c . There exist real numbers γ u ≥ for u ∈ [ c ] such that τ u = γ u τ c . This means that the number of tasks computedby workers in cluster u is γ u times more than the number of tasks computed by workers in the slowest cluster c . Given thevalues of τ , . . . , τ c we can compute the rate of RPM3 as follows. Lemma 1.

Consider a private distributed matrix-matrix multiplication with n workers out of which at most z can collude. Letthe input matrices A and B be split into m and k submatrices, A = (cid:2) A T , . . . , A Tm (cid:3) T and B = (cid:2) B , . . . , B k (cid:3) , respectively.Let c be the number of clusters of workers and τ u be the number of rounds in which the polynomials h ( u ) t ( x ) , t = 1 , . . . , τ u is interpolated at the master. Then, for an ε overhead required by the Fountain code decoding process, the rate of RPM3 underthis setting is ρ RPM3 = mk mk (1 + ε ) + ( z − τ c c X u =1 γ u + zτ c γ ,c . (9) Proof:

We count the number of results N (that will be in the denominator of the rate) collected by the master at theend of the computation process. From each cluster of workers u , u = 1 , . . . , c , the master collects n u τ u results. Recall that n = 2 d + 2 z − and n u = 2 d u + z − for u = 2 , . . . , c . We can write the following N = c X u =1 n u τ u = c X u =2 (2 d u + z − τ u + (2 d + 2 z − τ = c X u =1 d u τ u + ( z − c X u =1 τ u + zτ = 2 mk (1 + ε ) + ( z − τ c c X u =1 γ u + zτ c γ . (10)Equation (10) follows from the fact that P cu =1 d u τ u = mk (1 + ε ) . This is true because the master needs mk (1 + ε ) differentvalues of e A ( u ) i,t e B ( u ) j,t in total to compute AB and each interpolated polynomial h ( u ) t ( x ) encodes d u such values.Lemma 1 shows a tradeoff between the rate of the scheme and its adaptivity to heterogeneous systems. Dividing the workersinto c clusters and sending several polynomials to the workers affects the rate of the scheme. The loss in the rate appears inthe term ( z − τ c P cu =2 γ u . However, sending several polynomials to the workers allows the master a ﬂexibility in assigninga number of tasks proportional to the resources of the workers; Hence, increasing the speed of the computing process. Remark 2 (Properties of the optimal encoding polynomials) . A ﬂexible-rate scheme that clusters the workers into c clusters hasthe same rate as a ﬁxed-rate scheme using the same encoding polynomials if the encoding polynomials h (1) t ( x ) and h ( u ) t ( x ) , u = 2 , . . . , c , share z − evaluations. In addition, h ( u ) t ( x ) can be interpolated from the shared evaluations and an additional n u evaluations of the form h ( u ) t ( β i ) , i ∈ [ n u ] . This constraint can be interpreted as ﬁnding f ( u ) t ( x ) = r t ( x ) + p ( u ) t ( x ) , g ( u ) t ( x ) = s t ( x ) + q ( u ) t ( x ) , such that h (1) t ( x ) = f (1) t ( x ) g (1) t ( x ) , h ( u ) t ( x ) = f ( u ) t ( x ) g ( u ) t ( x ) , share z − evaluations for u = 2 , . . . , c . The polynomials r t ( x ) and s t ( x ) do not depend on the cluster number and are thepolynomials that encode the randomness to guarantee privacy. The polynomials q ( u ) t ( x ) and p ( u ) t ( x ) encode the Fountain-codeddata and change from a cluster to another to guarantee that the master obtains new coded packets. For z = 1 , RPM3 satisﬁesthis property and thus has an optimal encoding polynomials. The problem of ﬁnding optimal encoding polynomials for z > is left open. The main property of RPM3 is that the rate of the scheme is independent of the degree of the encoding polynomials andfrom the number of available workers n . The rate only depends on the number of assigned tasks to the workers in differentclusters. This property reﬂects the ability of RPM3 to ﬂexibly assign the tasks to the workers based on their available resources.In addition, this property reﬂects the fact that RPM3 can design tasks to have arbitrarily small size to ﬁt the computationalpower of the available workers. Comparison to the scheme of [31]:

We refer to this scheme as the

KES scheme for brevity. The KES scheme has the highestrate amongst known schemes that tolerate stragglers and have a model similar to the one considered in this paper. In particular,it has a better rate than naively using Lagrange polynomials to send the tasks to the workers. The better rate is achieved bycarefully choosing the coefﬁcients and the degrees of x in f ( u ) t ( x ) and g ( u ) t ( x ) to reduce the number of evaluations neededfrom the workers to interpolate h ( u ) t ( x ) . Remark 3.

One could use the polynomials of the schemes in [29], [31] instead of Lagrange polynomials to potentially improvethe rate of RPM3. However, the polynomials h ( u ) t constructed in [29], [31] are not guaranteed to share any evaluations (thusmay require a large number of workers per cluster) and do not exist for all values of m and k . For values of m u , k u and n u , u = 1 , . . . , c, for which the polynomials in [29], [31] can be used per cluster, the rate of RPM3 is improved if the followingholds c X u =1 m u k u mk ≥ c X u =1 d u . We assume that the master sends several tasks to the workers. Let m I be the number of rows in A i encoded for a given task.Let k I be the number of columns in B i encoded for a given task. Each task is of size m I k I where ( m I + z )( k I +1) − n − n s . . . Number of colluding workers z R a t e KES scheme [31]RPM3 - HomogeneousRPM3 - Heterogeneous (a) Rate of RPM3 and the KESscheme.

Number of colluding workers z W a iti ng ti m e RPM3 - HomogeneousKES scheme [31] - HomogeneousKES scheme [31] - HeterogeneousRPM3 - Heterogeneous (b) Empirical average waiting timeunder model 1. − Number of colluding workers z W a iti ng ti m e KES scheme [31] - HomogeneousKES scheme [31] - HeterogeneousRPM3 - HomogeneousRPM3 - Heterogeneous (c) Empirical average waiting timeunder model 2.

Fig. 2: Comparison between the rate of RPM3 and the KES scheme for the ﬁrst setting (see Table II). The KES scheme hashigher rate than RPM3 for this particular setting. Note that RPM3 is restricted to z ≤ because of clustering (see (5)), whilefor the KES scheme z is restricted due to the rate calculation (see (11)). Figures 2b and 2c are an extension of Figures 1aand 1b, that show the mean waiting for a largest range of z . The rate of RPM3 depends on the relative computation power γ u of the different clusters. The γ u ’s can be considered as a random variables. Thus, the plot in 2b is the expected rate of RPM3rather than the conventional rate. For the KES scheme, the rate is ﬁxed and depends only on the number of workers, numberof stragglers and the privacy parameter z . Under model 1 the KES scheme outperforms RPM3 for homogeneous clusters andRPM3 outperforms the KES scheme for heterogeneous clusters for z ≤ . However, under model 2 RPM3 outperforms theKES scheme for both heterogeneous and homogeneous clusters for the values of z for which it has a non-zero rate.to tolerate n s stragglers. The master must send ⌈ m/m I ⌉⌈ k/k I ⌉ tasks to the workers. The rate of the KES scheme is givenby ρ KES = m I k I ( m I + z )( k I + 1) − . (11) Numerical comparison:

We compare numerically the rate ρ KES of the KES scheme and the rate ρ RP M of RPM3. We plot ρ RP M and ρ KES in Figure 2a and 3a for n = 1000 , m = 2000 , k = 3000 , c = 5 and the system parameters summarized inTable II. We set the maximum number of tolerated stragglers n s for the KES scheme to be equal to the number of workers Clustering of the workers Relative computation power

Setting 1 n = 220 γ = 12 Homogeneous environment n = 240 γ = 9 n = 160 γ = 6 n = 150 γ = 3 n = 230 γ = 1 Setting 2 n = 220 γ = 100 Heterogeneous environment n = 300 γ = 60 n = 190 γ = 10 n = 160 γ = 3 n = 130 γ = 1 TABLE II: Parameters for the numerical simulations. The ﬁrst two columns represent two different settings, where clustersconsist of different size. The other columns show two scenarios of heterogeneity of the clusters, i.e., different γ u,c parametersin the slowest clusters, i.e., n s = n . For both settings we consider two scenarios; homogeneous and heterogeneous clusters.The respective expected service rates of workers in each cluster are shown in the last two columns of Table II.The KES scheme tolerates z ≤ for setting 1 and z ≤ for setting 2. This restriction is dictated in a way such that thevalues of m I and k I satisfy ( m I + z )( k I + 1) − n − n s . RPM3 tolerates z ≤ for both setting since it is restricted by thenumber of workers in cluster , see (5). The rate of the KES scheme depends on n s . For the two considered settings, the rate In [1] we deﬁned a scaled version of this rate to reﬂect the size of the tasks of the KES scheme in comparison to those sent using RPM3. However, wechange the deﬁnition of the rate to keep the rate a number between and . We shall explain the effect of the size of the tasks in the sequel. behaviour is different. For the ﬁrst setting, the rate of RPM3 is always smaller than the rate of the KES scheme (Figure 2a).However, RPM3 can have higher rate than the KES scheme in the second setting (Figure 3a). More precisely, RPM3 hasa better rate for z ≥ . Despite the rate loss due to the Lagrange polynomials and the penalty overhead of clustering, thedecrease in rate for RPM3 is slower in z . This allows RPM3 to have higher rate for large values of z . Analytical comparison:

We compare the rates of RPM3 and the KES scheme as follows ρ KES ρ RP M = m I k I mk mk (1 + ε ) + ( z − τ c P cu =1 γ u + zτ c γ ( m I + z )( k I + 1) − . (12)Let D , ( z − τ c c X u =1 γ u + zτ c γ . From (12) we deduce that ρ RP M is smaller than ρ KES when the following holds.

D m I k I mk ≤ m I + z ( k I + 1) − m I k I (1 + 2 ǫ ) − . For small values of m and k , the left hand side is larger than the right hand side and therefore the rate of the KES schemeis better than the rate of RPM3. However, in the regime of interest where the tasks sent to the workers are small, i.e., m and k are large, the inequality depends mostly on z and k I . For small values of z the left hand side is larger than the right handside. However, for large values of k I when z increases the left hand side has a smaller increase the right hand side whichmakes RPM3 better for larger z . For z = 1 , the left hand side is equal to τ c γ m I k I /mk ≥ and the right hand side is equalto n − n s − m I k I (1 + ǫ ) . When z increases by ∆ z , m I and/or k I decrease so that ( m I + z )( k I + 1) − n − n s , and τ c increases because d u decrease and P cu =1 d u τ c γ u = 2 mk (1 + ǫ ) . The values of γ u do not depend on z . For ﬁxed k I , m I + z remains a constant. Hence, m I k I decreases as k I ∆ z and the right hand side of the equation increases by a factor of ∆ z + 2 k I ∆ z (1 + ǫ ) . Whereas on the left hand side, D increases with z by a factor ∆ τ c and m I k I decreases as k I ∆ z .Despite the loss of rate for RPM3, the crucial advantage of RPM3 is the reduced time spent at the master to ﬁnish itscomputation. In RPM3, the master waits until each worker of the slowest cluster computes τ c tasks. Whereas in the KESscheme the master waits until every non-straggling worker computes ⌈ m/m I ⌉⌈ k/k I ⌉ tasks. In particular, assume that theslowest non-straggler in the KES scheme belongs to the slowest cluster in RPM3. If τ c < ⌈ m/m I ⌉⌈ k/k I ⌉ , then in RPM3 themaster waits for the slowest workers to compute a smaller number of tasks which increases the speed of the computation withhigh probability. In Figure 2b and 2c we plot the average waiting time at the master for the same schemes and parametersused for Figure 2a when the time spent at the workers to compute a task is an exponential random variable. To understandthe improvement brought by RPM3, we analyze next the waiting time at the master for different schemes and show for whichparameter regimes RPM3 outperforms the KES scheme.VI. T IME A NALYSIS

We analyse the performance of RPM3 by computing the expected waiting time at the master to compute AB , under theservice model 1 and model 2 and some simplifying assumptions for tractability of the analysis. We compare the mean waitingat the master when using RPM3 to the one when using the KES scheme. A. Clustering

In the encoding process of RPM3, at every round the master clusters the workers into c different clusters of ﬁxed size n u , u ∈ [ c ] . We assume that n u is ﬁxed during the computing process. The workers of each cluster are assumed to have similarexpected speed in computing a new task, i.e., similar service rate λ . In the delay analysis, we assume that each worker ofcluster u , u ∈ [ c ] , has a compute time following a shifted exponential distribution with service rate λ u and shift s u . Therefore,at every completed round the master obtains n u tasks computed at rate λ u and shift s u , for all u ∈ [ c ] . B. Decoding

Let τ u be the number of tasks successfully computed by workers of cluster u during the whole computation process, i.e., τ u is the number of tasks computed by all n u workers of cluster u . Recall that P cu =1 τ u d u ≥ km (1 + ε ) so that the masterreceives enough packets to decode AB . The variable ε is the required overhead of Fountain codes and d u is the number ofFountain-coded packets e A e B encoded within each task sent to cluster u . C. Waiting time

The waiting time, i.e., the time spent at the master to compute AB , can now be expressed as the time spent to receive thelast packet from a given worker that makes c X u =1 τ u d u ≥ km (1 + ε ) . Let T iu be the random variable representing the time spentby worker w i in cluster u during τ u different rounds, i.e., T iu is the time spent until worker w i receives, computes and sends the result of τ u tasks to the master. Recall that the master needs n u responses from workers in cluster u to decode the τ u packets. Let T ⋆u , max i ∈ [ n u ] T iu be the time spent by all the workers of cluster u to receive, compute and send the result of τ u packets to the master. The waiting time at the master is given by T RP M = max u ∈{ ,...,c } T ⋆u . (13) D. Probability distribution of the waiting time

In the following analysis we ignore the real identity of the workers and only assign identities that depend on the speed ofthe designated worker at a given round. More precisely, instead of referring to a worker by w j , j ∈ [ n ] , we refer to a worker w i of cluster u as a worker who falls into cluster u ∈ [ c ] at position i ∈ [ n u ] . This worker w i of cluster u can be any of the w j ’s and j could be different at different rounds of the algorithm. This abstraction will help us in simplifying the notationsand the concepts explained in the sequel. Theorem 2 (Waiting time under worker-dependent ﬁxed service time) . Consider a master running a private distributedmatrix-matrix multiplication of two matrices A and B using RPM3. The multiplication is divided into km different smallermultiplications.Consider the worker-dependent ﬁxed service time where λ u s u , t m , i.e., the handshake time s u is a constant factor of themean service time /λ u for all u ∈ [ c ] . Let u ⋆ be the value of u that minimizes the ratio λ u /τ u . Let H n be the n th harmonicsum deﬁned as H n , P ni =1 1 i . The probability distribution on the waiting time of the master T RP M is bounded by Pr( T RP M > x ) ≤ − (cid:18) − e (cid:16) t m − λu⋆ mkτu⋆ x (cid:17) (cid:19) n , (14) for x ≥ s m km and otherwise.The mean waiting time at the master is upper bounded by E [ T RP M ] ≤ ( t m + H n ) τ u ⋆ λ u ⋆ mk . (15)The proof of Theorem 2 is given in Appendix B. Numerical simulations indicate that this bound is a good representationof the empirical mean waiting time. This is illustrated in Figure 1c for setting 1 described in Table II.While the bound in (14) is not surprising, it has important practical implications: it allows the master to tune the parametersand set a probabilistic deadline on the computation time. Assume the master wants a probabilistic guarantee on the maximumcomputing time, i.e., a probabilistic deadline t D given by Pr( T RP M > t D ) ≤ g D where g D < is the probabilistic guarantee.Given the number of clusters and their respective service rates λ u , the master ﬁnds the minimum ratio λ u /τ u (maximum numberof tasks per cluster) that satisﬁes (cid:18) − e (cid:16) t m − λu⋆mkτu⋆ t D (cid:17) (cid:19) n = 1 − g D . Assume that for the given λ u ’s the allowed values of τ u ’s, u = 1 . . . , c , such that λ u /τ u ≤ λ u ⋆ /τ u ⋆ , satisfy c X u =1 d u τ u ≥ km (1 + ǫ ) . Then, there exists at least one possible task assignment strategy that satisﬁes the deadline guarantees.Next we ﬁnd a lower bound on the mean waiting time of the master when using the KES scheme so that we can comparethe waiting time of the master under both schemes.

Corollary 2.

Consider the worker-dependent ﬁxed service time where λ u s u , t m , i.e., the handshake time s u is a constantfactor of the mean service time /λ u for all u ∈ [ c ] . Let n s be the number of stragglers and let m I ≤ m and k I ≤ k be thenumber of matrices in which A and B must be divided to use the KES scheme. The mean waiting time E [ T KES ] of the masteris bounded by E [ T KES ] ≥ t m + H n − H n − n s λ m I k I . The proof of Corollary 2 follows the same steps as the proof of Theorem 2. A sketch of the proof is given in Appendix B.We can now compare the mean waiting time of the master when using RPM3 and the KES scheme as follows. E [ T RP M ] E [ T KES ] ≤ λ u ⋆ λ t m + H n t m + H n − H n − n s m I k I τ u ⋆ mk . (16) . . . . Number of colluding workers z R a t e KES scheme [31]RPM3 - HomogeneousRPM3 - Heterogeneous (a) Rate of RPM3 and the KESscheme.

Fig. 3: Comparison of the performance of RPM3 and the KES scheme for setting 2, c.f., Table II. In contrast to setting 1(Figure 2a), RPM3 enjoys a higher rate for large values of z , regardless of the clustering environment. In addition, for z ≥ ,the waiting time of RPM3 is smaller for both settings under both service time models.The inequality in (16) can be understood as follows. The ﬁrst ratio λ u ⋆ /λ indicates how fast are the workers of cluster compared to cluster u ⋆ . This is an artifact of our bounding technique. To understand the remaining ratios, recall that ( m I + z )( k I + 1) − n − n s . For a ﬁxed n s , the ratio t m + H n t m + H n − H n − n s ≈ t m + log nt m + log nn − n s reﬂects the speed brought by mitigating the stragglers. Here we approximate H n by H n ≈ log( n ) + γ, where γ ≈ . is called the Euler-Mascheroni constant.The ratio m I k I τ u ⋆ /mk is the most important one. It reﬂects the respective number of tasks assigned to the workers underthe different schemes. RPM3 assigns τ u ⋆ tasks of size mk each to the workers of cluster u ⋆ , i.e., the slowest cluster. Whereas,the KES scheme assigns one task of size m I k I , or equivalently m I k I /mk tasks of size mk , to all the workers. For ﬁxedsystem parameters, when z increases, m I k I increases and RPM3 is expected to outperform the KES scheme. Similarly forlarge values of n s , m I k I increases and RPM3 is expected to outperform the KES scheme.Under the second model of the service time, we have the following upper bound on the master’s mean waiting time. Theorem 3 (Mean waiting time under cluster-dependent additive service time) . Consider a master running a private distributedmatrix-matrix multiplication of two matrices A and B using RPM3. The multiplication is divided into km different smallermultiplications.Consider the cluster-dependent additive service time model where λ u s u , t m , i.e., the shift s u is a constant factor ofthe mean service time /λ u for all u ∈ [ c ] . Let τ max , max u ∈{ ,...,c } τ u and let s m = max u s u τ u . The probability distribution P ( x ) , Pr( T RP M > x ) is bounded from above by P ( x ) < − (cid:16) − τ max − X j =0 e λ c s u − λ c kmx j ! ( λ c kmx − λ c s u ) j (cid:17) n , for x ≥ s m km and is equal to otherwise. The mean waiting time at the master is bounded from above by E [ T RP M ] ≤ s m km + Φ( n ) , (17) where Φ( n ) is given by Φ( n ) = λ c n ( τ max − n − X j =0 ( − j (cid:18) n − j (cid:19) · ( τ max − j X ℓ =0 a ℓ ( τ max , j ) ( τ max + ℓ )!( j + 1) τ max + ℓ +1 (18) and a ℓ ( τ max , j ) is the coefﬁcient of x ℓ in the expansion τ max − X i =0 x i i ! ! j . The bound in (18) can be interpreted as the mean of the n th ordered statistic of n iid Erlang random variables with shift max u s u τ u , service rate λ u and shape τ max chosen such that cluster u has computed the highest number of tasks in respectto its service rate. The shape of an Erlang random variable is the number of iid exponential random variables being summed.Recall that the service rate λ u reﬂects the expected number of tasks that a worker can process per time unit. Thus, the ratio λ u /τ u represents how large is the number of tasks computed at cluster u in respect to its rate. In implementation we expect λ u /τ u to be almost the same for all values of u and therefore we expect the upper bound to be a good estimate of the actualmean waiting time at the master.For the KES scheme, we prove that the mean waiting time of the master under this setting is bounded from below by themean of the ( n − n s ) th ordered statistic of n iid Erlang random variables with rate λ /mk and shift s mk l mm I m l kk I m and shape l mm I m l kk I m , i.e., sum of l mm I m · l kk I m iid exponential random variables. Corollary 3.

Consider the cluster-dependent additive service time model where λ u s u , t m , i.e., the shift s u is a constantfactor of the mean service time /λ u for all u ∈ [ c ] . The mean waiting time at the master when using the KES scheme thattolerates n s stragglers is bounded from below by E [ T KES ] ≥ s k I m I + Φ I ( n ) , where Φ I ( n ) is given by Φ I ( n ) = λ ( n − n s )( τ − n − n s − X j =0 ( − j (cid:18) n − n s − j (cid:19) · ( τ − n − n s + j ) X ℓ =0 a ℓ ( τ , n − n s + j )( τ + ℓ )!( n − n s + j + 1) τ max + ℓ +1 , (19) where again a ℓ ( o, p ) is the coefﬁcient of x ℓ in the expansion o − X i =0 x i i ! ! p . E. Numerical results

We provide numerical simulations showing the empirical average waiting time at the master for both considered modelsunder the two settings and two clustering environments summarized in Table II.

Model 1:

The empirical average waiting time under the worker-dependent ﬁxed service time model is simulated numericallyand shown in Figure 2b and Figure 3b for settings 1 and 2, respectively. Recall that in this model we assume that the timespent to compute τ tasks at a worker is a scaled shifted exponential random variable. For the ﬁrst setting, the KES schemeenjoys a smaller waiting time in the homogeneous scenario. When the workers compute powers are different, i.e., in theheterogeneous scenario, RPM3 has a slightly better waiting time for small values of z . However, for large values of z theoverhead of clustering becomes signiﬁcant. Thus, the KES scheme outperforms RPM3 in this parameter regime. Model 2:

The empirical average waiting time under the cluster-dependent additive service time model is simulated numericallyand shown in Figure 2c and Figure 3c for settings 1 and 2, respectively. Recall that in this model we assume that the timespent to compute τ tasks at a worker is the sum of τ iid shifted exponential random variables, i.e., it is a random variable thatfollows a shifted Erlang distribution. We make a small change to the KES scheme. We assign multiple smaller tasks to theworkers, rather than one large task as is done in the original scheme of [31]. This change allows a fair comparison to RPM3under model 2. Interestingly, regardless of the heterogeneity of the clusters, RPM3 outperforms the KES scheme for the valuesof z it can tolerate. As expected, in the heterogeneous scenario RPM3 has much smaller average waiting time compared to theKES scheme. This is an important observation, since model 2 is a better approximation of the real waiting time. The reason isthat in RPM3 the master sends multiple independent tasks to the workers and it is more realistic to assign them independentrandom variables. VII. P ERFECT L OAD B ALANCING

In this section we study the waiting time of the master under perfect load balancing in the worker-dependent ﬁxed servicetime model (model 1). To make load balancing possible, we assume that the master has previous knowledge of the behaviorof the workers during the run time of the algorithm. More precisely, the master knows the overall computing power of eachworker, i.e., the master knows s i and λ i , for all i = 1 , . . . , n . Given this knowledge, the master knows the number of clusters c , number of workers per cluster n u , u = 1 , . . . , c , and can assign tasks that are proportional to the workers computing powerwithout the need of a rateless code. This knowledge alleviates the need of assigning one polynomial per cluster and payingthe extra penalty of privacy per cluster. TABLE III: The value of τ c under perfect load balancing when using different schemes. The acronym LB refers to loadbalancing. The values of m ui and k ui are the number of matrices in which A and B are divided according to the rate of eachscheme and the value of the number of workers per cluster n u . The ratio of τ c of two schemes is proportional to the ratio ofthe mean waiting time of those schemes under service model 1, see (23). Setting RPM3 LB: GASP low z LB: GASP large zτ c mk (1+ ǫ ) γ j n − z +12 k + P cu =2 ⌊ nu − z +12 ⌋ γ u mk P cu =1 γ u n u − P cu =1 ( γ u − γ u +1 )( k u + m u ) − γ ( z + z −

3) 2 mk P cu =1 γ u n u − γ (2 z − Setting LB: GASP z = 1 LB: GASP medium z LB: Ideal scheme τ c mk P cu =1 γ u n u − P cu =1 ( γ u − γ u +1 )( k u + m u ) mk P cu =1 γ u n u − P cu =1 ( γ u − γ u +1 )( k u + zm u ) − γ ( z − mk P cu =1 γ u n u − γ z The master groups workers with similar s i and λ i in the same cluster. Let τ ( LB ) u be the number of tasks assigned to workersin cluster u , u = 1 , . . . , c . The master chooses the values of τ ( LB ) u such that τ ( LB ) u ( s u + λ u ) is a constant so that the averagecompute time is the same at all the workers. We assume that s u λ u is a constant, hence τ ( LB ) u depends on λ u . We write τ ( LB ) u asa function of τ ( LB ) c , i.e., τ ( LB ) u = γ u τ ( LB ) c . Recall that the master divides A into m sub-matrices A i , i = 1 . . . , m, and B into k sub-matrices B j , j = 1 , . . . , k . Thus, the master needs mk computations of the form A i B j . Given a double-sided z -privatetask assignment to n workers, the number of computations of the form A i B j that can be computed depends on the rate ofthe scheme used to encode the tasks. Therefore, the total number of tasks that the master needs to assign to the workers, andhence the value of τ ( LB ) u , depends on the rate of the scheme.Given the value of τ ( LB ) c , the mean waiting time of the master under perfect load balancing is characterized as follows. Theorem 4.

Given a scheme that assigns τ ( LB ) c to the workers of cluster c . The mean waiting time at the master using loadbalancing under the worker-dependent ﬁxed service time model is given by E [ T LB ] = s c τ ( LB ) c km + H n τ ( LB ) c λ c mk . (20) Therefore, when λ u s u = t m , the mean waiting time achievable using RPM3 is bounded away from the mean waiting time withload balancing as follows E [ T RP M ] ≤ E [ T LB ] τ u ⋆ τ ( LB ) c λ c λ u ⋆ . (21) Sketch of proof:

The proof of Theorem 4 follows the same steps of the proofs of Theorem 2 and Theorem 3. The maindifference is that τ ( LB ) u is designed after assuming that the master has a full knowledge of the future. Thus, the mean waitingtime at all the workers is the same. Therefore, the mean waiting time E [ T LB ] is exactly the n th ordered statistic of n iid randomvariables following a shifted exponential distribution with shift s c τ ( LB ) c /km and rate λ c mk/τ ( LB ) c . Under the assumption that λ u s u is a ﬁxed constant t m , i.e., s u = t m /λ u , the value u ⋆ of u that maximizes s u τ u = τ u t m /λ u is the same as the value of u that minimizes the ratio λ u τ u . Thus, s m = s u ⋆ τ u ⋆ . We can now write E [ T LB ] E [ T RP M ] ≥ s c τ ( LB ) c km + H n τ ( LB ) c λ c mks m km + H n τ u ⋆ λ u ⋆ mk (22) = τ ( LB ) c τ u ⋆ λ u ⋆ λ c . (23)In (22) we replace E [ T RP M ] by the upper bound obtained in Theorem 2. The remaining follows from simple calculations.Thus, we only need to calculate the value of τ ( LB ) c depending on the scheme used by the master for perfect load balancing.The values of τ ( LB ) c of the different considered schemes are summarized in Table III.We are particularly interested in comparing RPM3 to two extreme settings: load balancing using a scheme with the besttheoretical rate possible, referred to as the ideal scheme; and load balancing using GASP when z takes large values. Loadbalancing using the ideal scheme provides a theoretical lower bound on the mean waiting time for the master using any possiblerateless scheme. Load balancing using GASP provides a lower bound on the mean waiting time that could be achievable ifone ﬁnds encoding polynomials with the same rate of GASP and that satisfy the properties of Remark 2. After explaining theschemes used for load balancing and ﬁnding their respective values of τ ( LB ) c , we can prove the following. Corollary 4.

For a ﬁxed overhead of Fountain codes ǫ and a given τ ( ideal ) c for load balancing using the ideal scheme. Themean waiting time of RPM3 is bounded away from the mean waiting time of the ideal scheme as follows. E [ T RP M ] ≤ E [ T LB ] 2 γ u ⋆ λ c (1 + ǫ ) λ u ⋆ (cid:18) − P cu =1 ( z + 1) γ u P cu =1 γ u n u − γ z (cid:19) . (24) The mean waiting time of RPM3 is bounded away from the mean waiting time of load balancing using GASP scheme withlarge values of z as follows. E [ T RP M ] ≤ E [ T LB ] γ u ⋆ (1 + ǫ )1 − P cu =2 γ u ( z + 1) + 2 γ P cu =1 γ u n u − γ (2 z − λ c λ u ⋆ . (25)The ratio of the sums in both denominators of (24) and (25) is less than one. The equality is attained when z = n + 1 / n u + 1 for all u = 2 , . . . , c . Remark 4.

Finding coding strategies that achieve the mean waiting time under load balancing is equivalent to ﬁnding encodingpolynomials with the desired rates (as GASP or the ideal scheme) that share z − evaluations as explained in Remark 2.The problem of ﬁnding such polynomials, if they exist, is left open.A. Theoretical lower bound Given n workers and a double-sided z -private task assignment with no straggler tolerance, the best rate the master can hopefor is ( n − z ) /n . In other words, out of every n tasks the master obtains n − z computations of the form A i B j . This is clearfrom the privacy requirements. Any collection of z workers should learn nothing about A and B and therefore any collectionof z tasks cannot give information about A and B . This is standard in the literature of secret sharing. This rate is indeedachievable in the private matrix-vector multiplication setting where the input vector is not private. However, in the setting ofmatrix-matrix multiplication no known scheme achieves this rate. When restricting the encoding to polynomials, better boundscan be obtained by replacing the ideal rate with the lower bounds provided in [42]. We use the ideal scheme to keep our lowerbound theoretical and independent from the encoding strategy.Now we turn our attention to the task assignment. The master wants to maximize the rate to reduce the number of assignedtasks. Since the rate ( n − z ) /n is a linear function of n , the master wants to maximize n when possible. Let τ ( ideal ) u be thenumber of tasks assigned to workers in cluster u under this setting. Given the knowledge of the workers, the master assigns τ ( ideal ) c tasks with maximal rate n − z , and τ ( ideal ) c − − τ ( ideal ) c tasks with rate n − n c − z , and τ ( ideal ) c − − τ ( ideal ) c − tasks with rate n − n c − n c − − z and so on. The only constraint is the following c X u =1 ( τ ( ideal ) u − τ ( ideal ) u +1 )( n − z − c − X i = u n i +1 ) = mk, where τ ( ideal ) c +1 is deﬁned as and P ji is deﬁned as if j < i . Next we compute τ ( ideal ) c given mk and γ , . . . , γ c − . We ﬁrstuse the telescopic expansion to write mk = ( τ ( ideal ) c )( n − z )+ ( τ ( ideal ) c − − τ ( ideal ) c )( n − n c − z )+ ( τ ( ideal ) c − − τ ( ideal ) c − )( n − n c − n c − − z ) ... + ( τ ( ideal )1 − τ ( ideal )2 )( n − z )= c X u =2 τ ( ideal ) u n u + τ ( ideal )1 ( n − z ) . Using the notation τ ( ideal ) u = γ u τ ( ideal ) c with γ c = 1 , we can compute τ ( ideal ) c as in (26). Given the value of τ ( ideal ) c , wecharacterize the average waiting time at the master as in the ﬁrst part of Corollary 4. We provide the proof here. τ ( ideal ) c = mk P cu =1 γ u n u − γ z . (26) Proof of Corollary 4:

To compare the waiting time of RPM3 to that of the ideal scheme, we express τ c as follows τ c = mk (1 + ǫ ) γ (cid:22) n − z + 12 (cid:23) + P cu =2 (cid:22) n u − z + 12 (cid:23) γ u . Now we can write τ ( ideal ) c τ u ⋆ = γ (cid:22) n − z + 12 (cid:23) + P cu =2 (cid:22) n u − z + 12 (cid:23) γ u γ u ⋆ (1 + ǫ ) ( P cu =1 γ u n u − γ z ) ≥ γ (cid:18) n − z − (cid:19) + P cu =2 (cid:18) n u − z − (cid:19) γ u γ u ⋆ (1 + ǫ ) ( P cu =1 γ u n u − γ z )= ( P cu =1 ( n u − z − γ u − γ z )2 γ u ⋆ (1 + ǫ ) ( P cu =1 γ u n u − γ z )= 12 γ u ⋆ (1 + ǫ ) (cid:18) − P cu =1 ( z + 1) γ u P cu =1 γ u n u − γ z (cid:19) . We conclude the proof of the ﬁrst part by combining the above inequality together with (23).

B. Load balancing using GASP

Given n workers and a z -private task assignment with no straggler tolerance, the best achievable rate under our modelis obtained by using GASP codes [29]. The rate of these codes depends on the privacy parameter z and on the number ofsub-matrices A i and B j used in one encoding of GASP. We divide our analysis into four parts accordingly. a) No collusion: For z = 1 , let m and k be the number of sub-matrices that A and B are divided into, respectively. Then,the rate achieved by GASP is equal to ( k m ) / ( k m + k + m ) . Therefore, given n workers the master obtains n − k − m computations of the form A i B j . Notice that the rate of the scheme depends on n since n must satisfy n = m k + m + k .Since the rate is a linear function of n , the master wants to maximize n when possible. Let τ ( no col ) u be the number of tasksthat could be assigned to workers in cluster u when using this scheme. When using GASP, the value of m and k dependson the number of workers. We denote by m u and k u the number of divisions when the tasks are sent to all workers in clusters to u . Thus, the master assigns τ ( no col ) c tasks with maximal rate n − k c − m c to all n workers, and τ ( no col ) c − − τ ( no col ) c taskswith rate n − n c − k c − − m c − to the n − n c workers not in the slowest cluster c and so on. The only constraint on τ ( no col ) u is the following c X u =1 (cid:16) τ ( no col ) u − τ ( no col ) u +1 (cid:17) n − k u − m u − c − X i = u n i ! = mk, where τ ( no col ) c +1 , m c +11 and k c +11 are deﬁned as and P ji is deﬁned as if j < i . Next we compute τ ( no col ) c given mk and γ , . . . , γ c − .Using the telescopic expansion we write mk = c X u =1 τ ( no col ) u n u − c X u =1 ( τ ( no col ) u − τ ( no col ) u +1 )( k u + m u ) . Using the notation τ ( no col ) u = γ u τ ( no col ) c with γ c = 1 and γ c +1 = 0 , we can compute τ ( no col ) c as τ ( no col ) c = mk P cu =1 γ u n u − P cu =1 ( γ u − γ u +1 )( k u + m u ) . (27) b) Small number of collusion: Let m and k be the number of sub-matrices that A and B are divided into, respectively.For ≤ z < min { m , k } the rate achieved by GASP is equal to ( k m ) / ( k m + k + m + z + z − . Therefore, given n workers the master obtains n − k − m − z − z + 3 computations of the form A i B j . Notice that also here the rate of thescheme depends on n since n must satisfy n = k m + k + m + z + z − . Since the rate is a linear function of n , themaster wants to maximize n when possible. Let τ ( S-GASP ) u be the number of tasks that could be assigned to workers in cluster u when using this scheme. Thus, the master assigns τ ( S-GASP ) c tasks with maximal rate n − k c − m c − z − z + 3 to all n workers, and τ ( S-GASP ) c − − τ ( S-GASP ) c tasks with rate n − n c − k c − − m c − − z − z + 3 to the n − n c workers not in cluster C and so on. Using the notation τ ( S-GASP ) u = γ u τ ( S-GASP ) c with γ c = 1 and γ c +1 = 0 , we can compute τ ( S-GASP ) c by following thesame steps as above. We express τ ( S-GASP ) c as mk P cu =1 γ u n u − P cu =1 ( γ u − γ u +1 )( k u + m u ) − γ ( z + z − . (28) c) Medium number of collusion: Let m and k be the number of sub-matrices that A and B are divided into, respectively.Let m ≤ k . For m ≤ z < k the rate achieved by GASP is equal to ( k m ) / (( k + z )( m + 1) − . This rate coincideswith the rate of [31]. For m ≥ k , the values of m and k are interchanged in the rate. Therefore, given n workers themaster obtains n − k − zm − z + 1 computations of the form A i B j . Since the rate is a linear function of n , the masterwants to maximize n when possible. Let τ ( L-GASP ) u be the number of tasks that could be assigned to workers in cluster u whenusing this scheme. Thus, the master assigns τ ( L-GASP ) c tasks with maximal rate n − k c − zm c − z + 1 to all workers, and τ ( L-GASP ) c − − τ ( L-GASP ) c tasks with rate n − n c − k c − − zm c − − z + 1 to the n − n c workers not in cluster c and so on.Using the notation τ ( L-GASP ) u = γ u τ ( L-GASP ) c with γ c = 1 and γ c +1 = 0 , we can compute τ ( L-GASP ) c by following the samesteps as above. We express τ ( L-GASP ) c as mk P cu =1 γ u n u − P cu =1 ( γ u − γ u +1 )( k u + zm u ) − γ ( z − . (29) d) Large number of collusion: Let m and k be the number of sub-matrices that A and B are divided into, respectively.For max { m , k } ≤ z the rate achieved by GASP is equal to ( k m ) / (2 m k + 2 z − . Notice that this rate coincides withthe rate of regular Lagrange polynomials that we use in our scheme. Given n workers the master obtains ⌊ ( n − z + 1) / ⌋ computations of the form A i B j . Since the rate is a linear function of n , the master wants to maximize n when possible. Let τ ( Lag ) u be the number of tasks that could be assigned to workers in cluster u when using this scheme. Thus, the master assigns τ ( Lag ) c tasks with maximal rate ⌊ ( n − z + 1) / ⌋ to all n workers, and τ ( Lag ) c − − τ ( Lag ) c tasks with rate ⌊ ( n − n c − z + 1) / ⌋ to the n − n c workers not in cluster c and so on. We can now prove the second part of Corollary 4. Proof of Corollary 4 (Continued):

Assuming that n u is even for all u = 1 , . . . , c and using the notation τ ( Lag ) u = γ u τ ( Lag ) c with γ c = 1 , we can compute τ ( Lag ) c by following the same steps as above. We express τ ( Lag ) c as τ ( Lag ) c = 2 mk P cu =1 γ u n u − γ (2 z − . (30)We compare τ ( Lag ) c to τ c obtained for RPM3. Recall that τ c is expressed as τ c = mk (1 + ǫ ) γ (cid:22) n − z + 12 (cid:23) + P cu =2 (cid:22) n u − z + 12 (cid:23) γ u . Following the steps of the proof of the ﬁrst part of this corollary we can write τ ( Lag ) c τ u ⋆ ≥ P cu =1 n u γ u − P cu =1 γ u ( z + 1) − γ zγ u ⋆ (1 + ǫ ) ( P cu =1 γ u n u − γ (2 z − γ u ⋆ (1 + ǫ ) (cid:18) − P cu =2 γ u ( z + 1) + 2 γ P cu =1 γ u n u − γ (2 z − (cid:19) . Thus, we prove the second part of Corollary 4. VIII. C

ONCLUSION

We considered the heterogeneous and time-varying setting of private distributed matrix-matrix multiplication. The workershave different computing and communication resources that can change over time. We designed a scheme called RPM3 thatallows the master to group the workers into clusters with similar resources. The workers are assigned a number of tasksproportional to their overall available resources, i.e., faster workers compute more tasks and slower workers compute lesstasks. This ﬂexibility increases the speed of the computation.We analyzed the rate of RPM3 and the mean waiting time of the master under two models of the workers service times.Using RPM3 results in a smaller mean waiting time than known ﬁxed-rate straggler-tolerant schemes. The reduction of themean waiting time is possible by leveraging the heterogeneity of the workers. We provide lower bounds on the mean waitingtime of the master under the worker-dependent ﬁxed service time model. The lower bounds are obtained by assuming perfectload balancing, i.e., the master has full knowledge of the future behavior of the workers. In terms of rate, RPM3 has a worserate than rates of known ﬁxed-rate straggler-tolerant schemes. We show that there exists a tradeoff between the ﬂexibility ofRPM3 and its rate. Dividing the workers into several clusters provides a good granularity for the master to reﬁne the taskassignment. However, increasing the number of clusters negatively affects the rate of RPM3. Finding lower bounds on the rate of ﬂexible schemes and ﬁnding codes that achieve this rate are left as open problems. Therate of a ﬂexible scheme using polynomials is affected by two factors. The ﬁrst factor is the rate of the polynomial h ( u ) t ( x ) assigned to each cluster u at round t . The rate of h ( u ) t ( x ) is the ratio of the number of computations of the form A i B i tothe number of evaluations needed to interpolate h ( u ) t ( x ) . Increasing the rate of h ( u ) t ( x ) increases the rate of the scheme. Thesecond factor is the number of evaluations that are shared between h (1) t ( x ) and all other h ( u ) t ( x ) for u = 2 , . . . , c . Allowing alarger number of shared evaluations increases the rate of the scheme, see Remark 1 and Remark 2. Finding lower bounds onthe rate of a ﬂexible scheme or lower bounds on the number of evaluations that any two polynomials h ( u ) t ( x ) and h ( u ) t ( x ) can share, implies ﬁnding better lower bounds on the mean waiting time of the master.R EFERENCES[1] R. Bitar, M. Xhemrishi, and A. Wachter-Zeh, “Rateless codes for private distributed matrix-matrix multiplication,” 2020.[2] J. Dean and L. A. Barroso, “The tail at scale,”

Communications of the ACM , vol. 56, no. 2, pp. 74–80, 2013.[3] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. , “Large scale distributed deep networks,”in

Advances in neural information processing systems , pp. 1223–1231, 2012.[4] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica, “Effective straggler mitigation: Attack of the clones,” in

Presented as part of the 10thUSENIX Symposium on Networked Systems Design and Implementation (NSDI 13) , pp. 185–198, 2013.[5] G. Liang and U. C. Kozat, “Fast cloud: Pushing the envelope on delay performance of cloud storage with coding,”

IEEE/ACM Transactions on Networking ,vol. 22, no. 6, pp. 2012–2025, 2014.[6] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,”

IEEE Transactions onInformation Theory , vol. 64, no. 3, pp. 1514–1529, 2017.[7] J. A. Suykens and J. Vandewalle, “Least squares support vector machine classiﬁers,”

Neural processing letters , vol. 9, no. 3, pp. 293–300, 1999.[8] G. A. Seber and A. J. Lee,

Linear regression analysis , vol. 329. John Wiley & Sons, 2012.[9] D. J. MacKay, “Fountain codes,”

IEEE Proceedings-Communications , vol. 152, no. 6, pp. 1062–1068, 2005.[10] M. Luby, “Lt codes,” in

The 43rd Annual IEEE Symposium on Foundations of Computer Science (FOCS) , pp. 271–280, 2002.[11] A. Shokrollahi, “Raptor codes,”

IEEE/ACM Transactions on Networking (TON) , vol. 14, no. SI, pp. 2551–2567, 2006.[12] A. K. Pradhan, A. Heidarzadeh, and K. R. Narayanan, “Factored LT and factored raptor codes for large-scale distributed matrix multiplication,”

CoRR ,vol. abs/1907.11018, 2019.[13] A. Mallick, M. Chaudhari, and G. Joshi, “Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication,” arXiv preprintarXiv:1804.10331 , 2018.[14] T. Baharav, K. Lee, O. Ocal, and K. Ramchandran, “Straggler-prooﬁng massive-scale distributed matrix multiplication with d -dimensional product codes,”in IEEE International Symposium on Information Theory (ISIT) , pp. 1993–1997, 2018.[15] S. Wang, J. Liu, and N. Shroff, “Coded sparse matrix multiplication,” arXiv preprint arXiv:1802.03430 , 2018.[16] Q. Yu, M. Maddah-Ali, and S. Avestimehr, “Polynomial codes: an optimal design for high-dimensional coded matrix multiplication,” in

Advances inNeural Information Processing Systems (NIPS) , pp. 4403–4413, 2017.[17] S. Li, M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr, “A fundamental tradeoff between computation and communication in distributed computing,”

IEEE Transactions on Information Theory , vol. 64, no. 1, pp. 109–128, 2018.[18] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Straggler mitigation in distributed matrix multiplication: Fundamental limits and optimal coding,”

IEEE Transactions on Information Theory , vol. 66, no. 3, pp. 1920–1933, 2020.[19] M. Fahim, H. Jeong, F. Haddadpour, S. Dutta, V. Cadambe, and P. Grover, “On the optimal recovery threshold of coded matrix multiplication,” in , pp. 1264–1270, 2017.[20] Y. Keshtkarjahromi, Y. Xing, and H. Seferoglu, “Dynamic heterogeneity-aware coded cooperative computation at the edge,” arXiv preprint,rXiv:1801.04357v3 , 2018.[21] H. A. Nodehi and M. A. Maddah-Ali, “Secure coded multi-party computation for massive matrix operations,” arXiv preprint arXiv:1908.04255 , 2019.[22] A. Behrouzi-Far and E. Soljanin, “Efﬁcient replication for straggler mitigation in distributed computing,” arXiv preprint arXiv:2006.02318 , 2020.[23] P. Peng, E. Soljanin, and P. Whiting, “Diversity vs. parallelism in distributed computing with redundancy,” in

IEEE International Symposium onInformation Theory (ISIT) , pp. 257–262, 2020.[24] S. Wang, J. Liu, and N. Shroff, “Coded sparse matrix multiplication,” in

Proceedings of the 35th International Conference on Machine Learning (J. Dyand A. Krause, eds.), vol. 80 of

Proceedings of Machine Learning Research , (Stockholmsm¨assan, Stockholm Sweden), pp. 5152–5160, PMLR, 10–15Jul 2018.[25] A. Severinson, A. Graell i Amat, and E. Rosnes, “Block-diagonal and LT codes for distributed computing with straggling servers,”

IEEE Transactionson Communications , vol. 67, no. 3, pp. 1739–1753, 2019.[26] R. Bitar, P. Parag, and S. El Rouayheb, “Minimizing latency for secure coded computing using secret sharing via Staircase codes,”

IEEE Transactionson Communications , 2020.[27] R. Bitar, Y. Xing, Y. Keshtkarjahromi, V. Dasari, S. El Rouayheb, and H. Seferoglu, “Private and rateless adaptive coded matrix-vector multiplication,” arXiv preprint arXiv:1909.12611 , 2019.[28] H. Yang and J. Lee, “Secure distributed computing with straggling servers using polynomial codes,”

IEEE Transactions on Information Forensics andSecurity , vol. 14, no. 1, pp. 141–150, 2018.[29] R. G. D’Oliveira, S. El Rouayheb, and D. Karpuk, “GASP codes for secure distributed matrix multiplication,”

IEEE Transactions on Information Theory ,vol. 66, no. 7, pp. 4038–4050, 2020.[30] W.-T. Chang and R. Tandon, “On the capacity of secure distributed matrix multiplication,” in , pp. 1–6, IEEE, 2018.[31] J. Kakar, S. Ebadifar, and A. Sezgin, “Rate-efﬁciency and straggler-robustness through partition in distributed two-sided secure matrix computation,” arXiv preprint arXiv:1810.13006 , 2018.[32] Q. Yu, S. Li, N. Raviv, S. M. M. Kalan, M. Soltanolkotabi, and S. A. Avestimehr, “Lagrange coded computing: Optimal design for resiliency, security,and privacy,” in

The 22nd International Conference on Artiﬁcial Intelligence and Statistics (AISTATS) , pp. 1215–1225, 2019.[33] Q. Yu and A. S. Avestimehr, “Entangled polynomial codes for secure, private, and batch distributed matrix multiplication: Breaking the “cubic” barrier,” arXiv preprint arXiv:2001.05101 , 2020.[34] M. Kim and J. Lee, “Private secure coded computation,” in

IEEE International Symposium on Information Theory (ISIT) , pp. 1097–1101, 2019.[35] J. Kakar, S. Ebadifar, and A. Sezgin, “On the capacity and straggler-robustness of distributed secure matrix multiplication,”

IEEE Access , vol. 7,pp. 45783–45799, 2019.[36] M. Aliasgari, O. Simeone, and J. Kliewer, “Private and secure distributed matrix multiplication with ﬂexible communication load,”

IEEE Transactionson Information Forensics and Security , vol. 15, pp. 2722–2734, 2020. [37] Z. Jia and S. A. Jafar, “Cross subspace alignment codes for coded distributed batch matrix multiplication,” arXiv preprint arXiv:1909.13873 , 2019.[38] N. Mital, C. Ling, and D. Gunduz, “Secure distributed matrix computation with discrete fourier transform,” arXiv preprint arXiv:2007.03972 , 2020.[39] J. Kakar, A. Khristoforov, S. Ebadifar, and A. Sezgin, “Uplink cost adjustable schemes in secure distributed matrix multiplication,” in IEEE InternationalSymposium on Information Theory (ISIT) , pp. 1124–1129, IEEE, 2020.[40] R. G. D’Oliveira, S. El Rouayheb, D. Heinlein, and D. Karpuk, “Notes on communication and computation in secure distributed matrix multiplication,” arXiv preprint arXiv:2001.05568 , 2020.[41] G. Liang and U. C. Kozat, “TOFEC: Achieving optimal throughput-delay trade-off of cloud storage using erasure codes,” in

IEEE International Conferenceon Computer Communications (INFOCOM) , pp. 826–834, 2014.[42] G. R. D’Oliveira, S. El Rouayheb, D. Heinlein, and D. Karpuk, “Degree tables for secure distributed matrix multiplication,” in

IEEE Information TheoryWorkshop (ITW) , pp. 1–5, IEEE, 2019.[43] A. R´enyi, “On the theory of order statistics,”

Acta Mathematica Academiae Scientiarum Hungarica , vol. 4, no. 3-4, pp. 191–231, 1953.[44] S. S. Gupta, “Order statistics from the gamma distribution,”

Technometrics , vol. 2, no. 2, pp. 243–262, 1960. A PPENDIX AP ROOF OF PRIVACY

We want to prove that RPM3 maintains information theoretic privacy of the master’s data. We prove that at any given round t , the tasks sent to the workers do not reveal any information about the input matrices A and B . This is sufﬁcient since therandom matrices generated at every round are drawn independently and uniformly at random. In other words, if the workersdo not obtain any information about A and B at any given round, then the workers obtain no information about A and B throughout the whole process.Recall that we deﬁne W i,t as the set of random variables representing the tasks sent to worker w i at round t . In addition,for a set A ⊆ [ n ] we deﬁne W A ,t as the set of random variables representing the tasks sent to the workers indexed by A atround t , i.e., W A ,t , {W i,t | i ∈ A} . The privacy constraint is then expressed as I ( A , B ; W Z ,t ) = 0 , ∀Z ⊂ [ n ] , s.t. |Z| = z. We start by proving the privacy constraint for A . For a set A ⊆ [ n ] , let F A ,t be the set of random variables representingthe evaluations of f ( u ) t ( x ) sent to workers indexed by the set A at round t . We want to prove I ( A ; F Z ,t ) = 0 , ∀Z ⊂ [ n ] , s.t. |Z| = z. Proving the satisfaction of the privacy constraint for B follows the same steps and is omitted.Let K be the set of random variable presenting the random matrices R t, , . . . , R t,z generated by the master at round t . Westart by showing that proving the privacy constraint is equivalent to proving that H ( K | F Z , A ) = 0 for all Z ⊆ [ n ] , |Z| = z .To that end we write, H ( A | F Z ) = H ( A ) − H ( F Z ) + H ( F Z | A ) (31) = H ( A ) − H ( F Z ) + H ( F Z | A ) − H ( F Z | A , K ) (32) = H ( A ) − H ( F Z ) + I ( F Z ; K | A )= H ( A ) − H ( F Z ) + H ( K | A ) − H ( K | F Z , A )= H ( A ) − H ( F Z ) + H ( K ) − H ( K | F Z , A ) (33) = H ( A ) − H ( K | F Z , A ) . (34)Equation (32) follows because H ( F Z | A, K ) = 0 , i.e., the tasks sent to the workers are a function of the matrix A and therandom matrices R t, , . . . , R t,z which is true by construction. In (33) we use the fact that the random matrices R t, , . . . , R t,z are chosen independently from A , i.e., H ( K | A ) = H ( K ) . Equation (34) follows because for any collection of z workers, themaster assigns z tasks each of which has the same dimension as R t,δ , δ ∈ { , . . . , z } . In addition, all matrices R t, , . . . , R t,z are chosen independently and uniformly at random; hence, H ( F Z ) = H ( K ) .Therefore, since the entropy H ( . ) is positive, proving that H ( A | F Z ) = H ( A ) is equivalent to proving that H ( K | F Z , A ) =0 . The explanation of H ( K | F Z , A ) = 0 is that given the matrix A and all the tasks received at round t , any collection of z workers can obtain the value of the matrices R t, , . . . , R t,z .This follows immediately from the use of Lagrange polynomials and setting the random matrices as the ﬁrst z coefﬁcients.More precisely, given the data matrix as side information, the tasks sent to any collection of z workers become the evaluationsof a Lagrange polynomial of degree z − whose coefﬁcients are the random matrices R t, , . . . , R t,z . Thus, the workers caninterpolate that polynomial and obtain the random matrices. Therefore, by repeating the same calculations for B , we show thatinformation theoretic privacy of the input matrices is guaranteed. A PPENDIX BP ROOFS FOR THE MEAN WAITING TIME

Proof of Theorem 2:

In this model we assume that for a given worker w i in cluster u , the random variable T iu followsa shifted exponential distribution with shift s u τ u /mk and rate λ u mk/τ u . In the following we shall focus on T u since T iu depends only on the identity of the cluster u and is the same for all workers in this cluster.We can write the pdf of T u as Pr T u ( x ) =  if x < s u τ u mk λ u mkτ u exp (cid:16) λ u s u − λ u mkτ u x (cid:17) otherwise . Remark 5.

It is worth noting that by coupling RPM3 with a mechanism like the one proposed in [20], we can reduce the shiftof T u to s u /mk by allowing the master to send computational tasks to the workers while they are busy computing other tasksand also allowing the workers to send results of the previous task to the master while computing a new task. Thus absorbingthe delays of communicating every task to and from the workers except for the delays of sending the ﬁrst task and receivingthe last task. In our analysis we do not assume the use of such mechanism. Let { x ≥ suτumk } be the indicator function that is equal to when x ≥ s u τ u / mk and is equal to otherwise. The cumulativedensity function F T u ( x ) , Pr( T u < x ) of T u can then be expressed as F T u ( x ) = (cid:16) − e ( λ u s u − λumkτu x ) (cid:17) { x ≥ suτumk } . The random variable T ⋆u is the maximum of n u iid copies of T u . Hence, we can write F T ⋆u ( x ) = (cid:16)(cid:16) − e ( λ u s u − λumkτu x ) (cid:17) { x ≥ suτumk } (cid:17) n u . Recall that T RP M = max u ∈{ ,...,c } T ⋆u is the maximum of c independent random variables. Therefore, we can write F T RPM ( x ) = Pr( T RP M < x )= Pr( T ⋆ < x ) Pr( T ⋆ < x ) · · · Pr( T ⋆c < x )= c Y u =1 (cid:16)(cid:16) − e ( λ u s u − λumkτu x ) (cid:17) { x ≥ suτumk } (cid:17) n u . Deﬁne s m , max u s u τ u and t m , λ u s u . We consider ¯ F T RPM ( x ) , − F T RPM ( x ) and ﬁnd a bound on ¯ F T RPM ( x ) asfollows ¯ F T RPM ( x ) = Pr( T RP M > x )= 1 − Pr( T RP M < x )= 1 − Pr( T ⋆ < x ) Pr( T ⋆ < x ) · · · Pr( T ⋆c < x ) . To obtain a non-trivial lower bound on ¯ F T RPM ( x ) , we consider x ≥ s m /mk and bound from below each term in theproduct Q cu =1 Pr( T ⋆u < x ) by Pr( T ⋆u < x ) ≥ (cid:18) − e (cid:16) t m − λu⋆ mkτu⋆ x (cid:17) (cid:19) . The bound is obtained by maximizing the exponent of e in F T u ( x ) because t m − λ u⋆ / τ u⋆ mkx ≤ . To see that, recall that s m = max u s u τ u . Let s m = s u τ u we can write x ≥ s u τ u mk = λ u ⋆ s u τ u λ u ⋆ mk = λ u s u mk τ u λ u ⋆ ≥ λ u s u mk τ u ⋆ λ u ⋆ . Let F X ( x ) , − e (cid:16) t m − λu⋆ mkτu⋆ x (cid:17) , we can write F T RPM ( x ) ≥ F X ( x ) n . Notice that F X ( x ) is the cumulative distribution function (CDF) of a random variable following a shifted exponentialdistribution with shift s m /km and rate λ u ⋆ km/τ u ⋆ . Given n iid random variables X , . . . , X n , we let X (1) ≤ X (2) ≤ · · · ≤ X ( n ) be the ordered values of the X i ’s (known as ordered statistics). With this notation, F T RPM ( x ) is bounded by the distribution ofthe n th ordered statistic of n random variables following a shifted exponential distribution, i.e., we have F X ( x ) n = F X ( n ) ( x ) .It follows that E [ T RP M ] = Z ∞ Pr( T RP M > x ) dx = Z ∞ (1 − F T RPM ( x )) dx ≤ Z ∞ (1 − F X ( x ) n ) dx = Z ∞ (1 − F X ( n ) ( x )) dx = E [ X ( n ) ] . (35)Next we obtain a bound on E [ X ( n ) ] the mean of X ( n ) . We express X as X = smkm + X ′ , where X ′ is a random variable following an exponential distribution with rate λ u ⋆ km/τ u ⋆ . The following equations hold F X ( x ) = F X ′ (cid:16) x − smkm (cid:17) , ∀ x ≥ smkm , (36) E [ X ] = smkm + E [ X ′ ] . (37)We use the following Theorem from Renyi [43] to compute E [ X ′ ] . Theorem (Renyi [43]) . The d th order statistic X ′ ( d ) of n iid exponential random variables X ′ i is equal to the following randomvariable in the distribution X ′ ( d ) , d X j =1 X ′ j n − j + 1 . Using Renyi’s theorem, the mean of the d th order statistic E [ X ′ ( d ) ] can be written as E [ X ′ ( d ) ] = E [ X ′ j ] d − X j =0 n − j = ( H n − H n − d ) τ u ⋆ λ u ⋆ mk , where H n is the n th harmonic sum deﬁned as H n , P ni =1 1 i , with the notation H , . In particular, E [ X ( n ) ] = H n τ u ⋆ λ u ⋆ mk . (38)Combining the results of (35), (37) and (38) we have E [ T RP M ] ≤ smkm + H n τ u ⋆ λ u ⋆ mk . Under the assumption that λ u s u is a ﬁxed constant t m , i.e., s u = t m /λ u , the value u ⋆ of u that maximizes s u τ u = τ u t m /λ u is the same as the value of that minimizes the ratio λ u τ u . Thus, we can write s m = s u ⋆ τ u ⋆ = t m τ u ⋆ /λ u ⋆ . This concludes theproof. Proof of Corollary 2:

We only provide a sketch of the proof because the detailed steps are similar to steps of the proofof Theorem 2. We ﬁrst bound F T I ( x ) from above by F X ( x ) n − n s where X is a shifted exponential random variable with rate λ mk l mm I m l kk I m and shift s mk l mm I m l kk I m . This is the n − n s ordered statistic of n iid random variable following the shiftedexponential distribution. We use Renyi’s theorem and the inequality l mm I m l kk I m ≥ mkm I k I and the rest follows. Proof of Theorem 3:

In this model the random variable T u (the time spent by a worker in cluster u to compute τ u tasks)is the sum of τ u iid random variables following the shifted exponential distribution with rate λ u mk and shift s u /mk . Thus, T u is a random variable following a shifted Erlang distribution. The CDF F T u ( x ) of T u is equal to if x < s u /mk and is expressed as follows otherwise. F T u ( x ) = 1 − τ u − X j =0 e − λ u km ( x − s u /mk ) j ! (cid:16) λ u km ( x − s u km ) j (cid:17) = 1 − τ u − X j =0 e λ u s u − λ u kmx j ! ( λ u kmx − λ u s u ) j . Again, the random variable T ⋆u (the time spent by all the workers of cluster u to compute τ u tasks) is the maximum of n u iid copies of T u . Hence, we have F T ⋆u ( x ) = 0 for x < s u /km and for x ≥ s u /km we have F T ⋆u ( x ) =  − τ u − X j =0 e λ u s u − λ u kmx j ! ( λ u kmx − λ u s u ) j  n u . Similarly to the ﬂow of the proof of Theorem 2, we want to bound the mean waiting time of the master. We ﬁrst have thefollowing set of inequalities. E [ T RP M ] = Z ∞ Pr( T RP M > x ) dx (39) = Z ∞ (1 − Pr( T RP M < x )) dx = s m km + Z ∞ smkm − c Y u =1 F T ⋆u ( x ) ! dx = s m km + Z ∞ smkm − c Y u =1 F T u ( x ) n u ! dx. (40)Next, we bound F T u ( x ) from below for all values of u ∈ [ c ] . Since all the terms in the summation in F T u are positive, wecan write F T u ( x ) ≥ − τ max − X j =0 e λ u s u − λ u kmx j ! ( λ u kmx − λ u s u ) j . Taking the derivative of e λusu − λukmx j ! ( λ u kmx − λ u s u ) j with respect to λ u , we see that this function is decreasing in λ u . Wecan now bound F T u ( x ) as F T u ( x ) ≥ − τ max − X j =0 e λ c s u − λ c kmx j ! ( λ c kmx − λ c s u ) j . (41)Let F max ( x ) , − P τ max − j =0 e λcsu − λckmx j ! ( λ c kmx − λ c s u ) j . Combining (40) and (41) we can bound the mean waiting timefrom above by E [ T RP M ] ≤ s m km + Z ∞ smkm − c Y u =1 ( F max ( x )) n u ! dx = s m km + Z ∞ smkm (1 − ( F max ( x )) n ) dx. Notice that F max ( x ) is the cumulative density function of an Erlang distribution with shape τ max (i.e. sum of τ max iid exponential random variables) and rate λ c . Hence, the mean waiting time of the master when using RPM3 is bounded by themean of the n th order statistic of n Erlang random variables. Using the derivation from [44] we can bound the mean waitingtime as in the statement of the theorem. A

PPENDIX CA LGORITHMS

We summarize the encoding process of RPM3 in the following four algorithms. We consider the clustering of the workers(result of Algorithm 4) as global knowledge for all the provided algorithms. The coordinator (Algorithm 1) takes as input thenumbers of clusters and keeps track of the workers. More precisely, when any worker is idle, the coordinator calls the Encodefunction (Algorithm 3) to generate a new task. The encoder in Algorithm 3 takes as input the round, the cluster in which the idle worker is located. The encoder needs to know if the idle worker is the ﬁrst worker of this cluster starting this round. If so,the master creates a fresh polynomial pair and sends an evaluation to the workers. Only for the ﬁrst round, the clustering in theencoding does not play a role. This holds because at the ﬁrst round one polynomial pair is generated for all the workers. Thecoordinator checks at real-time if the idle worker is the last worker of its cluster to respond. In this case, the coordinator callsthe interpolation algorithm. Algorithm 2 provides the Fountain-coded matrices. To do so, it requires the index of the clusterat hand and the considered round. If it is the ﬁrst cluster of the considered round, the algorithm interpolates the polynomial h (1) t ( x ) and saves the z shared evaluations that are used in the interpolation of the other clusters. Otherwise, the interpolationalgorithm extracts the z shared evaluations from the memory and interpolates the polynomial. The coordinator collects all theFountain-coded matrices obtained from the interpolation and saves them in a list. When enough Fountain-coded matrices arecollected, the coordinator runs the peeling decoder to obtain all mk components of the multiplication C successfully. A. Scheduler

Algorithm 1:

Coordinator

Input : n workers Result: A B , . . . , A m B k Inter ← [] ; // storage for Fountain-coded matrices counter ← ; // nb of coded matrices collected t , t , . . . , t n ← ; // put all workers to round Encode ( t = 1 , n t = n , u = c ) ; // encode for all the workers Clustering( n tasks, ∆ , z ) ; // Cluster the workers and make it global i ← ; // auxiliary variable representing worker w i n t (1) , . . . , n t ( c ) ← ; // nb of workers in cluster u at round t while counter ≤ mk (1 + ε ) doif worker i is ready then t i ← extract number of packets computed by this worker so far; n t i ( u ′ ) = n t i ( u ′ ) + 1 ; if n t i = n u ′ then Inter ← [ Inter ; Interpolate ( n t i ( u ′ ) , u ′ )] ;counter = counter + d u ′ ; end t i = t i + 1 ;Encode( i, t i ) ; else i = i + 1 mod n ; // Check the next worker endend A B , . . . , A m B k ← Peeling decoder(Inter) as in [12] ;

B. Interpolation

Algorithm 2:

Interpolate

Input : u Result:

Fountain-coded matrices e A e B if u = 1 then // first cluster finished computing interpolate h (1) t ( x ) using d + 2 z − evaluations ;save h (1) t ( α ) , . . . , h (1) t ( α z ) in the memory as z common evaluations; else // cluster u = 1 finished computing extract z common evaluations for round t from the memory ;interpolate h ( u ) t using d u + z − and z common evaluations ; endreturn h ( u ) t ( α z +1 ) , . . . , h ( u ) t ( α z + d u ) ; C. Encoding

Algorithm 3:

Encode

Input : t, n t , u Result:

Tasks for the idle workers if t = 1 then generate z random matrices R , S ;encode P cu =1 d u Fountain-coded matrices e A , e B ;construct c polynomial-pairs f ( u )1 ( x ) , g ( u )1 ( x ) as in (3), (4) ;pick carefully distinct β , . . . , β n elements from F q ; return n evaluations f (1)1 ( β i ) , g (1)1 ( β i ) elseif n t ( u ) = 1 thenif u = 1 then // the first worker at round t generate z random matrices R t , S t ;encode d Fountain-coded matrices e A , e B ;construct a polynomial-pair f (1) t ( x ) , g (1) t ( x ) ;send an evaluation to worker i , f (1) t ( β i ) , g (1) t ( β i ) ; else // the first worker of non-first cluster at round t encode d u Fountain-coded matrices e A , e B ;extract R t , S t from the memory ;construct a polynomial-pair f ( u ) t ( x ) , g ( u ) t ( x ) ;send to worker i an evaluation f ( u ) t ( β i ) , g ( u ) t ( β i ) ; endelse // non-first worker of any cluster extract the polynomial-pair f ( u ) t ( x ) , g ( u ) t ( x ) from the memory ;send f ( u ) t ( β i ) , g ( u ) t ( β i ) to worker i ; endend D. Clustering

Algorithm 4:

Clustering

Input : n idle workers and n tasks, ∆ , z Result: c , n , . . . , n c Send n tasks to n workers ;non assigned ← n ; // nb of workers not assigned u ← ; // indexing of the clusters n ← ; // nb of workers in the first cluster while non assigned ≥ z + 1 do stop ← False ; // stopping criterion while not stop doif any worker w i completed the task then n u = n u + 1 ; if n u = 1 then start time ← get current time() ; // time when first worker of cluster u finished computing end non assigned = non assigned − ; if u = 1 then stop = ( n ≥ z − and get current time() − start time ≥ ∆ else stop = ( n u ≥ z + 1) and get current time() − start time ≥ ∆ endendend u = u + 1 ; n u ← ; end c = u − ; // nb of clusters n c = n c + non assigned ;non assigned = 0 ; return c, n , . . . , n cc