[PDF] Simplex Queues for Hot-Data Download

Abstract

In cloud storage systems, hot data is usually replicated over multiple nodes in order to accommodate simultaneous access by multiple users as well as increase the fault tolerance of the system. Recent cloud storage research has proposed using availability codes, which is a special class of erasure codes, as a more storage efficient way to store hot data. These codes enable data recovery from multiple, small disjoint groups of servers. The number of the recovery groups is referred to as the availability and the size of each group as the locality of the code. Until now, we have very limited knowledge on how code locality and availability affect data access time. Data download from these systems involves multiple fork-join queues operating in-parallel, making the analysis of access time a very challenging problem. In this paper, we present an approximate analysis of data access time in storage systems that employ simplex codes, which are an important and in certain sense optimal class of availability codes. We consider and compare three strategies in assigning download requests to servers; first one aggressively exploits the storage availability for faster download, second one implements only load balancing, and the last one employs storage availability only for hot data download without incurring any negative impact on the cold data download.

Full PDF

11 Simplex Queues for Hot-Data Download

Mehmet Fatih Aktas¸, Elie Najm, and Emina Soljanin,

Fellow, IEEE

Abstract

In cloud storage systems, hot data is usually replicated over multiple nodes in order to accommodate simultaneous access bymultiple users as well as increase the fault tolerance of the system. Recent cloud storage research has proposed using availabilitycodes, which is a special class of erasure codes, as a more storage efﬁcient way to store hot data. These codes enable datarecovery from multiple, small disjoint groups of servers. The number of the recovery groups is referred to as the availability and the size of each group as the locality of the code. Until now, we have very limited knowledge on how code locality andavailability affect data access time. Data download from these systems involves multiple fork-join queues operating in-parallel,making the analysis of access time a very challenging problem. In this paper, we present an approximate analysis of data accesstime in storage systems that employ simplex codes, which are an important and in certain sense optimal class of availability codes.We consider and compare three strategies in assigning download requests to servers; ﬁrst one aggressively exploits the storageavailability for faster download, second one implements only load balancing, and the last one employs storage availability onlyfor hot data download without incurring any negative impact on the cold data download.

Index Terms

Distributed storage, erasure coding, hot data access, queueing analysis.

I. I

NTRODUCTION

In distributed systems, reliable data storage is accomplished through redundancy, which has traditionally been achieved bysimple replication of data across multiple nodes [1], [2]. A special class of erasure codes, known as locally repairable codes(LRCs) [3], has started to replace replication in practice [4], [5] as a more storage efﬁcient way to provide a desired reliability.A storage code has locality r and availability t if each data symbol can be recovered from t disjoint groups of at most r servers. Low code locality is desired to limit the number of nodes accessed while recovering from a node failure. Codeavailability provides a means to cope with node failures and skews in content popularity as follows. First, when data is availablein multiple recovery groups, simultaneous node failures have lower chance of preventing the user from accessing the desiredcontent. Second, frequently requested hot data can be simultaneously served by the node at which the data resides as wellas multiple groups of servers that store the recovery groups the desired data. Popularity of the stored content in distributedsystems is shown to exhibit signiﬁcant variability. Data collected from a large Microsoft Bing cluster shows that 90% of thestored content is not accessed by more than one task simultaneously while the remaining 10% is observed to be frequently andsimultaneously accessed [6]. LRCs with good locality and availability have been explored and several construction methodsare presented in e.g., [7], [8].It has recently been recognized that the storage redundancy can also provide fast data access [9]. Idea is to simultaneouslyrequest data from both the original and the redundant storage, and wait for the fastest subset of the initiated downloads thatare sufﬁcient to reconstruct the desired content. Download with redundant requests is shown to help eliminating long queueingand/or service times (see e.g. [9], [10], [11], [12], [13], [14], [15], [16] and references therein). Most of these papers considerdownload of all jointly encoded pieces of data, i.e., the entire ﬁle. Very few papers have addressed download in low trafﬁcregime of only some, possibly hot, pieces of data that are jointly encoded with those of less interest [17], [18], [19].In this paper, we are concerned with hot data download from systems that employ an LRC with availability for reliablestorage [7], [8]. In particular, we consider simplex codes, which is a subclass of LRCs with availability. Three reasons for thischoice are given as follows: 1) Simplex codes are optimal in several ways, e.g., they meet the upper bound on the distanceof LRCs with a given locality [20], they are shown to achieve the maximum rate among the binary linear codes with a givenavailability and locality two [21], they meet the Griesmer bound and are therefore linear codes with the lowest possible lengthgiven the code distance [22], 2) Simplex codes are minimally different from replication, in that when data is simplex coded anysingle node failure can be recovered by accessing two nodes, while accessing a single node is sufﬁcient to recover replicateddata, 3) Simplex codes have recently been shown to achieve the maximum fair robustness for a given storage budget againstthe skews in content popularity [23].In a distributed system that employs a simplex code, hot data can be downloaded either from a single systematic nodethat stores the data or from any one of the t pairs of nodes (i.e., recovery groups ) where the desired data is encoded jointlywith others, or redundantly by some subset of these options (see Fig. 1). We consider three strategies for scheduling thedownload requests: 1) Replicate-to-all where each arriving request is simultaneously assigned to its systematic node and all itsrecovery groups, 2)

Select-one where each arriving request is assigned either to its systematic node or to one of its recovery

M. Aktas and E. Soljanin are with Rutgers University. E-mail: { mehmet.aktas, emina.soljanin } @rutgers.edu.E. Najm is with EPFL. E-mail: elie.najm@epﬂ.ch.This paper was presented in part at 2017 ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems. a r X i v : . [ c s . I T ] A p r Download RequestsScheduler DATA Recoveryneeds both Recoveryneeds both t. . .

Fig. 1: Data access model in a system that employs a simplex code with availability t . Desired data can be accessed by downloading eitherfrom the systematic node that stores the data or any of the t recovery groups of two nodes or redundantly by some subset of these options.Download from a recovery group requires fetching the coded content from both servers and recovering the desired data. groups, 3) Fairness-ﬁrst where each arriving request is primarily assigned to its systematic node while only hot data requestsare opportunistically replicated to their recovery groups when they are idle. The ﬁrst two scheduling strategies are the twopolarities between download with redundant requests and plain load balancing, while the third aims to exploit download withredundancy while incurring no negative impact on the cold data download time.A download time analysis for storage systems that employ LRCs is given in [17], [18] by assuming a low trafﬁc regime,in that no request arrives at the servers before the request in service departs, so the adopted system model does not involveany queues or any other type of resource sharing. When the low trafﬁc assumption does not hold, a system that employsan LRC with availability consists of multiple inter-dependent fork-join queues. This renders the download time analysis verychallenging even under Markovian arrival and service time models.In this paper, we present a ﬁrst attempt on analyzing the download time with no low trafﬁc assumption in systems that employLRCs with availability, simplex codes in particular, for reliable storage. Under replicate-to-all scheduling, system implementsmultiple inter-dependent fork-join queues, hence the exact analysis is formidable due to the infamous state explosion. We outlinea set of observations and heuristics that allow us to approximate the system as an

M/G/ queue. Using simulations, we showthat the presented approximation gives an accurate prediction of the average hot data download time at all arrival regimes.Under select-one scheduling, the results available in the literature for fork-join queues with two servers can be translatedand used to get accurate approximations of the average download time. At last, we study download time under fairness-ﬁrstscheduling, and evaluate the pain and gain of replicate-to-all (aggressive exploitation of redundancy) compared to fairness-ﬁrst(opportunistic exploitation of redundancy).The remainder of the paper is organized as follows. In Sec. II we deﬁne a simplex coded storage model, and explain thequeueing and service model that is adopted to study the data access time. Sec. III introduces the replicate-to-all scheduling,explains the difﬁculties it poses for analyzing data access time and presents an approximate method for analysis. Sec. IVpresents the analysis for systems with availability one and Sec. V generalizes the analysis for systems with any degree ofavailability. In Sec. VI we present the select-one scheduling and compare its data access time performance with replicate-to-allscheduling. Sec. VII introduces the fairness-ﬁrst scheduler and shows the gain and pain of replicate-to-all over fairness-ﬁrst inhot and cold data access time. II. S YSTEM M ODEL

A binary simplex code is a linear systematic code. With an [ n = 2 k − , k ] simplex code, a data set [ d , . . . , d k ] is expandedinto [ d , . . . , d k , c , . . . c n − k ] where c i ’s represent the added redundant data. Each data symbol in the resulting expanded dataset is stored on a separate node. For a given k and n , simplex codes have optimal distance and a simplex coded storage canrecover from simultaneous failure of any k − − nodes. Availability of a simplex code is also given as t = 2 k − − and itslocality is equal to two, that is, any encoded data d i is available in a systematic node or can be recovered from t other disjointpairs of nodes. For instance, data set [ a, b, c ] would be expanded into [ a, b, a + b, c, a + c, b + c, a + b + c ] in a [7 , simplexcoded storage.In the data download scenario we consider here, download requests arrive for the individual data symbol d i ’s rather than thecomplete data set. We assume that one of the jointly encoded data symbols is more popular than the others at any time. Thepopular data symbol is referred as the hot data while the others are referred as cold data. Download requests are dispatchedto the storage servers at their arrival time to the system. We consider three request scheduling strategies. The ﬁrst one is replicate-to-all scheduling which aggressively implements download with redundancy for all arriving requests. Each requestis assigned to its systematic server that stores the desired data, and simultaneously replicated to all its t recovery groups atwhich the requested data can be reconstructed. A request is completed and departs the system as soon as the fastest one ofits t + 1 dispatched copies completes, and its remaining t outstanding copies get immediately removed from the system. The second one we consider is select-one scheduling which does not employ redundancy for download and simply implementsload-balancing, in that each arriving request is assigned either to its systematic server or one of its recovery groups. In thedownload time analysis of these two scheduling strategies, download trafﬁc for cold data is assumed to be negligible and thegoal of the system is to ﬁnish the arriving hot data requests as fast as possible. We compare the performance of these twopolarities of scheduling in terms of the system stability and the average hot data download time.At last, we consider the scenario in which cold data arrival trafﬁc is non-negligible while the cold data requests are assumedto arrive at a slower rate than the hot data requests. Hot data download with redundancy can reduce the download time andallow the hot data requests to leave the system faster. However, redundant copies of the hot data requests occupy the cold dataservers in the recovery groups, which causes additional waiting time for the cold data requests and may lead to dramaticallyhigher cold data download times. This cannot maintain fairness across the hot and cold data requests, which is an important goalof scheduling in systems with redundancy [24]. We introduce and study fairness-ﬁrst scheduling which exploits redundancyfor hot data download opportunistically with no effect on the cold data download. Each arriving request is primarily assignedto its systematic server, and only the hot data requests are replicated to the recovery groups. A redundant copy of a hot datarequest is accepted into service only if the cold data server in the assigned recovery group is idle. Even if a redundant hotdata request copy is let to start service at a recovery group, it is immediately removed from service as soon as a cold datarequest arrives to the occupied cold data server.We assume for tractable analysis that download requests arrive to system according to a Poisson process of rate λ . Eachserver can serve only one request at a time and the requests assigned to a server wait for their turn in a FIFO queue. We referto the random amount of time to fetch the data at a server and stream it to the user as the server service time . Downloadof hot data at a recovery group requires fetching content from both servers. Thus, redundant hot data requests assigned to arecovery group are forked into the two recovery servers, which then need to join to complete a download.Initial or redundant data stored at the storage nodes have the same size. We assume independent service time variability forcontent download at each server. This allows us to model the service times with an identical and independent (i.i.d.) distributionacross the requests and the servers. We ﬁrstly study the system by assuming exponential service times for analytic tractability,then extend the analysis in some cases to general service times. Note that, exponential service times cannot model the effectof data size on the download time since a sample from it can be as small as zero. Therefore, it is an appropriate model onlywhen the effect of data size on the service times is negligible [15].III. R EPLICATE - TO - ALL S CHEDULING

In this section, we consider hot data download in a binary simplex coded storage under replicate-to-all scheduling. Recallthat in a Microsoft Bing data center, 90% of the stored content is reported to be not accessed frequently or simultaneously[6]. Adopting this observation in our analysis of the hot data access time under replicate-to-all scheduling, download trafﬁcfor cold data is assumed to be negligible. Subsec. V-B relates the analysis to the scenario where cold data download trafﬁc isnon-negligible. Later in Sec. VII, we study the scenario with non-negligible cold data download trafﬁc and discuss the impactof replicate-to-all scheduling on hot and cold data download time.Replicate-to-all scheduler replicates each arriving hot data request to its systematic server as well as to all its t recoverygroups (see Fig. 1). A request is completed as soon as one of its copies completes at either the systematic server or one ofthe recovery groups. As soon as a request is completed, all its outstanding copies get removed from the system immediately.Download from each recovery group implements a separate fork join queue with two-servers.Analysis of fork-join queues is a notoriously hard problem and exact average response time is known only for a two-serverfork join queue with exponential service times [25]. Moreover, fork-join queues in the replicate-to-all system are inter-dependentsince the completion of a request copy at the systematic server or at one of the fork-join queues triggers the cancellation ofits outstanding copies that are either waiting or in service at other servers. Given this inter-dependence, exact analysis of thedownload time in replicate-to-all system is a formidable task.Derivation of the average hot data download time is presented in [17] under the low trafﬁc assumption, that is, assumingno queues or resource sharing in the system. When low trafﬁc assumption does not hold and queueing analysis is required,an upper bound on the download time is found in [17] by using the more restrictive split-merge service model. Split-mergemodel assumes that servers are not allowed to proceed with the requests in their queues until the request at the head of thesystem departs. Simulation results show that the upper bound obtained by the split-merge model is loose unless the arrival rateis very low. We here present a framework to approximate the replicate-to-all system as an M/G/ queue under any arrivalrate regime, which proves to be an accurate estimator of the average hot data download time. The presented approximationprovides a fairly simple interpretation of the system dynamics and employs heuristics that shed light on the performance ofaggressive exploitation of storage availability. A. M/G/ Approximation

Each arriving request has one copy at the systematic server and t redundant copies split across the recovery groups. Arequest departs from the system as soon as one of its copies completes. Request copies at the systematic server wait and get served in a FIFO queue, hence departures from the systematic server can only be in the order of arrival. Request copies thatare assigned to recovery groups are forked into two siblings and both have to ﬁnish service for the request completion. Oneof the recovery servers can be ahead of the other one in the same group in executing the requests in its queue. We refer tosuch servers as a leading server . Although one of the forked copies may happen to be served by a leading server and ﬁnishservice earlier than its sibling, the remaining copy has to wait for the other request copies in front of it. Thus, departures atthe recovery groups can also be only in the order of arrival. These imply that a request cannot exit the replicate-to-all systembefore the other requests in front of it, that is, requests depart the system in the order they arrive.Exact analysis of download time requires keeping track of each copy of the requests in the system. There can be multiple(at most t ) leading recovery servers in the system simultaneously and each can be ahead of its slower peer by at most as manyrequests as in the system. These cause the infamous state explosion problem and renders the exact analysis intractable. Dueto the leading servers, copies of a single request can start service at different times and multiple requests can be in service(there can be at most t + 1 different requests in service simultaneously), which complicates the analysis even further. In thefollowing, we redeﬁne the service start time of the requests, which allows us to eliminate this last mentioned complication. Deﬁnition 1:

Service start time of a request is deﬁned as the ﬁrst time epoch at which all its copies are in service, i.e., whennone of its copies is left waiting in queue.Given this new deﬁnition of request service start times, there can be only one request in service while the system is busy. Fortwo requests to be simultaneously in service, all copies of each must be in service simultaneously, which is impossible giventhat requests depart the system in order. This observation is crucial for the rest of the section and stated below to be able torefer to it later.

Observation 1:

Requests depart the replicate-to-all system in the order they arrive and there can be at most one request inservice at any time given Def. 1.Once a request starts service according to Def. 1, the layout of its copies at the servers determines its service time distribution.For instance, if all copies of a request remain in the system at its service start time epoch, then the request will be servedaccording to S ∼ min { ˜ V , ˜ V , . . . , ˜ V t } where ˜ V denotes the residual service time of the copy at the systematic server, and ˜ V i ’s denote the maximum of the tworesidual service times of the sibling copies at each recovery group.A request copy can depart from a leading server before the request starts service. When j copies of a request depart earlier,we denote the service start layout of the request as type- j , for which the service time distribution is given as S j ∼ min { ˜ V , . . . , ˜ V j , ˜ V , . . . , ˜ V t − j } for j = 0 , , . . . , t . ˜ V is the residual service time at the systematic server, ˜ V , . . . , ˜ V j denote the residual service times at the j recovery groups with leading servers and ˜ V i ’s denote the maximum of the two residual service times at the remaining t − j recovery groups without a leading server. Notice that the previous example with no early departure of any request copycorresponds to type- service start.The layout of the copies of a request at the servers, hence its service time distribution, is determined by the number of copiesthat depart before the request starts service. These early departures can only be from the leading servers so the service timedistribution of a request depends on the difference between the queue lengths at the leading servers and their slow peers. Inother words, service time distribution of a request is dictated by the system state at its service start time epoch. Queue lengthscarry memory between service starts of the subsequent requests. For instance, if a request makes a type- j service start withall the leading servers being ahead of their slow peers by at least two, then the service start type of the subsequent requestwill be at least j . Therefore, in general, service time distributions are not independent across the subsequent requests.When service times at the servers are not exponentially distributed, there are inﬁnitely many possible distributions for therequest service times. This is because some copies of a request may start earlier and stay in service until the request movesto head of the line and starts service. Residual service time of these early starters is differently distributed than the requestcopies that move to service at the request service start time epoch. We can trim the number of possible request service timedistributions down to t + 1 by modifying the system such that each copy of a request remaining in service is restarted atthe service start time of the request. This modiﬁcation is essential for approximating the download time when the servicetimes at the servers are not exponentially distributed, as further discussed in Sec. V-A. When service times at the servers areexponentially distributed, this modiﬁcation is not required. Memoryless property allows us to treat the copies that start serviceearly as if they move to service at the request service start time. Lemma 1:

In replicate-to-all system, if service time V i ’s at the servers are exponentially distributed with rate µ , an arbitraryrequest can be served with any of the t + 1 different types of distributions. Type- j distribution for j = 0 , . . . , t is given as S j ∼ min { V, . . . , V j , V , . . . , V t − j } (1)where V i ’s are distributed as the maximum of the two independent V i ’s. Then, ﬁrst and second moments of type- j request service time distribution are given as E [ S j ] = t − j (cid:88) k =0 (cid:18) t − jk (cid:19) k ( − t − j − k µ (2 t + 1 − j − k ) ,E [ S j ] = t − j (cid:88) k =0 (cid:18) t − jk (cid:19) k +1 ( − t − j − k (cid:2) µ (2 t + 1 − j − k ) (cid:3) . (2)All moments of S j decrease monotonically in j . Proof: (1) follows from the memoryless property of exponential service times at the servers, then by the law of totalexpectation (2) is obtained.One can easily show that the tail probability of type- j request service time P r { S j ≥ s } decreases monotonically in j , i.e., S j is stochastically dominated by S j +1 for j = 0 , . . . , t − . To see this intuitively, type- j request service start means that atexactly j recovery groups, one of the forked copies departs from a leading server before the request starts service. For thecompletion of such requests, the single remaining copy has to ﬁnish at j recovery groups with a leading server, while both ofthe remaining copies have to ﬁnish service at the other t − j recovery groups. Waiting for one copy is stochastically better thanhaving to wait for two, hence the larger the number of early departures the faster the request service time will be. Besides,since service times are non-negative we know E [ S mj − S mj +1 ] = (cid:90) ∞ ms m − (cid:0) P r { S j ≥ s } − P r { S j +1 ≥ s } (cid:1) ds > , thus all moments of S j decrease monotonically in j .Even though service time distributions are not independent across the requests, they are only loosely coupled. For the requestat the head of the line to make type- j start, there needs to be a leading server in j of the recovery groups, while the servers inthe rest of the recovery groups should be leveled. Every request departure from the system triggers cancellation of the copiesat the slow recovery servers, which helps the slow servers to catch up with the leading servers. Therefore, it is “hard” forthe leading servers to keep leading because they compete not just with their slow peers but with every other server in thesystem. Thus, the queues at all the servers are expected to frequently level up. A time epoch at which the queues level upbreaks the dependence between the service times of the requests that start service before or after the levelling. Given that thesetime epochs occur frequently, request service times constitute a series of independent small-size batches where the dependenceexists only within each batch. Observation 2:

Replicate-to-all system experiences frequent time epochs across which the request service time distributionsare independent.Lastly, request arrivals see time averages in terms of the probabilities over different service time distributions. This holdseven when the service times at the servers are not exponentially distributed.

Lemma 2:

In replicate-to-all system, let J i be the type of service time distribution for the i th request. Then, lim i →∞ P r { J i = j } = f j where f j ’s are the time average values of the probabilities over the possible request service time distributions. Proof:

Given that two requests ﬁnd the system in the same state at their arrival epochs, the same stochastic processgenerates their life cycle under stability. This implies that the probabilities over the possible service time distributions for arequest are completely determined by the system state seen by the request at its arrival. Since Poisson arrivals see time averagesin system state under stability [26], they also see time averages in the probabilities over the possible service time distributions.Observations we have made so far, which are also validated by the simulations, allow us to develop an approximate methodfor analyzing the replicate-to-all system. Requests depart in the order they arrive (Obv. 1), hence the system as a whole actslike a ﬁrst-in ﬁrst-out queue. Thinking about the trimmed down space of possibilities or exponential service times at theservers, t + 1 possible distributions for the request service times are given in Lm. 1. Although request service times are notindependent, they exhibit loose coupling (Obv. 2). Putting all these together and relying on their accuracy gives us an M/G/ approximation for the actual system. Proposition 1:

Replicate-to-all system can be approximated as an

M/G/ queue, hence the PK formula gives an approximatevalue for the average hot data download time as E [ T ] ≈ E [ S ] + λE [ S ]2(1 − λE [ S ]) . (3)The moments of S are given as E [ S ] = t (cid:88) j =0 f j E [ S j ] , E [ S ] = t (cid:88) j =0 f j E [ S j ] , (4) where f j is the probability that an arbitrary request has type- j service time distribution S j (see Lm. 2). When service timesat the servers are exponentially distributed, moments of S j are given in (2). B. An exact upper and lower bound

An upper bound on the hot data download time in replicate-to-all system is given in [18]. The bound is derived by assuminga more restricted service model, which is known as split-merge. One of the servers in a recovery group can execute the requestcopies in its queue faster than its sibling, i.e., a recovery server can lead its peer. Split-merge model does not allow this; serversin the recovery groups are blocked until the request at the head of the line departs. Thus, under the split-merge model requestsare accepted into service one by one, hence all requests are restricted to have type- (slowest) service time distribution.Employing the same idea that leads to the split-merge model, we next ﬁnd a lower bound in the following. Theorem 1: An M/G/ queue with service times distributed as S t in Lm. 1 is a lower bound on the replicate-to-all systemin hot data download time. Proof:

In replicate-to-all system, there are t + 1 possible request service time distributions which are stochastically orderedas S > · · · > S t . Restricting all request service time distributions to S t gives an M/G/ queue that performs faster than theactual system. M/G/ approximation given in Prop. 1 is an approximate intermediate point between these two polarities; split-merge upperbound and the lower bound given in Thm. 1. C. Probabilities for the request service time distributions

For the

M/G/ approximation to be useful, we need to ﬁnd the probabilities f j ’s (equivalently their time average values)for the possible request service time distribution S j ’s. An exact expression for f j is found as follows. The sub-sequence ofrequest arrivals that see an empty system forms a renewal process [27, Thm. 5.5.8]. Let us deﬁne a renewal-reward function R j ( t ) = { J ( t ) = j } where is the indicator function and { J ( t ) = j } denotes the event that the request at the head of theline at time t has type- j service time distribution S j . f j = lim t →∞ P r { R j ( t ) = 1 } = lim t →∞ E [ R j ( t )] (a) = lim t →∞ t (cid:90) t −∞ R j ( τ ) dτ (b) = E (cid:104)(cid:82) S rn S rn − R j ( t ) dt (cid:105) E [ X ] . where ( a ) and ( b ) are due to the equality of the limiting time and ensemble averages of the renewal-reward function R j ( t ) [27, Thm. 5.4.5], and S rn − , S rn are the ( n − th, n th renewal epochs (i.e., consecutive arrival epochs that ﬁnd the systemempty), and X is the i.i.d. inter-renewal interval.As the expression above clearly indicates, deriving f j ’s requires an exact analysis of the system. However, we conjecturethe following relation that allows us to ﬁnd good estimates rather than the exact values of f j ’s which are presented in thefollowing sections. Conjecture 1:

In replicate-to-all system, probability f j ’s over the trimmed down space of the possible request service timedistribution S j ’s (see Lm. 1) hold the relation f i − > f i , ≤ i ≤ t. or equivalently, f i = ρ i f i − for some ρ i < .We here brieﬂy discuss the reasoning behind the conjecture. Obv. 2 states that the servers frequently level up because theleading servers in the recovery groups compete with every other server to keep leading or to possibly advance further ahead.For a request to be served according to type- j distribution, one of the forked copies of the request has to depart at exactly j recovery groups before the request starts service. This requires one server in each of the j recovery groups to be leading,which is harder for larger values of j . We validated the conjecture with the simulations but could not prove it. However, wederive a strong pointer for the conjecture in Appendix A; given a request is served according to type- j distribution, the nextrequest in line is more likely to be served according to type- ( < j ) distribution.IV. R EPLICATE - TO - ALL WITH A VAILABILITY O NE In this section, we study replicate-to-all system with availability one, e.g., the storage system [ a, b, a + b ] as illustratedin Fig. 2. We initially assume that service times at the servers are exponentially distributed and let the service rates at thesystematic server and the two recovery servers be respectively γ , α and β . Then, there are two possible request service timedistributions S and S as given in Lm. 1. M/G/ approximation in Prop. 1 requires ﬁnding the probabilities f and f foran arbitrary request to be served according to S and S respectively. Although an exact solution proves to be hard, we ﬁndgood estimates for f and f in the following. Replicate-to-all requests for a321

Exp( γ ) a 23 231 Exp( α ) b 1 Exp( β ) a+bQueued:Departed:In service:System state: ( (

0, 0 )) . . . Exp( γ ) a1 32 3 Exp( α ) b Exp( β ) a+b1 21 ( (

0, 1 )) Fig. 2: Replicate-to-all system with availability one. Requests - are replicated to the systematic server and the recovery servers at arrival.Two system snapshots are illustrated for time epochs at which the request- (Left) and the request- (Right) starts service. Request- hasservice time distributed as S while request- is served with S . Request service time distributions S and S are deﬁned in Lm. 1. A. Markov process for the system state

Imagine a join queue attached at the tail of the system that enqueues the request copies that depart from the servers. Sincea copy departing from the systematic server completes the associated request immediately, a copy waiting for synchronizationin the join queue must be from a leading recovery server. As soon as a request completes, copies of the request waiting in thejoin queue are removed. State of the join queue at time t is the tuple n ( t ) = ( n ( t ) , n ( t )) where n i ’s denote the number ofrequest copies in the join queue that departed from the two recovery servers respectively.Ordered departure of the requests (see Obv. 1) and cancellation of the outstanding copies upon a request completion imply n ( t ) n ( t ) = 0 for all t . Together with the total number of requests N ( t ) in the system; s ( t ) = ( N ( t ) , n ( t ) , n ( t )) constitutesthe state of the replicate-to-all system at time t (see Fig. 2). System state s ( t ) is a Markov process as illustrated in Fig. 3.Let us deﬁne P r { s ( t ) = ( k, i, j ) } = p k,i,j ( t ) , and suppose that stability is imposed and lim t →∞ p k,i,j ( t ) = p k,i,j . Then, thesystem balance equations are summarized for k, i, j ≥ as (cid:2) γ ( k ≥

1) + α ( i ≥

1) + β ( j ≥ (cid:3) p k,i,j = λ ( k ≥ , i ≥ , j ≥ p k − ,i − ,j − + γp k +1 ,i +1 ,j +1 + ( γ + α ) p k +1 ,i +1 ,j + ( γ + β ) p k +1 ,i,j +1 . (5)Computing the generating function P w,x,y = (cid:88) k,i,j> p k,i,j w k x i y j from the above balance equations is intractable, so the exact solution of the system’s steady state behavior is. This is becausethe state space is inﬁnite in two dimensions and its analysis is tedious (see Appendix B for a guessing based analysis withlocal balance method). As discussed next, state space of the system can be made much simpler by employing a high trafﬁcassumption. ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) λ γ αβ λ β + γ λ α + γ λ γ αβ ... β + γ ... α β + γ ... ... α + γ ... β α + γ ......

0, 0 1, 0 2, 00, 10, 2 αβ αγ + β γ + β · · · γ + αβγ + α · · · Fig. 3: Markov state process for the replicate-to-all system with availability one (Top), and its high trafﬁc approximation (Bottom).

B. Analysis under high trafﬁc assumption

Suppose that the arrival rate λ is very close to its critical value for stability and the queues at the servers are always nonempty.This is a rather crude assumption and holds only when the system is unstable, however it yields a good approximation for thesystem and makes the analysis much easier. We refer to this set of working conditions as high trafﬁc assumption . Under thisassumption, servers are always busy and the state of the join queue n ( t ) represents the state of the whole system. Then, thesystem state follows a birth-death process as shown in Fig. 3, for which the steady state balance equations are given as αp i, = ( γ + β ) p i +1 , , βp ,i = ( γ + α ) p ,i +1 , i ≥ . (6)where p i,j = lim t →∞ P r { n ( t ) = ( i, j ) } . Solving the balance equations, we ﬁnd p i, = (cid:18) αβ + γ (cid:19) i p , , p ,i = (cid:18) βα + γ (cid:19) i p , , i ≥ . (7)By the axiom of probability, we have (cid:34) ∞ (cid:88) i =1 (cid:18) βα + γ (cid:19) i + (cid:18) αβ + γ (cid:19) i (cid:35) p , = 1 . Assuming β < α + γ and α < β + γ , p , = (cid:18) βα + γ − β + αβ + γ − α (cid:19) − = γ − ( α − β ) γ ( α + β + γ ) from which p i, and p ,i are found by substituting p , in (7). Other useful quantities are given as ∞ (cid:88) i =1 p i, = α ( α + γ − β ) γ ( α + β + γ ) , ∞ (cid:88) i =1 p ,i = β ( β + γ − α ) γ ( α + β + γ ) . For tractability, we continue by assuming α = β = µ , thus p , = γγ + 2 µ , ∞ (cid:88) i =1 p i, = ∞ (cid:88) i =1 p ,i = µγ + 2 µ . Next, we use the steady state probabilities under the high trafﬁc assumption to obtain estimates for the request service timeprobabilities f and f , and some other quantities that give insight into the system behavior. C. Request completions at the servers

A request completes as soon as either its copy at the systematic server departs or both copies that it has at the recoverygroup. Here we ﬁnd bounds on the fraction of the requests completed by the systematic server or the recovery group.

Theorem 2:

In replicate-to-all system with availability one, let w s and w r be the fraction of requests completed by respectivelythe systematic server and the recovery group. Then the following inequalities hold w s ≥ γνγν + 2 µ , w r ≤ µ γν + 2 µ ; ν = γ + 2 µ. (8) Proof:

Under high trafﬁc approximation, w s and w r can be found from the steady state probabilities of the Markovchain (MC) embedded in the state process (see Fig. 3). Recall that the system state under high trafﬁc approximation is n ( t ) = ( n ( t ) , n ( t )) where n i is the number of request copies in the join queue that departed from the i th recovery server.System stays at each state for an exponential duration of rate ν = γ + 2 µ . Therefore, the steady state probabilities p i ’s (i.e.,limiting fraction of the time spent in state i ) of n ( t ) and the steady state probabilities π i ’s (i.e., limiting fraction of the statetransitions into state i ) of the embedded MC are equal as seen by the equality π i = ( p i ν ) / (cid:88) i ≥ p i ν = p i . Let f s and f r be the limiting fraction of state transitions that represent request completions by respectively the systematicserver and the recovery group. Then, we ﬁnd f s = π , γν + ∞ (cid:88) i =1 π i, γν + ∞ (cid:88) i =1 π ,i γν = γν ,f r = ∞ (cid:88) i =1 ( π ,i + π i, ) µν = 2 (cid:16) µν (cid:17) . Limiting fraction of all state transitions that represent a request departure is then found as f d = f s + f r = ( γν + 2 µ ) /ν . Finally, we ﬁnd the fraction of request completions by the systematic server and by the recovery group as ˆ w s = f s f d = γνγν + 2 µ , ˆ w r = f r f d = 2 µ γν + 2 µ . Recall that, these values are calculated using the high trafﬁc approximation, in which the queues are never empty. However,recovery servers have to regularly go idle under stability. Therefore, the fraction of request completions at the recovery groupis smaller under stability than what we ﬁnd under high trafﬁc, hence we conclude w r ≤ ˆ w r , which implies w s ≥ ˆ w s .Bounds on w s and w r become tighter as the arrival rate λ increases as shown in Fig. 4, which is because the bounds arederived by studying the system under the high trafﬁc assumption. Arrival rate λ F r a c t i o n o f r e q u e s t c o m p l e t i o n s Replicate-to-all t = 1, γ = α = β = 1 Lower-bound, w s Upper-bound, w r Simulation, w s Simulation, w r Fig. 4: Simulated fraction of request completions by the systematic server ( w s ) and by the recovery group ( w r ). Bounds are as given in (8). D. M/G/ approximation Using the analysis under high trafﬁc assumption given in Sec. IV-B, here we obtain estimates for the probabilities f and f over the request service time distributions S and S . This enables us to state an analytical form of the M/G/ approximation. Lemma 3:

Hot data download time under high trafﬁc assumption is a lower bound for the download time in the replicate-to-allsystem.

Proof:

Let us ﬁrst consider the system with availability one. Comparing the two Markov processes in Fig. 3, one can seethat the process under high trafﬁc assumption can be obtained from the actual state process as follows: 1) Introduce additionaltransitions of rate α from state ( i, ( i, to ( i + 1 , ( i + 1 , and additional transitions of rate β from state ( i, (0 , i )) to ( i + 1 , (0 , i + 1)) for each i ≥ , 2) Gather the states ( i, ( m, n )) for i ≥ into a super state, (3) Observe that the process withthe super states is the same as the process under high trafﬁc assumption. Thus, employing the high trafﬁc assumption has theeffect of placing extra state transitions that help the system to serve the requests faster.The above rationale can be extended for the system with any degree of availability. Recall from Lm. 1 that request servicetime distributions are stochastically ordered as S > · · · > S t . High trafﬁc assumption ensures that there is always a replicafor the leading recovery servers to serve, hence increases the fraction of requests served with a faster request service timedistribution. Thus, the analysis with high trafﬁc assumption would yield a lower bound on the download time in the actualsystem. Theorem 3:

Under the assumptions in Prop. 1 and for replicate-to-all system with availability one, we have the followingbounds on the probabilities f and f over the request service time distributions S and S (see Lm. 1) f ≥ f lb = γνγν + 2 µ , f ≤ f ub = 2 µ γν + 2 µ Proof:

Consider the state process under high trafﬁc assumption as given in Fig. 3. State transitions that are towards thecenter (0 , state represent request completions. Let us deﬁne f d as the fraction of such state transitions.Since queues are never empty under high trafﬁc assumption, the next request always starts service as soon as the head ofthe line request departs. Therefore, among the state transitions that represent request completions, a request makes a type- Arrival rate λ A v e r a g e d o w n l o a d t i m e Replicate-to-all t = 1, Servers ∼ Exp ( μ = 1) SimulationSplit-merge upper boundLower boundMatrix-analytic upper-boundHigh traffic approximation

Fig. 5: For replicate-to-all system with availability one and exponentially distributed service times at the servers, comparison of the simulatedaverage hot data download time, the split-merge upper bound in [18], the lower bound in Thm.1, matrix analytic upper bound in Thm. 4,and the

M/G/ approximation given in Prop. 1 and Thm. 3. service start per transition into state (0 , while a request makes a type- service start per transition to any other states. Let f → and f → be the fraction of state transitions that represent respectively type- and type- request service starts. We ﬁnd f d = π , γν + ∞ (cid:88) i =1 ( π ,i + π i, ) µ + γν = 2 µ + 2 µγ + γ ν f → = π , γν + π , µ + γν + π , µ + γν = π , (cid:18) γν + 2 µµ + γ µ + γν (cid:19) = π , = γγ + 2 µ , Then, under high trafﬁc approximation, the limiting fraction of the requests that make type- ( ˆ f ) or type- ( ˆ f ) servicestart are found for ν = γ + 2 µ as ˆ f = f → f d = γνγν + 2 µ , ˆ f = 1 − ˆ f = 2 µ γν + 2 µ . Under stability, system has to empty out regularly. A request that ﬁnds the system empty makes type- service start for sure.However under high trafﬁc assumption, servers are always busy, which enforces less number of type- request service starts.Therefore, the fraction ˆ f of type- request service starts obtained under high trafﬁc approximation presents a lower boundfor its value under stability, i.e., ˆ f ≤ f , from which ˆ f ≥ f follows immediately.Given that type- service is slower than type- service, substituting the bounds f lb and f ub given in Thm. 3 in place of theactual probabilities f and f in (4) yields the lower bounds on the request service time moments as E [ S lb ] = f lb (cid:18) γ + µ − γ + 2 µ (cid:19) + f ub γ + µ ,E [ S lb ] = f lb (cid:18) γ + µ ) − γ + 2 µ ) (cid:19) + f ub γ + µ ) . (9)As suggested by the M/G/ approximation in Prop. 1, substituting these lower bounds on the service time moments in thePK formula gives an approximate lower bound on the hot data download time in replicate-to-all system with availability one.To emphasize, since the M/G/ model is not exact but an approximation, the lower bound on the download time can onlybe claimed to be an approximate one. Simulated average hot data download times and the derived M/G/ approximation arecompared in Fig. 5. Approximation seems to be doing very good at predicting the actual average download time, and especiallybetter than the split-merge upper bound given in [18] and the lower bound we stated in Thm. 1. E. Service rate allocation

Suppose that the system is given a budget of service rate which can be arbitrarily allocated across the servers. When servicetimes at the server are exponentially distributed, we show that allocating more service rate at the systematic server achievessmaller hot data download time (see Appendix C). This suggests that in systems employing availability codes for reliablestorage, systematic symbols shall be stored on the fast nodes while the slow nodes can be used for storing the redundantsymbols for faster hot data download. F. Matrix analytic solution

Here we ﬁnd an upper bound on the average hot data download time that is provably tighter than the split-merge upperbound in [18]. Let us truncate the state process for the replicate-to-all system with availability one (see Fig. 3) such that thepyramid is limited to only the ﬁve central columns and inﬁnite only in the vertical dimension. This choice is intuitive givenConj. 1; system spends more time at the central states which implement type- request and it is hard for the leading serversto advance further ahead, which would have made the system state transit towards the wings of the pyramid Markov Chain.Also the analysis in Appendix B suggests that the most frequently visited system states are located at the central columns.

1) Computing the steady state probabilities:

Finding a closed form solution for the steady state probabilities of the statesin the truncated process is as challenging as the original problem. However one can solve the truncated process numericallywith an arbitrarily small error using the

Matrix Analytics method [28, Chapter 21]. In the following, we denote the vectorsand matrices in bold font. Let us ﬁrst deﬁne the limiting steady state probability vector as π =[ π , (0 , , π , (0 , , π , (0 , , π , (1 , ,π , (0 , , π , (0 , , π , (0 , , π , (0 , , π , (2 , ,π , (0 , , π , (0 , , π , (0 , , π , (1 , , π , (2 , , · · · ]=[ π , π , π , π , π , · · · ] . where π k, ( i,j ) is the steady state probability for state ( k, ( i, j )) and π = [ π , (0 , , π , (0 , , π , (0 , , π , (1 , ] , π i = [ π i +1 , (0 , , π i +1 , (0 , , π i +1 , (0 , , π i +1 , (1 , , π i +1 , (2 , ] . for i ≥ . One can write the balance equations governing the limiting probabilities in the form below πQ = (10)For the truncated process, Q has the following form Q =  F H L F H L F HL F H L F H . . . . . .  , (11)where the sub-matrices F , H , L , F , L and H are given in Appendix D in terms of server service rates γ , α , β and therequest arrival rate λ . Using (10) and (11), we get the following system of equations in matrix form, π F + π L = 0 , π H + π F + π L = 0 , π i H + π i +1 F + π i +2 L = 0 , i ≥ . (12)In order to solve the system above, we assume the steady state probability vectors to be of the form π i = π R i − , i ≥ , R ∈ R × . (13)Combining (12) and (13), we get π F + π L = 0 , π H + π ( F + RL ) = 0 , π i ( H + RF + R L ) = 0 , i ≥ . (14)From (14) we have common conditions for the system to hold H + RF + R L = , R = − (cid:0) R L + H (cid:1) F − . (15)The inverse of F in (15) exists since det( F ) = − δ ( δ − α )( δ − β ) (cid:54) = 0 assuming δ = α + β + γ + λ and λ > . Using (15),an iterative algorithm to compute R is given in Algorithm 1. The norm (cid:107) R i − R i − (cid:107) corresponds to the absolute value of thelargest element of the difference matrix R i − R i − . Therefore, the algorithm terminates when the largest difference betweenthe elements of the last two computed matrices is smaller than the threshold (cid:15) . The initial matrix R could take any value, notnecessarily . The error threshold (cid:15) could be ﬁxed to any arbitrary value, but the lower this value the slower the convergence.Computing R , vectors π and π are remaining to be found in order to deduce the values of all limiting probabilities. Recall Algorithm 1

Computing matrix R procedure C OMPUTING R (cid:15) ← − , R ← , i ← while true do R i ← − (cid:0) R i − L + H (cid:1) F − if (cid:107) R i − R i − (cid:107) > (cid:15) then i ← i + 1 else return R i that in (14), the ﬁrst two equations are yet to be used. Writing these two equations in matrix form (cid:2) π π (cid:3) (cid:20) F H L RL + F (cid:21) = , (16)where is a × zeros vector and Φ = (cid:20) F H L RL + F (cid:21) ∈ R × . In addition, we have the normalization equation to take into account. Denoting = [1 , , , , = [1 , , , , and using(13), we get π D + ∞ (cid:88) i =1 π i D = 1 π D + ∞ (cid:88) i =1 π R i − D = 1 π D + π ( I − R ) − D = 1 (cid:2) π π (cid:3) (cid:20) D ( I − R ) − . D (cid:21) = 1 , (17)where I is the × identity matrix. In order to ﬁnd π and π , we solve the following system (cid:2) π π (cid:3) Ψ = [1 , , , , , , , , . (18)where Ψ is obtained by replacing the ﬁrst column of Φ with [ , ( I − R D ) − ] D . Hence, (18) is a linear system of equationswith unknowns. After solving (18), we obtain the remaining limiting probabilities vector using (13).

2) Bounding the average hot data download time:

Let N ma be the number of requests in the truncated system. First noticethat P r { N ma = 0 } = π , (0 , ,P r { N ma = 1 } = π , (0 , + π , (0 , + π , (1 , = π D − π , (0 , ,P r { N ma = i } = π i, (0 , + π i, (0 , + π i, (0 , + π i, (1 , + π i, (2 , = π i − D , i ≥ . Then, the average number of requests in the truncated system is computed as E [ N ma ] = (cid:80) ∞ i =0 iP r { N ma = i } = π D − π , (0 , + ∞ (cid:88) i =2 i ( π i − D )= π D − π , (0 , + ∞ (cid:88) i =2 i ( π R i − D )= π D − π , (0 , + π (cid:32) ∞ (cid:88) i =2 ( i − R i − + R i − (cid:33) D = π D − π , (0 , + π  ∞ (cid:88) j =1 j R j − + ∞ (cid:88) i =0 R i  D = π D − π , (0 , + π (cid:0) ( I − R ) − + ( I − R ) − (cid:1) D . (19) Equation (19) shows that we only need π , π and R , thus no need to calculate the inﬁnite number of limiting probabilities. Theorem 4:

A strict upper bound on the average hot data download time of replicate-to-all system with availability one isgiven as E [ N ma ] /λ where E [ N ma ] is given in (19). Proof:

Truncation of the state process is equivalent to imposing a blocking on the recovery group whenever one of therecovery server is ahead of its sibling by two replicas, which works slower than the actual system. Therefore, the averagedownload time found for the truncated system is an upper bound on the actual download time and by Little’s law it is expressedas E [ N ma ] /λ .Fig. 5 shows that the upper bound which we ﬁnd with matrix analytic procedure is a closer upper bound than the upperbound obtained by the split-merge model in [18]. This is because the split-merge model is equivalent to truncating the stateprocess and keeping only the central column, while the truncation we consider here keeps ﬁve central columns of the stateprocess which yields greater predictive accuracy as expected.V. E XTENSION TO THE G ENERAL C ASE

Given the probabilities f j ’s over the possible request service time distributions S j ’s, we have an M/G/ approximationfor the replicate-to-all system with any degree of availability (Prop. 1). However even when the system availability is one( t = 1 ), exact analysis is formidable and we could only ﬁnd estimates for f j ’s in Sec. IV. In this section, we derive bounds orestimates for f j ’s when the system availability is greater than one by building on the relation between f j ’s that we conjecturedin Conj. 1. Theorem 5:

Under the assumptions in Conj. 1, we have the following bound and approximations for the probabilities f j ’sover the request service time distributions S j ’s f ≥ − ρ − ρ t +1 , f j ≈ ρ j − ρ − ρ t +1 , ≤ j ≤ t. (20)for ρ ≥ max { ρ j } tj =1 . Approximation for f j ’s tends to be an over estimator, especially as j gets larger. Proof:

Recall the relation in Conj. 1 f j = ρ j f j − ; ρ j < , j = 1 , . . . , t. Using the normalization requirement, we obtain D (cid:88) i =0 f i = f (cid:104) D (cid:88) j =1 j (cid:89) i =1 ρ i (cid:105) = 1 . Then, f j ’s are found in terms of ρ j ’s as f = (cid:104) D (cid:88) j =1 j (cid:89) i =1 ρ i (cid:105) − , f j = f j (cid:89) i =1 ρ i . (21)Substituting each ρ i with the same ρ , which is an upper bound on all ρ i ’s, and solving for f j ’s gives the estimates ˆ f = 1 − ρ − ρ t +1 , ˆ f j = ρ j f Note that, the above estimates preserve the relation given in Conj. 1. It is easy to see f ≥ ˆ f . However, concluding that f j ≤ ˆ f j for j (cid:54) = 0 requires further assumptions on ρ j ’s but it is also easy to see why ˆ f j ’s tend to be an over estimator formost series of ρ j ’s, which is especially the case for larger j , e.g., showing that ˆ f t /f t > holds is straightforward. M/G/ approximation in Prop. 1 relies on good estimates for the request service time probabilities f j ’s. The tighter theupper bound ρ in Thm 5, the better the estimates ˆ f j ’s are. The simplest way is to set ρ to its naive maximum, ρ = 1 , andobtain the estimates as ˆ f j = 1 / ( t + 1) for j = 0 , . . . , t . Substituting these estimates in the M/G/ approximation gives us a naive approximation for the replicate-to-all system.Next, we ﬁnd an inequality for ρ in Thm. 5, which leads us to better estimates for f j ’s. Corollary 1:

Upper bound ρ in Thm. 5 holds the inequality (1 − λE [ ˆ S ]) ρ t +1 − ρ + λE [ ˆ S ] ≥ . (22)where E [ ˆ S ] = t +1 (cid:80) Dj =0 E [ S j ] . Proof:

Under stability, sub-sequence of request arrivals that ﬁnd the system empty forms a renewal process [27, Thm. 5.5.8].Assumptions in Conj. 1 imply that the replicate-to-all system is an

M/G/ queue with arrival rate λ and service time S .Expected number of request arrivals E [ J ] between successive renewal epochs (busy periods) is given as E [ J ] = 1 / (1 − λE [ S ]) [27, Thm. 5.5.10]. Requests that ﬁnd the system empty at arrival are served with type- service time distribution. Requests that arrive withina busy period can be served with any type- j service time distribution for j = 0 , . . . , t . This observation reveals that /E [ J ] is a lower bound for f .Computing the value of E [ J ] requires knowing E [ S ] , which we have been trying to estimate. An upper bound is given as E [ J ub ] = 1 / (1 − λE [ S lb ]) where E [ S lb ] is a lower bound for E [ S ] . One possibility is E [ S lb ] = 1 t + 1 D (cid:88) j =0 E [ S j ] which we previously obtained for the naive M/G/ approximation. Thus, we have f ≥ /E [ J ] ≥ /E [ J ub ] = 1 − λE [ S lb ] . In the system for which the estimate ˆ f = 1 / ( t + 1) becomes exact, the lower bound obtained from renewal theory (i.e., − λE [ S ] ) holds as well under stability. For this system, E [ S lb ] is exact, hence ˆ f = 1 / ( t + 1) ≥ − λE [ S lb ] . One can see that (1 − ρ ) / (1 − ρ t +1 ) ≥ / ( t + 1) for ≤ ρ < , so we have − ρ − ρ t +1 ≥ t + 1 ≥ − λE [ S lb ] from which (22) follows.Next we use inequality (22) to get a tighter value for the upper bound ρ in Thm. 5 as follows. Solving for ρ in (22) doesnot yield a closed form solution, so to get one we take the limit as lim t →∞ (1 − λE [ ˆ S ]) ρ t +1 − ρ + λE [ ˆ S ] ≥ ⇐⇒ ρ ≤ λE [ ˆ S ] . Using this new upper bound instead of the naive one (i.e., ρ = 1 ) gives us a better M/G/ approximation for replicate-to-allsystem. We next present the best M/G/ approximation we could obtain by estimating each ρ j separately rather than usinga single upper bound ρ . Corollary 2:

In replicate-to-all system, the service time probabilities f j ’s are well approximated as ˆ f = (cid:104) D (cid:88) i =1 i − (cid:89) j =0 ˆ ρ j (cid:105) − , ˆ f j = ˆ f j − (cid:89) i =0 ˆ ρ i (23)where ˆ ρ j ’s are computed recursively as ˆ ρ = λE [ ˆ S ] t (1 − λE [ ˆ S ]) , ˆ ρ j = 1 − (1 − λE [ ˆ S ])(1 + (cid:80) j − k =0 (cid:81) kl =0 ˆ ρ l )(1 − λE [ ˆ S ])( t − j ) (cid:81) j − k =0 ˆ ρ k , j = 1 , . . . , t. (24)where E [ ˆ S ] = t +1 (cid:80) Dj =0 E [ S j ] . Proof:

Setting ˆ f j = ˆ ρ ˆ f for j = 1 , . . . , t and using the normalization requirement (cid:80) Di =0 ˆ f i = 1 , we ﬁnd ˆ f = 1 / (1+ t ˆ ρ ) .Using the inequality / (1 + t ˆ ρ ) ≥ / ( t + 1) ≥ − λE [ ˆ S ] (as found in the proof of Cor 1), we get the upper bound ˆ ρ ≤ λE [ ˆ S ]1 − λE [ ˆ S ] . Fixing the value of ˆ ρ to the upper bound above and setting ˆ f = ˆ ρ ˆ f , ˆ f i = ˆ ρ ˆ ρ ˆ f for ≤ i ≤ t , we ﬁnd an upperbound on ˆ ρ by executing the same steps that we took for ﬁnding the upper bound on ˆ ρ . Normalization requirement gives ˆ f = (cid:2) ρ (1 + ˆ ρ ( t − (cid:3) − and we have ˆ f ≥ / (1 + t ) ≥ − λE [ ˆ S ] , which yields ˆ ρ ≤ − (1 − λE [ ˆ S ])(1 − ˆ ρ )( t − − λE [ ˆ S ])ˆ ρ . The same process can be repeated to ﬁnd an upper bound on ˆ ρ by ﬁxing ˆ ρ and ˆ ρ to the upper bounds above. Generalizingthis, ﬁxing ˆ ρ , ..., ˆ ρ j − to their respective upper bounds, we ﬁnd the upper bound ˆ ρ j ≤ − (1 − λE [ ˆ S ]) (cid:0) (cid:80) j − k =0 (cid:81) kl =0 ˆ ρ l (cid:1) (1 − λE [ ˆ S ])( t − j ) (cid:81) j − k =0 ˆ ρ k Finally, setting each ˆ ρ j to its respective upper bound allows us to compute the estimates ˆ f i ’s.To summarize, we obtained a naive M/G/ approximation by setting ρ = 1 in Thm. 5. Next, using Cor. 1 we obtained atighter bound on ρ , which gave the better approximation. Finally, in Cor. 2 we ﬁnd the best approximation that we could bycomputing the estimates for ρ j ’s recursively. Fig. 6 gives a comparison of these naive, better and the best approximations withthe simulated average hot data download time. Note that the approximations we derived in this section give close approximationssystem with availability one, for which we were able to obtain a very good approximation using the high trafﬁc assumptionin Sec. IV. Arrival rate λ A v e r a g e d o w n l o a d t i m e Replicate-to-all t = 1, Servers ∼ Exp ( μ = 1) SimulationSplit-merge upper boundLower boundNaive approximationBetter approximationBest approximation

Arrival rate λ A v e r a g e d o w n l o a d t i m e Replicate-to-all t = 3, Servers ∼ Exp ( μ = 1) SimulationSplit-merge upper boundLower boundNaive approximationBetter approximationBest approximation

Fig. 6: Comparison of the approximations given in Sec. IV and the simulated average hot data download time for replicate-to-all systemwith availability t = 1 (Left) and t = 3 (Right). A. Non-exponential servers

Modern distributed storage systems have been shown to be susceptible to heavy tails in service time [29], [30], [31]. Herewe evaluate the

M/G/ approximation presented in Prop. 1 and Cor. 2 on server service times that are advocated in theliterature to better model the server side variability in practice.We consider two analytically tractable service time distributions. First one is Pareto ( s, α ) with minimum value s > andtail index α > . Pareto is a canonical heavy tailed distribution (smaller α means heavier tail) that is observed to ﬁt servicetimes in real systems [30]. Second one is Bernoulli ( U, L, p ) that takes value L (long i.e., L > U ) w.p. p < . and value U (usual) otherwise, which is also mentioned to be a proper model for the server side variability in practice [30].Two main observations about the replicate-to-all system that lead us to the M/G/ approximation hold also when the serverservice times are non-exponential; requests depart in the order they arrive and the system experiences frequent time epochsthat break the chain of dependence between the request service times. However, when the memoryless property of exponentialdistribution is missing, there are inﬁnitely many request service time distributions possible. As discussed in Sec. III-A, the setof possible request service time distributions can be trimmed down to size t + 1 by modifying the system and forcing a servicerestart for the request copies that start service earlier as soon as their associated request move to head of the line (i.e., onceall the copies of a request moves in service).Repeating the formulation in Lm. 1 for Pareto server service times, moments for type- j request service time distribution isfound for j = 0 , . . . , t as E [ S j ] = t − j (cid:88) k =0 (cid:18) t − jk (cid:19) k ( − t − j − k λ α (2 t + 1 − j − k ) α (2 t + 1 − j − k ) − ,E [ S j ] = t − j (cid:88) k =0 (cid:18) t − jk (cid:19) k ( − t − j − k λ α (2 t + 1 − j − k ) α (2 t + 1 − j − k ) − . (25)and when server service times are Bernoulli , we ﬁnd E [ S j ] = U + ( L − U ) p t +1 (2 − p ) t − j ,E [ S j ] = U + ( L − U ) p t +1 (2 − p ) t − j . (26)Moments given above enable the M/G/ approximation stated in Prop. 1 and Cor. 2. Fig. 7 illustrates that the approximationyields a close estimate for the average hot data download time when service times at the servers are distributed as Pareto or Bernoulli . When the tail heaviness pronounced by the server service times is high (e.g.,

Pareto distribution), system operatesmore like the restricted split-merge model. This is because under heavier tail it is harder for the leading servers in the recovery Simulating queues with heavy tailed service time is difﬁcult and numeric results given here are optimistic i.e., plotted values serve as a lower bound [32]. groups to keep leading, and it is more likely for a leading server to straggle and loose the lead. This effect is greater onthe leading servers as the system availability increases since the leading servers compete with more servers to keep leading.This increases the fraction of requests that have type- (slowest) service time, hence split-merge bound gets tighter at higheravailability. Arrival rate λ A v e r a g e d o w n l o a d t i m e Replicate-to-all t = 1, Servers ∼ Pareto ( s = 0.1, α = 1.5) SimulationSplit-merge upper bound M / G /1 approximationLower bound Arrival rate λ A v e r a g e d o w n l o a d t i m e Replicate-to-all t = 3, Servers ∼ Pareto ( s = 0.1, α = 1.5) SimulationSplit-merge upper bound M / G /1 approximationLower bound Arrival rate λ A v e r a g e d o w n l o a d t i m e t = 1, Servers ∼ Bernoulli ( U = 1, L = 8, p = 0.2) SimulationSplit-merge upper bound M / G /1 approximationLower bound Arrival rate λ A v e r a g e d o w n l o a d t i m e t = 3, Servers ∼ Bernoulli ( U = 1, L = 8, p = 0.2) SimulationSplit-merge upper bound M / G /1 approximationLower bound Fig. 7: Comparison of the

M/G/ approximation in Cor. 2, the split-merge upper bound in [18], the lower bound given in Thm. 1, and thesimulated average hot data download time for replicate-to-all system under Pareto (Top) and

Bernoulli (Bottom) service times at the servers.

B. Hot data download under mixed arrivals

Recall that in Sec. II, we ﬁxed the system model on the ﬁxed arrival scenario where the requests arriving in a busy periodask for only one data symbol, which we refer to as hot data. This was based on the fact that cold data in practice is very rarelyrequested [6]. We here try to relate the discussion we had so far on the replicate-to-all system to the mixed arrival scenario.Under ﬁxed arrival scenario, the role of each server in terms of storing a systematic or a coded data symbol is ﬁxed for eacharriving request. Under mixed arrival scenario, the role of each server depends on the data symbol requested by an arrival.If the requested symbol is a then the systematic server for the request is the one that stores a while the others are recoveryservers. When the requested data symbol changes, the role of every server changes.Under mixed arrival scenario, multiple requests can be simultaneously served at their corresponding systematic servers, hencerequests do not necessarily depart in the order they arrive and multiple requests can depart simultaneously. Analysis presentedin the previous sections for the replicate-to-all system under ﬁxed arrival scenario mainly depended on the observations thatlead us to approximate the system as an M/G/ queue, while the system under mixed arrival scenario surely violates theseobservations. Therefore, the analysis of the system under mixed arrival scenario turns out to be much harder. However, we cancompare the performance of the system under these two arrival scenarios as follows. Theorem 6:

In replicate-to-all system, hot data download time under mixed arrivals is a lower bound for the case with ﬁxedarrivals.

Proof:

Under ﬁxed arrivals, one of the servers in each recovery group may be ahead of its sibling at any time and proceedwith a copy of a request that is waiting in line. Therefore, departure of the request copies at the leading servers cannot completea request alone and early departing copies can only shorten the service duration of a request (recall that type- j service time isstochastically less than type- ( j − , see Lm. 1). However under mixed arrivals, a leading server may be the systematic serverfor a request waiting in line and completion of its copy at the leading server can complete the request. This is a bigger reductionin the overall download time than completion of a mere copy that fasten the service time of a request, which concludes theproof. Fig. 8 illustrates Thm. 6 by comparing the simulated values of average hot data download time under ﬁxed and mixed arrivalscenarios.

Arrival rate λ A v e r a g e d o w n l o a d t i m e Replicate-to-all t = 1, Servers ∼ Exp ( μ = 1) Simulation, fixed-arrivalsSimulation, mixed-arrivals

Arrival rate λ A v e r a g e d o w n l o a d t i m e Replicate-to-all t = 3, Servers ∼ Exp ( μ = 1) Simulation, fixed-arrivalsSimulation, mixed-arrivals

Fig. 8: Comparison of the simulated average hot data download time in replicate-to-all system under ﬁxed (i.e., all requests arriving withina busy period ask for hot data) and mixed arrival scenarios (i.e., arriving requests may ask for any one of the stored data symbols equallylikely).

VI. S

ELECT - ONE S CHEDULING

Storage systems experience varying download load that exhibits certain trafﬁc patterns throughout the day [1], [33]. Phasechanges in the system load can be anticipated or detected online, which allows for adapting the request scheduling strategyto enable faster data access. In a distributed setting, data access schemes, hence the request scheduling strategies are eitherimposed or limited by the availability of data across the storage nodes. The desired scheduling strategies that are optimizedfor different load phases may require a change in the data availability layout. For instance, hot data may need to be replicatedto achieve stable download performance, or replicas for the previously hot but currently cold data may need to be removed inorder to open up space in the storage, e.g., see [34] for a survey of adaptive replica management strategies. In a coded storage,data availability can be changed by fetching, re-encoding and re-distributing the stored data. This online modiﬁcation of dataavailability puts additional load on the system and introduces additional operational complexity that makes the system proneto operational errors [1].LRCs, in particular simplex codes, provide the necessary ﬂexibility for the system to seamlessly switch between differentrequest scheduling strategies. Batch codes have been proposed for load balancing purposes in [35] and their connection toLRCs is studied in [36]. A simplex code is a linear batch code. In a simplex coded storage system, the simplest strategy for loadbalancing, namely select-one , is to assign an arriving request either to the systematic server or to one of the recovery groupschosen at random according to a scheduling distribution. Availability of a simplex code allows replicate-to-all and select-one strategies to be used interchangeably. For low or middle trafﬁc regime, replicate-to-all achieves faster data download,however system operates under stability over a greater range of request arrival rate with select-one strategy (see Fig. 9). Thisimplements the capability for the system to switch between the replicate-to-all and select-one schedulers seamlessly dependingon the measured system load.If we assume that arrivals for cold data are negligible (i.e., ﬁxed arrival scenario as deﬁned in Sec. V-B), select-onescheduler splits the arrival stream of hot data requests into independent Poisson streams ﬂowing to the systematic server andto each recovery group. Then, the systematic server implements an

M/G/ queue while each recovery group implements anindependent fork-join queue with two servers and Poisson arrivals. Download time for an arbitrary request can then be foundas the weighted sum of the response times of each of the sub-systems where the weights are the scheduling probabilities acrossthe sub-systems.When the service times at the servers are exponentially distributed, we state below an exact expression for the average hotdata download time using an exact expression given in [25] for the average response time of a two-server fork-join queue. Ingeneral, an approximation on the average download time can be found along the same lines using the approximations availablein the literature on the response time of fork-join queues with an arbitrary service time distribution. We refer the reader to[37] for an excellent survey of the available approximations for fork-join queues. Theorem 7:

Suppose that service times at the servers are exponentially distributed with rate µ . Given a scheduling distribution [ p , p , . . . , p t ] respectively over the systematic server and t recovery groups, average hot data download time in the select-onesystem is given as E [ T ] = p µ − p λ + t (cid:88) i =1 p i µ − p i λ µ ( µ − p i λ ) (27) Proof:

Each arriving request is independently assigned either to the systematic server with probability p or to recoverygroup- i with probability p i for i = 1 , . . . , t . Given a Poisson ( λ ) hot data request arrival stream, arrivals that ﬂow to the systematic server form an independent Poisson ( p λ ) stream while those that ﬂow to repair group- i form an independent Poisson ( p i λ ) stream. Then the systematic server is an M/M/ queue and each repair group is a fork-join queue with twoservers and Poisson arrivals, for which an exact expression is given in [38] for its average response time. Arrival rate λ A v e r a g e d o w n l o a d t i m e t = 1, Servers ∼ Exp ( μ = 1) Replicate-to-allSelect-one

Arrival rate λ A v e r a g e d o w n l o a d t i m e t = 3, Servers ∼ Exp ( μ = 1) Replicate-to-allSelect-one

Fig. 9: Comparison of the average hot data download time in replicate-to-all and select-one systems.

VII. F

AIRNESS -F IRST S CHEDULING

In a simplex coded storage, we assume that hot and cold data are coded and stored together. Replicate-to-all schedulingthat we studied in Sec. III and IV aims to exploit download with redundancy for all the arriving hot data requests. Our studyof replicate-to-all system is centered around the assumption that the trafﬁc for cold data requests is negligible as observed inpractice e.g., only 10% of the stored content is frequently and simultaneously accessed [6]. In this section, we consider thecase when the request trafﬁc for cold data is non-negligible.In replicate-to-all system, hot data requests are replicated to all servers at arrival, including the servers that store cold data.For instance in a [ a, b, a + b ] -system, requests for hot data a are replicated both to its systematic server- a and the server- b that hosts the cold data. Then, the arriving requests for b may end up waiting for the hot data request copies at server- b .This increases the cold data download time and is unfair to the cold data requests. In [39], authors propose a fair schedulerfor systems with replicated redundancy, where the original requests are designated as primary and assigned with high servicepriority at the servers while the redundant request copies are designated as secondary and assigned lower service priority.We here consider the fairness-ﬁrst scheduler for implementing fairness while opportunistically exploiting the redundancyfor faster hot data downloads in a simplex coded storage. In fairness-ﬁrst system, each arriving request, asking for hot orcold data, is primarily assigned to its systematic server, e.g., requests for a go to server- a , requests for b go to server- b ina [ a, b, a + b ] -system. Redundant copies are launched to mitigate server side variability only for hot data requests, but in arestricted manner as follows. As soon as a hot data request moves into service at its systematic server, its recovery groupsare checked to see if they are idle. A redundant copy of it is launched only at the idle recovery groups. For instance in a [ a, b, a + b ] -system, when a request for hot data a reaches the head of the line at server- a , a copy of it will be launched atserver- b and server- a + b if both are idle, otherwise, it will be served only at server- a .Fairness-ﬁrst system aims to achieve perfect fairness towards cold data download. When a cold data request ﬁnds itssystematic server busy with a redundant hot data request copy, the redundant copy in service will be immediately removedfrom the system. Thus, hot data requests can exploit redundancy for faster download only opportunistically, with zero effecton cold data download.We assume the request arrival stream for each stored data symbol is an independent Poisson process. The rate of requestarrivals for each cold data symbol is set to be the same and denoted by λ c , while the rate of request arrivals for hot data isdenoted by λ . In fairness-ﬁrst system, redundant copies of hot data requests have zero effect on the cold data download, hencethe download trafﬁc for each cold data symbol implements an independent M/G/ . Cold data download time can then beanalyzed using the standard results on M/G/ queues.Download trafﬁc for hot data implements a complex queueing system with redundancy. While a hot data request movesinto service at its systematic server, it gets replicated to all of its idle recovery groups. A request copy assigned to a recoverygroup needs to wait for downloading from both servers. In addition, if there exists a cold data server in a recovery group, anassigned hot data request copy will get cancelled as soon as the cold data server receives a request. These interruptions modifythe service time distribution of hot data requests on the ﬂy. The main difﬁculty in analyzing the hot data download time isthat service times of subsequent hot data requests are not independent. The number of recovery groups that are found idle bya hot data request or interruption of its redundant copies by cold data arrivals tells us some information about the service timedistribution of the subsequent hot data request.An upper and lower bound on the hot data download time is easy to ﬁnd. As a lower bound we can use the split-mergemodel that is used to upper bound download time in replicate-to-all system, and a lower bound can be found along the samelines of Thm. 1. Theorem 8:

Hot data download time in fairness-ﬁrst system is bounded above by an

M/G/ queue with service timedistribution S t − log ( t +1) and is bounded below by an M/G/ queue with service time distribution S t , where P r { S i > s } = P r { V > s } (cid:0) − P r { V ≤ s } (cid:1) i . (28) V denotes the service time at the servers. Proof:

The worst case scenario gives the upper bound, in which no hot data request can be replicated to any one of therecovery groups that include a cold data server, i.e., cold data servers are assumed to be busy all the time; λ c = 1 /E [ V ] .Remaining number of recovery groups with no cold data server is t − log ( t + 1) (see Sec. II for simplex code properties).The best case scenario gives the lower bound, in which each hot data request is fully replicated to all the t recovery groups,i.e., cold data servers are assumed to be idle all the time; λ c = 0 . In either case, hot data download implements an M/G/ queue with a service time distribution given in (28).Consider a low trafﬁc regime under which the hot data requests do not wait in queue and go straight into service, i.e., eachhot data request completes service before the next one arrives. We refer to this set of conditions at which the system operatesas the low trafﬁc assumption . Under this assumption, each arriving request independently sees time averages (by PASTA),hence the service times of the subsequent hot data requests become i.i.d. Therefore, hot data download implements an M/G/ queue under the low trafﬁc assumption as stated in the following. Proposition 2:

Let us denote the service times at the servers with V and let the arrivals for each cold data follow a Poissonprocess of rate λ c . Under the low trafﬁc assumption, hot data download in fairness-ﬁrst system implements an M/G/ queuewith a service time distribution given as P r { S > s } = C ( s ) (cid:2) α ( s )(1 − ρ ) + ρ (cid:3) m , (29)where m = log ( t + 1) , ρ = λ c E [ V ] ,α ( s ) = 1 − P r { V ≤ s } (cid:90) s e − λ c v P r { V = v } dv,C ( s ) = P r { V > s } (cid:0) − P r { V ≤ s } (cid:1) t − m . Proof:

Cold data downloads at each cold data server implement an independent

M/G/ queue. Under stability, fractionof the time that a cold data server is busy with a cold data request is then given as ρ = λ c E [ V ] . The number of recoverygroups with a cold data server is given as m = log ( t + 1) .Low trafﬁc assumption basically means that request arrivals for hot data do not wait in queue in the hot data server. Sincethe inter-arrival times are assumed to be Markovian, we can use PASTA and state that each recovery group with a cold dataserver is independently found idle with probability − ρ by each arriving hot data request. The number of such idle recoverygroups seen by a hot data arrival is then distributed as R ∼ Binomial ( m, − ρ ) .Given R = r , a request will be simultaneously replicated to the systematic hot data server, and to the t − m recovery groupswithout a cold data server, and to the r recovery groups with a cold data server. Let us denote the service time distribution ofsuch a request as S r . Service time of the copy at the systematic server is distributed as V . At the recovery groups without acold data server, we simply wait for downloading from two servers, hence the service time is distributed as V . Service time V (cid:48) at the recovery groups with a cold data server requires a bit more attention since the hot data request copies get removedfrom service as soon as the cold data server receives a request. Let X ∼ Exp ( λ c ) and denote the service time at the recoveryserver that stores a coded data (e.g., a + b ) as V and at the recovery server that stores a cold data (e.g., b ) as V c . Then, wecan write the distribution of V (cid:48) as P r { V (cid:48) ≤ s } = P r { V (cid:48) ≤ s, V c ≤ X } + P r { V (cid:48) ≤ s, V c > X } ( a ) = P r { max { V, V c } ≤ s, V c ≤ X } = P r { V ≤ s } P r { V c ≤ s, V c ≤ X } = P r { V ≤ s } (cid:90) s e − λ c v P r { V c = v } dv. where ( a ) is due to the cancellation of the copy at the cold data server due to an arrival of a cold data request, in other words P r { V (cid:48) ≤ s, V c > X } = 0 . Now we are ready to write the distribution of S r as P r { S r > s } = P r { min { V , { V } t − mi =1 , { V (cid:48) } ri =1 } > v } = P r { V > s } (cid:0) − P r { V ≤ s } (cid:1) t − m P r { V (cid:48) > s } r = C ( s ) P r { V (cid:48) > s } r . Since each arriving hot data request independently samples from R , hot data downloads implement an M/G/ queue withservice time S , which is distributed as P r { S > s } = E R [ P r { S R > s } ]= m (cid:88) r =0 (cid:18) mr (cid:19) (1 − ρ ) r ρ m − r P r { S r > s } = C ( s ) m (cid:88) r =0 (cid:18) mr (cid:19)(cid:2) P r { V (cid:48) > s } (1 − ρ ) (cid:3) r ρ m − r = C ( s ) (cid:2) α ( s )(1 − ρ ) + ρ (cid:3) m . The approximation given in Prop. 2 is exact when the hot data arrival rate is low. Fig. 10 gives a comparison of the simulatedaverage hot data download time and the approximation. Approximation seems to be fairly accurate as well when the hot datarequests arrive at relatively higher rates. Accuracy of the approximation diminishes with increasing arrival rate. This is becausethe length of the busy periods in the hot data server increases which makes the independence assumption on the request servicetimes invalid, hence the adopted

M/G/ model starts to deviate from the actual behavior. Hot data arrival rate λ A v g h o t d a t a d o w n l o a d t i m e Fairness-first t = 1, Servers ∼ Exp ( μ = 1)Cold data arrival rate= 0.1 SimulationApproximationUpper boundLower bound

Hot data arrival rate λ A v g h o t d a t a d o w n l o a d t i m e Fairness-first t = 1, Servers ∼ Exp ( μ = 1)Cold data arrival rate= 0.5 SimulationApproximationUpper boundLower bound

Hot data arrival rate λ A v g h o t d a t a d o w n l o a d t i m e Fairness-first t = 1, Servers ∼ Pareto ( s = 1, α = 3)Cold data arrival rate= 0.1 SimulationApproximationUpper boundLower bound

Hot data arrival rate λ A v g h o t d a t a d o w n l o a d t i m e Fairness-first t = 1, Servers ∼ Pareto ( s = 1, α = 3)Cold data arrival rate= 0.3 SimulationApproximationUpper boundLower bound

Fig. 10: Comparison of the upper and lower bounds in Thm. 8, the

M/G/ approximation in Prop. 2, and the simulated average hot datadownload time in fairness-ﬁrst system. To recap, replicate-to-all scheduler aims to exploit download with redundant copies for each request arrival with no distinction.On the other hand, fairness-ﬁrst scheduler allows for download with redundancy only for hot data requests by eliminating anynegative impact caused on the cold data download by the redundant hot data request copies. At low arrival rate, one wouldexpect replicate-to-all to outperform fairness-ﬁrst in both hot or cold data download since the queues do not build up much andthe redundant request copies do not signiﬁcantly increase the waiting times of the requests. However at higher arrival regimes,replicate-to-all overburdens the cold data servers with redundant requests and cause great pain in the waiting times experiencedby the cold data requests. Fig. 11 gives an illustration of these observations and compares the gain of replicate-to-all over thefairness-ﬁrst scheduler. After certain threshold in hot data arrival rate, replicate-to-all starts incurring more pain in percentage(negative gain) on cold data download than the gain achieved for hot data download. This is intuitive since download withredundancy has diminishing returns as the arrival rate gets higher, and the high waiting times caused by the excessive numberof redundant request copies in the system has a much greater impact on the cold data download time. Hot data arrival rate −80−60−40−2002040 G a i n i n p e r c e n t a g e Servers ∼

Exp ( μ = 1), t = 1Cold data arrival rate= 0.1 In hot dataIn cold data

Hot data arrival rate −40−2002040 G a i n i n p e r c e n t a g e Servers ∼

Exp ( μ = 1), t = 1Cold data arrival rate= 0.5 In hot dataIn cold data

Fig. 11: Gain or pain of replicate-to-all over fairness-ﬁrst scheduler in average hot and cold data download time. Negative gain means pain,which indicates that replicate-to-all performs worse than fairness-ﬁrst in terms of the hot or cold data download time. A CKNOWLEDGMENTS

This material is based upon work supported by the National Science Foundation under Grant No. CIF-1717314.R

EFERENCES[1] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google ﬁle system. In

ACM SIGOPS operating systems review , volume 37, pages 29–43.ACM, 2003.[2] Dhruba Borthakur. The hadoop distributed ﬁle system: Architecture and design.

Hadoop Project Website , 11(2007):21, 2007.[3] Parikshit Gopalan, Cheng Huang, Huseyin Simitci, and Sergey Yekhanin. On the locality of codeword symbols.

IEEE Transactions on InformationTheory , 58(11):6925–6934, 2012.[4] Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. Erasure coding in windowsazure storage. In

Presented as part of the 2012 USENIX Annual Technical Conference (USENIX ATC 12) , pages 15–26, 2012.[5] Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur.Xoring elephants: Novel erasure codes for big data. In

Proceedings of the VLDB Endowment , volume 6, pages 325–336. VLDB Endowment, 2013.[6] Ganesh Ananthanarayanan, Sameer Agarwal, Srikanth Kandula, Albert Greenberg, Ion Stoica, Duke Harlan, and Ed Harris. Scarlett: coping with skewedcontent popularity in mapreduce clusters. In

Proceedings of the sixth conference on Computer systems , pages 287–300. ACM, 2011.[7] Itzhak Tamo and Alexander Barg. A family of optimal locally recoverable codes.

IEEE Transactions on Information Theory , 60(8):4661–4676, 2014.[8] Ankit Singh Rawat, Dimitris S Papailiopoulos, Alexandros G Dimakis, and Sriram Vishwanath. Locality and availability in distributed storage. In , pages 681–685. IEEE, 2014.[9] Gauri Joshi, Yanpei Liu, and Emina Soljanin. Coding for fast content download. In

Communication, Control, and Computing (Allerton), 2012 50thAnnual Allerton Conference on , pages 326–333. IEEE, 2012.[10] Longbo Huang, Sameer Pawar, Hao Zhang, and Kannan Ramchandran. Codes can reduce queueing delay in data centers. In

Proceed. 2012 IEEEInternational Symposium on Information Theory (ISIT’12) , pages 2766–2770.[11] Kristen Gardner, Samuel Zbarsky, Sherwin Doroudi, Mor Harchol-Balter, and Esa Hyytia. Reducing latency via redundant requests: Exact analysis.

ACM SIGMETRICS Performance Evaluation Review , 43(1):347–360, 2015.[12] Bin Li, Aditya Ramamoorthy, and R Srikant. Mean-ﬁeld-analysis of coding versus replication in cloud storage systems. In

Computer Communications,IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on , pages 1–9. IEEE, 2016.[13] Parimal Parag, Archana Bura, and Jean-Francois Chamberland. Latency analysis for distributed storage. In

INFOCOM 2017-IEEE Conference onComputer Communications, IEEE , pages 1–9. IEEE, 2017.[14] Gauri Joshi, Emina Soljanin, and Gregory W. Wornell. Efﬁcient redundancy techniques for latency reduction in cloud systems.

ACM Trans. Modelingand Perform. Evaluation of Comput. Sys. (TOMPECS) , to appear, 2017.[15] Kristen Gardner, Mor Harchol-Balter, Alan Scheller-Wolf, and Benny Van Houdt. A better model for job redundancy: Decoupling server slowdown andjob size.

IEEE/ACM Transactions on Networking , 2017.[16] Mehmet Aktas and Emina Soljanin. Heuristics for analyzing download time in mds coded storage systems. In

Information Theory (ISIT), 2018 IEEEInternational Symposium on . IEEE, 2018.[17] Swanand Kadhe, Emina Soljanin, and Alex Sprintson. Analyzing the download time of availability codes. In , pages 1467–1471. IEEE, 2015.[18] Swanand Kadhe, Emina Soljanin, and Alex Sprintson. When do the availability codes make the stored data more available? In , pages 956–963. IEEE, 2015.[19] Qiqi Shuai and Victor O. K. Li. Reducing delay of ﬂexible download in coded distributed storage system. In , pages 1–6, 2016.[20] Viveck R Cadambe and Arya Mazumdar. Bounds on the size of locally recoverable codes.

IEEE transactions on information theory , 61(11):5787–5794,2015.[21] Swanand Kadhe and Robert Calderbank. Rate optimal binary linear locally repairable codes with small availability. In

Information Theory (ISIT), 2017IEEE International Symposium on , pages 166–170. IEEE, 2017.[22] Andreas Klein. On codes meeting the griesmer bound.

Discrete Mathematics , 274(13):289 – 297, 2004.[23] Mehmet Aktas, Sarah E Anderson, Ann Johnston, Gauri Joshi, Swanand Kadhe, Gretchen L Matthews, Carolyn Mayer, and Emina Soljanin. On theservice capacity region of accessing erasure coded content. arXiv preprint arXiv:1710.03376 , 2017.[24] Kristen Gardner, Mor Harchol-Balter, Esa Hyyti¨a, and Rhonda Righter. Scheduling for efﬁciency and fairness in systems with redundancy.

PerformanceEvaluation , 116:1–25, 2017.[25] Leopold Flatto and S Hahn. Two parallel queues created by arrivals with two demands i.

SIAM Journal on Applied Mathematics , 44(5):1041–1053,1984. [26] Ronald W Wolff. Poisson arrivals see time averages. Operations Research , 30(2):223–231, 1982.[27] Robert G Gallager.

Stochastic processes: theory for applications . Cambridge University Press, 2013.[28] M. Harchol-Balter.

Performance Modeling and Design of Computer Systems: Queueing Theory in Action . Cambridge University Press, New York, NY,1st edition, 2013.[29] Yunjing Xu, Zachary Musgrave, Brian Noble, and Michael Bailey. Bobtail: Avoiding long tails in the cloud. In

NSDI , volume 13, pages 329–342, 2013.[30] Jeffrey Dean and Luiz Andr´e Barroso. The tail at scale.

Communications of the ACM , 56(2):74–80, 2013.[31] Charles Reiss, Alexey Tumanov, Gregory R Ganger, Randy H Katz, and Michael A Kozuch. Towards understanding heterogeneous clouds at scale:Google trace analysis.

Intel Science and Technology Center for Cloud Computing, Tech. Rep , page 84, 2012.[32] E Y¨ucesan, CH Chen, JL Snowdon, JM Charnes, Donald Gross, and John F Shortle. Difﬁculties in simulating queues with pareto service. In

SimulationConference , 2002.[33] Dror G Feitelson.

Workload modeling for computer systems performance evaluation . Cambridge University Press, 2015.[34] Tehmina Amjad, Muhammad Sher, and Ali Daud. A survey of dynamic replication strategies for improving data availability in data grids.

FutureGeneration Computer Systems , 28(2):337–349, 2012.[35] Yuval Ishai, Eyal Kushilevitz, Rafail Ostrovsky, and Amit Sahai. Batch codes and their applications. In

Proceedings of the thirty-sixth annual ACMsymposium on Theory of computing , pages 262–271. ACM, 2004.[36] Vitaly Skachek. Batch and pir codes and their connections to locally repairable codes. In

Network Coding and Subspace Designs , pages 427–442.Springer, 2018.[37] Alexander Thomasian. Analysis of fork/join and related queueing systems.

ACM Computing Surveys (CSUR) , 47(2):17, 2015.[38] R Nelson and A N Tantawi. Approximate analysis of fork/join synchronization in parallel queues.

IEEE transactions on computers , 37(6):739–743,1988.[39] Kristen Gardner, Mor Harchol-Balter, Esa Hyyti¨a, and Rhonda Righter. Scheduling for efﬁciency and fairness in systems with redundancy.

PerformanceEvaluation , 116:1–25, 2017. A PPENDIX

A. On the Conj. 1 in Sec.III

We here present a Theorem that helps to build an intuition for Conj. 1. We ﬁrstly assume that the service times at the serversare exponentially distributed with rate µ . Let us redeﬁne the state of the replicate-to-all system S as the type (see Lm. 1 fordetails) of service start for the request at the head of the system, so S ∈ { , . . . , t } . System state transitions correspond totime epochs at which a request moves into service (see Def. 1 for the deﬁnition of request service start times).Given that Conj. 1 holds i.e., f j > f j +1 for j = 0 , . . . , t − , one would expect the average drift at state- j to be towardsthe lower rank state- i ’s with i < j . In the Theorem below we prove this indeed holds. However, it poses neither a necessarynor a sufﬁcient condition for Conj. 1. The biggest in proving the conjecture is that a possible transition exists between everypair of states. Finding transition probabilities requires the exact analysis of the system, which proved to be formidable. Forinstance, if the transitions were possible only between the consecutive states j and j + 1 as in a birth-death chain, Theorembelow would have implied the conjecture. Theorem 9:

In replicate-to-all system, let J i be the type of service time distribution for request- i . Then P r { J i +1 > j | J i = j } < . . In other words, given that a request with type- j service time distribution, the subsequent request is more likely tobe served with type- i distribution such that i < j . Proof:

Let us deﬁne L k ( t ) as the absolute difference of queue lengths at repair group k and time t . Suppose that the i threquest moves in service at time τ and its service time is type- j ; namely J i = j . Let also A denote the event that L k ( τ ) > for every repair group k that has a leading server at time τ , we refer to other recovery groups as non-leading. Then, thefollowing inequality holds P r { J i +1 > j | J i = j, A } > P r { J i +1 > j | J i = j } (30)Event A guarantees that J i +1 ≥ j i.e., P r { J i +1 ≥ j | J i = j, A } = 1 , because even when none of the leading servers advancesbefore the i th request departs, the next request will make at least type- j start.We next try to ﬁnd P r { J i +1 > j | J i = j, A } = 1 − P r { J i +1 = j | J i = j, A } . Suppose that the i th request departs at time τ (cid:48) and without loss of generality, let the repair group k has a leading server onlyif k ≤ j . The event { J i +1 = j | J i = j, A } is equivalent to the event B j = { L k ( ζ ) < ζ ∈ [ τ, τ (cid:48) ] , j < k ≤ t } for < j < t − . (31)This is because for the ( i + 1) th request to have type- ( j + 1) service time, one of the servers in at least one of the non-leadingrecovery groups should advance by at least two replicas before the i th request terminates.Event B j can be re-expressed as B j = t − j (cid:91) l =0 C l ; C l = { L k i ( τ ) = 1; 1 ≤ i ≤ l, j < k i ≤ t } . Event C l describes that l non-leading recovery groups start leading by one replica before the departure of i th request. Giventhat there are currently j leading recovery groups, let p +1 j be the probability that a non-leading repair group starts to lead byone before the departure of ith request, and p Dj be the probability that i th request departs ﬁrst. Then, we can write P r { C l } = p Dj + l j + l − (cid:89) k = j p +1 k . Since the events C l for l = 0 , . . . , t − j are disjoint, P r { B j } = t − j (cid:80) l =0 P r { C l } and we get the recurrence relation P r { B j } = p Dj + p +1 j P r { B j +1 } Since the service times at the servers are exponentially distributed, the probabilities are easy to ﬁnd as p Dj = 1 + j t , p +1 j = 2( t − j )1 + 2 t . Then we ﬁnd

P r { B t − } = p Dt − + p +1 t − p Dt = 1 + ( t − t + 21 + 2 t t t = 1 + t t + 1(1 + 2 t ) = 1 − t t + 1(1 + 2 t ) > Suppose

P r { B j +1 } > . , then we have P r { B j } = 1 + j t + 2( t − i )1 + 2 t P r { B j +1 } > i t + ( t − i )1 + 2 t = 1 + t t = 12 + 1 /

21 + 2 t > Knowing that

P r { B t − } > . together with P r { B k } > . , and given that P r { B k +1 } > . we have P r { B j } > . foreach j .By (30) and (31), we ﬁnd P r { J i +1 > j | J i = j } < P r { J i +1 > j | J i = j, A } = 1 − P r { B j } < . . which tells us that for any request with type- j service time, the subsequent request is more likely to have service time withtype less than j . B. Approximate analysis of the Markov process for the replicate-to-all system with availability one

Here we give an approximate analysis of the process illustrated in Fig. 3 with a guessing based local balance equationsapproach. Consider the case where all three servers operate at the same rate, i.e., α = β = µ , which makes the pyramid processsymmetric around the center column, i.e., p k, ( i, = p k, (0 ,i ) for ≤ i ≤ k . Thus, the following discussion is given in terms ofthe states on the right side of the pyramid.Observe that under low trafﬁc load, system spends almost the entire time in states (0 , (0 , , (1 , (0 , , (1 , (0 , and (1 , (1 , . Given this observation, notice that the rate entering into (1 , (0 , due to request arrivals is equal to the rate leavingthe state due to request departures. To help with guessing the steady-state probabilities, we start with the assumption that rateentering into a state due to request arrivals is equal to the rate leaving the state due to request departures. This gives us thefollowing relation between the steady-state probabilities of the column-wise subsequent states: p k, ( i, = λγ + 2 µ p k − , ( i, , ≤ i ≤ k. (32)Let us deﬁne τ = λ/ ( γ + 2 µ ) . This relation allows us to write p k, ( i, = τ k − i p i, ( i, . However this obviously won’t holdfor higher arrival rates since at higher arrival rates some requests wait in the queue, which requires the rate entering into a state due to request arrivals to be higher than the rate leaving the state due to task completions. To be used in the followingdiscussion, ﬁrst we write p , (1 , in terms of p , (0 , from the global balance equations as the following. λp , (0 , = γp , (0 , + 2( γ + µ ) p , (1 , ,p , (1 , = λ − γτ γ + µ ) p , (0 , (33)For the nodes at the far right side of the pyramid, we can write the global balance equations and solve the correspondingrecurrence relation as the following. p i, ( i, ( λ + µ + γ ) = p i, ( i − , µ + p i +1 , ( i +1 , ( µ + γ ) , i ≥ ,p i +2 , ( i +2 , = bp i +1 , ( i +1 , + ap i, ( i, , i ≥ where b = 1 + λµ + γ , a = − τ µγ + µ , = ⇒ p i, ( i, = Ar i + Br i where B = r p , (0 , + ( p , (1 , − bp , (0 , ) r r r − r ,A = p , (0 , − B where ( r , r ) = (cid:16) − b − √ ∆2 a , − b + √ ∆2 a (cid:17) ; ∆ = b + 4 a,p k, ( i, = p k, (0 ,i ) = τ k − i (cid:16) Ar i + Br i (cid:17) , ≤ i ≤ k. (34)Even though the algebra does not permit much cancellation, one can ﬁnd the unknowns A and B above by computing p , (0 , as follows. ∞ (cid:88) k =0 p k, (0 , + ∞ (cid:88) i =1 ∞ (cid:88) k = i ( p k, ( i, + p k, (0 ,i ) )= p , (0 , − τ + 21 − τ ∞ (cid:88) i =1 p i, ( i, ( τ < p , (0 , − τ + 21 − τ ∞ (cid:88) i =1 (cid:18) Ar i + Br i (cid:19) = p , (0 , − τ + 21 − τ (cid:18) Ar − Br − (cid:19) ( (34) , r , r > p , (0 , − τ + 21 − τ (cid:18) ( p , (0 , − B )( r −

1) + B ( r − r − r − (cid:19) = p , (0 , − τ + 21 − τ (cid:18) B ( r − r ) + p , (0 , ( r − r − r − (cid:19) = p , (0 , − τ +21 − τ (cid:18) ( r p + r r ( p , (1 , − bp , (0 , )) + p , (0 , ( r − r − r − (cid:19) = p , (0 ,  (cid:16) r + r r (cid:0) ( λ − γτ ) / (2 γ +2 µ ) − b (cid:1) + r − r − r − (cid:17) − τ  = 1 . Simulation results show that the model for p k, ( i, discussed above is proper in structure i.e., p k, ( i, decreases exponentiallyin k and i . However, simulations show that τ ( λ ) = k ( γ, µ ) λ/ ( γ + 2 µ ) . For instance, for γ = µ we can ﬁnd that k ( γ, µ ) (cid:39) . .Nevertheless this does not permit to ﬁnd a general expression for k ( γ, µ ) . C. Service rate allocation in the replicate-to-all system with availability one

Here we present the algebra that shows ∂E [ T approx ] /∂ρ < where E [ T approx ] is given in Thm. 3. Deﬁne C = γ + 2 µ and ρ = γ/µ , then the followings can be calculated ˆ f = 11 + 2 / [ ρ ( ρ + 2)] , ∂ ˆ f ∂ρ = 4( ρ + 1)( ρ + 2 ρ + 2) ,E [ V ] = ρ + 2 C ( ρ + 1) , E [ V ] = 2 C (cid:16) ρ + 2 ρ + 1 (cid:17) ,E [ V ] = 2( ρ + 2) C ( ρ + 1) − C , E [ V ] = 4 C (cid:16) ρ + 2 ρ + 1 (cid:17) − C ,∂E [ V ] ∂ρ = − C ( ρ + 1) , ∂E [ V ] ∂ρ = − ρ + 2) C ( ρ + 1) ,∂E [ V ] ∂ρ = − C ( ρ + 1) , ∂E [ V ] ∂ρ = − ρ + 2) C ( ρ + 1) , ( i ) E [ V lb ] = E [ V ] + ˆ f ( E [ V ] − E [ V ]) ,∂E [ V lb ] ∂ρ = ∂E [ V ] ∂ρ + ∂ ˆ f ∂ρ ( E [ V ] − E [ V ])+ ∂ ( E [ V ] − E [ V ]) ∂ρ ˆ f = − C ( ρ + 1) + 4( ρ + 1)( ρ + 2 ρ + 2) C ( ρ + 1) − C ( ρ + 1) ρ ( ρ + 2) ρ + 2 ρ + 2= 1 C ( − ρ + 1) + 4( ρ + 2 ρ + 2) − ρ ( ρ + 2)( ρ + 1) ( ρ + 2 ρ + 2) )= − ( ρ + 2 ρ + 2) + 4( ρ + 1) − ( ρ + 2 ρ )( ρ + 2 ρ + 2) C ( ρ + 1) ( ρ + 2 ρ + 2)= − ρ + 1) ( ρ + 2 ρ + 2) + 4( ρ + 1) C ( ρ + 1) ( ρ + 2 ρ + 2)= − ρ + 1) ( ρ + 2 ρ ) C ( ρ + 1) ( ρ + 2 ρ + 2) < , ( ii ) E [ V lb ] = E [ V ] + ˆ f ( E [ V ] − E [ V ]) ,∂E [ V lb ] ∂ρ = ∂E [ V ] ∂ρ + ∂ ˆ f ∂ρ ( E [ V ] − E [ V ])+ ∂ ( E [ V ] − E [ V ]) ∂ρ ˆ f = − ρ + 2) C ( ρ + 1) + 8( ρ + 1) C ( ρ + 2 ρ + 2) (( ρ + 2 ρ + 1 ) − − ρ ( ρ + 2) C ( ρ + 1) ( ρ + 2 ρ + 2)= 8 C ( 2 ρ + 3( ρ + 1)( ρ + 2 ρ + 2) − ρ + 2( ρ + 1) ) − ρ ( ρ + 2) C ( ρ + 1) ( ρ + 2 ρ + 2) < . ( iii ) E [ T approx ] = E [ V lb ] + λE [ V lb ]2(1 − λE [ V lb ]) ∂E [ T approx ] ∂ρ = ∂E [ V lb ] ∂ρ + λ ∂E [ V lb ] ∂ρ (1 − λE [ V lb ]) + ∂E [ V lb ] ∂ρ λE [ V lb ])(1 − λE [ V lb ]) = ∂E [ V lb ] ∂ρ (1 + λ E [ V lb ]2(1 − λE [ V lb ]) ) + ∂E [ V lb ] ∂ρ λ − λE [ V lb ]) under stability λE [ V lb ] < and we found above ∂E [ V lb ] ∂ρ < , ∂E [ V lb ] ∂ρ < which shows that ∂E [ T approx ] ∂ρ < . D. Matrix Analytic solution for the replicate-to-all system with availability one

Deﬁning δ = α + β + γ + λ , the sub-matrices forming Q are given below. F =  − λ λ α + γ β − δ γ β − δ αβ + γ α − δ  , H =  λ λ λ  , L =  α + γ α + γ

00 0 γ

00 0 β + γ

00 0 0 β + γ  , F =  β − δ β − δ β − δ α

00 0 0 − δ α α − δ  , L =  α + γ α + γ γ β + γ β + γ  , H =  λ λ λ λ

00 0 0 0 λ 