On the Delay-Storage Trade-off in Content Download from Coded Distributed Storage Systems
11 On the Delay-Storage Trade-off in ContentDownload from Coded Distributed Storage Systems
Gauri Joshi, Yanpei Liu,
Student Member, IEEE,
Emina Soljanin,
SeniorMember, IEEE
Abstract
In this paper we study how coding in distributed storage reduces expected download time, in additionto providing reliability against disk failures. The expected download time is reduced because when acontent file is encoded to add redundancy and distributed across multiple disks, reading only a subsetof the disks is sufficient to reconstruct the content. For the same total storage used, coding exploits thediversity in storage better than simple replication, and hence gives faster download. We use a novelfork-join queuing framework to model multiple users requesting the content simultaneously, and derivebounds on the expected download time. Our system model and results are a novel generalization of thefork-join system that is studied in queueing theory literature. Our results demonstrate the fundamentaltrade-off between the expected download time and the amount of storage space. This trade-off can beused for design of the amount of redundancy required to meet the delay constraints on content delivery.
Index Terms distributed storage, fork-join queues, MDS codes
This work was presented in part at the 50th Annual Allerton Conference on Communication, Control and Computing,Monticello IL, October 2012.G. Joshi is with the Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology,Cambridge, MA. Y. Liu is with the Department of Electrical and Computer Engineering, University of Wisconsin Madison,Madison, WI. E. Soljanin is with Bell Labs, Alcatel-Lucent, Murray Hill, NJ. (E-mail: [email protected], [email protected],[email protected]) a r X i v : . [ c s . D C ] D ec . I NTRODUCTION
Large-scale cloud storage and distributed file systems such as Amazon Elastic Block Store(EBS) [1] and Google File System (GoogleFS) [2] have become the backbone of many appli-cations such as web searching, e-commerce, and cluster computing. In these distributed storagesystems, the content files stored on a set of disks may be simultaneously requested by multipleusers. The users have two major demands: reliable storage and fast content download. Contentdownload time includes the time taken for a user to compete with the other users for access tothe disks, and the time to acquire the data from the disks. Fast content download is importantfor delay-sensitive applications such as video streaming, VoIP, as well as collaborative tools likeDropbox [3] and Google Docs [4].The authors in [2] point out that in large-scale distributed storage systems, disk failures arethe norm and not the exception. To protect the data from disk failures, cloud storage providerstoday simply replicate content throughout the storage network over multiple disks. In addition tofault tolerance, replication makes the content quickly accessible since multiple users requestinga content can be directed to different replicas. However, replication consumes a large amount ofstorage space. In data centers that process massive data today, using more storage space implieshigher expenditure on electricity, maintenance and repair, as well as the cost of leasing physicalspace.Coding, which was originally developed for reliable communication in presence of noise,offers a more efficient way to store data in distributed systems. The main idea behind coding isto add redundancy so that a content, stored on a set of disks, can be reconstructed by reading asubset of these disks. Previous work shows that coding can achieve the same reliability againstfailures with lower storage space used. It also allows efficient replacement of disks that haveto be removed due to failure or maintenance. We show that in addition to reliability and easyrepair, coding also gives faster content download because we only have to wait for contentdownload from a subset of the disks. Some preliminary results on the analysis of download timevia queueing-theoretic modeling are presented in [5].2 . Previous Work
Research in coding for distributed storage was galvanized by the results reported in [6]. Priorto that work, literature on distributed storage recognized that, when compared with replication,coding can offer huge storage savings for the same reliability levels. But it was also arguedthat the benefits of coding are limited, and are outweighed by certain disadvantages and extracomplexity. Namely, to provide reliability in multi-disk storage systems, when some disks fail,it must be possible to restore either the exact lost data or an equivalent reliability with minimaldownload from the remaining storage. This problem of efficient recovery from disk failureswas addressed in some early work [7]. But in general, the cost of repair regeneration wasconsidered much higher in coded than in replication systems [8], until [6] established existenceand advantages of new regenerating codes. This work was then quickly followed, and the areais very active today (see e.g., [9]–[11] and references therein).Only recently [12]–[14] was it realized that in addition to reliability, coding can guarantee thesame level of content accessibility, but with lower storage than replication. In [12], the scenariothat when there are multiple requests, all except one of them are blocked and the accessibilityis measured in terms of blocking probability is considered. In [13], multiple requests are placedin a queue instead of blocking and the authors propose a scheduling scheme to map requests toservers (or disks) to minimize the waiting time. In [14], the authors give a combinatorial proofthat flooding requests to all disks, instead of a subset of them gives the fastest download time.This result corroborates the system model we consider in this paper to model the distributedstorage system and analyze its download time.Using redundancy in coding for delay reduction has also been studied in the context of packettransmission in [15]–[17], and for some content retrieval scenarios in [18], [19]. Although theyshare some common spirit, they do not consider the effect of queueing of requests in codeddistributed storage systems.
B. Our Contributions
In this paper we show that coding allows fast content download in addition to reliable storage.Since multiple users can simultaneously request the content, the download time includes the timeto wait for access to the disks plus the time to read the data. When the content is coded and3istributed on multiple disks, it is sufficient to read it only from a subset of these disks in orderto retrieve the content. We take a queuing-theoretic approach to study how coding the contentin this way provides diversity in storage, and achieve a significant reduction in the downloadtime. The analysis of download time leads us to an interesting trade-off between download timeand storage space, which can be used to design the optimal level of redundancy in a distributedstorage system. To the best of our knowledge, we are the first to propose the ( n, k ) fork-joinsystem and find bounds on its mean response time, a novel generalization of the ( n, n ) fork-joinsystem studied in queueing theory literature.We consider that requests entering the system are assigned to multiple disks, where theyenter local queues waiting for disks access. Note that this is in contrast to some existing works(e.g. [13], [14]) where requests wait in a centralized queue when all disks are busy. Our approachof immediate dispatching of requests to local queues is used by most server farms to facilitatefast acknowledgement response to customers [20]. Under this queueing model, we propose the ( n, k ) fork-join system, where each request is forked to n disks that store the coded content, andit exits the system when any k ( k ≤ n ) disks are read. The ( n, n ) fork-join system in which all n disks have to be read has been extensively studied in queueing theory and operations researchrelated literature [21]–[23]. Our analysis of download time can be seen as a generalization tothe analysis of the ( n, n ) fork-join system.The rest of the paper is organized as follows. In Section II, we present some preliminaryconcepts that are central to the results presented in the paper. In Section III, we analyze theexpected download time of the ( n, k ) fork-join system and present the fundamental trade-offbetween expected download time and storage. These results were presented in part in [5]. InSection IV, we relax some simplifying assumptions, and present the delay-storage trade-offby considering some practical issues such as heavy-tailed and correlated service times of thedisks. In Section V, we extend the analysis to distributed storage systems with a large numberof disks. Such systems can be divided into groups of n disks each, where each group is anindependent ( n, k ) fork-join system. Finally, Section VI concludes the paper and gives futureresearch directions. 4 + bba a + bb Fig. 1: Storage is higher, but response time (per disk & overall) is reduced.II. P
RELIMINARY C ONCEPTS
A. Reducing Delay using Coding
One natural way to reduce the download time is to replicate the content across n disks. Thenif the user issues n download requests, one to each disk, it only needs to wait for the one of therequests to be served. This strategy gives a sharp reduction in download time, but at the cost of n times more storage space and the cost of processing multiple requests.It is more efficient to use coding instead of replication. Consider that a content F of unit sizeis divided into k blocks of equal size. It is encoded to n ≥ k blocks using an ( n, k ) maximumdistance separable (MDS) code, and the coded blocks are stored on an array of n disks. MDScodes have the property that any k out of the n blocks are sufficient to reconstruct the entirefile. MDS codes have been suggested to provide reliability against disk failures. In this paperwe show that, in addition to error-correction, we can exploit these codes to reduce the downloadtime of the content.The encoded blocks are stored on n different disks (one block per disk). Each incomingrequest is sent to all n disks, and the content can be recovered when any k out of n blocksare successfully downloaded. An illustrative example with n = 3 disks and k = 2 is shown inFig. 1. The content F is split into equal blocks a and b , and stored on disks as a , b , and a ⊕ b , the exclusive-or of blocks a and b . Thus each disk stores content of half the size of file F . Downloads from any disks jointly enable reconstruction of F .5 . Role of Order Statistics The time taken to download a block of content is a random variable. If the block downloadtimes are independent and identically distributed (i.i.d.), the time to download any k out of n blocks is the k th order statistic of the block download times. We now provide some backgroundon order statistics of i.i.d. random variables. For a more complete treatment, please refer to[24]. Although in our system model the block download times are not i.i.d., this background oni.i.d. order statistics is a powerful tool for our analysis on the dependent case as shown in latersections.Let X , X , · · · , X n be i.i.d. random variables. Then, X k,n , the k th order statistic of X i , ≤ i ≤ n , or the k th smallest variable has the distribution, f X k,n ( x ) = n (cid:18) n − k − (cid:19) F X ( x ) k − (1 − F X ( x )) n − k f X ( x ) , where f X is the probability density function (PDF) and F X is the cumulative distribution function(CDF) of X i for all i . In particular, if X i ’s are exponential with mean /µ , then the expectationand variance of order statistic X k,n are given by,E [ X k,n ] = 1 µ k (cid:88) i =1 n − k + i = 1 µ ( H n − H n − k ) , (1)V [ X k,n ] = 1 µ k (cid:88) i =1 n − k + i ) = 1 µ ( H n − H ( n − k ) ) , (2)where H n and H n are generalized harmonic numbers defined by H n = n (cid:88) j =1 j and H n = n (cid:88) j =1 j . (3)We observe from (1) that for fixed n , E [ X k,n ] decreases when k becomes smaller. This factwill help us understand the analysis of download time in Section III and Section V respectively. C. Assignment Policies
In our distributed storage model, we divide the content into k blocks, we use /k units ofspace of each disk, and hence total storage space used is n/k units. This is unlike conventionalreplication-based storage solutions where n entire copies of content are stored on the n disks. Insuch systems, each incoming request can be assigned to any of the n disks. One such assignment6olicies is the power-of- d assignment [20], [25]. For each incoming request, the power-of- d jobassignment uniformly selects d nodes ( d ≤ n ) and sends the request to the node with least workleft among the d nodes. The amount of work left of a node can be the expected time taken forthat node to become empty when there are no new arrivals or simply the number of jobs queued.When d = n , power-of- d reduces to the least-work-left (LWL) policy (or joint-the-shortest-queue(JSQ) if work is measured by the number of jobs). Power-of- d assignment has received muchattention recently due to the prevailing popularity in large-scale parallel computing. In Section IIIand Section V, we compare these policies with our proposed distributed storage model.III. T HE ( n, k ) F ORK - JOIN S YSTEM
We consider the scenario that users attempt to download the content from the distributedstorage system where their requests are placed in a queue at each disk. In Section III-A wepropose the ( n, k ) fork-join system to model the queueing of download requests, and derivetheoretical bounds on the expected download time in Section III-B. This analysis leads us tothe fundamental trade-off between download time and storage, which provides insights intothe practical system design. Numerical and simulation results demonstrating this trade-off arepresented in Section III-C. A. System Model
We model the queueing of download requests at the disks using the ( n, k ) fork-join systemwhich is defined as follows. Definition 1 ( ( n, k ) fork-join system) . An ( n, k ) fork-join system consists of n nodes. Everyarriving job is divided into n tasks which enter first-come first-serve queues at each of the n nodes. The job departs the system when any k out of n tasks are served by their respectivenodes. The remaining n − k tasks abandon their queues and exit the system before completionof service. The ( n, n ) fork-join system, known in literature as fork-join queue, has been extensivelystudied in, e.g., [21]–[23]. However, the ( n, k ) generalization in Definition 1 above has notbeen previously studied to our best knowledge. Fig. 2 illustrates the (3 , fork-join system7 F Abandon J λ λλ kµkµkµ Fig. 2: Illustration of the (3 , fork-join system. Since out of tasks of Job are served, thethird task abandons its queue and the job exits the system. Job has to wait for one more taskto be served.corresponding to the coded distributed storage example shown in Fig. 1. Each download request,or a job is forked to the nodes. When out of tasks are served, the third task abandonsits queue and the job exits the system. For example, Job is about to exit the system, whileJob is waiting for one more task to be served. The letters F and J denote the fork and joinoperations respectively.We consider that arrival of download requests is Poisson with rate λ . Every request is forkedto the n disks. The time taken to download one unit of data is exponential with mean /µ . Since,each disk stores /k units of data, consider that the service time for each node is exponentiallydistributed with mean /µ (cid:48) where µ (cid:48) = kµ . Define the load factor ρ (cid:48) (cid:44) λ/µ (cid:48) . This model with anM/M/1 queue at every disk is sometimes referred to as a Flatto-Hahn-Wright (or FHW) model[26], [27] in fork-join queue literature. While most of our analytical results in Section III andSection V are for the FHW model, and we use simulations to study systems with M/G/1 queuesat the disks in Section IV.For the ( n, n ) fork-join system to be stable, [28] shows that the arrival rate λ must be lessthan µ (cid:48) , the service rate of a node, which in our ( n, n ) system equals to nµ . In Lemma 1 below,we show that λ < nµ is also a necessary condition for the stability of the ( n, k ) fork-join systemfor any ≤ k ≤ n . Lemma 1 (Stability of ( n, k ) fork-join system) . For the ( n, k ) fork-join system to be stable, the ate of Poisson arrivals λ and the service rate µ (cid:48) = kµ per node must satisfy λ < nµ (cid:48) k = nµ. (4) Proof:
Tasks arrive at each queue at rate λ and are served at rate µ (cid:48) = kµ . But when k out of the n tasks finish service, the remaining n − k tasks abandon their queues. A task can beone of the abandoning tasks with probability ( n − k ) /n . Hence the effective arrival rate to eachqueue is λ minus the rate of abandonment λ ( n − k ) /n . Then the condition for stability of eachqueue is λ − λ ( n − k ) n < µ (cid:48) , (5)which reduces to (4). B. Bounds on the Mean Response Time
Our objective is to determine the expected download time, which we refer to as the meanresponse time T ( n,k ) of the ( n, k ) fork-join system. It is the expected time that a job spends inthe system, from its arrival until k out of n of its tasks are served by their respective nodes.Previous works [21]–[23] have studied T ( n,n ) , but it has not been found in closed form – onlybounds are known. An exact expression for the mean response time is found only for the (2 , fork-join system [22].Since the n tasks are served by independent M/M/1 queues, intuition suggests that T ( n,k ) isthe k th order statistic of n exponential service times. However this is not true, which makes theanalysis of T ( n,k ) challenging. The reason why the order statistics approach does not work isthat when j nodes ( j < n ) finish serving their tasks they can start serving the tasks of the nextjob (cf. Fig. 2). As a result, the service time of a job depends on the departure time of previousjobs.We now present upper and lower bounds on the mean response time T ( n,k ) . The numericalresults in Section III-C show that these bounds are fairly tight. Theorem 1 (Upper Bound on Mean Response Time) . The mean response time T ( n,k ) of an ( n, k ) ork-join system satisfies T ( n,k ) ≤ H n − H n − k µ (cid:48) + (6) λ (cid:2) ( H n − H ( n − k ) ) + ( H n − H ( n − k ) ) (cid:3) µ (cid:48) (cid:2) − ρ (cid:48) ( H n − H n − k ) (cid:3) , where λ is the request arrival rate, µ (cid:48) is the service rate at each queue, ρ (cid:48) = λ/µ (cid:48) is the loadfactor, and the generalized harmonic numbers H n and H n are as given in (3). The bound isvalid only when ρ (cid:48) ( H n − H n − k ) < .Proof: To find this upper bound, we use a model called the split-merge system, which issimilar but easier to analyze than the fork-join system. In the ( n, k ) fork-join queueing model,after a node serves a task, it can start serving the next task in its queue. On the contrary, inthe split-merge model, the n nodes are blocked until k of them finish service. Thus, the jobdeparts all the queues at the same time. Due to this blocking of nodes, the mean response timeof the ( n, k ) split-merge model is an upper bound on (and a pessimistic estimate of) T ( n,k ) forthe ( n, k ) fork-join system.The ( n, k ) split-merge system is equivalent to an M/G/1 queue where arrivals are Poisson withrate λ and service time is a random variable S distributed according to the k th order statistic ofthe exponential distribution. The mean and variance of S are (cf. (1) and (2))E [ S ] = H n − H n − k µ (cid:48) and V [ S ] = H n − H ( n − k ) µ (cid:48) . (7)The Pollaczek-Khinchin formula [29] gives the mean response time T of an M/G/1 queue interms of the mean and variance of S as, T = E [ S ] + λ ( E [ S ] + V [ S ]2(1 − λ E [ S ]) . (8)Substituting the values of E [ S ] and V [ S ] given by (7), we get the upper bound (6). Note that thePollaczek-Khinchin formula is valid only when λ > E [ S ] , the stability condition of the M/G/1queue. Since E [ S ] increases with k , there exists a k such that the M/G/1 queue is unstable forall k ≥ k . The inequality λ > E [ S ] can be simplified to ρ (cid:48) ( H n − H n − k ) < which is thecondition for validity of the upper bound given in Theorem 1.10 emark 1. For the ( n, n ) fork-join system, the authors in [22] find an upper bound on meanresponse time different from (6) derived above. To find the bound, they first prove that theresponse times of the n queues form a set of associated random variables [30]. Then they usethe property of associated random variables that their expected maximum is less than that forindependent variables with the same marginal distributions. However this approach used in [22]cannot be extended to the ( n, k ) fork-join system with k < n because this property of associatedvariables does not hold for the k th order statistic for k < n . As a corollary to Theorem 1 above, we can get an exact expression for T ( n, ) , the meanresponse time of the ( n, fork-join system. Recall that in the ( n, fork-join system, the entirecontent is replicated on n disks, and we just have to wait for any one disk to serve the incomingrequest. Corollary 1.
The mean response time T ( n, of the ( n, fork-join system is given by T ( n, = 1 nµ − λ , (9) where λ is the rate of Poisson arrivals and µ is the service rate.Proof: In Theorem 1 we constructed the ( n, k ) split-merge system which always has worseresponse time than the corresponding ( n, k ) fork-join system. For the special case when k = 1 ,the split-merge system is equivalent to the fork-join system and gives the same response time.Substituting k = 1 and µ (cid:48) = kµ = µ in (7) and (8) we get the result (9). Theorem 2 (Lower Bound on Mean Response Time) . The mean response time T ( n,k ) of an ( n, k ) fork-join queueing system satisfies T ( n,k ) ≥ µ (cid:48) (cid:2) H n − H n − k + ρ (cid:48) ( H n ( n − ρ (cid:48) ) − H ( n − k )( n − k − ρ (cid:48) ) ) (cid:3) , (10) where λ is the request arrival rate, µ (cid:48) is the service rate at each queue, ρ (cid:48) = λ/µ (cid:48) is the loadfactor, and the generalized harmonic number H n ( n − ρ (cid:48) ) is given by H n ( n − ρ (cid:48) ) = n (cid:88) j =1 j ( j − ρ (cid:48) ) . Proof:
The lower bound in (10) is a generalization of the bound for the ( n, n ) fork-joinsystem derived in [23]. The bound for the ( n, n ) system is derived by considering that a job11oes through n stages of processing. A job is said to be in the j th stage if j out of n tasks havebeen served by their respective nodes for ≤ j ≤ n − . The job waits for the remaining n − j tasks to be served, after which it departs the system. For the ( n, k ) fork-join system, since weonly need k tasks to finish service, each job now goes through k stages of processing. In the j th stage, where ≤ j ≤ k − , j tasks have been served and the job will depart when k − j more tasks to finish service.We now show that the service rate of a job in the j th stage of processing is at most ( n − j ) µ (cid:48) .Consider two jobs B and B in the i th and j th stages of processing respectively. Let i > j , thatis, B has completed more tasks than B . Job B moves to the ( j + 1) th stage when one of its n − j remaining tasks complete. If all these tasks are at the heads of their respective queues, theservice rate for job B is exactly ( n − j ) µ (cid:48) . However since i > j , B ’s task could be ahead of B ’s in one of the n − j pending queues, due to which that task of B cannot be immediatelyserved. Hence, we have shown that the service rate of in the j th stage of processing is at most ( n − j ) µ (cid:48) .Thus, the time for a job to move from the j th to ( j + 1) th stage is lower bounded by / (( n − j ) µ (cid:48) − λ ) , the mean response time of an M/M/1 queue with arrival rate λ and service rate ( n − j ) µ (cid:48) . The total mean response time is the sum of the mean response times of each of the k stages of processing and is bounded below as T ( n,k ) ≥ k − (cid:88) j =0 n − j ) µ (cid:48) − λ , = 1 µ (cid:48) k − (cid:88) j =0 n − j ) − ρ (cid:48) , = 1 µ (cid:48) k − (cid:88) j =0 (cid:104) n − j + ρ (cid:48) ( n − j )( n − j − ρ (cid:48) ) (cid:105) , = 1 µ (cid:48) (cid:2) H n − H n − k + ρ (cid:48) ( H n ( n − ρ (cid:48) ) − H ( n − k )( n − k − ρ (cid:48) ) ) (cid:3) . Hence, we have found lower and upper bounds on the mean response time T ( n,k ) . In Fig. 3we demonstrate how the tightness of the bounds changes with service rate µ . The figure showsthe mean response time of a (10 , k ) fork-join system versus k for service rates µ = 3 , and / . Note that the upper bound for k = n shown in the plot is T ( n,n ) ≤ H n / ( nµ − λ ) as given in12 M ean r e s pon s e t i m e T (10, k) simulationUpper boundLower bound (a) Arrival rate λ = 1 and service rate µ = 3 . M ean r e s pon s e t i m e k 1 2 3 4 5 6 7 8 9 100246810 S t o r age T (10, k) simulationUpper boundLower boundRequired storage (b) Arrival rate λ = 1 and service rate µ = 1 . M ean r e s pon s e t i m e T (10, k) simulationUpper boundLower bound (c) Arrival rate λ = 1 and service rate µ = 1 / . Fig. 3: Behavior of the mean response time T (10 ,k ) as k increases (and total storage n/k decreases). The plot shows that the bounds on mean response time given by (6) and (10) aretight when the system is lightly loaded and become loose as µ decreases and/or k increases.1322], instead of the bound in (6). The reason behind this substitution is explained in Remark 1.We observe in Fig. 3 that the bounds become loose as k increases and/or µ decreases. Inparticular, the upper bound becomes loose because the blocking of queues in split-merge systembecomes significant when k increases and/or µ decreases. For µ = 1 / , the upper bound in (6)becomes invalid for k ≥ because the condition ρ (cid:48) ( H n − H n − k ) < is violated. Similarly, thelower bound becomes loose with increasing k and decreasing µ because the difference betweenthe actual service rate in the j th stage of processing, and its bound ( n − j ) µ (cid:48) increases. When k = 1 , the bounds coincide and give T ( n, = 1 / ( nµ − λ ) . C. Download Time vs. Storage Space Trade-off . . . . . . response time f r a c t i on o f c o m p l e t ed do w n l oad s single diskk=1k=5k=10 0.00 0.05 0.10 0.15 0.20 . . . . . . response time f r a c t i on o f c o m p l e t ed do w n l oad s ← single disk baseline – unit storage ← the same total storage ← double total storage ← × increase in storage Fig. 4: CDFs of the response time of (10 , k ) fork-join systems, and the required storageIn this section we present numerical results demonstrating the fundamental trade-off betweenstorage and response time of the ( n, k ) fork-join system. We also compare the response time14f the ( n, k ) fork-join system to the power-of- d and LWL assignment policies introduced inSection II-C.The expected download time of the file can be reduced in two ways 1) by increasing the totalstorage, or the storage expansion n/k per file, and 2) by increasing the number n of disks usedfor file storage. Both the total storage and the number of disks could be a limiting factor inpractice. We first address the scenario where the number of disks n is kept constant, but thestorage expansion changes from to n as we choose k from to n . We then study the scenariowhere the storage expansion factor n/k is kept constant, but the number of disks varies.
1) Flexible Storage Expansion & Fixed Number of Disks:
Fig. 3 is a plot of mean responseversus k for a fixed number of disks n . Note that as we increase k , the total storage n/k useddecreases as shown in Fig. 3b. When we increase k , two factors affect the mean response time T ( n,k ) in opposite ways: 1) As k increases the storage per disk reduces which reduces meanresponse time. 2) With higher k we have to wait for more nodes to finish service for the job toexit the system. Hence we lose the diversity benefit of coding, which results in an increase inthe mean response time.In Fig. 3 we observe that when µ = 1 or , the second factor dominates causing the meanresponse time T ( n,k ) to strictly increase with k . At lower service rate µ = shown in Fig. 3c,the mean response time first decreases, and then increases with k . At small k (e.g. k = 1 ), theper-node service time /kµ becomes large, outweighing the benefit of waiting for just k nodesto finish service. At large k , waiting for many nodes to response outweighs the fast /kµ servicetime. Due to this phenomenon, there is an optimal k that minimizes the mean response time.In addition to small mean response time, ensuring quality-of-service to the user may alsorequire that the probability of exceeding some maximum tolerable response time to be small.Thus, we study the cumulative distribution function (CDF) of the response time for differentvalues of k for a fixed n . In Fig. 4 we plot the CDF of the response time with k = 1 , , , for fixed n = 10 . The arrival rate and service rate are λ = 1 and µ = 3 as defined earlier. For k = 1 , the CDF is represents the minimum of n exponential random variables, which is alsoexponentially distributed.The CDF plot can be used to design a storage system that gives probabilistic bounds onthe response time. For example, if we wish to keep the response time below . seconds with15robability at least . , then the CDF plot shows that k = 5 , satisfy this requirement but k = 1 does not. The plot also shows that at . seconds, of requests are complete in allfork-join systems, but only are complete in the single-disk case.
2) Flexible Number of Disks & Fixed Storage Expansion :
Next, we take a different viewpointand analyze the benefit of spreading the content across more disks while using the same totalstorage space. Fig. 5 plots the bounds (6) and (10) on the mean response time T ( n,k ) versus k while keeping constant code rate k/n = 1 / , for the ( n, k ) fork-join system with λ = 1 andthree different values of µ . For these parameter values the bounds are tight, and can be used for −2 −1 n, n = 2k M ean r e s pon s e t i m e µ = 5 upper bound µ = 5 lower bound µ = 1 upper bound µ = 1 lower bound µ = 0.51 upper bound µ = 0.51 lower bound Fig. 5: Mean response time upper and lower bounds on the mean response time T ( n,n/ for λ = 1 and three different service rates µ . Due to the diversity advantage of more disks, T ( n,n/ reduces with n .analysis in place of simulations.We observe that the mean response time T ( n,k ) reduces as k increases because we get thediversity advantage of having more disks. The reduction in T ( n,k ) happens at the higher rate forsmall values of k and µ . For heavy-tailed distributions (e.g. Pareto, cf. Sec. IV), the benefit thatcomes from diversity is even larger. T ( n,k ) approaches zero as n → ∞ for a fixed storage expansion n/k . This is because weassumed that service rate of a single disk is kµ since the /k units of the content F is storedon one disk. However, in practice the mean service time /kµ will not go zero as reading eachdisk will need some non-zero setup time in completing each task, irrespective of the amount of16ata read from the disk. In Section IV we will see how this setup time affects the delay-storagetrade-off.In order to understand the response time better, we plot in Fig. 6 the CDF for different valuesof k for a fixed ratio k/n = 1 / . Again we observe that the diversity of increasing number ofdisks n helps to reduce the response time. . . . . . . response time f r a c t i on o f c o m p l e t ed do w n l oad s single diskk=1k=2k=5 Fig. 6: CDFs of the response time of ( n, k = n/ fork-join systems, and the required storage −2 −1 Average time to download one unit of content (1/ µ ) M ean R e s pon s e T i m e (10, 1) fork−join system(20, 2) fork−join systemPower−of−2Power−of−10 (LWL job assignment) Fig. 7: For λ = 1 and the same amount of total storage used ( units), the fork-join system haslower mean response time than the corresponding power-of- d and LWL assignment policies.17 ) Comparison with Power-of- d Assignment:
We now compare the mean response time of the ( n, k ) fork-join system with power-of- d and least-work-left (LWL) job assignment introducedin Section II. Recall that for each incoming request, the power-of- d policy assigns a request tothe node with the least-work-left from among d uniformly selected nodes. Fig. 7 is a plot ofthe mean response time versus /µ , the average time taken to download one unit of content. Itcompares the ( n, k ) fork-join system which uses n/k units of total storage with the power-of- d and LWL assignment policies with the entire content (one unit) replicated on the n/k disks.Thus, all the systems shown in Fig. 7 use the same total storage space n/k = 10 units.We observe in Fig. 7 that the fork-join system outperforms the power-of- d and LWL assign-ment policies. This is because as we saw in Fig. 5, when we increase n and k while keepingthe ratio n/k (the total storage) fixed, the mean response time of the ( n, k ) fork-join systemdecreases. That is, the diversity advantage dominates over the slowdown due to waiting for morenodes to finish service. Thus, for large enough n , the ( n, k ) fork-join system outperforms thecorresponding power-of-d scheme that uses the same storage space n/k units.There are other practical issues that are not considered in Fig. 7. For instance, in the ( n, k ) fork-join system there are communication costs associated with forking jobs to n nodes andcosts of decoding the MDS coded blocks. On the other hand, the power-of- d assignment systemrequires constant feedback from the nodes to determine the work left at each node.IV. G ENERALIZING THE S ERVICE D ISTRIBUTION
The theoretical analysis and numerical results so far assumed a specific service time distri-bution at each node – we considered the exponential distributions. In this section we presentsome results by generalizing the service time distribution. In Section IV-A we extend the upperbound to general service time distributions. We present numerical results for heavy-tailed andcorrelated service times in Section IV-B and Section IV-C respectively.
A. General Service Time Distribution
In several practical scenarios the service distribution is unknown. We present an upper boundon the mean response time for such cases, only using the mean and the variance of the service18istribution. Let X , X , . . . , X n be the i.i.d random variables representing the service times ofthe n nodes, with expectation E [ X i ] = µ (cid:48) and variance V [ X i ] = σ for all i . Theorem 3 (Upper Bound with General Service Time) . The mean response time T ( n,k ) of an ( n, k ) fork-join system with general service time X such that E [ X ] = µ (cid:48) and V [ X ] = σ satisfies T ( n,k ) ≤ µ (cid:48) + σ (cid:114) k − n − k + 1+ λ (cid:20)(cid:16) µ (cid:48) + σ (cid:113) k − n − k +1 (cid:17) + σ C ( n, k ) (cid:21) (cid:104) − λ (cid:16) µ (cid:48) + σ (cid:113) k − n − k +1 (cid:17)(cid:105) . (11) Proof:
The proof follows from Theorem 1 where the upper bound can be calculatedusing ( n, k ) split-merge system and Pollaczek-Khinchin formula (8). Unlike the exponentialdistribution, we do not have an exact expression for S , i.e., the k th order statistic of the servicetimes X , X , · · · X n . Instead, we use the following upper bounds on the expectation andvariance of S derived in [31] and [32].E [ S ] ≤ µ (cid:48) + σ (cid:114) k − n − k + 1 , (12)V [ S ] ≤ C ( n, k ) σ . (13)The proof of (12) involves Jensen’s inequality and Cauchy-Schwarz inequality. For detailsplease refer to [31]. The constant C ( n, k ) depends only on n and k , and can be found in thetable in [32]. Holding n constant, C ( n, k ) decreases as k increases. The proof of (13) can befound in [32].Note that (8) strictly increases as either E [ S ] or V [ S ] increases. Thus, we can substitute theupper bounds in it to obtain the upper bound on mean response time (11).Regarding the lower bound, we note that our proof in Theorem 2 cannot be extended to thisgeneral service time setting. The proof requires memoryless property of the service time, whichdoes not necessary hold in the general service time case. B. Heavy-tailed Service Time
In many practical systems the service time has a heavy-tail distribution, which means thatthere is a larger probability of getting very large values. More formally, a random variable19 M ean r e s pon s e t i m e Pareto, α = 1.8Pareto, α = 4Pareto, α = ∞ Exponential
Fig. 8: Mean response time T (10 ,k ) of different service time distributions. λ = 1 and µ = 3 . Formore heavy-tailed (smaller α ) distributions, the increase in mean response time with k becomesdominant since we have to wait for more nodes to finish service. X is said to be heavy-tail distribution if its tail probability is not exponentially bounded and lim x →∞ e βx Pr(
X > x ) = ∞ for all β > . We consider the Pareto distribution which has beenwidely used to model heavy-tailed jobs in existing literature (see for example [33], [34]). ThePareto distribution is parametrized is parametrized by scale parameter x m and shape parameter α and its cumulative distribution function is given by, F X ( x ) = − (cid:0) x m x (cid:1) α for x ≥ x m for x < x m (14)A smaller value of α implies a heavier tail. In particular, when α = ∞ the service time becomesdeterministic and when α ≤ the service time becomes infinite. In [33] Pareto distribution with α = 1 . was reported for the sizes of files requested from websites.In Fig. 8 we plot the mean response time T ( n,k ) versus k for n = 10 disks, for arrival rate λ = 1 and service rate µ = 3 for the exponential and Pareto service distributions. Each diskstores /k units of data and thus the service rate of each individual queue is µ (cid:48) = kµ . For a given k , all distributions have the same mean service time /kµ . We observe that as the distributionbecomes more heavy-tailed (smaller α ), waiting for more nodes (larger k ) to finish results in anincrease in mean response time which outweighs the decrease caused by smaller service time /kµ . For smaller α , the optimal k decreases because the increase in mean response time for20arger k is more dominant. C. Correlated Service Times
Thus far we have considered that the n tasks of a job have independent service times. We nowanalyze how the correlation between service times affects the mean response time of the fork-joinsystem. In practice the correlation between service times could be because the service time isproportional to the size of the file being downloaded. We model the correlation by consideringthat the service time of each task is δX d + (1 − δ ) X r,i , a weighted sum of two independentexponential random variables X d and X r,i both with mean /kµ . The variable X d is fixed acrossthe n queues, and X r,i is the independent for the queues ≤ i ≤ n . The weight δ representsthe degree of correlation between the service times of the n queues. When δ = 0 , the system isidentical to the original ( n, k ) fork-join system analyzed in Section III. The mean response time T (cid:48) ( n,k ) of the ( n, k ) fork-join system with service time distribution as described above is, T (cid:48) ( n,k ) = δ E [ X d ] + (1 − δ ) T ( n,k ) , (15) = δkµ + (1 − δ ) T ( n,k ) , where in T ( n,k ) is the response time with independent exponential service times analyzed inSection III. Fig. 9 shows the trade-off between mean response time and k for weight δ = 0 , . , and . When δ is , coding provides diversity in this regime and gives faster response time forsmaller k as we already observed in Fig. 3. As the correlation between service times increaseswe lose the diversity advantage provided by coding and do not get fast response for small k .Note that for δ = 1 , there is no diversity advantage and the decrease in response time with k isonly because of the fact that each disk stores /k units of data.V. T HE ( m, n, k ) F ORK - JOIN S YSTEM
In a distributed storage with a large number of disks m , having an ( m, k ) fork-join systemwould involve large signaling overhead of forking the request to all the m disks, and highdecoding complexity. The decoding complexity is high even with small k because it dependson the field size, which is a function of m in standard codes such as Reed-Solomon codes.Hence, we propose a system where we divide the m disks into g = m/n groups of n disks21 M ean r e s pon s e t i m e δ = 1 δ = 0.5 δ = 0 Fig. 9: Mean response time of T (cid:48) ( n,k ) of (10 , k ) fork-join systems with job dependent servicetime distribution for λ = 1 , µ = 3 , and δ = 0 , . , . As δ increases, the service times are morecorrelated and we lose the diversity advantage of coding.each, which act as independent ( n, k ) fork-join systems. In Section V-A we give the systemmodel and analyze the mean response time of the ( m, n, k ) fork-join system. In Section V-B wepresent numerical results comparing the mean response time with different policies of assigningan incoming request to one of the groups. A. Analysis of Response Time
Consider a distributed storage system with m disks. We divide then into g = m/n groupsof n disks each as shown in Fig. 10. We refer to this system as the ( m, n, k ) fork-join system,formally defined as follows. Definition 2 (The ( m, n, k ) fork-join system) . An ( m, n, k ) fork-join system consists of m ≥ n disks partitioned into g = m/n groups with n disks each. An incoming download request isassigned to one of the g groups according to some policy (e.g., uniformly at random). Eachgroup behaves as an independent ( n, k ) fork-join system described in Definition 1. We can extend Lemma 1 to find a necessary condition for the stability of the ( m, n, k ) fork-joinsystem, in terms of the arrival rate λ i to each group i , for ≤ i ≤ g . Lemma 2 (Stability of ( m, n, k ) fork-join system) . For the ( m, n, k ) fork-join system to be .. ... n ... n + 1 ... n ...... ... ( g − n + 1 ... gn = mλ λ λ λ g Fig. 10: The ( m, n, k ) fork-join system with incoming service requests split among g = m/n fork-join systems. stable, the rate of arrival of requests λ i to group i and the service rate µ (cid:48) = kµ per node mustsatisfy λ i < nµ, ∀ ≤ i ≤ g. (16) Proof:
Since each group behaves as an independent ( n, k ) fork-join system we can applythe condition of stability in Lemma 1 with λ replaced by λ i . The result follows from this.The response time of the ( m, n, k ) fork-join system depends on the policy used to assign anincoming request to one of the groups. Under the uniform job assignment policy, each incomingrequest is assigned to a group chosen uniformly at random from the g groups. The Poisson arrivalrate to each group is then reduced to λ/g , and each group is an independent ( n, k ) fork-joinsystem. Therefore, we can extend the bounds in Theorem 1 and Theorem 2 to the mean responsetime of the ( m, n, k ) fork-join system as follows. Corollary 2.
The response time T ( m,n,k ) of an ( m, n, k ) fork-join system with uniform groupassignment is bounded by (6) and (10) with λ replaced by λ/g .B. Numerical Results To reduce the decoding complexity and signaling overhead, an ( m, n, k ) fork-join systemwith smaller n , and thus more groups g = m/n , is preferred. However, reducing n reduces the23 M ean r e s pon s e t i m e (12, 6, k) Exponential(12, 6, k) Pareto, α = 1.8(12, 12, k) Pareto, α = 1.8(12, 12, k) Exponential (12, 6, k) system (12, 12, k) system Fig. 11: Mean response time T (12 ,n,k ) with the exponential and Pareto service distributions, andparameters λ = 1 and µ = 3 . Given m and we would like to find the smallest n and largest k that can achieve a given target response time.diversity advantage which could give higher expected download time (cf. Fig. 6). Thus, there isa delay-complexity trade-off when we vary the number of groups g . Moreover, the content hasto be replicated at all groups to which its request can be directed. Thus, having a large numberof groups, also means increased storage space.In Fig. 11 we plot the mean response time for (12 , n, k ) system and uniform group assignmentwith exponential and Pareto service times. Given the number of disks m , we would like to findthe smallest n , and largest k that can achieve a given target response time. Smaller n meansthere are less disks per group, and hence less signaling overhead of forking a request to thedisks in a group. Larger k is desirable because the total storage space used is m/k units. Forexponential service distribution, Fig. 11 shows that diversity of having a large n , or smaller k always gives lower response time. But this monotonicity does not hold for the Pareto servicetime distribution. For example, the (12 , , fork-join system with n = 6 disks per group and / units of storage used, gives lower response time than the (12 , , fork-join systemwith n = 12 disks per group and total storage / units of storage used.We now study the response time of the ( m, n, k ) fork-join system under three different groupassignment policies – the uniform job assignment policy, where each incoming request is assignedto a group chosen uniformly at random from the g groups, and the power-of- d and least-work-left24 M ean r e s pon s e t i m e (20, 5, k) uniform(20, 5, k) LWL (power−of−4)(20, 5, k) power−of−2(20, 10, k) LWL (power−of−2)(20, 10, k) uniform (20, 5, k) system (20, 10, k) system Fig. 12: Mean response time of (20 , n, k ) systems, with λ = 1 and µ = 1 / for differentgroup assignment policies. The power-of- and LWL assignment give faster response time thanuniform assignment.(LWL) policies introduced in Section II-C.In Fig. 12 we show a comparison of the response time of the (20 , n, k ) fork-join system withthe uniform, power-of- d and LWL group assignment policies. Request arrival are Poisson withrate λ = 1 and service times are exponential with rate µ = 1 / . As expected, the power-of- d assignments give lower response time than the uniform assignment but it is at the cost of receivingfeedback about the amount of work left at each node. We again note that power-of-2 policy isonly slightly worse than the LWL policy (cf. Fig. 7). The simulation suggests power-of- d groupassignment is a strategy worth considering in actual implementations.VI. C ONCLUDING R EMARKS
A. Major Implications
In this paper we show how coding in distributed storage systems, which has been used toprovide reliability against disk failures, also reduces the content download time. We considerthat content is divided into k blocks, and stored on n > k disks or nodes in a network. Theredundancy is added using an ( n, k ) maximum distance separable (MDS) code, which allowscontent reconstruction by reading any k of the n disks. Since the download time from each diskis random, waiting for only k out of n disks reduces overall download time significantly.25e take a queueing-theoretic approach to model multiple users requesting the content simul-taneously. We propose the ( n, k ) fork-join system model where each request is forked to queuesat the n disks. This is a novel generalization of the ( n, n ) fork-join system studied in queueingtheory literature. We analytically derive upper and lower bounds on the expected download timeand show that they are fairly tight. To the best of our knowledge, we are the first to propose the ( n, k ) fork-join system and find bounds on its mean response time. We also extend this analysisto distributed systems with large number of disks, that can be divided into many ( n, k ) fork-joinsystems.Our results demonstrate the fundamental trade-off between the download time and the amountof storage space. This trade-off can be used for design of the amount of redundancy requiredto meet the delay constraints of content delivery. We observe that the optimal operating pointvaries with the service distribution of the time to read each disk. We present theoretical resultsfor the exponential distribution, and simulation results for the heavy-tailed Pareto distribution. B. Future Perspectives
Although, we focus on distributed storage here, the results in this paper can be extended tocomputing systems such as MapReduce [35] as well as content access networks [16], [36].There are some practical issues affecting the download time that are not considered in thispaper and could be addressed in future work. For instance, the signaling overhead of forkingthe request to n disks, and the complexity of decoding the content increases with n . In practicalstorage systems, adding redundancy in storage not only requires extra capital investment instorage and networking but also consumes more energy [37]. It would be interesting to studythe fundamental trade-off between power consumption and quality-of-service. Finally, note thatin this paper we focus on the read operation in a storage system. However in practical systemsrequests entering the system consist of both read and write operations – we leave the investigationof the write operation for future work. R EFERENCES [1] Amazon EBS, http://aws.amazon.com/ebs/.[2] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google file system,” in
ACM SIGOPS Op. Sys. Rev. , vol. 37, no. 5,2003, pp. 29–43. Allerton Conf. on Commun. Control and Comput. ,pp. 326–333, Oct. 2012.[6] A. G. Dimakis, P. B. Godfrey, M. Wainwright, and K. Ramchandran, “Network coding for distributed storage systems,”
Proc. of IEEE INFOCOM , pp. 2000–2008, May 2007.[7] M. Blaum, J. Brady, J. Bruck, and J. Menon, “EVENODD: an efficient scheme for tolerating double disk failures in RAIDarchitectures,”
IEEE Transactions on Computers , vol. 44, no. 2, pp. 192–202, 1995.[8] R. Rodrigues and B. Liskov, “High availability in DHTs: Erasure coding vs. replication,”
Int. Workshop Peer-to-Peer Sys. ,pp. 226–239, Feb. 2005.[9] K. V. Rashmi, N. B. Shah, P. V. Kumar, and K. Ramchandran, “Explicit construction of optimal exact regenerating codesfor distributed storage,”
Allerton Conf. on Commun. Control and Comput. , pp. 1243 – 1249, Sep. 2009.[10] N. B. Shah, K. V. Rashmi, P. V. Kumar, and K. Ramchandran, “Interference alignment in regenerating codes for distributedstorage: necessity and code constructions,”
IEEE Trans. Inform. Theory , vol. 58, pp. 2134–2158, Apr. 2012.[11] I. Tamo, Z. Wang and J. Bruck, “Zigzag Codes: MDS Array Codes With Optimal Rebuilding,”
IEEE Transactions onInformation Theory , vol. 59, no. 3, pp. 1597–1616, 2013.[12] U. Ferner, M. M´edard, and E. Soljanin, “Toward sustainable networking: Storage area networks with network coding,”
Allerton Conf. on Commun. Control and Comput. , pp. 517–524, Oct. 2012.[13] L. Huang, S. Pawar, H. Zhang, and Kannan Ramchandran, “Codes can reduce queuing delay in data centers,”
Proc. Int.Symp. Inform. Theory , pp. 2766–2770, Jul. 2012.[14] N. Shah, K. Lee, and K. Ramachandran, “The MDS queue: analyzing latency performance of codes and redundant requests,”Tech. Rep. arXiv:1211.5405, Nov. 2012.[15] G. Kabatiansky, E. Krouk and S. Semenov,
Error correcting coding and security for data networks: analysis of thesuperchannel concept , 1st ed. Wiley, Mar. 2005, ch. 7.[16] N. F. Maxemchuk, “Dispersity routing in high-speed networks,”
Compu. Networks and ISDN Sys. , vol. 25, pp. 645–661,Jan. 1993.[17] Y. Liu, J. Yang, and S. C. Draper, “Exploiting route diversity in multi-packet transmission using mutual informationaccumulation,”
Allerton Conf. on Commun. Control and Comput. , pp. 1793–1800, Sep. 2011.[18] L. Xu, “Highly available distributed storage systems,” Ph.D. dissertation, California Institute of Technology, 1998.[19] E. Soljanin, “Reducing delay with coding in (mobile) multi-agent information transfer,”
Allerton Conf. on Commun. Controland Comput. , pp. 1428–1433, Sep. 2010.[20] M. Harchol-Balter,
Performance Modeling and Design of Computer Systems: Queueing Theory in Action . CambridgeUniversity Press, 2013.[21] C. Kim and A. K. Agrawala, “Analysis of the fork-join queue,”
IEEE Trans. Comput. , vol. 38, no. 2, pp. 250–255, Feb.1989.[22] R. Nelson and A. Tantawi, “Approximate analysis of fork/join synchronization in parallel queues,”
IEEE Trans. Comput. ,vol. 37, no. 6, pp. 739–743, Jun. 1988.[23] E. Varki, A. Merchant and H. Chen, “The M/M/1 fork-join queue with variable sub-tasks,” unpublished, available online. [24] S. Ross,
A first course in probability , 6th ed. Prentice Hall, 2002, ch. 6.6, p. 273.
25] M. Mitzenmacher, “The power of two choices in randomized load balancing,” Ph.D. dissertation, University of CaliforniaBerkeley, CA, 1996.[26] L. Flatto and S. Hahn, “Two parallel queues created by arrivals with two demands i,”
SIAM Journal on Applied Mathematics ,vol. 44, no. 5, pp. 1041–1053, 1984.[27] P. E. Wright, “Two parallel processors with coupled inputs,”
Advances in applied probability , pp. 986–1007, 1992.[28] P. Konstantopoulos and J. Walrand, “Stationarity and stability of fork-join networks,”
J. Appl. Prob. , vol. 26, pp. 604–614,Sep. 1989.[29] H. C. Tijms,
A first course in stochastic models , 2nd ed. Wiley, 2003, ch. 2.5, p. 58.[30] J. Esary, F. Proschan and D. Walkup, “Association of random variables, with applications,”
Annals of Math. Stat. , vol. 38,no. 5, pp. 1466–1474, Oct. 1967.[31] B. C. Arnold and R. A. Groeneveld, “Bounds on expectations of linear systematic statistics based on dependent samples,”
Annals of Stat. , vol. 7, pp. 220–223, Oct. 1979.[32] N. Papadatos, “Maximum variance of order statistics,”
Ann. Inst. Statist. Math , vol. 47, pp. 185–193, 1995.[33] M. E. Crovella and A. Bestavros, “Self-similarity in World Wide Web traffic: evidence and possible causes,”
IEEE/ACMTrans. Networking , pp. 835–846, Dec. 1997.[34] M. Faloutsos, P. Faloutsos, and C. Faloutsos, “On power-law relationships of the Internet topology,”
Proc. ACM SIGCOMM ,pp. 251–262, 1999.[35] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,”
Commu. of ACM , vol. 51, pp.107–113, Jan. 2008.[36] A. Vulimiri, O. Michel, P. B. Godfrey, and S. Shenker, “More is less: reducing latency via redundancy,”
Proc. ACMHotNets , pp. 13–18, 2012.[37] T. Bostoen, S. Mullender, and Y. Berbers, “Power-reduction techniques for data-center storage systems,”
ACM Comput.Surveys , vol. 45, no. 3, 2013., vol. 45, no. 3, 2013.