Distributed storage algorithms with optimal tradeoffs
aa r X i v : . [ c s . I T ] J a n Distributed storage algorithmswith optimal tradeoffs
Michael Luby ∗ IEEE Fellow, ACM Fellow , Thomas Richardson † IEEE Fellow ∗ BitRipple, Inc † Qualcomm Technologies Inc.
Abstract
One of the primary objectives of a distributed storage system is to reliably store large amounts of source datafor long durations using a large number N of unreliable storage nodes, each with clen bits of storage capacity.Storage nodes fail randomly over time and are replaced with nodes of equal capacity initialized to zeroes, and thusbits are erased at some rate E . To maintain recoverability of the source data, a repairer continually reads data overa network from nodes at a rate R , and generates and writes data to nodes based on the read data.The distributed storage source data capacity is the maximum amount of source data that can be reliably storedfor long periods of time. The research described in [1] shows that asymptotically the distributed storage sourcedata capacity is at most (cid:18) − E · R (cid:19) · N · clen (1)as N and R grow.In this work we introduce and analyze algorithms such that asymptotically the distributed storage source datacapacity is at least Equation (1). Thus, Equation (1) expresses a fundamental trade-off between network traffic andstorage overhead to reliably store source data. Index Terms distributed information systems, data storage systems, data warehouses, information science, information theory,information entropy, error compensation, mutual information, channel capacity, channel coding, time-varying chan-nels, error correction codes, Reed-Solomon codes, network coding, sign Nal to noise ratio, throughput, distributedalgorithms, algorithm design N and analysis, reliability, reliability engineering, reliability theory, fault tolerance,redundancy, robustness, failure analysis, equipment failure.
I. T HE B ASIC L IQUID S YSTEM
In this section we review the basic liquid storage system model, as developed in [2], and indicate someof the extensions introduced in this paper. We establish the essential mathematical framework of liquidstorage, and prepare for variations that improve on certain characteristics, in particular read bandwidthrequirements for repair. Many practical details discussed in [2] are outside the scope of this paper and,for purposes of exposition, we make certain simplifying assumptions in the model.In the storage system models we consider a total of n o bj equal-sized data objects are stored. Eachobject’s source data is partitioned in k c equal size data fragments and an MDS erasure correcting codeis used to generate as many as k c + r fragments corresponding to a length k c + r code. The code mayor may not be systematic. In the basic liquid system the code length, k c + r, equals N, the number ofstorage nodes used in the system and each storage node is uniquely associated to a symbol in the errorcorrecting code. Data fragments corresponding to a particular symbol are stored on the associated node.In the basic liquid system this association is exclusive, the node stores all fragments associated to thecode symbol and no others. In the extensions considered in this paper we may have k c + r > N and the Portions of this work were done while the first author was with Qualcomm Technologies, Inc. Portions of this work was supported bythe National Science Foundation under grant 1936572. BitRipple, Inc ([email protected]) and International Computer Science Institute([email protected]). In [2] nearly MDS codes, RaptorQ codes, were proposed for their scalability and complexity properties. Here we assume MDS codeslargely for convenience, in a practical implementation of the proposed schemes RaptorQ would still likely be the best practical choice. otal number of fragments generated upon object repair can vary and may be less than k c + r . Nodes willgenerally be associated to code symbols in the code but fragments associated to other code symbols mayalso be stored on a node.We define β = N − k c N as the overhead of the storage system. The stored source data objects comprise (1 − β ) of the storage system capacity and the remaining fraction β of the storage capacity is used toprovide resiliency against data loss due to node failure.We say that a fragment is intact at time t if it is stored on a node and can be read at that time. Whena storage node fails all fragments stored on it are lost. The node is then replaced with a new emptynode, and we assume that this happens instantaneously. Thus node failure is functionally equivalent tonode erasure and we will often refer to fragment loss as fragment erasure. In the basic liquid system thereplacement node is assigned the same code symbol as the node it replaces. In the systems developed inthis paper this is no longer the case. Instead we will assume an ordered list of symbols and when a nodefails its associated symbol is placed at the bottom of the list and the new node is associated to the firstunassociated symbol on the list.In the basic liquid system, objects are repaired serially in a fixed cyclic order. When an object isrepaired, k c fragments are read , which are sufficient to reconstruct the data, and any erased fragmentis (re)generated and written to its associated node, i.e., the node associated to the corresponding codesymbol. Hence, immediately upon repair, all N of an object’s fragments are intact. We treat object repairas an atomic event, i.e., we do not concern ourselves with partial repairs of objects being interruptedby node failure. We associate the time of this atomic event with the completion of the repair. Repairefficiency is related to the number of erased fragments regenerated per object repair, and regeneratinga large number of erased fragments makes more efficient use of k c -fragment read than regenerating asmaller number. In the basic liquid system, the maximum number of fragments that can be regeneratedis N − k c since if more than N − k c fragments are erased then the data object is not recoverable. Forefficient repair it is desirable to operate the system so that then number of fragments regenerated is nearthis maximum. This aim must be balanced against the risk of loosing data, and some margin must bemaintained. Exploiting the law of large numbers, larger systems can tolerate smaller relative margins. Inthe systems considered in this paper objects will not necessarily be repaired in strictly cyclic order, butcyclic repair will be loosely followed. We will also generally treat repair as an atomic event, but thisatomic event may involve more than one object repair. In a typical object repair we will regenerate morethan N − k c fragments, thereby increasing repair efficiency. Some of those fragments will correspond tocode symbols not yet associated to an actual storage node in the system (they may instead be associatedto virtual nodes) and those fragments will be temporarily stored in other locations until the associatednode is physically introduced into the system, at which point those fragments are moved (copied) to thatnode.In this paper we often assume a fixed rate of object repair since we are interested in asymptoticperformance limits that use minimal repair resources. This rate of repair is characterized by T tot whichdenotes the time required to repair all objects once each. In the basic liquid system this is equivalent to thetime between repairs for a fixed object since the repair cycles through the objects but this will not hold inthe extended systems. We generally assume that the time to failure of a node is an exponentially distributedrandom variable of known rate. With fixed system size, this assumption is equivalent to a Poisson nodefailure process. In practice the node failure rate may not be precisely known and the Poisson assumptionmay not be valid. In [2] a feedback regulator was described that modulates the repair rate as a function ofthe observed node failure process. To maintain repair efficiency the regulator attempts to steer the numberof repaired fragments toward some target, the choice of which balances efficiency against the risk of dataloss. We will indicate how the technique can be extended to the models considered here. Part of the appeal of liquid storage is that repair, including node replacement, can be delayed without signiificantly affecting the risk ofdata loss. We assume instantaneous node replacement largely as a mathematical convenience. Practical systems would likely read, or at least access, slightly more than k fragments to reduce latency due to straggler nodes. In thecase of non-MDS codes more than k c fragments might be needed. In this paper we ignore these marginal effects.
2t is convenient to introduce a two dimensional visualization of the repair process in which one axis (the x axis) represents ordered objects in the system and the other axis (the y axis) represents the nodes. Inthis visualization storage capacity is faithfully represented as area, see Fig. 1. Since objects are effectivelyqueued for repair, we will also refer to this as the repair queue. At any time the position of an object inthe repair queue is the number of objects behind it in the queue. In the basic liquid system, with its strictcyclic repair order, this is identical with the number of objects that have been repaired since the givenobject was last repaired. Under this interpretation, an object’s position is an integer in , ..., n o bj − . Itis sometimes convenient to introduce a real variable x ∈ [0 , to represent position in a scale invariantmanner. By the object at position x we will mean the object at integer position ⌊ xn o bj ⌋ and we willuse both ”object-position” and ” x -position” to refer to position in the repair queue. To avoid distractingcomplications we will often tacitly assume operation in the liquid limit, by which we mean the limitof an infinite number of objects. In particular, in that limit any distinction between xn o bj and ⌊ xn o bj ⌋ disappears. We represent the storage nodes as equally spaced on the y axis, where each node occupiesan interval of height /N so that the storage capacity of the node corresponds with a rectangular stripof height /N along the y -axis, and length along the x -axis. Thus, the vertical axis is normalized tohave y ∈ [0 , . In this visualization, the nodes are generally ordered by their age. If the node associatedto the vertical segments [ k/N, ( k + 1) /N ] , fails, then the y axis location of nodes , ..., k − are eachincreased by /N and the empty replacement node is placed in the strip in y -position [0 , /N ) . The repairprocess then ensures that the set of objects which store intact fragments on a node is non-decreasing(by inclusion) in y. Whereas in the basic liquid system a replacement node immediately starts storingregenerated fragments, in the systems considered in this paper use of the replacement node for storagemay be delayed until a certain transition in the repair process occurs. Until that time the replacementnode does not actively function in the system and storage on that node will be virtual.Assume an object is repaired at time t = 0 and the object is then placed at the tail of the repair queue.After some additional time t has elapsed some of the object’s fragments may be lost due to node failure.Let F ( x, t ) denote the set of erased fragment symbols for the object in position x. If x ≤ x then wehave F ( x , t ) ⊂ F ( x , t ) . This nested fragment ordering is central to operation in the basic liquid system.The systems in this paper will largely preserve this property, but some deviation will occur.Let us define the function f ( x, t ) = N |F ( x, t ) | . Since there are initially N fragments, one each ona node, and node lifetimes are independent and exponentially distributed with rate λ, the distributionof the number of erased fragments at time t is binomial. In particular, the probability that k fragmentshave been erased by time t is (cid:0) Nk (cid:1) (1 − e − λt ) k ( e − λt ) N − k . Assuming t < T tot , the object will be in x -position t/T tot , and we see that the expected number of erased fragments for an object in position x is N E f ( x, t ) = N (1 − e − λT tot x ) . The repair efficiency of the basic liquid system stems from the number of fragments regenerated uponrepair. For an object repaired at time t the number of repair fragments regenerated is f (1 , t ) N. Under ourcurrent assumptions the probability that this takes the value ℓ is q ( ℓ ) = (cid:0) Nℓ (cid:1) (1 − e − λT tot ) k ( e − λT tot ) N − ℓ andhas an expected value of N (1 − e − λT tot ) . Data loss occurs if ℓ > r.
In particular T tot should be chosen sothat N (1 − e − λT tot ) < r. In [2] the boundMTTDL ≥ λN (1 − β ) q ( r ) − T tot P rj =0 q ( j ) was shown, where MTTDL denotes the mean time to data loss assuming a perfect (all fragments intact)initial state. For an appropriate choice of T tot we have P rj =0 q ( j ) ≃ and q ( r ) ≪ . The MTTDL froma typical state is essentially the same as that from a perfect state [2].
A. Asymptotic Repair Read Rate
In the limit N → ∞ we can choose T tot ( N ) such that (1 − e − λT tot ( N ) ) → β and MTTDL → ∞ . In this limit the system achieves its maximum per-repair efficiency with an object repair rate given by3nused storage0 n obj − Objects in order of age since last repair Complete ( k + ∆) k N − N od e s O r d e r e dby A g e Incomplete ( r − ∆) Fig. 1: Visualization of basic liquid repair queue. The head of the queue is on the right and the tail onthe left. Objects at the head of the queue are repaired and then move to the tail. Hence, the cyclic orderof the objects is invariant. T tot = (1 − e − λT tot ( N ) ) → − λ − ln(1 − β ) . In [1] a lower bound on data read rates was established that forsmall β can be written as λT tot ≥ β . Compared to this bound the read rate in the basic liquid system islarger, essentially by a factor of . An evident cause of this gap is the typically unused portion of the overhead storage capacity in thebasic liquid system, see Fig. 1. The area under the curve y = f ( x, t ) represents unused storage capacityand in the asymptotic limit discussed above this unused capacity exceeds β . If in the basic liquid systemwe excluded the unused portion of the overhead storage capacity in the accounting of the overhead, thenthe lower read rate bound would be achieved to within second order in β. This observation is the key tothe storage schemes presented in this paper where we aim to achieve full storage utilization.II. P
ARTIALLY V IRTUALIZED R EPAIR Q UEUE S CHEMES
As described above, in typical operation the basic liquid repair system leaves approximately half of theavailable storage overhead unused and this deficiency accounts for its suboptimal repair read rates. Onepossible remedy for this deficiency is to virtualize the storage of the bottom portion of the repair queueby placing those fragments in the remaining unused portion of the repair queue, see Fig. 2. This is theapproach considered in this section. The repacking perturbs the dynamics of the repair process and thetypical form of the incomplete repair queue changes. It turns out that under an appropriate form of therepacking the repacked portion of the queue (asymptotically) exactly fits in the remaining unused portionand, consequently, optimal read repair rates with full storage utilization can be asymptotically achieved.In the basic liquid system the storage overhead β corresponds with the code rate in that k c = (1 − β ) N and r = βN. In the partially virtualized scheme we will instead use a larger codelength to accomodatevirtualized nodes. When a storage node fails, its physical replacement will assume the identity of avirtualized node including adopting its associated code symbol. The symbol associated to the failed nodewill instead be assigned to a new virtual node which is added to the system.The number of fragments regenerated per object repair will vary slightly from repair to repair but it willapproximate ( β + β v ) N where β v N represents a number of virtualized nodes. The partially virtualizedscheme is roughly similar to a basic liquid system with code rate − β β v and (1 + β v ) N nodes. Consider insuch a basic liquid system the portion of the repair queue comprising the bottom β v N nodes. The numberof fragments stored in those nodes is a non-decreasing function of the their height ( y position) in therepair queue. Let x v ( t ) denote the largest x position of a fragment for the node in node position β v N − . , g ) TransitionalNode (1 , g + h ) Virtual FragmentsTransient (Stored Virtual) FragmentsVirtual NodesIncomplete NodesEmpty NodesAccess NodesFig. 2: Visualization of partially virtualized repair queue. The virtual fragments are stored on the transientnodes as indicated. The boundary between the transient nodes and the empty nodes (the yellow line) depictsthe transitional node.There are therefore x v ( t ) n o bj objects that store fragments in this lower portion of the repair queue. Inthe partially virtualized scheme the bottom β v N nodes are virtual, i.e., they do not exist as actual nodes.The fragments that appear in these nodes in the repair queue are virtual and they are actually stored inan otherwise unused portion of the actual repair queue. The thus-stored virtual fragments will be referredto as transient fragments and, for transient fragments, the code symbol associated to the fragment is notthe one associated to the node on which they are stored. As part of the repair process transient fragmentsare later moved (copied) to their intended final location on the node associated to their code symbol.Fragments that are written to their final location will be called settled fragments. Thus, when a transientfragment is copied to its intended location it becomes a settled fragment. A. The Repacking Scheme
We will now give a more detailed and formal description of the partially virtualized transient fragmentpacking scheme, which is depicted in Fig. 2. In the rough analogy to the basic liquid system made abovewe assumed β v N virtual nodes and N actual nodes. The repair process proceeds much as in the basicliquid case from the perspective of the virtual queue, but in the actual system this entails the writingand moving transient fragments and the promotion of virtual nodes to actual nodes. We assume that atany time that there is one node (at most) in transition between the virtual state and the actualized state.We refer to this node as the transitional node . Transient fragments that are regenerated during the repairprocess are written exclusively to the transitional node. For purposes of system analysis we will treat thetransitional node as virtual until the point in time where its initial transient and settled fragments havebeen completely written, at which point we say that the node is launched . We will discuss later how thewriting of the transitional node can be additionally protected so that this assumption could be supportedin practice, but its main purpose is to avoid analytical complications inherent in the potential failure of thetransitional node. In practice, such a failure would be no more damaging then the failure of any anothernode. 5t any time t the nodes in the top (1 − g v ( t )) N positions are actual (post launch) and the transitionalnode will be in node position g v ( t ) N − . The transitional node must be physically real for (standard)repair to proceed so if g v ( t ) N = 0 then standard repair is suspended. (Later we will add an ancillaryrepair process that can proceed when g v ( t ) N = 0 . ) Assuming standard repair proceeds, a transitionalnode eventually completes , is launched, and ceases to be the designated transitional node. At that timethe node immediately below in the queue becomes the transitional node and g v ( t ) N is decremented by . Any actual nodes below the transitional node, i.e. those in node positions [0 : g v ( t ) N − can beconsidered as physically empty although we depict them as holding virtual fragments. This implies, forthe purposes of our model, that they cannot fail. The virtual fragments associated to those nodes arestored as transient fragments spread across those nodes in node positions [ g v ( t ) N : ( g v ( t ) + h v ( t )) N ) where we have introduce h v ( t ) N to represent the number of launched transient fragment carrying nodes.The repacking scheme uses h v ( t ) N consecutive nodes in the actual repair queue to store the transientfragments and these nodes occupy node positions [ g v ( t ) N : g v ( t ) N + h v ( t ) N ) . We introduce a dimensionless paramater κ where those transient fragments associated to κ n o bj N consec-utive objects will be stored on a single node. Correspondingly, the transitional node remains transitionalthrough κ n o bj N object repairs, at which point it ceases to be the transitional node and is launched as anactual node. The time required to repair κ n o bj N objects under continuous standard repair ( g v ( t ) > ) at rategiven by T tot will be denoted by ∆ t := κN T tot . Hence, under constant rate continuous standard repair, a transitional node is completed and launches asan actual node every ∆ t time units.If a node that carries transient fragments fails, a time t say, then those transient fragments are erased.The corresponding κ n o bj N contiguous objects in the repair queue lose their virtual fragments and the numberof intact fragments for those objects is immediately reduced to (1 − g v ( t − )) N. We refer to this as a virtualfragment loss event . Upon a virtual fragment loss event we move the affected κ n o bj N objects forward in therepair queue, placing them directly in front of those objects that still have virtualized fragments. Thus,these objects are moved to those consecutive positions ending at x v ( t − ) , and x v ( t +) is correspondinglyreduced. While this reordering disrupts the invariant cyclic repair ordering of the basic liquid system, itpreserves the fragment symbol ordering property of the repair queue, i.e., after the reordering the set ofintact fragment code symbols for objects is ordered by inclusion along the queue.To give a more precise description of the partial virtualization repacking scheme and to facilitate analysiswe will assign specific repair queue positions to the transient fragments. We extend the coordinates ofthe basic liquid system so that the purely virtual node portion of the queue is represented by y < . The physical nodes and their storage is represented by the area ( x, y ) ∈ [0 , just as in the basic liquidcase, (although the launched nodes are represented by the area ( x, y ) ∈ [0 , × [ g v ( t ) , . ) We extendthe definition of f ( x, t ) to be consistent with this, i.e., (1 − f ( x, t )) N represents the number of intactfragments for an object in position x at time t. For x ≤ x v ( t ) we therefore have f ( x, t ) ≤ . The portion of the repair queue consisting of the settled fragments for objects in x -positions with x > x v ( t ) , i.e., those without virtual fragments, behaves nearly identically to a reduced size basic liquidrepair queue. While the basic liquid repair queue has a fixed number of objects and nodes, the non-virtualized portion of the repair queue has a fluctuating number of objects and nodes because of fluctuationsin x v ( t ) and g v ( t ) respectively. For large systems, however, x v ( t ) and g v ( t ) will concentrate around theirexpected values. Another difference with the basic liquid system is the presence of the transient fragments,i.e., the image of the virtualized fragments. The presence of the transient fragments has, however, no directeffect on the dynamics of the non-virtualized portion of the queue as long as there is sufficient capacityon each actual node to store both sets of fragments.At the granularity of fragments it is natural to understand each settled and virtual fragment as occupyinga storage area in the repair queue corresponding to a rectangle of size N × n o bj . To specifiy positions fortransient fragments and associate them to their corresponding virtual fragment positions we introduce a6inear (for fixed t ) map ϕ t : R → R defined by ϕ t ( x, y ) = (1 − κ ( y − g v ( t )) , g v ( t ) + (1 + κ − x )) . Notethat ϕ t is area preserving, that ϕ t ( x + κN , y ) = ϕ t ( x, y ) + (0 , N ) , and that ϕ t ( x, y + N ) = ϕ t ( x, y ) + ( κN , . Under the mapping ϕ t the N × n o bj rectangle associated to a virtual fragment is a rectangle of thesame area but with size κn o bj × κN . We can conceive of the associated transient fragment as occupyingthis rectangle. While virtual fragment rectangles are packed horizontally along node strips the transientfragments appear in stacks of height κ on a node. Thus, if we consider the virtual fragments for objectsin x position κN [ k, k + 1) virtually stored on a virtual node with associated strip N [ j, j + 1) , ( j < , thenthese fragments occupy a virtual rectangular area of size κN × N and under the mapping ϕ t their imageoccupies a rectangle of precisely the same size on a single node. At the granularity of fragments, themapping can be understood this way. We note that the x position of a transient fragment has no physicalsignificance, since it is only necessary that the fragment be stored on the corresponding node, and weallow the x position to change with time. The y location of transient fragments, however, indicates thenode on which the object is stored and the y positions change correspondingly. See Fig. B. Node Failure and Object Repair
We now describe the failure and repair processes in greater detail.
1) Standard object repair:
Standard object repair can proceed only when g v ( t ) > so that thetransitional node exists as a physical node. The object under repair at time t has ( f (1 , t ) − g v ( t )) N fragments erased from among the top (1 − g v ( t )) N nodes in the repair queue. During repair, thesefragments are regenerated for the object and written to those nodes as settled fragments just as in thebasic liquid system. One further settled fragment is regenerated for the transitional node. Note that thisfragment may be viewed as both settled or transient, since as a ’transient’ fragment it would later becopied to the transitional node to become a ’settled’ fragment. In addition, a constant β v N number ofvirtual fragments are generated and written as transient fragments to the transitional node. These transientfragments correspond to virtual fragments on virtual nodes appearing in the repair queue below thetransitional node. Thus, a total of ( f (1 , t ) − g v ( t )) N + 1 + β v N fragments are regenerated. See Fig. 2.
2) Atomic nature of Transitional node writing:
Assuming that the transitional node cannot fail isequivalent to treating the writing of the transitional node as an atomic event. In an actual implementationa failure of the transitional node could be handled by forwarding in the repair queue those among the κ n o bj N objects whose repair had been completed. Alternatively, if a copy of those fragments had beenmaintained elsewhere then the transitional node could be reconstituted on a replacement node. The transientfragments copied to realize virtual fragments would still be available since the memory occupied by thosenodes is not released until the transitional node is complete, so those fragments would be available fora replacement transitional node. In any case, the failure of a transitional node is a small perturbationin the repair process and could easily be absorbed in any practical implementation without significantlyaffecting the long term behavior of the repair process. When a transient fragment carrying node fails weview its transient fragments as lost. It is possible, however, that some of those fragments had already beencopied to the transitional node. For simplicity we will ignore this possibility and treat those fragmentsas completely lost. This is equivalent to assuming that the copying process happens instantaneously uponcompletion of the transitional node.
3) Transitional node completion and actual node launch:
A particular transitional node is used to storetransient fragments regenerated from κ n o bj N successive object repairs. Once those repairs are complete thatnode will store β v N κ n o bj N = β v κn o bj transient fragments and we say that the transitional node has beencompleted. When a transitional node n completes it becomes an actual (transient fragment carrying) nodewith node position g v N. At this time the node fully enters the system and it’s lifetime properly begins,hence we refer to this as the node’s launch. For a node n we will use τ L ( n ) to denote its time of launch.The node will fail some time later, denoted τ F ( n ) and τ F ( n ) − τ L ( n ) is an independent exponentiallydistributed random variable with rate λ. Note that actual nodes are ordered in the repair queue by launchsequence. We will introduce the notation I L ( n ) to denote the relative order of the launched nodes. Initially7 t = 0 ) all actual nodes will possess a value for I L ( n ) ≤ and they will be correspondingly ordered inthe repair queue. Subsequent launches (for t > ) will be indexed starting from I L ( n ) = 1 . Upon completion the transitional node ceases to be the transitional node and g v N reduces by . Moreprecisely, we have g v ( τ L ( n )+) N = g v ( τ L ( n ) − ) N − and the node position of n at time τ L ( n )+ is g v ( τ L ( n )+) N. The node immediately below, in node position g v ( τ L ( n )+) N − g v ( τ L ( n ) − ) N − , assuming it exists as a physical node, now becomes the transitional node. (We will ignore the possibilityof simultaneous node failure and node launch.) While the κ n o bj N object repairs associated to one transitionalnode are proceeding, the virtual fragments residing on the transitional nodes are realized, i.e., made settled,by copying the transient images of those fragments to the transitional node. Completion of the transitionalnode also entails the completion of this copy process; we assume it is accomplished by the time the lastobject repair for that transitional node is completed. Treating the completion as an atomic event, the numberof transient fragments copied to the transitional node is given by h v ( τ L ( n ) − ) N κ n o bj N = h v ( τ L ( n ) − ) κn o bj . Note that the number of initial settled fragments on the node upon launch is slightly larger, being given by ( h v ( τ L ( n ) − ) N + 1) κ n o bj N = ( h v ( τ L ( n ) − ) + N ) κn o bj . Upon transient node completion the storage capacityassociated to copied transient fragments is released, i.e., it is made available for overwriting. Conceptually,in the context of the repair queue, we can view those transient fragments as having been erased. By thedefinition of ϕ t , the decrementing of g v N at completion times corresponds to increasing the x positionsof the remaining transient fragments by κN (see Fig. 2).For t ≥ τ L ( n ) the settled fragments on node n occupy x-positions [0 , z s ( n , t )] where we have introducethe parameter z s ( n , t ) to indicate the right extreme position of the settled fragments. Hence z s ( n , τ L ( n )+) = κ ( h v ( τ L ( n ) − ) + N ) . Similarly, for t ≥ τ L ( n ) the transient fragments on n occupy x-positions [ z v ( n , t ) , where we have introduce the parameter z v ( n , t ) . At launch, the number of transient fragments on thenode is β v N κ n o bj N = β v κn o bj , hence z v ( n , τ L ( n )+) = 1 − β v κ. At times of susbsequent transient nodecompletions z v ( n , t ) increments by κN until z v ( n , t ) reaches , which occurs after β v N such launches. Ifa launched node n has no transient fragments stored on it at time t then we say z v ( n , t ) = 1+ . Let usdenote the time that z s ( n , t ) reaches as τ s ( n ) . Note that both z v ( n , t ) and z s ( n , t ) can be defined anduniquely determined even if n has failed (they take the value they would have if some other node hadfailed instead) and we occasionally use this fact.Note that we generally require z s ( n , t ) ≤ z v ( n , t ) which limits the choice of β v . In particular thisrequires κ ( h v ( τ L ( n ) − ) + N ) ≤ − β v κ. Should this inequality be violated at a launch time for somenode n , then let us stipulate that the right-most settled fragments targeted for n will be dropped to reduce z s ( n , t ) . In the analysis below we will choose system parameters so that this event is exponentially rarein N. As standard repair proceeds the two values z s ( n , t ) and z v ( n , t ) increase essentially in lock step. Ifwe treat transitional node completion as an atomic discrete event then the two positions increase exactlytogether by κN upon each transitional node completion. If we view the generation of settled fragments atthe granularity of objects then the settled fragments increase with finer granularity, but the two are re-synchronized upon transitional node completion. More precisely, z s ( n , t ) increases by n o bj in T tot n o bj intervalswhereas z v ( n , t ) increases by κN in intervals of length ∆ t = κN T tot . Thus, assuming an initial negligiblemargin sufficient to store κ n o bj N fragments, there is always sufficient capacity on the node to store itsassigned transient fragments. For convenience we will generally adopt the atomic event viewpoint.
4) Ancillary Repair:
When g v ( t ) = 0 standard repair cannot proceed. The system could, however,continue to repair objects without generating transient fragments. More specifically, the ancillary repairprocess regenerates those f (1 , t ) N fragments missing from actual nodes and writes them to the corre-sponding actual nodes. Thus, the repaired object will have N intact fragments after repair. It can thenbe place in the repair queue in position x v ( t ) , immediately in front of those objects that possess virtualfragments. This process resembles basic liquid repair for the subsystem consisting of those objects in x -positions greater than x v ( t ) . Note that this involves no change in the set of transient fragments.Under ancillary repair the value of z s ( n , t ) increases for the bottom f (1 , t ) N nodes while z v ( n , t ) remains unchanged. If for some such node it then arises that z s ( n , t ) = z v ( n , t ) then there is no remainingroom to write the regenerated fragments on that node. Let us stipulate, in that case, that the settled fragment8orresponding to the object with maximal x position among objects possessing settled fragments on n will be overwritten by the regenerated fragment. With this stipulation the structure of the repair queueremains intact and we continue to have z s ( n , t ) = z v ( n , t ) .
5) Transient and Settled Fragment Processes:
At any time t the number of launched nodes storingtransient fragments is h v ( t ) N and the number of transient fragments on each node is an integer multipleof κ n o bj N . At transitional node completion times each transient fragment carrying node releases the storageoccupied by the κ n o bj N fragments that were copied to the transitional node. In addition, a new transientfragment carrying node is launched carrying β v κn o bj transient fragments. The following result, which westate without further proof, captures the basic dependence. Lemma 2.1:
Let I L ( n ) ≤ I L ( n ′ ) and assume at time t that node n stores transient fragments. Then z v ( n , t ) − z v ( n ′ , t ) = ( I L ( n ′ ) − I L ( n )) κN . When a transient fragment carrying node fails h v ( t ) N reduces by . At a transitional node completiontime the above Lemma implies that z v ( n , t ) may reach for at most transient fragment carrying node.Thus h v ( t ) N will either remain unchanged or increase by . Transitional node completion and transientcarrying node failure are the only two events that modify the transient fragments.
Lemma 2.2 (Montonicity):
Let n and n ′ be successively launched nodes, i.e. I L ( n ) = I L ( n ′ ) − , thenassuming neither node has failed and t ≥ τ L ( n ′ ) , we have z s ( n ′ , τ L ( n ′ )+) − z s ( n , τ L ( n )+) ≤ κNz s ( n , t ) − z s ( n ′ , t ) ≥ Proof:
Consider the first inequality. If z s ( n , τ L ( n )+) = 1 − β v κ then the result is immediate, sowe assume z s ( n , τ L ( n )+) = κN ( h v ( τ L ( n ) − ) N + 1) < − β v κ. Now, as described above h v ( t ) N canincrement (by ) only upon completion of a transitional node and can otherwise only decrease. Hence κN ( h v ( τ L ( n ′ ) − ) N + 1) ≤ κN ( h v ( τ L ( n ) − ) N + 2) which now gives the first result.Between the launch of n and the launch of n ′ exactly κ n o bj N objects are repaired under standard repair.Hence z s ( n , τ L ( n ′ )+) ≥ z s ( n , τ L ( n )+) + κN ≥ z s ( n ′ , τ L ( n ′ )+) . (Note that the first inequality is an equalityif no ancillary repair has occurred in the meantime.) For t ∈ ( τ L ( n ′ ) , τ s ( n )] the quantity z s ( n , t ) − z s ( n ′ , t ) is unchanging unless either z s ( n , t ) = z v ( n , t ) or z s ( n ′ , t ) = z v ( n ′ , t ) . If z s ( n , t ) = z v ( n , t ) then wehave z s ( n ′ , t ) ≤ z v ( n ′ , t ) < z v ( n , t ) . If it first occurs that z s ( n ′ , t ) = z v ( n ′ , t ) then, after that we have z s ( n , t ) ∈ [ z v ( n ′ , t ) , z v ( n , t )] . Thus, the second inequality holds in all cases.
6) The Node Launch Position Process:
The value of g v ( t ) N increases by when a launched nodefails. The value of g v ( t ) N decreases by upon a transitional node completion. The rate of launched nodefailure is given by λ (1 − g v ( t )) N. Thus, under our current assumption the integer valued g v ( t ) N processfollows closely the queue size of a machine interference problem (MIP) [3], [4]. If we assume constantrate standard repair then g v ( t ) N is precisely an MIP with constant service time. In Kendall notation thisis an M/D/ //N queing problem. In the MIP problem it is assumed that there are N machines whosetime to failure is an exponentially distributioned random variable with rate λ. Upon failure the machinesenter a queue for service and the repair time is another random variable, in our case assumed to be adeterministic constant. In some MIP models one assumes a finite capacity for the queue, but in our casethis is immaterial.More loosely, the g v ( t ) N process behaves much like the number of users in a single server queue.The service time of the queue is a constant under the assumption of fixed repair rate. A system designer,however, would be free to vary the repair rate as a function of the state of the system. The number ofusers in the queue cannot be arbitrarily large since having more than N − k c users in the queue impliesdata loss.
7) Survivor process:
For a launched node n and t ≥ τ L ( n ) let S ( n ; t ) denote the number of nodeslaunched since τ L ( n ) − (including n ) that have not failed by time t. Thus, formally S ( n ; t ) = { n ′ : I L ( n ′ ) ≥ I L ( n ) , τ F ( n ′ ) > t } τ F ( n ) denotes the failure time of node n . Lemma 2.3:
For τ L ( n ) ≤ t < τ s ( n ) we have f ( z s ( n , t ) − , t ) ≤ |S ( n ; t ) | + g v ( t ) ≤ f ( z s ( n , t )+ , t ) Proof:
By Lemma 2.2, for any node n ′ ∈ S ( n ; t ) we have z s ( n ′ , t ) ≤ z s ( n , t ) . It follows that f ( z, t ) N ≥ |S ( n ; t ) | + g v ( t ) for any z > z s ( n , t ) . For any node n ′′ with I L ( n ′′ ) ≤ I L ( n ) we have z s ( n ′′ , t ) ≥ z s ( n , t ) . Hence, assuming at least one suchnode survives at time t, we have f ( z, t ) N ≤ |S ( n ; t ) | + g v ( t ) for z < z s ( n , t ) . If no such node survivesthen g v ( t ) N + |S ( n ; t ) | = N. We define a data loss event as a node loss that results in at least one object having fewer than k c intactfragments. (For purposes of analysis of system dynamics we generally assume that the repair proceedsregardless of this event.) Suppose a first data loss event occurs at some time t. This implies that a fullysettled node failed at time t and the node n that then enters node position N − k c is unsettled, i.e., τ s ( n ) > t. It follows that |S ( n ; t +) | + g v ( t +) = N − k c and since |S ( n ; t ) | + g v ( t ) can only increase with t we have |S ( n ; τ s ( n )) | + g v ( τ s ( n )) ≥ N − k c . Since τ s ( n ) − τ L ( n ) ≤ T tot this yields the following result. Lemma 2.4: If |S ( n ; τ s ( n )) | + g v ( τ s ( n )) < N − k c for all nodes with τ s ( n ) ≤ t then f (1 , s ) N < N − k c for all s ∈ [0 , t − T tot ] . C. Analysis of Continuous Standard Repair
In this section we present an analysis of a system under continuous standard repair at constant repairrate. We set the system parameters so that g v ( t ) N gravitates towards a value δ N for a small positivedesign parameter δ . We consider what is essentially a single busy period of continuous standard repairand show that its expected length (without data loss) is exponential in N. In this mode of operationthe storage system always has a non-zero number of empty physical nodes. In practice such a form ofoperation could be viable if there are many such systems running on a shared set of storage nodes, sothat the unused nodes could be aggregated across a much larger system and their number kept relativelysmall. For a single such system a more practical version would likely use faster repair with occasionalystanard repair suspension. We will show that in the large system limit ( N → ∞ ) that with properly chosenparameters we can have MTTDL → ∞ repair read rate → ( λT tot ) − ( β − ln(1 − β )) We conjecture that the factor β − ln(1 − β ) is optimal.Long-lasting continuous standard repair must balance the node production rate with the node loss rate.We will choose δ as a (small) positive target value for g v ( t ) . When g v ( t ) = δ the node production rateand the node loss rate will be equal, which results in κ = ((1 − δ ) λT tot ) − . A transitional node completes every ∆ t = κ T tot N time units, hence nodes are launched at rate / ∆ t whilethey fail at rate λ (1 − g v ( t )) N. Note that we have the relation λ ∆ t = δ N where δ = 1 − δ . After a node is launched β v N subsequent transitional node completions are required to clear out thetransient fragments and this will occur, under continuous standard repair, after an elapsed time β v N ∆ t = β v κT tot . Similarly, under continuous standard repair, we have τ s ( n ) − τ L ( n ) = (1 − z s ( n , τ L ( n )) T tot .
1) Initial Condition:
We aim to show extremely long operation of the system, but to be concrete weintroduce an initial condition set essentially to the expected behavior, in particular we assume g v (0) = . We construct the initial condition ( t = 0+ ) by supposing that nodes had been launched at times , − ∆ t , − t , . . . . Some of these nodes will be assumed to have failed by time t = 0 . Nominally, anode launched at time − k ∆ t would have survived to time t = 0 with probability e − λ ∆ t k . Among nodeslaunched at times , − ∆ t , − t , . . . , − d ∆ t the expected number of nodes surviving at time t = 0 is givenby S d := d X k =0 e − λ ∆ t k = 1 − e − λ ∆ t ( d +1) − e − λ ∆ t . (2)We will consider that the node launched at time − k ∆ t has failed by time t = 0 if ⌈ S k − ⌉ = ⌈ S k ⌉ . Thisimplies that the node launched at time still survives ( S = 1 ). The number of nodes surviving at time from − d ∆ t , ..., is ⌈ S d ⌉ . The node n in node-position δ N + j at t = 0+ has − I L ( n ) equal to the smallest d such that S d > j. Note that S ∞ > δ N so all initial operational nodes are assigned a launch index. It followsthat for launched node in position δ N + j at t = 0+ we have − I L ( n ) = ⌊− ( λ ∆ t ) − ln(1 − (1 − e − λ ∆ t ) j ) ⌋ . For all launched nodes we assume z s ( n , z v ( n , , where z v ( n , − β v κ ) − I L ( n )∆ t . Weconsider the initial nodes in node position δ N + j for j < , to have not been launched.
2) Parameters:
We define three fixed x -positions, Z m ≤ Z a ≤ Z v . These correspond respectivelyto three x -positions on launched nodes: a minimum desired position of initial settled fragments, i.e., aminimum desired z s ( n ) ; the expected value of z s ( n , τ L ( n )) , i.e., the expected value of κ ( h v ( τ L ( n ) − )+ N ); the beginning of transient fragments on a launched node, i.e., Z v = 1 − β v κ. Consider an object in position Z and assume that it has δ N settled fragments. Under continuous standardrepair it will reach the head of the repair queue after an elapsed time T tot (1 − Z ) and the expected numberof those settled fragments lost during that time is given by δ N (1 − e − λT tot (1 − Z ) ) . Assume continuous standard repair, and consider nodes launched at times k ∆ t for k = 0 , ..., K − . Theexpected number of these nodes surviving at time K ∆ t is given by K − X k =0 e − λ ( K − k )∆ t = 1 − e − λK ∆ t e λ ∆ t − − e − λK ∆ t e δ N − To simplify notation we introduce ξ N = ( e δ N − δ N = 1 + δ N + δ N ) + . . . ≃ . Let us define γ ( Z ) = 1 − e − λT tot (1 − Z ) ξ N . and set γ a = γ ( Z a ) , γ m = γ ( Z m ) , γ v = γ ( Z v ) . Assuming K ( Z ) := T tot (1 − Z ) / ∆ t is an integer, thequantity γ ( Z ) δ N is the expected number of survivors among launches in a time period of length (1 − Z ) T tot . By our definitions this implies Z a = κN ( γ ( Z v ) δ + 1 N ) , and the desired condition Z a < Z v reduces to the condition κN ( γ ( Z v ) δ + N ) ≤ Z v . We note the relation K ( Z ) = − δ N ln(1 − ξ N γ ( Z )) . With appropriate parameter choices we will have β > γ m > γ a > γ v > β/ . D. Stopping Time
Assuming continuous standard repair, the transitional node completion times are k ∆ t , k = 0 , , , ... and we will use the notation n k to indicate the node launched at time k ∆ t . For each i = 0 , , ... let usdefine J i = ⌊ s i / ∆ t ⌋ as the index of the node repair immediately preceeding the failure time s i . Assume K m = K ( Z m ) is integer valued. Define β δ = β − δ . A gk = { g ( k ∆ t +) N ∈ [1 , δ N − } A Zk = { z s ( n k − K m ) ≥ Z m } A Sk = {S ( n k − K m , k ∆ t ) ≤ β δ N } and define the stopping time I s = arg min i { A gi ∪ A Zi ∪ A Si } as the first launch instance at which at least one of these conditions fails to hold. Lemma 2.5:
No data loss can occur prior to s I s . Proof:
Let I f ∆ t denote the last launch time before the first data loss event which occurs at s F . Wewill show I s ≤ I f . If g v ( k ∆ t +) [1 , δ N − or z s ( n k − K m ) < Z m for any k ≤ I f then we have I s ≤ I f immediately. Assume now that g v ( k ∆ t +) N ∈ [1 , δ N − and z s ( n k − K m ) ≥ Z m for all k ≤ I f . Since f ( s F +) N > N − k c we now have |S ( n I f − K m , I f ∆ t ) | ≥ |S ( n I f − K m , s F ) |≥ ( f ( s F ) − g v ( s F )) N> N − k c − g v ( s F ) N ≥ N − k c − δ = β δ which implies I s ≤ I f . Finally, we show that the expectation of the stopping time I s is exponentially large in N. Proposition 2.6:
Assume the stated inital condition with g v ( t ) = δ and expected transient fragments.Assume β ≤ , δ ≤ β and δ N ≥ . Set γ m = δ β δ , γ a = δ β δ , and γ v = δ β δ Then we have E ( I s ) ≥ e δ N Proof:
In the appendix we prove the following: p ( A gk ) ≤ e − δ N (3) p ( A Zk ) ≤ e − δ N (4) p ( A Sk ) ≤ e − δ N . (5)for all i ≤ I s . From this we have p ( { A gi ∪ A Zi ∪ A Si } ) ≤ e − δ N and an elementary argument now yieldsthe stated result.The assumptions on β, δ and δ N are made largely to simplify constants in the proofs. They can be relaxedto obtain more general results of the same form.We first note that setting the three γ values determines Z and K and T tot . In particular we have K v = − N δ ln(1 − ξ N δ β δ ) (and which we assume to be integer valued), which implies β v = − δ ln(1 − ξ N δ β δ ) . Letting N → ∞ we can have δ → and obtain the asymptotic value β v = − ln(1 − β ) . The value of γ v determines Z a through Z a = κ ( γ v + N ) . The relation between γ a and Z a then yields λT tot = δ β δ + 1 N δ − ln(1 − ξ N δ β δ ) which is asymptotic to β − ln(1 − β ) . E. Immediate Repair and Ancillary Repair
While the above described system achieves arbitrarily large MTDL with what we conjecture areasymptotically optimal repair read rates, there are various practical drawbacks. In particular, it may be12ndesirable to maintain a queue of incomplete nodes with a constant repair rate when an acceleration ofthe repair process could quickly clear the queue. Examination of the exponents in the above argumentsindicates that quite large systems might be required to enable the described mode of operation withsufficient data protection and accelerated repair could allow smaller systems. As will be discussed below,the design is also somewhat vulnerable to the probabilistic node failure assumptions. In particular, thesystem depends on the failure of relatively young unsettled nodes to ensure protection from data loss.In this section we discuss a more practical mode of operation in which we view the storage of transientfragments as largely opportunistic, intended not to interfere with the ongoing basic liquid-like repair.In practice the node failure rate is not precisely known and the assumption of exponentially distributednode lifetimes will not hold precisely. While liquid storage admits delayed repair, it is likely the casethat practical repair operations can proceed relatively quickly once a node is declared permanently failed,faster than needed according to the node loss rate. In such a case, assuming an appropriate choice for κ, the system may reach g v ( t ) N = 0 frequently. When g v ( t ) N reaches the standard repair process willsimply stop, and it will restart only after a node failure. Ancillary repair, however, can continue while g v ( t ) N = 0 . Without ancillary repair it is still possible for data loss to occur even if node repair is immediate,i.e. even if the repair rate is arbitrarily high. Indeed, assume that the node in position g v ( t ) N = 0 failsrepeatedly, i.e., each node failure occurs in position . Then, eventually we have h v ( t ) N = 0 . Consider anode n launched under this condition, it has z s ( n , τ L ( n )) = κN ≃ . Now suppose that subsequent to thisnode launch only fully settled nodes fail. Upon each subsequent node launch z s ( n , t ) will increase by κN while the node position of n will increase by . Since Nκ − > N − k c data loss is inevitable, we willeventually have h v ( t ) N > N − k c . More generally, if h v ( t ) N becomes quite small then the gap between z a ( n , t ) − z s ( n , t ) becomes large, and this leads to the data loss event outlined above. In this circumstanceancillary repair could increase z s ( n , t ) while leaving z a ( n , t ) fixed, thereby reducing the gap.If ( h v ( t )+ g v ( t )) N exceeds N − k c then data loss occurs. This could occur even with g v ( t ) N = 0 is β v N is sufficiently large and transient fragment carrying nodes do not fail. It may well be the case in practicethat node failure rates are low while the nodes are relatively new and, in that case, this possibility wouldbecome a significant concern. Let us therefore consider a design in which we choose β v N < N − k c . Thisimplies that h v ( t ) N < N − k c for all t so that ( h v ( t ) + g v ( t )) N > N − k c can occur only with sufficientlylarge g v ( t ) N. By controlling the rate of standard repair, and leaving some additional margin, one cancontrol the value of g v ( t ) N and, with high probability, keep it sufficiently small. To give an indicationof how this could be accomplished we note the following result. In an M/D/1 queue with repair timegiven by γ/λ, with γ < the probability that the queue length exeeds x during a busy period is upperbounded by e − νx , where ν solves e ν = 1 + νγ . (A proof may be found in the appendix.) For example, it γ ≃ . (repair time equal to / of node failure interarrival times) then ν = 2 . Hence the probabilityof exceeding δN is less than e − δ N . We note the significant improvement of the exponent as comparedto the previous section. There the critical exponents were of the form δ N. III. A C
OMPLETE V IRTUALIZATION A PPROACH
We now present an alternative approach in which the entire incomplete portion of the of repair queueis virtualized. The height of the virtualized queue will be r = β v N. Instead of temporarily using certainnodes for saving the overhead fragments, all actual nodes are used simultaneously for both access andoverhead. The amount of overhead needed is βN ≃ β v N/ . and the number of actual storage nodes is N = k c + δ. Here δ represents a margin which protects the system against data loss.In the complete virtualization approach the objects in the repair queue are maintained in a fixed cyclicorder and repaired according to that order in standard repair. The system requires, however, an ancillaryrepair process that operates with a different object order. From the perspective of the repair queue we willview the ancillary repair of objects as happening ‘in place’, meaning their position in the queue is notaltered. Furthermore, we do not separate the two repair processes in time but rather assume a certain amount13f synchronization between them. Unlike the partially virtualized repair method, the method outlined herepossesses a unique ’complete’ state to which the system will periodically return. This favorsanalysis ofthe mode of operation in which the repair rate is generally higher than needed and repair suspends whenthe complete state is reached.The objects are partitioned into N groups. Group membership is determined by object position in the(cyclic) repair queue modulo N. It is convenient, therefore, to assume that n o bj is a multiple of N, so thatall groups have precisely the same size and so that the group definition is invariant under cyclic shift ofthe repair queue. This assumption is not critical, but since it simplifies the description we will adopt it.Each group of objects is uniquely associated to one of the N nodes such that all virtual fragmentsbelonging to objects in the group are stored as transient fragments on the associated node. When a nodefails it is replaced by a new empty node in node position and the group association of the failed nodeis transferred to the new node.When a node fails all objects lose the settled fragment that had been stored on that node. In addition, theobjects belonging to the group associated to that node each loose all of their transient/virtual fragments.The intact settled fragment ordering in the virtual repair queue is thereby violated since every N thobject in the repair queue lost all of its virtual fragments. Whereas in the partially virtualized methodthe corresponding objects were advanced in the repair queue, in the completely virtualized approach weinstead adopt an ancillary repair process that regenerates those missing fragments directly by repairing theobjects in the affected group. The objects otherwise maintain their place in the queue and for each objectin the group the lost transient fragments are simply regenerated and stored on the replacement node. Ingeneral this involves regenerating for each object in the group one settled fragment and a varying numberof transient fragments. On average only β v N/ fragments are repaired per object, so the repair efficiencyof this ancillary process is less than that of the main repair process by a factor of two. If the overhead β is small, though, then this represents a small portion ( O ( β ) ) of the total needed repair. A. The Complete State
It is most convenient to describe the system by first describing the complete state (in which repairsuspends). In the complete state each object has N settled access fragments stored one each on the N access nodes. The virtualized portion of the repair queue has a staircase form, i.e., an asymptoticallylinear boundary. In particular we have f ∗ ( x ) N = ⌊− ( β v N )(1 − x ) ⌋ , where the superscript ∗ indicates thecomplete state and we use the same definition of f as in the previous section. This function is a stepfunction and each step has a width of β v N n o bj objects. See Fig. 3 for an example. Note that the x -axis isnow extended beyond x = 1 with x > representing the overhead portion of the storage capacity.It is convenient (but not critical) to assume that n o bj β v N is an integer. In that case each group of objectscomprises exactly the same number of virtual fragments. Since the incomplete repair queue is entirelyvirtualized, the actual overhead of the system is not β v but, approximately, β v / . A careful check of thedefinition of f ∗ shows that the virtualized portion of the repair queue actually includes one completednode. It is possible in the complete virtualization approach to include zero or more than one completednodes in the virtualization, e.g. by setting f ∗ ( x ) N = ⌊− s (1 − x ) − ( β v N − s ) ⌋ for some s < β v N, andadding more complete virtual nodes would provide an additional buffer against bursty node losses at thecost of additional overhead, but, to simplify the presentation, we will not develop these variations.
1) Node Failure in the Complete State:
Consider a node failure while the system is in the completestate. One group in the virtualized portion of the repair group is erased. A new empty node is added tothe system in node position in the repair queue. The surviving nodes in positions below the failed nodeare all advanced by in their node positions.Conceptually, all virtual fragments are incremented by also in the node ordering. Note that this meansthat the top virtual node now coincides with the new empty physical node, and the associated transientfragments will be copied to the new physical node. An ancillary repair job is simultaneously commencedto regenerate the erased transient fragments, all of which will be written to the new node. In addition, each14irtual NodesAccess NodesFig. 3: Repair Queue in Complete State. We assume 40 access nodes with β v N = 10 . The area associatedto one node is indicated, along with its associated group.ancillary object repair regenerates the one missing settled fragment associated to the lost node. Because weassume one complete virtual node in the complete state, the new physical node will be complete once theancillary repair job and the transient copying are complete. This does not, however, in itself recreate thecomplete state. In order to reach the complete state the standard repair process, with the virtual/transientfragments being written to their associated nodes, must also complete n o bj β v N standard object repairs. Notethat a small number of objects are scheduled for repair in both the ancillary and the regular repair process.Clearly, only one repair for those objects is required.
2) Further Node Failure:
If additional nodes fail before reaching the complete state, then, for eachadditional failure, another n o bj β v N standard object repairs are scheduled along with the ancillary repair for eachfailed node. Thus, each repair job involves n o bj β v N standard object repairsat most and fewer than n o bj N ancillaryobject repairs. Each node repair also entails the copying of transient fragments to the transitional node.When the transitional node is complete the storage used for those transient fragments can be released.We will consider two possible relationships between standard repair and ancillary repair. The first andsimplest is to consider both repairs as associated to the transitional node repair. In this approach bothtypes of repair are tied together in a single repair function and tied to the repair of the transitional node.When the transitional node completes its standard repair it may not be fully settled since the objectsbelonging to groups associated to other failed nodes waiting to advance to the transitional node for repairwill be missing fragments associated to those nodes. In the second approach the ancillary repairs are givenpriority. Since ancillary repairs per node are smaller (order β ) than the standard repairs and this improvesthe resiliency of the system for a given δ, this is likely the more practial approach. It does, however, leadto a more complicated analysis of the performance of the system.
3) Atomic Nature of Transitional Node Repair:
The transient fragments that are copied to the transitionalnode remain intact until the transitional node is complete. The number of settled fragments regeneratedby the ancillary repair process for the transitional node is less than n o bj N . With a negligible fraction ofthe storage capacity (roughly N − ) these fragments could be written to both the transitional node andtemporarily copied to other nodes. If the transitional node then fails while it is being written, it could bereconstituted with copying alone. In part to simplify the analysis, we will assume that transitional nodescannot fail. Using the above mechanism, this could be effectively realized in an actual system with asmall amount of additional overhead. B. Analysis
We define g v ( t ) as in the partially virtualized case, i.e., g v ( t ) N denotes the node position immediatelyabove the transitional node, assuming g v ( t ) N > . Here g v ( t ) = 0 implies the complete state with repair15irtual NodesAccess NodesFig. 4: Result of Failure of Indicated node from Fig. 3. Transient fragments available for copying areindicated in dark blue. The white strips indicate erased fragments.Virtual NodesAccess NodesFig. 5: Result of Three Node Failures. The repair of the first failure is 70% complete. After 25% of therepair was complete a second node failed and after 60% a third node failed. Dark green indicates repairedfragments. Here we show the delayed copy model where transient fragments for transitional and virtualnodes are lost when another node fails.suspended. Let us first consider the case where ancillary repair is synchronized with standard repair.With the assumption of no transitional node failure, the system behaves as single server queue with g v ( t ) representing the number of users in the queuing system at time t. Under fixed repair time this is preciselya
M/D/ //N machine interference problem. In particular, the arrival rate decreases as g v ( t ) increases. Inpractice, however, the value of g v ( t ) would be kept small, and in fact cannot exceed δ without data loss.The repair rate of the system could be adjusted to increase as g v ( t ) increases to control the probabilityof data loss.The number of erased access fragments for objects in the system can depend on the group of that object.All objects are missing the g v ( t ) N fragments associated to the standard repair queue. In addition, objectsbelonging to groups associated to failed nodes awaiting repair will be missing additional fragments. Whena node fails all of the virtual fragments associated to that node’s group are lost. They will not be recoveredby the ancillary repair until that node becomes and completes as the transitional node. All nodes that werein the repair queue at the time of failure will be missing those fragments until the node completes as atransitional node. If an object belongs to a group associated to a failed node that is in the node repairqueue, then it is missing fragments for all nodes that were ine node repair at the time of its failure. Hencethe number of erased fragments for an object is g v ( t ) N + E where E is is the object’s group-associatednode in not in the node repair queue and is otherwise equal to the number of nodes that were in the node16epair queue at the time of the node’s failure. It follows that as long as g v ( t ) N < δN is maintained thenno data loss does occurs. Thus, controlling the repair rate ensure g v ( t ) < δ provides data integrity.For and M/D/1 queue with arrival rate λ ′ and service time D ′ = γ/λ ′ with γ < , we show inAppendix C that the probability of the queue exceeding m during a busy period is less than e − νm . Excursion probabilities are only smaller in the finite population case, (where the arrival rate decreaseswith queue size). The probability during a busy period that g v ( t ) > δN − is less that e − ν δ N where ν is given by e ν = 1 + νγ where the time for a node repair is γ/ ( λN ) . For example, if γ ≃ . then ν = 2 . Thus the expected number of busy periods until data loss is at least e ν ( δ N − − . The expected length of each busy period until this occurs as at least − µ where µ = 1 / ( δ N λD ) =1 / ( δ γ ) . Thus, MTTDL is greater than ( e ν δ N − δ γ D + λN ) . It follows that we have MTTDL → ∞ for any fixed γ < . Moreover, we can have γ → as N → ∞ . In this asymptote the read repair rate isgiven by ( λT tot ) − (2 β + O ( β )) which is optimal to first order in β. In the case where ancillary repair is given priority over standard repair the repair process can be viewedas a two stage queuing system where the first stage performs the ancillary repair and the second stageperforms the standard repair. We may assume a fixed overall repair rate that operates on one stage at atime, or a variable rate system that accelerates repair as queue size increases.A
PPENDIX AP ROOF OF P ROPOSITION M = δ N nodes at time t = 0 , according to the stated initial condition. At each time k ∆ t ,k = 1 , , ... a node is launched into, i.e. added to, the system. The duration from launch to failure of anode is an independent exponentially distributed random variable with rate λ. We will use the notation n k to indicate the node launched at time k ∆ t . To model the node failure process in an alternate way we adopt a point Poisson process with rate λN.
The arrival times of the Poisson process will be denoted s i , i = 0 , , , ... Here, s = 0 and for i > the differences s i − s i − are i.i.d. exponential random variables with rate λN. For each i > we furtheradopt a uniformly random independent random variable Y i ∈ [0 : N − . We interpret this to mean thatthe node in node position g v ( s i ) N + Y i is affected by the failure event associated to the Poisson process.If Y i + g v ( s i ) N ≥ N then no node failure actually occurs and the system is unaffected, otherwise thenode in node-position g v ( s i ) N + Y i fails at time s i .
1) Bounds on g v ): Let M k denote the number of surviving launched nodes in the system at time k ∆ t + , i.e., immediately after the k th node launch. Let us introduce the notation q = (1 − e − λ ∆ t ) and ¯ q = 1 − q. A node surviving at time k ∆ t will fail by time ( k + 1)∆ t with probability q. We note that q − = (1 − e − δ N ) − = δ N + ǫ N where ǫ N ∈ (0 , for δ N ≥ . Lemma A.1 (Proof of (3) ): For k ≥ we have P ( M k > N − ≤ e − δ N (6) P ( M k < (1 − δ ) N ) ≤ e − δ N (7) Proof:
For k ≥ the number of surviving launched nodes at time k ∆ t can be written as a sum ofindependent Bernoulli random variables. M k = δ N X j =1 ˜ b j + k − X j =0 b j ˜ b j indicates the survival of the j th initial node and b j indicates the survival of the node launchedat time ( k − j )∆ t . It follows that ˜ p j := E (˜ b j ) = ¯ q k and p j := E ( b j ) = ¯ q j and we obtain E ( M k ) = δ N ¯ q k + k − X j =0 ¯ q j = δ N ¯ q k + 1 − ¯ q k q = δ N + ǫ N,k where ǫ N,k = ǫ N (1 − ¯ q k ) ∈ (0 , . Define E = P δ Nj =1 min(˜ p j , − ˜ p j ) + P k − j =0 min( p j , − p j ) and set W = ⌈ − ln 2ln ¯ q ⌉ = ⌈ (ln 2) δ N ⌉ . Weclaim E ≤ W. Assume first that k ≥ W. Then E ≤ ( δ N )¯ q k + W − X j =0 (1 − ¯ q j ) + k − X j = W ¯ q j = ( δ N )¯ q k + W − − ¯ q W q + ¯ q W − ¯ q k q = ( δ N − q )(¯ q k ) + W − − q W q ≤ W where for the last step we use < ¯ q W ≤ ¯ q − ln 2ln ¯ q = . The argument for k ≤ W is similar and for thatcase we actually obtain the slightly stronger bound E ≤ k ≤ W. Applying Lemma (B.2) (Chernoff bound) we now obtain P ( M k > N − ≤ e − η W with η = N − − ( δ N + ǫ N,k ) = δ N − − ǫ N,k . Noting that W ≤ δ N ln(2) + 1 and using the conditionsof Proposition 2.6 we have ( η δ N ) = (1 − ǫ N,k δ N ) ≥ . ≥ δ (ln 2) + 1 N ≥ WN and we see that (6) holds.Similarly, Lemma (B.2) gives P ( M k < (1 − δ ) N ) ≤ e − η W with η = δ N + ǫ N,k so (7) follows from W ≤ ln(2) δ N + 1 ≤ N.
2) Bounds on Settled Nodes:
We now consider the number of nodes lost in a fixed time interval oflength K ∆ t from among the launched nodes present at node launch time ( k − K )∆ t . The expected valuetakes the form M (1 − q ) where M is an initial number of nodes and q is the probability of one such nodefailing in the given time interval. Lemma A.2:
Assume β δ − (1 − q K ) > (with q = e − λ ∆ t ). Let F be the number of settled nodes present18t time ( k − K )∆ t that fail by time k ∆ t . ln P ( F ≥ β δ N ) ≤ − β δ N − N (1 − ¯ q K )) N (1 − ¯ q K ) (8) Proof:
Assume first that k − K ≥ and that there are M launched nodes at time ( k − K )∆ t . Thenumber of these nodes that fail by time k ∆ t is a random variable F given by F = P Mi =1 (1 − b i ) wherethe b i are i.i.d. Bernoulli with E ( b i ) = ¯ q K and so under these conditions E ( F ) = M (1 − ¯ q K ) . ApplyingLemma B.2 we obtain ln P ( F ≥ β δ N ) ≤ − β δ N − M (1 − ¯ q K )) M (1 − ¯ q K ) . (9)Under the stated assumptions the quantity on the right is decreasing in M for M ≤ N, and increasing in ¯ q so the desired result holds for this case.Assume now that k − K < we have F = P Mi =1 (1 − b i ) where b i are i.i.d. Bernoulli with E ( b i ) = q k , and so E ( F ) = M (1 − q k ) and M is the number of initial nodes surviving from time ( k − K )∆ t , so M = ⌈ S K − k ⌉ < N. It follows that the above bound holds in this case as well.
3) Bounds on Transients:
We now consider the node survival process over relatively small launchwindows. In particular we consider how many nodes launched from ( k − K )∆ t until k ∆ t survive at time k ∆ t + . We are generally interested in the case K = K v and in that case the number of survivors is equalto h v ( k ∆ t − ) N + 1 , which in turn gives the number of initial settled fragments placed on the transitionalnode. Lemma A.3:
Assume K satisfies ¯ q K > and Kq < . Then for all k ≥ we have P ( S kk − K ≤ E ( S kk − K ) − η ) ≤ e − η ( K +12 ) q (10) Proof: If k − K ≥ then S kk − K is independent of the initial condition and is given by P Kj =0 b j where b j indicates the survival of the node launched at time ( k − j )∆ t . We have p i := P ( b j = 1) = ¯ q j so E ( P Kj =0 b j ) = P Kj =0 p j = − ¯ q K +1 q . With the assumptions on K we have − (1 − q ) K +1 q = ( K + 1) q − (cid:0) K +12 (cid:1) q + ...q ≥ ( K + 1) − (cid:18) K + 12 (cid:19) q . and it now follows that P Kj =0 (1 − p j ) ≤ (cid:0) K +12 (cid:1) q . In the case k − K < then S kk − K is given by P ⌈ S K − k ⌉ j =1 ˜ b j + P kj =0 b j where ˜ p i = P (˜ b j = 1) = ¯ q k . Now ⌈ S K − k ⌉ X j =1 ˜ p j + k X j =0 p j ≥ S K − k ¯ q k + k X j =1 ¯ q k − j = 1 − ¯ q K +1 q ≥ K + 1 − (cid:18) K + 12 (cid:19) q and, since ⌈ S K − k ⌉ + k < K + 1 we obtain P ⌈ S K − k ⌉ j =1 (1 − ˜ p j ) + P kj =0 (1 − p j ) ≤ (cid:0) K +12 (cid:1) q . Applying the Chernoff bounds, Lemma B.3 we now obtain the desired result.
4) Application of Bounds:
We note that for x ∈ [0 , we have x ≤ − ln(1 − x ) ≤ x (1 + x − x ) . emma A.4: Assume δ β δ N ≥ , that δ βN > , that δδ N ≥ . , that β ≤ , and that γ ( Z ) ≤ δ β δ . Then (cid:18) K ( Z ) + 12 (cid:19) qN ≤ γ ( Z ) Proof:
To simplify notation we will suppress dependence on Z. First, we note the bound (cid:18) K + 12 (cid:19) q ≤ ( K + 1 . δ N Since ξ N γ ≤ ξ N δ β δ ≤ β δ ≤ we have KN = − δ ln(1 − ξ N γ ) ≤ . δ ξ N γ which we can combine with ≤ N δ ξ N γ to obtain (Since γ ≥ β/ and βN ≥ we have N δ ξ N γ ≥ ) K + 1 . ≤ √ δ ξ N γN . Combining the above we now have (cid:18) K + 12 (cid:19) q ≤ δ ( ξ N γ ) N ≤ γ N where we used δ ξ N ≤ − δ + N ≤ . Lemma A.5 (Proof of (5) ): P ( S kk − K m ≥ β δ M ) ≤ e − δ N Proof:
Now we consider S kk − K m . Note that E |S ( n , I L ( n )+ K ∆ t ) | = γ m N. Hence β δ N − E |S ( n , ( I L ( n )+ K m )∆ t ) | = ( β δ − γ m ) N = δ β δ N. Thus, applying Lemma A.3 we see that the lemma will follows from ( β δ − γ m ) N (cid:0) K m +12 (cid:1) q = δ β δ N (cid:0) K m +12 (cid:1) q ≥ δ β δ N γ m ≥ δ N which follows from Lemma A.4 since γ m ≤ β δ . Lemma A.6 (Proof of (4) ): P ( κ ( |S ( n , I L ( n ) + K v ) | + 1) < Z m N ) ≤ e − δ N Proof:
Now, by definition of γ v we have E |S ( n , ( I L ( n ) + K v ∆ t ) | = γ v δ β δ and E ( κ ( |S ( n , ( I L ( n ) + K v )∆ t | + 1)) = Z a N. Lemma B.3 yields P (cid:0) |S ( n , ( I L ( n ) + K v )∆ t ) | − E |S ( n , ( I L ( n ) + K v )∆ t ) | ≤ − κ − ( Z a − Z m ) (cid:1) ≤ e −
38 ( Za − Zm )2 N κ ( Kv +12 ) q Thus, the desired result will follow from ( Z a − Z m ) N κ (cid:0) K v +12 (cid:1) q ≥ δ N
20e have κ ( Z a − Z m ) = − δ ln (cid:16) − ( γ m − γ a ) ξ N − ξ N γ a (cid:17) ≥ δ ( γ m − γ a ) ξ N − γ a ξ N = δδ β δ − ξ N δ β δ ≥ δδ β δ > δ γ v and Lemma A.4 now gives the desired result.Although not needed in the proof, a similar argument shows that the probability of z s reaching z v isexponentially small in N . A PPENDIX BC HERNOFF B OUNDS
In this section we prove some standard inequalities in a form convenient for the proofs.
Lemma B.1:
Let b be a Bernoulli random variable with P ( b = 1) = p. Assuming s ∈ (0 , / we have E e s ( b − p ) ≤ e (1 − p ) s E e − s ( b − p ) ≤ e (1 − p ) s Proof:
For any real s we have E e s (1 − b ) = (1 + (1 − p )( e s − . For s ∈ ( − / , / we have (1 + (1 − p )( e s − ≤ e (1 − p )( e s − which then yields E e − s ( b − p ) ≤ e (1 − p )( e s − − s ) . The two inequalities nowfollow by bounding e s − − s from below.Combining the two inequalities and applying them to − b we have the following corollary. Corollary B.2: If s ∈ [ − / , / then E e s ( b − p ) ≤ e min { p, − p } s Lemma B.3:
Let b k , k = 0 , , ..., K be independent Bernoulli random variables with E ( b k ) = p k . Then P ( B − E B ≥ η ) ≤ e − η M (11) P ( B − E B ≤ η ) ≤ e − η M (12)for any M ≥ P Kk =1 min { p k , − p k } and η ≤ M. Proof:
Using Corollary B.2 we have for any s ∈ [ − / , / , E e s ( B − E B ) ≤ e M s . Since η ≤ M we now have from the Markov inequality P ( B − E B ≥ η ) ≤ e − sη e M s ≤ e − η M (13)where the last step follows by choosing s = η M . Similarly, we obtain P ( B − E B ≤ − η ) ≤ e sη e M s ≤ e − η M (14)where the last step follows by choosing s = − η M . PPENDIX
CLet Q , Q , .. be i.i.d. exponential random variables with rate γ < . Let f ( x ) denote the probabilitythat x + P mi =1 ( Q i − > for all m = 0 , , , ... Let us extend the definition of f ( x ) by setting f ( x ) = 1 for x ∈ [ − , . It follows that for x > the function f ( x ) is the unique fixed point of the map g → I ( g ) defined for bounded non=negative non-increasing functions on [ − , ∞ ) by I ( g )( x ) = ( g ( x ) x ∈ [ − , R ∞ x − γe − γ ( u − ( x − g ( u ) du x ≥ It is an easy exercise to show that iterating I on g converges to a unique solution δ that depends onlyon g ( x ) , x ∈ [ − , . Furthermore, the solution is monotonic in g ( x ) , x ∈ [ − , , i.e., given g ( x ) ≤ g ( x ) , x ∈ [ − , , it follows that δ ( x ) ≤ δ ( x ) . We have f ( x ) = δ ( x ) for g ( x ) = 1 , x ∈ [ − , . Let ν be the unique solution to e ν = 1 + νγ . We claim that if g ( x ) = e − νx , x ∈ [ − , ∞ ) , then δ = g. In other words, e − νx is a fixed point of I . Indeed, for x ≥ , Z ∞ x − γe − γ ( u − ( x − e − nuu du = Z ∞ γe − γu e − ν ( u +( x − du = γγ + ν e ν e − νx = e − νx By the monotonicity of δ as a function of g ( x ) , x ∈ [ − , , we now have the following result e − ν e − νx < f ( x ) < e − νx . R EFERENCES [1] L
UBY , M. Repair rate lower bounds for distributed storage.
Accepted to IEEE Transactions on Information Theory (Jan. 2021).[2] L
UBY , M., P
ADOVANI , R., R
ICHARDSON , T. J., M
INDER , L.,
AND A GGARWAL , P. Liquid cloud storage.
ACM Trans. Storage 15 , 1(Feb. 2019).[3] S
AATY , T.
Elements of Queueing Theory: With Applications . McGraw-Hill, 1961.[4] S
TECKE , K. E.,
AND A RONSON , J. E. Review of operator/machine interference models.
International Journal of Production Research23 , 1 (1985), 129–151.22