On Scheduling and Redundancy for P2P Backup
aa r X i v : . [ c s . N I] S e p On Scheduling and Redundancy for P2P Backup
Laszlo Toka ∗† , Matteo Dell’Amico ∗ , Pietro Michiardi ∗ { laszlo.toka, matteo.dell-amico, pietro.michiardi } @eurecom.fr ∗ Eurecom, Sophia-Antipolis, France † Budapest University of Technology and Economics, Hungary
Abstract —An online backup system should be quick andreliable in both saving and restoring users’ data. To do so ina peer-to-peer implementation, data transfer scheduling and theamount of redundancy must be chosen wisely. We formalize theproblem of exchanging multiple pieces of data with intermittentlyavailable peers, and we show that random scheduling completestransfers nearly optimally in terms of duration as long as thesystem is sufficiently large. Moreover, we propose an adaptiveredundancy scheme that improves performance and decreasesresource usage while keeping the risks of data loss low. Extensivesimulations show that our techniques are effective in a realistictrace-driven scenario with heterogeneous bandwidth.
I. I
NTRODUCTION
The advent of cloud computing as a new paradigm to enableservice providers with the ability to deploy cost-effectivesolutions has favored the development of a range of newservices, including online storage applications. Due to theeconomy of scale of cloud-based storage services, the costsincurred by end-users to hand over their data to a remotestorage location in the Internet have approached the cost ofownership of commodity storage devices.As such, online storage applications spare users most of thetime-consuming nuisance of data backup: user interaction isminimal, and in case of data loss due to an accident, restoringthe original data is a seamless operation. However, the long-term storage costs that are typical of a backup application mayeasily go past that of traditional approaches to data backup.Additionally, while data availability is a key feature thatlarge-scale data-centers deployments guarantee, its durabilityis questionable, as reported recently [1].For these reasons, peer-to-peer (P2P) storage systems are analternative to cloud-based solutions. Storage costs are merelythose of a commodity storage device, which is shared (togetherwith some bandwidth resources) with a number of remoteInternet users to form a distributed storage system. Suchapplications optimize latency to individual file access: indeed,users hand over their data to the P2P system, which is usedas a replacement of a local hard drive. In such a scenario, lowaccess latency is difficult to achieve: the online behavior ofusers is unpredictable and, at large scale, crashes and failuresare the norm rather than the exception. As a consequence,storage space is sacrificed for low access latency: a P2Papplication stores large amounts of redundant data to copewith such unfavorable events.In this work we study a particular case of online storage:P2P backup applications. Data backup involves the bulk trans-fer of potentially large quantities of data, both during regular data backups and in case of data loss. As a consequence, lowaccess latency is not an issue, while short backup and restoretimes seem a more reasonable goal.Given these considerations, here we seek to optimize backupand restore times, while guaranteeing that data loss remains anunlikely event. There are two main design choices that affectthese metrics: scheduling , i.e. deciding how to allocate datatransfers between peers, and redundancy , i.e. the amount ofdata in the P2P system that guarantees a backup operation tobe considered complete and safe. The endeavor of this workis to study and evaluate these two intertwined aspects.First, we describe in detail our application scenario (Sec.II), and show why the assumptions underlying a backupapplication can simplify many problems addressed in theliterature. We then set off to define the problem of schedulingin a full knowledge setting, and we show that it can besolved in polynomial time by reducing it to a maximal flowproblem. Full knowledge of future peer uptime is obviouslyan unrealistic assumption: thus, we show that a randomizedapproach to scheduling yields near optimal results when thesystem scale is large and we corroborate our findings usingreal availability traces from an instant messaging application(Sec. III).We then move to study a novel redundancy policy that,rather than focusing on short-term data availability, targetsshort data restore times. As such, our method alleviates thestorage burden of large amounts of redundant data on clientmachines (Sec. IV). With a trace-driven simulation of acomplete P2P backup system, we show that our technique isviable in practical scenarios and illustrate its benefits in termsof increased performance (Sec. V).We conclude by studying a range of data maintenancepolicies when restore operations may undergo some naturaldelays. For example, detecting a faulty external hard-drive maynot be immediate, or obtaining a new equipment upon a crashmay require some time. We show that an “assisted” approachto data repair techniques (which involves a cloud-based storageservice) can significantly reduce the probability of data loss,at an affordable cost (Sec. VI).II. A
PPLICATION S CENARIO
In this work, similarly to many online backup applications(e.g., Dropbox ), we assume users to specify one local foldercontaining important data to backup. We also assume that backup data remains available locally to peers. This is animportant trait that distinguishes backup from storage appli-cations, in which data is only stored remotely.Backup data consists of an opaque object, possibly repre-senting an encrypted archive of changes to a set of files, thatwe term backup object . In the spirit of incremental backups,we consider that each backup object should be kept on thesystem indefinitely. Consolidation and deletion of obsoletebackups are not taken into account in this work.A backup object of size o is split into k original fragmentsof a fixed size f , with k = o/f . Since backup data is stored onunreliable machines characterized by an unpredictable onlinebehavior, the original k blocks are encoded using erasure cod-ing (e.g., Reed-Solomon). This creates n encoded fragments having size f , of which any k are sufficient to recover theoriginal data. The redundancy rate is defined as r = n/k . Herewe assume that encoded fragments reside on distinct remotepeers, which avoids that a single disk failure causes the lossof multiple fragments. Backup Phase:
The backup phase involves a data owner anda set of remote peers that eventually store encoded fragmentsfor the data owner. We assume that any peer in the system cancollect a list of remote peers with available storage space: thiscan be achieved by using known techniques, e.g. a centralized“tracker” or a decentralized data structure such as a distributedhash table.Data backup requires a scheduling policy that drives thechoice of where and when to upload encoded fragments toremote peers. Moreover, a redundancy policy determines whenthe data is safe, which completes the backup operation.
Maintenance Phase:
Once the backup phase is completed andencoded fragments reside on remote peers, the maintenancephase begins. Peer crashes and departures can cause the lossof some encoded fragments; during the maintenance phase,peers detects such losses and generate new encoded fragmentsto restore a redundancy level at which the backup is safe again.For a generic P2P storage system, in which encoded frag-ments only reside in the network and peers do not keep a localcopy of their data, the maintenance phase is critical. Indeed,peers need to first download the whole backup object fromremote machines, then to generate new encoded fragments andupload them to available peers. This problem has fostered thedesign of efficient coding schemes to mitigate the excessivenetwork traffic caused by the maintenance operation (seee.g. [2], [3]).In a backup application, the maintenance phase is lesscritical: the data owner can generate new encoded fragmentsusing the local copy of the data with no download required.
Restore Phase:
In the unfortunate case of a crash, the dataowner initiates the restore phase. A peer contacts the remotemachines holding encoded fragments, downloads at least k of them, and reconstructs the original backup data. Again, ascheduling policy drives the process.Since the ability to successfully restore data upon a crashis the ultimate goal of any backup system, in our applicationthe restore traffic receives higher priority than the backup and maintenance traffic. A. Performance Metrics
We characterize the system performance in terms of theamount of time required to complete the backup and the restorephases, labelled time to backup (TTB) and time to restore (TTR).In the following Sections, we use baselines for backup andrestore operations which bound both TTB and TTR. Let usassume an ideal storage system with unlimited capacity anduninterrupted online time that backs up user data. In this case,TTB and TTR only depend on backup object size and onbandwidth and availability of the data owner. We label theseideal values minTTB and minTTR , and we define them formallyin Sec. III.Additionally, we consider the data loss probability , whichaccounts for the probability of a data owner to be unable torestore backup data.A P2P backup application may exact a high toll in terms ofpeer resources, including storage and bandwidth. In this workwe gloss over metrics of the burden on individual peers andthe network, considering a scenario in which the resources ofpeers are lost if left unused.
B. Availability Traces
The online behavior of users, i.e., their patterns of con-nection and disconnection over time, is difficult to captureanalytically. In this work we will perform our evaluations ona real application trace that exhibits both heterogeneity andcorrelated user behavior. Our traces capture user availability,in terms of login/logoff events, from an instant messaging(IM) server in Italy for a duration of 3 months. We argue thatthe behavior of regular IM users constitutes a representativecase study. Indeed, for both an IM and an online backupapplication, users are generally signed in for as long as theirmachine is connected to the Internet.In this work we only consider users that are online for anaverage of at least four hours per day, as done in the Wualaonline storage application . Once this filter is applied, weobtain the trace of 376 users. User availabilities are stronglycorrelated, in the sense that many users connect or disconnectaround the same time. As shown in Fig. 1(a), there are strongdifferences between the number of users connected during dayand night and between workdays and weekends. Most usersare online for less than 40% of the trace, while some of themare almost always connected (Fig. 1(b)).III. T HE S CHEDULING P ROBLEM
Scheduling data transfers between peers is an importantoperation that affects the time required to complete a backup ora restore task, especially in a system involving unreliable ma-chines with unpredictable online patterns. Because of churn,a node might not be able to find online nodes to exchangedata with: hence, TTB and TTR can grow due to idle periodsof time. Unexpected node disconnections require a method to Mon Tue Wed Thu Fri Sat Sun Mon50100150200250 Time [day] O n li n e p ee r s (a) Online peers during a week. E m p i r i ca l C D F (b) Time spent online.Fig. 1. Availability trace. Uptime is heterogeneous and strongly correlated. handle partial fragments, which can be discarded or resumed.Moreover, the redundancy rate used to cope with failures andunavailability may decrease system performance. Finally, theavailable bandwidth between peers involved in a data transfer,which may be shared due to parallel transmissions, is anothercause for slow backup and restore operations.In this Section, we focus on the implications of churn alone.We simplify the scheduling problem by assuming the redun-dancy factor to be a given input parameter, and neglectingthe possibility of congestion due to several different backup,restore or maintenance processes interfering. Furthermore, wedo not consider interrupted fragment transfers. In Section IV,we define an adaptive scheme to compute the redundancy rateapplied to a backup operation and in Section V we relax allother assumptions.We now define a reference scenario to bound TTB andTTR. Consider an ideal storage system (e.g. a cloud service)with unbounded bandwidth and 100% availability. A peer i with upload and download bandwidth u i and d i starting thebackup of an object of size o at time t completes its backupat time t ′ , after having spent ou i time online. Analogously, i restores a backup object with the same size at t ′′ afterhaving spent od i time online. We define minT T B ( i, t ) = t ′ − t and minT T R ( i, t ) = t ′′ − t . We use these reference valuesthroughout the paper to compare the relative performance ofour P2P application versus that of such an ideal system.Because we neglect congestion issues, we can focus on abackup/restore operation as seen from a single peer in thesystem. Let us consider a generic peer p and I remote peers p , . . . , p I used to store p ’s data. We assume time to befractioned in time-slots of fixed length. Let a i,t be an indicatorvariable so that a i,t = 1 if and only if p i is online at time t . Each peer i has integer upload and download capacity ofrespectively u i and d i fragments per time-slot.We now proceed with a series of definitions used to formal-ize the scheduling problem. Definition 1: A backup schedule is a set of ( i, t ) tuplesrepresenting the decision of uploading a fragment from p to peer p i , where i ∈ { . . . I } at time-slot t . A valid backupschedule S satisfies the following properties:1) ∀ t : |{ i : ( i, t ) ∈ S }| ≤ u : no more than u fragmentsper time-slot can be uploaded.2) ∀ ( i, t ) ∈ S : a i,t = a ,t = 1 : fragments are transferred only between online peers.3) ∀ ( i, t ) , ( j, u ) ∈ S : i = j : no two fragments are storedon the same peer. Definition 2: A restore schedule is a set of ( i, t ) tuplesrepresenting the decision of downloading a fragment from a setof remote peers p i ∈ P at time t , where P is set of storagepeers that received a fragment during the backup phase. Avalid restore schedule S satisfies the following properties:1) ∀ t : |{ i : ( i, t ) ∈ S }| ≤ d : no more than d fragmentsper time-slot can be recovered.2) ∀ ( i, t ) ∈ S : a i,t = a ,t = 1 .3) ∀ ( i, t ) : ( j, u ) ∈ S, i = j .4) ∀ ( i, t ) ∈ S : p i ∈ P : fragments can only be retrievedfrom storage peers. Definition 3:
The completion time C of a schedule S is thelast time-slot in which a transfer is performed, that is: C ( S ) = max { t : ( i, t ) ∈ S } . In the following, we first consider a full information setting,and show how to compute an optimal schedule which min-imizes completion time provided that the online behavior ofpeers is known a priori . Then, we compare optimal schedulingto a randomized policy that needs no knowledge of futurepeer uptime; via a numeric analysis, we show the conditionsunder which a randomized, uninformed approach achievesperformance comparable to that of an optimal schedule.
A. Full Information Setting
We cast the problem of finding the optimal schedule forboth backup and restore operations as finding the minimumcompletion time to transfer a given number x of fragments.For backup, x will correspond to the number n of redundantencoded fragments; for restores, x will be equal to the number k of original fragments. We show that this problem can bereduced to finding the maximum number of fragments thatcan be transferred within a given time T . We then use a max-flow formulation and show that existing algorithms can solvethe original problem in polynomial time. Definition 4: An optimal schedule to backup/restore x frag-ments is one that achieves the minimum completion time totransfer at least x fragments. Let S be the set of all validschedules; the minimum completion time is: O ( x ) = min { C ( S ) : S ∈ S ∧ | S | ≥ x } . (1)The following proposition shows that the optimal comple-tion time can be obtained by computing the maximum numberof fragments that can be transferred in T time-slots. Proposition 1:
Let S be the set of all valid schedulesand F ( t ) be the function denoting the maximum number offragments that can be transferred within time-slot t , that is: F ( t ) = max {| S | : S ∈ S ∧ C ( S ) ≤ t } . (2)The optimal completion time is: O ( x ) = min { t : F ( t ) ≥ x } . Proof:
Let t = O ( x ) and t = min { t : F ( t ) ≥ x } . • t ≥ t . By Eq. 1, an S ∈ S exists such that C ( S ) = t and | S | ≥ x , implying that F ( t ) ≥ x . Therefore, t ≥ min { t : F ( t ) ≥ x } = t . • t ≤ t . By Eq. 2, an S exists such that C ( S ) = t and | S | ≥ x . This directly implies that t = O ( x ) ≤ t .We can now iteratively compute F ( t ) with growing valuesof t ; the above Proposition guarantees that the first value T that satisfies F ( T ) ≥ x will be the desired result.We now focus on a single instance of the problem offinding the maximum number of fragments F ( T ) that can betransferred within time-slot T , and show that it can be encodedas a max-flow problem on a flow network built as follows.First, we create a bipartite directed graph G ′ = ( V ′ , E ′ ) where V ′ = T ∪ P ; the elements of T = { t i : i ∈ . . . T } representtime-slots, the elements of P = { p i : i ∈ . . . I } represent re-mote peers (only storage nodes for restores). An edge connectsa time-slot to a peer if that peer is online during that partic-ular time-slot: E ′ = { ( t i , p j ) : t i ∈ T ∧ p j ∈ P ∧ a i,j = 1 } .Source s and sink t nodes complete the bipartite graph G ′ andcreate a flow network G = ( V, E ) . The source is connectedto all the time-slots during which the data owner p is online;all peers are connected to the sink.The capacities on the edges are defined as follows: edgesfrom the source to time-slots have capacity u or d (respec-tively, for backup and restore operations); edges between time-slots and peers have capacity d i or u i (respectively, for backupand restore operations); finally, edges between peers and thesink have capacity m . Note that in this work we assumeindividual fragments to be uploaded to distinct peers, hence m = 1 . To simplify presentation, we assume integer capacities u k = d k = 1 ∀ k ∈ [0 , I ] .Fig. 2 illustrates an example of the whole procedure de-scribed above, for the case of a backup operation. Fig. 2(a)shows the online behavior for time-slots t , . . . , t of the dataowner and the remote peers ( p , p , p ) that can be selectedas remote locations to backup data. The optimal scheduleproblem amounts to deciding which remote peer should beawarded a time-slot to transfer backup fragments, so that theoperation can be completed within the shortest time. Thisproblem is encoded in the graph of Fig. 2(b). Time-slots andremote peers are represented by the nodes of the inner bipartitegraph. An edge of capacity 1 connects a time-slot to the set ofonline peers in that time-slot, as derived from Fig. 2(a). Thesource node has an edge of capacity u = 1 to every time-slot in which the data owner is online (in the figure, t , t areshaded to remind p is offline): this guarantees that only 1fragment per time-slot can be transferred. The sink node hasan incident edge with capacity m = 1 from every remote peer.Each s − t flow represents a valid schedule. For backupoperations, the schedule is valid because the three equationsof Def. 1 are verified by construction of the flow network.Similarly, for restore operations the equations of Def. 2 aresatisfied by construction, since only remote peers in the set P are part of the flow network. Data ownerp1p2p3 t1 t2 t3 t4 t5 t6 t7 t8 (a) Online behavior t1t2t3t4t5t6t7t8 p1p2p3s tm=1............u0=1............ di=1 (b) Equivalent flow networkFig. 2. An example of a backup operation. The original problem of findingan optimal schedule, given the online behavior of peers, is transformed in amax-flow problem on an equivalent graph.
In the particular case of the example the smallest valueof t ensuring F ( t ) ≥ is 3, corresponding to a flow graphthat contains only the t , t , t time-slot nodes. The resultingoptimal scheduling corresponds to the thick edges in Fig. 2(b).For a flow network with V nodes and E edges, the max-flowcan be computed with time complexity O (cid:16) V E log (cid:16) V E (cid:17)(cid:17) [4].In our case, when we have p nodes and an optimal solution of t time-slots, V is O ( p + t ) and E is O ( pt ) . The complexity of aninstance of the algorithm is thus O (cid:16) pt (cid:16) p log pt + t log tp (cid:17)(cid:17) .The original problem, i.e., finding an optimal schedulethat minimizes the time to transfer x fragments, can besolved by performing O (log t ) max-flow computations. Infact, an upper bound for the optimal completion time canbe found in O (log t ) instances of the max-flow algorithmby doubling at each time the value of T , then the optimalvalue can be obtained, again in O (log t ) time, by usingbinary search. The computational complexity of determiningan optimal schedule in a full information framework is thus O (cid:16) pt log t (cid:16) p log pt + t log tp (cid:17)(cid:17) . B. Random Scheduling
In practice, assuming complete knowledge of peers’ onlinebehavior is not realistic. We introduce a randomized schedul-ing policy which only requires knowing which peers are onlineat the time of the scheduling decision. In Sec. III-C, wecompare optimal and randomized scheduling using real traces.
For backup operations, in each time-slot, fragments areuploaded from the data owner to no more than u remotepeers chosen at random among those that are currently onlineand that did not receive a fragment in previous time-slots. Thissatisfies Def. 1. For restore operations, in each time-slot, d remote peers in the set P are randomly chosen among thosethat are currently online and data is transferred back to thedata owner. This satisfies Def. 2.We now use Fig. 2 to illustrate a possible outcome ofthe randomized schedule defined here and compare it to theoptimal schedule computed using the max-flow formalization.We focus on the backup operation of x = 3 fragments carriedout by the data owner p . In Fig. 2(a), the data owner mayrandomly select p to be the recipient of the first fragmentin time-slot t . Since we assume m = 1 fragment canbe stored on a distinct peer, this choice implies that time-slot t is “wasted”. In time-slot t the data owner has nochoice but to store data on peer p . Only in time-slot t the backup process is complete, when the last fragment isuploaded to peer p . Hence, this randomized schedule writesas ( p , t ); ( p , t ); ( p , t ) .The optimal schedule is obtained by computing the max-flow on the flow network in Fig. 2(b) (thick edges in thefigure), and writes as ( p , t ); ( p , t ); ( p , t ) . The backupoperation only requires 3 time-slots to complete. C. Numerical Analysis
Here, we take a numerical perspective and compare optimaland randomized scheduling in terms of TTB and TTR. Wefocus on a single data owner p involved in a backup operation.The input to the scheduling problem is the availability tracedescribed in Sec. II-B, starting the backup at a randommoment; we set the duration of a time-slot to one hour. Let u = 1 fragment per time-slot be the upload rate of p . Wereport results for x ∈ { , , } backup fragments, andvary the number of randomly chosen remote peers so that I ∈ { . x, . x, . . . , x } . We obtained each data point byaveraging 1,000 runs of the experiment; furthermore, for eachof those runs, we averaged the completion times of 1,000random schedules in the same settings.Fig. 3 illustrates the ratio between the TTB achieved re-spectively by optimal and randomized scheduling, normal-ized to the ideal backup time minTTB. We observe thatboth optimal and randomized scheduling approach minTTBwhen the number of remote peers available to store backupfragments increases: a large system improves transmissionopportunities, and TTB approaches the ideal lower bound.However, when the number of backup fragments grows, whichis a consequence of higher redundancy rates, randomizedscheduling requires a larger pool of remote machines toapproach the performance of the optimal scheduling. We alsonote that heterogeneous and correlated behavior of users in theavailability trace results in “idle” time-slots in which neitheroptimal nor randomized scheduling can transfer data.This very same evaluation can be used to evaluate a restoreoperation, even if the parameters acquire a different meaning.
40 60 80 100 120 140 16011.11.21.31.41.51.61.71.8 Number of remote peers TT B / m i n TT B random x=40optimal x=40random x=60optimal x=60random x=80optimal x=80 Fig. 3. Numerical analysis: a comparison between optimal and randomizedscheduling, using real availability traces.
In this case, the number x of fragments that need to betransferred is the number of original fragments k , and thenumber of remote peers I will correspond to the number ofencoded fragments n . For restores, as the redundancy rate nk = Ix grows, backups will be more efficient.We conclude that randomized scheduling is a good choicefor a P2P backup application, provided that: • to have efficient backups, the ratio between number ofnodes in the system and number of fragments to store isnot very close to one; • to have efficient restores, the redundancy rate is not veryclose to one.As a heuristic threshold, in our analysis we obtain that avalue of Ix = 1 . is sufficient to complete backup and restorewithin a tolerable (around 10%) deviation from minTTB orminTTR, respectively. In the following, we will therefore userandomized scheduling and make sure that such a ratio isreached in order to ensure that scheduling does not imposea too harsh penalty on TTB and TTR.Birk and Kol [5] analyzed random backup scheduling bymodeling peer uptime as a Markovian process. Albeit quan-titatively different due to the absence of diurnal and weeklypatterns in their model, their study reached a conclusion that isanalogous to ours: in backups, the completion time of randomscheduling converges to to the optimal value as the systemsize grows. IV. R EDUNDANCY P OLICY
In the literature, the redundancy rate is generally chosen apriori to ensure what we term prompt data availability . Givena system with average availability a , a target data availability t , and assuming the availability of each individual peer as anindependent random variable with probability a , a system-wide redundancy rate is computed as follows. The total number n of redundant fragments required to meet the target t , when k original fragments constitute the data to backup is computed as [6]: min ( n ∈ N : n X i = k (cid:18) ni (cid:19) a i (1 − a ) n − i ≥ t ) . (3)We label this method fixed-redundancy , and use it in thefollowing as a baseline approach.Ensuring prompt data availability is not our goal, since peersonly retrieve their data upon (hopefully rare) crash events. Datadownloads correspond to restore operations, which require along time to complete because of the sheer size of backupdata. Hence, we approach the design of our redundancy policyby taking into account the tradeoffs that a backup applicationhas to face. On the one hand, low redundancy improves theaggregate storage capacity of the system, TTB decreases,and maintenance costs drop. On the other hand, two factorsdiscourage from selecting excessively low redundancy rates.First, TTR increases, as less peers will be online to servefragments during data restores; second, there is a higher riskof data loss.Our redundancy policy operates as follows. During thebackup phase, peers constantly estimate their TTR and theprobability of losing data and adjust the redundancy rateaccording to the characteristics of the remote peers that holdtheir data. In practice, data owners upload encoded fragmentsuntil the estimates of TTR and data loss probability are belowan arbitrary threshold. When the threshold is crossed, thebackup phase terminates.Note that TTB is generally several times longer than TTR.First, in the restore phase, peers are not likely to disconnectfrom the Internet. Second, most peers have asymmetric lineswith fast downlink and slow uplink; third, backups requireuploading redundant data while restores involve downloadingan amount of data equivalent to the original backup object.Because of this unbalance, we argue that it is reasonable to usea redundancy scheme that trades longer TTR (which affectsonly users that suffer a crash) for shorter TTB (which affectsall users).We now delve into the details of how to approximate TTRand data loss probability. A. Approximating TTR
Similarly to the optimal scheduling problem, predictingaccurately the TTR requires full knowledge of disk failureevents and peer availability patterns. We obtain an estimate ofthe TTR with a heuristic approach; in Sec. V we show thatour approximation is reasonable.We assume that a data owner p remains online duringthe whole restore process. The TTR can be bounded for tworeasons: i) the download bandwidth d of the data owner isa bottleneck; ii) the upload rate of remote peers holding p ’sdata is a bottleneck. Let us focus on the second case: we definethe expected upload rate of a generic remote peer p i holdinga backup fragment of p as the product a i u i of the averageavailability and the upload bandwidth of p i . The data ownerneeds k fragments to recover the backup object: suppose thesefragments are served by the k “fastest” remote peers. In this R e d un d a n c y r a t e % % % . % . % Fig. 4. Data loss probability. case, the “bottleneck” upload rate is that of the k -th peer p j with the smaller expected upload rate. If we consider l paralleldownloads and a backup object of size o , a peer computes anestimate of the TTR as eT T R = max (cid:18) od , ola j u j (cid:19) . (4) B. Approximating the Data Loss Probability
Upon a crash, a peer with n fragments placed on remotepeers can lose its data if more than n − k of them crash aswell before data is completely restored. Considering a delay w that can pass between the crash event and the beginning ofthe restore phase, we compute the data loss probability withina total delay of t = w + eT T R .We consider disk crashes to be memoryless events, withconstant probability for any peer and at any time. Disklifetimes are thus exponentially distributed stochastic variableswith a parametric average t : a peer crashes by time t withprobability − e − t/t . The probability of data loss is then n X i = n − k +1 (cid:18) ni (cid:19) (cid:16) − e − t/t (cid:17) i (cid:16) e − t/t (cid:17) n − i . (5)Data loss probability needs to be monitored with great care.In Fig. 4, we plot the probability of losing data as a function ofthe redundancy rate and the delay t . Here we set t = 90 days and k = 64 ; when the time without maintenance is in the orderof magnitude of weeks, even a small decrease in redundancycan increase the probability of data loss by several orders ofmagnitude.In summary, our redundancy policy triggers the end of thebackup phase, and determines the redundancy rate applied to abackup object. Since we trade longer TTR for shorter TTB, ourscheme ensures that data redundancy is enough to make dataloss probability small, and keeps TTR under a certain value.Finally, we remark that our approximation techniques requireknowing the uplink capacity and the average availability ofremote peers. While a decentralized approach to resourcemonitoring is an appealing research subject, it is commonpractice (e.g. Wuala) to rely on a centralized infrastructureto monitor peer resources. V. S
YSTEM S IMULATION
We proceed with a trace-driven system simulation, consid-ering all the factors identified in Sec. III: churn, correlateduptime, peer bandwidth, congestion, and fragment granularity.
A. Simulation Settings
Our simulation covers three months, using the availabilitytraces described in Sec. II-B, with the exception that peersremain online during restores. Uplink capacities of peers areobtained by sampling a real bandwidth distribution measuredat more than 300,000 unique Internet hosts for a 48 hourperiod from roughly 3,500 distinct ASes across 160 countries[7]. These values have a highly skewed distribution, with amedian of 77 kBps and a mean of 428kBps. To representtypical asymmetric residential Internet lines, we assign to eachpeer a downlink speed equal to four times its uplink.Our adaptive redundancy policy uses the following pa-rameters: we set the threshold for the estimated TTR tosatisfy eT T R ≤ max (1 day , · minT T R ) and we keep theprobability of data loss smaller than − , when w = 2 weeks is the maximum delay between crash and restore events (seeSec. IV-B).Each node has 10 GB of data to backup, and dedicates50 GB of storage space to the application. The high ratiobetween these two values lets us disregard issues due toinsufficient storage capacity (which is considered to be cheap)and focus on the subjects of our investigation, i.e., schedulingand redundancy. The fragment size f is set to 160 MB,resulting in k = 64 original fragments per backup object.We define peers’ lifetimes to be exponentially distributedrandom variables with an expected value of 90 days. After theycrash, peers return online immediately and start their restoreprocess; in Sec. VI, we also consider a delay between crashevents and restore operations.As discussed in Sec. IV-B, we compare against a baselineredundancy policy that assigns a fixed redundancy rate. Herewe set a target data availability of t = 0 . , and use thesystem-wide average availability a = 0 . as computed fromour availability traces. Hence, we obtain a value n = 228 anda redundancy rate n/k = 3 . .Our simulations involve 376 peers. This is sufficient toensure that the performance of a randomized scheduling isclose to optimality (see Sec. III).For each set of parameters, the simulation results are ob-tained by averaging ten simulation runs. B. Results
Fig. 5 shows the cumulative distribution function ofminTTB and minTTR: these baseline values are deeply in-fluenced by the bandwidth distribution we used, and their gapis justified by the asymmetry of the access bandwidth and theassumption that peers stay online during the restore process.We now verify the accuracy of our approximation of TTR,expressed as the ratio of estimated versus measured TTR. This Here we neglect the economics of the application, e.g. promoting userloyalty to the system. Hence, we do not consider unanticipated user departures. −1 E m p i r i ca l C D F minTTBminTTR Fig. 5. MinTTB and minTTR. ratio has a median of . , with 10th and 90th percentiles ofrespectively . and . . The values of TTRs vary mostlydue to the diurnal and weekly connectivity patterns of usersin our traces, but for most cases the eTTR is a sensible roughestimation of TTR.The adaptive policy pays off, with an average redundancyrate of . against a flat value of . for the baselineapproach (Fig. 6(a)); the maintenance traffic decreases accord-ingly, and the system almost doubles its storage capacity. Inaddition, TTB is roughly halved with the adaptive scheme (Fig.6(b)); a price for this is paid by crashed peers, which will havelonger TTR (Fig. 6(c)). As we argued in Sec. IV, we thinkthis loss is tolerable and well offset by the benefits of reducedredundancy. We observe tails where a minority of the nodeshave very high T T B/minT T B and
T T R/minT T R ratios.They are nodes with very high bandwidths and therefore lowvalues of minTTB and minTTR (see Fig. 5); their backup andrestore speeds will be limited by the bandwidth of remotenodes which are orders of magnitude smaller.These results certify that our adaptive scheme beneficiallyaffects performance. However, lower redundancy might resultin higher risks of losing data: in the following Section, weanalyze this.VI. D
ATA L OSS AND D ELAYED R ESTORE
Our simulation settings put the system under exceptionalstress: the peer crash rate is two orders of magnitude higherthan what is reported for commodity hardware [8]. In such anadverse scenario, we study the likelihood and the causes ofdata loss, and their relation to the redundancy scheme.In addition, we discuss the implications of delayed responseto crashes, affecting both restore and maintenance operations.We consider the following scenarios: • Immediate response:
Peers start restores as soon as theycrash. Moreover, they immediately alert relevant peers tostart their maintenance. • Delayed response:
Crashed peers return online after arandom delay. If this delay exceeds a timeout, peerssuffering from fragment loss start their maintenance. • Delayed assisted response:
After the above timeout, athird party intervenes to rescue crashed peers whose datais at risk, by maintaining it.In our simulations, delays are exponentially distributedrandom variables with an average of one week; the timeout E m p i r i ca l C D F adaptivefixed (a) E m p i r i ca l C D F adaptivefixed (b) E m p i r i ca l C D F adaptivefixed (c)Fig. 6. System performance.TABLE IC ATEGORIZATION OF DATA LOSS
Red. policy Restores Unfinished Backups UnavoidableImmediate 96% 76%Adaptive Delayed 79% 65%Delayed assisted 94% 78%Immediate 99% 78%Fixed Delayed 92% 75%Delayed assisted 94% 76% value is one week as well.For performance reasons, assisted maintenance can be sup-ported by an online storage provider, which is used as a temporary buffer. Here we assume a provider with 100%uptime, unlimited bandwidth and storage space: maintenanceis triggered upon expiration of the timeout, conditioned to adata loss probability greater than − .In our experiments, due to the inflated peer crash rates,between 11.4% and 14.6% of crashed peers could not recovertheir data. In Table I, we focus on those peers. The majorityof data loss events affected peers that crashed before theycompleted their backups, according to the redundancy policy(unfinished backups column). This can be due to two reasons:the backup process is inherently time-consuming, due to theavailability and bandwidth of data owners; or the backupsystem is inefficient.To differentiate between these two cases, we consider un-avoidable data loss events (rightmost column in the table). Ifa peer crashes before minTTB, no online backup system couldhave saved the data. Data backup takes time: this simple factalone accounts for far more than all the limitations of a P2Papproach. Users should worry more about completing theirbackup quickly than about the reliability of their peers.The difference in redundancy between the high rate used bythe fixed baseline and the adaptive approach does not impactsignificantly the data loss rate, excepting the case of non-assisted delayed response. Assisted maintenance is an effectiveway to counter this effect.In Fig. 7 we show the costs of assisted repairs in terms ofdata traffic. Given that prices on storage service are highly C u m u l a ti v e s e r v e r up l o a d / b ac kup adaptivefixed Fig. 7. Assisted maintenance. asymmetric we only consider the outbound traffic, fromprovider to peers. Data volumes are expressed as fractionsof the total size of backup objects in the system. There is astriking difference between the adaptive and fixed redundancyschemes: higher redundancy results in less emergency situa-tions in which the server has to step in. The amount of datastored on the server has a peak load of less than 2.5% of thetotal backup size: the assisted repairs are quick, therefore onlya small fraction of the peers need assistance simultaneously.VII. R ELATED W ORK
Redundancy rates and data repair techniques in P2P backupsystems have been investigated from various angles. Variousworks [9], [10] determine redundancy as a function of nodefailure rate in order to guarantee data durability at the expenseof data availability. Many other approaches (e.g., [11]–[13])adopt formulae similar to Equation 3 to guarantee low latencythrough prompt data availability, but require high redundancyrates in typical settings. Our proposal strives to provide both durability and performance at a low redundancy cost, relax-ing prompt data availability by requiring that data becomesrecoverable within a given time window.A complete system design requires considering severalproblems that were not addressed in this paper; fortunately,many of them have been tackled in the literature.When a full system needs to be backed up, convergentencryption [14], [15] can be used to ensure that storage space To date (July 2010), inbound traffic to Amazon S3 is free: http://aws.amazon.com/s3/ does not get wasted by saving multiple copies of the same fileacross the system.Data maintenance is cheap in our scenario, where it is per-formed by a data owner with a local copy. When maintenanceis delegated to nodes that do not have a local copy of thebackup objects, various coding schemes can be used [2], [3]to limit the amount of required data transit. For these settings,cryptographic protocols [16], [17] have been designed to verifythe authenticity of stored data.A recurrent problem for P2P applications is creating incen-tives to encourage nodes in contributing more resources. Thiscan be done via reputation systems [18] or virtual currency[19]. Specifically for storage systems, an easy and efficientsolution is segregating nodes in sub-networks with roughlyhomogeneous characteristics such as uptime and storage space[20], [21].Backup objects, whose confidentiality can be ensured bystandard encryption techniques, should encode incrementaldifferences between archive versions. Recently, various tech-niques have been proposed to optimize computational time andsize of these differences [22].It may happen that resources offered by peers are just notsufficient to satisfy all user needs. In this case, a hybrid peer-assisted system can be developed where data is stored ona centralized data center and on peers. This can result inscenarios having performances comparable with centralizedsystems, at a fraction of the costs [23].VIII. C
ONCLUSION
The P2P paradigm applied to backup applications is acompelling alternative to centralized online solutions, whichbecome costly for long-term storage.In this work, we revisited P2P backup and argued that suchan application is viable. Because the online behavior of usersis unpredictable and, at large scale, crashes and failures are thenorm rather than the exception, we showed that scheduling andredundancy policies are paramount to achieve short backupand restore times.We gave a novel formalization of optimal scheduling andshowed that, with full information, a problem that may appearcombinatorial in nature can actually be solved efficiently byreducing it to a maximal flow problem. Without full informa-tion, optimal scheduling is unfeasible; however, we showedthat as the system size grows, the gap between randomizedand optimal scheduling policies diminishes rapidly.Furthermore, we studied an adaptive scheme that strives tomaintain data redundancy small, which implies shorter backuptimes than a state-of-the-art approach that uses a system-wide, fixed redundancy rate. This comes at the expense ofincreased restore times, which we argued to be a reasonableprice to pay, especially in light of our study on the probabilityof data loss. In fact, we determined that the vast majorityof data loss episodes are due to incomplete backups. Ourexperiments illustrated that such events are unavoidable, asthey are determined by the limitations of data owners alone:no online storage system could have avoided such unfortunate events. We conclude that short backup times are crucial, farmore than the reliability of the P2P system itself. As such,the crux of a P2P backup application is to design mechanismsthat optimize such metric.Our research agenda includes the design and implementa-tion of a fully fledged prototype of a P2P backup application.Additionally, we will extend the parameter space of our study,to include the natural heterogeneity of user demand in termsof storage requirements. To do so, we will collect measure-ments from both existing online storage systems and from acontrolled deployment of our prototype implementation.R
EFERENCES[1] R. Wauters. (2009) Online backup company carbonite loses customers’data, blames and sues suppliers. TechCrunch. [Online]. Available:http://tcrn.ch/dABxRn[2] A. G. Dimakis, P. B. Godfrey, M. J. Wainwright, and K. Ramchandran,“Network coding for distributed storage systems,” in
IEEE INFOCOM ,2007.[3] A. Duminuco and E. Biersack, “Hierarchical codes: How to makeerasure codes attractive for peer-to-peer storage systems,” in
IEEE P2P ,2008.[4] A. V. Goldberg and R. E. Tarjan, “A new approach to the maximumflow problem,” in
ACM STOC , 1986.[5] Y. Birk and T. Kol, “Coding and scheduling considerations for peer-to-peer storage backup systems,” in
SNAPI . IEEE, 2007.[6] R. Bhagwan, S. Savage, and G. Voelker, “Understanding availability,”in
Peer-to-Peer Systems II . Springer, 2003, pp. 256–267.[7] M. Piatek, T. Isdal, T. Anderson, A. Krishnamurthy, and A. Venkatara-mani, “Do incentives build robustness in bittorrent,” in
USENIX NSDI ,2007.[8] B. Schroeder and G. A. Gibson, “Disk failures in the real world: Whatdoes an mttf of 1,000,000 hours mean to you?” in
USENIX FAST , 2007.[9] B.-g. Chun, F. Dabek, A. Haeberlen, E. Sit, H. Weatherspoon, M. F.Kaashoek, J. Kubiatowicz, and R. Morris, “Efficient replica maintenancefor distributed storage systems,” in
USENIX NSDI , 2006.[10] L. Pamies-Juarez and P. Garcia-Lopez, “Maintaining data reliabilitywithout availability in p2p storage systems,” in
ACM SAC , 2010.[11] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels,R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, andB. Zhao, “Oceanstore: An architecture for global-scale persistent stor-age,” in
ACM ASPLOS , 2000.[12] M. L. Sameh, S. Elnikety, A. Birrell, M. Burrows, and M. Isard, “Acooperative internet backup scheme,” in
USENIX ATC , 2003.[13] R. Bhagwan, K. Tati, Y. chung Cheng, S. Savage, and G. M. Voelker,“Total recall: System support for automated availability management,”in
USENIX NSDI , 2004.[14] L. Cox and B. Noble, “Pastiche: Making backup cheap and easy,” in
USENIX OSDI , 2002.[15] M. Landers, H. Zhang, and K.-L. Tan, “Peerstore: Better performanceby relaxing in peer-to-peer backup,” in
IEEE P2P , 2004.[16] N. Oualha, M. ¨Onen, and Y. Roudier, “A security protocol for self-organizing data storage,” in
IFIP SEC , 2008.[17] G. Ateniese, R. Di Pietro, L. Mancini, and G. Tsudik, “Scalable andefficient provable data possession,” in
ICST SecureComm , 2008.[18] S. Kamvar, M. Schlosser, and H. Garcia-Molina, “The eigentrust al-gorithm for reputation management in p2p networks,” in
ACM WWW ,2003.[19] V. Vishnumurthy, S. Chandrakumar, and E. Sirer, “Karma: A secureeconomic framework for peer-to-peer resource sharing,” in
P2P Econ ,2003.[20] L. Pamies-Juarez, P. Garc´ıa-L´opez, and M. S´anchez-Artigas, “Rewardingstability in peer-to-peer backup systems,” in
IEEE ICON , 2008.[21] P. Michiardi and L. Toka, “Selfish neighbor selection in peer-to-peerbackup and storage applications,” in
Euro-Par , 2009.[22] K. Tangwongsan, H. Pucha, D. G. Andersen, and M. Kaminsky, “Effi-cient similarity estimation for systems exploiting data redundancy,” in
IEEE INFOCOM , 2010.[23] L. Toka, M. Dell’Amico, and P. Michiardi, “Online data backup: a peer-assisted approach,” in