An Analytical Study of a Structured Overlay in the presence of Dynamic Membership
Supriya Krishnamurthy, Sameh El-Ansary, Erik Aurell, Seif Haridi
aa r X i v : . [ c s . N I] O c t An Analytical Study of a Structured Overlay in thePresence of Dynamic Membership
Supriya Krishnamurthy , , Sameh El-Ansary , Erik Aurell , and Seif Haridi , Swedish Institute of Computer Science (SICS), Sweden Department of Physics, KTH-Royal Institute of Technology, Sweden IMIT, KTH-Royal Institute of Technology, Sweden { supriya,sameh,eaurell,seif } @sics.se Abstract — In this paper we present an analytical study ofdynamic membership (aka churn) in structured peer-to-peernetworks. We use a fluid model approach to describe steady-state or transient phenomena, and apply it to the Chord system.For any rate of churn and stabilization rates, and any systemsize, we accurately account for the functional form of theprobability of network disconnection as well as the fraction offailed or incorrect successor and finger pointers. We show howwe can use these quantities to predict both the performance andconsistency of lookups under churn. All theoretical predictionsmatch simulation results. The analysis includes both featuresthat are generic to structured overlays deploying a ring as wellas Chord-specific details, and opens the door to a systematiccomparative analysis of, at least, ring-based structured overlaysystems under churn.
I. I
NTRODUCTION A N intrinsic property of Peer-to-Peer systems is the pro-cess of never-ceasing dynamic membership. StructuredPeer-to-Peer Networks (aka Distributed Hash Tables (DHTs))have the underlying principle of arranging nodes in an over-lay graph of known topology and diameter. This knowledgeresults in the provision of performance guarantees. However,dynamic membership continuously “corrupts/churns” the over-lay graph and every DHT strives to provide a technique to“correct/maintain” the graph in the face of this perturbation.Both theoretical and empirical studies have been conductedto analyze the performance of DHTs undergoing “churn” andsimultaneously performing “maintenance”. Liben-Nowell etal. [11] prove a lower bound on the maintenance rate requiredfor a network to remain connected in the face of a givendynamic membership rate. Aspnes et al. [3] give upper andlower bounds on the number of messages needed to locatea node/data item in a DHT in the presence of node or linkfailures. The value of such theoretical studies is that theyprovide insights neutral to the details of any particular DHT.Empirical studies have also been conducted to complementthese theoretical studies by showing how within the asymptoticbounds, the performance of a DHT may vary substantially
This work is funded by the European 6th FP EVERGROW project.c (cid:13)
IEEE. Personal use of this material is permitted. However, permission toreprint/republish this material for advertising or promotional purposes or forcreating new collective works for resale or redistribution to servers or lists,or to reuse any copyrighted component of this work in other works, must beobtained from the IEEE. depending on different DHT designs and implementationdecisions. Examples include the work of: Li et al. [10], Rhea et al. [14], and Rowstron et al. [5].In this paper, we present a fluid model of Chord [15], aspecific DHT, under churn. Fluid models have been used tomodel data communication systems at least since the early’80ies [2], and in some sense since the work of Erlang [4].More recently, in the context of P2P systems, it has beenused to model the performance of BitTorrent [13] and theSquirrel caching system [6]. This technique has much incommon with macroscopic and mesoscopic descriptions ofphysical and chemical phenomena (from where the term fluidhas obviously been borrowed), and carries the same advantagesof conciseness and computability relative to an underlyingmore exact description. Our analysis is directly based on themaster equation approach of physical kinetics, see e.g. the textbook [12], which provides a scheme for taking the variousdynamical processes involved systematically into account.The fluid model requires the notion of a state of the system.This is just a listing of the quantities one would need to knowfor a description of the system at a given level of detail.For Chord, we use grosso modo a level of description whichrequires keeping track of how many nodes there are in thesystem and what the state (whether correct, incorrect or failed)of each of the pointers of those nodes is. This information isnot enough to draw a unique graph of network-connectionsbecause, for example, if we know that a given node has an’incorrect’ successor pointer, this still does not tell us whichnode it is pointing to. However, as we will see, beginning atthis level of description is sufficient to keep track of most ofthe details of the Chord protocols. Having defined a state, thefluid model is simply a set of equations for the evolution of theprobability of finding the system in this state, given the detailsof the dynamics. The master equation approach is useful forkeeping track of the contribution of all the events which canbring about changes in the probability in a micro-instant oftime i.e. , evaluating all the terms in the dynamics leading toa gain or loss of this probability.Using this formalism we investigate a probabilistic modelin which peers arrive independently, distributed as a Poissonprocess, and life-times are exponentially distributed. While thissetup is not necessary fully realistic (more realistic modelscan also be analyzed using master equation techniques), it is standard in modeling, as it typically brings out the salientfeatures of the system with as few obscuring details fromthe probabilistic model as possible. We then derive the func-tional forms of the following: ( i ) Chord-specific inter-nodedistribution properties and ( ii ) for every outgoing pointer of aChord node, the probability that it is in any one of its possiblestates. This probability is different for each of the successorand finger pointers. We then use this information to predictother quantities such as ( iii ) the probability that the networkgets disconnected, ( iv ) lookup consistency (number of failedlookups), and ( v ) lookup performance (latency). All quantitiesare computed as a function of the parameters involved and allresults are verified by simulations.II. R ELATED W ORK
Closest in spirit to our work is the informal derivation in theoriginal Chord paper [15] of the average number of timeoutsencountered by a lookup. This quantity was approximatedthere by the product of the average number of fingers usedin a lookup times the probability that a given finger pointsto a departed node. Our methodology not only allows us toderive the latter quantity systematically but also demonstrateshow this probability depends on which finger (or successor) isinvolved. Further we are able to derive a precise relation re-lating this probability to lookup performance and consistencyaccurately at any value of the system parameters.In the works of Aberer et al. [1] and Wang e t al. [16],DHTs are analyzed under churn and the results are comparedwith simulations. These analyses can also be classified as fluidmodels. However the main parameter is the probability thata random selected entry of a routing table is stale. In ouranalysis, we determine this quantity from system details andchurn rates.A brief announcement of the results presented in this paper,has appeared earlier in [8].III. O UR I MPLEMENTATION OF C HORD
The Chord Ring.
The general philosophy of DHTs is tomap a set of data items onto a set of nodes where the insertionand lookup of items is done using the unique keys that theitems are given. Chord’s realization of that philosophy is asfollows. Peers and data items are given unique keys (usuallyobtained by a cryptographic hash of unique attribute like theIP address or public key for nodes, and filename or checksumfor items) drawn from a circular key space of size K . TheChord system dictates that the right place for storing an itemis at the first alive node whose key succeeds the key of theitem. Since we refer to nodes and items by their keys, theinsertion and lookup of items becomes a matter of locatingthe right “successor” of a key. All nodes have successor andpredecessor pointers. For N nodes, using only the successorpointers to lookup items requires N hops on average. Fingers.
To reduce the average lookup path length, nodeskeep M = log K pointers known as the “fingers”. Usingthese fingers, a node can retrieve any key in O (log N ) hops.The fingers of a node n (where n ∈ · · · K − ) point toexponentially increasing distances of keys away from n . That is, ∀ i ∈ .. M , n points to a node whose key is equal to n + 2 i − . We denote that key by n.f in i .start . However, for acertain i , there might not be a node in the network whose keyis equal to n + 2 i − . Therefore, n points to the first successorof n + 2 i − which we denote by n.f in i .node . The Successor List
Moreover, each node keeps a list ofthe S = O (log( N )) immediate successors as backups for itsfirst successor. We use the notation n.s to refer to this list and n.s i to refer to the i th element in the list. Finally we use thenotation n.p to refer to the predecessor. Stabilization, Churn & Steady State.
To keep the pointersup-to-date in the presence of churn, each node performsperiodic stabilization of its successors and fingers. In ouranalysis, we define λ j as the rate of joins per node, λ f the rateof failures per node and λ s the rate of stabilizations per node.The fraction of stabilizations which act on the successors is α ,such that the rate of successor stabilizations is αλ s , and therate of finger stabilizations is (1 − α ) λ s . In all that follows, weimpose the steady state condition λ j = λ f unless otherwisestated. Further it is useful to define r ≡ λ s λ f which is therelevant ratio on which all the quantities we are interested inwill depend, e.g, r = 50 means that a join/fail event takesplace every half an hour for a stabilization which takes placeonce every seconds. Throughout the paper we will use theterms λ j N ∆ t , λ f N ∆ t , αλ s N ∆ t and (1 − α ) λ s N ∆ t to denotethe respective probabilities that a join, failure, a successorstabilization, or a finger stabilization take place anywhere onthe ring during a micro period of time of length ∆ t . Parameters.
The parameters of the problem are hence: K , N , α and r . All relevant measurable quantities should beentirely expressible in terms of these parameters. Simulation
Since we are collecting statistics like the prob-ability of a particular finger pointer to be wrong, we needto repeat each experiment times before obtaining well-averaged results. The total simulation sequential real timefor obtaining the results of this paper was about hoursthat was parallelized on a cluster of nodes where wehad N = 1000 , K = 2 , S = 6 , ≤ r ≤ and . ≤ α ≤ . .While the main outlines of the chord protocol are providedby its authors in [15], an exact analysis necessitates theprovision of a deeper level of detail and adopted assumptionswhich we provide in the following subsections. A. Joins, Failures & Ring Stabilization
Initialization.
Initially, a node knows its key and at leastone node with key c that already exists in the network andis alive. The knowledge of such a node is assumed to be ac-quired through some out-of-band method. The predecessor p ,successors ( s .. S ) and fingers ( f in .. M .node ) are all assignedto nil . Joins (Fig. 1). A new node n joins by looking up itssuccessor using the initial random contact node c . It also startsits first stabilization of the successors and initializes its fingers. Stabilization of Successors (Fig. 1). The function fixSuc-cessors is triggered periodically with rate αλ s . A node n tells its first alive successor y that it believes itself to be y ’s n . join ( c ) s = c .findSuccessor( n )fixSuccessors()initFingers( s ) n . fixSuccessors () y = firstAliveSuccessor() { y.p, y.s } = y .iThinkIamYourPred( n )if ( y.p ∈ ( me, y ) ) //Case Aprepend( y.p )fixSuccessors()elsif ( y.p ∈ ( y, me ) ) //Case BconsiderANewPred( y.p )reconcilce( y.s )else //Case C: y.p == me reconcile( y.s ) n . firstAliveSuccessor ()while (true)if ( s == nil )//Broken Ring!!if (isAlive( s ))return ( s ) ∀ i ∈ .. ( S − s i = s i +1 s S = nil n . iThinkIAmYourPred ( x )if ((isNotAlive( p ) or ( p == nil )) p = x return( { s, x } )if ( x ∈ ( p, me ) ) oldp = pp = x return( { s, oldp } )elsereturn( { s, p } ) n . considerANewPred ( x )if (isNotAlive( p )or ( p == nil )or ( x ∈ ( p, n ) )) p = x n . reconcile ( s ′ )for i = 1 .. ( S − s i +1 = s ′ i n . prepend ( y )for i = S .. s i = s i − s = y Fig. 1J
OINS AND R ING S TABILIZATION A LGORITHMS . predecessor and expects as an answer y ’s predecessor y.p andsuccessors y.s . The response of y can lead to three actions: Case A . Some node exists between n and y ( i.e. , n ’s beliefis wrong), so n prepends y.p to its successor list as a firstsuccessor and retries fixSuccessors . Case B . y confirms n ’s belief and informs n of y ’s old prede-cessor y.p . Therefore n considers y.p as an alternative/initialpredecessor for n . Finally, n reconciles its successor list with y.s . Case C . y agrees that n is its predecessor and the only taskof n is to update its successor list by reconciling it with y.s .By calling iThinkIamYourPred (Fig. 1), some node x in-forms n that it believes itself to be n ’s predecessor. If n ’spredecessor p is not alive or nil , then n accepts x as apredecessor and informs x about this agreement by returning x . Alternatively, if n ’s predecessor p is alive (discovering thatwill be explained shortly in section III-C), then there are twopossibilities: The first is that x is in the region between n andits current predecessor p , therefore n should accept x as anew predecessor and inform x about its old predecessor. Thesecond is that p is already pointing to x so the state is correctat both parties and n confirms that to x by informing it that x is the predecessor of n . In all cases the function returns apredecessor and a successor list.The function firstAliveSuccessor (Fig. 1) iterates throughthe successor list. In each iteration, if the first successor s isalive, it is returned. Otherwise, the dead successor is droppedfrom the list and nil is appended to the end of the list. If thefirst successor is nil this means that all immediate successorsare dead and that the ring is disconnected. n . initFingers ( s ) f ′ = s .f ∀ i ∈ .. M s.th. ( fin i .start ∈ ( n, s ] ), fin i .node = s ∀ j ∈ .. M s.th. ( fin j .start / ∈ ( n, s ] ), fin j .node = localSuccessor ( f ′ , fin j .start ) n . localSuccessor ( f , k )for i = 1 .. M if ( k ∈ ( n, fin i ] )return( fin i )return(nil) n . fixFingers ( k ) ≤ i = random() ≤ M fin i .node =findSuccessor( fin i .start) Fig. 2I
NITIALIZATION AND S TABILIZATION OF F INGERS . B. Lookups and Stabilization of Fingers
Stabilization of Fingers (Fig. 2). Stabilization of fingersoccurs at a rate (1 − α ) λ s . Each time the fixFingers functionis triggered, a random finger f in i is chosen and a lookupfor f in i .start is performed and the result is used to update f in i .node . n . findSuccessor ( k )//Case A: k is exactly equal to n if ( k == n )return( n )//Case B: k is between n and s if ( k ∈ ( n, s ] )return(firstAliveSuccessorNoChange());//Case C: Forward to the lookup to//the closest preceding alive finger cpf = closestAlivePrecedingFinger( k );if ( cpf == nil ) y = firstAliveSuccessorNoChange();if ( k ∈ ( n, y ] )return( y ); cpf = closestAlivePrecedingSucc(k);return( cpf .findSuccessor(k))elsereturn ( cpf .findSuccessor(k)); n . firstAliveSuccessorNoChange () i = 1 while (true)if ( s i == nil )//Broken Ring!!if (isAlive( s i ))return ( s i ) i + + n . closestAlivePrecedingFinger ( k )for i = M .. if (( fin i ∈ ( n, k ) )and ( fin i = nil )and isAlive( fin i ))return( fin i )return(nil) n . closestAlivePrecedingSucc ( k )for i = S .. if (( s i ∈ ( n, k ) )and ( s i = nil )and isAlive( s i ))return( s i )return(cpf) Fig. 3T HE L OOKUP A LGORITHM . Initialization of Fingers (Fig. 2). After having initialized itsfirst successor s , a node n sets all fingers with starts between n and s to s . The rest of the fingers are initialized by takinga copy of the finger table of s and finding an approximatesuccessor to every finger from that finger table. Lookups (Fig. 3). A lookup operation is a fundamentaloperation that is used to find the successor of a key. It is usedby many other routines and its performance and consistency are the main quantities of interest in the evaluation of anyDHT. A node n looking up the successor of k runs the findSuccessor algorithm which can lead to the following cases: Case A. If k is equal to n then n is trivially the successorof k . Case B. If k ∈ ( n, s ] then n has found the successor of k ,but it could be that s has failed and n has not yet discoveredthis. However, entries in the successor list can act as backupsfor the first successor. Therefore, the first alive successor of n is the successor of k . Note that, in this case, while we tryto find the first alive successor, we do not change the entriesin the successor list. This is mainly because, to simplify theanalysis, we want the successor list to be changed at a fixedrate rate αλ s only by the fixSuccessors function. Case C.
The lookup should be forwarded to a node closerto k , namely the closest alive finger preceding k in n ’s fingertable. The call to the function closestAlivePrecedingFinger returns such a node if possible and the lookup is forwarded toit. However, it could be the case that all alive preceding fingersto k are dead. In that case, we need to use the successor listas a last resort for the lookup. Therefore, we locate the firstalive successor y and if k ∈ ( n, y ] then y is the successor of k . Otherwise, we locate the closest alive preceding successorto k and forward the lookup to it. C. Failures
Throughout the code we use the call isAlive and isN otAlive . A simple interpretation of those routines wouldbe to equate them to a performance of a ping. However, acorrect implementation for them is that they are discoveredby performing the operation required. For instance, a call to f irstAliveSuccesor in Fig. 1 is performed to retrieve a node y and then call y.iT hinkIamY ourP red , so alternatively thefirst alive successor could be discovered by iterating on thesuccessor list and calling iT hinkIamY ourP red .IV. T HE A NALYSIS
A. Distributional Properties of Inter-Node Distances
In this section we will assume that all keys are populatedby peers with independent and equal probability, and, further-more, that this probability does not change with time. Thefirst condition is a natural consequence of peers joining andleaving/failing independently. The last condition, on the otherhand, does not hold strictly since the number of peers presentunder churn is a fluctuating quantity, Nevertheless, it can beexpected to hold to good accuracy in sufficiently large systems.A detailed analysis along these lines will be given elsewhere.
Definition 4.1:
Given two keys u, v ∈ { ... K − } , the“distance” between them is u − v (with modulo- K arithmetic).We interchangeably say that u and v form an “interval” oflength u − v . Hence the number of keys inside an interval oflength ℓ is ℓ − keys. Property 4.1:
The probability P ( x ) of finding an intervalof length x is: P ( x ) = ρ x − (1 − ρ ) where ρ = K− N K .Under the stated conditions, each key will be populated withthe same probability N K = 1 − ρ , for N << K . An interval
Fig. 4( A ) C ASE WHEN n AND p HAVE THE SAME VALUE OF fin k .node . ( B )C ASE WHERE A NEWLY JOINED NODE p COPIES THE k th ENTRY OF ITSSUCCESSOR NODE n AS THE BEST APPROXIMATION FOR ITS OWN k th ENTRY ( BY THE JOIN PROTOCOL ). I
N THIS CASE , THERE COULD BE ANODE o WHICH IS THE ’ CORRECT ’ ENTRY FOR p.fin k .node . H OWEVER , SINCE p IS NEWLY JOINED , THE ONLY INFORMATION IT HAS ACCESS TO ISTHE FINGER TABLE OF n . of length x then involves x − consecutive unpopulated keys,and then one populated key, which explains the formula.We now derive some properties of this distribution whichwill be used in the ensuing analysis. Property 4.2:
For any two keys u and v , where v = u + x ,let b i be the probability that the first node encountered inbetween these two keys is at u + i (where ≤ i < x ). Then b i ≡ ρ i (1 − ρ ) . The probability that there is definitely at leastone node between u and v is: a ( x ) ≡ − ρ x . Hence theconditional probability that the first node is at a distance i given that there is at least one node in the interval is bc ( i, x ) ≡ b ( i ) /a ( x ) . Property 4.3:
The probability that a node and at least oneof its immediate predecessors share the same k th finger is p ( k ) ≡ ρ ρ (1 − ρ k − ) . The explanation for this propertygoes as follows. If the distance between node n and itspredecessor p is x , the distance between n.f in k . start and p.f in k . start is also x (see Fig. 4(a)). If there is no node inbetween n.f in k . start and p.f in k . start then n.f in k . node and p.f in k . node will share the same value. From Property 4.1,the probability that the distance between n and p is x is ρ x − (1 − ρ ) . However, x has to be less than k − , otherwise p.f in k . node will be equal to n . The probability that nonode exists between n.f in k . start and p.f in k . start is ρ x (byProperty 4.2). Therefore the probability that the n.f in k . node and p.f in k . node share the same value is: P k − − x =1 ρ x − (1 − ρ ) ρ x = ρ ρ (1 − ρ k − ) . It is straightforward (though tedious)to derive similar expressions for p ( k ) the probability that anode and at least two of its immediate predecessors share thesame k th finger, p ( k ) and so on. Property 4.4:
We can similarly assess the probability thatthe join protocol (see Section III-B) results in further replica-tion of the k th pointer. Let us define the probability p join ( i, k ) as the probability that a newly joined node, chooses the i th entry of its successor’s finger table for its own k th entry. Notethat this is unambiguous even in the case that the successor’s i th entry is repeated. All we are asking is, when is the k th entryof the new joinee the same as the i th entry of the successor? Fig. 5C
HANGES IN W , THE NUMBER OF WRONG ( FAILED OR OUTDATED ) s POINTERS , DUE TO JOINS , FAILURES AND STABILIZATIONS . Clearly i ≤ k . In fact for the larger fingers, we only need toconsider p join ( k, k ) , since p join ( i, k ) ∼ for i < k . Using theinterval distribution we find, for large k , p j oin ( k, k ) ∼ ρ (1 − ρ k − − ) + (1 − ρ )(1 − ρ k − − ) − (1 − ρ ) ρ (2 k − − ρ k − − .This function goes to for large k .We can also analogously compute p join ( i, k ) for any i .The only trick here is to estimate the probability that startingfrom i , the last distinct entry of n ’s finger table does not give p a better choice for its k th entry. This can againreadily be computed using property 4.2, but we do not do thecomputation here since for our purposes p join ( k, k ) suffices. B. Successor Pointers
We now turn to estimating various quantities of interest forChord. In all that follows we will evaluate various average quantities, as a function of the parameters. To do this weneed to understand how the dynamical evolution of the systemaffects these quantities.In the case of Chord, we only need to consider one of threekinds of events happening at any micro-instant: a join, a failureor a stabilization. One assumption made in the following isthat such a micro-instant of time exists, or in other words,that we can divide time till we have an interval small enoughthat in this interval, only one of these three processes occursanywhere in the system. Implicit in this is the assumption thata stabilization (either of successors or fingers) is done fasterthan the time-scales over which joins and fails occur.Another aspect of this system which simplifies analysis isthat successor pointers of adjacent nodes are independent ofeach other. That is, the state of the first successor pointer ofa given node does not affect the state of the first successorpointer of either its predecessor or its successor. The samelogic also works for the state of the second successor pointersof adjacent nodes and so on. On the other hand, the state of
TABLE IG
AIN AND LOSS TERMS FOR W ( r, α ) : THE NUMBER OF WRONG FIRSTSUCCESSORS AS A FUNCTION OF r AND α .Change in W ( r, α ) Probability of Occurrence W ( t + ∆ t ) = W ( t ) + 1 c . = ( λ j N ∆ t )(1 − w ) W ( t + ∆ t ) = W ( t ) + 1 c . = λ f N (1 − w ) ∆ tW ( t + ∆ t ) = W ( t ) − c . = λ f Nw ∆ tW ( t + ∆ t ) = W ( t ) − c . = αλ s Nw ∆ tW ( t + ∆ t ) = W ( t ) 1 − ( c . + c . + c . + c . ) the second successor pointer of a node is clearly related to thestate of its first successor pointer as well the state of the firstsuccessor pointer of the successor. This is taken into accountin the analysis of second and higher successor pointers. Incharacterizing the states of higher successors, we look for theleading order behavior in terms of the parameter r .Consider first the successor pointers. Let w k ( r, α ) denotethe fraction of nodes having a wrong k th successor pointerand d k ( r, α ) the fraction of nodes having a failed successorpointer. Also, let W k ( r, α ) be the number of nodes havinga wrong k th successor pointer and D k ( r, α ) the number ofnodes having a failed successor pointer. A failed pointer isone which points to a departed node while a wrong pointerpoints either to an incorrect node (alive but not correct) or adead one. As we will see, both these quantities play a role inpredicting lookup consistency and lookup length.By the protocol for stabilizing successors in Chord, a nodeperiodically contacts its first successor, possibly correcting itand reconciling with its successor list. Therefore, the numberof wrong k th successor pointers are not independent quantitiesbut depend on the number of wrong first successor pointers.We write an equation for W ( r, α ) by accounting for allthe events that can change it in a micro event of time ∆ t . Anillustration of the different cases in which changes in W takeplace due to joins, failures and stabilizations is provided inFig. 5. In some cases W increases/decreases while in others itstays unchanged. For each increase/decrease, Table I providesthe corresponding probabilities.By our implementation of the join protocol, a new node n y ,joining between two nodes n x and n z , always has a correct s pointer after the join. However the state of n x .s beforethe join makes a difference. If n x .s was correct (pointingto n z ) before the join, then after the join it will be wrongand therefore W increases by . If n x .s was wrong beforethe join, then it will remain wrong after the join and W isunaffected. Thus, we need to account for the former case only.The probability that n x .s is correct is − w and term c . follows from this.For failures, we have cases. To illustrate them we usenodes n x , n y , n z and assume that n y is going to fail. First,if both n x .s and n y .s were correct, then the failure of n y will make n x .s wrong and hence W increases by . Second,if n x .s and n y .s were both wrong, then the failure of n y will decrease W by one, since one wrong pointer disappears.Third, if n x .s was wrong and n y .s was correct, then W is unaffected. Fourth, if n x .s was correct and n y .s waswrong, then the wrong pointer of n y disappears and n x .s becomes wrong, therefore W is unaffected. For the first case w (r , α ) , d (r , α ) Rate of Stabilisation /Rate of failure (r= λ s / λ f )w (r,0.25) Simulationw (r,0.5) Simulationw (r,0.75) Simulationw (r, ) Theoryw (r, ) Theoryw (r, ) Theoryd (r,0.75) Simulationd (r, 0.75) Theory Fig. 6T
HEORY AND SIMULATION FOR THE PROBABILITY OF WRONG st SUCCESSOR w ( r, α ) AND FAILED st SUCCESSOR d ( r, α ) . to happen, we need to pick two nodes with correct pointers,the probability of this is (1 − w ) . For the second case tohappen, we need to pick two nodes with wrong pointers, theprobability of this is w . From these probabilities follow theterms c . and c . .Finally, a successor stabilization does not affect W , unlessthe stabilizing node had a wrong pointer. The probability ofpicking such a node is w . From this follows the term c . .Hence the equation for W ( r, α ) is: dW N dt = λ j (1 − w ) + λ f (1 − w ) − λ f w − αλ s w Solving for w in the steady state and putting λ j = λ f , weget: w ( r, α ) = 23 + rα ≈ rα (1)This expression matches well with the simulation resultsas shown in Fig. 6. d ( r, α ) is then ≈ w ( r, α ) since when λ j = λ f , about half the number of wrong pointers are incorrectand about half point to dead nodes. Thus d ( r, α ) ≈ rα whichalso matches well the simulations as shown in Fig. 6.The fraction of wrong second successors can be estimatedin an analogous manner. Consider, for a node n , the possiblestates of the successor, n.s , the successor of the successor, ∗ ( n.s ) .s , and the second successor, n.s . In a fully correctstate, ∗ ( n.s ) .s and n.s of course point to the same node.If in such a state either n.s or ∗ ( n.s ) .s becomes incorrectthrough the action of a join or a failure, then n.s is alsoincorrect. On the other hand, n.s cannot be corrected bythe stabilization protocol unless both n.s and ∗ ( n.s ) .s areboth already corrected. Hence, n.s is wrong if either n.s or ∗ ( n.s ) .s are wrong, and also if both n.s and ∗ ( n.s ) .s are correct, but n.s has not yet been corrected. If the numberof such non-stabilized configurations is N and the fraction is n , we have w = 2 w − w + n (2)To estimate n we consider how these configurations mightbe gained or lost. The gain term arises from stabilizations of configurations where n.s is correct but ∗ ( n.s ) .s is wrong.A stabilization performed by node n.s then results in thegain of a N configuration. On the other hand, non-stabilizedconfigurations are lost either by a stabilization performedby node n (when it gets the correct successor list from itssuccessor and hence corrects n.s ), or by corrupting either n.s or ∗ ( n.s ) .s (by a join or failure). The latter possibilitygives terms of order r and we can ignore it in the limit thatstabilizations happens on a much faster time scale than joinsand failures ( i.e. , r much larger than unity). The equation for N is hence dN dt ≈ αλ s w (1 − w ) − αλ s n (3)which implies n ≈ w to order r . Thus, we have w ≈ r .For higher successors we reason similarly by consideringthe state of the k − st successor pointer of node n , the suc-cessor pointer of the k − st successor, and the k th successorpointer of node n . We can write a recursion equation for w k the fraction of nodes with wrong k th successor pointer w k = w + w k − − w k − w + n k (4)where n k is the density of configurations where the k − st successor pointer of node n and the first successor pointer ofthe k − st successor are both correct, but this informationhas not yet been used to correct the k th successor pointer ofnode n . If node n does not as yet have the correct informationabout its k th successor, that means that either all the nodesin between n and its k − st successor have the correctinformation but node n has not as yet stabilized, or that thestabilization has propagated back from the k − st successorto some node in between but not as yet to n.s . To elaborateon this further, there is the case where the second successorpointer of the k − nd successor has not been corrected, thenthe case where this has been done, but the third successorpointer of the k − rd successor has not been corrected, andso on. Each of these is analogous to n and each occurswith density (1 − w k − ) w , if joins and failures are neglectedcompared to stabilizations. Hence, if to leading order in r wehave w k ∼ c k αr , then c k = c k − + kc (5)which leads to w k ≈ k ( k + 1) αr (6). We note that this expression obviously depends on the detailsof the stabilization scheme, and is in principle only valid upto k ∼ √ r . As shown in Fig. 7, the agreement between theoryand simulation is still however quite reasonable at k = 5 and r = 100 . C. Break-up (Network Disconnection) Probability
We demonstrate below, how calculating d k ( r, α ) : the frac-tion of nodes with dead k th pointers, helps in estimating theprobability that the network gets disconnected for any value of r and α . Let P bu ( n, r, α ) be the probability that n consecutivenodes fail. If n = S , the length of the successor list, then F r ac ti on o f nod e s w it h w r ong k t h s u cce ss o r , w k (r , α ) Rate of Stabilisation of Successors/Rate of failure ( α r= αλ s / λ f )w (r,0.5) Simulationw (r,0.5) Theoryw (r,0.5) Simulationw (r,0.5) Theoryw (r,0.5) Simulationw (r,0.5) Theoryw (r,0.5) Simulationw (r,0.5) Theoryw (r,0.5) Simulationw (r,0.5) Theory Fig. 7T
HEORY AND SIMULATION FOR THE PROBABILITY OF A WRONG k th SUCCESSOR w k ( r, α ) .TABLE IIG AIN AND LOSS TERMS FOR N bu (2 , r, α ) : THE NUMBER OF NODES WITHDEAD FIRST and
SECOND SUCCESSORS .Change in N bu ( r, α ) Probability of Occurrence N bu ( t + ∆ t ) = N bu ( t ) + 1 c . = ( λ f N ∆ t ) d ( r, α ) N bu ( t + ∆ t ) = N bu ( t ) + 1 c . = λ f N ∆ t (1 − d ) d N bu ( t + ∆ t ) = N bu ( t ) − c . = αλ s N ∆ tP bu (2 , r, α ) N bu ( t + ∆ t ) = N bu ( t ) 1 − ( c . + c . + c . ) clearly the node whose successor list this is, gets disconnectedfrom the network and the network breaks up. For the rangeof r considered in Fig. 6, P bu ( S , r, α ) ∼ . However shouldwe go lower, this starts becoming finite. The master equationanalysis introduced here can be used to estimate P bu ( n, r, α ) for any ≤ n ≤ S . We indicate how this might be doneby first considering the case n = 2 . Let N bu (2 , r, α ) be thenumber of configurations in which a node has both s and s dead and P bu (2 , r, α ) be the fraction of such configurations.Table II indicates how this is estimated within the presentframework.A join event does not affect this probability in any way. Sowe only need to consider the effect of failures or stabilizationevents. The term c . accounts for the situation when the first successor of a node is dead (which happens with probability d ( r, α ) as explained above). A failure event can then kill itssecond successor as well and this happens with probability c . . The second term is the situation that the first successoris alive (with probability − d ) but the second successor isdead (with probability d ). The logic used to estimate d (or d k in general) is very similar to the reasoning we used toestimate the w k ’s. So we have d k = d + ( k − d = kd (7)Thus the k th successor of a node is dead if the k − st succes-sor’s successor is dead, or the k − st successor’s successoris not dead but the intermediate nodes think it is because theyhaven’t stabilized. Hence d ∼ /αr . This estimate for d matches the simulation results very well, as shown in Fig. 8.Coming back to counting the gain and loss terms for d (r , α ) Rate of Stabilisation /Rate of failure (r= λ s / λ f )d (r,0.5) Simulationd (r,0.5) Theoryd (r,0.5) Simulationd (r,0.25) Theoryd (r,0.5) Simulationd (r,0.75) Theory Fig. 8T
HEORY AND SIMULATION FOR THE PROBABILITY OF FAILURE OF THE nd SUCCESSOR , d ( r, α ) . N bu (2 , r, α ) , a stabilization event reduces the number of suchconfigurations by one, if the node doing the stabilization hadsuch a configuration to begin with.Solving the equation for N bu (2 , r, α ) , one hence obtainsthat P bu (2 , r, α ) ∼ / ( αr ) . As Fig. 9 shows, this is a preciseestimate.We can similarly estimate the probabilities for three con-secutive nodes failing, etc , and hence also the general discon-nection probability P bu ( S , r, α ) . In fact P bu ( S , r, α ) may bewritten in terms of the d k ( r, α ) as: P bu ( S ) = ( S − P S d i ( r, α )( αr ) S− (8)The logic behind this equation is similar to that used forsolving for P bu (2) , namely that for S consecutive nodes tofail, any S − of the S nodes should have failed first, andthen a failure event kills the remaining node. (8) is readilysolved by substituting the values of the d k ’s to get P bu ( S ) = ( S + 1)!2( αr ) S (9)As mentioned above this is again correct only to leadingorder. Namely there will be correction terms of the order r S +1 which we haven’t computed at this level of approximation.The Master Equation formalism thus affords the possibilityof making a precise prediction for when the system runs thedanger of getting disconnected, as a function of the parameters. Lookup Consistency
By the lookup protocol, a lookup isinconsistent if the immediate predecessor of the sought keyhas a wrong s pointer. However, we need only consider thecase when the s pointer is pointing to an alive (but incorrect)node since our implementation of the protocol always requiresthe lookup to return an alive node as an answer to the query.The probability that a lookup is inconsistent I ( r, α ) is hence w ( r, α ) − d ( r, α ) . This prediction matches the simulationresults very well, as shown in Fig. 10. B r ea k - up P r ob a b ilit y P bu ( S , r , α ) Rate of Stabilisation of Successors/Rate of failure ( α r= αλ s / λ f )P bu (2,r,0.25) SimulationP bu (2,r,0.25) TheoryP bu (2,r,0.5) SimulationP bu (2,r,0.5) TheoryP bu (2,r,0.75) SimulationP bu (2,r,0.75) Theory Fig. 9T
HEORY AND SIMULATION FOR THE BREAK - UP PROBABILITY P bu (2 , r, α ) . I(r , α ) Rate of Stabilisation of Successors/Rate of failure ( α r= αλ s / λ f )I(r,0.25) SimulationI(r,0.5) SimulationI(r,0.75) SimulationI(r, ) theoryI(r, ) theoryI(r, ) theory Fig. 10T
HEORY AND SIMULATION FOR INCONSISTENT LOOKUPS I ( r, α ) . D. Failure of Fingers
We now turn to estimating the fraction of finger pointerswhich point to failed nodes. As we will see this is animportant quantity for predicting lookups, since failed fingerscause timeouts and increase the lookup length. However, weonly need to consider fingers pointing to dead nodes. Unlikemembers of the successor list, alive fingers even if outdated,always bring a query closer to the destination and do notaffect consistency or substantially even the lookup length.Therefore we consider fingers in only two states, alive or dead(failed). By our implementation of the stabilization protocol(see Sections III-A and III-B), fingers and successors arestabilized entirely independently of each other to simplify theanalysis. Thus even though the first finger is also always thefirst successor, this information is not used by the node inupdating the finger. Fingers of nodes far apart are independentof each other. Fingers of adjacent nodes can be correlated andwe take this into account. The only assumption in this sectionis in connection with the join protocol as explained below.
Fig. 11C
HANGES IN F k , THE NUMBER OF FAILED fin k POINTERS , DUE TO JOINS , FAILURES AND STABILIZATIONS .TABLE IIIT
HE RELEVANT GAIN AND LOSS TERMS FOR F k , THE NUMBER OF NODESWHOSE kth
FINGERS ARE POINTING TO A FAILED NODE FOR k > . F k ( t + ∆ t ) Probability of Occurence = F k ( t ) + 1 c . = ( λ j N ∆ t ) P ki =1 p join ( i, k ) f i = F k ( t ) − c . = (1 − α ) M f k ( λ s N ∆ t )= F k ( t ) + 1 c . = (1 − f k ) [1 − p ( k )]( λ f N ∆ t )= F k ( t ) + 2 c . = (1 − f k ) ( p ( k ) − p ( k ))( λ f N ∆ t )= F k ( t ) + 3 c . = (1 − f k ) ( p ( k ) − p ( k ))( λ f N ∆ t )= F k ( t ) 1 − ( c . + c . + c . + c . + c . ) Let f k ( r, α ) denote the fraction of nodes whose k th fingerpoints to a failed node and F k ( r, α ) denote the respectivenumber. For notational simplicity, we write these as simply F k and f k . We can predict this function for any k by againestimating the gain and loss terms for this quantity, caused bya join, failure or stabilization event, and keeping only the mostrelevant terms. These are listed in Table III and illustrated inFig. 11A join event can play a role here by increasing the numberof F k pointers if the successor of the joinee had a failed i th pointer (occurs with probability f i ) and the joinee replicatedthis from the successor as the joinee’s k th pointer. (occurs withprobability p join ( i, k ) from property 4.4). For large enough k ,this probability is one only for p join ( k, k ) , that is, the newjoinee mostly only replicates the successor’s k th pointer as itsown k th pointer. This is what we consider here.A stabilization evicts a failed pointer if there was one tobegin with. The stabilization rate is divided by M , since anode stabilizes any one finger randomly, every time it decidesto stabilize a finger at rate (1 − α ) λ s .Given a node n with an alive k th finger (occurs withprobability − f k ), when the node pointed to by that fingerfails, the number of failed k th fingers ( F k ) increases. Theamount of this increase depends on the number of immediatepredecessors of n that were pointing to the failed node with their k th finger. That number of predecessors could be , , ,.. etc. Using property 4.3 the respective probabilities of thosecases are: − p ( k ) , p ( k ) − p ( k ) , p ( k ) − p ( k ) ,... etc.Solving for f k in the steady state, we get: f k = h P rep ( k ) + 2 − p join ( k ) + r (1 − α ) M i P rep ( k )) − rh P rep ( k ) + 2 − p join ( k ) + r (1 − α ) M i − P rep ( k )) P rep ( k )) (10)where ˜ P rep ( k ) = Σ p i ( k ) . In practice, it is enough to keep thefirst three terms in this sum. To first order in r we have, inanalogy to (6), f k ≈ (1 + ˜ P rep ( k )) M (1 − α ) r (11)This expression simply says that the fraction of dead fingersis inversely proportional to the rate of finger stabilizations, (1 − α ) r , and proportional to how many fingers there are tostabilize, M , with the proportionality factor (1 + ˜ P rep ( k )) depending only on ρ .To sum up, the computation of the fraction of dead k th finger pointers is analogous to the calculation of the fractionof wrong first successor pointer, albeit a bit more involved.No recursion is involved, in contrast to the calculation ofthe fraction of wrong higher successor pointers. The aboveexpressions, (10) match very well with the simulation results(Fig. 13). E. Cost of Finger Stabilizations and Lookups
In this section, we demonstrate how the information aboutthe failed fingers and successors can be used to predict the costof stabilizations, lookups or in general the cost for reachingany key in the id space. By cost we mean the number ofhops needed to reach the destination including the number oftimeouts encountered en-route. Timeouts occur every time aquery is passed to a dead node. The node does not answer andthe originator of the query has to use another finger instead.For this analysis, we consider timeouts and hops to add equallyto the cost. We can easily generalize this analysis to investigatethe case when a timeout costs some factor γ times the cost ofa hop.Define C t ( r, α ) (also denoted by C t ) to be the expectedcost for a given node to reach some target key which is t keysaway from it (which means reaching the first successor ofthis key). For example, C would then be the cost of lookingup the adjacent key ( key away). Since the adjacent key isalways stored at the first alive successor, therefore if the firstsuccessor is alive (which occurs with probability − d ), thecost will be hop. If the first successor is dead but the secondis alive (occurs with probability d (1 − d ) ), the cost will be1 hop + 1 timeout = and the expected cost is × d (1 − d ) and so forth. Therefore, we have C = 1 − d + 2 × d (1 − d ) + 3 × d d (1 − d ) + · · · ≈ d = 1 + 1 / ( αr ) . To find the expected cost for reaching a general distance t we need to closely follow the Chord protocol, which wouldlookup t by first finding the closest preceding finger. For thepurposes of the analysis, we will find it easier to think in termsof the closest preceding start . Let us hence define ξ to be the s tart of the finger (say the k th ) that most closely precedes t . Hence ξ = 2 k − + n and t = ξ + m i.e. , there are m keys between the sought target t and the start of the closestpreceding finger. With that, we can write a recursion relationfor C ξ + m as follows: C ξ + m = C ξ [1 − a ( m )]+ (1 − f k ) a ( m ) " m − X i =0 bc ( i, m ) C m − i + f k a ( m ) (cid:20) k − X i =1 h k ( i ) ξ/ i − X l =0 bc ( l, ξ/ i )(1 + ( i −
1) + C ξ i − l + m ) + O ( h k ( k )) (cid:21) (12)where ξ i ≡ P m =1 ,i ξ/ m and h k ( i ) is the probability thata node is forced to use its k − i th finger owing to the deathof its k th finger. The probabilities a, b, bc have already beenintroduced in Section IV, and we define the probability h k ( i ) below.The lookup equation though rather complicated at first sightmerely accounts for all the possibilities that a Chord lookupwill encounter, and deals with them exactly as the protocoldictates.The first term (Fig. 12 (a)) accounts for the eventuality thatthere is no node intervening between ξ and ξ + m (occurswith probability − a ( m ) ). In this case, the cost of lookingfor ξ + m is the same as the cost for looking for ξ .The second term (Fig. 12 (b)) accounts for the situationwhen a node does intervene in between (with probability a ( m ) ), and this node is alive (with probability − f k ). Thenthe query is passed on to this node (with added to registerthe increase in the number of hops) and then the cost dependson the length of the distance between this node and t .The third term (Fig. 12 (c)) accounts for the case when theintervening node is dead (with probability f k ). Then the costincreases by (for a timeout) and the query needs to find analternative lower finger that most closely precedes the target.Let the k − i th finger (for some i , ≤ i ≤ k − ) be such afinger. This happens with probability h k ( i ) i.e. , the probabilitythat the lookup is passed back to the k − i th finger eitherbecause the intervening fingers are dead or share the samefinger table entry as the k th finger is denoted by h k ( i ) . Thestart of the k − i th finger is at ξ/ i and the distance between ξ/ i and ξ is equal to P m =1 ,i ξ/ m which we denote by ξ i .Therefore, the distance from the start of the k − i th to the targetis equal to ξ i + m . However, note that f in k − i .node could be l keys away (with probability bc ( l, ξ/ i ) ) from f in k − i .start (for some l , ≤ l < ξ/ i ). Therefore, after making one hopto f in k − i .node , the remaining distance to the target is ξ i + Fig. 12C
ASES THAT A LOOKUP CAN ENCOUNTER WITH THE RESPECTIVE PROBABILITIES AND COSTS . f k (r , α ) Rate of Stabilisation of Fingers/Rate of failure ((1- α )r=(1- α ) λ s / λ f )f (r,0.5) Simulationf (r,0.5) Theoryf (r,0.5) Simulationf (r,0.5) Theoryf (r,0.5) Simulationf (r,0.5) Theoryf (r,0.5) Simulationf (r,0.5) Theory 6 6.4 6.8 7.2 7.6 8 8.4 8.8 9.2 9.6 10 10.4 10.8 11.2 0 200 400 600 800 1000 1200 1400 1600 L ookup l a t e n c y ( hop s + ti m e ou t s ) L (( - α )r) Rate of Stabilisation of Fingers/Rate of failure (1- α )rL((1- α )r) SimulationL((1- α )r) Theory Fig. 13T
HEORY AND SIMULATION FOR PROBABILITY OF FAILURE OF THE k th FINGER f k ( r, α ) , AND THE LOOKUP LENGTH L ( r, α ) . m − l . The increase in cost for this operation is i − ; the indicates the cost of taking up the query again by f in k − i .node , and the i − indicates the cost for trying anddiscarding each of the i − intervening fingers. The probability h k ( i ) is easy to compute given property 4.2 and the expressionfor the f k ’s computed in the previous section. h k ( i ) = a ( ξ/ i )(1 − f k − i ) × Π s =1 ,i − (1 − a ( ξ/ s ) + a ( ξ/ s ) f k − s ) , i < kh k ( k ) =Π s =1 ,k − (1 − a ( ξ/ s ) + a ( ξ/ s ) f k − s ) (13)In (13) we account for all the reasons that a node mayhave to use its k − i th finger instead of its k th finger. Thiscould happen because the intervening fingers were either deador not distinct. The probabilities h k ( i ) satisfy the constraint P ki =1 h k ( i ) = 1 since clearly, either a node uses any one ofits fingers or it doesn’t. This latter probability is h k ( k ) , that isthe probability that a node cannot use any earlier entry in itsfinger table. In this case, n proceeds to its successor list. Thequery is now passed on to the first alive successor and the newcost is a function of the distance of this node from the target t .We indicate this case by the last term in 12 which is O ( h k ( k )) . This can again be computed from the inter-node distributionand from the functions d k ( r, α ) computed earlier. However inpractice, the probability for this is extremely small except fortargets very close to n . Hence this does not significantly affectthe value of general lookups and we ignore it in our analysis.The cost for general lookups is hence L ( r, α ) = Σ K− i =1 C i ( r, α ) K The lookup equation is solved recursively numerically, giventhe coefficients and C . In Fig. 13, we compare theoreticalresults with simulation for N = 1000 . It is seen that the theorymatches the simulation results very well.In Fig. 14 we also show the theoretical predictions forsome larger values of N . From the structure of Equation12, it is clear that the dependence of the average lookupon churn comes entirely from the presence of the terms f k .Since f k ∼ f is independent of k for large fingers, we canapproximate the average lookup length by the functional form L ( r, α ) = A + Bf + Cf + · · · . The coefficients A, B, C etc can be recursively computed by solving the lookup equation tothe required order in f and depend only on N the number of L (( - α )r) fr o m t h e L ookup E qu a ti on (1- α )r N=1000N=2000N=4000N=8000N=160007.846+7.846*(f+3*f )7.346+7.346*(f+3*f )6.846+6.846*(f+3*f )6.346+6.346*(f+3*f )5.846+5.846*(f+3*f ) Fig. 14L
OOKUP COST , THEORETICAL CURVE , FOR , , , AND
PEERS . T
HE RATIONALE FOR THE FITS IS EXPLAINED IN THE TEXT . nodes, − ρ the density of peers and b the base or equivalentlythe size of the finger table of each node. The advantage ofwriting the lookup length this way is that churn-specific detailssuch as how new joinees construct a finger table or howexactly stabilizations are done in the system, can be isolatedin the expression for f . If we were to change our stabilizationstrategy for example [9], we could immediately estimate thelookup length by plugging in the new expression for f in theabove relation.The coefficient A , which is the lookup cost without churncan be obtained very precisely for any base b , from analyzing(12) in the zero-churn case. This analysis is rather laboriousand will be presented elsewhere [9]. It confirms the well-known result A = log N and in addition reproduces smalldeviations from this behavior previously observed by us innumerical simulations [7]. The values of A in Fig. 14 aretaken from this analysis. B can be qualitatively estimated as follows : every suf-ficiently long finger is dead with some finite probability f given by (10). If A is the average value of the lookuplength without churn, then each look-up encounters f A deadfingers on average. This estimate predicts a look-up cost ofapproximately A (1 + f ) , giving B = A and C and all othercoefficients equal to ..In Fig. 14 we show that the best fit to the data is obtainedin fact by taking B = A and C = 3 A . The expressionfor f is taken from 10 for large k (for a system with fingers, the expression for f k becomes independent of k for k ≥ ). In general, as mentioned earlier, B and C can beobtained accurately for any value of the system parameters bythe numerical solution of Eq. 12 to the required order.V. D ISCUSSION AND C ONCLUSION
In this paper we have presented a detailed theoreticalanalysis of a DHT-based P2P system, Chord, using a fluidmodel. The technique for deriving the fluid model has beenborrowed from the master equation approach of physics, whichhelps in systematically taking different dynamical effects intoaccount. This analysis differs from previous theoretical work done on DHTs in that it aims not at establishing bounds,but on precise determination of the relevant quantities in thisdynamically evolving system. From the match of our theoryand the simulations, it can be seen that we can predict withan accuracy of greater than in most cases. Though thisanalysis is not exact , since it takes only some (and not all)correlations into account, yet it provides a methodology forkeeping track of most of the relevant details of the system.We expect that a similar analysis can be done for most otherDHT’s, thus helping to establish quantitative guidelines fortheir comparison.The main conclusions for the analysis of Chord in astatistically steady state are the following. Property 5.1:
As a function of r , the ratio of the rate ofstabilizations to the rate of failures, the fraction of wrongpointers of any kind (successors or fingers) is to leading orderand good approximation Const. /r , where the constant dependson the pointer. Property 5.2:
The probability of break up of a ring can beestimated from the knowledge of the fraction of wrong firstsuccessors, wrong second successors, etc. This probability isgenerally very low when every node has a sufficient number ofsuccessors, indicating that Chord is robust against ring break-up.
Property 5.3:
At a given value of r , the fraction of wrongsuccessors, w k , and the fraction of dead fingers, f k , increaseswith k . The fraction of wrong successors increases indefinitely,and becomes of order one at k about √ r for the particularstabilization strategy that we have used. The fraction of deadfingers on the other hand tends to a constant for sufficientlylarge k . Property 5.4:
The look-up cost, which is the expected num-ber of hops including time-outs, can be computed by numericalrecursion. The fraction of incorrect finger pointers f k ( ∼ f for large k ) is a required input for this recursion. The lookupcost tends to the well-known average number of hops withoutchurn when f is small (or churn is low) and increases when f is large. We show that it can be well described by the formula A (1 + g ( f )) , where A is the value of the lookup cost withoutchurn and g ( f ) is well approximated by f +3 f for N << K .In general g ( f ) can be obtained accurately to any desired orderby solving Eq. 12 recursively to the required order in f . Property 5.5:
The preceding note brings out the followingsimple feature of Chord: under any state of churn, sufficientlylong fingers are all dead with essentially the same probability.Hence, in a sufficiently large system, a look-up will almostsurely encounter one or more dead fingers, leading to time-outs. For applications where time-outs should be the exceptionand not the norm, this paper helps in estimating how muchstabilization is necessary under a given level of churn, toachieve such a level of performance.
Property 5.6:
The preceding note also brings out the ad-ditional feature that by writing the lookup cost in the abovesimplified form, we can isolate the effects of churn-specificdetails in the expression for f . Changing details in the joinprotocol or changing the maintenance strategy [9] merelycause a change in the expression for f . The lookup cost withthis new strategy can then be immediately assessed for any r , by plugging in the new expression for f in the expression forthe lookup cost (as opposed to solving Eq. 12 each time foreach value of r ).The impact of this work can be summarized as follows:given that periodic stabilization is a fundamental techniquefor topology maintenance in DHTs, the question: ”How oftenshould a DHT node perform periodic stabilization?” is of greatpractical relevance. The answer to this question depends onseveral factors. First we need to know where the DHT isdeployed, in a LAN, in a cooperative milieu, or among publicnon-trusting partners, i.e. , what is the expected join/failure rate(churn)? Secondly, since DHTs involve different types of stabi-lizations, we need to know which of these rates is of interest tooptimize. For example, in the DHT studied in this paper, thereis both ring stabilization as well as finger stabilization. Thirdly,we also need to know whether we have performance goalswhich require us to know how much stabilization is needed,or constraints on bandwidth which necessitate a knowledgeof the expected performance. Previous analytical attempts(see Section II) have addressed these question through theidentification of general (algorithm/system-neutral) bounds onstabilization rates.In this paper, we have taken another point of view. We havetraded-off generality for accuracy. That is, we have producedresults that can describe to a very high degree of accuracyquantities like the probability of inconsistent look-ups and theexpected look-up length as functions of the stabilization andchurn rates. Many of the insights we get from this analysissuch as most of the points listed above, would be very hard tocome by from simulations alone. So for instance, the formulaeproduced in this paper could directly be used by a systemadministrator or the person in charge of deploying a DHT asa guide for configuring stabilization rates. While the resultsare based on Chord, all analyses concerning the ring (break-up and inconsistency) are applicable to many other systems,since consistent hashing on a ring is a recurring component inmany other DHTs.VI. L IMITATIONS AND F UTURE W ORK
The main limitation of this work stems from the fact that theresults are inherently dependent on the intricate details of theanalyzed algorithms. While some changes in the algorithmscan be easily accommodated without redoing the analysis (asexplained in 5.6), others such as a different lookup strategy ora different placement of fingers would necessitate recalculatingall the quantities again. However, results concerning the ring-related aspects like successor lists, break-up probability andinter-node distributions are likely to be reusable in othervariations of the Chord protocols as well other systems usinga ring geometry.For the future, the authors’ research agenda include theintroduction of extensions to the current model to be able toaccount for locality-awareness and different topology main-tenance techniques. Some work towards the latter goal hasalready been done in [9]. Relatedly, a useful application forthis work is to enable systems to dynamically self-tune theirstabilization rates and choose the best maintenance techniqueto achieve a desired hop count. R
EFERENCES[1] Karl Aberer, Anwitaman Datta, and Manfred Hauswirth,
Efficient, self-contained handling of identity in peer-to-peer systems , IEEE Transac-tions on Knowledge and Data Engineering (2004), no. 7, 858–869.[2] D. Anick, D. Mitra, and M.M. Sondhi, Stochastic theory of data-handling systems with multiple sources , Bell Systems Technical Journal (1982), 1871–1894.[3] James Aspnes, Zo¨e Diamadi, and Gauri Shah, Fault-tolerant routing inpeer-to-peer systems , Proceedings of the twenty-first annual symposiumon Principles of distributed computing, ACM Press, 2002, pp. 223–232.[4] E. Brockmeyer, H.L. Halstrom, and Arns Jensen,
The life and works ofA.K. Erlang , The Copenhagen Telephone Company, 1948.[5] Miguel Castro, Manuel Costa, and Antony Rowstron,
Performance anddependability of structured peer-to-peer overlays , Proceedings of the2004 International Conference on Dependable Systems and Networks(DSN’04), IEEE Computer Society, 2004.[6] Florence Cl´evenot and Philippe Nain,
A simple fluid model for theanalysis of the squirrel peer-to-peer caching system , IEEE INFOCOM2004, 2004.[7] Sameh El-Ansary, Erik Aurell, and Seif Haridi,
A physics-inspiredperformace evaluation of a structured peer-to-peer overlay network ,The International Conference on Parallel and Distributed Computingand Networks (PDCN 2005), 2005.[8] Supriya Krishnamurthy, Sameh El-Ansary, Erik Aurell, and Seif Haridi,
A statistical theory of chord under churn , The 4th International Work-shop on Peer-to-Peer Systems (IPTPS’05) (Ithaca, New York), February2005.[9] ,
Comparing maintenance strategies for overlays , Tech. report,Swedish Institute of Computer Science, in preparation 2007.[10] Jinyang Li, Jeremy Stribling, Robert Morris, M. Frans Kaashoek, andThomer M. Gil,
A performance vs. cost framework for evaluating dhtdesign tradeoffs under churn , Proceedings of the 24th Infocom (Miami,FL), March 2005.[11] David Liben-Nowell, Hari Balakrishnan, and David Karger,
Analysisof the evolution of peer-to-peer systems , ACM Conf. on Principles ofDistributed Computing (PODC) (Monterey, CA), July 2002.[12] N.G. van Kampen,
Stochastic Processes in Physics and Chemistry ,North-Holland Publishing Company, 1981, ISBN-0-444-86200-5.[13] Dongyu Qui and R. Srikant,
Modeling and performance analysis ofbittorrent-like peer-to-peer networks , SIGCOMM’04 (Portland, Oregon),August 2004.[14] Sean Rhea, Dennis Geels, Timothy Roscoe, and John Kubiatowicz,
Handling churn in a DHT , Proceedings of the 2004 USENIX AnnualTechnical Conference(USENIX ’04) (Boston, Massachusetts, USA),June 2004.[15] Ion Stoica, Robert Morris, David Liben-Nowell, David Karger, M. FransKaashoek, Frank Dabek, and Hari Balakrishnan,