[PDF] Gossip-based Search in Multipeer Communication Networks

Abstract

We study a gossip-based algorithm for searching data objects in a multipeer communication network. All of the nodes in the network are able to communicate with each other. There exists an initiator node that starts a round of searches by randomly querying one or more of its neighbors for a desired object. The queried nodes can also be activated and look for the object. We examine several behavioural patterns of nodes with respect to their willingness to cooperate in the search. We derive mathematical models for the search process based on the balls and bins model, as well as known approximations for the rumour-spreading problem. All models are validated with simulations. We also evaluate the performance of the algorithm and examine the impact of search parameters.

Full PDF

aa r X i v : . [ c s . N I] J u l Gossip-based Search in Multipeer Communication Networks

Eva Jaho † , Ioannis Koukoutsidis * , Siyu Tang ‡ ,Ioannis Stavrakakis † , and Piet Van Mieghem ‡† National & Kapodistrian University of Athens, Dept. Informatics andTelecommunications, Ilissia, 157 84 Athens, Greece, E-mail: { ejaho,ioannis } @di.uoa.gr * University of Peloponnese, Dept. Telecommunications Science and Technology, Endof Karaiskaki St., 22100, Tripolis, Greece, E-mail: [email protected] ‡ Delft University of Technology, 2600 GA Delft, The Netherlands, E-mail: { S.Tang,P.F.A.VanMieghem } @tudelft.nlJune 21, 2009 Abstract

We study a gossip-based algorithm for searching data objects in a multipeer communicationnetwork. All of the nodes in the network are able to communicate with each other. Thereexists an initiator node that starts a round of searches by randomly querying one or more ofits neighbours for a desired object. The queried nodes can also be activated and look for theobject. We examine several behavioural patterns of nodes with respect to their willingness tocooperate in the search. We derive mathematical models for the search process based on theballs and bins model, as well as known approximations for the rumour-spreading problem. Allmodels are validated with simulations. We also evaluate the performance of the algorithm andexamine the impact of search parameters.

The term ‘gossiping algorithm’ encompasses any communication algorithm where messages be-tween two nodes are exchanged opportunistically, with the intervention of other nodes that act asbetweeners or forwarders of the message. It is inspired from the social sciences, in the same way asepidemic protocols where inspired from the spreading of infectuous diseases [1]. These two commu-nication paradigms are very similar, with diﬀerences focusing on the diﬀerent ways that gossipingnodes and infected nodes could behave: gossiping nodes adopt human-like characteristics, whilethe behaviour of infected nodes is governed by the dynamics of the virus or disease.Gossiping algorithms are suitable for communication in distributed systems, such as ad-hocnetworks and generally systems with peer-to-peer or peer-to-multipeer communication. The lattercommunication paradigm is followed here, where a node can communicate with multiple peers,usually maintaining a short-time connection with each one. Attractive characteristics of gossiping1lgorithms include simplicity, scalability and robustness to failures, as well as a speed of dissemi-nation that is easily conﬁgurable. Gossiping can be identiﬁed with the spreading of rumours in anetwork, the dymanics of which are investigated in [6, 5]. Additionally, gossiping protocols havebeen used for the computation of aggregate network quantities, such as sums, averages, or quantilesof certain node values [3].In all of the above referenced works, gossiping is typically used for the dissemination of in-formation, and performance metrics are oriented towards measuring the eﬃciency of informationdissemination. In this paper, we model a speciﬁc gossip-based algorithm that aims at ﬁnding ﬁles(or data objects, in general) in nodes in a distributed network. The algorithm employs sequentially-generated parallel search procedures, in the following manner: We assume there exists a ﬁle in thenetwork that may be located in diﬀerent nodes. An initiator node is interested in this ﬁle andstarts a round of searches to ﬁnd it, by randomly querying one or more of its peers (neighbours).The queried nodes can also be activated and look for the object. The search is considered successfulwhen at least one copy of the ﬁle is found.Apart from the initiator, the other nodes that assist the search can have diﬀerent behaviouralpatterns. We distinguish between cooperative and non-cooperative nodes. Nodes in the ﬁrst cate-gory always become active when queried, and generate themselves query messages in subsequentrounds. Non-cooperative nodes on the other hand are unwilling to participate in the search pro-cess themselves. We also consider stiﬂer nodes; the term is borrowed from [5] and signiﬁes nodesthat were previously active, but from a certain point on lose interest in the dissemination of thequery, and thus cease to participate in the search. Hence, it is a special case of cooperation. Toavoid confusion in the paper, non-stiﬂer nodes that are non-cooperative are also referred to as plain non-cooperative nodes. In all the above cases cooperation is considered only with respect toparticipating in the search; if a node has the ﬁle it always returns it.We also derive diﬀerent versions of the algorithm based on the level of knowledge that eachnode has about the progress of search. We consider two extremes: at the one, each node has noknowledge whatsoever about the number or identities of nodes that have been previously queriedin the network. At the other extreme, each node has complete knowledge about these facts andavoids sending messages to previously queried nodes at subsequent rounds. We call these cases blind search and smart search , respectively.We mathematically model the blind search process based on a known approximation for therumour spreading problem [6], which we extend here. The smart search process is modeled usinga combinatorial approach, based on a generalization of the balls and bins model [2]. Based onthese models, we are able to evaluate the performance of the search, as well as the impact of searchparameters. The latter are the number of queried neighbours by a node and the number of copiesof the ﬁle in the network. By changing these parameters, one can easily conﬁgure the speed andeﬃciency of the search, as will be shown later.Both the blind and smart versions of the gossiping algorithm have been modeled exactly in[8]. Both information dissemination and search are investigated in that paper, while here we arefocusing on the search process. The main algorithmic diﬀerence in [8] is that even the nodes thathave the ﬁle can be non-cooperative. In this paper, we want to focus only on the eﬀect thatcooperation has in the forwarding of the message: only the intermediate nodes forwarding a querycan be non-cooperative. In addition, the approximative model presented here for the blind searchprocess is shown to be computationally simpler, while maintaining good accuracy. Finally, in thispaper we present the stiﬂing behavioral pattern, which is not included in [8].2he paper is structured as follows. In Section 2, the gossip-based search algorithm is describedin more detail, and application scenarios that justify the study of the blind and smart searchalgorithms are discussed. In Sections 3 and 4 we present the mathematical modeling of the blindand smart search algorithms, respectively. The modeling in these sections covers cooperative andplain non-cooperative nodes. The stiﬂing behavioural pattern is analysed in Section 5. In Section 6,we present results for the performance of the blind search algorithm for very large numbers of nodes,and derive useful scaling laws. The major conclusions from this work and issues for future researchare presented in Section 7.

A model of the network in the form of a complete graph is considered. There is an initiator node I ,and N − f located in m of the other nodes of the graph( m ≤ N −

1) that the initiator wants to ﬁnd. The initiator starts a search by randomly queryinga subset of its neighbors of size k ( k ≤ N − f is found.We consider two search scenarios:- Blind search : An active node searches “blindly” at each round, possibly querying nodesthat have been queried before. This approach can model devices with small computationalcapabilities, that cannot keep a log of queried nodes, or cases where the identities of thedevices are not known. It is equally appropriate to model situations with random encountersbetween nodes. For instance, a number of mobility models have exponential meeting timesbetween mobile nodes (such as the Random Walk, Random Waypoint and Random Directionmodels, as well as more realistic, synthetic models based on these [7]). In our model, the timeuntil a node is queried approaches a geometric distribution, which is the discrete time analogto an exponential distribution.-

Smart search : An active node searches “smartly” at each round, by avoiding nodes that havebeen queried before either by itself or by other nodes. This demands the knowledge of theidentities of all queried nodes, and has a larger overhead compared to the blind search case.We do not deﬁne the exact algorithm by which the identities of all queried nodes are madeknown to an active node. We only assume that this knowledge can be obtained at a costthat is small compared to the cost of searching, and use this case mainly as a reference forthe eﬃciency of the blind search algorithm. It is evident that smart search corresponds tothe fastest version of the algorithm. Although it can be hard and costly to implement, therecan exist schemes that can approximate its performance. For example, a low-cost algorithmthat could approximate smart search can be based on the routine that, at each peer-to-peercommunication, nodes exchange the lists of peers they have queried.3

Approximate blind search model with cooperative or non-cooperative nodes

Each node that receives a search query will cooperate to forward the query with probability c (0 ≤ c ≤ r = 1 , , . . . until the ﬁle is found. If at step r thereare ˆ A ( r ) active nodes then, provided the ﬁle is not yet found, the probability of ﬁnding it at the r th step, S ( r ), is: S ( r ) = 1 − (1 − p s ) ˆ A ( r ) , (1)where p s is the probability that a single search (consisting of k diﬀerent random queries) succeeds.To ﬁnd p s , notice that the problem is equivalent to the one where, in a set of N − m marked nodes and we randomly select a group of k nodes. We want to ﬁnd the probabilitythat at least one marked node is selected. The probability that our selection returns exactly u marked nodes ( u ≤ min( m, k )) is p u = (cid:0) mu (cid:1)(cid:0) N − − mk − u (cid:1)(cid:0) N − k (cid:1) . Indeed, the marked nodes can be chosen in (cid:0) mu (cid:1) diﬀerent ways, the unmarked ones in (cid:0) N − − mk − u (cid:1) ways, and the total number of ways to select k nodes is (cid:0) N − k (cid:1) . Further, p s = 1 − p , therefore p s = 1 − (cid:0) N − − mk (cid:1)(cid:0) N − k (cid:1) . (2)The probability of ﬁnding the ﬁle f at the r th step is, p ( r ) = S ( r ) r − Y i =1 (1 − S ( i )) . (3)This formula is an approximation because it implicitly assumes that each round is independent ofthe other.A deterministic approximation ˆ A ( r ) for the number of active nodes in each round can be foundusing the method presented in [6], which is extended to k neighbours that are cooperative withprobability c . Consider the process { I ( r ) , r ≥ } of the number of inactive nodes in each round.Given that at round r there are i ( r ) inactive nodes and A ( r ) active ones, the mean number ofinactive nodes at round r + 1 will be E [ I ( r + 1)] = i ( r ) "(cid:18) − kN − (cid:19) A ( r ) + " − (cid:18) − kN − (cid:19) A ( r ) (1 − c ) = i ( r ) " (1 − c ) + c (cid:18) − kN − (cid:19) A ( r ) . k and large N , we use a second order expansion of (1 − kN − ) A ( r ) , so that (1 − kN − ) A ( r ) ≈ e − A ( r )( kN − + k N − ) , and E [ I ( r + 1)] = i ( r ) (cid:20) (1 − c ) + ce − A ( r )( kN − + k N − ) (cid:21) . From this, by assuming I ( r ) = i ( r ) ∀ r , we derive the deterministic approximation I ( r + 1) = I ( r ) (cid:20) (1 − c ) + ce − A ( r )( kN − + k N − ) (cid:21) . (4)Using that A ( r ) = N − I ( r ), we ﬁnally obtain the recursion: A ( r + 1) = N c + A ( r )(1 − c ) − ( N − A ( r )) ce − A ( r )( kN − + k N − ) , (5)with A (1) = 1. Since A ( r ) is not an integer in general, we round it to the nearest integer, whichwe denote by ˆ A ( r ) = [ A ( r )].Following a similar approach as in [6], it can be shown that the distribution of I ( r + 1), given i ( r ) is indeed concentrated sharply around i ( r )[(1 − c ) + c exp( − A ( r )( kN − + k N − ))], and thatthe approximation becomes more accurate as k/N → A involved in the search (activated nodes): E [ r ] = ∞ X r =1 rp ( r ) , (6) E [ A ] = ∞ X r =1 ˆ A ( r + 1) p ( r ) . (7)(Notice that ˆ A ( r + 1) nodes will be activated approximately at round r .)For numerical calculations, as an upper bound on the support of r we take r max = min { r : r − Y i =1 (1 − S ( i )) < ǫ } , (8)where ǫ is a number close to zero. Remark . In [8], we have derived an exact model for a slightly diﬀerent version of the blind searchalgorithm. Generally, the exact approach for modeling the search algorithm requires the calculationof the N × N transition matrix Q [ ij ] , where the ( i, j )-th value is the probability of going from i to j active nodes in one round. Then Q r (1 , i ) denotes the probability that there are i active nodes in r rounds.The probability of ﬁnding the ﬁle in r rounds, denoted here by B ( r ), can be calculated as B ( r ) = N X i =1 " − (cid:0) N − im (cid:1)(cid:0) N − m (cid:1) Q r (1 , i ) . (9)5his is a probability distribution, therefore the probability of ﬁnding the ﬁle exactly at round r isgiven by: B ( r ) − B ( r − . (10)In the Appendix, we compare the complexity of the two models and examine the accuracy of theapproximate one. It is shown that the reduction in computational cost is of the order of O ( N ),while the relative accuracy of the approximation is higher than 95% in the majority of cases (thecomparison holds when c = 1).We validate our approximation by means of simulation. The simulations were performed with100 instances of random ﬁle positions, and 100 random executions of the search in each instance,leading to a total of 10 repetitions in each experiment. We evaluate the mean number of roundsand the mean number of activated (infected) nodes until at least one copy of the ﬁle f is found,varying the number of nodes N in the graph, the cooperation probability c and parameters k and m . The value of ǫ in (8) was set to 10 − . Results for diﬀerent cases are shown in Fig. 1.These ﬁgures illustrate that the simulation results match those from the theoretical analysis verywell, except for large values of k . As the size of the network increases, the number of active nodesincreases linearly, while the increase in the mean number of rounds is superlinear, at a decreasingrate, as shown in Fig. 1(a) and 1(b). This implies that the number of rounds can be well-ﬁttedusing a logarithmic function, as will be shown more clearly in Section 6.We examine the impact of search parameters k , m , in Fig. 1(c)-1(f). Note that the search canbecome faster by increasing either m or k . By comparing Fig.1(c) with 1(e), we note that theincrease in speed is higher for large values of k . However, Fig.1(d) and 1(f) show that increasing k has the disadvantage of increasing the number of active nodes, and thus produces higher commu-nication overhead. For example, calculations based on the simulation results show that for c = 1and N = 50 nodes, increasing m to 3 yields a relative decrease in the mean number of roundsby 31%, and in the mean number of active nodes by 45%. On the other hand, increasing k to3 yields a higher relative decrease in the mean number of rounds by 48%, but an increase in themean number of active nodes by 14%. Overall, we remark that increasing the number of queriedneighbours results in a great redundancy in the number of nodes that participate in the search withonly small gains in speed. An analysis for the smart search process is presented below. An approximate analysis similar tothe one for the blind search model fails here, due to the varying probabilities of successful query.Instead we adopt a direct combinatorial approach, by considering a generalization of the occupancy(balls and bins) problem [2].The generalized balls and bins problem is deﬁned as follows: In a population of n bins, supposewe randomly distribute r groups of k balls, such that in each group, no two balls go in the samebin and successive distributions of groups of balls are independent. We want to ﬁnd the probabilitythat exactly v bins remain empty, where v = 0 , , . . . , n − k (it is assumed that n > k ). It is noted that the improvement in accuracy by using the exact expression (1 − kN ) A ( r ) , or adding more termsin its expansion, is negligible. M ean nu m be r o f r ound s Number of nodesBlind search (k=1, m=1)analysis c=1simulation c=1analysis c=0.8simulation c=0.8analysis c=0.5simulation c=0.5 (a) M ean nu m be r o f a c t i v e node s Number of nodesBlind search (k=1, m=1)analysis c=1simulation c=1analysis c=0.8simulation c=0.8analysis c=0.5simulation c=0.5 (b) M ean nu m be r o f r ound s mBlind search (N=50, k=1) analysis c=1simulation c=1analysis c=0.8simulation c=0.8analysis c=0.5simulation c=0.5 (c) M ean nu m be r o f a c t i v e node s mBlind search (N=50, k=1) analysis c=1simulation c=1analysis c=0.8simulation c=0.8analysis c=0.5simulation c=0.5 (d) M ean nu m be r o f r ound s kBlind search (N=50, m=1) analysis c=1simulation c=1analysis c=0.8simulation c=0.8analysis c=0.5simulation c=0.5 (e) M ean nu m be r o f a c t i v e node s kBlind search (N=50, m=1)analysis c=1simulation c=1analysis c=0.8simulation c=0.8analysis c=0.5simulation c=0.5 (f) Figure 1: Analytical and simulation results for the mean number of rounds and mean number ofactive nodes with varying network parameters, for the blind search algorithmWe follow the approach in [2, Section IV.2] for the classical occupancy problem (where k = 1).7he total number of ways to distribute r groups of balls in the way described above is (cid:0) nk (cid:1) r .Similarly, the total number of ways to assign them to n − (cid:0) n − k (cid:1) r , so that the probabilitythat one given bin is empty, is (cid:0) n − k (cid:1) r / (cid:0) nk (cid:1) r . Generally, the probability that v given bins are emptyis (cid:0) n − vk (cid:1) r / (cid:0) nk (cid:1) r .Therefore, the probability that at least one bin is empty is, by the inclusion-exclusion method, n − k X i =1 ( − i − (cid:18) ni (cid:19) (cid:0) n − ik (cid:1) r (cid:0) nk (cid:1) r . The probability that all bins are occupied, denoted by p ( r, k, n ), is p ( r, k, n ) = 1 − n − k X i =1 ( − i − (cid:18) ni (cid:19) (cid:0) n − ik (cid:1) r (cid:0) nk (cid:1) r = n − k X i =0 ( − i (cid:18) ni (cid:19) (cid:0) n − ik (cid:1) r (cid:0) nk (cid:1) r . (11)Consider now the case where exactly v non-given bins are empty. These v bins can be chosen in (cid:0) nv (cid:1) diﬀerent ways. The k balls of each of the r groups are distributed among the remaining n − v bins such that exactly n − v are occupied. The mean number of such distributions is (cid:18) n − vk (cid:19) r p ( r, k, n − v ) . Dividing by the total number of possible conﬁgurations (cid:0) nk (cid:1) r we obtain the probability p v ( r, k, n )that exactly v bins are empty: p v ( r, k, n ) = (cid:18) nv (cid:19) (cid:0) n − vk (cid:1) r (cid:0) nk (cid:1) r n − k − v X i =0 ( − i (cid:18) n − vi (cid:19) (cid:0) n − v − ik (cid:1) r (cid:0) n − vk (cid:1) r . (12)Based on (12), we ﬁnd transition probabilities of the form p ( x i , x j ), which denotes the proba-bility that if at a certain round of the algorithm there are x i active nodes, then at the next roundthere will be x j active ones ( x j ≥ x i ). It is emphasized here that each round corresponds to onetransition. In our terminology, “at (or in) a certain round” will have the meaning “after the transi-tion that occured in this round and before the next transition” . The ﬁrst round marks the transitionfrom 1 active node (the initiator) to a maximum number of k + 1 active nodes.Since there are no repetitions for the smart search, the transition probabilities can be found bydirectly applying (12), substituting n = N − x i , v = N − x j , and r = x i : p ( x i , x j ) = p N − x j ( x i , k, N − x i ) . (13)From (12),(13) we have p ( x i , x j ) = (cid:18) N − x i N − x j (cid:19) (cid:0) x j − x i k (cid:1) x i (cid:0) N − x i k (cid:1) x i x j − x i − k X ℓ =0 ( − ℓ (cid:18) x j − x i ℓ (cid:19) (cid:0) x j − x i − ℓk (cid:1) x i (cid:0) x j − x i k (cid:1) x i . (14)8ach of the x j − x i queried nodes will decide whether or not to be cooperative independentlywith probability c . The probability that α out of x j − x i nodes will actually be activated is B ( x j − x i , α, c ) def = (cid:18) x j − x i α (cid:19) c α (1 − c ) x j − x i − α . Therefore, the probability that there will be x i + α active nodes in the next round is p ( x i , x i + α ) = X x j − x i >α p ( x i , x j ) B ( x j − x i , α, c ) . (15)Based on the transition probabilities, we construct the N × N transition matrix Q with entries p ( x i , x j ) for i, j = 1 , . . . , N . The value of the i th element of the ﬁrst row of the matrix Q r is theprobability that there are i active nodes at round r .Let us denote by p s ( v ) the probability that at least one of the v active nodes ﬁnds a copy ofthe ﬁle. The probability S ( r ) of ﬁnding the ﬁle by (and including) round r is S ( r ) = X v Q ( r − (1 , v ) (1 − (1 − p s ( v )) v ) , (16)where p s ( v ) is the probability that a search by a single node ﬁnds a copy of the ﬁle, given thatthere are already v active nodes.To ﬁnd p s ( v ), we take a similar approach as in Section 3. It is p s ( v ) = 1 − (cid:0) N − v − mk (cid:1)(cid:0) N − vk (cid:1) (17)Finally, the probability of ﬁnding the ﬁle at the r -th round is given by (3). We emphasize thatthis formula is again an approximation, since it is assumed that each round is independent of theother.Based on the above distribution, we easily derive the expected number of rounds until a ﬁle isfound, as in (6). The mean number of nodes activated during the search process is E [ A ] = ∞ X r =1 E [ α ( r )] p ( r ) , (18)where E [ α ( r )] is the mean number of active nodes in round r , derived from the distribution Q r (1 , · ).(For the smart search, the summation index in (6),(18) is upper bounded by ⌈ N − /k ⌉ .)We take both analytical and simulation results for the same values of the parameters N , c , k and m as for the blind search algorithm. The simulations are conducted for the same number ofrepetitions as for blind search. Results are shown in Fig. 2. Generally, the model is extremelyaccurate for c = 1, but as the cooperation probability decreases it starts to deviate from thesimulated behavior. For small values of c , we remark that the model is less accurate than ourmodel for blind search, even though it follows a combinatorial approach that is exact up to (16).We attribute this to the fact that the intermediate search steps are more correlated for the smartsearch algorithm. Thus, as it is intuitively reasonable, the assumption of independence over roundsleads to worse results for the smart search than for the blind search case. Notice that (16) is not a cdf, so we can’t use S ( r ) − S ( r −

1) to ﬁnd the probability of successful search at round r . M ean nu m be r o f r ound s Number of nodesSmart search (k=1,m=1)analysis c=1simulation c=1analysis c=0.8simulation c=0.8analysis c=0.5simulation c=0.5 (a) M ean nu m be r o f a c t i v e node s Number of nodesSmart search (k=1,m=1)analysis c=1simulation c=1analysis c=0.8simulation c=0.8analysis c=0.5simulation c=0.5 (b) M ean nu m be r o f r ound s mSmart search (N=50, k=1) analysis c=1simulation c=1analysis c=0.8simulation c=0.8analysis c=0.5simulation c=0.5 (c) M ean nu m be r o f a c t i v e node s mSmart search (N=50, k=1) analysis c=1simulation c=1analysis c=0.8simulation c=0.8analysis c=0.5simulation c=0.5 (d) M ean nu m be r o f r ound s kSmart search (N=50, m=1) analysis c=1simulation c=1analysis c=0.8simulation c=0.8analysis c=0.5simulation c=0.5 (e) M ean nu m be r o f a c t i v e node s kSmart search (N=50, m=1)analysis c=1simulation c=1analysis c=0.8simulation c=0.8analysis c=0.5simulation c=0.5 (f) Figure 2: Analytical and simulation results for the mean number of rounds and mean number ofactive nodes with varying network parameters, for the smart search algorithmThe same observations hold regarding the eﬀect of the cooperation probability, the number of10ueried neighbours and the number of copies of the ﬁle, as in the blind search case. It is againemphasized that the behavior with respect to increasing k is an outcome of the tradeoﬀ betweenthe increased speed of discovery and the redundancy in the total number of messages sent.We notice that there is only a small performance improvement of the smart over the blind searchalgorithm, expressed through the decrease in the mean number of rounds. This improvementbecomes less pronounced as k or m increase. For example, based on the simulation results for N = 50 and c = 1, the relative reduction of smart search in the mean number of rounds is 13%when k = 1 , m = 1, 3% when k = 1 , m = 3, and 5% when k = 3 , m = 1. This impovementis relatively larger when the cooperation probability decreases: for N = 50 and c = 0 .

5, thecorresponding reductions were 27%, 16% and 6%.However, the mean number of active nodes may be greater for the smart search, due to the factthat we always query only inactive nodes. This was true in most of the derived results. For N = 50nodes, the relative increased reached up to 10% for c = 1 and k = 1 , m = 3, while for c = 0 . k = 1 , m = 3. For the values of N = 50 and c = 0 .

5, only a slight decreaseof 1% was observed when k = 3 , m = 1.The overall results illustrate that when comparing the two cases, the smart search does notoﬀer a signiﬁcant improvement. This leads us to the conclusion that if the overhead incurred inthe smart search algorithm for informing all active nodes of the identities of queried nodes is notnegligible compared to that of the search procedure, it is highly likely that there is not much togain by such a scheme. Another behavioral pattern that we consider is stiﬂing . In this pattern, each of the nodes thatare (or become) active at a certain round may cease to be active and not participate in the searchprocess any more. This could express a node’s loss of interest in spreading the query messagefurther in the network.We analyse this stiﬂing behaviour based on the assumption that at each round of the search,each active node may become a stiﬂer independently with probability s . A node that becomes astiﬂer is considered as inactive, and in a blind search it may become active again with probability1 − s , if queried. We consider that the initiator does not become a stiﬂer, so the number of activenodes will always be greater than zero.This stiﬂing behaviour will be modeled only for the blind search case, based on our approxima-tive method. In order to model the smart search case, one has to discriminate between active andqueried nodes, which leads to a multi-dimensional Markov chain which is not easily amenable toanalysis.Given that there are i ( r ) inactive nodes at round r , we are interested in ﬁnding the meannumber of inactive nodes at round r + 1. This consists of the mean number of active nodes atround r that became inactive (excluding the initiator) and the mean number of inactive nodes at11 M ean nu m be r o f r ound s Number of nodesBlind search with stifling nodes (k=1, m=1)analysis s=0simulation s=0analysis s=0.2simulation s=0.2analysis s=0.4simulation s=0.4analysis s=0.6simulation s=0.6analysis s=0.8simulation s=0.8 (a) M ean nu m be r o f a c t i v e node s upon d i sc o v e r y Number of nodesBlind search with stifling nodes (k=1, m=1)analysis s=0simulation s=0analysis s=0.2simulation s=0.2analysis s=0.4simulation s=0.4analysis s=0.6simulation s=0.6analysis s=0.8simulation s=0.8 (b)

Figure 3: Mean number of rounds and mean number of active nodes upon discovery with k = 1, m = 1 for the blind search algorithm with stiﬂersround r that remained inactive. Hence, E [ I ( r + 1)] =( A ( r ) − s + I ( r ) "(cid:18) − kN − (cid:19) A ( r ) + " − (cid:18) − kN − (cid:19) A ( r ) s =( A ( r ) − s + i ( r ) " s + (1 − s ) (cid:18) − kN − (cid:19) A ( r ) . Assuming I ( r ) = i ( r ) ∀ r , and using again that (1 − kN − ) A ( r ) ≈ e − A ( r )( kN − + k N − ) , A ( r ) = N − I ( r ), we ﬁnally obtain the deterministic approximation A ( r + 1) = 1 + ( N − − s ) − ( N − A ( r ))(1 − s ) e − A ( r )( kN − + k N − ) , (19)with A (1) = 1.From there we can follow similar steps as in Section 3 to ﬁnd performance measures of interest.Results based on simulation and the analytical approximation are shown in Fig. 3, for diﬀerentvalues of parameters m , k , and the stiﬂing probability s . As s increases, the performance of thesearch algorithm deteriorates.We observe that the approximate model follows very well the simulated behaviour, except forlarger deviations in the mean number of rounds when the stiﬂing probability is high. This ismainly due to the higher relative error that results from the rounding operation (see the analysisof Section 3). As the stiﬂing probability gets higher, the number of active nodes in the networkis rounded to one in the model, and therefore the mean number of rounds approaches the inverseof the probability of a successful query of this node (geometric distribution). For example, when k = 1, the mean number of rounds approaches the value of ( N − /m .A comparison of the speed of blind search between the stiﬂing and plain non-cooperative caseshows that the search algorithm performs worse in the presence of stiﬂer nodes. We may remarkfrom the results that the relative increase in the number of rounds when nodes behave as stiﬂers– rather than as plain non-cooperative nodes – becomes greater when the number of nodes in the12 M ean nu m be r o f r ound s Number of nodes(k=1, m=1)smart search (s=0)blind search (s=0)smart search (s=0.4)blind search (s=0.4)smart search (s=0.8)blind search (s=0.8) (a) M ean nu m be r o f a c t i v e node s upon d i sc o v e r y Number of nodes(k=1, m=1)smart search (s=0)blind search (s=0)smart search (s=0.4)blind search (s=0.4)smart search (s=0.8)blind search (s=0.8) (b)

Figure 4: Comparison of smart search against blind search with stiﬂers for k = 1 , m = 1, withdiﬀerent values of the stiﬂing probability and increasing number of nodes.network increase, or when the stiﬂing probability increases. Since stiﬂing is opposite to cooperation,it makes sense to examine dual values of s , c , i.e. such that s = 1 − c holds. For k = 1 , m = 1,and N = 50, the mean number of rounds is increased by 9% for nodes that behave as stiﬂers withprobability s = 0 .

2, compared to the case of plain non-cooperative nodes with c = 0 .

8. The relativeincrease is 5% when N = 10. When s = c = 0 .

5, the corresponding relative increase is muchgreater, and amounts to 78%.This diﬀerence becomes smaller as the search becomes faster, i.e. when increasing either the k or m parameters. Regarding the relative inﬂuence of the parameters k , m to the eﬃciency of thesearch the same observations hold, as in all previous cases.The mean number of active nodes calculated here is the mean number of nodes that are activeupon discovery of the ﬁle. We therefore do not count nodes which were previously active in thesearch. Hence, it should be noted that the mean number of active nodes displayed here is onlyindicative of the communication overhead, as it does not count the nodes that were active inintermediate rounds of the algorithm, and hence the corresponding communication costs. Generally,our ﬁndings show that this number is much smaller when compared to the plain non-cooperativecase, where active nodes remain in that state until the end. For k = 1 , m = 1, and N = 50,the number of active nodes upon discovery is decreased by 39% in the stiﬂing case with s = 0 . c = 0 . k = 1 , m = 1, with diﬀerent valuesof the stiﬂing probability and increasing number of nodes.We observe that smart search can yield a reduction in the number of rounds to discover theﬁle, which becomes signiﬁcant for high values of the stiﬂing probability. For s = 0 . N = 50,the relative reduction is 37%. However, the interesting thing is that for smaller values it may alsoyield a slight increase (see the curves in Fig. 4(a) for s = 0 . s > s ( s = 0 . s = 0 .

8. Therefore the highestgain in speed does not imply the highest reduction in redundant active nodes, and vice-versa.

The low-complexity approximate model we have developed for the blind search algorithm enablesus to study its performance for networks with very large numbers of nodes. We have taken resultsfor networks with up to 10 nodes, for both behavioural proﬁles: plain non-cooperative and stiﬂing.In Fig. 5, we plot the mean number of rounds and the mean number of active nodes as a functionof N for the case of plain non-cooperative nodes, while in Fig. 6, similar results are taken for thecase of stiﬂing nodes. M ean nu m be r o f r ound s Number of nodesBlind search (k=1, m=1)c=1c=0.8c=0.6c=0.4c=0.2 (a) M ean nu m be r o f a c t i v e node s Number of nodesBlind search (k=1, m=1)c=1c=0.8c=0.6c=0.4c=0.2 (b)

Figure 5: Scaling performance of blind search in the plain non-cooperative case, for k = 1, m = 1:(a) mean number of rounds, (b) mean number of active nodesThe x -axis in all plots is in log scale. Fig. 5(b) and 6(b) are in log-log scale. In Fig. 6(a), thecurve for s = 0 . y -axis.From these results, we observe that the scaling performance of blind search is remarkably simple.In the plain-non cooperative case, the mean number of rounds increases linearly with log N , whilethe mean number of active nodes increases linearly with N . This is true for almost the wholerange of values of c . (A more accurate estimate would be to consider a piece-wise linear function,with a slightly smaller slope for N < s < . s , the mean number of rounds increasesproportionately to the increase in the number of nodes, while the mean number of active nodesupon discovery approaches 1. 14 M ean nu m be r o f r ound s Number of nodesBlind search (k=1, m=1)s=0s=0.2s=0.4s=0.6 (a) M ean nu m be r o f a c t i v e node s upon d i sc o v e r y Number of nodesBlind search (k=1, m=1)s=0s=0.2s=0.4s=0.6 (b)

Figure 6: Scaling performance of blind search in the stiﬂing case, for k = 1, m = 1: (a) meannumber of rounds, (b) mean number of active nodes upon discoveryBased on the observed behaviour, we can easily derive ﬁtted functions, based on the leastsquares method. For example, for k = 1 , m = 1 and c = 1, E [ r ] = 0 .

629 log N + 0 . E [ A ] = 0 . N + 0 . This paper has focused on the mathematical modeling of the gossip-based search algorithm in acomplete graph. We focused mainly on the blind-search algorithm, which is totally ignorant ofprevious queries and can be implemented very easily, and compared its performance with respectto a smart-search algorithm, where previously queried nodes are avoided.Several conclusions can be extracted concerning trade-oﬀs between speed, cost, and redundancybetween “blind” and “smart” gossip-based search algorithms. We have investigated two extremecases; many intermediate algorithms can be studied with diﬀerent levels of knowledge of previousqueries, trading speed of discovery with additional communication and processing overhead. Animportant observation is that speed also trades-oﬀ with the redundancy in the number of querymessages needed to locate a ﬁle. The results we obtained provided serious indication that, whennodes have a plain non-cooperative proﬁle, the additional overhead of designing a “smarter” algo-rithm is not worth it, since apart from the additional communication and processing cost, it induceshigh redundancy in the number of messages in the network.For both the blind and smart search cases, we showed that the mean number of active nodes andthe mean number of rounds roughly increase linearly with the number of nodes in the network andits logarithm, respectively. For the blind search algorithm, we were able to conﬁrm this behaviourfor very large numbers of nodes, using the approximate model which has very low complexity.Another important observation concerns the relative impact on the search of the number ofqueried peers by each node, and of the number of copies of the data object in the network. Theincrease of both these parameters increases the speed of discovery. The relative increase is greaterwhen the number of queried peers increases, in both the blind and smart search algorithms. How-ever, the corresponding increase in the number of active nodes that (in most cases) ensues is15nappropriate, and hence it is preferable to keep the value of this parameter very small. Thegossip-based search algorithm performs better when the requested data object is spread to manynodes in the network.Other useful remarks from this research concern the eﬀects of diﬀerent behavioural proﬁles:cooperative, plain-non cooperative and stiﬂing, with diﬀerent degrees of cooperation. Stiﬂing hasa greater negative impact on the search performance, which becomes worse in large networks.Future research issues we envisage are mostly related to the performance evaluation of the gossip-based search algorithm. The most signiﬁcant direction of research is to examine the eﬃciency ofthe search in diﬀerent types of networks. From the network in the form of a complete graph that westudied here, we can pass to more general graphs that are often met, such as Erd¨os-R´enyi graphs,or graphs with power-law degree distribution (scale-free networks). Performance results in suchnetworks where each node has connections with diﬀerent peers will give more realistic evidence ofits eﬃciency and applicability. Finally, it is interesting to compare the performance of the algorithmwith diﬀerent distributed search schemes, in terms of speed and implementation cost.

References [1] P. T. Eugster, R. Guerraoui, A.-M. Kermarrec, and L. Massouli´e. Epidemic information dis-semination in distributed systems.

IEEE Computer , 37(5):60–67, 2004.[2] W. Feller.

An Introduction to Probability Theory and Its Applications, Volume 1 . Wiley, January1968.[3] D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information.

Foundations of Computer Science, Annual IEEE Symposium on , 0:482, 2003.[4] J.-M. Muller.

Elementary Functions, Algorithms and Implementation . Birkhauser, Boston,1997.[5] M. Nekovee, Y. Moreno, G. Bianconi, and M. Marsili. Theory of rumour spreading in complexsocial networks.

Physica A , 374:457, Jul 2008.[6] B. Pittel. On spreading a rumor.

SIAM J. Appl. Math. , 47(1):213–223, 1987.[7] T. Spyropoulos, K. Psounis, and C. S. Raghavendra. Eﬃcient routing in intermittently con-nected mobile networks: the single-copy case.

IEEE/ACM Trans. Netw. , 16(1):63–76, 2008.[8] S. Tang, E. Jaho, I. Stavrakakis, I. Koukoutsidis, and P. V. Mieghem. Modeling gossip-basedcontent propagation and search in distributed p2p overlay.

Computer Networks (under submis-sion) , 2009.

A Comparison between the approximate and the exact model forthe blind search algorithm

We compare the approximate model for the blind search algorithm with the exact model developedin [8], when the cooperation probability c = 1. The two models are compared from the points ofview of complexity and accuracy. 16 .1 Comparison of complexity We will compare the complexity of the two models based on the computational cost for derivingthe probability of locating the ﬁle at a certain round r , given by (3) in the approximate and by(9) in the exact model. The computational cost is measured based on the number of elementarysteps needed to derive the location probability, where each step consists of a small number ofelementary operations (addition, subtraction, multiplication or division). We use the O -notationas the asymptotic upper bound of the complexity.To compute (3), ﬁrst computations of (2) and (1) need to be done. Equation (2) involves twobinomial coeﬃcients, which can be computed in O ( kN ) time using the well-known linear recursionformula. The recursive formula (5) involves a small number of multiplications and additions, andan exponential function. The exponential can be calculated easily by splitting the exponent intointeger and fractional parts (the latter can be computed within high accuracy with a few termsonly in a Taylor expansion, see e.g. [4]). Hence each execution of the recursion has order one, andcomputing A ( r ) or ˆ A ( r ) takes time O ( r ). It holds that ˆ A ( r ) ≤ N . Therefore, the computation of(1) and (3), given their input parameters, takes time O ( N ) and O ( r ) respectively, since in the ﬁrstcase it involves the computation of ˆ A ( r ) polynomials, and in the second of r polynomials, both ofdegree one. Therefore, the total complexity of deriving the probability of locating the ﬁle at round r , using the approximate model is O ( kN + r ).To ﬁnd the probability to ﬁnd at least one copy of the ﬁle with the exact modeling, we need tocalculate the r -th power of the transition matrix Q and then solve equations (9),(10) sequentially.The ﬁrst computation involves the multiplication of an N × N transition probability matrix, withcomplexity at worst O (cid:0) N (cid:1) . For suﬃciently large r , we can consider the sequence of matrices Q , Q , Q , ... , Q k , instead of computing the sequence Q , Q , ... , Q r . Since the former one convergesconsiderably faster compared with the latter one. Therefore, we can compute the matrix powerin O (cid:0) ln( r ) N (cid:1) steps. The computation of (9) involves two binomial coeﬃcients, namely, (cid:0) N − im (cid:1) and (cid:0) N − m (cid:1) , which have complexity O ( mN ) . Therefore, it takes O (cid:0) mN (cid:1) steps to solve (9). Thetotal computational complexity is thus dominated by the complexity to compute the matrix power,which is O (cid:0) ln( r ) N (cid:1) in our case. A.2 Comparison of accuracy

We next compare the relative accuracy of the approximate model for calculating the mean numberof steps to ﬁnd at least one copy of the ﬁle and the mean number of nodes activated in the search.The relative accuracy is calculated as (1 − | exact − approx | /exact )100%, and is output with twodecimal digits. It is reminded that the comparison is done when the cooperation probability is one.Results are shown in Table 1 below, where N stands for the total number of nodes in the network(including the initiator).These results conﬁrm that the approximation becomes more accurate when the number of nodesin the network increases. The greatest inaccuracy is observed for a relatively large – compared to N – number of queried neighbours k , as was also indicated in Fig. 1(f). Generally, the model provesto be credible, with a relative accuracy that is higher than 95% in the majority of the above cases. For j ≥ i > ` ji ´ = ` j − i ´ + ` j − i − ´ , with ` j ´ = ` jj ´ = 1. (a) Mean number of rounds N = 10 N = 20 N = 30 N = 40 N = 50 k = 1, m = 1 94 .

07 97 .

23 98 .

94 99 .

60 100 k = 1, m = 3 93 .

97 97 .

00 98 .

45 99 .

18 99 . k = 3, m = 1 96 .

16 98 .

77 99 .

89 99 . (b) Mean number of nodes activated in the search N = 10 N = 20 N = 30 N = 40 N = 50 k = 1, m = 1 92 .

60 95 .

78 96 .

12 96 .

34 97 . k = 1, m = 3 95 .

43 95 .

97 94 .

81 95 .

25 96 . k = 3, m = 1 86 .

69 90 .

51 93 .

63 94 .

98 95 .97