Statistical privacy-preserving message dissemination for peer-to-peer networks
aa r X i v : . [ c s . N I] F e b Statistical privacy preserving message dissemination forpeer-to-peer networks
David M¨odinger , Jan-Hendrik Lorenz , Franz J. Hauck Institute of Distributed Systems, Ulm University, Ulm, Germany Institute of Theoretical Computer Science, Ulm University, Ulm, Germany* [email protected]
Concerns for the privacy of communication is widely discussed in research and overallsociety. For the public financial infrastructure of blockchains, this discussionencompasses the privacy of transaction data and its broadcasting throughout thenetwork. To tackle this problem, we transform a discrete-time protocol for contactnetworks over infinite trees into a computer network protocol for peer-to-peernetworks. Peer-to-peer networks are modelled as organically growing graphs. We showthat the distribution of shortest paths in such a network can be modelled using anormal distribution N ( µ, σ ). We determine statistical estimators for µ, σ viamultivariate models. The model behaves logarithmic over the number of nodes n andproportional to an inverse exponential over the number of added edges k. Theseresults facilitate the computation of optimal forwarding probabilities during thedissemination phase for optimal privacy in a limited information environment.
An increasing number of data breaches and media coverage of privacy concerns has ledto a heightened awareness of privacy concerns in research and for laypersons.Especially in financial contexts, such as cryptocurrencies, engineers, and researchersproduced many privacy improving proposals. Either improving privacy on otherwisenon-privacy preserving systems [1] or implementing new systems with privacy-firstpractices [2–4].Financial transactions are anonymized in these systems, but they still requirebroadcasting to all participants. This required broadcast provides additional challengesfor privacy [5–9]. Attempts to solve these weaknesses include established protocolssuch as Tor or I2P [10] or new protocols [11–13]. These new protocols are tailor-madefor broadcast applications, which was not a goal in classical protocols such as Tor.Our previous proposal [13] considered adaptive diffusion [14] as an intermediateprivacy providing phase. Unfortunately, the attacker and network model of adaptivediffusion is not suitable for real-world computer networks.
To overcome these obstacles in this paper, we:February 9, 2021 1/18
Derive optimal forwarding strategies for adaptive diffusion based on aone-dimensional random walk, using the distribution on shortest paths of theunderlying network. • Model the distribution of shortest paths in k -growing graphs. • Provide an approximation for this model based on the number of participants n and edges k. • Transform adaptive diffusion into a network protocol for networks following a k -growing model. In Section 3, we describe the scenario of this paper, as well as relevant backgroundinformation, including the original adaptive diffusion protocol. Section 4 contains ourresulting transformation of adaptive diffusion and its properties. In Sections 5 to 7 wediscuss the details of the required changes to the protocol. Section 5 considers thechanges in privacy and network assumptions of adaptive diffusion for thetransformation to a network protocol. In Section 7 we derive an estimation of theshortest paths in expected networks to determine a concrete implementation of theprobabilities involved with the protocol.
In this paper, we discuss the privacy of broadcasts within an unstructured peer-to-peernetwork. For some applications, e.g., broadcasts of financial transactions in ablockchain network, the sender of a broadcast message has an interest in not beingrevealed. This is, despite the main goal being everyone receiving their message. Thegoal is, therefore, to hide the originator of such a message.The default solution to broadcasting a message in an unstructured network is aflood and prune broadcast. Here, the sender sends the message to all its neighbours. Anode that has not received the message yet will send it to all of its neighbours. Thenode excludes the link over which it received the message. Broadcasting, in this way,produces a highly symmetrical dissemination pattern, leading to possible identificationattacks.Assume there are nodes, which are collaborating to identify the originator of such amessage. Those nodes might be distributed throughout the network and can learn thetopology of the network over time. These nodes can reliably estimate the identity ofthe sender of the message by determining the graph centre, or Jordan centre, of thenodes that already received the message. The centre of the graph is the node, whichhas the smallest distance to all affected nodes. Here, the affected nodes are those thatalready received the message.While varying network latencies might distort the result, the set of likelyoriginators is small. Lastly, an attacker might also create connections to all nodes,always receiving the message as a neighbour of the true source.
An unstructured peer-to-peer network does not determine the connections of itsparticipants in an identification space. Such a network is usually built by new nodesconnecting to either a name service or publicly known participants of the network.February 9, 2021 2/18hese provide a list of known participants via some gossip protocol. The newparticipants choose a number of those nodes and establish a connection to them. Thisnumber of connections is often fixed in the client source code.To model this behaviour via graphs, we use an establishing algorithm. Let n nodes,or vertices, try to establish such a network, where each node establishes k edges, orconnections, to previously existing nodes. No node establishes loops, i.e., edges withitself, or multiple edges. The result is, therefore, a simple graph. In this paper, we callthis a k-growing graph with n nodes.To introduce adaptive diffusion, we also require the concept of an infinite d -regulartree. Such a graph has no cycles, i.e., between any two nodes exists exactly one path.Further, each node has a degree of d , i.e., each node is connected to exactly d neighbours via d edges. As it is infinite, there is no number n limiting the number ofnodes and no maximum distance within the graph, often called a graph diameter. Fora more in-depth introduction to graphs, c.f. Jackson’s Social and EconomicNetworks [15]. For this paper, we revisit some statistical fundamentals in the form of probabilitydistributions. As we make heavy use of it, we will keep a focus on the normaldistribution.Probability distributions can be separated into two categories: discrete andcontinuous distributions [16, 17]. A discrete probability distribution is one that onlytakes a countable set of values. Continuous distributions, on the other hand, have asupport (points where they are not zero) of uncountable size, e.g., the real numbers ofzero to infinity. In quite a few cases, continuous distributions might be more suited tomodel a discrete problem while transforming the result back into the discrete space.One of the most famous representatives of the continuous distributions is theso-called normal distribution, designated by N ( µ, σ ) , where µ represents the expectedvalue and σ the variance of the distribution. The probability distribution is definedby its probability density function (PDF)PDF( x ) = 1 σ √ π e − ( x − µσ ) . The cumulative distribution function (CDF), the integration of the PDF, of thenormal distribution is often written asCDF( x ) = 12 (cid:20) (cid:18) x − µσ √ (cid:19)(cid:21) , where erf represents the error function defined byerf( x ) = 2 √ π Z x e − t dt. The support, i.e., non zero values of the PDF, of the normal distribution is( − inf , inf)) , i.e., all of the real numbers R . As this can be problematic for someapplications, there exists the truncated normal distribution, limiting the distributionto some interval [ a, b ] ⊂ R . This requires re-normalisation of the result. As the integralof the PDF of the normal distribution over [ a, b ] is smaller than 1.Another distribution based on the normal distribution is the log-normaldistribution. Here, the domain is transformed, changing the default support to(0 , + inf) . Informally, this distribution is useful when the logarithm of the data isnormally distributed.February 9, 2021 3/18astly, we would like to mention the Weibull distribution. The Weibull distributionis part of the extreme value distributions family, modelling maxima or minima, with asupport of [0 , + inf) . Adaptive diffusion [14] is a protocol developed initially for infinite d-regular trees in adiscrete-time model, i.e., with fixed steps. It breaks the symmetry present in regularflood-and-prune broadcasts by creating a virtual source token. The owner of thevirtual source token either spreads the message in such a way that they are the Jordancentre of the graph of nodes that received the message so far.Further, the owner of the virtual source token may forward the token. Thisforwarding would allow the overall probability of any node being the original source toequalise, i.e. if n nodes received the message, the probability of any node v i havingreceived the message being the true source v should be approximately P [ v i = v ] n . Theprobability of forwarding depends on the number of previous forwards h , the currentstep number t and the degree of the underlying tree d . Using these parameters, theprobability of forwarding the virtual source token is derived by Fanti et al. as p d ( t, h ) = t − h +2 t +2 if d = 2 , ( d − t − h +1 − d − t − if d > . (1)The algorithm realising the protocol is given by Algorithm 1 and Algorithm 2 as itwas presented originally.February 9, 2021 4/18 lgorithm 1 Adaptive Diffusion as described by Fanti et al. [14].
Require: contact network G = ( V, E ), source v ′ , time T , degree d Ensure: set of infected nodes V T V = { v ′ } , h = 0 , v = v ′ v ′ selects one of its neighbors u at random V = V ∪ { u } , h = 1 , v = u let N ( u ) represent u ’s neighbors V = V ∪ N ( u ) \ { v ′ } , v = v t = 3 for t ≤ T do v t − selects a random Variable X ∼ U (0 , if X ≤ p d ( t − , h ) thenfor v ∈ N ( v t − ) do Infection Message(
G, v t − , v, V t ) end forelse v t − randomly selects u ∈ N ( v t − ) \ { v t − } h = h + 1 v t = u for v ∈ N ( v t ) \ { v t − } do Infection Message(
G, v t , v, V t ) if t + 1 > T then break end if Infection Message(
G, v t , v, V t ) end forend if t = t + 2 end forAlgorithm 2 Infection Message algorithm as described by Fanti et al. [14].
Require: contact network G = ( V, E ), source v ′ , time T , degree d Ensure: set of infected nodes V T if v ∈ V t thenfor w ∈ N ( v ) \ { u } do Infection Message(
G, v, w, V t ) end forelse V t = T t − ∪ { v } end if Although this approach is designed for cycle-free networks, Fanti et al. provide [14]it works well even for general networks. For a network protocol, adaptive diffusionprovides a few challenges. A suitably powerful attacker can subvert the protocol byconnecting to a large number of nodes, as a node informs all neighbours of newmessages, reducing the privacy guarantees to distance one. Further, later messagesdeliver the hop count h of current forwards, eliminating all nodes of a distance notequal to h. Lastly, the discrete-time model needs to be transformed into acontinuous-time model of real-world computing systems.February 9, 2021 5/18 k-Growing η -Adaptive Diffusion The general idea is still that of adaptive diffusion. The virtual source should forwardmessages so that it is the Jordan centre of the sub-graph created from all nodes thatreceived the message. In detail, we apply some modifications to the protocol.First, we limit the spread, i.e., the number of neighbours involved in thedissemination, to η many neighbours. This change reflects in the message handlingsub-protocol Algorithm 3, as nodes need to select neighbours. Further, as arbitrarynetworks may have multiple paths between nodes, a node must only react on messagesreceived via the proper path. Algorithm 3 η -adaptive diffusion message handling algorithm. Input:
Message m Environment:
Message sender v , Self u if v p,m = ∅ then N =randomly select η neighbours out of N ( u ) \ { v } v p,m = v elseif v p,m = v then send m to all N end ifend if The virtual source sub-protocol Algorithm 4 requires further changes. The truesource uses the message ( v, t = 1 , r = H ( m )) to initiate the protocol. The hop counter h has been dropped, as it leaks the distance to the true source to possible attackers. Algorithm 4 η -Adaptive Diffusion virtual source handling algorithm. Input:
Previous virtual source v p , round identifier r , current timestep t Environment:
Neighbours N with |N | = η + 1, depth d for v ∈ N \ { v p } do Send m to v if t + 1 ≤ d and t > then Send m to v end ifend forwhile t ≤ d do t = t + 1 x = ∼ U (0 , if x ≤ p t then v next = ∼ U{N \ { v p }} Send ( v self , t, r ) to v next , to call Algorithm 4 breakelse Wait for ≈ for v ∈ N do Send m to v end forend ifend while The virtual source token is forwarded with probability p t and kept with probabilityFebruary 9, 2021 6/18 − p t . The probability p t can be computed based on the distance distribution withinthe network f, i.e., f ( i ) gives the expected number of nodes in distance i of any node.The exact computation is quite involved, compare Section 5 for details.After a suitable threshold is reached for the privacy of the originator, i.e., the set ofpotential originators is large enough, the protocol switched to a flood and prunebroadcast. This will ensure delivery to all participants and increase efficiency, asfurther privacy gains are low - or non-existent once the full network is part of the setof potential originators.To preserve the privacy of participants, nodes must monitor network latencies, asthey have to artificially slow down the protocol when keeping a virtual source token.Further, every virtual source node must monitor the network for the progress of theprotocol. A time-out will trigger retransmission to a different participant, as thepreviously selected is considered as refusing cooperation or unreachable. The time-outwill extend on messages received through the protocol but only stop when receiving aflood and prune message. Some generalisations arise when considering general computer networks instead ofinfinite tree graphs. General networks may have cycles, i.e., multiple paths betweenparticipants, and a non-regular distance distribution. To extend the model of adaptivediffusion to these circumstances, we replace the calculations based on properties of atree with a more general distribution function f, i.e., there are f ( i ) nodes withdistance i. To prevent an attacker from learning additional information about the originator,we have to modify some aspects of the protocol. First, we have to remove the h usedin the protocol, as an attacker can infer the exact distance to the originator. As otherparticipants may not know the distance distribution of the originator and to keep theprotocol general, we will use a homogeneous distribution, i.e., all nodes use the samedistribution f to compute their probabilities. To model the process of passing the virtual source token, we use a timeinhomogeneous Markov chain, i.e., the probabilities involved may change based on thetime t . For a network of diameter ∅ , the chain has ∅ + 1 states 0 , , . . . , ∅ . Each staterepresents the current distance of the virtual source from the true source.
Fig 1.
Time inhomogeneous Markov chain of passing the virtual source token.A node of distance h to the true source should pass the virtual source token to amore distant node with probability p t ( h ) . Alternatively, the node keeps the distancethe same with probability 1 − p t ( h ) . The Markov chain with these properties isvisualised in Fig. 1.As noted before, a participant may not know its actual distance to the true source h , so the probabilities p t ( h ) may not depend on the distance to the true source.At time t, let the i -th row of the vector P t ∈ [0 , t describe the probability of thevirtual source token being with a node of distance i from the true source. We have P = (1) , as the true source has distance 0 to itself and has the token initially. Further,let M t ∈ [0 , t +1 × t be the stochastic column matrix describing the transition from theFebruary 9, 2021 7/18 -th to the ( t + 1)-th step, i.e., P t +1 = M t P t = M t M t − . . . M P . Based on ourMarkov model, the matrix M t has the form: M t = − p t (0) 0 0 · · · p t (0) 1 − p t (1) 0 ...0 p t (1) . . . . . . 0... . . . . . . 0... . . . 1 − p t ( t − · · · · · · p t ( t − . To solve for probabilities p t ( h ) , we define our goal state: the probabilities for allany reachable node should be 1 t = 1 P t − s =0 f ( t ) . Using this, we can describe the probability of a node of distance i having the tokenat step t as f t ( i ) = f ( i ) P t − ts =0 f ( s ) . Using this, we can write the goal of equal probability as P t = f t (0) f t (1)... f t ( t − = 1 P t − s =0 f ( s ) f (0) f (1)... f ( t − ! = M t M t − . . . M P . Unfortunately, the number of restrictions does not necessarily allow for a singlesolution, perfectly fulfilling our goal. We can compute a possible solution p t based onthe last row of our transition equation. M t − P t − = − p t − (0) p t − (0) . . .. . . 1 − p t − ( t − p t − ( t − f t − (0) f t − (1)... f t − ( t − = f t (0) f t (1)... f t ( t − = P t First line: (1 − p t − (0)) f t − (0) = f t (0) ⇔ − p t − (0) = f t (0) f t − (0) ⇔ p t − (0) = 1 − f t (0) f t − (0) ⇔ p t (0) = 1 − f t +1 (0) f t (0)February 9, 2021 8/18 i + 1)-th line: f t ( i ) = p t − ( i − f t − ( i −
1) + (1 − p t − ( i )) f t − ( i ) ⇔ (1 − p t − ( i )) f t − ( i ) = f t ( i ) − p t − ( i − f t − ( i − ⇔ p t − ( i ) = 1 − f t ( i ) − p t − ( i − f t − ( i − f t − ( i ) ⇔ p t ( i ) = 1 − f t +1 ( i ) − p t ( i − f t ( i − f t ( i )By induction we arrive at: f t ( i ) p t ( i ) | {z } a i = f t ( i ) − f t +1 ( i ) + p t ( i − f t ( i − | {z } a i − ⇔ f t ( i ) p t ( i ) = i X j =0 ( f t ( j ) − f t +1 ( j )) ⇔ p t ( i ) = P ij =0 ( f t ( j ) − f t +1 ( j )) f t ( i )As the actual distance of a node from the origin is unknown, we have to determinea single probability. As the distribution over h is known - it is our desired state f t - wecan combine these with the precomputed probabilities per distance. This achieves asingle forwarding probability: p t = t − X h =0 f t ( h ) p t ( h ) . A node which did not forward the token could recompute the forwardingprobability using its expected distance from the previous round to achieve betterhiding.
The ideal solution only holds if and only if the next state P t is reachable from P t − bya single increase or stay. The condition can be formalised with the followingrequirements, derived from the solution:0 ≤ f t (0) f t − (0) ≤ ≤ P ij =0 ( f t ( j ) − f t +1 ( j )) f t ( i ) ≤ f ( i ) > f t (0) f t − (0) = f (0) P t − s =0 f ( s ) f (0) P t − s =0 f ( s ) = P t − s =0 f ( s ) P t − s =0 f ( s ) ≤ . Equation (3) intuitively describes that the probability of a node in distance j possessing the token, can not exceed the probability of a node of the same distancepossessing the token in the previous time step in addition to the total change in lowerdistances.February 9, 2021 9/18f this condition is violated, we need to compensate in the distribution orprobabilities. Either way, the resulting distribution will be non-optimal hiding. Tominimise the deviation, we determine the final desired state of the protocol, after t steps, with t ≤ ∅ . We then compute a new P ′ i , ∀ i ≤ t as P ′ i = f ′ i (0)... f ′ i ( i − Here, f ′ is derived from f as: f ′ t ( i ) = ( f t ( i ) if t is max desired state f t ( i ) + max( χ t,i , δ t,i ) otherwise χ t,i = t X j = i +1 f t ( j ) − f ′ t ( j ) δ t,i = f ′ t +1 ( t ) − f t ( i ) + t − X ji +1 (cid:0) f ′ t +1 ( j ) − f ′ t ( j ) (cid:1) The value δ t,i represents the difference required to fulfil Equation (3). On the otherhand, χ t,i represents all changes made to later entries, i.e., propagating the changesmade through δ . Note that Equation (3) is equivalent to the following.0 ≤ P ij =0 ( f t ( j ) − f t +1 ( j )) f t ( i ) ≤ ≤ i X j =0 ( f t ( j ) − f t +1 ( j )) ≤ f t ( i )Applying this equation to our goal state f ′ we find the generation of f ′ through thefollowing changes: f ′ t ( i ) ≥ i X j =0 ( f ′ t ( j ) − f ′ t +1 ( j ))= i X j =0 f ′ t ( j ) | {z } =1 − P t − j = i +1 f ′ t ( j ) − i X j =0 f ′ t +1 ( j ) | {z } =1 − P tj = i +1 f ′ t +1 ( j ) P k − j =0 f k ( j )=1 = 1 − t − X j = i +1 f ′ t ( j ) − (1 − t X j = i +1 f ′ t +1 ( j ))= t − X j = i +1 f ′ t ( j ) + t X j = i +1 f ′ t +1 ( j )= t − X j = i +1 f ′ t ( j ) + t − X j = i +1 f ′ t +1 ( j ) + f ′ t +1 ( t )= t − X j = i +1 (cid:0) f ′ t ( j ) + f ′ t +1 ( j ) (cid:1) + f ′ t +1 ( t ) . February 9, 2021 10/18his leaves us with only known values, allowing us to compute the minimumdifference required, i.e., δ t,i . Using these results, we showed that if it is possible toachieve an optimal result, probabilities derived from f ′ are optimal. If such a result isnot possible, probabilities derived from f ′ will yield a result with as little deviation aspossible for intermediate steps. All previous discussions are in discrete time, i.e., the time t is in steps, especiallynatural numbers. A network protocol must operate in some form of continuous-time,however. Fortunately, network protocols lend themselves for a simple conversiontechnique: network latency.If there was no delay between messages, a token transfer to another node could beobserved by all participants of the protocol so far. To prevent this observation, a nodemust insert an artificial latency when not forwarding the message. The latency shouldbe drawn from a distribution indistinguishable from real network latencies. Therefore,a node must observe the latencies of its connections. One remaining privacy problem of adaptive diffusion is the selection of all neighboursfor dissemination. If an attacker is a neighbour of the first recipient of the virtualsource token, they will notice the broadcast as soon as possible without being the firstvirtual source recipient. An attacker can force this situation by creating connections toall participants in the network. Even with many unobtrusive attackers distributedthroughout the network, the chance of selection is high.To reduce privacy leaks via non-virtual source reduce the spread of information, weintroduce the parameter η. Participants only select η neighbours to participate in theprotocol. This reduces the chance of selecting at least one of α attackers from c connections to Q c − αj = c − α − η jc η , i.e., the chance of an attacker being selected shrinks thelower the parameter is. Unfortunately, a node can not reliably decide which edge increase or decrease thedistance to the true source, as the source is unknown or an edge with the desiredprobability is not available. For every node, keeping the token will keep the distancethe same. As a heuristic for early nodes, returning the token along the path it wasreceived likely reduces the distance by one while forwarding it to another node likelyincreases the distance. Knowledge about the neighbours of neighbours can increase theaccuracy of this heuristic.For networks observed in real-world peer-to-peer-networks, the small-worldproperty likely holds [15, 18]: the shortest distance between any two nodes is likelybelow or equal to 6. After six steps, the candidate pool for the true source is most ofthe network. Therefore, performing the analysis is sufficient for the early steps of theprotocol.Due to the lack of information stemming from the privacy requirements, thedistribution of nodes holding the virtual source is distorted at every step, making theresult less correct. Alternative approximations based on the distribution may performbetter empirically. One improvement may be a better approximation by nodes holdingthe virtual source token. They can infer their distance to the true source to be at mostFebruary 9, 2021 11/18 the moment they receive the token. Therefore, they have no reason to use theprobabilities as if they were at distance t + 1 should they keep the token.Lastly, the result of the previous section is based on a distribution of the shortestpaths within the networks. This distribution is not generally known for most graphtypes and could not be empirically determined by a participant without knowledge ofthe topology. To relieve the final limitation, a concrete distribution, we analyse a likely graphcreation algorithm for the expected distribution of shortest paths. This will allow anode to compute p t based on the number of expected edges and nodes in its network. We determine suitable distributions and their parameters by creating and analysingrandom graphs. To create the graphs, we use igraph , while the analysis is done usingscipy [19].We chose igraph’s graph establishment function, which takes a number of nodes n and a number of edges k. The method creates a random graph by sequentially addingnodes. Each node creates k edges to already existing nodes. This scheme leads to aconnected graph, where older nodes have a higher number of connections, while newnodes have at least k connections.We chose this scheme as it is similar to the schemes used in peer-to-peer networks.A new node connects to publicly known nodes and asks for a set of participants. Thenew node then chooses some number of nodes to connect to. This model is asimplification, as it ignores churn, i.e., nodes leaving and joining the network again,but it is a close fit for real-world applications.To reproduce the steps and results of this paper, we provide a repository of ourdata and scripts und the MIT open source license , including an interactive notebookfor experimentation. To model the observed behaviour, we chose various discrete and continuousdistributions. As discrete candidates, we looked at Poisson, Planck, Binomial andGeometric distributions. For continuous distributions, we considered the normal,log-normal, truncated normal and Weibull distribution. We evaluated the fit ofcontinuous distributions by the overall shape, as the data is discrete.The Weibull distribution was chosen as a candidate for the extreme valuedistributions, as shortest paths are calculated as minimums over paths. The normaldistribution and its variations were chosen due to the central limit theorem, i.e., thenormal distribution as the limit of independent samplings. The truncated normaldistribution was selected as a candidate as its support can be limited to positive values- a sensible limitation for path lengths.We estimated parameter fits for all distributions from many generated graphs. Theparametrisation for the truncated normal was mostly indistinguishable from theproduced normal distribution. Similarly, the log-normal distribution was transformedto mimic the normal fit closely. Therefore, we removed the truncated and log-normal https://igraph.org/ https://github.com/vs-uulm/eta-adaptive February 9, 2021 12/18istribution as candidates to not overcomplicate the model. Representative examplesfor continuous fits are shown in Fig. 2.
Fig 2.
Fitted continuous distributions from an example dataset, which was createdusing 2000 nodes and 6 edges. A lognormal and truncated normal fit were plottedidentically to the normal distribution and were, therefore, omitted.Similarly, the results for the discrete distributions was not a good fit. Only thePoisson distribution produced a convincing fit for any graph but limited to graphswith k = 1 . We did not test other discrete distributions as often no efficient maximumlikelihood estimators exist or are implemented. Representative examples for discretefits are shown in Fig. 3.
Fig 3.
Fitted discrete distributions from an example dataset, which was created using2000 nodes and 6 edges. Only the binomial estimation using the ceiling operator todiscretize the parameters shows any resemblace to the desired data.Finally, the results for the normal distribution produced good point-wise fits forgraphs with k > . The normal distribution fits were especially accurate for the coresection of the distribution. The most significant deviation from the data could beobserved for the low end of the distribution: for distance 0 or 1. Fortunately, thesevalues can be fit manually based on the parameters n, k , as the mass at a distance ofzero should be n and the expected mass at the distance of one should be k (2 n − k − n , i.e., the average degree. To apply the normal distribution to our given problem, the resulting distributionneeds to be discretised, i.e., turned from a continuous distribution in a discrete one.The main goal is to keep the properties of a probability distribution, i.e., the sum ofall points not equal to zero needs to add up to 1.A valid discretization can be constructed based on the cumulative distributionfunction over intervals, capturing the full support of the distribution, e.g., f ( x ) = CDF( x )As we fit the distribution based on the points of data, the natural discretisation canbe achieved by point-wise evaluation and re-normalisation of the result. Let PDF bethe probability density function, then a new probability mass function with evaluationpoints 0 , , . . . , t (the discrete equivalent to a PDF) is given by x ∈ { , , . . . , t } f t ( x ) = PDF( x ) P ts =0 PDF( s ) . This approach can easily accompany special values at certain points. Therefore,the full discretization of our normal distribution shall be: f (0) = 1 nf (1) = k (2 n − k − n f ( x ) = PDF( x ) f (0) + f (1) + P ts =2 PDF( s ) . The maximum point t should be chosen in such a way, that the remaining error1 − CDF( t ) ≤ ǫ is small enough for the given purpose.February 9, 2021 13/18 .4 Model Parameter Estimator The previous section concluded that the distribution of shortest paths could bemodelled using a discretised normal distribution. Building upon this conclusion, weare further interested in the parameters µ and σ of a normal distribution N ( µ, σ ) . The parameters of the normal distribution should only depend on the parameters ofour network, the number of nodes n and number of edges k. We are interest infunctions
M, S, with a small error err such that µ = M ( n, k ) + err σ = S ( n, k ) + err . These are statistical estimators. To determine these, we fit a large number ofrandomly generated graphs and stored the resulting values for µ and σ. Thedetermined functions
M, S are approximated using the functional equations: M ( n, k ) ≈ α ln( βn ) e γk + δ ln( ηn ) + ζe γk + ǫ,S ( n, k ) ≈ a ln( b n ) + c e d k + e . Here, the greek and fracture constants α to ǫ and a through e are determined byleast-square fitting of the function to the acquired data. Through a fit of experimentaldata we reached the following approximate functions: M ( n, k ) ≈ .
595 log(2 . n )exp(0 . k ) + 0 .
341 log(1 . n ) + 0 . . k ) − . ,S ( n, k ) ≈ . . n ) + 1 . . k ) + 0 . . We used the fitted parameters for random graphs to determine the behaviour of theparameters. The ranges of the parameters depend on the size of the network n and thenumber of connections created each step k. By splitting the dimensions based on n and k , an initial estimation is possible. Thedimension dependent on k shows a behaviour proportional to e k . The dimensiondependent on n on the other hand, shows a behaviour proportional to log( n ) , a squareroot behaviour could be excluded as fitted parameters easily overestimated the data.A random selection of the dimensional analysis is shown in Fig. 4. Fig 4.
Datapoints for various graph sizes split by number of edges k and number ofnodes n . µ estimate fitted using e k for the number of edges and fitted using log( n ) forthe number of nodes dimension.The estimation of values for σ show much more pronounced residues in the form ofa sawtooth function. The forms can be recognized from the similarly shaped, butsmaller, residues of the µ estimations. The values of σ show to be within 0.3 to 0.7,even for large numbers of nodes n, e.g., n = 1 000 000 . Further, the values seem tojump rapidly and slowly decent, forming a sawtooth pattern, which is hard to predictaccurately. The pattern arises as additional nodes in the network are more likely tocreate shortcuts, than to increase path length, until the overall network diameterFebruary 9, 2021 14/18ncreases by one, steeply widening the distribution - and therefore increasing thevariance, i.e., σ. The likelihood of such an increase follows its own probabilitydistribution, which we did not determine for this paper.
Based on our one dimensional evaluation, we construct a two dimensional estimatormodel M ( n, k ) . These models are based on possible combinations of our onedimensional approximations, i.e., the partial derivatives are the derivatives of ourobservations: ∂M∂n ≈ ddn log( n ) ,∂M∂k ≈ ddk e k . The constructed models have various constants, denoted by greek lower casesymbols. These constants are not shared between the models but are independently fit.The models are denoted by M ( n, k ) = α ln( βn ) + γe δk + ǫ,M ( n, k ) = α ln( βn ) e γk + δ ln( ηn ) + ζe γk + ǫ,M ( n, k ) = α ln( βn ) e γk + ǫ,M ( n, k ) = α ln( βn ) e γk − αδke γk + ζe ηk + θ ln( ιn )( κn ) λn + ν ln( ξn ) . We fit the parameters of the model based on our first dataset. The residuals of thefit parameters, i.e., the difference between the true value and estimation, shows asawtooth form. This arises as additional nodes likely create shortcuts until the overalldiameter of the network grows.To evaluate the performance besides this observed error, we created a newindependent dataset. We measured the difference between the calculated values by ourmodel and the actual fitted parameters, i.e., the bias of the estimator. Fig. 5 showsthe distribution of the bias of our models for µ . Fig 5.
Boxplot of the bias distribution of the µ -estimator, using the four models,compared to the measured value.For the µ estimation, model M and M perform the best. Model M requires lessparameters than M , i.e., it is simpler, therefore we prefer model M . For the estimation of σ, none of the models performs exceptionally well. In general,the observed sigma values are small and within the range 0.3-0.7, even for graphsusing one million nodes. As no model performed exceptionally well, but also notexceptionally bad, we stuck with the simplest model, i.e., model 1. Fig. 6 provides adiscretization of an example prediction for a dataset based on 2000 nodes and 6 edges.We can now use these values to compute concrete p t for a given network of n nodesusing a k -growing approach.February 9, 2021 15/18 ig 6. Discretization of an example dataset using 2000 nodes and 6 edges. Thepredicted normal distribution is based on parameters estimated using model 2 for µ and model 1 for σ, with a pointwise discretization and an interval discretization, usingmidpoint intervals. There are various smaller improvements possible to increase privacy or efficiency ofthe protocol. First, instead of switching to a flood and prune from the last virtualsource node, the protocol could instead trigger the flood and prune broadcast from allleaf nodes. The last message transmission message, see Algorithm 3, would instructleaf nodes to start a flood and prune process. This would reduce leaks of informationduring the flood and prune protocol and improve the efficiency of the protocol.To improve resistance against linkable broadcasts and to hinder an attacker in thefirst step, t may be randomized on initiating. This would also require the originator touse a spreading message first, as the initial transmission would otherwise be special inthis regard.The protocol has little guards against non-participation attacks, which could bemitigated through time-outs. While allowing the protocol to finish, these would stillreduce the efficiency of the protocol in selective non-participation attacks. As those donot prevent every connected node from receiving the message and do not diminish theprivacy results, we did not tackle these in this paper. In this paper, we transformed the adaptive diffusion protocol [14] into a protocol forpeer-to-peer networks. To achieve this, we remodelled the virtual source passingprobabilities in a more general way, based on the distance distribution of theunderlying network. Further, we provide a privacy-friendly solution to solve theseequations, while smoothing out otherwise unachievable states.We analysed expected k-growing network topologies, which are similar toreal-world peer-to-peer network growth, for their distance distributions. The analysisshowed the distances in the networks to be approximately normally distributed.Lastly, we performed a parameter analysis of the resulting normal distributions,showing that µ and σ of the normal distribution can be approximated by acombination of logarithmic and inverse exponentials. The models, based on thenumber of edges k and number of nodes n are: µ ( n, k ) ≈ .
595 log(2 . n )exp(0 . k ) + 0 .
341 log(1 . n ) + 0 . . k ) − . σ ( n, k ) ≈ . . n ) + 1 . . k ) + 0 . to evaluate and reproduce under a permissive opensource license. https://github.com/vs-uulm/eta-adaptive February 9, 2021 16/18 eferences
1. Bonneau J, Narayanan A, Miller A, Clark J, Kroll JA, Felten EW. Mixcoin:Anonymity for Bitcoin with accountable mixes. In: Financial Cryptography andData Sec. Springer; 2014. p. 486–504.2. Ben Sasson E, Chiesa A, Garman C, Green M, Miers I, Tromer E, et al.Zerocash: Decentralized anonymous payments from bitcoin. In: IEEE Symp. onSec. and Priv. (SP). IEEE; 2014. p. 459–474.3. Kopp H, M¨odinger D, Hauck FJ, Kargl F, B¨osch C. Design of aPrivacy-Preserving Decentralized File Storage with Financial Incentives. In:Proc. of IEEE Sec. & Priv. on the Blockch. (S&B) (aff. with EUROCRYPT2017). IEEE; 2017.4. Noether S, Noether S. Monero is Not That Mysterious; 2014.5. Mastan ID, Paul S. A New Approach to Deanonymization of UnreachableBitcoin Nodes; 2018. Crypt. ePrint Arch., Report 2018/243.6. Biryukov A, Khovratovich D, Pustogarov I. Deanonymisation of Clients inBitcoin P2P Network. In: Proc. of the ACM Conf. on Comp. and Comm. Sec.New York, NY, USA: ACM; 2014. p. 15–29.7. Meiklejohn S, Pomarole M, Jordan G, Levchenko K, McCoy D, Voelker GM,et al. A fistful of bitcoins: characterizing payments among men with no names.In: Proc. of the Internet Meas. Conf. ACM; 2013. p. 127–140.8. Ron D, Shamir A. Quantitative analysis of the full bitcoin transaction graph.In: Financial Cryptography and Data Security. Springer; 2013. p. 6–24.9. Koshy P, Koshy D, McDaniel P. An analysis of anonymity in bitcoin using P2Pnetwork traffic. In: Proc. of the Int. Conf. on Finan. Cryp. and Data Sec.Springer; 2014. p. 469–485.10. Conrad B, Shirazi F. A Survey on Tor and I2P. In: Ninth Int. Conf. on InternetMonitoring and Protection (ICIMP2014); 2014. p. 22–28.11. Bojja Venkatakrishnan S, Fanti G, Viswanath P. Dandelion: Redesigning theBitcoin Network for Anonymity. Proc of the ACM Measurement and Analysisof Comp Sys (POMACS). 2017;1(1):22:1–22:34.12. Fanti G, Venkatakrishnan SB, Bakshi S, Denby B, Bhargava S, Miller A, et al.Dandelion++: Lightweight Cryptocurrency Networking with Formal AnonymityGuarantees. SIGMETRICS Perform Eval Rev. 2019;46(1):5–7.13. M¨odinger D, Kopp H, Kargl F, Hauck FJ. Towards Enhanced Network Privacyfor Blockchains. In: Proc. of the 38th IEEE Int. Conf. on DistributedComputing Systems (ICDCS); 2018.14. Fanti G, Kairouz P, Oh S, Viswanath P. Spy vs. Spy: Rumor SourceObfuscation. SIGMETRICS Perform Eval Rev. 2015;43(1):271–284.15. Jackson MO. Social and economic networks. Princeton university press; 2010.16. Norman L, Kotz S, Balakrishnan N. Continuous Univariate Distributions, Vol. 1of wiley series in probability and mathematical statistics: applied probabilityand statistics; 1994.February 9, 2021 17/187. Johnson NL, Kotz S, Balakrishnan N. Continuous univariate distributions, Vol.2 of wiley series in probability and mathematical statistics: applied probabilityand statistics; 1995.18. Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’networks. nature.1998;393(6684):440–442.19. Jones E, Oliphant T, Peterson P, et al.. SciPy: Open source scientific tools forPython; 2001–. Available from: