Approximate Privacy-Preserving Neighbourhood Estimations
AApproximate Privacy-Preserving Neighbourhood Estimations
Alvaro Garcia-Recuero [email protected] NETWORKSMadrid, Leganes, Spain
ABSTRACT
Anonymous social networks present a number of new and chal-lenging problems for existing Social Network Analysis techniques.Traditionally, existing methods for analysing graph structure, suchas community detection, required global knowledge of the graphstructure. That implies that a centralised entity must be given ac-cess to the edge list of each node in the graph. This is impossiblefor anonymous social networks and other settings where privacyis valued by its participants.In addition, using their graph structureinputs for learning tasks defeats the purpose of anonymity. In thiswork, we hypothesise that one can re-purpose the use of the Hyper-ANF a.k.a HyperBall algorithm –intended for approximate diameterestimation– to the task of privacy-preserving community detectionfor friend recommending systems that learn from an anonymousrepresentation of the social network graph structure with limitedprivacy impacts. This is possible because the core data structuremaintained by HyperBall is a HyperLogLog with a counter of thenumber of reachable neighbours from a given node. Exchanging thisdata structure in future decentralised learning deployments givesaway no information about the neighbours of the node and thereforedoes preserve the privacy of the graph structure.
CCS CONCEPTS • Information systems → Web mining . KEYWORDS approximate computing, graph estimation, social networks, Hyper-LogLog, Decentralisation
ACM Reference Format:
Alvaro Garcia-Recuero. 2021. Approximate Privacy-Preserving Neighbour-hood Estimations. In . ACM, New York, NY, USA, 4 pages.
Performing data analysis over large graphs with billions of edges isan important yet challenging task. Previous work have estimated theaverage distance in the graph among any two users of large socialnetworking site as Facebook, resulting in 4.75 hops on average. Suchinsights on the structure of a graph are often difficult to investigatefor the research community, as it is neither possible to obtain allthe neighbourhood sets of each node in the social network due to
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected]. , , 2021 © 2021 Association for Computing Machinery. social networks terms and conditions of the social network, nor it ispractical to process a complete graph of billions of nodes in termsof performance in a laptop. A practical solution is the emerging useof the graph estimation techniques that approximate the value ofgraph properties at the node level to reduce computation overheadpresumably.In the literature, estimation algorithms propose to calculate neigh-bourhood information efficiently by scaling distributed computing toapproximate massive graph data mining [10]. Boldi & Vigna focuson approximating geometric centrality network metrics at scale [2]using an algorithm based on HyperLogLog counters in their Hy-perANF algorithm [1] a.k.a HyperBall. This allows approximationof geometric centrality metrics at a very high speed and acceptableaccuracy for certain use cases that do not require exact values.
Contribution.
In this work we implement the HyperBall algorithmusing Python to explore its utility in approximate neighbourhoodinformation of a Decentralised Online Social Network (DOSN) withmillions of edges. It is the first type of such study to date with aDOSN dataset. We estimate the neighbourhood function of the graphover the Mastodon DOSN data provided by [11].We translate to aPython version the existing HyperBall algorithm. The dataset fromthe DOSN graph is loaded in-memory and we use our neighbourhoodfunction to adjust the error bounds of the estimation technique usingHyperLogLog [4] counters. The expectation here is to see how asmaller graph, in the order of millions instead of billions, behavein terms of performance first. Our results show error bounds inthe order of 20-30% in the most pessimistic cases of the networkgraph traversal (with exception of the level 0 of the traversal whereonly 1 node exists and thus error is always plausible because of theproblem of small samples when using HyperLoLog), and better errorin line with the theoretical found in the original data approximationtechnique of HyperLogLog we employ here. Estimation using sketching techniques as HyperLogLog [4] countersare a tool for cardinality estimation of large multisets. In short theycan efficiently count the number of different elements in a streamusing a single pass over it. A HyperLogLog counter uses a hashfunction producing b bits that maps each element uniformly to oneof 2 b possibilities. For any produced hash value, say h , we usePrefix t ( h ) to denote the first t bits of h and Rest t ( h ) to denote theremaining bits after the first t bits are removed. Further for any bit-string x , let LZ ( x ) be the number of leading zeros in x . With thesedefinitions in hand, we now present the functions for manipulatingHyperLogLog counters in Algorithm 1. In addition to functions thatadd an item to the count and extract the current value of the counters,HyperLogLog counters allow us to compute lossless unions: the http://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/algo/HyperBall.html a r X i v : . [ c s . D S ] F e b , 2021 Alvaro Garcia-Recuero intuition between two compatible HyperLogLog counters, we canjust take the maximum counter to obtain the lossless that the unionof two counters is, equivalent to counting the union of the two multi-sets at each of the two counters and thus with no added error. This ishowever not true for taking the minimum for intersections so we useanother MinHash approach to obtain that as well later here. Algorithm 1
HyperLogLog Counter Manipulation h : D → b , a hash function from the domain of items M an array of m = t counters each initialised to − ∞ α m is a constant that depends on the number of counters function A DD I TEM ( M :counter, x :item) i ← Prefix t ( h ( x )) M [ i ] ← Max { M [ i ] , LZ ( Rest t ( h ( x ))) } function G ET C OUNT ( M :counter) Z ← m − ∑ i = − M [ i ] E ← α m m Z if E ≤ m then Let V be number of registers equal to 0 if V ̸ = then E ← m log mV return E function U NION ( M :counter, N :counter) x:hll_counter for i:=0 to m-1 do x [ i ] ← Max { M [ i ] , N [ i ] } Return xMost crucially for this paper, HyperLogLog counters provide tightstatistical bounds on the measured cardinality. In the limit n → ∞ , thereturned count (say ˆ n ) is an almost unbiased estimator [4] of the truecount (say n ) with a relative standard deviation σ n ≤ . √ m . Increasingthe number of counters therefore decreases the uncertainty in themeasurement at the expense of space. Extending the example above,using 8 bits per counter and 128 counters means we can countitems from a universe of 2 distinct items with a relative standarddeviation under 9 . m log ( mV ) ,where V is the number of zero registers (lines 12–15 of Algorithm 1).In further work on making HyperLogLog counters sound for smallcounts [6] have recommended using a bias correction instead, andas in their appendix work we implement their recommended biascorrection algorithm instead of linear counting at small ranges. Weuse the same bias correction at low counts for all datasets here, seebelow equation 2. p bits (cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123) ... − p bits (cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123) ... . The Mastodon DOSN dataset contains about 6.5 million direct edgerelations and more than 566520 nodes computed in Spark. In ourHyperBall implemen[2] we choose the precision of 4 p-bits forHyperLogLog. The traversal of each dataset graph uses a depth asparameter to run with no less than 5 as to resemble the degrees ofseparation in a real social network graph (e.g., Facebook). We testour HyberBall implementation with two centralised social networkdatasets (Twitter, Facebook) from [7], one decentralised social net-work dataset (Mastodon) and the Enron email dataset for the sake ofdiversity.
Performance.
In Table 1 we observe that all of the datasets havenaturally worse performance with a brute force approach. The resultsthat use the Hyperball as expected show a considerable speedup incalculation of geometric Hyperballs in the order of minutes, com-pared to hours when using a brute force Breadth First Search (BFS)approach in a commodity Macbook Pro OS X Mojave with 4 coresCPU, 16GB of memory and 500GB SATA disk installed. In addition,we find a embarrassingly parallel number of tasks in the initialisationand computation of the counters for HyperBall, which provides a fur-ther speedup opportunity using the joblib library as parallel backendusing threading with four parallel tasks. This cuts the computationtime of the Mastodon graph’s HyperBall by half. Overall the the-oretical probabilistic guarantees of the HyperBall seem to reducetimes further for larger graphs as Mastodon, when approximationsare based in the HyperLogLog counters.
Table 1: HyperBall computation times in (hh:mins:secs)
Bfs HyperBall
Nodes Edges Sequential Sequential Parallel
Twitter 81306 2420766 >1:00:00.00 0:02:54.874483 0:02:46.414789Facebook 4039 88234 0:00:56.965906 0:00:08.249678 0:00:08.533745Mastodon 566520 6493563 >1:00:00.00 0:11:16.773455
A key operation in community detection as the one in Figure 1 (andrecommendation systems) is calculating neighbourhood intersec-tions. This can be performed through intersection of Hyperballs.Intersecting the two HyperBalls however has a shortcoming as theunderlying HyperLogLog that HyperBall uses can not compute in-tersections by itself on the fly with the neighbouring summaries orfingerprints alone.Fortunately we can use state-of-the-art estimation techniquesbased on [3] for intersecting two HyperBalls. Our approach ob-tains that by computing the product of the
HyperLogLog with anapproximation of the Jaccard similarity using MinHashes. BecauseMinHashes is approximately the intersection of the Jaccard of thetwo HyperBalls divided by a fixed parameter k (see equation 2), weare left with just an approximated value of the intersection amongtwo sets in equation 3. pproximate Privacy-Preserving Neighbourhood Estimations , , 2021 Rather than dividing by the union which would be more costly,using the MinHashes in the first equation, the result can be approxi-mated as the Jaccard coefficient, which effectively requires just theintersection divided by the MinHashes parameter k as mentioned.Also, this may have privacy properties due to the way the countersand fingerprint are built with HyperLogLog – using just an encodingof inputs. It is easy to see from the decomposition below in the for-mulae that we can approximately cancel out unions from equation 3with the following multiplication of HyperLogLog and MinHashes: |∩| (cid:0)(cid:0) |∪| · (cid:0)(cid:0) |∪| (1) | h k ( A i ) ∩ h k ( B i ) | k (2) (cid:12)(cid:12)(cid:12) (cid:92) A i (cid:12)(cid:12)(cid:12) = J ( A ,..., A n ) · (cid:12)(cid:12)(cid:12) (cid:91) A i (cid:12)(cid:12)(cid:12) ≈ MinHash · HLL (3)
Figure 1: A small-world network community structure of thetype considered in the paper from Newman-Girvan [9] and sim-ilarly in ; the approach to community detection.
Then, using the mentioned MinHashes H , if H ′ ( v ) = H ( v ) alsoholds, then for each ( v , u ) ∈ E , H ′ ( v ) = ( H ′ ( v ) ∩ H ( u )) . And surpris-ingly, this can be done while reading edges in sequence.The summary of the structure is as follows: • • Towards Decentralised Privacy Preservation with HyperBall.
Armed with our implementation of the HyperBall algorithm wealso study the path length probability distribution and the averagepath length over the same networks datasets. For the latter metric weapply our implementation of the approximated algorithm describedin HyperBall. This allows us to compute the so called small worldcoefficient for each of the networks we benchmark.
Algorithm 2
Average path length with HyperBallnr_paths: number of paths of length per node. max_t: maximum distance of HyperBall computations is equalto b . function H YPER B ALL ( G :graph, b :radius of ball, p :hll_c_prec) return HB function N UM N ODES D IST F ROM ( v , t ) if t = 0 thenreturn else if t >= get_max_t thenreturn elsereturn balls[v][t].size() - self.balls[v][t - 1].size() function AVERAGE _ PATH _ LENGTH ( G ) for v ∈ G dofor t ∈ do nr_paths[v] = HB .N UM N ODES D IST F ROM ( v , t ) Performance.
From the computations of Table 2 we obtain thecorresponding small world coefficient by (i) computing a randomgraph with equivalent amount of nodes and edges to our input graph,and (ii) computing a lattice graph with same amount of vertices too.Once we have those values computed, now we can obtain the ratio l of average path lengths among a random graph R compared to ouractual input graph G . If we then also compare the ratio of the averageclustering coefficient among the input graph G and the previouslycomputed lattice graph L , we obtain the clustering coefficient of G / L called c . We calculate the small world coefficients as l − c .Usually the coefficient should be between -1 and 1 for indicatinghow strong the small world phenomenon occurs. A coefficient closeto 0 indicates a strong influence of small world. Positive valuesindicate a graph with more random characteristics. Negative valuesindicate regularity/lattice-like graphs.We observe that performance of the network properties in theMastodon dataset improves again somehow. From the standpointof social networks, small-world networks have short average pathlength and high clustering coefficient (Twitter below). Regardingaverage shortest path length, if shorter it indicates that informationpropagates more easily than in a random or regular lattice-type socialnetwork as Twitter. For Mastodon this can be related to the relativelyself-organising nature of the Mastodon platform among users. ForTwitter we would expect a higher average path length as expressedin previous works in the matter [8]. However, their dataset at Twitteris more up to data that the one we use from an ego network of2012 in [7]. We plan to experiment with larger datasets in a parallelframework as Spark as well to compute these same metrics.Note that the times in Table 2 indicate the average path lengthcomputation, which is the main source of overhead for the smallworld coefficient we provide. Some minor shifts exist in Twitterand the Mastodon. The times here are considered for parallel only , 2021 Alvaro Garcia-Recuero Table 2: Results for Algorithm 2 with HyperBall vs NetworkX
NetworkX HyperBall
Nodes Edges hh:mins:secs (hh:mins:secs) Avg.ShortestPathLength(R/G) ClusteringCoeff.(G/L) SmallWorldCoeff.(l-c)
Twitter 81306 2420766 0:03:21.698368 0:03:34.580287 0.9072928958767731 0.565311468612065 0.3415490228344126Facebook 4039 88234 0:00:12.997504 in HyperBall here because it will cost us a bit less when initialis-ing the underlying HyperLogLog counters than doing so sequen-tially, particularly for the Mastodon dataset as we observed in theprevious table. We compare our approach using HyperANF herefor getting the path length with state-of-the-art library NetworkX(nx.average_shortest_path_length(G)) that is expected to be moreoptimised than our prototype and surprisingly, ours still performsquite similar or better in some cases.The privacy-preserving version of the Average Path Length abovein Algorithm 2 eventually only needs to learn, as before, the ab-solute value | B ( v , t ) | not B ( v , t ) itself, as | B ( v , t ) | is the v in question B ( v , t − ) . Inside HyperBall this is done byfirst keeping a HyperLogLog counter H ( v ) at each node v ∈ V . So i f f B ( v , t ) = B ( v , t − ) holds, then for each ( v , u ) ∈ E , B ( v , t ) = B ( v , t ) ∪ B ( u , t − ) .Equipped with previous theoretical vision and empirical resultshere, we envision a future privacy preserving version of HyperBallfor decentralised social network deployments in which networkproperties require to exchange these set cardinalities or summarysets over insecure channels to perform federated leraning tasks forinstance. We use approximated computing techniques using summary sets(HyperBall, Minhashes) that can be applied to estimate properties ingraphs resulting in algorithms that require only a bounded amountof data, e.g., just a HyperLogLog set per vertex rather than a listof all reachable vertices. Summary sets appear to have anonymityproperties e.g., hard to deduce actually reachable neighbours given aHyperLogLog encoding of neighbours reachable in K steps (like inHyperBall). The hypothesis here is that these algorithms are there-fore a good candidate for privacy preserving decentralised socialnetworks.Firstly, in order to apply our proposal for the social recommen-dation systems in a social network, only summaries are availablefor friend lists at each node. Therefore, once can intersect such liststo make friendship recommendations in zero-knowledge to attainsecure two party computation by applying existing Private Set Inter-section protocols we have for cardinality estimation of that operationand possibly do so even in a decentralised manner [5]. Indeed wewould envision the integration of this new work with the former inorder to intersect fingerprints of the HyperBalls in zero-knowledgeunder standard techniques for broadcasting local values.Secondly, we have explored the feasibility of estimating well-known measures of closeness as the path length probability distri-bution and the average path length , the latter using HyperBall forperformance. Later we may also consider the variance-to-mean ratioof the shortest-paths distribution in the Mastodon dataset. These met-rics are important in network geometry as explained in [1] and serveas indicator for instance to distinguish structural differences among a social network from a web graph. To the best of our knowledgewe are the first to compute them over a lists of ego networks and adecentralised network using approximated or sketching techniquesthat encode privacy properties.
Limitations.
The summary sets from the union of HyperLogLogcounters at each vertex using the candidate set of Algorithm 2 willbe analysed in terms of how vulnerable they are to an attacks froman honest but curious participant in the protocols/systems we willdevelop as said. However if estimated locally these computations aresecure, and thus only the broadcasting of the resulting local valuesto calculate a global one as in the average path length need to beprotected or secured for privacy reasons. Effectively, our networkmodel from here will assume a decentralised computation of suchnetwork metrics in future DOSN deployments as Mastodon.
In the upcoming work we will show that it is possible to performefficient community detection with estimation algorithms that usea representation of the graph from the neighbouring nodes whencomputing intersections and unions in a federated manner.
We thank you Eiko Yoneki and Amitabha Roy from the CambridgeComputer Laboratory for their inputs on HyperLogLog counters andMinHashes. We thank you Ajitesh Srivastava for the proofreading.
REFERENCES [1] B
OLDI , P., R
OSA , M.,
AND V IGNA , S. Hyperanf: Approximating the neigh-bourhood function of very large graphs on a budget. In
Proceedings of the 20thinternational conference on World wide web (2011), ACM, pp. 625–634.[2] B
OLDI , P.,
AND V IGNA , S. In-core computation of geometric centralities withhyperball: A hundred billion nodes and beyond. In (2013), IEEE, pp. 621–628.[3] B
RODER , A. Z., C
HARIKAR , M., F
RIEZE , A. M.,
AND M ITZENMACHER , M.Min-wise independent permutations. In
Proceedings of the thirtieth annual ACMsymposium on Theory of computing (1998), ACM, pp. 327–336.[4] F
LAJOLET , P., F
USY , É., G
ANDOUET , O.,
AND M EUNIER , F. Hyperloglog: theanalysis of a near-optimal cardinality estimation algorithm.
DMTCS Proceedings ,1 (2008).[5] G
ARCÍA -R ECUERO , Á., B
URDGES , J.,
AND G ROTHOFF , C. Privacy-preservingabuse detection in future decentralised online social networks. In
InternationalWorkshop on Data Privacy Management (2016), Springer, pp. 78–93.[6] H
EULE , S., N
UNKESSER , M.,
AND H ALL , A. Hyperloglog in practice: Al-gorithmic engineering of a state of the art cardinality estimation algorithm. In
Proceedings of the EDBT 2013 Conference (Genoa, Italy, 2013).[7] M C A ULEY , J. J.,
AND L ESKOVEC , J. Learning to discover social circles in egonetworks. In
NIPS (2012), vol. 2012, Citeseer, pp. 548–56.[8] M
YERS , S. A., S
HARMA , A., G
UPTA , P.,
AND L IN , J. Information network orsocial network? the structure of the twitter follow graph. In Proceedings of the23rd International Conference on World Wide Web (2014), pp. 493–498.[9] N
EWMAN , M. E.,
AND G IRVAN , M. Finding and evaluating community structurein networks.
Physical review E 69 , 2 (2004), 026113.[10] P
ALMER , C. R., G
IBBONS , P. B.,
AND F ALOUTSOS , C. Anf: A fast and scalabletool for data mining in massive graphs. In
Proceedings of the eighth ACM SIGKDDinternational conference on Knowledge discovery and data mining (2002), ACM,pp. 81–90.[11] Z
IGNANI , M., G
AITO , S.,
AND R OSSI , G. P. Follow the “mastodon”: Structureand evolution of a decentralized online social network. In