PES: Priority Edge Sampling in Streaming Triangle Estimation
aa r X i v : . [ c s . S I] A ug PES: Priority Edge Sampling in StreamingTriangle Estimation
Roohollah Etemadi, Jianguo Lu,School of Computer Science, University of Windsor, Canada { etemadir, jlu } @uwindsor.ca Abstract —The number of triangles (hereafter denoted by ∆ ) is an important metric to analyze massive graphs. It is also used tocompute clustering coefficient in networks. This paper proposes a new algorithm called PES (Priority Edge Sampling) to estimate thenumber of triangles in the streaming model where we need to minimize the memory window. PES combines edge sampling and reservoirsampling. Compared with the state-of-the-art streaming algorithms, PES outperforms consistently. The results are verified extensivelyin 48 large real-world networks in different domains and structures. The performance ratio can be as large as 11. More importantly,the ratio grows with data size almost exponentially. This is especially important in the era of big data–while we can tolerate existingalgorithms for smaller datasets, our method is indispensable when sampling very large data. In addition to empirical comparisons, wealso proved that the estimator is unbiased, and derived the variance. Index Terms —Graph sampling; Triangles; Streaming algorithms; Variance. ✦ NTRODUCTION T HE number of triangles (hereafter denoted as ∆ ) is animportant metric to reveal the complex structure of real-world networks. It has been used in many applications in-cluding community structure detection and graph clustering[1], link prediction [2], spam detection [3], DNA sequenceanalysis [4], microarray data analysis [5], word-learning [6],and many others. Exact algorithms to compute ∆ in a largenetwork are costly. It was proven that the best algorithm hasa complexity of Θ( M / ) , where M is the number of edgesin the input graph [7]. Therefore, various sampling-basedalgorithms are proposed, e.g., in [8]–[18].Sampling-based algorithms are especially important inthe era of big and hidden data. There are numerous massivenetworks that have billions of nodes. For example, Facebookas an online social network has over two billion users. Manynetworks are dynamic, both users and connections betweenusers can change over time. Furthermore, networks are oftenhidden behind access interfaces, and data in its entirety arenot available. Therefore, it is essential to design sampling-based methods.There are two types of methods that estimate trian-gles and the closely related metric clustering coefficient.One is the direct-sampling that has random access to thenodes/edges of the input graph [8] [9] [19] [20]. The other isthe streaming model that scans the nodes/edges of the inputgraph in an arbitrary order over a stream. Note that onesalient feature of the streaming algorithms is that the arrivalsequence may not be uniformly random. In the streamingmodel, a constant number of passes over the stream areused to estimate ∆ . The key constraint is a limited memorywindow [10] [11] [15]–[17] [21] [22]. When there is nolimit to the number of passes, it is called a semi-streamingmodel [23]. This paper addresses the estimation of ∆ in thestreaming model.We propose a new streaming algorithm, called PES (Priority Edge Sampling). It is based on edge sampling[11] [14] [12] [8], and gives higher priority to edges thatcan form triangles. We prove that our estimator is unbi-ased, and derive the variance of the estimator so that theconfidence interval can be obtained when an estimation isgiven. Empirically, we compare it with the state-of-the-artGPS-In [10], TRIEST [12], and MASCOT [14] algorithms ,and demonstrate that PES outperforms them consistentlyon most of the 48 real networks that we have experimentedwith. More importantly, the performance gain increases withthe size of networks. The performance ratio can be as high as11 for GPS-In, the best of existing algorithms, meaning thatGPS-In needs 11 times more samples to achieve the sameaccuracy.Performances of sampling algorithms are often data de-pendent, especially on the structure of the graphs. To verifyour result in addition to empirical comparisons, we conductanalytical comparisons. GPS-In cannot give an analyticalvariance of the estimation because its sampling probabilitychanges in every step. Hence, the comparison betweenPES and GPS-In cannot be analytical. To understand theadvantage of PES, we compare it with NES (Naive EdgeSampling) that was proposed in [8] [12] [14] and similar toMASCOT and TRIEST. The analytical comparison betweenPES and NES can shed some lights on understanding thedifference between PES and GPS-In.To summarize, our main contributions are that we have:1) Given an efficient algorithm PES. 2) Proved the unbiased-ness of the estimator and derived variances for PES andNES; 3) Compared PES and NES analytically. ACKGROUND AND R ELATED W ORK
Given a simple graph G ( V , E ) , where V stands for the setof nodes, and E the set of edges. Let N = |V| , M = |E| ; TABLE 1: Summary of the notations
Notation Meaning G ( V , E ) Input graph (undirected and no self-edges) g A subgraph of
GN, M
Number of nodes and edges in
Gp, q
Sampling probability Λ G ∆ Gσ A wedge pool n Size of pool σm Sample size ∆ σ σ . Λ c g ∆ g g . Φ G b ∆ NES
Naive edge sampling estimator b ∆ PES
Priority edge sampling estimator ∆ and Λ denote the number of triangles and wedges in G ,respectively. A wedge W is a path ( u, v, w ) of length two,where u, v, w ∈ V , ( u, v ) ∈ E , and ( v, w ) ∈ E . The wedge W is closed if ( u, w ) ∈ E . Otherwise it is open. A closed wedge W is also called a triangle. Note that each triangle has threeclosed wedges. Table 1 summarizes the list of the notationsused in the rest of this paper.Each sampling method takes some sample nodes oredges, or a combination of them, into a subgraph. Then,the number of triangles in the subgraph is used to estimatethe triangle count. Depending on the way to take samples,the estimator and its variance change. Intuitively, we wantto observe a maximum number of triangles while keepingthe sample size small. The bottom line is that we need toobserve at least one triangle in order to give an estimate. The most naive method of triangle estimation is to samplethree random nodes as a potential triangle, then check theexistence of edges among the nodes over a stream. It iscalled triple sampling [16]. This approach needs to sample N / (6 ǫ ∆) number of triples to achieve an estimationin interval ∆ ± ǫ ∆ with 95 % confidence. Intuitively, thecomplexity of three nodes combination is O ( N ) . Obviously,it is not a practical method because the sample size is toolarge to observe even one triangle. The cost is even higherthan direct counting of the triangles.A more practical method is to sample a random edgeand a random node [15], then check whether three nodes(one random nodes plus two nodes in the edge) form atriangle. This method improves the previous triple samplingby assuming one edge always exists in the triple. Hence, itonly needs to check the existence of other two edges. Still, itneeds to take M N/ (3 ǫ ∆) triples to have an estimation inthe same confidence interval as in triple sampling.Large real networks are mostly sparse, hence the prob-ability of having a triangle is still low among two randompairs of nodes. One improvement to the above method is,instead of choosing a random node in the entire graph,selecting a random node from its neighbourhood. It is called neighborhood sampling [13] [17]. The most straightforward edge sampling is to take edgesuniformly at random, then count the triangles in thesubgraph [9]. In the streaming model, the correspondingstreaming version of the algorithm is to take each edge withan equal probability p over a stream and create a subgraph g . The number of triangles in g is used to estimate ∆ .Obviously, the sampling probability of a triangle in suchmethod is p . The size of g needs to be . M/ ( ǫ ∆) / toobtain an estimation with an additive error ± ǫ ∆ with 95 %confidence [8]. When p is small, which is the case for verylarge graphs, this algorithm is not efficient.Instead of using equal probability among three edges,there are methods to assign high probabilities for the secondand/or the third edge. For instance, post-stream graph prioritysampling (GPS-Post) [10] takes this approach by samplingthe third edge with a higher probability if it is in a triangle.Another technique is to take edges from the neighbor-hood of already sampled edges with higher probability. Apair of connected edges (called a wedge) in the samplecan be a potential triangle, and its closeness is checked inthe rest of the stream [12] [14] [21]. Obviously, the prob-ability of forming a wedge is p because two edges arerequired to be sampled. [11] improves the previous methodas follows. When an edge closes a wedge in a sample, it isunconditionally added into the sample; if it is connected tosome sampled edges, it is chosen with higher probability q ; otherwise it is taken with probability p . The number oftriangles in the sample is used to estimate ∆ . Obviously,this method samples triangles with different probabilities,i.e., pq , q , p , q , and 1. One shortcoming of this approach isthat how one can determine q - sampling probability of aneighbor edge. To overcome such an issue, in our method q is dynamically adjusted using reservoir sampling [24].More recently, another elegant approach has been pro-posed by [10] called in-stream priority sampling (GPS-In). Itpreserves edges in a sample with different priorities. Thenumber of sampled wedges closed by an edge is usedas a measure to determine the priority of the edge beingpreserved in the sample. For each new edge e , it firstcounts the number of wedges closed by e in the sampleand computes its priority. Then, the edge is added into thesample. If the number of edges in the sample exceeds thesize limit, an edge with lower priority is removed fromthe sample. In each step, the estimator for ∆ is updatedif edge e completes some wedges in the sample. It has beenshown that GPS-In outperforms the existing methods [10].Therefore, we consider GPS-In as the state-of-the-art methodin this context.When random access to the input graph is availablethe ideal method is wedge sampling . It selects some wedgesuniformly at random and checks their closeness to estimate ∆ . Unfortunately, taking a wedge uniformly at random ina large graph is costly. Three passes over an edge streamare required to implement wedge sampling in the streamingmodel [15]–[17].Another direction is indirect sampling. Such methodshave been applied when the entire graph is not accessible.They use traversal-based sampling techniques to take asample from the input graph [25] [26]. Moreover, several works have been conducted to compute clustering coeffi-cient closely related to ∆ [26]–[29]. AIVE E DGE S AMPLING (NES)
As a starting point for understanding our PES algorithm tobe described in the next section, we first present a naivealgorithm based on edge sampling, called NES (Naive EdgeSampling). It is similar to TRIEST [12] and MASCOT [14].Note that MASCOT was proposed to estimate the number oftriangles for each node in a graph (local triangle counting).NES can be consider as its modification for estimating ∆ (global triangle counting). The details of NES are shown inAlg. 1. For each edge in a stream, NES adds the edge intosubgraph g with probability p (Line 4). Then, the same edgeis used to check how many wedges in current g are closedby it. ∆ g records the sum of such closed wedges(triangles)(Lines 5-7).The algorithm differs from the one in [8] in that we donot count the triangles in g . Instead, it checks the closenessof wedges in g during the streaming process. Clearly, theprobability of forming a wedge in g is p . Note that threeedges of a closed wedge can appear in six different ordersin a stream. In two of them, the third edge appears after thefirst two and the associated closed wedge can be observed.Thus, the probability of identifying a closed wedge is p / .Because each triangle has three closed wedges, the samplingprobability of each triangle is p / × p . Note thateach identified closed wedge by NES is considered as a onetriangle because only one of three closed wedges of eachtriangle can be identified in a stream.Suppose δ i be an indicator for the i th triangle in theoriginal graph G . Indicator δ i is one when the i th triangleis identified over the stream; otherwise it is zero. Recallthat ∆ g is the number of triangles identified by NES basedon g over a stream. The expectation of ∆ g is E (∆ g ) = E ( P ∆ i =1 δ i ) = P ∆ i =1 E ( δ i ) = P ∆ i =1 p = p ∆ . Thus, theunbiased estimator for ∆ using NES is b ∆ NES = ∆ g p . Algorithm 1:
Naive Edge Sampling (NES)
Input: p Output: b ∆ , RSE( b ∆ ) begin ∆ g = 0 , g = {} . while new edge e do Add e into g with probability p . foreach wedge w ∈ g closed by e do ∆ g + = 1 . end end b ∆ NES = ∆ g /p . RSE( b ∆ NES ) ≈ ∆ − / g . end Next we need to understand the variance of b ∆ NES .Although MASCOT gave a similar algorithm, they onlygive upper-bound of its variance. We derived the varianceof b ∆ NES and present it in the form of Relative StandardError (RSE= √ var/ ∆ ) in Theorem 1. We use RSE insteadof variance that is commonly used. This is because variancedepends on the ground truth of ∆ , which changes from data to data. This is especially inconvenient when evaluatingmultiple data sets–a larger variance in one data may bebetter than a smaller variance in another data.The variance of NES is adapted but different from thedirect sampling algorithm in [8] to accommodate the stream-ing model. The main difference is that in NES, to identify aclosed wedge over a stream, first its two edges need to beadded into g ; then its third edge needs to be visited in therest of the stream. Theorem 1.
The RSE of b ∆ NES is approximated by
RSE ( b ∆ NES ) ≈ ∆ − / g . (1) Proof.
See Appendix A.Theorem 1 shows that the variance depends on thenumber of triangles in the sampled graph g . To reduce RSEwith the same subgraph size g , we need to sample moretriangles while keeping the same sampling probability forthe first edge. This prompts us to increase the samplingprobability for the second edge of a triangle. RIORITY E DGE S AMPLING (PES)
PES improves NES by increasing the probability of captur-ing triangles in the sample graph. To do so, we maintain apool of wedges as well as a subgraph g . Edges that can forma wedge in g will have a higher priority being sampled.Hence, we call it Priority Edge Sampling. It is impossibleand not necessary to keep all the wedges. Instead, we main-tain a small fixed-size pool of wedges σ . For each triangle,the first edge will be sampled with probability p , whichis the same as NES. The difference is in the second edge.When the second edge is scanned, the associated wedgesare added into σ with probability q . Later we will show that q is normally much larger than p , especially when the graphis large. The closeness of wedges in the pool is checked inthe rest of the stream. Therefore, PES identifies a trianglewith probability pq , which is greater than p in NES.The details of PES are summarized in Alg. 2. Input p is the sampling probability of edges, n is the pool size.In our experiments, we simply set n = | g | for the conve-nience of performance comparison. Λ c counts the wedgesformed based on g such that the first edge is in g andsecond edge not necessarily. Some of these wedges maybe added to σ with a changing probability q . Hence wecall them candidate wedges and denoted by Λ c . ∆ σ countsthe triangles formed from σ and g . When a new edge e is visited, it is added into subgraph g with probability p (Line 4). Then, the closeness of wedges in pool σ is checked(Lines 5-8). Once a closed wedge is identified, the numberof triangles ∆ σ captured so far is increased by 1 (Line 7).Next, each candidate wedge formed using the new edge e and edge f in g like w ( e, f ) is considered to be added intopool σ with probability q (Lines 9-21). Note that probability q is dynamically computed over the stream using n and Λ c (Line 14). We explain the steps in the following illustrativeexample. Algorithm 2:
Priority Edge Sampling (PES)
Input: p, n.
Output: b ∆ , RSE ( b ∆) begin Λ c = 0 , ∆ σ = 0 , σ = {} , g = {} . while new edge e do Add edge e into g with probability p ; foreach wedge w in σ closed by e do label w as closed. ∆ σ + = 1 . end foreach wedge w ( e, f ) where edge f ∈ g do Λ c + = 1 . if | σ | < n then σ = σ ∪ { w } . else q = n/ Λ c . if Random[0,1) < q then Select random wedge w ′ from σ . if w ′ is closed then ∆ σ − = 1 . σ = σ − { w ′ } . σ = σ ∪ { w } . end end end b ∆ PES = ∆ σ /pq . RSE( b ∆ PES ) ≈ ∆ − / σ . end We illustrate PES with a toy graph in Fig. 1 with detailedsteps. Each row in the table represents one step. Column e shows the edge stream. Column g displays the samplededges in subgraph g . In this example, each edge in thestream is added into g with probability p = 0 . . When edge (1 , arrives, PES adds it to g with probability p . Supposethat it is not added, and g remains empty. Next edge in thestream is (6 , . Suppose that it is added to g this time. It cannot form any wedges in the fourth column.The third edge (6 , is not added into g , but we stillcheck its neighbours in g for closed wedges and candidate wedges. The candidate wedges constructed in each stepare demonstrated in the fourth column. When edge (6 , is encountered in step 3, a wedge (7 , , is formed sinceedge (6 , is already in the subgraph g . In the pool foreach wedge, we keep a label to show its closeness. The openwedge (7 , , is denoted as (7 , , − . Column Λ c recordsthe number of such candidate wedges. It can be larger thanthe pool size. When edge (6 , arrives, it forms a candidatewedge (8 , , , hence Λ c is increased by one, but it is notadded into the pool σ .Not every candidate wedge is added into the pool. Thepool has a fixed size, functioning as a reservoir. In thisexample, its capacity n = 2 . The candidate wedge is addedinto the pool unconditionally only when it is not full yet.Hence, wedge (7 , , and the wedge in the subsequent step (1 , , are added into the pool.When the pool is full, the candidate wedge will replacea random wedge in the pool with probability q . In step9, edge (6,10) forms a candidate wedge (8,6,10) with edge(6,8). Now the forth wedge (8,6,10) can not be added into σ directly because the pool has reached its limit 2. Instead, we (A) An example graph. e g w ( e, f ) , f ∈ g σ Λ c q ∆ σ φ - φ (6,8) - φ (7 , , − (7 , , − , (1 , , − (7 , , − , (1 , , − (7 , , − , (1 , , − (7 , , − , (1 , , − (1,2) - (7 , , − , (1 , , − (8 , , − , (1 , , − (8 , , − , (1 , , − (8 , , − , (9 , , − (2 , , − , (9 , , − (2 , , − , (9 , , + (B) Steps on the graph in Panel (A) with p = 0 . , n = 2 .Fig. 1: Steps of applying our PES on a toy graph.replace one of the wedges in the pool with a probability q = n/ Λ c = 2 / . Suppose that by chance, this wedgereplaces (7,6,8) in the pool. The candidate wedge in Step 10does not replace any wedge in the pool by chance. For thecandidate wedge (9,6,8) in step 11, suppose that it replacesan existing wedge (1,6,8) in the pool. Step 12 has anotherwedge being replaced.The last edge in the stream is (8,9). It closes the wedge (9 , , − that is obtained in previous steps. Hence, the labelof this wedge is changed to + ; and ∆ σ is increased by 1.At this point, Λ g = 8 . This means eight candidate wedgesare identified in total over the stream; the probability ofpreserving a wedge in σ is q = 2 / . Thus, the unbiasedestimator for ∆ is b ∆ P ES = ∆ σ pq = 10 . × .
25 = 20 . (2) We prove that b ∆ P ES is unbiased as follows. Let δ i be theindicator function for the i th triangle in the input graph.It is one when the i th triangle is sampled; otherwise it iszero. For each triangle, the probability of sampling the firstedge is p , the probability of sampling the second edge is q .Note that the closeness of a wedge is checked every timewhen a wedge emerges in the pool. Hence the probabilityof sampling a triangle is pq . The expectation of δ i is pq andthe expectation of ∆ σ is E (∆ σ ) = E ( ∆ X i =1 δ i ) = ∆ X i =1 E ( δ i ) = ∆ X i =1 pq = pq ∆ . (3)Thus, the unbiased estimator is as follows. Theorem 2.
The unbiased estimator for PES algorithm is as b ∆ P ES = ∆ σ pq . (4)An interesting part of the algorithm is that q decreasesover time, and the sampling probability of the second edgein Eq. 4 is the q in the final step, not the bigger q values inearlier steps. Intuitively, edges sampled in earlier steps havea higher probability of being replaced during the process.The earlier the edge being scanned, the bigger the q is at thatmoment. But it also has a higher probability being replacedin a later stage. Hence the overall probability is the same asthe final q . Detailed proof is similar to reservoir sampling[24] using inductive inference, and is given as follows.For the last candidate wedge at arrival time Λ c , it iseasy to understand that the second edge has a samplingprobability q = n/ Λ c . Other wedges arrived before also hasa sampling probability q , following reservoir sampling [24]as explained in the following inductive inference:When Λ c = n + 1 , the sampling probability for wedgesarrived before time n is: × (cid:18) n + 1 + nn + 1 n − n (cid:19) = nn + 1 . (5)This is because that there is a probability of / ( n + 1) that the new wedge won’t replace any old wedge; andthere is a probability of n/ ( n + 1) that an old wedge willbe replaced. For each replacement, the probability of oneparticular wedge not being replaced is n − /n .Suppose that the old wedges are kept with probability n/ ( n + x ) when Λ c = n + x . When Λ c = n + x + 1 , thesampling probability for wedges arrived before time n + x +1 is nn + x × (cid:18) x + 1 n + x + 1 + n + xn + x + 1 n + x − n + x (cid:19) = nn + 1 . (6) The variance of the estimator is complicated because of theinvolvement of two different sampling techniques–uniformsampling and reservoirs sampling. In PES, a wedge asa possible triangle is formed uniformly at random withprobability p over an edge stream; and it is preservedwith probability q in pool σ . Applying the variance on theestimator we get var ( b ∆ P ES ) = var (cid:18) ∆ σ pq (cid:19) = var (cid:18) ∆ X i =1 δ i pq (cid:19) = 1( pq ) X i =1 ∆ X j =1 cov ( δ i , δ j )= 1( pq ) (cid:18) ∆ X i =1 var ( δ i ) + ∆ X i = j cov ( δ i , δ j ) (cid:19) . (7)Recall that δ i is the indicator for the i th triangle as definedbefore. By the definition of variance, var ( δ i ) is E ( δ i ) − E ( δ i ) . Therefore, the cost of the first term in Eq. 7 is ∆( pq − ( pq ) ) . For the covariance, let Φ be the number ofpairs of triangles with a common edge. To identify such acase by PES, the common edge should be added into g with probability p . Otherwise, identifying the two triangles insuch a shared case is not dependent. Furthermore, the othertwo edges need to be preserved in pool σ with probability ( n − n ) / ( p Λ − p Λ) . Thus, the probability of sampling sucha dependent pair is pq ′ where q ′ is ( n − n ) / ( p Λ − p Λ) .Recall that n is the size of pool σ . Each dependent pair hasfive edges and the common edge should be visited beforethe other four and needs to be sampled with probability p . Clearly, the five edges can arrive in 120 different ordersin a stream; and in one-fifth of them, the common edge isthe first one in the stream. Because the edges are assumedin a random order in the stream, each of 120 orders hasan equal chance to be identified by PES. Note that eachdependent pair ( δ i , δ j ) appears twice in the covariance term.Thus, the cost of Φ dependent cases is ( pq ′ − ( pq ) ) .Because the reservoir sampling is used to preserve wedgesin pool σ we need to consider the cost of (∆ − − ∆) independent pairs. Obviously, the probability of selectinga pair of independent triangles is p q ′ . By the definition ofcovariance, i.e. E ( δ i δ j ) − E ( δ i ) E ( δ j ) , the cost of independentcases is (∆ − − ∆)( p q ′ − ( pq ) ) . Substitute the costsin Eq. 7 and after some math simplification, the variance ofthe estimator is given by the following theorem. Lemma 1.
Let ∆ be the true number of triangles and b ∆ P ES beits estimation by PES. The variance of b ∆ P ES is var ( b ∆ P ES ) = ∆(1 − pq ) pq + 2Φ( q ′ − pq )5 pq + Φ ′ ( q ′ − q ) q . (8) here Φ is the number of pairs of shared triangles and q = n/p Λ and q ′ = ( n − n ) / ( p Λ − p Λ) , and Φ ′ = (∆ − − ∆) . The variance of the estimator depends on several metricsincluding ∆ , Φ , p and q . In practice, we do not have theknowledge of these metrics. For example, ∆ is exactly whatwe are estimating. Hence, in order to know the performanceof the estimator, we need to estimate the variance. Thus,we simplify the variance to have better insight into it. Todo so, we translate the variance into RSE and use big dataassumption to present the following theorem. Theorem 3.
The RSE of b ∆ P ES is approximated by
RSE ( b ∆ P ES ) ≈ ∆ − / σ . (9) Proof.
Translate var ( b ∆ P ES ) into the RSE = √ var/ ∆ . Whenthe input graph is large, approximations n − ≈ n and p Λ − ≈ p Λ are valid. Thus, after some math work we get RSE ( b ∆ P ES ) ≈ (cid:20) pq (cid:18) − pq + 2Φ5∆ ( q − pq ) (cid:19)(cid:21) / . (10)When graph is large sampling probabilities p and q is verysmall and terms − pq and + ( q − pq ) in Eq. 10 are ignor-able. Thus, Eq. 10 is simplified as (cid:0) ∆ pq (cid:1) − / . Replacing ∆ with its estimation based on Eq. 4, we obtain the theorem. Choosing a proper pool size is important to obtain betterperformance by PES. Note that PES increases the probabilityof identifying a triangle in the sample by storing candidatewedges in pool σ . According to Theorem 3, the error bound of TABLE 2: Properties of the networks in our experiments, sorted by graph size N . Dataset N ( × ) h d i C Type Dataset N ( × ) h d i C Type1. Ego-facebook [30] 0.004 43.69 0.519 OSN
25. Youtube [30] 1.1 5.27 0.006 OSN2. CA-GrQc [30] 0.005 5.52 0.629 COL
26. Dblp [31] 1.3 8.16 0.170 COA3. Wiki-vote [30] 0.007 28.32 0.125 OSN 27. Wiki-Polish [31] 1.5 55.17 0.01 WEB4. AstroPh [31] 0.01 21.10 0.31 CIT
28. Trec-wt10g [31] 1.6 8.33 0.014 WEB5. CA-CondMat [30] 0.02 8.08 0.264 COA
29. Wiki-Portuguese [31] 1.6 48.19 0.022 WEB6. HepPh [31] 0.02 224.14 0.279 COA 30. Wiki-Japanese [31] 1.6 69.82 0.021 WEB7. Enron-email [31] 0.03 10.02 0.085 ECO
31. Pokec [31] 1.6 27.31 0.046 OSN8. Brightkite [30] 0.05 7.35 0.110 OSN 32. As-skitter [30] 1.6 13.08 0.005 INT
9. Facebook [31] 0.06 25.64 0.147 OSN 33. Wiki-Italian [31] 1.8 72.90 0.024 WEB10. Epinions [31] 0.07 10.69 0.065 OSN 34. Hudong [31] 1.9 14.54 0.003 WEB11. Slashdot-Zoo [31] 0.07 11.82 0.023 OSN 35. Hollywood [32], [33] 1.9 24.51 0.152 OSN12. Livemocha [31] 0.1 42.13 0.014 OSN 36. Flicker [31] 2.3 19.83 0.107 OSN13. Douban [31] 0.1 4.22 0.01 OSN 37. Flixster [31] 2.5 6.27 0.013 OSN14. Gowalla [30] 0.1 9.66 0.023 OSN 38. Wiki-Russian [31] 2.8 44.20 0.015 WEB15. Libimseti [31] 0.2 155.97 0.007 OSN 39. Wiki-French [31] 3.0 55.21 0.015 WEB16. Digg [31] 0.2 11.07 0.061 OSN 40. Orkut [31] 3.0 76.28 0.041 OSN17. Dblp-Coau [30] 0.3 6.62 0.306 COA 41. Wiki-German [31] 3.2 40.77 0.0088 WEB18. Web-NotreDame [30] 0.3 6.69 0.087 WEB
42. USpatent [31] 3.7 8.75 0.067 CIT19. Amazon [30] 0.3 5.53 0.205 COP
43. LiveJournal [30] 3.9 17.35 0.125 OSN20. Actor [31] 0.3 78.68 0.166 COL 44. DBpedia [31] 18 13.89 0.0016 WEB21. Citeseer [31] 0.3 9.03 0.049 CIT 45. Web-Arabic [32], [33] 22 48.70 0.031 WEB22. Dogster [31] 0.4 40.03 0.014 OSN 46. Gsh-2015 [32], [33] 29 9.18 0.007 WEB23. Catster [31] 0.6 50.32 0.028 OSN 47. MicrosoftAc.G. [34] 46 22.61 0.015 CIT24. Web-Google [31] 0.8 9.87 0.055 WEB 48. Friendster [31] 65 55.06 0.017 OSN Online Social Network Collaboration Citation Coauthorship E-communication Web Graph Co-purchasing Internet topology an estimation using PES depends on the number of trianglesin pool σ , i.e. ∆ σ . As an example, to obtain an estimation in [∆ ± . with 95% confidence, PES needs to identify 25triangles.Based on PES sampling scheme, the number of trianglesin pool σ depends on the structural property of the inputgraph measured by global clustering coefficient ( C )– thefraction of closed wedges in the graph. It means only C fraction of wedges in the pool can form triangles. Thus, thesize of pool σ needs to be n ≈ ∆ σ / C to observe ∆ σ numberof triangles. For example, suppose C = 0 . in an inputgraph. The size of pool σ need to be at least 500 to observe25 triangles using PES.Note that C is unknown for the input graph and cannotbe used to decide about the best value for n in practice.Recall that n is the size of pool σ . One way to resolve such anissue is to use an adaptive-size reservoir as a pool. It meansthat the size of pool σ can be adjusted during samplingprocess to have a specific number of triangles in the pooland at the same time to store uniform random samples. Werefer readers for more details about adaptive-size reservoirsampling to [35]. XPERIMENTS
We conduct experiments to 1) compare our algorithms withthe state-of-the-art algorithms GPS-In and GPS-Post [10],TRIEST [12], and MASCOT [14]. Other algorithms are notcompared because it is already demonstrated that theyare inferior to GPS-In; and 2) Verify our analytical resultspresented in Theorem 1 and 3. This is needed because thereare approximations in the derivation. The precise resultsare long formulas that depend on the structure of thegraph, such as the number of triangles ( ∆ ) and the countof dependent triangles ( Φ ). Theorems 1 and 3 give moreconcise results by omitting some terms in the long formula by assuming the graph is large and p is small. How good issuch approximation needs to be evaluated empirically.The code along with all the data, includ-ing some intermediate data, are available athttp://cs.uwindsor.ca/ ∼ etemadir/PES. Because the performance of sampling algorithms oftenvaries from data-to-data, especially depends on the struc-ture of the graphs, we verify our results extensively withmany (48) real networks with different size from varietiesof domains. The size ranges from 4 thousand to 65 millionnodes. The domains include online social networks (OSN),web graphs, citation and co-authorship networks, etc. Insome figures, we only plot half of the datasets (24) to savespace. Other datasets have similar behaviours.It is computationally costly to obtain the ground truthof large graphs. Luckily, we have access to two serverseach with 24 cores and 256 GB RAM to carry out suchintensive computing. Table 2 summarizes the networks andtheir statistics. The graphs are sorted by their node size N .In the table h d i is average degree, and C is global clusteringcoefficient ( C = 3∆ / Λ ). To evaluate our method, we compare with all the relatedmethods that we are aware of, i.e., GPS-In [10], GPS-Post[10], TRIEST [12], and MASCOT [14]. To have a fair com-parison, we implemented these algorithms using a sameframework. MASCOT was originally designed for localtriangle counting, hence we modified it for global trianglecounting, which is the same as NES.We executed the estimators on the graphs and reportedthe results along with our observations in the followingsections. The results were obtained over 1000 independent m G P S − P o s t / m P E S N × C m G P S − I n / m P E S N × C m N E S / m P E S N × C (A) Our PES vs. GPS-Post (B) Our PES vs. GPS-In (C) Our PES vs. NESFig. 2: Sample size ratios of our PES vs. GPS-Post (Panel A), GPS-In (Panel B), and NES (Panel C) when RSE=0.2. S a m p l e s i ze s E go − f a c eboo k C A − G r Q c W i k i − V o t e A s t r o P h C A − C ond M a t H ep P h E n r on − e m a il B r i gh t k i t e F a c eboo k E p i n i on s D i gg G o w a ll a F L i ck e r D oubanL i b i m s e t i A m a z on D b l p H o ll y w ood P o k e c L i v e J ou r na l O r k u t W eb − A r ab i c M i c r o s o ft A G F r i end s t e r PES NES (MASCOT [14]) GPS-In [10]
Fig. 3: Our PES uses less memory space compared to other methods to obtain an estimation with the same RSE=0.2 onmost of the graphs. Note that the sample size include both the size of the subgraph and the reservoir for our PES, and forGPS-In method extra memory per sampled edge was considered and it is | g | .runs for the graphs except for the four largest graphs that arerepeated 500 times. The edges of the graphs were scanned ina random order. Note that the edge list can be in the same ordifferent order in each run. For the methods, the edge listswith the same order were used.To compute observed RSE, we repeated the estimation k times using the same sample size, each time obtain anestimate ∆ i . Let µ = k P ki =1 ∆ i . The observed RSE is obtainusing RSE = 1∆ r k X (∆ i − µ ) . (11)In our experiments, k = 1000 for all graphs except for thefour largest ones with k = 500 . Fig. 2 summarizes the comparison of PES with the state-of-the-art methods GPS-Post (Panel A) and GPS-In (PanelB) [10]. We also compared NES and our PES algorithm inPanel C. We set the sampling probability of the estimatorsto obtain the same RSE. Here we report the ratios betweensample sizes when RSE=0.2. Similar phenomenon is ob-served for other RSEs. In each panel, the Y-axis is the ratios,and the X-axis is the graph size that is represented by thenode size N multiplied by global clustering coefficient. Inall the methods, m is the ’sample size’. Algorithms differin the definition of ’sample size’ because some algorithms maintain a reservoir of wedges in addition to subgraph g or use extra memory per sampled edge to store informationabout sampled edges in subgraph g . NES has g only. Hencethe sample size m is the number of edges (denoted by | g | ), which is equal to pM . PES maintains a wedge pool σ . Hence the sample size is | g | + | σ | . GPS -In and -Postalso store subgraph g and two additional values per eachedge in g . However,we consider their sample sizes as | g | .The parameters were obtained based on Eq. 8 for PES andEq. 14 for NES to achieve an estimate with RSE=0.2, andthe parameters of GPS-Post and GPS-In were manuallyobtained using experimental results. Then, the average ofthe size of subgraph g and pool σ over k independent runswere used to obtain sample sizes of the methods.In the panels, each marker represents one of the 48graphs described in Table 2. From the figure we makeseveral observations: • Our PES outperforms GPS-In and GPS-Post consis-tently in terms of sample size. All the ratios areabove one in Panel (A), meaning that PES needsfewer samples than GPS-Post for all the datasets.For instance, take Orkut (labeled 43) in Panel A hasratio 73, meaning that GPS-Post needs 73 times moresampled edges compared to our PES. Compare toGPS-In, our PES also needs less sample size in mostof the graphs. For example, LiveJournal (labeled 43)in Panel A has the ratio 5.4, meaning that GPS-In
Ego-facebook R S E CA-GrQc
CA-CondMat
HepPh
Epinions
Gowalla R S E Digg
Dblp-Coau
Amazon
Actor
Citeseer
Web-Google R S E Dblp
Pokec
Hollywood
FLicker
Wiki-French
Orkut R S E Sample size
USpatent
Sample size
LiveJournal
Sample size
Web-Arabic
Sample size
Gsh-2015 R S E Sample size
MicrosoftAG
Sample size
PES GPS-In [10] TRIEST [12] NES (MASCOT[14])
Fig. 4: Our PES outperforms existing methods in terms of RSEs when the methods are using the same sample sizes. Notethat for our PES, the sample size includes both the size of subgraph g and reservoir size, i.e., | σ | . For GPS-In [10], we onlyconsidered the size of subgraph g as a sample size, and ignored two additional values per each sampled edges in g . InTRIEST [12] and NES (MASCOT [14]) the sample size is | g | .requires 5.4 times more sampled edges compared toPES to obtain an estimation with the same RSE. Theimprovement margin is higher for GPS-Post, whichis expected since GPS-In improves GPS-Post. Takethe same LiveJournal data for example, as shown inPanel A, the ratio is 66, much higher than 5.4. • The ratio is positively correlated with data size. Inother words, compared with PES, the sample sizeof other algorithms grows polynomially with graphsize. This result has high implication for very largegraphs: although other algorithms can deal withcurrent data, their performance will deteriorate poly-nomially with graph size. The Pearson correlationcoefficient between the ratios and the size of datais 82 for GPS-In, 80 for GPS-Post, and 79 for NES. • The performance of our PES depends on both thegraph size (N) and structure (shown by global clus-tering coefficient). When graph is large other meth-ods need to sample a large fraction of edges toobserve pairs of connected edges (wedges) as a po-tential structures to identify triangles in the sample.In contrast, PES uses wedge pool σ to increase thechance of identifying triangles during the samplingprocess. The size of σ depends on C . In other words,only C fraction of wedges in σ are used to identifytriangles. Thus, when C is small, PES needs to storemore wedges in σ to identify more triangles. Fig. 3 compares the actual sample sizes of the three meth-ods side by side. The sample sizes are the ones to achievethe same RSE=0.2. Take the Friendster data for example, thesamples for PES is 8,132, meaning that the subgraph sizeis 4,066, and the reservoir size is 4,066 to achieve RSE=0.2.On the other hand, the sample size of GPS-In is 177,308,meaning that | g | = 88 , . Similarly, the sample size of NESis 133,355, meaning that | g | = 133 , . As shown in thefigure, our PES outperforms the other methods in most ofthe graphs. It is obvious that,the performance ratios grow byincreasing the size of graphs. For example, all the methodsneed almost the same sample sizes for Ego-facebook graph(the smallest graph in our dataset) to achieve the estimationwith the same RSE=0.2. Take the Friendster data as thelargest graph in the dataset, PES needs 10.9 times lesssample size compare to GPS-In.Next, we investigate how the performance ratios be-tween the methods change by increasing the accuracy ofestimators (decreasing the RSE). To do so, we set the pa-rameters of PES to obtain the RSEs between 0.1 and 0.4.Then, the other methods were run using the same samplesized used in PES. Note that we considered both the sizeof subgraph g and the pool σ as a sample size of PES.We only report the observed RSEs of our PES vs. baselinemethods, i.e. GPS-In [10], TRIEST [12] and NES ( adaptionof MASCOT [14] for global triangle counting), in Fig. 4. Inthe plots, when the RSE is greater than 1, the corresponding Ego-facebook R S E Wiki-Vote
AstroPh
Enron-email
Brightkite
Gowalla
Libimseti R S E Digg
Amazon
Citeseer
Web-Google
Youtube
Dblp R S E As-skitter
FLicker
Orkut
Wiki-German
USpatent
LiveJournal R S E p DBpedia p Web-Arabic p Gsh-2015 p MicrosoftAG p Friendster p Obs. Est.
Fig. 5: The observed RSEs of b ∆ P ES support our estimated RSEs based on Eq. 9. Note that the mean of ∆ σ s over k independent runs was used in Eq. 9 to compute estimated RSEs. In the experiments k = 1000 for the graphs except thefour largest ones with k = 500 .method obtain zero for estimation most of the time. It can beseen that by increasing the sample size, the gap between theRSEs of the methods diminishes. Still, PES outperforms theexisting methods in terms of obtaining accurate estimationusing the same sample sizes for large graphs, as we can seein the last row of the figure. The performance of GPS-In andTRIEST are almost the same. In a few graphs, i.e. Epinions,Gowalla and Digg, PES is outperformed by the methodsby increasing the sample size. The reason is that in thosegraphs global clustering coefficient is very small compareto their sizes. Therefore, most of the candidate wedges inpool σ will not be closed. Thus, to identify a closed wedge(triangle) in the pool, PES needs to store more wedges. We conduct experiments to verify our approximations usedin the derivations of Theorems 1 & 3. Thus, samplingprobability p of the PES and NES were initialized in a waythat the estimators achieve the RSEs between 0.1 and 0.4 toget estimations in range [∆ ± . , ∆ ± . with 95%confidence. The observed and estimated RSEs are reportedin the plots of Fig. 5 and 6. We report the results for 24representative graphs. Similar patterns are observed for theremaining data sets.As shown in the plots, in both theorems our approxi-mations work very well. It can be seen that our estimatedRSEs (blue lines in the plots) fit perfectly the observed ones(red lines with circle markers) not only for large graphs butalso for small-sized ones. Thus, in practice the theorems can be used to control the accuracy of the estimators. Moreover,they can be used to quantify the performance ratio betweenthe methods as in the following section. We use Theorems 1 & 3 to quantify the performance ratiobetween NES and PES. Suppose p N and p D be samplingprobability of NES and PES respectively to achieve the sameRSE. Using the result of Theorems 1 and 3, we need to have ∆ − / σ ≈ ∆ − / g . Replace ∆ σ = p D q ∆ and ∆ g = p N ∆ .Recall that q is the sampling probability of preserving can-didate wedges in pool σ . Suppose the size of pool σ be thesame as the size of | g | , i.e. | σ | = | g | . Thus, q ≈ M/ Λ . Aftersome math simplifications, we get Corollary 1.
Suppose pool size be | σ | = | g | = p D M in PES. Theratio between sampling probabilities of PES and NES to achievethe same RSE is given by p N p D ≈ Mp N Λ . (12)Corollary 1 says that the sample size ratio between PESand NES depends on M , Λ , and sampling probability ofNES ( p N ). Recall that M and Λ are the number of edges andthe count of wedges in the input graph.To verify the corollary, the parameters of the methodswere set to achieve the RSEs between 0.1 and 0.4. Note thatwe set up the size of pool as p D M in PES, i.e. | σ | = | g | . Theobserved and estimated ratios based on Eq. 12 are reported Ego-facebook R S E Wiki-Vote
AstroPh
Enron-email
Brightkite
Gowalla
Libimseti R S E Digg
Amazon
Citeseer
Web-Google
Youtube
Dblp R S E As-skitter
FLicker
Orkut
Wiki-German
USpatent
LiveJournal R S E p DBpedia p Web-Arabic p Gsh-2015 p MicrosoftAG p Friendster p Obs. Est.
Fig. 6: The observed RSEs of b ∆ NES fit very well our estimated RSEs based on Eq. 1. Note that the mean of ∆ σ s over k independent runs was used in Eq. 1 to compute estimated RSEs. In the experiments k = 1000 for the graphs except thefour largest ones with k = 500 . AstroPh p N / p D CA-CondMat
HepPh
Enron-email
Brightkite
Gowalla
Libimseti p N / p D Digg
Amazon
Citeseer
Web-Google
Youtube
Dblp p N / p D As-skitter
FLicker
Orkut
Wiki-German
USpatent
LiveJournal p N / p D RSE
DBpedia
RSE
Web-Arabic
RSE
Gsh-2015
RSE
MicrosoftAG
RSE
Friendster
RSE
Obs. Est.
Fig. 7: Our PES outperforms NES. The observed and estimated ratios between p N and p D when the methods achieve thesame RSEs between 0.1 and 0.4. The estimated ratios are obtained using Eq. 12. in Fig. 7. It can be seen that the observed ratios supportour theoretical results in Eq. 12, i.e., the estimated ratiosbased on Eq. 12 fit the observed values very well in mostof the representative graphs. However, as expected there isa small gap between the observed and estimated ratios in afew cases. ONCLUSION AND D ISCUSSION
This paper proposes a streaming algorithm called PES. Itimproves NES by increasing the chance of observing a tri-angle over a stream from p in NES to pq , where q is greaterthan p and it is automatically adjusted over the stream.PES outperforms GPS-In consistently in all the datasets thathave been tested. The performance ratio can be as highas 11. An important observation is that the performanceratio grows exponentially with data size, indicating that wecould observe higher performance gain in larger datasets.We have tested on networks with 65 million nodes. Dueto the prohibitive cost to calculate the ground truths (suchas triangle, wedges, and shared wedges and triangles) ofvery large graph, we did not experiment with even largernetworks. We should note that real networks often havebillions of nodes, much larger than our experimented data.We expect that our algorithm would be particularly usefulin such very large networks.In retrospect, the key to improve the performance is toidentify triangles as many as possible during the samplingprocess. In the streaming model, we need to scan each edgeanyway. Thus, NES fits naturally with the streaming modelbecause the closeness check almost comes free, especiallybecause the sample size is small compared with the orig-inal graph. PES improves NES further by increasing thesampling probability of the second edge of the triangle. Itimproves GPS-In because GPS-In does not always add thesecond edge as we did in PES.Most algorithms are compared empirically only. This islimited, and conclusions may not be true for other datasets.We compare NES and PES analytically, and quantify theperformance gain. The analytical comparison also gives us adeeper understanding as for when PES is better. PES hingeson the value of q . Probability q becomes larger than p whenthe graph becomes larger. A CKNOWLEDGMENTS
The research is supported by NSERC Discovery grant. R EFERENCES [1] H. Yin, A. R. Benson, J. Leskovec, and D. F. Gleich, “Local higher-order graph clustering,” in
Proceedings of the 23rd ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining .ACM, 2017, pp. 555–564.[2] Y. Liu, C. Zhao, X. Wang, Q. Huang, X. Zhang, and D. Yi, “Thedegree-related clustering coefficient and its application to linkprediction,”
Physica A: Statistical Mechanics and its Applications , vol.454, pp. 24–33, 2016.[3] P. O. Boykin and V. P. Roychowdhury, “Leveraging social net-works to fight spam,”
Computer , vol. 38, no. 4, pp. 61–68, 2005.[4] A.-L. Barabasi and Z. N. Oltvai, “Network biology: understandingthe cell’s functional organization,”
Nature reviews genetics , vol. 5,no. 2, p. 101, 2004. [5] G. Kalna and D. J. Higham, “A clustering coefficient for weightednetworks, with application to gene expression data,”
Ai Communi-cations , vol. 20, no. 4, pp. 263–271, 2007.[6] R. Goldstein and M. S. Vitevitch, “The influence of clusteringcoefficient on word-learning: how groups of similar soundingwords facilitate acquisition,”
Frontiers in psychology , vol. 5, p. 1307,2014.[7] M. Latapy, “Main-memory triangle computations for very large(sparse (power-law)) graphs,”
Theoretical Computer Science , vol.407, no. 1, pp. 458–473, 2008.[8] R. Etemadi, J. Lu, and Y. H. Tsin, “Efficient estimation of trianglesin very large graphs,” in
Proceedings of the 25th ACM Internationalon Conference on Information and Knowledge Management , ser. CIKM’16. ACM, 2016, pp. 1251–1260.[9] C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos,“Doulion: counting triangles in massive graphs with a coin,” in
Proceedings of the 15th ACM SIGKDD international conference onKnowledge discovery and data mining, - . ACM, 2009.[10] N. K. Ahmed, N. Duffield, T. L. Willke, and R. A. Rossi, “Onsampling from massive graph streams,” Proceedings of the VLDBEndowment , vol. 10, no. 11, pp. 1430–1441, 2017.[11] N. K. Ahmed, N. Duffield, J. Neville, and R. Kompella, “Graphsample and hold: A framework for big-graph analytics,” in
Pro-ceedings of the 20th ACM SIGKDD international conference on Knowl-edge discovery and data mining, - . ACM, 2014.[12] L. De Stefani, A. Epasto, M. Riondato, and E. Upfal, “ T RIEST :Counting local and global triangles in fully-dynamic streams withfixed memory size,” in
Proceedings of the 22th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining ,2016, pp. 825–834.[13] A. Pavan, K. Tangwongsan, S. Tirthapura, and K.-L. Wu, “Count-ing and sampling triangles from a graph stream,”
Proceedings ofthe VLDB , vol. 6, no. 14, pp. 1870–1881, 2013.[14] Y. Lim and U. Kang, “Mascot: Memory-efficient and accurate sam-pling for counting local triangles in graph streams,” in
Proceedingsof the 21th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, - . ACM, 2015.[15] L. S. Buriol, G. Frahling, S. Leonardi, A. Marchetti-Spaccamela,and C. Sohler, “Counting triangles in data streams,” in Proceedingsof the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium onPrinciples of database systems, - . ACM, 2006.[16] Z. Bar-Yossef, R. Kumar, and D. Sivakumar, “Reductions instreaming algorithms, with an application to counting trianglesin graphs,” in Proceedings of the thirteenth annual ACM-SIAM sym-posium on Discrete algorithms, - . Society for Industrial andApplied Mathematics, 2002.[17] H. Jowhari and M. Ghodsi, “New streaming algorithms for count-ing triangles in graphs,” in International Computing and Combina-torics Conference, - . Springer, 2005.[18] M. Al Hasan and V. S. Dave, “Triangle counting in large networks:a review,” Wiley Interdisciplinary Reviews: Data Mining and Knowl-edge Discovery , vol. 8, no. 2, p. e1226, 2018.[19] R. Pagh and C. E. Tsourakakis, “Colorful triangle counting anda mapreduce implementation,”
Information Processing Letters , vol.112, no. 7, pp. 277–281, 2012.[20] B. Wu, K. Yi, and Z. Li, “Counting triangles in large graphsby random sampling,”
IEEE Transactions on Knowledge and DataEngineering , vol. 28, no. 8, pp. 2013–2026, 2016.[21] M. Jha, C. Seshadhri, and A. Pinar, “A space-efficient streamingalgorithm for estimating transitivity and triangle counts using thebirthday paradox,”
ACM Trans. Knowl. Discov. Data , vol. 9, no. 3,pp. 15:1–15:21, Feb. 2015.[22] P. Wang, Y. Qi, Y. Sun, X. Zhang, J. Tao, and X. Guan, “Approx-imately counting triangles in large graph streams including edgeduplicates with a fixed memory usage,”
Proceedings of the VLDBEndowment , vol. 11, no. 2, pp. 162–175, 2017.[23] L. Becchetti, P. Boldi, C. Castillo, and A. Gionis, “Efficientsemi-streaming algorithms for local triangle counting in massivegraphs,” in
Proceedings of the 14th ACM SIGKDD internationalconference on Knowledge discovery and data mining . ACM, 2008,pp. 16–24.[24] J. S. Vitter, “Random sampling with a reservoir,”
ACM Transactionson Mathematical Software (TOMS) , vol. 11, no. 1, pp. 37–57, 1985.[25] M. Rahman and M. A. Hasan, “Sampling triples from restrictednetworks using mcmc strategy,” in
Proceedings of the 23rd ACMInternational Conference on Conference on Information and Knowledge Management , ser. CIKM ’14. New York, NY, USA: ACM, 2014, pp.1519–1528.[26] S. J. Hardiman and L. Katzir, “Estimating clustering coefficientsand size of social networks via random walk,” in
Proceedings of the22nd international conference on World Wide Web . ACM, 2013, pp.539–550.[27] R. Etemadi and J. Lu, “Bias correction in clustering coefficientestimation,” in , Dec 2017, pp. 606–615.[28] T. Schank and D. Wagner, “Approximating clustering coefficientand transitivity,”
Journal of Graph Algorithms and Applications ,vol. 9, no. 2, pp. 265–275, 2005.[29] C. Seshadhri, A. Pinar, and T. G. Kolda, “Triadic measures ongraphs: The power of wedge sampling,” in
SIAM InternationalConference on Data Mining (SDM) . SIAM, 2013, pp. 10–18.[30] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large networkdataset collection,” http://snap.stanford.edu/data, Jun. 2014.[31] J. Kunegis, “Konect - the koblenz network collection,”http://konect.uni-koblenz.de/networks, May 2016.[32] P. Boldi, M. Rosa, M. Santini, and S. Vigna, “Layered labelpropagation: A multiresolution coordinate-free ordering for com-pressing social networks,” in
Proceedings of the 20th internationalconference on WWW, - . ACM, 2011.[33] P. Boldi and S. Vigna, “The WebGraph framework I: Compressiontechniques,” in Proceeding of the Thirteenth International World WideWeb Conference, - . ACM, 2004.[34] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-j. P. Hsu, andK. Wang, “An overview of microsoft academic service (mas) andapplications,” in Proceedings of the 24th International Conference onWorld Wide Web . ACM, 2015, pp. 243–246.[35] M. Al-Kateb, B. S. Lee, and X. S. Wang, “Adaptive-size reservoirsampling over data streams,” in19th International Conference onScientific and Statistical Database Management (SSDBM 2007)