[PDF] About Weighted Random Sampling in Preferential Attachment Models

Abstract

The Barab\'asi-Albert model is a popular scheme for creating scale-free graphs but has been previously shown to have ambiguities in its definition. In this paper we discuss a new ambiguity in the definition of the BA model by identifying the tight relation between the preferential attachment process and unequal probability random sampling. While the probability that each individual vertex is selected is set to be proportional to their degree, the model does not specify the joint probabilities that any tuple of m vertices is selected together for m>1. We demonstrate the consequences using analytical, experimental, and empirical analyses and propose a concise definition of the model that addresses this ambiguity. Using the connection with unequal probability random sampling, we also highlight a confusion about the process via which nodes are selected on each time step, for which -- despite being implicitly indicated in the original paper -- current literature appears fragmented.

Full PDF

AAbout Weighted Random Sampling in PreferentialAttachment Models

Giorgos Stamatelatos, Pavlos S. Efraimidis

Dept. of Electrical and Computer Engineering, Democritus University of Thrace,Kimmeria, Xanthi 67100, Greece

Abstract

The Barabasi-Albert model is a very popular model for creating randomscale-free graphs. Despite its widespread use, there is a subtle ambiguityin the deﬁnition of the model and, consequently, the dependent models andapplications. This ambiguity is a result of the model’s tight relation withthe ﬁeld of unequal probability random sampling, which dictates the exactprocess of edge creation after a newborn node has been added. In this work,we identify the ambiguity, assess its impact and propose a more precisedeﬁnition of the Barabasi-Albert model that is expressed as an applicationof unequal probability random sampling.

Keywords:

Barabasi-Albert Model, Unequal Probability RandomSampling, Preferential Attachment, Scale-Free Graphs

1. Introduction

The Barabasi-Albert model (BA model) [2] is a very popular modelfor generating undirected scale-free networks, i.e. graphs with power-lawdegree distribution. This property is achieved via a growing scheme with apreferential attachment mechanism, in which each newborn node gives linksto existing nodes and the more connected a node is the more likely it is toreceive a new link. More formally, at each time step t the new node t thatis born gains m diﬀerent edges to existing nodes. According to the originalpaper, for the selection of the m diﬀerent nodes we (quoted with adjustednotation) Email addresses: [email protected] (Giorgos Stamatelatos), [email protected] (Pavlos S. Efraimidis) a r X i v : . [ c s . D S ] F e b andomly select a vertex i and connect it with probability m · d i ( t ) / (cid:80) tj =1 d j ( t ) to vertex t in the system,where d i ( t ) is the degree of vertex i at time t .The BA model is also described in textbooks, for example Jackson [13]oﬀers a slight variation of the original deﬁnition:the probability that an existing node i gets a link from the newborn node at time t is m times i ’s degree relative to the overalldegree of all existing nodes at time t , or m · d i ( t ) / (cid:80) tj =1 d j ( t ).The model is widely used in research works with tens of thousands of ci-tations and is implemented in popular network libraries, such as NetworkX(Python), jGraphT (Java) and iGraph (C++).In this paper, we show that there exists an ambiguity in the deﬁnitionof the BA model by identifying the process of selecting m nodes from thepopulation based on their degrees. This process is called unequal probabilityrandom sampling or weighted random sampling , which more formally con-sists in selecting m items from a population of t items where the probabilityof selecting any two items may not necessarily be equal. Typically, in thepreferential attachment models, these probabilities are functions of the el-ements’ sizes, in the case of the BA model the elements’ degrees. To thebest of our knowledge, this is the ﬁrst time that the growing preferentialattachment model is studied in relation to the unequal probability randomsampling schemes and, as a result, the ﬁrst time that preferential attachmentis identiﬁed as an application of unequal probability random sampling.There exist, however, many unequal probability random sampling de-signs and each one may interpret the relevant probabilities in a diﬀerentway. In particular, concerning the common deﬁnitions of the BA model, theexpression “select node i with probability p i ” may be interpreted in at least3 diﬀerent ways:1. Each individual node i gets a link with an independent probability p i .2. Each node i has a selection probability p i in a scheme where the m nodes are selected one by one.3. The inclusion probability of i appearing in the collection of m nodesis p i .These interpretations are just 3 examples that are not equivalent to eachother but all abide by the “probability proportional to degree” requirementimposed by the BA model. As a result, this distinction is aﬀecting the BA2lgorithm because diﬀerent random sampling designs can lead to graphswith diﬀerent expected properties.This distinction among the weighted random sampling designs appearsto be less known in this scientiﬁc ﬁeld, where most of the work is based onthe assumption that the proportionality refers to the interpretation (2) ofthe above list. For example, Bollob´as et al. [5] state thatfor m > m edges from u t one at a time . . . ,which points to a variation of interpretation (2), as the edges are beingadded one by one with consecutive draws. In the same paper, the authorsidentify an ambiguity in the BA model:it is not clear how the process is supposed to get started,referring to the situation of the starting m nodes. This ambiguity, how-ever, is not related to the fundamental method of edge selection during thegrowing process. In addition, Batagelj and Brandes [3] attempt a connec-tion of the BA model with random sampling but not unequal probabilityrandom sampling. They propose a very eﬃcient algorithm for scale-freegraphs which is in use in all major open source frameworks and also refersto the interpretation (2). Another eﬃcient implementation of the BA modelis presented in [10]. The authors mention thatthe computationally intensive part of the algorithm is the degree-proportionate selection, a.k.a. roulette wheel selection .Random sampling does not necessarily imply roulette wheel selections; infact only one particular type of weighted random sampling is equivalent tothis: the selection probabilities of interpretation (2).For these reasons, we explore the unequal probability random samplingdesigns, we demonstrate their diﬀerences in respect to the BA model, andprovide a clariﬁcation that addresses its ambiguity by incorporating randomsampling terminology. Our contribution can be summarized as follows:1. We associate two scientiﬁc ﬁelds: unequal probability random sam-pling and growing preferential attachment models as an application ofthe former.2. Using this association, we identify a subtle ambiguity on the deﬁnitionof the BA model and demonstrate its impact.3. We establish a strict deﬁnition of the BA model that considers unequalprobability random sampling. 3ith this work, we deliver a perspective of the preferential attachment pro-cess around unequal probability random sampling and hope to motivatetheir further integration in future studies.

2. Unequal Probability Random Sampling

Initially, we establish the notation by deﬁning the problem of unequalprobability random sampling as selecting m elements from a population of t elements based on their sizes x , x , ..., x t . The values x are also calledparameters or weights and in this context they refer to vertex degrees. Inaddition, the deﬁnitions and abbreviations in the rest of this paper arecommonly found in the literature [12, 15, 16].The weighted random sampling designs can be categorized based ondiﬀerences found on multiple levels. In the case of growing preferential at-tachment, during time t when there are t elements in the system, we haveto select exactly m discrete elements based on their existing sizes (degrees).It is, therefore, evident that this refers to a random sampling without re-placement and with ﬁxed sample size. As stated earlier, however, a majordiﬀerence among the designs is the exact process via which the parameters x , x , ..., x t are being utilized. In particular, each design interprets the sizes x i in a diﬀerent way which leads to diﬀerences in the respective inclusionprobabilities. More formally, the value π i is deﬁned as the ﬁrst order inclu-sion probability of element i in the sample, which is equal to the number ofsamples that include i over the total number of possible samples. While theinclusion probabilities are a function of the sizes, a special case of randomsampling design is the str π ps (inclusion probability strictly proportional tosize) for which they are exactly proportional π i ∼ x i .Several unequal probability sampling designs are given by Hanif andBrewer [12], Till´e [16], Berger and Till´e [4] and Grafstr¨om [9]. In this work,we discuss the preferential attachment mechanism in relation to three rel-evant designs but our arguments are more general and hold for a largerselection of sampling methods:1. The conditional Poisson design [11].2. The draw-by-draw selection [17].3. The str π ps scheme.These designs all refer to unequal probability sampling without replacementand with constant sample size but the interpretation of the weights is diﬀer-entiated. Speciﬁcally, the conditional Poisson is equivalent to the process ofgenerating Bernoulli samples until the desirable sample size is achieved via4esign Incl. prob. of node 1 Diﬀ. from str π psstr π ps n +1 n +1 + n − n +1 · n n + n Conditional Poisson n − n − n +2 4 n − n + n +2 Table 1: An analytical unequal probability example of n elements with x = 2 and x i = 1 , i = 2 , . . . , n . The table displays the inclusion probability of the heavy node (node1) and the diﬀerence among the sampling designs. the rejection of invalid samples. The Yates-Grundy draw-by-draw designis the process of selecting the individual units with unequal probabilitiesand without replacement until the sample is of the desired size. Eﬃcientalgorithms implementing the draw-by-draw design have been suggested, forexample in [8] for which the equivalence has been proven by Li [14]. Thecases of the draw-by-draw and str π ps design are also discussed in [7].The conditional Poisson and draw-by-draw methods are approximationsof the str π ps design and, under certain conditions, the sampling design canhave negligible consequences on the application. However, depending onthe sensitivity of the application, they might not be suitable to use inter-changeably. Speciﬁcally, Table 1 shows an analytical example of n elementswith sizes x = 2 and x i = 1 , i = 2 , . . . , n , i.e. one heavy element withdouble the size of the other n − n . The table also displays thequantiﬁed diﬀerences among the designs via which it can easily be shownthat these random sampling schemes will converge asymptotically for large n . It is worth mentioning that for single item samples, all these designs areidentical.Finally, from the designs that were mentioned in Section 1, the corre-spondence can now be stated: interpretation (1) refers to conditional Pois-son, interpretation (2) refers to the draw-by-draw method, and interpreta-tion (3) is a str π ps scheme. All three of these interpretations can be ﬁttedinto the description of “selecting m vertices with probability proportionalto their degree” but all refer to semantically diﬀerent probabilities that donot coincide with each other.

3. Preferential Attachment and Random Sampling

These diﬀerences are examined in this section in respect to the BA model.In particular, we identify three perspectives under which the properties of50 . − − Element Degree / Size C . C . F r e q u e n c y π PSDbDCP c · x − Figure 1: Log-log plot of the complementary cumulative degree distributions of 5 diﬀerentrandom sampling designs. The settings are n = 100 , , x ) and m = 3. the generated graphs are diﬀerent and demonstrate their impact in the BAmodel. Speciﬁcally, the divergent properties are the degree distribution(Section 3.1), the expectations of individual nodes occupying speciﬁc ranksin the degree hierarchy (Section 3.2), and the probabilities of individualvertices overthrowing other vertices during the growing BA model process(Section 3.3). These perspectives are experimentally studied in detail andthe ﬁndings provide evidence that the ambiguity of the BA model has animpact on both a theoretical level and practical applications.For our experiments, the exact algorithms that we use for the three sam-pling designs mentioned are the method directly produced by the deﬁnitionof conditional Poisson, the algorithm of Batagelj and Brandes [3] for thedraw-by-draw design and Chao’s algorithm [6] as the str π ps scheme. Wealso use the abbreviations CP , DbD and π PS to refer to these in respectiveorder. Lastly, regarding the initial state of the algorithm, we start the pro-cess with a complete graph of m = m nodes. This is important to specifyas it addresses another ambiguity that has been previously identiﬁed [5]. The ﬁrst impact of the BA model ambiguity is on the degree distributionof the resulting graph. It is known that the expected degree distribution ofthe BA model follows a power law; an experimental demonstration is shownin Figure 1. The ﬁgure is a log-log plot of the complementary cumulative6 4203040 Element Degree / Size F r e q u e n c y π PSDbDCP 20 30 40 5000 . . F r e q u e n c y π PSDbDCP

Figure 2: Plot of frequencies of smallest (left) and largest (right) degrees for n = 100 and m = 3. degree distributions for m = 3 and three values for the number of vertices n :100, 1000 and 10000 which are displayed from bottom to top. The plot alsodisplays the reference distribution x − as the complementary cumulativedistribution of the theoretical x − power law because with a continuumapproximation it holds that the cumulative distribution P ( x ) of c · x − γ is P ( x ) = (cid:90) ∞ x c · x − γ = cγ − x − ( γ − ∼ x − ( γ − . The same plot also displays the results for the 3 diﬀerent random samplingdesigns with the tail (the values below 3.5 y-value) being impacted by thevery few occurrences.The diﬀerences among the random sampling designs can be better ob-served in Figure 2, which shows in a zoom level the distribution for n = 100and m = 3 in linear axes. The distributions exhibit an interesting behaviorregarding the probability mass of the lower, intermediate and higher de-grees. In particular, the conditional Poisson design appears to have morevertices with very low or very high degree while the draw-by-draw methodresults in a distribution with more intermediate degree vertices. Overall,the experiment demonstrates the existence of minor diﬀerences on the de-gree distributions of BA graphs when diﬀerent random sampling designs areapplied on it. Nodes in the BA model graph are usually anonymous, i.e. only the de-gree distribution is relevant or important. However, there are many graphsthat correspond to the same degree distribution and each one has distinct7bD π PS CPNode Top Node Prob. Node Top Node Prob. Node Top Node Prob.0 23.8922% 0 24.7511% 0 26.5150%1 23.5828% 1 24.5591% 1 26.2667%2 23.4610% 2 24.0638% 2 26.1037%3 11.2638% 3 07.7320% 3 09.5224%4 06.2067% 4 05.6924% 4 04.5706%5 03.7253% 5 03.3494% 5 02.5317%

Table 2: Probabilities of older nodes to occupy the highest degree rank properties. Often, it is desirable to study the graph in the individual ver-tex level. In this experiment we measure how the degrees are allocated invertices, and more speciﬁcally the degrees of the older nodes, which are usu-ally more important. This scenario would, for example, be applicable whenstudying the most inﬂuential nodes in a social network.According to the BA model, the older nodes are expected to have higherdegrees than newly born nodes. We treat the oldest as named nodes andexperimentally measure their probabilities to occupy the highest degree rank(the node with the most amount of connections) in respect to the diﬀerentsampling schemes. We used the settings n = 1000 , m = m = 2, performed1,000,000 iterations of the experiment in order to achieve statistical stability,and display the average probabilities in Table 2.The results provide evidence that the probabilities of the oldest nodesto occupy speciﬁc degree ranks is dependent upon the sampling scheme.As a result, in certain cases, for example where the subject of study isthe most inﬂuential nodes, it is important that there is no ambiguity onthe graph generation process. The ﬁrst 3 nodes have approximately thesame probability due to symmetry produced by the initial clique. The tabledisplays only the ﬁrst 6 nodes to demonstrate the diﬀerences of the motimportant vertices; the probability drops rapidly as the node ID increases.Profound diﬀerences can be observed in these results, as conditional Poissonsampling appears to generate graphs with “heavier” old nodes, i.e. nodesthat are more likely to remain in the top degree rank after the arrival of newervertices. Moreover, the same pattern of str π ps being in-between emergeshere as well regarding the ranks of the oldest nodes, while the draw-by-drawdesign results in the “lightest” old nodes.80 − − − Graph Size P r o b a b ili t y o f o v e r t h r o w π PSDbDCP x − Figure 3: Log-log plot of the overthrow probability distribution for n = 1000 , m = 2. Another subtle perspective via which we study the diﬀerences amongthe random sampling designs is targeting the growing component of theprocess and, more speciﬁcally, how the highest degree node changes overtime. First, we deﬁne a random variable which shows the number of verticesin the growing graph at the point of an overthrow . An overthrow indicatesa change in the top degree rank and is, more formally, the situation wherethe vertex with the highest degree is unique (clear heaviest node) and atthe exact previous time step there was more than one vertex occupying thehighest degree rank (tied heaviest nodes). As a result, an overthrow, asdeﬁned here, also includes the situations where a unique top rank of thesame node is repeated directly after a tied top rank.It is our expectation that the probability of an overthrow will declinein respect to time. This is because as the node with the highest degreeaccumulates more edges, the probability of it gaining more edges is furtherincreased and, as a result, the probability of an overthrow is declining overtime. Our hypothesis is conﬁrmed by the experiments for n = 1000, m = 2and 2,000,000 repetitions for statistical stability, the results of which areshown in Figure 3. The ﬁgure shows the probability distribution of an over-throw (y-axis) occurring at speciﬁc time points (x-axis) of the BA model9rowing process. The momentary ﬂuctuations, which are especially notice-able on very small x values are because of the deﬁnition of an overthrowthat disallows the existence of consecutive overthrows. It is possible thatan alternative deﬁnition that is not bound by this property would not havethis impact on the distribution. Furthermore, the ﬁgure also highlights thediﬀerences among the random sampling designs with the draw-by-draw de-sign resulting in more “unstable” process, i.e. a process with relatively moreoverthrows. The str π ps design is again in the intermediate position, which isconsistent with the previous ﬁndings, and the conditional Poisson generatesa BA growing with the least amount of overthrows.Interestingly, the overthrow probability distribution is also a power law.In fact, the distribution corresponding to the str π ps design appears re-markably close to the power law distribution c · x − with a correlation of ρ = 0 . x values greater or equal than 20 to avoid interference from the initial ﬂuctua-tions mentioned previously. This observation deserves some attention as, iftrue, such a complicated process involving the BA model, the weighted ran-dom sampling scheme and the overthrow mechanic, produces such a simpleand intuitive outcome.Another interesting property of the approximation -1 of the exponent isthat it constitutes a strict boundary on the behavior of another measure,the expected number of overthrows in the asymptotic state, i.e. when t approaches inﬁnity, which is given by ∞ (cid:88) t =1 c · t − γ , where γ is the exponent of the overthrow distribution. This expectationconverges when γ > γ ≤

1. As a result, the value of thehypothesis that γ = 1 lies directly on the boundary of asymptotically ﬁniteor inﬁnite number of overthrows in a BA graph creation process. We leavethe relation of the overthrow distribution with the the power law distribution x − as an open problem; whether they can be analytically linked should bepursued in some future study.

4. Concise deﬁnition of the BA model

Our experimental study shows that, to a certain extent, the choice of theunequal probability random sampling design has an impact on the outcomeof the BA model. A natural question is whether there exists a “correct”10cheme that has to be applied on the BA model. According to Jackson [13,Section 5.2] the rate at which a node gains edges is dd i ( t ) dt = d i ( t )2 t , which points to the str π ps model as it implies that the inclusion probabilityis strictly proportional to the degree of the respective vertex. The samecan be concluded via the master equation approach mentioned by Albertand Barab´asi [1, Section VII.B] where the probability of a vertex gainingan edge after a newborn node has been added is proportional to its degree.This observation implies that the majority of the BA algorithms do notfollow the BA model strictly as imprinted in the analytical formulation ofthe model; most of the theoretical algorithms in the literature as well asthe implementations in major open source frameworks consider the draw-by-draw scheme.Despite non-str π ps models resulting in graphs with diﬀerent properties,there is no basis in labeling them wrong. It is instead appropriate to con-sider them producing unexpected results if not properly speciﬁed, ratherthan wrong. Thus, we propose that the BA model, and the preferential at-tachment mechanism in general, be deﬁned more strictly in order to accountfor the diﬀerent sampling designs. Typically, the BA model is deﬁned viathe parameters n , m , m and less commonly, in non-linear models, the a parameter. This deﬁnition is not suﬃcient to adequately describe the pro-cess and is open to interpretation. As a result, it is required that the BAmodel be extended with another parameter, the exact unequal probabilityrandom sampling design used in the edge selection after a newborn nodehas been added. Typically, the sampling design is without replacement andwith constant sample size in order to abide by the constant m parameterand result in a simple graph, i.e. a graph without loops or multiple edges.

5. Conclusions

In this paper we have identiﬁed a subtle ambiguity in the deﬁnition ofthe Barabasi-Albert model by taking into consideration the ﬁeld of unequalprobability random sampling. We showed that the deﬁnition of the model, inparticular the part where m nodes are selected based on their degrees, is opento interpretation and can lead to graphs with unexpected properties. Forthis reason, we proposed an extension to the deﬁnition in the form of a newparameter that dictates the exact weighted random sampling scheme. We11ltimately hope to inspire the use of unequal probability random samplingin the preferential attachment model in general.Future lines of research may focus on the asymptotic analysis of therandom sampling scheme when applied to the BA model. An example wouldbe whether diﬀerent random sampling schemes would converge to the samedegree distribution as the number of vertices increases. This perspective isparticularly interesting because traditionally the BA model has been thoughtas a stationary distribution model [1]. This work may also give rise tothe algorithmic study of the BA model in a random sampling context andthe development of eﬃcient algorithms besides the draw-by-draw method,which appears to be dominant in both a theoretical level and open sourceimplementations. Regarding the latter, unexpected implications of existingimplementations on practical applications should also be assessed, even iftheir impact is negligible. Acknowledgements

This work has been co-ﬁnanced by the European Union and Greek na-tional funds through the Operational Program Competitiveness, Entrepreneur-ship and Innovation, under the call RESEARCH – CREATE – INNOVATE(project code: T1EDK-02474, grant no.: MIS 5030446).

References [1] Albert, R. and Barab´asi, A.-L. (2002). Statistical mechanics of complexnetworks.

Reviews of Modern Physics , 74(1):47–97.[2] Barab´asi, A.-L. and Albert, R. (1999). Emergence of Scaling in RandomNetworks.

Science , 286(5439):509–512.[3] Batagelj, V. and Brandes, U. (2005). Eﬃcient generation of large randomnetworks.

Physical Review E , 71(3):036113.[4] Berger, Y. G. and Till´e, Y. (2009). Sampling with unequal probabilities.In

Handbook of statistics , volume 29, pages 39–54. Elsevier.[5] Bollob´as, B., Riordan, O., Spencer, J., and Tusn´ady, G. (2001). Thedegree sequence of a scale-free random graph process: Degree Sequenceof a Random Graph.

Random Structures & Algorithms , 18(3):279–290.[6] Chao, M. T. (1982). A general purpose unequal probability samplingplan.

Biometrika , 69(3):653–656. 127] Efraimidis, P. S. (2015). Weighted random sampling over datastreams. In

Algorithms, Probability, Networks, and Games , pages 183–195. Springer.[8] Efraimidis, P. S. and Spirakis, P. G. (2006). Weighted random samplingwith a reservoir.

Information Processing Letters , 97(5):181–185.[9] Grafstr¨om, A. (2010).

On unequal probability sampling designs . PhDthesis, Ume˚a University, Ume˚a University. OCLC: 648090592.[10] Hadian, A., Nobari, S., Minaei-Bidgoli, B., and Qu, Q. (2016). ROLL:Fast In-Memory Generation of Gigantic Scale-free Networks. In

Pro-ceedings of the 2016 International Conference on Management of Data- SIGMOD ’16 , pages 1829–1842, San Francisco, California, USA. ACMPress.[11] Hajek, J. (1964). Asymptotic Theory of Rejective Sampling with Vary-ing Probabilities from a Finite Population.

The Annals of MathematicalStatistics , 35(4):1491–1523.[12] Hanif, M. and Brewer, K. R. W. (1980). Sampling with Unequal Proba-bilities without Replacement: A Review.

International Statistical Review/ Revue Internationale de Statistique , 48(3):317.[13] Jackson, M. O. (2008).

Social and economic networks . Princeton Uni-versity Press, Princeton, NJ.[14] Li, K.-H. (1994). A computer implementation of the yates-grundy drawby draw procedure.

Journal of Statistical Computation and Simulation ,50(3-4):147–151.[15] Ros´en, B. (1997). On sampling with probability proportional to size.

Journal of Statistical Planning and Inference , 62(2):159–191.[16] Till´e, Y. (2006).

Sampling algorithms . Springer series in statistics.Springer, New York.[17] Yates, F. and Grundy, P. M. (1953). Selection Without Replacementfrom Within Strata with Probability Proportional to Size.