[PDF] Diffusion-Aware Sampling and Estimation in Information Diffusion Networks

Abstract

Partially-observed data collected by sampling methods is often being studied to obtain the characteristics of information diffusion networks. However, these methods usually do not consider the behavior of diffusion process. In this paper, we propose a novel two-step (sampling/estimation) measurement framework by utilizing the diffusion process characteristics. To this end, we propose a link-tracing based sampling design which uses the infection times as local information without any knowledge about the latent structure of diffusion network. To correct the bias of sampled data, we introduce three estimators for different categories; link-based, node-based, and cascade-based. To the best of our knowledge, this is the first attempt to introduce a complete measurement framework for diffusion networks. We also show that the estimator plays an important role in correcting the bias of sampling from diffusion networks. Our comprehensive empirical analysis over large synthetic and real datasets demonstrates that in average, the proposed framework outperforms the common BFS and RW sampling methods in terms of link-based characteristics by about 37% and 35%, respectively.

Full PDF

aa r X i v : . [ c s . S I] M a y Diffusion-Aware Sampling and Estimation inInformation Diffusion Networks

Motahareh Eslami Mehdiabadi

Sharif University of TechnologyEmail: [email protected]

Hamid R. Rabiee

Sharif University of TechnologyEmail: [email protected]

Mostafa Salehi

Sharif University of TechnologyEmail:mostafa [email protected]

Abstract —Partially-observed data collected by sampling meth-ods is often being studied to obtain the characteristics ofinformation diffusion networks. However, these methods usuallydo not consider the behavior of diffusion process. In this paper,we propose a novel two-step (sampling/estimation) measurementframework by utilizing the diffusion process characteristics.To this end, we propose a link-tracing based sampling designwhich uses the infection times as local information without anyknowledge about the latent structure of diffusion network. Tocorrect the bias of sampled data, we introduce three estimatorsfor different categories; link-based, node-based, and cascade-based. To the best of our knowledge, this is the ﬁrst attemptto introduce a complete measurement framework for diffusionnetworks. We also show that the estimator plays an importantrole in correcting the bias of sampling from diffusion networks.Our comprehensive empirical analysis over large synthetic andreal datasets demonstrates that in average, the proposed frame-work outperforms the common BFS and RW sampling methodsin terms of link-based characteristics by about and ,respectively.

I. I

NTRODUCTION

Information diffusion is one of the important topics thathas been considered in large On-line Social Networks (OSN)such as Facebook, Twitter, and YouTube. These networksthat provide information in different formats such as posts,tweets, and videos are called “information diffusion networks”.In recent years, the tremendous growth of these networks,have resulted in creation of large information networks. Forexample, in March 2011, Twitter users were sending 50 milliontweets per day [2]. Moreover, the latent structure of diffusionnetworks makes their analysis considerably difﬁcult. Althoughwe usually discover the time of obtaining some informationby people, we can not ﬁnd the source of information easily.Furthermore, in epidemic diseases, the infection shows itselfwhen somebody becomes infected without determining whoinfected whom [1]. Therefore, it may be impossible or costlyto obtain the complete structure of a large and latent diffusionnetwork.Partially-observed network data is often being studied toobtain the characteristics of these networks. The networkresulting from such measurements may be thought of as asample from a larger underlying network. As a result, theaccuracy of the studies on diffusion network analysis dependson the estimation of the characteristics based on the samplednetwork data.The measurement of the network characteristics can be achieved in two steps: 1) Sampling, and 2) Estimation. Inthe ﬁrst step, data is collected from the network by using asampling method. The essential property of a sampling methodthat makes it appropriate for network inference is that itsvisiting probabilities should be known for all the networkelements. This allows sampled data to be weighted so thatthey accurately represent the network data. In the estimationstep, an estimator is used to obtain the network characteristics.An estimator is a function that uses a summary of sampleddata as input, and estimates the unknown parameters of thepopulation which has generated the input. However, samplingand estimation in the context of networks may introduce somepotential complications.In recent years, a considerable amount of research havebeen done on analyzing the topological characteristics of largeOSNs based on the sampled data from different networks suchas Facebook [4], [5], Twitter [6], YouTube [7], and otherlarge networks [8], [9]. However, considering the samplingapproaches to study diffusion behaviors of social networks,apart from their topologies, is a remarkable issue that shouldbe addressed. The previous work on diffusion data collection[12], [3], [13], [20] have used some well-known samplingmethods such as Breadth-First Search (BFS) and RandomWalk (RW), without considering the behavior of the diffusionprocess. This leads to gathering redundant data and losing partsof diffusion data, that consequently decrease the performanceof these sampling methods (refer to Figure 1). On the otherhand, often it is not feasible to directly work with the diffusionnetworks, because the structure of many large real systems cannot be discovered. Moreover, the previous studies assert thatthe characteristics of a sampled diffusion network is indicativeof the same characteristics for the whole network. However, itshould be noted that the obtained characteristics represent thesampled graph, instead of the original graph. Such problemscan be compensated for in many cases by using the appropriateestimator.In this paper, we propose a novel two-step (sam-pling/estimation) framework, called “D NS ”, to measure thecharacteristics of diffusion networks. To this end, we proposea link-tracing based sampling method that utilizes diffusionprocess properties to traverse the network more accurately.Speciﬁcally, this method samples the underlying network bymoving from a node to one of its neighbors through anoutgoing link based on the probability of spreading infection. a) Common Network Sampling Design (b) Diffusion-Aware Network Sampling DesignFig. 1. Illustration of different sampling designs in diffusion networks. The regions speciﬁed by dotted lines show the diffusion networks. The red and greenareas demonstrate the sampled networks obtained by common and diffusion-aware network sampling methods, respectively. As it is shown, diffusion-awarenetwork sampling design can cover the diffusion network more accurately. We calculate this infection probability by considering thecascades behavior in the diffusion networks. It is noteworthythat the algorithm only uses the infection times as localinformation without any knowledge about the latent structureof the diffusion network. Moreover, we extend the well-knownHansen-Hurwitz estimator [38] to correct the bias of sampleddata. We propose three efﬁcient estimators related to differentcategories of network characteristics; link-based, node-based,and cascade-based. To the best of our knowledge, this is theﬁrst attempt to introduce a complete measurement frameworkfor the diffusion networks.We evaluated the proposed framework over large syntheticand real datasets by comparing it with BFS and RW samplingmethods. The experimental results demonstrated that D NS outperforms the aforementioned common sampling methods interms of link-based characteristics by about , in average.Moreover, D NS decreased the bias of the sampled data by compared to the sampling design without estimation. Theresults conﬁrm that ﬁnding an appropriate estimator has animportant role in correcting the bias of sampling methods.Furthermore, the results show that the proposed frameworkperforms well even in low sampling rates. we also analyzedthe effect of diffusion rate on the performance of D NS . Theanalysis showed the independence of the proposed frameworkto the diffusion process behavior. Hence, we can use D NS in various diffusion networks with different diffusion patternswithout any performance loss.In summary, our main contributions can be summarized asfollows: • Proposing a novel sampling design for gathering datafrom a diffusion network by utilizing the properties ofdiffusion process. • Proposing three estimators for correcting the bias ofsampled data by computing the visiting probabilities ofdifferent types of diffusion characteristics (link-based,node-based, and cascade-based). • Decreasing the bias of measuring link-based characteris-tics compared to the other common sampling methodsThe rest of the paper is organized as follows. Section IIpresents a classiﬁcation of data collection approaches in the ﬁeld of information diffusion networks. The problem formu-lation is proposed in Section III. The proposed measurementframework is presented in Section IV. Section V elaboratesthe experimental evaluation, and the concluding remarks areprovided in Section VI.II. R

ELATED W ORK

Diffusion process as a fundamental phenomenon over OSNshas attracted great attention in recent years [1], [32], [14], [15],[11], [16], [36], [13], [18]. Here, we provide a comprehensivesurvey over the approaches used for collecting the diffusionprocess data.

Complete Data:

The most fundamental approach is tocollect the complete diffusion data. Many diffusion processestry to generate some diffusion paths and use them for analysis.Following the Iraq war petitions in the format of e-mail [11],[19], studying communication events between faculty and staffof a university by e-mails [15], and tracking the ﬂow ofinformation by extracting short textual phrases [16] are someexamples of this approach. However, gathering diffusion datain many areas create problems such as missing data, privacypolicies, and impossibility of tracing all paths of diffusion.Moreover, large scale of diffusion networks is one of themost important obstacles of gathering the complete diffusiondata. These problems have led the researchers to use samplingmethods to obtain partial diffusion data.

Partial Data:

Sampling methods can be considered as anefﬁcient way to tackle the problem of large-scale diffusiondata. Using these methods to collect diffusion data have beenstudied in some recent work [12], [3], [13], [20]. The majorityof these works have utilized one of the most common samplingmethods; Breadth-First Search (BFS). BFS is a basic graph-based sampling method that has been used extensively forsampling networks in various domains [7], [4], [21], [6]. Ateach iteration of BFS, the earliest explored node is selectednext. This method discovers all nodes within some distancefrom the starting node. Inferring diffusion topics from theDBLP database [20] and sampling the Twitter network to studyon the resulting diffusion network [12], [13] are some exam-ples which use BFS to collect the diffusion data. However,FS leads to a bias towards high degree nodes [35], and thisbias has not been analyzed for arbitrary graphs [22]. Despitethe popularity of BFS, the problem of computing the visitingprobabilities of network elements (such as nodes and links)in BFS sampling design is still unsolved. Because, samplingwithout replacement in BFS introduces complex dependenciesbetween the sampled elements. To the best of our knowledge,no estimator has been introduced to correct the sampling biasof BFS in an arbitrary network.Despite the considerable amount of research on analyzingthe topological characteristics of the networks in various areas[4], [5], [6], [7], [8], [9], little attention have been madeon gathering partial data based on the diffusion behavior.Random Walk (RW) [23] is also one of the most important andwidely used link-tracing sampling methods in different kindof network contexts such as uniformly sampling Web pagesfrom the Internet [24], content density in peer-to-peer networks[33], [34], degree distributions of the Facebook social graph[4], [5] and in general large graphs [8]. A classic RW samplesa graph by moving from a node u , to a neighboring node v ,through an outgoing link ( u, v ) , chosen uniformly at randomfrom the neighbors of node u . The probability of selectingthe next node determines the probability that nodes are beingsampled. In any given connected and non-bipartite graph G ,the probability of being at a node u converges at equilibrium tothe stationary distribution π ( u ) = deg ( u ) / | E | , where deg ( u ) and E are the degree of node u and are the set of links of thenetwork graph. Moreover, the probability that a link is visitedis / | E | (i.e., links are visited uniformly at random) [23].Using these sampling methods without any attention to dif-fusion paths will result in some redundant data which are notrelated to the diffusion process. Removing these unnecessarydata decreases the efﬁciency of these sampling methods [3].No work has previously been done that considers diffusionprocess characteristics in the sampling strategy. In this pa-per, we propose a diffusion-aware sampling and estimationmethods which uses only local information of the underlyingnetwork. To the best of our knowledge, this is the ﬁrst study tointroduce a complete measurement framework for the diffusionnetworks. Moreover, we use BFS and RW as baseline methodsfor comparison.III. P ROBLEM F ORMULATION

A. Basic Notations and Deﬁnitions

Let network G = ( V, E ) be the underlying network where V is the set of nodes, and E is the set of links where n = | V | and m = | E | . In diffusion process, some diffusible chunkssuch as information and epidemic diseases propagate over G .These diffusible chunks are called “infection” where each pathof infection will build a “cascade” [1], [18]. When the cascadesspread over the underlying network, the diffusion network G ∗ will be formed.We deﬁne G s = ( V s , E s ) as the induced sub-graph of G bysampling rate of µ where V s ⊂ V and E s ⊂ E . In order toanalyze the diffusion process, we should measure the diffusioncharacterization metrics from the sampled diffusion data. Since diffusion phenomenon covers many elements of the network(such as nodes, links, and cascades), we determine an “elementset”, T , as a set of diffusion network elements [3]. Let L bea ﬁnite set of element labels. A label can be, for instance,the degree of a node, the weight of a link, or the length of acascade. A label l e is assigned to each element e ∈ T by atarget function f : T → L , i.e. f = { ( e, l e ) | e ∈ T, l e ∈ L } .For example, infection is a label for each node that showswhether this node is infected during the diffusion process ornot. The target function f for this label will match nodes u ∈ V to the set L = { , } ( f ( u ) = 0 , if node u is not infectedand f ( u ) = 1 , otherwise).Almost all network characterization metrics we are awareof can be expressed as some aggregative function. In thispaper, we focus on the measurement of diffusion networkcharacteristics. To this end, we consider the average function( η ) over diffusion elements as: η ( f ( G )) = P e ∈ T f ( e ) | T | (1)In the above infection example, this average shows the per-centage of infected nodes by the diffusion process to all thenodes of the underlying network. B. Problem Deﬁnition

Our goal is to propose a diffusion-aware measurementframework to collect diffusion data in an efﬁcient way. Thediffusion process measurement procedure consists of twosteps: (1) samples from the underlying network and computesthe desired target function f on the sampled elements , (2)computes an estimate of Avg ( f ) by ﬁnding an appropriateestimator M . To evaluate the measurement framework, wedeﬁne the bias metric as: ρ = | η ( f ( G ∗ )) − η ( f ( M )) | η ( f ( G ∗ )) (2)Now, our problem becomes equal to ﬁnding a measurementframework which minimizes the bias, i.e. ρ .IV. P ROPOSED F RAMEWORK

In this section, for the ﬁrst time, we propose a diffusion-aware probabilistic measurement framework, called “D NS ”( D iffusion N etwork S ampling). A. Sampling Design

In the existing sampling methods such as BFS and RW,we begin at a starting node, and recursively visit (one ormore) of its neighbors as next nodes, without considering thediffusion paths. Here, we try to utilize the diffusion processproperties to ﬁnd how to traverse the network more accurately.By computing the probability of spreading infections over thelinks of underlying network, we can direct the sampling designtoward diffusion paths without any prior knowledge about thediffusion network structures. Therefore, we can cover a greaterpart of unknown diffusion network and decrease the redundantdata such as nodes and links which do not attend the diffusionprocess.o calculate the probability of spreading infection overa link, we focus on the cascades behavior in the diffusionnetworks. Each cascade c can be assigned to a time vector t c = { t , t , · · · , t n } which shows the infection times ofnodes by c . If cascade c does not infect a node, this nodeinfection time will be considered as ∞ [18]. The cascadeswith the same structure that propagate over the underlyingnetwork is shown by set C with N c members. We deﬁne C T = { t , t , · · · , t N c } as the set of cascades’ time vectors.We assume the transmission model of cascades follows theindependent cascade model [37]. In this model, a node getsthe chance to transmit information to its neighbors at eachtime episode, independently.When a node decides to infect one of its neighbors, it willdo the transmission with a waiting time model that shows howlong it will take for a node to infect a chosen neighbor. In theproposed sampling method, we use the exponential model [1]as the waiting time model. By deﬁning ∆ = t v − t u , theinfection transmission probability over link e ( u, v ) at cascade c can be computed as follows. P c ( e ) = e − ∆ α Exponential Model (3)Where α is an adjustment parameter which determines howfast a cascade spreads. As it can be seen, the probability ofspreading an infection have an inverse relation with ∆ . It isthe symptom of a simple fact; when you receive an interestingE-mail, the passing of time will decrease the probability offorwarding it to your friends. Since diffusion network containsmany cascades, each link e ( u, v ) can attend more than onecascade. Therefore, C e is deﬁned as the set of cascades whichpass over link e . Now, we deﬁne for each link e the infectionprobability P e , by calculating its average probability over theattended cascades as: P e = P c ∈ C P c ( e ) | C e | (4)The pseudo code of the proposed sampling design is shownin Algorithm IV.1. This method samples the underlying net-work by moving from a node u , to a neighboring node v , through an outgoing link with the infection probability P e . It is noteworthy that the algorithm only uses the infec-tion times (i.e. C T ) as local information without any priorknowledge about the latent structure of the diffusion network. Algorithm IV.1: T HE S AMPLING D ESIGN ( Seed, C T , k, α ) v := Seed % v is the current node while ( | E s | < k ) % k is the sampling size do  for each u ∈ Neighbors (v) do  e := ( v, u ) V s ← V s ∪ uE s ← E s ∪ e for each c ∈ C e do  ∆ = t u − t v P c ( e ) = e − ∆ α P e = P e + P c ( e ) P e = P e | C e | v ← u with probability of P e G s := ( V s , E s ) return ( G s ) B. Estimation Approach

The selection bias of a sampling method can be correctedby re-weighting of the measured values. This can be doneusing the Hansen-Hurwitz estimator [38], i.e. elements areweighted inversely proportional to their visiting probability.For any target function f : T → L that deﬁnes a characteristic(refer to Section III-A), the estimator of Equation 5 providesan asymptotic estimate of the population mean µ of f [39]: ˆ η = k − P i =0 f ( X i ) π ( X i ) k − P i =0 1 π ( X i ) (5)Where X i and π ( X i ) are the visited element (that could benodes, links or cascades), and its visiting probability on the i th draw of sampling method, respectively. Therefore, to use thisestimator, we should compute the probability of visiting eachelement in the proposed sampling procedure. In the following,we address this issue and extend the above estimator for threedifferent categories of elements; link-based, node-based, andcascade-based.

1) Link-based Characteristics:

The links have a great rolein spreading infection over the networks. Gaining some infor-mation without having any connection to others for propaga-tion, will be not valuable in a network. Therefore, link-basedcharacteristics are the most important ones in the diffusionprocess. “Link Attendance”, as an example of link-based char-acteristics, shows the amount of presence in diffusion processfor a link. The links with high attendance are signiﬁcant insome applications such as ﬁnding potential paths of infectionpropagation in the epidemic spreading [3].Since in the proposed sampling method we move over linkswith the probability P e (Equation 4), the visiting probability oflink e will be equal to this probability; i.e. π ( e ) = P e . We canuse these visiting probabilities in Equation 5 to estimate thereal value of link-based characteristics. As mentioned before,e only use the local knowledge to compute the visitingprobabilities of the links.

2) Node-based Characteristics:

The number of “Seeds”(the beginners of an infection) [3] and “participation” (thefraction of users involved in the information diffusion) [12]are some examples of node-based characteristics. Diffusionprocess can be modeled as a Markov random walk over theunderlying network G (the details can be found in our previouspaper [18]). Therefore, the visiting probability of node u inthe proposed sampling method can be deﬁned as: π ( u ) = X v ∈ N ( u ) π ( v ) π ( e vu ) (6)Where N ( u ) is the set of node u ’ neighbors. The infec-tion of node u at time t u depends on the infection of itsneighbors at the t v where t v < t u . If we deﬁne π = { π (0) , π (1) , · · · , π ( n − } , calculating π needs the globalknowledge of a network as it is the stationary distribution ofthe mentioned Markov chain. Since we do not have the globalview of the network in sampling procedure, ﬁnding the exactvalue of π is not possible in real systems. Therefore, ﬁnding anapproximation of π can be considered as a research directionin the future.

3) Cascade-based Characteristics:

The cascades as thebuilding blocks of a diffusion networks can determine manycharacteristics of a diffusion process. For instance, the depthof a spreading phenomenon can be determined by the lengthof its cascades [3], [12]. Owing to the fact that each cascade c has a series of links which it spreads over them, its visitingprobability depends on visiting all of its links [1]. Therefore,we can deﬁne π ( c ) as: π ( c ) = Y e ∈ c P c ( e ) (7)This formula can be calculated by having all the links of acascade. Since the probability of visiting a cascade needs theglobal knowledge of a network, it should be approximated byusing local information.V. E XPERIMENTAL E VALUATION

A. Setup

As discussed in Section IV-B, computing the visiting prob-ability of network elements should be done by using thelocal information. Since calculating nodes and cascades vis-iting probability need global knowledge about the underlyingnetwork structure, we evaluate the Link-Attendance as a link-based characteristic. We use BFS and RW as baseline methodsfor comparison with D NS .To build the diffusion network, many homogeneous cas-cades are generated with the same structure over the un-derlying network. The speed of cascades’ transmission isdetermined by α . To control the distance through which acascade can propagates, we use the parameter β [1]. Thefraction of the underlying network G which is covered bythe diffusion network G ∗ is deﬁned as the diffusion rate δ . B. Dataset

We utilize seven synthetic and real networks with differentstructures. The properties and cascade generation settings ofthe datasets are provided in Table I.

1) Synthetic Dataset:

We use the following models togenerate synthetic data: • Forest Fire model [26] is generated by the parameter ma-trix [5; 0 .

12; 0 .

1; 1; 0] where entries illustrate the numberof starting nodes, forward burning probability, backwardburning probability, decay probability and probability oforphan nodes, respectively. • The Kronecker graph[25] with three different Kro-necker parameter matrices are generated as: the Ran-dom graph [27] (by Kronecker parameter matrixof [0 . , .

1; 0 . , . ), the hierarchical network [28]( [0 . , .

5; 0 . , . ), and the Core-Periphery network [29]by ( [0 . , .

5; 0 . , . ). TABLE IT HE N ETWORK AND C ASCADE G ENERATION P ARAMETERS . Network n m α β δ

Forest Fire . . . Core-Periphery . . . Hierarchical . . . Random(ER) . . . PolBlog . . . Football

115 615 0 . . . NetScience . . .

2) Real Dataset:

We used three real-world networks forevaluation purposes. The ﬁrst network is based on links andposts of blogs in the political blogosphere around the time ofthe 2004 presidential election in US [30]. The other networkis a network of American football games between Division IAcolleges during regular season of Fall 2000 [40]. The last isco-authorship network of network theory scientists [31] whichwe is referred to as NetScience.

C. Speed of Cascade

As mentioned before, the speed of cascade propagationover the underlying network is controlled by α . In the D NS framework, we use this parameter in calculating P e to deter-mine the direction of sampling and correct the bias. As thediffusion network structure is unknown in the most large realsystems, the speed of cascades is not available to be used inthe sampling and estimation approach. Therefore, we evalu-ate the D NS performance by measuring the link-attendancecharacteristic based on different values of α ( . < α < )over the synthetic and real networks in a ﬁxed sampling rate( µ = 0 . ). As Figure 2 illustrates, all the networks havesimilar behavior with respect to α . Moreover, most networksachieve the minimum bias (below ) in measuring the link-attendance characteristics when . ≤ α ≤ . .The behavior of the political blog network is different incomparison with other networks to some extent. Analyzingthis network structure reveals that this different behavior is ig. 2. Speed of Cascade the result of the network density [26] difference. Comparingthe density of political blog network with the other networksshows that larger density needs larger α in the sampling andestimation procedure. In fact, the higher density necessitatesmore speed in cascade transmission to visit the elements of thenetwork in a time episode. Therefore, political blog networkachieves the least bias when α = 1 . . The best value of α foreach network is provided in Table I. We use these values asthe input parameters for D NS in our experimental evaluations. D. Performance Evaluation

In this section, we evaluate the performance of D NS frame-work in three aspects. First, we compare the bias of D NS with the baseline methods (BFS and RW) in measuring thelink-based characteristics. Second, we study the importance ofestimation approach in the proposed framework. Finally, weanalyze the behavior of these methods in different samplingrates.Figure 3 shows the results of measuring the link-attendancebias against different sampling rates. As it is observed, theproposed framework can measure this characteristic with verylow bias ( , in average). We summarize the average perfor-mance difference of D NS with BFS and RW in all networksin Table II. It can be seen that D NS in average outperformsBFS and RW in terms of link-attendance by about and , respectively.Interestingly, we can see that the proposed framework hasdecreased the bias by compared to the sampling designof D NS without applying the proposed estimation approach.These results conﬁrm that the obtained characteristics from asampled data represent the sampled graph properties, but notthe original graph. Therefore, an estimator plays an importantrole in correcting the bias of the sampling frameworks. How-ever, this issue has not been considered in the previous work on gathering the diffusion data. We also measured the node-based (Seed), and cascade-based (Depth) characteristics by thesampling design of D NS without estimation. The results showthat the proposed sampling design alone, can not perform asgood as D NS with estimation. Speciﬁcally, in average it canonly improve the bias by about , and compared toBFS and RW, respectively. TABLE II T HE AVERAGE PERFORMANCE DIFFERENCE OF D NS WITH

BFS,RW

AND D NS WITHOUT ESTIMATION (D NS -W O E).

Network BFS RW D NS -WoE Forest Fire

49% 21% 14%

Core-Periphery

22% 24% 20%

Hierarchical

43% 45% 37%

Random(ER)

44% 45% 39%

PolBlog

34% 31% 31%

Football

35% 46% 37%

NetScience

30% 31% 22%

Average

37% 35% 30%

Moreover, Figure 3 demonstrates that the proposed methodcan act very well even in low sampling rates. D NS decreasesthe bias of measuring diffusion characteristics to when µ < . . This promising result provides an appropriate samplingand estimation framework for the large real networks whereonly low sampling rates are available. E. Diffusion Behaviour Analysis

The diffusion rate ( δ ) of infection over the underlyingnetwork has a signiﬁcant role in gathering diffusion data. Asthis rate decreases, the smaller parts of the underlying networkwill be covered by the infection. Therefore, collecting thediffusion data becomes more difﬁcult. Here, we analyze theeffect of diffusion rate against performance of the proposedmethod. Figure 4 illustrates that D NS leads to low bias evenin low diffusion rates. Additionally, these results demonstratesthe independence of the proposed framework to the diffusionprocess behavior. Hence, we can use D NS in various diffusionnetworks with different diffusion patterns without any loss inperformance. VI. C ONCLUSIONS

In this paper, we introduced a novel two-step framework,D NS , to measure the characteristics of large scale and latentdiffusion networks. We proposed a sampling algorithm thatsamples the underlying network by moving from a node toone of its neighboring nodes through an outgoing link byconsidering the infection probability. Moreover, we proposedthree estimators for correcting the bias of sampled data byextending the well-known Hansen-Hurwitz estimator. To thisend, we computed the visiting probabilities of three types ofdiffusion characteristics; link-based, node-based, and cascade-based.Our experiments showed that in average, the proposedmethod outperforms BFS and RW in terms of link-attendanceby about and , respectively. Moreover, we showed a) Core Periphery Network (b) Hierarchical Network (c) Random Network (d) Forest Fire Network(e) Political Blogsphere Network (f) Football Network (g) Co-authorship NetworkFig. 3. Link Attendance characteristic evaluation in different sampling rates that the proposed estimator can improve the performance ofthe sampling design by about . Therefore, an appropriateestimator plays an important role in correcting the bias. Fur-thermore, the results demonstrated that the proposed methodcan act very well even in low sampling rates. Additionally,our studies on the diffusion process behavior showed that D NS leads to low bias even in low diffusion rates.we believe that our results provide a promising step to-wards understanding the sampling approaches in analysis andevaluation of diffusion processes. There are several interestingdirections for future work. Approximating the visiting proba-bilities of node-based and cascade-based characteristics is oneof our main future goals.VII. A CKNOWLEDGMENTS

This research has been partially supported by ITRC (IranTelecommunication Research Center) under grant number6479/500 (90/4/22). R

EFERENCES[1] M. Gomez-Rodriguez, J. Leskovec and A. Krause,

Inferring networksof diffusion and inﬂuence , In proc. of KDD ’10, pages 1019-1028, 2010.[2] Twitter Blog: = numbers. Blog.twitter.com., Retrieved 2012-01-20,http://blog.twitter.com/2011/03/numbers.html.[3] M. Eslami, H.R. Rabiee and M.Salehi, Sampling from InformationDiffusion Networks , 2012.[4] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou,

Walking inFacebook: A Case Study of Unbiased Sampling of OSNs , Proceedingsof IEEE INFOCOM, 2010.[5] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou,

PracticalRecommendations on Crawling Online Social Networks , IEEE J. Sel.Areas Commun, 2011.[6] M. Salehi, H. R. Rabiee, N. Nabavi and Sh. Pooya,

Characterizing Twitterwith Respondent-Driven Sampling , International Workshop on Cloud andSocial Networking (CSN2011) in conjunction with SCA2011, No. 9, Vol.29, pages 5521–5529, 2011. [7] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel and B. Bhattachar-jee,

Measurement and analysis of online social networks , Proceedings ofthe ACM SIGCOMM conference on Internet measurement, pages 29–42,2007.[8] J. Leskovec and Ch. Faloutsos,

Sampling from large graphs , Proceedingsof the ACM SIGKDD conference on Knowledge discovery and datamining, pages 631–636, 2006.[9] M. Salehi, H. R. Rabiee, and A. Rajabi,

Sampling from Complex Networkswith high Community Structures , Chaos: An Interdisciplinary Journal ofNonlinear Science , 2012.[10] J. Leskovec, M. McGlohon, C. Faloutsos, N. S. Glance and M. Hurst,

Patterns of Cascading Behavior in Large Blog Graphs , In proc. ofSDM’07, 2007.[11] D. Liben-Nowell and J. Kleinberg,

Tracing information ﬂow on a globalscale using Internet chain-letter data , Proc. of the National Academy ofSciences, 105(12):4633-4638, 25 Mar, 2008.[12] M. D. Choudhury, Y. Lin, H. Sundaram, K. S. Candan, L. Xie and A.Kelliher,

How Does the Data Sampling Strategy Impact the Discovery ofInformation Diffusion in Social Media? , Proc. of ICWSM , 2010.[13] E. Sadikov, M. Medina, J. Leskovec and H. Garcia-Molina,

Correctingfor missing data in information cascades , WSDM, pages 55-64, 2011.[14] D. Gruhl, R. Guha, D. Liben-Nowell and A. Tomkins,

Information dif-fusion through blogspace , In proc. of of the 13th international conferenceon World Wide Web, pages 491–501, 2004.[15] G. Kossinets, J. M. Kleinberg and D.J. Watts,

The structure ofinformation pathways in a social communication network , KDD ’08,pages 435-443. 2008.[16] J. Leskovec, L. Backstrom and J. Kleinberg,

Meme-tracking and thedynamics of the news cycle , KDD ’09: Proc. of the 15th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages497-506, 2009.[17] M.Gomez-Rodriguez, D. Balduzzi and B.Scholkopf,

Uncovering theTemporal Dynamics of Diffusion Networks , Proc. of the 28 th InternationalConference on Machine Learning, Bellevue,WA, USA, 2011.[18] M. Eslami, H. R. Rabiee and M. Salehi,

DNE: A Method for ExtractingCascaded Diffusion Networks from Social Networks , IEEE Social Com-puting Proceedings, 2011.[19] F. Chierichetti, J. Kleinberg and D. Liben-Nowell,

Reconstructing Pat-terns of Information Diffusion from Incomplete Observations , NIPS 2011.[20] C.X. Lin, Q. Mei, Y.Jiang, J. Han and S. Qi,

Inferring the Diffusion andEvolution of Topics in Social Communities , SNA KDD, 2011.[21] Ch. Wilson, B. Boe, A. Sala, K. P. N Puttaswamy and B. Y. Zhao,a) Core Periphery Network (b) Hierarchical Network (c) Random Network (d) Forest Fire Network(e) Political Blogsphere Network (f) Football Network (g) Co-authorship NetworkFig. 4. Analysis of diffusion rate over sampling frameworks

User interactions in social networks and their implications , EuroSys ’09:Proceedings of the 4th ACM European conference on Computer systems,pages 205–218, 2009.[22] M. Kurant, A. Markopoulou and P. Thiran,

On the bias of BFS (BreadthFirst Search) , 22nd IEEE International Teletrafﬁc Congress (ITC), pages1–8, 2010.[23] L. Lovas,

Random walks on graphs: a survey , Combinatorics, 1993.[24] M.R Henzinger, A. Heydon,M. Mitzenmacher and M. Najork,

On near-uniform URL sampling , Proceedings of the World Wide Web conferenceon Computer networks, pages 295-308, 2000.[25] J. Leskovec and C. Faloutsos,

Scalable modeling of real graphs usingKronecker multiplication , Proc. of ICML, pages 497-504, 2007.[26] J. Leskovec, J. Kleinberg and C. Faloutsos,

Graphs over Time: Densi-cation Laws, Shrinking Diameters and Possible Explanations , Proc. ofKDD , 2005.[27] P.Erds and A. Rnyi,

On the evolution of random graphs , Publ. Math.Inst. Hung. Acad. Sci., 5: page 17, 1960.[28] A. Clauset, C. Moore and M. E. J. Newman,

Hierarchical structure andthe prediction of missing links in networks , Nature, 453: pages 98-101,2008.[29] J. Leskovec, K.J. Lang, A. Dasgupta and M.W. Mahoney,

Statisticalproperties of community structure in large social and information net-works , WWW, pages 695-704, 2008.[30] L.A. Adamic and N. Glance,

The political blogosphere and the 2004US Election , Proc. of the WWW-2005 Workshop on the WebloggingEcosystem, 2005.[31] M. E. J. Newman,

Finding community structure in networks using theeigenvectors of matrices , Preprint physics/0605087, 2006.[32] S.A. Myers and J. Leskovec,

On the Convexity of Latent Social NetworkInference , Advances in Neural Infromation Processing Systems, 2010.[33] Ch. Gkantsidis, M. Mihail and A. Saberi,

Random walks in peer-to-peer networks: algorithms and evaluation , Elsevier Science PublishersB. V., Performance Evaluation, P2P Computing Systems, Vol 63, pages241–263, 2006.[34] D. Stutzbach, R. Rejaie, N. Dufﬁeld,S. Sen and W. Willinger,

OnUnbiased Sampling for Unstructured Peer-to-Peer Networks , Proceedingsof IMC,pages 27–40, 2008. [35] L. Becchetti, C. Castillo, D. Donato and A.Fazzone,

On the bias of BFS(Breadth First Search) , LinkKDD, pages 1–8, 2006.[36] J. Yang and J. Leskovec,

Modeling Information Diffusion in ImplicitNetworks , ICDM, IEEE Computer Society, pages 599-608, 2010.[37] D. Kempe, J. Kleinberg and E. Tardos,

Maximizing the spread ofinﬂuence through a social network , KDD ’03: Proc. of the ninth ACMSIGKDD international conference on Knowledge discovery and datamining, ACM Press, pages 137-146, 2003.[38] M. Hansen and W. Hurwitz,

On the Theory of Sampling from FinitePopulations , Annals of Mathematical Statistics, No. 3, Vol 14, 1943.[39] E. Volz and D. Heckathorn,

Probability based estimation theory forrespondent-driven sampling , Ofﬁcial Statistics, pages 79, Vol 24, 2008.[40] M. Girvan and M. E. J. Newman,