Outward Influence and Cascade Size Estimation in Billion-scale Networks
OOutward Influence and Cascade Size Estimationin Billion-scale Networks
H. T. Nguyen, T. P. Nguyen
Virginia Commonwealth Univ.Richmond, VA 23220{hungnt,trinpm}@vcu.edu
T. N. Vu
Univ. of Colorado, Boulder &UC DenverBoulder, CO [email protected]
T. N. Dinh
Virginia Commonwealth Univ.Richmond, VA [email protected]
ABSTRACT
Estimating cascade size and nodes’ influence is a fundamental taskin social, technological, and biological networks. Yet this task isextremely challenging due to the sheer size and the structural het-erogeneity of networks. We investigate a new influence measure,termed outward influence (OI), defined as the (expected) number ofnodes that a subset of nodes S will activate, excluding the nodes in S .Thus, OI equals, the de facto standard measure, influence spread of S minus | S | . OI is not only more informative for nodes with smallinfluence, but also, critical in designing new effective sampling andstatistical estimation methods.Based on OI, we propose SIEA/SOIEA, novel methods to esti-mate influence spread/outward influence at scale and with rigoroustheoretical guarantees . The proposed methods are built on two novelcomponents 1) IICP an important sampling method for outwardinfluence; and 2) RSA, a robust mean estimation method that mini-mize the number of samples through analyzing variance and rangeof random variables. Compared to the state-of-the art for influ-ence estimation, SIEA is Ω ( log n ) times faster in theory and up to several orders of magnitude faster in practice. For the first time, influ-ence of nodes in the networks of billions of edges can be estimatedwith high accuracy within a few minutes. Our comprehensive ex-periments on real-world networks also give evidence against thepopular practice of using a fixed number, e.g. 10K or 20K, of samplesto compute the “ground truth” for influence spread. KEYWORDS
Outward influence; FPRAS; Approximation Algorithm
ACM Reference format:
H. T. Nguyen, T. P. Nguyen, T. N. Vu, and T. N. Dinh. 2017. Outward Influenceand Cascade Size Estimation in Billion-scale Networks. In
Proceedings ofACM Conference, Washington, DC, USA, July 2017 (Conference’17),
16 pages.DOI: 10.1145/nnnnnnn.nnnnnnn
In the past decade, a massive amount of data on human interactionshas shed light on various cascading processes from the propaga-tion of information and influence [17] to the outbreak of diseases
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
Conference’17, Washington, DC, USA © 2017 ACM. 978-x-xxxx-xxxx-x/YY/MM...$15.00DOI: 10.1145/nnnnnnn.nnnnnnn S Influence I ( S ) Outward Inf. I out ( S ){ u } + p + p = . p + p = . { v } + p = .
20 2 p = . { w } .
00 0 . Figure 1:
Left: the influence of nodes under IC model. The influ-ence of all nodes are roughly the same, despite that w is much lessinfluential than u and v . Right: Outward influence is better at re-flecting the relative influence of the nodes. w has the least outwardinfluence, , while v ’s is nearly twice as that of u [21]. These cascading processes can be modeled in graph theorythrough the abstraction of the network as a graph G = ( V , E ) and a diffusion model that describes how the cascade proceeds into thenetwork from a prescribed subset of nodes. A fundamental task inanalyzing those cascades is to estimate the cascade size, also knownas influence spread in social networks. This task is the foundationof the solutions for many applications including viral marketing[17, 28, 31, 32], estimating users’ influence [12, 23], optimal vaccineallocation [30], identifying critical nodes in the network [11], andmany others. Yet this task becomes computationally challengingin the face of the nowadays social networks that may consist ofbillions of nodes and edges.Most of the existing work in network cascades uses stochastic dif-fusion models and estimates the influence spread through sampling[8, 11, 17, 23, 29, 31]. The common practice is to use a fixed numberof samples, e.g. 10K or 20K [8, 17, 29, 31], to estimate the expectedsize of the cascade, aka influence spread . Not only is there no singlesample size that works well for all networks of different sizes andtopologies, but those approaches also do not provide any accuracyguarantees. Recently, Lucier et al. [23] introduced INFEST , the firstestimation method that comes with accuracy guarantees. Unfor-tunately, our experiments suggest that
INFEST does not performwell in practice, taking hours on networks with only few thousandnodes.
Will there be a rigorous method to estimate the cascade size inbillion-scale networks?
In this paper, we investigate efficient estimation methods fornodes’ influence under stochastic cascade models [10, 12, 17]. First,we introduce a new influence measure, called outward influence and defined as I out ( S ) = I ( S ) − | S | , where I ( S ) denotes the influencespread. The new measure excludes the self-influence artifact ininfluence spread, making it more effective in comparing relativeinfluence of nodes . As shown in Fig. 1, the influence spread of thenodes are roughly the same, 1. In contrast, the outward influence ofnodes u , v and w are 0 . , .
20, and 0 .
00, respectively. Those values a r X i v : . [ c s . D S ] A p r onference’17, July 2017, Washington, DC, USA Nguyen et al. correctly reflect the intuition that w is the least influential nodesand v is nearly twice as influential as u .More importantly, the outward influence measure inspires novelmethods, termed SIEA/SOIEA, to estimate influence spread/outwardinfluence at scale and with rigorous theoretical guarantees . Both SOIEA and
SIEA guarantee arbitrary small relative error with highprobability within an O ( n ) observed influence. The proposed meth-ods are built on two novel components 1) IICP an important sam-pling method for outward influence; and 2) RSA, a robust meanestimation method that minimize the number of samples throughanalyzing variance and range of random variables.
IICP focusesonly on non-trivial cascades in which at least one node outside theseed set must be activated. As each
IICP generates cascades of sizeat least two and outward influence of at least one, it leads to smallervariance and much faster convergence to the mean value. Under thewell-known independent cascade model [17],
SOIEA is Ω ( log n ) times faster than the state-of-the-art INFEST [23] in theory and is four to five orders of magnitude faster than both
INFEST and thenaive Monte-Carlo sampling. For other stochastic models, such ascontinuous-time diffusion model [12], LT model [17], SI, SIR, andvariations [10], RSA can be applied directly to estimate the influ-ence spread, given a Monte-Carlo sampling procedure, or, better,with an extension of
IICP to the model.Our contributions are summarized as follows. • We introduce a new influence measure, called
OutwardInfluence which is more effective in differentiating nodes’influence. We investigate the characteristics of this newmeasure including non-monotonicity, submodularity, and • Two fully polynomial time randomized approximationschemes (
FPRAS ) SIEA and
SOIEA to provide ( ϵ , δ ) -approximatefor influence spread and outward influence with only an O ( n ) observed influence in total. Particularly, SOIEA , ouralgorithm to estimate influence spread, is Ω ( log n ) timesfaster than the state-of-the-art INFEST [23] in theory andis four to five orders of magnitude faster than both
INFEST and the naive Monte-Carlo sampling. • The robust mean estimation algorithm, termed
RSA , a build-ing block of
SIEA , can be used to estimate influence spreadunder other stochastic diffusion models , or, in general, meanof bounded random variables of unknown distribution.
RSA will be our favorite statistical algorithm moving forwards. • We perform comprehensive experiments on both real-worldand synthesis networks with size up to 65 million nodesand . Our experiments indicate the superiorof our algorithms in terms of both accuracy and runningtime in comparison to the naive Monte-Carlo and the state-of-the-art methods. The results also give evidence againstthe practice of using a fixed number of samples to estimatethe cascade size. For example, using 10000 samples to esti-mate the influence will deviate up to 240% from the groundtruth in a Twitter subnetwork. In contrast, our algorithmcan provide (pseudo) ground truth with guaranteed small(relative) error (e.g. 0.5%). Thus it is a more concrete bench-mark tool for research on network cascades.
Organization.
The rest of the paper is organized as follows: InSection 2, we introduce the diffusion model and the definition ofoutward influence with its properties. We propose an
FPRAS foroutward influence estimation in Section 3. Applications in influenceestimation are presented in Section 5 which is followed by theexperimental results in Section 6 and conclusion in Section 8. Wecover the most recent related work in Section 7.
In this section, we will introduce stochastic diffusion models, thenew measure of
Outward Influence , and showcase its propertiesunder the popular Independent Cascade (IC) model [17].
Diffusion model.
Consider a network abstracted as a graph G = ( V , E ) , where V and E are the sets of nodes and edges, re-spectively. For example, in a social network, V and E correspondto the set of users and their social relationships, respectively. As-sume that there is a cascade starting from a subset of nodes S ⊆ V ,called seed set . How the cascade progress is described by a diffu-sion model (aka cascade model) M that dictates how nodes getsactivated/influenced. In a stochastic diffusion model, the cascade isdictated by a random vector θ in a sample space Ω θ . Describing thediffusion model is then equivalent to specifying the distribution P of θ .Let r S ( θ ) be the size of the cascade, the number of activatednodes in the end. The influence spread of S , denoted by I ( S ) , underdiffusion model M is the expected size of the cascade, i.e., I ( S ) = (cid:40)(cid:205) θ ∈ Ω θ r θ ( S ) Pr [ θ ] for discrete Ω θ , ∫ θ ∈ Ω θ r θ ( S ) dP ( θ ) for continuous Ω θ (1)For example, we describe below the unknown vector θ and theirdistribution for the most popular diffusion models. • Information diffusion models, e.g. Independent Cascade(IC), Linear Threshold (LT), the general triggering model[17]: θ ∈ { , } | E | , and ∀ ( u , v ) ∈ E , θ ( u , v ) is a Bernouli ran-dom variable that indicates whether u activates/influences v . That is for given w ( u , v ) ∈ ( , ) , θ ( u , v ) = u activates v with a probability w ( u , v ) and 0, otherwise. • Epidemic cascading models, e.g., Susceptible-Infected (SI)[10, 26] and its variations: θ ∈ N | E | , and ∀ ( u , v ) ∈ E , θ ( u , v ) is a random variable following a geometric distribution. θ ( u , v ) indicates how long it takes u to activates v after u isactivated. • Continuous-time models [12]: θ ∈ R | E | , and θ ( u , v ) is acontinuous random variable with density function π u , v ( t ) . θ ( u , v ) also indicates the transmission times (time until u activates v ) like that in the SI model, however, the transmis-sions time on different edges follow different distributions. Outward Influence.
We introduce the notion of
Outward Influ-ence which captures the influence of a subset of nodes towards therest of the network. Outward influence excludes the self-influenceof the seed nodes from the measure.Definition 1 (Outward Influence).
Given a graph G = ( V , E ) ,a set S ⊆ V and a diffusion model M , the Outward Influence of S ,denoted by I out ( S ) , is I out ( S ) = I ( S ) − | S | (2) utward Influence and Cascade Size Estimationin Billion-scale Networks Conference’17, July 2017, Washington, DC, USA Thus, influence and outward influence of a seed set S differexactly by the number of nodes in S . Influence Spread/Outward Influence Estimations.
A funde-mental task in network science is to estimate the influence of agiven seed set S . Since the exact computation is Given agraph G and a set S ⊆ V , the problem asks for an ( ϵ , δ ) -estimate ˆ I ( S ) of influence spread I ( S ) , i.e., Pr [( − ϵ ) I ( S ) ≤ ˆ I ( S ) ≤ ( + ϵ ) I ( S )] ≥ − δ . (3)The outward influence estimation problem is stated similarly:Definition 3 (Outward Influence Estimation). Givena graph G and a set S ⊆ V , the problem asks for an ( ϵ , δ ) -estimate ˆ I out ( S ) of influence spread I out ( S ) , i.e., Pr [( − ϵ ) I out ( S ) ≤ ˆ I out ( S ) ≤ ( + ϵ ) I out ( S )] ≥ − δ . (4)A common approach for estimation is through generating inde-pendent Monte-Carlo samples and taking the average. However,one faces two major challenges: • How to achieve a minimum number samples to get an ( ϵ , δ ) -approximate? • How to effectively generate samples with small variance,and, thus, reduce the number of samples?For simplicity, we focus on the well-known
Independent Cascade (IC) model and provide the extension of our approaches to othercascade models in Subsection 5.3.
Given a probabilistic graph G = ( V , E ) in which each edge ( u , v ) ∈ E is associated with a number w ( u , v ) ∈ ( , ) . w ( u , v ) indicatesthe probability that node u will successfully activate v once u isactivated. In practice, the probability w ( u , v ) can be mined frominteraction frequency [17, 32] or learned from action logs [13]. Cascading Process.
The cascade starts from a subset of nodes S ⊆ V , called seed set. The cascade happens in discrete rounds t = , , ... | V | . At round 0, only nodes in S are active and the othersare inactive. When a node u becomes active, it has a single chanceto activate (aka influence) each neighbor v of u with probability w ( u , v ) . An active node remains active till the end of the cascadeprocess. It stops when no more nodes get activated. Sample Graph.
Associate with each edge ( u , v ) ∈ E a biasedcoin that lands heads with probability w ( u , v ) and tails with proba-bility 1 − w ( u , v ) . Deciding the outcome when u attempts to activate v is then equivalent to the outcome of flipping the coin. If the coinlanded heads, the activation attemp succeeds and we call ( u , v ) a live-edge . Since all the activation on the edges are independent inthe IC model, it does not matter when we flip the coin. That is wecan flip all the coins associated with the edges ( u , v ) at the sametime instead of waiting until node u becomes active. We call thegraph д that contains the nodes V and all the live-edges a samplegraph of G .Note that the model parameter θ for the IC is a random vectorindicating the states of the edges, i.e. live-edge or not. In otherwords, Ω θ corresponds to the space of all possible sample graphsof G , denoted by Ω G . Probabilistic Space.
The graph G can be seen as a generativemodel. The set of all sample graphs generated from G togetherwith their probabilities define a probabilistic space Ω G . Recall thateach sample graph д ∈ Ω G can be generated by flipping coinson all the edges to determine whether or not the edge is live orappears in д . Each edge ( u , v ) will be present in the a sample graphwith probability w ( u , v ) . Thus, the probability that a sample graph д = ( V , E (cid:48) ⊆ E ) is generated from G isPr [ д ∼ G] = (cid:214) ( u , v )∈ E (cid:48) w ( u , v ) (cid:214) ( u , v )∈ E \ E (cid:48) ( − w ( u , v )) . (5) Influence Spread and Outward Influence.
In a sample graph д ∈ Ω G , let r д ( S ) be the set of nodes reachable from S . The influencespread in Eq. 1 is rewritten, I ( S ) = (cid:213) д ∈ Ω G | r д ( S )| Pr [ д ∼ G] , (6)and the outward influence is defined accordingly to Eq. 2, I out ( S ) = I ( S ) − | S | (7) We show the properties of outward influence under the IC model.
Better Influence Discrepancy.
As illustrated through Fig. 1,the elimination of the nominal constant | S | helps to differentiatethe “actual influence” of the seed nodes to the other nodes in thenetwork. In the extreme case when p = o ( ) , the ratio between theinfluence spread of u and v is + p + p + p + p ≈
1, suggesting u and v havethe same influence. However, outward influence can capture thefact that v can influence roughly twice the number of nodes than u , since s I out ( u ) I out ( v ) = p + p p ≈ / Non-monotonicity.
Outward influence as a function of seedset S is non-monotone. This is different from the influence spread.In Figure 1, I out ({ u }) = . < I out ({ u , v }) = .
2, however, I out ({ u }) = . > I out ({ u , w }) = .
11. That is adding nodesto the seed set may increase or decrease the outward influence.
Submodularity.
A submodular function expresses the diminish-ing returns behavior of set functions and are suitable for many ap-plications, including approximation algorithms and machine learn-ing. If Ω is a finite set, a submodular function is a set function f : 2 Ω ← R , where 2 Ω denotes the power set of Ω , which satisfiesthat for every X , Y ⊆ Ω with X ⊆ Y and every x ∈ Ω \ Y , we have, f ( X ∪ { x }) − f ( X ) ≥ f ( Y ∪ { x }) − f ( Y ) . (8)Similar to influence spread, outward influence, as a function ofthe seed set S , is also submodular.Lemma 1. Given a network G = ( V , E , w ) , the outward influencefunction I out ( S ) for S ∈ | V | , is a submodular function If we can compute outward influence of S , the influence spread of S can be obtained by adding | S | to it. Since computing influencespread is Given a probabilistic graph G = ( V , E , w ) and a seedset S ⊆ V , it is I out ( S ) . onference’17, July 2017, Washington, DC, USA Nguyen et al. However, while influence spread is lower-bounded by one, theoutward influence of any set S can be arbitrarily small (or evenzero). Take an example in Figure 1, node u has influence of I ({ u }) = + p + p ≥ p . However, u ’s outward influence I out ({ u }) = p + p can be exponentially small if p = n . Thismakes estimating outward influence challenging, as the numberof samples needed to estimate the mean of random variables isinversely proportional to the mean. Monte-Carlo estimation . A typical approach to obtain an ( ϵ , δ ) -approximaion of a random variable is through Monte-Carlo estima-tion: taking the average over many samples of that random variable.Through the Bernstein’s inequality [9], we have the lemma:Lemma 3. Given a set X , X , . . . of i.i.d. random variables havinga common mean µ X , there exists a Monte-Carlo estimation whichgives an ( ϵ , δ ) -approximate of the mean µ X and uses T = O ( ϵ ln ( δ ) bµ X ) random variables where b is an upper-bound of X i , i.e. X i ≤ b . To estimate the influence spread I ( S ) , existing work often sim-ulates the cascade process using a BFS-like procedure and takesthe average of the cascades’ sizes as the influence spread. Thenumber of samples needed to obtain an ( ϵ , δ ) -approximation is O ( ϵ log (cid:16) δ (cid:17) n I ( S ) ) samples. Since I ( S ) ≥
1, in the worst-case, weneed only a polynomial number of samples, O ( ϵ log (cid:16) δ (cid:17) n ) .Unfortunately, the same argument does not apply for the case of I out ( S ) , since I out ( S ) can be arbitrarily close to zero. For the samereason, the recent advances in influence estimation in [3, 23] cannotbe adapted to obtain a polynomial-time algorithm to compute an ( ϵ , δ ) -approximation (aka FPRAS ) for outward influence. We shalladdress this challenging task in the next section.We summarize the frequently used notations in Table 1.
Table 1: Table of notations
Notation Description n , m G = ( V , E , w ) . I ( S ) Influence Spread of seed set S ⊆ V . I out ( S ) Outward Influence of seed set S ⊆ V . N out ( u ) The set of out-neighbors of u : N out ( u ) = { v ∈ V |( u , v ) ∈ E } N outS N outS = (cid:208) u ∈ S N out ( u )\ S . A i The event that v i is active and v , . . . , v i − are not activeafter round 1. β β = (cid:205) li = Pr [ A i ] = − Pr [ A l + ] . c ( ϵ , δ ) c ( ϵ , δ ) = ( + ϵ ) ln ( δ ) ϵ ϵ (cid:48) ϵ (cid:48) = ϵ (cid:16) − ϵb ( + ϵ ) ln ( δ )( b − a ) (cid:17) ≈ ϵ ( − O ( n )) for δ = n ϒ ϒ = ( + ϵ ) c ( ϵ (cid:48) , δ )( b − a ) We propose a Fully Polynomial Randomized Approximation Scheme(
FPRAS ) to estimate the outward influence of a given set S . Giventwo precision parameters ϵ , δ ∈ ( , ) , our FPRAS algorithm guaran-tees to return an ( ϵ , δ ) -approximate ˆ I out ( S ) of the outward influence I out ( S ) , Pr [( − ϵ ) I out ( S ) ≤ ˆ I out ( S ) ≤ ( + ϵ ) I out ( S )] ≥ − δ . (9) General idea.
Our starting point is an observation that the cascadetriggered by the seed set with small influence spread often stopsright at round 0. The probability of such cascades, termed trivialcascades , can be computed exactly. Thus if we can sample only the non-trivial cascades , we will obtain a better sampling method toestimate the outward influence. The reason is that the “outwardinfluence” associated with non-trivial cascade is also lower-boundedby one. Thus, we again can apply the argument in the previoussection on the polynomial number of samples.Given a graph G and a seed set S , we introduce our importancesampling strategy to generate such non-trivial cascades. It consistsof two stages:(1) Guarantee that at least one neighbor of S will be activatedthrough a biased selection towards the cascades with atleast one node outside of S and,(2) Continue to simulate the cascade using the standard pro-cedure following the diffusion model.This importance sampling strategy is general for different diffusionmodels. In the following, we illustrate our importance samplingunder the focused IC model. We propose
Importance IC Polling ( IICP ) to sample non-trivial cas-cades in Algorithm 1.
Figure 2: Neighbors of nodes in S First, we “merge” all the nodes in S and define a “unified neighbor-hood” of S . Specifically, let N out ( u ) = { v |( u , v ) ∈ E } the set of out-neighbors of u and N outS = (cid:208) u ∈ S N outu \ S the set of out-neighbors of S excluding S . For each v ∈ N outS , P S , v = − (cid:214) u ∈ S ( − w ( u , v )) , (10)the probability that v is activated directly by one (or more) node(s)in S . Without loss of generality, assume that P S , v ≤ v into S ).Assume an order on the neighborhood of S , that is N outS = { v , v , . . . , v l } , where l = | N outS | . For each i = .. l , let A i be the event that v i bethe “first” node that gets activated directly by S : A i = { v , . . . , v i − are not active and v i is active after round 1 } . The probability of A i isPr [ A i ] = P S , v i i − (cid:214) j = ( − P S , v j ) . (11)For consistency, we also denote A l + the event that none of theneighbors are activated, i.e., utward Influence and Cascade Size Estimationin Billion-scale Networks Conference’17, July 2017, Washington, DC, USA Pr [ A l + ] = − l (cid:213) i = Pr [ A i ] . (12)Note that A l + is also the event that the cascade stops right at round0. Such a cascade is termed a trivial cascade . As we can computeexactly the probability of trivial cascades, we do not need to samplethose cascades but focus only on the non-trivial ones.Denote by β the probability of having at least one nodes among v , . . . , v l activated by S , i.e., β = l (cid:213) i = Pr [ A i ] = − Pr [ A l + ] . (13) Algorithm 1:
IICP - Importance IC Polling
Input:
A graph G = ( V , E , w ) and a seed set S Output: Y ( S ) - size of a random outward cascade from S Stage 1 // Sample non-trivial neighbors of set S Precompute Pr [ A i ] , i = , . . . , l + Select one neighbor v i among v , . . . , v l with probability ofselecting v i being Pr [ A i ] β Queue R ← { v i } ; Y ( S ) =
1; Mark v i and all nodes in S visited for j = i + l do With a probability P S , v j do Add v j into R ; Y ( S ) ← Y ( S ) +
1; Mark v j visited. Stage 2 // Sample from newly influenced nodes while R is non-empty do u ← R .pop() foreach unvisited neighbor v of u do With a probability w ( u , v ) Add v to R ; Y ( S ) ← Y ( S ) +
1; Mark v visited. return Y ( S ) ; We now explain the details in the Importance IC Polling Algo-rithm (
IICP ), summarized in Alg. 1. The algorithm outputs the sizeof the cascade minus the seed set size. We term the output of
IICP the outer size of the cascade. The algorithm consists of two stages.
Stage 1 . By definition, the events A i , A , ..., A l , A l + are disjointand form a partition of the sample space. To generate a non-trivialcascade, we first select in the first round v i , i = , . . . , l with aprobability Pr [ A i ] β , i = , . . . , l (excluding A l + ). This will guaranteethat at least one of the neighbors of S will be activated. Let v i be the selected node, after the first round v i becomes active and v , . . . , v i − remains inactive. The nodes v j among v i + , . . . , v l arethen activated independently with probability P S , v j (Eq. 10). Stage 2.
After the first stage of sampling neighbors of S , we geta non-trivial set of nodes directly influenced from S . For each ofthose nodes and later influenced nodes, we will sample a set of itsneighbors by the naive BFS-like IC polling scheme [17]. Assumesampling neighbors of a newly influenced node u , each neighbor v j ∈ N out ( u ) is influenced by u with probability w ( u , v j ) . Theneighbors of those influenced nodes are next to be sampled in thesame fashion.In addition, we keep track of the newly influenced nodes usinga queue R and the number of active nodes outside S using Y ( S ) .The following lemma shows how to estimate the (expected) cas-cade size through the (expected) outer size of non-trivial cascades. Lemma 4. Given a seed set S ⊆ V , let Y ( S ) be the random vari-able associated with the output of the IICP algorithm. The followingproperties hold, • ≤ Y ( S ) ≤ n − | S | , • I out ( S ) = E [ Y ( S ) ] · β . Further, let Ω W be the probability space of non-trivial cascadesand Ω Y the probability space for the outer size of non-trivial cas-cades, i.e, Y ( S ) . The probability of Y ( S ) ∈ [ , n − | S |] is given by,Pr [ Y ( S ) ∈ Ω Y ] = (cid:213) W ( S ) ∈ Ω W , | W ( S ) | = Y ( S ) Pr [ W ( S ) ∈ Ω W ] . From Lemma 4, we can obtain an estimate ˆ I out ( S ) of I out ( S ) throughgetting an estimate ˆ E [ Y ( S ) ] of E [ Y ( S ) ] by,Pr (cid:104) ( − ϵ ) E [ Y ( S ) ] ≤ ˆ E [ Y ( S ) ] ≤ ( + ϵ ) E [ Y ( S ) ] (cid:105) = Pr (cid:104) ( − ϵ ) E [ Y ( S ) ] β ≤ ˆ E [ Y ( S ) ] β ≤ ( + ϵ ) E [ Y ( S ) ] β (cid:105) = Pr (cid:104) ( − ϵ ) I out ( S ) ≤ ˆ I out ( S ) ≤ ( + ϵ ) I out ( S ) (cid:105) , (14)where the estimate ˆ I out ( S ) = ˆ E [ Y ( S ) ] · β . Thus, finding an ( ϵ , δ ) -approximation of I out ( S ) is then equivalent to finding an ( ϵ , δ ) -approximate ˆ E [ Y ( S ) ] of E [ Y ( S ) ] .The advantage of this approach is that estimating E [ Y ( S ) ] , inwhich the random variable Y ( S ) has value of at least 1, requires onlya polynomial number of samples. Here the same argument on thenumber of samples to estimate influence spread in subsection 2.3can be applied. Let Y ( S ) , Y ( S ) , . . . be the random variables denotingthe output of IICP . We can apply Lemma 3 on the set of randomvariables Y ( S ) , Y ( S ) , . . . satisfying 1 ≤ Y ( S ) i ≤ | V | − | S | . Since eachrandom variable Y ( S ) i is at least 1 and hence, µ Y = E [ Y ( S ) ] ≥ T = O ( ln ( δ ) ϵ ( n − | S |)) randomvariables for the Monte-Carlo estimation. Since, IICP has a worst-case time complexity O ( m + n ) , the Monte-Carlo using IICP is an
FPRAS for estimating outward influence.Theorem 3.1.
Given arbitrary ≤ ϵ , δ ≤ and a set S , theMonte-Carlo estimation using IICP returns an ( ϵ , δ ) -approximationof I out ( S ) using O ( ln ( δ ) ϵ ( n − | S |)) samples. In Section 5, we will show that both outward influence and in-fluence spread can be estimated by a powerful algorithm saving afactor of more than ϵ random variables compared to this FPRAS estimation. The algorithm is built upon our mean estimation algo-rithms for bounded random variables proposed in the following.
In this section, we propose an efficient mean estimation algorithmfor bounded random variables. This is the core of our algorithmsfor accurately and efficiently estimating the outward influence andinfluence spread in Section 5.We first propose an ‘intermediate’ algorithm:
Generalized Stop-ping Rule Estimation ( GSRA ) which relies on a simple stopping ruleand returns an ( ϵ , δ ) -approximate of the mean of lower-bounded onference’17, July 2017, Washington, DC, USA Nguyen et al. random variables. The GSRA simultaneously generalizes and fixesthe error of the Stopping Rule Algorithm [9] which only aims toestimate the mean of [ , ] random variables and has a technicalerror in its proof.The main mean estimation algorithm, namely Robust SamplingAlgorithm ( RSA ) presented in Alg. 3, effectively takes into accountboth mean and variance of the random variables. It uses
GSRA asa subroutine to estimate the mean value and variance at differentgranularity levels.
We aim at obtaining an ( ϵ , δ ) -approximate of the mean of randomvariables X , X , . . . . Specifically, the random variables are requiredto satisfy the following conditions: • a ≤ X i ≤ b , ∀ i = , , . . . • E [ X i + | X , X , ..., X i ] = µ X , ∀ i = , , . . . where 0 ≤ a < b are fixed constants and (unknown) µ X .Our algorithm generalizes the stopping rule estimation in [9]that provides ( ϵ , δ ) estimation of the mean of i.i.d. random variables X , X , ... ∈ [ , ] . The notable differences are the following: • We discover and amend an error in the stopping algorithmin [9]: the number of samples drawn by that algorithm maynot be sufficient to guarantee the ( ϵ , δ ) -approximation. • We allow estimating the mean of random variables thatare possibly dependent and/or with different distributions .Our algorithm works as long as the random variables havethe same means. In contrast, the algorithm in [9] can onlybe applied for i.i.d random variables. • Our proposed algorithm obtains an unbiased estimator ofthe mean, i.e. E [ ˆ µ X ] = µ X while [9] returns a biased one. • Our algorithm is faster than the one in [9] whenever thelower-bound for random variables a > Algorithm 2:
Generalized Stopping Rule Alg. (
GSRA ) Input:
Random variables X , X , . . . and 0 < ϵ , δ < Output: An ( ϵ , δ ) -approximate of µ X = E [ X i ] If b − a < ϵb , return µ X = a . Compute: ϵ (cid:48) = ϵ (cid:16) − ϵb ( + ϵ ) ln ( δ )( b − a ) (cid:17) ; ϒ = ( + ϵ ) c ( ϵ (cid:48) , δ )( b − a ) ; Initialize h = , T = while h < ϒ do h ← h + X T , T ← T + return ˆ µ X = h / T ; Our Generalized Stopping Rule Algorithm (
GSRA ) is describedin details in Alg. 2. Denote c ( ϵ , δ ) = ( + ϵ ) ln ( δ ) ϵ .The algorithm contains two main steps: 1) Compute the stoppingthreshold ϒ (Line 2) which relies on the value of ϵ (cid:48) computed fromthe given precision parameters ϵ , δ and the range [ a , b ] of the ran-dom variables; 2) Consecutively acquire the random variables untilthe sum of their outcomes exceeds ϒ (Line 4-5). Finally, it returnsthe average of the outcomes, ˆ µ X = h / T (Line 6), as an estimate forthe mean, µ X . Notice that ϒ in GSRA depends on ( b − a ) and thus,getting tighter bounds on the range of random variables holds akey for the efficiency of GSRA in application perspectives. The approximation guarantee and number of necessary samplesare stated in the following theorem.Theorem 4.1.
The Generalized Stopping Rule Algorithm (
GSRA )returns an ( ϵ , δ ) -approximate ˆ µ X of µ X , i.e., Pr [( − ϵ ) µ X ≤ ˆ µ X ≤ ( + ϵ ) µ X ] > − δ , (15) and, the number of samples T satisfies, Pr [ T ≤ ( + ϵ ) ϒ / µ X ] > − δ / . (16) The hole in the Stopping Rule Algorithm in [9].
The estima-tion algorithm in [9] for estimating the mean of random variablesin range [ , ] also bases on a main stopping rule condition as our GSRA . It computes a threshold ϒ = + ( + ϵ ) ( e − ) ln ( δ ) ϵ , (17)where e is the base of natural logarithm, and generates samples X j until (cid:205) Tj = X j ≥ ϒ . The algorithm returns ˆ µ X = ϒ T as a biasedestimate of µ X .Unfortunately, the threshold ϒ to determine the stopping timedoes not completely account for the fact that the necessary numberof samples should go over the expected one in order to providehigh solution guarantees. This actually causes a flaw in their laterproof of the correctness.To amend the algorithm, we slightly strengthen the stoppingcondition by replacing the ϵ in the formula of ϒ with an ϵ (cid:48) = ϵ (cid:16) − ϵb ( + ϵ ) ln ( δ )( b − a ) (cid:17) (Line 2, Alg. 2). Since ϵb < b − a (else thealgorithm returns µ X = a ) and assume w.l.o.g. that δ < /
2, it fol-lows that ϵ (cid:48) ≥ . ϵ . Thus the number of samples, in comparisonto those in the stopping rule algorithm in [9] increases by at mosta constant factor. Benefit of considering the lower-bound a . By dividing therandom variables by b , one can apply the stopping rule algorithm in[9] on the normalized random variables. The corresponding valueof ϒ is then ϒ = + ( + ϵ )( + ϵ ) ln ( δ ) ϵ (cid:48) b (18) ϒ in our proposed algorithm is however smaller by a multiplicativefactor of b − ab . Thus it is faster than the algorithm in [9] by a factorof b − ab on average. Note that in case of estimating the influence, wehave a = , b = n − | S | . Compared to algorithm applied [9] directly,our GSRA algorithm saves the generated samples by a factor of b − ab = n −| S |− n = − | S | + n < Martingale theory to cope with weakly-dependent randomvariables.
To prove Theorem 4.1, we need a stronger Chernoff-likebound to deal with the general random variables X , X , . . . inrange [ a , b ] presented in the following.Let define random variables Y i = (cid:205) ij = ( X j − µ X ) , ∀ i ≥
1. Hence,the random variables Y , Y , . . . form a Martingale [24] due to thefollowing, E [ Y i | Y , . . . , Y i − ] = E [ Y i − ] + E [ X i − µ X ] = E [ Y i − ] . Then, we can apply the following lemma from [7] stating,Lemma 5.
Let Y , . . . , Y i , ... be a martingale, such that | Y | ≤ α , | Y j − Y j − | ≤ α for all j = [ , i ] , and Var [ Y ] + i (cid:213) j = Var [ Y j | Y , . . . , Y j − ] ≤ β . (19) utward Influence and Cascade Size Estimationin Billion-scale Networks Conference’17, July 2017, Washington, DC, USA Then, for any λ ≥ , Pr [ Y i − E [ Y i ] ≥ λ ] ≤ exp (− λ / · α · λ + · β ) (20)In our case, we have | Y | = | X − µ X | ≤ b − a , | Y j − Y j − | = | X i − µ X | ≤ b − a , Var [ Y ] = Var [ X − µ X ] = Var [ X ] and Var [ Y j | Y , . . . , Y j ] = Var [ X j − µ X ] = Var [ X ] . Apply Lemma 2 with i = T and λ = ϵT µ X ,we have,Pr (cid:104) T (cid:213) j = X j ≥ ( + ϵ ) µ X T (cid:105) ≤ exp (cid:0) − ϵ T µ X ( b − a ) ϵµ X T + [ X ] T (cid:1) (21)Then, since Var [ X ] ≤ µ X ( b − µ X ) ≤ µ X ( b − a ) ( since Bernoullirandom variables with the same mean µ X have the maximum vari-ance), we also obtain,Pr (cid:104) T (cid:213) j = X j ≥ ( + ϵ ) µ X T (cid:105) ≤ exp (cid:0) − ϵ T µ X ( + ϵ )( b − a ) (cid:1) . (22)Similarly, − Y , . . . , − Y i , . . . also form a Martingale and applyingLemma 5 gives the following probabilistic inequality,Pr (cid:104) T (cid:213) j = X j ≤ ( − ϵ ) µ X T (cid:105) ≤ exp (cid:0) − ϵ T µ X ( b − a ) (cid:1) . (23) Algorithm 3:
Robust Sampling Algorithm (
RSA ) Input:
Two streams of i.i.d. random variables, X , X , . . . and X (cid:48) , X (cid:48) , . . . and 0 < ϵ , δ < Output: An ( ϵ , δ ) -approximate ˆ µ X of µ X Step 1 // Obtain a rough estimate ˆ µ (cid:48) X of µ X if ϵ ≥ / then return ˆ µ X ← GSRA ( < X , X , . . . >, ϵ , δ ) ˆ µ (cid:48) X ← GSRA ( < X , X , . . . >, √ ϵ , δ / Step 2 // Estimate the variance ˆ σ X ϒ = + √ ϵ −√ ϵ ( + ln ( )/ ln ( δ )) · ϒ ; N σ = ϒ · ϵ / ˆ µ (cid:48) X ; ∆ = // ϒ isdefined the same as in Alg. 2 for i = N σ do ∆ ← ∆ + ( X (cid:48) i − X (cid:48) i + ) /2; ˆ ρ X = max { ˆ σ X = ∆ / N σ , ϵ ˆ µ (cid:48) X ( b − a )} ; Step 3 // Estimate µ X Set T = ϒ · ˆ ρ X /( ˆ µ (cid:48) X ( b − a )) , S ← for i = T do S ← S + X i ; return ˆ µ X = S / T ; Our previously proposed
GSRA algorithm may have problem inestimating means of random variables with small variances. An im-portant tool that we rely on to prove the approximation guaranteein
GSRA is the Chernoff-like bound in Eq. 22 and Eq. 23. However,from the inequality in Eq. 21, we can also derive the followingstronger inequality, Pr (cid:104) T (cid:213) j = X j ≥ ( + ϵ ) µ X T (cid:105) ≤ exp (cid:16) − ϵ T µ X ( b − a ) ϵµ X T + [ X ] T (cid:17) ≤ exp (cid:16) − ϵ T µ X ( + ) max { ϵµ X ( b − a ) , Var [ X ]} (cid:17) . (24)In many cases, random variables have small variances and hencemax { ϵµ X ( b − a ) , Var [ X ]} = ϵµ X ( b − a ) . Thus, Eq. 24 is muchstronger than Eq. 22 and can save a factor of ϵ in terms of re-quired observed influences translating into the sample requirement.However, both the mean and variance are not available.To achieve a robust sampling algorithm in terms of sample com-plexity, we adopt and improve the AA algorithm in [9] for generalcases of [ a , b ] random variables. The robust sampling algorithms( RSA ) subsequently will estimate both the mean and variance inthree steps: 1) roughly estimate the mean value with larger error( √ ϵ or a constant); 2) use the estimated mean value to computethe number of samples necessary for estimating the variance; 3)use both the estimated mean and variance to refine the requiredsamples to estimate mean value with desired error ( ϵ , δ ).Let X , X , . . . and X (cid:48) , X (cid:48) , . . . are two streams of i.i.d randomvariables. Our robust sampling algorithm ( RSA ) is described inAlg. 3. It consists of three main steps:1) If ϵ ≥ /
4, run
GSRA with parameter ϵ , δ and return theresult (Line 1-2). Otherwise, assume ϵ < / µ (cid:48) X using parameters of ϵ (cid:48) = √ ϵ < / , δ (cid:48) = δ / µ (cid:48) X in step 1 to compute the necessarynumber of samples, N σ , to estimate the variance of X i , ˆ σ X .Note that this estimation uses the second set of samples, X (cid:48) , X (cid:48) , . . .
3) Use both ˆ µ (cid:48) X in step 1 and ˆ σ X in step 2 to compute theactual necessary number of samples, T , to approximatethe mean µ X . Note that this uses the same set of samples X , X , . . . as in the first step.The numbers of samples used in the first two steps are always lessthan a constant times ϒ · ϵ / µ X which is the minimum samples thatwe can achieve using the variance. This is because the first takesthe error parameter √ ϵ which is higher than ϵ and the second stepuses N σ = ϒ · ϵ / ˆ µ (cid:48) X samples.At the end, the algorithm returns the influence estimate ˆ µ X which is the average over T samples, ˆ µ X = S / T . The estimationguarantees are stated in the following theorem.Theorem 4.2. Let X be the probability distribution that X , X , . . . and X (cid:48) , X (cid:48) , . . . are drawn from. Let ˆ µ X be the estimate of E [ X ] re-turned by Alg. 3 and T be the number of drawn samples in Alg. 3w.r.t. ϵ , δ . We have,(1) Pr [ µ X ( − ϵ ) ≤ ˆ µ X ≤ ( + ϵ ) µ X ] ≥ − δ ,(2) There is a universal constant c (cid:48) such that Pr [ T > c (cid:48) ϒ ρ X /( µ X ( b − a ))] ≤ δ (25) where ρ Z = max { ϵµ X ( b − a ) , Var [ X ]} . Compared to the AA algorithm in [9], first of all, we replacetheir stopping rule algorithm with GSRA and also, we change thecomputation of ϒ which is always smaller than that of [9] by afactor of 1 + √ ϵ − ϵ ≥ ϵ ≤ / onference’17, July 2017, Washington, DC, USA Nguyen et al. This section applies our
RSA algorithm to estimate both the outwardinfluence and the traditional influence spread.
We directly apply
RSA algorithm on two streams of i.i.d. randomvariables Y ( S ) , Y ( S ) , . . . and Y (cid:48)( S ) , Y (cid:48)( S ) , . . . , which are generatedby IICP sampling algorithm, with the precision parameters ϵ , δ .The algorithm is called Scalable Outward Influence EstimationAlgorithm ( SOIEA ) and presented in Alg. 4 which generates twostreams of random variables Y ( S ) , Y ( S ) , . . . and Y (cid:48)( S ) , Y (cid:48)( S ) , . . . (Line 1)and applies RSA algorithm on these two streams (Line 2). Note thatoutward influence estimate is achieved by scaling down µ Y by β (Lemma 4). Algorithm 4:
SOIEA
Alg. to estimate outward influence
Input:
A probabilistic graph G , a set S and ϵ , δ Output: ˆ I ( S ) - an ( ϵ , δ ) -estimate of I ( S ) Generate two streams of i.i.d. random variables Y ( S ) , Y ( S ) , . . . and Y (cid:48)( S ) , Y (cid:48)( S ) , . . . by IICP algorithm. return ˆ I out ( S ) ← β · RSA ( < Y ( S ) , · · · >, < Y (cid:48)( S ) , · · · >, ϵ , δ ) We obtain the following theoretical results incorporated fromTheorem 4.2 of
RSA and
IICP samples.Theorem 5.1.
The
SOIEA algorithm gives an ( ϵ , δ ) outward in-fluence estimation. The observed outward influences (sum of Y ( S ) ) andthe number of generated random variables are in O ( ln ( δ ) ϵ ρ Y I out ( S )/ β ) and O ( ln ( δ ) ϵ ρ Y I out ( S )/ β ) respectively, where ρ Y = max { ϵ I out ( S )( n −| S | − )/ β , Var [ Y ( S ) i ]} . Note that E [ Y ( S ) ] = I out ( S )/ β ≥ Not only is the concept of outward influence helpful in discriminat-ing the relative influence of nodes but also its sampling technique,
IICP , can help scale up the estimation of influence spread ( IE ) tobillion-scale networks. Naive approach . A naive approach is to 1) obtain an ( ϵ , δ ) -approximation ˆ I out ( S ) of I out ( S ) using Monte-Carlo estimation2) return ˆ I out ( S ) + | S | . It is easy to show that this approach re-turn an ( ϵ , δ ) -approximation for I ( S ) . This approach will require O ( ln ( δ ) ϵ n ) IICP random samples.However, the naive approach is not optimized to estimate influ-ence due to several reasons: 1) a loose bound µ Y = E [ Y ( S ) ] ≥ ( ϵ , δ ) -approximation of outward influence to ( ϵ , δ ) -approximation of in-fluence introduces a gap that can be used to improve the estimationguarantees. We next propose more efficient algorithms based onImportance IC Sampling to achieve an ( ϵ , δ ) -approximate of bothoutward influence and influence spread. Our methods are based ontwo effective mean estimation algorithms. Our approach . Based on the observations that • ≤ Y ( S ) ≤ n − | S | , i.e., we know better bounds for Y ( S ) incomparison to the cascade size which is in the range [ , n ] . • As we want to have an ( ϵ , δ ) -approximation for Y ( S ) + | S | ,the fixed add-on | S | can be leveraged to reduce the numberof samples.We combine the effective RSA algorithm with our ImportanceIC Polling (
IICP ) for estimating the influence spread of a set S .For influence spread estimation, we will analyze random variablesbased on samples generated by our Importance IC Polling schemeand use those to devise an influence estimation algorithm.Since outward influence and influence spread differ by an addi-tive factor of | S | , for each outward sample Y ( S ) generated by IICP ,let define a corresponding variable Z ( S ) , Z ( S ) = Y ( S ) · β + | S | , (26)where β is defined in Eq. 13. We obtain, • | S | + β ≤ Z ( S ) ≤ | S | + β ( n − | S |) , • E [ Z ( S ) ] = E [ Y ( S ) ] · β + | S | = I out ( S ) + | S | = I ( S ) ,and thus we can to approximate I ( S ) by estimating E [ Z ( S ) ] .Recall that to estimate the influence I ( S ) of a seed set S , all theprevious works [6, 17, 21] resort to simulating many influence cas-cades from S and take the average size of those generated cascades.Let call M ( S ) the random variable representing the size of such ainfluence cascade. Then, we have E [ M ( S ) ] = I ( S ) . Although both Z ( S ) and M ( S ) can be used to estimate the influence, they havedifferent variances that lead to difference in convergence speedwhen estimating their means. The relation between variances of Z ( S ) and M ( S ) is stated as follows.Lemma 6. Let Z ( S ) defined in Eq. 26 and M ( S ) be random variablefor the size of a influence cascade, the variances of Z ( S ) and M ( S ) satisfy, Var [ Z ( S ) ] = β · Var [ M ( S ) ] − ( − β ) I out ( S ) (27)Note that 0 ≤ β ≤ I ( S ) ≥ | S | . Thus, the variance of Z ( S ) ismuch smaller than M ( S ) . Our proposed RSA on random variables X i makes use of the variances of random variables and thus, benefitsfrom the small variance of Z ( S ) compared to the same algorithmon the previously known random variables M ( S ) . Algorithm 5:
SIEA
Alg. to estimate influence spread
Input:
A probabilistic graph G , a set S and ϵ , δ Output: ˆ I ( S ) - an ( ϵ , δ ) -estimate of I ( S ) Generate two streams of i.i.d. random variables Y ( S ) , Y ( S ) , . . . and Y (cid:48)( S ) , Y (cid:48)( S ) , . . . by IICP algorithm. Compose two streams Z ( S ) , Z ( S ) , . . . and Z (cid:48)( S ) , Z (cid:48)( S ) , . . . from Y ( S ) , Y ( S ) , . . . and Y (cid:48)( S ) , Y (cid:48)( S ) , . . . using Eq. 26. return ˆ I ( S ) ← RSA ( < Z ( S ) , · · · >, < Z (cid:48)( S ) , · · · >, ϵ , δ ) Thus, we apply the
RSA on random variables generated by
IICP to develop Scalable Influence Estimation Algorithm (
SIEA ). SIEA isdescribed in Alg. 5 which consists of two main steps: 1) generatei.i.d. random variables by
IICP and 2) convert those variables tobe used in
RSA to estimate influence of S . The results are stated asfollows,Theorem 5.2. The
SIEA algorithm gives an ( ϵ , δ ) influence spreadestimation. The observed influences (sum of random variables Z ( S ) )and the number of generated random variables are in O ( ln ( δ ) ϵ ρ Z I ( S ) ) and O ( ln ( δ ) ϵ ρ Z I ( S ) ) , where ρ Z = max { ϵ I ( S ) β ( n −| S |− ) , Var [ Z ( S ) i ]} . utward Influence and Cascade Size Estimationin Billion-scale Networks Conference’17, July 2017, Washington, DC, USA Comparison to
INFEST [23].
Compared to the most recentstate-of-the-art influence estimation in [23] that requires O ( n log ( n ) ϵ ) observed influences, the SIEA algorithm incorporating
IICP sam-pling with
RSA saves at least a factor of log ( n ) . That is because thenecessary observed influences in SIEA is bounded by O ( ln ( δ ) ϵ β ρ Z I ( S ) ) .Since Var [ Z ( S ) i ] ≤ I ( S )(| S | + β ( n − | S |) − I ( S )) ≤ I ( S )( n − | S | − ) andhence, ρ Z ≤ I ( S )( n − | S | − ) , when δ = n as in [23], the observedinfluences is then, O ( ln ( δ ) ϵ ρ Z I ( S ) ) ≤ O ( n log ( / δ ) ϵ ) ≤ O ( n log ( n ) ϵ ) (28)Consider ϵ , δ as constants, the observed influences is O ( n ) . We can easily apply the
RSA estimation algorithm to obtain an ( ϵ , δ ) -estimate of the influence spread under other cascade models as longas there is a Monte-Carlo sampling procedure to generate sizes ofrandom cascades. For most stochastic diffusion models, includingboth discrete-time models, e.g. the popular LT with a naive samplegenerator described in [17], SI and SIR [10] or their variants withdeadlines [26], and continuous-time models [12], designing sucha Monte-Carlo sampling procedure is straightforward. Since theinfluence cascade sizes are at least the seed size, we always needsat most O ( n ) samples.To obtain more efficient sampling procedures, we can extend theidea of sampling non-trivial cascade in IICP to other models. Suchsampling procedures in general will result in random variables withsmaller variances and tighter bounds on the ranges. In turns,
RSA ,that benefits from smaller variance and range, will requires fewersamples for estimation.
We develop the parallel versions of our algorithms to speed up thecomputation and demonstrate the easy-to-parallelize property ofour methods. Our main idea is that the random variable generationby
IICP can be run in parallel. In particular, random variables usedin each step of the core
RSA algorithm can be generated simultane-ously. Recall that
IICP only needs to store a queue of newly activenodes, an array to mark the active nodes and a single variable Y ( S ) .In total, each thread requires space in order of the number of activenodes in that simulation, O ( Y ( S ) ) , which is at most linear with sizeof the graph O ( n ) . In fact due to the stopping condition of linearnumber of observed influences, the total size of all the threads isbounded by O ( n ) assumed the number of threads is relatively smallcompared to n .Moreover, our algorithms can be implemented efficiently in termsof communication cost in distributed environments. This is be-cause the output of IICP algorithm is just a single number Y ( S ) andthus, worker nodes in a distributed environment only communi-cate that single number back to the node running the estimationtask. Here each IICP node holds a copy of the graph. However, theprogramming model needs to be considered carefully. For instance,as pointed out in many studies that the famous MapReduce is not agood fit for iterative graph processing algorithms [14, 22].
We will experimentally show that Outward Influence Estimation(
SOIEA ) and Outward-Based Influence Estimation (
SIEA ) are not only several orders of magnitudes faster than existing state-of-the-art methods but also consistently return much smaller errors. Wepresent empirical validation of our methods on both real world andsynthetic networks.
Algorithms.
We compare performance of
SOIEA and
SIEA withthe following algorithms: • INFEST [23]: A recent influence estimation algorithm byLucier et al. [23] in KDD’15 that provides approximationguarantees. We reimplement the algorithm in C++ accord-ingly to the description in [23] . • MC , MC : Variants of Monte-Carlo method that gen-erates the traditional influence cascades [17, 21] to estimate(outward) influence spread. • MC ϵ , δ : The Monte-Carlo method that uses the traditionalinfluence cascades and guarantees ( ϵ , δ ) -estimation. Fol-lowing [23], MC ϵ , δ is only for measuring the runningtime of the normal Monte-Carlo to provide the same ( ϵ , δ ) -approximation guarantee. In particular, we obtain runningtime of MC ϵ , δ by interpolating from that from MC , i.e. ϵ ln ( δ ) n Time ( MC ) . Table 2: Datasets’ StatisticsDataset
NetHEP
15K 59K 4.1NetPHY
37K 181K 13.4Epinions
75K 841K 13.4DBLP
3M 117M 78.0Twitter [20] 41.7M 1.5G 70.5Friendster From http://snap.stanford.edu
Datasets.
We use both real-world networks and synthetic net-works generated by GTgraph [2]. For real world networks, wechoose a set of 7 datasets with sizes from tens of thousands to 65.6millions. Table 2 gives a summary. GTgraph generates syntheticgraphs with varying number of nodes and edges.
Metrics.
We compare the performance of the algorithms in termsof solution quality and running time. To compare the solution qual-ity, we adopt the relative error which shows how far the estimatednumber from the “ground truth". The relative error of outwardinfluence is computed as follows: | ˆ I out ( S ) I out ( S ) − | · I out ( S ) is estimated outward influence of seed set S by thealgorithm, I out ( S ) is “ground truth" for S .Similarly, relative error of influence spread is, | ˆ I ( S ) I ( S ) − | · Through communication with the authors of [23], the released code has some problemand is not ready for testing. onference’17, July 2017, Washington, DC, USA Nguyen et al.
Table 3: Comparing performance of algorithms in estimating outward influences
Avg. Rel. Error (%) Max. Rel. Error (%) Running time (sec)Dataset Edge Models
SOIEA MC MC SOIEA MC MC SOIEA MC MC MC ϵ , δ NetHEP wc p = . p = .
01 0.0 4.5 1.6 0.2 20.2 9.2 0.2 0.1 0.1 8.8 p = .
001 0.0 19.2 4.6 0.1 100.0 26.4 0.2 0.1 0.1 8.5NetPHY wc p = . p = .
01 0.0 5.5 1.7 0.2 30.4 10.7 0.6 0.1 0.1 25.0 p = .
001 0.0 19.1 5.1 0.0 80.0 28.1 0.7 0.1 0.1 24.0 P e r c e n t a g e Error Rate (%)
SOIEAMC MC (a) WC model P e r c e n t a g e Error Rate (%)
SOIEAMC MC (b) p = . P e r c e n t a g e Error Rate (%)
SOIEAMC MC (c) p = . P e r c e n t a g e Error Rate (%)
SOIEAMC MC (d) p = . Figure 3: Error distributions (histogram) of the approximation errors of
SOIEA , MC , MC on NetHEP the average relative error (Avg. Rel. Error) and maximum relativeerror (Max. Rel. Error). Ground-truth computation.
We use estimates of influenceand outward influence with a very small error corresponding tothe setting ϵ = . , δ = / n . We note that previous researches[23, 31] compute the “ground truth" by running Monte-Carlo with10,000 samples which is not sufficient as we will show later in ourexperiments. Parameter Settings.
For each of the datasets, we consider twocommon edge weighting models: • Weighted Cascade ( WC ): The weight of edge ( u , v ) is cal-culated as w ( u , v ) = d in ( v ) where d in ( v ) denotes the in-degree of node v , as in [6, 8, 28, 31, 32]. • Constant model : All the edges has the same constant prob-ability p as in [6, 8, 17]. We consider three different valuesof p , i.e. 0 . , . , . ϵ = . δ = / n for SOIEA and
SIEA by default orexplicitly stated otherwise.
Environment.
All algorithms are implemented in C++ and com-piled using GCC 4.8.5. We conduct all experiments on a CentOS 7workstation with two Intel Xeon 2.30GHz CPUs adding up to 20physical cores and 250GB RAM.
We compare
SOIEA against MC and MC in four differentedge models on NetHEP and NetPHY dataset. The results are pre-sented in Table 3 and Figure 3. Table 3 illustrates that the outward in-fluences computed by
SOIEA consistently have much smaller errorsin both average and maximum cases than MC and MC in alledge models. In particular, on NetHEP with p = .
001 edge model,
SOIEA has average relative error close to 0% while it is 19.2% and4.6% for MC , MC respectively; the maximum relative errorsof MC , MC in this case are 100%, 26.4% which are much higher than SOIEA of 0.1%. Apparently, MC has smaller errorrate than MC since it uses 10 times more samples.Figure 3 shows error distributions of SOIEA , MC , and MC on NetHEP. In all considered edge models, SOIEA ’s error highlyconcentrates around 0% while errors of MC and MC wildlyspread out to a very large spectrum. In particular, SOIEA has ahuge spike at the 0 error while both MC and MC containtwo heavy tails in two sides of their error distributions. Moreover,when p gets smaller, the tails get larger as more and more emptyinfluence simulations are generated in the traditional method. From Table 3, the running time of MC and MC is close to that of SOIEA while MC ϵ , δ takes up to 700times slower than the others. Thus, in order to achieve the sameapproximation guarantee as SOIEA , the naive Monte-Carlo willneed 700 more time than
SOIEA .Overall,
SOIEA achieves significantly better solution quality andruns substantially faster than Monte-Carlo method. With largernumber of samples, Monte-Carlo method can improve the qualitybut the running time severely suffers.
This experiment evaluates
SIEA by comparing its performance withthe most recent state-of-the-art
INFEST and naive Monte-Carloinfluence estimation. Here, we use
WC model to assign probabilitiesfor the edges. We set the ϵ parameter for INFEST to 0.4 since wecannot run with smaller value of ϵ for this algorithm. Note that INFEST guarantees an error of ( + ϵ ) , which is equivalent to amaximum relative error of 320%. For a fair comparison, we also run
SIEA with ϵ = .
4. We use the gold-standard 10000 samples for theMonte-Carlo method ( MC ). We set a time limit of 6 hours forall algorithms. Table 4 presents the solution quality ofthe algorithms in estimating size 1 seed sets, i.e. | S | =
1. It showsthat
SIEA consistently achieves substantially higher quality solution utward Influence and Cascade Size Estimationin Billion-scale Networks Conference’17, July 2017, Washington, DC, USA
Table 4: Comparing performance of algorithms in estimating influence spread in
WC Model (seed set size | S | = ) Avg. Rel. Error (%) Max. Rel. Error (%) Running time (sec.)Dataset
SIEA MC
INFEST SIEA MC
INFEST SIEA SIEA (16 cores) MC MC ϵ , δ INFEST
NetHEP 0.2 1.2 17.7 1.5 6.6 82.7 0.1 0.1 0.0 0.8 3417.6NetPHY 0.1 0.4 22.9 0.6 5.3 43.0 0.1 0.1 0.0 2.6 8517.7Epinions 0.9 5.3 n/a 5.2 19.7 n/a 0.2 0.1 0.0 21.9 n/aDBLP 0.3 1.2 n/a 1.9 8.7 n/a 2.8 1.3 0.1 770.4 n/aOrkut 0.5 3.0 n/a 3.2 16.0 n/a 54.2 4.76 2.9 8 . · n/aTwitter 1.0 37.1 n/a 3.1 n/a 1272.3 106.2 7.9 3 . · n/aFriendster 0.1 3.1 n/a 0.6 23.6 n/a 1510.1 . · n/a Table 5: Comparing performance of algorithms in estimating influence spread in
WC Model (seed set size | S | = | V | ) Avg. Rel. Error (%) Max. Rel. Error (%) Running time (sec.)Dataset
SIEA MC
INFEST SIEA MC
INFEST SIEA SIEA (16 cores) MC MC ϵ , δ INFEST
NetHEP 0.1 0.0 11.1 0.4 0.2 14.1 0.1 0.1 2.1 191.7 600.5NetPHY 0.1 0.0 24.4 0.2 0.1 26.3 0.1 0.1 5.3 1297.1 3326.4Epinions 0.2 0.1 20.2 0.4 0.2 23.8 0.3 0.1 20.1 1 . · . · n/aOrkut 0.1 0.0 n/a 0.7 0.1 n/a 51.6 4.6 5322.8 1 . · n/aTwitter 0.2 n/a n/a 0.5 n/a n/a 1061.6 93.5 n/a n/a n/aFriendster 0.1 n/a n/a 0.2 n/a n/a 2068.8 n/a n/a n/a than both INFEST and MC . Note that INFEST can only run onNetHEP and NetPHY under time limit. The average relative errorof
INFEST is 88 to 229 times higher than
SIEA while its maximumrelative error is up to 82% compared to the ground truth. The largerelative error of
INFEST is explained by its loose guaranteed relativeerror (320%). Whereas, the average relative error of MC is up to37 times higher than SIEA . The maximum relative error of MC is up to 240% higher than the ground truth on Twitter dataset thatdemonstrates the insufficiency of using 10000 traditional influencesamples to get the ground truth.Differ from Table 4, Table 5 shows the results in estimatinginfluences of seed sets of size 5% the total number of nodes. Under 6hour limit, INFEST can only run on NetHEP, NetPHY, and Epinionswhile MC could not handle the large Twitter and Friendstergraph. INFEST still has a very high error compared to the other twowhile
SIEA and MC returns the similar quality solutions. This isbecause 5% of the nodes is an enormous number, i.e. > IICP . In both cases of two seed set sizes,
SIEA vastly outperforms MC ϵ , δ and INFEST by several orders of magni-tudes.
INFEST is up to 10 times slower than SIEA and can only runon small networks, i.e. NetHEP, NetPHY and Epinions. Comparedwith MC ϵ , δ , the speedup factor is around 10 , thus, MC cannotrun for the two largest networks, Twitter and Friendster in case | S | = | V | .We also test the parallel version of SIEA . With 16 cores,
SIEA runsabout 12 times faster than that on a single core in large networksachieving an effective factor of around 75%.Overall,
SIEA consistently achieves much better solution qualityand run significantly fastest than
INFEST and the naive MC method.Surprisingly, under time limit of 6 hours, INFEST can only handlesmall networks and has very high error. The MC method achievesbetter accuracy for large seed sets, however, its running time in-creases dramatically resulting in failing to run on large datasets. R unn i ng T i m e ( x s e c ) Number of Nodes (n) d = 30, 1 core d = 20, 1 core d = 10, 1 core d = 30, 4 cores d = 20, 4 cores d = 10, 4 cores d = 30, 16 cores d = 20, 16 cores d = 10, 16 cores (a) Running time in linear -1 R unn i ng T i m e ( s e c ) Number of Nodes (n) (b) Running time in log scale
Figure 4: Running time of
SIEA on synthetic networks R e l a t i v e E rr o r ( % ) Size of Seed Set MC Avg. ErrorSIEA Avg. Error MC
Max. ErrorSIEA Max. Error (a) Relative Error R unn i ng T i m e ( s e c ) Size of Seed Set MC ε ,δ MC SIEA 1 CoreSIEA 2 Cores SIEA 4 CoresSIEA 8 CoresSIEA 16 Cores (b) Running Time
Figure 5: Comparing
SIEA , MC and MC ϵ , δ on Twitter We test the scalability of the single core and parallel versions ofour method on synthetic networks generated by the well-knownGTgraph with various network sizes. We also carry the same testson the real-world Twitter network in comparison with the MC . We generate synthetic graphs us-ing GTgraph[2], a standard graph generator used widely in largescale experiments on graph algorithms [1, 4, 15]. We generategraphs with number of nodes n ∈ { , , , } . For eachsize n , we generate 3 different graphs with average degree d ∈{ , , } . We use the WC model to assign edge weights. We run
SIEA with different number of cores C = { , , } onference’17, July 2017, Washington, DC, USA Nguyen et al. Table 6: Comparing performance of algorithms in estimating influence spread in LT model (seed set size | S | = ) Avg. Rel. Error (%) Max. Rel. Error (%) Running time (sec.)Dataset
SIEA LT MC MC SIEA LT MC MC SIEA LT SIEA LT (16 cores) MC MC MC ϵ , δ NetHEP 1.6 1.6 0.6 8.4 7.9 2.5 0.0 0.0 0.0 0.1 1.0NetPHY 1.2 0.5 0.3 12.7 4.4 1.4 0.0 0.0 0.0 0.1 2.9Epinions 1.5 4.3 2.2 7.0 17.4 7.4 0.7 0.4 0.0 0.4 24.5DBLP 0.4 1.0 0.5 5.7 11.4 2.2 2.4 0.4 0.3 2.5 1530.4Orkut 0.5 3.3 1.1 1.9 22.1 5.9 249.4 25.0 8.5 84.2 4 . · Twitter 2.4 36.1 20.7 7.1 97.5 85.6 6820.0 548.6 32.2 287.6 1 . · Friendster 0.2 3.1 1.4 2.4 16.5 9.0 6183.9 701.8 20.4 137.8 9 . · Figure 4 reports the time
SIEA spent to estimate influence spreadof seed set of size 1. With the same number of nodes, we see thatthe running time of
SIEA does not significantly increase as theaverage degree increases. Figure 4b views Figure 4a in logarithmicscale to show the linear increase of running time with respect tothe increases of nodes. As expected,
SIEA speeds up proportionallyto number of cores used. As a result,
SIEA with 16 cores is able toestimate influence spread of a random node on a synthetic graphof 100 million nodes and 1.5 billion of edges in just 5 minutes.
Figure 5 evaluates the performanceof
SIEA in comparison with MC on various seed set sizes | S | = { , , , k , k } on Twitter dataset. On all the sizes of seed sets, SIEA consistently has average and maximum relative errors smallerthan 10% (Figure 5a). The maximum relative error of MC goesup to 244% with seed set size | S | =
1. As observed in experimentswith large size seed sets, both
SIEA and MC have similar errorrate with seed set size | S | = SIEA ’s running time increases in much lower pace, e.g. fewhundreds of seconds, while MC ϵ , δ consumes proportionally moretime (Figure 5b). Figure 5b also evaluates parallel implementationof SIEA by varying number of CPU cores C = { , , , , } . Therunning time of SIEA reduces almost two times every time thenumber of cores doubles confirming the almost linear speedup.Altogether, the parallel implementation of
SIEA shows a linearspeedup behavior with respect to the number of cores used. On thesame network with size of seed sets linearly grows,
SIEA requiresslightly more time to estimate influence spread while Monte-Carloshows a linear runtime requirement. Throughout the experiments,
SIEA always guarantees small error rate within ϵ . We illustrate the generality of our algorithms in various diffusionmodel by adapting
SIEA for the LT model by only replacing
IICP with the sampling algorithm for the LT [17]. The algorithm is thennamed
SIEA LT . The setting is similar to the case of IC. We presentthe results of SIEA LT compared with MC , MC , MC ϵ , δ inTable 6. INFEST is initially proposed for the IC model, thus, weresults for
INFEST under the LT model are not available.The results are mostly consistent with those observed under theIC model.
SIEA LT obtains significantly smaller errors and runs inorder of magnitudes faster than the counterparts. The results againconfirm that the estimation quality of MC using 10 K samples is notgood enough to be considered as gold-standard quality benchmark. In a seminal paper [17], Kempe et al. formulated and generalized twoimportant influence diffusion models, i.e. Independent Cascade (IC)and Linear Threshold (LT). This work has motivated a large numberof follow-up researches on information diffusion [3, 6, 8, 18, 23, 29]and applications in multiple disciplines [16, 19, 21]. Kempe et al. [17]proved the monotonicity and submodularity properties of influenceas a function of sets of nodes. Later, Chen et al. [6] proved thatcomputing influence under these diffusion models is ( + ϵ ) that the true influence falls in and verifying whetherthe guess is right with high probability. However, their approachis not scalable due to a main drawback that the guessed intervalsare very small, thus, the number of guesses as well as verificationsmade is huge. As a result, the method in [23] can only run for smalldataset and still takes hours to estimate a single seed set. Theyalso developed a distributed version on MapReduce however, graphalgorithms on MapReduce have various serious issues [14, 22].Influence estimation oracles are developed in [8, 29] which takeadvantage of sketching the influence to preprocess the graph forfast queries. Cohen et al. [8] use the novel bottom- k min-hashsketch to build combined reachability sketches while Ohsaka et al.in [29] adopt the reverse influence sketches. [29] also introduces thereachability-true-based technique to deal with dynamic changes inthe graphs. However, these methods require days for preprocessingin order to achieve fast responses for multiple queries.There have also been increasing interests in many related prob-lems. [5, 13] focus on designing data mining or machine learningalgorithms to extract influence cascade model parameters from realdatasets, e.g. action logs. Influence Maximization, which finds aseed set of certain size with the maximum influence among those inthe same size, found many real-world applications and has attracteda lot of research work [3, 6, 17, 21, 25, 27, 28, 31]. This paper investigates a new measure, called Outward Influence,for nodes’ influence in social networks. Outward influence inspiresnew statiscal algorithms, namely Importance IC Polling (
IICP ) andRobust Mean Estimation (
RSA ) to estimate influence of nodes undervarious stochastic diffusion models. Under the popular IC model, the
IICP leads to an
FPRAS for estimating outward influence and
SIEA to estimate influence spread.
SIEA is Ω ( log ( n )) times faster thanthe most recent state-of-the-art and experimentally outperform the utward Influence and Cascade Size Estimationin Billion-scale Networks Conference’17, July 2017, Washington, DC, USA other methods by several orders of magnitudes. As previous ap-proaches to compute ground truth influence can result in high errorand long computational time, our algorithms provides concrete andscalable tools to estimate ground-truth influence for research onnetwork cascade and social influence. REFERENCES [1] V. Agarwal, F. Petrini, D. Pasetto, and D. A. Bader. 2010. Scalable graph explo-ration on multicore processors. In SC . IEEE, 1–11.[2] D. A. Bader and K. Madduri. 2006. Gtgraph: A synthetic graph generator suite. Atlanta, GA, February (2006).[3] C. Borgs, M. Brautbar, J. Chayes, and B. Lucier. 2014. Maximizing Social Influencein Nearly Optimal Time. In
SODA . SIAM, 946–957.[4] A. Campan. 2008. A clustering approach for data and structural anonymity insocial networks.
PinKDD (2008), 54.[5] M. Cha, A. Mislove, and K. P. Gummadi. 2009. A measurement-driven analysis ofinformation propagation in the flickr social network. In
WWW . ACM, 721–730.[6] W. Chen, C. Wang, and Y. Wang. 2010. Scalable influence maximization forprevalent viral marketing in large-scale social networks. In
KDD . ACM, 1029–1038.[7] F. Chung and L. Lu. 2006. Concentration inequalities and martingale inequalities:a survey.
Internet Mathematics (2006), 79–127.[8] E. Cohen, D. Delling, T. Pajor, and R. F. Werneck. 2014. Sketch-based influencemaximization and computation: Scaling up with guarantees. In
CIKM . ACM,629–638.[9] P. Dagum, R. Karp, M. Luby, and S. Ross. 2000. An Optimal Algorithm for MonteCarlo Estimation.
SICOMP (2000), 1484–1496.[10] D. J. Daley, J. Gani, and J. M. Gani. 2001.
Epidemic modelling: an introduction .Vol. 15. Cambridge University Press.[11] T. N. Dinh and M. T. Thai. 2015. Assessing attack vulnerability in networks withuncertainty. In
INFOCOM . IEEE, 2380–2388.[12] N. Du, L. Song, M. Gomez-Rodriguez, and H. Zha. 2013. Scalable influenceestimation in continuous-time diffusion networks. In
NIPS . 3147–3155.[13] A. Goyal, F. Bonchi, and L. V. S. Lakshmanan. 2010. Learning Influence Probabil-ities in Social Networks. In
WSDM . ACM, 241–250.[14] P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, and R. Zadeh. 2013. Wtf: The whoto follow service at twitter. In
WWW . ACM, 505–514.[15] S. Hong, Sang K. Kim, T. Oguntebi, and K. Olukotun. 2011. Accelerating CUDAgraph algorithms at maximum warp. In
SIGPLAN Notices . ACM, 267–276.[16] Y. M. Ioannides and L. L. Datcher. 2004. Job information networks, neighborhoodeffects, and inequality.
Journal of economic literature (2004), 1056–1093.[17] D. Kempe, J. Kleinberg, and É. Tardos. 2003. Maximizing the spread of influencethrough a social network. In
KDD . 137–146.[18] D. Kempe, J. Kleinberg, and E. Tardos. 2005. Influential nodes in a diffusionmodel for social networks. In
ICALP . 1127–1138.[19] A. Krause, A. Singh, and C. Guestrin. 2008. Near-optimal sensor placements inGaussian processes: Theory, efficient algorithms and empirical studies.
JMLR (2008), 235–284.[20] H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social networkor a news media?. In
WWW . ACM, 591–600.[21] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance.2007. Cost-effective outbreak detection in networks. In
KDD . ACM, 420–429.[22] J. Lin and M. Schatz. 2010. Design patterns for efficient graph algorithms inMapReduce. In
MLG . ACM, 78–85.[23] B. Lucier, J. Oren, and Y. Singer. 2015. Influence at scale: Distributed computationof complex contagion in networks. In
KDD . ACM, 735–744.[24] M. Mitzenmacher and E. Upfal. 2005.
Probability and computing: Randomizedalgorithms and probabilistic analysis . Cambridge University Press.[25] D. T. Nguyen, Z. Huiyuan, S. Das, M. T. Thai, and T. N. Dinh. 2013. Least CostInfluence in Multiplex Social Networks: Model Representation and Analysis. In
ICDM . 567–576.[26] H. T. Nguyen, P. Ghosh, M. L. Mayo, and T. N. Dinh. 2016. Multiple InfectionSources Identification with Provable Guarantees. In
CIKM . ACM, 1663–1672.[27] H. T. Nguyen, M. T. Thai, and T. N. Dinh. 2016. Cost-aware targeted viralmarketing in billion-scale networks. In
INFOCOM . IEEE, 1–9.[28] H. T. Nguyen, M. T. Thai, and T. N. Dinh. 2016. Stop-and-Stare: Optimal SamplingAlgorithms for Viral Marketing in Billion-scale Networks. In
SIGMOD . ACM,695–710.[29] N. Ohsaka, T. Akiba, Y. Yoshida, and K. Kawarabayashi. 2016. Dynamic influenceanalysis in evolving networks.
VLDB (2016), 1077–1088.[30] V. M. Preciado, M. Zargham, C. Enyioha, A. Jadbabaie, and G. Pappas. 2013.Optimal vaccine allocation to control epidemic outbreaks in arbitrary networks.In
CDC . IEEE, 7486–7491.[31] Y. Tang, Y. Shi, and X. Xiao. 2015. Influence Maximization in Near-Linear Time:A Martingale Approach. In
SIGMOD . ACM, 1539–1554. [32] Y. Tang, X. Xiao, and Y. Shi. 2014. Influence maximization: Near-optimal timecomplexity meets practical efficiency. In
SIGMOD . ACM, 75–86.
Proof of Lemma 1
Recall that on a sampled graph д ∼ G , for a set S ⊆ V , we denote r ( o ) д ( S ) to be the set of nodes, excluding the ones in S , that arereachable from S through live edges in д , i.e. r ( o ) д ( S ) = r д ( S )\ S .Alternatively, r ( o ) д ( S ) is called the outward influence cascade of S on sample graph д and, consequently, we have, I out ( S ) = (cid:213) д ∼G | r ( o ) д ( S )| Pr [ д ∼ G] . (31)It is sufficient to show that | r ( o ) д ( S )| is submodular, as I out ( S ) isa linear combination of submodular functions. Consider a samplegraph д ∼ G , two sets S , T such that S ⊆ T ⊆ V and v ∈ V \ T . Wehave three possible cases: • Case v ∈ r ( o ) д ( S ) : then v ∈ r ( o ) д ( T ) since S ⊆ T and v (cid:60) T .Thus, we have the following, r ( o ) д ( S ∪ { v }) − r ( o ) д ( S ) = r ( o ) д ( T ∪ { v }) − r ( o ) д ( T ) = − . (32) • Case v (cid:60) r ( o ) д ( S ) but v ∈ r ( o ) д ( T ) : We have that, r ( o ) д ( S ∪ { v }) − r ( o ) д ( S ) = | r ( o ) д ({ v })\( r ( o ) д ( S ) ∪ S )| ≥ , (33)while r ( o ) д ( T ∪ { v }) − r ( o ) д ( T ) = −
1. Thus, r ( o ) д ( S ∪ { v }) − r ( o ) д ( S ) > r ( o ) д ( T ∪ { v }) − r ( o ) д ( T ) . (34) • Case v (cid:60) r ( o ) д ( T ) : Since ∀ u ∈ r ( o ) д ( S ) ∪ S , we have either u ∈ r ( o ) д ( T ) or u ∈ T or r ( o ) д ( S ) ∪ S ⊆ r ( o ) д ( T ) ∪ T , and thus, r ( o ) д ( S ∪ { v }) − r ( o ) д ( S ) = | r ( o ) д ({ v })\( r ( o ) д ( S ) ∪ S )|≥ | r ( o ) д ({ v })\( r ( o ) д ( T ) ∪ T )| = r ( o ) д ( T ∪ { v }) − r ( o ) д ( T ) . (35)In all three cases, we have, r ( o ) д ( S ∪ { v }) − r ( o ) д ( S )≥ r ( o ) д ( T ∪ { v }) − r ( o ) д ( T ) . (36)Applying Eq. 36 on all possible д ∼ G and taking the sum over allof these inequalities give (cid:213) д ∼G ( r ( o ) д ( S ∪ { v }) − r ( o ) д ( S )) Pr [ д ∼ G]≥ (cid:213) д ∼G ( r ( o ) д ( T ∪ { v }) − r ( o ) д ( T )) Pr [ д ∼ G] , or, I out ( S ∪ { v }) − I out ( S ) ≥ I out ( T ∪ { v }) − I out ( T ) . (37)That completes the proof. onference’17, July 2017, Washington, DC, USA Nguyen et al. Proof of Lemma 4
Let Ω + W be the probability space of all possible cascades from S .For any cascade W ( S ) ⊇ S , the probability of that cascade in Ω + W isgiven by Pr [ W ( S ) ∈ Ω + W ] = (cid:213) д ∈ Ω G , д (cid:123) W ( S ) Pr [ д ∈ Ω G ] , where д (cid:123) W ( S ) means that W ( S ) is the set of reachable nodesfrom S in д .Let Ω W be the probability space of non-trivial cascades. Accord-ing to the Stage 1 in IICP , the probability of the trivial cascade is:Pr [ S ∈ Ω W ] = . Comparing to the mass of cascades in Ω + W , the probability massof the trivial cascade S in Ω W is redistributed proportionally toother cascades in Ω W . Specifically, according to line 2 in IICP , theprobability mass of all the non-trivial cascades in Ω W is multipliedby a factor 1 / β . Thus,Pr [ W ( S ) ∈ Ω + W ] = Pr [ W ( S ) ∈ Ω W ] · β ∀ W ( S ) (cid:44) S . It follows that I out ( S ) = (cid:213) W ( S ) ∈ Ω + W | W ( S ) \ S | · Pr [ W ( S ) ∈ Ω + W ] (38) = (cid:213) W ( S ) ∈ Ω W | W ( S ) \ S | · Pr [ W ( S ) ∈ Ω W ] β (39) = E [| W ( S ) |] · β = E [ Y ( S ) ] · β . (40)We note that for W ( S ) = S , | W ( S ) \ S | =
0. Thus the difference inthe probability masses between the two probabilistic spaces doesnot affect the 2nd step.
Proof of Theorem 4.1
We will equivalently prove two probabilistic inequalities:Pr [ ˆ µ X < ( − ϵ ) µ X ] ≤ δ , (41)and Pr [ ˆ µ X > ( + ϵ ) µ X ] ≤ δ . (42) Prove Eq. 41.
We first realize that at termination point of Alg. 2,due to the stopping condition h = (cid:205) Tj = X j ≥ ϒ and X j ≤ b , ∀ j , thefollowing inequalities hold, ϒ ≤ T (cid:213) j = X j ≤ ϒ + b . (43)The left hand side of Eq. 41 is rewritten as follows,Pr [ ˆ µ X < ( − ϵ ) µ X ] = Pr (cid:104) (cid:205) Tj = X j T < ( − ϵ ) µ X (cid:105) (44) = Pr (cid:104) T (cid:213) j = X j < ( − ϵ ) µ X T (cid:105) (45) ≤ Pr [ ϒ < ( − ϵ ) µ X T ] . (46) The last inequality is due to our realization in Eq. 43. Assume that ϵ < µ X >
0, let denote L = (cid:100) ϒ ( − ϵ ) µ X (cid:101) . We then have, L ≥ ϒ ( − ϵ ) µ X ⇒ ϒ L ≤ ( − ϵ ) µ X , (47)and L > ϒ µ X > ( + ϵ ) ln ( δ ) ϵ (cid:48) µ X ( b − a ) . (48)Thus, from Eq. 46, we obtain,Pr [ ˆ µ X < ( − ϵ ) µ X ] ≤ Pr [ L ≤ T ] = Pr (cid:104) L (cid:213) j = X j ≤ T (cid:213) j = X j (cid:105) ≤ Pr (cid:104) L (cid:213) j = X j ≤ ϒ + b (cid:105) (49) ≤ Pr (cid:104) (cid:205) L j = X j L ≤ ϒ + bL (cid:105) , (50)where the second inequality is due to Eq. 43. Note that (cid:205) L j = X j L isan estimate of µ X using the first L random variables X , . . . , X L .Furthermore, from Eq. 47 that ϒ L ≤ ( − ϵ ) µ X , we have, ϒ + bL ≤ ( − ϵ ) µ X + bL = ( − ϵ + bL µ X ) µ X . (51)Since L > ( + ϵ ) ln ( δ ) ϵ (cid:48) µ X ( b − a ) from Eq. 48, ϒ + bL ≤ (cid:16) − ϵ + ϵ b ( + ϵ ) ln ( δ )( b − a ) (cid:17) µ X = ( − ϵ (cid:48) ) µ X . (52)Plugging these into Eq. 50, we obtain,Pr [ ˆ µ X < ( − ϵ ) µ X ] ≤ Pr (cid:104) L (cid:213) j = X j ≤ ( − ϵ (cid:48) ) µ X L (cid:105) . (53)Now, apply the Chernoff-like bound in Eq. 23 with T = L and notethat L > ( + ϵ ) ln ( δ ) ϵ (cid:48) µ X ( b − a ) > ( δ ) ϵ (cid:48) µ X ( b − a ) , weachieve,Pr [ ˆ µ X < ( − ϵ ) µ X ] ≤ exp (cid:16) − ϵ (cid:48) L µ X ( b − a ) (cid:17) (54) ≤ exp (cid:16) − ϵ (cid:48) ( δ ) ϵ (cid:48) µ X ( b − a ) ( b − a ) (cid:17) = δ . (55)That completes the proof of Eq. 41. Prove Eq. 42.
The left hand side of Eq. 42 is rewritten as follows,Pr [ ˆ µ X > ( + ϵ ) µ X ] = Pr (cid:104) T (cid:213) j = X j > ( + ϵ ) µ X T (cid:105) (56) ≤ Pr [ ϒ + b > ( + ϵ ) µ X T ] , (57)where the last inequality is due to our observation that (cid:205) Tj = X j ≤ ϒ + b . Under the same assumption that 0 < µ X ≤ b + ϵ , we denote utward Influence and Cascade Size Estimationin Billion-scale Networks Conference’17, July 2017, Washington, DC, USA L = (cid:98) ϒ + b ( + ϵ ) µ X (cid:99) . We then have, L ≥ ϒ ( + ϵ ) µ X = ( + ϵ ) ln ( δ ) ϵ (cid:48) µ X ( b − a ) , (58)and L ≤ ϒ + b ( + ϵ ) µ X ⇒ ϒ + bL ≥ ( + ϵ ) µ X (59) ⇒ ϒ L ≥ ( + ϵ ) µ X − bL = ( + ϵ − bL µ X ) µ X (60) ⇒ ϒ L ≥ (cid:16) + ϵ − ϵ b ( + ϵ ) ln ( δ )( b − a ) (cid:17) µ X = ( + ϵ (cid:48) ) µ X (61)Thus, from Eq. 57, we obtain,Pr [ ˆ µ X > ( + ϵ ) µ X ] ≤ Pr [ L ≥ T ] = Pr (cid:104) L (cid:213) j = X j ≥ T (cid:213) j = X j (cid:105) ≤ Pr (cid:104) L (cid:213) j = X j ≥ ϒ (cid:105) = Pr (cid:104) (cid:205) L j = X j L ≥ ϒ L (cid:105) (62) ≤ Pr (cid:104) (cid:205) L j = X j L ≥ ( + ϵ (cid:48) ) µ X (cid:105) (63)where the last inequality follows from Eq. 61. By applying anotherChenoff-like bound from Eq. 22 combined with the lower boundon L in Eq. 58, we achieve,Pr [ ˆ µ X > ( + ϵ ) µ X ] ≤ exp (cid:0) − ϵ (cid:48) L µ X ( + ϵ )( b − a ) (cid:1) = δ , (64)which completes the proof of Eq. 42.Follow the same procedure as in the proof of Eq. 42, we obtainthe second statement in the theorem that,Pr [ T ≤ ( + ϵ ) ϒ / µ X ] > − δ / , (65)which completes the proof of the whole theorem. More elaboration on the hold in [9].
The stopping rule algo-rithm in [9] is described in Alg. 6.
Algorithm 6:
Stopping Rule Algorithm [9]
Input:
Random variables X , X , . . . and 0 < ϵ , δ < Output: An ( ϵ , δ ) -approximate of µ X = E [ X i ] Compute: ϒ = + ( + ϵ ) ( e − ) ln ( δ ) ϵ ; Initialize h = , T = while h < ϒ do h ← h + X T , T ← T + return ˆ µ X = ϒ / T ; The algorithm first computes ϒ and then, generates samples X j until the sum of their outcomes exceed ϒ . Afterwards, it returns ϒ / T as the estimate. Apparently, ϒ / T is a biased estimate of µ X since (cid:205) Tj = X j ≥ ϒ .An important realization for this algorithm from our proof ofTheorem 4.1 is that ϒ ≤ (cid:205) Tj = X j ≤ ϒ + b with b = [ , ] random variables. In section 5 of [9], following the proof of Pr [ ˆ µ X > ( + ϵ ) µ X ] ≤ δ / [ ˆ µ X < ( − ϵ ) µ X ] ≤ δ /
2, there is stepthat derives as follows: Pr [ L ≤ T ] = Pr (cid:104)(cid:205) L j = X j ≤ (cid:205) Tj = X j (cid:105) = Pr (cid:104)(cid:205) L j = X j ≤ ϒ (cid:105) where L is a predefined number, i.e. L = (cid:98) ϒ ( − ϵ ) µ X (cid:99) . However, since ϒ ≤ (cid:205) Tj = X j ≤ ϒ + b , the last equality does nothold. This is based on Eq. 49 with the correct expression beingPr (cid:104) (cid:205) L j = X j ≤ ϒ + b (cid:105) instead of Pr (cid:104) (cid:205) L j = X j ≤ ϒ (cid:105) . Proof of Theorem 4.2 [Proof of Part (1)] If ϵ ≥ /
4, then
RSA only runs
GSRA and hence,from Theorem 4.1, the returned solution satisfies the precisionrequirement. Otherwise, since the first steps is literally applying
GSRA with √ ϵ < / , δ /
3, we have,Pr [ µ X ( − √ ϵ ) ≤ ˆ µ (cid:48) X ≤ µ X ( + √ ϵ )] ≥ − δ / ρ X ≥ ρ X /
2. Let define the random vari-ables ξ i = ( X (cid:48) i − − X (cid:48) i ) / , i = , , . . . and thus, E [ ξ i ] = Var [ X ] .Consider the following two cases.(1) If Var [ X ] ≥ ϵµ X ( b − a ) , consider two sub-cases:(a) If Var [ X ] ≥ ( − √ ϵ ) ϵµ X ( b − a ) , then since N σ = ϒ ϵ / ˆ µ (cid:48) X ≥ −√ ϵ ( + ln ( )/ ln ( δ )) ϒ ϵ / µ X , applying theChernoff-like bound in Eq. 21 gives,Pr [ Var [ X ]/ ≤ ∆ / N σ ] ≥ − δ / ρ X ≥ Var [ X ]/ = ρ X / − δ / [ X ] ≤ ( − √ ϵ ) ϵµ X ( b − a ) , then ϵµ X ( b − a ) ≥ Var [ X ]/( ( − √ ϵ )) and therefore, ˆ ρ X ≥ ϵ ˆ µ (cid:48) X ( b − a ) ≥( − √ ϵ ) ϵµ X ( b − a ) ≥ Var X / = ρ X / If Var [ X ] ≤ ϵµ X ( b − a ) , it follows that ˆ ρ X ≥ ϵ ˆ µ X ≥ ρ X ( − min {√ ϵ , / }) with probability at least 1 − δ / + √ ϵ −√ ϵ ˆ ρ X / ˆ µ (cid:48) X ≥ ρ X / µ Z with probabilityat least 1 − δ /
3. In step 3, since T = ϒ ˆ ρ X /( ˆ µ (cid:48) X ( b − a )) ≥ ( + ln ( )/ ln ( δ )) ϒ ρ X /( µ X ( b − a )) and hence, applying the Chernoff-like bound in Eq. 24 again gives,Pr [ µ X ( − ϵ ) ≤ ˆ µ X ≤ µ X ( + ϵ )] ≥ − δ / . (68)Accumulating the probabilities, we finally obtain,Pr [ µ X ( − ϵ ) ≤ ˆ µ X ≤ µ X ( + ϵ )] ≥ − δ , (69)This completes the proof of part (1). [Proof of Part (2)] The
RSA algorithm may fail to terminate afterusing O( ϒ ρ X /( µ X ( b − a ))) samples if either:(1) The GSRA algorithm fails to return an (√ ϵ , δ / ) -approximateˆ µ (cid:48) X with probability at most δ /
2, or,(2) In step 2, for Var [ X ] ≤ ( − √ ϵ ) ϵµ X ( b − a ) , ˆ ρ X is not O( ϵµ X ( b − a )) with probability at most δ / T = ( + ϵ ) ϒ / µ X = O( ϒ ρ X /( µ X ( b − a ))) ,the first case happens with probability at most δ /
2. In addition, wecan show similarly to Theorem 4.1 that if Var [ X ] ≤ ϵµ X ( b − a ) ,then, Pr [ ∆ / T ≥ ϵµ X ( b − a )] ≤ exp (− Tϵµ X ( b − a )/ ) . (70)Thus, for T ≥ ϒ ϵ / µ X , we have Pr [ ∆ / T ≥ ϵµ X ] ≤ δ / onference’17, July 2017, Washington, DC, USA Nguyen et al. Proof of Lemma 6
We start with the computation of Var [ Z ( S ) ] with a note that E [ Z ( S ) ] = I ( S ) ,Var [ Z ( S ) ] = ( n −| S |) β + | S | (cid:213) z = β + | S | ( z − E [ Z ( S ) ]) Pr [ Z ( S ) = z ] = n −| S | (cid:213) y = ( yβ + | S | − I ( S )) Pr [ Y ( S ) = y ] = n −| S | (cid:213) y = ( yβ − I out ( S ) β + I out ( S ) β + | S | − I ( S )) Pr [ Y ( S ) = y ] = β n −| S | (cid:213) y = ( y − I out ( S )) Pr [ Y ( S ) = y ] + n −| S | (cid:213) y = ( I out ( S ) β + | S | − I ( S )) Pr [ Y ( S ) = y ] + β n −| S | (cid:213) y = ( y − I out ( S ))( I out ( S ) β + | S | − I ( S )) Pr [ Y ( S ) = y ] Since Y ( S ) ≥ [ Y ( S ) = y ] = Pr [ M ( S ) = y + | S |] β , we have, n −| S | (cid:213) y = ( y − I out ( S )) Pr [ Y ( S ) = y ] = β n (cid:213) m = + | S | ( m − E [ M ( S ) ]) Pr [ M ( S ) = m ] = β n (cid:213) m = | S | ( m − E [ M ( S ) ]) Pr [ M ( S ) = m ] − β I out ( S )( − β ) = β ( Var [ M ( S ) ] − I out ( S )( − β )) , (71)and, n −| S | (cid:213) y = β ( y − I out ( S ))( I out ( S ) β + | S | − I ( S )) Pr [ Y ( S ) = y ] = ( I out ( S ) β + | S | − I ( S )) n −| S | (cid:213) y = ( y − I out ( S )) Pr [ Y ( S ) = y ] = ( I out ( S ) β + | S | − I ( S )) I out ( S )( / β − ) . (72)Plug these back in the Var [ Z ( S ) ] , we obtain,Var [ Z ( S ) ] = β ( Var [ M ( S ) ] − I out ( S )( − β )) + ( I out ( S ) β + | S | − I ( S )) + β ( I out ( S ) β + | S | − I ( S )) I out ( S )( / β − ) = β · Var [ M ( S ) ] − ( − β ) I out ( S ))