Sketch-based Influence Maximization and Computation: Scaling up with Guarantees
Edith Cohen, Daniel Delling, Thomas Pajor, Renato F. Werneck
SSketch-based Influence Maximization and Computation:Scaling up with Guarantees
EDITH COHENMicrosoft [email protected] DANIEL DELLINGMicrosoft [email protected] THOMAS PAJORMicrosoft [email protected] RENATO F. WERNECKMicrosoft [email protected] 2014
Abstract
Propagation of contagion through networks is a fundamentalprocess. It is used to model the spread of information, influence,or a viral infection. Diffusion patterns can be specified by a prob-abilistic model, such as Independent Cascade (IC), or capturedby a set of representative traces.Basic computational problems in the study of diffusion are influence queries (determining the potency of a specified seedset of nodes) and
Influence Maximization (identifying the mostinfluential seed set of a given size). Answering each influence queryinvolves many edge traversals, and does not scale when thereare many queries on very large graphs. The gold standard forInfluence Maximization is the greedy algorithm, which iterativelyadds to the seed set a node maximizing the marginal gain ininfluence. Greedy has a guaranteed approximation ratio of atleast (1 − /e ) and actually produces a sequence of nodes, witheach prefix having approximation guarantee with respect to thesame-size optimum. Since Greedy does not scale well beyond afew million edges, for larger inputs one must currently use eitherheuristics or alternative algorithms designed for a pre-specifiedsmall seed set size.We develop a novel sketch-based design for influence computa-tion. Our greedy Sketch-based Influence Maximization (SKIM)algorithm scales to graphs with billions of edges, with one to twoorders of magnitude speedup over the best greedy methods. Itstill has a guaranteed approximation ratio, and in practice itsquality nearly matches that of exact greedy. We also present influence oracles , which use linear-time preprocessing to generatea small sketch for each node, allowing the influence of any seedset to be quickly answered from the sketches of its nodes. The spread of contagion (information diffusion or spread of aninfection) is a universal phenomenon that is extensively studiedin the context of physical, biological, and social networks. Suchcascades can have one or multiple sources (or seeds ) and spreadfrom infected nodes to neighbors through the link structure. Amotivating application for the study of influence is viral marketingstrategies [14, 23], in which the influence of a set S of peoplein a social network is the number of adoptions triggered if wegive S free copies of a product. The problem also has importantapplications beyond social graphs, such as placing sensors inwater distribution networks for detecting contamination [20].A popular model for information diffusion is Independent Cas-cade (IC), in which an independent random variable is associated with each (directed) edge ( u, v ) to model the degree of influence of u on v . A single propagation instance is obtained by instantiatingall edge variables. We then study the distribution of a property ofinterest, such as the number of infected nodes, over these randominstances.The simplest and most studied IC model is binary IC , in whichthe range of the edge random variables is binary. A biasedcoin of probability p uv is flipped for each directed edge ( u, v ).Accordingly, the edge can be either live , meaning that once u isinfected, v is also infected, or null . This model was formalizedin a seminal work by Kempe et al. [19] and is based on earlierstudies by Goldenberg et al. [14]. Note that each direction of anundirected edge { u, v } may have its own independent randomvariable, since influence is not necessarily symmetric. A particularpropagation instance is specified by the set of live edges, and anode is infected by a seed set S in this instance if and only ifit is reachable from a seed node. The influence of S is formallydefined as the expectation, over instances, of the number ofinfected nodes.Instead of working directly on this probabilistic IC model,Kempe et al. [19] proposed a simulation-based approach, in whicha set { G ( i ) } of propagation instances (graphs) is generated inMonte Carlo fashion according to the influence model. Theaverage influence of S on { G ( i ) } is an unbiased estimate thatconverges to the expectation on the probabilistic model. Theability to compute influence with respect to an arbitrary set ofpropagation instances has significant advantages, as it is usefulfor instances generated from traces or by more complex models[16, 1], which exhibit correlations between edges that cannot becaptured by the simplified IC model [15]. Moreover, the averagebehavior of a probabilistic model on a small set of instancescaptures its “typical” behavior, which is often more relevant thanthe expected value when the variance is very high.A basic primitive in the study of influence are influence queries :Compute (or approximate) the influence of a query set S of seednodes. With binary influence, this amounts to performing graphsearches from the seed set in multiple instances. Unfortunately,this does not scale well when many queries are posed over graphswith millions of nodes.Even more computationally challenging is the fundamental Influence Maximization problem, which is finding the most potentseed set of a certain size or cost. The problem was formalized byKempe et al. [19] and inspired by Richardson and Domingos [23].Kempe et al. showed that, even when the influence function isdeterministic (but the number s of seeds is a parameter), theproblem encodes the classic Max Cover problem and therefore isNP-hard [19]. Moreover, an inapproximability result of Feige [13]1 a r X i v : . [ c s . D S ] A ug mplies that any algorithm that can guarantee a solution that isat least (1 − /e + (cid:15) ) times the optimum is likely to scale poorlywith the number of seeds. Chen et al. [5] showed that computingthe exact influence of a single seed in the binary IC model, evenwhen edge probabilities are p = 0 .
5, is S of seeds with maximum average influenceover a fixed set of propagation instances. A natural heuristicis to use the set of most influential individuals, say those withhigh degree or centrality [19], as seeds. This approach, however,cannot account for the dependence between seeds, missing thefact that two important nodes may “cover” essentially the samecommunities. Kempe et al. [19] proposed a greedy algorithm( Greedy ) instead. It starts with an empty seed set and iterativelyadds to S the node with maximum marginal gain in influence(relative to current seed set). Since our objective is monotoneand submodular, a classical result from Nemhauser et al. [21]implies that the influence of the greedy solution with s seeds isat least 1 − (1 − /s ) s ≥
63% of the best possible for any seedset of the same size. From Feige’s inapproximability result, thisis the best approximation ratio guarantee we can (asymptoticallyand realistically) hope for.
Greedy has become the gold standard for influence maximiza-tion, in terms of the quality of the results.
Greedy , however,does not scale to modern real-world social networks. The issue isthat evaluating the marginal contribution of each node requiresa directed reachability computation in each instance (of whichthere can be hundreds). Several performance improvements to
Greedy have thus been proposed. Leskovec et al. [20] proposedCELF, which are “lazy” evaluations of the marginal contribu-tion, performed only when a node is a candidate for the highestmarginal contribution. Chen et al. [6] took a different approach,using the reachability sketches of Cohen [7] to speed up the reeval-uation of the marginal contribution of all nodes. While effective,even with these and other accelerations [17, 22], the best currentimplementations of
Greedy do not scale to networks beyond 10 edges [5], which are quite small by modern standards.To support massive graphs, several studies proposed algorithmsspecific to the IC model, which work directly with the edgeprobabilities instead of with simulations and thus can not bereliably applied to a set of arbitrary instances. Borg et al. [3]recently proposed an algorithm based on reverse reachabilitysearches from sampled nodes, similar in spirit to the approachused for reachability sketching [7]. Their algorithm providestheoretical guarantees on the approximation quality and has goodasymptotic performance, but large “constants.” Very recently,Tang et. al. [25] developed TIM, which engineers the (mostlytheoretical) algorithm of Borgs et al. [3] to obtain a scalableimplementation with guarantees. A significant drawback of thisapproach is that it only works for a pre-specified seed set size s ,whereas Greedy produces a sequence of nodes, with each prefixhaving an approximation guarantee with respect to the same-sizeoptimum. In applications we are often interested not in a singlepoint, but in a trade-off curve that allows us to find a sweetspot of influence per cost or characterize the network. TIM alsoscales very poorly with the seed set size s , and the evaluationonly considered seed sets of up to 50 nodes.The DegreeDiscount [6] heuristic refines the natural approachof adding the next highest degree node. MIA [5] converts thebinary IC sampling probabilities p e to deterministic edge weightsand works essentially with one deterministic instance. IRIE, byJung et al. [18], is a heuristic approximation of greedy additionof seed nodes, and has the best performance we are aware of for an algorithm that produces a sequence of seed nodes. In eachstep, the probability of each node to be covered by the currentseed set S is estimated using another algorithm (or simulations).They then use eigenvector computations to approximate marginalcontributions of all nodes. Of those approaches, the IRIE heuris-tic scales much better and is much more accurate than otherheuristics. In particular, it performs nearly as well as Greedy on many research collaboration graphs [18].
Contributions.
We design a novel sketch-based approach for in-fluence computation which offers scalability with performanceguarantees. Our main contribution is SKIM (SKetch-based In-fluence Maximization), a highly scalable (approximate) imple-mentation of the greedy algorithm for influence maximization.We also introduce influence oracles : after preprocessing that isalmost linear, we can answer influence queries very efficiently,considering only the sketches of the query seed set.We can apply our design on inputs specified as a fixed set ofpropagation instances, as in Kempe et al. [19], with influencedefined as the average over them. We also handle inputs specifiedas an IC model, where influence is defined as the expectation.Our model is defined precisely in Section 2.We now provide more details on our design. The exact com-putation of an influence query requires expensive graph searchesfrom the query seed set S on each of ‘ instances. The exactgreedy algorithm for Influence Maximization requires a similarcomputation for each marginal contribution. We address thisscalability issue by working with sketches.The core of our approach are per-node summary structureswhich we call combined reachability sketches . The sketch ofa node compactly represents its influence “coverage” across ‘ instances; we call this its combined reachability set . The combinedreachability sketch of a node, precisely defined in Section 3, is thebottom- k min-hash sketch [10, 8] of the combined reachability setof the node. This generalizes the reachability sketches of Cohen [7],which are defined for a single instance. The parameter k is asmall constant that determines the tradeoff between computationand accuracy. Bottom- k sketches of sets support cardinalityestimation, which means that we can estimate the influence (overall instances) of a node or of a set of nodes from their combinedreachability sketches. The estimate has a small relative errorand good concentration [7]. Our use of combination sketches andstate-of-the-art optimal estimators is key to obtaining the bestbalance between sketch size and accuracy.Our SKIM algorithm for influence maximization is presentedin Section 4. It scales by running the greedy algorithm in “sketchspace,” always taking a node with the maximum estimated (ratherthan exact) marginal contribution.SKIM computes combined reachability sketches, but only untilthe node with the maximum estimated influence is computed.This node is then added to the seed set. We then update thesketches to be with respect to a residual problem in which the nodethat is selected into the seed set and its “influence” are no longerpresent. SKIM then resumes the sketch computation, startingwith the residual sketches, but (again) stopping when a node withmaximum estimated influence (in the current, residual, instance)is found. A new residual problem is then computed. This processis iterated until the seed set reaches the desired size. Since theresidual problem becomes smaller with iterations, we can computea very large seed set very efficiently. We also prove that the totaloverhead of the updates required to maintain the residual sketchesis small. In particular, for a set { G ( i ) } of ‘ arbitrary instances, the2lgorithm can be run to exhaustion, producing a full permutationof the nodes in O ( P i ∈ [ ‘ ] | G ( i ) | + m(cid:15) − log n ) time, where m isthe sum over nodes of the maximum indegree (over instances).For all s ≥
1, the first s nodes we select have with a very highprobability (at least 1 − /n c for a constant c ) influence that isat least 1 − (1 − /s ) s − (cid:15) times the maximum influence of a seedset of the same size s . These are worst-case bounds. We proposean adaptive approach that exploits properties of actual networks,in particular a skewed influence distribution, to achieve fasterrunning times with the same guarantees.Our use of the residual instances by SKIM is the key formaintaining the accuracy of the greedy selection through theexecution and providing with high probability, approximationratio guarantees that nearly match those of exact Greedy .Section 5 presents our influence oracles, which preprocess theinput to compute combined reachability sketches for all nodes.For instances { G ( i ) } with n nodes and m ( i ) edges, the sketchesare built in O ( k P i m ( i ) ) total time. The influence of a set S ⊆ V can then be approximated from the sketches of the nodes in S .The oracle applies the union cardinality estimator of Cohen andKaplan [11] to estimate the union of the influence sets of the seednodes. The query runs in time O ( | S | k log | S | ) and unbiasedlywith a well-concentrated relative error of (cid:15) = 1 / √ k . Whilepreprocessing depends on the number of instances, the sketchsize and the approximation quality only depend on the sketchparameter k .The asymptotic bounds we obtain are novel also from a the-oretical perspective, and significantly improve the state of theart, even for influence maximization on a single (deterministic)instance (select a seed set in a directed graph with maximumreachable set).Section 6 presents an extensive experimental study. Besidesdemonstrating the scalability of our algorithms on real-worldnetworks, we compare SKIM with existing approaches, includ-ing exact Greedy (when size allows), the state-of-the-art IRIEheuristic, and TIM. We obtain IC models from networks by usingthe well-studied weighted and uniform [19] probabilities. Ouralgorithms scale up to very large graphs with barely any compro-mise on quality over exact
Greedy , with theoretical guarantees.On instances generated by an IC model, we achieve more thanan order of magnitude speedup over the best greedy heuristics,which are designed specifically for this model. Even for a fixedsmall seed set size, SKIM is significantly faster than TIM.Moreover, our algorithm is efficient and accurate enough to beexecuted exhaustively, producing a full permutation of the nodesfor networks with billions of edges. For the first time, we providethe full (approximate) Pareto front of influence versus seed setsize. These relations showcase a basic property of the network,and the general pattern that a small fraction of nodes influencesa large fraction of the network. In contrast, most previous studieswe are aware of only considered seed sets with at most 50 nodes,revealing only a very restricted view of this relation. A propagation instance G = ( V, E ) is specified by the edge set E .The influence of a set of nodes S in instance G is the number ofnodes reachable from S using the edges E : Inf ( G, S ) = |{ u | S (cid:32) u }| , (1)where the predicate S (cid:32) u holds if u ∈ S or if there is a forwardpath from a node in S to the node u . Our input is specified as a set G = { G ( i ) } of ‘ ≥ G ( i ) = ( V, E ( i ) ) on the same set of nodes. The influ-ence of S over all instances { G ( i ) } is the average single-instanceinfluence: Inf ( G , S ) = Inf ( { G ( i ) } , S ) = 1 ‘ X i ∈ [ ‘ ] Inf ( G ( i ) , S ) . (2)The set of propagation instances can be derived from cascadetraces or generated by a probabilistic model.The input can also be specified as a probabilistic model, suchas Independent Cascade (IC) [19], which defines a distribution G over instances G ∼ G that share a set V of nodes. In this case,the influence of G is defined as the expectation Inf ( G , S ) = E G ∼G Inf ( G, S ) . (3)We are interested in influence oracles and in influence max-imization . Influence queries are specified by a seed set S ⊂ V and the goal is to compute (or estimate) the influence Inf ( G , S ).Influence oracles, after efficient preprocessing of the input, allowus to support very fast queries. Influence maximization is theproblem of finding a seed set S ⊂ V with maximum influence,where | S | = s is given. We are interested in efficiently computinga seed set whose influence is close to the maximum one, as well asin computing a sequence of seeds so that each prefix has influencethat is close to maximum for its size. At the heart of our approach are combined reachability sketches ,which are summary structures X u that we associate with eachnode u . The combined sketches can be defined with respecteither to a set G = { G ( i ) } of ‘ ≥ G .We first consider as input a set of ‘ ≥ reachability set of a node u in instance G as R ( G, u ) = { v | u (cid:32) G v } , where u (cid:32) G v means that v is reachable from u in G .Considering all instances, the combined reachability set is a set ofnode-instance pairs: R u = { ( v, i ) | u (cid:32) G ( i ) v } . The influence ofa set of nodes S on instances { G ( i ) } can thus be expressed as Inf ( { G ( i ) } , S ) = 1 ‘ X i ∈ [ ‘ ] (cid:12)(cid:12)(cid:12) [ u ∈ S R ( G ( i ) , u ) (cid:12)(cid:12)(cid:12) = 1 ‘ (cid:12)(cid:12)(cid:12) [ u ∈ S R u (cid:12)(cid:12)(cid:12) . (4)This is the average over the instances { G ( i ) } (with i ∈ [ ‘ ]) of thenumber of nodes reachable from at least one node in S .The combined reachability sketch of a node captures its reacha-bility information across instances. The sketches we use arethe bottom- k min-hash sketches [7, 10] X v of the combinedreachability sets R v : We associate with each node-instancepair ( v, i ) an independent random rank value r ( i ) v ∼ U [0 , U [0 ,
1] is the uniform distribution on [0 , com-bined reachability sketch of u is the set of the k smallest rankvalues amongst { r ( i ) v | ( v, i ) ∈ R u } : X u = Bottom- k { r ( i ) v | ( v, i ) ∈ R ( i ) u } , (5)where Bottom- k of a set is its subset consisting of the k smallestvalues. When there is a single instance ( ‘ = 1) the combinedreachability sketches are the same as the reachability sketches ofCohen [7].We define the threshold rank τ u of each node u as τ u = k th (cid:0) { r ( i ) v | ( v, i ) ∈ R ( i ) u } (cid:1) , (6)3hich is the k th lowest rank value in R u . (For a set Y of cardi-nality | Y | < k , we define k th ( Y ) ≡ | X u | = k we have τ u = max { X u } , and τ u = 1 otherwise. The cardinal-ity | R u | can be estimated from X u using a bottom- k cardinalityestimator. The estimate is | X u | if τ u = 1 (i.e., if | X u | < k )and is ( k − /τ u otherwise. This estimate has a Coefficient ofVariation (CV), which is the ratio of the standard deviation tothe mean, that is never more than 1 / √ k − c >
1, we obtainthat using k = (2 + c ) (cid:15) − ln n , the probability of having relativeerror larger than (cid:15) is at most 1 /n c . Therefore, we can be correctwith high probability on estimating the influence of all nodes. Instead of using ranks drawn from U [0 , n‘ node-instance pairs. We can also structure the permutation sothat each sequence in positions in + 1 to ( i + 1) n for integral i ≥ v in chunk i is randomly selected from instances j forwhich the pair ( v, j ) does not have a permutation rank of in orless (independently for each node). One can show that this canonly improve estimation accuracy [8]. Only the first min { k, ‘ } n positions can be included in combined reachability sketches ofnodes.When estimating influence, we can convert permutation ranksto random ranks using the exponential distribution [7]. We canalso estimate cardinality of a subset of the D = n‘ elementsdirectly from permutation ranks [ D ], using the unbiased estima-tor 1 + ( k − D − / ( T − T is the k thsmallest permutation rank. This estimator can be interpreted assetting aside the element with permutation rank T , and estimat-ing the fraction (of the other D − T , whichis ( k − / ( T − We now define sketches with respect to a binary IC model G ,presented as a graph with probabilities p e associated with itsedges. The influence of a set of nodes S is Inf ( G , S ) = E G ∼G (cid:12)(cid:12) [ u ∈ S R ( G, u ) (cid:12)(cid:12) . (7)The sketches we define for G also contain at most k rank val-ues, but provide approximation guarantees with respect to (7).The sketches can be interpreted as the sketches computed for ‘ instances generated according to the model G ∼ G as ‘ → ∞ .When doing so, at the limit, each unique rank value correspondsto a unique instance, so we do not need to explicitly represent“instances.” We work with structured permutation ranks (Sec-tion 3.1). Since it suffices to consider the first kn ranks, thisconveniently removes the dependence of the rank representationon ‘ . We can similarly apply an estimator to the k th smallestrank T ≤ kn − k to estimate influence: Instead of estimatingcardinality (which goes to infinity with ‘ ) and dividing by ‘ usingthe estimator ‘ + ( k − n‘ − ‘ ( T − we take the limit as ‘ → ∞ andestimate influence using n ( k − / ( T − In this section we present our Sketch-based Influence Maximiza-tion (SKIM) algorithm. We first review
Greedy , the greedyalgorithm for influence maximization (working with ‘ instances)presented by Kempe et al. [19]. Greedy is applied with respectto the influence objective
Inf ( G , S ), as defined in Equation (2). Itstarts with an empty seed set S = ∅ . In each iteration, it addsto S the node v with maximum marginal gain , Inf ( G , S ∪ { v } ) − Inf ( G , S ) = 1 ‘ (cid:12)(cid:12)(cid:12) [ u ∈ S ∪{ v } R u \ [ u ∈ S R u (cid:12)(cid:12)(cid:12) . (8)This is the same as choosing v maximizing Inf ( G , S ∪ { v } ).SKIM approximates exact Greedy by ensuring that at eachiteration , with sufficiently high probability, or in expectationover iterations, the node we choose to add to the seed set hasa marginal gain that is close to the maximum one. To do so, itsuffices to compute sketches only to the point that the node withthe maximum estimated marginal gain is revealed. To maintainaccuracy, we maintain a residual problem and respective sketches.SKIM constructs (partial) combined reachability sketches byadapting a construction of reachability sketches [7]: It processesnode-instance pairs ( u, i ) by increasing rank, performing a reversereachability search in G ( i ) from u . The sketch X v of each visitednode v is augmented with the rank r ( i ) u of the pair. For a givenvalue of k , the first node u whose sketch reaches size k is alsothe node with maximum estimated influence. This is because thebottom- k cardinality estimate of a node depends only on the k thsmallest rank in X u , τ u (which is a complete sufficient statisticfor cardinality estimation from the sketch [8]); see Equation (6).For the node u , τ u is equal to the rank r ( i ) u of the last processedpair ( u, i ). For other nodes v with incomplete sketches, we knowthat τ v ≥ r ( i ) u , so their estimate is lower.Sketch building is suspended once the node v with maximumestimated influence is found. SKIM then adds v to the seed set andgenerates a residual problem, with v and all node-instance pairsit covers removed from the instances G . The (partially computed)sketches of each remaining node u are updated using X u ← X u \ X v , which deletes from the sketch the ranks of all coverednode-instance pairs.The process of building sketches is then resumed on the residualproblem, working with updated partial sketches and instances.We continue processing node-instance pairs in increasing rankorder, starting from the first rank that exceeds τ v and skippingpairs that are already covered.We provide pseudocode for SKIM as Algorithm 1. Instead ofmaintaining the actual partial sketches X v , the algorithm onlykeeps their cardinalities size [ v ]. To support correct and efficientupdates of the sketches, we maintain an inverted index index [ u, i ]that lists, for each rank value r ( i ) u we processed, all nodes v suchthat r ( i ) u ∈ X v . The entry for rank r ( i ) u is created and populatedwhen we perform a reverse reachability search from pair ( u, i ).The algorithm outputs the list seedlist of pairs ( σ i , I i ), where { σ i } is a permutation of the nodes according to the order theyare selected into the seed set, and I i is the marginal influenceof σ i . The surprising property of our construction is that thiswhole iterative process is very efficient. If we run SKIM witha fixed k = c(cid:15) − log n , Section 4.1 will show that we obtain thefollowing worst-case performance guarantees: Theorem 4.1.
SKIM runs in time O ( n‘ + P i | E ( i ) | + m(cid:15) − log n ) , where m = P v max i InDeg ( i ) ( v ) ≤ | S i E ( i ) | . The lgorithm 1: Sketch-based Influence Maximization // Initialization forall the pairs ( u, i ) do covered[ u,i ] ← falseforall the nodes v do size[ v ] ← index ← hash map of node-instance pairs to nodes seedlist ← ∅ // List of seeds & marg. influences rank ← n‘ node-instance pairs ( u, i ) // Compute seed nodes while | seedlist | < n dowhile rank < n‘ do // Build sketches rank ← rank + 1( u, i ) ← rank -th pair in shuffled sequence if covered[ v,i ] = false then BFS from u in reverse graph G ( i ) , during which foreach scanned node v do size[ v ] ← size[ v ] + 1 index[ u,i ] ← index[ u,i ] ∪ { v } if size[ v ] = k then x ← v // Next seed node abort sketch building if all nodes u have size[ u ] < k then x ← argmax u ∈ V size[ u ] I x ← // The coverage of x forall the instances i do // Residual problem (forward) BFS from x in graph G ( i ) , during which foreach scanned node v doif covered[ v,i ] then prune I x ← I x + 1 covered[ v,i ] ← true // Cover v in i forall the nodes w in index[ v,i ] do size[ w ] ← size[ w ] − index ( v, i ) ← ⊥ // Erase ( v, i ) from index I x ← I x /‘ seedlist . append( x , I x )return( seedlist ) permutation { σ i } of nodes has the property that with probability − /n Ω( c ) , for all s ∈ [ n ] , the set of seed nodes S = { σ , . . . , σ s } ,has Inf ( { G ( i ) } , S ) ≥ (1 − /e − (cid:15) ) arg max Z || Z |≤ s Inf ( { G ( i ) } , Z ) . It is not hard to show that the influence of a node v in the residualproblem of iteration i is equal to its marginal influence with re-spect to S = { σ , . . . , σ i − } in the original problem. Therefore, I i ,which is the influence of σ i in the residual problem of iteration i , isthe marginal influence of σ i , with respect to S = { σ , . . . , σ i − } in the original problem. Thus, by definition, for all s ∈ [ n ]and S = { σ , . . . , σ s } , Inf ( { G ( i ) } , S ) = P i ∈ [ s ] I i .We also show that the partial sketches correctly capture acomponent of the sketches computed for the residual problem: Lemma 4.1.
At the end of an iteration selecting v , each updatedpartial sketch X u is equal to the set of entries of the combined reachability sketch X u of u in the residual problem that have rankvalue at most τ v .Proof sketch. The content of each sketch X u before computingthe residual is clearly a superset of all reachable node-instancepairs ( z, i ) with rank r ( i ) z ≤ τ v in the residual problem. We canthen verify that entries are removed from X u only and for allcovered node-instance pairs with r ( i ) z ≤ τ v . We now analyze the running time of SKIM. All updates of theresidual problem together take time linear in the size of { G ( i ) } ,since nodes and edges that are covered by the current seed set areremoved once visited and never considered again. The remainingcomponent of the computation is determined by the number oftimes ranks are inserted (and removed) from sketches. Insertinga value to X u involves a scan of all (remaining) incoming edges to u in an instance. Removals of ranks can be charged to insertions.So we need to bound the total number of rank insertions: Lemma 4.2.
The expected total number of rank insertions at aparticular node is O ( k ln n ) .Proof sketch. Consider a sketch X v . We can show, viewingthe sketches as uniform samples of reaching pairs, that eachrank value removal corresponds to cardinality—and hence influ-ence (marginal gain)—being reduced in expectation by a factorof 1 − /k . The initial influence is at most n , so there are atmost k ln( nk ) insertions until the marginal influence is reducedbelow 1 /k , at which point we do not need to consider the node.The running time is dominated by the sum over nodes v , ofthe number of times a rank is inserted to the sketch of v , timesthe in-degree of v (the maximum over instances). From thelemma, we obtain a bound of O ( km ln n ) on the total number ofinsertions. Thus, we obtain a bound of O ( km ln( n ) + P i | G ( i ) | )on the running time of the algorithm. To obtain an approximation that is within 1 + (cid:15) with good prob-ability, we can choose a fixed k = c(cid:15) − log n , for some constant c .The relative error of each influence estimate of a node in an iter-ation is at most (cid:15) with probability of at least 1 − /n c . Since weuse polynomially many estimates (maximize influence among n nodes in each of at most n iterations), all estimates are withina relative error of (cid:15) with probability that is polynomially closeto 1 − /n c − . Lastly, we bound the approximation ratio of the“approximate” greedy algorithm we work with, which uses seedswith close to maximum instead of maximum marginal gain: Lemma 4.3.
With any submodular and monotone objective func-tion, approximate greedy, which iteratively chooses a node withmarginal gain that is at least (1 − δ ) of the maximum, has anapproximation ratio of at least (1 − (1 − /s ) s − O ( δ )) . The sameclaim holds in expectation when the selection is well concentrated,that is, its probability of being below (1 − aδ ) times the maximumdecreases exponentially with a > .Proof. The argument extends the analysis of exact greedy byNemhauser et al. [21]. For any s , and after selecting any set U of seeds, the maximum marginal gain by adding a single nodeis always at least 1 /s of the maximum possible gain for s nodes.When using the approximation, this is at least (1 − δ ) /s of the5aximum possible gain. Therefore, after approximate greedyselection of s nodes, the influence is at least 1 − (1 − (1 − δ ) /s ) s ≤ − (1 − /s ) s − O ( δ ) using the first order term of the Taylorexpansion. This worst-case analysis is too pessimistic, both for the approx-imation ratio and running time. In our experiments, we testedSKIM with a fixed k , and observed that the computed seed setshad influence that is much closer to the exact greedy selectionthan indicated by the worst-case bounds.The explanation is that the influence distribution on real inputsis heavy-tailed, with the vast majority of nodes having a muchsmaller influence than the one of maximum influence. One factorof O (log n ) in the worst-case running time is due to a “unionbound” ensuring a relative error of (cid:15) for all nodes in all iterations,with high probability. With a heavy tail distribution, we canidentify the maximum with a small error if we ensure a small erroronly on the few nodes that have influence close to the maximum.Furthermore, when the maximum influence is separated out fromother influence values, our approximate maximum is more likelyto be the node with actual maximum influence. Moreover, theestimation error over iterations averages out, so as the seed setgets larger we can work with lower accuracy and still guaranteegood approximation.We propose incorporating error estimation that is adaptive rather than worst-case. This facilitates tighter confidence boundson the estimation quality of our output. It also allows us toadjust the sketch parameter k during computation in order tomeet pre-specified accuracy and confidence levels.Let the discrepancy in an iteration be the gap between theactual maximum and the marginal influence of the selected seed.We will bound the sum of discrepancies across iterations bymaintaining a confidence distribution on this sum.The estimation uses two components. (i) The exact marginalinfluence I s of the selected node in each iteration, as well asthe sum I = P i ≤ s I s , which is the influence of our seed set.The value I s is computed when generating the residual prob-lem. (ii) Noting in each iteration the size of the second largestsketch (excluding the last processed rank). Intuitively, if thesecond largest sketch is much smaller than the first one, it ismore likely that the first one is the actual maximum. We boundthe discrepancy in a single iteration using Chernoff bounds. Theprobability that the sum of independent Bernoulli trials fallsbelow its expectation µ by more than νµ isPr[ Z < (1 − ν ) µ ] < (cid:18) exp( − ν )(1 − ν ) (1 − ν ) (cid:19) µ . (9)We use this to bound the probability that the discrepancy ex-ceeds ∆ (cid:15) , where ∆ is the exact marginal gain of our selected seednode. We consider the second largest sketch size, k ≤ k − τ is not considered part of the sketch even if in-cluded). We use Z = k , µ = τ ∆(1 + (cid:15) ), and ν = 1 − k τ ∆(1+ (cid:15) ) inEquation (9) to obtain a confidence level.Finally, to maintain an upper bound on the confidence-errordistribution of the sum of discrepancies, we take a convolution, af-ter each iteration, of the current distribution with the distributionof the current iteration. SKIM can be adapted for higher concurrency by running thesketch-building phases in batches of ranks. We can also adapt itto process inputs presented as an IC model instead of as a setof instances. This yields a more efficient implementation thanwhen generating a set of instances using simulations and runningSKIM on them. In IC-model SKIM, the residual problem is acollection of partial models and sketch building is performed onthe probabilistic model. We omit details due to space limitations.
We now present an accurate and efficient oracle for binary influ-ence, which is based on precomputing a combined reachabilitysketch (as defined in Section 3) for each node. We preprocess aset of ‘ instances G = { G ( i ) } using O ( k P ‘i =1 | E ( i ) | ) computationand working storage of O ( k ) per node. The preprocessing gen-erates combined reachability sketches X v of size O ( k ) for eachnode v ∈ V . Theorem 5.1.
Given a set { X v } of combined reachabilitysketches for G with parameter k , influence queries Inf ( G , S ) fora set S of nodes can be estimated in O ( | S | k log | S | ) time fromthe sketches { X u | u ∈ S } . The estimate is nonnegative andunbiased, has CV at least / √ k − , and is well concentrated,meaning that the probability that the relative error exceeds a/ √ k decreases exponentially with a > . We next present the two components of our oracle: estimat-ing the influence of S from the sketches of the nodes in S andefficiently computing all combined reachability sketches. We show how to use the combined reachability sketches of a set ofnodes S to estimate the influence of S , as given in Equation (4).In graph terms, this means estimating the cardinality of theunion S u ∈ S R u from the sketches X u , with u ∈ S . The influ-ence Inf ( G , S ) is the union cardinality divided by the number ofinstances ‘ and, accordingly, is estimated using (cid:92) (cid:12)(cid:12)S v ∈ S R v (cid:12)(cid:12) /‘ . Ourestimators use the threshold rank τ u of each node u ; see Equa-tion (6).From the bottom- k sketches of each set R u for u ∈ S we canunbiasedly estimate the cardinality of the union S u ∈ S R u . Oneway to do this is to compute the bottom- k sketch of the union [7],which has threshold value τ = k th { S u ∈ S X u } and apply thecardinality estimator ( k − /τ . This would already conclude theproof of Theorem 5.1.In our implementation, we use a strictly better union cardinalityestimator that uses all the (at most k | S | ) values in the set ofsketches instead of just the k th smallest: (cid:92) (cid:12)(cid:12) [ v ∈ S R v (cid:12)(cid:12) = X z ∈ S v ∈ S X v \{ τ v } u ∈ S | z ∈ X u \{ τ u } τ u . (10)This estimator, proposed by Cohen and Kaplan [11], can be com-puted from the | S | sketches in time O ( | S | k log | S | ), by first sortingthe | S | sketches by decreasing threshold, and then identifyingfor each distinct rank value the threshold of the first sketch thatcontains it. When the sets R u are all the same, the estimate isthe same as applying an estimator to the bottom- k sketch onthe union, but Equation (10) can have up to a factor of p | S | lgorithm 2: Combined reachability sketches forall the nodes u ∈ V do sketches[ u ] ← ∅ // Global sketches local[ u ] ← ∅ // Instance-local sketches shuffle the n‘ node-instance pairs ( u, i ) forall the instances i do // Build local sketches for instance i for pairs ( u, j ) with j = i by increasing rank r do BFS from u in reverse graph G ( i ) , during which foreach scanned node v doif | local[ v ] | = k then prune local[ v ] ← local[ v ] ∪ { r } // Merge local sketches into global sketches forall the nodes u do // Both sketches[ u ] and local[ u ] are sorted sketches[ u ] ← merge( sketches[ u ] , local[ u ] ) trim sketches[ u ] to size k local[ u ] ← ∅ return( sketches ) lower CV when the sets R u are sufficiently disjoint. Moreover,this estimator is an optimal sum estimator in that it minimizesvariance given the information available in the sketches.We can also derive a permutation version of Equation (10).The simplest way is to treat the permutation rank T as a uniformrank r = ( T − / ( ‘n −
1) which is the probability that the rankof another node is smaller than T . When there is a single instance G = ( V, E ), the combined sketchesare simply reachability sketches [7, 10]. Reachability sketches Y v for all nodes can be computed very efficiently, using at most mk edge traversals in total, where m is the number of edges [7].Algorithm 2 computes combined sketches by applying thepruned searches algorithm of Cohen [7] on each instance G ( i ) ,obtaining a sketch Y ( i ) v for each node, and combining the re-sults. The combined sketch X v is obtained by taking thebottom- k values in the union of the ‘ sketches, defined as X v ← bottom- k ( ∪ i ∈ ‘ Y ( i ) v ) . The algorithm runs in O ( k P i | E ( i ) | ) time. Rather than storingall sets of sketches, we can compute and merge concurrentlyor sequentially, but after each step, take the bottom- k valuesin the current bottom- k set and the newly computed sketchfor instance G ( i ) : X v ← bottom- k { X v , Y ( i ) v } . Therefore, theadditional run time storage requirement for sketches is O ( nk ).This gives us the worst-case bounds on the computation statedin Theorem 5.1. We implemented our algorithms in C++ using Visual Studio2013 with full optimization. All experiments were run on amachine with two Intel Xeon E5-2690 CPUs and 384 GiB ofDDR3-1066 RAM, running Windows 2008R2 Server. Each CPUhas 8 cores (2.90 GHz, 8 ×
64 kiB L1, 8 ×
256 kiB, and 20 MiB L3cache), but all runs are sequential for consistency. We ran our experiments on benchmark networks available aspart of the SNAP [24] and WebGraph [2] projects. More specifi-cally, we test social ( Epinions , Slashdot , Gowalla , TwitterFollowers , LiveJournal , Orkut , Friendster , Twitter ), collaboration ( AstroPh ),and web (
Slovakia , Slovakia > ) networks. Slovakia > is obtainedfrom Slovakia by reversing all arcs (influence follows the reversedirection of links).Kempe et al. [19] proposed two natural ways of associatingprobabilities with edges in the binary IC model: the uniform scheme assigns a constant probability p to each directed edge (theyused p = 0 . p = 0 . weighted cascade (wc)scheme the probability is the inverse of the degree of the headnode (making the probability that a node is influenced less depen-dent on its number of neighbors). We consider the wc scheme bydefault, but we will also experiment with the uniform scheme (un).These two schemes are the most commonly tested in previousstudies of scalability [20, 6, 5, 18, 22, 25]. This section evaluates SKIM, our new sketch-based influencemaximization algorithm. By default we set the number of sampledinstances to ‘ = 64 and compute sketches with k = 64 entries.(These choices will be justified in later experiments.) To evaluatethe actual influence values of the seeds computed by SKIM, weuse a set of 512 different sampled instances, in which we simplyrun BFSes a posteriori.Table 1 summarizes the performance of our algorithm on severalnetworks of varying sizes with up to almost two billion edges.Besides the network sizes, the table reports results for three seedset sizes s : 50, 1000 and n , i.e., computing full permutation. Ineach case, it reports the total running time of our algorithm aswell as the total influence of the related seed set as a percentageof n . (Note that for s = n this value is 100 % by definition, sowe omit it in the table.) For s = 50 and 1000, the table alsoreports the corresponding numbers for IRIE [18], one of the fastestavailable heuristics that can generate full permutations. We useour own implementation of IRIE, which is somewhat faster thanthe one evaluated in the original paper. Except for s = n , we setan execution time limit of two hours; we report “DNF” and thecorresponding number of computed seeds for those runs that didnot finish.The table shows that the influences computed by IRIE andSKIM are very close; sometimes SKIM being better. However,SKIM is significantly faster, outperforming IRIE by several ordersof magnitude on many instances. In particular, when comput-ing 1000 instead of 50 seeds, SKIM’s speedup over IRIE becomesmore evident as IRIE’s running time grows linearly with thenumber of seed nodes, whereas with SKIM it decreases with thesize of the residual problem. As a result, we can compute the 1000most influential nodes on a graph with 65 million nodes and 1.8billion edges (Friendster) in just 22 minutes. Similarly, computinga full influence ordering with SKIM takes less then 5.5 hours onall graphs.We also compare SKIM to TIM + [25], the fastest influencemaximization algorithm we are aware of. We ran their implemen-tation (kindly given to us by the authors) to report figures onour instances. As in their experiments, we set the ε parameterof TIM + to 1 .
0. Table 2 reports the influence (as percentageof n ) as well as the running time for 50 and 1000 seed nodes. Wenote that SKIM and TIM + are extremely close in quality, withTIM + tending to be slightly better. SKIM is faster than TIM + on most instances except on Friendster , Twitter , and
Slovakia > able 1. Performance of SKIM and IRIE. SKIM uses k = 64, ‘ = 64, and we evaluate the influence on 512 (different) sampledinstances. For all runs (except those for n seeds) we set a time limit of two hours. For the runs that did not finish (DNF), we reportthe influence of the seed set (its size is shown in parenthesis after “DNF”) computed within the time limit (*). influence [%] running time [sec]
50 seeds 1000 seeds 50 seeds 1000 seeds n seedsinstance | V | [ · ] | A | [ · ] SKIM IRIE SKIM IRIE SKIM IRIE SKIM IRIE SKIM AstroPh
Epinions
Slashdot
Gowalla
TwitterFollowers
LiveJournal
Orkut * Friendster
65 608.4 1 806 067.1 9.5 8.8 * * Twitter
41 652.2 1 468 364.9 21.1 21.1 38.0 25.3 * Slovakia
50 636.2 1 930 292.9 5.4 4.8 14.8 10.1 * Slovakia >
50 636.2 1 930 292.9 10.3 10.0 25.9 16.7 * Table 2.
Comparing SKIM and TIM + regarding influence andrunning time for 50 and 1000 seeds. influence [%] running time [sec]
50 seeds 1000 seeds 50 seeds 1000 seedsinstance SKIM TIM SKIM TIM SKIM TIM SKIM TIM
AstroPh
Epinions
Slashdot
Gowalla
TwitterF’s
LiveJournal
Orkut
Friendster
Slovakia
Slovakia > sequence of nodes such that every prefix of this sequencealso (approximately) maximizes the influence. In contrast, TIM + must be rerun to obtain a smaller set of maximally influentialnodes.We next argue why our paremeter choices are reasonable. First,we evaluate the impact of the number ‘ of instances on thesolution quality. Figure 1 (left) reports the quality of the seednodes found by Greedy (GRE) when we use different ‘ valuesduring the algorithm, but evaluate the quality of the resulting seedset on 4096 (different) instances. We observe that increasing ‘ does help quality, but only up to a certain point. In particular,values beyond 64 yield modest improvements. Since our runningtimes depend on ‘ , we use this value by default.Figure 2 compares SKIM to GRE, IRIE, and DEG (includingnodes by order of decreasing degree) on two inputs: Slashdot and
TwitterFollowers . For SKIM, we test various values for k (4, 16, 64,256). We report the influence error when compared to GRE (top)and the running time (bottom). We observe that the error for % % % AstroPh : seed set size e rr o r w r t . l = % % Number of instances ‘ a v e r ag ee rr o r SlashdotEpinionsAstroPh
Figure 1.
Evaluating different numbers of simulations (left) andevaluating the average error of our oracle on 1000 random seeds,subject to varying ‘ . The right plot is discussed in Section 6.2.SKIM decreases as we increase k , k = 64 being the sweet spot,after which solution quality does not improve by much anymore.Running times increase for all algorithms with the size of theseed set, but SKIM is consistently the fastest algorithm for anysize.Figure 3 evaluates the performance of SKIM and IRIE on thetwo IC schemes (wc,un), using TwitterFollowers as input. Weobserve that SKIM matches the solution quality of IRIE but issignificantly faster.Finally, Figure 4 shows the influence (top) and runningtime (bottom) of SKIM when computing the full permutation.We plot the relative influence and running time (both as per-centage) subject to the number of computed seed nodes as thealgorithm progresses (also as percentage of n ). To the best of ourknowledge, we are the first who are able to compute (approxi-mately) the full Pareto front of influence versus seed set size ongraphs with billions of edges within a few hours only. The tradeoffseems to characterize the core of the network: On Slovakia > and Twitter , 0.1% of the nodes already cover almost 50% of the entiregraph, while on
Slashdot and
Friendster , 0.1% of the seeds onlycover 25–30% of the graph, albeit with a faster growth. Other in-stances have a slower growth in influence, but on all instances 10%of the nodes cover at least 50% of the graph. Regarding runningtime, we observe that all instances exhibit similar behavior. Inparticular, more than 50% of the total running time is spentcomputing the first 10% of seed nodes.8
500 1000
Slashdot : seed set size e rr o r w r t . G R E GRE DEGIRIE SK-4SK-16 SK-64SK-256
TwitterF’s : seed set size e rr o r w r t . G R E Slashdot : seed set size r unn i n g t i m e [ s ec ] . TwitterF’s : seed set size r unn i n g t i m e [ s ec ] Figure 2.
Evaluating influence and running time for severalalgorithms. The legend applies to all plots.
TwitterF’s : seed set size i nflu e n ce [ · ] TwitterF’s : seed set size r unn i n g t i m e [ s ec ] SKIM-wcIRIE-wcSKIM-unIRIE-un
Figure 3.
Evaluating SKIM and IRIE on the uniform (un) andweighted cascade (wc) models. The legend applies to both plots.
This section evaluates our influence oracle (cf. Section 5). We usethe IC model (with wc probabilities) to generate a set of ‘ = 64instances. We build combined reachability sketches of size k = 64for this set of instances and evaluate the performance of ouroracle (cf. Section 2).Table 3 summarizes the performance of our oracle on severalnetworks. It reports the time spent for preprocessing and therequired space (in MiB) to store the combined sketches. Queriesare evaluated for seed set sizes s of 1, 50, and 1000. For each s ,we generate 100 seed sets whose nodes are selected uniformly atrandom. We report the average running time of the query (esti-mator) in microseconds and the relative error of the estimatedinfluence when compared to the exact influence of the respectiveseed set.We observe that preprocessing times are reasonable for allgraphs while space consumption is essentially linear in the numberof nodes. For example, on LiveJournal (the biggest instancetested), the sketches require 2.3 GiB of space, which we computedin just 34 minutes. The influence of a single node can then beestimated in 1–2 µs, while for 1000 seed nodes we require 5.2 ms.Note that the query time is almost independent of the graph size.Using k = 64, the error stays well below 10% for one seed node,and decreases significantly for larger seed sets (to around 1% . seed set size [%] i nflu e n ce [ % ] . EpinionsSlashdotTwitterF’sLiveJournal OrkutFriendsterTwitterSlovakia > seed set size [%] r unn i n g t i m e [ % ] Figure 4.
Evaluating influence permutations (top) and runningtime (bottom) on several instances. The legend applies to bothplots.
Table 3.
Evaluating our influence oracle with ‘ = 64. preproc. queries AstroPh
Epinions
10 37.1 1.3 5.2 155.0 3.4 5 011.1 1.1
Slashdot
20 37.8 1.5 6.0 155.2 3.9 4 982.3 1.0
Gowalla
46 96.0 1.5 7.3 179.8 3.2 5 275.6 1.1
TwitterFollowers
229 223.0 2.1 7.0 190.2 3.3 5 061.8 0.8
LiveJournal s = 1000).Figure 5 shows in detail how the error of the estimator ( y axis) decreases when the seed set size increases ( x axis). Tobetter evaluate the performance of estimating the union of severalreachability sets, we use the following neighborhood generator forqueries: For each query, it first picks a node u at random withprobability proportional to its degree. From u it exhaustivelygrows a BFS of the smallest depth l such that the tree containsat least s nodes. The nodes for the seed set are then uniformlysampled from this tree. With this generator, we expect thereachability sets of seed nodes to highly overlap. Looking atthe figure, we observe that the estimation error of our oracledecreases rapidly for increasing s . Also, running queries fromthe neighborhood generator (right) compared to the uniformone (left), has almost no effect on the estimation error; for 50seed nodes it is even better on many instances. % % % % seed set size (uniform) e rr o r % % % % % seed set size (neighborhood) e rr o r AstroPhEpinionsSlashdotGowallaTwitterF’sLiveJournal
Figure 5.
Evaluating our oracle for seed sets of varying size,which are selected uniformly at random (left) or with our BFS-based method (right).9inally, Figure 1 (right) reports the performance of the oraclefor fixed instances on the general IC model. We vary the number ‘ of instances generated by simulations when building the oracle,but compute the error on a different set of 8192 instances. Sinceour oracle implementation is optimized for fixed instances, wesee a higher error with ‘ = 64. We can also see that the errordecreases with the number of simulations. We conclude that foran IC model oracle, it is beneficial to construct sketches that haveapproximation guarantees with respect to the IC model itself (cf.Section 3.2) rather than work with simulations. We presented highly scalable algorithms for binary influence com-putation. SKIM is a sketch-space implementation of the greedyinfluence maximization algorithm that scales it by several ordersof magnitude, to graphs with billions of edges. SKIM computes asequence of nodes such that each prefix has a probabilistic guaran-tee on approximation quality that is close to that of
Greedy . Wealso presented sketch-based influence oracles, which after a near-linear processing of the instances can estimate influence queriesin time proportional to the number of seeds. Our experimentalstudy focused on instances generated by an IC model, since thefastest algorithms we compared with only apply in this model.Our experiments revealed that SKIM is accurate and faster thanother algorithms by one to two order of magnitude.In future work, we plan to develop a SKIM-like algorithm for timed influence , where edges have lengths that are interpretedas transition times and we consider both the speed and scope ofinfection [15, 4, 9, 1, 12]. We also plan to use sketches to efficientlyestimate the Jaccard similarity of the influence sets of two nodes,which we believe to be an effective similarity measure [9].
References [1] B. D. Abrahao, F. Chierichetti, R. Kleinberg, and A. Pan-conesi. Trace complexity of network inference. In
KDD ,2013.[2] P. Boldi and S. Vigna. The WebgGaph framework I: com-pression techniques. In
WWW . 2004.[3] C. Borg, M. Brautbar, J. Chayes, and B. Lucier. Maximizingsocial influence in nearly optimal time. In
SODA , 2014.[4] W. Chen, W. Lu, and Y. Zhang. Time-critical influencemaximization in social networks with time-delayed diffusionprocess. In
AAAI , 2014.[5] W. Chen, C. Wang, and Y. Wang. Scalable influence maxi-mization for prevalent viral marketing in large-scale socialnetworks. In
KDD . ACM, 2010.[6] W. Chen, Y. Wang, and S. Yang. Efficient influence maxi-mization in social networks. In
KDD . ACM, 2009.[7] E. Cohen. Size-estimation framework with applications totransitive closure and reachability.
J. Comput. System Sci. ,55:441–453, 1997.[8] E. Cohen. All-distances sketches, revisited: HIP estimatorsfor massive graphs analysis. In
PODS . ACM, 2014. [9] E. Cohen, D. Delling, F. Fuchs, A. Goldberg, M. Goldszmidt,and R. Werneck. Scalable similarity estimation in socialnetworks: Closeness, node labels, and random edge lengths.In
COSN . ACM, 2013.[10] E. Cohen and H. Kaplan. Summarizing data using bottom-ksketches. In
ACM PODC , 2007.[11] E. Cohen and H. Kaplan. Leveraging discarded samplesfor tighter estimation of multiple-set aggregates. In
ACMSIGMETRICS , 2009.[12] N. Du, L. Song, M. Gomez-Rodriguez, and H. Zha. Scalableinfluence estimation in continuous-time diffusion networks.In
NIPS . Curran Associates, Inc., 2013.[13] U. Feige. A threshold of ln n for approximating set cover. J.Assoc. Comput. Mach. , 45:634–652, 1998.[14] J. Goldenberg, B. Libai, and E. Muller. Talk of the network:A complex systems look at the underlying process of word-of-mouth.
Marketing Letters , 12(3), 2001.[15] M. Gomez-Rodriguez, D. Balduzzi, and B. Schölkopf. Un-covering the temporal dynamics of diffusion networks. In
ICML , 2011.[16] M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferringnetworks of diffusion and influence. In
KDD , 2010.[17] A. Goyal, W. Lu, and L.V.S. Lakshmanan. Celf++: Opti-mizing the greedy algorithm for influence maximization insocial networks. In
WWW . ACM, 2011.[18] K. Jung, W. Heo, and W. Chen. Irie: Scalable and robustinfluence maximization in social networks. In
ICDM . ACM,2012.[19] D. Kempe, J. M. Kleinberg, and É. Tardos. Maximizing thespread of influence through a social network. In
KDD . ACM,2003.[20] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. Van-Briesen, and Glance N. Cost-effective outbreak detection innetworks. In
KDD . ACM, 2007.[21] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis ofthe approximations of maximizing submodular set functions.
Mathematical Programming , 14, 1978.[22] N. Ohsaka, T. Akiba, Y. Yoshida, and K. Kawarabayashi.Fast and accurate influence maximization on large networkswith pruned monte-carlo simulations. In
AAAI , 2014.[23] M. Richardson and P. Domingos. Mining knowledge-sharingsites for viral marketing. In
KDD . ACM, 2002.[24] Stanford network analysis project. http://snap.stanford.edu .[25] Y. Tang, X. Xiao, and Y. Shi. Influence maximization:Near-optimal time complexity meets practical efficiency. In