[PDF] Sketch-based Influence Maximization and Computation: Scaling up with Guarantees

Abstract

Propagation of contagion through networks is a fundamental process. It is used to model the spread of information, influence, or a viral infection. Diffusion patterns can be specified by a probabilistic model, such as Independent Cascade (IC), or captured by a set of representative traces. Basic computational problems in the study of diffusion are influence queries (determining the potency of a specified seed set of nodes) and Influence Maximization (identifying the most influential seed set of a given size). Answering each influence query involves many edge traversals, and does not scale when there are many queries on very large graphs. The gold standard for Influence Maximization is the greedy algorithm, which iteratively adds to the seed set a node maximizing the marginal gain in influence. Greedy has a guaranteed approximation ratio of at least (1-1/e) and actually produces a sequence of nodes, with each prefix having approximation guarantee with respect to the same-size optimum. Since Greedy does not scale well beyond a few million edges, for larger inputs one must currently use either heuristics or alternative algorithms designed for a pre-specified small seed set size. We develop a novel sketch-based design for influence computation. Our greedy Sketch-based Influence Maximization (SKIM) algorithm scales to graphs with billions of edges, with one to two orders of magnitude speedup over the best greedy methods. It still has a guaranteed approximation ratio, and in practice its quality nearly matches that of exact greedy. We also present influence oracles, which use linear-time preprocessing to generate a small sketch for each node, allowing the influence of any seed set to be quickly answered from the sketches of its nodes.

Full PDF

SSketch-based Inﬂuence Maximization and Computation:Scaling up with Guarantees

EDITH COHENMicrosoft [email protected] DANIEL DELLINGMicrosoft [email protected] THOMAS PAJORMicrosoft [email protected] RENATO F. WERNECKMicrosoft [email protected] 2014

Abstract

Propagation of contagion through networks is a fundamentalprocess. It is used to model the spread of information, inﬂuence,or a viral infection. Diﬀusion patterns can be speciﬁed by a prob-abilistic model, such as Independent Cascade (IC), or capturedby a set of representative traces.Basic computational problems in the study of diﬀusion are inﬂuence queries (determining the potency of a speciﬁed seedset of nodes) and

Inﬂuence Maximization (identifying the mostinﬂuential seed set of a given size). Answering each inﬂuence queryinvolves many edge traversals, and does not scale when thereare many queries on very large graphs. The gold standard forInﬂuence Maximization is the greedy algorithm, which iterativelyadds to the seed set a node maximizing the marginal gain ininﬂuence. Greedy has a guaranteed approximation ratio of atleast (1 − /e ) and actually produces a sequence of nodes, witheach preﬁx having approximation guarantee with respect to thesame-size optimum. Since Greedy does not scale well beyond afew million edges, for larger inputs one must currently use eitherheuristics or alternative algorithms designed for a pre-speciﬁedsmall seed set size.We develop a novel sketch-based design for inﬂuence computa-tion. Our greedy Sketch-based Inﬂuence Maximization (SKIM)algorithm scales to graphs with billions of edges, with one to twoorders of magnitude speedup over the best greedy methods. Itstill has a guaranteed approximation ratio, and in practice itsquality nearly matches that of exact greedy. We also present inﬂuence oracles , which use linear-time preprocessing to generatea small sketch for each node, allowing the inﬂuence of any seedset to be quickly answered from the sketches of its nodes. The spread of contagion (information diﬀusion or spread of aninfection) is a universal phenomenon that is extensively studiedin the context of physical, biological, and social networks. Suchcascades can have one or multiple sources (or seeds ) and spreadfrom infected nodes to neighbors through the link structure. Amotivating application for the study of inﬂuence is viral marketingstrategies [14, 23], in which the inﬂuence of a set S of peoplein a social network is the number of adoptions triggered if wegive S free copies of a product. The problem also has importantapplications beyond social graphs, such as placing sensors inwater distribution networks for detecting contamination [20].A popular model for information diﬀusion is Independent Cas-cade (IC), in which an independent random variable is associated with each (directed) edge ( u, v ) to model the degree of inﬂuence of u on v . A single propagation instance is obtained by instantiatingall edge variables. We then study the distribution of a property ofinterest, such as the number of infected nodes, over these randominstances.The simplest and most studied IC model is binary IC , in whichthe range of the edge random variables is binary. A biasedcoin of probability p uv is ﬂipped for each directed edge ( u, v ).Accordingly, the edge can be either live , meaning that once u isinfected, v is also infected, or null . This model was formalizedin a seminal work by Kempe et al. [19] and is based on earlierstudies by Goldenberg et al. [14]. Note that each direction of anundirected edge { u, v } may have its own independent randomvariable, since inﬂuence is not necessarily symmetric. A particularpropagation instance is speciﬁed by the set of live edges, and anode is infected by a seed set S in this instance if and only ifit is reachable from a seed node. The inﬂuence of S is formallydeﬁned as the expectation, over instances, of the number ofinfected nodes.Instead of working directly on this probabilistic IC model,Kempe et al. [19] proposed a simulation-based approach, in whicha set { G ( i ) } of propagation instances (graphs) is generated inMonte Carlo fashion according to the inﬂuence model. Theaverage inﬂuence of S on { G ( i ) } is an unbiased estimate thatconverges to the expectation on the probabilistic model. Theability to compute inﬂuence with respect to an arbitrary set ofpropagation instances has signiﬁcant advantages, as it is usefulfor instances generated from traces or by more complex models[16, 1], which exhibit correlations between edges that cannot becaptured by the simpliﬁed IC model [15]. Moreover, the averagebehavior of a probabilistic model on a small set of instancescaptures its “typical” behavior, which is often more relevant thanthe expected value when the variance is very high.A basic primitive in the study of inﬂuence are inﬂuence queries :Compute (or approximate) the inﬂuence of a query set S of seednodes. With binary inﬂuence, this amounts to performing graphsearches from the seed set in multiple instances. Unfortunately,this does not scale well when many queries are posed over graphswith millions of nodes.Even more computationally challenging is the fundamental Inﬂuence Maximization problem, which is ﬁnding the most potentseed set of a certain size or cost. The problem was formalized byKempe et al. [19] and inspired by Richardson and Domingos [23].Kempe et al. showed that, even when the inﬂuence function isdeterministic (but the number s of seeds is a parameter), theproblem encodes the classic Max Cover problem and therefore isNP-hard [19]. Moreover, an inapproximability result of Feige [13]1 a r X i v : . [ c s . D S ] A ug mplies that any algorithm that can guarantee a solution that isat least (1 − /e + (cid:15) ) times the optimum is likely to scale poorlywith the number of seeds. Chen et al. [5] showed that computingthe exact inﬂuence of a single seed in the binary IC model, evenwhen edge probabilities are p = 0 .

5, is S of seeds with maximum average inﬂuenceover a ﬁxed set of propagation instances. A natural heuristicis to use the set of most inﬂuential individuals, say those withhigh degree or centrality [19], as seeds. This approach, however,cannot account for the dependence between seeds, missing thefact that two important nodes may “cover” essentially the samecommunities. Kempe et al. [19] proposed a greedy algorithm( Greedy ) instead. It starts with an empty seed set and iterativelyadds to S the node with maximum marginal gain in inﬂuence(relative to current seed set). Since our objective is monotoneand submodular, a classical result from Nemhauser et al. [21]implies that the inﬂuence of the greedy solution with s seeds isat least 1 − (1 − /s ) s ≥

63% of the best possible for any seedset of the same size. From Feige’s inapproximability result, thisis the best approximation ratio guarantee we can (asymptoticallyand realistically) hope for.

Greedy has become the gold standard for inﬂuence maximiza-tion, in terms of the quality of the results.

Greedy , however,does not scale to modern real-world social networks. The issue isthat evaluating the marginal contribution of each node requiresa directed reachability computation in each instance (of whichthere can be hundreds). Several performance improvements to

Greedy have thus been proposed. Leskovec et al. [20] proposedCELF, which are “lazy” evaluations of the marginal contribu-tion, performed only when a node is a candidate for the highestmarginal contribution. Chen et al. [6] took a diﬀerent approach,using the reachability sketches of Cohen [7] to speed up the reeval-uation of the marginal contribution of all nodes. While eﬀective,even with these and other accelerations [17, 22], the best currentimplementations of

Greedy do not scale to networks beyond 10 edges [5], which are quite small by modern standards.To support massive graphs, several studies proposed algorithmsspeciﬁc to the IC model, which work directly with the edgeprobabilities instead of with simulations and thus can not bereliably applied to a set of arbitrary instances. Borg et al. [3]recently proposed an algorithm based on reverse reachabilitysearches from sampled nodes, similar in spirit to the approachused for reachability sketching [7]. Their algorithm providestheoretical guarantees on the approximation quality and has goodasymptotic performance, but large “constants.” Very recently,Tang et. al. [25] developed TIM, which engineers the (mostlytheoretical) algorithm of Borgs et al. [3] to obtain a scalableimplementation with guarantees. A signiﬁcant drawback of thisapproach is that it only works for a pre-speciﬁed seed set size s ,whereas Greedy produces a sequence of nodes, with each preﬁxhaving an approximation guarantee with respect to the same-sizeoptimum. In applications we are often interested not in a singlepoint, but in a trade-oﬀ curve that allows us to ﬁnd a sweetspot of inﬂuence per cost or characterize the network. TIM alsoscales very poorly with the seed set size s , and the evaluationonly considered seed sets of up to 50 nodes.The DegreeDiscount [6] heuristic reﬁnes the natural approachof adding the next highest degree node. MIA [5] converts thebinary IC sampling probabilities p e to deterministic edge weightsand works essentially with one deterministic instance. IRIE, byJung et al. [18], is a heuristic approximation of greedy additionof seed nodes, and has the best performance we are aware of for an algorithm that produces a sequence of seed nodes. In eachstep, the probability of each node to be covered by the currentseed set S is estimated using another algorithm (or simulations).They then use eigenvector computations to approximate marginalcontributions of all nodes. Of those approaches, the IRIE heuris-tic scales much better and is much more accurate than otherheuristics. In particular, it performs nearly as well as Greedy on many research collaboration graphs [18].

Contributions.

We design a novel sketch-based approach for in-ﬂuence computation which oﬀers scalability with performanceguarantees. Our main contribution is SKIM (SKetch-based In-ﬂuence Maximization), a highly scalable (approximate) imple-mentation of the greedy algorithm for inﬂuence maximization.We also introduce inﬂuence oracles : after preprocessing that isalmost linear, we can answer inﬂuence queries very eﬃciently,considering only the sketches of the query seed set.We can apply our design on inputs speciﬁed as a ﬁxed set ofpropagation instances, as in Kempe et al. [19], with inﬂuencedeﬁned as the average over them. We also handle inputs speciﬁedas an IC model, where inﬂuence is deﬁned as the expectation.Our model is deﬁned precisely in Section 2.We now provide more details on our design. The exact com-putation of an inﬂuence query requires expensive graph searchesfrom the query seed set S on each of ‘ instances. The exactgreedy algorithm for Inﬂuence Maximization requires a similarcomputation for each marginal contribution. We address thisscalability issue by working with sketches.The core of our approach are per-node summary structureswhich we call combined reachability sketches . The sketch ofa node compactly represents its inﬂuence “coverage” across ‘ instances; we call this its combined reachability set . The combinedreachability sketch of a node, precisely deﬁned in Section 3, is thebottom- k min-hash sketch [10, 8] of the combined reachability setof the node. This generalizes the reachability sketches of Cohen [7],which are deﬁned for a single instance. The parameter k is asmall constant that determines the tradeoﬀ between computationand accuracy. Bottom- k sketches of sets support cardinalityestimation, which means that we can estimate the inﬂuence (overall instances) of a node or of a set of nodes from their combinedreachability sketches. The estimate has a small relative errorand good concentration [7]. Our use of combination sketches andstate-of-the-art optimal estimators is key to obtaining the bestbalance between sketch size and accuracy.Our SKIM algorithm for inﬂuence maximization is presentedin Section 4. It scales by running the greedy algorithm in “sketchspace,” always taking a node with the maximum estimated (ratherthan exact) marginal contribution.SKIM computes combined reachability sketches, but only untilthe node with the maximum estimated inﬂuence is computed.This node is then added to the seed set. We then update thesketches to be with respect to a residual problem in which the nodethat is selected into the seed set and its “inﬂuence” are no longerpresent. SKIM then resumes the sketch computation, startingwith the residual sketches, but (again) stopping when a node withmaximum estimated inﬂuence (in the current, residual, instance)is found. A new residual problem is then computed. This processis iterated until the seed set reaches the desired size. Since theresidual problem becomes smaller with iterations, we can computea very large seed set very eﬃciently. We also prove that the totaloverhead of the updates required to maintain the residual sketchesis small. In particular, for a set { G ( i ) } of ‘ arbitrary instances, the2lgorithm can be run to exhaustion, producing a full permutationof the nodes in O ( P i ∈ [ ‘ ] | G ( i ) | + m(cid:15) − log n ) time, where m isthe sum over nodes of the maximum indegree (over instances).For all s ≥

1, the ﬁrst s nodes we select have with a very highprobability (at least 1 − /n c for a constant c ) inﬂuence that isat least 1 − (1 − /s ) s − (cid:15) times the maximum inﬂuence of a seedset of the same size s . These are worst-case bounds. We proposean adaptive approach that exploits properties of actual networks,in particular a skewed inﬂuence distribution, to achieve fasterrunning times with the same guarantees.Our use of the residual instances by SKIM is the key formaintaining the accuracy of the greedy selection through theexecution and providing with high probability, approximationratio guarantees that nearly match those of exact Greedy .Section 5 presents our inﬂuence oracles, which preprocess theinput to compute combined reachability sketches for all nodes.For instances { G ( i ) } with n nodes and m ( i ) edges, the sketchesare built in O ( k P i m ( i ) ) total time. The inﬂuence of a set S ⊆ V can then be approximated from the sketches of the nodes in S .The oracle applies the union cardinality estimator of Cohen andKaplan [11] to estimate the union of the inﬂuence sets of the seednodes. The query runs in time O ( | S | k log | S | ) and unbiasedlywith a well-concentrated relative error of (cid:15) = 1 / √ k . Whilepreprocessing depends on the number of instances, the sketchsize and the approximation quality only depend on the sketchparameter k .The asymptotic bounds we obtain are novel also from a the-oretical perspective, and signiﬁcantly improve the state of theart, even for inﬂuence maximization on a single (deterministic)instance (select a seed set in a directed graph with maximumreachable set).Section 6 presents an extensive experimental study. Besidesdemonstrating the scalability of our algorithms on real-worldnetworks, we compare SKIM with existing approaches, includ-ing exact Greedy (when size allows), the state-of-the-art IRIEheuristic, and TIM. We obtain IC models from networks by usingthe well-studied weighted and uniform [19] probabilities. Ouralgorithms scale up to very large graphs with barely any compro-mise on quality over exact

Greedy , with theoretical guarantees.On instances generated by an IC model, we achieve more thanan order of magnitude speedup over the best greedy heuristics,which are designed speciﬁcally for this model. Even for a ﬁxedsmall seed set size, SKIM is signiﬁcantly faster than TIM.Moreover, our algorithm is eﬃcient and accurate enough to beexecuted exhaustively, producing a full permutation of the nodesfor networks with billions of edges. For the ﬁrst time, we providethe full (approximate) Pareto front of inﬂuence versus seed setsize. These relations showcase a basic property of the network,and the general pattern that a small fraction of nodes inﬂuencesa large fraction of the network. In contrast, most previous studieswe are aware of only considered seed sets with at most 50 nodes,revealing only a very restricted view of this relation. A propagation instance G = ( V, E ) is speciﬁed by the edge set E .The inﬂuence of a set of nodes S in instance G is the number ofnodes reachable from S using the edges E : Inf ( G, S ) = |{ u | S (cid:32) u }| , (1)where the predicate S (cid:32) u holds if u ∈ S or if there is a forwardpath from a node in S to the node u . Our input is speciﬁed as a set G = { G ( i ) } of ‘ ≥ G ( i ) = ( V, E ( i ) ) on the same set of nodes. The inﬂu-ence of S over all instances { G ( i ) } is the average single-instanceinﬂuence: Inf ( G , S ) = Inf ( { G ( i ) } , S ) = 1 ‘ X i ∈ [ ‘ ] Inf ( G ( i ) , S ) . (2)The set of propagation instances can be derived from cascadetraces or generated by a probabilistic model.The input can also be speciﬁed as a probabilistic model, suchas Independent Cascade (IC) [19], which deﬁnes a distribution G over instances G ∼ G that share a set V of nodes. In this case,the inﬂuence of G is deﬁned as the expectation Inf ( G , S ) = E G ∼G Inf ( G, S ) . (3)We are interested in inﬂuence oracles and in inﬂuence max-imization . Inﬂuence queries are speciﬁed by a seed set S ⊂ V and the goal is to compute (or estimate) the inﬂuence Inf ( G , S ).Inﬂuence oracles, after eﬃcient preprocessing of the input, allowus to support very fast queries. Inﬂuence maximization is theproblem of ﬁnding a seed set S ⊂ V with maximum inﬂuence,where | S | = s is given. We are interested in eﬃciently computinga seed set whose inﬂuence is close to the maximum one, as well asin computing a sequence of seeds so that each preﬁx has inﬂuencethat is close to maximum for its size. At the heart of our approach are combined reachability sketches ,which are summary structures X u that we associate with eachnode u . The combined sketches can be deﬁned with respecteither to a set G = { G ( i ) } of ‘ ≥ G .We ﬁrst consider as input a set of ‘ ≥ reachability set of a node u in instance G as R ( G, u ) = { v | u (cid:32) G v } , where u (cid:32) G v means that v is reachable from u in G .Considering all instances, the combined reachability set is a set ofnode-instance pairs: R u = { ( v, i ) | u (cid:32) G ( i ) v } . The inﬂuence ofa set of nodes S on instances { G ( i ) } can thus be expressed as Inf ( { G ( i ) } , S ) = 1 ‘ X i ∈ [ ‘ ] (cid:12)(cid:12)(cid:12) [ u ∈ S R ( G ( i ) , u ) (cid:12)(cid:12)(cid:12) = 1 ‘ (cid:12)(cid:12)(cid:12) [ u ∈ S R u (cid:12)(cid:12)(cid:12) . (4)This is the average over the instances { G ( i ) } (with i ∈ [ ‘ ]) of thenumber of nodes reachable from at least one node in S .The combined reachability sketch of a node captures its reacha-bility information across instances. The sketches we use arethe bottom- k min-hash sketches [7, 10] X v of the combinedreachability sets R v : We associate with each node-instancepair ( v, i ) an independent random rank value r ( i ) v ∼ U [0 , U [0 ,

1] is the uniform distribution on [0 , com-bined reachability sketch of u is the set of the k smallest rankvalues amongst { r ( i ) v | ( v, i ) ∈ R u } : X u = Bottom- k { r ( i ) v | ( v, i ) ∈ R ( i ) u } , (5)where Bottom- k of a set is its subset consisting of the k smallestvalues. When there is a single instance ( ‘ = 1) the combinedreachability sketches are the same as the reachability sketches ofCohen [7].We deﬁne the threshold rank τ u of each node u as τ u = k th (cid:0) { r ( i ) v | ( v, i ) ∈ R ( i ) u } (cid:1) , (6)3hich is the k th lowest rank value in R u . (For a set Y of cardi-nality | Y | < k , we deﬁne k th ( Y ) ≡ | X u | = k we have τ u = max { X u } , and τ u = 1 otherwise. The cardinal-ity | R u | can be estimated from X u using a bottom- k cardinalityestimator. The estimate is | X u | if τ u = 1 (i.e., if | X u | < k )and is ( k − /τ u otherwise. This estimate has a Coeﬃcient ofVariation (CV), which is the ratio of the standard deviation tothe mean, that is never more than 1 / √ k − c >

1, we obtainthat using k = (2 + c ) (cid:15) − ln n , the probability of having relativeerror larger than (cid:15) is at most 1 /n c . Therefore, we can be correctwith high probability on estimating the inﬂuence of all nodes. Instead of using ranks drawn from U [0 , n‘ node-instance pairs. We can also structure the permutation sothat each sequence in positions in + 1 to ( i + 1) n for integral i ≥ v in chunk i is randomly selected from instances j forwhich the pair ( v, j ) does not have a permutation rank of in orless (independently for each node). One can show that this canonly improve estimation accuracy [8]. Only the ﬁrst min { k, ‘ } n positions can be included in combined reachability sketches ofnodes.When estimating inﬂuence, we can convert permutation ranksto random ranks using the exponential distribution [7]. We canalso estimate cardinality of a subset of the D = n‘ elementsdirectly from permutation ranks [ D ], using the unbiased estima-tor 1 + ( k − D − / ( T − T is the k thsmallest permutation rank. This estimator can be interpreted assetting aside the element with permutation rank T , and estimat-ing the fraction (of the other D − T , whichis ( k − / ( T − We now deﬁne sketches with respect to a binary IC model G ,presented as a graph with probabilities p e associated with itsedges. The inﬂuence of a set of nodes S is Inf ( G , S ) = E G ∼G (cid:12)(cid:12) [ u ∈ S R ( G, u ) (cid:12)(cid:12) . (7)The sketches we deﬁne for G also contain at most k rank val-ues, but provide approximation guarantees with respect to (7).The sketches can be interpreted as the sketches computed for ‘ instances generated according to the model G ∼ G as ‘ → ∞ .When doing so, at the limit, each unique rank value correspondsto a unique instance, so we do not need to explicitly represent“instances.” We work with structured permutation ranks (Sec-tion 3.1). Since it suﬃces to consider the ﬁrst kn ranks, thisconveniently removes the dependence of the rank representationon ‘ . We can similarly apply an estimator to the k th smallestrank T ≤ kn − k to estimate inﬂuence: Instead of estimatingcardinality (which goes to inﬁnity with ‘ ) and dividing by ‘ usingthe estimator ‘ + ( k − n‘ − ‘ ( T − we take the limit as ‘ → ∞ andestimate inﬂuence using n ( k − / ( T − In this section we present our Sketch-based Inﬂuence Maximiza-tion (SKIM) algorithm. We ﬁrst review

Greedy , the greedyalgorithm for inﬂuence maximization (working with ‘ instances)presented by Kempe et al. [19]. Greedy is applied with respectto the inﬂuence objective

Inf ( G , S ), as deﬁned in Equation (2). Itstarts with an empty seed set S = ∅ . In each iteration, it addsto S the node v with maximum marginal gain , Inf ( G , S ∪ { v } ) − Inf ( G , S ) = 1 ‘ (cid:12)(cid:12)(cid:12) [ u ∈ S ∪{ v } R u \ [ u ∈ S R u (cid:12)(cid:12)(cid:12) . (8)This is the same as choosing v maximizing Inf ( G , S ∪ { v } ).SKIM approximates exact Greedy by ensuring that at eachiteration , with suﬃciently high probability, or in expectationover iterations, the node we choose to add to the seed set hasa marginal gain that is close to the maximum one. To do so, itsuﬃces to compute sketches only to the point that the node withthe maximum estimated marginal gain is revealed. To maintainaccuracy, we maintain a residual problem and respective sketches.SKIM constructs (partial) combined reachability sketches byadapting a construction of reachability sketches [7]: It processesnode-instance pairs ( u, i ) by increasing rank, performing a reversereachability search in G ( i ) from u . The sketch X v of each visitednode v is augmented with the rank r ( i ) u of the pair. For a givenvalue of k , the ﬁrst node u whose sketch reaches size k is alsothe node with maximum estimated inﬂuence. This is because thebottom- k cardinality estimate of a node depends only on the k thsmallest rank in X u , τ u (which is a complete suﬃcient statisticfor cardinality estimation from the sketch [8]); see Equation (6).For the node u , τ u is equal to the rank r ( i ) u of the last processedpair ( u, i ). For other nodes v with incomplete sketches, we knowthat τ v ≥ r ( i ) u , so their estimate is lower.Sketch building is suspended once the node v with maximumestimated inﬂuence is found. SKIM then adds v to the seed set andgenerates a residual problem, with v and all node-instance pairsit covers removed from the instances G . The (partially computed)sketches of each remaining node u are updated using X u ← X u \ X v , which deletes from the sketch the ranks of all coverednode-instance pairs.The process of building sketches is then resumed on the residualproblem, working with updated partial sketches and instances.We continue processing node-instance pairs in increasing rankorder, starting from the ﬁrst rank that exceeds τ v and skippingpairs that are already covered.We provide pseudocode for SKIM as Algorithm 1. Instead ofmaintaining the actual partial sketches X v , the algorithm onlykeeps their cardinalities size [ v ]. To support correct and eﬃcientupdates of the sketches, we maintain an inverted index index [ u, i ]that lists, for each rank value r ( i ) u we processed, all nodes v suchthat r ( i ) u ∈ X v . The entry for rank r ( i ) u is created and populatedwhen we perform a reverse reachability search from pair ( u, i ).The algorithm outputs the list seedlist of pairs ( σ i , I i ), where { σ i } is a permutation of the nodes according to the order theyare selected into the seed set, and I i is the marginal inﬂuenceof σ i . The surprising property of our construction is that thiswhole iterative process is very eﬃcient. If we run SKIM witha ﬁxed k = c(cid:15) − log n , Section 4.1 will show that we obtain thefollowing worst-case performance guarantees: Theorem 4.1.

SKIM runs in time O ( n‘ + P i | E ( i ) | + m(cid:15) − log n ) , where m = P v max i InDeg ( i ) ( v ) ≤ | S i E ( i ) | . The lgorithm 1: Sketch-based Inﬂuence Maximization // Initialization forall the pairs ( u, i ) do covered[ u,i ] ← falseforall the nodes v do size[ v ] ← index ← hash map of node-instance pairs to nodes seedlist ← ∅ // List of seeds & marg. influences rank ← n‘ node-instance pairs ( u, i ) // Compute seed nodes while | seedlist | < n dowhile rank < n‘ do // Build sketches rank ← rank + 1( u, i ) ← rank -th pair in shuﬄed sequence if covered[ v,i ] = false then BFS from u in reverse graph G ( i ) , during which foreach scanned node v do size[ v ] ← size[ v ] + 1 index[ u,i ] ← index[ u,i ] ∪ { v } if size[ v ] = k then x ← v // Next seed node abort sketch building if all nodes u have size[ u ] < k then x ← argmax u ∈ V size[ u ] I x ← // The coverage of x forall the instances i do // Residual problem (forward) BFS from x in graph G ( i ) , during which foreach scanned node v doif covered[ v,i ] then prune I x ← I x + 1 covered[ v,i ] ← true // Cover v in i forall the nodes w in index[ v,i ] do size[ w ] ← size[ w ] − index ( v, i ) ← ⊥ // Erase ( v, i ) from index I x ← I x /‘ seedlist . append( x , I x )return( seedlist ) permutation { σ i } of nodes has the property that with probability − /n Ω( c ) , for all s ∈ [ n ] , the set of seed nodes S = { σ , . . . , σ s } ,has Inf ( { G ( i ) } , S ) ≥ (1 − /e − (cid:15) ) arg max Z || Z |≤ s Inf ( { G ( i ) } , Z ) . It is not hard to show that the inﬂuence of a node v in the residualproblem of iteration i is equal to its marginal inﬂuence with re-spect to S = { σ , . . . , σ i − } in the original problem. Therefore, I i ,which is the inﬂuence of σ i in the residual problem of iteration i , isthe marginal inﬂuence of σ i , with respect to S = { σ , . . . , σ i − } in the original problem. Thus, by deﬁnition, for all s ∈ [ n ]and S = { σ , . . . , σ s } , Inf ( { G ( i ) } , S ) = P i ∈ [ s ] I i .We also show that the partial sketches correctly capture acomponent of the sketches computed for the residual problem: Lemma 4.1.

At the end of an iteration selecting v , each updatedpartial sketch X u is equal to the set of entries of the combined reachability sketch X u of u in the residual problem that have rankvalue at most τ v .Proof sketch. The content of each sketch X u before computingthe residual is clearly a superset of all reachable node-instancepairs ( z, i ) with rank r ( i ) z ≤ τ v in the residual problem. We canthen verify that entries are removed from X u only and for allcovered node-instance pairs with r ( i ) z ≤ τ v . We now analyze the running time of SKIM. All updates of theresidual problem together take time linear in the size of { G ( i ) } ,since nodes and edges that are covered by the current seed set areremoved once visited and never considered again. The remainingcomponent of the computation is determined by the number oftimes ranks are inserted (and removed) from sketches. Insertinga value to X u involves a scan of all (remaining) incoming edges to u in an instance. Removals of ranks can be charged to insertions.So we need to bound the total number of rank insertions: Lemma 4.2.

The expected total number of rank insertions at aparticular node is O ( k ln n ) .Proof sketch. Consider a sketch X v . We can show, viewingthe sketches as uniform samples of reaching pairs, that eachrank value removal corresponds to cardinality—and hence inﬂu-ence (marginal gain)—being reduced in expectation by a factorof 1 − /k . The initial inﬂuence is at most n , so there are atmost k ln( nk ) insertions until the marginal inﬂuence is reducedbelow 1 /k , at which point we do not need to consider the node.The running time is dominated by the sum over nodes v , ofthe number of times a rank is inserted to the sketch of v , timesthe in-degree of v (the maximum over instances). From thelemma, we obtain a bound of O ( km ln n ) on the total number ofinsertions. Thus, we obtain a bound of O ( km ln( n ) + P i | G ( i ) | )on the running time of the algorithm. To obtain an approximation that is within 1 + (cid:15) with good prob-ability, we can choose a ﬁxed k = c(cid:15) − log n , for some constant c .The relative error of each inﬂuence estimate of a node in an iter-ation is at most (cid:15) with probability of at least 1 − /n c . Since weuse polynomially many estimates (maximize inﬂuence among n nodes in each of at most n iterations), all estimates are withina relative error of (cid:15) with probability that is polynomially closeto 1 − /n c − . Lastly, we bound the approximation ratio of the“approximate” greedy algorithm we work with, which uses seedswith close to maximum instead of maximum marginal gain: Lemma 4.3.

With any submodular and monotone objective func-tion, approximate greedy, which iteratively chooses a node withmarginal gain that is at least (1 − δ ) of the maximum, has anapproximation ratio of at least (1 − (1 − /s ) s − O ( δ )) . The sameclaim holds in expectation when the selection is well concentrated,that is, its probability of being below (1 − aδ ) times the maximumdecreases exponentially with a > .Proof. The argument extends the analysis of exact greedy byNemhauser et al. [21]. For any s , and after selecting any set U of seeds, the maximum marginal gain by adding a single nodeis always at least 1 /s of the maximum possible gain for s nodes.When using the approximation, this is at least (1 − δ ) /s of the5aximum possible gain. Therefore, after approximate greedyselection of s nodes, the inﬂuence is at least 1 − (1 − (1 − δ ) /s ) s ≤ − (1 − /s ) s − O ( δ ) using the ﬁrst order term of the Taylorexpansion. This worst-case analysis is too pessimistic, both for the approx-imation ratio and running time. In our experiments, we testedSKIM with a ﬁxed k , and observed that the computed seed setshad inﬂuence that is much closer to the exact greedy selectionthan indicated by the worst-case bounds.The explanation is that the inﬂuence distribution on real inputsis heavy-tailed, with the vast majority of nodes having a muchsmaller inﬂuence than the one of maximum inﬂuence. One factorof O (log n ) in the worst-case running time is due to a “unionbound” ensuring a relative error of (cid:15) for all nodes in all iterations,with high probability. With a heavy tail distribution, we canidentify the maximum with a small error if we ensure a small erroronly on the few nodes that have inﬂuence close to the maximum.Furthermore, when the maximum inﬂuence is separated out fromother inﬂuence values, our approximate maximum is more likelyto be the node with actual maximum inﬂuence. Moreover, theestimation error over iterations averages out, so as the seed setgets larger we can work with lower accuracy and still guaranteegood approximation.We propose incorporating error estimation that is adaptive rather than worst-case. This facilitates tighter conﬁdence boundson the estimation quality of our output. It also allows us toadjust the sketch parameter k during computation in order tomeet pre-speciﬁed accuracy and conﬁdence levels.Let the discrepancy in an iteration be the gap between theactual maximum and the marginal inﬂuence of the selected seed.We will bound the sum of discrepancies across iterations bymaintaining a conﬁdence distribution on this sum.The estimation uses two components. (i) The exact marginalinﬂuence I s of the selected node in each iteration, as well asthe sum I = P i ≤ s I s , which is the inﬂuence of our seed set.The value I s is computed when generating the residual prob-lem. (ii) Noting in each iteration the size of the second largestsketch (excluding the last processed rank). Intuitively, if thesecond largest sketch is much smaller than the ﬁrst one, it ismore likely that the ﬁrst one is the actual maximum. We boundthe discrepancy in a single iteration using Chernoﬀ bounds. Theprobability that the sum of independent Bernoulli trials fallsbelow its expectation µ by more than νµ isPr[ Z < (1 − ν ) µ ] < (cid:18) exp( − ν )(1 − ν ) (1 − ν ) (cid:19) µ . (9)We use this to bound the probability that the discrepancy ex-ceeds ∆ (cid:15) , where ∆ is the exact marginal gain of our selected seednode. We consider the second largest sketch size, k ≤ k − τ is not considered part of the sketch even if in-cluded). We use Z = k , µ = τ ∆(1 + (cid:15) ), and ν = 1 − k τ ∆(1+ (cid:15) ) inEquation (9) to obtain a conﬁdence level.Finally, to maintain an upper bound on the conﬁdence-errordistribution of the sum of discrepancies, we take a convolution, af-ter each iteration, of the current distribution with the distributionof the current iteration. SKIM can be adapted for higher concurrency by running thesketch-building phases in batches of ranks. We can also adapt itto process inputs presented as an IC model instead of as a setof instances. This yields a more eﬃcient implementation thanwhen generating a set of instances using simulations and runningSKIM on them. In IC-model SKIM, the residual problem is acollection of partial models and sketch building is performed onthe probabilistic model. We omit details due to space limitations.

We now present an accurate and eﬃcient oracle for binary inﬂu-ence, which is based on precomputing a combined reachabilitysketch (as deﬁned in Section 3) for each node. We preprocess aset of ‘ instances G = { G ( i ) } using O ( k P ‘i =1 | E ( i ) | ) computationand working storage of O ( k ) per node. The preprocessing gen-erates combined reachability sketches X v of size O ( k ) for eachnode v ∈ V . Theorem 5.1.

Given a set { X v } of combined reachabilitysketches for G with parameter k , inﬂuence queries Inf ( G , S ) fora set S of nodes can be estimated in O ( | S | k log | S | ) time fromthe sketches { X u | u ∈ S } . The estimate is nonnegative andunbiased, has CV at least / √ k − , and is well concentrated,meaning that the probability that the relative error exceeds a/ √ k decreases exponentially with a > . We next present the two components of our oracle: estimat-ing the inﬂuence of S from the sketches of the nodes in S andeﬃciently computing all combined reachability sketches. We show how to use the combined reachability sketches of a set ofnodes S to estimate the inﬂuence of S , as given in Equation (4).In graph terms, this means estimating the cardinality of theunion S u ∈ S R u from the sketches X u , with u ∈ S . The inﬂu-ence Inf ( G , S ) is the union cardinality divided by the number ofinstances ‘ and, accordingly, is estimated using (cid:92) (cid:12)(cid:12)S v ∈ S R v (cid:12)(cid:12) /‘ . Ourestimators use the threshold rank τ u of each node u ; see Equa-tion (6).From the bottom- k sketches of each set R u for u ∈ S we canunbiasedly estimate the cardinality of the union S u ∈ S R u . Oneway to do this is to compute the bottom- k sketch of the union [7],which has threshold value τ = k th { S u ∈ S X u } and apply thecardinality estimator ( k − /τ . This would already conclude theproof of Theorem 5.1.In our implementation, we use a strictly better union cardinalityestimator that uses all the (at most k | S | ) values in the set ofsketches instead of just the k th smallest: (cid:92) (cid:12)(cid:12) [ v ∈ S R v (cid:12)(cid:12) = X z ∈ S v ∈ S X v \{ τ v } u ∈ S | z ∈ X u \{ τ u } τ u . (10)This estimator, proposed by Cohen and Kaplan [11], can be com-puted from the | S | sketches in time O ( | S | k log | S | ), by ﬁrst sortingthe | S | sketches by decreasing threshold, and then identifyingfor each distinct rank value the threshold of the ﬁrst sketch thatcontains it. When the sets R u are all the same, the estimate isthe same as applying an estimator to the bottom- k sketch onthe union, but Equation (10) can have up to a factor of p | S | lgorithm 2: Combined reachability sketches forall the nodes u ∈ V do sketches[ u ] ← ∅ // Global sketches local[ u ] ← ∅ // Instance-local sketches shuﬄe the n‘ node-instance pairs ( u, i ) forall the instances i do // Build local sketches for instance i for pairs ( u, j ) with j = i by increasing rank r do BFS from u in reverse graph G ( i ) , during which foreach scanned node v doif | local[ v ] | = k then prune local[ v ] ← local[ v ] ∪ { r } // Merge local sketches into global sketches forall the nodes u do // Both sketches[ u ] and local[ u ] are sorted sketches[ u ] ← merge( sketches[ u ] , local[ u ] ) trim sketches[ u ] to size k local[ u ] ← ∅ return( sketches ) lower CV when the sets R u are suﬃciently disjoint. Moreover,this estimator is an optimal sum estimator in that it minimizesvariance given the information available in the sketches.We can also derive a permutation version of Equation (10).The simplest way is to treat the permutation rank T as a uniformrank r = ( T − / ( ‘n −

1) which is the probability that the rankof another node is smaller than T . When there is a single instance G = ( V, E ), the combined sketchesare simply reachability sketches [7, 10]. Reachability sketches Y v for all nodes can be computed very eﬃciently, using at most mk edge traversals in total, where m is the number of edges [7].Algorithm 2 computes combined sketches by applying thepruned searches algorithm of Cohen [7] on each instance G ( i ) ,obtaining a sketch Y ( i ) v for each node, and combining the re-sults. The combined sketch X v is obtained by taking thebottom- k values in the union of the ‘ sketches, deﬁned as X v ← bottom- k ( ∪ i ∈ ‘ Y ( i ) v ) . The algorithm runs in O ( k P i | E ( i ) | ) time. Rather than storingall sets of sketches, we can compute and merge concurrentlyor sequentially, but after each step, take the bottom- k valuesin the current bottom- k set and the newly computed sketchfor instance G ( i ) : X v ← bottom- k { X v , Y ( i ) v } . Therefore, theadditional run time storage requirement for sketches is O ( nk ).This gives us the worst-case bounds on the computation statedin Theorem 5.1. We implemented our algorithms in C++ using Visual Studio2013 with full optimization. All experiments were run on amachine with two Intel Xeon E5-2690 CPUs and 384 GiB ofDDR3-1066 RAM, running Windows 2008R2 Server. Each CPUhas 8 cores (2.90 GHz, 8 ×

64 kiB L1, 8 ×

256 kiB, and 20 MiB L3cache), but all runs are sequential for consistency. We ran our experiments on benchmark networks available aspart of the SNAP [24] and WebGraph [2] projects. More speciﬁ-cally, we test social ( Epinions , Slashdot , Gowalla , TwitterFollowers , LiveJournal , Orkut , Friendster , Twitter ), collaboration ( AstroPh ),and web (

Slovakia , Slovakia > ) networks. Slovakia > is obtainedfrom Slovakia by reversing all arcs (inﬂuence follows the reversedirection of links).Kempe et al. [19] proposed two natural ways of associatingprobabilities with edges in the binary IC model: the uniform scheme assigns a constant probability p to each directed edge (theyused p = 0 . p = 0 . weighted cascade (wc)scheme the probability is the inverse of the degree of the headnode (making the probability that a node is inﬂuenced less depen-dent on its number of neighbors). We consider the wc scheme bydefault, but we will also experiment with the uniform scheme (un).These two schemes are the most commonly tested in previousstudies of scalability [20, 6, 5, 18, 22, 25]. This section evaluates SKIM, our new sketch-based inﬂuencemaximization algorithm. By default we set the number of sampledinstances to ‘ = 64 and compute sketches with k = 64 entries.(These choices will be justiﬁed in later experiments.) To evaluatethe actual inﬂuence values of the seeds computed by SKIM, weuse a set of 512 diﬀerent sampled instances, in which we simplyrun BFSes a posteriori.Table 1 summarizes the performance of our algorithm on severalnetworks of varying sizes with up to almost two billion edges.Besides the network sizes, the table reports results for three seedset sizes s : 50, 1000 and n , i.e., computing full permutation. Ineach case, it reports the total running time of our algorithm aswell as the total inﬂuence of the related seed set as a percentageof n . (Note that for s = n this value is 100 % by deﬁnition, sowe omit it in the table.) For s = 50 and 1000, the table alsoreports the corresponding numbers for IRIE [18], one of the fastestavailable heuristics that can generate full permutations. We useour own implementation of IRIE, which is somewhat faster thanthe one evaluated in the original paper. Except for s = n , we setan execution time limit of two hours; we report “DNF” and thecorresponding number of computed seeds for those runs that didnot ﬁnish.The table shows that the inﬂuences computed by IRIE andSKIM are very close; sometimes SKIM being better. However,SKIM is signiﬁcantly faster, outperforming IRIE by several ordersof magnitude on many instances. In particular, when comput-ing 1000 instead of 50 seeds, SKIM’s speedup over IRIE becomesmore evident as IRIE’s running time grows linearly with thenumber of seed nodes, whereas with SKIM it decreases with thesize of the residual problem. As a result, we can compute the 1000most inﬂuential nodes on a graph with 65 million nodes and 1.8billion edges (Friendster) in just 22 minutes. Similarly, computinga full inﬂuence ordering with SKIM takes less then 5.5 hours onall graphs.We also compare SKIM to TIM + [25], the fastest inﬂuencemaximization algorithm we are aware of. We ran their implemen-tation (kindly given to us by the authors) to report ﬁgures onour instances. As in their experiments, we set the ε parameterof TIM + to 1 .

0. Table 2 reports the inﬂuence (as percentageof n ) as well as the running time for 50 and 1000 seed nodes. Wenote that SKIM and TIM + are extremely close in quality, withTIM + tending to be slightly better. SKIM is faster than TIM + on most instances except on Friendster , Twitter , and

Slovakia > able 1. Performance of SKIM and IRIE. SKIM uses k = 64, ‘ = 64, and we evaluate the inﬂuence on 512 (diﬀerent) sampledinstances. For all runs (except those for n seeds) we set a time limit of two hours. For the runs that did not ﬁnish (DNF), we reportthe inﬂuence of the seed set (its size is shown in parenthesis after “DNF”) computed within the time limit (*). inﬂuence [%] running time [sec]

50 seeds 1000 seeds 50 seeds 1000 seeds n seedsinstance | V | [ · ] | A | [ · ] SKIM IRIE SKIM IRIE SKIM IRIE SKIM IRIE SKIM AstroPh

Epinions

Slashdot

Gowalla

TwitterFollowers

LiveJournal

Orkut * Friendster

65 608.4 1 806 067.1 9.5 8.8 * * Twitter

41 652.2 1 468 364.9 21.1 21.1 38.0 25.3 * Slovakia

50 636.2 1 930 292.9 5.4 4.8 14.8 10.1 * Slovakia >

50 636.2 1 930 292.9 10.3 10.0 25.9 16.7 * Table 2.

Comparing SKIM and TIM + regarding inﬂuence andrunning time for 50 and 1000 seeds. inﬂuence [%] running time [sec]

50 seeds 1000 seeds 50 seeds 1000 seedsinstance SKIM TIM SKIM TIM SKIM TIM SKIM TIM

AstroPh

Epinions

Slashdot

Gowalla

TwitterF’s

LiveJournal

Orkut

Friendster

Twitter

Slovakia

Slovakia > sequence of nodes such that every preﬁx of this sequencealso (approximately) maximizes the inﬂuence. In contrast, TIM + must be rerun to obtain a smaller set of maximally inﬂuentialnodes.We next argue why our paremeter choices are reasonable. First,we evaluate the impact of the number ‘ of instances on thesolution quality. Figure 1 (left) reports the quality of the seednodes found by Greedy (GRE) when we use diﬀerent ‘ valuesduring the algorithm, but evaluate the quality of the resulting seedset on 4096 (diﬀerent) instances. We observe that increasing ‘ does help quality, but only up to a certain point. In particular,values beyond 64 yield modest improvements. Since our runningtimes depend on ‘ , we use this value by default.Figure 2 compares SKIM to GRE, IRIE, and DEG (includingnodes by order of decreasing degree) on two inputs: Slashdot and

TwitterFollowers . For SKIM, we test various values for k (4, 16, 64,256). We report the inﬂuence error when compared to GRE (top)and the running time (bottom). We observe that the error for % % % AstroPh : seed set size e rr o r w r t . l = % % Number of instances ‘ a v e r ag ee rr o r SlashdotEpinionsAstroPh

Figure 1.

Evaluating diﬀerent numbers of simulations (left) andevaluating the average error of our oracle on 1000 random seeds,subject to varying ‘ . The right plot is discussed in Section 6.2.SKIM decreases as we increase k , k = 64 being the sweet spot,after which solution quality does not improve by much anymore.Running times increase for all algorithms with the size of theseed set, but SKIM is consistently the fastest algorithm for anysize.Figure 3 evaluates the performance of SKIM and IRIE on thetwo IC schemes (wc,un), using TwitterFollowers as input. Weobserve that SKIM matches the solution quality of IRIE but issigniﬁcantly faster.Finally, Figure 4 shows the inﬂuence (top) and runningtime (bottom) of SKIM when computing the full permutation.We plot the relative inﬂuence and running time (both as per-centage) subject to the number of computed seed nodes as thealgorithm progresses (also as percentage of n ). To the best of ourknowledge, we are the ﬁrst who are able to compute (approxi-mately) the full Pareto front of inﬂuence versus seed set size ongraphs with billions of edges within a few hours only. The tradeoﬀseems to characterize the core of the network: On Slovakia > and Twitter , 0.1% of the nodes already cover almost 50% of the entiregraph, while on

Slashdot and

Friendster , 0.1% of the seeds onlycover 25–30% of the graph, albeit with a faster growth. Other in-stances have a slower growth in inﬂuence, but on all instances 10%of the nodes cover at least 50% of the graph. Regarding runningtime, we observe that all instances exhibit similar behavior. Inparticular, more than 50% of the total running time is spentcomputing the ﬁrst 10% of seed nodes.8

500 1000

Slashdot : seed set size e rr o r w r t . G R E GRE DEGIRIE SK-4SK-16 SK-64SK-256

TwitterF’s : seed set size e rr o r w r t . G R E Slashdot : seed set size r unn i n g t i m e [ s ec ] . TwitterF’s : seed set size r unn i n g t i m e [ s ec ] Figure 2.

Evaluating inﬂuence and running time for severalalgorithms. The legend applies to all plots.

TwitterF’s : seed set size i nﬂu e n ce [ · ] TwitterF’s : seed set size r unn i n g t i m e [ s ec ] SKIM-wcIRIE-wcSKIM-unIRIE-un

Figure 3.

Evaluating SKIM and IRIE on the uniform (un) andweighted cascade (wc) models. The legend applies to both plots.

This section evaluates our inﬂuence oracle (cf. Section 5). We usethe IC model (with wc probabilities) to generate a set of ‘ = 64instances. We build combined reachability sketches of size k = 64for this set of instances and evaluate the performance of ouroracle (cf. Section 2).Table 3 summarizes the performance of our oracle on severalnetworks. It reports the time spent for preprocessing and therequired space (in MiB) to store the combined sketches. Queriesare evaluated for seed set sizes s of 1, 50, and 1000. For each s ,we generate 100 seed sets whose nodes are selected uniformly atrandom. We report the average running time of the query (esti-mator) in microseconds and the relative error of the estimatedinﬂuence when compared to the exact inﬂuence of the respectiveseed set.We observe that preprocessing times are reasonable for allgraphs while space consumption is essentially linear in the numberof nodes. For example, on LiveJournal (the biggest instancetested), the sketches require 2.3 GiB of space, which we computedin just 34 minutes. The inﬂuence of a single node can then beestimated in 1–2 µs, while for 1000 seed nodes we require 5.2 ms.Note that the query time is almost independent of the graph size.Using k = 64, the error stays well below 10% for one seed node,and decreases signiﬁcantly for larger seed sets (to around 1% . seed set size [%] i nﬂu e n ce [ % ] . EpinionsSlashdotTwitterF’sLiveJournal OrkutFriendsterTwitterSlovakia > seed set size [%] r unn i n g t i m e [ % ] Figure 4.

Evaluating inﬂuence permutations (top) and runningtime (bottom) on several instances. The legend applies to bothplots.

Table 3.

Evaluating our inﬂuence oracle with ‘ = 64. preproc. queries AstroPh

Epinions

10 37.1 1.3 5.2 155.0 3.4 5 011.1 1.1

Slashdot

20 37.8 1.5 6.0 155.2 3.9 4 982.3 1.0

Gowalla

46 96.0 1.5 7.3 179.8 3.2 5 275.6 1.1

TwitterFollowers

229 223.0 2.1 7.0 190.2 3.3 5 061.8 0.8

LiveJournal s = 1000).Figure 5 shows in detail how the error of the estimator ( y axis) decreases when the seed set size increases ( x axis). Tobetter evaluate the performance of estimating the union of severalreachability sets, we use the following neighborhood generator forqueries: For each query, it ﬁrst picks a node u at random withprobability proportional to its degree. From u it exhaustivelygrows a BFS of the smallest depth l such that the tree containsat least s nodes. The nodes for the seed set are then uniformlysampled from this tree. With this generator, we expect thereachability sets of seed nodes to highly overlap. Looking atthe ﬁgure, we observe that the estimation error of our oracledecreases rapidly for increasing s . Also, running queries fromthe neighborhood generator (right) compared to the uniformone (left), has almost no eﬀect on the estimation error; for 50seed nodes it is even better on many instances. % % % % seed set size (uniform) e rr o r % % % % % seed set size (neighborhood) e rr o r AstroPhEpinionsSlashdotGowallaTwitterF’sLiveJournal

Figure 5.

Evaluating our oracle for seed sets of varying size,which are selected uniformly at random (left) or with our BFS-based method (right).9inally, Figure 1 (right) reports the performance of the oraclefor ﬁxed instances on the general IC model. We vary the number ‘ of instances generated by simulations when building the oracle,but compute the error on a diﬀerent set of 8192 instances. Sinceour oracle implementation is optimized for ﬁxed instances, wesee a higher error with ‘ = 64. We can also see that the errordecreases with the number of simulations. We conclude that foran IC model oracle, it is beneﬁcial to construct sketches that haveapproximation guarantees with respect to the IC model itself (cf.Section 3.2) rather than work with simulations. We presented highly scalable algorithms for binary inﬂuence com-putation. SKIM is a sketch-space implementation of the greedyinﬂuence maximization algorithm that scales it by several ordersof magnitude, to graphs with billions of edges. SKIM computes asequence of nodes such that each preﬁx has a probabilistic guaran-tee on approximation quality that is close to that of

Greedy . Wealso presented sketch-based inﬂuence oracles, which after a near-linear processing of the instances can estimate inﬂuence queriesin time proportional to the number of seeds. Our experimentalstudy focused on instances generated by an IC model, since thefastest algorithms we compared with only apply in this model.Our experiments revealed that SKIM is accurate and faster thanother algorithms by one to two order of magnitude.In future work, we plan to develop a SKIM-like algorithm for timed inﬂuence , where edges have lengths that are interpretedas transition times and we consider both the speed and scope ofinfection [15, 4, 9, 1, 12]. We also plan to use sketches to eﬃcientlyestimate the Jaccard similarity of the inﬂuence sets of two nodes,which we believe to be an eﬀective similarity measure [9].

References [1] B. D. Abrahao, F. Chierichetti, R. Kleinberg, and A. Pan-conesi. Trace complexity of network inference. In

KDD ,2013.[2] P. Boldi and S. Vigna. The WebgGaph framework I: com-pression techniques. In

WWW . 2004.[3] C. Borg, M. Brautbar, J. Chayes, and B. Lucier. Maximizingsocial inﬂuence in nearly optimal time. In

SODA , 2014.[4] W. Chen, W. Lu, and Y. Zhang. Time-critical inﬂuencemaximization in social networks with time-delayed diﬀusionprocess. In

AAAI , 2014.[5] W. Chen, C. Wang, and Y. Wang. Scalable inﬂuence maxi-mization for prevalent viral marketing in large-scale socialnetworks. In

KDD . ACM, 2010.[6] W. Chen, Y. Wang, and S. Yang. Eﬃcient inﬂuence maxi-mization in social networks. In

KDD . ACM, 2009.[7] E. Cohen. Size-estimation framework with applications totransitive closure and reachability.

J. Comput. System Sci. ,55:441–453, 1997.[8] E. Cohen. All-distances sketches, revisited: HIP estimatorsfor massive graphs analysis. In

PODS . ACM, 2014. [9] E. Cohen, D. Delling, F. Fuchs, A. Goldberg, M. Goldszmidt,and R. Werneck. Scalable similarity estimation in socialnetworks: Closeness, node labels, and random edge lengths.In

COSN . ACM, 2013.[10] E. Cohen and H. Kaplan. Summarizing data using bottom-ksketches. In

ACM PODC , 2007.[11] E. Cohen and H. Kaplan. Leveraging discarded samplesfor tighter estimation of multiple-set aggregates. In

ACMSIGMETRICS , 2009.[12] N. Du, L. Song, M. Gomez-Rodriguez, and H. Zha. Scalableinﬂuence estimation in continuous-time diﬀusion networks.In

NIPS . Curran Associates, Inc., 2013.[13] U. Feige. A threshold of ln n for approximating set cover. J.Assoc. Comput. Mach. , 45:634–652, 1998.[14] J. Goldenberg, B. Libai, and E. Muller. Talk of the network:A complex systems look at the underlying process of word-of-mouth.

Marketing Letters , 12(3), 2001.[15] M. Gomez-Rodriguez, D. Balduzzi, and B. Schölkopf. Un-covering the temporal dynamics of diﬀusion networks. In

ICML , 2011.[16] M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferringnetworks of diﬀusion and inﬂuence. In

KDD , 2010.[17] A. Goyal, W. Lu, and L.V.S. Lakshmanan. Celf++: Opti-mizing the greedy algorithm for inﬂuence maximization insocial networks. In

WWW . ACM, 2011.[18] K. Jung, W. Heo, and W. Chen. Irie: Scalable and robustinﬂuence maximization in social networks. In

ICDM . ACM,2012.[19] D. Kempe, J. M. Kleinberg, and É. Tardos. Maximizing thespread of inﬂuence through a social network. In

KDD . ACM,2003.[20] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. Van-Briesen, and Glance N. Cost-eﬀective outbreak detection innetworks. In

KDD . ACM, 2007.[21] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis ofthe approximations of maximizing submodular set functions.

Mathematical Programming , 14, 1978.[22] N. Ohsaka, T. Akiba, Y. Yoshida, and K. Kawarabayashi.Fast and accurate inﬂuence maximization on large networkswith pruned monte-carlo simulations. In

AAAI , 2014.[23] M. Richardson and P. Domingos. Mining knowledge-sharingsites for viral marketing. In

KDD . ACM, 2002.[24] Stanford network analysis project. http://snap.stanford.edu .[25] Y. Tang, X. Xiao, and Y. Shi. Inﬂuence maximization:Near-optimal time complexity meets practical eﬃciency. In