Blocking Self-avoiding Walks Stops Cyber-epidemics: A Scalable GPU-based Approach
BBlocking Self-avoiding Walks Stops Cyber-epidemics:A Scalable GPU-based Approach
Hung Nguyen, Alberto Cano, and Thang Dinh
Virginia Commonwealth UniversityRichmond, VA 23220 { hungnt,acano,tndinh } @vcu.edu Tam Vu
University of Colorado, DenverDenver, CO [email protected]
ABSTRACT
Cyber-epidemics, the widespread of fake news or propaganda throughsocial media, can cause devastating economic and political conse-quences. A common countermeasure against cyber-epidemics isto disable a small subset of suspected social connections or ac-counts to effectively contain the epidemics. An example is therecent shutdown of 125,000 ISIS-related Twitter accounts. Despitemany proposed methods to identify such subset, none are scalableenough to provide high-quality solutions in nowadays billion-sizenetworks.To this end, we investigate the Spread Interdiction problemsthat seek most effective links (or nodes) for removal under thewell-known Linear Threshold model. We propose novel CPU-GPUmethods that scale to networks with billions of edges , yet, possess rigorous theoretical guarantee on the solution quality. At the coreof our methods is an O ( ) -space out-of-core algorithm to generatea new type of random walks, called Hitting Self-avoiding Walks ( HSAW s). Such a low memory requirement enables handling of bignetworks and, more importantly, hiding latency via scheduling ofmillions of threads on GPUs. Comprehensive experiments on real-world networks show that our algorithms provides much higherquality solutions and are several order of magnitude faster than thestate-of-the art. Comparing to the (single-core) CPU counterpart,our GPU implementations achieve significant speedup factors upto 177x on a single GPU and 338x on a GPU pair.
KEYWORDS
Spread Interdiction, Approximation Algorithm, GPUs
Cyber-epidemics have caused significant economical and politicalconsequences, and even more so in the future due to the increasingpopularity of social networks. Such widespread of fake news andpropaganda has potential to pose serious threats to global secu-rity. For example, through social media, terrorists have recruitedthousands of supporters who have carried terror acts includingbombings in the US, Europe, killing dozens of thousands of inno-cents, and created worldwide anxiety [1]. The rumor of explosionsat the White House injuring President Obama caused $136.5 bil-lion loss in stock market [2] or the recent burst of fake news hassignificantly influenced the 2016 election [12].To contain those cyber-epidemics, one common strategy is todisable user accounts or social connects that could potentially bevessels for rumor propagation through the “word-of-mouth” ef-fect. For example, Twitter has deleted 125,000 accounts linked toterrorism[3] since the middle of 2015 and U.S. officials have called for shutting down al-shababs Twitter accounts[4]. Obviously, re-moving too many accounts/links will negatively affect legitimateexperience, possibly hindering the freedom of speech. Thus it iscritical to identify small subsets of social links/user accounts whoseremoval effectively contains the epidemics.Given a social network, which can be abstracted as a graph inwhich nodes represent users and edges represent their social con-nections, the above task is equivalent to the problem of identifyingnodes and edges in the graph to remove such that it minimizesthe (expected) spread of the rumors under a diffusion model. In a“blind interdiction” manner, [9, 34] investigate the problem when noinformation on the sources of the rumors are available. Given the in-fected source nodes, Kimura et al. [18] and [21] proposed heuristicsto remove edges to minimize the spread from the sources. Remark-ably, Khalil et al. [16] propose the first 1 − / e − ϵ -approximationalgorithm for the edges removal problem under the linear thresholdmodel [15]. However, the number of samples needed to providethe theoretical guarantee is too high for practical purpose. Never-theless, none of the proposed methods can scale to large networkswith billions of edges and nodes.In this paper, we formulate and investigate two Spread Inter-diction problems, namely Edge-based Spread Interdiction ( eSI ) and Node-based Spread Interdiction ( nSI ). The problems consider a graph,representing a social network and a subset of suspected nodes thatmight be infected with the rumor. They seek for a size- k set of edges(or nodes) that removal minimize the spread from the suspectednodes under the well-known linear threshold (LT) model [15]. Ourmajor contribution is the two hybrid GPU-based algorithms, called eSIA and nSIA , that possess distinguished characteristics: • Scalability : Thanks to the highly efficient self-avoidingrandom walks generation on GPU(s), our algorithms runs several order of magnitude faster than its CPU’s counterpartas well as the state-of-the-art method[16]. The proposedmethods take only seconds on networks with billions ofedges and can work on even bigger networks via stretchingthe data across multiple GPUs. • Riorous quality guarantee : Through extensive analysis,we show that our methods return ( − / e − ϵ ) -approximationsolutions w.h.p. Importantly, our methods can effectivelydetermine a minimal number of HSAW samples to achievethe theoretical guarantee for given ϵ >
0. In practice,our solutions are consistently 10%-20% more effective thanthe runner up when comparing to the centrality-based,influence-maximization-based methods, and the currentstate of the art in [16].The foundation of our proposed methods is a theoretical connec-tion between Spread Interdiction and a new type of random walks, a r X i v : . [ c s . S I] F e b alled Hitting Self-avoiding Walks ( HSAW s). The connection allowsus to find the most effective edges (nodes) for removal throughfinding those who appear most frequently on the
HSAW s. Thebottle neck of this approach is, however, the generation of
HSAW s,which requires repeatedly generation of self-avoiding walks untilone reach a suspected node. Additionally, the standard approachto generate self-avoiding walks requires Ω ( n ) space per thread tostore whether each node has been visited. This severely limits thenumber of threads that can be launched concurrently.To tackle this challenge, we propose a novel O ( ) -space out-of-core algorithm to generate HSAW . Such a low memory requirementenables handling of big networks on GPU and, more importantly,hiding latency via scheduling of millions of threads. Comparingto the (single-core) CPU counterpart, our GPU implementationsachieve significant speedup factors up to 177x on a single GPU and388x on a GPU pair, making them several order of magnitude fasterthan the state-of-the art method [16].Our contributions are summarized as follows: • We formulate the problem of stopping the cyber-epidemicsby removing nodes and edges as two interdiction problemsand establish an important connection between the SpreadInterdiction problems and blocking Hitting Self-avoidingWalk ( HSAW ). • We propose out-of-core O ( ) − space HSAW sampling al-gorithm that allows concurrent execution of millions ofthreads on GPUs. For big graphs that do not fit into a singleGPU, we also provide distributed algorithms on multipleGPUS via the techniques of graph partitioning and nodereplicating. Our sampling algorithm might be of partic-ular interest for those who are into sketching influencedynamics of billion-scale networks. • Two ( − / e − ϵ ) -approximation algorithms, namely, eSIA and nSIA , for the edge and node versions of the spreadinterdiction problems. Our approaches bring together rig-orous theoretical guarantees and practical efficiency. • We conduct comprehensive experiments on real-worldnetworks with up to 1.5 billion edges. The results suggestthe superiority of our methods in terms of solution quality(10%-20% improvement) and running time (2-3 orders ofmagnitude faster).
Organization . We present the LT model and formulate twoSpread Interdiction problems on edges and nodes in Section 2. Sec-tion 3 introduces Hitting Self-avoiding Walk (
HSAW ) and proves themonotonicity and submodularity, followed by
HSAW sampling al-gorithm in Section 4 with parallel and distributed implementationson GPUs. The complete approximation algorithms are presented inSection 5. Lastly, we present our experimental results in Section 6,related work in Section 7 and conclusion in Section 8.
We consider a social network represented by a directed probabilisticgraph G = ( V , E , w ) that contains | V | = n nodes and | E | = m weighted edges. Each edge ( u , v ) ∈ E is associated with an infectionweight w ( u , v ) ∈ [ , ] which indicates the likelihood that u willinfect v once u gets infected. Assume that we observe in the network a set of suspected nodes V I that might be infected with misinformation or viruses. However,we do not know which ones are actually infected. Instead, theprobability that a node v ∈ V I is given by a number p ( v ) ∈ [ , ] .In a social network like Twitter, this probability can be obtainedthrough analyzing tweets’ content to determine the likelihood ofmisinformation being spread. By the same token, in computernetworks, remote scanning methods can be deployed to estimatethe probability that a computer gets infected by a virus.The Spread Interdiction problems aim at selecting a set of nodesor edges whose removal results in maximum influence suspensionof the infected nodes. We assume a subset C of candidate nodes (oredges) that we can remove from the graph. The C can be determineddepending on the situation at hand. For example, C can contains(highly suspicious) nodes from V I or even nodes outside of V I , if wewish to contains the rumor rapidly. Similarly, C can contains edgesthat are incident to suspected nodes in V I or C = E if we wish tomaximize the effect of the containment.We consider that the infection spreads according to the well-known Linear Threshold (LT) diffusion model [15].
In the LT model, each user v selects an activation threshold θ v uniformly random from [ , ] . The edges’ weights must satisfya condition that, for each node, the sum of all in-coming edges’weights is at most 1, i.e., (cid:205) u ∈ V w ( u , v ) ≤ , ∀ v ∈ V . The diffusionhappens in discrete time steps t = , , , . . . , n . At time t =
0, a setof users S ⊆ V ,called the seed set , are infected and all other nodesare not. We also call the infected nodes active , and uninfected nodes inactive . An inactive node v at time t becomes active at time t + (cid:205) active neighbors u of v w ( u , v ) ≥ θ v . The infection spreads untilno more nodes become active.Given G = ( V , E , w ) and a seed set S ⊂ V , the influence spread (or simply spread ) of S , denoted by I G ( S ) , is the expected numberof infected nodes at the end of the diffusion process. Here theexpectation is taken over the randomness of all thresholds θ v .One of the extensively studied problem is the influence maxi-mization problem [15]. The problem asks for a seed set S of k nodesto maximize I G ( S ) . In contrast, this paper considers the case whenthe seed set (or the distribution over the seed set) is given and aimsat identifying a few edges/nodes whose removals effectively reducethe influence spread. LT live-edge model.
In [15], the LT model is shown to beequivalent to the live-edge model where each node v ∈ V picksat most one incoming edge with a probability equal to the edgeweight. Specifically, a sample graph G = ( V , E (cid:48) ⊂ E ) is generatedfrom G according to the following rules: 1) for each node v , at mostone incoming edge is selected; 2) the probability of selecting edge ( u , v ) is w ( u , v ) and there is no incoming edge to v with probability ( − (cid:205) u ∈ N − ( v ) w ( u , v )) .Then the influence spread I G ( S ) is equal the expected numberof nodes reachable from S in sample graph G , i.e., I G ( S ) = (cid:213) G ∼G { G reachable from S } Pr [ G ∼ G] , where G ∼ G denotes the sample graph G induced from the sto-chastic graph G according to the live-edge model. or a sample graph G and a node v ∈ V , define χ G ( S , v ) = (cid:26) v is reachable from S in G I G ( S ) = (cid:213) v ∈ V (cid:213) G ∼G χ G ( S , v ) Pr [ G ∼ G] = (cid:213) v ∈ V I G ( S , v ) , (2)where I G ( S , v ) denotes the probability that node v ∈ V is eventuallyinfected by the seed set S . Learning Parameters from Real-world Traces.
Determin-ing the infection weights in the diffusion models is itself a hardproblem and have been studied in various researches [7, 13]. Inpractice, this infection weight w ( u , v ) between nodes u and v isusually estimated by the interaction frequency from u to v [15, 32]or learned from additional sources, e.g., action logs [13]. Denote by V I = ( V I , p ) , the set of suspected nodes V I and theirprobabilities of being the sources. V I defines a probability distribu-tion over possible seed sets. The probability of a particular seed set X ⊆ V I is given byPr [ X ∼ V I ] = (cid:214) u ∈ X p ( u ) (cid:214) v ∈ V I \ X ( − p ( v )) . (3)By considering all possible seed sets X ∼ V I , we further definethe expected influence spread of V I as follows, I G (V I ) = (cid:213) X ∼V I I G ( X ) Pr [ X ∼ V I ] . (4)We aim to remove k nodes/edges from the network to minimizethe spread of infection from the suspected nodes V I (defined inEq. 4) in the residual network. Equivalently, the main goal in ourformulations is to find a subset of edges (node) T that maximize the influence suspension defined as, D ( T , V I ) = I G (V I ) − I G (cid:48) (V I ) , (5)where G (cid:48) is the residual network obtained from G by removingedges (nodes) in S . When S is a set of nodes, all the edges adjacentto nodes in S are also removed from the G .We formulate the two interdiction problems as follows.Definition 1 (Edge-based Spread Interdiction ( eSI )). Given G = ( V , E , w ) , V I , a set of suspected nodes and their probabilitiesof being infected V I = ( V I , p ) , a candidate set C ⊆ E and a budget ≤ k ≤ | C | , the eSI problem asks for a k -edge set ˆ T k ⊆ E thatmaximizes the influence suspension D ( T k , V I ) . ˆ T k = arg max T k ⊆ C , | T k | = k D ( T k , V I ) , (6) where, D ( T k , V I ) = I G (V I ) − I G (cid:48) (V I ) . (7)Definition 2 (Node-based Spread Interdiction ( nSI )). Givena stochastic graph G = ( V , E , w ) , a set of suspected nodes and theirprobabilities of being infected V I = ( V I , p ) , a candidate set C ⊆ V and a budget ≤ k ≤ | C | , the nSI problem asks for a k -node set ˆ S k that maximizes the influence suspension D ( n ) ( S k , V I ) . ˆ S k = arg max S k ⊆ C , | S k | = k D ( n ) ( S k , V I ) , (8) where D ( n ) ( S k , V I ) = I G (V I ) − I G (cid:48) (V I ) . (9) is the influence suspension of S k as defined in Eq. 5. We also abbreviate D ( n ) ( S k , V I ) by D ( n ) ( S k ) and D ( T k , V I ) by D ( T k ) when the context is clear. Complexity and Hardness . The hardness results of nSI and eSI problems are stated in the following theorem.Theorem 2.1. nSI and eSI are NP-hard and cannot be approxi-mated within − / e − o ( ) under P (cid:44) N P . The proof is in our appendix. In the above definitions, the sus-pected nodes in V I = ( V I , p ) can be inferred from the frequencyof suspicious behaviors or their closeness to known threats. Theseprobabilities are also affected by the seriousness of the threats. Extension to Cost-aware Model.
One can generalize the nSI and eSI problems to replace the set of candidate C with an as-signment of removal costs for edges (nodes). This can be done byincorporating the cost-aware version of max-coverage problem in[17, 28]. For the shake of clarity, we, however, opt for the uniformcost version in this paper. In this section, we first introduce a new type of
Self-avoiding Walk ( SAW ), called
Hitting
SAW ( HSAW ), under the LT model. These
HSAW s and how to generate them are keys of our proofs and algo-rithms. Specifically, we prove that the Spread Interdiction problemsare equivalent to identifying the “most frequently edges/nodes”among a collection of
HSAW s. First, we define
Hitting Self-avoiding Walk ( HSAW ) for a samplegraph G of G .Definition 3 (Hitting Self-avoiding Walk ( HSAW )).
Givena sample graph G = ( V , E ) of a stochastic graph G = ( V , E , w ) under the live-edge model (for LT), a sample set X ⊆ V I , a walk h = < v , v , . . . , v l > is called a hitting self-avoiding walk if ∀ i ∈[ , l − ] , ( v i + , v i ) ∈ E , ∀ i , j ∈ [ , l ] , v i (cid:44) v j and h ∩ X = { v l } . An HSAW h starts from a node v , called the source of h anddenoted by src ( h ) , and consecutively walks to an incoming neigh-boring node without visiting any node more than once. From thedefinition, the distribution of HSAW s depends on the distribution ofthe sample graphs G , drawn from G following the live-edge model,the distributions of the infection sources V I , and src ( h ) .According to the live-edge model (for LT), each node has at mostone incoming edge. This leads to three important properties. Self-avoiding : An
HSAW has no duplicated nodes. Otherwisethere is a loop in h and at least one of the node on the loop (thatcannot be v l ) will have at least two incoming edges, contradictingthe live-edge model for LT. Walk Uniqueness : Given a sample graph G ∼ G and X ∼ V I , forany node v ∈ V \ X , there is at most one HSAW h that starts at node v . To see this, we can trace from v until reaching a node in X . Asthere is at most one incoming edge per node, the trace is unique. Walk Probability : Given a stochastic graph G = ( V , E , w ) and V I ,the probability of having a particular HSAW h = < v , v , . . . , v l > ,where v l ∈ V I , is computed as follows,Pr [ h ∈ G] = p ( v l ) (cid:214) u ∈ V I ∩ h , u (cid:44) v l ( − p ( u )) (cid:213) G ∼G Pr [ G ∼ G] · h ∈ G p ( v l ) (cid:214) u ∈ V I ∩ h , u (cid:44) v l ( − p ( u )) l − (cid:214) i = w ( v i + , v i ) , (10)where h ∈ G if all the edges in h appear in a random sample G ∼ G .Thus, based on the properties of HSAW , we can define a prob-ability space Ω h which has the set of elements being all possible HSAW s and the probability of a
HSAW computed from Eq.10. ↔ HSAW
Blocking
From the probability space of
HSAW in a stochastic network G and set V I of infected nodes, we prove the following importantconnection between influence suspenion of a set of edges and arandom HSAW . We say T ⊆ E interdicts an HSAW h j if T ∩ h j (cid:44) ∅ .When T interdicts h j , removing T will disrupt h j , leaving src ( h j ) uninfected.Theorem 3.1. Given a graph G = ( V , E , w ) and a set V I , for anyrandom HSAW h j and any set T ∈ E of edges, we have D ( T , V I ) = I G (V I ) Pr [ T interdicts h j ] . (11)The proof is presented in the extended version in [5]. Theorem 3.1states that the influence suspension of a set T ⊆ E is proportionalto the probability that T intersects with a random HSAW . Thus, tomaximize the influence suspension, we find a set of edges that hitsthe most
HSAW s. This motivates our sampling approach:(1) Sample θ random HSAW s to build an estimator the influ-ence suspensions of many edge sets,(2) Apply
Greedy algorithm over the set of
HSAW samples tofind a solution ˆ T k that blocks the most HSAW samples.The challenges in this approach are how to efficiently generaterandom
HSAW s and what the value of θ is to provide guarantee.As a corollary of Theorem 3.1, we obtain the monotonicity andsubmodularity of the influence suspension function D ( T k , V I ) .Corollary 3.2. The influence suspension function D ( S ) where T is the set of edges, under the LT model is monotone, ∀ T ⊆ T (cid:48) , D ( T ) ≤ D ( T (cid:48) ) , (12) and submodular, i.e. for any ( u , v ) (cid:60) T (cid:48) , D ( T ∪ { v }) − D ( T ) ≥ D ( T (cid:48) ∪ {( u , v )}) − D ( T (cid:48) ) . (13)The proof is presented in the extended version [5]. The mono-tonicity and submodularity indicates that the above greedy ap-proach will return ( − / e − ϵ ) approximation solutions, where ϵ > HSAW . To providea good guarantee, a large number of
HSAW are needed, makinggenerating
HSAW the bottleneck of this approach.
We propose our sampling algorithm to generate
HSAW sampleson massive parallel GPU platform. We begin with the simple CPU-based version of the algorithm.
HSAW s Algorithm 1 describes our
HSAW sampling procedure which isbased on the live-edge model in Section 2.The algorithm follows a rejection sampling scheme which re-peatedly generate random
SAW (Lines 2-14) until getting a
HSAW (Line 1). The
SAW sampling picks a random node v and follows thelive-edge model to select an incoming edge to ( u , v ) (Lines 5-14). Algorithm 1:
HSAW Sampling Algorithm
Input:
Graph G , suspect set V I and p ( v ) , ∀ v ∈ V I Output:
A random
HSAW sample h j while True do Pick a node v uniformly at random; Initialize h j = ∅ ; while True do h j = h j ∪ {( u , v )} ( h j = h j ∪ { u } for node version); Use live-edge model to select an edge ( u , v ) ∈ E ; if no edge selected then break ; if edge ( u , v ) is selected then if u ∈ V I and rand () ≤ p ( u ) then return h j ; if u ∈ h j then break ; Set v = u ; Then it replaces v with u and repeat the process until either: 1) nolive-edge is selected (Lines 7-8) indicating that h j does not reachto an infected node; 2) h j hits to a node in V I and that node isactually an infected node (Lines 10-11) or 3) edge ( u , v ) is selectedbut u closes a cycle in h j . Only in the second case, the algorithmterminates and return the found HSAW . Random node 𝑣 (a) Random node 𝑣 (b) Cycle 𝑣 𝑣
Infected node
Figure 1: (a) no cycle,
HSAW found (b) a single cycle, no
HSAW . The algorithm is illustrated in Fig. 1. In ( a ), the simple pathtravels through several nodes and reach an infected node. In ( b ),the algorithm detects a cycle. HSAW
Generation on GPU(s)
GPUs with the massive parallel computing power offer an attractivesolution for generating
HSAW , the major bottleneck. As shown inthe previous subsection, generating a
HSAW requires repeatedlygenerating
SAW s. Since the
SAW samples are independent, if we canrun millions of
SAW generation threads on GPUs, we can maximizethe utility of GPUs’ cores and minimize the stalls due to pipelineshazard or memory accesses, i.e., minimize latency hiding . Moreover,only the hitting
SAW need to be transported back to CPU, thus theGPU-CPU communication is minimal.
Challenges.
Due to the special design of GPU with a massivenumber of parallel threads, in the ideal case, we can speed up ouralgorithms vastly if memory accesses are coalesced and there isno warp divergence . However, designing such algorithms to fullyutilize GPUs requires attention to the GPU architecture.Moreover, executing millions of parallel threads means eachthread has little memory to use. Unfortunately, the CPU-basedAlgorithm to generate
HSAW (Alg. 1) can use up to Ω ( n ) space totrack which nodes have been visited. For large networks, there isnot enough memory to launch a large number of threads.We tackle the above challenges and design a new lightweight HSAW generation algorithm. Our algorithm, presented in Alg. 2,requires only O ( ) space per thread. Thus, millions of threads an be invoked concurrently to maximize the efficiency. The algo-rithm ThreadSample in Alg. 2 consists of three efficient techniques: O ( ) -space Path-encoding, O ( ) -space Infinite Cycle Detection andSliding Window Early Termination to generate HSAW . The first technique O ( ) -space Path-encoding aims at generating SAW samples on GPUcores using only constant memory space. We take advantage of atypical feature of modern pseudo-random number generators thata random number is generated by a function with the input (seed)being the random number generated in the previous round, r i = f ( r i − ) ( i ≥ ) (14)where r is the initial seed that can be set by users. Those generatorsare based on linear recurrences and proven in [31] to be extremelyfast and passing strong statistical tests.Thus, if we know the value of the random seed at the beginningof the SAW generator and the number of traversal steps, we canreconstruct the whole walks. As a result, the
SAW sampling algo-rithm only needs to store the set of initial random seeds and thewalk lengths. The Alg. 2 is similar Alg. 1 except it does not return a
SAW but only two numbers
Seed h and Len h that encode the walk.To detect cycle (line 17), ThreadSample use the following twoheuristics to detect most of the cycles. As the two heuristics canproduce false negative (but not false positive), there is small chancethat
ThreadSample will return some walks with cycles. However,the final checking of cycle in Alg. 3 will make sure only valid
HSAW will be returned.
To detect cycle in
SAW sampling(line 17 in Alg. 2), we adopt two constant space Cycle-detectionalgorithms: the Floyd’s [20] and Brent’s algorithms [20].The Floyd’s algorithm only requires space for two pointers totrack a generating path. These pointers move at different speeds,i.e., one is twice as fast as the other. Floyd’s guarantees to detect thecycle in the first traversing round of the slower pointer on the cycleand in the second round of the faster one. The Floyd’s algorithmmaintains two sampling paths pointed by two pointers and thus,needs two identical streams of random live-edge selections.Differently, the Brent’s algorithm cuts half of the computationof Floyd’s algorithm by requiring a single stream of live-edge selec-tions. The algorithm compares the node at position 2 i − , i ≥ O (| h j |) .The Brent’s algorithm combined with cycle detection results in aspeedup factor of 51x in average compared to a single CPU core. Many cycles ifexist often have small size. Thus, we use Cuckoo filter [11] of a smallfixed size k to index and detect the cycle among the last k visitednodes. Our experimental results (with k =
2) show that this shortcycle detection improves further other acceleration techniques to aspeedup factor of 139x.
The combined algorithm of generat-ing
HSAW on GPU is presented in Alg. 3 which generates a streamof
HSAW samples h , h , . . . . The main component is a loop ofmultiple iterations. Each iteration calls thread max , i.e. maximumnumber of threads in the GPU used, threads to execute Alg. 2 thatruns on a GPU core and generates at most l HSAW samples. Those
Algorithm 2:
ThreadSample - Sampling on a GPU thread
Input: l and ThreadID Global pool H of HSAW samples; Initialize a generator PRG . Seed ← PRG2 ( ThreadID ) ; for i = 1 to do PRG. next() ; // Burn-in period for i = 1 to l do Seed h = PRG . next () ; Len h = Use PRG to pick a node v uniformly at random; while True do Use PRG to select an edge ( u , v ) ∈ E following the live-edge LT model; if no edge selected or Len h ≥ n then break ; if edge ( u , v ) is selected then if u ∈ V I then Use PRG with probability p ( u ) : H = H ∪ < Seed h , Len h + > ; break ; if cycle detected at u then break ; Set v = u ; Len h = Len h + Algorithm 3:
Parallel
HSAW
Sampling Algorithm on GPU
Input:
Graph G , V I , p ( v ) , ∀ v ∈ V I Output: R - A stream of HSAW samples i = while True do Initialize global H = ∅ , l = thread max depends on GPU model; Call
ThreadSample ( ThreadID , l ) ∀ ThreadID = .. thread max ; foreach < Seed h , Len h > ∈ H do Reconstruct h from Seed h , Len h ; if h has no cycle then i ← i + Add R i = { edges in h } to stream R ; samples are encoded by only two numbers Seed h and Len h whichdenotes the starting seed of the random number generator andthe length of that HSAW . Based on these two numbers, we canreconstruct the whole
HSAW and recheck the occurrence of cycle.If no cycles detected, a new
HSAW R i = { edges in h } is added tothe stream. The small parameter l prevents thread divergence.Recall that Alg. 2 is similar to that of Alg. 1 except:1) It only stores three numbers: node v , Seed h and Len h .2) It uses two random number generator PRG and PRG2which are in the same class of linear recurrence (Eq. 14). PRG goes through the burn-in period to gurantee the ran-domness (Lines 3-4).3) Cycle detection in Line 17 can be Floyd’s, Brent’s or onewith Cuckoo Filter (this requires rechecking in Alg. 3).Thus, the algorithms requires only a constant space and has thesame time complexity as
HSAW sampling in Alg. 1. .3 Distributed Algorithm on Multiple GPUs In case the graph cannot fit into the memory of a single GPU, wewill need to distribute the graph data across multiple GPUs. Werefer to this approach as
Distributed algorithm on Multiple GPUs .We use the folklore approach of partitioning the graph intosmaller (potentially overlapping) partitions. Ideally, we aim at par-titioning the graph that minimizes the inter-GPU communication.This is equivalent to minimizing the chance of a
HSAW crossingdifferent partitions. To do this, we first apply the standardizedMETIS [24] graph partitioning techniques into p partitions where p is the number of GPUs. Each GPU will then receive a partitionand generate samples from that subgraph. The number of samplesgenerated by each GPU is proportional to the number of nodes inthe received partition. We further reduce the crossing HSAW byextending each partition to include nodes that are few hop(s) away.The number of hops away is called extension parameter , denotedby h . We use h = h = This section focuses on the question of detecting the minimal num-ber of
HSAW to guarantee ( − / e − ϵ ) approximation and the com-plete present of eSIA . We adopt the recent Stop-and-Stare frame-work [29] proven to be efficient, i.e. meeting theoretical lowerbounds on the number of samples. Algorithm 4:
Greedy algorithm for maximum coverage
Input:
A set R t of HSAW samples, C ⊆ E and k . Output: An ( − / e ) -optimal solution ˆ T k on samples. ˆ T k = ∅ ; for i = to k do ˆ e ← arg max e ∈ C \ ˆ T k ( Cov R t ( ˆ T k ∪ { e }) − Cov R t ( ˆ T k )) ; Add ˆ c to ˆ T k ; return ˆ T k ; Algorithm 5:
Check algorithm for confidence level
Input: ˆ T k , R t , R (cid:48) t , ϵ , δ and t . Output:
True if the solution ˆ T k meets the requirement. Compute Λ by Eq. 16; if Cov R (cid:48) t ( ˆ T k ) ≥ Λ then ϵ = Cov R t ( ˆ T k )/ Cov R (cid:48) t ( ˆ T k ) − ϵ = ϵ (cid:115) |R (cid:48) t |( + ϵ ) t − Cov
R(cid:48) t ( ˆ T k ) ; ϵ = ϵ (cid:115) |R (cid:48) t |( + ϵ )( − / e − ϵ )( + ϵ / ) t − Cov
R(cid:48) t ( ˆ T k ) ; ϵ t = ( ϵ + ϵ + ϵ ϵ )( − / e − ϵ ) + ( − / e ) ϵ ; if ϵ t ≤ ϵ then return True ; return False ; Similar to [29, 32, 33], we first derive a threshold θ = ( − e ) ( + ϵ ) ˆ I G (V I ) · ln ( / δ ) + ln (cid:0) mk (cid:1) OPT k ϵ . (15)Using θ HSAW samples, the greedy algorithm (Alg. 4) guaranteesto returns a ( − / e − ϵ ) approximate solution with a probabilityat least 1 − δ / Algorithm 6:
Edge Spread Interdiction Algorithm ( eSIA ) Input:
Graph G , V I , p ( v ) , ∀ v ∈ V I , k , C ⊆ E and 0 ≤ ϵ , δ ≤ Output: ˆ T k - An ( − / e − ϵ ) -near-optimal solution. Compute Λ (Eq. 18), N max (Eq. 17); t = A stream of
HSAW h , h , . . . is generated by Alg. 3 on GPU; repeat t = t + R t = { R , . . . , R Λ t − } ; R (cid:48) t = { R Λ t − + , . . . , R Λ t } ; ˆ T k ← Greedy (R t , C , k ) ; if Check ( ˆ T k , R t , R (cid:48) t , ϵ , δ ) = True then return ˆ T k ; until |R t | ≥ N max ; return ˆ T k ; Unfortunately, we cannot compute this threshold directly as itinvolves two unknowns ˆ I G (V I ) and OPT k . The Stop-and-Stareframework in [29] untangles this problem by utilizing two indepen-dent sets of samples: one for finding the candidate solution using Greedy algorithm and the second for out-of-sample verification ofthe candidate solution’s quality. This strategy guarantees to find a1 − / e − ϵ approximation solution within at most a (constant time)of any theoretical lower bounds such as the above θ (w.h.p.) eSIA Algorithm.
The complete algorithm eSIA is presented inAlg. 6. It has two sub-procedures:
Greedy , Alg. 4, and
Check , Alg. 5.
Greedy : Alg. 4 selects a candidate solution ˆ T k from a set of HSAW samples R t . This implements the greedy scheme that iterativelyselects from the set of candidate edges C an edge that maximizesthe marginal gain. The algorithm stops after selecting k edges. Check : Alg. 5 verifies if the candidate solution ˆ T k satisfies thegiven precision error ϵ . It computes the error bound providedin the current iteration of eSIA , i.e. ϵ t from ϵ , ϵ , ϵ (Lines 4-6),and compares that with the input ϵ . This algorithm consists of achecking condition (Line 2) that examines the coverage of ˆ T k onthe independent set R (cid:48) t of HSAW samples with Λ , Λ = + ( + ϵ )( + ϵ ) ln ( t max δ ) ϵ , (16)where t max = log (cid:16) N max ( + / ϵ ) ln ( δ / ) / ϵ (cid:17) is the maximum numberof iterations run by eSIA in Alg. 6 (bounded by O ( log n ) ). Thecomputations of ϵ , ϵ , ϵ are to guarantee the estimation qualityof ˆ T k and the optimal solution T ∗ k .The main algorithm in Alg. 6 first computes the upper-bound onneccessary HSAW samples N max i.e., N max = ( − e ) ( + ϵ ) m · ln ( / δ ) + ln (cid:0) mk (cid:1) kϵ , (17)and Λ i.e., Λ = ( + ϵ ) ln ( t max δ ) ϵ , (18)Then, it enters a loop of at most t max = O ( log n ) iterations. In eachiteration, eSIA uses the set R t of first Λ t − HSAW samples to find acandidate solution ˆ T k by the Greedy algorithm (Alg. 4). Afterwards,it checks the quality of ˆ T k by the Check procedure (Alg. 5). If the
Check returns
True meaning that ˆ T k meets the error requirement ϵ with high probability, ˆ T k is returned as the final solution.In cases when Check algorithm fails to verify the candidatesolution ˆ T k after t max iteations, eSIA will be terminated by theguarding condition |R t | ≥ N max (Line 9). Optimal Guarantee Analysis.
We prove that eSIA returnsan ( − / e − ϵ ) -approximate solution for the eSI problem with robability at least 1 − δ where ϵ ∈ ( , ) and 1 / δ = O ( n ) are thegiven precision parameters.Theorem 5.1. Given a graph G = (V , E , w ) , a probabilistic set V I of suspected nodes, candidate edge set C ⊆ E , < ϵ < , / δ = O ( n ) as the precision parameters and budget k , eSIA returns an ( − / e − ϵ ) -approximate solution ˆ T k with probability at least − δ , Pr [ D ( ˆ T k ) ≥ ( − / e − ϵ ) OPT ( e ) k ] ≥ − δ . (19) Comparison to Edge Deletion in [16] . The recent work in[16] selects k edges to maximize the sum of influence suspensionsof nodes in V I while our eSI problem considers V I as a wholeand maximize the influence suspension of V I . The formulation in[16] reflects the case where only a single node in V I is the seed ofpropaganda and each node has the same chance. In contrast, eSI considers a more practical situations in which each node v in V I can be a seed independently with probability p ( v ) . Such conditionis commonly found when the propaganda have been active for sometime until triggering the detection system. In fact, the method in[16] can be applied in our problem and vise versa. However, [16]requires an impractically large number of samples to deliver the ( − / e − ϵ ) guarantee. Similar to Theorem 3.1, we can also establish the connection be-tween identifying nodes for removal and identifying nodes thatappear frequently in
HSAW s.Theorem 5.2.
Given G = ( V , E , w ) , a random HSAW sample h j and a probabilistic set V I , for any set S ∈ V , D ( n ) ( S , V I ) = I G (V I ) Pr [ S interdicts h j ] . Thus, the nSIA algorithm for selecting k nodes to remove andmaximize the influence suspension is similar to eSIA except:1) The Greedy algorithm selects nodes with maximum mar-ginal gains into the candidate solution ˆ S k .2) The maximum HSAW samples is cumputed as follows, N max = ( − e ) ( + ϵ ) n · ln ( / δ ) + ln (cid:0) nk (cid:1) kϵ . (20)The approximation guarantee is stated below.Theorem 5.3. Given a graph G = (V , E , p ) , a probabilistic set V I of possible seeds with their probabilities, C ⊆ V , < ϵ < , / δ = O ( n ) and a budget k , nSIA returns an ( − / e − ϵ ) -approximatesolution ˆ S k with probability at least − δ , Pr [ D ( n ) ( ˆ S k ) ≥ ( − / e − ϵ ) OPT ( n ) k ] ≥ − δ , (21) where S ∗ k is an optimal solution of k nodes. Both the complete algorithm for node-based Spread Interdictionand the proof of Theo. 5.3 is presented in our extended version [5].
In this section, we present the results of our comprehensive experi-ments on real-world networks. The results suggest the superiorityof nSIA and eSIA over the other methods.
Algorithms compared . For each of the studied problems, i.e., nSI and eSI , we compare three sets of algorithms: • nSIA and eSIA - our proposed algorithms, each of whichhas five implementations: single/multi-core CPU, and sin-gle/parallel/distributed GPU accelerations. • InfMax - V and InfMax - V I - algorithms for Influence Maxi-mization problem, that finds the set of k nodes in C thathave the highest influence. For the edge version, we follow[16] to select k edges that go into the highest influencenodes. • GreedyCutting [16] on edge deletion problem. • Baseline methods: we consider 3 common ranking mea-sures: Pagerank, Max-Degree and Randomized.
Datasets . Table 1 provides the summary of 5 datasets used.
Table 1:
Datasets’ StatisticsDataset n ) m ) Avg. Deg. Type DBLP (*)
Pokec (*)
Skitter (*)
LiveJournal (*)
4M 34.7M 8.7 Social
Twitter [23] 41.7M 1.5G 70.5 Social (*) http://snap.stanford.edu/data/index.html
Measurements . We measure the performance of each algorithmin two aspects: Solution quality and Scalability. To compute theinfluence suspension, we adapt the EIVA algorithm in [29] to findan ( ϵ , δ ) -estimate ˆ D ( T , V I ) ,Pr [| ˆ D ( T , V I ) − D ( T , V I )| ≥ ϵ D ( T , V I )] ≤ δ , (22)where ϵ , δ are set to 0 .
01 and 1 / n (see details in [29]). Parameter Settings . We follow a common setting in [28, 29, 32]and set the weight of edge ( u , v ) to be w ( u , v ) = d in ( v ) . Here d in ( v ) denotes the in-degree of node v . For simplicity, we set C = E (or C = V ) in the edge (or node) interdiction problem(s). For nSIA and eSIA algorithms, we set the precision parameters to be ϵ = . δ = / n as a general setting. Following the approach in [16]the suspected set of nodes V I contains randomly selected 1000nodes with randomized probabilities between 0 and 1 of being realsuspects. The budget value k is ranging from 100 to 1000.All the experiments are carried on a CentOS 7 machine having 2Intel(R) Xeon(R) CPUs X5680 3.33GHz with 6 cores each, 2 NVIDIA’sTitan X GPUs and 100 GB of RAM. The algorithms are implementedin C++ with C++11 compiler and CUDA 8.0 toolkit. The results of comparing the solution quality, i.e., influence sus-pension, of the algorithms on four larger network datasets, e.g.,Pokec, Skitter, LiveJournal and Twitter, are presented in Fig. 2 for eSI . Across all four datasets, we observe that eSIA significantlyoutperforms the other methods with widening margins when k in-creases. eSIA performs twice as good as InfMax- V I and many timesbetter than the rest. Experiments on nSIA give similar observationand the complete results are presented in our extended version. Comparison with GreedyCutting [16] . We compare eSIA with the GreedyCutting [16] which solves the slightly differentEdge Deletion problem that interdicts the sum of nodes’ influences I n f l uen c e S u s pen s i on ( x ) Budgets (k) eSIA*E-InfMax-V I E-InfMax-V E-PagerankE-Max-DegreeE-Randomized (a) Pokec I n f l uen c e S u s pen s i on ( x ) Budgets (k) (b) Skitter I n f l uen c e S u s pen s i on ( x ) Budgets (k) (c) LiveJournal I n f l uen c e S u s pen s i on ( x ) Budgets (k) (d) Twitter
Figure 2:
Interdiction Efficiency of different approaches on eSI problem ( eSIA * denotes the general eSIA )Dataset 1 CPU core 8 CPU cores 1 GPU par-2 GPUs time(s) SpF time(s) SpF time(s) SpF time(s) SpFDBLP
Pokec
Skitter
LiveJ
Table 2:
Running time and Speedup Factor (
SpF ) eSIA on various platforms ( k = , par-2 GPUsrefers to parallel algorithm on 2 GPUs). R unn i ng t i m e ( s ) Budgets (k)
Figure 3:
Running time on Twitter I n f l uen c e S u s pen s i on Budgets (k)eSIA*GreedyCutting (100)GreedyCutting (500) (a) Influence Suspension R unn i ng t i m e ( s ) Budgets (k) eSIA-GPU (1 GPU)eSIA-CPU (1 core)GreedyCutting (100)GreedyCutting (500) (b) Running time
Figure 4:
Comparison between eSIA and GreedyCutting with EdgeDeletion Problem on Skitter network. while eSI minimizes the combined influence. Thus, to comparethe methods for the two problems, we set the number of sourcesto be 1. Since we are interested in interdicting nodes with highimpact on networks, we select top 10 nodes in Skitter network with highest degrees and randomize their probabilities. We carry10 experiments, each of which takes 1 out of 10 nodes to be thesuspect. For GreedyCutting, we keep the default setting of 100sample graphs and also test with 500 samples. We follow the edgeprobability settings in [16] that randomizes the edge probabilitiesand then normalizes by, w ( u , v ) = w ( u , v ) (cid:205) ( w , v )∈ E w ( w , v ) (23)so that the sum of edge probabilities into a node is 1. Afterwards,we take the average influence suspension and running time overall 10 tests and the results are drawn in Fig. 4. Results:
From Fig. 4, we see that clearly eSIA both obtains notablybetter solution quality, i.e., 10% to 50% higher, and runs substan-tially faster by a factor of up to 20 (CPU 1 core) and 1250 (1 GPU)than GreedyCutting. Comparing between using 100 and 500 samplegraphs in GreedyCutting, we see improvements in terms of Influ-ence Suspension when 500 samples are used, showing the qualitydegeneracy of using insufficient graph samples. Skitter is largest network that we could run GreedyCutting due to an unknown errorreturned by that algorithm.
This set of experiments is devoted for evaluating the running timeof nSIA and eSIA implementations on multi-core CPUs and singleor multiple GPUs on large networks.
We exper-iment the different parallel implementations: 1) single/multipleGPU, 2) multi-core CPUs to evaluate the performance in variouscomputational platforms. Due to the strong similarity between nSIA and eSIA in terms of performance, we only measure the timeand speedup factor (
SpF ) for nSIA . We use two Titan X GPUs fortesting multiple GPUs. The results are shown in Table 2 and Fig. 3.
Running time.
From Table 2, we observe that increasing thenumber of CPUs running in parallel achieves an effective speedupof 80% per core meaning with 8 cores, nSIA runs 6.5 times faster thanthat on a single CPU. On the other hand, using one GPU reducesthe running time by 100 to 200 times while two parallel GPUs helpsalmost double the performance, e.g., 200 vs. 123 times faster onTwitter. Fig. 3 confirms the speedups on different budgets. R W PS ( m illi on ) CPU Intel(R) Xeon(R) E5−2670 v2 GPU Titan X
Figure 5:
Random walks persecond on various platforms R W PS ( m illi on s ) - PC GPU - PC GPU - SPC Figure 6:
Effects of different ac-celeration techniques
Random Walk Generating Rate.
We compare the rates ofgenerating random walks (samples) on different parallel platforms,i.e., GPU and CPU. The results are described in Fig. 5.Unsurprisingly, the rate of random walk generation on CPUlinearly depends on the number of cores achieving nearly 70% to80% effectiveness. Between GPU and CPU, even with 16 cores ofCPU, only 10.8 million random walks are generated per second thatis around 13 times less than that on a GPU with 139 million. R unn i ng t i m e ( s ) Number of nodes (n)m/n=5m/n=10m/n=15
Figure 7: Scalability tests onvarying network sizes. R e l a t i v e r unn i ng t i m e h=1h=2 Figure 8:
Distributed algorithmon 2 GPUs.
Scalability Test.
We carry another test on the scalability of ourGPU implementation. We create synthetic networks by GTgraph[6] with number of nodes, denoted by n , increasing from tens ofthousands, 10 , to hundreds of millions, 10 . For each value of n ,we also test on multiple densities, i.e., ratio of edges to nodes m / n .Specifically, we test with densities of 5 ,
10 and 15. Our results areplotted in Figure 7. The results show that the running time of nSIA increases almost linearly with the size of the network.
We implementedour distributed nSIA algorithm on two GPUs and compared theperformance with that on a single GPU. For the distributed version,we test on two values of extension parameter h = h = We experi-mentally evaluate the benefit of our acceleration techniques. Wecompare 3 different versions of nSIA and eSIA : 1)
GPU-PC whichemploys O ( ) -space Path-encoding and O ( ) -space Cycle-detectionby the slow Floyd’s algorithm; 2) GPU-PC which employs O ( ) -space Path-encoding and O ( ) -space Cycle-detection by the fastBrent’s algorithm; 3) GPU-SPC which applies all the techniquesincluding the empirical Sliding-window early termination. We runfour versions on all the datasets and compute the RWPS comparedto that on a single-core CPU. The average results are in Fig. 6.The experiment results illustrate the huge effectiveness of theacceleration techniques. Specifically, the O ( ) -space Path-encodingcombined with the slow Floyd’s algorithm for Cycle-detection( GPU-PC ) helps boost up the performance by 14x. When the fastBrent’s algorithm is incorporated for O ( ) -space Cycle-detection,the speedup is further increased to 51x while applying all the tech-niques effectively improves the running time up to 139x faster. Severa works have been proposed for removing/adding nodes/edgesto minimize or maximize the influence of a node set in a network.[18, 21] proposed heuristic algorithms under the linear thresholdmodel and its deterministic version. [14] studies the influenceblocking problem under the competitive linear threshold model,that selects k nodes to initiate the inverse influence propagation toblock the initial cascade. Misinformation containment has also beenwidely studied in the literature [22, 30]. Other than LT model, thenode and edge interdiction problems were studied under other diffu-sion models: [34] consider the SIR (Susceptible-Infected-Recovery)model while [19] considers the IC model.The closest to our work is [16] in which the authors study twoproblems under the LT model: removing k edges to minimize the sum over influences of nodes in a set and adding k edges to max-imize the sum. They prove the monotonicity and submodularityof their objective functions and then develop two approximationalgorithms for the two corresponding problems. However, theiralgorithms do not provide a rigorous approximation factor due torelying on a fixed number of simulations. In addition, there is noefficient implementation for billion-scale networks.Another closely related line of works is on Influence Maximiza-tion [10, 15, 29, 32] which selects a set of seed node that maximizesthe spread of influence over the networks. Chen et al. [8] provedthat estimating influence of a set of nodes is This paper aims at stopping an epidemic in a stochastic networks G following the popular Linear Threshold model. The problems ask fora set of nodes (or edges) to remove from G such that the influenceafter removal is minimum or the influence suspension is maximum.We draw an interesting connection between the Spread Interdictionproblems and the concept of Self-avoiding Walk ( SAW ). We thenpropose two near-optimal approximation algorithms. To acceleratethe computation, we propose three acceleration techniques forparallel and distributed algorithms on GPUs. Our algorithms showlarge performance advantage in both solution quality and runningtime over the state-of-the-art methods.
REFERENCES \ sim$kxm85/software/GTgraph/. (2017).[7] M. Cha, H. Haddadi, F. Benevenuto, and PK Gummadi. 2010. Measuring UserInfluence in Twitter: The Million Follower Fallacy. ICWSM
10, 10-17 (2010), 30.[8] W. Chen, Y. Yuan, and L. Zhang. 2010. Scalable influence maximization in socialnetworks under the linear threshold model. In
ICDM . IEEE, 88–97.[9] T. N. Dinh, H. T. Nguyen, P. Ghosh, and M. L. Mayo. 2015. Social InfluenceSpectrum with Guarantees: Computing More in Less Time. In
CSoNet . Springer,84–103.[10] N. Du, L. Song, M. Gomez-R, and H. Zha. 2013. Scalable influence estimation incontinuous-time diffusion networks. In
NIPS . 3147–3155.[11] Bin Fan, Dave G Andersen, Michael Kaminsky, and Michael D Mitzenmacher.2014. Cuckoo filter: Practically better than bloom. In
CoNEXT . ACM, 75–88.[12] Hunt Allcott Matthew Gentzkow. 2017. Social Media and Fake News in the 2016Election. (2017).[13] A. Goyal, F. Bonchi, and L. V. S. Lakshmanan. 2010. Learning influence probabil-ities in social networks.
WSDM (2010), 241–250.[14] X. He, G. Song, W. Chen, and Q. Jiang. 2012. Influence Blocking Maximization inSocial Networks under the Competitive Linear Threshold Model.. In
SDM . SIAM,463–474.
15] D Kempe, J Kleinberg, and E Tardos. 2003. Maximizing the spread of influencethrough a social network. In
KDD . ACM, 137–146.[16] E. B. Khalil, B. Dilkina, and L. Song. 2014. Scalable diffusion-aware optimizationof network topology. In
KDD . ACM, 1226–1235.[17] S. Khuller, A. Moss, and JS Naor. 1999. The budgeted maximum coverage problem.
Info. Proc. Lett.
70, 1 (1999), 39–45.[18] M. Kimura, K. Saito, and H. Motoda. 2008. Solving the contamination minimiza-tion problem on networks for the linear threshold model. In
PRICAI . Springer,977–984.[19] M. Kimura, K. Saito, and H. Motoda. 2009. Blocking links to minimize contami-nation spread in a social network.
TKDD
3, 2 (2009), 9.[20] D Knuth. 1998. The Art of Computer Programming, Vol II: SeminumericalAlgorithms. (1998).[21] C. J. Kuhlman, G. Tuli, S. Swarup, M. V. Marathe, and S. S. Ravi. 2013. Blockingsimple and complex contagion by edge removal. In
ICDM . IEEE, 399–408.[22] K. K. Kumar and G. Geethakumari. 2014. Detecting misinformation in onlinesocial networks using cognitive psychology.
HCIS
4, 1 (2014), 1.[23] H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social networkor a news media?. In
WWW . ACM, New York, NY, USA, 591–600.[24] Dominique LaSalle, Md Mostofa Ali Patwary, Nadathur Satish, Narayanan Sun-daram, Pradeep Dubey, and George Karypis. 2015. Improving graph partitioningfor modern graphs and architectures. In
IA3 . ACM, 14.[25] H. Liu, H. H. Huang, and Y. Hu. 2016. iBFS: Concurrent Breadth-First Search onGPUs. In
SIGMOD . ACM, 403–416. [26] X. Liu, M. Li, S. Li, S. Peng, X. Liao, and X. Lu. 2014. IMGPU: GPU-acceleratedinfluence maximization in large-scale social networks.
TPDS
25, 1 (2014), 136–145.[27] D. Merrill, M. Garland, and A. Grimshaw. 2012. Scalable GPU graph traversal. In
SIGPLAN Notices , Vol. 47. 117–128.[28] H. T. Nguyen, M. T. Thai, and T. N. Dinh. 2016. Cost-aware Targeted ViralMarketing in Billion-scale Networks. In
INFOCOM . IEEE.[29] H. T Nguyen, M. T Thai, and T. N Dinh. 2016. Stop-and-stare: Optimal samplingalgorithms for viral marketing in billion-scale networks. In
SIGMOD . ACM,695–710.[30] N. P. Nguyen, G. Yan, and M. T. Thai. Analysis of Misinformation Containmentin Online Social Networks.
Comp. Netw.
57, 10 (????), 2133–2146.[31] S.Vigna. 2017. xorshift*/xorshift+ generators and the PRNG shootout. (2017).http://xoroshiro.di.unimi.it/[32] Y. Tang, Y. Shi, and X. Xiao. 2015. Influence Maximization in Near-Linear Time:A Martingale Approach. In
SIGMOD . ACM, 1539–1554.[33] Y. Tang, X. Xiao, and Y. Shi. 2014. Influence maximization: Near-optimal timecomplexity meets practical efficiency. In
SIGMOD . ACM, 75–86.[34] H. Tong, B. A. Prakash, T. Eliassi-Rad, M. Faloutsos, and C. Faloutsos. 2012.Gelling, and Melting, Large Graphs by Edge Manipulation. In
CIKM . ACM, NewYork, NY, USA, 245–254.[35] R. Zenklusen. 2015. An O (1)-approximation for minimum spanning tree inter-diction. In
FOCS . IEEE, 709–728. PROOFS OF LEMMAS AND THEOREMS
We summarize the commonly used notations in Table 3.
Table 3: Table of notations
Notation Description n , m G = ( V , E , w ) . G ∼ G A sample graph G of G I G (V I ) Influence Spread of V I in G . OPT k The maximum influence suspension by removing at most k edges. OPT ( n ) k The maximum influence suspension by removing at most k nodes.ˆ T k , ˆ S k The returned size- k edge set of eSIA and nSIA . T ∗ k , S ∗ k An optimal size- k set of edges and nodes. R t , R (cid:48) t Sets of random
HSAW samples in iteration t .Cov R t ( T ) HSAW h j ∈ R t intersecting with T . Λ Λ = ( + ϵ ) ln ( t max δ ) ϵ , . Λ Λ = + ( + ϵ )( + ϵ ) ln ( t max δ ) ϵ , Proof of Theorem 2.1
We prove that both nSI and eSI cannot be approximated within1 − / e − o ( ) , that also infers the NP-hardness of these two problems. nSI cannot be approximated within − / e − o ( ) . We provethis by showing that Influence Maximization problem [15] is aspecial case of nSI with a specific parameter setting. Consideringan instance of Influence Maximization problem which finds a setof k nodes having the maximum influence on the network on aprobabilistic graph G = ( V , E , w ) , we construct an instance of nSI asfollows: using the same graph G with V I = V and ∀ v ∈ V , p ( v ) = / C = V .On the new instance, for a set S of k nodes, we have, D ( S , V I ) = I G (V I ) − I G (cid:48) (V I ) = (cid:213) X ∼V I ( I G ( X ) − I G (cid:48) ( X )) Pr [ X ∼ V I ] . (24)Since ∀ v ∈ V , p ( v ) = /
2, from Eq. 3, we have,Pr [ X ∼ V I ] = / n . (25)Thus, D ( S , V I ) = n (cid:213) X ∼V I ( I G ( X ) − I G (cid:48) ( X )) = n (cid:213) X ∼V I (cid:213) v ∈ V ( I G ( X , v ) − I G (cid:48) ( X , v )) , (26)where G , G (cid:48) are sampled graph from G and G (cid:48) . We say G is con-sistent , denoted by G ∝ G (cid:48) , if every edge ( u , v ) appearing in G (cid:48) isalso realized in G . Thus, each sampled graph G (cid:48) of G (cid:48) correspondsto a class of samples G of G . We define it as the consistency classof G (cid:48) in G , denoted by C G (cid:48) = { G ∼ G| G ∝ G (cid:48) } . More importantly,we have, Pr [ G (cid:48) ∼ G ] = (cid:213) G ∈C G (cid:48) Pr [ G ∼ G] . (27) Note that if G (cid:48) (cid:44) G (cid:48) are sampled graph from G (cid:48) , then C G (cid:48) ∩C G (cid:48) = ∅ .Thus, we obtain, I G (cid:48) ( X , v ) = (cid:213) G (cid:48) ∼G (cid:48) χ G (cid:48) ( X , v ) Pr [ G (cid:48) ∼ G (cid:48) ] = (cid:213) G ∼G ( χ G ( X , v ) − χ G ( X , S , v )) Pr [ G ∼ G] , (28)where χ G ( X , S , v ) = (cid:40) v is only reachable from X through S I G (cid:48) ( X , v ) = I G ( X , v ) − (cid:205) G ∼G χ G ( X , S , v ) Pr [ G ∼ G] . Putthis to Eq. 26, we have, D ( S , V I ) = n (cid:213) G ∼G (cid:213) v ∈ V (cid:213) X ∼V I χ G ( X , S , v ) Pr [ G ∼ G] = n (cid:213) G ∼G (cid:213) v ∈ I G ( S ) (cid:213) X ∼V I χ G ( X , S , v ) Pr [ G ∼ G] = n (cid:213) G ∼G (cid:213) v ∈ I G ( S ) n − Pr [ G ∼ G] = I G ( S ) , (29)where the third equality is due to the property of the LT model thatfor a node, there exists at most one incoming edge in any sample G ∼ G .Therefore, we have D ( S , V I ) = / I G ( S ) where I G ( S ) is theinfluence function which is well-known to be NP-hard and not ableto be approximated within 1 − / e − o ( ) . Thus, D ( S , V I ) possessesthe same properties. eSI cannot be approximated within − / e − o ( ) . Based onthe hardness of nSI , we can easily prove that of the eSI by a reduc-tion as follow: assuming an instance of eSI on G = ( V , E , w ) and V I ,for ( u , v ) ∈ E , we add a node e uv and set w ue uv = w uv , w e uv v = e uv where ( u , v ) ∈ E . As such,the eSI is converted to a restricted nSI problem which is NP-hardand cannot be approximated within 1 − / e − o ( ) . Proof of Theorem 3.1
From the definition of influence suspension function for a set T k ofedges (Eq. 5), we have, D ( T k , V I ) = I G (V I ) − I G (cid:48) (V I ) = (cid:213) X ∼V I [ I G ( X ) − I G (cid:48) ( X )] Pr [ X ∼ V I ] = (cid:213) X ∼V I (cid:213) v ∈ V [ I G ( X , v ) − I G (cid:48) ( X , v )] Pr [ X ∼ V I ] . (30)In Eq. 30, the set X is a sample set of V I and deterministic.We will extend the term I G ( X , v ) − I G (cid:48) ( X , v ) inside the doublesummation and then plug in back the extended result. First, wedefine the notion of collection of HSAW s from X to a node v . Collection of
HSAW s . In the original stochastic graph G havinga set of source nodes X , for a node v , we define a collection P X , v of HSAW s to include all possible
HSAW s h from a node in X to v , P X , v = { h = < v = v , v , . . . , v l > | h ∩ X = { v l }} . (31) ccording to Eq. 2, the influence of a seed set X onto a node v has an equivalent computation based on the sample graphs asfollows, I G ( X , v ) = (cid:213) G ∼G χ G ( X , v ) Pr [ G ∼ G] , (32)where χ G ( X , v ) is an indicator function having value 1 if v is reach-able from X by a live-edge path in G and 0 otherwise. If we groupup the sample graphs according to the HSAW from nodes in X to v such that Ω h contains all the sample graphs having the path h .Then since the set X is deterministic,Pr X ∼V I [ h ] = (cid:213) G ∼ Ω h Pr [ G ∼ G] . (33)Due to the walk uniqueness property, Ω h for h ∈ P X , v arecompletely disjoint and their union is equal the set of sample graphsof G that v is activated from nodes in X . Thus Eq. 32 is rewrittenas, I G ( X , v ) = (cid:213) h ∈P X , v (cid:213) G ∈ Ω h Pr [ G ∼ G] = (cid:213) h ∈P X , v Pr X ∼V I [ h ] . We now compute the value of I G ( X , v ) − I G (cid:48) ( X , v ) in the summa-tion of Eq. 30. Since G (cid:48) is induced from G by removing the edgesin T k , the set of all possible sample graphs of G (cid:48) will be a subset ofthose sample graphs of G . Furthermore, if G ∼ G and G can not besampled from G (cid:48) , then Pr [ G ∼ G (cid:48) ] = I G ( X , v ) − I G (cid:48) ( X , v ) = (cid:213) h ∈P X , v (cid:213) G ∈ Ω h ( Pr [ G ∼ G] − Pr [ G ∼ G (cid:48) ]) = (cid:213) h ∈P X , v h ∩ T k = ∅ (cid:213) G ∈ Ω h ( Pr [ G ∼ G] − Pr [ G ∼ G (cid:48) ]) + (cid:213) h ∈P X , v h ∩ T k (cid:44) ∅ (cid:213) G ∈ Ω h ( Pr [ G ∼ G] − Pr [ G ∼ G (cid:48) ]) = (cid:213) h ∈P X , v h ∩ T k = ∅ ( Pr X ∼V I [ h ] − Pr X ∼V I [ h ]) + (cid:213) h ∈P X , v h ∩ T k (cid:44) ∅ Pr X ∼V I [ h ] = (cid:213) h ∈P X , v h ∩ T k (cid:44) ∅ Pr X ∼V I [ h ] , (34)where the second equality is due to the division of P X , v into twosub-collections of HSAW s. The third and forth equalities are due toEq. 33 when X is deterministic. Thus, we obtain, I G ( X , v ) − I G (cid:48) ( X , v ) = (cid:213) h ∈P X , v h ∩ T k (cid:44) ∅ Pr X ∼V I [ h ] , (35)which is the summation over the probabilities of having a HSAW appearing in G but not in G (cid:48) indicated by h is suspended by T k given that X is deterministic.Plugging Eq. 35 back to Eq. 30, we obtain, D ( T k , V I ) = (cid:213) X ∼V I (cid:213) v ∈ V [ I G ( X , v ) − I G (cid:48) ( X , v )] Pr [ X ∼ V I ] = (cid:213) X ∼V I (cid:213) v ∈ V (cid:213) h ∈P X , v h ∩ T k (cid:44) ∅ Pr X ∼V I [ h ] Pr [ X ∼ V I ] . (36) We define P X to be the set of all HSAW s from a node in X toother nodes and P to be the set of all HSAW s from nodes in theprobabilistic set V I to other nodes. Then Eq. 36 is rewritten as, D ( T k , V I ) = (cid:213) X ∼V I (cid:213) h ∈P X h ∩ T k (cid:44) ∅ Pr X ∼V I [ h ] Pr [ X ∼ V I ] = (cid:213) h ∈P h ∩ T k (cid:44) ∅ Pr [ h ] = (cid:213) h ∩ T k (cid:44) ∅ Pr [ h ] (37) = I G (V I ) (cid:205) h ∩ T k (cid:44) ∅ Pr [ h ] I G (V I ) = I G (V I ) (cid:205) h ∩ T k (cid:44) ∅ Pr [ h ] (cid:205) h ∈P Pr [ h ] (38) = I G (V I ) Pr [ T interdicts h ] (39)The last equality is obtained from Pr [ T interdicts h ] = (cid:205) h ∩ Tk (cid:44) ∅ Pr [ h ] (cid:205) h ∈P Pr [ h ] which holds since h is random HSAW . Proof of Monotonicity and Submodularity of D ( T k , V I ) . Theleft-hand side of Eq. 37 is equivalent to the weighted coverage func-tion of a set cover system in which: every
HSAW h ∈ P is anelement in a universal set P and edges in E are subsets. The subsetof e ∈ E contains the elements that the corresponding HSAW s have e on their paths. The probability Pr [ h ] is the weight of element h .Since the weighted coverage function is monotone and submodular,it is followed that D ( T k , V I ) has the same properties. B PROOF OF THEOREM 5.1
Before proving Theorem 5.1, we need the following results.Let R , . . . , R N be the random HSAW samples generated in eSIA algorithms. Given a subset of edges T ⊂ E , define X j ( T ) = min {| R j ∩ T | , } , the Bernouli random variable with mean µ X = E[ X j ( T )] = D ( T )/ I (V I , p ) . Let ˆ µ X = N (cid:205) Ni = X i ( T ) be an estimate of µ X . Corol-laries 1 and 2 in [32] state that,Lemma 1 ([32]). For N > and ϵ > , it holds that, Pr [ ˆ µ X > ( + ϵ ) µ X ] ≤ exp ( − N µ X ϵ + ϵ ) , (40)Pr [ ˆ µ X < ( − ϵ ) µ X ] ≤ exp ( − N µ X ϵ ) . (41)The above lemma is used in proving the estimation guaranteesof the candidate solution ˆ T k found by Greedy algorithm in eachiteration and the optimal solution T ∗ k .Recall that eSIA stops when either 1) the number of samplesexceeds the cap, i.e., |R t | ≥ N max or 2) ϵ t ≤ ϵ for some t ≥ N max was chosen to guarantee that ˆ T k will be a ( − / e − ϵ ) -approximation solution w.h.p.Lemma 2. Let B ( ) be the bad event that B ( ) = (|R t | ≥ N max ) ∩ ( D ( ˆ T k ) < ( − / e − ϵ ) OPT ( e ) k ) . We have Pr [ B ( ) ] ≤ δ / . Proof. We prove the lemma in two steps:(S1) With N = ( − e ) ( + ϵ ) I G (V I )· ln ( / δ ) + ln ( mk ) OPT k ϵ HSAW sam-ples, the returned solution ˆ T k is an ( − / e − ϵ ) -approximatesolution with probability at least 1 − δ / N max ≥ N . roof of (S1). Assume an optimal solution T ∗ k with maximuminfluence suspension of OPT k . Use N = ( − e ) ( + ϵ ) I G (V I ) · ln ( / δ ) + ln ( mk ) OPT k ϵ HSAW samples and apply Lemma 1 on a set T k of k edges, we obtain,Pr [ D t ( T k ) ≥ D ( T k ) + ϵ − / e OPT k ] (42) = Pr [ D t ( T k ) ≤ (cid:18) + ϵ − / e OPT k D ( T k ) (cid:19) D ( T k )] (43) ≤ exp (cid:32) − N D ( T k )( + / ϵ ) I G (V I ) (cid:18) OPT k ( − / e ) D ( T k ) ϵ (cid:19) (cid:33) (44) ≤ δ (cid:0) mk (cid:1) k andsince ˆ T k is one of those sets, we have,Pr [ D t ( T k ) ≥ D ( T k ) + ϵ − / e OPT k ] ≤ δ T ∗ k and apply the second inequality in Lemma 1, we obtain,Pr [ D t ( T ∗ k ) ≤ ( − ϵ − / e ) OPT k ] ≤ δ (cid:0) mk (cid:1) D t ( T k ) ≥ D ( T k ) + ϵ − / e OPT k and,(2) D t ( T ∗ k ) ≤ ( − ϵ − / e ) OPT k with the maximum probability on either of them happening is δ + δ ( mk ) ≤ δ . Thus, in case neither of the two bad events happens,we have both,(1’) D t ( T k ) ≤ D ( T k ) + ϵ − / e OPT k and,(2’) D t ( T ∗ k ) ≥ ( − ϵ − / e ) OPT k with probability at least 1 − δ . Using (1’) and (2’), we derive theapproximation guarantee of ˆ T k as follows, D t ( T k ) ≤ D ( T k ) + ϵ − / e OPT k ⇔ D ( T k ) ≥ D t ( T k ) − ϵ − / e OPT k ≥ ( − / e ) D t ( T ∗ k ) − ϵ − / e OPT k ≥ ( − / e )( − ϵ − / e ) OPT k − ϵ − / e OPT k ≥ ( − / e − (( − / e ) ϵ − / e + ϵ − / e )) OPT k ≥ ( − / e − ϵ ) OPT k (48)Thus, we achieve D ( T k ) ≥ ( − / e − ϵ ) OPT k with probability atleast 1 − δ Proof of (S2).
It is sufficient to prove that km ≤ OPT k I (V I , p ) which istrivial since it equivalent to OPT k ≥ km I (V I , p ) and the optimalsolution OPT k with k edges must cover at least a fraction km thetotal influence of I G (V I ) . Note that there are m edges to select fromand the influence suspension of all m edges is exactly I G (V I ) . (cid:3) In the second case, the algorithm stops when ϵ t ≤ ϵ for some1 ≤ t ≤ t max . The maximum number of iterations t max is boundedby O ( log n ) as stated below. Lemma 3. The number of iterations in eSIA is at most t max = O ( log n ) . Proof. Since the number of
HSAW samples doubles in everyiteration and we start at Λ and stop with at most 2 N max samples,the maximum number of iterations is, t max = log ( N max ϒ ( ϵ , δ / ) ) = log (cid:169)(cid:173)(cid:173)(cid:171) ( − e ) ( + ϵ ) m · ln ( / δ ) + ln ( mk ) kϵ ( + ϵ ) ln ( δ ) ϵ (cid:170)(cid:174)(cid:174)(cid:172) = log (cid:32) ( − e ) m ( ln ( / δ ) + ln (cid:0) mk (cid:1) ) k ln ( / δ ) (cid:33) ≤ log (cid:18) ( − e ) m ( ln ( / δ ) + k ln m ) k ln ( / δ ) (cid:19) ≤ log (cid:18) ( − e ) mk + ( − e ) m ln m ln ( / δ ) + ( − e ) m ln 2 k ln ( / δ ) (cid:19) = O ( log n ) (49)The last equality is due to that k ≤ m ≤ n ; EPT k is constant andour precision parameter 1 / δ = Ω ( n ) . (cid:3) For each iteration t , we will bound the probabilities of the badevents that lead to inaccurate estimations of D ( ˆ T k ) through R (cid:48) t , and D ( T ∗ k ) through R t (Line 5 in Alg. 5). We obtain the following.Lemma 4. For each ≤ t ≤ t max , let ˆ ϵ t be the unique root of f ( x ) = δ t max , where f ( x ) = exp (cid:169)(cid:173)(cid:171) − N t D ( ˆ Tk ) I G(V I ) x + / x (cid:170)(cid:174)(cid:172) , and ϵ ∗ t = ϵ (cid:118)(cid:117)(cid:116) I G (V I )( + ϵ / ) t − OPT ( e ) k . Consider the following bad events B ( ) t = (cid:16) D t (cid:48) ( ˆ T k ) ≥ ( + ˆ ϵ t ) D ( ˆ T k ) (cid:17) , B ( ) t = (cid:16) D t ( T ∗ k ) ≤ ( − ϵ ∗ t ) OPT ( e ) k (cid:17) . We have Pr [ B ( ) t ] , Pr [ B ( ) t ] ≤ δ t max . Proof. One can verify that f ( x ) is a strictly decreasing functionfor x >
0. Moreover, f ( ) = x →∞ f ( x ) =
0. Thus, theequation f ( x ) = δ t max has an unique solution for 0 < δ < t max ≥ Bound the probability of B ( ) t : Note that ˆ ϵ t and the samples gen-erated in R (cid:48) t are independent. Thus, we can apply the concentrationinequality in Eq. (40):Pr [ D t (cid:48) ( ˆ T k ) ≥ ( + ˆ ϵ t ) D ( ˆ T k )] ≤ exp (cid:32) − N t D ( ˆ T k ) ˆ ϵ t ( + ˆ ϵ t ) I G (V I ) (cid:33) ≤ δ t max . e last equation is due to the definition of ˆ ϵ t . Bound the probability of B ( ) t : Since ϵ ∗ t is fixed and independentfrom the generated samples, we havePr [ D t ( T ∗ k ) ≤ ( − ϵ ∗ t ) OPT k ] ≤ exp (cid:32) − |R t | OPT k ϵ ∗ t I G (V I ) (cid:33) = exp (cid:32) − Λ t − OPT k ϵ I G (V I ) I G (V I ) t − OPT k (cid:33) (50) = exp (cid:32) − ( + ϵ ) ln ( t max δ ) ϵ t − OPT k ϵ I G (V I ) ( + ϵ / ) I G (V I ) t − OPT k (cid:33) ≤ exp (cid:18) − ln 3 t max δ (cid:19) = δ t max , (51)which completes the proof of Lemma 4. (cid:3) Lemma 5.
Assume that none of the bad events B ( ) , B ( ) t , B ( ) t ( t = .. t max ) happen and eSIA stops with some ϵ t ≤ ϵ . We have ˆ ϵ t < ϵ and consequently (52) D t (cid:48) ( ˆ T k ) ≤ ( + ϵ ) D ( ˆ T k ) (53)Proof. Since the bad event B ( ) t does not happen, D t (cid:48) ( ˆ T k ) ≤ ( + ˆ ϵ t ) D ( ˆ T k ) (54) ⇔ Cov R (cid:48) I ( ˆ T k ) ≤ ( + ˆ ϵ t ) N t D ( ˆ T k ) I G (V I ) (55)When eSIA stops with ϵ t ≤ ϵ , it must satisfy the condition onLine 2 of Alg. 5, Cov R (cid:48) I ( ˆ T k ) ≥ Λ . Thus, we have ( + ˆ ϵ t ) N t D ( ˆ T k ) I G (V I ) ≥ Λ = + ( + ϵ ) + / ϵϵ ln 3 t max δ (56)From the definition of ˆ ϵ t , it follows that N t = + / ϵ t ˆ ϵ t ln (cid:18) t max δ (cid:19) I G (V I ) D ( ˆ T k ) (57)Substitute the above into (56) and simplify, we obtain: ( + ˆ ϵ t ) + / ϵ t ˆ ϵ t ln (cid:18) t max δ (cid:19) (58) ≥( + ϵ ) + / ϵϵ ln 3 t max δ + ( + x ) + / xx is a decreasing function for x > ϵ t < ϵ . (cid:3) We now prove the approximation guarantee of eSIA .Proof of Theorem 5.1. Apply union bound for the bad eventsin Lemmas 2 and 4. The probability that at least one of the badevents B ( ) , B ( ) t , B ( ) t ( t = .. t max ) happen is at most δ / + ( δ /( t max ) + δ /( t max )) × t max ≤ δ (60)In other words, the probability that none of the bad events hap-pen will be at least 1 − δ . Assume that none of the bad events happen, we shall show that the returned ˆ T k is a ( − / e − ϵ ) -approximationsolution.If eSIA stops with |R t | ≥ N max , ˆ T k is a ( − / e − ϵ ) -approximationsolution, since the bad event B ( ) does not happen.Otherwise, eSIA stops at some iteration t and ϵ t ≤ ϵ . We usecontradiction method. Assume that D ( ˆ T k ) < ( − / e − ϵ ) OPT k . (61)The proof will continue in the following order(A) D ( ˆ T k ) ≥ ( − / e − ϵ (cid:48) t ) OPT k where ϵ (cid:48) t = ( ϵ + ˆ ϵ t + ϵ ˆ ϵ t )( − / e − ϵ ) + ( − / e ) ϵ ∗ t .(B) ˆ ϵ t ≤ ϵ and ϵ ∗ t ≤ ϵ .(C) ϵ (cid:48) t ≤ ϵ t ≤ ϵ ⇒ D ( ˆ T k ) ≥ ( − e − ϵ ) OPT k ( contradiction ). Proof of (A) . Since the bad events B ( ) t and B ( ) t do not happen,we have D t (cid:48) ( ˆ T k ) ≤ ( + ˆ ϵ t ) D ( ˆ T k ) , and (62) D t ( T ∗ k ) ≤ ( − ϵ ∗ t ) OPT k . (63)Since ϵ ← Cov R t ( ˆ T k )/ Cov R (cid:48) t ( ˆ T k ) − = D t ( ˆ T k )/ D t (cid:48) ( ˆ T k ) −
1, itfollows from (62) that D t ( ˆ T k ) = ( + ϵ ) D t (cid:48) ( ˆ T k ) ≤ ( + ϵ )( + ˆ ϵ t ) D ( ˆ T k ) Expand the right hand side and apply (61), we obtain D ( ˆ T k ) ≥ D t ( ˆ T k ) − ( ϵ + ˆ ϵ t + ϵ ˆ ϵ t ) D ( ˆ T k )≥ D t ( ˆ T k ) − ( ϵ + ˆ ϵ t + ϵ ˆ ϵ t )( − / e − ϵ ) OPT k Since the
Greedy algorithm for the
Max-Coverage guarantees an ( − / e ) -approximation, D t ( ˆ T k ) ≥ ( − / e ) D t ( T ∗ k ) . Thus, D ( ˆ T k ) ≥ ( − / e ) D t ( T ∗ k )− ( ϵ + ˆ ϵ t + ϵ ˆ ϵ t )( − / e − ϵ ) OPT k ≥ ( − / e )( − ϵ ∗ t ) OPT k − ( ϵ + ˆ ϵ t + ϵ ˆ ϵ t )( − / e − ϵ ) OPT k ≥ ( − / e − ϵ (cid:48) t ) OPT k , where ϵ (cid:48) t = ( ϵ + ˆ ϵ t + ϵ ˆ ϵ t )( − / e − ϵ ) + ( − / e ) ϵ ∗ t . Proof of (B) . We show that ˆ ϵ t ≤ ϵ . Due to the computation of ϵ ← ϵ (cid:114) |R t |( + ϵ ) t − Cov
R(cid:48) t ( ˆ T k ) , we have1 ϵ = ϵ |R (cid:48) t | t − + ϵ Cov R (cid:48) t ( ˆ T k ) = ϵ I G (V I ) t − + ϵ D t (cid:48) ( ˆ T k ) . Expand the number of
HSAW samples in iteration t , N t = t − Λ ,and apply the above equality, we have N t = t − ( + / ϵ ) ϵ ln 3 t max δ (64) = t − ( + / ϵ ) ϵ I G (V I ) t − + ϵ D t (cid:48) ( ˆ T k ) ln 3 t max δ (65) = ( + / ϵ ) ϵ ( + ϵ ) I G (V I ) D t (cid:48) ( ˆ T k ) ln 3 t max δ (66)On the other hand, according to Eq. (57), we also have, N t = + / ϵ t ˆ ϵ t ln (cid:18) t max δ (cid:19) I G (V I ) D ( ˆ T k ) . (67) lgorithm 9: Node Spread Interdiction Algorithm ( nSIA ) Input:
Graph G , V I , p ( v ) , ∀ v ∈ V I , k , C ⊆ V and 0 ≤ ϵ , δ ≤ Output: ˆ S k - An ( − / e − ϵ ) -near-optimal solution. Compute Λ (Eq. 18), N max (Eq. 72); t = Generate a stream of random samples R , R , . . . where each R j isthe set of nodes in HSAW sample h j by Alg. 3; repeat t = t + R t = { R , . . . , R Λ t − } ; R (cid:48) t = { R Λ t − + , . . . , R Λ t } ; ˆ S k ← GreedyNode (R t , C , k ) ; if CheckNode ( ˆ S k , R t , R (cid:48) t , ϵ , δ ) = True then return ˆ S k ; until |R t | ≥ N max ; return ˆ S k ; Thus ( + / ϵ ) ϵ + ϵ D t (cid:48) ( ˆ T k ) = + / ϵ t ˆ ϵ t D ( ˆ T k )⇒ ˆ ϵ t ϵ = + / ϵ t + / ϵ D t (cid:48) ( ˆ T k )( + ϵ ) D ( ˆ T k ) ≤ D t (cid:48) ( ˆ T k ) ≤ ( + ϵ ) D ( ˆ T k ) andˆ ϵ t ≤ ϵ . Therefore, ˆ ϵ t ≤ ϵ .We show that ϵ ∗ t ≤ ϵ . According to the definition of ϵ ∗ t and ϵ ,we have ( ϵ ∗ t ) ϵ = I G (V I )( + ϵ / ) t − OPT k / I G (V I )( + ϵ )( − / e − ϵ )( + ϵ / ) t − D t (cid:48) ( ˆ T k ) = D t (cid:48) ( ˆ T k ) OPT k ( + ϵ )( − / e − ϵ ) ≤ D t ( ˆ T k ) OPT k ( − / e − ϵ ) ≤ D t (cid:48) ( ˆ T k ) ≤ ( + ϵ ) D ( ˆ T k ) andthe assumption (61), respectively. Thus, ϵ ∗ t ≤ ϵ . Proof of (C) . Since 1 + ϵ = ˆ D t ( ˆ T k )/ D t (cid:48) ( ˆ T k ) ≥ ϵ ≥ ˆ ϵ t > ϵ ≥ ϵ ∗ t >
0, we have ϵ (cid:48) t = ( ϵ + ˆ ϵ t + ϵ ˆ ϵ t )( − / e − ϵ ) + ( − / e ) ϵ ∗ t (68) = ( ϵ + ˆ ϵ t ( + ϵ ))( − / e − ϵ ) + ( − / e ) ϵ ∗ t (69) ≤ ( ϵ + ϵ ( + ϵ ))( − / e − ϵ ) + ( − / e ) ϵ (70) = ϵ t ≤ ϵ . (71)This completes the proof. (cid:3) B.1 Node-based Interdiction Algorithms
Algorithm 7:
GreedyNode algorithm for maximum coverage
Input:
A set R t of HSAW samples, C ⊆ V and k . Output: An ( − / e ) -optimal solution ˆ S k on samples. ˆ S k = ∅ ; for i = to k do ˆ v ← arg max v ∈ C \ ˆ S k ( Cov R t ( ˆ S k ∪ { e }) − Cov R t ( ˆ S k )) ; Add ˆ v to ˆ S k ; return ˆ S k ; Algorithm 8:
CheckNode algorithm for confidence level
Input: ˆ S k , R t , R (cid:48) t , ϵ , δ and t . Output:
True if the solution ˆ S k meets the requirement. Compute Λ by Eq. 16; if Cov R (cid:48) t ( ˆ S k ) ≥ Λ then ϵ = Cov R t ( ˆ S k )/ Cov R (cid:48) t ( ˆ S k ) − ϵ = ϵ (cid:115) |R (cid:48) t |( + ϵ ) t − Cov
R(cid:48) t ( ˆ S k ) ; ϵ = ϵ (cid:115) |R (cid:48) t |( + ϵ )( − / e − ϵ )( + ϵ / ) t − Cov
R(cid:48) t ( ˆ S k ) ; ϵ t = ( ϵ + ϵ + ϵ ϵ )( − / e − ϵ ) + ( − / e ) ϵ ; if ϵ t ≤ ϵ then return True ; return False ; The maximum number of
HSAW samples needed, N max = ( − e ) ( + ϵ ) n · ln ( / δ ) + ln (cid:0) nk (cid:1) kϵ , (72) Algorithm Description.
Alg. 6 uses two subroutines:1)
Greedy in Alg. 7 that selects a candidate solution ˆ S k froma set of HSAW samples R t . This implements the greedyscheme that selects a node with maximum marginal gainand add it to the solution until k nodes have been selected.2) Check in Alg. 5 that checks the candidate solution ˆ S k if itsatisfies the given precision error ϵ . It computes the errorbound provided in the current iteration of nSIA , i.e. ϵ t from ϵ , ϵ , ϵ (Lines 4-6), and compares that with the input ϵ .This algorithm consists of a checking condition (Line 2) thatexamines the coverage of ˆ S k on second independent set R (cid:48) t from R t (used in Greedy to find ˆ S k ) of HSAW samples. Thecomputations of ϵ , ϵ , ϵ are to guarantee the estimationquality of ˆ S k and the optimal solution S ∗ k .The main algorithm in Alg. 6 first computes Λ and the upper-bound on neccessary HSAW samples in Line 1. Then, it enters a loopof at most t max iterations. In each iteration, nSIA uses the set R t offirst Λ t − HSAW samples to find a candidate solution ˆ S k by the Greedy algorithm (Alg. 4). Afterwards, it checks the quality of ˆ S k by the Check procedure (Alg. 5). If the
Check returns
True meaningthat ˆ S k meets the error requirement ϵ with high probability, ˆ S k isreturned as the final solution.In cases when Check algorithm fails to verify the candidatesolution ˆ S k after t max iteations, nSIA will be terminated by theguarding condition |R t | ≥ N max (Line 9). I n f l uen c e S u s pen s i on ( x ) Budgets (k) nSIA*InfMax-V I InfMax-V PagerankMax-DegreeRandomized (a) Pokec I n f l uen c e S u s pen s i on ( x ) Budgets (k) (b) Skitter I n f l uen c e S u s pen s i on ( x ) Budgets (k) (c) LiveJournal I n f l uen c e S u s pen s i on ( x ) Budgets (k) (d) Twitter
Figure 9: Interdiction Efficiency of different approaches on nSI problem ( nSIA * refers to general nSIA algorithm)
C PERFORMANCE ON NODE-BASED SPREADINTERDICTION
The results of comparing the solution quality, i.e., influence suspen-sion, of the algorithms on the four larger network datasets, e.g.,Pokec, Skitter, LiveJournal and Twitter, are presented in Fig. 9 for nSI . Across all four datasets, we observe that nSIA significantlyoutperforms the other methods with widening margins when k increases. For example, on the largest Twitter network, nSIA isabout 20% better than the runner-up InfMax- V I and 10 times betterthan the rest. D ANALYZING NODES SELECTED FORREMOVAL
We aim at analyzing which kind of nodes, i.e., those in V I or popularnodes, selected by different algorithms. Suspect Selection Ratio.
We first analyze the solutions using
Suspect Selection Ratio (SSR) which is the ratio of the number ofsuspected nodes selected to the budget k . We run the algorithmson all five datasets and take the average. Our results are presented in Table 4. We see that the majority of nodes selected by nSIA aresuspected while the other methods except InfMax- V I rarely selectthese nodes. InfMax- V I restricts its selection in V I and thus, alwayshas value of 1. The small faction of nodes selected by nSIA not in V I are possibly border nodes between V I and the outside. Interdiction Cost Analysis.
We measure the cost of removinga set S of nodes by the following function. Cost ( S ) = (cid:213) v ∈ S ( − p ( v )) log ( d in ( v ) + ) (73)The Cost ( S ) function consists of two parts: 1) probability of nodebeing suspected and 2) the popularity of that node in the networkimplied by its in-degree. Interdicting nodes with higher probabilityor less popular results in smaller cost indicating less interruptionto the operations of network.The Interdiction costs of the solutions returned by different al-gorithms are shown in Table 4. We observe that nSIA introducesthe least cost, even less than that of InfMax - V I possibly because InfMax - V I ignores the probabilities of nodes being suspected. Theother methods present huge cost to networks since they target highinfluence nodes. ethods Suspect Selection Ratio (SSR) Interdiction Cost
200 400 600 800 1000 200 400 600 800 1000 nSIA
InfMax - V I InfMax - V Pagerank
Max-Degree
Randomized