Online Influence Maximization under Independent Cascade Model with Semi-Bandit Feedback
OOnline Influence Maximization under IndependentCascade Model with Semi-Bandit Feedback
Zheng Wen
Adobe Research [email protected]
Branislav Kveton
Adobe Research [email protected]
Michal Valko
SequeL team, INRIA Lille - Nord Europe [email protected]
Sharan Vaswani
University of British Columbia [email protected]
Abstract
We study the online influence maximization problem in social networks underthe independent cascade model. Specifically, we aim to learn the set of “bestinfluencers” in a social network online while repeatedly interacting with it. We ad-dress the challenges of (i) combinatorial action space, since the number of feasibleinfluencer sets grows exponentially with the maximum number of influencers, and(ii) limited feedback, since only the influenced portion of the network is observed.Under a stochastic semi-bandit feedback, we propose and analyze
IMLinUCB , acomputationally efficient
UCB -based algorithm. Our bounds on the cumulativeregret are polynomial in all quantities of interest, achieve near-optimal dependenceon the number of interactions and reflect the topology of the network and the acti-vation probabilities of its edges, thereby giving insights on the problem complexity.To the best of our knowledge, these are the first such results. Our experiments showthat in several representative graph topologies, the regret of
IMLinUCB scales assuggested by our upper bounds.
IMLinUCB permits linear generalization and thusis both statistically and computationally suitable for large-scale problems. Ourexperiments also show that
IMLinUCB with linear generalization can lead to lowregret in real-world online influence maximization.
Social networks are increasingly important as media for spreading information, ideas, and influ-ence. Computational advertising studies models of information propagation or diffusion in suchnetworks [16, 6, 10].
Viral marketing aims to use this information propagation to spread awarenessabout a specific product. More precisely, agents (marketers) aim to select a fixed number of influ-encers (called seeds or source nodes ) and provide them with free products or discounts. They expectthat these users will influence their neighbours and, transitively, other users in the social network toadopt the product. This will thus result in information propagating across the network as more usersadopt or become aware of the product. The marketer has a budget on the number of free products andmust choose seeds in order to maximize the influence spread , which is the expected number of usersthat become aware of the product. This problem is referred to as influence maximization (IM) [16].For IM, the social network is modeled as a directed graph with the nodes representing users, andthe edges representing relations (e.g., friendships on Facebook, following on Twitter) between them.Each directed edge ( i, j ) is associated with an activation probability w ( i, j ) that models the strengthof influence that user i has on user j . We say a node j is a downstream neighbor of node i ifthere is a directed edge ( i, j ) from i to j . The IM problem has been studied under a number ofdiffusion models [16, 13, 23]. The best known and studied are the models in [16], and in particular a r X i v : . [ c s . L G ] J un he independent cascade (IC) model. In this work, we assume that the diffusion follows the IC modeland describe it next.After the agent chooses a set of source nodes S , the independent cascade model defines a diffusion(influence) process: At the beginning, all nodes in S are activated (influenced); subsequently, everyactivated node i can activate its downstream neighbor j with probability w ( i, j ) once, independently of the history of the process. This process runs until no activations are possible. In the IM problem, thegoal of the agent is to maximize the expected number of the influenced nodes subject to a cardinalityconstraint on S . Finding the best set S is an NP-hard problem, but under common diffusion modelsincluding IC, it can be efficiently approximated to within a factor of − /e [16].In many social networks, however, the activation probabilities are unknown . One possibility is tolearn these from past propagation data [25, 14, 24]. However in practice, such data are hard toobtain and the large number of parameters makes this learning challenging. This motivates thelearning framework of IM bandits [31, 28, 29], where the agent needs to learn to choose a good setof source nodes while repeatedly interacting with the network. Depending on the feedback to theagent, the IM bandits can have (1) full-bandit feedback, where only the number of influenced nodes isobserved; (2) node semi-bandit feedback, where the identity of influenced nodes is observed; or (3)edge semi-bandit feedback, where the identity of influenced edges (edges going out from influencednodes) is observed. In this paper, we give results for the edge semi-bandit feedback model, where weobserve for each influenced node, the downstream neighbors that this node influences. Such feedbackis feasible to obtain in most online social networks. These networks track activities of users, forinstance, when a user retweets a tweet of another user. They can thus trace the propagation (of thetweet) through the network, thereby obtaining edge semi-bandit feedback.The IM bandits problem combines two main challenges. First, the number of actions (possiblesets) S grows exponentially with the cardinality constraint on S . Second, the agent can only observethe influenced portion of the network as feedback. Although IM bandits have been studied in thepast [21, 8, 31, 5, 29] (see Section 6 for an overview and comparison), there are a number of openchallenges [28]. One challenge is to identify reasonable complexity metrics that depend on boththe topology and activation probabilities of the network and characterize the information-theoreticcomplexity of the IM bandits problem. Another challenge is to develop learning algorithms such that(i) their performance scales gracefully with these metrics and (ii) are computationally efficient andcan be applied to large social networks with millions of users.In this paper, we address these two challenges under the IC model with access to edge semi-banditfeedback. We refer to our model as an independent cascade semi-bandit (ICSB) . We make fourmain contributions. First, we propose IMLinUCB , a
UCB -like algorithm for ICSBs that permits lineargeneralization and is suitable for large-scale problems. Second, we define a new complexity metric,referred to as maximum observed relevance for ICSB, which depends on the topology of the networkand is a non-decreasing function of activation probabilities. The maximum observed relevance C ∗ can also be upper bounded based on the network topology or the size of the network in the worst case.However, in real-world social networks, due to the relatively low activation probabilities [14], C ∗ attains much smaller values as compared to the worst case upper bounds. Third, we bound thecumulative regret of IMLinUCB . Our regret bounds are polynomial in all quantities of interest andhave near-optimal dependence on the number of interactions. They reflect the structure and activationprobabilities of the network through C ∗ and do not depend on inherently large quantities, such asthe reciprocal of the minimum probability of being influenced (unlike [8]) and the cardinality ofthe action set. Finally, we evaluate IMLinUCB on several problems. Our empirical results on simplerepresentative topologies show that the regret of
IMLinUCB scales as suggested by our topology-dependent regret bounds. We also show that
IMLinUCB with linear generalization can lead to lowregret in real-world online influence maximization problems.
In this section, we define notation and give the formal problem statement for the IM problem underthe IC model. Consider a directed graph G = ( V , E ) with a set V = { , , . . . , L } of L = |V| nodes,a set E = { , , . . . , |E|} of directed edges, and an arbitrary binary weight function w : E → { , } .2e say that a node v ∈ V is reachable from a node v ∈ V under w if there is a directed path p = ( e , e , . . . , e l ) from v to v in G satisfying w ( e i ) = 1 for all i = 1 , , . . . , l , where e i is the i -th edge in p . For a given source node set S ⊆ V and w , we say that node v ∈ V is influenced if v isreachable from at least one source node in S under w ; and denote the number of influenced nodes in G by f ( S , w ) . By definition, the nodes in S are always influenced.The influence maximization (IM) problem is characterized by a triple ( G , K, w ) , where G is a givendirected graph, K ≤ L is the cardinality of source nodes, and w : E → [0 , is a probability weightfunction mapping each edge e ∈ E to a real number w ( e ) ∈ [0 , . The agent needs to choose a setof K source nodes S ⊆ V based on ( G , K, w ) . Then a random binary weight function w , whichencodes the diffusion process under the IC model, is obtained by independently sampling a Bernoullirandom variable w ( e ) ∼ Bern ( w ( e )) for each edge e ∈ E . The agent’s objective is to maximize theexpected number of the influenced nodes: max S : |S| = K f ( S , w ) , where f ( S , w ) ∆ = E w [ f ( S , w )] isthe expected number of influenced nodes when the source node set is S and w is sampled accordingto w . It is well-known that the (offline) IM problem is NP-hard [16], but can be approximately solvedby approximation/randomized algorithms [6] under the IC model. In this paper, we refer to suchalgorithms as oracles to distinguish them from the machine learning algorithms discussed in followingsections. Let S opt be the optimal solution of this problem, and S ∗ = ORACLE ( G , K, w ) be the(possibly random) solution of an oracle ORACLE . For any α, γ ∈ [0 , , we say that ORACLE isan ( α, γ ) -approximation oracle for a given ( G , K ) if for any w , f ( S ∗ , w ) ≥ γf ( S opt , w ) withprobability at least α . Notice that this further implies that E [ f ( S ∗ , w )] ≥ αγf ( S opt , w ) . We say anoracle is exact if α = γ = 1 . In this section, we first describe the IM semi-bandit problem. Next, we state the linear generalizationassumption and describe
IMLinUCB , our
UCB -based semi-bandit algorithm.
The independent cascade semi-bandit (ICSB) problem is also characterized by a triple ( G , K, w ) , but w is unknown to the agent. The agent interacts with the independent cascade semi-bandit for n rounds.At each round t = 1 , , . . . , n , the agent first chooses a source node set S t ⊆ V with cardinality K based on its prior information and past observations. Influence then diffuses from the nodes in S t according to the IC model. Similarly to the previous section, this can be interpreted as the environmentgenerating a binary weight function w t by independently sampling w t ( e ) ∼ Bern ( w ( e )) for each e ∈ E . At round t , the agent receives the reward f ( S t , w t ) , that is equal to the number of nodesinfluenced at that round. The agent also receives edge semi-bandit feedback from the diffusionprocess. Specifically, for any edge e = ( u , u ) ∈ E , the agent observes the realization of w t ( e ) ifand only if the start node u of the directed edge e is influenced in the realization w t . The agent’sobjective is to maximize the expected cumulative reward over the n steps. Since the number of edges in real-world social networks tends to be in millions or even billions, weneed to exploit some generalization model across activation probabilities to develop efficient anddeployable learning algorithms. In particular, we assume that there exists a linear-generalizationmodel for the probability weight function w . That is, each edge e ∈ E is associated with a known feature vector x e ∈ (cid:60) d (here d is the dimension of the feature vector) and that there is an unknown coefficient vector θ ∗ ∈ (cid:60) d such that for all e ∈ E , w ( e ) is “well approximated" by x T e θ ∗ . Formally,we assume that ρ ∆ = max e ∈E | w ( e ) − x T e θ ∗ | is small. In Section 5.2, we see that such a lineargeneralization leads to efficient learning in real-world networks. Note that all vectors in this paperare column vectors. As is standard in graph theory, a directed path is a sequence of directed edges connecting a sequence ofdistinct nodes, under the restriction that all edges are directed in the same direction. Notice that the definitions of f ( S , w ) and f ( S , w ) are consistent in the sense that if w ∈ { , } |E| , then f ( S , w ) = f ( S , w ) with probability . lgorithm 1 IMLinUCB : Influence Maximization Linear
UCB
Input: graph G , source node set cardinality K , oracle ORACLE , feature vector x e ’s, and algorithmparameters σ, c > , Initialization: B ← ∈ (cid:60) d , M ← I ∈ (cid:60) d × d for t = 1 , , . . . , n do
1. set θ t − ← σ − M − t − B t − and the UCBs as U t ( e ) ← Proj [0 , (cid:18) x T e θ t − + c (cid:113) x T e M − t − x e (cid:19) for all e ∈ E
2. choose S t ∈ ORACLE ( G , K, U t ) , and observe the edge-level semi-bandit feedback3. update statistics:(a) initialize M t ← M t − and B t ← B t − (b) for all observed edges e ∈ E , update M t ← M t + σ − x e x T e and B t ← B t + x e w t ( e ) Similar to the existing approaches for linear bandits [1, 9], we exploit the linear generalization todevelop a learning algorithm for ICSB. Without loss of generality, we assume that (cid:107) x e (cid:107) ≤ forall e ∈ E . Moreover, we use X ∈ (cid:60) |E|× d to denote the feature matrix, i.e., the row of X associatedwith edge e is x T e . Note that if a learning agent does not know how to construct good features, it canalways choose the naïve feature matrix X = I ∈ (cid:60) |E|×|E| and have no generalization model acrossedges. We refer to the special case X = I ∈ (cid:60) |E|×|E| as the tabular case. IMLinUCB algorithm
In this section, we propose Influence Maximization Linear UCB (
IMLinUCB ), detailed in Algorithm 1.Notice that
IMLinUCB represents its past observations as a positive-definite matrix (
Gram matrix ) M t ∈ (cid:60) d × d and a vector B t ∈ (cid:60) d . Specifically, let X t be a matrix whose rows are the featurevectors of all observed edges in t steps and Y t be a binary column vector encoding the realizations ofall observed edges in t steps. Then M t = I + σ − X T t X t and B t = X T t Y t .At each round t , IMLinUCB operates in three steps: First, it computes an upper confidence bound U t ( e ) for each edge e ∈ E . Note that Proj [0 , ( · ) projects a real number into interval [0 , to ensurethat U t ∈ [0 , |E| . Second, it chooses a set of source nodes based on the given ORACLE and U t , whichis also a probability-weight function. Finally, it receives the edge semi-bandit feedback and uses it toupdate M t and B t . It is worth emphasizing that IMLinUCB is computationally efficient as long as
ORACLE is computationally efficient. Specifically, at each round t , the computational complexities ofboth Step 1 and 3 of IMLinUCB are O (cid:0) |E| d (cid:1) . It is worth pointing out that in the tabular case,
IMLinUCB reduces to
CUCB [7], in the sense that theconfidence radii in
IMLinUCB are the same as those in
CUCB , up to logarithmic factors. That is,
CUCB can be viewed as a special case of
IMLinUCB with X = I . Recall that the agent’s objective is to maximize the expected cumulative reward, which is equivalent tominimizing the expected cumulative regret. The cumulative regret is the loss in reward (accumulatedover rounds) because of the lack of knowledge of the activation probabilities. Observe that in eachround t , IMLinUCB needs to use an approximation/randomized algorithm
ORACLE for solving theoffline IM problem. Naturally, this can lead to O ( n ) cumulative regret, since at each round there isa non-diminishing regret due to the approximation/randomized nature of ORACLE . To analyze theperformance of
IMLinUCB in such cases, we define a more appropriate performance metric, the scaledcumulative regret, as R η ( n ) = (cid:80) nt =1 E [ R ηt ] , where n is the number of steps, η > is the scale, and R ηt = f ( S opt , w t ) − η f ( S t , w t ) is the η -scaled realized regret R ηt at round t . When η = 1 , R η ( n ) reduces to the standard expected cumulative regret R ( n ) . Notice that in a practical implementation, we store M − t instead of M t . Moreover, M t ← M t + σ − x e x T e is equivalent to M − t ← M − t − M − t x e x T e M − t x T e M − t x e + σ . a) (b) (c) (d) Figure 1: a . Bar graph on nodes. b . Star graph on nodes. c . Ray graph on nodes. d . Gridgraph on nodes. Each undirected edge denotes two directed edges in opposite directions. In this section, we give a regret bound for
IMLinUCB for the case when w ( e ) = x T e θ ∗ for all e ∈ E ,i.e., the linear generalization is perfect. Our main contribution is a regret bound that scales with a newcomplexity metric, maximum observed relevance , which depends on both the topology of G and theprobability weight function w , and is defined in Section 4.1. We highlight this as most known resultsfor this problem are worst case, and some of them do not depend on probability weight function at all. We start by defining some terminology. For given directed graph G = ( V , E ) and source node set S ⊆ V , we say an edge e ∈ E is relevant to a node v ∈ V \ S under S if there exists a path p from asource node s ∈ S to v such that (1) e ∈ p and (2) p does not contain another source node other than s . Notice that with a given S , whether or not a node v ∈ V \ S is influenced only depends on thebinary weights w on its relevant edges. For any edge e ∈ E , we define N S ,e as the number of nodesin V \ S it is relevant to, and define P S ,e as the conditional probability that e is observed given S , N S ,e ∆ = (cid:80) v ∈V\S { e is relevant to v under S} and P S ,e ∆ = P ( e is observed | S ) . (1)Notice that N S ,e only depends on the topology of G , while P S ,e depends on both the topology of G and the probability weight w . The maximum observed relevance C ∗ is defined as the maximum(over S ) 2-norm of N S ,e ’s weighted by P S ,e ’s, C ∗ ∆ = max S : |S| = K (cid:113)(cid:80) e ∈E N S ,e P S ,e . (2)As is detailed in the proof of Lemma 1 in Appendix A, C ∗ arises in the step where Cauchy-Schwarzinequality is applied. Note that C ∗ also depends on both the topology of G and the probabilityweight w . However, C ∗ can be bounded from above only based on the topology of G or the size of theproblem, i.e., L = |V| and |E| . Specifically, by defining C G ∆ = max S : |S| = K (cid:113)(cid:80) e ∈E N S ,e , we have C ∗ ≤ C G = max S : |S| = K (cid:113)(cid:80) e ∈E N S ,e ≤ ( L − K ) (cid:112) |E| = O (cid:16) L (cid:112) |E| (cid:17) = O (cid:0) L (cid:1) , (3)where C G is the maximum/worst-case (over w ) C ∗ for the directed graph G , and the maximum isobtained by setting w ( e ) = 1 for all e ∈ E . Since C G is worst-case, it might be very far awayfrom C ∗ if the activation probabilities are small. Indeed, this is what we expect in typical real-world situations. Notice also that if max e ∈E w ( e ) → , then P S ,e → for all e / ∈ E ( S ) and P S ,e = 1 for all e ∈ E ( S ) , where E ( S ) is the set of edges with start node in S , hence we have C ∗ → C G ∆ = max S : |S| = K (cid:113)(cid:80) e ∈E ( S ) N S ,e . In particular, if K is small, C G is much less than C G inmany topologies. For example, in a complete graph with K = 1 , C G = Θ( L ) while C G = Θ( L ) .Finally, it is worth pointing out that there exist situations ( G , w ) such that C ∗ = Θ( L ) . One suchexample is when G is a complete graph with L nodes and w ( e ) = L/ ( L + 1) for all edges e in thisgraph.To give more intuition, in the rest of this subsection, we illustrate how C G , the worst-case C ∗ , varieswith four graph topologies in Figure 1: bar, star, ray, and grid, as well as two other topologies:5eneral tree and complete graph. We fix the node set V = { , , . . . , L } for all graphs. The bargraph (Figure 1a) is a graph where nodes i and i + 1 are connected when i is odd. The star graph(Figure 1b) is a graph where node is central and all remaining nodes i ∈ V \ { } are connectedto it. The distance between any two of these nodes is . The ray graph (Figure 1c) is a star graphwith k = (cid:6) √ L − (cid:7) arms, where node is central and each arm contains either (cid:100) ( L − /k (cid:101) or (cid:98) ( L − /k (cid:99) nodes connected in a line. The distance between any two nodes in this graph is O ( √ L ) .The grid graph (Figure 1d) is a classical non-tree graph with O ( L ) edges.To see how C G varies with the graph topology, we start with the simplified case when K = |S| = 1 .In the bar graph (Figure 1a), only one edge is relevant to a node v ∈ V \ S and all the other edgesare not relevant to any nodes. Therefore, C G ≤ . In the star graph (Figure 1b), for any s , atmost one edge is relevant to at most L − nodes and the remaining edges are relevant to at mostone node. In this case, C G ≤ √ L + L = O ( L ) . In the ray graph (Figure 1c), for any s , at most O ( √ L ) edges are relevant to L − nodes and the remaining edges are relevant to at most O ( √ L ) nodes. In this case, C G = O ( (cid:112) L L + LL ) = O ( L ) . Finally, recall that for all graphs we canbound C G by O ( L (cid:112) |E| ) , regardless of K . Hence, for the grid graph (Figure 1d) and general treegraph, C G = O ( L ) since |E| = O ( L ) ; for the complete graph C G = O ( L ) since |E| = O ( L ) .Clearly, C G varies widely with the topology of the graph. The second column of Table 1 summarizeshow C G varies with the above-mentioned graph topologies for general K = |S| . Consider C ∗ defined in Section 4.1 and recall the worst-case upper bound C ∗ ≤ ( L − K ) (cid:112) |E| , wehave the following regret guarantees for IMLinUCB . Theorem 1
Assume that (1) w ( e ) = x T e θ ∗ for all e ∈ E and (2) ORACLE is an ( α, γ ) -approximationalgorithm. Let D be a known upper bound on (cid:107) θ ∗ (cid:107) , if we apply IMLinUCB with σ = 1 and c = (cid:115) d log (cid:18) n |E| d (cid:19) + 2 log ( n ( L + 1 − K )) + D, (4) then we have R αγ ( n ) ≤ cC ∗ αγ (cid:115) dn |E| log (cid:18) n |E| d (cid:19) + 1 = (cid:101) O (cid:16) dC ∗ (cid:112) |E| n/ ( αγ ) (cid:17) (5) ≤ (cid:101) O (cid:0) d ( L − K ) |E|√ n/ ( αγ ) (cid:1) . (6) Moreover, if the feature matrix X = I ∈ (cid:60) |E|×|E| (i.e., the tabular case), we have R αγ ( n ) ≤ cC ∗ αγ (cid:112) n |E| log (1 + n ) + 1 = (cid:101) O (cid:0) |E| C ∗ √ n/ ( αγ ) (cid:1) (7) ≤ (cid:101) O (cid:16) ( L − K ) |E| √ n/ ( αγ ) (cid:17) . (8)Please refer to Appendix A for the proof of Theorem 1, that we outline in Section 4.3. We now brieflycomment on the regret bounds in Theorem 1. Topology-dependent bounds:
Since C ∗ is topology-dependent, the regret bounds in Equations 5and 7 are also topology-dependent. Table 1 summarizes the regret bounds for each topology discussed in Section 4.1. Since the regret bounds in Table 1 are the worst-case regret bounds for agiven topology, more general topologies have larger regret bounds. For instance, the regret boundsfor tree are larger than their counterparts for star and ray, since star and ray are special trees. The gridand tree can also be viewed as special complete graphs by setting w ( e ) = 0 for some e ∈ E , hencecomplete graph has larger regret bounds. Again, in practice we expect C ∗ to be far smaller due toactivation probabilities. The regret bound for bar graph is based on Theorem 2 in the appendix, which is a stronger version ofTheorem 1 for disconnected graph. C G (worst-case C ∗ ) R αγ ( n ) for general X R αγ ( n ) for X = I bar graph O ( √ K ) (cid:101) O ( dK √ n/ ( αγ )) (cid:101) O (cid:16) L √ Kn/ ( αγ ) (cid:17) star graph O ( L √ K ) (cid:101) O (cid:16) dL √ Kn/ ( αγ ) (cid:17) (cid:101) O (cid:16) L √ Kn/ ( αγ ) (cid:17) ray graph O ( L √ K ) (cid:101) O (cid:16) dL √ Kn/ ( αγ ) (cid:17) (cid:101) O (cid:16) L √ Kn/ ( αγ ) (cid:17) tree graph O ( L ) (cid:101) O (cid:0) dL √ n/ ( αγ ) (cid:1) (cid:101) O (cid:16) L √ n/ ( αγ ) (cid:17) grid graph O ( L ) (cid:101) O (cid:0) dL √ n/ ( αγ ) (cid:1) (cid:101) O (cid:16) L √ n/ ( αγ ) (cid:17) complete graph O ( L ) (cid:101) O (cid:0) dL √ n/ ( αγ ) (cid:1) (cid:101) O (cid:0) L √ n/ ( αγ ) (cid:1) Table 1: C G and worst-case regret bounds for different graph topologies. Tighter bounds in tabular case and under exact oracle:
Notice that for the tabular case withfeature matrix X = I and d = |E| , (cid:101) O ( (cid:112) |E| ) tighter regret bounds are obtained in Equations 7 and 8.Also notice that the (cid:101) O (1 / ( αγ )) factor is due to the fact that ORACLE is an ( α, γ ) -approximationoracle. If ORACLE solves the IM problem exactly (i.e., α = γ = 1 ), then R αγ ( n ) = R ( n ) . Tightness of our regret bounds:
First, note that our regret bound in the bar case with K = 1 matchesthe regret bound of the classic LinUCB algorithm. Specifically, with perfect linear generalization, thiscase is equivalent to a linear bandit problem with L arms and feature dimension d . From Table 1,our regret bound in this case is (cid:101) O ( d √ n ) , which matches the known regret bound of LinUCB that canbe obtained by the technique of [1]. Second, we briefly discuss the tightness of the regret bound inEquation 6 for a general graph with L nodes and |E| edges. Note that the (cid:101) O ( √ n ) -dependence on timeis near-optimal, and the (cid:101) O ( d ) -dependence on feature dimension is standard in linear bandits [1, 33],since (cid:101) O ( √ d ) results are only known for impractical algorithms. The (cid:101) O ( L − K ) factor is due to thefact that the reward in this problem is from K to L , rather than from to . To explain the (cid:101) O ( |E| ) factor in this bound, notice that one (cid:101) O ( (cid:112) |E| ) factor is due to the fact that at most (cid:101) O ( |E| ) edges mightbe observed at each round (see Theorem 3), and is intrinsic to the problem similarly to combinatorialsemi-bandits [19]; another (cid:101) O ( (cid:112) |E| ) factor is due to linear generalization (see Lemma 1) and mightbe removed by better analysis. We conjecture that our (cid:101) O ( d ( L − K ) |E|√ n/ ( αγ )) regret bound inthis case is at most (cid:101) O ( (cid:112) |E| d ) away from being tight. We now outline the proof of Theorem 1. For each round t ≤ n , we define the favorable event ξ t − = {| x T e ( θ τ − − θ ∗ ) | ≤ c (cid:113) x T e M − τ − x e , ∀ e ∈ E , ∀ τ ≤ t } , and the unfavorable event ξ t − asthe complement of ξ t − . If we decompose E [ R αγt ] , the ( αγ ) -scaled expected regret at round t , overevents ξ t − and ξ t − , and bound R αγt on event ξ t − using the naïve bound R αγt ≤ L − K , then, E [ R αγt ] ≤ P ( ξ t − ) E [ R αγt | ξ t − ] + P (cid:0) ξ t − (cid:1) [ L − K ] . By choosing c as specified by Equation 4, we have P (cid:0) ξ t − (cid:1) [ L − K ] < /n (see Lemma 2 inthe appendix). On the other hand, notice that by definition of ξ t − , w ( e ) ≤ U t ( e ) , ∀ e ∈ E underevent ξ t − . Using the monotonicity of f in the probability weight, and the fact that ORACLE is an ( α, γ ) -approximation algorithm, we have E [ R αγt | ξ t − ] ≤ E [ f ( S t , U t ) − f ( S t , w ) | ξ t − ] / ( αγ ) . The next observation is that, from the linearity of expectation, the gap f ( S t , U t ) − f ( S t , w ) decom-poses over nodes v ∈ V \ S t . Specifically, for any source node set S ⊆ V , any probability weightfunction w : E → [0 , , and any node v ∈ V , we define f ( S , w, v ) as the probability that node v isinfluenced if the source node set is S and the probability weight is w . Hence, we have f ( S t , U t ) − f ( S t , w ) = (cid:80) v ∈V\S t [ f ( S t , U t , v ) − f ( S t , w, v )] .
16 24 32L2 R eg r e t ! = 0.8, X = IStarRay 8 16 24 32L2 R eg r e t ! = 0.7, X = I 8 16 24 32L2 R eg r e t ! = 0.8, X = X (a) Stars and rays: The log-log plots of the n -step regret of IMLinUCB in two graph topologies after n = 10 steps. We varythe number of nodes L and the mean edge weight ω . Number of Rounds C u m u l a t i v e R eg r e t CUCBIMLinUCB with d=10 (b) Subgraph of Facebook network
Figure 2: Experimental resultsIn the appendix, we show that under any weight function, the diffusion process from the source nodeset S t to the target node v can be modeled as a Markov chain. Hence, weight function U t and w giveus two Markov chains with the same state space but different transition probabilities. f ( S t , U t , v ) − f ( S t , w, v ) can be recursively bounded based on the state diagram of the Markov chain under weightfunction w . With some algebra, Theorem 3 in Appendix A bounds f ( S t , U t , v ) − f ( S t , w, v ) by theedge-level gap U t ( e ) − w ( e ) on the observed relevant edges for node v , f ( S t , U t , v ) − f ( S t , w, v ) ≤ (cid:80) e ∈E S t,v E [ { O t ( e ) } [ U t ( e ) − w ( e )] |H t − , S t ] , (9)for any t , any “history" (past observations) H t − and S t such that ξ t − holds, and any v ∈ V \ S t ,where E S t ,v is the set of edges relevant to v and O t ( e ) is the event that edge e is observed at round t . Based on Equation 9, we can prove Theorem 1 using the standard linear-bandit techniques (seeAppendix A). In this section, we present a synthetic experiment in order to empirically validate our upper boundson the regret. Next, we evaluate our algorithm on a real-world Facebook subgraph.
In the first experiment, we evaluate
IMLinUCB on undirected stars and rays (Figure 1) and validatethat the regret grows with the number of nodes L and the maximum observed relevance C ∗ as shownin Table 1. We focus on the tabular case ( X = I ) with K = |S| = 1 , where the IM problem can besolved exactly. We vary the number of nodes L ; and edge weight w ( e ) = ω , which is the same for alledges e . We run IMLinUCB for n = 10 steps and verify that it converges to the optimal solution ineach experiment. We report the n -step regret of IMLinUCB for ≤ L ≤ in Figure 2a. Recall thatfrom Table 1, R ( n ) = (cid:101) O ( L ) for star and R ( n ) = (cid:101) O ( L ) for ray.We numerically estimate the growth of regret in L , the exponent of L , in the log-log space of L andregret. In particular, since log( f ( L )) = p log( L ) + log( c ) for any f ( L ) = cL p and c > , both p and log( c ) can be estimated by linear regression in the new space. For star graphs with ω = 0 . and ω = 0 . , our estimated growth are respectively O ( L . ) and O ( L . ) , which are close to theexpected (cid:101) O ( L ) . For ray graphs with ω = 0 . and ω = 0 . , our estimated growth are respectively O ( L . ) and O ( L . ) , which are again close to the expected (cid:101) O ( L ) . This shows that maximumobserved relevance C ∗ proposed in Section 4.1 is a reasonable complexity metric for these twotopologies. In the second experiment, we demonstrate the potential performance gain of
IMLinUCB in real-world influence maximization semi-bandit problems by exploiting linear generalization across edges.Specifically, we compare
IMLinUCB with
CUCB in a subgraph of Facebook network from [22]. Thesubgraph has L = |V| = 327 nodes and |E| = 5038 directed edges. Since the true probability weight8unction w is not available, we independently sample w ( e ) ’s from the uniform distribution U (0 , . and treat them as ground-truth. Note that this range of probabilities is guided by empirical evidencein [14, 3]. We set n = 5000 and K = 10 in this experiment. For IMLinUCB , we choose d = 10 and generate edge feature x e ’s as follows: we first use node2vec algorithm [15] to generate a nodefeature in (cid:60) d for each node v ∈ V ; then for each edge e , we generate x e as the element-wise productof node features of the two nodes connected to e . Note that the linear generalization in this experimentis imperfect in the sense that min θ ∈(cid:60) d max e ∈E | w ( e ) − x Te θ | > . For both CUCB and
IMLinUCB ,we choose
ORACLE as the state-of-the-art offline IM algorithm proposed in [27]. To compute thecumulative regret, we compare against a fixed seed set S ∗ obtained by using the true w as input to theoracle proposed in [27]. We average the empirical cumulative regret over independent runs, andplot the results in Figure 2b. The experimental results show that compared with CUCB , IMLinUCB cansignificantly reduce the cumulative regret by exploiting linear generalization across w ( e ) ’s. There exist prior results on IM semi-bandits [21, 8, 31]. First, Lei et al. [21] gave algorithms for thesame feedback model as ours. The algorithms are not analyzed and cannot solve large-scale problemsbecause they estimate each edge weight independently. Second, our setting is a special case ofstochastic combinatorial semi-bandit with a submodular reward function and stochastically observededges [8]. Their work is the closest related work. Their gap-dependent and gap-free bounds are bothproblematic because they depend on the reciprocal of the minimum observation probability p ∗ of anedge: Consider a line graph with |E| edges where all edge weights are . . Then /p ∗ is |E|− . Onthe other hand, our derived regret bounds in Theorem 1 are polynomial in all quantities of interest.A very recent result of Wang and Chen [32] removes the /p ∗ factor in [8] for the tabular case andpresents a worst-case bound of (cid:101) O ( L |E|√ n ) , which in the tabular complete graph case improves overour result by (cid:101) O ( L ) . On the other hand, their analysis does not give structural guarantees that weprovide with maximum observed relevance C ∗ obtaining potentially much better results for the casein hand and giving insights for the complexity of IM bandits. Moreover, both Chen et al. [8] andWang and Chen [32] do not consider generalization models across edges or nodes, and thereforetheir proposed algorithms are unlikely to be practical for real-world social networks. In contrast, ourproposed algorithm scales to large problems by exploiting linear generalization across edges. IM bandits for different influence models and settings:
There exist a number of extensions andrelated results for IM bandits. We only mention the most related ones (see [28] for a recent survey).Vaswani et al. [31] proposed a learning algorithm for a different and more challenging feedbackmodel, where the learning agent observes influenced nodes but not the edges , but they do not giveany guarantees. Carpentier and Valko [5] give a minimax optimal algorithm for IM bandits but onlyconsider a local model of influence with a single source and a cascade of influences never happens.In related networked bandits [11], the learner chooses a node and its reward is the sum of the rewardsof the chosen node and its neighborhood. The problem gets more challenging when we allow theinfluence probabilities to change [2], when we allow the seed set to be chosen adaptively [30], orwhen we consider a continuous model [12]. Furthermore, Sigla et al. [26] treats the IM setting with anadditional observability constraints, where we face a restriction on which nodes we can choose at eachround. This setting is also related to the volatile multi-armed bandits where the set of possible armschanges [4]. Vaswani et al. [29] proposed a diffusion-independent algorithm for IM semi-bandits witha wide range of diffusion models, based on the maximum-reachability approximation. Despite itswide applicability, the maximum reachability approximation introduces an additional approximationfactor to the scaled regret bounds. As they have discussed, this approximation factor can be large insome cases. Lagrée et al. [20] treat a persistent extension of IM bandits when some nodes becomepersistent over the rounds and no longer yield rewards. This work is also a generalization andextension of recent work on cascading bandits [17, 18, 34], since cascading bandits can be viewed asvariants of online influence maximization problems with special topologies (chains).
Acknowledgements
The research presented was supported by French Ministry of Higher Education andResearch, Nord-Pas-de-Calais Regional Council, Inria and Univertät Potsdam associated-team north-europeanproject Allocate, and French National Research Agency projects ExTra-Learn (n.ANR-14-CE24-0010-01) andBoB (n.ANR-16-CE23-0003). We would also like to thank Dr. Wei Chen and Mr. Qinshi Wang for pointing outa mistake in an earlier version of this paper. eferences [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linearstochastic bandits. In Neural Information Processing Systems , 2011.[2] Yixin Bao, Xiaoke Wang, Zhi Wang, Chuan Wu, and Francis C. M. Lau. Online influencemaximization in non-stationary social networks. In
International Symposium on Quality ofService , apr 2016.[3] Nicola Barbieri, Francesco Bonchi, and Giuseppe Manco. Topic-aware social influence propa-gation models.
Knowledge and information systems , 37(3):555–584, 2013.[4] Zahy Bnaya, Rami Puzis, Roni Stern, and Ariel Felner. Social network search as a volatilemulti-armed bandit problem.
Human Journal , 2(2):84–98, 2013.[5] Alexandra Carpentier and Michal Valko. Revealing graph bandits for maximizing local influence.In
International Conference on Artificial Intelligence and Statistics , 2016.[6] Wei Chen, Chi Wang, and Yajun Wang. Scalable influence maximization for prevalent viralmarketing in large-scale social networks. In
Knowledge Discovery and Data Mining , 2010.[7] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework,results and applications. In
International Conference on Machine Learning , 2013.[8] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit and its extension toprobabilistically triggered arms.
Journal of Machine Learning Research , 17, 2016.[9] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization underbandit feedback. In
Conference on Learning Theory , 2008.[10] David Easley and Jon Kleinberg. Networks, Crowds, and Markets: Reasoning About a HighlyConnected World. Cambridge University Press, 2010.[11] Meng Fang and Dacheng Tao. Networked bandits with disjoint linear payoffs. In
InternationalConference on Knowledge Discovery and Data Mining , 2014.[12] Mehrdad Farajtabar, Xiaojing Ye, Sahar Harati, Le Song, and Hongyuan Zha. Multistagecampaigning in social networks. In
Neural Information Processing Systems , 2016.[13] M Gomez Rodriguez, B Schölkopf, Langford J Pineau, et al. Influence maximization incontinuous time diffusion networks. In
International Conference on Machine Learning , 2012.[14] Amit Goyal, Francesco Bonchi, and Laks VS Lakshmanan. Learning influence probabilities insocial networks. In
Proceedings of the third ACM international conference on Web search anddata mining , pages 241–250. ACM, 2010.[15] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In
Knowledge Discovery and Data Mining . ACM, 2016.[16] David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence through asocial network.
Knowledge Discovery and Data Mining , page 137, 2003.[17] Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits:Learning to rank in the cascade model. In
Proceedings of the 32nd International Conference onMachine Learning , 2015.[18] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Combinatorial cascadingbandits. In
Advances in Neural Information Processing Systems 28 , pages 1450–1458, 2015.[19] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds forstochastic combinatorial semi-bandits. In
Proceedings of the 18th International Conference onArtificial Intelligence and Statistics , 2015.[20] Paul Lagrée, Olivier Cappé, Bogdan Cautis, and Silviu Maniu. Effective large-scale onlineinfluence maximization. In
International Conference on Data Mining , 2017.1021] Siyu Lei, Silviu Maniu, Luyi Mo, Reynold Cheng, and Pierre Senellart. Online influencemaximization. In
Knowledge Discovery and Data mining , 2015.[22] Jure Leskovec and Andrej Krevl. Snap datasets: Stanford large network dataset collection.http://snap.stanford.edu/data, jun 2014.[23] Yanhua Li, Wei Chen, Yajun Wang, and Zhi-Li Zhang. Influence diffusion dynamics and influ-ence maximization in social networks with friend and foe relationships. In
ACM internationalconference on Web search and data mining . ACM, 2013.[24] Praneeth Netrapalli and Sujay Sanghavi. Learning the graph of epidemic cascades. In
ACMSIGMETRICS Performance Evaluation Review , volume 40, pages 211–222. ACM, 2012.[25] Kazumi Saito, Ryohei Nakano, and Masahiro Kimura. Prediction of information diffusionprobabilities for independent cascade model. In
Knowledge-Based Intelligent Information andEngineering Systems , pages 67–75, 2008.[26] Adish Singla, Eric Horvitz, Pushmeet Kohli, Ryen White, and Andreas Krause. Informationgathering in networks via active exploration. In
International Joint Conferences on ArtificialIntelligence , 2015.[27] Youze Tang, Xiaokui Xiao, and Shi Yanchen. Influence maximization: Near-optimal timecomplexity meets practical efficiency. 2014.[28] Michal Valko.
Bandits on graphs and structures . habilitation, École normale supérieure deCachan, 2016.[29] Sharan Vaswani, Branislav Kveton, Zheng Wen, Mohammad Ghavamzadeh, Laks VS Laksh-manan, and Mark Schmidt. Model-independent online learning for influence maximization. In
International Conference on Machine Learning , 2017.[30] Sharan Vaswani and Laks V. S. Lakshmanan. Adaptive influence maximization in socialnetworks: Why commit when you can adapt? Technical report, 2016.[31] Sharan Vaswani, Laks. V. S. Lakshmanan, and Mark Schmidt. Influence maximization withbandits. In
NIPS workshop on Networks in the Social and Information Sciences 2015 , 2015.[32] Qinshi Wang and Wei Chen. Improving regret bounds for combinatorial semi-bandits withprobabilistically triggered arms and its applications. In
Neural Information Processing Systems ,mar 2017.[33] Zheng Wen, Branislav Kveton, and Azin Ashkan. Efficient learning in large-scale combinatorialsemi-bandits. In
International Conference on Machine Learning , 2015.[34] Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. Cascad-ing bandits for large-scale recommendation problems. In
Uncertainty in Artificial Intelligence ,2016. 11 ppendix
A Proof of Theorem 1
In the appendix, we prove a slightly stronger version of Theorem 1, which also uses another com-plexity metric E ∗ defined as follows: Assume that the graph G = ( V , E ) includes m disconnectedsubgraphs G = ( V , E ) , G = ( V , E ) , . . . , G m = ( V m , E m ) , which are in the descending orderbased on the number of nodes |E i | ’s. We define E ∗ as the number of edges in the first min { m, K } subgraphs: E ∗ = min { m,K } (cid:88) i =1 |E i | . (10)Note that by definition, E ∗ ≤ |E| . Based on E ∗ , we have the following slightly stronger version ofTheorem 1. Theorem 2
Assume that (1) w ( e ) = x T e θ ∗ for all e ∈ E and (2) ORACLE is an ( α, γ ) -approximationalgorithm. Let D be a known upper bound on (cid:107) θ ∗ (cid:107) . If we apply IMLinUCB with σ = 1 and c ≥ (cid:115) d log (cid:18) nE ∗ d (cid:19) + 2 log ( n ( L + 1 − K )) + D, (11) then we have R αγ ( n ) ≤ cC ∗ αγ (cid:115) dnE ∗ log (cid:18) nE ∗ d (cid:19) + 1 = (cid:101) O (cid:16) dC ∗ (cid:112) E ∗ n/ ( αγ ) (cid:17) . (12) Moreover, if the feature matrix is of the form X = I ∈ (cid:60) |E|×|E| (i.e., the tabular case), we have R αγ ( n ) ≤ cC ∗ αγ (cid:112) n |E| log (1 + n ) + 1 = (cid:101) O (cid:0) |E| C ∗ √ n/ ( αγ ) (cid:1) . (13)Since E ∗ ≤ |E| , Theorem 2 implies Theorem 1. We prove Theorem 2 in the remainder of this section.We now define some notation to simplify the exposition throughout this section. Definition 1
For any source node set
S ⊆ V , any probability weight function w : E → [0 , , andany node v ∈ V , we define f ( S , w, v ) as the probability that node v is influenced if the source nodeset is S and the probability weight function is w . Notice that by definition, f ( S , w ) = (cid:80) v ∈V f ( S , w, v ) always holds. Moreover, if v ∈ S , then f ( S , w, v ) = 1 for any w by the definition of the influence model. Definition 2
For any round t and any directed edge e ∈ E , we define event O t ( e ) = { edge e is observed at round t } . Note that by definition, an directed edge e is observed if and only if its start node is influenced andobserved does not necessarily mean that the edge is active . A.1 Proof of Theorem 2Proof:
Let H t be the history ( σ -algebra) of past observations and actions by the end of round t . Bythe definition of R αγt , we have E [ R αγt |H t − ] = f ( S opt , w ) − αγ E [ f ( S t , w ) | H t − ] , (14)where the expectation is over the possible randomness of S t , since ORACLE might be a randomizedalgorithm. Notice that the randomness coming from the edge activation is already taken care of in the12efinition of f . For any t ≤ n , we define event ξ t − as ξ t − = (cid:26) | x T e ( θ τ − − θ ∗ ) | ≤ c (cid:113) x T e M − τ − x e , ∀ e ∈ E , ∀ τ ≤ t (cid:27) , (15)and ξ t − as the complement of ξ t − . Notice that ξ t − is H t − -measurable. Hence we have E [ R αγt ] ≤ P ( ξ t − ) E (cid:2) f ( S opt , w ) − f ( S t , w ) / ( αγ ) (cid:12)(cid:12) ξ t − (cid:3) + P (cid:0) ξ t − (cid:1) [ L − K ] . Notice that under event ξ t − , w ( e ) ≤ U t ( e ) , ∀ e ∈ E , for all t ≤ n , thus we have f ( S opt , w ) ≤ f ( S opt , U t ) ≤ max S : |S| = K f ( S , U t ) ≤ αγ E [ f ( S t , U t ) | H t − ] , where the first inequality follows from the monotonicity of f in the probability weight, and the lastinequality follows from the fact that ORACLE is an ( α, γ ) -approximation algorithm. Thus, we have E [ R αγt ] ≤ P ( ξ t − ) αγ E [ f ( S t , U t ) − f ( S t , w ) | ξ t − ] + P (cid:0) ξ t − (cid:1) [ L − K ] . (16)Notice that based on Definition 1, we have f ( S t , U t ) − f ( S t , w ) = (cid:88) v ∈V\S t [ f ( S t , U t , v ) − f ( S t , w, v )] . Recall that for a given graph G = ( V , E ) and a given source node set S ⊆ V , we say an edge e ∈ E and a node v ∈ V \ S are relevant if there exists a path p from a source node s ∈ S to v such that (1) e ∈ p and (2) p does not contain another source node other than s . We use E S ,v ⊆ E to denote the setof edges relevant to node v under the source node set S , and use V S ,v ⊆ V to denote the set of nodesconnected to at least one edge in E S ,v . Notice that G S ,v ∆ = ( V S ,v , E S ,v ) is a subgraph of G , and werefer to it as the relevant subgraph of node v under the source node set S .Based on the notion of relevant subgraph, we have the following theorem, which bounds f ( S t , U t , v ) − f ( S t , w, v ) by edge-level gaps U t ( e ) − w ( e ) on the observed edges in the relevant subgraph G S t ,v for node v ; Theorem 3
For any t , any history H t − and S t such that ξ t − holds, and any v ∈ V \ S t , we have f ( S t , U t , v ) − f ( S t , w, v ) ≤ (cid:88) e ∈E S t,v E [ { O t ( e ) } [ U t ( e ) − w ( e )] |H t − , S t ] , where E S t ,v is the edge set of the relevant subgraph G S t ,v . Please refer to Section A.2 for the proof of Theorem 3. Notice that under favorable event ξ t − , wehave U t ( e ) − w ( e ) ≤ c (cid:113) x T e M − t − x e for all e ∈ E . Therefore, we have E [ R αγt ] ≤ cαγ P ( ξ t − ) E (cid:88) v ∈V\S t (cid:88) e ∈E S t,v { O t ( e ) } (cid:113) x T e M − t − x e (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ξ t − + P (cid:0) ξ t − (cid:1) [ L − K ] ≤ cαγ E (cid:88) v ∈V\S t (cid:88) e ∈E S t,v { O t ( e ) } (cid:113) x T e M − t − x e + P (cid:0) ξ t − (cid:1) [ L − K ]= 2 cαγ E (cid:88) e ∈E { O t ( e ) } (cid:113) x T e M − t − x e (cid:88) v ∈V\S t { e ∈ E S t ,v } + P (cid:0) ξ t − (cid:1) [ L − K ]= 2 cαγ E (cid:34)(cid:88) e ∈E { O t ( e ) } N S t ,e (cid:113) x T e M − t − x e (cid:35) + P (cid:0) ξ t − (cid:1) [ L − K ] , (17)13here N S t ,e = (cid:80) v ∈V\S { e ∈ E S t ,v } is defined in Equation 1. Thus we have R αγ ( n ) ≤ cαγ E (cid:34) n (cid:88) t =1 (cid:88) e ∈E { O t ( e ) } N S t ,e (cid:113) x T e M − t − x e (cid:35) + [ L − K ] n (cid:88) t =1 P (cid:0) ξ t − (cid:1) . (18)In the following lemma, we give a worst-case bound on (cid:80) nt =1 (cid:80) e ∈E { O t ( e ) } N S t ,e (cid:113) x T e M − t − x e . Lemma 1
For any round t = 1 , , . . . , n , we have n (cid:88) t =1 (cid:88) e ∈E { O t ( e ) } N S t ,e (cid:113) x T e M − t − x e ≤ (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) n (cid:88) t =1 (cid:88) e ∈E { O t ( e ) } N S t ,e (cid:33) dE ∗ log (cid:0) nE ∗ dσ (cid:1) log (cid:0) σ (cid:1) · Moreover, if X = I ∈ (cid:60) |E|×|E| , then we have n (cid:88) t =1 (cid:88) e ∈E { O t ( e ) } N S t ,e (cid:113) x T e M − t − x e ≤ (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) n (cid:88) t =1 (cid:88) e ∈E { O t ( e ) } N S t ,e (cid:33) |E| log (cid:0) nσ (cid:1) log (cid:0) σ (cid:1) · Please refer to Section A.3 for the proof of Lemma 1. Finally, notice that for any t , E (cid:34)(cid:88) e ∈E { O t ( e ) } N S t ,e (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S t (cid:35) = (cid:88) e ∈E N S t ,e E [ { O t ( e ) }|S t ] = (cid:88) e ∈E N S t ,e P S t ,e ≤ C ∗ , thus taking the expectation over the possibly randomized oracle and Jensen’s inequality, we get E (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) t =1 (cid:88) e ∈E { O t ( e ) } N S t ,e ≤ (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) t =1 E (cid:34)(cid:88) e ∈E { O t ( e ) } N S t ,e (cid:35) ≤ (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) t =1 C ∗ = C ∗ √ n. (19)Combining the above with Lemma 1 and (18), we obtain R αγ ( n ) ≤ cC ∗ αγ (cid:115) dnE ∗ log (cid:0) nE ∗ dσ (cid:1) log (cid:0) σ (cid:1) + [ L − K ] n (cid:88) t =1 P (cid:0) ξ t − (cid:1) . (20)For the special case when X = I , we have R αγ ( n ) ≤ cC ∗ αγ (cid:115) n |E| log (cid:0) nσ (cid:1) log (cid:0) σ (cid:1) + [ L − K ] n (cid:88) t =1 P (cid:0) ξ t − (cid:1) . (21)Finally, we need to bound the failure probability of upper confidence bound being wrong (cid:80) nt =1 P (cid:0) ξ t − (cid:1) . We prove the following bound on P (cid:0) ξ t − (cid:1) : Lemma 2
For any t = 1 , , . . . , n , any σ > , any δ ∈ (0 , , and any c ≥ σ (cid:115) d log (cid:18) nE ∗ dσ (cid:19) + 2 log (cid:18) δ (cid:19) + (cid:107) θ ∗ (cid:107) , we have P (cid:0) ξ t − (cid:1) ≤ δ . Please refer to Section A.4 for the proof of Lemma 2. From Lemma 2, for a known upper bound D on (cid:107) θ ∗ (cid:107) , if we choose σ = 1 and c ≥ (cid:113) d log (cid:0) nE ∗ d (cid:1) + 2 log ( n ( L + 1 − K )) + D , whichcorresponds to δ = n ( L +1 − K ) in Lemma 2, then we have [ L − K ] n (cid:88) t =1 P (cid:0) ξ t − (cid:1) < . (cid:3) A.2 Proof of Theorem 3
Recall that we use G S t ,v = ( V S t ,v , E S t ,v ) to denote the relevant subgraph of node v under the sourcenode set S t . Since Theorem 3 focuses on the influence from S t to v , and by definition all the pathsfrom S t to v are in G S t ,v , thus, it is sufficient to restrict to G S t ,v and ignore other parts of G in thisanalysis.We start by defining some useful notations. Influence Probability with Removed Nodes:
Recall that for any weight function w : E → [0 , ,any source node set S ⊂ V and any target node v ∈ V , f ( S , w, v ) is the probability that S willinfluence v under weight w (see Definition 1). We now define a similar notation for the influenceprobability with removed nodes . Specifically, for any disjoint node set V , V ⊆ V S t ,v ⊆ V , wedefine h ( V , V , w ) as follows: • First, we remove nodes V , as well as all edges connected to/from V , from G S t ,v , and obtaina new graph G (cid:48) . • h ( V , V , w ) is the probability that V will influence the target node v in graph G (cid:48) under theweight (activation probability) w ( e ) for all e ∈ G (cid:48) .Obviously, a mathematically equivalent way to define h ( V , V , w ) is to define it as the probabilitythat V will influence v in G S t ,v under a new weight (cid:101) w , defined as (cid:101) w ( e ) = (cid:26) if e is from or to a node in V w ( e ) otherwiseNote that by definition, f ( S t , w, v ) = h ( S t , ∅ , w ) . Also note that h ( V , V , w ) implicitly dependson v , but we omit v in this notation to simplify the exposition. Edge Set E ( V , V ) : For any two disjoint node sets V , V ⊆ V S t ,v , we define the edge set E ( V , V ) as E ( V , V ) = { e = ( u , u ) : e ∈ E S t ,v , u ∈ V , and u / ∈ V } . That is, E ( V , V ) is the set of edges in G S t ,v from V to V S t ,v \ V . Diffusion Process:
Note that under any edge activation realization w ( e ) , e ∈ E S t ,v , on the relevantsubgraph G S t ,v , we define a finite-length sequence of disjoint node sets S , S , . . . , S (cid:101) τ as S = S t S τ +1 ∆ = (cid:110) u ∈ V S t ,v : u / ∈ ∪ ττ (cid:48) =0 S τ (cid:48) and ∃ e = ( u , u ) ∈ E S t ,v s.t. u ∈ S τ and w ( e ) = 1 (cid:111) , (22) ∀ τ = 0 , . . . , (cid:101) τ − . That is, under the realization w ( e ) , e ∈ E S t ,v , S τ +1 is the set of nodes directlyactivated by S τ . Specifically, any node u ∈ S τ +1 satisfies u / ∈ (cid:83) ττ (cid:48) =0 S τ (cid:48) (i.e. it was not activatedbefore), and there exists an activated edge e from S τ to u (i.e. it is activated by some node in S τ ).We define S (cid:101) τ as the first node set in the sequence s.t. either S (cid:101) τ = ∅ or v ∈ S (cid:101) τ , and assume thissequence terminates at S (cid:101) τ . Note that by definition, (cid:101) τ ≤ |V S t ,v | always holds. We refer to each τ = 0 , , . . . , (cid:101) τ as a diffusion step in this section.To simplify the exposition, we also define S τ ∆ = (cid:83) ττ (cid:48) =0 S τ (cid:48) for all τ ≥ and S − = ∅ . Since w is random, ( S τ ) (cid:101) ττ =0 is a stochastic process, which we refer to as the diffusion process . Note that (cid:101) τ is also random; in particular, it is a stopping time.Based on the shorthand notations defined above, we have the following lemma for the diffusionprocess ( S τ ) (cid:101) ττ =0 under any weight function w : 15 emma 3 For any weight function w : E → [0 , , any step τ = 0 , , . . . , (cid:101) τ , any S τ and S τ − , wehave h (cid:0) S τ , S τ − , w (cid:1) = if v ∈ S τ if S τ = ∅ E (cid:2) h (cid:0) S τ +1 , S τ , w (cid:1)(cid:12)(cid:12) ( S τ , S τ − ) (cid:3) otherwise , where the expectation is over S τ +1 under weight w . Note that the tuple ( S τ , S τ − ) in the condi-tional expectation means that S τ is the source node set and nodes in S τ − have been removed. Proof:
Notice that by definition, h (cid:0) S τ , S τ − , w (cid:1) = 1 if v ∈ S τ and h (cid:0) S τ , S τ − , w (cid:1) = 0 if S τ = ∅ . Also note that in these two cases, (cid:101) τ = τ .Otherwise, we prove that h (cid:0) S τ , S τ − , w (cid:1) = E (cid:2) h (cid:0) S τ +1 , S τ , w (cid:1)(cid:12)(cid:12) ( S τ , S τ − ) (cid:3) . Recall that bydefinition, h (cid:0) S τ , S τ − , w (cid:1) is the probability that v will be influenced conditioning onsource node set S τ and removed node set S τ − , (23)that is h (cid:0) S τ , S τ − , w (cid:1) = E (cid:2) ( v is influenced ) (cid:12)(cid:12) ( S τ , S τ − ) (cid:3) (24)Let w ( e ) , ∀ e ∈ E ( S τ , S τ ) be any possible realization. Now we analyze the probability that v willbe influenced conditioning onsource node set S τ , removed node set S τ − , and w ( e ) for all e ∈ E ( S τ , S τ ) . (25)Specifically, conditioning on Equation 25, we can define a new weight function w (cid:48) as w (cid:48) ( e ) = (cid:26) w ( e ) if e ∈ E ( S τ , S τ ) w ( e ) otherwise (26)then h (cid:0) S τ , S τ − , w (cid:48) (cid:1) is the probability that v will be influenced conditioning on Equation 25. Thatis, h (cid:0) S τ , S τ − , w (cid:48) (cid:1) = E (cid:2) ( v is influenced ) (cid:12)(cid:12) ( S τ , S τ − ) , w ( e ) ∀ e ∈ E ( S τ , S τ ) (cid:3) , (27)for any possible realization of w ( e ) , ∀ e ∈ E ( S τ , S τ ) . Notice that on the lefthand of Equation 27, w (cid:48) encodes the conditioning on w ( e ) for all e ∈ E ( S τ , S τ ) (see Equation 26).From here to Equation 29, we focus on an arbitrary but fixed realization of w ( e ) , ∀ e ∈ E ( S τ , S τ ) (orequivalently, an arbitrary but fixed w (cid:48) ). Based on the definition of S τ +1 , conditioning on Equation 25, S τ +1 is deterministic and all nodes in S τ +1 can also be treated as source nodes. Thus, we have h (cid:0) S τ , S τ − , w (cid:48) (cid:1) = h (cid:0) S τ ∪ S τ +1 , S τ − , w (cid:48) (cid:1) , conditioning on Equation 25.On the other hand, conditioning on Equation 25, we can treat any edge e ∈ E ( S τ , S τ ) with w ( e ) = 0 as having been removed. Since nodes in S τ − have also been removed, and v / ∈ S τ , thenif there is a path from S τ to v , then it must go through S τ +1 , and the last node on the path in S τ +1 must be after the last node on the path in S τ (note that the path might come back to S τ for severaltimes). Hence, conditioning on Equation 25, if nodes in S τ +1 are also treated as source nodes, then S τ is irrelevant for influence on v and can be removed. So we have h (cid:0) S τ , S τ − , w (cid:48) (cid:1) = h (cid:0) S τ ∪ S τ +1 , S τ − , w (cid:48) (cid:1) = h (cid:0) S τ +1 , S τ , w (cid:1) . (28)Note that in the last equation we change the weight function back to w since edges in E ( S τ , S τ ) have been removed. Thus, conditioning on Equation 25, we have h (cid:0) S τ +1 , S τ , w (cid:1) = h (cid:0) S τ , S τ − , w (cid:48) (cid:1) = E (cid:2) ( v is influenced ) (cid:12)(cid:12) ( S τ , S τ − ) , w ( e ) ∀ e ∈ E ( S τ , S τ ) (cid:3) . (29)Notice again that Equation 29 holds for any possible realization of w ( e ) , ∀ e ∈ E ( S τ , S τ ) .16inally, we have h (cid:0) S τ , S τ − , w (cid:1) ( a ) = E (cid:2) ( v is influenced ) (cid:12)(cid:12) ( S τ , S τ − ) (cid:3) ( b ) = E (cid:2) E (cid:2) ( v is influenced ) (cid:12)(cid:12) ( S τ , S τ − ) , w ( e ) ∀ e ∈ E ( S τ , S τ ) (cid:3)(cid:12)(cid:12) ( S τ , S τ − ) (cid:3) ( c ) = E (cid:2) h (cid:0) S τ +1 , S τ , w (cid:1)(cid:12)(cid:12) ( S τ , S τ − ) (cid:3) , (30)where (a) follows from Equation 24, (b) follows from the tower rule, and (c) follows from Equation 29.This concludes the proof. (cid:3) Consider two weight functions
U, w : E → [0 , s.t. U ( e ) ≥ w ( e ) for all e ∈ E . The followinglemma bounds the difference h (cid:0) S τ , S τ − , U (cid:1) − h (cid:0) S τ , S τ − , w (cid:1) in a recursive way. Lemma 4
For any two weight functions w, U : E → [0 , s.t. U ( e ) ≥ w ( e ) for all e ∈ E , any step τ = 0 , , . . . , (cid:101) τ , any S τ and S τ − , we have h (cid:0) S τ , S τ − , U (cid:1) − h (cid:0) S τ , S τ − , w (cid:1) = 0 if v ∈ S τ or S τ = ∅ ; and otherwise h (cid:0) S τ , S τ − , U (cid:1) − h (cid:0) S τ , S τ − , w (cid:1) ≤ (cid:88) e ∈E ( S τ , S τ ) [ U ( e ) − w ( e )]+ E (cid:2) h (cid:0) S τ +1 , S τ , U (cid:1) − h (cid:0) S τ +1 , S τ , w (cid:1)(cid:12)(cid:12) ( S τ , S τ − ) (cid:3) , where the expectation is over S τ +1 under weight w . Recall that the tuple ( S τ , S τ − ) in theconditional expectation means that S τ is the source node set and nodes in S τ − have been removed. Proof:
First, note that if v ∈ S τ or S τ = ∅ , then h (cid:0) S τ , S τ − , U (cid:1) − h (cid:0) S τ , S τ − , w (cid:1) = 0 follows directly from Lemma 3. Otherwise, to simplify the exposition, we overload the notation anduse w ( S τ +1 ) to denote the conditional probability of S τ +1 conditioning on ( S τ , S τ − ) under theweight function w , and similarly for U ( S τ +1 ) . That is w ( S τ +1 ) ∆ = Prob (cid:2) S τ +1 (cid:12)(cid:12) ( S τ , S τ − ); w (cid:3) U ( S τ +1 ) ∆ = Prob (cid:2) S τ +1 (cid:12)(cid:12) ( S τ , S τ − ); U (cid:3) , (31)where the tuple ( S τ , S τ − ) in the conditional probability means that S τ is the source node set andnodes in S τ − have been removed, and w and U after the semicolon indicate the weight function.Then from Lemma 3, we have h (cid:0) S τ , S τ − , U (cid:1) = (cid:88) S τ +1 U ( S τ +1 ) h (cid:0) S τ +1 , S τ , U (cid:1) h (cid:0) S τ , S τ − , w (cid:1) = (cid:88) S τ +1 w ( S τ +1 ) h (cid:0) S τ +1 , S τ , w (cid:1) where the sum is over all possible realization of S τ +1 .17ence we have h (cid:0) S τ , S τ − , U (cid:1) − h (cid:0) S τ , S τ − , w (cid:1) = (cid:88) S τ +1 (cid:2) U ( S τ +1 ) h (cid:0) S τ +1 , S τ , U (cid:1) − w ( S τ +1 ) h (cid:0) S τ +1 , S τ , w (cid:1)(cid:3) = (cid:88) S τ +1 (cid:2) U ( S τ +1 ) h (cid:0) S τ +1 , S τ , U (cid:1) − w ( S τ +1 ) h (cid:0) S τ +1 , S τ , U (cid:1)(cid:3) + (cid:88) S τ +1 (cid:2) w ( S τ +1 ) h (cid:0) S τ +1 , S τ , U (cid:1) − w ( S τ +1 ) h (cid:0) S τ +1 , S τ , w (cid:1)(cid:3) = (cid:88) S τ +1 (cid:2) U ( S τ +1 ) − w ( S τ +1 ) (cid:3) h (cid:0) S τ +1 , S τ , U (cid:1) + (cid:88) S τ +1 w ( S τ +1 ) (cid:2) h (cid:0) S τ +1 , S τ , U (cid:1) − h (cid:0) S τ +1 , S τ , w (cid:1)(cid:3) , (32)where the sum in the above equations is also over all the possible realizations of S τ +1 . Notice that bydefinition, we have E (cid:2) h (cid:0) S τ +1 , S τ , U (cid:1) − h (cid:0) S τ +1 , S τ , w (cid:1)(cid:12)(cid:12) ( S τ , S τ − ) (cid:3) = (cid:88) S τ +1 w ( S τ +1 ) (cid:2) h (cid:0) S τ +1 , S τ , U (cid:1) − h (cid:0) S τ +1 , S τ , w (cid:1)(cid:3) , (33)where the expectation in the lefthand side is over S τ +1 under weight w , or equivalently, over w ( e ) for all e ∈ E ( S τ , S τ ) under weight w . Thus, to prove Lemma 4, it is sufficient to prove that (cid:88) S τ +1 (cid:2) U ( S τ +1 ) − w ( S τ +1 ) (cid:3) h (cid:0) S τ +1 , S τ , U (cid:1) ≤ (cid:88) e ∈E ( S τ , S τ ) [ U ( e ) − w ( e )] . (34)Notice that (cid:88) S τ +1 (cid:2) U ( S τ +1 ) − w ( S τ +1 ) (cid:3) h (cid:0) S τ +1 , S τ , U (cid:1) ( a ) ≤ (cid:88) S τ +1 (cid:2) U ( S τ +1 ) − w ( S τ +1 ) (cid:3) h (cid:0) S τ +1 , S τ , U (cid:1) (cid:2) U ( S τ +1 ) ≥ w ( S τ +1 ) (cid:3) ( b ) ≤ (cid:88) S τ +1 (cid:2) U ( S τ +1 ) − w ( S τ +1 ) (cid:3) (cid:2) U ( S τ +1 ) ≥ w ( S τ +1 ) (cid:3) ( c ) = 12 (cid:88) S τ +1 (cid:12)(cid:12) U ( S τ +1 ) − w ( S τ +1 ) (cid:12)(cid:12) , (35)where (a) holds since (cid:88) S τ +1 (cid:2) U ( S τ +1 ) − w ( S τ +1 ) (cid:3) h (cid:0) S τ +1 , S τ , U (cid:1) = (cid:88) S τ +1 (cid:2) U ( S τ +1 ) − w ( S τ +1 ) (cid:3) h (cid:0) S τ +1 , S τ , U (cid:1) (cid:2) U ( S τ +1 ) ≥ w ( S τ +1 ) (cid:3) + (cid:88) S τ +1 (cid:2) U ( S τ +1 ) − w ( S τ +1 ) (cid:3) h (cid:0) S τ +1 , S τ , U (cid:1) (cid:2) U ( S τ +1 ) < w ( S τ +1 ) (cid:3) , ≤ h (cid:0) S τ +1 , S τ , U (cid:1) ≤ by definition. To prove (c), we define shorthand notations A + = (cid:88) S τ +1 (cid:2) U ( S τ +1 ) − w ( S τ +1 ) (cid:3) (cid:2) U ( S τ +1 ) ≥ w ( S τ +1 ) (cid:3) A − = (cid:88) S τ +1 (cid:2) U ( S τ +1 ) − w ( S τ +1 ) (cid:3) (cid:2) U ( S τ +1 ) < w ( S τ +1 ) (cid:3) Then we have A + + A − = (cid:88) S τ +1 (cid:2) U ( S τ +1 ) − w ( S τ +1 ) (cid:3) = 0 , since by definition (cid:80) S τ +1 U ( S τ +1 ) = (cid:80) S τ +1 w ( S τ +1 ) = 1 . Moreover, we also have A + − A − = (cid:88) S τ +1 (cid:12)(cid:12) U ( S τ +1 ) − w ( S τ +1 ) (cid:12)(cid:12) . And hence A + = (cid:80) S τ +1 (cid:12)(cid:12) U ( S τ +1 ) − w ( S τ +1 ) (cid:12)(cid:12) . Thus, to prove Lemma 4, it is sufficient to prove (cid:88) S τ +1 (cid:12)(cid:12) U ( S τ +1 ) − w ( S τ +1 ) (cid:12)(cid:12) ≤ (cid:88) e ∈E ( S τ , S τ ) [ U ( e ) − w ( e )] . (36)Let (cid:101) w ∈ { , } |E ( S τ , S τ ) | be an arbitrary edge activation realization for edges in E ( S τ , S τ ) . Alsowith a little bit abuse of notation, we use w ( (cid:101) w ) to denote the probability of (cid:101) w under weight w . Noticethat w ( (cid:101) w ) = (cid:89) e ∈E ( S τ , S τ ) w ( e ) (cid:101) w ( e ) [1 − w ( e )] − (cid:101) w ( e ) , and U ( (cid:101) w ) is defined similarly. Recall that by definition S τ +1 is a deterministic function of sourcenode set S τ , removed nodes S τ − , and (cid:101) w . Hence, for any possible realized S τ +1 , let W ( S τ +1 ) denote the set of (cid:101) w ’s that lead to this S τ +1 , then we have U ( S τ +1 ) = (cid:88) (cid:101) w ∈ W ( S τ +1 ) U ( (cid:101) w ) and w ( S τ +1 ) = (cid:88) (cid:101) w ∈ W ( S τ +1 ) w ( (cid:101) w ) Thus, we have (cid:88) S τ +1 (cid:12)(cid:12) U ( S τ +1 ) − w ( S τ +1 ) (cid:12)(cid:12) = 12 (cid:88) S τ +1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) (cid:101) w ∈ W ( S τ +1 ) [ U ( (cid:101) w ) − w ( (cid:101) w )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) S τ +1 (cid:88) (cid:101) w ∈ W ( S τ +1 ) | U ( (cid:101) w ) − w ( (cid:101) w ) | = 12 (cid:88) (cid:101) w | U ( (cid:101) w ) − w ( (cid:101) w ) | (37)Finally, we prove that (cid:88) (cid:101) w | U ( (cid:101) w ) − w ( (cid:101) w ) | ≤ (cid:88) e ∈E ( S τ , S τ ) [ U ( e ) − w ( e )] (38)by mathematical induction. Without loss of generality, we order the edges in E ( S τ , S τ ) as , , . . . , |E ( S τ , S τ ) | . For any k = 1 , . . . , |E ( S τ , S τ ) | , we use (cid:101) w k ∈ { , } k to denote anarbitrary edge activation realization for edges , . . . , k . Then, we prove (cid:88) (cid:101) w k | U ( (cid:101) w k ) − w ( (cid:101) w k ) | ≤ k (cid:88) e =1 [ U ( e ) − w ( e )] (39)19or all k = 1 , . . . , |E ( S τ , S τ ) | by mathematical induction. Notice that when k = 1 , we have (cid:88) (cid:101) w | U ( (cid:101) w ) − w ( (cid:101) w ) | = 12 [ | U (1) − w (1) | + | (1 − U (1)) − (1 − w (1)) | ] = U (1) − w (1) . Now assume that the induction hypothesis holds for k , we prove that it also holds for k + 1 . Note that (cid:88) (cid:101) w k +1 | U ( (cid:101) w k +1 ) − w ( (cid:101) w k +1 ) | = 12 (cid:88) (cid:101) w k [ | U ( (cid:101) w k ) U ( k + 1) − w ( (cid:101) w k ) w ( k + 1) | + | U ( (cid:101) w k )(1 − U ( k + 1)) − w ( (cid:101) w k )(1 − w ( k + 1)) | ] ( a ) ≤ (cid:88) (cid:101) w k [ | U ( (cid:101) w k ) U ( k + 1) − w ( (cid:101) w k ) U ( k + 1) | + | w ( (cid:101) w k ) U ( k + 1) − w ( (cid:101) w k ) w ( k + 1) | + | U ( (cid:101) w k )(1 − U ( k + 1)) − w ( (cid:101) w k )(1 − U ( k + 1)) | + | w ( (cid:101) w k )(1 − U ( k + 1)) − w ( (cid:101) w k )(1 − w ( k + 1)) | ]= 12 (cid:88) (cid:101) w k [ U ( k + 1) | U ( (cid:101) w k ) − w ( (cid:101) w k ) | + w ( (cid:101) w k ) | U ( k + 1) − w ( k + 1) | + (1 − U ( k + 1)) | U ( (cid:101) w k ) − w ( (cid:101) w k ) | + w ( (cid:101) w k ) | U ( k + 1) − w ( k + 1) | ]= 12 (cid:88) (cid:101) w k | U ( (cid:101) w k ) − w ( (cid:101) w k ) | + [ U ( k + 1) − w ( k + 1)] ( b ) ≤ k (cid:88) e =1 [ U ( e ) − w ( e )] + [ U ( k + 1) − w ( k + 1)]= k +1 (cid:88) e =1 [ U ( e ) − w ( e )] , (40)where (a) follows from the triangular inequality and (b) follows from the induction hypothesis. Hence,we have proved Equation 39 by induction hypothesis. As we have proved above, this is sufficient toprove Lemma 4. (cid:3) Finally, we prove the following lemma:
Lemma 5
For any two weight functions w, U : E → [0 , s.t. U ( e ) ≥ w ( e ) for all e ∈ E , we have f ( S t , U, v ) − f ( S t , w, v ) ≤ E (cid:104)(cid:80) (cid:101) τ − τ =0 (cid:80) e ∈E ( S τ , S τ ) [ U ( e ) − w ( e )] (cid:12)(cid:12)(cid:12) S t (cid:105) , where (cid:101) τ is the stopping time when S τ = ∅ or v ∈ S τ , and the expectation is under the weightfunction w . Proof:
Recall that the diffusion process ( S τ ) (cid:101) ττ =0 is a stochastic process. Note that by definition,if we treat the pair ( S τ , S τ − ) as the state of the diffusion process at diffusion step τ , and as-sume that w ( e ) ∼ Bern ( w ( e )) are independently sampled for all e ∈ E S t ,v , then the sequence ( S , S − ) , ( S , S − ) , . . . , ( S (cid:101) τ , S (cid:101) τ − ) follows a Markov chain, specifically, • For any state ( S τ , S τ − ) s.t. v / ∈ S τ and S τ (cid:54) = ∅ , its transition probabilities to the nextstate ( S τ +1 , S τ ) depend on w ( e ) ’s for e ∈ E (cid:0) S τ , S τ (cid:1) . • Any state ( S τ , S τ − ) s.t. v ∈ S τ or S τ = ∅ is a terminal state and the state transitionterminates once visiting such a state. Recall that by definition of the stopping time (cid:101) τ , thestate transition terminates at (cid:101) τ .We define h (cid:0) S τ , S τ − , U (cid:1) − h (cid:0) S τ , S τ − , w (cid:1) as the “value" at state ( S τ , S τ − ) . Also notethat the states in this Markov chain is topologically sortable in the sense that it will never revisit astate it visits before. Hence, we can compute h (cid:0) S τ , S τ − , U (cid:1) − h (cid:0) S τ , S τ − , w (cid:1) via a backward20nduction from the terminal states, based on a valid topological order. Thus, from Lemma 4, we have f ( S t , U, v ) − f ( S t , w, v ) ( a ) = h ( S , ∅ , U ) − h ( S , ∅ , w ) ( b ) ≤ E (cid:101) τ − (cid:88) τ =0 (cid:88) e ∈E ( S τ , S τ ) [ U ( e ) − w ( e )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S , (41)where ( a ) follows from the definition of h , and (b) follows from the backward induction. Since S = S t by definition, we have proved Lemma 5. (cid:3) Finally, we prove Theorem 3 based on Lemma 5. Recall that the favorable event at round t − isdefined as ξ t − = (cid:26) | x T e ( θ τ − − θ ∗ ) | ≤ c (cid:113) x T e M − τ − x e , ∀ e ∈ E , ∀ τ ≤ t (cid:27) . Also, based on Algorithm 1, we have ≤ w ( e ) ≤ U t ( e ) ≤ , ∀ e ∈ E . Thus, from Lemma 5, we have f ( S t , U t , v ) − f ( S t , w, v ) ≤ E (cid:104)(cid:80) (cid:101) τ − τ =0 (cid:80) e ∈E ( S τ , S τ ) [ U t ( e ) − w ( e )] (cid:12)(cid:12)(cid:12) S t , H t − (cid:105) , where the expectation is based on the weight function w . Recall that O t ( e ) is the event that edge e isobserved at round t . Recall that by definition, all edges in E ( S τ , S τ ) are observed at round t (sincethey are going out from an influenced node in S τ , see Definition 2) and belong to E S t ,v , so we have f ( S t , U t , v ) − f ( S t , w, v ) ≤ E (cid:101) τ − (cid:88) τ =0 (cid:88) e ∈E ( S τ , S τ ) [ U t ( e ) − w ( e )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S t , H t − ≤ E (cid:88) e ∈E S t,v ( O t ( e )) [ U t ( e ) − w ( e )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S t , H t − . (42)This completes the proof for Theorem 3. A.3 Proof of Lemma 1Proof:
To simplify the exposition, we define z t,e = (cid:113) x T e M − t − x e for all t = 1 , . . . , n and all e ∈ E , and use E ot denote the set of edges observed at round t . Recall that M t = M t − + 1 σ (cid:88) e ∈E x e x T e { O t ( e ) } = M t − + 1 σ (cid:88) e ∈E ot x e x T e . (43)Thus, for all ( t, e ) such that e ∈ E ot (i.e., edge e is observed at round t ), we have that det [ M t ] ≥ det (cid:20) M t − + 1 σ x e x T e (cid:21) = det (cid:20) M t − (cid:18) I + 1 σ M − t − x e x T e M − t − (cid:19) M t − (cid:21) = det [ M t − ] det (cid:20) I + 1 σ M − t − x e x T e M − t − (cid:21) = det [ M t − ] (cid:18) σ x T e M − t − x e (cid:19) = det [ M t − ] (cid:32) z t,e σ (cid:33) . Thus, we have (det [ M t ]) |E ot | ≥ (det [ M t − ]) |E ot | (cid:89) e ∈E ot (cid:32) z t,e σ (cid:33) . emark 1 Notice that when the feature matrix X = I , M t ’s are always diagonal matrices, and wehave det [ M t ] = det [ M t − ] (cid:89) e ∈E ot (cid:32) z t,e σ (cid:33) , which will lead to a tighter bound in the tabular ( X = I ) case. Since 1) det [ M t ] ≥ det [ M t − ] from Equation 43 and 2) |E ot | ≤ E ∗ , where E ∗ is defined inEquation 10 and |E ot | ≤ E ∗ follows from its definition, we have (det [ M t ]) E ∗ ≥ (det [ M t − ]) E ∗ (cid:89) e ∈E ot (cid:32) z t,e σ (cid:33) . Therefore, we have (det [ M n ]) E ∗ ≥ (det [ M ]) E ∗ n (cid:89) t =1 (cid:89) e ∈E ot (cid:32) z t,e σ (cid:33) = n (cid:89) t =1 (cid:89) e ∈E ot (cid:32) z t,e σ (cid:33) , since M = I . On the other hand, we have that trace ( M n ) = trace I + 1 σ n (cid:88) t =1 (cid:88) e ∈E ot x e x T e = d + 1 σ n (cid:88) t =1 (cid:88) e ∈E ot (cid:107) x e (cid:107) ≤ d + nE ∗ σ ,where the last inequality follows from the fact that (cid:107) x e (cid:107) ≤ and |E ot | ≤ E ∗ . From the trace-determinant inequality, we have d trace ( M n ) ≥ [det( M n )] d , thus we have (cid:20) nE ∗ dσ (cid:21) dE ∗ ≥ (cid:20) d trace ( M n ) (cid:21) dE ∗ ≥ [det( M n )] E ∗ ≥ n (cid:89) t =1 (cid:89) e ∈E ot (cid:32) z t,e σ (cid:33) . Taking the logarithm on the both sides, we have dE ∗ log (cid:20) nE ∗ dσ (cid:21) ≥ n (cid:88) t =1 (cid:88) e ∈E ot log (cid:32) z t,e σ (cid:33) . (44)Notice that z t,e = x T e M − t − x e ≤ x T e M − x e = (cid:107) x e (cid:107) ≤ , thus we have z t,e ≤ log (cid:18) z t,eσ (cid:19) log ( σ ) · Hence we have n (cid:88) t =1 (cid:88) e ∈E ot z t,e ≤ (cid:0) σ (cid:1) n (cid:88) t =1 (cid:88) e ∈E ot log (cid:32) z t,e σ (cid:33) ≤ dE ∗ log (cid:2) nE ∗ dσ (cid:3) log (cid:0) σ (cid:1) . (45) Remark 2
When the feature matrix X = I , we have d = |E| , det [ M n ] = n (cid:89) t =1 (cid:89) e ∈E ot (cid:32) z t,e σ (cid:33) , and |E| log (cid:20) nE ∗ |E| σ (cid:21) ≥ n (cid:88) t =1 (cid:88) e ∈E ot log (cid:32) z t,e σ (cid:33) . This implies that n (cid:88) t =1 (cid:88) e ∈E ot z t,e ≤ |E| log (cid:2) nσ (cid:3) log (cid:0) σ (cid:1) , (46) Notice that for any y ∈ [0 , , we have y ≤ log (cid:16) yσ (cid:17) log (cid:16) σ (cid:17) ∆ = κ ( y ) . To see it, notice that κ ( y ) is a strictlyconcave function, and κ (0) = 0 and κ (1) = 1 . ince E ∗ ≤ |E| . Finally, from Cauchy-Schwarz inequality, we have that n (cid:88) t =1 (cid:88) e ∈E { O t ( e ) } N S t ,e (cid:113) x T e M − t − x e = n (cid:88) t =1 (cid:88) e ∈E ot N S t ,e z t,e ≤ (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) n (cid:88) t =1 (cid:88) e ∈E ot N S t ,e n (cid:88) t =1 (cid:88) e ∈E ot z t,e = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116)(cid:32) n (cid:88) t =1 (cid:88) e ∈E { O t ( e ) } N S t ,e (cid:33) n (cid:88) t =1 (cid:88) e ∈E ot z t,e . (47)Combining this inequality with the above bounds on (cid:80) nt =1 (cid:80) e ∈E ot z t,e (see Equations 45 and 46),we obtain the statement of the lemma. (cid:3) A.4 Proof of Lemma 2Proof:
We use E ot denote the set of edges observed at round t . The first observation is that wecan order edges in E ot based on breadth-first search (BFS) from the source nodes S t , as describedin Algorithm 2, where π t ( S t ) is an arbitrary conditionally deterministic order of S t . We say a node u ∈ V is a downstream neighbor of node v ∈ V if there is a directed edge ( v, u ) . We also assumethat there is a fixed order of downstream neighbors for any node v ∈ V . Algorithm 2
Breadth-First Sort of Observed Edges
Input: graph G , π t ( S t ) , and w t Initialization: node queue queueN ← π t ( S t ) , edge queue queueE ← ∅ , dictionary of influencednodes dictN ← S t while queueN is not empty do node v ← queueN . dequeue() for all downstream neighbor u of v do queueE . enqueue(( v, u )) if w t ( v, u ) == 1 and u / ∈ dictN then queueN . enqueue( u ) and dictN ← dictN ∪ { u } Output: edge queue queueE
Let J t = |E ot | . Based on Algorithm 2, we order the observed edges in E ot as a t , a t , . . . , a tJ t . We startby defining some useful notation. For any t = 1 , , . . . , any j = 1 , , . . . , J t , we define η t,j = w t ( a tj ) − w ( a tj ) . One key observation is that η t,j ’s form a martingale difference sequence (MDS). Moreover, η t,j ’sare bounded in [ − , and hence they are conditionally sub-Gaussian with constant R = 1 . Wefurther define that V t = σ M t = σ I + t (cid:88) τ =1 J τ (cid:88) j =1 x a τj x T a τj , and Y t = t (cid:88) τ =1 J τ (cid:88) j =1 x a τj η t,j = B t − t (cid:88) τ =1 J τ (cid:88) j =1 x a τj w ( a tj ) = B t − t (cid:88) τ =1 J τ (cid:88) j =1 x a τj x T a τj θ ∗ . Notice that the notion of “time" (or a round) is indexed by the pair ( t, j ) , and follows the lexicographicalorder. Based on Algorithm 2, at the beginning of round ( t, j ) , a tj is conditionally deterministic and the conditionalmean of w t ( a tj ) is w ( a tj ) .
23s we will see later, we define V t and Y t to use the self-normalized bound developed in [1] (seeAlgorithm 1 of [1]). Notice that M t θ t = 1 σ B t = 1 σ Y t + 1 σ t (cid:88) τ =1 J τ (cid:88) j =1 x a τj x T a τj θ ∗ = 1 σ Y t + [ M t − I ] θ ∗ , where the last equality is based on the definition of M t . Hence we have θ t − θ ∗ = M − t (cid:20) σ Y t − θ ∗ (cid:21) . Thus, for any e ∈ E , we have (cid:12)(cid:12) (cid:104) x e , θ t − θ ∗ (cid:105) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) x T e M − t (cid:20) σ Y t − θ ∗ (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) x e (cid:107) M − t (cid:107) σ Y t − θ ∗ (cid:107) M − t ≤(cid:107) x e (cid:107) M − t (cid:20) (cid:107) σ Y t (cid:107) M − t + (cid:107) θ ∗ (cid:107) M − t (cid:21) , where the first inequality follows from the Cauchy-Schwarz inequality and the second inequalityfollows from the triangle inequality. Notice that (cid:107) θ ∗ (cid:107) M − t ≤ (cid:107) θ ∗ (cid:107) M − = (cid:107) θ ∗ (cid:107) , and (cid:107) σ Y t (cid:107) M − t = σ (cid:107) Y t (cid:107) V − t (since M − t = σ V − t ), therefore we have (cid:12)(cid:12) (cid:104) x e , θ t − θ ∗ (cid:105) (cid:12)(cid:12) ≤ (cid:107) x e (cid:107) M − t (cid:20) σ (cid:107) Y t (cid:107) V − t + (cid:107) θ ∗ (cid:107) (cid:21) . (48)Notice that the above inequality always holds. We now provide a high-probability bound on (cid:107) Y t (cid:107) V − t based on self-normalized bound proved in [1]. From Theorem 1 of [1], we know that for any δ ∈ (0 , , with probability at least − δ , we have (cid:107) Y t (cid:107) V − t ≤ (cid:115) (cid:18) det( V t ) / det( V ) − / δ (cid:19) ∀ t = 0 , , . . . . Notice that det( V ) = det( σ I ) = σ d . Moreover, from the trace-determinant inequality, we have [det( V t )] /d ≤ trace ( V t ) d = σ + 1 d t (cid:88) τ =1 J τ (cid:88) j =1 (cid:107) x a τj (cid:107) ≤ σ + tE ∗ d ≤ σ + nE ∗ d , where the second inequality follows from the assumption that (cid:107) x a tk (cid:107) ≤ and the fact J t = |E ot | ≤ E ∗ , and the last inequality follows from t ≤ n . Thus, with probability at least − δ , we have (cid:107) Y t (cid:107) V − t ≤ (cid:115) d log (cid:18) nE ∗ dσ (cid:19) + 2 log (cid:18) δ (cid:19) ∀ t = 0 , , . . . , n − . That is, with probability at least − δ , we have (cid:12)(cid:12) (cid:104) x e , θ t − θ ∗ (cid:105) (cid:12)(cid:12) ≤ (cid:107) x e (cid:107) M − t (cid:34) σ (cid:115) d log (cid:18) nE ∗ dσ (cid:19) + 2 log (cid:18) δ (cid:19) + (cid:107) θ ∗ (cid:107) (cid:35) for all t = 0 , , . . . , n − and ∀ e ∈ E .Recall that by the definition of event ξ t − , the above inequality implies that, for any t = 1 , , . . . , n ,if c ≥ σ (cid:115) d log (cid:18) nE ∗ dσ (cid:19) + 2 log (cid:18) δ (cid:19) + (cid:107) θ ∗ (cid:107) , then P ( ξ t − ) ≥ − δ . That is, P ( ξ t − ) ≤ δ . (cid:3)(cid:3)