A Random Algorithm for Profit Maximization with Multiple Adoptions in Online Social Networks
Tiantian Chen, Bin Liu, Wenjing Liu, Qizhi Fang, Jing Yuan, Weili Wu
AA Random Algorithm for Profit Maximization with MultipleAdoptions in Online Social Networks
Tiantian Chen a , Bin Liu a, ∗ , Wenjing Liu a , Qizhi Fang a , Jing Yuan b , Weili Wu b a School of Mathematical Sciences, Ocean University of China, Qingdao 266100, China b Department of Computer Science, The University of Texas at Dallas, Richardson, TX 75080, USA
Abstract
Online social networks have been one of the most e ff ective platforms for marketing and adver-tising. Through “word of mouth” e ff ects, information or product adoption could spread fromsome influential individuals to millions of users in social networks. Given a social network G and a constant k , the influence maximization problem seeks for k nodes in G that can influencethe largest number of nodes. This problem has found important applications, and a large amountof works have been devoted to identifying the few most influential users. But most of existingworks only focus on the di ff usion of a single idea or product in social networks. However, in re-ality, one company may produce multiple kinds of products and one user may also have multipleadoptions.Given multiple kinds of di ff erent products with di ff erent activation costs and profits, it iscrucial for the company to distribute the limited budget among multiple products in order toachieve profit maximization. Profit Maximization with Multiple Adoptions (PM A) problem aimsto seek for a seed set within the budget to maximize the overall profit. In this paper, a
RandomizedModified Greedy ( RMG ) algorithm based on the Reverse Influence Sampling (RIS) technique ispresented for the PM A problem, which could achieve a (1 − / e − ε )-approximate solution withhigh probability. Compared with the algorithm proposed in [16] that achieves a (1 − / e )-approximate solution, our algorithm provides a better performance ratio which is also the bestperformance ratio of the PM A problem. Comprehensive experiments on three real-world socialnetworks are conducted, and the results demonstrate that our
RMG algorithm outperforms thealgorithm proposed in [16] and other heuristics in terms of profit maximization, and could betterallocate the budget.
Keywords:
Profit maximization, Social network, Approximation algorithm, Sampling
1. Introduction
With the increasing popularity of online social and information networks such as Facebook,Twitter, LinkedIn, etc., many researchers have studied the di ff usion phenomenon in social net-works, including the di ff usion of news, ideas, innovations, the adoption of new products, etc. ∗ Corresponding author
Email address: [email protected] (Bin Liu)
Preprint submitted to Elsevier January 18, 2021 a r X i v : . [ c s . S I] J a n uch di ff usion is driven by the influence propagation throughout the social networks. One topicthat has been extensively studied is the Influence Maximization (IM) problem [5, 6, 7, 11]. Thegoal of the IM problem is to find a small subset of influential nodes such that they can attract thelargest number of members in a social network, according to an influence propagation model.The IM problem has been mainly studied under two classical influence propagation models:
In-dependent Cascade (IC) model [1, 2] and
Linear Threshold (LT) model [3, 4]. Both IC modeland LT model are probabilistic models that characterize how the influence is propagated in thesocial network starting from an initial set of seed nodes.The objective functions of the IM related problems are usually complicated to compute due tothe randomness of the probabilistic di ff usion models. In fact, computing the expected influencefor a given seed set is ffi ciency has motivated a large amount ofresearch on the IM problem in the past decade [7, 8, 9, 10, 11, 12, 14]. However, most of thesemethods either trade performance guarantees for practical e ffi ciency, or vice versa. There areexceptions like TIM / TIM + [19] and IMM [20], which are scalable methods with performanceguarantee for the IM problem. They utilized a novel Reverse Influence Sampling (RIS) tech-nique introduced by Borgs et al. [17] and obtained (1 − / e − ε )-approximate solutions with highprobability .Most existing works focus on the IM problem with a single di ff usion, i.e., only one prod-uct is considered. In reality, however, one company may produce multiple products and peoplemay purchase multiple kinds of products at one time. For example, Apple sells iPhone, iPad,Macbook, etc. Many people have iPhone and iPad at the same time, and plenty of people ownboth laptop and desktop. Therefore, given a limited budget and some kinds of di ff erent itemswith di ff erent activation costs and profits, a crucial question is: how to allocate the budget tomaximize the overall profit? Recently, Zhang et al. [16] formulated a
Profit Maximization withMultiple Adoption (PM A) problem, which seeks for a seed set within the limited budget tomassively influence customers and achieves the goal of profit maximization. And a (1 − / e )-approximation algorithm was presented by Zhang et al. Moreover, they proposed another algo-rithm, called PMIS , and stated that
PMIS could produce a solution within a factor of α · (1 − / e ),where α may be made arbitrarily close to 1. However, α is obtained by using CPLEX to solvethe Multiple-Chioce Knapsack problem.In this paper, we present an e ffi cient algorithm called Randomized Modified Greedy ( RMG )algorithm for the PM A problem. The
RMG algorithm based on the RIS technique returns a(1 − / e − ε )-approximate solution with at least 1 − ( nq ) − l − qn − l (cid:48) probability, where q is thenumber of products, n is the number of nodes in the social network, and ε, l , l (cid:48) > RMG algorithm can be implemented with tunable parameters and it is flexible for balancingthe running time and the accuracy. Experimental results show that the
RMG algorithm not onlyproduces high quality seed sets but also takes much less time than the greedy algorithm withMonte-Carlo simulations.The contributions of this paper are summarized as follows: • We present a
Randomized Modified Greedy ( RMG ) algorithm for the PM A problem thatachieves a (1 − / e − ε )-approximation ratio with high probability, which significantlyimproves upon prior works in terms of performance guarantee and is also the best perfor-mance ratio of the PM A problem even for one product. • We extend the RIS technique to accommodate the profit estimation over multiple products,2nd an
OPT estimation scheme is designed to speed up the sampling process, where
OPT is the optimum of the PM A problem. • We conduct comprehensive experiments on real-world social networks that verify the su-periority of
RMG in providing an e ff ective budget distribution over multiple products. Related works.
The Influence Maximization (IM) problem has been studied intensively in thepast decade. Kempe et al. [11] first formulated IM as a combinatorial optimization problemand presented a general greedy algorithm that yielded a (1 − / e − ε )-approximation for alldi ff usion models considered. Later, a large amount of works aimed to improve the e ffi ciencyand scalability of the seed selection algorithm. However, they were either heuristics withoutperformance guarantees [6, 7] or prohibitively slow on billion-scale networks [9, 14]. Borgs etal. [17] made a theoretical breakthrough with the RIS technique that guaranteed a (1 − / e − ε )-approximation and significantly reduced the expected running time. Subsequently, [19, 20, 21]proposed algorithms that were very e ffi cient even on large networks with millions of nodes andbillions of edges.Recently, research on influence or profit maximization with a limited budget has also emerged.Nguyen et al. [22] consider the budgeted influence maximization problem in which each nodecan have an arbitrary cost, but they mainly focus on the di ff usion of a particular kind of prod-uct, which is di ff erent from our work. Multiple products are considered in [23], where seperatebudget constraints are made on di ff erent products and their di ff usions are separated as well.However, all of the products share an overall activation budget in the PM A problem. The PM Aproblem is proposed by Zhang et al.[16] and a (1 − / e )-approximation algorithm is discussedunder the Multiple Thresholds (MT) model which is an extension of LT model. They proposedanother algorithm, called
PMIS , and stated that
PMIS could produce a solution within a factorof α · (1 − / e ), where α may be made arbitrarily close to 1. However, α is obtained by usingCPLEX to solve the Multiple-Chioce Knapsack problem.[25] investigates the cost-aware targetedviral marketing problem in which only one product spreads in the network and each node has itsown selecting cost and benefit, and aims to find a seed set with total cost no more than a budget tomaximize the expected total profit. The PM A problem is actually a special case of the problemconsidered in [25], but they only propose a (1 − / √ e − ε )-approximation algorithm. In thispaper, we present a (1 − / e − ε )-approximation algorithm based on the RIS technique under theIC model, which significantly improves the solution of the PM A problem.
Organization.
The rest of the paper is organized as follows. In Section 2, we introduce thedi ff usion model and the definition of the PM A problem. Key ideas of solving the PM A problemand the framework of the algorithm are presented in Section 3. Section 4 is dedicated to the
RMG algorithm along with the analysis of its performance ratio and time complexity. Section 5 showsour experimental results and Section 6 concludes the paper.
2. Problem Formulation
This work aims to design a marketing strategy for allocating the budget among multiple prod-ucts in a social network. In this section, we present the di ff usion model and give the definitionof the Profit Maximization with Multiple Adoptions (PM A) problem.A social network is usually represented as a digraph G = ( V , E ) with nodes in V representingusers and edges in E representing relationships between users, where | V | = n , | E | = m . Assumethat each directed edge e in G is associated with a propagation probability p ( e ) ∈ [0 , .1. Di ff usion Model There are many di ff usion models studied in the literature. The di ff usion model consideredin this paper is the Independent Cascade (IC) model, investigated in the context of marketingby Goldenberg et al. [1, 2]. Given a social network G , the IC model considers a timestampedinfluence propagation process as follows:1. At timestamp 1, we activate a selected node set S ⊆ V , and set all of the remaining nodesinactive.2. If a node v is first activated at timestamp t , then for each directed edge e pointing from v to an inactive node u , v has a probability p ( e ) to activate u at timestamp t +
1. After timestamp t + v has no chance to activate any node.3. Once a node becomes active, it remains active in the following timestamps.Let I ( S ) be the number of nodes that are activated when the above di ff usion process termi-nates. Refer to S as the seed set , and I ( S ) as the spread of S . Let σ ( S ) be the expected spread of S , that is σ ( S ) = E [ I ( S )]. In reality, one company may produce multiple di ff erent products and one user may purchasemultiple kinds of products at one time. Therefore, given a limited budget and some kinds ofdi ff erent items with di ff erent activation costs and profits, it is crucial for the company to wiselyallocate the budget to maximize the overall profit. This problem is called the Profit Maximizationwith Multiple Adoptions (PM A) problem, which was introduced by Zhang et al. [16].
Definition 1 (Profit Maximization with Multiple Adoptions (PM A)) . Given a social networkG = ( V , E ) with propagation probability p : E → [0 , , suppose there are q di ff erent kindsof products spreading independently in G and each node can adopt multiple products. For i = , , . . . , q, let c i be the cost of initially activating a node to adopt product i, and p i be the profitobtained when a node is activated to adopt product i, where c i , p i > . The PM A problem asksto identify a seed set for each product respectively with overall activation cost at most B suchthat the expected total profit is maximized.
Obviously, the Influence Maximization (IM) problem is a special case of the PM A problemwhen q =
1. Since the IM problem is NP-hard for the IC model [11] and cannot be approximatedbetter than 1 − / e unless P = NP [13], we have the following result.
Claim 1.
The PM A problem is NP-hard for the IC model, and for any ε > , it cannot beapproximated in polynomial time within a ratio of (1 − / e + ε ) unless P = NP.
Since each node may adopt multiple products, the PM A problem can be characterized as q independent di ff usion processes under a common budget constraint in G . The most crucial pointhere is how to allocate the budget among the multiple products. To address this issue, we shallgive another version of the PM A problem in the next section.
3. Key Ideas for Solving PM A Problem
In this section, we transfer the PM A problem into an equivalent problem, called PM- (cid:101) G , andpresent the framework for solving the PM- (cid:101) G problem.4 .1. Reformulation of the Problem Definition 2 ( q -component copy graph (cid:101) G ) . Given a social network G = ( V , E ) with propagationprobability p : E → [0 , , let (cid:101) G = ( (cid:101) V , (cid:101) E ) = G (1) ∪ G (2) ∪ · · · ∪ G ( q ) be a graph composed of qcomponents, each of which (denoted by G ( i ) = ( V ( i ) , E ( i ) ) ) is a copy of G. For any node u ∈ V, letu ( i ) be the copy node of u in G ( i ) . For di ff erent copy nodes u ( i ) and u ( j ) of u in di ff erent componentsG ( i ) and G ( j ) , i (cid:44) j, they are regarded as di ff erent nodes in the (cid:101) G. Then for any node set S ⊆ (cid:101) V,S can be described as S = S (1) ∪ S (2) ∪ . . . ∪ S ( q ) , where S ( i ) ⊆ V ( i ) , i = , , . . . , q. Figure 1: An example of q -component copy graph (cid:101) G , q = Thus, q di ff erent products spreading in G are transformed to a single product spreading in (cid:101) G , subject to the condition that both the activation cost and the profit are di ff erent in di ff erentcomponents: • Cost in G ( i ) : the cost of initially activating a node in component G ( i ) to adopt the productis c i . • Profit in G ( i ) : the profit when a node in component G ( i ) is activated to adopt the product is p i .Since di ff erent components do not connect each other, for a seed set S = S (1) ∪ S (2) ∪ . . . ∪ S ( q ) , S ( i ) can only influence the nodes in G ( i ) . For i = , , . . . , q , denote by σ ( S ( i ) ) the expectedspread of S ( i ) , and ρ ( S ( i ) ) the excepted profit gained by initially activating the nodes in S ( i ) ,i.e., ρ ( S ( i ) ) = p i · σ ( S ( i ) ). Let ρ ( S ) be the expected total profit gained by initially activatingall the nodes in S , then ρ ( S ) = (cid:80) qi = p i · σ ( S ( i ) ) . Let c ( S ) be the activation cost of S , that is, c ( S ) = (cid:80) qi = c i | S ( i ) | . Definition 3 (Profit Maximization on (cid:101) G (PM- (cid:101) G )) . The PM- (cid:101)
G problem asks for a seed set S ⊆ (cid:101) Vwith the activation cost at most B such that the expected total profit is maximized: max ρ ( S ) = q (cid:88) i = p i · σ ( S ( i ) ) s . t . q (cid:88) i = c i | S ( i ) | ≤ B (1)It is easy to see that the PM A problem is equivalent to the PM- (cid:101) G problem.Let Ω = { S ⊆ (cid:101) V | c ( S ) ≤ B } , that is, Ω is the feasible set of the PM- (cid:101) G problem. Let S ∗ be theoptimal solution of the PM- (cid:101) G problem, and OPT = ρ ( S ∗ ) be the optimum.Since ρ ( S ) = (cid:80) qi = p i · σ ( S ( i ) ) and σ ( S ( i ) ) has been proved to be nondecreasing as well assubmodular [11], the following result holds. Proposition 1.
The profit function ρ ( · ) in the PM- (cid:101) G problem is nonnegative, nondecreasing andsubmodular. ff erent costs and profits, and a limited budget, one may resortto algorithms of the classical knapsack problem, such as greedy algorithm and dynamic pro-gramming. However, those methods cannot perform well here, since the profit function ρ ( · ) inthe PM- (cid:101) G problem is submodular rather than linear. And selecting any node from the candidatenode set may influence the marginal gain of choosing the next seed, not like the static weight as-sociated with each item in the knapsack problem. All the facts make the PM- (cid:101) G problem di ffi cult.For the problem of maximizing a nonnegative, nondecreasing submodular set function f ( · )subject to a knapsack constraint, Sviridenko[18] proposed a modified greedy algorithm whichguarantees a (1 − / e )-approximation ratio and is based on a value oracle model for f ( · ). That is,for a given set S , the algorithm can query an oracle to find its value f ( S ). But our task in this workis to solve the PM- (cid:101) G problem without using the value oracle model, and it is accompanied by thedi ffi culty of computing ρ ( · ), because the computation of σ ( · ) has been shown to be (cid:101) G Problem
Before we give the algorithm for solving the PM- (cid:101) G problem in detail, we first describe themain idea and framework of the algorithm.To tackle intractability of the computation of ρ ( · ), we try to obtain an estimate ˆ ρ ( · ) of ρ ( · ) witha small error with high probability, where ˆ ρ ( · ) can be computed in polynomial time. Then wesubstitute ˆ ρ ( S ) for ρ ( S ) to translate the original PM- (cid:101) G problem into maximizing ˆ ρ ( S ) with thebudget constraint. Using the modified greedy algorithm [18], a (1 − / e )-approximate solution S A for the problem of max S ∈ Ω ˆ ρ ( S ) is obtained, which can be proved to be a (1 − / e − ε )-approximatesolution for the PM- (cid:101) G problem with high probability.In the estimation of ρ ( S ), we utilize the Reverse Influence Sampling (RIS) technique intro-duced by Borgs et al. [17], which significantly improves the time complexity of the algorithmfor the IM problem.In summary, the PM- (cid:101) G problem can be solved by the following steps. • Estimate ρ ( S ) : Use the RIS technique to gain an estimation ˆ ρ ( · ) of ρ ( · ) such that for any S ∈ Ω , | ˆ ρ ( S ) − ρ ( S ) | < ε · OPT holds with high probability, where 0 < ε < • Solve problem max S ∈ Ω ˆ ρ ( S ) : Prove ˆ ρ ( S ) is nonnegative, nondecreasing and submodular,then use the Modified Greedy algorithm to solve the problem max S ∈ Ω ˆ ρ ( S ). Let S A be thesolution returned by the algorithm, then we can show that S A is a (1 − / e − ε )-approximatesolution for the PM- (cid:101) G problem with high probability (w.h.p.).The framework for solving the PM- (cid:101) G problem is shown in Fig.2. Figure 2: Overview of algorithms.
4. Algorithm and Its Analysis
In this section, the
Randomized Modified Greedy ( RMG ) algorithm for the PM- (cid:101) G problem ispresented, which can achieve a (1 − / e − ε )-approximation ratio with high probability. Before6ntroducing the algorithm, we list some notations for convenience. Let p min = min ≤ i ≤ q { p i } , p max = max ≤ i ≤ q { p i } , c min = min ≤ i ≤ q { c i } and k i = (cid:98) B / c i (cid:99) . Let k ∗ = (cid:98) B / c min (cid:99) , i.e. the maximumnumber of seed nodes that can be chosen. As the budget B is often limited, the size of the seedset can not be too large. Thus in the following, we assume that k ∗ ≤ (cid:98) nq / (cid:99) . ρ ( S )Now we are in the position to give the estimation of ρ ( S ). In this work, Reverse InfluenceSampling (RIS) technique is used, which captures the influence landscape of the social networkthrough generating a set of Random Reverse Reachable (RR) sets [19].
Random Reverse Reachable (RR) set.
Given a social network (cid:101) G = ( (cid:101) V , (cid:101) E ) with propagationprobability p : E → [0 , g from (cid:101) G by removing each edge e in (cid:101) G with 1 − p ( e ) probability 2) selecting anode v from g uniformly at random 3) returning R as the set of nodes that can reach v in g . Figure 3: An example of generating random RR sets under the IC model. Three random RR sets R , R and R aregenerated for three nodes c (1) , a (1) and d (1) , respectively. Intuitively, if a node u appears in an RR set generated for another node v , then u can reach v through a certain path in (cid:101) G . Thus, a propagation process from a seed set containing u shouldhave a certain probability to activate v . The result of [17] attests to this observation.Using the reverse Breadth First Search (BFS) algorithm in [26], we can generate a set ofrandom RR sets R = { R , R , . . . , R θ } . Given a node set S ⊆ (cid:101) V , we say that S covers an RR set R j if and only if S ∩ R j (cid:44) ∅ . Define F R ( S ) as the fraction of RR sets in R covered by S , that is F R ( S ) = |{ R j ∈ R , ≤ j ≤ θ | S ∩ R j (cid:44) ∅}| θ . Recall that S = S (1) ∪ S (2) ∪· · ·∪ S ( q ) ⊆ (cid:101) V , it is clear that F R ( S ) = (cid:80) qi = F R ( S ( i ) ). Based on theresults in Tang et al. [19], we can obtain that for any S ( i ) ⊆ V ( i ) , i = , , . . . , q , the expected valueof nq · F R ( S ( i ) ) equals the expected spread of S ( i ) in (cid:101) G . This implies to the following lemma. Lemma 1.
For any node set S ( i ) ⊆ V ( i ) , E [ nq · F R ( S ( i ) )] = σ ( S ( i ) ) , i = , , . . . , q. Denote ˆ σ ( S ( i ) ) = nq · F R ( S ( i ) ). Then according to Lemma 1, ˆ σ ( S ( i ) ) is an unbiased estimateof σ ( S ( i ) ). Define ˆ ρ ( S ( i ) ) = p i ˆ σ ( S ( i ) ) and for any S ⊆ (cid:101) V , let ˆ ρ ( S ) = (cid:80) qi = ˆ ρ ( S ( i ) ) = (cid:80) qi = p i ˆ σ ( S ( i ) ).Obviously, ˆ ρ ( S ( i ) ) and ˆ ρ ( S ) are unbiased estimates of ρ ( S ( i ) ) and ρ ( S ), respectively. Corollary 1.
For any S ⊆ (cid:101) V, E [ ˆ ρ ( S ( i ) )] = ρ ( S ( i ) ) , ( i = , , . . . , q ) , and E [ ˆ ρ ( S )] = ρ ( S ) . For any S ∈ Ω , we use the following Algorithms 1 and 2 to obtain the value of ˆ ρ ( S ). InAlgorithm 1, we generate a set of θ random RR sets, denoted by R . In Algorithm 2, we firstidentify the nodes in S and partition them into S (1) , S (2) , . . . , S ( q ) , where S ( i ) ⊆ V ( i ) (Lines 2-6),then compute the fraction of RR sets in R covered by S ( i ) , denoted by F R ( S ( i ) ) (Lines 7-10).Summing up all the q items, we obtain an estimate ˆ ρ ( S ).By Cherno ff bounds [24], we show that for any S ∈ Ω , the result obtained by Algorithm 2 isan accurate estimate of ρ ( S ) with high probability, when θ is su ffi ciently large.7 lgorithm 1 RR Sets Generation
Input:
Graph (cid:101) G and a positive integer θ . Output: a set of θ random RR sets R . Initialize: R = ∅ ; Use the reverse Breadth First Search algorithm to generate θ random RR sets and insert theminto R ; Initialize: B = ∅ , σ ( ∅ ) = return R . Algorithm 2
Profit-Estimate
Input:
A seed set S = { ¯ v , ¯ v , · · · , ¯ v | S | } ∈ Ω , R = { R , R , · · · , R θ } , 0 < ε < Output: ˆ ρ ( S ) such that | ˆ ρ ( S ) − ρ ( S ) | < ε · OPT with at least 1 − ( nq ) − l ( k ∗ ) − / (cid:16) nqk ∗ (cid:17) probability.. Initialize: ˆ ρ ( S ) = for i from 1 to q do ¡¡¡¡ S ( i ) ← ∅ ;¡¡¡¡ for j from 1 to | S | do if ¯ v j ∈ V ( i ) then ¡¡¡¡ S ( i ) = S ( i ) ∪ { ¯ v j } ;¡¡¡¡ for i from 1 to q do ¡¡¡¡ Initialize: F R ( S ( i ) ) = for k from 1 to θ do ¡¡¡¡ F R ( S ( i ) ) = F R ( S ( i ) ) + min {| S ( i ) ∩ R k | , } θ ;¡¡¡¡ return ˆ ρ ( S ) = (cid:80) qi = nqp i · F R ( S ( i ) ). Lemma 2.
Suppose θ satisfies θ ≥ (8 q + ε ) nq p max · l log( nq ) + log(2 qk ∗ ) + log (cid:16) nqk ∗ (cid:17) ε · OPT . (2) Then for any set S ∈ Ω , the following inequality holds with at least − ( nq ) − l ( k ∗ ) − / (cid:16) nqk ∗ (cid:17) proba-bility: | ˆ ρ ( S ) − ρ ( S ) | < ε · OPT , (3) where l > , < ε < and k ∗ = (cid:98) B / c min (cid:99) . Since ˆ ρ ( S ) ∼ ρ ( S ) with high probability and ˆ ρ ( S ) can be computed in polynomial time, wenow turn to solve the following problem.max ˆ ρ ( S ) s . t . q (cid:88) i = c i | S ( i ) | ≤ B (4) max S ∈ Ω ˆ ρ ( S )In this section, we provide a Modified Greedy algorithm for problem (4) which achieves a(1 − / e )-approximate solution S A . Then we show that S A is a (1 − / e − ε )-approximate solutionfor the original PM- (cid:101) G problem with high probability.8 lgorithm 3 Modified Greedy Algorithm
Input:
Graph (cid:101) G , a budget B and 0 < ε < Output:
A (1 − / e − ε )-approximate solution for the PM- (cid:101) G problem, with at least (1 − ( nq ) − l )-probability. Initialize: U = ∅ , S = ∅ , (cid:98) V = (cid:101) V , V = ∅ ; for all U ∈ Ω , | U | = do ¡¡¡¡ ˆ ρ ( U ) ← Profit-Estimate( R , U , ε );¡¡¡¡ ¡¡¡¡ Insert U into U ;¡¡¡¡ U ∗ = arg max U ∈ U { ˆ ρ ( U ) } ; for all S ∈ Ω , | S | = do ¡¡¡¡ S ← S ;¡¡¡¡ while c ( S ) ≤ B do ¡¡¡¡ (cid:98) V = (cid:98) V \ S ;¡¡¡¡ ¡¡¡¡ ˆ ρ ( S ) ← Profit-Estimate( R , S , ε );¡¡¡¡ for all v ∈ (cid:98) V do if c ( S ∪ { v } ) ≤ B then ¡¡¡¡ Insert v into V ;¡¡¡¡ ¡¡¡¡ ˆ ρ ( S ∪ { v } ) ← Profit-Estimate( R , S ∪ { v } , ε );¡¡¡¡ v ∗ = arg max v ∈ V { ˆ ρ ( S ∪{ v } ) − ˆ ρ ( S ) c ( { v } ) } ; S = S ∪ { v ∗ } ; Insert S into S ; S ∗ = arg max S ∈ S { ˆ ρ ( S ) } ; return S A = arg max { ˆ ρ ( U ∗ ) , ˆ ρ ( S ∗ ) } . Lemma 3. ˆ ρ ( · ) is nonnegative, nondecreasing and submodular. Motivated by the design of the main algorithm in [18], we propose a Modified Greedy al-gorithm for problem (4). The sketch of the algorithm is as follows. We first enumerate all thefeasible seed sets containing one or two nodes separately, to avoid the extreme situation thatnodes with high profit and cost are not included in the solution (Lines 2-5). Then start with anyfeasible seed set consisting of three nodes, and greedily add node which does not destroy thefeasibility of the set (Lines 6-18). Finally, output the maximum among the two cases (Line 19).
Theorem 1.
Given a graph (cid:101)
G, a positive number B, < ε < , l > and θ that satisfiesinequality (2), Algorithm 3 returns a (1 − / e − ε ) -approximate solution for the PM- (cid:101) G problemwith at least − ( nq ) − l probability.4.3. Estimation of the Parameter θ To guarantee the solution returned by Algorithm 3 is a (1 − / e − ε )-approximate solutionfor the PM- (cid:101) G problem with high probability, the number θ of the random RR sets generated inAlgorithm 1 should satisfy inequality (2). For simplicity, we define λ = (8 q + ε ) nq p max · ( l log nq + log(2 qk ∗ ) + log (cid:16) nqk ∗ (cid:17) ) · ε − (5)and rewrite (2) as θ ≥ λ/ OPT . (6)9owever, since OPT is unknown in advance, it is di ffi cult to set θ directly based on (6).Inspired by the technique used in [19], we address this challenge by finding an estimate u of OPT which is also a lower bound of
OPT . Then, by setting θ = λ/ u , we can guarantee θ satis-fying inequality (6). On the other hand, θ should be set reasonably small in order to avoid timeoverheads, which requests the lower bound u to be as close to OPT as possible.
In this section, an estimation of
OPT is presented, which is based on the results in [19].Though this estimation is not good enough, we remain it here in order to evaluate the timecomplexity of our algorithms.Define the width of an RR set R , denoted by ω ( R ), as the number of directed edges in (cid:101) G whichpoint to the nodes in R . That is, ω ( R ) = (cid:80) v ∈ R (the in-degree of v in (cid:101) G ). Obviously, if an edge isexamined in the generation of R , then it must point to a node in R . Let EW be the expected width of a random RR set, that is, the expected number of coin tosses required to generate a randomRR set. Therefore, it can be easy to verify that the expected time complexity of Algorithm 1 is O ( θ · EW ).The connection between EW and the expected spread of any node in V ∗ is formalized in thefollowing lemma [19]. Lemma 4. nm EW = E [ I ( { v ∗ } )] , where the expectation of I ( { v ∗ } ) is taken over the randomness inv ∗ and the influence propagation process. Lemma 4 implies that p min · nm EW ≤ OPT , since E [ I ( S ∗ )] is the expected spread of at least (cid:98) B / c max (cid:99) seed nodes and the profit of activating any node in E [ I ( S ∗ )] is at least p min . As p min · nm EW is easy to be estimated, we can choose u = p min · nm EW as a lower bound of OPT . However,when | S ∗ | (cid:29) u = p min · nm EW renders θ = λ/ u unnecessarily large and makes u = p min · nm EW an unfavorable choice of u . Now we consider another closer estimation of
OPT . For i = , , . . . , q , we consider anextreme situation of the PM- (cid:101) G problem in which the seed set only contains the nodes in G ( i ) . Insuch situation, the seed set consists of no more than k i = (cid:98) B / c i (cid:99) nodes and the PM- (cid:101) G problemturns to seek for a size- k i seed set with the maximum profit, which is equivalent to the k i -size IMproblem in G ( i ) . According to the results in [19], the IM problem in G under the IC model canbe solved by an algorithm, called TIM + .The TIM + algorithm based on the RIS technique consists of two phases. The first phase,called parameter estimation, receives an estimate KPT + of the optimum and uses it to compute θ (cid:48) which is the number of the random RR sets needed to generate. The second phase, called nodeselection, samples θ (cid:48) random RR sets from G and applies the greedy algorithm to derive a size- k node set S k covering a large number of RR sets. (The details of the TIM + algorithm can be seenin the appendix.)We use the TIM + algorithm to solve the k i -size IM problem in G ( i ) , and obtain an approximatesolution (denoted by S k i ) and the estimation of its expected spread (denoted by ˆ σ ( S k i )). Thenwe compute the corresponding estimation of profit (denoted by u i ) when S k i is used as seed set.Based on the analysis in [19], we compute u i = u i / (1 + (cid:15) (cid:48) ) to ensure that u i ≤ OPT with highprobability. Then take max ≤ i ≤ q { u i } as an estimation of OPT , which is also a lower bound of
OPT with high probability. The estimating procedure is detailed in Algorithm 4, and the main resultof
TIM + is presented as follows. 10 lgorithm 4 OPT Estimation
Input:
Graph (cid:101) G , a budget B , c , c , . . . , c q , p , p , . . . , p q and 0 < ε (cid:48) < l (cid:48) > Output:
A lower bound u ∗ of OPT . for i from 1 to q do ¡¡¡¡ k i = (cid:98) Bc i (cid:99) ;¡¡¡¡ ¡¡¡¡ ˆ σ ( S k i ) ← TIM + ( G ( i ) , k i , ε (cid:48) , l (cid:48) );¡¡¡¡ ¡¡¡¡ u i = p i · ˆ σ ( S k i ) / (1 + ε (cid:48) / return u ∗ = max ≤ i ≤ q { u i } . • Input :
A graph G , the constraint number k of the seed set, a constant l (cid:48) > ε (cid:48) ∈ (0 , • Output :
A seed set S k and the estimation ˆ σ ( S k ) of its expected spread. • Approximation : S k is a (1 − / e − ε (cid:48) )-approximate solution of the IM problem, with atleast 1 − n − l (cid:48) probability. • Time complexity : O (cid:0) ( k + l (cid:48) )( m + n ) log n / ( ε (cid:48) ) (cid:1) .Let S ∗ k be the optimal solution, and σ ( S ∗ k ) be the optimum of the k -size IM problem in G .Based on Lemma 7 and Lemma 8 in [19], we have: Lemma 5.
Let S k be the solution returned by the TIM + algorithm for the k-size IM problem, and ˆ σ ( S k ) be the estimation of the expected spread of S k , then Pr (cid:104) (1 − / e )(1 − ε (cid:48) / σ ( S ∗ k ) < ˆ σ ( S k ) < (1 + ε (cid:48) / σ ( S ∗ k ) (cid:105) > − n − l (cid:48) . Theorem 2.
Algorithm 4 returnsu ∗ ∈ (cid:34) (1 − / e )(1 − ε (cid:48) / + ε (cid:48) / q OPT , OPT (cid:35) with at least − qn − l (cid:48) probability and runs in O (cid:0) ( k ∗ + l (cid:48) )( m + n ) q log n / ( ε (cid:48) ) (cid:1) expected time,where l (cid:48) > and < ε (cid:48) < .4.3.3. Refined Estimation of OPT In this section, we present another method to estimate
OPT . Clearly, the e ffi ciency of the RMG algorithm highly depends on the value of u ∗ obtained by Algorithm 4. Though we couldensure that the output u ∗ of Algorithm 4 is no smaller than (1 − / e )(1 − ε (cid:48) / + ε (cid:48) / q · OPT with high proba-bility, u ∗ may be much smaller than OPT in experiments.We pose an e ffi cient solution (Algorithm 5) to the above problem, which adds an intermediatestep between Algorithm 4 and Algorithm 3. At first, we construct two matrices P called profitmatrix and A called seed set matrix as follows (Algorithm 5, Lines 1-9). These two matrices bothhave q rows and k ∗ columns. For i = , , . . . , q and j = , , . . . , k i , the entry a i j of A denotesthe seed set which achieves the maximum profit of selecting j seed nodes from G ( i ) , and let p i j of P be the estimation of profit obtained by using a i j as the seed set. For i = , , . . . , q and j = k i + , . . . , k ∗ , we set p i j = a i j = ∅ . Then each time we select the seed set a i j with themaximum ratio of profit p i j to its activating cost c i · j in the entire matrix, which means that weadd the nodes set a i ∗ j ∗ whose profit satisfies p i ∗ j ∗ = arg max ≤ i ≤ q , ≤ j ≤ k ∗ (cid:40) p i j c i · j (cid:41) , lgorithm 5 Refine OPT Estimation
Input:
Graph (cid:101) G , a budget B , c , c , . . . , c q , p , p , . . . , p q , l (cid:48) > < ε (cid:48) < Output:
A lower bound u (cid:48) of OPT . Initialize two matrices P and A ; for i from 1 to q do for j from 1 to k i do ¡¡¡¡ ˆ σ ( S ( i ) j ) ← TIM + ( G ( i ) , j , ε (cid:48) , l (cid:48) );¡¡¡¡ ¡¡¡¡ a i j = S ( i ) j ;¡¡¡¡ ¡¡¡¡ p i j = p i · ˆ σ ( S ( i ) j );¡¡¡¡ for j from k i + k ∗ do ¡¡¡¡ a i j = ∅ ;¡¡¡¡ ¡¡¡¡ p i j = ˆ S ← ∅ ; while c ( ˆ S ) ≤ B do p i ∗ j ∗ = arg max ≤ i ≤ q , ≤ j ≤ k ∗ { p ij c i · j } ; ˆ S = ˆ S ∪ a i ∗ j ∗ ; Set entries of row i ∗ in matrix P to 0; u ∗∗ = ˆ ρ ( ˆ S ); return u (cid:48) = max { u ∗ , u ∗∗ } .to the current seed set, while still ensuring the activation cost of the update seed set no more than B . After that, set all the entries in row i ∗ to 0. Repeat the above process until the overall activationcost of the seed set is more than B . Denote ˆ S as the final seed set obtained, and u ∗∗ = ˆ ρ ( ˆ S ) (Lines10-15). The final output of Algorithm 5 is u (cid:48) = max { u ∗ , u ∗∗ } , a new lower bound of OPT (Line16).
In summary, our
RMG algorithm for the PM A problem works as follows. Given the socialnetwork G , B , c , c , . . . , c q , p , p , . . . , p q , parameters ε, ε (cid:48) , l and l (cid:48) , we first construct the graph (cid:101) G = G (1) ∪ G (2) ∪ . . . ∪ G ( q ) . Then RMG implements Algorithm 4 and obtains a value of u ∗ inreturn. And then RMG computes θ = λ/ u ∗ in which λ is defined in (5) and invokes Algorithm1 to generate a set R of random RR sets. Finally, we run Algorithm 3 with (cid:101) G , ε , B and θ as theinput and take its output S A as the final result of the PM A problem.In the rest of this section, we discuss the time complexity of
RMG algorithm. Based onprevious discussions, the expected time complexity of Algorithm 1 is O ( θ · EW ). In 4.3.1, wehave obtained that u = p min · nm EW is a lower bound of OPT . By setting θ = λ/ u , we can obtainthat Algorithm 1 has an expected time complexity of O ( θ · EW ) = O ( m λ np min ) = O (cid:0) ( k ∗ + l + m + n ) q p max log( nq ) / ( p min · ε ) (cid:1) . Clearly, Algorithm 2 runs in O ( q θ ) = O (cid:0) ( k ∗ + l + m + n ) q p max log( nq ) / ( p min · ε ) (cid:1) expectedtime.Now we are in the position to analyse the expected running time of Algorithm 3. For any S ∈ Ω , ˆ ρ ( S ) is computed by Algorithm 2 which runs in O (cid:0) ( k ∗ + l + m + n ) q p max log( nq ) / ( p min · ε ) (cid:1) nq ) times. The second part of Algorithm 3 from line 7 to 22 invokes Algorithm 2 at most k ∗ · ( nq ) times. Thus, Algorithm 3 has an expected time complexity of O (cid:0) k ∗ ( k ∗ + l + m + n ) n q p max log( nq ) / ( p min · ε ) (cid:1) .By Theorems 1 and 2, RMG runs in O (cid:0) k ∗ ( k ∗ + l + m + n ) n q p max · log( nq ) / ( p min · ε ) (cid:1) expected time and returns a (1 − / e − ε )-approximate solution with at least 1 − ( nq ) − l − qn − l (cid:48) probability.
5. Experimental Evaluation
In this section, we show the e ff ectiveness of our proposed algorithm on three social networkdatasets. The goal of the experiments is multifold. First, we would like to evaluate the perfor-mance of the RMG algorithm as measured by the achieved expected total profit. Second, weevaluate the extent to which the estimated
OPT and refined
OPT estimate the lower bound of theprofit, which indirectly control the e ffi ciency of the profit maximization algorithm. Finally, weshow the distribution of budget and profit produced by our algorithm for di ff erent products onmultiple datasets to reveal the superiority of our algorithms in depth. Table 1: Dataset characteristics
Dataset n m
Type Average degree
NetHEPT wikiVote
Epinions
Datasets.
We conduct extensive experiments on three real benchmark social networks:
NetHEPT , wikiVote and Epinions to examine the e ff ectiveness of the RMG algorithm. Basicstatistics of the datasets are summarized in Table 1, where n denotes the number of nodes and m denotes the number of edges in the social graph. For undirected graphs, we reverse every edge inboth directions so as to make each undirected edge into two directed edges. Note that the numberof edges are doubled in this case. All datasets used in our experiments are publicly available at[15]. Influence Model.
In this work, we adopt the standard Independent Cascade (IC) model asthe influence model, which is widely used in the literature [19, 20]. As for the IC model, we setthe propagation probability of each directed edge as reciprocal of the in-degree of the node thatthe edge points to. Specifically, for each edge e we first identify the node v that e points to, andthen set p ( e ) = / d ( v ), where d ( v ) denotes the in-degree of v . This setting of p ( e ) is widely usedin prior works [19, 9, 10]. Algorithms.
In addition to our proposed algorithm, we use three algorithms as baselinealgorithms for comparison purpose, namely,
Random , Greedy , and
PMCE [16]. In particular,
Random is a baseline algorithm that randomly select nodes from the network and assign randomproduct to each node while satisfying the budget constraint.
Greedy is an iterative procedure, theintuition behind is to select the pair of node and product with maximum ratio of the marginalincrease in expected profit over the cost in each round, until the budget is exhausted.
PMCE is another baseline algorithm, i.e., the
Profit Maximization with Cost E ff ectiveness algorithmproposed in [16]. It first constructs two candidate solutions and then select the better one as thefinal result. The first candidate is selected in an iterative greedy process. In each round, the nodewith maximum ratio of the marginal profit increase over the square of cost is selected, and the13rocess runs until the budget is used up. It then finds the second candidate using a similar iterativegreedy process, only with a di ff erent guideline to select a node in each iteration: selecting thenode with maximum marginal profit increase. The intuition behind PMCE is to consider boththe cases that to emphasize the importance of product cost (first candidate) and to ignore theimportance of product cost (second candidate). Note that
PMCE is designed under the LinearThreshold (LT) model only, but we incorporate the triggering model generalization technique[19] into
PMCE and extend it to the IC model.
Table 2: Product Statistics
Dataset Product Profit Cost Ratio
NetHEPT
P1;P2;P3 0.39;0.55;0.67 0.36;0.48;0.65 1.08;1.15;1.03 wikiVote
P1;P2;P3 0.45;0.65;0.41 0.12;0.20;0.80 3.75;3.25;0.51
Epinions
P1;P2;P3 0.45;0.20;0.06 0.08;0.65;0.78 5.63;0.31;0.08
Parameters.
Unless otherwise specified, we set ε = ε (cid:48) = ¯ ε = . ε is another errorparameter in the TIM + alg.) and B =
15. For our solutions, we set l and l (cid:48) in a way that ensuresa success probability of 1 − / n . For the baseline algorithms, we set the number of Monte Carlosimulations to r = , following the standard practice in the literature [19]. In our experiments,we randomly generate the profit and cost of the products for each dataset in three cases. First,the profit over cost ratio is similar among di ff erent products ( NetHEPT ). Second, we set twoproducts with higher profit over cost ratio than the other product ( wikiVote ). Third, we set asingle product with higher cost than the other two products (
Epinions ). The product statisticsare presented in Table 2. In later paragraphs, we provide a detailed analysis of di ff erent budgetdistribution patterns shown in these cases and make a deep exploration of the superiority of ourproposed algorithms.All experiments were run on a machine with Intel Xeon 2.40GHz CPU and 64GB memory,running 64-bit RedHat Linux server. For each set of experiments, we run the simulation for 100rounds and average results are reported as follow. Figure 4: Expected total profit vs. amount of budget.
Expected Total Profit.
Our first set of experiments compares our solutions in terms ofexpected total profit with baseline algorithms
Random , Greedy and
PMCE . Figure 4 shows theexpected total profit yielded by each method on all tested datasets, with B varying from 1 to15. The x -axis holds the amount of budget and the y -axis holds the expected total profit. Weobserve that the trend of PMCE is almost in line with the trend of Greedy, and
RMG consistentlyoutperforms all baseline algorithms. In particular, when B = RMG leads
PMCE by over 20%gain on all datasets, and the gap between
RMG and baseline algorithms becomes larger as thebudget increases. 14 igure 5: Comparison of total profit with estimated OPT (Algorithm 4) and refined OPT (Algorithm 5).
Estimation on
OPT . Figure 5 presents the comparison of the expected total profit yielded by
RMG with the estimated OPT yielded by Alg. 4 (OPTEst) and Alg. 5 (RefOPT) respectively.The x -axis holds the amount of budget. For RMG , the y -axis holds the expected total profit;and for OPTEst and RefOPT, the y -axis holds the estimated lower bound of OPT . The budget B ranges from 1 to 15. We observe that RefOPT produces a tighter estimation of the lower boundof OPT on all datasets over a varying budget. This indicates that it is beneficial to incorporatethe computation of maximum profit that can be achieved considering all possible combinationsof budget distributions over multiple products. Thus Algorithm 5 provides a sophisticated yete ff ective estimation on the lower bound of OPT , leading to a higher e ffi ciency of RMG . Figure 6: Budget & profit distributions.
Budget & Profit Distribution.
We take a further step to explore the distribution of budgetand profit produced by
RMG with a varying budget. Figure 6 illustrates how the budget is dis-tributed over multiple products and the corresponding profit gained from each product with thebudget varying from 1 to 15. We observe that with a limited budget at the very beginning, all thebudget is spent on promoting the product with highest profit cost ratio. As the budget increases,spending more on a single profitable product is not preferred and gradually adjusting budget dis-tribution over multiple products becomes crucial. Thus
RMG balances the cost and profit in along run and produces a distribution that maximizes the profit.In summary, our experiments on various settings demonstrate that the
RMG algorithm ise ff ective, producing far superior solutions than the baselines.15 . Conclusion Traditional Influence maximization problem focuses on the di ff usion of a single product orinformation in the social network, aiming to seek for a small node set of maximum influence.However, in reality, one company may produce several products to meet the demand of cus-tomers. The PM A problem considers the di ff usion of multiple di ff erent products in the socialnetwork, and seeks for a seed set within the limited budget to achieve the goal of profit maxi-mization. Therefore, how to allocate the limited budget among multiple products is crucial forthe company in designing commercial activities.In this paper, we propose a RMG algorithm for the PM A problem. The algorithm runs in O ( k ∗ ( k ∗ + l + m + n ) n q p max log( nq ) / ( p min · ε )) expected time and returns a (1 − / e − ε )-approximate solution with at least 1 − ( nq ) − l − qn − l (cid:48) probability, which significantly improvesupon prior works in terms of performance guarantee and is also the best performance ratio of thePM A problem even for one product. Experimental results on real-world social networks showthat our
RMG algorithm outperforms the algorithm proposed in [16] and other heuristics in termsof profit maximization, and could better allocate the budget. For future work, we plan to improvethe
RMG algorithm in terms of the time complexity, and investigate the case in which multipleproducts spread in the social network and could compete with each other.
Acknowledgement.
This work was supported in part by National Natural Science Foundationof China (11501316), China Postdoctoral Science Foundation (2016M600556), Qingdao Post-doctoral Application Research Project (2016156), and Natural Science Foundation of ShandongProvince of China (ZR2017QA010).
7. ReferencesReferences [1] J. Goldenberg, B. Libai, E. Muller, “Talk of the network: a complex systems look at the underlying process ofword-of-mouth,” in Marketing Letters, 2001, pp. 211–223.[2] J. Goldenberg, B. Libai, E. Muller, “Using complex systems analysis to advance marketing theory development,”in Academy of Marketing Science Review, 2001.[3] M. Granovetter, “Threshold models of collective behavior,” in American Journal of Sociology, 1978, pp. 1420–1443.[4] T. Schelling, “Micromotives and Macrobehavior,” in Norton, 1978.[5] A. Borodin, Y. Filmus, and J. Oren, “Threshold models for competitive influence in social networks,” in WINE,2010, pp. 539–550.[6] W. Chen, C. Wang, and Y. Wang, “ Scalable influence maximization for prevalent viral marketing in large-scalesocial networks,” in KDD, 2010, pp. 1029–1038.[7] W. Chen, Y. Wang, and S. Yang, “Efficient influence maximization in social networks,” in KDD, 2009, pp. 199–208.[8] W. Chen, Y. Yuan, and L. Zhang, “Scalable influence maximization in social networks under the linear thresholdmodel,” in ICDM, 2010, pp. 88–97.[9] A. Goyal, W. Lu and L. V. S. Lakshmanan, “CELF ++ : Optimizing the greedy algorithm for influence maximiza-tion in social networks,” in WWW, 2011, pp. 47–48.[10] K. Jung, W. Heo and W. Chen, “IRIE: A scalable influence maximization algorithm for independent cascade modeland its extensions,” in ICDM, 2012, pp. 1–20.[11] D. Kempe, J. M. Kleinberg, and V. Tardos, “Maximizing the spread of influence through a social network,” inKDD, 2003, pp. 137–146.[12] D. Kempe, J. M. Kleinberg, and V. Tardos, “Influential nodes in a diffusion model for social networks,” in ICALP,2005, pp. 1127–1138.
13] D. Kempe, J. M. Kleinberg, and V. Tardos, “Maximizing the spread of influence through a social network,” inTheory, 2015, pp. 105–147.[14] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen and N. Glance, “Cost-effective outbreak detectionin networks,” in KDD, 2007, pp. 420–429.[15] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection,”http: // snap.stanford.edu / data, 2014.[16] H. Y. Zhang, H. L. Zhang, A. Kuhnle, M. T. Thai, “Profit maximization for multiple products in online socialnetworks,” in INFOCOM, 2016, pp. 1–9.[17] C. Borgs, M. Brautbar, J. T. Chayes, and B. Lucier, “Maximizing social influence in nearly optimal time,” in SODA,2014, pp. 946–957.[18] M. Sviridenko, “ A note on maximizing a submodular set function subject to a knapsack constraint,” in OperationsResearch Letters, 2004, pp. 41–43.[19] Y. Tang, X. Xiao, and Y. Shi, “Influence maximization: near-optimal time complexity meets practical efficiency,”in SIGMOD, 2014, pp. 75–86.[20] Y. Tang, Y. Shi, and X. Xiao, “Influence maximization in near-linear time: A martingale approach,” in SIGMOD,2015, pp. 1539–1554.[21] H. T. Nguyen, M. T. Thai, and T. N. Dinh, “Stop-and-stare: Optimal sampling algorithms for viral marketing inbillion-scale networks,” in SIGMOD, 2016, pp. 695–710.[22] H. Nguyen and R. Zheng, “On budgeted influence maximization in social networks,” in IEEE J. Sel. Area Comm.31, 2013, pp. 1084–1094.[23] N. Du, Y. Liang M. F. Balcan and L. Song, “Budgeted influence maximization for pultiple products,”arXiv:1312.2164.[24] R. Motwani and P. Raghavan, Randomized Algorithms , Cambridge University Press, 1995, pp. 68–70.[25] H. T. Nguyen, M. T. Thai, and T. N. Dinh, “A billion-scale approximation algorithm for maximizing benefit in viralmaketing,” in IEEE / ACM Transactions on Networking, 2017, pp. 2419–2429.[26] E. F. Moore, “The shortest path through a maze,” in the proceeding of Int. Symp. Switching Theory, 1959, pp.285–292.
Appendix A. Proof of some conclusionsProof of Lemma 2.
For any seed set S ∈ Ω , let µ i = σ ( S ( i ) ) / ( nq ) = E [ F R ( S ( i ) )], which representsthe probability that S ( i ) overlaps with a random RR set.Then θ · F R ( S ( i ) ) can be regarded as the sum of θ i.i.d. Bernoulli variables with a mean µ i .Thus, we have Pr (cid:34) | ˆ ρ ( S ( i ) ) − ρ ( S ( i ) ) | ≥ ε q · OPT (cid:35) = Pr (cid:34) | θ · F R ( S ( i ) ) − θ · µ i | ≥ ε · OPT nq p i µ i · θµ i (cid:35) . (A.1)Let δ = ε · OPT nq p i µ i . By Cherno ff bounds, inequality (2) and the fact that ρ ( S ( i ) ) = p i σ ( S ( i ) ) = p i · nq µ i ≤ OPT , the following inequality holds for the right hand side (r.h.s.) of (A.1):r.h.s. of (A.1) < {− δ + δ · θµ i } = {− ε · OPT nq p i (4 nq p i µ i + ε · OPT ) · θ }≤ {− ε · OPT nq p i (4 q · OPT + ε · OPT ) · θ }≤ ( nq ) − l ( qk ∗ ) − / (cid:32) nqk ∗ (cid:33) . (A.2)17urthermore, we havePr (cid:20) | ˆ ρ ( S ) − ρ ( S ) | < ε · OPT (cid:21) = Pr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q (cid:88) i = ( ˆ ρ ( S ( i ) ) − ρ ( S ( i ) )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε · OPT ≥ Pr q (cid:88) i = (cid:12)(cid:12)(cid:12) ˆ ρ ( S ( i ) ) − ρ ( S ( i ) ) (cid:12)(cid:12)(cid:12) < ε · OPT ≥ Pr (cid:34) q · max ≤ i ≤ q | ˆ ρ ( S ( i ) ) − ρ ( S ( i ) ) | < ε · OPT (cid:35) = Pr (cid:34) | ˆ ρ ( S (1) ) − ρ ( S (1) ) | < ε q · OPT , . . . , | ˆ ρ ( S ( q ) ) − ρ ( S ( q ) ) | < ε q · OPT (cid:35) . (A.3)By the union bound and Equation (A.2), we haver.h.s. of (A.3) ≥ q (cid:88) i = Pr (cid:34) | ˆ ρ ( S ( i ) ) − ρ ( S ( i ) ) | < ε q · OPT (cid:35) − ( q − ≥ − ( nq ) − l ( k ∗ ) − / (cid:32) nqk ∗ (cid:33) Therefore, the lemma is proved.
Proof of Lemma 3.
For any node set S ⊆ (cid:101) V , adding nodes from (cid:101) V \ S into S can never decrease F R ( S ( i ) ) ( i = , , . . . , q ). Hence, ˆ ρ ( S ) = nq (cid:80) qi = p i F R ( S ( i ) ) is nondecreasing since p i > S ⊆ T ⊆ (cid:101) V and y ∈ (cid:101) V \ T , we will prove the following inequality (A.4) which implies F R ( · ) is submodular: F R ( S ∪ { y } ) − F R ( S ) ≥ F R ( T ∪ { y } ) − F R ( T ) . (A.4)Let W , W , W and W be the sets of RR sets in R covered by T ∪ { y } , T , S ∪ { y } and S ,respectively. Then W \ W represents the set of RR sets which can be covered by { y } but notcovered by T , and W \ W represents the set of RR sets which can be covered by { y } but notcovered by S . Recall that S ⊆ T , we have ( W \ W ) ⊆ ( W \ W ). By the definition of F R ( · ), F R ( S ∪ { y } ) − F R ( S ) represents the proportion of RR sets in R which can be covered by { y } butnot covered by S . It follows that inequality (A.4) holds, according to the relationship between( W \ W ) and ( W \ W ). Therefore, ˆ ρ ( S ) = nq (cid:80) qi = p i F R ( S ( i ) ) is submodular. Proof of Theorem 1.
Let S A be the node set returned by Algorithm 3, and S ∗ p be the optimalsolution of problem (4). As S A is obtained by a (1 − / e )-approximation algorithm for problem(4) [14], we have ˆ ρ ( S A ) ≥ (1 − / e ) ˆ ρ ( S ∗ p ). Recall that S ∗ is the optimal solution for the PM- (cid:101) G problem and OPT = ρ ( S ∗ ), we have ˆ ρ ( S ∗ p ) ≥ ˆ ρ ( S ∗ ), leading to ˆ ρ ( S A ) ≥ (1 − / e ) ˆ ρ ( S ∗ ).According to Lemma 2, inequality (3) holds with at least 1 − ( nq ) − l ( k ∗ ) − / (cid:16) nqk ∗ (cid:17) probabilityfor a given seed set S ∈ Ω . By the assumption that k ∗ ≤ (cid:98) nq / (cid:99) , we can obtain that | Ω | ≤ (cid:80) ≤ j ≤ k ∗ (cid:16) nqj (cid:17) ≤ k ∗ · (cid:16) nqk ∗ (cid:17) . Then, by the union bound, inequality (3) holds simultaneously for all18ode sets belonging to Ω with at least 1 − ( nq ) − l probability. In that case, we have ρ ( S A ) > ˆ ρ ( S A ) − ε · OPT ≥ (1 − e ) ˆ ρ ( S ∗ ) − ε · OPT > (1 − e )(1 − ε · OPT − ε · OPT > (1 − e − ε ) · OPT . Thus, Theorem 1 is proved.
Proof of Lemma 5.
According to Lemma 7 and Lemma 8 in [19], we have thatPr (cid:20)
KPT ∗ ∈ (cid:20) KPT , σ ( S ∗ k ) (cid:21)(cid:21) ≥ − n − l (cid:48) , and Pr (cid:20) KPT + ∈ (cid:104) KPT ∗ , σ (cid:16) S ∗ k (cid:17)(cid:105) | KPT ∗ ∈ (cid:20) KPT , σ ( S ∗ k ) (cid:21)(cid:21) ≥ − n − l (cid:48) . Thus, Pr (cid:104)
KPT + ≤ σ ( S ∗ k ) (cid:105) ≥ Pr (cid:104) KPT + ∈ (cid:104) KPT ∗ , σ (cid:16) S ∗ k (cid:17)(cid:105)(cid:105) ≥ − n − l (cid:48) . Let λ (cid:48) = (8 + ε (cid:48) ) n ( l (cid:48) log n + log 2 + log (cid:16) nk (cid:17) ) · ( ε (cid:48) ) − and θ (cid:48) = λ (cid:48) / KPT + , then Pr (cid:104) θ (cid:48) ≥ λ (cid:48) /σ ( S ∗ k ) (cid:105) ≥ − n − l (cid:48) .For any size- k node set S k , suppose that θ (cid:48) satisfies θ (cid:48) ≥ λ (cid:48) /σ ( S ∗ k ), then | ˆ σ ( S k ) − σ ( S k ) | < ( ε (cid:48) / · σ ( S ∗ k ) holds with at least 1 − n − l (cid:48) / (cid:16) nk (cid:17) probability.It follows that when θ (cid:48) satisfies θ (cid:48) ≥ λ (cid:48) /σ ( S ∗ k ),Pr (cid:34) ˆ σ ( S k ) < (1 + ε (cid:48) σ ( S ∗ k ) (cid:35) ≥ − n − l (cid:48) / (cid:32) nk (cid:33) and Pr (cid:34) ˆ σ ( S k ) > (1 − e )(1 − ε (cid:48) σ ( S ∗ k ) (cid:35) ≥ − n − l (cid:48) / (cid:32) nk (cid:33) . Thus, Pr (cid:34) (1 − e )(1 − ε (cid:48) σ ( S ∗ k ) < ˆ σ ( S k ) < (1 + ε (cid:48) σ ( S ∗ k ) | θ (cid:48) ≥ λ (cid:48) σ ( S ∗ k ) (cid:35) ≥ − n − l (cid:48) / (cid:16) nk (cid:17) . Therefore, we havePr (cid:104) (1 − / e )(1 − ε (cid:48) / σ ( S ∗ k ) < ˆ σ ( S k ) < (1 + ε (cid:48) / σ ( S ∗ k ) (cid:105) ≥ (1 − n − l (cid:48) )(1 − n − l (cid:48) / (cid:16) nk (cid:17) ) > − n − l (cid:48) . Proof of Theorem 2.
Recall that the optimal solution of the PM- (cid:101) G problem is S ∗ , denote S ∗ = S (1) ∪ ¯ S (2) ∪ · · · ∪ ¯ S ( q ) . Then, we obtain that OPT = ρ ( S ∗ ) = q (cid:88) i = ρ ( ¯ S ( i ) ) = q (cid:88) i = p i · σ ( ¯ S ( i ) ) . By the definition of the optimal solution, we have ¯ S ( i ) is the optimal solution of the | ¯ S ( i ) | -sizeIM problem in G ( i ) . For the k i -size IM problem in G ( i ) ( i = , , . . . , q ), let S k i and ˆ σ ( S k i ) be the(1 − / e − ε (cid:48) )-approximate solution and the corresponding value returned by TIM + , respectively.Let S ∗ k i be the optimal solution, and σ ( S ∗ k i ) be its expected spread. Based on Lemma 5, we havePr (cid:104) (1 − / e )(1 − ε (cid:48) / σ ( S ∗ k i ) < ˆ σ ( S k i ) < (1 + ε (cid:48) / σ ( S ∗ k i ) (cid:105) > − n − l (cid:48) . It follows directly that | S ∗ k i | ≥ | ¯ S ( i ) | by the definitions of k i and ¯ S ( i ) . Recall that S ∗ k i and ¯ S ( i ) are the optimal solutions of the k i -size and | ¯ S ( i ) | -size influence maximization problem in G ( i ) ,respectively. Then we have σ ( S ∗ k i ) ≥ σ ( ¯ S ( i ) ) since σ ( · ) is nondecreasing.Let u i = p i · ˆ σ ( S k i ) / (1 + ε (cid:48) /
2) and u ∗ = max ≤ i ≤ q { u i } , thenPr (cid:34) (1 − / e )(1 − ε (cid:48) / + ε (cid:48) / ρ ( S ∗ k i ) < u i ≤ OPT (cid:35) > − n − l (cid:48) . (A.5)Therefore, (A.5) holds simultaneously for all i = , , . . . , q with at least 1 − qn − l (cid:48) probability,which means that OPT ≤ q (cid:88) i = p i · σ ( S ∗ k i ) = q (cid:88) i = ρ ( S ∗ k i ) < (1 + ε (cid:48) / q (1 − / e )(1 − ε (cid:48) / u ∗ , and u ∗ ≤ OPT hold simultaneously with at least 1 − qn − l (cid:48) probability. In conclusion, u ∗ ∈ (cid:104) (1 − / e )(1 − ε (cid:48) / + ε (cid:48) / q OPT , OPT (cid:105) with at least 1 − qn − l (cid:48) probability. TIM + runs in O (cid:0) ( k i + l (cid:48) )( m + n ) log n / ( ε (cid:48) ) (cid:1) expected time, where k i = (cid:98) B / c i (cid:99) is the numberbudget of the seed set [19]. Therefore, Algorithm 4 runs in O (cid:0) ( k ∗ + l (cid:48) )( m + n ) q log n / ( ε (cid:48) ) (cid:1) ex-pected time. Appendix B. Brief Introduction of the
TIM + Algorithm
In this section, we give an outline of the
TIM + algorithm. The TIM + algorithm based on theRIS technique consists of two phases. The first phase called parameter estimation receives anestimate KPT + of the optimum and uses it to compute θ (cid:48) which is the number of RR sets neededto generate. The second phase, called node selection, samples θ (cid:48) RR sets from G and applies thegreedy algorithm to derive a size- k node set S k covering a large number of RR sets. We put allthe algorithms in the entire process together in algorithm 6.20 lgorithm 6 T I M + Input:
Graph G , a positive integer k and 0 < ¯ ε, ε (cid:48) < Output:
A value ˆ σ ( S k ) where S k is a (1 − / e − ε (cid:48) )-approximation solution of the k -size IMproblem, with at least (1 − n − l (cid:48) )-probability. for i from 1 to log n − do ¡¡¡¡ Let c i = (6 l log n + n )) · i ;¡¡¡¡ ¡¡¡¡ Let sum = for j from 1 to c i do ¡¡¡¡ Generate a random RR set R ;¡¡¡¡ ¡¡¡¡ sum = sum + κ ( R );¡¡¡¡ if sum / c i > / i then return KPT ∗ = n · sum / (2 · c i ). return KPT ∗ = Let R (cid:48) be the set of all RR sets generated in the last iteration of the above loop; Innitialize: S (cid:48) k = ∅ ; for i from 1 to k do ¡¡¡¡ Identify the node v i that covers the most RR sets in R (cid:48) ;¡¡¡¡ ¡¡¡¡ Add v i into S (cid:48) k ;¡¡¡¡ ¡¡¡¡ Remove from R (cid:48) all RR sets that are covered by v i ;¡¡¡¡ Let ¯ λ = (2 + ¯ ε ) l (cid:48) n log n · ( ¯ ε ) − ; Let ¯ θ = ¯ λ/ KPT ∗ ; Initialize a set R (cid:48)(cid:48) = ∅ ; Generate ¯ θ random RR sets and put them into R (cid:48)(cid:48) ; Let ¯ f be the fraction of the RR sets in R (cid:48)(cid:48) that is covered by S (cid:48) k ; Let
KPT (cid:48) = ¯ f · n / (1 + ¯ ε ); KPT + = max { KPT (cid:48) , KPT ∗ } ; Let λ (cid:48) = (8 + ε ) n · ( l (cid:48) log n + log 2 + log (cid:16) nk (cid:17) ) · ( ε (cid:48) ) − ; Let θ (cid:48) = λ (cid:48) / KPT + ; Initialize a set R ∗ = ∅ ; Generate θ (cid:48) random RR sets and insert them into R ∗ ; Initialize a node set S k = ∅ ; for i from 1 to k do ¡¡¡¡ Identify the node v i that covers the most RR sets in R ∗ ;¡¡¡¡ ¡¡¡¡ Add v i into S k ;¡¡¡¡ ¡¡¡¡ Remove from R ∗ all RR sets that are covered by v i ;¡¡¡¡ Let f be the fraction of the RR sets in R ∗ that is covered by S k ; return ˆ σ ( S k ) = n · ff