[PDF] A Random Algorithm for Profit Maximization with Multiple Adoptions in Online Social Networks

Abstract

Online social networks have been one of the most effective platforms for marketing and advertising. Through "word of mouth" effects, information or product adoption could spread from some influential individuals to millions of users in social networks. Given a social network G and a constant k, the influence maximization problem seeks for k nodes in G that can influence the largest number of nodes. This problem has found important applications, and a large amount of works have been devoted to identifying the few most influential users. But most of existing works only focus on the diffusion of a single idea or product in social networks. However, in reality, one company may produce multiple kinds of products and one user may also have multiple adoptions. Given multiple kinds of different products with different activation costs and profits, it is crucial for the company to distribute the limited budget among multiple products in order to achieve profit maximization. Profit Maximization with Multiple Adoptions (PM^{2}A) problem aims to seek for a seed set within the budget to maximize the overall profit. In this paper, a Randomized Modified Greedy (RMG) algorithm based on the Reverse Influence Sampling (RIS) technique is presented for the PM^{2}A problem, which could achieve a (1-1/e-\varepsilon)-approximate solution with high probability. Compared with the algorithm proposed in [16] that achieves a \frac{1}{2}(1-1/e^{2})-approximate solution, our algorithm provides a better performance ratio which is also the best performance ratio of the PM^{2}A problem. Comprehensive experiments on three real-world social networks are conducted, and the results demonstrate that our RMG algorithm outperforms the algorithm proposed in [16] and other heuristics in terms of profit maximization, and could better allocate the budget.

Full PDF

AA Random Algorithm for Proﬁt Maximization with MultipleAdoptions in Online Social Networks

Tiantian Chen a , Bin Liu a, ∗ , Wenjing Liu a , Qizhi Fang a , Jing Yuan b , Weili Wu b a School of Mathematical Sciences, Ocean University of China, Qingdao 266100, China b Department of Computer Science, The University of Texas at Dallas, Richardson, TX 75080, USA

Abstract

Online social networks have been one of the most e ﬀ ective platforms for marketing and adver-tising. Through “word of mouth” e ﬀ ects, information or product adoption could spread fromsome inﬂuential individuals to millions of users in social networks. Given a social network G and a constant k , the inﬂuence maximization problem seeks for k nodes in G that can inﬂuencethe largest number of nodes. This problem has found important applications, and a large amountof works have been devoted to identifying the few most inﬂuential users. But most of existingworks only focus on the di ﬀ usion of a single idea or product in social networks. However, in re-ality, one company may produce multiple kinds of products and one user may also have multipleadoptions.Given multiple kinds of di ﬀ erent products with di ﬀ erent activation costs and proﬁts, it iscrucial for the company to distribute the limited budget among multiple products in order toachieve proﬁt maximization. Proﬁt Maximization with Multiple Adoptions (PM A) problem aimsto seek for a seed set within the budget to maximize the overall proﬁt. In this paper, a

RandomizedModiﬁed Greedy ( RMG ) algorithm based on the Reverse Influence Sampling (RIS) technique ispresented for the PM A problem, which could achieve a (1 − / e − ε )-approximate solution withhigh probability. Compared with the algorithm proposed in [16] that achieves a (1 − / e )-approximate solution, our algorithm provides a better performance ratio which is also the bestperformance ratio of the PM A problem. Comprehensive experiments on three real-world socialnetworks are conducted, and the results demonstrate that our

RMG algorithm outperforms thealgorithm proposed in [16] and other heuristics in terms of proﬁt maximization, and could betterallocate the budget.

Keywords:

Profit maximization, Social network, Approximation algorithm, Sampling

1. Introduction

With the increasing popularity of online social and information networks such as Facebook,Twitter, LinkedIn, etc., many researchers have studied the di ﬀ usion phenomenon in social net-works, including the di ﬀ usion of news, ideas, innovations, the adoption of new products, etc. ∗ Corresponding author

Email address: [email protected] (Bin Liu)

Preprint submitted to Elsevier January 18, 2021 a r X i v : . [ c s . S I] J a n uch di ﬀ usion is driven by the inﬂuence propagation throughout the social networks. One topicthat has been extensively studied is the Inﬂuence Maximization (IM) problem [5, 6, 7, 11]. Thegoal of the IM problem is to ﬁnd a small subset of inﬂuential nodes such that they can attract thelargest number of members in a social network, according to an inﬂuence propagation model.The IM problem has been mainly studied under two classical inﬂuence propagation models:

In-dependent Cascade (IC) model [1, 2] and

Linear Threshold (LT) model [3, 4]. Both IC modeland LT model are probabilistic models that characterize how the inﬂuence is propagated in thesocial network starting from an initial set of seed nodes.The objective functions of the IM related problems are usually complicated to compute due tothe randomness of the probabilistic di ﬀ usion models. In fact, computing the expected inﬂuencefor a given seed set is ﬃ ciency has motivated a large amount ofresearch on the IM problem in the past decade [7, 8, 9, 10, 11, 12, 14]. However, most of thesemethods either trade performance guarantees for practical e ﬃ ciency, or vice versa. There areexceptions like TIM / TIM + [19] and IMM [20], which are scalable methods with performanceguarantee for the IM problem. They utilized a novel Reverse Inﬂuence Sampling (RIS) tech-nique introduced by Borgs et al. [17] and obtained (1 − / e − ε )-approximate solutions with highprobability .Most existing works focus on the IM problem with a single di ﬀ usion, i.e., only one prod-uct is considered. In reality, however, one company may produce multiple products and peoplemay purchase multiple kinds of products at one time. For example, Apple sells iPhone, iPad,Macbook, etc. Many people have iPhone and iPad at the same time, and plenty of people ownboth laptop and desktop. Therefore, given a limited budget and some kinds of di ﬀ erent itemswith di ﬀ erent activation costs and proﬁts, a crucial question is: how to allocate the budget tomaximize the overall proﬁt? Recently, Zhang et al. [16] formulated a

Proﬁt Maximization withMultiple Adoption (PM A) problem, which seeks for a seed set within the limited budget tomassively inﬂuence customers and achieves the goal of proﬁt maximization. And a (1 − / e )-approximation algorithm was presented by Zhang et al. Moreover, they proposed another algo-rithm, called PMIS , and stated that

RMG algorithm based on the RIS technique returns a(1 − / e − ε )-approximate solution with at least 1 − ( nq ) − l − qn − l (cid:48) probability, where q is thenumber of products, n is the number of nodes in the social network, and ε, l , l (cid:48) > RMG algorithm can be implemented with tunable parameters and it is ﬂexible for balancingthe running time and the accuracy. Experimental results show that the

RMG algorithm not onlyproduces high quality seed sets but also takes much less time than the greedy algorithm withMonte-Carlo simulations.The contributions of this paper are summarized as follows: • We present a

Randomized Modiﬁed Greedy ( RMG ) algorithm for the PM A problem thatachieves a (1 − / e − ε )-approximation ratio with high probability, which signiﬁcantlyimproves upon prior works in terms of performance guarantee and is also the best perfor-mance ratio of the PM A problem even for one product. • We extend the RIS technique to accommodate the proﬁt estimation over multiple products,2nd an

OPT estimation scheme is designed to speed up the sampling process, where

OPT is the optimum of the PM A problem. • We conduct comprehensive experiments on real-world social networks that verify the su-periority of

RMG in providing an e ﬀ ective budget distribution over multiple products. Related works.

The Inﬂuence Maximization (IM) problem has been studied intensively in thepast decade. Kempe et al. [11] ﬁrst formulated IM as a combinatorial optimization problemand presented a general greedy algorithm that yielded a (1 − / e − ε )-approximation for alldi ﬀ usion models considered. Later, a large amount of works aimed to improve the e ﬃ ciencyand scalability of the seed selection algorithm. However, they were either heuristics withoutperformance guarantees [6, 7] or prohibitively slow on billion-scale networks [9, 14]. Borgs etal. [17] made a theoretical breakthrough with the RIS technique that guaranteed a (1 − / e − ε )-approximation and signiﬁcantly reduced the expected running time. Subsequently, [19, 20, 21]proposed algorithms that were very e ﬃ cient even on large networks with millions of nodes andbillions of edges.Recently, research on inﬂuence or proﬁt maximization with a limited budget has also emerged.Nguyen et al. [22] consider the budgeted inﬂuence maximization problem in which each nodecan have an arbitrary cost, but they mainly focus on the di ﬀ usion of a particular kind of prod-uct, which is di ﬀ erent from our work. Multiple products are considered in [23], where seperatebudget constraints are made on di ﬀ erent products and their di ﬀ usions are separated as well.However, all of the products share an overall activation budget in the PM A problem. The PM Aproblem is proposed by Zhang et al.[16] and a (1 − / e )-approximation algorithm is discussedunder the Multiple Thresholds (MT) model which is an extension of LT model. They proposedanother algorithm, called

PMIS , and stated that

PMIS could produce a solution within a factorof α · (1 − / e ), where α may be made arbitrarily close to 1. However, α is obtained by usingCPLEX to solve the Multiple-Chioce Knapsack problem.[25] investigates the cost-aware targetedviral marketing problem in which only one product spreads in the network and each node has itsown selecting cost and beneﬁt, and aims to ﬁnd a seed set with total cost no more than a budget tomaximize the expected total proﬁt. The PM A problem is actually a special case of the problemconsidered in [25], but they only propose a (1 − / √ e − ε )-approximation algorithm. In thispaper, we present a (1 − / e − ε )-approximation algorithm based on the RIS technique under theIC model, which signiﬁcantly improves the solution of the PM A problem.

Organization.

The rest of the paper is organized as follows. In Section 2, we introduce thedi ﬀ usion model and the deﬁnition of the PM A problem. Key ideas of solving the PM A problemand the framework of the algorithm are presented in Section 3. Section 4 is dedicated to the

RMG algorithm along with the analysis of its performance ratio and time complexity. Section 5 showsour experimental results and Section 6 concludes the paper.

2. Problem Formulation

This work aims to design a marketing strategy for allocating the budget among multiple prod-ucts in a social network. In this section, we present the di ﬀ usion model and give the deﬁnitionof the Proﬁt Maximization with Multiple Adoptions (PM A) problem.A social network is usually represented as a digraph G = ( V , E ) with nodes in V representingusers and edges in E representing relationships between users, where | V | = n , | E | = m . Assumethat each directed edge e in G is associated with a propagation probability p ( e ) ∈ [0 , .1. Di ﬀ usion Model There are many di ﬀ usion models studied in the literature. The di ﬀ usion model consideredin this paper is the Independent Cascade (IC) model, investigated in the context of marketingby Goldenberg et al. [1, 2]. Given a social network G , the IC model considers a timestampedinﬂuence propagation process as follows:1. At timestamp 1, we activate a selected node set S ⊆ V , and set all of the remaining nodesinactive.2. If a node v is ﬁrst activated at timestamp t , then for each directed edge e pointing from v to an inactive node u , v has a probability p ( e ) to activate u at timestamp t +

1. After timestamp t + v has no chance to activate any node.3. Once a node becomes active, it remains active in the following timestamps.Let I ( S ) be the number of nodes that are activated when the above di ﬀ usion process termi-nates. Refer to S as the seed set , and I ( S ) as the spread of S . Let σ ( S ) be the expected spread of S , that is σ ( S ) = E [ I ( S )]. In reality, one company may produce multiple di ﬀ erent products and one user may purchasemultiple kinds of products at one time. Therefore, given a limited budget and some kinds ofdi ﬀ erent items with di ﬀ erent activation costs and proﬁts, it is crucial for the company to wiselyallocate the budget to maximize the overall proﬁt. This problem is called the Proﬁt Maximizationwith Multiple Adoptions (PM A) problem, which was introduced by Zhang et al. [16].

Deﬁnition 1 (Proﬁt Maximization with Multiple Adoptions (PM A)) . Given a social networkG = ( V , E ) with propagation probability p : E → [0 , , suppose there are q di ﬀ erent kindsof products spreading independently in G and each node can adopt multiple products. For i = , , . . . , q, let c i be the cost of initially activating a node to adopt product i, and p i be the proﬁtobtained when a node is activated to adopt product i, where c i , p i > . The PM A problem asksto identify a seed set for each product respectively with overall activation cost at most B suchthat the expected total proﬁt is maximized.

Obviously, the Inﬂuence Maximization (IM) problem is a special case of the PM A problemwhen q =

1. Since the IM problem is NP-hard for the IC model [11] and cannot be approximatedbetter than 1 − / e unless P = NP [13], we have the following result.

Claim 1.

The PM A problem is NP-hard for the IC model, and for any ε > , it cannot beapproximated in polynomial time within a ratio of (1 − / e + ε ) unless P = NP.

Since each node may adopt multiple products, the PM A problem can be characterized as q independent di ﬀ usion processes under a common budget constraint in G . The most crucial pointhere is how to allocate the budget among the multiple products. To address this issue, we shallgive another version of the PM A problem in the next section.

3. Key Ideas for Solving PM A Problem

In this section, we transfer the PM A problem into an equivalent problem, called PM- (cid:101) G , andpresent the framework for solving the PM- (cid:101) G problem.4 .1. Reformulation of the Problem Deﬁnition 2 ( q -component copy graph (cid:101) G ) . Given a social network G = ( V , E ) with propagationprobability p : E → [0 , , let (cid:101) G = ( (cid:101) V , (cid:101) E ) = G (1) ∪ G (2) ∪ · · · ∪ G ( q ) be a graph composed of qcomponents, each of which (denoted by G ( i ) = ( V ( i ) , E ( i ) ) ) is a copy of G. For any node u ∈ V, letu ( i ) be the copy node of u in G ( i ) . For di ﬀ erent copy nodes u ( i ) and u ( j ) of u in di ﬀ erent componentsG ( i ) and G ( j ) , i (cid:44) j, they are regarded as di ﬀ erent nodes in the (cid:101) G. Then for any node set S ⊆ (cid:101) V,S can be described as S = S (1) ∪ S (2) ∪ . . . ∪ S ( q ) , where S ( i ) ⊆ V ( i ) , i = , , . . . , q. Figure 1: An example of q -component copy graph (cid:101) G , q = Thus, q di ﬀ erent products spreading in G are transformed to a single product spreading in (cid:101) G , subject to the condition that both the activation cost and the proﬁt are di ﬀ erent in di ﬀ erentcomponents: • Cost in G ( i ) : the cost of initially activating a node in component G ( i ) to adopt the productis c i . • Proﬁt in G ( i ) : the proﬁt when a node in component G ( i ) is activated to adopt the product is p i .Since di ﬀ erent components do not connect each other, for a seed set S = S (1) ∪ S (2) ∪ . . . ∪ S ( q ) , S ( i ) can only inﬂuence the nodes in G ( i ) . For i = , , . . . , q , denote by σ ( S ( i ) ) the expectedspread of S ( i ) , and ρ ( S ( i ) ) the excepted proﬁt gained by initially activating the nodes in S ( i ) ,i.e., ρ ( S ( i ) ) = p i · σ ( S ( i ) ). Let ρ ( S ) be the expected total proﬁt gained by initially activatingall the nodes in S , then ρ ( S ) = (cid:80) qi = p i · σ ( S ( i ) ) . Let c ( S ) be the activation cost of S , that is, c ( S ) = (cid:80) qi = c i | S ( i ) | . Deﬁnition 3 (Proﬁt Maximization on (cid:101) G (PM- (cid:101) G )) . The PM- (cid:101)

G problem asks for a seed set S ⊆ (cid:101) Vwith the activation cost at most B such that the expected total proﬁt is maximized: max ρ ( S ) = q (cid:88) i = p i · σ ( S ( i ) ) s . t . q (cid:88) i = c i | S ( i ) | ≤ B (1)It is easy to see that the PM A problem is equivalent to the PM- (cid:101) G problem.Let Ω = { S ⊆ (cid:101) V | c ( S ) ≤ B } , that is, Ω is the feasible set of the PM- (cid:101) G problem. Let S ∗ be theoptimal solution of the PM- (cid:101) G problem, and OPT = ρ ( S ∗ ) be the optimum.Since ρ ( S ) = (cid:80) qi = p i · σ ( S ( i ) ) and σ ( S ( i ) ) has been proved to be nondecreasing as well assubmodular [11], the following result holds. Proposition 1.

The proﬁt function ρ ( · ) in the PM- (cid:101) G problem is nonnegative, nondecreasing andsubmodular. ﬀ erent costs and proﬁts, and a limited budget, one may resortto algorithms of the classical knapsack problem, such as greedy algorithm and dynamic pro-gramming. However, those methods cannot perform well here, since the proﬁt function ρ ( · ) inthe PM- (cid:101) G problem is submodular rather than linear. And selecting any node from the candidatenode set may inﬂuence the marginal gain of choosing the next seed, not like the static weight as-sociated with each item in the knapsack problem. All the facts make the PM- (cid:101) G problem di ﬃ cult.For the problem of maximizing a nonnegative, nondecreasing submodular set function f ( · )subject to a knapsack constraint, Sviridenko[18] proposed a modiﬁed greedy algorithm whichguarantees a (1 − / e )-approximation ratio and is based on a value oracle model for f ( · ). That is,for a given set S , the algorithm can query an oracle to ﬁnd its value f ( S ). But our task in this workis to solve the PM- (cid:101) G problem without using the value oracle model, and it is accompanied by thedi ﬃ culty of computing ρ ( · ), because the computation of σ ( · ) has been shown to be (cid:101) G Problem

Before we give the algorithm for solving the PM- (cid:101) G problem in detail, we ﬁrst describe themain idea and framework of the algorithm.To tackle intractability of the computation of ρ ( · ), we try to obtain an estimate ˆ ρ ( · ) of ρ ( · ) witha small error with high probability, where ˆ ρ ( · ) can be computed in polynomial time. Then wesubstitute ˆ ρ ( S ) for ρ ( S ) to translate the original PM- (cid:101) G problem into maximizing ˆ ρ ( S ) with thebudget constraint. Using the modiﬁed greedy algorithm [18], a (1 − / e )-approximate solution S A for the problem of max S ∈ Ω ˆ ρ ( S ) is obtained, which can be proved to be a (1 − / e − ε )-approximatesolution for the PM- (cid:101) G problem with high probability.In the estimation of ρ ( S ), we utilize the Reverse Inﬂuence Sampling (RIS) technique intro-duced by Borgs et al. [17], which signiﬁcantly improves the time complexity of the algorithmfor the IM problem.In summary, the PM- (cid:101) G problem can be solved by the following steps. • Estimate ρ ( S ) : Use the RIS technique to gain an estimation ˆ ρ ( · ) of ρ ( · ) such that for any S ∈ Ω , | ˆ ρ ( S ) − ρ ( S ) | < ε · OPT holds with high probability, where 0 < ε < • Solve problem max S ∈ Ω ˆ ρ ( S ) : Prove ˆ ρ ( S ) is nonnegative, nondecreasing and submodular,then use the Modiﬁed Greedy algorithm to solve the problem max S ∈ Ω ˆ ρ ( S ). Let S A be thesolution returned by the algorithm, then we can show that S A is a (1 − / e − ε )-approximatesolution for the PM- (cid:101) G problem with high probability (w.h.p.).The framework for solving the PM- (cid:101) G problem is shown in Fig.2. Figure 2: Overview of algorithms.

4. Algorithm and Its Analysis

In this section, the

Randomized Modiﬁed Greedy ( RMG ) algorithm for the PM- (cid:101) G problem ispresented, which can achieve a (1 − / e − ε )-approximation ratio with high probability. Before6ntroducing the algorithm, we list some notations for convenience. Let p min = min ≤ i ≤ q { p i } , p max = max ≤ i ≤ q { p i } , c min = min ≤ i ≤ q { c i } and k i = (cid:98) B / c i (cid:99) . Let k ∗ = (cid:98) B / c min (cid:99) , i.e. the maximumnumber of seed nodes that can be chosen. As the budget B is often limited, the size of the seedset can not be too large. Thus in the following, we assume that k ∗ ≤ (cid:98) nq / (cid:99) . ρ ( S )Now we are in the position to give the estimation of ρ ( S ). In this work, Reverse InﬂuenceSampling (RIS) technique is used, which captures the inﬂuence landscape of the social networkthrough generating a set of Random Reverse Reachable (RR) sets [19].

Random Reverse Reachable (RR) set.

Given a social network (cid:101) G = ( (cid:101) V , (cid:101) E ) with propagationprobability p : E → [0 , g from (cid:101) G by removing each edge e in (cid:101) G with 1 − p ( e ) probability 2) selecting anode v from g uniformly at random 3) returning R as the set of nodes that can reach v in g . Figure 3: An example of generating random RR sets under the IC model. Three random RR sets R , R and R aregenerated for three nodes c (1) , a (1) and d (1) , respectively. Intuitively, if a node u appears in an RR set generated for another node v , then u can reach v through a certain path in (cid:101) G . Thus, a propagation process from a seed set containing u shouldhave a certain probability to activate v . The result of [17] attests to this observation.Using the reverse Breadth First Search (BFS) algorithm in [26], we can generate a set ofrandom RR sets R = { R , R , . . . , R θ } . Given a node set S ⊆ (cid:101) V , we say that S covers an RR set R j if and only if S ∩ R j (cid:44) ∅ . Deﬁne F R ( S ) as the fraction of RR sets in R covered by S , that is F R ( S ) = |{ R j ∈ R , ≤ j ≤ θ | S ∩ R j (cid:44) ∅}| θ . Recall that S = S (1) ∪ S (2) ∪· · ·∪ S ( q ) ⊆ (cid:101) V , it is clear that F R ( S ) = (cid:80) qi = F R ( S ( i ) ). Based on theresults in Tang et al. [19], we can obtain that for any S ( i ) ⊆ V ( i ) , i = , , . . . , q , the expected valueof nq · F R ( S ( i ) ) equals the expected spread of S ( i ) in (cid:101) G . This implies to the following lemma. Lemma 1.

For any node set S ( i ) ⊆ V ( i ) , E [ nq · F R ( S ( i ) )] = σ ( S ( i ) ) , i = , , . . . , q. Denote ˆ σ ( S ( i ) ) = nq · F R ( S ( i ) ). Then according to Lemma 1, ˆ σ ( S ( i ) ) is an unbiased estimateof σ ( S ( i ) ). Deﬁne ˆ ρ ( S ( i ) ) = p i ˆ σ ( S ( i ) ) and for any S ⊆ (cid:101) V , let ˆ ρ ( S ) = (cid:80) qi = ˆ ρ ( S ( i ) ) = (cid:80) qi = p i ˆ σ ( S ( i ) ).Obviously, ˆ ρ ( S ( i ) ) and ˆ ρ ( S ) are unbiased estimates of ρ ( S ( i ) ) and ρ ( S ), respectively. Corollary 1.

For any S ⊆ (cid:101) V, E [ ˆ ρ ( S ( i ) )] = ρ ( S ( i ) ) , ( i = , , . . . , q ) , and E [ ˆ ρ ( S )] = ρ ( S ) . For any S ∈ Ω , we use the following Algorithms 1 and 2 to obtain the value of ˆ ρ ( S ). InAlgorithm 1, we generate a set of θ random RR sets, denoted by R . In Algorithm 2, we ﬁrstidentify the nodes in S and partition them into S (1) , S (2) , . . . , S ( q ) , where S ( i ) ⊆ V ( i ) (Lines 2-6),then compute the fraction of RR sets in R covered by S ( i ) , denoted by F R ( S ( i ) ) (Lines 7-10).Summing up all the q items, we obtain an estimate ˆ ρ ( S ).By Cherno ﬀ bounds [24], we show that for any S ∈ Ω , the result obtained by Algorithm 2 isan accurate estimate of ρ ( S ) with high probability, when θ is su ﬃ ciently large.7 lgorithm 1 RR Sets Generation

Input:

Graph (cid:101) G and a positive integer θ . Output: a set of θ random RR sets R . Initialize: R = ∅ ; Use the reverse Breadth First Search algorithm to generate θ random RR sets and insert theminto R ; Initialize: B = ∅ , σ ( ∅ ) = return R . Algorithm 2

Proﬁt-Estimate

Input:

A seed set S = { ¯ v , ¯ v , · · · , ¯ v | S | } ∈ Ω , R = { R , R , · · · , R θ } , 0 < ε < Output: ˆ ρ ( S ) such that | ˆ ρ ( S ) − ρ ( S ) | < ε · OPT with at least 1 − ( nq ) − l ( k ∗ ) − / (cid:16) nqk ∗ (cid:17) probability.. Initialize: ˆ ρ ( S ) = for i from 1 to q do ¡¡¡¡ S ( i ) ← ∅ ;¡¡¡¡ for j from 1 to | S | do if ¯ v j ∈ V ( i ) then ¡¡¡¡ S ( i ) = S ( i ) ∪ { ¯ v j } ;¡¡¡¡ for i from 1 to q do ¡¡¡¡ Initialize: F R ( S ( i ) ) = for k from 1 to θ do ¡¡¡¡ F R ( S ( i ) ) = F R ( S ( i ) ) + min {| S ( i ) ∩ R k | , } θ ;¡¡¡¡ return ˆ ρ ( S ) = (cid:80) qi = nqp i · F R ( S ( i ) ). Lemma 2.

Suppose θ satisﬁes θ ≥ (8 q + ε ) nq p max · l log( nq ) + log(2 qk ∗ ) + log (cid:16) nqk ∗ (cid:17) ε · OPT . (2) Then for any set S ∈ Ω , the following inequality holds with at least − ( nq ) − l ( k ∗ ) − / (cid:16) nqk ∗ (cid:17) proba-bility: | ˆ ρ ( S ) − ρ ( S ) | < ε · OPT , (3) where l > , < ε < and k ∗ = (cid:98) B / c min (cid:99) . Since ˆ ρ ( S ) ∼ ρ ( S ) with high probability and ˆ ρ ( S ) can be computed in polynomial time, wenow turn to solve the following problem.max ˆ ρ ( S ) s . t . q (cid:88) i = c i | S ( i ) | ≤ B (4) max S ∈ Ω ˆ ρ ( S )In this section, we provide a Modiﬁed Greedy algorithm for problem (4) which achieves a(1 − / e )-approximate solution S A . Then we show that S A is a (1 − / e − ε )-approximate solutionfor the original PM- (cid:101) G problem with high probability.8 lgorithm 3 Modiﬁed Greedy Algorithm

Input:

Graph (cid:101) G , a budget B and 0 < ε < Output:

A (1 − / e − ε )-approximate solution for the PM- (cid:101) G problem, with at least (1 − ( nq ) − l )-probability. Initialize: U = ∅ , S = ∅ , (cid:98) V = (cid:101) V , V = ∅ ; for all U ∈ Ω , | U | = do ¡¡¡¡ ˆ ρ ( U ) ← Proﬁt-Estimate( R , U , ε );¡¡¡¡ ¡¡¡¡ Insert U into U ;¡¡¡¡ U ∗ = arg max U ∈ U { ˆ ρ ( U ) } ; for all S ∈ Ω , | S | = do ¡¡¡¡ S ← S ;¡¡¡¡ while c ( S ) ≤ B do ¡¡¡¡ (cid:98) V = (cid:98) V \ S ;¡¡¡¡ ¡¡¡¡ ˆ ρ ( S ) ← Proﬁt-Estimate( R , S , ε );¡¡¡¡ for all v ∈ (cid:98) V do if c ( S ∪ { v } ) ≤ B then ¡¡¡¡ Insert v into V ;¡¡¡¡ ¡¡¡¡ ˆ ρ ( S ∪ { v } ) ← Proﬁt-Estimate( R , S ∪ { v } , ε );¡¡¡¡ v ∗ = arg max v ∈ V { ˆ ρ ( S ∪{ v } ) − ˆ ρ ( S ) c ( { v } ) } ; S = S ∪ { v ∗ } ; Insert S into S ; S ∗ = arg max S ∈ S { ˆ ρ ( S ) } ; return S A = arg max { ˆ ρ ( U ∗ ) , ˆ ρ ( S ∗ ) } . Lemma 3. ˆ ρ ( · ) is nonnegative, nondecreasing and submodular. Motivated by the design of the main algorithm in [18], we propose a Modiﬁed Greedy al-gorithm for problem (4). The sketch of the algorithm is as follows. We ﬁrst enumerate all thefeasible seed sets containing one or two nodes separately, to avoid the extreme situation thatnodes with high proﬁt and cost are not included in the solution (Lines 2-5). Then start with anyfeasible seed set consisting of three nodes, and greedily add node which does not destroy thefeasibility of the set (Lines 6-18). Finally, output the maximum among the two cases (Line 19).

Theorem 1.

Given a graph (cid:101)

G, a positive number B, < ε < , l > and θ that satisﬁesinequality (2), Algorithm 3 returns a (1 − / e − ε ) -approximate solution for the PM- (cid:101) G problemwith at least − ( nq ) − l probability.4.3. Estimation of the Parameter θ To guarantee the solution returned by Algorithm 3 is a (1 − / e − ε )-approximate solutionfor the PM- (cid:101) G problem with high probability, the number θ of the random RR sets generated inAlgorithm 1 should satisfy inequality (2). For simplicity, we deﬁne λ = (8 q + ε ) nq p max · ( l log nq + log(2 qk ∗ ) + log (cid:16) nqk ∗ (cid:17) ) · ε − (5)and rewrite (2) as θ ≥ λ/ OPT . (6)9owever, since OPT is unknown in advance, it is di ﬃ cult to set θ directly based on (6).Inspired by the technique used in [19], we address this challenge by ﬁnding an estimate u of OPT which is also a lower bound of

OPT . Then, by setting θ = λ/ u , we can guarantee θ satis-fying inequality (6). On the other hand, θ should be set reasonably small in order to avoid timeoverheads, which requests the lower bound u to be as close to OPT as possible.

In this section, an estimation of

OPT is presented, which is based on the results in [19].Though this estimation is not good enough, we remain it here in order to evaluate the timecomplexity of our algorithms.Deﬁne the width of an RR set R , denoted by ω ( R ), as the number of directed edges in (cid:101) G whichpoint to the nodes in R . That is, ω ( R ) = (cid:80) v ∈ R (the in-degree of v in (cid:101) G ). Obviously, if an edge isexamined in the generation of R , then it must point to a node in R . Let EW be the expected width of a random RR set, that is, the expected number of coin tosses required to generate a randomRR set. Therefore, it can be easy to verify that the expected time complexity of Algorithm 1 is O ( θ · EW ).The connection between EW and the expected spread of any node in V ∗ is formalized in thefollowing lemma [19]. Lemma 4. nm EW = E [ I ( { v ∗ } )] , where the expectation of I ( { v ∗ } ) is taken over the randomness inv ∗ and the inﬂuence propagation process. Lemma 4 implies that p min · nm EW ≤ OPT , since E [ I ( S ∗ )] is the expected spread of at least (cid:98) B / c max (cid:99) seed nodes and the proﬁt of activating any node in E [ I ( S ∗ )] is at least p min . As p min · nm EW is easy to be estimated, we can choose u = p min · nm EW as a lower bound of OPT . However,when | S ∗ | (cid:29) u = p min · nm EW renders θ = λ/ u unnecessarily large and makes u = p min · nm EW an unfavorable choice of u . Now we consider another closer estimation of

OPT . For i = , , . . . , q , we consider anextreme situation of the PM- (cid:101) G problem in which the seed set only contains the nodes in G ( i ) . Insuch situation, the seed set consists of no more than k i = (cid:98) B / c i (cid:99) nodes and the PM- (cid:101) G problemturns to seek for a size- k i seed set with the maximum proﬁt, which is equivalent to the k i -size IMproblem in G ( i ) . According to the results in [19], the IM problem in G under the IC model canbe solved by an algorithm, called TIM + .The TIM + algorithm based on the RIS technique consists of two phases. The ﬁrst phase,called parameter estimation, receives an estimate KPT + of the optimum and uses it to compute θ (cid:48) which is the number of the random RR sets needed to generate. The second phase, called nodeselection, samples θ (cid:48) random RR sets from G and applies the greedy algorithm to derive a size- k node set S k covering a large number of RR sets. (The details of the TIM + algorithm can be seenin the appendix.)We use the TIM + algorithm to solve the k i -size IM problem in G ( i ) , and obtain an approximatesolution (denoted by S k i ) and the estimation of its expected spread (denoted by ˆ σ ( S k i )). Thenwe compute the corresponding estimation of proﬁt (denoted by u i ) when S k i is used as seed set.Based on the analysis in [19], we compute u i = u i / (1 + (cid:15) (cid:48) ) to ensure that u i ≤ OPT with highprobability. Then take max ≤ i ≤ q { u i } as an estimation of OPT , which is also a lower bound of

OPT with high probability. The estimating procedure is detailed in Algorithm 4, and the main resultof

TIM + is presented as follows. 10 lgorithm 4 OPT Estimation

Input:

Graph (cid:101) G , a budget B , c , c , . . . , c q , p , p , . . . , p q and 0 < ε (cid:48) < l (cid:48) > Output:

A lower bound u ∗ of OPT . for i from 1 to q do ¡¡¡¡ k i = (cid:98) Bc i (cid:99) ;¡¡¡¡ ¡¡¡¡ ˆ σ ( S k i ) ← TIM + ( G ( i ) , k i , ε (cid:48) , l (cid:48) );¡¡¡¡ ¡¡¡¡ u i = p i · ˆ σ ( S k i ) / (1 + ε (cid:48) / return u ∗ = max ≤ i ≤ q { u i } . • Input :

A graph G , the constraint number k of the seed set, a constant l (cid:48) > ε (cid:48) ∈ (0 , • Output :

A seed set S k and the estimation ˆ σ ( S k ) of its expected spread. • Approximation : S k is a (1 − / e − ε (cid:48) )-approximate solution of the IM problem, with atleast 1 − n − l (cid:48) probability. • Time complexity : O (cid:0) ( k + l (cid:48) )( m + n ) log n / ( ε (cid:48) ) (cid:1) .Let S ∗ k be the optimal solution, and σ ( S ∗ k ) be the optimum of the k -size IM problem in G .Based on Lemma 7 and Lemma 8 in [19], we have: Lemma 5.

Let S k be the solution returned by the TIM + algorithm for the k-size IM problem, and ˆ σ ( S k ) be the estimation of the expected spread of S k , then Pr (cid:104) (1 − / e )(1 − ε (cid:48) / σ ( S ∗ k ) < ˆ σ ( S k ) < (1 + ε (cid:48) / σ ( S ∗ k ) (cid:105) > − n − l (cid:48) . Theorem 2.

Algorithm 4 returnsu ∗ ∈ (cid:34) (1 − / e )(1 − ε (cid:48) / + ε (cid:48) / q OPT , OPT (cid:35) with at least − qn − l (cid:48) probability and runs in O (cid:0) ( k ∗ + l (cid:48) )( m + n ) q log n / ( ε (cid:48) ) (cid:1) expected time,where l (cid:48) > and < ε (cid:48) < .4.3.3. Reﬁned Estimation of OPT In this section, we present another method to estimate

OPT . Clearly, the e ﬃ ciency of the RMG algorithm highly depends on the value of u ∗ obtained by Algorithm 4. Though we couldensure that the output u ∗ of Algorithm 4 is no smaller than (1 − / e )(1 − ε (cid:48) / + ε (cid:48) / q · OPT with high proba-bility, u ∗ may be much smaller than OPT in experiments.We pose an e ﬃ cient solution (Algorithm 5) to the above problem, which adds an intermediatestep between Algorithm 4 and Algorithm 3. At ﬁrst, we construct two matrices P called proﬁtmatrix and A called seed set matrix as follows (Algorithm 5, Lines 1-9). These two matrices bothhave q rows and k ∗ columns. For i = , , . . . , q and j = , , . . . , k i , the entry a i j of A denotesthe seed set which achieves the maximum proﬁt of selecting j seed nodes from G ( i ) , and let p i j of P be the estimation of proﬁt obtained by using a i j as the seed set. For i = , , . . . , q and j = k i + , . . . , k ∗ , we set p i j = a i j = ∅ . Then each time we select the seed set a i j with themaximum ratio of proﬁt p i j to its activating cost c i · j in the entire matrix, which means that weadd the nodes set a i ∗ j ∗ whose proﬁt satisﬁes p i ∗ j ∗ = arg max ≤ i ≤ q , ≤ j ≤ k ∗ (cid:40) p i j c i · j (cid:41) , lgorithm 5 Reﬁne OPT Estimation

Input:

Graph (cid:101) G , a budget B , c , c , . . . , c q , p , p , . . . , p q , l (cid:48) > < ε (cid:48) < Output:

A lower bound u (cid:48) of OPT . Initialize two matrices P and A ; for i from 1 to q do for j from 1 to k i do ¡¡¡¡ ˆ σ ( S ( i ) j ) ← TIM + ( G ( i ) , j , ε (cid:48) , l (cid:48) );¡¡¡¡ ¡¡¡¡ a i j = S ( i ) j ;¡¡¡¡ ¡¡¡¡ p i j = p i · ˆ σ ( S ( i ) j );¡¡¡¡ for j from k i + k ∗ do ¡¡¡¡ a i j = ∅ ;¡¡¡¡ ¡¡¡¡ p i j = ˆ S ← ∅ ; while c ( ˆ S ) ≤ B do p i ∗ j ∗ = arg max ≤ i ≤ q , ≤ j ≤ k ∗ { p ij c i · j } ; ˆ S = ˆ S ∪ a i ∗ j ∗ ; Set entries of row i ∗ in matrix P to 0; u ∗∗ = ˆ ρ ( ˆ S ); return u (cid:48) = max { u ∗ , u ∗∗ } .to the current seed set, while still ensuring the activation cost of the update seed set no more than B . After that, set all the entries in row i ∗ to 0. Repeat the above process until the overall activationcost of the seed set is more than B . Denote ˆ S as the ﬁnal seed set obtained, and u ∗∗ = ˆ ρ ( ˆ S ) (Lines10-15). The ﬁnal output of Algorithm 5 is u (cid:48) = max { u ∗ , u ∗∗ } , a new lower bound of OPT (Line16).

In summary, our

RMG algorithm for the PM A problem works as follows. Given the socialnetwork G , B , c , c , . . . , c q , p , p , . . . , p q , parameters ε, ε (cid:48) , l and l (cid:48) , we ﬁrst construct the graph (cid:101) G = G (1) ∪ G (2) ∪ . . . ∪ G ( q ) . Then RMG implements Algorithm 4 and obtains a value of u ∗ inreturn. And then RMG computes θ = λ/ u ∗ in which λ is deﬁned in (5) and invokes Algorithm1 to generate a set R of random RR sets. Finally, we run Algorithm 3 with (cid:101) G , ε , B and θ as theinput and take its output S A as the ﬁnal result of the PM A problem.In the rest of this section, we discuss the time complexity of

RMG algorithm. Based onprevious discussions, the expected time complexity of Algorithm 1 is O ( θ · EW ). In 4.3.1, wehave obtained that u = p min · nm EW is a lower bound of OPT . By setting θ = λ/ u , we can obtainthat Algorithm 1 has an expected time complexity of O ( θ · EW ) = O ( m λ np min ) = O (cid:0) ( k ∗ + l + m + n ) q p max log( nq ) / ( p min · ε ) (cid:1) . Clearly, Algorithm 2 runs in O ( q θ ) = O (cid:0) ( k ∗ + l + m + n ) q p max log( nq ) / ( p min · ε ) (cid:1) expectedtime.Now we are in the position to analyse the expected running time of Algorithm 3. For any S ∈ Ω , ˆ ρ ( S ) is computed by Algorithm 2 which runs in O (cid:0) ( k ∗ + l + m + n ) q p max log( nq ) / ( p min · ε ) (cid:1) nq ) times. The second part of Algorithm 3 from line 7 to 22 invokes Algorithm 2 at most k ∗ · ( nq ) times. Thus, Algorithm 3 has an expected time complexity of O (cid:0) k ∗ ( k ∗ + l + m + n ) n q p max log( nq ) / ( p min · ε ) (cid:1) .By Theorems 1 and 2, RMG runs in O (cid:0) k ∗ ( k ∗ + l + m + n ) n q p max · log( nq ) / ( p min · ε ) (cid:1) expected time and returns a (1 − / e − ε )-approximate solution with at least 1 − ( nq ) − l − qn − l (cid:48) probability.

5. Experimental Evaluation

In this section, we show the e ﬀ ectiveness of our proposed algorithm on three social networkdatasets. The goal of the experiments is multifold. First, we would like to evaluate the perfor-mance of the RMG algorithm as measured by the achieved expected total proﬁt. Second, weevaluate the extent to which the estimated

OPT and reﬁned

OPT estimate the lower bound of theproﬁt, which indirectly control the e ﬃ ciency of the proﬁt maximization algorithm. Finally, weshow the distribution of budget and proﬁt produced by our algorithm for di ﬀ erent products onmultiple datasets to reveal the superiority of our algorithms in depth. Table 1: Dataset characteristics

Dataset n m

Type Average degree

NetHEPT wikiVote

Epinions

Datasets.

We conduct extensive experiments on three real benchmark social networks:

NetHEPT , wikiVote and Epinions to examine the e ﬀ ectiveness of the RMG algorithm. Basicstatistics of the datasets are summarized in Table 1, where n denotes the number of nodes and m denotes the number of edges in the social graph. For undirected graphs, we reverse every edge inboth directions so as to make each undirected edge into two directed edges. Note that the numberof edges are doubled in this case. All datasets used in our experiments are publicly available at[15]. Inﬂuence Model.

In this work, we adopt the standard Independent Cascade (IC) model asthe inﬂuence model, which is widely used in the literature [19, 20]. As for the IC model, we setthe propagation probability of each directed edge as reciprocal of the in-degree of the node thatthe edge points to. Speciﬁcally, for each edge e we ﬁrst identify the node v that e points to, andthen set p ( e ) = / d ( v ), where d ( v ) denotes the in-degree of v . This setting of p ( e ) is widely usedin prior works [19, 9, 10]. Algorithms.

In addition to our proposed algorithm, we use three algorithms as baselinealgorithms for comparison purpose, namely,

Random , Greedy , and

PMCE [16]. In particular,

Random is a baseline algorithm that randomly select nodes from the network and assign randomproduct to each node while satisfying the budget constraint.

Greedy is an iterative procedure, theintuition behind is to select the pair of node and product with maximum ratio of the marginalincrease in expected proﬁt over the cost in each round, until the budget is exhausted.

PMCE is another baseline algorithm, i.e., the

Proﬁt Maximization with Cost E ﬀ ectiveness algorithmproposed in [16]. It ﬁrst constructs two candidate solutions and then select the better one as theﬁnal result. The ﬁrst candidate is selected in an iterative greedy process. In each round, the nodewith maximum ratio of the marginal proﬁt increase over the square of cost is selected, and the13rocess runs until the budget is used up. It then ﬁnds the second candidate using a similar iterativegreedy process, only with a di ﬀ erent guideline to select a node in each iteration: selecting thenode with maximum marginal proﬁt increase. The intuition behind PMCE is to consider boththe cases that to emphasize the importance of product cost (ﬁrst candidate) and to ignore theimportance of product cost (second candidate). Note that

PMCE is designed under the LinearThreshold (LT) model only, but we incorporate the triggering model generalization technique[19] into

PMCE and extend it to the IC model.

Table 2: Product Statistics

Dataset Product Proﬁt Cost Ratio

NetHEPT

P1;P2;P3 0.39;0.55;0.67 0.36;0.48;0.65 1.08;1.15;1.03 wikiVote

P1;P2;P3 0.45;0.65;0.41 0.12;0.20;0.80 3.75;3.25;0.51

Epinions

P1;P2;P3 0.45;0.20;0.06 0.08;0.65;0.78 5.63;0.31;0.08

Parameters.

Unless otherwise speciﬁed, we set ε = ε (cid:48) = ¯ ε = . ε is another errorparameter in the TIM + alg.) and B =

15. For our solutions, we set l and l (cid:48) in a way that ensuresa success probability of 1 − / n . For the baseline algorithms, we set the number of Monte Carlosimulations to r = , following the standard practice in the literature [19]. In our experiments,we randomly generate the proﬁt and cost of the products for each dataset in three cases. First,the proﬁt over cost ratio is similar among di ﬀ erent products ( NetHEPT ). Second, we set twoproducts with higher proﬁt over cost ratio than the other product ( wikiVote ). Third, we set asingle product with higher cost than the other two products (

Epinions ). The product statisticsare presented in Table 2. In later paragraphs, we provide a detailed analysis of di ﬀ erent budgetdistribution patterns shown in these cases and make a deep exploration of the superiority of ourproposed algorithms.All experiments were run on a machine with Intel Xeon 2.40GHz CPU and 64GB memory,running 64-bit RedHat Linux server. For each set of experiments, we run the simulation for 100rounds and average results are reported as follow. Figure 4: Expected total proﬁt vs. amount of budget.

Expected Total Proﬁt.

Our ﬁrst set of experiments compares our solutions in terms ofexpected total proﬁt with baseline algorithms

Random , Greedy and

PMCE . Figure 4 shows theexpected total proﬁt yielded by each method on all tested datasets, with B varying from 1 to15. The x -axis holds the amount of budget and the y -axis holds the expected total proﬁt. Weobserve that the trend of PMCE is almost in line with the trend of Greedy, and

RMG consistentlyoutperforms all baseline algorithms. In particular, when B = RMG leads

PMCE by over 20%gain on all datasets, and the gap between

RMG and baseline algorithms becomes larger as thebudget increases. 14 igure 5: Comparison of total proﬁt with estimated OPT (Algorithm 4) and reﬁned OPT (Algorithm 5).

Estimation on

OPT . Figure 5 presents the comparison of the expected total proﬁt yielded by

RMG with the estimated OPT yielded by Alg. 4 (OPTEst) and Alg. 5 (RefOPT) respectively.The x -axis holds the amount of budget. For RMG , the y -axis holds the expected total proﬁt;and for OPTEst and RefOPT, the y -axis holds the estimated lower bound of OPT . The budget B ranges from 1 to 15. We observe that RefOPT produces a tighter estimation of the lower boundof OPT on all datasets over a varying budget. This indicates that it is beneﬁcial to incorporatethe computation of maximum proﬁt that can be achieved considering all possible combinationsof budget distributions over multiple products. Thus Algorithm 5 provides a sophisticated yete ﬀ ective estimation on the lower bound of OPT , leading to a higher e ﬃ ciency of RMG . Figure 6: Budget & proﬁt distributions.

Budget & Proﬁt Distribution.

We take a further step to explore the distribution of budgetand proﬁt produced by

RMG with a varying budget. Figure 6 illustrates how the budget is dis-tributed over multiple products and the corresponding proﬁt gained from each product with thebudget varying from 1 to 15. We observe that with a limited budget at the very beginning, all thebudget is spent on promoting the product with highest proﬁt cost ratio. As the budget increases,spending more on a single proﬁtable product is not preferred and gradually adjusting budget dis-tribution over multiple products becomes crucial. Thus

RMG balances the cost and proﬁt in along run and produces a distribution that maximizes the proﬁt.In summary, our experiments on various settings demonstrate that the

RMG algorithm ise ﬀ ective, producing far superior solutions than the baselines.15 . Conclusion Traditional Inﬂuence maximization problem focuses on the di ﬀ usion of a single product orinformation in the social network, aiming to seek for a small node set of maximum inﬂuence.However, in reality, one company may produce several products to meet the demand of cus-tomers. The PM A problem considers the di ﬀ usion of multiple di ﬀ erent products in the socialnetwork, and seeks for a seed set within the limited budget to achieve the goal of proﬁt maxi-mization. Therefore, how to allocate the limited budget among multiple products is crucial forthe company in designing commercial activities.In this paper, we propose a RMG algorithm for the PM A problem. The algorithm runs in O ( k ∗ ( k ∗ + l + m + n ) n q p max log( nq ) / ( p min · ε )) expected time and returns a (1 − / e − ε )-approximate solution with at least 1 − ( nq ) − l − qn − l (cid:48) probability, which signiﬁcantly improvesupon prior works in terms of performance guarantee and is also the best performance ratio of thePM A problem even for one product. Experimental results on real-world social networks showthat our

RMG algorithm outperforms the algorithm proposed in [16] and other heuristics in termsof proﬁt maximization, and could better allocate the budget. For future work, we plan to improvethe

RMG algorithm in terms of the time complexity, and investigate the case in which multipleproducts spread in the social network and could compete with each other.

Acknowledgement.

This work was supported in part by National Natural Science Foundationof China (11501316), China Postdoctoral Science Foundation (2016M600556), Qingdao Post-doctoral Application Research Project (2016156), and Natural Science Foundation of ShandongProvince of China (ZR2017QA010).

7. ReferencesReferences [1] J. Goldenberg, B. Libai, E. Muller, “Talk of the network: a complex systems look at the underlying process ofword-of-mouth,” in Marketing Letters, 2001, pp. 211–223.[2] J. Goldenberg, B. Libai, E. Muller, “Using complex systems analysis to advance marketing theory development,”in Academy of Marketing Science Review, 2001.[3] M. Granovetter, “Threshold models of collective behavior,” in American Journal of Sociology, 1978, pp. 1420–1443.[4] T. Schelling, “Micromotives and Macrobehavior,” in Norton, 1978.[5] A. Borodin, Y. Filmus, and J. Oren, “Threshold models for competitive influence in social networks,” in WINE,2010, pp. 539–550.[6] W. Chen, C. Wang, and Y. Wang, “ Scalable influence maximization for prevalent viral marketing in large-scalesocial networks,” in KDD, 2010, pp. 1029–1038.[7] W. Chen, Y. Wang, and S. Yang, “Efficient influence maximization in social networks,” in KDD, 2009, pp. 199–208.[8] W. Chen, Y. Yuan, and L. Zhang, “Scalable influence maximization in social networks under the linear thresholdmodel,” in ICDM, 2010, pp. 88–97.[9] A. Goyal, W. Lu and L. V. S. Lakshmanan, “CELF ++ : Optimizing the greedy algorithm for influence maximiza-tion in social networks,” in WWW, 2011, pp. 47–48.[10] K. Jung, W. Heo and W. Chen, “IRIE: A scalable influence maximization algorithm for independent cascade modeland its extensions,” in ICDM, 2012, pp. 1–20.[11] D. Kempe, J. M. Kleinberg, and V. Tardos, “Maximizing the spread of influence through a social network,” inKDD, 2003, pp. 137–146.[12] D. Kempe, J. M. Kleinberg, and V. Tardos, “Influential nodes in a diffusion model for social networks,” in ICALP,2005, pp. 1127–1138.

13] D. Kempe, J. M. Kleinberg, and V. Tardos, “Maximizing the spread of influence through a social network,” inTheory, 2015, pp. 105–147.[14] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen and N. Glance, “Cost-effective outbreak detectionin networks,” in KDD, 2007, pp. 420–429.[15] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection,”http: // snap.stanford.edu / data, 2014.[16] H. Y. Zhang, H. L. Zhang, A. Kuhnle, M. T. Thai, “Profit maximization for multiple products in online socialnetworks,” in INFOCOM, 2016, pp. 1–9.[17] C. Borgs, M. Brautbar, J. T. Chayes, and B. Lucier, “Maximizing social influence in nearly optimal time,” in SODA,2014, pp. 946–957.[18] M. Sviridenko, “ A note on maximizing a submodular set function subject to a knapsack constraint,” in OperationsResearch Letters, 2004, pp. 41–43.[19] Y. Tang, X. Xiao, and Y. Shi, “Influence maximization: near-optimal time complexity meets practical efficiency,”in SIGMOD, 2014, pp. 75–86.[20] Y. Tang, Y. Shi, and X. Xiao, “Influence maximization in near-linear time: A martingale approach,” in SIGMOD,2015, pp. 1539–1554.[21] H. T. Nguyen, M. T. Thai, and T. N. Dinh, “Stop-and-stare: Optimal sampling algorithms for viral marketing inbillion-scale networks,” in SIGMOD, 2016, pp. 695–710.[22] H. Nguyen and R. Zheng, “On budgeted influence maximization in social networks,” in IEEE J. Sel. Area Comm.31, 2013, pp. 1084–1094.[23] N. Du, Y. Liang M. F. Balcan and L. Song, “Budgeted influence maximization for pultiple products,”arXiv:1312.2164.[24] R. Motwani and P. Raghavan, Randomized Algorithms , Cambridge University Press, 1995, pp. 68–70.[25] H. T. Nguyen, M. T. Thai, and T. N. Dinh, “A billion-scale approximation algorithm for maximizing benefit in viralmaketing,” in IEEE / ACM Transactions on Networking, 2017, pp. 2419–2429.[26] E. F. Moore, “The shortest path through a maze,” in the proceeding of Int. Symp. Switching Theory, 1959, pp.285–292.

Appendix A. Proof of some conclusionsProof of Lemma 2.

For any seed set S ∈ Ω , let µ i = σ ( S ( i ) ) / ( nq ) = E [ F R ( S ( i ) )], which representsthe probability that S ( i ) overlaps with a random RR set.Then θ · F R ( S ( i ) ) can be regarded as the sum of θ i.i.d. Bernoulli variables with a mean µ i .Thus, we have Pr (cid:34) | ˆ ρ ( S ( i ) ) − ρ ( S ( i ) ) | ≥ ε q · OPT (cid:35) = Pr (cid:34) | θ · F R ( S ( i ) ) − θ · µ i | ≥ ε · OPT nq p i µ i · θµ i (cid:35) . (A.1)Let δ = ε · OPT nq p i µ i . By Cherno ﬀ bounds, inequality (2) and the fact that ρ ( S ( i ) ) = p i σ ( S ( i ) ) = p i · nq µ i ≤ OPT , the following inequality holds for the right hand side (r.h.s.) of (A.1):r.h.s. of (A.1) < {− δ + δ · θµ i } = {− ε · OPT nq p i (4 nq p i µ i + ε · OPT ) · θ }≤ {− ε · OPT nq p i (4 q · OPT + ε · OPT ) · θ }≤ ( nq ) − l ( qk ∗ ) − / (cid:32) nqk ∗ (cid:33) . (A.2)17urthermore, we havePr (cid:20) | ˆ ρ ( S ) − ρ ( S ) | < ε · OPT (cid:21) = Pr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q (cid:88) i = ( ˆ ρ ( S ( i ) ) − ρ ( S ( i ) )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε · OPT  ≥ Pr  q (cid:88) i = (cid:12)(cid:12)(cid:12) ˆ ρ ( S ( i ) ) − ρ ( S ( i ) ) (cid:12)(cid:12)(cid:12) < ε · OPT  ≥ Pr (cid:34) q · max ≤ i ≤ q | ˆ ρ ( S ( i ) ) − ρ ( S ( i ) ) | < ε · OPT (cid:35) = Pr (cid:34) | ˆ ρ ( S (1) ) − ρ ( S (1) ) | < ε q · OPT , . . . , | ˆ ρ ( S ( q ) ) − ρ ( S ( q ) ) | < ε q · OPT (cid:35) . (A.3)By the union bound and Equation (A.2), we haver.h.s. of (A.3) ≥ q (cid:88) i = Pr (cid:34) | ˆ ρ ( S ( i ) ) − ρ ( S ( i ) ) | < ε q · OPT (cid:35) − ( q − ≥ − ( nq ) − l ( k ∗ ) − / (cid:32) nqk ∗ (cid:33) Therefore, the lemma is proved.

Proof of Lemma 3.

For any node set S ⊆ (cid:101) V , adding nodes from (cid:101) V \ S into S can never decrease F R ( S ( i ) ) ( i = , , . . . , q ). Hence, ˆ ρ ( S ) = nq (cid:80) qi = p i F R ( S ( i ) ) is nondecreasing since p i > S ⊆ T ⊆ (cid:101) V and y ∈ (cid:101) V \ T , we will prove the following inequality (A.4) which implies F R ( · ) is submodular: F R ( S ∪ { y } ) − F R ( S ) ≥ F R ( T ∪ { y } ) − F R ( T ) . (A.4)Let W , W , W and W be the sets of RR sets in R covered by T ∪ { y } , T , S ∪ { y } and S ,respectively. Then W \ W represents the set of RR sets which can be covered by { y } but notcovered by T , and W \ W represents the set of RR sets which can be covered by { y } but notcovered by S . Recall that S ⊆ T , we have ( W \ W ) ⊆ ( W \ W ). By the deﬁnition of F R ( · ), F R ( S ∪ { y } ) − F R ( S ) represents the proportion of RR sets in R which can be covered by { y } butnot covered by S . It follows that inequality (A.4) holds, according to the relationship between( W \ W ) and ( W \ W ). Therefore, ˆ ρ ( S ) = nq (cid:80) qi = p i F R ( S ( i ) ) is submodular. Proof of Theorem 1.

Let S A be the node set returned by Algorithm 3, and S ∗ p be the optimalsolution of problem (4). As S A is obtained by a (1 − / e )-approximation algorithm for problem(4) [14], we have ˆ ρ ( S A ) ≥ (1 − / e ) ˆ ρ ( S ∗ p ). Recall that S ∗ is the optimal solution for the PM- (cid:101) G problem and OPT = ρ ( S ∗ ), we have ˆ ρ ( S ∗ p ) ≥ ˆ ρ ( S ∗ ), leading to ˆ ρ ( S A ) ≥ (1 − / e ) ˆ ρ ( S ∗ ).According to Lemma 2, inequality (3) holds with at least 1 − ( nq ) − l ( k ∗ ) − / (cid:16) nqk ∗ (cid:17) probabilityfor a given seed set S ∈ Ω . By the assumption that k ∗ ≤ (cid:98) nq / (cid:99) , we can obtain that | Ω | ≤ (cid:80) ≤ j ≤ k ∗ (cid:16) nqj (cid:17) ≤ k ∗ · (cid:16) nqk ∗ (cid:17) . Then, by the union bound, inequality (3) holds simultaneously for all18ode sets belonging to Ω with at least 1 − ( nq ) − l probability. In that case, we have ρ ( S A ) > ˆ ρ ( S A ) − ε · OPT ≥ (1 − e ) ˆ ρ ( S ∗ ) − ε · OPT > (1 − e )(1 − ε · OPT − ε · OPT > (1 − e − ε ) · OPT . Thus, Theorem 1 is proved.

Proof of Lemma 5.

According to Lemma 7 and Lemma 8 in [19], we have thatPr (cid:20)

KPT ∗ ∈ (cid:20) KPT , σ ( S ∗ k ) (cid:21)(cid:21) ≥ − n − l (cid:48) , and Pr (cid:20) KPT + ∈ (cid:104) KPT ∗ , σ (cid:16) S ∗ k (cid:17)(cid:105) | KPT ∗ ∈ (cid:20) KPT , σ ( S ∗ k ) (cid:21)(cid:21) ≥ − n − l (cid:48) . Thus, Pr (cid:104)

KPT + ≤ σ ( S ∗ k ) (cid:105) ≥ Pr (cid:104) KPT + ∈ (cid:104) KPT ∗ , σ (cid:16) S ∗ k (cid:17)(cid:105)(cid:105) ≥ − n − l (cid:48) . Let λ (cid:48) = (8 + ε (cid:48) ) n ( l (cid:48) log n + log 2 + log (cid:16) nk (cid:17) ) · ( ε (cid:48) ) − and θ (cid:48) = λ (cid:48) / KPT + , then Pr (cid:104) θ (cid:48) ≥ λ (cid:48) /σ ( S ∗ k ) (cid:105) ≥ − n − l (cid:48) .For any size- k node set S k , suppose that θ (cid:48) satisﬁes θ (cid:48) ≥ λ (cid:48) /σ ( S ∗ k ), then | ˆ σ ( S k ) − σ ( S k ) | < ( ε (cid:48) / · σ ( S ∗ k ) holds with at least 1 − n − l (cid:48) / (cid:16) nk (cid:17) probability.It follows that when θ (cid:48) satisﬁes θ (cid:48) ≥ λ (cid:48) /σ ( S ∗ k ),Pr (cid:34) ˆ σ ( S k ) < (1 + ε (cid:48) σ ( S ∗ k ) (cid:35) ≥ − n − l (cid:48) / (cid:32) nk (cid:33) and Pr (cid:34) ˆ σ ( S k ) > (1 − e )(1 − ε (cid:48) σ ( S ∗ k ) (cid:35) ≥ − n − l (cid:48) / (cid:32) nk (cid:33) . Thus, Pr (cid:34) (1 − e )(1 − ε (cid:48) σ ( S ∗ k ) < ˆ σ ( S k ) < (1 + ε (cid:48) σ ( S ∗ k ) | θ (cid:48) ≥ λ (cid:48) σ ( S ∗ k ) (cid:35) ≥ − n − l (cid:48) / (cid:16) nk (cid:17) . Therefore, we havePr (cid:104) (1 − / e )(1 − ε (cid:48) / σ ( S ∗ k ) < ˆ σ ( S k ) < (1 + ε (cid:48) / σ ( S ∗ k ) (cid:105) ≥ (1 − n − l (cid:48) )(1 − n − l (cid:48) / (cid:16) nk (cid:17) ) > − n − l (cid:48) . Proof of Theorem 2.

Recall that the optimal solution of the PM- (cid:101) G problem is S ∗ , denote S ∗ = S (1) ∪ ¯ S (2) ∪ · · · ∪ ¯ S ( q ) . Then, we obtain that OPT = ρ ( S ∗ ) = q (cid:88) i = ρ ( ¯ S ( i ) ) = q (cid:88) i = p i · σ ( ¯ S ( i ) ) . By the deﬁnition of the optimal solution, we have ¯ S ( i ) is the optimal solution of the | ¯ S ( i ) | -sizeIM problem in G ( i ) . For the k i -size IM problem in G ( i ) ( i = , , . . . , q ), let S k i and ˆ σ ( S k i ) be the(1 − / e − ε (cid:48) )-approximate solution and the corresponding value returned by TIM + , respectively.Let S ∗ k i be the optimal solution, and σ ( S ∗ k i ) be its expected spread. Based on Lemma 5, we havePr (cid:104) (1 − / e )(1 − ε (cid:48) / σ ( S ∗ k i ) < ˆ σ ( S k i ) < (1 + ε (cid:48) / σ ( S ∗ k i ) (cid:105) > − n − l (cid:48) . It follows directly that | S ∗ k i | ≥ | ¯ S ( i ) | by the deﬁnitions of k i and ¯ S ( i ) . Recall that S ∗ k i and ¯ S ( i ) are the optimal solutions of the k i -size and | ¯ S ( i ) | -size inﬂuence maximization problem in G ( i ) ,respectively. Then we have σ ( S ∗ k i ) ≥ σ ( ¯ S ( i ) ) since σ ( · ) is nondecreasing.Let u i = p i · ˆ σ ( S k i ) / (1 + ε (cid:48) /

2) and u ∗ = max ≤ i ≤ q { u i } , thenPr (cid:34) (1 − / e )(1 − ε (cid:48) / + ε (cid:48) / ρ ( S ∗ k i ) < u i ≤ OPT (cid:35) > − n − l (cid:48) . (A.5)Therefore, (A.5) holds simultaneously for all i = , , . . . , q with at least 1 − qn − l (cid:48) probability,which means that OPT ≤ q (cid:88) i = p i · σ ( S ∗ k i ) = q (cid:88) i = ρ ( S ∗ k i ) < (1 + ε (cid:48) / q (1 − / e )(1 − ε (cid:48) / u ∗ , and u ∗ ≤ OPT hold simultaneously with at least 1 − qn − l (cid:48) probability. In conclusion, u ∗ ∈ (cid:104) (1 − / e )(1 − ε (cid:48) / + ε (cid:48) / q OPT , OPT (cid:105) with at least 1 − qn − l (cid:48) probability. TIM + runs in O (cid:0) ( k i + l (cid:48) )( m + n ) log n / ( ε (cid:48) ) (cid:1) expected time, where k i = (cid:98) B / c i (cid:99) is the numberbudget of the seed set [19]. Therefore, Algorithm 4 runs in O (cid:0) ( k ∗ + l (cid:48) )( m + n ) q log n / ( ε (cid:48) ) (cid:1) ex-pected time. Appendix B. Brief Introduction of the

TIM + Algorithm

In this section, we give an outline of the

TIM + algorithm. The TIM + algorithm based on theRIS technique consists of two phases. The ﬁrst phase called parameter estimation receives anestimate KPT + of the optimum and uses it to compute θ (cid:48) which is the number of RR sets neededto generate. The second phase, called node selection, samples θ (cid:48) RR sets from G and applies thegreedy algorithm to derive a size- k node set S k covering a large number of RR sets. We put allthe algorithms in the entire process together in algorithm 6.20 lgorithm 6 T I M + Input:

Graph G , a positive integer k and 0 < ¯ ε, ε (cid:48) < Output:

A value ˆ σ ( S k ) where S k is a (1 − / e − ε (cid:48) )-approximation solution of the k -size IMproblem, with at least (1 − n − l (cid:48) )-probability. for i from 1 to log n − do ¡¡¡¡ Let c i = (6 l log n + n )) · i ;¡¡¡¡ ¡¡¡¡ Let sum = for j from 1 to c i do ¡¡¡¡ Generate a random RR set R ;¡¡¡¡ ¡¡¡¡ sum = sum + κ ( R );¡¡¡¡ if sum / c i > / i then return KPT ∗ = n · sum / (2 · c i ). return KPT ∗ = Let R (cid:48) be the set of all RR sets generated in the last iteration of the above loop; Innitialize: S (cid:48) k = ∅ ; for i from 1 to k do ¡¡¡¡ Identify the node v i that covers the most RR sets in R (cid:48) ;¡¡¡¡ ¡¡¡¡ Add v i into S (cid:48) k ;¡¡¡¡ ¡¡¡¡ Remove from R (cid:48) all RR sets that are covered by v i ;¡¡¡¡ Let ¯ λ = (2 + ¯ ε ) l (cid:48) n log n · ( ¯ ε ) − ; Let ¯ θ = ¯ λ/ KPT ∗ ; Initialize a set R (cid:48)(cid:48) = ∅ ; Generate ¯ θ random RR sets and put them into R (cid:48)(cid:48) ; Let ¯ f be the fraction of the RR sets in R (cid:48)(cid:48) that is covered by S (cid:48) k ; Let

KPT (cid:48) = ¯ f · n / (1 + ¯ ε ); KPT + = max { KPT (cid:48) , KPT ∗ } ; Let λ (cid:48) = (8 + ε ) n · ( l (cid:48) log n + log 2 + log (cid:16) nk (cid:17) ) · ( ε (cid:48) ) − ; Let θ (cid:48) = λ (cid:48) / KPT + ; Initialize a set R ∗ = ∅ ; Generate θ (cid:48) random RR sets and insert them into R ∗ ; Initialize a node set S k = ∅ ; for i from 1 to k do ¡¡¡¡ Identify the node v i that covers the most RR sets in R ∗ ;¡¡¡¡ ¡¡¡¡ Add v i into S k ;¡¡¡¡ ¡¡¡¡ Remove from R ∗ all RR sets that are covered by v i ;¡¡¡¡ Let f be the fraction of the RR sets in R ∗ that is covered by S k ; return ˆ σ ( S k ) = n · ff