[PDF] Non-monotone DR-Submodular Function Maximization

Abstract

We consider non-monotone DR-submodular function maximization, where DR-submodularity (diminishing return submodularity) is an extension of submodularity for functions over the integer lattice based on the concept of the diminishing return property. Maximizing non-monotone DR-submodular functions has many applications in machine learning that cannot be captured by submodular set functions. In this paper, we present a 1 2+ϵ -approximation algorithm with a running time of roughly O( n ϵ log 2 B) , where n is the size of the ground set, B is the maximum value of a coordinate, and ϵ>0 is a parameter. The approximation ratio is almost tight and the dependency of running time on B is exponentially smaller than the naive greedy algorithm. Experiments on synthetic and real-world datasets demonstrate that our algorithm outputs almost the best solution compared to other baseline algorithms, whereas its running time is several orders of magnitude faster.

Full PDF

NNon-monotone DR-Submodular Function Maximization(Full version)

Tasuku Soma

The University of Tokyotasuku [email protected]

Yuichi Yoshida

National Institute of Informatics, and

Preferred Infrastructure, [email protected]

Abstract

We consider non-monotone DR-submodular function maxi-mization, where DR-submodularity (diminishing return sub-modularity) is an extension of submodularity for functionsover the integer lattice based on the concept of the di-minishing return property. Maximizing non-monotone DR-submodular functions has many applications in machinelearning that cannot be captured by submodular set functions.In this paper, we present a (cid:15) -approximation algorithm witha running time of roughly O ( n(cid:15) log B ) , where n is the sizeof the ground set, B is the maximum value of a coordinate,and (cid:15) > is a parameter. The approximation ratio is almosttight and the dependency of running time on B is exponen-tially smaller than the naive greedy algorithm. Experimentson synthetic and real-world datasets demonstrate that our al-gorithm outputs almost the best solution compared to otherbaseline algorithms, whereas its running time is several or-ders of magnitude faster. Submodular functions have played a key role in varioustasks in machine learning, statistics, social science, and eco-nomics. A set function f : 2 E → R with a ground set E is submodular if f ( X ∪ { e } ) − f ( X ) ≥ f ( Y ∪ { e } ) − f ( Y ) for arbitrary sets X, Y ⊆ E with X ⊆ Y , and an ele-ment e ∈ E \ Y . The importance and usefulness of sub-modularity in these areas are due to the fact that submodu-lar functions naturally capture the diminishing return prop-erty . Various important functions in these areas such as theentropy function, coverage function, and utility functionssatisfy this property. See, e.g., (Krause and Golovin 2014;Fujishige 2005).Recently, maximizing (non-monotone) submodular func-tions has attracted particular interest in the machine learn-ing community. In contrast to minimizing submodular func-tions, which can be done in polynomial time, maximizingsubmodular functions is NP-hard in general. However, wecan achieve a constant factor approximation for various set-tings. Notably, (Buchbinder et al. 2012) presented a very el-egant double greedy algorithm for (unconstrained) submod-ular function maximization, which was the ﬁrst algorithmachieving -approximation, and this approximation ratio is tight (Feige, Mirrokni, and Vondrak 2011). Applications ofnon-monotone submodular function maximization includeefﬁcient sensor placement (Krause, Singh, and Guestrin2008), privacy in online services (Krause and Horvitz 2008),and maximum entropy sampling (Ko, Lee, and Queyranne1995).The models and applications mentioned so far are builtupon submodular set functions . Although set functions arefairly powerful for describing problems such as variable se-lection, we sometimes face difﬁcult situations that cannot becast with set functions. For example, in the budget allocationproblem (Alon, Gamzu, and Tennenholtz 2012), we wouldlike to decide how much budget should be set aside for eachad source, rather than whether we use the ad source or not.A similar issue arises when we consider models allowingmultiple choices of an element in the ground set.To deal with such situations, several generalizations ofsubmodularity have been proposed. (Soma et al. 2014) de-vised a general framework for maximizing monotone sub-modular functions on the integer lattice , and showed thatthe budget allocation problem and its variant fall into thisframework. In their framework, functions are deﬁned overthe integer lattice Z E + and therefore effectively represent dis-crete allocations of budget. Regarding the original motiva-tion for the diminishing return property, one can naturallyconsider its generalization to the integer lattice: a function f : Z E + → R satisfying f ( x + χ e ) − f ( x ) ≥ f ( y + χ e ) − f ( y ) for x ≤ y and e ∈ E , where χ e ∈ R E is the vector with χ e ( e ) = 1 and χ e ( a ) = 0 for every a (cid:54) = e . Such functionsare called diminishing return submodular (DR-submodular)functions (Soma and Yoshida 2015) or coordinate-wise con-cave submodular functions (Milgrom and Strulovici 2009).DR-submodular functions have found various applicationsin generalized sensor placement (Soma and Yoshida 2015)and (a natural special case of) the budget allocation prob-lem (Soma et al. 2014).As a related notion, a function is said to be lattice sub-modular if f ( x ) + f ( y ) ≥ f ( x ∨ y ) + f ( x ∧ y ) for arbitrary x and y , where ∨ and ∧ are coordinate-wisemax and min, respectively. Note that DR-submodularity is a r X i v : . [ c s . D S ] D ec tronger than lattice submodularity in general (see, e.g.,(Soma et al. 2014)). Nevertheless, we consider the DR-submodularity to be a “natural deﬁnition” of submodular-ity, at least for the applications mentioned so far, becausethe diminishing return property is crucial in these real-worldscenarios. Our contributions

We design a novel polynomial-time approximation al-gorithm for maximizing (non-monotone) DR-submodularfunctions. More precisely, we consider the optimizationproblem maximize f ( x ) subject to ≤ x ≤ B , (1)where f : Z E + → R + is a non-negative DR-submodularfunction, is the zero vector, and B ∈ Z E + is a vector rep-resenting the maximum values for each coordinate. When B is the all-ones vector, this is equivalent to the original(unconstrained) submodular function maximization. We as-sume that f is given as an evaluation oracle; when we spec-ify x ∈ Z E + , the oracle returns the value of f ( x ) .Our algorithm achieves (cid:15) -approximation for any con-stant (cid:15) > in O ( | E | (cid:15) · log( ∆ δ ) log B · ( θ +log B )) time, where δ and ∆ are the minimum positive marginal gain and max-imum positive values, respectively, of f , B = (cid:107) B (cid:107) ∞ :=max e ∈ E B ( e ) , and θ is the running time of evaluating (theoracle for) f . To our knowledge, this is the ﬁrst polynomial-time algorithm achieving (roughly) -approximation.We also conduct numerical experiments on the revenuemaximization problem using real-world networks. The ex-perimental results show that the solution quality of our algo-rithm is comparable to other algorithms. Furthermore, ouralgorithm runs several orders of magnitude faster than otheralgorithms when B is large.DR-submodularity is necessary for obtaining polynomial-time algorithms with a meaningful approximation guaran-tee; if f is only lattice submodular, then we cannot obtainconstant-approximation in polynomial time. To see this, itsufﬁces to observe that an arbitrary univariate function islattice submodular, and therefore ﬁnding an (approximate)maximum value must invoke O ( B ) queries. We note thatrepresenting an integer B requires (cid:100) log B (cid:101) bits. Hence, therunning time of O ( B ) is pseudopolynomial rather than poly-nomial. Fast simulation of the double greedy algorithm

Naturally, one can reduce the problem (1) to maximizationof a submodular set function by simply duplicating eachelement e in the ground set into B ( e ) distinct copies anddeﬁning a set function over the set of all the copies. Onecan then run the double greedy algorithm (Buchbinder et al.2012) to obtain -approximation. This reduction is simplebut has one large drawback; the size of the new ground setis (cid:80) e ∈ E B ( e ) , a pseudopolynomial in B . Therefore, thisnaive double greedy algorithm does not scale to a situationwhere B is large. For scalability, we need an additional trick that reducesthe pseudo-polynomial running time to a polynomial one. In monotone submodular function maximization on the integerlattice, (Soma and Yoshida 2015; Soma and Yoshida 2016)provide such a speedup trick, which effectively combines the decreasing threshold technique (Badanidiyuru and Vondr´ak2014) with binary search. However, a similar technique doesnot apply to our setting, because the double greedy algo-rithm works differently from (single) greedy algorithms formonotone submodular function maximization. The doublegreedy algorithm examines each element in a ﬁxed order andmarginal gains are used to decide whether to include the el-ement or not. In contrast, the greedy algorithm chooses eachelement in decreasing order of marginal gains, and this prop-erty is crucial for the decreasing threshold technique.We resolve this issue by splitting the set of all marginalgains into polynomially many small intervals. For each in-terval, we approximately execute multiple steps of the dou-ble greedy algorithm at once, as long as the marginal gainsremain in the interval. Because the marginal gains do notchange (much) within the interval, this simulation can bedone with polynomially many queries and polynomial-timeoverhead. To our knowledge, this speedup technique is notknown in the literature and is therefore of more general in-terest.Very recently, (Ene and Nguyen 2016) pointed out thata DR-submodular function f : { , , . . . , B } E → R + can be expressed as a submodular set function g over apolynomial-sized ground set, which turns out to be E ×{ , , . . . , k − } , where k = (cid:100) log ( B + 1) (cid:101) . Their ideais representing x ( e ) in binary form for each e ∈ E , andbits in the binary representations form the new ground set.We may want to apply the double greedy algorithm to g in order to get a polynomial-time approximation algorithm.However, this strategy has the following two drawbacks:(i) The value of g ( E × { , , . . . , k − } ) is deﬁned as f ( x ) , where x ( e ) = 2 k − for every e ∈ E . This meansthat we have to extend the domain of f . (ii) More cru-cially, the double greedy algorithm on g may return a largeset such as E × { , , . . . , k − } whose correspondingvector x ∈ Z E + may violate the constraint x ≤ B . Al-though we can resolve these issues by introducing a knap-sack constraint, it is not a practical solution because exist-ing algorithms for knapsack constraints (Lee et al. 2009;Chekuri, Vondr´ak, and Zenklusen 2014) are slow and haveworse approximation ratios than / . Notations

For an integer n ∈ N , [ n ] denotes the set { , . . . , n } . For vectors x , y ∈ Z E , we deﬁne f ( x | y ) := f ( x + y ) − f ( y ) . The (cid:96) -norm and (cid:96) ∞ -norm of a vec-tor x ∈ Z E are deﬁned as (cid:107) x (cid:107) := (cid:80) e ∈ E | x ( e ) | and (cid:107) x (cid:107) ∞ := max e ∈ E | x ( e ) | , respectively. As mentioned above, there have been many efforts to max-imize submodular functions on the integer lattice. Perhapsthe work most related to our interest is (Gottschalk andPeis 2015), in which the authors considered maximizingattice submodular functions over the bounded integer lat-tice and designed -approximation pseudopolynomial-timealgorithm. Their algorithm was also based on the doublegreedy algorithm, but does not include a speeding up tech-nique, as proposed in this paper.In addition there are several studies on the constrained maximization of submodular functions (Feige, Mirrokni,and Vondrak 2011; Buchbinder et al. 2014; Buchbinder andFeldman 2016), although we focus on the unconstrainedcase. Many algorithms for maximizing submodular func-tions are randomized, but a very recent work (Buchbinderand Feldman 2016) devised a derandomized version of thedouble greedy algorithm. (Gotovos, Karbasi, and Krause2015) considered maximizing non-monotone submodularfunctions in the adaptive setting, a concept introducedin (Golovin and Krause 2011).A continuous analogue of DR-submodular functions isconsidered in (Bian et al. 2016). In this section, we present a polynomial-time approximationalgorithm for maximizing (non-monotone) DR-submodularfunctions. We ﬁrst explain a simple adaption of the doublegreedy algorithm for (set) submodular functions to our set-ting, which runs in pseudopolynomial time. Then, we showhow to achieve a polynomial number of oracle calls. Finally,we provide an algorithm with a polynomial running time(details are placed in Appendix A).

Pseudopolynomial-time algorithm

Algorithm 1 is an immediate extension of the doublegreedy algorithm for maximizing submodular (set) func-tions (Buchbinder et al. 2012) to our setting. We start with x = and y = B , and then for each e ∈ E , we tighten thegap between x ( e ) and y ( e ) until they become exactly thesame. Let α = f ( χ e | x ) and β = f ( − χ e | y ) . We notethat α + β = f ( x + χ e ) − f ( x ) − ( f ( y ) − f ( y − χ e )) ≥ holds from the DR-submodularity of f . Hence, if β < ,then α > must hold, and we increase x ( e ) by one. Simi-larly, if α < , then β > must hold, and we decrease y ( e ) by one. When both of them are non-negative, we increase x ( e ) by one with probability αα + β , or decrease y ( e ) by onewith the complement probability βα + β . Theorem 1.

Algorithm 1 is a -approximation algorithmfor (1) with time complexity O ( (cid:107) B (cid:107) · θ + (cid:107) B (cid:107) ) , where θ is the running time of evaluating f . We omit the proof as it is a simple modiﬁcation of theanalysis of the original algorithm.

Algorithm with polynomially many oracle calls

In this section, we present an algorithm with a polynomialnumber of oracle calls.Our strategy is to simulate Algorithm 1 without evaluatingthe input function f many times. A key observation is that,at Line 4 of Algorithm 1, we do not need to know the exact Algorithm 1

Pseudopolynomial-time algorithm

Input: f : Z E + → R + , B ∈ Z E + . Assumption: f is DR-submodular. x ← , y ← B . for e ∈ E do while x ( e ) < y ( e ) do α ← f ( χ e | x ) and β ← f ( − χ e | y ) . if β < then x ( e ) ← x ( e ) + 1 . else if α < then y ( e ) ← y ( e ) − . else x ( e ) ← x ( e ) + 1 with probability αα + β and y ( e ) ← y ( e ) − with the complement proba-bility βα + β . If α = β = 0 , we assume αα + β = 1 . end if end while end for return x .value of f ( χ e | x ) and f ( − χ e | y ) ; good approximationsto them are sufﬁcient to achieve an approximation guaranteeclose to . To exploit this observation, we ﬁrst design analgorithm that outputs (sketches of) approximations to thefunctions g ( b ) := f ( χ e | x + b χ e ) and h ( b ) := f ( − χ e | y − b χ e ) . Note that g and h are non-increasing functions in b because of the DR-submodularity of f .To illustrate this idea, let us consider a non-increasingfunction φ : { , , . . . , B − } → R and suppose that φ is non-negative ( φ is either g or h later on). Let δ and ∆ be the minimum and the maximum positive values of φ , re-spectively. Then, for each δ ≤ τ ≤ ∆ of the form δ (1 + (cid:15) ) k ,we ﬁnd the minimum b τ such that φ ( b τ ) < τ (we regard φ ( B ) = −∞ ). From the non-increasing property of φ , wethen have φ ( b ) ≥ τ for any b < b τ . Using the set of pairs { ( τ, b τ ) } τ , we can obtain a good approximation to φ . Thedetails are provided in Algorithm 2. Lemma 2.

For any φ : { , , . . . , B − } → R and (cid:15) > ,Algorithm 2 outputs a set of pairs { ( b τ , τ ) } τ from which, forany b ∈ { , , . . . , B − } , we can reconstruct a value v in O (log B ) time such that v ≤ φ ( b ) < (1 + (cid:15) ) v if φ ( b ) > and v = 0 otherwise. The time complexity of Algorithm 2is O ( (cid:15) log( ∆ δ ) log B · θ ) if φ has a positive value, where δ and ∆ are the minimum and maximum positive values of φ ,respectively, and θ is the running time of evaluating φ , andis O (log B · θ ) otherwise.Proof. Let S = { ( b τ , τ ) } τ be the set of pairs output by Al-gorithm 2. Our reconstruction algorithm is as follows: Given b ∈ { , , . . . , B − } , let ( b τ ∗ , τ ∗ ) be the pair with the min-imum b τ ∗ , where b < b τ ∗ . Note that such a b τ ∗ always existsbecause a pair of the form ( B, · ) is always added to S . Wethen output τ ∗ . The time complexity of this reconstructionalgorithm is clearly O (log B ) .We now show the correctness of the reconstruction algo-rithm. If φ ( b ) > , then, in particular, we have φ ( b ) ≥ δ .Then, τ ∗ is the maximum value of the form δ (1+ (cid:15) ) k at most lgorithm 2 Sketching subroutine for Algorithm 3

Input: φ : { , , . . . , B − } → R , (cid:15) > . Assumption: φ is non-increasing. S ← ∅ . We regard φ ( B ) = −∞ . Find the minimum b ∈ { , , . . . , B } with φ ( b ) ≤ by binary search. if b ≥ then φ has a positive value. ∆ ← φ (0) and δ ← φ ( b − . for ( τ ← δ ; τ ≤ ∆ ; τ ← (1 + (cid:15) ) τ ) do Find the minimum b τ ∈ { , , . . . , B } with φ ( b τ ) < τ by binary search. S ← S ∪ { ( b τ , τ ) } end for if b δ (cid:54) = B then S ← S ∪ { ( B, } . end if else φ is non-positive. S ← S ∪ { ( B, } . end if return S . φ ( b ) . Hence, we have τ ∗ ≤ φ ( b ) < (1 + (cid:15) ) τ ∗ . If φ ( b ) ≤ , ( b τ ∗ , τ ∗ ) = ( B, and we output zero.Finally, we analyze the time complexity of Algorithm 2.Each binary search requires O (log B ) time. The number ofbinary searches performed is O (log (cid:15) ∆ δ ) = O ( (cid:15) log ∆ δ ) when φ has a positive value and 1 when φ is non-positive.Hence, we have the desired time complexity.We can regard Algorithm 2 as providing a value oracle fora function ˜ φ : { , , . . . , B − } → R + that is an approxi-mation to the input function φ : { , , . . . , B − } → R .We now describe our algorithm for maximizing DR-submodular functions. The basic idea is similar to Algo-rithm 1, but when we need f ( χ e | x ) and f ( − χ e | y ) , weuse approximations to them instead. Let α and β be approx-imations to f ( χ e | x ) and f ( − χ e | y ) , respectively, ob-tained by Algorithm 2. Then, we increase x ( e ) by one withprobability αα + β and decrease y ( e ) by one with the comple-ment probability βα + β . The details are given in Algorithm 3.We now analyze Algorithm 3. An iteration refers to aniteration in the while loop from Line 5. We have (cid:107) B (cid:107) it-erations in total. For k ∈ { , . . . , (cid:107) B (cid:107) } , let x k and y k be x and y , respectively, right after the k th iteration. Note that x (cid:107) B (cid:107) = y (cid:107) B (cid:107) is the output of the algorithm. We deﬁne x = and y = B for convenience.Let o be an optimal solution. For k ∈ { , , . . . , (cid:107) B (cid:107) } ,we then deﬁne o k = ( o ∨ x k ) ∧ y k . Note that o = o holdsand o (cid:107) B (cid:107) equals the output of the algorithm. We have thefollowing key lemma. Lemma 3.

For every k ∈ [ (cid:107) B (cid:107) ] , we have E [ f ( o k − ) − f ( o k )] ≤ (cid:15) E [ f ( x k ) − f ( x k − ) + f ( y k ) − f ( y k − )] (2) Proof.

Fix k ∈ [ (cid:107) B (cid:107) ] and let e be the element of interestin the k th iteration. Let α and β be the values in Line 6 in Algorithm 3

Algorithm with polynomially many queries

Input: f : Z E + → R + , B ∈ Z E + , (cid:15) > . Assumption: f is DR-submodular. x ← , y ← B . for e ∈ E do Deﬁne g, h : { , , . . . , B − } → R as g ( b ) = f ( χ e | x + b χ e ) and h ( b ) = f ( − χ e | y − b χ e ) . Let ˜ g and ˜ h be approximations to g and h , respec-tively, given by Algorithm 2. while x ( e ) < y ( e ) do α ← ˜ g ( x ( e )) and β ← ˜ h ( B ( e ) − y ( e )) . x ( e ) ← x ( e )+1 with probability αα + β and y ( e ) ← y ( e ) − with the complement probability βα + β . If α = β = 0 , we assume αα + β = 1 . end while end for return x .the k th iteration. We then have E [ f ( x k ) − f ( x k − ) + f ( y k ) − f ( y k − )]= αα + β f ( χ e | x k − ) + βα + β f ( − χ e | y k − ) ≥ αα + β α + βα + β β = α + β α + β , (3)where we use the guarantee in Lemma 2 in the inequality.We next establish an upper bound of E [ f ( o k − ) − f ( o k )] .As o k = ( o ∨ x k ) ∧ y k , conditioned on a ﬁxed o k − , weobtain E [ f ( o k − ) − f ( o k )]= αα + β (cid:16) f ( o k − ) − f ( o k − ∨ x k ( e ) χ e ) (cid:17) + βα + β (cid:16) f ( o k − ) − f ( o k − ∧ y k ( e ) χ e ) (cid:17) . (4) Claim 4. (4) ≤ (1+ (cid:15) ) αβα + β .Proof. We show this claim by considering the followingthree cases.If x k ( e ) ≤ o k − ( e ) ≤ y k ( e ) , then (4) is zero.If o k − ( e ) < x k ( e ) , then o k ( e ) = o k − ( e ) + 1 , and theﬁrst term of (4) is f ( o k − ) − f ( o k − ∨ x k ( e ) χ e )= f ( o k − ) − f ( o k − + χ e ) ≤ f ( y k − − χ e ) − f ( y k − )= f ( − χ e | y k − ) ≤ (1 + (cid:15) ) β. Here, the ﬁrst inequality uses the DR-submodularity of f and the fact that o k − ≤ y k − − χ e , and the second in-equality uses the guarantee in Lemma 2. The second termof (4) is zero, and hence we have (4) ≤ (1+ (cid:15) ) αβα + β .If y k ( e ) < o k − ( e ) , then by a similar argument, we have(4) ≤ (1+ (cid:15) ) αβα + β .e now return to proving Lemma 3. By Claim 4,(4) ≤ (1 + (cid:15) ) αβα + β ≤ (cid:15) (cid:16) α + β α + β (cid:17) ≤ (cid:15) · (3) , which indicates the desired result. Theorem 5.

Algorithm 3 is a (cid:15) -approximation algorithmfor (1) with time complexity O ( | E | (cid:15) · log( ∆ δ ) log (cid:107) B (cid:107) ∞ · θ + (cid:107) B (cid:107) log (cid:107) B (cid:107) ∞ ) , where δ and ∆ are the minimum positivemarginal gain and the maximum positive value, respectively,of f and θ is the running time of evaluating f .Proof. Summing up (2) for k ∈ [ (cid:107) B (cid:107) ] , we get (cid:107) B (cid:107) (cid:88) k =1 E [ f ( o k − ) − f ( o k )] ≤ (cid:15) (cid:107) B (cid:107) (cid:88) k =1 E [ f ( x k ) − f ( x k − ) + f ( y k ) − f ( y k − )] . The above sum is telescopic, and hence we obtain E [ f ( o ) − f ( o (cid:107) B (cid:107) )] ≤ (cid:15) E [ f ( x (cid:107) B (cid:107) ) − f ( x ) + f ( y (cid:107) B (cid:107) ) − f ( y )] ≤ (cid:15) E [ f ( x (cid:107) B (cid:107) ) + f ( y (cid:107) B (cid:107) )]= (1 + (cid:15) ) E [ f ( x (cid:107) B (cid:107) )] . The second inequality uses the fact that f is non-negative,and the last equality uses y (cid:107) B (cid:107) = x (cid:107) B (cid:107) . Because E [ f ( o ) − f ( o (cid:107) B (cid:107) )] = f ( o ) − E [ f ( x (cid:107) B (cid:107) )] , we obtainthat E [ f ( x (cid:107) B (cid:107) )] ≥ (cid:15) f ( o ) .We now analyze the time complexity. We only query theinput function f inside of Algorithm 2, and the number oforacle calls is O ( | E | (cid:15) log( ∆ δ ) log B ) by Lemma 2. Note thatwe invoke Algorithm 2 with g and h , and the minimum pos-itive values of g and h are at least the minimum positivemarginal gain δ of f . The number of iterations is (cid:107) B (cid:107) , andwe need O (log B ) time to access ˜ g and ˜ h . Hence, the totaltime complexity is as stated. Remark 6.

We note that even if f is not a non-negative func-tion, the proof of Theorem 5 works as long as f ( x ) ≥ and f ( y ) ≥ , that is, f ( ) ≥ and f ( B ) ≥ . Hence, givena DR-submodular function f : Z E + → R and B ∈ Z E + , wecan obtain a (cid:15) -approximation algorithm for the followingproblem: maximize f ( x ) − min { f ( ) , f ( B ) } subject to ≤ x ≤ B , (5) This observation is useful, as the objective function oftentakes negative values in real-world applications.

Polynomial-time algorithm

In many applications, the running time needed to evaluatethe input function is a bottleneck, and hence Algorithm 3 isalready satisfactory. However, it is theoretically interestingto reduce the total running time to a polynomial, and weshow the following. The proof is deferred to Appendix A.

Theorem 7.

There exists a (cid:15) -approximation algo-rithm with time complexity (cid:101) O ( | E | (cid:15) log( ∆ δ ) log (cid:107) B (cid:107) ∞ · ( θ +log (cid:107) B (cid:107) ∞ )) , where δ and ∆ are the minimum positivemarginal gain and the maximum positive value, respectivelyof f and θ is the running time of evaluating f . Here (cid:101) O ( T ) means O ( T log c T ) for some c ∈ N . In this section, we show our experimental results and thesuperiority of our algorithm with respect to other baselinealgorithms.

Experimental setting

We conducted experiments on a Linux server with an IntelXeon E5-2690 (2.90 GHz) processor and 256 GB of mainmemory. All the algorithms were implemented in C • Single Greedy ( SG ): We start with x = . For each el-ement e ∈ E , as long as the marginal gain of adding χ e to the current solution x is positive, we add it to x .The reason that we do not choose the element with themaximum marginal gain is to reduce the number of or-acle calls, and our preliminary experiments showed thatsuch a tweak does not improve the solution quality. • Double Greedy ( DG , Algorithm 1). • Lattice Double Greedy (

Lattice-DG ): The / -approximation algorithm for maximizing non-monotonelattice submodular functions (Gottschalk and Peis 2015). • Double Greedy with a polynomial number of oracle callswith error parameter (cid:15) > ( Fast-DG (cid:15) , Algorithm 3).We measure the efﬁciency of an algorithm by the numberof oracle calls instead of the total running time. Indeed, therunning time for evaluating the input function is the domi-nant factor of the total running, because objective functionsin typical machine learning tasks contain sums over all datapoints, which is time consuming. Therefore, we do not con-sider the polynomial-time algorithm (Theorem 7) here.

Revenue maximization

In this application, we consider revenue maximization onan (undirected) social network G = ( V, W ) , where W =( w ij ) i,j ∈ V represents the weights of edges. The goal is tooffer for free or advertise a product to users so that the rev-enue increases through their word-of-mouth effect on others.If we invest x units of cost on a user i ∈ V , the user becomesan advocate of the product (independently from other users)with probability − (1 − p ) x , where p ∈ (0 , is a pa-rameter. This means that, for investing a unit cost to i , wehave an extra chance that the user i becomes an advocate B N u m be r o f o r a c l e c a ll s SGDGLattice-DGFast-DG . Fast-DG . Fast-DG . (a) Adolescent health B N u m be r o f o r a c l e c a ll s SGDGLattice-DGFast-DG . Fast-DG . Fast-DG . (b) Advogato B N u m be r o f o r a c l e c a ll s SGDGLattice-DGFast-DG . Fast-DG . Fast-DG . (c) Twitter lists Figure 1: Number of oracle callsTable 1: Objective values (Our methods are highlighted ingray.)

Adolescent health B = 10 SG DG Lattice-DG

Fast-DG . Fast-DG . Fast-DG . B = 10 SG DG Lattice-DG

Fast-DG . Fast-DG . Fast-DG . with probability p . Let S ⊆ V be a set of users who ad-vocate the product. Note that S is a random set. Followinga simpliﬁed version of the model introduced by (Hartline,Mirrokni, and Sundararajan 2008), the revenue is deﬁned as (cid:80) i ∈ S (cid:80) j ∈ V \ S w ij . Let f : Z E + → R be the expected rev-enue obtained in this model, that is, f ( x ) = E S (cid:104)(cid:88) i ∈ S (cid:88) j ∈ V \ S w ij (cid:105) = (cid:88) i ∈ S (cid:88) j ∈ V \ S w ij (1 − (1 − p ) x ( i ) )(1 − p ) x ( j ) . It is not hard to show that f is non-monotone DR-submodular function (see Appendix B for the proof).In our experiment, we used three networks, Adolescenthealth (2,539 vertices and 12,969 edges), Advogato (6,541vertices and 61,127 edges), and Twitter lists (23,370 verticesand 33,101 edges), all taken from (Kunegis 2013). We regard all the networks as undirected. We set p = 0 . , and set w ij = 1 when an edge exists between i and j and w ij = 0 otherwise. We imposed the constraint ≤ x ( e ) ≤ B forevery e ∈ E , where B is chosen from { , . . . , } .Table 1 shows the objective values obtained by eachmethod. As can be seen, except for Lattice-DG , which isclearly the worst, the choice of a method does not much af-fect the obtained objective value for all the networks. No-tably, even when (cid:15) is as large as . , the objective valuesobtained by Fast-DG are almost the same as SG and DG .Figure 1 illustrates the number of oracle calls of eachmethod. The number of oracle calls of DG and Lattice-DG is linear in B , whereas that of Fast-DG slowly grows. Al-though the number of oracle calls of SG also slowly grows,it is always orders of magnitude larger than that of Fast-DG with (cid:15) = 0 . or (cid:15) = 0 . .In summary, Fast-DG . achieves almost the best objec-tive value, whereas the number of oracle calls is two or threeorders of magnitude smaller than those of the other methodswhen B is large. In this paper, we proposed a polynomial-time (cid:15) -approximation algorithm for non-monotone DR-submodularfunction maximization. Our experimental results on therevenu maximization problem showed the superiority of ourmethod against other baseline algorithms.Maximizing a submodular set function under constraintsis well studied (Lee et al. 2009; Gupta et al. 2010;Chekuri, Vondr´ak, and Zenklusen 2014; Mirzasoleiman etal. 2016). An intriguing open question is whether we canobtain polynomial-time algorithms for maximizing DR-submodular functions under constraints such as cardinal-ity constraints, polymatroid constraints, and knapsack con-straints.

Acknowledgments

T. S. is supported by JSPS Grant-in-Aid for Research Activity Start-up. Y. Y. is sup-ported by JSPS Grant-in-Aid for Young Scientists (B)(No. 26730009), MEXT Grant-in-Aid for Scientiﬁc Re-search on Innovative Areas (No. 24106003), and JST, ER-ATO, Kawarabayashi Large Graph Project. eferences [Alon, Gamzu, and Tennenholtz 2012] Alon, N.; Gamzu, I.;and Tennenholtz, M. 2012. Optimizing budget allocationamong channels and inﬂuencers. In

WWW , 381–388.[Badanidiyuru and Vondr´ak 2014] Badanidiyuru, A., andVondr´ak, J. 2014. Fast algorithms for maximizingsubmodular functions. In

SODA , 1497–1514.[Bian et al. 2016] Bian, Y.; Mirzasoleiman, B.; Buhmann,J. M.; and Krause, A. 2016. Guaranteed non-convex op-timization: Submodular maximization over continuous do-mains.

CoRR abs/1606.05615.[Buchbinder and Feldman 2016] Buchbinder, N., and Feld-man, M. 2016. Deterministic algorithms for submodularmaximization problems. In

SODA , 392–403.[Buchbinder et al. 2012] Buchbinder, N.; Feldman, M.;Naor, J. S.; and Schwartz, R. 2012. A tight lineartime (1 / -approximation for unconstrained submodularmaximization. In FOCS , 649–658.[Buchbinder et al. 2014] Buchbinder, N.; Feldman, M.;Naor, J. S.; and Schwartz, R. 2014. Submodular maximiza-tion with cardinality constraints. In

SODA , 1433–1452.[Chekuri, Vondr´ak, and Zenklusen 2014] Chekuri, C.;Vondr´ak, J.; and Zenklusen, R. 2014. Submodular functionmaximization via the multilinear relaxation and con-tention resolution schemes.

SIAM Journal on Computing

Non-Uniform RandomVariate Generation . Springer.[Ene and Nguyen 2016] Ene, A., and Nguyen, H. L. 2016. Areduction for optimizing lattice submodular functions withdiminishing returns.

CoRR abs/1606.08362.[Feige, Mirrokni, and Vondrak 2011] Feige, U.; Mirrokni,V. S.; and Vondrak, J. 2011. Maximizing non-monotone sub-modular functions.

SIAM J. on Comput.

Submodular Functionsand Optimization . Elsevier, 2nd edition.[Golovin and Krause 2011] Golovin, D., and Krause, A.2011. Adaptive submodularity: Theory and applications inactive learning and stochastic optimization.

Journal of Arti-ﬁcial Intelligence Research

AAAI , 1996–2003.[Gottschalk and Peis 2015] Gottschalk, C., and Peis, B.2015. Submodular function maximization on the boundedinteger lattice. In

WAOA , 133–144.[Gupta et al. 2010] Gupta, A.; Roth, A.; Schoenebeck, G.;and Talwar, K. 2010. Constrained non-monotone submodu-lar maximization: ofﬂine and secretary algorithms. In

WINE ,246–257.[Hartline, Mirrokni, and Sundararajan 2008] Hartline, J.;Mirrokni, V.; and Sundararajan, M. 2008. Optimalmarketing strategies over social networks. In

WWW ,189–198. [Ko, Lee, and Queyranne 1995] Ko, C. W.; Lee, J.; andQueyranne, M. 1995. An exact algorithm for maximumentropy sampling.

Operations Research

Tractability:Practical Approaches to Hard Problems . Cambridge Uni-versity Press. 71–104.[Krause and Horvitz 2008] Krause, A., and Horvitz, E. 2008.A utility-theoretic approach to privacy and personalization.In

AAAI , 1181–1188.[Krause, Singh, and Guestrin 2008] Krause, A.; Singh, A.;and Guestrin, C. 2008. Near-optimal sensor placementsin gaussian processes: theory, efﬁcient algorithms and em-pirical studies.

The Journal of Machine Learning Research

WWW Companion , 1343–1350.[Lee et al. 2009] Lee, J.; Mirrokni, V. S.; Nagarajan, V.; andSviridenko, M. 2009. Non-monotone submodular maxi-mization under matroid and knapsack constraints. In

STOC ,323–332.[Milgrom and Strulovici 2009] Milgrom, P., and Strulovici,B. 2009. Substitute goods, auctions, and equilibrium.

Jour-nal of Economic Theory

ICLM’16: Proceedings of the 33rd International Conference onMachine Learning (ICML) .[Soma and Yoshida 2015] Soma, T., and Yoshida, Y. 2015.A generalization of submodular cover via the diminishingreturn property on the integer lattice. In

NIPS , 847–855.[Soma and Yoshida 2016] Soma, T., and Yoshida, Y. 2016.Maximizing submodular functions with the diminishing re-turn property over the integer lattice. In

IPCO , 325–336.[Soma et al. 2014] Soma, T.; Kakimura, N.; Inaba, K.; andKawarabayashi, K. 2014. Optimal budget allocation: Theo-retical guarantee and efﬁcient algorithm. In

ICML , 351–359.

A Proof of Theorem 7

A key observation to obtain an approximation algorithmwith a polynomial time complexity is that the approximatefunctions ˜ g and ˜ h used in Algorithm 3 are piecewise constantfunctions. Hence, while x ( e ) and y ( e ) lie on the intervalsfor which ˜ g and ˜ h , respectively, are constant, the values of α and β do not change. This means that we repeat the samerandom process in the while loop of Algorithm 3 as long as x ( e ) and y ( e ) lie on the intervals. We will show that we cansimulate the entire random process in polynomial time. Be-cause the number of possible values of ˜ g and ˜ h is boundedby O ( (cid:15) log ∆ δ ) , we obtain a polynomial-time algorithm.As the model of computation, we assume that we can per-form an elementary arithmetic operation on real numbers inconstant time, and that we can sample a uniform [0 , ran-dom variable. lgorithm 4 Subroutine to simulate random processes forAlgorithm 5

Input: p ∈ [0 , , (cid:96) a , (cid:96) b , (cid:96) a + b ∈ Z + , η ∈ (0 , . q ← − p . N ← O (cid:16) log (cid:0) η log( (cid:96) a + (cid:96) b + (cid:96) a + b ) (cid:1)(cid:17) . a ← , b ← . while a < (cid:96) a , b < (cid:96) b , and a + b < (cid:96) a + b do while (cid:96) a − a ≤ N or (cid:96) a + b − ( a + b ) < N do s ← a value sampled from G ( q ) . if b + s ≤ (cid:96) b and a + b + s ≤ (cid:96) a + b then a ← a + 1 , b ← b + s . else return ( a, min { (cid:96) b , (cid:96) a + b − a } ) . end if end while while (cid:96) b − b ≤ N do s ← a value sampled from G ( p ) . if a + s ≤ (cid:96) a and a + b + s ≤ (cid:96) a + b then a ← a + s , b ← b + 1 . else return (min { (cid:96) a , (cid:96) a + b − b } , b ) . end if end while n ← min {(cid:98) (cid:96) a − ap (cid:99) , (cid:98) (cid:96) b − bq (cid:99) , (cid:96) a + b − ( a + b ) } , m ←(cid:98) n/ (cid:99) . s ← a value sampled from B ( m, p ) . a ← a + s , b ← b + m − s . if a > (cid:96) a , b > (cid:96) b , or a + b > (cid:96) a + b then Fail . end if end while return ( a, b ) .The ﬁrst two ingredients for simulating the random pro-cess are sampling procedures for a binomial distribution anda geometric distribution. For n ∈ N and p ∈ [0 , , let B ( n, p ) be the binomial distribution with mean np and vari-ance np (1 − p ) . For p ∈ [0 , , let G ( p ) be the geometricdistribution with mean /p , that is, Pr X ∼G ( p ) [ X = k ] =(1 − p ) k − p for k ≥ . We then have the following: Lemma 8 (See, e.g., (Devroye 1986)) . For any n ∈ N , and p ∈ [0 , , we can sample a value from the binomial distri-bution B ( n, p ) in O (log n ) time. Lemma 9 (See, e.g., (Devroye 1986)) . For any p ∈ [0 , ,we can sample a value from the geometric distribution G ( p ) in O (1) time. We consider the following random process parameterizedby p ∈ [0 , and integers (cid:96) a , (cid:96) b , (cid:96) a + b ∈ Z + , which we de-note by P ( p, (cid:96) a , (cid:96) b , (cid:96) a + b ) : We start with a = b = 0 . While a < (cid:96) a , b < (cid:96) b , and a + b < (cid:96) a + b , we increment a withprobability p , and b with the complement probability − p .Note that, in the end of the process, we have a = (cid:96) a , b = (cid:96) b ,or a + b = (cid:96) a + b . Let D ( p, (cid:96) a , (cid:96) b , (cid:96) a + b ) be the distribution ofthe pair ( a, b ) generated by P ( p, (cid:96) a , (cid:96) b , (cid:96) a + b ) .We introduce an efﬁcient procedure (Algorithm 4) that succeeds in simulating the process P ( p, (cid:96) a , (cid:96) b , (cid:96) a + b ) withhigh probability. To prove the correctness of Algorithm 4,we use the following form of Chernoff’s bound. Lemma 10 (Chernoff’s bound) . Let X , . . . , X n be inde-pendent random variables taking values in { , } . Let X = (cid:80) ni =1 X i and µ = E [ X ] . Then, for any δ > , we have Pr[ X ≥ (1 + δ ) µ ] ≤ exp( − δµ/ . Lemma 11.

We have the following: • Algorithm 4 succeeds to return a pair ( a, b ) with proba-bility at least − η . • The distribution of the pair ( a, b ) output by Algorithm 4 isdistributed according to D ( p, (cid:96) a , (cid:96) b , (cid:96) a + b ) . • The pair ( a, b ) output by Algorithm 4 satisﬁes at least oneof the following: a = (cid:96) a , b = (cid:96) b , or a + b = (cid:96) a + b . • The time complexity of Algorithm 4 is O (cid:0) log( η log (cid:96) ) · log (cid:96) (cid:1) , where (cid:96) = (cid:96) a + (cid:96) b + (cid:96) a + b .Proof. We ﬁrst note that once (i) (cid:96) a − a ≤ N , (ii) (cid:96) b − b ≤ N ,or (iii) (cid:96) a + b − ( a + b ) ≤ N holds, then we enter the whileloop from Line 5 or from 13 until the end of the algorithm.We check the ﬁrst claim. Suppose that none of (i), (ii),and (iii) holds. Then, we reach Line 21. Here, we intend tosimulate the process P ( p, (cid:96) a , (cid:96) b , (cid:96) a + b ) to a point where wewill increment a and b m = (cid:98) n/ (cid:99) times in total. By unionbound and Chernoff’s bound, the probability that we fail canbe bounded by Pr (cid:104) s ≥ (cid:96) a − a ∨ s ≥ (cid:96) b − b ∨ s ≥ (cid:96) a + b − ( a + b ) (cid:105) ≤ Pr[ s ≥ (cid:96) a − a ] + Pr[ s ≥ (cid:96) b − b ] + Pr[ s ≥ (cid:96) a + b − ( a + b )] ≤ exp (cid:16) − (cid:96) a − a − pm (cid:17) + exp (cid:16) − (cid:96) b − b − qm (cid:17) + exp( − (cid:96) a + b − ( a + b ) − m ) ≤ exp (cid:16) − (cid:96) a − a (cid:17) + exp (cid:16) − (cid:96) b − b (cid:17) + exp (cid:16) − (cid:96) a + b − ( a + b )2 (cid:17) ≤ − O ( N )) . When we do not fail, at least one of the following threevalues shrinks by half: (cid:96) a − a , (cid:96) b − b , and (cid:96) a + b − ( a + b ) .Hence, after O (log( (cid:96) a + (cid:96) b + (cid:96) a + b )) iterations of the whileloop (from Line (4)), at least one of (i), (ii), or (iii) is sat-isﬁed. Once this happens, we do not fail and output a pair ( a, b ) . By union bound, the failure probability is at most − O ( N )) · O (log( (cid:96) a + (cid:96) b + (cid:96) a + b )) ≤ δ By choosing the hidden constant in N large enough.Next, we check the second claim. From the argumentabove, as long as none of (i), (ii), or (iii) is satisﬁed, weexactly simulate the process P ( p, (cid:96) a , (cid:96) b , (cid:96) a + b ) . Hence, sup-pose that (i) is satisﬁed. Then, what we does is until a reaches (cid:96) a , we sample s from the geometric distribution G ( q ) = G (1 − p ) , and if b + s ≤ (cid:96) b and a + b + s ≤ (cid:96) a + b ,then we update a by a + 1 and b by b + s , and otherwisewe output the pair ( a, b = min { (cid:96) b , (cid:96) a + b − a } ) . If a reaches lgorithm 5 Polynomial time approximation algorithm

Input: f : Z E + → R + , B ∈ Z E , (cid:15) > . Assumption: f is DR-submodular x ← , y ← B . for e ∈ E do Deﬁne g, h : { , , . . . , B − } → R as g ( b ) = f ( χ e | x + b χ e ) and h ( b ) = f ( − χ e | y − b χ e ) . Let ˜ g and ˜ h be approximations to g and h , respec-tively, given by Algorithm 2. while x ( e ) < y ( e ) do α ← ˜ g ( x ( e )) and β ← ˜ h ( B ( e ) − y ( e )) . if β = 0 then x ( e ) ← y ( e ) and break . else if α < then y ( e ) ← x ( e ) and break . else s ← max { b | ˜ g ( b ) = α } , t ← B ( e ) − max { b | ˜ h ( b ) = β } . Call Algorithm 4 with p = αα + β , (cid:96) a = s − x ( e ) , (cid:96) b = y ( e ) − t , (cid:96) a + b = y ( e ) − x ( e ) , and η = O ( (cid:15) (2+ (cid:15) ) | E | log(∆ /δ ) ) . if Algorithm 4 returned a pair ( a, b ) then x ( e ) ← x ( e ) + a , y ( e ) ← y ( e ) + b . else return . end if end if end while end for return x . (cid:96) a , then we output the pair ( a = (cid:96) a , b ) . This can be seen asan efﬁcient simulation of the process P ( p, (cid:96) a , (cid:96) b , (cid:96) a + b ) . Thecase (ii) or (iii) is satisﬁed can be analyzed similarly, and thesecond claim holds.The third and fourth claims are obvious from the deﬁni-tion of the algorithm.Our idea for simulating Algorithm 3 efﬁciently is as fol-lows. Suppose we have α = ˜ g ( x ( e )) and β = ˜ h ( B ( e ) − y ( e )) for the current x and y . Let s = max { b | ˜ g ( b ) = α } and t = B ( e ) − max { b | ˜ h ( b ) = β } . Then, ˜ g and ˜ h are constant in the intervals [ x ( e ) , . . . , s ] and [ B ( e ) − t, . . . , B ( e ) − y ( e )] , respectively. By running Algorithm 4with p = α/ ( α + β ) , (cid:96) a = s − x ( e ) , (cid:96) b = t − y ( e ) , and (cid:96) a + b = y ( e ) − x ( e ) , we can simulate Algorithm 3 to a pointwhere at least one of the following happens: x ( e ) reaches s , y ( e ) reaches t , or x ( e ) is equal to y ( e ) . When Algorithm 4failed to output a pair, we output an arbitrary feasible solu-tion, say, the zero vector . Algorithm 5 presents a formaldescription of the algorithm. Proof of Theorem 7.

We ﬁrst analyze the failure probabil-ity. Since the number of possible values of ˜ g and ˜ h isbounded by O ( (cid:15) log ∆ δ ) for each e ∈ E , we call Algorithm 4 O ( | E | (cid:15) log ∆ δ ) times by the third claim of Lemma 11. Hence, by the ﬁrst claim of Lemma 11 and union bound, the fail-ure probability is at most (cid:15) (cid:15) if the hidden constant in η atLine 13 is chosen to be small enough.Let D and D (cid:48) be the distributions of outputs from Algo-rithms 3 and 5, respectively. Conditioned on the event thatAlgorithm 4 does not fail (and hence we output f ( ) ), D exactly matches D (cid:48) by the second claim of Lemma 11.By letting o be the optimal solution, we have, by Theo-rem 5, that E x ∼D (cid:48) f ( x ) ≥ (cid:15) (cid:15) · E x ∼D f ( x ) ≥

12 + 2 (cid:15) f ( o ) . Hence, we have the approximation factor of (cid:15) .The number of oracle calls is exactly the same as Algo-rithm 3. The total time spent inside Algorithm 4 is O (cid:16) | E | (cid:15) log ∆ δ (cid:17) · O (cid:16) log (cid:16) | E | (cid:15) log ∆ δ log (cid:107) B (cid:107) ∞ (cid:17) log (cid:107) B (cid:107) ∞ (cid:17) = (cid:101) O (cid:16) | E | (cid:15) log ∆ δ log (cid:107) B (cid:107) ∞ (cid:17) by the fourth claim of Lemma 11. Because evaluating ˜ g and ˜ h takes O (log (cid:107) B (cid:107) ∞ ) time and computing s and t also takes O (log (cid:107) B (cid:107) ∞ ) time, we need O (cid:16) | E | (cid:15) log ∆ δ (cid:17) · O (log (cid:107) B (cid:107) ∞ ) = O (cid:16) | E | (cid:15) log ∆ δ log (cid:107) B (cid:107) ∞ (cid:17) time. Summing them up, we obtain the stated time complex-ity.We again note that, even if the given DR-submodularfunction f : Z E + → R is not non-negative, we can obtaina (cid:15) -approximation algorithm for (5), as stated in Re-mark 6. B DR-submodularity of functions used inexperiments

In this section, we will see that the objective function usedin Section 4 is indeed DR-submodular. Recall that our ob-jective function of revenue maximization is as follows: f ( x ) = (cid:88) i ∈ S (cid:88) j ∈ V \ S w ij (1 − (1 − p ) x ( i ) )(1 − p ) x ( j ) , where w ij is a nonnegative weight and p ∈ [0 , is a pa-rameter. Since DR-submodular functions are closed undernonnegative linear combination, it sufﬁces to check that g ( x ) = (1 − q x ( i ) ) q x ( j ) is DR-submodular, where q = 1 − p . To see the DR-submodularity of g , we need to check that g ( χ i | x + χ j ) ≤ g ( χ i | x ) ( x ∈ R E + , i, j ∈ E ) . Note that i and j may be identical. By direct algebra, g ( χ i | x ) = (1 − q x ( i )+1 ) q x ( j ) − (1 − q x ( i ) ) q x ( j ) = q x ( i )+ x ( j ) (1 − q ) ,g ( χ i | x + χ j ) = q x ( i )+ x ( j )+1 (1 − q )= qg ( χ i | x ) . Since q ∈ [0 , , we obtain g ( χ i | x + χ j ))