Continuous Submodular Function Maximization
CContinuous Submodular Function Maximization
Yatao Bian ∗ [email protected] Tencent AI LabShenzhen, China 518057
Joachim M. Buhmann [email protected]
Department of Computer ScienceETH Zurich8092 Zurich, Switzerland
Andreas Krause [email protected]
Department of Computer ScienceETH Zurich8092 Zurich, Switzerland
Abstract
Continuous submodular functions are a category of generally non-convex/non-concavefunctions with a wide spectrum of applications. The celebrated property of this class offunctions – continuous submodularity – enables both exact minimization and approximatemaximization in polynomial time. Continuous submodularity is obtained by generalizingthe notion of submodularity from discrete domains to continuous domains. It intuitivelycaptures a repulsive effect amongst different dimensions of the defined multivariate function.In this paper, we systematically study continuous submodularity and a class of non-convex optimization problems: continuous submodular function maximization . We startby a thorough characterization of the class of continuous submodular functions, and showthat continuous submodularity is equivalent to a weak version of the diminishing returns(DR) property. Thus we also derive a subclass of continuous submodular functions, termed continuous DR-submodular functions , which enjoys the full DR property. Then we presentoperations that preserve continuous (DR-)submodularity, thus yielding general rules forcomposing new submodular functions. We establish intriguing properties for the problemof constrained DR-submodular maximization, such as the local-global relation , which cap-tures the relationship of locally (approximate) stationary points and global optima. Weidentify several applications of continuous submodular optimization, ranging from influ-ence maximization with general marketing strategies, MAP inference for DPPs to meanfield inference for probabilistic log-submodular models. For these applications, continuoussubmodularity formalizes valuable domain knowledge relevant for optimizing this class ofobjectives. We present inapproximability results and provable algorithms for two prob-lem settings: constrained monotone DR-submodular maximization and constrained non-monotone DR-submodular maximization. Finally, we extensively evaluate the effectivenessof the proposed algorithms on different problem instances, such as influence maximizationwith marketing strategies and revenue maximization with continuous assignments.
Keywords:
Continuous submodularity, Continuous DR-submodularity, Submodularfunction maximization, Provable non-convex optimization, Revenue maximization ∗ . Corresponding author. Most of the work was conducted while Y. Bian was at ETH Zurich. Y. Bian’sORCID id is: orcid.org/0000-0002-2368-4084c (cid:13) https://creativecommons.org/licenses/by/4.0/ . a r X i v : . [ c s . L G ] J un ian and Buhmann and Krause
1. Introduction
Submodularity is a structural property usually associated with set functions , with importantimplications for optimization (Nemhauser et al., 1978). The general setup requires a groundset V containing n items, which could be, for instance, the set of all features in a givensupervised learning problem (Das and Kempe, 2011), or the set of all users in the influencemaximization problem (Kempe et al., 2003). Usually, we have an objective function thatmaps a subset of V to a real value: F ( X ) : 2 V → R + . This function often quantifies utility,coverage, relevance, diversity etc. Equivalently, one can express any subset X as a binaryvector x ∈ { , } n . Hereby, for component i of x , x i = 1 means that item i is inside X ,otherwise item i is outside of X . This binary representation associates the powerset of V with all vertices of an n -dimensional hypercube. Because of this, we also call submodularityof set functions “submodularity over binary domains” or “binary submodularity”.Over binary domains, there are two well-known definitions of submodularity: the latticedefinition and the diminishing returns (DR) definition. Definition 1 (Lattice definition)
A set function F : 2 V (cid:55)→ R + is submodular iff ∀ X, Y ⊆V , it holds: F ( X ) + F ( Y ) ≥ F ( X ∪ Y ) + F ( X ∩ Y ) . (1)One can easily show that it is equivalent to the following DR definition: Definition 2 (DR definition)
A set function F ( X ) : 2 V (cid:55)→ R + is submodular iff ∀ A ⊆ B ⊆ V and ∀ v ∈ V \ B , it holds: F ( A ∪ { v } ) − F ( A ) ≥ F ( B ∪ { v } ) − F ( B ) . (2)Optimizing submodular set functions has found numerous applications in machine learn-ing, including variable selection (Krause and Guestrin, 2005), dictionary learning (Krauseand Cevher, 2010; Das and Kempe, 2011), sparsity inducing regularizers (Bach, 2010), sum-marization (Gomes and Krause, 2010; Lin and Bilmes, 2011a; Mirzasoleiman et al., 2013)and variational inference (Djolonga and Krause, 2014b). Submodular set functions can beefficiently minimized (Iwata et al., 2001), and there are strong guarantees for approximatemaximization (Nemhauser et al., 1978; Krause and Golovin, 2012).Even though submodularity is most widely considered in the discrete setting, the notioncan be generalized to arbitrary lattices (Fujishige, 2005). Of particular interest are latticesover real vectors, which can be used to define submodularity over continuous domains(Topkis, 1978; Bach, 2015; Bian et al., 2017b). But one may wonder: why do we needcontinuous submodularity? In summary, there are two motivations for studying continuous submodularity: i) It is animportant modeling ingredient for many real-world applications; ii)
It captures a subclassof well-behaved non-convex optimization problems, which admits guaranteed optimizationwith algorithms running in polynomial time. In the following, we will informally illustratethese two aspects. ontinuous Submodular Function Maximization Natural Prior Knowledge for Modeling.
In order to illustrate the first motivation,let us consider a stylized scenario. Suppose you got stuck in the desert one day, and becameextremely thirsty. After two days of exploration you found a bottle of water. What is evenbetter is that you also found a bottle of soda.We will use a two-dimensional function f ([ x ; x ]) to quantize the “happiness” gained byhaving x quantity of water and x quantity of soda. Let δ = [50ml water; 50ml soda]. Nowit is natural to see that the following inequality shall hold: f ([1 ml ; 1 ml ]+ δ ) − f ([1 ml ; 1 ml ]) ≥ f ([100 ml ; 100 ml ]+ δ ) − f ([100 ml ; 100 ml ]). The LHS of the inequality measures the marginalgain of happiness by having δ more [water, soda] based on a small context ([1ml; 1ml]),while the RHS means the marginal gain based on a large context ([100ml; 100ml]), thisis a typical example of the well-known diminishing returns (DR) phenomenon, which willformally defined in Section 3.1. The DR property models the context sensitive expectationthat adding one more unit of resource contributes more in the small context than in a largecontext.This example illustrates that diminishing returns effects naturally occur in continuousdomains, not only discrete ones. While related to concavity, we will see that continuoussubmodularity yields complementary means of modeling diminishing returns effects overcontinuous domains. Real-world examples comprise user preferences in recommender sys-tems, customer satisfaction, influence in social advertisements etc. Non-Convex Structure enabling Provable Optimization.
Non-convex optimizationis a core challenge in machine learning, and arises in numerous learning tasks from trainingdeep neural networks (Bottou et al., 2018) to latent variable models (Anandkumar et al.,2014). A fundamental problem in non-convex optimization is to reach a stationary pointassuming smoothness of the objective for unconstrained optimization (Sra, 2012; Li and Lin,2015; Reddi et al., 2016a; Allen-Zhu and Hazan, 2016) or constrained optimization prob-lems (Ghadimi et al., 2016; Lacoste-Julien, 2016). However, without further assumptions,a stationary point may in general be of arbitrary poor objective value. It thus remains achallenging problem to understand which classes of non-convex objectives can be tractablyoptimized.In pursuit of solving this challenging problem, we show that continuous submodularityprovides a natural structure for provable non-convex optimization. It arises in variousimportant non-convex objectives. Let us look at a simple example by considering a classicalquadratic program (QP): f ( x ) = x (cid:62) Hx + h (cid:62) x + c . When H is symmetric, we knowthat the Hessian matrix is ∇ f = H . Let us consider a specific two dimensional example,where H = [ − , − − , − − f is a DR-submodular function (see definitions in Section 3). In this paper, wepropose polynomial-time solvers for optimizing such objectives with strong approximationguarantees. Further examples of submodular objectives include the Lov´asz (Lov´asz, 1983)and multilinear extensions (Calinescu et al., 2007) of submodular set functions, or to thesoftmax extensions (Gillenwater et al., 2012) for DPP (determinantal point process) MAPinference. Organization of the Paper.
We will present a brief background of submodular optimiza-tion, the classical Frank-Wolfe algorithm and existing structures for non-convex optimiza- ian and Buhmann and Krause tion in Section 2. In Section 3 we give a thorough characterization of the class of continuoussubmodular and DR-submodular functions. Section 4 presents general composition rulesthat preserve continuous (DR-)submodularity, along with exemplary applications of theserules, such as for designing deep submodular functions. Section 5 discusses intriguing prop-erties for the problem of constrained DR-submodular maximization in both monotone andnon-monotone settings, such as the local-global relation. In Section 6 we illustrate repre-sentative applications of continuous submodular optimization. In the next two sections wediscuss hardness results and algorithmic techniques for constrained DR-submodular maxi-mization in different settings: Section 7 illustrates how to maximize monotone continuousDR-submodular functions, and Section 8 provides techniques for maximizing non-monotoneDR-submodular functions with a down-closed convex constraint. We present experimen-tal results on three representative problems in Section 9. Lastly, Section 10 discusses andconcludes the paper.
2. Background and Related Work
We give a brief introduction of the background of submodular optimization in this section.
Notation.
Throughout this work we assume V = { v , v , ..., v n } being the ground set of n elements, and e i ∈ R n is the characteristic vector for element v i (also the standard i th basis vector). We use boldface letters x ∈ R V and x ∈ R n interchangebly to indicate an n -dimensional vector, where x i is the i th entry of x . We use a boldface capital letter A ∈ R m × n to denote an m by n matrix and use A ij to denote its ij th entry. By default, f ( · ) is usedto denote a continuous function, and F ( · ) to represent a set function. For a differentiablefunction f ( · ), ∇ f ( · ) denotes its gradient, and for a twice differentiable function f ( · ), ∇ f ( · )denotes its Hessian. [ n ] := { , ..., n } for an integer n ≥ (cid:107) · (cid:107) means the Euclideannorm by default. Given two vectors x , y , x ≤ y means x i ≤ y i , ∀ i . x ∨ y and x ∧ y denote coordinate-wise maximum and coordinate-wise minimum, respectively. x | i ( k ) is theoperation of setting the i th element of x to k , while keeping all other elements unchanged,i.e., x | i ( k ) = x − x i e i + k e i . As a discrete analogue of convexity, submodularity provides computationally effective struc-ture so that many discrete problems with this property can be efficiently solved or ap-proximated. Of particular interest is a (1 − /e )-approximation for maximizing a mono-tone submodular set function subject to a cardinality, a matroid, or a knapsack constraint(Nemhauser et al., 1978; Vondr´ak, 2008; Sviridenko, 2004). For maximizing non-monotonesubmodular functions, a 0.325-approximation under cardinality and matroid constraints(Gharan and Vondr´ak, 2011), and a 0.2-approximation under a knapsack constraint havebeen shown (Lee et al., 2009). Another result pertains to unconstrained maximization ofnon-monotone submodular set functions, for which Buchbinder et al. (2012) propose thedeterministic double greedy algorithm with a 1/3 approximation guarantee, and the ran-domized double greedy algorithm that achieves the tight 1/2 approximation guarantee.
1. A DR-submodular function is a submodular function with the additional diminishing returns (DR)property, which will be formally defined in Section 3. ontinuous Submodular Function Maximization Although most commonly associated with set functions, in many practical scenarios, itis natural to consider generalizations of submodular set functions, including bisubmodular functions, k -submodular functions, tree-submodular functions, adaptive submodular func-tions, as well as submodular functions defined over integer lattices.Golovin and Krause (2011) introduce the notion of adaptive submodularity to generalizesubmodular set functions to adaptive policies. Kolmogorov (2011) studies tree-submodularfunctions and presents a polynomial-time algorithm for minimizing them. For distributivelattices, it is well-known that the combinatorial polynomial-time algorithms for minimizing asubmodular set function can be adopted to minimize a submodular function over a boundedinteger lattice (Fujishige, 2005).Approximation algorithms for maximizing bisubmodular functions and k -submodularfunctions have been proposed by Singh et al. (2012); Ward and Zivny (2014). Recently,maximizing a submodular function over integer lattices has attracted considerable atten-tion. In particular, Soma et al. (2014) develop a (1 − /e )-approximation algorithm formaximizing a monotone DR-submodular integer function under a knapsack constraint. Fornon-monotone submodular functions over the bounded integer lattice, Gottschalk and Peis(2015) provide a 1/3-approximation algorithm. Recently, Soma and Yoshida (2018) presenta continuous non-smooth extension for maximizing monotone integer submodular functions. Even though submodularity is most widely considered in the discrete realm, the notion canbe generalized to arbitrary lattices (Fujishige, 2005). Wolsey (1982) considers maximizinga special class of continuous submodular functions subject to one knapsack constraint, inthe context of solving location problems. That class of functions are additionally requiredto be monotone, piecewise linear and concave . Calinescu et al. (2007) and Vondr´ak (2008)discuss a subclass of continuous submodular functions, which is termed smooth submodularfunctions , to describe the multilinear extension of a submodular set function. They pro-pose the continuous greedy algorithm, which has a (1 − /e ) approximation guarantee formaximizing a smooth submodular function under a down-closed polytope constraint. Bach(2015) considers the problem of minimizing continuous submodular functions , and provesthat efficient techniques from convex optimization may be used for minimization (Fujishige,2005).Ene and Nguyen (2016) provide an approach for reducing integer DR-submodular func-tion maximization problems to submodular set function maximization problem. This ap-proach suggests a way to approximately optimize continuous submodular functions over simple continuous constraints: Discretize the continuous function and constraint to be aninteger instance, and then optimize it using the reduction. However, for monotone DR-submodular function maximization, this method can not handle the general continuousconstraints discussed in this work, i.e., arbitrary down-closed convex sets. Moreover, forgeneral submodular function maximization, this method cannot be applied, since the re-duction needs the additional diminishing returns property. Therefore we focus on explicitlycontinuous methods in this work.
2. A function f : [0 , n → R is smooth submodular if it has second partial derivatives everywhere and allentries of its Hessian matrix are non-positive. ian and Buhmann and Krause Algorithm 1:
Classical Frank-Wolfe algorithm for constrained convex optimization(Frank and Wolfe, 1956)
Input: min x ∈ R n , x ∈D f ( x ); x ∈ D for t = 0 . . . T do Compute s t := arg min s ∈D (cid:10) s , ∇ f ( x t ) (cid:11) ; // LMO Choose step size γ ∈ (0 , Update x t +1 := (1 − γ ) x t + γ s t ; Output: x T ;Recently, Niazadeh et al. (2018) present optimal algorithms for non-monotone sub-modular maximization with a box constraint. Continuous submodular maximization is alsowell studied in the stochastic setting (Karimi et al., 2017; Hassani et al., 2017; Mokhtariet al., 2018b), online setting (Chen et al., 2018), bandit setting (D¨urr et al., 2019) anddecentralized setting (Mokhtari et al., 2018a). Since the workhorse algorithms for continuous DR-submodular maximization are Frank-Wolfe style algorithms, we give a brief introduction of classical Frank-Wolfe algorithms inthis section. The Frank-Wolfe algorithm (Frank and Wolfe, 1956) (also known as Condi-tional Gradient algorithm or the Projection-Free algorithm) is one of the classical algorithmsfor constrained convex optimization. It has received renewed interest in recent years due toits projection free nature and its ability to exploit structured constraints (Jaggi, 2013b).The Frank-Wolfe algorithm solves the following constrained optimization problem:min x ∈ R n , x ∈D f ( x ) , (3)where f is differentiable with L -Lipschitz gradients and the constraint D is convex andcompact.A sketch of the Frank-Wolfe algorithm is presented in Algorithm 1. It needs an initializer x ∈ D . Then it runs for T iterations. In each iteration it does the following: in Step 2it solves a linear minimization problem whose objective is defined by the current gradient ∇ f ( x t ). This step is often called the linear minimization/maximization oracle (LMO). InStep 3 a step size γ is chosen. Then it updates the solution x to be a convex combinationof the current solution and the LMO output s .There are several popular rules to choose the step size in Step 3. For a short summary: i) γ t := t +2 , which is often called the “oblivious” rule since it does not depend on any informa-tion of the optimization problem; ii) γ t = min { , g t L (cid:107) s t − x t (cid:107) } , where g t := −(cid:104)∇ f ( x t ) , s t − x t (cid:105) isthe so-called Frank-Wolfe gap, which is an upper bound of the suboptimality if f is convex;iii) Line search rule: γ t := arg min γ ∈ [0 , f ( x t + γ ( s t − x t )). Frank-Wolfe Algorithm for Non-Convex Optimization.
Recently, Frank-Wolfe al-gorithms have been extended for smooth non-convex optimization problems with con-straints. Lacoste-Julien (2016) analyzes the Frank-Wolfe method for general constrained
3. Appeared later than when the paper Bian et al. (2019) was released. ontinuous Submodular Function Maximization non-convex optimization problems, where he uses the Frank-Wolfe gap as the non-stationaritymeasure. Reddi et al. (2016b) study Frank-Wolfe methods for non-convex stochastic andfinite-sum optimization problems. They also used the Frank-Wolfe gap as the non-stationaritymeasure. Optimizing non-convex continuous functions has received considerable interest in the lastdecades. There are two widespread structures for non-convex optimization: quasi-convexity and geodesic convexity , both of them are based on relaxations of the classical convexitydefinition.
Quasi-Convexity.
A function f : D (cid:55)→ R defined on a convex subset D of a real vectorspace is quasi-convex if for all x , y ∈ D and λ ∈ [0 ,
1] it holds, f ( λ x + (1 − λ ) y ) ≤ max { f ( x ) , f ( y ) } . (4)Quasi-convex optimization problems appear in different areas, such as industrial organiza-tion (Wolfstetter, 1999) and computer vision (Ke and Kanade, 2007). Quasi-convex opti-mization problems can be solved by a series of convex feasibility problems (Boyd and Van-denberghe, 2004). Hazan et al. (2015a) study stochastic quasi-convex optimization, wherethey proved that a stochastic version of the normalized gradient descent can converge to aglobal minimium for quasi-convex functions that are locally Lipschitz. Geodesic Convexity.
Geodesic convex functions are a class of generally non-convex func-tions in Euclidean space. However, they still enjoy the nice property that local optimalityimplies global optimality. Sra and Hosseini (2016) provide an introduction to geodesicconvex optimization with machine learning applications. Recently, Vishnoi (2018) studyvarious aspects of geodesic convex optimization.
Definition 3 (Geodesically convex functions)
Let ( M , g ) be a Riemannian manifoldand K ⊆ M be a totally convex set with respect to g . A function f : K → R is a geodesicallyconvex function with respect to g if ∀ p , q ∈ K , and for all geodesic γ pq : [0 , → K thatjoins p to q , it holds, ∀ t ∈ [0 , , f ( γ pq ( t )) ≤ (1 − t ) f ( p ) + tf ( q ) . (5)Various applications with non-convex objectives in Euclidean space can be resolvedwith geodesic convex optimization methods, such as Gaussian mixture models (Hosseiniand Sra, 2015), metric learning (Zadeh et al., 2016) and matrix square root (Sra, 2015).By deriving explicit expressions for the smooth manifold structure, such as inner products,gradients, vector transport and Hessian, various optimization methods have been developed.Jeuris et al. (2012) present conjugate gradient, BFGS and trust-region methods. Qi et al.(2010) propose the Riemannian BFGS (RBFGS) algorithm for general retraction and vectortransport. Ring and Wirth (2012) prove its local superlinear rate of convergence. Sra andHosseini (2015) present a limited memory version of RBFGS. ian and Buhmann and Krause Other Non-convex Structures.
Tensor methods have been used in various non-convexproblems, e.g., learning latent variable models (Anandkumar et al., 2014) and training neu-ral networks (Janzamin et al., 2015). A fundamental problem in non-convex optimization isto reach a stationary point assuming the smoothness of the objective (Sra, 2012; Li and Lin,2015; Reddi et al., 2016a; Allen-Zhu and Hazan, 2016). With extra assumptions, certainglobal convergence results can be obtained. For example, for functions with Lipschitz con-tinuous Hessians, the regularized Newton scheme of Nesterov and Polyak (2006) achievesglobal convergence results for functions with an additional star-convexity property or withan additional gradient-dominance property (Polyak, 1963). Hazan et al. (2015b) introducethe family of σ -nice functions and propose a graduated optimization-based algorithm, thatprovably converges to a global optimum for this family of non-convex functions. However,it is typically difficult to verify whether these assumptions hold in real-world problems. To the best of our knowledge, this work is the first to systematically study continuoussubmodularity and its maximization algorithms. Our main contributions are: Thorough characterizations of submodularity.
By lifting the notion of submodu-larity to continuous domains, we identify a subclass of tractable non-convex optimizationproblems: continuous submodular optimization. We provide a thorough characterization ofcontinuous submodularity, which results in 0 th order, 1 st order and 2 nd order definitions. Continuous submodularity preserving operations.
We study general principles formaintaining continuous (DR-)submodularity. These enable: i) Convenient ways of recogniz-ing new continuous submodular objectives; ii)
Generic rules for designing new continuousor discrete submodular objectives, such as deep submodular functions.
Properties of constrained DR-submodular maximization.
We discover intriguingproperties of the general constrained DR-submodular maximization problem, such as thelocal-global relation (in Proposition 22), which relates (approximately) stationary pointsand the global optimum, thus allowing to incorporate progress in the area of non-convexoptimization research.
Provable algorithms for DR-submodular maximization.
We establish hardness re-sults and propose provable algorithms for constrained DR-submodular maximization in twosettings: i) Maximizing monotone functions with down-closed convex constraints; ii)
Max-imizing non-monotone functions with down-closed convex constraints.
Applications with (DR)-submodular objectives.
We formulate representative ap-plications with (DR)-submodular objectives from various areas, such as machine learning,data mining and combinatorial optimization.
Extensive experimental evaluations.
We present representative applications with thestudied continuous submodular objectives, and extensively evaluate the proposed algorithmson these applications.
4. This journal paper is partially based on the previous conference papers Bian et al. (2017b), Bian et al.(2017a) also the thesis Bian (2019). ontinuous Submodular Function Maximization Table 1: Comparison of definitions of continuous submodular and convex functionsDefinitions Continuous submodular function f ( · ) Convex function g ( · ), ∀ λ ∈ [0 , th order f ( x ) + f ( y ) ≥ f ( x ∨ y ) + f ( x ∧ y ) λg ( x )+(1 − λ ) g ( y ) ≥ g ( λ x +(1 − λ ) y )1 st order weak DR property (Definition 6), or ∇ f ( · ) is a weak antitone mapping(Lemma 8) g ( y ) ≥ g ( x ) + (cid:104)∇ g ( x ) , y − x (cid:105) nd order ∂ f ( x ) ∂x i ∂x j ≤ ∀ i (cid:54) = j ∇ g ( x ) (cid:23)
3. Characterizations of Continuous Submodular Functions
Continuous submodular functions are defined on subsets of R n : X = (cid:81) ni =1 X i , where each X i is a compact subset of R (Topkis, 1978; Bach, 2015). A function f : X → R is submodular iff for all ( x , y ) ∈ X × X , f ( x ) + f ( y ) ≥ f ( x ∨ y ) + f ( x ∧ y ) , ( submodularity ) (6)where ∧ and ∨ are the coordinate-wise minimum and maximum operations, respectively.Specifically, X i could be a finite set, such as { , } (in which case f ( · ) is called a set function),or { , ..., k i − } (called integer function), where the notion of continuity is vacuous; X i canalso be an interval, which is referred to as a continuous domain. In this section, we considerthe interval by default, but it is worth noting that the properties introduced in this sectioncan be applied to X i being a general compact subset of R .When twice-differentiable, f ( · ) is submodular iff all off-diagonal entries of its Hessianmatrix are non-positive (Topkis, 1978; Bach, 2015), ∀ x ∈ X , ∂ f ( x ) ∂x i ∂x j ≤ , ∀ i (cid:54) = j. (7)The class of continuous submodular functions contains a subset of both convex andconcave functions, and shares some useful properties with them (illustrated in Figure 1).Examples include submodular and convex functions of the form φ ij ( x i − x j ) for φ ij convex;submodular and concave functions of the form x (cid:55)→ g ( (cid:80) ni =1 λ i x i ) for g concave and λ i non-negative. Lastly, indefinite quadratic functions of the form f ( x ) = x (cid:62) Hx + h (cid:62) x + c withall off-diagonal entries of H non-positive are examples of submodular but non-convex/non-concave functions. Interestingly, characterizations of continuous submodular functions arein correspondence to those of convex functions, which are summarized in Table 1.
5. Notice that an equivalent definition of (6) is that ∀ x ∈ X , ∀ i (cid:54) = j and a i , a j ≥ x i + a i ∈ X i , x j + a j ∈X j , it holds f ( x + a i e i ) + f ( x + a j e j ) ≥ f ( x ) + f ( x + a i e i + a j e j ). With a i and a j approaching zero,one gets (7). ian and Buhmann and Krause SubmodularConcave ConvexDR-submodular
Figure 1: Venn diagram for concavity, convexity, submodularity and DR-submodularity.
The Diminishing Returns (DR) property was introduced when studying set and integerfunctions. We generalize the DR property to general functions defined over X . It will soonbe clear that the DR property defines a subclass of submodular functions. All of the proofscan be found in Appendix A. Definition 4 (DR/IR property, DR-submodular/IR-supermodular functions)
Afunction f ( · ) defined over X satisfies the diminishing returns (DR) property if ∀ a ≤ b ∈ X , ∀ i ∈ [ n ] , ∀ k ∈ R + such that ( k e i + a ) and ( k e i + b ) are still in X , it holds, f ( k e i + a ) − f ( a ) ≥ f ( k e i + b ) − f ( b ) . (8) This function f ( · ) is called a DR-submodular function. If − f ( · ) is DR-submodular, wecall f ( · ) an IR-supermodular function, where IR stands for “Increasing Returns”. One immediate observation is that for a differentiable DR-submodular function f ( · ), wehave that ∀ a ≤ b ∈ X , ∇ f ( a ) ≥ ∇ f ( b ), i.e., the gradient ∇ f ( · ) is an antitone mappingfrom R n to R n . This observation can be formalized below: Lemma 5 (Antitone mapping) If f ( · ) is continuously differentiable, then f ( · ) is DR-submodular iff ∇ f ( · ) is an antitone mapping from R n to R n , i.e., ∀ a ≤ b ∈ X , ∇ f ( a ) ≥∇ f ( b ) . Recently, the DR property is explored by Eghbali and Fazel (2016) to achieve the worst-case competitive ratio for an online concave maximization problem. The DR property isalso closely related to a sufficient condition on a concave function g ( · ) (Bilmes and Bai,2017, Section 5.2), to ensure submodularity of the corresponding set function generated bygiving g ( · ) boolean input vectors.
6. Note that the DR property implies submodularity and thus the name “DR-submodular” contains redun-dant information about submodularity of a function, but we keep this terminology to be consistent withprevious literature on integer submodular functions. ontinuous Submodular Function Maximization It is well known that for set functions, the DR property is equivalent to submodularity, whilefor integer functions, submodularity does not in general imply the DR property (Somaet al., 2014; Soma and Yoshida, 2015a,b). However, it was unclear whether there existsa diminishing-return-style characterization that is equivalent to submodularity of integerfunctions. In this work we give a positive answer to this question by proposing the weakdiminishing returns ( weak DR ) property for general functions defined over X , and prove that weak DR gives a sufficient and necessary condition for a general function to be submodular. Definition 6 (Weak DR property)
A function f ( · ) defined over X has the weak dimin-ishing returns property (weak DR) if ∀ a ≤ b ∈ X , ∀ i ∈ V such that a i = b i , ∀ k ∈ R + suchthat ( k e i + a ) and ( k e i + b ) are still in X , it holds, f ( k e i + a ) − f ( a ) ≥ f ( k e i + b ) − f ( b ) . (9)The following proposition shows that for all set functions, as well as integer and contin-uous functions, submodularity is equivalent to the weak DR property. All the proofs can befound in Appendix A. Proposition 7 ( submodularity ) ⇔ ( weak DR ) A function f ( · ) defined over X is submod-ular iff it satisfies the weak DR property. Given Proposition 7, one can treat weak DR as the first order definition of submodularity:Notice that for a continuously differentiable function f ( · ) with the weak DR property, wehave that ∀ a ≤ b ∈ X , ∀ i ∈ V s.t. a i = b i , it holds ∇ i f ( a ) ≥ ∇ i f ( b ), i.e., ∇ f ( · ) is a weak antitone mapping. Formally, Lemma 8 (Weak antitone mapping) If f ( · ) is continuously differentiable, then f ( · ) issubmodular iff ∇ f ( · ) is a weak antitone mapping from R n to R n , i.e., ∀ a ≤ b ∈ X , ∀ i ∈ V s.t. a i = b i , ∇ i f ( a ) ≥ ∇ i f ( b ) . Now we show that the DR property is stronger than the weak DR property, and the classof DR-submodular functions is a proper subset of that of submodular functions, as indicatedby Figure 1. Proposition 9 ( submodular/weak DR ) + ( coordinate-wise concave ) ⇔ ( DR ) A func-tion f ( · ) defined over X satisfies the DR property iff f ( · ) is submodular and coordinate-wiseconcave, where the coordinate-wise concave property is defined as: ∀ x ∈ X , ∀ i ∈ V , ∀ k, l ∈ R + s.t. ( k e i + x ) , ( l e i + x ) , (( k + l ) e i + x ) are still in X , it holds, f ( k e i + x ) − f ( x ) ≥ f (( k + l ) e i + x ) − f ( l e i + x ) , (10) or equivalently (if twice differentiable) ∂ f ( x ) ∂x i ≤ , ∀ i ∈ V . Proposition 9 shows that a twice differentiable function f ( · ) is DR-submodular iff ∀ x ∈X , ∂ f ( x ) ∂x i ∂x j ≤ , ∀ i, j ∈ V , which does not necessarily imply the concavity of f ( · ). GivenProposition 9, we also have the characterizations of continuous DR-submodular functions,which are summarized in Table 2. ian and Buhmann and Krause Table 2: Summarization of definitions of continuous DR-submodular functionsDefinitions Continuous DR-submodular function f ( · ), ∀ x , y ∈ X th order f ( x ) + f ( y ) ≥ f ( x ∨ y ) + f ( x ∧ y ), and f ( · ) is coordinate-wise concave (see(10))1 st order DR property (Definition 4), or ∇ f ( · ) is an antitone mapping (Lemma 5)2 nd order ∂ f ( x ) ∂x i ∂x j ≤ ∀ i, j (all entries of the Hessian matrix being non-positive) Figure 2 shows the contour of a 2-D continuous submodular function [ x ; x ] (cid:55)→ . x − x ) + e − x − ) + 0 . e − x − ) + e − x − ) + e − x − ) and a 2-D DR-submodularfunction x (cid:55)→ log det (diag( x )( L − I ) + I ) , x ∈ [0 , , (11)where L = [2 . ,
3; 3 , . x ; x ] (cid:55)→ . x − x ) + e − x − ) + 0 . e − x − ) + e − x − ) + e − x − ) . Right: A 2-D softmax extension,which is continuous DR-submodular. x (cid:55)→ log det (diag( x )( L − I ) + I ) , x ∈ [0 , , where L = [2 . ,
3; 3 , .
4. Operations that Preserve Continuous (DR-)Submodularity
Continuous submodularity is preserved under various operations, e.g., the sum of two con-tinuous submodular functions is submodular, non-negative combinations of continuous sub-modular functions are still submodular, and a continuous submodular function multipliedby a positive scalar is still submodular. In this section, we will study some general submod- ontinuous Submodular Function Maximization ularity preserving operations from the perspective of function composition. Then we willlook at some exemplary applications resulting from these rules. Observation 10 (Bach (2015))
Let f be a DR-submodular function over X = (cid:81) ni =1 X i .Let ˜ f be the function defined by restricting f on a product of subsets of X i . Then ˜ f isDR-submodular. Observation 10 will be useful when we try to obtain discrete submodular functions bydiscretizing continuous submodular functions.
Suppose there are two functions h : R m → R n and f : R n → R . Consider the composedfunction g ( x ) := f ( h ( x )) = ( f ◦ h )( x ). We are interested in what properties are neededfrom f and h such that the composed function g is DR-submodular.Here h is a multivariate vector-valued function. Equivalently, we can express h as n multivariate functions h k : R m → R , k = 1 , ..., n . We use ∇ h to denote the n × m Jacobianmatrix of h . Let y = h ( x ), so y k = h k ( x ).For a vector-valued function, we define its (DR)-submodularity as, Definition 11 ((DR-)submodularity for vector-valued functions)
Let h : R m → R n be a multivariate vector-valued function, and h k : R m → R be the k th entry of the output, k = 1 , ..., n . Then we say h is (DR-)submodular iff h k is (DR-)submodular, ∀ k ∈ [ n ] . Assume for simplicity that both f and h are twice differentiable. Applying the chainrule twice, one can verify that ∇ g ( x ) = ∇ h ( x ) (cid:62) ∇ f ( y ) ∇ h ( x ) + n (cid:88) k =1 ∂f ( y ) ∂y k ∇ h k ( x ) , (12)where the product above is the standard matrix multiplication. After some manipulation,one can see that the ( i, j ) th entry of ∇ g ( x ) is, ∂ g ( x ) ∂x i ∂x j = n (cid:88) s,t =1 ∂ f ( y ) ∂y s ∂y t ∂h s ( x ) ∂x i ∂h t ( x ) ∂x j + n (cid:88) k =1 ∂f ( y ) ∂y k ∂ h k ( x ) ∂x i ∂x j . (13)Maintaining DR-submodularity (or IR-supermodularity) means maintaining the sign of ∂ g ( x ) ∂x i ∂x j . From Equation (13), one can see that if we want ∂ g ( x ) ∂x i ∂x j to be non-positive, h mustin general be monotone. h could be either nondecreasing or nonincreasing, in both caseswe have ∂h s ( x ) ∂x i ∂h t ( x ) ∂x j ≥ Theorem 12 (DR-submodularity preserving conditions on function composition)
Suppose h : R m → R n is monotone (nondecreasing or nonincreasing), f : R n → R . Thefollowing statements about the composed function g ( x ) := f ( h ( x )) = ( f ◦ h )( x ) hold:1. If f is DR-submodular, nondecreasing, and h is DR-submodular, then g is DR-submodular; ian and Buhmann and Krause
2. If f is DR-submodular, nonincreasing, and h is IR-supermodular, then g is DR-submodular;3. If f is IR-supermodular, nondecreasing, and h is IR-supermodular, then g is IR-supermodular;4. If f is IR-supermodular, nonincreasing, and h is DR-submodular, then g is IR-supermodular. If f and h are both twice differentiable, Theorem 12 can be directly proved by examiningthe ( i, j ) th entry of ∇ g ( x ) in Equation (13). Furthermore, the above conclusions can alsobe rigorously proved when the functions are non-differentiable.
Bellow we give an exemplarproof of statement 1 in Theorem 12. The other proofs are omitted due to high similarity.
Proof of statement 1 in Theorem 12 when the functions are non-differentiableProof [Proof of Theorem 12 when the functions are non-differentiable]To prove the DR-submodularity of g , it suffices to show that: ∀ x ≤ y , ∀ i ∈ [ m ] , ∀ k ≥ , g ( x + k e i ) − g ( x ) ≥ g ( y + k e i ) − g ( y ) . (14)Due to DR-submodularity of h , h ( x + k e i ) − h ( x ) ≥ h ( y + k e i ) − h ( y ) (15)I) Let us consider the case when h is nondecreasing. It holds, h ( x ) ≤ h ( y ) (16)Then, g ( x + k e i ) − g ( x ) (17)= f [ h ( x + k e i )] − f [ h ( x )] (18) ≥ f [ h ( x ) + h ( y + k e i ) − h ( y )] − f [ h ( x )] (15) and f is non-decreasing (19) ≥ f [ h ( y + k e i )] − f [ h ( y )] (16) and f is DR-submodular (20)= g ( y + k e i ) − g ( y ) . (21)Thus we prove Equation (14), i.e., the DR-submodularity of g .II) Let us consider the case when h is nonincreasing. It holds, h ( x + k e i ) ≥ h ( y + k e i ) (22)Thus, g ( y ) − g ( y + k e i ) (23)= f [ h ( y )] − f [ h ( y + k e i )] (24) ≥ f [ h ( y + k e i ) + h ( x ) − h ( x + k e i )] − f [ h ( y + k e i )] (15) & f is nondecreasing (25) ≥ f [ h ( x )] − f [ h ( x + k e i )] (22) and f is DR-submodular (26) ontinuous Submodular Function Maximization ⠇ ⠇ Figure 3: Layers l − l of the DSF.= g ( x ) − g ( x + k e i ) . (27)By examining the ( i, j ) th entry of ∇ g ( x ) in Equation (13), we can also prove thefollowing conclusion: Lemma 13
Suppose h is monotone (nondecreasing or nonincreasing). In addition, assume h is separable, that is, m = n and h k ( x ) = h k ( x k ) , k = 1 , ..., n . Then f ( h ( x )) maintainssubmodularity (supermodularity) of f . The detailed proof can be found in Appendix B.1. It is worth noting that under the same set-ting as in Lemma 13, f ( h ( x )) might not maintain DR-submodularity (IR-supermodularity)of f . Using conclusions inthis section, we obtain a general way of composing discrete DSFs: i) We make a continuousdeep submodular function f : X → R utilizing the composition rules; ii) By restricting f tothe binary lattice { , } n , we obtain a deep submodular set function. Similarly, by restricting f on the integer lattice { , , , ..., k } n , we get a deep submodular integer function. Thisstep is ensured by the restriction rule (observation 10).For a specific example, we can easily prove that the DSFs composed by nesting SCMMswith concave functions (Bilmes and Bai, 2017) are binary submodular.Firstly, we can prove that the continuous function composed by nesting SCMMs withconcave functions (Bilmes and Bai, 2017) are continuous submodular. Let the original inputvector be x ∈ R n , which serves as the input vector of the 0 th layer. As shown by Figure 3,let the output of the i th neuron in the l th layer be o li . So o li = σ ( W l , o l − + W l , o l − + ... + W l ,d l − o l − d l − ) . (28) ian and Buhmann and Krause where we assume there are d l neurons in layer l , and use W l ∈ R d l × d l − + to denote the weightmatrix between layer l − l . σ is the activation function which is concave andnondecreasing in the positive orthant.Now let us proceed by induction. When l = 0, o i ( x ) is DR-submodular and nondecreas-ing since o i ( x ) = x i . Let us assume for the ( l − th layer that o l − i ( x ) is DR-submodular andnondecreasing wrt. x . It is left to prove that o li ( x ) is DR-submodular and nondecreasing wrt. x . According to Equation (28), o li ( x ) = σ ( W l , o l − ( x ) + W l , o l − ( x ) + ... + W l ,d l − o l − d l − ( x )).Since W is non-negative, the function h ( x ) = W l , o l − ( x )+ W l , o l − ( x )+ ... + W l ,d l − o l − d l − ( x )is DR-submodular and nondecreasing. σ is also DR-submodular and nondecreasing. Ac-cording to Theorem 12, o li ( x ) is also DR-submodular wrt x . Thus we finish the induction.Given that continuous DSFs are continuous submodular, by means of the restrictionoperation, we obtain binary or integer deep submodular functions.However, the above principles offer more general ways of designing DSFs, other thannesting SCMMs with concave activation functions (as proposed by Bilmes and Bai (2017)).As long as the resultant continuous map is DR-submodular, the discrete function obtainedby restriction will be DR-submodular. With the principles proved in this section, onecan immediately recognize that the following applications enjoy continuous submodularobjectives (more details will be discussed in the corresponding sections). Influence Maximization with Marketing Strategies.
One can easily see that theobjective in Equation (46) is the composition of a nondecreasing multilinear extension anda monotone activation function. So it is DR-submodular according to Theorem 12.
Revenue Maximization with Continuous Assignments.
One can also verify thatthe revenue maximization problem in Equation (52) is the composition of a non-monotoneDR-submodular multilinear extension and a separable monotone function, so it is still DR-submodular according to Lemma 13.
5. Properties of Constrained DR-Submodular Maximization
In this section, we first formulate the constrained DR-submodular maximization problem,and then establish several properties of it. In particular, we show properties related toconcavity of the objective along certain directions, and establish the relation between lo-cally stationary points and the global optimum (thus called “local-global relation”). Theseproperties will be used to derive guarantees for the algorithms in the following sections. Allomitted proofs are in Appendix B.
The general setup of constrained continuous submodular function maximization is,max x ∈P⊆X f ( x ) , (P)where f : X → R is continuous submodular or DR-submodular, X = [ u , ¯ u ] (Bian et al.,2017b). One can assume f is non-negative over X , since otherwise one just needs to find alower bound for the minimum function value of f over X (because box-constrained submod-ular minimization can be solved to arbitrary precision in polynomial time (Bach, 2015)). ontinuous Submodular Function Maximization Let the lower bound be f min , then working on a new function f (cid:48) ( x ) := f ( x ) − f min will notchange the solution structure of the original problem (P).The constraint set P ⊆ X is assumed to be a down-closed convex set, since without thisproperty one cannot reach any constant factor approximation guarantee of the problem (P)(Vondr´ak, 2013). Formally, down-closedness of a convex set is defined as follows:
Definition 14 (Down-closedness)
A down-closed convex set is a convex set P associatedwith a lower bound u ∈ P , such that:1. ∀ y ∈ P , u ≤ y ;2. ∀ y ∈ P , x ∈ R n , u ≤ x ≤ y implies that x ∈ P . Without loss of generality, we assume P lies in the positive orthant and has the lowerbound . Otherwise we can always define a new set P (cid:48) = { x | x = y − u , y ∈ P} in thepositive orthant, and a corresponding continuous submodular function f (cid:48) ( x ) := f ( x + u ),and all properties of the function are still preserved.The diameter of P is D := max x , y ∈P (cid:107) x − y (cid:107) , and it holds that D ≤ (cid:107) ¯ u (cid:107) . We use x ∗ to denote the global maximum of (P). In some applications we know that f satisfies themonotonicity property: Definition 15 (Monotonicity)
A function f ( · ) is monotone nondecreasing if, ∀ a ≤ b , f ( a ) ≤ f ( b ) . (29) In the sequel, by “monotonicity”, we mean monotone nondecreasing by default.
We also assume that f has Lipschitz gradients, Definition 16 (Lipschitz gradients)
A differentiable function f ( · ) has L -Lipschitz gra-dients if for all x , y ∈ X it holds that, (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) ≤ L (cid:107) x − y (cid:107) . (30) According to Nesterov (2013, Lemma 1.2.3), if f ( · ) has L -Lipschitz gradients, then | f ( x + v ) − f ( x ) − (cid:104)∇ f ( x ) , v (cid:105)| ≤ L (cid:107) v (cid:107) . (31)For Frank-Wolfe style algorithms, the notion of curvature usually gives a tighter boundthan just using the Lipschitz gradients. Definition 17 (Curvature of a continuously differentiable function)
The curvatureof a differentiable function f ( · ) w.r.t. a constraint set P is, C f ( P ) := sup x , v ∈P ,γ ∈ (0 , , y = x + γ ( v − x ) γ (cid:104) f ( y ) − f ( x ) − ( y − x ) (cid:62) ∇ f ( x ) (cid:105) . (32)If a differentiable function f ( · ) has L -Lipschitz gradients, one can easily show that C f ( P ) ≤ LD , given Nesterov (2013, Lemma 1.2.3). ian and Buhmann and Krause Though in general a DR-submodular function f is neither convex, nor concave, it is concavealong some directions : Proposition 18 (Bian et al. (2017b))
A continuous DR-submodular function f ( · ) is con-cave along any non-negative direction v ≥ , and any non-positive direction v ≤ . Notice that DR-submodularity is a stronger condition than concavity along directions v ∈ ± R n + : for instance, a concave function is concave along any direction, but it may notbe a DR-submodular function. Strong DR-submodularity.
DR-submodular objectives may be strongly concave alongdirections v ∈ ± R n + , e.g., for DR-submodular quadratic functions. We will show that suchadditional structure may be exploited to obtain stronger guarantees for the local-globalrelation. Definition 19 (Strong DR-submodularity)
A function f is µ -strongly DR-submodular( µ ≥ ) if for all x ∈ X and v ∈ ± R n + , it holds that, f ( x + v ) ≤ f ( x ) + (cid:104)∇ f ( x ) , v (cid:105) − µ (cid:107) v (cid:107) . (33) We know that for unconstrained optimization problems, (cid:107)∇ f ( x ) (cid:107) is often used as the non-stationarity measure of the point x . What should be a proper non-stationarity measureof a general constrained optimization problem? We advocate the non-stationarity measureproposed by Lacoste-Julien (2016) and Reddi et al. (2016b), which can be calculated forfree within Frank-Wolfe-style algorithms (e.g., Algorithm 2). Non-stationarity measure.
For any constraint set
Q ⊆ X , the non-stationarity of apoint x ∈ Q is, g Q ( x ) := max v ∈Q (cid:104) v − x , ∇ f ( x ) (cid:105) (non-stationarity) . (34)It always holds that g Q ( x ) ≥
0, and x is defined to be a stationary point in Q iff g Q ( x ) =0, so (34) is a natural generalization of the non-stationarity measure for unconstraintedoptimization problems.We start with the following proposition involving the non-stationarity measure. Proposition 20 (Bian et al. (2017a)) If f is µ -strongly DR-submodular, then for anytwo points x , y in X , it holds: ( y − x ) (cid:62) ∇ f ( x ) ≥ f ( x ∨ y ) + f ( x ∧ y ) − f ( x ) + µ (cid:107) x − y (cid:107) . (35)Proposition 20 implies that if x is stationary in P (i.e., g P ( x ) = 0), then 2 f ( x ) ≥ f ( x ∨ y ) + f ( x ∧ y ) + µ (cid:107) x − y (cid:107) , which gives an implicit relation between x and y .As the following statements show, g Q ( x ) plays an important role in characterizing thelocal-global relation in both monotone and non-monotone setting. ontinuous Submodular Function Maximization Corollary 21 (Local-Global Relation:
Monotone Setting ) Let x be a point in P withnon-stationarity g P ( x ) . If f is monotone nondecreasing and µ -strongly DR-submodular,then it holds that, f ( x ) ≥
12 [ f ( x ∗ ) − g P ( x )] + µ (cid:107) x − x ∗ (cid:107) . (36)Corollary 21 indicates that any stationary point is a 1/2 approximation, which is alsofound by Hassani et al. (2017) (with µ = 0). Furthermore, if f is µ -strongly DR-submodular,the quality of x will be improved considerably: if x is close to x ∗ , it should be close tobeing optimal since f is smooth; if x is far away from x ∗ , the term µ (cid:107) x − x ∗ (cid:107) will boostthe approximation bound significantly. We provide here a very succinct proof based onProposition 20. Proof [Proof of Corollary 21] Let y = x ∗ in Proposition 20, one can easily obtain f ( x ) ≥
12 [ f ( x ∗ ∨ x ) + f ( x ∗ ∧ x ) − g P ( x )] + µ (cid:107) x − x ∗ (cid:107) . (37)Because of monotonicity and x ∗ ∨ x ≥ x ∗ , we know that f ( x ∗ ∨ x ) ≥ f ( x ∗ ). Fromnon-negativity, f ( x ∗ ∧ x ) ≥
0. Then we reach the conclusion.
Proposition 22 (Local-Global Relation:
Non-Monotone Setting ) Let x be a pointin P with non-stationarity g P ( x ) , and Q := P ∩ { y | y ≤ ¯ u − x } . Let z be a point in Q withnon-stationarity g Q ( z ) . It holds that, max { f ( x ) , f ( z ) } ≥ (38)14 [ f ( x ∗ ) − g P ( x ) − g Q ( z )] + µ (cid:0) (cid:107) x − x ∗ (cid:107) + (cid:107) z − z ∗ (cid:107) (cid:1) , where z ∗ := x ∨ x ∗ − x . Figure 4 provides a two-dimensional visualization of Proposition 22. Notice that the smallerconstraint Q is generated after the first stationary point x is calculated. Proof sketch of Proposition 22:
The proof uses Proposition 20, the non-stationarityin (34) and a key observation in the following Claim. The detailed proof is deferred toAppendix B.4.
Claim 23
Under the setting of Proposition 22, it holds that, f ( x ∨ x ∗ ) + f ( x ∧ x ∗ ) + f ( z ∨ z ∗ ) + f ( z ∧ z ∗ ) ≥ f ( x ∗ ) . (39)Note that Chekuri et al. (2014); Gillenwater et al. (2012) propose a similar relation forthe special cases of the multilinear/softmax extensions by mainly proving the same conclu-sion as in Claim 23. Their relation does not incorporate the properties of non-stationarity or ian and Buhmann and Krause Figure 4: Visualization of the local-global relation in the non-monotone setting.strong DR-submodularity. They both use the proof idea of constructing a specialized auxil-iary set function tailored to specific DR-submodular functions (the considered extensions).We present a different proof method by directly utilizing the DR property on carefully con-structed auxiliary points (e.g., ( x + z ) ∨ x ∗ in the proof of Claim 23), which is arguably moresuccinct and straightforward than that of Chekuri et al. (2014); Gillenwater et al. (2012).
6. Exemplary Applications of Continuous Submodular Optimization
Continuous submodularity naturally finds applications in various domains, ranging frominfluence and revenue maximization, to DPP MAP inference and mean field inference ofprobabilistic graphical models. We discuss several concrete problem instances in this section.
Non-convex/non-concave QP problems of the form f ( x ) = x (cid:62) Hx + h (cid:62) x + c under convexconstraints naturally arise in many applications, including scheduling (Skutella, 2001), in-ventory theory, and free boundary problems. A special class of QP is the submodular QP(the minimization of which was studied in Kim and Kojima (2003)), in which all off-diagonalentries of H are required to be non-positive. Price optimization with continuous prices is aDR-submodular quadratic program (Ito and Fujimaki, 2016).Another representative class of DR-submodular quadratic objectives arise when com-puting the stability number s ( G ) of a graph G = ( V, E ), s ( G ) − = min x ∈ ∆ x (cid:62) ( A + I ) x ,where A is the adjacency matrix of the graph G , ∆ is the standard simplex (Motzkin and ontinuous Submodular Function Maximization Straus, 1965). This instance is a convex-constrained monotone DR-submodular maximiza-tion problem.
The
Lov´asz extension (Lov´asz, 1983) used for submodular set function minimization is bothsubmodular and convex (see Appendix A of Bach (2015)).The multilinear extension (Calinescu et al., 2007) is extensively used for submodular setfunction maximization. It is the expected value of F ( S ) under the fully factorized surrogatedistribution q ( S | x ) := (cid:81) i ∈ S x i (cid:81) j / ∈ S (1 − x j ) , x ∈ [0 , V : f mt ( x ) := E q ( S | x ) [ F ( S )] = (cid:88) S ⊆V F ( S ) (cid:89) i ∈ S x i (cid:89) j / ∈ S (1 − x j ) . (40) f mt ( x ) is DR-submodular and coordinate-wise linear (Bach, 2015). The partial derivativeof f mt ( x ) can be expressed as, ∇ i f mt ( x ) = E q ( S | x ,x i =1) [ F ( S )] − E q ( S | x ,x i =0) [ F ( S )] (41)= f mt ( x | i (1)) − f mt ( x | i (0))= (cid:88) S ⊆V ,S (cid:51) i F ( S ) (cid:89) j ∈ S \{ i } x j (cid:89) j (cid:48) / ∈ S (1 − x j (cid:48) ) − (cid:88) S ⊆V\{ i } F ( S ) (cid:89) j ∈ S x j (cid:89) j (cid:48) / ∈ S,j (cid:48) (cid:54) = i (1 − x j (cid:48) ) . At the first glance, evaluating the multilinear extension in Equation (40) costs an expo-nential number of operations. However, when used in practice, one can often use samplingtechniques to estimate its value and gradient. Furthermore, it is worth noting that for sev-eral classes of practical submodular set functions, their multilinear extensions f mt ( · ) admitclosed form expressions. We present details in the following. Let us use v ∈ { , } V to equivalently denote the n binary random variables in a Gibbsrandom field. F ( v ) corresponds to the negative energy function in Gibbs random fields.If the energy function is parameterized with a finite order of interactions, i.e., F ( v ) = (cid:80) s ∈V θ s v s + (cid:80) ( s,t ) ∈V×V θ s,t v s v t + ... + (cid:80) ( s ,s ,...,s d ) θ s ,s ,...,s d v s · · · v s d , d < ∞ , then one canverify that its multilinear extension has the following closed form, f mt ( x ) = (cid:88) s ∈V θ s x s + (cid:88) ( s,t ) ∈V×V θ s,t x s x t + ... (42)+ (cid:88) ( s ,s ,...,s d ) θ s ,s ,...,s d x s · · · x s d . The gradient of this expression can also be easily derived. Given this observation, onecan quickly derive the multilinear extensions of a large category of energy functions of Gibbsrandom fields, e.g., graph cut, hypergraph cut, Ising models, etc. Specifically, ian and Buhmann and Krause Undirected MaxCut.
For undirected
MaxCut , its objective is F ( v ) = (cid:80) ( i,j ) ∈ E w ij ( v i + v j − v i v j ) , v ∈ { , } V . One can verify that its multilinear extension is f mt ( x ) = (cid:80) ( i,j ) ∈ E w ij ( x i + x j − x i x j ) , x ∈ [0 , V . Directed MaxCut.
For directed
MaxCut , its objective is F ( v ) = (cid:80) ( i,j ) ∈ E w ij v i (1 − v j ) , v ∈ { , } V . Its multilinear extension is f mt ( x ) = (cid:80) ( i,j ) ∈ E w ij x i (1 − x j ) , x ∈ [0 , V . Ising models.
For Ising models (Ising, 1925) with non-positive pairwise interactions (anti-ferromagnetic interactions), F ( v ) = (cid:80) s ∈V θ s v s + (cid:80) ( s,t ) ∈ E θ st v s v t , v ∈ { , } V , this objectivecan be easily verified to be submodular. Its multilinear extension is: f mt ( x ) = (cid:88) s ∈V θ s x s + (cid:88) ( s,t ) ∈ E θ st x s x t , x ∈ [0 , V . (43) FLID is a diversity model (Tschiatschek et al., 2016) that has been designed as a compu-tationally efficient alternative to DPPs (Kulesza et al., 2012). It is based on the facilitylocation objective. Let W ∈ R |V|× D + be the weights, each row corresponds to the latentrepresentation of an item, with D as the dimensionality. Then F ( S ) := (cid:88) i ∈ S u i + (cid:88) Dd =1 (max i ∈ S W i,d − (cid:88) i ∈ S W i,d )= (cid:88) i ∈ S u (cid:48) i + (cid:88) Dd =1 max i ∈ S W i,d , (44)which models both coverage and diversity, and u (cid:48) i = u i − (cid:80) Dd =1 W i,d . If u (cid:48) i = 0, one recoversthe facility location objective. The computational complexity of evaluating its partitionfunction is O (cid:0) |V| D +1 (cid:1) (Tschiatschek et al., 2016), which is exponential in terms of D .We now show the technique such that f mt ( x ) and ∇ i f mt ( x ) can be evaluated in O (cid:0) Dn (cid:1) time. Firstly, for one d ∈ [ D ], let us sort W i,d such that W i d (1) ,d ≤ W i d (2) ,d ≤ · · · ≤ W i d ( n ) ,d .After this sorting, there are D permutations to record: i d ( l ) , l = 1 , ..., n, ∀ d ∈ [ D ]. Now,one can verify that f mt ( x ) = (cid:88) i ∈ [ n ] u (cid:48) i x i + (cid:88) d (cid:88) S ⊆V max i ∈ S W i,d (cid:89) m ∈ S x m (cid:89) m (cid:48) / ∈ S (1 − x m (cid:48) )= (cid:88) i ∈ [ n ] u (cid:48) i x i + (cid:88) d n (cid:88) l =1 W i d ( l ) ,d x i d ( l ) n (cid:89) m = l +1 [1 − x i d ( m ) ] . Sorting costs O ( Dn log n ), and from the above expression, one can see that the cost ofevaluating f mt ( x ) is O (cid:0) Dn (cid:1) . By the relation that ∇ i f mt ( x ) = f mt ( x | i (1)) − f mt ( x | i (0)),the cost is also O (cid:0) Dn (cid:1) . Suppose there are | C | = { c , ..., c | C | } concepts, and n items in V . Give a set S ⊆ V ,Γ( S ) denotes the set of concepts covered by S . Given a modular function m : 2 C (cid:55)→ R + , ontinuous Submodular Function Maximization the set cover function is defined as F ( S ) = m (Γ( S )). This function models coverage inmaximization, and also the notion of complexity in minimization problems (Lin and Bilmes,2011b). Let us define an inverse map Γ − , such that for each concept c , Γ − ( c ) denotes theset of items v such that Γ − ( c ) (cid:51) v . So the multilinear extension is, f mt ( x ) = (cid:88) i ∈V m (Γ( S )) (cid:89) m ∈ S x m (cid:89) m (cid:48) / ∈ S (1 − x m (cid:48) )= (cid:88) c ∈ C m c (cid:20) − (cid:89) i ∈ Γ − ( c ) (1 − x i ) (cid:21) . (45)The last equality is achieved by considering the situations where a concept c is covered.One can observe that both f mt ( x ) and ∇ i f mt ( x ) can be evaluated in O ( n | C | ) time. In the most general case, one may only have access to the function values of F ( S ). Inthis scenario, one can use a polynomial number of sample steps to estimate f mt ( x ) and itsgradients.Specifically: 1) Sample k times S ∼ q ( S | x ) and evaluate function values for them, result-ing in F ( S ) , ..., F ( S k ). 2) Return the average k (cid:80) ki =1 F ( S i ). According to the Hoeffdingbound (Hoeffding, 1963), one can easily derive that k (cid:80) ki =1 F ( S i ) is arbitrarily close to f mt ( x ) with increasingly more samples: With probability at least 1 − exp( − k(cid:15) / | k (cid:80) ki =1 F ( S i ) − f mt ( x ) | ≤ (cid:15) max S | F ( S ) | , for all (cid:15) > Kempe et al. (2003) propose a general marketing strategy for influence maximization. Theyassume that there exists a number m of different marketing actions M i , each of which mayaffect some subset of nodes by increasing their probabilities of being activated. A naturalrequirement would be that the more we spend on any one action, the stronger should beits effect. Formally, one chooses x i investments to marketing action M i , so a marketingstrategy is an m -dimensional vector x ∈ R m . Then the probability that node i will becomeactivated is described by the activation function: a i ( x ) : R m → [0 , x , a node i becomes active with probability a i ( x ),so the expected influence is: f ( x ) = (cid:88) S ⊆ V F ( S ) (cid:89) i ∈ S a i ( x ) (cid:89) j / ∈ S (1 − a j ( x )) . (46) F ( S ) is the influence with the seeding set as S . It is submodular for many influence models,such as the Linear Threshold model and Independent Cascade model of Kempe et al. (2003).One can easily see that Equation (46) is DR-submodular by viewing it as a composition ofthe multilinear extension of F ( S ) and the activation function a ( x ). ian and Buhmann and Krause a ( x )For the activation function a i ( x ), we consider two realizations:1. Independent marketing action.Here we provide one action for each customer, and different actions are independent.So we have m = | V | actions, and for customer i , there exists an activation function a i ( x i ), which is a one dimensional nondecreasing DR-submodular function. A specificinstance is that a i ( x i ) = 1 − (1 − p i ) x i , p i ∈ [0 ,
1] is the probability of customer i becoming activated with one unit of investment.2. Bipartite marketing actions.Suppose there are m marketing actions and | V | customers. The influence relationshipamong actions and customers are modeled as a bipartite graph ( M, V ; W ), where M and V are collections of marketing actions and customers, respectively, and W is the collection of weights. The edge weight, p st ∈ W , represents the influenceprobability of action s to customers t by providing one unit of investment to action s . So with a marketing strategy as x , the probability of a customer t being activatedis a t ( x ) = 1 − (cid:81) ( s,t ) ∈ W (1 − p st ) x s . This is a nondecreasing DR-submodular function.One may notice that the independent marketing action is a special case of bipartitemarketing action. Optimal budget allocation is a special case of the influence maximization problem. It canbe modeled as a bipartite graph (
S, T ; W ), where S and T are collections of advertisingchannels and customers, respectively. The edge weight, p st ∈ W , represents the influenceprobability of channel s to customer t . The goal is to distribute the budget (e.g., time for aTV advertisement, or space of an inline ad) among the source nodes, and to maximize theexpected influence on the potential customers (Soma et al., 2014; Hatano et al., 2015).The total influence of customer t from all channels can be modeled by a proper monotoneDR-submodular function I t ( x ), e.g., I t ( x ) = 1 − (cid:81) ( s,t ) ∈ W (1 − p st ) x s where x ∈ R S + is thebudget assignment among the advertising channels. For a set of k advertisers, let x i ∈ R S + be the budget assignment for advertiser i , and x := [ x , · · · , x k ] denote the assignments forall the advertisers. The overall objective is, g ( x ) = (cid:88) ki =1 α i f ( x i ) with (47) f ( x i ) := (cid:88) t ∈ T I t ( x i ) , ≤ x i ≤ ¯ u i , ∀ i = 1 , ..., k, (48)which is monotone DR-submodular.A concrete application arises when advertisers bid for search marketing, i.e., wherevendors bid for the right to appear alongside the results of different search keywords. Here, x is is the volume of advertisement space allocated to the advertiser i to show his ad alongsidequery keyword s . The search engine company needs to distribute the budget (advertisingspace) to all vendors to maximize their influence on the customers, while respecting various ontinuous Submodular Function Maximization constraints. For example, each vendor has a specified budget limit for advertising, and the adspace associated with each search keyword can not be too large. All such constraints can beformulated as a down-closed polytope P , hence the Submodular FW algorithm (Algorithm 4in Section 7) can be used to find an approximate solution for the problem max x ∈P g ( x ).Note that one can flexibly add regularizers in designing I t ( x i ) as long as it remainsmonotone DR-submodular. For example, adding separable regularizers of the form (cid:80) s φ ( x is )does not change off-diagonal entries of the Hessian, and hence maintains submodularity.Alternatively, bounding the second-order derivative of φ ( x is ) ensures DR-submodularity. Determinantal point processes (DPPs) are probabilistic models of repulsion, which havebeen used to model diversity in machine learning (Kulesza et al., 2012). The constrainedMAP (maximum a posteriori) inference problem of a DPP is an NP-hard combinatorialproblem in general. Currently, the methods with the best approximation guarantees arebased on either maximizing the multilinear extension (Calinescu et al., 2007) or the softmaxextension (Gillenwater et al., 2012), both of which are continuous DR-submodular functions.The multilinear extension is given as an expectation over the original set function values,thus evaluating the objective of this extension requires expensive sampling in general. Incontrast, the softmax extension has a closed form expression, which is more appealing froma computational perspective. Let L be the positive semidefinite kernel matrix of a DPP, itssoftmax extension is: f ( x ) = log det [diag( x )( L − I ) + I ] , x ∈ [0 , n , (49)where I is the identity matrix, diag( x ) is the diagonal matrix with diagonal elements set as x . Its DR-submodularity can be established by directly applying Lemma 3 of Gillenwateret al. (2012), which immediately implies that all entries of ∇ f are non-positive, so f ( x ) iscontinuous DR-submodular.The problem of MAP inference in DPPs corresponds to the problem max x ∈P f ( x ), where P is a down-closed convex constraint, e.g., a matroid polytope or a matching polytope. Probabilistic log-submodular models (Djolonga and Krause, 2014a) are a class of proba-bilistic models over subsets of a ground set V = [ n ], where the log-densities are submodularset functions F ( S ): p ( S ) = exp( F ( S )). The partition function Z = (cid:80) S ⊆V exp( F ( S ))is typically hard to evaluate. One can use mean field inference to approximate p ( S )by some factorized distribution q ( S | x ) := (cid:81) i ∈ S x i (cid:81) j / ∈ S (1 − x j ) , x ∈ [0 , n , by minimiz-ing the distance measured w.r.t. the Kullback-Leibler divergence between q and p , i.e., (cid:80) S ⊆V q ( S | x ) log q ( S | x ) p ( S ) . It is,KL( x ) = − (cid:88) S ⊆V F ( S ) (cid:89) i ∈ S x i (cid:89) j / ∈ S (1 − x j )+ (50) (cid:88) ni =1 [ x i log x i + (1 − x i ) log(1 − x i )] + log Z . ian and Buhmann and Krause KL( x ) is IR-supermodular w.r.t. x . To see this: The first term is the negative of a mul-tilinear extension, so it is IR-supermodular. The second term is separable, and coordinate-wise convex, so it will not affect the off-diagonal entries of ∇ KL( x ), it will only contributeto the diagonal entries. Now, one can see that all entries of ∇ KL( x ) are non-negative,so KL( x ) is IR-supermodular w.r.t. x . Minimizing the Kullback-Leibler divergence KL( x )amounts to maximizing a DR-submodular function. Given a social connection graph with nodes denoting n users and edges encoding theirconnection strength, the viral marketing suggests to choose a small subset of buyers to givethem some product for free, to trigger a cascade of further adoptions through “word-of-mouth” effects, in order to maximize the total revenue (Hartline et al., 2008). For someproducts (e.g., software), the seller usually gives away the product in the form of a trial,to be used for free for a limited time period. In this task, except for deciding whether tochoose a user or not, the sellers also need to decide how much the free assignment should be,in which the assignments should be modeled as continuous variables. We call this problem revenue maximization with continuous assignments .We use a directed graph G = ( V , E ; W ) to represent the social connection graph. V contains all the n users, E is the edge set, and W is the adjacency matrix. We treat theundirected social connection graph as a special case of the directed graph, by taking oneundirected edge as two directed edge with the same weight. One model with “discrete” product assignments is considered by Soma and Yoshida (2017)and D¨urr et al. (2019), motivated by the observation that giving a user more free productsincreases the likelihood that the user will advocate this product. It can be treated asa simplified variant of the Influence-and-Exploit (IE) strategy of Hartline et al. (2008).Specifically:-
Influence stage: Each user i that is given x i units of products for free becomes anadvocate of the product with probability 1 − q x i (independently from other users),where q ∈ (0 ,
1) is a parameter. This is consistent with the intuition that with morefree assignment, the user is more likely to advocate the product.-
Exploit stage: suppose that a set S of users advocate the product while the complementset V \ S of users do not. Now the revenue comes from the users in V \ S , since theywill be influenced by the advocates with probability proportional to the edge weights.We use a simplified concave graph model (Hartline et al., 2008) for the value function,i.e., v j ( S ) = (cid:80) i ∈ S W ij , j ∈ V \ S . Assume for simplicity that the users of V \ S arevisited independently with each other. Then the revenue is: R ( S ) = (cid:88) j ∈V\ S v j ( S ) = (cid:88) j ∈V\ S (cid:88) i ∈ S W ij . (51)Notice that S is a random set drawn according to the distribution specified by thecontinuous assignment x . ontinuous Submodular Function Maximization With this Influence-and-Exploit (IE) strategy, the expected revenue is a function f : R V + → R + , as shown below: f ( x ) = E S [ R ( S )] = E S (cid:88) i ∈ S (cid:88) j ∈V\ S W ij = (cid:88) i ∈V (cid:88) j ∈V\{ i } W ij (1 − q x i ) q x j . (52)According to Lemma 13, one can see that the above objective is submodular, since it iscomposed by the multilinear extension of R ( S ) (which is continuous submodular) and theseparable function h : R V → R V , where h i ( x i ) = 1 − q x i . In addition to the Influence-and-Exploit (IE) model, we also consider an alternative model.Assume there are q products and n buyers/users, let x i ∈ R n + be the assignments of product i to the n users, let x := [ x , · · · , x q ] denote the assignments for the q products. The revenuecan be modeled as g ( x ) = (cid:80) qi =1 f ( x i ) with f ( x i ) := α i (cid:88) s : x is =0 R s ( x i ) + β i (cid:88) t : x it (cid:54) =0 φ ( x it ) + γ i (cid:88) t : x it (cid:54) =0 ¯ R t ( x i ) , (53) ≤ x i ≤ ¯ u i , where x it is the assignment of product i to user t for free, e.g., the amount of free trialtime or the amount of the product itself. R s ( x i ) models revenue gain from user s who didnot receive the free assignment. It can be some non-negative, non-decreasing submodularfunction. φ ( x it ) models revenue gain from user t who received the free assignment, since themore one user tries the product, the more likely he/she will buy it after the trial period.¯ R t ( x i ) models the revenue loss from user t (in the free trial time period the seller cannotget profits), which can be some non-positive, non-increasing submodular function. Forproducts with continuous assignments, usually the cost of the product does not increasewith its amount, e.g., the product as a software, so we only have the box constraint oneach assignment. The objective in Equation (53) is generally non-concave/non-convex , andnon-monotone submodular (see Appendix D for more details). Lemma 24 If R s ( x i ) is non-decreasing submodular and ¯ R t ( x i ) is non-increasing submod-ular, then f ( x i ) in Equation (53) is submodular. Many discrete submodular problems can be naturally generalized to the continuous settingwith continuous submodular objectives. The maximum coverage problem and the problemof text summarization with submodular objectives are among the examples (Lin and Bilmes,2010). We put details in the sequel. ian and Buhmann and Krause Submodularity-based objective functions for text summarization perform well in practice(Lin and Bilmes, 2010). Let C be the set of all concepts, and V be the set of all sentences. Asa typical example, the concept-based summarization aims to find a subset S of the sentencesto maximize the total credit of concepts covered by S . Soma et al. (2014) consideredextending the submodular text summarization model to one that incorporates “confidence”of a sentence, which has a discrete value, and modeling the objective to be an integersubmodular function. It is perhaps even more natural to consider continuous confidencevalues x i ∈ [0 , p i ( x i ) to denote the set of covered concepts when selectingsentence i with confidence level x i , it can be a monotone covering function p i : R + → C , ∀ i ∈ V . Then the objective function of the extended model is f ( x ) = (cid:80) j ∈∪ i p i ( x i ) c j ,where c j ∈ R + is the credit of concept j . It can be verified that this objective is a monotonecontinuous submodular function. In the maximum coverage problem, there are n subsets C , ..., C n from the ground set V . One subset C i can be chosen with “confidence” level x i ∈ [0 , C i with confidence x i can be modeled with the followingmonotone normalized covering function: p i : R + → V , i = 1 , ..., n . The target is to choosesubsets from C , ..., C n with confidence level to maximize the number of covered elements | ∪ ni =1 p i ( x i ) | , at the same time respecting the budget constraint (cid:80) i c i x i ≤ b (where c i isthe cost of choosing subset C i ). This problem generalizes the classical maximum coverageproblem. It is easy to see that the objective function is monotone submodular, and theconstraint is a down-closed polytope. For cost-sensitive outbreak detection in sensor networks (Leskovec et al., 2007), one needsto place sensors in a subset of locations selected from all the possible locations V , to quicklydetect a set of contamination events E , while respecting the cost constraints of the sensors.For each location v ∈ V and each event e ∈ E , a value t ( v, e ) is provided as the time ittakes for the placed sensor in v to detect event e . Soma and Yoshida (2015a) considered thesensors with discrete energy levels. It is natural to model the energy levels of sensors to bea continuous variable x ∈ R V + . For a sensor with energy level x v , the success probability itdetects the event is 1 − (1 − p ) x v , which models that by spending one unit of energy one hasan extra chance of detecting the event with probability p . In this model, beyond decidingwhether to place a sensor or not, one also needs to decide the optimal energy levels. Let t ∞ = max e ∈ E,v ∈V t ( v, e ), let v e be the first sensor that detects event e ( v e is a randomvariable). One can define the objective as the expected detection time that could be saved , f ( x ) := E e ∈ E E v e [ t ∞ − t ( v e , e )] , (54)which is a monotone DR-submodular function. Maximizing f ( x ) w.r.t. the cost constraintspursues the goal of finding the optimal energy levels of the sensors, to maximize the expecteddetection time that could be saved. ontinuous Submodular Function Maximization Suppose we have a collection of items, e.g., images V = { v , ..., v n } . We follow the strategyto extract a representative summary, where representativeness is defined w.r.t. a submodularset function F : 2 V → R . However, instead of returning a single set, our goal is to obtainsummaries at multiple levels of detail or resolution. One way to achieve this goal is toassign each item v i a nonnegative score x i . Given a user-tunable threshold τ , the resultingsummary S τ = { v i | x i ≥ τ } is the set of items with scores exceeding τ . Thus, instead ofsolving the discrete problem of selecting a fixed set S , we pursue the goal to optimize overthe scores, e.g., to use the following continuous submodular function, f ( x ) = (cid:88) i ∈V (cid:88) j ∈V φ ( x j ) s i,j − (cid:88) i ∈V (cid:88) j ∈V x i x j s i,j , (55)where s i,j ≥ i, j , and φ ( · ) is a non-decreasing concavefunction. The classical discrete facility location problem can be generalized to the continuous casewhere the scale of a facility is determined by a continuous value in interval [ , ¯ u ]. For a setof facilities V , let x ∈ R V + be the scale of all facilities. The goal is to decide how large eachfacility should be in order to optimally serve a set T of customers. For a facility s of scale x s ,let p st ( x s ) be the value of service it can provide to customer t ∈ T , where p st ( x s ) is a nor-malized monotone function ( p st (0) = 0). Assuming each customer chooses the facility withhighest value, the total service provided to all customers is f ( x ) = (cid:80) t ∈ T max s ∈V p st ( x s ). Itcan be shown that f is monotone submodular.
7. Algorithms for Monotone DR-Submodular Maximization
In this section, we present two classes of algorithms for maximizing a monotone continuousDR-submodular function subject to a down-closed convex constraint. The detailed proofscan be found in Appendix C. Even despite the monotonicity assumption, solving the problemto optimality is still a very challenging task. In fact, we prove the following hardness result:
Proposition 25 (Hardness and Inapproximability)
The problem of maximizing a mono-tone nondecreasing continuous DR-submodular function subject to a general down-closed polytope constraint is NP-hard. For any (cid:15) > , it cannot be approximated in polynomialtime within a ratio of (1 − /e + (cid:15) ) (up to low-order terms), unless RP = NP. Proposition 25 can be proved by the reduction from the problem of maximizing a mono-tone submodular set function subject to cardinality constraints. The proof relies on thetechniques of multilinear extension (Calinescu et al., 2007; Calinescu et al., 2011) and pi-page rounding (Ageev and Sviridenko, 2004), and also the hardness results of Feige (1998);Calinescu et al. (2007).
Remark 26
Due to the NP-hardness of converging to the global optimum for Problem (P) ,in the following by “convergence” we mean converging to a solution point which has aconstant factor approximation guarantee with respect to the global optimum. ian and Buhmann and Krause Non-convex FW and
PGA
The first class of algorithms directly utilize the local-global relation of Corollary 21. Weknow that any stationary point is a 1/2 approximate solution. Thus any solver that obtainsa stationary point yields a solution with a 1/2 approximation guarantee. We give twoconcrete examples below.
Non-convex FW
Algorithm
For sake of completeness, we summarize the
Non-convex FW algorithm in Algorithm 2.
Algorithm 2:
Non-convex FW ( f, P , K, (cid:15), x )(Lacoste-Julien, 2016) for maximizing asmooth objective Input: max x ∈P f ( x ), f : a smooth function, P : convex set, K : number of iterations, (cid:15) : stopping tolerance for k = 0 , ..., K do find v k s.t. (cid:104) v k , ∇ f ( x k ) (cid:105) ≥ max v ∈P (cid:104) v , ∇ f ( x k ) (cid:105) ; // LMO d k ← v k − x k , g k := (cid:104) d k , ∇ f ( x k ) (cid:105) ; // g k : non-stationarity measure if g k ≤ (cid:15) then return x k ; Option I: γ k ∈ arg min γ ∈ [0 , f ( x k + γ d k ), Option II: γ k ← min { g k C , } for C ≥ C f ( P ) ; x k +1 ← x k + γ k d k ; Output: x k (cid:48) and g k (cid:48) = min ≤ k ≤ K g k ; // modified output solution compared to that ofLacoste-Julien (2016) Algorithm 2 is modified from Lacoste-Julien (2016). The only difference lies in theoutput: we return the solution x k (cid:48) with the minimum non-stationarity, which is needed toinvoke the local-global relation. In contrast, Lacoste-Julien (2016) outputs the solution fromthe last iteration. Since C f ( P ) is generally hard to evaluate, we use the classical obliviousstep size rule ( k +2 ) and the Lipschitz step size rule ( γ k = min { , g k L (cid:107) d k (cid:107) } , where g k is theso-called Frank-Wolfe gap) in the experiments (Section 9).Hassani et al. (2017) show that the Projected Gradient Ascent algorithm ( PGA ) withconstant step size (1 /L ) can converge to a stationary point, so it has a 1/2 approximationguarantee. We can also show that the Non-convex FW of Lacoste-Julien (2016) has a 1/2approximation guarantee according to the local-global relation:
Corollary 27
The non-convex Frank-Wolfe algorithm (abbreviated as
Non-convex FW ) ofLacoste-Julien (2016) has a 1/2 approximation guarantee, and / √ k rate of convergencefor solving Problem (P) when the objective is monotone nondecreasing. PGA
Algorithm
Algorithm 3 is reproduced from Hassani et al. (2017) for completeness. It takes a smoothDR-submodular function f , and a convex constraint P . Then it runs for K iterations. Ineach iteration, we firstly choose a step size γ k , then we update the current solution using thecurrent gradient to get a point y k +1 . Lastly, we projects y k +1 onto the convex set P , which ontinuous Submodular Function Maximization Algorithm 3:
PGA for maximizing a monotone DR-submodular objective (Hassaniet al., 2017)
Input: max x ∈P f ( x ), f : a smooth DR-Submodular function, P : convex set, K :number of iterations, x ∈ P for k = 0 , ..., K − do Set step size γ k ; // i): “Lipschitz” rule L ; ii): adaptive rule: C/ √ k y k +1 ← x k + γ k ∇ f ( x k ); x k +1 ← arg min x ∈P (cid:107) x − y k +1 (cid:107) ; // Projection Output: x k (cid:48) with k (cid:48) = arg max ≤ k ≤ K f ( x k ) ; // modified output compared to that ofHassani et al. (2017) amounts to solving a constrained quadratic program. After K iterations, we output thesolution with the maximal function value, which is slightly different from that of Hassaniet al. (2017).The resulting algorithm has a 1/2 approximation guarantee and sublinear rate of con-vergence: Theorem 28 (Hassani et al. (2017))
For Algorithm 3, if one chooses γ k = 1 /L , thenafter K iterations, f ( x K ) ≥ f ( x ∗ )2 − D L K . (56)It is worth noting that, in general the smoothness parameter L is difficult to estimate, so the“Lipschitz” step size rule γ k = 1 /L poses a challenge for implementation. In experiments,Hassani et al. (2017) also suggest the adaptive step size rule γ k = C/ √ k , where C is aconstant. Submodular FW : Follow Concave Directions
For DR-submodular maximization, one key property is that while being non-convex/non-concave in general, they are concave along any non-negative directions (c.f., Proposition 18).Thus, if we design an algorithm such that it follows a non-negative direction in each updatestep, we ensure that it achieves progress in a concave direction. As a consequence, itsfunction value is guaranteed to grow by a certain increment. Based on this intuition, wepresent the
Submodular FW algorithm, which is a generalization of the continuous greedyalgorithm of Vondr´ak (2008), and the classical Frank-Wolfe algorithm (Frank and Wolfe,1956; Jaggi, 2013a).Algorithm 4 summarizes the details. Since it is a variant of the convex Frank-Wolfealgorithm for DR-submodular maximization, we call it
Submodular FW . In iteration k , ituses the linearization of f ( · ) as a surrogate, and moves in the direction of the maximizerof this surrogate function, i.e., v k = arg max v ∈P (cid:104) v , ∇ f ( x k ) (cid:105) . Intuitively, it searches forthe direction in which one can maximize the improvement in the function value and stillremain feasible. Finding such a direction requires maximizing a linear objective at eachiteration. Meanwhile, it eliminates the need for projecting back to the feasible set in each ian and Buhmann and Krause Algorithm 4:
Submodular FW for monotone DR-submodular maximization (Bianet al., 2017b)
Input: max x ∈P f ( x ), P is a down-closed convex set in the positive orthant withlower bound ; prespecified step size γ ∈ (0 , α and δ . K . x ← , t ← k ← // k : iteration index, t : cumulative step size while t < do find step size γ k ∈ (0 , γ k ← γ ; set γ k ← min { γ k , − t } ; find v k s.t. (cid:104) v k , ∇ f ( x k ) (cid:105) ≥ α max v ∈P (cid:104) v , ∇ f ( x k ) (cid:105) − δγ k LD ; // α ∈ (0 , is themulplicative error level, δ ∈ [0 , ¯ δ ] is the additive error level x k +1 ← x k + γ k v k , t ← t + γ k , k ← k + 1; Output: x K ;iteration, which is an essential step for methods such as projected gradient ascent ( PGA ).The
Submodular FW algorithm updates the solution in each iteration by using step size γ k ,which can simply be set to a prespecified constant γ .Note that Submodular FW can tolerate both multiplicative error α and additive error δ when solving the LMO subproblem (Step 4 of Algorithm 4). Setting α = 1 and δ = 0 wouldrecover the error-free case. Remark 29
The main difference of
Submodular FW in Algorithm 4 and the classical Frank-Wolfe algorithm in Algorithm 1 lies in the update direction being used: For Algorithm 4,the update direction (in Step 5) is v k , while for classical Frank-Wolfe it is v k − x k , i.e., x k +1 ← x k + γ k ( v k − x k ) . To prove the approximation guarantee, we first derive the following lemma.
Lemma 30
The output solution x K lies in P . Assuming x ∗ to be the optimal solution, onehas, (cid:104) v k , ∇ f ( x k ) (cid:105) ≥ α [ f ( x ∗ ) − f ( x k )] − δγ k LD , ∀ k = 0 , ..., K − . (57) Theorem 31 (Approximation guarantee)
For error levels α ∈ (0 , , δ ∈ [0 , ¯ δ ] , with K iterations, Algorithm 4 outputs x K ∈ P such that, f ( x K ) ≥ (1 − e − α ) f ( x ∗ ) − LD (1 + δ )2 K − (cid:88) k =0 γ k + e − α f ( ) . (58)Theorem 31 gives the approximation guarantee for any step size γ k . By observing that (cid:80) K − k =0 γ k = 1 and (cid:80) K − k =0 γ k ≥ K − (see the proof in Appendix C.5), with constant stepsize, we obtain the following “tightest” approximation bound, Corollary 32
For a fixed number of iterations K , and constant step size γ k = γ = K − ,Algorithm 4 provides the following approximation guarantee: f ( x K ) ≥ (1 − e − α ) f ( x ∗ ) − LD (1 + δ )2 K + e − α f ( ) . (59) ontinuous Submodular Function Maximization Corollary 32 implies that with a constant step size γ , 1) when γ → K → ∞ ),Algorithm 4 will output the solution with the worst-case guarantee (1 − /e ) f ( x ∗ ) in theerror-free case if f ( ) = 0; and 2) The Submodular FW has a sub-linear convergence ratefor monotone DR-submodular maximization over any down-closed convex constraint.
Remarks on computational cost.
It can be seen that when using a constant stepsize, Algorithm 4 needs O ( (cid:15) ) iterations to get (cid:15) -close to the best-possible function value(1 − e − ) f ( x ∗ ) in the error-free case. When P is a polytope in the positive orthant, oneiteration of Algorithm 4 costs approximately the same as solving a positive LP, for which anearly-linear time solver exists (Allen-Zhu and Orecchia, 2015).
8. Algorithms for Non-Monotone DR-Submodular Maximization
In this section we present algorithms for the problem of non-monotone DR-submodular max-imization, all omitted proofs can be found in Appendix E. Non-monotone DR-submodularmaximization is strictly harder than the monotone setting. For the simple situation withonly one hyperrectangle constraint ( P = [0 , n ), we have the following hardness result: Proposition 33 (Hardness and Inapproximability)
The problem of maximizing a gen-erally non-monotone continuous DR-submodular function subject to a hyperrectangle con-straint is NP-hard. Furthermore, there is no (1 / (cid:15) ) -approximation for any (cid:15) > , unlessRP = NP. The above results can be proved through the reduction from the problem of maximizing anunconstrained non-monotone submodular set function. The proof depends on the techniquesof Calinescu et al. (2007); Buchbinder et al. (2012) and the hardness results of Feige et al.(2011); Dobzinski and Vondr´ak (2012).We propose two algorithms: The first is based on the local-global relation, and thesecond is a
Frank-Wolfe variant adapted for the non-monotone setting. All the omittedproofs are deferred to Appendix E.
Two-Phase
Algorithm: Applying the Local-Global RelationAlgorithm 5:
The
Two-Phase
Algorithm (Bian et al., 2017a)
Input: max x ∈P f ( x ), stopping tolerances (cid:15) , (cid:15) , K , K x ← Non-convex Frank-Wolfe ( f, P , K , (cid:15) , x ) ; // x ∈ P Q ← P ∩ { y ∈ R n + | y ≤ ¯ u − x } ; z ← Non-convex Frank-Wolfe ( f, Q , K , (cid:15) , z ) ; // z ∈ Q Output: arg max { f ( x ) , f ( z ) } ;By directly applying the local-global relation in Section 5.3, we present the Two-Phase algorithm in Algorithm 5. It generalizes the “two-phase” method of Chekuri et al. (2014);Gillenwater et al. (2012). It invokes a non-convex solver (we use the
Non-convex FW byLacoste-Julien (2016); pseudocode is included in Algorithm 2 of Section 7.1.1) to find ap-proximately stationary points in P and Q , respectively, then returns the solution with thelarger function value. ian and Buhmann and Krause Though we use
Non-convex FW as a subroutine here, it is noteworthy that any algorithmthat is guaranteed to find an approximately stationary point can be plugged into Algorithm 5as a subroutine. We give an improved approximation bound by considering more propertiesof DR-submodular functions. Building on the results of Lacoste-Julien (2016), we obtainthe following
Theorem 34
The output of Algorithm 5 satisfies, max { f ( x ) , f ( z ) } ≥ µ (cid:0) (cid:107) x − x ∗ (cid:107) + (cid:107) z − z ∗ (cid:107) (cid:1) (60)+ 14 (cid:20) f ( x ∗ ) − min (cid:26) max { h , C f ( P ) }√ K + 1 , (cid:15) (cid:27) − min (cid:26) max { h , C f ( Q ) }√ K + 1 , (cid:15) (cid:27)(cid:21) , where h := max x ∈P f ( x ) − f ( x ) , h := max z ∈Q f ( z ) − f ( z ) are the initial suboptimalities, C f ( P ) := sup x , v ∈P ,γ ∈ (0 , , y = x + γ ( v − x ) 2 γ ( f ( y ) − f ( x ) − ( y − x ) (cid:62) ∇ f ( x )) is the curvature of f w.r.t. P , and z ∗ = x ∨ x ∗ − x . Theorem 34 indicates that Algorithm 5 has a 1 / / √ k rate ofconvergence. However, it has good empirical performance as demonstrated by the practicalexperiments. Informally, this can be partially explained by the term µ (cid:0) (cid:107) x − x ∗ (cid:107) + (cid:107) z − z ∗ (cid:107) (cid:1) in (60): if x strongly deviates from x ∗ , then this term will augment the bound; if x is closeto x ∗ , by the smoothness of f , it should be close to optimal. Shrunken FW : Follow Concavity and Shrink ConstraintAlgorithm 6:
The
Shrunken FW
Algorithm for Non-monotone DR-submodular Max-imization (Bian et al., 2017a)
Input: max x ∈P f ( x ) ; K ; step size γ = 1 /K . x ← , t ← k ← // k : iteration index, t k : cumulative step size while t k < do v k ← arg max v ∈P , v ≤ ¯ u − x k (cid:104) v , ∇ f ( x k ) (cid:105) ; // shrunken LMO use uniform step size γ k = γ ; set γ k ← min { γ k , − t k } ; x k +1 ← x k + γ k v k , t k +1 ← t k + γ k , k ← k + 1; Output: x K ; // suppose there are K iterations in total Algorithm 6 summarizes the
Shrunken FW variant, which is inspired by the unified con-tinuous greedy algorithm in Feldman et al. (2011) for maximizing the multilinear extensionof a submodular set function.It initializes the solution x to be , and maintains t k as the cumulative step size. Atiteration k , it maximizes the linearization of f over a “shrunken” constraint set { v | v ∈P , v ≤ ¯ u − x k } , which is different from the classical LMO of Frank-Wolfe-style algorithms(hence we refer to it as the “shrunken LMO”). Then it employs an update step in thedirection v k chosen by the LMO with a uniform step size γ k = γ . The cumulative step size t k is used to ensure that the overall step sizes sum to one, thus the output solution x K is aconvex combination of the LMO outputs, hence also lies in P . ontinuous Submodular Function Maximization The shrunken LMO (Step 3) is the key difference compared to the
Submodular FW variant in Bian et al. (2017b) (detailed in Algorithm 4). Therefore, we call Algorithm 6
Shrunken FW . The extra constraint v ≤ ¯ u − x k is added to prevent too rapid growth ofthe solution, since in the non-monotone setting such fast increase may hurt the overallperformance.The next theorem states the guarantees of Shrunken FW in Algorithm 6.
Theorem 35
Consider Algorithm 6 with uniform step size γ . For k = 1 , ..., K it holdsthat, f ( x k ) ≥ t k e − t k f ( x ∗ ) − LD kγ − O ( γ ) f ( x ∗ ) . (61)By observing that t K = 1 and applying Theorem 35, we get the following Corollary: Corollary 36
The output of Algorithm 6 satisfies f ( x K ) ≥ e f ( x ∗ ) − LD K − O (cid:18) K (cid:19) f ( x ∗ ) . (62)Corollary 36 shows that Algorithm 6 enjoys a sublinear convergence rate towards somepoint x K inside P , with a 1 /e approximation guarantee. Proof sketch of Theorem 35:
The proof is by induction. To prepare the buildingblocks, we first of all show that the growth of x k is indeed bounded, Lemma 37 (Bounding the growth of x k ) Assume x = . For k = 0 , ..., K − , itholds, x ki ≤ ¯ u i [1 − (1 − γ ) t k /γ ] , ∀ i ∈ [ n ] . (63)Then the following Lemma provides a lower bound, which depends on the global opti-mum, Lemma 38 (Generalized from Lemma 7 of Chekuri et al. (2015))
Given θ ∈ ( , ¯ u ] ,let λ (cid:48) = min i ∈ [ n ] ¯ u i θ i . Then for all x ∈ [ , θ ] , it holds, f ( x ∨ x ∗ ) ≥ (1 − λ (cid:48) ) f ( x ∗ ) . (64)Then the key ingredient for induction is the relation between f ( x k +1 ) and f ( x k ) indi-cated by: Claim 39
For k = 0 , ..., K − it holds, f ( x k +1 ) ≥ (1 − γ ) f ( x k ) + γ (1 − γ ) t k /γ f ( x ∗ ) − LD γ . (65)It is derived by a combination of the quadratic lower bound in Equation (31), Lemma 37and Lemma 38. ian and Buhmann and Krause Notice that though the
Two-Phase algorithm has an inferior guarantee compared to
ShrunkenFW , it is still of interest: i) It preserves flexibility in using a wide range of existing solversfor finding an (approximately) stationary point. ii) The guarantees that we present relyon a worst-case analysis. The empirical performance of the
Two-Phase algorithm is oftencomparable or better than that of
Shrunken FW . This suggests to explore more propertiesin concrete problems that may favor the
Two-Phase algorithm.
9. Experimental Evaluation
Follow the application in Section 6.3, we consider the following simplified influence modelfor experiments. The resulted problem is an instance of the monotone DR-submodularmaximization problem.
Simplified Influence Model for Experiments.
For general influence models, it is hardto evaluate Equation (46). To simplify the experiments, we consider F ( S ) to be a facilitylocation objective, for which the expected influence has a closed-form expression, as shownby Bian et al. (2019, Section 4.2). Here each customer may represent an “opinion leader”in social networks, and there is a bipartite graph describing the influence strength of eachopinion leader to the population. Dataset.
We used the UC Irvine forum dataset as the real-world bipartite graph. It is abipartite network containing user posts to forums. The users are students at the Universityof California, Irvine. An edge represents a forum message on a specific forum. It has intotal 899 users, 522 forums and 33,720 edges (posts on the forum).For a specific (user, forum) pair, we determine the edge weight as the number of postsfrom that user on the forum. This weighting indicates that the more one user has postedon a forum, the more he has influenced that particular forum. With this processing, wehave 7,089 unique edges between users and forums.We experimented with the independent marketing actions in Section 6.3.1 for simplicity.For a customer i , we set the parameter p i ∈ [0 ,
1] based on the following heuristic: Firstly, wecalculate the “degree” of customer i as the number of forums he has posted on: d i = (cid:107) W i : (cid:107) .Then we set p i = σ ( − d i ), σ ( · ) is the logistic sigmoid function. Remember that p i is theprobability of customer i becoming activated with one unit of investment, so this heuristicmeans that the more influence power a user has, the more difficult it is to activate him,because he might charge more than other users with less influence power. Since it is tootime consuming to experiment on the whole bipartite graph, we experimented on differentsubgraphs of the original bipartite graph. ontinuous Submodular Function Maximization Iterations ( n =50) E x p e c t e d i n f l u e n c e Submodular FWPGA (Lipschitz)PGA ( C / k + 1)Nonconvex FW (2/( k + 2))Nonconvex FW (Lipschitz) (a) 50 users, 10 forums Iterations ( n =100) E x p e c t e d i n f l u e n c e Submodular FWPGA (Lipschitz)PGA ( C / k + 1)Nonconvex FW (2/( k + 2))Nonconvex FW (Lipschitz) (b) 100 users, 10 forums Iterations ( n =150) E x p e c t e d i n f l u e n c e Submodular FWPGA (Lipschitz)PGA ( C / k + 1)Nonconvex FW (2/( k + 2))Nonconvex FW (Lipschitz) (c) 150 users, 20 forums Iterations ( n =200) E x p e c t e d i n f l u e n c e Submodular FWPGA (Lipschitz)PGA ( C / k + 1)Nonconvex FW (2/( k + 2))Nonconvex FW (Lipschitz) (d) 200 users, 20 forums Figure 5: Expected influence w.r.t. iterations of different algorithms on real-world sub-graphs with (a) 50 users; (b) 100 users; (c) 150 users; (d) 200 users.
Submodular FW has astable performance. It does not need to tune the step sizes or any hyperparameters.
PGA algorithms are sensitive to quality of tuned step sizes.
Non-convex FW with the Lipschitzstep size rule also needs a careful tuning of the Lipschitz parameter. ian and Buhmann and Krause Figure 5 documents the trajectories of expected influence of different algorithms. We cansee that
Submodular FW has a very stable performance: It can always reach a fairly goodsolution, no matter what kind of setting you have. And it does not need to tune the stepsizes or any hyperparameters. One drawback is that it converges relatively slowly in thebeginning.For
PGA algorithms, we tested with two step size rules: the Lipschitz rule (1 /L ) whichhas the 1/2 approximation guarantee; the diminishing step size rule ( C/ √ k + 1), whichdoes not have a formal theoretical guarantee. One general observation is that both step sizerules need a careful tuning of hyperparameters, and the performance crucially depends onthe quality of hyperparameters. For example, for PGA , if the step size is too small, it mayconverge too slowly; if the step sizes are too large, it tends to fluctuate.For
Non-convex FW algorithms, we also tested two step size rules: the “oblivious” rule(2 / ( k + 2))) and the Lipschitz rule. Apparently the Lipschitz step size rule needs a carefultuning of the Lipschitz parameter L , while the oblivious rule does not. With a carefultuning of L , both Non-convex FW variants converge very fast and converge to the highestfunction value.
Maximizing Softmax extensions of DPP MAP inference is an important instance of non-monotone DR-submodular maximization problem. One can obtain the derivative of thesoftmax extension in Equation (49) as: ∇ i f ( x ) = tr( { [diag( x )( L − I ) + I ] − [( L − I ) i ] } ) , ∀ i ∈ [ n ] , (66)where ( L − I ) i denotes the matrix obtained by zeroing all entries except for the i th row of( L − I ). Let C := (diag( x )( L − I ) + I ) − , D := ( L − I ), one can see that ∇ i f ( x ) = D (cid:62) i · C · i ,which gives an efficient way to calculate the gradient ∇ f ( x ). Results on Synthetic Data.
We generate the softmax objectives (see (49)) in the fol-lowing way: first generate the n eigenvalues d ∈ R n + , each evenly distributed in [0 , D = diag( d ). After generating a random unitary matrix U , we set L = UDU (cid:62) . Onecan verify that L is positive semidefinite and has eigenvalues as the entries of d . Then wegenerate one cardinality constraint in the form of Ax ≤ b , where A = × n and b = 0 . n .Function value trajectories returned by different solvers on problem instances with dif-ferent dimensionalities are shown in Figure 6. One can observe that Two-Phase FW has thefastest convergence.
Shrunken FW converges slower, however it always eventually returnsa high function value. The performance of
PGA highly depends on the hyperparameters ofthe step sizes.
We experiment with the model from Section 6.7.1 on several real-world graphs. Note thatthe objective of the simplified revenue maximization model is in general continuous submod- http://konect.uni-koblenz.de/networks/opsahl-ucforum
8. where D i · means the i -th row of D and C · i indicates the i -th column of C . ontinuous Submodular Function Maximization Iterations ( n =50) F un c t i o n v a l u e Shrunken FWTwo-phase FW (Lipschitz)Two-phase FW (2/( k +2))PGA (Lipschitz)PGA ( C / k +1) Iterations ( n =130) F un c t i o n v a l u e Shrunken FWTwo-phase FW (Lipschitz)Two-phase FW (2/( k +2))PGA (Lipschitz)PGA ( C / k +1) Iterations ( n =210) F un c t i o n v a l u e Shrunken FWTwo-phase FW (Lipschitz)Two-phase FW (2/( k +2))PGA (Lipschitz)PGA ( C / k +1) Figure 6: Trajectories of different solvers on Softmax instances with one cardinality con-straint. Left: n = 50; Middle: n = 130; Right: n = 210. Two-Phase FW has the fastestconvergence.
Shrunken FW converges slower, yet it always eventually returns a high func-tion value. The performance of
PGA highly depends on the hyperparameters of the stepsizesular, and resulting optimization problem is a continuous submodular maximization problemwith down-closed convex constraint. For this problem setting, the studied algorithms mightnot have a formal theoretical approximation guarantee. Yet, due to the practical usage ofthis application, it is worthwhile to use it as a robustness test of the algorithms in thissection.
The real-world graphs are from the Konect network collection (Kunegis, 2013) and theSNAP dataset. The graph datasets and corresponding experimental parameters arerecorded in Table 3. We tested with the constraint that is the interaction of one boxconstraint (0 ≤ x i ≤ u ) and one cardinality constraint (cid:62) x ≤ b .Table 3: Graph datasets and the corresponding experimental parameters. n is the numberof nodes, q is the parameter of the model in Section 6.7.1, u is the upper bound of the boxconstraint, and b is the budget.Dataset name n q u budget b “Reality Mining” 96 1,086,404(multiedge) 0.75 10 0 . nu “Residence hall” 217 2,672 0.75 10 0 . nu “Infectious” 410 17,298 0.7 20 0 . nu “U. Rovira i Virgili” 1,133 5,451 0.8 20 0 . nu “ego Facebook” 4,039 88,234 0.9 40 0 . nu http://konect.uni-koblenz.de/networks http://snap.stanford.edu/ ian and Buhmann and Krause For a specific example, the “Reality Mining” (Eagle and Pentland, 2006) dataset contains the contact data of 96 persons through tracking 100 mobile phones. The datasetwas collected in 2004 over the course of nine months and represents approximately 500,000hours of data on users’ location, communication and device usage behavior. Here onecontact could mean a phone call, Bluetooth sensor proximity or physical location proximity.We use the number of contacts as the weight of an edge, by assuming that the more contactshappen between two persons, the stronger the connection strength should be. Results on a Small Graph for Visualization.
First, we tested on a small graph,in order to clearly visualize the results. We select a subgraph from the “Reality Mining”dataset by taking the first five users/nodes, the nodes and number of contacts amongst nodesare shown in Figure 7a. For illustration, we label the five users as “A, B, C, D, E”. Onecan see that there are different level of contacts between different users, for example, thereare 22,194 contacts between A and B, while there are only 82 contacts between E and C.Figure 7b traces the trajectories of different algorithms when maximizing the revenueobjective. They were all run for 20 iterations. One can see that
Shrunken FW and
Two-PhaseFW reach higher revenue than
PGA algorithms. Notice that
Shrunken FW and
Two-Phase FW with oblivious step sizes do not need to tune any hyperparameters, while the others needto adapt the Lipschitz parameter L and the constant C to determine the step sizes. ABCD E (a) The “Reality Mining” subgraph.
Iterations ( n =5) E x p e c t e d r e v e nu e Shrunken FWTwo-phase FW (Lipschitz)Two-phase FW (2/( k + 2))PGA (1/ L )PGA ( C / k + 1) (b) Trajectories of algorithms with 20 iterations Figure 7: Results on the “Reality Mining” subgraph with one cardinality constraint, where u = 10 , b = 0 . ∗ n ∗ u .One may ask the question: How does the assignment look like for different algorithms?
In order to show this behavior, we visualize the assignments in Figure 8. One can see that
Shrunken FW assigns user A the most free products (6.1), followed by user C (3.3), thenuser E (0.6). All other users get 0 assignment. This is consistent with the intuition: one can http://konect.uni-koblenz.de/networks/mit , and http://realitycommons.media.mit.edu/realitymining.html ontinuous Submodular Function Maximization observe that user A most strongly influences others users (with total contacts as 22,194+410 + 143), while user D exerts zero influence on others. Two-Phase FW provides similarresult, while
PGA is conservative in assigning free products to users. A|6.1B|0.0C|3.3D|0.0 E|0.6
Shrunken FW A|6.3B|0.0C|3.3D|0.0 E|0.3
Two-phase FW (Lipschitz) A|4.6B|0.0C|1.9D|0.0 E|0.5
PGA ( C / k + 1) Figure 8: Assignments to the users returned by different algorithms.
PGA is more conserva-tive in terms of assigning free products to users than the other two algorithms:
ShrunkenFW and
Two-Phase FW . Results on Big Graphs.
Then we looked at the behavior of the algorithms on theoriginal big graph, which is plotted in Figure 9, for real-world graphs with at most n = 4 , Two-Phase FW algorithm achieves the highest objec-tive value, and also converges with the fastest rate.
Shrunken FW converges slower than
Two-Phase FW , but it always reaches competitive function value.
PGA algorithms need totune parameters for the step size, and converges to lower objective values.
10. Conclusion
In this work, we have systematically studied continuous submodularity and the problem ofcontinuous (DR)-submodular maximization. With rigorous characterizations and study ofcomposition rules, we established important properties of this class of functions. Based onof geometric properties of continuous DR-submodular maximization, we proposed provablealgorithms for both the monotone and non-monotone settings. We also identified represen-tative applications and demonstrated the effectiveness of the proposed algorithms on bothsynthetic and real-world experiments. ian and Buhmann and Krause Iterations ( n =217) E x p e c t e d r e v e nu e Shrunken FWTwo-phase FW (Lipschitz)Two-phase FW (2/( k + 2))PGA (1/ L )PGA ( C / k + 1) (a) “Residence hall” dataset Iterations ( n =410) E x p e c t e d r e v e nu e Shrunken FWTwo-phase FW (Lipschitz)Two-phase FW (2/( k + 2))PGA (1/ L )PGA ( C / k + 1) (b) “Infectious” dataset Iterations ( n =1133) E x p e c t e d r e v e nu e Shrunken FWTwo-phase FW (Lipschitz)Two-phase FW (2/( k + 2))PGA (1/ L )PGA ( C / k + 1) (c) “U. Rovira i Virgili” dataset Iterations ( n =4039) E x p e c t e d r e v e nu e Shrunken FWTwo-phase FW (Lipschitz)Two-phase FW (2/( k + 2))PGA (1/ L )PGA ( C / k + 1) (d) “ego Facebook” dataset Figure 9: Trajectory of different algorithms on real-world graphs. Usually
Two-Phase FW achieves the highest objective value, and also converges with the fastest rate.
ShrunkenFW converges slower than
Two-Phase FW , but it always reaches competitive function value.
PGA algorithms need to tune parameters for the step size, and converges to lower objectivevalues. ontinuous Submodular Function Maximization References
Alexander A Ageev and Maxim I Sviridenko. Pipage rounding: A new method of construct-ing algorithms with proven performance guarantee.
Journal of Combinatorial Optimiza-tion , 8(3):307–328, 2004.Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization.In
International Conference on Machine Learning (ICML) , pages 699–707, 2016.Zeyuan Allen-Zhu and Lorenzo Orecchia. Nearly-linear time positive lp solver with fasterconvergence rate. In
Proceedings of the Forty-Seventh Annual ACM on Symposium onTheory of Computing , pages 229–236. ACM, 2015.Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky.Tensor decompositions for learning latent variable models.
JMLR , 15(1):2773–2832, 2014.Francis Bach. Submodular functions: from discrete to continous domains. arXiv:1511.00394 , 2015.Francis R Bach. Structured sparsity-inducing norms through submodular functions. In
NIPS , pages 118–126, 2010.An Bian, Kfir Y. Levy, Andreas Krause, and Joachim M. Buhmann. Continuous dr-submodular maximization: Structure and algorithms. In
Advances in Neural InformationProcessing Systems (NIPS) , pages 486–496, 2017a.Andrew An Bian, Baharan Mirzasoleiman, Joachim M. Buhmann, and Andreas Krause.Guaranteed non-convex optimization: Submodular maximization over continuous do-mains. In
International Conference on Artificial Intelligence and Statistics (AISTATS) ,pages 111–120, 2017b.Yatao A. Bian, Joachim M. Buhmann, and Andreas Krause. Optimal continuous dr-submodular maximization and applications to provable mean field inference. In
Interna-tional Conference on Machine Learning (ICML) , pages 644–653, 2019.Yatao An Bian.
Provable Non-Convex Optimization and Algorithm Validation via Submod-ularity . PhD thesis, ETH Zurich, 2019.Jeffrey Bilmes and Wenruo Bai. Deep submodular functions. arXiv preprintarXiv:1701.08939 , 2017.L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scalemachine learning.
Siam Review , 60(2):223–311, 2018.Stephen Boyd and Lieven Vandenberghe.
Convex optimization . Cambridge university press,2004.Niv Buchbinder, Moran Feldman, Joseph Seffi Naor, and Roy Schwartz. A tight lineartime (1/2)-approximation for unconstrained submodular maximization. In
FOCS , pages649–658. IEEE, 2012. ian and Buhmann and Krause Gruia Calinescu, Chandra Chekuri, Martin P´al, and Jan Vondr´ak. Maximizing a submodu-lar set function subject to a matroid constraint. In
Integer programming and combinatorialoptimization , pages 182–196. Springer, 2007.Gruia Calinescu, Chandra Chekuri, Martin P´al, and Jan Vondr´ak. Maximizing a monotonesubmodular function subject to a matroid constraint.
SIAM J. Comput. , 40(6):1740–1766,2011.Chandra Chekuri, Jan Vondr´ak, and Rico Zenklusen. Submodular function maximizationvia the multilinear relaxation and contention resolution schemes.
SIAM Journal on Com-puting , 43(6):1831–1879, 2014.Chandra Chekuri, TS Jayram, and Jan Vondr´ak. On multiplicative weight updates forconcave and submodular function maximization. In
Proceedings of the 2015 Conferenceon Innovations in Theoretical Computer Science , pages 201–210. ACM, 2015.Lin Chen, Hamed Hassani, and Amin Karbasi. Online continuous submodular maximiza-tion. In
International Conference on Artificial Intelligence and Statistics (AISTATS) ,pages 1896–1905, 2018.Abhimanyu Das and David Kempe. Submodular meets spectral: Greedy algorithmsfor subset selection, sparse approximation and dictionary selection. arXiv preprintarXiv:1102.3975 , 2011.Josip Djolonga and Andreas Krause. From map to marginals: Variational inference inbayesian submodular models. In
Neural Information Processing Systems (NIPS) , pages244–252, 2014a.Josip Djolonga and Andreas Krause. From map to marginals: Variational inference inbayesian submodular models. In
NIPS , pages 244–252, 2014b.Shahar Dobzinski and Jan Vondr´ak. From query complexity to computational complexity.In
Proceedings of the forty-fourth annual ACM symposium on Theory of computing , pages1107–1116. ACM, 2012.Christoph D¨urr, Nguyen Kim Thang, Abhinav Srivastav, and L´eo Tible. Non-monotonedr-submodular maximization: Approximation and regret guarantees. arXiv preprintarXiv:1905.09595 , 2019.Nathan Eagle and Alex Sandy Pentland. Reality mining: sensing complex social systems.
Personal and ubiquitous computing , 10(4):255–268, 2006.Reza Eghbali and Maryam Fazel. Designing smoothing functions for improved worst-casecompetitive ratio in online optimization. In
NIPS , pages 3279–3287. 2016.Alina Ene and Huy L Nguyen. A reduction for optimizing lattice submodular functionswith diminishing returns. arXiv preprint arXiv:1606.08362 , 2016.Uriel Feige. A threshold of ln n for approximating set cover.
Journal of the ACM (JACM) ,45(4):634–652, 1998. ontinuous Submodular Function Maximization Uriel Feige, Vahab S Mirrokni, and Jan Vondrak. Maximizing non-monotone submodularfunctions.
SIAM Journal on Computing , 40(4):1133–1153, 2011.Moran Feldman, Joseph Naor, and Roy Schwartz. A unified continuous greedy algorithmfor submodular maximization. In
Foundations of Computer Science (FOCS), 2011 IEEE52nd Annual Symposium on , pages 570–579. IEEE, 2011.Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming.
Navalresearch logistics quarterly , 3(1-2):95–110, 1956.Satoru Fujishige.
Submodular functions and optimization , volume 58. Elsevier, 2005.Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximationmethods for nonconvex stochastic composite optimization.
Mathematical Programming ,155(1-2):267–305, 2016.Shayan Oveis Gharan and Jan Vondr´ak. Submodular maximization by simulated anneal-ing. In
Proceedings of the twenty-second annual ACM-SIAM symposium on DiscreteAlgorithms , pages 1098–1116. Society for Industrial and Applied Mathematics, 2011.Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Near-optimal map inference for deter-minantal point processes. In
Advances in Neural Information Processing Systems , pages2735–2743, 2012.Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications inactive learning and stochastic optimization.
Journal of Artificial Intelligence Research ,pages 427–486, 2011.Ryan Gomes and Andreas Krause. Budgeted nonparametric learning from data streams.In
ICML , volume 1, page 3, 2010.Corinna Gottschalk and Britta Peis. Submodular function maximization on the boundedinteger lattice. In
Approximation and Online Algorithms , pages 133–144. Springer, 2015.Jason Hartline, Vahab Mirrokni, and Mukund Sundararajan. Optimal marketing strategiesover social networks. In
Proceedings of the 17th international conference on World WideWeb , pages 189–198. ACM, 2008.Hamed Hassani, Mahdi Soltanolkotabi, and Amin Karbasi. Gradient methods for submod-ular maximization. In
Advances in Neural Information Processing Systems (NIPS) , pages5837–5847, 2017.Daisuke Hatano, Takuro Fukunaga, Takanori Maehara, and Ken-ichi Kawarabayashi. La-grangian decomposition algorithm for allocating marketing channels. In
AAAI , pages1144–1150, 2015.Elad Hazan, Kfir Levy, and Shai Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex optimization. In
Advances in Neural Information Processing Systems , pages 1594–1602, 2015a. ian and Buhmann and Krause Elad Hazan, Kfir Y Levy, and Shai Shalev-Swartz. On graduated optimization for stochasticnon-convex problems. arXiv preprint arXiv:1503.03712 , 2015b.Wassily Hoeffding. Probability inequalities for sums of bounded random variables.
Journalof the American statistical association , 58(301):13–30, 1963.Reshad Hosseini and Suvrit Sra. Matrix manifold optimization for gaussian mixtures. In
Advances in Neural Information Processing Systems , pages 910–918, 2015.Ernst Ising. Contribution to the theory of ferromagnetism.
Z. Phys. , 31:253–258, 1925.Shinji Ito and Ryohei Fujimaki. Large-scale price optimization via network flow. In
Advancesin Neural Information Processing Systems (NIPS) , pages 3855–3863, 2016.Satoru Iwata, Lisa Fleischer, and Satoru Fujishige. A combinatorial strongly polynomialalgorithm for minimizing submodular functions.
Journal of the ACM , 48(4):761–777,2001.Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In
ICML2013 , pages 427–435, 2013a.Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In
Pro-ceedings of the 30th International Conference on Machine Learning (ICML-13) , pages427–435, 2013b.Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods.
CoRRabs/1506.08473 , 2015.Ben Jeuris, Raf Vandebril, and Bart Vandereycken. A survey and comparison of contem-porary algorithms for computing the matrix geometric mean.
Electronic Transactions onNumerical Analysis , 39(ARTICLE):379–402, 2012.Mohammad Karimi, Mario Lucic, Hamed Hassani, and Andreas Krause. Stochastic submod-ular maximization: The case of coverage functions. In
Advances in Neural InformationProcessing Systems , pages 6853–6863, 2017.Qifa Ke and Takeo Kanade. Quasiconvex optimization for robust geometric reconstruc-tion.
IEEE Transactions on Pattern Analysis and Machine Intelligence , 29(10):1834–1847, 2007.David Kempe, Jon Kleinberg, and ´Eva Tardos. Maximizing the spread of influence througha social network. In
Proceedings of the ninth ACM SIGKDD international conference onKnowledge discovery and data mining , pages 137–146. ACM, 2003.Sunyoung Kim and Masakazu Kojima. Exact solutions of some nonconvex quadratic opti-mization problems via sdp and socp relaxations.
Computational Optimization and Appli-cations , 26(2):143–154, 2003. ontinuous Submodular Function Maximization Vladimir Kolmogorov. Submodularity on a tree: Unifying l (cid:92) -convex and bisubmodularfunctions. In Mathematical Foundations of Computer Science , pages 400–411. Springer,2011.Andreas Krause and Volkan Cevher. Submodular dictionary selection for sparse represen-tation. In
ICML , pages 567–574, 2010.Andreas Krause and Daniel Golovin. Submodular function maximization.
Tractability:Practical Approaches to Hard Problems , 3:19, 2012.Andreas Krause and Carlos Guestrin. Near-optimal nonmyopic value of information ingraphical models. In
UAI , pages 324–331, 2005.Alex Kulesza, Ben Taskar, et al. Determinantal point processes for machine learning.
Foundations and Trends R (cid:13) in Machine Learning , 5(2–3):123–286, 2012.J´erˆome Kunegis. Konect: the koblenz network collection. In Proceedings of the 22ndInternational Conference on World Wide Web , pages 1343–1350. ACM, 2013.Simon Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives. arXivpreprint arXiv:1607.00345 , 2016.Jon Lee, Vahab S Mirrokni, Viswanath Nagarajan, and Maxim Sviridenko. Non-monotonesubmodular maximization under matroid and knapsack constraints. In
Theory of com-puting , pages 323–332. ACM, 2009.Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen,and Natalie Glance. Cost-effective outbreak detection in networks. In
ACM SIGKDDinternational conference on Knowledge discovery and data mining , pages 420–429, 2007.Huan Li and Zhouchen Lin. Accelerated proximal gradient methods for nonconvex pro-gramming. In
NIPS , pages 379–387, 2015.Hui Lin and Jeff Bilmes. Multi-document summarization via budgeted maximization ofsubmodular functions. In
Annual Conference of the North American Chapter of theAssociation for Computational Linguistics , pages 912–920, 2010.Hui Lin and Jeff Bilmes. A class of submodular functions for document summarization. In
HLT , 2011a.Hui Lin and Jeff Bilmes. Optimal selection of limited vocabulary speech corpora. In
TwelfthAnnual Conference of the International Speech Communication Association , 2011b.Lov´asz. Submodular functions and convexity. In
Mathematical Programming The State ofthe Art , pages 235–257. Springer, 1983.Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributedsubmodular maximization: Identifying representative elements in massive data. In
NIPS ,pages 2049–2057, 2013. ian and Buhmann and Krause Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Decentralized submodular maxi-mization: Bridging discrete and continuous settings. arXiv preprint arXiv:1802.03825 ,2018a.Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Stochastic conditional gradientmethods: From convex minimization to submodular maximization. arXiv preprintarXiv:1804.09554 , 2018b.Theodore S Motzkin and Ernst G Straus. Maxima for graphs and a new proof of a theoremof tur´an.
Canad. J. Math , 17(4):533–540, 1965.George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approxi-mations for maximizing submodular set functions – i.
Mathematical Programming , 14(1):265–294, 1978.Yurii Nesterov.
Introductory lectures on convex optimization: A basic course , volume 87.Springer Science & Business Media, 2013.Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its globalperformance.
Mathematical Programming , 108(1):177–205, 2006.Rad Niazadeh, Tim Roughgarden, and Joshua R Wang. Optimal algorithms for con-tinuous non-monotone submodular and dr-submodular maximization. arXiv preprintarXiv:1805.09480 , 2018.BT Polyak. Gradient methods for the minimisation of functionals.
USSR ComputationalMathematics and Mathematical Physics , 3(4):864–878, 1963.Chunhong Qi, Kyle A Gallivan, and P-A Absil. Riemannian bfgs algorithm with appli-cations. In
Recent advances in optimization and its applications in engineering , pages183–192. Springer, 2010.Sashank J Reddi, Suvrit Sra, Barnabas Poczos, and Alex Smola. Fast stochastic methodsfor nonsmooth nonconvex optimization. arXiv preprint arXiv:1605.06900 , 2016a.Sashank J Reddi, Suvrit Sra, Barnab´as P´oczos, and Alex Smola. Stochastic frank-wolfemethods for nonconvex optimization. In , pages 1244–1251. IEEE, 2016b.Wolfgang Ring and Benedikt Wirth. Optimization methods on riemannian manifolds andtheir application to shape space.
SIAM Journal on Optimization , 22(2):596–627, 2012.Ajit P Singh, Andrew Guillory, and Jeff Bilmes. On bisubmodular maximization. In
Inter-national Conference on Artificial Intelligence and Statistics , pages 1055–1063, 2012.Martin Skutella. Convex quadratic and semidefinite programming relaxations in scheduling.
J. ACM , 2001.Tasuku Soma and Yuichi Yoshida. A generalization of submodular cover via the diminishingreturn property on the integer lattice. In
NIPS , pages 847–855, 2015a. ontinuous Submodular Function Maximization Tasuku Soma and Yuichi Yoshida. Maximizing submodular functions with the diminishingreturn property over the integer lattice. arXiv preprint arXiv:1503.01218 , 2015b.Tasuku Soma and Yuichi Yoshida. Non-monotone dr-submodular function maximization.In
AAAI , volume 17, pages 898–904, 2017.Tasuku Soma and Yuichi Yoshida. Maximizing monotone submodular functions over theinteger lattice.
Mathematical Programming , 172(1-2):539–563, 2018.Tasuku Soma, Naonori Kakimura, Kazuhiro Inaba, and Ken-ichi Kawarabayashi. Optimalbudget allocation: Theoretical guarantee and efficient algorithm. In
Proceedings of the31st International Conference on Machine Learning , pages 351–359, 2014.Suvrit Sra. Scalable nonconvex inexact proximal splitting. In
Advances in Neural Informa-tion Processing Systems (NIPS) , pages 530–538, 2012.Suvrit Sra. On the matrix square root via geometric optimization. arXiv preprintarXiv:1507.08366 , 2015.Suvrit Sra and Reshad Hosseini. Conic geometric optimization on the manifold of positivedefinite matrices.
SIAM Journal on Optimization , 25(1):713–739, 2015.Suvrit Sra and Reshad Hosseini. Geometric optimization in machine learning. In
AlgorithmicAdvances in Riemannian Geometry and Applications , pages 73–91. Springer, 2016.Maxim Sviridenko. A note on maximizing a submodular set function subject to a knapsackconstraint.
Operations Research Letters , 32(1):41–43, 2004.Donald M Topkis. Minimizing a submodular function on a lattice.
Operations research , 26(2):305–321, 1978.Sebastian Tschiatschek, Josip Djolonga, and Andreas Krause. Learning probabilistic sub-modular diversity models via noise contrastive estimation. In
Proc. International Con-ference on Artificial Intelligence and Statistics (AISTATS) , 2016.Nisheeth K Vishnoi. Geodesic convex optimization: Differentiation on manifolds, geodesics,and convexity. arXiv preprint arXiv:1806.06373 , 2018.Jan Vondr´ak. Optimal approximation for the submodular welfare problem in the value oraclemodel. In
Proceedings of the 40th Annual ACM Symposium on Theory of Computing ,pages 67–74, 2008.Jan Vondr´ak. Symmetry and approximability of submodular maximization problems.
SIAMJournal on Computing , 42(1):265–304, 2013.Justin Ward and Stanislav Zivny. Maximizing bisubmodular and k -submodular functions.In SODA 2014 , pages 1468–1481, 2014.Elmar Wolfstetter.
Topics in microeconomics: Industrial organization, auctions, and in-centives . cambridge university Press, 1999. ian and Buhmann and Krause Laurence A. Wolsey. Maximising real-valued submodular functions: Primal and dual heuris-tics for location problems.
Math. Oper. Res. , 7(3):410–425, 1982.Pourya Zadeh, Reshad Hosseini, and Suvrit Sra. Geometric mean metric learning. In
International conference on machine learning , pages 2464–2471, 2016. ontinuous Submodular Function Maximization Appendix
Appendix A. Proofs of Characterizations of Continuous SubmodularFunctions
Since X i is a compact subset of R , we denote its lower bound and upper bound to be u i and ¯ u i , respectively. A.1 Proofs of Lemma 5 and Lemma 8Proof [Proof of Lemma 5]
Sufficiency : For any dimension i , ∇ i f ( a ) = lim k → f ( k e i + a ) − f ( a ) k ≥ lim k → f ( k e i + b ) − f ( b ) k = ∇ i f ( a ) . (67) Necessity :Firstly, we show that for any c ≥ , the function g ( x ) := f ( c + x ) − f ( x ) is monotonicallynon-increasing. ∇ g ( x ) = ∇ f ( c + x ) − ∇ f ( x ) ≤ . (68)Taking c = k e i , since g ( a ) ≤ g ( b ), we reach the DR-submodularity definition. Proof [Proof of Lemma 8] Similar as the proof of Lemma 5, we have the following:
Sufficiency : For any dimension i s.t. a i = b i , ∇ i f ( a ) = lim k → f ( k e i + a ) − f ( a ) k ≥ lim k → f ( k e i + b ) − f ( b ) k = ∇ i f ( a ) . (69) Necessity :We show that for any k ≥
0, the function g ( x ) := f ( k e i + x ) − f ( x ) is monotonicallynon-increasing. ∇ g ( x ) = ∇ f ( k e i + x ) − ∇ f ( x ) ≤ . (70)Since g ( a ) ≤ g ( b ), we reach the weak DR definition. A.2 Alternative Formulation of the weak DR
Property
First of all, we will prove that weak DR has the following alternative formulation, which willbe used to prove Proposition 7.
Lemma 40 (Alternative formulation of weak DR ) The weak DR property (Equation (9) ,denoted as
Formulation I ) has the following equilvalent formulation (Equation (71) , denotedas
Formulation II ): ∀ a ≤ b ∈ X , ∀ i ∈ { i (cid:48) | a i (cid:48) = b i (cid:48) = u i (cid:48) } , ∀ k (cid:48) ≥ l (cid:48) ≥ s.t. ( k (cid:48) e i + a ) , ( l (cid:48) e i + a ) , ( k (cid:48) e i + b ) and ( l (cid:48) e i + b ) are still in X , the following inequality is satisfied, f ( k (cid:48) e i + a ) − f ( l (cid:48) e i + a ) ≥ f ( k (cid:48) e i + b ) − f ( l (cid:48) e i + b ) . (Formulation II) (71) ian and Buhmann and Krause Proof
Let D = { i | a i = b i = u i } , D = { i | u i < a i = b i < ¯ u i } , and D = { i | a i = b i = ¯ u i } .1) Formulation II ⇒ Formulation I
When i ∈ D , set l (cid:48) = 0 in Formulation II one can get f ( k (cid:48) e i + a ) − f ( a ) ≥ f ( k (cid:48) e i + b ) − f ( b ).When i ∈ D , ∀ k ≥
0, let l (cid:48) = a i − u i = b i − u i > , k (cid:48) = k + l (cid:48) = k + ( a i − u i ), andlet ¯ a = ( a | i ( u i )) , ¯ b = ( b | i ( u i )). It is easy to see that ¯ a ≤ ¯ b , and ¯ a i = ¯ b i = u i . Then from Formulation II , f ( k (cid:48) e i + ¯ a ) − f ( l (cid:48) e i + ¯ a ) = f ( k e i + a ) − f ( a ) (72) ≥ f ( k (cid:48) e i + ¯ b ) − f ( l (cid:48) e i + ¯ b ) = f ( k e i + b ) − f ( b ) . When i ∈ D , Equation (9) holds trivially.The above three situations proves the Formulation I .2)
Formulation II ⇐ Formulation I ∀ a ≤ b , ∀ i ∈ D , one has a i = b i = u i . ∀ k (cid:48) ≥ l (cid:48) ≥
0, let ˆ a = l (cid:48) e i + a , ˆ b = l (cid:48) e i + b , let k = k (cid:48) − l (cid:48) ≥
0, it can be verified that ˆ a ≤ ˆ b and ˆ a i = ˆ b i , from Formulation I , f ( k e i + ˆ a ) − f (ˆ a ) = f ( k (cid:48) e i + a ) − f ( l (cid:48) e i + a ) (73) ≥ f ( k e i + ˆ b ) − f (ˆ b ) = f ( k (cid:48) e i + b ) − f ( l (cid:48) e i + b ) . which proves Formulation II . A.3 Proof of Proposition 7Proof submodularity ⇒ weak DR :Let us prove the Formulation II (Equation (71)) of weak DR , which is, ∀ a ≤ b ∈ X , ∀ i ∈ { i (cid:48) | a i (cid:48) = b i (cid:48) = u i (cid:48) } , ∀ k (cid:48) ≥ l (cid:48) ≥
0, the following inequality holds, f ( k (cid:48) e i + a ) − f ( l (cid:48) e i + a ) ≥ f ( k (cid:48) e i + b ) − f ( l (cid:48) e i + b ) . (74)And f is a submodular function iff ∀ x , y ∈ X , f ( x ) + f ( y ) ≥ f ( x ∨ y ) + f ( x ∧ y ), so f ( y ) − f ( x ∧ y ) ≥ f ( x ∨ y ) − f ( x ).Now ∀ a ≤ b ∈ X , one can set x = l (cid:48) e i + b and y = k (cid:48) e i + a . It can be easilyverified that x ∧ y = l (cid:48) e i + a and x ∨ y = k (cid:48) e i + b . Substituting all the above equalities into f ( y ) − f ( x ∧ y ) ≥ f ( x ∨ y ) − f ( x ) one can get f ( k (cid:48) e i + a ) − f ( l (cid:48) e i + a ) ≥ f ( k (cid:48) e i + b ) − f ( l (cid:48) e i + b ).2) submodularity ⇐ weak DR :Let us use Formulation I (Equation (9)) of weak DR to prove the submodularity property. ∀ x , y ∈ X , let D := { e , · · · , e d } be the set of elements for which y e > x e , let k e i := y e i − x e i . Now set a := x ∧ y , b := x and a i = ( a i − | e i ( y e i )) = k e i e i + a i − , b i =( b i − | e i ( y e i )) = k e i e i + b i − , for i = 1 , · · · , d .One can verify that a i ≤ b i , a ie i (cid:48) = b ie i (cid:48) for all i (cid:48) ∈ D, i = 0 , · · · , d , and that a d = y , b d = x ∨ y . ontinuous Submodular Function Maximization Applying Equation (9) of the weak DR property for i = 1 , · · · , d one can get f ( k e e e + a ) − f ( a ) ≥ f ( k e e e + b ) − f ( b ) (75) f ( k e e e + a ) − f ( a ) ≥ f ( k e e e + b ) − f ( b ) (76) · · · f ( k e d e e d + a d − ) − f ( a d − ) ≥ f ( k e d e e d + b d − ) − f ( b d − ) . (77)Taking a sum over all the above d inequalities, one can get f ( k e d e e d + a d − ) − f ( a ) ≥ f ( k e d e e d + b d − ) − f ( b ) (78) ⇔ f ( y ) − f ( x ∧ y ) ≥ f ( x ∨ y ) − f ( x ) (79) ⇔ f ( x ) + f ( y ) ≥ f ( x ∨ y ) + f ( x ∧ y ) , (80)which proves the submodularity property. A.4 Proof of Proposition 9Proof submodular + coordinate-wise concave ⇒ DR :From coordinate-wise concavity we have f ( a + k e i ) − f ( a ) ≥ f ( a + ( b i − a i + k ) e i ) − f ( a + ( b i − a i ) e i ). Therefore, to prove DR it suffices to show that f ( a + ( b i − a i + k ) e i ) − f ( a + ( b i − a i ) e i ) ≥ f ( b + k e i ) − f ( b ) . (81)Let x := b , y := ( a + ( b i − a i + k ) e i ), so x ∧ y = ( a + ( b i − a i ) e i ) , x ∨ y = ( b + k e i ). Fromsubmodularity, one can see that inequality (81) holds.2) DR ⇒ submodular + coordinate-wise concave :From DR property, the weak DR (Equation (9)) property is implied, which equivalentlyproves the submodularity property.To prove coordinate-wise concavity , one just need to set b := a + l e i , then we have f ( a + k e i ) − f ( a ) ≥ f ( a + ( k + l ) e i ) − f ( a + l e i ). Appendix B. Proofs for Properties of Continuous DR-SubmodularMaximization
B.1 Proof of Lemma 13Proof [Proof of Lemma 13]Supppose for simplicity that f and h are both twice differentiable. Note that when f and h are not differentiable, one can similarly prove the conclusion using zero th orderdefinition of continuous submodularity. ian and Buhmann and Krause Without loss of generality, let us prove that f ( h ( x )) maintains submodularity of f . Onejust need to show that the term ∂ g ( x ) ∂x i ∂x j in Equation (13) is non-positive when i (cid:54) = j .Firstly, let us consider the term (cid:80) nk =1 ∂f ( y ) ∂y k ∂ h k ( x ) ∂x i ∂x j . Since h is separable as stated above, ∂ h k ( x ) ∂x i ∂x j is always zero, so (cid:80) nk =1 ∂f ( y ) ∂y k ∂ h k ( x ) ∂x i ∂x j is always zero.Then it remains to show that the term (cid:80) ns,t =1 ∂ f ( y ) ∂y s ∂y t ∂h s ( x ) ∂x i ∂h t ( x ) ∂x j is non-positive. Thereare two situations: 1) s = t . Since i (cid:54) = j , there must be one term out of ∂h s ( x ) ∂x i and ∂h t ( x ) ∂x j that are zero (because h is separable). 2) s (cid:54) = t . Since f is submodular, it holdsthat ∂ f ( y ) ∂y s ∂y t ≤
0. Because h is monotone, it also holds that ∂h s ( x ) ∂x i ∂h t ( x ) ∂x j ≥
0. So the term (cid:80) ns,t =1 ∂ f ( y ) ∂y s ∂y t ∂h s ( x ) ∂x i ∂h t ( x ) ∂x j is non-positive in the above two situations.Now we reach the conclusion that f ( h ( x )) maintains submodularity of f . B.2 Proof of Proposition 18Proof [Proof of Proposition 18] Consider a univariate function g ( ξ ) := f ( x + ξ v ∗ ) , ξ ≥ , v ∗ ≥ . (82)We know that dg ( ξ ) dξ = (cid:104) v ∗ , ∇ f ( x + ξ v ∗ ) (cid:105) . (83)It can be verified that: g ( ξ ) is concave ⇔ d g ( ξ ) dξ = ( v ∗ ) (cid:62) ∇ f ( x + ξ v ∗ ) v ∗ = (cid:88) i (cid:54) = j v ∗ i v ∗ j ∇ ij f + (cid:88) i ( v ∗ i ) ∇ ii f ≤ . (84)The non-positiveness of ∇ ij f is ensured by submodularity of f ( · ), and the non-positivenessof ∇ ii f results from the coordinate-wise concavity of f ( · ).The proof of concavity along any non-positive direction is similar, which is omitted here. B.3 Proof of Proposition 20Proof [Proof of Proposition 20] Since f is DR-submodular, so it is concave along anydirection v ∈ ± R n + . We know that x ∨ y − x ≥ and x ∧ y − x ≤ , so from the strongDR-submodularity in (33), f ( x ∨ y ) − f ( x ) ≤ (cid:104)∇ f ( x ) , x ∨ y − x (cid:105) − µ (cid:107) x ∨ y − x (cid:107) , (85) f ( x ∧ y ) − f ( x ) ≤ (cid:104)∇ f ( x ) , x ∧ y − x (cid:105) − µ (cid:107) x ∧ y − x (cid:107) . (86) ontinuous Submodular Function Maximization Summing the above two inequalities and notice that x ∨ y + x ∧ y = x + y , we arrive,( y − x ) (cid:62) ∇ f ( x ) (87) ≥ f ( x ∨ y ) + f ( x ∧ y ) − f ( x ) + µ (cid:107) x ∨ y − x (cid:107) + (cid:107) x ∧ y − x (cid:107) )= f ( x ∨ y ) + f ( x ∧ y ) − f ( x ) + µ (cid:107) y − x (cid:107) , (88)the last equality holds since (cid:107) x ∨ y − x (cid:107) + (cid:107) x ∧ y − x (cid:107) = (cid:107) y − x (cid:107) . B.4 Proof of Proposition 22Proof [Proof of Proposition 22] Consider the point z ∗ := x ∨ x ∗ − x = ( x ∗ − x ) ∨ . One cansee that: 1) ≤ z ∗ ≤ x ∗ ; 2) z ∗ ∈ P (down-closedness); 3) z ∗ ∈ Q (because of z ∗ ≤ ¯ u − x ).From Proposition 20, (cid:104) x ∗ − x , ∇ f ( x ) (cid:105) + 2 f ( x ) ≥ f ( x ∨ x ∗ ) + f ( x ∧ x ∗ ) + µ (cid:107) x − x ∗ (cid:107) , (89) (cid:104) z ∗ − z , ∇ f ( z ) (cid:105) + 2 f ( z ) ≥ f ( z ∨ z ∗ ) + f ( z ∧ z ∗ ) + µ (cid:107) z − z ∗ (cid:107) . (90)Let us first of all prove the following key Claim. Claim 23
Under the setting of Proposition 22, it holds that, f ( x ∨ x ∗ ) + f ( x ∧ x ∗ ) + f ( z ∨ z ∗ ) + f ( z ∧ z ∗ ) ≥ f ( x ∗ ) . (39) Proof [Proof of Claim 23] Firstly, we are going to prove that f ( x ∨ x ∗ ) + f ( z ∨ z ∗ ) ≥ f ( z ∗ ) + f (( x + z ) ∨ x ∗ ) , (91)which is equivalent to f ( x ∨ x ∗ ) − f ( z ∗ ) ≥ f (( x + z ) ∨ x ∗ ) − f ( z ∨ z ∗ ). It can be shown that x ∨ x ∗ − z ∗ = ( x + z ) ∨ x ∗ − z ∨ z ∗ . Combining this with the fact that z ∗ ≤ z ∨ z ∗ , and usingthe DR property (see Definition 4) implies (91). Then we establish, x ∨ x ∗ − z ∗ = ( x + z ) ∨ x ∗ − z ∨ z ∗ . (92)We will show that both the RHS and LHS of the above equation are equal to x : for theLHS of (92) we can write x ∨ x ∗ − z ∗ = x ∨ x ∗ − ( x ∨ x ∗ − x ) = x . For the RHS of (92) letus consider any coordinate i ∈ [ n ],( x i + z i ) ∨ x ∗ i − z i ∨ z ∗ i =( x i + z i ) ∨ x ∗ i − (( x i + z i ) − x i ) ∨ (( x i ∨ x ∗ i ) − x i ) = x i , (93)where the last equality holds easily for the two situations: ( x i + z i ) ≥ x ∗ i and ( x i + z i ) < x ∗ i .Next, we are going to prove that, f ( z ∗ ) + f ( x ∧ x ∗ ) ≥ f ( x ∗ ) + f ( ) . (94) ian and Buhmann and Krause It is equivalent to f ( z ∗ ) − f ( ) ≥ f ( x ∗ ) − f ( x ∧ x ∗ ), which can be done similarly by the DRproperty: Notice that x ∗ − x ∧ x ∗ = x ∨ x ∗ − x = z ∗ − and ≤ x ∧ x ∗ . (95)Thus (94) holds from the DR property. Combining (91) and (94) one can get, f ( x ∨ x ∗ ) + f ( z ∨ z ∗ ) + f ( x ∧ x ∗ ) + f ( z ∧ z ∗ ) ≥ f ( x ∗ ) + f ( ) + f (( x + z ) ∨ x ∗ ) + f ( z ∧ z ∗ ) (96) ≥ f ( x ∗ ) . (non-negativity of f )Combining (89) and (90) and Claim 23 it reads, (cid:104) x ∗ − x , ∇ f ( x ) (cid:105) + (cid:104) z ∗ − z , ∇ f ( z ) (cid:105) + 2( f ( x ) + f ( z )) (97) ≥ f ( x ∗ ) + µ (cid:107) x − x ∗ (cid:107) + (cid:107) z − z ∗ (cid:107) ) . (98)From the definition of non-stationarity in (34) one can get, g P ( x ) := max v ∈P (cid:104) v − x , ∇ f ( x ) (cid:105) x ∗ ∈P ≥ (cid:104) x ∗ − x , ∇ f ( x ) (cid:105) , (99) g Q ( z ) := max v ∈Q (cid:104) v − z , ∇ f ( z ) (cid:105) z ∗ ∈Q ≥ (cid:104) z ∗ − z , ∇ f ( z ) (cid:105) . (100)Putting together Equations (97), (99) and (100) we can get,2( f ( x ) + f ( z )) ≥ f ( x ∗ ) − g P ( x ) − g Q ( z ) + µ (cid:107) x − x ∗ (cid:107) + (cid:107) z − z ∗ (cid:107) ) . (101)So it arrives max { f ( x ) , f ( z ) } ≥ (102)14 [ f ( x ∗ ) − g P ( x ) − g Q ( z )] + µ (cid:107) x − x ∗ (cid:107) + (cid:107) z − z ∗ (cid:107) ) . (103) Appendix C. Additional Details for Monotone DR-SubmodularMaximization
C.1 Proof of Proposition 25Proof [Proof of Proposition 25]On a high level, the proof idea follows from the reduction from the problem of maximizinga monotone submodular set function subject to cardinality constraints. ontinuous Submodular Function Maximization Let us denote Π as the problem of maximizing a monotone submodular set functionsubject to cardinality constraints, and Π as the problem of maximizing a monotone contin-uous DR-submodular function under general down-closed polytope constraints. FollowingCalinescu et al. (2011), there exist an algorithm A for Π that consists of a polynomialtime computation in addition to polynomial number of subroutine calls to an algorithm forΠ . For details on A see the following.First of all, the multilinear extension (Calinescu et al., 2007) of a monotone submodularset function is a monotone continuous submodular function, and it is coordinate-wise linear,thus falls into a special case of monotone continuous DR-submodular functions. Evaluat-ing the multilinear extension and its gradients can be done using sampling methods, thusresulted in a randomized algorithm.So the algorithm A shall be: 1) Maximize the multilinear extension of the submodularset function over the matroid polytope associated with the cardinality constraint, whichcan be achieved by solving an instance of Π . We call the solution obtained the fractionalsolution; 2) Round the fractional solution to a feasible integeral solution using polynomialtime rounding technique in Ageev and Sviridenko (2004); Calinescu et al. (2007) (called thepipage rounding). Thus we prove the reduction from Π to Π .Our reduction algorithm A implies the NP-hardness and inapproximability of problemΠ .For the NP-hardness, because Π is well-known to be NP-hard (Calinescu et al., 2007;Feige, 1998), so Π is NP-hard as well.For the inapproximability: Assume there exists a polynomial algorithm B that can solveΠ better than 1 − /e , then we can use B as the subroutine algorithm in the reduction,which implies that one can solve Π better than 1 − /e . Now we slightly adapt the proofof inapproximability on max-k-cover of Feige (1998), since max-k-cover is a special case ofΠ . According to the proof of Theorem 5.3 in Feige (1998) and our reduction A , we havea reduction from approximating 3SAT–5 to problem Π . Using the rest proof of Theorem5.3 in Feige (1998), we reach the result that one cannot solve Π better than 1 − /e , unlessRP = NP. C.2 Proof of Corollary 27Proof [Proof of Corollary 27] Firstly, according to Theorem 1 of Lacoste-Julien (2016),
Non-convex FW is known to converge to a stationary point with a rate of 1 / √ k .Then according to Corollary 21, any stationary point is a 1/2 approximate solution. C.3 Proof of Lemma 30Proof
It is easy to see that x K is a convex combination of points in P , so x K ∈ P .Consider the point v ∗ := ( x ∗ ∨ x ) − x = ( x ∗ − x ) ∨ ≥ . Because v ∗ ≤ x ∗ and P isdown-closed, we get v ∗ ∈ P .By monotonicity, f ( x + v ∗ ) = f ( x ∗ ∨ x ) ≥ f ( x ∗ ). ian and Buhmann and Krause Consider the function g ( ξ ) := f ( x + ξ v ∗ ) , ξ ≥ dg ( ξ ) dξ = (cid:104) v ∗ , ∇ f ( x + ξ v ∗ ) (cid:105) . FromProposition 18, g ( ξ ) is concave, hence g (1) − g (0) = f ( x + v ∗ ) − f ( x ) ≤ dg ( ξ ) dξ (cid:12)(cid:12)(cid:12) ξ =0 × (cid:104) v ∗ , ∇ f ( x ) (cid:105) . (104)Then one can get (cid:104) v , ∇ f ( x ) (cid:105) ( a ) ≥ α (cid:104) v ∗ , ∇ f ( x ) (cid:105) − δγLD ≥ (105) α ( f ( x + v ∗ ) − f ( x )) − δγLD ≥ α ( f ( x ∗ ) − f ( x )) − δγLD , (106)where ( a ) is resulted from the LMO step of Algorithm 4. C.4 Proof of Theorem 31Proof [Proof of Theorem 31] From the Lipschitz assumption of f (Equation (30)): f ( x k +1 ) − f ( x k ) = f ( x k + γ k v k ) − f ( x k ) (107) ≥ γ k (cid:104) v k , ∇ f ( x k ) (cid:105) − L γ k (cid:107) v k (cid:107) (Lipschitz smoothness) ≥ γ k α [ f ( x ∗ ) − f ( x k )] − γ k δLD − L γ k D . (Lemma 30)After rearrangement, f ( x k +1 ) − f ( x ∗ ) ≥ (1 − αγ k )[ f ( x k ) − f ( x ∗ )] − LD γ k (1 + δ )2 . (108)Therefore, f ( x K ) − f ( x ∗ ) ≥ K − (cid:89) k =0 (1 − αγ k )[ f ( ) − f ( x ∗ )] − LD (1 + δ )2 K − (cid:88) k =0 γ k . (109)One can observe that (cid:80) K − k =0 γ k = 1, and since 1 − y ≤ e − y when y ≥ f ( x ∗ ) − f ( x K ) ≤ [ f ( x ∗ ) − f ( )] e − α (cid:80) K − k =0 γ k + LD (1 + δ )2 K − (cid:88) k =0 γ k (110)= [ f ( x ∗ ) − f ( )] e − α + LD (1 + δ )2 K − (cid:88) k =0 γ k . (111)After rearrangement, we get, f ( x K ) ≥ (1 − /e α ) f ( x ∗ ) − LD (1 + δ )2 K − (cid:88) k =0 γ k + e − α f ( ) . (112) ontinuous Submodular Function Maximization C.5 Proof of Corollary 32Proof [Proof of Corollary 32] Fixing K , to reach the tightest bound in Equation (58)amounts to solving the following problem:min K − (cid:88) k =0 γ k (113)s.t. K − (cid:88) k =0 γ k = 1 , γ k ≥ . Using Lagrangian method, let λ be the Lagrangian multiplier, then L ( γ , · · · , γ K − , λ ) = K − (cid:88) k =0 γ k + λ (cid:34) K − (cid:88) k =0 γ k − (cid:35) . (114)It can be easily verified that when γ = · · · = γ K − = K − , (cid:80) K − k =0 γ k reaches the minimum(which is K − ). Therefore we obtain the tightest worst-case bound in Corollary 32. Appendix D. Details of Revenue Maximization with ContinuousAssignments
D.1 More Details About the Model
As discussed in the main text, R s ( x i ) should be some non-negative, non-decreasing, sub-modular function; therefore, we set R s ( x i ) := (cid:113)(cid:80) t : x it (cid:54) =0 x it w st , where w st is the weight ofedge connecting users s and t . The first part in R.H.S. of Equation (53) models the revenuefrom users who have not received free assignments, while the second and third parts modelthe revenue from users who have gotten the free assignments. We use w tt to denote the“self-activation rate” of user t : Given certain amount of free trail to user t , how probableis it that he/she will buy after the trial. The intuition of modeling the second part inR.H.S. of Equation (53) is: Given the users more free assignments, they are more likelyto buy the product after using it. Therefore, we model the expected revenue in this partby φ ( x it ) = w tt x it ; The intuition of modeling the third part in R.H.S. of Equation (53) is:Giving the users more free assignments, the revenue could decrease, since the users use theproduct for free for a longer period. As a simple example, the decrease in the revenue canbe modeled as γ (cid:80) t : x it (cid:54) =0 − x it . D.2 Proof of Lemma 24Proof
First of all, we prove that g ( x ) := (cid:80) s : x s =0 R s ( x ) is a non-negative submodular function.It is easy to see that g ( x ) is non-negative. To prove that g ( x ) is submodular, one justneed, g ( a ) + g ( b ) ≥ g ( a ∨ b ) + g ( a ∧ b ) , ∀ a , b ∈ [ , ¯ u ] . (115) ian and Buhmann and Krause Let A := supp ( a ) , B := supp ( b ), where supp ( x ) := { i | x i (cid:54) = 0 } is the support of the vector x . First of all, because R s ( x ) is non-decreasing, and b ≥ a ∧ b , a ≥ a ∧ b , (cid:88) s ∈ A \ B R s ( b ) + (cid:88) s ∈ B \ A R s ( a ) ≥ (cid:88) s ∈ A \ B R s ( a ∧ b ) + (cid:88) s ∈ B \ A R s ( a ∧ b ) . (116)By submodularity of R s ( x ), and summing over s ∈ V\ ( A ∪ B ), (cid:88) s ∈V\ ( A ∪ B ) R s ( a ) + (cid:88) s ∈V\ ( A ∪ B ) R s ( b ) ≥ (cid:88) s ∈V\ ( A ∪ B ) R s ( a ∨ b ) + (cid:88) s ∈V\ ( A ∪ B ) R s ( a ∧ b ) . (117)Summing Equations 116 and 117 one can get (cid:88) s ∈V\ A R s ( a ) + (cid:88) s ∈V\ B R s ( b ) ≥ (cid:88) s ∈V\ ( A ∪ B ) R s ( a ∨ b ) + (cid:88) s ∈V\ ( A ∩ B ) R s ( a ∧ b )which is equivalent to Equation (115).Then we prove that h ( x ) := (cid:80) t : x t (cid:54) =0 ¯ R t ( x ) is submodular. Because ¯ R t ( x ) is non-increasing, and a ≤ a ∨ b , b ≤ a ∨ b , (cid:88) t ∈ A \ B ¯ R t ( a ) + (cid:88) t ∈ B \ A ¯ R t ( b ) ≥ (cid:88) t ∈ A \ B ¯ R t ( a ∨ b ) + (cid:88) t ∈ B \ A ¯ R t ( a ∨ b ) . (118)By submodularity of ¯ R t ( x ), and summing over t ∈ A ∩ B , (cid:88) t ∈ A ∩ B ¯ R t ( a ) + (cid:88) t ∈ A ∩ B ¯ R t ( b ) ≥ (cid:88) t ∈ A ∩ B ¯ R t ( a ∨ b ) + (cid:88) t ∈ A ∩ B ¯ R t ( a ∧ b ) . (119)Summing Equations 118, 119 we get, (cid:88) t ∈ A ¯ R t ( a ) + (cid:88) t ∈ B ¯ R t ( b ) ≥ (cid:88) t ∈ A ∪ B ¯ R t ( a ∨ b ) + (cid:88) t ∈ A ∩ B ¯ R t ( a ∧ b ) (120)which is equivalent to h ( a ) + h ( b ) ≥ h ( a ∨ b ) + h ( a ∧ b ), ∀ a , b ∈ [ , ¯ u ], thus proving thesubmodularity of h ( x ).Finally, because f ( x ) is the sum of two submodular functions and one modular function,so it is submodular. Appendix E. Proofs for Non-Monotone DR-Submodular Maximization
E.1 Proof for Hardness and InapproximabilityProof [Proof of Proposition 33] The main proof follows from the reduction from the problemof maximizing an unconstrained non-monotone submodular set function.Let us denote Π as the problem of maximizing an unconstrained non-monotone sub-modular set function, and Π as the problem of maximizing a box constrained non-monotonecontinuous DR-submodular function. Following the Appendix A of Buchbinder et al. (2012), ontinuous Submodular Function Maximization there exist an algorithm A for Π that consists of a polynomial time computation in addi-tion to polynomial number of subroutine calls to an algorithm for Π . For details see thefollowing.Given a submodular set function F : 2 V → R + , its multilinear extension (Calinescu et al.,2007) is a function f : [0 , V → R + , whose value at a point x ∈ [0 , V is the expected valueof F over a random subset R ( x ) ⊆ V , where R ( x ) contains each element e ∈ V independentlywith probability x e . Formally, f ( x ) := E [ R ( x )] = (cid:80) S ⊆V F ( S ) (cid:81) e ∈ S x e (cid:81) e (cid:48) / ∈ S (1 − x e (cid:48) ). Itcan be easily seen that f ( x ) is a non-monotone DR-submodular function.Then the algorithm A can be: 1) Maximize the multilinear extension f ( x ) over thebox constraint [0 , V , which can be achieved by solving an instance of Π . Obtain thefractional solution ˆ x ∈ [0 , n ; 2) Return the random set R (ˆ x ). According to the definitionof multilinear extension, the expected value of F ( R (ˆ x )) is f (ˆ x ). Thus proving the reductionfrom Π to Π .Given the reduction, the hardness result follows from the hardness of unconstrainednon-monotone submodular set function maximization.The inapproximability result comes from that of the unconstrained non-monotone sub-modular set function maximization in Feige et al. (2011) and Dobzinski and Vondr´ak (2012). E.2 Proof of Theorem 34Proof [Proof of Theorem 34]Let g P ( x ) , g Q ( z ) to the non-stationarity of x and z , respectively. Since we are using the Non-convex FW (Algorithm 2) as subroutine, according to Lacoste-Julien (2016, Theorem1), one can get, g P ( x ) ≤ min (cid:26) max { h , C f ( P ) }√ K + 1 , (cid:15) (cid:27) , (121) g Q ( z ) ≤ min (cid:26) max { h , C f ( Q ) }√ K + 1 , (cid:15) (cid:27) . (122)Plugging the above into Proposition 22 we reach the conclusion in (60). E.3 Detailed Proofs for Theorem 35
E.3.1 Proof of Lemma 37
Lemma 37 (Bounding the growth of x k ) Assume x = . For k = 0 , ..., K − , itholds, x ki ≤ ¯ u i [1 − (1 − γ ) t k /γ ] , ∀ i ∈ [ n ] . (63) Proof [Proof of Lemma 37] We prove by induction. First of all, it holds when k = 0, since x i = 0, and t = 0 as well. Assume it holds for k . Then for k + 1, we have x k +1 i = x ki + γv ki (123) ian and Buhmann and Krause ≤ x ki + γ (¯ u i − x ki ) (constraint of shrunken LMO) (124)= (1 − γ ) x ki + γ ¯ u i ≤ (1 − γ )¯ u i [1 − (1 − γ ) t k /γ ] + γ ¯ u i (induction) (125)= ¯ u i [1 − (1 − γ ) t k +1 /γ ] . E.3.2 Proof of Lemma 38
Lemma 38 (Generalized from Lemma 7 of Chekuri et al. (2015))
Given θ ∈ ( , ¯ u ] ,let λ (cid:48) = min i ∈ [ n ] ¯ u i θ i . Then for all x ∈ [ , θ ] , it holds, f ( x ∨ x ∗ ) ≥ (1 − λ (cid:48) ) f ( x ∗ ) . (64) Proof [Proof of Lemma 38]Consider r ( λ ) = x ∗ + λ ( x ∨ x ∗ − x ∗ ), it is easy to see that r ( λ ) ≥ , ∀ λ ≥ λ (cid:48) ≥
1. Let y = r ( λ (cid:48) ) = x ∗ + λ (cid:48) ( x ∨ x ∗ − x ∗ ), it is easy to see that y ≥
0, italso holds that y ≤ ¯ u : Consider one coordinate i , 1) if x i ≥ x ∗ i , then y i = x ∗ i + λ (cid:48) ( x i − x ∗ i ) ≤ λ (cid:48) x i ≤ λ (cid:48) θ i ≤ ¯ u i ; 2) if x i < x ∗ i , then y i = x ∗ i ≤ ¯ u i . So f ( y ) ≥ x ∨ x ∗ = (1 − λ (cid:48) ) x ∗ + 1 λ (cid:48) y = (1 − λ (cid:48) ) r (0) + 1 λ (cid:48) r ( λ (cid:48) ) , (126)since f is concave along r ( λ ), so it holds that, f ( x ∨ x ∗ ) ≥ (1 − λ (cid:48) ) f ( x ∗ ) + 1 λ (cid:48) f ( y ) ≥ (1 − λ (cid:48) ) f ( x ∗ ) . (127) E.3.3 Proof of Theorem 35
Proof [Proof of Theorem 35]First of all, let us prove the Claim:
Claim 39
For k = 0 , ..., K − it holds, f ( x k +1 ) ≥ (1 − γ ) f ( x k ) + γ (1 − γ ) t k /γ f ( x ∗ ) − LD γ . (65) Proof [Proof of Claim 39] Consider a point z k := x k ∨ x ∗ − x k , one can observe that:1) z k ≤ ¯ u − x k ; 2) since x k ≥ , x ∗ ≥ , so z k ≤ x ∗ , which implies that z k ∈ P (fromdown-closedness of P ). So z k is a candidate solution for the shrunken LMO (Step 3 inAlgorithm 6). We have, f ( x k +1 ) − f ( x k ) ≥ γ (cid:104)∇ f ( x k ) , v k (cid:105) − L γ (cid:107) v k (cid:107) (Quadratic lower bound of (31)) (128) ontinuous Submodular Function Maximization ≥ γ (cid:104)∇ f ( x k ) , v k (cid:105) − L γ D (diameter of P ) (129) ≥ γ (cid:104)∇ f ( x k ) , z k (cid:105) − L γ D (shrunken LMO) (130) ≥ γ ( f ( x k + z k ) − f ( x k )) − L γ D (concave along z k ) (131)= γ [ f ( x k ∨ x ∗ ) − f ( x k )] − L γ D (132) ≥ γ [(1 − λ (cid:48) ) f ( x ∗ ) − f ( x k )] − L γ D (Lemma 38) (133)= γ [(1 − γ ) t k /γ f ( x ∗ ) − f ( x k )] − L γ D , (134)where the last equality comes from setting θ := ¯ u (1 − (1 − γ ) t k /γ ) according to Lemma 37,thus λ (cid:48) = min i ¯ u i θ i = (1 − (1 − γ ) t k /γ ) − .After rearrangement, we reach the claim.Then, let us prove Theorem 35 by induction .First of all, it holds when k = 0 (notice that t = 0). Assume that it holds for k .Then for k + 1, considering the fact e − t − O ( γ ) ≤ (1 − γ ) t/γ when 0 < γ ≤ t ≤ f ( x k +1 ) (135) ≥ (1 − γ ) f ( x k ) + γ (1 − γ ) t k /γ f ( x ∗ ) − LD γ (136) ≥ (1 − γ ) f ( x k ) + γ [ e − t k − O ( γ )] f ( x ∗ ) − LD γ (137) ≥ (1 − γ )[ t k e − t k f ( x ∗ ) − LD kγ − O ( γ ) f ( x ∗ )] + γ [ e − t k − O ( γ )] f ( x ∗ ) − LD γ = [(1 − γ ) t k e − t k + γe − t k ] f ( x ∗ ) − LD γ [(1 − γ ) k + 1] − [(1 − γ ) O ( γ ) + γO ( γ )] f ( x ∗ ) ≥ [(1 − γ ) t k e − t k + γe − t k ] f ( x ∗ ) − LD γ ( k + 1) − O ( γ ) f ( x ∗ ) . (138)Let us consider the term [(1 − γ ) t k e − t k + γe − t k ] f ( x ∗ ). We know that the function g ( t ) = te − t is concave in [0 , g ( t k + γ ) − g ( t k ) ≤ γg (cid:48) ( t k ), which amounts to,[(1 − γ ) t k e − t k + γe − t k ] f ( x ∗ ) ≥ ( t k + γ ) e − ( t k + γ ) f ( x ∗ ) (139)= t k +1 e − t k +1 f ( x ∗ ) . (140)Plugging Equation (140) into Equation (138) we get, f ( x k +1 ) ≥ t k +1 e − t k +1 f ( x ∗ ) − LD γ ( k + 1) − O ( γ ) f ( x ∗ ) . (141)Thus proving the induction, and proving the theorem as well. ian and Buhmann and Krause Appendix F. Miscellaneous Results
F.1 Verifying DR-Submodularity of the ObjectivesSoftmax extension.
For softmax extension, the objective is, f ( x ) = log det (diag( x )( L − I ) + I ) , x ∈ [0 , n . Its DR-submodularity can be established by directly applying Lemma 3 in (Gillenwateret al., 2012): Gillenwater et al. (2012, Lemma 3) immediately implies that all entries of ∇ f are non-positive, so f ( x ) is DR-submodular. Multilinear extension.
The DR-submodularity of multilinear extension can be directlyrecognized by considering the conclusion in Appendix A.2 of Bach (2015) and the fact thatmultilinear extension is coordinate-wise linear. KL ( x ) . The Kullback-Leibler divergence between q x and p , i.e., (cid:80) S ⊆V q x ( S ) log q x ( S ) p ( S ) is,KL( x ) = − (cid:88) S ⊆V (cid:89) i ∈ S x i (cid:89) j / ∈ S (1 − x j ) F ( S ) + (cid:88) ni =1 [ x i log x i + (1 − x i ) log(1 − x i )] + log Z. The first term is the negative of a multilinear extension, so it is DR-supermodular. Thesecond term is separable, and coordinate-wise convex, so it will not affect the off-diagonalentries of ∇ KL( x ), it will only contribute to the diagonal entries. Now, one can see thatall entries of ∇ KL( x ) are non-negative, so KL( x ) is DR-supermodular w.r.t. x ..