Two-Sided Weak Submodularity for Matroid Constrained Optimization and Regression
aa r X i v : . [ c s . D S ] F e b Two-Sided Weak Submodularity for MatroidConstrained Optimization and Regression
Theophile Thiery ∗ Justin Ward ∗ Abstract
The concept of weak submodularity and the related submodularity ratio con-siders the behavior of a set function as elements are added to some current solution.By considering the submodularity ratio, strong guarantees have been obtained formaximizing various non-submodular objectives subject to a cardinality constraintvia the standard greedy algorithm. Here, we give a natural complement to thenotion of weak submodularity by considering how a function changes as elementsare removed from the solution. We show that a combination of these two notionscan be used to obtain strong guarantees for maximizing non-submodular objec-tives subject to an arbitrary matroid constraint via both standard and distortedlocal search algorithms. Our guarantees improve on the state of the art whenever γ is moderately large, and agree with known guarantees for submodular objec-tives when γ = 1. As motivation, we consider both the subset selection problem,and the Bayesian A-optimal design problem for linear regression, both of whichwere previously studied in the context of weak submodularity. We show that theseproblems satisfy our complementary notion of approximate submodularity, as well,allowing us to apply our new guarantees. A submodular function f : 2 X → R is a set function such that f ( e | A ) ≥ f ( e | B ) for all A ⊆ B ⊆ X, e ∈ X \ B and where f ( e | S ) := f ( S ∪ { e } ) − f ( S ) denotes the marginalcontribution of element e with respect to some set S . Submodularity has long played acentral role in combinatorial optimization and operational research, because it providesa discrete analogue of convexity and also captures a natural diminishing return property.More recently, a wide range of applications have also been found in data science andmachine learning. Here, optimization of submodular functions has been used as a generaltechnique for subset selection problems [9, 11, 28], sensor placement [28], documentsummarization [2, 15, 30], and influence maximization in social networks [25].All of these applications consider the problem of extracting a small descriptive subsetof elements from some larger set and so consider a simple cardinality constraint. Insome application areas, it may be desirable to introduce a notion of fairness requiringthat our selection is made up of a few parts from different portions of the dataset.For example, if we wish to summarize a video, we would like to select k images thatrecap it. But to ensure a sufficient palette of images, we desire to select images at ∗ School of Mathematical Sciences, Queen Mary University of London, United Kingdom, Emails: { t.f.thiery, justin.ward } @qmul.ac.uk − e − )-approximation for a cardinality constraint (i.e.a uniform matroid). This is known to be the best-possible, both under the assumptionthat P = N P and in the model in which the algorithm may only access the function f by making (a polynomial number of) value queries [14, 31]. For a general matroidconstraint, the greedy algorithm attains only 1 / − e − )-approximation algorithm via the continuous greedy algo-rithm. For more complex constraints, the best known guarantees are provided by localsearch algorithms [29, 16, 38], which iteratively improve a current solution by repeated,relatively small exchanges.Owing to their simplicity, both greedy and local search algorithms are also oftenused as heuristics in practice, even for functions that are not submodular. One suchsetting is subset selection via forward regression. Here the goal is to select a small setof variables that together “best” predict some target quantity. The forward regressionalgorithm selects these variables in a greedy, stepwise fashion. In this setting, Das andKempe [10] showed that the squared multiple correlation objective function, also knowas the R objective, is submodular if suppressors are not present among the predictorvariables, thus justifying the use of the greedy algorithm. In later work [11] they showedthat even in the presence of suppressors, a weaker property (appropriately) deemed weaksubmodularity often holds. They introduced the submodularity ratio γ , which measureshow far a function deviates from submodularity when considering the effect of addingindividual elements. Their definition is directly motivated by the greedy algorithm fora cardinality constraint, for which they derive an approximation guarantee of (1 − e − γ ).This guarantee, which depends on the objective function’s submodularity ratio γ ∈ [0 , γ under an arbitrary matroid constraint,the best known guarantee is a randomized 1 / (1 + γ − ) -approximation algorithm viathe residual random greedy algorithm due Chen et al. [6]. However, as γ tends to 1(i.e. as the function becomes closer to a submodular function) the approximation tendsto only 1 /
4, while the standard greedy algorithm is known to provide a 1 / . . . . . . . . γ A pp r o x i m a t i o n G u a r a n t ee Residual Random Greedy [6]Local Search (Section 4)Distorted Local Search (Section 5)Figure 1: Guarantees for maximizing a ( γ, /γ )-weakly submodular objectives subjectto a matroid constraint.search algorithms and the rounding procedure used in conjunction with the continuousgreedy algorithm require considering the effect of swaps of elements on the value of f . Therefore, it is natural to ask whether many of the real-world applications display-ing weak submodularity may have some further inherent property which allows thesetechniques to be used. Here, we answer the questions affirmatively, by giving a natural extension that comple-ments the submodularity ratio γ . We introduce the notion of an upper submodularityratio β , that bounds how far a function deviates from submodularity when consideringthe effect of removing single elements. Intuitively, the parameter β compares the lossby removing an entire set compared to the aggregate individual losses for each element.We call functions for which both ratios are bounded ( γ, β ) -weakly submodular . To mo-tivate our definition, we show that the objective for subset selection problem consideredby Das and Kempe [11] is ( γ, β )-weakly submodular with β = γ − (where γ can bebounded using a sparse eigenvalue parameter λ min of the covariance matrix as in [11]).We also consider the problem of Bayesian A-optimal design, which has been previ-ously studied via weak submodularity [3, 21, 22]. In Appendix 8.1, we show that ourparameter β can be bounded by γ − for this problem, as well. Given the relationshipbetween γ and β for the problems we consider, it is natural to conjecture that β may bebounded in some generic fashion for every weakly submodular function. In Appendix 9,we show that this is not the case by exhibiting a function on a ground set of size k forwhich β must be Θ (cid:0) k − γ (cid:1) .Using our new definition, we are able to derive guarantees (as a function of γ and β ) for matroid constrained maximization problems via both a deterministic local searchalgorithm and a randomized, distorted local search algorithm that is guided by anauxiliary potential. For objectives that are submodular, our algorithms have guaranteesof 1 / − O ( ǫ ) and (1 − e − − O ( ǫ )), respectively, and these guarantees deteriorate smoothlyas a function of the objective’s deviation from submodularity. For both the subset3election and A-optimal design problems (in which β = γ − ) our first algorithm improveson the current state of the art for all γ > √ − , and our second for all γ > . f implies submodularity of thepotential g , which is used to derive the bounds on g necessary for convergence and sam-pling, as well the crucial bound linking the local optimality of g to the value of f . Here,however, since f is only approximately submodular, these techniques will not work, andso we require a more delicate analysis for each of these components. A further compli-cation in our setting is that the correct potential g depends on the values of γ and β ,which may not be known a priori. One approach to handle this in practice would be tosimply enumerate a sufficiently fine sequence of guesses for both parameters. Here, wegive a more efficient approach that is based on guessing the value of a joint parameterin γ and β . Each such guess gives a different distorted potential. Inspired (broadly) bysimulated annealing, we show that if each such new potential is initialized by the localoptimum of the previous potential, then the overall running time can be amortized overall guesses. To highlight the main ideas, we present a simplified version of the algorithmin Section 5, and defer the full analysis to Appendices 6 and 7. Following the pioneering work of Das and Kempe [11], there has been a vast literatureon other notions of approximate submodularity. Horel and Singer [24] define a function F as ε -approximately submodular if F is within a (1 + ε ) multiplicative factor of somesubmodular function f and investigate the relationship between ε and the number ofqueries needed to optimize it. Chierichetti et al. [7] focus on ε -additive approximatesubmodular function. As noted in their work, their definition is incomparable withthe notion of weak submodularity based on submodularity ratio. Gong et al. [20] andNong [32] use a generic submodularity ratio γ gen and design algorithms to maximizesubmodular functions when γ gen is bounded. Specifically, Gong et al. [20] obtained a(1 − e − )(1 − e − γ gen ) approximation under a matroid constraint. All the definitionsintroduced here assume that the objective is monotone. Recently Santiago and Yoshidahave proposed a notion of approximate submodularity that extends to the non-monotonecase [35].The submodularity ratio was first introduced to analyze the forward regression andorthogonal matching pursuit algorithms for linear regression [11]. It was latter re-lated to restricted strong convexity [13], leading to similar guarantees for generalizedlinear model (GLM) likelihood, graphical model learning objectives, or an arbitraryM-estimator. The strength of the submodularity ratio is that for all applications itsuffices to estimate the value of γ to obtain guarantees for greedy algorithms even inother modes of computation [27, 12]. This has lead to new guarantees for sensor place-ment problems [23], experimental design [21, 3] and low rank optimization [26]. Weaksubmodularity and related algorithmic techniques have also been used for documentsummarization [6] and interpretation of neural networks [12]. A set function has generic submodularity ratio γ gen if f ( e | A ) ≥ γ gen · f ( e | B ) for all A ⊆ B and e ∈ X \ B . We note that that the definition is more restrictive than ours, i.e., γ ≥ γ gen . In particular,for the subset selection problem, it is easy to formulate instances for which γ gen = 0 and so the genericratio provides no approximation guarantee, but γ > curvature which can be used to strengthen approximation boundsfor both for general submodular functions [8, 37, 36, 39, 19], as well as in combina-tion with weak submodularity [3]. Inspired by insights from continuous optimization,Pokutta et al. [33] have recently introduced a new notion of sharpness , which providesfurther explanation for the empirically good performance of the greedy algorithm onsubmodular objectives. Throughout the remainder of the paper, all set functions f : 2 X → R ≥ that we considerwill be monotone , satisfying f ( B ) ≥ f ( A ) for all A ⊆ B . Because we are focusingon maximization problems, we will further assume without loss of generality that ourobjective functions are normalized , so f ( ∅ ) = 0. When there is no risk of confusion, wewill use the shorthands A + e for A ∪ { e } and A − e for A \ { e } .The problems that we consider will be constrained by an arbitrary matroid M =( X, I ). Here I ⊆ X is a family of independent sets , satisfying ∅ ∈ I , A ⊆ B ∈ I ⇒ A ∈I , and for all A, B ∈ I with | A | < | B | , there exists e ∈ B \ A such that A + e ∈ I . Themaximal independent sets of I are called bases of M , and the last condition impliesthat they all have the same size, which is the rank of M , which we typically denoteby k . Our goal will be to find some S ∈ I that maximizes the objective f . Since f is monotone, we can assume without loss of generality that the maximizer is a base of M . We will make use of a standard result, first given by Brualdi [4], that shows thatfor any pair of bases A, B ∈ I , we can index the elements a i ∈ A and b i ∈ B so that A − a i + b i ∈ I for all i ∈ [ k ].It can be shown that a set function f : 2 X → R ≥ is submodular if and only if: f ( B ) − f ( A ) ≤ X e ∈ B \ A f ( e | A )for any A ⊆ B ⊆ X . The definition of weak-submodularity relaxes this inequality toobtain: γ ( f ( B ) − f ( A )) ≤ X e ∈ B \ A f ( e | A ) , (1)for any set A ⊆ B , where γ ∈ [0 ,
1] is called the submodularity ratio of f . If γ ≥ f is a submodular. Since the standard analysis of the greedy algorithmworks by bounding the marginal gains of the optimal elements not selected during thealgorithm, judicious application of (1) gives guarantees for the greedy algorithm evenfor arbitrary fixed γ . Submodularity can equivalently be characterized by the inequality f ( B ) − f ( A ) ≥ X e ∈ B \ A f ( e | B \ { e } ) We note that the definition given here, which is also used in [3, 6, 12, 21, 35], is slightly adaptedfrom the original definition given in [11]. A ⊆ B ⊆ X . Thus, another natural approach would be to consider functionsthat satisfy: β ( f ( B ) − f ( A )) ≥ X e ∈ B \ A f ( e | B \ { e } ) (2)for some β ≥
0. We call this property β -weak submodularity from above to distinguishit from (1), which we will now refer to as γ -weak submodularity from below . Here, β ≤ f is submodular. Note that by monotonicity, f ( B ) − f ( A ) ≥ f ( B ) − f ( B \ { e } ) ≥ e ∈ B \ A and so every monotone function f satisfies (2)for β = | B | .Combining the above notions, we call a set function f ( γ, β ) -weakly submodular ifit is γ -weakly submodular from below and β -weakly submodular from above (and soboth (1) and (2) hold). For our results, we will make use of the following simple lemmawhich follows directly from (1). Lemma 1.
Suppose that f is γ -weakly submodular from below. Then, for any set T and a, b ∈ X \ T , f ( b | T ) ≥ γf ( b | T + a ) − (1 − γ ) f ( a | T ) Proof.
Rearranging the inequality (1) for A = { a, b } and B = T ∪ A gives: f ( a | T ) + f ( b | T ) ≥ γf ( { a, b } | T ) ⇐⇒ f ( b | T ) ≥ γf ( b | T + a ) − (1 − γ ) f ( a | T ) . Let us now consider the subset selection problem in which we have a random variable Z and set of n random variables X = { X , . . . , X n } (where here and throughout thissection we use calligraphic letters to denote sets of random variables to avoid confusion).We suppose that Z and all X i have been normalized to have mean 0 and variance 1,and let C X be the n × n covariance matrix for the variables X i . Our goal is to finda constrained set S ⊆ X that gives the best linear predictor for Z . We measure thefitness of the linear predictor using the squared multiple correlation , also known as the R function. Given a subset S , we define the R objective for S as: R Z, S = (Var( Z ) − E (cid:2) ( Z − Z ′ ) (cid:3) ) / Var( Z ) , (3)where Z ′ = P X i ∈S α i X i is the linear predictor over S which optimally minimizes themean square prediction error for Z . Given a set S and Z , the coefficients of the bestlinear predictor are given by α = C − S b S ,Z , where C S is the principle submatrix of C X corresponding to variables in S , i.e., ( C S ) i,j = Cov( X i , X j ), and b S ,Z is a vector ofcovariances between Z , and each X i ∈ S , i.e., ( b S ,Z ) i = Cov( X i , Z ). Therefore, if we let X S denote a vector containing the random variables in S , then the best linear predictorcan be written as: Z ′ = X T S C − S b S ,Z . Because Z has unit variance, the objectivesimplifies to R Z, S = 1 − E (cid:2) ( Z − Z ′ ) (cid:3) , and so the R objective can be regarded asa measure of the fraction of variance of Z that is explained by S . In addition, wecan define the residual of Z with respect to this predictor as the random variableRes( Z, S ) = Z − Z ′ = Z − X T S C − S b S ,Z . Therefore the R objective is equivalent to R Z, S = 1 − Var(Res( Z, S )) = b TZ, S C − S b Z, S .6as and Kempe [10, 11] consider the problem of maximizing R Z, S subject to theconstraint |S| ≤ k . They showed that although the R objective is not submodular ingeneral, it is submodular in the absence of suppressors [10]. Moreover, they motivatethe use of the greedy algorithm in practice by showing that the R objective satisfies (1)for all A ⊆ B ⊆ X with γ ≥ λ min ( C X , |B| ) ≥ λ min ( C X ) is a λ min ( C X ) is the smallesteigenvalue of C X and λ min ( C X , |B| ) is the smallest |B| -sparse eigenvalue of C X [11]. Inorder to analyze the greedy algorithm, it suffices to let B \ A and A both be subsetscontaining at most k variables, so |B| = 2 k .Here, we give extend this result by deriving an analogous bound on the weak sub-modularity of the R objective from above : Theorem 2.
For any
B ⊆ X , the R objective satisfies (2) with β ≤ λ min ( C X , |B| ) ≤ λ min ( C X ) . In particular, this shows that the R objective is ( γ, /γ )-weakly submodular for γ = λ min ( C X ). We show in the next section that this implies guarantees for local searchalgorithms when S is now constrained to be an independent set of some matroid of rank k . In fact because our bound on β then depends on the smallest k -sparse eigenvalue of C X rather than the smallest 2 k -sparse eigenvalue that is used to bound γ our guaranteesin practice may be stronger than shown in Figure 1.In order to prove Theorem 2, we will use the following lemmas from [11]. Lemma 3.
Given two sets of random variables S = { X , . . . , X n } , and A , and a randomvariable Z we have: Res( Z, A ∪ S ) = Res(Res( Z, A ) , { Res( X i , A ) } i : X i ∈S ) . Lemma 4.
Given two sets of random variables S = { X , . . . , X n } , and A , and a randomvariable Z we have: R Z, S∪A = R Z, A + R Z, { Res( X i , A ) } i : Xi ∈S . We define following quantities, which we use for the rest of the section. Let A , B besome fixed set of random variables with A ⊆ B and suppose that
B\A := { X , . . . , X t } .For each X i , we define Y i to be the random variable Res( X i , B \ A ). Suppose furtherthat each Y i has been renormalized to have unit variance. Then, we let S = { Y i , . . . , Y t } ,and define C to be the covariance matrix for the variables S and b to be the vector ofcovariances between Z and each Y i ∈ S .We fix a single random variable X i . For ease of notation, in the next two Lemmas weassume without loss of generality that C , b have been permuted so that X i correspondsto the last row and column of C . Then, we define A − i = A \ { X i } , S − i = S \ { Y i } and let Y − i denote the vector containing the variables of S − i (ordered as in C and b ).We then let C − i be the principle submatrix of C by excluding the row and columncorresponding to Y i (i.e., the last row and column), and b − i be the vector obtainedfrom b by excluding the entry for Y i (i.e., the last entry). Finally, we let u i be thevector of covariances between Y i and each Y j ∈ S − i . Note that u i corresponds to thelast column of C with its last entry (corresponding to Var( Y i )) removed.We begin by computing the loss in R Z, B when removing X i from B . Lemma 5. R Z, B − R Z, B\{ X i } = Cov( Z, Res( Y i , S − i )) / Var(Res( Y i , S − i )) = b T H i b /s i ,where H i = (cid:18) C − − i u i u Ti C − − i − C − − i u i − u Ti C − i (cid:19) and s i = 1 − u Ti C − − i u i . roof. By Lemma 4 and Lemma 3, respectively: R Z, B − R Z, B\{ X i } = R Z, Res( X i , A − i ∪ ( B\A )) = R Z, Res(Res( X i , B\A ) , { Res( X j , B\A ) } Xj ∈A− i ) . Recall that each Y i is obtained from Res( X i , B \ A ) by renormalization and S − i = { Y j } X j ∈A − i . Thus, Res( Y i , S − i ) is simply a rescaling of Res(Res( X i , B \A ) , { Res( X i , B \A ) } X i ∈A − i ). Since the R objective is invariant under scaling of the predictor variables,we have: R Z, B − R Z, B\{ X i } = R Z, Res( Y i , S − i ) = Cov( Z, Res( Y i , S − i )) / Var(Res( Y i , S − i )) , (4)where the last line follows directly from the definition of the R objective. It remainsto express (4) in terms of C , b and u . By definition, Res( Y i , S − i ) = Y i − Y T − i C − − i u i .Hence,Var(Res( Y i , S − i )) = Var( Y i − Y T − i C − − i u i )= E (cid:2) Y i (cid:3) − u Ti C − − i u i + u Ti C − − i E (cid:2) Y − i Y T − i (cid:3) C − − i u i = 1 − u Ti C − − i u i , where the last equality follows from normalization of Y i and E (cid:2) Y − i Y T − i (cid:3) = C − i . Fur-thermore,Cov( Z, Res( Y i , S − i )) = Cov( Z, Y i − Y T − i C − − i u i ) = (cid:0) Cov(
Z, Y i ) − Cov(
Z, Y T − i C − i u i ) (cid:1) = (cid:0) b i − b T − i C − − i u i (cid:1) = b T (cid:18) C − − i u i u Ti C − − i − C − − i u i − u Ti C − − i (cid:19) b . Substituting the above 2 expressions into (4) completes the proof.In the next lemma we show that this can be simplified for eigenvectors of C − . Lemma 6.
Let ( λ, v ) , ( µ, w ) be any 2 eigenpairs of C − . Then, v T H i w = λµs i v i w i ,where H i and s i are as defined in the statement of Lemma 5.Proof. Applying the formula for block matrix inversion (Lemma 18) to C − , we have C − = (cid:18) C − i u i u Ti (cid:19) − = (cid:18) C − − i
00 0 (cid:19) + 11 − u Ti C − − i u i (cid:18) C − − i u i u Ti C − − i − C − − i u i − u Ti C − − i (cid:19) . (5)Next, we look at the i th linear equation defined by the product between C − and w .Because ( µ, w ) is an eigenpair of C − , we must have ( C − w ) i = µw i . By (5), thisis equivalent to ( − u Ti C − − i w − i + w i ) /s i = µw i (where, as usual, we let w − i be thevector obtained from w by discarding its i th entry). This in turn is equivalent to u i C − − i w − i = w i (1 − µs i ). Since C − is symmetric, the same argument implies that v T − i C − − i u i = v i (1 − λs i ). Thus, v T H i w = v T − i C − − i u i u Ti C − − i w − i − w i ( v T − i C − − i u i ) − v i ( u Ti C − − i w − i ) + v i w i , = v i w i (1 − λs i )(1 − µs i ) − v i w i (1 − λs i ) − v i w i (1 − µs i )+ v i w i , = v i w i ((1 − λs i )(1 − µs i ) − (1 − λs i ) − (1 − µs i ) + 1) = λµs i v i w i . roof of Theorem 2. Let { v , . . . , v i } be an eigenbasis of C − with corresponding eigen-values λ , . . . , λ i . Let V be a matrix with columns given by these v i . Since C − is a sym-metric positive semidefinite matrix, the matrix V is orthonormal. Hence, we can write b = V y for some vector y . By Lemma 5, for any i , b T H i b = Cov( Z, Res( Y i , S − i )) ≥ s i = Var(Res( Y i , S − i )) ≤
1. Thus, by Lemma 5, for each i = 1 , . . . , t , R Z, B − R Z, B\{ X i } = b T H i b /s i ≤ b T H i b /s i = y T V T H i V y /s i . By Lemma 6, ( V T H i V ) l,m = λ l λ m s i ( v l ) i ( v m ) i . Thus, summing over all i we have: t X i =1 R Z, B − R Z, B\{ X i } ≤ t X i =1 t X l,m =1 ( y l y m λ l λ m )( v l ) i ( v m ) i . = t X l,m =1 ( y l y m λ l λ m ) t X i =1 ( v l ) i ( v m ) i = t X i =1 y i λ i ≤ λ max t X i =1 y i λ i , (6)where the last equation follows from the orthonormality of the eigenvectors v i . More-over, R Z, B − R Z, A = b T C − b = t X i =1 y i λ i . (7)Combining (7) and (6), we have P i ∈S R Z, B − R Z, B\{ X i } ≤ λ max ( C − )[ R Z, B − R A ] andso inequality (2) is satisfied for β = λ max ( C − ) = 1 /λ min ( C ). It remains to bound1 /λ min ( C ) in terms of the eigenvalues of C . We show that λ min ( C ) ≥ λ min ( C, |B| ).Recall that C is a normalized covariance matrix for the random variables { Res( X i , B \A ) } X i ∈A . As shown in Lemma 20, this implies that λ min ( C ) ≥ λ min ( C ( B\A ) ∪A ) ≥ λ min ( C X , |B| ) ≥ λ min ( C X ), and the claimed bound follows. In this section, we derive a first novel algorithm to optimize ( γ, β )-weakly submodularfunctions. The algorithm is a simple local-search procedure. It exchanges elementsas long it yields an improvement in the objective function. Each exchange preservesthe feasibility of the set. To ensure a convergence in polynomial time, the algorithmperforms the exchange only if the objective value increases by a (1 + ε/k )-multiplicativefactor , and terminates after making at most ⌈ log ε/k k ⌉ improvements. In order toensure that a good solution is obtained after at most this number of improvements, werequire a good starting solution. Here, we take the single element e ∈ X with largestvalue f ( e ), and complete { e } arbitrarily to a base of M .The next lemma gives the guarantees of the algorithm 1 for maximizing a ( γ, β )-weakly submodular function. Motivated by the R objective (see section 3), for ( γ, /γ )-weakly submodular function and assuming ε = 0, we recover a 1 / /γ ) − for all γ > √ − and sufficientlysmall ǫ . Theorem 7.
Given a matroid M = ( X, I ) , a ( γ, β ) -weakly submodular function f and ε > , the local-search algorithm 1 runs in time O ( nk log( k ) /ε ) and returns a solution S such that f ( S ) ≥ γ (2 − γ ) β + γ + ε f ( O ) for any solution O ∈ I . lgorithm 1: The local-search algorithm procedure
LocalSearch ( M , f, ε )Let e max = arg max e ∈ X f ( e ); S ← an arbitrary base of M containing e max ; for i ← to ⌈ log ε/k k ⌉ doif ∃ e ∈ S and e ′ ∈ X \ S with S − e + e ′ ∈ I and (1 + ε/k ) f ( S ) < f ( S − e + e ′ ) then S ← S − e + e ′ else return S ; return S ; Proof.
Let S init be the arbitrary base M containing arg max e ∈ X f ( e ) that we select atthe start of the algorithm. Consider any feasible solution O ∈ I . Then, by monotonicity,normalization of f and γ -weak submodularity of f , we have f ( O ) ≤ γ P o ∈ O f ( o |∅ ) ≤ kγ f ( e ) ≤ kγ f ( S init ). Each time the algorithm makes an exchange the value of the solution S increases by a factor (1 + ε/k ). Thus, if the algorithm terminates after ⌈ log ε/k k ⌉ improvements, then f ( S ) ≥ kf ( S init ) ≥ γf ( O ) ≥ γ γ + β (2 − γ )+ ε f ( O ) as required.Suppose now that the algorithm returns S before making ⌈ log ε/k k ⌉ improvements.Let S = { s , . . . , s k } . Since M is a matroid, we can index the elements o i ∈ O and s i ∈ S so that S − s i + o i is a feasible set for all 1 ≤ i ≤ k . Since the algorithmterminated without making any further improvements, (1 + ǫ/k ) f ( S ) ≥ f ( S − s i + o i )for all 1 ≤ i ≤ k . Subtracting f ( S − s i ) from each side, and applying Lemma 1 gives: f ( s i | S − s i ) + ( ǫ/k ) f ( S ) ≥ f ( o i | S − s i ) ≥ γf ( o i | S ) − (1 − γ ) f ( s i | S − s i ) , for all i ≤ i ≤ k . Summing over all i and rearranging we then have:(2 − γ ) k X i =1 f ( s i | S − s i ) + ǫf ( S ) ≥ γ k X i =1 f ( o i | S ) . Now, we use β -weak submodularity to upper-bound the left-hand side, and γ -weaksubmodularity and monotonicity of f to lower-bound the right-hand side:((2 − γ ) β + ǫ ) f ( S ) ≥ γ ( f ( O ∪ S ) − f ( S )) ≥ γ ( f ( O ) − f ( S ))Rearranging this then gives the claimed guarantee. The bound on the running timefollows from the fact that the algorithm makes at most ⌈ log ε/k k ⌉ = O ( k log( k ) /ε )improvements, each of which can be found in time O ( nk ). Here, we present a more sophisticated local search procedure that is guided by an aux-iliary potential function, derived from the objective f . We call this procedure distorted local search (Algorithm 2). In each step, Algorithm 2 performs a swap if and only if itimproves the following auxiliary potential function:10 lgorithm 2: The (simplified) distorted local-search algorithm procedure
DistortedLocalSearch ( M , f )Suppose that f is ( γ, β )-weakly submodular and let φ = γ + β (1 − γ ); S ← an arbitrary base of M ; while ∃ a ∈ S, b ∈ X \ S with S − a + b ∈ I and g φ ( S − a + b ) > g φ ( S ) do S ← S − a + b ; return S ; g φ ( A ) = Z φe φp e φ − X B ⊆ A p | B |− (1 − p ) | A |−| B | f ( B ) dp = X B ⊆ A m ( φ ) | A |− , | B |− f ( B )where we define m ( φ ) a,b = R φe φp p b (1 − p ) a − b / ( e φ − dp , and let φ ∈ R ≥ be a givenparameter.Here we focus on bounding the quality of the local optimum produced by Algorithm 2when φ is chosen appropriately. Our main theorem is the following: Theorem 8.
Suppose that f is ( γ, β ) -weakly submodular, and let φ = φ ( γ, β ) , γ + β (1 − γ ) . Then, for any sets A, O ∈ I : φe φ e φ − f ( A ) ≥ γ f ( O ) + | A | X i =1 [ g φ ( A ) − g φ ( A − a i + o i )] . At a local optimum, we necessarily have g φ ( A ) − g φ ( A − a i + o i ) ≤ i ,so Theorem 8 implies that the solution S returned by Algorithm 2 satisfies f ( S ) ≥ γ − e − φ ) φ f ( O ) when φ = γ + β (1 − γ ). Note that in both the subset selection prob-lem and Bayesian A -optimal design, we can set β = γ − . This gives a guarantee of γ − e − ( γ γ − − γ + γ − − , which is better than existing guarantees for all γ > . g φ ( A ) directly, as it dependson the values f ( A ) for all subsets of A . In Appendix 6.4 we show that we can efficientlyestimate g φ via simple sampling procedure. Linking f and g φ is the key to understandthe quality of the final solution. Thus, in Appendix 6.2 we show that the value g φ ( A )can be bounded in terms of f ( A ) for any set A . To ensure convergence, we require thateach improvement makes a significant change in g φ , we can then bound the number ofimprovements required to reach a local optimum of g φ . Finally, we address the fact that γ and β may not be known in advance. By initializing the algorithm with a solutionproduced by the residual random greedy algorithm, we can bound the range of valuesfor φ that must be considered and then guess φ from this range. In Appendix 6.3 weshow that small changes in φ result in small changes to g φ ( A ), and so by initializing therun for each subsequent guess of φ with the solution produced for the previous guess, wecan amortize the total number of improvements required across all guesses. The finalalgorithm, presented in Appendix 7, has the same guarantee as Algorithm 2 minus asmall O ( ǫ ) term, and requires ˜ O ( nk ǫ − ) evaluations of f .We now turn to the proof of Theorem 8. Similarly to the analysis of [17], onecan show that coefficients m ( φ ) a,b satisfy the following properties (we give a full proof inAppendix 6.1). 11 emma 9. For any φ > , the coefficients m ( φ ) a,b satisfy the following:1. g φ ( e | A ) = P B ⊆ A m ( φ ) | A | , | B | f ( e | B ) , for any A ⊆ X and e A .2. P ab =0 m ( φ ) a,b = 1 for all a ≥ .3. m ( φ ) a,b = m ( φ ) a +1 ,b +1 + m ( φ ) a +1 ,b for all ≤ b ≤ a .4. φm ( φ ) a,b = − bm ( φ ) a − ,b − + ( a − b ) m ( φ ) a − ,b + ( φ/ ( e φ − )) b =0 + ( φe φ / ( e φ − b = a forall a > and ≤ b ≤ a . In the analysis of [17], it is shown that if f is submodular, its associated potential g is as well, and this plays a crucial role in the analysis. Here, however, f is only weakly submodular, which means we must carry out an alternative analysis to bound the qualityof a local optimum for g φ . As in the previous section, we consider two arbitrary bases A and O of a matroid M . As before, we index the elements a i ∈ A and o i ∈ O sothat A − a i + o i is a base for all 1 ≤ i ≤ | A | . Then, since g φ ( A − a i + o i ) − g ( A ) = g ( o i | A − a i ) − g ( a i | A − a i ): | A | X i =1 g ( a i | A − a i ) = | A | X i =1 [ g ( A ) − g ( A − a i + o i )] + | A | X i =1 g ( o i | A − a i ) . (8)The next lemma bounds the final term in (8). Lemma 10. If f is ( γ, β ) -weakly submodular, then | A | X i =1 g ( o i | A − a i ) ≥ γ f ( O ) − (cid:0) γ + β (1 − γ ) (cid:1) X B ⊆ A m ( φ ) | A | , | B | f ( B ) . Proof.
By claims 1 and 3 of Lemma 9, we have g φ ( o i | A − a i ) = X B ⊆ A − a i m ( φ ) | A |− , | B | f ( o i | B ) = X B ⊆ A − a i [ m ( φ ) | A | , | B | +1 f ( o i | B ) + m ( φ ) | A | , | B | f ( o i | B )] . Since f is γ -weakly submodular from below, Lemma 1 implies that this is at least X B ⊆ A − a i m ( φ ) | A | , | B | +1 [ γf ( o i | B + a i ) − (1 − γ ) f ( a i | B )] + m ( φ ) | A | , | B | f ( o i | B ) = P + Q, where: P = γ X B ⊆ A − a i h m ( φ ) | A | , | B | +1 f ( o i | B + a i ) + m ( φ ) | A | , | B | f ( o i | B ) i = γ X B ⊆ A m ( φ ) | A | , | B | f ( o i | B ) Q = (1 − γ ) X B ⊆ A − a i h m ( φ ) | A | , | B | f ( o i | B ) − m ( φ ) | A | , | B | +1 f ( a i | B ) i ≥ − (1 − γ ) X B ⊆ A − a i m ( φ ) | A | , | B | +1 f ( a i | B )Summing the resulting lower bound over each a i ∈ A we then have | A | X i =1 g φ ( o i | A − a i ) ≥ γ | A | X i =1 X B ⊆ A m ( φ ) | A | , | B | f ( o i | B ) − (1 − γ ) | A | X i =1 X B ⊆ A − a i m ( φ ) | A | , | B | +1 f ( a i | B ) . (9)12ince f is γ -weakly submodular from below and monotone, | A | X i =1 X B ⊆ A m ( φ ) | A | , | B | f ( o i | B ) = X B ⊆ A | A | X i =1 m ( φ ) | A | , | B | f ( o i | B ) ≥ X B ⊆ A m ( φ ) | A | , | B | γ [ f ( O ∪ B ) − f ( B )] ≥ γ X B ⊆ A m ( φ ) | A | , | B | [ f ( O ) − f ( B )] = γf ( O ) − γ X B ⊆ A m ( φ ) | A | , | B | f ( B ) , where the last equation follows from part 2 of Lemma 9. Similarly, since f is β -weaklysubmodular from above: | A | X i =1 X B ⊆ A − a i m ( φ ) | A | , | B | +1 ( f ( B + a i ) − f ( B )) = X T ⊆ A | A | X i =1 m ( φ ) | A | , | T | ( f ( T ) − f ( T − a i )) ≤ X T ⊆ A m ( φ ) | A | , | T | β ( f ( T ) − f ( ∅ )) = β X B ⊆ A m ( φ ) | A | , | B | f ( B )where the first equation can be verified by substituting B = T − a i for each a i ∈ T andnoting that | T | = | B | + 1. Using the two previous inequalities to bound the right-handside of (9), we have: | A | X i =1 g φ ( o i | A − a i ) ≥ γ f ( O ) − (cid:0) γ + β (1 − γ ) (cid:1) X B ⊆ A m ( φ ) | A | , | B | f ( B ) . Using the bound from the previous lemma, we can now complete the proof of The-orem 8.
Proof of Theorem 8.
Applying Lemma 10 to the last term in (8) and rearranging gives: | A | X i =1 g φ ( a i | A − a i )+ (cid:0) γ + β (1 − γ ) (cid:1) X B ⊆ A m ( φ ) | A | , | B | f ( B ) ≥ γ f ( O )+ | A | X i =1 [ g φ ( A ) − g φ ( A − a i + o i )] . (10)From part 1 of Lemma 9, | A | X i =1 g φ ( a i | A − a i ) = | A | X i =1 X B ⊆ A − a i m ( φ ) | A |− , | B | ( f ( B + a i ) − f ( B ))= X T ⊆ A | T | m ( φ ) | A |− , | T |− f ( T ) − ( | A | − | T | ) m ( φ ) | A |− , | T | f ( T ) , where the last equation follows from the fact that each T ⊆ A appears once as T = B + a i for each a i ∈ T —in which case it has coefficient m ( φ ) | A |− , | B | = m ( φ ) | A |− , | T |− —and once as T = B for each a i T —in which case it has coefficient m ( φ ) | A |− , | B | = m ( φ ) | A |− , | T | . Thus,we can rewrite (10) as: X B ⊆ A (cid:16) | B | m ( φ ) | A |− , | B |− − ( | A | − | B | ) m ( φ ) | A |− , | B | + (cid:0) γ + β (1 − γ ) (cid:1) m ( φ ) | A | , | B | (cid:17) f ( B ) ≥ γ f ( O ) + | A | X i =1 [ g φ ( A ) − g φ ( A − a i + o i )] . (11)13ince φ = γ + β (1 − γ ), the recurrence in part 4 of Lemma 9 implies that the left-handside vanishes for all B except B = ∅ , in which case it is φe φ − f ( ∅ ) = 0 or B = A , inwhich case it is φe φ e φ − f ( A ). The claimed theorem then follows. g φ Here we give further properties of the potential g φ ( A ) = Z φe φp e φ − X B ⊆ A p | B |− (1 − p ) | A |−| B | f ( B ) dp = X B ⊆ A m ( φ ) | A |− , | B |− f ( B )defined in Section 5. We recall that the coefficients m ( φ ) a,b for 0 ≤ b ≤ a are defined by m ( φ ) a,b = Z φe φp e φ − p b (1 − p ) a − b dp. If we consider a continuous distribution D φ on [0 ,
1] with density function: D φ ( x ) = φe φx e φ − m ( φ ) a,b = E p ∼D φ [ p a (1 − p ) b ]. Forconvenience, we will define m ( φ ) a,b = 0 if either a < b < h : R → R by h ( x ) = xe x e x − .Then, by Bernoulli’s inequality dhdx = e x ( e x − − x )( e x − ≥ h is an increasing function.Note that our algorithm’s claimed guarantee can then be expressed as γ h ( φ ( γ,β )) where φ ( γ, β ) = γ + β (1 − γ ). Finally, for k ∈ Z + , we let H k denote the k th harmonic number H k = P ki =1 /i = Θ(log k ). m ( φ ) a,b Here we provide a proof for the following properties m ( φ ) a,b used in Section 5. We give aproof of each claim in turn. Lemma 9.
For any φ > , the coefficients m ( φ ) a,b satisfy the following:1. g φ ( e | A ) = P B ⊆ A m ( φ ) | A | , | B | f ( e | B ) , for any A ⊆ X and e A .2. P ab =0 m ( φ ) a,b = 1 for all a ≥ .3. m ( φ ) a,b = m ( φ ) a +1 ,b +1 + m ( φ ) a +1 ,b for all ≤ b ≤ a .4. φm ( φ ) a,b = − bm ( φ ) a − ,b − + ( a − b ) m ( φ ) a − ,b + ( φ/ ( e φ − )) b =0 + ( φe φ / ( e φ − b = a forall a > and ≤ b ≤ a . laim 1. Note that by the definition of g φ : g φ ( e | A ) = X B ⊆ A + e m ( φ ) | A | , | B |− f ( B ) − X B ⊆ A m ( φ ) | A |− , | B |− f ( B )= X B ⊆ A h ( m ( φ ) | A | , | B |− − m ( φ ) | A |− , | B |− ) f ( B ) + m ( φ ) | A | , | B | f ( B + e ) i . It thus suffices to show ( m ( φ ) | A | , | B |− − m ( φ ) | A |− , | B |− ) f ( B ) = − m ( φ ) | A | , | B | f ( B ). For B = ∅ ,we have f ( ∅ ) = 0 and so ( m ( φ ) | A | , − − m ( φ ) | A |− , − ) f ( ∅ ) = 0 = − m ( φ ) | A | , f ( ∅ ). When | B | ≥ m ( φ ) | A | , | B |− − m ( φ ) | A |− , | B |− = E p ∼D φ h p | B |− (1 − p ) | A |−| B | +1 − p | B |− (1 − p ) | A |−| B | i = E p ∼D φ h − p | B | (1 − p ) | A |−| B | i = − m ( φ ) | A | , | B | . Claim 2.
By linearity of expectation: a X b =0 m ( φ ) a,b = a X b =0 E p ∼ D φ (cid:2) p b (1 − p ) a − b (cid:3) = 1 . Claim 3.
When 0 ≤ b ≤ a , the definition of m ( φ ) a,b immediately gives: m ( φ ) a,b = E p ∼ D φ p b (1 − p ) a − b = E p ∼ P [ p b (1 − p ) a − b p + p b (1 − p ) a − b (1 − p )] = m ( φ ) a +1 ,b +1 + m ( φ ) a +1 ,b . Claim 4.
For a > b ≤ a , integration by parts gives: m ( φ ) a,b = Z D φ ( p ) p b (1 − p ) a − b dp = D φ ( p ) φ p b (1 − p ) a − b (cid:12)(cid:12)(cid:12)(cid:12) p =1 p =0 − Z D φ ( p ) φ (cid:0) bp b − (1 − p ) a − b − ( a − b ) p b (1 − p ) a − b − (cid:1) dp . Which is equivalent to: φm ( φ ) a,b = − bm ( φ ) a − ,b − + ( a − b ) m ( φ ) a − ,b + D φ ( p ) p b (1 − p ) a − b (cid:12)(cid:12)(cid:12) p =1 p =0 . This holds by the definition of m ( φ ) a,b when b >
0. When b = 0 it follows from − bm ( φ ) a − ,b − = 0 and bp b − (1 − p ) a − b = 0. To complete the claim, we note thatlim p → + D φ ( p ) p b (1 − p ) a − b is D φ (0) = φ/ ( e φ −
1) if b = 0 and 0 if b >
0, andlim p → − D φ ( p ) p b (1 − p ) a − b is D φ (1) = φe φ / ( e φ −
1) if a = b , and 0 if 0 ≤ b < a . g Here we show that the value of g φ ( A ) in terms of f ( A ) for any set A . In the analysisof [17], this follows from submodularity of g , which is inherited from the submodularity15f f . Here, we must again adopt a different approach. We begin by proving the followingclaim. Fix some set A ⊆ X and for all 0 ≤ j ≤ | A | define F j = P B ∈ ( Aj ) f ( B ) as thetotal value of all subsets of A of size j . Note that since we suppose f is normalized, F = f ( ∅ ) = 0. Lemma 11. If f is γ -weakly submodular from below, then F i ≥ (cid:0) | A |− i − (cid:1) γf ( A ) for all ≤ i ≤ | A | .Proof. Let k = | A | . Since f is γ -weakly submodular, P e ∈ A \ B ( f ( B + e ) − f ( B )) ≥ γ ( f ( A ) − f ( B )) for all B ⊆ A . Rearranging this we have X e ∈ A \ B f ( B + e ) ≥ γf ( A ) + ( | A | − | B | − γ ) f ( B ) ≥ γf ( A ) + ( | A | − | B | − f ( B ) , (12)for all B ⊆ A . Summing (12) over all (cid:0) kj (cid:1) possible subsets B of size j , we obtain( j + 1) F j +1 ≥ γ (cid:0) kj (cid:1) f ( A ) + ( k − j − F j , (13)since each set T of size j + 1 appears once as B + e on the left-hand side of (12) foreach of the j + 1 distinct choices of e ∈ T with B = T − e .We now show that F i ≥ (cid:0) k − i − (cid:1) γf ( A ) for all 1 ≤ i ≤ k . The proof is by inductionon i . For i = 1, the claim follows immediately from (13) with j = 0, since then (cid:0) kj (cid:1) = 1 = (cid:0) k − i − (cid:1) and ( k − j − F j = ( k − F = 0. For the induction step, (13) andthe induction hypothesis imply: F i +1 ≥ i +1 (cid:16) γ (cid:0) ki (cid:1) f ( A ) + ( k − i − F i (cid:17) ≥ i +1 (cid:16) γ (cid:0) ki (cid:1) f ( A ) + ( k − i − γ (cid:0) k − i − (cid:1) f ( A ) (cid:17) = γi +1 (cid:16) ki (cid:0) k − i − (cid:1) f ( A ) + ( k − i − (cid:0) k − i − (cid:1) f ( A ) (cid:17) = γi +1 (cid:16) k + ki − i − ii (cid:17) (cid:0) k − i − (cid:1) f ( A )= γi +1 k ( i +1) − i ( i +1) i (cid:0) k − i − (cid:1) f ( A ) = γ k − ii (cid:0) k − i − (cid:1) f ( A ) = γ (cid:0) k − i (cid:1) f ( A ) . Using the above claim, we now bound the value of g φ ( A ) for any set A . Lemma 12. If f is γ -weakly submodular, then for all A ⊆ X , γf ( A ) ≤ g φ ( A ) ≤ h ( φ ) H | A | f ( A ) .Proof. Let k = | A | . We begin with the lower bound for g φ ( A ). By the definition of thecoefficients m ( φ ) a,b and Lemma 11: g φ ( A ) = X B ⊆ A E p ∼D φ h p | B |− (1 − p ) | A |−| B | i f ( B ) = E p ∼D φ (cid:20) k X i =1 p i − (1 − p ) k − i F i (cid:21) ≥ E p ∼D φ (cid:20) k X i =1 γp i − (1 − p ) k − i (cid:0) k − i − (cid:1) f ( A ) (cid:21) = E p ∼D φ (cid:20) γf ( A ) k − X i =0 (cid:0) k − i (cid:1) p i (1 − p ) k − i − (cid:21) = E p ∼D φ [ γf ( A )] = γf ( A ) . g φ ( A ) = E p ∼D φ (cid:20) k X i =1 p i − (1 − p ) k − i F i (cid:21) ≤ E p ∼D φ (cid:20) k X i =1 p i − (1 − p ) k − i F k (cid:21) = Z φe φp e φ − P ki =1 (cid:0) ki (cid:1) p i (1 − p ) k − i f ( A ) p dp = Z φe φp e φ − − (1 − p ) k p f ( A ) dp ≤ φe φ e φ − Z − (1 − p ) k p f ( A ) dp = h ( φ ) Z k − X j =0 (1 − p ) j f ( A ) dp = h ( φ ) k − X j =0 j + 1 f ( A ) = h ( φ ) H | A | f ( A ) , where the first inequality follows from monotonicity of f and the second inequality from − (1 − p ) k p > p ∈ (0 ,
1] and that h ( φ ) = φe φ e φ − is an increasing function of p . g φ to φ The following lemma shows that small changes in the parameter φ produce relativelysmall changes in the value g φ ( A ) for any set A . Lemma 13.
For all φ , ǫ > , and S ⊆ X ,1. g φ (1 − ǫ ) ( S ) ≥ e − φǫ g φ ( S ) h ( φ ) ≤ e φǫ h ( φ (1 − ǫ ))Since algorithm 3 runs multiple local-searches with different guesses of φ , the lemmais key to initialize the next run with the solution returned by the previous local-searchwhich had a slightly perturbed potential. Proof.
Both claims will follow from the inequality φ (1 − ǫ ) e φ (1 − ǫ ) p e φ (1 − ǫ ) − ≥ e − φǫ φe φp e φ − , (14)which we show is valid for all p ∈ [0 ,
1] and ǫ >
0. Indeed, under these assumptions, φ (1 − ǫ ) e φ (1 − ǫ ) p e φ (1 − ǫ ) − · e φ − φe φp = (1 − ǫ ) e − φǫp e φ − e φ e − φǫ − − ǫ ) e − φǫp e φ − e φ (1 + ( e − φ − ǫ − ≥ (1 − ǫ ) e − φǫp e φ − e φ (1 + ǫ ( e − φ − − − ǫ ) e − φǫp e φ − − ǫ )( e φ −
1) = e − φǫp ≥ e − φǫ . Here the first inequality follows from the generalized Bernoulli inequality (1 + x ) t ≤ (1 + tx ), which holds for all x ≥ − ≤ t ≤
1, and the second inequality followsfrom p ∈ [0 , g φ (1 − ǫ ) ( A ) = Z φ (1 − ǫ ) e φ (1 − ǫ ) p e φ (1 − ǫ ) p − X B ⊆ A p | B |− (1 − p ) | A |−| B | f ( B ) dp ≥ Z e − φǫ φe φp e φ − X B ⊆ A p | B |− (1 − p ) | A |−| B | f ( B ) dp = e − φǫ g φ ( A ) , as required. For the second claim, setting p = 1 in (14) gives h ( φ (1 − ǫ )) ≥ e − φǫ h ( φ )or, equivalently, h ( φ ) ≤ e φǫ h ( φ (1 − ǫ )). g φ via sampling The definition of g φ requires evaluating f ( B ) on all B ⊆ A , which requires n | A | calls tothe value oracle for f . In this section, we show that efficiently estimate g φ using only apolynomial number of value queries to f . Our sampling procedure is based on the samegeneral ideas described in [17], but here we focus on evaluating only the marginals of g φ , which results in a considerably simpler implementation. In particular, our algorithmdoes not require computation of the coefficients m ( φ ) a,b . Lemma 14.
For any φ , N , there is a procedure for estimating g φ ( e | A ) using N randomvalue oracle queries to f , such that for any δ > , P [ | g ( e | A ) − ˜ g ( e | A ) | ≥ δf ( A + e ) ] < e − δ N . Proof.
We consider the following 2-step procedure given as an interpretation of g in [17]:we first sample p ∼ D φ , then construct a random B ⊆ A by taking each element of A independently with probability p . The probability that any given B ⊆ A is selected bythe procedure is then precisely Z φe φp e φ − p | B | (1 − p ) | A | dp = m | A | , | B | . Thus, if we sample a random ˜ B in this fashion, E [ f ( e | ˜ B )] = P B ⊆ A m | A | , | B | f ( e | B ) = g ( e | A ), by part 1 of Lemma 9. We remark that for the particular distributions D φ we consider, the first step of the procedure can easily by implemented with inversetransform sampling.Suppose now that we draw N independent random samples { B i } Ni =1 in this fashion,and define the random variables Y i = g ( e | A ) − f ( e | B i ) f ( A + e ) . Then, E [ Y i ] = 0 for all i . Moreover,by monotonicity of f , 0 ≤ f ( e | B ) ≤ f ( B + e ) ≤ f ( A + e ) for all B ⊆ A and so also0 ≤ g ( e | B ) ≤ f ( A + e ). Thus, | Y i | ∈ [0 ,
1] for all i .Let ˜ g φ ( e | A ) = N P Ni =1 f ( e | B i ). Applying the Chernoff bound (Lemma 21), for any δ > P [ | g ( e | A ) − ˜ g ( e | A ) | ≥ δf ( A + e ) ] ≤ P hP Ni =1 Y i > δN i < e − δ N . A randomized, polynomial time distorted local-search algorithm
Our final algorithm is shown in Algorithm 3. Before presenting it in detail, we describethe main concerns involved in its formulation.
We initialize the algorithm with a solution S by using the residual random greedyalgorithm of [6]. Their analysis shows that the expected value of the solution producedby the algorithm is at least γ − ) f ( O ), where O is an optimal solution to the problem.Here, however, we will require a guarantee that holds with high probability. This is easilyensured by independently running the residual random greedy algorithm a sufficientnumber of times and taking the best solution found.Formally, suppose we set ε ′ = min( ε, ) and run the residual random greedyalgorithm G = log( n )2 ǫ ′ = ˜ O ( ǫ − ) times independently. For each 1 ≤ l ≤ G , let T l bethe solution produced by the l th instance of the residual random greedy algorithm.Define the random variables Z l = f ( S l ) f ( O ) − γ − ) , where O ∈ I is the optimal solution.Then, E [ Z l ] = 0 and | Z l | ≤ l . Let S = arg max { f ( T l ) : 1 ≤ l ≤ G } . Then, bythe Chernoff bound, we then have P (cid:2) f ( S ) < (cid:0) (1 + γ − ) − − ǫ ′ (cid:1) f ( O ) (cid:3) ≤ P h G P Gl =1 f ( T l ) < (cid:0) (1 + γ − ) − − ǫ ′ (cid:1) f ( O ) i = P hP Gl =1 Z l > Gǫ ′ i < e − ǫ ′ G . Thus, with probability at least 1 − n = 1 − o (1), f ( S ) ≥ (cid:16) γ − ) − ǫ ′ (cid:17) f ( O ). φ In Theorem 8, we considered a ( γ, β )-weakly submodular function f , and used thepotential g φ with φ = φ ( γ, β ) = γ + β (1 − γ ) to guide the search. In general, however,the values of γ and β may not be known in advance. One approach to coping with thiswould be to make an appropriate series of guesses for each of the values, then run ourthe algorithm for each guess and return the best solution obtained.Here we describe an alternative and more efficient approach: we guess the value of φ ( γ, β ) directly from an appropriate range of values. Moreover, for each subsequentguess, we initialize the algorithm using the solution produced by the previous guess.Combined with the bounds from Lemmas 15 and 13, this allows us to amortize the timespent searching for improvements over all guesses.In the next lemma, we show that if γ or φ ( γ, β ) is very small, then guarantee for therandom residual greedy algorithm is stronger than that required by our analysis (andso S is already a good solution). This will allow us to bound the range of values forboth φ and γ that we must consider in our algorithm. Lemma 15.
For all γ ∈ (0 , and β ≥ , φ ( γ, β ) ≥ . Moreover, if φ ( γ, β ) > or γ < , then γ − ) > γ (1 − e − φ ( γ,β ) ) φ ( γ,β ) . roof. First, we show that φ ( γ, β ) ≥ / γ ∈ (0 ,
1] and β ≥
1. Notethat ∂φ∂β = 1 − γ ≥
0, for all γ ∈ [0 , φ ( γ, β ) sets β = 1.Moreover, ∂φ∂γ = 2 γ − β and ∂ φ∂γ = 2 so a φ ( γ, β ) is minimized by γ = β = . It followsthat φ ( γ, β ) ≥ φ (cid:0) , (cid:1) = for all γ ∈ [0 ,
1] and β ≥ φ ( γ, β ) >
4. Then, the claim follows, since γ (1 − e − φ ( γ,β ) ) φ ( γ, β ) < γ ≤ γ (1 + γ ) = 1(1 + γ − ) . It remains to consider the case in which γ < . Recall that h ( x ) , xe x e x − is an increasingand so h ( φ ( γ, β )) ≥ h ( ) > (where the last inequality follows directly by computationof h ( ) > . γ < . Then, γ (1 − e − φ ( γ,β ) ) φ ( γ, β ) = γ h ( φ ( γ, β )) < γ . Comparing the previous estimation to the approximation ratio of Chen et al. [6] andusing that γ < /
7, we have γ (1 + γ − ) − = 34 ( γ + 1) < (cid:18) (cid:19) < . Thus, γ < γ − ) and again the claim follows.Lemma 15 shows that it suffices to consider φ ( γ, β ) ∈ [3 / ,
4] and γ > /
7, sinceotherwise the starting solution already satisfies the claimed guarantee. Thus, our algo-rithm considers a geometrically decreasing sequence of guesses for the value φ ∈ [3 / , φ j = 4(1 − ε ) j , where 0 ≤ j ≤ ⌈ log − ε ⌉ . For each subsequent, we ini-tialize S with the solution produced by the previous guess (or, in the first iteration,the solution S produced by the residual random greedy algorithm). We then repeat-edly search for single element swaps that significantly improve the potential g φ ( S ).Specifically, we will exchange an element a S with an element b ∈ S whenever˜ g φ j ( a | S b ) > ˜ g φ j ( b | S − b ) + ∆ f ( S ), where ˜ g φ j ( ·| S − b ) is an estimate of g φ j ( ·| S − b )computed using N samples as described in Section 6.4 and ∆ is an appropriately chosenparameter. We show that by setting N appropriately, we can ensure that with highprobability an approximate local optimum of every g φ is reached after at most sometotal number M of improvements across all guesses. Our final algorithm is shown in Algorithm 3. Let M = ( X, I ) be a matroid, and f : 2 X → R ≥ be a ( γ, β )-weakly submodular function. Given some 0 < ǫ ≤ We remark that the use of the residual random greedy algorithm is not strictly necessary for ourresults. One can instead initialize the algorithm with a base containing the best singleton as in thestandard local search procedure to obtain a guarantee of γ/k for the initial solution. The remainingarguments can then be modified at the cost of a larger running time dependence on the parameter k . lgorithm 3: Distorted Local Search ImplementationLet ∆ = ǫk , δ = ∆4 h (4) · H k = ǫ h (4) · H k k , M = (1 + δ − )(37 + ln( H k )), N = 28 δ − ln( M kn ), G = log( n ) / (2 min( ε, ) ); S ← the best output produced by G independent runs of the RandomResidual Greedy algorithm applied to f and M ; S max ← S ; i ← for ≤ j ≤ ⌈ log − ǫ / ⌉ do φ ← − ǫ ) j ; S ← S j ; repeat isLocalOpt ← true ; foreach b ∈ S and a ∈ X \ S with S − b + a ∈ I do Compute ˜ g φ j ( a | S − b ) and ˜ g φ j ( b | S − b ) using N random samples; if ˜ g φ j ( a | S − b ) > ˜ g φ j ( b | S − b ) + ∆ f ( S ) then S ← S − b + a ; i ← i + 1;isLocalOpt ← false ; breakuntil isLocalOpt or i ≥ M ; S j +1 ← S ; if f ( S j +1 ) > f ( S max ) then S max ← S j +1 ; return S max the parameters:∆ = ǫk (threshold for accepting improvements) δ = ∆4 h (4) · H k = ǫ h (4) · H k k = Θ( ǫ/ ( k log k )) (bound on sampling accuracy) L = 1 + ⌈ log − ε ⌉ = O ( ε − ) (number of guesses for φ ) M = log δ (7 · · e Lǫ h (4) H k ) = ˜ O ( δ − ) = ˜ O ( kε − )(total number of improvements) N = 4 · δ − ln( M kn ) = ˜ O ( δ − ) = ˜ O ( k ε − ) (number of samples to estimate g φ )In Algorithm 3, we evaluate potential improvements using an estimate ˜ g φ j ( ·| S − b )for the marginals of g that is computed using N samples. By Lemma 14, we thenhave | ˜ g φ j ( e | A ) − g φ j ( e | A ) | ≤ γδf ( A + e ) for any A, e considered by the algorithm withprobability at least 1 − e − δ γ N . If γ ≥ /
7, this is at least 1 − e − δ · N = 1 − Mkn ) .In our algorithm we will limit the total number of improvements made across all guessesfor φ to be at most M . Note that any improvement can be found by testing at most kn marginal values, so we must estimate at most M kn marginal values across thealgorithm. By a union bound, we then have | ˜ g φ j ( e | A ) − g φ j ( e | A ) | ≤ γδf ( A + e ) for all A, e considered by Algorithm 3 with probability at least 1 − o (1) whenever γ ≥ / S after making M improvements, we must in fact have an optimal solution with high21robability. Lemma 16.
Suppose that γ ≥ / . Then, if Algorithm 3 makes M improvements, thenthe set S it returns satisfies f ( S ) ≥ f ( O ) with probability − o (1) .Proof. With probability 1 − o (1) we have (cid:12)(cid:12) ˜ g φ j ( e | A ) − g φ j ( e | A ) (cid:12)(cid:12) ≤ γδf ( A + e ) for any e, A considered by Algorithm 3. Whenever the algorithm exchanges some a ∈ X \ S for b ∈ S for some guess φ j , we have ˜ g φ j ( a | S − b ) − ˜ g φ j ( b | S − b ) ≥ ∆ f ( S ) and so g φ j ( S − b + a ) − g φ j ( S ) = g φ j ( a | S − b ) − g φ j ( b | S − b ) ≥ ˜ g φ j ( a | S − b ) − δγf ( S − b + a ) − ˜ g φ j ( b | S − b ) − δγf ( S ) ≥ ˜ g φ j ( a | S − b ) − δg φ j ( S − b + a ) − ˜ g φ j ( b | S − b ) − δg φ j ( S ) ≥ ∆ f ( S ) − δg φ j ( S − b + a ) − δg φ j ( S ) , where the second inequality follows from the lower bound in Lemma (12). Rearrangingand using the upper bound on g φ j ( S ) from Lemma 12, together with the definition of δ and ∆, we obtain: g φ j ( S − b + a ) ≥ ∆ f ( S ) + (1 − δ ) g φ j ( S )1 + δ ≥ ǫk h ( φ j ) · H k + 1 − δ δ g φ j ( S ) ≥ ǫk h (4) · H k + 1 − δ δ g φ j ( S ) = 1 + 3 δ δ g φ j ( S ) ≥ (1 + δ ) g φ j ( S ) , where the last inequality follows from x x ≥ (1+ x ) x for all 0 ≤ x ≤ f ( S ) ≥ ((1 + γ − ) − − ε ′ ) f ( O ), which we have shown also occurswith high probability 1 − o (1). Then, since γ ≥ and ε ′ = min( , ε ), we have f ( S ) ≥ f ( O ). Suppose that i = M when the algorithm is considering some guess φ l . We consider how the current value of g φ j ( S ) changes throughout Algorithm 3, bothas improvements are made and as j increases. As we have just shown, each of our M improvements increases this value by a factor of (1 + δ ). Moreover, as shown inLemma 13, g φ j ( S ) = g (1 − ǫ ) φ j − ( S ) ≥ e − φ j − ǫ g φ j − ( S ) ≥ e − ǫ g φ j − ( S ) , for any set S . Thus, each time j is incremented, the value g φ j ( S ) decreases by a factorof at most e ǫ . Since we made M improvements, we then have: g φ ℓ ( S ℓ +1 ) ≥ (1 + δ ) M e − ℓǫ g φ ( S ) ≥ (1 + δ ) M e − ℓǫ γf ( S ) ≥ (1 + δ ) M e − ℓǫ
17 1128 f ( O ) , where the second inequality follows from the lower bound on g given in Lemma 12, andthe second from γ ≥ . The upper bound on g given by Lemma 12, together withthe fact that h is increasing and φ ℓ ≤
4, implies that: g φ ℓ ( S ℓ +1 ) ≤ h ( φ ℓ ) H k f ( S ℓ +1 ) ≤ h (4) H k f ( S ℓ +1 ). Thus, f ( S ℓ +1 ) ≥ (1 + δ ) M e − ℓǫ · · h (4) H k f ( O ) . Since ℓ ≤ L and M = log δ ( e Lǫ · · h (4) H k ), the set S max returned by the algorithmthus has f ( S max ) ≥ f ( S ℓ +1 ) ≥ f ( O ), as claimed.22 heorem 17. Let M = ( X, I ) be a matroid, f : 2 X → R ≥ be a ( γ, β ) -weakly submodu-lar function, and ǫ > . Then, with probability − o (1) , the set S returned by Algorithm 3satisfies f ( S ) ≥ γ (1 − e − φ ( γ,β ) ) φ ( γ,β ) f ( O ) for any solution O ∈ I , where φ ( γ, β ) = γ + β (1 − γ ) .The algorithm runs in time ˜ O ( nk ε − ) .Proof. We have shown that f ( S ) ≥ (cid:0) (1 + γ − ) − − ε ′ (cid:1) f ( O ) with probability 1 − o (1),where ε ′ = min( ε, ). If γ < / φ ( γ, β ) [3 / , γ − ) − > γ (1 − e − φ ( γ,β ) ) φ ( γ,β ) , and so the claim follows as f ( S max ) ≥ f ( S ). Thus, we supposethat γ ≥ / φ ( γ, β ) ∈ [3 / , M improvements,Lemma 16 implies that the set returned by the algorithm is optimal with probability atleast 1 − o (1).In the remaining case, we have φ ( γ, β ) ∈ [3 / , γ ≥ /
7, and each set S j +1 produced by the algorithm must have ˜ g φ j ( o l | S j +1 − s l ) ≤ ˜ g ˜ φ j ( s l | S j +1 − s l ) + ∆ f ( S )for every s l ∈ S and o l ∈ O . Since γ ≥ /
7, with probability 1 − o (1), we have (cid:12)(cid:12) ˜ g φ j ( e | A ) − g φ j ( e | A ) (cid:12)(cid:12) ≤ γδf ( A + e ) for all sets e, A and all φ j considered by the algo-rithm. Thus, g φ j ( S j +1 − s l + o l ) − g φ j ( S j +1 )= g φ j ( o l | S j +1 − s l ) − g φ j ( s l | S j +1 − s l ) ≤ ˜ g φ j ( o l | S j +1 − s l ) + δγf ( S j +1 − s l + o l ) − ˜ g φ j ( s l | S j +1 − s l ) + δγf ( S j +1 ) ≤ ∆ f ( S j +1 ) + δγf ( S j +1 ) + δγf ( S j +1 − s l + o l ) ≤ (∆ + 2 δ ) f ( O )= O (cid:0) ǫk (cid:1) · f ( O ) . Consider the smallest j such that φ j +1 , − ε ) j +1 < φ ( γ, β ). Then, φ j +1 < φ ( γ, β ) ≤ φ j +1 / (1 − ε ) , φ j . Let ˜ β = φ j − γ − γ . Then, φ ( γ, ˜ β ) = γ + φ j − γ − γ (1 − γ ) = φ j and˜ β ≥ φ ( γ,β ) − γ − γ = β , so f is also ( γ, ˜ β )-weakly submodular and so Theorem 8 implies: f ( S j +1 ) ≥ γ h ( φ j ) f ( O ) + k X i =1 (cid:2) g φ j ( S ) − g φ j ( S − s l + o l ) (cid:3) ≥ (cid:18) γ h ( φ j ) − O ( ǫ ) (cid:19) f ( O )By Lemma 13 part 2, our choice of j , and φ j ≤ h ( φ j ) ≤ e φ j ε h ((1 − ε ) φ j ) ≤ e φ j ε h ( φ ( γ, β )) ≤ h ( φ ( γ, β ))1 − φ j ε ≤ h ( φ ( γ, β ))1 − ε . Thus, f ( S j +1 ) ≥ (cid:16) γ h ( φ ( γ,β )) − O ( ε ) (cid:17) f ( O ) = (cid:16) γ (1 − e − φ ( γ,β ) ) φ ( γ,β ) − O ( ε ) (cid:17) f ( O ).The running time of the algorithm is dominated by the number of value oracle queriesmade to f . The initialization requires running the residual random greedy ˜ O ( ε − ) times,which requires O ( nk ) value queries. We make at most M = ˜ O ( ε − k ) improvements,each requiring at most N nk = ˜ O ( nk ε − ) value queries. Altogether the running timeis thus at most ˜ O ( nk ε − ). We make use of the following basic results related to matrix inverses (see e.g. [34,Section A.3]) 23 emma 18 (Block Matrix Inverse) . Let B and A be matrices of dimension k × k and h × h , respectively, and let U and V be matrices of dimension k × h and h × k respectively.Then, B UV A − = B − + B − U SV B − − B − U S − SV B − S where S = ( A − V B − U ) − is the Schur complement of B . Lemma 19 (Sherman-Morrisson-Woodbury formula) . Let
A, U, C, V be matrices ofconformable sizes. Then, ( A + U CV ) − = A − − A − U (cid:0) C − + V A − U (cid:1) − V A − . We also make use of the following bound on eigenvalues of normalized covariancematrices given in [11, Lemmas 2.5 and 2.6]:
Lemma 20.
Let L and S = { X , X , . . . , X n } be two disjoint sets of zero-mean randomvariables each of which has variance at most . Let C be the covariance matrix of theset L ∪ S . Let C ρ be the covariance matrix of the set { Res( X , L ) , . . . , Res( X n , L ) } afternormalization of the random variables to have unit variance. Then λ min ( C ρ ) ≥ λ min ( C ) . Finally, we use the following form of the Chernoff bound given in [1, Theorem A.1.16]:
Lemma 21 (Chernoff Bound) . Let X i , ≤ i ≤ n be mutually independent randomvariables with E [ X i ] = 0 and | X i | ≤ for all i . Set S = X + · · · + X n . Then for any a , Pr[
S > a ] < e − a / n In Bayesian linear regression, we suppose data is generated by a linear model y = X T θ + ε , where y ∈ R n , X ∈ R p × n and ε ∼ N (0 , σ I ), where I is the identity matrix.Here, X = (cid:2) x x · · · x n (cid:3) with x i ∈ R p is a vector of data, and y is a vectorcorresponding observations for the response variable. The variable ε represents Gaussiannoise with 0 mean and variance σ . When the number of columns n (i.e., the numberof potential observations) is very large, experimental design focuses on selecting a smallsubset S ⊂ { , , . . . , n } of columns of X to maximally reduce the variance of theestimator θ .Let X S , y S be the matrix X (the vector y respectively) restricted to columns (rowsrespectively) indexed by S . From classical statistical theory, the optimal choice ofparameters for any such S is given by ˆ θ S = ( X TS X S ) − X S y S and satisfies Var( ˆ θ S ) = σ ( X TS X S ) − . Because the variance of ˆ θ S is a matrix, there is not a universal functionwhich one tries to minimize to find the appropriate set S . Instead, there are multipleobjective functions depending on the context leading to different optimality criteria.As in [28, 3, 21], we consider the A-optimal design objective. We suppose our priorprobability distribution has θ ∼ N (0 , Λ). We start by stating a standard result fromBayesian linear regression. 24 emma 22.
Given the previous assumption, and the prior on θ ∼ N (0 , Λ) , The pos-terior distribution of θ follows a normal distribution p ( θ | y S ) ∼ N ( M − S X S y S , M − S ) ,where M − S = (cid:0) σ − X S X TS + Λ − (cid:1) − . In A-optimal design, our objective function seeks to reduce the variance of the pos-terior distribution of θ by reducing the trace of M − S , i.e., the sum of the variance ofthe regression coefficients. Mathematically, we seek to maximize the following objectivefunction F ( S ) = tr(Λ) − tr( M − S ) = tr(Λ) − tr((Λ − + σ − X S X TS ) − ) . (15)The function F is not submodular as shown in [28]. The current tightest estimation ofthe lower weak-submodular ratio of F is due to Harshaw et al. [21]. They show that γ ≥ (1 + s σ λ max ( Λ )) − , where s = max i ∈ [ n ] k x i k . Here we give a bound on the upperweak-submodularity ratio β . Theorem 23.
Assume a prior distribution θ ∼ N (0 , Λ ) , and let s = max i ∈ [ n ] k x i k .The function F is (1 /c, c ) -weakly submodular with c = 1 + s σ λ max ( Λ ) . Observe that like for the R objective, our upper bound for β is the the inverse ofthe lower bound for γ . Proof.
The lower bound on γ is shown is [21]. It remains to prove the upper bound on β . Let B be some set of observations and A ⊆ B with k = | A | and for convenience,define T = B \ A . By the matrix inversion lemma 19, we have F ( B ) − F ( A ) = tr( M − A ) − tr( M − B )= tr (cid:0) ( M B − σ − X T X TT ) − (cid:1) − tr( M − B )= tr (cid:0) M − B + M − B X T ( σ I − X TT M − B X T ) − X TT M − B (cid:1) − tr( M − B )= tr (cid:0) M − B X T ( σ I − X TT M − B X T ) − X TT M − B (cid:1) = tr (cid:0) ( σ I − X TT M − B X T ) − X TT M − B X T (cid:1) . (16)The third equality uses the linearity of the trace while the last equality uses the cyclicproperty of the trace. We use the previous equation to derive an upper and lower boundfor the numerator and denominator of the submodularity ratio respectively. Applying(16) with A = B \ { i } (and so T = { i } ) we obtain F ( B ) − F ( B − i ) = tr( x Ti M − B x i ) σ − x Ti M − B x i . Let (cid:22) be the Loewner ordering of positive semidefinite matrices, where A (cid:22) B if andonly if B − A (cid:23) − (cid:22) M R for any set R , which implies thatΛ (cid:23) M − R . Using a second time the Sherman-Morrison-Woodbury formula (Lemma 19)together with the previous observation, we get (cid:0) σ − x Ti M − B x i (cid:1) − = σ − + σ − x Ti (cid:0) M B − σ − x i x Ti (cid:1) − x i , = σ − + σ − x Ti M − B \{ i } x i , ≤ σ − + σ − x Ti Λ x i , ≤ σ − + σ − λ max (Λ) s , s = max i k x i k and the last inequality follows by the Courant-Fischer min-maxtheorem. Summing over all i ∈ T = B \ A and using the linearity of the trace, we have X i ∈ T F ( i | B − i ) = X i ∈ T tr( x Ti M − B x i ) σ − x Ti M − B x i ≤ (cid:0) σ − + s σ − λ max (Λ) (cid:1) X i ∈ T tr( x Ti M − B x i )= (cid:0) σ − + s σ − λ max (Λ) (cid:1) tr( X TT M − B X T ) . (17)Returning to the expression of F ( B ) − F ( A ), we note that M B is positive definite, whichimplies that M − B is positive definite. This in turn implies that − X TT M − B X T (cid:22) σ I − X TT M − B X T (cid:22) σ I . Thus, ( σ I − X TT M − B X T ) − (cid:23) σ − I ≻
0. Therefore,tr(( σ I − X TT M − B X T ) − X TT M − B X T ) ≥ tr ( σ − X TT M − B X T ) = σ − tr ( X TT M − B X T ) . Combining this with the bound (17), we have: P i ∈ T F ( i | B − i ) F ( B ) − F ( A ) ≤ (cid:0) σ − + σ − λ max (Λ) · s (cid:1) tr (cid:0) X TT M − B X T (cid:1) σ − tr ( X TT M − B X T ) ≤ s σ λ max (Λ) . Recalling that T = B \ A , this completes the proof. β be? We have shown that the R objective (Section 3) and the A -optimal design objectivefor Bayesian linear regression (Section 8.1) are ( c, /c )-weakly submodular for someparameter c . A natural question to ask is whether, given γ >
0, there is a small non-trivial bound for β independent of the size of the ground set. Here we show that this isnot true in general. Theorem 24.
For any γ > and k > there exists a function on a ground set of size k that is γ -weakly submodular from below but not β -weakly submodular from above forany β < (cid:0) k − γk − (cid:1) = Θ (cid:0) k − γ (cid:1) . Note that as γ → γ → β → k (respectively β → γ → γ →
1. We conjecture that this bound on β isin fact tight bound achievable.The intuition behind the construction is simple. We build a set function recursivelywith lower submodularity ratio exactly γ . The recurrence relation holds until the ( k − th marginal, which allows us to have a large value for the final marginal and thusincrease β . Proof of Theorem 24.
We start by constructing a monotone set function f on a groundset of k elements. The elements are indistinguishable, meaning that for any given set S , two elements e, e ′ ∈ X \ S have the same marginal contribution. Therefore, becauseelements are indistinguishable, the value of a set is a function of its size. Let x i be thevalue of any set of size i = 0 , , . . . , k . Additionally, let x = f ( ∅ ) = 0 and x k = 1. Wedefine x i inductively with the following recurrence for i = 0 , , . . . , k − x i +1 = k − i − γk − i · x i + γk − i or equivalently x i +1 − x i = γk − i (1 − x i ) . (18)26t can easily be shown (by induction) that the described sequence is valid, i.e. it ismonotone and each x i ∈ [0 , − x i +1 = 1 − (cid:18) k − i − γk − i · x i + γk − i (cid:19) = (cid:18) − γk − i (cid:19) (1 − x i ) , (19)for all i = 0 , , . . . , k − f has a lower submodularity ratio at most γ . We prove that forany B and A ⊂ B such that | B | = j and | A | = i : P e ∈ B \ A f ( e | A ) f ( B ) − f ( A ) = ( j − i )( x i +1 − x i ) x j − x i ≥ γ. (20)First, we consider the case in which j = k . If i = k −
1, then the left-hand side of(20) is 1. If i ≤ k −
2, then applying the identity (18), and recalling that x k = 1 gives:( k − i )( x i +1 − x i ) x k − x i = ( k − i ) · γk − i (1 − x i )1 − x i = γ. for any i = 1 , . . . , k − j ≤ k − i ≤ k −
2. Then, by employingrecursively the identity (19) we obtain x j − x i = (1 − x i ) − (1 − x j ) = (1 − x i ) − j − Y ℓ = i (cid:18) − γk − ℓ (cid:19)! . (21)Since γ ∈ [0 , − /n ) x ≤ − x/n for x ∈ [0 , j − Y ℓ = i (cid:18) − γk − ℓ (cid:19) ≥ j − Y ℓ = i (cid:18) − k − ℓ (cid:19) γ = (cid:18) k − jk − i (cid:19) γ = (cid:18) − j − ik − i (cid:19) γ ≥ − (cid:18) j − ik − i (cid:19) , where in the last inequality we used the fact that (1 − x ) γ ≥ − x for x ∈ [0 ,
1] and γ ∈ [0 , j − i )( x i +1 − x i ) x j − x i = ( j − i ) γk − i (1 − x i )(1 − x i ) (cid:16) − Q j − ℓ = i (cid:16) − γk − ℓ (cid:17)(cid:17) ≥ ( j − i ) γk − i (1 − x i )(1 − x i ) j − ik − i = γ, where we have used (18) and (21) in the first equation. Combining these cases, we find f is γ -weakly submodular from below.To complete the proof we now show that f is not β -weakly submodular from abovefor any β < (cid:0) k − γk − (cid:1) . Here we consider the case in which A = ∅ and B = X and showthat: P e ∈ B f ( e | B − e ) f ( B ) − f ( ∅ ) = ( k − i )( x k − x k − ) x k − x = (cid:18) k − γk − (cid:19) . Recall that x k = 1, and x = 0, which implies that the denominator is equal to 1.Recursively applying the identity (19) gives k (1 − x k − ) = k · k − Y ℓ =0 (cid:18) − γk − ℓ (cid:19) = k k Y ℓ =2 (cid:16) − γℓ (cid:17) = Q kℓ =2 ( ℓ − γ ) Q k − ℓ =1 ℓ = Q k − ℓ =1 ( ℓ + 1 − γ ) Q k − ℓ =1 ℓ = (cid:18) k − γk − (cid:19) . eferences [1] Noga Alon and Joel H. Spencer. The Probabilistic Method . Wiley Publishing, 4thedition, 2016.[2] Ramakrishna Bairi, Rishabh Iyer, Ganesh Ramakrishnan, and Jeff Bilmes. Summa-rization of multi-document topic hierarchies using submodular mixtures. In
Proc.53rd Annual Meeting of the Association for Computational Linguistics and 7th In-ternational Joint Conference on Natural Language Processing (Volume 1: LongPapers) , pages 553–563, 2015.[3] Andrew An Bian, Joachim M Buhmann, Andreas Krause, and Sebastian Tschi-atschek. Guarantees for greedy maximization of non-submodular functions withapplications. In
Proc. 34th ICML , pages 498–507, 2017.[4] Richard A Brualdi. Comments on bases in dependence structures.
Bull. of theAustralian Mathematical Society , 1(2):161–167, 1969.[5] Gruia Calinescu, Chandra Chekuri, Martin P´al, and Jan Vondr´ak. Maximizing amonotone submodular function subject to a matroid constraint.
SIAM J. Comput-ing , 40(6):1740–1766, 2011.[6] Lin Chen, Moran Feldman, and Amin Karbasi. Weakly submodular maximizationbeyond cardinality constraints: Does randomization help greedy? In
Proc. 35thICML , pages 804–813, 2018.[7] Flavio Chierichetti, Anirban Dasgupta, and Ravi Kumar. On additive approximatesubmodularity. arXiv preprint arXiv:2010.02912 , 2020.[8] Michele Conforti and G´erard Cornu´ejols. Submodular set functions, matroids andthe greedy algorithm: Tight worst-case bounds and some generalizations of therado-edmonds theorem.
Discrete Applied Mathematics , 7(3):251–274, 1984.[9] Abhimanyu Das, Anirban Dasgupta, and Ravi Kumar. Selecting diverse featuresvia spectral regularization.
Proc. 25th NeurIPS , pages 1583–1591, 2012.[10] Abhimanyu Das and David Kempe. Algorithms for subset selection in linear re-gression. In
Proc. 40th STOC , pages 45–54, 2008.[11] Abhimanyu Das and David Kempe. Submodular meets spectral: Greedy algorithmsfor subset selection, sparse approximation and dictionary selection. In
Proc. 28thICML , 2011.[12] Ethan R Elenberg, Alexandros G Dimakis, Moran Feldman, and Amin Karbasi.Streaming weak submodularity: Interpreting neural networks on the fly. In
Proc.31st NeurIPS , pages 4044–4054, 2017.[13] Ethan R Elenberg, Rajiv Khanna, Alexandros G Dimakis, Sahand Negahban, et al.Restricted strong convexity implies weak submodularity.
The Annals of Statistics ,46(6B):3539–3568, 2018.[14] Uriel Feige. A threshold of ln n for approximating set cover. J. of the ACM ,45(4):634–652, 1998. 2815] Moran Feldman, Amin Karbasi, and Ehsan Kazemi. Do less, get more: Streamingsubmodular maximization with subsampling. In
Proc. 32nd NeurIPS , pages 732–742, 2018.[16] Moran Feldman, Joseph (Seffi) Naor, Roy Schwartz, and Justin Ward. Improvedapproximations for k-exchange systems. In
Proc. 19th ESA , pages 784–798, 2011.[17] Yuval Filmus and Justin Ward. A tight combinatorial algorithm for submodularmaximization subject to a matroid constraint.
SIAM J. Computing , 43(2):514–542,2014.[18] M. L. Fisher, G. L. Nemhauser, and L. A. Wolsey. An analysis of approximationsfor maximizing submodular set functions I.
Mathematical Programming , pages265–294, 1978.[19] Tobias Friedrich, Andreas G¨obel, Frank Neumann, Francesco Quinzan, and RalfRothenberger. Greedy maximization of functions with bounded curvature underpartition matroid constraints. In
Proc. of AAAI , volume 33(1), pages 2272–2279,2019.[20] Suning Gong, Qingqin Nong, Wenjing Liu, and Qizhi Fang. Parametric mono-tone function maximization with matroid constraints.
J. Global Optimization ,75(3):833–849, 2019.[21] Chris Harshaw, Moran Feldman, Justin Ward, and Amin Karbasi. Submodularmaximization beyond non-negativity: Guarantees, fast algorithms, and applica-tions. In
Proc. 36th ICML , volume 97, pages 2634–2643, 2019.[22] Abolfazl Hashemi, Mahsa Ghasemi, Haris Vikalo, and Ufuk Topcu. Submodularobservation selection and information gathering for quadratic models. In
Proc. 36thICML , pages 2653–2662, 2019.[23] Abolfazl Hashemi, Mahsa Ghasemi, Haris Vikalo, and Ufuk Topcu. Randomizedgreedy sensor selection: Leveraging weak submodularity.
IEEE Trans. on Auto-matic Control , 66(1):199–212, 2020.[24] Thibaut Horel and Yaron Singer. Maximization of approximately submodular func-tions.
Proc. 29th NeurIPS , pages 3045–3053, 2016.[25] David Kempe, Jon Kleinberg, and ´Eva Tardos. Maximizing the spread of influencethrough a social network. In
Proc. 9th (KDD) , pages 137–146, 2003.[26] Rajiv Khanna, Ethan Elenberg, Alexandros G. Dimakis, and Sahand Negahban.On approximation guarantees for greedy low rank optimization. In
Proc. 34thICML , pages 1837–1846, 2017.[27] Rajiv Khanna, Ethan Elenberg, Alexandros G Dimakis, Sahand Negahban, andJoydeep Ghosh. Scalable greedy feature selection via weak submodularity. In
Proc.20th AISTATS , pages 1560–1568, 2017.[28] Andreas Krause, Ajit Paul Singh, and Carlos Guestrin. Near-optimal sensor place-ments in gaussian processes: Theory, efficient algorithms and empirical studies.
J.Machine Learning Research , 9:235–284, 2008.2929] Jon Lee, Maxim Sviridenko, and Jan Vondr´ak. Submodular maximization over mul-tiple matroids via generalized exchange properties.
Math. of Operations Research ,35(4):795–806, 2010.[30] Marko Mitrovic, Ehsan Kazemi, Morteza Zadimoghaddam, and Amin Karbasi.Data summarization at scale: A two-stage submodular approach. In
Proc. 35thICML , pages 3596–3605, 2018.[31] G L Nemhauser and L A Wolsey. Best algorithms for approximating the maximumof a submodular set function.
Mathematics of Operations Research , 3(3):177–188,1978.[32] Qingqin Nong, Tao Sun, Suning Gong, Qizhi Fang, Dingzhu Du, and Xiaoyu Shao.Maximize a monotone function with a generic submodularity ratio. In
Proc. Inter-national Conference on Algorithmic Applications in Management , pages 249–260,2019.[33] Sebastian Pokutta, Mohit Singh, and Alfredo Torrico. On the unreasonable effec-tiveness of the greedy algorithm: Greedy adapts to sharpness. In
Proc. 37th ICML ,pages 7772–7782, 2020.[34] Carl Edward Rasmussen and Christopher K. I. Williams.
Gaussian Processes forMachine Learning . The MIT Press, 2005.[35] Richard Santiago and Yuichi Yoshida. Weakly submodular function maximizationusing local submodularity ratio. arXiv preprint arXiv:2004.14650 , 2020.[36] Maxim Sviridenko, Jan Vondr´ak, and Justin Ward. Optimal approximation forsubmodular and supermodular optimization with bounded curvature. In
Proc.26th SODA , pages 1134–1148, 2015.[37] Jan Vondr´ak. Submodularity and curvature: the optimal algorithm.
RIMSKˆokyˆuroku Bessatsu , B23:253–266, 01 2010.[38] Justin Ward. A (k+ 3)/2-approximation algorithm for monotone submodular k-setpacking and general k-exchange systems. In
Proc. 29th STACS , volume 14, pages42–53. LIPIcs, 2012.[39] Yuichi Yoshida. Maximizing a monotone submodular function with a boundedcurvature under a knapsack constraint.