Evolutionary Algorithms and Submodular Functions: Benefits of Heavy-Tailed Mutations
Tobias Friedrich, Andreas Göbel, Francesco Quinzan, Markus Wagner
EEvolutionary Algorithms and Submodular Functions:Benefits of Heavy-Tailed Mutations
Tobias Friedrich , , Andreas G¨obel , Francesco Quinzan , and Markus Wagner Hasso Plattner Institute, Potsdam, Germany University of Adelaide, Adelaide, Australia
Abstract.
A core feature of evolutionary algorithms is their mutation operator. Recently,much attention has been devoted to the study of mutation operators with dynamic andnon-uniform mutation rates. Following up on this line of work, we propose a new mutationoperator and analyze its performance on the (1+1) Evolutionary Algorithm (EA).Our analyses show that this mutation operator competes with pre-existing ones, when usedby the (1+1) EA on classes of problems for which results on the other mutation opera-tors are available. We show that the (1+1) EA using our mutation operator finds a (1 / / Keywords:
Evolutionary algorithms, mutation operators, submodular functions, matroids.
A key procedure of the (1+1) EA that affects its performance is the mutation operator , i.e., theoperator that determines at each step how the potential new solution is generated. In the pastseveral years there has been a huge effort, both from a theoretical and an experimental point ofview, towards understanding how this procedure influences the performance of the (1+1) EA andwhich is the optimal way of choosing this parameter (e.g., see [1,2]).The most common mutation operator on n -bit strings is the static uniform mutation operator.This operator, unif p , flips each bit of the current solution independently with probability p ( n ). Thisprobability, p ( n ), is called static mutation rate and remains the same throughout the run of thealgorithm. The most common choice for p ( n ) is 1 /n ; thus, mutated solutions differ in expectationin one bit from their predecessors. Witt [3] shows that this choice of p ( n ) is optimal for all pseudo-Boolean linear functions. Doerr et al. [4] further observe that changing p ( n ) by a constant factorcan lead to large variations of the overall run-time of the (1+1) EA. They also show the existenceof functions for which this choice of p ( n ) is not optimal.Static mutation rates are not the only ones studied in literature. Jansen et al. [5] pro-pose a mutation rate which at time step t flips each bit independently with probability2 ( t −
1) mod ( (cid:100) log n (cid:101)− /n . Doerr et al. [6] observe that this mutation rate is equivalent to a mu-tation rate of the form α/n , where α is chosen uniformly at random (u.a.r.) from the set { ( t −
1) mod ( (cid:100) log n (cid:101)− | t ∈ { , . . . , (cid:100) log n (cid:101)}} . Doerr et al. [7,8] have proposed a simple on-the-flymechanism that can approximate optimal mutation rates well for two unimodal functions. a r X i v : . [ c s . D S ] N ov Doerr et al. [6] notice that the choice of p ( n ) = 1 /n is a result of over-tailoring the mutationrates to commonly studied simple unimodal problems. They propose a non-static mutation oper-ator fmut β , which chooses a mutation rate α ≤ / fmut β .Friedrich et al. [9] propose a new mutation operator. Their operator cMut ( p ) chooses at eachstep with constant probability p to flip 1-bit of the solution chosen uniformly at random. Withthe remaining probability 1 − p , the operator chooses k ∈ { , . . . , n } uniformly at random andflips k bits of the solution chosen uniformly at random. This operator performs well in optimizingpseudo-Boolean functions, as well as combinatorial problems such as the minimum vertex coverand the maximum cut. Experiments suggest that this operator outperforms the mutation operatorof Doerr et al. [6] when run on functions that exhibit large deceptive basins of attraction, i.e., localoptima whose Hamming distance from the global optimum is in Θ ( n ).As evolutionary algorithms are used extensively in real world applications, it is important toextend the theoretical analysis of their performance to the more general classes of functions. Toimprove the performance of (1+1) EA in more complex landscapes and inspired by the recentresults of Doerr et al. [6] and Friedrich et al. [9] we propose a new mutation operator pmut β . Ouroperator mutates n -bit string solutions as follows. At each step, pmut β chooses k ∈ { , . . . , n } from a power-law distribution. Then k bits of the current solution are chosen uniformly at randomand then flipped. During a run of the (1+1) EA using pmut β , the majority of mutations consistof flipping a small number of bits, but occasionally a large number, of up to n bit flips can beperformed. In comparison to the mutations of fmut β , the mutations of pmut β have a considerablyhigher likelihood of performing larger than ( n/ Run-Time Comparison on Artificial Landscapes
Our analysis of the (1+1) EA using pmut β starts by considering artificial landscapes. More specifically, in Section 3.1 we show that the(1+1) EA using pmut β manages to find the optimum of any function within exponential time.When run on the OneMax function, the (1+1) EA with pmut β finds the optimum solution inexpected polynomial time.In Section 3.2 we consider the problem of maximizing the n -dimensional jump function Jump m,n ( x ), first introduced by Droste et al. [10]. We show that for any value of the parame-ters m, n with m constant or n − m , the expected run time of the (1+1) EA using pmut β remainspolynomial. This is not the case for the (1+1) EA using unif p , for which Droste et al. [10] showed arun time of Θ ( n m + n log n ) in expectation. Doerr et al. [6] are able to derive polynomial bounds forthe expected run-time of the (1+1) EA using their mutation operator fmut β , but in their resultslimit the jump parameter to m ≤ n/ Optimization of Submodular Functions
Our main focus in this article is to study the per-formance of the (1+1) EA when optimizing submodular functions. Submodularity is a propertythat captures the notion of diminishing returns. Thus submodular functions find applicability in alarge variety of problems. Examples include: maximum facility location problems [11], maximumcut and maximum directed cut [12], restricted
SAT instances [13]. Submodular functions undera single matroid constraint arise in artificial intelligence and are connected to probabilistic faultdiagnosis problems [14,15].Submodular functions exhibit additional properties in some cases, such as symmetry and mono-tonicity . These properties can be exploited to derive run time bounds for local randomized searchheuristics such as the (1+1) EA. In particular, Friedrich and Neumann [16] give run time boundsfor the (1+1) EA and GSEMO on this problem, assuming either monotonicity or symmetry.We show (Section 4.1) that the (1+1) EA with pmut β on any non-negative, submodular functiongives a 1 / in expectation
1. The expected run time in the unconstrained case is given in Section 4.1, whereas theimproved upper-bound in Section 4.2. The expected run time bounds for the (1+1) EA in theconstrained case are discussed in Section 5. Previous run time bounds for deterministic local searchalgorithms are discussed in Feige et al. [17] and Lee et al. [15]. We remark that local operations for the deterministic local search correspond to favorable moves in the analysis of the (1+1) EA.Hence, they are the same unit of measurement.further show (Section 4.2) that the (1+1) EA outperforms the local search of Feige et al. [17] atleast with constant probability (w.c.p.).Additionally we evaluate the performance of the (1+1) EA on the maximum directed cutproblem using pmut β experimentally, on real-world graphs of different origins, and with up to 6.6million vertices and 56 million edges. Our experiments show that pmut β outperforms unif p andthe uniform mutation operator on these instances. This analysis appears in Section 6.1In section 5 we consider the problem of maximizing a symmetric submodular function undera single matroid constraint. Our analysis shows that the (1+1) EA using pmut β finds a 1 / unif p .To establish our results empirically, in Section 6.2 we consider the symmetric mutual informa-tion problem under a cardinality constraint. We consider an air pollution data set during a fourmonth interval and use the (1+1) EA to identify the highly informative random variables of thisdata set. We observe that pmut β performs better than the uniform mutation operator and unif p for a small time budged and a small cardinality constraint, but for a large cardinality constraintall mutation operators have similar performance.A comparison of the previously known performance of deterministic local search algorithms onsubmodular functions and our results on the (1+1) EA can be found in Table 1. We study the run time of the simple (1 + 1) Evolutionary Algorithm under various configurations.This algorithm requires a bit-string of fixed length n as input. An offspring is then generated bythe mutation operator , an operator that resembles asexual reproduction. The fitness of the solutionis then computed and the less desirable result is discarded. This algorithm is elitist in the sensethat the solution quality never decreases throughout the process. Pseudo-code for the (1+1) EAis given in Algorithm 1.In the (1+1) EA the offspring generated in each iteration depends on the mutation operator.The standard choice for the Mutation ( · ) is to flip each bit of an input string x = ( x , . . . , x n ) Algorithm 1:
The (1+1) EA input: a fitness function f : 2 V → R ≥ ; output: an (approximate) global maximumof the function f ;// sample initial solutionchoose x ∈ { , } n uniformly at random; while convergence criterion not met do // apply mutation operator y ← Mutation ( x );// perform selection if f ( y ) ≥ f ( x ) then x ← y ; return x; independently with probability 1 /n . In a slightly more general setting, the mutation operator unif p ( · ) flips each bit of x independently with probability p/n , where p ∈ [0 , n/ p as mutation rate .Uniform mutations can be further generalized, by sampling the mutation rate p ∈ [0 , n/ power-law mutation fmut β of Doerr et al. [6]. fmut β chooses the mutation rate according to a power-lawdistribution on [0 , /
2] with exponent β . More formally, denote with X the r.v. (random variable)that returns the mutation rate at a given step. The power-law operator fmut β uses a probabilitydistribution D βn/ s.t. Pr ( X = k ) = H βn/ k − β , where H β(cid:96) = (cid:80) (cid:96)j =1 1 j β . The H β(cid:96) s are known in theliterature as generalized harmonic numbers. Interestingly, generalized harmonic numbers can beapproximated with the Riemann Zeta function as ζ ( β ) = lim (cid:96) → + ∞ H β(cid:96) . In particular, harmonicnumbers H βn/ are always upper-bounded by a constant, for increasing problem size and for a fixed β > In this paper we consider an alternative approach to the non-uniform mutation operators describedabove. For a given probability distribution P = [1 , . . . , n ] → R the proposed mutation operatorsamples an element k ∈ [1 , . . . , n ] according to the distribution P , and flips exactly k -many bits inan input string x = ( x , . . . x n ), chosen uniformly at random among all possibilities. This frame-work depends on the distribution P , which we always assume fixed throughout the optimizationprocess.Based on the results of Doerr et al. [6], we study a specialization of our non-uniform frameworkthat uses a distribution of the form P = D βn . We refer to this operator as pmut β , and pseudocode isgiven in Algorithm 2. This operator uses a power-law distribution on the probability of performingexactly k -bit flips in one iteration. That is, for x ∈ { , } n and all k ∈ { , . . . , n } , Pr (cid:0) H (cid:0) x, pmut β ( x ) (cid:1) = k (cid:1) = ( H βn ) − k − β (1)We remark that with this operator, for any two points x, y ∈ { , } n , the probability Pr (cid:0) y = pmut β ( x ) (cid:1) only depends on their Hamming distance H ( x, y ).Although both operators, fmut β and pmut β , are defined in terms of a power-law distributiontheir behavior differs. We note that, for any choice of the constant β > x ∈ { , } n , Pr ( H ( x, fmut β ( x )) = 0) >
0, while Pr (cid:0) H (cid:0) x, pmut β ( x ) (cid:1) = 0 (cid:1) = 0. We discuss the advantages anddisadvantages of these two operators in Sections 3. Algorithm 2:
The mutation operator pmut β ( x ) input: a pseudo-Boolean array x ; output: a mutated pseudo-Boolean array y ; y ← x ;choose k ∈ [1 , . . . , n ] with distribution D βn ;flip k -bits of y chosen uniformly at random; return y ; Submodular set function functions intuitively capture the notion of diminishing returns – i.e. themore you acquire the less your marginal gain. More formally, the following definition holds.
Definition 1.
A set function f : 2 V → R ≥ is submodular if it holds f ( S ) + f ( T ) ≥ f ( S ∪ T ) + f ( S ∩ T ) for all S, T ⊆ V . We remark that in this context V is always a finite set. It is well-known that the defining axiomin Definition 1 is equivalent to the requirement f ( S ∪ { x } ) − f ( S ) ≥ f ( T ∪ { x } ) − f ( T ) , (2)for all S, T ⊆ V such that S ⊆ T and x ∈ T \ S (see eg. Welsh [19]).We say that a set function f : 2 V → R ≥ is symmetric if it holds f ( S ) = f ( V \ S ) for all S ⊆ V .In some cases, feasible solutions are characterized as the independent sets of a matroid withbase set V , as in the following definition. Definition 2.
Given a set V , a matroid M = ( V, I ) with base set V consists of a collection of subsets I of V with the following properties: • ∅ ∈ I ; • if T ∈ I , then S ∈ I for all subsets S ⊆ T ; • if S, T ∈ I and | S | ≤ | T | , then there exists a point x ∈ T \ S s.t. S ∪ { x } ∈ I ; From the axioms in Definition 2, it follows that two maximal independent sets always have thesame number of elements. This number is called the rank of a matroid. It is possible to generalizethis notion, as in the following definition.
Definition 3.
Consider a matroid M = ( V, I ) . For any subset S ⊆ V , the rank function r ( C ) returns the size of the largest independent set in S - i.e. r ( S ) = arg max T ⊆ S {| T | : T ∈ I} . We introduce a basic probabilistic inequality that is useful in the run time analysis in Section 4.2.This simple tool is commonly referred to as Markov’s Inequality. We use the following variationof it.
Lemma 4 (Markov).
Let X be a random variable, where X ∈ [0 , . Then it holds Pr ( X ≤ c ) ≤ − E [ X ]1 − c , for all ≤ c ≤ E [ X ] . For a discussion of Lemma 4, see eg. Mitzenmacher and Upfal [20, Theorem 3.1].
The Multiplicative Drift theorem is a powerful tool to analyze the expected run time of randomizedalgorithms such as the (1+1) EA. Intuitively, for a fitness function f : 2 V → R ≥ we view the runtime of the (1+1) EA as a Markov chain { X t } t ≥ , where X t depends on the f -value reached attime-step t . The Multiplicative Drift theorem gives an upper-bound on the expected value of thefirst hitting time T = inf { t : X t = 0 } , provided that the change of the average value of the process { X t } t ≥ is within a multiplicative factor of the previous solution. The following theorem holds. Theorem 5 (Theorem 3 in Doerr et al. [21]).
Let { X t } t ≥ be a random variable describinga Markov process over a finite state space S ⊆ R . Let T be the random variable that denotes theearliest point in time t ∈ N such that X t = 0 . Suppose that there exist δ > , c min > , and c max > such that • E [ X t − X t +1 | X t ] ≥ δX t ; • X t ∈ [ c min , c max ] ∪ { } ;for all t < T . Then it holds E [ T ] ≤ δ ln (cid:16) c max c min (cid:17) . In this section we bound from above the run time of the (1+1) EA using the mutation operator pmut β on any fitness function f : { , } n → R . It is well-known that the (1+1) EA using uniformmutation on any such fitness function has expected run time at most n n . This upper-bound istight, in the sense that there exists a function f s.t. the expected run time of the (1+1) EA usinguniform mutation to find the global optimum of f is Ω ( n n ). For a discussion on these bounds seeDroste et al. [10]. Doerr et al. [6] prove that on any fitness function f : { , } n → R the (1+1) EAusing the mutation operator fmut β has run time at most O (cid:16) H βn/ n n β (cid:17) . Similarly, we derive ageneral upper bound on the run time of the (1+1) EA using mutation pmut β . Lemma 6.
On any fitness function f : { , } n → R the (1+1) EA with mutation pmut β finds theoptimum solution after expected O (cid:0) H βn e n/e n β (cid:1) fitness evaluations, with the constant implicit inthe asymptotic notation independent of β .Proof Without loss of generality we assume n to be even. We proceed by identifying a general lowerbound on the probability of reaching any point from any other point. To this end, let x, y ∈ { , } n be any two points and let k = H ( x, y ) be their Hamming distance. Then the probability of reachingthe point y in one iteration from x is Pr (cid:0) y = pmut β ( x ) (cid:1) = (cid:18) nk (cid:19) − Pr (cid:0) H (cid:0) x, pmut β ( x ) (cid:1) = k (cid:1) . From (1) we have that it holds Pr (cid:0) H (cid:0) x, pmut β ( x ) (cid:1) = k (cid:1) = ( H βn ) − k − β ≥ ( H βn ) − n − β for allchoices of x ∈ { , } n and k = 1 , . . . , n . Using a known lower bound of the binomial coefficient wehave that (cid:18) nk (cid:19) − ≥ (cid:18) nn/ (cid:19) − ≥ (2 e ) − n/ ≥ e − n/e , from which it follows that Pr (cid:0) y = pmut β ( x ) (cid:1) ≥ ( H βn ) − e − n/e n − β , for any choice of x and y . We can roughly estimate run time as a geometric distribution with probability of success Pr (cid:0) y = pmut β ( x ) (cid:1) . Hence, we conclude by taking the inverse of the estimate above, which yieldsan upper-bound on the probability of convergence on any fitness function. (cid:3) We consider the
OneMax function, defined as
OneMax ( x , . . . , x n ) = | x | = (cid:80) nj =1 x j . Thissimple linear function of unitation returns the number of ones in a pseudo-Boolean input string.The (1+1) EA with mutation operators unif p and fmut β finds the global optimum after O ( n log n )fitness evaluations (see [22,10,6]). It can be easily shown that the (1+1) EA with mutation operator pmut β achieves similar performance on this instance. Lemma 7.
The (1+1) EA with mutation pmut β finds the global optimum of the OneMax afterexpected O (cid:0) H βn n log n (cid:1) fitness evaluations, for all β > and with the constant implicit in theasymptotic notation independent of β .Proof We use the fitness level method outlined in Wegener [23]. Define the levels A i = { x ∈{ , } n : f ( x ) = i } , and consider the quantities s i = ( n − i )( nH βn ) − , for all i = 1 , . . . , n . Theneach s i is a lower-bound on the probability of reaching a higher fitness in one iteration. Denotewith T pmut β ( f ) the run time of the (1+1) EA with mutation pmut β on the function f = OneMax .By the fitness level theorem, we obtain an upper-bound on the run time as T pmut β ( f ) ≤ n − (cid:88) i =0 s i ≤ H βn n (cid:90) n − dxn − x ≤ H βn n log n and the claim follows. (cid:3) Droste et al. [10] defined the following jump function.
Jump m,n ( x ) = m + | x | if | x | ≤ n − m ; m + | x | if | x | = n ; n − | x | otherwise.For 1 < m < n this function exhibits a single local maximum and a single global maximum.The first parameter of Jump m,n determines the Hamming distance between the local and theglobal optimum, while the second parameter denotes the size of the input. We present a generalupper-bound on the run time of the (1+1) EA on
Jump m,n with mutation operator pmut β . Then,following the footsteps of Doerr et al. [6], we compare the performance of pmut β with staticmutation operators on jump functions for all m ≤ n/ Lemma 8.
Consider a jump function f = Jump m,n and denote with T pmut β ( f ) the expected runtime of the (1+1) EA using the mutation pmut β on the function f . T pmut β ( f ) = H βn (cid:0) nm (cid:1) O (cid:0) m β (cid:1) ,were the constant implicit in the asymptotic notation is independent of m and β .Proof We use the fitness level method. Define the levels A i = { x ∈ { , } n : f ( x ) = i } for all i = 1 , . . . , n , and consider the quantities s i = ( n − i )( nH βn ) − , ≤ i ≤ n − m − (cid:0) nm (cid:1) − ( H βn ) − m − β , i = n − m ; i ( nH βn ) − , n − m + 1 ≤ i ≤ n − s i is a lower bound for the probability of reaching a higher fitness in one iteration fromthe level A i . By the fitness level theorem we obtain an upper bound on the run time as T pmut β ( f ) ≤ (cid:18) nm (cid:19) H βn m β + n − m − (cid:88) i =0 nH βn n − i + n − (cid:88) i = n − m +1 nH βn i ≤ (cid:18) nm (cid:19) H βn m β + 2 nH βn (cid:90) nm dxx = (cid:18) nm (cid:19) H βn m β + 2 nH βn ln nm , for any choice of β >
1. Since we have that 1 < m < n and m ≥
2, then it follows that2 nH βn ln nm ≤ nH βn ln n ≤ H βn (cid:18) nm (cid:19) , and the lemma follows. (cid:3) Note that the upper-bound on the run time given in Lemma 8 yields polynomial run time onall functions
Jump m,n with m constant for increasing problem size and also with n − m constantfor increasing problem size.Following the analysis of Doerr et al. [6], we can compare the run time of the (1+1) EA withmutation pmut β with the (1+1) EA with uniform mutations, on the jump function Jump m,n for m ≤ n/ Corollary 9.
Consider a jump function f = Jump m,n with m ≤ n/ and denote with T pmut β ( f ) the run time of the (1+1) EA using the mutation pmut β on the function f . Similarly, denotewith T opt ( f ) the run time of the (1+1) EA using the best possible static uniform mutation on thefunction f . Then it holds T pmut β ( f ) ≤ cm β − . H βn T opt ( f ) , for a constant c independent of m and β . The result above holds because Doerr et al. [6] prove that the best possible optimizationtime for a static mutation rate a function f = Jump m,n with m ≤ n/ / n m /m m ( n/ ( n − m )) n − m ≤ T opt ( f ). We study the problem of maximizing a non-negative submodular function f : 2 V → R ≥ with noside constraints. More formally, we study the problemargmax C ⊆ V f ( C ) . (3)This problem is APX -complete. That is, this problem is NP -hard and does not admit a poly-nomial time approximation scheme (PTAS), unless P = NP (see Nemhauser and Wolsey [24]).We denote with opt any solution of Problem (3), and we denote with n the size of V . We prove that the (1+1) EA with mutation pmut β is a (1 / − ε/n )-approximation algorithmfor Problem 3. In our analysis we assume neither monotonicity nor symmetry. We approach thisproblem by searching for (1 + α )-local optima, which we define below. Definition 10.
Let f : 2 V → R ≥ be any submodular function. A set S ⊆ V is a (1 + α ) -localoptimum if it holds (1 + α ) f ( S ) ≥ f ( S \ { u } ) for all u ∈ S , and (1 + α ) f ( S ) ≥ f ( S ∪ { v } ) for all v ∈ V \ S , for a constant α > . This definition is useful in the analysis because it can be proved that either (1+ α )-local optimaor their complement always yield a good approximation of the global maximum, as in the followingtheorem. Theorem 11 (Theorem 3.4 in Feige et al. [17]).
Consider a non-negative submodular function f : 2 V → R ≥ and let S be a (1 + ε/n ) -local optimum as in Definition 10. Then either S or V \ S is a (1 / − ε/n ) -approximation of the global maximum of f . It is possible to construct examples of submodular functions that exhibit (1 + ε/n )-local optimawith arbitrarily bad approximation ratios. Thus, (1 + ε/n )-local optima alone do not yield anyapproximation guarantee for Problem (3), unless the fitness function is symmetric.We can use Theorem 11 to estimate the run time of the (1+1) EA using mutation pmut β tomaximize a given submodular function. Intuitively, it is always possible to find a (1 + ε/n )-localoptimum in polynomial time using single bit-flips. It is then possible to compare the approximatelocal solution S with its complement V \ S by flipping all bits in one iteration.We do not perform the analysis on a given submodular function f directly, but we considera corresponding potential function g f,ε instead. We define potential functions as in the followinglemma. Lemma 12.
Consider a non-negative submodular function f : 2 V → R ≥ . Consider the function g f,ε ( U ) = f ( U ) + ε opt n , for all U ⊆ V . The following conditions hold(1) g f,ε ( U ) is submodular.(2) g f,ε ( U ) ≥ ε opt /n , for all subsets U ⊆ V .(3) Suppose that a solution U ⊆ V is a δ -approximation for g f,ε , for a constant < δ < . Then U is a ( δ − ε/n ) -approximation for f .Proof (1) The submodularity of g f,ε ( U ) follows immediately from the fact that f ( U ) is submodular,together with the fact that the term ε opt /n is constant. (2) The property follows directly fromthe definition of g f,ε ( U ), together with the assumption that f is non-negative. (3) Fix a subset U ⊆ V that is an δ -approximation for g f,ε . Then we have that g f,ε ( U ) ≥ δ (cid:16) opt + ε opt n (cid:17) ⇒ f ( U ) ≥ δ (cid:16) opt + ε opt n (cid:17) − ε opt n . It follows that f ( U ) ≥ δ opt − (1 − δ ) ε opt n ≥ δ opt − ε opt n , where the last inequality follows from the assumption that 0 < δ <
1. The lemma follows. (cid:3)
Using potential functions and their properties, we can prove the following result.
Theorem 13.
The (1+1) EA with mutation pmut β is a (1 / − ε/n ) -approximation algorithm forProblem (3) . Its expected run time is O (cid:0) ε n log nε + n β (cid:1) .Proof We prove that for all ε >
0, the (1+1) EA with mutation pmut β finds a (1 / − ε/n )-approximation of g f,ε (as in Lemma 12) within expected O (cid:0) ε n log nε + n β (cid:1) fitness evaluations. Wethen use this knowledge to conclude that the (1+1) EA with mutation pmut β finds a (1 / − ε/n )-approximation of f within O (cid:0) n β + ε n log nε (cid:1) fitness evaluations and the theorem follows.We divide the run time in two phases. During (Phase 1), the (1+1) EA finds a (1 + ε/n )-localoptimum of g f,ε . During (Phase 2) the algorithm finds a (1 / − ε/n )-approximation of the globaloptimum of g f using the heavy-tailed mutation. (Phase 1) We use the multiplicative increase method. Denote with x t the solution found bythe (1+1) EA at time step t , for all t ≥
0. Then for any solution x t it is always possible to makean improvement of (1 + ε/n ) g f,ε ( x t ) on the fitness in the next iteration, by adding or removinga single vertex, unless x t is already a (1 + ε/n )-local optimum. We refer to any single bit-flipthat yields such an improvement of a fitness as favorable bit-flip . We give an upper-bound onthe number of favorable bit-flips k to reach a (1 + ε/n )-local optimum, by solving the followingequation (cid:16) εn (cid:17) k ε opt n ≤ opt + ε opt n ⇔ (cid:16) εn (cid:17) k ≤ nε + 1 , where we have used that that for the initial solution x , g f,ε ( x ) ≥ ε opt /n (see Lemma 12(2)).From solving this inequality it follows that the (1+1) EA with mutation pmut β reaches a (1 + ε/n )-local maximum after at most k = O (cid:0) ε n log nε (cid:1) favorable bit-flips. Since the probability of performing a single chosen bit-flip is at least ( H βn ) − n − = Ω (1 /n ), then the expected waitingtime for a favorable bit-flip to occur is O ( n ), we can upper-bound the expected run time in thisinitial phase as O (cid:0) ε n log nε (cid:1) . (Phase 2) Assume that a (1 + ε/n )-local optimum has been found. Then from Theorem 11follows that either this local optimum or its complement is a (1 / − ε/n )-approximation of theglobal maximum. Thus, if the solution found in Phase 1 does not yield the desired approximationratio, a n -bit flip is sufficient to find a (1 / − ε/n )-approximation of the global optimum of g f .The probability of this event to occur is at least ( H βn ) − n − β = Ω ( n − β ) by (1). After an additionalphase of expected O (cid:0) n β (cid:1) fitness evaluations the (1+1) EA with mutation pmut β reaches thedesired approximation of the global maximum. (cid:3) We prove that the (1+1) EA with mutation pmut β yields an improved upper-bound on the runtime over that of Theorem 13, at least with constant probability. This upper-bound yields animprovement over the run time analysis of a standard deterministic Local Search (LS) algorithm(see Theorem 3.4 Feige et al. [17]), at least with constant probability. Theorem 14 (Theorem 2.1 in Feige et al. [17]).
Let f : 2 V → R ≥ be a submodular function,and denote with R ⊆ V a set chosen uniformly at random. Then E [ f ( R )] ≥ opt / . We exploit this result to obtain an improved upper-bound on the run time. Intuitively, the initialsolution sampled by the (1+1) EA yields a constant-factor approximation guarantee at least withconstant probability. We can use this result to prove the following theorem.
Theorem 15.
The (1+1) EA with mutation pmut β is a (1 / − ε/n ) -approximation algorithm forProblem (3) after O (cid:0) ε n + n β (cid:1) fitness evaluations, at least w.c.p.Proof This proof is similar to that of Theorem 13. We denote with x t a solution reached by the(1+1) EA at time step t . We first prove that the definition of submodularity implies that, withhigh probability the initial solution x yields a constant-factor approximation guarantee. We thenperform a run time analysis as in Theorem 13, by counting the expected time until the fittestindividual is chosen for selection, and a local improvement of at least (1 + (cid:15)/n ) is made, assumingthat the initial solution yields a constant-factor approximation guarantee.Denote with R ⊆ V a set chosen uniformly at random and fix a constant δ >
1. We combineTheorem 14 with Lemma 4, by choosing X = f ( R ) / opt and obtain, Pr (cid:18) f ( R ) ≤ δ opt (cid:19) = Pr (cid:18) X ≤ δ (cid:19) ≤ − / − / δ = 3 δ δ − , where the last inequality by applying Theorem 14 and linearity of expectation to the r.v. X = f ( R ) / opt , to obtain that E [ X ] ≥ /
4. We have, Pr (cid:18) x > δ opt (cid:19) ≥ − Pr (cid:18) f ( R ) ≤ δ opt (cid:19) ≥ − δ δ − . In the following, for a fixed constant δ >
1, we perform the run time analysis as in Theorem 13conditional on the event A = { x > opt / δ } , which occurs at least w.c.p.Again, we divide the run time in two phases. During Phase 1, the (1+1) EA finds a (1 + ε/n )-local optimum of f . During Phase 2 the algorithm finds a (1 / − ε/n )-approximation of the globaloptimum of f using the heavy-tailed mutation. (Phase 1) For any solution x t it is always possible to make an improvement of (1 + ε/n ) f ( x t )on the fitness in the next iteration, by adding or removing a single vertex – the favorable bit-flip, unless x t is already a (1 + ε/n )-local optimum. Again, we give an upper-bound on the number offavorable bit-flips k to reach a (1 + ε/n )-local optimum, by solving the following equation (cid:16) εn (cid:17) k opt δ ≤ opt ⇐⇒ (cid:16) εn (cid:17) k ≤ δ, from which it follows that the (1+1) EA with mutation pmut β reaches a (1+ ε/n )-local maximumafter at most k = O (cid:0) ε n (cid:1) favorable moves. Since the probability of performing a single chosenbit-flip is at least ( H βn ) − n − = Ω (1 /n ), then the expected waiting time for a favorable bit-flip tooccur is O ( n ), we can upper-bound the expected run time in this initial phase as O (cid:0) ε n (cid:1) . (Phase 2) We conclude applying the heavy-tailed mutation step: If the solution found inPhase 1 does not yield the desired approximation ratio, a n -bit flip is sufficient to find a(1 / − ε/n )-approximation of the global optimum of f . The probability of this event to occuris at least ( H βn ) − n − β = Ω ( n − β ) by (1). After an additional phase of expected O (cid:0) n β (cid:1) fitnessevaluations the (1+1) EA with mutation pmut β performs an n -nit flip, thus reaching the desiredapproximation ratio. (cid:3) In this section we consider the problem of maximizing a non-negative submodular function f :2 V → R ≥ under a single matroid constraint M = ( V, I ). More formally, we study the problemargmax C ∈I f ( C ) . (4)We denote with opt any solution of Problem (4), and we denote with n the size of V . Note thatthis definition of opt differs from that of Section 4.We approach this problem, by maximizing the following fitness function z f ( C ) = (cid:26) f ( C ) if C ∈ I ; r ( C ) − | C | otherwise; (5)with r the rank function as in Definition 3. If a solution C is unfeasible, then z f ( C ) returns anegative number, whereas if C is feasible, then z f ( C ) outputs a non-negative number.When studying additional constraints on the solution space the problem becomes more in-volved, so we require a different notion of local optimality. Definition 16.
Let f : 2 V → R ≥ be a submodular function, let M = ( V, I ) be a matroid and let α > . A set S ∈ I is a (1 + α ) -local optimum if the following hold. • (1 + α ) f ( S ) ≥ f ( S \ { u } ) for all u ∈ S ; • (1 + α ) f ( S ) ≥ f ( S ∪ { v } ) for all v ∈ V \ S s.t. S ∪ { v } ∈ I ; • (1 + α ) f ( S ) ≥ f (( S \ { u } ) ∪ { v } ) for all u ∈ S and v ∈ V \ S s.t. ( S \ { u } ) ∪ { v } ∈ I . We prove that, in the case of a symmetric submodular function, a (1 + α )-local optimum asin Definition 16 yields a constant-factor approximation ratio. To this end, We make use of thefollowing well-known result. Theorem 17 (Theorem 1 in Lee et al. [15]).
Let M = ( V, I ) be a matroid and I, J ∈ I betwo independent sets. Then there is a mapping π : J \ I → ( I \ J ) ∪ {∅} such that • ( I \ { π ( b ) } ) ∪ { b } ∈ I for all b ∈ J \ I ; • (cid:12)(cid:12) π − ( e ) (cid:12)(cid:12) ≤ for all e ∈ I \ J . A mild revision of Lemma 1 and Theorem 3 in Lee et al. [15], the following lemma holds.
Lemma 18.
Consider a non-negative symmetric submodular function f : 2 V → R ≥ , a matroid M = ( V, I ) and let S be a (1 + ε/n ) -local optimum as in Definition 16. Then S is a (1 / − (cid:15)/n ) -approximation for Problem (4) . Proof
Fix a constant ε > C ∈ I . Consider a mapping π : C \ S → ( S \ C ) ∪ {∅} as inTheorem 17. Since S is a (1 + ε/n )-local optimum it holds (cid:16) εn (cid:17) f ( S ) ≥ f (( S \ { π ( b ) } ) ∪ b ); (6)for all b ∈ C \ S . Thus, it holds f (( S ∪ { b } ) − f ( S ) ≤ f (( S \ { π ( b ) } ) ∪ { b } ) − f ( S \ { π ( b ) } ) ≤ (cid:16) εn (cid:17) f ( S ) − f ( S \ { π ( b ) } ) , where the first inequality follows from (2), and the second one follows from (6). Summing theseinequalities for each b ∈ C \ S and using submodularity as in (2) we obtain, f ( S ∪ C ) − f ( S ) ≤ (cid:88) b ∈ C \ S [ f ( S ∪ { b } ) − f ( S )] ≤ (cid:88) b ∈ C \ S (cid:104)(cid:16) εn (cid:17) f ( S ) − f ( S \ { π ( b ) } ) (cid:105) . Consider a given order of the elements in b ∈ C \ S , i.e. C \ S = { b , . . . , b k } . Then it holds (cid:88) b ∈ C \ S (cid:104)(cid:16) εn (cid:17) f ( S ) − f ( S \ { π ( b ) } ) (cid:105) = k (cid:88) j =1 [ f ( S ) − f ( S \ { π ( b j ) } )] + k εn f ( S ) ≤ k (cid:88) j =2 f (cid:32) ( S ∩ C ) j (cid:91) (cid:96) =1 { π ( b (cid:96) ) } (cid:33) − k (cid:88) j =2 f (cid:32) ( S ∩ C ) j − (cid:91) (cid:96) =1 { π ( b (cid:96) ) } (cid:33) + f (( S ∩ C ) ∪ { π ( b ) } ) − f ( S ∩ C ) + k εn f ( S ) ≤ (cid:16) εn (cid:17) f ( S ) − f ( S ∩ C )where the first inequality follows from (2) and the second inequality follows by taking the telescopicsum together with the fact that k ≤ n . Thus, it follows that2 (cid:16) εn (cid:17) f ( S ) ≥ f ( S ∪ C ) + f ( S ∩ C ) , Since f is symmetric, f ( S ) = f ( V \ S ) and we have that,3 (cid:16) εn (cid:17) f ( S ) ≥ f ( S ) + f ( S ∪ C ) + f ( S ∩ C ) ≥ f ( C \ S ) + f ( C ∩ S ) ≥ f ( C ) . The claim follows by choosing C = opt . (cid:3) We use Lemma 18 to perform a run time analysis of the (1+1) EA. We consider the case of the pmut β mutation, although our proof easily extends to the standard uniform mutation and fmut β .We experimentally compare these operators in Section 6.2. We perform the analysis by estimatingthe expected run time until a (1 + ε/n )-local optimum is reached, and apply Lemma 18 to obtainthe desired approximation guarantee. Our analysis yields an improved upper-bound on the runtime over that of Friedrich and Neumann [16]. The following theorem holds. Theorem 19.
The (1+1) EA with mutation pmut β is a (1 / − ε/n ) -approximation algorithm forProblem (4) . Its expected run time is O (cid:0) (cid:15) n log nε (cid:1) . Proof
We perform the analysis assuming that a fitness function as in (5) is used. We divide the runtime in two phases. During (Phase 1) the (1+1) EA finds a feasible solution, whereas in (Phase2) it finds a (1 + ε/n )-local optimum, given that an independent set has been found. (Phase 1) Assuming that the initial solution is not an independent set, then the (1+1) EAmaximizes the function r ( C ) −| C | , until a feasible solution is found. This is equivalent to minimizingthe function | C | − r ( C ). We estimate the run time using the multiplicative drift theorem (Theorem5). Denote with x t a solution found by the (1+1) EA after t steps, consider the Markov chain X t = | x t | − r ( x t ) and consider the first hitting time T = min { t : X t = 0 } . Then it holds X t ∈{ } ∪ [1 , n ]. Moreover, since the probability of removing a single chosen bit-flip from the currentsolution is 1 /en , we have, E [ X t − X t +1 | X t ] ≥ X t en . Theorem 5 now yields, E [ T ] ≤ en log(1 + n ) . We conclude that we can upper-bound the run time in this initial phase as O ( n log n ). (Phase 2) We estimate the run time in this phase with the multiplicative increase method.Assuming that a feasible solution is reached, then all subsequent solutions are feasible, since z f ( C ) ≥ z f ( C ) < f directly but weconsider the potential function g f,ε from Lemma 12 (recall that in this case opt is not the globaloptimum of f , but the highest f -value among all feasible solutions). We prove that for all ε > pmut β finds a (1 / − ε/n )-approximation of g f,ε ( S ) = f ( S ) + opt ε ,within expected O (cid:0) ε n log nε (cid:1) fitness evaluations. We apply Lemma 12(3) and conclude that the(1+1) EA with mutation pmut β finds a (1 / − ε/n )-approximation of f within O (cid:0) ε n log nε (cid:1) fitness evaluations.Denote with y t the solution found by the (1+1) EA at time step t + (cid:96) , for all t ≥
0, with (cid:96) thenumber of steps in Phase 1. In other words, y is the first feasible solution found by the (1+1) EA,and y t is the solution found after additional t steps. Again, the solutions y t are independent setsfor all t ≥
0. For any solution y t it is always possible to make an improvement of (1 + ε/n ) g f,ε ( y t )on the fitness in the next iteration, by adding or removing a single vertex, or by swapping twobits, unless y t is already a (1 + ε/n )-local optimum. Again, we refer to any single bit-flip or swapthat yields such an improvement of a fitness as favorable move. We give an upper-bound on thenumber of favorable moves k to reach a (1 + ε/n )-local optimum, by solving the following equation (cid:16) εn (cid:17) k ε opt n ≤ opt + ε opt n ⇔ (cid:16) εn (cid:17) k ≤ nε + 1 , where we have used that that for the initial solution y , g f,ε ( y ) ≥ ε opt /n (see Lemma 12(2)).From solving the inequality it follows that the (1+1) EA reaches a (1 + ε/n )-local maximumafter at most k = O (cid:0) ε n log nε (cid:1) favorable moves. Since the probability of performing a singlechosen bit-flip or a swap is at least H − βn − β n − , then the expected waiting time for a favorablebit-flip to occur is at most O (cid:0) n (cid:1) , hence we can upper-bound the expected run time in Phase 2as O (cid:0) ε n log nε (cid:1) . (cid:3) Given a directed graph G = ( V, E ), we consider the problem of finding a subset U ⊆ V of nodessuch that the sum of the outer edges of U is maximal. This problem is the maximum directed cutproblem ( Max - Di - Cut ) and is a known to be NP -complete.For each subset of nodes U ⊆ V , consider the set ∆ ( U ) = { ( e , e ) ∈ E : e ∈ U and e / ∈ U } of all edges leaving U . We define the cut function f : 2 V −→ R ≥ as f ( U ) = | ∆ ( U ) | . (7)The Max - Di - Cut can be approached by maximizing the cut function as in (7). Note that thisfunction is non-negative. Moreover, it is always submodular and, in general, non-monotone (see e.g. Feige et al. [17] and Friedrich et al. [16]). Hence, this approach to the Max - Di - Cut can beformalized as in Problem (3) in Section 4.We select the 123 large instances used by Wagner et al. [25]; the number of vertices rangesfrom about 379 to over 6.6 million, and the number of edges ranges from 914 to over 56 million.All 123 instances are available online [26].The instances come from a wide range of origins. For example, there are 14 collaborationnetworks (ca-*, from various sources such as Citeseer, DBLP, and also Hollywood productions),five infrastructure networks (inf-*), six interaction networks (ia-*, e.g. about email exchange), 21general social networks (soc-*, e.g., Flickr, LastFM, Twitter, Youtube), 44 subnets of Facebook(socfb-*, mostly from different American universities), and 14 web graphs (web-*, showing thestate of various subsets of the Internet at particular points in time). We take these graphs andrun Algorithm 1 with seven mutation operators: fmut β and pmut β with β ∈ { . , . , . } and unif . We use an intuitive bit-string representation based on vertices, and we initialize uniformlyat random. Each edge has a weight of 1.For each instance-mutation pair, we perform 100 independent runs (100 000 evaluations each)and with an overall computation budget of 72 hours per pair. Out of the initial 123 instances67 finish their 100 repetitions per instance within this time limit. We report on these 67 in thefollowing. We use the average cut size achieved in the 100 runs as the basis for our analyses.Firstly, we rank the seven approaches based on the average cut size achieved (best rank is 1,worst rank is 7). Table 2 shows the average rank achieved by the different mutation approaches. unif performs best at the lower budget and worst at the higher budget, which we take as astrong indication that few bit-flips are initially helpful to quickly improve the cut size, while moreflips are helpful later in the search to escape local optima. At the higher budget, both fmut β and pmut β perform better than unif , independent of the parameter chosen. In particular, pmut β clearly performs better than fmut β at both budgets, however, while pmut β with β = 1 . pmut β with β = 3 . average rankmutation t = 10 , t = 100 , fmut . fmut . fmut . pmut . pmut . pmut . unif Table 2: Average ranks (based on mean cut size) of at t = 10 000 and t = 100 000 iterations (lowerranks are better).To investigate the relative performance difference and the statistical significance thereof, weperform a Nemenyi two-tailed test (see Figure 1). This test performs all-pairs comparisons onFriedman-type ranked data. The results are as expected and consistent with the average ranksreported in Table 2.Across the 67 instances, the achieved cut sizes vary significantly (see Table 3). For example,the average gap between the worst and the best approach is 42.1% at 10 000 iterations and it stillis 7.4% at 100 000 iterations. Also, when we compare the best fmut β and pmut β configurations (asper Table 3), then we can see that (i) pmut β is better or equal to fmut β , and (ii) the performance In contrast to our earlier work [27], we are comparing against unif , which performs at least one flip,thus making it a fairer comparison. Source categories of the 67 instances: 2x bio-*, 6x ca-*, 5x ia-*, 2x inf-*, 1x soc-*, 40x socfb-*, 4x tech-*,7x web-*. The largest graph is socfb-Texas84 with 36 364 vertices and 1 590 651 edges.5(a) 10 000 evaluations (b) 100 000 evaluations
Fig. 1: Critical Distance (CD) diagram based on a Nemenyi two-tailed test using the averagerankings. CD (top left) shows the critical distance. Distances larger than CD corresponds to astatistical significant difference in the ranking. Relationships within a critical distance are markedwith a horizontal bar. t = 10 k t = 100 k total pmut . vs fmut . total pmut . vs fmut . min gap 0.8% 1.1% 0.0% 0.0%mean gap 13.0% 2.3% 1.9% 0.8%max gap 42.1% 4.7% 7.4% 6.3% Table 3: Summary of cut-size differences. “total” refers to the gap between the best and worstperforming mutation out of all seven. The two highlighted pairs compare the best fmut β and pmut β values listed in Table 2.advantage of pmut β over fmut β is 2.3% and 0.8% on average, with a maximum of 4.7% and 6.3%(i.e., for 10 000 and 100 000 evaluations).To investigate the extent to which mutation performance and instance features are correlated,we perform a 2D projection using a principle component analysis of the instance feature spacebased on the features collected from [26]. We then consider the performance of the seven mutationoperators at a budget of 100,000 evaluations, and we visualize it in the 2D space (see Figure 2).In these projections, the very dense cluster in the top left is formed exclusively by the socfb-*instances, and the ridge from the very top left to the bottom left is made up of (from top tobottom) ia-*, tech-*, web*, and ca-* instances. The “outlier” on the right is web-BerkStan, dueto its extremely high values of the average vertex degree, the number of triangles formed by threeedges (3-cliques), the maximum triangles formed by an edge, and the maximum i -core number,where an i -core of a graph is a maximal induced subgraph and each vertex has degree at least i .Interestingly, the performance seems to be correlated with the instance features and thus,indirectly, with their origin. For example, we can see in Figure 2a that unif does not reach a cutsize that is within 1% of the best observed average for many of the socfb-* instances (shown asmany black dots in the tight socfb*-cluster). In contrast to this, pmut . ’s corresponding Figure 2bshows only red dots, indicating that it always performs within 1% of the best-observed.Lastly, we summarize the results in Figure 2c based on the concept of instance difficulty.Here, the color denotes the number of instances that achieve a cut size within 1% of the bestobserved average. Interestingly, many ia-*, ca-*, web-* and tech-* instances are solved well bymany mutation operators. In contrast to this, many socfb-* instances are blue, meaning that aresolved well by just very few mutation operators – in particular, by our pmut . . We study an instance of the general feature selection problem: Given a set of observations, find asubset of relevant features (variables, predictors) for use in model construction.We consider the following framework. Suppose that n time series X (1) , . . . , X ( n ) are given,each one representing a sequence of temporal observations. For each sequence X ( i ) , define thecorresponding temporal variation as a sequence Y ( i ) with Y ( i ) j = X ( i ) j − X ( i ) j − . We perform feature selection of the variables Y ( i ) , assuming that the joint probability distri-bution p ( Y (1) , . . . , Y ( n ) ) is Gaussian. Specifically, given a cardinality constraint k , we search for unif footprint (b) pmut . footprint (c) Instance difficulty Fig. 2: Mutation operator footprints (left and middle plots): instances are marked red if the mu-tation are at most 1% away from the best-observed performance. Instance difficulty (right-mostplot): the color encodes the number of algorithms that perform within 1% of the oracle perfor-mance. Note: a principle component analysis is used for the projection of the instances from thefeature space into 2D.a subset S ∈ [ n ] of size at most k s.t. the corresponding series χ S := { Y ( i ) : i ∈ S } are optimalpredictors for the overall variation in the model. Variations of this setting are found in manyapplications (see eg. Singh et al. [28], Zhu and Stein [29], and Zimmerman [30])We use the mutual information as an optimization criterion for identifying highly informativerandom variables among the { Y ( i ) } (see Calseton and Zidek [31]). For a subset S ∈ [ n ], we definethe corresponding mutual information as MI ( S ) = − (cid:88) i (1 − ρ i ) , (8)where the ρ i are the canonical correlations between χ S and χ V \ S . It is well-known that the mutualinformation as in (8) is a symmetric non-negative submodular function (see Krause et al. [32]).Note also that a cardinality constraint k is equivalent to a matroid constraint, with independentsets all subsets S ∈ [ n ] of cardinality at most k . Hence, this approach to feature selection consistsof maximizing a non-negative symmetric submodular function under a matroid constraint, asin Problem (4). Following the framework outlined in Section 5, we approach this problem bymaximizing the following fitness function z MI ( S ) = (cid:26) MI ( S ) if | S | ≤ k ; k − | S | otherwise; (9)We apply this methodology to perform feature selection on an air pollution dataset (see Rhodeand Muller [33]). This dataset consists of hourly air NO data from over 1500 sites, during a fourmonth interval from April 5, 2014 to August 5, 2014.For a fixed cardinality constraint k = 200 , . . . , . pmut β and fmut β with β = 1 . , . , .
5. The results are displayedin Figure 3.We observe that for a small time budget and small k , heavy tailed-mutations outperform thestandard uniform mutation and the fmut β . We observe that for large k all mutation operatorsachieve similar performance. These results suggest that for small time budget, and small k , largerjumps are beneficial, whereas standard mutation operators may be sufficient to achieve a goodapproximation of the optimum, given more resources. l l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l l · + · + · +
400 600 800 size of the cardinality constraint [k] f i t ne ss v a l ue [ s a m p l e m ean ] Timbe Budget = 1K l l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l l · + · + · +
400 600 800 size of the cardinality constraint [k] f i t ne ss v a l ue [ s a m p l e m ean ] Timbe Budget = 2.5K l l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l l · + · + · + · +
400 600 800 size of the cardinality constraint [k] f i t ne ss v a l ue [ s a m p l e m ean ] mutation operator lllllllllllllllllllllllllllllllllllllllllllllllll fmut, b = 1.5fmut, b = 2.5fmut, b = 3.5pmut, b = 1.5pmut, b = 2.5pmut, b = 3.5uniform Timbe Budget = 5K
Fig. 3: Solution quality achieved by the (1+1) EA with various mutation rates on a fitness functionas in (9), for fixed cardinality constraint k , and varying time budget. We consider the (1+1) EAwith uniform mutation, pmut β and fmut β with β = 1 . , . , .
5. Each dot corresponds to thesample mean of 100 independent runs.
In the pursuit of optimizers for complex landscapes that arise in industrial problems, we haveidentified a new mutation operator. This operator allows for good performance of the classical(1+1) EA when optimizing not only simple artificial test functions, but the whole class of non-negative submodular functions and symmetric submodular functions under a matroid constraint.As submodular functions find applications in a variety of natural settings, it is interesting toconsider the potential utility of heavy tailed operators as building blocks for optimizers of morecomplex landscapes, where submodularity can be identified in parts of these landscapes.
Acknowledgment
Markus Wagner has been supported by ARC Discovery Early Career Researcher AwardDE160100850.