[PDF] Satisfiability and Evolution

Abstract

We show that, if truth assignments on n variables reproduce through recombination so that satisfaction of a particular Boolean function confers a small evolutionary advantage, then a polynomially large population over polynomially many generations (polynomial in n and the inverse of the initial satisfaction probability) will end up almost certainly consisting exclusively of satisfying truth assignments. We argue that this theorem sheds light on the problem of novelty in Evolution.

Full PDF

aa r X i v : . [ c s . CC ] A ug Satisﬁability and Evolution

Adi Livnat ∗ Christos Papadimitriou † Aviad Rubinstein ‡ Gregory Valiant § Andrew Wan ¶ Abstract

We show that, if truth assignments on n variables reproduce through recombinationso that satisfaction of a particular Boolean function confers a small evolutionary ad-vantage, then a polynomially large population over polynomially many generations(polynomial in n and the inverse of the initial satisfaction probability) will end up al-most surely consisting exclusively of satisfying truth assignments. We argue that thistheorem sheds light on the problem of the evolution of complex adaptations. ∗ Department of Biological Sciences, Virginia Tech. † Computer Science Division, University of California at Berkeley, CA, 94720 USA, [email protected]. ‡ Computer Science Division, University of California at Berkeley, CA, 94720 USA,[email protected]. § Computer Science Department, Stanford University, CA, 94305. ¶ The Simons Institute for the Theory of Computing, UC Berkeley. Parts of this work were completedwhile at Harvard, and at IIIS, Tsinghua University.

Introduction

The TCS community has a long history of applying its perspectives and tools to betterunderstand the processes around us; from learning to multi-agent systems, game theory andmechanism design. By and large, the eﬀorts to understand these areas from a rigorous andalgorithmic perspective have been very successful, leading to both rich theories and practicalcontributions. Evolution is, perhaps, one of the most blatantly algorithmic processes, yet ourcomputational understanding of it is still in its infancy (see [16] for a pioneering study), andwe currently lack a computational theory explaining its apparent success. Algorithmically,how plausible are the origins of evolution and the emergence of self-replication? Is evolutionsurprisingly eﬃcient or surprisingly ineﬃcient? What are the necessary criteria for evolution-like algorithms to yield rich, interesting, and diverse ecosystems? Why is recombination (i.e.,sexual reproduction) more successful than asexual reproduction? Given the reshuﬄing ofgenomes that occurs through recombination, how can complex traits that depend on manydiﬀerent genes arise and spread in a population?In this work, we begin to tackle this last question of why complex traits that may de-pend on many diﬀerent genes are able to eﬃciently arise in polynomial populations withrecombination. In the standard view of evolution, a variant of a particular gene is morelikely to spread across a population if it makes its own contribution to the overall ﬁtness,independent of the contributions of variants of other genes. How can complex, multi-genetraits spread in a population? This may seem to be especially problematic for multi-genetraits whose contribution to ﬁtness does not decompose into small additive components as-sociated with each gene variant —traits with the property that even if one gene variant isdiﬀerent from the one that is needed for the right combination, there is no beneﬁt, or evena net negative beneﬁt. Here, we provide one rigorous argument for how such complex traitscan eﬃciently spread throughout a population. While we consider this question in a modelthat makes considerable (but justiﬁable) simpliﬁcations, this model makes a theoreticallyrigorous contribution to the fundamental problem of how evolution can produce complexity.

Motivating example: Waddington’s experiment.

In 1953 the great experimentalistConrad Waddington exposed the pupae of a population of

Drosophilia melanogaster to aheat shock, and noticed that in some of the adults that developed, the appearance of thewings had changed (they lacked a complete posterior crossvein) [17]. He then maintaineda population of ﬂies where only those with altered wings were allowed to reproduce. Byrepeating the procedure of heat shock and selection over the generations, the percentage ofﬂies with altered wings increased over time to values close to one. Even more interestingly,beginning at generation fourteen, some ﬂies exhibited the new trait even without havingbeen treated with heat shock.At ﬁrst sight, this surprising phenomenon — known as genetic assimilation — recallsLamarck’s now discredited belief that acquired traits can be inherited. However, Booleanfunctions provide a purely genetic explanation, which extends the idea originally oﬀeredinformally by Stern [15] (see also [2, 5]): Suppose that the phenotype “altered wings” is aBoolean function of n genes x , . . . , x n with two alleles (variants) each, thought as {− , } {− , } variable h (standing for “high temperature”). x + x + · · · + x n + (1 + h )2 · k ≥ n, for some integer k (think of n ≈

10 , and k ≈ n/ µ ti be the averagevalue of x i in the population at time t , and assume the genotype frequencies at time t aredistributed according to a distribution µ t (the reason for denoting the distribution this waywill become clear). If mating occurs at random with free recombination then, in expectation,the average value of each x i in the next generation is given by: µ t +1 i = E µ t [ f ( x ) · x i ] E µ t [ f ] , (1)where f ( x ) = 1 exactly when a ﬂy having genotype x will develop altered wings (i.e.,the above inequality is satisﬁed) and f ( x ) = 0 otherwise. We then assume that the nextgeneration will be distributed according to a product distribution µ t +1 , where each x i hasexpectation µ t +1 i . By approximating the genotype frequencies of the population for eachgeneration in this way, it can be shown by calculation that a trait with this genotypicspeciﬁcation (a) is very rare in the population under normal temperature h = 1; (b) itbecomes much more common under high temperature h = 1; (c) jumps to just above 50%after the ﬁrst breeding under h = 1; (d) after successive breedings with h = 1 it is nearlyﬁxed; and (e) if after this h becomes −

1, the trait is still quite common.

Note:

Our interpretation of Waddington’s experiment is a simpliﬁcation. First, we con-sider only the distribution of genotypes in each generation to determine the distribution ofthe next; instead, we could ﬁrst take a ﬁnite sample according to the present distribution anduse that sample to calculate the distribution of the next generation. Such an approximationcan only become exact when the population size is inﬁnite, but it is a standard and usefulone in population genetics (and we shall eventually consider ﬁnite populations for our mainresult). We also assumed that each individual of the new generation is produced by samplingeach gene independently of the other genes, and with probability equal to the frequencies ofthe two alleles of this gene in the parent population (the adults of the previous generationwith altered wings). This assumption turns out to be justiﬁed in the settings that we willconsider, as will be discussed in the following section.

Populations of truth assignments

This way of looking at Waddington’s experiment brings about a very natural question: Is thisampliﬁcation of satisfying truth assignments (outcomes (c) and (d) of experiment describedabove) a property of threshold functions, or is it more general? Does it hold for all monotonefunctions, for example?

For all Boolean functions? Here we assume for simplicity haploid organisms, that is, each individual has only one copy of each gene. See the next section for any unfamiliar terms and concepts from evolution. f : {− , } n → { , } of n binary genes (in theabsence of the environmental variable h which was crucial in Waddington’s phenomenon).What if genotypes satisfying this Boolean function had a slight advantage under naturalselection? (In Waddington’s experiment, they had an absolute advantage because of the ex-perimental design.) For example, imagine that genotypes satisfying f survive to adulthoodmore than the others, in expectation, by a factor of (1 + ǫ ), for some small ǫ >

0. Wouldthis trait (that is, satisfaction of the Boolean function f ) be eventually ﬁxed in the popula-tion? And, if so, could this be a subtle mechanism for introducing complex adaptation in apopulation?To reﬂect our assumption that satisfaction of f confers only an ǫ -advantage, we may takea function f : {− , } n → { , ǫ } , where we regard the value 1 + ǫ as “satisﬁed”, and thevalue 1 as “unsatisﬁed”. We track the allele frequencies from generation to generation asin Waddington’s experiment: Equation (1) gives us the average value µ t +1 i of each x i in thenext generation, and we describe the next generation by the product distribution µ t +1 .Suppose that we continue this process, starting from distribution µ , and deﬁning { µ i } , { µ i } , . . . , { µ ti } . . . as above. Consider the average ﬁtness of the population at time t ,deﬁned as µ t ( f ) = Pr µ t [ f ( x ) = 1 + ǫ ]. The question is, when does µ t ( f ) approach one? Ourﬁrst result states that, for monotone functions, it does after O (cid:16) nǫµ ( f ) (cid:17) steps: Theorem 1. If f is monotone, then µ t ( f ) ≥ − n (1+ ǫ ) ǫtµ ( f ) . Note:

This nontrivial result also serves to illustrate one point: The work is not about sat-isﬁability heuristics (monotone functions are not an impressive benchmark in this regard...).Heuristics are about ﬁnding good individuals in a population. In contrast, evolution is aboutcreating good populations . This is our focus here.Our ambition is to prove the same result for all Boolean functions. Immediately wesee that this is impossible if we insist on an inﬁnite population: Consider the function f = x ⊕ x : starting with the uniform distribution at time t = 0, the above dynamics wouldleave the distribution unchanged, for all time, and hence µ t ( f ) = 1 / t . The parityfunction is not the only Boolean function with this property: for example the function“ P ni =1 x i = k ”, if started at µ i = kn , will stay at that spot forever, and will always have µ t ( f ) = O (1 / √ k ). However, experimentation shows that these “spurious ﬁxpoints” are notabsorbing, and evolution pulls the distribution away from them and towards satisfaction.That is, this disappointing phenomenon is an artifact of the inﬁnite population simpliﬁcation.Indeed, random genetic drift due to sampling eﬀects has been considered to be a signiﬁcantcomponent of evolution at the molecular level (it is possible for an allele to become ﬁxed inthe population even in the absence of selection). Thus, we need to make the model morerealistic.We adopt a model, consisting of the following process: At each generation t we create alarge population of N individuals (we call this the “sampling” step) by sampling N timesfrom the product distribution µ t to obtain y (1) , . . . , y ( N ) ( N is assumed to remain constantfrom one generation to the next, which is a standard assumption in population genetics [6]).The empirical allele frequencies of the sample are given by a vector ν t , where for each i we3ave: ν ti = 1 N N X j =1 y ( j ) i . We write ν t ∼ B ( µ t ) to denote a draw from this distribution and use ν t to denote the impliedBernoulli distribution.We then enforce the assumed selection advantage of satisfaction to obtain the “in-expectation” frequencies of the subsequent generation: µ t +1 i = E ν t [ f ( x ) · x i ] E ν t [ f ] . (2)We show that when selection is weak, any satisﬁable Boolean function will almost surely bealways satisﬁed after polynomially many time steps. Theorem 2. (informal statement)

For any satisﬁable Boolean function f of n variablesand any suﬃciently small ǫ > , after T generations of N individuals µ T ( f ) = 1 withprobability arbitrarily close to one, where T and N are polynomial functions of n , ǫ , and µ ( f ) . The proof of Theorem 2 shows why the population does not become stuck at the pre-viously discussed “spurious ﬁxpoints;” sampling eﬀects ensure movement over suﬃcientlymany generations, and selection ensures movement is made towards satisfaction.

Outline of the paper

In the next section we introduce some basic concepts from population genetics, we deﬁneand justify our simpliﬁed model, and we present a result due to Nagylaki [12] implyingthat, if selection is weak, then one can assume that the genotype distribution is a productdistribution. In Section 3, we show Theorem 1 on monotone functions. Our main resultis given in Section 4, and its proof outlined; the full proof is detailed in the Appendix. InSection 5, we conclude with a discussion of our result, and a number of open problems.

The genetic makeup of an organism is its genotype , which speciﬁes one allele (gene variant)for each genetic site, or “locus,” in the haploid case. We shall be focusing on n speciﬁcgenes of interest (say, a few dozen out of the many thousands of genes of the species). Ateach locus, we assume that there are two alleles segregating in the population (hence therelevance of Boolean functions). Thus, a genotype will be a vector in {± } n . We assumethe species reproduces sexually (this is crucial, see the discussion in the last section). Ina sexual species reproduction proceeds through recombination, that is, the formation of anew genotype by choosing alleles from two parental genotypes in the previous generation.To produce each generation, the individuals mate at random (we also assume no bipartitioninto sexes) and there is no generation overlap (that is, the new generation is produced enmasse just before death of the previous one). We assume that the population size is constantat some large number N (expressed as a function of n , the number of genes of interest, which4s the basic parameter). Each genotype g ∈ {± } n is assumed to have a ﬁtness value equalto the expected number of oﬀsprings this genotype will produce. We also assume that thegenes recombine freely , that is, for any two genes i, j of an oﬀspring, the probability that thealleles come from the same parent is exactly half (and not larger, as is the case if the twogenes are linked).These assumptions are simpliﬁcations of the standard model of population genetics usedbroadly in the literature, and generally trusted to preserve the essence of selection in sexualpopulations. The Boolean assumption is of course meant to bring into play mathematicalinsights from that ﬁeld, but we believe that it is not restrictive (for example, allele − ǫ , where ǫ > n genes, with probability equal to the probability ofoccurrence of that allele in the parent generation. That is, we assume that the distributionof the genotypes in a generation is a product distribution . This situation is called in thepopulation genetics literature linkage equilibrium , or the Wright manifold [19, 20]. In general,genotype frequencies are known to be correlated, and this correlation — the distance from theproduct distribution — is called linkage disequilibrium [7] and is of importance and interestin the study of evolution. However, in the absence of selection, a standard argument showsthat the distribution of a population quickly reaches linkage equilibrium (arguments existboth for ﬁnite and inﬁnite populations). Our previous assumption places our experimentin a regime known as weak selection . Weak selection means that the ﬁtness values are in asmall interval [1 − ǫ, ǫ ], where ǫ is called the selection strength . An elegant and powerfulresult due to Thomas Nagylaki [12] states that, under weak selection, evolution proceeds toa point very close to linkage equilibrium. In particular, assume that a population evolvesas we described above in a regime of weak selection of strength ǫ , and let m be the totalnumber of alleles (this is 2 n in our case; actually, Nagylaki’s Theorem also holds underdiploid and partial recombination). By linkage disequilibrium we mean formally the L ∞ distance between the genotype distribution and the product distribution: Theorem 3. (Nagylaki’s Theorem, see [12]) Under weak selection, and after O (log m · log 1 /ǫ ) generations, linkage disequilibrium is O ( ǫ ) . In our setting ǫ is minuscule, so Nagylaki’s Theorem motivates our assumption thatpopulations are formed “by independent sampling of the genetic soup.” We strongly believethat our theorem is true for large ǫ as well, but this remains open, as discussed in the lastsection. In this section we give a self-contained proof of Theorem 1. The proof is simple, once aconnection is made to discrete Fourier analysis. In what follows, we assume familiarity withFourier analysis over the Boolean cube for product distributions. We brieﬂy review somebasic facts and describe the notation used in our proofs.5or µ = ( µ , . . . , µ n ) ∈ [ − , n and a function f : {− , } nµ → R , where {− , } nµ denotesthe Boolean cube endowed with the product distribution given by µ i = E [ x i ], we considerthe µ -biased Fourier decomposition of f . Let σ i = 1 − µ i be the variance of each bit. Wedenote the µ -biased Fourier coeﬃcients by ˆ f ( S ; µ ) = E µ [ f · φ µS ], where φ µS = Q i ∈ S x i − µ i σ i . Let D ( µ ) i f = σ i ( f i =1 − f i = − ) be the diﬀerence operator for Boolean functions over {− , } nµ . Wehave that D ( µ ) i f = X S ∋ i ˆ f ( S ; µ ) φ µS \{ i } , (3)and in particular, E µ [ D ( µ ) i f ] = ˆ f ( i ; µ ), which we will use repeatedly throughout our proofs.Our ﬁrst step will be to observe that the change in allele frequencies from one generationto the next may be expressed in terms of f ’s linear Fourier coeﬃcients. Let µ be the vectorwhich speciﬁes the allele frequency of the population at time t . Then, letting µ ′ be the allelefrequency vector at time t + 1 and using the selection speciﬁed by Equation (1), we havethat µ ′ i − µ i = σ i ˆ f ( i ; µ ) E µ [ f ] . (4)This follows immediately from the deﬁnitions: σ i · ˆ f ( i ; µ ) = σ i · E µ [ f · φ µi ]= E µ [ f · x i ] − E µ [ f ] · µ i = E µ [ f ] · µ ′ i − E µ [ f ] · µ i . Our proof uses the following well-known facts, which are easily derived from the basicnotions (see Chapter 2.3, [13] ). First, we have that the inﬂuences of a monotone functionare given by its linear coeﬃcients. (For a function f : {− , } nµ → R , we denote its inﬂuencein direction i by P S ∋ i ˆ f ( S ; µ ) .) Next, the inequality of Poincar´e lower bounds the totalinﬂuence of a function by its variance. The versions below have been scaled to our setting andcan be obtained by applying the original facts to a Boolean function g : {− , } n → {− , } and setting f ( x ) = 1 + ǫ (1 + g ). Proposition 4.

Let f : {− , } nµ → { , ǫ } be monotone. Then for all i ∈ [ n ] : X S ∋ i ˆ f ( S ; µ ) = ǫσ i · ˆ f ( i ; µ ) . Proposition 5.

Let f : {− , } nµ → { , ǫ } and Var[ f ] = E µ [ f ] − E µ [ f ] . Then X i ∈ [ n ] X S ∋ i ˆ f ( S ; µ ) = X S ⊆ [ n ] | S | ˆ f ( S ; µ ) ≥ Var[ f ] . Equation (4) tells us that the bias of each bit i increases according to the correspondingcoeﬃcient ˆ f ( i ). Proposition 4 tells us that for monotone f , the linear coeﬃcients correspondto the inﬂuences of f . Finally, the inequality of Poincar´e tells us that the linear coeﬃcientsmust be large.We may now prove Theorem 1. Theorem 1.

Let f : {− , } n → { , ǫ } be monotone. Then µ t ( f ) ≥ − n (1+ ǫ ) ǫtµ ( f ) .6 roof. Combining Equation (4) with Propositions 4 and 5 tells us that the sum of the biasesincreases at each step: X i ∈ [ n ] ( µ ′ i − µ i ) = 2 ǫ · E µ [ f ] X i ∈ [ n ] X S ∋ i ˆ f ( S ; µ ) ≥ ǫ · E µ [ f ] Var[ f ]= 2 ǫ · E µ [ f ] ǫ µ ( f )(1 − µ ( f ))Let µ t ( f ) = 1 − δ. Then for all t ′ ≤ t , the sum of the biases increases at each step: n X i =1 µ t ′ +1 i − n X i =1 µ t ′ i ≥ ǫµ t ′ ( f )(1 − µ t ′ ( f )) E t ′ µ [ f ] ≥ ǫµ t ′ ( f ) δ ǫ ≥ δǫµ ( f )1 + ǫ . On the other hand, we know that − n ≤ P ni =1 µ t ′ i ≤ n for all t ′ , so t ≤ n (1+ ǫ ) δǫµ ( f ) . We remark that Theorem 1 (with worse parameters) can also be proven using a general-ization of the Russo-Margulis lemma to product distributions, which states that the gradientof E µ [ f ] (as a function of µ ) corresponds to the inﬂuences of f (see Appendix B.1). For a function f : {− , } n → { , ǫ } , consider the multilinear extension ˜ f : [ − , n → [1 , ǫ ] deﬁned by ˜ f ( µ ) = E x ∼ µ [ f ( x )]. Our goal is to understand when ˜ f ( µ ) = 1 + ǫ . Westart with the precise statement of the main result (compare with Theorem 2): Theorem 6.

Let β = q ǫN (1 − nǫ ) . If ˜ f (cid:0) µ (cid:1) > r β ln 2 β then there is some constant C such that for any T ≥ C · ǫn · N − nǫ : Pr h ˜ f (cid:0) µ ( T ) (cid:1) = 1 + ǫ i ≥ − β − /n. Note that the conditions in Theorem 6 imply restrictions on the initial probability ofsatisfaction and the strength of selection. In particular, the selection coeﬃcient must be inthe range 1 /N / < ǫ < /n (we discuss this restriction in the next section), and the initialprobability of satisfaction must be at least N − / . The full proof of the theorem is given inthe Appendix; in this section we sketch its salient points.One ﬁrst diﬃculty in the proof is this: The convergence proof gauges the improvement inaverage population ﬁtness obtained during the second of the two steps per generation (the7tness step). However, the ﬁrst of the two steps (the sampling step) introduces variance, andwe must establish that this variance is insigniﬁcant in comparison with the increase in ﬁtness.Our ﬁrst lemma (Lemma 7) establishes that the diﬀerence between the average ﬁtness of thesample and the average ﬁtness, squared (that is to say, the variance introduced), is boundedfrom above by the increase in average ﬁtness obtained in the ﬁtness step: E ν ∼ B [( ˜ f ( ν ) − ˜ f ( µ )) ] ≤ E ν ∼ B [ ˜ f ( µ ′ ) − ˜ f ( ν )] / [( N − · (1 − nǫ )] . (5)Here we focus on one generation, so µ denotes the product distribution from which thesampling is made, ν the empirical product distribution of the sample (note that ˜ f ( ν ) is arandom variable with expectation ˜ f ( µ )), and µ ′ the product distribution resulting from theselection (or ﬁtness) step. Thus, µ ′ is the initial product distribution in the next generation.To establish inequality (5), we ﬁrst show that the right-hand side is lower bounded bythe total mass of the singleton Fourier coeﬃcients of the biased transform (Lemma 8):˜ f ( µ ′ ) − ˜ f ( ν ) ≥ (1 − nǫ ) n X i =1 ˆ f ( i ; ν ) . (6)The intuition in the proof of (6) is that the ﬁtness step is very close to an ǫ -long stepof the gradient ascent of the average ﬁtness function (this intuition is very accurate awayfrom the boundary of the hypercube). Gradient ascent in each coordinate is captured bythe corresponding singleton coeﬃcient squared. But then there is an analytical complicationof approximating the overall ascent by the sum of sequential coordinate-wise ascents; thediﬃculty is, of course, that the partial derivatives change after each small ascent, and thechange must be bounded (Lemma 10).This establishes that the ﬁtness increase in the selection step is larger than the linearFourier mass, and hence nonnegative when ǫ is small. However, the linear Fourier mass maybe zero, as is the case for the exclusive-or function under the uniform distribution (recall thediscussion a few lines after Theorem 1). Here, sampling eﬀects will ensure that progress ismade in expectation. We show that, on average, the linear Fourier mass is much larger thanthe variance (Lemma 9): E ν ∼ B h ( ˜ f ( µ ) − ˜ f ( ν )) i ≤ N − E ν ∼ B " n X i =1 ˆ f ( i ; ν ) (7)The rather involved proof of (7) takes place entirely within the biased Fourier domain (seeAppendix B.3). Now notice that (7), combined with (6), completes the proof of inequality(5) and Lemma 7.Note that the upper bound on the variance in (5) includes in the denominator a factor of(1 − ǫn ) · N . This immediately tells us that our technique is sharpest when the population N is large and the selection strength ǫ is small — in particular, it must be smaller than n .This latter point is a rather puzzling limitation of our result: Why does a theorem about theeﬀectiveness of natural selection become harder to prove when selection is stronger? Oneintuitive explanation is that in this case selection works very much like gradient ascent, andit is well known that the convergence of gradient ascent is harder to establish when the ascentstep is large, as a large step can “skip over” the stationary point sought. Is this upper limiton ǫ necessary? This is an intriguing open question discussed in the last section.8ext, we establish that the total eﬀect of the sampling steps is small: For any α > p β ln 2 β − , Pr[ | T X t =1 ˜ f ( ν t ) − ˜ f ( µ t − ) | ≥ α ] ≤ β, where β = (cid:16) ǫN (1 − nǫ ) (cid:17) / .It is not hard to see that the sum P Tt =1 ˜ f ( ν t ) − ˜ f ( µ t − ) constitutes a martingale, albeitone with no obvious upper bound on each step. In Lemma 15 we bound the total eﬀectof the sampling step by resorting to a rather exotic martingale inequality derived from ageneralization of Bernstein’s inequality to martingales with unbounded jumps and proved in[4] (in fact, a specialization stated in Appendix C as Lemma 14).Incidentally, notice that this is the place where it is proved, quite indirectly, that the sam-pling step succeeds in getting the process unstuck from spurious ﬁxpoints such as ( , , . . . , )for the exclusive-or function: Since the total eﬀect of sampling is limited, the increase inaverage ﬁtness must eventually prevail.Finally, when the process is near a vertex of the hypercube, ﬁtness increases are too smallto help ﬁnish the argument, but here we rely on the fact that the process is very likely todrift so close to a vertex that it will eventually get stuck there (Lemma 16), completing theproof of the main result. We proved a novel and highly nontrivial aspect of Boolean satisﬁability: By randomly cross-ing assignments and favoring satisfaction slightly, one can breed a population of pure sat-isfying truth assignments. We argued that this rather curious property seems important inunderstanding one intriguing aspect of evolution: how complex traits controlled by manygenes can emerge.There are many roads of mathematical inquiry opened by this theorem. First, can thelimitations/restrictions of our model be relaxed so that it better reﬂects the realities of life?Some of the assumptions in our model are arguably unrealistic (haploidy, ﬁxed populationsize, random mating, partly in-expectation ﬁtness calculation), but these follow widely ac-cepted practices in population genetics needed for mathematical simpliﬁcation. We alsomake the assumption of weak selection, but this is also a very defensible one for unlinkedloci.There are, however, a few further restrictions of our model that call for discussion: • Two alleles per gene.

The motivation is, of course, that this assumption ushers in thepowerful analytical toolbox of Boolean functions. We have no doubt that similar resultshold for more alleles, but would require a great number of technical adjustments. • Fitness landscape.

We assumed a very specialized ﬁtness landscape with values 1 and1 + ǫ only. This is a natural simpliﬁcation that facilitates the connection to Booleanfunctions, but we do not believe it is an essential one. We believe that this result canbe extended to much richer landscapes with a small gap, for example to situations inwhich ﬁtness values are in [1 − δ, ∪ [1 + ǫ, ǫ + δ ] for some small δ > ǫ is larger? As we have mentioned,this is an analytical challenge with roots in the diﬃculty of the analysis of gradient descent.Of course, a constant gap would bring us outside the realm of weak selection, and renderour approximation by product distribution baseless. There are two ways we can proceed:One is to prove that the exact recurrence equations of genotype frequencies yield eventualsatisfaction. This seems possible but challenging.Another avenue, which we have followed for some time, is to work with product distri-butions anyway. In particular, what if the ﬁtness landscape has values { , } — that is tosay, non-satisfying truth assignments are removed from the population, as in Waddington’sexperiment? This is a realistic approximation if, for example, this selection does not happenin every generation but every O (log n ) generations (because breeding without selection isknown to take you close to the Wright manifold). In such a setting, our quadratic bound forthe in-expectation process of monotone functions no longer requires any dependence on theinitial probability of satisfaction µ ( f ). For the process with sampling, we have the followingconjecture. Conjecture:

If the ﬁtness landscape has values { , } , then the process reaches near uni-versal satisfaction with probability approaching as the population size goes to inﬁnity. We now want to point out an obvious and yet surprising aspect of our work: In thetraditional framework of adaptive evolution, each allele spreads in the population mainlyeither due to an additive contribution to ﬁtness that it makes in and of itself (let us callthis “traditional propagation”) or due to random genetic drift [1, 19, 20, 18]. In our model,however, alleles at diﬀerent genes are spreading in the population as governed by the complexinteractions between them that are continually subject to selection. Thus, a population canchange dramatically through a novel process involving subtle changes in genetic statisticsand simultaneous gradual emergence in the whole population [8, 11], and not by traditionalpropagation.Furthermore, notice that since recombination is a crucial ingredient of our analysis, ourresults inform the question of the role of sex in evolution. In this regard they add to recentworks that have begun to examine the role of sex while giving full weight to the importanceof genetic interactions [3, 9].Finally, can our bounds be improved? For the monotone case, it is easy to see thatthe TRIBES function with appropriate fan-in provides a matching lower bound. As for thegeneral case, we feel that the very generous bounds of the main result can be improvedsubstantially. For example, the assumed time bound is only necessary in order to ﬁnish thelast part of the argument (convergence to a vertex) once the vast majority of the populationis already satisfying; more analysis is needed to investigate this subtle phenomenon.Our proof that the population converges to a single satisfying truth assignment may seema troubling aspect of our result. Two remarks: First, the loss of genetic diversity shouldnot be surprising in itself. With drift alone, for each locus, one allele will become ﬁxedeventually (where the probability that a particular allele will be the ﬁxed allele is proportionalto its current frequency in the population). Second, in our process many satisfying truthassignments are likely to survive for a very long time before the random walk clears thepicture. This fact may be more relevant than the characteristics of eqilibrium; after all,evolution happens in the transient. 10 cknowledgments:

We are grateful to Yu Liu of Tsinghua University for some veryinteresting conversations in the beginning of this research.

References [1]

The Genetical Theory of Natural Selection . Oxford: The Clarendon Press, 1930.[2] K G Bateman. The genetic assimilation of four venation phenocopies.

J Genet , 56:443–447, 1959.[3] E Chastain, A Livnat, C Papadimitriou, and U Vazirani. Algorithms, games, andevolution.

Proceedings of the National Academy of Sciences , 111(29):10620–10623, 2014.[4] K. Dzhaparidze and J.H. van Zanten. On Bernstein-type inequalities for martingales.

Stochastic Processes and their Applications , 93(1):109 – 117, 2001.[5] D S Falconer.

Introduction to Quantitative Genetics . Oliver and Boyd, Edinburgh, 1960.[6] J H Gillespie.

Population Genetics: A Concise Guide . JHU Press, 2004.[7] R.C. Lewontin. The interaction of selection and linkage. I. General considerations:heterotic models.

Genetics , 49:49–67, 1964.[8] A Livnat. Interactions-based evolution: how natural selection and nonrandom mutationwork together.

Biology Direct , 8:24, 2013.[9] A Livnat, C Papadimitriou, J Dushoﬀ, and M W Feldman. A mixability theory for therole of sex in evolution.

Proceedings of the National Academy of Sciences , 105(50):19803–19808, 2008.[10] G. Margulis. Probabilistic characteristics of graphs with large connectivity.

Prob.Peredachi Inform. , 10:101–108, 1974.[11] E Mayr.

Animal Species and Evolution . Belknap Press, 1963.[12] T. Nagylaki. The evolution of multilocus systems under weak selection.

Genetics ,134(2):627–647, 1993.[13] Ryan O’Donnell.

Analysis of boolean functions . Cambridge University Press, 2014.[14] L. Russo. On the critical percolation probabilities.

Z. Wahrsch. werw. Gebiete , 43:39–48,1978.[15] C. Stern. Selection for sub-threshold diﬀerences and the origin of the pseudoexogenousadaptations.

The American Naturalist , 92, 313-316 1958.[16] Leslie G Valiant. Evolvability.

Journal of the ACM (JACM) , 56(1):3, 2009.[17] C. H. Waddington. Genetic assimilation of an acquired character.

Evolution , 7(2):118–126, 1953. 1118] M J Wade and C J Goodnight. Perspective: the theories of ﬁsher and wright in thecontext of metapopulations: when nature does many small experiments.

Evolution ,52(6):1537–1553, 1998.[19] S. Wright. Evolution in Mendelian populations.

Genetics , 16:97–159, 1931.[20] S. Wright. The roles of mutation, inbreeding, crossbeeding and selection in evolution.In

Proc. 6th International Congress of Genetics , volume 1, pages 356–366, 1932.

A Outline of proof

In the following sections, we prove Theorem 6:

Theorem.

Let β = q ǫN (1 − nǫ ) . If ˜ f (cid:0) µ (cid:1) > r β ln 2 β then there is some constant C such that for any T ≥ C · ǫn · N − nǫ : Pr h ˜ f (cid:0) µ ( T ) (cid:1) = 1 + ǫ i ≥ − β − /n. Our proof of Theorem 6 is structured as follows. In Section B, we consider the averageﬁtness from one generation to the next. As described in Section 1, each generation consistsof two steps: the sampling step, which begins with a product distribution µ and results inan empirical product distribution ν , and the ﬁtness step resulting in a distribution µ ′ (whichbecomes the initial distribution for the next generation). The main lemma (and the mostinvolved step of our proof) of Section B is Lemma 7, which upper bounds the variance of˜ f ( ν ), by a small fraction of E [ ˜ f ( µ ′ ) − ˜ f ( ν )], the expected increase in average ﬁtness by theﬁtness step.In Section C, we apply Lemma 7 with the martingale inequality to prove Lemma 15,which states that the total ﬁtness decrease will be small with high probability. Finally, wecomplete the proof of the main theorem in Section D by arguing (Lemma 16) that for T asstated in the theorem, µ T will reach a vertex of the hypercube (and hence f ( µ T ) ∈ { , ǫ } )with high probability. B Selection vs sampling eﬀects

In this section we consider just one step of the process. Let µ be the initial product dis-tribution of a generation, ν be the empirical product distribution from the sampling step,and µ ′ be the product distribution after the ﬁtness step. Our main goal in this section is toshow that the variance of ˜ f ( ν ), the average ﬁtness of the population after the sampling step,is small compared to the expected increase in average ﬁtness from the subsequent selectionstep, ˜ f ( µ ′ ) − ˜ f ( ν ). The main lemma we will prove is the following:12 emma 7. Let ν be the vector of expectations of allele frequencies in the population sampleof size N , drawn according to B ( µ ) . Then: E ν ∼ B [( ˜ f ( ν ) − ˜ f ( µ )) ] ≤ E ν ∼ B [ ˜ f ( µ ′ ) − ˜ f ( ν )] / ( N − · (1 − nǫ ) . We will prove Lemma 7 by proving two intermediate lemmas. First, we show that ﬁtnessincrease by the selection step ˜ f ( µ ′ ) − ˜ f ( ν ) is nearly as large as the ν -biased Fourier weightof the linear coeﬃcients of f (and hence non-negative), provided that ǫ is suﬃciently small. Lemma 8.

Let µ ′ be the expectations of the process after selection from the population ν .Then: ˜ f ( µ ′ ) − ˜ f ( ν ) ≥ (1 − nǫ ) n X i =1 ˆ f ( i ; ν ) . Next, we show that the variance of ˜ f ( ν ) is at most a small fraction of the expected linear ν -biased Fourier mass of f : P ni =1 ˆ f ( i ; ν ) —here the expectation is taken over the choice of ν. Lemma 9. E ν ∼ B h ( ˜ f ( µ ) − ˜ f ( ν )) i ≤ N − E ν ∼ B " n X i =1 ˆ f ( i ; ν ) . Combining Lemmas 8 and 9 gives us Lemma 7.

B.1 Preliminaries

Throughout our proofs, we will use the notation and basic facts established at the beginningof Section 3. At times, we will use biased Fourier analysis with diﬀerent product distributions µ and µ ′ at the same time. To prevent ambiguity, we will refer to the standard deviationof the i ’th bit as σ i ( µ ) = 1 − µ i (similarly for σ i ( µ ′ )), but we will use σ i when the contextmakes the distribution clear.Recall that the exentsion of f : {− , } n → { , ǫ } ,˜ f ( µ ) = E x ∼ µ [ f ( x )] = X S ⊆ [ n ] ˆ f ( S ) Y j ∈ S µ j is multilinear. Note that its derivative in the i ’th direction is given by ∂ ˜ f ( µ ) ∂µ i = X S ∋ i ˆ f ( S ) Y j ∈ S \ i µ j = E µ [ D (1 / i f ]= 1 σ i E µ [ D ( µ ) i f ]= ˆ f ( i ; µ ) σ i . (8)Here (8) is a straightforward generalization of the Russo-Margulis Lemma [10, 14] for productdistributions. Thus, we may write the change in allele frequency from the ﬁtness step as: µ ′ i − ν i = σ i ( ν ) ˆ f ( i ; ν ) E ν [ f ] = σ i ( ν )˜ f ( ν ) ∂ ˜ f ( ν ) ∂ν i , (9)where the ﬁrst equality holds by Equation (4) (derived in Section 3).13 .2 Proof of Lemma 8 We will compute the ﬁtness change as each coordinate changes. Consider the hybrid distri-butions given by the expectations w i = ( µ ′ , . . . , µ ′ i , ν i +1 , . . . , ν n ) so that w = ν and w n = µ ′ .We have that ˜ f ( µ ′ ) − ˜ f ( ν ) = n X i =1 ˜ f ( w i ) − ˜ f ( w i − ) . Observe that the ﬁrst increment is easily computed as˜ f ( w ) − ˜ f ( ν ) = ( µ ′ − ν ) ∂ ˜ f ( ν ) ∂ν = ˆ f (1; ν ) E ν [ f ] . However, this formula will not be valid for subsequent hybrids as the derivatives of ˜ f havechanged. We start by showing that the derivatives in each direction do not change muchbetween the hybrid distributions. The lemma below will allow us to approximate the deriva-tives of the hybrids by the derivative of ˜ f at ν . Lemma 10.

Let ν ′ ≥ ν ∈ [+1 , − n diﬀer only on the j -th coordinate, i.e., ν ℓ = ν ′ ℓ for all ℓ = j and ν ′ j > ν j . Then for any i ≥ and j < i , ∂ ˜ f ( ν ′ ) ∂ν ′ i − ∂ ˜ f ( ν ) ∂ν i = ( ν ′ j − ν j ) ˆ f ( { i, j } ; ν ) σ i σ j . In particular, for the hybrid distributions above with i ≥ : ∂ ˜ f ( w i − ) ∂w i − i − ∂ ˜ f ( ν ) ∂ν i = 1 E ν [ f ] σ i i − X j =1 ˆ f ( j ; ν ) ˆ f ( { i, j } ; ν ) . Proof.

Expanding the derivatives in terms of the unbiased Fourier coeﬃcients, we have that ∂ ˜ f ( ν ′ ) ∂ν ′ i − ∂ ˜ f ( ν ) ∂ν i = X S ∋ i ˆ f ( S )  Y ℓ ∈ S \ i ν ′ ℓ − Y ℓ ∈ S \ i ν ℓ  = X S ∋{ j,i } ˆ f ( S )( ν ′ j − ν j ) Y ℓ ∈ S \{ j,i } ν ℓ = ( ν ′ j − ν j ) E ν [ D (1 / i D (1 / j f ] . The proof of the ﬁrst equality in the lemma is completed by noting that: E ν [ D (1 / i D (1 / j f ] = 1 σ i σ j E ν [ D ( ν ) i D ( ν ) j f ] . For the second equality of the lemma, we may write the diﬀerence between the derivative inthe i ’-th direction under the hybrid distribution w i − and under the original distribution as14 telescoping sum: ∂ ˜ f ( w i − ) ∂w i − i − ∂ ˜ f ( ν ) ∂ν i = i − X j =1 ( w jj − w j − j ) ˆ f ( { i, j } ); ν ) σ i σ j = i − X j =1 σ j ˆ f ( j ; ν ) E ν [ f ] ˆ f ( { i, j } ; ν ) σ i σ j = 1 E ν [ f ] σ i i − X j =1 ˆ f ( j ; ν ) ˆ f ( { i, j } ; ν ) . We are now ready to prove Lemma 8:

Lemma (8) . Let µ ′ be the expectations of the process after selection from the population ν .Then: ˜ f ( µ ′ ) − ˜ f ( ν ) ≥ (1 − nǫ ) n X i =1 ˆ f ( i ; ν ) . Proof.

We will bound each of the diﬀerences between the hybrid densities in the summation:˜ f ( µ ′ ) − ˜ f ( ν ) = n X i =1 ˜ f ( w i ) − ˜ f ( w i − ) . Since ˜ f is multilinear, we have that for i ≥ f ( w i ) − ˜ f ( w i − ) = ( µ ′ i − ν i ) ∂ ˜ f ( w i − ) ∂w i − i = σ i ˆ f ( i ; ν ) E ν [ f ] ∂ ˜ f ( ν ) ∂ν i + ∂ ˜ f ( w i − ) ∂w i − i − ∂ ˜ f ( ν ) ∂ν i ! = σ i ˆ f ( i ; ν ) E ν [ f ] ˆ f ( i ; ν ) σ i + 1 E ν [ f ] σ i i − X j =1 ˆ f ( j ; ν ) ˆ f ( { i, j } ; ν ) ! . Recalling that ˆ f ( { i, j } ; ν ) = σ i σ j E ν [ D (1 / i D (1 / j f ] and using the fact that | D (1 / i D (1 / j f ( x ) | ≤ ǫ/ x and E ν [ f ] ≥

1, we have:1 E ν [ f ] σ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i − X j =1 ˆ f ( j ; ν ) ˆ f ( { i, j } , ν ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ǫ i − X j =1 σ j | ˆ f ( j ; ν ) | . (10)Assume WLOG that σ i | ˆ f ( i ; ν ) | ≥ σ j | ˆ f ( j ; ν ) | for all j < i . so that (10) is at most ǫ ( i − σ i | ˆ f ( i ; ν ) | . Substituting into the telescoping sum, we have: n X i =1 ˜ f ( w i ) − ˜ f ( w i − ) ≥ E ν [ f ] n X i =1 σ i ˆ f ( i ; ν ) ˆ f ( i ; ν ) σ i − ǫ i − σ i | ˆ f ( i ; ν ) | ! ≥

11 + ǫ n X i =1 ˆ f ( i ; ν ) (1 − ǫ i − σ i ) ≥ (1 − nǫ ) n X i =1 ˆ f ( i ; ν ) . B.3 Proof of Lemma 9

Our goal is to show that: E ν ∼ B " n X i =1 ˆ f ( i ; ν ) ≥ ( N − · E ν ∼ B h ( ˜ f ( µ ) − ˜ f ( ν )) i . The ﬁrst key observation is that the Fourier basis with respect to the product distribution µ is still orthogonal with respect to B , i.e., we have E ν ∼ B [ φ µS ( ν ) φ µT ( ν )] = 0 for S = T , because B is a product distribution and E ν ∼ B [ φ µS ( ν )] = 0 for S = ∅ . In particular, Parseval’s holdsfor the extension of f with respect to B : Claim 11.E ν ∼ B [ ˜ f ( ν ) ] = E B  X S ˆ f ( S ; µ ) φ µS ( ν ) !  = X S ˆ f ( S ; µ ) E ν ∼ B [ φ µS ( ν ) ] . Our approach will be to consider both sides of the inequality using the µ -biased Fourierbasis of f . This is straightforward for the variance of ˜ f ( ν ) using Parseval’s. For the righthand side, we observe that the ν -biased linear coeﬃcients may be viewed as functions in ν . In fact, each linear coeﬃcient can be viewed as the extension of the µ -biased diﬀerenceoperator to [ − , n , modulo a normalization factor. Claim 12. ˆ f ( i ; ν ) = σ i ( ν ) σ i ( µ ) X S ∋ i ˆ f ( S ; µ ) φ µS \ i ( ν ) . Proof.

We rewrite the linear ν -biased Fourier coeﬃcient in terms of the µ -biased diﬀerenceoperator: ˆ f ( i ; ν ) = E x ∼ ν [ D ( ν ) i f ] = σ i ( ν ) E ν [ D (1 / i f ]= σ i ( ν ) σ i ( µ ) E ν h D ( µ ) i f i = σ i ( ν ) σ i ( µ ) X S ∋ i ˆ f ( S ; µ ) E ν [ φ µS \ i ]= σ i ( ν ) σ i ( µ ) X S ∋ i ˆ f ( S ; µ ) φ µS \ i ( ν ) , D ( µ ) i , and the ﬁnal holds because ν is a product distribution.Finally, we will use the fact that the variance of φ µi ( ν ) grows smaller as the sample sizeincreases: Fact 13. E ν ∼ B [ φ µS \ i ( ν ) ] = E ν ∼ B [ φ µS ( ν ) ] · N. Proof.

Because B is a product distribution, we have: E ν ∼ B [ φ µS ( ν ) ] = E ν ∼ B [ φ µS \ i ( ν ) ] · E ν ∼ B [ φ µi ( ν ) ] . Then E ν ∼ B [ φ µi ( ν ) ] = E ν ∼ B [( ν i − µ i ) /σ i ( µ )]= E ν ∼ B [( ν i − µ i ) ]1 − µ i = E B [ ν i ] − µ i − µ i = N − + µ i − N − µ i − µ i − µ i . With the previous claims in hand, we are ready to prove Lemma 9:

Lemma (9) . E ν ∼ B h ( ˜ f ( µ ) − ˜ f ( ν )) i ≤ N − E ν ∼ B " n X i =1 ˆ f ( i ; ν ) . Proof.

We ﬁrst consider the expected ν -biased linear Fourier weight. Applying Claims 12and 11, and summing over i we have: n X i =1 E ν ∼ B [ ˆ f ( i ; ν ) ] = n X i =1 E B [ σ i ( ν ) ] σ i ( µ ) X S ∋ i ˆ f ( S ; µ ) E B [ φ µS \ i ( ν ) ]= (cid:18) − N (cid:19) n X i =1 X S ∋ i ˆ f ( S ; µ ) E B [ φ µS \ i ( ν ) ]= (cid:18) − N (cid:19) n X i =1 X S ∋ i ˆ f ( S ; µ ) E B [ φ µS ( ν ) ] N ≥ ( N − · X S = ∅ ˆ f ( S ; µ ) E [ φ µS ( ν ) ] . σ i ( ν ) depends only on ν i , while φ µS \ i does not, so theexpectations may be taken separately. For the next equality, we calculate E B [ σ i ( ν ) ] = 1 − E B [ ν i ]= 1 − (cid:18) µ i + 1 − µ i N (cid:19) , which gives that E B [ σ i ( ν ) ] σ i ( µ ) = 1 − /N. The third equality holds by Fact 13, and the ﬁnal inequality holds since each non-emptycoeﬃcient appears in the sum at least once.Using Claim 11 and rewriting the variance of ˜ f ( ν ) using the µ -biased Fourier basis for f , we have: E ν ∼ B [( ˜ f ( ν ) − ˜ f ( µ )) ] = E B [( X S = ∅ ˆ f ( S ; µ ) φ µS ( ν )) ]= X S = ∅ ˆ f ( S ; µ ) E B [ φ µS ( ν ) ] . C Bounding the cumulative eﬀect of sampling

We saw in Lemma 8 that the selection step always results in a non-negative change in ﬁtnesswhen ǫ is suﬃciently small. The sampling steps, however, may decrease ﬁtness. In thissection we show that the cumulative eﬀect of sampling on ﬁtness will be small. We use µ t to denote the initial distribution of the generation at time t , and ν t to denote the productdistribution after the sampling step. according to B ( µ t ). Then the selection step determinesthe population at time t + 1 which we write as µ t +1 (determined by ν t ). By Lemma 7 at eachstage the variance of ˜ f ( ν ) is a small fraction of the expected ﬁtness increase after selection.Summing over all generations, the total variance from the sampling steps is a small fractionof the total ﬁtness increase from the selection steps. Finally, we bound from above the totalﬁtness decrease due to sampling eﬀects; for this last step we need the following generalizationof Bernstein’s inequality to martingales with unbounded jumps by Dzhaparidze and Zanten: Lemma 14. (Theorem 3.3 , [4]) Let {F t } t =0 , ,... be a ﬁltration, and let ζ , ζ , . . . be a mar-tingale diﬀerence sequence w.r.t. {F t } . Consider the martingale S T = P Tt =1 ζ t . Deﬁne: H T = X ζ t + X E (cid:2) ζ t | F t − (cid:3) Then, for each stopping time τ , Pr (cid:20) max T ≤ τ | S T | > z, H τ ≤ L (cid:21) ≤ (cid:18) − z L (cid:19) This is a special case of their Theorem 3.3, which corresponds, in their notation, to the limit as a → Lemma 15.

Let β = (cid:16) ǫ ( N − − nǫ ) (cid:17) / and α = q β ln β . Then Pr[ | T X t =1 ˜ f ( ν t ) − ˜ f ( µ t − ) | ≥ α ] ≤ β. Proof.

For each sequence ( µ t ) Tt =0 of populations up to time T , deﬁne its congruence class asa subset of inﬁnite sequences: h(cid:0) µ t (cid:1) Tt =0 i = (cid:8)(cid:0) w t (cid:1) ∞ t =0 : w t = µ t ∀ t ≤ T (cid:9) Now, for a time T , consider the space of possible sequences of populations: F T = nh(cid:0) µ t (cid:1) Tt =0 io . Then F ⊂ F ⊂ . . . is a ﬁltration. We will consider the following martingale: S T = T X t =0 ζ t = T X t =0 ˜ f (cid:0) ν t (cid:1) − ˜ f (cid:0) µ t (cid:1) Notice that this is indeed a martingale becauseE [ S T | F T − ] = E " T X t =0 ˜ f (cid:0) ν t (cid:1) − ˜ f (cid:0) µ t (cid:1) | F T − = S T − + E h ˜ f (cid:0) ν T (cid:1) − ˜ f (cid:0) µ T (cid:1) | F T − i = S T − . To apply Lemma 14, we also deﬁne the following sequences: M T = T X t =0 (cid:16) ˜ f (cid:0) ν t (cid:1) − ˜ f (cid:0) µ t (cid:1)(cid:17) V T = T X t =0 E (cid:20)(cid:16) ˜ f (cid:0) ν t (cid:1) − ˜ f (cid:0) µ t (cid:1)(cid:17) | F t − (cid:21) H T = M T + V T For each T , we show that Pr[ H T ≥ β ] ≤ β by bounding E [ H T ] and applying Markov’sinequality. Applying Lemma 7, we have that E [ M T ] = E " T X t =0 ( ˜ f ( ν t ) − ˜ f ( µ t )) ≤ N − − nǫ ) · E " T X t =0 ˜ f ( µ t +1 ) − ˜ f ( ν t ) ≤ ǫ ( N − − nǫ ) , E " T X t =0 ˜ f ( µ t +1 ) − ˜ f ( ν t ) + T X t =0 ˜ f ( ν t ) − ˜ f ( µ t ) = E [ ˜ f ( µ T +1 ) − ˜ f ( µ )] ≤ ǫ, and E [ S T ] = E [ P Tt =0 ˜ f ( ν t ) − ˜ f ( µ t )] = 0 . Similarly we may apply Lemma 7 to each term of V T : E h ( ˜ f ( ν t ) − ˜ f ( µ t )) |F t − i = E ν t ∼ B ( µ t ) h ( ˜ f ( ν t ) − ˜ f ( µ t )) i ≤ N − − nǫ ) · E ν t ∼ B ( µ t ) h ˜ f ( µ t +1 ) − ˜ f ( ν t ) i . Summing over t and taking the expectation, we again have that E [ V T ] ≤ ǫ/ ( N − − nǫ ).Thus, we have that E [ H T ] = E [ M T + V T ] ≤ ǫ ( N − − nǫ ) ≤ β . Finally, applying Lemma 14to S T , we have Pr[max T ≤ τ | S T | ≥ α, H T ≤ β ] ≤ (cid:18) − α β (cid:19) ≤ β. Combining this with the bound Pr[ H T ≥ β ] ≤ β gives the lemma. D Proof of the main theorem

To complete the proof of Theorem 6, we ﬁrst show (Lemma 16 below) that for suﬃcientlylarge T , the population µ T is at a vertex of the Boolean cube with high probability. Finally,we combine this with Lemma 15, which bounds the probability that ˜ f ( µ T ) = 1 + ǫ . Lemma 16.

There is a constant

C > such that for any T ≥ C · ǫn N − nǫ , we have: Pr[ µ T / ∈ {− , } n ] < /n. Proof.

Note that if | ν t ′ j | = 1 for some time t ′ , then ν tj = ν t ′ j for every t ≥ t ′ . Observe alsothat if | µ t ′ j | > − ( n N ) − (in this case we say j is α -determined with α = n − N − ), wehave by Markov’s inequality: Pr[ | ν t ′ j | < ≤ /n . We will show that after enough time, it is unlikely that there is any coordinate that was never α -determined. More precisely, let A j ; t be the event that coordinate j was not α -determinedfor µ , . . . , µ t . To prove the lemma, the above reasoning tells us that it suﬃces to show thatfor T as set in the condition of the lemma:Pr[ n _ j =1 A j ; T ] ≤ /n. We will consider each coordinate separately and show that for each j , Pr[ A j ; T ] ≤ n − . Ourproof will use the following simple claims relating Pr[ A j ; T ] to the selection steps of theprocess. 20 laim 17. Fix any time t and an interval T , such that t + T ≤ T . Then: E  t + T X t = t ν tj − µ tj !  ≥ α Pr [ A j ; T ] T N .

Proof. E  t + T X t = t ν tj − µ tj !  = t + T X t = t E ν t ∼ B ( µ t ) h(cid:0) ν tj − µ tj (cid:1) i ≥ t + T X t = t Pr [ A j ; t ] E ν t h(cid:0) ν tj − µ tj (cid:1) | A j ; t i ≥ α Pr [ A j ; T ] T N .

Note that for t < t ′ , the outcome of ( ν ( t ′ ) j − µ ( t ′ ) j ) has expctation 0, even given any informationabout time t ; this gives the ﬁrst equality. The last inequality holds because Pr[ A j ; t ] ≤ Pr[ A j ; t ′ ] for t ≥ t ′ and because E ν t ∼ B ( µ t ) h(cid:0) ν tj − µ tj (cid:1) | A j ; t i = σ j ( µ t ) N ≥ α N .

The next claim tells us that for any interval of time t . . . ( t + T ) , the change in µ j dueto the sampling steps cannot be much more than the change from the selection steps. Claim 18. t + T X t = t ν tj − µ tj ! ≤ t + T X t = t µ t +1 j − ν tj ! + 4 . Proof.

Observe that µ t + T +1 j − µ t j = t + T X t = t ν tj − µ tj ! + t + T X t = t µ t +1 j − ν tj ! has magnitude at most 2, which gives: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t + T X t = t ν tj − µ tj (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t + T X t = t µ t +1 j − ν tj (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 . Squaring both sides, the claim follows from the fact that 2 | x | + 8 ≥ ( | x | + 2) . We now complete the proof of the lemma. First, combining Claims 17 and 18 tells usthat for t + T ≤ T : E  t + T X t = t µ t +1 j − ν tj !  ≥ α Pr [ A j ; T ] T N − .

21y applying Cauchy-Schwarz and Lemma 8, we can relate the quantity inside the expectationto the change in ﬁtness: t + T X t = t µ t +1 j − ν tj ! ≤ T t + T X t = t (cid:0) µ t +1 j − ν tj (cid:1) = T t + T X t = t σ i ( ν t ) ˆ f ( j ; ν t )˜ f ( ν t ) ! ≤ T − nǫ t + T X t = t ˜ f ( µ t +1 ) − ˜ f ( ν t )Taking expectations on both sides, we conclude that for a suﬃciently long interval, theexpected change in ﬁtness on that interval is a good upper bound on α Pr[ A j ; T ]4 N :11 − nǫ E " t + T X t = t ˜ f ( µ t +1 ) − ˜ f ( ν t ) ≥ α Pr[ A j ; T ]4 N − /T . (11)We can now amplify this bound T times while using the fact that the total ﬁtness changeis at most ǫ ( T and T will be set after). T (cid:18) α Pr[ A j ; T ]4 N − /T (cid:19) ≤ T − X ℓ =0 E  ( ℓ +1) ∗ T X t = ℓ ∗ T ˜ f ( µ t +1 ) − ˜ f ( ν t )  − nǫ = E " T ∗ T X t =0 ˜ f ( µ t +1 ) − ˜ f ( ν t ) − nǫ = E " T ∗ T X t =0 ˜ f ( µ t +1 ) − ˜ f ( ν t ) + ˜ f ( ν t ) − ˜ f ( µ t ) − nǫ = E [ ˜ f ( µ T +1 ) − ˜ f ( µ )] 11 − nǫ ≤ ǫ − nǫ Finally, we have Pr[ A j ; T ] ≤ N ǫT (1 − nǫ ) α + 8 NαT . Taking T = 16 N n and T = ǫ N n − nǫ bounds the probability by 1 /n . Lemma 16 tells us that with probability at least 1 − /n , the population vector will beat a vertex after at most T = O (cid:16) ǫn N − nǫ (cid:17) steps, in which case ˜ f ( µ T ) ∈ { , ǫ } . On theother hand, Lemma 15 tells us that for any T , the probability that the total negative eﬀectof the sampling exceeds α is at most β ; since the ﬁtness change for each selection step isnon-negative (for our choice of ǫ ), we have thatPr[ ˜ f ( µ T ) = 1 + ǫ ] = Pr[ ˜ f ( µ T ) < ˜ f ( µ (0) ) − α ] ≤ β when ˜ f ( µ ) > αα