Satisfiability and Evolution
Adi Livnat, Christos Papadimitriou, Aviad Rubinstein, Gregory Valiant, Andrew Wan
aa r X i v : . [ c s . CC ] A ug Satisfiability and Evolution
Adi Livnat ∗ Christos Papadimitriou † Aviad Rubinstein ‡ Gregory Valiant § Andrew Wan ¶ Abstract
We show that, if truth assignments on n variables reproduce through recombinationso that satisfaction of a particular Boolean function confers a small evolutionary ad-vantage, then a polynomially large population over polynomially many generations(polynomial in n and the inverse of the initial satisfaction probability) will end up al-most surely consisting exclusively of satisfying truth assignments. We argue that thistheorem sheds light on the problem of the evolution of complex adaptations. ∗ Department of Biological Sciences, Virginia Tech. † Computer Science Division, University of California at Berkeley, CA, 94720 USA, [email protected]. ‡ Computer Science Division, University of California at Berkeley, CA, 94720 USA,[email protected]. § Computer Science Department, Stanford University, CA, 94305. ¶ The Simons Institute for the Theory of Computing, UC Berkeley. Parts of this work were completedwhile at Harvard, and at IIIS, Tsinghua University.
Introduction
The TCS community has a long history of applying its perspectives and tools to betterunderstand the processes around us; from learning to multi-agent systems, game theory andmechanism design. By and large, the efforts to understand these areas from a rigorous andalgorithmic perspective have been very successful, leading to both rich theories and practicalcontributions. Evolution is, perhaps, one of the most blatantly algorithmic processes, yet ourcomputational understanding of it is still in its infancy (see [16] for a pioneering study), andwe currently lack a computational theory explaining its apparent success. Algorithmically,how plausible are the origins of evolution and the emergence of self-replication? Is evolutionsurprisingly efficient or surprisingly inefficient? What are the necessary criteria for evolution-like algorithms to yield rich, interesting, and diverse ecosystems? Why is recombination (i.e.,sexual reproduction) more successful than asexual reproduction? Given the reshuffling ofgenomes that occurs through recombination, how can complex traits that depend on manydifferent genes arise and spread in a population?In this work, we begin to tackle this last question of why complex traits that may de-pend on many different genes are able to efficiently arise in polynomial populations withrecombination. In the standard view of evolution, a variant of a particular gene is morelikely to spread across a population if it makes its own contribution to the overall fitness,independent of the contributions of variants of other genes. How can complex, multi-genetraits spread in a population? This may seem to be especially problematic for multi-genetraits whose contribution to fitness does not decompose into small additive components as-sociated with each gene variant —traits with the property that even if one gene variant isdifferent from the one that is needed for the right combination, there is no benefit, or evena net negative benefit. Here, we provide one rigorous argument for how such complex traitscan efficiently spread throughout a population. While we consider this question in a modelthat makes considerable (but justifiable) simplifications, this model makes a theoreticallyrigorous contribution to the fundamental problem of how evolution can produce complexity.
Motivating example: Waddington’s experiment.
In 1953 the great experimentalistConrad Waddington exposed the pupae of a population of
Drosophilia melanogaster to aheat shock, and noticed that in some of the adults that developed, the appearance of thewings had changed (they lacked a complete posterior crossvein) [17]. He then maintaineda population of flies where only those with altered wings were allowed to reproduce. Byrepeating the procedure of heat shock and selection over the generations, the percentage offlies with altered wings increased over time to values close to one. Even more interestingly,beginning at generation fourteen, some flies exhibited the new trait even without havingbeen treated with heat shock.At first sight, this surprising phenomenon — known as genetic assimilation — recallsLamarck’s now discredited belief that acquired traits can be inherited. However, Booleanfunctions provide a purely genetic explanation, which extends the idea originally offeredinformally by Stern [15] (see also [2, 5]): Suppose that the phenotype “altered wings” is aBoolean function of n genes x , . . . , x n with two alleles (variants) each, thought as {− , } {− , } variable h (standing for “high temperature”). x + x + · · · + x n + (1 + h )2 · k ≥ n, for some integer k (think of n ≈
10 , and k ≈ n/ µ ti be the averagevalue of x i in the population at time t , and assume the genotype frequencies at time t aredistributed according to a distribution µ t (the reason for denoting the distribution this waywill become clear). If mating occurs at random with free recombination then, in expectation,the average value of each x i in the next generation is given by: µ t +1 i = E µ t [ f ( x ) · x i ] E µ t [ f ] , (1)where f ( x ) = 1 exactly when a fly having genotype x will develop altered wings (i.e.,the above inequality is satisfied) and f ( x ) = 0 otherwise. We then assume that the nextgeneration will be distributed according to a product distribution µ t +1 , where each x i hasexpectation µ t +1 i . By approximating the genotype frequencies of the population for eachgeneration in this way, it can be shown by calculation that a trait with this genotypicspecification (a) is very rare in the population under normal temperature h = 1; (b) itbecomes much more common under high temperature h = 1; (c) jumps to just above 50%after the first breeding under h = 1; (d) after successive breedings with h = 1 it is nearlyfixed; and (e) if after this h becomes −
1, the trait is still quite common.
Note:
Our interpretation of Waddington’s experiment is a simplification. First, we con-sider only the distribution of genotypes in each generation to determine the distribution ofthe next; instead, we could first take a finite sample according to the present distribution anduse that sample to calculate the distribution of the next generation. Such an approximationcan only become exact when the population size is infinite, but it is a standard and usefulone in population genetics (and we shall eventually consider finite populations for our mainresult). We also assumed that each individual of the new generation is produced by samplingeach gene independently of the other genes, and with probability equal to the frequencies ofthe two alleles of this gene in the parent population (the adults of the previous generationwith altered wings). This assumption turns out to be justified in the settings that we willconsider, as will be discussed in the following section.
Populations of truth assignments
This way of looking at Waddington’s experiment brings about a very natural question: Is thisamplification of satisfying truth assignments (outcomes (c) and (d) of experiment describedabove) a property of threshold functions, or is it more general? Does it hold for all monotonefunctions, for example?
For all Boolean functions? Here we assume for simplicity haploid organisms, that is, each individual has only one copy of each gene. See the next section for any unfamiliar terms and concepts from evolution. f : {− , } n → { , } of n binary genes (in theabsence of the environmental variable h which was crucial in Waddington’s phenomenon).What if genotypes satisfying this Boolean function had a slight advantage under naturalselection? (In Waddington’s experiment, they had an absolute advantage because of the ex-perimental design.) For example, imagine that genotypes satisfying f survive to adulthoodmore than the others, in expectation, by a factor of (1 + ǫ ), for some small ǫ >
0. Wouldthis trait (that is, satisfaction of the Boolean function f ) be eventually fixed in the popula-tion? And, if so, could this be a subtle mechanism for introducing complex adaptation in apopulation?To reflect our assumption that satisfaction of f confers only an ǫ -advantage, we may takea function f : {− , } n → { , ǫ } , where we regard the value 1 + ǫ as “satisfied”, and thevalue 1 as “unsatisfied”. We track the allele frequencies from generation to generation asin Waddington’s experiment: Equation (1) gives us the average value µ t +1 i of each x i in thenext generation, and we describe the next generation by the product distribution µ t +1 .Suppose that we continue this process, starting from distribution µ , and defining { µ i } , { µ i } , . . . , { µ ti } . . . as above. Consider the average fitness of the population at time t ,defined as µ t ( f ) = Pr µ t [ f ( x ) = 1 + ǫ ]. The question is, when does µ t ( f ) approach one? Ourfirst result states that, for monotone functions, it does after O (cid:16) nǫµ ( f ) (cid:17) steps: Theorem 1. If f is monotone, then µ t ( f ) ≥ − n (1+ ǫ ) ǫtµ ( f ) . Note:
This nontrivial result also serves to illustrate one point: The work is not about sat-isfiability heuristics (monotone functions are not an impressive benchmark in this regard...).Heuristics are about finding good individuals in a population. In contrast, evolution is aboutcreating good populations . This is our focus here.Our ambition is to prove the same result for all Boolean functions. Immediately wesee that this is impossible if we insist on an infinite population: Consider the function f = x ⊕ x : starting with the uniform distribution at time t = 0, the above dynamics wouldleave the distribution unchanged, for all time, and hence µ t ( f ) = 1 / t . The parityfunction is not the only Boolean function with this property: for example the function“ P ni =1 x i = k ”, if started at µ i = kn , will stay at that spot forever, and will always have µ t ( f ) = O (1 / √ k ). However, experimentation shows that these “spurious fixpoints” are notabsorbing, and evolution pulls the distribution away from them and towards satisfaction.That is, this disappointing phenomenon is an artifact of the infinite population simplification.Indeed, random genetic drift due to sampling effects has been considered to be a significantcomponent of evolution at the molecular level (it is possible for an allele to become fixed inthe population even in the absence of selection). Thus, we need to make the model morerealistic.We adopt a model, consisting of the following process: At each generation t we create alarge population of N individuals (we call this the “sampling” step) by sampling N timesfrom the product distribution µ t to obtain y (1) , . . . , y ( N ) ( N is assumed to remain constantfrom one generation to the next, which is a standard assumption in population genetics [6]).The empirical allele frequencies of the sample are given by a vector ν t , where for each i we3ave: ν ti = 1 N N X j =1 y ( j ) i . We write ν t ∼ B ( µ t ) to denote a draw from this distribution and use ν t to denote the impliedBernoulli distribution.We then enforce the assumed selection advantage of satisfaction to obtain the “in-expectation” frequencies of the subsequent generation: µ t +1 i = E ν t [ f ( x ) · x i ] E ν t [ f ] . (2)We show that when selection is weak, any satisfiable Boolean function will almost surely bealways satisfied after polynomially many time steps. Theorem 2. (informal statement)
For any satisfiable Boolean function f of n variablesand any sufficiently small ǫ > , after T generations of N individuals µ T ( f ) = 1 withprobability arbitrarily close to one, where T and N are polynomial functions of n , ǫ , and µ ( f ) . The proof of Theorem 2 shows why the population does not become stuck at the pre-viously discussed “spurious fixpoints;” sampling effects ensure movement over sufficientlymany generations, and selection ensures movement is made towards satisfaction.
Outline of the paper
In the next section we introduce some basic concepts from population genetics, we defineand justify our simplified model, and we present a result due to Nagylaki [12] implyingthat, if selection is weak, then one can assume that the genotype distribution is a productdistribution. In Section 3, we show Theorem 1 on monotone functions. Our main resultis given in Section 4, and its proof outlined; the full proof is detailed in the Appendix. InSection 5, we conclude with a discussion of our result, and a number of open problems.
The genetic makeup of an organism is its genotype , which specifies one allele (gene variant)for each genetic site, or “locus,” in the haploid case. We shall be focusing on n specificgenes of interest (say, a few dozen out of the many thousands of genes of the species). Ateach locus, we assume that there are two alleles segregating in the population (hence therelevance of Boolean functions). Thus, a genotype will be a vector in {± } n . We assumethe species reproduces sexually (this is crucial, see the discussion in the last section). Ina sexual species reproduction proceeds through recombination, that is, the formation of anew genotype by choosing alleles from two parental genotypes in the previous generation.To produce each generation, the individuals mate at random (we also assume no bipartitioninto sexes) and there is no generation overlap (that is, the new generation is produced enmasse just before death of the previous one). We assume that the population size is constantat some large number N (expressed as a function of n , the number of genes of interest, which4s the basic parameter). Each genotype g ∈ {± } n is assumed to have a fitness value equalto the expected number of offsprings this genotype will produce. We also assume that thegenes recombine freely , that is, for any two genes i, j of an offspring, the probability that thealleles come from the same parent is exactly half (and not larger, as is the case if the twogenes are linked).These assumptions are simplifications of the standard model of population genetics usedbroadly in the literature, and generally trusted to preserve the essence of selection in sexualpopulations. The Boolean assumption is of course meant to bring into play mathematicalinsights from that field, but we believe that it is not restrictive (for example, allele − ǫ , where ǫ > n genes, with probability equal to the probability ofoccurrence of that allele in the parent generation. That is, we assume that the distributionof the genotypes in a generation is a product distribution . This situation is called in thepopulation genetics literature linkage equilibrium , or the Wright manifold [19, 20]. In general,genotype frequencies are known to be correlated, and this correlation — the distance from theproduct distribution — is called linkage disequilibrium [7] and is of importance and interestin the study of evolution. However, in the absence of selection, a standard argument showsthat the distribution of a population quickly reaches linkage equilibrium (arguments existboth for finite and infinite populations). Our previous assumption places our experimentin a regime known as weak selection . Weak selection means that the fitness values are in asmall interval [1 − ǫ, ǫ ], where ǫ is called the selection strength . An elegant and powerfulresult due to Thomas Nagylaki [12] states that, under weak selection, evolution proceeds toa point very close to linkage equilibrium. In particular, assume that a population evolvesas we described above in a regime of weak selection of strength ǫ , and let m be the totalnumber of alleles (this is 2 n in our case; actually, Nagylaki’s Theorem also holds underdiploid and partial recombination). By linkage disequilibrium we mean formally the L ∞ distance between the genotype distribution and the product distribution: Theorem 3. (Nagylaki’s Theorem, see [12]) Under weak selection, and after O (log m · log 1 /ǫ ) generations, linkage disequilibrium is O ( ǫ ) . In our setting ǫ is minuscule, so Nagylaki’s Theorem motivates our assumption thatpopulations are formed “by independent sampling of the genetic soup.” We strongly believethat our theorem is true for large ǫ as well, but this remains open, as discussed in the lastsection. In this section we give a self-contained proof of Theorem 1. The proof is simple, once aconnection is made to discrete Fourier analysis. In what follows, we assume familiarity withFourier analysis over the Boolean cube for product distributions. We briefly review somebasic facts and describe the notation used in our proofs.5or µ = ( µ , . . . , µ n ) ∈ [ − , n and a function f : {− , } nµ → R , where {− , } nµ denotesthe Boolean cube endowed with the product distribution given by µ i = E [ x i ], we considerthe µ -biased Fourier decomposition of f . Let σ i = 1 − µ i be the variance of each bit. Wedenote the µ -biased Fourier coefficients by ˆ f ( S ; µ ) = E µ [ f · φ µS ], where φ µS = Q i ∈ S x i − µ i σ i . Let D ( µ ) i f = σ i ( f i =1 − f i = − ) be the difference operator for Boolean functions over {− , } nµ . Wehave that D ( µ ) i f = X S ∋ i ˆ f ( S ; µ ) φ µS \{ i } , (3)and in particular, E µ [ D ( µ ) i f ] = ˆ f ( i ; µ ), which we will use repeatedly throughout our proofs.Our first step will be to observe that the change in allele frequencies from one generationto the next may be expressed in terms of f ’s linear Fourier coefficients. Let µ be the vectorwhich specifies the allele frequency of the population at time t . Then, letting µ ′ be the allelefrequency vector at time t + 1 and using the selection specified by Equation (1), we havethat µ ′ i − µ i = σ i ˆ f ( i ; µ ) E µ [ f ] . (4)This follows immediately from the definitions: σ i · ˆ f ( i ; µ ) = σ i · E µ [ f · φ µi ]= E µ [ f · x i ] − E µ [ f ] · µ i = E µ [ f ] · µ ′ i − E µ [ f ] · µ i . Our proof uses the following well-known facts, which are easily derived from the basicnotions (see Chapter 2.3, [13] ). First, we have that the influences of a monotone functionare given by its linear coefficients. (For a function f : {− , } nµ → R , we denote its influencein direction i by P S ∋ i ˆ f ( S ; µ ) .) Next, the inequality of Poincar´e lower bounds the totalinfluence of a function by its variance. The versions below have been scaled to our setting andcan be obtained by applying the original facts to a Boolean function g : {− , } n → {− , } and setting f ( x ) = 1 + ǫ (1 + g ). Proposition 4.
Let f : {− , } nµ → { , ǫ } be monotone. Then for all i ∈ [ n ] : X S ∋ i ˆ f ( S ; µ ) = ǫσ i · ˆ f ( i ; µ ) . Proposition 5.
Let f : {− , } nµ → { , ǫ } and Var[ f ] = E µ [ f ] − E µ [ f ] . Then X i ∈ [ n ] X S ∋ i ˆ f ( S ; µ ) = X S ⊆ [ n ] | S | ˆ f ( S ; µ ) ≥ Var[ f ] . Equation (4) tells us that the bias of each bit i increases according to the correspondingcoefficient ˆ f ( i ). Proposition 4 tells us that for monotone f , the linear coefficients correspondto the influences of f . Finally, the inequality of Poincar´e tells us that the linear coefficientsmust be large.We may now prove Theorem 1. Theorem 1.
Let f : {− , } n → { , ǫ } be monotone. Then µ t ( f ) ≥ − n (1+ ǫ ) ǫtµ ( f ) .6 roof. Combining Equation (4) with Propositions 4 and 5 tells us that the sum of the biasesincreases at each step: X i ∈ [ n ] ( µ ′ i − µ i ) = 2 ǫ · E µ [ f ] X i ∈ [ n ] X S ∋ i ˆ f ( S ; µ ) ≥ ǫ · E µ [ f ] Var[ f ]= 2 ǫ · E µ [ f ] ǫ µ ( f )(1 − µ ( f ))Let µ t ( f ) = 1 − δ. Then for all t ′ ≤ t , the sum of the biases increases at each step: n X i =1 µ t ′ +1 i − n X i =1 µ t ′ i ≥ ǫµ t ′ ( f )(1 − µ t ′ ( f )) E t ′ µ [ f ] ≥ ǫµ t ′ ( f ) δ ǫ ≥ δǫµ ( f )1 + ǫ . On the other hand, we know that − n ≤ P ni =1 µ t ′ i ≤ n for all t ′ , so t ≤ n (1+ ǫ ) δǫµ ( f ) . We remark that Theorem 1 (with worse parameters) can also be proven using a general-ization of the Russo-Margulis lemma to product distributions, which states that the gradientof E µ [ f ] (as a function of µ ) corresponds to the influences of f (see Appendix B.1). For a function f : {− , } n → { , ǫ } , consider the multilinear extension ˜ f : [ − , n → [1 , ǫ ] defined by ˜ f ( µ ) = E x ∼ µ [ f ( x )]. Our goal is to understand when ˜ f ( µ ) = 1 + ǫ . Westart with the precise statement of the main result (compare with Theorem 2): Theorem 6.
Let β = q ǫN (1 − nǫ ) . If ˜ f (cid:0) µ (cid:1) > r β ln 2 β then there is some constant C such that for any T ≥ C · ǫn · N − nǫ : Pr h ˜ f (cid:0) µ ( T ) (cid:1) = 1 + ǫ i ≥ − β − /n. Note that the conditions in Theorem 6 imply restrictions on the initial probability ofsatisfaction and the strength of selection. In particular, the selection coefficient must be inthe range 1 /N / < ǫ < /n (we discuss this restriction in the next section), and the initialprobability of satisfaction must be at least N − / . The full proof of the theorem is given inthe Appendix; in this section we sketch its salient points.One first difficulty in the proof is this: The convergence proof gauges the improvement inaverage population fitness obtained during the second of the two steps per generation (the7tness step). However, the first of the two steps (the sampling step) introduces variance, andwe must establish that this variance is insignificant in comparison with the increase in fitness.Our first lemma (Lemma 7) establishes that the difference between the average fitness of thesample and the average fitness, squared (that is to say, the variance introduced), is boundedfrom above by the increase in average fitness obtained in the fitness step: E ν ∼ B [( ˜ f ( ν ) − ˜ f ( µ )) ] ≤ E ν ∼ B [ ˜ f ( µ ′ ) − ˜ f ( ν )] / [( N − · (1 − nǫ )] . (5)Here we focus on one generation, so µ denotes the product distribution from which thesampling is made, ν the empirical product distribution of the sample (note that ˜ f ( ν ) is arandom variable with expectation ˜ f ( µ )), and µ ′ the product distribution resulting from theselection (or fitness) step. Thus, µ ′ is the initial product distribution in the next generation.To establish inequality (5), we first show that the right-hand side is lower bounded bythe total mass of the singleton Fourier coefficients of the biased transform (Lemma 8):˜ f ( µ ′ ) − ˜ f ( ν ) ≥ (1 − nǫ ) n X i =1 ˆ f ( i ; ν ) . (6)The intuition in the proof of (6) is that the fitness step is very close to an ǫ -long stepof the gradient ascent of the average fitness function (this intuition is very accurate awayfrom the boundary of the hypercube). Gradient ascent in each coordinate is captured bythe corresponding singleton coefficient squared. But then there is an analytical complicationof approximating the overall ascent by the sum of sequential coordinate-wise ascents; thedifficulty is, of course, that the partial derivatives change after each small ascent, and thechange must be bounded (Lemma 10).This establishes that the fitness increase in the selection step is larger than the linearFourier mass, and hence nonnegative when ǫ is small. However, the linear Fourier mass maybe zero, as is the case for the exclusive-or function under the uniform distribution (recall thediscussion a few lines after Theorem 1). Here, sampling effects will ensure that progress ismade in expectation. We show that, on average, the linear Fourier mass is much larger thanthe variance (Lemma 9): E ν ∼ B h ( ˜ f ( µ ) − ˜ f ( ν )) i ≤ N − E ν ∼ B " n X i =1 ˆ f ( i ; ν ) (7)The rather involved proof of (7) takes place entirely within the biased Fourier domain (seeAppendix B.3). Now notice that (7), combined with (6), completes the proof of inequality(5) and Lemma 7.Note that the upper bound on the variance in (5) includes in the denominator a factor of(1 − ǫn ) · N . This immediately tells us that our technique is sharpest when the population N is large and the selection strength ǫ is small — in particular, it must be smaller than n .This latter point is a rather puzzling limitation of our result: Why does a theorem about theeffectiveness of natural selection become harder to prove when selection is stronger? Oneintuitive explanation is that in this case selection works very much like gradient ascent, andit is well known that the convergence of gradient ascent is harder to establish when the ascentstep is large, as a large step can “skip over” the stationary point sought. Is this upper limiton ǫ necessary? This is an intriguing open question discussed in the last section.8ext, we establish that the total effect of the sampling steps is small: For any α > p β ln 2 β − , Pr[ | T X t =1 ˜ f ( ν t ) − ˜ f ( µ t − ) | ≥ α ] ≤ β, where β = (cid:16) ǫN (1 − nǫ ) (cid:17) / .It is not hard to see that the sum P Tt =1 ˜ f ( ν t ) − ˜ f ( µ t − ) constitutes a martingale, albeitone with no obvious upper bound on each step. In Lemma 15 we bound the total effectof the sampling step by resorting to a rather exotic martingale inequality derived from ageneralization of Bernstein’s inequality to martingales with unbounded jumps and proved in[4] (in fact, a specialization stated in Appendix C as Lemma 14).Incidentally, notice that this is the place where it is proved, quite indirectly, that the sam-pling step succeeds in getting the process unstuck from spurious fixpoints such as ( , , . . . , )for the exclusive-or function: Since the total effect of sampling is limited, the increase inaverage fitness must eventually prevail.Finally, when the process is near a vertex of the hypercube, fitness increases are too smallto help finish the argument, but here we rely on the fact that the process is very likely todrift so close to a vertex that it will eventually get stuck there (Lemma 16), completing theproof of the main result. We proved a novel and highly nontrivial aspect of Boolean satisfiability: By randomly cross-ing assignments and favoring satisfaction slightly, one can breed a population of pure sat-isfying truth assignments. We argued that this rather curious property seems important inunderstanding one intriguing aspect of evolution: how complex traits controlled by manygenes can emerge.There are many roads of mathematical inquiry opened by this theorem. First, can thelimitations/restrictions of our model be relaxed so that it better reflects the realities of life?Some of the assumptions in our model are arguably unrealistic (haploidy, fixed populationsize, random mating, partly in-expectation fitness calculation), but these follow widely ac-cepted practices in population genetics needed for mathematical simplification. We alsomake the assumption of weak selection, but this is also a very defensible one for unlinkedloci.There are, however, a few further restrictions of our model that call for discussion: • Two alleles per gene.
The motivation is, of course, that this assumption ushers in thepowerful analytical toolbox of Boolean functions. We have no doubt that similar resultshold for more alleles, but would require a great number of technical adjustments. • Fitness landscape.
We assumed a very specialized fitness landscape with values 1 and1 + ǫ only. This is a natural simplification that facilitates the connection to Booleanfunctions, but we do not believe it is an essential one. We believe that this result canbe extended to much richer landscapes with a small gap, for example to situations inwhich fitness values are in [1 − δ, ∪ [1 + ǫ, ǫ + δ ] for some small δ > ǫ is larger? As we have mentioned,this is an analytical challenge with roots in the difficulty of the analysis of gradient descent.Of course, a constant gap would bring us outside the realm of weak selection, and renderour approximation by product distribution baseless. There are two ways we can proceed:One is to prove that the exact recurrence equations of genotype frequencies yield eventualsatisfaction. This seems possible but challenging.Another avenue, which we have followed for some time, is to work with product distri-butions anyway. In particular, what if the fitness landscape has values { , } — that is tosay, non-satisfying truth assignments are removed from the population, as in Waddington’sexperiment? This is a realistic approximation if, for example, this selection does not happenin every generation but every O (log n ) generations (because breeding without selection isknown to take you close to the Wright manifold). In such a setting, our quadratic bound forthe in-expectation process of monotone functions no longer requires any dependence on theinitial probability of satisfaction µ ( f ). For the process with sampling, we have the followingconjecture. Conjecture:
If the fitness landscape has values { , } , then the process reaches near uni-versal satisfaction with probability approaching as the population size goes to infinity. We now want to point out an obvious and yet surprising aspect of our work: In thetraditional framework of adaptive evolution, each allele spreads in the population mainlyeither due to an additive contribution to fitness that it makes in and of itself (let us callthis “traditional propagation”) or due to random genetic drift [1, 19, 20, 18]. In our model,however, alleles at different genes are spreading in the population as governed by the complexinteractions between them that are continually subject to selection. Thus, a population canchange dramatically through a novel process involving subtle changes in genetic statisticsand simultaneous gradual emergence in the whole population [8, 11], and not by traditionalpropagation.Furthermore, notice that since recombination is a crucial ingredient of our analysis, ourresults inform the question of the role of sex in evolution. In this regard they add to recentworks that have begun to examine the role of sex while giving full weight to the importanceof genetic interactions [3, 9].Finally, can our bounds be improved? For the monotone case, it is easy to see thatthe TRIBES function with appropriate fan-in provides a matching lower bound. As for thegeneral case, we feel that the very generous bounds of the main result can be improvedsubstantially. For example, the assumed time bound is only necessary in order to finish thelast part of the argument (convergence to a vertex) once the vast majority of the populationis already satisfying; more analysis is needed to investigate this subtle phenomenon.Our proof that the population converges to a single satisfying truth assignment may seema troubling aspect of our result. Two remarks: First, the loss of genetic diversity shouldnot be surprising in itself. With drift alone, for each locus, one allele will become fixedeventually (where the probability that a particular allele will be the fixed allele is proportionalto its current frequency in the population). Second, in our process many satisfying truthassignments are likely to survive for a very long time before the random walk clears thepicture. This fact may be more relevant than the characteristics of eqilibrium; after all,evolution happens in the transient. 10 cknowledgments:
We are grateful to Yu Liu of Tsinghua University for some veryinteresting conversations in the beginning of this research.
References [1]
The Genetical Theory of Natural Selection . Oxford: The Clarendon Press, 1930.[2] K G Bateman. The genetic assimilation of four venation phenocopies.
J Genet , 56:443–447, 1959.[3] E Chastain, A Livnat, C Papadimitriou, and U Vazirani. Algorithms, games, andevolution.
Proceedings of the National Academy of Sciences , 111(29):10620–10623, 2014.[4] K. Dzhaparidze and J.H. van Zanten. On Bernstein-type inequalities for martingales.
Stochastic Processes and their Applications , 93(1):109 – 117, 2001.[5] D S Falconer.
Introduction to Quantitative Genetics . Oliver and Boyd, Edinburgh, 1960.[6] J H Gillespie.
Population Genetics: A Concise Guide . JHU Press, 2004.[7] R.C. Lewontin. The interaction of selection and linkage. I. General considerations:heterotic models.
Genetics , 49:49–67, 1964.[8] A Livnat. Interactions-based evolution: how natural selection and nonrandom mutationwork together.
Biology Direct , 8:24, 2013.[9] A Livnat, C Papadimitriou, J Dushoff, and M W Feldman. A mixability theory for therole of sex in evolution.
Proceedings of the National Academy of Sciences , 105(50):19803–19808, 2008.[10] G. Margulis. Probabilistic characteristics of graphs with large connectivity.
Prob.Peredachi Inform. , 10:101–108, 1974.[11] E Mayr.
Animal Species and Evolution . Belknap Press, 1963.[12] T. Nagylaki. The evolution of multilocus systems under weak selection.
Genetics ,134(2):627–647, 1993.[13] Ryan O’Donnell.
Analysis of boolean functions . Cambridge University Press, 2014.[14] L. Russo. On the critical percolation probabilities.
Z. Wahrsch. werw. Gebiete , 43:39–48,1978.[15] C. Stern. Selection for sub-threshold differences and the origin of the pseudoexogenousadaptations.
The American Naturalist , 92, 313-316 1958.[16] Leslie G Valiant. Evolvability.
Journal of the ACM (JACM) , 56(1):3, 2009.[17] C. H. Waddington. Genetic assimilation of an acquired character.
Evolution , 7(2):118–126, 1953. 1118] M J Wade and C J Goodnight. Perspective: the theories of fisher and wright in thecontext of metapopulations: when nature does many small experiments.
Evolution ,52(6):1537–1553, 1998.[19] S. Wright. Evolution in Mendelian populations.
Genetics , 16:97–159, 1931.[20] S. Wright. The roles of mutation, inbreeding, crossbeeding and selection in evolution.In
Proc. 6th International Congress of Genetics , volume 1, pages 356–366, 1932.
A Outline of proof
In the following sections, we prove Theorem 6:
Theorem.
Let β = q ǫN (1 − nǫ ) . If ˜ f (cid:0) µ (cid:1) > r β ln 2 β then there is some constant C such that for any T ≥ C · ǫn · N − nǫ : Pr h ˜ f (cid:0) µ ( T ) (cid:1) = 1 + ǫ i ≥ − β − /n. Our proof of Theorem 6 is structured as follows. In Section B, we consider the averagefitness from one generation to the next. As described in Section 1, each generation consistsof two steps: the sampling step, which begins with a product distribution µ and results inan empirical product distribution ν , and the fitness step resulting in a distribution µ ′ (whichbecomes the initial distribution for the next generation). The main lemma (and the mostinvolved step of our proof) of Section B is Lemma 7, which upper bounds the variance of˜ f ( ν ), by a small fraction of E [ ˜ f ( µ ′ ) − ˜ f ( ν )], the expected increase in average fitness by thefitness step.In Section C, we apply Lemma 7 with the martingale inequality to prove Lemma 15,which states that the total fitness decrease will be small with high probability. Finally, wecomplete the proof of the main theorem in Section D by arguing (Lemma 16) that for T asstated in the theorem, µ T will reach a vertex of the hypercube (and hence f ( µ T ) ∈ { , ǫ } )with high probability. B Selection vs sampling effects
In this section we consider just one step of the process. Let µ be the initial product dis-tribution of a generation, ν be the empirical product distribution from the sampling step,and µ ′ be the product distribution after the fitness step. Our main goal in this section is toshow that the variance of ˜ f ( ν ), the average fitness of the population after the sampling step,is small compared to the expected increase in average fitness from the subsequent selectionstep, ˜ f ( µ ′ ) − ˜ f ( ν ). The main lemma we will prove is the following:12 emma 7. Let ν be the vector of expectations of allele frequencies in the population sampleof size N , drawn according to B ( µ ) . Then: E ν ∼ B [( ˜ f ( ν ) − ˜ f ( µ )) ] ≤ E ν ∼ B [ ˜ f ( µ ′ ) − ˜ f ( ν )] / ( N − · (1 − nǫ ) . We will prove Lemma 7 by proving two intermediate lemmas. First, we show that fitnessincrease by the selection step ˜ f ( µ ′ ) − ˜ f ( ν ) is nearly as large as the ν -biased Fourier weightof the linear coefficients of f (and hence non-negative), provided that ǫ is sufficiently small. Lemma 8.
Let µ ′ be the expectations of the process after selection from the population ν .Then: ˜ f ( µ ′ ) − ˜ f ( ν ) ≥ (1 − nǫ ) n X i =1 ˆ f ( i ; ν ) . Next, we show that the variance of ˜ f ( ν ) is at most a small fraction of the expected linear ν -biased Fourier mass of f : P ni =1 ˆ f ( i ; ν ) —here the expectation is taken over the choice of ν. Lemma 9. E ν ∼ B h ( ˜ f ( µ ) − ˜ f ( ν )) i ≤ N − E ν ∼ B " n X i =1 ˆ f ( i ; ν ) . Combining Lemmas 8 and 9 gives us Lemma 7.
B.1 Preliminaries
Throughout our proofs, we will use the notation and basic facts established at the beginningof Section 3. At times, we will use biased Fourier analysis with different product distributions µ and µ ′ at the same time. To prevent ambiguity, we will refer to the standard deviationof the i ’th bit as σ i ( µ ) = 1 − µ i (similarly for σ i ( µ ′ )), but we will use σ i when the contextmakes the distribution clear.Recall that the exentsion of f : {− , } n → { , ǫ } ,˜ f ( µ ) = E x ∼ µ [ f ( x )] = X S ⊆ [ n ] ˆ f ( S ) Y j ∈ S µ j is multilinear. Note that its derivative in the i ’th direction is given by ∂ ˜ f ( µ ) ∂µ i = X S ∋ i ˆ f ( S ) Y j ∈ S \ i µ j = E µ [ D (1 / i f ]= 1 σ i E µ [ D ( µ ) i f ]= ˆ f ( i ; µ ) σ i . (8)Here (8) is a straightforward generalization of the Russo-Margulis Lemma [10, 14] for productdistributions. Thus, we may write the change in allele frequency from the fitness step as: µ ′ i − ν i = σ i ( ν ) ˆ f ( i ; ν ) E ν [ f ] = σ i ( ν )˜ f ( ν ) ∂ ˜ f ( ν ) ∂ν i , (9)where the first equality holds by Equation (4) (derived in Section 3).13 .2 Proof of Lemma 8 We will compute the fitness change as each coordinate changes. Consider the hybrid distri-butions given by the expectations w i = ( µ ′ , . . . , µ ′ i , ν i +1 , . . . , ν n ) so that w = ν and w n = µ ′ .We have that ˜ f ( µ ′ ) − ˜ f ( ν ) = n X i =1 ˜ f ( w i ) − ˜ f ( w i − ) . Observe that the first increment is easily computed as˜ f ( w ) − ˜ f ( ν ) = ( µ ′ − ν ) ∂ ˜ f ( ν ) ∂ν = ˆ f (1; ν ) E ν [ f ] . However, this formula will not be valid for subsequent hybrids as the derivatives of ˜ f havechanged. We start by showing that the derivatives in each direction do not change muchbetween the hybrid distributions. The lemma below will allow us to approximate the deriva-tives of the hybrids by the derivative of ˜ f at ν . Lemma 10.
Let ν ′ ≥ ν ∈ [+1 , − n differ only on the j -th coordinate, i.e., ν ℓ = ν ′ ℓ for all ℓ = j and ν ′ j > ν j . Then for any i ≥ and j < i , ∂ ˜ f ( ν ′ ) ∂ν ′ i − ∂ ˜ f ( ν ) ∂ν i = ( ν ′ j − ν j ) ˆ f ( { i, j } ; ν ) σ i σ j . In particular, for the hybrid distributions above with i ≥ : ∂ ˜ f ( w i − ) ∂w i − i − ∂ ˜ f ( ν ) ∂ν i = 1 E ν [ f ] σ i i − X j =1 ˆ f ( j ; ν ) ˆ f ( { i, j } ; ν ) . Proof.
Expanding the derivatives in terms of the unbiased Fourier coefficients, we have that ∂ ˜ f ( ν ′ ) ∂ν ′ i − ∂ ˜ f ( ν ) ∂ν i = X S ∋ i ˆ f ( S ) Y ℓ ∈ S \ i ν ′ ℓ − Y ℓ ∈ S \ i ν ℓ = X S ∋{ j,i } ˆ f ( S )( ν ′ j − ν j ) Y ℓ ∈ S \{ j,i } ν ℓ = ( ν ′ j − ν j ) E ν [ D (1 / i D (1 / j f ] . The proof of the first equality in the lemma is completed by noting that: E ν [ D (1 / i D (1 / j f ] = 1 σ i σ j E ν [ D ( ν ) i D ( ν ) j f ] . For the second equality of the lemma, we may write the difference between the derivative inthe i ’-th direction under the hybrid distribution w i − and under the original distribution as14 telescoping sum: ∂ ˜ f ( w i − ) ∂w i − i − ∂ ˜ f ( ν ) ∂ν i = i − X j =1 ( w jj − w j − j ) ˆ f ( { i, j } ); ν ) σ i σ j = i − X j =1 σ j ˆ f ( j ; ν ) E ν [ f ] ˆ f ( { i, j } ; ν ) σ i σ j = 1 E ν [ f ] σ i i − X j =1 ˆ f ( j ; ν ) ˆ f ( { i, j } ; ν ) . We are now ready to prove Lemma 8:
Lemma (8) . Let µ ′ be the expectations of the process after selection from the population ν .Then: ˜ f ( µ ′ ) − ˜ f ( ν ) ≥ (1 − nǫ ) n X i =1 ˆ f ( i ; ν ) . Proof.
We will bound each of the differences between the hybrid densities in the summation:˜ f ( µ ′ ) − ˜ f ( ν ) = n X i =1 ˜ f ( w i ) − ˜ f ( w i − ) . Since ˜ f is multilinear, we have that for i ≥ f ( w i ) − ˜ f ( w i − ) = ( µ ′ i − ν i ) ∂ ˜ f ( w i − ) ∂w i − i = σ i ˆ f ( i ; ν ) E ν [ f ] ∂ ˜ f ( ν ) ∂ν i + ∂ ˜ f ( w i − ) ∂w i − i − ∂ ˜ f ( ν ) ∂ν i ! = σ i ˆ f ( i ; ν ) E ν [ f ] ˆ f ( i ; ν ) σ i + 1 E ν [ f ] σ i i − X j =1 ˆ f ( j ; ν ) ˆ f ( { i, j } ; ν ) ! . Recalling that ˆ f ( { i, j } ; ν ) = σ i σ j E ν [ D (1 / i D (1 / j f ] and using the fact that | D (1 / i D (1 / j f ( x ) | ≤ ǫ/ x and E ν [ f ] ≥
1, we have:1 E ν [ f ] σ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i − X j =1 ˆ f ( j ; ν ) ˆ f ( { i, j } , ν ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ǫ i − X j =1 σ j | ˆ f ( j ; ν ) | . (10)Assume WLOG that σ i | ˆ f ( i ; ν ) | ≥ σ j | ˆ f ( j ; ν ) | for all j < i . so that (10) is at most ǫ ( i − σ i | ˆ f ( i ; ν ) | . Substituting into the telescoping sum, we have: n X i =1 ˜ f ( w i ) − ˜ f ( w i − ) ≥ E ν [ f ] n X i =1 σ i ˆ f ( i ; ν ) ˆ f ( i ; ν ) σ i − ǫ i − σ i | ˆ f ( i ; ν ) | ! ≥
11 + ǫ n X i =1 ˆ f ( i ; ν ) (1 − ǫ i − σ i ) ≥ (1 − nǫ ) n X i =1 ˆ f ( i ; ν ) . B.3 Proof of Lemma 9
Our goal is to show that: E ν ∼ B " n X i =1 ˆ f ( i ; ν ) ≥ ( N − · E ν ∼ B h ( ˜ f ( µ ) − ˜ f ( ν )) i . The first key observation is that the Fourier basis with respect to the product distribution µ is still orthogonal with respect to B , i.e., we have E ν ∼ B [ φ µS ( ν ) φ µT ( ν )] = 0 for S = T , because B is a product distribution and E ν ∼ B [ φ µS ( ν )] = 0 for S = ∅ . In particular, Parseval’s holdsfor the extension of f with respect to B : Claim 11.E ν ∼ B [ ˜ f ( ν ) ] = E B X S ˆ f ( S ; µ ) φ µS ( ν ) ! = X S ˆ f ( S ; µ ) E ν ∼ B [ φ µS ( ν ) ] . Our approach will be to consider both sides of the inequality using the µ -biased Fourierbasis of f . This is straightforward for the variance of ˜ f ( ν ) using Parseval’s. For the righthand side, we observe that the ν -biased linear coefficients may be viewed as functions in ν . In fact, each linear coefficient can be viewed as the extension of the µ -biased differenceoperator to [ − , n , modulo a normalization factor. Claim 12. ˆ f ( i ; ν ) = σ i ( ν ) σ i ( µ ) X S ∋ i ˆ f ( S ; µ ) φ µS \ i ( ν ) . Proof.
We rewrite the linear ν -biased Fourier coefficient in terms of the µ -biased differenceoperator: ˆ f ( i ; ν ) = E x ∼ ν [ D ( ν ) i f ] = σ i ( ν ) E ν [ D (1 / i f ]= σ i ( ν ) σ i ( µ ) E ν h D ( µ ) i f i = σ i ( ν ) σ i ( µ ) X S ∋ i ˆ f ( S ; µ ) E ν [ φ µS \ i ]= σ i ( ν ) σ i ( µ ) X S ∋ i ˆ f ( S ; µ ) φ µS \ i ( ν ) , D ( µ ) i , and the final holds because ν is a product distribution.Finally, we will use the fact that the variance of φ µi ( ν ) grows smaller as the sample sizeincreases: Fact 13. E ν ∼ B [ φ µS \ i ( ν ) ] = E ν ∼ B [ φ µS ( ν ) ] · N. Proof.
Because B is a product distribution, we have: E ν ∼ B [ φ µS ( ν ) ] = E ν ∼ B [ φ µS \ i ( ν ) ] · E ν ∼ B [ φ µi ( ν ) ] . Then E ν ∼ B [ φ µi ( ν ) ] = E ν ∼ B [( ν i − µ i ) /σ i ( µ )]= E ν ∼ B [( ν i − µ i ) ]1 − µ i = E B [ ν i ] − µ i − µ i = N − + µ i − N − µ i − µ i − µ i . With the previous claims in hand, we are ready to prove Lemma 9:
Lemma (9) . E ν ∼ B h ( ˜ f ( µ ) − ˜ f ( ν )) i ≤ N − E ν ∼ B " n X i =1 ˆ f ( i ; ν ) . Proof.
We first consider the expected ν -biased linear Fourier weight. Applying Claims 12and 11, and summing over i we have: n X i =1 E ν ∼ B [ ˆ f ( i ; ν ) ] = n X i =1 E B [ σ i ( ν ) ] σ i ( µ ) X S ∋ i ˆ f ( S ; µ ) E B [ φ µS \ i ( ν ) ]= (cid:18) − N (cid:19) n X i =1 X S ∋ i ˆ f ( S ; µ ) E B [ φ µS \ i ( ν ) ]= (cid:18) − N (cid:19) n X i =1 X S ∋ i ˆ f ( S ; µ ) E B [ φ µS ( ν ) ] N ≥ ( N − · X S = ∅ ˆ f ( S ; µ ) E [ φ µS ( ν ) ] . σ i ( ν ) depends only on ν i , while φ µS \ i does not, so theexpectations may be taken separately. For the next equality, we calculate E B [ σ i ( ν ) ] = 1 − E B [ ν i ]= 1 − (cid:18) µ i + 1 − µ i N (cid:19) , which gives that E B [ σ i ( ν ) ] σ i ( µ ) = 1 − /N. The third equality holds by Fact 13, and the final inequality holds since each non-emptycoefficient appears in the sum at least once.Using Claim 11 and rewriting the variance of ˜ f ( ν ) using the µ -biased Fourier basis for f , we have: E ν ∼ B [( ˜ f ( ν ) − ˜ f ( µ )) ] = E B [( X S = ∅ ˆ f ( S ; µ ) φ µS ( ν )) ]= X S = ∅ ˆ f ( S ; µ ) E B [ φ µS ( ν ) ] . C Bounding the cumulative effect of sampling
We saw in Lemma 8 that the selection step always results in a non-negative change in fitnesswhen ǫ is sufficiently small. The sampling steps, however, may decrease fitness. In thissection we show that the cumulative effect of sampling on fitness will be small. We use µ t to denote the initial distribution of the generation at time t , and ν t to denote the productdistribution after the sampling step. according to B ( µ t ). Then the selection step determinesthe population at time t + 1 which we write as µ t +1 (determined by ν t ). By Lemma 7 at eachstage the variance of ˜ f ( ν ) is a small fraction of the expected fitness increase after selection.Summing over all generations, the total variance from the sampling steps is a small fractionof the total fitness increase from the selection steps. Finally, we bound from above the totalfitness decrease due to sampling effects; for this last step we need the following generalizationof Bernstein’s inequality to martingales with unbounded jumps by Dzhaparidze and Zanten: Lemma 14. (Theorem 3.3 , [4]) Let {F t } t =0 , ,... be a filtration, and let ζ , ζ , . . . be a mar-tingale difference sequence w.r.t. {F t } . Consider the martingale S T = P Tt =1 ζ t . Define: H T = X ζ t + X E (cid:2) ζ t | F t − (cid:3) Then, for each stopping time τ , Pr (cid:20) max T ≤ τ | S T | > z, H τ ≤ L (cid:21) ≤ (cid:18) − z L (cid:19) This is a special case of their Theorem 3.3, which corresponds, in their notation, to the limit as a → Lemma 15.
Let β = (cid:16) ǫ ( N − − nǫ ) (cid:17) / and α = q β ln β . Then Pr[ | T X t =1 ˜ f ( ν t ) − ˜ f ( µ t − ) | ≥ α ] ≤ β. Proof.
For each sequence ( µ t ) Tt =0 of populations up to time T , define its congruence class asa subset of infinite sequences: h(cid:0) µ t (cid:1) Tt =0 i = (cid:8)(cid:0) w t (cid:1) ∞ t =0 : w t = µ t ∀ t ≤ T (cid:9) Now, for a time T , consider the space of possible sequences of populations: F T = nh(cid:0) µ t (cid:1) Tt =0 io . Then F ⊂ F ⊂ . . . is a filtration. We will consider the following martingale: S T = T X t =0 ζ t = T X t =0 ˜ f (cid:0) ν t (cid:1) − ˜ f (cid:0) µ t (cid:1) Notice that this is indeed a martingale becauseE [ S T | F T − ] = E " T X t =0 ˜ f (cid:0) ν t (cid:1) − ˜ f (cid:0) µ t (cid:1) | F T − = S T − + E h ˜ f (cid:0) ν T (cid:1) − ˜ f (cid:0) µ T (cid:1) | F T − i = S T − . To apply Lemma 14, we also define the following sequences: M T = T X t =0 (cid:16) ˜ f (cid:0) ν t (cid:1) − ˜ f (cid:0) µ t (cid:1)(cid:17) V T = T X t =0 E (cid:20)(cid:16) ˜ f (cid:0) ν t (cid:1) − ˜ f (cid:0) µ t (cid:1)(cid:17) | F t − (cid:21) H T = M T + V T For each T , we show that Pr[ H T ≥ β ] ≤ β by bounding E [ H T ] and applying Markov’sinequality. Applying Lemma 7, we have that E [ M T ] = E " T X t =0 ( ˜ f ( ν t ) − ˜ f ( µ t )) ≤ N − − nǫ ) · E " T X t =0 ˜ f ( µ t +1 ) − ˜ f ( ν t ) ≤ ǫ ( N − − nǫ ) , E " T X t =0 ˜ f ( µ t +1 ) − ˜ f ( ν t ) + T X t =0 ˜ f ( ν t ) − ˜ f ( µ t ) = E [ ˜ f ( µ T +1 ) − ˜ f ( µ )] ≤ ǫ, and E [ S T ] = E [ P Tt =0 ˜ f ( ν t ) − ˜ f ( µ t )] = 0 . Similarly we may apply Lemma 7 to each term of V T : E h ( ˜ f ( ν t ) − ˜ f ( µ t )) |F t − i = E ν t ∼ B ( µ t ) h ( ˜ f ( ν t ) − ˜ f ( µ t )) i ≤ N − − nǫ ) · E ν t ∼ B ( µ t ) h ˜ f ( µ t +1 ) − ˜ f ( ν t ) i . Summing over t and taking the expectation, we again have that E [ V T ] ≤ ǫ/ ( N − − nǫ ).Thus, we have that E [ H T ] = E [ M T + V T ] ≤ ǫ ( N − − nǫ ) ≤ β . Finally, applying Lemma 14to S T , we have Pr[max T ≤ τ | S T | ≥ α, H T ≤ β ] ≤ (cid:18) − α β (cid:19) ≤ β. Combining this with the bound Pr[ H T ≥ β ] ≤ β gives the lemma. D Proof of the main theorem
To complete the proof of Theorem 6, we first show (Lemma 16 below) that for sufficientlylarge T , the population µ T is at a vertex of the Boolean cube with high probability. Finally,we combine this with Lemma 15, which bounds the probability that ˜ f ( µ T ) = 1 + ǫ . Lemma 16.
There is a constant
C > such that for any T ≥ C · ǫn N − nǫ , we have: Pr[ µ T / ∈ {− , } n ] < /n. Proof.
Note that if | ν t ′ j | = 1 for some time t ′ , then ν tj = ν t ′ j for every t ≥ t ′ . Observe alsothat if | µ t ′ j | > − ( n N ) − (in this case we say j is α -determined with α = n − N − ), wehave by Markov’s inequality: Pr[ | ν t ′ j | < ≤ /n . We will show that after enough time, it is unlikely that there is any coordinate that was never α -determined. More precisely, let A j ; t be the event that coordinate j was not α -determinedfor µ , . . . , µ t . To prove the lemma, the above reasoning tells us that it suffices to show thatfor T as set in the condition of the lemma:Pr[ n _ j =1 A j ; T ] ≤ /n. We will consider each coordinate separately and show that for each j , Pr[ A j ; T ] ≤ n − . Ourproof will use the following simple claims relating Pr[ A j ; T ] to the selection steps of theprocess. 20 laim 17. Fix any time t and an interval T , such that t + T ≤ T . Then: E t + T X t = t ν tj − µ tj ! ≥ α Pr [ A j ; T ] T N .
Proof. E t + T X t = t ν tj − µ tj ! = t + T X t = t E ν t ∼ B ( µ t ) h(cid:0) ν tj − µ tj (cid:1) i ≥ t + T X t = t Pr [ A j ; t ] E ν t h(cid:0) ν tj − µ tj (cid:1) | A j ; t i ≥ α Pr [ A j ; T ] T N .
Note that for t < t ′ , the outcome of ( ν ( t ′ ) j − µ ( t ′ ) j ) has expctation 0, even given any informationabout time t ; this gives the first equality. The last inequality holds because Pr[ A j ; t ] ≤ Pr[ A j ; t ′ ] for t ≥ t ′ and because E ν t ∼ B ( µ t ) h(cid:0) ν tj − µ tj (cid:1) | A j ; t i = σ j ( µ t ) N ≥ α N .
The next claim tells us that for any interval of time t . . . ( t + T ) , the change in µ j dueto the sampling steps cannot be much more than the change from the selection steps. Claim 18. t + T X t = t ν tj − µ tj ! ≤ t + T X t = t µ t +1 j − ν tj ! + 4 . Proof.
Observe that µ t + T +1 j − µ t j = t + T X t = t ν tj − µ tj ! + t + T X t = t µ t +1 j − ν tj ! has magnitude at most 2, which gives: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t + T X t = t ν tj − µ tj (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t + T X t = t µ t +1 j − ν tj (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 . Squaring both sides, the claim follows from the fact that 2 | x | + 8 ≥ ( | x | + 2) . We now complete the proof of the lemma. First, combining Claims 17 and 18 tells usthat for t + T ≤ T : E t + T X t = t µ t +1 j − ν tj ! ≥ α Pr [ A j ; T ] T N − .
21y applying Cauchy-Schwarz and Lemma 8, we can relate the quantity inside the expectationto the change in fitness: t + T X t = t µ t +1 j − ν tj ! ≤ T t + T X t = t (cid:0) µ t +1 j − ν tj (cid:1) = T t + T X t = t σ i ( ν t ) ˆ f ( j ; ν t )˜ f ( ν t ) ! ≤ T − nǫ t + T X t = t ˜ f ( µ t +1 ) − ˜ f ( ν t )Taking expectations on both sides, we conclude that for a sufficiently long interval, theexpected change in fitness on that interval is a good upper bound on α Pr[ A j ; T ]4 N :11 − nǫ E " t + T X t = t ˜ f ( µ t +1 ) − ˜ f ( ν t ) ≥ α Pr[ A j ; T ]4 N − /T . (11)We can now amplify this bound T times while using the fact that the total fitness changeis at most ǫ ( T and T will be set after). T (cid:18) α Pr[ A j ; T ]4 N − /T (cid:19) ≤ T − X ℓ =0 E ( ℓ +1) ∗ T X t = ℓ ∗ T ˜ f ( µ t +1 ) − ˜ f ( ν t ) − nǫ = E " T ∗ T X t =0 ˜ f ( µ t +1 ) − ˜ f ( ν t ) − nǫ = E " T ∗ T X t =0 ˜ f ( µ t +1 ) − ˜ f ( ν t ) + ˜ f ( ν t ) − ˜ f ( µ t ) − nǫ = E [ ˜ f ( µ T +1 ) − ˜ f ( µ )] 11 − nǫ ≤ ǫ − nǫ Finally, we have Pr[ A j ; T ] ≤ N ǫT (1 − nǫ ) α + 8 NαT . Taking T = 16 N n and T = ǫ N n − nǫ bounds the probability by 1 /n . Lemma 16 tells us that with probability at least 1 − /n , the population vector will beat a vertex after at most T = O (cid:16) ǫn N − nǫ (cid:17) steps, in which case ˜ f ( µ T ) ∈ { , ǫ } . On theother hand, Lemma 15 tells us that for any T , the probability that the total negative effectof the sampling exceeds α is at most β ; since the fitness change for each selection step isnon-negative (for our choice of ǫ ), we have thatPr[ ˜ f ( µ T ) = 1 + ǫ ] = Pr[ ˜ f ( µ T ) < ˜ f ( µ (0) ) − α ] ≤ β when ˜ f ( µ ) > αα