Accessibility Percolation on Cartesian Power Graphs
aa r X i v : . [ q - b i o . P E ] D ec Accessibility Percolation on Cartesian Power Graphs
Benjamin Schmiegelt ∗ and Joachim Krug † Institute for Biological Physics, University of CologneDecember 18, 2019
Abstract
A fitness landscape is a mapping from the space of genotypes to the real numbers. Apath in a fitness landscape is a sequence of genotypes connected by single mutational steps.Such a path is said to be accessible if the fitness values of the genotypes encountered alongthe path increase monotonically. We study accessible paths on random fitness landscapesof the House-of-Cards type, which implies that the fitness values are independent, identi-cally and continuously distributed random variables. The genotype space is taken to be aCartesian power graph A L , where L is the number of genetic loci and the allele graph A encodes the possible allelic states and mutational transitions on one locus. The probabilityof existence of accessible paths between two genotypes at a distance linear in L displays asharp transition from 0 to 1 at a threshold value β ∗ of the fitness difference between theinitial and final genotype. We derive a lower bound on β ∗ for general A and conjecture thatthis bound is tight for a large class of allele graphs. Our results generalize previous resultsfor accessibility percolation on the biallelic hypercube, and compare favorably to publishednumerical results for multiallelic Hamming graphs. A key tool of our analysis is the con-cept of quasi-accessibility, which eliminates the need to account for the self-avoidance ofaccessible paths. In the strong-selection weak-mutation (SSWM) regime evolutionary dynamics reduces to anadaptive walk on what is known as a fitness landscape, the map from genotypes to fitness values[17]. For low mutation rates the nearly monomorphic population can be represented by a singlemajority genotype moving through the space of genotypes by individual mutations that fix with aprobability depending on the fitness of the mutant relative to the parental genotype [7, 16]. Understrong selection, the movement of such a walker is additionally constrained towards increasingfitness values, making it an adaptive walk [9]. This limits the number of selectively accessiblepaths a population can take through the genotype space [18, 4, 6].Here we investigate the impact that the mutational structure of the genotype space has onthe number of evolutionary paths available to SSWM dynamics. We use a simple stochasticmodel for fitness landscapes known as the House-of-Cards (HoC) model, in which each genotype g is assigned an i.i.d. continuous random fitness value F g [10, 9]. As the property of a path to be ∗ [email protected] † [email protected] alleles and can be mutated individually. For simplicity, but withoutloss of generality, we will assume that all loci have the same set of possible states. Thereforegenotypes are sequences g = ( g , . . . , g L ), with L determining the number of loci. Individual(point) mutations, which are the only ones to be considered here, mutate only one of the loci.The mutational structure of the system determines whether every state of one locus is able tomutate to any other or whether some restrictions apply. For example, whereas point mutations inthe DNA sequence can mutate any nucleotide base into any other, the genetic code constrains thepossible one-step transitions between amino acids. To accomodate general mutational structureswe describe the loci by a simple directed allele graph A . The vertex set V of this graph is theset of all alleles, and its arrows indicate possible one-step mutations between alleles. Again, forsimplicity but without loss of generality, we assume the allele graph to be the same on all loci.The genotype space can then be described as the Cartesian graph product A L [14], a directedgraph whose vertices form the genotypes and with arrows between genotypes that can be reachedvia one-step mutations. The fitness landscape constrains which of these arrows may be takenby an adaptive walker and we call the directed sub-graph of the genotype graph obtained byremoving arrows which do not point towards increasing fitness the fitness graph [5].Our goal is to determine, for a given pair of genotypes a and b , how likely it is possible toreach b from a on the fitness graph when the fitness landscape is of HoC type. In particular thisquestion is non-trivial in the case of the directed distance d ab from a to b being of linear orderin L on the genotype graph. From previous work, it is known that for the case of two alleles, V = { , } , and a linear distance d ab ∼ αL , there is a critical value β ∗ depending on α for thefitness difference β = F b − F a , such that for fixed β > β ∗ , the likelihood of b being accessiblefrom a converges to 1 in L , while for β < β ∗ , the likelihood converges to 0 [8, 3, 13, 2, 12]. Thetransition occurring at β = β ∗ has been referred to as accessibility percolation [15, 11]. Apartfrom a computational study [20], so far accessibility percolation has been studied only for thebiallelic case where the genotype space is the L -dimensional hypercube. Results related to thosepresented here have been obtained in the context of first-passage percolation [14], which is linkedto accessibility percolation through a mapping described in [13]. For our main results let the largest degree of A be bounded by some constant ∆ (or alternativelylet ∆ be the order of the allele graph, if it is finite). If the degrees are not sufficiently bounded,then their number would simply become so large that the accessibility problem becomes trivialbecause of an overwhelming choice of paths.In the following we denote the number of accessible paths from a to b given a fitness difference β between them by Z ab ( β ). We refer to the probability P [ Z ab ( β ) ≥
1] as accessibility , and studyits behavior in the large- L limit. Therefore a sequence of endpoints ( a, b ) needs to be determinedas L increases. For general such sequences, calculations may become tediously complex. Here welimit ourself to sequences that converge linearly in the following sense: Let there be constants p xy with x, y ∈ V and then choose a and b in such a way that |{ a l = x ∧ b l = y | l = 1 . . . L }| = Lp xy + O (1) . (1)This guarantees that all start-end-pairs on loci occur linearly often in L without any non-boundedcorrections, e.g. of order log L . Such intermediate order terms would need to be accounted foradditionally in the following considerations, although the procedure to do so is straight-forward.2n terms of possible values of p xy , they need to sum to 1 in total and we further restrictthem to have at least one non-zero off-diagonal term, only finitely many non-zero terms andall non-zero terms either on the diagonal or on non-trivial allele pairs. We consider an allelepair ( x , y ) trivial iff there is no walk of non-zero length from x to y . These pairs would make itimpossible to find any walk between a and b at all for x = y and for x = y they would have noeffect on walks between a and b . The latter case is not problematic because of how we set upthe linear limit. If there were however intermediate order terms in the limit, they would need tobe accounted for.These requirements imply in particular that the Hamming distance d Hab is asymptotically δL + O (1), where δ = 1 − X x ∈ V p xx >
0. Furthermore the requirements imply that there is a pair x, y ∈ V with non-zero p xy on which the directed distance between x and y is maximal, so thatthe (directed) distance between a and b on the full graph is linearly bounded from above. Sinceit is also bounded from below by the Hamming distance, then d ab = Θ( L ).In order to succinctly state our results, we introduce the following quantities for pairs ofalleles x, y ∈ V : Γ xy ( β ) = ln (cid:16)(cid:0) e βA (cid:1) xy (cid:17) , (2) γ xy ( β ) = ∂ β Γ xy ( β ) = (cid:0) A e βA (cid:1) xy (e βA ) xy , (3)¯ γ xy ( β ) = ∂ β Γ xy ( β ) = (cid:0) A e βA (cid:1) xy (e βA ) xy − (cid:0) A e βA (cid:1) xy (e βA ) xy ! . (4)Here and in the following A stands for the adjacency matrix of the allele graph. The logarithmis well-defined for all alleles, because the matrix exponential can become zero only for trivial off-diagonal loci, which we explicitly excluded. We define the same quantities for pairs of genotypes a, b ∈ V L : Γ ab ( β ) = h Γ a l b l ( β ) i l , (5) γ ab ( β ) = h γ a l b l ( β ) i l , (6)¯ γ ab ( β ) = h ¯ γ a l b l ( β ) i l , (7)where h X l i l = 1 L L X l =1 X l .Given the way we defined our limit process and assuming that β is bounded by a constant,these quantities can be asymptotically expressed in terms of the p xy with additional correctionterms of order L − accounting for rounding effects,Γ ab ( β ) = h Γ a l b l ( β ) i pa,b + O (cid:0) L − (cid:1) , (8) γ ab ( β ) = h γ a l b l ( β ) i pa,b + O (cid:0) L − (cid:1) , (9)¯ γ ab ( β ) = h ¯ γ a l b l ( β ) i pa,b + O (cid:0) L − (cid:1) , (10)where h X a l b l i pa,b = X x,y ∈ V p xy X xy .For β →
0, Γ ab ( β ) → −∞ , because we are assuming p xy to have at least one off-diagonalterm. Further, for β → ∞ , Γ ab ( β ) → ∞ , for the same reason. Since Γ ab ( β ) is also monotonically3ncreasing in β , there must then be a unique positive root of Γ ab ( β ), which we will denote by β ∗ . Further we introduce γ ⋆ = lim L →∞ γ ab ( β ∗ ) = h γ a l b l ( β ∗ ) i pa,b , (11)¯ γ ⋆ = lim L →∞ ¯ γ ab ( β ∗ ) = h ¯ γ a l b l ( β ∗ ) i pa,b . (12)Our first main result states that for all β ( L ) = β ∗ − γ ⋆ − ln LL − | ω (1) | L :lim L →∞ P [ Z ab ( β ( L )) ≥
1] = 0 . (13)This is an upper bound on the threshold function of accessibility and will be derived throughthe asymptotic behavior of the mean number of quasi-accessible paths, which vanishes for β ( L )below this threshold. The concept of quasi-accessibility is introduced below in Sect. 4.For a lower bound on the accessibility, it will be necessary to introduce the following functionfor x, y ∈ V and 0 ≤ r, s ≤ M xy ( β, s, r ) = D ln (cid:0) e sβA (cid:1) vw E r,sx,y , (14)where ¯ r = 1 − r and ¯ s = 1 − s and h X vw i r,sx,y = P v,w ∈ V (cid:0) e ¯ srβA (cid:1) xv (cid:0) e sβA (cid:1) vw X vw (cid:0) e ¯ s ¯ rβA (cid:1) wy (e βA ) xy . (15)Note that this is a proper mean, because by matrix multiplication h i r,sx,y = 1. As before, wedefine the same function for a, b ∈ V L as the mean over l and again in the limit of L → ∞ , thisfunction converges uniformly for bounded β with corrections of order O (cid:0) L − (cid:1) , M ab ( β, s, r ) = h M a l b l ( β, s, r ) i pa,b + O (cid:0) L − (cid:1) . (16)We also define M ∗ ( s, r ) = h M a l b l ( β ∗ , s, r ) i pa,b . (17)We say that the problem setup is of weakly normal type if M ∗ ( s, r ) ≤ M ∗ ( s, r )is always zero for s = 0 and s = 1. If however M ∗ ( s, r ) < (strongly) normal type . Many setups are strongly normal, however[14] shows that there are setups which are not weakly normal, with the smallest graph with thatproperty being of order 5. It is possible to generate a weakly normal, but not strongly normalsetup, by adjusting p xy carefully starting from a non-normal setup.We conjecture that lim L →∞ P [ Z ab ( β ( L )) ≥
1] = 1 (18)for all β ( L ) = β ∗ + γ ⋆ − ln LL + | ω (1) | L for normal setups and that furthermorelim L →∞ P [ Z ab ( β ) ≥
1] = 0 (19)4or some β > β ∗ for non weakly normal setups. This means that the previously obtained boundon the critical value of β ∗ is tight only in the (weakly) normal setups, with the window betweenthe two bounds on the threshold function being of order L − in the strongly normal case. Aproof of this statement will be added in a later version of this manuscript.Finally as our third result we show that, given that accessible paths exist at β = β ∗ − γ ⋆ − ln LL + O (cid:18) L (cid:19) , they must have asymptotically almost surely length β ∗ γ ⋆ L ± ω (cid:16) √ L (cid:17) . The simplest application is to the complete graph on A alleles, which leads to genotype spacesknown as Hamming graphs. By symmetry, in this case there are only two choices for the initialand final allele on a vertex, either a l = b l or a l = b l . Therefore the setup can be fully describedby just the relative Hamming distance δ = d ab L . One obtainsΓ ab ( β ) = − ln A − β + δ ln (cid:0) − A β (cid:1) + ¯ δ ln (cid:0) A − A β (cid:1) , (20) γ ⋆ = − A e A β ∗ (cid:18) δ − A β ∗ + ¯ δ A − A β ∗ (cid:19) , (21)where ¯ δ = 1 − δ . In the biallelic case A = 2 the condition Γ ab ( β ∗ ) = 0 reduces to the relationsinh( β ∗ ) δ cosh( β ∗ ) ¯ δ = 1 which was first conjectured in [2] and proved in [13, 12]. At full distance δ = 1 Γ ab ( β ) = − ln A − β + ln (cid:0) − A β (cid:1) , (22) γ ⋆ = ( A − A β ∗ + 1e A β ∗ − . (23)The values of β ∗ and γ ⋆ for small A are shown in the following table: A β ∗ γ ⋆ β ∗ γ ⋆ ≈ . √ ≈ . ≈ .
253 ln (cid:16) π (cid:17) ≈ .
631 1 + 2 cos (cid:18) π (cid:19) ≈ . ≈ .
824 ln √ r √ − ! ≈ . ≈ . ≈ . β ∗ is the unique positive solution of the polynomial equation (cid:16) e β ∗ (cid:17) A − A e β ∗ − . (24)5or A ≥ A → ∞ as β ∗ = ln AA + 1 + ln AA + O (cid:18) ln AA (cid:19) , (25) γ ⋆ = A + O (cid:18) A (cid:19) , (26) β ∗ γ ⋆ = ln A + 1 + ln AA + O (cid:18) ln AA (cid:19) . (27)As the number of alleles increases, accessibility increases and the required fitness differencebetween the start and end point decreases. In fact this quantity vanishes to zero for A → ∞ .At the same time the length of accessible walks close to the critical fitness difference increases,but slowly. The minimal length of a path covering the full distance d ab is L , and hence β ∗ γ ⋆ − a ) and sideways steps (where a mutation occurs to an allele that is part ofneither the initial nor the target genotype) [19]. The fraction of all alleles on a given locus thatappear along an accessible path close to the critical point is given by A − β ∗ γ ⋆ which decreaseswith increasing A . Zargorski, Burda and Waclaw carried out simulations of this model, giving β ∗ with two digit precision for different values of A [20]. Their results match the values derivedhere up to ± . We can modify the complete graph slightly to disallow mutations back to the allele that waspresent in the initial genotype (the wild type allele), while still allowing mutations between allother alleles. In this case the expressions simplify significantly to β ∗ = ln ( A − A − , (28) γ ⋆ = A − . (29)The asymptotic behavior is the same as for the complete graph. In the biallelic case A =2 this describes accessibility percolation on the directed hypercube, which was considered byHegarty and Martinsson in [8]. In this case β ∗ = 1, which implies that the directed hypercubeis marginally accessible under the HoC model [6]. The complete graph is in some sense the best-case scenario for accessibility. On the opposite sideof the spectrum of possible allele graphs one can choose the path graph on A vertices. In thiscase the distance between the two end points increases with the number of alleles and there is aunique order in which mutations on a locus must be applied. This causes accessibility to becomevery low. For A = 2 the path graph is identical to the complete graph. However, already for A = 3 we find β ∗ = ln (cid:0) √ (cid:1) √ ≈ . . (30)Since β ∗ represents a fitness quantile which must lie between 0 and 1, this value implies thatthe path graph on three vertices can never be accessible for any fitness difference if (almost) all6oci need to mutate from one end of the graph to the other. For higher A this effect becomesmore pronounced. As a possible biological application of the path graph the description ofcopy-number variants of genes can be mentioned [1].Since the complete graph on two vertices without reversions has β ∗ = 1 as shown before in(28) and adding edges can only decrease β ∗ , it is actually required that the distance between a l and b l on the allele graph is at least 2 in order for β ∗ > We base the proof for the upper bound on the mean number of accessible paths, or rather themean number of quasi-accessible walks. We define the term quasi-accessible as a generalizationof the notion of accessibility used up to now.Suppose we assign to each genotype multiple i.i.d. fitness values indexed by integers, inaddition to the first fitness value specified by the HoC model. We then say that a path whichvisits a genotype multiple times is quasi-accessible, if it would be accessible assuming that it“sees” the n -th fitness value at its n -th visit to the genotype. This guarantees that all paths ofequal length are equally likely to be quasi-accessible. In contrast, to study accessibility in theusual sense we would need to treat non-self-avoiding walks separately, because they can never beaccessible (as two fitness values on the walk would be identical and therefore not in monotonicallyincreasing order). Further note that if there exists a quasi-accessible path, there must also existan accessible one, obtained by removal of all cycles from the former path, and conversely anyaccessible path is also quasi-accessible. This implies that we can restrict our investigation toquasi-accessible paths, the number of which we will denote by ˜ Z ab ( β ).In order to prove the upper bound, we will consider the mean number of quasi-accessiblewalks from a to b . Note that each walk of length N is accessible with probability β N − ( N − F a to F b and the denumerator accounts for the increasing orderof these values. The mean number of walks taking n steps from x to y on one locus is given by( A n ) xy . A walk of length N could take each step on any of the loci, so that the total number ofwalks of length N can be written as X n + ... + n L = N (cid:18) Nn , . . . , n L (cid:19) L Y l =1 ( A n l ) a l b l (31)where (cid:18) Nn , . . . , n L (cid:19) is the multinomial coefficient accounting for the different orderings of stepson individual loci. Multiplication of this expression with the probability of quasi-accessibility ofeach such walk gives the mean number of quasi-accessible paths E (cid:2) ˜ Z ab ( β ) (cid:3) = ∞ X N =0 X n + ... + n L = N (cid:18) Nn , . . . , n L (cid:19) β N − ( N − L Y l =1 ( A n l ) a l b l . (32)The term ( N − N ! by introduction of a derivative E (cid:2) ˜ Z ab ( β ) (cid:3) = ∂ β ∞ X N =0 X n + ... + n L = N (cid:18) Nn , . . . , n L (cid:19) β N N ! L Y l =1 ( A n l ) a l b l (33)7nd redistributing all the factorials and β N into the product yields E (cid:2) ˜ Z ab ( β ) (cid:3) = ∂ β ∞ X N =0 X n + ... + n L = N L Y l =1 β n l n l ! ( A n l ) a l b l . (34)Finally the sums and the product can be interchanged and E (cid:2) ˜ Z ab ( β ) (cid:3) = ∂ β L Y l =1 ∞ X n =0 β n n ! ( A n ) a l b l = ∂ β L Y l =1 (cid:0) e βA (cid:1) a l b l = ∂ β e L Γ ab ( β ) = Lγ ab ( β )e L Γ ab ( β ) . (35)This function is monotonically increasing in β and by expansion around β ∗ , we find E (cid:2) ˜ Z ab ( β ) (cid:3) = L e Lγ ⋆ ( β − β ∗ )+ O (1)+ O ( L ( β − β ∗ ) ) . (36)Therefore at β ( L ) = β ∗ − γ ⋆ − ln LL the mean becomes Θ(1) and subtracting an additional term | ω (1) | L from β ( L ) causes it to decrease to zero, proving our first main result by application ofMarkov’s inequality.This upper bound can be strengthened by considering intervals of walk lengths. For thiswe calculate the contribution of walks of given length N to the overall mean number of quasi-accessible walks. In principle this is done by restriction of the sum in the previous derivation forthe mean number of quasi-accessible walks. However, we can also write the contribution as E (cid:2) ˜ Z ab ( β ) (cid:12)(cid:12) N − ≤ N ≤ N + (cid:3) = P [ N − ≤ N ≤ N + ] E (cid:2) ˜ Z ab ( β ) (cid:3) , (37)where the probability is with respect to the choice of a walk weighted by the probability β N − ( N − N − µL √ L under this distribution, its unnormalized characteristic function is ∞ X N =1 X n + ... + n L = N e iz N − µL √ L (cid:18) Nn , . . . , n L (cid:19) β N − ( N − L Y l =1 ( A n l ) a l b l (38)= e − izµ √ L Lγ ab (cid:16) e iz √ L β (cid:17) e L Γ ab (cid:16) e iz √ L β (cid:17) (39)or in the normalized form E h e iz N − µL √ L i = e − izµ √ L γ ab (cid:16) e iz √ L β (cid:17) γ ab ( β ) e L (cid:16) Γ ab (cid:16) e iz √ L β (cid:17) − Γ ab ( β ) (cid:17) . (40)Asymptotically, pointwise, e iz √ L = 1 + iz √ L − z L + . . . and, assuming β is bounded by a constantwe have E h e iz N − µL √ L i = exp (cid:18) − izµ √ L + izβ √ Lγ ab ( β ) − z βγ ab ( β ) − z β ¯ γ ab ( β ) + . . . (cid:19) . (41)For µ = β ∗ γ ⋆ this function converges pointwise and so the random variable converges in distri-bution to a normal random variable with variance σ = β ∗ γ ⋆ + β ∗ ¯ γ ⋆ .8herefore, at the potential critical point β = β ∗ − γ ⋆ − ln LL + O (cid:0) L − (cid:1) , where the meannumber of walks of any length is bounded by a constant, the number of quasi-accessible walksof lengths not in the interval β ∗ γ ⋆ L ± ω (cid:16) √ L (cid:17) vanishes. All quasi-accessible walks, if they exist,must therefore have lengths in this interval almost surely. Acknowledgements
This work was supported by DFG within SPP 1590
Probabilistic structures in evolution . Wethank David Augustin for his contributions to the early stages of this project.
References [1] Altenberg, L.: Fundamental properties of the evolution of mutational robustness. PreprintarXiv:1508.07866 (2015)[2] Berestycki, J., Brunet, É., Shi, Z.: The number of accessible paths in the hypercube.Bernoulli , 653–680 (2016)[3] Berestycki, J., Brunet, É., Shi, Z.: Accessibility percolation with backsteps. ALEA, Lat.Am. J. Probab. Math. Stat. , 45–62 (2017)[4] Carneiro, M., Hartl, D.L.: Adaptive landscapes and protein evolution. Proc. Nat. Acad.Sci. USA , 1747–1751 (2010)[5] Crona, K., Greene, D., Barlow, M.: The peaks and geometry of fitness landscapes. J.Theor. Biol. , 1–10 (2013)[6] Franke, J., Klözer, A., de Visser, J.A.G.M., Krug, J.: Evolutionary accessibility of muta-tional pathways. PLoS Comp. Biol. (8), e1002,134 (2011)[7] Gillespie, J.H.: Molecular evolution over the mutational landscape. Evolution , 1116–1129 (1984)[8] Hegarty, P., Martinsson, A.: On the existence of accessible paths in various models offitness landscapes. Ann. Appl. Probab. , 1375–1395 (2014)[9] Kauffman, S., Levin, S.: Towards a general theory of adaptive walks on rugged landscapes.Journal of Theoretical Biology (1), 11–45 (1987)[10] Kingman, J.F.C.: A simple model for the balance between selection and mutation. Journalof Applied Probability (1), 1–12 (1978)[11] Krug, J.: Accessibility percolation in random fitness landscapes. Preprint arXiv:1903.11913(2019)[12] Li, L.: Phase transition for accessibility percolation on hypercubes. J. Theor. Prob. ,2072–2111 (2018)[13] Martinsson, A.: Accessibility percolation and first-passage site percolation on the unori-ented binary hypercube. Preprint arXiv:1501.02206 (2015)914] Martinsson, A.: First-passage percolation on Cartesian power graphs. Ann. Prob. ,1004–1041 (2018)[15] Nowak, S., Krug, J.: Accessibility percolation on n -trees. Europhys. Lett. , 66,004(2013)[16] Orr, H.A.: The population genetics of adaptation: the adaptation of DNA sequences.Evolution , 1317–1330 (2002)[17] de Visser, J.A.G.M., Krug, J.: Empirical fitness landscapes and the predictability of evo-lution. Nature Reviews Genetics , 480–490 (2014)[18] Weinreich, D.M., Watson, R.A., Chao, L.: Sign epistasis and genetic constraint on evolu-tionary trajectories. Evolution , 1165–1174 (2005)[19] Wu, N.C., Dai, L., Olson, C.A., Lloyd-Smith, J.O., Sun, R.: Adaptation in protein fitnesslandscapes is facilitated by indirect paths. eLife , 16,965 (2016)[20] Zagorski, M., Burda, Z., Waclaw, B.: Beyond the hypercube: evolutionary accessibility offitness landscapes with realistic mutational networks. PLoS Comp. Biol.12