[PDF] Revisiting the Neutral Dynamics Derived Limiting Guanine-Cytosine Content Using the Human De Novo Point Mutation Data

Abstract

We revisit the topic of human genome guanine-cytosine content under neutral evolution. For this study, the de novo mutation data within human is used to estimate mutational rate instead of using base substitution data between related species. We then define a new measure of mutation bias which separate the de novo mutation counts from the background guanine-cytosine content itself, making comparison between different datasets easier. We derive a new formula for calculating limiting guanine-cytosine content by separating CpG-involved mutational events as an independent variable. Using the formula when CpG-involved mutations are considered, the guanine-cytosine content drops less severely in the limit of neutral dynamics. We provide evidence, under certain assumptions, that an isochore-like structure might remain as a limiting configuration of the neutral mutational dynamics.

Full PDF

aa r X i v : . [ q - b i o . GN ] A ug Revisiting the Neutral Dynamics Derived Limiting Guanine-CytosineContent Using the Human De Novo Point Mutation Data

Wentian Li , Yannis Almirantis , Astero Provata

1. The Robert S. Boas Center for Genomics and Human GeneticsThe Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA2. Theoretical Biology and Computational Genomics Laboratory, Institute of Bioscience and ApplicationsNational Center for Scientiﬁc Research “Demokritos”, 15341 Athens, Greece3. Statistical Mechanics and Dynamical Systems Laboratory, Institute of Nanoscience and NanotechnologyNational Center for Scientiﬁc Research, “Demokritos”, 15341 Athens, Greece

AbstractWe revisit the topic of human genome guanine-cytosine content under neutral evolution.For this study, the de novo mutation data within human is used to estimate mutational rateinstead of using base substitution data between related species. We then deﬁne a new measureof mutation bias which separate the de novo mutation counts from the background guanine-cytosine content itself, making comparison between diﬀerent datasets easier. We derive a newformula for calculating limiting guanine-cytosine content by separating CpG-involved muta-tional events as an independent variable. Using the formula when CpG-involved mutations areconsidered, the guanine-cytosine content drops less severely in the limit of neutral dynamics.We provide evidence, under certain assumptions, that an isochore-like structure might remainas a limiting conﬁguration of the neutral mutational dynamics. keywords: de novo mutation, human genome, neutral evolution, limiting G+Ccontent, CpG dinucleotide i et al. Introduction

It is well known that in genomes of many species from ﬁsh (Costantini et al., 2007) to human(Bernardi, 1989; Costantini et al., 2006), there is an alternation of DNA segments with highand low guanine-plus-cytosine (G+C) content, called isochores (Bernardi, 1985, 2006). Suchchange between high and low G+C content leads to a higher variance with respect to DNAwalk properties than expected from a simple homogeneous stochastic model (Fickett et al.,1992) – thus the term heterogeneity has been adopted (Sueoka, 1962; Li et al., 1998). Thephysical spatial arrangement of G+C content in genomes can be more complicated. Com-monly observed examples of the spatial complexity are the variable length of these isochores,the isochore-within-isochore phenomenon, and a slower than exponential decay of base-to-base autocorrelation function, called long-range correlation (Li and Kaneko, 1992; Peng et al.,1992). An isochore is not simply a statistical signal: it has functional information also. Therichest isochores tend to be gene rich (Zerial et al., 1986), and genes tend to have higher G+Ccontent (Clay et al., 1996). G+C content may also be associated with inner/outer region ofa chromatin structure (Jabbari and Bernardi, 2017), though the cause-eﬀect relationship be-tween the two is not completely clear (Li, 2013). G+C content of genomes of more than onethousand species is presently known (Wang, 2018).Based on the synonymous substitution rates in protein coding regions obtained from diﬀer-ent mammalian species, it was concluded that because G/C → A/T substitution rate (fromancestral outgroup to descendant ingroup) is higher than the A/T → G/C rate, all genomicregions will move towards a lower equilibrium G+C content (Duret et al., 2002; Belle et al.,2004), though counter evidence was also presented (Alvarez-Valin et al., 2004; Gu and Li, 2006;Romiguier et al., 2010). These data from diﬀerent diverging species contain information froma very long time scale, and mix various factors from mutation rates to selection.Another observation is that the G/C → A/T to A/T → G/C substitution rate ratio is a func-tion of the G+C content, based on the substitutions in DNA transposons, a type of repetitivesequences (International Human Genome Sequencing Consortium, 2001). This result is notinconsistent with (Duret et al., 2002), where the absolute counts of G/C → A/T substitutionsover the A/T → G/C substitution counts, increases with G+C content. However, there is a dif- i et al. ference between the two approaches: the ratio in (International Human Genome Sequencing Consortium,2001) is based on conditional probability (conditional on G+C content), whereas the ratio in(Duret et al., 2002) is unconditional. This discrepancy is not usually emphasized, which cancause confusion when diﬀerent results are compared.If we want to disentangle diﬀerent causes, and only focus on the consequence of short-time-scale point mutations, it would be ideal to catch mutational events in “real time”. Fortunately,in the genetic study of human diseases, the “ de novo ” mutation – genetic variant absent inparents’ genome but present in oﬀspring’s genome – has been extensively investigated in bothdiseased and normal samples (Kong et al., 2012; Francioli et al., 2015). The data becomesmore important as there is evidence that de novo mutations might play a role in many hu-man diseases, such autism spectrum disorders (Sebat et al., 2007), schizophrenia (Girard et al.,2011), intellectual disability (Vessers et al., 2010; Hamdan et al., 2014), developmental disor-ders (Deciphering Developmental Disorders Study, 2018), and others (Veltman and Brunner,2012; Samocha et al., 2014).There are several distinct features of de novo mutation data compared to the substitutions inmammalian species: (1) a mutation occurs within one species, the Homo sapiens , not mutationsleading to base diﬀerences between species (i.e. substitution); (2) a mutation occurs in thecurrent time, and we do not deal with ancestral mutational events in the past, which werelikely to be diﬀerent as it was from a potentially diﬀerent environment; (3) for a normal,disease-free oﬀspring, a de novo mutation is unlikely to be deleterious, thus the requirementof neutral evolution is satisﬁed. We also do not need to assume the spread as well as ﬁxationof a mutational allele when substitutions between species are used; (4) a mutation is moredirectly observed, by comparing parents’ and oﬀspring’s genome, not by comparing the inferredancestral genome with the descendant genomes; (5) due to availability of the human referencegenome, not only a whole genome data can be readily obtained, but also the genomic context ofa mutation location is well annotated, unlike the situation of many other mammalian genomes.The large number of de novo studies in the human genome in the recent years (Pranckeniene et al.,2018; Jonsson et al., 2018; Goldmann, 2018; Wang et al., 2019; Kessler et al., 2020) may makethe data collection daunting. Fortunately, there exist de novo databases which we can usedirectly. Note that, unlike most people who are more interested in functional de novo muta- i et al. tions that cause genetic diseases, here we are interested in the neutral de novo mutations inthe general population.The study is organized as follows: we ﬁrst focus on the theoretical framework relating denovo mutation rate and G+C content. We re-derive the formula of the limiting G+C contentin single-base-mutation-driven neutral dynamics in terms of mutation counts between strongand weak bases. We deﬁne coeﬃcients α , α based on mutation counts (not on mutationrate), and their deviation from the value of 1/2 immediately reveals which type of base willincrease/decrease in the limiting dynamics. We then expand our formula to dynamics ofthree variables (weak bases, strong bases not-involved in CpG dinucleotides, and strong basesinvolved in CpG dinucleotide). We deﬁne coeﬃcients β , β , β based on mutation countsamong these three variables, and a deviation of their value from 1/3 immediately predicts thelimiting dynamics of them. We carry out a data analysis using the de novo mutation databaseto estimate α ’s and β ’s in diﬀerent location types, as well as in diﬀerent G+C backgrounds. Theoretical formulation

Formula of limiting G+C content based on mutation-driven neutral dynamics

Assume that a certain number (= n W S ) of weak (A or T) to strong (C or G) de novo mutation events is observed, and similarly there are n SW strong-to-weak mutation events.Denote the G+C-content as x ; assume that there are N genomic positions to be considered,and M is the number of persons from which de novo mutation events were collected, thenthere are M · N · x positions occupied by strong bases, and M · N · (1 − x ) positions by weakbases. We then compose the mutation (and non-mutation) event counts matrix as: W = A/T S = C/GW = A/TS = C/G  M N (1 − x ) − n W S n W S n SW M N x − n SW  (1)The diagonal elements in Eq.(1) are mostly not directly counted, simply because (e.g.) A → A is not reported as a mutation event (though (e.g.) A → T is). However, we can infer themfrom the total number of base positions N , total number of samples M , the current G+C i et al. content x , and the mutation counts away from the base type.Normalizing the matrix in Eq.(1) by row sum, we obtain the conditional probability (tran-sition probability in Markov chain): W = A/T S = C/GW = A/TS = C/G  − n WS MN (1 − x ) p W → S ≡ n WS MN (1 − x ) p S → W ≡ n SW MNx − n SW MNx  (2)From the two weak ↔ strong (conditional) transition probabilities, it is well known that thelimiting G+C (strong) content is (Sueoka, 1962; Petrov and Hartle, 1999; Lynch, 2007, 2010;International Human Genome Sequencing Consortium, 2001; Li, 2011, 2013): x ′ = 1 p S → W p W → S + 1 . (3)An easy derivation is to consider the “detailed balance”: x ′ p S → W = (1 − x ′ ) p W → S .We deﬁne two new coeﬃcients based on the mutational event counts: α = n SW / ( n SW + n W S ) and α = n W S / ( n SW + n W S ). Note P i α i = 1. The conditional transition probability inEq.(3) can be replaced by the de novo mutational event counts ( n SW and n W S ) or α ’s: x ′ = 1 n SW n WS · − xx + 1 . = 1 α ( x ) α ( x ) · − xx + 1 . (4)Both total number of bases N and number of persons M are canceled from Eq.(4), thus we donot need to know their values. Eq.(4) shows how the limiting G+C content (x’) depends onthe current G+C content (x), and two (actually one) mutational count based coeﬃcients α ( α = 1 − α ). Eq.(4) can be written in a more symmetric form: x ′ − x ′ = n W S n SW · (cid:18) x − x (cid:19) = α ( x ) α ( x ) · (cid:18) x − x (cid:19) . (5)As long as n W S < n SW (or α < α , more C/G to C/T mutation events than in the oppositedirection), x/ (1 − x ) will decrease, so will G+C-content x.Note that Eq.(5) is not a one-time iteration, from time t to time t + 1, typically seen inthe ﬁeld of dynamical systems (May, 1976; Li and Yorke, 1975; Feigenbaum, 1978). Eq.(5)maps directly from the current G+C-content to the limiting G+C-content in one step. Thebase transition counts ( n SW and n W S ) or their normalized values ( α and α ) are not constant i et al. values, but changing as a function of the current G+C-content. If G+C-content reduces,we should also see a lower value of α . To emphasize this point, we write this functionaldependence of α and α on x in Eqs.(4,5) explicitly.In the literature, the mutational bias towards W=A+T is deﬁned as m = p S → W /p W → S = v/u (Lynch, 2007) (page-126), or the equilibrium constant in the direction of W base pairs K (International Human Genome Sequencing Consortium, 2001) (page-886), whereas their de-pendence on G+C-content is not obvious. In our notation, K = [ α ( x ) /α ( x )] · [(1 − x ) /x ] isexpressed in two parts so that it is made explicit that the ﬁrst part is derived purely from themutation counts α /α = n SW /n W S and the second part is unrelated to mutation counts, butpurely base composition related.There is another advantage of using α , instead of the mutational bias K . When the valuesof α , are compared to 0.5, we immediately know the direction of the base type change: if α > .

5, (A+T)-content will increase from the current value to its limiting value; similarly if α > .

5, (G+C)-content will increase. This advantage is more clear in the next subsectionwhen three variables are considered.

Formula of limiting G+C content when CpG is considered separately

Now we speciﬁcally consider a subset of strong bases within the 5’-CpG-3’ dinucleotidecontext. The base C next to a base G in downstream (3’) direction is known to have a muchhigher mutation rate (in particularly, to base T). The dinucleotide on the opposite strandof 5’-CpG-3’ is also 5’-CpG-3’, but the base G is expected to have a higher mutation rate.Let’s denote these strong bases as S p ( p indicates the phosphodiester bond between C and G)and other G/C bases not in this context as S n . We also assume among strong G/C bases, aproportion of y of them are in S p . Though not common, it is still possible to have a mutationfrom one S p base to another S p base, e.g., 5 ′ − CGG − ′ → ′ − CCG − ′ . i et al. Similar to Eq.(1), the number of mutation counts for three types of base (W, S n , S p ) are: W S n S p WS n S p  M N (1 − x ) − n W Sn − n W Sp n W Sn n W Sp n SnW

M N x (1 − y ) − n SnW − n SnSp n SnSp n SpW n SpSn

M N xy − n SpW − n SpSn  and again, the row normalized matrix is the transition matrix:

W S n S p WS n S p  − n WSn + n WSp MN (1 − x ) n WSn MN (1 − x ) n WSp MN (1 − x ) n SnW

MNx (1 − y ) − n SnW + n SnSp

MNx (1 − y ) n SnSp

MNx (1 − y ) n SpW

MNxy n

SpSn

MNxy − n SpW + n SpSn

MNxy  (6)The limiting composition of

W, S n , S p is proportion to the eigenvector of the transpose (switch-ing rows and columns) of Eq.(6) corresponding to the eigenvalue equal to 1 (which is the largesteigenvalue of a Markov transition matrix) (see Appendix). We obtain such a (unnormalized)eigenvector for the transpose of Eq.(6) by Wolfram Alpha ( ) as:  ( n SpW n SnW + n SnSp n SpW + n SpSn n SnW )(1 − x )( n W Sn n SpSn + n W Sp n SpSn + n SpW n W Sn ) x (1 − y )( n W Sp n SnSp + n W Sn n SnSp + n SnW n W Sp ) xy  ∝  β ( x ) · (1 − x ) β ( x ) · x (1 − y ) β ( x ) · xy  . (7)Although Eq.(7) looks complicated, it can be memorized by the illustration in Fig.1. Note thatagain the genome size N and number of persons M are not present in the limiting compositionformula Eq.(7).We introduce new coeﬃcients β i (i=1,2,3) to be proportional to the coeﬃcients in theleft column-array in Eq.(7), but normalized (i.e., divided by the sum of all products of twotransition counts). Note that P i β i = 1 doesn’t mean the right column array itself in Eq.(7) isnormalized. Our introduction of β i coeﬃcients will make the comparison between diﬀerent dataeasier, because they are based purely on mutational counts. Also, if β = β = β = 1 / β from 1/3 easily points to the direction of change in the composition.To emphasize the fact that { β i } are not constant in the dynamics, we write their dependenceon G+C-content explicitly in Eq.(7). i et al. Data analyses

Filtering neutral de novo mutation events

We use the denovo-db v1.6.1 ( http://denovo-db.gs.washington.edu/denovo-db/ , August 19,2018). The ﬁles denovo-db.ssc-samples.variants.tsv and denovo-db.non-ssc-samples.variants.tsvare used. Each line in these ﬁles is a mutational event in a person with a particular annotation.Therefore, for a mutation in a coding region with multiple transcripts, each mutation eventmay occupy multiple lines. There are 628,234 lines in the two ﬁles. The genomic coordinatesare in hg19/GRCh37.We ﬁlter the de novo mutations by the following criteria: (1) The mutation is a bi-allelicsingle-nucleotide-polymorphism (SNP); (2) The person’s phenotype is a normal “control”; (3)Y-chromosome variants are excluded; (4) the base is consistent with the reference genome ofhg19/GRCh37. The criterion de novo mutation event counts:Turner2017 (83187, 75.0%) (Turner et al., 2017), GONL (15896, 14.3%) (Genome of the Netherlands Consortium,2014), Turner2016 (3541, 3.2%) (Turner et al., 2016), Iossifov2014 (3521, 3.2% ) (Iossifov et al.,2014), Werling2018 (3302, 3.0% ) (Werling et al., 2018), Krumm (1014, 0.9%) (Krumm et al.,2015), Yuen2017 (181, 0.16%) (Yuen et al., 2017), Gulsuner2013 (170, 0.15%) (Gulsuner et al.,2013), Conrad2011 (59, 0.05%) (Conrad et al., 2011), Besenbacher2014 (52, 0.047%) (Besenbacher et al.,2015), Rauch2012 (509, 0.045%) (Rauch et al., 2012), and ASD3 (15, 0.014%).Besides the information provided by denovo-db, we have added these extra informationby using the hg19/GRCh37 reference genome: (1) G+C base and CpG dinucleotide count of2kb window centered at the SNP; (2) G+C base and CpG dinucleotide count of 20kb windowcentered at the SNP; (3) the triplet context of the SNP; and (4) the triplet context after themutation.The denovo-db provides 18 location-types which we condense to 9 types: intron (and intron-near-splice): 58824 lines, intergenic: 40375 lines, upstream-gene and downstream-gene: 4832 i et al. lines, missense(and missense-near-splice): 3235 lines, 5’ and 3’ UTR: 1514 lines, synonymous(and synonymous-nea-splice): 1336 lines, non-coding-exon (and non-coding-exon-near-splice):634 lines, stop-lost and stop-gain: 149 lines, splice-acceptor and splice-donor: 81 lines.Fig.2 shows the distribution of CADD (combined annotation dependent depletion) value(Kircher et al., 2014), percentage of non-repetitive-sequence (uppercase letter), 2kb windowG+C content, 20kb window G+C content, 2kb widow CpG%/(G+C)%, and 20kb widowCpG%/(G+C)% of all these 9 location types. Most of the result in Fig.2 is known. For exam-ple, the functional impact of variants is the highest for stop-gain/lost, followed by missense;intergenic regions contain more repetitive sequences or transposons; genic regions can be ofhigh-(G+C)-content; etc. We further show that larger window (20kb) statistics have morenarrow distributions, and intron regions (even more so than intergenic regions) avoid CpGdinucleotides. De novo mutation derived α and β coeﬃcients Table 1 shows the raw count of diﬀerent types of de novo mutations in 9 diﬀerent varianttypes described in the last section. The two α and α coeﬃcients for the two-base type modeland three β , β , β coeﬃcients for the three-base type model are listed in Table 1. Althoughwe cannot assume neutral dynamics for variants in the functional categories, whether an α i coeﬃcient larger or smaller than 1/2, and whether a β i coeﬃcient larger or smaller than 1/3will indicate which direction the mutational force is pushing. In all functional categories, W(A+T) base content will be pushed higher by mutation, S(G+C) and CpG content will bepushed lower. The stop-gain/loss and splice acceptor/donor categories contain very few denovo mutation counts. However, the mutational force would drive the CpG content higher insplice sites, while deplete CpG from stop sites.Bases in the intergenic regions can be assumed to follow a neutral evolution without con-straints. We further partition the de novo mutational events in intergenic regions accordingto its surrounding (2kb) (G+C)-content, and α , β coeﬃcients are calculated in each (G+C)-content quantile. The results are shown in Table 2. We can see that not only α > / β > /

3, but also their values increase with the surrounding G+C content. This result is consistent i et al. with previous publications (Duret et al., 2002; International Human Genome Sequencing Consortium,2001). Table 2 also shows that β < /

3, and decreases with surrounding G+C content.Dependence of CpG mutation rate on local G+C content has also been reported before(Fryxell and Moon, 2005).

Evidence of two diﬀerent limiting G+C contents

To further examine the prediction of our neutral mutational dynamics, using the mutationrates based on the de novo mutational event count, as a function of current G+C content,we expand the previous six G+C content quantiles to eight, with the highest G+C rangesplit into three more G+C regions. This partition would lead to around 6000-7000 intergenic de novo mutational events in each one of the lower G+C brackets, but 2000-3000 intergenicmutational events in the last three high G+C brackets. The mutation counts of various types,the calculated α ( x )’s and β ( x )’s, the predicted limiting G+C content (by either two-variableor three-variable equation) and limiting CpG/(G+C), are shown in Table 2. The currentintergenic G+C content and CpG/(G+C) values, calculated directly from the hg19 intergenicsequences (an intergenic sequence longer than 10kb is partitioned into pieces of 10kb length),are shown in Fig.3. We notice that CpG/(G+C) is positively correlated with (G+C) %, as itinvolves the product of two strong bases.Fig.4 depicts the mutational data as a function of current (G+C)% from various perspec-tives. Fig.4(A) shows that the p ( S → W ) /p ( S → W ) is not constant, but decreases at high(G+C) content. By the prediction in Eq.(3), the limiting (G+C)% will be higher for thecurrent (G+C)-rich intergenic regions, as shown in Fig.4(B). Interestingly, the three-variableprediction (by considering CpG-containing strong base as a distinct variable) leads to slightlyhigher limiting (G+C)% than the two-variable prediction. Fig.4(C), however, shows that thelimiting CpG/(G+C) may not be oﬀ very much from the current CpG/(G+C) value. Fig.4(D)is yet another way to look at the same data. If the mutational rate is the same in regionswith diﬀerent current (G+C)%, the proportion of G or C bases among those that experiencea de novo mutation will be linearly proportional to the current (G+C)%. Nonlinearity of this“substituted-base G+C content” in bacterial genomes has been observed (Bohlin et al., 2018). i et al. The fact that our Fig.4(D) is further away from the diagonal line (see the grey dashed line)indicates that human genome is not in a base composition equilibrium state as in bacterialgenomes.The results in Fig.4 may indicate that isochores are maintained by neutral mutationaldynamics if the mutational rate is estimated from the de novo mutational events. However,there are still two possibilities: (1) the relatively low mutational AT-driving-force observed inthe current high G+C region is supposed to be still low when the G+C content in the sameregion is lower in time. We may justify this assumption by a hypothesis that the mutationrate in this region is perhaps determined by the three-dimensional chromatin structure thanby the G+C content. (2) our high G+C intergenic region might be embedded in high G+Cgenic regions which protect the G+C decay by selection force. In that case, the relatively lowAT-driving-force in the intergenic region is not really neutral. The limiting isochore conclusionmay not be reached if we assume that the mutational driving force in the current high G+Cregions becomes stronger with time, due to the lower G+C content in the future. However,there is no way to prove this with the current data.

Discussion

In this study, we re-visit the topic which was popular in the past, on base composition change(Duret et al., 2002; Gu and Li, 2006; Alvarez-Valin et al., 2004; Romiguier et al., 2010), butfocus one one species only, the homo sapiens. Towards this, we rely on the mutational eventsobserved in human only, i.e., the de novo mutation by comparing the genomic sequences be-tween parents and oﬀspring. A reliable data on both the mutation rate and context-dependenceis clearly important. Previous work calculated this information by comparing the orthologousregions between human and chimpanzee, considering chimp as ancestral and count any sin-gle nucleotide change from chimp to human as mutational events (Supplementary text of(Samocha et al., 2014)). This approach may obtain more counts, but the directionality ofthe mutational events can be questioned. The data we use is guaranteed for the mutationaldirection (from parents to oﬀspring) which is an important piece of information on contextanalysis. i et al. Rare variants might be another type of data to study mutation rates and context eﬀect(Chakraborty, 1981; Kimura, 1983; Neel et al., 1986). However, this approach should dealwith sequencing errors and private variant (i.e., variant found only in one person) should bevalidated. Also, population speciﬁc reference genomes should be available, so that a so-calledrare variant according to the standard reference genome might be not so rare in a particularpopulation, and multiple mutations on the same site should be corrected. Considering theimportance of estimating the background neutral mutational rate in the assessment of excessmutation in a particular gene (Samocha et al., 2014), it could be interesting to compare diﬀer-ent approaches, with the anticipation of further complexity due to factors such as gender andage (Jonsson et al., 2018), population (Mathieson and Reich, 2017; Narasimhan et al., 2017),and chromosome regions (Harpak et al., 2016).Our conclusion that isochore-like structure, i.e., diﬀerent regions having diﬀerent G+Ccontents, can be maintained in the limiting conﬁguration of the neutral dynamics, has alreadybeen implied in (page 886 of) (International Human Genome Sequencing Consortium, 2001):“if K is the equilibrium constant . . . then the equilibrium GC content should be 1/(1+K) . . . (K) varies as a function of local GC content”. Because the currently (G+C)-rich regions havelower K (equivalent to our [ α ( x ) /α ( x )] · [(1 − x ) /x ] in Eq.(4) ), they should also have a higherG+C content in the limiting equilibrium state. However, our conclusion is reached based ona more realistic three-variable dynamics (Eq.7). We also caution on an assumption requiredfor reaching this conclusion, i.e. the mutational rate is a chromosome regional property andmay not be a property of G+C content itself. To conﬁrm or reject the assumption, it might benecessary to follow the temporal base composition dynamics in an intergenic G+C rich regionin the human genome. Acknowledgment

W.L. thanks the ﬁnancial support from the Robert Boas Center for Genomics and HumanGenetics. i et al. Appendix

A Derivation of the limiting composition based on mutation rate

The master equation or continuous time Markov process for the dynamics of a genomic unitwith multiple ( m ) states is (superscript T is for transpose): d~ P dt = ( M T − I ) ~ P (8)where ~ P is the composition array with m elements, and M m × m = { M ij } = { P i → j } ( i, j =1 . . . m ) is the m × m transition matrix, with P i → j the unit time probability for state i tochange to state j , and I the m × m identity matrix. The value of m is 4 for nucleotidebases, 16 for dinucleotides, 20 for amino acids, 64 for codons, and any values in between orbeyond when degenerate/equivalent states of the genomic unit are combined. For example,if the strand symmetry is considered, m = 2; if A and T are combined into weak and C andG combined to strong, m = 2; if C or G within dinucleotide CpG is distinguished from notwithin, m = 3; and if stop codons are excluded from all codons, m = 21, etc.Eq.(8) can be derived by examining the source of state- j frequency change at time t + dt from that at time t : p j ( t + dt ) = p j ( t ) + X i = j p i ( t ) P i → j dt − p j ( t ) X k = j P j → k dt = p j ( t ) + X i = j p i ( t ) P i → j dt − p j ( t )(1 − P j → j ) dt = p j ( t ) + X i p i ( t ) P i → j dt − p j ( t ) dt (9)then, p j ( t + dt ) − p j ( t ) dt = X i p i ( t ) P i → j − p j ( t )= X j ′ P j ′ → i ′ p j ′ ( t ) − p j ( t )= X j ′ M Ti ′ j p j ′ ( t ) − p j ( t ) (10)which is Eq.(8) in the dt → i et al. The equilibrium composition is the solution of d ˜P /dt = 0 = ( M T − I ) ~ P which is an eigen-value/eigenvector problem. This type of dynamical systems is also called (multi) compart-mental systems (Jacquez, 1972), and it is known that the only non-negative eigenvalue ofcompartmental matrix M T − I is zero (the largest eigenvalue of matrix M T is one) (Jacquez,1972). In other words, the limiting composition is the (normalized) eigenvector correspondingto the eigenvalue=1 of the transpose of Markov transition matrix. i et al. References

F Alvarez-Valin, O Clay, S Cruveiller, G Bernardi (2004), Inaccurate reconstruction of ancestral GC levelscreates a “vanishing isochores eﬀect, Mol. Phylogenet. and Evol., 31:788-793.EMS Belle, L Duret, N Gatlier, A Eyre-Walker (2004), The decline of isochores in mammals: an assessmentof the GC content variation along the mammalian phylogeny, J. Mol. Evol., 58:653-660.G Bernardi (1989), The isochore organization of the human genome, Ann. Rev. Genet., 23:637-661.G Bernardi (2006),

Structural and Evolutionary Genomics: Natural Selection in Genome Evolution (Elsevier).G Bernardi, B Olofsson, J Filipski, M Zerial, J Salinas, G Cuny, M Meunier-Rotival, F Rodier (1985), Themosaic genome of warm-blooded vertebrates, Science, 228:953-958.S Besenbacher, S Liu, JM Izarzugaza, J Grove, K Belling, J Bork-Jensen, S Huang, TD Als, S Li, R Yadav, etal. (2015), Novel variation and de novo mutation rates in population-wide de novo assembled Danish trios,Nature Commun., 6:5969.J Bohlin, V Eldholm, O Brynildsrud, JH Petterson, K Alfsnes (2018), Modeling of the GC content of thesubstituted bases in bacterial core genomes,

BMC Genomics , 19:589.R Chakraborty (1981), Estimation of mutation rates from the number of rare alleles in a sample,

Ann. Hum.Biol. , 8:221-230.O Clay, S Cacci´o, S Zoubak, D Mouchiroud, G Bernardi (1996), Human coding and noncoding DNA: compo-sitional correlations, Mol. Phylogenet. and Evol., 5:2-12.DF Conrad, JE Keebler, MA DePristo, SJ Lindsay, Y Zhang, F Casals, Y Idaghdour, CL Hartl, C Torroja,KV Garimella, M Zilversmit, R Cartwright, GA Rouleau, M Daly, EA Stone, ME Hurles, P Awadalla,1000 Genomes Project (2011), Variation in genome-wide mutation rates within and between human families,Nature Genet., 43:712-714.M Costantini, F Auletta, G Bernardi (2007), Isochore patterns and gene distributions in ﬁsh genomes, Ge-nomics, 90:364-371.M Costantini, O Clay, F Auletta, G Bernardi (2006), An isochore map of human chromosomes, Genome Res.,16:536-541.Deciphering Developmental Disorders Study (2018), Prevalence and architecture of de novo mutations in de-velopmental disorders, Nature, 542:433-438. i et al.

16L Duret, M Semon, G Piganeau, D Mouchiroud, N Galtier (2002), Vanishing GC-rich isochores in mammaliangenomes, Genetics, 162:1837-1847.MJ Feigenbaum (1978), Quantitative universality for a class of nonlinear transformations,

J. Stat. Phys. ,19:25-52.JW Fickett, DC Torney, DR Wolf (1992), Base compositional structure of genomes, Genomics, 13:1056-1064.LC Francioli, PP Polak, A Koren, A Menelaou, S Chun, I Renkens, Genome of the Netherlands Consortium,CM van Duijn, M Swertz, C Wijmenga, et al. (2015), Genome-wide patterns and properties of de novomutations in humans, Nature Genet., 47:822-826.KJ Fryxell and WJ Moon (2005), CpG mutation rates in the human genome are highly dependent on localGC content, Mol. Biol. and Evol., 22:650-658.Genome of the Netherlands Consortium (2014), Whole-genome sequence variation, population structure anddemographic history of the Dutch population, Nature Genet., 46:818-825.SL Girard, J Gauthier, A Noreau, L Xiong, S Zhou, L Jouan, A Dionne-Laporte, D Spiegelman, E Henrion,O Diallo, P Thibodeau, I Bachand, JYJ Bao, AHY Tong, CH Lin, B Millet, N Jaafari, R Joober, PADion, S Lok, MO Krebs, GA Rouleau (2011), Increased exonic de novo mutation rate in individuals withschizophrenia, Nature Genet., 43:860863.JM Goldmann (2018),

Characterization of de novo mutations in the human germline (Ph.D Thesis, Dept ofHuman Genetics, Radboud University Medical Center).J Gu and WH Li (2006), Are GC-rich isochores vanishing in mammals? Gene, 385:50-56.S Gulsuner, T Walsh, AC Watts, MK Lee, AM Thornton, S Casadei, C Rippey, H Shahin, Consortium on theGenetics of Schizophrenia (COGS), PAARTNERS Study Group, (2013), Spatial and temporal mapping ofde novo mutations in schizophrenia to a fetal prefrontal cortical network, Cell, 154:518-529.FF Hamdan, M Srour, JM Capo-Chichi, H Daoud, C Nassif, L Patry, C Massicotte, A Ambalavanan, DSpiegelman, O Diallo, E Henrion, A Dionne-Laporte, A Fougerat, AV Pshezhetsky, S Venkateswaran, GARouleau, JL Michaud (2014), De novo mutations in moderate or severe intellectual disability, PLOS Genet.,10:1004772.A Harpak, A Bhaskar, JK Pritchard (2016), Mutation rate variation is a primary determinant of the distributionof allele frequencies in humans,

PLoS Genet. , 12:e1006489.International Human Genome Sequencing Consortium (2001), Initial sequencing and analysis of the humangenome, Nature, 409:860-921. i et al.

17I Iossifov, BJ O’Roak, SJ Sanders, M Ronemus, N Krumm, D Levy, HA Stessman, KT Witherspoon, L Vives,KE Patterson, et al. (2014), The contribution of de novo coding mutations to autism spectrum disorder,Nature, 515:216-221.K Jabbari and G Bernardi (2017), An isochore framework underlies chromatin architecture, PLoS ONE,12:e0168023.JA Jacquez (1972),

Compartmental Analysis in Biology and Medicine: Kinetics of Distribution of Tracer-Labeled Materials (Elsevier).H J´onsson, P Sulem, GA Arnadottir, P´alsson, HP Eggertsson, S Kristmundsdottir, F Zink, B Kehr, KEHjorleifsson, B Jensson, et al. (2018), Multiple transmissions of de novo mutations in families,

NatureGenet. , 50:1674-1680.MD Kessler, DP Loesch, JA Perry, NL Heard-Costa, D Taliun, BE Cade, H Wang, M Daya, J Ziniti, S Datta,et al. (2020), De novo mutations across 1465 diverse genomes reveal mutational insights and reductions inthe Amish founder population,

Proc. Natl. Acad. Sci. , 117:2560-2569.M Kimura (1983), Rare variant alleles in the light of the neutral theory,

Mol. Biol. and Evol. , 1:84-93.M Kircher, DM Witten, P Jain, BJ O’Roak, GM Cooper, J Shendure (2014), A general framework for esti-mating the relative pathogenicity of human genetic variants,

Nature Genet. , 46:310315.A Kong, ML Frigge, G Masson, S Besenbacher, P Sulem, G Magnusson, SA Gudjonsson, A Sigurdsson, AJonasdottir, A Jonasdottir, (2012), Rate of de novo mutations and the importance of fathers age to diseaserisk, Nature, 488:471-475.N Krumm, TN Turner, C Baker, L Vives, K Mohajeri, K Witherspoon, A Raja, BP Coe, HA Stessman, ZXHe, et al. (2015), Excess of rare, inherited truncating mutations in autism, Nature Genet., 47:582-588.TY Li and JA Yorke (1975), Period three implies chaos,

Am. Math. Monthly , 82:985-992.W Li (2011), On parameters of the human genome, Journal of Theoretical Biology, 288:92-104.W Li (2013), G+C content evolution in the human genome, eLS (John Wiley & Sons, Ltd: Chichester). doi:10.1002/9780470015902.a0021751W Li and K Kaneko (1992), Long-range correlation and partial 1/f spectrum in a non-coding DNA sequence,Europhys. Lett., 17:655-660.W Li, G Stolvitzky, P Bernaola-Galvan, JL Oliver (1998), Compositional heterogeneity within, and uniformitybetween, DNA sequences of yeast chromosomes, Genome Res., 8:916-928. i et al.

18M Lynch (2007),

The Origins of Genome Architecture (Sinauer Associations, Inc.: Sunderland, MA).M Lynch (2010), Rate, molecular spectrum, and consequences of human mutation, Proc. Natl. Acad. Sci.,107:961-968.RM May (1976), Simple mathematical models with very complicated dynamics,

Nature . 261:459467.VM Narasimhan, R Rahbari, A Scally, A Wuster, D Mason, Y Xue, J Wright, RC Trembath, ER Maher, DAvan Heel, A Auton, ME Hurles, C Tyler-Smith, R Durbin (2017), Estimating the human mutation rate fromautozygous segments reveals population diﬀerences in human mutational processes,

Nature Comm. , 8:303.I Mathieson and D Reich (2017), Diﬀerences in the rare variant spectrum among human populations,

PLoSGenet. , 13:e1006581.JV Neel, HW Mohrenweiser, ED Rothman, JM Naidu (1986), A revised indirect estimate of mutation rates inAmerindians,

Am. J. Hum. Genet. , 38:649-666.CK Peng, SV Buldyrev, AL Goldberger, S Havlin, F Sciortino, M Simons, HE Stanley (1992), Long-rangecorrelations in nucleotide sequences, Nature, 356168-170.DA Petrov and DL Hartle (1999), Patterns of nucleotide substitution in Drosophila and mammalian genomes,Proc. Natl. Acad. Sci., 96,14751479.L Pranck˙enien˙e, A Jakaitien˙e, L Ambrozaityt˙e, I Kavaliauskien˙e, V Kuˇcinskas Insight into de novo mutationvariation in Lithuanian exome,

Front. Genet. , 9:315.A Rauch, D Wieczorek, E Graf, T Wieland, S Endele, T Schwarzmayr, B Albrecht, D Bartholdi, J Beygo, N DiDonato, et al. (2012), Range of genetic mutations associated with severe non-syndromic sporadic intellectualdisability: an exome sequencing study, Lancet, 380:1674-1682.J Romiguier, V Ranwez, EJP Douzery, N Galtier (2010), Contrasting GC-content dynamics across 33 mam-malian genomes: Relationship with life-history traits and chromosome sizes, Genome Res., 20:1001-1009.K E Samocha, EB Robinson, SJ Sanders, C Stevens, A Sabo, LM McGrath, JA Kosmicki, K Rehnstr¨om, SMallick, A Kirby, et al. (2014), A framework for the interpretation of de novo mutation in human disease,Nature Genet., 46:944-950.J Sebat, B Lakshmi, D Malhotra, J Troge, C Lese-Martin, T Walsh, B Yamrom, S Yoon, A Krasnitz, J Kendall,et al. (2007), Strong association of de novo Copy number mutations with autism, Science, 316:445-449.N Sueoka (1962), On the genetic basis of variation and heterogeneity of DNA base composition, Proc. Natl.Acad. Sci., 48:582-592. i et al.

De novo mutations in human genetic disease, Nature Rev. Genet.,13:565-575.LE Vissers, J de Ligt, C Gilissen, I Janssen, M Steehouwer, P de Vries, B van Lier, P Arts, N Wieskamp, Mdel Rosario, BW van Bon, A Hoischen, BB de Vries, HG Brunner, JA Veltman (2010), A de novo paradigmfor mental retardation, Nature Genet., 42:1109-1112.D Wang (2018), GCevobase: an evolution-based database for GC content in eukaryotic genomes, Bioinformat-ics, 34:2129-2131.W Wang, R Coronminas, GN Lin (2019),

De novo mutations from whole exome sequencing in neurodevelop-mental and psychiatric disorders: from discovery to application,

Front. Genet. , 10:258.DM Werling, H Brand, et al. (2018), An analytical framework for whole-genome sequence association studiesand its implications for autism spectrum disorder, Nature Genet., 50:727-736.RKC Yuen, D Merico, et al. (2017), Whole genome sequencing resource identiﬁes 18 new candidate genes forautism spectrum disorder, Nature Neurosci., 20:602-611.M Zerial, J Salinas, J Filipski, G Bernardi (1986), Gene distribution and nucleotide sequence organization inthe human genome, Eru. J. Biochem., 160:479-485. i et al. β . This expression is the sum of three terms: n SpW n SW , n SnSp n SpW , and n SpSn n SnW where W is weak base (A or T), S n is strong base (C or G) not involved in a CpG context, and S p for S in a CpGcontext. These three terms can be represented by the subplots (1), (2) and (3) . i et al. −6 −4 −2 0 2 4 . . . . . . . CADD for SNPs in denovoDB stop (n=149)missense (n=3235)splice (n=81)synonymous (n=1336)UTR (n=1514)non−coding−exon (n=634)intron (n=58824)up/downstream−gene (n=4832)intergenic (n=58824) 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . . . uppercase(non−repetetive)% (2kb) stopmissensesplicesynonymousUTRnon−coding−exonintronup/downstream−geneintergenic0.2 0.3 0.4 0.5 0.6 0.7 0.8 x=GC% (2kb) stopmissensesplicesynonymousUTRnon−coding−exonintronup/downstream−geneintergenic 0.2 0.3 0.4 0.5 0.6 0.7 0.8 x=GC% (20kb) stopmissensesplicesynonymousUTRnon−coding−exonintronup/downstream−geneintergenic0.00 0.05 0.10 0.15 0.20 0.25 y=cpg%/CG% (2kb) stopmissensesplicesynonymousUTRnon−coding−exonintronup/downstream−geneintergenic 0.00 0.05 0.10 0.15 0.20 0.25 y=cpg%/CG% (20kb) stopmissensesplicesynonymousUTRnon−coding−exonintronup/downstream−geneintergenic Figure 2: Distribution of various statistics of the de novo mutations according to nine diﬀerent categories: stop-gain/loss, missense, splice donor/acceptor, synonymous, 3’/5’-UTR, non-coding-exon, intron, up/downstream-gene, intergenic. (1) CADD; (2) percentage of non-repetitive sequence in the 2kb window centered at themutation site ; (3) (G+C)-content in the 2kb window; (4) (G+C)-content in the 20kb window; (5) CpG/(G+C)in the 2kb window; (6) CpG/(G+C) in the 20kb window. i et al. . . . intergenic, window size 10−100kb x=(C+G)% y = C p G % / ( C + G ) % oo o ooo oo oo o oo oo o o o oo o ooo oo o o ooooo o ooo ooo ooo oo oooo oo ooo oo oo oo oo ooo oo o ooo oo ooo o ooo oo oo oooo o oo o oo oo oo oo o ooooo ooo o o ooooooooooo o o ooooooo oo o oo oo ooo ooo oooo o o ooo oooo o ooooooo ooo oooooooo ooo o ooo oo oo ooo oo oooo oooo oooo o oo ooooo ooo oo oo oooooooo oo ooo o o o oo oooo oooooooo oo ooo o o ooo oooo o oo o oo oooo ooo ooooooo ooo o ooo oo oo oo o oooo oo oooo o ooooo ooooo o ooo ooo oo ooooo oo ooo ooooooo o ooo oooo ooo o o ooo o o oooo ooo o ooooo ooooo ooo oo oooo ooooo ooo o o o o oo ooo o oo o ooo oooo oo oo ooo o oooo o oo ooo oo o o oooooooo oooo ooo ooo oooo ooo oo ooooo ooo oooo oooooo ooo o oo o ooo oo oo o oo ooo o oo oooooooo oo oooooo ooo oooooo oooo oooo oooooo ooooo oo oooo oooo o oo oo ooo oo ooo ooooo ooooo oooo ooo ooo oo ooo oo o oooooooo oo oooo ooo o o oo o oo oo ooo ooooooo ooo ooo oo oo o o oo oo oooo oooooooo oo o oooo ooooooo ooo o ooooo ooo oo oooooo o oo ooo oo o ooo o oo o oo oo oo oo oo o oo ooooooooo oo oooooooo ooooo ooo oo o ooo oooo o ooo oo oo ooo oo o ooo o oo o o oooo oooooo oooo o ooo oo oo oo ooo oooo o oooooo oo oo ooo oo o oo oo o oooo ooo ooo oo o oooooooooo ooooo o ooo ooo o o o o oo ooooooooo oo oo o ooo oo oo o ooo o oo oo ooo ooooooo oo ooo ooo o o ooo oo oo o oooo oo oo oooo ooo ooooooo oo oo oo o oo o oo oo o ooo ooooooooooo ooo ooo o o o oo o oo ooo oo ooo oooo oo oooooooo oooooo oo ooo o oo oooo oo o oo oooo ooo oooo oo oo ooo oo ooooo o oo oooo ooo o oo oo o oo ooo o oooo ooo oo o ooooooo oo oo oooo oooo o o oo oo o oo oo oooo oooo o oooooo oooo ooo o oo oo ooo ooo oo oooooo ooo oo ooo o ooo oooooo ooo o o o ooooo o ooooo o oooo ooo oooo oo oo ooo oooo oooooo oooo o o oo oo o o oo oo ooo oooooooo oo oo oo oo ooo ooo oooooooo ooooo oooo oo ooo o o o o oooo o oo ooo oooooo ooo oooooooooooo oo oooo ooo oo oo ooo oo ooo oo ooo oooooo ooo o ooo o ooo ooooo ooo o ooo oo ooo oooooooo ooooo o ooooo o ooo oooo ooo oo oooo ooooo ooo ooooo oooooo oooo oo o oooo oooo ooooo oooo ooo o ooo ooo o oooooooo oooooooo oo ooo oo oo oo o oooo oo o oo ooooooo oooooo oo ooo ooooo oooo oo o o o oo o o oo oo oo ooooo o ooo o oo o ooo ooo oo oooooo oooooooooo o oo oooo oooo oo ooooo oooo o oooo ooo oo o oo oo ooooo ooo oo oooo o o ooooooooo ooooo ooo oo o ooo oo o oooo o oo o ooo o oo oo ooo o o o oooo oo o ooooooooo o ooo oooo oooo oo o oooooo ooo oooo o ooooo o o ooooo o oo o ooooooooo oooo oo oo ooo oo oo o ooooo o o ooo ooo ooooo oo o oo o oo ooo oo o o o oooo o oooo oo oo o oooooo o ooo o oo oo ooo ooooo ooo ooo oo oo oooo oo ooooo ooo ooo oooo o oooo oo o o oooo ooo o ooooo oooo ooo ooo oooooooo oooo oo ooo o oo oo o oooo oooo ooo oo oo ooo o o ooooo oo oo oo oo ooo ooo oo oooooooo ooooo oo oo ooo oo oo ooo ooooo oo oo ooo oo oooo ooo ooooo o oo ooo oooooo oo oo oo ooooooo o oooo ooo oooo o ooo o o oo oo oo ooooooooooo ooo ooo o o oooooooo oo oo ooooo ooooooo ooo oo ooooooooo oooo ooo oo ooooo oo ooo oooo oooo ooo oo ooooo oo ooooooooooo ooo ooo oooo oo o oo oo ooooo oo oo ooooo ooooo ooo oooo o o ooo o oooo oooo ooo oo ooo oo ooo oooo o oo ooooooooo oo oo oo oo o o oooo oo ooo oo oooooo ooo o oo o oo oo ooo ooo ooo oo ooooooo oo ooooo oo oo ooooooo oo oooo oo o ooooooooooo ooooo oo oo oo oo oooo oooo oo o ooo ooo ooo oo oooo oooo oo o o oo o o ooooo oo oooooooo ooo ooo oooooo oo ooo oooo oooooooo oooooooooooooo oooo ooooo ooo o oooooooo ooo ooo ooooo o oooooooo o oo oo oooo ooooo o o oo o o o oooooooooo o ooooo oo o oo oooo o o ooo ooooo o o oooooooo oo o ooo ooooo ooooo oooo oo o o oo oo oo ooo oooo o o oo oooo oo o oo oooo o oo oooooo ooo o oooooooooo o ooo o oo o ooo oooooo oo o o oooo oooo oo oooo ooo ooo oo oo oo ooo o oooo oo oo oooooooo oooo o o ooo ooo oo oooo o ooooooooo ooo oo oooo ooooooo oooooooooo o ooooooo o oooooo o oo oo oooo o oooo oo o oo oo oooo oo oooo oo o ooo oo ooo o o ooo oo oooo o ooo oooooo o oo oo oo oooooo oo ooo ooo oo o oo o o ooo ooooo ooo o ooooooooo ooo oooo oo o oooo oooo ooo oo oo oooo oo oo oo ooo oo oo ooo oooo o oo ooo o ooo oo oo ooo ooo ooo ooo oo o oooo oooo ooo o ooo oo oo o oooo ooo o oo o ooooo ooo ooo ooo o o ooo o oooo ooo ooo ooooo o o o ooooooooooo ooo oo o o ooooo ooooo oooo oo o o oooooooo o oooooo oooo o oooo oo oo oooo o oo ooo oo o ooooo ooo oooo oo oo oooooo ooooo o oooo oooo ooo oo oooo ooo oo oo o o oo ooooo oooo ooo ooo oo ooo oooooooooooo ooo ooooo o ooooo ooo oo ooo o ooo oo o oo o oooo ooooooo ooo oooo ooo ooo ooo o ooo ooo ooo oo ooo ooo oooo oooo oooo oo oo oo o ooooooo o oooo oo ooooo o oooooo oo o o o oo ooooooooo oo ooo oo oo ooo o ooo oo ooooo oooooo o o ooo oo oooo ooo oo oooo oooooooo ooooo oo ooo o ooooo ooooo ooo ooo oo o oooo oo oo oo ooo oo ooooooo oo o ooo oo oo oo oooooooooo o ooo oo ooo o oooooooo ooooo o oo oo ooo oo o ooooooo oooooo oo oo oo ooo ooooooo oo oooooo oo oo oooo o o oo o oo ooo o oo oooo oooo o oo ooo o ooooo ooo oooo o o oo oo oo ooooo oooo ooo oo oo ooo oooo oo ooo ooo o oo ooooooo o o o oooo oo oooo ooooo ooo ooooo o ooooooo oooooooo ooo ooo o ooooo oo o oo oooooooo oo oo ooo ooo oo oooooooo o oooo o oo oo oooo oo ooooooooooo oo oooo oo ooo oo ooo oo ooooooo o ooo oo ooooo oooo oooooo ooo o oo o o ooooooo oooo o ooooo o oo oo ooo oooooooo o oo o oooo oooo ooo oo ooo oooo ooo oooooo ooooooo ooo oo oooo oo ooo o o ooo oooo o ooooooooo ooo o oo oooo ooo oooo ooooo oo oo o oo oo ooo oooo oooo oo oooo oo o oooo oo oo o oo o o ooo o oo oo ooo oooo oo oo oooooo ooo oo oooooo oooo ooooooooo ooo oo oo ooo o oooooooo ooo oo o o ooooo oooo oo oooo oo o ooo o oooo ooooo ooooo oooo o ooo oo oo ooo oo oo ooo o oo o oooo oo ooo oooo oo oo o oo ooooo ooo oooooo oo oooo o oo oooo oo ooo ooooo ooo ooo ooooooo oo ooo o oo oo oo ooo ooooo ooo oo o ooooo oo o ooo oooo oooooooooo oooooo o oo oo oooo oo oo oo oooo oo oo ooo o oooooooo ooo oo o ooooooooo oooo o o oo oooooooo oooo ooo o o o o oo o ooo ooo oooo o ooo ooo oooooooooo oooooo oo o oooo o ooo o oo ooooooooo oo oo o o ooo ooo ooo o ooo oo ooo o ooo o ooooooo ooo ooooo oo o oooooooooo oooooo oooo oo ooo o oooooo o oooo o oo oo o ooo o ooo ooooo o ooooo ooo ooo oo o ooo oooo oooo oo o ooo o oo oo oooo oo o ooo oo oo oooooooo o oo ooooo oo oo ooo oo oo oooo oooo oo oooo ooo o oo o ooo oo ooo o o o ooooooo oo o oo oo o oo o oo oo ooooooo oooo ooo ooooo ooo o o oo ooo o oo ooo o ooo ooooo oo o oooo ooo oo oo ooo o oo o oo ooooooooooo oooo ooo ooooo o oooo o oooo o o o ooooooo ooo o ooooo o ooo oooo oo oo o oooo oooooooooooo ooo ooo ooo oooo o oo o o ooooooo ooo ooooooo oo ooo oooo ooo ooooo ooo oo oo ooo oo ooo ooooooooooo ooo o o ooo ooo oo oo oooo oo oo ooo oo ooooo oo ooo ooo ooooo o ooo ooo oooooooo ooo ooooo ooo o oooooo oo ooo oooooo oo oo oo oo oo oooooo oooo oooo oo o o oooo ooooo oooooooo ooo oooooo oooo oo oo o ooo oooo oo o o o oo ooo oo oooooo ooo oo oo o ooooooo o oo ooo oo oo oo o ooo o ooooooooo oo oooo ooo o ooo oooooo oooo o oooooooo oo oo oo ooo oo oo oo ooooo ooo o ooo oooo oo oooo ooo oo ooo oo oooooo oo oooo ooo oo ooo oooo o oo oo oo o ooo oooo oo oo ooooo ooo o oo o ooo oo oooo o o ooooo oo o ooo o oo oo ooo oooo o oooooo oooo oo ooo o o o o o o oo oooooo oo o ooo o o ooooooo o o oooooooo oo ooo ooo oooo oo o ooo oo ooo o o ooooo ooooo oo ooooo o oo o ooooooooo ooo oo ooo ooo oo ooo o oooo oooo oooo o oo o oooooo oo ooo oo o oooooo o o oo o ooooo o o ooo ooo oo oo ooo o oo oo o oo o o ooo oo oo ooooo oo ooooo ooo oo o o oo ooooo oooo oooooo o ooo ooo oo ooo oooo o o o oo oo ooo oooo o o oo oo oo oo oooooo o ooo oo oo oo oo oo oo ooo o ooo ooo oooo oooo ooo oo o ooooo o ooooo o o ooo ooooo oooo o o ooooo oo o ooooo ooooo o oooooo ooooo ooo oo o ooooooo ooo oo o o ooooooo o oooo ooo ooo ooooo o o ooo oo oooo o oooo oo o oooooooooooo ooo oo oo oo oo oooo oooooo o oooooo o ooo o oo ooooooo o oooo ooo oo oo o ooo o oo ooo ooooo o ooo oooooo ooo o oo oo oo oo oo ooo oo oo o ooooooo oo ooo o oo o oo oo oooooooo ooo oo oo oo ooo oo oo o ooo ooooo ooo oo o o oo ooooo o oo o ooooooooo oooo o oooo oo oo o o oo oooo oo oooo ooo ooo oo oooooooo o oooo ooo ooo o ooo o ooooo oooo o oo o oo oooo oo oo o oo ooo oooo oo ooo oo oo o ooo oooooo oo o ooo ooooo oooo oooo o ooo ooo oooo o o oo o ooooo ooo o oo ooo oooo o oooo ooooo oo oooooo oo oo oo ooo oo oo ooooo ooo oooo ooo oooooooooo oooooooo oo o o o ooooooo oooo oo oooo oo o ooo o oooo oo oooooooo oo o o ooo ooo ooo oo oo o ooooooooo ooooo ooo oo oo ooooooo oo ooo oo oo ooo oo ooo o o oooooo oooooo ooooo oo ooooo ooo ooo oo o ooooo ooooooo oo oo o oo oooo oooo oo o oo oooo ooo o o o oo oo ooo o ooooo oooo o oo oo o o oo ooo o o o oooo oo o oooo ooo ooooooooooo ooo ooo o oo o oo oo oo ooo o ooooo oo oooo ooo o ooooooo o oo o ooo o ooooo o oooo ooo ooooo ooo oo o o o oooo o oooooo oo o oo ooo ooo o o o o o ooooooooo oooooo oo o o ooooooo ooo o o o ooooooo o ooo oooo o ooooooooooooooo oooo oo o oo o oo o ooo o o o ooo o ooo oo oo ooooo ooo ooo o ooo oo ooo oo o ooooooooo ooo oooo oo oo oooo oo oooo o oooooo ooo ooooo o oooo oo ooo ooo ooooooooo oo o oo oo ooo oo oooooooooo o ooo ooo oo oo o o oo oooo oooo ooooo oooo oo oo oo ooo ooo oo ooo o oo ooo oo ooo ooo oo ooooooo oo o ooo ooooo oooo o oo o oo o oooo ooo o ooooo oo oo oooooo ooo ooo o oo o ooo ooo o ooo ooo o oo oo o ooooooooooo oooo oo ooooo oooo ooooooo oo ooo oooooo ooo ooo oo ooo o oooooo ooo oo oo ooo oooo ooo oo o ooo ooooooo oo oo oo oo oo oo o ooooo oo o oo oo oo oooo o oo o ooooo ooo oooo o oooooooooo oooo ooo ooo ooooo o ooo o oooo oooo o o ooo o o oooooo ooooo oooo oo o o oo ooooooooo oooooo oo ooooooo oo oo ooo o ooo o o ooooo oo ooo ooo oo o oo oooo oooo oo o oooo ooo oo oo o o oo ooooo ooooooooooo o o ooo oo oo o ooo ooooooo ooooooo o oooo oo oo o oo oooooo o o ooooo oo ooo o ooo oo oo o ooo ooo oo oo oooo oo ooo o oo o oooo oo oooo oooo oooooooo o o oooooooo ooo oooo o oo oo o oo oo o o ooo oooo ooo o ooo o ooo oo o oo oooo oo ooooo o ooo ooooooooo oo ooo oo ooo o ooooo ooo oo o oo oooo o oooooo oo ooo oo ooo ooooooooo ooo oo ooooo o o o oo ooo oo ooo o o o oo oo oo o o ooo o o oooooo oo oo ooooooo oo o o oo ooo ooo ooo oo oo ooooooo ooo oo ooo o ooo oo oooooo o o o oo o o oo o ooooo oo ooooo o oo oooo ooo oooooo oooooo oooo oo oo o o oooooooo oo o oooo ooo o oooooo oo o oooooo ooo ooooo ooo o ooo ooooo o o ooo o o oo ooo oo oooo o ooooo ooo ooo o ooo ooo ooooo o o oooo ooo oo ooo o oooo ooo ooo ooo ooo ooo oo oo ooo oo oooooooo oo oo oooo o oo ooo oooooooo ooo ooo oooo oooo o oo oooo oo oo o oooooooo oo oooo oooo oooooo oo ooo oooo ooo ooo oo ooo oo ooooo oo oo oo o ooooo oooo ooo oo o ooooooo o ooo ooooooooo oo oo oooo o oo o oo oo oo oo o oooo oooo oo ooooo oo oooo oooooo oooo ooo oo oo oo o oo oo oooo ooooooo oo oo oo ooooooo ooo oo ooo o ooo oooo ooo oooo ooooooo ooooooo ooo oooo oo o ooo o ooo oo oo ooooooooooo ooooo oo o oooo oo oo o ooo o ooo oo oo o oo ooo o ooo o ooo oo o oo o ooo oo ooo oo oooooo o ooo o oo ooooo oooo oo o oo o ooo oo oooo oo o oo o oo oo o oo ooo oo ooooo o ooo oooo o o ooo o ooo ooooo oooo oo oooo o oooooo ooo o ooooooooo oo ooo oo ooo o ooo oo o oo oo o oo o ooooo o ooooo oo oo oo oo oo oo oooo oooo o oo oo oo oo ooooooo ooo oo o o oooooo oo ooooooooo ooooo oooo ooo o ooo oo oo oo oo ooo ooo o oo oooo o ooooooo oo o oooo oo oooo oo oooo oooo oo oo ooooo o oo o o oo oo ooo ooo oooo oooo oo oo ooooo o ooooooo oooo ooo oo ooo oo oo oooooo ooooo oo oooooooooooooo oo oooooo o o ooo o ooo oo oooooo o ooooo ooo o oo oo ooo oooo o oo ooo o o oooooo o ooo oo o o ooo oo ooooo oo ooooo ooo ooo o o ooooooo o o o oooo ooooooo o oooo oo ooo oooo oo o ooo oo o o oo o ooooo oo oo oo o ooo ooooo o o oo oo oo oo o o o ooooooo o o oo oooo oo oooo o oooo oo ooo o oo o ooooooo oooooo o oo o oo ooo o ooo oo oo oooo ooo oooo ooo o oo ooo oo oo oo o ooo oooooooooo oo ooo oooo oooooo ooo ooo ooooo oooooooo oo o oo ooo o oo oooo oo oo oo ooo o ooooo ooooo oooo o oooooooo oo o oo o oo o oo ooo ooo ooooooooo oo oooo ooooo oo oo oooooooo ooooo o ooo oooooo oo oooo o ooooo oo ooo oo oo oooooo ooo o o oooo ooo o o oooooooo o oo oo ooo ooo oooo oooo ooo ooo oooooo oo oo oooo o oooo ooo oo oo oo ooo oooo ooo ooo ooooooo ooo o ooo ooo oo oooo o ooo o ooooo oooo oo oo oooo oo oo oo oo ooooo oo ooo o ooo oo ooo o ooo ooooo o ooooooooooo oo ooo oo oo ooo oo oo o o oo oo o ooo oooo oo oo oo o oo o oooo o oo o ooooooooo oo ooo oo oooo oo oo o oo oo ooooo ooo ooo oooooooooo ooooooooo o ooo ooo oo ooooo o o oo o ooooo o o oooooooooo oo o oooooooooooooooooo ooooo o ooooo oo o oooo oo oooo oooo o oooo oooooo oooooo ooooo ooooo oooooooo o o o oo o o oo oo oooo oooooooo ooo o oo oooooo ooo oo o ooooo ooo oooooo o o oo oooo ooooo o oooo oo oo ooooo o o ooooooo oooo o ooo oooo oooo oooooo oooooooo oo oooooooooo o oo oo oo ooo o ooo o oo oo oo oo oo oooo o oooooo ooo oo ooo oooo o o ooo o ooo oooo ooo ooo ooo ooo o oo ooo o ooooooooooo oo oo oo o o oo oo oo oo oo oo ooo o o oo oo ooo ooo ooo ooo oo o oooooo oo oo oo ooooo oo o o o o oo ooooo ooooooooo o o oo ooo o ooooooo oo ooo ooo ooo oo oo oo oooooo oo oo o oo ooooooo ooo ooo o ooo oooo oo o oooooo o oo oooo o oo oo oo ooo o ooooo oo oo o oo oo oo oooo oo oo oo ooooo oo oo ooo oooo ooooo ooo o oo oo oooo oooooo ooo o oo o oo o oo o ooo oo ooo oo ooo oo oo oo ooooo oo oooo oooo o o ooo oo oo ooo ooooo oo oooo ooo o oooo o ooooooo oo o ooo o o o oooooo o ooo ooo o oo oo o oooo o oo o ooo o oo oo ooo o ooo o oooo ooooooo oo oo oo o oooo oo o o oooooo o ooo ooo oooo oooo ooo ooo oo oo oooo oooooo o oo ooo o ooo oooooo o oooo o oo oo ooooo oo oo ooo oo o o oo oooo o oo ooo o ooooo ooo o o oo ooooooo ooo oo oo oo o o o ooo o oo oo oo o oo oooooo oooooooo oo o oo ooo o o ooooo oo ooo oo o oooooo ooo o ooo o oo oo oo o oo ooo oo o oooo oo ooo o oo oooo o oo o ooooooooooooooooo ooo oo ooo ooooo o ooo o o oo oooo oooo o oo oo o ooo ooooooo ooooooooo oooo oo oo o o oo o o oo oo oo oooooo o o oo oo ooooooo ooooo o oo ooo o ooo o oo o ooo ooo oo oooo ooo oo ooo oo oooo o o oo o ooo o ooo oo o oooooooo oooo oooooo ooo ooooooooooo ooo ooooo oooo o ooooooooo oo oo o oooo oooo o o o o ooooooo o ooooo ooo o oo ooo ooo ooo oooo ooo oo ooo oo ooooooooo oo ooo oo oo oo oo oo oooo ooo oooo o ooooo oo ooo oooo oo o ooo o ooooo oo oo o ooooo ooooooooo ooo ooo oo oo o ooooooo ooooooooooooooo oooooooo o ooo ooooooo o o ooo oo o oooooooo ooooo ooooo oooo ooooo ooo o o o oo o oooooo oo o oo oo oooo o oo oo ooo o oo ooo ooo oo ooo o oooo oo oo o oo ooo oo oo ooo oooo oo oo oo o ooo oooo o oo oooo o oooo ooo ooo oo oooo ooo oooo o ooo oo oooo ooo o oooooo oooo ooooooo o oo ooooooo ooo ooo o ooo oooo o ooo oo ooo ooo oooo o oo o oooo ooooooo oo oo ooo ooo ooo o o o oooo oo ooo oooo oooooooo oo oo o ooooo o oooo o o ooo o oo o ooo oooooo ooooo oo ooo oooooo oooooo ooo o oo oo oo o oooo o ooo ooo ooo oooo ooooo oo o oo ooooo oo o oo ooooo oo oooooooooo ooo oo oo oo o ooooo ooo oooo oooooo oooo ooo ooo ooo oooo oo oo oo oooo oooo o ooooo ooo oooo oooo oo ooo oo ooooo ooo oooooo o ooo oooo o oo oo o o oooo oo ooo ooooo o o ooo ooo oo o oooooo oo ooo o oooo oo oooo oo oo ooooooooooooo oo oo o o o ooo o o oo o ooooo oo ooo oo oo oo o ooo o oooo oo o ooo ooooo oo o ooooooooo ooooo o ooo oo ooooooo oooo ooooo o ooo oooooo oooo oo oo o ooo ooo oooooooooooo o oooo ooo o o o oo o ooooooooooo oo oo ooo o oo o ooo ooooo oo oooo o ooo ooo o oo ooo oo ooooo ooo ooooo ooo oo o o o oo ooo o oo o ooooo ooo o ooo o ooooo oooo o oo o ooo ooooooo oo ooo o oo ooooo oo ooo ooo oo o o o ooo oo o oo o ooooo ooooooo oo ooo oooo oooo o ooo oooo oooo ooo ooo o o oo oooooo oooooo ooo oo ooo ooooooooo ooo oo oo o ooooo oooooooo ooo oooo oo oooooo oo oooooo oo oo ooo oooooooo o ooo oo ooo ooo ooooo ooo oo ooooooo o oooo ooo o oo o ooooooo oo o oooo oo oo oooo oooo oooo ooo oooo oooooo o oo o oooooo oo oo oooooo oooooo ooo oooooo ooo o o o ooooo ooo o ooo ooo oo ooo ooo o oo o o o ooo oooooooo ooo oo oooo oo o ooo o oo oo oooo oo oooo oo o ooooo oo ooo oooooooooooo ooo oooo ooo ooo o oo o ooo o oooooooooo ooooo oooooooooo o oo oo oo ooooo oo ooo oo ooo oo o o oooo oo oo ooooo oooo ooo oooo ooo oooo oooooo ooo ooo oo oooooooo ooo oo ooo ooo oooo oo ooo oo oo ooo ooooo oooo o o oooo o oooo oo oo oo oo oo oooo oo ooo o ooo o 1234 56 789 10111213 14 15 16 1718 192021 2223 Figure 3: Each point is an intergenic region in the human genome (regions longer than 100kb are split intomultiple 100kb windows), where x is G+C content, y is the CpG/(G+C) proportion. Larger windows/regions( > i et al. . . . . . . . (A) S−>W to W−>S rates ratio x=current (g+c)% p ( S −> W ) / p ( W −> S ) = [ a l pha1 / a l pha2 ] * [ ( − x ) / x ] . . . . . (B) limiting (G+C) % x=current (G+C)% li m i t i ng ( G + C ) % x x x x x x x x two−variable predictionthree−variable prediction . . . (C) limiting CpG/(G+C) current CpG/(G+C) li m i t i ng C p G / ( G + C ) . . . . . (D) (G+C)% in de novo bases current (G+C)% ( G + C ) % a m ong deno v o ba s e Figure 4: Evidence that current isochore structure might be maintained in the equilibrium conﬁguration inneutral dynamics. Each bin represents a collection of de novo events in intergenic regions with speciﬁc 2kbwindow G+C content. The ﬁrst ﬁve G+C bin points contain 6000-7000 mutational events each, whereas thelast 3 high G+C bin points only 2000-3000 mutational events each. (A) p S → W p W → S (Eq.(3)) or α ( x ) α ( x ) · − xx (Eq.(4))as a function of current 2kb G+C content; (B) limiting G+C by Eq.(4) or Eq.(7) as a function of current 2kbG+C content; (C) limiting CpG/(G+C) by Eq.(7) as a function of current 2kb CpG/(G+C); (D) proportionof G+C among bases which have experienced de novo mutation. The grey lines in (B), (C), (D) representslope=1 diagonal lines. i et al. p ) n(SW) n(SS p ) n(S p W) n(S p S) α α β β β intron 58824 14100 5926 18051 1108 11646 523 0.597 0.403 0.434 0.326 0.241intergenic 40375 9813 4018 13277 732 6705 303 0.591 0.409 0.423 0.302 0.274missense 3235 493 272 875 70 1083 59 0.719 0.281 0.553 0.298 0.150stream 4832 1041 513 1505 123 991 44 0.616 0.384 0.449 0.294 0.257synonymous 1336 167 112 431 24 508 10 0.771 0.229 0.623 0.232 0.145UTR 1514 359 136 464 34 307 32 0.609 0.391 0.449 0.337 0.214non-coding 634 98 68 169 19 171 16 0.672 0.328 0.506 0.282 0.213stop 149 3 1 61 0 66 0 0.969 0.031 0.940 0.046 0.006splice 81 4 16 42 0 1 0 0.683 0.317 0.058 0.006 0.936Table 1: Statistics of the de novo mutational events used: type: 9 functional groups are used by simplifyingthe original 18 groups; n : number of mutational events; n ( W S ): number of de novo mutation from W (Aor T) to S (G or C) bases outside a CpG-containing triplet; n ( W Sp ): number of de novo mutation from W to S bases within a CpG-containing triplet ; n ( SW ): number of de novo mutation from S (non-CpG) to W bases; n ( SSp ): number of de novo mutation from S (non-CpG) to S (CpG containing) bases; α , α : deﬁnedin Eq.(3), satisfying α + α = 1; β , β , β : deﬁned in Eq.(7), satisfying β + β + β = 1. i et al. de novo mutation in intergenic regionsG+C range n n(WS) n(WS p ) n(SW) n(SS p ) n(S p W) n(S p S) α α β β β de novode novo