Human-chimpanzee alignment: Ortholog Exponentials and Paralog Power Laws
HHuman-chimpanzee alignment: Ortholog Exponentialsand Paralog Power Laws
Kun Gao and Jonathan Miller
Physics and Biology Unit, Okinawa Institute of Science and Technology GraduateUniversity, Okinawa, Japan
Abstract
Genomic subsequences conserved between closely related species such ashuman and chimpanzee exhibit an exponential length distribution, in con-trast to the algebraic length distribution observed for sequences shared be-tween distantly related genomes. We find that the former exponential canbe further decomposed into an exponential component primarily composedof orthologous sequences, and a truncated algebraic component primarilycomposed of paralogous sequences.
Keywords: evolution, genomic alignment, length distribution, exponentialand power-law, orthology and paralogy
1. Introduction
Sequence conservation is defined by similar or identical nucleotide se-quences within or among genomes at frequencies beyond those expected onneutral evolution. Within most neutral models of evolution, the probabil-ity that a sequence appears in two unrelated genomes decays exponentiallywith its length, so that sufficiently long sequences common to more than onegenome are expected to derive from a common ancestor [15]. Sequence dupli-cation represents a primary mechanism through which new genetic materialcan arise [23, 28]. When identical sequences are observed within a singlegenome at levels exceeding those expected on an independent site model ofevolution, sequence duplication is one candidate for their origin. Similarityamong sequences beyond that expected within an independent site model,whether multiple occurrences within a single genome or simultaneous oc-currence in multiple genomes, is known as “sequence homology” and may
Preprint submitted to Computational Biology and Chemistry September 5, 2018 a r X i v : . [ q - b i o . GN ] A ug ndicate common ancestry [4].Because sequence conservation and sequence duplication are often in-ferred from sequence length and identity, we believe that a systematic un-derstanding of the latter two features may elucidate rules underlying sequenceevolution and lead to more faithful models of neutral evolution.The set of sequences shared within a genome or between two genomesmay be summarised in its “length distribution:” a histogram with length L on the x -axis and number L ) of shared sequences of length L on the y -axis. These length distributions can exhibit distinctive characteristics that weaim to account for within some model of sequence evolution. Henceforth, weabbreviate “length distribution” to ‘distribution,” as all distributions referredto in this manuscript are histograms of the form indicated above. Strong conservation among distantly related genomes: algebraic distributionwith exponent ≈ − −
4. Thus “ultra-conserved” sequences exhibit thispower law, but the same exponent also governs pairwise conserved sequencesthat are not necessarily shared by a third genome.
Sequence identity among closely related genomes: exponential distribution
Sequences conserved between closely related species such as human andcertain primates display an exponential distribution, rather than a power-law[26]. “Closely related” is defined empirically as sufficiently recent branchingfrom a common ancestor. “Ultra-duplication” within single genomes: algebraic distribution with expo-nent − − − Figure 1: Distributions of identically conserved or duplicated sequences in (a) human chro-mosome 1 self-alignment; (b) human chromosome 1 – chimpanzee chromosome 1 alignmentand (c) human chromosome 1 – mouse chromosome 1 alignment. Distributions shown aregenerated by repeat-masked whole-chromosome LASTZ [13] net alignments obtained di-rectly from UCSC Genome Bioinformatics. Log-log plots enclose semi-log insets.
Figure 1 recapitulates the three cases mentioned above. With increasingevolutionary distance between species, distributions of identical sequencesobtained from LASTZ net alignment cross over from algebraic (with ex-ponent −
3) to exponential, and then again to algebraic (with exponent inthe neighborhood of − Materials and Methods ); henceforth – with the exception of the“
Materials and Methods ” section – we refer to “alignment” and for themost part we omit the qualifier “LASTZ,” which is tacitly implied unlessotherwise indicated explicitly.In the following, we apply whole-genome/whole-chromosome alignmentbetween human and chimpanzee to investigate the origin of the exponentialdistribution and disentangle it from the algebraic distribution. For closely3elated species, quantitative relationships emerge between orthologous se-quences and the exponential distribution, and between paralogous sequencesand the algebraic distribution.
2. Materials and Methods
We compare genomic sequences with the LASTZ pairwise alignment tool[13]. LASTZ alignment comprises several stages of which we rely mainly ontwo: raw alignment and net alignment. Raw alignment is the immediateproduct of LASTZ and may include multiple and positionally overlappingmatches for each aligned sequence. A subsequent net alignment removespositional overlaps among matched sequences, chains them, and discards allbut the highest-scoring chains, yielding a single match for each position inthe genome. One function of net alignment is to extract homologous elementsfrom the raw alignment [14].LASTZ is obtained from ; we useLASTZ default options for raw alignment. The UCSC Genome Browser( http://hgdownload.cse.ucsc.edu/admin/jksrc.zip ) provides additionaltools ( axtChain , chainNet and netToAxt ) for producing the net alignment.Standard procedures that we follow for LASTZ alignment (both raw and net)are described at: http://genomewiki.cse.ucsc.edu/index.php/Whole_genome_alignment_howto . Genome sequences
Soft repeat-masked ( http://repeatmasker.org ) genome sequences areobtained as fasta files from the Ensembl FTP Server (e.g. hg
19 as version74; ftp://ftp.ensembl.org/pub/release-74/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.74.dna.chromosome.1.fa.gz ). We use for the mostpart the hg
19 and panT ro ), yielding gorilla chromosome 1, orangutan4hromosome 1 (reverse strand), macaca chromosome 1, and marmoset chro-mosome 7 as orthologous to human chromosome 1.Mouse ( mm
9) chromosome 1 is downloaded from Ensembl and alignedto human chromosome 1. Mouse chromosome 1 carries a plurality (closeto 1 /
4) of orthologous elements shared between human chromosome 1 andthe mouse whole genome. The Venter genome [21] is obtained from UCSCGenome Bioinformatics ( http://hgdownload.cse.ucsc.edu/goldenPath/venter1/bigZips/venter1.2bit ) and aligned to hg Repeat-masking
Repetitive sequence elements may constitute close to half the genome;unless these repeats are explicitly identified, most if not all large-scale align-ment methods may fail to complete on eukaryotic genomes. A commonpractice is to identify these elements prior to alignment with a software toolsuch as Repeatmasker ( http://repeatmasker.org ) that demarcates repet-itive sequence in lower-case letters (“soft [repeat-]masking”) in contrast tothe upper-case letters that designate non-repetitive sequence.
Unless otherwise indicated, all LASTZ alignments represented here areperformed on soft repeat-masked genome sequences.
When aligning sequences, LASTZ excludes soft-masked bases from its“seed” stage but reintroduces them in later stages, when alignments of un-masked sequence can be extended into soft-masked regions (see webpage: ), so that LASTZ can in principle align cer-tain masked sequences. For example, just over 50% of each of human ( hg .
74) chromosome 1 and chimpanzee (Ensembl CHIMP2 . . . Following [9], for a given pairwise alignment we study continuous (unin-terrupted) matching runs of bases (CMRs), wherein a contiguous series ofmatching nucleotides is terminated at mismatches or indels. Unless otherwise5ndicated, all CMRs discussed here represent exact matches that we refer tointerchangeably with these two terms.A histogram L ) (or (length) distribution) describes the number ofCMRs of a given length L . Pairwise alignment of genomes yields conservedor duplicated sequences within or between genomes; we expect that distri-butions of these conserved or duplicated sequences reflect certain features ofgenome evolution. Forward alignment and reverse alignment
DNA is composed of complementary strands so that for two DNA se-quences pairwise alignment can be implemented in either of two relativeorientations, “forward” or “reverse.” Matches to the reverse orientation arethought to arise by inversion or inverted duplication/transposition [1]. Weperform both forward and reverse alignments; however, we subsequently com-bine their products before further calculation except where it is informativeto keep them separate.
Dot plot
A two-dimensional similarity matrix between two sequences is displayedas a dot plot [10], in which one sequence of an aligned pair lies along thehorizontal and the other along the vertical axis. Dot plots are commonlyused to visualise sequence similarity and to display homologous matches be-tween genomes. “Syntenic dot plots” exhibit synteny (see webpage: http://genomevolution.org/wiki/index.php/Syntenic_dotplots ). In this pa-per, we apply them to display orthologous sequences between human andchimpanzee genomes.In some of our dot plots, prominent horizontal or vertical white bandsappear that correspond to sequence that has not yet been reliably determinedand is therefore represented by “N” in the assemblies from Ensembl; suchbases are excluded from alignments [13].
3. Results
An alignment of a numbered human chromosome to the chromosomesof chimpanzee yields the exponential distribution of CMRs shown in figure1 for the correspondingly numbered chimpanzee chromosome, but not for6ther chimpanzee chromosomes. In the latest releases of the chimpanzeegenome assembly, chimpanzee chromosomes have been renumbered to re-flect common ancestry with the corresponding human chromosomes [25], sothat chromosomes sharing the same number can be thought of as “ortholo-gous chromosomes.” In figure 2, human chromosome 1 is separately alignedagainst each chimpanzee chromosome. With the exception of the alignmentof human chromosome 1 with chimpanzee chromosome 1, all the alignments(raw and net) between human chromosome 1 and chimpanzee chromosomesyield approximately algebraic distributions. The dot plots corresponding tothese alignments can be found in figure S1 . Figure 2: Distributions of exact matches (CMRs) in raw (blue) and net (red) alignmentsof human chromosome 1 against all chimpanzee chromosomes. Main figures show log-logplots; insets semi-log plots for the same data. For purposes of comparison, lines with slope k on the log-log scale as indicated have been drawn into each figure. H f rag H f rag C f rag
C f rag S2 a syn-tenic dot plot. As can be seen in figure 3, alignments between homologous(heterologous) fragments exhibit exponential (algebraic) distributions. Figure 3: Distributions of exact matches in raw alignments between different fragments ofhuman chromosome 1 and chimpanzee chromosome 1. We extract the first and last thirds f each these two chromosomes, and align all four resulting fragment pairs. These figuresindicate that even within a single chromosome, the exponential distribution correlates withorthology: only alignments between orthologous fragments show exponential distributions.Figure 4: Orthology map among different fragments of human chromosome 1 and chim-panzee chromosome 1. Horizontal dark grey bars (largely obscured in (a) by maroonvertical bars) show human chromosome 1 and chimpanzee chromosome 1; light grey barsindicate fragments H f rag H f rag C f rag
C f rag
We describe further evidence below that the exponential distribution ob-served in the human-chimpanzee alignment correlates with sequence orthol-ogy.
Although alignment of orthologous human and chimpanzee chromosomesyields an exponential distribution overall, distributions of aligned subfrag-ments of these genomes are not necessarily exponential. In figures 2 and 3 itis seen that whole-chromosome alignments between human and chimpanzeecontain both exponential and power-law components.In this subsection, we illustrate how the human chromosome 1 – chim-panzee chromosome 1 alignment naturally decomposes into algebraic andexponential subsets. Based on the observations in figure 2 and figure 3, wehypothesise that 9 o a first approximation, the exponential and (approximately)power-law components correspond to orthologous sequence andparalogous sequence, respectively.
To perform this decomposition we develop several methods, each of whichis related to this hypothesis in a slightly different way. With the exceptionof a “local” method based on “nested” and “non-nested” matches that isparameter-free, they involve parameter choices and sometimes further ma-nipulations whose justification is not always readily apparent. Nevertheless,it turns out that these methods yield very similar outcomes.It may be worth remarking that a length distribution alone contains noinformation about location in a genome, so that it is impossible to partitionan alignment into exponential and algebraic components solely on the basisof aligned fragment or CMR lengths; nevertheless, the content of the previoussubsection suggests that a partition can be extracted from the dot plot.
As evident from figure 2, figure 3 and figure S1 , alignments betweenhuman and chimpanzee genomes with an exponential distribution exhibitdense accumulations of sequences within the dot plot. For closely relatedspecies like human and chimpanzee, it is well known that one such highdensity zone ordinarily forms a band near the diagonal of the dot plot thatwe refer to as the “diagonal band.” We will see that the diagonal band is amajor contributor to the exponential distribution.Figure 5 shows the distributions of exact matches from the diagonal bandand from its complement within the dot plot of the human chromosome 1 –chimpanzee chromosome 1 raw alignment. We crudely take into account thelength difference between human chromosome 1 and chimpanzee chromosome1, on the order of δ C bases, by defining a region around the diagonalof width δ C
1, so that sequences offset by as many as δ C igure 5: Dot plots and distributions of exact matches on diagonal and off diagonal inhuman chromosome 1 – chimpanzee chromosome 1 raw alignment. The diagonal bandwidth is chosen as δ C bases (see text). For the reverse alignment, we excise not thewhole diagonal band, but rather only two fragments (see the insets in the lower-left panel)that in the dot plot correspond to large inversions. The exponential distribution of CMRsin this diagonal band and the algebraic distribution of off-diagonal CMRs is evident in theright-most panels. In contrast to the other methods described here, in this subsection wetreat the forward and reverse alignment separately. As we have discussedabove, the orthologous sequences between human chromosome 1 and chim-panzee chromosome 1 concentrate primarily in the forward strands; we ex-tract the entire diagonal band from the forward alignment. In the reversealignment, two large and distinct inversions appear on the dot plot (see in-sets in lower-left panel of Figure 5); by extracting these inversions we findempirically that we can neatly separate exponential from power-law in thereverse alignment. This can be understood if these large inversions are re-cent events, so that in contrast to the rest of these two chromosomes, theorthologous orientation is reversed. 11 .2.2. “Genetic clock” method: Separating high-similarity alignment blocksfrom low-similarity alignment blocks
The raw alignment is composed of a set of alignment blocks, each rep-resenting a local alignment whose score is higher than a pre-establishedthreshold. One way to characterise similarity in an alignment block is tocompute the number of mismatches it contains, yielding a Hamming dis-tance. The ratio of Hamming distance to sequence length then representsa (time-integrated) rate of variation per base. In the absence of selection,and under customary idealisations, this ratio reflects the time elapsed sub-sequent to divergence, constituting a crude “genetic clock” [34]. Accordingto the definition of ortholog and paralog (see section ), paralogs divergebefore orthologs and should exhibit greater ratios of Hamming distance tosequence length than orthologs. Figure 6 left panel shows Hamming distanceversus alignment block length for all alignment blocks in the human chromo-some 1 – chimpanzee chromosome 1 raw alignment; each dot corresponds toan alignment block. Evidently, these alignment blocks comprise two majorbranches; alignment blocks in the lower-right branch exhibit lower rates ofvariation (greater similarity) than those in the upper-left branch.Figure 6 left panel shows a natural partition of alignment blocks; a linewith slope in the neighborhood of 0 .
08 from the origin is sufficient to eluci-date a partition into an upper-left branch and a lower-right branch (leftmostpanel). Middle panels in figure 6 show respective dot plots for the alignmentblocks in the upper-left branch and lower-right branch, and the rightmostpanel the (approximately) algebraic distribution of the upper-left branch andthe exponential distribution of the lower-right branch.In this “genetic clock” method, we treat the forward and reverse align-ments identically, thus in figure 6, we display the combined product of for-ward and reverse alignment; they are exhibited separately in figure S3 .12 igure 6: Dot plots and distributions of exact matches depend on the accumulated vari-ation (see text) within the alignment blocks from which they are derived. From rawalignment between human chromosome 1 and chimpanzee chromosome 1, the left panelshows the Hamming distance (number of mismatches and indels) as a function of align-ment block length; each dot corresponds to a distinct alignment block. The dashed linescrudely partition alignment blocks into an upper-left branch and a lower-right branch. Dotplots of exact matches for alignment blocks within each branch are shown in the middlepanels, and their respective distributions in the right panel, exhibiting decomposition into(approximately) algebraic and exponential components. LASTZ alignment is performed in stages, with “raw” alignment the imme-diate product. Raw alignment contains all matches between sequences whosealignment scores exceed a predetermined threshold. Aligned fragments oftenoverlap within the raw alignment; one location in the target sequence canmatch multiple locations in the query sequence, and vice versa. Net align-ment scans the target sequence and selects from the aligned fragments ineach region the pair with the highest alignment score, discarding all pairswith lower scores, eliminating overlaps and returning a unique optimal chainof aligned fragments [14].Since the exponential distribution of CMRs in human-chimpanzee align-ment comprises primarily of high-similarity sequence pairs, one would expectthe net alignment to extract such pairs from the raw alignment. We definea “raw minus net” (RMN) alignment as the residual of the raw alignmentonce all fragments also in the net alignment have been removed. Thus thenet alignment and the RMN alignment represent complementary subsets ofthe raw alignment.Kent et al. designed the net alignment to align orthologous sequences[14], so it is not surprising that the LASTZ net alignment between humanchromosome 1 and chimpanzee chromosome 1 consists primarily of exponen-tial components.Figure 7 exhibits dot plots and distributions of exact matches for raw,net and RMN alignments between human chromosome 1 and chimpanzeechromosome 1; and it can be seen there that the net alignment extracts anexponential component from the raw alignment; the RMN alignment distillsan (approximately) algebraic component. Figure S4 exhibits these plots anddistributions separately for forward alignment and for reverse alignment.13 igure 7: Dot plots and distributions of exact matches for the raw, net and raw minusnet alignments between human chromosome 1 and chimpanzee chromosome 1. The netalignment extracts an exponential component from the raw alignment; the RMN alignmentdistills an (approximately) algebraic component. We define “nested-CMRs” and “non-nested-CMRs” as two complemen-tary subsets of the CMRs within an alignment: a CMR is said to be “nested”if it is a subsequence of another CMR. In more detail,
Definition 1: If seq: [ i , i ] denotes a sequence that starts at location i and ends at location i in a genome (here i ≥ i are coordinates in thegenome, both relative to the plus strand), then for two sequences extractedfrom a same genome, seqA: [ x , x ] and seqB: [ y , y ] , we say “ seqA isnested in seqB ” if both these conditions are satisfied:1. y − y ≥ x − x ;2. y ≤ x ;3. y ≥ x ; Definition 2:
Given two different CMRs within an alignment, when thequery or target sequence of one CMR is nested in the corresponding queryor target sequence of the other, we say the first CMR is nested in the secondCMR, and the overlap between these two CMRs is called a “nested overlap.”
Definition 3:
A CMR that is nested in another CMR is called a nested-CMR ; otherwise it is a non-nested-CMR .14hese definitions of nested and non-nested CMR apply to any alignment,including – but not limited to – LASTZ raw [13] and LASTZ net [14] .Below, we apply these definitions to LASTZ raw alignment, and study thedistributions exhibited by nested and non-nested CMRs. Figure 8: Dot plots and distributions of exact matches for nested-CMRs and non-nested-CMRs in human chromosome 1 – chimpanzee chromosome 1 raw alignment.
Figure 8 exhibits dot plots and distributions for the nested-CMRs andnon-nested-CMRs in human chromosome 1 – chimpanzee chromosome 1 rawalignment (for forward and reverse alignments alone see figure S5 ). The(approximately) algebraic character of nested-CMRs versus the exponentialcharacter of non-nested-CMRs is evident. This outcome is plausible if onerecalls that orthologs tend to be more similar to one another than are par-alogs (see section ), so that subsequences of paralogs are likely to benested within subsequences of orthologs. This method requires no chainingof alignment blocks, and is further distinguished from netting because it isparameter-free. Aside from their common reliance on the raw alignment, these four meth-ods ( – ) are independent of one another; however, the distributionsof the corresponding subsets extracted by each of these four methods are However, please note that here our definitions of nested and non-nested CMRs aredifferent from those of nested and non-nested local maxmers by E Taillefer and J Millerin [31]. extracts theentire diagonal band, discarding all off-diagonal elements. In contrast, theother methods all retain some on-diagonal and some off-diagonal elements.Figure 9 schematically displays the consistency of these methods, indi-cating that exponential subsets extracted by different methods consist over-whelmingly of shared CMRs; in particular our “global” and “local” methodsshare close to 95 ∼
98% of the CMRs (figure 9 left panel).
Figure 9: Schematic illustrations of consistency among different methods – described in the text. Circles in the figure indicate the exponential subsets extracted fromhuman chromosome 1 – chimpanzee chromosome 1 raw alignment. Numerals in the figureshow the number of CMRs in different subsets and percentages in the brackets show theproportions of shared CMRs. As evident in the right panel of figure 9, the set of CMRs common to allfour methods contains at least 70% of the CMRs obtained by each methodalone. Although each of these four methods yields some CMRs that arenot obtained by any of the other methods, the proportions of such CMRsare small: 1 .
6% of the net alignment CMRs, 0 .
4% of the non-nested CMRs,6 .
2% of the low-branch CMRs and 7 .
8% of the on-diagonal CMRs.
To account qualitatively for the ortholog contribution to the exponen-tial distribution, we apply a random uncorrelated point mutation (RUPM)model. As a simple model of neutral evolution, a RUPM model consistsof site-independent point mutations (here, single-base substitutions) only,where the rate of these mutations is homogeneous across the genome.16s two identical copies of a common ancestor genome evolve indepen-dently under a neutral RUPM model, CMR lengths follow an exponentialdistribution. For sufficiently short times, long CMRs can be assigned to cor-responding positions within the two genomes, and lie on the diagonal; longsegmental duplications present in the common ancestor remain well con-served. Matches among these segmental duplications in different locationsof the genomes yield a distribution similar to that of the common ancestor:any differences can only be accounted for by random, uncorrelated pointmutation.
We perform a numerical simulation of neutral evolution under the RUPMmodel. Human chromosome 1 was selected as a common ancestor sequencecontaining algebraically distributed segmental duplications. Starting fromtwo identical copies of the ancestral genome, random uncorrelated point mu-tations are introduced independently. We apply 0 .
5% mutations per baseand generate a raw alignment between the descendent genomes. The distri-butions of exact matches are displayed in figure 10.
Figure 10: Distributions of exact matches in raw alignment between two “synthetic”descendants of a common ancestor genome. Introducing random uncorrelated point mu- ations with frequencies 0 .
5% into two copies of an ancestral genome consisting of humanchromosome 1 generates two synthetic descendent genomes. In the figure, solid cyan cir-cles indicate self-alignment of the original (un-mutated) sequence; solid blue circles all theCMRs in the alignment between the mutated sequences; red crosses the “orthologs;” openmaroon circles the “paralogs.” “Orthologs” correspond to matches that share commonlocations between the two descendent genomes’ “paralogs” to matches with a differentlocation in each of the two descendent genomes.
Under the RUPM model, we identify matches between sequences havingidentical coordinates within the respective mutated sequences as “orthologs”and all other matches as “paralogs.” In figure 10 these orthologs exhibitan exponential distribution, whereas paralogs exhibit an (approximately)algebraic distribution that resembles the algebraic distribution of the self-alignment of the original (un-mutated) sequence, but falls a little short inthe tail.For comparison, a parallel simulation on a random sequence is performed;see
Supplementary Text S1 . Figure 11: Distributions of the “orthologs” and “paralogs” in our “synthetic” alignment,separated by different methods.
Because evolution is simulated according to the RUPM, the orthologsand paralogs in this synthetic alignment can be identified solely by their lo-18 ethods Subsets numbers oforthologs numbers ofparalogs Error (%)
On-diagonal 2456358 0 0 “Geometrical”
Off-diagonal 0 12169905 0 “Genetic clock”
Lower branch 2445738 62405 2 . ratio threshold: 0.025 ) Upper branch 10642 12107501 0 . . “Global” RMN alignment 7 12169893 0 . . “Local” Nested-CMRs 44073 12158052 0 . Table 1: Identification of “orthologs” and “paralogs” in the synthetic alignment by meth-ods – . cations within the aligned sequences and we can use this synthetic alignmentto examine the reliability of the methods – above. Figure 11 illus-trates the distributions of the “orthologs” and “paralogs” from our syntheticalignment, as separated by each of our four methods; evidently all of themare effective at separating the exponential from the power-law, as can alsobe seen from Table 1. Relative to the “geometrical” method , which is– for the RUPM model – perfect, the other methods also perform well. The calculations above were performed on human chromosome 1 andchimpanzee chromosome 1. Figure S6 exhibits distributions of exact matchesfrom net and raw minus net alignments of all pairs of orthologous chromo-somes from human and chimp. Exponential distributions characterise thenet alignments, and algebraic most of the raw minus net. Some chromosomepairs show exponential tails in raw minus net, for example, chromosome 16and chromosome Y; it happens that these two chromosomes appear to con-tain more repetitive sequences than other chromosomes (data not shown);however, further understanding awaits future research. Heretofore we have addressed only the human-chimpanzee alignment.Whether our conclusions apply equally well to other genome pairs with sim-ilar evolutionary distances remains to be seen. Figure 12 shows the distribu-tions of exact matches in alignments between human ( hg
19) chromosome 119nd orthologous chromosomes selected from the Venter, chimpanzee, gorilla,orangutan, macaca and marmoset genomes. We choose for each species theorthologous cognate as the chromosome that shares the most orthologousgenes with human chromosome 1 according to Ensembl Biomart (data notshown). For a more distant genome, mouse chromosome 1 is aligned to hu-man chromosome 1; it carries on the order of 1 / igure 12: Distributions of exact matches from raw, net and raw minus net alignments of human chromosome 1 versus thecorresponding orthologous chromosomes of respectively Venter, chimpanzee, gorilla, orangutan, macaca, marmoset; and in therightmost panel versus mouse chromosome 1. . Discussion The quantitative study of monoscale substitution/duplication dynamicswas revitalised by the work of H.C. Lee and collaborators with their aptcharacterisation of “nature as the blind plagiariser” [6]. Although these au-thors did not investigate the steady state duplication length distributionsyielded by their models, subsequent research revealed that similar classes ofmodels yield algebraic length distributions that resemble those often exhib-ited by duplicated sequence in self-alignment and self-intersection of naturalgenomes [9, 16, 24]. Algebraic distributions of conserved sequence lengthsamong distantly related genomes had been observed earlier.This manuscript extends the characterisation of sequence length distri-butions to a pair of closely related genomes, those of human and chimp,where both conserved sequence lengths and duplicated sequence lengths canbe simultaneously computed. In
Results we demonstrated that the humanchromosome 1 – chimpanzee chromosome 1 alignment can be decomposedinto two subsets, one with an exponential length distribution, the other an(approximately) algebraic length distribution. Our calculations also suggestthat the algebraic length distribution is composed primarily of duplicated se-quence including but not limited to paralogous genes, whereas the exponen-tial length distribution is mainly composed of matches between orthologouschromosomal regions.A neutral substitution model in the absence of selection is expected toyield an exponential length distribution for sequence conserved between twogenomes. The phenomenon is quantitatively and conservatively thought ofas a Bernoulli process; the exponential arises from the length distributionof head runs when flipping a biased coin [3], and the exponential under-lies most null models of sequence similarity in comparative genomics. It isnot understood the extent to which an exponential is expected in (say) hu-man/chimpanzee alignment, or whether – since we are not chimpanzees – anexponential is unexpected because of selection.
Chromosomal regions, within or across species, that have common an-cestry are said to be homologs [4]. Homologs can be further identified as orthologs if they diverged via evolutionary speciation, or paralogs if they di-verged via sequence duplication [2, 7]. Orthology and paralogy can in princi-ple be defined for all sequences within a genome, but in practice most on-line22atabases consist only of protein-coding genes. Because of gene duplicationand genome rearrangement, the ancestry of a given gene may be difficultto ascertain with high confidence, and ortholog/paralog classification can beambiguous. Phylogenetic analysis of the gene lineage is customarily believedto enable the strongest discrimination between orthology and paralogy.A standard approach to orthology and paralogy is to argue that within agiven genome pair, orthologs tend to be those homologs that diverged least[2]. Duplication subsequent to speciation generates “mother” and “daughter”copies (known as “in-paralogs”) that exhibit congruent divergence from theircognate orthologs. This sequence of events yields so-called “co-orthology”among in-paralogs [19]. Co-orthology can be further refined to “primaryorthology” and “secondary orthology” [12]. Our preliminary calculationssuggest that in the human-chimpanzee alignment, primary orthologs domi-nate the exponential length distribution, but secondary orthologs merge withparalogs into the power-law length distribution.
In the plots above we study continuous (uninterrupted) matching runs ofbases (CMRs), where continuous matching runs are by definition terminatedat mismatches or indels; these are exact matches; however, CMRs may alsobe defined according to approximate matching criteria. The following criteriaare listed in order of decreasing stringency:I : Exact matches: Each of the four nucleotides (A,T,G,C) matches itselfonly; a mismatch or indel terminates a run of matches;II : A=G, C=T: In addition to the exact matches, A and G, C and T arealso matched pairs; an indel or any mismatch involving other than anA/G or T/C pair terminates the run;III : Indel-terminated matches: aligned but gap/insert-free sequence is takenas matching; only an indel terminates the run;IV : Alignment blocks: High similarity local alignments returned by LASTZthat are separated from one another by un-alignable sequence. Theyspan exact matches, mismatches and indels.CMR distributions obtained with criteria I through IV display sufficientqualitative similarity to one another that only exact match distributions are23isplayed in this manuscript. An example for human-chimpanzee alignmentcan be found in the supplement (see figure S9 in Supplementary Text S2 );for other genome pairs corresponding plots may be found in [9, 17, 26, 30, 31].It was observed for distant inter -genome comparisons in [30] and [26]that criterion II matches – in contrast to all other inexact base substitutionmatching conditions – displace the algebraic distribution of exact matches tonumbers and lengths greater by an order-of-magnitude, with minimal impacton the shape of the curve. Were these C ⇒ T/G ⇒ A substitutions neutral, anexponential would have been anticipated. Yet, a qualitatively similar phe-nomenon (criterion II shifts algebraic criterion I curves to larger numbers andgreater lengths, with minimal impact on shape) is observed for duplications within a genome [9, 16, 31].
The qualitative parallels between distributions of exact and inexact matchesin duplicated sequence versus conserved sequence – discussed in the previ-ous section – suggest to us that the mechanisms behind them share commonfeatures. Subsequent to our original computations [26, 30], the portfolioof fully sequenced genomes has expanded vastly, and a variety of genomepairs exhibiting exact match length distributions with power laws close to − igure 13: Schematic illustration of how a steady source of homologous sequence, sub-ject to local mutation such as random base substitution, can lead to stationary algebraicdistributions of homologous sequence length. (a) Solid coloured blocks represent newlyduplicated sequence within a single lineage. The fading colour indicates the loss of homol-ogy between a duplicate sequence and its source as random local mutations fragment thematches. The time elapsed between the given duplication event and the present, governsthe extent of fragmentation of the given duplicate. (b) Solid coloured blocks represent se-quence – not necessarily duplicated – on which constraint has been newly lost. The fadedcolour indicates the loss of homology over time, as the newly unconstrained sequence ac-cumulates random local mutations that fragment the matches. The time elapsed betweenthe loss of constraint on the given sequence and the present, governs the extent of fragmen-tation of the given sequence. The braces indicate how this time is reflected in evolutionarydistance: nearby leaves (black brace) are dominated by recent mutations of unconstrainedsequence and yield an exponential distribution; intermediate distances (dark grey brace)are dominated by an algebraic distribution arising from successive losses of constraint; atstill greater distances (light grey brace) the distribution exhibits increasingly steep tailsas the overall amplitude attenuates into noise. pair of present-day descendants of a common ancestor,fragmentation could be misinterpreted as representing the average constrainton the sequence over all time; sequences that lost their constraints earlier ap-pear subject to less constraint overall (are more fragmented) than sequencesthat lost their constraints more recently. Presumably, only suitable outgroupgenomes can resolve this potential ambiguity.Observe that, in accord with figure 13 (b), recently diverged sequences(nearby branches) are expected to share exponentially distributed exact matchlengths (because all the mutations breaking the matches occurred subsequent to divergence); an intermediate regime to share algebraically distributedmatch lengths (the integral of fragment lengths arising from mutations thatoccurred before divergence), in principle with power −
3; as the branches sep-arate further the amplitude of the distribution diminishes until matches aretoo sparse to infer its form.In summary, the parallel between the algebraic distributions of duplicatedand conserved sequence is that they both represent a signature of perpetualsequence turnover ; for conserved sequence, the turnover of functional se-quence in a continual process of expropriation, exploitation, and extinction.The latter conception is hardly novel, but the prospect of a quantitativemeasure of it (the exponent, presumably) could be illuminating.
5. Conclusion
Exponential length distributions between similar species and algebraic(power-law) length distributions between more distantly related species andwithin the alignment of a genome to itself have been previously observed.We have studied here the distribution of lengths of identical (and nearlyidentical) sequence shared between closely related organisms. A key con-tribution of our study is that the exponential distribution between closelyrelated genomes turns out to be composed of two types of sequences: (1) or-thologous sequences, which have an exponential distribution; (2) paralogoussequences, which have an algebraic (power-law) distribution.Comparing human and chimpanzee, we explicitly distinguish orthologousfrom non-orthologous regions in a number of different ways, including knownchromosome orthology; annotated orthologous regions in chromosomes; diag-onal versus non-diagonal sectors of a dot plot; alignment similarity betweenhuman and chimpanzee; optimal chains of fragments aligned between orthol-ogous chromosomes. For all such characterisations, we demonstrate expo-27entially distributed length segments for orthologous regions, and algebraic(power-law) distributed length segments for non-orthologous regions. Finally,we provide an in silico demonstration of how such length distributions couldhave arisen through neutral evolution.Recent models of neutral evolution proposed to explain algebraic distribu-tions of duplicated sequence lengths often observed in natural genomes leadone to ask whether they can shed light on the evolution of duplications overevolutionary time scales [17, 24]. Addressing this question suggests the in-vestigation of duplicated sequences common to at least two different species.At the same time, observations from almost ten years ago of algebraic dis-tributions of sequences conserved among multiple divergent species remainunaccounted for [30].In this paper, we take some first steps in studying the evolution of thedistribution of duplicated sequence lengths from self-alignment to alignmentof two nearby species, human and chimpanzee. We describe a parameter-freemethod of extracting paralogs from LASTZ raw alignment of human andchimpanzee, based on nested and non-nested matches, that seems to recon-stitute an approximately algebraic distribution of shared duplicate sequencelengths traceable to the self-alignment. Finally, we exhibit the evolutionof orthologous sequence length distributions over a range of increasingly di-vergent species that spans the exponential and the algebraic, for which amechanism is conjectured.As observed in [30] (twenty years after the rigorous mathematics of [3])pure exponentials may not be so easy to come by in natural genome se-quences. Once that has been recognized, the relevant question shifts to “un-der what circumstances do exponentials actually occur, and why or whynot?” And if not, what takes their place and what does it tell us about se-quence evolution? We hope that the work presented here will eventually leadto further insights into these questions.
Acknowledgement
The authors gratefully acknowledge generous support from the OkinawaInstitute of Science and Technology Graduate University to jm.28 eferencesReferences [1] A Albrecht-Buehler. Inversions and inverted transpositions as the ba-sis for an almost universal “format” of genome sequences. Genomics,90:297–305, 2007.[2] AM Altenhoff and C Dessimoz. Inferring orthology and paralogy.In M Anisimova, editor, Evolutionary Genomics: Statistical andComputational Methods, chapter 9. Springer Science+Business Media,2012.[3] R Arratia and MS Waterman. Critical phenomena in sequence matching.Annals of Probability, 13(4):1236–49, 1985.[4] TA Brown. Molecular phylogenetics. In Genomes, 2nd Edition, chap-ter 16. Oxford: Wiley-Liss, 2002.[5] JA Capra, KS Pollard, and M Singh. Novel genes exhibit distinct pat-terns of function acquisition and network integration. Genome Biology,11(R127), 2010.[6] HD Chen, WL Fan, SG Kong, and HC Lee. Universal global imprintsof genome growth and evolution: equivalent length and cumulative mu-tation density. PLoS ONE, 5(4)(e9844), 2010.[7] W Fitch. Distinguishing homologous from analogous proteins. Syst Zool,19(2):99–113, 1970.[8] KJ Fryxell and E Zuckerkandl. Cytosine deamination plays a pri-mary role in the evolution of mammalian isochores. Mol. Biol. Evol.,17(9):13711383, 2000.[9] K Gao and J Miller. Algebraic distribution of segmental duplica-tion lengths in whole-genome sequence self-alignments. PLoS One,6(7)(e18464), 2011.[10] AJ Gibbs and GA. McIntyre. The diagram, a method for comparingsequences. its use with amino acid and nucleotide sequences. Eur. J.Biochem, 16:1–11, 1970. 2911] D Grauer and WH Li. Fundamentals of Molecular Evolution. Sinauer,2000.[12] M V Han, J P Demuth, C L McGrath, C Casola, and M W Hahn. Adap-tive evolution of young gene duplicates in mammals. Genome Research,19:859–867, 2009.[13] RS Harris. Improved pairwise alignment of genomic DNA. PhD thesis,The Pennsylvania State University, 2007.[14] W J Kent, R Baertsch, A Hinrichs, W Miller, and D Haussler. Evolutionscauldron: Duplication, deletion, and rearrangement in the mouse andhuman genomes. Proc Natl Acad Sci USA, 100(20):11484–11489, 2003.[15] EV Koonin and YI. Wolf. The common ancestry of life. Biology Direct,5(64), 2010.[16] MV Koroteev and J Miller. Scale-free duplication dynamics: a modelfor ultraduplication. Phys. Rev. E., 84(061919), 2011.[17] MV Koroteev and J Miller. Fragmentation dynamics of dna sequenceduplications. arXiv: 1304.1409v3 [math-ph], 2013.[18] PL Krapivsky, S Redner, and E Ben-Naim. A Kinetic View of StatisticalPhysics. Cambridge University Press, 2010.[19] D M Kristensen, Y I Wolf, A R Mushegian, and E V Koonin. Computa-tional methods for gene orthology inference. Briefings in Bioinformatics,12(5):379–391, 2011.[20] W Kuhn. Uber die kinetik des abbaues hochmolekularer ketten. Ber.Dtsch. Chem. Ges., 63(1530), 1930.[21] S Levy, G Sutton, PC Ng, L Feuk, AL Halpern, andet al. The diploid genome sequence of an individual human,doi:10.1371/journal.pbio.0050254. PLoS Biology, 5(10), e254, 2007.[22] G Lunter, CP Ponting, and J Hein. Genome-wide identification of hu-man functional dna using a neutral indel model. PLoS Comput. Biol.,2(1)(e5), 2006.[23] M Lynch. The Origins of Genome Architecture. Sinauer, 2007.3024] F Massip and Arndt P F. Neutral evolution of duplicated dna: An evolu-tionary stick-breaking process causes scale-invariant behavior. PhysicalReview Letters, 110(148101), 2013.[25] E H McConkey. Orthologous numbering of great ape and humanchromosomes is essential for comparative genomics. Cytogenetic andGenome Research, 105:157–158, 2004.[26] J Miller. Colossal and super-colossal ultraconservation. IEICE TechnicalReport, Neurocomputing, 109(53), 2009.[27] DL Nelson and MM Cox. Lehninger Principles of Biochemistry (6thEdition). W.H. Freeman, 2012.[28] S Ohno. Evolution by Gene Duplication. Springer-verlag, New York.Heidelberg. Berlin, 1970.[29] CP Ponting, C Nellaker, and S Meader. Rapid turnover of functionalsequence in human and other genomes. Annu. Rev. Genom. HumanGenet., 12:275–299, 2011.[30] W Salerno, P Havlak, and J Miller. Scale-invariant structure of stronglyconserved sequence in genomic intersections and alignments. Proc NatlAcad Sci USA, 103(35):13121–13125, 2006.[31] E Taillefer and J Miller. Exhaustive computation of exact duplicationsvia super and non-nested local maximal repeats. J Bioinform ComputBiol., 12(1)(1350018), 2014.[32] G Varani and WH McClain. The g × u wobble base pair: A funda-mental building block of rna structure crucial to rna function in diversebiological systems. EMBO Rep., 1(1):1823, 2000.[33] RM Ziff and ED McGrady. The kinetics of cluster fragmentation anddepolymerisation. J. Phys. A, 18(3027), 1985.[34] E Zuckerkandl and LB Pauling. Molecular disease, evolution, and genicheterogeneity. Horizons in Biochemistry. Academic Press, New York.,1962. 31 upplementary figures Figure S1 : Dot plots for (soft repeat-masked LASTZ) raw alignments between humanchromosome 1 and each chimpanzee chromosome. igure S2 : Syntenic dot plot for the CDS (protein-coding nucleotide sequences) betweenhuman chromosome 1 and chimpanzee chromosome 1, created by the SynMap tool in CoGe( http://genomevolution.org/CoGe/index.pl ). The horizontal blue and vertical violet barsindicate the locations of fragments H f rag H f rag C f rag
C f rag igure S3 : Same as figure 6 in the main text, but displaying separately the forward andreverse alignments. igure S4 : Same as figure 7 in the main text, but displaying separately the forward andreverse alignments.Figure S5 : Same as figure 8 in the main text, but displaying separately the forward andreverse alignments. igure S6 : (Length) distribution of exact matches in the net and RMN (raw minus net)alignments of chromosomes orthologous between human and chimpanzee. Supplementary Text
Supplementary Text S1 . Control simulation for the synthetic alignmentin section 3.3.1
At the insistence of one of the referees, to confirm our interpretation ofthe simulation in section we applied the numerical procedure describedthere to a random sequence of the same length as human chromosome 1 andwith lower-case letters at the same positions as they appear in soft-maskedhuman chromosome 1. Our simulation preserved the case of each letter, be-cause unless this soft-masking was maintained, LASTZ was unable to com-plete an alignment of the descendent genomes. We applied 0 .
5% substitutions36er base independently to each of two copies of the random sequence and gen-erated a LASTZ raw alignment between the mutated descendent genomes.“Orthologs” and “other matches” were identified via each of the methods – . The distributions of exact matches are displayed in figure S7 . Figure S7 : A control simulation to figure 10. Random uncorrelated point mutations at arate of 0 .
5% per base are applied to produce two independent realisations of the RUPMon a randomly generated “ancestral genome” of the same length as human chromosome1 and that are soft-masked at the same locations as in human chromosome 1. Since theancestral genome consists solely of independent uncorrelated random sequence, a power-law distribution of paralogs does not appear in this control simulation.
Figure S8 and table S1 for the control simulation correspond respectivelyto figure 11 and table 1 of the text. “Orthologs” in the synthetic and con-trol simulations share qualitatively similar features, but because the randomancestral genome contains no segmental duplications, paralogs are absent inthe control simulation (see inset in upper-right panel of figure S8 , wherethe upper-left branch is missing). In figure S7 any non-orthologous matchesappear only by chance. 37 igure S8 : Distributions of “orthologs” and “non-orthologous matches” in the controlsimulation as identified by each of the methods – . Methods Subsets numbers oforthologs numbers ofnon-orthologs Error (%)
On-diagonal 2454994 0 0 “Geometrical”
Off-diagonal 0 124 0 “Genetic clock”
Lower branch 2455118 0 0 . ratio threshold: 0.025 ) Upper branch 0 0 0Net alignment 2454994 2 0 . “Global” RMN alignment 2 122 1 . . “Local” Nested-CMRs 1 120 0 . Table S1 : Identifications of orthologs and paralogs in the control simulation by methods – . Supplementary Text S2 . Exponential distributions of CMRs counted bydifferent matching criteria
Matching criteria I through IV described in section successively relaxthe matching condition. CMRs counted according to a stricter criterion are38ontained within those counted according to a more relaxed criterion; there-fore, locally and within an alignment block, CMRs counted according todifferent criteria exhibit a nested or hierarchical structure. Different match-ing criteria yield qualitatively similar length distributions, which suggests tous that the latter reflect some intrinsic features of the genomes rather thanartefacts of matching criteria.
Figure S9 : Distributions of the contiguously matched runs counted by different matchingcriteria in the (soft repeat-masked LASTZ) raw alignment between human chromosome 1and chimpanzee chromosome 1. Figure S9 shows the distributions of CMRs counted by different match-ing criteria from the human chromosome 1 – chimpanzee chromosome 1LASTZ raw alignment. Evidently, exact matches, A=G/C=T runs and indel-terminated runs all show exponential tails. As described in [9], contiguousindels (successive insertions or deletions) yield algebraic length distributions.The biological significance of matching criterion II, transition (G ⇔ A,C ⇔ T), has been recognised since the discovery of the genetic code in the mid20th century [27]. In the genetic code, it is evident that the 3rd base “wobble”displays enhanced tolerance for transitions (they tend not to alter the aminoacid encoded by a codon) over other kinds of substitutions. Furthermore, inRNA (DNA) secondary structure, the G:U (G:T) base-pair hydrogen bondplays a central role in stabilising duplex structures formed by all classesof RNAs [32]. Thus the tolerance in these contexts for G ⇔ A and C ⇔ Tsubstitutions is reflected by functional selection.39n the other hand, the C ⇒ T mutation rate (and via subsequent mis-match repair, the G ⇒ A rate) is enhanced by an order of magnitude overother substitutions by the (selectively neutral) chemical process of deamina-tion [11]. The effective pressure for C ⇒ T/G ⇒ A substitution is so strongthat it is believed that certain specific mechanisms, such as A ⇒⇒