[PDF] CoHSI V: Identical multiple scale-independent systems within genomes and computer software

Abstract

A mechanism-free and symbol-agnostic conservation principle, the Conservation of Hartley-Shannon Information (CoHSI) is predicted to constrain the structure of discrete systems regardless of their origin or function. Despite their distinct provenance, genomes and computer software share a simple structural property; they are linear symbol-based discrete systems, and thus they present an opportunity to test in a comparative context the predictions of CoHSI. Here, without any consideration of, or relevance to, their role in specifying function, we identify that 10 representative genomes (from microbes to human) and a large collection of software contain identically structured nested subsystems. In the case of base sequences in genomes, CoHSI predicts that if we split the genome into n-tuples (a 2-tuple is a pair of consecutive bases; a 3-tuple is a trio and so on), without regard for whether or not a region is coding, then each collection of n-tuples will constitute a homogeneous discrete system and will obey a power-law in frequency of occurrence of the n-tuples. We consider 1-, 2-, 3-, 4-, 5-, 6-, 7- and 8-tuples of ten species and demonstrate that the predicted power-law behavior is emphatically present, and furthermore as predicted, is insensitive to the start window for the tuple extraction i.e. the reading frame is irrelevant. We go on to provide a proof of Chargaff's second parity rule and on the basis of this proof, predict higher order tuple parity rules which we then identify in the genome data. CoHSI predicts precisely the same behavior in computer software. This prediction was tested and confirmed using 2-, 3- and 4-tuples of the hexadecimal representation of machine code in multiple computer programs, underlining the fundamental role played by CoHSI in defining the landscape in which discrete symbol-based systems must operate.

Full PDF

CCoHSI V: Identical multiple scale-independent systemswithin genomes and computer software

Les Hatton ∗ , Gregory Warr † February 26, 2019

Abstract

A mechanism-free and symbol-agnostic conservation principle, the Conservation of Hartley-Shannon Information (CoHSI) is predicted to constrain the structure of discrete systems re-gardless of their origin or function. Despite their distinct provenance, genomes and computersoftware share a simple structural property; they are linear symbol-based discrete systems,and thus they present an opportunity to test in a comparative context the predictions ofCoHSI. Here, without any consideration of, or relevance to, their role in specifying function we identify that 10 representative genomes (from microbes to human) and a large collectionof software contain identically structured nested subsystems. In the case of base sequencesin genomes, CoHSI predicts that if we split the genome into n-tuples (a 2-tuple is a pair ofconsecutive bases; a 3-tuple is a trio and so on), without regard for whether or not a regionis coding , then each collection of n-tuples will constitute a homogeneous discrete system andwill obey a power-law in frequency of occurrence of the n-tuples. We consider 1-, 2-, 3-, 4-,5-, 6-, 7- and 8-tuples of ten species and demonstrate that the predicted power-law behavioris emphatically present, and furthermore as predicted, is insensitive to the start window forthe tuple extraction i.e. the reading frame is irrelevant.We go on to provide a proof of Chargaﬀ’s second parity rule and on the basis of this proof,predict higher order tuple parity rules which we then identify in the genome data.CoHSI predicts precisely the same behavior in computer software. This prediction wastested and conﬁrmed using 2-, 3- and 4-tuples of the hexadecimal representation of machinecode in multiple computer programs, underlining the fundamental role played by CoHSI indeﬁning the landscape in which discrete symbol-based systems must operate.

Statement of reproducibility

This paper adheres to the transparency and reproducibility principles espoused by [Pop59, Zio82,CK92, HR94, DMR +

09, IHGC12] and includes references to all methods and source code necessaryto reproduce the results presented. These are referred to here as the reproducibility deliverables and will be available initially at https://leshatton.org. Each reproducibility deliverable allows allresults, tables and diagrams to be reproduced individually for that paper, as well as performingveriﬁcation checks on machine environment, availability of essential open source packages, qualityof arithmetic and regression testing of the outputs [HW16]. Note that these packages are designedto run on Linux machines for no other reason than to guarantee the absence of any closed sourceand therefore potentially opaque contributions to these results.

Introduction

In considering both genomes and computer software we emphasize that the properties we examinein this study have nothing to do with the function of these systems in conveying information re-spectively to living organisms and to computers. Although interesting, the analogy of cells as livingcomputers [Bre12, WB12, CKL +

18] is therefore without any relevance to the properties we considerhere. The conservation principle (CoHSI, Conservation of Hartley-Shannon Information) that we ∗ Emeritus Professor, Kingston University, KT1 2EE, U.K., [email protected] † Emeritus Professor, Medical University of South Carolina, 171 Ashley Ave, Charleston, SC 29425, USA,[email protected] a r X i v : . [ q - b i o . O T ] F e b ave described is applicable to all discrete systems ; CoHSI accurately predicts amongst other global properties, the length distribution of components in all discrete systems, including proteomes andother qualifying discrete systems such as software, written texts and musical compositions, at alllevels of aggregation [HW15, HW17, HW18a, HW18c]. Thus, of necessity in describing systemsof such diverse provenance, CoHSI is symbol agnostic (in our usage "symbol" is synonymous with"sign" or "token") and the use of Hartley-Shannon Information, which eschews any meaning as-sociated with symbols, is integral to the theory. In proteins, the symbols are the amino acids(including any post-translational modiﬁcations) and in software they are the human-readable pro-gramming language tokens familiar to computer programmers. As noted above, CoHSI is agnosticwith respect to symbol but requires that the categorization of symbols within a system shouldbe consistent. In other words, we should see CoHSI-constrained properties regardless of the typeor level of categorization used. This we can examine in both proteins and in computer softwarebecause there is a deeper and very well known layer of categorization in both these systems; werefer respectively to the bases of DNA and to machine-readable code.Proteins are encoded in DNA by (A)denine, (T)hiamine, (G)uanine and (C)ytosine, whereby,read sequentially, each triplet of bases encodes an amino acid. Genomes consists of linear polymersof the bases and in complex organisms not all DNA is coding. Whereas in bacteria over 90percent of the genome can encode proteins, in humans for example the coding regions are typicallyconsidered to be between 1-2 percent of the total, and the function of the remaining DNA, whichincludes substantial amounts of repetitive sequences as well as regions regulating gene expression,has stirred controversy [Doo13]. Note that we do not consider here the non-canonical bases, forexample, the modiﬁed guanine 7-methylguanine. The 17 known such modiﬁcations at the time ofwriting would of course increase the base alphabet signiﬁcantly from just 4, but suﬃcient data isnot yet available for a comprehensive analysis incorporating this extended base alphabet. For therecord, since our theory is independent of the nature of the bases, we would not anticipate anychanges to our conclusions as the base alphabet was extended.Similarly, computer software systems are written in a higher-level human-readable form in oneor more programming languages such as C, C++, perl or python. However, before it can actuallybe run on a computer, the human-readable form must be converted to a machine code, (alsoknown as machine-readable code), which means something to the central processing unit (CPU)of the computer. This process is called either interpretation or compilation depending on theprogramming language.These two highly disparate but deeper levels of categorization we now consider in more detail. The basic alphabet of life

The coding and other functional regions of DNA are ipso facto non-random in sequence, butthe overall sequence of bases in the genome is also demonstrably non-random, as exempliﬁed byinnumerable features, including the following examples. First, there are wide variations in the C+Gcontent of genomes [MC10, LD14]. Second, there is underrepresentation of CG dinucleotide repeats[DB11, YO89]. Third, Chargaﬀ’s second parity rule notes the occurrence of (approximately)equivalent numbers of nucleotides (G = C; A = T) and of nucleotide motifs with their reversecomplement on the same strand [AB06]. Fourth, there are many classes of repeated DNA sequencesdistributed throughout, and accounting for signiﬁcant proportions of, the genome [BOHH15].Typically these imbalances in DNA sequences are explained by speciﬁc mechanisms. For ex-ample they have been described as adaptive responses to the environment in prokaryotes [MC10].In the case of CG dinucleotides, their involvement in controlling gene expression (as CpG islands),their particular secondary structure, and their role as sites for mutation [DB11, KKR +

16] are allconsidered reasons for their underrepresentation. In the case of repeated DNA, many of these se-quences are mobile genetic elements that can be either transposed or copied and transposed in thegenome, leading to an accumulation of these elements in the genome. Over 40 percent of the humangenome and over 80 percent of the maize genome consist of transposable elements [THGRI11].Thus, the structure of the genome represents the outcome of multiple mechanisms operat-ing over evolutionary timescales - the generation of coding and transcriptional control sequences,the response to diverse biochemical and environmental pressures, and the invasion and spread oftransposable elements. However, the predictions of CoHSI transcend all of these essentially local https://en.wikipedia.org/wiki/DNA, accessed 17-Feb-2019. The basic alphabet of software

In computers, the most fundamental level of arithmetic has 2 symbols in contrast to the 4 symbolsof DNA. Machine-readable code appears as long strings of 1s and 0s but in practice for displaypurposes and to compress it, it is shown in hexadecimal format, (i.e. 4 bits at a time), numbered0-9,a-f, with a-f corresponding to decimal numbers from 10 to 15 respectively. The hexadecimalcharacter e , corresponding to the decimal value 14 is therefore an abbreviated form for a bit patternof + 2 + 2 , whereas the hexadecimal character is the same as its decimal form andis an abbreviated form for a bit pattern of . A typical string of hexadecimal charactersfrom a binary dump on a Linux machine of a computer program looks like ...4102301400e08a2625010b32000e0242... What these characters mean and where they appear is the function of the systems software ofa computer, (c.f. ELF format .) Some parts contain only data and some parts are control sectionswhich direct the CPU and the computer hardware to do something. There is a weak analogywith the coding and non-coding regions of DNA, but it would be easy to over-exaggerate thisand we will avoid simple but potentially misleading analogies wherever possible. In spite of theirrapidly growing size and ubiquity, computer programs barely even hint at the levels of complexitywe observe in the genome - we understand precisely how coding and data sections in a computerwork, but we have an incomplete understanding of how the genome integrates its functions tomaintain and propagate life. We should also note that the mechanisms of life work upwards fromthe genome to higher-level objects (cells, tissues and organs, organisms) whereas the process ofpreparing a computer program to run goes in the opposite direction from human cognition tohigher-level human readable instructions down to the machine-readable binary format understoodby the computer hardware.We have then two kinds of system of distinct provenance in which we have previously demon-strated CoHSI patterns at the level of categorization that recognizes components [HW16, HW17,HW18a, HW18c]; in the case of the genome these components are the encoded proteins (composedof amino acids) and in the case of computer programs the components are computer functions orroutines (written in source code). In the study reported here we test the predictions of CoHSIat the lowest level of categorization for these two systems; the four bases of the genome and thehexadecimal format of computer machine-readable code. We emphasize that this is an entirely dif-ferent level of abstraction from the categorization used to study proteins and computer functions.In contrast to the clear distinction made between components in systems of proteins or of computerfunctions, here we treat the genome and machine code simply as long uninterrupted strings of sym-bols, either the bases of DNA or the hexadecimal characters of machine-readable code respectively. The only boundaries we will recognize are the start and end of the genome of a species and thestart and end of the machine code of a computer program, and even these we will ignore when weexamine the patterns predicted in aggregates of genomes or of computer programs. Furthermore,we also examine for both proteins and computer software whether the observed structural patternsare independent of reading frame, i.e. the position in the string of symbols at which we startreading, or even how many symbols we include in the reading window, (we consider 1,2,3,4,5,6,7and 8). Other authors have considered the tuple occurrence problem. Speciﬁcally, [GWH11] usea model of preferential attachment as a generator for power-laws and consider very long tuples,whereas in this study we test the predictions of the CoHSI conservation principle.As we describe in more detail below, CoHSI predicts that in such linear strings of symbols asDNA or machine-readable code, motifs consisting of strings (tuples) of symbols will occur with afrequency, when rank ordered, that corresponds to a power law (i.e. Zipf’s Law). https://en.wikipedia.org/wiki/Executable_and_Linkable_Format, accessed 28-Nov-2018 Methods

Theoretical background

Some kind of underpinning theoretical model is essential when approaching datasets of the sizeof genomes or of computer programs looking for patterns, otherwise it is all too easy to be over-whelmed by their gigantic and growing complexity, not to mention "p-hacking" if the observationsare not tested in the context of a clear prediction. The theory leading to CoHSI is described in[Hat14, HW15] and fully developed in [HW17], but it will be useful to spend a few paragraphsdescribing exactly how this methodology of CoHSI can be used to approach problems of catego-rization without any consideration of speciﬁc mechanisms.

The Classiﬁcation of Discrete Systems and the Predictions of CoHSI

In [HW17], we noted that discrete systems (i.e. systems whose pieces can be counted as integers)can be divided into two distinct types. The indivisible pieces can be considered synonymously assymbols, signs or as tokens, with token being our preferred term in discussing proteins and software.In

Heterogeneous systems the indivisible tokens taken from an alphabet of tokens are assembledsequentially, in distinguishable order, into larger units that we term components. The conceptof components can be illustrated by considering the proteome (which is the collection of all theproteins expressed by a species) as a discrete system. The components in this case are the proteins,which are themselves assembled sequentially and in distinguishable order from amino acids (thetokens). Another example is provided by the computer functions that are assembled sequentiallyand in distinguishable order from computer source code (the tokens) and that constitute thecomponents of a computer program. The second type of discrete system that we distinguish is

Homogeneous . Homogeneous discrete systems are simply those in which the indivisible units canbe counted without reference to any order of assembly - a classical example of a homogeneoussystem is the frequency of words (equivalent here to tokens) in a written text. In this example theassignment of words is to bins, where each bin (the equivalent here of a component) contains onlythe occurrences of a speciﬁc word. When the bins are rank ordered by the frequencies of the wordsthey contain, a power law relationship (the classical Zipf’s Law [Zip35]) emerges.The above discussion implies that a single discrete system might be categorized as both a het-erogeneous and a homogeneous system. CoHSI theory predicts not only diﬀerent behaviours forheterogeneous and homogeneous systems, but also that both behaviours can co-exist simultane-ously in a single system, dependent solely upon the categorization used [HW17]. For example inthe heterogeneous categorization of a text such as a book the words are components and the lettersare tokens and the distribution of word lengths thus follows the predicted canonical CoHSI lengthdistribution [HW17] for heterogeneous systems. When the same book is re-categorized as a homo-geneous system, i.e. when the words themselves are designated as tokens and binned according totheir frequency of occurrence, CoHSI predicts that homogeneous systems will be overwhelminglylikely to obey a simple power-law (in fact CoHSI provides a proof of Zipf’s law [HW17]). Thesame coincident heterogeneous and homogeneous behaviour is not unique to texts - it was alsoobserved in the currently known set of proteins [HW18c], whereby on one hand the length distri-bution of proteins measured in amino acids follows the canonical heterogeneous form while on theother hand, the distribution of frequency of protein re-use (seen when gene transfer has occurredbetween species) is power-law by rank as predicted for the homogeneous categorization.The full development of CoHSI theory is in [HW17], and for the homogeneous case for reason-ably large values of the contents of the i th component t i , (where Stirling’s approximation is valid),4he distribution is the solution of log t i = − α − β ( ddt i log N ( t i , a i ; a i )) (1)The derivation of this uses the simplest form of Stirling’s approximation log( t i !) ≈ t i log t i − t i and in the homogeneous case, the information function N ( t i , a i ; a i ) of [HW17], is simply r t r , where r is the rank order and so the frequency distribution is given by log t r = − α − β ( ddt r log r t r ) = − α − β log r (2)The solution of this is simply Zipf’s power-law for all ranks [HW17]. Slicing and Dicing the Genome

We analyzed the genomes of the ten species shown in Table 1, downloaded from various public sites(the Sanger Institute and the NCBI). These genomes varied in size from 0.8 MB (Megabyte) to881.2 MB and from approximately 10% to 98% non-protein coding sequence. We have preservedthe ﬁle names to identify uniquely the particular assemblies. The sequences were analyzed fornon-overlapping n-tuples (n=1-8) read sequentially from the beginning of the sequence to the end,except where oﬀsets (changes in reading frame) are speciﬁcally noted. We did not use pre-existingdatabases of tuple sequences - the bins were deﬁned by the tuple sequences that emerged from theanalysis.Species Genome ﬁle size (MB.)Ciona Ciona_intestinalis.KH.dna.toplevel.fa.gz 34.4Fruitﬂy Drosophila_melanogaster.BDGP6.dna.toplevel.fa.gz 42.3Rice GCA_001889745.1_Rice_IR8_v1.7_genomic.fna.gz 122.6Cyanobacteria GCA_003555505.1_ASM355550v1_genomic.fna.gz 0.8Mushroom GCA_000300555.1_Agabi_varbur_1_genomic.fna.gz 9.7Yeast Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz 3.8Nematode Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz 30.3Human Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 881.2Gorilla Gorilla_gorilla.gorGor4.dna.toplevel.fa.gz 873.4Thale cress GCA_000001735.2_TAIR10.1_genomic.fna.gz 37.5Table 1: The ten genomes used in our study and their sources. The mushroom sequence is of

Agaricus bisporus and the cyanobacterial sequence is of

Thermosynechococcus elongatus

PKUAC-SCTE542.

Results and Discussion

Testing Predictions Using Ten Genome Sequences

Our straightforward prediction using homogeneous CoHSI theory is that the occurrence rate ofn-tuples will follow a power-law in rank order of occurrence, i.e. Zipf’s law, for any n.First we attempted to falsify this prediction by analyzing the haploid human genome as citedin Table 1 for n = 1, 2, 3, 4, 5, 6, 7 and 8. The results are shown in Fig. 1; the presence of thepredicted power-law (linearity on a log-log plot) is clear and seen emphatically with the highertuples, where the number of bins has a potential maximum of 65,536 in the case of the 8-tuples. R lm() reports that the associated p-value matching the power-law linearity in the 8-tuple ccdfof Fig. 1 is < (2 . × − over the 3-decade range − . , with an adjusted R-squared valueof . . The slope is − . ± . . The same analysis of 1-8 tuples was carried out with the other 9 genomes (Appendix Figs 12and 13), and in every case power law distributions of tuple frequency (as observed with the humangenome) were seen regardless of species.The following points can be noted 5

10 100 1000 10000 100000 1x10

1 10 100 1000 10000 T up l e o cc u rr en c e s rank 1-tuples2-tuples3-tuples4-tuples5-tuples6-tuples7-tuples8-tuples Figure 1: The occurrence rates of n-tuples in the human genome for n = 1, 2, 3, 4, 5, 6, 7 and 8.The plots are of the complementary cumulative distribution function (ccdf) • All 10 genomes showed identical power law (Zipﬁan) behaviour for all n-tuples regardless ofvariation in the size of the genomes (which spans more than 3 decades between the cyanobac-terium and human) or the proportion of coding sequence they contain, which ranges from1-2% in human and gorilla to 20-30% for the fruitﬂy, rice, thale cress

Ciona and

C. elegans ,to approximately 65 - 85% in the two fungal species (yeast and mushroom) and to 90% inthe cyanobacterium. • The rapid fall-oﬀ (droop) at the extreme tail was observed in all genomes and all tuples from2-8 and occurs in the region where the rarer tuples are not statistically well-represented.Although the drooping tail represents only a small fraction of the genome (<1%), neverthelessit does not appear random, and we will discuss the potential origins of this feature of the tailin more detail below. • In all species there is a signiﬁcant departure from CoHSI in the highest ranked tuples (typ-ically the ﬁrst and second but sometimes extending to additional ranks) where they occurmuch more frequently than predicted. In the cyanobacterium this deviation is apparent onlyfor the 6-, 7- and 8-tuple analyses. This observation will also be discussed below. • The scale independence. The power-law slopes are almost parallel for all n-tuples in the samegenome as shown in Fig. 2. To see the similarity from a diﬀerent viewpoint, the 1-8 tuplesfor each species we studied are shown individually in the Appendix Figs 12 and 13. • Since CoHSI is token-agnostic, we predicted that the result would be unaﬀected, (apart froma two-fold scale change), when we generated the reverse complementary strand (using

A (cid:42)(cid:41) T and

C (cid:42)(cid:41) G , and reading 5’ to 3’) and concatenated it with the ﬁrst strand. We show thatthis is indeed the case for the human genome in Fig. 3.

The droopy tail

Initially, we thought the droopy tail was simply a statistical phenomenon caused by the lowestranked bins having insuﬃcient occupancy, something also noted by [GWH11]. It is however notrandom. This can be seen by combining the data for the genomes of all ten species we consideredinto a single large ﬁle that was then scanned in one pass from beginning to end as shown inFigure 4. The droopy tail is plainly evident and with a similar qualitative nature to that seenin each of the individual species. This systematic nature of the droopy tail of the homogeneous6 tuples s l ope species CionaCyanobacteriaFruitflyGorillaHumanMushroomNematodeRiceThalecressYeast

Figure 2: Each slope with error bounds extracted from R analysis of each tuple for each species.The data are shown in tuple order (1-8) for each species.distribution initially concerned us as CoHSI predicts a pure Zipf’s law as we have seen above.However, we then realized that the pure Zipf’s law predicted by CoHSI arises directly from the useof Stirling’s approximation. Wherever Stirling’s approximation is good, which is for most of thedistribution, a near linear power-law in rank is found as expected. However for the lowest ranked(i.e. least frequently occurring) items, the basic assumption underlying Stirling’s approximationof well-occupied categories no longer holds.What then happens to the predicted Zipf’s law for the least frequently occurring items ifStirling’s approximation is not used? We explored this ﬁrst using Ramanujan’s form [Ram88] fora better approximation than Stirling to log( t r !) , and this is given by log t r + 1 + 8 t r + 24 t r t r + 4 t r + 8 t r ) = − α − β log r (3)This becomes identical to Stirling’s approximation for the highest occupancy ranks as can beseen by letting t r → ∞ in (3) and comparing with (2), but how does it depart from Zipf’s lawas the occupancy of ranks falls and they appear later in the rank-ordering? The result can beseen in Fig. 5, where a droop begins to appear. Encouraged by this, we therefore supplementedthe Stirling and Ramanujan approximations by computing log( t r !) exactly as (cid:80) k = t r k =1 log k when t r is below some suitable value, (100 in our studies). The droop increases and is not therefore anartifact of the approximation we used. This process reveals that the correct CoHSI prediction isthat the Zipf’s law relation for rank ordering will naturally break down (droop) for the highestvalues of the rank (i.e. the least occupied bins) as shown in Fig. 5.7

10 100 1000 10000 100000 1x10

1 10 100 1000 10000 T up l e o cc u rr en c e s rank 1-tuples2-tuples3-tuples4-tuples5-tuples6-tuples7-tuples8-tuples Figure 3: The occurrence rates of n-tuples in the human genome, with the forward and reversesequences of the haploid genome concatenated, for n = 1, 2, 3, 4, 5, 6, 7 and 8. Data are plottedas the complementary cumulative distribution function.

1 10 100 1000 10000 T up l e o cc u rr en c e s rank 1-tuples2-tuples3-tuples4-tuples5-tuples6-tuples7-tuples8-tuples Figure 4: The occurrence rates of n-tuples in the combined genomes of all ten species for n = 1, 2,3, 4, 5, 6, 7 and 8. The anomalous very low frequency bins that are evident close to the x-axis aftera break in the drooping tail arise from ambiguous sequence reads in the Thale cress (see Table 2and Appendix Fig 13). Data are plotted as the complementary cumulative distribution function.However, this is clearly not the whole story as the droop is systematic in the tail even when binsin this region are relatively well-occupied and it is observable in all genomes regardless of scale ina qualitatively identical manner. It seems likely that some systematic departure is present whichcauses a departure from the equilibrium power-law and the modest droop predicted by CoHSI;further investigation is necessary. 8 O cc u rr en c e s Rank

RamanujanStirlingExact

Figure 5: Illustrating the droop: the departures from the exact Zipf’s law prediction of CoHSI asthe t r (occupancy rates of the bins) become small (in essence, this is the y-axis) using Stirling’sapproximation (which gives exactly Zipf’s law), Ramanujan’s approximation which begins to departfrom Zipf’s law, and an exact computation which shows a greater departure when the t r aresmallest. The graph illustrates a homogeneous distribution with alpha = -0.001, beta = 2.0 Over-Representation Of The Most Frequent Tuples

Table 2 gives the four most common and four least common 8-tuples for each of the 10 genomesanalyzed, and in all species except

Ciona , the two most highly represented 8-tuples occur morefrequently than predicted by CoHSI (see Figures 1 and 2, the analyses of n-tuple frequencies for all10 genomes). Several points are notable. First, in all 9 of the eukaryotic species the motifs (A) and(T) are always amongst the top 4. Second, the results support Chargaﬀ’s Second Parity Rule, evenfor octanucleotide motifs, in all 10 of the species examined; the gorilla is the only eukaryotic speciesin which the top four 8-tuples do not constitute two pairs of reverse complementary sequences.We will address below the manner in which CoHSI provides proof of Chargaﬀ’s Second ParityRule. Third, the cyanobacterial sequence is interesting in that the most frequent 8-tuple is its ownreverse complement, with the next two most frequent 8-tuples forming a reverse complementarypair. Fourth, in the nine eukaryotic species the most common 8-tuple motifs do not occur atrandom, being shared between genomes. Of the 20 most frequent motifs observed when all genomescombined are analysed (Table 3) all of them occur in more than one species, being represented ina number of eukaryotic genomes ranging from 2-9. This sharing of common motifs did not extendto the prokaryote - none of these motifs (Table 3) was shared with the cyanobacterial genome. Sensitivity analysis

CoHSI is a token-agnostic and mechanism-agnostic theory. We would therefore expect that theemphatic power-law behaviour for the n-tuples reported above would be independent of the startof the reading frame. This was tested on all genomes for the largest tuple considered, the 8-tuple.The results (data not shown) were identical with no sensitivity noted for start oﬀsets of 1, 2, 3and 4 bases. A typical example is shown for the human genome in Figure 7. The curves for the4 analyses are so similar that they are overlain perfectly except for a small number of very minordiﬀerences at the extreme ends of the distribution.It is certainly correct that starting the tuple analysis with non-zero oﬀsets of the readingframe could identify diﬀerent orderings of tuples, but we must recall that the homogeneous modelis in rank order. CoHSI simply leads in each case to a re-ordered list which, when arrangedby rank leads to precisely the same predicted power-law. It is worth noting that this result isapparently independent of any distinction between protein-coding and non-protein coding regions9pecies Most common Least common

Ciona ATATATAT (7191) GACGGGCT (7)TTTTTTTT (7156) CCCCCGGA (8)AAAAAAAA (7062) CCTAGCCG (8)TATATATA (7014) CGGGGCGC (8)Cyanobacteria GCGATCGC (509) AAAACGAC (1)CGATCGCC (321) AAAACGCA (1)GGCGATCG (284) AAAAGAAC (1)GATCGCCC (147) AAACAAGG (1)Fruit Fly TTTTTTTT (16876) CCTAGGGG (24)AAAAAAAA (16803) CGCTAGGG (30)ATATATAT (6853) GGTACCCG (30)TATATATA (6137) CCCCTACG (32)Gorilla TTTTTTTT (559235) CGCGTACG (29)AAAAAAAA (550006) CGTCGACG (30)ATTCCATT (146972) GTCGATCG (31)TTCCATTC (143536) CGTACGCG (34)Human TTTTTTTT (573882) CGTACGCG (22)AAAAAAAA (570162) CGCGTACG (25)ATATATAT (117335) TCGCGTCG (31)TATATATA (111608) CGCGTCGA (34)Mushroom TTTTTTTT (1115) GGGCCCCT (4)AAAAAAAA (936) CCCCTAAG (7)GAAGAAGA (444) GCCCCTAG (7)TCTTCTTC (419) GGGGCCCG (7)Nematode TTTTTTTT (18378) CGCCCGGG (8)AAAAAAAA (18372) GGGCCGCT (8)ATTTTTTT (9642) GGGGCCCT (8)AAAAAAAT (9621) CTAAGGGG (9)Rice TATATATA (38933) CGCGCTTA (63)ATATATAT (38420) GCGTTACG (65)AAAAAAA (30448) CGTAACGC (67)TTTTTTT (30367) CGCGTTAC (79)ThaleCress AAAAAAAA (16120) AAACKAGA (1)TTTTTTTT (15921) AAAGMMWC (1)ATATATAT (7491) AAAMCCTA (1)TATATATA (7199) AACAWWWT (1)Yeast AAAAAAAA (1172) AACACGCG (1)TTTTTTTT (1099) AACCGCGC (1)ATATATAT (536) ACACGCGT (1)TATATATA (506) ACACGGTC (1)

Table 2: The four most and least common 8-tuples in each species, with their frequencies ofoccurrence in parentheses. We note that the 4 least frequent motifs in Thale Cress have all arisenartefactually from ambiguous sequence readsof the genome. The start codon for any protein could randomly be at an oﬀset of 0, 1 or 2 fromthe tuple reading frame (or even in 3’-5’ orientation because it is read from the second strand),the eﬀects of which would tend to obscure any contribution to the results of protein readingframe. However, it is perhaps more important to recall that estimates for the non-protein codingproportion of the genome range from around 17% for yeast to as high as 98% for primates, andthe results that we observe are qualitatively identical for all 10 genomes considered individually orin aggregate - i.e. as CoHSI predicts, the proportion of non-protein coding sequence in a genomehas no eﬀect on the distribution of tuples. 10ank Motif Frequency

10 100 1000 10000 100000 1x10

1 10 100 1000 10000 T up l e o cc u rr en c e s rank offset = 0offset = 1offset = 2offset = 3 Figure 6: The occurrence rates of 8-tuples in the Human genome for oﬀsets of 0, 1, 2 and 3. The4 curves overlie one another almost exactly. Note that this graph has been decimated to reduceits size. Only every 10th point is shown for oﬀsets = 1, 2 and 3 after the ﬁrst 100 points. Data areplotted as the complementary cumulative distribution function.

Proof of Chargaﬀ ’s Second Parity Rule

It can be noted that the analysis of the 20 most frequent motifs that occur in all 10 genomescombined (Table 3) includes 8 pairs of motifs that are reverse complements of one another, and 3motifs that are reverse complements of themselves (dyadic repeats). We extended this analysis tothe top 50 most common motifs in all 10 genomes combined (data not shown) and observed that44 of these motifs form 22 reverse complementary pairs, in addition to the 3 above-noted motifs11hat are dyadic repeats (ATATATAT, TATATATA and TTTTAAAA). This result was unexpected- it suggests that Chargaﬀ’s Second Parity Rule might extend beyond a single genome to largeraggregates of genomes. We calculated the frequency ratios for each of these top 22 pairs of motifsin the 10 genomes combined with the following result. Seven of these pairs showed a ratio of1.00; ten pairs showed a ratio of 1.01; four pairs showed a ratio of 1.02 and one pair had a ratioof 1.07. Thus these 8-tuple motifs show the clear hallmarks of Chargaﬀ’s Second Parity Law;reverse complementary sequences and almost identical frequency of occurrence on a single strandof the DNA. Although our analysis was less than fully comprehensive of all the 8-tuple motifs (adaunting task), the fact that the examined motifs are distributed across multiple genomes (Table 3)already suggests that Chargaﬀ’s Second Parity Rule may embrace aspects of the global propertiesof genomes. This prompted us to examine if one of the predictions of CoHSI (which as we haveshown above constrains the global properties of genomes) might include Chargaﬀ’s Second ParityRule. The results of this investigation are outlined below.Chargaﬀ’s rules are empirical and refer to the relative frequencies of A, T, G, and C bases inDNA, which we denote as %A, %T, %C and %G. The ﬁrst rule holds that %A = %T and %G =%C globally in the double-stranded DNA. This observation, which predated Watson and Crick’sground-breaking discovery, helped guide them in uncovering the structure of the double helix. Oncethe structure of the DNA double helix is known, the ﬁrst of Chargaﬀ’s rules is self-evident.However, the second empirical rule holds that %A = %T and %G = %C individually on eachstrand of a double-stranded DNA molecule. It is observed that it does not apply to single-strandedDNA or any RNA, or relatively short DNA sequences. Our work with CoHSI allows us to presenta proof of Chargaﬀ’s Second Parity Rule.We suppose that there are four bases W, X, Y and Z with complementarity rules W (cid:42)(cid:41) X and

Y (cid:42)(cid:41) Z .In any assembly, the CoHSI Homogeneous equation states that their frequencies will obey apower-law. Considering strand 1 of a double-helix pair, the frequency of occurrence of 4 bases in2 complementary pairs can be written down in (4! / unique ways. % W ≥ % X ≥ % Y ≥ % Z (4) % W ≥ % Y ≥ % X ≥ % Z (5) % W ≥ % Y ≥ % Z ≥ % X (6) % X ≥ % W ≥ % Y ≥ % Z (7) % X ≥ % Y ≥ % W ≥ % Z (8) % X ≥ % Y ≥ % Z ≥ % W (9)Consider (4). If we apply complementarity to this, we get for strand 2 % X ≥ % W ≥ % Z ≥ % Y (10)However, the CoHSI homogeneous law applies to any signiﬁcantly sized assembly, so strand 2must have the same distribution as strand 1. % W ≥ % X ≥ % Y ≥ % Z (11)Combining (10) and (11), the only solution for strand 2 (and therefore strand 1 by symmetry) % W = % X ≥ % Y = % Z (12)So for (4), CoHSI and complementarity jointly imply that four bases will appear in one strandin two pairs with each pair separated in frequency by a power-law (12).Consider now (5). If we apply complementarity to this, we get for strand 2 https://en.wikipedia.org/wiki/Chargaﬀ%27s_rules W ≥ % Y ≥ % X ≥ % Z (13)However, applying the CoHSI homogeneous law, again strand 2 must have the same distributionas strand 1. % X ≥ % Z ≥ % W ≥ % Y (14)Combining (13) and (14), the only solution for strand 2 (and therefore strand 1 by symmetry)is then % W = % X = % Y = % Z (15)Continuing like this, we ﬁnd that (4), (7) allow a two complementary pair solution, whereas(5), (6), (8) and (9) only allow a solution where all four bases occur with the same frequency, butthis breaks CoHSI since they must distribute as a power-law, implying that the only solution istwo complementary pairs on each strand as represented by (12).We note1. In the Genome, these pairs are W = T, X = A and Y = C and Z = G.2. We note that for haploid strands for which no complementarity applies, (4) must hold, but not(10) as paired bases need not occur. The CoHSI power-law will hold however explaining whysingle-stranded RNA amongst others for example does not obey Chargaﬀ’s Second Parityrule.3. For smaller assemblies, CoHSI is an asymptotic approximation, consequently we can expectthis proof to break down for haploid strands and also for smaller diploid strands.4. Finally, CoHSI knows nothing of tokens so it cannot say that any particular complementarypair will be jointly dominant over the other complementary pair. All it can say is that ifthere are four bases in a diploid strand with complementarity in force, they will manifestthemselves as two pairs of complementary bases with a pair ratio obeying a power-law (thereare only two points but the slope will be essentially the same as for for the higher tuples, seeFig 2), or as four identically-occurring bases. The latter does not conform to a power lawrelationship and is thus ruled out by CoHSI.We close this section by showing in Fig. 7, the clear two-pairs structure for the 1-tuples in allspecies tested. Extended Chargaﬀ-like symmetries

Chargaﬀ’s Second Parity Rule was originally framed in terms of the frequencies of the individualfour bases, and the proof presented above can be extended to the symmetries of higher-tuple motifsin diploid genomes subject to complementarity.The nature of the above proof shows that if an n-tuple and its complement are the most com-monly occurring in a haploid strand, then CoHSI implies that their frequency will be identical withinstatistical uncertainty as they will isolate themselves as a pairing as was the case with 1-tuples in(12). This is abundantly obvious in Table 2 with poly(A) and its complement poly(T). It can alsobe seen in all tuples from 1 to 8 in Fig 4 and in Table 3.More intriguing is that if two n-tuples and their complements happen to appear as the fourhighest frequencies of occurrence, then CoHSI implies that they can isolate themselves as a 4-groupwhich have the same frequency within statistical uncertainty . This is actually visible in the 8-tupledata of Table 2 for

Ciona .Following the proof above for 1-tuples, this line of reasoning with respect to Chargaﬀ’s SecondParity Rule and the global properties of motifs within and across genomes could clearly be takenmuch further. However, when the sample size is just 10 genomes there is suﬃcient statistical noiseto make it challenging to undertake such a comprehensive study. Regardless of this, we considerthat the theory and examples presented here clearly point to both an explanation of Chargaﬀ’sSecond Parity Rule and to its wider application in understanding the structure of genomes.13 T up l e o cc u rr en c e s rank CionaCyanobacteriaFruitflyGorillaHumanMushroomNematodeRiceThalecressYeast Figure 7: The occurrence rates of 1-tuples in all species. The clear two-pair structure whereby % A = % T and % G = % C is visible in each species although the diﬀerentiation diﬀers. Testing Predictions using Computer Software

Machine-readable computer software is in hexadecimal (base 16) rather than base 4 as in thegenome, so a 3-tuple in computer software has as many possibilities as a 6-tuple in the genome.Fig. 8 shows the results of tuple analysis for 2, 3 and 4- tuples using the open source planetariumsoftware kstars . The slopes are steeper than those for any of the genomes shown but are stillself-similar and with linearity over 2-3 decades depending on the tuple. These analyses of machinecode show the same droop in the tail of the distributions as was seen with the genome distributionsand which, as discussed above, is integral to the CoHSI prediction.

1 10 100 1000 10000 F r equen cy Rank ordered hex tuples 2-tuple3-tuple4-tuple

Figure 8: The planetarium software kstars for 2, 3 and 4-tuples. Data are plotted as the comple-mentary cumulative distribution function.In Fig. 9, we show 4-tuples for three diﬀerent programs, the planetarium software kstars used14bove, supplemented with the embedded version of the

MySQL database software and a versionof the well known image editing software, gimp . These programs have a very diﬀerent functionbut still exhibit the strongly self-similar near linear behaviour with a droop in the lowest-occupiedranks that we have already consistently seen. Note that these programs are much smaller than thegenomes used earlier giving a little more statistical ﬂuctuation, and that the slope of the power-lawis rather steeper. The signiﬁcance of this is unclear, as in CoHSI models the power-slope is anundetermined Lagrange parameter.

1 10 100 1000 10000 F r equen cy Rank ordered hex 4-tuples kstarsmysql embeddedgimp-2.8

Figure 9: The occurrence rates of 4-tuples of machine-readable code in three computer programs, kstars , MySQL and gimp . Data are plotted as the complementary cumulative distribution function.

R lm() reports that the associated p-value matching the power-law linearity for gimp in the4-tuple ccdf of Fig. 9 is < (2 . × − over the 3-decade range − . , with an adjusted R-squared value of . . The slope is − . ± . , approximately 3 times steeper than the equivalentslope for the genomic tuples. Finally, in Fig. 10 when we combine together all 5,307 binary programs in a typical Ubuntu18.04 LTS distribution totalling some 670MB of binary hexadecimal code combined into a singlesystem, the underlying CoHSI power-law is abundantly visible.

R lm() reports that the associated p-value matching the power-law linearity in the 4-tuple ccdfof Fig. 10 is < (2 . × − over the 4-decade range − . , with an adjusted R-squaredvalue of . . The slope is − . ± . . Sensitivity analysis

A sensitivity analysis of the machine code utilizing oﬀsets of the tuple reading window revealed nosigniﬁcant diﬀerences, just as in the genome analysis earlier, and is not therefore shown.

Conclusions

The results presented here are an integral part of an ongoing project to test as exhaustively as pos-sible the predictions of a novel conservation principle [Hat14, HW15, HW17], that has been postu-lated to constrain the structure of all qualifying discrete systems. This principle (the Conservationof Hartley-Shannon-Information or CoHSI) considers discrete systems constrained not only by theirtotal size, but also by their total information content. By embedding Hartley-Shannon information(in which the symbols are free of meaning) in a statistical mechanical framework, two equationsemerge that predict, respectively, the behaviour of the two types of discrete system that have beendiscovered so far [HW17]. The ﬁrst of these systems is termed heterogeneous , and the smallest15

100 1000 10000 100000 1x10

1 10 100 1000 10000 F r equen cy Rank ordered hex 4-tuples: ALL

Figure 10: The occurrence rates of 4-tuples of machine-readable (hexadecimal) code in 5,307 binaryprograms located in /usr/bin and /bin of a standard Ubuntu 18.04 LTS distribution treated as asingle combined system. Data are plotted as the complementary cumulative distribution function.pieces of the system (tokens, symbols or signs) are assembled sequentially in distinguishable orderinto larger structures (termed components). Three examples of such heterogeneous systems areproteomes (where amino acids are the tokens and proteins are the components), natural languages(where letters are the tokens and words are the components) and computer software, in which theprogramming tokens are assembled into routines or sub-routines, which are the components. Thesecond type of discrete system is termed homogeneous , and such systems are characterized simplyby the frequency of occurrence of their constituent pieces, as delineated by some consistent catego-rization, without regard to any distinguishable order. An example of such a system is also providedby natural language texts, when the frequency of each word in a text is counted. Amongst thepredictions of CoHSI for heterogeneous systems are scale independence, an overwhelmingly likely(canonical) distribution of component lengths which causes average length to be highly conserved,and the inevitability of very long components, due to a high ﬁdelity power-law tail. The predic-tions of CoHSI for homogeneous systems reduce simply to a power law relationship with a naturaldrooping tail for poorly represented categories in the frequencies of the tokens, i.e. Zipf’s Law (ofwhich CoHSI provides an alternative proof).In previous work [HW17] we have shown that proteomes and computer software (as program-ming languages) both conform closely to the predictions of CoHSI for heterogeneous systems. How-ever, proteomes and software are both higher-level representations of more fundamental symbol-based systems. In the case of proteomes this is the genome (DNA) and for software source codeit is the machine-readable code. While we do not wish to pursue the analogy of the genome ascomputer software, DNA and machine code are both simply long strings of symbols (4 bases inDNA, and the 16 symbols of machine-readable code). Thus both DNA and machine-readable codecan both be analyzed as homogeneous discrete systems in terms of the frequency of occurrence oftuples, or short motifs of the symbols of which they are composed.The predictions of CoHSI for the tuple frequency of DNA and machine-readable code wereexamined using the sequenced genomes of 10 organisms, ranging from bacteria to human, andthe entire binary distribution of an Ubuntu 18.04 LTS distribution. The results were in all casesemphatically as predicted by CoHSI, and for DNA were also independent of oﬀset (reading frame),the proportion of non-protein coding DNA in the genome, or indeed the pooling of the 10 genomes.For binary computer programs, the results are independent of oﬀset and we made no distinctionbetween control sections of the code and data sections.Discrete systems and their associated Zipﬁan properties have been subjected to extensive in-vestigation and several studies have identiﬁed similar tuple frequency behaviour to that described16ere. In this paper, we have restricted our discussion of CoHSI to applications in the genome andin machine code, however we mentioned earlier work by [SB10] on text analysis of 4-letter wordsin the works of Jane Austen and others and this contains interesting and relevant insights. Theseauthors are particularly interested in semantics and distinguish 4-letter words into several groupsa) Those which appear in their text collections, b) Those which are genuine English words but arenot used in their text collections and c) Those words which comprise any of the ways of ar-ranging the letters of the alphabet for a 4-letter word. Of these c) dwarves a) and b) in magnitudebut as they show (their Fig. 3), the populations with semantic meaning - a) and a)+b) - obeythe power-law with a droop but the amount of droop is inversely proportional to the size of thepopulation of words. When all patterns are included, the result is power-law with a slight droopas predicted by CoHSI. Of course, the systems we considered here are considerably larger than thecorpus considered by [SB10].The results of [SB10] support our interpretation throughout this and previous papers [HW18b]that the CoHSI distribution represents in some sense an equilibrium distribution to which allqualifying discrete systems are drawn and departures from it represent semantic pressure throughlocal mechanism such as natural selection or human volition.The power law behaviour of discrete systems was originally observed empirically, and Zipf’sLaw remains controversial with respect to its underlying causes [Pia14]. In considering the largeliterature on discrete systems and Zipﬁan phenomena, it is clear that while many studies haveidentiﬁed the signature of behaviors constrained by CoHSI, the authors have explained the resultsby an invocation of a wide variety of models [New06, Pia14]; in contrast, we interpret these studiesas supporting the conclusion we advance here for the universal operation of the constraints imposedby CoHSI, even given its extreme parsimony as a theory. In this context it is important to notethat CoHSI is not an assumed model we are trying to ﬁt - it is a predictive theory we are trying tofalsify .In discrete systems that conform to Zipf’s Law one might predict that a power law relationshipshould prevail across the full data set from a qualifying system; however, this is rarely if everobserved in natural or experimental data sets and the results that we present here are no exception;we see deviations both at the highest tuple frequencies (typically higher than predicted frequencies)and also with the lowest tuple frequencies, where all of the curves show a pronounced droopingtail. Considering ﬁrst the highest frequency ranks, two points can be made; ﬁrst, CoHSI is nota straitjacket, and although it will always guide a system towards an equilibrium state, localdepartures from the equilibrium can be forced through pressures such as natural selection (inbiological systems) semantic pressures (in language) and syntactic and semantic conventions (asin computer programs). If we consider the overpopulation of the highest frequency ranks in thegenome, it is noticeable that collectively the most frequent 8-tuple in the eukaryotic species (Table2) is (A) or (T) . Poly(A) tracts are known to be associated with the Alu repetitive DNA elementin primates [LP84] and it is not surprising to ﬁnd that poly(A) and poly(T) tracts are found ineukaryotic genomes that contain signiﬁcant numbers of transposable elements that have reversetranscribed intermediates. Transposable elements constitute large portions of eukaryotic genomes,and their continued expansion [SRMA16] constitutes, we propose, an evolutionary pressure againstwhich CoHSI is unable to re-establish equilibrium. In a similar way, the over-representation ofcertain tuples in machine-readable code stems from normal practice in operating system software ofloading binary computer programs into memory which has been ﬂash-ﬁlled with tracts of 00000000or ﬀﬀﬀﬀ, many of which remain when the program is dumped for analysis as we have done here.Considering now the drooping tails (i.e. the under population of the lowest frequency ranks),although we cannot exclude the inﬂuence of stochastic eﬀects we note that the mathematics of theCoHSI equation implies that such a droop will occur in the lowest frequency ranks, and this wasconﬁrmed by exact calculation.The results reported here and in prior studies both by ourselves and others, as discussed above,provide a wealth of data which support the notion that CoHSI constrains the properties of discretesystems in the manner described. Indeed, although CoHSI is not a strait-jacket, we have beenunable to falsify any of its predictions thus far, even with systems of such radically diﬀerentprovenance as the genome, and executable machine code.17 ompleting the circle

In this paper, we have examined at the lowest accessible level of categorization two discrete systemsof distinct provenance, (bases in the genome and machine-readable hexadecimal characters incomputer software) and demonstrated that they have the same CoHSI-predicted global properties.Although these systems appear to be of entirely diﬀerent provenance, arising from the evolutionof life (genomes) and from human cognitive processes (computer software) they are inextricablylinked at two profound levels. At one level, as illustrated in Fig. 11 the genome is transcribed intoproteins, proteins are one expression of the living systems that include

Homo sapiens , a species thathas developed suﬃcient intelligence to be able to design silicon-based machines to analyze its owngenetic code and to design novel organisms (including humans, [NAoSM17]) based on customizedand edited DNA sequences, thus completing an iterative cycle that promises an interesting future[Har16]. The second level at which genomes and computer software are linked has to do withCoHSI itself. CoHSI is a mechanism-free and token-agnostic ergodic theory, and there is no reasonto suppose that the predictions of CoHSI will not constrain all discrete systems, of which theuniverse is the largest instance [HW17] whereas genomes and machine-code are rather smallerexamples. ..KLFSPKESF....ACTGAGG.. .. 43 001d fa56...

Figure 11: Showing the remarkable relationship between 3 separate but intimately linked CoHSIsystems; the known set of proteins ([HW15]), a heterogeneous system; and the genome and com-puter software which are homogeneous systems as analyzed here.

Appendix

Here we present the remarkable visual similarity between the 1-8 tuples for the haploid version ofeach of the remaining genomes analysed in this study, emphatically illustrating the unseen hand18f CoHSI in their development. These are shown alphabetically by species as Figs 12a - 12f andFigs 13a - 13c. A

1 10 100 1000 10000 T up l e o cc u rr en c e s rank 1-tuples2-tuples3-tuples4-tuples5-tuples6-tuples7-tuples8-tuples B

1 10 100 1000 10000 T up l e o cc u rr en c e s rank 1-tuples2-tuples3-tuples4-tuples5-tuples6-tuples7-tuples8-tuples C

10 100 1000 10000 100000 1x10

1 10 100 1000 10000 T up l e o cc u rr en c e s rank 1-tuples2-tuples3-tuples4-tuples5-tuples6-tuples7-tuples8-tuples D

10 100 1000 10000 100000 1x10

1 10 100 1000 10000 T up l e o cc u rr en c e s rank 1-tuples2-tuples3-tuples4-tuples5-tuples6-tuples7-tuples8-tuples E

1 10 100 1000 10000 T up l e o cc u rr en c e s rank 1-tuples2-tuples3-tuples4-tuples5-tuples6-tuples7-tuples8-tuples F

10 100 1000 10000 100000 1x10

1 10 100 1000 10000 T up l e o cc u rr en c e s rank 1-tuples2-tuples3-tuples4-tuples5-tuples6-tuples7-tuples8-tuples Figure 12: n-tuple frequencies for (A) Ciona (B) Cyanobacteria (C) Fruitﬂy (D) Gorilla (E) Mush-room and (F) Nematode. Data are plotted as the complementary cumulative distribution function.These ﬁgures have many interesting features, quite apart from their extraordinary similarity,for example, the entire Gorilla distribution Fig. 12d is identical to that of the human, Fig. 1,including the ﬁne structure detail in the tail. This is also clearly visible when they are comparedin Fig. 2.

Acknowledgements

We owe a continuing debt to countless volunteers who provide high quality open source software,(we used Linux and the redoubtable perl), and also the scientists who enabled open access tothe datasets. Without open source and open data, science would unquestionably descend into anew Dark Age overseen by tribal inﬂuences. In various discussions, Gillian Libretto provided theinsights which led to Figure 11 and we are indebted also to Bob Chapman for numerous penetratinginsights and relevant background knowledge of which we were generally ignorant.Sub-images in Fig. 11 are displayed under the Creative Commons Licence of Wikipedia. • Correspondence:

Correspondence and requests for materials should be addressed to LesHatton (email: [email protected]). 19

10 100 1000 10000 100000 1x10

1 10 100 1000 10000 T up l e o cc u rr en c e s rank 1-tuples2-tuples3-tuples4-tuples5-tuples6-tuples7-tuples8-tuples B

1 10 100 1000 10000 T up l e o cc u rr en c e s rank 1-tuples2-tuples3-tuples4-tuples5-tuples6-tuples7-tuples8-tuples C

1 10 100 1000 10000 T up l e o cc u rr en c e s rank 1-tuples2-tuples3-tuples4-tuples5-tuples6-tuples7-tuples8-tuples Figure 13: n-tuple frequencies for (A) Rice (B) Thale Cress (C) Yeast. Data are plotted as thecomplementary cumulative distribution function.

Author’s contributions

LH performed the analyses, LH and GW developed the arguments, discussed the results andcontributed to the text of the manuscript.

Competing interests

The authors declare no competing ﬁnancial interests.

Funding

This work was funded by the authors. 20 eferences [AB06] Guenter Albrecht-Buehler. Asymptotically increasing compliance of genomes withChargaﬀ’s Second Parity Rules through inversions and inverted transpositions.

Pro-ceedings of the National Academy of Sciences , 103(47):17828–17833, 2006.[BOHH15] Maria Assunta Biscotti, Ettore Olmo, and J. S. (Pat) Heslop-Harrison. RepetitiveDNA in eukaryotic genomes.

Chromosome Research , 23(3):415–420, Sep 2015.[Bre12] Sydney Brenner. Life’s code script.

Nature , 482:461, Feb 2012.[CK92] Jon F. Claerbout and Martin Karrenbach. Electronic documents give reproducibilitya new meaning. In

Proc. 62nd Ann. Int. Meeting , pages 601–604. Soc. of ExplorationGeophysics, 1992.[CKL +

18] Anne Condon, Helene Kirchner, Damien Lariviere, Wallace Marshall, VincentNoireaux V. Tsivi Tlusty, and Eric Fourmentin. Will biologists become computerscientists?

EMBO reports , 19(9):e46628, 2018.[DB11] Aimee M. Deaton and Adrian Bird. CpG islands and the regulation of transcription.

Genes and Dev. , 25:1010–1022, 2011.[DMR +

09] David L. Donoho, Arian Maleki, Inam Ur Rahman, Morteza Shahram, and VictoriaStodden. Reproducible research in computational harmonic analysis.

Computing inScience and Engineering , 8(18), 2009.[Doo13] W. Ford Doolittle. Is junk DNA bunk? a critique of encode.

Proceedings of theNational Academy of Sciences , 110(14):5294–5300, 2013.[GWH11] Xiaocong Gan, Dahui Wang, and Zhangang Han. A growth model that generates ann-tuple Zipf law.

Physica A Statistical and Theoretical Physics , 390:792–800, 03 2011.[Har16] Yuval Noah Harari.

Homo deus: a brief history of tomorrow . Harvill Secker, 2016.[Hat14] Les Hatton. Conservation of Information: Software’s Hidden Clockwork.

IEEE Transactions on Software Engineering , 40(5):450–460, May 2014.10.1109/TSE.2014.2316158.[HR94] Les Hatton and Andy Roberts. How accurate is scientiﬁc software ?

IEEE Transac-tions on Software Engineering , 20(10), 1994.[HW15] Les Hatton and Greg Warr. Protein Structure and Evolution: Are They ConstrainedGlobally by a Principle Derived from Information Theory?

PLOS ONE , 2015.doi:10.1371/journal.pone.0125663.[HW16] Les Hatton and Greg Warr. Full Computational Reproducibility in Biological Sci-ence: Methods, Software and a Case Study in Protein Biology.

ArXiv , August 2016.http://arxiv.org/abs/1608.06897 [q-bio.QM].[HW17] Les Hatton and Greg Warr. Information theory and the length distribution of alldiscrete systems. arXiv , Sep 2017. http://arxiv.org/pdf/1709.01712 [q-bio.OT].[HW18a] Les Hatton and Greg Warr. CoHSI I; Detailed properties of the Canoni-cal Distribution for Discrete Systems such as the Proteome. arXiv , Jun 2018.https://arxiv.org/pdf/1806.08785 [q-bio.OT].[HW18b] Les Hatton and Greg Warr. CoHSI II; The average length of pro-teins, evolutionary pressure and eukaryotic ﬁne structure. arXiv , Jul 2018.https://arxiv.org/pdf/1807.11076 [q-bio.OT].[HW18c] Les Hatton and Greg Warr. CoHSI IV; Unifying Horizontal and Vertical gene transfer- is mechanism irrelevant ? arXiv , Nov 2018. https://arxiv.org/pdf/1811.02526 [q-bio.OT]. 21IHGC12] Darrell C. Ince, Leslie Hatton, and John Graham-Cumming. The case for open pro-gram code.

Nature , 482:485–488, February 2012. doi:10.1038/nature10836.[KKR +

16] Mahima Kaushik, Shikha Kaushik, Kapil Roy, Anju Singh, Swati Mahendru, MohanKumar, Swati Chaudhary, Saami Ahmed, and Shrikant Kukreti. A bouquet of DNAstructures: Emerging diversity.

Biochemistry and Biophysics Reports , 5:388 – 395,2016.[LD14] Xiu-Qing Li and Donglei Du. Variation, Evolution, and Correlation Analysis of C+GContent and Genome or Chromosome Size in Diﬀerent Kingdoms and Phyla.

PLOSONE , 9(2):1–18, 02 2014.[LP84] Arthur J. Lustig and Thomas D. Petes. Long poly(A) tracts in the human genome areassociated with the

Alu family of repeated elements.

Journal of Molecular Biology ,180(3):753 – 759, 1984.[MC10] Scott Mann and Yi-Ping Phoebe Chen. Bacterial genomic G+C composition-elicitingenvironmental adaptation.

Genomics , 95(1):7 – 15, 2010.[NAoSM17] Engineering National Academies of Sciences and Medicine.

Human Genome Editing:Science, Ethics, and Governance . Washington, DC, 2017.[New06] M. E. J. Newman. Power laws, Pareto distributions and Zipf’s law.

ContemporaryPhysics , 46:323–351, 2006.[Pia14] Steven T. Piantadosi. Zipf’s word frequency law in natural language: A critical reviewand future directions.

Psychon Bull Rev , 21:1112–1130, 10 2014.[Pop59] Karl Popper.

The Logic of Scientiﬁc Discovery . Routledge, 1959.[Ram88] Srinivasa Ramanujan.

The lost notebook and other unpublished papers . Springer-Verlag, 1988. ISBN 978-3-540-18726-4.[SB10] Greg J. Stephens and William Bialek. Statistical mechanics of letters in words.

PhysReview E , 81(6):219–229, 2010.[SRMA16] Michael Sheinman, Anna Ramisch, Florian Massip, and Peter Arndt. Evolutionarydynamics of selﬁsh DNA explains the abundance distribution of genomic subsequences.

Scientiﬁc Reports , 2016.[THGRI11] Maud I. Tenaillon, Matthew B. Huﬀord, Brandon S. Gaut, and Jeﬀrey Ross-Ibarra.Genome Size and Transposable Element Content as Determined by High-ThroughputSequencing in Maize and

Zea luxurians . Genome Biology and Evolution , 3:219–229,2011.[WB12] Guenther Witzany and Frantisek Baluska. Life’s code script does not code itself. themachine metaphor for living organisms is outdated.

EMBO reports , 13, 11 2012.[YO89] Tetsuya Yomo and Susumu Ohno. Concordant evolution of coding and noncodingregions of DNA made possible by the universal rule of TA/CG deﬁciency-TG/CTexcess.

Proceedings of the National Academy of Sciences , 86(21):8452–8456, 1989.[Zio82] Anton M. Ziolkowski. Further Thoughts on Popperian Geophysics–the Example ofDeconvolution.

Geophysical Prospecting , 30:p.155–165, 1982.[Zip35] George K. Zipf.