[PDF] CoHSI II; The average length of proteins, evolutionary pressure and eukaryotic fine structure

Abstract

The CoHSI (Conservation of Hartley-Shannon Information) distribution is at the heart of a wide-class of discrete systems, defining (amongst other properties) the length distribution of their components. Discrete systems such as the known proteome, computer software and texts are all known to fit this distribution accurately. In a previous paper, we explored the properties of this distribution in detail. Here we will use these properties to show why the average length of components in general and proteins in particular is highly conserved, howsoever measured, demonstrating this on various aggregations of proteins taken from the UniProt database. We will go on to define departures from this equilibrium state, identifying fine structure in the average length of eukaryotic proteins that result from evolutionary processes.

Full PDF

aa r X i v : . [ q - b i o . O T ] J u l CoHSI II; The average length of proteins, evolutionarypressure and eukaryotic ﬁne structure

Les Hatton, Gregory WarrJuly 31, 2018

Abstract

The CoHSI (Conservation of Hartley-Shannon Information) distribution is at the heart ofa wide-class of discrete systems, deﬁning (amongst other properties) the length distribution oftheir components. Discrete systems such as the known proteome, computer software and textsare all known to ﬁt this distribution accurately. In a previous paper, we explored the propertiesof this distribution in detail. Here we will use these properties to show why the average lengthof components in general and proteins in particular is highly conserved, howsoever measured,demonstrating this on various aggregations of proteins taken from the UniProt database. Wewill go on to deﬁne departures from this equilibrium state, identifying ﬁne structure in theaverage length of eukaryotic proteins that result from evolutionary processes.

To support falsiﬁability in the Popperian sense, this paper is accompanied by a complete compu-tational reproducibility suite including all software source code, data references and the variousglue scripts necessary to reproduce each ﬁgure, table and statistical analysis and then regress localresults against a gold standard embedded within the suite to help build conﬁdence in the theoryand results we are reporting. This follows the methods broadly described by [IHGC12] and exem-pliﬁed in a tutorial and case study [HW16]. These reproducibility suites are currently available athttp://leshatton.org/ until a suitable public archive appears, where they may be transferred.

In previous papers [HW18], we have derived and explored the properties of a diﬀerential equationwhich accurately models the global length distribution of components of discrete systems. Ex-amples would include i) the lengths of all known proteins, either for a species (the proteome) orindeed for all organisms (the pan-proteome) and ii) the lengths of functions in computer programswritten in any programming language. Throughout we will consider a discrete system as a set ofcomponents, each of which comprises a number of indivisible tokens or symbols in which order issigniﬁcant, chosen from a unique alphabet of tokens. In such a system, the i th component is takento have t i tokens and a unique alphabet of a i tokens. The total size of the system is T = P Mi =1 t i where M is the number of components and is taken to be reasonably large, M > say. This leadsto the heterogeneous model of [HW17], which is directly applicable to the proteome.With this nomenclature, the diﬀerential equation describing the length distribution as an im-plicit pdf ∼ a i ( t i ) is log t i + 1 + 8 t i + 24 t i t i + 4 t i + 8 t i ) = − α − β ( ddt i log N ( t i , a i ; a i )) , (1)Here N ( t i , a i ; a i ) is a function describing the Hartley-Shannon information content of the i th component as described in detail in [HW18]. The parameters α, β are Lagrangian undeterminedparameters. In other words the Statistical Mechanics methodology we used has nothing to sayabout their value - they are an unknown function of the discrete dataset being studied and we willdiscuss their relevance later.(1) arises naturally for systems which have the same measure of Hartley-Shannon Informationfor a ﬁxed size in total number of symbols or tokens. It is both scale-independent (it does not1epend on T providing T is reasonably large in the sense of Statistical Mechanics [GW01]) andtoken agnostic (Hartley-Shannon information speciﬁcally avoids any associated meaning to thetokens [Har28, Sha48, Che63].In this paper, we focus on the implications of this for the average length of a component in sucha system. In the proteome, this would be the average length of a protein in amino acids and in acomputer function, this is the average length in the discrete tokens of the programming languagein which it is written [HW17].A typical solution of (1) is shown as Figure 1. C o H S I F r e qu e n c y o f ti ti CoHSI solution

Figure 1: Illustrating a typical solution of the CoHSI equation described in [HW17].This can be compared with actual length distributions such as those of collections of computersoftware and also the collection of all known sequences (the pan-proteome). Figure 2 shows a plotof the frequency of occurrence of diﬀerent lengths of functions measured in programming languagetokens for 80 million lines of open source software written in the programming language ISO Cand Figure 3 shows the length of proteins measured in amino acids for the known proteome asdeﬁned by TrEMBL version 15-07 [TrE18]. As well as their striking visual similarity, they aremathematically very similar; both are characterized by a sharp unimodal peak with almost linearslopes, asymptoting to an extraordinarily accurate power-law [HW15, HW17]. C o m pon e n t s w it h v a l u e ti ti (tokens) Figure 2: Illustrating the length distribution of the functions measured in programming languagesymbols for 80 million lines of open source C.The fact that such a model exists and accurately models measured length distributions in suchdisparate systems means that we can throw light on some interesting questions relating to theaverage length of a protein (including what average even means for this distribution) in light ofthe continuous changes in protein sequences that the processes of evolution are acknowledged togenerate. For the remainder of this paper, we will therefore focus on the proteins.2 A bundan c e o f p r o t e i n s i n T r e m b l Protein length

Figure 3: Illustrating the length distribution of the proteins measured in amino acids in TrEMBLversion 15-07.

Researchers have previously noted the highly conserved nature of the average protein length indiﬀerent aggregations. For example, [WHL05] showed that there was a general tendency for proteinlengths to be conserved across the eukaryotic domain, whilst noting that protein orthologs wereof diﬀerent average length across domains. [XCH +

06] noted that in their collections, the averagelengths of genes are highly conserved in both prokaryotes and eukaryotes although the averagelength in the two domains are rather diﬀerent. Given the fact that CoHSI is a constraint thatacts on the global properties (including the lengths) of proteins in all organisms, the variationsin average length introduced above are of obvious interest. By utilizing the properties of theunderlying CoHSI length distribution as described in [HW18], we will examine these observations;in particular we will introduce the notion of departure from the equilibrium or most likely state ofaverage protein length as a measure of evolutionary divergence . Initially we will explore the conceptof the "average length" of proteins in diﬀerent aggregations of protein sequences taken from a recentversion of the best-annotated database, SwissProt (version 18-02, https://uniprot.org - we haveused various versions of SwissProt to demonstrate the robustness of our results).

The distribution of Figure 1 as exempliﬁed in the real systems of Figures 2 and 3 has interestingproperties. Although it is the solution of a diﬀerential equation, for components which are largecompared with their unique alphabet of tokens (e.g. amino acids, programming language tokens...), it asymptotes quickly to a power-law, (typically from t i > although it depends on the size ofthe unique alphabet of tokens used by the system under consideration). In the real data of Figures2 and 3 the power-law is astonishingly accurately produced, (giving adjusted R values in R of >0.99 [HW17]). For smaller components, the distribution is characterized by a sharp unimodal peakwith almost linear slopes.The presence of the power-law means that the distribution is long-tailed and along with thesharp unimodal peak at small t i , leads to a distribution which is palpably right-skewed. Thissigniﬁcantly complicates the meaning of the word average . Statisticians use the words location and spread to study distributions. Location is in a very general sense what laymen call the average.It is a description of "where" a distribution appears to be concentrated. Spread refers to howmuch the distribution spreads around this location. Hence when we use the word average, we havein mind the most likely value if presented with a member of the population we are considering.All readers will be familiar with the normal distribution or bell curve, which is a symmetricunimodal distribution and therefore its most common value is in the middle at the peak. Forsuch a distribution, the standard measures of average are the mean, median and mode. For asymmetric, unimodal distribution the values of these three measures are coincident, thus ﬁttingin comfortably with our notion of average. However, for a skewed distribution such as shown inFigure 1, the mean, median and mode values are not coincident. In the case of the proteome, thisimportant distinction has not escaped some previous researchers [Zha00] but it is a point worthmaking when trying to understand what exactly is highly conserved.3y way of example and to show that mean, median and mode can diﬀer substantially, if wecompute each of these for the length distributions of the domains of life and viruses in TrEMBLrelease 15-07 of Figure 3 (data derived from https://uniprot.org ), we get the values shown asTable 1.Table 1: Measures of average length of proteins in the domains of life and virusesDomain of Life Mean Median ModeArchaea 287 246 130Bacteria 312 272 156Eukaryota 435 350 379Viruses 451 289 252The eﬀects of the skew are obvious and as can be seen, the mean is probably the least satisfactoryway of measuring the average. The median (in terms of the spread of its values in Table 1) is lesssensitive to the skew than are the mean and the mode, and is similar across all domains of life (evenfor viruses) although with the highest value for the Eukaryota. While there are other statisticallyappropriate ways of dealing with this skew under the general banner of robust methods such as thetrimmed mean, (c.f. [Tuk77]), whereby we remove some percentage of the population at both thelow and the high end of the distribution, we will not consider these further here. The data shownin Table 1 give clear support for the observations previously made by others that conventionalmeasures of the "average" are indeed well conserved in the proteins; however, here we can ascribethis property to the conservation principle (CoHSI) that constrains the nature of the underlyinglength distributions.As noted above, Table 1 has been interpreted as suggesting a signiﬁcant diﬀerence in themean length found in prokaryota and eukaryota in [XCH + In other words, this observation is robust with respect to the measuresof average used.

This allows us to use a useful graphical property of the mean.

In this section, we will explore how protein lengths vary across aggregations of protein data takenat diﬀerent levels in the taxonomic hierarchy. Following the sage advice of Tukey [Tuk77], we willbegin by doing this visually. One very useful device for this is to plot the total number of proteins N against the total concatenated length L of those proteins (in amino acids) for each species in anaggregation. If we use the mean m as a measure of the average length of proteins in a species, then L = m.N , and so plotting L against N gives a straight line of gradient m if the average length isthe same for all species in the plot. T o t a l s i z e i n aa Number of proteins

Figure 4: A plot of the total length of all proteins in a species against the number of proteins in aspecies for the whole of the SwissProt dataset version 18-02.4igure 4 illustrates this for the whole of the SwissProt annotated subset of TrEMBL, releasev18-02 for all species with up to 1,000 proteins. Each point corresponds to a species and representsthe total length of all the proteins in that species plotted against the total number of proteins forthat species. Similar data are shown for SwissProt release 13-11 in [HW15]. As can be seen, theresulting distribution is highly linear indicating that the mean length of proteins is indeed stronglyconserved across the entire SwissProt release.

Figures 5 and 6 show the equivalent plots for the archaea and the bacteria of the SwissProt releasev18-02 respectively. These are subsets of Figure 4. T o t a l s i z e i n aa Number of proteins

Figure 5: A plot of the total length of all proteins in a species against the number of proteins in aspecies for the archaea of the SwissProt dataset version 18-02. T o t a l s i z e i n aa Number of proteins

Figure 6: A plot of the total length of all proteins in a species against the number of proteins in aspecies for the bacteria of the SwissProt dataset version 18-02.In contrast, Figure 7 shows the equivalent plot for the eukaryota. Although the overall be-haviour is strongly linear as predicted, there appear interesting departures, e.g. the broadeningtending to bifurcation of the plot, which do not seem to be present in either archaea or bacteria.

For completeness, we also include the viruses as Figure 8. We will however say little more aboutthese data , as the status of viruses as non-living agents that infect cells (and co-opt their biochem-ical and cellular machinery) across the whole spectrum of the domains of life introduces additionalcomplications and questions that we do not address. In particular, the viruses do not represent anatural aggregation; rather they are a heterogeneous collection that might best be considered as acomponent of the domain of life that they speciﬁcally infect.Note that we do not expect the power-law slopes of the tails of these diﬀerent aggregations tobe the same. In terms of our model, they correspond to diﬀerent values of α, β but our theory has5 T o t a l s i z e i n aa Number of proteins

Figure 7: A plot of the total length of all proteins in a species against the number of proteins in aspecies for the eukaryota of the SwissProt dataset version 18-02. T o t a l s i z e i n aa Number of proteins

Figure 8: A plot of the total length of all proteins in a species against the number of proteins in aspecies for the viruses of the SwissProt dataset version 18-02.nothing to say about their actual values in any system as they are fundamentally undetermined inthe Statistical Mechanics methodology used.We give an R lm() linearity analysis for each of the aggregations in Figs 4 - 8 as Table 2. Notethat these are somewhat diﬀerent than those quoted in Table 1 which used v.15-07 of the SwissProtannotated subset.Table 2: Measures of mean length of proteins in the domains of life and viruses taken from theSwissProt annotated subset of TrEMBL v.18-02.Aggregation Slope Adjusted R Std. Error R values of Table 2 conﬁrm our visual analysis with prokaryotic domains beingvery close to the equilibrium value (which corresponds to 1.0). Viruses show the biggest departure.6 Evolutionary divergence and ﬁne structure in the eukary-ota

We will now attempt to interpret the visual patterns of Figures 4, 5, 6, 7 and, although the virusesare not considered a domain of life (but still manifestly a qualifying discrete system), Figure 8, interms of our theory.As described at length in [Hat14, HW17], our theory is scale-independent and token-agnostic.As a result, we argue with very substantial measurement support [HW15], that discrete systems inter alia share common properties associated with the distribution of the lengths of their com-ponents. These properties should therefore by deﬁnition be decoupled from the particular natureof the system. In other words, in the case of the proteome, it is unnecessary to postulate thatparticular length distributions need to have been generated by evolution, which is brought aboutby a variety of local processes, only some of which rely on the meaning and therefore function oftokens (amino acids). However, we note that evolutionary processes can perturb the equilibriumdistribution constrained by CoHSI - a measure of the degree of this perturbation is the divergencefrom linearity in the plots of Figs 4-8, as expressed for example in the adjusted R values of Table2. It is helpful to elaborate on the concept of evolutionary divergence as we use it here, in order toemphasize that we do not see evolution as being driven solely by the pressure of natural selection.The modern synthesis (or neo-Darwinian theory) regards evolution (simply, the change of life formswith time) as a process with a primary mechanism of natural selection. Naturally arising or pre-existing variants in a population are acted on by natural selection, such that the best-adaptedorganisms (through the survival of the ﬁttest) pass their conditionally superior genes to the nextgeneration. However, genetic variation in a population can also be the basis for evolution bymechanisms distinct from natural selection. These mechanisms operate in one case by the ﬁxationin a population of particular genetic variants through what are essentially stochastic processes.Because the variants ﬁxed in this random manner are postulated to be without substantial beneﬁtsor handicaps in terms of natural selection, this theory is termed genetic drift , or the neutral theoryof evolution. Neutral evolution may even explain the majority of genetic variation that is seen atthe molecular level, [Kim89, Sto12].A second mechanism that acts to ﬁx particular genetic variants in a population and thatis not driven by natural selection is termed molecular drive . In this theory [Dov82], intrinsicmechanistic, molecular and biochemical biases in cellular functions lead to an outcome whereparticular genetic variants become ﬁxed in a population without the process being driven bynatural selection. Processes that are susceptible to these biases include gene conversion (which canshow directional preferences), crossing-over that occurs preferentially at certain chromosomal sitesbearing speciﬁc allelic variants, and chromosomal rearrangements such as transpositions, whichcan show preferences for certain sites. Note that we cannot eliminate the role of any particularmechanism of evolution in the observed patterns of protein length - such a proposal would not befalsiﬁable. All we can say is that it is not necessary to invoke speciﬁc evolutionary mechanismsto explain what is observed, in an analogous way to the role of the ether in early 20th centurydiscussions of relativity.Returning to our discussion of the aggregations of the previous section, there is a clear visualdiﬀerence between the length distributions of the prokaryotic domains of life (archaea and bacteria)when compared with the eukaryotic domain of life. The former are more or less exactly what weexpect of a system which closely adheres to the CoHSI principle. In [HW18], we explored thegeneral properties of (1), in particular, the nature of the solutions as the Lagrangian undeterminedparameters α, β were varied and the eﬀects of these variations on three measures of the average,i.e. the mean, the median and the mode. We showed that all three measures were robust tovariations in α, β consistent within ranges of values observed in real data and speciﬁcally how theywere related to the power-law slope of the tail of the length distribution. In other words, any system obeying (1) would be expected to exhibit a strongly conserved averagecomponent length across diﬀerent aggregations, howsoever measured, simply because variations inthe disposable parameters α, β for those diﬀerent aggregations have relatively little eﬀect on themean, median and mode. Indeed, strongly conserved average component length is a property ofCoHSI systems.

How then should we interpret the data for eukaryota shown in Figure 7 in terms of the apparentbifurcation of their mean protein length? Using an analogy from physical systems, we can think7f the solution of (1) as the equilibrium state and any departures from it as due, in the case ofthe proteome, to evolutionary pressure . We could then interpret the notable visual features ofFigure 7 either as characteristic of a less eﬃcient domain of life where return to the equilibriumstate (exactly conserved average length) is more sluggish, or as a domain of life where evolutionaryprocesses are exerting a greater pressure. At this point we merely note these two possible explana-tions, and prefer not to engage in untestable speculation about deﬁning in quantitative terms what eﬃciency could mean in terms of the cellular and molecular processes of the diﬀerent domains oflife, or the relative rigors of selective pressure experienced by archaea, bacteria and eukaryota.Returning then to the prokaryotic domains, we note that the archaea and bacteria both closelyadhere to the expected behavior for a discrete system, that of close adherence to the conservationof average protein length. In other words, both prokaryotic domains of life appear close to theequilibrium state with little evidence of departures other than what appear to be minor randomﬂuctuations. While this outcome would be consistent with speculation that prokaryotes are both1) subject to strong selective pressure and 2) possess highly eﬃcient mechanisms of response tosuch pressure, a consequence of the approach that we take here is that there is no prima facie reason to look for evolutionary implications when the only departures from the equilibrium stateconstrained by CoHSI appear both minor and random.When looking at the eukaryota of Figure 7 however, our visual impression is that the departuresare not random and there appears evidence of ﬁne structure in the form of bifurcated embeddedregions of linearity with diﬀerent gradients, almost as if the eukaryota could be further sub-divided.There is also a hint of this in the full dataset itself as shown in Figure 4 with a zone of steeperlinearity evident with a gradient corresponding to an average protein length of around 440. Now werecall [HW17], that although the CoHSI principle is overwhelmingly likely to lead to a pdf which isthe solution of (1), it is not a straitjacket.

We expect in a CoHSI system that average componentlength will be strongly conserved across diﬀerent aggregations, so it is of considerable interest toinvestigate apparently systematic departures from this as appear to occur in the eukaryota.

Wetherefore propose that systematic departures from CoHSI both identify and provide a measure ofevolutionary divergence, which we now test.

Before we begin to explore the dataset itself, it is important to discuss what we mean by systematicdepartures in these large datasets. All experimental datasets include various kinds of noise, suchas pseudo-random noise perhaps caused by data which has not yet been curated properly, but alsosystematic noise caused for example by researchers choosing to study only the small proteins of aspecies. We can do some simple analyses to explore these points.For example, if we took a minimum qualifying number of proteins even as low as 500 in order tomeasure how well researchers have covered a particular species, then more than 98% of the speciesappearing in the better curated SwissProt dataset would not qualify. Even if the very much largerTrEMBL v 18-02 is used (larger by about a factor of 100x), almost 85% of the species would notqualify either.Figure 9 is a plot of all species in the less well curated superset TrEMBL v 18-02. It can bedirectly compared with the better curated SwissProt 18-02 subset Figure 4 as the x-axis scale isthe same in both, (although not the y-axis).Comparing Figures 4 and 9 gives some intriguing insights into this question. First of all it isclear that the bifurcation observed in Figure 4 is completely obscured by the additional noise inthe full TrEMBL distribution Figure 9. If we interpret Figure 4 as being indicative of genuinebiological signal there are hints of systematic behaviour but as we will see, we have to reduce ourtarget qualifying number of proteins dramatically to only 40 before species emerge, but do theydo so consistently ?

CoHSI is a theory about tokens in discrete systems and the direct implications for how lengthdistributions of proteins, in the case of the proteome, will behave. In our analysis of ﬁne structure,we will therefore use graphical properties of the length distributions themselves, notably box andwhiskers diagrams which show the outliers, the quartiles and the median, and give more insightinto the skewed nature of the distribution. 8

100 200 300 400 500 600 700 800 900 1000 T o t a l s i z e i n aa Number of proteins

Figure 9: A plot of the total length of all proteins in a species against the number of proteins in aspecies for all species in the Trembl dataset version 18-02.We observe in our analyses what appear to be two patterns of local linearity in Figure 4associated with the eukaryota, as seen more clearly in Figure 7, where the subset of data for theeukaryota are plotted separately. Two populations emerge, clearly distinguishable by their averagelength.1. High average protein length eukaryota ± aa.2. Population average protein length eukaryota ± aa.The large majority of points in Figure 7 (each corresponding to a species), lie on or near whatwe interpret as the equilibrium state for the system and which we have named the Populationaverage protein length , in this case using the mean. This is where the vast majority of species arelocated. However, the second population of interest also shows marked linearity and is termed the

High average protein length ; this we will focus on. It is visible as an approximately linear band withsystematically higher gradient than the population average. Since it appears to have a consistentquality in spite of the relatively low number of qualifying proteins we have been forced to use byincompleteness, we would expect that a biological signal would correspond to some systematic setof species occupying this zone of higher average length. On the other hand, if the zone of linearityappears to have no consistency with respect to species, we would have much less conﬁdence in itas a distinguishable population.

Figure 10 shows a box and whiskers diagram for the range of eukaryota with the higher averageprotein length of ± with the minimum number of qualifying proteins set at 40. We see thatthe species occupying this zone of linearity are indeed consistent and are members of the kingdomFungi and speciﬁcally the subkingdom dikarya and the phylum ascomycota. In spite of the heavyqualiﬁcation resulting from the incompleteness, this is suﬃciently promising that we will continuedropping the minimum number of qualifying proteins to see at what point species from a diﬀerentkingdom intrude, clouding our picture.Figure 11 is the same data but with the addition of species with a qualifying number of proteinsset to 30 or greater. This time a new subphylum pezizomycotina appears - notable that this is thelargest subphylum of the ascomycota fungi (http://tolweb.org/Pezizomycotina/29296).Once again dropping the minimum number of qualifying number of proteins to 20, Figure 12results. Again only members of the kingdom fungi are introduced, this time eurotiomycetes, a classof pezizomycotina and eurotiomycetidae, a subclass of the eurotiomycetes.Only when we reduce below this already low level of qualifying proteins (data not shown) areadditional species from taxa other than the ascomycota (such as the Metazoa) identiﬁed.Our results suggest that deviation from the equilibrium protein length distribution constrainedby CoHSI can potentially identify evolutionary divergence, as is the case documented here for theascomycete fungi. The fact that we could reduce the minimum number of qualifying proteins fora particular taxon down to only 20 whilst preserving the result shows a level of consistency which9 A sc o m yc o t a F ung i D i k a r y a X − − − − Figure 10: A box and whiskers diagram of all eukaryota with an average protein length of ± aa for a minimum qualifying number of proteins of 40.is certainly promising and might be indicative of a biological signal. The nature of any underlyingevolutionary mechanism and possible signiﬁcance for the ecology and evolution of the ascomycotacould be a fruitful area of investigation. We have identiﬁed above that it seems possible to identify particular species associated with dif-fering average protein length within their domain aggregation. What might be the cause of this?At this stage, we will do no more than identify one further interesting property in informationbased discrete systems which is not shared with the classical Boltzmann systems in the KineticTheory of Gases. Recall from [HW17] that in the review of classical Boltzmann systems, theparameter α controls the total size T of the system in tokens whilst the parameter β controls theexponential shape of the energy distribution. However in that theory, α, β are tightly coupled.Once β is chosen, α is set automatically as it simply normalises the distribution, including subsetsof T . In other words, subsets have the same shape if they have the same β . (Note that in classicalBoltzmann systems, β = 1 / ( k B Θ) where k B , Θ are Boltzmann’s constant and the temperaturerespectively.)In information-based systems however, α, β decouple , [HW18]. The parameter α still controlsthe system size T and the parameter β controls how the information is distributed, but their func-tionality now overlaps. β is still determined asymptotically by the shape of the distribution (thedistribution of the unique alphabet of amino acids in this case), but this does not automaticallydetermine α . Instead there are a range of distribution shapes for the smaller components corre-sponding to diﬀerent values of α for the same value of β , (Figure 6 of [HW18]). This correspondsto varying the system size for the same total information content - the diﬀerent distributions nat-urally lead to diﬀerent values of the average for subsets even with the same β . There is no analog10 A sc o m yc o t a P e z i z o m yc o t i na F ung i D i k a r y a X − − − − − Figure 11: A box and whiskers diagram of all eukaryota with an average protein length of ± aa for a minimum qualifying number of proteins of 30.for this in classical energy conserving systems.We can illustrate this by considering only the Bacteria domain of life in version 18-02 of theSwissProt dataset. We have already seen in Figure 6 that this dataset appears to adhere veryclosely to the highly conserved average protein length described earlier as an equilibrium state,with no obvious ﬁne structure, unlike that observed in the eukaryota. If this were a classical system,subsets of this dataset would have the same statistical properties and we would expect them tohave the same albeit somewhat noisier estimated average length. However, this is not the casewith CoHSI systems which have a richer set of subtle behaviours. We can see this by extractingsubsets based on their unique amino acid alphabet count , one of the key parameters of discretesystems to emerge in [HW17]. Now for ﬁxed β , smaller subsets will correspond to smaller α sincethis parameter controls the size but as we showed in [HW18] the CoHSI equation then implies ashorter average protein length.We can clearly see this prediction fulﬁlled qualitatively in Figures 13 and 14, (they are typicalof analyses we conducted for 12-24 unique amino acids in increments of 1). According to an Rlm() analysis, the 18 unique amino acid count proteins have an average length of 152.0, adjusted R of 0.955 on 1241 species whilst the 20 unique amino acid count proteins have an average lengthof 365.5, adjusted R of 0.995 on 2544 species. This is a good point to re-emphasize the token-agnostic nature of CoHSI. It simply does not matter which amino acids are actually used sincethey have no intrinsic meaning. The only thing that matters in the information measure is the unique amino acid count [HW15, HW17].We will defer any further discussion of average protein length and unique alphabet size as itproperly belongs in a full discussion of the complex topic of protein alphabets and PTM (Post-Translational Modiﬁcation) which we will consider in detail in a later paper in this series. We dohowever consider Figures 13 and 14 as further supporting evidence of the CoHSI theory.11 A sc o m yc o t a E u r o t i o m yc e t e s P e z i z o m yc o t i na F ung i E u r o t i o m yc e t i dae D i k a r y a X − − − − − − − Figure 12: A box and whiskers diagram of all eukaryota with an average protein length of ± aa for a minimum qualifying number of proteins of 20. C on c a t ena t ed l eng t h o f p r o t e i n s i n s pe c i e s Number of proteins in species

Figure 13: A plot of the total length of all proteins in a species against the number of proteins in aspecies for all bacteria of the SwissProt dataset version 18-02 with exactly 18 unique amino acids.

In this paper, we have discussed various measures of average length in CoHSI systems showing thatobserved properties of the proteome notably the strong conservation of average lengths of proteinsshould be expected whatever standard measure of average length we use, be it mean, median ormode as indicated in [HW18].Choosing the mean and using a simple graphical property, we then demonstrated that thisleads to a consistent visual method of identifying related taxa in an average length plot. Thearchaea, bacteria and eukaryota each show distinct average protein length distributions, and a12

0 500 1000 1500 2000 2500 3000 C on c a t ena t ed l eng t h o f p r o t e i n s i n s pe c i e s Number of proteins in species

Figure 14: A plot of the total length of all proteins in a species against the number of proteins in aspecies for the bacteria of the SwissProt dataset version 18-02 with exactly 20 unique amino acids.closer examination of the eukaryota revealed that fungi showed a longer then average proteinlength. Examining the fungi in more detail suggested that this property of the fungi might beascribed speciﬁcally to the phylum ascomycota, although we caution that the current coverageof protein sequencing for many species is quite minimal. This situation improves all the time assequencing and annotation eﬀorts continue, and the availability of more comprehensive and reliabledatasets for proteins will improve the conﬁdence in results obtained through the methods outlinedhere.We also introduced the idea of evolutionary divergence (or pressure) acting against an equi-librium state deﬁned by CoHSI. This was particularly well exempliﬁed in comparing the rathernoisy full length distributions of all species in TrEMBL v 18-02 with those in the better curatedand much smaller subset, SwissProt v 18-02. This latter dataset allowed us to identify what weconsider as biologically-related ﬁne structure in the eukaryota in the form of banded zones of lin-earity in the average protein length of species. This was not visible in archaea or bacteria, whichwe hypothesize are much closer to the CoHSI- constrained equilibrium.Finally, we speculated that there may be some relationship with unique amino acid alphabetas predicted by [HW17] and we were able to demonstrate the existence of a subtle relationshipdirectly predicted by CoHSI on the Bacteria domain of life in version 18-02 of SwissProt wherebydiﬀerent sized subsets based on unique amino acid count of this domain have diﬀerent averagelengths. We will pursue the nature of this relationship and the post-translational modiﬁcation ofproteins in a later paper in this series. 13 eferences [Che63] C. Cherry.

On Human Communication . John Wiley Science Editions, 1963. Library ofCongress 56-9820.[Dov82] Gabriel Dover. Molecular drive: a cohesive mode of species evolution.

Nature , 299:111–117, 1982.[GW01] A.M. Glazer and J.S. Wark.

Statistical Mechanics. A survival guide . OUP, 2001.[Har28] R.V.L. Hartley. Transmission of information.

Bell System Tech. Journal , 7:535, 1928.[Hat14] L. Hatton. Conservation of Information: Software’s Hidden Clockwork.

IEEE Transac-tions on Software Engineering , 40(5):450–460, May 2014. 10.1109/TSE.2014.2316158.[HW15] L. Hatton and G. Warr. Protein Structure and Evolution: Are They ConstrainedGlobally by a Principle Derived from Information Theory ?

PLOS ONE , 2015.doi:10.1371/journal.pone.0125663.[HW16] L. Hatton and G. Warr. Full Computational Reproducibility in Biological Science:Methods, Software and a Case Study in Protein Biology.

ArXiv , August 2016.http://arxiv.org/abs/1608.06897 [q-bio.QM].[HW17] L. Hatton and G. Warr. Information theory and the length distribution of all discretesystems. arXiv , Sep 2017. http://arxiv.org/pdf/1709.01712 [q-bio.OT].[HW18] L. Hatton and G. Warr. CoHSI I; Detailed properties of the CanonicalDistribution for Discrete Systems such as the Proteome. arXiv , Jun 2018.https://arxiv.org/pdf/1806.08785 [q-bio.OT].[IHGC12] D.C. Ince, L. Hatton, and J. Graham-Cumming. The case for open program code.

Nature , 482:485–488, February 2012. doi:10.1038/nature10836.[Kim89] Motoo Kimura. The Neutral Theory of Evolution and the World View of the Neutralists.

Genome , 31:24–31, 1989.[Sha48] C.E. Shannon. A mathematical theory of communication.

Bell System Tech. Journal ,27:379–423, July 1948.[Sto12] Arlin Stoltzfus. Constructive neutral evolution: exploring evolutionary theory’s curiousdisconnect.

Biology Direct

Exploratory Data Analysis . Addison-Wesley, 1977.[WHL05] D.Y. Wang, M.F. Hsieh, and W.H. Li. A general tendency for conservation of proteinlength across eukaryotic kingdom.

Molecular Biology and Evolution , 22:142–147, 2005.10.1093/molbev/msh263.[XCH +

06] L. Xu, H. Chen, X. Hu, R. Zhang, Z. Zhang, and Z.W. Luo. Average gene lengthis highly conserved in prokaryotes and eukaryotes and diverges only between the twokingdoms.

Molecular Biology and Evolution , 23(6):1107–1108, June 2006. 10.1093/mol-bev/msk019.[Zha00] J. Zhang. Protein-length distributions for the three domains of life.