[PDF] On the verge of life: Distribution of nucleotide sequences in viral RNAs

Abstract

The aim of the study is to analyze viruses using parameters obtained from distributions of nucleotide sequences in the viral RNA. Seeking for the input data homogeneity, we analyze single-stranded RNA viruses only. Two approaches are used to obtain the nucleotide sequences; In the first one, chunks of equal length (four nucleotides) are considered. In the second approach, the whole RNA genome is divided into parts by adenine or the most frequent nucleotide as a "space". Rank--frequency distributions are studied in both cases. Within the first approach, the Pólya and the negative hypergeometric distribution yield the best fit. For the distributions obtained within the second approach, we have calculated a set of parameters, including entropy, mean sequence length, and its dispersion. The calculated parameters became the basis for the classification of viruses. We observed that proximity of viruses on planes spanned on various pairs of parameters corresponds to related species. In certain cases, such a proximity is observed for unrelated species as well calling thus for the expansion of the set of parameters used in the classification. We also observed that the fourth most frequent nucleotide sequences obtained within the second approach are of different nature in case of human coronaviruses (different nucleotides for MERS, SARS-CoV, and SARS-CoV-2 versus identical nucleotides for four other coronaviruses). We expect that our findings will be useful as a supplementary tool in the classification of diseases caused by RNA viruses with respect to severity and contagiousness.

Full PDF

JJournal manuscript No. (will be inserted by the editor)

On the verge of life: Distribution of nucleotidesequences in viral RNAs

Mykola Husev · Andrij Rovenchak

Received: date / Accepted: date

Abstract

The aim of the study is to analyze viruses using parameters ob-tained from distributions of nucleotide sequences in the viral RNA. Seekingfor the input data homogeneity, we analyze single-stranded RNA viruses only.Two approaches are used to obtain the nucleotide sequences; In the ﬁrst one,chunks of equal length (four nucleotides) are considered. In the second ap-proach, the whole RNA genome is divided into parts by adenine or the mostfrequent nucleotide as a “space”. Rank–frequency distributions are studied inboth cases. Within the ﬁrst approach, the P´olya and the negative hyperge-ometric distribution yield the best ﬁt. For the distributions obtained withinthe second approach, we have calculated a set of parameters, including en-tropy, mean sequence length, and its dispersion. The calculated parametersbecame the basis for the classiﬁcation of viruses. We observed that proxim-ity of viruses on planes spanned on various pairs of parameters correspondsto related species. In certain cases, such a proximity is observed for unre-lated species as well calling thus for the expansion of the set of parametersused in the classiﬁcation. We also observed that the fourth most frequent nu-cleotide sequences obtained within the second approach are of diﬀerent naturein case of human coronaviruses (diﬀerent nucleotides for MERS, SARS-CoV,and SARS-CoV-2 versus identical nucleotides for four other coronaviruses).We expect that our ﬁndings will be useful as a supplementary tool in theclassiﬁcation of diseases caused by RNA viruses with respect to severity andcontagiousness.

Keywords

RNA virus · Coronavirus · Nucleotide sequence · Rank–frequencydistribution.

M. Husev E-mail: [email protected]; [email protected] · A. Rovenchak E-mail: [email protected]; [email protected] for Theoretical Physics, Ivan Franko National University of Lviv12 Drahomanov St, UA-79005, Lviv, Ukraine a r X i v : . [ q - b i o . O T ] S e p Mykola Husev, Andrij Rovenchak

Studies of genomes based on linguistic approaches date a few decades back[Brendel et al. 1986; Pevzner et al. 1989; Searls 1992; Botstein & Cherry 1997;Gimona 2006; Falt´ynek et al. 2019; Ji 2020]. An interplay with methods of sta-tistical physics as well as theory of complex systems brought new insights intobiology [Dehmer & Emmert-Streib 2009; Qian 2013]. Studies range from at-tempted n-gram-based classiﬁcation of genomes [Tomovi´c et al. 2006; Huang& Yu 2016] to algorithms for optimal segmentation of RNAs in secondarystructure predictions [Licon et al. 2010] and analysis of substitution rates ofcoding genes during evolution [Lin et al. 2019], just to mention a few. Re-cently, neural networks and deep learning algorithms emerged as new tools toanalyze nucleotide sequences [Fang et al. 2019; Singh et al. 2019; Merkus et al.2020; Ren et al. 2020] oﬀering wider prospects for studies of genomes. Viruses,balancing on the fuzzy border between non-alive and alive, hence remainingon the verge of life [Villarreal 2004; Kolb 2007; Carsetti 2020], are within themost interesting subjects of studies.The aim of the present Letter is to draw attention to simple treatments ofnucleotide sequences in viral RNAs by means of new parameters, which canbe immediately extracted from genome data. We expect that such parameterscan be potentially used as an auxiliary tool in the classiﬁcation of viruses,cf, in particular, [Wang 2013]. The idea of this study is linked to the recentCOVID-19 outbreak, and the analysis started from comparing human coron-aviruses [Su et al. 2016; Wu et al. 2020] and some other viruses. To achieverelative homogeneity of the material, we restrict our sample to single-strandedRNA viruses only. Both positive- and negative-sense RNAs are considered. Forfuture reference, we also include two retroviruses, HIV-1 and HIV-2.The paper is organized as follows. Summary of data and description ofmethods are given in Section 2. Results are presented in Section 3. Finally,brief discussion is given in Section 4. Detailed tables of numerical results areplaced in the Appendix. n the verge of life: Distribution of nucleotide sequences in viral RNAs 3

Table 1

Viruses analyzed in the work.

No. Short name Full name Type a Size(bases) NCBI source b − ) 13371 HE584753.1. . .HE584760.12 Ball pythonnidovirus Ball python nidovirus 1 (+) 33452 6746603263 Dengue Dengue virus 2 (+) 10723 1589769834 Ebola Zaire ebolavirus ( − ) 18962 MK672824.15 Feline-CoV Feline infectious peritonitis virus (+) 29355 3151929626 HCoV-229E Human coronavirus 229E (+) 27317 121757457 HCoV-HKU1 Human coronavirus HKU1 (+) 29926 856678768 HCoV-NL63 Human coronavirus NL63 (+) 27553 491697829 HCoV-OC43 Human coronavirus OC43 (+) 30741 157887170910 Hepatitis A Hepatovirus A (+) 7478 NC 001489.111 Hepatitis C Hepatitis C virus genotype 1 (+) 9646 2212979212 Hepatitis D Hepatitis delta virus ( − ) 1682 1327751713 Hepatitis E Hepatitis E virus (+) 7176 NC 001434.114 HIV-1 Human immunodeﬁciency virus 1 (retro) 9181 962935715 HIV-2 Human immunodeﬁciency virus 2 (retro) 10359 962888016 HRV-A Human rhinovirus A1 (+) 7137 146430696217 HRV-B Human rhinovirus B3 (+) 7208 146430697518 HRV-C Human rhinovirus NAT001 (+) 6944 146431021219 Marburg Lake Victoria marburgvirus - Ravn ( − ) 19114 DQ447649.120 Measles Measles virus strain Edmonston ( − ) 15894 AF266290.121 MERS Middle East respiratory syndromecoronavirus (+) 30119 66748938822 Norovirus Norovirus Hu/GI.1/CHA6A003 20091104/2009/USA (+) 7600 KF039737.123 Phage MS2 Enterobacteria phage MS2 (+) 3569 17612092424 Planidovirus Planarian secretory cell nidovirus (+) 41178 157180392825 Polio Poliovirus (Enterovirus C) (+) 7440 NC 002058.326 Rabies Rabies virus strain SRV9 ( − ) 11928 AF499686.227 SARS Severe acute respiratory syndromecoronavirus (+) 29751 3027192628 SARS-CoV-2 Severe acute respiratory syndromecoronavirus 2 (+) 29903 NC 04551229 Yellow fever Yellow fever virus (+) 10862 NC 002031.1 g-max30 Zika Zika virus (+) 10794 226377833 g-max Notes: a Negative-sense RNA ( − ), positive-sense RNA (+) or retro. b than those of DNA ones, which may vary by about four orders of magnitude[Campillo-Balderas et al. 2015].We use two approaches to deﬁne nucleotide sequences. The ﬁrst one is basedon cutting an RNA genome into chunks of equal length of n nucleotides. Thesecond approach is rooted in linguistics, so that the most frequent nucleotide istreated as a “space” dividing a RNA into “words” of diﬀerent lengths [Roven-chak 2018]. Note also distantly related units applied in the analysis of thehuman DNA, so called motifs [Liang 2014]. Mykola Husev, Andrij Rovenchak

To demonstrate the ﬁrst approach, with equal-length chunks, let us con-sider the Ebolavirus genome, starting with the following nucleotide sequence:GGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGT. . . . (1)Choosing the chunk length n = 4, we obtain:GGAC ACAC AAAA AGAA AGAA GAAT TTTT AGGA TCTT TTGT . . . . (2)Eventually, for RNA length not being multiples of four, the last chunk can haveone to three nucleotides. Obviously, the number of all possible 4-nucleotidecombinations is 4 = 256. Note that longer chunks would yield much highervariety of combinations with frequencies being distributed very smoothly. Onthe other hand, we would like to avoid studies of shorter chunks, like three-nucleotide sequences corresponding to codons. So, the length n = 4 seemsoptimal for our analysis.In the second approach, the same Ebolavirus sequence (]refeq1) can be splitusing the most frequent nucleotide – adenine – as a “space” into the following:GG C C C X X X X G X X G X G X TTTTT GG TCTTTTGT. . . . (3)The “X” stands for a zero-length element inserted between two consecutive“A”s.We have also applied peculiar treatment of the Inﬂuenza A virus (H1N1)by adding spaces between each of eight segments of its RNA in the ﬁrst andsecond approaches.In both approaches, we calculate the frequencies of obtained nucleotidechunks within a given genome split in the respective manner and compile therank–frequency distributions. The latter are obtained in a standard manner asfollows: the most frequent item has rank 1, the second most frequent one hasrank 2 and so on. Items with equal frequencies are given consecutive ranks ina random order, which is not relevant. The rank–frequency distributions obtained using the ﬁrst approach – with 4-nucleotide chunks – were analyzed using a special software, AltmannFitter 2.1[Altmann 2000]. We found that two discrete distributions describe the obtaineddata with the highest precision, so called 1-displaced negative hypergeometricdistribution [Grzybek 2007; Wilson 2013]: p r = (cid:0) M + r − r − (cid:1)(cid:0) K − M + n − r − n − r +1 (cid:1)(cid:0) K + n − n (cid:1) , r = 1 , , , . . . . (4) n the verge of life: Distribution of nucleotide sequences in viral RNAs 5 and P´olya distribution [Wimmer & Altmann 1999; Johnson et al. 2005]: p r = (cid:0) − p/sr − (cid:1)(cid:0) ( p − /sn − r +1 (cid:1)(cid:0) − /sn (cid:1) , r = 1 , , , . . . . (5)Absolute frequencies are obtained by multiplying p r by the sample size N . Inmost cases, the discrepancy coeﬃcient C = χ /N is smaller than 0.02, whichis considered a good ﬁt [Maˇcutek 2008]. Typical rank–frequency distributionsand respective ﬁts are shown in Figure 1. Complete data are summarized inTable 3 in the Appendix and visualized in Figure 2. rNp r

0 2 4 6 8 10 12 14 20 40 60 80 100 120 140 160 rNp r Fig. 1

Typical rank–frequency distributions and respective ﬁts. The left panel shows thedata for MERS and the ﬁt with the hypergometric distribution, which is one of the best( C = 0 . C = 0 . KM H1N1HIV-1HIV-2 HRV-AHRV-BHRV-CMarburgMeasles SARS-CoV-2Feline-CoVDengueEbola Yellow(cid:16)feverZika MERS SARSHCoV-229E HCoV-HKU1HCoV-NL63Ball(cid:16)pythonnidovirus Planidovirus(cid:16)(cid:16)Hepatitis(cid:16)AHepatitis(cid:16)D Hepatitis(cid:16)EPhage-MS2 (cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)HCoV-OC43Rabies(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)Hepatitis(cid:16)CPolioNorovirus(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16) sp HIV-1HIV-2HRV-A HRV-B HRV-C Marburg MeaslesSARS-CoV-2 DengueYellow(cid:16)fever ZikaMERSSARSHCoV-229EHCoV-NL63HCoV-OC43 Ball(cid:16)pythonnidovirusHepatitis(cid:16)A Hepatitis(cid:16)DHepatitis(cid:16)EPhage-MS2Rabies(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)Polio(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)Feline-CoV (cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)Hepatitis(cid:16)CHCoV-HKU1 EbolaNorovirusA/H1N1

Fig. 2

Location of viruses on the K − M plane (negative hypergeometric ﬁt, left panel)and s − p plane (P´olya ﬁt, right panel). The ﬁrst immediate observation from Figure 2 is that the length of genomeshas no special inﬂuence on the ﬁtting parameters. Indeed, both the shortestHepatitis D genome and two longest – Ball python nidovirus and Planidovirus– genomes have close values of M or s parameters. On the other hand, for Mykola Husev, Andrij Rovenchak genomes of similar lengths (coronaviruses) a clear separation is seen with re-spect to M and p parameters. It is even more pronounced in the former casecorresponding to the negative hypergeometric distribution: lower values forHCoV viruses (229E, HKU1, NL63, and OC43) and higher ones for MERS,SARS, and SARS-CoV-2.Rank–frequency distributions were also compiled for nucleotide “words”obtained using the second approach and used to calculate certain parameters,like entropy, mean length (ﬁrst central moment), length dispersion (secondcentral moment) and some others. Previous studies [Rovenchak 2018] showedthat entropy and mean lengths of nucleotide sequences in the mitochondrialDNA can be used to distinguish species and genera of mammals. It appears,however, that even better results are achieved with the “entropy – lengthdispersion” pair of variables, cf. Figure 3.The parameters are deﬁned as follows. Entropy is given by S = − r max (cid:88) r =1 p r ln p r , (6)where the upper summation limit corresponds to the total number of diﬀerent“words” in the list and relative frequencies p r are p r = f r /N, where N = (cid:88) r f r (7)and f r are absolute frequencies at rank r . Mean length and length dispersionare m = 1 N (cid:88) i x i , m = 1 N (cid:88) i ( x i − m ) . (8)where the summations run over all the “words” of the analyzed genome.Lengths x i of a particular word are counted as the number of nucleotidesexcept for “X” having length zero.One should note that from similarity of species one can expect proximityof points but not vice versa: it would be too bold to expect species distin-guishability from only two parameters.This second approach can be divided into two sub-branches: (a) adenine,which is the most frequent nucleotide in most species studied in the presentwork, is used as a “space”; (b) the most frequent nucleotide is used as a“space”. The latter is mostly relevant for RNAs, where low frequencies ofadenine yield too long “words” thus signiﬁcantly distorting the expected de-pendencies. The respective results are shown in Figures 4–6. All the data aresummarized in Table 4.In Figure 6, we can observe in particular that α -coronaviruses, HCoV-229Eand HCoV-NL63, have very close values of the parameters (the respectivepoint nearly overlap). A similar situation is with β -corovaniruses HCoV-OC43and HCoV-HKU1. Two other β -corovaniruses, SARS and SARS-CoV-2, arelocated close to HCoV-OC43 and HCoV-HKU1, while MERS occupies an in-termediate position. The latter virus also signiﬁcantly diﬀers in the entropy n the verge of life: Distribution of nucleotide sequences in viral RNAs 7 m S FelidaeUrsidaeHominidaeSus domesticusCanis lupus

Fig. 3

Grouping of mammal species on the m − S plane. Red-shaded area correspondsto Felidae , the blue one denotes

Ursidae , and the green-one corresponds to

Hominidae .Calculations are made using mitochondrial DNAs with adenine as a “space”. m S A Ebola HCoV-229E(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)HCoV-HKU1 HCoV-NL63HCoV-OC43HIV-1 HIV-2 HRV-CMarburg Measles MERSPolio RabiesSARS-CoV-2 SARS Yellow(cid:16)feverZikaHepatitis(cid:16)A NorovirusFeline-CoVBall(cid:16)pythonnidovirus DengueHRV-AHRV-BA/H1N1

Fig. 4

Location of viruses on the m − S plane. Calculations are made using RNAs withadenine as a “space”, hence entropy is denoted S A . value, see Figure 5. On the other hand, calculations with the most frequentnucleotide used as a space (T for the analyzed coronaviruses) do not exhibitsuch a grouping, see Figure 4.When looking in detail into the rank–frequency distributions correspondingto coronaviruses we have discovered the following pattern: the ﬁrst rank isalways occupied by “X” followed by three single-nucleotide “words” with ranks2–4, while the ﬁfth ranks are occupied by a two-nucleotide sequence with eitherthe same (4-same) or diﬀerent (4-diﬀ) nucleotides, see Table 2. Curiously,diﬀerent nucleotides correspond to coronaviruses causing much more severediseases. This observation is yet to be extended onto a wider material, but thepreliminary data for the analyzed human viruses are as follows: Mykola Husev, Andrij Rovenchak m S m.f. Dengue Ebola(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)A/H1N1HIV-1 HIV-2HRV-AHRV-B(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16) HRV-CMarburg Measles Polio Rabies Hepatitis(cid:16)A NorovirusBall(cid:16)pythonnidovirus Yellow(cid:16)feverZikaHCoV-229EHCoV-OC43 MERS SARS-CoV-2 SARSFeline-CoV Hepatitis(cid:16)CHepatitis(cid:16)E Phage-MS2

Fig. 5

Location of viruses on the m − S plane. Calculations are made using RNAs withthe most frequent nucleotide as a “space”, hence entropy is denoted S m . f . . m S A / N A/H1N1 (cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)HCoV-229EHIV-1 HIV-2HRV-A HRV-CMarburg MeaslesPolioRabiesSARS-CoV-2SARS(cid:16)(cid:16)(cid:16) Yellow(cid:16)feverZikaHepatitis(cid:16)ANorovirusDengueHRV-B MERSHCoV-OC43Ebola (cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)HCoV-NL63(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)HCoV-HKU1

Fig. 6

Location of viruses on the m − S/N plane. Calculations are made using RNAswith adenine as a “space”, hence entropy is denoted S A . The vertical axis thus representsthe entropy divided by the number of nucleotide sequences separated by adenine in therespective genome. – 4-same: Dengue, HCoV-229E, HCoV-HKU1, HCoV-NL63, HCoV-OC43,HIV-1, HIV-2, HRV-A, HRV-B, HRV-C, Polio; – 4-diﬀ:

A/H1N1, Ebola, Hepatitis A, Hepatitis C, Hepatitis E, Marburg,Measles, MERS, Norovirus, Rabies, SARS, SARS-CoV-2.Three other viruses, Hepatitis D, Yellow fever, and Zika, do not follow eitherpattern having a two-nucleotide sequence with as low ranks as 3 or 4. n the verge of life: Distribution of nucleotide sequences in viral RNAs 9

Table 2

Top-ranked nucleotide sequences in the genomes of the human coronaviruses.

MERS SARS SARS-CoV-2 HCoV-229E HCoV-HKU1 HCoV-NL63 HCoV-OC43 r “word” f r “word” f r “word” f r “word” f r “word” f r “word” f r “word” f r We have presented several possible approaches to simple parametrization ofRNA viruses based on the analysis of nucleotide sequences in viral genomes.They are based on discrete distributions (negative hypergeometric and P´olya)for equal-length (4-nucleotide) chunks and on the pair “entropy – length dis-persion” for distributions of sequences separated by adenine or another mostfrequent nucleotide. Related viruses are characterized by close values of thecalculated parameters. In some cases, similar values are also obtained for unre-lated viruses. This is not surprising as representing viruses on a plane means atwo-parametric projection of points that are certainly described by more thantwo variables. We consider our study as preliminary steps in discovering suchvariables.Observations regarding peculiarities of rank–frequency distributions, withthe fourth most frequent sequence containing two either the same or diﬀerentnucleotides (4-same vs 4-diﬀ), support the fact that 4-diﬀ cases correspondto viruses causing potentially more severe diseases when dealing with sevenhuman coronaviruses. This tendency is generally preserved if the analyzed setis expanded by other viruses studied in this work. Some precautions concern,in particular, the two HIV types, which fall into the 4-same category while cer-tainly being extremely dangerous. However, HIV are not strictly RNA virusesbut retroviruses, so we suggest that the reported peculiarities might be spe-ciﬁc for RNA viruses only. “False-positive” alerts (cf. Norovirus in the 4-diﬀcategory) are not problematic, but the rate of “false-negative” results (severediseases in the 4-same category) is yet to be identiﬁed. Expansion of the an-alyzed material in future studies would help to clarify the relevance of thisobservation. To establish relations between peculiarities of the rank–frequencydistributions in virus genomes and disease severity, a formalization of the latteris required. Initially we planned using the case fatality rate (CFR) indicator[Reich et al. 2012; Kim et al. 2020] but where not able to ﬁnd a study with data for diﬀerent viruses based on a uniﬁed approach, similar, e.g., to [GBD2017].The main expected outcome of our reported analysis is a call for collabo-ration to expand the dataset and consistently classify diseases caused by RNAviruses, in particular with respect to severity and contagiousness. If some sim-ple patterns could be established in the nucleotide distributions, this mighthelp alerting healthcare systems, which seems to become a very topical issuefrom this year on.

Conﬂict of interest

The authors, Mykola Husev and Andrij Rovenchak, declare that they have noconﬂict of interest.

Ethical approval

This article does not contain any studies with human participants or animalsperformed by any of the authors.

References

1. Altmann, G. (2000) Altmann Fitter 2.1. L¨udenscheid: RAM-Verlag.2. Brendel, V., Beckmann, J. S. and Trifonov, E. N. (1986) Linguistics of nucleotide se-quences: Morphology and comparison of vocabularies. Journal of Biomolecular Structure& Dynamics, 4, 011–021.3. Botstein, D. and Cherry, J. M. (1997) Molecular linguistics: Extracting informationfrom gene and protein sequences. Proc. Natl. Acad. Sci. USA, 94, 5506–5507.4. Campillo-Balderas, J.A., Lazcano, A., and Becerra, A. (2015) Viral genome size distri-bution does not correlate with the antiquity of the host lineages. Front. Ecol. Evol., 3,143. https://doi.org/10.3389/fevo.2015.001435. Carsetti, A. (2020) On the verge of life: Looking for a new scientiﬁc paradigm. In:Metabiology. Non-standard Models, General Semantics and Natural Evolution (Stud-ies in Applied Philosophy, Epistemology and Rational Ethics, vol 50), 1–25. Cham:Springer. https://doi.org/10.1007/978-3-030-32718-7 16. de Smit, M. H. and van Duin, J. (1993) Translational initiation at the coat-proteingene of phage MS2: native upstream RNA relieves inhibition by local secondarystructure. Molecular Microbiology, 9, 1079–1088. DOI: https://doi.org/10.1111/j.1365-2958.1993.tb01237.x7. Dehmer, M. and Emmert-Streib, F. (eds.) (2009) Analysis of Complex Networks: FromBiology to Linguistics. Weinheim: Wiley.8. Falt´ynek, D., Matlach, V. and Lackov´a, L’. (2019) Bases are not letters: On the analogybetween the genetic code and natural language by sequence analysis. Biosemiotics, 12,289–304. DOI:10.1007/s12304-019-09353-z9. Fang, C., Moriwaki, Y., Li, C. and Shimizu K. (2019) MoRFPred en: Sequence-basedprediction of MoRFs using an ensemble learning strategy. Journal of Bioinformatics andComputational Biology, 17, 1940015. DOI 10.1142/S021972001940015810. GBD 2017 Causes of Death Collaborators (Gregory A. Roth et al.) (2018) Global,regional, and national age-sex-speciﬁc mortality for 282 causes of death in 195 countriesand territories, 1980–2017: a systematic analysis for the Global Burden of Disease Study2017. Lancet, 392, 1736–1788. DOI: https://doi.org/10.1016/S0140-6736(18)32203-7n the verge of life: Distribution of nucleotide sequences in viral RNAs 1111. Gimona, M. (2006) Protein linguistics — a grammar for modular protein assembly?Nature Rev. Mol. Cell. Biol., 7, 68–73. DOI: https://doi.org/10.1038/nrm178512. Gorbalenya, A. E., Enjuanes, L., Ziebuhr, J. and Snijder, E. J. (2006) Nidovi-rales: Evolving the largest RNA virus genome. Virus Research, 117, 17–37. DOI:https://doi.org/10.1016/j.virusres.2006.01.01713. Grzybek, P. (2007) On the systematic and system-based study of grapheme frequencies:A reanalysis of German letter frequencies. Glottometrics, 15, 82–91.14. Huang, C.-R. and Lo, S. J. (2010) Evolution and diversity of the humanHepatitis D virus genome. Advances in Bioinformatics, 2010, 323654. DOI:https://doi.org/10.1155/2010/32365415. Huang, H.-H. and Yu, C. (2016) Clustering DNA sequences using the out-of-placemeasure with reduced n-grams. Journal of Theoretical Biology, 406, 61–72. DOI:https://doi.org/10.1016/j.jtbi.2016.06.02916. Ji, S. (2020) The molecular linguistics of DNA: Letters, words, sentences, texts, andtheir meanings. In Burgin, M. and Dodig-Crnkovic, G. (eds.), Theoretical InformationStudies: Information in the World. Singapore: World Scientiﬁc, 187–231.17. Johnson, N. L., Kemp, A. W., and Kotz, S. (2005) Univariate Discrete Distributions,3rd edition. John Wiley & Sons, Inc., Hoboken, New Jersey.18. Kim, D.-H., Choe, Y. J., and Jeong, J.-Y. (2020) Understanding and interpretation ofCase Fatality Rate of Coronavirus Disease 2019. J. Korean Med. Sci., 35, e137. DOI:https://doi.org/10.3346/jkms.2020.35.e13719. Kolb, V. M. (2007) On the applicability of the Aristotelian principlesto the deﬁnition of life. International Journal of Astrobiology, 6, 51–57.https://doi.org/10.1017/S14735504070035620. Liang, Y. (2014). Analysis of DNA motifs in the human genome. PhDdissertation, The City University of New York; CUNY Academic Works.https://academicworks.cuny.edu/gc etds/6321. Licon, A., Taufer, M., Leung, M.-Y., and Johnson, K. L. (2010) A dynamic program-ming algorithm for ﬁnding the optimal segmentation of an RNA sequence in secondarystructure predictions. In 2nd Int. Conf. Bioinform. Comput. Biol., 165–170.22. Lin, J.-J., Bhattacharjee, M. J., Yu, C.-P., Tseng, Y. Y., and Li, W.-H. (2019) Manyhuman RNA viruses show extraordinarily stringent selective constraints on protein evo-lution. Proc. Natl Acad. Sci., 116, 19009–19018. DOI: 10.1073/pnas.1907626116.23. Maˇcutek, J. (2008) A generalization of the geometric distribution and its application inquantitative linguistics. Romanian Reports in Physics, 60, 501–509.24. Melkus, G., Rucevskis, P., Celms, E., ˇCer¯ans, K., Freivalds, K., Kikusts, P., Lace, L.,Opmanis, M., Rituma, D. and Viksna, J. (2020) Network motif-based analysis of regu-latory patterns in paralogous gene pairs. Journal of Bioinformatics and ComputationalBiology, 18, 2040008. DOI 10.1142/S021972002040008925. Pevzner, P. A., Borodovsky, M. Yu. and Mironov, A. A. (1989) Linguistics of nucleotidesequences I: The signiﬁcance of deviations from mean statistical characteristics andprediction of the frequencies of occurrence of words. Journal of Biomolecular Structureand Dynamics, 6, 1013–1026. DOI: 10.1080/07391102.1989.10506528.26. Qian, H. (2013) Stochastic physics, complex systems and biology. Quantitative Biology,1, 50–53. DOI 10.1007/s40484-013-0002-627. Reich, N. G., Lessler, J., Cummings, D. A. T., and Brookmeyer, R. (2012) Estimat-ing absolute and relative case fatality ratios from infectious disease surveillance data.Biometrics, 68, 598–606. DOI: https://doi.org/10.1111/j.1541-0420.2011.01709.x28. Ren, J., Song, K., Deng, C., Ahlgren, N. A., Fuhrman, J. A., Li, Y., Xie, X., Poplin,R., and Sun, F. (2020) Identifying viruses from metagenomic data using deep learning.Quantitative Biology, 8, 64–77. DOI: https://doi.org/10.1007/s40484-019-0187-429. Rovenchak, A. (2018) Telling apart Felidae and Ursidae from the distribution ofnucleotides in mitochondrial DNA. Modern Physics Letters B, 32, 1850057. DOI:https://doi.org/10.1142/S021798491850057430. Saberi, A., Gulyaeva, A. A., Brubacher, J. L., Newmark, P. A. and Gorbalenya, A. E.(2018) A planarian nidovirus expands the limits of RNA genome size. PLOS Pathogens,14, e1007314. DOI: https://doi.org/10.1371/journal.ppat.10073142 Mykola Husev, Andrij Rovenchak31. Saldanha, J. A., Thomas, H. C., and Monjardino, J. P. (1990) Cloning and sequencingof RNA of hepatitis delta virus isolated from human serum. Journal of General Virology,71, 1603–1606. DOI: https://doi.org/10.1099/0022-1317-71-7-160332. Searls, D. B. (1992) The linguistics of DNA. American Scientist, 80, 579–591.33. Singh, S., Yang, Y., P´oczos, B., and Ma, J. (2019) Predicting enhancer-promoter in-teraction from genomic sequence with deep neural networks. Quantitative Biology, 7,122–137. DOI: https://doi.org/10.1007/s40484-019-0154-034. Su, S., Wong, G., Shi, W., Liu, J., Lai, A. C. K., Zhou, J., Liu, W., Bi, Y., and Gao,G. F. (2016) Epidemiology, Genetic Recombination, and Pathogenesis of Coronaviruses.Trends in Microbiology, 24, 490–502. DOI: https://doi.org/10.1016/j.tim.2016.03.00335. Tomovi´c, A., Janiˇci´c, P., and Keˇselj, V. (2006) n-Gram-based classiﬁcation and unsu-pervised hierarchical clustering of genome sequences. Computer Methods and Programsin Biomedicine, 81, 137–153. DOI: https://doi.org/10.1016/j.cmpb.2005.11.00736. Villarreal, L.P. (2004) Are viruses alive? Sci. Amer. December, 100–105.37. Wang, J.-D. (2013) Comparing virus classiﬁcation using genomic materials accordingto diﬀerent taxonomic levels. Journal of Bioinformatics and Computational Biology, 11,1343003. DOI 10.1142/S021972001343003838. Wilson, A. (2013) Probability distributions of grapheme frequencies in Irish and Manx,Journal of Quantitative Linguistics, 20, 169–177.DOI: https://doi.org/10.1080/09296174.2013.79991939. Wimmer, G. and Altmann, G. (1999) Thesaurus of univariate discrete probability dis-tributions, 1st ed. Essen: Stamm.40. Wu, F., Zhao, S., Yu, B., Chen, Y.-M., Wang, W., Song, Z.-G., Hu, Y., Tao, Z.-W., Tian,J.-H., Pei, Y.-Y., et al. (2020) A new coronavirus associated with human respiratorydisease in China. Nature 579, 265–269. DOI: https://doi.org/10.1038/s41586-020-2008-3n the verge of life: Distribution of nucleotide sequences in viral RNAs 13

A Tables of data

Table 3

Fitting parameters for the distributions of four-nucleotide chunks.Virus Entropy Size Negative hypergeometric distribution P´olya distribution S (chunks) K M n C s p n C

A/H1N1 5.3515 3345 2.4536 0.7918 258 0.0057 0.4156 0.3209 259 0.006Ball python nidov. 5.3385 8363 2.1873 0.7087 256 0.0043 0.4711 0.3227 256 0.005Dengue 5.3219 2681 2.3958 0.7561 256 0.0051 0.427 0.3144 256 0.0055Ebola 5.4002 4741 2.3911 0.8336 256 0.0008 0.4154 0.3462 256 0.001Feline-CoV 5.3357 7339 2.7359 0.8759 257 0.0011 0.3647 0.3186 257 0.0011HCoV-229E 5.2899 6830 2.6172 0.7907 256 0.0014 0.3772 0.3007 257 0.0013HCoV-HKU1 5.1491 7482 2.7836 0.7153 260 0.0057 0.3707 0.2506 261 0.007HCoV-NL63 5.1738 6889 2.7545 0.7344 256 0.0035 0.3754 0.2644 255 0.0042HCoV-OC43 5.2854 7686 2.651 0.7918 258 0.0027 0.3879 0.2975 257 0.0031Hepatitis A 5.1923 1870 2.7578 0.818 239 0.0079 0.3546 0.2877 243 0.0075Hepatitis C 5.3871 2412 2.4158 0.8378 254 0.0029 0.4173 0.3454 254 0.003Hepatitis D 4.9309 421 2.1249 0.6739 178 0.0333 0.4566 0.3217 178 0.0342Hepatitis E 5.3405 1794 2.3837 0.7811 254 0.007 0.4297 0.3251 254 0.0077HIV-1 5.2425 2296 2.509 0.7853 239 0.006 0.409 0.3099 239 0.0066HIV-2 5.3114 2590 2.5607 0.8015 256 0.003 0.3892 0.3112 256 0.0029HRV-A 5.2618 1785 2.753 0.8492 248 0.0081 0.3418 0.2981 254 0.0077HRV-B 5.2793 1802 2.6419 0.8706 238 0.0043 0.3774 0.3262 239 0.0042HRV-C 5.3165 1736 2.5766 0.8688 243 0.0033 0.389 0.3353 243 0.0033Marburg 5.3418 4779 2.5061 0.8225 252 0.002 0.4025 0.3267 253 0.0021Measles 5.4293 3974 2.3932 0.8767 256 0.0022 0.4186 0.3655 256 0.0022MERS 5.4040 7530 2.4687 0.8665 256 0.0011 0.402 0.3498 257 0.0013Norovirus 5.4015 1900 2.3835 0.8461 253 0.0046 0.4244 0.3536 254 0.0045Phage-MS2 5.3680 893 2.31 0.8084 249 0.0107 0.4436 0.3496 249 0.0114Planidovirus 5.0360 10295 3.0466 0.7017 261 0.0183 0.2449 0.1863 315 0.0161Polio 5.3837 1860 2.43 0.8423 254 0.0044 0.4172 0.3452 254 0.0047Rabies 5.3802 2982 2.4359 0.8425 252 0.0027 0.4148 0.3443 253 0.0028SARS 5.3825 7438 2.6599 0.9058 256 0.002 0.3716 0.3395 257 0.0019SARS-CoV-2 5.3330 7476 2.826 0.9014 258 0.0022 0.3546 0.3168 258 0.0021Yellow fever 5.3430 2716 2.6168 0.8421 258 0.005 0.3962 0.3206 257 0.0059Zika 5.3377 2699 2.2919 0.73 257 0.0081 0.4142 0.3181 258 0.0053Note: Entropies S are calculated for the distributions of four-nucleotide chunks usingEquation (6).4 Mykola Husev, Andrij Rovenchak Table 4

Parameters for the distributions of nucleotide sequences separated by a speciﬁcnucleotideVirus Entropy Size Size Mean length Length dispersion S (“words”) (bases) m m A considered a “space” even if not being the most frequent:

A/H1N1 3.5446 4456 13371 2.0025 6.4378Ball python nidovirus 3.6911 11118 33452 2.0089 5.6785Dengue 3.5204 3554 10723 2.0174 6.2980Ebola 3.7703 6056 18962 2.1313 6.8261Feline-CoV 4.0381 8572 29355 2.4246 8.8222HCoV-229E 4.1411 7421 27317 2.6812 11.8059HCoV-HKU1 4.0653 8332 29926 2.5918 9.7058HCoV-NL63 4.2082 7254 27553 2.7985 11.8171HCoV-OC43 4.1871 8503 30741 2.6154 9.6729Hepatitis A 3.7125 2189 7478 2.4166 9.6428Hepatitis C 4.7418 1890 9646 4.0349 23.7224Hepatitis D 3.7600 340 1682 3.9500 30.1122Hepatitis E 4.9569 1231 7176 4.8302 30.5829HIV-1 3.3022 3273 9181 1.8054 5.3819HIV-2 3.4121 3507 10359 1.9541 6.3979HRV-A 3.4610 2389 7137 1.9879 6.3770HRV-B 3.5822 2339 7208 2.0821 6.3131HRV-C 3.6362 2177 6944 2.1902 7.4778Marburg 3.6623 6256 19114 2.0555 6.5991Measles 4.0685 4639 15894 2.4264 7.8423MERS 4.3936 7901 30119 2.8122 11.0646Norovirus 3.9312 2094 7600 2.6299 10.4595Phage MS2 4.1385 836 3569 3.2703 15.0130Planidovirus 3.0356 16361 41178 1.5169 3.6413Polio 3.8237 2207 7440 2.3715 8.3277Rabies 3.9758 3419 11928 2.4890 8.9910SARS 4.1112 8482 29751 2.5077 9.4794SARS-CoV-2 3.9369 8955 29903 2.3394 8.6559Yellow fever 3.9853 2964 10862 2.6650 11.2174Zika 4.0105 2992 10794 2.6080 9.3647

C is the most frequent:

Hepatitis C 3.8192 2894 9646 2.3334 8.7827Hepatitis D 3.1128 505 1682 2.3327 13.0101Hepatitis E 3.4866 2305 7176 2.1137 8.2778Phage MS2 4.0693 934 3569 2.8223 9.9534

G is the most frequent:

Yellow fever 3.8711 3088 10862 2.5178 10.0036Zika 3.8499 3140 10794 2.4379 9.3213