Dinucleotide repeats in coronavirus SARS-CoV-2 genome: evolutionary implications
DDinucleotide repeats in coronavirus SARS-CoV-2genome: evolutionary implications
Changchuan Yin ID ∗ Department of Mathematics, Statistics, and Computer ScienceUniversity of Illinois at ChicagoChicago, IL 60607USA [email protected]
Abstract
The ongoing global pandemic of infection disease COVID-19 caused by the 2019novel coronavirus (SARS-COV-2, formerly 2019-nCoV) presents critical threats topublic health and the economy since it was identified in China, December 2019.The genome of SARS-CoV-2 had been sequenced and structurally annotated, yetlittle is known of the intrinsic organization and evolution of the genome. To this end,we present a mathematical method for the genomic spectrum, a kind of barcode,of SARS-CoV-2 and common human coronaviruses. The genomic spectrum isconstructed according to the periodic distributions of nucleotides, and thereforereflects the unique characteristics of the genome. The results demonstrate thatcoronavirus SARS-CoV-2 exhibits dinucleotide TT islands in the non-structuralproteins 3, 4, 5, and 6. Further analysis of the dinucleotide regions suggeststhat the dinucleotide repeats are increased during evolution and may confer theevolutionary fitness of the virus. The special dinucleotide regions in the SARS-CoV-2 genome identified in this study may become diagnostic and pharmaceuticaltargets in monitoring and curing the COVID-19 disease. keywords : COVID-19, SARS-CoV-2, genomic spectrum, dinucleotide repeats, evolution fitness,virulence
Highlights • We present the genomic spectrum of the SARS-CoV-2 genome as a genomic signature. Thegenomic spectrum illustrates the periodic distributions of nucleotides in the genome. • SARS-CoV-2 genomic spectrum displays pronounced dinucleotide TT repeat islands inORF1a (nsps 3-6). • Dinucleotide TT repeat islands in ORF1a (nsps 3-6) of the SARS-CoV-2 genome correlatewith evolution fitness and can be considered as the pathogen-associated molecular patterns(PAMPs).
The current global pandemic of COVID-19 caused by the novel coronavirus SARS-COV-2, formerly2019-nCoV, has been a severe threat to public health and the economy since it emerged in Wuhan,China, December 2019. As of May 29, 2020, 5.7 million COVID-19 cases in the globe have beenconfirmed, and 357,688 deaths have occurred from COVID-19 disease (WHO, 2020). SARS-CoV-2 ∗ Correspondence author, [email protected] arXiv.org, June 2, 2020 a r X i v : . [ q - b i o . GN ] M a y s the aetiological agent and responsible for a large-scale outbreak of fatal disease. Understandingthe notable features of the SARS-CoV-2 genome in zoonotic origin and the evolutionary trend is ofimportance for revealing the intervention targets, and disease control and prevention.Coronaviruses (CoVs) are the largest group of enveloped, positive-sense, single-stranded RNAviruses. Taxonomically, coronavirus SARS-CoV-2 belongs to Nidovirales order, Coronaviridaefamily, Coronavirinae subfamily, and beta-CoV genus (Fehr and Perlman, 2015). SARS-CoV-2 ishighly pathogenic with similar or lower pathogenicity as severe acute respiratory syndrome (SARS)coronavirus (SARS-CoV) in 2002–2003, and lower pathogenicity than Middle-East respiratorysyndrome coronavirus (MERS-CoV) in 2012. Nevertheless, SARS-CoV-2 is highly transmissible inhumans. To date, seven human CoVs (HCoVs) have been identified. Among them are alpha-CoVs,HCoV-229E, and HCoV-NL63. The remaining five beta-CoVs include HCoV-OC43, HCoV-HKU1,SARS-CoV, MERS-CoV, and SARS-CoV-2. Four common human CoVs, HCoV-229E, HCoV-OC43,HCoV-NL63, and HCoV-HKU1 usually cause mild symptoms, like the common cold and/or diarrhea(Su et al., 2016). In contrast, SARS-CoV/MERS-CoV, and current SARS-CoV-2 are particularlypathogenic, causing SARS and COVID-19, respectively.Because of the resemblance of SARS-CoV-2 to SARS-like coronaviruses, bats are likely to act asnatural reservoir hosts of the progenitor of SARS-CoV-2 (Andersen et al., 2020). It is established thatSARS-CoV was transmitted from palm civet to humans (Wang et al., 2006) and MERS-CoV fromdromedary camels to humans, yet the zoonotic origin of SARS-CoV-2 is still unclear. The genomesequence analysis shows that SARS-CoV-2 has the closest relative SARS-like SLCoV/RaTG13,found in horseshoe bat ( Rhinolophus affinis ) from Yunnan, China (Zhou et al., 2020). SARS-CoV-2has 96.2% overall genome sequence identity to SARS-like coronavirus SLCoV/RaTG13 (Zhouet al., 2020). Therefore, the natural reservoir of SARS-CoV-2 could be the horseshoe bat. Recentevidence suggests pangolins as host candidates (Lam et al., 2020), however, the host pangolins arenot firmly determined. Not all bat SARS-like CoV can infect humans. For example, bat coronavirus,swine acute diarrhoea syndrome coronavirus (SADS-CoV) found in 2017 (Zhou et al., 2018), causedmillions of piglet deaths, but no human cases. We may postulate that SARS-CoV-2 evolved from itsprogenitor through mutations and evolutionary fitness in the host-shift and adapting. SARS-CoV-2are fast-evolving pathogens that continuously undertake mutations in the generations of infection ofthe host (Yin, 2020; Wang et al., 2020; Korber et al., 2020). This fact suggests that SARS-CoV-2 hadevolved mutations in critical proteins and genome structures prior to establishing human infection.To understand how SARS-CoV-2 jumps from animals to adaptively infect humans is important to thesurveillance of virus evolution and diversity, therefore ultimately controlling the COVID-19 pandemicand preventing future SARS-like outbreaks.SARS-CoV-2 is an enveloped positive-strand RNA virus, having an exceptionally long (29.9kb)genome (Zhou et al., 2020),. The genome consists of 5’ leader cap sequence along with a 3’ poly (A)tail, genes encoding non-structural proteins (nsps), and structural proteins, as well as several accessoryproteins. Approximate two-thirds of the genome comprises two large overlapping open readingframes (ORF1a and ORF1ab), encoding polyproteins that are subsequently cleaved by viral proteasesto generate 16 non-structural proteins (nsp1 to nsp16). Non-structural proteins are essential for RNAreplication, transcriptions, and immune evasion. Nsp3, a large multi-domain and multi-functionalprotein, plays essential roles in virus replication. The papain-like protease (PLpro) activity of nsp3 isresponsible for the initial processing of OFR1a protein. In addition, nsp3, together with nsp4 andnsp6, recruits intracellular membranes to form double-membrane vesicles (DVMs) to support viralRNA replication. Nsp5 is a second viral protease (3C-like protease, 3CLpro) that splits both ORF1aand ORF1ab proteins. The downstream regions of the genome encode structural proteins, the spike(S) protein, the nucleocapsid (N) protein, the envelope (E) protein, and the membrane (M) protein.The four structural proteins are all required to produce a structurally complete viral particle. The Sprotein mediates viral attachment to the ACE2 receptor host, and the subsequent fusion between theviral and host cell membranes enables the virus to enter host cells. The nucleocapsid (N) protein,one of the most abundant viral proteins, can bind to the RNA genome and participate in processes ofreplication, assembly, host cell response during viral infection (McBride et al., 2014).The SARS-CoV-2 genome has been sequenced and annotated, but little is known for the complexstructure of the genome in light of evolutionary fitness and host infectivity. Conventionally, RNAviruses use encoded proteins to interact with the components in cellular response. However, numerousdiscoveries show that the virus RNA structures, determined by genome composition may also playcentral roles in maximizing virus replication and evolutionary fitness (Jensen and Thomsen, 2012).2or example, recent studies show that increased CG and TA dinucleotides in both coding and non-coding regions of echovirus 7 inhibit replication initiation during post-entry in several cell lines (Froset al., 2017). Therefore, RNA viruses simulate host mRNA composition, for example, the dinucleotidefrequencies (Fros et al., 2017). Animal genomes have a bias in their dinucleotide composition, andthe heavy under-representation of CG and TA dinucleotides is especially well known. Most animalRNA and small DNA viruses suppress genomic CG and TA dinucleotide frequencies, apparentlymimicking host mRNA composition (Di Giallonardo et al., 2017). If a virus RNA composition orstructure is very different from host mRNA, RIG-I (retinoic acid-inducible gene I)-like receptorsmay detect RNA molecules that are absent from the uninfected host (Goubau et al., 2013). Detectingevolutionary microbial structures known as pathogen-associated molecular patterns (PAMPs) is animportant feature of the innate immune system. Host-cells possess intrinsic defense pathways thatprevent replication of viruses with increased CG and TA frequencies in mechanisms independent ofcodon usage (Belalov and Lukashev, 2013).The genomic spectrum demonstrates dinucleotide, trinucleotide, and multi-nucleotide distributions.The dinucleotide distributions are often considered as the signature of a genome (Kariin and Burge,1995) and the 3-periodicity patterns are distinguishing characteristics for the protein-coding regions(Tsonis et al., 1991). The strengths of the 2-periodicity and 3-periodicity in a genome are determinedby the perfect levels and copy numbers of dinucleotide and trinucleotide repeats, respectively. Because2-periodicity and 3-periodicity are the essential characteristics of a genome, in this study, we onlyexamine these two periodicities from the genomic spectrum when analyzing the genome.In this study, we present the genomic spectrum of SARS-CoV-2 and identify the dinucleotide repeatsin ORF1a (nsps 3-6) that have been instrumental in interacting host immune systems and pathogenicityand host infectivity. These dinucleotide repeats are essential to the survival and infectivity of themicrobe and can be considered as one of PAMPs in SARS-CoV-2. These genomic elements areevidence of the evolutionary fitness of SARS-CoV-2 in host-shift and adaption. Tracking the evolutionof these elements may provide insights into the zoonotic origin of SARS-CoV-2 and the control ofCOVID-19 disease.
To inspect the insightful traits of the SARS-CoV-2 genome, we utilize our periodicity analysismethod to survey the nucleotide distributions and the rendered periodicities in the genome. Wepreviously proposed the periodicity analysis method to quantitatively detect the nucleotide repeatsand periodicities in a genome (Yin, 2017). The method employs nucleotide distributions on periodicpositions in a genome and identifies approximate repeat structures as the signatures of the genome.Because we have included more functionalities, such as smoothing the periodicity profile, from theoriginal method, here we describe the method in detail though the technical algorithms had beendelineated previously (Yin, 2017). Our computer programs of the periodicity analysis of a genomeare available to the public at GitHub repository https://github.com/cyinbox/DNADU.
The nucleotide distributions at the periodic positions of a DNA sequence can be represented by acongruence derivative (CD) vector (Yin and Wang, 2016; Yin, 2017). The CD vector of a nucleotidefor a specific periodicity is constructed by the cumulative frequencies of the nucleotide at theseperiodic positions (Definition 2.1).
Definition 2.1.
For a DNA sequence of length n , let u α ( k ) = 1 when the nucleotide α appears atposition k , otherwise, u α ( k ) = 0 , where α ∈ { A, T, C, G } and k = 1 , · · · , n . The congruencederivative vector of the nucleotide α of the DNA sequence for periodicity p , is defined as f α,j = (cid:88) mod ( k,p )= j u α ( k ) j = 1 , · · · , p, k = 1 , · · · , n (1) , where mod ( k, p ) is the modulo operation and returns the remainder after division of k by p , and f α = ( f α (1) , f α (2) , · · · , f α ( p )) . Four congruence derivative vectors f α of periodicity p for nucleotides A, T, C and G form a con-gruence derivative (CD) matrix of size × p . The columns of the CD matrix indicate nucleotide3requencies at the periodic positions k = pt − q , where k is the position index of a DNA sequence, t = 1 , , . . . , and q = p − , . . . , , , . For example, consider the CD matrix of periodicity 5for DNA sequence, the first column of the CD matrix shows the nucleotide frequencies at periodicpositions k = 1 , , , . . . , t − ; the second column of the matrix shows the nucleotide frequenciesat periodic positions k = 2 , , , . . . , t − ; the third column of the matrix shows the nucleotidedistributions at periodic positions k = 3 , , , . . . , t − , and so on. The CD matrix of a DNAsequence describes nucleotide frequencies at all periodic positions and can be used to efficientlycompute the Fourier power spectrum and determine periodicities in the DNA sequence (Yin andWang, 2016). Therefore, the CD vector reflects the arrangement of repetitive sequence elements andinner periodicities in the DNA sequence. Since the CD matrix contains the nucleotide frequencies on periodic positions, the variance of thematrix elements can measure the nucleotide distribution. For the CD matrix of periodicity p , thesummation of p elements of the matrix is equal to the length n of the DNA sequence, and the meanof the elements of the CD matrix is n p . Therefore, to quantify the nucleotide distribution, we definethe normalized distribution uniformity (NDU) of a DNA sequence using the CD matrix (Definition3.2). Definition 2.2.
For a DNA sequence of length n , let f i,j be an element of the CD matrix of periodicity p , the normalized distribution uniformity of periodicity p of the DNA sequence is defined as N DU ( p ) = 1 n (cid:88) i =1 p (cid:88) j =1 ( f i,j − n p ) (2)From Definition 2.2., we notice that the normalized distribution uniformity at periodicity p is anintuitive description for the level of unbalance of nucleotide frequencies on periodic positions. Itdepends on the quadratic function of the nucleotide frequencies, sequence, and periodicity length.NDU(p) can be used to indicate the existence of the periodicity p in a DNA sequence. This methodoffers an elaboration of the repetitive elements such as the repeat consensus, copy number, and theperfect level (Yin, 2017).When using a sliding window along a genome, the periodicities of a range of 2 to 10 are calculated ineach window segment. Therefore, a two-dimension periodicity spectrum is formed for the genome.The two-dimension spectrum can be considered as the genomics signature or the barcode signature ofthe genome. To locate the positions of a repeat region in a genome, we smooth and filter the correspondingsliding-window periodicity by moving average convolution (De Jong, 1989). Then the peaks in theperiodicity profile are detected using the Z-score algorithm (Brakel, 2020). The peak positions areused to demarcate repeats in a genome.In a nutshell, to compute distribution uniformities of different periodicities of a DNA sequence, wefirst scan the sequence in different periodicity sizes, construct the congruence derivative matrix ofeach periodicity, and compute the NDU(p) of these periodicities p. The periodicity with the maximumdistribution uniformity reflects the predominant pattern of repetitive elements. The NDU values ofperiodicities indicate the perfect levels, and copy numbers of corresponding repeat regions. a) (b)(c) (d)
Figure 1: Genomic spectra of SARS-CoV-2 and SARS-related CoVs (SLoVs). (a) SARS-CoV-2. (b)SLCoV/RaTG13. (c) SARS-CoV/Tor2. (d) MERS-CoV. The sliding window is 250 bp.
To identify the signature features of the coronavirus SARS-CoV-2 genome, we employ the periodicityspectrum analysis to identify the characteristic periodicities in the genome. We create the genomicspectrum (barcode) of SARS-CoV-2 (Fig.1(a)) using the sliding window NDU method and compareit with the counterparts of SLCoV/RaTG13 (Fig.1(b)), SARS-CoV/Tor2 (Fig.1(c)), and MERS-CoV(Fig.1(d)). From the spectrum comparison, we observe that SARS-CoV-2 and SLCoV/RaTG13 bothhave pronounced 2-periodicity in four regions while both SARS-CoV/Tor2 and MERS-CoV onlyhave an extremely low level of 2-periodicity in the corresponding regions. The strong dinucleotidesignal in SARS-CoV-2 encouraged us to investigate the causes in detail.To locate the regions of rich dinucleotide repeats, we verify that 2-periodicity and 3-periodicity arestrong signals among all genomic periodicities (Fig. 2(a, d)), and detect the peaks of the sliding-window periodicity profiles (Fig.2 (b, c)). The peak positions are used to demarcate the dinucleotiderepeat regions in the genome. The dinucleotide repeat regions (dinucleotide islands) are in ORF1aand the corresponding genes are listed in Table 1.From the genomic spectrum analysis, the relative abundance of the dinucleotide repeats, particularlydinucleotide TT, are mapped to the genes of ORF1a (nsp3, nsp4, and nsp6) (Fig.3 and Table 1.).However, these dinucleotide repeat signals are weak or imperceptible in the corresponding regions inthe SARS-CoV and MERS-CoV genomes. 5 a) SARS-CoV-2/Wuhan-Hu-1 (b) SARS-CoV-2/Wuhan-Hu-1(c) SARS-CoV-2/Wuhan-Hu-1 (d) SARS-CoV-2/Wuhan-Hu-1
Figure 2: Genomic spectra of the region 6 kb - 12 kb of the SARS-CoV-2 genome (GenBank:NC_0455127). (a) The periodicity magnitudes. (b) The 2-periodicity spectrum of 250 bp slidingwindows. (c) The 3-periodicity spectrum of 250 bp sliding windows. (d) The genomic spectra of 250bp sliding windows.Table 1: The dinucleotide repeats in SARS-CoV-2/Wuhan-Hu-1 genome (GenBank: NC_045512)region location 2-periodicity consensus perfection level proteinsub-region 1 6227:7886 6.9320 TT 0.3599 nsp3sub-region 2 9101:10273 7.2362 TT 0.3696 nsp4sub-region 3 10850:12000 6.4728 TT 0.3783 3CLPro and nsp6In the coronavirus SARS-CoV-2 RNA genome, the gene for replicase of 20 kb encodes two overlap-ping polyproteins, ORF1a (replicase 1a) and ORF1ab (replicase 1ab). The genome structure and thecorresponding dinucleotide regions identified are illustrated in Fig.3. The two polyproteins are re-sponsible for viral replication and transcription (Chen et al., 2020). The expression of the C-proximalportion of pp1ab requires (–1) ribosomal frame-shifting. The first dinucleotide repeat is in the codingregion of Papain-like proteinase (PL proteinase, non-structural protein 3, nps3). Nsp3 is the largestessential component of the replication and transcription complex. The PL proteinase in nsp3 cleavesnsps 1-3 and blocks host innate immune response, promoting cytokine expression (Lei et al., 2018;Serrano et al., 2009). The second dinucleotide repeat is in the coding region of non-structural protein4 (nsp4). Nsp4 is responsible for forming double-membrane vesicles (DMV). The third dinucleotiderepeat is in the coding region of the C-terminal 3CLPro protease (3 chymotrypsin-like proteinase,3CLpro) and nsp6. 3CLPro protease is essential for RNA replication. The 3CLPro proteinase isresponsible for processing the C-terminus of nsp4 through nsp16 for all coronaviruses (Anand et al.,2003). Therefore, conserved structure and catalytic sites of 3CLpro may serve as attractive targets for6igure 3: Paradigm of dinucleotide repeats in the SARS-CoV-2 genome organization. The paradigmis drawn according to the reference genomes of SARS-CoV-2 (NC_045512) and SARS-CoV/Tor2(NC_004718).antiviral drugs (Kim et al., 2012). Together, nsp3, nsp4, and nsp6 can induce DMV (Angelini et al.,2013).In summary, the dinucleotide repeat islands found in this study are located in the host-interactionregions of the genome of SARS-CoV-2. These special dinucleotide repeat regions in ORF1a mostlikely contribute to the adaptive immune response, therefore, implying evolution fitness.Coincidentally, previous work on MERS-CoV using co-evolution analysis revealed that nsp3 rep-resents a preferential selection target in adaptive evolution for zoonotic MERS-CoV to a new host(Forni et al., 2016). Our finding that nsp3 is involved in evolution fitness is consistent with thediscovery in MERS-CoV. We investigate the correlation of dinucleotides AA and TA contents inSARS-CoV genomes and virulence. We examine the dinucleotide in the genomic regions (6 kb - 12kb). The increased 2-periodicity in the genomic regions of SARS-CoV-2 and SARS-like CoVs arethe results of the unbalanced distributions of dinucleotides.
To understand the evolutionary tendency of coronavirus genomes, we examine the genomic spec-tra of four major bats SARS-like coronaviruses (SLCoVs), pangolin-SLCoV, SLCoV/ZXC21,SLCoV/WIV1, and SLCoV/Shaanxi2011, all of which naturally live in bat
Rhinolophidae horse-shoe . Because pangolin-SLCoV was found similar to SARS-CoV-2, pangolin was exploratorilypostulated as an intermediate animal host of SARS-CoV-2 (). SLCoV/ZXC21 is the second similarstrain to SARS-CoV-2, with 82% similarity (Hu et al., 2018). SLCoV/WIV1 was closely related toSARS-CoV/Tor2 in terms of genome identity and ACE2 binding in human cells (Ge et al., 2013).SLCoV/Shaanxi2011 was found in 2011 (Yang et al., 2013).The results show that pangolin-SLCoV displays three major dinucleotide repeats in nsp3 and nsp4, butlacks the corresponding dinucleotide repeats in 3CLPro and nsp6 as found in SARS-CoV-2 (Fig.4(a)).So SARS-CoV-2 is mostly closed to SLCoV/RaTG13 (Fig.1(b)) and SLCoV/ZXC21 (Fig.4(b)), notpangolin-SLCoV. If pangolins are the intermediary hosts of SARS-CoV-2, and SARS-CoV-2 wasindeed evolved from pangolin-SLCoV, we may infer that pangolin-SLCoV would need to evolvedinucleotide repeats in 3CLPro and nsp6 during evolution fitness before infecting human hosts.7 a) (b)(c) (d)
Figure 4: Genomic spectra of SARS-like CoVs (SLCoVs). (a) pangolin-SLCoV. (b) SLCoV/ZXC21.(c) SLCoV/WIV1. (d) SLCoV/Shaanxi2011. The sliding window is 250 bp.We also observed the evolution trend between SLCoV/WIV1 (Fig.4(c)) and SLCoV/Shaanxi2011(Fig.4(d)). The genomic spectrum of SLCoV/Shaanxi2011 is similar to SLCoV/WIV1, but hasadditionally increased dinucleotide repeat in 3CLPro and nsp6. This new dinucleotide repeat in theregion 3CLPro and nsp6 in SLCoV/Shaanxi2011 is consistent with the regions found in SARS-CoV-2,SLCoV/RaTG13, and SLCoV/ZXC21. Therefore, the dinucleotide repeat in the region 3CLPro andnsp6 probably play an important role in the evolutionary fitness of SARS-CoV-2 to the human hosts.The results show that only SARS-CoV and SARS-like CoVs have low dinucleotide repeats. Thelow-dinucleotide contents can be considered as in early evolution fitness when interacting with thehuman immune system, then low-dinucleotides may render high virus virulence because the virushas not adapted to the host immune system, and the host immune system acts intensely.
To investigate the correlation of dinucleotide repeats and pathogenicity of coronaviruses, we produceand compare the genomic spectra of four common human coronaviruses (Fig.5). Classical humancoronavirus 229E (HCoV-229E) and human coronavirus OC43 (HCoV-OC43) were identified in 2004.The two viruses are close relatives, and the virus characteristics are similar to human pathogenicity.Both HCoV-229E and HCoV-OC43 can cause young children and the elderly and have a low immunefunction. Almost 100% of children are infected in early childhood, mainly as self-limiting upperrespiratory infections, such as the common cold and intestinal infections Symptoms caused byHCoV-OC43 strain are generally more severe than those of HCoV-229E virus. From the genomic8 a) (b)(c) (d)
Figure 5: Genomic spectra of common human coronaviruses. (a) HCoV-229E. (b) HCoV-NL63. (c)HCoV-OC43. (d) HCoV-HKU1. The sliding window is 250 bp.spectrum analysis, we observe higher 2-periodicity in HCoV-229E than in HCoV-OC43 (Fig.5 (a,c)).These dinucleotide repeats regions correspond to the three sub-regions in SARS-CoV-2. High 2-periodicity value may attenuate the virus replication and therefore reduce severe virulence. That is inan agreement with the correlation of dinucleotide repeats and pathogenicity previously.The spectra of HCoV-NL63 and HCoV-HKU1 demonstrate extremely high 2-periodicity, as wellas 3-periodicity in the corresponding regions (Fig.5 (b,d)). HCoV-NL63 and HCoV-HKU1 are themost common human CoVs that cause only a mild cold symptom or no symptom (Pyrc et al., 2007).Again, these high dinucleotide repeats may contribute to the light pathogenicity in these two viruses.
We wish to know whether SARS-CoV-2 originally evolved from SARS-CoV, or SARS-CoV-2 wouldevolve to SARS-CoV. The answer to this question may help us to predict the evolution of SARS-CoV-2 virus for better disease prediction and control. Because the genomes of SARS-related coronavirusesover a long time period are rarely available, to infer the evolution of coronaviruses, we track the trendof HCoV-229E coronaviruses over the last six decades from the first human infected HCoV-229Eidentified in 1962 (Thiel et al., 2001). HCoV-229E virus causes common cold but occasionally it canbe associated with more severe respiratory infections in children, elderly, and persons with underlyingillness. Using the measurements of dinucleotides in the genomic regions, the trend of HCoV-229coronaviruses may infer the evolutionary stages of bat coronaviruses. In a similar method, we may9hen determine the origin SARS-CoV-2 if the trend of SARS-CoV-2 is compared with SARS-likecoronaviruses.The human coronavirus HCoV-229E strains used in the dinucleotide trend analysis are from differenthistorical periods. The reference genome HCoV-229 in the evolutionary analysis was obtained fromthe infectious HCoV-229E, the 1973-deposited laboratory-adapted prototype strain of HCoV-229E(VR-740). The HCoV-229E prototype strain was originally isolated in 1962 from a patient in Chicago.The first clinical HCoV-229E isolate from a US patient in 2012 was included in this study (Farsaniet al., 2012) (GenBank: JX503060). HCoV-229E (SC3112) isolate in 2015 is included (GenBank:KY983587). The new Human coronavirus strain HCoV-229E was isolated from plasma collectedfrom a Haitian child in 2016 (Bonny et al., 2017) (GenBank: MF542265).The result in the periodicity trends of the HCoV-229E coronaviruses demonstrates that both 2-periodicity (Fig.7(a)) and 3-periodicity (Fig.7(b)) in the coronaviruses are increasing with evolution-ary time. SARS-CoV-2 has relatively high 2-periodicity and 3-periodicity. This result suggests thatthe trends of 2-periodicity and 3-periodicity are increasing with time. The evolutionary origin ofcoronaviruses can be inferred by the trends of 2-periodicity and 3-periodicity. Therefore, we maycompare the 2-periodicity and 3-periodicity in SARS-CoV-2 and SARS-like coronaviruses to under-stand the evolutionary origin of SARS-CoV-2. To investigate the correlation of dinucleotide repeatswith virus virulence in SARS-related coronaviruses, we compare the spectra of the genomic region atcoordinates 6k-12k bp of five SARS-related coronaviruses. The genomic regions contain abundantdinucleotide repeats. The region 6k-12k of the genome contains three dinucleotide repeats approxi-mately located at 6k-8k, 8k-10k, and 10k-12k sub-regions. The five SARS-related coronaviruses havedifferent levels of virulence. The most virulent virus is MERS-CoV, followed by SARS-CoV. The2-periodicity magnitudes, which reflect the distribution of dinucleotides, are compared and shownin (Fig.8 (a,b)). We may see that MERS-CoV genomic region has the lowest dinucleotide level inall three dinucleotide sub-regions. SARS-CoV also has a low dinucleotide level but is higher thanMERS-CoV. We, therefore, may infer that the low dinucleotide level correlates with high virulence.The lower the dinucleotide level is, the higher virus virulence is. This postulation is supported by theobservation of the dinucleotide in three SARS-related coronaviruses. Compared with the SARS-CoV,the SARS-like bats-SLOV/Rp3 has similar dinucleotide distributions in sub-regions 1 (6k-8k) and 3(10k-12k), but higher dinucleotide distribution in sub-region 2 (8k-10k). SLCoV/Rp3 has lower virusvirulence than SARS. The SARS-like SLCoV/ZXC21, which shares the highest sequence identitywith SARS-CoV-2, shows slightly lower dinucleotide distributions in the two sub-regions 2 and 3,and slightly higher dinucleotide distribution than sub-region 1.We notice that trends of the 2-periodicity and 3-periodicity from MERS-CoV, SARS-CoV, SARS-like CoVs, and SARS-CoV-2 are increasing (Fig.8 (a,b)). Based on the previous analysis of theperiodicity trends of coronaviruses at different times, we may infer that SARS-CoV-2 originates fromthe SARS-like CoVs.
It is well studied that dinucleotide composition bias in RNA viruses may impact the virus replications,specifically, attenuating or strengthening virus virulence during evolution (Fros et al., 2017; Gu et al.,2019). Clinical evidences have suggested that SARS-CoV-2 has lower virulence than SARS-CoV.To investigate if the increased dinucleotide repeats correlate with virus virulence, we compare thedinucleotide frequencies in the three dinucleotide rich islands of SAR-CoV-2 and SARS-CoV/Tor2.The result shows that both SARS-CoV-2 and SARS-CoV have an abundance of dinucleotides TTand TA in the whole genome, and three prominent dinucleotide repeat regions (Fig.9 (a,b,c,d)), andextreme CG deficiency in sub-region 3 (Fig.9(d)). The dinucleotides TT and TA are increased inSARS-CoV-2 compared with SARS-CoV (Fig.9 (a,b,c,d)).The role increased dinucleotides TT and TA in the SARS-CoV-2 genome is to possibly attenuate virusreplication. One mechanism for attenuating virus by dinucleotide bias is that the dinucleotide regionsfold a special structure as a target for cell RNA cleavage, which is a fundamental host response forcontrolling viral infections (Zhou et al., 1993). The 2,5-oligoadenylate synthetase/RNase L system isan innate immunity pathway that responds to a pathogen-associated molecular pattern (PAMP) toinduce degradation of viral and cellular RNAs, thereby blocking viral infection. In higher vertebrates,this process is often regulated by interferons (IFNs). Ribonuclease L (RNase L, L is for latent) is aninterferon (IFN)-induced antiviral ribonuclease which, upon activation, destroys all RNA within the10 a) 2-periodicity(b) 3-periodicity
Figure 6: Comparison of periodicities in genomic regions (6k-12k) of human HCoV-229E coron-aviruses. (a) 2-periodicity. (b) 3-periodicity. 11 a) 2-periodicity(b) 3-periodicity
Figure 7: Comparison of periodicities in the genomic regions (6 kb- 12kb) of SARS-CoV andSARS-like coronaviruses. (a) 2-periodicity. (b) 3-periodicity.12 a) whole genome (b) sub-region 1(c) sub-region 2 (d) sub-region 3
Figure 8: The dinucleotide frequency distributions in the three genomic regions of SARS-CoV-2and SARS-CoV/Tor2. (a) whole genome, (b) sub-region 1. (c) sub-region 2. (d) sub-region 3. Thesub-regions are listed in Table 1.cellular and viral (Silverman, 2007). RNase L cleaves hepatitis C virus (HCV) RNA at single-strandedTT and TA dinucleotides throughout the open reading frame (ORF). An interesting discovery is thatin bacterium
Mycoplasma pneumonia , which is a respiratory infection agent, the genomes also haverelative abundance extremes dinucleotides TT and TA (Karlin, 1998). Therefore, we may postulatethat the dinucleotides TT and TA regions in SARS-CoV-2 are possibly cleaved by RNase L duringinfection.
In this study, we identify the unique dinucleotide repeats in the SARS-CoV-2 genome. The dinu-cleotide repeats in the genomic spectra are revealed by the periodicity analysis. We discover thestrength of these repeats correlates with the evolutionary fitness of the virus to a human host formaximizing its survival in epidemics, instead of destroying the host. Therefore, RNA viruses simulatehost mRNA composition such as the dinucleotide compositions (Fros et al., 2017). Most vertebrateRNA and small DNA viruses suppress genomic CG and TA dinucleotide frequencies, apparentlymimicking host mRNA composition. The abundance of dinucleotides TT and TA are most likelycommon pathogenicity islands in microbial genomes. This study on SARS-CoV-2 provides addi-tional evidence that increased dinucleotides TT and TA in SARS-CoV-2 is the result of interactionwith the host during virus evolution. We consider these three dinucleotide abundance regions aspathogenicity islands of the SARS-CoV-2 genome. In addition, these special regions may contributeRNA replications, and can be recognized by cell RNAase L for RNA degradation in the immuneresponse. However, this study is only a theoretical analysis of the genomes. The actual functional13onsequences and the impacts on transmissibility and pathogenesis of these dinucleotide repeatsshould be determined by biochemical experiments and animal models.In humans and mammals, APOBEC (apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like3) systems help protect the organisms from viral infections. For the origin of dinucleotide repeatsin the SARS-CoV-2 genome, we speculate that the molecular mechanism of increased dinucleotiderepeats during evolution fitness is possibly APOBEC3-mediated editing of viral RNA, in whichCytosine (C) is often mutated to Uracil (U) by deamination(Bishop et al., 2004). Therefore, manydinucleotide TT repeats could be generated as the 2-periodicity island in the RNA genome by theAPOBEC defense system.Cross-species transmission of coronaviruses from wildlife reservoirs may lead to disease outbreaksin humans, posing a severe threat to human health. To date, most studies on the zoonotic origin ofSARS-CoV-2 primarily focus on the Spike protein, which is essential for the entry of virus particlesinto the cell. Mutations or acquisition of potential cleavage site for furin proteases in the Spikeprotein may confer zoonotic transmissibility of SARS-CoV-2 (Andersen et al., 2020; Hoffmann et al.,2020). However, the spike protein is required but not sufficient for zoonotic coronavirus transmission.For instance, the Spike protein in SARS-like SHC014-CoV can only enable the chimeric virusSHC014-MA15 to infect human cells when the Spike protein is integrated into a wild-type SARS-CoV backbone (Menachery et al., 2015). The SARS-CoV backbone contains ORF1ab replicationcomponents. Our study provides evidence for the importance of ORF1ab replication regions inevolutionary fitness. Consequently, these ORF1ab replication components are critical for zoonotictransmissions.This study on the genomic spectrum of SARS-CoV-2 reveals high dinucleotides TT regions in ORF1amight contribute to evolution fitness in host immune evasion. Accordingly, monitoring SARS-CoV-2and developing antiviral drugs should envisage molecular characteristics and changes of nsps 3-6,which are vital components in interacting with human hosts. These special dinucleotide repeatregions should be investigated in detail for its functions and phenotype changes in SARS-CoV-2.Importantly, these dinucleotide repeat regions can possibly be the prophylactic and therapeutic targetsfor controlling COVID-19.
Acknowledgments
Abbreviations • ACE2: angiotensin converting enzyme 2 • APOBEC: apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like3 • COVID-19: coronavirus disease 2019 • CD: a congruence derivative • DMV: double-membrane vesicle • MERS: Middle-East respiratory syndrome • NDU: the normalized distribution uniformity • PAMPs: pathogen-associated molecular patterns • RIG-I: retinoic acid-inducible gene I • SARS: severe acute respiratory syndrome • SARS-CoV-2: severe acute respiratory syndrome coronavirus 2
Supplementary materials eferenceseferences