Inverted repeats in coronavirus SARS-CoV-2 genome and implications in evolution
IInverted repeats in coronavirus SARS-CoV-2 genomeand implications in evolution
Changchuan Yin ID ∗ Department of Mathematics, Statistics, and Computer ScienceUniversity of Illinois at ChicagoChicago, IL 60607USA
Stephen S.-T. Yau ID † Department of Mathematical SciencesTsinghua UniversityBeijing 100084China
Abstract
The coronavirus disease (COVID-19) pandemic, caused by the coronavirus SARS-CoV-2, has caused 60 millions of infections and 1.38 millions of fatalities. Genomicanalysis of SARS-CoV-2 can provide insights on drug design and vaccine develop-ment for controlling the pandemic. Inverted repeats in a genome greatly impact thestability of the genome structure and regulate gene expression. Inverted repeatsinvolve cellular evolution and genetic diversity, genome arrangements, and diseases.Here, we investigate the inverted repeats in the coronavirus SARS-CoV-2 genome.We found that SARS-CoV-2 genome has an abundance of inverted repeats. Theinverted repeats are mainly located in the gene of the Spike protein. This resultsuggests the Spike protein gene undergoes recombination events, therefore, isessential for fast evolution. Comparison of the inverted repeat signatures in humanand bat coronaviruses suggest that SARS-CoV-2 is mostly related SARS-relatedcoronavirus, SARSr-CoV/RaTG13. The study also reveals that the recent SARS-related coronavirus, SARSr-CoV/RmYN02, has a high amount of inverted repeatsin the spike protein gene. Besides, this study demonstrates that the inverted repeatdistribution in a genome can be considered as the genomic signature. This studyhighlights the significance of inverted repeats in the evolution of SARS-CoV-2 andpresents the inverted repeats as the genomic signature in genome analysis. keywords : COVID-19, SARS-CoV-2, 2019-nCoV, coronavirus, inverted repeat, genome, evolution
The novel human coronavirus SARS-CoV-2 (formerly, 2019-nCoV) first emerged in Wuhan, China,in December 2019, the causative agent for Coronavirus Disease-2019 (COVID-19) pandemic, hasclaimed 1.38 million mortality in the globe as of Nov.24, 2020 (Max Roser and Hasell, 2020).Understanding the molecular structure and evolution of SARS-CoV-2 genome is of urgency for ∗ Correspondence author, [email protected] † Correspondence author, [email protected] to arXiv.org, November 26, 2020 a r X i v : . [ q - b i o . O T ] N ov racing the origin of the virus and provides insights on vaccine development and drug design forcontrolling the current COVID-19 pandemic.Human coronaviruses (CoVs) are common viral respiratory pathogens that cause mild to moderateupper-respiratory tract illnesses. Two common CoVs, 229E, and OC43 were identified in 1965and can cause the common cold. Four typical human CoVs found in recent years are Severe AcuteRespiratory Syndrome Coronavirus (SARS-CoV) in 2002, NL63 in 2004, HKU1 in 2005, and MiddleEast respiratory syndrome coronavirus (MERS-CoV) in 2012. Among these human CoVs, SARS-CoV and MERS-CoV are highly pathogenic and caused severe and fatal infections. MERS symptomsare very severe, usually including fever, cough, and shortness of breath which often progress topneumonia. About 30% with MERS had died. SARS symptoms often include fever, chills, andbody aches which usually progressed to pneumonia. About 10% with SARS-CoV had died. Thecurrent coronavirus SARS-CoV-2, which causes a worldwide COVID-19 pandemic, is milder thanSARS-CoV, but can cause severe syndromes and fatality in people with cardiopulmonary disease,people with weakened immune systems, infants, and older adults.SARS-CoV-2 is a beta coronavirus, like MERS-CoV and SARS-CoV. All three of these coronaviruseshave their origins in bats. Yet the zoonotic origin of SARS-CoV-2 is still unconfirmed. Zhouet al. (2020c,b) ’s study showed that the bat SARS-related coronavirus strain SARSr-CoV/RaTG13,identified from a bat Rhinolophus affinis in Yunnan province, China, in July 2012, shares 96.2%nucleotide identity. A recent study identified a new SARSr-CoV/RmYN02 (2019) from
Rhinolophusmalayanus , which is closely related to SARS-CoV-2 (Zhou et al., 2020a). SARSr-CoV/RmYN02shares 93.3% nucleotide identity with SARS-CoV-2 and comprises natural insertions at the S1/S2cleavage site of the Spike protein. The unique S1/S2 cleavage in the Spike protein in SARS-CoV-2may confer the zoonotic spread of SARS-CoV-2. However, the originating relationship among theseCoVs is not entirely clear.
SARS-CoV-2 coronavirus contains a linear single-stranded positive RNA genome (Fig.1). TheSARS-CoV-2 RNA genome of 29.9kb has a total of 11 genes with 11 open reading frames (ORFs)(Yoshimoto, 2020), consisting of the leader sequence (5’UTR), the coding regions, and 3’UTRpseudoknot stem-loop (Wu et al., 2020). The coding regions include ORF1ab and genes encoding 16non-structural proteins (Finkel et al., 2020) and structural proteins (spike (S), envelope (E), membrane(M), and nucleocapsid (N)) (Gordon et al., 2020), and several accessory proteins.ORF1ab encodes replicase polyproteins required for viral RNA replication and transcription (Chenet al., 2020b; Cavasotto et al., 2020). Nonstructural protein 1 (nsp1) likely inhibits host translation byinteracting with 40S ribosomal subunit, leading to host mRNA degradation through cleavage neartheir 5’UTRs. Nsp 1 promotes viral gene expression and immunoevasion in part by interfering withinterferon-mediated signaling. Nonstructural protein 2 (nsp2) interacts with host factors prohibitin 1and prohibitin 2, which are involved in many cellular processes including mitochondrial biogenesis.The third non-structural protein (nsp3) is Papain-like proteinase. Nsp3 is an essential and the largestcomponent of the replication and transcription complex. The Papain-like proteinase cleaves non-structural proteins 1-3 and blocks the host’s innate immune response, promoting cytokine expression(Serrano et al., 2009; Lei et al., 2018). Nsp4 encoded in ORF1ab is responsible for forming double-membrane vesicle (DMV). The other non-structural proteins are 3CLPro protease (3-chymotrypsin-like proteinase, 3CLpro) and nsp6. 3CLPro protease is essential for RNA replication. The 3CLProproteinase accounts for processing the C-terminus of nsp4 through nsp16 in coronaviruses (Anandet al., 2003). Together, nsp3, nsp4, and nsp6 can induce DMV (Angelini et al., 2013).SARS-coronavirus has a unique RNA replication facility, including two RNA-dependent RNApolymerases (RNA pol). The first RNA polymerase is a primer-dependent non-structural protein12 (nsp12), and the second RNA polymerase is nsp8, nsp8 has the primase capacity for de novo replication initiation without primers (Te Velthuis et al., 2012). Nsp7 and nsp8 are essential proteinsin the replication and transcription of SARS-CoV-2. Nsp7 is responsible for nuclear transport. TheSARS-coronavirus nsp7-nsp8 complex is a multimeric RNA polymerase for both de novo initiationand primer extension (Prentice et al., 2004; Te Velthuis et al., 2012). Nsp8 also interacts withORF6 accessory protein. The nsp9 replicase protein of SARS-coronavirus binds RNA and interactswith nsp8 for its functions (Sutton et al., 2004). Helicase (nsp13) possesses helicase activity, thuscatalyzing the unwinding dsRNA or structured RNA into single strands. Importantly, nsp14 may2igure 1: The structural diagram of SARS-CoV-2 genome (GenBank: NC_045512). The diagram ofSARS-CoV-2 genome was made using DNA Feature Viewer (Zulkower and Rosser, 2020).function as a proofreading exoribonuclease for virus replication, hence, SARS-CoV-2 mutation rateremains low.Furthermore, the SARS-CoV-2 genome encodes several structural proteins. The structural proteinspossess much higher immunogenicity for T cell responses than the non-structural proteins (Liet al., 2008). The structural proteins include spike (S), envelope (E), membrane protein (M), andnucleoprotein (N) (Marra et al., 2003; Ruan et al., 2003). The Spike glycoprotein has two domainsS1 and S2. Spike protein S1 attaches the virion to the host cell membrane through the receptorACE2, initiating the infection (Wan et al., 2020; Wong et al., 2004). After being internalizedinto the endosomes of the cells, the S glycoprotein is then cleaved by cathepsin CTSL. The spikeprotein domain S2 mediates fusion of the virion and cellular membranes by acting as a class I viralfusion protein. Especially, the spike glycoprotein of coronavirus SARS-CoV-2 contains a furin-likecleavage site (Coutard et al., 2020). Recent study indicates that SARS-CoV-2 is more infectiousthan SARS-CoV according to the changes of S protein-ACE2 binding affinity (Chen et al., 2020a).The envelope (E) protein interacts with membrane protein M in the budding compartment of thehost cell. The M protein holds dominant cellular immunogenicity (Liu et al., 2010). Nucleoprotein(ORF9a) packages the positive-strand viral RNA genome into a helical ribonucleocapsid (RNP)during virion assembly through its interactions with the viral genome and a membrane protein M (Heet al., 2004). Nucleoprotein plays an important role in enhancing the efficiency of subgenomic viralRNA transcription and viral replication.
In addition to the coding regions, SARS-CoV-2 genome contains hidden structures that can retaingenome stability, regulate gene replication and expression, and control virus life cycles. The non-coding genome structures include leader sequences, transcriptional regulatory sequences (TRS),G-quadruplex structures, frame-shifting regions, and repeats. The first non-coding structure is the5’ leader sequence of about 265 bp is the unique characteristic in coronavirus replication and playscritical roles in the gene expression of coronavirus during its discontinuous sub-genomic replication(Li et al., 2005).SARS-CoV-2 contains G-quadruplex structures (Ji et al., 2020). It is well established that sequenceswith G-blocks (adjacent runs of Guanines) can potentially form non-canonical G-quadruplex (G4)structures (Choi and Majima, 2011; Métifiot et al., 2014). The G4 structures are formed by stackingtwo or more G-tetrads by Hoogsteen hydrogen bonds and often are the sites of genomic instability,serving one or more biological functions (Bochman et al., 2012).An inverted repeat is a single-stranded sequence of nucleotides followed by downstream its reversecomplement downstream. The intervening sequence between the initial sequence and the reversecomplement is called a spacer. When the spacer sequence is zero, the inverted repeat is called apalindrome. For example, the inverted repeat, 5’-ATTCGCGAAT-3’ is a palindrome, the palindrome-first sequence is 5’-ATTCG-3’, and the palindrome-second sequence is 5’-CGAAT-3’. When thespacer in an inverted repeat is non-zero, the repeat is generally inverted. In a generally invertedrepeat, we still denote the initial sequence as a palindrome-first sequence and the downstreamreverse complement as a palindrome-second sequence. For example, in the general inverted repeat,3’-TTTAGGT...ACCTAAA-3’, the palindrome-first sequence is 5’-TTTAGGT-3’, and the palindrome-second sequence is 5’-ACCTAAA-3’. Through self-complementary base pairing, an inverted repeatcan form a stem-loop (hairpin) structure in an RNA molecule, where the palindrome-first andpalindrome-second sequences make a stem, and the spacer sequence makes a loop. It should benoted that an inverted repeat may not have perfect complementary base pairing in palindrome-firstand palindrome-second sequences, so the stem formed by an imperfect inverted repeat can havemismatches, insert, or deletions. Inverted repetitive sequences are principal components of thearchaeal and bacterial CRISPR-CAS systems (Mojica et al., 2005), which function as adaptiveantiviral defense systems.Inverted repeats have important biological functions in viruses. Inverted repeats delimit the boundariesin transposons in genome evolution and form stem-loop structures in retaining genome instabilityand flexibility. Inverted repeats are described as hotspots of eukaryotic and prokaryotic genomicinstability(Voineagu et al., 2008), replication (Pearson et al., 1996), and gene silencing (Selker, 1999).Therefore, inverted repeats involve cellular evolution and genetic diversity, mutations, and diseases.Despite the paramount roles of the non-coding structures, the non-coding structures are not immedi-ately visible as the coding regions. This study is to identify one of the crucial non-coding structures,inverted repeats in SARS-CoV-2 genome, and investigate the cohort of the inverted repeats and thevirus evolution.
The complete genomes of coronaviruses were scanned for inverted repeats using Palindrome analyzer(Brázda et al., 2016). Palindrome analyzer (http://bioinformatics.ibp.cz/) is a web-based server forretrieving palindromic and inverted repeats in DNA or RNA sequences. Palindrome server describesthe features of inverted repeats including similarity analysis, localization, and visualization.
To ensure consistency in comparing coronavirus genomes, we only extracted the inverted repeats withthe perfect complementary base pairing of the palindrome-first and palindrome-second sequences.Noted that a short inverted repeat of length P can be inside a long inverted repeat of length Q ( Q > P ), in this case, we only extracted the inverted repeats of length Q and excluded the invertedrepeat of length P .The retrieved inverted repeats were mapped on the protein genes in a genome according to thepositions of the palindrome-first and palindrome-second sequences of the inverted repeats.The distributions of inverted repeats on protein genes in the different genomes are assessed by theWasserstein distance, known as the earth mover’s distance. The Wasserstein distance correspondsto the minimum amount of work required to transform one distribution into the other. The p − th Wasserstein distance between two probability distributions µ and ν is defined as follows (Vallender,1974), W p ( µ, ν ) = (cid:18) inf π ∈ Γ( µ,ν ) (cid:90) R × R | x − y | dπ ( x, y ) (cid:19) /p , where Γ( µ, ν ) denotes the set of probability distributions on R × R with marginals µ and ν . The following complete genomes of SARS-CoVs and SARS-related coronaviruses (SARSr-CoVs)were downloaded from NCBI GenBank: SARS-CoV-2 (GenBank: NC_045512.2) (Wu et al., 2020),SARS-COV/BJ01 (GenBank: AY278488), SARSr-CoV/RaTG13 (GenBank: MN996532) (Zhouet al., 2020c), SARSr-CoV/RmYN02 (GISAID: EPI_ISL_412977) (Zhou et al., 2020a; Shu andMcCauley, 2017), and MERS-CoV (GenBank: NC_019843) (Zaki et al., 2012).4
Results
Long inverted repeats are deemed to greatly influence the stability of the genomes of variousorganisms. The longest inverted repeats identified in SARS-CoV-2 genome is 15 bp sequence,the palindrome-first sequence 5’-ACTTACCTTTTAAGT-3’ is at 8474-8489 (nsp3 gene), and thepalindrome-second sequence 5’-ACTTAAAAGGTAAGT-3’ is at 13295-13310 (nsp10 gene). Therepeats of 11-15 bp are predominantly located in the gene of the Spike (S) protein (Fig.2(a) and(b)). The other three protein genes (nsp3, RdRp, and N protein) are also enriched with long invertedrepeats.Long inverted repeats often contribute to the stability of a genome because of stable stems formedby the long inverted repeats. The results also suggest the recombinations took place at the gene ofthe Spike protein during evolution. Together, four protein genes (S, nsp3, RdRp, and N protein) ofabundant inverted repeats are evolving dramatically and are critical for virus survival, therefore, canbe the pharmaceutical targets (Gao et al., 2020).The relation of virus genomes may provide insights on the zoonotic origin and evolution of theviruses. To examine the close relevance of human and bat CoVs, we evaluate and compare thedistributions of inverted repeats of 11-15 bp in four CoV genomes: SARS-CoV-2 (Fig.2(a)), SARS-CoV (Fig.3(a)), MERS-CoV (Fig.4(a)) SARSr-CoV/RaTG13 (Fig.5(a)), and SARSr-CoV/RmYN02(Fig.6(a)). The repeat numbers of the inverted repeats of 11-15 bp on each protein gene in thegenomes are shown in Fig.2(b), Fig.3(b), Fig.4(b), Fig.5(b), and Fig.6(b). The repeat numbersare counted by both the palindrome-first and palindrome-second sequences of the inverted repeats.Taking account of the inverted repeats of wide ranges 8-15 bp, we computed the pairwise Wassersteindistances of the repeat numbers of protein genes in three closely related SARSr-CoVs: the distancebetween SARS-CoV-2 and SARSr-CoV/RaTG1 is 6.8571, the distance between SARS-CoV-2and SARSr-CoV/RmYN02 is 5.7143, and the distance between SARSr-CoV/RaTG1 and SARSr-CoV/RmYN02 is 6.3571. Therefore, we conclude that SARS-CoV-2 strain is more closely relatedto SARSr-CoV/RaTG1 (2013) than SARSr-CoV/RmYN02 (2019). Both SARS-CoV-2 and SARSr-CoV/RmYN02 may evolve from SARSr-CoV/RaTG1. We also observe that the Spike proteingene in SARSr-CoV/RmYN02 (Fig.6(b)) have more long inverted repeats than the counterparts ofSARS-CoV-2 (Fig.2(b)) and SARSr-CoV/RaTG1 (Fig.5(b)). Unsurprisingly, the Spike protein inSARSr-CoV/RmYN02 contains natural insertions at the S1/S2 cleavage site. This cleavage site mayoriginate from some recombination events of the Spike genes as the result of inverted repeats.The total frequencies of inverted repeats of different lengths in the human and bat CoVs also suggestthat SARS-CoV-2 is closely related SARSr-CoV/RaTG13 (Fig.7). Notedly, Fig. 7 shows that theinverted repeats of all lengths are increasing from SARS-CoV (in 2003) to SARS-CoV-2 (in 2019).From these repeat analyses, we may infer that during evolution, the recombinations may occur andproduce accumulative inverted repeats under natural selection. We see that recombinations can beone of the driven forces for fast evolution.
The COVID-19 pandemic has caused substantial health emergencies and economic stress in the world.Vaccine development is critical to mitigating the pandemic. The facts revealed in this study that threeproteins nsp3, RdRp, and the Spike protein are rich with inverted repeats suggest that these threeproteins are functional significance for virus survivals, and shall be the targets of drug design andvaccine development.If we relax the matching pairs in the inverted repeats, we expect that much longer inverted repeats canbe identified, and the number of inverted repeats in the virus genome will be increased significantly.The imperfect inverted repeats are the natural forms of the repeats to maintain the genome structures.Because the perfect inverted repeat distribution and types in a genome are unique and extractingthe perfect inverted repeats are parameter-free, the perfect inverted repeats can be considered as thegenomic signature. The signatures from perfect inverted repeats are consistent, therefore, can be usedfor distinguishing the closely related viruses and differing virus mutation variants. The quantitativecomparison of the signature can also provide phylogenetic taxonomy when appropriate numerical5 a)(b)
Figure 2: Distributions of inverted repeats consisting of first half sequences and second sequenceson SARS-CoV-2 genome (NC_045512). (a) Inverted repeats of 11-15 bp. (b) Repeat numbers ofinverted repeats of 12-15 bp in the protein genes of the genome. In (b), the repeat numbers arecounted by both palindrome-first and palindrome-second sequences.6 a)(b)
Figure 3: Distributions of inverted repeats consisting of palindrome-first and palindrome-secondsequences on SARS-CoV genome (AY278488). (a) Inverted repeats of 11-15 bp. (b) Repeat numbersof inverted repeats of 12-15 bp in the protein genes of the genome.7 a)(b)
Figure 4: Distributions of inverted repeats consisting of palindrome-first and palindrome-secondsequences on MERS-CoV genome (NC_019843). (a) Inverted repeats of 11-15 bp. (b) Repeatnumbers of inverted repeats of 12-15 bp in the protein genes of the genome.8 a)(b)
Figure 5: Distributions of inverted repeats consisting of palindrome-first and palindrome-secondsequences on SARSr/RaTG13 genome (MN996532). (a) Inverted repeats of 11-15 bp. (b) Repeatnumbers of inverted repeats of 12-15 bp in the protein genes of the genome.9 a)(b)
Figure 6: Distributions of inverted repeats consisting of palindrome-first and palindrome-secondsequences on SARSr-CoV/RmYN02 genome (EPI_ISL_412977). (a) Inverted repeats of 11-15 bp.(b) Repeat numbers of inverted repeats of 12-15 bp in the protein genes of the genome.10igure 7: Frequencies of inverted repeats of different lengths in the coronavirus genomes: SARS-CoV-2, SARS-CoV, MERS-CoV, SARSr-CoV/RaTG13, and SARSr-CoV/RmYN02. The repeatnumbers are counted by palindrome-first sequences only.metrics for the signatures are realized. Therefore, the perfect inverted repeats can be an effectivebarcode to delimit species and genotypes.
Acknowledgments
Competing interests
We declare we have no competing interests.
Abbreviations • COVID-19: coronavirus disease 2019• SARS: severe acute respiratory syndrome• SARS-CoV-2: severe acute respiratory syndrome coronavirus 2• MERS-CoV: Middle East Respiratory Syndrome coronavirus• CRISPR: clusters of regularly interspaced short palindromic repeats• ACE2: angiotensin-converting enzyme 2• NCBI: National Center for Biotechnology Information (USA)11 eferenceseferences