AA comparative analysis for SARS-CoV-2
G¨oksel Mısırlı ∗ School of Computing and Mathetmatics, Keele University
E-mail: [email protected]
Abstract
COVID-19 has affected the world tremendously. It is critical that biological exper-iments and clinical designs are informed by computational approaches for time- andcost-effective solutions. Comparative analyses particularly can play a key role to re-veal structural changes in proteins due to mutations, which can lead to behaviouralchanges, such as the increased binding of the SARS-CoV-2 surface glycoprotein tohuman ACE2 receptors. The aim of this report is to provide an easy to follow tuto-rial for biologists and others without delving into different bioinformatics tools. Morecomplex analyses such as the use of large-scale computational methods can then beutilised. Starting with a SARS-CoV-2 genome sequence, the report shows visualisingDNA sequence features, deriving amino acid sequences, and aligning different genomesto analyse mutations and differences. The report provides further insights into how theSARS-CoV-2 surface glycoprotein mutated for higher binding affinity to human ACE2receptors, compared to the SARS-CoV protein, by integrating existing 3D proteinmodels.
Keywords
SAR-CoV-2, SARS-CoV, Comparative Genomics, Surface Glycoprotein, CLC GenomicsWorkbench 1 a r X i v : . [ q - b i o . B M ] A p r ntroduction Severe acute respiratory syndrome (SARS)-related coronaviruses have previously caused twopandemics. The recent SARS Coronavirus 2 (SARS2-CoV-2) outbreak has had unprece-dented effects so far. It is essential to develop data integration mechanisms in order to gaininsights using data that already exists. For example, genome-wide comparisons can be usedto inform subsequent computational analyses which can potentially be used to search fordrugs and to develop computational models.Taxonomy classifications offer a systematic approach to link data from different organ-isms. The taxonomy id for SARS-CoV-2 is reported as 2697049 by the National Centerfor Biotechnology (NCBI) taxonomy browser which also lists synonyms that can be usedwhen searching for information in different databases ( ). Moreover, the taxonomy id 694009 is usedto group all SARS-related coronavirus taxonomies. This list can especially be useful tosearch for related sequences in order to infer information via comparative genomics ap-proaches. For example, the NCBI Virus database can be queried with these taxonomy idsto retrieve SARS-CoV-2 related nucleotide and protein sequences.The SARS-CoV-2 genome sequences can also be accessed directly from the NCBI’s Gen-Bank database ( ). Relatedinformation includes where a virus was isolated from. Some of these GenBank files includeonly sequences and some of them provide annotations about genome locations of importantsequence features.This tutorial initially uses two genome sequences: one for SARS-CoV-2 and another onefor SARS-CoV. Regarding the former, the GenBank entry LC528232 was selected. LC528232was created for a strain which was isolated in Japan. Regarding the latter, the GenBankentry AP006557, which was deposited in 2006, was selected.2 enomic features The GenBank files include annotations about sequence features, which can be visualisedand analysed using various tools. In this report, the analyses were carried out using CLCGenomics Workbench 20.0.3. GenBank files were initially imported into this tool whichwas then used to analyse the annotated sequence features in more detail. For example, thesurface glycoprotein denoted as ‘S’ is shown in Figure 1.Figure 1: SARS-CoV-2 sequence features that are captured in GenBank files are shown.In order understand the mutations in the surface glycoprotein, a new entry for the relatedcoding sequence (CDS) was created in CLC Genomics Workbench. The corresponding CDSfeature was selected and the nucleotides were copied to create this new entry. CLC GenomicsWorkbench was then used to display restriction sites and protein translation information(Figure 2). 3igure 2: Restriction sites in the surface glycoprotein CDS, labelled as ‘S’.
Predicting secondary structures
Understanding secondary structure formations and where mutations occur can provide in-sights into the effects of these mutations in the SARS-CoV-2 surface glycoprotein CDS.Hence, a new entry for the corresponding amino acid sequences was created in CLC GenomicsWorkbench by using the ‘
Toolbox - Classical Sequence Analysis - NucleotideAnalysis - Translate to Protein ’ option. Alpha-helix and beta-strand formations werepredicted using the ‘
Toolbox - Classical Sequence Analysis - Protein Analysis -Predict Secondary Structure ’ option. The inferred information was then incorporatedas annotations into the entry for the amino acid sequences (Figure 3).Figure 3: The surface glycoprotein secondary structure predictions. Red arrows representbeta-strands and blue arrows represent alpha-helices.4 equence alignment
It is reported that the SARS-CoV-2 surface glycoprotein is optimised to bind to human ACE2receptors. In order to analyse the effects of these mutations further, the protein sequencecan be aligned to previously known similar sequences. Figure 4 shows the alignment ofthe surface glycoprotein amino acid sequences from SARS-CoV-2 (LC528232) and SARS-CoV (AP006557). Although the alignment shows high similarities between the amino acidsequences, gaps and mutations also exist.Figure 4: Aligned surface glycoprotein amino acid sequences from SARS-CoV-2 (LC528232)and SARS-CoV (AP006557).A more detailed view of the alignment of the two protein sequences can be seen in Figure5. Secondary structures are integrated into the view. The surface exposed regions arecompared using different options such as the Kyte-Doolittle scale. Additional options canbe configured from the ‘
Alignment Settings - Protein info ’ tab.Figure 5: Aligned surface glycoprotein amino acid sequences from SARS-CoV-2 (LC528232)and SARS-CoV (AP006557). Red blocks at the top represent beta-strands and blue blocksrepresent alpha-helices.Wan and co-workers reported that mutations in the surface glycoprotein’s five amino acidscan play a critical role in binding to the human ACE2 receptors. Anderson and co-workersprovide their own insights, including another 6th critical amino acid. Here, we integratedsecondary structure predictions and realigned the sequences in order to show how these five5utations may affect the binding of the SARS-CoV-2 protein (Figure 6). The alignment ofsecondary structure predictions show the additions of three beta strands and the loss of onebeta strand. These changes may affect the folding of the protein and hence its binding.Figure 6: Five key mutations are shown using the blue boxes. These mutations are reportedto change the binding affinity of the surface glycoprotein to the human ACE2 receptors.Comparisons are displayed for SARS-CoV-2 (LC528232) and SARS-CoV (AP006557).Predictions for secondary structures may reveal some information about the binding ofthe surface glycoprotein. However, 3D models may help understanding the surfaces that areexposed and are likely to bind to other molecules. In order to understand the effects of themutated region in Figure 6 (shown using the dashed box), an existing 3D model of the SARS-CoV protein was searched for. CLC Genomics Workbench’s the ‘
Toolbox - ClassicalSequence Analysis - Protein Analysis - Find and Model Structure ’ option was usedto search for existing 3D models. The first ranked entry with the most ‘Match identity’ and‘Coverage’ was selected. The Protein Data Bank (PDB) identifier of this entry is ‘5X58Prefusion Structure of SARS-CoV Spike Glycoprotein , Conformation 1’. Using the CLCGenomics Workbench’s ‘
Project Settings - Sequence tools - Show Sequence ’ option,the 5X58’s ‘Chain C’ sequence was first displayed and it was then aligned to the SARS-CoVprotein sequence (AP006557) using the ‘
Align to Existing Sequence ’ option. The firstpicture on the left in Figure 7 shows the structure of the chain. The SARS-CoV amino acidsequences from the area with the key mutations (shown using the dashed box in Figure 6)were highlighted using the sequence editor. The second picture on the right in Figure 7shows the corresponding amino acid areas highlighted in the 3D structural view.A similar analysis was also carried out for the SARS-CoV-2 Chain C. The ‘6W41’ entry, including the details about the crystal structure of the SARS-CoV-2 receptor binding domain,was downloaded from the Protein Data Bank. A collection of related-entries can be found6igure 7: The binding region in SARS-CoV. The left picture shows the SARS-CoV Chain C,which binds to the human ACE2 receptors. The right picture shows the area correspondingto the highlighted sequence below, which represent the mutation area shown using the dashedbox in Figure 6.at the European Bioinformatics Institute’s COVID-19 page. The sequence from the ‘6W41’model was then aligned to the SARS-CoV-2 sequence (LC528232). Compared to the SARS-CoV secondary structures, both the CLC Genomics Workbench predictions and the SARS-CoV-2 ‘6W41’ model reveal additional beta strands in the mutated region (shown using thedashed box in Figure 6) of the surface glycoprotein. The 3D view of the binding region isshown in Figure 8. The second picture on the right in Figure 8 shows the correspondingamino acid areas highlighted in the 3D structural view.CLC Genomic Workbench was then used to align the two SARS-CoV-2 and SARS-CoV 3D structures using the ‘
Project Settings - Structure tools - Align ProteinStructure ’ option. Figure 9 shows the alignment using red and blue colours for SARS-CoV-2 and SARS-CoV respectively. The third picture on the right shows the region thataligns with the ACE2 receptors during binding. The middle picture shows the four keyamino acid sequences that mutated in SARS-CoV-2. These key mutations affect the bindingaffinity. iscussion In this report, insights from the sequence analysis of SARS-CoV-2 were presented. Genome-wide comparisons between SARS-CoV-2 and SARS-CoV further shed lights into understand-ing the potential effects of key mutations in the surface glycoprotein amino acid sequences.Our analysis is inline with the current reports.
Although, some of the mutations may playan important role in binding to the human ACE2 receptors, there are several mutations thatmay have strengthened the binding affinity of the surface glycoprotein. Both the predictionsand the 3D models reveal additional beta strands and hydrogen bonds. These additionalmutations may cause changes in structural conformations and cause a higher binding affinity.This report has been prepared in a tutorial style using CLC Genomics Workbench. Theapproach can be adopted by biologists and others to have initial understanding of SARS-CoV-2 mutations, and their relationships to SARS-CoV and other closely related viruses.Initial findings can then be explored further by using other specialised tools and approaches.Computational methods are especially promising. Machine learning, artificial intelli-gence, data integration and mining, visualisation, computational and mathematical mod-elling for key biochemical interactions and disease control mechanisms can usefully be appliedto provide solutions in a cost- and time- effective manner. We hope that this report is usefulto those who wish to understand essential information about SARS-CoV-2 for subsequentanalyses.
Acknowledgements
We thank QIAGEN for providing two months extended license beyond the two-week triallicence in order to utilise the CLC Genomics Workbench tool. The extended license wasadvertised by QIAGEN and was subsequently requested by the author.9 eferences (1) Zhou, P.; Yang, X.-L.; Wang, X.-G.; Hu, B.; Zhang, L.; Zhang, W.; Si, H.-R.; Zhu, Y.;Li, B.; Huang, C.-L.; Chen, H.-D.; Chen, J.; Luo, Y.; Guo, H.; Jiang, R.-D.; Liu, M.-Q.; Chen, Y.; Shen, X.-R.; Wang, X.; Zheng, X.-S.; Zhao, K.; Chen, Q.-J.; Deng, F.;Liu, L.-L.; Yan, B.; Zhan, F.-X.; Wang, Y.-Y.; Xiao, G.-F.; Shi, Z.-L. A pneumoniaoutbreak associated with a new coronavirus of probable bat origin.
Nature , ,270–273.(2) The National Center for Biotechnology, Taxonomy Browser. .(3) The National Center for Biotechnology, Virus Database. .(4) Andersen, K. G.; Rambaut, A.; Lipkin, W. I.; Holmes, E. C.; Garry, R. F. The proximalorigin of SARS-CoV-2. Nature Medicine ,(5) Wan, Y.; Shang, J.; Graham, R.; Baric, R. S.; Li, F. Receptor Recognition by the NovelCoronavirus from Wuhan: an Analysis Based on Decade-Long Structural Studies ofSARS Coronavirus.
Journal of Virology , .(6) Berman, H.; Henrick, K.; Nakamura, H. Announcing the worldwide Protein Data Bank. Nature Structural & Molecular Biology , , 980.(7) Yuan, Y.; Cao, D.; Zhang, Y.; Ma, J.; Qi, J.; Wang, Q.; Lu, G.; Wu, Y.; Yan, J.;Shi, Y.; Zhang, X.; Gao, G. F. Cryo-EM structures of MERS-CoV and SARS-CoV spikeglycoproteins reveal the dynamic receptor binding domains. Nature Communications , , 15092.(8) Protein Data Bank, Structure of 2019-nCoV chimeric receptor-binding domain com-plexed with its receptor human ACE2. .109) The European Bioinformatics Institute, COVID-19 Outbreak. .(10) Shang, J.; Ye, G.; Shi, K.; Wan, Y.; Luo, C.; Aihara, H.; Geng, Q.; Auerbach, A.; Li, F.Structural basis of receptor recognition by SARS-CoV-2. Nature ,(11) Wu, F.; Zhao, S.; Yu, B.; Chen, Y.-M.; Wang, W.; Song, Z.-G.; Hu, Y.; Tao, Z.-W.;Tian, J.-H.; Pei, Y.-Y.; Yuan, M.-L.; Zhang, Y.-L.; Dai, F.-H.; Liu, Y.; Wang, Q.-M.;Zheng, J.-J.; Xu, L.; Holmes, E. C.; Zhang, Y.-Z. A new coronavirus associated withhuman respiratory disease in China.
Nature ,579