bioRxiv | 2021

Unsupervised classification of SARS-CoV-2 genomic sequences uncovers hidden genetic diversity and suggests an efficient strategy for genomic surveillance

 
 
 
 
 

Abstract


Accurate and timely monitoring of emerging genomic diversity is crucial for limiting the spread of potentially more transmissible/pathogenic strains of SARS-CoV-2. At the time of writing, over 1.8M distinct viral genome sequences have been made publicly available, and a sophisticated nomenclature system based on phylogenetic evidence and expert manual curation has allowed the relatively rapid classification of emerging lineages of potential concern. Here, we propose a complementary approach that integrates fine-grained spatiotemporal estimates of allele frequency with unsupervised clustering of viral haplotypes, and demonstrate that multiple highly frequent genetic variants, arising within large and/or rapidly expanding SARS-CoV-2 lineages, have highly biased geographic distributions and are not adequately captured by current SARS-CoV-2 nomenclature standards. Our results advocate a partial revision of current methods used to track SARS-CoV-2 genomic diversity and highlight the importance of the application of strategies based on the systematic analysis and integration of regional data. Here we provide a complementary, completely automated and reproducible framework for the mapping of genetic diversity in time and across different geographic regions, and for the prioritization of virus variants of potential concern. We believe that the approach outlined in this study will contribute to relevant advances to current genomic surveillance methods.

Volume None
Pages None
DOI 10.1101/2021.06.23.449558
Language English
Journal bioRxiv

Full Text