bioRxiv | 2021
Cluster-specific gene markers enhance Shigella and Enteroinvasive Escherichia coli in silico serotyping
Abstract
Shigella and enteroinvasive Escherichia coli ( EIEC) cause human bacillary dysentery with similar invasion mechanisms and share similar physiological, biochemical and genetic characteristics. The ability to differentiate Shigella and EIEC from each other is important for clinical diagnostic and epidemiologic investigations. The existing genetic signatures may not discriminate between Shigella and EIEC. However, phylogenetically, Shigella and EIEC strains are composed of multiple clusters and are different forms of E. coli. In this study, we identified 10 Shigella clusters, 7 EIEC clusters and 53 sporadic types of EIEC by examining over 17,000 publicly available Shigella/EIEC genomes. We compared Shigella and EIEC accessory genomes to identify the cluster-specific gene markers or marker sets for the 17 clusters and 53 sporadic types. The gene markers showed 99.63% accuracy and more than 97.02% specificity. In addition, we developed a freely available in silico serotyping pipeline named Shigella EIEC Cluster Enhanced Serotype Finder (ShigEiFinder) by incorporating the cluster-specific gene markers and established Shigella/EIEC serotype specific O antigen genes and modification genes into typing. ShigEiFinder can process either paired end Illumina sequencing reads or assembled genomes and almost perfectly differentiated Shigella from EIEC with 99.70% and 99.81% cluster assignment accuracy for the assembled genomes and mapped reads respectively. ShigEiFinder was able to serotype over 59 Shigella serotypes and 22 EIEC serotypes and provided a high specificity with 99.40% for assembled genomes and 99.38% for mapped reads for serotyping. The cluster markers and our new serotyping tool, ShigEiFinder (https://github.com/LanLab/ShigEiFinder), will be useful for epidemiologic and diagnostic investigations. Data summary Sequencing data have been deposited at the National Center for Biotechnology Information under BioProject number PRJNA692536.