Boris Fedorov
National Institutes of Health
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Boris Fedorov.
Nucleic Acids Research | 2014
Tatiana Tatusova; Stacy Ciufo; Boris Fedorov; Kathleen ONeill; Igor Tolstoy
The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks.
Nucleic Acids Research | 2009
William Klimke; Richa Agarwala; Azat Badretdin; Slava Chetvernin; Stacy Ciufo; Boris Fedorov; Boris Kiryutin; Kathleen O’Neill; Wolfgang Resch; Sergei Resenchuk; Susan C. Schafer; Igor Tolstoy; Tatiana Tatusova
Rapid increases in DNA sequencing capabilities have led to a vast increase in the data generated from prokaryotic genomic studies, which has been a boon to scientists studying micro-organism evolution and to those who wish to understand the biological underpinnings of microbial systems. The NCBI Protein Clusters Database (ProtClustDB) has been created to efficiently maintain and keep the deluge of data up to date. ProtClustDB contains both curated and uncurated clusters of proteins grouped by sequence similarity. The May 2008 release contains a total of 285 386 clusters derived from over 1.7 million proteins encoded by 3806 nt sequences from the RefSeq collection of complete chromosomes and plasmids from four major groups: prokaryotes, bacteriophages and the mitochondrial and chloroplast organelles. There are 7180 clusters containing 376 513 proteins with curated gene and protein functional annotation. PubMed identifiers and external cross references are collected for all clusters and provide additional information resources. A suite of web tools is available to explore more detailed information, such as multiple alignments, phylogenetic trees and genomic neighborhoods. ProtClustDB provides an efficient method to aggregate gene and protein annotation for researchers and is available at http://www.ncbi.nlm.nih.gov/sites/entrez?db=proteinclusters.
Nucleic Acids Research | 2015
Tatiana Tatusova; Stacy Ciufo; Scott Federhen; Boris Fedorov; Richard McVeigh; Kathleen ONeill; Igor Tolstoy; Leonid Zaslavsky
NCBI RefSeq genome collection http://www.ncbi.nlm.nih.gov/genome represents all three major domains of life: Eukarya, Bacteria and Archaea as well as Viruses. Prokaryotic genome sequences are the most rapidly growing part of the collection. During the year of 2014 more than 10 000 microbial genome assemblies have been publicly released bringing the total number of prokaryotic genomes close to 30 000. We continue to improve the quality and usability of the microbial genome resources by providing easy access to the data and the results of the pre-computed analysis, and improving analysis and visualization tools. A number of improvements have been incorporated into the Prokaryotic Genome Annotation Pipeline. Several new features have been added to RefSeq prokaryotic genomes data processing pipeline including the calculation of genome groups (clades) and the optimization of protein clusters generation using pan-genome approach.
Standards in Genomic Sciences | 2011
William Klimke; Claire O’Donovan; Owen White; J. Rodney Brister; Karen Clark; Boris Fedorov; Ilene Mizrachi; Kim D. Pruitt; Tatiana Tatusova
The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.
Archive | 2016
Leonid Zaslavsky; Stacy Ciufo; Boris Fedorov; Boris Kiryutin; IgorTolstoy; Tatiana Tatusova
Recent technological innovations have ignited an explosion in microbial genome se‐ quencing that has fundamentally changed our understanding of biology of microbes and profoundly impacted public health policy. This huge increase in DNA sequence data presents new challenges for the annotation, analysis, and visualization bioinformatics tools. New strategies have been designed to bring an order to this genome sequence shockwave and improve the usability of associated data. Genomes are organized in a hi‐ erarchical distance tree using single-copy ribosomal protein marker distances for distance calculation. Protein distance measures dissimilarity between markers of the same type and the subsequent genomic distance averages over the majority of marker-distances, ig‐ noring the outliers. More than 30,000 genomes from public archives have been organized in a marker distance tree resulting in 6,438 species-level clades representing 7,597 taxo‐ nomic species. This computational infrastructure provides a foundation for prokaryotic gene and genome analysis, allowing easy access to pre-calculated genome groups at vari‐ ous distance levels. One of the most challenging problems in the current data deluge is the presentation of the relevant data at an appropriate resolution for each application, eliminating data redundancy but keeping biologically interesting variations.
BMC Bioinformatics | 2016
Leonid Zaslavsky; Stacy Ciufo; Boris Fedorov; Tatiana Tatusova
BackgroundMicrobial genomes at the National Center for Biotechnology Information (NCBI) represent a large collection of more than 35,000 assemblies. There are several complexities associated with the data: a great variation in sampling density since human pathogens are densely sampled while other bacteria are less represented; different protein families occur in annotations with different frequencies; and the quality of genome annotation varies greatly. In order to extract useful information from these sophisticated data, the analysis needs to be performed at multiple levels of phylogenomic resolution and protein similarity, with an adequate sampling strategy.ResultsProtein clustering is used to construct meaningful and stable groups of similar proteins to be used for analysis and functional annotation. Our approach is to create protein clusters at three levels. First, tight clusters in groups of closely-related genomes (species-level clades) are constructed using a combined approach that takes into account both sequence similarity and genome context. Second, clustroids of conservative in-clade clusters are organized into seed global clusters. Finally, global protein clusters are built around the the seed clusters. We propose filtering strategies that allow limiting the protein set included in global clustering.The in-clade clustering procedure, subsequent selection of clustroids and organization into seed global clusters provides a robust representation and high rate of compression. Seed protein clusters are further extended by adding related proteins. Extended seed clusters include a significant part of the data and represent all major known cell machinery. The remaining part, coming from either non-conservative (unique) or rapidly evolving proteins, from rare genomes, or resulting from low-quality annotation, does not group together well. Processing these proteins requires significant computational resources and results in a large number of questionable clusters.ConclusionThe developed filtering strategies allow to identify and exclude such peripheral proteins limiting the protein dataset in global clustering. Overall, the proposed methodology allows the relevant data at different levels of details to be obtained and data redundancy eliminated while keeping biologically interesting variations.
bioinformatics and biomedicine | 2011
Leonid Zaslavsky; Vyacheslav Chetvernin; Dmitry Dernovoy; Boris Fedorov; William Klimke; Alexandre Souvorov; Igor Tolstoy; Tatiana Tatusova; David J. Lipman
From the beginning of the microbial genome sequencing era, researchers have shown a commendable commitment to phylogenetic diversity. The completion of one genome from each prokaryotic division or phylum is still a frequently articulated community goal. However, largely because of the interest in human pathogens and advances in sequencing technologies, there are also now a number of very closely related genomes whose organization and gene content can be directly compared. Studying genetic variability of pathogenic bacteria using whole-genome sequencing provides a way to understanding the mechanism of bacterial adaptation to rapid environmental changes and can be a source of useful information on virulence mechanisms. The bacterial genome datasets available in public archives represent a large collection of genome at different levels of sequence quality and assembly. A fast and reliable method of phylogenetic classification based on genome sequences provides a necessary foundation for a more detailed comparative analysis. NCBI has developed an approach of grouping bacterial organisms into phylogenetic clades using a genome dissimilarity measure based on the comparison of universally conserved markers. Special adjustments have been made to compensate for data inaccuracy and incompleteness. Tests performed on complete and draft genomes from phylum Proteobacteria demonstrated that the proposed robust genomic distance allows stable and reliable species-level clustering and can be used for forming phylogenetic clades. Since the tradeoff for the increased robustness of the method is its limited sensitivity at a very fine level, a phylogenomic refinement could be done within each constructed clade when file-level phylogenetic resolution of close genomes is necessary.
Archive | 2014
Tatiana Tatusova; Stacy Ciufo; Boris Fedorov; Kathleen O’Neill; Igor Tolstoy; Leonid Zaslavsky
publisher | None
author
Archive | 2014
Tatiana Tatusova; Leonid Zaslavsky; Boris Fedorov; Diana Haddad; Anjana Vatsan; Danso Ako-adjei; Olga Blinkova; Hassan Ghazal