Boris Kiryutin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Boris Kiryutin is active.

Explore More

Publication

Featured researches published by Boris Kiryutin.

BMC Bioinformatics | 2003

The COG database: an updated version includes eukaryotes

Roman L. Tatusov; Natalie D. Fedorova; John D. Jackson; Aviva R. Jacobs; Boris Kiryutin; Eugene V. Koonin; Dmitri M. Krylov; Raja Mazumder; Sergei L. Mekhedov; Anastasia N. Nikolskaya; B Sridhar Rao; Sergei Smirnov; Alexander V. Sverdlov; Sona Vasudevan; Yuri I. Wolf; Jodie J. Yin; Darren A. Natale

BackgroundThe availability of multiple, essentially complete genome sequences of prokaryotes and eukaryotes spurred both the demand and the opportunity for the construction of an evolutionary classification of genes from these genomes. Such a classification system based on orthologous relationships between genes appears to be a natural framework for comparative genomics and should facilitate both functional annotation of genomes and large-scale evolutionary studies.ResultsWe describe here a major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes and the construction of clusters of predicted orthologs for 7 eukaryotic genomes, which we named KOGs after euk aryotic o rthologous g roups. The COG collection currently consists of 138,458 proteins, which form 4873 COGs and comprise 75% of the 185,505 (predicted) proteins encoded in 66 genomes of unicellular organisms. The euk aryotic o rthologous g roups (KOGs) include proteins from 7 eukaryotic genomes: three animals (the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster and Homo sapiens), one plant, Arabidopsis thaliana, two fungi (Saccharomyces cerevisiae and Schizosaccharomyces pombe), and the intracellular microsporidian parasite Encephalitozoon cuniculi. The current KOG set consists of 4852 clusters of orthologs, which include 59,838 proteins, or ~54% of the analyzed eukaryotic 110,655 gene products. Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increase in the coverage of eukaryotic genomes with KOGs. Examination of the phyletic patterns of KOGs reveals a conserved core represented in all analyzed species and consisting of ~20% of the KOG set. This conserved portion of the KOG set is much greater than the ubiquitous portion of the COG set (~1% of the COGs). In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes.ConclusionThe updated collection of orthologous protein sets for prokaryotes and eukaryotes is expected to be a useful platform for functional annotation of newly sequenced genomes, including those of complex eukaryotes, and genome-wide evolutionary studies.

Nucleic Acids Research | 2001

The COG database: new developments in phylogenetic classification of proteins from complete genomes

Roman L. Tatusov; Darren A. Natale; Igor Garkavtsev; Tatiana Tatusova; Uma Shankavaram; Bachoti S. Rao; Boris Kiryutin; Michael Y. Galperin; Natalie D. Fedorova; Eugene V. Koonin

The database of Clusters of Orthologous Groups of proteins (COGs), which represents an attempt on a phylogenetic classification of the proteins encoded in complete genomes, currently consists of 2791 COGs including 45 350 proteins from 30 genomes of bacteria, archaea and the yeast Saccharomyces cerevisiae (http://www.ncbi.nlm.nih. gov/COG). In addition, a supplement to the COGs is available, in which proteins encoded in the genomes of two multicellular eukaryotes, the nematode Caenorhabditis elegans and the fruit fly Drosophila melanogaster, and shared with bacteria and/or archaea were included. The new features added to the COG database include information pages with structural and functional details on each COG and literature references, improvements of the COGNITOR program that is used to fit new proteins into the COGs, and classification of genomes and COGs constructed by using principal component analysis.

Journal of Virology | 2008

The Influenza Virus Resource at the National Center for Biotechnology Information

Yiming Bao; Pavel Bolotov; Dmitry Dernovoy; Boris Kiryutin; Leonid Zaslavsky; Tatiana Tatusova; Jim Ostell; David J. Lipman

Influenza epidemics cause morbidity and mortality worldwide (4). Each year in the United States, more than 200,000 patients are admitted to hospitals because of influenza and there are approximately 36,000 influenza-related deaths (14). In recent years, several subtypes of avian influenza viruses have jumped host species to infect humans. The H5N1 subtype, in particular, has been reported in 328 human cases and has caused 200 human deaths in 12 countries (World Health Organization, http://www.who.int/csr/disease/avian_influenza/country/cases_table_2007_09_10/en/index.html). These viruses have the potential to cause a pandemic in humans. Antiviral drugs and vaccines must be developed to minimize the damage that such a pandemic would bring. To achieve this, it is vital that researchers have free access to viral sequences in a timely fashion, and sequence analysis tools need to be readily available. Historically, the number of influenza virus sequences in public databases has been far less than those of some well-studied viruses, such as human immunodeficiency virus. The number of complete influenza virus genomes has been even smaller. In addition, many of the sequences were collected in the course of influenza surveillance programs that prioritized antigenically novel isolates. Although collecting antigenically novel isolates is appropriate for surveillance, it results in biased samples of sequenced isolates that are not representative of community cases of influenza (2, 13). Therefore, in 2004, the National Institute of Allergy and Infectious Diseases (NIAID) launched the Influenza Genome Sequencing Project (7), which aims to rapidly sequence influenza viruses from samples collected all over the world. Viral sequences were generated at the J. Craig Venter Institute, annotated at the National Center for Biotechnology Information (NCBI), and deposited in GenBank. In just over 2 years after the initiation of the project, more than 2,000 complete genomes of influenza viruses A and B had been deposited in GenBank. To help the research community to make full use of the wealth of information from such a large amount of data, which will be increasing continuously, the Influenza Virus Resource was created at NCBI in 2004.

Nucleic Acids Research | 2009

The National Center for Biotechnology Information's Protein Clusters Database

William Klimke; Richa Agarwala; Azat Badretdin; Slava Chetvernin; Stacy Ciufo; Boris Fedorov; Boris Kiryutin; Kathleen O’Neill; Wolfgang Resch; Sergei Resenchuk; Susan C. Schafer; Igor Tolstoy; Tatiana Tatusova

Rapid increases in DNA sequencing capabilities have led to a vast increase in the data generated from prokaryotic genomic studies, which has been a boon to scientists studying micro-organism evolution and to those who wish to understand the biological underpinnings of microbial systems. The NCBI Protein Clusters Database (ProtClustDB) has been created to efficiently maintain and keep the deluge of data up to date. ProtClustDB contains both curated and uncurated clusters of proteins grouped by sequence similarity. The May 2008 release contains a total of 285 386 clusters derived from over 1.7 million proteins encoded by 3806 nt sequences from the RefSeq collection of complete chromosomes and plasmids from four major groups: prokaryotes, bacteriophages and the mitochondrial and chloroplast organelles. There are 7180 clusters containing 376 513 proteins with curated gene and protein functional annotation. PubMed identifiers and external cross references are collected for all clusters and provide additional information resources. A suite of web tools is available to explore more detailed information, such as multiple alignments, phylogenetic trees and genomic neighborhoods. ProtClustDB provides an efficient method to aggregate gene and protein annotation for researchers and is available at http://www.ncbi.nlm.nih.gov/sites/entrez?db=proteinclusters.

Nucleic Acids Research | 2014

Virus Variation Resource—recent updates and future directions

J. Rodney Brister; Yiming Bao; Sergey A. Zhdanov; Yuri Ostapchuck; Vyacheslav Chetvernin; Boris Kiryutin; Leonid Zaslavsky; Michael Kimelman; Tatiana Tatusova

Virus Variation (http://www.ncbi.nlm.nih.gov/genomes/VirusVariation/) is a comprehensive, web-based resource designed to support the retrieval and display of large virus sequence datasets. The resource includes a value added database, a specialized search interface and a suite of sequence data displays. Virus-specific sequence annotation and database loading pipelines produce consistent protein and gene annotation and capture sequence descriptors from sequence records then map these metadata to a controlled vocabulary. The database supports a metadata driven, web-based search interface where sequences can be selected using a variety of biological and clinical criteria. Retrieved sequences can then be downloaded in a variety of formats or analyzed using a suite of tools and displays. Over the past 2 years, the pre-existing influenza and Dengue virus resources have been combined into a single construct and West Nile virus added to the resultant resource. A number of improvements were incorporated into the sequence annotation and database loading pipelines, and the virus-specific search interfaces were updated to support more advanced functions. Several new features have also been added to the sequence download options, and a new multiple sequence alignment viewer has been incorporated into the resource tool set. Together these enhancements should support enhanced usability and the inclusion of new viruses in the future.

Nucleic Acids Research | 2007

FLAN: a web server for influenza virus genome annotation.

Yiming Bao; Pavel Bolotov; Dmitry Dernovoy; Boris Kiryutin; Tatiana Tatusova

FLAN (short for FLu ANnotation), the NCBI web server for genome annotation of influenza virus (http://www.ncbi.nlm.nih.gov/genomes/FLU/Database/annotation.cgi) is a tool for user-provided influenza A virus or influenza B virus sequences. It can validate and predict protein sequences encoded by an input flu sequence. The input sequence is BLASTed against a database containing influenza sequences to determine the virus type (A or B), segment (1 through 8) and subtype for the hemagglutinin and neuraminidase segments of influenza A virus. For each segment/subtype of the viruses, a set of sample protein sequences is maintained. The input sequence is then aligned against the corresponding protein set with a ‘Protein to nucleotide alignment tool’ (ProSplign). The translated product from the best alignment to the sample protein sequence is used as the predicted protein encoded by the input sequence. The output can be a feature table that can be used for sequence submission to GenBank (by Sequin or tbl2asn), a GenBank flat file, or the predicted protein sequences in FASTA format. A message showing the length of the input sequence, the predicted virus type, segment and subtype for the hemagglutinin and neuraminidase segments of Influenza A virus will also be displayed.

BMC Microbiology | 2009

Virus variation resources at the National Center for Biotechnology Information: dengue virus

Wolfgang Resch; Leonid Zaslavsky; Boris Kiryutin; Michael Rozanov; Yiming Bao; Tatiana Tatusova

BackgroundThere is an increasing number of complete and incomplete virus genome sequences available in public databases. This large body of sequence data harbors information about epidemiology, phylogeny, and virulence. Several specialized databases, such as the NCBI Influenza Virus Resource or the Los Alamos HIV database, offer sophisticated query interfaces along with integrated exploratory data analysis tools for individual virus species to facilitate extracting this information. Thus far, there has not been a comprehensive database for dengue virus, a significant public health threat.ResultsWe have created an integrated web resource for dengue virus. The technology developed for the NCBI Influenza Virus Resource has been extended to process non-segmented dengue virus genomes. In order to allow efficient processing of the dengue genome, which is large in comparison with individual influenza segments, we developed an offline pre-alignment procedure which generates a multiple sequence alignment of all dengue sequences. The pre-calculated alignment is then used to rapidly create alignments of sequence subsets in response to user queries. This improvement in technology will also facilitate the incorporation of additional virus species in the future. The set of virus-specific databases at NCBI, which will be referred to as Virus Variation Resources (VVR), allow users to build complex queries against virus-specific databases and then apply exploratory data analysis tools to the results. The metadata is automatically collected where possible, and extended with data extracted from the literature.ConclusionThe NCBI Dengue Virus Resource integrates dengue sequence information with relevant metadata (sample collection time and location, disease severity, serotype, sequenced genome region) and facilitates retrieval and preliminary analysis of dengue sequences using integrated web analysis and visualization tools.

Archive | 2016

Dealing with the Data Deluge – New Strategies in Prokaryotic Genome Analysis

Leonid Zaslavsky; Stacy Ciufo; Boris Fedorov; Boris Kiryutin; IgorTolstoy; Tatiana Tatusova

Recent technological innovations have ignited an explosion in microbial genome se‐ quencing that has fundamentally changed our understanding of biology of microbes and profoundly impacted public health policy. This huge increase in DNA sequence data presents new challenges for the annotation, analysis, and visualization bioinformatics tools. New strategies have been designed to bring an order to this genome sequence shockwave and improve the usability of associated data. Genomes are organized in a hi‐ erarchical distance tree using single-copy ribosomal protein marker distances for distance calculation. Protein distance measures dissimilarity between markers of the same type and the subsequent genomic distance averages over the majority of marker-distances, ig‐ noring the outliers. More than 30,000 genomes from public archives have been organized in a marker distance tree resulting in 6,438 species-level clades representing 7,597 taxo‐ nomic species. This computational infrastructure provides a foundation for prokaryotic gene and genome analysis, allowing easy access to pre-calculated genome groups at vari‐ ous distance levels. One of the most challenging problems in the current data deluge is the presentation of the relevant data at an appropriate resolution for each application, eliminating data redundancy but keeping biologically interesting variations.

Archive | 2008