Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Michael DiCuccio is active.

Publication


Featured researches published by Michael DiCuccio.


Nucleic Acids Research | 2014

RefSeq: an update on mammalian reference sequences

Kim D. Pruitt; Garth Brown; Susan M. Hiatt; Françoise Thibaud-Nissen; Alexander Astashyn; Olga Ermolaeva; Catherine M. Farrell; Jennifer Hart; Melissa J. Landrum; Kelly M. McGarvey; Michael R. Murphy; Nuala A. O’Leary; Shashikant Pujar; Bhanu Rajput; Sanjida H. Rangwala; Lillian D. Riddick; Andrei Shkeda; Hanzhen Sun; Pamela Tamez; Raymond E. Tully; Craig Wallin; David Webb; Janet Weber; Wendy Wu; Michael DiCuccio; Paul Kitts; Donna Maglott; Terence Murphy; James Ostell

The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of annotated genomic, transcript and protein sequence records derived from data in public sequence archives and from computation, curation and collaboration (http://www.ncbi.nlm.nih.gov/refseq/). We report here on growth of the mammalian and human subsets, changes to NCBI’s eukaryotic annotation pipeline and modifications affecting transcript and protein records. Recent changes to NCBI’s eukaryotic genome annotation pipeline provide higher throughput, and the addition of RNAseq data to the pipeline results in a significant expansion of the number of transcripts and novel exons annotated on mammalian RefSeq genomes. Recent annotation changes include reporting supporting evidence for transcript records, modification of exon feature annotation and the addition of a structured report of gene and sequence attributes of biological interest. We also describe a revised protein annotation policy for alternatively spliced transcripts with more divergent predicted proteins and we summarize the current status of the RefSeqGene project.


Nucleic Acids Research | 2016

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

Nuala A. O'Leary; Mathew W. Wright; J. Rodney Brister; Stacy Ciufo; Diana Haddad; Richard McVeigh; Bhanu Rajput; Barbara Robbertse; Brian Smith-White; Danso Ako-adjei; Alexander Astashyn; Azat Badretdin; Yiming Bao; Olga Blinkova; Vyacheslav Brover; Vyacheslav Chetvernin; Jinna Choi; Eric Cox; Olga Ermolaeva; Catherine M. Farrell; Tamara Goldfarb; Tripti Gupta; Daniel H. Haft; Eneida Hatcher; Wratko Hlavina; Vinita Joardar; Vamsi K. Kodali; Wenjun Li; Donna Maglott; Patrick Masterson

The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.


Genome Research | 2009

The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes

Kim D. Pruitt; Jennifer Harrow; Rachel A. Harte; Craig Wallin; Mark Diekhans; Donna Maglott; Steve Searle; Catherine M. Farrell; Jane Loveland; Barbara J. Ruef; Elizabeth Hart; Marie-Marthe Suner; Melissa J. Landrum; Bronwen Aken; Sarah Ayling; Robert Baertsch; Julio Fernandez-Banet; Joshua L. Cherry; Val Curwen; Michael DiCuccio; Manolis Kellis; Jennifer M. Lee; Michael F. Lin; Michael Schuster; Andrew Shkeda; Clara Amid; Garth Brown; Oksana Dukhanina; Adam Frankish; Jennifer Hart

Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.


Nucleic Acids Research | 2016

NCBI prokaryotic genome annotation pipeline.

Tatiana Tatusova; Michael DiCuccio; Azat Badretdin; Vyacheslav Chetvernin; Eric P. Nawrocki; Leonid Zaslavsky; Alexandre Lomsadze; Kim D. Pruitt; Mark Borodovsky; James Ostell

Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBIs Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.


PLOS Biology | 2009

Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse

Deanna M. Church; Leo Goodstadt; LaDeana W. Hillier; Michael C. Zody; Steve Goldstein; Xinwe She; Richa Agarwala; Joshua L. Cherry; Michael DiCuccio; Wratko Hlavina; Yuri Kapustin; Peter Meric; Donna Maglott; Zoë Birtle; Ana C. Marques; Tina Graves; Shiguo Zhou; Brian Teague; Konstantinos Potamousis; Chris Churas; Michael Place; Jill Herschleb; Ron Runnheim; Dan Forrest; James M. Amos-Landgraf; David C. Schwartz; Ze Cheng; Kerstin Lindblad-Toh; Evan E. Eichler; Chris P. Ponting

A finished clone-based assembly of the mouse genome reveals extensive recent sequence duplication during recent evolution and rodent-specific expansion of certain gene families. Newly assembled duplications contain protein-coding genes that are mostly involved in reproductive function.


Nature Genetics | 2010

Public data archives for genomic structural variation

Deanna M. Church; Ilkka Lappalainen; Tam P. Sneddon; Jonathan Hinton; Michael Maguire; John Lopez; John Garner; Justin Paschall; Michael DiCuccio; Eugene Yaschenko; Stephen W. Scherer; Lars Feuk; Paul Flicek

To the Editor: When the road map for sequencing the human genome was laid out, the study of genetic variation was deemed a critical component 1 , with the mapping of SNPs initially being a priority. The availability of a high quality human reference assembly 2 facilitated the discovery and characterization of structural variation of DNA, with copy number variation being its most abundant form 3–5. As other multicellular organisms were sequenced, structural variation was observed to be a ubiquitous feature of genomes 6,7. The dbSNP 8 database was created early in 1998 to manage SNP and small-scale variation data, but it was not designed for larger and more complex structural variation data. The explosion of data from diverse structural variation studies now necessitates the development of a public data archive. Here, we describe two official companion databases, dbVar and DGVa, serving this community role. Public data archives play a major role in supporting the scientific community. They provide stable and traceable identifiers and allow for a single point of access for data collections, facilitating data download and meta-analysis across studies. The Database of Genomic Variants (DGV) was developed in 2004 to support public access to human genomic structural variation data for use in biomedical studies. DGV has served a very important role in collecting and analyzing structural variation studies, but it is not designed to provide a comprehensive and perpetual archive and has discontinued accepting direct submissions. Instead, DGV will work in partnership with the new archives and use them as the primary source of structural variation information, ensuring that the data in DGV is synchronized with accessioned structural variations in dbVar and DGVa. The main role of DGV going forward will be to curate and visualize selected studies to facilitate interpretation of structural variation data, including implementing the highest-level quality standards required by the clinical and diagnostic communities. The complex nature of structural variation often makes it less amenable than other genomic data (for example, SNPs) to single-step electronic mapping to reference genomes, necessitating significant manual curation efforts, a critical contribution which DGV will continue to perform (Fig. 1). Efforts are ongoing to populate dbVar and DGVa with the historical studies from DGV, and 29 of these studies, including efforts to provide a comprehensive description of the structural variation landscape, such as study estd20 (Conrad et al.) 9 , have been loaded as of August 2010, with the …


Nucleic Acids Research | 2016

Assembly: a resource for assembled genomes at NCBI.

Paul Kitts; Deanna M. Church; Françoise Thibaud-Nissen; Jinna Choi; Vichet Hem; Victor Sapojnikov; Robert G. Smith; Tatiana Tatusova; Charlie Xiang; Andrey Zherikov; Michael DiCuccio; Terence Murphy; Kim D. Pruitt; Avi Kimchi

The NCBI Assembly database (www.ncbi.nlm.nih.gov/assembly/) provides stable accessioning and data tracking for genome assembly data. The model underlying the database can accommodate a range of assembly structures, including sets of unordered contig or scaffold sequences, bacterial genomes consisting of a single complete chromosome, or complex structures such as a human genome with modeled allelic variation. The database provides an assembly accession and version to unambiguously identify the set of sequences that make up a particular version of an assembly, and tracks changes to updated genome assemblies. The Assembly database reports metadata such as assembly names, simple statistical reports of the assembly (number of contigs and scaffolds, contiguity metrics such as contig N50, total sequence length and total gap length) as well as the assembly update history. The Assembly database also tracks the relationship between an assembly submitted to the International Nucleotide Sequence Database Consortium (INSDC) and the assembly represented in the NCBI RefSeq project. Users can find assemblies of interest by querying the Assembly Resource directly or by browsing available assemblies for a particular organism. Links in the Assembly Resource allow users to easily download sequence and annotations for current versions of genome assemblies from the NCBI genomes FTP site.


Standards in Genomic Sciences | 2016

Meeting report: GenBank microbial genomic taxonomy workshop (12–13 May, 2015)

Scott Federhen; Ramon Rosselló-Móra; Hans-Peter Klenk; Brian J. Tindall; Konstantinos T. Konstantinidis; William B. Whitman; Daniel R. Brown; David P. Labeda; David W. Ussery; George M Garrity; Rita R. Colwell; Nur A. Hasan; Joerg Graf; Aidan Parte; Pablo Yarza; Brittany Goldberg; Heike Sichtig; Ilene Karsch-Mizrachi; Karen Clark; Richard McVeigh; Kim D. Pruitt; Tatiana Tatusova; Robert Falk; Sean Turner; Thomas L. Madden; Paul Kitts; Avi Kimchi; William Klimke; Richa Agarwala; Michael DiCuccio

Many genomes are incorrectly identified at GenBank. We developed a plan to find and correct misidentified genomes using genomic comparison statistics together with a scaffold of reliably identified genomes from type. A workshop was organized with broad representation from the bacterial taxonomic community to review the proposal, the GenBank Microbial Genomic Taxonomy Workshop, Bethesda MD, May 12–13, 2015.


Nucleic Acids Research | 2018

RefSeq: an update on prokaryotic genome annotation and curation

Daniel H. Haft; Michael DiCuccio; Azat Badretdin; Vyacheslav Brover; Vyacheslav Chetvernin; Kathleen O’Neill; Wenjun Li; Farideh Chitsaz; Myra K. Derbyshire; Noreen R. Gonzales; Marc Gwadz; Fu Lu; Gabriele H. Marchler; James S. Song; Narmada Thanki; Roxanne A. Yamashita; Chanjuan Zheng; Françoise Thibaud-Nissen; Lewis Y. Geer; Kim D. Pruitt

Abstract The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes. Antimicrobial resistance proteins have been reannotated comprehensively, improved structural annotation of insertion sequence transposases and selenoproteins is provided, curated complex domain architectures have given upgraded names to millions of multidomain proteins, and we introduce a new kind of annotation rule—BlastRules. Continual curation of supporting evidence, and propagation of improved names onto RefSeq proteins ensures that the functional annotation of genomes is kept current. An increasing share of our annotation now derives from HMMs and other sets of annotation rules that are portable by nature, and available for download and for reuse by other investigators. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.


International Journal of Systematic and Evolutionary Microbiology | 2018

Using average nucleotide identity to improve taxonomic assignments in prokaryotic genomes at the NCBI

Stacy Ciufo; Sivakumar Kannan; Shobha Sharma; Azat Badretdin; Karen Clark; Sean Turner; Slava Brover; Conrad L. Schoch; Avi Kimchi; Michael DiCuccio

Average nucleotide identity analysis is a useful tool to verify taxonomic identities in prokaryotic genomes, for both complete and draft assemblies. Using optimum threshold ranges appropriate for different prokaryotic taxa, we have reviewed all prokaryotic genome assemblies in GenBank with regard to their taxonomic identity. We present the methods used to make such comparisons, the current status of GenBank verifications, and recent developments in confirming species assignments in new genome submissions.

Collaboration


Dive into the Michael DiCuccio's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Paul Kitts

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Terence Murphy

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Kim D. Pruitt

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Avi Kimchi

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Azat Badretdin

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Deanna M. Church

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Donna Maglott

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Catherine M. Farrell

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Daniel H. Haft

National Institutes of Health

View shared research outputs
Researchain Logo
Decentralizing Knowledge