Paul Kitts
National Institutes of Health
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Paul Kitts.
Nucleic Acids Research | 2014
Kim D. Pruitt; Garth Brown; Susan M. Hiatt; Françoise Thibaud-Nissen; Alexander Astashyn; Olga Ermolaeva; Catherine M. Farrell; Jennifer Hart; Melissa J. Landrum; Kelly M. McGarvey; Michael R. Murphy; Nuala A. O’Leary; Shashikant Pujar; Bhanu Rajput; Sanjida H. Rangwala; Lillian D. Riddick; Andrei Shkeda; Hanzhen Sun; Pamela Tamez; Raymond E. Tully; Craig Wallin; David Webb; Janet Weber; Wendy Wu; Michael DiCuccio; Paul Kitts; Donna Maglott; Terence Murphy; James Ostell
The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of annotated genomic, transcript and protein sequence records derived from data in public sequence archives and from computation, curation and collaboration (http://www.ncbi.nlm.nih.gov/refseq/). We report here on growth of the mammalian and human subsets, changes to NCBI’s eukaryotic annotation pipeline and modifications affecting transcript and protein records. Recent changes to NCBI’s eukaryotic genome annotation pipeline provide higher throughput, and the addition of RNAseq data to the pipeline results in a significant expansion of the number of transcripts and novel exons annotated on mammalian RefSeq genomes. Recent annotation changes include reporting supporting evidence for transcript records, modification of exon feature annotation and the addition of a structured report of gene and sequence attributes of biological interest. We also describe a revised protein annotation policy for alternatively spliced transcripts with more divergent predicted proteins and we summarize the current status of the RefSeqGene project.
Nucleic Acids Research | 2016
Nuala A. O'Leary; Mathew W. Wright; J. Rodney Brister; Stacy Ciufo; Diana Haddad; Richard McVeigh; Bhanu Rajput; Barbara Robbertse; Brian Smith-White; Danso Ako-adjei; Alexander Astashyn; Azat Badretdin; Yiming Bao; Olga Blinkova; Vyacheslav Brover; Vyacheslav Chetvernin; Jinna Choi; Eric Cox; Olga Ermolaeva; Catherine M. Farrell; Tamara Goldfarb; Tripti Gupta; Daniel H. Haft; Eneida Hatcher; Wratko Hlavina; Vinita Joardar; Vamsi K. Kodali; Wenjun Li; Donna Maglott; Patrick Masterson
The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.
Genome Biology | 2015
Deanna M. Church; Valerie Schneider; Karyn Meltz Steinberg; Michael C. Schatz; Aaron R. Quinlan; Chen Shan Chin; Paul Kitts; Bronwen Aken; Gabor T. Marth; Michael M. Hoffman; Javier Herrero; M. Lisandra Zepeda Mendoza; Richard Durbin; Paul Flicek
The human genome reference assembly is crucial for aligning and analyzing sequence data, and for genome annotation, among other roles. However, the models and analysis assumptions that underlie the current assembly need revising to fully represent human sequence diversity. Improved analysis tools and updated data reporting formats are also required.
Genome Research | 2017
Valerie Schneider; Tina A. Graves-Lindsay; Kerstin Howe; Nathan Bouk; Hsiu-Chuan Chen; Paul Kitts; Terence Murphy; Kim D. Pruitt; Françoise Thibaud-Nissen; Derek Albracht; Robert S. Fulton; Milinn Kremitzki; Vincent Magrini; Chris Markovic; Sean McGrath; Karyn Meltz Steinberg; Kate Auger; William Chow; Joanna Collins; Glenn Harden; Tim Hubbard; Sarah Pelan; Jared T. Simpson; Glen Threadgold; James Torrance; Jonathan Wood; Laura Clarke; Sergey Koren; Matthew Boitano; Paul Peluso
The human reference genome assembly plays a central role in nearly all aspects of todays basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.
Nucleic Acids Research | 2016
Paul Kitts; Deanna M. Church; Françoise Thibaud-Nissen; Jinna Choi; Vichet Hem; Victor Sapojnikov; Robert G. Smith; Tatiana Tatusova; Charlie Xiang; Andrey Zherikov; Michael DiCuccio; Terence Murphy; Kim D. Pruitt; Avi Kimchi
The NCBI Assembly database (www.ncbi.nlm.nih.gov/assembly/) provides stable accessioning and data tracking for genome assembly data. The model underlying the database can accommodate a range of assembly structures, including sets of unordered contig or scaffold sequences, bacterial genomes consisting of a single complete chromosome, or complex structures such as a human genome with modeled allelic variation. The database provides an assembly accession and version to unambiguously identify the set of sequences that make up a particular version of an assembly, and tracks changes to updated genome assemblies. The Assembly database reports metadata such as assembly names, simple statistical reports of the assembly (number of contigs and scaffolds, contiguity metrics such as contig N50, total sequence length and total gap length) as well as the assembly update history. The Assembly database also tracks the relationship between an assembly submitted to the International Nucleotide Sequence Database Consortium (INSDC) and the assembly represented in the NCBI RefSeq project. Users can find assemblies of interest by querying the Assembly Resource directly or by browsing available assemblies for a particular organism. Links in the Assembly Resource allow users to easily download sequence and annotations for current versions of genome assemblies from the NCBI genomes FTP site.
Standards in Genomic Sciences | 2016
Scott Federhen; Ramon Rosselló-Móra; Hans-Peter Klenk; Brian J. Tindall; Konstantinos T. Konstantinidis; William B. Whitman; Daniel R. Brown; David P. Labeda; David W. Ussery; George M Garrity; Rita R. Colwell; Nur A. Hasan; Joerg Graf; Aidan Parte; Pablo Yarza; Brittany Goldberg; Heike Sichtig; Ilene Karsch-Mizrachi; Karen Clark; Richard McVeigh; Kim D. Pruitt; Tatiana Tatusova; Robert Falk; Sean Turner; Thomas L. Madden; Paul Kitts; Avi Kimchi; William Klimke; Richa Agarwala; Michael DiCuccio
Many genomes are incorrectly identified at GenBank. We developed a plan to find and correct misidentified genomes using genomic comparison statistics together with a scaffold of reliably identified genomes from type. A workshop was organized with broad representation from the bacterial taxonomic community to review the proposal, the GenBank Microbial Genomic Taxonomy Workshop, Bethesda MD, May 12–13, 2015.
Bioinformatics | 2018
Alejandro A. Schäffer; Eric P. Nawrocki; Yoon Choi; Paul Kitts; Ilene Karsch-Mizrachi; Richard McVeigh
Motivation Nucleic acid sequences in public databases should not contain vector contamination, but many sequences in GenBank do (or did) contain vectors. The National Center for Biotechnology Information uses the program VecScreen to screen submitted sequences for contamination. Additional tools are needed to distinguish true‐positive (contamination) from false‐positive (not contamination) VecScreen matches. Results A principal reason for false‐positive VecScreen matches is that the sequence and the matching vector subsequence originate from closely related or identical organisms (for example, both originate in Escherichia coli). We collected information on the taxonomy of sources of vector segments in the UniVec database used by VecScreen. We used that information in two overlapping software pipelines for retrospective analysis of contamination in GenBank and for prospective analysis of contamination in new sequence submissions. Using the retrospective pipeline, we identified and corrected over 8000 contaminated sequences in the nonredundant nucleotide database. The prospective analysis pipeline has been in production use since April 2017 to evaluate some new GenBank submissions. Availability and implementation Data on the sources of UniVec entries were included in release 10.0 (ftp://ftp.ncbi.nih.gov/pub/UniVec/). The main software is freely available at https://github.com/aaschaffer/vecscreen_plus_taxonomy. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.
Nucleic Acids Research | 1990
Paul Kitts; Martin D. Ayres; Robert D. Possee
Plant and Animal Genome XX Conference (January 14-18, 2012) | 2013
Françoise Thibaud-Nissen; Alexander Souvorov; Terence Murphy; Michael DiCuccio; Paul Kitts
Virology | 1995
Rachael E. Hawtin; Kevin Arnold; Martin D. Ayres; Paolo Marinho de Andrade Zanotto; Stephen C. Howard; Graham W. Gooday; L. H. Chappell; Paul Kitts; Linda A. King; Robert D. Possee