James K. Bonfield
Wellcome Trust Sanger Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by James K. Bonfield.
Bioinformatics | 2010
James K. Bonfield; Andrew Whitwham
Motivation: Existing sequence assembly editors struggle with the volumes of data now readily available from the latest generation of DNA sequencing instruments. Results: We describe the Gap5 software along with the data structures and algorithms used that allow it to be scalable. We demonstrate this with an assembly of 1.1 billion sequence fragments and compare the performance with several other programs. We analyse the memory, CPU, I/O usage and file sizes used by Gap5. Availability and Implementation: Gap5 is part of the Staden Package and is available under an Open Source licence from http://staden.sourceforge.net. It is implemented in C and Tcl/Tk. Currently it works on Unix systems only. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
PLOS ONE | 2013
James K. Bonfield; Matthew V. Mahoney
Storage and transmission of the data produced by modern DNA sequencing instruments has become a major concern, which prompted the Pistoia Alliance to pose the SequenceSqueeze contest for compression of FASTQ files. We present several compression entries from the competition, Fastqz and Samcomp/Fqzcomp, including the winning entry. These are compared against existing algorithms for both reference based compression (CRAM, Goby) and non-reference based compression (DSRC, BAM) and other recently published competition entries (Quip, SCALCE). The tools are shown to be the new Pareto frontier for FASTQ compression, offering state of the art ratios at affordable CPU costs. All programs are freely available on SourceForge. Fastqz: https://sourceforge.net/projects/fastqz/, fqzcomp: https://sourceforge.net/projects/fqzcomp/, and samcomp: https://sourceforge.net/projects/samcomp/.
Nucleic Acids Research | 2009
Guy Cochrane; Ruth Akhtar; James K. Bonfield; Lawrence Bower; Fehmi Demiralp; Nadeem Faruque; Richard Gibson; Gemma Hoad; Tim Hubbard; Chris Hunter; Mikyung Jang; Szilveszter Juhos; Rasko Leinonen; Steven Leonard; Quan Lin; Rodrigo Lopez; Dariusz Lorenc; Hamish McWilliam; Gaurab Mukherjee; Sheila Plaister; Rajesh Radhakrishnan; Stephen Robinson; Siamak Sobhany; Petra ten Hoopen; Robert Vaughan; Vadim Zalunin; Ewan Birney
Dramatic increases in the throughput of nucleotide sequencing machines, and the promise of ever greater performance, have thrust bioinformatics into the era of petabyte-scale data sets. Sequence repositories, which provide the feed for these data sets into the worldwide computational infrastructure, are challenged by the impact of these data volumes. The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/embl), comprising the EMBL Nucleotide Sequence Database and the Ensembl Trace Archive, has identified challenges in the storage, movement, analysis, interpretation and visualization of petabyte-scale data sets. We present here our new repository for next generation sequence data, a brief summary of contents of the ENA and provide details of major developments to submission pipelines, high-throughput rule-based validation infrastructure and data integration approaches.
Nature Genetics | 2005
David J. Adams; Emmanouil T. Dermitzakis; Tony Cox; James Smith; Robert Davies; Ruby Banerjee; James K. Bonfield; James C. Mullikin; Yeun Jun Chung; Jane Rogers; Allan Bradley
Inbred mouse strains provide the foundation for mouse genetics. By selecting for phenotypic features of interest, inbreeding drives genomic evolution and eliminates individual variation, while fixing certain sets of alleles that are responsible for the trait characteristics of the strain. Mouse strains 129Sv (129S5) and C57BL/6J, two of the most widely used inbred lines, diverged from common ancestors within the last century, yet very little is known about the genomic differences between them. By comparative genomic hybridization and sequence analysis of 129S5 short insert libraries, we identified substantial structural variation, a complex fine-scale haplotype pattern with a continuous distribution of diversity blocks, and extensive nucleotide variation, including nonsynonymous coding SNPs and stop codons. Collectively, these genomic changes denote the level and direction of allele fixation that has occurred during inbreeding and provide a basis for defining what makes these mouse strains unique.
Nucleic Acids Research | 2007
Guy Cochrane; Ruth Akhtar; Philippe Aldebert; Nicola Althorpe; Alastair Baldwin; Kirsty Bates; Sumit Bhattacharyya; James K. Bonfield; Lawrence Bower; Paul Browne; Matias Castro; Tony Cox; Fehmi Demiralp; Ruth Y. Eberhardt; Nadeem Faruque; Gemma Hoad; Mikyung Jang; Tamara Kulikova; Alberto Labarga; Rasko Leinonen; Steven Leonard; Quan Lin; Rodrigo Lopez; Dariusz Lorenc; Hamish McWilliam; Gaurab Mukherjee; Francesco Nardone; Sheila Plaister; Stephen Robinson; Siamak Sobhany
The Ensembl Trace Archive (http://trace.ensembl.org/) and the EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/), known together as the European Nucleotide Archive, continue to see growth in data volume and diversity. Selected major developments of 2007 are presented briefly, along with data submission and retrieval information. In the face of increasing requirements for nucleotide trace, sequence and annotation data archiving, data capture priority decisions have been taken at the European Nucleotide Archive. Priorities are discussed in terms of how reliably information can be captured, the long-term benefits of its capture and the ease with which it can be captured.
Nucleic Acids Research | 2010
Rasko Leinonen; Ruth Akhtar; Ewan Birney; James K. Bonfield; Lawrence Bower; Matthew Corbett; Ying Cheng; Fehmi Demiralp; Nadeem Faruque; Neil Goodgame; Richard Gibson; Gemma Hoad; Chris Hunter; Mikyung Jang; Steven Leonard; Quan Lin; Rodrigo Lopez; Michael Maguire; Hamish McWilliam; Sheila Plaister; Rajesh Radhakrishnan; Siamak Sobhany; Guy Slater; Petra ten Hoopen; Franck Valentin; Robert Vaughan; Vadim Zalunin; Daniel R. Zerbino; Guy Cochrane
The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) is Europe’s primary nucleotide sequence archival resource, safeguarding open nucleotide data access, engaging in worldwide collaborative data exchange and integrating with the scientific publication process. ENA has made significant contributions to the collaborative nucleotide archival arena as an active proponent of extending the traditional collaboration to cover capillary and next-generation sequencing information. We have continued to co-develop data and metadata representation formats with our collaborators for both data exchange and public data dissemination. In addition to the DDBJ/EMBL/GenBank feature table format, we share metadata formats for capillary and next-generation sequencing traces and are using and contributing to the NCBI SRA Toolkit for the long-term storage of the next-generation sequence traces. During the course of 2009, ENA has significantly improved sequence submission, search and access functionalities provided at EMBL–EBI. In this article, we briefly describe the content and scope of our archive and introduce major improvements to our services.
Archive | 2003
Rodger Staden; David Phillip Judge; James K. Bonfield
Methods for managing large scale sequencing projects are available through the use of our GAP4 package and the applications to which it can link are described. This main assembly and editing program, also provides a graphical user interface to the assembly engines: CAP3, FAKII, and PHRAP. Because of the diversity of working practices in the large number of laboratories where the package is used, these methods are very flexible and are readily tailored to suit local needs. For example, the Sanger Centre in the UK and the Whitehead Institute in the United States have both made major contributions to the human genome project using the package in different ways. The manual for the current (2001.0) version of the package is over 500 pages when printed, so this chapter is a brief overview of some of its most important components. We have tried to show a logical route through the methods in the package: pre-processing, assembly, contig1 ordering using read-pairs, contig joining using sequence comparison, assembly checking, automated experiment suggestions for extending contigs and solving problems, and ending with editing and consensus file generation. Before this overview, two important aspects of the package are outlined: the file formats used, the displays and the powerful user interface of GAP4. The package runs on UNIX and Microsoft Windows platforms and is entirely free to academic users, and can be downloaded from Website: http://www.mrc-lmb.cam.ac.uk/pubseq.
Nature Methods | 2016
Ibrahim Numanagić; James K. Bonfield; Faraz Hach; Jan Voges; Jörn Ostermann; Claudio Alberti; Marco Mattavelli; S. Cenk Sahinalp
High-throughput sequencing (HTS) data are commonly stored as raw sequencing reads in FASTQ format or as reads mapped to a reference, in SAM format, both with large memory footprints. Worldwide growth of HTS data has prompted the development of compression methods that aim to significantly reduce HTS data size. Here we report on a benchmarking study of available compression methods on a comprehensive set of HTS data using an automated framework.
Bioinformatics | 2002
James K. Bonfield; Rodger Staden
MOTIVATION To produce an open and extensible file format for DNA trace data which produces compact files suitable for large-scale storage and efficient use of internet bandwidth. RESULTS We have created an extensible format named ZTR. For a set of data taken from an ABI-3700 the ZTR format produces trace files which require 61.6% of the disk space used by gzipped SCFv3, and which can be written and read at greater speed. The compression algorithms used for the trace amplitudes are used within the National Center for Biotechnology Information (NCBI) trace archive. lmb.cam.ac.uk/pub/staden/io_lib/test_data.
Scientific Reports | 2017
Francesca Giordano; Louise Aigrain; Michael A. Quail; Paul Coupland; James K. Bonfield; Robert Davies; German Tischler; David K. Jackson; Thomas M. Keane; Jing Li; Jia-Xing Yue; Gianni Liti; Richard Durbin; Zemin Ning
Long-read sequencing technologies such as Pacific Biosciences and Oxford Nanopore MinION are capable of producing long sequencing reads with average fragment lengths of over 10,000 base-pairs and maximum lengths reaching 100,000 base- pairs. Compared with short reads, the assemblies obtained from long-read sequencing platforms have much higher contig continuity and genome completeness as long fragments are able to extend paths into problematic or repetitive regions. Many successful assembly applications of the Pacific Biosciences technology have been reported ranging from small bacterial genomes to large plant and animal genomes. Recently, genome assemblies using Oxford Nanopore MinION data have attracted much attention due to the portability and low cost of this novel sequencing instrument. In this paper, we re-sequenced a well characterized genome, the Saccharomyces cerevisiae S288C strain using three different platforms: MinION, PacBio and MiSeq. We present a comprehensive metric comparison of assemblies generated by various pipelines and discuss how the platform associated data characteristics affect the assembly quality. With a given read depth of 31X, the assemblies from both Pacific Biosciences and Oxford Nanopore MinION show excellent continuity and completeness for the 16 nuclear chromosomes, but not for the mitochondrial genome, whose reconstruction still represents a significant challenge.