Daniel R. Zerbino
European Bioinformatics Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Daniel R. Zerbino.
Bioinformatics | 2012
Marcel H. Schulz; Daniel R. Zerbino; Martin Vingron; Ewan Birney
Motivation: High-throughput sequencing has made the analysis of new model organisms more affordable. Although assembling a new genome can still be costly and difficult, it is possible to use RNA-seq to sequence mRNA. In the absence of a known genome, it is necessary to assemble these sequences de novo, taking into account possible alternative isoforms and the dynamic range of expression values. Results: We present a software package named Oases designed to heuristically assemble RNA-seq reads in the absence of a reference genome, across a broad spectrum of expression values and in presence of alternative isoforms. It achieves this by using an array of hash lengths, a dynamic filtering of noise, a robust resolution of alternative splicing events and the efficient merging of multiple assemblies. It was tested on human and mouse RNA-seq data and is shown to improve significantly on the transABySS and Trinity de novo transcriptome assemblers. Availability and implementation: Oases is freely available under the GPL license at www.ebi.ac.uk/~zerbino/oases/ Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
Genome Research | 2011
Dent Earl; Keith Bradnam; John St. John; Aaron E. Darling; Dawei Lin; Joseph Fass; Hung On Ken Yu; Vince Buffalo; Daniel R. Zerbino; Mark Diekhans; Ngan Nguyen; Pramila Ariyaratne; Wing-Kin Sung; Zemin Ning; Matthias Haimel; Jared T. Simpson; Nuno A. Fonseca; Inanc Birol; T. Roderick Docking; Isaac Ho; Daniel S. Rokhsar; Rayan Chikhi; Dominique Lavenier; Guillaume Chapuis; Delphine Naquin; Nicolas Maillet; Michael C. Schatz; David R. Kelley; Adam M. Phillippy; Sergey Koren
Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/.
PLOS ONE | 2009
Daniel R. Zerbino; Gayle K. McEwen; Elliott H. Margulies; Ewan Birney
Background Despite the short length of their reads, micro-read sequencing technologies have shown their usefulness for de novo sequencing. However, especially in eukaryotic genomes, complex repeat patterns are an obstacle to large assemblies. Principal Findings We present a novel heuristic algorithm, Pebble, which uses paired-end read information to resolve repeats and scaffold contigs to produce large-scale assemblies. In simulations, we can achieve weighted median scaffold lengths (N50) of above 1 Mbp in Bacteria and above 100 kbp in more complex organisms. Using real datasets we obtained a 96 kbp N50 in Pseudomonas syringae and a unique 147 kbp scaffold of a ferret BAC clone. We also present an efficient algorithm called Rock Band for the resolution of repeats in the case of mixed length assemblies, where different sequencing platforms are combined to obtain a cost-effective assembly. Conclusions These algorithms extend the utility of short read only assemblies into large complex genomes. They have been implemented and made available within the open-source Velvet short-read de novo assembler.
Cell | 2016
Biola-Maria Javierre; Oliver Burren; Steven P. Wilder; Kreuzhuber R; Hill Sm; Sven Sewitz; Jonathan Cairns; Steven W. Wingett; Csilla Várnai; Thiecke Mj; Frances Burden; Samantha Farrow; Antony J. Cutler; Karola Rehnström; Kate Downes; Luigi Grassi; Myrto Kostadima; Paula Freire-Pritchett; Wang F; Hendrik G. Stunnenberg; John A. Todd; Daniel R. Zerbino; Oliver Stegle; Willem H. Ouwehand; Mattia Frontini; Chris Wallace; Mikhail Spivakov; Peter Fraser
Summary Long-range interactions between regulatory elements and gene promoters play key roles in transcriptional regulation. The vast majority of interactions are uncharted, constituting a major missing link in understanding genome control. Here, we use promoter capture Hi-C to identify interacting regions of 31,253 promoters in 17 human primary hematopoietic cell types. We show that promoter interactions are highly cell type specific and enriched for links between active promoters and epigenetically marked enhancers. Promoter interactomes reflect lineage relationships of the hematopoietic tree, consistent with dynamic remodeling of nuclear architecture during differentiation. Interacting regions are enriched in genetic variants linked with altered expression of genes they contact, highlighting their functional role. We exploit this rich resource to connect non-coding disease variants to putative target promoters, prioritizing thousands of disease-candidate genes and implicating disease pathways. Our results demonstrate the power of primary cell promoter interactomes to reveal insights into genomic regulatory mechanisms underlying common diseases.
Nature Genetics | 2016
G. David Poznik; Yali Xue; Fernando L. Mendez; Thomas Willems; Andrea Massaia; Melissa A. Wilson Sayres; Qasim Ayub; Shane McCarthy; Apurva Narechania; Seva Kashin; Yuan Chen; Ruby Banerjee; Juan L. Rodriguez-Flores; Maria Cerezo; Haojing Shao; Melissa Gymrek; Ankit Malhotra; Sandra Louzada; Rob DeSalle; Graham R. S. Ritchie; Eliza Cerveira; Tomas Fitzgerald; Erik Garrison; Anthony Marcketta; David Mittelman; Mallory Romanovitch; Chengsheng Zhang; Xiangqun Zheng-Bradley; Gonçalo R. Abecasis; Steven A. McCarroll
We report the sequences of 1,244 human Y chromosomes randomly ascertained from 26 worldwide populations by the 1000 Genomes Project. We discovered more than 65,000 variants, including single-nucleotide variants, multiple-nucleotide variants, insertions and deletions, short tandem repeats, and copy number variants. Of these, copy number variants contribute the greatest predicted functional impact. We constructed a calibrated phylogenetic tree on the basis of binary single-nucleotide variants and projected the more complex variants onto it, estimating the number of mutations for each class. Our phylogeny shows bursts of extreme expansion in male numbers that have occurred independently among each of the five continental superpopulations examined, at times of known migrations and technological innovations.
Genome Biology | 2015
Daniel R. Zerbino; Steven P. Wilder; Nathan Johnson; Thomas Juettemann; Paul Flicek
Most genomic variants associated with phenotypic traits or disease do not fall within gene coding regions, but in regulatory regions, rendering their interpretation difficult. We collected public data on epigenetic marks and transcription factor binding in human cell types and used it to construct an intuitive summary of regulatory regions in the human genome. We verified it against independent assays for sensitivity. The Ensembl Regulatory Build will be progressively enriched when more data is made available. It is freely available on the Ensembl browser, from the Ensembl Regulation MySQL database server and in a dedicated track hub.
Genome Biology | 2016
Jonathan Cairns; Paula Freire-Pritchett; Steven W. Wingett; Csilla Várnai; Andrew Dimond; Vincent Plagnol; Daniel R. Zerbino; Stefan Schoenfelder; Biola M. Javierre; Cameron S. Osborne; Peter Fraser; Mikhail Spivakov
Capture Hi-C (CHi-C) is a method for profiling chromosomal interactions involving targeted regions of interest, such as gene promoters, globally and at high resolution. Signal detection in CHi-C data involves a number of statistical challenges that are not observed when using other Hi-C-like techniques. We present a background model and algorithms for normalisation and multiple testing that are specifically adapted to CHi-C experiments. We implement these procedures in CHiCAGO (http://regulatorygenomicsgroup.org/chicago), an open-source package for robust interaction detection in CHi-C. We validate CHiCAGO by showing that promoter-interacting regions detected with this method are enriched for regulatory features and disease-associated SNPs.
Genome Research | 2010
Andrew L. Young; Hatice Ozel Abaan; Daniel R. Zerbino; James C. Mullikin; Ewan Birney; Elliott H. Margulies
We have developed a novel approach for using massively parallel short-read sequencing to generate fast and inexpensive de novo genomic assemblies comparable to those generated by capillary-based methods. The ultrashort (<100 base) sequences generated by this technology pose specific biological and computational challenges for de novo assembly of large genomes. To account for this, we devised a method for experimentally partitioning the genome using reduced representation (RR) libraries prior to assembly. We use two restriction enzymes independently to create a series of overlapping fragment libraries, each containing a tractable subset of the genome. Together, these libraries allow us to reassemble the entire genome without the need of a reference sequence. As proof of concept, we applied this approach to sequence and assembled the majority of the 125-Mb Drosophila melanogaster genome. We subsequently demonstrate the accuracy of our assembly method with meaningful comparisons against the current available D. melanogaster reference genome (dm3). The ease of assembly and accuracy for comparative genomics suggest that our approach will scale to future mammalian genome-sequencing efforts, saving both time and money without sacrificing quality.
Database | 2016
Daniel R. Zerbino; Nathan Johnson; Thomas Juetteman; Dan Sheppard; Steven P. Wilder; Ilias Lavidas; Michael Nuhn; Emily Perry; Quentin Raffaillac-Desfosses; Daniel Sobral; Damian Keefe; Stefan Gräf; Ikhlak Ahmed; Rhoda Kinsella; Bethan Pritchard; Simon Brent; Ridwan Amode; Anne Parker; Steven Trevanion; Ewan Birney; Ian Dunham; Paul Flicek
New experimental techniques in epigenomics allow researchers to assay a diversity of highly dynamic features such as histone marks, DNA modifications or chromatin structure. The study of their fluctuations should provide insights into gene expression regulation, cell differentiation and disease. The Ensembl project collects and maintains the Ensembl regulation data resources on epigenetic marks, transcription factor binding and DNA methylation for human and mouse, as well as microarray probe mappings and annotations for a variety of chordate genomes. From this data, we produce a functional annotation of the regulatory elements along the human and mouse genomes with plans to expand to other species as data becomes available. Starting from well-studied cell lines, we will progressively expand our library of measurements to a greater variety of samples. Ensembl’s regulation resources provide a central and easy-to-query repository for reference epigenomes. As with all Ensembl data, it is freely available at http://www.ensembl.org, from the Perl and REST APIs and from the public Ensembl MySQL database server at ensembldb.ensembl.org. Database URL: http://www.ensembl.org
Bioinformatics | 2013
Glenn Hickey; Benedict Paten; Dent Earl; Daniel R. Zerbino; David Haussler
Motivation: Large multiple genome alignments and inferred ancestral genomes are ideal resources for comparative studies of molecular evolution, and advances in sequencing and computing technology are making them increasingly obtainable. These structures can provide a rich understanding of the genetic relationships between all subsets of species they contain. Current formats for storing genomic alignments, such as XMFA and MAF, are all indexed or ordered using a single reference genome, however, which limits the information that can be queried with respect to other species and clades. This loss of information grows with the number of species under comparison, as well as their phylogenetic distance. Results: We present HAL, a compressed, graph-based hierarchical alignment format for storing multiple genome alignments and ancestral reconstructions. HAL graphs are indexed on all genomes they contain. Furthermore, they are organized phylogenetically, which allows for modular and parallel access to arbitrary subclades without fragmentation because of rearrangements that have occurred in other lineages. HAL graphs can be created or read with a comprehensive C++ API. A set of tools is also provided to perform basic operations, such as importing and exporting data, identifying mutations and coordinate mapping (liftover). Availability: All documentation and source code for the HAL API and tools are freely available at http://github.com/glennhickey/hal. Contact: [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.