Derek Aguiar
Brown University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Derek Aguiar.
Journal of Computational Biology | 2012
Derek Aguiar; Sorin Istrail
Genome assembly methods produce haplotype phase ambiguous assemblies due to limitations in current sequencing technologies. Determining the haplotype phase of an individual is computationally challenging and experimentally expensive. However, haplotype phase information is crucial in many bioinformatics workflows such as genetic association studies and genomic imputation. Current computational methods of determining haplotype phase from sequence data--known as haplotype assembly--have difficulties producing accurate results for large (1000 genomes-type) data or operate on restricted optimizations that are unrealistic considering modern high-throughput sequencing technologies. We present a novel algorithm, HapCompass, for haplotype assembly of densely sequenced human genome data. The HapCompass algorithm operates on a graph where single nucleotide polymorphisms (SNPs) are nodes and edges are defined by sequence reads and viewed as supporting evidence of co-occurring SNP alleles in a haplotype. In our graph model, haplotype phasings correspond to spanning trees. We define the minimum weighted edge removal optimization on this graph and develop an algorithm based on cycle basis local optimizations for resolving conflicting evidence. We then estimate the amount of sequencing required to produce a complete haplotype assembly of a chromosome. Using these estimates together with metrics borrowed from genome assembly and haplotype phasing, we compare the accuracy of HapCompass, the Genome Analysis ToolKit, and HapCut for 1000 Genomes Project and simulated data. We show that HapCompass performs significantly better for a variety of data and metrics. HapCompass is freely available for download (www.brown.edu/Research/Istrail_Lab/).
Bioinformatics | 2013
Derek Aguiar; Sorin Istrail
Motivation: Genome-wide haplotype reconstruction from sequence data, or haplotype assembly, is at the center of major challenges in molecular biology and life sciences. For complex eukaryotic organisms like humans, the genome is vast and the population samples are growing so rapidly that algorithms processing high-throughput sequencing data must scale favorably in terms of both accuracy and computational efficiency. Furthermore, current models and methodologies for haplotype assembly (i) do not consider individuals sharing haplotypes jointly, which reduces the size and accuracy of assembled haplotypes, and (ii) are unable to model genomes having more than two sets of homologous chromosomes (polyploidy). Polyploid organisms are increasingly becoming the target of many research groups interested in the genomics of disease, phylogenetics, botany and evolution but there is an absence of theory and methods for polyploid haplotype reconstruction. Results: In this work, we present a number of results, extensions and generalizations of compass graphs and our HapCompass framework. We prove the theoretical complexity of two haplotype assembly optimizations, thereby motivating the use of heuristics. Furthermore, we present graph theory–based algorithms for the problem of haplotype assembly using our previously developed HapCompass framework for (i) novel implementations of haplotype assembly optimizations (minimum error correction), (ii) assembly of a pair of individuals sharing a haplotype tract identical by descent and (iii) assembly of polyploid genomes. We evaluate our methods on 1000 Genomes Project, Pacific Biosciences and simulated sequence data. Availability and Implementation: HapCompass is available for download at http://www.brown.edu/Research/Istrail_Lab/. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
PLOS ONE | 2014
Ian C. McDowell; Chamilani Nikapitiya; Derek Aguiar; Christopher E. Lane; Sorin Istrail; Marta Gomez-Chiarri
The American oyster Crassostrea virginica, an ecologically and economically important estuarine organism, can suffer high mortalities in areas in the Northeast United States due to Roseovarius Oyster Disease (ROD), caused by the gram-negative bacterial pathogen Roseovarius crassostreae. The goals of this research were to provide insights into: 1) the responses of American oysters to R. crassostreae, and 2) potential mechanisms of resistance or susceptibility to ROD. The responses of oysters to bacterial challenge were characterized by exposing oysters from ROD-resistant and susceptible families to R. crassostreae, followed by high-throughput sequencing of cDNA samples from various timepoints after disease challenge. Sequence data was assembled into a reference transcriptome and analyzed through differential gene expression and functional enrichment to uncover genes and processes potentially involved in responses to ROD in the American oyster. While susceptible oysters experienced constant levels of mortality when challenged with R. crassostreae, resistant oysters showed levels of mortality similar to non-challenged oysters. Oysters exposed to R. crassostreae showed differential expression of transcripts involved in immune recognition, signaling, protease inhibition, detoxification, and apoptosis. Transcripts involved in metabolism were enriched in susceptible oysters, suggesting that bacterial infection places a large metabolic demand on these oysters. Transcripts differentially expressed in resistant oysters in response to infection included the immune modulators IL-17 and arginase, as well as several genes involved in extracellular matrix remodeling. The identification of potential genes and processes responsible for defense against R. crassostreae in the American oyster provides insights into potential mechanisms of disease resistance.
research in computational molecular biology | 2010
Bjarni V. Halldórsson; Derek Aguiar; Ryan Tarpine; Sorin Istrail
A phase transition is taking place today The amount of data generated by genome resequencing technologies is so large that in some cases it is now less expensive to repeat the experiment than to store the information generated by the experiment In the next few years it is quite possible that millions of Americans will have been genotyped The question then arises of how to make the best use of this information and jointly estimate the haplotypes of all these individuals The premise of the paper is that long shared genomic regions (or tracts) are unlikely unless the haplotypes are identical by descent (IBD), in contrast to short shared tracts which may be identical by state (IBS) Here we estimate for populations, using the US as a model, what sample size of genotyped individuals would be necessary to have sufficiently long shared haplotype regions (tracts) that are identical by descent (IBD), at a statistically significant level These tracts can then be used as input for a Clark-like phasing method to obtain a complete phasing solution of the sample We estimate in this paper that for a population like the US and about 1% of the people genotyped (approximately 2 million), tracts of about 200 SNPs long are shared between pairs of individuals IBD with high probability which assures the Clark method phasing success We show on simulated data that the algorithm will get an almost perfect solution if the number of individuals being SNP arrayed is large enough and the correctness of the algorithm grows with the number of individuals being genotyped. We also study a related problem that connects copy number variation with phasing algorithm success A loss of heterozygosity (LOH) event is when, by the laws of Mendelian inheritance, an individual should be heterozygote but, due to a deletion polymorphism, is not Such polymorphisms are difficult to detect using existing algorithms, but play an important role in the genetics of disease and will confuse haplotype phasing algorithms if not accounted for We will present an algorithm for detecting LOH regions across the genomes of thousands of individuals The design of the long-range phasing algorithm and the Loss of Heterozygosity inference algorithms was inspired by analyzing of the Multiple Sclerosis (MS) GWAS dataset of the International Multiple Sclerosis Consortium and we present in this paper similar results with those obtained from the MS data.
Journal of Computational Biology | 2011
Bjarni V. Halldórsson; Derek Aguiar; Ryan Tarpine; Sorin Istrail
A phase transition is taking place today. The amount of data generated by genome resequencing technologies is so large that in some cases it is now less expensive to repeat the experiment than to store the information generated by the experiment. In the next few years, it is quite possible that millions of Americans will have been genotyped. The question then arises of how to make the best use of this information and jointly estimate the haplotypes of all these individuals. The premise of this article is that long shared genomic regions (or tracts) are unlikely unless the haplotypes are identical by descent. These tracts can be used as input for a Clark-like phasing method to obtain a phasing solution of the sample. We show on simulated data that the algorithm will get an almost perfect solution if the number of individuals being genotyped is large enough and the correctness of the algorithm grows with the number of individuals being genotyped. We also study a related problem that connects copy number variation with phasing algorithm success. A loss of heterozygosity (LOH) event is when, by the laws of Mendelian inheritance, an individual should be heterozygote but, due to a deletion polymorphism, is not. Such polymorphisms are difficult to detect using existing algorithms, but play an important role in the genetics of disease and will confuse haplotype phasing algorithms if not accounted for. We will present an algorithm for detecting LOH regions across the genomes of thousands of individuals. The design of the long-range phasing algorithm and the loss of heterozygosity inference algorithms was inspired by our analysis of the Multiple Sclerosis (MS) GWAS dataset of the International Multiple Sclerosis Genetics Consortium. We present similar results to those obtained from the MS data.
Bioinformatics | 2012
Derek Aguiar; Bjarni V. Halldórsson; Eric M. Morrow; Sorin Istrail
Motivation: The understanding of the genetic determinants of complex disease is undergoing a paradigm shift. Genetic heterogeneity of rare mutations with deleterious effects is more commonly being viewed as a major component of disease. Autism is an excellent example where research is active in identifying matches between the phenotypic and genomic heterogeneities. A considerable portion of autism appears to be correlated with copy number variation, which is not directly probed by single nucleotide polymorphism (SNP) array or sequencing technologies. Identifying the genetic heterogeneity of small deletions remains a major unresolved computational problem partly due to the inability of algorithms to detect them. Results: In this article, we present an algorithmic framework, which we term DELISHUS, that implements three exact algorithms for inferring regions of hemizygosity containing genomic deletions of all sizes and frequencies in SNP genotype data. We implement an efficient backtracking algorithm—that processes a 1 billion entry genome-wide association study SNP matrix in a few minutes—to compute all inherited deletions in a dataset. We further extend our model to give an efficient algorithm for detecting de novo deletions. Finally, given a set of called deletions, we also give a polynomial time algorithm for computing the critical regions of recurrent deletions. DELISHUS achieves significantly lower false-positive rates and higher power than previously published algorithms partly because it considers all individuals in the sample simultaneously. DELISHUS may be applied to SNP array or sequencing data to identify the deletion spectrum for family-based association studies. Availability: DELISHUS is available at http://www.brown.edu/Research/Istrail_Lab/. Contact: [email protected] and [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
Methods of Molecular Biology | 2010
Sorin Istrail; Ryan Tarpine; Kyle Schutter; Derek Aguiar
The CYRENE Project focuses on the study of cis-regulatory genomics and gene regulatory networks (GRN) and has three components: a cisGRN-Lexicon, a cisGRN-Browser, and the Virtual Sea Urchin software system. The project has been done in collaboration with Eric Davidson and is deeply inspired by his experimental work in genomic regulatory systems and gene regulatory networks. The current CYRENE cisGRN-Lexicon contains the regulatory architecture of 200 transcription factors encoding genes and 100 other regulatory genes in eight species: human, mouse, fruit fly, sea urchin, nematode, rat, chicken, and zebrafish, with higher priority on the first five species. The only regulatory genes included in the cisGRN-Lexicon (CYRENE genes) are those whose regulatory architecture is validated by what we call the Davidson Criterion: they contain functionally authenticated sites by site-specific mutagenesis, conducted in vivo, and followed by gene transfer and functional test. This is recognized as the most stringent experimental validation criterion to date for such a genomic regulatory architecture. The CYRENE cisGRN-Browser is a full genome browser tailored for cis-regulatory annotation and investigation. It began as a branch of the Celera Genome Browser (available as open source at http://sourceforge.net/projects/celeragb /) and has been transformed to a genome browser fully devoted to regulatory genomics. Its access paradigm for genomic data is zoom-to-the-DNA-base in real time. A more recent component of the CYRENE project is the Virtual Sea Urchin system (VSU), an interactive visualization tool that provides a four-dimensional (spatial and temporal) map of the gene regulatory networks of the sea urchin embryo.
pacific symposium on biocomputing | 2013
Derek Aguiar; Wendy S.W. Wong; Sorin Istrail
The growing availability of inexpensive high-throughput sequence data is enabling researchers to sequence tumor populations within a single individual at high coverage. But, cancer genome sequence evolution and mutational phenomena like driver mutations and gene fusions are difficult to investigate without first reconstructing tumor haplotype sequences. Haplotype assembly of single individual tumor populations is an exceedingly difficult task complicated by tumor haplotype heterogeneity, tumor or normal cell sequence contamination, polyploidy, and complex patterns of variation. While computational and experimental haplotype phasing of diploid genomes has seen much progress in recent years, haplotype assembly in cancer genomes remains uncharted territory. In this work, we describe HapCompass-Tumor a computational modeling and algorithmic framework for haplotype assembly of copy number variable cancer genomes containing haplotypes at different frequencies and complex variation. We extend our polyploid haplotype assembly model and present novel algorithms for (1) complex variations, including copy number changes, as varying numbers of disjoint paths in an associated graph, (2) variable haplotype frequencies and contamination, and (3) computation of tumor haplotypes using simple cycles of the compass graph which constrain the space of haplotype assembly solutions. The model and algorithm are implemented in the software package HapCompass-Tumor which is available for download from http://www.brown.edu/Research/Istrail_Lab/.
research in computational molecular biology | 2014
Derek Aguiar; Eric M. Morrow; Sorin Istrail
In this work we present graph theoretic algorithms for the identification of all identical-by-descent IBD multi-shared haplotype tracts for an m ×n haplotype matrix. We introduce Tractatus, an exact algorithm for computing all IBD haplotype tracts in time linear in the size of the input, Omn. Tractatus resolves a long standing open problem, breaking optimally the worst-case quadratic time barrier of Om 2 n of previous methods often cited as a bottleneck in haplotype analysis of genome-wide association study-sized data. This advance in algorithm efficiency makes an impact in a number of areas of population genomics rooted in the seminal Li-Stephens framework for modeling multi-loci linkage disequilibrium LD patterns with applications to the estimation of recombination rates, imputation, haplotype-based LD mapping, and haplotype phasing. We extend the Tractatus algorithm to include computation of haplotype tracts with allele mismatches and shared homozygous haplotypes in a set of genotypes. Lastly, we present a comparison of algorithmic runtime, power to infer IBD tracts, and false positive rates for simulated data and computations of homozygous haplotypes in genome-wide association study data of autism. The Tractatus algorithm is available for download at http://www.brown.edu/Research/Istrail_Lab/ .
international conference on computational advances in bio and medical sciences | 2012
Derek Aguiar; Sorin Istrail
Genetic heterogeneity of rare mutations with severe effects is more commonly being viewed as a major component of disease[1]. Autism is an excellent example where research is active in identifying matches between the phenotypic and genomic heterogeneities. A substantial portion of autism appears to be correlated with copy number variation which is not directly probed by high-throughput next generation sequencing (NGS) or single nucleotide polymorphism (SNP) array technologies[2]. Furthermore, de novo and mapping based genome assembly methods produce phase ambiguous assemblies due to the limitations of current sequencing technologies. As a result, phase-dependent interactions between SNP variants may hide complex genetic heterogeneities associated with disease. Thus, identifying the genetic heterogeneity of complex disease remains a major unresolved computational problem due, in part, to the inability of algorithms to detect small deletions and phase single nucleotide polymorphism. In the first part of this talk, we will present an algorithmic framework, termed DELISHUS, that implements a highly efficient algorithm for inferring genomic deletions of all sizes and frequencies in SNP array data. The core of the algorithm is a de facto polynomial time backtracking algorithm - that finishes on a 1 billion entry genome-wide association study SNP matrix in a few minutes - to compute all potential inherited deletions in a dataset. With very few modifications, DELISHUS may also infer regions that contain de novo deletions. We will show that DELISHUS has significantly higher sensitivity and specificity than previously developed methods and present a genome-wide deletion map of autism. DELISHUS may be run with SNP array or NGS data. In the second part of the talk, we will present our recent work on haplotype assembly of NGS data using our HAPCOMPASS algorithm. We suggest two new metrics for evaluating the quality of a haplotype assembly that do not require knowledge of the true haplotypes. Finally, we will show that HAPCOMPASS performs significantly better than the Genome Analysis ToolKit[3] and HapCut[4] for 1000 genomes data as well as simulated data for a variety of metrics.