Hamid Mohamadi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hamid Mohamadi is active.

Explore More

Publication

Featured researches published by Hamid Mohamadi.

Plant Journal | 2015

Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism

René L. Warren; Christopher I. Keeling; Macaire Man Saint Yuen; Anthony Raymond; Greg Taylor; Benjamin P. Vandervalk; Hamid Mohamadi; Daniel Paulino; Readman Chiu; Shaun D. Jackman; Gordon Robertson; Chen Yang; Brian Boyle; Margarete Hoffmann; Detlef Weigel; David R. Nelson; Carol Ritland; Nathalie Isabel; Barry Jaquish; Alvin Yanchuk; Jean Bousquet; Steven J.M. Jones; John MacKay; Inanc Birol; Joerg Bohlmann

White spruce (Picea glauca), a gymnosperm tree, has been established as one of the models for conifer genomics. We describe the draft genome assemblies of two white spruce genotypes, PG29 and WS77111, innovative tools for the assembly of very large genomes, and the conifer genomics resources developed in this process. The two white spruce genotypes originate from distant geographic regions of western (PG29) and eastern (WS77111) North America, and represent elite trees in two Canadian tree-breeding programs. We present an update (V3 and V4) for a previously reported PG29 V2 draft genome assembly and introduce a second white spruce genome assembly for genotype WS77111. Assemblies of the PG29 and WS77111 genomes confirm the reconstructed white spruce genome size in the 20 Gbp range, and show broad synteny. Using the PG29 V3 assembly and additional white spruce genomics and transcriptomics resources, we performed MAKER-P annotation and meticulous expert annotation of very large gene families of conifer defense metabolism, the terpene synthases and cytochrome P450s. We also comprehensively annotated the white spruce mevalonate, methylerythritol phosphate and phenylpropanoid pathways. These analyses highlighted the large extent of gene and pseudogene duplications in a conifer genome, in particular for genes of secondary (i.e. specialized) metabolism, and the potential for gain and loss of function for defense and adaptation.

Bioinformatics | 2014

BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters

Justin Chu; Sara Sadeghi; Anthony Raymond; Shaun D. Jackman; Ka Ming Nip; Richard Mar; Hamid Mohamadi; Yaron S N Butterfield; A. Gordon Robertson; Inanc Birol

Large datasets can be screened for sequences from a specific organism, quickly and with low memory requirements, by a data structure that supports time- and memory-efficient set membership queries. Bloom filters offer such queries but require that false positives be controlled. We present BioBloom Tools, a Bloom filter-based sequence-screening tool that is faster than BWA, Bowtie 2 (popular alignment algorithms) and FACS (a membership query algorithm). It delivers accuracies comparable with these tools, controls false positives and has low memory requirements. Availability and implementaion: www.bcgsc.ca/platform/bioinfo/software/biobloomtools Contact: [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Genome Biology and Evolution | 2016

Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation

Shaun D. Jackman; René L. Warren; Ewan A. Gibb; Benjamin P. Vandervalk; Hamid Mohamadi; Justin Chu; Anthony Raymond; Stephen Pleasance; Robin Coope; Mark R. Wildung; Carol Ritland; Jean Bousquet; Steven J.M. Jones; Joerg Bohlmann; Inanc Birol

The genome sequences of the plastid and mitochondrion of white spruce (Picea glauca) were assembled from whole-genome shotgun sequencing data using ABySS. The sequencing data contained reads from both the nuclear and organellar genomes, and reads of the organellar genomes were abundant in the data as each cell harbors hundreds of mitochondria and plastids. Hence, assembly of the 123-kb plastid and 5.9-Mb mitochondrial genomes were accomplished by analyzing data sets primarily representing low coverage of the nuclear genome. The assembled organellar genomes were annotated for their coding genes, ribosomal RNA, and transfer RNA. Transcript abundances of the mitochondrial genes were quantified in three developmental tissues and five mature tissues using data from RNA-seq experiments. C-to-U RNA editing was observed in the majority of mitochondrial genes, and in four genes, editing events were noted to modify ACG codons to create cryptic AUG start codons. The informatics methodology presented in this study should prove useful to assemble organellar genomes of other plant species using whole-genome shotgun sequencing data.

Bioinformatics | 2016

Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art

Justin Chu; Hamid Mohamadi; René L. Warren; Chen Yang; Inanc Birol

&NA; Identifying overlaps between error‐prone long reads, specifically those from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PB), is essential for certain downstream applications, including error correction and de novo assembly. Though akin to the read‐to‐reference alignment problem, read‐to‐read overlap detection is a distinct problem that can benefit from specialized algorithms that perform efficiently and robustly on high error rate long reads. Here, we review the current state‐of‐the‐art read‐to‐read overlap tools for error‐prone long reads, including BLASR, DALIGNER, MHAP, GraphMap and Minimap. These specialized bioinformatics tools differ not just in their algorithmic designs and methodology, but also in their robustness of performance on a variety of datasets, time and memory efficiency and scalability. We highlight the algorithmic features of these tools, as well as their potential issues and biases when utilizing any particular method. To supplement our review of the algorithms, we benchmarked these tools, tracking their resource needs and computational performance, and assessed the specificity and precision of each. In the versions of the tools tested, we observed that Minimap is the most computationally efficient, specific and sensitive method on the ONT datasets tested; whereas GraphMap and DALIGNER are the most specific and sensitive methods on the tested PB datasets. The concepts surveyed may apply to future sequencing technologies, as scalability is becoming more relevant with increased sequencing throughput. Contact: [email protected], [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

BMC Medical Genomics | 2015

Konnector v2.0: pseudo-long reads from paired-end sequencing data.

Benjamin P. Vandervalk; Chen Yang; Zhuyi Xue; Karthika Raghavan; Justin Chu; Hamid Mohamadi; Shaun D. Jackman; Readman Chiu; René L. Warren; Inanc Birol

BackgroundReading the nucleotides from two ends of a DNA fragment is called paired-end tag (PET) sequencing. When the fragment length is longer than the combined read length, there remains a gap of unsequenced nucleotides between read pairs. If the target in such experiments is sequenced at a level to provide redundant coverage, it may be possible to bridge these gaps using bioinformatics methods. Konnector is a local de novo assembly tool that addresses this problem. Here we report on version 2.0 of our tool.ResultsKonnector uses a probabilistic and memory-efficient data structure called Bloom filter to represent a k-mer spectrum - all possible sequences of length k in an input file, such as the collection of reads in a PET sequencing experiment. It performs look-ups to this data structure to construct an implicit de Bruijn graph, which describes (k-1) base pair overlaps between adjacent k-mers. It traverses this graph to bridge the gap between a given pair of flanking sequences.ConclusionsHere we report the performance of Konnector v2.0 on simulated and experimental datasets, and compare it against other tools with similar functionality. We note that, representing k-mers with 1.5 bytes of memory on average, Konnector can scale to very large genomes. With our parallel implementation, it can also process over a billion bases on commodity hardware.

Bioinformatics | 2016

ntHash: recursive nucleotide hashing.

Hamid Mohamadi; Justin Chu; Benjamin P. Vandervalk; Inanc Birol

Motivation: Hashing has been widely used for indexing, querying and rapid similarity search in many bioinformatics applications, including sequence alignment, genome and transcriptome assembly, k-mer counting and error correction. Hence, expediting hashing operations would have a substantial impact in the field, making bioinformatics applications faster and more efficient. Results: We present ntHash, a hashing algorithm tuned for processing DNA/RNA sequences. It performs the best when calculating hash values for adjacent k-mers in an input sequence, operating an order of magnitude faster than the best performing alternatives in typical use cases. Availability and implementation: ntHash is available online at http://www.bcgsc.ca/platform/bioinfo/software/nthash and is free for academic use. Contacts: [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Bioinformatics | 2017

ntCard: A streaming algorithm for cardinality estimation in genomics data

Hamid Mohamadi; Hamza N. Khan; Inanc Birol

Motivation: Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k‐mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k‐mers, or even better, to build a histogram of k‐mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a k‐mer histogram from large volumes of sequencing data is a challenging task. Results: Here, we present ntCard, a streaming algorithm for estimating the frequencies of k‐mers in genomics datasets. At its core, ntCard uses the ntHash algorithm to efficiently compute hash values for streamed sequences. It then samples the calculated hash values to build a reduced representation multiplicity table describing the sample distribution. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution. We have compared the performance of ntCard and other cardinality estimation algorithms. We used three datasets of 480 GB, 500 GB and 2.4 TB in size, where the first two representing whole genome shotgun sequencing experiments on the human genome and the last one on the white spruce genome. Results show ntCard estimates k‐mer coverage frequencies >15× faster than the state‐of‐the‐art algorithms, using similar amount of memory, and with higher accuracy rates. Thus, our benchmarks demonstrate ntCard as a potentially enabling technology for large‐scale genomics applications. Availability and Implementation: ntCard is written in C ++ and is released under the GPL license. It is freely available at https://github.com/bcgsc/ntCard. Contact: [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.Abstract Motivation Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k-mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k-mers, or even better, to build a histogram of k-mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a k-mer histogram from large volumes of sequencing data is a challenging task. Results Here, we present ntCard, a streaming algorithm for estimating the frequencies of k-mers in genomics datasets. At its core, ntCard uses the ntHash algorithm to efficiently compute hash values for streamed sequences. It then samples the calculated hash values to build a reduced representation multiplicity table describing the sample distribution. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution. We have compared the performance of ntCard and other cardinality estimation algorithms. We used three datasets of 480 GB, 500 GB and 2.4 TB in size, where the first two representing whole genome shotgun sequencing experiments on the human genome and the last one on the white spruce genome. Results show ntCard estimates k-mer coverage frequencies >15× faster than the state-of-the-art algorithms, using similar amount of memory, and with higher accuracy rates. Thus, our benchmarks demonstrate ntCard as a potentially enabling technology for large-scale genomics applications. Availability and Implementation ntCard is written in C ++ and is released under the GPL license. It is freely available at https://github.com/bcgsc/ntCard. Supplementary information Supplementary data are available at Bioinformatics online.

PLOS ONE | 2015

DIDA: Distributed Indexing Dispatched Alignment

Hamid Mohamadi; Benjamin P. Vandervalk; Anthony Raymond; Shaun D. Jackman; Justin Chu; Clay P. Breshears; Inanc Birol

One essential application in bioinformatics that is affected by the high-throughput sequencing data deluge is the sequence alignment problem, where nucleotide or amino acid sequences are queried against targets to find regions of close similarity. When queries are too many and/or targets are too large, the alignment process becomes computationally challenging. This is usually addressed by preprocessing techniques, where the queries and/or targets are indexed for easy access while searching for matches. When the target is static, such as in an established reference genome, the cost of indexing is amortized by reusing the generated index. However, when the targets are non-static, such as contigs in the intermediate steps of a de novo assembly process, a new index must be computed for each run. To address such scalability problems, we present DIDA, a novel framework that distributes the indexing and alignment tasks into smaller subtasks over a cluster of compute nodes. It provides a workflow beyond the common practice of embarrassingly parallel implementations. DIDA is a cost-effective, scalable and modular framework for the sequence alignment problem in terms of memory usage and runtime. It can be employed in large-scale alignments to draft genomes and intermediate stages of de novo assembly runs. The DIDA source code, sample files and user manual are available through http://www.bcgsc.ca/platform/bioinfo/software/dida. The software is released under the British Columbia Cancer Agency License (BCCA), and is free for academic use.

bioRxiv | 2016

Overlapping long sequence reads: Current innovations and challenges in developing sensitive, specific and scalable algorithms

Justin Chu; Hamid Mohamadi; René L. Warren; Chen Yang; Inanc Birol

Identifying overlaps between error-prone long reads, specifically those from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PB), is essential for certain downstream applications, including error correction and de novo assembly. Though akin to the read-to-reference alignment problem, read-to-read overlap detection is a distinct problem that can benefit from specialized algorithms that perform efficiently and robustly on high error rate long reads. Here, we review the current state-of-the-art read-to-read overlap tools for error-prone long reads, including BLASR, DALIGNER, MHAP, GraphMap, and Minimap. These specialized bioinformatics tools differ not just in their algorithmic designs and methodology, but also in their robustness of performance on a variety of datasets, time and memory efficiency, and scalability. We highlight the algorithmic features of these tools, as well as their potential issues and biases when utilizing any particular method. We benchmarked these tools, tracking their resource needs and computational performance, and assessed the specificity and precision of each. The concepts surveyed may apply to future sequencing technologies, as scalability is becoming more relevant with increased sequencing throughput. Contact [email protected]; [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Comparative and Functional Genomics | 2015

Spaced Seed Data Structures for De Novo Assembly

Inanc Birol; Justin Chu; Hamid Mohamadi; Shaun D. Jackman; Karthika Raghavan; Benjamin P. Vandervalk; Anthony Raymond; René L. Warren

De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.

Explore More