Is this you? Create Your Porfile

Haixu Tang

Indiana University Bloomington

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Haixu Tang is active.

Explore More

Publication

Featured researches published by Haixu Tang.

Proceedings of the National Academy of Sciences of the United States of America | 2001

An Eulerian path approach to DNA fragment assembly

Pavel A. Pevzner; Haixu Tang; Michael S. Waterman

For the last 20 years, fragment assembly in DNA sequencing followed the “overlap–layout–consensus” paradigm that is used in all currently available assembly tools. Although this approach proved useful in assembling clones, it faces difficulties in genomic shotgun assembly. We abandon the classical “overlap–layout–consensus” approach in favor of a new euler algorithm that, for the first time, resolves the 20-year-old “repeat problem” in fragment assembly. Our main result is the reduction of the fragment assembly to a variation of the classical Eulerian path problem that allows one to generate accurate solutions of large-scale sequencing problems. euler, in contrast to the celera assembler, does not mask such repeats but uses them instead as a powerful fragment assembly tool.

Science | 2011

The ecoresponsive genome of Daphnia pulex

John K. Colbourne; Michael E. Pfrender; Donald L. Gilbert; W. Kelley Thomas; Abraham Tucker; Todd H. Oakley; Shin-ichi Tokishita; Andrea Aerts; Georg J. Arnold; Malay Kumar Basu; Darren J Bauer; Carla E. Cáceres; Liran Carmel; Claudio Casola; Jeong Hyeon Choi; John C. Detter; Qunfeng Dong; Serge Dusheyko; Brian D. Eads; Thomas Fröhlich; Kerry A. Geiler-Samerotte; Daniel Gerlach; Phil Hatcher; Sanjuro Jogdeo; Jeroen Krijgsveld; Evgenia V. Kriventseva; Dietmar Kültz; Christian Laforsch; Erika Lindquist; Jacqueline Lopez

The Daphnia genome reveals a multitude of genes and shows adaptation through gene family expansions. We describe the draft genome of the microcrustacean Daphnia pulex, which is only 200 megabases and contains at least 30,907 genes. The high gene count is a consequence of an elevated rate of gene duplication resulting in tandem gene clusters. More than a third of Daphnia’s genes have no detectable homologs in any other available proteome, and the most amplified gene families are specific to the Daphnia lineage. The coexpansion of gene families interacting within metabolic pathways suggests that the maintenance of duplicated genes is not random, and the analysis of gene expression under different environmental conditions reveals that numerous paralogs acquire divergent expression patterns soon after duplication. Daphnia-specific genes, including many additional loci within sequenced regions that are otherwise devoid of annotations, are the most responsive genes to ecological challenges.

Nucleic Acids Research | 2010

FragGeneScan: predicting genes in short and error-prone reads

Mina Rho; Haixu Tang; Yuzhen Ye

The advances of next-generation sequencing technology have facilitated metagenomics research that attempts to determine directly the whole collection of genetic material within an environmental sample (i.e. the metagenome). Identification of genes directly from short reads has become an important yet challenging problem in annotating metagenomes, since the assembly of metagenomes is often not available. Gene predictors developed for whole genomes (e.g. Glimmer) and recently developed for metagenomic sequences (e.g. MetaGene) show a significant decrease in performance as the sequencing error rates increase, or as reads get shorter. We have developed a novel gene prediction method FragGeneScan, which combines sequencing error models and codon usages in a hidden Markov model to improve the prediction of protein-coding region in short reads. The performance of FragGeneScan was comparable to Glimmer and MetaGene for complete genomes. But for short reads, FragGeneScan consistently outperformed MetaGene (accuracy improved ∼62% for reads of 400 bases with 1% sequencing errors, and ∼18% for short reads of 100 bases that are error free). When applied to metagenomes, FragGeneScan recovered substantially more genes than MetaGene predicted (>90% of the genes identified by homology search), and many novel genes with no homologs in current protein sequence database.

Proceedings of the National Academy of Sciences of the United States of America | 2012

Rate and molecular spectrum of spontaneous mutations in the bacterium Escherichia coli as determined by whole-genome sequencing

Heewook Lee; Ellen Popodi; Haixu Tang; Patricia L. Foster

Knowledge of the rate and nature of spontaneous mutation is fundamental to understanding evolutionary and molecular processes. In this report, we analyze spontaneous mutations accumulated over thousands of generations by wild-type Escherichia coli and a derivative defective in mismatch repair (MMR), the primary pathway for correcting replication errors. The major conclusions are (i) the mutation rate of a wild-type E. coli strain is ∼1 × 10−3 per genome per generation; (ii) mutations in the wild-type strain have the expected mutational bias for G:C > A:T mutations, but the bias changes to A:T > G:C mutations in the absence of MMR; (iii) during replication, A:T > G:C transitions preferentially occur with A templating the lagging strand and T templating the leading strand, whereas G:C > A:T transitions preferentially occur with C templating the lagging strand and G templating the leading strand; (iv) there is a strong bias for transition mutations to occur at 5′ApC3′/3′TpG5′ sites (where bases 5′A and 3′T are mutated) and, to a lesser extent, at 5′GpC3′/3′CpG5′ sites (where bases 5′G and 3′C are mutated); (v) although the rate of small (≤4 nt) insertions and deletions is high at repeat sequences, these events occur at only 1/10th the genomic rate of base-pair substitutions. MMR activity is genetically regulated, and bacteria isolated from nature often lack MMR capacity, suggesting that modulation of MMR can be adaptive. Thus, comparing results from the wild-type and MMR-defective strains may lead to a deeper understanding of factors that determine mutation rates and spectra, how these factors may differ among organisms, and how they may be shaped by environmental conditions.

Bioinformatics | 2012

RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data

Yongan Zhao; Haixu Tang; Yuzhen Ye

Summary: With the wide application of next-generation sequencing (NGS) techniques, fast tools for protein similarity search that scale well to large query datasets and large databases are highly desirable. In a previous work, we developed RAPSearch, an algorithm that achieved a ~20–90-fold speedup relative to BLAST while still achieving similar levels of sensitivity for short protein fragments derived from NGS data. RAPSearch, however, requires a substantial memory footprint to identify alignment seeds, due to its use of a suffix array data structure. Here we present RAPSearch2, a new memory-efficient implementation of the RAPSearch algorithm that uses a collision-free hash table to index a similarity search database. The utilization of an optimized data structure further speeds up the similarity search—another 2–3 times. We also implemented multi-threading in RAPSearch2, and the multi-thread modes achieve significant acceleration (e.g. 3.5X for 4-thread mode). RAPSearch2 requires up to 2G memory when running in single thread mode, or up to 3.5G memory when running in 4-thread mode. Availability and implementation: Implemented in C++, the source code is freely available for download at the RAPSearch2 website: http://omics.informatics.indiana.edu/mg/RAPSearch2/. Contact: [email protected] Supplementary information: Available at the RAPSearch2 website.

Bioinformatics | 2004

Fragment assembly with short reads

Mark Chaisson; Pavel A. Pevzner; Haixu Tang

MOTIVATION Current DNA sequencing technology produces reads of about 500-750 bp, with typical coverage under 10x. New sequencing technologies are emerging that produce shorter reads (length 80-200 bp) but allow one to generate significantly higher coverage (30x and higher) at low cost. Modern assembly programs and error correction routines have been tuned to work well with current read technology but were not designed for assembly of short reads. RESULTS We analyze the limitations of assembling reads generated by these new technologies and present a routine for base-calling in reads prior to their assembly. We demonstrate that while it is feasible to assemble such short reads, the resulting contigs will require significant (if not prohibitive) finishing efforts. AVAILABILITY Available from the web at http://www.cse.ucsd.edu/groups/bioinformatics/software.html

intelligent systems in molecular biology | 2006

A computational approach toward label-free protein quantification using predicted peptide detectability

Haixu Tang; Randy J. Arnold; Pedro Alves; Zhiyin Xun; David E. Clemmer; Milos V. Novotny; James P. Reilly; Predrag Radivojac

We propose here a new concept of peptide detectability which could be an important factor in explaining the relationship between a proteins quantity and the peptides identified from it in a high-throughput proteomics experiment. We define peptide detectability as the probability of observing a peptide in a standard sample analyzed by a standard proteomics routine and argue that it is an intrinsic property of the peptide sequence and neighboring regions in the parent protein. To test this hypothesis we first used publicly available data and data from our own synthetic samples in which quantities of model proteins were controlled. We then applied machine learning approaches to demonstrate that peptide detectability can be predicted from its sequence and the neighboring regions in the parent protein with satisfactory accuracy. The utility of this approach for protein quantification is demonstrated by peptides with higher detectability generally being identified at lower concentrations over those with lower detectability in the synthetic protein mixtures. These results establish a direct link between protein concentration and peptide detectability. We show that for each protein there exists a level of peptide detectability above which peptides are detected and below which peptides are not detected in an experiment. We call this level the minimum acceptable detectability for identified peptides (MDIP) which can be calibrated to predict protein concentration. Triplicate analysis of a biological sample showed that these MDIP values are consistent among the three data sets.

PLOS ONE | 2007

Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies

Andreas Sundquist; Mostafa Ronaghi; Haixu Tang; Pavel A. Pevzner; Serafim Batzoglou

While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the

Trends in Genetics | 2000

Regulation of adjacent yeast genes

Semyon Kruglyak; Haixu Tang

1000 goal for re-sequencing, their limitations prevent the de novo sequencing of eukaryotic genomes with the standard shotgun sequencing protocol. We present SHRAP (SHort Read Assembly Protocol), a sequencing protocol and assembly methodology that utilizes high-throughput short-read technologies. We describe a variation on hierarchical sequencing with two crucial differences: (1) we select a clone library from the genome randomly rather than as a tiling path and (2) we sample clones from the genome at high coverage and reads from the clones at low coverage. We assume that 200 bp read lengths with a 1% error rate and inexpensive random fragment cloning on whole mammalian genomes is feasible. Our assembly methodology is based on first ordering the clones and subsequently performing read assembly in three stages: (1) local assemblies of regions significantly smaller than a clone size, (2) clone-sized assemblies of the results of stage 1, and (3) chromosome-sized assemblies. By aggressively localizing the assembly problem during the first stage, our method succeeds in assembling short, unpaired reads sampled from repetitive genomes. We tested our assembler using simulated reads from D. melanogaster and human chromosomes 1, 11, and 21, and produced assemblies with large sets of contiguous sequence and a misassembly rate comparable to other draft assemblies. Tested on D. melanogaster and the entire human genome, our clone-ordering method produces accurate maps, thereby localizing fragment assembly and enabling the parallelization of the subsequent steps of our pipeline. Thus, we have demonstrated that truly inexpensive de novo sequencing of mammalian genomes will soon be possible with high-throughput, short-read technologies using our methodology.

Genes & Development | 2012

Spatial and functional relationships among Pol V-associated loci, Pol IV-dependent siRNAs, and cytosine methylation in the Arabidopsis epigenome

Andrzej T. Wierzbicki; Ross Cocklin; Anoop Mayampurath; Ryan Lister; M. Jordan Rowley; Brian D. Gregory; Joseph R. Ecker; Haixu Tang

We thank L. Kruglyak and E. Hubbell for valuable comments and feedback. We were supported in part by NSF grant DBI 9504393 and NIH grant R01 GM36230. We thank S. Tavare and M. Waterman for useful discussions and for providing us with the resources for this project. We also thank the referees for their valuable comments.

Explore More