Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Yuzhen Ye is active.

Publication


Featured researches published by Yuzhen Ye.


Nucleic Acids Research | 2005

The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes

Ross Overbeek; Tadhg P. Begley; Ralph Butler; Jomuna V. Choudhuri; Han-Yu Chuang; Matthew Cohoon; Valérie de Crécy-Lagard; Naryttza N. Diaz; Terry Disz; Robert D. Edwards; Michael Fonstein; Ed D. Frank; Svetlana Gerdes; Elizabeth M. Glass; Alexander Goesmann; Andrew C. Hanson; Dirk Iwata-Reuyl; Roy A. Jensen; Neema Jamshidi; Lutz Krause; Michael Kubal; Niels Bent Larsen; Burkhard Linke; Alice C. McHardy; Folker Meyer; Heiko Neuweger; Gary J. Olsen; Robert Olson; Andrei L. Osterman; Vasiliy A. Portnoy

The release of the 1000th complete microbial genome will occur in the next two to three years. In anticipation of this milestone, the Fellowship for Interpretation of Genomes (FIG) launched the Project to Annotate 1000 Genomes. The project is built around the principle that the key to improved accuracy in high-throughput annotation technology is to have experts annotate single subsystems over the complete collection of genomes, rather than having an annotation expert attempt to annotate all of the genes in a single genome. Using the subsystems approach, all of the genes implementing the subsystem are analyzed by an expert in that subsystem. An annotation environment was created where populated subsystems are curated and projected to new genomes. A portable notion of a populated subsystem was defined, and tools developed for exchanging and curating these objects. Tools were also developed to resolve conflicts between populated subsystems. The SEED is the first annotation environment that supports this model of annotation. Here, we describe the subsystem approach, and offer the first release of our growing library of populated subsystems. The initial release of data includes 180 177 distinct proteins with 2133 distinct functional roles. This data comes from 173 subsystems and 383 different organisms.


Science | 2011

The ecoresponsive genome of Daphnia pulex

John K. Colbourne; Michael E. Pfrender; Donald L. Gilbert; W. Kelley Thomas; Abraham Tucker; Todd H. Oakley; Shin-ichi Tokishita; Andrea Aerts; Georg J. Arnold; Malay Kumar Basu; Darren J Bauer; Carla E. Cáceres; Liran Carmel; Claudio Casola; Jeong Hyeon Choi; John C. Detter; Qunfeng Dong; Serge Dusheyko; Brian D. Eads; Thomas Fröhlich; Kerry A. Geiler-Samerotte; Daniel Gerlach; Phil Hatcher; Sanjuro Jogdeo; Jeroen Krijgsveld; Evgenia V. Kriventseva; Dietmar Kültz; Christian Laforsch; Erika Lindquist; Jacqueline Lopez

The Daphnia genome reveals a multitude of genes and shows adaptation through gene family expansions. We describe the draft genome of the microcrustacean Daphnia pulex, which is only 200 megabases and contains at least 30,907 genes. The high gene count is a consequence of an elevated rate of gene duplication resulting in tandem gene clusters. More than a third of Daphnia’s genes have no detectable homologs in any other available proteome, and the most amplified gene families are specific to the Daphnia lineage. The coexpansion of gene families interacting within metabolic pathways suggests that the maintenance of duplicated genes is not random, and the analysis of gene expression under different environmental conditions reveals that numerous paralogs acquire divergent expression patterns soon after duplication. Daphnia-specific genes, including many additional loci within sequenced regions that are otherwise devoid of annotations, are the most responsive genes to ecological challenges.


Genome Research | 2008

The amphioxus genome illuminates vertebrate origins and cephalochordate biology

Linda Z. Holland; Ricard Albalat; Kaoru Azumi; Èlia Benito-Gutiérrez; Matthew J. Blow; Marianne Bronner-Fraser; Frédéric Brunet; Thomas Butts; Simona Candiani; Larry J. Dishaw; David E. K. Ferrier; Jordi Garcia-Fernàndez; Jeremy J. Gibson-Brown; Carmela Gissi; Adam Godzik; Finn Hallböök; Dan Hirose; Kazuyoshi Hosomichi; Tetsuro Ikuta; Hidetoshi Inoko; Masanori Kasahara; Jun Kasamatsu; Takeshi Kawashima; Ayuko Kimura; Masaaki Kobayashi; Zbynek Kozmik; Kaoru Kubokawa; Vincent Laudet; Gary W. Litman; Alice C. McHardy

Cephalochordates, urochordates, and vertebrates evolved from a common ancestor over 520 million years ago. To improve our understanding of chordate evolution and the origin of vertebrates, we intensively searched for particular genes, gene families, and conserved noncoding elements in the sequenced genome of the cephalochordate Branchiostoma floridae, commonly called amphioxus or lancelets. Special attention was given to homeobox genes, opsin genes, genes involved in neural crest development, nuclear receptor genes, genes encoding components of the endocrine and immune systems, and conserved cis-regulatory enhancers. The amphioxus genome contains a basic set of chordate genes involved in development and cell signaling, including a fifteenth Hox gene. This set includes many genes that were co-opted in vertebrates for new roles in neural crest development and adaptive immunity. However, where amphioxus has a single gene, vertebrates often have two, three, or four paralogs derived from two whole-genome duplication events. In addition, several transcriptional enhancers are conserved between amphioxus and vertebrates--a very wide phylogenetic distance. In contrast, urochordate genomes have lost many genes, including a diversity of homeobox families and genes involved in steroid hormone function. The amphioxus genome also exhibits derived features, including duplications of opsins and genes proposed to function in innate immunity and endocrine systems. Our results indicate that the amphioxus genome is elemental to an understanding of the biology and evolution of nonchordate deuterostomes, invertebrate chordates, and vertebrates.


Nucleic Acids Research | 2010

FragGeneScan: predicting genes in short and error-prone reads

Mina Rho; Haixu Tang; Yuzhen Ye

The advances of next-generation sequencing technology have facilitated metagenomics research that attempts to determine directly the whole collection of genetic material within an environmental sample (i.e. the metagenome). Identification of genes directly from short reads has become an important yet challenging problem in annotating metagenomes, since the assembly of metagenomes is often not available. Gene predictors developed for whole genomes (e.g. Glimmer) and recently developed for metagenomic sequences (e.g. MetaGene) show a significant decrease in performance as the sequencing error rates increase, or as reads get shorter. We have developed a novel gene prediction method FragGeneScan, which combines sequencing error models and codon usages in a hidden Markov model to improve the prediction of protein-coding region in short reads. The performance of FragGeneScan was comparable to Glimmer and MetaGene for complete genomes. But for short reads, FragGeneScan consistently outperformed MetaGene (accuracy improved ∼62% for reads of 400 bases with 1% sequencing errors, and ∼18% for short reads of 100 bases that are error free). When applied to metagenomes, FragGeneScan recovered substantially more genes than MetaGene predicted (>90% of the genes identified by homology search), and many novel genes with no homologs in current protein sequence database.


PLOS ONE | 2012

A core human microbiome as viewed through 16S rRNA sequence clusters

Susan M. Huse; Yuzhen Ye; Yanjiao Zhou; Anthony A. Fodor

We explore the microbiota of 18 body sites in over 200 individuals using sequences amplified V1–V3 and the V3–V5 small subunit ribosomal RNA (16S) hypervariable regions as part of the NIH Common Fund Human Microbiome Project. The body sites with the greatest number of core OTUs, defined as OTUs shared amongst 95% or more of the individuals, were the oral sites (saliva, tongue, cheek, gums, and throat) followed by the nose, stool, and skin, while the vaginal sites had the fewest number of OTUs shared across subjects. We found that commonalities between samples based on taxonomy could sometimes belie variability at the sub-genus OTU level. This was particularly apparent in the mouth where a given genus can be present in many different oral sites, but the sub-genus OTUs show very distinct site selection, and in the vaginal sites, which are consistently dominated by the Lactobacillus genus but have distinctly different sub-genus V1–V3 OTU populations across subjects. Different body sites show approximately a ten-fold difference in estimated microbial richness, with stool samples having the highest estimated richness, followed by the mouth, throat and gums, then by the skin, nasal and vaginal sites. Richness as measured by the V1–V3 primers was consistently higher than richness measured by V3–V5. We also show that when such a large cohort is analyzed at the genus level, most subjects fit the stool “enterotype” profile, but other subjects are intermediate, blurring the distinction between the enterotypes. When analyzed at the finer-scale, OTU level, there was little or no segregation into stool enterotypes, but in the vagina distinct biotypes were apparent. Finally, we note that even OTUs present in nearly every subject, or that dominate in some samples, showed orders of magnitude variation in relative abundance emphasizing the highly variable nature across individuals.


Bioinformatics | 2012

RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data

Yongan Zhao; Haixu Tang; Yuzhen Ye

Summary: With the wide application of next-generation sequencing (NGS) techniques, fast tools for protein similarity search that scale well to large query datasets and large databases are highly desirable. In a previous work, we developed RAPSearch, an algorithm that achieved a ~20–90-fold speedup relative to BLAST while still achieving similar levels of sensitivity for short protein fragments derived from NGS data. RAPSearch, however, requires a substantial memory footprint to identify alignment seeds, due to its use of a suffix array data structure. Here we present RAPSearch2, a new memory-efficient implementation of the RAPSearch algorithm that uses a collision-free hash table to index a similarity search database. The utilization of an optimized data structure further speeds up the similarity search—another 2–3 times. We also implemented multi-threading in RAPSearch2, and the multi-thread modes achieve significant acceleration (e.g. 3.5X for 4-thread mode). RAPSearch2 requires up to 2G memory when running in single thread mode, or up to 3.5G memory when running in 4-thread mode. Availability and implementation: Implemented in C++, the source code is freely available for download at the RAPSearch2 website: http://omics.informatics.indiana.edu/mg/RAPSearch2/. Contact: [email protected] Supplementary information: Available at the RAPSearch2 website.


Nucleic Acids Research | 2004

FATCAT: a web server for flexible structure comparison and structure similarity searching

Yuzhen Ye; Adam Godzik

Protein structure comparison, an important problem in structural biology, has two main applications: (i) comparing two protein structures in order to identify the similarities and differences between them, and (ii) searching for structures similar to a query structure. Many web-based resources for both applications are available, but all are based on rigid structural alignment algorithms. FATCAT server implements the recently developed flexible protein structure comparison algorithm FATCAT, which automatically identifies hinges and internal rearrangements in two protein structures. The server provides access to two algorithms: FATCAT-pairwise for pairwise flexible structure comparison and FATCAT-search for database searching for structurally similar proteins. Given two protein structures [in the Protein Data Bank (PDB) format], FATCAT-pairwise reports their structural alignment and the corresponding statistical significance of the similarity measured as a P-value. Users can view the superposition of the structures online in web browsers that support the Chime plug-in, or download the superimposed structures in PDB format. In FATCAT-search, users provide one query structure and the server returns a list of protein structures that are similar to the query, ordered by the P-values. In addition, FATCAT server can report the conformational changes of the query structure as compared to other proteins in the structure database. FATCAT server is available at http://fatcat.burnham.org.


Bioinformatics | 2005

Multiple flexible structure alignment using partial order graphs

Yuzhen Ye; Adam Godzik

MOTIVATION Existing comparisons of protein structures are not able to describe structural divergence and flexibility in the structures being compared because they focus on identifying a common invariant core and ignore parts of the structures outside this core. Understanding the structural divergence and flexibility is critical for studying the evolution of functions and specificities of proteins. RESULTS A new method of multiple protein structure alignment, POSA (Partial Order Structure Alignment), was developed using a partial order graph representation of multiple alignments. POSA has two unique features: (1) identifies and classifies regions that are conserved only in a subset of input structures and (2) allows internal rearrangements in protein structures. POSA outperforms other programs in the cases where structural flexibilities exist and provides new insights by visualizing the mosaic nature of multiple structural alignments. POSA is an ideal tool for studying the variation of protein structures within diverse structural families. AVAILABILITY POSA is freely available for academic users on a Web server at http://fatcat.burnham.org/POSA


PLOS Computational Biology | 2009

A Parsimony Approach to Biological Pathway Reconstruction/Inference for Genomes and Metagenomes

Yuzhen Ye; Thomas G. Doak

A common biological pathway reconstruction approach -- as implemented by many automatic biological pathway services (such as the KAAS and RAST servers) and the functional annotation of metagenomic sequences -- starts with the identification of protein functions or families (e.g., KO families for the KEGG database and the FIG families for the SEED database) in the query sequences, followed by a direct mapping of the identified protein families onto pathways. Given a predicted patchwork of individual biochemical steps, some metric must be applied in deciding what pathways actually exist in the genome or metagenome represented by the sequences. Commonly, and straightforwardly, a complete biological pathway can be identified in a dataset if at least one of the steps associated with the pathway is found. We report, however, that this naïve mapping approach leads to an inflated estimate of biological pathways, and thus overestimates the functional diversity of the sample from which the DNA sequences are derived. We developed a parsimony approach, called MinPath (Minimal set of Pathways), for biological pathway reconstructions using protein family predictions, which yields a more conservative, yet more faithful, estimation of the biological pathways for a query dataset. MinPath identified far fewer pathways for the genomes collected in the KEGG database -- as compared to the naïve mapping approach -- eliminating some obviously spurious pathway annotations. Results from applying MinPath to several metagenomes indicate that the common methods used for metagenome annotation may significantly overestimate the biological pathways encoded by microbial communities.


research in computational molecular biology | 2010

A novel abundance-based algorithm for binning metagenomic sequences using l -tuples

Yu-Wei Wu; Yuzhen Ye

Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing Among the computational tools recently developed for metagenomic sequence analysis, binning tools attempt to classify all (or most) of the sequences in a metagenomic dataset into different bins (i.e., species), based on various DNA composition patterns (e.g., the tetramer frequencies) of various genomes Composition-based binning methods, however, cannot be used to classify very short fragments, because of the substantial variation of DNA composition patterns within a single genome We developed a novel approach (AbundanceBin) for metagenomics binning by utilizing the different abundances of species living in the same environment AbundanceBin is an application of the Lander-Waterman model to metagenomics, which is based on the l-tuple content of the reads AbundanceBin achieved accurate, unsupervised, clustering of metagenomic sequences into different bins, such that the reads classified in a bin belong to species of identical or very similar abundances in the sample In addition, AbundanceBin gave accurate estimations of species abundances, as well as their genome sizes—two important parameters for characterizing a microbial community We also show that AbundanceBin performed well when the sequence lengths are very short (e.g 75 bp) or have sequencing errors.

Collaboration


Dive into the Yuzhen Ye's collaboration.

Top Co-Authors

Avatar

Haixu Tang

Indiana University Bloomington

View shared research outputs
Top Co-Authors

Avatar

Thomas G. Doak

Indiana University Bloomington

View shared research outputs
Top Co-Authors

Avatar

Qingjiang Wu

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Zhongwei Wang

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Jianfeng Zhou

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Mingjie Wang

Indiana University Bloomington

View shared research outputs
Top Co-Authors

Avatar

Quan Zhang

Indiana University Bloomington

View shared research outputs
Top Co-Authors

Avatar

Yu Wei Wu

Indiana University Bloomington

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Anthony A. Fodor

University of North Carolina at Charlotte

View shared research outputs
Researchain Logo
Decentralizing Knowledge