Youngik Yang
J. Craig Venter Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Youngik Yang.
Proceedings of the National Academy of Sciences of the United States of America | 2013
Jeffrey S. McLean; Mary-Jane Lombardo; Jonathan H. Badger; Anna Edlund; Mark Novotny; Joyclyn Yee-Greenbaum; Nikolay Vyahhi; Adam P Hall; Youngik Yang; Christopher L. Dupont; Michael G. Ziegler; Hamidreza Chitsaz; Andrew E. Allen; Shibu Yooseph; Glenn Tesler; Pavel A. Pevzner; Robert Friedman; Kenneth H. Nealson; J. C. Venter; Roger S. Lasken
Significance This research highlights the discovery and genome reconstruction of a member of the globally distributed yet uncultivated candidate phylum TM6 (designated TM6SC1). In addition to the 16S rRNA gene, no other genomic information is available for this cosmopolitan phylum. This report also introduces a mini-metagenomic approach based on the use of high-throughput single-cell genomics techniques and assembly tools that address a widely recognized issue: how to effectively capture and sequence the currently uncultivated bacterial species that make up the “dark matter of life.” Amplification and sequencing random pools of 100 events enabled an estimated 90% recovery of the TM6SC1 genome. The “dark matter of life” describes microbes and even entire divisions of bacterial phyla that have evaded cultivation and have yet to be sequenced. We present a genome from the globally distributed but elusive candidate phylum TM6 and uncover its metabolic potential. TM6 was detected in a biofilm from a sink drain within a hospital restroom by analyzing cells using a highly automated single-cell genomics platform. We developed an approach for increasing throughput and effectively improving the likelihood of sampling rare events based on forming small random pools of single-flow–sorted cells, amplifying their DNA by multiple displacement amplification and sequencing all cells in the pool, creating a “mini-metagenome.” A recently developed single-cell assembler, SPAdes, in combination with contig binning methods, allowed the reconstruction of genomes from these mini-metagenomes. A total of 1.07 Mb was recovered in seven contigs for this member of TM6 (JCVI TM6SC1), estimated to represent 90% of its genome. High nucleotide identity between a total of three TM6 genome drafts generated from pools that were independently captured, amplified, and assembled provided strong confirmation of a correct genomic sequence. TM6 is likely a Gram-negative organism and possibly a symbiont of an unknown host (nonfree living) in part based on its small genome, low-GC content, and lack of biosynthesis pathways for most amino acids and vitamins. Phylogenomic analysis of conserved single-copy genes confirms that TM6SC1 is a deeply branching phylum.
Proceedings of the National Academy of Sciences of the United States of America | 2015
Alyson E. Santoro; Christopher L. Dupont; R. Alex Richter; Matthew T. Craig; Paul Carini; Matthew R. McIlvin; Youngik Yang; William D. Orsi; Dawn M. Moran; Mak A. Saito
Significance Thaumarchaeota are among the most abundant microbial cells in the ocean, but to date, complete genome sequences for marine Thaumarchaeota are lacking. Here, we report the 1.23-Mbp genome of the pelagic ammonia-oxidizing thaumarchaeon “Candidatus Nitrosopelagicus brevis” str. CN25. We present the first proteomic data, to our knowledge, from this phylum, which show a high proportion of proteins translated in oligotrophic conditions. Metagenomic fragment recruitment using data from the open ocean indicate the ubiquitous presence of Ca. N. brevis-like sequences in the surface ocean and suggest Ca. N. brevis as a model system for understanding the ecology and evolution of pelagic marine Thaumarchaeota. Thaumarchaeota are among the most abundant microbial cells in the ocean, but difficulty in cultivating marine Thaumarchaeota has hindered investigation into the physiological and evolutionary basis of their success. We report here a closed genome assembled from a highly enriched culture of the ammonia-oxidizing pelagic thaumarchaeon CN25, originating from the open ocean. The CN25 genome exhibits strong evidence of genome streamlining, including a 1.23-Mbp genome, a high coding density, and a low number of paralogous genes. Proteomic analysis recovered nearly 70% of the predicted proteins encoded by the genome, demonstrating that a high fraction of the genome is translated. In contrast to other minimal marine microbes that acquire, rather than synthesize, cofactors, CN25 encodes and expresses near-complete biosynthetic pathways for multiple vitamins. Metagenomic fragment recruitment indicated the presence of DNA sequences >90% identical to the CN25 genome throughout the oligotrophic ocean. We propose the provisional name “Candidatus Nitrosopelagicus brevis” str. CN25 for this minimalist marine thaumarchaeon and suggest it as a potential model system for understanding archaeal adaptation to the open ocean.
Mbio | 2013
Anna Edlund; Youngik Yang; Adam P Hall; Lihong Guo; Renate Lux; Xuesong He; Karen E. Nelson; Kenneth H. Nealson; Shibu Yooseph; Wenyuan Shi; Jeffrey S. McLean
BackgroundOur knowledge of microbial diversity in the human oral cavity has vastly expanded during the last two decades of research. However, much of what is known about the behavior of oral species to date derives from pure culture approaches and the studies combining several cultivated species, which likely does not fully reflect their function in complex microbial communities. It has been shown in studies with a limited number of cultivated species that early oral biofilm development occurs in a successional manner and that continuous low pH can lead to an enrichment of aciduric species. Observations that in vitro grown plaque biofilm microcosms can maintain similar pH profiles in response to carbohydrate addition as plaque in vivo suggests a complex microbial community can be established in the laboratory. In light of this, our primary goal was to develop a robust in vitro biofilm-model system from a pooled saliva inoculum in order to study the stability, reproducibility, and development of the oral microbiome, and its dynamic response to environmental changes from the community to the molecular level.ResultsComparative metagenomic analyses confirmed a high similarity of metabolic potential in biofilms to recently available oral metagenomes from healthy subjects as part of the Human Microbiome Project. A time-series metagenomic analysis of the taxonomic community composition in biofilms revealed that the proportions of major species at 3 hours of growth are maintained during 48 hours of biofilm development. By employing deep pyrosequencing of the 16S rRNA gene to investigate this biofilm model with regards to bacterial taxonomic diversity, we show a high reproducibility of the taxonomic carriage and proportions between: 1) individual biofilm samples; 2) biofilm batches grown at different dates; 3) DNA extraction techniques and 4) research laboratories.ConclusionsOur study demonstrates that we now have the capability to grow stable oral microbial in vitro biofilms containing more than one hundred operational taxonomic units (OTU) which represent 60-80% of the original inoculum OTU richness. Previously uncultivated Human Oral Taxa (HOT) were identified in the biofilms and contributed to approximately one-third of the totally captured 16S rRNA gene diversity. To our knowledge, this represents the highest oral bacterial diversity reported for an in vitro model system so far. This robust model will help investigate currently uncultivated species and the known virulence properties for many oral pathogens not solely restricted to pure culture systems, but within multi-species biofilms.
Proceedings of the National Academy of Sciences of the United States of America | 2015
Lihong Guo; Jeffrey S. McLean; Youngik Yang; Randal Eckert; Christopher W. Kaplan; Pierre Kyme; Omid Sheikh; Brian C. Varnum; Renate Lux; Wenyuan Shi; Xuesong He
Significance The lack of effective tools to modulate individual species within a complex microbial community poses a major barrier to studying human microbiome and its associated diseases. We showed targeted removal of human cariogenic Streptococcus mutans within an in vitro oral multispecies community using a high-efficacy antimicrobial peptide—C16G2—as well as drastic reconstruction of the microbial structure following treatment. The findings suggest the potential of using targeted antimicrobials to modulate microbiome and study the ecological function of specific bacterial species. One major challenge to studying human microbiome and its associated diseases is the lack of effective tools to achieve targeted modulation of individual species and study its ecological function within multispecies communities. Here, we show that C16G2, a specifically targeted antimicrobial peptide, was able to selectively kill cariogenic pathogen Streptococcus mutans with high efficacy within a human saliva-derived in vitro oral multispecies community. Importantly, a significant shift in the overall microbial structure of the C16G2-treated community was revealed after a 24-h recovery period: several bacterial species with metabolic dependency or physical interactions with S. mutans suffered drastic reduction in their abundance, whereas S. mutans’ natural competitors, including health-associated Streptococci, became dominant. This study demonstrates the use of targeted antimicrobials to modulate the microbiome structure allowing insights into the key community role of specific bacterial species and also indicates the therapeutic potential of C16G2 to achieve a healthy oral microbiome.
The ISME Journal | 2015
Anna Edlund; Youngik Yang; Shibu Yooseph; Adam P Hall; Don D. Nguyen; Pieter C. Dorrestein; Karen E. Nelson; Xuesong He; Renate Lux; Wenyuan Shi; Jeffrey S. McLean
Dental caries, one of the most globally widespread infectious diseases, is intimately linked to pH dynamics. In supragingival plaque, after the addition of a carbohydrate source, bacterial metabolism decreases the pH which then subsequently recovers. Molecular mechanisms supporting this important homeostasis are poorly characterized in part due to the fact that there are hundreds of active species in dental plaque. Only a few mechanisms (for example, lactate fermentation, the arginine deiminase system) have been identified and studied in detail. Here, we conducted what is to our knowledge, the first full transcriptome and metabolome analysis of a diverse oral plaque community by using a functionally and taxonomically robust in vitro model system greater than 100 species. Differential gene expression analyses from the complete transcriptome of 14 key community members revealed highly varied regulation of both known and previously unassociated pH-neutralizing pathways as a response to the pH drop. Unique expression and metabolite signatures from 400 detected metabolites were found for each stage along the pH curve suggesting it may be possible to define healthy and diseased states of activity. Importantly, for the maintenance of healthy plaque pH, gene transcription activity of known and previously unrecognized pH-neutralizing pathways was associated with the genera Lactobacillus, Veillonella and Streptococcus during the pH recovery phase. Our in vitro study provides a baseline for defining healthy and disease-like states and highlights the power of moving beyond single and dual species applications to capture key players and their orchestrated metabolic activities within a complex human oral microbiome model.
IEEE Transactions on Nanobioscience | 2012
Hyungro Lee; Youngik Yang; Heejoon Chae; Seungyoon Nam; Donghoon Choi; Patanachai Tangchaisin; Chathura Herath; Suresh Marru; Kenneth P. Nephew; Sun Kim
MicroRNAs, by regulating the expression of hundreds of target genes, play critical roles in developmental biology and the etiology of numerous diseases, including cancer. As a vast amount of microRNA expression profile data are now publicly available, the integration of microRNA expression data sets with gene expression profiles is a key research problem in life science research. However, the ability to conduct genome-wide microRNA-mRNA (gene) integration currently requires sophisticated, high-end informatics tools, significant expertise in bioinformatics and computer science to carry out the complex integration analysis. In addition, increased computing infrastructure capabilities are essential in order to accommodate large data sets. In this study, we have extended the BioVLAB cloud workbench to develop an environment for the integrated analysis of microRNA and mRNA expression data, named BioVLAB-MMIA. The workbench facilitates computations on the Amazon EC2 and S3 resources orchestrated by the XBaya Workflow Suite. The advantages of BioVLAB-MMIA over the web-based MMIA system include: 1) readily expanded as new computational tools become available; 2) easily modifiable by re-configuring graphic icons in the workflow; 3) on-demand cloud computing resources can be used on an “as needed” basis; 4) distributed orchestration supports complex and long running workflows asynchronously. We believe that BioVLAB-MMIA will be an easy-to-use computing environment for researchers who plan to perform genome-wide microRNA-mRNA (gene) integrated analysis tasks.
Nucleic Acids Research | 2015
Cuncong Zhong; Youngik Yang; Shibu Yooseph
Protein sequences predicted from metagenomic datasets are annotated by identifying their homologs via sequence comparisons with reference or curated proteins. However, a majority of metagenomic protein sequences are partial-length, arising as a result of identifying genes on sequencing reads or on assembled nucleotide contigs, which themselves are often very fragmented. The fragmented nature of metagenomic protein predictions adversely impacts homology detection and, therefore, the quality of the overall annotation of the dataset. Here we present a novel algorithm called GRASP that accurately identifies the homologs of a given reference protein sequence from a database consisting of partial-length metagenomic proteins. Our homology detection strategy is guided by the reference sequence, and involves the simultaneous search and assembly of overlapping database sequences. GRASP was compared to three commonly used protein sequence search programs (BLASTP, PSI-BLAST and FASTM). Our evaluations using several simulated and real datasets show that GRASP has a significantly higher sensitivity than these programs while maintaining a very high specificity. GRASP can be a very useful program for detecting and quantifying taxonomic and protein family abundances in metagenomic datasets. GRASP is implemented in GNU C++, and is freely available at http://sourceforge.net/projects/grasp-release.
Nucleic Acids Research | 2013
Youngik Yang; Shibu Yooseph
The metagenomic paradigm allows for an understanding of the metabolic and functional potential of microbes in a community via a study of their proteins. The substrate for protein identification is either the set of individual nucleotide reads generated from metagenomic samples or the set of contig sequences produced by assembling these reads. However, a read-based strategy using reads generated by next-generation sequencing (NGS) technologies, results in an overwhelming majority of partial-length protein predictions. A nucleotide assembly-based strategy does not fare much better, as metagenomic assemblies are typically fragmented and also leave a large fraction of reads unassembled. Here, we present a method for reconstructing complete protein sequences directly from NGS metagenomic data. Our framework is based on a novel short peptide assembler (SPA) that assembles protein sequences from their constituent peptide fragments identified on short reads. The SPA algorithm is based on informed traversals of a de Bruijn graph, defined on an amino acid alphabet, to identify probable paths that correspond to proteins. Using large simulated and real metagenomic data sets, we show that our method outperforms the alternate approach of identifying genes on nucleotide sequence assemblies and generates longer protein sequences that can be more effectively analysed.
PLOS Computational Biology | 2016
Cuncong Zhong; Anna Edlund; Youngik Yang; Jeffrey S. McLean; Shibu Yooseph
Analyses of metagenome data (MG) and metatranscriptome data (MT) are often challenged by a paucity of complete reference genome sequences and the uneven/low sequencing depth of the constituent organisms in the microbial community, which respectively limit the power of reference-based alignment and de novo sequence assembly. These limitations make accurate protein family classification and abundance estimation challenging, which in turn hamper downstream analyses such as abundance profiling of metabolic pathways, identification of differentially encoded/expressed genes, and de novo reconstruction of complete gene and protein sequences from the protein family of interest. The profile hidden Markov model (HMM) framework enables the construction of very useful probabilistic models for protein families that allow for accurate modeling of position specific matches, insertions, and deletions. We present a novel homology detection algorithm that integrates banded Viterbi algorithm for profile HMM parsing with an iterative simultaneous alignment and assembly computational framework. The algorithm searches a given profile HMM of a protein family against a database of fragmentary MG/MT sequencing data and simultaneously assembles complete or near-complete gene and protein sequences of the protein family. The resulting program, HMM-GRASPx, demonstrates superior performance in aligning and assembling homologs when benchmarked on both simulated marine MG and real human saliva MG datasets. On real supragingival plaque and stool MG datasets that were generated from healthy individuals, HMM-GRASPx accurately estimates the abundances of the antimicrobial resistance (AMR) gene families and enables accurate characterization of the resistome profiles of these microbial communities. For real human oral microbiome MT datasets, using the HMM-GRASPx estimated transcript abundances significantly improves detection of differentially expressed (DE) genes. Finally, HMM-GRASPx was used to reconstruct comprehensive sets of complete or near-complete protein and nucleotide sequences for the query protein families. HMM-GRASPx is freely available online from http://sourceforge.net/projects/hmm-graspx.
Bioinformatics | 2015
Youngik Yang; Cuncong Zhong; Shibu Yooseph
UNLABELLED The determination of protein sequences from a metagenomic dataset enables the study of metabolism and functional roles of the organisms that are present in the sampled microbial community. We had previously introduced algorithm and software for the accurate reconstruction of protein sequences from short peptides identified on nucleotide reads in a metagenomic dataset. Here, we present significant computational improvements to the short peptide assembly algorithm that make it practical to reconstruct proteins from large metagenomic datasets containing several hundred million reads, while maintaining accuracy. The improved computational efficiency is achieved using a suffix array data structure that allows for fast querying during the assembly process, and a significant redesign of assembly steps that enables multi-threaded execution. AVAILABILITY AND IMPLEMENTATION The program is available under the GPLv3 license from sourceforge.net/projects/spa-assembler.