Yanni Sun | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yanni Sun is active.

Explore More

Publication

Featured researches published by Yanni Sun.

Nucleic Acids Research | 2014

Ribosomal Database Project: data and tools for high throughput rRNA analysis

James R. Cole; Qiong Wang; Jordan A. Fish; Benli Chai; Donna M. McGarrell; Yanni Sun; C. Titus Brown; Andrea Porras-Alfaro; Cheryl R. Kuske; James M. Tiedje

Ribosomal Database Project (RDP; http://rdp.cme.msu.edu/) provides the research community with aligned and annotated rRNA gene sequence data, along with tools to allow researchers to analyze their own rRNA gene sequences in the RDP framework. RDP data and tools are utilized in fields as diverse as human health, microbial ecology, environmental microbiology, nucleic acid chemistry, taxonomy and phylogenetics. In addition to aligned and annotated collections of bacterial and archaeal small subunit rRNA genes, RDP now includes a collection of fungal large subunit rRNA genes. RDP tools, including Classifier and Aligner, have been updated to work with this new fungal collection. The use of high-throughput sequencing to characterize environmental microbial populations has exploded in the past several years, and as sequence technologies have improved, the sizes of environmental datasets have increased. With release 11, RDP is providing an expanded set of tools to facilitate analysis of high-throughput data, including both single-stranded and paired-end reads. In addition, most tools are now available as open source packages for download and local use by researchers with high-volume needs or who would like to develop custom analysis pipelines.

PLOS Genetics | 2012

Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga Nannochloropsis oceanica CCMP1779

Astrid Vieler; Guangxi Wu; Chia Hong Tsai; Blair Bullard; Adam J. Cornish; Christopher M. Harvey; Ida Barbara Reca; Chelsea K. Thornburg; Rujira Achawanantakun; Christopher J. Buehl; Michael S. Campbell; David Cavalier; Kevin L. Childs; Teresa J. Clark; Rahul R. Deshpande; Erika Erickson; Ann A. Ferguson; Witawas Handee; Que Kong; Xiaobo Li; Bensheng Liu; Steven Lundback; Cheng Peng; Rebecca L. Roston; Sanjaya; Jeffrey P. Simpson; Allan D. TerBush; Jaruswan Warakanont; Simone Zäuner; Eva M. Farré

Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred organism. Moreover, adequate molecular and genetic resources prerequisite for the rational engineering of marine algal feedstocks are lacking for most candidate species. Heterokonts of the genus Nannochloropsis naturally have high cellular oil content and are already in use for industrial production of high-value lipid products. First success in applying reverse genetics by targeted gene replacement makes Nannochloropsis oceanica an attractive model to investigate the cell and molecular biology and biochemistry of this fascinating organism group. Here we present the assembly of the 28.7 Mb genome of N. oceanica CCMP1779. RNA sequencing data from nitrogen-replete and nitrogen-depleted growth conditions support a total of 11,973 genes, of which in addition to automatic annotation some were manually inspected to predict the biochemical repertoire for this organism. Among others, more than 100 genes putatively related to lipid metabolism, 114 predicted transcription factors, and 109 transcriptional regulators were annotated. Comparison of the N. oceanica CCMP1779 gene repertoire with the recently published N. gaditana genome identified 2,649 genes likely specific to N. oceanica CCMP1779. Many of these N. oceanica–specific genes have putative orthologs in other species or are supported by transcriptional evidence. However, because similarity-based annotations are limited, functions of most of these species-specific genes remain unknown. Aside from the genome sequence and its analysis, protocols for the transformation of N. oceanica CCMP1779 are provided. The availability of genomic and transcriptomic data for Nannochloropsis oceanica CCMP1779, along with efficient transformation protocols, provides a blueprint for future detailed gene functional analysis and genetic engineering of Nannochloropsis species by a growing academic community focused on this genus.

Frontiers in Microbiology | 2013

FunGene: the functional gene pipeline and repository

Jordan A. Fish; Benli Chai; Qiong Wang; Yanni Sun; C. Titus Brown; James M. Tiedje; James R. Cole

Ribosomal RNA genes have become the standard molecular markers for microbial community analysis for good reasons, including universal occurrence in cellular organisms, availability of large databases, and ease of rRNA gene region amplification and analysis. As markers, however, rRNA genes have some significant limitations. The rRNA genes are often present in multiple copies, unlike most protein-coding genes. The slow rate of change in rRNA genes means that multiple species sometimes share identical 16S rRNA gene sequences, while many more species share identical sequences in the short 16S rRNA regions commonly analyzed. In addition, the genes involved in many important processes are not distributed in a phylogenetically coherent manner, potentially due to gene loss or horizontal gene transfer. While rRNA genes remain the most commonly used markers, key genes in ecologically important pathways, e.g., those involved in carbon and nitrogen cycling, can provide important insights into community composition and function not obtainable through rRNA analysis. However, working with ecofunctional gene data requires some tools beyond those required for rRNA analysis. To address this, our Functional Gene Pipeline and Repository (FunGene; http://fungene.cme.msu.edu/) offers databases of many common ecofunctional genes and proteins, as well as integrated tools that allow researchers to browse these collections and choose subsets for further analysis, build phylogenetic trees, test primers and probes for coverage, and download aligned sequences. Additional FunGene tools are specialized to process coding gene amplicon data. For example, FrameBot produces frameshift-corrected protein and DNA sequences from raw reads while finding the most closely related protein reference sequence. These tools can help provide better insight into microbial communities by directly studying key genes involved in important ecological processes.

Plant Physiology | 2014

MAKER-P: A Tool Kit for the Rapid Creation, Management, and Quality Control of Plant Genome Annotations

Michael S. Campbell; MeiYee Law; Carson Holt; Joshua C. Stein; Gaurav D. Moghe; David E. Hufnagel; Jikai Lei; Rujira Achawanantakun; Dian Jiao; Carolyn J. Lawrence; Doreen Ware; Shin Han Shiu; Kevin L. Childs; Yanni Sun; Ning Jiang; Mark Yandell

MAKER-P annotates the entire Arabidopsis and maize genomes in less than 3 h with comparable quality to the current TAIR10 and maize V2 annotation builds. We have optimized and extended the widely used annotation engine MAKER in order to better support plant genome annotation efforts. New features include better parallelization for large repeat-rich plant genomes, noncoding RNA annotation capabilities, and support for pseudogene identification. We have benchmarked the resulting software tool kit, MAKER-P, using the Arabidopsis (Arabidopsis thaliana) and maize (Zea mays) genomes. Here, we demonstrate the ability of the MAKER-P tool kit to automatically update, extend, and revise the Arabidopsis annotations in light of newly available data and to annotate pseudogenes and noncoding RNAs absent from The Arabidopsis Informatics Resource 10 build. Our results demonstrate that MAKER-P can be used to manage and improve the annotations of even Arabidopsis, perhaps the best-annotated plant genome. We have also installed and benchmarked MAKER-P on the Texas Advanced Computing Center. We show that this public resource can de novo annotate the entire Arabidopsis and maize genomes in less than 3 h and produce annotations of comparable quality to those of the current The Arabidopsis Information Resource 10 and maize V2 annotation builds.

Mbio | 2013

Ecological Patterns of nifH Genes in Four Terrestrial Climatic Zones Explored with Targeted Metagenomics Using FrameBot, a New Informatics Tool

Qiong Wang; John F. Quensen; Jordan A. Fish; Tae Kwon Lee; Yanni Sun; James M. Tiedje; James R. Cole

ABSTRACT Biological nitrogen fixation is an important component of sustainable soil fertility and a key component of the nitrogen cycle. We used targeted metagenomics to study the nitrogen fixation-capable terrestrial bacterial community by targeting the gene for nitrogenase reductase (nifH). We obtained 1.1 million nifH 454 amplicon sequences from 222 soil samples collected from 4 National Ecological Observatory Network (NEON) sites in Alaska, Hawaii, Utah, and Florida. To accurately detect and correct frameshifts caused by indel sequencing errors, we developed FrameBot, a tool for frameshift correction and nearest-neighbor classification, and compared its accuracy to that of two other rapid frameshift correction tools. We found FrameBot was, in general, more accurate as long as a reference protein sequence with 80% or greater identity to a query was available, as was the case for virtually all nifH reads for the 4 NEON sites. Frameshifts were present in 12.7% of the reads. Those nifH sequences related to the Proteobacteria phylum were most abundant, followed by those for Cyanobacteria in the Alaska and Utah sites. Predominant genera with nifH sequences similar to reads included Azospirillum, Bradyrhizobium, and Rhizobium, the latter two without obvious plant hosts at the sites. Surprisingly, 80% of the sequences had greater than 95% amino acid identity to known nifH gene sequences. These samples were grouped by site and correlated with soil environmental factors, especially drainage, light intensity, mean annual temperature, and mean annual precipitation. FrameBot was tested successfully on three ecofunctional genes but should be applicable to any. IMPORTANCE High-throughput phylogenetic analysis of microbial communities using rRNA-targeted sequencing is now commonplace; however, such data often allow little inference with respect to either the presence or the diversity of genes involved in most important ecological processes. To study the gene pool for these processes, it is more straightforward to assess the genes directly responsible for the ecological function (ecofunctional genes). However, analyzing these genes involves technical challenges beyond those seen for rRNA. In particular, frameshift errors cause garbled downstream protein translations. Our FrameBot tool described here both corrects frameshift errors in query reads and determines their closest matching protein sequences in a set of reference sequences. We validated this new tool with sequences from defined communities and demonstrated the tool’s utility on nifH gene fragments sequenced from soils in well-characterized and major terrestrial ecosystem types. High-throughput phylogenetic analysis of microbial communities using rRNA-targeted sequencing is now commonplace; however, such data often allow little inference with respect to either the presence or the diversity of genes involved in most important ecological processes. To study the gene pool for these processes, it is more straightforward to assess the genes directly responsible for the ecological function (ecofunctional genes). However, analyzing these genes involves technical challenges beyond those seen for rRNA. In particular, frameshift errors cause garbled downstream protein translations. Our FrameBot tool described here both corrects frameshift errors in query reads and determines their closest matching protein sequences in a set of reference sequences. We validated this new tool with sequences from defined communities and demonstrated the tool’s utility on nifH gene fragments sequenced from soils in well-characterized and major terrestrial ecosystem types.

BMC Bioinformatics | 2011

HMM-FRAME: accurate protein domain classification for metagenomic sequences containing frameshift errors

Yuan Zhang; Yanni Sun

BackgroundProtein domain classification is an important step in metagenomic annotation. The state-of-the-art method for protein domain classification is profile HMM-based alignment. However, the relatively high rates of insertions and deletions in homopolymer regions of pyrosequencing reads create frameshifts, causing conventional profile HMM alignment tools to generate alignments with marginal scores. This makes error-containing gene fragments unclassifiable with conventional tools. Thus, there is a need for an accurate domain classification tool that can detect and correct sequencing errors.ResultsWe introduce HMM-FRAME, a protein domain classification tool based on an augmented Viterbi algorithm that can incorporate error models from different sequencing platforms. HMM-FRAME corrects sequencing errors and classifies putative gene fragments into domain families. It achieved high error detection sensitivity and specificity in a data set with annotated errors. We applied HMM-FRAME in Targeted Metagenomics and a published metagenomic data set. The results showed that our tool can correct frameshifts in error-containing sequences, generate much longer alignments with significantly smaller E-values, and classify more sequences into their native families.ConclusionsHMM-FRAME provides a complementary protein domain classification tool to conventional profile HMM-based methods for data sets containing frameshifts. Its current implementation is best used for small-scale metagenomic data sets. The source code of HMM-FRAME can be downloaded at http://www.cse.msu.edu/~zhangy72/hmmframe/ and at https://sourceforge.net/projects/hmm-frame/.

Nucleic Acids Research | 2014

The insect-phase gRNA transcriptome in Trypanosoma brucei

Donna J. Koslowsky; Yanni Sun; Jordan Hindenach; Terence Theisen; Jasmin Lucas

One of the most striking examples of small RNA regulation of gene expression is the process of RNA editing in the mitochondria of trypanosomes. In these parasites, RNA editing involves extensive uridylate insertions and deletions within most of the mitochondrial messenger RNAs (mRNAs). Over 1200 small guide RNAs (gRNAs) are predicted to be responsible for directing the sequence changes that create start and stop codons, correct frameshifts and for many of the mRNAs generate most of the open reading frame. In addition, alternative editing creates the opportunity for unprecedented protein diversity. In Trypanosoma brucei, the vast majority of gRNAs are transcribed from minicircles, which are approximately one kilobase in size, and encode between three and four gRNAs. The large number (5000–10 000) and their concatenated structure make them difficult to sequence. To identify the complete set of gRNAs necessary for mRNA editing in T. brucei, we used Illumina deep sequencing of purified gRNAs from the procyclic stage. We report a near complete set of gRNAs needed to direct the editing of the mRNAs.

Bioinformatics | 2014

miR-PREFeR: an accurate, fast and easy-to-use plant miRNA prediction tool using small RNA-Seq data

Jikai Lei; Yanni Sun

SUMMARY Plant microRNA prediction tools that use small RNA-sequencing data are emerging quickly. These existing tools have at least one of the following problems: (i) high false-positive rate; (ii) long running time; (iii) work only for genomes in their databases; (iv) hard to install or use. We developed miR-PREFeR (miRNA PREdiction From small RNA-Seq data), which uses expression patterns of miRNA and follows the criteria for plant microRNA annotation to accurately predict plant miRNAs from one or more small RNA-Seq data samples of the same species. We tested miR-PREFeR on several plant species. The results show that miR-PREFeR is sensitive, accurate, fast and has low-memory footprint. AVAILABILITY AND IMPLEMENTATION https://github.com/hangelwen/miR-PREFeR

Plant Physiology | 2015

Automated Update, Revision, and Quality Control of the Maize Genome Annotations Using MAKER-P Improves the B73 RefGen_v3 Gene Models and Identifies New Genes

Mei Yee Law; Kevin L. Childs; Michael S. Campbell; Joshua C. Stein; Andrew Olson; Carson Holt; Nicholas Panchy; Jikai Lei; Dian Jiao; Carson M. Andorf; Carolyn J. Lawrence; Doreen Ware; Shin Han Shiu; Yanni Sun; Ning Jiang; Mark Yandell

Reannotation of the maize genome using MAKER-P results in many revised and new gene models. The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes.

Bioinformatics | 2015

Reconstructing 16S rRNA genes in metagenomic data

Cheng Yuan; Jikai Lei; James R. Cole; Yanni Sun

Metagenomic data, which contains sequenced DNA reads of uncultured microbial species from environmental samples, provide a unique opportunity to thoroughly analyze microbial species that have never been identified before. Reconstructing 16S ribosomal RNA, a phylogenetic marker gene, is usually required to analyze the composition of the metagenomic data. However, massive volume of dataset, high sequence similarity between related species, skewed microbial abundance and lack of reference genes make 16S rRNA reconstruction difficult. Generic de novo assembly tools are not optimized for assembling 16S rRNA genes. In this work, we introduce a targeted rRNA assembly tool, REAGO (REconstruct 16S ribosomal RNA Genes from metagenOmic data). It addresses the above challenges by combining secondary structure-aware homology search, zproperties of rRNA genes and de novo assembly. Our experimental results show that our tool can correctly recover more rRNA genes than several popular generic metagenomic assembly tools and specially designed rRNA construction tools. Availability and implementation: The source code of REAGO is freely available at https://github.com/chengyuan/reago. Contact: [email protected]

Explore More