John C. Mu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where John C. Mu is active.

Explore More

Publication

Featured researches published by John C. Mu.

Cell | 2012

Mutation of a U2 snRNA Gene Causes Global Disruption of Alternative Splicing and Neurodegeneration

Yichang Jia; John C. Mu; Susan L. Ackerman

Although uridine-rich small nuclear RNAs (U-snRNAs) are essential for pre-mRNA splicing, little is known regarding their function in the regulation of alternative splicing or of the biological consequences of their dysfunction in mammals. Here, we demonstrate that mutation of Rnu2-8, one of the mouse multicopy U2 snRNA genes, causes ataxia and neurodegeneration. Coincident with the observed pathology, the level of mutant U2 RNAs was highest in the cerebellum and increased after granule neuron maturation. Furthermore, neuron loss was strongly dependent on the dosage of mutant and wild-type snRNA genes. Comprehensive transcriptome analysis identified a group of alternative splicing events, including the splicing of small introns, which were disrupted in the mutant cerebellum. Our results suggest that the expression of mammalian U2 snRNA genes, previously presumed to be ubiquitous, is spatially and temporally regulated, and dysfunction of a single U2 snRNA causes neuron degeneration through distortion of pre-mRNA splicing.

Bioinformatics | 2012

Fast and accurate read alignment for resequencing

John C. Mu; Hui Jiang; Amirhossein Kiani; Marghoob Mohiyuddin; Narges Bani Asadi; Wing Hung Wong

MOTIVATION Next-generation sequence analysis has become an important task both in laboratory and clinical settings. A key stage in the majority sequence analysis workflows, such as resequencing, is the alignment of genomic reads to a reference genome. The accurate alignment of reads with large indels is a computationally challenging task for researchers. RESULTS We introduce SeqAlto as a new algorithm for read alignment. For reads longer than or equal to 100 bp, SeqAlto is up to 10 × faster than existing algorithms, while retaining high accuracy and the ability to align reads with large (up to 50 bp) indels. This improvement in efficiency is particularly important in the analysis of future sequencing data where the number of reads approaches many billions. Furthermore, SeqAlto uses less than 8 GB of memory to align against the human genome. SeqAlto is benchmarked against several existing tools with both real and simulated data. AVAILABILITY Linux and Mac OS X binaries free for academic use are available at http://www.stanford.edu/group/wonglab/seqalto CONTACT [email protected].

Bioinformatics | 2015

MetaSV: An accurate and integrative structural-variant caller for next generation sequencing

Marghoob Mohiyuddin; John C. Mu; Jian Li; Narges Bani Asadi; Mark Gerstein; Alexej Abyzov; Wing Hung Wong; Hugo Y. K. Lam

Summary: Structural variations (SVs) are large genomic rearrangements that vary significantly in size, making them challenging to detect with the relatively short reads from next-generation sequencing (NGS). Different SV detection methods have been developed; however, each is limited to specific kinds of SVs with varying accuracy and resolution. Previous works have attempted to combine different methods, but they still suffer from poor accuracy particularly for insertions. We propose MetaSV, an integrated SV caller which leverages multiple orthogonal SV signals for high accuracy and resolution. MetaSV proceeds by merging SVs from multiple tools for all types of SVs. It also analyzes soft-clipped reads from alignment to detect insertions accurately since existing tools underestimate insertion SVs. Local assembly in combination with dynamic programming is used to improve breakpoint resolution. Paired-end and coverage information is used to predict SV genotypes. Using simulation and experimental data, we demonstrate the effectiveness of MetaSV across various SV types and sizes. Availability and implementation: Code in Python is at http://bioinform.github.io/metasv/. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Bioinformatics | 2015

VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications

John C. Mu; Marghoob Mohiyuddin; Jian Li; Narges Bani Asadi; Mark Gerstein; Alexej Abyzov; Wing Hung Wong; Hugo Y. K. Lam

Summary: VarSim is a framework for assessing alignment and variant calling accuracy in high-throughput genome sequencing through simulation or real data. In contrast to simulating a random mutation spectrum, it synthesizes diploid genomes with germline and somatic mutations based on a realistic model. This model leverages information such as previously reported mutations to make the synthetic genomes biologically relevant. VarSim simulates and validates a wide range of variants, including single nucleotide variants, small indels and large structural variants. It is an automated, comprehensive compute framework supporting parallel computation and multiple read simulators. Furthermore, we developed a novel map data structure to validate read alignments, a strategy to compare variants binned in size ranges and a lightweight, interactive, graphical report to visualize validation results with detailed statistics. Thus far, it is the most comprehensive validation tool for secondary analysis in next generation sequencing. Availability and implementation: Code in Java and Python along with instructions to download the reads and variants is at http://bioinform.github.io/varsim. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Genome Biology | 2015

An ensemble approach to accurately detect somatic mutations using SomaticSeq

Li Tai Fang; Pegah Tootoonchi Afshar; Aparna Chhibber; Marghoob Mohiyuddin; Yu Fan; John C. Mu; Greg Gibeling; Sharon Barr; Narges Bani Asadi; Mark Gerstein; Daniel C. Koboldt; Wenyi Wang; Wing Hung Wong; Hugo Y. K. Lam

SomaticSeq is an accurate somatic mutation detection pipeline implementing a stochastic boosting algorithm to produce highly accurate somatic mutation calls for both single nucleotide variants and small insertions and deletions. The workflow currently incorporates five state-of-the-art somatic mutation callers, and extracts over 70 individual genomic and sequencing features for each candidate site. A training set is provided to an adaptively boosted decision tree learner to create a classifier for predicting mutation statuses. We validate our results with both synthetic and real data. We report that SomaticSeq is able to achieve better overall accuracy than any individual tool incorporated.

Journal of Computational and Graphical Statistics | 2016

Computational Aspects of Optional Pólya Tree

Hui Jiang; John C. Mu; Kun Yang; Chao Du; Luo Lu; Wing Hung Wong

Optional Pólya tree (OPT) is a flexible nonparametric Bayesian prior for density estimation. Despite its merits, the computation for OPT inference is challenging. In this article, we present time complexity analysis for OPT inference and propose two algorithmic improvements. The first improvement, named limited-lookahead optional Pólya tree (LL-OPT), aims at accelerating the computation for OPT inference. The second improvement modifies the output of OPT or LL-OPT and produces a continuous piecewise linear density estimate. We demonstrate the performance of these two improvements using simulated and real date examples.

Physiological Reports | 2014

Exploring the physiologic role of human gastroesophageal reflux by analyzing time-series data from 24-h gastric and esophageal pH recordings

Luo Lu; John C. Mu; Sheldon Sloan; Philip B. Miner; Jerry D. Gardner

Our previous finding of a fractal pattern for gastric pH and esophageal pH plus the statistical association of sequential pH values for up to 2 h led to our hypothesis that the fractal pattern encodes information regarding gastric acidity and that depending on the value of gastric acidity, the esophagus can signal the stomach to alter gastric acidity by influencing gastric secretion of acid or bicarbonate. Under our hypothesis values of gastric pH should provide information regarding values of esophageal pH and vice versa. We used vector autoregression, a theory‐free set of inter‐related linear regressions used to measure relationships that can change over time, to analyze data from 24‐h recordings of gastric pH and esophageal pH. We found that in pH records from normal subjects, as well as from subjects with gastroesophageal reflux disease alone and after treatment with a proton pump inhibitor, gastric pH values provided important information regarding subsequent values of esophageal pH and values of esophageal pH provided important information regarding subsequent values of gastric pH. The ability of gastric pH and esophageal pH to provide information regarding subsequent values of each other was reduced in subjects with gastroesophageal reflux disease compared to normal subjects. Our findings are consistent with the hypothesis that depending on the value of gastric acidity, the esophagus can signal the stomach to alter gastric acidity, and that this ability is impaired in subjects with gastroesophageal reflux disease.

Bioinformatics | 2016

LongISLND: in silico sequencing of lengthy and noisy datatypes

Bayo Lau; Marghoob Mohiyuddin; John C. Mu; Li Tai Fang; Narges Bani Asadi; Carolina Dallett; Hugo Y. K. Lam

Summary: LongISLND is a software package designed to simulate sequencing data according to the characteristics of third generation, single-molecule sequencing technologies. The general software architecture is easily extendable, as demonstrated by the emulation of Pacific Biosciences (PacBio) multi-pass sequencing with P5 and P6 chemistries, producing data in FASTQ, H5, and the latest PacBio BAM format. We demonstrate its utility by downstream processing with consensus building and variant calling. Availability and Implementation: LongISLND is implemented in Java and available at http://bioinform.github.io/longislnd Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Scientific Reports | 2015

Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods

John C. Mu; Pegah Tootoonchi Afshar; Marghoob Mohiyuddin; Xi Chen; Jian Li; Narges Bani Asadi; Mark Gerstein; Wing Hung Wong; Hugo Y. K. Lam

A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of variants including structural variants (SVs). Thus, we leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms. It was a necessary effort to completely reanalyze the HuRef genome as its previously published variants were mostly reported five years ago, suffering from compatibility, organization, and accuracy issues that prevent their direct use in benchmarking. Our extensive analysis and validation resulted in a gold set with high specificity and sensitivity. In contrast to the current gold sets of the NA12878 or HS1011 genomes, our gold set is the first that includes small variants, deletion SVs and insertion SVs up to a hundred thousand base-pairs. We demonstrate the utility of our HuRef gold set to benchmark several published SV detection tools.

Cancer Research | 2015

Abstract LB-306: An ensemble approach to accurately detect somatic mutations via adaptive boosting

Li Tai Fang; Pegah Tootoonchi Afshar; John C. Mu; Narges Bani Asadi; Wing Hung Wong; Hugo Y. K. Lam

Identifying somatic mutations is a key analysis in cancer research. The challenge lies in the impure and heterogeneous nature of the tumor samples. Oftentimes, an algorithm works well for one tumor but poorly for another. Here, we present an ensemble approach that integrates multiple algorithms and demonstrate its performance and high accuracy with validation from both synthetic data and real data. Our approach incorporates state-of-the-art callers including MuTect, SomaticSniper, VarScan2, JointSNVMix2, and VarDict for somatic mutation detection. Each of these algorithms has its unique strength, capable of detecting variants that are missed by some others. The call sets are combined based on 70 independent sequencing and genomic features, which are then used by an adaptively boosted decision tree learner. The learner is trained with a sophisticated simulated data to discriminate true mutations from very noisy data of the tumor samples. In our latest submission to the ICGC-TCGA DREAM Mutation Calling Challenge (the Challenge), our approach obtained an unprecedented somatic SNV detection accuracy of 97.1% with a recall of 94.2% and a precision of 99.9%. The synthetic data was a tumor-normal pair of samples with 30x sequencing depth each. The tumor sample was synthesized by spiking in a whole spectrum of variants ranging from SNVs/Indels to SVs, resulting in an SNV allele frequency (VAF) of 25%. We further validated our approach with “in silico titration”. The titration mixed two different real genomes at different proportions with validated ground truths to generate different sample conditions, ranging from the simplest case where the normal and tumor were pure to the more challenging case where the tumor and normal tissues cross contaminated. From an VAF of 50%, 25% to 15%, our approach achieved an accuracy of 95.7%, 92.5%, and 85.3% respectively based on cross validation, consistent with the results from the Challenge. Finally, we validated our approach with three widely-used and published cancer datasets, obtained from TCGA and EGA, including a whole-genome sequenced malignant melanoma cell line, a whole-genome sequenced chronic lymphocytic leukemia cell line, and a whole-exome sequenced colon adenocarcinoma patient sample with experimentally validated somatic mutations. Our approach was trained on the data from the Challenge and applied to the aforementioned samples to measure its accuracy. Our results showed that we achieved a recall of 98.9%, 89.1% and 87.9% respectively. Although precision on real data cannot be measured without a comprehensive whole-genome experimental validation, our comparatively smaller call sets compared to all other methods considered implying that it has the highest precision among all. We extended our study of the above three validation approaches, namely synthetic genomes, in silico titration, and real samples, to compare with all the five individual callers for accuracy performance. We found that our approach had the highest accuracy when compared to any individual caller. To conclude, our approach is shown to have high accuracy in different types and conditions of tumor samples and by far the best in its class. Citation Format: Li Tai Fang, Pegah T. Afshar, John C. Mu, Narges Bani Asadi, Wing H. Wong, Hugo Y. K. Lam. An ensemble approach to accurately detect somatic mutations via adaptive boosting. [abstract]. In: Proceedings of the 106th Annual Meeting of the American Association for Cancer Research; 2015 Apr 18-22; Philadelphia, PA. Philadelphia (PA): AACR; Cancer Res 2015;75(15 Suppl):Abstract nr LB-306. doi:10.1158/1538-7445.AM2015-LB-306

Explore More