Yalu Wen
Michigan State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yalu Wen.
Nucleic Acids Research | 2009
Lin Wan; Kelian Sun; Qi Ding; Yuehua Cui; Ming Li; Yalu Wen; Robert C. Elston; Minping Qian; Wenjiang J. Fu
Affymetrix SNP arrays have been widely used for single-nucleotide polymorphism (SNP) genotype calling and DNA copy number variation inference. Although numerous methods have achieved high accuracy in these fields, most studies have paid little attention to the modeling of hybridization of probes to off-target allele sequences, which can affect the accuracy greatly. In this study, we address this issue and demonstrate that hybridization with mismatch nucleotides (HWMMN) occurs in all SNP probe-sets and has a critical effect on the estimation of allelic concentrations (ACs). We study sequence binding through binding free energy and then binding affinity, and develop a probe intensity composite representation (PICR) model. The PICR model allows the estimation of ACs at a given SNP through statistical regression. Furthermore, we demonstrate with cell-line data of known true copy numbers that the PICR model can achieve reasonable accuracy in copy number estimation at a single SNP locus, by using the ratio of the estimated AC of each sample to that of the reference sample, and can reveal subtle genotype structure of SNPs at abnormal loci. We also demonstrate with HapMap data that the PICR model yields accurate SNP genotype calls consistently across samples, laboratories and even across array platforms.
Bioinformatics | 2016
Yalu Wen; Fushun Chen; Qingzheng Zhang; Yan Zhuang; Zhiguang Li
MOTIVATION DNA methylation is an important epigenetic modification that has essential role in gene regulation, cell differentiation and cancer development. Bisulfite sequencing is a widely used technique to obtain genome-wide DNA methylation profiles, and one of the key tasks of analyzing bisulfite sequencing data is to detect differentially methylated regions (DMRs) among samples under different treatment conditions. Although numerous tools have been proposed to detect differentially methylated single CpG site (DMC) between samples, methods for direct DMR detection, especially for complex study designs, are largely limited. RESULTS We present a new software, GetisDMR, for direct DMR detection. We use beta-binomial regression to model the whole-genome bisulfite sequencing data, where variations in methylation levels and confounding effects have been accounted for. We employ a region-wise test statistic, which is derived from local Getis-Ord statistics and considers the spatial correlation between nearby CpG sites, to detect DMRs. Unlike existing methods, that attempt to infer DMRs from DMCs based on empirical criteria, we provide statistical inference for direct DMR detection. Through extensive simulations and an application to two mouse datasets, we demonstrate that GetisDMR achieves better sensitivities, positive predictive values, more exact locations and better agreement of DMRs with current biological knowledge. AVAILABILITY AND IMPLEMENTATION It is available at https://github.com/DMU-lilab/GetisDMR CONTACTS: [email protected] or [email protected] information: Supplementary data are available at Bioinformatics online.
Health Promotion Practice | 2015
Katherine Alaimo; Shannon Oleksyk; Diane Golzynski; Nick Drzal; Jennifer Lucarelli; Melissa Reznar; Yalu Wen; Karen Krabill Yoder
The Michigan Healthy School Action Tools (HSAT) is an online self-assessment and action planning process for schools seeking to improve their health policies and practices. The School Nutrition Advances Kids study, a 2-year quasi-experimental intervention with low-income middle schools, evaluated whether completing the HSAT with a facilitator assistance and small grant funding resulted in (1) improvements in school nutrition practices and policies and (2) improvements in student dietary intake. A total of 65 low-income Michigan middle schools participated in the study. The Block Youth Food Frequency Questionnaire was completed by 1,176 seventh-grade students at baseline and in eighth grade (during intervention). Schools reported nutrition-related policies and practices/education using the School Environment and Policy Survey. Schools completing the HSAT were compared to schools that did not complete the HSAT with regard to number of policy and practice changes and student dietary intake. Schools that completed the HSAT made significantly more nutrition practice/education changes than schools that did not complete the HSAT, and students in those schools made dietary improvements in fruit, fiber, and cholesterol intake. The Michigan HSAT process is an effective strategy to initiate improvements in nutrition policies and practices within schools, and to improve student dietary intake.
Genetic Epidemiology | 2013
Yalu Wen; Qing Lu
The translation of human genome discoveries into health practice is one of the major challenges in the coming decades. The use of emerging genetic knowledge for early disease prediction, prevention, and pharmacogenetics will advance genome medicine and lead to more effective prevention/treatment strategies. For this reason, studies to assess the combined role of genetic and environmental discoveries in early disease prediction represent high priority research projects, as manifested in the multiple risk prediction studies now underway. However, the risk prediction models formed to date lack sufficient accuracy for clinical use. Converging evidence suggests that diseases with the same or similar clinical manifestations could have different pathophysiological and etiological processes. When heterogeneous subphenotypes are treated as a single entity, the effect size of predictors can be reduced substantially, leading to a low‐accuracy risk prediction model. The use of more refined subphenotypes facilitates the identification of new predictors and leads to improved risk prediction models. To account for the phenotypic heterogeneity, we have developed a multiclass likelihood‐ratio approach, which simultaneously determines the optimum number of subphenotype groups and builds a risk prediction model for each group. Simulation results demonstrated that the new approach had more accurate and robust performance than existing approaches under various underlying disease models. The empirical study of type II diabetes (T2D) by using data from the Genes and Environment Initiatives suggested heterogeneous etiology underlying obese and nonobese T2D patients. Considering phenotypic heterogeneity in the analysis leads to improved risk prediction models for both obese and nonobese T2D subjects.
Genetic Epidemiology | 2016
Yalu Wen; Qing Lu
Although compelling evidence suggests that the genetic etiology of complex diseases could be heterogeneous in subphenotype groups, little attention has been paid to phenotypic heterogeneity in genetic association analysis of complex diseases. Simply ignoring phenotypic heterogeneity in association analysis could result in attenuated estimates of genetic effects and low power of association tests if subphenotypes with similar clinical manifestations have heterogeneous underlying genetic etiologies. To facilitate the family‐based association analysis allowing for phenotypic heterogeneity, we propose a clustered multiclass likelihood‐ratio ensemble (CMLRE) method. The proposed method provides an alternative way to model the complex relationship between disease outcomes and genetic variants. It allows for heterogeneous genetic causes of disease subphenotypes and can be applied to various pedigree structures. Through simulations, we found CMLRE outperformed the commonly adopted strategies in a variety of underlying disease scenarios. We further applied CMLRE to a family‐based dataset from the International Consortium to Identify Genes and Interactions Controlling Oral Clefts (ICOC) to investigate the genetic variants and interactions predisposing to subphenotypes of oral clefts. The analysis suggested that two subphenotypes, nonsyndromic cleft lip without palate (CL) and cleft lip with palate (CLP), shared similar genetic etiologies, while cleft palate only (CP) had its own genetic mechanism. The analysis further revealed that rs10863790 (IRF6), rs7017252 (8q24), and rs7078160 (VAX1) were jointly associated with CL/CLP, while rs7969932 (TBK1), rs227731 (17q22), and rs2141765 (TBK1) jointly contributed to CP.
Frontiers in Genetics | 2015
Yalu Wen; Qing Lu
Oral clefts are common birth defects that have a major impact on the affected individual, their family and society. World-wide, the incidence of oral clefts is 1/700 live births, making them the most common craniofacial birth defects. The successful prediction of oral clefts may help identify sub-population at high risk, and promote new diagnostic and therapeutic strategies. Nevertheless, developing a clinically useful oral clefts risk prediction model remains a great challenge. Compelling evidences suggest the etiologies of oral clefts are highly heterogeneous, and the development of a risk prediction model with consideration of phenotypic heterogeneity may potentially improve the accuracy of a risk prediction model. In this study, we applied a previously developed statistical method to investigate the risk prediction on sub-phenotypes of oral clefts. Our results suggested subtypes of cleft lip (CL) and palate have similar genetic etiologies (AUC = 0.572) with subtypes of CL only (AUC = 0.589), while the subtypes of cleft palate only (CPO) have heterogeneous underlying mechanisms (AUCs for soft CPO and hard CPO are 0.617 and 0.623, respectively). This highlighted the potential that the hard and soft forms of CPO have their own mechanisms despite sharing some of the genetic risk factors. Comparing with conventional methods for risk prediction modeling, our method considers phenotypic heterogeneity of a disease, which potentially improves the accuracy for predicting each sub-phenotype of oral clefts.
PLOS ONE | 2013
Ming Li; Yalu Wen; Qing Lu; Wenjiang J. Fu
Oligonucleotide microarrays are commonly adopted for detecting and qualifying the abundance of molecules in biological samples. Analysis of microarray data starts with recording and interpreting hybridization signals from CEL images. However, many CEL images may be blemished by noises from various sources, observed as “bright spots”, “dark clouds”, and “shadowy circles”, etc. It is crucial that these image defects are correctly identified and properly processed. Existing approaches mainly focus on detecting defect areas and removing affected intensities. In this article, we propose to use a mixed effect model for imputing the affected intensities. The proposed imputation procedure is a single-array-based approach which does not require any biological replicate or between-array normalization. We further examine its performance by using Affymetrix high-density SNP arrays. The results show that this imputation procedure significantly reduces genotyping error rates. We also discuss the necessary adjustments for its potential extension to other oligonucleotide microarrays, such as gene expression profiling. The R source code for the implementation of approach is freely available upon request.
Current Genomics | 2016
Ming Li; Changshuai Wei; Yalu Wen; Tong Wang; Qing Lu
Many complex diseases, such as psychiatric and behavioral disorders, are commonly characterized through various measurements that reflect physical, behavioral and psychological aspects of diseases. While it remains a great challenge to find a unified measurement to characterize a disease, the available multiple phenotypes can be analyzed jointly in the genetic association study. Simultaneously testing these phenotypes has many advantages, including considering different aspects of the disease in the analysis, and utilizing correlated phenotypes to improve the power of detecting disease-associated variants. Furthermore, complex diseases are likely caused by the interplay of multiple genetic variants through complicated mechanisms. Considering gene-gene interactions in the joint association analysis of complex diseases could further increase our ability to discover genetic variants involving complex disease pathways. In this article, we propose a stepwise U-test for joint association analysis of multiple loci and multiple phenotypes. Through simulations, we demonstrated that testing multiple phenotypes simultaneously could attain higher power than testing one single phenotype at a time, especially when there are shared genes contributing to multiple phenotypes. We also illustrated the proposed method with an application to Nicotine Dependence (ND), using datasets from the Study of Addition, Genetics and Environment (SAGE). The joint analysis of three ND phenotypes identified two SNPs, rs10508649 and rs2491397, and reached a nominal P-value of 3.79e-13. The association was further replicated in two independent datasets with P-values of 2.37e-05 and 7.46e-05.
Journal of Computational Biology | 2013
Yalu Wen; Ming Li; Wenjiang J. Fu
The genomic wave has been identified as a major artifact in genome data and is highly correlated with the sequence GC content. Although statistical methods have been developed to filter this artifact, the mechanism underlying the genomic wave has not been studied yet. Understanding of the artifact, specifically the sources of the artifact, may lead to successful separation of biological signals from the artifact and improve array design, modeling, and association studies. We develop an approach to catching the genomic wave in the oligonucleotide single-nucleotide polymorphism (SNP) arrays by separating biological signals from the array baseline background through modeling sequence binding with a newly developed probe intensity composite representation (PICR) model. The PICR model decomposes the probe intensity of each SNP probe set into the target sequence concentrations, SNP-specific background (nonsignal) and measurement error, and identifies the biological signals through the target concentration for each allele. We demonstrate with the Affymetrix GeneChip 500K HapMap data and the Wellcome Trust Case-Control Study data that the genomic wave is captured through the SNP-specific background term of the PICR model, and is separated successfully from the allelic target concentrations-the biological signals. We further identify two important sources of the genomic waves, the GC content and the fragment length (FL) of the sequence, and conclude that (1) the genomic wave artifact can be removed from the genome data with the PICR model, and (2) in addition to the GC content, the genomic wave also has a component of nonlinear effect of the FL.
Scientific Reports | 2016
Yalu Wen; Zihuai He; Ming Li; Qing Lu
With the advance in high-throughput sequencing technology, it is feasible to investigate the role of common and rare variants in disease risk prediction. While the new technology holds great promise to improve disease prediction, the massive amount of data and low frequency of rare variants pose great analytical challenges on risk prediction modeling. In this paper, we develop a forward random field method (FRF) for risk prediction modeling using sequencing data. In FRF, subjects’ phenotypes are treated as stochastic realizations of a random field on a genetic space formed by subjects’ genotypes, and an individual’s phenotype can be predicted by adjacent subjects with similar genotypes. The FRF method allows for multiple similarity measures and candidate genes in the model, and adaptively chooses the optimal similarity measure and disease-associated genes to reflect the underlying disease model. It also avoids the specification of the threshold of rare variants and allows for different directions and magnitudes of genetic effects. Through simulations, we demonstrate the FRF method attains higher or comparable accuracy over commonly used support vector machine based methods under various disease models. We further illustrate the FRF method with an application to the sequencing data obtained from the Dallas Heart Study.