Jiangwen Sun
University of Connecticut
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jiangwen Sun.
Addictive Behaviors | 2012
Jiangwen Sun; Jinbo Bi; Grace Chan; David W. Oslin; Lindsay A. Farrer; Joel Gelernter; Henry R. Kranzler
Although there is evidence that opioid dependence (OD) is heritable, efforts to identify genes contributing to risk for the disorder have been hampered by its complex etiology and variable clinical manifestations. Decomposition of a complex set of opioid users into homogeneous subgroups could enhance genetic analysis. We applied a series of data mining techniques, including multiple correspondence analysis, variable selection and cluster analysis, to 69 opioid-related measures from 5390 subjects aggregated from family-based and case-control genetic studies to identify homogeneous subtypes and estimate their heritability. Novel aspects of this work include our use of (1) heritability estimates of specific clinical features of OD to enhance the heritability of the subtypes and (2) a k-medoids clustering method in combination with hierarchical clustering to yield replicable clusters that are less sensitive to noise than previous methods. We identified five homogeneous groups, including two large groups comprised of 762 and 1353 heavy opioid users, with estimated heritability of 0.69 and 0.76, respectively. These methods represent a promising approach to the identification of highly heritable subtypes in complex, heterogeneous disorders.
American Journal of Medical Genetics | 2014
Jinbo Bi; Joel Gelernter; Jiangwen Sun; Henry R. Kranzler
Because DSM‐IV cocaine dependence (CD) is heterogeneous, it is not an optimal phenotype to identify genetic variation contributing to risk for cocaine use and related behaviors (CRBs). We used a cluster analytic method to differentiate homogeneous, highly heritable subtypes of CRBs and to compare their utility with that of the DSM‐IV CD as traits for genetic association analysis. Clinical features of CRBs and co‐occurring disorders were obtained via a poly‐diagnostic interview administered to 9,965 participants in genetic studies of substance dependence. A subsample of subjects (N = 3,443) were genotyped for 1,350 single nucleotide polymorphisms (SNPs) selected from 130 candidate genes related to addiction. Cluster analysis of clinical features of the sample yielded five subgroups, two of which were characterized by heavy cocaine use and high heritability: a heavy cocaine use, infrequent intravenous injection group and an early‐onset, heavy cocaine use, high comorbidity group. The utility of these traits was compared with the CD diagnosis through association testing of 2,320 affected subjects and 480 cocaine‐exposed controls. Analyses examined both single SNP (main) and SNP–SNP interaction (epistatic) effects, separately for African‐Americans and European‐Americans. The two derived subtypes showed more significant P values for 6 of 8 main effects and 7 of 8 epistatic effects. Variants in the CLOCK gene were significantly associated with the heavy cocaine use, infrequent intravenous injection group, but not with the DSM‐IV diagnosis of CD. These results support the utility of subtypes based on CRBs to detect risk variants for cocaine addiction.
BMC Genetics | 2014
Jiangwen Sun; Jinbo Bi; Henry R. Kranzler
BackgroundAccurate classification of patients with a complex disease into subtypes has important implications for medicine and healthcare. Using more homogeneous disease subtypes in genetic association analysis will facilitate the detection of new genetic variants that are not detectible using the non-differentiated disease phenotype. Subtype differentiation can also improve diagnostic classification, which can in turn inform clinical decision making and treatment matching. Currently, the most sophisticated methods for disease subtyping perform cluster analysis using patients’ clinical features. Without guidance from genetic information, the resultant subtypes are likely to be suboptimal and efforts at genetic association may fail.ResultsWe propose a multi-view matrix decomposition approach that integrates clinical features with genetic markers to detect confirmatory evidence for a disease subtype. This approach groups patients into clusters that are consistent between the clinical and genetic dimensions of data; it simultaneously identifies the clinical features that define the subtype and the genotypes associated with the subtype. A simulation study validated the proposed approach, showing that it identified hypothesized subtypes and associated features. In comparison to the latest biclustering and multi-view data analytics using real-life disease data, the proposed approach identified clinical subtypes of a disease that differed from each other more significantly in the genetic markers, thus demonstrating the superior performance of the proposed approach.ConclusionsThe proposed algorithm is an effective and superior alternative to the disease subtyping methods employed to date. Integration of phenotypic features with genetic markers in the subtyping analysis is a promising approach to identify concurrently disease subtypes and their genetic associations.
ACM Transactions on Intelligent Systems and Technology | 2013
Jinbo Bi; Jiangwen Sun; Yu Wu; Howard Tennen; Stephen Armeli
Alcohol misuse is one of the most serious public health problems facing adolescents and young adults in the United States. National statistics shows that nearly 90% of alcohol consumed by youth under 21 years of age involves binge drinking and 44% of college students engage in high-risk drinking activities. Conventional alcohol intervention programs, which aim at installing either an alcohol reduction norm or prohibition against underage drinking, have yielded little progress in controlling college binge drinking over the years. Existing alcohol studies are deductive where data are collected to investigate a psychological/behavioral hypothesis, and statistical analysis is applied to the data to confirm the hypothesis. Due to this confirmatory manner of analysis, the resulting statistical models are cohort-specific and typically fail to replicate on a different sample. This article presents two machine learning approaches for a secondary analysis of longitudinal data collected in college alcohol studies sponsored by the National Institute on Alcohol Abuse and Alcoholism. Our approach aims to discover knowledge, from multiwave cohort-sequential daily data, which may or may not align with the original hypothesis but quantifies predictive models with higher likelihood to generalize to new samples. We first propose a so-called temporally-correlated support vector machine to construct a classifier as a function of daily moods, stress, and drinking expectancies to distinguish days with nighttime binge drinking from days without for individual students. We then propose a combination of cluster analysis and feature selection, where cluster analysis is used to identify drinking patterns based on averaged daily drinking behavior and feature selection is used to identify risk factors associated with each pattern. We evaluate our methods on two cohorts of 530 total college students recruited during the Spring and Fall semesters, respectively. Cross validation on these two cohorts and further on 100 random partitions of the total students demonstrate that our methods improve the model generalizability in comparison with traditional multilevel logistic regression. The discovered risk factors and the interaction of these factors delineated in our models can set a potential basis and offer insights to a new design of more effective college alcohol interventions.
knowledge discovery and data mining | 2013
Jiangwen Sun; Jinbo Bi; Henry R. Kranzler
Identifying genetic variation underlying a complex disease is important. Many complex diseases have heterogeneous phenotypes and are products of a variety of genetic and environmental factors acting in concert. Deriving highly heritable quantitative traits of a complex disease can improve the identification of genetic risk of the disease. The most sophisticated methods so far perform unsupervised cluster analysis on phenotypic features; and then a quantitative trait is derived based on each resultant cluster. Heritability is estimated to assess the validity of the derived quantitative traits. However, none of these methods explicitly maximize the heritability of the derived traits. We propose a quadratic optimization approach that directly utilizes heritability as an objective during the derivation of quantitative traits of a disease. This method maximizes an objective function that is formulated by decomposing the traditional maximum likelihood method for estimating heritability of a quantitative trait. We demonstrate the effectiveness of the proposed method on both synthetic data and real-world problems. We apply our algorithm to identify highly heritable traits of complex human-behavior disorders including opioid and cocaine use disorders, and highly heritable traits of dairy cattle that are economically important. Our approach outperforms standard cluster analysis and several previous methods.
knowledge discovery and data mining | 2015
Tingyang Xu; Jiangwen Sun; Jinbo Bi
Longitudinal analysis is important in many disciplines, such as the study of behavioral transitions in social science. Only very recently, feature selection has drawn adequate attention in the context of longitudinal modeling. Standard techniques, such as generalized estimating equations, have been modified to select features by imposing sparsity-inducing regularizers. However, they do not explicitly model how a dependent variable relies on features measured at proximal time points. Recent graphical Granger modeling can select features in lagged time points but ignores the temporal correlations within an individuals repeated measurements. We propose an approach to automatically and simultaneously determine both the relevant features and the relevant temporal points that impact the current outcome of the dependent variable. Meanwhile, the proposed model takes into account the non-i.i.d nature of the data by estimating the within-individual correlations. This approach decomposes model parameters into a summation of two components and imposes separate block-wise LASSO penalties to each component when building a linear model in terms of the past τ measurements of features. One component is used to select features whereas the other is used to select temporal contingent points. An accelerated gradient descent algorithm is developed to efficiently solve the related optimization problem with detailed convergence analysis and asymptotic analysis. Computational results on both synthetic and real world problems demonstrate the superior performance of the proposed approach over existing techniques.
PLOS ONE | 2015
Jiangwen Sun; Henry R. Kranzler; Jinbo Bi
Multivariate phenotypes may be characterized collectively by a variety of low level traits, such as in the diagnosis of a disease that relies on multiple disease indicators. Such multivariate phenotypes are often used in genetic association studies. If highly heritable components of a multivariate phenotype can be identified, it can maximize the likelihood of finding genetic associations. Existing methods for phenotype refinement perform unsupervised cluster analysis on low-level traits and hence do not assess heritability. Existing heritable component analytics either cannot utilize general pedigrees or have to estimate the entire covariance matrix of low-level traits from limited samples, which leads to inaccurate estimates and is often computationally prohibitive. It is also difficult for these methods to exclude fixed effects from other covariates such as age, sex and race, in order to identify truly heritable components. We propose to search for a combination of low-level traits and directly maximize the heritability of this combined trait. A quadratic optimization problem is thus derived where the objective function is formulated by decomposing the traditional maximum likelihood method for estimating the heritability of a quantitative trait. The proposed approach can generate linearly-combined traits of high heritability that has been corrected for the fixed effects of covariates. The effectiveness of the proposed approach is demonstrated in simulations and by a case study of cocaine dependence. Our approach was computationally efficient and derived traits of higher heritability than those by other methods. Additional association analysis with the derived cocaine-use trait identified genetic markers that were replicated in an independent sample, further confirming the utility and advantage of the proposed approach.
bioinformatics and biomedicine | 2012
Jiangwen Sun; Jinbo Bi; Henry R. Kranzler
Identifying genetic variations that underlie human disease is very important to advance our understanding of the diseases pathophysiology and promote its personalized treatment. However, many disease phenotypes have complex clinical manifestations and a complicated etiology. Gene finding efforts for complex diseases have had limited success to date. Research results suggest that one way to enhance these efforts is to differentiate subtypes of a complex multifactorial disease phenotype. Existing subtyping methods rely on cluster analysis using only clinical features of a disorder without guidance from genetic data, resulting in subtypes for which genotype association may be limited. In this work, we seek to derive a novel computational method based on multi-objective programming that is capable of clinically categorizing a disease phenotype so as to discover genetically different subtypes. Our approach optimizes two objectives: (1) the cluster-derived subtypes should differ significantly on clinical features; (2) these subtypes can be well separated using candidate genes. This work has been motivated by clinical studies of opioid dependence, a serious, prevalent disorder that is heterogeneous phenotypically. Analyses on a sample of 1,470 European American subjects aggregated from multiple genetic studies of opioid dependence show that the proposed algorithm is superior to existing subtyping methods.
bioinformatics and biomedicine | 2013
Jiangwen Sun; Jinbo Bi; Henry R. Kranzler
Complex disorders exhibit great heterogeneity in both clinical manifestation and genetic etiology. This heterogeneity substantially limits the identification of geneotype-phenotype associations. Differentiating homogeneous subtypes of a complex phenotype will enable the detection of genetic variants contributing to the effect of subtypes that cannot be detected by the non-differentiated phenotype. However, the most sophisticated subtyping methods available so far perform unsupervised cluster analysis or latent class analysis on only phenotypic features. Without guidance from the genetic dimension, the resultant subtypes can be suboptimal and genetic associations may fail. We propose a multi-view biclustering approach that integrates phenotypic features and genetic markers to detect confirming evidence in the two views for a disease subtype. This approach groups subjects in clusters that are consistent between the phenotypic and genetic views, and simultaneously identifies the phenotypic features that are used to define a subtype and the genotypes that are associated with the subtype. Our simulation study validates this approach, and our extensive comparison with several biclustering and multi-view data analytics on real-life disease data demonstrates the superior performance of the proposed approach.
Bioinformatics | 2016
Jiangwen Sun; Z. Jiang; X.C. Tian; Jinbo Bi
Motivation: A growing number of studies have explored the process of pre-implantation embryonic development of multiple mammalian species. However, the conservation and variation among different species in their developmental programming are poorly defined due to the lack of effective computational methods for detecting co-regularized genes that are conserved across species. The most sophisticated method to date for identifying conserved co-regulated genes is a two-step approach. This approach first identifies gene clusters for each species by a cluster analysis of gene expression data, and subsequently computes the overlaps of clusters identified from different species to reveal common subgroups. This approach is ineffective to deal with the noise in the expression data introduced by the complicated procedures in quantifying gene expression. Furthermore, due to the sequential nature of the approach, the gene clusters identified in the first step may have little overlap among different species in the second step, thus difficult to detect conserved co-regulated genes. Results: We propose a cross-species bi-clustering approach which first denoises the gene expression data of each species into a data matrix. The rows of the data matrices of different species represent the same set of genes that are characterized by their expression patterns over the developmental stages of each species as columns. A novel bi-clustering method is then developed to cluster genes into subgroups by a joint sparse rank-one factorization of all the data matrices. This method decomposes a data matrix into a product of a column vector and a row vector where the column vector is a consistent indicator across the matrices (species) to identify the same gene cluster and the row vector specifies for each species the developmental stages that the clustered genes co-regulate. Efficient optimization algorithm has been developed with convergence analysis. This approach was first validated on synthetic data and compared to the two-step method and several recent joint clustering methods. We then applied this approach to two real world datasets of gene expression during the pre-implantation embryonic development of the human and mouse. Co-regulated genes consistent between the human and mouse were identified, offering insights into conserved functions, as well as similarities and differences in genome activation timing between the human and mouse embryos. Availability and Implementation: The R package containing the implementation of the proposed method in C ++ is available at: https://github.com/JavonSun/mvbc.git and also at the R platform https://www.r-project.org/. Contact: [email protected]