Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Xingjie Shi is active.

Publication


Featured researches published by Xingjie Shi.


Briefings in Bioinformatics | 2015

Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA

Qing Zhao; Xingjie Shi; Yang Xie; Jian Huang; BenChang Shia; Shuangge Ma

With accumulating research on the interconnections among different types of genomic regulations, researchers have found that multidimensional genomic studies outperform one-dimensional studies in multiple aspects. Among many sources of multidimensional genomic data, The Cancer Genome Atlas (TCGA) provides the public with comprehensive profiling data on >30 cancer types, making it an ideal test bed for conducting and comparing different analyses. In this article, the analysis goal is to apply several existing methods and associate multidimensional genomic measurements with cancer outcomes in particular prognosis, with special focus on the predictive power of genomic signatures. We exploit clinical data and four types of genomic measurement including mRNA gene expression, DNA methylation, microRNA and copy number alterations for breast invasive carcinoma, glioblastoma multiforme, acute myeloid leukemia and lung squamous cell carcinoma collected by TCGA. To accommodate the high dimensionality, we extract important features using Principal Component Analysis, Partial Least Squares and Least Absolute Shrinkage and Selection Operator (Lasso), which are representative of dimension reduction and variable selection techniques and have been extensively adopted, and fit Cox survival models with combined important features. We calibrate the predictive power of each type of genomic measurement for the prognosis of four cancer types and find that the results vary across cancers. Our analysis also suggests that for most of the cancers in our study and the adopted methods, there is no substantial improvement in prediction when adding other genomic measurement after gene expression and clinical covariates have been included in the model. This is consistent with the findings that molecular features measured at the transcription level affect clinical outcomes more directly than those measured at the DNA/epigenetic level.


Genetic Epidemiology | 2014

A penalized robust method for identifying gene-environment interactions.

Xingjie Shi; Jin Liu; Jian Huang; Yong Zhou; Yang Xie; Shuangge Ma

In high‐throughput studies, an important objective is to identify gene–environment interactions associated with disease outcomes and phenotypes. Many commonly adopted methods assume specific parametric or semiparametric models, which may be subject to model misspecification. In addition, they usually use significance level as the criterion for selecting important interactions. In this study, we adopt the rank‐based estimation, which is much less sensitive to model specification than some of the existing methods and includes several commonly encountered data and models as special cases. Penalization is adopted for the identification of gene–environment interactions. It achieves simultaneous estimation and identification and does not rely on significance level. For computation feasibility, a smoothed rank estimation is further proposed. Simulation shows that under certain scenarios, for example, with contaminated or heavy‐tailed data, the proposed method can significantly outperform the existing alternatives with more accurate identification. We analyze a lung cancer prognosis study with gene expression measurements under the AFT (accelerated failure time) model. The proposed method identifies interactions different from those using the alternatives. Some of the identified genes have important implications.


Briefings in Bioinformatics | 2014

Similarity of markers identified from cancer gene expression studies: observations from GEO

Xingjie Shi; Shihao Shen; Jin Liu; Jian Huang; Yong Zhou; Shuangge Ma

Gene expression profiling has been extensively conducted in cancer research. The analysis of multiple independent cancer gene expression datasets may provide additional information and complement single-dataset analysis. In this study, we conduct multi-dataset analysis and are interested in evaluating the similarity of cancer-associated genes identified from different datasets. The first objective of this study is to briefly review some statistical methods that can be used for such evaluation. Both marginal analysis and joint analysis methods are reviewed. The second objective is to apply those methods to 26 Gene Expression Omnibus (GEO) datasets on five types of cancers. Our analysis suggests that for the same cancer, the marker identification results may vary significantly across datasets, and different datasets share few common genes. In addition, datasets on different cancers share few common genes. The shared genetic basis of datasets on the same or different cancers, which has been suggested in the literature, is not observed in the analysis of GEO data.


Bioinformatics | 2015

Deciphering the associations between gene expression and copy number alteration using a sparse double Laplacian shrinkage approach

Xingjie Shi; Qing Zhao; Jian Huang; Yang Xie; Shuangge Ma

MOTIVATION Both gene expression levels (GEs) and copy number alterations (CNAs) have important biological implications. GEs are partly regulated by CNAs, and much effort has been devoted to understanding their relations. The regulation analysis is challenging with one gene expression possibly regulated by multiple CNAs and one CNA potentially regulating the expressions of multiple genes. The correlations among GEs and among CNAs make the analysis even more complicated. The existing methods have limitations and cannot comprehensively describe the regulation. RESULTS A sparse double Laplacian shrinkage method is developed. It jointly models the effects of multiple CNAs on multiple GEs. Penalization is adopted to achieve sparsity and identify the regulation relationships. Network adjacency is computed to describe the interconnections among GEs and among CNAs. Two Laplacian shrinkage penalties are imposed to accommodate the network adjacency measures. Simulation shows that the proposed method outperforms the competing alternatives with more accurate marker identification. The Cancer Genome Atlas data are analysed to further demonstrate advantages of the proposed method. AVAILABILITY AND IMPLEMENTATION R code is available at http://works.bepress.com/shuangge/49/.


Genetic Epidemiology | 2014

Integrative Analysis of High-throughput Cancer Studies With Contrasted Penalization

Xingjie Shi; Jin Liu; Jian Huang; Yong Zhou; Benchang Shia; Shuangge Ma

In cancer studies with high‐throughput genetic and genomic measurements, integrative analysis provides a way to effectively pool and analyze heterogeneous raw data from multiple independent studies and outperforms “classic” meta‐analysis and single‐dataset analysis. When marker selection is of interest, the genetic basis of multiple datasets can be described using the homogeneity model or the heterogeneity model. In this study, we consider marker selection under the heterogeneity model, which includes the homogeneity model as a special case and can be more flexible. Penalization methods have been developed in the literature for marker selection. This study advances from the published ones by introducing the contrast penalties, which can accommodate the within‐ and across‐dataset structures of covariates/regression coefficients and, by doing so, further improve marker selection performance. Specifically, we develop a penalization method that accommodates the across‐dataset structures by smoothing over regression coefficients. An effective iterative algorithm, which calls an inner coordinate descent iteration, is developed. Simulation shows that the proposed method outperforms the benchmark with more accurate marker identification. The analysis of breast cancer and lung cancer prognosis studies with gene expression measurements shows that the proposed method identifies genes different from those using the benchmark and has better prediction performance.


Genomics | 2016

Integrated analysis of multidimensional omics data on cutaneous melanoma prognosis.

Yu Jiang; Xingjie Shi; Qing Zhao; Michael Krauthammer; Bonnie E. Gould Rothberg; Shuangge Ma

Multiple types of genetic, epigenetic, and genomic changes have been implicated in cutaneous melanoma prognosis. Many of the existing studies are limited in analyzing a single type of omics measurement and cannot comprehensively describe the biological processes underlying prognosis. As a result, the obtained prognostic models may be less satisfactory, and the identified prognostic markers may be less informative. The recently collected TCGA (The Cancer Genome Atlas) data have a high quality and comprehensive omics measurements, making it possible to more comprehensively and more accurately model prognosis. In this study, we first describe the statistical approaches that can integrate multiple types of omics measurements with the assistance of variable selection and dimension reduction techniques. Data analysis suggests that, for cutaneous melanoma, integrating multiple types of measurements leads to prognostic models with an improved prediction performance. Informative individual markers and pathways are identified, which can provide valuable insights into melanoma prognosis.


Statistics in Medicine | 2015

A penalized robust semiparametric approach for gene-environment interactions

Cen Wu; Xingjie Shi; Yuehua Cui; Shuangge Ma

In genetic and genomic studies, gene-environment (G×E) interactions have important implications. Some of the existing G×E interaction methods are limited by analyzing a small number of G factors at a time, by assuming linear effects of E factors, by assuming no data contamination, and by adopting ineffective selection techniques. In this study, we propose a new approach for identifying important G×E interactions. It jointly models the effects of all E and G factors and their interactions. A partially linear varying coefficient model is adopted to accommodate possible nonlinear effects of E factors. A rank-based loss function is used to accommodate possible data contamination. Penalization, which has been extensively used with high-dimensional data, is adopted for selection. The proposed penalized estimation approach can automatically determine if a G factor has an interaction with an E factor, main effect but not interaction, or no effect at all. The proposed approach can be effectively realized using a coordinate descent algorithm. Simulation shows that it has satisfactory performance and outperforms several competing alternatives. The proposed approach is used to analyze a lung cancer study with gene expression measurements and clinical variables. Copyright


Genetic Epidemiology | 2016

Analyzing Association Mapping in Pedigree-Based GWAS Using a Penalized Multitrait Mixed Model.

Jin Liu; Can Yang; Xingjie Shi; Cong Li; Jian Huang; Hongyu Zhao; Shuangge Ma

Genome‐wide association studies (GWAS) have led to the identification of many genetic variants associated with complex diseases in the past 10 years. Penalization methods, with significant numerical and statistical advantages, have been extensively adopted in analyzing GWAS. This study has been partly motivated by the analysis of Genetic Analysis Workshop (GAW) 18 data, which have two notable characteristics. First, the subjects are from a small number of pedigrees and hence related. Second, for each subject, multiple correlated traits have been measured. Most of the existing penalization methods assume independence between subjects and traits and can be suboptimal. There are a few methods in the literature based on mixed modeling that can accommodate correlations. However, they cannot fully accommodate the two types of correlations while conducting effective marker selection. In this study, we develop a penalized multitrait mixed modeling approach. It accommodates the two different types of correlations and includes several existing methods as special cases. Effective penalization is adopted for marker selection. Simulation demonstrates its satisfactory performance. The GAW 18 data are analyzed using the proposed method.


Genetic Epidemiology | 2017

Analysis of cancer gene expression data with an assisted robust marker identification approach

Hao Chai; Xingjie Shi; Qingzhao Zhang; Qing Zhao; Yuan Huang; Shuangge Ma

Gene expression (GE) studies have been playing a critical role in cancer research. Despite tremendous effort, the analysis results are still often unsatisfactory, because of the weak signals and high data dimensionality. Analysis is often further challenged by the long‐tailed distributions of the outcome variables. In recent multidimensional studies, data have been collected on GEs as well as their regulators (e.g., copy number alterations (CNAs), methylation, and microRNAs), which can provide additional information on the associations between GEs and cancer outcomes. In this study, we develop an ARMI (assisted robust marker identification) approach for analyzing cancer studies with measurements on GEs as well as regulators. The proposed approach borrows information from regulators and can be more effective than analyzing GE data alone. A robust objective function is adopted to accommodate long‐tailed distributions. Marker identification is effectively realized using penalization. The proposed approach has an intuitive formulation and is computationally much affordable. Simulation shows its satisfactory performance under a variety of settings. TCGA (The Cancer Genome Atlas) data on melanoma and lung cancer are analyzed, which leads to biologically plausible marker identification and superior prediction.


Computational Statistics & Data Analysis | 2018

A Forward and Backward Stagewise algorithm for nonconvex loss functions with adaptive Lasso

Xingjie Shi; Yuan Huang; Jian Huang; Shuangge Ma

Penalization is a popular tool for multi- and high-dimensional data. Most of the existing computational algorithms have been developed for convex loss functions. Nonconvex loss functions can sometimes generate more robust results and have important applications. Motivated by the BLasso algorithm, this study develops the Forward and Backward Stagewise (Fabs) algorithm for nonconvex loss functions with the adaptive Lasso (aLasso) penalty. It is shown that each point along the Fabs paths is a δ-approximate solution to the aLasso problem and the Fabs paths converge to the stationary points of the aLasso problem when δ goes to zero, given that the loss function has second-order derivatives bounded from above. This study exemplifies the Fabs with an application to the penalized smooth partial rank (SPR) estimation, for which there is still a lack of effective algorithm. Extensive numerical studies are conducted to demonstrate the benefit of penalized SPR estimation using Fabs, especially under high-dimensional settings. Application to the smoothed 0-1 loss in binary classification is introduced to demonstrate its capability to work with other differentiable nonconvex loss function.

Collaboration


Dive into the Xingjie Shi's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jin Liu

National University of Singapore

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Yang Xie

University of Texas Southwestern Medical Center

View shared research outputs
Top Co-Authors

Avatar

Yong Zhou

Shanghai University of Finance and Economics

View shared research outputs
Top Co-Authors

Avatar

Can Yang

Hong Kong Baptist University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge