Yitan Zhu
NorthShore University HealthSystem
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yitan Zhu.
Nature Methods | 2014
Yitan Zhu; Peng Qiu; Yuan Ji
To the Editor: The Cancer Genome Atlas (TCGA) has been generating multi-modal genomics, epigenomics, and proteomics data for thousands of tumor samples across more than 20 types of cancer. While the access to most level-1 and -2 TCGA data is restricted, the entire level-3 TCGA data as well as some level-1 clinical data (e.g., survival and drug treatments) are publicly available. Included in the public data are genome-wide measurements of different genetic characterizations, such as DNA copy number, DNA methylation, and mRNA expression for the same genes, providing unprecedented opportunities for systematic investigation of cancer mechanisms at multiple molecular and regulatory layers [1-3]. Few tools of integrative data mining for TCGA are present, partly due to lack of tools to acquire and assemble the large scale TCGA data. Specifically, the level-3 TCGA data are stored as hundreds of thousands of sample- and platform-specific files, accessible through HTTP directories on the servers of TCGA Data Coordinating Center (DCC) [4]. Navigating through all of the files manually is impossible. Although Firehose [5] nicely assemble and publish TCGA data, it does not share the program code for data assembly. Currently the community does not have access to open-source data retrieving tools for automatic and flexible data acquisition, hence severely hindering the progress in systemic data integration and reproducible computational analysis using TCGA data. To meet these challenges, we introduce TCGA-Assembler, a software package that automates and streamlines the retrieval, assembly, and processing of public TCGA data. TCGA-Assembler equips users the ability to produce Firehose-type of TCGA data, with open-source and freely available program script. TCGA-Assembler opens a door for the development of data-mining and data-analysis tools that generate fully reproducible results, including data acquisition. TCGA-Assembler consists of two modules (Fig. 1a), both written in R (http://www.r-project.org). Module A streamlines data downloading and quality check, and module B processes the downloaded data for subsequent analyses (Supplementary Methods). In particular, module A takes advantage of the informative naming mechanism of TCGA data file system (Supplementary Fig. 1) and applies a recursive algorithm to retrieve the URLs of all data files. By string matching on the URLs, module A allows users to download most of TCGA public data (Supplementary Table 1) across genomic features and cancer types. For each genomics feature (such as gene expression from RNA-Seq) a data matrix combining multiple samples (Fig. 1b) is produced, with rows representing genomics units (such as genes) and columns representing samples. Module B provides convenient and important data preprocessing functions, such as mega-data assembly, data cleaning, and quantification of various measurements. For users interested in integrative analysis [6], a mega data matrix (Fig. 1c) is required that matches different types of genomics measurements for the same genes across samples. Module B provides a function “CombineMultiPlatfomData” to fulfill this requirement (Supplementary Methods), which involves intricate data-matching steps to overcome the feature-labeling discrepancies caused by different lab protocols and biotechnologies in the experiments. Other data-processing functions are also provided to facilitate downstream analysis (Supplementary Methods). Figure 1 TCGA-Assembler as a tool for acquiring, assembling, and processing public TCGA data. (a) Flowchart of TCGA- Assembler. Module A acquires data from TCGA DCC. Module B processes the obtained data using various functions. (b) Illustration of a data matrix ... Other big data tools for TCGA are available [5, 7, 8]. In particular, level-3 TCGA data can also be obtained from Firehose [5] at the MIT Broad Institute in the same format as in Fig. 1b, one for each cancer type and genomics platform. Module A of TCGA-Assembler not only provides the same type of data matrices, but also distributes R functions and associated computer program that produce the data matrices. Equipped with the open-source tool, users will be independent and control what and when TCGA data will be acquired locally. More importantly, quantitatively advanced users may integrate our open-source programs with downstream data analysis tools to realize reproducible and automated data analysis for TCGA. Unique to TCGA-Assembler is module B that provides critical functions for data cleaning and processing. For example, the mega data table (Fig. 1c) can be obtained with a single function, behind which substantial efforts have been directed to ensure the validity of process, such as to check and correct gene symbol discrepancies. Lastly, TCGA-Assembler is fully compatible with Firehose in that the data processing functions in Module B can directly process data files downloaded from Firehose. This compatibility is crucial to those who want to take advantage of both software pipelines. TCGA-Assembler will remain freely available and open-source. In the future, more data processing and analysis functions will be continuously added to TCGA-Assembler based on user feedback and new research needs. The authors request acknowledgment of the use of TCGA-Assembler in published works.
NPJ breast cancer | 2016
Hui Li; Yitan Zhu; Elizabeth S. Burnside; Erich Huang; Karen Drukker; Katherine A. Hoadley; Cheng Fan; Suzanne D. Conzen; Margarita L. Zuley; Jose M. Net; Elizabeth J. Sutton; Gary J. Whitman; Elizabeth A. Morris; Charles M. Perou; Yuan Ji; Maryellen L. Giger
Using quantitative radiomics, we demonstrate that computer-extracted magnetic resonance (MR) image-based tumor phenotypes can be predictive of the molecular classification of invasive breast cancers. Radiomics analysis was performed on 91 MRIs of biopsy-proven invasive breast cancers from National Cancer Institute’s multi-institutional TCGA/TCIA. Immunohistochemistry molecular classification was performed including estrogen receptor, progesterone receptor, human epidermal growth factor receptor 2, and for 84 cases, the molecular subtype (normal-like, luminal A, luminal B, HER2-enriched, and basal-like). Computerized quantitative image analysis included: three-dimensional lesion segmentation, phenotype extraction, and leave-one-case-out cross validation involving stepwise feature selection and linear discriminant analysis. The performance of the classifier model for molecular subtyping was evaluated using receiver operating characteristic analysis. The computer-extracted tumor phenotypes were able to distinguish between molecular prognostic indicators; area under the ROC curve values of 0.89, 0.69, 0.65, and 0.67 in the tasks of distinguishing between ER+ versus ER−, PR+ versus PR−, HER2+ versus HER2−, and triple-negative versus others, respectively. Statistically significant associations between tumor phenotypes and receptor status were observed. More aggressive cancers are likely to be larger in size with more heterogeneity in their contrast enhancement. Even after controlling for tumor size, a statistically significant trend was observed within each size group (P=0.04 for lesions ⩽2 cm; P=0.02 for lesions >2 to ⩽5 cm) as with the entire data set (P-value=0.006) for the relationship between enhancement texture (entropy) and molecular subtypes (normal-like, luminal A, luminal B, HER2-enriched, basal-like). In conclusion, computer-extracted image phenotypes show promise for high-throughput discrimination of breast cancer subtypes and may yield a quantitative predictive signature for advancing precision medicine.
Scientific Reports | 2016
Yitan Zhu; Hui Li; Wentian Guo; Karen Drukker; Li Lan; Maryellen L. Giger; Yuan Ji
Magnetic Resonance Imaging (MRI) has been routinely used for the diagnosis and treatment of breast cancer. However, the relationship between the MRI tumor phenotypes and the underlying genetic mechanisms remains under-explored. We integrated multi-omics molecular data from The Cancer Genome Atlas (TCGA) with MRI data from The Cancer Imaging Archive (TCIA) for 91 breast invasive carcinomas. Quantitative MRI phenotypes of tumors (such as tumor size, shape, margin, and blood flow kinetics) were associated with their corresponding molecular profiles (including DNA mutation, miRNA expression, protein expression, pathway gene expression and copy number variation). We found that transcriptional activities of various genetic pathways were positively associated with tumor size, blurred tumor margin, and irregular tumor shape and that miRNA expressions were associated with the tumor size and enhancement texture, but not with other types of radiomic phenotypes. We provide all the association findings as a resource for the research community (available at http://compgenome.org/Radiogenomics/). These findings pave potential paths for the discovery of genetic mechanisms regulating specific tumor phenotypes and for improving MRI techniques as potential non-invasive approaches to probe the cancer molecular status.
Journal of medical imaging | 2015
Wentian Guo; Hui Li; Yitan Zhu; Li Lan; Shengjie Yang; Karen Drukker; Elizabeth A. Morris; Elizabeth S. Burnside; Gary J. Whitman; Maryellen L. Giger; Yuan Ji
Abstract. Genomic and radiomic imaging profiles of invasive breast carcinomas from The Cancer Genome Atlas and The Cancer Imaging Archive were integrated and a comprehensive analysis was conducted to predict clinical outcomes using the radiogenomic features. Variable selection via LASSO and logistic regression were used to select the most-predictive radiogenomic features for the clinical phenotypes, including pathological stage, lymph node metastasis, and status of estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2). Cross-validation with receiver operating characteristic (ROC) analysis was performed and the area under the ROC curve (AUC) was employed as the prediction metric. Higher AUCs were obtained in the prediction of pathological stage, ER, and PR status than for lymph node metastasis and HER2 status. Overall, the prediction performances by genomics alone, radiomics alone, and combined radiogenomics features showed statistically significant correlations with clinical outcomes; however, improvement on the prediction performance by combining genomics and radiomics data was not found to be statistically significant, most likely due to the small sample size of 91 cancer cases with 38 radiomic features and 144 genomic features.
Journal of the American Statistical Association | 2013
Juhee Lee; Peter Müller; Yitan Zhu; Yuan Ji
We propose a nonparametric Bayesian local clustering (NoB-LoC) approach for heterogeneous data. NoB-LoC implements inference for nested clusters as posterior inference under a Bayesian model. Using protein expression data as an example, the NoB-LoC model defines a protein (column) cluster as a set of proteins that give rise to the same partition of the samples (rows). In other words, the sample partitions are nested within protein clusters. The common clustering of the samples gives meaning to the protein clusters. Any pair of samples might belong to the same cluster for one protein set but to different clusters for another protein set. These local features are different from features obtained by global clustering approaches such as hierarchical clustering, which create only one partition of samples that applies for all the proteins in the dataset. In addition, the NoB-LoC model is different from most other local or nested clustering methods, which define clusters based on common parameters in the sampling model. As an added and important feature, the NoB-LoC method probabilistically excludes sets of irrelevant proteins and samples that do not meaningfully cocluster with other proteins and samples, thus improving the inference on the clustering of the remaining proteins and samples. Inference is guided by a joint probability model for all the random elements. We provide a simulation study and a motivating example to demonstrate the unique features of the NoB-LoC model. Supplementary materials for this article are available online.
Cancer | 2016
Elizabeth S. Burnside; Karen Drukker; Hui Li; Ermelinda Bonaccio; Margarita L. Zuley; Marie A. Ganott; Jose M. Net; Elizabeth J. Sutton; Kathleen R. Brandt; Gary J. Whitman; Suzanne D. Conzen; Li Lan; Yuan Ji; Yitan Zhu; C. Carl Jaffe; Erich P. Huang; John Freymann; Justin S. Kirby; Elizabeth A. Morris; Maryellen L. Giger
The objective of this study was to demonstrate that computer‐extracted image phenotypes (CEIPs) of biopsy‐proven breast cancer on magnetic resonance imaging (MRI) can accurately predict pathologic stage.
Nucleic Acids Research | 2016
Subhajit Sengupta; Kamalakar Gulukota; Yitan Zhu; Carole Ober; Katherine Naughton; William Wentworth-Sheilds; Yuan Ji
Somatic mosaicism refers to the existence of somatic mutations in a fraction of somatic cells in a single biological sample. Its importance has mainly been discussed in theory although experimental work has started to emerge linking somatic mosaicism to disease diagnosis. Through novel statistical modeling of paired-end DNA-sequencing data using blood-derived DNA from healthy donors as well as DNA from tumor samples, we present an ultra-fast computational pipeline, LocHap that searches for multiple single nucleotide variants (SNVs) that are scaffolded by the same reads. We refer to scaffolded SNVs as local haplotypes (LH). When an LH exhibits more than two genotypes, we call it a local haplotype variant (LHV). The presence of LHVs is considered evidence of somatic mosaicism because a genetically homogeneous cell population will not harbor LHVs. Applying LocHap to whole-genome and whole-exome sequence data in DNA from normal blood and tumor samples, we find wide-spread LHVs across the genome. Importantly, we find more LHVs in tumor samples than in normal samples, and more in older adults than in younger ones. We confirm the existence of LHVs and somatic mosaicism by validation studies in normal blood samples. LocHap is publicly available at http://www.compgenome.org/lochap.
Bioinformatics | 2018
Lin Wei; Zhilin Jin; Shengjie Yang; Yanxun Xu; Yitan Zhu; Yuan Ji
Motivation The Cancer Genome Atlas (TCGA) program has produced huge amounts of cancer genomics data providing unprecedented opportunities for research. In 2014, we developed TCGA-Assembler, a software pipeline for retrieval and processing of public TCGA data. In 2016, TCGA data were transferred from the TCGA data portal to the Genomic Data Commons (GDCs), which is supported by a different set of data storage and retrieval mechanisms. In addition, new proteomics data of TCGA samples have been generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC) program, which were not available for downloading through TCGA-Assembler. It is desirable to acquire and integrate data from both GDC and CPTAC. Results We develop TCGA-assembler 2 (TA2) to automatically download and integrate data from GDC and CPTAC. We make substantial improvement on the functionality of TA2 to enhance user experience and software performance. TA2 together with its previous version have helped more than 2000 researchers from 64 countries to access and utilize TCGA and CPTAC data in their research. Availability of TA2 will continue to allow existing and new users to conduct reproducible research based on TCGA and CPTAC data. Availability and implementation http://www.compgenome.org/TCGA-Assembler/ or https://github.com/compgenome365/TCGA-Assembler-2. Contact [email protected] or [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.
Cancer Informatics | 2014
Yanxun Xu; Yitan Zhu; Peter Müller; Riten Mitra; Yuan Ji
The Cancer Genome Atlas (TCGA) generates comprehensive genomic data for thousands of patients over more than 20 cancer types. TCGA data are typically whole-genome measurements of multiple genomic features, such as DNA copy numbers, DNA methylation, and gene expression, providing unique opportunities for investigating cancer mechanism from multiple molecular and regulatory layers. We propose a Bayesian graphical model to systemically integrate multi-platform TCGA data for inference of the interactions between different genomic features either within a gene or between multiple genes. The presence or absence of edges in the graph indicates the presence or absence of conditional dependence between genomic features. The inference is restricted to genes within a known biological network, but can be extended to any sets of genes. Applying the model to the same genes using patient samples in two different cancer types, we identify network components that are common as well as different between cancer types. The examples and codes are available at https://www.ma.utexas.edu/users/yxu/software.html.
Biometrics | 2018
Yang Ni; Peter Müller; Yitan Zhu; Yuan Ji
We develop novel hierarchical reciprocal graphical models to infer gene networks from heterogeneous data. In the case of data that can be naturally divided into known groups, we propose to connect graphs by introducing a hierarchical prior across group-specific graphs, including a correlation on edge strengths across graphs. Thresholding priors are applied to induce sparsity of the estimated networks. In the case of unknown groups, we cluster subjects into subpopulations and jointly estimate cluster-specific gene networks, again using similar hierarchical priors across clusters. We illustrate the proposed approach by simulation studies and three applications with multiplatform genomic data for multiple cancers.