Jie Tan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jie Tan is active.

Explore More

Publication

Featured researches published by Jie Tan.

Journal of Cellular Physiology | 2014

Big Data Bioinformatics

Casey S. Greene; Jie Tan; Matthew Ung; Jason H. Moore; Chao Cheng

Recent technological advances allow for high throughput profiling of biological systems in a cost‐efficient manner. The low cost of data generation is leading us to the “big data” era. The availability of big data provides unprecedented opportunities but also raises new challenges for data mining and analysis. In this review, we introduce key concepts in the analysis of big data, including both “machine learning” algorithms as well as “unsupervised” and “supervised” examples of each. We note packages for the R programming language that are available to perform machine learning analyses. In addition to programming based solutions, we review webservers that allow users with limited or no programming background to perform these analyses on large data compendia. J. Cell. Physiol. 229: 1896–1900, 2014.

pacific symposium on biocomputing | 2014

UNSUPERVISED FEATURE CONSTRUCTION AND KNOWLEDGE EXTRACTION FROM GENOME-WIDE ASSAYS OF BREAST CANCER WITH DENOISING AUTOENCODERS

Jie Tan; Matthew Ung; Chao Cheng; Casey S. Greene

Big data bring new opportunities for methods that efficiently summarize and automatically extract knowledge from such compendia. While both supervised learning algorithms and unsupervised clustering algorithms have been successfully applied to biological data, they are either dependent on known biology or limited to discerning the most significant signals in the data. Here we present denoising autoencoders (DAs), which employ a data-defined learning objective independent of known biology, as a method to identify and extract complex patterns from genomic data. We evaluate the performance of DAs by applying them to a large collection of breast cancer gene expression data. Results show that DAs successfully construct features that contain both clinical and molecular information. There are features that represent tumor or normal samples, estrogen receptor (ER) status, and molecular subtypes. Features constructed by the autoencoder generalize to an independent dataset collected using a distinct experimental platform. By integrating data from ENCODE for feature interpretation, we discover a feature representing ER status through association with key transcription factors in breast cancer. We also identify a feature highly predictive of patient survival and it is enriched by FOXM1 signaling pathway. The features constructed by DAs are often bimodally distributed with one peak near zero and another near one, which facilitates discretization. In summary, we demonstrate that DAs effectively extract key biological principles from gene expression data and summarize them into constructed features with convenient properties.

mSystems | 2016

ADAGE-Based Integration of Publicly Available Pseudomonas aeruginosa Gene Expression Data with Denoising Autoencoders Illuminates Microbe-Host Interactions.

Jie Tan; John H. Hammond; Deborah A. Hogan; Casey S. Greene

The quantity and breadth of genome-scale data sets that examine RNA expression in diverse bacterial and eukaryotic species are increasing more rapidly than for curated knowledge. Our ADAGE method integrates such data without requiring gene function, gene pathway, or experiment labeling, making practical its application to any large gene expression compendium. We built a Pseudomonas aeruginosa ADAGE model from a diverse set of publicly available experiments without any prespecified biological knowledge, and this model was accurate and predictive. We provide ADAGE results for the complete P. aeruginosa GeneChip compendium for use by researchers studying P. aeruginosa and source code that facilitates ADAGE’s application to other species and data types. ABSTRACT The increasing number of genome-wide assays of gene expression available from public databases presents opportunities for computational methods that facilitate hypothesis generation and biological interpretation of these data. We present an unsupervised machine learning approach, ADAGE (analysis using denoising autoencoders of gene expression), and apply it to the publicly available gene expression data compendium for Pseudomonas aeruginosa. In this approach, the machine-learned ADAGE model contained 50 nodes which we predicted would correspond to gene expression patterns across the gene expression compendium. While no biological knowledge was used during model construction, cooperonic genes had similar weights across nodes, and genes with similar weights across nodes were significantly more likely to share KEGG pathways. By analyzing newly generated and previously published microarray and transcriptome sequencing data, the ADAGE model identified differences between strains, modeled the cellular response to low oxygen, and predicted the involvement of biological processes based on low-level gene expression differences. ADAGE compared favorably with traditional principal component analysis and independent component analysis approaches in its ability to extract validated patterns, and based on our analyses, we propose that these approaches differ in the types of patterns they preferentially identify. We provide the ADAGE model with analysis of all publicly available P. aeruginosa GeneChip experiments and open source code for use with other species and settings. Extraction of consistent patterns across large-scale collections of genomic data using methods like ADAGE provides the opportunity to identify general principles and biologically important patterns in microbial biology. This approach will be particularly useful in less-well-studied microbial species. IMPORTANCE The quantity and breadth of genome-scale data sets that examine RNA expression in diverse bacterial and eukaryotic species are increasing more rapidly than for curated knowledge. Our ADAGE method integrates such data without requiring gene function, gene pathway, or experiment labeling, making practical its application to any large gene expression compendium. We built a Pseudomonas aeruginosa ADAGE model from a diverse set of publicly available experiments without any prespecified biological knowledge, and this model was accurate and predictive. We provide ADAGE results for the complete P. aeruginosa GeneChip compendium for use by researchers studying P. aeruginosa and source code that facilitates ADAGE’s application to other species and data types. Author Video: An author video summary of this article is available.

PeerJ | 2016

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

Jeffrey A. Thompson; Jie Tan; Casey S. Greene

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

european conference on artificial life | 2013

Rapid Rule Compaction Strategies for Global Knowledge Discovery in a Supervised Learning Classifier System

Jie Tan; Jason H. Moore; Ryan J. Urbanowicz

Michigan-style learning classifier systems have availed themselves as a promising modeling and data mining strategy for bioinformaticists seeking to connect predictive variables with disease phenotypes. The resulting ‘model’ learned by these algorithms is comprised of an entire population of rules, some of which will inevitably be redundant or poor predictors. Rule compaction is a post-processing strategy for consolidating this rule population with the goal of improving interpretation and knowledge discovery. However, existing rule compaction strategies tend to reduce overall rule population performance along with population size, especially in the context of noisy problem domains such as bioinformatics. In the present study we introduce and evaluate two new rule compaction strategies (QRC, PDRC) and a simple rule filtering method (QRF), and compare them to three existing methodologies. These new strategies are tuned to fit with a global approach to knowledge discovery in which less emphasis is placed on minimizing rule population size (to facilitate manual rule inspection) and more is placed on preserving performance. This work identified the strengths and weaknesses of each approach, suggesting PDRC to be the most balanced approach trading a minimal loss in testing accuracy for significant gains or consistency in all other performance statistics.

Cell systems | 2017

Unsupervised Extraction of Stable Expression Signatures from Public Compendia with an Ensemble of Neural Networks

Jie Tan; Georgia Doing; Kimberley A Lewis; Courtney E Price; Kathleen M Chen; Kyle C Cady; Barret Perchuk; Michael T. Laub; Deborah A. Hogan; Casey S. Greene

Cross-experiment comparisons in public data compendia are challenged by unmatched conditions and technical noise. The ADAGE method, which performs unsupervised integration with denoising autoencoder neural networks, can identify biological patterns, but because ADAGE models, like many neural networks, are over-parameterized, different ADAGE models perform equally well. To enhance model robustness and better build signatures consistent with biological pathways, we developed an ensemble ADAGE (eADAGE) that integrated stable signatures across models. We applied eADAGE to a compendium of Pseudomonas aeruginosa gene expression profiling experiments performed in 78 media. eADAGE revealed a phosphate starvation response controlled by PhoB in media with moderate phosphate and predicted that a second stimulus provided by the sensor kinase, KinB, is required for this PhoB activation. We validated this relationship using both targeted and unbiased genetic approaches. eADAGE, which captures stable biological patterns, enables cross-experiment comparisons that can highlight measured but undiscovered relationships.

bioRxiv | 2015

ADAGE analysis of publicly available gene expression data collections illuminates Pseudomonas aeruginosa-host interactions

Jie Tan; John H. Hammond; Deborah A. Hogan; Casey S. Greene

The growth in genome-scale assays of gene expression for different species in publicly available databases presents new opportunities for computational methods that aid in hypothesis generation and biological interpretation of these data. Here, we present an unsupervised machine-learning approach, ADAGE (Analysis using Denoising Autoencoders of Gene Expression) and apply it to the interpretation of all of the publicly available gene expression data for Pseudomonas aeruginosa, an important opportunistic bacterial pathogen. In post-hoc positive control analyses using curated knowledge, the P. aeruginosa ADAGE model found that co-operonic genes often participated in similar processes and accurately predicted which genes had similar functions. By analyzing newly generated data and previously published microarray and RNA-seq data, the ADAGE model identified gene expression differences between strains, modeled the cellular response to low oxygen, and predicted the involvement of biological processes despite low level expression differences in directly involved genes. Comparison of ADAGE with PCA and ICA revealed that ADAGE extracts distinct signals. We provide the ADAGE model with analysis of all publicly available P. aeruginosa GeneChip experiments, and we provide open source code for use in other species and settings.

evolutionary computation machine learning and data mining in bioinformatics | 2013

Time-Point specific weighting improves coexpression networks from time-course experiments

Jie Tan; Gavin D. Grant; Michael L. Whitfield; Casey S. Greene

Integrative systems biology approaches build, evaluate, and combine data from thousands of diverse experiments. These strategies rely on methods that effectively identify and summarize gene-gene relationships within individual experiments. For gene-expression datasets, the Pearson correlation is often applied to build coexpression networks because it is both easily interpretable and quick to calculate. Here we develop and evaluate weighted Pearson correlation approaches that better summarize gene expression data into coexpression networks for synchronized cell cycle time-course experiments. These methods use experimental measurements of cell cycle synchrony to estimate appropriate weights through either sliding window or linear regression approaches. We show that these weights improve our ability to build coexpression networks capable of identifying phase-specific functional relationships between genes. We evaluate our method on diverse experiments and find that both weighted strategies outperform the traditional method. This weighted correlation approach is implemented in the Sleipnir library, an open source library used for integrative systems biology. Integrative approaches using properly weighted time-course experiments will provide a more detailed understanding of the processes studied in such experiments.

Biodata Mining | 2018

PathCORE-T: identifying and visualizing globally co-occurring pathways in large transcriptomic compendia

Kathleen M Chen; Jie Tan; Gregory P. Way; Georgia Doing; Deborah A. Hogan; Casey S. Greene

BackgroundInvestigators often interpret genome-wide data by analyzing the expression levels of genes within pathways. While this within-pathway analysis is routine, the products of any one pathway can affect the activity of other pathways. Past efforts to identify relationships between biological processes have evaluated overlap in knowledge bases or evaluated changes that occur after specific treatments. Individual experiments can highlight condition-specific pathway-pathway relationships; however, constructing a complete network of such relationships across many conditions requires analyzing results from many studies.ResultsWe developed PathCORE-T framework by implementing existing methods to identify pathway-pathway transcriptional relationships evident across a broad data compendium. PathCORE-T is applied to the output of feature construction algorithms; it identifies pairs of pathways observed in features more than expected by chance as functionally co-occurring. We demonstrate PathCORE-T by analyzing an existing eADAGE model of a microbial compendium and building and analyzing NMF features from the TCGA dataset of 33 cancer types. The PathCORE-T framework includes a demonstration web interface, with source code, that users can launch to (1) visualize the network and (2) review the expression levels of associated genes in the original data. PathCORE-T creates and displays the network of globally co-occurring pathways based on features observed in a machine learning analysis of gene expression data.ConclusionsThe PathCORE-T framework identifies transcriptionally co-occurring pathways from the results of unsupervised analysis of gene expression data and visualizes the relationships between pathways as a network. PathCORE-T recapitulated previously described pathway-pathway relationships and suggested experimentally testable additional hypotheses that remain to be explored.

bioRxiv | 2017

Unsupervised extraction of stable expression signatures from public compendia with eADAGE

Jie Tan; Georgia Doing; Kimberley A Lewis; Courtney E Price; Kathleen M Chen; Kyle C Cady; Barret Perchuk; Michael T. Laub; Deborah A. Hogan; Casey S. Greene

Cross experiment comparisons in public data compendia are challenged by unmatched conditions and technical noise. The ADAGE method, which performs unsupervised integration with neural networks, can effectively identify biological patterns, but because ADAGE models, like many neural networks, are over-parameterized, different ADAGE models perform equally well. To enhance model robustness and better build signatures consistent with biological pathways, we developed an ensemble ADAGE (eADAGE) that integrated stable signatures across models. We applied eADAGE to a Pseudomonas aeruginosa compendium containing experiments performed in 78 media. eADAGE revealed a phosphate starvation response controlled by PhoB. While we expected PhoB activity in limiting phosphate conditions, our analyses found PhoB activity in other media with moderate phosphate and predicted that a second stimulus provided by the sensor kinase, KinB, is required for PhoB activation in this setting. We validated this relationship using both targeted and unbiased genetic approaches. eADAGE, which captures stable biological patterns, enables cross-experiment comparisons that can highlight measured but undiscovered relationships.

Explore More