Featured Researches

Genomics

CLEAR: Coverage-based Limiting-cell Experiment Analysis for RNA-seq

Direct cDNA preamplification protocols developed for single-cell RNA-seq have enabled transcriptome profiling of precious clinical samples and rare cells without sample pooling or RNA extraction. Currently, there is no algorithm optimized to reveal and remove noisy transcripts in limiting-cell RNA-seq (lcRNA-seq) data for downstream analyses. Herein, we present CLEAR, a workflow that identifies reliably quantifiable transcripts in lcRNA-seq data for differentially expressed gene (DEG) analysis. Libraries at three input amounts of FACS-derived CD5+ and CD5- cells from a chronic lymphocytic leukemia patient were used to develop CLEAR. When using CLEAR transcripts vs. using all transcripts, downstream analyses revealed more shared transcripts across different input RNA amounts, improved Principal Component Analysis (PCA) separation, and yielded more DEGs between cell types. As proof-of-principle, CLEAR was applied to an in-house lcRNA-seq dataset and two public datasets. When imputation is used, CLEAR is also adaptable to large clinical studies and for single cell analyses.

Read more
Genomics

COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment, and paired-end read LinkAge

The advent of next-generation sequencing (NGS) technologies enables researchers to sequence complex microbial communities directly from environment. Since assembly typically produces only genome fragments, also known as contigs, instead of entire genome, it is crucial to group them into operational taxonomic units (OTUs) for further taxonomic profiling and down-streaming functional analysis. OTU clustering is also referred to as binning. We present COCACOLA, a general framework automatically bin contigs into OTUs based upon sequence composition and coverage across multiple samples. The effectiveness of COCACOLA is demonstrated in both simulated and real datasets in comparison to state-of-art binning approaches such as CONCOCT, GroopM, MaxBin and MetaBAT. The superior performance of COCACOLA relies on two aspects. One is employing L 1 distance instead of Euclidean distance for better taxonomic identification during initialization. More importantly, COCACOLA takes advantage of both hard clustering and soft clustering by sparsity regularization. In addition, the COCACOLA framework seamlessly embraces customized knowledge to facilitate binning accuracy. In our study, we have investigated two types of additional knowledge, the co-alignment to reference genomes and linkage of contigs provided by paired-end reads, as well as the ensemble of both. We find that both co-alignment and linkage information further improve binning in the majority of cases. COCACOLA is scalable and faster than CONCOCT ,GroopM, MaxBin and MetaBAT. The software is available at this https URL

Read more
Genomics

CTCF Degradation Causes Increased Usage of Upstream Exons in Mouse Embryonic Stem Cells

Transcriptional repressor CTCF is an important regulator of chromatin 3D structure, facilitating the formation of topologically associating domains (TADs). However, its direct effects on gene regulation is less well understood. Here, we utilize previously published ChIP-seq and RNA-seq data to investigate the effects of CTCF on alternative splicing of genes with CTCF sites. We compared the amount of RNA-seq signals in exons upstream and downstream of binding sites following auxin-induced degradation of CTCF in mouse embryonic stem cells. We found that changes in gene expression following CTCF depletion were significant, with a general increase in the presence of upstream exons. We infer that a possible mechanism by which CTCF binding contributes to alternative splicing is by causing pauses in the transcription mechanism during which splicing elements are able to concurrently act on upstream exons already transcribed into RNA.

Read more
Genomics

Can artificial neural networks supplant the polygene risk score for risk prediction of complex disorders given very large sample sizes?

Genome-wide association studies (GWAS) provide a means of examining the common genetic variation underlying a range of traits and disorders. In addition, it is hoped that GWAS may provide a means of differentiating affected from unaffected individuals. This has potential applications in the area of risk prediction. Current attempts to address this problem focus on using the polygene risk score (PRS) to predict case-control status on the basis of GWAS data. However this approach has so far had limited success for complex traits such as schizophrenia (SZ). This is essentially a classification problem. Artificial neural networks (ANNs) have been shown in recent years to be highly effective in such applications. Here we apply an ANN to the problem of distinguishing SZ patients from unaffected controls. We compare the effectiveness of the ANN with the PRS in classifying individuals by case-control status based only on genetic data from a GWAS. We use the schizophrenia dataset from the Psychiatric Genomics Consortium (PGC) for this study. Our analysis indicates that the ANN is more sensitive to sample size than the PRS. As larger and larger sample sizes become available, we suggest that ANNs are a promising alternative to the PRS for classification and risk prediction for complex genetic disorders.

Read more
Genomics

Cancer Gene Profiling through Unsupervised Discovery

Precision medicine is a paradigm shift in healthcare relying heavily on genomics data. However, the complexity of biological interactions, the large number of genes as well as the lack of comparisons on the analysis of data, remain a tremendous bottleneck regarding clinical adoption. In this paper, we introduce a novel, automatic and unsupervised framework to discover low-dimensional gene biomarkers. Our method is based on the LP-Stability algorithm, a high dimensional center-based unsupervised clustering algorithm, that offers modularity as concerns metric functions and scalability, while being able to automatically determine the best number of clusters. Our evaluation includes both mathematical and biological criteria. The recovered signature is applied to a variety of biological tasks, including screening of biological pathways and functions, and characterization relevance on tumor types and subtypes. Quantitative comparisons among different distance metrics, commonly used clustering methods and a referential gene signature used in the literature, confirm state of the art performance of our approach. In particular, our signature, that is based on 27 genes, reports at least 30 times better mathematical significance (average Dunn's Index) and 25% better biological significance (average Enrichment in Protein-Protein Interaction) than those produced by other referential clustering methods. Finally, our signature reports promising results on distinguishing immune inflammatory and immune desert tumors, while reporting a high balanced accuracy of 92% on tumor types classification and averaged balanced accuracy of 68% on tumor subtypes classification, which represents, respectively 7% and 9% higher performance compared to the referential signature.

Read more
Genomics

Cancer classification and pathway discovery using non-negative matrix factorization

Extracting genetic information from a full range of sequencing data is important for understanding diseases. We propose a novel method to effectively explore the landscape of genetic mutations and aggregate them to predict cancer type. We used multinomial logistic regression, nonsmooth non-negative matrix factorization (nsNMF), and support vector machine (SVM) to utilize the full range of sequencing data, aiming at better aggregating genetic mutations and improving their power in predicting cancer types. Specifically, we introduced a classifier to distinguish cancer types using somatic mutations obtained from whole-exome sequencing data. Mutations were identified from multiple cancers and scored using SIFT, PP2, and CADD, and grouped at the individual gene level. The nsNMF was then applied to reduce dimensionality and to obtain coefficient and basis matrices. A feature matrix was derived from the obtained matrices to train a classifier for cancer type classification with the SVM model. We have demonstrated that the classifier was able to distinguish the cancer types with reasonable accuracy. In five-fold cross-validations using mutation counts as features, the average prediction accuracy was 77.1% (SEM=0.1%), significantly outperforming baselines and outperforming models using mutation scores as features. Using the factor matrices derived from the nsNMF, we identified multiple genes and pathways that are significantly associated with each cancer type. This study presents a generic and complete pipeline to study the associations between somatic mutations and cancers. The discovered genes and pathways associated with each cancer type can lead to biological insights. The proposed method can be adapted to other studies for disease classification and pathway discovery.

Read more
Genomics

Cell Identity Codes: Understanding Cell Identity from Gene Expression Profiles using Deep Neural Networks

Understanding cell identity is an important task in many biomedical areas. Expression patterns of specific marker genes have been used to characterize some limited cell types, but exclusive markers are not available for many cell types. A second approach is to use machine learning to discriminate cell types based on the whole gene expression profiles (GEPs). The accuracies of simple classification algorithms such as linear discriminators or support vector machines are limited due to the complexity of biological systems. We used deep neural networks to analyze 1040 GEPs from 16 different human tissues and cell types. After comparing different architectures, we identified a specific structure of deep autoencoders that can encode a GEP into a vector of 30 numeric values, which we call the cell identity code (CIC). The original GEP can be reproduced from the CIC with an accuracy comparable to technical replicates of the same experiment. Although we use an unsupervised approach to train the autoencoder, we show different values of the CIC are connected to different biological aspects of the cell, such as different pathways or biological processes. This network can use CIC to reproduce the GEP of the cell types it has never seen during the training. It also can resist some noise in the measurement of the GEP. Furthermore, we introduce classifier autoencoder, an architecture that can accurately identify cell type based on the GEP or the CIC.

Read more
Genomics

Cell Type Identification from Single-Cell Transcriptomic Data via Semi-supervised Learning

Cell type identification from single-cell transcriptomic data is a common goal of single-cell RNA sequencing (scRNAseq) data analysis. Neural networks have been employed to identify cell types from scRNAseq data with high performance. However, it requires a large mount of individual cells with accurate and unbiased annotated types to build the identification models. Unfortunately, labeling the scRNAseq data is cumbersome and time-consuming as it involves manual inspection of marker genes. To overcome this challenge, we propose a semi-supervised learning model to use unlabeled scRNAseq cells and limited amount of labeled scRNAseq cells to implement cell identification. Firstly, we transform the scRNAseq cells to "gene sentences", which is inspired by similarities between natural language system and gene system. Then genes in these sentences are represented as gene embeddings to reduce data sparsity. With these embeddings, we implement a semi-supervised learning model based on recurrent convolutional neural networks (RCNN), which includes a shared network, a supervised network and an unsupervised network. The proposed model is evaluated on macosko2015, a large scale single-cell transcriptomic dataset with ground truth of individual cell types. It is observed that the proposed model is able to achieve encouraging performance by learning on very limited amount of labeled scRNAseq cells together with a large number of unlabeled scRNAseq cells.

Read more
Genomics

Cell lineage tracing using nuclease barcoding

Lineage tracing, the determination and mapping of progeny arising from single cells, is an important approach enabling the elucidation of mechanisms underlying diverse biological processes ranging from development to disease. We developed a dynamic sequence-based barcode for lineage tracing and have demonstrated its performance in C. elegans, a model organism whose lineage tree is well established. The strategy we use creates lineage trees based upon the introduction of specific mutations into cells and the propagation of these mutations to daughter cells at each cell division. We present an experimental proof of concept along with a corresponding simulation and analytical model for deeper understanding of the coding capacity of the system. By introducing mutations in a predictable manner using CRISPR/Cas9, our technology will enable more complete investigations of cellular processes.

Read more
Genomics

Cell-to-cell variability and robustness in S-phase duration from genome replication kinetics

Genome replication, a key process for a cell, relies on stochastic initiation by replication origins, causing a variability of replication timing from cell to cell. While stochastic models of eukaryotic replication are widely available, the link between the key parameters and overall replication timing has not been addressed systematically.We use a combined analytical and computational approach to calculate how positions and strength of many origins lead to a given cell-to-cell variability of total duration of the replication of a large region, a chromosome or the entire genome.Specifically, the total replication timing can be framed as an extreme-value problem, since it is due to the last region that replicates in each cell. Our calculations identify two regimes based on the spread between characteristic completion times of all inter-origin regions of a genome. For widely different completion times, timing is set by the single specific region that is typically the last to replicate in all cells. Conversely, when the completion time of all regions are comparable,an extreme-value estimate shows that the cell-to-cell variability of genome replication timing has universal properties. Comparison with available data shows that the replication program of three yeast species falls in this extreme-value regime.

Read more

Ready to get started?

Join us today