Featured Researches

Genomics

A Pipeline for Integrated Theory and Data-Driven Modeling of Genomic and Clinical Data

High throughput genome sequencing technologies such as RNA-Seq and Microarray have the potential to transform clinical decision making and biomedical research by enabling high-throughput measurements of the genome at a granular level. However, to truly understand causes of disease and the effects of medical interventions, this data must be integrated with phenotypic, environmental, and behavioral data from individuals. Further, effective knowledge discovery methods that can infer relationships between these data types are required. In this work, we propose a pipeline for knowledge discovery from integrated genomic and clinical data. The pipeline begins with a novel variable selection method, and uses a probabilistic graphical model to understand the relationships between features in the data. We demonstrate how this pipeline can improve breast cancer outcome prediction models, and can provide a biologically interpretable view of sequencing data.

Read more
Genomics

A Public Website for the Automated Assessment and Validation of SARS-CoV-2 Diagnostic PCR Assays

Summary: Polymerase chain reaction-based assays are the current gold standard for detecting and diagnosing SARS-CoV-2. However, as SARS-CoV-2 mutates, we need to constantly assess whether existing PCR-based assays will continue to detect all known viral strains. To enable the continuous monitoring of SARS-CoV-2 assays, we have developed a web-based assay validation algorithm that checks existing PCR-based assays against the ever-expanding genome databases for SARS-CoV-2 using both thermodynamic and edit-distance metrics. The assay screening results are displayed as a heatmap, showing the number of mismatches between each detection and each SARS-CoV-2 genome sequence. Using a mismatch threshold to define detection failure, assay performance is summarized with the true positive rate (recall) to simplify assay comparisons. Availability: this https URL. Contact: Jason Gans ([email protected]) and Patrick Chain ([email protected])

Read more
Genomics

A Quadratically Regularized Functional Canonical Correlation Analysis for Identifying the Global Structure of Pleiotropy with NGS Data

Investigating the pleiotropic effects of genetic variants can increase statistical power, provide important information to achieve deep understanding of the complex genetic structures of disease, and offer powerful tools for designing effective treatments with fewer side effects. However, the current multiple phenotype association analysis paradigm lacks breadth (number of phenotypes and genetic variants jointly analyzed at the same time) and depth (hierarchical structure of phenotype and genotypes). A key issue for high dimensional pleiotropic analysis is to effectively extract informative internal representation and features from high dimensional genotype and phenotype data. To explore multiple levels of representations of genetic variants, learn their internal patterns involved in the disease development, and overcome critical barriers in advancing the development of novel statistical methods and computational algorithms for genetic pleiotropic analysis, we proposed a new framework referred to as a quadratically regularized functional CCA (QRFCCA) for association analysis which combines three approaches: (1) quadratically regularized matrix factorization, (2) functional data analysis and (3) canonical correlation analysis (CCA). Large-scale simulations show that the QRFCCA has a much higher power than that of the nine competing statistics while retaining the appropriate type 1 errors. To further evaluate performance, the QRFCCA and nine other statistics are applied to the whole genome sequencing dataset from the TwinsUK study. We identify a total of 79 genes with rare variants and 67 genes with common variants significantly associated with the 46 traits using QRFCCA. The results show that the QRFCCA substantially outperforms the nine other statistics.

Read more
Genomics

A Robust and Precise ConvNet for small non-coding RNA classification (RPC-snRC)

Functional or non-coding RNAs are attracting more attention as they are now potentially considered valuable resources in the development of new drugs intended to cure several human diseases. The identification of drugs targeting the regulatory circuits of functional RNAs depends on knowing its family, a task which is known as RNA sequence classification. State-of-the-art small noncoding RNA classification methodologies take secondary structural features as input. However, in such classification, feature extraction approaches only take global characteristics into account and completely oversight co-relative effect of local structures. Furthermore secondary structure based approaches incorporate high dimensional feature space which proves computationally expensive. This paper proposes a novel Robust and Precise ConvNet (RPC-snRC) methodology which classifies small non-coding RNAs sequences into their relevant families by utilizing the primary sequence of RNAs. RPC-snRC methodology learns hierarchical representation of features by utilizing positioning and occurrences information of nucleotides. To avoid exploding and vanishing gradient problems, we use an approach similar to DenseNet in which gradient can flow straight from subsequent layers to previous layers. In order to assess the effectiveness of deeper architectures for small non-coding RNA classification, we also adapted two ResNet architectures having different number of layers. Experimental results on a benchmark small non-coding RNA dataset show that our proposed methodology does not only outperform existing small non-coding RNA classification approaches with a significant performance margin of 10% but it also outshines adapted ResNet architectures.

Read more
Genomics

A Stochastic Automata Network Description for Spatial DNA-Methylation Models

DNA methylation is an important biological mechanism to regulate gene expression and control cell development. Mechanistic modeling has become a popular approach to enhance our understanding of the dynamics of methylation pattern formation in living cells. Recent findings suggest that the methylation state of a cytosine base can be influenced by its DNA neighborhood. Therefore, it is necessary to generalize existing mathematical models that consider only one cytosine and its partner on the opposite DNA-strand (CpG), in order to include such neighborhood dependencies. One approach is to describe the system as a stochastic automata network (SAN) with functional transitions. We show that single-CpG models can successfully be generalized to multiple CpGs using the SAN description and verify the results by comparing them to results from extensive Monte-Carlo simulations.

Read more
Genomics

A Stochastic Model for the Formation of Spatial Methylation Patterns

DNA methylation is an epigenetic mechanism whose important role in development has been widely recognized. This epigenetic modification results in heritable changes in gene expression not encoded by the DNA sequence. The underlying mechanisms controlling DNA methylation are only partly understood and recently different mechanistic models of enzyme activities responsible for DNA methylation have been proposed. Here we extend existing Hidden Markov Models (HMMs) for DNA methylation by describing the occurrence of spatial methylation patterns over time and propose several models with different neighborhood dependencies. We perform numerical analysis of the HMMs applied to bisulfite sequencing measurements and accurately predict wild-type data. In addition, we find evidence that the enzymes' activities depend on the left 5' neighborhood but not on the right 3' neighborhood.

Read more
Genomics

A Systematic Review of Mutations Associated with Isoniazid Resistance Points to Lower Diagnostic Sensitivity for Common Mutations and Increased Incidence of Uncommon Mutations in Clinical Strains of Mycobacterium tuberculosis

Molecular testing is rapidly becoming integral to the global tuberculosis (TB) control effort. Uncommon mechanisms of resistance can escape detection by these platforms and lead to the development of Multi-Drug Resistant (MDR) strains. This article is a systematic review of published articles that reported isoniazid (INH) resistance-conferring mutations between September-2013 and December-2019. The aims were to catalogue mutations associated with INH resistance, estimate their global prevalence and co-occurrence, and their utility in molecular diagnostics. The genes commonly associated with INH resistance, katG, inhA, fabG1, and the intergenic region oxyR-ahpC were considered in this review. In total, 52 articles were included describing 5,632 INHR clinical isolates from 31 countries. The three most frequently mutated loci continue to be katG315 (4,100), inhA-15 (786), and inhA-8 (105). However, the diagnostic value of inhA-8 is far lower than previously thought, only appearing in 25 (0.4%) INHR isolates that lacked a mutation at the first two loci. Importantly, of the four katG loci recommended by the previous systematic review for diagnostics, only katG315 was observed in our INHR isolates. This indicates continued evolution and regional differences in INH resistance. We have identified 58 loci (common to both systematic reviews) in three genomic regions as a reliable basis for molecular diagnostics. We also report 49 new loci associated with INH resistance. Including all observed mutations provides a cumulative sensitivity of 85.1%. The most disconcerting is the remaining 14.9% of isolates that harbor an unknown mechanism of resistance, will escape molecular detection, and likely convert to MDR-TB, further complicating treatment. Integrating the information cataloged in this and other similar studies into current diagnostic tools is essential for combating the emergence of MDR-TB.

Read more
Genomics

A bioinformatics pipeline for the identification of CHO cell differential gene expression from RNA-Seq data

In recent years the publication of genome sequences for the Chinese hamster and Chinese hamster ovary (CHO) cell lines have facilitated study of these biopharmaceutical cell factories with unprecedented resolution. Our understanding of the CHO cell transcriptome, in particular, has rapidly advanced through the application of next-generation sequencing (NGS) technology to characterise RNA expression (RNA-Seq). In this chapter we present a computational pipeline for the analysis of CHO cell RNA-Seq data from the Illumina platform to identify differentially expressed genes. The example data and bioinformatics workflow required to run this analysis are freely available at this http URL.

Read more
Genomics

A community-based transcriptomics classification and nomenclature of neocortical cell types

To understand the function of cortical circuits it is necessary to classify their underlying cellular diversity. Traditional attempts based on comparing anatomical or physiological features of neurons and glia, while productive, have not resulted in a unified taxonomy of neural cell types. The recent development of single-cell transcriptomics has enabled, for the first time, systematic high-throughput profiling of large numbers of cortical cells and the generation of datasets that hold the promise of being complete, accurate and permanent. Statistical analyses of these data have revealed the existence of clear clusters, many of which correspond to cell types defined by traditional criteria, and which are conserved across cortical areas and species. To capitalize on these innovations and advance the field, we, the Copenhagen Convention Group, propose the community adopts a transcriptome-based taxonomy of the cell types in the adult mammalian neocortex. This core classification should be ontological, hierarchical and use a standardized nomenclature. It should be configured to flexibly incorporate new data from multiple approaches, developmental stages and a growing number of species, enabling improvement and revision of the classification. This community-based strategy could serve as a common foundation for future detailed analysis and reverse engineering of cortical circuits and serve as an example for cell type classification in other parts of the nervous system and other organs.

Read more
Genomics

A deep learning classifier for local ancestry inference

Local ancestry inference (LAI) identifies the ancestry of each segment of an individual's genome and is an important step in medical and population genetic studies of diverse cohorts. Several techniques have been used for LAI, including Hidden Markov Models and Random Forests. Here, we formulate the LAI task as an image segmentation problem and develop a new LAI tool using a deep convolutional neural network with an encoder-decoder architecture. We train our model using complete genome sequences from 982 unadmixed individuals from each of five continental ancestry groups, and we evaluate it using simulated admixed data derived from an additional 279 individuals selected from the same populations. We show that our model is able to learn admixture as a zero-shot task, yielding ancestry assignments that are nearly as accurate as those from the existing gold standard tool, RFMix.

Read more

Ready to get started?

Join us today