Quantitative Biology Genomics - Researchain

Featured Researches

Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm

Long reads produced by third-generation sequencing technologies are used to construct an assembly (i.e., the subject's genome), which is further used in downstream genome analysis. Unfortunately, long reads have high sequencing error rates and a large proportion of bps in these long reads are incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e., read-to-assembly alignment information). However, assembly polishing algorithms can only polish an assembly using reads either from a certain sequencing technology or from a small assembly. Such technology-dependency and assembly-size dependency require researchers to 1) run multiple polishing algorithms and 2) use small chunks of a large genome to use all available read sets and polish large genomes. We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e., both large and small genomes) using reads from all sequencing technologies (i.e., second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo 1) models an assembly as a profile hidden Markov model (pHMM), 2) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm, and 3) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real read sets demonstrate that Apollo is the only algorithm that 1) uses reads from any sequencing technology within a single run and 2) scales well to polish large assemblies without splitting the assembly into multiple parts.

Genomics

Applied Category Theory for Genomics -- An Initiative

The ultimate secret of all lives on earth is hidden in their genomes -- a totality of DNA sequences. We currently know the whole genome sequence of many organisms, while our understanding of the genome architecture on a systematic level remains rudimentary. Applied category theory opens a promising way to integrate the humongous amount of heterogeneous informations in genomics, to advance our knowledge regarding genome organization, and to provide us with a deep and holistic view of our own genomes. In this work we explain why applied category theory carries such a hope, and we move on to show how it could actually do so, albeit in baby steps. The manuscript intends to be readable to both mathematicians and biologists, therefore no prior knowledge is required from either side.

Genomics

Approximate Search for Known Gene Clusters in New Genomes Using PQ-Trees

We define a new problem in comparative genomics, denoted PQ-Tree Search, that takes as input a PQ-tree T representing the known gene orders of a gene cluster of interest, a gene-to-gene substitution scoring function h , integer parameters d T and d S , and a new genome S . The objective is to identify in S approximate new instances of the gene cluster that could vary from the known gene orders by genome rearrangements that are constrained by T , by gene substitutions that are governed by h , and by gene deletions and insertions that are bounded from above by d T and d S , respectively. We prove that the PQ-Tree Search problem is NP-hard and propose a parameterized algorithm that solves the optimization variant of PQ-Tree Search in O ∗ ( 2 γ ) time, where γ is the maximum degree of a node in T and O ∗ is used to hide factors polynomial in the input size. The algorithm is implemented as a search tool, denoted PQFinder, and applied to search for instances of chromosomal gene clusters in plasmids, within a dataset of 1,487 prokaryotic genomes. We report on 29 chromosomal gene clusters that are rearranged in plasmids, where the rearrangements are guided by the corresponding PQ-tree. One of these results, coding for a heavy metal efflux pump, is further analysed to exemplify how PQFinder can be harnessed to reveal interesting new structural variants of known gene clusters. The code for the tool as well as all the data needed to reconstruct the results are publicly available on GitHub (this http URL).

Genomics

Assembly ASM291031v2 (Genbank: GCA_002910315.2) identified as assembly of the Northern Dolly Varden (Salvelinus malma malma) genome, and not the Arctic char (S. alpinus) genome

To date, twelve complete genomes representing eleven species belonging to six genera have been sequenced in salmonids. For the genus Salvelinus, it was supposed to sequence the genome of Arctic char, one of the most variable species of vertebrate animals. Sequencing was carried out (Christensen et al., 2018) using the tissues of the female IW2-2015 obtained from the company engaged in industrial aquaculture of chars - Icy Waters Ltd. The company exploits two of its own broodstocks - NL and TR, originating from the chars from the Nauyuk Lake and the Tree River (Nunavut, Canada). Since the complete mitochondrial genome of the female IW2-2015 was absent in the published assembly ASM291031v2, we determined its type and complete sequence from the sequence read archives taken from Genbank. It was found that the female's mitogenome belongs to the BERING haplogroup, which is characteristic of Northern Dolly Varden S. malma malma. Analysis of other unlinked diagnostic loci encoded by nuclear DNA (ITS1, RAG1, SFO-12, SFO-18, SMM-21) also revealed distinctive characters of Northern Dolly Varden in female IW2-2015. It was concluded that the genomic assembly ASM291031v2 was obtained not from an individual of Arctic char S. alpinus, but from an individual of a related species - Northern Dolly Varden S. malma malma. The identical to the IW2-2015 female characteristics of diagnostic loci were found in other individuals from the broodstock TR. Apparently, the broodstock TR is entirely a strain derived from Northern Dolly Varden. Since assembly ASM291031v2 was obtained from a specimen originated from the marginal population of Northern Dolly Varden (Tree R.) isolated from the main range of the species and with some traces of introgressive hybridization, this assembly can hardly be considered as a description of a typical genome of S. malma malma.

Genomics

Assessment of Multiple-Biomarker Classifiers: fundamental principles and a proposed strategy

The multiple-biomarker classifier problem and its assessment are reviewed against the background of some fundamental principles from the field of statistical pattern recognition, machine learning, or the recently so-called "data science". A narrow reading of that literature has led many authors to neglect the contribution to the total uncertainty of performance assessment from the finite training sample. Yet the latter is a fundamental indicator of the stability of a classifier; thus its neglect may be contributing to the problematic status of many studies. A three-level strategy is proposed for moving forward in this field. The lowest level is that of construction, where candidate features are selected and the choice of classifier architecture is made. At that point, the effective dimensionality of the classifier is estimated and used to size the next level of analysis, a pilot study on previously unseen cases. The total (training and testing) uncertainty resulting from the pilot study is, in turn, used to size the highest level of analysis, a pivotal study with a target level of uncertainty. Some resources available in the literature for implementing this approach are reviewed. Although the concepts explained in the present article may be fundamental and straightforward for many researchers in the machine learning community they are subtle for many practitioners, for whom we provided a general advice for the best practice in \cite{Shi2010MAQCII} and elaborate here in the present paper.

Genomics

Assessment of P-value variability in the current replicability crisis

Increased availability of data and accessibility of computational tools in recent years have created unprecedented opportunities for scientific research driven by statistical analysis. Inherent limitations of statistics impose constrains on reliability of conclusions drawn from data but misuse of statistical methods is a growing concern. Significance, hypothesis testing and the accompanying P-values are being scrutinized as representing most widely applied and abused practices. One line of critique is that P-values are inherently unfit to fulfill their ostensible role as measures of scientific hypothesis's credibility. It has also been suggested that while P-values may have their role as summary measures of effect, researchers underappreciate the degree of randomness in the P-value. High variability of P-values would suggest that having obtained a small P-value in one study, one is, nevertheless, likely to obtain a much larger P-value in a similarly powered replication study. Thus, "replicability of P-value" is itself questionable. To characterize P-value variability one can use prediction intervals whose endpoints reflect the likely spread of P-values that could have been obtained by a replication study. Unfortunately, the intervals currently in use, the P-intervals, are based on unrealistic implicit assumptions. Namely, P-intervals are constructed with the assumptions that imply substantial chances of encountering large values of effect size in an observational study, which leads to bias. As an alternative to P-intervals, we develop a method that gives researchers flexibility by providing them with the means to control these assumptions. Unlike endpoints of P-intervals, endpoints of our intervals are directly interpreted as probabilistic bounds for replication P-values and are resistant to selection bias contingent upon approximate prior knowledge of the effect size distribution.

Genomics

Attention based convolutional neural network for predicting RNA-protein binding sites

RNA-binding proteins (RBPs) play crucial roles in many biological processes, e.g. gene regulation. Computational identification of RBP binding sites on RNAs are urgently needed. In particular, RBPs bind to RNAs by recognizing sequence motifs. Thus, fast locating those motifs on RNA sequences is crucial and time-efficient for determining whether the RNAs interact with the RBPs or not. In this study, we present an attention based convolutional neural network, iDeepA, to predict RNA-protein binding sites from raw RNA sequences. We first encode RNA sequences into one-hot encoding. Next, we design a deep learning model with a convolutional neural network (CNN) and an attention mechanism, which automatically search for important positions, e.g. binding motifs, to learn discriminant high-level features for predicting RBP binding sites. We evaluate iDeepA on publicly gold-standard RBP binding sites derived from CLIP-seq data. The results demonstrate iDeepA achieves comparable performance with other state-of-the-art methods.

Genomics

Autism spectrum disorder: a neuro-immunometabolic hypothesis of the developmental origins

Fetal neuroinflammation and prenatal stress (PS) may contribute to lifelong neurological disabilities. Astrocytes and microglia play a pivotal role, but the mechanisms are poorly understood. Here, we test the hypothesis that via gene-environment interactions, fetal neuroinflammation and PS may reprogram glial immunometabolic phenotypes which impact neurodevelopment and neurobehavior. This glial-neuronal interplay increases the risk for clinical manifestation of autism spectrum disorder (ASD) in at-risk children. Drawing on genomic data from the recently published series of ovine and rodent glial transcriptome analyses with fetuses exposed to neuroinflammation or PS, we conducted a secondary analysis against the Simons Foundation Autism Research Initiative (SFARI) Gene database. We confirmed 21 gene hits. Using unsupervised statistical network analysis, we then identified six clusters of probable protein-protein interactions mapping onto the immunometabolic and stress response networks and epigenetic memory. These findings support our hypothesis. We discuss the implications for ASD etiology, early detection, and novel therapeutic approaches.

Genomics

Automated deconvolution of structured mixtures from bulk tumor genomic data

Motivation: As cancer researchers have come to appreciate the importance of intratumor heterogeneity, much attention has focused on the challenges of accurately profiling heterogeneity in individual patients. Experimental technologies for directly profiling genomes of single cells are rapidly improving, but they are still impractical for large-scale sampling. Bulk genomic assays remain the standard for population-scale studies, but conflate the influences of mixtures of genetically distinct tumor, stromal, and infiltrating immune cells. Many computational approaches have been developed to deconvolute these mixed samples and reconstruct the genomics of genetically homogeneous clonal subpopulations. All such methods, however, are limited to reconstructing only coarse approximations to a few major subpopulations. In prior work, we showed that one can improve deconvolution of genomic data by leveraging substructure in cellular mixtures through a strategy called simplicial complex inference. This strategy, however, is also limited by the difficulty of inferring mixture structure from sparse, noisy assays. Results: We improve on past work by introducing enhancements to automate learning of substructured genomic mixtures, with specific emphasis on genome-wide copy number variation (CNV) data. We introduce methods for dimensionality estimation to better decompose mixture model substructure; fuzzy clustering to better identify substructure in sparse, noisy data; and automated model inference methods for other key model parameters. We show that these improvements lead to more accurate inference of cell populations and mixture proportions in simulated scenarios. We further demonstrate their effectiveness in identifying mixture substructure in real tumor CNV data. Availability: Source code is available at this http URL

Genomics

Automatic learning of pre-miRNAs from different species

Discovery of microRNAs (miRNAs) relies on predictive models for characteristic features from miRNA precursors (pre-miRNAs). The short length of miRNA genes and the lack of pronounced sequence features complicate this task. To accommodate the peculiarities of plant and animal miRNAs systems, tools for both systems have evolved differently. However, these tools are biased towards the species for which they were primarily developed and, consequently, their predictive performance on data sets from other species of the same kingdom might be lower. While these biases are intrinsic to the species, the characterization of their occurrence can lead to computational approaches able to diminish their negative effect on the accuracy of pre-miRNAs predictive models. Here, we investigate in this study how 45 predictive models induced for data sets from 45 species, distributed in eight subphyla, perform when applied to a species different from the species used in its induction. Our computational experiments show that the separability of pre-miRNAs and pseudo pre-miRNAs instances is species-dependent and no feature set performs well for all species, even within the same subphylum. Mitigating this species dependency, we show that an ensemble of classifiers reduced the classification errors for all 45 species. As the ensemble members were obtained using meaningful, and yet computationally viable feature sets, the ensembles also have a lower computational cost than individual classifiers that rely on energy stability parameters, which are of prohibitive computational cost in large scale applications. In this study, the combination of multiple pre-miRNAs feature sets and multiple learning biases enhanced the predictive accuracy of pre-miRNAs classifiers of 45 species. This is certainly a promising approach to be incorporated in miRNA discovery tools towards more accurate and less species-dependent tools.

Ready to get started?

Join us today

Archive Your Research