Featured Researches

Genomics

*K-means and Cluster Models for Cancer Signatures

We present *K-means clustering algorithm and source code by expanding statistical clustering methods applied in this https URL to quantitative finance. *K-means is statistically deterministic without specifying initial centers, etc. We apply *K-means to extracting cancer signatures from genome data without using nonnegative matrix factorization (NMF). *K-means' computational cost is a fraction of NMF's. Using 1,389 published samples for 14 cancer types, we find that 3 cancers (liver cancer, lung cancer and renal cell carcinoma) stand out and do not have cluster-like structures. Two clusters have especially high within-cluster correlations with 11 other cancers indicating common underlying structures. Our approach opens a novel avenue for studying such structures. *K-means is universal and can be applied in other fields. We discuss some potential applications in quantitative finance.

Read more
Genomics

A $4,000 Workstation for Mammalian Genome Assembly with Long Reads

Long-read sequencing has enabled the de novo assembly of several mammalian genomes, but with high cost in computing. Here, we demonstrated de novo assembly of mammalian genome using long reads in an efficient and inexpensive workstation.

Read more
Genomics

A Common Gene Expression Signature Analysis Method for Multiple Types of Cancer

Mining gene expression profiles has proven valuable for identifying signatures serving as surrogates of cancer phenotypes. However, the similarities of such signatures across different cancer types have not been strong enough to conclude that they represent a universal biological mechanism shared among multiple cancer types. Here we describe a network-based approach that explores gene-to-gene connections in multiple cancer datasets while maximizing the overall association of the subnetwork with clinical outcomes. With the dataset of The Cancer Genome Atlas (TCGA), we studied the characteristics of common gene expression of three types of cancers: Rectum adenocarcinoma (READ), Breast invasive carcinoma (BRCA) and Colon adenocarcinoma (COAD). By analyzing several pairs of highly correlated genes after filtering and clustering work, we found that the co-expressed genes across multiple types of cancers point to particular biological mechanisms related to cancer cell progression , suggesting that they represent important attributes of cancer in need of being elucidated for potential applications in diagnostic, prognostic and therapeutic products applicable to multiple cancer types.

Read more
Genomics

A Comparison of Microbial Genome Web Portals

Microbial genome web portals have a broad range of capabilities that address a number of information-finding and analysis needs for scientists. This article compares the capabilities of the major microbial genome web portals to aid researchers in determining which portal(s) are best suited to solving their information-finding and analytical needs. We assessed both the bioinformatics tools and the data content of BioCyc, KEGG, Ensembl Bacteria, KBase, IMG, and PATRIC. For each portal, our assessment compared and tallied the available capabilities. The strengths of BioCyc include its genomic and metabolic tools, multi-search capabilities, table-based analysis tools, regulatory network tools and data, omics data analysis tools, breadth of data content, and large amount of curated data. The strengths of KEGG include its genomic and metabolic tools. The strengths of Ensembl Bacteria include its genomic tools and large number of genomes. The strengths of KBase include its genomic tools and metabolic models. The strengths of IMG include its genomic tools, multi-search capabilities, large number of genomes, table-based analysis tools, and breadth of data content. The strengths of PATRIC include its large number of genomes, table-based analysis tools, metabolic models, and breadth of data content.

Read more
Genomics

A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction

Reconstructing components of a genomic mixture from data obtained by means of DNA sequencing is a challenging problem encountered in a variety of applications including single individual haplotyping and studies of viral communities. High-throughput DNA sequencing platforms oversample mixture components to provide massive amounts of reads whose relative positions can be determined by mapping the reads to a known reference genome; assembly of the components, however, requires discovery of the reads' origin -- an NP-hard problem that the existing methods struggle to solve with the required level of accuracy. In this paper, we present a learning framework based on a graph auto-encoder designed to exploit structural properties of sequencing data. The algorithm is a neural network which essentially trains to ignore sequencing errors and infers the posteriori probabilities of the origin of sequencing reads. Mixture components are then reconstructed by finding consensus of the reads determined to originate from the same genomic component. Results on realistic synthetic as well as experimental data demonstrate that the proposed framework reliably assembles haplotypes and reconstructs viral communities, often significantly outperforming state-of-the-art techniques.

Read more
Genomics

A Hybrid HMM Approach for the Dynamics of DNA Methylation

The understanding of mechanisms that control epigenetic changes is an important research area in modern functional biology. Epigenetic modifications such as DNA methylation are in general very stable over many cell divisions. DNA methylation can however be subject to specific and fast changes over a short time scale even in non-dividing (i.e. not-replicating) cells. Such dynamic DNA methylation changes are caused by a combination of active demethylation and de novo methylation processes which have not been investigated in integrated models. Here we present a hybrid (hidden) Markov model to describe the cycle of methylation and demethylation over (short) time scales. Our hybrid model decribes several molecular events either happening at deterministic points (i.e. describing mechanisms that occur only during cell division) and other events occurring at random time points. We test our model on mouse embryonic stem cells using time-resolved data. We predict methylation changes and estimate the efficiencies of the different modification steps related to DNA methylation and demethylation.

Read more
Genomics

A Markovian genomic concatenation model guided by persymmetric matrices

The aim of this work is to provide a rigorous mathematical analysis of a stochastic concatenation model presented by Sobottka and Hart (2011) which allows approximation of the first-order stochastic structure in bacterial DNA by means of a stationary Markov chain. Two probabilistic constructions that rigorously formalize the model are presented. Necessary and sufficient conditions for a Markov chain to be generated by the model are given, as well as the theoretical background needed for designing new algorithms for statistical analyses of real bacterial genomes. It is shown that the model encompasses the Markov chains satisfying intra-strand parity, a property observed in most DNA sequences.

Read more
Genomics

A Model for Competition for Ribosomes in the Cell

Large-scale simultaneous mRNA translation and the resulting competition for the available ribosomes has important implications to the cell's functioning and evolution. Developing a better understanding of the intricate correlations between these simultaneous processes, rather than focusing on the translation of a single isolated transcript, should help in gaining a better understanding of mRNA translation regulation and the way elongation rates affect organismal fitness. A model of simultaneous translation is specifically important when dealing with highly expressed genes, as these consume more resources. In addition, such a model can lead to more accurate predictions that are needed in the interconnection of translational modules in synthetic biology. We develop and analyze a general model for large-scale simultaneous mRNA translation and competition for ribosomes. This is based on combining several ribosome flow models (RFMs) interconnected via a pool of free ribosomes. We prove that the compound system always converges to a steady-state and that it always entrains or phase locks to periodically time-varying transition rates in any of the mRNA molecules. We use this model to explore the interactions between the various mRNA molecules and ribosomes at steady-state. We show that increasing the length of an mRNA molecule decreases the production rate of all the mRNAs. Increasing any of the codon translation rates in a specific mRNA molecule yields a local effect: an increase in the translation rate of this mRNA, and also a global effect: the translation rates in the other mRNA molecules all increase or all decrease. These results suggest that the effect of codon decoding rates of endogenous and heterologous mRNAs on protein production is more complicated than previously thought.

Read more
Genomics

A Multi-Trait Approach Identified Genetic Variants Including a Rare Mutation in RGS3 with Impact on Abnormalities of Cardiac Structure/Function

Heart failure is a major cause for premature death. Given heterogeneity of the heart failure syndrome, identifying genetic determinants of cardiac function and structure may provide greater insights into heart failure. Despite progress in understanding the genetic basis of heart failure through genome wide association studies, heritability of heart failure is not well understood. Gaining further insights into mechanisms that contribute to heart failure requires systematic approaches that go beyond single trait analysis. We integrated Bayesian multi-trait approach and Bayesian networks for the analysis of 10 correlated traits of cardiac structure and function measured for 3387 individuals with whole exome sequence data. While using single-trait based approaches did not find any significant genetic variant, applying the integrative Bayesian multi-trait approach, we identified 3 novel variants located in genes, RGS3, CHD3, and MRPL38 with significant impact on the cardiac traits such as left ventricular volume index, parasternal long axis interventricular septum thickness, and mean left ventricular wall thickness. Among these, the rare variant NC_000009.11:g.116346115C>A (rs144636307) in RGS3 showed pleiotropic effect on left ventricular mass index, left ventricular volume index and Maximum left atrial anterior-posterior diameter while RGS3 can inhibit TGF-beta signaling associated with left ventricle dilation and systolic dysfunction.

Read more
Genomics

A Pipeline for Insertion Sequence Detection and Study for Bacterial Genome

Insertion Sequences (ISs) are small DNA segments that have the ability of moving themselves into genomes. These types of mobile genetic elements (MGEs) seem to play an essential role in genomes rearrangements and evolution of prokaryotic genomes, but the tools that deal with discovering ISs in an efficient and accurate way are still too few and not totally precise. Two main factors have big effects on IS discovery, namely: genes annotation and functionality prediction. Indeed, some specific genes called "transposases" are enzymes that are responsible of the production and catalysis for such transposition, but there is currently no fully accurate method that could decide whether a given predicted gene is either a real transposase or not. This is why authors of this article aim at designing a novel pipeline for ISs detection and classification, which embeds the most recently available tools developed in this field of research, namely OASIS (Optimized Annotation System for Insertion Sequence) and ISFinder database (an up-to-date and accurate repository of known insertion sequences). As this latter depend on predicted coding sequences, the proposed pipeline will encompass too various kinds of bacterial genes annotation tools (that is, Prokka, BASys, and Prodigal). A complete IS detection and classification pipeline is then proposed and tested on a set of 23 complete genomes of Pseudomonas aeruginosa. This pipeline can also be used as an investigator of annotation tools performance, which has led us to conclude that Prodigal is the best software for IS prediction. A deepen study regarding IS elements in P.aeruginosa has then been conducted, leading to the conclusion that close genomes inside this species have also a close numbers of IS families and groups.

Read more

Ready to get started?

Join us today