Featured Researches

Genomics

BAUM: A DNA Assembler by Adaptive Unique Mapping and Local Overlap-Layout-Consensus

Genome assembly from the high-throughput sequencing (HTS) reads is a fundamental yet challenging computational problem. An intrinsic challenge is the uncertainty caused by the widespread repetitive elements. Here we get around the uncertainty using the notion of uniquely mapped (UM) reads, which motivated the design of a new assembler BAUM. It mainly consists of two types of iterations. The first type of iterations constructs initial contigs from a reference, say a genome of a species that could be quite distant, by adaptive read mapping, filtration by the reference's unique regions, and reference updating. A statistical test is proposed to split the layouts at possible structural variation sites. The second type of iterations includes mapping, scaffolding/contig-extension, and contig merging. We extend each contig by locally assembling the reads whose mates are uniquely mapped to an end of the contig. Instead of the de Bruijn graph method, we take the overlap-layout-consensus (OLC) paradigm. The OLC is implemented by parallel computation, and has linear complexity with respect to the number of contigs. The adjacent extended contigs are merged if their alignment is confirmed by the adjusted gap distance. Throughout the assembling, the mapping criterion is selected by probabilistic calculations. These innovations can be used complementary to the existing de novo assemblers. Applying this novel method to the assembly of wild rice Oryza longistaminata genome, we achieved much improved contig N50, 18.8k, compared with other assemblers. The assembly was further validated by contigs constructed from an independent library of long 454 reads.

Read more
Genomics

BOAssembler: a Bayesian Optimization Framework to Improve RNA-Seq Assembly Performance

High throughput sequencing of RNA (RNA-Seq) can provide us with millions of short fragments of RNA transcripts from a sample. How to better recover the original RNA transcripts from those fragments (RNA-Seq assembly) is still a difficult task. For example, RNA-Seq assembly tools typically require hyper-parameter tuning to achieve good performance for particular datasets. This kind of tuning is usually unintuitive and time-consuming. Consequently, users often resort to default parameters, which do not guarantee consistent good performance for various datasets. Here we propose BOAssembler (this https URL), a framework that enables end-to-end automatic tuning of RNA-Seq assemblers, based on Bayesian Optimization principles. Experiments show this data-driven approach is effective to improve the overall assembly performance. The approach would be helpful for downstream (e.g. gene, protein, cell) analysis, and more broadly, for future bioinformatics benchmark studies.

Read more
Genomics

Bacterial protein interaction networks: connectivity is ruled by gene conservation, essentiality and function

Protein-protein interaction (PPI) networks are the backbone of all processes in living cells. In this work we relate conservation, essentiality and functional repertoire of a gene to the connectivity k (i.e., the number of interaction links) of the corresponding protein in the PPI network. On a set of 42 bacterial genomes of different sizes, and with reasonably separated evolutionary trajectories, we investigate three issues: i) whether the distribution of connectivities changes between PPI subnetworks of essential and nonessential genes; ii) how gene conservation, measured both by the evolutionary retention index (ERI) and by evolutionary pressures, is related to the the connectivity of the corresponding protein; iii) how PPI connectivities are modulated by evolutionary and functional relationships, as represented by the Clusters of Orthologous Genes (COGs). We show that conservation, essentiality and functional specialisation of genes constrain the connectivity of the corresponding proteins in bacterial PPI networks. In particular, we isolate a core of highly connected proteins (with connectivities k≥40 ), which is ubiquitous among the species considered here -- though mostly visible in the degree distributions of bacteria with small genomes (less than 1000 genes). The genes that belong to this highly connected core are conserved, essential and, in most cases, belong to the COG cluster J, related to ribosomal functions and to the processing of genetic information.

Read more
Genomics

Bacteriophage classification for assembled contigs using Graph Convolutional Network

Motivation: Bacteriophages (aka phages), which mainly infect bacteria, play key roles in the biology of microbes. As the most abundant biological entities on the planet, the number of discovered phages is only the tip of the iceberg. Recently, many new phages have been revealed using high throughput sequencing, particularly metagenomic sequencing. Compared to the fast accumulation of phage-like sequences, there is a serious lag in taxonomic classification of phages. High diversity, abundance, and limited known phages pose great challenges for taxonomic analysis. In particular, alignment-based tools have difficulty in classifying fast accumulating contigs assembled from metagenomic data. Results: In this work, we present a novel semi-supervised learning model, named PhaGCN, to conduct taxonomic classification for phage contigs. In this learning model, we construct a knowledge graph by combining the DNA sequence features learned by convolutional neural network (CNN) and protein sequence similarity gained from gene-sharing network. Then we apply graph convolutional network (GCN) to utilize both the labeled and unlabeled samples in training to enhance the learning ability. We tested PhaGCN on both simulated and real sequencing data. The results clearly show that our method competes favorably against available phage classification tools.

Read more
Genomics

Bayesian estimation of Differential Transcript Usage from RNA-seq data

Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. cjBitSeq is a read based model and performs fully Bayesian inference by MCMC sampling on the space of latent state of each transcript per gene. BayesDRIMSeq is a count based model and estimates the Bayes Factor of a DTU model against a null model using Laplace's approximation. The proposed models are benchmarked against the existing ones using a recent independent simulation study as well as a real RNA-seq dataset. Our results suggest that the Bayesian methods exhibit similar performance with DRIMSeq in terms of precision/recall but offer better calibration of False Discovery Rate.

Read more
Genomics

Biases in differential expression analysis of RNA-seq data: A matter of replicate type

In differential expression (DE) analysis of RNA-seq count data, it is known that genes with a larger read number are more likely to be differentially expressed. This bias has a profound effect on the subsequent Gene Ontology (GO) analysis by perturbing the ranks of gene-sets. Another known bias is that the commonly used parametric DE analysis methods (e.g., edgeR, DESeq and baySeq) tend to yield more DE genes as the sequencing depth is increased. We nevertheless show that these biases are in fact confined to data of the technical replicate type. We also show the GO or gene-set enrichment analysis methods applied to technical replicate data result in considerable number of false positives. In conclusion, the current DE and enrichment analysis methods can be confidently used for biological replicate count data, while caution should be exercised when analysing technical replicate data.

Read more
Genomics

Bipartite Community Structure of eQTLs

Genome Wide Association Studies (GWAS) and eQTL analyses have produced a large and growing number of genetic associations linked to a wide range of human phenotypes. As of 2013, there were more than 11,000 SNPs associated with a trait as reported in the NHGRI GWAS Catalog. However, interpreting the functional roles played by these SNPs remains a challenge. Here we describe an approach that uses the inherent bipartite structure of eQTL networks to place SNPs into a functional context. Using genotyping and gene expression data from 163 lung tissue samples in a study of Chronic Obstructive Pulmonary Disease (COPD) we calculated eQTL associations between SNPs and genes and cast significant associations (FDR <0.1 ) as links in a bipartite network. To our surprise, we discovered that the highly-connected "hub" SNPs within the network were devoid of disease-associations. However, within the network we identified 35 highly modular communities, which comprise groups of SNPs associated with groups of genes; 13 of these communities were significantly enriched for distinct biological functions (P <5× 10 −4 ) including COPD-related functions. Further, we found that GWAS-significant SNPs were enriched at the cores of these communities, including previously identified GWAS associations for COPD, asthma, and pulmonary function, among others. These results speak to our intuition: rather than single SNPs influencing single genes, we see groups of SNPs associated with the expression of families of functionally related genes and that disease SNPs are associated with the perturbation of those functions. These methods are not limited in their application to COPD and can be used in the analysis of a wide variety of disease processes and other phenotypic traits.

Read more
Genomics

Bivariate Causal Discovery and its Applications to Gene Expression and Imaging Data Analysis

The mainstream of research in genetics, epigenetics and imaging data analysis focuses on statistical association or exploring statistical dependence between variables. Despite their significant progresses in genetic research, understanding the etiology and mechanism of complex phenotypes remains elusive. Using association analysis as a major analytical platform for the complex data analysis is a key issue that hampers the theoretic development of genomic science and its application in practice. Causal inference is an essential component for the discovery of mechanical relationships among complex phenotypes. Many researchers suggest making the transition from association to causation. Despite its fundamental role in science, engineering and biomedicine, the traditional methods for causal inference require at least three variables. However, quantitative genetic analysis such as QTL, eQTL, mQTL, and genomic-imaging data analysis requires exploring the causal relationships between two variables. This paper will focus on bivariate causal discovery. We will introduce independence of cause and mechanism (ICM) as a basic principle for causal inference, algorithmic information theory and additive noise model (ANM) as major tools for bivariate causal discovery. Large-scale simulations will be performed to evaluate the feasibility of the ANM for bivariate causal discovery. To further evaluate their performance for causal inference, the ANM will be applied to the construction of gene regulatory networks. Also, the ANM will be applied to trait-imaging data analysis to illustrate three scenarios: presence of both causation and association, presence of association while absence of causation, and presence of causation, while lack of association between two variables.

Read more
Genomics

CAMIRADA: Cancer microRNA association discovery algorithm, a case study on breast cancer

In recent studies, non-coding protein RNAs have been identified as microRNA that can be used as biomarkers for early diagnosis and treatment of cancer, that decrease mortality in cancer. A microRNA may target hundreds or thousands of genes and a gene may regulate several microRNAs, so determining which microRNA is associated with which cancer is a big challenge. Many computational methods have been performed to detect micoRNAs association with cancer, but more effort is needed with higher accuracy. Increasing research has shown that relationship between microRNAs and TFs play a significant role in the diagnosis of cancer. Therefore, we developed a new computational framework (CAMIRADA) to identify cancer-related microRNAs based on the relationship between microRNAs and disease genes (DG) in the protein network, the functional relationships between microRNAs and Transcription Factors (TF) on the co-expression network, and the relationship between microRNAs and the Differential Expression Gene (DEG) on co-expression network. The CAMIRADA was applied to assess breast cancer data from two HMDD and miR2Disease databases. In this study, the AUC for the 65 microRNAs of the top of the list was 0.95, which was more accurate than the similar methods used to detect microRNAs associated with the cancer artery.

Read more
Genomics

CD44 alternative splicing is a sensor of intragenic DNA methylation in tumors

DNA methylation (meDNA) is a suspected modulator of alternative splicing, while splicing in turn is involved in tumour formations nearly as frequently as DNA mutations. Yet, the impact of meDNA on tumorigenesis via its effect on splicing has not been thoroughly explored. Here, we find that HCT116 colon carcinoma cells inactivated for the DNA methylases DNMT1 and DNMT3b undergo a partial epithelial to mesenchymal transition (EMT) associated with alternative splicing of the CD44 transmembrane receptor. The skipping of CD44 variant exons is in part explained by altered expression or splicing of splicing and chromatin factors. A direct effect of meDNA on alternative splicing was sustained by transient depletion of DNMT1 and the methyl-binding genes MBD1, MBD2, and MBD3. Yet, local changes in intragenic meDNA also altered recruitment of MBD1 protein and of the chromatin factor HP1 γ known to alter transcriptional pausing and alternative splicing decisions. We further tested if meDNA level has sufficiently strong direct impact on the outcome of alternative splicing to have a predictive value in the MCF10A model for breast cancer progression and in patients with acute lymphoblastic leukemia (B ALL). We found that a small number of differentially spliced genes mostly involved in splicing and signal transduction is systematically correlated with local meDNA. Altogether, our observations suggest that, although DNA methylation has multiple avenues to alternative splicing, its indirect effect may be also mediated through alternative splicing isoforms of these sensors of meDNA.

Read more

Ready to get started?

Join us today