Quantitative Biology Genomics - Researchain

Featured Researches

Accurate selfcorrection of errors in long reads using de Bruijn graphs

New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can produce sequencing reads up to 50,000 bp long but with an error rate of at least 15%. Reducing the error rate is necessary for subsequent utilisation of the reads in, e.g., de novo genome assembly. The error correction problem has been tackled either by aligning the long reads against each other or by a hybrid approach that uses the more accurate short reads produced by second generation sequencing technologies to correct the long reads. We present an error correction method that uses long reads only. The method consists of two phases: first we use an iterative alignment-free correction method based on de Bruijn graphs with increasing length of k-mers, and second, the corrected reads are further polished using long-distance dependencies that are found using multiple alignments. According to our experiments the proposed method is the most accurate one relying on long reads only for read sets with high coverage. Furthermore, when the coverage of the read set is at least 75x, the throughput of the new method is at least 20% higher. LoRMA is freely available at this http URL.

Genomics

Accurate, Fast and Lightweight Clustering of de novo Transcriptomes using Fragment Equivalence Classes

Motivation: De novo transcriptome assembly of non-model organisms is the first major step for many RNA-seq analysis tasks. Current methods for de novo assembly often report a large number of contiguous sequences (contigs), which may be fractured and incomplete sequences instead of full-length transcripts. Dealing with a large number of such contigs can slow and complicate downstream analysis. Results :We present a method for clustering contigs from de novo transcriptome assemblies based upon the relationships exposed by multi-mapping sequencing fragments. Specifically, we cast the problem of clustering contigs as one of clustering a sparse graph that is induced by equivalence classes of fragments that map to subsets of the transcriptome. Leveraging recent developments in efficient read mapping and transcript quantification, we have developed RapClust, a tool implementing this approach that is capable of accurately clustering most large de novo transcriptomes in a matter of minutes, while simultaneously providing accurate estimates of expression for the resulting clusters. We compare RapClust against a number of tools commonly used for de novo transcriptome clustering. Using de novo assemblies of organisms for which reference genomes are available, we assess the accuracy of these different methods in terms of the quality of the resulting clusterings, and the concordance of differential expression tests with those based on ground truth clusters. We find that RapClust produces clusters of comparable or better quality than existing state-of-the-art approaches, and does so substantially faster. RapClust also confers a large benefit in terms of space usage, as it produces only succinct intermediate files - usually on the order of a few megabytes - even when processing hundreds of millions of reads.

Genomics

Addressing Ancestry Disparities in Genomic Medicine: A Geographic-aware Algorithm

With declining sequencing costs a promising and affordable tool is emerging in cancer diagnostics: genomics. By using association studies, genomic variants that predispose patients to specific cancers can be identified, while by using tumor genomics cancer types can be characterized for targeted treatment. However, a severe disparity is rapidly emerging in this new area of precision cancer diagnosis and treatment planning, one which separates a few genetically well-characterized populations (predominantly European) from all other global populations. Here we discuss the problem of population-specific genetic associations, which is driving this disparity, and present a novel solution--coordinate-based local ancestry--for helping to address it. We demonstrate our boosting-based method on whole genome data from divergent groups across Africa and in the process observe signals that may stem from the transcontinental Bantu-expansion.

Genomics

AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

As genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping). However, if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces 1) the number of reads that need to be fully mapped to the new reference by up to 99.99\% and 2) the overall execution time to remap read sets between two reference genome versions by 6.7x, 6.6x, and 2.8x for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively. We validate our remapping results with GATK and find that AirLift provides similar accuracy in identifying ground truth SNP and INDEL variants as the baseline of fully mapping a read set.

Genomics

Algorithmic Methods to Infer the Evolutionary Trajectories in Cancer Progression

The genomic evolution inherent to cancer relates directly to a renewed focus on the voluminous next generation sequencing (NGS) data, and machine learning for the inference of explanatory models of how the (epi)genomic events are choreographed in cancer initiation and development. However, despite the increasing availability of multiple additional -omics data, this quest has been frustrated by various theoretical and technical hurdles, mostly stemming from the dramatic heterogeneity of the disease. In this paper, we build on our recent works on "selective advantage" relation among driver mutations in cancer progression and investigate its applicability to the modeling problem at the population level. Here, we introduce PiCnIc (Pipeline for Cancer Inference), a versatile, modular and customizable pipeline to extract ensemble-level progression models from cross-sectional sequenced cancer genomes. The pipeline has many translational implications as it combines state-of-the-art techniques for sample stratification, driver selection, identification of fitness-equivalent exclusive alterations and progression model inference. We demonstrate PiCnIc's ability to reproduce much of the current knowledge on colorectal cancer progression, as well as to suggest novel experimentally verifiable hypotheses.

Genomics

Aligning 415 519 proteins in less than two hours on PC

Rapid development of modern sequencing platforms enabled an unprecedented growth of protein families databases. The abundance of sets composed of hundreds of thousands sequences is a great challenge for multiple sequence alignment algorithms. In the article we introduce FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilisation of longest common subsequence measure for determining pairwise similarities, a novel method of gap costs evaluation, and a new iterative refinement scheme. Importantly, its implementation is highly optimised and parallelised to make the most of modern computer platforms. Thanks to the above, quality indicators, namely sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms like Clustal Omega or MAFFT for datasets exceeding a few thousand of sequences. The quality does not compromise time and memory requirements which are an order of magnitude lower than that of existing solutions. For example, a family of 415 519 sequences was analysed in less than two hours and required only 8GB of RAM. FAMSA is freely available at this http URL.

Genomics

Alpha7 nicotinic acetylcholine receptor signaling modulates ovine fetal brain astrocytes transcriptome in response to endotoxin

Neuroinflammation in utero may result in lifelong neurological disabilities. Astrocytes play a pivotal role, but the mechanisms are poorly understood. No early postnatal treatment strategies exist to enhance neuroprotective potential of astrocytes. We hypothesized that agonism on alpha7 nicotinic acetylcholine receptor (alpha7nAChR) in fetal astrocytes will augment their neuroprotective transcriptome profile, while the antagonistic stimulation of alpha7nAChR will achieve the opposite. Using an in vivo - in vitro model of developmental programming of neuroinflammation induced by lipopolysaccharide (LPS), we validated this hypothesis in primary fetal sheep astrocytes cultures re-exposed to LPS in presence of a selective alpha7nAChR agonist or antagonist. Our RNAseq findings show that a pro-inflammatory astrocyte transcriptome phenotype acquired in vitro by LPS stimulation is reversed with alpha7nAChR agonistic stimulation. Conversely, antagonistic alpha7nAChR stimulation potentiates the pro-inflammatory astrocytic transcriptome phenotype. Furthermore, we conduct a secondary transcriptome analysis against the identical alpha7nAChR experiments in fetal sheep primary microglia cultures and discuss the implications for neurodevelopment.

Genomics

An Extension of Deep Pathway Analysis: A Pathway Route Analysis Framework Incorporating Multi-dimensional Cancer Genomics Data

Recent breakthroughs in cancer research have come via the up-and-coming field of pathway analysis. By applying statistical methods to prior known gene and protein regulatory information, pathway analysis provides a meaningful way to interpret genomic data. While many gene/protein regulatory relationships have been studied, never before has such a significant amount data been made available in organized forms of gene/protein regulatory networks and pathways. However, pathway analysis research is still in its infancy, especially when applying it to solve practical problems. In this paper we propose a new method of studying biological pathways, one that cross analyzes mutation information, transcriptome and proteomics data. Using this outcome, we identify routes of aberrant pathways potentially responsible for the etiology of disease. Each pathway route is encoded as a bayesian network which is initialized with a sequence of conditional probabilities specifically designed to encode directionality of regulatory relationships encoded in the pathways. Far more complex interactions, such as phosphorylation and methylation, among others, in the pathways can be modeled using this approach. The effectiveness of our model is demonstrated through its ability to distinguish real pathways from decoys on TCGA mRNA-seq, mutation, Copy Number Variation and phosphorylation data for both Breast cancer and Ovarian cancer study. The majority of pathways distinguished can be confirmed by biological literature. Moreover, the proportion of correctly indentified pathways is \% higher than previous work where only mRNA-seq mutation data is incorporated for breast cancer patients. Consequently, such an in-depth pathway analysis incorporating more diverse data can give rise to the accuracy of perturbed pathway detection.

Genomics

An Immune-related lncRNAs Model for Prognostic of SKCM Patients Base on Cox Regression and Coexpression Analysis

SKCM is the most dangerous one of skin cancer, its high degree of malignant, is the leading cause of skin cancer. And the level of radiation treatment and chemical treatment is minimal, so the mortality is high. Because of its complex molecular and cellular heterogeneity, the existing prediction model of skin cancer risk is not ideal. In this study, we developed an immune-related lncRNAs model to predict the prognosis of patients with SKCM. Screening for SKCM-related differential expression of lncRNA from TCGA. Identified immune-related lncRNAs and lncRNA-related mRNA based on the co-expression method. Through univariate and multivariate analysis, an immune-related lncRNA model is established to analyze the prognosis of SKCM patients. A 4-lncRNA skin cancer prediction model was constructed, including MIR155HG, AL137003.2, AC011374.2, and AC009495.2. According to the model, SKCM samples were divided into a high-risk group and low-risk group, and predict the survival of the two groups in 30 years. The area under the ROC curve is 0.749, which shows that the model has excellent performance. We constructed a 4-lncRNA model to predict the prognosis of patients with SKCM, indicating that these lncRNAs may play a unique role in the carcinogenesis of SKCM.

Genomics

An Improved Filtering Algorithm for Big Read Datasets

For single-cell or metagenomic sequencing projects, it is necessary to sequence with a very high mean coverage in order to make sure that all parts of the sample DNA get covered by the reads produced. This leads to huge datasets with lots of redundant data. A filtering of this data prior to assembly is advisable. Titus Brown et al. (2012) presented the algorithm Diginorm for this purpose, which filters reads based on the abundance of their k -mers. We present Bignorm, a faster and quality-conscious read filtering algorithm. An important new feature is the use of phred quality scores together with a detailed analysis of the k -mer counts to decide which reads to keep. With recommended parameters, in terms of median we remove 97.15% of the reads while keeping the mean phred score of the filtered dataset high. Using the SDAdes assembler, we produce assemblies of high quality from these filtered datasets in a fraction of the time needed for an assembly from the datasets filtered with Diginorm. We conclude that read filtering is a practical method for reducing read data and for speeding up the assembly process. Our Bignorm algorithm allows assemblies of competitive quality in comparison to Diginorm, while being much faster. Bignorm is available for download at this https URL

Ready to get started?

Join us today

Archive Your Research