Featured Researches

Genomics

Combining exome and gene expression datasets in one graphical model of disease to empower the discovery of disease mechanisms

Identifying genes associated with complex human diseases is one of the main challenges of human genetics and computational medicine. To answer this question, millions of genetic variants get screened to identify a few of importance. To increase the power of identifying genes associated with diseases and to account for other potential sources of protein function aberrations, we propose a novel factor-graph based model, where much of the biological knowledge is incorporated through factors and priors. Our extensive simulations show that our method has superior sensitivity and precision compared to variant-aggregating and differential expression methods. Our integrative approach was able to identify important genes in breast cancer, identifying genes that had coding aberrations in some patients and regulatory abnormalities in others, emphasizing the importance of data integration to explain the disease in a larger number of patients.

Read more
Genomics

Combining human cell line transcriptome analysis and Bayesian inference to build trustworthy machine learning models for prediction of animal toxicity in drug development

Biomedical data, particularly in the field of genomics, has characteristics which make it challenging for machine learning applications - it can be sparse, high dimensional and noisy. Biomedical applications also present challenges to model selection - whilst powerful, accurate predictions are necessary, they alone are not sufficient for a model to be deemed useful. Due to the nature of the predictions, a model must also be trustworthy and transparent, empowering a practitioner with confidence that its use is appropriate and reliable. In this paper, we propose that this can be achieved through the use of judiciously built feature sets coupled with Bayesian models, specifically Gaussian processes. We apply Gaussian processes to drug discovery, using inexpensive transcriptomic profiles from human cell lines to predict animal kidney and liver toxicity after treatment with specific chemical compounds. This approach has the potential to reduce invasive and expensive animal testing during clinical trials if in vitro human cell line analysis can accurately predict model animal phenotypes. We compare results across a range of feature sets and models, to highlight model importance for medical applications.

Read more
Genomics

Comparative genomic analysis of the human gut microbiome reveals a broad distribution of metabolic pathways for the degradation of host-synthetized mucin glycans

The colonic mucus layer is a dynamic and complex structure formed by secreted and transmembrane mucins, which are high-molecular-weight and heavily glycosylated proteins. Colonic mucus consists of a loose outer layer and a dense epithelium-attached layer. The outer layer is inhabited by various representatives of the human gut microbiota (HGM). Glycans of the colonic mucus can be used by the HGM as a source of carbon and energy when dietary fibers are not sufficiently available. Here, we analyzed 397 individual HGM genomes to identify pathways for the cleavage of host-synthetized mucin glycans to monosaccharides as well as for the catabolism of the derived monosaccharides. Our key results are as follows: (i) Genes for the cleavage of mucin glycans were found in 86% of the analyzed genomes, whereas genes for the catabolism of derived monosaccharides were found in 89% of the analyzed genomes. (ii) Comparative genomic analysis identified four alternative forms of the monosaccharide-catabolizing enzymes and four alternative forms of monosaccharide transporters. (iii) Eighty-five percent of the analyzed genomes may be involved in exchange pathways for the monosaccharides derived from cleaved mucin glycans. (iv) The analyzed genomes demonstrated different abilities to degrade known mucin glycans. Generally, the ability to degrade at least one type of mucin glycan was predicted for 81% of the analyzed genomes. (v) Eighty-two percent of the analyzed genomes can form mutualistic pairs that are able to degrade mucin glycans and are not degradable by any of the paired organisms alone. Taken together, these findings provide further insight into the inter-microbial communications of the HGM as well as into host-HGM interactions.

Read more
Genomics

Comparing copy-number profiles under multi-copy amplifications and deletions

During cancer progression, malignant cells accumulate somatic mutations that can lead to genetic aberrations. In particular, evolutionary events akin to segmental duplications or deletions can alter the copy-number profile (CNP) of a set of genes in a genome. Our aim is to compute the evolutionary distance between two cells for which only CNPs are known. This asks for the minimum number of segmental amplifications and deletions to turn one CNP into another. This was recently formalized into a model where each event is assumed to alter a copy-number by 1 or −1 , even though these events can affect large portions of a chromosome. We propose a general cost framework where an event can modify the copy-number of a gene by larger amounts. We show that any cost scheme that allows segmental deletions of arbitrary length makes computing the distance strongly NP-hard. We then devise a factor 2 approximation algorithm for the problem when copy-numbers are non-zero and provide an implementation called \textsf{cnp2cnp}. We evaluate our approach experimentally by reconstructing simulated cancer phylogenies from the pairwise distances inferred by \textsf{cnp2cnp} and compare it against two other alternatives, namely the \textsf{MEDICC} distance and the Euclidean distance. The experimental results show that our distance yields more accurate phylogenies on average than these alternatives if the given CNPs are error-free, but that the \textsf{MEDICC} distance is slightly more robust against error in the data. In all cases, our experiments show that either our approach or the \textsf{MEDICC} approach should preferred over the Euclidean distance.

Read more
Genomics

Comprehensive assessment of error correction methods for high-throughput sequencing data

The advent of DNA and RNA sequencing has revolutionized the study of genomics and molecular biology. Next generation sequencing (NGS) technologies like Illumina, Ion Torrent, SOLiD sequencing etc. have brought about a quick and cheap way to sequence genomes. Recently, third generation sequencing (TGS) technologies like PacBio and Oxford Nanopore Technology (ONT) have also been developed. Different technologies use different underlying methods for sequencing and are prone to different error rates. Though many tools exist for error correction of sequencing data from NGS and TGS methods, no standard method is available yet to evaluate the accuracy and effectiveness of these error-correction tools. In this study, we present a Software Package for Error Correction Tool Assessment on nuCLEic acid sequences (SPECTACLE) providing comprehensive algorithms to evaluate error-correction methods for DNA and RNA sequencing, for NGS and TGS platforms. We also present a compilation of sequencing datasets for Illumina, PacBio and ONT platforms that present challenging scenarios for error-correction tools. Using these datasets and SPECTACLE, we evaluate the performance of 23 different error-correction tools and present unique and helpful insights into their strengths and weaknesses. We hope that our methodology will standardize the evaluation of DNA and RNA error-correction tools in the future.

Read more
Genomics

Comprehensive overview and assessment of miRNA target prediction tools in human and drosophila melanogaster

MicroRNAs (miRNAs) are small non-coding RNAs that control gene expression at the post-transcriptional level through complementary base pairing with the target mRNA, leading to mRNA degradation and blocking translation process. Any dysfunctions of these small regulatory molecules have been linked with the development and progression of several diseases. Therefore, it is necessary to reliably predict potential miRNA targets. A large number of computational prediction tools have been developed which provide a faster way to find putative miRNA targets, but at the same time their results are often inconsistent. Hence, finding a reliable, functional miRNA target is still a challenging task. Also, each tool is equipped with different algorithms, and it is difficult for the biologists to know which tool is the best choice for their study. This paper briefly describes fundamental of miRNA target prediction algorithms, discuss frequently used prediction tools, and further, the performance of frequently used prediction tools have been assessed using experimentally validated high confident mature miRNAs and their targets for two organisms Human and Drosophila Melanogaster. Both Drosophila Melanogaster and Human supported miRNA target prediction tools have been evaluated separately to find out best performing tool for each of these two organisms. In the human dataset, TargetScan showed the best results amongst the other predictors followed by the miRmap and microT, whereas in the D. Melanogaster dataset, MicroT tool showed the best performance followed by the TargetScan in the comparison of other tools.

Read more
Genomics

Computational Drug Repositioning and Elucidation of Mechanism of Action of Compounds against SARS-CoV-2

The COVID-19 crisis called for rapid reaction from all the fields of biomedical research. Traditional drug development involves time consuming pipelines that conflict with the urgence of identifying effective therapies during a health and economic emergency. Drug repositioning, that is the discovery of new clinical applications for drugs already approved for different therapeutic contexts, could provide an effective shortcut to bring COVID-19 treatments to the bedside in a timely manner. Moreover, computational approaches can help accelerate the process even further. Here we present the application of computational drug repositioning tools based on transcriptomics data to identify drugs that are potentially able to counteract SARS-CoV-2 infection, and also to provide insights on their mode of action. We believe that mucolytics and HDAC inhibitors warrant further investigation. In addition, we found that the DNA Mismatch repair pathway is strongly modulated by drugs with experimental in vitro activity against SARS-CoV-2 infection. Both full results and methods are publicly available.

Read more
Genomics

Computational Performance of a Germline Variant Calling Pipeline for Next Generation Sequencing

With the booming of next generation sequencing technology and its implementation in clinical practice and life science research, the need for faster and more efficient data analysis methods becomes pressing in the field of sequencing. Here we report on the evaluation of an optimized germline mutation calling pipeline, HummingBird, by assessing its performance against the widely accepted BWA-GATK pipeline. We found that the HummingBird pipeline can significantly reduce the running time of the primary data analysis for whole genome sequencing and whole exome sequencing while without significantly sacrificing the variant calling accuracy. Thus, we conclude that expansion of such software usage will help to improve the primary data analysis efficiency for next generation sequencing.

Read more
Genomics

Computational and molecular dissection of an X-box cis-Regulatory module

Ciliopathies are a class of human diseases marked by dysfunction of the cellular organelle, cilia. While many of the molecular components that make up cilia have been identified and studied, comparatively little is understood about the transcriptional regulation of genes encoding these components. The conserved transcription factor Regulatory Factor X (RFX)/DAF-19, which acts through binding to the cis-regulatory motif known as X-box, has been shown to regulate ciliary genes in many animals from Caenorhabditis elegans to humans. However, accumulating evidence suggests that RFX is unable to initiate transcription on its own. Therefore, other factors and cis-regulatory elements are likely required. One such element, a DNA motif called the C-box, has recently been identified in C. elegans. It is still unclear if the X-box and C-boxes are the only regulatory elements involved and how they interact. To this end, I analyzed the transcriptional regulation of dyf-5, the C. elegans ortholog of the human ciliopathy gene Male-Associated Kinase (MAK). Using computational methods, I was able to confirm the presence of the previously reported X-box and C-boxes as well as identifying an additional C-box. By sequentially mutating each of the identified motifs, I identified the role each potential motif plays in transcriptional regulation of dyf-5. My results showed that only the X-box and the three C-boxes are necessary and are sufficient to drive transcription, with the X-box and the centre C-box being the major contributors and the other two C-boxes enhancing expression. This study advances the knowledge of gene regulation in general and will further our understanding of ciliopathies and the mutations that cause them.

Read more
Genomics

Computational genomic algorithms for miRNA-based diagnosis of lung cancer: the potential of machine learning

The advent of large scale, high-throughput genomic screening has introduced a wide range of tests for diagnostic purposes. Prominent among them are tests using miRNA expression levels. Genomics and proteomics now provide expression levels of hundreds of miRNAs at a time. However, for actual diagnostic tools to become reality requires the simultaneous development of methods to interpret the large amounts of miRNA expression data that can be generated from a single patient sample. Because these data are in numeric form, quantitative methods must be developed. Statistics such as p-values and log fold change give some insight, but the diagnostic effectiveness of each miRNA test must first be evaluated. Here, the author has developed a traditional, sensitivity- and specificity-based algorithm, as well as a modern machine learning algorithm, and evaluated their diagnostic potential for lung cancer against a publicly available database. The findings suggest that the machine learning algorithm achieves higher accuracy (97% for cancerous and 73% for normal samples), in addition to providing confidence intervals that could provide valuable diagnostic support. The machine learning algorithm also has significant potential for expansion to more complex diagnoses of lung cancer sub-types, to other cancers as well diseases beyond cancer. Both algorithms are available on the Github repo: this https URL.

Read more

Ready to get started?

Join us today