Featured Researches

Genomics

A reproducible effect size is more useful than an irreproducible hypothesis test to analyze high throughput sequencing datasets

Motivation: P values derived from the null hypothesis significance testing framework are strongly affected by sample size, and are known to be irreproducible in underpowered studies, yet no suitable replacement has been proposed. Results: Here we present implementations of non-parametric standardized median effect size estimates, dNEF, for high-throughput sequencing datasets. Case studies are shown for transcriptome and tag-sequencing datasets. The dNEF measure is shown to be more reproducible and robust than P values and requires sample sizes as small as 3 to reproducibly identify differentially abundant features. Availability: Source code and binaries freely available at: this https URL , omicplotR, and this https URL .

Read more
Genomics

A robust and generalizable transcriptomic signature for sepsis diagnostics

Motivation: High-throughput sequencing can detect tens of thousands of genes in parallel, providing opportunities for improving the diagnostic accuracy of multiple diseases including sepsis, which is an aggressive inflammatory response to infection that can cause organ failure and death. Early screening of sepsis is essential in clinic, but no effective diagnostic biomarkers are available yet. Results: We present a novel method, Recurrent Logistic Regression (RLR), to identify diagnostic biomarkers for sepsis from the blood transcriptome data. Immune-related gene expression profiles are studied among 1,144 sepsis samples and 240 normal samples across three detection platforms, including Affymetrix Human Genome U133 Plus 2.0, Affymetrix Human Genome U219 Array, and Agilent Human Gene Expression 4x44K v2 Microarray. A panel including five genes, LRRN3, IL2RB, FCER1A, TLR5, and S100A12, are determined as diagnostic biomarkers (LIFTS) for sepsis on two discovery cohorts. LIFTS discriminates patients with sepsis from normal controls in high accuracy (AUROC = 0.9959 on average; IC = [0.9722-1.0]) on nine validation cohorts across three independent platforms, which outperforms existing markers. Each individual gene has certain distinguishability for the detection of sepsis but cannot achieve as high performance as the entire gene panel. Network and functional analysis illustrated that the interactors of the five genes are closely interacted and are commonly involved in several biological functions, including growth hormone synthesis, chemokine signaling pathway, B cell receptor signaling pathway, etc.

Read more
Genomics

A single step protein assay that is both detergent and reducer compatible: The cydex blue assay

Determination of protein concentration in often an absolute pre-requisite in preparing samples for biochemical and proteomic analyses. However, current protein assay methods are not compatible with both reducers and detergents, which are however present simultaneously in most denaturing extraction buffers used in proteomics and electrophoresis, and in particular in SDS electrophoresis. We found that inclusion of cyclodextrins in a Coomassie blue-based assay made it compatible with detergents, as cyclodextrins complex detergents in a 1:1 molecular ratio. As this type of assay is intrinsically resistant to reducers, we have thus developed a single step assay that is both detergent and reducer compatible. Depending on the type and concentration of detergents present in the sample buffer, either beta-cyclodextrin or alpha-cyclodextrin can be used, the former being able to complex a wider range of detergents and the latter being able to complex higher amounts of detergents due to its greater solubility in water. Cyclodextrins are used at final concentrations of 2-10 mg/mL in the assay mix. This typically allows to measure samples containing as little as 0.1 mg/mL protein, in the presence of up to 2% detergent and reducers such as 5 % mercaptoethanol or 50 mM DTT in a single step with a simple spectrophotometric assay. This article is protected by copyright. All rights reserved.

Read more
Genomics

A spectral algorithm for fast de novo layout of uncorrected long nanopore reads

Motivation: New long read sequencers promise to transform sequencing and genome assembly by producing reads tens of kilobases long. However their high error rate significantly complicates assembly and requires expensive correction steps to layout the reads using standard assembly engines. Results: We present an original and efficient spectral algorithm to layout the uncorrected nanopore reads, and its seamless integration into a straightforward overlap/layout/consensus (OLC) assembly scheme. The method is shown to assemble Oxford Nanopore reads from several bacterial genomes into good quality (~99% identity to the reference) genome-sized contigs, while yielding more fragmented assemblies from a Sacharomyces cerevisiae reference strain. Availability and implementation: this http URL Contact: [email protected]

Read more
Genomics

A step towards a reinforcement learning de novo genome assembler

The use of reinforcement learning has proven to be very promising for solving complex activities without human supervision during their learning process. However, their successful applications are predominantly focused on fictional and entertainment problems - such as games. Based on the above, this work aims to shed light on the application of reinforcement learning to solve this relevant real-world problem, the genome assembly. By expanding the only approach found in the literature that addresses this problem, we carefully explored the aspects of intelligent agent learning, performed by the Q-learning algorithm, to understand its suitability to be applied in scenarios whose characteristics are more similar to those faced by real genome projects. The improvements proposed here include changing the previously proposed reward system and including state space exploration optimization strategies based on dynamic pruning and mutual collaboration with evolutionary computing. These investigations were tried on 23 new environments with larger inputs than those used previously. All these environments are freely available on the internet for the evolution of this research by the scientific community. The results suggest consistent performance progress using the proposed improvements, however, they also demonstrate the limitations of them, especially related to the high dimensionality of state and action spaces. We also present, later, the paths that can be traced to tackle genome assembly efficiently in real scenarios considering recent, successfully reinforcement learning applications - including deep reinforcement learning - from other domains dealing with high-dimensional inputs.

Read more
Genomics

A word recurrence based algorithm to extract genomic dictionaries

Genomes may be analyzed from an information viewpoint as very long strings, containing functional elements of variable length, which have been assembled by evolution. In this work an innovative information theory based algorithm is proposed, to extract significant (relatively small) dictionaries of genomic words. Namely, conceptual analyses are here combined with empirical studies, to open up a methodology for the extraction of variable length dictionaries from genomic sequences, based on the information content of some factors. Its application to human chromosomes highlights an original inter-chromosomal similarity in terms of factor distributions.

Read more
Genomics

AFDP: An Automated Function Description Prediction Approach to Improve Accuracy of Protein Function Predictions

With the rapid growth in high-throughput biological sequencing technologies and subsequently the amount of produced omics data, it is essential to develop automated methods to annotate the functionality of unknown genes and proteins. There are developed tools such as AHRD applying known proteins characterization to annotate unknown ones. Some other algorithms such as eggNOG apply orthologous groups of proteins to detect the most probable function. However, while the available tools focus on the detection of the most similar characterization, they are not able to generalize and integrate information from multiple homologs while maintaining accuracy. Here, we devise AFDP, an integrated approach for protein function prediction which benefits from the combination of two available tools, AHRD and eggNOG, to predict the functionality of novel proteins and produce more precise human readable descriptions by applying our stCFExt algorithm. StCFExt creates function descriptions applying available manually curated descriptions in swiss-prot. Using a benchmark dataset we show that the annotations predicted by our approach are more accurate than eggNOG and AHRD annotations.

Read more
Genomics

ASB1 differential methylation in ischaemic cardiomyopathy. Relationship with left ventricular performance in end stage heart failure patients

Aims: Ischaemic cardiomyopathy (ICM) leads to impaired contraction and ventricular dysfunction causing high rates of morbidity and mortality. Epigenomics allows the identification of epigenetic signatures in human diseases. We analyse the differential epigenetic patterns of ASB gene family in ICM patients and relate these alterations to their haemodynamic and functional status. Methods and Results: Epigenomic analysis was carried out using 16 left ventricular (LV) tissue samples, 8 from ICM patients undergoing heart transplantation and 8 from control (CNT) subjects without cardiac disease. We increased the sample size up to 13 ICM and 10 CNT for RNA-sequencing and to 14 ICM for pyrosequencing analyses. We found a hypermethylated profile (cg11189868) in the ASB1 gene that showed a differential methylation of 0.26 beta difference, P < 0.05. This result was validated by pyrosequencing technique (0.23 beta difference, P < 0.05). Notably, the methylation pattern was strongly related to LV ejection fraction (r = -0.849, P = 0.008) stroke volume (r = -0.929, P = 0.001) and end-systolic and diastolic LV diameters (r = -0.743, P = 0.035 for both). ASB1 showed a down regulation in mRNA levels (-1.2 fold, P < 0.05). Conclusion: Our findings link a specific ASB1 methylation pattern to LV structure and performance in end-stage ICM, opening new therapeutic opportunities and providing new insights regarding which is the functionally relevant genome in the ischemic failing myocardium. Keywords: ischaemic cardiomyopathy; epigenomics; heart failure; left ventricular dysfunction; stroke volume; ASB1.

Read more
Genomics

Accelerating the Understanding of Life's Code Through Better Algorithms and Hardware Design

Calculating the similarities between a pair of genomic sequences is one of the most fundamental computational steps in genomic analysis. This step -- called sequence alignment -- is the computational bottleneck because: (1) it is implemented using quadratic-time dynamic programming algorithms and (2) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper's execution time. In this thesis, we introduce four new algorithms (GateKeeper, Shouji, MAGNET, and SneakySnake) that function as a pre-alignment step and aim to filter out most incorrect candidate locations. The first key idea of our pre-alignment filters is to provide high filtering accuracy by correctly detecting all similar segments shared between two sequences. The second key idea is to exploit the massively parallel architecture of modern FPGAs for accelerating our filtering algorithms. We also develop an efficient CPU implementation of the SneakySnake algorithm for commodity desktops and servers. We evaluate the benefits and downsides of our pre-alignment filtering approach in detail using 12 real datasets. In our evaluation, we demonstrate that our hardware pre-alignment filters show two to three orders of magnitude speedup over their equivalent CPU implementations. We also demonstrate that integrating our hardware pre-alignment filters with the state-of-the-art read aligners reduces the aligner's execution time by up to 21.5x. Finally, we show that efficient CPU implementation of pre-alignment filtering still provides significant benefits. We show that SneakySnake on average reduces the execution time of the best performing CPU-based read aligners Edlib and Parasail, by up to 43x and 57.9x, respectively.

Read more
Genomics

Accurate Genomic Prediction Of Human Height

We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, ∼ 40, 20, and 9 percent of total variance for the three traits. For example, predicted heights correlate ∼ 0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The variance captured for height is comparable to the estimated SNP heritability from GCTA (GREML) analysis, and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for the SNPs used. Thus, our results resolve the common SNP portion of the "missing heritability" problem -- i.e., the gap between prediction R-squared and SNP heritability. The ∼ 20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common SNPs. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier GWAS for out-of-sample validation of our results.

Read more

Ready to get started?

Join us today