Featured Researches

Quantitative Methods

Integrative Data Analytic Framework to Enhance Cancer Precision Medicine

With the advancement of high-throughput biotechnologies, we increasingly accumulate biomedical data about diseases, especially cancer. There is a need for computational models and methods to sift through, integrate, and extract new knowledge from the diverse available data to improve the mechanistic understanding of diseases and patient care. To uncover molecular mechanisms and drug indications for specific cancer types, we develop an integrative framework able to harness a wide range of diverse molecular and pan-cancer data. We show that our approach outperforms competing methods and can identify new associations. Furthermore, through the joint integration of data sources, our framework can also uncover links between cancer types and molecular entities for which no prior knowledge is available. Our new framework is flexible and can be easily reformulated to study any biomedical problems.

Read more
Quantitative Methods

Interactive exploration of population scale pharmacoepidemiology datasets

Population-scale drug prescription data linked with adverse drug reaction (ADR) data supports the fitting of models large enough to detect drug use and ADR patterns that are not detectable using traditional methods on smaller datasets. However, detecting ADR patterns in large datasets requires tools for scalable data processing, machine learning for data analysis, and interactive visualization. To our knowledge no existing pharmacoepidemiology tool supports all three requirements. We have therefore created a tool for interactive exploration of patterns in prescription datasets with millions of samples. We use Spark to preprocess the data for machine learning and for analyses using SQL queries. We have implemented models in Keras and the scikit-learn framework. The model results are visualized and interpreted using live Python coding in Jupyter. We apply our tool to explore a 384 million prescription data set from the Norwegian Prescription Database combined with a 62 million prescriptions for elders that were hospitalized. We preprocess the data in two minutes, train models in seconds, and plot the results in milliseconds. Our results show the power of combining computational power, short computation times, and ease of use for analysis of population scale pharmacoepidemiology datasets. The code is open source and available at: this https URL

Read more
Quantitative Methods

Interpretation of the Area Under the ROC Curve for Risk Prediction Models

The area under the curve (AUC) of the receiver operating characteristics curve (ROC) evaluates the separation between patients and nonpatients or discrimination. For risk prediction models these risk distributions can be derived from the population risk distribution so are not independent as in diagnosis. A ROC curve AUC formula based on the underlying population risk distribution clarifies how discrimination is defined mathematically and that generation of the equivalent c-statistic effects a Monte Carlo integration of the formula. For a selection of continuous risk distributions, exact analytic formulas or numerical results for the ROC curve AUC and overlap measure are presented and demonstrate a linear or near-linear dependence on their standard deviation. The ROC curve AUC is also shown to be highly dependent on the mean population risk, a distinction from the independence from disease prevalence for diagnostic tests. The converse of discrimination, overlap, has been quantified by the overlap measure, which appears to provide equivalent information. As achieving wider population risk distributions is the goal of risk prediction modeling for clinical risk stratification, interpreting the ROC curve AUC as a measure of dispersion, rather than discrimination, when comparing risk prediction models may be more relevant.

Read more
Quantitative Methods

Intervention fatigue is the primary cause of strong secondary waves in the COVID-19 pandemic

As of November 2020, the number of COVID-19 cases is increasing rapidly in many countries. In Europe, the virus spread slowed considerably in the late spring due to strict lockdown, but a second wave of the pandemic grew throughout the fall. In this study, we first reconstruct the time evolution of the effective reproduction numbers R(t) for each country by integrating the equations of the classic SIR model. We cluster countries based on the estimated R(t) through a suitable time series dissimilarity. The result suggests that simple dynamical mechanisms determine how countries respond to changes in COVID-19 case counts. Inspired by these results, we extend the SIR model to include a social response to explain the number X(t) of new confirmed daily cases. As a first-order model, we assume that the social response is on the form d t R=−ν(X− X ∗ ) , where X ∗ is a threshold for response. The response rate ν depends on whether X ∗ is below or above this threshold, on three parameters ν 1 , ν 2 , ν 3 , , and on t . When X< X ∗ , ν= ν 1 , describes the effect of relaxed intervention when the incidence rate is low. When X> X ∗ , ν= ν 2 exp(− ν 3 t) , models the impact of interventions when incidence rate is high. The parameter ν 3 represents the fatigue, i.e., the reduced effect of intervention as time passes. The proposed model reproduces typical evolving patterns of COVID-19 epidemic waves observed in many countries. Estimating the parameters ν 1 , ν 2 , ν 3 and initial conditions, such as R 0 , for different countries helps to identify important dynamics in their social responses. One conclusion is that the leading cause of the strong second wave in Europe in the fall of 2020 was not the relaxation of interventions during the summer, but rather the general fatigue to interventions developing in the fall.

Read more
Quantitative Methods

Investigating 3D Atomic Environments for Enhanced QSAR

Predicting bioactivity and physical properties of molecules is a longstanding challenge in drug design. Most approaches use molecular descriptors based on a 2D representation of molecules as a graph of atoms and bonds, abstracting away the molecular shape. A difficulty in accounting for 3D shape is in designing molecular descriptors can precisely capture molecular shape while remaining invariant to rotations/translations. We describe a novel alignment-free 3D QSAR method using Smooth Overlap of Atomic Positions (SOAP), a well-established formalism developed for interpolating potential energy surfaces. We show that this approach rigorously describes local 3D atomic environments to compare molecular shapes in a principled manner. This method performs competitively with traditional fingerprint-based approaches as well as state-of-the-art graph neural networks on pIC 50 ligand-binding prediction in both random and scaffold split scenarios. We illustrate the utility of SOAP descriptors by showing that its inclusion in ensembling diverse representations statistically improves performance, demonstrating that incorporating 3D atomic environments could lead to enhanced QSAR for cheminformatics.

Read more
Quantitative Methods

Investigating the Product Profiles and Structural Relationships of New Levansucrases with Conventional and Non-Conventional Substrates

The synthesis of complex oligosaccharides is desired for their potential as prebiotics, and their role in the pharmaceutical and food industry. Levansucrase (LS, EC this http URL), a fructosyl-transferase, can catalyze the synthesis of these compounds. LS acquires a fructosyl residue from a donor molecule and performs a non-Lenoir transfer to an acceptor molecule, via β -(2 → 6)-glycosidic linkages. Genome mining was used to uncover new LS enzymes with increased transfructosylating activity and wider acceptor promiscuity, with an initial screening revealing five LS enzymes. The product profiles and activities of these enzymes were examined after their incubation with sucrose. Alternate acceptor molecules were also incubated with the enzymes to study their consumption. LSs from Gluconobacter oxydans and Novosphingobium aromaticivorans synthesized fructooligosaccharides (FOSs) with up to 13 units in length. Alignment of their amino acid sequences and substrate docking with homology models identified structural elements causing differences in their product spectra. Raffinose, over sucrose, was the preferred donor molecule for the LS from Vibrio natriegens, N. aromaticivorans, and Paraburkolderia graminis. The LSs examined were found to have wide acceptor promiscuity, utilizing monosaccharides, disaccharides, and two alcohols to a high degree.

Read more
Quantitative Methods

Investigation of the Cyprus donkey milk bacterial diversity by 16SrDNA high-throughput sequencing in a Cyprus donkey farm

The interest in milk originating from donkeys is growing worldwide due to its claimed functional and nutritional properties, especially for sensitive population groups, such as infants with cow milk protein allergy. The current study aimed to assess the microbiological quality of donkey milk produced in a donkey farm in Cyprus using cultured-based and high-throughput sequencing (HTS) techniques. The culture-based microbiological analysis showed very low microbial counts, while important food-borne pathogens were not detected in any sample. In addition, HTS was applied to characterize the bacterial communities of donkey milk samples. Donkey milk was mostly comprised of: Gram-negative Proteobacteria, including Sphingomonas, Pseudomonas Mesorhizobium and Acinetobacter; lactic acid bacteria, including Lactobacillus and Streptococcus; the endospores forming Clostridium; and the environmental genera Flavobacterium and Ralstonia, detected in lower relative abundances. The results of the study support existing findings that donkey milk contains mostly Gram-negative bacteria. Moreover, it raises questions regarding the contribution: a) of antimicrobial agents (i.e. lysozyme, peptides) in shaping the microbial communities and b) of the bacterial microbiota to the functional value of donkey milk.

Read more
Quantitative Methods

JigSaw: A tool for discovering explanatory high-order interactions from random forests

Machine learning is revolutionizing biology by facilitating the prediction of outcomes from complex patterns found in massive data sets. Large biological data sets, like those generated by transcriptome or microbiome studies,measure many relevant components that interact in vivo with one another in modular ways.Identifying the high-order interactions that machine learning models use to make predictions would facilitate the development of hypotheses linking combinations of measured components to outcome. By using the structure of random forests, a new algorithmic approach, termed JigSaw,was developed to aid in the discovery of patterns that could explain predictions made by the forest. By examining the patterns of individual decision trees JigSaw identifies high-order interactions between measured features that are strongly associated with a particular outcome and identifies the relevant decision thresholds. JigSaw's effectiveness was tested in simulation studies where it was able to recover multiple ground truth patterns;even in the presence of significant noise. It was then used to find patterns associated with outcomes in two real world data this http URL was first used to identify patterns clinical measurements associated with heart disease. It was then used to find patterns associated with breast cancer using metabolites measured in the blood. In heart disease, JigSaw identified several three-way interactions that combine to explain most of the heart disease records (66%) with high precision (93%). In breast cancer, three two-way interactions were recovered that can be combined to explain almost all records (92%) with good precision (79%). JigSaw is an efficient method for exploring high-dimensional feature spaces for rules that explain statistical associations with a given outcome and can inspire the generation of testable hypotheses.

Read more
Quantitative Methods

Knowledge transfer across cell lines using Hybrid Gaussian Process models with entity embedding vectors

To date, a large number of experiments are performed to develop a biochemical process. The generated data is used only once, to take decisions for development. Could we exploit data of already developed processes to make predictions for a novel process, we could significantly reduce the number of experiments needed. Processes for different products exhibit differences in behaviour, typically only a subset behave similar. Therefore, effective learning on multiple product spanning process data requires a sensible representation of the product identity. We propose to represent the product identity (a categorical feature) by embedding vectors that serve as input to a Gaussian Process regression model. We demonstrate how the embedding vectors can be learned from process data and show that they capture an interpretable notion of product similarity. The improvement in performance is compared to traditional one-hot encoding on a simulated cross product learning task. All in all, the proposed method could render possible significant reductions in wet-lab experiments.

Read more
Quantitative Methods

Label-free Raman spectroscopy and machine learning enables sensitive evaluation of differential response to immunotherapy

Cancer immunotherapy provides durable clinical benefit in only a small fraction of patients, particularly due to a lack of reliable biomarkers for accurate prediction of treatment outcomes and evaluation of response. Here, we demonstrate the first application of label-free Raman spectroscopy for elucidating biochemical changes induced by immunotherapy in the tumor microenvironment. We used CT26 murine colorectal cancer cells to grow tumor xenografts and subjected them to treatment with anti-CTLA-4 and anti-PD-L1 antibodies. Multivariate curve resolution - alternating least squares (MCR-ALS) decomposition of Raman spectral dataset obtained from the treated and control tumors revealed subtle differences in lipid, nucleic acid, and collagen content due to therapy. Our supervised classification analysis using support vector machines and random forests provided excellent prediction accuracies for both immune checkpoint inhibitors and delineated important spectral markers specific to each therapy, consistent with their differential mechanisms of action. Our findings pave the way for in vivo studies of response to immunotherapy in clinical patients using label-free Raman spectroscopy and machine learning.

Read more

Ready to get started?

Join us today