Sarah Elshal
Katholieke Universiteit Leuven
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sarah Elshal.
Nucleic Acids Research | 2016
Léon-Charles Tranchevent; Amin Ardeshirdavani; Sarah Elshal; Daniel Alcaide; Jan Aerts; Didier Auboeuf; Yves Moreau
Genomic studies and high-throughput experiments often produce large lists of candidate genes among which only a small fraction are truly relevant to the disease, phenotype or biological process of interest. Gene prioritization tackles this problem by ranking candidate genes by profiling candidates across multiple genomic data sources and integrating this heterogeneous information into a global ranking. We describe an extended version of our gene prioritization method, Endeavour, now available for six species and integrating 75 data sources. The performance (Area Under the Curve) of Endeavour on cross-validation benchmarks using ‘gold standard’ gene sets varies from 88% (for human phenotypes) to 95% (for worm gene function). In addition, we have also validated our approach using a time-stamped benchmark derived from the Human Phenotype Ontology, which provides a setting close to prospective validation. With this benchmark, using 3854 novel gene–phenotype associations, we observe a performance of 82%. Altogether, our results indicate that this extended version of Endeavour efficiently prioritizes candidate genes. The Endeavour web server is freely available at https://endeavour.esat.kuleuven.be/.
Nucleic Acids Research | 2016
Sarah Elshal; Léon-Charles Tranchevent; Alejandro Sifrim; Amin Ardeshirdavani; Jesse Davis; Yves Moreau
Disease-gene identification is a challenging process that has multiple applications within functional genomics and personalized medicine. Typically, this process involves both finding genes known to be associated with the disease (through literature search) and carrying out preliminary experiments or screens (e.g. linkage or association studies, copy number analyses, expression profiling) to determine a set of promising candidates for experimental validation. This requires extensive time and monetary resources. We describe Beegle, an online search and discovery engine that attempts to simplify this process by automating the typical approaches. It starts by mining the literature to quickly extract a set of genes known to be linked with a given query, then it integrates the learning methodology of Endeavour (a gene prioritization tool) to train a genomic model and rank a set of candidate genes to generate novel hypotheses. In a realistic evaluation setup, Beegle has an average recall of 84% in the top 100 returned genes as a search engine, which improves the discovery engine by 12.6% in the top 5% prioritized genes. Beegle is publicly available at http://beegle.esat.kuleuven.be/.
international conference on bioinformatics and biomedical engineering | 2016
Sarah Elshal; Jaak Simm; Adam Arany; Pooya Zakeri; Jesse Davis; Yves Moreau
Text mining is popular in biomedical applications because it allows retrieving highly relevant information. Particularly for us, it is quite practical in linking diseases to the genes involved in them. However text mining involves multiple challenges, such as (1) recognizing named entities (e.g., diseases and genes) inside the text, (2) constructing specific vocabularies that efficiently represent the available text, and (3) applying the correct statistical criteria to link biomedical entities with each other. We have previously developed Beegle, a tool that allows prioritizing genes for any search query of interest. The method starts with a search phase, where relevant genes are identified via the literature. Once known genes are identified, a second phase allows prioritizing novel candidate genes through a data fusion strategy. Many aspects of our method could be potentially improved. Here we evaluate two MEDLINE annotators that recognize biomedical entities inside a given abstract using different dictionaries and annotation strategies. We compare the contribution of each of the two annotators in associating genes with diseases under different vocabulary settings. Somewhat surprisingly, with fewer recognized entities and a more compact vocabulary, we obtain better associations between genes and diseases. We also propose a novel but simple association criterion to link genes with diseases, which relies on recognizing only gene entities inside the biomedical text. These refinements significantly improve the performance of our method.
bioinformatics and biomedicine | 2015
Pooya Zakeri; Sarah Elshal; Yves Moreau
In biology there is often the need to discover the most promising genes, among a large list of candidate genes, to further investigate. While a single data source might not be effective enough, integrating several complementary genomic data sources leads to more accurate prediction. We propose a kernel-based gene prioritization framework using geometric kernel fusion which we have recently developed as a powerful tool for protein fold classification [I]. It has been shown that taking more involved geometry means of their corresponding kernel matrices is less sensitive in dealing with complementary and noisy kernel matrices compared to standard multiple kernel learning methods. Since genomic kernels often encodes the complementary characteristics of biological data, this leads us to research the application of geometric kernel fusion in the gene prioritization task. We utilize an unbiased and prospective benchmark based on the OMIM [2] associations. Experimental results on our prospective benchmark show that our model can improve the accuracy of the state-of-the-art gene prioritization model.
intelligent systems in molecular biology | 2018
Pooya Zakeri; Jaak Simm; Adam Arany; Sarah Elshal; Yves Moreau
Motivation Most gene prioritization methods model each disease or phenotype individually, but this fails to capture patterns common to several diseases or phenotypes. To overcome this limitation, we formulate the gene prioritization task as the factorization of a sparsely filled gene‐phenotype matrix, where the objective is to predict the unknown matrix entries. To deliver more accurate gene‐phenotype matrix completion, we extend classical Bayesian matrix factorization to work with multiple side information sources. The availability of side information allows us to make non‐trivial predictions for genes for which no previous disease association is known. Results Our gene prioritization method can innovatively not only integrate data sources describing genes, but also data sources describing Human Phenotype Ontology terms. Experimental results on our benchmarks show that our proposed model can effectively improve accuracy over the well‐established gene prioritization method, Endeavour. In particular, our proposed method offers promising results on diseases of the nervous system; diseases of the eye and adnexa; endocrine, nutritional and metabolic diseases; and congenital malformations, deformations and chromosomal abnormalities, when compared to Endeavour. Availability and implementation The Bayesian data fusion method is implemented as a Python/C++ package: https://github.com/jaak‐s/macau. It is also available as a Julia package: https://github.com/jaak‐s/BayesianDataFusion.jl. All data and benchmarks generated or analyzed during this study can be downloaded at https://owncloud.esat.kuleuven.be/index.php/s/UGb89WfkZwMYoTn.
bioinformatics and biomedicine | 2016
Sarah Elshal; Mithila Mathad; Jaak Simm; Jesse Davis; Yves Moreau
The massive growth of biomedical text makes it very challenging for researchers to review all relevant work and generate all possible hypotheses in a reasonable amount of time. Many text mining methods have been developed to simplify this process and quickly present the researcher with a learned set of biomedical hypotheses that could be potentially validated. Previously, we have focused on the task of identifying genes that are linked with a given disease by text mining the PubMed abstracts. We applied a word-based concept profile similarity to learn patterns between disease and gene entities and hence identify links between them. In this work, we study an alternative approach based on topic modelling to learn different patterns between the disease and the gene entities and measure how well this affects the identified links. We investigated multiple input corpuses, word representations, topic parameters, and similarity measures. On one hand, our results show that when we (1) learn the topics from an input set of gene-clustered set of abstracts, and (2) apply the dot-product similarity measure, we succeed to improve our original methods and identify more correct disease-gene links. On the other hand, the results also show that the learned topics remain limited to the diseases existing in our vocabulary such that scaling the methodology to new disease queries becomes non trivial.
intelligent systems in molecular biology | 2015
Pooya Zakeri; Jaak Simm; Adam Arany; Sarah Elshal; Yves Moreau
bioinformatics and biomedicine | 2016
Sarah Elshal; M Mathad; Jaak Simm; Jesse Davis; Yves Moreau
Lecture Notes in Computer Science | 2016
Sarah Elshal; Jaak Simm; Adam Arany; Pooya Zakeri; Jesse Davis; Yves Moreau
intelligent systems in molecular biology | 2015
Sarah Elshal; Léon-Charles Tranchevent; Jesse Davis; Yves Moreau