Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Dietrich Rebholz-Schuhmann is active.

Publication


Featured researches published by Dietrich Rebholz-Schuhmann.


Database | 2015

PhenoMiner: from text to a database of phenotypes associated with OMIM diseases

Nigel Collier; Tudor Groza; Damian Smedley; Peter N. Robinson; Anika Oellrich; Dietrich Rebholz-Schuhmann

Analysis of scientific and clinical phenotypes reported in the experimental literature has been curated manually to build high-quality databases such as the Online Mendelian Inheritance in Man (OMIM). However, the identification and harmonization of phenotype descriptions struggles with the diversity of human expressivity. We introduce a novel automated extraction approach called PhenoMiner that exploits full parsing and conceptual analysis. Apriori association mining is then used to identify relationships to human diseases. We applied PhenoMiner to the BMC open access collection and identified 13 636 phenotype candidates. We identified 28 155 phenotype-disorder hypotheses covering 4898 phenotypes and 1659 Mendelian disorders. Analysis showed: (i) the semantic distribution of the extracted terms against linked ontologies; (ii) a comparison of term overlap with the Human Phenotype Ontology (HP); (iii) moderate support for phenotype-disorder pairs in both OMIM and the literature; (iv) strong associations of phenotype-disorder pairs to known disease-genes pairs using PhenoDigm. The full list of PhenoMiner phenotypes (S1), phenotype-disorder associations (S2), association-filtered linked data (S3) and user database documentation (S5) is available as supplementary data and can be downloaded at http://github.com/nhcollier/PhenoMiner under a Creative Commons Attribution 4.0 license. Database URL: phenominer.mml.cam.ac.uk


Journal of Biomedical Semantics | 2017

SAFE: SPARQL federation over RDF data cubes with access control

Yasar Khan; Muhammad Saleem; Muntazir Mehdi; Aidan Hogan; Qaiser Mehmood; Dietrich Rebholz-Schuhmann; Ratnesh Sahay

BackgroundSeveral query federation engines have been proposed for accessing public Linked Open Data sources. However, in many domains, resources are sensitive and access to these resources is tightly controlled by stakeholders; consequently, privacy is a major concern when federating queries over such datasets. In the Healthcare and Life Sciences (HCLS) domain real-world datasets contain sensitive statistical information: strict ownership is granted to individuals working in hospitals, research labs, clinical trial organisers, etc. Therefore, the legal and ethical concerns on (i) preserving the anonymity of patients (or clinical subjects); and (ii) respecting data ownership through access control; are key challenges faced by the data analytics community working within the HCLS domain. Likewise statistical data play a key role in the domain, where the RDF Data Cube Vocabulary has been proposed as a standard format to enable the exchange of such data. However, to the best of our knowledge, no existing approach has looked to optimise federated queries over such statistical data.ResultsWe present SAFE: a query federation engine that enables policy-aware access to sensitive statistical datasets represented as RDF data cubes. SAFE is designed specifically to query statistical RDF data cubes in a distributed setting, where access control is coupled with source selection, user profiles and their access rights. SAFE proposes a join-aware source selection method that avoids wasteful requests to irrelevant and unauthorised data sources. In order to preserve anonymity and enforce stricter access control, SAFE’s indexing system does not hold any data instances—it stores only predicates and endpoints. The resulting data summary has a significantly lower index generation time and size compared to existing engines, which allows for faster updates when sources change.ConclusionsWe validate the performance of the system with experiments over real-world datasets provided by three clinical organisations as well as legacy linked datasets. We show that SAFE enables granular graph-level access control over distributed clinical RDF data cubes and efficiently reduces the source selection and overall query execution time when compared with general-purpose SPARQL query federation engines in the targeted setting.


Briefings in Bioinformatics | 2018

Improving data workflow systems with cloud services and use of open data for bioinformatics research

Dietrich Rebholz-Schuhmann; Ratnesh Sahay; Audrey M. Michel; Md. Rezaul Karim; Pavel V. Baranov; Achille Zappa

&NA; Data workflow systems (DWFSs) enable bioinformatics researchers to combine components for data access and data analytics, and to share the final data analytics approach with their collaborators. Increasingly, such systems have to cope with large‐scale data, such as full genomes (about 200 GB each), public fact repositories (about 100 TB of data) and 3D imaging data at even larger scales. As moving the data becomes cumbersome, the DWFS needs to embed its processes into a cloud infrastructure, where the data are already hosted. As the standardized public data play an increasingly important role, the DWFS needs to comply with Semantic Web technologies. This advancement to DWFS would reduce overhead costs and accelerate the progress in bioinformatics research based on large‐scale data and public resources, as researchers would require less specialized IT knowledge for the implementation. Furthermore, the high data growth rates in bioinformatics research drive the demand for parallel and distributed computing, which then imposes a need for scalability and high‐throughput capabilities onto the DWFS. As a result, requirements for data sharing and access to public knowledge bases suggest that compliance of the DWFS with Semantic Web standards is necessary. In this article, we will analyze the existing DWFS with regard to their capabilities toward public open data use as well as large‐scale computational and human interface requirements. We untangle the parameters for selecting a preferable solution for bioinformatics research with particular consideration to using cloud services and Semantic Web technologies. Our analysis leads to research guidelines and recommendations toward the development of future DWFS for the bioinformatics research community.


Journal of Biomedical Semantics | 2017

Towards precision medicine: discovering novel gynecological cancer biomarkers and pathways using linked data

Alokkumar Jha; Yasar Khan; Muntazir Mehdi; Rezaul Karim; Qaiser Mehmood; Achille Zappa; Dietrich Rebholz-Schuhmann; Ratnesh Sahay

BackgroundNext Generation Sequencing (NGS) is playing a key role in therapeutic decision making for the cancer prognosis and treatment. The NGS technologies are producing a massive amount of sequencing datasets. Often, these datasets are published from the isolated and different sequencing facilities. Consequently, the process of sharing and aggregating multisite sequencing datasets are thwarted by issues such as the need to discover relevant data from different sources, built scalable repositories, the automation of data linkage, the volume of the data, efficient querying mechanism, and information rich intuitive visualisation.ResultsWe present an approach to link and query different sequencing datasets (TCGA, COSMIC, REACTOME, KEGG and GO) to indicate risks for four cancer types – Ovarian Serous Cystadenocarcinoma (OV), Uterine Corpus Endometrial Carcinoma (UCEC), Uterine Carcinosarcoma (UCS), Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (CESC) – covering the 16 healthy tissue-specific genes from Illumina Human Body Map 2.0. The differentially expressed genes from Illumina Human Body Map 2.0 are analysed together with the gene expressions reported in COSMIC and TCGA repositories leading to the discover of potential biomarkers for a tissue-specific cancer.ConclusionWe analyse the tissue expression of genes, copy number variation (CNV), somatic mutation, and promoter methylation to identify associated pathways and find novel biomarkers. We discovered twenty (20) mutated genes and three (3) potential pathways causing promoter changes in different gynaecological cancer types. We propose a data-interlinked platform called BIOOPENER that glues together heterogeneous cancer and biomedical repositories. The key approach is to find correspondences (or data links) among genetic, cellular and molecular features across isolated cancer datasets giving insight into cancer progression from normal to diseased tissues. The proposed BIOOPENER platform enriches mutations by filling in missing links from TCGA, COSMIC, REACTOME, KEGG and GO datasets and provides an interlinking mechanism to understand cancer progression from normal to diseased tissues with pathway components, which in turn helped to map mutations, associated phenotypes, pathways, and mechanism.


european semantic web conference | 2018

Assessing FAIR Data Principles Against the 5-Star Open Data Principles

Ali Hasnain; Dietrich Rebholz-Schuhmann

Access to biomedical data is increasingly important to enable data driven science in the research community. The Linked Open Data (LOD) principles (by Tim Berner-Lee) have been suggested to judge the quality of data by its accessibility (open data access), by its format and structures, and by its interoperability with other data sources. The objective is to use interoperable data sources across the Web with ease.


data warehousing and knowledge discovery | 2018

FedS: Towards Traversing Federated RDF Graphs.

Qaiser Mehmood; Alokkumar Jha; Dietrich Rebholz-Schuhmann; Ratnesh Sahay

Traversing paths within a graph is a well-studied problem and highly intractable especially with large-scale graphs. In case of multiple graphs, the standard practice is to merge distinct graphs in a centralised way to evaluate the existence of paths between given entities (or nodes). In the biomedical domain counting and retrieving the number of paths (or edges) that connect two biological entities is a highly desirable feature expected from graph databases. Therefore, non-standard solutions exist that count and retrieve paths from a single graph database. From the standard perspective, SPARQL 1.1 provides the navigational feature called Property Paths (PP) which is limited only to a single RDF graph where path existence can be evaluated between pair of nodes. In this paper, we propose a federated approach – called FedS – that retrieves paths from multiple RDF triple stores. Our key idea is to partially delegate computational load to a set of federated RDF triple stores in a peer-to-peer manner thus reducing the computational burden on a centralised query processing server. In our preliminary investigation, we evaluate FedS against the state-of-the-art approaches that provide the path counting feature over single RDF graph. We compare FedS against these approaches in terms of performance (overall path retrieval time) and result completeness, i.e., number of paths retrieved.


Journal of Biomedical Semantics | 2018

Disease mentions in airport and hospital geolocations expose dominance of news events for disease concerns

Joana M. Barros; Jim Duggan; Dietrich Rebholz-Schuhmann

BackgroundIn recent years, Twitter has been applied to monitor diseases through its facility to monitor users’ comments and concerns in real-time. The analysis of tweets for disease mentions should reflect not only user specific concerns but also disease outbreaks. This requires the use of standard terminological resources and can be focused on selected geographic locations. In our study, we differentiate between hospital and airport locations to better distinguish disease outbreaks from background mentions of disease concerns.ResultsOur analysis covers all geolocated tweets over a 6 months time period, uses SNOMED-CT as a standard medical terminology, and explores language patterns (as well as MetaMap) to identify mentions of diseases in reference to the geolocation of tweets. Contrary to our expectation, hospital and airport geolocations are not suitable to collect significant portions of tweets concerned with disease outcomes. Overall, geolocated tweets exposed a large number of messages commenting on disease-related news articles. Furthermore, the geolocated messages exposed an over-representation of non-communicable diseases in contrast to infectious diseases.ConclusionsOur findings suggest that disease mentions on Twitter not only serve the purpose to share personal statements but also to share concerns about news articles. In particular, our assumption about the relevance of hospital and airport geolocations for an increased frequency of diseases mentions has not been met. To further address the linguistic cues, we propose the study of health forums to understand how a change in medium affects the language applied by the users. Finally, our research on the language use may provide essential clues to distinguish complementary trends in the use of language in Twitter when analysing health-related topics.


Briefings in Bioinformatics | 2018

Where to search top-K biomedical ontologies?

Daniela Oliveira; Anila Sahar Butt; Armin Haller; Dietrich Rebholz-Schuhmann; Ratnesh Sahay

Abstract Motivation Searching for precise terms and terminological definitions in the biomedical data space is problematic, as researchers find overlapping, closely related and even equivalent concepts in a single or multiple ontologies. Search engines that retrieve ontological resources often suggest an extensive list of search results for a given input term, which leads to the tedious task of selecting the best-fit ontological resource (class or property) for the input term and reduces user confidence in the retrieval engines. A systematic evaluation of these search engines is necessary to understand their strengths and weaknesses in different search requirements. Result We have implemented seven comparable Information Retrieval ranking algorithms to search through ontologies and compared them against four search engines for ontologies. Free-text queries have been performed, the outcomes have been judged by experts and the ranking algorithms and search engines have been evaluated against the expert-based ground truth (GT). In addition, we propose a probabilistic GT that is developed automatically to provide deeper insights and confidence to the expert-based GT as well as evaluating a broader range of search queries. Conclusion The main outcome of this work is the identification of key search factors for biomedical ontologies together with search requirements and a set of recommendations that will help biomedical experts and ontology engineers to select the best-suited retrieval mechanism in their search scenarios. We expect that this evaluation will allow researchers and practitioners to apply the current search techniques more reliably and that it will help them to select the right solution for their daily work. Availability The source code (of seven ranking algorithms), ground truths and experimental results are available at https://github.com/danielapoliveira/bioont-search-benchmark


international conference on big data | 2017

Querying web polystores

Yasar Khan; Antoine Zimmermann; Alokkumar Jha; Dietrich Rebholz-Schuhmann; Ratnesh Sahay

The database, semantic web, and linked data communities have proposed solutions that federate queries over multiple data sources using a single data model. Nowadays, the data retrieval requirements originating from versatile and broad domains like healthcare and life sciences (HCLS) are changing this conventional trend — of federating query over a single data model — primarily due to the simultaneous use of different data models (CSV, JSON, RDB, RDF, XML, etc.) in a real-life scenario. Its now impractical to assume that the variety (graph, key-value, stream, text, table, tree, etc.) of high volume data residing in specialised storage engines will first be converted to a common data model, stored in a general-purpose data storage engine, and finally be queried over the Web. Nevertheless, in this era where genomics datasets are growing from petascale to exascale, it is now important to exploit such vast domain resources in their native data models. The key approach is to query the vast data resources from their native data models and specialised storage engines. In this paper, we propose a Web-based query federation mechanism — called PolyWeb — that unifies query answering over multiple native data models (CSV, RDB, and RDF). We demonstrate PolyWeb on a cancer genomics use-case where it is often the case that a description of biological and chemical entities (e.g., gene, disease, drug, pathways) span across multiple data models. In order to assess the benefits and limitations of evaluating queries over native data models, we evaluate PolyWeb with state-of-the-art query federation engine in terms of result completeness, source selection, and overall query execution time.


very large data bases | 2016

Drug Dosage Balancing Using Large Scale Multi-omics Datasets

Alokkumar Jha; Muntazir Mehdi; Yasar Khan; Qaiser Mehmood; Dietrich Rebholz-Schuhmann; Ratnesh Sahay

Cancer is a disease of biological and cell cycle processes, driven by dosage of the limited set of drugs, resistance, mutations, and side effects. The identification of such limited set of drugs and their targets, pathways, and effects based on large scale multi-omics, multi-dimensional datasets is one of key challenging tasks in data-driven cancer genomics. This paper demonstrates the use of public databases associated with Drug-TargetGene/Protein-Disease to dissect the in-depth analysis of approved cancer drugs, their genetic associations, their pathways to establish a dosage balancing mechanism. This paper will also help to understand cancer as a disease associated pathways and effect of drug treatment on the cancer cells. We employ the Semantic Web approach to provide an integrated knowledge discovery process and the network of integrated datasets. The approach is employed to sustain the biological questions involving 1 Associated drugs and their omics signature, 2i¾?Identification of gene association with integrated Drug-Target databases 3 Mutations, variants, and alterations from these targets 4 Their PPI Interactions and associated oncogenic pathways 5 Associated biological process aligned with these mutations and pathways to identify IC-50 level of each drug along-with adverse events and alternate indications. In principal this large semantically integrated database of around 30 databases will serve as Semantic Linked Association Prediction in drug discovery to explore and expand the dosage balancing and drug re-purposing.

Collaboration


Dive into the Dietrich Rebholz-Schuhmann's collaboration.

Top Co-Authors

Avatar

Ratnesh Sahay

National University of Ireland

View shared research outputs
Top Co-Authors

Avatar

Qaiser Mehmood

National University of Ireland

View shared research outputs
Top Co-Authors

Avatar

Achille Zappa

National University of Ireland

View shared research outputs
Top Co-Authors

Avatar

Ali Hasnain

National University of Ireland

View shared research outputs
Top Co-Authors

Avatar

Alokkumar Jha

National University of Ireland

View shared research outputs
Top Co-Authors

Avatar

Yasar Khan

National University of Ireland

View shared research outputs
Top Co-Authors

Avatar

Muntazir Mehdi

National University of Ireland

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Muhammad Saleem

University of Agriculture

View shared research outputs
Top Co-Authors

Avatar

Durre Zehra

National University of Ireland

View shared research outputs
Researchain Logo
Decentralizing Knowledge