Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Helena F. Deus is active.

Publication


Featured researches published by Helena F. Deus.


PLOS ONE | 2008

A Semantic Web management model for integrative biomedical informatics.

Helena F. Deus; Romesh Stanislaus; Diogo F. Veiga; Carmen Behrens; Ignacio I. Wistuba; John D. Minna; Harold R. Garner; Stephen G. Swisher; Jack A. Roth; Arlene M. Correa; Bradley M. Broom; Kevin R. Coombes; Allen Chang; Lynn H. Vogel; Jonas S. Almeida

Background Data, data everywhere. The diversity and magnitude of the data generated in the Life Sciences defies automated articulation among complementary efforts. The additional need in this field for managing property and access permissions compounds the difficulty very significantly. This is particularly the case when the integration involves multiple domains and disciplines, even more so when it includes clinical and high throughput molecular data. Methodology/Principal Findings The emergence of Semantic Web technologies brings the promise of meaningful interoperation between data and analysis resources. In this report we identify a core model for biomedical Knowledge Engineering applications and demonstrate how this new technology can be used to weave a management model where multiple intertwined data structures can be hosted and managed by multiple authorities in a distributed management infrastructure. Specifically, the demonstration is performed by linking data sources associated with the Lung Cancer SPORE awarded to The University of Texas MDAnderson Cancer Center at Houston and the Southwestern Medical Center at Dallas. A software prototype, available with open source at www.s3db.org, was developed and its proposed design has been made publicly available as an open source instrument for shared, distributed data management. Conclusions/Significance The Semantic Web technologies have the potential to addresses the need for distributed and evolvable representations that are critical for systems Biology and translational biomedical research. As this technology is incorporated into application development we can expect that both general purpose productivity software and domain specific software installed on our personal computers will become increasingly integrated with the relevant remote resources. In this scenario, the acquisition of a new dataset should automatically trigger the delegation of its analysis.


Journal of Web Semantics | 2014

Big linked cancer data

Muhammad Saleem; Maulik R. Kamdar; Aftab Iqbal; Shanmukha Sampath; Helena F. Deus; Axel-Cyrille Ngonga Ngomo

The amount of bio-medical data available on the Web grows exponentially with time. The resulting large volume of data makes manual exploration very tedious. Moreover, the velocity at which this data changes and the variety of formats in which bio-medical data is published makes it difficult to access them in an integrated form. Finally, the lack of an integrated vocabulary makes querying this data more difficult. In this paper, we advocate the use of Linked Data to integrate, query and visualize bio-medical data. The resulting Big Linked Data allows discovering knowledge distributed across manifold sources, making it viable for the serendipitous discovery of novel knowledge. We present the concept of Big Linked Data by showing how the constant stream of new bio-medical publications can be integrated with the Linked Cancer Genome Atlas dataset (TCGA) within a virtual integration scenario. We ensure the scalability of our approach through the novel TopFed federated query engine, which we evaluate by comparing the query execution time of our system with that of FedX on Linked TCGA. Then, we show how we can harness the value hidden in the underlying integrated data by making it easier to explore through a user-friendly interface. We evaluate the usability of the interface by using the standard system usability questionnaire as well as a customized questionnaire designed for the users of our system. Our overall result of 77 suggests that our interface is easy to use and can thus lead to novel insights.


PLOS ONE | 2008

Exploratory analysis of the copy number alterations in glioblastoma multiforme

Pablo R. Freire; Marco Vilela; Helena F. Deus; Yong Wan Kim; Dimpy Koul; Howard Colman; Kenneth D. Aldape; Oliver Bögler; W. K. Alfred Yung; Kevin R. Coombes; Gordon B. Mills; Ana Tereza Ribeiro de Vasconcelos; Jonas S. Almeida

Background The Cancer Genome Atlas project (TCGA) has initiated the analysis of multiple samples of a variety of tumor types, starting with glioblastoma multiforme. The analytical methods encompass genomic and transcriptomic information, as well as demographic and clinical data about the sample donors. The data create the opportunity for a systematic screening of the components of the molecular machinery for features that may be associated with tumor formation. The wealth of existing mechanistic information about cancer cell biology provides a natural reference for the exploratory exercise. Methodology/Principal Findings Glioblastoma multiforme DNA copy number data was generated by The Cancer Genome Atlas project for 167 patients using 227 aCGH experiments, and was analyzed to build a catalog of aberrant regions. Genome screening was performed using an information theory approach in order to quantify aberration as a deviation from a centrality without the bias of untested assumptions about its parametric nature. A novel Cancer Genome Browser software application was developed and is made public to provide a user-friendly graphical interface in which the reported results can be reproduced. The application source code and stand alone executable are available at http://code.google.com/p/cancergenome and http://bioinformaticstation.org, respectively. Conclusions/Significance The most important known copy number alterations for glioblastoma were correctly recovered using entropy as a measure of aberration. Additional alterations were identified in different pathways, such as cell proliferation, cell junctions and neural development. Moreover, novel candidates for oncogenes and tumor suppressors were also detected. A detailed map of aberrant regions is provided.


Journal of Biomedical Informatics | 2010

Exposing the cancer genome atlas as a SPARQL endpoint

Helena F. Deus; Diogo F. Veiga; Pablo R. Freire; John N. Weinstein; Gordon B. Mills; Jonas S. Almeida

The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to characterize several types of cancer. Datasets from biomedical domains such as TCGA present a particularly challenging task for those interested in dynamically aggregating its results because the data sources are typically both heterogeneous and distributed. The Linked Data best practices offer a solution to integrate and discover data with those characteristics, namely through exposure of data as Web services supporting SPARQL, the Resource Description Framework query language. Most SPARQL endpoints, however, cannot easily be queried by data experts. Furthermore, exposing experimental data as SPARQL endpoints remains a challenging task because, in most cases, data must first be converted to Resource Description Framework triples. In line with those requirements, we have developed an infrastructure to expose clinical, demographic and molecular data elements generated by TCGA as a SPARQL endpoint by assigning elements to entities of the Simple Sloppy Semantic Database (S3DB) management model. All components of the infrastructure are available as independent Representational State Transfer (REST) Web services to encourage reusability, and a simple interface was developed to automatically assemble SPARQL queries by navigating a representation of the TCGA domain. A key feature of the proposed solution that greatly facilitates assembly of SPARQL queries is the distinction between the TCGA domain descriptors and data elements. Furthermore, the use of the S3DB management model as a mediator enables queries to both public and protected data without the need for prior submission to a single data source.


international conference on semantic systems | 2013

Linked cancer genome atlas database

Muhammad Saleem; Shanmukha S. Padmanabhuni; Axel-Cyrille Ngonga Ngomo; Jonas S. Almeida; Stefan Decker; Helena F. Deus

The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional pilot project to create an atlas of genetic mutations responsible for cancer. One of the aims of this project is to develop an infrastructure for making the cancer related data publicly accessible, to enable cancer researchers anywhere around the world to make and validate important discoveries. However, data in the cancer genome atlas are organized as text archives in a set of directories. Devising bioinformatics applications to analyse such data is still challenging, as it requires downloading very large archives and parsing the relevant text files in order to collect the critical co-variates necessary for analysis. Furthermore, the various types of experimental results are not connected biologically, i.e. in order to truly exploit the data in the genome-wide context in which the TCGA project was devised, the data needs to be converted into a structured representation and made publicly available for remote querying and virtual integration. In this work, we address these issues by RDFizing data from TCGA and linking its elements to the Linked Open Data (LOD) Cloud. The outcome is the largest LOD data source (to the best of our knowledge) comprising of over 30 billion triples. This data source can be exploited through publicly available SPARQL endpoints, thus providing an easy-to-use, time-efficient, and scalable solution to accessing the Cancer Genome Atlas. We also describe showcases which are enabled by the new linked data representation of the Cancer Genome Atlas presented in this paper.


international semantic web conference | 2014

Linked Biomedical Dataspace: Lessons Learned Integrating Data for Drug Discovery

Ali Hasnain; Maulik R. Kamdar; Panagiotis Hasapis; Dimitris Zeginis; Claude N. Warren; Helena F. Deus; Dimitrios Ntalaperas; Konstantinos A. Tarabanis; Muntazir Mehdi; Stefan Decker

The increase in the volume and heterogeneity of biomedical data sources has motivated researchers to embrace Linked Data (LD) technologies to solve the ensuing integration challenges and enhance information discovery. As an integral part of the EU GRANATUM project, a Linked Biomedical Dataspace (LBDS) was developed to semantically interlink data from multiple sources and augment the design of in silico experiments for cancer chemoprevention drug discovery. The different components of the LBDS facilitate both the bioinformaticians and the biomedical researchers to publish, link, query and visually explore the heterogeneous datasets. We have extensively evaluated the usability of the entire platform. In this paper, we showcase three different workflows depicting real-world scenarios on the use of LBDS by the domain users to intuitively retrieve meaningful information from the integrated sources. We report the important lessons that we learned through the challenges encountered and our accumulated experience during the collaborative processes which would make it easier for LD practitioners to create such dataspaces in other domains. We also provide a concise set of generic recommendations to develop LD platforms useful for drug discovery.


Journal of Biomedical Informatics | 2014

ReVeaLD: A user-driven domain-specific interactive search platform for biomedical research

Maulik R. Kamdar; Dimitris Zeginis; Ali Hasnain; Stefan Decker; Helena F. Deus

Bioinformatics research relies heavily on the ability to discover and correlate data from various sources. The specialization of life sciences over the past decade, coupled with an increasing number of biomedical datasets available through standardized interfaces, has created opportunities towards new methods in biomedical discovery. Despite the popularity of semantic web technologies in tackling the integrative bioinformatics challenge, there are many obstacles towards its usage by non-technical research audiences. In particular, the ability to fully exploit integrated information needs using improved interactive methods intuitive to the biomedical experts. In this report we present ReVeaLD (a Real-time Visual Explorer and Aggregator of Linked Data), a user-centered visual analytics platform devised to increase intuitive interaction with data from distributed sources. ReVeaLD facilitates query formulation using a domain-specific language (DSL) identified by biomedical experts and mapped to a self-updated catalogue of elements from external sources. ReVeaLD was implemented in a cancer research setting; queries included retrieving data from in silico experiments, protein modeling and gene expression. ReVeaLD was developed using Scalable Vector Graphics and JavaScript and a demo with explanatory video is available at http://www.srvgal78.deri.ie:8080/explorer. A set of user-defined graphic rules controls the display of information through media-rich user interfaces. Evaluation of ReVeaLD was carried out as a game: biomedical researchers were asked to assemble a set of 5 challenge questions and time and interactions with the platform were recorded. Preliminary results indicate that complex queries could be formulated under less than two minutes by unskilled researchers. The results also indicate that supporting the identification of the elements of a DSL significantly increased intuitiveness of the platform and usability of semantic web technologies by domain users.


Bioinformatics | 2013

A self-updating road map of The Cancer Genome Atlas

David E. Robbins; Alexander Grüneberg; Helena F. Deus; Murat M. Tanik; Jonas S. Almeida

Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months. Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals. Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial. Contact: [email protected]


international semantic technology conference | 2014

A Roadmap for Navigating the Life Sciences Linked Open Data Cloud

Ali Hasnain; Syeda Sana e Zainab; Maulik R. Kamdar; Qaiser Mehmood; Claude N. Warren; Qurratal Ain Fatimah; Helena F. Deus; Muntazir Mehdi; Stefan Decker

Multiple datasets that add high value to biomedical research have been exposed on the web as a part of the Life Sciences Linked Open Data (LSLOD) Cloud. The ability to easily navigate through these datasets is crucial for personalized medicine and the improvement of drug discovery process. However, navigating these multiple datasets is not trivial as most of these are only available as isolated SPARQL endpoints with very little vocabulary reuse. The content that is indexed through these endpoints is scarce, making the indexed dataset opaque for users. In this paper, we propose an approach for the creation of an active Linked Life Sciences Data Roadmap, a set of congurable rules which can be used to discover links (roads) between biological entities (cities) in the LSLOD cloud. We have catalogued and linked concepts and properties from 137 public SPARQL endpoints. Our Roadmap is primarily used to dynamically assemble queries retrieving data from multiple SPARQL endpoints simultaneously. We also demonstrate its use in conjunction with other tools for selective SPARQL querying, semantic annotation of experimental datasets and the visualization of the LSLOD cloud. We have evaluated the performance of our approach in terms of the time taken and entity capture. Our approach, if generalized to encompass other domains, can be used for road-mapping the entire LOD cloud.


Nature Biotechnology | 2006

Data integration gets 'Sloppy'

Jonas S. Almeida; Chuming Chen; Robert Gorlitsky; Romesh Stanislaus; Marta Aires-De-Sousa; Pedro Eleutério; João A. Carriço; António Maretzek; Andreas Bohn; Allen Chang; Fan Zhang; Rahul Mitra; Gordon B. Mills; Xiaoshu Wang; Helena F. Deus

VOLUME 24 NUMBER 9 SEPTEMBER 2006 NATURE BIOTECHNOLOGY we emphatically recommend all users of dbSNP to refer to the ‘validation status’ tag and use a simple SNP classification scheme, as described above, that aims at extracting RefSNPs with lower error rates. According to our classification, dbSNP (version 124) contains in C1, C2 and C3 2,077,680, 2,946,840 and 3,470,166 entries, respectively. To investigate the differences between those three classes, we extracted the available confidence information. C1 and C2 RefSNPs have higher average values (both 51.4) than SNPs in C3 (43.2, Supplementary Notes online). Furthermore, about 87% in C1 and C2 have confidence values of at least 40, in contrast to only 63% in C3 (Fig. 1d). As a low confidence value indicates a potential sequencing error, we recommend that bioinformatics and/or experimental efforts either use only C1 and C2 RefSNPs or find a way of excluding from C3 all dbSNP entries with Phred <40 (ref. 11).

Collaboration


Dive into the Helena F. Deus's collaboration.

Top Co-Authors

Avatar

Jonas S. Almeida

University of Alabama at Birmingham

View shared research outputs
Top Co-Authors

Avatar

Stefan Decker

National University of Ireland

View shared research outputs
Top Co-Authors

Avatar

Ali Hasnain

National University of Ireland

View shared research outputs
Top Co-Authors

Avatar

Ronan Fox

National University of Ireland

View shared research outputs
Top Co-Authors

Avatar

Eric Prud'hommeaux

Massachusetts Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Gordon B. Mills

University of Texas MD Anderson Cancer Center

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Muhammad Saleem

University of Agriculture

View shared research outputs
Researchain Logo
Decentralizing Knowledge