Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Hugo P. Bastos is active.

Publication


Featured researches published by Hugo P. Bastos.


BMC Bioinformatics | 2008

Metrics for GO based protein semantic similarity: a systematic evaluation

Catia Pesquita; Daniel Faria; Hugo P. Bastos; António E. N. Ferreira; André O. Falcão; Francisco M. Couto

BackgroundSeveral semantic similarity measures have been applied to gene products annotated with Gene Ontology terms, providing a basis for their functional comparison. However, it is still unclear which is the best approach to semantic similarity in this context, since there is no conclusive evaluation of the various measures. Another issue, is whether electronic annotations should or not be used in semantic similarity calculations.ResultsWe conducted a systematic evaluation of GO-based semantic similarity measures using the relationship with sequence similarity as a means to quantify their performance, and assessed the influence of electronic annotations by testing the measures in the presence and absence of these annotations. We verified that the relationship between semantic and sequence similarity is not linear, but can be well approximated by a rescaled Normal cumulative distribution function. Given that the majority of the semantic similarity measures capture an identical behaviour, but differ in resolution, we used the latter as the main criterion of evaluation.ConclusionsThis work has provided a basis for the comparison of several semantic similarity measures, and can aid researchers in choosing the most adequate measure for their work. We have found that the hybrid simGIC was the measure with the best overall performance, followed by Resniks measure using a best-match average combination approach. We have also found that the average and maximum combination approaches are problematic since both are inherently influenced by the number of terms being combined. We suspect that there may be a direct influence of data circularity in the behaviour of the results including electronic annotations, as a result of functional inference from sequence similarity.


PLOS ONE | 2012

Mining GO Annotations for Improving Annotation Consistency

Daniel Faria; Andreas Schlicker; Catia Pesquita; Hugo P. Bastos; António E. N. Ferreira; Mario Albrecht; André O. Falcão

Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.


International Scholarly Research Notices | 2012

Chemical Entity Recognition and Resolution to ChEBI

Tiago Grego; Catia Pesquita; Hugo P. Bastos; Francisco M. Couto

Chemical entities are ubiquitous through the biomedical literature and the development of text-mining systems that can efficiently identify those entities are required. Due to the lack of available corpora and data resources, the community has focused its efforts in the development of gene and protein named entity recognition systems, but with the release of ChEBI and the availability of an annotated corpus, this task can be addressed. We developed a machine-learning-based method for chemical entity recognition and a lexical-similarity-based method for chemical entity resolution and compared them with Whatizit, a popular-dictionary-based method. Our methods outperformed the dictionary-based method in all tasks, yielding an improvement in F-measure of 20% for the entity recognition task, 2–5% for the entity-resolution task, and 15% for combined entity recognition and resolution tasks.


Methods of Molecular Biology | 2011

Application of gene ontology to gene identification.

Hugo P. Bastos; Bruno Tavares; Catia Pesquita; Daniel Faria; Francisco M. Couto

Candidate gene identification deals with associating genes to underlying biological phenomena, such as diseases and specific disorders. It has been shown that classes of diseases with similar phenotypes are caused by functionally related genes. Currently, a fair amount of knowledge about the functional characterization can be found across several public databases; however, functional descriptors can be ambiguous, domain specific, and context dependent. In order to cope with these issues, the Gene Ontology (GO) project developed a bio-ontology of broad scope and wide applicability. Thus, the structured and controlled vocabulary of terms provided by the GO project describing the biological roles of gene products can be very helpful in candidate gene identification approaches. The method presented here uses GO annotation data in order to identify the most meaningful functional aspects occurring in a given set of related gene products. The method measures this meaningfulness by calculating an e-value based on the frequency of annotation of each GO term in the set of gene products versus the total frequency of annotation. Then after selecting a GO term related to the underlying biological phenomena being studied, the method uses semantic similarity to rank the given gene products that are annotated to the term. This enables the user to further narrow down the list of gene products and identify those that are more likely of interest.


Frontiers in Genetics | 2013

Annotation extension through protein family annotation coherence metrics

Hugo P. Bastos; Luka A. Clarke; Francisco M. Couto

Protein functional annotation consists in associating proteins with textual descriptors elucidating their biological roles. The bulk of annotation is done via automated procedures that ultimately rely on annotation transfer. Despite a large number of existing protein annotation procedures the ever growing protein space is never completely annotated. One of the facets of annotation incompleteness derives from annotation uncertainty. Often when protein function cannot be predicted with enough specificity it is instead conservatively annotated with more generic terms. In a scenario of protein families or functionally related (or even dissimilar) sets this leads to a more difficult task of using annotations to compare the extent of functional relatedness among all family or set members. However, we postulate that identifying sub-sets of functionally coherent proteins annotated at a very specific level, can help the annotation extension of other incompletely annotated proteins within the same family or functionally related set. As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class. We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension. Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families.


international conference on digital information management | 2008

Identifying bioentity recognition errors of rule-based text-mining systems

Francisco M. Couto; Tiago Grego; Hugo P. Bastos; Catia Pesquita; Rafael Torres; Pablo Sanchez; Leandro Pascual; Christian Blaschke

An important research topic in Bioinformatics involves the exploration of vast amounts of biological and biomedical scientific literature (BioLiterature). Over the last few decades, text-mining systems have exploited this BioLiterature to reduce the time spent by researchers in its analysis. However, state-of-the-art approaches are still far from reaching performance levels acceptable by curators, and below the performance obtained in other domains, such as personal name recognition or news text. To achieve high levels of performance, it is essential that text mining tools effectively recognize bioentities present in BioLiterature. This paper presents FIBRE (Filtering Bioentity Recognition Errors), a system for automatically filtering mis annotations generated by rule-based systems that automatically recognize bioentities in BioLiterature. FIBRE aims at using different sets of automatically generated annotations to identify the main features that characterize an annotation of being of a certain type. These features are then used to filter mis annotations using a confidence threshold. The assessment of FIBRE was performed on a set of more than 17,000 documents, previously annotated by Text Detective, a state-of-the-art rule-based name bioentity recognition system. Curators evaluated the gene annotations given by Text Detective that FIBRE classified as non-gene annotations, and we found that FIBRE was able to filter with a precision above 92% more than 600 mis annotations, requiring minimal human effort, which demonstrates the effectiveness of FIBRE in a realistic scenario.


PLOS ONE | 2015

GRYFUN: A Web Application for GO Term Annotation Visualization and Analysis in Protein Sets

Hugo P. Bastos; Lisete Sousa; Luka A. Clarke; Francisco M. Couto

Functional context for biological sequence is provided in the form of annotations. However, within a group of similar sequences there can be annotation heterogeneity in terms of coverage and specificity. This in turn can introduce issues regarding the interpretation of actual functional similarity and overall functional coherence of such a group. One way to mitigate such issues is through the use of visualization and statistical techniques. Therefore, in order to help interpret this annotation heterogeneity we created a web application that generates Gene Ontology annotation graphs for protein sets and their associated statistics from simple frequencies to enrichment values and Information Content based metrics. The publicly accessible website http://xldb.di.fc.ul.pt/gryfun/ currently accepts lists of UniProt accession numbers in order to create user-defined protein sets for subsequent annotation visualization and statistical assessment. GRYFUN is a freely available web application that allows GO annotation visualization of protein sets and which can be used for annotation coherence and cohesiveness analysis and annotation extension assessments within under-annotated protein sets.


Journal of Biomedical Semantics | 2016

Functional coherence metrics in protein families

Hugo P. Bastos; Lisete Sousa; Luka A. Clarke; Francisco M. Couto

BackgroundBiological sequences, such as proteins, have been provided with annotations that assign functional information. These functional annotations are associations of proteins (or other biological sequences) with descriptors characterizing their biological roles. However, not all proteins are fully (or even at all) annotated. This annotation incompleteness limits our ability to make sound assertions about the functional coherence within sets of proteins. Annotation incompleteness is a problematic issue when measuring semantic functional similarity of biological sequences since they can only capture a limited amount of all the semantic aspects the sequences may encompass.MethodsInstead of relying uniquely on single (reductive) metrics, this work proposes a comprehensive approach for assessing functional coherence within protein sets. The approach entails using visualization and term enrichment techniques anchored in specific domain knowledge, such as a protein family. For that purpose we evaluate two novel functional coherence metrics, mUI and mGIC that combine aspects of semantic similarity measures and term enrichment.ResultsThese metrics were used to effectively capture and measure the local similarity cores within protein sets. Hence, these metrics coupled with visualization tools allow an improved grasp on three important functional annotation aspects: completeness, agreement and coherence.ConclusionsMeasuring the functional similarity between proteins based on their annotations is a non trivial task. Several metrics exist but due both to characteristics intrinsic to the nature of graphs and extrinsic natures related to the process of annotation each measure can only capture certain functional annotation aspects of proteins. Hence, when trying to measure the functional coherence of a set of proteins a single metric is too reductive. Therefore, it is valuable to be aware of how each employed similarity metric works and what similarity aspects it can best capture. Here we test the behaviour and resilience of some similarity metrics.


distributed computing and artificial intelligence | 2009

GREAT: Gene Regulation EvAluation Tool

Catia M. Machado; Hugo P. Bastos; Francisco M. Couto

Our understanding of biological systems is highly dependent on the study of the mechanisms that regulate genetic expression. In this paper we present a tool to evaluate scientific papers that potentially describe Saccharomyces cerevisiae gene regulations, following the identification of transcription factors in abstracts using text mining techniques. GREAT evaluates the probability of a given gene-transcription factor pair corresponding to a gene regulation based on data retrieved from public biological databases.


acm symposium on applied computing | 2008

BOLOS: BLAST & Ontology Linked-hOmologue Stars

Hugo P. Bastos; Catia Pesquita; Daniel Faria; André O. Falcão

The growing number of protein sequences in databases has lead to, among other issues, an increase in redundancy, since many of the new sequences are similar to others already in the databases. While a comprehensive view of the protein space is essential, redundancy hampers large scale searches (such as BLAST) or studies (comparative genomics, functional genomics). One of the solutions to eliminate redundancy in the protein space is through clustering methods, which group redundant proteins into single entities (clusters). In this work we present BOLOS, a web tool that combines a clustered protein space with molecular function GO term annotations, directed for functional genomics. It allows searches over the cluster space with a raw sequence, a Swiss-Prot protein or a molecular function GO term, and provides three GO-based parameters which allow the assessment of cluster quality and biological validity. The user can also chose one of the twelve different significance level cluster spaces available. For each cluster, BOLOS provides the essential information (name and accession number) about proteins and GO terms it contains, as well as relevant statistics, while linking to external databases to allow the users to access further information. BOLOS is available at: http://xldb.di.fc.ul.pt/biotools/bolos/

Collaboration


Dive into the Hugo P. Bastos's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge