Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Catalina O. Tudor is active.

Publication


Featured researches published by Catalina O. Tudor.


Database | 2013

An overview of the BioCreative 2012 Workshop Track III: interactive text mining task.

Cecilia N. Arighi; Ben Carterette; K. Bretonnel Cohen; Martin Krallinger; W. John Wilbur; Petra Fey; Robert Dodson; Laurel Cooper; Ceri E. Van Slyke; Wasila M. Dahdul; Paula M. Mabee; Donghui Li; Bethany Harris; Marc Gillespie; Silvia Jimenez; Phoebe M. Roberts; Lisa Matthews; Kevin G. Becker; Harold J. Drabkin; Susan M. Bello; Luana Licata; Andrew Chatr-aryamontri; Mary L. Schaeffer; Julie Park; Melissa Haendel; Kimberly Van Auken; Yuling Li; Juancarlos Chan; Hans-Michael Müller; Hong Cui

In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators’ overall experience of a system, regardless of the system’s high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV.


bioinformatics and biomedicine | 2012

iSimp: A sentence simplification system for biomedicail text

Yifan Peng; Catalina O. Tudor; Manabu Torii; Cathy H. Wu; K. Vijay-Shanker

Text mining applications using natural language processing are often confronted with long and complicated sentences. This is observed particularly in the abstracts of scientific articles where authors summarize, in few sentences, the various facts described throughout the manuscript. Being rich in novel and important information, the abstract has been the primary target of biomedicai text mining applications. In this work, we aim to simplify complex sentences in abstracts of biomedicai text so that they can be readily processed by text mining applications. We focus on syntactic constructs that are frequently encountered in the biomedicai literature, such as coordinations, relative clauses, and appositions, with emphasis on their boundary detection. Our approach yielded good detection performance (average F-measure between 86.5% and 92.7%), and aided in improving biomedicai text mining applications, RLIMS-P and RankPref.


BMC Bioinformatics | 2010

eGIFT: Mining Gene Information from the Literature

Catalina O. Tudor; Carl J. Schmidt; K. Vijay-Shanker

BackgroundWith the biomedical literature continually expanding, searching PubMed for information about specific genes becomes increasingly difficult. Not only can thousands of results be returned, but gene name ambiguity leads to many irrelevant hits. As a result, it is difficult for life scientists and gene curators to rapidly get an overall picture about a specific gene from documents that mention its names and synonyms.ResultsIn this paper, we present eGIFT (http://biotm.cis.udel.edu/eGIFT), a web-based tool that associates informative terms, called i Terms, and sentences containing them, with genes. To associate i Terms with a gene, eGIFT ranks i Terms about the gene, based on a score which compares the frequency of occurrence of a term in the genes literature to its frequency of occurrence in documents about genes in general. To retrieve a genes documents (Medline abstracts), eGIFT considers all gene names, aliases, and synonyms. Since many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene. Another additional filtering process is applied to retain those abstracts that focus on the gene rather than mention it in passing. eGIFTs information for a gene is pre-computed and users of eGIFT can search for genes by using a name or an EntrezGene identifier. i Terms are grouped into different categories to facilitate a quick inspection. eGIFT also links an i Term to sentences mentioning the term to allow users to see the relation between the i Term and the gene. We evaluated the precision and recall of eGIFTs i Terms for 40 genes; between 88% and 94% of the i Terms were marked as salient by our evaluators, and 94% of the UniProtKB keywords for these genes were also identified by eGIFT as i Terms.ConclusionsOur evaluations suggest that i Terms capture highly-relevant aspects of genes. Furthermore, by showing sentences containing these terms, eGIFT can provide a quick description of a specific gene. eGIFT helps not only life scientists survey results of high-throughput experiments, but also annotators to find articles describing gene aspects and functions.


Database | 2012

The eFIP system for text mining of protein interaction networks of phosphorylated proteins

Catalina O. Tudor; Cecilia N. Arighi; Qinghua Wang; Cathy H. Wu; K. Vijay-Shanker

Protein phosphorylation is a central regulatory mechanism in signal transduction involved in most biological processes. Phosphorylation of a protein may lead to activation or repression of its activity, alternative subcellular location and interaction with different binding partners. Extracting this type of information from scientific literature is critical for connecting phosphorylated proteins with kinases and interaction partners, along with their functional outcomes, for knowledge discovery from phosphorylation protein networks. We have developed the Extracting Functional Impact of Phosphorylation (eFIP) text mining system, which combines several natural language processing techniques to find relevant abstracts mentioning phosphorylation of a given protein together with indications of protein–protein interactions (PPIs) and potential evidences for impact of phosphorylation on the PPIs. eFIP integrates our previously developed tools, Extracting Gene Related ABstracts (eGRAB) for document retrieval and name disambiguation, Rule-based LIterature Mining System (RLIMS-P) for Protein Phosphorylation for extraction of phosphorylation information, a PPI module to detect PPIs involving phosphorylated proteins and an impact module for relation extraction. The text mining system has been integrated into the curation workflow of the Protein Ontology (PRO) to capture knowledge about phosphorylated proteins. The eFIP web interface accepts gene/protein names or identifiers, or PubMed identifiers as input, and displays results as a ranked list of abstracts with sentence evidence and summary table, which can be exported in a spreadsheet upon result validation. As a participant in the BioCreative-2012 Interactive Text Mining track, the performance of eFIP was evaluated on document retrieval (F-measures of 78–100%), sentence-level information extraction (F-measures of 70–80%) and document ranking (normalized discounted cumulative gain measures of 93–100% and mean average precision of 0.86). The utility and usability of the eFIP web interface were also evaluated during the BioCreative Workshop. The use of the eFIP interface provided a significant speed-up (∼2.5-fold) for time to completion of the curation task. Additionally, eFIP significantly simplifies the task of finding relevant articles on PPI involving phosphorylated forms of a given protein. Database URL: http://proteininformationresource.org/pirwww/iprolink/eFIP.shtml


Database | 2015

Construction of phosphorylation interaction networks by text mining of full-length articles using the eFIP system

Catalina O. Tudor; Karen E. Ross; Gang Li; K. Vijay-Shanker; Cathy H. Wu; Cecilia N. Arighi

Protein phosphorylation is a reversible post-translational modification where a protein kinase adds a phosphate group to a protein, potentially regulating its function, localization and/or activity. Phosphorylation can affect protein–protein interactions (PPIs), abolishing interaction with previous binding partners or enabling new interactions. Extracting phosphorylation information coupled with PPI information from the scientific literature will facilitate the creation of phosphorylation interaction networks of kinases, substrates and interacting partners, toward knowledge discovery of functional outcomes of protein phosphorylation. Increasingly, PPI databases are interested in capturing the phosphorylation state of interacting partners. We have previously developed the eFIP (Extracting Functional Impact of Phosphorylation) text mining system, which identifies phosphorylated proteins and phosphorylation-dependent PPIs. In this work, we present several enhancements for the eFIP system: (i) text mining for full-length articles from the PubMed Central open-access collection; (ii) the integration of the RLIMS-P 2.0 system for the extraction of phosphorylation events with kinase, substrate and site information; (iii) the extension of the PPI module with new trigger words/phrases describing interactions and (iv) the addition of the iSimp tool for sentence simplification to aid in the matching of syntactic patterns. We enhance the website functionality to: (i) support searches based on protein roles (kinases, substrates, interacting partners) or using keywords; (ii) link protein entities to their corresponding UniProt identifiers if mapped and (iii) support visual exploration of phosphorylation interaction networks using Cytoscape. The evaluation of eFIP on full-length articles achieved 92.4% precision, 76.5% recall and 83.7% F-measure on 100 article sections. To demonstrate eFIP for knowledge extraction and discovery, we constructed phosphorylation-dependent interaction networks involving 14-3-3 proteins identified from cancer-related versus diabetes-related articles. Comparison of the phosphorylation interaction network of kinases, phosphoproteins and interactants obtained from eFIP searches, along with enrichment analysis of the protein set, revealed several shared interactions, highlighting common pathways discussed in the context of both diseases. Database URL: http://proteininformationresource.org/efip


BMC Systems Biology | 2013

A framework for biomedical figure segmentation towards image-based document retrieval

Luis D. Lopez; Jingyi Yu; Cecilia N. Arighi; Catalina O. Tudor; Manabu Torii; Hongzhan Huang; K. Vijay-Shanker; Cathy H. Wu

The figures included in many of the biomedical publications play an important role in understanding the biological experiments and facts described within. Recent studies have shown that it is possible to integrate the information that is extracted from figures in classical document classification and retrieval tasks in order to improve their accuracy. One important observation about the figures included in biomedical publications is that they are often composed of multiple subfigures or panels, each describing different methodologies or results. The use of these multimodal figures is a common practice in bioscience, as experimental results are graphically validated via multiple methodologies or procedures. Thus, for a better use of multimodal figures in document classification or retrieval tasks, as well as for providing the evidence source for derived assertions, it is important to automatically segment multimodal figures into subfigures and panels. This is a challenging task, however, as different panels can contain similar objects (i.e., barcharts and linecharts) with multiple layouts. Also, certain types of biomedical figures are text-heavy (e.g., DNA sequences and protein sequences images) and they differ from traditional images. As a result, classical image segmentation techniques based on low-level image features, such as edges or color, are not directly applicable to robustly partition multimodal figures into single modal panels.In this paper, we describe a robust solution for automatically identifying and segmenting unimodal panels from a multimodal figure. Our framework starts by robustly harvesting figure-caption pairs from biomedical articles. We base our approach on the observation that the document layout can be used to identify encoded figures and figure boundaries within PDF files. Taking into consideration the document layout allows us to correctly extract figures from the PDF document and associate their corresponding caption. We combine pixel-level representations of the extracted images with information gathered from their corresponding captions to estimate the number of panels in the figure. Thus, our approach simultaneously identifies the number of panels and the layout of figures.In order to evaluate the approach described here, we applied our system on documents containing protein-protein interactions (PPIs) and compared the results against a gold standard that was annotated by biologists. Experimental results showed that our automatic figure segmentation approach surpasses pure caption-based and image-based approaches, achieving a 96.64% accuracy. To allow for efficient retrieval of information, as well as to provide the basis for integration into document classification and retrieval systems among other, we further developed a web-based interface that lets users easily retrieve panels containing the terms specified in the user queries.


Archive | 2017

iPTMnet: Integrative Bioinformatics for Studying PTM Networks

Karen E. Ross; Hongzhan Huang; Jia Ren; Cecilia N. Arighi; Gang Li; Catalina O. Tudor; Mengxi Lv; Jung-Youn Lee; Sheng-Chih Chen; K. Vijay-Shanker; Cathy H. Wu

Protein post-translational modification (PTM) is an essential cellular regulatory mechanism, and disruptions in PTM have been implicated in disease. PTMs are an active area of study in many fields, leading to a wealth of PTM information in the scientific literature. There is a need for user-friendly bioinformatics resources that capture PTM information from the literature and support analyses of PTMs and their functional consequences. This chapter describes the use of iPTMnet ( http://proteininformationresource.org/iPTMnet/ ), a resource that integrates PTM information from text mining, curated databases, and ontologies and provides visualization tools for exploring PTM networks, PTM crosstalk, and PTM conservation across species. We present several PTM-related queries and demonstrate how they can be addressed using iPTMnet.


Methods of Molecular Biology | 2011

eFIP: A Tool for Mining Functional Impact of Phosphorylation from Literature

Cecilia N. Arighi; Amy Y. Siu; Catalina O. Tudor; Jules Nchoutmboube; Cathy H. Wu; Vijay K. Shanker

Technologies and experimental strategies have improved dramatically in the field of genomics and proteomics facilitating analysis of cellular and biochemical processes, as well as of proteins networks. Based on numerous such analyses, there has been a significant increase of publications in life sciences and biomedicine. In this respect, knowledge bases are struggling to cope with the literature volume and they may not be able to capture in detail certain aspects of proteins and genes. One important aspect of proteins is their phosphorylated states and their implication in protein function and protein interacting networks. For this reason, we developed eFIP, a web-based tool, which aids scientists to find quickly abstracts mentioning phosphorylation of a given protein (including site and kinase), coupled with mentions of interactions and functional aspects of the protein. eFIP combines information provided by applications such as eGRAB, RLIMS-P, eGIFT and AIIAGMT, to rank abstracts mentioning phosphorylation, and to display the results in a highlighted and tabular format for a quick inspection. In this chapter, we present a case study of results returned by eFIP for the protein BAD, which is a key regulator of apoptosis that is posttranslationally modified by phosphorylation.


Database | 2012

Developing a biocuration workflow for AgBase, a non-model organism database

Lakshmi R. Pillai; Philippe Chouvarine; Catalina O. Tudor; Carl J. Schmidt; K. Vijay-Shanker; Fiona M. McCarthy

AgBase provides annotation for agricultural gene products using the Gene Ontology (GO) and Plant Ontology, as appropriate. Unlike model organism species, agricultural species have a body of literature that does not just focus on gene function; to improve efficiency, we use text mining to identify literature for curation. The first component of our annotation interface is the gene prioritization interface that ranks gene products for annotation. Biocurators select the top-ranked gene and mark annotation for these genes as ‘in progress’ or ‘completed’; links enable biocurators to move directly to our biocuration interface (BI). Our BI includes all current GO annotation for gene products and is the main interface to add/modify AgBase curation data. The BI also displays Extracting Genic Information from Text (eGIFT) results for each gene product. eGIFT is a web-based, text-mining tool that associates ranked, informative terms (iTerms) and the articles and sentences containing them, with genes. Moreover, iTerms are linked to GO terms, where they match either a GO term name or a synonym. This enables AgBase biocurators to rapidly identify literature for further curation based on possible GO terms. Because most agricultural species do not have standardized literature, eGIFT searches all gene names and synonyms to associate articles with genes. As many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene, and filtering is applied to remove abstracts that mention a gene in passing. The BI is linked to our Journal Database (JDB) where corresponding journal citations are stored. Just as importantly, biocurators also add to the JDB citations that have no GO annotation. The AgBase BI also supports bulk annotation upload to facilitate our Inferred from electronic annotation of agricultural gene products. All annotations must pass standard GO Consortium quality checking before release in AgBase. Database URL: http://www.agbase.msstate.edu/


bioinformatics and biomedicine | 2012

Robust segmentation of biomedical figures for image-based document retrieval

Luis D. Lopez; Jingyi Yu; Catalina O. Tudor; Cecilia N. Arighi; Hongzhan Huang; K. Vijay-Shanker; Cathy H. Wu

Figures play an important role in illustrating concepts, methodology and results in biomedicai literature. However, figures in biomedicai literature are often composed of multiple subfigures (panels), which may illustrate diverse methodologies or results. Robust and accurate panel partitioning is crucial to support article categorization based on methods or experimental results and to provide the evidence source for derived assertions. But, it is a challenging task. In this paper, we present a comprehensive framework for harvesting multimodal panels in biomedicai literature, and demonstrate its application to protein-protein interaction (PPI)-related literature as a use case. A unique feature of our solution is that we combine pixel-level representations of images with figure captions. Our approach first analyzes figure captions to identify the label style used to mark panels. We then use pixel-level representations to partition a figure into a set of bounding boxes of connected components. We also perform a lexical analysis on the text within the figure to locate panel labels that match the caption analysis results. Finally, we estimate the optimal panel layout and use the layout to partition the figure. We tested our system on a dataset provided by the Molecular INTeraction database (MINT), and show that our approach surpasses pure caption-based and pure image-based approaches, achieving a 96.64% precision.

Collaboration


Dive into the Catalina O. Tudor's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Cathy H. Wu

University of Delaware

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Karen E. Ross

Georgetown University Medical Center

View shared research outputs
Top Co-Authors

Avatar

Gang Li

University of Delaware

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jia Ren

University of Delaware

View shared research outputs
Top Co-Authors

Avatar

Liang Sun

University of Delaware

View shared research outputs
Researchain Logo
Decentralizing Knowledge