Helen L. Johnson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Helen L. Johnson is active.

Explore More

Publication

Featured researches published by Helen L. Johnson.

BMC Bioinformatics | 2008

OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression

Lawrence Hunter; Zhiyong Lu; James Firby; William A. Baumgartner; Helen L. Johnson; Philip V. Ogren; K. Bretonnel Cohen

BackgroundInformation extraction (IE) efforts are widely acknowledged to be important in harnessing the rapid advance of biomedical knowledge, particularly in areas where important factual information is published in a diverse literature. Here we report on the design, implementation and several evaluations of OpenDMAP, an ontology-driven, integrated concept analysis system. It significantly advances the state of the art in information extraction by leveraging knowledge in ontological resources, integrating diverse text processing applications, and using an expanded pattern language that allows the mixing of syntactic and semantic elements and variable ordering.ResultsOpenDMAP information extraction systems were produced for extracting protein transport assertions (transport), protein-protein interaction assertions (interaction) and assertions that a gene is expressed in a cell type (expression). Evaluations were performed on each system, resulting in F-scores ranging from .26 – .72 (precision .39 – .85, recall .16 – .85). Additionally, each of these systems was run over all abstracts in MEDLINE, producing a total of 72,460 transport instances, 265,795 interaction instances and 176,153 expression instances.ConclusionOpenDMAP advances the performance standards for extracting protein-protein interaction predications from the full texts of biomedical research articles. Furthermore, this level of performance appears to generalize to other information extraction tasks, including extracting information about predicates of more than two arguments. The output of the information extraction system is always constructed from elements of an ontology, ensuring that the knowledge representation is grounded with respect to a carefully constructed model of reality. The results of these efforts can be used to increase the efficiency of manual curation efforts and to provide additional features in systems that integrate multiple sources for information extraction. The open source OpenDMAP code library is freely available at http://bionlp.sourceforge.net/

BMC Bioinformatics | 2010

The structural and content aspects of abstracts versus bodies of full text journal articles are different

K. Bretonnel Cohen; Helen L. Johnson; Karin Verspoor; Christophe Roeder; Lawrence Hunter

BackgroundAn increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research.ResultsWe examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies.ConclusionsAspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts.

BMC Bioinformatics | 2012

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

Karin Verspoor; Kevin Bretonnel Cohen; Arrick Lanfranchi; Colin Warner; Helen L. Johnson; Christophe Roeder; Jinho D. Choi; Christopher S. Funk; Yuriy Malenkiy; Miriam Eckert; Nianwen Xue; William A. Baumgartner; Michael Bada; Martha Palmer; Lawrence Hunter

BackgroundWe introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.ResultsMany biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.ConclusionsThe finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

north american chapter of the association for computational linguistics | 2009

High-precision biological event extraction with a concept recognizer

K. Bretonnel Cohen; Karin Verspoor; Helen L. Johnson; Christophe Roeder; Philip V. Ogren; William A. Baumgartner; Elizabeth K. White; Lawrence Hunter

We approached the problems of event detection, argument identification, and negation and speculation detection as one of concept recognition and analysis. Our methodology involved using the OpenDMAP semantic parser with manually-written rules. We achieved state-of-the-art precision for two of the three tasks, scoring the highest of 24 teams at precision of 71.81 on Task 1 and the highest of 6 teams at precision of 70.97 on Task 2. The OpenDMAP system and the rule set are available at bionlp.sourceforge.net.

pacific symposium on biocomputing | 2005

EVALUATION OF LEXICAL METHODS FOR DETECTING RELATIONSHIPS BETWEEN CONCEPTS FROM MULTIPLE ONTOLOGIES

Helen L. Johnson; K. Bretonnel Cohen; William A. Baumgartner; Zhiyong Lu; Michael Bada; Todd Kester; Hyunmin Kim; Lawrence Hunter

We used exact term matching, stemming, and inclusion of synonyms, implemented via the Lucene information retrieval library, to discover relationships between the Gene Ontology and three other OBO ontologies: ChEBI, Cell Type, and BRENDA Tissue. Proposed relationships were evaluated by domain experts. We discovered 91,385 relationships between the ontologies. Various methods had a wide range of correctness. Based on these results, we recommend careful evaluation of all matching strategies before use, including exact string matching. The full set of relationships is available at compbio.uchsc.edu/dependencies.

Genome Biology | 2008

Concept recognition for extracting protein interaction relations from biomedical text

William A. Baumgartner; Zhiyong Lu; Helen L. Johnson; J. Gregory Caporaso; Jesse Paquette; Anna Lindemann; Elizabeth K. White; Olga Medvedeva; K. Bretonnel Cohen; Lawrence Hunter

Background:Reliable information extraction applications have been a long sought goal of the biomedical text mining community, a goal that if reached would provide valuable tools to benchside biologists in their increasingly difficult task of assimilating the knowledge contained in the biomedical literature. We present an integrated approach to concept recognition in biomedical text. Concept recognition provides key information that has been largely missing from previous biomedical information extraction efforts, namely direct links to well defined knowledge resources that explicitly cement the concepts semantics. The BioCreative II tasks discussed in this special issue have provided a unique opportunity to demonstrate the effectiveness of concept recognition in the field of biomedical language processing.Results:Through the modular construction of a protein interaction relation extraction system, we present several use cases of concept recognition in biomedical text, and relate these use cases to potential uses by the benchside biologist.Conclusion:Current information extraction technologies are approaching performance standards at which concept recognition can begin to deliver high quality data to the benchside biologist. Our system is available as part of the BioCreative Meta-Server project and on the internet http://bionlp.sourceforge.net.

Journal of Biomedical Discovery and Collaboration | 2007

Corpus Refactoring: a Feasibility Study

Helen L. Johnson; William A. Baumgartner; Martin Krallinger; K. Bretonnel Cohen; Lawrence Hunter

BackgroundMost biomedical corpora have not been used outside of the lab that created them, despite the fact that the availability of the gold-standard evaluation data that they provide is one of the rate-limiting factors for the progress of biomedical text mining. Data suggest that one major factor affecting the use of a corpus outside of its home laboratory is the format in which it is distributed. This paper tests the hypothesis that corpus refactoring – changing the format of a corpus without altering its semantics – is a feasible goal, namely that it can be accomplished with a semi-automatable process and in a time-effcient way. We used simple text processing methods and limited human validation to convert the Protein Design Group corpus into two new formats: WordFreak and embedded XML. We tracked the total time expended and the success rates of the automated steps.ResultsThe refactored corpus is available for download at the BioNLP SourceForge website http://bionlp.sourceforge.net. The total time expended was just over three person-weeks, consisting of about 102 hours of programming time (much of which is one-time development cost) and 20 hours of manual validation of automatic outputs. Additionally, the steps required to refactor any corpus are presented.ConclusionWe conclude that refactoring of publicly available corpora is a technically and economically feasible method for increasing the usage of data already available for evaluating biomedical language processing systems.

computational intelligence | 2011

HIGH-PRECISION BIOLOGICAL EVENT EXTRACTION: EFFECTS OF SYSTEM AND OF DATA.

K. Bretonnel Cohen; Karin Verspoor; Helen L. Johnson; Christophe Roeder; Philip V. Ogren; William A. Baumgartner; Elizabeth K. White; Hannah Tipney; Lawrence Hunter

We approached the problems of event detection, argument identification, and negation and speculation detection in the BioNLP’09 information extraction challenge through concept recognition and analysis. Our methodology involved using the OpenDMAP semantic parser with manually written rules. The original OpenDMAP system was updated for this challenge with a broad ontology defined for the events of interest, new linguistic patterns for those events, and specialized coordination handling. We achieved state‐of‐the‐art precision for two of the three tasks, scoring the highest of 24 teams at precision of 71.81 on Task 1 and the highest of 6 teams at precision of 70.97 on Task 2. We provide a detailed analysis of the training data and show that a number of trigger words were ambiguous as to event type, even when their arguments are constrained by semantic class. The data is also shown to have a number of missing annotations. Analysis of a sampling of the comparatively small number of false positives returned by our system shows that major causes of this type of error were failing to recognize second themes in two‐theme events, failing to recognize events when they were the arguments to other events, failure to recognize nontheme arguments, and sentence segmentation errors. We show that specifically handling coordination had a small but important impact on the overall performance of the system. The OpenDMAP system and the rule set are available at http://bionlp.sourceforge.net.

pacific symposium on biocomputing | 2006

A fault model for ontology mapping, alignment, and linking systems.

Helen L. Johnson; K. Bretonnel Cohen; Lawrence Hunter

There has been much work devoted to the mapping, alignment, and linking of ontologies (MALO), but little has been published about how to evaluate systems that do this. A fault model for conducting fine-grained evaluations of MALO systems is proposed, and its application to the system described in Johnson et al. [15] is illustrated. Two judges categorized errors according to the model, and inter-judge agreement was calculated by error category. Overall inter-judge agreement was 98% after dispute resolution, suggesting that the model is consistently applicable. The results of applying the model to the system described in [15] reveal the reason for a puzzling set of results in that paper, and also suggest a number of avenues and techniques for improving the state of the art in MALO, including the development of biomedical domain specific language processing tools, filtering of high frequency matching results, and word sense disambiguation.

intelligent systems in molecular biology | 2009

Mining protein-protein interactions from GeneRIFs with OpenDMAP

Andrew D. Fox; William A. Baumgartner; Helen L. Johnson; Lawrence Hunter; Donna K. Slonim

We applied the OpenDMAP [1] and BioNLP-UIMA [2] NLP systems to the task of mining protein-protein interactions (PPIs) from GeneRIFs. Our goal was to assess and improve system performance on GeneRIF text. We identified several classes of errors in the systems output on a training dataset (most notably difficulty recognizing protein complexes) and modified the system to improve performance based on these observations. To improve recognition of protein complex interactions, we implemented a new protein-complex-resolution UIMA component. We added a custom entity identification engine that uses GeneRIF metadata to annotate proteins that may have been missed by the other engines. These changes simultaneously improved both recall and precision, resulting in an overall improvement in F-measure (from 0.23 to 0.48). Results confirm that the targeted enhancements described here lead to a substantial improvement in performance. Availability: Annotated data sets and source code for the new UIMA components can be found at http://bcb.cs.tufts.edu/GeneRIFs/

Explore More