Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Lawrence Hunter is active.

Publication


Featured researches published by Lawrence Hunter.


pacific symposium on biocomputing | 1999

EDGAR: Extraction of Drugs, Genes And Relations from the Biomedical Literature

Thomas C. Rindflesch; Lorraine K. Tanabe; John N. Weinstein; Lawrence Hunter

EDGAR (Extraction of Drugs, Genes and Relations) is a natural language processing system that extracts information about drugs and genes relevant to cancer from the biomedical literature. This automatically extracted information has remarkable potential to facilitate computational analysis in the molecular biology of cancer, and the technology is straightforwardly generalizable to many areas of biomedicine. This paper reports on the mechanisms for automatically generating such assertions and on a simple application, conceptual clustering of documents. The system uses a stochastic part of speech tagger, generates an underspecified syntactic parse and then uses semantic and pragmatic information to construct its assertions. The system builds on two important existing resources: the MEDLINE database of biomedical citations and abstracts and the Unified Medical Language System, which provides syntactic and semantic information about the terms found in biomedical abstracts.


Genome Biology | 2008

Overview of BioCreative II gene mention recognition

Larry Smith; Lorraine K. Tanabe; Rie Johnson nee Ando; Cheng-Ju Kuo; I-Fang Chung; Chun-Nan Hsu; Yu-Shi Lin; Roman Klinger; Christoph M. Friedrich; Kuzman Ganchev; Manabu Torii; Hongfang Liu; Barry Haddow; Craig A. Struble; Richard J. Povinelli; Andreas Vlachos; William A. Baumgartner; Lawrence Hunter; Bob Carpenter; Richard Tzong-Han Tsai; Hong-Jie Dai; Feng Liu; Yifei Chen; Chengjie Sun; Sophia Katrenko; Pieter W. Adriaans; Christian Blaschke; Rafael Torres; Mariana Neves; Preslav Nakov

Nineteen teams presented results for the Gene Mention Task at the BioCreative II Workshop. In this task participants designed systems to identify substrings in sentences corresponding to gene name mentions. A variety of different methods were used and the results varied with a highest achieved F1 score of 0.8721. Here we present brief descriptions of all the methods used and a statistical analysis of the results. We also demonstrate that, by combining the results from all submissions, an F score of 0.9066 is feasible, and furthermore that the best result makes use of the lowest scoring submissions.


intelligent systems in molecular biology | 2007

Manual curation is not sufficient for annotation of genomic databases

William A. Baumgartner; K. Bretonnel Cohen; Lynne M. Fox; George K. Acquaah-Mensah; Lawrence Hunter

MOTIVATION Knowledge base construction has been an area of intense activity and great importance in the growth of computational biology. However, there is little or no history of work on the subject of evaluation of knowledge bases, either with respect to their contents or with respect to the processes by which they are constructed. This article proposes the application of a metric from software engineering known as the found/fixed graph to the problem of evaluating the processes by which genomic knowledge bases are built, as well as the completeness of their contents. RESULTS Well-understood patterns of change in the found/fixed graph are found to occur in two large publicly available knowledge bases. These patterns suggest that the current manual curation processes will take far too long to complete the annotations of even just the most important model organisms, and that at their current rate of production, they will never be sufficient for completing the annotation of all currently available proteomes.


Journal of Mammary Gland Biology and Neoplasia | 2003

Functional Development of the Mammary Gland: Use of Expression Profiling and Trajectory Clustering to Reveal Changes in Gene Expression During Pregnancy, Lactation, and Involution

Michael C. Rudolph; James L. McManaman; Lawrence Hunter; Tzulip Phang; Margaret C. Neville

To characterize the molecular mechanisms by which progesterone withdrawal initiates milk secretion, we examined global gene expression during pregnancy and lactation in mice, focusing on the period around parturition. Trajectory clustering was used to profile the expression of 1358 genes that changed significantly between pregnancy day 12 and lactation day 9. Predominantly downward trajectories included stromal and proteasomal genes and genes for the enzymes of fatty acid degradation. Milk protein gene expression increased throughout pregnancy, whereas the expression of genes for lipid synthesis increased sharply at the onset of lactation. Examination of regulatory genes with profiles similar or complementary to those of lipid synthesis genes led to a model in which progesterone stimulates synthesis of TGF-β, Wnt 5b, and IGFBP-5 during pregnancy. These factors are suggested to repress secretion by interfering with PRL and IGF-1 signaling. With progesterone withdrawal, PRL and IGF-1 signaling are activated, in turn activating Akt/PKB and the SREBPs, leading to increased lipid synthesis.


PLOS Computational Biology | 2008

Getting started in text mining.

K. Bretonnel Cohen; Lawrence Hunter

Text mining is the use of automated methods for exploiting the enormous amount of knowledge available in the biomedical literature. There are at least as many motivations for doing text mining work as there are types of bioscientists. Model organism database curators have been heavy participants in the development of the field due to their need to process large numbers of publications in order to populate the many data fields for every gene in their species of interest. Bench scientists have built biomedical text mining applications to aid in the development of tools for interpreting the output of high-throughput assays and to improve searches of sequence databases (see [1] for a review). Bioscientists of every stripe have built applications to deal with the dual issues of the double-exponential growth in the scientific literature over the past few years and of the unique issues in searching PubMed/MEDLINE for genomics-related publications. A surprising phenomenon can be noted in the recent history of biomedical text mining: although several systems have been built and deployed in the past few years—Chilibot, Textpresso, and PreBIND (see Text S1 for these and most other citations), for example—the ones that are seeing high usage rates and are making productive contributions to the working lives of bioscientists have been built not by text mining specialists, but by bioscientists. We speculate on why this might be so below. Three basic types of approaches to text mining have been prevalent in the biomedical domain. Co-occurrence–based methods do no more than look for concepts that occur in the same unit of text—typically a sentence, but sometimes as large as an abstract—and posit a relationship between them. (See [2] for an early co-occurrence–based system.) For example, if such a system saw that BRCA1 and breast cancer occurred in the same sentence, it might assume a relationship between breast cancer and the BRCA1 gene. Some early biomedical text mining systems were co-occurrence–based, but such systems are highly error prone, and are not commonly built today. In fact, many text mining practitioners would not consider them to be text mining systems at all. Co-occurrence of concepts in a text is sometimes used as a simple baseline when evaluating more sophisticated systems; as such, they are nontrivial, since even a co-occurrence–based system must deal with variability in the ways that concepts are expressed in human-produced texts. For example, BRCA1 could be referred to by any of its alternate symbols—IRIS, PSCP, BRCAI, BRCC1, or RNF53 (or by any of their many spelling variants, which include BRCA1, BRCA-1, and BRCA 1)—or by any of the variants of its full name, viz. breast cancer 1, early onset (its official name per Entrez Gene and the Human Gene Nomenclature Committee), as breast cancer susceptibility gene 1, or as the latters variant breast cancer susceptibility gene-1. Similarly, breast cancer could be referred to as breast cancer, carcinoma of the breast, or mammary neoplasm. These variability issues challenge more sophisticated systems, as well; we discuss ways of coping with them in Text S1. Two more common (and more sophisticated) approaches to text mining exist: rule-based or knowledge-based approaches, and statistical or machine-learning-based approaches. The variety of types of rule-based systems is quite wide. In general, rule-based systems make use of some sort of knowledge. This might take the form of general knowledge about how language is structured, specific knowledge about how biologically relevant facts are stated in the biomedical literature, knowledge about the sets of things that bioscientists talk about and the kinds of relationships that they can have with one another, and the variant forms by which they might be mentioned in the literature, or any subset or combination of these. (See [3] for an early rule-based system, and [4] for a discussion of rule-based approaches to various biomedical text mining tasks.) At one end of the spectrum, a simple rule-based system might use hard-coded patterns—for example, plays a role in or is associated with —to find explicit statements about the classes of things in which the researcher is interested. At the other end of the spectrum, a rule-based system might use sophisticated linguistic and semantic analyses to recognize a wide range of possible ways of making assertions about those classes of things. It is worth noting that useful systems have been built using technologies at both ends of the spectrum, and at many points in between. In contrast, statistical or machine-learning–based systems operate by building classifiers that may operate on any level, from labelling part of speech to choosing syntactic parse trees to classifying full sentences or documents. (See [5] for an early learning-based system, and [4] for a discussion of learning-based approaches to various biomedical text mining tasks.) Rule-based and statistical systems each have their advantages and disadvantages. For example, rule systems are often assumed (not necessarily correctly) to take a significant amount of time to develop. Statistical systems typically require large amounts of expensive-to-get labelled training data. In practice, statistical and rule-based systems can be fruitfully combined. For example, a statistical system that classifies documents as to whether or not they are relevant to the subject of genetic variation in mouse genes might use the output of a rule-based mutation recognizer as one of its feature extractors. Many systems also employ an initial statistical processing step, followed by rule-based post-processing. A primary problem that either type of system must deal with is the issue of ambiguity: the existence of multiple relationships between language and meanings or categories. Ambiguity exists at every level of linguistic structure, from the part of speech of words to subtle issues in pragmatics. A common example of ambiguity in genomics text is related to gene names and symbols. Consider the string fat: is it an adjective, or a noun? Either part of speech is entirely plausible in biomedical texts, and PubMed returns almost 112 K hits for that single-word query (and more than 13 K even if we try to restrict the query to genomics by including the disjunction (gene OR genetic OR genetics). This ambiguity is relatively easy to resolve, but fat also turns out to be the name or symbol of a number of different genes—humans, mice, rats, Drosophila, zebrafish, chickens, M. mulatta, and two Lactobacilli have at least one gene whose name, official symbol, or alias is fat. Even if the species whose gene is being referred to can be determined, the ambiguity may still not be resolved—in humans, fat is the official symbol of Entrez Gene entry 2195 and an alternate symbol for Entrez Gene entry 948. The distinction is not trivial. The former is a cadhedrin, and is associated with tumor suppression and with bipolar disorder, while the latter is a thrombospondin receptor associated with atherosclerosis, platelet glycoprotein deficiency, hyperlipidemia, and insulin resistance, to name just a few phenotypes. These ambiguities are not trivial: if your analysis is wrong, you miss or erroneously extract information on relations between molecular biology and human disease.


Behavioral and Brain Sciences | 1986

Transcending inductive category formation in learning

Roger C. Schank; Gregg C. Collins; Lawrence Hunter

The inductive category formation framework, an influential set of theories of learning in psychology and artificial intelligence, is deeply flawed. In this framework a set of necessary and sufficient features is taken to define a category. Such definitions are not functionally justified, are not used by people, and are not inducible by a learning system. Inductive theories depend on having access to all and only relevant features, which is not only impossible but begs a key question in learning. The crucial roles of other cognitive processes (such as explanation and credit assignment) are ignored or oversimplified. Learning necessarily involves pragmatic considerations that can only be handled by complex cognitive processes. We provide an alternative framework for learning according to which category definitions must be based on category function. The learning system invokes other cognitive processes to accomplish difficult tasks, makes inferences, analyses and decides among potential features, and specifies how and when categories are to be generated and modified. We also examine the methodological underpinnings of the two approaches and compare their motivations.


BMC Bioinformatics | 2008

OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression

Lawrence Hunter; Zhiyong Lu; James Firby; William A. Baumgartner; Helen L. Johnson; Philip V. Ogren; K. Bretonnel Cohen

BackgroundInformation extraction (IE) efforts are widely acknowledged to be important in harnessing the rapid advance of biomedical knowledge, particularly in areas where important factual information is published in a diverse literature. Here we report on the design, implementation and several evaluations of OpenDMAP, an ontology-driven, integrated concept analysis system. It significantly advances the state of the art in information extraction by leveraging knowledge in ontological resources, integrating diverse text processing applications, and using an expanded pattern language that allows the mixing of syntactic and semantic elements and variable ordering.ResultsOpenDMAP information extraction systems were produced for extracting protein transport assertions (transport), protein-protein interaction assertions (interaction) and assertions that a gene is expressed in a cell type (expression). Evaluations were performed on each system, resulting in F-scores ranging from .26 – .72 (precision .39 – .85, recall .16 – .85). Additionally, each of these systems was run over all abstracts in MEDLINE, producing a total of 72,460 transport instances, 265,795 interaction instances and 176,153 expression instances.ConclusionOpenDMAP advances the performance standards for extracting protein-protein interaction predications from the full texts of biomedical research articles. Furthermore, this level of performance appears to generalize to other information extraction tasks, including extracting information about predicates of more than two arguments. The output of the information extraction system is always constructed from elements of an ontology, ensuring that the knowledge representation is grounded with respect to a carefully constructed model of reality. The results of these efforts can be used to increase the efficiency of manual curation efforts and to provide additional features in systems that integrate multiple sources for information extraction. The open source OpenDMAP code library is freely available at http://bionlp.sourceforge.net/


Bioinformatics | 2007

MutationFinder: a high-performance system for extracting point mutation mentions from text

J. Gregory Caporaso; William A. Baumgartner; David A. Randolph; K. Bretonnel Cohen; Lawrence Hunter

UNLABELLED Discussion of point mutations is ubiquitous in biomedical literature, and manually compiling databases or literature on mutations in specific genes or proteins is tedious. We present an open-source, rule-based system, MutationFinder, for extracting point mutation mentions from text. On blind test data, it achieves nearly perfect precision and a markedly improved recall over a baseline. AVAILABILITY MutationFinder, along with a high-quality gold standard data set, and a scoring script for mutation extraction systems have been made publicly available. Implementations, source code and unit tests are available in Python, Perl and Java. MutationFinder can be used as a stand-alone script, or imported by other applications. PROJECT URL http://bionlp.sourceforge.net.


BMC Bioinformatics | 2012

Concept annotation in the CRAFT corpus

Michael Bada; Miriam Eckert; Donald Evans; Kristin Garcia; Krista Shipley; Dmitry Sitnikov; William A. Baumgartner; K. Bretonnel Cohen; Karin Verspoor; Judith A. Blake; Lawrence Hunter

BackgroundManually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.ResultsThis paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.ConclusionsAs the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.


BMC Bioinformatics | 2010

The structural and content aspects of abstracts versus bodies of full text journal articles are different

K. Bretonnel Cohen; Helen L. Johnson; Karin Verspoor; Christophe Roeder; Lawrence Hunter

BackgroundAn increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research.ResultsWe examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies.ConclusionsAspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts.

Collaboration


Dive into the Lawrence Hunter's collaboration.

Top Co-Authors

Avatar

K. Bretonnel Cohen

University of Colorado Denver

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Michael Bada

University of Colorado Denver

View shared research outputs
Top Co-Authors

Avatar

Anis Karimpour-Fard

University of Colorado Denver

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Helen L. Johnson

University of Colorado Denver

View shared research outputs
Top Co-Authors

Avatar

Christophe Roeder

University of Colorado Denver

View shared research outputs
Researchain Logo
Decentralizing Knowledge