K. Bretonnel Cohen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where K. Bretonnel Cohen is active.

Explore More

Publication

Featured researches published by K. Bretonnel Cohen.

meeting of the association for computational linguistics | 2007

A shared task involving multi-label classification of clinical free text

John Pestian; Chris Brew; Pawel Matykiewicz; D. J. Hovermale; Neil Johnson; K. Bretonnel Cohen; Włodzisław Duch

This paper reports on a shared task involving the assignment of ICD-9-CM codes to radiology reports. Two features distinguished this task from previous shared tasks in the biomedical domain. One is that it resulted in the first freely distributable corpus of fully anonymized clinical text. This resource is permanently available and will (we hope) facilitate future research. The other key feature of the task is that it required categorization with respect to a large and commercially significant set of labels. The number of participants was larger than in any previous biomedical challenge task. We describe the data production process and the evaluation measures, and give a preliminary analysis of the results. Many systems performed at levels approaching the inter-coder agreement, suggesting that human-like performance on this task is within the reach of currently available technologies.

PLOS Computational Biology | 2008

Getting started in text mining.

K. Bretonnel Cohen; Lawrence Hunter

Text mining is the use of automated methods for exploiting the enormous amount of knowledge available in the biomedical literature. There are at least as many motivations for doing text mining work as there are types of bioscientists. Model organism database curators have been heavy participants in the development of the field due to their need to process large numbers of publications in order to populate the many data fields for every gene in their species of interest. Bench scientists have built biomedical text mining applications to aid in the development of tools for interpreting the output of high-throughput assays and to improve searches of sequence databases (see [1] for a review). Bioscientists of every stripe have built applications to deal with the dual issues of the double-exponential growth in the scientific literature over the past few years and of the unique issues in searching PubMed/MEDLINE for genomics-related publications. A surprising phenomenon can be noted in the recent history of biomedical text mining: although several systems have been built and deployed in the past few years—Chilibot, Textpresso, and PreBIND (see Text S1 for these and most other citations), for example—the ones that are seeing high usage rates and are making productive contributions to the working lives of bioscientists have been built not by text mining specialists, but by bioscientists. We speculate on why this might be so below. Three basic types of approaches to text mining have been prevalent in the biomedical domain. Co-occurrence–based methods do no more than look for concepts that occur in the same unit of text—typically a sentence, but sometimes as large as an abstract—and posit a relationship between them. (See [2] for an early co-occurrence–based system.) For example, if such a system saw that BRCA1 and breast cancer occurred in the same sentence, it might assume a relationship between breast cancer and the BRCA1 gene. Some early biomedical text mining systems were co-occurrence–based, but such systems are highly error prone, and are not commonly built today. In fact, many text mining practitioners would not consider them to be text mining systems at all. Co-occurrence of concepts in a text is sometimes used as a simple baseline when evaluating more sophisticated systems; as such, they are nontrivial, since even a co-occurrence–based system must deal with variability in the ways that concepts are expressed in human-produced texts. For example, BRCA1 could be referred to by any of its alternate symbols—IRIS, PSCP, BRCAI, BRCC1, or RNF53 (or by any of their many spelling variants, which include BRCA1, BRCA-1, and BRCA 1)—or by any of the variants of its full name, viz. breast cancer 1, early onset (its official name per Entrez Gene and the Human Gene Nomenclature Committee), as breast cancer susceptibility gene 1, or as the latters variant breast cancer susceptibility gene-1. Similarly, breast cancer could be referred to as breast cancer, carcinoma of the breast, or mammary neoplasm. These variability issues challenge more sophisticated systems, as well; we discuss ways of coping with them in Text S1. Two more common (and more sophisticated) approaches to text mining exist: rule-based or knowledge-based approaches, and statistical or machine-learning-based approaches. The variety of types of rule-based systems is quite wide. In general, rule-based systems make use of some sort of knowledge. This might take the form of general knowledge about how language is structured, specific knowledge about how biologically relevant facts are stated in the biomedical literature, knowledge about the sets of things that bioscientists talk about and the kinds of relationships that they can have with one another, and the variant forms by which they might be mentioned in the literature, or any subset or combination of these. (See [3] for an early rule-based system, and [4] for a discussion of rule-based approaches to various biomedical text mining tasks.) At one end of the spectrum, a simple rule-based system might use hard-coded patterns—for example, plays a role in or is associated with —to find explicit statements about the classes of things in which the researcher is interested. At the other end of the spectrum, a rule-based system might use sophisticated linguistic and semantic analyses to recognize a wide range of possible ways of making assertions about those classes of things. It is worth noting that useful systems have been built using technologies at both ends of the spectrum, and at many points in between. In contrast, statistical or machine-learning–based systems operate by building classifiers that may operate on any level, from labelling part of speech to choosing syntactic parse trees to classifying full sentences or documents. (See [5] for an early learning-based system, and [4] for a discussion of learning-based approaches to various biomedical text mining tasks.) Rule-based and statistical systems each have their advantages and disadvantages. For example, rule systems are often assumed (not necessarily correctly) to take a significant amount of time to develop. Statistical systems typically require large amounts of expensive-to-get labelled training data. In practice, statistical and rule-based systems can be fruitfully combined. For example, a statistical system that classifies documents as to whether or not they are relevant to the subject of genetic variation in mouse genes might use the output of a rule-based mutation recognizer as one of its feature extractors. Many systems also employ an initial statistical processing step, followed by rule-based post-processing. A primary problem that either type of system must deal with is the issue of ambiguity: the existence of multiple relationships between language and meanings or categories. Ambiguity exists at every level of linguistic structure, from the part of speech of words to subtle issues in pragmatics. A common example of ambiguity in genomics text is related to gene names and symbols. Consider the string fat: is it an adjective, or a noun? Either part of speech is entirely plausible in biomedical texts, and PubMed returns almost 112 K hits for that single-word query (and more than 13 K even if we try to restrict the query to genomics by including the disjunction (gene OR genetic OR genetics). This ambiguity is relatively easy to resolve, but fat also turns out to be the name or symbol of a number of different genes—humans, mice, rats, Drosophila, zebrafish, chickens, M. mulatta, and two Lactobacilli have at least one gene whose name, official symbol, or alias is fat. Even if the species whose gene is being referred to can be determined, the ambiguity may still not be resolved—in humans, fat is the official symbol of Entrez Gene entry 2195 and an alternate symbol for Entrez Gene entry 948. The distinction is not trivial. The former is a cadhedrin, and is associated with tumor suppression and with bipolar disorder, while the latter is a thrombospondin receptor associated with atherosclerosis, platelet glycoprotein deficiency, hyperlipidemia, and insulin resistance, to name just a few phenotypes. These ambiguities are not trivial: if your analysis is wrong, you miss or erroneously extract information on relations between molecular biology and human disease.

Computational Linguistics | 2011

Amazon mechanical turk: Gold mine or coal mine?

Karën Fort; Gilles Adda; K. Bretonnel Cohen

Recently heard at a tutorial in our field: “It cost me less than one hundred bucks to annotate this using Amazon Mechanical Turk!” Assertions like this are increasingly common, but we believe they should not be stated so proudly; they ignore the ethical consequences of using MTurk (Amazon Mechanical Turk) as a source of labor. Manually annotating corpora or manually developing any other linguistic resource, such as a set of judgments about system outputs, represents such a high cost that many researchers are looking for alternative solutions to the standard approach. MTurk is becoming a popular one. However, as in any scientific endeavor involving humans, there is an unspoken ethical dimension involved in resource construction and system evaluation, and this is especially true of MTurk. We would like here to raise some questions about the use of MTurk. To do so, we will define precisely what MTurk is and what it is not, highlighting the issues raised by the system. We hope that this will point out opportunities for our community to deliberately value ethics above cost savings.

BMC Bioinformatics | 2008

OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression

Lawrence Hunter; Zhiyong Lu; James Firby; William A. Baumgartner; Helen L. Johnson; Philip V. Ogren; K. Bretonnel Cohen

BackgroundInformation extraction (IE) efforts are widely acknowledged to be important in harnessing the rapid advance of biomedical knowledge, particularly in areas where important factual information is published in a diverse literature. Here we report on the design, implementation and several evaluations of OpenDMAP, an ontology-driven, integrated concept analysis system. It significantly advances the state of the art in information extraction by leveraging knowledge in ontological resources, integrating diverse text processing applications, and using an expanded pattern language that allows the mixing of syntactic and semantic elements and variable ordering.ResultsOpenDMAP information extraction systems were produced for extracting protein transport assertions (transport), protein-protein interaction assertions (interaction) and assertions that a gene is expressed in a cell type (expression). Evaluations were performed on each system, resulting in F-scores ranging from .26 – .72 (precision .39 – .85, recall .16 – .85). Additionally, each of these systems was run over all abstracts in MEDLINE, producing a total of 72,460 transport instances, 265,795 interaction instances and 176,153 expression instances.ConclusionOpenDMAP advances the performance standards for extracting protein-protein interaction predications from the full texts of biomedical research articles. Furthermore, this level of performance appears to generalize to other information extraction tasks, including extracting information about predicates of more than two arguments. The output of the information extraction system is always constructed from elements of an ontology, ensuring that the knowledge representation is grounded with respect to a carefully constructed model of reality. The results of these efforts can be used to increase the efficiency of manual curation efforts and to provide additional features in systems that integrate multiple sources for information extraction. The open source OpenDMAP code library is freely available at http://bionlp.sourceforge.net/

Bioinformatics | 2007

MutationFinder: a high-performance system for extracting point mutation mentions from text

J. Gregory Caporaso; William A. Baumgartner; David A. Randolph; K. Bretonnel Cohen; Lawrence Hunter

UNLABELLED Discussion of point mutations is ubiquitous in biomedical literature, and manually compiling databases or literature on mutations in specific genes or proteins is tedious. We present an open-source, rule-based system, MutationFinder, for extracting point mutation mentions from text. On blind test data, it achieves nearly perfect precision and a markedly improved recall over a baseline. AVAILABILITY MutationFinder, along with a high-quality gold standard data set, and a scoring script for mutation extraction systems have been made publicly available. Implementations, source code and unit tests are available in Python, Perl and Java. MutationFinder can be used as a stand-alone script, or imported by other applications. PROJECT URL http://bionlp.sourceforge.net.

Database | 2012

Text mining for the biocuration workflow

Lynette Hirschman; Gully A. P. C. Burns; Martin Krallinger; Cecilia N. Arighi; K. Bretonnel Cohen; Alfonso Valencia; Cathy H. Wu; Andrew Chatr-aryamontri; Karen G. Dowell; Eva Huala; Anália Lourenço; Robert Nash; Anne-Lise Veuthey; Thomas C. Wiegers; Andrew Winter

Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on ‘Text Mining for the BioCuration Workflow’ at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community.

Biomedical Informatics Insights | 2012

Sentiment Analysis of Suicide Notes: A Shared Task.

John Pestian; Pawel Matykiewicz; Michelle Linn-Gust; Brett R. South; Özlem Uzuner; Jan Wiebe; K. Bretonnel Cohen; John F. Hurdle; Chris Brew

This paper reports on a shared task involving the assignment of emotions to suicide notes. Two features distinguished this task from previous shared tasks in the biomedical domain. One is that it resulted in the corpus of fully anonymized clinical text and annotated suicide notes. This resource is permanently available and will (we hope) facilitate future research. The other key feature of the task is that it required categorization with respect to a large set of labels. The number of participants was larger than in any previous biomedical challenge task. We describe the data production process and the evaluation measures, and give a preliminary analysis of the results. Many systems performed at levels approaching the inter-coder agreement, suggesting that human-like performance on this task is within the reach of currently available technologies.

BMC Bioinformatics | 2012

Concept annotation in the CRAFT corpus

Michael Bada; Miriam Eckert; Donald Evans; Kristin Garcia; Krista Shipley; Dmitry Sitnikov; William A. Baumgartner; K. Bretonnel Cohen; Karin Verspoor; Judith A. Blake; Lawrence Hunter

BackgroundManually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.ResultsThis paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.ConclusionsAs the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

BMC Bioinformatics | 2010

The structural and content aspects of abstracts versus bodies of full text journal articles are different

K. Bretonnel Cohen; Helen L. Johnson; Karin Verspoor; Christophe Roeder; Lawrence Hunter

BackgroundAn increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research.ResultsWe examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies.ConclusionsAspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts.

BMC Bioinformatics | 2009

Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD)

Thomas C. Wiegers; Allan Peter Davis; K. Bretonnel Cohen; Lynette Hirschman; Carolyn J. Mattingly

BackgroundThe Comparative Toxicogenomics Database (CTD) is a publicly available resource that promotes understanding about the etiology of environmental diseases. It provides manually curated chemical-gene/protein interactions and chemical- and gene-disease relationships from the peer-reviewed, published literature. The goals of the research reported here were to establish a baseline analysis of current CTD curation, develop a text-mining prototype from readily available open source components, and evaluate its potential value in augmenting curation efficiency and increasing data coverage.ResultsPrototype text-mining applications were developed and evaluated using a CTD data set consisting of manually curated molecular interactions and relationships from 1,600 documents. Preliminary results indicated that the prototype found 80% of the gene, chemical, and disease terms appearing in curated interactions. These terms were used to re-rank documents for curation, resulting in increases in mean average precision (63% for the baseline vs. 73% for a rule-based re-ranking), and in the correlation coefficient of rank vs. number of curatable interactions per document (baseline 0.14 vs. 0.38 for the rule-based re-ranking).ConclusionThis text-mining project is unique in its integration of existing tools into a single workflow with direct application to CTD. We performed a baseline assessment of the inter-curator consistency and coverage in CTD, which allowed us to measure the potential of these integrated tools to improve prioritization of journal articles for manual curation. Our study presents a feasible and cost-effective approach for developing a text mining solution to enhance manual curation throughput and efficiency.

Explore More