Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Karin Verspoor is active.

Publication


Featured researches published by Karin Verspoor.


BMC Bioinformatics | 2011

The gene normalization task in BioCreative III

Zhiyong Lu; Hung Yu Kao; Chih-Hsuan Wei; Minlie Huang; Jingchen Liu; Cheng-Ju Kuo; Chun-Nan Hsu; Richard Tzong-Han Tsai; Hong-Jie Dai; Naoaki Okazaki; Han-Cheol Cho; Martin Gerner; Illés Solt; Shashank Agarwal; Feifan Liu; Dina Vishnyakova; Patrick Ruch; Martin Romacker; Fabio Rinaldi; Sanmitra Bhattacharya; Padmini Srinivasan; Hongfang Liu; Manabu Torii; Sérgio Matos; David Campos; Karin Verspoor; Kevin Livingston; W. John Wilbur

BackgroundWe report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k).ResultsWe received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively.ConclusionsBy using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.


Journal of Cheminformatics | 2015

The CHEMDNER corpus of chemicals and drugs and its annotation principles

Martin Krallinger; Obdulia Rabal; Florian Leitner; Miguel Vazquez; David Salgado; Zhiyong Lu; Robert Leaman; Yanan Lu; Donghong Ji; Daniel M. Lowe; Roger A. Sayle; Riza Theresa Batista-Navarro; Rafal Rak; Torsten Huber; Tim Rocktäschel; Sérgio Matos; David Campos; Buzhou Tang; Hua Xu; Tsendsuren Munkhdalai; Keun Ho Ryu; S. V. Ramanan; Senthil Nathan; Slavko Žitnik; Marko Bajec; Lutz Weber; Matthias Irmer; Saber A. Akhondi; Jan A. Kors; Shuo Xu

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/


BMC Bioinformatics | 2012

Concept annotation in the CRAFT corpus

Michael Bada; Miriam Eckert; Donald Evans; Kristin Garcia; Krista Shipley; Dmitry Sitnikov; William A. Baumgartner; K. Bretonnel Cohen; Karin Verspoor; Judith A. Blake; Lawrence Hunter

BackgroundManually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.ResultsThis paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.ConclusionsAs the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.


BMC Bioinformatics | 2010

The structural and content aspects of abstracts versus bodies of full text journal articles are different

K. Bretonnel Cohen; Helen L. Johnson; Karin Verspoor; Christophe Roeder; Lawrence Hunter

BackgroundAn increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research.ResultsWe examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies.ConclusionsAspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts.


Database | 2013

BioC: a minimalist approach to interoperability for biomedical text processing

Donald C. Comeau; Rezarta Islamaj Doğan; Paolo Ciccarese; Kevin Bretonnel Cohen; Martin Krallinger; Florian Leitner; Zhiyong Lu; Yifan Peng; Fabio Rinaldi; Manabu Torii; Alfonso Valencia; Karin Verspoor; Thomas C. Wiegers; Cathy H. Wu; W. John Wilbur

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/


meeting of the association for computational linguistics | 2016

Findings of the 2016 Conference on Machine Translation.

Ondˇrej Bojar; Rajen Chatterjee; Christian Federmann; Yvette Graham; Barry Haddow; Matthias Huck; Antonio Jimeno Yepes; Philipp Koehn; Varvara Logacheva; Christof Monz; Matteo Negri; Aurélie Névéol; Mariana L. Neves; Martin Popel; Matt Post; Raphael Rubino; Carolina Scarton; Lucia Specia; Marco Turchi; Karin Verspoor; Marcos Zampieri

This paper presents the results of the WMT16 shared tasks, which included five machine translation (MT) tasks (standard news, IT-domain, biomedical, multimodal, pronoun), three evaluation tasks (metrics, tuning, run-time estimation of MT quality), and an automatic post-editing task and bilingual document alignment task. This year, 102 MT systems from 24 institutions (plus 36 anonymized online systems) were submitted to the 12 translation directions in the news translation task. The IT-domain task received 31 submissions from 12 institutions in 7 directions and the Biomedical task received 15 submissions systems from 5 institutions. Evaluation was both automatic and manual (relative ranking and 100-point scale assessments). The quality estimation task had three subtasks, with a total of 14 teams, submitting 39 entries. The automatic post-editing task had a total of 6 teams, submitting 11 entries.


Protein Science | 2006

A categorization approach to automated ontological function annotation.

Karin Verspoor; Judith D. Cohn; Susan M. Mniszewski; Cliff Joslyn

Automated function prediction (AFP) methods increasingly use knowledge discovery algorithms to map sequence, structure, literature, and/or pathway information about proteins whose functions are unknown into functional ontologies, typically (a portion of) the Gene Ontology (GO). While there are a growing number of methods within this paradigm, the general problem of assessing the accuracy of such prediction algorithms has not been seriously addressed. We present first an application for function prediction from protein sequences using the POSet Ontology Categorizer (POSOC) to produce new annotations by analyzing collections of GO nodes derived from annotations of protein BLAST neighborhoods. We then also present hierarchical precision and hierarchical recall as new evaluation metrics for assessing the accuracy of any predictions in hierarchical ontologies, and discuss results on a test set of protein sequences. We show that our method provides substantially improved hierarchical precision (measure of predictions made that are correct) when applied to the nearest BLAST neighbors of target proteins, as compared with simply imputing that neighborhoods annotations to the target. Moreover, when our method is applied to a broader BLAST neighborhood, hierarchical precision is enhanced even further. In all cases, such increased hierarchical precision performance is purchased at a modest expense of hierarchical recall (measure of all annotations that get predicted at all).


Journal of Biomedical Semantics | 2012

BioLemmatizer: a lemmatization tool for morphological processing of biomedical text

Haibin Liu; Tom Christiansen; William A. Baumgartner; Karin Verspoor

BackgroundThe wide variety of morphological variants of domain-specific technical terms contributes to the complexity of performing natural language processing of the scientific literature related to molecular biology. For morphological analysis of these texts, lemmatization has been actively applied in the recent biomedical research.ResultsIn this work, we developed a domain-specific lemmatization tool, BioLemmatizer, for the morphological analysis of biomedical literature. The tool focuses on the inflectional morphology of English and is based on the general English lemmatization tool MorphAdorner. The BioLemmatizer is further tailored to the biological domain through incorporation of several published lexical resources. It retrieves lemmas based on the use of a word lexicon, and defines a set of rules that transform a word to a lemma if it is not encountered in the lexicon. An innovative aspect of the BioLemmatizer is the use of a hierarchical strategy for searching the lexicon, which enables the discovery of the correct lemma even if the input Part-of-Speech information is inaccurate. The BioLemmatizer achieves an accuracy of 97.5% in lemmatizing an evaluation set prepared from the CRAFT corpus, a collection of full-text biomedical articles, and an accuracy of 97.6% on the LLL05 corpus. The contribution of the BioLemmatizer to accuracy improvement of a practical information extraction task is further demonstrated when it is used as a component in a biomedical text mining system.ConclusionsThe BioLemmatizer outperforms other tools when compared with eight existing lemmatizers. The BioLemmatizer is released as an open source software and can be downloaded from http://biolemmatizer.sourceforge.net.


BMC Bioinformatics | 2012

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

Karin Verspoor; Kevin Bretonnel Cohen; Arrick Lanfranchi; Colin Warner; Helen L. Johnson; Christophe Roeder; Jinho D. Choi; Christopher S. Funk; Yuriy Malenkiy; Miriam Eckert; Nianwen Xue; William A. Baumgartner; Michael Bada; Martha Palmer; Lawrence Hunter

BackgroundWe introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.ResultsMany biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.ConclusionsThe finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.


BMC Bioinformatics | 2014

Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters.

Christopher S. Funk; William A. Baumgartner; Benjamin Garcia; Christophe Roeder; Michael Bada; K. Bretonnel Cohen; Lawrence Hunter; Karin Verspoor

BackgroundOntological concepts are useful for many different biomedical tasks. Concepts are difficult to recognize in text due to a disconnect between what is captured in an ontology and how the concepts are expressed in text. There are many recognizers for specific ontologies, but a general approach for concept recognition is an open problem.ResultsThree dictionary-based systems (MetaMap, NCBO Annotator, and ConceptMapper) are evaluated on eight biomedical ontologies in the Colorado Richly Annotated Full-Text (CRAFT) Corpus. Over 1,000 parameter combinations are examined, and best-performing parameters for each system-ontology pair are presented.ConclusionsBaselines for concept recognition by three systems on eight biomedical ontologies are established (F-measures range from 0.14–0.83). Out of the three systems we tested, ConceptMapper is generally the best-performing system; it produces the highest F-measure of seven out of eight ontologies. Default parameters are not ideal for most systems on most ontologies; by changing parameters F-measure can be increased by up to 0.4. Not only are best performing parameters presented, but suggestions for choosing the best parameters based on ontology characteristics are presented.

Collaboration


Dive into the Karin Verspoor's collaboration.

Top Co-Authors

Avatar

Lawrence Hunter

University of Colorado Denver

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Justin Zobel

University of Melbourne

View shared research outputs
Top Co-Authors

Avatar

Christophe Roeder

University of Colorado Denver

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Christopher S. Funk

University of Colorado Denver

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge