Christopher S. Funk | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christopher S. Funk is active.

Explore More

Publication

Featured researches published by Christopher S. Funk.

BMC Bioinformatics | 2012

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

Karin Verspoor; Kevin Bretonnel Cohen; Arrick Lanfranchi; Colin Warner; Helen L. Johnson; Christophe Roeder; Jinho D. Choi; Christopher S. Funk; Yuriy Malenkiy; Miriam Eckert; Nianwen Xue; William A. Baumgartner; Michael Bada; Martha Palmer; Lawrence Hunter

BackgroundWe introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.ResultsMany biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.ConclusionsThe finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

BMC Bioinformatics | 2014

Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters.

Christopher S. Funk; William A. Baumgartner; Benjamin Garcia; Christophe Roeder; Michael Bada; K. Bretonnel Cohen; Lawrence Hunter; Karin Verspoor

BackgroundOntological concepts are useful for many different biomedical tasks. Concepts are difficult to recognize in text due to a disconnect between what is captured in an ontology and how the concepts are expressed in text. There are many recognizers for specific ontologies, but a general approach for concept recognition is an open problem.ResultsThree dictionary-based systems (MetaMap, NCBO Annotator, and ConceptMapper) are evaluated on eight biomedical ontologies in the Colorado Richly Annotated Full-Text (CRAFT) Corpus. Over 1,000 parameter combinations are examined, and best-performing parameters for each system-ontology pair are presented.ConclusionsBaselines for concept recognition by three systems on eight biomedical ontologies are established (F-measures range from 0.14–0.83). Out of the three systems we tested, ConceptMapper is generally the best-performing system; it produces the highest F-measure of seven out of eight ontologies. Default parameters are not ideal for most systems on most ontologies; by changing parameters F-measure can be increased by up to 0.4. Not only are best performing parameters presented, but suggestions for choosing the best parameters based on ontology characteristics are presented.

BMC Bioinformatics | 2013

Combining heterogeneous data sources for accurate functional annotation of proteins

Artem Sokolov; Christopher S. Funk; Kiley Graim; Karin Verspoor; Asa Ben-Hur

Combining heterogeneous sources of data is essential for accurate prediction of protein function. The task is complicated by the fact that while sequence-based features can be readily compared across species, most other data are species-specific. In this paper, we present a multi-view extension to GOstruct, a structured-output framework for function annotation of proteins. The extended framework can learn from disparate data sources, with each data source provided to the framework in the form of a kernel. Our empirical results demonstrate that the multi-view framework is able to utilize all available information, yielding better performance than sequence-based models trained across species and models trained from collections of data within a given species. This version of GOstruct participated in the recent Critical Assessment of Functional Annotations (CAFA) challenge; since then we have significantly improved the natural language processing component of the method, which now provides performance that is on par with that provided by sequence information. The GOstruct framework is available for download at http://strut.sourceforge.net.

Journal of Biomedical Semantics | 2015

Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct

Christopher S. Funk; Indika Kahanda; Asa Ben-Hur; Karin Verspoor

Most computational methods that predict protein function do not take advantage of the large amount of information contained in the biomedical literature. In this work we evaluate both ontology term co-mention and bag-of-words features mined from the biomedical literature and analyze their impact in the context of a structured output support vector machine model, GOstruct. We find that even simple literature based features are useful for predicting human protein function (F-max: Molecular Function =0.408, Biological Process =0.461, Cellular Component =0.608). One advantage of using literature features is their ability to offer easy verification of automated predictions. We find through manual inspection of misclassifications that some false positive predictions could be biologically valid predictions based upon support extracted from the literature. Additionally, we present a “medium-throughput” pipeline that was used to annotate a large subset of co-mentions; we suggest that this strategy could help to speed up the rate at which proteins are curated.

F1000Research | 2015

PHENOstruct: Prediction of human phenotype ontology terms using heterogeneous data sources.

Indika Kahanda; Christopher S. Funk; Karin Verspoor; Asa Ben-Hur

The human phenotype ontology (HPO) was recently developed as a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. At present, only a small fraction of human protein coding genes have HPO annotations. But, researchers believe that a large portion of currently unannotated genes are related to disease phenotypes. Therefore, it is important to predict gene-HPO term associations using accurate computational methods. In this work we demonstrate the performance advantage of the structured SVM approach which was shown to be highly effective for Gene Ontology term prediction in comparison to several baseline methods. Furthermore, we highlight a collection of informative data sources suitable for the problem of predicting gene-HPO associations, including large scale literature mining data.

GigaScience | 2015

A close look at protein function prediction evaluation protocols

Indika Kahanda; Christopher S. Funk; Fahad Ullah; Karin Verspoor; Asa Ben-Hur

BackgroundThe recently held Critical Assessment of Function Annotation challenge (CAFA2) required its participants to submit predictions for a large number of target proteins regardless of whether they have previous annotations or not. This is in contrast to the original CAFA challenge in which participants were asked to submit predictions for proteins with no existing annotations. The CAFA2 task is more realistic, in that it more closely mimics the accumulation of annotations over time. In this study we compare these tasks in terms of their difficulty, and determine whether cross-validation provides a good estimate of performance.ResultsThe CAFA2 task is a combination of two subtasks: making predictions on annotated proteins and making predictions on previously unannotated proteins. In this study we analyze the performance of several function prediction methods in these two scenarios. Our results show that several methods (structured support vector machine, binary support vector machines and guilt-by-association methods) do not usually achieve the same level of accuracy on these two tasks as that achieved by cross-validation, and that predicting novel annotations for previously annotated proteins is a harder problem than predicting annotations for uncharacterized proteins. We also find that different methods have different performance characteristics in these tasks, and that cross-validation is not adequate at estimating performance and ranking methods.ConclusionsThese results have implications for the design of computational experiments in the area of automated function prediction and can provide useful insight for the understanding and design of future CAFA competitions.

Handbook of Linguistic Annotation | 2017

The Colorado Richly Annotated Full Text (CRAFT) Corpus: Multi-Model Annotation in the Biomedical Domain

K. Bretonnel Cohen; Karin Verspoor; Karën Fort; Christopher S. Funk; Michael Bada; Martha Palmer; Lawrence Hunter

The Colorado Richly Annotated Full Text (CRAFT) corpus consists of full-text journal articles. The primary motivation for the annotation project was the accumulating body of evidence indicating that the bodies of journal articles contain much information that is not present in the abstracts, and that the textual and structural characteristics of article bodies are different from those of abstracts. The development of CRAFT was characterized by a “multi-model” annotation task. The sample population was all journal articles that had been used by the Mouse Genome Informatics group as evidence for at least one Gene Ontology or Mouse Phenotype Ontology “annotation.” The linguistic annotation is represented in the widely known Penn Treebank format (Marcus et al., Comput. Linguist. 19(2), 313–330, 1993) [50], with the addition of a small number of tags and phrasal categories to accommodate the idiosyncrasies of the domain.

pacific symposium on biocomputing | 2013

Combining heterogenous data for prediction of disease related and pharmacogenes.

Christopher S. Funk; Lawrence Hunter; K. Bretonnel Cohen

Identifying genetic variants that affect drug response or play a role in disease is an important task for clinicians and researchers. Before individual variants can be explored efficiently for effect on drug response or disease relationships, specific candidate genes must be identified. While many methods rank candidate genes through the use of sequence features and network topology, only a few exploit the information contained in the biomedical literature. In this work, we train and test a classifier on known pharmacogenes from PharmGKB and present a classifier that predicts pharmacogenes on a genome-wide scale using only Gene Ontology annotations and simple features mined from the biomedical literature. Performance of F=0.86, AUC=0.860 is achieved. The top 10 predicted genes are analyzed. Additionally, a set of enriched pharmacogenic Gene Ontology concepts is produced.

Journal of Biomedical Semantics | 2016

Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition.

Christopher S. Funk; K. Bretonnel Cohen; Lawrence Hunter; Karin Verspoor

BackgroundGene Ontology (GO) terms represent the standard for annotation and representation of molecular functions, biological processes and cellular compartments, but a large gap exists between the way concepts are represented in the ontology and how they are expressed in natural language text. The construction of highly specific GO terms is formulaic, consisting of parts and pieces from more simple terms.ResultsWe present two different types of manually generated rules to help capture the variation of how GO terms can appear in natural language text. The first set of rules takes into account the compositional nature of GO and recursively decomposes the terms into their smallest constituent parts. The second set of rules generates derivational variations of these smaller terms and compositionally combines all generated variants to form the original term. By applying both types of rules, new synonyms are generated for two-thirds of all GO terms and an increase in F-measure performance for recognition of GO on the CRAFT corpus from 0.498 to 0.636 is observed. Additionally, we evaluated the combination of both types of rules over one million full text documents from Elsevier; manual validation and error analysis show we are able to recognize GO concepts with reasonable accuracy (88 %) based on random sampling of annotations.ConclusionsIn this work we present a set of simple synonym generation rules that utilize the highly compositional and formulaic nature of the Gene Ontology concepts. We illustrate how the generated synonyms aid in improving recognition of GO concepts on two different biomedical corpora. We discuss other applications of our rules for GO ontology quality assurance, explore the issue of overgeneration, and provide examples of how similar methodologies could be applied to other biomedical terminologies. Additionally, we provide all generated synonyms for use by the text-mining community.

Archive | 2015