Colin R. Batchelor
Royal Society of Chemistry
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Colin R. Batchelor.
empirical methods in natural language processing | 2009
Simone Teufel; Advaith Siddharthan; Colin R. Batchelor
Argumentative Zoning (AZ) is an analysis of the argumentative and rhetorical structure of a scientific paper. It has been shown to be reliably used by independent human coders, and has proven useful for various information access tasks. Annotation experiments have however so far been restricted to one discipline, computational linguistics (CL). Here, we present a more informative AZ scheme with 15 categories in place of the original 7, and show that it can be applied to the life sciences as well as to CL. We use a domain expert to encode basic knowledge about the subject (such as terminology and domain specific rules for individual categories) as part of the annotation guidelines. Our results show that non-expert human coders can then use these guidelines to reliably annotate this scheme in two domains, chemistry and computational linguistics.
Genome Biology | 2010
Martin G. Reese; Barry Moore; Colin R. Batchelor; Fidel Salas; Fiona Cunningham; Gabor T. Marth; Lincoln Stein; Paul Flicek; Mark Yandell; Karen Eilbeck
Here we describe the Genome Variation Format (GVF) and the 10Gen dataset. GVF, an extension of Generic Feature Format version 3 (GFF3), is a simple tab-delimited format for DNA variant files, which uses Sequence Ontology to describe genome variation data. The 10Gen dataset, ten human genomes in GVF format, is freely available for community analysis from the Sequence Ontology website and from an Amazon elastic block storage (EBS) snapshot for use in Amazons EC2 cloud computing environment.
Bioinformatics | 2012
Maria Liakata; Shyamasree Saha; Simon Dobnik; Colin R. Batchelor; Dietrich Rebholz-Schuhmann
Motivation: Scholarly biomedical publications report on the findings of a research investigation. Scientists use a well-established discourse structure to relate their work to the state of the art, express their own motivation and hypotheses and report on their methods, results and conclusions. In previous work, we have proposed ways to explicitly annotate the structure of scientific investigations in scholarly publications. Here we present the means to facilitate automatic access to the scientific discourse of articles by automating the recognition of 11 categories at the sentence level, which we call Core Scientific Concepts (CoreSCs). These include: Hypothesis, Motivation, Goal, Object, Background, Method, Experiment, Model, Observation, Result and Conclusion. CoreSCs provide the structure and context to all statements and relations within an article and their automatic recognition can greatly facilitate biomedical information extraction by characterizing the different types of facts, hypotheses and evidence available in a scientific publication. Results: We have trained and compared machine learning classifiers (support vector machines and conditional random fields) on a corpus of 265 full articles in biochemistry and chemistry to automatically recognize CoreSCs. We have evaluated our automatic classifications against a manually annotated gold standard, and have achieved promising accuracies with ‘Experiment’, ‘Background’ and ‘Model’ being the categories with the highest F1-scores (76%, 62% and 53%, respectively). We have analysed the task of CoreSC annotation both from a sentence classification as well as sequence labelling perspective and we present a detailed feature evaluation. The most discriminative features are local sentence features such as unigrams, bigrams and grammatical dependencies while features encoding the document structure, such as section headings, also play an important role for some of the categories. We discuss the usefulness of automatically generated CoreSCs in two biomedical applications as well as work in progress. Availability: A web-based tool for the automatic annotation of articles with CoreSCs and corresponding documentation is available online at http://www.sapientaproject.com/software http://www.sapientaproject.com also contains detailed information pertaining to CoreSC annotation and links to annotation guidelines as well as a corpus of manually annotated articles, which served as our training data. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
Journal of Biomedical Semantics | 2011
Joanne S. Luciano; Bosse Andersson; Colin R. Batchelor; Olivier Bodenreider; Timothy W.I. Clark; Christine Denney; Christopher Domarew; Thomas Gambet; Lee Harland; Anja Jentzsch; Vipul Kashyap; Peter Kos; Julia Kozlovsky; Timothy Lebo; Scott M Marshall; James P. McCusker; Deborah L. McGuinness; Chimezie Ogbuji; Elgar Pichler; Robert L Powers; Eric Prud’hommeaux; Matthias Samwald; Lynn M. Schriml; Peter J. Tonellato; Patricia L. Whetzel; Jun Zhao; Susie Stephens; Michel Dumontier
BackgroundTranslational medicine requires the integration of knowledge using heterogeneous data from health care to the life sciences. Here, we describe a collaborative effort to produce a prototype Translational Medicine Knowledge Base (TMKB) capable of answering questions relating to clinical practice and pharmaceutical drug discovery.ResultsWe developed the Translational Medicine Ontology (TMO) as a unifying ontology to integrate chemical, genomic and proteomic data with disease, treatment, and electronic health records. We demonstrate the use of Semantic Web technologies in the integration of patient and biomedical data, and reveal how such a knowledge base can aid physicians in providing tailored patient care and facilitate the recruitment of patients into active clinical trials. Thus, patients, physicians and researchers may explore the knowledge base to better understand therapeutic options, efficacy, and mechanisms of action.ConclusionsThis work takes an important step in using Semantic Web technologies to facilitate integration of relevant, distributed, external sources and progress towards a computational platform to support personalized medicine.AvailabilityTMO can be downloaded from http://code.google.com/p/translationalmedicineontology and TMKB can be accessed at http://tm.semanticscience.org/sparql.
Journal of Biomedical Informatics | 2011
Christopher J. Mungall; Colin R. Batchelor; Karen Eilbeck
The Sequence Ontology is an established ontology, with a large user community, for the purpose of genomic annotation. We are reforming the ontology to provide better terms and relationships to describe the features of biological sequence, for both genomic and derived sequence. The SO is working within the guidelines of the OBO Foundry to provide interoperability between SO and the other related OBO ontologies. Here, we report changes and improvements made to SO including new relationships to better define the mereological, spatial and temporal aspects of biological sequence.
meeting of the association for computational linguistics | 2007
Peter T. Corbett; Colin R. Batchelor; Simone Teufel
We describe the annotation of chemical named entities in scientific text. A set of annotation guidelines defines 5 types of named entities, and provides instructions for the resolution of special cases. A corpus of fulltext chemistry papers was annotated, with an inter-annotator agreement F score of 93%. An investigation of named entity recognition using LingPipe suggests that F scores of 63% are possible without customisation, and scores of 74% are possible with the addition of custom tokenisation and the use of dictionaries.
Applied Ontology | 2011
Robert Hoehndorf; Colin R. Batchelor; Thomas Bittner; Michel Dumontier; Karen Eilbeck; Rob Knight; Christopher J. Mungall; Jane S. Richardson; Jesse Stombaugh; Eric Westhof; Craig L. Zirbel; Neocles B. Leontis
Biomedical Ontologies integrate diverse biomedical data and enable intelligent data-mining and help translate basic research into useful clinical knowledge. We present the RNA Ontology (RNAO), an ontology for integrating diverse RNA data, including RNA sequences and sequence alignments, three-dimensional structures, and biochemical and functional data. For example, individual atomic resolution RNA structures have broader significance as representatives of classes of homologous molecules, which can differ significantly in sequence while sharing core structural features and common roles or functions. Thus, structural data gain value by being linked to homologous sequences in genomic data and databases of sequence alignments. Likewise, the value of genomic data is enhanced by annotation of shared structural features, especially when these can be linked to specific functions. Moreover, the significance of biochemical, functional and mutational analyses of RNA molecules are most fully understood when linked to molecular structures and phylogenies. To achieve these goals, RNAO provides logically rigorous definitions of the components of RNA primary, secondary and tertiary structure and the relations between these entities. RNAO is being developed to comply with the developing standards of the Open Biomedical Ontologies (OBO) Consortium. The RNAO can be accessed at http://code.google.com/p/rnao/.
Journal of Cheminformatics | 2015
Gang Fu; Colin R. Batchelor; Michel Dumontier; Janna Hastings; Egon Willighagen; Evan Bolton
BackgroundPubChem is an open repository for chemical structures, biological activities and biomedical annotations. Semantic Web technologies are emerging as an increasingly important approach to distribute and integrate scientific data. Exposing PubChem data to Semantic Web services may help enable automated data integration and management, as well as facilitate interoperable web applications.DescriptionThis work, one of a series covering the PubChemRDF project, describes an approach to translate PubChem Substance and Compound information into Resource Description Framework (RDF) format. Basic examples are provided to demonstrate its use. The aim of this effort is to provide two new primary benefits to researchers in a cost-effective manner. Firstly, we aim to remove the inherent limitations of using the web-based resource PubChem by allowing a researcher to use readily available semantic technologies (namely, RDF triple stores and their corresponding SPARQL query engines) to query and analyze PubChem data on local computing resources. Secondly, this work intends to help improve data sharing, analysis, and integration of PubChem data to resources external to NCBI and across scientific domains, by means of the association of PubChem data to existing ontological frameworks, including CHEMical INFormation ontology, Semanticscience Integrated Ontology, and others.ConclusionsWith the goal of semantically describing information available in the PubChem archive, pre-existing ontological frameworks were used, rather than creating new ones. Semantic relationships between compounds and substances, chemical descriptors associated with compounds and substances, interrelationships between chemicals, as well as provenance and attribute metadata of substances are described.
BMC Genomics | 2013
David P. Hill; Nico Adams; Mike Bada; Colin R. Batchelor; Tanya Z. Berardini; Heiko Dietze; Harold J. Drabkin; Marcus Ennis; Rebecca E. Foulger; Midori A. Harris; Janna Hastings; Namrata Kale; Paula de Matos; Christopher J. Mungall; Gareth Owen; Paola Roncaglia; Christoph Steinbeck; Steve Turner; Jane Lomax
BackgroundThe Gene Ontology (GO) facilitates the description of the action of gene products in a biological context. Many GO terms refer to chemical entities that participate in biological processes. To facilitate accurate and consistent systems-wide biological representation, it is necessary to integrate the chemical view of these entities with the biological view of GO functions and processes. We describe a collaborative effort between the GO and the Chemical Entities of Biological Interest (ChEBI) ontology developers to ensure that the representation of chemicals in the GO is both internally consistent and in alignment with the chemical expertise captured in ChEBI.ResultsWe have examined and integrated the ChEBI structural hierarchy into the GO resource through computationally-assisted manual curation of both GO and ChEBI. Our work has resulted in the creation of computable definitions of GO terms that contain fully defined semantic relationships to corresponding chemical terms in ChEBI.ConclusionsThe set of logical definitions using both the GO and ChEBI has already been used to automate aspects of GO development and has the potential to allow the integration of data across the domains of biology and chemistry. These logical definitions are available as an extended version of the ontology from http://purl.obolibrary.org/obo/go/extensions/go-plus.owl.
meeting of the association for computational linguistics | 2007
Colin R. Batchelor; Peter T. Corbett
We describe the semantic enrichment of journal articles with chemical structures and biomedical ontology terms using Oscar, a program for chemical named entity recognition (NER). We describe how Oscar works and how it can been adapted for general NER. We discuss its implementation in a real publishing workflow and possible applications for enriched articles.