Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where John McNaught is active.

Publication


Featured researches published by John McNaught.


panhellenic conference on informatics | 2005

Developing a robust part-of-speech tagger for biomedical text

Yoshimasa Tsuruoka; Yuka Tateishi; Jin-Dong Kim; Tomoko Ohta; John McNaught; Sophia Ananiadou; Jun’ichi Tsujii

This paper presents a part-of-speech tagger which is specifically tuned for biomedical text. We have built the tagger with maximum entropy modeling and a state-of-the-art tagging algorithm. The tagger was trained on a corpus containing newspaper articles and biomedical documents so that it would work well on various types of biomedical text. Experimental results on the Wall Street Journal corpus, the GENIA corpus, and the PennBioIE corpus revealed that adding training data from a different domain does not hurt the performance of a tagger, and our tagger exhibits very good precision (97% to 98%) on all these corpora. We also evaluated the robustness of the tagger using recent MEDLINE articles.


BMC Bioinformatics | 2009

Construction of an annotated corpus to support biomedical information extraction

Paul Thompson; Syed Amir Iqbal; John McNaught; Sophia Ananiadou

BackgroundInformation Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic and semantic behaviour of these words is highly important. Corpora annotated with information concerning this behaviour can constitute a valuable resource in the training of IE components and resources.ResultsWe have defined a new scheme for annotating sentence-bound gene regulation events, centred on both verbs and nominalised verbs. For each event instance, all participants (arguments) in the same sentence are identified and assigned a semantic role from a rich set of 13 roles tailored to biomedical research articles, together with a biological concept type linked to the Gene Regulation Ontology. To our knowledge, our scheme is unique within the biomedical field in terms of the range of event arguments identified. Using the scheme, we have created the Gene Regulation Event Corpus (GREC), consisting of 240 MEDLINE abstracts, in which events relating to gene regulation and expression have been annotated by biologists. A novel method of evaluating various different facets of the annotation task showed that average inter-annotator agreement rates fall within the range of 66% - 90%.ConclusionThe GREC is a unique resource within the biomedical field, in that it annotates not only core relationships between entities, but also a range of other important details about these relationships, e.g., location, temporal, manner and environmental conditions. As such, it is specifically designed to support bio-specific tool and resource development. It has already been used to acquire semantic frames for inclusion within the BioLexicon (a lexical, terminological resource to aid biomedical text mining). Initial experiments have also shown that the corpus may viably be used to train IE components, such as semantic role labellers. The corpus and annotation guidelines are freely available for academic purposes.


Systematic Reviews | 2015

Using text mining for study identification in systematic reviews: a systematic review of current approaches

Alison O’Mara-Eves; James Thomas; John McNaught; Makoto Miwa; Sophia Ananiadou

BackgroundThe large and growing number of published studies, and their increasing rate of publication, makes the task of identifying relevant studies in an unbiased way for inclusion in systematic reviews both complex and time consuming. Text mining has been offered as a potential solution: through automating some of the screening process, reviewer time can be saved. The evidence base around the use of text mining for screening has not yet been pulled together systematically; this systematic review fills that research gap. Focusing mainly on non-technical issues, the review aims to increase awareness of the potential of these technologies and promote further collaborative research between the computer science and systematic review communities.MethodsFive research questions led our review: what is the state of the evidence base; how has workload reduction been evaluated; what are the purposes of semi-automation and how effective are they; how have key contextual problems of applying text mining to the systematic review field been addressed; and what challenges to implementation have emerged?We answered these questions using standard systematic review methods: systematic and exhaustive searching, quality-assured data extraction and a narrative synthesis to synthesise findings.ResultsThe evidence base is active and diverse; there is almost no replication between studies or collaboration between research teams and, whilst it is difficult to establish any overall conclusions about best approaches, it is clear that efficiencies and reductions in workload are potentially achievable.On the whole, most suggested that a saving in workload of between 30% and 70% might be possible, though sometimes the saving in workload is accompanied by the loss of 5% of relevant studies (i.e. a 95% recall).ConclusionsUsing text mining to prioritise the order in which items are screened should be considered safe and ready for use in ‘live’ reviews. The use of text mining as a ‘second screener’ may also be used cautiously. The use of text mining to eliminate studies automatically should be considered promising, but not yet fully proven. In highly technical/clinical areas, it may be used with a high degree of confidence; but more developmental and evaluative work is needed in other disciplines.


Bioinformatics | 2007

Learning string similarity measures for gene/protein name dictionary look-up using logistic regression

Yoshimasa Tsuruoka; John McNaught; Jun'i; chi Tsujii; Sophia Ananiadou

MOTIVATION One of the bottlenecks of biomedical data integration is variation of terms. Exact string matching often fails to associate a name with its biological concept, i.e. ID or accession number in the database, due to seemingly small differences of names. Soft string matching potentially enables us to find the relevant ID by considering the similarity between the names. However, the accuracy of soft matching highly depends on the similarity measure employed. RESULTS We used logistic regression for learning a string similarity measure from a dictionary. Experiments using several large-scale gene/protein name dictionaries showed that the logistic regression-based similarity measure outperforms existing similarity measures in dictionary look-up tasks. AVAILABILITY A dictionary look-up system using the similarity measures described in this article is available at http://text0.mib.man.ac.uk/software/mldic/.


Research Synthesis Methods | 2011

Applications of text mining within systematic reviews

James Thomas; John McNaught; Sophia Ananiadou

Systematic reviews are a widely accepted research method. However, it is increasingly difficult to conduct them to fit with policy and practice timescales, particularly in areas which do not have well indexed, comprehensive bibliographic databases. Text mining technologies offer one possible way forward in reducing the amount of time systematic reviews take to conduct. They can facilitate the identification of relevant literature, its rapid description or categorization, and its summarization. In this paper, we describe the application of four text mining technologies, namely, automatic term recognition, document clustering, classification and summarization, which support the identification of relevant studies in systematic reviews. The contributions of text mining technologies to improve reviewing efficiency are considered and their strengths and weaknesses explored. We conclude that these technologies do have the potential to assist at various stages of the review process. However, they are relatively unknown in the systematic reviewing community, and substantial evaluation and methods development are required before their possible impact can be fully assessed. Copyright


BMC Bioinformatics | 2011

Enriching a biomedical event corpus with meta-knowledge annotation

Paul Thompson; Raheel Nawaz; John McNaught; Sophia Ananiadou

BackgroundBiomedical papers contain rich information about entities, facts and events of biological relevance. To discover these automatically, we use text mining techniques, which rely on annotated corpora for training. In order to extract protein-protein interactions, genotype-phenotype/gene-disease associations, etc., we rely on event corpora that are annotated with classified, structured representations of important facts and findings contained within text. These provide an important resource for the training of domain-specific information extraction (IE) systems, to facilitate semantic-based searching of documents. Correct interpretation of these events is not possible without additional information, e.g., does an event describe a fact, a hypothesis, an experimental result or an analysis of results? How confident is the author about the validity of her analyses? These and other types of information, which we collectively term meta-knowledge, can be derived from the context of the event.ResultsWe have designed an annotation scheme for meta-knowledge enrichment of biomedical event corpora. The scheme is multi-dimensional, in that each event is annotated for 5 different aspects of meta-knowledge that can be derived from the textual context of the event. Textual clues used to determine the values are also annotated. The scheme is intended to be general enough to allow integration with different types of bio-event annotation, whilst being detailed enough to capture important subtleties in the nature of the meta-knowledge expressed in the text. We report here on both the main features of the annotation scheme, as well as its application to the GENIA event corpus (1000 abstracts with 36,858 events). High levels of inter-annotator agreement have been achieved, falling in the range of 0.84-0.93 Kappa.ConclusionBy augmenting event annotations with meta-knowledge, more sophisticated IE systems can be trained, which allow interpretative information to be specified as part of the search criteria. This can assist in a number of important tasks, e.g., finding new experimental knowledge to facilitate database curation, enabling textual inference to detect entailments and contradictions, etc. To our knowledge, our scheme is unique within the field with regards to the diversity of meta-knowledge aspects annotated for each event.


languages in biology and medicine | 2008

Normalizing biomedical terms by minimizing ambiguity and variability

Yoshimasa Tsuruoka; John McNaught; Sophia Ananiadou

BackgroundOne of the difficulties in mapping biomedical named entities, e.g. genes, proteins, chemicals and diseases, to their concept identifiers stems from the potential variability of the terms. Soft string matching is a possible solution to the problem, but its inherent heavy computational cost discourages its use when the dictionaries are large or when real time processing is required. A less computationally demanding approach is to normalize the terms by using heuristic rules, which enables us to look up a dictionary in a constant time regardless of its size. The development of good heuristic rules, however, requires extensive knowledge of the terminology in question and thus is the bottleneck of the normalization approach.ResultsWe present a novel framework for discovering a list of normalization rules from a dictionary in a fully automated manner. The rules are discovered in such a way that they minimize the ambiguity and variability of the terms in the dictionary. We evaluated our algorithm using two large dictionaries: a human gene/protein name dictionary built from BioThesaurus and a disease name dictionary built from UMLS.ConclusionsThe experimental results showed that automatically discovered rules can perform comparably to carefully crafted heuristic rules in term mapping tasks, and the computational overhead of rule application is small enough that a very fast implementation is possible. This work will help improve the performance of term-concept mapping tasks in biomedical information extraction especially when good normalization heuristics for the target terminology are not fully known.


BMC Bioinformatics | 2012

Extracting semantically enriched events from biomedical literature

Makoto Miwa; Paul Thompson; John McNaught; Douglas B. Kell; Sophia Ananiadou

BackgroundResearch into event-based text mining from the biomedical literature has been growing in popularity to facilitate the development of advanced biomedical text mining systems. Such technology permits advanced search, which goes beyond document or sentence-based retrieval. However, existing event-based systems typically ignore additional information within the textual context of events that can determine, amongst other things, whether an event represents a fact, hypothesis, experimental result or analysis of results, whether it describes new or previously reported knowledge, and whether it is speculated or negated. We refer to such contextual information as meta-knowledge. The automatic recognition of such information can permit the training of systems allowing finer-grained searching of events according to the meta-knowledge that is associated with them.ResultsBased on a corpus of 1,000 MEDLINE abstracts, fully manually annotated with both events and associated meta-knowledge, we have constructed a machine learning-based system that automatically assigns meta-knowledge information to events. This system has been integrated into EventMine, a state-of-the-art event extraction system, in order to create a more advanced system (EventMine-MK) that not only extracts events from text automatically, but also assigns five different types of meta-knowledge to these events. The meta-knowledge assignment module of EventMine-MK performs with macro-averaged F-scores in the range of 57-87% on the BioNLP’09 Shared Task corpus. EventMine-MK has been evaluated on the BioNLP’09 Shared Task subtask of detecting negated and speculated events. Our results show that EventMine-MK can outperform other state-of-the-art systems that participated in this task.ConclusionsWe have constructed the first practical system that extracts both events and associated, detailed meta-knowledge information from biomedical literature. The automatically assigned meta-knowledge information can be used to refine search systems, in order to provide an extra search layer beyond entities and assertions, dealing with phenomena such as rhetorical intent, speculations, contradictions and negations. This finer grained search functionality can assist in several important tasks, e.g., database curation (by locating new experimental knowledge) and pathway enrichment (by providing information for inference). To allow easy integration into text mining systems, EventMine-MK is provided as a UIMA component that can be used in the interoperable text mining infrastructure, U-Compare.


international conference on computational linguistics | 2004

Enhancing automatic term recognition through recognition of variation

Goran Nenadié; Sophia Ananiadou; John McNaught

Terminological variation is an integral part of the linguistic ability to realise a concept in many ways, but it is typically considered an obstacle to automatic term recognition (ATR) and term management. We present a method that integrates term variation in a hybrid ATR approach, in which term candidates are recognised by a set of linguistic filters and termhood assignment is based on joint frequency of occurrence of all term variants. We evaluate the effectiveness of incorporating specific types of term variation by comparing it to the performance of a baseline method that treats term variants as separate terms. We show that ATR precision is enhanced by considering joint termhoods of all term variants, while recall benefits by the introduction of new candidates through consideration of different variation types. On a biomedical test corpus we show that precision can be increased by 20--70% for the top ranked terms, while recall improves generally by 2--25%.


BMC Bioinformatics | 2008

How to Make the Most of NE Dictionaries in Statistical NER

Yutaka Sasaki; Yoshimasa Tsuruoka; John McNaught; Sophia Ananiadou

BackgroundWhen term ambiguity and variability are very high, dictionary-based Named Entity Recognition (NER) is not an ideal solution even though large-scale terminological resources are available. Many researches on statistical NER have tried to cope with these problems. However, it is not straightforward how to exploit existing and additional Named Entity (NE) dictionaries in statistical NER. Presumably, addition of NEs to an NE dictionary leads to better performance. However, in reality, the retraining of NER models is required to achieve this. We chose protein name recognition as a case study because it most suffers the problems related to heavy term variation and ambiguity.MethodsWe have established a novel way to improve the NER performance by adding NEs to an NE dictionary without retraining. In our approach, first, known NEs are identified in parallel with Part-of-Speech (POS) tagging based on a general word dictionary and an NE dictionary. Then, statistical NER is trained on the POS/PROTEIN tagger outputs with correct NE labels attached.ResultsWe evaluated performance of our NER on the standard JNLPBA-2004 data set. The F-score on the test set has been improved from 73.14 to 73.78 after adding protein names appearing in the training data to the POS tagger dictionary without any model retraining. The performance further increased to 78.72 after enriching the tagging dictionary with test set protein names.ConclusionOur approach has demonstrated high performance in protein name recognition, which indicates how to make the most of known NEs in statistical NER.

Collaboration


Dive into the John McNaught's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Paul Thompson

University of Manchester

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Gerardo Sierra

National Autonomous University of Mexico

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Iain Buchan

University of Manchester

View shared research outputs
Top Co-Authors

Avatar

Sarah Thew

University of Manchester

View shared research outputs
Top Co-Authors

Avatar

Yutaka Sasaki

University of Manchester

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge