Is this you? Create Your Porfile

Mike Conway

National Institute of Informatics

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mike Conway is active.

Explore More

Publication

Featured researches published by Mike Conway.

Bioinformatics | 2008

BioCaster: detecting public health rumors with a Web-based text mining system

Nigel Collier; Son Doan; Ai Kawazoe; Reiko Matsuda Goodwin; Mike Conway; Yoshio Tateno; Quoc Hung Ngo; Dinh Dien; Asanee Kawtrakul; Koichi Takeuchi; Mika Shigematsu; Kiyosu Taniguchi

Summary: BioCaster is an ontology-based text mining system for detecting and tracking the distribution of infectious disease outbreaks from linguistic signals on the Web. The system continuously analyzes documents reported from over 1700 RSS feeds, classifies them for topical relevance and plots them onto a Google map using geocoded information. The background knowledge for bridging the gap between Laymans terms and formal-coding systems is contained in the freely available BioCaster ontology which includes information in eight languages focused on the epidemiological role of pathogens as well as geographical locations with their latitudes/longitudes. The system consists of four main stages: topic classification, named entity recognition (NER), disease/location detection and event recognition. Higher order event analysis is used to detect more precisely specified warning signals that can then be notified to registered users via email alerts. Evaluation of the system for topic recognition and entity identification is conducted on a gold standard corpus of annotated news articles. Availability: The BioCaster map and ontology are freely available via a web portal at http://www.biocaster.org. Contact: [email protected]

International Journal of Medical Informatics | 2009

Classifying disease outbreak reports using n-grams and semantic features.

Mike Conway; Son Doan; Ai Kawazoe; Nigel Collier

INTRODUCTIONnThis paper explores the benefits of using n-grams and semantic features for the classification of disease outbreak reports, in the context of the BioCaster disease outbreak report text mining system. A novel feature of this work is the use of a general purpose semantic tagger - the USAS tagger - to generate features.nnnBACKGROUNDnWe outline the application context for this work (the BioCaster epidemiological text mining system), before going on to describe the experimental data used in our classification experiments (the 1000 document BioCaster corpus). FEATURE SETS: Three broad groups of features are used in this work: Named Entity based features, n-gram features, and features derived from the USAS semantic tagger.nnnMETHODOLOGYnThree standard machine learning algorithms - Naïve Bayes, the Support Vector Machine algorithm, and the C4.5 decision tree algorithm - were used for classifying experimental data (that is, the BioCaster corpus). Feature selection was performed using the chi(2) feature selection algorithm. Standard text classification performance metrics - Accuracy, Precision, Recall, Specificity and F-score - are reported.nnnRESULTSnA feature representation composed of unigrams, bigrams, trigrams and features derived from a semantic tagger, in conjunction with the Naïve Bayes algorithm and feature selection yielded the highest classification accuracy (and F-score). This result was statistically significant compared to a baseline unigram representation and to previous work on the same task. However, it was feature selection rather than semantic tagging that contributed most to the improved performance.nnnCONCLUSIONnThis study has shown that for the classification of disease outbreak reports, a combination of bag-of-words, n-grams and semantic features, in conjunction with feature selection, increases classification accuracy at a statistically significant level compared to previous work in this domain.

Journal of the American Medical Informatics Association | 2010

Developing syndrome definitions based on consensus and current use.

Wendy W. Chapman; John N. Dowling; Atar Baer; David L. Buckeridge; Dennis Cochrane; Mike Conway; Peter L. Elkin; Jeremy U. Espino; J. E. Gunn; Craig M. Hales; Lori Hutwagner; Mikaela Keller; Catherine A. Larson; Rebecca S. Noe; Anya Okhmatovskaia; Karen L. Olson; Marc Paladini; Matthew J. Scholer; Carol Sniegoski; David A. Thompson; Bill Lober

OBJECTIVEnStandardized surveillance syndromes do not exist but would facilitate sharing data among surveillance systems and comparing the accuracy of existing systems. The objective of this study was to create reference syndrome definitions from a consensus of investigators who currently have or are building syndromic surveillance systems.nnnDESIGNnClinical condition-syndrome pairs were catalogued for 10 surveillance systems across the United States and the representatives of these systems were brought together for a workshop to discuss consensus syndrome definitions.nnnRESULTSnConsensus syndrome definitions were generated for the four syndromes monitored by the majority of the 10 participating surveillance systems: Respiratory, gastrointestinal, constitutional, and influenza-like illness (ILI). An important element in coming to consensus quickly was the development of a sensitive and specific definition for respiratory and gastrointestinal syndromes. After the workshop, the definitions were refined and supplemented with keywords and regular expressions, the keywords were mapped to standard vocabularies, and a web ontology language (OWL) ontology was created.nnnLIMITATIONSnThe consensus definitions have not yet been validated through implementation.nnnCONCLUSIONnThe consensus definitions provide an explicit description of the current state-of-the-art syndromes used in automated surveillance, which can subsequently be systematically evaluated against real data to improve the definitions. The method for creating consensus definitions could be applied to other domains that have diverse existing definitions.

Journal of Biomedical Informatics | 2009

Towards role-based filtering of disease outbreak reports

Son Doan; Ai Kawazoe; Mike Conway; Nigel Collier

This paper explores the role of named entities (NEs) in the classification of disease outbreak report. In the annotation schema of BioCaster, a text mining system for public health protection, important concepts that reflect information about infectious diseases were conceptually analyzed with a formal ontological methodology and classified into types and roles. Types are specified as NE classes and roles are integrated into NEs as attributes such as a chemical and whether it is being used as a therapy for some infectious disease. We focus on the roles of NEs and explore different ways to extract, combine and use them as features in a text classifier. In addition, we investigate the combination of roles with semantic categories of disease-related nouns and verbs. Experimental results using naïve Bayes and Support Vector Machine (SVM) algorithms show that: (1) roles in combination with NEs improve performance in text classification, (2) roles in combination with semantic categories of noun and verb features contribute substantially to the improvement of text classification. Both these results were statistically significant compared to the baseline raw text representation. We discuss in detail the effects of roles on each NE and on semantic categories of noun and verb features in terms of accuracy, precision/recall and F-score measures for the text classification task.

Literary and Linguistic Computing | 2010

Mining a corpus of biographical texts using keywords

Mike Conway

Using statistically derived keywords to characterize texts has become an important research method for digital humanists and corpus linguists in areas such as literary analysis and the exploration of genre difference. Keywords—and the associated concepts of keyness and `key-keyness-have inspired conferences and workshops, many and varied research papers, and are central to several modern corpus processing tools. In this article, we present evidence that (at least for the task of biographical sentence classification) frequent words characterize texts better than keywords or key-keywords. Using the naive Bayes learning algorithm in conjunction with frequency-, keyword-, and key-keyword-based text representation to classify a corpus of biographical sentences, we discovered that the use of frequent words alone provided a classification accuracy better than either the keyword or key-keyword representations at a statistically significant level. This result suggests that (for the biographical sentence classification task at least) frequent words characterize texts better than keywords derived using more computationally intensive methods.

north american chapter of the association for computational linguistics | 2009

Using Hedges to Enhance a Disease Outbreak Report Text Mining System

Mike Conway; Son Doan; Nigel Collier

Identifying serious infectious disease outbreaks in their early stages is an important task, both for national governments and international organizations like the World Health Organization. Text mining and information extraction systems can provide an important, low cost and timely early warning system in these circumstances by identifying the first signs of an outbreak automatically from online textual news. One interesting characteristic of disease outbreak reports --- which to the best of our knowledge has not been studied before --- is their use of speculative language (hedging) to describe uncertain situations. This paper describes two uses of hedging to enhance the BioCaster disease outbreak report text mining system.

Journal of Medical Internet Research | 2010

Developing a disease outbreak event corpus.

Mike Conway; Ai Kawazoe; Hutchatai Chanlekha; Nigel Collier

Background In recent years, there has been a growth in work on the use of information extraction technologies for tracking disease outbreaks from online news texts, yet publicly available evaluation standards (and associated resources) for this new area of research have been noticeably lacking. Objective This study seeks to create a “gold standard” data set against which to test how accurately disease outbreak information extraction systems can identify the semantics of disease outbreak events. Additionally, we hope that the provision of an annotation scheme (and associated corpus) to the community will encourage open evaluation in this new and growing application area. Methods We developed an annotation scheme for identifying infectious disease outbreak events in news texts. An event─in the context of our annotation scheme─consists minimally of geographical (eg, country and province) and disease name information. However, the scheme also allows for the rich encoding of other domain salient concepts (eg, international travel, species, and food contamination). Results The work resulted in a 200-document corpus of event-annotated disease outbreak reports that can be used to evaluate the accuracy of event detection algorithms (in this case, for the BioCaster biosurveillance online news information extraction system). In the 200 documents, 394 distinct events were identified (mean 1.97 events per document, range 0-25 events per document). We also provide a download script and graphical user interface (GUI)-based event browsing software to facilitate corpus exploration. Conclusion In summary, we present an annotation scheme and corpus that can be used in the evaluation of disease outbreak event extraction algorithms. The annotation scheme and corpus were designed both with the particular evaluation requirements of the BioCaster system in mind as well as the wider need for further evaluation resources in this growing research area.

international health informatics symposium | 2010

Leveraging the semantic web and natural language processing to enhance drug-mechanism knowledge in drug product labels

Richard D. Boyce; Henk Harkema; Mike Conway

Multiple studies indicate that drug-drug interactions are a significant source of preventable adverse drug events. Factors contributing to the occurrence of preventable ADEs resulting from DDIs include a lack of knowledge of the patients concurrent medications and inaccurate or inadequate knowledge of interactions by health care providers. FDA-approved drug product labeling is a major source of information intended to help clinicians prescribe drugs in a safe and effective manner. Unfortunately, drug product labeling has been identified as often lagging behind emerging drug knowledge; especially when it has been several years since a drug has been released to the market. In this paper we report on a novel approach that explores employing Semantic Web technology and natural language processing to identify drug mechanism information that may update or expand upon statements present in product labeling.

international conference on computational linguistics | 2010