Berry de Bruijn
National Research Council
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Berry de Bruijn.
BMC Bioinformatics | 2003
Ian M. Donaldson; Joel D. Martin; Berry de Bruijn; Cheryl Wolting; Vicki Lay; Brigitte Tuekam; Shudong Zhang; Berivan Baskin; Gary D. Bader; Katerina Michalickova; Tony Pawson; Christopher W. V. Hogue
BackgroundThe majority of experimentally verified molecular interaction and biological pathway data are present in the unstructured text of biomedical journal articles where they are inaccessible to computational methods. The Biomolecular interaction network database (BIND) seeks to capture these data in a machine-readable format. We hypothesized that the formidable task-size of backfilling the database could be reduced by using Support Vector Machine technology to first locate interaction information in the literature. We present an information extraction system that was designed to locate protein-protein interaction data in the literature and present these data to curators and the public for review and entry into BIND.ResultsCross-validation estimated the support vector machines test-set precision, accuracy and recall for classifying abstracts describing interaction information was 92%, 90% and 92% respectively. We estimated that the system would be able to recall up to 60% of all non-high throughput interactions present in another yeast-protein interaction database. Finally, this system was applied to a real-world curation problem and its use was found to reduce the task duration by 70% thus saving 176 days.ConclusionsMachine learning methods are useful as tools to direct interaction and pathway database back-filling; however, this potential can only be realized if these techniques are coupled with human review and entry into a factual database such as BIND. The PreBIND system described here is available to the public at http://bind.ca. Current capabilities allow searching for human, mouse and yeast protein-interaction information.
Journal of the American Medical Informatics Association | 2011
Berry de Bruijn; Colin Cherry; Svetlana Kiritchenko; Joel D. Martin; Xiaodan Zhu
Objective As clinical text mining continues to mature, its potential as an enabling technology for innovations in patient care and clinical research is becoming a reality. A critical part of that process is rigid benchmark testing of natural language processing methods on realistic clinical narrative. In this paper, the authors describe the design and performance of three state-of-the-art text-mining applications from the National Research Council of Canada on evaluations within the 2010 i2b2 challenge. Design The three systems perform three key steps in clinical information extraction: (1) extraction of medical problems, tests, and treatments, from discharge summaries and progress notes; (2) classification of assertions made on the medical problems; (3) classification of relations between medical concepts. Machine learning systems performed these tasks using large-dimensional bags of features, as derived from both the text itself and from external sources: UMLS, cTAKES, and Medline. Measurements Performance was measured per subtask, using micro-averaged F-scores, as calculated by comparing system annotations with ground-truth annotations on a test set. Results The systems ranked high among all submitted systems in the competition, with the following F-scores: concept extraction 0.8523 (ranked first); assertion detection 0.9362 (ranked first); relationship detection 0.7313 (ranked second). Conclusion For all tasks, we found that the introduction of a wide range of features was crucial to success. Importantly, our choice of machine learning algorithms allowed us to be versatile in our feature design, and to introduce a large number of features without overfitting and without encountering computing-resource bottlenecks.
Journal of the American Medical Informatics Association | 2013
Colin Cherry; Xiaodan Zhu; Joel D. Martin; Berry de Bruijn
Objective An analysis of the timing of events is critical for a deeper understanding of the course of events within a patient record. The 2012 i2b2 NLP challenge focused on the extraction of temporal relationships between concepts within textual hospital discharge summaries. Materials and methods The team from the National Research Council Canada (NRC) submitted three system runs to the second track of the challenge: typifying the time-relationship between pre-annotated entities. The NRC system was designed around four specialist modules containing statistical machine learning classifiers. Each specialist targeted distinct sets of relationships: local relationships, ‘sectime’-type relationships, non-local overlap-type relationships, and non-local causal relationships. Results The best NRC submission achieved a precision of 0.7499, a recall of 0.6431, and an F1 score of 0.6924, resulting in a statistical tie for first place. Post hoc improvements led to a precision of 0.7537, a recall of 0.6455, and an F1 score of 0.6954, giving the highest scores reported on this task to date. Discussion and conclusions Methods for general relation extraction extended well to temporal relations, and gave top-ranked state-of-the-art results. Careful ordering of predictions within result sets proved critical to this success.
Biomedical Informatics Insights | 2012
Colin Cherry; Saif M. Mohammad; Berry de Bruijn
This paper describes the National Research Council of Canadas submission to the 2011 i2b2 NLP challenge on the detection of emotions in suicide notes. In this task, each sentence of a suicide note is annotated with zero or more emotions, making it a multi-label sentence classification task. We employ two distinct large-margin models capable of handling multiple labels. The first uses one classifier per emotion, and is built to simplify label balance issues and to allow extremely fast development. This approach is very effective, scoring an F-measure of 55.22 and placing fourth in the competition, making it the best system that does not use web-derived statistics or re-annotated training data. Second, we present a latent sequence model, which learns to segment the sentence into a number of emotion regions. This model is intended to gracefully handle sentences that convey multiple thoughts and emotions. Preliminary work with the latent sequence model shows promise, resulting in comparable performance using fewer features.
Journal of Biomedical Informatics | 2013
Xiaodan Zhu; Colin Cherry; Svetlana Kiritchenko; Joel D. Martin; Berry de Bruijn
This paper addresses an information-extraction problem that aims to identify semantic relations among medical concepts (problems, tests, and treatments) in clinical text. The objectives of the paper are twofold. First, we extend an earlier one-page description (appearing as a part of [5]) of a top-ranked model in the 2010 I2B2 NLP Challenge to a necessary level of details, with the belief that feature design is the most crucial factor to the success of our system and hence deserves a more detailed discussion. We present a precise quantification of the contributions of a wide variety of knowledge sources. In addition, we show the end-to-end results obtained on the noisy output of a top-ranked concept detector, which could help construct a more complete view of the state of the art in the real-world scenario. As the second major objective, we reformulate our models into a composite-kernel framework and present the best result, according to our knowledge, on the same dataset.
Biomedical Digital Libraries | 2006
Jeffrey Demaine; Joel D. Martin; Lynn Wei; Berry de Bruijn
BackgroundThis paper examines how the adoption of a subject-specific library service has changed the way in which its users interact with a digital library. The LitMiner text-analysis application was developed to enable biologists to explore gene relationships in the published literature. The application features a suite of interfaces that enable users to search PubMed as well as local databases, to view document abstracts, to filter terms, to select gene name aliases, and to visualize the co-occurrences of genes in the literature. At each of these stages, LitMiner offers the functionality of a digital library. Documents that are accessible online are identified by an icon. Users can also order documents from their institutions library collection from within the application. In so doing, LitMiner aims to integrate digital library services into the research process of its users.MethodsCase studyResultsThis integration of digital library services into the research process of biologists results in increased access to the published literature.ConclusionIn order to make better use of their collections, digital libraries should customize their services to suit the research needs of their patrons.
Systematic Reviews | 2016
Ian J Saldanha; Christopher H. Schmid; Joseph Lau; Kay Dickersin; Jesse A. Berlin; Jens Jap; Bryant T Smith; Simona Carini; Wiley Chan; Berry de Bruijn; Byron C. Wallace; Susan Hutfless; Ida Sim; M. Hassan Murad; Sandra A. Walsh; Elizabeth J. Whamond; Tianjing Li
BackgroundData abstraction, a critical systematic review step, is time-consuming and prone to errors. Current standards for approaches to data abstraction rest on a weak evidence base. We developed the Data Abstraction Assistant (DAA), a novel software application designed to facilitate the abstraction process by allowing users to (1) view study article PDFs juxtaposed to electronic data abstraction forms linked to a data abstraction system, (2) highlight (or “pin”) the location of the text in the PDF, and (3) copy relevant text from the PDF into the form. We describe the design of a randomized controlled trial (RCT) that compares the relative effectiveness of (A) DAA-facilitated single abstraction plus verification by a second person, (B) traditional (non-DAA-facilitated) single abstraction plus verification by a second person, and (C) traditional independent dual abstraction plus adjudication to ascertain the accuracy and efficiency of abstraction.MethodsThis is an online, randomized, three-arm, crossover trial. We will enroll 24 pairs of abstractors (i.e., sample size is 48 participants), each pair comprising one less and one more experienced abstractor. Pairs will be randomized to abstract data from six articles, two under each of the three approaches. Abstractors will complete pre-tested data abstraction forms using the Systematic Review Data Repository (SRDR), an online data abstraction system. The primary outcomes are (1) proportion of data items abstracted that constitute an error (compared with an answer key) and (2) total time taken to complete abstraction (by two abstractors in the pair, including verification and/or adjudication).DiscussionThe DAA trial uses a practical design to test a novel software application as a tool to help improve the accuracy and efficiency of the data abstraction process during systematic reviews. Findings from the DAA trial will provide much-needed evidence to strengthen current recommendations for data abstraction approaches.Trial registrationThe trial is registered at National Information Center on Health Services Research and Health Care Technology (NICHSR) under Registration # HSRP20152269: https://wwwcf.nlm.nih.gov/hsr_project/view_hsrproj_record.cfm?NLMUNIQUE_ID=20152269&SEARCH_FOR=Tianjing%20Li. All items from the World Health Organization Trial Registration Data Set are covered at various locations in this protocol. Protocol version and date: This is version 2.0 of the protocol, dated September 6, 2016. As needed, we will communicate any protocol amendments to the Institutional Review Boards (IRBs) of Johns Hopkins Bloomberg School of Public Health (JHBSPH) and Brown University. We also will make appropriate as-needed modifications to the NICHSR website in a timely fashion.
Journal of Clinical Epidemiology | 2016
Margaret Sampson; Berry de Bruijn; Christine Urquhart; Kaveh G Shojania
OBJECTIVES To maximize the proportion of relevant studies identified for inclusion in systematic reviews (recall), complex time-consuming Boolean searches across multiple databases are common. Although MEDLINE provides excellent coverage of health science evidence, it has proved challenging to achieve high levels of recall through Boolean searches alone. STUDY DESIGN AND SETTING Recall of one Boolean search method, the clinical query (CQ), combined with a ranking method, support vector machine (SVM), or PubMed-related articles, was tested against a gold standard of studies added to 6 updated Cochrane reviews and 10 Agency for Healthcare Research and Quality (AHRQ) evidence reviews. For the AHRQ sample, precision and temporal stability were examined for each method. RESULTS Recall of new studies was 0.69 for the CQ, 0.66 for related articles, 0.50 for SVM, 0.91 for the combination of CQ and related articles, and 0.89 for the combination of CQ and SVM. Precision was 0.11 for CQ and related articles combined, and 0.11 for CQ and SVM combined. Related articles showed least stability over time. CONCLUSIONS The complementary combination of a Boolean search strategy and a ranking strategy appears to provide a robust method for identifying relevant studies in MEDLINE.
Proceedings of The Asist Annual Meeting | 2005
Jeffrey Demaine; Joel D. Martin; Berry de Bruijn
This paper describes the EurekaSeek bibliometric technique for automated linked-literature analysis. The MEDLINE database of biomedical literature is iteratively searched in order to identify research opportunities in the form of conceptual linkages between terms. As a tool for identifying undiscovered public knowledge, EurekaSeek is a variation on the techniques of Swanson and Smalheiser. EurekaSeek uses medical subject headings instead of text analysis in a fully automated search process, thereby eliminating the reliance on expert input during the process of linking literatures. In this paper, the EurekaSeek process is tested by retroactively examining the co-occurrence of terms in the published literature. The hypothesis tested in this paper is whether this tool, had it existed in the past, could have identified conceptual linkages that occurred only later in the literature. In addition, EurekaSeek is compared against a process that considers all potential term-to-term relationships. The list of terms that EurekaSeek produces is a subset of all potential linked literature terms. The experiment shows that EurekaSeek produces a higher percentage of likely hypotheses than when all terms are considered. While the proportion of identified linkages generated is still too small for the process to be a practical aid to research, statistically significant results were achieved. Metaphorically speaking, EurekaSeek identifies a higher proportion of needles per haystack.
Journal of the American Medical Informatics Association | 2018
Abeed Sarker; Maksim Belousov; Jasper Friedrichs; Kai Hakala; Svetlana Kiritchenko; Farrokh Mehryary; Sifei Han; Tung Tran; Anthony Rios; Ramakanth Kavuluru; Berry de Bruijn; Filip Ginter; Debanjan Mahata; Saif M. Mohammad; Goran Nenadic; Graciela Gonzalez-Hernandez
Abstract Objective We executed the Social Media Mining for Health (SMM4H) 2017 shared tasks to enable the community-driven development and large-scale evaluation of automatic text processing methods for the classification and normalization of health-related text from social media. An additional objective was to publicly release manually annotated data. Materials and Methods We organized 3 independent subtasks: automatic classification of self-reports of 1) adverse drug reactions (ADRs) and 2) medication consumption, from medication-mentioning tweets, and 3) normalization of ADR expressions. Training data consisted of 15 717 annotated tweets for (1), 10 260 for (2), and 6650 ADR phrases and identifiers for (3); and exhibited typical properties of social-media-based health-related texts. Systems were evaluated using 9961, 7513, and 2500 instances for the 3 subtasks, respectively. We evaluated performances of classes of methods and ensembles of system combinations following the shared tasks. Results Among 55 system runs, the best system scores for the 3 subtasks were 0.435 (ADR class F1-score) for subtask-1, 0.693 (micro-averaged F1-score over two classes) for subtask-2, and 88.5% (accuracy) for subtask-3. Ensembles of system combinations obtained best scores of 0.476, 0.702, and 88.7%, outperforming individual systems. Discussion Among individual systems, support vector machines and convolutional neural networks showed high performance. Performance gains achieved by ensembles of system combinations suggest that such strategies may be suitable for operational systems relying on difficult text classification tasks (eg, subtask-1). Conclusions Data imbalance and lack of context remain challenges for natural language processing of social media text. Annotated data from the shared task have been made available as reference standards for future studies (http://dx.doi.org/10.17632/rxwfb3tysd.1).