Guy Divita | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Guy Divita is active.

Explore More

Publication

Featured researches published by Guy Divita.

Journal of the American Medical Informatics Association | 2013

Validating a strategy for psychosocial phenotyping using a large corpus of clinical text

Adi V. Gundlapalli; Andrew Redd; Marjorie E. Carter; Guy Divita; Shuying Shen; Miland Palmer; Matthew H. Samore

OBJECTIVE To develop algorithms to improve efficiency of patient phenotyping using natural language processing (NLP) on text data. Of a large number of note titles available in our database, we sought to determine those with highest yield and precision for psychosocial concepts. MATERIALS AND METHODS From a database of over 1 billion documents from US Department of Veterans Affairs medical facilities, a random sample of 1500 documents from each of 218 enterprise note titles were chosen. Psychosocial concepts were extracted using a UIMA-AS-based NLP pipeline (v3NLP), using a lexicon of relevant concepts with negation and template format annotators. Human reviewers evaluated a subset of documents for false positives and sensitivity. High-yield documents were identified by hit rate and precision. Reasons for false positivity were characterized. RESULTS A total of 58 707 psychosocial concepts were identified from 316 355 documents for an overall hit rate of 0.2 concepts per document (median 0.1, range 1.6-0). Of 6031 concepts reviewed from a high-yield set of note titles, the overall precision for all concept categories was 80%, with variability among note titles and concept categories. Reasons for false positivity included templating, negation, context, and alternate meaning of words. The sensitivity of the NLP system was noted to be 49% (95% CI 43% to 55%). CONCLUSIONS Phenotyping using NLP need not involve the entire document corpus. Our methods offer a generalizable strategy for scaling NLP pipelines to large free text corpora with complex linguistic annotations in attempts to identify patients of a certain phenotype.

Journal of Health and Medical Informatics | 2013

Characterizing Clinical Text and Sublanguage: A Case Study of the VA Clinical Notes

Qing T. Zeng; Doug Redd; Guy Divita; SamahJarad; Cynthia Br; Jonathan R. Nebeker

Objective: To characterize text and sublanguage in medical records to better address challenges within Natural Language Processing (NLP) tasks such as information extraction, word sense disambiguation, information retrieval, and text summarization. The text and sublanguage analysis is needed to scale up the NLP development for large and diverse free-text clinical data sets. Design: This is a quantitative descriptive study which analyzes the text and sublanguage characteristics of a very large Veteran Affairs (VA) clinical note corpus (569 million notes) to guide the customization of natural language processing (NLP) of VA notes. Methods: We randomly sampled 100,000 notes from the top 100 most frequently appearing document types. We examined surface features and used those features to identify sublanguage groups using unsupervised clustering. Results: Using the text features we are able to characterize each of the 100 document types and identify 16 distinct sublanguage groups. The identified sublanguages reflect different clinical domains and types of encounters within the sample corpus. We also found much variance within each of the document types. Such characteristics will facilitate the tuning and crafting of NLP tools. Conclusion: Using a diverse and large sample of clinical text, we were able to show there are a relatively large number of sublanguages and variance both within and between document types. These findings will guide NLP development to create more customizable and generalizable solutions across medical domains and sublanguages.

Studies in health technology and informatics | 2014

Detecting earlier indicators of homelessness in the free text of medical records.

Andrew Redd; Marjorie E. Carter; Guy Divita; Shuying Shen; Miland Palmer; Matthew H. Samore; Adi V. Gundlapalli

Early warning indicators to identify US Veterans at risk of homelessness are currently only inferred from administrative data. References to indicators of risk or instances of homelessness in the free text of medical notes written by Department of Veterans Affairs (VA) providers may precede formal identification of Veterans as being homeless. This represents a potentially untapped resource for early identification. Using natural language processing (NLP), we investigated the idea that concepts related to homelessness written in the free text of the medical record precede the identification of homelessness by administrative data. We found that homeless Veterans were much higher utilizers of VA resources producing approximately 12 times as many documents as non-homeless Veterans. NLP detected mentions of either direct or indirect evidence of homelessness in a significant portion of Veterans earlier than structured data.

Journal of the American Medical Informatics Association | 2014

The pattern of name tokens in narrative clinical text and a comparison of five systems for redacting them

Mehmet Kayaalp; Allen C. Browne; Fiona M. Callaghan; Zeyno A. Dodd; Guy Divita; Selcuk Ozturk; Clement J. McDonald

Objective To understand the factors that influence success in scrubbing personal names from narrative text. Materials and methods We developed a scrubber, the NLM Name Scrubber (NLM-NS), to redact personal names from narrative clinical reports, hand tagged words in a set of gold standard narrative reports as personal names or not, and measured the scrubbing success of NLM-NS and that of four other scrubbing/name recognition tools (MIST, MITdeid, LingPipe, and ANNIE/GATE) against the gold standard reports. We ran three comparisons which used increasingly larger name lists. Results The test reports contained more than 1 million words, of which 2388 were patient and 20 160 were provider name tokens. NLM-NS failed to scrub only 2 of the 2388 instances of patient name tokens. Its sensitivity was 0.999 on both patient and provider name tokens and missed fewer instances of patient name tokens in all comparisons with other scrubbers. MIST produced the best all token specificity and F-measure for name instances in our most relevant study (study 2), with values of 0.997 and 0.938, respectively. In that same comparison, NLM-NS was second best, with values of 0.986 and 0.748, respectively, and MITdeid was a close third, with values of 0.985 and 0.796 respectively. With the addition of the Clinical Center name list to their native name lists, Ling Pipe, MITdeid, MIST, and ANNIE/GATE all improved substantially. MITdeid and Ling Pipe gained the most—reaching patient name sensitivity of 0.995 (F-measure=0.705) and 0.989 (F-measure=0.386), respectively. Discussion The privacy risk due to two name tokens missed by NLM-NS was statistically negligible, since neither individual could be distinguished among more than 150 000 people listed in the US Social Security Registry. Conclusions The nature and size of name lists have substantial influences on scrubbing success. The use of very large name lists with frequency statistics accounts for much of NLM-NS scrubbing success.

eGEMs (Generating Evidence & Methods to improve patient outcomes) | 2016

v3NLP Framework: Tools to Build Applications for Extracting Concepts from Clinical Text.

Guy Divita; Marjorie E. Carter; Le-Thuy T. Tran; Doug Redd; Qing T. Zeng; Scott L. DuVall; Matthew H. Samore; Adi V. Gundlapalli

Introduction: Substantial amounts of clinically significant information are contained only within the narrative of the clinical notes in electronic medical records. The v3NLP Framework is a set of “best-of-breed” functionalities developed to transform this information into structured data for use in quality improvement, research, population health surveillance, and decision support. Background: MetaMap, cTAKES and similar well-known natural language processing (NLP) tools do not have sufficient scalability out of the box. The v3NLP Framework evolved out of the necessity to scale-up these tools up and provide a framework to customize and tune techniques that fit a variety of tasks, including document classification, tuned concept extraction for specific conditions, patient classification, and information retrieval. Innovation: Beyond scalability, several v3NLP Framework-developed projects have been efficacy tested and benchmarked. While v3NLP Framework includes annotators, pipelines and applications, its functionalities enable developers to create novel annotators and to place annotators into pipelines and scaled applications. Discussion: The v3NLP Framework has been successfully utilized in many projects including general concept extraction, risk factors for homelessness among veterans, and identification of mentions of the presence of an indwelling urinary catheter. Projects as diverse as predicting colonization with methicillin-resistant Staphylococcus aureus and extracting references to military sexual trauma are being built using v3NLP Framework components. Conclusion: The v3NLP Framework is a set of functionalities and components that provide Java developers with the ability to create novel annotators and to place those annotators into pipelines and applications to extract concepts from clinical text. There are scale-up and scale-out functionalities to process large numbers of records.

Studies in health technology and informatics | 2014

Recognizing Questions and Answers in EMR Templates Using Natural Language Processing.

Guy Divita; Shuying Shen; Marjorie E. Carter; Andrew Redd; Tyler Forbush; Miland Palmer; Matthew H. Samore; Adi V. Gundlapalli

Templated boilerplate structures pose challenges to natural language processing (NLP) tools used for information extraction (IE). Routine error analyses while performing an IE task using Veterans Affairs (VA) medical records identified templates as an important cause of false positives. The baseline NLP pipeline (V3NLP) was adapted to recognize negation, questions and answers (QA) in various template types by adding a negation and slot:value identification annotator. The system was trained using a corpus of 975 documents developed as a reference standard for extracting psychosocial concepts. Iterative processing using the baseline tool and baseline+negation+QA revealed loss of numbers of concepts with a modest increase in true positives in several concept categories. Similar improvement was noted when the adapted V3NLP was used to process a random sample of 318,000 notes. We demonstrate the feasibility of adapting an NLP pipeline to recognize templates.

Journal of Biomedical Informatics | 2017

A pilot study of a heuristic algorithm for novel template identification from VA electronic medical record text.

Andrew Redd; Adi V. Gundlapalli; Guy Divita; Marjorie E. Carter; Le Thuy Tran; Matthew H. Samore

RATIONALE Templates in text notes pose challenges for automated information extraction algorithms. We propose a method that identifies novel templates in plain text medical notes. The identification can then be used to either include or exclude templates when processing notes for information extraction. METHODS The two-module method is based on the framework of information foraging and addresses the hypothesis that documents containing templates and the templates within those documents can be identified by common features. The first module takes documents from the corpus and groups those with common templates. This is accomplished through a binned word count hierarchical clustering algorithm. The second module extracts the templates. It uses the groupings and performs a longest common subsequence (LCS) algorithm to obtain the constituent parts of the templates. The method was developed and tested on a random document corpus of 750 notes derived from a large database of US Department of Veterans Affairs (VA) electronic medical notes. RESULTS The grouping module, using hierarchical clustering, identified 23 groups with 3 documents or more, consisting of 120 documents from the 750 documents in our test corpus. Of these, 18 groups had at least one common template that was present in all documents in the group for a positive predictive value of 78%. The LCS extraction module performed with 100% positive predictive value, 94% sensitivity, and 83% negative predictive value. The human review determined that in 4 groups the template covered the entire document, with the remaining 14 groups containing a common section template. Among documents with templates, the number of templates per document ranged from 1 to 14. The mean and median number of templates per group was 5.9 and 5, respectively. DISCUSSION The grouping method was successful in finding like documents containing templates. Of the groups of documents containing templates, the LCS module was successful in deciphering text belonging to the template and text that was extraneous. Major obstacles to improved performance included documents composed of multiple templates, templates that included other templates embedded within them, and variants of templates. We demonstrate proof of concept of the grouping and extraction method of identifying templates in electronic medical records in this pilot study and propose methods to improve performance and scaling up.

Journal of Biomedical Informatics | 2017

Detecting the presence of an indwelling urinary catheter and urinary symptoms in hospitalized patients using natural language processing.

Adi V. Gundlapalli; Guy Divita; Andrew Redd; Marjorie E. Carter; Danette Ko; Michael A. Rubin; Matthew H. Samore; Judith Strymish; Sarah L. Krein; Kalpana Gupta; Anne Sales

OBJECTIVE To develop a natural language processing pipeline to extract positively asserted concepts related to the presence of an indwelling urinary catheter in hospitalized patients from the free text of the electronic medical note. The goal is to assist infection preventionists and other healthcare professionals in determining whether a patient has an indwelling urinary catheter when a catheter-associated urinary tract infection is suspected. Currently, data on indwelling urinary catheters is not consistently captured in the electronic medical record in structured format and thus cannot be reliably extracted for clinical and research purposes. MATERIALS AND METHODS We developed a lexicon of terms related to indwelling urinary catheters and urinary symptoms based on domain knowledge, prior experience in the field, and review of medical notes. A reference standard of 1595 randomly selected documents from inpatient admissions was annotated by human reviewers to identify all positively and negatively asserted concepts related to indwelling urinary catheters. We trained a natural language processing pipeline based on the V3NLP framework using 1050 documents and tested on 545 documents to determine agreement with the human reference standard. Metrics reported are positive predictive value and recall. RESULTS The lexicon contained 590 terms related to the presence of an indwelling urinary catheter in various categories including insertion, care, change, and removal of urinary catheters and 67 terms for urinary symptoms. Nursing notes were the most frequent inpatient note titles in the reference standard document corpus; these also yielded the highest number of positively asserted concepts with respect to urinary catheters. Comparing the performance of the natural language processing pipeline against the human reference standard, the overall recall was 75% and positive predictive value was 99% on the training set; on the testing set, the recall was 72% and positive predictive value was 98%. The performance on extracting urinary symptoms (including fever) was high with recall and precision greater than 90%. CONCLUSIONS We have shown that it is possible to identify the presence of an indwelling urinary catheter and urinary symptoms from the free text of electronic medical notes from inpatients using natural language processing. These are two key steps in developing automated protocols to assist humans in large-scale review of patient charts for catheter-associated urinary tract infection. The challenges associated with extracting indwelling urinary catheter-related concepts also inform the design of electronic medical record templates to reliably and consistently capture data on indwelling urinary catheters.

Journal of Medical Systems | 2017

An Evolving Ecosystem for Natural Language Processing in Department of Veterans Affairs

Jennifer H. Garvin; Megha Kalsy; Cynthia Brandt; Stephen L. Luther; Guy Divita; Gregory Coronado; Doug Redd; Carrie M. Christensen; Brent Hill; Natalie Kelly; Qing Zeng Treitler

In an ideal clinical Natural Language Processing (NLP) ecosystem, researchers and developers would be able to collaborate with others, undertake validation of NLP systems, components, and related resources, and disseminate them. We captured requirements and formative evaluation data from the Veterans Affairs (VA) Clinical NLP Ecosystem stakeholders using semi-structured interviews and meeting discussions. We developed a coding rubric to code interviews. We assessed inter-coder reliability using percent agreement and the kappa statistic. We undertook 15 interviews and held two workshop discussions. The main areas of requirements related to; design and functionality, resources, and information. Stakeholders also confirmed the vision of the second generation of the Ecosystem and recommendations included; adding mechanisms to better understand terms, measuring collaboration to demonstrate value, and datasets/tools to navigate spelling errors with consumer language, among others. Stakeholders also recommended capability to: communicate with developers working on the next version of the VA electronic health record (VistA Evolution), provide a mechanism to automatically monitor download of tools and to automatically provide a summary of the downloads to Ecosystem contributors and funders. After three rounds of coding and discussion, we determined the percent agreement of two coders to be 97.2% and the kappa to be 0.7851. The vision of the VA Clinical NLP Ecosystem met stakeholder needs. Interviews and discussion provided key requirements that inform the design of the VA Clinical NLP Ecosystem.

Studies in health technology and informatics | 2016

Finding 'Evidence of Absence' in Medical Notes: Using NLP for Clinical Inferencing.

Marjorie E. Carter; Guy Divita; Andrew Redd; Michael A. Rubin; Matthew H. Samore; Kalpana Gupta; Adi V. Gundlapalli

Extracting evidence of the absence of a target of interest from medical text can be useful in clinical inferencing. The purpose of our study was to develop a natural language processing (NLP) pipelineto identify the presence of indwelling urinary catheters from electronic medical notes to aid in detection of catheter-associated urinary tract infections (CAUTI). Finding clear evidence that a patient does not have an indwelling urinary catheter is useful in making a determination regarding CAUTI. We developed a lexicon of seven core concepts to infer the absence of a urinary catheter. Of the 990,391 concepts extractedby NLP from a large corpus of 744,285 electronic medical notes from 5589 hospitalized patients, 63,516 were labeled as evidence of absence.Human review revealed three primary causes for false negatives. The lexicon and NLP pipeline were refined using this information, resulting in outputs with an acceptable false positive rate of 11%.

Explore More