Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Guergana Savova is active.

Publication


Featured researches published by Guergana Savova.


Journal of the American Medical Informatics Association | 2010

Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications

Guergana Savova; James J. Masanz; Philip V. Ogren; Jiaping Zheng; Sunghwan Sohn; Karin Kipper-Schuler; Christopher G. Chute

We aim to build and evaluate an open-source natural language processing system for information extraction from electronic medical record clinical free-text. We describe and evaluate our system, the clinical Text Analysis and Knowledge Extraction System (cTAKES), released open-source at http://www.ohnlp.org. The cTAKES builds on existing open-source technologies-the Unstructured Information Management Architecture framework and OpenNLP natural language processing toolkit. Its components, specifically trained for the clinical domain, create rich linguistic and semantic annotations. Performance of individual components: sentence boundary detector accuracy=0.949; tokenizer accuracy=0.949; part-of-speech tagger accuracy=0.936; shallow parser F-score=0.924; named entity recognizer and system-level evaluation F-score=0.715 for exact and 0.824 for overlapping spans, and accuracy for concept mapping, negation, and status attributes for exact and overlapping spans of 0.957, 0.943, 0.859, and 0.580, 0.939, and 0.839, respectively. Overall performance is discussed against five applications. The cTAKES annotations are the foundation for methods and modules for higher-level semantic processing of clinical free-text.


Journal of the American Medical Informatics Association | 2011

Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions

Wendy W. Chapman; Prakash M. Nadkarni; Lynette Hirschman; Leonard W. D'Avolio; Guergana Savova; Özlem Uzuner

This issue of JAMIA focuses on natural language processing (NLP) techniques for clinical-text information extraction. Several articles are offshoots of the yearly ‘Informatics for Integrating Biology and the Bedside’ (i2b2) (http://www.i2b2.org) NLP shared-task challenge, introduced by Uzuner et al ( see page 552 )1 and co-sponsored by the Veterans Administration for the last 2 years. This shared task follows long-running challenge evaluations in other fields, such as the Message Understanding Conference (MUC) for information extraction,2 TREC3 for text information retrieval, and CASP4 for protein structure prediction. Shared tasks in the clinical domain are recent and include annual i2b2 Challenges that began in 2006, a challenge for multi-label classification of radiology reports sponsored by Cincinnati Childrens Hospital in 2007,5 a 2011 Cincinnati Childrens Hospital challenge on suicide notes,6 and the 2011 TREC information retrieval shared task involving retrieval of clinical cases from narrative records.7 Although NLP research in the clinical domain has been active since the 1960s, progress in the development of NLP applications for clinical text has been slow and lags behind progress in the general NLP domain. There are several barriers to NLP development in the clinical domain, and shared tasks like the i2b2/VA Challenge address some of these barriers. Nevertheless, many barriers remain and unless the community takes a more active role in developing novel approaches for addressing the barriers, advancement and innovation will continue to be slow. Historically, there have been substantial barriers to NLP development in the clinical domain. These barriers are not unique to the clinical domain: they also occur in the fields of software engineering and general NLP. ### Lack of access to shared data Because of concerns regarding patient privacy and worry about revealing unfavorable institutional practices, hospitals and clinics have been extremely reluctant to allow access to clinical data for researchers from outside … Correspondence to Dr Wendy W Chapman, Department of Biomedical Informatics, University of California San Diego, 9500 Gilman Dr, Bldg 2 #0728, La Jolla, California, USA; wwchapman{at}ucsd.edu


Clinical Pharmacology & Therapeutics | 2011

The emerging role of electronic medical records in pharmacogenomics.

Russell A. Wilke; Hua Xu; Joshua C. Denny; Dan M. Roden; Ronald M. Krauss; Catherine A. McCarty; Robert L. Davis; Todd C. Skaar; Jatinder K. Lamba; Guergana Savova

Health‐care information technology and genotyping technology are both advancing rapidly, creating new opportunities for medical and scientific discovery. The convergence of these two technologies is now facilitating genetic association studies of unprecedented size within the context of routine clinical care. As a result, the medical community will soon be presented with a number of novel opportunities to bring functional genomics to the bedside in the area of pharmacotherapy. By linking biological material to comprehensive medical records, large multi‐institutional biobanks are now poised to advance the field of pharmacogenomics through three distinct mechanisms: (i) retrospective assessment of previously known findings in a clinical practice‐based setting, (ii) discovery of new associations in huge observational cohorts, and (iii) prospective application in a setting capable of providing real‐time decision support. This review explores each of these translational mechanisms within a historical framework.


Journal of the American Medical Informatics Association | 2010

Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease

Iftikhar J. Kullo; Jin Fan; Jyotishman Pathak; Guergana Savova; Zeenat Ali; Christopher G. Chute

BACKGROUND There is significant interest in leveraging the electronic medical record (EMR) to conduct genome-wide association studies (GWAS). METHODS A biorepository of DNA and plasma was created by recruiting patients referred for non-invasive lower extremity arterial evaluation or stress ECG. Peripheral arterial disease (PAD) was defined as a resting/post-exercise ankle-brachial index (ABI) less than or equal to 0.9, a history of lower extremity revascularization, or having poorly compressible leg arteries. Controls were patients without evidence of PAD. Demographic data and laboratory values were extracted from the EMR. Medication use and smoking status were established by natural language processing of clinical notes. Other risk factors and comorbidities were ascertained based on ICD-9-CM codes, medication use and laboratory data. RESULTS Of 1802 patients with an abnormal ABI, 115 had non-atherosclerotic vascular disease such as vasculitis, Buergers disease, trauma and embolism (phenocopies) based on ICD-9-CM diagnosis codes and were excluded. The PAD cases (66+/-11 years, 64% men) were older than controls (61+/-8 years, 60% men) but had similar geographical distribution and ethnic composition. Among PAD cases, 1444 (85.6%) had an abnormal ABI, 233 (13.8%) had poorly compressible arteries and 10 (0.6%) had a history of lower extremity revascularization. In a random sample of 95 cases and 100 controls, risk factors and comorbidities ascertained from EMR-based algorithms had good concordance compared with manual record review; the precision ranged from 67% to 100% and recall from 84% to 100%. CONCLUSION This study demonstrates use of the EMR to ascertain phenocopies, phenotype heterogeneity and relevant covariates to enable a GWAS of PAD. Biorepositories linked to EMR may provide a relatively efficient means of conducting GWAS.


Inflammatory Bowel Diseases | 2013

Normalization of plasma 25-hydroxy vitamin D is associated with reduced risk of surgery in Crohn's disease.

Ashwin N. Ananthakrishnan; Vivian S. Gainer; Tianxi Cai; Su Chun Cheng; Guergana Savova; Pei Chen; Peter Szolovits; Zongqi Xia; Philip L. De Jager; Stanley Y. Shaw; Susanne Churchill; Elizabeth W. Karlson; Isaac S. Kohane; Robert M. Plenge; Shawn N. Murphy; Katherine P. Liao

Background:Vitamin D may have an immunologic role in Crohn’s disease (CD) and ulcerative colitis (UC). Retrospective studies suggested a weak association between vitamin D status and disease activity but have significant limitations. Methods:Using a multi-institution inflammatory bowel disease cohort, we identified all patients with CD and UC who had at least one measured plasma 25-hydroxy vitamin D (25(OH)D). Plasma 25(OH)D was considered sufficient at levels ≥30 ng/mL. Logistic regression models adjusting for potential confounders were used to identify impact of measured plasma 25(OH)D on subsequent risk of inflammatory bowel disease–related surgery or hospitalization. In a subset of patients where multiple measures of 25(OH)D were available, we examined impact of normalization of vitamin D status on study outcomes. Results:Our study included 3217 patients (55% CD; mean age, 49 yr). The median lowest plasma 25(OH)D was 26 ng/mL (interquartile range, 17–35 ng/mL). In CD, on multivariable analysis, plasma 25(OH)D <20 ng/mL was associated with an increased risk of surgery (odds ratio, 1.76; 95% confidence interval, 1.24–2.51) and inflammatory bowel disease–related hospitalization (odds ratio, 2.07; 95% confidence interval, 1.59–2.68) compared with those with 25(OH)D ≥30 ng/mL. Similar estimates were also seen for UC. Furthermore, patients with CD who had initial levels <30 ng/mL but subsequently normalized their 25(OH)D had a reduced likelihood of surgery (odds ratio, 0.56; 95% confidence interval, 0.32–0.98) compared with those who remained deficient. Conclusion:Low plasma 25(OH)D is associated with increased risk of surgery and hospitalizations in both CD and UC, and normalization of 25(OH)D status is associated with a reduction in the risk of CD-related surgery.


Journal of Biomedical Informatics | 2009

Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model

Anni Coden; Guergana Savova; Igor L. Sominsky; James J. Masanz; Karin Schuler; James W. Cooper; Wei Guan; Piet C. de Groen

We introduce an extensible and modifiable knowledge representation model to represent cancer disease characteristics in a comparable and consistent fashion. We describe a system, MedTAS/P which automatically instantiates the knowledge representation model from free-text pathology reports. MedTAS/P is based on an open-source framework and its components use natural language processing principles, machine learning and rules to discover and populate elements of the model. To validate the model and measure the accuracy of MedTAS/P, we developed a gold-standard corpus of manually annotated colon cancer pathology reports. MedTAS/P achieves F1-scores of 0.97-1.0 for instantiating classes in the knowledge representation model such as histologies or anatomical sites, and F1-scores of 0.82-0.93 for primary tumors or lymph nodes, which require the extractions of relations. An F1-score of 0.65 is reported for metastatic tumors, a lower score predominantly due to a very small number of instances in the training and test sets.


Journal of the American Medical Informatics Association | 2013

Towards comprehensive syntactic and semantic annotations of the clinical narrative

Daniel Albright; Arrick Lanfranchi; Anwen Fredriksen; Will Styler; Colin Warner; Jena D. Hwang; Jinho D. Choi; Dmitriy Dligach; Rodney D. Nielsen; James H. Martin; Wayne H. Ward; Martha Palmer; Guergana Savova

Objective To create annotated clinical narratives with layers of syntactic and semantic labels to facilitate advances in clinical natural language processing (NLP). To develop NLP algorithms and open source components. Methods Manual annotation of a clinical narrative corpus of 127 606 tokens following the Treebank schema for syntactic information, PropBank schema for predicate-argument structures, and the Unified Medical Language System (UMLS) schema for semantic information. NLP components were developed. Results The final corpus consists of 13 091 sentences containing 1772 distinct predicate lemmas. Of the 766 newly created PropBank frames, 74 are verbs. There are 28 539 named entity (NE) annotations spread over 15 UMLS semantic groups, one UMLS semantic type, and the Person semantic category. The most frequent annotations belong to the UMLS semantic groups of Procedures (15.71%), Disorders (14.74%), Concepts and Ideas (15.10%), Anatomy (12.80%), Chemicals and Drugs (7.49%), and the UMLS semantic type of Sign or Symptom (12.46%). Inter-annotator agreement results: Treebank (0.926), PropBank (0.891–0.931), NE (0.697–0.750). The part-of-speech tagger, constituency parser, dependency parser, and semantic role labeler are built from the corpus and released open source. A significant limitation uncovered by this project is the need for the NLP community to develop a widely agreed-upon schema for the annotation of clinical concepts and their relations. Conclusions This project takes a foundational step towards bringing the field of clinical NLP up to par with NLP in the general domain. The corpus creation and NLP components provide a resource for research and application development that would have been previously impossible.


Inflammatory Bowel Diseases | 2013

Improving case definition of Crohn's disease and ulcerative colitis in electronic medical records using natural language processing: a novel informatics approach.

Ashwin N. Ananthakrishnan; Tianxi Cai; Guergana Savova; Su Chun Cheng; Pei Chen; Raul Guzman Perez; Vivian S. Gainer; Shawn N. Murphy; Peter Szolovits; Zongqi Xia; Stanley Y. Shaw; Susanne Churchill; Elizabeth W. Karlson; Isaac S. Kohane; Robert M. Plenge; Katherine P. Liao

Background:Previous studies identifying patients with inflammatory bowel disease using administrative codes have yielded inconsistent results. Our objective was to develop a robust electronic medical record–based model for classification of inflammatory bowel disease leveraging the combination of codified data and information from clinical text notes using natural language processing. Methods:Using the electronic medical records of 2 large academic centers, we created data marts for Crohn’s disease (CD) and ulcerative colitis (UC) comprising patients with ≥1 International Classification of Diseases, 9th edition, code for each disease. We used codified (i.e., International Classification of Diseases, 9th edition codes, electronic prescriptions) and narrative data from clinical notes to develop our classification model. Model development and validation was performed in a training set of 600 randomly selected patients for each disease with medical record review as the gold standard. Logistic regression with the adaptive LASSO penalty was used to select informative variables. Results:We confirmed 399 CD cases (67%) in the CD training set and 378 UC cases (63%) in the UC training set. For both, a combined model including narrative and codified data had better accuracy (area under the curve for CD 0.95; UC 0.94) than models using only disease International Classification of Diseases, 9th edition codes (area under the curve 0.89 for CD; 0.86 for UC). Addition of natural language processing narrative terms to our final model resulted in classification of 6% to 12% more subjects with the same accuracy. Conclusions:Inclusion of narrative concepts identified using natural language processing improves the accuracy of electronic medical records case definition for CD and UC while simultaneously identifying more subjects compared with models using codified data alone.


Journal of the American Medical Informatics Association | 2011

Drug side effect extraction from clinical narratives of psychiatry and psychology patients.

Sunghwan Sohn; Jean Pierre A Kocher; Christopher G. Chute; Guergana Savova

OBJECTIVE To extract physician-asserted drug side effects from electronic medical record clinical narratives. MATERIALS AND METHODS Pattern matching rules were manually developed through examining keywords and expression patterns of side effects to discover an individual side effect and causative drug relationship. A combination of machine learning (C4.5) using side effect keyword features and pattern matching rules was used to extract sentences that contain side effect and causative drug pairs, enabling the system to discover most side effect occurrences. Our system was implemented as a module within the clinical Text Analysis and Knowledge Extraction System. RESULTS The system was tested in the domain of psychiatry and psychology. The rule-based system extracting side effects and causative drugs produced an F score of 0.80 (0.55 excluding allergy section). The hybrid system identifying side effect sentences had an F score of 0.75 (0.56 excluding allergy section) but covered more side effect and causative drug pairs than individual side effect extraction. DISCUSSION The rule-based system was able to identify most side effects expressed by clear indication words. More sophisticated semantic processing is required to handle complex side effect descriptions in the narrative. We demonstrated that our system can be trained to identify sentences with complex side effect descriptions that can be submitted to a human expert for further abstraction. CONCLUSION Our system was able to extract most physician-asserted drug side effects. It can be used in either an automated mode for side effect extraction or semi-automated mode to identify side effect sentences that can significantly simplify abstraction by a human expert.


Journal of the American Medical Informatics Association | 2015

Evaluating the state of the art in disorder recognition and normalization of the clinical narrative

Sameer Pradhan; Noémie Elhadad; Brett R. South; David Martinez; Lee M. Christensen; Amy Vogel; Hanna Suominen; Wendy W. Chapman; Guergana Savova

Objective The ShARe/CLEF eHealth 2013 Evaluation Lab Task 1 was organized to evaluate the state of the art on the clinical text in (i) disorder mention identification/recognition based on Unified Medical Language System (UMLS) definition (Task 1a) and (ii) disorder mention normalization to an ontology (Task 1b). Such a community evaluation has not been previously executed. Task 1a included a total of 22 system submissions, and Task 1b included 17. Most of the systems employed a combination of rules and machine learners. Materials and methods We used a subset of the Shared Annotated Resources (ShARe) corpus of annotated clinical text—199 clinical notes for training and 99 for testing (roughly 180 K words in total). We provided the community with the annotated gold standard training documents to build systems to identify and normalize disorder mentions. The systems were tested on a held-out gold standard test set to measure their performance. Results For Task 1a, the best-performing system achieved an F1 score of 0.75 (0.80 precision; 0.71 recall). For Task 1b, another system performed best with an accuracy of 0.59. Discussion Most of the participating systems used a hybrid approach by supplementing machine-learning algorithms with features generated by rules and gazetteers created from the training data and from external resources. Conclusions The task of disorder normalization is more challenging than that of identification. The ShARe corpus is available to the community as a reference standard for future studies.

Collaboration


Dive into the Guergana Savova's collaboration.

Top Co-Authors

Avatar

Dmitriy Dligach

Loyola University Chicago

View shared research outputs
Top Co-Authors

Avatar

Timothy A. Miller

Boston Children's Hospital

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Elizabeth W. Karlson

Brigham and Women's Hospital

View shared research outputs
Top Co-Authors

Avatar

Steven Bethard

University of Alabama at Birmingham

View shared research outputs
Top Co-Authors

Avatar

Chen Lin

Boston Children's Hospital

View shared research outputs
Researchain Logo
Decentralizing Knowledge