Dingcheng Li
Mayo Clinic
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dingcheng Li.
Journal of the American Medical Informatics Association | 2013
Jyotishman Pathak; Kent R. Bailey; Calvin Beebe; Steven Bethard; David Carrell; Pei J. Chen; Dmitriy Dligach; Cory M. Endle; Lacey Hart; Peter J. Haug; Stanley M. Huff; Vinod Kaggal; Dingcheng Li; Hongfang D Liu; Kyle Marchant; James J. Masanz; Timothy A. Miller; Thomas A. Oniki; Martha Palmer; Kevin J. Peterson; Susan Rea; Guergana Savova; Craig Stancl; Sunghwan Sohn; Harold R. Solbrig; Dale Suesse; Cui Tao; David P. Taylor; Les Westberg; Stephen T. Wu
RESEARCH OBJECTIVE To develop scalable informatics infrastructure for normalization of both structured and unstructured electronic health record (EHR) data into a unified, concept-based model for high-throughput phenotype extraction. MATERIALS AND METHODS Software tools and applications were developed to extract information from EHRs. Representative and convenience samples of both structured and unstructured data from two EHR systems-Mayo Clinic and Intermountain Healthcare-were used for development and validation. Extracted information was standardized and normalized to meaningful use (MU) conformant terminology and value set standards using Clinical Element Models (CEMs). These resources were used to demonstrate semi-automatic execution of MU clinical-quality measures modeled using the Quality Data Model (QDM) and an open-source rules engine. RESULTS Using CEMs and open-source natural language processing and terminology services engines-namely, Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) and Common Terminology Services (CTS2)-we developed a data-normalization platform that ensures data security, end-to-end connectivity, and reliable data flow within and across institutions. We demonstrated the applicability of this platform by executing a QDM-based MU quality measure that determines the percentage of patients between 18 and 75 years with diabetes whose most recent low-density lipoprotein cholesterol test result during the measurement year was <100 mg/dL on a randomly selected cohort of 273 Mayo Clinic patients. The platform identified 21 and 18 patients for the denominator and numerator of the quality measure, respectively. Validation results indicate that all identified patients meet the QDM-based criteria. CONCLUSIONS End-to-end automated systems for extracting clinical information from diverse EHR systems require extensive use of standardized vocabularies and terminologies, as well as robust information models for storing, discovering, and processing that information. This study demonstrates the application of modular and open-source resources for enabling secondary use of EHR data through normalization into standards-based, comparable, and consistent format for high-throughput phenotyping to identify patient cohorts.
Journal of the American Medical Informatics Association | 2013
Sunghwan Sohn; Kavishwar B. Wagholikar; Dingcheng Li; Siddhartha Jonnalagadda; Cui Tao; Ravikumar Komandur Elayavilli; Hongfang Liu
BACKGROUND Temporal information detection systems have been developed by the Mayo Clinic for the 2012 i2b2 Natural Language Processing Challenge. OBJECTIVE To construct automated systems for EVENT/TIMEX3 extraction and temporal link (TLINK) identification from clinical text. MATERIALS AND METHODS The i2b2 organizers provided 190 annotated discharge summaries as the training set and 120 discharge summaries as the test set. Our Event system used a conditional random field classifier with a variety of features including lexical information, natural language elements, and medical ontology. The TIMEX3 system employed a rule-based method using regular expression pattern match and systematic reasoning to determine normalized values. The TLINK system employed both rule-based reasoning and machine learning. All three systems were built in an Apache Unstructured Information Management Architecture framework. RESULTS Our TIMEX3 system performed the best (F-measure of 0.900, value accuracy 0.731) among the challenge teams. The Event system produced an F-measure of 0.870, and the TLINK system an F-measure of 0.537. CONCLUSIONS Our TIMEX3 system demonstrated good capability of regular expression rules to extract and normalize time information. Event and TLINK machine learning systems required well-defined feature sets to perform well. We could also leverage expert knowledge as part of the machine learning features to further improve TLINK identification performance.
Journal of the American Medical Informatics Association | 2012
Siddhartha Jonnalagadda; Dingcheng Li; Sunghwan Sohn; Stephen T. Wu; Kavishwar B. Wagholikar; Manabu Torii; Hongfang Liu
OBJECTIVE This paper describes the coreference resolution system submitted by Mayo Clinic for the 2011 i2b2/VA/Cincinnati shared task Track 1C. The goal of the task was to construct a system that links the markables corresponding to the same entity. MATERIALS AND METHODS The task organizers provided progress notes and discharge summaries that were annotated with the markables of treatment, problem, test, person, and pronoun. We used a multi-pass sieve algorithm that applies deterministic rules in the order of preciseness and simultaneously gathers information about the entities in the documents. Our system, MedCoref, also uses a state-of-the-art machine learning framework as an alternative to the final, rule-based pronoun resolution sieve. RESULTS The best system that uses a multi-pass sieve has an overall score of 0.836 (average of B(3), MUC, Blanc, and CEAF F score) for the training set and 0.843 for the test set. DISCUSSION A supervised machine learning system that typically uses a single function to find coreferents cannot accommodate irregularities encountered in data especially given the insufficient number of examples. On the other hand, a completely deterministic system could lead to a decrease in recall (sensitivity) when the rules are not exhaustive. The sieve-based framework allows one to combine reliable machine learning components with rules designed by experts. CONCLUSION Using relatively simple rules, part-of-speech information, and semantic type properties, an effective coreference resolution system could be designed. The source code of the system described is available at https://sourceforge.net/projects/ohnlp/files/MedCoref.
Biomedical Informatics Insights | 2012
Sunghwan Sohn; Manabu Torii; Dingcheng Li; Kavishwar B. Wagholikar; Stephen T. Wu; Hongfang Liu
This paper describes the sentiment classification system developed by the Mayo Clinic team for the 2011 I2B2/VA/Cincinnati Natural Language Processing (NLP) Challenge. The sentiment classification task is to assign any pertinent emotion to each sentence in suicide notes. We have implemented three systems that have been trained on suicide notes provided by the I2B2 challenge organizer–-a machine learning system, a rule-based system, and a system consisting of a combination of both. Our machine learning system was trained on re-annotated data in which apparently inconsistent emotion assignment was adjusted. Then, the machine learning methods by RIPPER and multinomial Naïve Bayes classifiers, manual pattern matching rules, and the combination of the two systems were tested to determine the emotions within sentences. The combination of the machine learning and rule-based system performed best and produced a micro-average F-score of 0.5640.
Scientific Reports | 2016
Yue Yu; Jun Chen; Dingcheng Li; Liwei Wang; Wei Wang; Hongfang Liu
Increasing evidence has shown that sex differences exist in Adverse Drug Events (ADEs). Identifying those sex differences in ADEs could reduce the experience of ADEs for patients and could be conducive to the development of personalized medicine. In this study, we analyzed a normalized US Food and Drug Administration Adverse Event Reporting System (FAERS). Chi-squared test was conducted to discover which treatment regimens or drugs had sex differences in adverse events. Moreover, reporting odds ratio (ROR) and P value were calculated to quantify the signals of sex differences for specific drug-event combinations. Logistic regression was applied to remove the confounding effect from the baseline sex difference of the events. We detected among 668 drugs of the most frequent 20 treatment regimens in the United States, 307 drugs have sex differences in ADEs. In addition, we identified 736 unique drug-event combinations with significant sex differences. After removing the confounding effect from the baseline sex difference of the events, there are 266 combinations remained. Drug labels or previous studies verified some of them while others warrant further investigation.
Biomedical Informatics Insights | 2016
Vinod Kaggal; Ravikumar Komandur Elayavilli; Saeed Mehrabi; Joshua J. Pankratz; Sunghwan Sohn; Yanshan Wang; Dingcheng Li; Majid Mojarad Rastegar; Sean P. Murphy; Jason L. Ross; Rajeev Chaudhry; James D. Buntrock; Hongfang Liu
The concept of optimizing health care by understanding and generating knowledge from previous evidence, ie, the Learning Health-care System (LHS), has gained momentum and now has national prominence. Meanwhile, the rapid adoption of electronic health records (EHRs) enables the data collection required to form the basis for facilitating LHS. A prerequisite for using EHR data within the LHS is an infrastructure that enables access to EHR data longitudinally for health-care analytics and real time for knowledge delivery. Additionally, significant clinical information is embedded in the free text, making natural language processing (NLP) an essential component in implementing an LHS. Herein, we share our institutional implementation of a big data-empowered clinical NLP infrastructure, which not only enables health-care analytics but also has real-time NLP processing capability. The infrastructure has been utilized for multiple institutional projects including the MayoExpertAdvisor, an individualized care recommendation solution for clinical care. We compared the advantages of big data over two other environments. Big data infrastructure significantly outperformed other infrastructure in terms of computing speed, demonstrating its value in making the LHS a possibility in the near future.
Studies in health technology and informatics | 2015
Saeed Mehrabi; Anand Krishnan; Alexandra M. Roch; Heidi Schmidt; Dingcheng Li; Joe Kesterson; Chris Beesley; Paul R. Dexter; C. Max Schmidt; Mathew J. Palakal; Hongfang Liu
In this study we have developed a rule-based natural language processing (NLP) system to identify patients with family history of pancreatic cancer. The algorithm was developed in a Unstructured Information Management Architecture (UIMA) framework and consisted of section segmentation, relation discovery, and negation detection. The system was evaluated on data from two institutions. The family history identification precision was consistent across the institutions shifting from 88.9% on Indiana University (IU) dataset to 87.8% on Mayo Clinic dataset. Customizing the algorithm on the the Mayo Clinic data, increased its precision to 88.1%. The family member relation discovery achieved precision, recall, and F-measure of 75.3%, 91.6% and 82.6% respectively. Negation detection resulted in precision of 99.1%. The results show that rule-based NLP approaches for specific information extraction tasks are portable across institutions; however customization of the algorithm on the new dataset improves its performance.
ieee embs international conference on biomedical and health informatics | 2014
Feichen Shen; Dingcheng Li; Hongfang Liu; Yugyung Lee; Jyotishman Pathak; Christopher G. Chute; Cui Tao
In this paper, we introduce our efforts on an application of semantic web technologies to phenotyping algorithms in Electronic Health Records (EHR) data for the purpose of facilitating the reasoning and inferring processes of some patients groups in an intelligent manner.
bioinformatics and biomedicine | 2013
Yuji Zhang; Dingcheng Li; Cui Tao; Feichen Shen; Hongfang Liu
A huge amount of association relationships among biological entities (e.g., diseases, drugs, and genes) are scattered in biomedical literature. How to extract and analyze such heterogeneous data still remains a challenging task for most researchers in the biomedical field. Natural language processing (NLP) has the potential in extracting associations among biological entities from literature. However, association information extracted through NLP can be large, noisy, and redundant which poses significant challenges to biomedical researchers to use such information. To address this challenge, we propose a computational framework to facilitate the use of NLP results. We apply Latent Dirichlet Allocation (LDA) to discover topics based on associations. The networks extracted from each topic provide a disease-specific network for downstream bioinformatics analysis of associations for each topic. We illustrated the framework through the construction of disease-specific networks from Semantic MEDLINE, an NLP-generated association database, followed by the analysis of network properties, such as hub nodes and degree distribution. The results demonstrate that (1) LDA-based approach can group related diseases into the same disease topic; (2) the disease-specific association network follows the scale-free network property, in which hub nodes are enriched in related diseases, genes and drugs.
international conference on bioinformatics | 2015
Dingcheng Li; Majid Rastegar-Mojarad; Ravikumar Komandur Elayavilli; Yanshan Wang; Saeed Mehrabi; Yue Yu; Sunghwan Sohn; Yanpeng Li; Naveed Afzal; Hongfang Liu
Clinical natural language processing (NLP) has become indispensable in the secondary use of electronic medical records (EMRs). However, it is found that current clinical NLP tools face the problem of portability among different institutes. An ideal solution to this problem is cross-institutional data sharing. However, the legal enforcement of no revelation of protected health information (PHI) obstructs this practice even with the availability of state-of-the-art de-identification tools. In this paper, we investigated the use of a frequency-filtering approach to extract PHI-free sentences utilizing the Enterprise Data Trust (EDT), a large collection of EMRs at Mayo Clinic. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. This assumption originates from the observation that there exist a large number of redundant descriptions of similar patient conditions in EDT. Both manual and automatic evaluations on the sentence set with frequencies higher than one show no PHI are found. The promising results demonstrate the potential of sharing highly frequent sentences among institutes.