Yi Guan
Harbin Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yi Guan.
Journal of Biomedical Informatics | 2014
Xinbo Lv; Yi Guan; Benyang Deng
Machine learning methods usually assume that training data and test data are drawn from the same distribution. However, this assumption often cannot be satisfied in the task of clinical concept extraction. The main aim of this paper was to use training data from one institution to build a concept extraction model for data from another institution with a different distribution. An instance-based transfer learning method, TrAdaBoost, was applied in this work. To prevent the occurrence of a negative transfer phenomenon with TrAdaBoost, we integrated it with Bagging, which provides a softer weights update mechanism with only a tiny amount of training data from the target domain. Two data sets named BETH and PARTNERS from the 2010 i2b2/VA challenge as well as BETHBIO, a data set we constructed ourselves, were employed to show the effectiveness of our works transfer ability. Our method outperforms the baseline model by 2.3% and 4.4% when the baseline model is trained by training data that are combined from the source domain and the target domain in two experiments of BETH vs. PARTNERS and BETHBIO vs. PARTNERS, respectively. Additionally, confidence intervals for the performance metrics suggest that our methods results have statistical significance. Moreover, we explore the applicability of our method for further experiments. With our method, only a tiny amount of labeled data from the target domain is required to build a concept extraction model that produces better performance.
2016 New York Scientific Data Summit (NYSDS) | 2016
Xishuang Dong; Lijun Qian; Yi Guan; Lei Huang; Qiubin Yu; Jinfeng Yang
Research of named entity recognition (NER) on electrical medical records (EMRs) focuses on verifying whether methods to NER in traditional texts are effective for that in EMRs, and there is no model proposed for enhancing performance of NER via deep learning from the perspective of multiclass classification. In this paper, we annotate a real EMR corpus to accomplish the model training and evaluation. And, then, we present a Convolutional Neural Network (CNN) based multiclass classification method for mining named entities from EMRs. The method consists of two phases. In the phase 1, EMRs are pre-processed for representing samples with word embedding. In the phase 2, the method is built by segmenting training data into many subsets and training a CNN binary classification model on each of subset. Experimental results showed the effectiveness of our method.
Journal of Biomedical Informatics | 2017
Bin He; Bin Dong; Yi Guan; Jinfeng Yang; Zhipeng Jiang; Qiubin Yu; Jianyi Cheng; Chunyan Qu
OBJECTIVEnTo build a comprehensive corpus covering syntactic and semantic annotations of Chinese clinical texts with corresponding annotation guidelines and methods as well as to develop tools trained on the annotated corpus, which supplies baselines for research on Chinese texts in the clinical domain.nnnMATERIALS AND METHODSnAn iterative annotation method was proposed to train annotators and to develop annotation guidelines. Then, by using annotation quality assurance measures, a comprehensive corpus was built, containing annotations of part-of-speech (POS) tags, syntactic tags, entities, assertions, and relations. Inter-annotator agreement (IAA) was calculated to evaluate the annotation quality and a Chinese clinical text processing and information extraction system (CCTPIES) was developed based on our annotated corpus.nnnRESULTSnThe syntactic corpus consists of 138 Chinese clinical documents with 47,426 tokens and 2612 full parsing trees, while the semantic corpus includes 992 documents that annotated 39,511 entities with their assertions and 7693 relations. IAA evaluation shows that this comprehensive corpus is of good quality, and the system modules are effective.nnnDISCUSSIONnThe annotated corpus makes a considerable contribution to natural language processing (NLP) research into Chinese texts in the clinical domain. However, this corpus has a number of limitations. Some additional types of clinical text should be introduced to improve corpus coverage and active learning methods should be utilized to promote annotation efficiency.nnnCONCLUSIONSnIn this study, several annotation guidelines and an annotation method for Chinese clinical texts were proposed, and a comprehensive corpus with its NLP modules were constructed, providing a foundation for further study of applying NLP techniques to Chinese texts in the clinical domain.
Computer Methods and Programs in Biomedicine | 2017
Chao Zhao; Jingchi Jiang; Zhiming Xu; Yi Guan
BACKGROUND AND OBJECTIVEnElectronic medical records (EMRs) contain an amount of medical knowledge which can be used for clinical decision support. We attempt to integrate this medical knowledge into a complex network, and then implement a diagnosis model based on this network.nnnMETHODSnThe dataset of our study contains 992 records which are uniformly sampled from different departments of the hospital. In order to integrate the knowledge of these records, an EMR-based medical knowledge network (EMKN) is constructed. This network takes medical entities as nodes, and co-occurrence relationships between the two entities as edges. Selected properties of this network are analyzed. To make use of this network, a basic diagnosis model is implemented. Seven hundred records are randomly selected to re-construct the network, and the remaining 292 records are used as test records. The vector space model is applied to illustrate the relationships between diseases and symptoms. Because there may exist more than one actual disease in a record, the recall rate of the first ten results, and the average precision are adopted as evaluation measures.nnnRESULTSnCompared with a random network of the same size, this network has a similar average length but a much higher clustering coefficient. Additionally, it can be observed that there are direct correlations between the community structure and the real department classes in the hospital. For the diagnosis model, the vector space model using disease as a base obtains the best result. At least one accurate disease can be obtained in 73.27% of the records in the first ten results.nnnCONCLUSIONnWe constructed an EMR-based medical knowledge network by extracting the medical entities. This network has the small-world and scale-free properties. Moreover, the community structure showed that entities in the same department have a tendency to be self-aggregated. Based on this network, a diagnosis model was proposed. This model uses only the symptoms as inputs and is not restricted to a specific disease. The experiments conducted demonstrated that EMKN is a simple and universal technique to integrate different medical knowledge from EMRs, and can be used for clinical decision support.
Journal of Biomedical Informatics | 2017
Zhipeng Jiang; Chao Zhao; Bin He; Yi Guan; Jingchi Jiang
The CEGS N-GRID 2016 Shared Task 1 in Clinical Natural Language Processing focuses on the de-identification of psychiatric evaluation records. This paper describes two participating systems of our team, based on conditional random fields (CRFs) and long short-term memory networks (LSTMs). A pre-processing module was introduced for sentence detection and tokenization before de-identification. For CRFs, manually extracted rich features were utilized to train the model. For LSTMs, a character-level bi-directional LSTM network was applied to represent tokens and classify tags for each token, following which a decoding layer was stacked to decode the most probable protected health information (PHI) terms. The LSTM-based system attained an i2b2 strict micro-F1 measure of 0.8986, which was higher than that of the CRF-based system.
international conference on systems engineering | 2015
Jinfeng Yang; Xishuang Dong; Yi Guan
Words are basic structural units of language and combine with each other to form sentences. Learning combination strengths between words is key of importance for sentence structure analysis. Inspired by the analogies between words and lymphocytes, a multi-word-agent autonomous learning model based on a artificial immune system is proposed to learn word combination strengths. The model is constructed by Cellular Automation, and words are modeled as B cell word agents and as antigen word agents. Furthermore, combination strengths between words are viewed as affinities of specific recognition relations between B cells and antigens. This research provides a completely new perspective on language and words and introduces biological inspirations from the immune system into the proposed model. The most significant advantage of the model is the ability of continuous learning and the concise implementation method. The effectiveness of the model can be verified by sentence dependency parsing. Experimental results on Penn Chinese Treebank 5.1 indicate that our model can learn word combination strengths effectively and continuously.
Computer Methods and Programs in Biomedicine | 2018
Jingchi Jiang; Jing Xie; Chao Zhao; Jia Su; Yi Guan; Qiubin Yu
BACKGROUND AND OBJECTIVEnThe application of medical knowledge strongly affects the performance of intelligent diagnosis, and method of learning the weights of medical knowledge plays a substantial role in probabilistic graphical models (PGMs). The purpose of this study is to investigate a discriminative weight-learning method based on a medical knowledge network (MKN).nnnMETHODSnWe propose a training model called the maximum margin medical knowledge network (M3KN), which is strictly derived for calculating the weight of medical knowledge. Using the definition of a reasonable margin, the weight learning can be transformed into a margin optimization problem. To solve the optimization problem, we adopt a sequential minimal optimization (SMO) algorithm and the clique property of a Markov network. Ultimately, M3KN not only incorporates the inference ability of PGMs but also deals with high-dimensional logic knowledge.nnnRESULTSnThe experimental results indicate that M3KN obtains a higher F-measure score than the maximum likelihood learning algorithm of MKN for both Chinese Electronic Medical Records (CEMRs) and Blood Examination Records (BERs). Furthermore, the proposed approach is obviously superior to some classical machine learning algorithms for medical diagnosis. To adequately manifest the importance of domain knowledge, we numerically verify that the diagnostic accuracy of M3KN is gradually improved as the number of learned CEMRs increase, which contain important medical knowledge.nnnCONCLUSIONSnOur experimental results show that the proposed method performs reliably for learning the weights of medical knowledge. M3KN outperforms other existing methods by achieving an F-measure of 0.731 for CEMRs and 0.4538 for BERs. This further illustrates that M3KN can facilitate the investigations of intelligent healthcare.
Artificial Intelligence in Medicine | 2018
Chao Zhao; Jingchi Jiang; Yi Guan; Xitong Guo; Bin He
OBJECTIVEnElectronic medical records (EMRs) contain medical knowledge that can be used for clinical decision support (CDS). Our objective is to develop a general system that can extract and represent knowledge contained in EMRs to support three CDS tasks-test recommendation, initial diagnosis, and treatment plan recommendation-given the condition of a patient.nnnMETHODSnWe extracted four kinds of medical entities from records and constructed an EMR-based medical knowledge network (EMKN), in which nodes are entities and edges reflect their co-occurrence in a record. Three bipartite subgraphs (bigraphs) were extracted from the EMKN, one to support each task. One part of the bigraph was the given condition (e.g., symptoms), and the other was the condition to be inferred (e.g., diseases). Each bigraph was regarded as a Markov random field (MRF) to support the inference. We proposed three graph-based energy functions and three likelihood-based energy functions. Two of these functions are based on knowledge representation learning and can provide distributed representations of medical entities. Two EMR datasets and three metrics were utilized to evaluate the performance.nnnRESULTSnAs a whole, the evaluation results indicate that the proposed system outperformed the baseline methods. The distributed representation of medical entities does reflect similarity relationships with respect to knowledge level.nnnCONCLUSIONnCombining EMKN and MRF is an effective approach for general medical knowledge representation and inference. Different tasks, however, require individually designed energy functions.
BMC Medical Informatics and Decision Making | 2017
Jia Su; Bin He; Yi Guan; Jingchi Jiang; Jinfeng Yang
BackgroundCardiovascular disease (CVD) has become the leading cause of death in China, and most of the cases can be prevented by controlling risk factors. The goal of this study was to build a corpus of CVD risk factor annotations based on Chinese electronic medical records (CEMRs). This corpus is intended to be used to develop a risk factor information extraction system that, in turn, can be applied as a foundation for the further study of the progress of risk factors and CVD.ResultsWe designed a light annotation task to capture CVD risk factors with indicators, temporal attributes and assertions that were explicitly or implicitly displayed in the records. The task included: 1) preparing data; 2) creating guidelines for capturing annotations (these were created with the help of clinicians); 3) proposing an annotation method including building the guidelines draft, training the annotators and updating the guidelines, and corpus construction. Meanwhile, we proposed some creative annotation guidelines: (1) the under-threshold medical examination values were annotated for our purpose of studying the progress of risk factors and CVD; (2) possible and negative risk factors were concerned for the same reason, and we created assertions for annotations; (3) we added four temporal attributes to CVD risk factors in CEMRs for constructing long term variations. Then, a risk factor annotated corpus based on de-identified discharge summaries and progress notes from 600 patients was developed. Built with the help of clinicians, this corpus has an inter-annotator agreement (IAA) F1-measure of 0.968, indicating a high reliability.ConclusionTo the best of our knowledge, this is the first annotated corpus concerning CVD risk factors in CEMRs and the guidelines for capturing CVD risk factor annotations from CEMRs were proposed. The obtained document-level annotations can be applied in future studies to monitor risk factors and CVD over the long term.
international conference on information science and technology | 2015
Jinfeng Yang; Yi Guan; Xishuang Dong
Striking analogies between human language and the immune system was first suggested by Jerne, an immunologist, in his Nobel lecture. Inspired by Jernes opinion, we re-exploit these analogies between them, and propose that words can be regarded as B cells. Based on this proposed analogous comparison, a multi-word-agent autonomous learning model (MWAALM) is designed and implemented to learn the strengths of combinative relations between words, which are the rules dominating the order by that words compose a sentence. The model is validated on the task of dependency parsing and the task of word similarity computing. Results show that MWAALM obtains good performance on the two tasks, and analogies between words and B cells are proven to be plausible.