Two-stage Federated Phenotyping and Patient Representation Learning
TTwo-stage Federated Phenotyping and Patient Representation Learning
Dianbo Liu
CHIPBoston Children’s HospitalHarvard Medical SchoolBoston, MA, USA, [email protected]
Dmitriy Dligach
Loyola University ChicagoChicago, IL, USA [email protected]
Timothy Miller
CHIPBoston Children’s HospitalHarvard Medical SchoolBoston, MA, USA, [email protected]
Abstract
A large percentage of medical information isin unstructured text format in electronic med-ical record systems. Manual extraction ofinformation from clinical notes is extremelytime consuming. Natural language process-ing has been widely used in recent years forautomatic information extraction from med-ical texts. However, algorithms trained ondata from a single healthcare provider are notgeneralizable and error-prone due to the het-erogeneity and uniqueness of medical docu-ments. We develop a two-stage federated nat-ural language processing method that enablesutilization of clinical notes from different hos-pitals or clinics without moving the data, anddemonstrate its performance using obesity andcomorbities phenotyping as medical task. Thisapproach not only improves the quality of aspecific clinical task but also facilitates knowl-edge progression in the whole healthcare sys-tem, which is an essential part of learninghealth system. To the best of our knowledge,this is the first application of federated ma-chine learning in clinical NLP.
Clinical notes and other unstructured data in plaintext are valuable resources for medical informat-ics studies and machine learning applications inhealthcare. In clinical settings, more than 70% ofinformation are stored as unstructured text. Con-verting the unstructured data into useful structuredrepresentations will not only help data analysis butalso improve efficiency in clinical practice (Jagan-nathan et al., 2009; Kreimeyer et al., 2017; Fordet al., 2016; Demner-Fushman et al., 2009; Murffet al., 2011; Friedman et al., 2004). Manual ex-traction of information from the vast volume ofnotes from electronic health record (EHR) systemsis too time consuming. To automatically retrieve information fromunstructured notes, natural language processing(NLP) has been widely used. NLP is a subfieldof computer science, that has been developingfor more than 50 years, focusing on intelligentprocessing of human languages (Manning et al.,1999). A combination of hard-coded rules andmachine learning methods have been used in thefield, with machine learning currently being thedominant paradigm.Automatic phenotyping is a task in clinical NLPthat aims to identify cohorts of patients that matcha predefined set of criteria. Supervised machinelearning is curently the main approach to pheno-typing, but availability of annotated data hindersthe progress for this task. In this work, we con-sider a scenario where multiple instituitions haveaccess to relatively small amounts of annotateddata for a particular phenotype and this amount isnot sufficient for training an accurate classifier. Onthe other hand, combining data from these institu-tions can lead to a high accuracy classifier, but di-rect data sharing is not possible due to operationaland privacy concerns.Another problem we are considering is learn-ing patient representations that can be used to trainaccurate phenotyping classifiers. The goal of pa-tient representation learning is mapping the text ofnotes for a patient to a fixed-length dense vector(embedding). Patient representation learning hasbeen done in a supervised (Dligach and Miller,2018) and unsupervised (Miotto et al., 2016) set-ting. In both cases, patient representation learn-ing requires massive amounts of data. As in thescenario we outlined in the previous paragraph,combining data from several institutions can leadto higher quality patient representations, whichin turn will improve the accuracy of phenotypingclassifiers. However, direct data sharing, again, isdifficult or impossible. a r X i v : . [ c s . I R ] A ug o tackle the challenges we mentioned above,we developed a federated machine learningmethod to utilize clinical notes from multiplesources, both for learning patient representationsand phenotype classifiers.Federated machine learning is a concept thatmachine learning models are trained in a dis-tributed and collaborative manner without cen-tralised data (Liu et al., 2018a; McMahan et al.,2016; Bonawitz et al., 2019; Koneˇcn`y et al., 2016;Huang et al., 2018; Huang and Liu, 2019). Thestrategy of federated learning has been recentlyadopted in the medical field in structured data-based machine learning tasks (Liu et al., 2018a;Huang et al., 2018; Liu et al., 2018b). However,to the best of our knowledge, this work is the firsttime a federated learning strategy has been used inmedical NLP.We developed our two-stage federated natu-ral language processing method based on previ-ous work on patient representation (Dligach andMiller, 2018). The first stage of our proposed fed-erated learning scheme is supervised patient rep-resentation learning. Machine learning models aretrained using medical notes from a large number ofhospitals or clinics without moving or aggregatingthe notes. The notes used in this stage need not bedirectly relevant to a specific medical task of in-terest. At the second stage, representations fromthe clinical notes directly related to the phenotyp-ing task are extracted using the algorithm obtainedfrom stage 1 and a machine learning model spe-cific to the medical task is trained.Clinicians spend a significant amount of timereviewing clinical notes. This time can be savedor reduced with reasonably designed NLP tech-nologies. One such task is phenotying from med-ical notes. In this study, we demonstrated, usingphenotyping from clinical note as a clinical task(Conway et al., 2011; Dligach and Miller, 2018),that the method we developed will make it possi-ble to utilize notes from a wide range of hospitalswithout moving the data.The ability to utilize clinical notes distributedat different healthcare providers not only benefitsa specific clinical practice task but also facilitatesbuilding a learning healthcare system, in whichmeaningful use of knowledge in distributed clin-ical notes will speed up progression of medicalknowledge to translational research, tool develop-ment, and healthcare quality assessment (Fried- man et al., 2010; Blumenthal and Tavenner, 2010).Without the needs of data movement, the speed ofinformation flow can approach real time and makea rapid learning healthcare system possible (Slut-sky, 2007; Friedman et al., 2014; Abernethy et al.,2010). Two datasets were used in this study. The MIMIC-III corpus (Johnson et al., 2016) was used forrepresentation learning. This corpus contains in-formation for more than 58,000 admissions formore than 45,000 patients admitted to Beth Is-rael Deaconess Medical Center in Boston between2001 and 2012. Relevant to this study, MIMIC-III includes clinical notes, ICD9 diagnostic codes,ICD9 procedure codes, and CPT codes. Thenotes were processed with cTAKES to extractUMLS unique concept identifiers (CUIs). Fol-lowing the cohort selection protocol from (Dligachand Miller, 2018), patients with over 10,000 CUIswere excluded from this study. We obtained a co-hort of 44,211 patients in total.The Informatics for Integrating Biology to theBedside (i2b2) Obesity challenge dataset was usedto train phenotyping models (Uzuner, 2009). Thedataset consists of 1237 discharge summaries fromPartners HealthCare in Boston. Patients in this co-hort were annotated with respect to obesity andits comorbidities. In this study we consider themore challenging intuitive version of the task. Thedischarge summaries were annotated with obe-sity and its 15 most common comorbidities, thepresence, absence or uncertainty (questionable) ofwhich were used as ground truth label in the phe-notyping task in this study. Table 1 shows thenumber of examples of each class for each phe-notype. Thus, we build phenotyping models for16 different diseases. At the representation learning stage (stage 1), allnotes for a patient were aggregated into a singledocument. CUIs extracted from the text were usedas input features. ICD-9 and CPT codes for thepatient were used as labels for supervised repre-sentation learning. https://ctakes.apache.org able 1: i2b2 cohort of obesity comorbidities Disease
86 596 0
CAD
391 265 5
CHF
308 318 1
Depression
142 555 0
Diabetes
473 205 5
GERD
144 447 1
Gallstones
101 609 0
Gout
94 616 2
Hypercholesterolemia
315 287 1
Hypertension
511 127 0
Hypertriglyceridemia
37 665 0 OA
117 554 1
OSA
99 606 8
Obesity
285 379 1
PVD
110 556 1
Venous Insufficiency
54 577 0At the phenotyping stage (stage 2), CUIs ex-tracted from the discharge summaries were used asinput features. Annotations of being present, ab-sent, or questionable for each of the 16 diagnosesfor each patient were used as multi-class classifi-cation labels.
We envision that clinical textual data can be use-ful in at least two ways: (1) for pre-training patientrepresentation models, and (2) for training pheno-typing models.In this study, a patient representation refers to afixed-length vector derived from clinical notes thatencodes all essential information about the patient.A patient representation model trained on massiveamounts of text data can be useful for a wide rangeof clinical applications. A phenotyping model, onthe other hand, captures the way a specific medicalcondition works, by learning the function that canpredict a disease (e.g., asthma) from the text of thenotes.Until recently, phenotyping models have beentrained from scratch, omitting stage (1), but recentwork (Dligach and Miller, 2018) included a pre-training step, which derived dense patient repre-sentations from data linking large amounts of pa-tient notes to ICD codes. Their work showed thatincluding the pre-training step led to learning pa-tient representations that were more accurate for a number of phenotyping tasks.Our goal here is to develop methods for feder-ated learning for both (1) pre-training patient rep-resentations, and (2) phenotyping tasks. Thesemethods will allow researchers and clinicans toutilize data from multiple health care providers,without the need to share the data directly, obvi-ating issues related to data transfer and privacy.To achieve this goal, we design a two-stage fed-erated NLP approach (Figure 1). In the first stage,following (Dligach and Miller, 2018), we pre-traina patient representation model by training an arti-ficial neural network (ANN) to predict ICD andCPT codes from the text of the notes. We extendthe methods from (Dligach and Miller, 2018) tofacilitate federated training.In the second stage, a phenotyping machinelearning model is trained in a federated manner us-ing clinical notes that are distributed across multi-ple sites for the target phenotype. In this stage, thenotes mapped to fixed-length representations fromstage (1) are used as input features and whetherthe patient has a certain disease is used as a labelwith one of the three classes: presence, absence orquestionable.In the following sections, we first describe asimple notes pre-processing step. We then discussthe method for pre-training patient representationsand the method for training phenotyping models.Finally, we describe our framework for perform-ing the latter two steps in a federated manner. igure 1: Two stage federated natural language processing for clinical notes phenotyping. In the first stage, apatient representation model was trained using an artificial neural network (ANN) to predict ICD and CPT codesfrom the text of the notes from a wide range of healthcare providers. The model without output layer was then usedas ”representation extractor” in the next stage. In the second stage, a phenotyping support vector machine modelwas trained in a federated manner using clinical notes for the target phenotype distributed across multiple silos.
All of our models rely on standardized medical vo-cabulary automatically extracted from the text ofthe notes rather than on raw text.To obtain medically relevant information fromclinical notes, Unified Medical Language System(UMLS) concept unique identifiers (CUIs) wereextracted from each note using Apache cTAKES(https://ctakes.apache.org). UMLS is a resourcethat brings together many health and biomedicalvocabularies and standardizes them to enable in-teroperability between computer systems.The Metathesaurus is a large, multi-purpose,and multi-lingual vocabulary that contains in-formation about biomedical and health relatedconcepts, their various names, and the relation-ships among them. The Metathesaurus structurehas four layers, Concept Unique Identifies(CUIs),Lexical (term) Unique Identifiers (LUI), StringUnique Identifiers (SUI) and Atom Unique Iden- tifiers (AUI). In this study, we focus on CUIs, inwhich a concept is a medical meaning. Our mod-els use UMLS CUIs as input.
We adapted the architecture from (Dligach andMiller, 2018) for pre-training patient representa-tions. A deep averaging network (DAN) that con-sists of an embedding layer, an average poolinglayer, a dense layer, and multiple sigmoid outputs,where each output corresponds to an ICD or CPTcode being predicted.This architecture takes CUIs as input and istrained using binary cross-entropy loss function topredict ICD and CPT codes. After the model istrained, the dense layer can be used to represent apatient as follows: the model weights are frozenand the notes of a new patient are fed into the net-work; the patient representation is collected fromthe values of the units of the dense layer. Thus, the tage 1Input:
MIMIC3 data clinical notes distributed at 10 simulated sites, Representation learning model
Output:
174 ICD or CPT codesExtract CUIs from each patient’s clinical notes using cTAKE. for t ∈ to T dofor k ∈ to K in parallel do Train patient representation learning model f k end aggregate models from all sites by W tag = (cid:80) Kk =1 n k N w tk end ; Stage 2Input: i2b2 clinical notes for obesity comorbidities distributed at 3 sites, phenotyping machinelearning model
Output: for t ∈ to T (cid:48) dofor k ∈ to K (cid:48) in parallel do Train phenotyping model f (cid:48) k end aggregate models from all sites by W (cid:48) tag = (cid:80) K (cid:48) k =1 n (cid:48) k N (cid:48) w (cid:48) tk end Algorithm 1: Two-stage federated natural language processingtext of the notes is mapped to a fixed-length vectorusing a pre-trained deep averaging network.
A linear kernel Support Vector Machine (SVM)taking input from representations generated usingthe pre-trained model from stage 1 was used asthe classifer for each phenotype of interest. Noregularization was used for the SVM and stochas-tic gradient descent was used as the optimizationalgorithm.
To train the ANN model in either stage 1 or stage2, we simulated sending out models with identi-cal initial parameters to all sites such as hospi-tals or clinics. At each site, a model was trainedusing only data form that site. Only model pa-rameters of the models were then sent back to the analyzer for aggregation but not the originaltraining data. An updated model is generated byaveraging the parameters of models distributivelytrained, weighted by sample size (Koneˇcn`y et al.,2016; McMahan et al., 2016). In this study, sam-ple size is defined as the number of patients.After model aggregation, the updated modelwas sent out to all sites again to repeat the globaltraining cycle (Algorithm 1). Formally, the weightupdate is specified by: W tag = K (cid:88) k =1 n k N W tk (1)where W ag is the parameter of aggregatedmodel at the analyzer site, K is the number ofdata sites, in this study the number of simulatedhealthcare providers or clinics. n i is the numberof samples at the i th site, N is the total number ofsamples across all sites, and W i is the parametersearned from the i th data site alone. t is the globalcycle number in the range of [1,T]. The algorithmtries to minimize the following objective function: argmin f ( − N (cid:88) j =1 M (cid:88) p =1 [ y jp logf ( x jp )+(1 − y jp ) log (1 − f ( x jp ))]) Where x j is the feature vector of CUIs. and y is the class label. p is the output number and M is the total number of outputs. f is the machinelearning model such as artificial neural networkor SVM.Codes that accompany this article can befound at our github repository . To imitate real world medical setting where dataare distributed with different healthcare providers,we randomly split patients in MIMIC-III data into10 sites for stage 1 (federated representation learn-ing). The training data of i2b2 was split into 3sites for stage 2 (phenotype learning) to mimicobesity related notes distributed with three differ-ent healthcare providers. i2b2 notes were not in-cluded in the representation learning as in clinicsettings information exchange routes for disease-specific records are often not the same as generalmedical information and ICD/CPT codes were notavailable for i2b2 dataset.Experiments were designed to answer threequestions:1. Whether clinical notes distributed in differentsilos can be utilized for patient representationlearning without data sharing2. Whether utilizing data from a wide rangeof sources will help improve performance ofphenotyping from clinical notes3. Whether models trained in a two-stage fed-erated manner will have inferior performanceto models trained with centralized data.To answer these questions, two-stage NLP al-gorithms were trained. Performance of modelstrained using only i2b2 notes from one of thethree sites were compared with two-stage fed-erated NLP results. Furthermore, performance https://github.com/kaiyuanmifen/FederatedNLP of machine learning models using distributed orcentralized data at patient representation learningstage or phenotyping stage were compared. We looked at the scenarios where no represen-tation learning was performed. In those cases,the standard TF-IDF weighted sparse bag-of-CUIsvectors were used to represent i2b2 notes. Thesparse vectors were used as input into the pheno-typing SVM model. We also looked at the scenar-ios where representation learning was performedby predicting ICD codes. For each of these con-ditions, we trained our phenotyping models us-ing centralized vs. federated learning. Finally,we considered a scenario where the phenotypingmodel was trained using the notes from a singlesite (the metrics we report were averaged acrossthree sites).To summarize, seven experiments were con-ducted:1. No representation learning + centralized phe-notyping learning2. No representation learning + federated phe-notyping learning where i2b2 training datawere randomly split into 3 silos3. No representation learning + single sourcephenotyping learning, where i2b2 data wererandomly split into 3 silos, but phenotypingalgorithm was only trained using data fromone of the silos4. Centralized representation learning + central-ized phenotyping learning5. Centralized representation learning + feder-ated phenotyping learning6. Federated representation learning + central-ized phenotyping learning,where MIMIC-IIIdata were randomly split into 10 silos7. Federated representation learning + federatedphenotyping learning, where MIMIC-III datawere randomly split into 10 silos and i2b2data into 3 silos (Table 2). able 2: Performance of different experiments
Experiment Patient representations Phenotyping Precision Recall F1
Table 3: Performance of two-stage federated NLP inobesity comobidity phenotyping by disease
Disease Prec Rec F1
Asthma 0.941 0.919 0.930CAD 0.605 0.606 0.605CHF 0.583 0.588 0.585Depression 0.844 0.774 0.801Diabetes 0.879 0.873 0.876GERD 0.578 0.543 0.558Gallstones 0.775 0.619 0.650Gout 0.948 0.929 0.938Hypercholesterolemia 0.891 0.894 0.892Hypertension 0.877 0.854 0.865Hypertriglyceridemia 0.725 0.519 0.524OA 0.531 0.520 0.525OSA 0.627 0.594 0.609Obesity 0.900 0.894 0.897PVD 0.590 0.604 0.596Venous Insufficiency 0.763 0.712 0.734
Average
In this article, we presented a two-stage methodthat conducts patient representation learning andobesity comorbidity phenotyping, both in a feder-ated manner. The experimental results suggest thatfederated training of machine learning models ondistributed datasets does improve performance ofNLP on clinical notes compared with algorithmstrained on data from a single site. In this study, weused CUIs as input features into machine learningmodels, but the same federated learning strategiescan also be applied to raw text.
Research reported in this publication was sup-ported by the National Library Of Medicine of theNational Institutes of Health under Award NumberR01LM012973. The content is solely the respon-sibility of the authors and does not necessarily rep-resent the official views of the National Institutesof Health.
References
Amy P Abernethy, Lynn M Etheredge, Patricia A Ganz,Paul Wallace, Robert R German, Chalapathy Neti,Peter B Bach, and Sharon B Murphy. 2010. Rapid-learning system for cancer care.
Journal of ClinicalOncology , 28(27):4268.David Blumenthal and Marilyn Tavenner. 2010. Themeaningful use regulation for electronic healthrecords.
New England Journal of Medicine ,363(6):501–504.Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp,Dzmitry Huba, Alex Ingerman, Vladimir Ivanov,Chloe Kiddon, Jakub Konecny, Stefano Mazzocchi,H Brendan McMahan, et al. 2019. Towards feder-ated learning at scale: System design. arXiv preprintarXiv:1902.01046 .Mike Conway, Richard L Berg, David Carrell,Joshua C Denny, Abel N Kho, Iftikhar J Kullo,James G Linneman, Jennifer A Pacheco, PeggyPeissig, Luke Rasmussen, et al. 2011. Analyzing theheterogeneity and complexity of electronic healthrecord oriented phenotyping algorithms. In
AMIAannual symposium proceedings , volume 2011, page274. American Medical Informatics Association.Dina Demner-Fushman, Wendy W Chapman, andClement J McDonald. 2009. What can natural lan- guage processing do for clinical decision support?
Journal of biomedical informatics , 42(5):760–772.Dmitriy Dligach and Timothy Miller. 2018. Learn-ing patient representations from text. arXiv preprintarXiv:1805.02096 .Elizabeth Ford, John A Carroll, Helen E Smith, DoniaScott, and Jackie A Cassell. 2016. Extracting infor-mation from the text of electronic medical records toimprove case detection: a systematic review.
Jour-nal of the American Medical Informatics Associa-tion , 23(5):1007–1015.Carol Friedman, Lyudmila Shagina, Yves Lussier, andGeorge Hripcsak. 2004. Automated encoding ofclinical documents based on natural language pro-cessing.
Journal of the American Medical Informat-ics Association , 11(5):392–402.Charles Friedman, Joshua Rubin, Jeffrey Brown,Melinda Buntin, Milton Corn, Lynn Etheredge, CarlGunter, Mark Musen, Richard Platt, William Stead,et al. 2014. Toward a science of learning systems:a research agenda for the high-functioning learninghealth system.
Journal of the American Medical In-formatics Association , 22(1):43–50.Charles P Friedman, Adam K Wong, and David Blu-menthal. 2010. Achieving a nationwide learn-ing health system.
Science translational medicine ,2(57):57cm29–57cm29.Li Huang and Dianbo Liu. 2019. Patient clustering im-proves efficiency of federated machine learning topredict mortality and hospital stay time using dis-tributed electronic medical records. arXiv preprintarXiv:1903.09296 .Li Huang, Yifeng Yin, Zeng Fu, Shifa Zhang, HaoDeng, and Dianbo Liu. 2018. Loadaboost: Loss-based adaboost federated machine learning on med-ical data. arXiv preprint arXiv:1811.12629 .Vasudevan Jagannathan, Charles J Mullett, James GArbogast, Kevin A Halbritter, Deepthi Yellapragada,Sushmitha Regulapati, and Pavani Bandaru. 2009.Assessment of commercial nlp engines for medi-cation information extraction from dictated clinicalnotes.
International journal of medical informatics ,78(4):284–291.Alistair EW Johnson, Tom J Pollard, Lu Shen,H Lehman Li-wei, Mengling Feng, Moham-mad Ghassemi, Benjamin Moody, Peter Szolovits,Leo Anthony Celi, and Roger G Mark. 2016.Mimic-iii, a freely accessible critical care database.
Scientific data , 3:160035.Jakub Koneˇcn`y, H Brendan McMahan, Felix X Yu, Pe-ter Richt´arik, Ananda Theertha Suresh, and DaveBacon. 2016. Federated learning: Strategies for im-proving communication efficiency. arXiv preprintarXiv:1610.05492 .ory Kreimeyer, Matthew Foster, Abhishek Pandey,Nina Arya, Gwendolyn Halford, Sandra F Jones,Richard Forshee, Mark Walderhaug, and TaxiarchisBotsis. 2017. Natural language processing systemsfor capturing and standardizing unstructured clini-cal information: a systematic review.
Journal ofbiomedical informatics , 73:14–29.Dianbo Liu, Timothy Miller, Raheel Sayeed, and Ken-neth Mandl. 2018a. Fadl: Federated-autonomousdeep learning for distributed electronic healthrecord. arXiv preprint arXiv:1811.11400 .Dianbo Liu, Nestor Sepulveda, and Ming Zheng.2018b. Artificial neural networks condensation: Astrategy to facilitate adaption of machine learning inmedical settings by reducing computational burden. arXiv preprint arXiv:1812.09659 .Christopher D Manning, Christopher D Manning, andHinrich Sch¨utze. 1999.
Foundations of statisticalnatural language processing . MIT press.H Brendan McMahan, Eider Moore, Daniel Ram-age, Seth Hampson, et al. 2016. Communication-efficient learning of deep networks from decentral-ized data. arXiv preprint arXiv:1602.05629 .Riccardo Miotto, Li Li, Brian A Kidd, and Joel T Dud-ley. 2016. Deep patient: an unsupervised represen-tation to predict the future of patients from the elec-tronic health records.
Scientific reports , 6:26094.Harvey J Murff, Fern FitzHenry, Michael E Matheny,Nancy Gentry, Kristen L Kotter, Kimberly Crimin,Robert S Dittus, Amy K Rosen, Peter L Elkin,Steven H Brown, et al. 2011. Automated identifica-tion of postoperative complications within an elec-tronic medical record using natural language pro-cessing.
Jama , 306(8):848–855.Jean R Slutsky. 2007. Moving closer to a rapid-learning health care system.
Health affairs ,26(2):w122–w124.¨Ozlem Uzuner. 2009. Recognizing obesity and co-morbidities in sparse data.