[PDF] Two-stage Federated Phenotyping and Patient Representation Learning

Abstract

A large percentage of medical information is in unstructured text format in electronic medical record systems. Manual extraction of information from clinical notes is extremely time consuming. Natural language processing has been widely used in recent years for automatic information extraction from medical texts. However, algorithms trained on data from a single healthcare provider are not generalizable and error-prone due to the heterogeneity and uniqueness of medical documents. We develop a two-stage federated natural language processing method that enables utilization of clinical notes from different hospitals or clinics without moving the data, and demonstrate its performance using obesity and comorbities phenotyping as medical task. This approach not only improves the quality of a specific clinical task but also facilitates knowledge progression in the whole healthcare system, which is an essential part of learning health system. To the best of our knowledge, this is the first application of federated machine learning in clinical NLP.

Full PDF

TTwo-stage Federated Phenotyping and Patient Representation Learning

Dianbo Liu

CHIPBoston Children’s HospitalHarvard Medical SchoolBoston, MA, USA, [email protected]

Dmitriy Dligach

Loyola University ChicagoChicago, IL, USA [email protected]

Timothy Miller

CHIPBoston Children’s HospitalHarvard Medical SchoolBoston, MA, USA, [email protected]

Abstract

A large percentage of medical information isin unstructured text format in electronic med-ical record systems. Manual extraction ofinformation from clinical notes is extremelytime consuming. Natural language process-ing has been widely used in recent years forautomatic information extraction from med-ical texts. However, algorithms trained ondata from a single healthcare provider are notgeneralizable and error-prone due to the het-erogeneity and uniqueness of medical docu-ments. We develop a two-stage federated nat-ural language processing method that enablesutilization of clinical notes from different hos-pitals or clinics without moving the data, anddemonstrate its performance using obesity andcomorbities phenotyping as medical task. Thisapproach not only improves the quality of aspeciﬁc clinical task but also facilitates knowl-edge progression in the whole healthcare sys-tem, which is an essential part of learninghealth system. To the best of our knowledge,this is the ﬁrst application of federated ma-chine learning in clinical NLP.

Clinical notes and other unstructured data in plaintext are valuable resources for medical informat-ics studies and machine learning applications inhealthcare. In clinical settings, more than 70% ofinformation are stored as unstructured text. Con-verting the unstructured data into useful structuredrepresentations will not only help data analysis butalso improve efﬁciency in clinical practice (Jagan-nathan et al., 2009; Kreimeyer et al., 2017; Fordet al., 2016; Demner-Fushman et al., 2009; Murffet al., 2011; Friedman et al., 2004). Manual ex-traction of information from the vast volume ofnotes from electronic health record (EHR) systemsis too time consuming. To automatically retrieve information fromunstructured notes, natural language processing(NLP) has been widely used. NLP is a subﬁeldof computer science, that has been developingfor more than 50 years, focusing on intelligentprocessing of human languages (Manning et al.,1999). A combination of hard-coded rules andmachine learning methods have been used in theﬁeld, with machine learning currently being thedominant paradigm.Automatic phenotyping is a task in clinical NLPthat aims to identify cohorts of patients that matcha predeﬁned set of criteria. Supervised machinelearning is curently the main approach to pheno-typing, but availability of annotated data hindersthe progress for this task. In this work, we con-sider a scenario where multiple instituitions haveaccess to relatively small amounts of annotateddata for a particular phenotype and this amount isnot sufﬁcient for training an accurate classiﬁer. Onthe other hand, combining data from these institu-tions can lead to a high accuracy classiﬁer, but di-rect data sharing is not possible due to operationaland privacy concerns.Another problem we are considering is learn-ing patient representations that can be used to trainaccurate phenotyping classiﬁers. The goal of pa-tient representation learning is mapping the text ofnotes for a patient to a ﬁxed-length dense vector(embedding). Patient representation learning hasbeen done in a supervised (Dligach and Miller,2018) and unsupervised (Miotto et al., 2016) set-ting. In both cases, patient representation learn-ing requires massive amounts of data. As in thescenario we outlined in the previous paragraph,combining data from several institutions can leadto higher quality patient representations, whichin turn will improve the accuracy of phenotypingclassiﬁers. However, direct data sharing, again, isdifﬁcult or impossible. a r X i v : . [ c s . I R ] A ug o tackle the challenges we mentioned above,we developed a federated machine learningmethod to utilize clinical notes from multiplesources, both for learning patient representationsand phenotype classiﬁers.Federated machine learning is a concept thatmachine learning models are trained in a dis-tributed and collaborative manner without cen-tralised data (Liu et al., 2018a; McMahan et al.,2016; Bonawitz et al., 2019; Koneˇcn`y et al., 2016;Huang et al., 2018; Huang and Liu, 2019). Thestrategy of federated learning has been recentlyadopted in the medical ﬁeld in structured data-based machine learning tasks (Liu et al., 2018a;Huang et al., 2018; Liu et al., 2018b). However,to the best of our knowledge, this work is the ﬁrsttime a federated learning strategy has been used inmedical NLP.We developed our two-stage federated natu-ral language processing method based on previ-ous work on patient representation (Dligach andMiller, 2018). The ﬁrst stage of our proposed fed-erated learning scheme is supervised patient rep-resentation learning. Machine learning models aretrained using medical notes from a large number ofhospitals or clinics without moving or aggregatingthe notes. The notes used in this stage need not bedirectly relevant to a speciﬁc medical task of in-terest. At the second stage, representations fromthe clinical notes directly related to the phenotyp-ing task are extracted using the algorithm obtainedfrom stage 1 and a machine learning model spe-ciﬁc to the medical task is trained.Clinicians spend a signiﬁcant amount of timereviewing clinical notes. This time can be savedor reduced with reasonably designed NLP tech-nologies. One such task is phenotying from med-ical notes. In this study, we demonstrated, usingphenotyping from clinical note as a clinical task(Conway et al., 2011; Dligach and Miller, 2018),that the method we developed will make it possi-ble to utilize notes from a wide range of hospitalswithout moving the data.The ability to utilize clinical notes distributedat different healthcare providers not only beneﬁtsa speciﬁc clinical practice task but also facilitatesbuilding a learning healthcare system, in whichmeaningful use of knowledge in distributed clin-ical notes will speed up progression of medicalknowledge to translational research, tool develop-ment, and healthcare quality assessment (Fried- man et al., 2010; Blumenthal and Tavenner, 2010).Without the needs of data movement, the speed ofinformation ﬂow can approach real time and makea rapid learning healthcare system possible (Slut-sky, 2007; Friedman et al., 2014; Abernethy et al.,2010). Two datasets were used in this study. The MIMIC-III corpus (Johnson et al., 2016) was used forrepresentation learning. This corpus contains in-formation for more than 58,000 admissions formore than 45,000 patients admitted to Beth Is-rael Deaconess Medical Center in Boston between2001 and 2012. Relevant to this study, MIMIC-III includes clinical notes, ICD9 diagnostic codes,ICD9 procedure codes, and CPT codes. Thenotes were processed with cTAKES to extractUMLS unique concept identiﬁers (CUIs). Fol-lowing the cohort selection protocol from (Dligachand Miller, 2018), patients with over 10,000 CUIswere excluded from this study. We obtained a co-hort of 44,211 patients in total.The Informatics for Integrating Biology to theBedside (i2b2) Obesity challenge dataset was usedto train phenotyping models (Uzuner, 2009). Thedataset consists of 1237 discharge summaries fromPartners HealthCare in Boston. Patients in this co-hort were annotated with respect to obesity andits comorbidities. In this study we consider themore challenging intuitive version of the task. Thedischarge summaries were annotated with obe-sity and its 15 most common comorbidities, thepresence, absence or uncertainty (questionable) ofwhich were used as ground truth label in the phe-notyping task in this study. Table 1 shows thenumber of examples of each class for each phe-notype. Thus, we build phenotyping models for16 different diseases. At the representation learning stage (stage 1), allnotes for a patient were aggregated into a singledocument. CUIs extracted from the text were usedas input features. ICD-9 and CPT codes for thepatient were used as labels for supervised repre-sentation learning. https://ctakes.apache.org able 1: i2b2 cohort of obesity comorbidities Disease

86 596 0

CAD

391 265 5

CHF

308 318 1

Depression

142 555 0

Diabetes

473 205 5

GERD

144 447 1

Gallstones

101 609 0

Gout

94 616 2

Hypercholesterolemia

315 287 1

Hypertension

511 127 0

Hypertriglyceridemia

37 665 0 OA

117 554 1

OSA

99 606 8

Obesity

285 379 1

PVD

110 556 1

Venous Insufﬁciency

54 577 0At the phenotyping stage (stage 2), CUIs ex-tracted from the discharge summaries were used asinput features. Annotations of being present, ab-sent, or questionable for each of the 16 diagnosesfor each patient were used as multi-class classiﬁ-cation labels.

We envision that clinical textual data can be use-ful in at least two ways: (1) for pre-training patientrepresentation models, and (2) for training pheno-typing models.In this study, a patient representation refers to aﬁxed-length vector derived from clinical notes thatencodes all essential information about the patient.A patient representation model trained on massiveamounts of text data can be useful for a wide rangeof clinical applications. A phenotyping model, onthe other hand, captures the way a speciﬁc medicalcondition works, by learning the function that canpredict a disease (e.g., asthma) from the text of thenotes.Until recently, phenotyping models have beentrained from scratch, omitting stage (1), but recentwork (Dligach and Miller, 2018) included a pre-training step, which derived dense patient repre-sentations from data linking large amounts of pa-tient notes to ICD codes. Their work showed thatincluding the pre-training step led to learning pa-tient representations that were more accurate for a number of phenotyping tasks.Our goal here is to develop methods for feder-ated learning for both (1) pre-training patient rep-resentations, and (2) phenotyping tasks. Thesemethods will allow researchers and clinicans toutilize data from multiple health care providers,without the need to share the data directly, obvi-ating issues related to data transfer and privacy.To achieve this goal, we design a two-stage fed-erated NLP approach (Figure 1). In the ﬁrst stage,following (Dligach and Miller, 2018), we pre-traina patient representation model by training an arti-ﬁcial neural network (ANN) to predict ICD andCPT codes from the text of the notes. We extendthe methods from (Dligach and Miller, 2018) tofacilitate federated training.In the second stage, a phenotyping machinelearning model is trained in a federated manner us-ing clinical notes that are distributed across multi-ple sites for the target phenotype. In this stage, thenotes mapped to ﬁxed-length representations fromstage (1) are used as input features and whetherthe patient has a certain disease is used as a labelwith one of the three classes: presence, absence orquestionable.In the following sections, we ﬁrst describe asimple notes pre-processing step. We then discussthe method for pre-training patient representationsand the method for training phenotyping models.Finally, we describe our framework for perform-ing the latter two steps in a federated manner. igure 1: Two stage federated natural language processing for clinical notes phenotyping. In the ﬁrst stage, apatient representation model was trained using an artiﬁcial neural network (ANN) to predict ICD and CPT codesfrom the text of the notes from a wide range of healthcare providers. The model without output layer was then usedas ”representation extractor” in the next stage. In the second stage, a phenotyping support vector machine modelwas trained in a federated manner using clinical notes for the target phenotype distributed across multiple silos.

All of our models rely on standardized medical vo-cabulary automatically extracted from the text ofthe notes rather than on raw text.To obtain medically relevant information fromclinical notes, Uniﬁed Medical Language System(UMLS) concept unique identiﬁers (CUIs) wereextracted from each note using Apache cTAKES(https://ctakes.apache.org). UMLS is a resourcethat brings together many health and biomedicalvocabularies and standardizes them to enable in-teroperability between computer systems.The Metathesaurus is a large, multi-purpose,and multi-lingual vocabulary that contains in-formation about biomedical and health relatedconcepts, their various names, and the relation-ships among them. The Metathesaurus structurehas four layers, Concept Unique Identiﬁes(CUIs),Lexical (term) Unique Identiﬁers (LUI), StringUnique Identiﬁers (SUI) and Atom Unique Iden- tiﬁers (AUI). In this study, we focus on CUIs, inwhich a concept is a medical meaning. Our mod-els use UMLS CUIs as input.

We adapted the architecture from (Dligach andMiller, 2018) for pre-training patient representa-tions. A deep averaging network (DAN) that con-sists of an embedding layer, an average poolinglayer, a dense layer, and multiple sigmoid outputs,where each output corresponds to an ICD or CPTcode being predicted.This architecture takes CUIs as input and istrained using binary cross-entropy loss function topredict ICD and CPT codes. After the model istrained, the dense layer can be used to represent apatient as follows: the model weights are frozenand the notes of a new patient are fed into the net-work; the patient representation is collected fromthe values of the units of the dense layer. Thus, the tage 1Input:

MIMIC3 data clinical notes distributed at 10 simulated sites, Representation learning model

Output:

174 ICD or CPT codesExtract CUIs from each patient’s clinical notes using cTAKE. for t ∈ to T dofor k ∈ to K in parallel do Train patient representation learning model f k end aggregate models from all sites by W tag = (cid:80) Kk =1 n k N w tk end ; Stage 2Input: i2b2 clinical notes for obesity comorbidities distributed at 3 sites, phenotyping machinelearning model

Output: for t ∈ to T (cid:48) dofor k ∈ to K (cid:48) in parallel do Train phenotyping model f (cid:48) k end aggregate models from all sites by W (cid:48) tag = (cid:80) K (cid:48) k =1 n (cid:48) k N (cid:48) w (cid:48) tk end Algorithm 1: Two-stage federated natural language processingtext of the notes is mapped to a ﬁxed-length vectorusing a pre-trained deep averaging network.

A linear kernel Support Vector Machine (SVM)taking input from representations generated usingthe pre-trained model from stage 1 was used asthe classifer for each phenotype of interest. Noregularization was used for the SVM and stochas-tic gradient descent was used as the optimizationalgorithm.

To train the ANN model in either stage 1 or stage2, we simulated sending out models with identi-cal initial parameters to all sites such as hospi-tals or clinics. At each site, a model was trainedusing only data form that site. Only model pa-rameters of the models were then sent back to the analyzer for aggregation but not the originaltraining data. An updated model is generated byaveraging the parameters of models distributivelytrained, weighted by sample size (Koneˇcn`y et al.,2016; McMahan et al., 2016). In this study, sam-ple size is deﬁned as the number of patients.After model aggregation, the updated modelwas sent out to all sites again to repeat the globaltraining cycle (Algorithm 1). Formally, the weightupdate is speciﬁed by: W tag = K (cid:88) k =1 n k N W tk (1)where W ag is the parameter of aggregatedmodel at the analyzer site, K is the number ofdata sites, in this study the number of simulatedhealthcare providers or clinics. n i is the numberof samples at the i th site, N is the total number ofsamples across all sites, and W i is the parametersearned from the i th data site alone. t is the globalcycle number in the range of [1,T]. The algorithmtries to minimize the following objective function: argmin f ( − N (cid:88) j =1 M (cid:88) p =1 [ y jp logf ( x jp )+(1 − y jp ) log (1 − f ( x jp ))]) Where x j is the feature vector of CUIs. and y is the class label. p is the output number and M is the total number of outputs. f is the machinelearning model such as artiﬁcial neural networkor SVM.Codes that accompany this article can befound at our github repository . To imitate real world medical setting where dataare distributed with different healthcare providers,we randomly split patients in MIMIC-III data into10 sites for stage 1 (federated representation learn-ing). The training data of i2b2 was split into 3sites for stage 2 (phenotype learning) to mimicobesity related notes distributed with three differ-ent healthcare providers. i2b2 notes were not in-cluded in the representation learning as in clinicsettings information exchange routes for disease-speciﬁc records are often not the same as generalmedical information and ICD/CPT codes were notavailable for i2b2 dataset.Experiments were designed to answer threequestions:1. Whether clinical notes distributed in differentsilos can be utilized for patient representationlearning without data sharing2. Whether utilizing data from a wide rangeof sources will help improve performance ofphenotyping from clinical notes3. Whether models trained in a two-stage fed-erated manner will have inferior performanceto models trained with centralized data.To answer these questions, two-stage NLP al-gorithms were trained. Performance of modelstrained using only i2b2 notes from one of thethree sites were compared with two-stage fed-erated NLP results. Furthermore, performance https://github.com/kaiyuanmifen/FederatedNLP of machine learning models using distributed orcentralized data at patient representation learningstage or phenotyping stage were compared. We looked at the scenarios where no represen-tation learning was performed. In those cases,the standard TF-IDF weighted sparse bag-of-CUIsvectors were used to represent i2b2 notes. Thesparse vectors were used as input into the pheno-typing SVM model. We also looked at the scenar-ios where representation learning was performedby predicting ICD codes. For each of these con-ditions, we trained our phenotyping models us-ing centralized vs. federated learning. Finally,we considered a scenario where the phenotypingmodel was trained using the notes from a singlesite (the metrics we report were averaged acrossthree sites).To summarize, seven experiments were con-ducted:1. No representation learning + centralized phe-notyping learning2. No representation learning + federated phe-notyping learning where i2b2 training datawere randomly split into 3 silos3. No representation learning + single sourcephenotyping learning, where i2b2 data wererandomly split into 3 silos, but phenotypingalgorithm was only trained using data fromone of the silos4. Centralized representation learning + central-ized phenotyping learning5. Centralized representation learning + feder-ated phenotyping learning6. Federated representation learning + central-ized phenotyping learning,where MIMIC-IIIdata were randomly split into 10 silos7. Federated representation learning + federatedphenotyping learning, where MIMIC-III datawere randomly split into 10 silos and i2b2data into 3 silos (Table 2). able 2: Performance of different experiments

Experiment Patient representations Phenotyping Precision Recall F1

Table 3: Performance of two-stage federated NLP inobesity comobidity phenotyping by disease

Disease Prec Rec F1

Asthma 0.941 0.919 0.930CAD 0.605 0.606 0.605CHF 0.583 0.588 0.585Depression 0.844 0.774 0.801Diabetes 0.879 0.873 0.876GERD 0.578 0.543 0.558Gallstones 0.775 0.619 0.650Gout 0.948 0.929 0.938Hypercholesterolemia 0.891 0.894 0.892Hypertension 0.877 0.854 0.865Hypertriglyceridemia 0.725 0.519 0.524OA 0.531 0.520 0.525OSA 0.627 0.594 0.609Obesity 0.900 0.894 0.897PVD 0.590 0.604 0.596Venous Insufﬁciency 0.763 0.712 0.734

Average

In this article, we presented a two-stage methodthat conducts patient representation learning andobesity comorbidity phenotyping, both in a feder-ated manner. The experimental results suggest thatfederated training of machine learning models ondistributed datasets does improve performance ofNLP on clinical notes compared with algorithmstrained on data from a single site. In this study, weused CUIs as input features into machine learningmodels, but the same federated learning strategiescan also be applied to raw text.

Research reported in this publication was sup-ported by the National Library Of Medicine of theNational Institutes of Health under Award NumberR01LM012973. The content is solely the respon-sibility of the authors and does not necessarily rep-resent the ofﬁcial views of the National Institutesof Health.

References

Amy P Abernethy, Lynn M Etheredge, Patricia A Ganz,Paul Wallace, Robert R German, Chalapathy Neti,Peter B Bach, and Sharon B Murphy. 2010. Rapid-learning system for cancer care.

Journal of ClinicalOncology , 28(27):4268.David Blumenthal and Marilyn Tavenner. 2010. Themeaningful use regulation for electronic healthrecords.

New England Journal of Medicine ,363(6):501–504.Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp,Dzmitry Huba, Alex Ingerman, Vladimir Ivanov,Chloe Kiddon, Jakub Konecny, Stefano Mazzocchi,H Brendan McMahan, et al. 2019. Towards feder-ated learning at scale: System design. arXiv preprintarXiv:1902.01046 .Mike Conway, Richard L Berg, David Carrell,Joshua C Denny, Abel N Kho, Iftikhar J Kullo,James G Linneman, Jennifer A Pacheco, PeggyPeissig, Luke Rasmussen, et al. 2011. Analyzing theheterogeneity and complexity of electronic healthrecord oriented phenotyping algorithms. In

AMIAannual symposium proceedings , volume 2011, page274. American Medical Informatics Association.Dina Demner-Fushman, Wendy W Chapman, andClement J McDonald. 2009. What can natural lan- guage processing do for clinical decision support?

Journal of biomedical informatics , 42(5):760–772.Dmitriy Dligach and Timothy Miller. 2018. Learn-ing patient representations from text. arXiv preprintarXiv:1805.02096 .Elizabeth Ford, John A Carroll, Helen E Smith, DoniaScott, and Jackie A Cassell. 2016. Extracting infor-mation from the text of electronic medical records toimprove case detection: a systematic review.

Jour-nal of the American Medical Informatics Associa-tion , 23(5):1007–1015.Carol Friedman, Lyudmila Shagina, Yves Lussier, andGeorge Hripcsak. 2004. Automated encoding ofclinical documents based on natural language pro-cessing.

Journal of the American Medical Informat-ics Association , 11(5):392–402.Charles Friedman, Joshua Rubin, Jeffrey Brown,Melinda Buntin, Milton Corn, Lynn Etheredge, CarlGunter, Mark Musen, Richard Platt, William Stead,et al. 2014. Toward a science of learning systems:a research agenda for the high-functioning learninghealth system.

Journal of the American Medical In-formatics Association , 22(1):43–50.Charles P Friedman, Adam K Wong, and David Blu-menthal. 2010. Achieving a nationwide learn-ing health system.

Science translational medicine ,2(57):57cm29–57cm29.Li Huang and Dianbo Liu. 2019. Patient clustering im-proves efﬁciency of federated machine learning topredict mortality and hospital stay time using dis-tributed electronic medical records. arXiv preprintarXiv:1903.09296 .Li Huang, Yifeng Yin, Zeng Fu, Shifa Zhang, HaoDeng, and Dianbo Liu. 2018. Loadaboost: Loss-based adaboost federated machine learning on med-ical data. arXiv preprint arXiv:1811.12629 .Vasudevan Jagannathan, Charles J Mullett, James GArbogast, Kevin A Halbritter, Deepthi Yellapragada,Sushmitha Regulapati, and Pavani Bandaru. 2009.Assessment of commercial nlp engines for medi-cation information extraction from dictated clinicalnotes.

International journal of medical informatics ,78(4):284–291.Alistair EW Johnson, Tom J Pollard, Lu Shen,H Lehman Li-wei, Mengling Feng, Moham-mad Ghassemi, Benjamin Moody, Peter Szolovits,Leo Anthony Celi, and Roger G Mark. 2016.Mimic-iii, a freely accessible critical care database.

Scientiﬁc data , 3:160035.Jakub Koneˇcn`y, H Brendan McMahan, Felix X Yu, Pe-ter Richt´arik, Ananda Theertha Suresh, and DaveBacon. 2016. Federated learning: Strategies for im-proving communication efﬁciency. arXiv preprintarXiv:1610.05492 .ory Kreimeyer, Matthew Foster, Abhishek Pandey,Nina Arya, Gwendolyn Halford, Sandra F Jones,Richard Forshee, Mark Walderhaug, and TaxiarchisBotsis. 2017. Natural language processing systemsfor capturing and standardizing unstructured clini-cal information: a systematic review.

Journal ofbiomedical informatics , 73:14–29.Dianbo Liu, Timothy Miller, Raheel Sayeed, and Ken-neth Mandl. 2018a. Fadl: Federated-autonomousdeep learning for distributed electronic healthrecord. arXiv preprint arXiv:1811.11400 .Dianbo Liu, Nestor Sepulveda, and Ming Zheng.2018b. Artiﬁcial neural networks condensation: Astrategy to facilitate adaption of machine learning inmedical settings by reducing computational burden. arXiv preprint arXiv:1812.09659 .Christopher D Manning, Christopher D Manning, andHinrich Sch¨utze. 1999.

Foundations of statisticalnatural language processing . MIT press.H Brendan McMahan, Eider Moore, Daniel Ram-age, Seth Hampson, et al. 2016. Communication-efﬁcient learning of deep networks from decentral-ized data. arXiv preprint arXiv:1602.05629 .Riccardo Miotto, Li Li, Brian A Kidd, and Joel T Dud-ley. 2016. Deep patient: an unsupervised represen-tation to predict the future of patients from the elec-tronic health records.

Scientiﬁc reports , 6:26094.Harvey J Murff, Fern FitzHenry, Michael E Matheny,Nancy Gentry, Kristen L Kotter, Kimberly Crimin,Robert S Dittus, Amy K Rosen, Peter L Elkin,Steven H Brown, et al. 2011. Automated identiﬁca-tion of postoperative complications within an elec-tronic medical record using natural language pro-cessing.

Jama , 306(8):848–855.Jean R Slutsky. 2007. Moving closer to a rapid-learning health care system.

Health affairs ,26(2):w122–w124.¨Ozlem Uzuner. 2009. Recognizing obesity and co-morbidities in sparse data.