Deep Representation Learning of Electronic Health Records to Unlock Patient Stratification at Scale
Isotta Landi, Benjamin S. Glicksberg, Hao-Chih Lee, Sarah Cherng, Giulia Landi, Matteo Danieletto, Joel T. Dudley, Cesare Furlanello, Riccardo Miotto
DDeep Representation Learning of Electronic Health Recordsto Unlock Patient Stratification at Scale
Isotta Landi , , Benjamin S. Glicksberg , , , Hao-Chih Lee , , Sarah Cherng , , Giulia Landi , MatteoDanieletto , , , Joel T. Dudley , , Cesare Furlanello § , , , and Riccardo Miotto ∗ , § , , , § These authors share senior authorship. (1) Bruno Kessler InstituteVia Sommarive 18, 38123 Povo (TN), Italy(2) Department of Psychology and Cognitive ScienceUniversity of TrentoCorso Bettini 84, 38068 Rovereto (TN), Italy(3) Hasso Plattner Institute for Digital Health at Mount Sinai(4) Institute for Next Generation Healthcare(5) Department of Genetics and Genomic SciencesIcahn School of Medicine at Mount Sinai1 Gustave L. Levy Place, New York, NY 10029, USA(6) Department of Mental Health and Pathological AddictionAzienda USL Centro “Santi”Via Vasari 13, 43100 Parma, Italy(7) HK3 LabVia Castel Morrone 14, 20129 Milan, Italy
Corresponding author:
Riccardo Miotto, PhDHasso Plattner Institute for Digital Health at Mount SinaiDepartment of Genetics and Genomic SciencesIcahn School of Medicine at Mount Sinai1 Gustave L. Levy PlaceNew York, NY 10029USAemail: [email protected] 1 a r X i v : . [ q - b i o . Q M ] J u l bstract Deriving disease subtypes from electronic health records (EHRs) can guide next-generation person-alized medicine. However, challenges in summarizing and representing patient data prevent widespreadpractice of scalable EHR-based stratification analysis. Here we present an unsupervised framework basedon deep learning to process heterogeneous EHRs and derive patient representations that can efficientlyand effectively enable patient stratification at scale. We considered EHRs of 1 , ,
741 patients froma diverse hospital cohort comprising of a total of 57 ,
464 clinical concepts. We introduce a representa-tion learning model based on word embeddings, convolutional neural networks, and autoencoders (i.e.,ConvAE) to transform patient trajectories into low-dimensional latent vectors. We evaluated these rep-resentations as broadly enabling patient stratification by applying hierarchical clustering to differentmulti-disease and disease-specific patient cohorts. ConvAE significantly outperformed several baselinesin a clustering task to identify patients with different complex conditions, with 2 .
61 entropy and 0 . ntroduction Electronic health records (EHRs) are collected as part of routine care across the vast majority of healthcareinstitutions. They consist of heterogeneous structured and unstructured data elements, including demo-graphic information, diagnoses, laboratory results, medication prescriptions, free text clinical notes, andimages. EHRs provide snapshots of a patient’s state of health and have created unprecedented opportunitiesto investigate the properties of clinical events across large populations using data-driven approaches andmachine learning. At the individual level, patient trajectories can foster personalized medicine; across apopulation, EHRs can provide a vital resource to understand population health management and help makebetter decisions for healthcare operation policies [1].Personalized medicine focuses on the use of patient-specific data to tailor treatment to an individual’sunique health characteristics. However, even seemingly simple diseases can show different degrees of com-plexity that can create challenges for identification, treatment, and prognosis, despite equivalence at thediagnostic level [2, 3]. Heterogeneity among patients is particularly evident for complex disorders , where theetiology is due to an amalgamation of multiple genetic, environmental, and lifestyle factors. Several differentconditions have been referred to as complex , such as Parkinson’s disease (PD) [4], multiple myeloma (MM)[5], and type 2 diabetes (T2D) [6]. Patients with complex disorders may differ on multiple systemic layers(e.g., different clinical measurements or comorbidity landscape) and in response to treatments, making theseconditions difficult to model. Multiple data types in patient longitudinal EHR histories offer a way to exam-ine disease complexity and present an opportunity to refine diseases into subtypes and tailor personalizedtreatments. This task is usually referred to as “EHR-based patient stratification”. This follows a commonapproach in clinical research, where attempts to identify latent patterns within a cohort of patients cancontribute to the development of improved personalized therapies [7].From a computational perspective, patient stratification is a data-driven, unsupervised learning task thatgroups patients according to their clinical characteristics [8]. Previous work in this domain aggregates clinicaldata at a patient level, representing each patient as multi-dimensional vectors, and derives subtypes withina disease-specific population via clustering (e.g., in autism [9]) or topological analysis (e.g., for T2D [10]).Deep learning has been applied to derive more robust patient representations to improve disease subtyping[8, 11]. Baytas et al. used time-aware long short-term memory (LSTM) networks to leverage stratificationof longitudinal data of PD patients [8]. Similarly, Zhang et al. used LSTM to identify three subgroups ofpatients with idiopathic PD that differ in disease progression patterns and symptom severity [11]. Thesestudies, however, only focused on curated and small disease-specific cohorts, with ad hoc manually selected3eatures. This approach not only limits scalability and generalizability, but also hinders the possibilityto discover unknown patterns that might characterize a condition. Because EHRs tend to be incomplete,using a diverse cohort of patients to derive disease-specific subgroups can adequately capture the features ofheterogeneity within the disease of interest [12]. However, it is challenging to create large-scale computationalmodels from EHRs because of data quality issues, such as high dimensionality, heterogeneity, sparseness,random errors, and systematic biases. Advances in machine learning, specifically in representation learning[13] and deep learning [14], are introducing different computational models to leverage EHRs for personalizedhealthcare [15, 16]. This work fits into this landscape by presenting an unsupervised patient stratificationpipeline that aims to automatically detect clinically meaningful subtypes within any condition by usingpatient representations learned from a heterogeneous and large cohort of EHRs.In particular, this paper proposes a general framework for identifying disease subtypes at scale (seeFigure 1a). We first propose an unsupervised deep learning architecture to derive vector-based patientrepresentations from a large and domain-free collection of EHRs. This model (i.e., ConvAE) combines 1)embeddings to contextualize medical concepts, 2) convolutional neural networks (CNNs) to loosely model thetemporal aspects of patient data, and 3) autoencoders (AEs) to enable the application of an unsupervisedarchitecture. Second, we show that ConvAE-based representations learned from real-world EHRs of about1 .
6M patients from the Mount Sinai Health System in New York improve clustering of patients with differentdisorders compared to several commonly used baselines. Last, we demonstrate that ConvAE leads to effectivepatient stratification with minimal effort. To this end, we used the encodings learned from domain-free andheterogeneous EHRs to derive subtypes for different complex disorders and provide a qualitative analysis todetermine their clinical relevance.This architecture enables patient stratification at scale by eliminating the need for manual feature engi-neering and explicit labeling of events within patient care timelines, and processes the whole EHR sequenceregardless of the length of patient history. By generating disease subgroups from large-scale EHR data, thisarchitecture can help disentangle clinical heterogeneity and identify high-impact patterns within complexdisorders, whose effect may be masked in case-control studies [17]. The specific properties of the differentsubgroups can then potentially inform personalized treatments and improve patient care.
Results
We first evaluated the extent to which ConvAE-based patient representations can be used to identify differentclinical diagnoses in the EHRs (i.e., disease phenotyping [18]). To this end, we performed clustering analysisusing patients with the following eight complex disorders: T2D, MM, PD, Alzheimer’s disease (AD), Crohn’s4isease (CD), breast cancer (BC), prostate cancer (PC), and attention deficit hyperactivity disorder (ADHD).We used SNOMED–CT (Systematized nomenclature of medicine – clinical terms) [19] to find all patientsin the data warehouse diagnosed with these conditions; see Supplementary Table 2 and the “Multi-diseaseclustering analysis” subsection in “Methods” for more details.Evaluation was organized as a 2-fold cross-validation experiment to show model generalizability and toassess replication of the stratification results. To this aim, we randomly split the dataset in half, obtainingtwo independent cohorts of about 800 ,
000 patients that we used to train and test the models (and viceversa). While we used all patients in each cohort for training, in the test sets we retained only the patientsdiagnosed with one of the eight disorders under consideration, obtaining about 94 ,
000 test patients per fold(see the “Dataset” subsection in “Methods” for more details).Table 1 shows the results using hierarchical clustering for different ConvAE architectures (one, two, andmultikernel CNN layers) and baselines in terms of entropy and purity scores averaged over the 2-fold cross-validation experiment. ConvAE performed significantly better than other models largely used in healthcarefor representation learning, including Deep Patient [20], for both entropy and purity scores ( p s < . .
50, based on purity score analysis). It is worth saying that, without a predictive theoryof clustering [21, 22], validation metrics frequently fail to correlate with clustering errors [23]. However,such theoretic structure is not applicable in this context because the heterogeneity of the external complexdisorder classes do not provide a reliable probabilistic framework. For this reason, we used, rather thanestimation error analysis, transparent external metrics, such as entropy and purity scores, which evaluatecluster composition and also account for possible subgroups of complex diseases [24].Figure 2 visualizes the distribution of the different patient representations along with their disease cohortlabels obtained using UMAP (Uniform manifold approximation and projection for dimension reduction[25]). ConvAE captures hidden patterns of overlapping phenotypes while still displaying identifiable groupsof patients with distinct disorders. Figure 3 shows the same patient distribution highlighting clusteringlabels and purity percentage scores of each cluster dominating disease. These figures refer to only one of thecross-validation splits; results for the second split are similar and are available in Supplementary Figures 1and 2). ConvAE (with one CNN layer) also led to better clustering, visually, than all baselines. Patients withADHD were the most separated and detected with 80% purity by hierarchical clustering. Visible clusterswith >
50% purity were also identified for T2D, PC and PD. Comparing the encoding projections (Figure 2)to the clustering visualization (Figure 3), we observe that patients whose disease is not correctly identifiedby clusters tend to not clearly separate in this low-dimensional space. As an example, AD patients were5andomly scattered in the plot and did not lead to distinguishable clusters. This might be due to factorssuch as sex and age, intrinsic biases or noise, but it might also reflect a shared phenotypic characterizationthat drives the learning process into displaying these patient EHR progressions closely together irrespectiveof disease labels.We then evaluated the use of ConvAE representations for patient stratification at scale and the identi-fication of clinically relevant disease subtypes. We considered six diseases: T2D, PD, AD, MM, PC, andBC. These are all age-related complex disorders with late onset (i.e., averaged increased prevalence after60 years of age) [26, 27, 28, 29, 30, 31]. We decided to focus on these conditions to avoid, to some extent,the confounding effect of age that could affect learning and the evaluation of different subtypes. Figure 4shows results running hierarchical clustering on the ConvAE-based patient representations of each differentdisease cohort. To determine the optimal number of clusters, we empirically selected the smallest numberof clusters that minimize the increase in explained variance (i.e., Elbow method). We were able to identifydifferent subtypes for each disease with no additional feature selection and using representations derivedfrom a domain-free cohort of patients. Supplementary Table 3 reports the number of patients in each cohortand the number of subgroups identified. Similar results were obtained for the second split and are reportedin Supplementary Figure 3.In the following sections, we present the clinical characterization of T2D, PD, and AD subgroups viaenrichment analysis of medical concept occurrences (see Supplementary Material for the characterization ofthe other conditions). We compare T2D and PD results to related studies based on ad hoc cohorts [10, 11].Conversely, there are no published EHR-based stratification studies for AD, MM, PC, and BC to use forcomparison. All subtypes were reviewed by a clinical expert to highlight meaningful descriptors and we usedmultiple pairwise chi-squared tests to assess group differences. For each disease, we list sex and age statisticsof the cohort (between group comparisons are performed via multiple pairwise chi-squared tests and t-tests),as well as the five most frequent diagnosis, medications, laboratory tests, and procedures, ordered accordingto in-group and total frequencies, in Supplementary Tables 4-9. The results for the second split are reportedin Supplementary Tables 10-15.
Type 2 diabetes
Patients with T2D clustered into three different subgroups that relate to different stages of progression forthe disease (see Figure 4a and Supplementary Table 4 for details).Subgroup I included 18 ,
325 patients and represents the mild symptom severity cohort, characterizedby common T2D symptoms (e.g., metabolic syndrome), which were treated with
Metformin , an oral hypo-6lycemic medication. Moreover, it also included patients exposed to lifestyle risk factors, such as
Obesity [6]. Subgroups II/III, which were composed by 22 ,
659 and 7 ,
704 patients, respectively, showed concomitantconditions associated to T2D progression and worsening symptoms. Specifically, subgroup II clustered pa-tients characterized by microvascular problems, such as diabetic nephropathy, neuropathy, and/or peripheralartery disease. The significant presence of
Creatinine and
Urea nitrogen laboratory tests, which estimaterenal function, suggests monitoring of kidney diseases, which are often related to T2D [32]. The presenceof
Pain in limb , combined with analgesic drugs (i.e.,
Paracetamol , Oxycodone ), indicates the presence ofvascular lesions at the peripheral level, manifested as ischemic rest pain or ulceration. This was confirmedby
Peripheral vascular disease diagnoses which accounts for 50% of terms in the T2D cohort.Subgroup III showed severe cardiovascular problems, identified by a significant presence of medical con-cepts related to coronary artery diseases, e.g.,
Coronary atherosclerosis , Angina pectoris , which are seriousrisk factors for heart failure. These subjects were often treated with antiplatelet therapy (i.e.,
Acetylsalicylicacid, Clopidrogel ) to prevent cardiovascular events (e.g., stroke) and were likely to receive invasive proce-dures to treat severe arteriopathy. For instance, 30% of patients in subgroup III underwent
PercutaneousTransluminal Coronary Angioplasty , a procedure to open up blocked coronary arteries.Our results confirm, in part, what was observed by Li et al. [10], which used topology analysis on anad hoc cohort of T2D patients and identified three distinct subgroups characterized by 1) microvasculardiabetic complications (i.e., diabetic nephropathy, diabetic retinopathy); 2) cancer of bronchus and lungs;and 3) cardiovascular diseases and psychiatric disorders. In particular, we detected the same microvascularand cardiovascular disease groups, which are consequences of T2D. In contrast, we were unable to detecta subgroup significantly characterized by cancer, an epiphenomenon that can be caused by secondary im-munodeficiency in patients with T2D [33, 34]. See Supplementary Material for further description and aclustering comparison via Fowlkes-Mallows index.
Parkinson’s disease
Individuals diagnosed with PD divided into two groups (Figure 4b and Supplementary Table 5): one domi-nated by motor symptoms (1 ,
368 patients) and another (1 ,
684 patients) characterized by non-motor/independentfeatures and longer course of disease.Subgroup I is characterized as a tremor-dominant cohort (i.e., manifested by motor symptoms) because ofthe significant presence of diagnosis such as
Essential tremor , Anxiety state , and
Dystonia . It is interestingto note that motor clinical features likely led to a common misdiagnosis of essential tremor, which is an7ction tremor that typically involves the hands. Parkinsonian tremor, on the contrary, although can bepresent during postural maneuvers and action, is much more severe at rest and decreases with purposefulactivities. However, when the tremor is severe, it is difficult to distinguish action tremor from restingtremor, leading to the aforementioned misdiagnosis [35]. Moreover, anxiety states, emotional excitement,and stressful situations can exacerbate the tremor, and lead to a delayed PD diagnosis.
Brain MRI , usuallynon-diagnostic in PD, was ordered for several patients in this subgroup (13%) suggesting its use for differentialdiagnosis, e.g., to investigate the presence of chronic/vascular encephalopathy.Subgroup II included non-motor and independent symptoms, such as
Constipation and
Fatigue . Patientsin subgroup II were significantly diagnosed with
Coronary artery disease that is prevalent in older patients( >
50 years old). Constipation and fatigue are among the most common non-motor problems related toautonomic dysfunction, diminished activity level, and slowed intestinal transit time in PD [36, 37].In their study about PD stratification with PPMI (Parkinson’s progression markers initiative) data,Zhang et al. [11] identified three distinct subgroups of patients based on severity of both motor and non-motor symptoms. In particular, one subgroup included patients with moderate functional decay in motorability and stable cognitive ability; a second subgroup presented with mild functional decay in both motorand non-motor symptoms; and the third subgroup was characterized by rapid progression of both motor andnon-motor symptoms. EHRs do not quantitatively capture PD symptom severity, therefore our analysescannot replicate these findings. However, unlike Zhang et al., we can discriminate between specific motorand non-motor symptoms and also suggest a longer, but not necessarily more severe, disease course for thenon-motor symptom subgroup.
Alzheimer’s disease
Patients with AD separated into three subgroups marked by AD onset, disease progression, and severity ofcognitive impairment (see Figure 4c and Supplementary Table 6).Subgroup I is characterized by 399 patients with early-onset AD, i.e., patients whose dementia symptomshave typically developed between the age of 30 and 60 years, and initial neurocognitive disorder. Early-onsetAD affects 5% of the individuals with AD in the US [38] and, because clinicians do not usually look for AD inyounger patients, the diagnostic process includes extensive evaluations of patient symptoms. In particular,given that a certain AD diagnosis can only be provided post-mortem through brain examination, cliniciansfirst rule out other causes that can lead to early-onset dementia (i.e., differential diagnosis). We find evidenceof this practice in this subgroup, which includes postmenopausal women, identifiable by mean age greaterthan 50,
Osteoporosis diagnosis with calcium supplement therapy, and menopausal hormone treatment (i.e.,8 stradiol ). Patients in this group are also tested for infectious diseases (e.g., HIV, Syphilis, Hepatitis C,Chlamydia/Gonorrhoea) that are possible causes of early-onset dementia [39], and screened via structuralneuroimaging, e.g.,
MRI/PET brain . As cognitive dysfunctions that may be mistaken for dementia can alsobe caused by depression and other psychiatric conditions, the presence of
Psychiatric service/procedure sug-gests psychiatric evaluations to exclude depressive pseudodementia. After the differential diagnosis processand the exclusion of other possible causes, eventually these patients received a diagnosis of AD.Subgroup II includes 1 ,
170 patients with late-onset AD, mild neuropsychiatric symptoms and cerebrovas-cular disease. Here, the absence of behavioral disturbances in 39% of patients, and their high average age( M = 84 . , sd = 9 .
61) suggest a late AD onset, with a progression characterized by a slower rate ofcognitive ability decline [40]. Moreover, the presence of
Acetylsalicylic acid , an antiplatelet medication, and
Intracranial hemorrage diagnosis indicates the co-occurrence of cerebrovascular disease, which affects bloodvessels and blood supply to the brain. Cerebrovascular diseases are common in aging, and can often beassociated with AD [41]. In this regard,
Head CT may have been performed to prevent or identify structuralabnormalities related to cerebrovascular disease.Subgroup III is characterized by 1 ,
632 individuals with typical onset and mild-to-moderate dementiasymptoms. A cohort of 409 patients was treated with
Donepezil , a cholinesterase inhibitor, that is a primarytreatment for cognitive symptoms and it is usually administered to patients with mild-to-moderate AD,producing small improvement in cognition, neuropsychiatric symptoms, and activities of daily living [42].Patients in this subgroup also showed both dementia with and without behavioral disturbances.
Discussion
This study proposes a computational framework to disentangle the heterogeneity of complex disorders inlarge-scale EHRs through the identification of data-driven clinical patterns with machine learning. Specif-ically, we developed and validated an unsupervised architecture based on deep learning (i.e., ConvAE) toinfer informative vector-based representations of millions of patients from a large and diverse hospital set-ting, which facilitates the identification of disease subgroups that can be leveraged to personalize medicine.These representations aim to be domain-free (i.e., not related to any specific task since learned over a largemulti-domain dataset) and enable patient stratification at scale. Results from our experiments show thatConvAE significantly outperformed several baselines on clustering patients with different complex conditionsand led to the identification of different clinically meaningfully disease subtypes.Results identified disease progression, symptom severity, and comorbidities as contributing the most tothe EHR-based clinical phenotypic variability of complex disorders. In particular, T2D patients divided9nto three subgroups according to comorbidities (i.e., cardiovascular and microvascular problems) and symp-tom severity (i.e., newly diagnosed with milder symptoms). Individuals with PD showed different diseaseduration and symptoms (i.e., motor, non-motor). AD profiles distinguished early- and late-onset groupsand separate patients with mild neuropsychiatric symptoms and cerebrovascular disease from patients withmild-to-moderate dementia. Patients with MM were characterized by different comorbidities (e.g., amyloi-dosis, pulmonary diseases) that manifest alongside precise typical signs of MM. Patients with PC and BCseparated according to disease progression. These findings showed that the features learned by ConvAEdescribe patients in a way that is general and conducive to identifying meaningful insights into differentclinical domains. In particular, this work aims to contribute to the next generation of clinical systems thatcan 1) scale to include many millions of patient records and 2) use a single, distributed patient representationto effectively support clinicians in their daily activities, rather than multiple systems working with differentpatient representations derived for different tasks [20].To this aim, enabling efficient data-driven patient stratification analyses to identify disease subgroupsis an important aspect to unlock personalized healthcare. Ideally, when new patients enter the medicalsystem, their health status progression can be tied to a specific subgroup, thereby informing the treatingclinician of personalized prognosis and possible effective treatment strategies, or counseling in cases wherea certain diagnosis is difficult and a more thorough examination is required (e.g, specific genetic or labtests). Moreover, the clinical characteristics of the different subtypes can potentially lead to intuitions fornovel discoveries, such as comorbidities, side-effects or repositioned drugs, which can be further investigatedanalysing the patient clinical trajectories.Previous studies mostly focused on a specific disease using ad hoc cohorts of patients and features [8, 9, 10,11, 43, 44]. While these studies obtained relevant clinically meaningful results, the computational frameworkis hard to replicate for different diseases and it is tied to the specific study and to the specific data. Deeplearning has extensively been used to model EHRs for medical analysis [15, 16], including clinical prediction,such as disease onset, mortality, and readmission [45, 46, 47], and disease phenotyping [20, 48]. Because deeplearning methods have not yet been leveraged for disease subtyping at scale, ConvAE aims to fill this gap andto provide an architecture that can improve unsupervised EHR pre-processing to favor patient stratificationand unveil clinically meaningful and actionable insights. Additionally, unlike previous representation learningmethods which did not consider the temporality of EHRs [20, 48], ConvAE uses CNNs in combination withembeddings to specifically capture some of the longitudinal aspects of patient clinical status, leading to morerobust representations. CNNs were already used to model EHRs for specific predictive analysis, as partof supervised architectures [49, 50]. Differently, we trained CNNs in an unsupervised framework based onautoencoders to learn general-purpose patient representations. While these representations were used to10everage disease subtype discovery, they can also be fine-tuned and applied to specific supervised tasks, suchas disease phenotyping and prediction.There are several limitations to our study. First, we acknowledge that the lack of any discernible patternin the multi-disease clustering analysis can also be due to noise and biases in the data, which might affect bothlearned representations and clustering. In particular, processing EHRs with minimum data engineering, onthe one hand, preserves all the available information and, to some extent, prevents systematic biases. On theother, it adds hospital-specific biases intrinsic to the EHR structure and noise due to data being redundantand too generic. Improving EHR pre-processing by, e.g., better modeling clinical notes and/or improvingfeature filtering, should help reduce noise and improve performances. Second, we identified patients relatedto complex disorders using SNOMED–CT codes and this likely led to the inclusion of many false positivesthat affected the learning algorithms [51]. The use of phenotyping algorithms based on manual rules, e.g.,PheKB [52], or semi-automated approaches, e.g. [53, 54]), should help identify better cohorts of patientsand, consequently, better disease subtypes. Another limitation comes from the choice, among all possibilities,of the specific complex disorders. This allowed us to test the approach on heterogeneous conditions thataffect different biological mechanisms, showing the efficacy of the proposed framework in generalizing tovarious clinical domains. Nevertheless, the approach should be further evaluated with other typologiesof conditions as well, such as multiple sclerosis, autoimmune diseases, and psychiatric disorders. Lastly,we identified relevant concepts in the patient subgroups by simply evaluating their frequency. Adding asemantic modeling component based on, e.g., topic modeling [55] or word embeddings [56], might lead tomore clinically meaningful patterns.Future works will attempt to address these limitations and to further improve and replicate the architec-ture. First, we plan to enable multi-level clustering in order to stratify patients within the subtypes. Thisshould lead to more granular patient stratification and thus, to patterns on a more individual-level. Sec-ond, we plan to verify ConvAE generalizability by replicating the study on EHRs from different healthcareinstitutions. Third, we will evaluate the use of disease subtypes as labels for training supervised modelsthat can predict stratified patient risk scores. This, beside further validating the relevance of the results,will also provide an initial and intuitive framework to apply the results of patient stratification to clinicalpractice. To this aim, we plan to first assess treatment safety and efficacy between subtypes of a specificdisease. Finally, to develop more comprehensive disease characterizations, we will include other modalitiesof data, e.g., genetics, into this framework, which will hopefully refine clustering and reveal new etiologies.Multi-modal stratified disease cohorts promise to facilitate better predictive capabilities for future outcomesby modeling how molecular mechanisms interact with clinical states.11 ethods
The framework to derive patient representations that enable stratification analysis at scale is based on 3steps: 1) data pre-processing; 2) unsupervised representation learning (i.e., ConvAE); and 3) clusteringanalysis of disease-specific cohorts (see Figure 1a). In this section, we report details of this framework aswell as the description of the evaluation design.
Dataset
We used de-identified EHRs from the Mount Sinai Health System data warehouse; the study was approvedby IRB-19-02369 in accordance with HIPAA guidelines. Mount Sinai Health System is a large and diverseurban hospital located in New York, NY, which generates a high volume of structured, semi-structured andunstructured data from inpatient, outpatient, and emergency room visits. Patients in the system can haveup to 12 years of follow-up data unless they are transferred or move their residence away from the hospitalsystem. We accessed a de-identified dataset containing approximately 4 . V was composed by 57 ,
464 clinical concepts.We retained all patients with at least two concepts, resulting in a collection of 1 , ,
741 different patients,with an average of 88 . ,
932 females, 691 , ,
488 not declared; the mean age of the population as of 2016 was 48 .
29 years ( sd = 23 . ,
000 random patients for tuning the modelhyperparameters. Train and test pre-processed sets’ details are reported in Supplementary Table 1.
Data pre-processing
Every patient in the dataset is represented as a longitudinal sequence s p of length M of aggregated temporally-ordered medical concepts, i.e., s p = ( w , w , . . . , w M ), where each w i is a medical concept from the vocab-ulary V . Pre-processing includes: 1) filtering the least and most frequent concepts; 2) dropping redundantconcepts within fixed time frames; 3) splitting long sequences of records to include the complete patient12istory while leveraging the CNN framework, which requires fixed-size inputs.We consider all the EHRs as a document D and each patient sequence s p as a sentence. For eachconcept w in V we first compute the probability of having w in D . We then multiply this by the sum ofthe probabilities to find w in a sentence s p for all sentences. In particular, let P be the set of all patients, ∀ w ∈ V , the filtering score is defined as: P ( w ∈ D ) (cid:88) p ∈ P P ( w ∈ s p ) = { s ∈ D ; w ∈ s }| D | (cid:88) p ∈ P { w i ∈ s p ; w i = w }| s p | , (1)where | D | is the total number of sentences and | s p | is the length of a patient sequence. The filteringscore combines document frequency, i.e., number of patients with at least one occurrence of w , and termfrequency, i.e., total number of occurrences of w in a patient sequence. We then drop all concepts withfiltering scores outside certain cut-off values to reduce the amount of noise (i.e., not informative conceptsthat occur multiple times in few patients, or too general concepts that occur in many patients).A patient may have multiple encounters in their health records that span consecutive days and mightinclude repeated concepts that are often artifacts of the EHR system, rather than new clinical entries. Toreduce this bias, we drop all duplicate medical concepts from the patient records within overlapping timeintervals of T days. Within the same time window, we also randomly shuffle the medical concepts, giventhat events within the same encounter are generally randomly recorded [59, 54]. Lastly, we eliminate allpatients with less than 3 concepts in their records.Patient sequences are then chopped into subsequences of fixed length L that are used to train the ConvAEmodel. Each patient sequence is thus defined as: s p = [( w , . . . , w L ) , ( w L +1 , . . . , w L ) , . . . ] , and subsequences shorter than L are padded with 0 up to length L . For the sake of clarity, in the followingsection we present the architecture as applied to a general subsequence s = ( w , . . . , w L ). The ConvAE architecture
ConvAE is a representation learning model that transforms patient EHR subsequences into low-dimensional,dense vectors. The architecture consists of three stacked modules (see Figure 1b). This study proposesto use in combination embedding, CNNs, and autoencoders to process EHRs and to derive unsupervisedvector-based patient representations that can be used for clinical inference and medical analysis.Given s , the architecture first assigns each medical concept w to an N -dimensional embedding vector13 w to capture the semantic relationships between medical concepts. Specifically, a patient subsequence isrepresented as an ( L × N ) matrix E = ( v w , v w , . . . , v w L ) T , where L is the subsequence length, and N isthe embedding dimension. This structure also retains temporal information because the rows of matrix E are temporally ordered according to patient visits.The architecture is then composed by CNNs, which extract local temporal patterns, and AEs, whichlearn the embedded representations for each patient subsequence. The CNN applies temporal filters to eachembedding matrix. CNN filters applied to EHRs usually perform a one-side convolution operation acrosstime via filter sliding. A filter can be defined as k ∈ R h × N , where h is the variable window size and N isthe embedding dimension [60, 61]. Our approach differs in that it processes embedding matrices as theywere RGB images carrying a third “depth” dimension. With this approach, we enable the model filters tolearn independent weights for each encoding dimension, thus activating for the most salient features in eachdimension of the embedding space. Therefore, we reshape the ( L × N ) embedding matrix into ˜ E ∈ R × L × N and we consider the embedding dimensions as channels. We then apply f filters k ∈ R × h × N to the paddedinput to keep the same output dimension and learn features that may grasp sequence characteristics. Inparticular, for each filter j , we obtain:( R ) j = ReLU( N − (cid:88) i =0 k i (cid:63) ˜ e i + b j ) , j = 1 , . . . , f, (2)where: R ∈ R × L × f is the output matrix; k i is the h -dimensional weight matrix at depth i ; ˜ e i ∈ R × L isthe i -th embedding dimension of the input matrix; b is the bias vector; and ( (cid:63) ) is the convolution function.We used Rectified Linear Unit (ReLU) as the activation function and max pooling. The output is thenreshaped into a concatenated vector of dimension L · f . This configuration learns different weights for eachembedding dimension to highlight relevant interdependencies of medical concepts, and tune representationsof patient histories to identify the most relevant characteristics of their semantic space.We then use fully dense layers of autoencoders to derive embedded patient representations that estimatethe given input subsequences. Specifically, we extract the hidden representation y , a H -dimensional vector,as the encoded representation of each patient subsequence. Each patient sequence s p is then transformed intoa sequence of encodings s h that can be post-modeled to obtain a unique vector-based patient representation.Here we simply component-wise average all the subsequence representations.To train ConvAE, we set up a multi-class classification task that reconstructs each initial input one-hotsubsequence of medical terms, from their encoded representations. Given a subsequence of medical concepts s , the ConvAE is trained by minimizing the Cross Entropy (CE) loss:14E(Softmax( O ) , s ) = − L L (cid:88) j =1 log(Softmax( O j ) w j ) , where O is the output of ConvAE reshaped into a matrix of dimension | V | × L , w j is the j -th element ofsequence s that correspond to a term indexed in V and:Softmax( O j ) i = exp O ji (cid:80) | V | i =1 exp O ji i = 1 , ..., | V | . (3)Since the objective function consists of only self-reconstruction errors, the model can be trained withoutany supervised training samples. Clustering analysis for patient stratification
ConvAE-based representations can be used to stratify patients from any preselected cohort without needingadditional feature engineering or manual adjustments. To this aim, patients with a specific disease areselected using, e.g., ICD codes, SNOMED–CT diagnosis, or phenotyping algorithms (e.g., [51, 53, 54]), andclustering is applied to the corresponding representations to identify disease subgroups. Here, specifically,we use SNOMED–CT diagnosis to preselect the disease cohorts and hierarchical clustering with Ward’smethod and Euclidean distance to derive disease subgroups. We identify the number of subclusters thatbest disentangles heterogeneity on the disease dataset using the Elbow Method, which empirically selectsthe smallest number of clusters that minimize the increase in explained variance.A systematic analysis of the patients in each subgroup can then automatically identify the medicalconcepts that significantly and uniquely define each disease subtype. In this work, we rank all the codesby their frequency in the patient sequences. In particular, we compute the percentages of patients whosesequence includes a specific concept both with respect to a subcluster (i.e., in-group frequency) and tothe complete disease cohort (i.e., total frequency). Ranking maximizes, first, the in-group percentage, andsecond, the total percentage. We then analyze the most frequent concepts and we use a pairwise chi-squaredtest to determine whether the distributions of present/absent concepts with respect to the detected subgroupsare significantly different [11].
Implementation details
All model hyperparameters were empirically tuned to minimize the network reconstruction error, whilebalancing training efficiency and computation time. We tested a large amount of configurations (e.g., timeinterval T equal to { , } ; patient subsequence length L equal to { , } ; embedding dimension N { , , } ). For brevity, we report only the final setting used in the patient stratificationexperiments. All modules were implemented in Python 3 . .
2, using scikit-learn and pytorch as machinelearning libraries [62, 63]. Computations were run on a server with an Nvidia Titan V GPU.We used equation (1) to discard terms with a filtering score less than 10 − , i.e., document frequencyranging from 1 to 10. Examples of discarded concepts are clotrimazole , an antifungal medication, and torsemide , a medication to reduce extra fluid in the body. We decided to retain all the very frequentconcepts as most of them seemed clinically informative (e.g., vital signs). Patients with less than 3 medicalconcepts were then discarded. In total, 24 ,
665 medical terms were filtered out, decreasing the vocabularysize to 32 , T = 15 days, shuffledunique medical concepts and dropped redundant terms. Patient sequences were then split in subsequencesof length L = 32 concepts, obtaining about ∼ M subsequences of medical concepts for training. This valuewas chosen to enable efficient training of the autoencoder with GPUs.We initialized medical concept embeddings using word2vec with the skip-gram model [56]. We consideredall the subsequences in the training set as sentences and medical concepts as words [54, 59]. We obtained100-dimensional embeddings for 31 ,
659 medical concepts of the vocabulary. The remaining concepts wereinitialized randomly; the subsequence padding was initialized as the null vector (i.e., at ). These embeddingvectors were then used as input for the ConvAE module and were further refined during the model training.The CNN module used 50 filters with kernel size equal to 5 and ReLU activation function. The autoen-coder was composed by 4 hidden layers with 200, 100, 200 and | V | ×
32 hidden nodes, respectively, where | V | is the vocabulary size. We used ReLU activation in the first three layers and Softplus activation inthe final layer to obtain continuous output. We applied dropout with p = 0 . − and weight decay = 10 − ) [64] for 5 epochs on all training data and batch size of 128. The size of thepatient representations was equal to 100.We evaluated different CNN configurations composed by 1-layer (i.e., “ConvAE 1-layer CNN”), 2-layers(i.e., “ConvAE 2-layer CNN”), and one multikernel layer (i.e., “ConvAE multikernel CNN”). All hyperpa-rameters were the same, except the number of filters in the second CNN of the 2-layer configuration thatwas set to 25. Multikernel CNN performs parallel training of distinct CNNs with different kernel sizes, andconcatenates the final outputs. We used kernel dimensions equal to 3, 5, and 7.16 aselines We compared ConvAE with the following representation learning algorithms: “RawCount”, “SVD-RawCount”,“SVD-TFIDF”, and “Deep Patient”. All baselines derived vector-based patient encodings of size 100.RawCount is a sparse representation where each patient is encoded into a count vector that has thelength of the vocabulary. More specifically, each individual health history s p is represented as an integervector x ∈ Z | V | , where each element is the frequency of the corresponding clinical concept in the patientlongitudinal history , i.e., x i = { w i ; w i ∈ s p } .SVD-RawCount applies truncated singular value decomposition (SVD) to the RawCount matrix to com-pute the largest singular values of the raw count encodings, which define the dense, lower-dimensionalrepresentations.SVD-TFIDF transforms the raw count encodings using the term frequency–inverse document frequency(TFIDF) weighting schema and applies truncated SVD to the resulting matrix. We considered the patientEHR sequences as documents, the entire dataset as corpus and we derived TFIDF scores for all medicalconcepts. Each patient is then represented as a vector of length | V | , with the corresponding TFIDF weightfor each concept, and the matrix obtained is reduced via truncated SVD.Deep Patient transforms the raw count matrix using a stack of denoising autoencoders as proposed byMiotto et al. [20]. We used the implementation details presented in the paper, with batch size equal to 32,corruption noise equal to 5%, and 5 training epochs. Multi-disease clustering analysis
We evaluated all the representation learning approaches in a clustering task to determine how they wereable to disentangle patients with different conditions. We chose eight complex disorders: type 2 diabetes(T2D), multiple myeloma (MM), Parkinson’s disease (PD), Alzheimer’s disease (AD), Crohn’s disease (CD),prostate cancer (PC), breast cancer (BC) and attention deficit hyperactivity disorder (ADHD). We retrievedall the corresponding patients in the test sets using SNOMED–CT codes after verifying that at least onecorrespondent ICD-9 code was present in a patient EHRs. In particular, we looked for
Type 2 diabetesmellitus (250.00) for T2D;
Multiple myeloma without mention of having achieved remission (203.00) for MM;
Paralysis agitans (332.0) for PD;
Alzheimer’s disease (331.0) for AD;
Regional enteritis of unspecified site(555.9) for CD;
Malignant neoplasm of prostate (185) for PC;
Malignant neoplasm of female breast (174.9) for BC; and
Attention deficit disorder with hyperactivity (314.01) for ADHD. We discarded all patients withcomorbidities within the selected diseases to facilitate the clustering interpretation. We then performedhierarchical clustering with k = 8 clusters (i.e., same as the different diseases) for all the representations17o evaluate if patients with the same condition were grouping together. The final test sets were composedby about 94 ,
000 patients per fold but were unbalanced, with disease cohorts ranging from about 1 ,
900 to50 ,
000 patients (see Supplementary Table 2). To use balanced datasets and improve the efficacy of theexperiment, we sub-sampled 5 ,
000 random patients for the highly populated diseases, and we iterated thissubsampling process 100 times, obtaining 100 different clustering per test set.We used entropy and purity scores averaged across the 100 experiments of each fold to measure to whatextent the clusters matched the different diseases. In particular, for each cluster j , we define the probabilitythat a patient in j has disease i as: p ij = m ij m j , (4)where m j is the number of patients in cluster j and m ij is the number of patients in cluster j with adiagnosis of disease i . Entropy for each cluster is defined as: E j = − (cid:88) i p ij log p ij , (5)and conditional entropy H (disease | cluster) is then computed as: H (disease | cluster) = (cid:88) j m j m E j , where m is the total number of elements in the complex disease dataset.Purity identifies the most represented disease in each cluster. For a cluster j , purity P j is defined as P j = max i p ij , where p ij is computed as before. The overall purity score is then the weighted average of P j for each cluster j . The perfect clustering obtains averaged entropy and purity scores equal to 0 and 1,respectively. Disease subtyping analysis
We evaluated the usability of ConvAE representations to discover disease subtypes for different and diverseconditions (i.e., patient stratification at scale). In particular, we selected a cohort of patients with T2D, PD,AD, MM, PC, and BC and ran hierarchical clustering on the ConvAE-based patient representations. Theseare all age-related complex disorders with late onset (i.e., increased prevalence after 60 years of age [26,27, 28, 29, 30, 31]). We focused only on these conditions to attempt reducing confounding age effects thatcould affect the analysis of the subtypes (as it could happen on CD and ADHD cohorts, where a commononset age is less defined). To reduce noise in the sequence encodings, we averaged all patient subsequence18epresentations from the first diagnosis forward, and we dropped sequences shorter than 3 concepts. Weranged the number of clusters from 2 to 15 and we used the Elbow Method to empirically select the smallestnumber of clusters that minimize the increase in explained variance. We then performed a qualitative analysisof each subtype, similarly to Zhang et al. [11], to identify which medical concepts characterized the specificgroup of patients. We further verified the various subgroups in the medical literature and with the supportof a practicing clinician.
Data availability
The data used for this study are available from the Mount Sinai Health System (NYC), but restrictionsapply to the availability of these data, which were used under license for the current study, and so are notpublicly available. Data are however available from the authors upon reasonable request and with permissionof Mount Sinai Health System.
Code availability
Code is available at: https://github.com/landiisotta/convae_architecture . Acknowledgments
R.M. would like to thank the support from the Hasso Plattner Foundation, the Alzheimer’s Drug DiscoveryFoundation and a courtesy GPU donation from Nvidia. I.L. acknowledges the support from the BrunoKessler Institute.
Competing interests
The authors declare no competing interests.
Author contributions
I.L. and R.M. conceived and designed the work. I.L. conducted the research and the experimental evalu-ation, and drafted the manuscript. R.M. created the dataset, supervised and supported the research, andsubstantially edited the manuscript. B.S.G. substantially edited the manuscript and created the architec-ture figures. H.L. and S.C. advised on methodological choices and critically revised the manuscript. G.L.19rovided clinical validation of the results and critically revised the manuscript. M.D. revised the manuscriptand contributed to the interpretation of the data. J.T.D. and C.F. supported the research and revised themanuscript. All the authors gave final approval of the completed manuscript version and are accountablefor all aspects of the work. 20 eferences [1] Jensen, P. B., Jensen, L. J. & Brunak, S. Mining electronic health records: towards better researchapplications and clinical care.
Nature Reviews Genetics
395 (2012).[2] Cutting, G. R. Cystic fibrosis genetics: from molecular understanding to clinical application.
NatureReviews Genetics et al.
Large-scale phenome analysis defines a behavioral signature for Huntington’sdisease genotype in mice.
Nature Biotechnology
Annals ofNeurology
BioMed Research International,
Diabetologia
NatureReviews Drug Discovery et al. Patient Subtyping via Time-Aware LSTM Networks in Proceedings of the 23rdACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, Halifax,NS, Canada, 2017), 65–74. doi: .[9] Doshi-Velez, F., Ge, Y. & Kohane, I. Comorbidity Clusters in Autism Spectrum Disorders: an Elec-tronic Health Record Time-Series Analysis.
Pediatrics e54–63 (2013).[10] Li, L. et al.
Identification of type 2 diabetes subgroups through topological analysis of patient similarity.
Science translational medicine et al. Data-Driven Subtyping of Parkinson’s Disease Using Longitudinal Clinical Records:a Cohort Study.
Scientific Reports
797 (2019).[12] Chen, D. et al.
Deep learning and alternative learning strategies for retrospective real-world clinicaldata. npj Digital Medicine IEEEtransactions on pattern analysis and machine intelligence
Nature
Briefings in Bioinformatics
Journal of the American Medical InformaticsAssociation et al.
The Impact of Phenotypic and Genetic Heterogeneity on Results of Genome WideAssociation Studies of Complex Diseases.
PLoS ONE e76295 (2013).[18] Banda, J. M., Seneviratne, M., Hernandez-Boussard, T. & Shah, N. H. Advances in electronic phe-notyping: from rule-based definitions to machine learning models. Annual review of biomedical datascience JAMA
Scientific Reports Pattern Recognition
PLoS ONE (2018).[23] Brun, M. et al. Model-based evaluation of clustering validation measures.
Pattern Recognition
Information retrieval
Journal of Open Source Software
861 (2018).[26] Cowie, C. C., Casagrande, S. S. & Geiss, L. S. Prevalence and incidence of type 2 diabetes andprediabetes.
Diabetes in America, 3rd edn. National Institutes of Health, Bethesda, MD,
The Lancet Neurology Dialogues in clinical neuroscience
111 (2009).[29] Kazandjian, D.
Multiple myeloma epidemiology and survival: a unique malignancy in Seminars inoncology (2016), 676–681.[30] https://seer.cancer.gov/statfacts/html/prost.html . (Accessed on September 17, 2019).[31] https://seer.cancer.gov/statfacts/html/breast.html . (Accessed on September 17, 2019).[32] Vallon, V. & Komers, R. Pathophysiology of the diabetic kidney. Comprehensive Physiology CriticalReviews in Oncology/Hematology et al.
Impaired Leucocyte Functions in Diabetic Patients.
Diabetic Medicine
Archivesof Neurology
Neurology et al.
Fatigue in Parkinson’s disease: A systematic review and meta-analysis.
MovementDisorders .(Accessed on October 14, 2019).[39] Manji, H., J¨ager, H. R. & Winston, A. HIV, dementia and antiretroviral drugs: 30 years of an epidemic.
Journal of Neurology, Neurosurgery & Psychiatry et al.
Prevalence of Neuropsychiatric Symptoms in Dementia and Mild CognitiveImpairment.
JAMA et al.
Vascular contributions to cognitive impairment and dementia including Alzheimer’sdisease.
Alzheimer’s & Dementia
Cochrane Database ofSystematic Reviews CD001190 (2018).[43] Lombardo, M. V. et al.
Unsupervised data-driven stratification of mentalizing heterogeneity in autism.
Scientific Reports et al. Identification and analysis of behavioral phenotypes in autism spectrum disorder viaunsupervised machine learning.
International Journal of Medical Informatics
Doctor AI: Predicting Clinical Events via Recurrent Neural Networks in Proceedings of Machine Learning for Healthcare (2016).[46] Pham, T., Tran, T., Phung, D. & Venkatesh, S. DeepCare: A Deep Dynamic Memory Model forPredictive Medicine in Advances in Knowledge Discovery and Data Mining (Springer InternationalPublishing, 2016), 30–41.[47] Rajkomar, A. et al.
Scalable and accurate deep learning with electronic health records. npj DigitalMedicine
18 (2018).[48] Beaulieu-Jones, B. K., Greene, C. S., et al.
Semi-supervised learning of the electronic health record forphenotype stratification.
Journal of biomedical informatics
Deepr : a convolutional net for medicalrecords.
IEEE Journal of Biomedical and Health Informatics et al.
Deep patient similarity learning for personalized healthcare.
IEEE Transactions onNanoBioscience et al.
Combining billing codes, clinical notes, and medications from electronic health recordsprovides superior phenotyping performance.
Journal of the American Medical Informatics Association e20–27 (2015).[52] Kirby, J. C. et al.
PheKB: a catalog and workflow for creating electronic phenotype algorithms fortransportability.
Journal of the American Medical Informatics Association
Journal of the American Medical Informatics Association et al. Automated disease cohort selection using word embeddings from Electronic HealthRecords in Biocomputing 2018 (World Scientific, 2017), 145–156. doi: .[55] Blei, D., Ng, A. & Jordan, M. Latent Dirichlet Allocation.
Journal of Machine Learning Research Efficient Estimation of Word Representations in VectorSpace.
Preprint at https://arxiv.org/abs/1301.3781 . (2013).[57] Jonquet, C., Shah, N. H. & Musen, M. A.
The Open Biomedical Annotator in AMIA Summits onTranslational Science Proceedings (2009), 56–60.[58] Lependu, P., Iyer, S. V., Fairon, C. & Shah, N. H. Annotation analysis for testing drug safety signalsusing unstructured clinical notes.
Journal of Biomedical Semantics s5 (2012).[59] Choi, Y., Chiu, C. Y. I. & Sontag, D.
Learning low-dimensional representations of medical concepts in AMIA Summits on Translational Science Proceedings (2016), 41–50.2360] Zhu, Z. et al. Measuring Patient Similarities via a Deep Architecture with Medical Concept Embedding in (2016), 749–758. doi: .[61] Suo, Q. et al. Personalized disease prediction using a CNN-based similarity learning method in (2017), 811–816. doi: .[62] Pedregosa, F. et al. Scikit-learn: Machine Learning in Python.
Journal of Machine Learning Research et al. Automatic differentiation in pytorch in NeurIPS Autodiff Workshop (2017).[64] Kingma, D. & Adam, J. B.
Adam: A Method for Stochastic Optimization in Proceedings of the 3rdInternational Conference on Learning Representations (2014), 1–15.24 ntropy Purity Disease Number ConvAE 1-layer CNN 2 .
61 (0 . , [2 . , . ∗∗∗ .
31 (0 . , [0 . , . ∗∗∗ .
50 (0 . ∗∗∗ ConvAE 2-layer CNN 2 .
75 (0 . , [2 . , . .
26 (0 . , [0 . , . .
93 (0 . .
66 (0 . , [2 . , . .
30 (0 . , [0 . , . .
94 (0 . .
90 (0 . , [2 . , . .
18 (0 . , [0 . , . .
76 (0 . .
90 (0 . , [2 . , . .
19 (0 . , [0 . , . .
13 (0 . .
85 (0 . , [2 . , . .
21 (0 . , [0 . , . .
83 (0 . .
81 (0 . , [2 . , . .
24 (0 . , [0 . , . .
96 (0 . Mean (sd, CI); Mean (standard deviation); ∗ p < . ∗∗ p < . ∗∗∗ p < . Table 1: Multi-disease clustering performances of ConvAE configurations and baselines. The scores reportedare averaged over a 2-fold cross-validation experiment. ConvAE 1-layer CNN significantly outperforms allother configurations and baselines on all measures. Multiple pairwise t-tests with Bonferroni correction areused to compare performances. 25 b Figure 1: Patient stratification framework and ConvAE architecture. ( a ) Framework enabling patient strat-ification analysis from deep unsupervised EHR representations; ( b ) Details of the ConvAE representationlearning architecture. 26 D = Alzheimer’s disease; ADHD = Attention deficit hyperactivity disorder; BC = Breast cancer; CD = Crohn’s disease;MM = Multiple myeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes
Figure 2: Uniform manifold approximation and projection (UMAP) encoding visualization. ( a ) ConvAE1-layer CNN; ( b ) SVD-RawCount; ( c ) SVD-TFIDF; ( d ) Deep Patient. AD = Alzheimer’s disease; ADHD= Attention deficit hyperactivity disorder; BC = Breast cancer; CD = Crohn’s disease; MM = Multiplemyeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes.27 D = Alzheimer’s disease; ADHD = Attention deficit hyperactivity disorder; BC = Breast cancer; CD = Crohn’s disease;MM = Multiple myeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes
Figure 3: Uniform manifold approximation and projection (UMAP) clustering visualization. ( a ) ConvAE1-layer CNN; ( b ) SVD-RawCount; ( c ) SVD-TFIDF; ( d ) Deep Patient. AD = Alzheimer’s disease; ADHD= Attention deficit hyperactivity disorder; BC = Breast cancer; CD = Crohn’s disease; MM = Multiplemyeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes.28igure 4: Complex disorder subgroups. A subsample of 5 ,
000 patients with T2D is displayed in Figure ( a ).Figures ( b ), ( c ), ( d ), ( e ), ( f ) display patient subtypes for Parkinson’s and Alzheimer’s disease, multiplemyeloma, prostate and breast cancer cohorts, respectively.29 upplementary Material Clustering comparison for the type 2 diabetes analysis
Li et al. [1] used a similar cohort of EHRs as in this study to stratify patients with type 2 diabetes (T2D).Of the 2 ,
472 patients from their paper, we identified 1 ,
050 of them in our test sets. To compare the results,we evaluated the similarity of the clusters we obtained to those found by Li et al. via the Fowlkes-Mallowsindex (FMI), which is an external validation similarity measure of two cluster analyses [2, 3]. FMI scoresrange from 0 to 1, where 1 represents identical clustering and 0 purely independent label assignments. Weobtained FMI = 0 .
40, which suggests that only a portion of patients in groups from Li et al. [1] are identifiedby our approach as sharing the same characteristics. This may entail that associated clinical phenotypesoverlap to a greater extent than hypothesized by Li et al., which may have been overlooked because theycollected shorter EHR sequences (i.e., 60 day intervals) and used a manually derived subset of features.
Disease subtyping
Multiple myeloma
We identified five subgroups for multiple myeloma (MM) (see Figure 4d and Sup-plementary Table 7). In particular, subgroup I is characterized by pulmonary manifestations; subgroup IIshows bone-related signs of MM; subgroup III includes signs of gastrointestinal problems; subgroup IV isdefined by kidney problems; and subgroup V shows signs of peripheral neuropathy.Pulmunary manifestations in subgroup I include
Pleura effusion , a rare pulmonary manifestation ofamyloidosis [4] that is a comorbidity of MM found in 10 −
15% of patients (i.e., superimposed amyloidosis).Subgroup I is also characterized by patients with amyloidosis and proteinuria (i.e., excess of proteins inurine) because of the large frequency of
Urea nitrogen blood test.
Disorders of bone and cartilage largely characterizes patients in subgroup II, which can be identified withbone-related signs of MM.Subgroups III and V include patients who received chemotherapy and/or anti-cancer medications. Inparticular, we often found
Bortezomib in combination with
Dexamethasone in both subgroups. Bortezomib,for example, is administered to 47% of patients from subgroup III and to 26% of patients in group V. It canbe used: 1) for patients ineligible for hematopoietic cell transplantation (HCT); 2) as a maintenance therapy;or 3) in conjunction with HCT for newly-diagnosed patients [5]. Given the characterization of subgroupsIII and V we expect gastrointestinal problems in subgroup III and
Inflammatory/toxic neuropathy diagnosisin subgroup V to indicate different side effects from anti-cancer medications. Peripheral nerve damage isalso one of the most significant non-hematologic toxicities of Bortezomib [6]. Although unlikely, neurologiccomplications can also be caused by MM. Such neurologic complications can be due to spinal cord compressionfrom an extramedullary plasmacytoma, or by peripheral neuropathy, which is rare and usually caused bysuperimposed amyloidosis [7]. The
Counseling concept in subgroup V likely denotes an encounter to treatsevere pain linked to neurologic diseases or psychological support.
Creatinine , Urea nitrogen , and
Urinalysis testing indicate renal function estimate for patients in sub-group IV. Moreover, 9% of patients report
Nephritis and nephropathy and
Chronic kidney disease diagnosis,reinforcing the association of subgroup IV to kidney conditions.
Prostate cancer
We find 2 subgroups of patients with prostate cancer (PC) related to diverging diseasecourses (see Figure 4e and Supplementary Table 8).Clinical manifestation of PC is heterogeneous and may range from asymptomatic screen, microscopic, welldifferentiated tumor, that may never become clinically relevant; to clinically symptomatic aggressive cancerthat causes metastases, morbidity, and death. Treatment approaches for PC include: active surveillance,radical prostatectomy, or radiation therapy (RT) for patients with low-risk PC; prostatectomy or RT in1ombination with Androgen Deprivation Therapy (ADT) for patients with higher-risk, but localized PC; RTand ADT for patients with clinical evidence of lymph node involvement.Patients in subgroup I report
Personal history of PC and
Ondansetron medication to prevent RT sideeffects. This suggests that this group includes patients with recurrent prostate cancer that have eitherreceived prostatectomy in the past, and hence RT and ADT is required, or, have already received RT andthus require a radical approach.
Anastomosis and
Pelvic lymphadenectomy concepts, which are related topost-prostatectomy procedures and are frequent in these patients, support this description.Clinical manifestations of PC are usually absent at the time of diagnosis, and over 90% of patients arediagnosed via specific screening (e.g., use of prostate-specific antigen (PSA) or digital rectal examination).Patients in subgroup II show frequent signs of effective PSA screening, indicating probable localized andasymptomatic PC. Diagnosis of
Nocturia , Impotence of organic origin , Urinary frequency , and treatmentsfor male sexual dysfunctions, i.e.,
Tadalafil, Sildenafil , are all signs of side effects from PC treatments [8].Among them, at least 22% likely received a prostatectomy (
Surgery ).Differently from the second subgroup, patients in the first subgroup do not have PSA among top-rankedconcepts. This suggests that subgroup I includes patients that already received prostatectomy, which makesPSA screening less common. Patients in subgroup I appear to have been in the healthcare system for longerand also to have been diagnosed with PC earlier (i.e., similar median age to subgroup II, but absent PSAscreening).
Breast cancer
Stratification of breast cancer (BC) patients lead to two different subgroups (see Figure4f and Supplementary Table 9). Subgroup I is linked to advanced stages of BC. Patients in subgroup II,instead, are younger and present a high number of screening-related medical concepts (e.g.,
Mammographyscreening ). In addition, concepts like
Abnormal mammogram and
Carcinoma in situ of breast suggest anearly-stage diagnosis.In subgroup I, 23% of patients reports
Unlisted chemotherapy , with
Surgery performed on 44% of them.This suggests that these patients may have a more advanced disease, as also evidenced by the lack ofscreening terms. As a result, they typically undergo chemotherapy treatment, which is more common inadvanced stages of BC, whereas primary surgery (lumpectomy, mastectomy), with or without radiationtherapy, is preferred for early-stage cancer. This group also includes patients that have already receivedsurgical treatments (33% having received a partial mastectomy) and thus can either be disease free or haverelapsed. The presence of
Secondary malignant neoplasm also suggests that subgroup I includes patientswith metastatic BC.It would be important to better characterize what the general concepts
Unlisted chemotherapy and
Antineoplastic chemotherapy specifically refer to in terms of more specific treatments (e.g., hormonal drugs,immunotherapy) to better understand the clinical characteristics of the different subgroups. Moreover,because different molecular subtypes of BC have been identified based on gene expression profiling [9],including hormonal profiles of patients (not available for this study) might improve the stratification results.
Replication of disease subtyping
In the following, we present the patient stratification results obtained with the second split. As highlightedin Supplementary Figure 3, we found slightly different subgroups only for PC and MM (when compared withthe results of the first split).MM encodings detect 4 instead of 5 subgroups. We found two subgroups showing kidney-related problems,one subgroup reporting signs of chemotherapy treatment side effects (i.e.,
Inflammatory/toxic neuropathy )and one subgroup identified by signs of possible superimposed amyloidosis, i.e.,
Disease of salivary glands .Patients with PC split into three subgroups, where subgroups II and III appears to be a further refinementof subgroup II identified in the first split. In particular, subgroup III includes significantly younger subjectscompared to subgroup II. The presence of
Personal history of PC suggests that subgroups II includes patients2ith relapsing PC. This subgroup is of particular importance for the investigation of treatment effectiveness.The analysis for the other diseases led to very similar results to those obtained with the first split. Inparticular, for T2D we identified three subgroups: a group with signs of metabolic syndrome and T2D riskfactors, a group with microvascular problems, and a third group showing signs of cardiovascular disorders.Patients with PD separates into two subgroups, with motor and non-motor symptoms, respectively, aspreviously found. AD and BC are again characterized by three and two subgroups, respectively, with thesame clinical profiles previously presented. 3 eferences [1] Li, L. et al.
Identification of type 2 diabetes subgroups through topological analysis of patient similarity.
Science translational medicine Journal of theAmerican Statistical Association
Journal of Multivariate Analysis
Current Opinion in Pulmonary Medicine et al.
Lenalidomide, Bortezomib, and Dexamethasone with Transplantation for Myeloma.
NewEngland Journal of Medicine
Blood
BestPractice & Research Clinical Haematology
JNCI: Journal ofthe National Cancer Institute et al.
Repeated observation of breast tumor subtypes in independent gene expression datasets.
Proceedings of the National Academy of Sciences plit 1 Split 2Train Test Train Test
Patients 741 ,
177 751 ,
979 740 ,
922 751 , , ,
014 3 , ,
238 3 , ,
596 3 , , .
91 (12 .
13) 4 .
86 (12 .
06) 4 .
92 (12 .
14) 4 .
85 (12 . ,
799 32 ,
156 32 ,
875 32 , Supplementary Table 1: Train and test set characteristics.
Complex disorder Test set 1 Test set 2
Type 2 diabetes 50 ,
253 50 , ,
124 3 , ,
374 3 , ,
947 1 , ,
401 14 , ,
330 8 , ,
668 6 , ,
510 6 , ADHD = Attention deficit hyperactivity disorder
Supplementary Table 2: Number of subjects in the complex disorder cohorts.
Test set 1 Test set 2Numerosity N clusters Numerosity N clusters
T2D 48 ,
688 3 48 ,
759 3PD 3 ,
052 2 3 ,
071 2AD 3 ,
201 3 3 ,
150 3MM 1 ,
884 5 1 ,
883 4PC 8 ,
522 2 8 ,
645 3BC 7 ,
964 2 7 ,
838 2
T2D = Type 2 diabetes; PD = Parkinson’s disease; AD = Alzheimer’s disease;MM = Multiple myeloma; PC = Prostate cancer; BC = Breast cancer
Supplementary Table 3: Complex disorder cohorts and number of subclusters identified via patient stratifi-cation. 5 y p e d i a b e t e s Sub g r o up I Sub g r o up II Sub g r o up III ( N = , )( N = , )( N = , ) F e m a l e / M a l e , , ∗ a , , ∗ a , , ∗ a A g e . ( . ) ∗ b . ( . ) ∗ b . ( . ) ∗ b I C D - H y p e r t e n s i o n ( ) - % ( % ) ∗∗∗ P a i n i n li m b ( ) - % ( % ) ∗∗ C o r o n a r y a t h e r o s c l e r o s i s ( v e ss e l )( ) - % ( % ) ∗∗∗ H y p e r li p i d e m i a ( ) - % ( % ) ∗∗∗ A c u t e k i dn e y f a il u r e ( ) - % ( % ) ∗∗∗ C o r o n a r y a r t e r y a t h e r o s c l e r o s i s ( ) - % ( % ) ∗∗∗ C h e s t p a i n ( ) - % ( % ) ∗∗∗ ( I v s III ) C h r o n i c k i dn e y d i s e a s e ( ) - % ( % ) ∗∗∗ A n g i n a p ec t o r i s ( ) - % ( % ) ∗∗∗ O b e s i t y ( ) - % ( % ) ∗∗∗ N e ph r i t i s a ndn e ph r o p a t h y ( ) - % ( % ) ∗∗∗ P e r c u t a n e o u s t r a n s l u m i n a l c o r o n a r y a n g i o p l a s t y ( V45.82 ) - % ( % ) ∗∗∗ H y p e r c h o l e s t e r o l e m i a ( ) - % ( % ) ∗∗∗ P e r i ph e r a l v a s c u l a r d i s e a s e ( ) - % ( % ) ∗∗ C a r d i a c d y s r h y t h m i a s ( ) - % ( % ) ∗∗∗ M e d i c a t i o n M e t f o r m i n - % ( % ) ∗∗∗ P a r a ce t a m o l - % ( % ) ∗∗∗ A ce t y l s a li c y li c a c i d - % ( % ) ∗∗∗ C a l c i u m - % ( % ) ∗∗∗ O xy c o d o n e - % ( % ) ∗∗∗ C l o p i d r og e l - % ( % ) ∗∗∗ V i t a m i n D - % ( % ) ∗∗∗ M o r ph i n e - % ( % ) ∗∗∗ B i v a li r ud i n - % ( % ) ∗∗∗ C h o l e s t e r o l - % ( % ) ∗∗∗ V a n c o m y c i n - % ( % ) ∗∗∗ L i s i n o p r il - % ( % ) ∗ A t o r v a s t a t i n - % ( % ) ∗∗∗ F u r o s e m i d e - % ( % ) ∗∗∗ A m l o d i p i n e - % ( % ) ∗∗∗ L a b t e s t G l u c o s e - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ H e m a t o c r i t - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ C h l o r i d e - % ( % ) ∗∗∗ M e a n c o r pu s c u l a r h e m og l o b i n c o n ce n t r a t i o n - % ( % ) ∗∗∗ C h o l e s t e r o l - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ C h l o r i d e - % ( % ) ∗∗∗ T r i g l y ce r i d e - % ( % ) ∗∗∗ A l bu m i n - % ( % ) ∗∗∗ T r o p o n i n I c a r d i a c - % ( % ) ∗∗∗ M i c r oa l bu m i np a n e l - % ( % ) ∗∗∗ A l k a li n e ph o s ph a t a s e - % ( % ) ∗∗∗ C h o l e s t e r o l - % ( % ) ∗∗∗ C P T - C a l c i u m - % ( % ) ∗∗∗ ( I v s II ) P o t a ss i u m - % ( % ) ∗∗∗ E C G ;i n t e r p r e t a t i o n , r e p o r t - % ( % ) ∗∗ H e m og l o b i n A C - % ( % ) ∗∗∗ ( I v s II ) U r e a n i t r og e n - % ( % ) ∗∗∗ T r o p o n i n , q u a n t i t a t i v e - % ( % ) ∗∗∗ G l u c o s e ( r e ag e n t s t r i p ) - % ( % ) ∗∗∗ U r i n a l y s i s - % ( % ) ∗∗∗ C - r e a c t i v e p r o t e i n , h i g h s e n s i t i v i t y - % ( % ) ∗∗∗ L i p i dp a n e l - % ( % ) ∗∗∗ H e p a t i c f un c t i o np a n e l - % ( % ) ∗∗∗ E c h o c a r d i og r a ph y , t r a n s t h o r a c i c - % ( % ) ∗∗∗ L i p o p r o t e i n , d i r ec t m e a s u r e m e n t - % ( % ) ∗∗∗ D up l e x s c a n o f e x t r e m i t yv e i n s - % ( % ) ∗∗∗ R a d i o l og i ce x a m i n a t i o n , c h e s t - % ( % ) ∗∗∗ M e a n ( s t a nd a r dd e v i a t i o n ) ; f r o m I C D - n i n - g r o up a nd (t o t a l ) p e r ce n t ag e s ; a M u l t i p l e p a i r w i s ec h i - s q u a r e d t e s t ; b M u l t i p l e p a i r w i s e t - t e s t ; ∗ p < . , ∗∗ p < . , ∗∗∗ p < . E C G = E l ec t r o c a r d i og r a m Supp l e m e n t a r y T a b l e : M o s t f r e q u e n tt e r m s f o r t h e t h r ee s ub g r o up s i n t h e t y p e d i a b e t e s c o h o r t . W e r e p o r tt o pfi v e d i ag n o s i s ( I C D - ) , m e d i c a t i o n s , l a b o r a t o r y t e s t s , a nd C P T - p r o ce du r e s . E a c h c li n i c a l t e r m i s f o ll o w e db y i n - g r o up a nd t o t a l f r e q u e n c i e s . C o rr ec t e dp - v a l u e s a r e r e p o r t e d f o r s i g n i fi c a n t c o m p a r i s o n s b e t w ee n g r o up s . arkinson’s diseaseSubgroup I Subgroup II (N=1 , , a a Age .
76 (13 . ∗ b .
17 (14 . ∗ b ICD-9 Essential tremor ( ) - 21% (56%) ∗∗∗
Constipation ( ) - 29% (66%) ∗∗∗
Anxiety state ( ) - 20% (45%) Other malaise and fatigue ( ) - 25% (72%) ∗∗∗
Depressive disorder ( ) - 14% (40%) ∗ Coronary atherosclerosis ( ) - 17% (94%) ∗∗∗
Abnormality of gait ( ) - 14% (32%) ∗∗∗
Dysphagia ( ) - 14% (77%) ∗∗∗
Dystonia ( ) - 11% (57%) ∗∗∗
Abdominal pain ( ) - 14% (90%) ∗∗∗
Medication Carbidopa/Levodopa combination - 51% (51%) ∗ Levodopa - 45% (57%)Amantadine - 16% (55%) ∗∗∗
Carbidopa - 45% (58%) ∗∗∗
Pramipexole - 15% (59%) ∗∗∗
Acetylsalicylic acid - 22% (87%) ∗∗∗
Rasagiline - 14% (60%) ∗∗∗
Docusate sodium - 19% (85%) ∗∗∗
Selegiline - 12% (57%) ∗∗∗
Vitamin D - 16% (72%) ∗∗∗
Lab test Mean corpuscular hemoglobin - 3% (4%) ∗∗∗
Glucose - 60% (97%) ∗∗∗
Leukocytes - 3% (4%) ∗∗∗
Urea nitrogen - 60% (97%) ∗∗∗
Mean platelet volume - 3% (4%) ∗∗∗
Creatinine - 59% (97%) ∗∗∗
Width - 3% (4%) ∗∗∗
Potassium - 59% (97%) ∗∗∗
Erythrocytes - 3% (4%) ∗∗∗
Sodium - 59% (97%) ∗∗∗
CPT-4 Unlisted psychiatric service or procedure - 25% (47%) ECG; interpretation, report - 51% (95%) ∗∗∗
MRI (brain, brain stem) - 13% (36%) ∗∗∗
Urea nitrogen - 48% (96%) ∗∗∗
Surgery - 11% (24%) ∗∗∗
Creatinine - 45% (96%) ∗∗∗
CT head/brain - 3% (8%) ∗∗∗
Metabolic panel - 35% (98%) ∗∗∗
Neuropsychological testing - 2% (36%) Echocardiography, transthoracic - 11% (97%) ∗∗∗ Mean (standard deviation); from ICD-9 on in-group and (total) percentages; a Multiple pairwise chi-squared test; b Multiple pairwise t-test; ∗ p < . , ∗ ∗ p < . , ∗ ∗ ∗ p < . Supplementary Table 5: Most frequent terms for the two subgroups in the Parkinson’s disease cohort.7 l z h e i m e r ’ s d i s e a s e Sub g r o up I Sub g r o up II Sub g r o up III ( N = )( N = , )( N = , ) F e m a l e / M a l e ∗∗∗ a ∗∗∗ a , ∗∗∗ a A g e . ( . ) ∗∗ b . ( . ) ∗∗ b . ( . ) ∗∗ b I C D - R o u t i n e g y n ec o l og i c a l e x a m i n a t i o n ( V72.31 ) - % ( % ) ∗∗∗ D e m e n t i a w /o b e h a v i o r a l d i s t u r b a n ce ( ) - % ( % ) ∗∗∗ C o n s t i p a t i o n ( ) - % ( % ) ∗∗∗ C o un s e li n g ( V65.40 ) - % ( % ) ∗∗ A l t e r e d m e n t a l s t a t u s ( ) - % ( % ) ∗∗∗ A n x i e t y s t a t e ( ) - % ( % ) ∗ O s t e o p o r o s i s ( ) - % ( % ) ∗∗∗ ( I v s II ) P e r s i s t e n t m e n t a l d i s o r d e r s ( ) - % ( % ) ∗∗∗ D e p r e ss i v e d i s o r d e r ( ) - % ( % ) F a m il y h i s t o r y o f o s t e o p o r o s i s ( V17.81 ) - % ( % ) ∗∗ D y s ph ag i a ( ) - % ( % ) ∗∗∗ D e m e n t i a , un s p ., w /o b e h a v i o r a l d i s t u r b a n ce ( ) - % ( % ) ∗∗∗ ( III v s II ) M a li g n a n t n e o p l a s m o f u t e r u s ( ) - % ( % ) ∗∗∗ I n t r a c r a n i a l h e m o rr ag e ( ) - % ( % ) ∗∗∗ D e m e n t i a w i t hb e h a v i o r a l d i s t u r b a n ce ( ) - % ( % ) ∗∗∗ M e d i c a t i o n C a l c i u m - % ( % ) ∗ ( I v s III ) A ce t y l s a li c y li c a c i d - % ( % ) ∗∗∗ D o n e p ez il - % ( % ) ∗∗∗ ( III v s I ) E s t r a d i o l - % ( % ) ∗∗∗ D o n e p ez il - % ( % ) ∗∗ ( II v s I ) M e m a n t i n e - % ( % ) ∗∗ I r o n - % ( % ) ∗ ( I v s III ) L e v o fl o x a c i n - % ( % ) ∗∗∗ D o c u s a t e s o d i u m - % ( % ) ∗∗∗ N o r e t h i s t e r o n e - % ( % ) ∗∗∗ V a n c o m y c i n - % ( % ) ∗∗ T r a z o d o n e - % ( % ) ∗∗∗ ( III v s I ) G a r d a s il - % ( % ) ∗∗∗ H a l o p e r i d o l - % ( % ) ∗∗∗ Z o l p i d e m - % ( % ) ∗∗∗ ( III v s I ) L a b t e s t C h l a m y d i a/ G o n o rr h o e a e a m p li fi e d D NA - % ( % ) ∗∗∗ M e a n c o r pu s c u l a r v o l u m e - % ( % ) ∗∗∗ L e u k o c y t e s - % ( % ) ∗∗∗ S y ph ili s ( r a p i dp l a s m a r e ag i n ) - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ G l u c o s e - % ( % ) ∗∗∗ H I V - % ( % ) ∗∗∗ ( I v s II ) E r y t h r o c y t e s - % ( % ) ∗∗∗ E r y t h r o c y t e s - % ( % ) ∗∗∗ H e p a t i t i s C v i r u s a b - % ( % ) M e a n c o r pu s c u l a r h e m og l o b i n c o n ce n t r a t i o n - % ( % ) ∗∗∗ H e m a t o c r i t - % ( % ) ∗∗∗ H e p a t i t i s B s u r f a ce a n t i g e n - % ( % ) G l u c o s e - % ( % ) ∗∗∗ M e a n c o r pu s c u l a r h e m og l o b i n c o n ce n t r a t i o n - % ( % ) ∗∗∗ C P T - P s y c h i a t r i c s e r v i ce / p r o ce du r e - % ( % ) ∗∗∗ ( I v s II ) E C G - % ( % ) ∗∗∗ T S H - % ( % ) ∗∗ C y t o p a t h o l og y , s li d e s , ce r v i c a l / v ag i n a l - % ( % ) ∗∗∗ P a r t i a l T h r o m b o p l a s t i n T i m e T e s t - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ M R I b r a i n - % ( % ) ∗∗ ( I v s III ) C r e a t i n i n e - % ( % ) ∗∗∗ E C G - % ( % ) ∗∗∗ CT p r o ce du r e - % ( % ) ∗∗∗ P r o t h r o m b i n t i m e - % ( % ) ∗∗∗ P s y c h i a t r i c s e r v i ce / p r o ce du r e - % ( % ) ∗∗∗ ( III v s II ) B r a i n i m ag i n g , PE T - % ( % ) ∗∗∗ H e a d CT - % ( % ) ∗∗∗ H e a d / b r a i n CT - % ( % ) ∗∗∗ M e a n ( s t a nd a r dd e v i a t i o n ) ; f r o m I C D - n i n - g r o up a nd (t o t a l ) p e r ce n t ag e s ; a M u l t i p l e p a i r w i s ec h i - s q u a r e d t e s t ; b M u l t i p l e p a i r w i s e t - t e s t ; ∗ p < . , ∗∗ p < . , ∗∗∗ p < . ; E C G = E l ec t r o c a r d i og r a m ; a b = a n t i b o d i e s ; T S H = T h y r o i d - s t i m u l a t i n g h o r m o n e ; PE T = P o s i t r o n e m i ss i o n t o m og r a ph y ; CT = C o m pu t e d t o m og r a ph y ; M R I = M ag n e t i c r e s o n a n ce i m ag i n g Supp l e m e n t a r y T a b l e : M o s t f r e q u e n tt e r m s f o r t h e t h r ee s ub g r o up s i n t h e A l z h e i m e r ’ s d i s e a s ec o h o r t . u l t i p l e m y e l o m a Sub g r o up I Sub g r o up II Sub g r o up III
Sub g r o up I V Sub g r o up V ( N = )( N = )( N = )( N = )( N = ) F e m a l e / M a l e a a a ∗∗ a ∗∗ a A g e . ( . ) ∗∗ b . ( . ) b . ( . ) b . ( . ) b . ( . ) b I C D - E d e m a ( ) - % ( % ) ∗ D i s e a s e o f s a li v a r y g l a nd s ( ) - % ( % ) ∗∗∗ D i a rr h e a ( ) - % ( % ) ∗∗∗ H y p e r li p i d e m i a ( ) - % ( % ) ∗∗∗ I V v s II / V O t h i nfl a mm a t o r y / t o x i c n e u r o p a t h y ( ) - % ( % ) ∗∗ A n e m i a ( ) - % ( % ) ∗∗∗ D i s o r d e r s o f b o n e a nd c a r t il ag e ( ) - % ( % ) ∗∗∗ N a u s e a ( ) - % ( % ) ∗ D y s u r i a ( ) - % ( % ) ∗ ( I V v s II / V ) U n s p i nfl a mm a t o r y / t o x i c n e u r o p a t h y ( ) - % ( % ) ∗∗ Sh o r t n e ss o f b r e a t h ( ) - % ( % ) ∗ O t h e r m a l a i s e a nd f a t i g u e ( ) - % ( % ) ∗∗∗ A n t i n e o p l a s t i cc h e m o t h e r a p y ( V58.11 ) - % ( % ) M a li g n a n t n e o p l a s m o f c o l o n ( ) - % ( % ) ∗ C o un s e li n g ( V65.40 ) - % ( % ) ∗ P l e u r a e ff u s i o n ( ) - % ( % ) ∗∗∗ O s t e o p o r o s i s ( ) - % ( % ) ∗ ( II v s III ) N e u t r o p e n i a ( ) - % ( % ) ∗∗∗ N e ph r i t i s a ndn e ph r o p a t h y ( ) - % ( % ) ∗ ( I V v s II / V ) O r ga n / t i ss u e t r a n s p l a n t( V42.9 ) - % ( % ) ∗∗∗ ( V v s II / III / I V ) F e v e r ( ) - % ( % ) ∗∗∗ F r a c t u r e ( E887 ) - % ( % ) ∗∗∗ ( II v s I / III ) O r ga n / t i ss u e t r a n s p l a n t( V42.9 ) - % ( % ) ∗∗∗ C h r o n i c k i dn e y d i s e a s e ( ) - % ( % ) ∗ ( I V v s II ) A n t i n e o p l a s t i cc h e m o t h e r a p y ( V58.11 ) - % ( % ) M e d i c a t i o n P a r a ce t a m o l - % ( % ) ∗∗ V i t a m i n D - % ( % ) ∗∗ C a l c i u m - % ( % ) ∗∗∗ C a l c i u m - % ( % ) ∗∗∗ C a l c i u m - % ( % ) ∗∗∗ S o d i u m c h l o r i d e - % ( % ) ∗∗ O xy c o d o n e - % ( % ) ∗∗∗ D e x a m e t h a s o n e - % ( % ) ∗∗ V i t a m i n D - % ( % ) ∗∗ D e x a m e t h a s o n e - % ( % ) ∗∗ O xy c o d o n e - % ( % ) ∗∗∗ F e n t a n y l - % ( % ) ∗∗∗ O nd a n s e t r o n - % ( % ) ∗∗∗ C h o l ec a l c i f e r o l - % ( % ) ∗ B o r t ez o m i b - % ( % ) ∗∗∗ F e n t a n y l - % ( % ) ∗∗∗ E r go c a l c i f e r o l - % ( % ) ∗ B o r t ez o m i b - % ( % ) ∗∗∗ E r go c a l c i f e r o l - % ( % ) ∗ I r o n - % ( % ) ∗∗∗ H e p a r i n - % ( % ) ∗∗∗ A ce t y l s a li c y li c a c i d m g - % ( % ) ∗ A c i c l o v i r - % ( % ) ∗∗∗ A t o r v a s t a t i n - % ( % ) ∗ A ce t y l s a li c y li c a c i d m g - % ( % ) ∗ L a b t e s t E r y t h r o c y t e s - % ( % ) ∗∗∗ H e m og l o b i n - % ( % ) ∗ C h l o r i d e - % ( % ) ∗ P r o t e i n - % ( % ) ∗∗∗ H e m a t o c r i t - % ( % ) ∗ G l u c o s e - % ( % ) ∗∗∗ L y m ph o c y t e s - % ( % ) ∗∗∗ G l u c o s e - % ( % ) ∗∗∗ G l u c o s e - % ( % ) ∗∗∗ P l a t e l e t s - % ( % ) ∗ U r e a n i t r og e n - % ( % ) ∗∗∗ L e u k o c y t e s - % ( % ) ∗∗ P o t a ss i u m - % ( % ) ∗∗∗ E r y t h r o c y t e s - % ( % ) ∗∗∗ E r y t h r o c y t e s - % ( % ) ∗∗∗ M e a n c o r pu s c u l a r h e m og l o b i n - % ( % ) ∗∗ M e a np l a t e l e t v o l u m e - % ( % ) ∗∗ M e a n c o r pu s c u l a r h e m og l o b i n - % ( % ) ∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ L y m ph o c y t e s - % ( % ) ∗∗∗ L e u k o c y t e s - % ( % ) ∗∗∗ ( I v s II / III / I V ) M e a n c o r pu s c u l a r h e m og l o b i n c o n ce n t r a t i o n - % ( % ) ∗ W i d t h - % ( % ) ∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ E o s i n o ph il s - % ( % ) ∗∗∗ C P T - B l oo d c o un t - % ( % ) ∗∗∗ D i ag n o s t i c / i n t e r v e n t i o n a l CT - % ( % ) ∗ C a l c i u m - % ( % ) ∗∗∗ C a l c i u m - % ( % ) ∗∗∗ G a mm ag l o bu li n - % ( % ) ∗∗ C a l c i u m - % ( % ) ∗∗∗ PE T li m i t e d a r e a ( H e a d / n ec k ) - % ( % ) ∗ B l oo d c o un t - % ( % ) ∗∗ E C G ;i n t e r p r e t a t i o n , r e p o r t - % ( % ) ∗∗∗ A l bu m i n - % ( % ) ∗∗∗ E C G ;i n t e r p r e t a t i o n , r e p o r t - % ( % ) ∗∗∗ PE T - CT ( s k u ll b a s e t o m i d - t h i g h ) - % ( % ) ∗ A l bu m i n - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ C a l c i u m ,i o n i ze d - % ( % ) ∗∗∗ P o t a ss i u m - % ( % ) ∗ T u m o r i m ag i n g PE T - CT - % ( % ) ∗∗∗ L a c t a t e d e h y d r og e n a s e - % ( % ) ∗∗∗ ( III v s I / II / I V ) C h o l e s t e r o l - % ( % ) ∗ ( I V v s I / II / V ) L a c t a t e d e h y d r og e n a s e - % ( % ) ∗ P TT - % ( % ) ∗∗ CT t h o r a x ( n o c o n t r a s t) - % ( % ) ∗∗∗ B o n e m a rr o w ; b i o p s y - % ( % ) ∗∗∗ U r i n a l y s i s - % ( % ) ∗∗∗ ( I V v s III / V ) B e t a - m i c r og l o bu li n - % ( % ) ∗∗∗ M e a n ( s t a nd a r dd e v i a t i o n ) ; f r o m I C D - n i n - g r o up a nd (t o t a l ) p e r ce n t ag e s ; a M u l t i p l e p a i r w i s ec h i - s q u a r e d t e s t ; b M u l t i p l e p a i r w i s e t - t e s t ; ∗ p < . ; ∗∗ p < . ; ∗∗∗ p < . ; E C G = E l ec t r o c a r d i og r a m ; CT = C o m pu t e d t o m og r a ph y ; PE T = P o s i t r o n e m i ss i o n t o m og r a ph y ; P TT = P a r t i a l t r o m b o p l a s t i n t i m e Supp l e m e n t a r y T a b l e : M o s t f r e q u e n tt e r m s f o r t h e fi v e s ub g r o up s i n t h e m u l t i p l e m y e l o m a c o h o r t . alignant neoplasm of prostateSubgroup I Subgroup II (N=6 , , .
64 (12 . a .
78 (10 . a ICD-9 Hyperlipidemia ( ) - 28% (95%) ∗∗∗
Nocturia ( ) - 29% (33%) ∗∗ Edema ( ) - 24% (94%) ∗∗∗
Elevated PSA ( ) - 18% (27%) ∗∗∗
Personal history of PC (
V10.46 ) - 20% (97%) ∗∗∗
Impotence of organic origin ( ) - 18% (35%) ∗∗∗
Hypertrophy (beging) of prostate ( ) - 14% (85%) ∗∗∗
Urinary frequency ( ) - 15% (27%) ∗∗∗
Hematuria ( ) - 14% (86%) ∗∗∗
Urinary hesitancy ( ) - 11% (33%) ∗∗∗
Medication Paracetamol - 44% (98%) ∗∗∗
Midazolam - 17% (12%) ∗∗∗
Oxycodone - 40% (98%) ∗∗∗
Tadalafil - 14% (35%) ∗∗∗
Ondansetron - 33% (97%) ∗∗∗
Sildenafil - 12% (33%) ∗∗∗
Propofol - 31% (94%) ∗∗∗
Tamsulosin - 10% (12%) ∗∗∗
Morphine - 30% (99%) ∗∗∗
Testosterone - 8% (28%)Lab test Glucose - 66% (96%) ∗ PSA post-prostatectomy - 17% (25%) ∗∗∗
Leukocytes - 63% (98%) ∗ PSA free - 10% (27%) ∗∗∗
Creatinine - 63% (99%) ∗ Nitrite - 8% (6%) ∗∗∗
Urea nitrogen - 63% (99%) ∗ Leukocyte esterase - 6% (5%) ∗∗∗
Potassium - 62% (99%) ∗ Urine specific gravity - 6% (5%) ∗∗∗
CPT-4 Calcium - 53% (98%) ∗∗∗
Testosterone total - 29% (32%) ∗∗∗
Anastomosis - 20% (98%) ∗∗∗
Surgery - 22% (14%) ∗∗∗
Ultrasound, transrectal - 7% (65%) ∗∗∗
Ultrasound post-voiding residual urine/bladder capacity - 18% (29%) ∗∗∗
Pelvic lymphadenectomy - 6% (100%) ∗∗∗
Urinalysis - 12% (44%) ∗∗∗
Cystoplasty/cystourethroplasty - 6% (100%) ∗∗∗
Biopsy, prostate - 7% (31%) ∗∗∗ Mean (standard deviation); from ICD-9 on in-group and (total) percentages; a Multiple pairwise t-test; ∗ p < . ∗∗ p < . ∗∗∗ p < . Supplementary Table 8: Most frequent terms for the two subgroups in the prostate cancer cohort.10 alignant neoplasm of breast (female)Subgroup I Subgroup II (N=5 , , .
67 (14 . ∗ a .
86 (13 . ∗ a ICD-9 Constipation ( ) - 25% (93%) ∗ Lump or mass in breast ( ) - 27% (29%) ∗ Secondary malignant neoplasm ( ) - 13% (93%) ∗∗∗
Abnormal mammogram ( ) - 23% (37%) ∗ Acquired absence of breast/nipple (
V45.71 ) - 12% (92%) ∗∗∗
Carcinoma in situ of breast ( ) - 15% (27%) ns Antineoplastic chemotherapy (
V58.11 ) - 7% (98%) ∗∗∗
Family history of malignant neoplasm of breast (
V16.3 ) - 6% (28%)Mammogram for high-risk patient (
V76.11 ) - 6% (63%) ∗∗∗
Abnormal findings on radiological examination of breast ( ) - 4% (36%) ∗∗∗
Medication Paracetamol - 50% (92%) ∗∗∗
Propofol - 27% (19%) ∗∗∗
Ondansetron - 46% (87%) ∗∗∗
Fentanyl - 26% (16%) ∗∗∗
Fentanyl - 45% (84%) ∗∗∗
Lidocaine - 25% (21%) ∗∗∗
Oxycodone - 43% (91%) ∗∗∗
Midazolam - 22% (18%) ∗∗∗
Propofol - 40% (81%) ∗∗∗
Ondansetron - 21% (13%) ∗∗∗
Lab test Glucose - 67% (97%) ∗∗∗
Leukocytes - 7% (3%) ∗∗∗
Leukocytes - 67% (97%) ∗∗∗
Glucose - 6% (3%) ∗∗∗
Erythrocytes - 66% (97%) ∗∗∗
Platelets - 6% (3%) ∗∗∗
Hemoglobin - 65% (97%) ∗∗∗
Erythrocytes - 6% (3%) ∗∗∗
Hematocrit - 65% (97%) ∗∗∗
Mean corpuscular hemoglobin - 6% (3%) ∗∗∗
CPT-4 Surgery - 44% (81%) ∗∗∗
Mammography - 35% (32%) ∗∗∗
Mastectomy, partial - 33% (78%) ∗ Ultrasound - 32% (27%) ∗ Ultrasound - 30% (73%) ∗ Surgery - 30% (19%) ∗∗∗
Unlisted chemotherapy - 23% (85%) ∗∗∗
Mastectomy, partial - 28% (22%) ∗∗∗
Oncoprotein - 17% (85%) ∗∗∗
Mammography, bilateral - 26% (39%) ∗∗∗ Mean (standard deviation); from ICD-9 on in-group and (total) percentages; a Multiple pairwise t-test; ∗ p < . ∗∗ p < . ∗∗∗ p < . Supplementary Table 9: Most frequent terms for the two subgroups in the breast cancer cohort.11 y p e d i a b e t e s ( Sp li t ) Sub g r o up I Sub g r o up II Sub g r o up III ( N = , )( N = , )( N = , ) F e m a l e / M a l e , , ∗∗∗ a , , ∗∗∗ a , , ∗∗∗ a A g e . ( . ) ∗ b . ( . ) ∗ b . ( . ) ∗ b I C D - H y p e r t e n s i o n ( ) - % ( % ) ∗∗∗ E d e m a ( ) - % ( % ) ∗∗∗ C o r o n a r y a r t e r y a t h e r o s c l e r o s i s ( ) - % ( % ) ∗∗∗ H y p e r li p i d e m i a ( ) - % ( % ) ∗∗∗ A c u t e k i dn e y f a il u r e ( ) - % ( % ) ∗∗∗ C o r o n a r y a t h e r o s c l e r o s i s ( v e ss e l )( ) - % ( % ) ∗∗∗ C h e s t p a i n ( ) - % ( % ) C h r o n i c k i dn e y d i s e a s e ( ) - % ( % ) ∗∗∗ A n g i n a p ec t o r i s ( ) - % ( % ) ∗∗∗ O b e s i t y ( ) - % ( % ) ∗∗∗ P a i n i n li m b ( ) - % ( % ) ∗∗∗ A bn o r m a l r e s u l t c a r d i o v a s c u l a r s y s t e m f un c t i o n ( ) - % ( % ) ∗∗∗ H y p e r c h o l e s t e r o l e m i a ( ) - % ( % ) ∗∗∗ N e ph r i t i s a ndn e ph r o p a t h y ( ) - % ( % ) ∗∗∗ P e r c u t a n e o u s t r a n s l u m i n a l c o r o n a r y a n g i o p l a s t y ( V45.82 ) - % ( % ) ∗∗∗ M e d i c a t i o n M e t f o r m i n - % ( % ) ∗∗∗ P a r a ce t a m o l - % ( % ) ∗∗∗ A ce t y l s a li c y li c a c i d - % ( % ) ∗∗∗ A ce t y l s a li c y li c a c i d - % ( % ) ∗∗∗ ( I v s II ) G l u c ago n - % ( % ) ∗∗∗ C l o p i d r og e l - % ( % ) ∗∗∗ C a l c i u m - % ( % ) ∗∗∗ I n s u li n li s p r o - % ( % ) ∗∗∗ I n t r a c o r o n a r y n i t r og li ce r i n - % ( % ) ∗∗∗ P a r a ce t a m o l - % ( % ) ∗∗∗ V a n c o m y c i n - % ( % ) ∗∗∗ B i v a li r ud i n - % ( % ) ∗∗∗ C h o l e s t e r o l - % ( % ) ∗∗∗ F u r o s e m i d e - % ( % ) ∗∗∗ C l o p i d r og e l - % ( % ) ∗∗∗ L a b t e s t G l u c o s e - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ H e m a t o c r i t - % ( % ) ∗∗∗ ( III v s I ) C r e a t i n i n e - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ M e a n c o r pu s c u l a r h e m og l o b i n c o n ce n t r a t i o n - % ( % ) ∗∗∗ ( III v s I ) L e u k o c y t e s - % ( % ) ∗∗∗ B ili r ub i n - % ( % ) ∗∗∗ C h l o r i d e - % ( % ) ∗∗∗ T r i g l y ce r i d e - % ( % ) ∗∗∗ A L T t e s t - % ( % ) ∗∗∗ C h o l e s t e r o l - % ( % ) ∗∗∗ C h o l e s t e r o l r a t i o - % ( % ) ∗ A l k a li n e ph o s ph a t a s e - % ( % ) ∗∗∗ T r o p o n i n I c a r d i a c - % ( % ) ∗∗∗ C P T - C a l c i u m - % ( % ) ∗∗∗ ( I v s II ) P o t a ss i u m - % ( % ) ∗∗∗ E C G ;i n t e r p r e t a t i o n , r e p o r t - % ( % ) ∗∗∗ H e m og l o b i n A C - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ L i p i dp a n e l - % ( % ) ∗∗∗ G l u c o s e - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ P o t a ss i u m - % ( % ) ∗∗∗ L i p i dp a n e l - % ( % ) ∗∗∗ U r i n a l y s i s - % ( % ) ∗∗∗ T r o p o n i n , q u a n t i t a t i v e - % ( % ) ∗ L i p o p r o t e i n , d i r ec t m e a s u r e m e n t - % ( % ) ∗∗∗ H e p a t i c f un c t i o np a n e l - % ( % ) ∗∗∗ C - r e a c t i v e p r o t e i n , h i g h s e n s i t i v i t y - % ( % ) ∗∗∗ M e a n ( s t a nd a r dd e v i a t i o n ) ; f r o m I C D - n i n - g r o up a nd (t o t a l ) p e r ce n t ag e s ; a M u l t i p l e p a i r w i s ec h i - s q u a r e d t e s t ; b M u l t i p l e p a i r w i s e t - t e s t ; ∗ p < . , ∗∗ p < . , ∗∗∗ p < . E C G = E l ec t r o c a r d i og r a m ; A L T = A l a n i n e a m i n o t r a n s f e r a s e t e s t Supp l e m e n t a r y T a b l e : M o s t f r e q u e n tt e r m s f o r t h e t h r ee s ub g r o up s i n t h e t y p e d i a b e t e ss ec o nd s p li t r e p li c a t i o n c o h o r t . arkinson’s disease (Split 2)Subgroup I Subgroup II (N=1 , , , a a Age .
39 (12 . ∗ b .
65 (15 . ∗ b ICD-9 Anxiety state ( ) - 24% (65%) ∗∗ Other malaise and fatigue ( ) - 26% (53%) ∗∗ Constipation ( ) - 23% (57%) ∗ Chest pain ( ) - 22% (68%) ∗∗∗
Essential tremor ( ) - 22% (79%) ∗∗ Coronary atherosclerosis ( ) - 21% (79%) ∗∗∗
Abnormality of gait ( ) - 15% (48%) ∗∗∗
Atrial fibrillation ( ) - 17% (85%) ∗∗∗
Depressive disorder ( ) - 14% (53%) ∗∗∗
Pleural effusion ( ) - 17% (95%) ∗∗∗
Medication Carbidopa/Levodopa combination - 49% (68%) ∗∗∗
Carbidopa - 47% (43%) ∗ Amantadine - 17% (74%) ∗∗∗
Levodopa - 46% (42%) ∗ Pramipexole - 15% (75%) ∗∗∗
Acetylsalicylic acid - 26% (72%) ∗∗∗
Rasagiline - 14% (78%) ∗∗∗
Heparin - 23% (94%) ∗∗∗
Selegiline - 12% (78%) ∗∗∗
Metoprolol - 18% (81%) ∗∗∗
Lab test Glucose - 9% (15%) ∗∗∗
Erythrocytes - 77% (85%) ∗∗∗
Leukocytes - 9% (15%) ∗∗∗
Mean corpuscolar hemoglobin - 75% (86%) ∗∗∗
Creatinine - 9% (15%) ∗∗∗
Glucose - 75% (85%) ∗∗∗
Erythrocytes - 9% (15%) ∗∗∗
Width - 75% (86%) ∗∗∗
Urea nitrogen - 8% (15%) ∗∗∗
Leukocytes - 75% (85%) ∗∗∗
CPT-4 Unlisted psychiatric service or procedure - 29% (70%) ∗∗∗
Urea nitrogen - 60% (85%) ∗∗∗
Surgery - 17% (48%) ∗∗∗
ECG; interpretation, report - 59% (82%) ∗∗∗
MRI (brain, brain stem) - 16% (58%) Urinalysis - 42% (87%) ∗∗∗
CT head/brain - 5% (21%) ∗∗∗
Radiologic examination, chest - 38% (86%) ∗∗∗
Implanted neurostimulator - 4% (68%) Troponin, quantitative - 30% (85%) ∗∗∗ Mean (standard deviation); from ICD-9 on in-group and (total) percentages; a Multiple pairwise chi-squared test; b Multiple pairwise t-test; ∗ p < . , ∗ ∗ p < . , ∗ ∗ ∗ p < . Supplementary Table 11: Most frequent terms for the two subgroups in the Parkinson’s disease second splitreplication cohort. 13 l z h e i m e r ’ s d i s e a s e ( Sp li t ) Sub g r o up I Sub g r o up II Sub g r o up III ( N = , )( N = , )( N = ) F e m a l e / M a l e , a a ∗∗∗ a A g e . ( . ) ∗∗ b . ( . ) ∗∗ b . ( . ) ∗∗ b I C D - C o n s t i p a t i o n ( ) - % ( % ) D e m e n t i a w /o b e h a v i o r a l d i s t u r b a n ce ( ) - % ( % ) ∗∗∗ R o u t i n e g y n ec o l og i c a l e x a m i n a t i o n ( V72.31 ) - % ( % ) ∗∗∗ A n x i e t y s t a t e ( ) - % ( % ) ∗∗∗ ( I v s II ) A l t e r e d m e n t a l s t a t u s ( ) - % ( % ) ∗∗∗ C o un s e li n g ( V65.40 ) - % ( % ) ∗∗∗ M e m o r y l o ss ( ) - % ( % ) ∗∗∗ P e r s i s t e n t m e n t a l d i s o r d e r s ( ) - % ( % ) ∗∗∗ O s t e o p o r o s i s ( ) - % ( % ) ∗∗∗ D e p r e ss i v e d i s o r d e r ( ) - % ( % ) ∗∗∗ C o n g e s t i v e h e a r t f a il u r e ( ) - % ( % ) ∗∗∗ F a m il y h i s t o r y o f o s t e o p o r o s i s ( V17.81 ) - % ( % ) ∗∗∗ I n s o m n i a ( ) - % ( % ) ∗∗∗ ( I v s III ) D e m e n t i a w i t hb e h a v i o r a l d i s t u r b a n ce ( ) - % ( % ) ∗∗∗ M a li g n a n t n e o p l a s m o f u t e r u s ( ) - % ( % ) ∗∗∗ M e d i c a t i o n E r go c a l c i f e r o l - % ( % ) ∗∗∗ A ce t y l s a li c y li c a c i d - % ( % ) ∗∗∗ E t h i n y l e s t r a d i o l - % ( % ) ∗∗∗ D o n e p ez il - % ( % ) ∗ D o n e p ez il - % ( % ) ∗ I r o n - % ( % ) ∗ M e m a n t i n e - % ( % ) ∗∗∗ L e v o fl o x a c i n - % ( % ) ∗∗∗ G a r d a s il - % ( % ) ∗∗∗ V i t a m i n B - - % ( % ) ∗∗∗ M e t o p r o l o l - % ( % ) ∗∗∗ N o r e t h i s t e r o n e - % ( % ) ∗∗∗ D o c u s a t e s o d i u m - % ( % ) ∗∗∗ ( I v s III ) H a l o p e r i d o l - % ( % ) ∗∗∗ N o r g e s t i m a t e - % ( % ) ∗ L a b t e s t G l u c o s e - % ( % ) ∗∗∗ E r y t h r o c y t e s - % ( % ) ∗∗∗ C h l a m y d i a/ G o n o rr h o e a e a m p li fi e d D NA - % ( % ) ∗∗∗ E r y t h r o c y t e s - % ( % ) ∗∗∗ H e m a t o c r i t - % ( % ) ∗∗∗ H I V - % ( % ) W i d t h - % ( % ) ∗∗∗ M e a np l a t e l e t v o l u m e - % ( % ) ∗∗∗ S y ph ili s ( r a p i dp l a s m a r e ag i n ) - % ( % ) ∗∗∗ ( III v s II ) P l a t e l e t s - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ H e p a t i t i s B s u r f a ce a n t i g e n - % ( % ) H e m og l o b i n - % ( % ) ∗∗∗ M e a n c o r pu s c u l a r h e m og l o b i n c o n ce n t r a t i o n - % ( % ) ∗∗∗ H e p a t i t i s C v i r u s a b - % ( % ) C P T - T S H - % ( % ) ∗∗ P a r t i a l T h r o m b o p l a s t i n T i m e T e s t - % ( % ) ∗∗∗ P s y c h i a t r i c s e r v i ce / p r o ce du r e - % ( % ) ∗∗∗ E C G ;i n t e r p r e t a t i o n , r e p o r t - % ( % ) ∗∗∗ P r o t h r o m b i n t i m e - % ( % ) ∗∗∗ C a l c i u m ,i o n i ze d - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ X - r a y c h e s t - % ( % ) ∗∗∗ C y t o p a t h o l og y , s li d e s , ce r v i c a l / v ag i n a l - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ H e a d / b r a i n CT - % ( % ) ∗∗∗ T S H - % ( % ) ∗∗∗ P s y c h i a t r i c s e r v i ce / p r o ce du r e - % ( % ) ∗∗∗ T r o p o n i n , q u a n t i t a t i v e - % ( % ) E s t r a d i o l - % ( % ) ∗∗∗ M e a n ( s t a nd a r dd e v i a t i o n ) ; f r o m I C D - n i n - g r o up a nd (t o t a l ) p e r ce n t ag e s ; a M u l t i p l e p a i r w i s ec h i - s q u a r e d t e s t ; b M u l t i p l e p a i r w i s e t - t e s t ; ∗ p < . , ∗∗ p < . , ∗∗∗ p < . ; E C G = E l ec t r o c a r d i og r a m ; a b = a n t i b o d i e s ; T S H = T h y r o i d - s t i m u l a t i n g h o r m o n e ; CT = C o m pu t e d t o m og r a ph y Supp l e m e n t a r y T a b l e : M o s t f r e q u e n tt e r m s f o r t h e t h r ee s ub g r o up s i n t h e A l z h e i m e r ’ s d i s e a s e s ec o nd s p li t r e p li c a t i o n c o h o r t . u l t i p l e m y e l o m a ( Sp li t ) Sub g r o up I Sub g r o up II Sub g r o up III
Sub g r o up I V ( N = )( N = )( N = )( N = ) F e m a l e / M a l e a a ∗∗ a ∗∗ a A g e . ( . ) ∗∗ b ( I v s III / I V ) . ( . ) ∗∗ b ( II v s III / I V ) . ( . ) b . ( . ) b I C D - O t h e r m a l a i s e a nd f a t i g u e ( ) - % ( % ) ∗∗∗ O t h i nfl a mm a t o r y / t o x i c n e u r o p a t h y ( ) - % ( % ) ∗∗ P l e u r a e ff u s i o n ( ) - % ( % ) ∗∗∗ H y p e r li p i d e m i a ( ) - % ( % ) ∗∗ D i s e a s e o f s a li v a r y g l a nd s ( ) - % ( % ) ∗∗∗ U n s p i nfl a mm a t o r y / t o x i c n e u r o p a t h y ( ) - % ( % ) ∗∗∗ A c u t e k i dn e y f a il u r e ( ) - % ( % ) ∗∗∗ N e ph r i t i s a ndn e ph r o p a t h y ( ) - % ( % ) ∗∗∗ ( I V v s I / II ) C o n s t i p a t i o n ( ) - % ( % ) ∗∗ D i s o r d e r s o f b o n e a nd c a r t il ag e ( ) - % ( % ) ∗∗∗ ( II v s I / I V ) O r ga n / t i ss u e t r a n s p l a n t( V42.9 ) - % ( % ) ∗∗∗ ( III v s I / I V ) D y s u r i a ( ) - % ( % ) F e v e r ( ) - % ( % ) ∗∗ D i s e a s e o f s a li v a r y g l a nd s ( ) - % ( % ) ∗∗∗ R e n a l f a il u r e ( ) - % ( % ) ∗∗∗ M o n o c l o n a l p a r a p r o t e i n e m i a ( ) - % ( % ) ∗∗ ( I V v s I / III ) C o un s e li n g ( V65.40 ) - % ( % ) ∗∗ O r ga n / t i ss u e t r a n s p l a n t( V42.9 ) - % ( % ) ∗∗∗ ( II v s I / I V ) A n t i n e o p l a s t i cc h e m o t h e r a p y ( V58.11 ) - % ( % ) ∗∗∗ ( III v s I / I V ) C h r o n i c k i dn e y d i s e a s e ( ) - % ( % ) ∗ M e d i c a t i o n O xy c o d o n e - % ( % ) ∗∗ C a l c i u m - % ( % ) ∗∗∗ O xy c o d o n e - % ( % ) ∗∗ E r go c a l c i f e r o l - % ( % ) ∗ L i d o c a i n e - % ( % ) ∗∗∗ D e x a m e t h a s o n e - % ( % ) ∗∗∗ O nd a n s e t r o n - % ( % ) ∗∗ C h o l ec a l c i f e r o l - % ( % ) ∗∗∗ A ce t y l s a li c y li c a c i d m g - % ( % ) ∗∗ B o r t ez o m i b - % ( % ) ∗∗∗ D i ph e nh y d r a m i n e - % ( % ) ∗ A t o r v a s t a t i n - % ( % ) ∗∗∗ D e x a m e t h a s o n e - % ( % ) ∗∗∗ A ce t y l s a li c y li c a c i d m g - % ( % ) ∗∗ D e x a m e t h a s o n e - % ( % ) ∗∗∗ ( III v s I / I V ) F u r o s e m i d e - % ( % ) ∗∗∗ ( I V v s I / III ) P a r a ce t a m o l - % ( % ) ∗∗∗ L e n a li d o m i d e - % ( % ) ∗∗∗ L o r a ze p a m - % ( % ) ∗∗∗ L o s a r t a n - % ( % ) ∗∗∗ L a b t e s t L e u k o c y t e s - % ( % ) ∗∗∗ W i d t h - % ( % ) ∗∗∗ M e a n c o r pu s c u l a r v o l u m e - % ( % ) ∗∗∗ G l u c o s e - % ( % ) ∗∗∗ E r y t h r o c y t e s - % ( % ) ∗∗∗ M e a np l a t e l e t v o l u m e - % ( % ) ∗∗∗ C h l o r i d e - % ( % ) ∗∗∗ L e u k o c y t e s - % ( % ) ∗∗∗ H e m a t o c r i t - % ( % ) ∗ M e a n c o r pu s c u l a r h e m og l o b i n - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗ M e a n c o r pu s c u l a r v o l u m e - % ( % ) ∗∗∗ H e m og l o b i n - % ( % ) ∗∗∗ L e u k o c y t e s - % ( % ) ∗∗∗ P r o t e i n - % ( % ) ∗∗∗ P l a t e l e t s - % ( % ) ∗∗∗ P r o t e i n - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ ( III v s I / I V ) U r e a n i t r og e n - % ( % ) ∗∗∗ C P T - D i ag n o s t i c / i n t e r v e n t i o n a l CT - % ( % ) ∗∗ B e t a - m i c r og l o bu li n - % ( % ) ∗∗∗ E C G ;i n t e r p r e t a t i o n , r e p o r t - % ( % ) ∗∗ E C G ;i n t e r p r e t a t i o n , r e p o r t - % ( % ) ∗∗ PE T li m i t e d a r e a ( H e a d / n ec k ) - % ( % ) ∗∗∗ B o n e m a rr o w , b i o p s y - % ( % ) ∗∗∗ P TT - % ( % ) ∗∗∗ V i t a m i n D - % ( % ) ∗∗∗ Su r g e r y - % ( % ) ∗∗∗ N e ph e l o m e t r y - % ( % ) ∗∗∗ X - r a y , c h e s t - % ( % ) ∗∗∗ T r i g l y ce r i d e s - % ( % ) ∗∗∗ P s y c h i a t r i c s e r v i ce / p r o ce du r e - % ( % ) ∗ I mm un o fi x a t i o n - % ( % ) ∗∗∗ U r i n a l y s i s - % ( % ) ∗∗∗ L i p i dp a n e l - % ( % ) ∗ PE T - CT ( s k u ll b a s e t o m i d - t h i g h ) - % ( % ) ∗∗ C h e m o t h e r a p y p r o ce du r e - % ( % ) ∗∗∗ ( II v s I / I V ) P h o s ph o r u s - % ( % ) ∗∗∗ C h o l e s t e r o l - % ( % ) ∗∗∗ M e a n ( s t a nd a r dd e v i a t i o n ) ; f r o m I C D - n i n - g r o up a nd (t o t a l ) p e r ce n t ag e s ; a M u l t i p l e p a i r w i s ec h i - s q u a r e d t e s t ; b M u l t i p l e p a i r w i s e t - t e s t ; ∗ p < . ; ∗∗ p < . ; ∗∗∗ p < . ; E C G = E l ec t r o c a r d i og r a m ; CT = C o m pu t e d t o m og r a ph y ; PE T = P o s i t r o n e m i ss i o n t o m og r a ph y ; P TT = P a r t i a l t h r o m b o p l a s t i n t i m e Supp l e m e n t a r y T a b l e : M o s t f r e q u e n tt e r m s f o r t h e f o u r s ub g r o up s i n t h e M u l t i p l e M y e l o m a s ec o nd s p li t r e p li c a t i o n c o h o r t . alignant neoplasm of prostate (Split 2)Subgroup I Subgroup II Subgroup III (N=2 , , , .
71 (12 . a .
92 (11 . ∗∗ a .
83 (14 . a ICD-9 Nocturia ( ) - 28% (50%) ∗∗∗
Personal history of PC (
V10.46 ) - 28% (77%) ∗∗∗
Palpitations ( ) - 21% (51%) ∗∗∗
Elevated PSA ( ) - 20% (49%) ∗∗∗
Hyperlipidemia ( ) - 25% (47%) ∗∗∗
Asthma ( ) - 18% (51%) ∗∗∗
Urinary frequency ( ) - 17% (45%) ∗∗∗ (I vs II)
Edema ( ) - 23% (47%) ∗∗∗
Vitamin D deficiency ( ) - 15% (72%) ∗∗∗
Impotence of organic origin ( ) - 16% (52%) ∗∗∗
Cardiac dysrhythmias ( ) - 15% (69%) ∗∗∗
Cyanosis ( ) - 14% (54%) ∗∗∗
Urge incontinence ( ) - 5% (52%) ∗∗∗ (I vs II)
Pleural effusion ( ) - 13% (87%) ∗∗∗
Neoplasm of colon ( ) - 11% (52%) ∗∗∗
Medication Midazolam - 15% (18%) ∗∗∗
Paracetamol - 68% (81%) ∗∗∗
Vitamin D3 - 17% (49%) ∗∗∗
Tadalafil - 12% (47%) ∗∗∗
Oxycodone - 61% (82%) ∗∗∗
Fluticasone - 17% (61%) ∗∗∗
Tamsulosin - 11% (23%) ∗∗ Ondansetron - 50% (82%) ∗∗∗
Atorvastatin - 17% (43%) ∗∗∗
Testosterone - 8% (45%) ∗∗∗ (I vs II)
Morphine - 50% (92%) ∗∗∗
Aerosol - 15% (53%) ∗∗∗
Sildenafil - 10% (44%) ∗∗∗ (I vs II)
Lidocaine - 47% (77%) ∗∗∗
Omeprazole - 10% (51%) ∗∗∗
Lab test PSA total - 20% (33%) ∗∗∗
Glucose - 84% (68%) ∗∗∗
Glucose - 47% (21%) ∗∗∗
PSA post-prostatectomy - 15% (37%) ∗∗∗ (I vs III)
Leukocytes - 84% (72%) ∗∗∗
Cholesterol - 35% (49%) ∗∗∗
Nitrite - 15% (18%) ∗∗∗
Urea nitrogen - 84% (72%) ∗ Hemoglobin A1C - 17% (52%) ∗∗∗
PSA free - 11% (47%) ∗∗∗
Potassium - 84% (73%) ∗∗∗ (I vs III)
Hepatitis C virus ab - 11% (53%) ∗∗∗
Testosterone free - 6% (46%) ∗∗∗ (I vs II)
Creatinine - 83% (72%) ∗∗∗
HIV 1 - 8% (55%) ∗∗∗
CPT-4 Surgery - 25% (25%) ∗∗∗
Calcium - 71% (72%) ∗∗∗
PSA total - 51% (23%) ∗∗∗
Ultrasound post-voiding residual urine/bladder capacity - 28% (48%) ∗∗∗
ECG; interpretation, report - 43% (61%) ∗∗∗ (II vs I)
PSA free - 52% (44%) ∗∗∗
Ultrasound, transrectal - 16% (57%) ∗∗∗
Anastomosis - 33% (92%) ∗∗∗ (II vs I)
ECG; interpretation, report - 41% (32%) ∗∗∗ (III vs I)
Urinalysis - 11% (60%) ∗∗∗ (I vs II)
Urine culture, bacterial - 20% (69%) ∗∗∗
Surgery - 34% (26%) ∗∗∗ (III vs I)
MRI, pelvis - 9% (43%) ∗∗ Troponin, quantitative - 19% (90%) ∗∗∗
Spirometry - 14% (73%) ∗∗∗ Mean (standard deviation); from ICD-9 on in-group and (total) percentages; a Multiple pairwise t-test; ∗ p < . ∗∗ p < . ∗∗∗ p < . Supplementary Table 14: Most frequent terms for the three subgroups in the prostate cancer second splitreplication cohort. 16 alignant neoplasm of breast - female (Split 2)Subgroup I Subgroup II (N=5 , , .
98 (14 . ∗ a .
94 (13 . ∗ a ICD-9 Personal history of malignant neoplasm of breast (
V10.3 ) - 54% (79%) ∗∗∗
Lump or mass in breast ( ) - 26% (33%) ∗∗∗
Constipation ( ) - 24% (92%) ∗ Abnormal mammogram ( ) - 22% (43%) ∗∗∗
Secondary malignant neoplasm ( ) - 14% (91%) ∗∗∗
Other screening mammogram (
V76.12 ) - 19% (44%) ∗∗∗
Acquired absence of breast/nipple (
V45.71 ) - 12% (89%) ∗∗∗
Carcinoma in situ of breast ( ) - 15% (32%) ∗ Antineoplastic chemotherapy (
V58.11 ) - 7% (99%) ∗∗∗
Diffuse cystic mastopathy ( ) - 10% (38%) ∗∗∗
Medication Paracetamol - 50% (89%) ∗∗∗
Propofol - 28% (23%) ∗∗∗
Fentanyl - 45% (80%) ∗∗∗
Fentanyl - 28% (20%) ∗∗∗
Ondansetron 44% (83%) ∗∗∗
Midazolam - 24% (22%) ∗∗∗
Oxycodone - 42% (88%) ∗∗∗
Lidocaine - 23% (23%) ∗∗∗
Propofol - 38% (77%) ∗∗∗
Ondansetron - 23% (17%) ∗∗∗
Lab test Leukocytes - 69% (97%) ∗∗∗
Leukocytes - 6% (3%) ∗∗∗
Glucose - 69% (97%) ∗∗∗
Glucose - 6% (3%) ∗∗∗
Hematocrit - 67% (97%) ∗∗∗
Width - 5% (3%) ∗∗∗
Erythrocytes - 67% (97%) ∗∗∗
Mean corpuscular hemoglobin concentration - 5% (3%) ∗∗∗
Width - 66% (97%) ∗∗∗
Erythrocytes - 5% (3%) ∗∗∗
CPT-4 Surgery - 43% (79%) ∗∗∗
Mammography - 33% (36%) ∗∗∗
Mastectomy, partial - 34% (75%) ∗∗∗
Surgery - 30% (21%) ∗∗∗
Ultrasound - 27% (68%) ∗∗∗
Mastectomy, partial - 28% (25%) ∗∗∗
Unlisted chemotherapy - 24% (84%) ∗∗∗
Ultrasound, breast(s) - 24% (40%) ∗∗∗
Oncoprotein - 16% (81%) ∗∗∗
Mammography, bilateral - 23% (42%) ∗∗∗ Mean (standard deviation); from ICD-9 on in-group and (total) percentages; a Multiple pairwise t-test; ∗ p < . ∗∗ p < . ∗∗∗ p < . Supplementary Table 15: Most frequent terms for the two subgroups in the breast cancer second splitreplication cohort. 17
D = Alzheimer’s disease; ADHD = Attention deficit hyperactivity disorder; BC = Breast cancer; CD = Crohn’s disease;MM = Multiple myeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes
Supplementary Figure 1: Second split Uniform Manifold Approximation and Projection (UMAP) encodingvisualization. ConvAE 1-layer CNN ( a ); SVD-RawCount ( b ); SVD-TFIDF ( c ); Deep Patient ( d ). AD =Alzheimer’s disease; ADHD = Attention deficit hyperactivity disorder; BC = Breast cancer; CD = Crohn’sdisease; MM = Multiple myeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes.18 D = Alzheimer’s disease; ADHD = Attention deficit hyperactivity disorder; BC = Breast cancer; CD = Crohn’s disease;MM = Multiple myeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes
Supplementary Figure 2: Second split Uniform Manifold Approximation and Projection (UMAP) clusteringvisualization. ConvAE 1-layer CNN ( a ); SVD-RawCount ( b ); SVD-TFIDF ( c ); Deep Patient ( d ). AD =Alzheimer’s disease; ADHD = Attention deficit hyperactivity disorder; BC = Breast cancer; CD = Crohn’sdisease; MM = Multiple myeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes.19upplementary Figure 3: Complex disorder subgroups identified in the replication set. A subsample of 5 , aa