[PDF] Deep Representation Learning of Electronic Health Records to Unlock Patient Stratification at Scale

Abstract

Deriving disease subtypes from electronic health records (EHRs) can guide next-generation personalized medicine. However, challenges in summarizing and representing patient data prevent widespread practice of scalable EHR-based stratification analysis. Here we present an unsupervised framework based on deep learning to process heterogeneous EHRs and derive patient representations that can efficiently and effectively enable patient stratification at scale. We considered EHRs of 1,608,741 patients from a diverse hospital cohort comprising of a total of 57,464 clinical concepts. We introduce a representation learning model based on word embeddings, convolutional neural networks, and autoencoders (i.e., ConvAE) to transform patient trajectories into low-dimensional latent vectors. We evaluated these representations as broadly enabling patient stratification by applying hierarchical clustering to different multi-disease and disease-specific patient cohorts. ConvAE significantly outperformed several baselines in a clustering task to identify patients with different complex conditions, with 2.61 entropy and 0.31 purity average scores. When applied to stratify patients within a certain condition, ConvAE led to various clinically relevant subtypes for different disorders, including type 2 diabetes, Parkinson's disease and Alzheimer's disease, largely related to comorbidities, disease progression, and symptom severity. With these results, we demonstrate that ConvAE can generate patient representations that lead to clinically meaningful insights. This scalable framework can help better understand varying etiologies in heterogeneous sub-populations and unlock patterns for EHR-based research in the realm of personalized medicine.

Full PDF

DDeep Representation Learning of Electronic Health Recordsto Unlock Patient Stratiﬁcation at Scale

Isotta Landi , , Benjamin S. Glicksberg , , , Hao-Chih Lee , , Sarah Cherng , , Giulia Landi , MatteoDanieletto , , , Joel T. Dudley , , Cesare Furlanello § , , , and Riccardo Miotto ∗ , § , , , § These authors share senior authorship. (1) Bruno Kessler InstituteVia Sommarive 18, 38123 Povo (TN), Italy(2) Department of Psychology and Cognitive ScienceUniversity of TrentoCorso Bettini 84, 38068 Rovereto (TN), Italy(3) Hasso Plattner Institute for Digital Health at Mount Sinai(4) Institute for Next Generation Healthcare(5) Department of Genetics and Genomic SciencesIcahn School of Medicine at Mount Sinai1 Gustave L. Levy Place, New York, NY 10029, USA(6) Department of Mental Health and Pathological AddictionAzienda USL Centro “Santi”Via Vasari 13, 43100 Parma, Italy(7) HK3 LabVia Castel Morrone 14, 20129 Milan, Italy

Corresponding author:

Riccardo Miotto, PhDHasso Plattner Institute for Digital Health at Mount SinaiDepartment of Genetics and Genomic SciencesIcahn School of Medicine at Mount Sinai1 Gustave L. Levy PlaceNew York, NY 10029USAemail: [email protected] 1 a r X i v : . [ q - b i o . Q M ] J u l bstract Deriving disease subtypes from electronic health records (EHRs) can guide next-generation person-alized medicine. However, challenges in summarizing and representing patient data prevent widespreadpractice of scalable EHR-based stratiﬁcation analysis. Here we present an unsupervised framework basedon deep learning to process heterogeneous EHRs and derive patient representations that can eﬃcientlyand eﬀectively enable patient stratiﬁcation at scale. We considered EHRs of 1 , ,

741 patients froma diverse hospital cohort comprising of a total of 57 ,

464 clinical concepts. We introduce a representa-tion learning model based on word embeddings, convolutional neural networks, and autoencoders (i.e.,ConvAE) to transform patient trajectories into low-dimensional latent vectors. We evaluated these rep-resentations as broadly enabling patient stratiﬁcation by applying hierarchical clustering to diﬀerentmulti-disease and disease-speciﬁc patient cohorts. ConvAE signiﬁcantly outperformed several baselinesin a clustering task to identify patients with diﬀerent complex conditions, with 2 .

61 entropy and 0 . ntroduction Electronic health records (EHRs) are collected as part of routine care across the vast majority of healthcareinstitutions. They consist of heterogeneous structured and unstructured data elements, including demo-graphic information, diagnoses, laboratory results, medication prescriptions, free text clinical notes, andimages. EHRs provide snapshots of a patient’s state of health and have created unprecedented opportunitiesto investigate the properties of clinical events across large populations using data-driven approaches andmachine learning. At the individual level, patient trajectories can foster personalized medicine; across apopulation, EHRs can provide a vital resource to understand population health management and help makebetter decisions for healthcare operation policies [1].Personalized medicine focuses on the use of patient-speciﬁc data to tailor treatment to an individual’sunique health characteristics. However, even seemingly simple diseases can show diﬀerent degrees of com-plexity that can create challenges for identiﬁcation, treatment, and prognosis, despite equivalence at thediagnostic level [2, 3]. Heterogeneity among patients is particularly evident for complex disorders , where theetiology is due to an amalgamation of multiple genetic, environmental, and lifestyle factors. Several diﬀerentconditions have been referred to as complex , such as Parkinson’s disease (PD) [4], multiple myeloma (MM)[5], and type 2 diabetes (T2D) [6]. Patients with complex disorders may diﬀer on multiple systemic layers(e.g., diﬀerent clinical measurements or comorbidity landscape) and in response to treatments, making theseconditions diﬃcult to model. Multiple data types in patient longitudinal EHR histories oﬀer a way to exam-ine disease complexity and present an opportunity to reﬁne diseases into subtypes and tailor personalizedtreatments. This task is usually referred to as “EHR-based patient stratiﬁcation”. This follows a commonapproach in clinical research, where attempts to identify latent patterns within a cohort of patients cancontribute to the development of improved personalized therapies [7].From a computational perspective, patient stratiﬁcation is a data-driven, unsupervised learning task thatgroups patients according to their clinical characteristics [8]. Previous work in this domain aggregates clinicaldata at a patient level, representing each patient as multi-dimensional vectors, and derives subtypes withina disease-speciﬁc population via clustering (e.g., in autism [9]) or topological analysis (e.g., for T2D [10]).Deep learning has been applied to derive more robust patient representations to improve disease subtyping[8, 11]. Baytas et al. used time-aware long short-term memory (LSTM) networks to leverage stratiﬁcationof longitudinal data of PD patients [8]. Similarly, Zhang et al. used LSTM to identify three subgroups ofpatients with idiopathic PD that diﬀer in disease progression patterns and symptom severity [11]. Thesestudies, however, only focused on curated and small disease-speciﬁc cohorts, with ad hoc manually selected3eatures. This approach not only limits scalability and generalizability, but also hinders the possibilityto discover unknown patterns that might characterize a condition. Because EHRs tend to be incomplete,using a diverse cohort of patients to derive disease-speciﬁc subgroups can adequately capture the features ofheterogeneity within the disease of interest [12]. However, it is challenging to create large-scale computationalmodels from EHRs because of data quality issues, such as high dimensionality, heterogeneity, sparseness,random errors, and systematic biases. Advances in machine learning, speciﬁcally in representation learning[13] and deep learning [14], are introducing diﬀerent computational models to leverage EHRs for personalizedhealthcare [15, 16]. This work ﬁts into this landscape by presenting an unsupervised patient stratiﬁcationpipeline that aims to automatically detect clinically meaningful subtypes within any condition by usingpatient representations learned from a heterogeneous and large cohort of EHRs.In particular, this paper proposes a general framework for identifying disease subtypes at scale (seeFigure 1a). We ﬁrst propose an unsupervised deep learning architecture to derive vector-based patientrepresentations from a large and domain-free collection of EHRs. This model (i.e., ConvAE) combines 1)embeddings to contextualize medical concepts, 2) convolutional neural networks (CNNs) to loosely model thetemporal aspects of patient data, and 3) autoencoders (AEs) to enable the application of an unsupervisedarchitecture. Second, we show that ConvAE-based representations learned from real-world EHRs of about1 .

6M patients from the Mount Sinai Health System in New York improve clustering of patients with diﬀerentdisorders compared to several commonly used baselines. Last, we demonstrate that ConvAE leads to eﬀectivepatient stratiﬁcation with minimal eﬀort. To this end, we used the encodings learned from domain-free andheterogeneous EHRs to derive subtypes for diﬀerent complex disorders and provide a qualitative analysis todetermine their clinical relevance.This architecture enables patient stratiﬁcation at scale by eliminating the need for manual feature engi-neering and explicit labeling of events within patient care timelines, and processes the whole EHR sequenceregardless of the length of patient history. By generating disease subgroups from large-scale EHR data, thisarchitecture can help disentangle clinical heterogeneity and identify high-impact patterns within complexdisorders, whose eﬀect may be masked in case-control studies [17]. The speciﬁc properties of the diﬀerentsubgroups can then potentially inform personalized treatments and improve patient care.

Results

We ﬁrst evaluated the extent to which ConvAE-based patient representations can be used to identify diﬀerentclinical diagnoses in the EHRs (i.e., disease phenotyping [18]). To this end, we performed clustering analysisusing patients with the following eight complex disorders: T2D, MM, PD, Alzheimer’s disease (AD), Crohn’s4isease (CD), breast cancer (BC), prostate cancer (PC), and attention deﬁcit hyperactivity disorder (ADHD).We used SNOMED–CT (Systematized nomenclature of medicine – clinical terms) [19] to ﬁnd all patientsin the data warehouse diagnosed with these conditions; see Supplementary Table 2 and the “Multi-diseaseclustering analysis” subsection in “Methods” for more details.Evaluation was organized as a 2-fold cross-validation experiment to show model generalizability and toassess replication of the stratiﬁcation results. To this aim, we randomly split the dataset in half, obtainingtwo independent cohorts of about 800 ,

000 patients that we used to train and test the models (and viceversa). While we used all patients in each cohort for training, in the test sets we retained only the patientsdiagnosed with one of the eight disorders under consideration, obtaining about 94 ,

000 test patients per fold(see the “Dataset” subsection in “Methods” for more details).Table 1 shows the results using hierarchical clustering for diﬀerent ConvAE architectures (one, two, andmultikernel CNN layers) and baselines in terms of entropy and purity scores averaged over the 2-fold cross-validation experiment. ConvAE performed signiﬁcantly better than other models largely used in healthcarefor representation learning, including Deep Patient [20], for both entropy and purity scores ( p s < . .

50, based on purity score analysis). It is worth saying that, without a predictive theoryof clustering [21, 22], validation metrics frequently fail to correlate with clustering errors [23]. However,such theoretic structure is not applicable in this context because the heterogeneity of the external complexdisorder classes do not provide a reliable probabilistic framework. For this reason, we used, rather thanestimation error analysis, transparent external metrics, such as entropy and purity scores, which evaluatecluster composition and also account for possible subgroups of complex diseases [24].Figure 2 visualizes the distribution of the diﬀerent patient representations along with their disease cohortlabels obtained using UMAP (Uniform manifold approximation and projection for dimension reduction[25]). ConvAE captures hidden patterns of overlapping phenotypes while still displaying identiﬁable groupsof patients with distinct disorders. Figure 3 shows the same patient distribution highlighting clusteringlabels and purity percentage scores of each cluster dominating disease. These ﬁgures refer to only one of thecross-validation splits; results for the second split are similar and are available in Supplementary Figures 1and 2). ConvAE (with one CNN layer) also led to better clustering, visually, than all baselines. Patients withADHD were the most separated and detected with 80% purity by hierarchical clustering. Visible clusterswith >

50% purity were also identiﬁed for T2D, PC and PD. Comparing the encoding projections (Figure 2)to the clustering visualization (Figure 3), we observe that patients whose disease is not correctly identiﬁedby clusters tend to not clearly separate in this low-dimensional space. As an example, AD patients were5andomly scattered in the plot and did not lead to distinguishable clusters. This might be due to factorssuch as sex and age, intrinsic biases or noise, but it might also reﬂect a shared phenotypic characterizationthat drives the learning process into displaying these patient EHR progressions closely together irrespectiveof disease labels.We then evaluated the use of ConvAE representations for patient stratiﬁcation at scale and the identi-ﬁcation of clinically relevant disease subtypes. We considered six diseases: T2D, PD, AD, MM, PC, andBC. These are all age-related complex disorders with late onset (i.e., averaged increased prevalence after60 years of age) [26, 27, 28, 29, 30, 31]. We decided to focus on these conditions to avoid, to some extent,the confounding eﬀect of age that could aﬀect learning and the evaluation of diﬀerent subtypes. Figure 4shows results running hierarchical clustering on the ConvAE-based patient representations of each diﬀerentdisease cohort. To determine the optimal number of clusters, we empirically selected the smallest numberof clusters that minimize the increase in explained variance (i.e., Elbow method). We were able to identifydiﬀerent subtypes for each disease with no additional feature selection and using representations derivedfrom a domain-free cohort of patients. Supplementary Table 3 reports the number of patients in each cohortand the number of subgroups identiﬁed. Similar results were obtained for the second split and are reportedin Supplementary Figure 3.In the following sections, we present the clinical characterization of T2D, PD, and AD subgroups viaenrichment analysis of medical concept occurrences (see Supplementary Material for the characterization ofthe other conditions). We compare T2D and PD results to related studies based on ad hoc cohorts [10, 11].Conversely, there are no published EHR-based stratiﬁcation studies for AD, MM, PC, and BC to use forcomparison. All subtypes were reviewed by a clinical expert to highlight meaningful descriptors and we usedmultiple pairwise chi-squared tests to assess group diﬀerences. For each disease, we list sex and age statisticsof the cohort (between group comparisons are performed via multiple pairwise chi-squared tests and t-tests),as well as the ﬁve most frequent diagnosis, medications, laboratory tests, and procedures, ordered accordingto in-group and total frequencies, in Supplementary Tables 4-9. The results for the second split are reportedin Supplementary Tables 10-15.

Type 2 diabetes

Patients with T2D clustered into three diﬀerent subgroups that relate to diﬀerent stages of progression forthe disease (see Figure 4a and Supplementary Table 4 for details).Subgroup I included 18 ,

325 patients and represents the mild symptom severity cohort, characterizedby common T2D symptoms (e.g., metabolic syndrome), which were treated with

Metformin , an oral hypo-6lycemic medication. Moreover, it also included patients exposed to lifestyle risk factors, such as

Obesity [6]. Subgroups II/III, which were composed by 22 ,

659 and 7 ,

704 patients, respectively, showed concomitantconditions associated to T2D progression and worsening symptoms. Speciﬁcally, subgroup II clustered pa-tients characterized by microvascular problems, such as diabetic nephropathy, neuropathy, and/or peripheralartery disease. The signiﬁcant presence of

Creatinine and

Urea nitrogen laboratory tests, which estimaterenal function, suggests monitoring of kidney diseases, which are often related to T2D [32]. The presenceof

Pain in limb , combined with analgesic drugs (i.e.,

Paracetamol , Oxycodone ), indicates the presence ofvascular lesions at the peripheral level, manifested as ischemic rest pain or ulceration. This was conﬁrmedby

Peripheral vascular disease diagnoses which accounts for 50% of terms in the T2D cohort.Subgroup III showed severe cardiovascular problems, identiﬁed by a signiﬁcant presence of medical con-cepts related to coronary artery diseases, e.g.,

Coronary atherosclerosis , Angina pectoris , which are seriousrisk factors for heart failure. These subjects were often treated with antiplatelet therapy (i.e.,

Acetylsalicylicacid, Clopidrogel ) to prevent cardiovascular events (e.g., stroke) and were likely to receive invasive proce-dures to treat severe arteriopathy. For instance, 30% of patients in subgroup III underwent

PercutaneousTransluminal Coronary Angioplasty , a procedure to open up blocked coronary arteries.Our results conﬁrm, in part, what was observed by Li et al. [10], which used topology analysis on anad hoc cohort of T2D patients and identiﬁed three distinct subgroups characterized by 1) microvasculardiabetic complications (i.e., diabetic nephropathy, diabetic retinopathy); 2) cancer of bronchus and lungs;and 3) cardiovascular diseases and psychiatric disorders. In particular, we detected the same microvascularand cardiovascular disease groups, which are consequences of T2D. In contrast, we were unable to detecta subgroup signiﬁcantly characterized by cancer, an epiphenomenon that can be caused by secondary im-munodeﬁciency in patients with T2D [33, 34]. See Supplementary Material for further description and aclustering comparison via Fowlkes-Mallows index.

Parkinson’s disease

Individuals diagnosed with PD divided into two groups (Figure 4b and Supplementary Table 5): one domi-nated by motor symptoms (1 ,

368 patients) and another (1 ,

684 patients) characterized by non-motor/independentfeatures and longer course of disease.Subgroup I is characterized as a tremor-dominant cohort (i.e., manifested by motor symptoms) because ofthe signiﬁcant presence of diagnosis such as

Essential tremor , Anxiety state , and

Dystonia . It is interestingto note that motor clinical features likely led to a common misdiagnosis of essential tremor, which is an7ction tremor that typically involves the hands. Parkinsonian tremor, on the contrary, although can bepresent during postural maneuvers and action, is much more severe at rest and decreases with purposefulactivities. However, when the tremor is severe, it is diﬃcult to distinguish action tremor from restingtremor, leading to the aforementioned misdiagnosis [35]. Moreover, anxiety states, emotional excitement,and stressful situations can exacerbate the tremor, and lead to a delayed PD diagnosis.

Brain MRI , usuallynon-diagnostic in PD, was ordered for several patients in this subgroup (13%) suggesting its use for diﬀerentialdiagnosis, e.g., to investigate the presence of chronic/vascular encephalopathy.Subgroup II included non-motor and independent symptoms, such as

Constipation and

Fatigue . Patientsin subgroup II were signiﬁcantly diagnosed with

Coronary artery disease that is prevalent in older patients( >

50 years old). Constipation and fatigue are among the most common non-motor problems related toautonomic dysfunction, diminished activity level, and slowed intestinal transit time in PD [36, 37].In their study about PD stratiﬁcation with PPMI (Parkinson’s progression markers initiative) data,Zhang et al. [11] identiﬁed three distinct subgroups of patients based on severity of both motor and non-motor symptoms. In particular, one subgroup included patients with moderate functional decay in motorability and stable cognitive ability; a second subgroup presented with mild functional decay in both motorand non-motor symptoms; and the third subgroup was characterized by rapid progression of both motor andnon-motor symptoms. EHRs do not quantitatively capture PD symptom severity, therefore our analysescannot replicate these ﬁndings. However, unlike Zhang et al., we can discriminate between speciﬁc motorand non-motor symptoms and also suggest a longer, but not necessarily more severe, disease course for thenon-motor symptom subgroup.

Alzheimer’s disease

Patients with AD separated into three subgroups marked by AD onset, disease progression, and severity ofcognitive impairment (see Figure 4c and Supplementary Table 6).Subgroup I is characterized by 399 patients with early-onset AD, i.e., patients whose dementia symptomshave typically developed between the age of 30 and 60 years, and initial neurocognitive disorder. Early-onsetAD aﬀects 5% of the individuals with AD in the US [38] and, because clinicians do not usually look for AD inyounger patients, the diagnostic process includes extensive evaluations of patient symptoms. In particular,given that a certain AD diagnosis can only be provided post-mortem through brain examination, cliniciansﬁrst rule out other causes that can lead to early-onset dementia (i.e., diﬀerential diagnosis). We ﬁnd evidenceof this practice in this subgroup, which includes postmenopausal women, identiﬁable by mean age greaterthan 50,

Osteoporosis diagnosis with calcium supplement therapy, and menopausal hormone treatment (i.e.,8 stradiol ). Patients in this group are also tested for infectious diseases (e.g., HIV, Syphilis, Hepatitis C,Chlamydia/Gonorrhoea) that are possible causes of early-onset dementia [39], and screened via structuralneuroimaging, e.g.,

MRI/PET brain . As cognitive dysfunctions that may be mistaken for dementia can alsobe caused by depression and other psychiatric conditions, the presence of

Psychiatric service/procedure sug-gests psychiatric evaluations to exclude depressive pseudodementia. After the diﬀerential diagnosis processand the exclusion of other possible causes, eventually these patients received a diagnosis of AD.Subgroup II includes 1 ,

170 patients with late-onset AD, mild neuropsychiatric symptoms and cerebrovas-cular disease. Here, the absence of behavioral disturbances in 39% of patients, and their high average age( M = 84 . , sd = 9 .

61) suggest a late AD onset, with a progression characterized by a slower rate ofcognitive ability decline [40]. Moreover, the presence of

Acetylsalicylic acid , an antiplatelet medication, and

Intracranial hemorrage diagnosis indicates the co-occurrence of cerebrovascular disease, which aﬀects bloodvessels and blood supply to the brain. Cerebrovascular diseases are common in aging, and can often beassociated with AD [41]. In this regard,

Head CT may have been performed to prevent or identify structuralabnormalities related to cerebrovascular disease.Subgroup III is characterized by 1 ,

632 individuals with typical onset and mild-to-moderate dementiasymptoms. A cohort of 409 patients was treated with

Donepezil , a cholinesterase inhibitor, that is a primarytreatment for cognitive symptoms and it is usually administered to patients with mild-to-moderate AD,producing small improvement in cognition, neuropsychiatric symptoms, and activities of daily living [42].Patients in this subgroup also showed both dementia with and without behavioral disturbances.

Discussion

This study proposes a computational framework to disentangle the heterogeneity of complex disorders inlarge-scale EHRs through the identiﬁcation of data-driven clinical patterns with machine learning. Specif-ically, we developed and validated an unsupervised architecture based on deep learning (i.e., ConvAE) toinfer informative vector-based representations of millions of patients from a large and diverse hospital set-ting, which facilitates the identiﬁcation of disease subgroups that can be leveraged to personalize medicine.These representations aim to be domain-free (i.e., not related to any speciﬁc task since learned over a largemulti-domain dataset) and enable patient stratiﬁcation at scale. Results from our experiments show thatConvAE signiﬁcantly outperformed several baselines on clustering patients with diﬀerent complex conditionsand led to the identiﬁcation of diﬀerent clinically meaningfully disease subtypes.Results identiﬁed disease progression, symptom severity, and comorbidities as contributing the most tothe EHR-based clinical phenotypic variability of complex disorders. In particular, T2D patients divided9nto three subgroups according to comorbidities (i.e., cardiovascular and microvascular problems) and symp-tom severity (i.e., newly diagnosed with milder symptoms). Individuals with PD showed diﬀerent diseaseduration and symptoms (i.e., motor, non-motor). AD proﬁles distinguished early- and late-onset groupsand separate patients with mild neuropsychiatric symptoms and cerebrovascular disease from patients withmild-to-moderate dementia. Patients with MM were characterized by diﬀerent comorbidities (e.g., amyloi-dosis, pulmonary diseases) that manifest alongside precise typical signs of MM. Patients with PC and BCseparated according to disease progression. These ﬁndings showed that the features learned by ConvAEdescribe patients in a way that is general and conducive to identifying meaningful insights into diﬀerentclinical domains. In particular, this work aims to contribute to the next generation of clinical systems thatcan 1) scale to include many millions of patient records and 2) use a single, distributed patient representationto eﬀectively support clinicians in their daily activities, rather than multiple systems working with diﬀerentpatient representations derived for diﬀerent tasks [20].To this aim, enabling eﬃcient data-driven patient stratiﬁcation analyses to identify disease subgroupsis an important aspect to unlock personalized healthcare. Ideally, when new patients enter the medicalsystem, their health status progression can be tied to a speciﬁc subgroup, thereby informing the treatingclinician of personalized prognosis and possible eﬀective treatment strategies, or counseling in cases wherea certain diagnosis is diﬃcult and a more thorough examination is required (e.g, speciﬁc genetic or labtests). Moreover, the clinical characteristics of the diﬀerent subtypes can potentially lead to intuitions fornovel discoveries, such as comorbidities, side-eﬀects or repositioned drugs, which can be further investigatedanalysing the patient clinical trajectories.Previous studies mostly focused on a speciﬁc disease using ad hoc cohorts of patients and features [8, 9, 10,11, 43, 44]. While these studies obtained relevant clinically meaningful results, the computational frameworkis hard to replicate for diﬀerent diseases and it is tied to the speciﬁc study and to the speciﬁc data. Deeplearning has extensively been used to model EHRs for medical analysis [15, 16], including clinical prediction,such as disease onset, mortality, and readmission [45, 46, 47], and disease phenotyping [20, 48]. Because deeplearning methods have not yet been leveraged for disease subtyping at scale, ConvAE aims to ﬁll this gap andto provide an architecture that can improve unsupervised EHR pre-processing to favor patient stratiﬁcationand unveil clinically meaningful and actionable insights. Additionally, unlike previous representation learningmethods which did not consider the temporality of EHRs [20, 48], ConvAE uses CNNs in combination withembeddings to speciﬁcally capture some of the longitudinal aspects of patient clinical status, leading to morerobust representations. CNNs were already used to model EHRs for speciﬁc predictive analysis, as partof supervised architectures [49, 50]. Diﬀerently, we trained CNNs in an unsupervised framework based onautoencoders to learn general-purpose patient representations. While these representations were used to10everage disease subtype discovery, they can also be ﬁne-tuned and applied to speciﬁc supervised tasks, suchas disease phenotyping and prediction.There are several limitations to our study. First, we acknowledge that the lack of any discernible patternin the multi-disease clustering analysis can also be due to noise and biases in the data, which might aﬀect bothlearned representations and clustering. In particular, processing EHRs with minimum data engineering, onthe one hand, preserves all the available information and, to some extent, prevents systematic biases. On theother, it adds hospital-speciﬁc biases intrinsic to the EHR structure and noise due to data being redundantand too generic. Improving EHR pre-processing by, e.g., better modeling clinical notes and/or improvingfeature ﬁltering, should help reduce noise and improve performances. Second, we identiﬁed patients relatedto complex disorders using SNOMED–CT codes and this likely led to the inclusion of many false positivesthat aﬀected the learning algorithms [51]. The use of phenotyping algorithms based on manual rules, e.g.,PheKB [52], or semi-automated approaches, e.g. [53, 54]), should help identify better cohorts of patientsand, consequently, better disease subtypes. Another limitation comes from the choice, among all possibilities,of the speciﬁc complex disorders. This allowed us to test the approach on heterogeneous conditions thataﬀect diﬀerent biological mechanisms, showing the eﬃcacy of the proposed framework in generalizing tovarious clinical domains. Nevertheless, the approach should be further evaluated with other typologiesof conditions as well, such as multiple sclerosis, autoimmune diseases, and psychiatric disorders. Lastly,we identiﬁed relevant concepts in the patient subgroups by simply evaluating their frequency. Adding asemantic modeling component based on, e.g., topic modeling [55] or word embeddings [56], might lead tomore clinically meaningful patterns.Future works will attempt to address these limitations and to further improve and replicate the architec-ture. First, we plan to enable multi-level clustering in order to stratify patients within the subtypes. Thisshould lead to more granular patient stratiﬁcation and thus, to patterns on a more individual-level. Sec-ond, we plan to verify ConvAE generalizability by replicating the study on EHRs from diﬀerent healthcareinstitutions. Third, we will evaluate the use of disease subtypes as labels for training supervised modelsthat can predict stratiﬁed patient risk scores. This, beside further validating the relevance of the results,will also provide an initial and intuitive framework to apply the results of patient stratiﬁcation to clinicalpractice. To this aim, we plan to ﬁrst assess treatment safety and eﬃcacy between subtypes of a speciﬁcdisease. Finally, to develop more comprehensive disease characterizations, we will include other modalitiesof data, e.g., genetics, into this framework, which will hopefully reﬁne clustering and reveal new etiologies.Multi-modal stratiﬁed disease cohorts promise to facilitate better predictive capabilities for future outcomesby modeling how molecular mechanisms interact with clinical states.11 ethods

The framework to derive patient representations that enable stratiﬁcation analysis at scale is based on 3steps: 1) data pre-processing; 2) unsupervised representation learning (i.e., ConvAE); and 3) clusteringanalysis of disease-speciﬁc cohorts (see Figure 1a). In this section, we report details of this framework aswell as the description of the evaluation design.

Dataset

We used de-identiﬁed EHRs from the Mount Sinai Health System data warehouse; the study was approvedby IRB-19-02369 in accordance with HIPAA guidelines. Mount Sinai Health System is a large and diverseurban hospital located in New York, NY, which generates a high volume of structured, semi-structured andunstructured data from inpatient, outpatient, and emergency room visits. Patients in the system can haveup to 12 years of follow-up data unless they are transferred or move their residence away from the hospitalsystem. We accessed a de-identiﬁed dataset containing approximately 4 . V was composed by 57 ,

464 clinical concepts.We retained all patients with at least two concepts, resulting in a collection of 1 , ,

741 diﬀerent patients,with an average of 88 . ,

932 females, 691 , ,

488 not declared; the mean age of the population as of 2016 was 48 .

29 years ( sd = 23 . ,

000 random patients for tuning the modelhyperparameters. Train and test pre-processed sets’ details are reported in Supplementary Table 1.

Data pre-processing

Every patient in the dataset is represented as a longitudinal sequence s p of length M of aggregated temporally-ordered medical concepts, i.e., s p = ( w , w , . . . , w M ), where each w i is a medical concept from the vocab-ulary V . Pre-processing includes: 1) ﬁltering the least and most frequent concepts; 2) dropping redundantconcepts within ﬁxed time frames; 3) splitting long sequences of records to include the complete patient12istory while leveraging the CNN framework, which requires ﬁxed-size inputs.We consider all the EHRs as a document D and each patient sequence s p as a sentence. For eachconcept w in V we ﬁrst compute the probability of having w in D . We then multiply this by the sum ofthe probabilities to ﬁnd w in a sentence s p for all sentences. In particular, let P be the set of all patients, ∀ w ∈ V , the ﬁltering score is deﬁned as: P ( w ∈ D ) (cid:88) p ∈ P P ( w ∈ s p ) = { s ∈ D ; w ∈ s }| D | (cid:88) p ∈ P { w i ∈ s p ; w i = w }| s p | , (1)where | D | is the total number of sentences and | s p | is the length of a patient sequence. The ﬁlteringscore combines document frequency, i.e., number of patients with at least one occurrence of w , and termfrequency, i.e., total number of occurrences of w in a patient sequence. We then drop all concepts withﬁltering scores outside certain cut-oﬀ values to reduce the amount of noise (i.e., not informative conceptsthat occur multiple times in few patients, or too general concepts that occur in many patients).A patient may have multiple encounters in their health records that span consecutive days and mightinclude repeated concepts that are often artifacts of the EHR system, rather than new clinical entries. Toreduce this bias, we drop all duplicate medical concepts from the patient records within overlapping timeintervals of T days. Within the same time window, we also randomly shuﬄe the medical concepts, giventhat events within the same encounter are generally randomly recorded [59, 54]. Lastly, we eliminate allpatients with less than 3 concepts in their records.Patient sequences are then chopped into subsequences of ﬁxed length L that are used to train the ConvAEmodel. Each patient sequence is thus deﬁned as: s p = [( w , . . . , w L ) , ( w L +1 , . . . , w L ) , . . . ] , and subsequences shorter than L are padded with 0 up to length L . For the sake of clarity, in the followingsection we present the architecture as applied to a general subsequence s = ( w , . . . , w L ). The ConvAE architecture

ConvAE is a representation learning model that transforms patient EHR subsequences into low-dimensional,dense vectors. The architecture consists of three stacked modules (see Figure 1b). This study proposesto use in combination embedding, CNNs, and autoencoders to process EHRs and to derive unsupervisedvector-based patient representations that can be used for clinical inference and medical analysis.Given s , the architecture ﬁrst assigns each medical concept w to an N -dimensional embedding vector13 w to capture the semantic relationships between medical concepts. Speciﬁcally, a patient subsequence isrepresented as an ( L × N ) matrix E = ( v w , v w , . . . , v w L ) T , where L is the subsequence length, and N isthe embedding dimension. This structure also retains temporal information because the rows of matrix E are temporally ordered according to patient visits.The architecture is then composed by CNNs, which extract local temporal patterns, and AEs, whichlearn the embedded representations for each patient subsequence. The CNN applies temporal ﬁlters to eachembedding matrix. CNN ﬁlters applied to EHRs usually perform a one-side convolution operation acrosstime via ﬁlter sliding. A ﬁlter can be deﬁned as k ∈ R h × N , where h is the variable window size and N isthe embedding dimension [60, 61]. Our approach diﬀers in that it processes embedding matrices as theywere RGB images carrying a third “depth” dimension. With this approach, we enable the model ﬁlters tolearn independent weights for each encoding dimension, thus activating for the most salient features in eachdimension of the embedding space. Therefore, we reshape the ( L × N ) embedding matrix into ˜ E ∈ R × L × N and we consider the embedding dimensions as channels. We then apply f ﬁlters k ∈ R × h × N to the paddedinput to keep the same output dimension and learn features that may grasp sequence characteristics. Inparticular, for each ﬁlter j , we obtain:( R ) j = ReLU( N − (cid:88) i =0 k i (cid:63) ˜ e i + b j ) , j = 1 , . . . , f, (2)where: R ∈ R × L × f is the output matrix; k i is the h -dimensional weight matrix at depth i ; ˜ e i ∈ R × L isthe i -th embedding dimension of the input matrix; b is the bias vector; and ( (cid:63) ) is the convolution function.We used Rectiﬁed Linear Unit (ReLU) as the activation function and max pooling. The output is thenreshaped into a concatenated vector of dimension L · f . This conﬁguration learns diﬀerent weights for eachembedding dimension to highlight relevant interdependencies of medical concepts, and tune representationsof patient histories to identify the most relevant characteristics of their semantic space.We then use fully dense layers of autoencoders to derive embedded patient representations that estimatethe given input subsequences. Speciﬁcally, we extract the hidden representation y , a H -dimensional vector,as the encoded representation of each patient subsequence. Each patient sequence s p is then transformed intoa sequence of encodings s h that can be post-modeled to obtain a unique vector-based patient representation.Here we simply component-wise average all the subsequence representations.To train ConvAE, we set up a multi-class classiﬁcation task that reconstructs each initial input one-hotsubsequence of medical terms, from their encoded representations. Given a subsequence of medical concepts s , the ConvAE is trained by minimizing the Cross Entropy (CE) loss:14E(Softmax( O ) , s ) = − L L (cid:88) j =1 log(Softmax( O j ) w j ) , where O is the output of ConvAE reshaped into a matrix of dimension | V | × L , w j is the j -th element ofsequence s that correspond to a term indexed in V and:Softmax( O j ) i = exp O ji (cid:80) | V | i =1 exp O ji i = 1 , ..., | V | . (3)Since the objective function consists of only self-reconstruction errors, the model can be trained withoutany supervised training samples. Clustering analysis for patient stratiﬁcation

ConvAE-based representations can be used to stratify patients from any preselected cohort without needingadditional feature engineering or manual adjustments. To this aim, patients with a speciﬁc disease areselected using, e.g., ICD codes, SNOMED–CT diagnosis, or phenotyping algorithms (e.g., [51, 53, 54]), andclustering is applied to the corresponding representations to identify disease subgroups. Here, speciﬁcally,we use SNOMED–CT diagnosis to preselect the disease cohorts and hierarchical clustering with Ward’smethod and Euclidean distance to derive disease subgroups. We identify the number of subclusters thatbest disentangles heterogeneity on the disease dataset using the Elbow Method, which empirically selectsthe smallest number of clusters that minimize the increase in explained variance.A systematic analysis of the patients in each subgroup can then automatically identify the medicalconcepts that signiﬁcantly and uniquely deﬁne each disease subtype. In this work, we rank all the codesby their frequency in the patient sequences. In particular, we compute the percentages of patients whosesequence includes a speciﬁc concept both with respect to a subcluster (i.e., in-group frequency) and tothe complete disease cohort (i.e., total frequency). Ranking maximizes, ﬁrst, the in-group percentage, andsecond, the total percentage. We then analyze the most frequent concepts and we use a pairwise chi-squaredtest to determine whether the distributions of present/absent concepts with respect to the detected subgroupsare signiﬁcantly diﬀerent [11].

Implementation details

All model hyperparameters were empirically tuned to minimize the network reconstruction error, whilebalancing training eﬃciency and computation time. We tested a large amount of conﬁgurations (e.g., timeinterval T equal to { , } ; patient subsequence length L equal to { , } ; embedding dimension N { , , } ). For brevity, we report only the ﬁnal setting used in the patient stratiﬁcationexperiments. All modules were implemented in Python 3 . .

2, using scikit-learn and pytorch as machinelearning libraries [62, 63]. Computations were run on a server with an Nvidia Titan V GPU.We used equation (1) to discard terms with a ﬁltering score less than 10 − , i.e., document frequencyranging from 1 to 10. Examples of discarded concepts are clotrimazole , an antifungal medication, and torsemide , a medication to reduce extra ﬂuid in the body. We decided to retain all the very frequentconcepts as most of them seemed clinically informative (e.g., vital signs). Patients with less than 3 medicalconcepts were then discarded. In total, 24 ,

665 medical terms were ﬁltered out, decreasing the vocabularysize to 32 , T = 15 days, shuﬄedunique medical concepts and dropped redundant terms. Patient sequences were then split in subsequencesof length L = 32 concepts, obtaining about ∼ M subsequences of medical concepts for training. This valuewas chosen to enable eﬃcient training of the autoencoder with GPUs.We initialized medical concept embeddings using word2vec with the skip-gram model [56]. We consideredall the subsequences in the training set as sentences and medical concepts as words [54, 59]. We obtained100-dimensional embeddings for 31 ,

659 medical concepts of the vocabulary. The remaining concepts wereinitialized randomly; the subsequence padding was initialized as the null vector (i.e., at ). These embeddingvectors were then used as input for the ConvAE module and were further reﬁned during the model training.The CNN module used 50 ﬁlters with kernel size equal to 5 and ReLU activation function. The autoen-coder was composed by 4 hidden layers with 200, 100, 200 and | V | ×

32 hidden nodes, respectively, where | V | is the vocabulary size. We used ReLU activation in the ﬁrst three layers and Softplus activation inthe ﬁnal layer to obtain continuous output. We applied dropout with p = 0 . − and weight decay = 10 − ) [64] for 5 epochs on all training data and batch size of 128. The size of thepatient representations was equal to 100.We evaluated diﬀerent CNN conﬁgurations composed by 1-layer (i.e., “ConvAE 1-layer CNN”), 2-layers(i.e., “ConvAE 2-layer CNN”), and one multikernel layer (i.e., “ConvAE multikernel CNN”). All hyperpa-rameters were the same, except the number of ﬁlters in the second CNN of the 2-layer conﬁguration thatwas set to 25. Multikernel CNN performs parallel training of distinct CNNs with diﬀerent kernel sizes, andconcatenates the ﬁnal outputs. We used kernel dimensions equal to 3, 5, and 7.16 aselines We compared ConvAE with the following representation learning algorithms: “RawCount”, “SVD-RawCount”,“SVD-TFIDF”, and “Deep Patient”. All baselines derived vector-based patient encodings of size 100.RawCount is a sparse representation where each patient is encoded into a count vector that has thelength of the vocabulary. More speciﬁcally, each individual health history s p is represented as an integervector x ∈ Z | V | , where each element is the frequency of the corresponding clinical concept in the patientlongitudinal history , i.e., x i = { w i ; w i ∈ s p } .SVD-RawCount applies truncated singular value decomposition (SVD) to the RawCount matrix to com-pute the largest singular values of the raw count encodings, which deﬁne the dense, lower-dimensionalrepresentations.SVD-TFIDF transforms the raw count encodings using the term frequency–inverse document frequency(TFIDF) weighting schema and applies truncated SVD to the resulting matrix. We considered the patientEHR sequences as documents, the entire dataset as corpus and we derived TFIDF scores for all medicalconcepts. Each patient is then represented as a vector of length | V | , with the corresponding TFIDF weightfor each concept, and the matrix obtained is reduced via truncated SVD.Deep Patient transforms the raw count matrix using a stack of denoising autoencoders as proposed byMiotto et al. [20]. We used the implementation details presented in the paper, with batch size equal to 32,corruption noise equal to 5%, and 5 training epochs. Multi-disease clustering analysis

We evaluated all the representation learning approaches in a clustering task to determine how they wereable to disentangle patients with diﬀerent conditions. We chose eight complex disorders: type 2 diabetes(T2D), multiple myeloma (MM), Parkinson’s disease (PD), Alzheimer’s disease (AD), Crohn’s disease (CD),prostate cancer (PC), breast cancer (BC) and attention deﬁcit hyperactivity disorder (ADHD). We retrievedall the corresponding patients in the test sets using SNOMED–CT codes after verifying that at least onecorrespondent ICD-9 code was present in a patient EHRs. In particular, we looked for

Type 2 diabetesmellitus (250.00) for T2D;

Multiple myeloma without mention of having achieved remission (203.00) for MM;

Paralysis agitans (332.0) for PD;

Alzheimer’s disease (331.0) for AD;

Regional enteritis of unspeciﬁed site(555.9) for CD;

Malignant neoplasm of prostate (185) for PC;

Malignant neoplasm of female breast (174.9) for BC; and

Attention deﬁcit disorder with hyperactivity (314.01) for ADHD. We discarded all patients withcomorbidities within the selected diseases to facilitate the clustering interpretation. We then performedhierarchical clustering with k = 8 clusters (i.e., same as the diﬀerent diseases) for all the representations17o evaluate if patients with the same condition were grouping together. The ﬁnal test sets were composedby about 94 ,

000 patients per fold but were unbalanced, with disease cohorts ranging from about 1 ,

900 to50 ,

000 patients (see Supplementary Table 2). To use balanced datasets and improve the eﬃcacy of theexperiment, we sub-sampled 5 ,

000 random patients for the highly populated diseases, and we iterated thissubsampling process 100 times, obtaining 100 diﬀerent clustering per test set.We used entropy and purity scores averaged across the 100 experiments of each fold to measure to whatextent the clusters matched the diﬀerent diseases. In particular, for each cluster j , we deﬁne the probabilitythat a patient in j has disease i as: p ij = m ij m j , (4)where m j is the number of patients in cluster j and m ij is the number of patients in cluster j with adiagnosis of disease i . Entropy for each cluster is deﬁned as: E j = − (cid:88) i p ij log p ij , (5)and conditional entropy H (disease | cluster) is then computed as: H (disease | cluster) = (cid:88) j m j m E j , where m is the total number of elements in the complex disease dataset.Purity identiﬁes the most represented disease in each cluster. For a cluster j , purity P j is deﬁned as P j = max i p ij , where p ij is computed as before. The overall purity score is then the weighted average of P j for each cluster j . The perfect clustering obtains averaged entropy and purity scores equal to 0 and 1,respectively. Disease subtyping analysis

We evaluated the usability of ConvAE representations to discover disease subtypes for diﬀerent and diverseconditions (i.e., patient stratiﬁcation at scale). In particular, we selected a cohort of patients with T2D, PD,AD, MM, PC, and BC and ran hierarchical clustering on the ConvAE-based patient representations. Theseare all age-related complex disorders with late onset (i.e., increased prevalence after 60 years of age [26,27, 28, 29, 30, 31]). We focused only on these conditions to attempt reducing confounding age eﬀects thatcould aﬀect the analysis of the subtypes (as it could happen on CD and ADHD cohorts, where a commononset age is less deﬁned). To reduce noise in the sequence encodings, we averaged all patient subsequence18epresentations from the ﬁrst diagnosis forward, and we dropped sequences shorter than 3 concepts. Weranged the number of clusters from 2 to 15 and we used the Elbow Method to empirically select the smallestnumber of clusters that minimize the increase in explained variance. We then performed a qualitative analysisof each subtype, similarly to Zhang et al. [11], to identify which medical concepts characterized the speciﬁcgroup of patients. We further veriﬁed the various subgroups in the medical literature and with the supportof a practicing clinician.

Data availability

The data used for this study are available from the Mount Sinai Health System (NYC), but restrictionsapply to the availability of these data, which were used under license for the current study, and so are notpublicly available. Data are however available from the authors upon reasonable request and with permissionof Mount Sinai Health System.

Code availability

Code is available at: https://github.com/landiisotta/convae_architecture . Acknowledgments

R.M. would like to thank the support from the Hasso Plattner Foundation, the Alzheimer’s Drug DiscoveryFoundation and a courtesy GPU donation from Nvidia. I.L. acknowledges the support from the BrunoKessler Institute.

Competing interests

The authors declare no competing interests.

Author contributions

I.L. and R.M. conceived and designed the work. I.L. conducted the research and the experimental evalu-ation, and drafted the manuscript. R.M. created the dataset, supervised and supported the research, andsubstantially edited the manuscript. B.S.G. substantially edited the manuscript and created the architec-ture ﬁgures. H.L. and S.C. advised on methodological choices and critically revised the manuscript. G.L.19rovided clinical validation of the results and critically revised the manuscript. M.D. revised the manuscriptand contributed to the interpretation of the data. J.T.D. and C.F. supported the research and revised themanuscript. All the authors gave ﬁnal approval of the completed manuscript version and are accountablefor all aspects of the work. 20 eferences [1] Jensen, P. B., Jensen, L. J. & Brunak, S. Mining electronic health records: towards better researchapplications and clinical care.

Nature Reviews Genetics

395 (2012).[2] Cutting, G. R. Cystic ﬁbrosis genetics: from molecular understanding to clinical application.

NatureReviews Genetics et al.

Large-scale phenome analysis deﬁnes a behavioral signature for Huntington’sdisease genotype in mice.

Nature Biotechnology

Annals ofNeurology

BioMed Research International,

Diabetologia

NatureReviews Drug Discovery et al. Patient Subtyping via Time-Aware LSTM Networks in Proceedings of the 23rdACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, Halifax,NS, Canada, 2017), 65–74. doi: .[9] Doshi-Velez, F., Ge, Y. & Kohane, I. Comorbidity Clusters in Autism Spectrum Disorders: an Elec-tronic Health Record Time-Series Analysis.

Pediatrics e54–63 (2013).[10] Li, L. et al.

Identiﬁcation of type 2 diabetes subgroups through topological analysis of patient similarity.

Science translational medicine et al. Data-Driven Subtyping of Parkinson’s Disease Using Longitudinal Clinical Records:a Cohort Study.

Scientiﬁc Reports

797 (2019).[12] Chen, D. et al.

Deep learning and alternative learning strategies for retrospective real-world clinicaldata. npj Digital Medicine IEEEtransactions on pattern analysis and machine intelligence

Nature

Brieﬁngs in Bioinformatics

Journal of the American Medical InformaticsAssociation et al.

The Impact of Phenotypic and Genetic Heterogeneity on Results of Genome WideAssociation Studies of Complex Diseases.

PLoS ONE e76295 (2013).[18] Banda, J. M., Seneviratne, M., Hernandez-Boussard, T. & Shah, N. H. Advances in electronic phe-notyping: from rule-based deﬁnitions to machine learning models. Annual review of biomedical datascience JAMA

Scientiﬁc Reports Pattern Recognition

PLoS ONE (2018).[23] Brun, M. et al. Model-based evaluation of clustering validation measures.

Pattern Recognition

Information retrieval

Journal of Open Source Software

861 (2018).[26] Cowie, C. C., Casagrande, S. S. & Geiss, L. S. Prevalence and incidence of type 2 diabetes andprediabetes.

Diabetes in America, 3rd edn. National Institutes of Health, Bethesda, MD,

The Lancet Neurology Dialogues in clinical neuroscience

111 (2009).[29] Kazandjian, D.

Multiple myeloma epidemiology and survival: a unique malignancy in Seminars inoncology (2016), 676–681.[30] https://seer.cancer.gov/statfacts/html/prost.html . (Accessed on September 17, 2019).[31] https://seer.cancer.gov/statfacts/html/breast.html . (Accessed on September 17, 2019).[32] Vallon, V. & Komers, R. Pathophysiology of the diabetic kidney. Comprehensive Physiology CriticalReviews in Oncology/Hematology et al.

Impaired Leucocyte Functions in Diabetic Patients.

Diabetic Medicine

Archivesof Neurology

Neurology et al.

Fatigue in Parkinson’s disease: A systematic review and meta-analysis.

MovementDisorders .(Accessed on October 14, 2019).[39] Manji, H., J¨ager, H. R. & Winston, A. HIV, dementia and antiretroviral drugs: 30 years of an epidemic.

Journal of Neurology, Neurosurgery & Psychiatry et al.

Prevalence of Neuropsychiatric Symptoms in Dementia and Mild CognitiveImpairment.

JAMA et al.

Vascular contributions to cognitive impairment and dementia including Alzheimer’sdisease.

Alzheimer’s & Dementia

Cochrane Database ofSystematic Reviews CD001190 (2018).[43] Lombardo, M. V. et al.

Unsupervised data-driven stratiﬁcation of mentalizing heterogeneity in autism.

Scientiﬁc Reports et al. Identiﬁcation and analysis of behavioral phenotypes in autism spectrum disorder viaunsupervised machine learning.

International Journal of Medical Informatics

Doctor AI: Predicting Clinical Events via Recurrent Neural Networks in Proceedings of Machine Learning for Healthcare (2016).[46] Pham, T., Tran, T., Phung, D. & Venkatesh, S. DeepCare: A Deep Dynamic Memory Model forPredictive Medicine in Advances in Knowledge Discovery and Data Mining (Springer InternationalPublishing, 2016), 30–41.[47] Rajkomar, A. et al.

Scalable and accurate deep learning with electronic health records. npj DigitalMedicine

18 (2018).[48] Beaulieu-Jones, B. K., Greene, C. S., et al.

Semi-supervised learning of the electronic health record forphenotype stratiﬁcation.

Journal of biomedical informatics

Deepr : a convolutional net for medicalrecords.

IEEE Journal of Biomedical and Health Informatics et al.

Deep patient similarity learning for personalized healthcare.

IEEE Transactions onNanoBioscience et al.

Combining billing codes, clinical notes, and medications from electronic health recordsprovides superior phenotyping performance.

Journal of the American Medical Informatics Association e20–27 (2015).[52] Kirby, J. C. et al.

PheKB: a catalog and workﬂow for creating electronic phenotype algorithms fortransportability.

Journal of the American Medical Informatics Association

Journal of the American Medical Informatics Association et al. Automated disease cohort selection using word embeddings from Electronic HealthRecords in Biocomputing 2018 (World Scientiﬁc, 2017), 145–156. doi: .[55] Blei, D., Ng, A. & Jordan, M. Latent Dirichlet Allocation.

Journal of Machine Learning Research Eﬃcient Estimation of Word Representations in VectorSpace.

Preprint at https://arxiv.org/abs/1301.3781 . (2013).[57] Jonquet, C., Shah, N. H. & Musen, M. A.

The Open Biomedical Annotator in AMIA Summits onTranslational Science Proceedings (2009), 56–60.[58] Lependu, P., Iyer, S. V., Fairon, C. & Shah, N. H. Annotation analysis for testing drug safety signalsusing unstructured clinical notes.

Journal of Biomedical Semantics s5 (2012).[59] Choi, Y., Chiu, C. Y. I. & Sontag, D.

Learning low-dimensional representations of medical concepts in AMIA Summits on Translational Science Proceedings (2016), 41–50.2360] Zhu, Z. et al. Measuring Patient Similarities via a Deep Architecture with Medical Concept Embedding in (2016), 749–758. doi: .[61] Suo, Q. et al. Personalized disease prediction using a CNN-based similarity learning method in (2017), 811–816. doi: .[62] Pedregosa, F. et al. Scikit-learn: Machine Learning in Python.

Journal of Machine Learning Research et al. Automatic diﬀerentiation in pytorch in NeurIPS Autodiﬀ Workshop (2017).[64] Kingma, D. & Adam, J. B.

Adam: A Method for Stochastic Optimization in Proceedings of the 3rdInternational Conference on Learning Representations (2014), 1–15.24 ntropy Purity Disease Number ConvAE 1-layer CNN 2 .

61 (0 . , [2 . , . ∗∗∗ .

31 (0 . , [0 . , . ∗∗∗ .

50 (0 . ∗∗∗ ConvAE 2-layer CNN 2 .

75 (0 . , [2 . , . .

26 (0 . , [0 . , . .

93 (0 . .

66 (0 . , [2 . , . .

30 (0 . , [0 . , . .

94 (0 . .

90 (0 . , [2 . , . .

18 (0 . , [0 . , . .

76 (0 . .

90 (0 . , [2 . , . .

19 (0 . , [0 . , . .

13 (0 . .

85 (0 . , [2 . , . .

21 (0 . , [0 . , . .

83 (0 . .

81 (0 . , [2 . , . .

24 (0 . , [0 . , . .

96 (0 . Mean (sd, CI); Mean (standard deviation); ∗ p < . ∗∗ p < . ∗∗∗ p < . Table 1: Multi-disease clustering performances of ConvAE conﬁgurations and baselines. The scores reportedare averaged over a 2-fold cross-validation experiment. ConvAE 1-layer CNN signiﬁcantly outperforms allother conﬁgurations and baselines on all measures. Multiple pairwise t-tests with Bonferroni correction areused to compare performances. 25 b Figure 1: Patient stratiﬁcation framework and ConvAE architecture. ( a ) Framework enabling patient strat-iﬁcation analysis from deep unsupervised EHR representations; ( b ) Details of the ConvAE representationlearning architecture. 26 D = Alzheimer’s disease; ADHD = Attention deﬁcit hyperactivity disorder; BC = Breast cancer; CD = Crohn’s disease;MM = Multiple myeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes

Figure 2: Uniform manifold approximation and projection (UMAP) encoding visualization. ( a ) ConvAE1-layer CNN; ( b ) SVD-RawCount; ( c ) SVD-TFIDF; ( d ) Deep Patient. AD = Alzheimer’s disease; ADHD= Attention deﬁcit hyperactivity disorder; BC = Breast cancer; CD = Crohn’s disease; MM = Multiplemyeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes.27 D = Alzheimer’s disease; ADHD = Attention deﬁcit hyperactivity disorder; BC = Breast cancer; CD = Crohn’s disease;MM = Multiple myeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes

Figure 3: Uniform manifold approximation and projection (UMAP) clustering visualization. ( a ) ConvAE1-layer CNN; ( b ) SVD-RawCount; ( c ) SVD-TFIDF; ( d ) Deep Patient. AD = Alzheimer’s disease; ADHD= Attention deﬁcit hyperactivity disorder; BC = Breast cancer; CD = Crohn’s disease; MM = Multiplemyeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes.28igure 4: Complex disorder subgroups. A subsample of 5 ,

000 patients with T2D is displayed in Figure ( a ).Figures ( b ), ( c ), ( d ), ( e ), ( f ) display patient subtypes for Parkinson’s and Alzheimer’s disease, multiplemyeloma, prostate and breast cancer cohorts, respectively.29 upplementary Material Clustering comparison for the type 2 diabetes analysis

Li et al. [1] used a similar cohort of EHRs as in this study to stratify patients with type 2 diabetes (T2D).Of the 2 ,

472 patients from their paper, we identiﬁed 1 ,

050 of them in our test sets. To compare the results,we evaluated the similarity of the clusters we obtained to those found by Li et al. via the Fowlkes-Mallowsindex (FMI), which is an external validation similarity measure of two cluster analyses [2, 3]. FMI scoresrange from 0 to 1, where 1 represents identical clustering and 0 purely independent label assignments. Weobtained FMI = 0 .

40, which suggests that only a portion of patients in groups from Li et al. [1] are identiﬁedby our approach as sharing the same characteristics. This may entail that associated clinical phenotypesoverlap to a greater extent than hypothesized by Li et al., which may have been overlooked because theycollected shorter EHR sequences (i.e., 60 day intervals) and used a manually derived subset of features.

Disease subtyping

Multiple myeloma

We identiﬁed ﬁve subgroups for multiple myeloma (MM) (see Figure 4d and Sup-plementary Table 7). In particular, subgroup I is characterized by pulmonary manifestations; subgroup IIshows bone-related signs of MM; subgroup III includes signs of gastrointestinal problems; subgroup IV isdeﬁned by kidney problems; and subgroup V shows signs of peripheral neuropathy.Pulmunary manifestations in subgroup I include

Pleura eﬀusion , a rare pulmonary manifestation ofamyloidosis [4] that is a comorbidity of MM found in 10 −

15% of patients (i.e., superimposed amyloidosis).Subgroup I is also characterized by patients with amyloidosis and proteinuria (i.e., excess of proteins inurine) because of the large frequency of

Urea nitrogen blood test.

Disorders of bone and cartilage largely characterizes patients in subgroup II, which can be identiﬁed withbone-related signs of MM.Subgroups III and V include patients who received chemotherapy and/or anti-cancer medications. Inparticular, we often found

Bortezomib in combination with

Dexamethasone in both subgroups. Bortezomib,for example, is administered to 47% of patients from subgroup III and to 26% of patients in group V. It canbe used: 1) for patients ineligible for hematopoietic cell transplantation (HCT); 2) as a maintenance therapy;or 3) in conjunction with HCT for newly-diagnosed patients [5]. Given the characterization of subgroupsIII and V we expect gastrointestinal problems in subgroup III and

Inﬂammatory/toxic neuropathy diagnosisin subgroup V to indicate diﬀerent side eﬀects from anti-cancer medications. Peripheral nerve damage isalso one of the most signiﬁcant non-hematologic toxicities of Bortezomib [6]. Although unlikely, neurologiccomplications can also be caused by MM. Such neurologic complications can be due to spinal cord compressionfrom an extramedullary plasmacytoma, or by peripheral neuropathy, which is rare and usually caused bysuperimposed amyloidosis [7]. The

Counseling concept in subgroup V likely denotes an encounter to treatsevere pain linked to neurologic diseases or psychological support.

Creatinine , Urea nitrogen , and

Urinalysis testing indicate renal function estimate for patients in sub-group IV. Moreover, 9% of patients report

Nephritis and nephropathy and

Chronic kidney disease diagnosis,reinforcing the association of subgroup IV to kidney conditions.

Prostate cancer

We ﬁnd 2 subgroups of patients with prostate cancer (PC) related to diverging diseasecourses (see Figure 4e and Supplementary Table 8).Clinical manifestation of PC is heterogeneous and may range from asymptomatic screen, microscopic, welldiﬀerentiated tumor, that may never become clinically relevant; to clinically symptomatic aggressive cancerthat causes metastases, morbidity, and death. Treatment approaches for PC include: active surveillance,radical prostatectomy, or radiation therapy (RT) for patients with low-risk PC; prostatectomy or RT in1ombination with Androgen Deprivation Therapy (ADT) for patients with higher-risk, but localized PC; RTand ADT for patients with clinical evidence of lymph node involvement.Patients in subgroup I report

Personal history of PC and

Ondansetron medication to prevent RT sideeﬀects. This suggests that this group includes patients with recurrent prostate cancer that have eitherreceived prostatectomy in the past, and hence RT and ADT is required, or, have already received RT andthus require a radical approach.

Anastomosis and

Pelvic lymphadenectomy concepts, which are related topost-prostatectomy procedures and are frequent in these patients, support this description.Clinical manifestations of PC are usually absent at the time of diagnosis, and over 90% of patients arediagnosed via speciﬁc screening (e.g., use of prostate-speciﬁc antigen (PSA) or digital rectal examination).Patients in subgroup II show frequent signs of eﬀective PSA screening, indicating probable localized andasymptomatic PC. Diagnosis of

Nocturia , Impotence of organic origin , Urinary frequency , and treatmentsfor male sexual dysfunctions, i.e.,

Tadalaﬁl, Sildenaﬁl , are all signs of side eﬀects from PC treatments [8].Among them, at least 22% likely received a prostatectomy (

Surgery ).Diﬀerently from the second subgroup, patients in the ﬁrst subgroup do not have PSA among top-rankedconcepts. This suggests that subgroup I includes patients that already received prostatectomy, which makesPSA screening less common. Patients in subgroup I appear to have been in the healthcare system for longerand also to have been diagnosed with PC earlier (i.e., similar median age to subgroup II, but absent PSAscreening).

Breast cancer

Stratiﬁcation of breast cancer (BC) patients lead to two diﬀerent subgroups (see Figure4f and Supplementary Table 9). Subgroup I is linked to advanced stages of BC. Patients in subgroup II,instead, are younger and present a high number of screening-related medical concepts (e.g.,

Mammographyscreening ). In addition, concepts like

Abnormal mammogram and

Carcinoma in situ of breast suggest anearly-stage diagnosis.In subgroup I, 23% of patients reports

Unlisted chemotherapy , with

Surgery performed on 44% of them.This suggests that these patients may have a more advanced disease, as also evidenced by the lack ofscreening terms. As a result, they typically undergo chemotherapy treatment, which is more common inadvanced stages of BC, whereas primary surgery (lumpectomy, mastectomy), with or without radiationtherapy, is preferred for early-stage cancer. This group also includes patients that have already receivedsurgical treatments (33% having received a partial mastectomy) and thus can either be disease free or haverelapsed. The presence of

Secondary malignant neoplasm also suggests that subgroup I includes patientswith metastatic BC.It would be important to better characterize what the general concepts

Unlisted chemotherapy and

Antineoplastic chemotherapy speciﬁcally refer to in terms of more speciﬁc treatments (e.g., hormonal drugs,immunotherapy) to better understand the clinical characteristics of the diﬀerent subgroups. Moreover,because diﬀerent molecular subtypes of BC have been identiﬁed based on gene expression proﬁling [9],including hormonal proﬁles of patients (not available for this study) might improve the stratiﬁcation results.

Replication of disease subtyping

In the following, we present the patient stratiﬁcation results obtained with the second split. As highlightedin Supplementary Figure 3, we found slightly diﬀerent subgroups only for PC and MM (when compared withthe results of the ﬁrst split).MM encodings detect 4 instead of 5 subgroups. We found two subgroups showing kidney-related problems,one subgroup reporting signs of chemotherapy treatment side eﬀects (i.e.,

Inﬂammatory/toxic neuropathy )and one subgroup identiﬁed by signs of possible superimposed amyloidosis, i.e.,

Disease of salivary glands .Patients with PC split into three subgroups, where subgroups II and III appears to be a further reﬁnementof subgroup II identiﬁed in the ﬁrst split. In particular, subgroup III includes signiﬁcantly younger subjectscompared to subgroup II. The presence of

Personal history of PC suggests that subgroups II includes patients2ith relapsing PC. This subgroup is of particular importance for the investigation of treatment eﬀectiveness.The analysis for the other diseases led to very similar results to those obtained with the ﬁrst split. Inparticular, for T2D we identiﬁed three subgroups: a group with signs of metabolic syndrome and T2D riskfactors, a group with microvascular problems, and a third group showing signs of cardiovascular disorders.Patients with PD separates into two subgroups, with motor and non-motor symptoms, respectively, aspreviously found. AD and BC are again characterized by three and two subgroups, respectively, with thesame clinical proﬁles previously presented. 3 eferences [1] Li, L. et al.

Identiﬁcation of type 2 diabetes subgroups through topological analysis of patient similarity.

Science translational medicine Journal of theAmerican Statistical Association

Journal of Multivariate Analysis

Current Opinion in Pulmonary Medicine et al.

Lenalidomide, Bortezomib, and Dexamethasone with Transplantation for Myeloma.

NewEngland Journal of Medicine

Blood

BestPractice & Research Clinical Haematology

JNCI: Journal ofthe National Cancer Institute et al.

Repeated observation of breast tumor subtypes in independent gene expression datasets.

Proceedings of the National Academy of Sciences plit 1 Split 2Train Test Train Test

Patients 741 ,

177 751 ,

979 740 ,

922 751 , , ,

014 3 , ,

238 3 , ,

596 3 , , .

91 (12 .

13) 4 .

86 (12 .

06) 4 .

92 (12 .

14) 4 .

85 (12 . ,

799 32 ,

156 32 ,

875 32 , Supplementary Table 1: Train and test set characteristics.

Complex disorder Test set 1 Test set 2

Type 2 diabetes 50 ,

253 50 , ,

124 3 , ,

374 3 , ,

947 1 , ,

401 14 , ,

330 8 , ,

668 6 , ,

510 6 , ADHD = Attention deﬁcit hyperactivity disorder

Supplementary Table 2: Number of subjects in the complex disorder cohorts.

Test set 1 Test set 2Numerosity N clusters Numerosity N clusters

T2D 48 ,

688 3 48 ,

759 3PD 3 ,

052 2 3 ,

071 2AD 3 ,

201 3 3 ,

150 3MM 1 ,

884 5 1 ,

883 4PC 8 ,

522 2 8 ,

645 3BC 7 ,

964 2 7 ,

838 2

T2D = Type 2 diabetes; PD = Parkinson’s disease; AD = Alzheimer’s disease;MM = Multiple myeloma; PC = Prostate cancer; BC = Breast cancer

Supplementary Table 3: Complex disorder cohorts and number of subclusters identiﬁed via patient stratiﬁ-cation. 5 y p e d i a b e t e s Sub g r o up I Sub g r o up II Sub g r o up III ( N = , )( N = , )( N = , ) F e m a l e / M a l e , , ∗ a , , ∗ a , , ∗ a A g e . ( . ) ∗ b . ( . ) ∗ b . ( . ) ∗ b I C D - H y p e r t e n s i o n ( ) - % ( % ) ∗∗∗ P a i n i n li m b ( ) - % ( % ) ∗∗ C o r o n a r y a t h e r o s c l e r o s i s ( v e ss e l )( ) - % ( % ) ∗∗∗ H y p e r li p i d e m i a ( ) - % ( % ) ∗∗∗ A c u t e k i dn e y f a il u r e ( ) - % ( % ) ∗∗∗ C o r o n a r y a r t e r y a t h e r o s c l e r o s i s ( ) - % ( % ) ∗∗∗ C h e s t p a i n ( ) - % ( % ) ∗∗∗ ( I v s III ) C h r o n i c k i dn e y d i s e a s e ( ) - % ( % ) ∗∗∗ A n g i n a p ec t o r i s ( ) - % ( % ) ∗∗∗ O b e s i t y ( ) - % ( % ) ∗∗∗ N e ph r i t i s a ndn e ph r o p a t h y ( ) - % ( % ) ∗∗∗ P e r c u t a n e o u s t r a n s l u m i n a l c o r o n a r y a n g i o p l a s t y ( V45.82 ) - % ( % ) ∗∗∗ H y p e r c h o l e s t e r o l e m i a ( ) - % ( % ) ∗∗∗ P e r i ph e r a l v a s c u l a r d i s e a s e ( ) - % ( % ) ∗∗ C a r d i a c d y s r h y t h m i a s ( ) - % ( % ) ∗∗∗ M e d i c a t i o n M e t f o r m i n - % ( % ) ∗∗∗ P a r a ce t a m o l - % ( % ) ∗∗∗ A ce t y l s a li c y li c a c i d - % ( % ) ∗∗∗ C a l c i u m - % ( % ) ∗∗∗ O xy c o d o n e - % ( % ) ∗∗∗ C l o p i d r og e l - % ( % ) ∗∗∗ V i t a m i n D - % ( % ) ∗∗∗ M o r ph i n e - % ( % ) ∗∗∗ B i v a li r ud i n - % ( % ) ∗∗∗ C h o l e s t e r o l - % ( % ) ∗∗∗ V a n c o m y c i n - % ( % ) ∗∗∗ L i s i n o p r il - % ( % ) ∗ A t o r v a s t a t i n - % ( % ) ∗∗∗ F u r o s e m i d e - % ( % ) ∗∗∗ A m l o d i p i n e - % ( % ) ∗∗∗ L a b t e s t G l u c o s e - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ H e m a t o c r i t - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ C h l o r i d e - % ( % ) ∗∗∗ M e a n c o r pu s c u l a r h e m og l o b i n c o n ce n t r a t i o n - % ( % ) ∗∗∗ C h o l e s t e r o l - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ C h l o r i d e - % ( % ) ∗∗∗ T r i g l y ce r i d e - % ( % ) ∗∗∗ A l bu m i n - % ( % ) ∗∗∗ T r o p o n i n I c a r d i a c - % ( % ) ∗∗∗ M i c r oa l bu m i np a n e l - % ( % ) ∗∗∗ A l k a li n e ph o s ph a t a s e - % ( % ) ∗∗∗ C h o l e s t e r o l - % ( % ) ∗∗∗ C P T - C a l c i u m - % ( % ) ∗∗∗ ( I v s II ) P o t a ss i u m - % ( % ) ∗∗∗ E C G ;i n t e r p r e t a t i o n , r e p o r t - % ( % ) ∗∗ H e m og l o b i n A C - % ( % ) ∗∗∗ ( I v s II ) U r e a n i t r og e n - % ( % ) ∗∗∗ T r o p o n i n , q u a n t i t a t i v e - % ( % ) ∗∗∗ G l u c o s e ( r e ag e n t s t r i p ) - % ( % ) ∗∗∗ U r i n a l y s i s - % ( % ) ∗∗∗ C - r e a c t i v e p r o t e i n , h i g h s e n s i t i v i t y - % ( % ) ∗∗∗ L i p i dp a n e l - % ( % ) ∗∗∗ H e p a t i c f un c t i o np a n e l - % ( % ) ∗∗∗ E c h o c a r d i og r a ph y , t r a n s t h o r a c i c - % ( % ) ∗∗∗ L i p o p r o t e i n , d i r ec t m e a s u r e m e n t - % ( % ) ∗∗∗ D up l e x s c a n o f e x t r e m i t yv e i n s - % ( % ) ∗∗∗ R a d i o l og i ce x a m i n a t i o n , c h e s t - % ( % ) ∗∗∗ M e a n ( s t a nd a r dd e v i a t i o n ) ; f r o m I C D - n i n - g r o up a nd (t o t a l ) p e r ce n t ag e s ; a M u l t i p l e p a i r w i s ec h i - s q u a r e d t e s t ; b M u l t i p l e p a i r w i s e t - t e s t ; ∗ p < . , ∗∗ p < . , ∗∗∗ p < . E C G = E l ec t r o c a r d i og r a m Supp l e m e n t a r y T a b l e : M o s t f r e q u e n tt e r m s f o r t h e t h r ee s ub g r o up s i n t h e t y p e d i a b e t e s c o h o r t . W e r e p o r tt o pﬁ v e d i ag n o s i s ( I C D - ) , m e d i c a t i o n s , l a b o r a t o r y t e s t s , a nd C P T - p r o ce du r e s . E a c h c li n i c a l t e r m i s f o ll o w e db y i n - g r o up a nd t o t a l f r e q u e n c i e s . C o rr ec t e dp - v a l u e s a r e r e p o r t e d f o r s i g n i ﬁ c a n t c o m p a r i s o n s b e t w ee n g r o up s . arkinson’s diseaseSubgroup I Subgroup II (N=1 , , a a Age .

76 (13 . ∗ b .

17 (14 . ∗ b ICD-9 Essential tremor ( ) - 21% (56%) ∗∗∗

Constipation ( ) - 29% (66%) ∗∗∗

Anxiety state ( ) - 20% (45%) Other malaise and fatigue ( ) - 25% (72%) ∗∗∗

Depressive disorder ( ) - 14% (40%) ∗ Coronary atherosclerosis ( ) - 17% (94%) ∗∗∗

Abnormality of gait ( ) - 14% (32%) ∗∗∗

Dysphagia ( ) - 14% (77%) ∗∗∗

Dystonia ( ) - 11% (57%) ∗∗∗

Abdominal pain ( ) - 14% (90%) ∗∗∗

Medication Carbidopa/Levodopa combination - 51% (51%) ∗ Levodopa - 45% (57%)Amantadine - 16% (55%) ∗∗∗

Carbidopa - 45% (58%) ∗∗∗

Pramipexole - 15% (59%) ∗∗∗

Acetylsalicylic acid - 22% (87%) ∗∗∗

Rasagiline - 14% (60%) ∗∗∗

Docusate sodium - 19% (85%) ∗∗∗

Selegiline - 12% (57%) ∗∗∗

Vitamin D - 16% (72%) ∗∗∗

Lab test Mean corpuscular hemoglobin - 3% (4%) ∗∗∗

Glucose - 60% (97%) ∗∗∗

Leukocytes - 3% (4%) ∗∗∗

Urea nitrogen - 60% (97%) ∗∗∗

Mean platelet volume - 3% (4%) ∗∗∗

Creatinine - 59% (97%) ∗∗∗

Width - 3% (4%) ∗∗∗

Potassium - 59% (97%) ∗∗∗

Erythrocytes - 3% (4%) ∗∗∗

Sodium - 59% (97%) ∗∗∗

CPT-4 Unlisted psychiatric service or procedure - 25% (47%) ECG; interpretation, report - 51% (95%) ∗∗∗

MRI (brain, brain stem) - 13% (36%) ∗∗∗

Urea nitrogen - 48% (96%) ∗∗∗

Surgery - 11% (24%) ∗∗∗

Creatinine - 45% (96%) ∗∗∗

CT head/brain - 3% (8%) ∗∗∗

Metabolic panel - 35% (98%) ∗∗∗

Neuropsychological testing - 2% (36%) Echocardiography, transthoracic - 11% (97%) ∗∗∗ Mean (standard deviation); from ICD-9 on in-group and (total) percentages; a Multiple pairwise chi-squared test; b Multiple pairwise t-test; ∗ p < . , ∗ ∗ p < . , ∗ ∗ ∗ p < . Supplementary Table 5: Most frequent terms for the two subgroups in the Parkinson’s disease cohort.7 l z h e i m e r ’ s d i s e a s e Sub g r o up I Sub g r o up II Sub g r o up III ( N = )( N = , )( N = , ) F e m a l e / M a l e ∗∗∗ a ∗∗∗ a , ∗∗∗ a A g e . ( . ) ∗∗ b . ( . ) ∗∗ b . ( . ) ∗∗ b I C D - R o u t i n e g y n ec o l og i c a l e x a m i n a t i o n ( V72.31 ) - % ( % ) ∗∗∗ D e m e n t i a w /o b e h a v i o r a l d i s t u r b a n ce ( ) - % ( % ) ∗∗∗ C o n s t i p a t i o n ( ) - % ( % ) ∗∗∗ C o un s e li n g ( V65.40 ) - % ( % ) ∗∗ A l t e r e d m e n t a l s t a t u s ( ) - % ( % ) ∗∗∗ A n x i e t y s t a t e ( ) - % ( % ) ∗ O s t e o p o r o s i s ( ) - % ( % ) ∗∗∗ ( I v s II ) P e r s i s t e n t m e n t a l d i s o r d e r s ( ) - % ( % ) ∗∗∗ D e p r e ss i v e d i s o r d e r ( ) - % ( % ) F a m il y h i s t o r y o f o s t e o p o r o s i s ( V17.81 ) - % ( % ) ∗∗ D y s ph ag i a ( ) - % ( % ) ∗∗∗ D e m e n t i a , un s p ., w /o b e h a v i o r a l d i s t u r b a n ce ( ) - % ( % ) ∗∗∗ ( III v s II ) M a li g n a n t n e o p l a s m o f u t e r u s ( ) - % ( % ) ∗∗∗ I n t r a c r a n i a l h e m o rr ag e ( ) - % ( % ) ∗∗∗ D e m e n t i a w i t hb e h a v i o r a l d i s t u r b a n ce ( ) - % ( % ) ∗∗∗ M e d i c a t i o n C a l c i u m - % ( % ) ∗ ( I v s III ) A ce t y l s a li c y li c a c i d - % ( % ) ∗∗∗ D o n e p ez il - % ( % ) ∗∗∗ ( III v s I ) E s t r a d i o l - % ( % ) ∗∗∗ D o n e p ez il - % ( % ) ∗∗ ( II v s I ) M e m a n t i n e - % ( % ) ∗∗ I r o n - % ( % ) ∗ ( I v s III ) L e v o ﬂ o x a c i n - % ( % ) ∗∗∗ D o c u s a t e s o d i u m - % ( % ) ∗∗∗ N o r e t h i s t e r o n e - % ( % ) ∗∗∗ V a n c o m y c i n - % ( % ) ∗∗ T r a z o d o n e - % ( % ) ∗∗∗ ( III v s I ) G a r d a s il - % ( % ) ∗∗∗ H a l o p e r i d o l - % ( % ) ∗∗∗ Z o l p i d e m - % ( % ) ∗∗∗ ( III v s I ) L a b t e s t C h l a m y d i a/ G o n o rr h o e a e a m p li ﬁ e d D NA - % ( % ) ∗∗∗ M e a n c o r pu s c u l a r v o l u m e - % ( % ) ∗∗∗ L e u k o c y t e s - % ( % ) ∗∗∗ S y ph ili s ( r a p i dp l a s m a r e ag i n ) - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ G l u c o s e - % ( % ) ∗∗∗ H I V - % ( % ) ∗∗∗ ( I v s II ) E r y t h r o c y t e s - % ( % ) ∗∗∗ E r y t h r o c y t e s - % ( % ) ∗∗∗ H e p a t i t i s C v i r u s a b - % ( % ) M e a n c o r pu s c u l a r h e m og l o b i n c o n ce n t r a t i o n - % ( % ) ∗∗∗ H e m a t o c r i t - % ( % ) ∗∗∗ H e p a t i t i s B s u r f a ce a n t i g e n - % ( % ) G l u c o s e - % ( % ) ∗∗∗ M e a n c o r pu s c u l a r h e m og l o b i n c o n ce n t r a t i o n - % ( % ) ∗∗∗ C P T - P s y c h i a t r i c s e r v i ce / p r o ce du r e - % ( % ) ∗∗∗ ( I v s II ) E C G - % ( % ) ∗∗∗ T S H - % ( % ) ∗∗ C y t o p a t h o l og y , s li d e s , ce r v i c a l / v ag i n a l - % ( % ) ∗∗∗ P a r t i a l T h r o m b o p l a s t i n T i m e T e s t - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ M R I b r a i n - % ( % ) ∗∗ ( I v s III ) C r e a t i n i n e - % ( % ) ∗∗∗ E C G - % ( % ) ∗∗∗ CT p r o ce du r e - % ( % ) ∗∗∗ P r o t h r o m b i n t i m e - % ( % ) ∗∗∗ P s y c h i a t r i c s e r v i ce / p r o ce du r e - % ( % ) ∗∗∗ ( III v s II ) B r a i n i m ag i n g , PE T - % ( % ) ∗∗∗ H e a d CT - % ( % ) ∗∗∗ H e a d / b r a i n CT - % ( % ) ∗∗∗ M e a n ( s t a nd a r dd e v i a t i o n ) ; f r o m I C D - n i n - g r o up a nd (t o t a l ) p e r ce n t ag e s ; a M u l t i p l e p a i r w i s ec h i - s q u a r e d t e s t ; b M u l t i p l e p a i r w i s e t - t e s t ; ∗ p < . , ∗∗ p < . , ∗∗∗ p < . ; E C G = E l ec t r o c a r d i og r a m ; a b = a n t i b o d i e s ; T S H = T h y r o i d - s t i m u l a t i n g h o r m o n e ; PE T = P o s i t r o n e m i ss i o n t o m og r a ph y ; CT = C o m pu t e d t o m og r a ph y ; M R I = M ag n e t i c r e s o n a n ce i m ag i n g Supp l e m e n t a r y T a b l e : M o s t f r e q u e n tt e r m s f o r t h e t h r ee s ub g r o up s i n t h e A l z h e i m e r ’ s d i s e a s ec o h o r t . u l t i p l e m y e l o m a Sub g r o up I Sub g r o up II Sub g r o up III

Sub g r o up I V Sub g r o up V ( N = )( N = )( N = )( N = )( N = ) F e m a l e / M a l e a a a ∗∗ a ∗∗ a A g e . ( . ) ∗∗ b . ( . ) b . ( . ) b . ( . ) b . ( . ) b I C D - E d e m a ( ) - % ( % ) ∗ D i s e a s e o f s a li v a r y g l a nd s ( ) - % ( % ) ∗∗∗ D i a rr h e a ( ) - % ( % ) ∗∗∗ H y p e r li p i d e m i a ( ) - % ( % ) ∗∗∗ I V v s II / V O t h i nﬂ a mm a t o r y / t o x i c n e u r o p a t h y ( ) - % ( % ) ∗∗ A n e m i a ( ) - % ( % ) ∗∗∗ D i s o r d e r s o f b o n e a nd c a r t il ag e ( ) - % ( % ) ∗∗∗ N a u s e a ( ) - % ( % ) ∗ D y s u r i a ( ) - % ( % ) ∗ ( I V v s II / V ) U n s p i nﬂ a mm a t o r y / t o x i c n e u r o p a t h y ( ) - % ( % ) ∗∗ Sh o r t n e ss o f b r e a t h ( ) - % ( % ) ∗ O t h e r m a l a i s e a nd f a t i g u e ( ) - % ( % ) ∗∗∗ A n t i n e o p l a s t i cc h e m o t h e r a p y ( V58.11 ) - % ( % ) M a li g n a n t n e o p l a s m o f c o l o n ( ) - % ( % ) ∗ C o un s e li n g ( V65.40 ) - % ( % ) ∗ P l e u r a e ﬀ u s i o n ( ) - % ( % ) ∗∗∗ O s t e o p o r o s i s ( ) - % ( % ) ∗ ( II v s III ) N e u t r o p e n i a ( ) - % ( % ) ∗∗∗ N e ph r i t i s a ndn e ph r o p a t h y ( ) - % ( % ) ∗ ( I V v s II / V ) O r ga n / t i ss u e t r a n s p l a n t( V42.9 ) - % ( % ) ∗∗∗ ( V v s II / III / I V ) F e v e r ( ) - % ( % ) ∗∗∗ F r a c t u r e ( E887 ) - % ( % ) ∗∗∗ ( II v s I / III ) O r ga n / t i ss u e t r a n s p l a n t( V42.9 ) - % ( % ) ∗∗∗ C h r o n i c k i dn e y d i s e a s e ( ) - % ( % ) ∗ ( I V v s II ) A n t i n e o p l a s t i cc h e m o t h e r a p y ( V58.11 ) - % ( % ) M e d i c a t i o n P a r a ce t a m o l - % ( % ) ∗∗ V i t a m i n D - % ( % ) ∗∗ C a l c i u m - % ( % ) ∗∗∗ C a l c i u m - % ( % ) ∗∗∗ C a l c i u m - % ( % ) ∗∗∗ S o d i u m c h l o r i d e - % ( % ) ∗∗ O xy c o d o n e - % ( % ) ∗∗∗ D e x a m e t h a s o n e - % ( % ) ∗∗ V i t a m i n D - % ( % ) ∗∗ D e x a m e t h a s o n e - % ( % ) ∗∗ O xy c o d o n e - % ( % ) ∗∗∗ F e n t a n y l - % ( % ) ∗∗∗ O nd a n s e t r o n - % ( % ) ∗∗∗ C h o l ec a l c i f e r o l - % ( % ) ∗ B o r t ez o m i b - % ( % ) ∗∗∗ F e n t a n y l - % ( % ) ∗∗∗ E r go c a l c i f e r o l - % ( % ) ∗ B o r t ez o m i b - % ( % ) ∗∗∗ E r go c a l c i f e r o l - % ( % ) ∗ I r o n - % ( % ) ∗∗∗ H e p a r i n - % ( % ) ∗∗∗ A ce t y l s a li c y li c a c i d m g - % ( % ) ∗ A c i c l o v i r - % ( % ) ∗∗∗ A t o r v a s t a t i n - % ( % ) ∗ A ce t y l s a li c y li c a c i d m g - % ( % ) ∗ L a b t e s t E r y t h r o c y t e s - % ( % ) ∗∗∗ H e m og l o b i n - % ( % ) ∗ C h l o r i d e - % ( % ) ∗ P r o t e i n - % ( % ) ∗∗∗ H e m a t o c r i t - % ( % ) ∗ G l u c o s e - % ( % ) ∗∗∗ L y m ph o c y t e s - % ( % ) ∗∗∗ G l u c o s e - % ( % ) ∗∗∗ G l u c o s e - % ( % ) ∗∗∗ P l a t e l e t s - % ( % ) ∗ U r e a n i t r og e n - % ( % ) ∗∗∗ L e u k o c y t e s - % ( % ) ∗∗ P o t a ss i u m - % ( % ) ∗∗∗ E r y t h r o c y t e s - % ( % ) ∗∗∗ E r y t h r o c y t e s - % ( % ) ∗∗∗ M e a n c o r pu s c u l a r h e m og l o b i n - % ( % ) ∗∗ M e a np l a t e l e t v o l u m e - % ( % ) ∗∗ M e a n c o r pu s c u l a r h e m og l o b i n - % ( % ) ∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ L y m ph o c y t e s - % ( % ) ∗∗∗ L e u k o c y t e s - % ( % ) ∗∗∗ ( I v s II / III / I V ) M e a n c o r pu s c u l a r h e m og l o b i n c o n ce n t r a t i o n - % ( % ) ∗ W i d t h - % ( % ) ∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ E o s i n o ph il s - % ( % ) ∗∗∗ C P T - B l oo d c o un t - % ( % ) ∗∗∗ D i ag n o s t i c / i n t e r v e n t i o n a l CT - % ( % ) ∗ C a l c i u m - % ( % ) ∗∗∗ C a l c i u m - % ( % ) ∗∗∗ G a mm ag l o bu li n - % ( % ) ∗∗ C a l c i u m - % ( % ) ∗∗∗ PE T li m i t e d a r e a ( H e a d / n ec k ) - % ( % ) ∗ B l oo d c o un t - % ( % ) ∗∗ E C G ;i n t e r p r e t a t i o n , r e p o r t - % ( % ) ∗∗∗ A l bu m i n - % ( % ) ∗∗∗ E C G ;i n t e r p r e t a t i o n , r e p o r t - % ( % ) ∗∗∗ PE T - CT ( s k u ll b a s e t o m i d - t h i g h ) - % ( % ) ∗ A l bu m i n - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ C a l c i u m ,i o n i ze d - % ( % ) ∗∗∗ P o t a ss i u m - % ( % ) ∗ T u m o r i m ag i n g PE T - CT - % ( % ) ∗∗∗ L a c t a t e d e h y d r og e n a s e - % ( % ) ∗∗∗ ( III v s I / II / I V ) C h o l e s t e r o l - % ( % ) ∗ ( I V v s I / II / V ) L a c t a t e d e h y d r og e n a s e - % ( % ) ∗ P TT - % ( % ) ∗∗ CT t h o r a x ( n o c o n t r a s t) - % ( % ) ∗∗∗ B o n e m a rr o w ; b i o p s y - % ( % ) ∗∗∗ U r i n a l y s i s - % ( % ) ∗∗∗ ( I V v s III / V ) B e t a - m i c r og l o bu li n - % ( % ) ∗∗∗ M e a n ( s t a nd a r dd e v i a t i o n ) ; f r o m I C D - n i n - g r o up a nd (t o t a l ) p e r ce n t ag e s ; a M u l t i p l e p a i r w i s ec h i - s q u a r e d t e s t ; b M u l t i p l e p a i r w i s e t - t e s t ; ∗ p < . ; ∗∗ p < . ; ∗∗∗ p < . ; E C G = E l ec t r o c a r d i og r a m ; CT = C o m pu t e d t o m og r a ph y ; PE T = P o s i t r o n e m i ss i o n t o m og r a ph y ; P TT = P a r t i a l t r o m b o p l a s t i n t i m e Supp l e m e n t a r y T a b l e : M o s t f r e q u e n tt e r m s f o r t h e ﬁ v e s ub g r o up s i n t h e m u l t i p l e m y e l o m a c o h o r t . alignant neoplasm of prostateSubgroup I Subgroup II (N=6 , , .

64 (12 . a .

78 (10 . a ICD-9 Hyperlipidemia ( ) - 28% (95%) ∗∗∗

Nocturia ( ) - 29% (33%) ∗∗ Edema ( ) - 24% (94%) ∗∗∗

Elevated PSA ( ) - 18% (27%) ∗∗∗

Personal history of PC (

V10.46 ) - 20% (97%) ∗∗∗

Impotence of organic origin ( ) - 18% (35%) ∗∗∗

Hypertrophy (beging) of prostate ( ) - 14% (85%) ∗∗∗

Urinary frequency ( ) - 15% (27%) ∗∗∗

Hematuria ( ) - 14% (86%) ∗∗∗

Urinary hesitancy ( ) - 11% (33%) ∗∗∗

Medication Paracetamol - 44% (98%) ∗∗∗

Midazolam - 17% (12%) ∗∗∗

Oxycodone - 40% (98%) ∗∗∗

Tadalaﬁl - 14% (35%) ∗∗∗

Ondansetron - 33% (97%) ∗∗∗

Sildenaﬁl - 12% (33%) ∗∗∗

Propofol - 31% (94%) ∗∗∗

Tamsulosin - 10% (12%) ∗∗∗

Morphine - 30% (99%) ∗∗∗

Testosterone - 8% (28%)Lab test Glucose - 66% (96%) ∗ PSA post-prostatectomy - 17% (25%) ∗∗∗

Leukocytes - 63% (98%) ∗ PSA free - 10% (27%) ∗∗∗

Creatinine - 63% (99%) ∗ Nitrite - 8% (6%) ∗∗∗

Urea nitrogen - 63% (99%) ∗ Leukocyte esterase - 6% (5%) ∗∗∗

Potassium - 62% (99%) ∗ Urine speciﬁc gravity - 6% (5%) ∗∗∗

CPT-4 Calcium - 53% (98%) ∗∗∗

Testosterone total - 29% (32%) ∗∗∗

Anastomosis - 20% (98%) ∗∗∗

Surgery - 22% (14%) ∗∗∗

Ultrasound, transrectal - 7% (65%) ∗∗∗

Ultrasound post-voiding residual urine/bladder capacity - 18% (29%) ∗∗∗

Pelvic lymphadenectomy - 6% (100%) ∗∗∗

Urinalysis - 12% (44%) ∗∗∗

Cystoplasty/cystourethroplasty - 6% (100%) ∗∗∗

Biopsy, prostate - 7% (31%) ∗∗∗ Mean (standard deviation); from ICD-9 on in-group and (total) percentages; a Multiple pairwise t-test; ∗ p < . ∗∗ p < . ∗∗∗ p < . Supplementary Table 8: Most frequent terms for the two subgroups in the prostate cancer cohort.10 alignant neoplasm of breast (female)Subgroup I Subgroup II (N=5 , , .

67 (14 . ∗ a .

86 (13 . ∗ a ICD-9 Constipation ( ) - 25% (93%) ∗ Lump or mass in breast ( ) - 27% (29%) ∗ Secondary malignant neoplasm ( ) - 13% (93%) ∗∗∗

Abnormal mammogram ( ) - 23% (37%) ∗ Acquired absence of breast/nipple (

V45.71 ) - 12% (92%) ∗∗∗

Carcinoma in situ of breast ( ) - 15% (27%) ns Antineoplastic chemotherapy (

V58.11 ) - 7% (98%) ∗∗∗

Family history of malignant neoplasm of breast (

V16.3 ) - 6% (28%)Mammogram for high-risk patient (

V76.11 ) - 6% (63%) ∗∗∗

Abnormal ﬁndings on radiological examination of breast ( ) - 4% (36%) ∗∗∗

Medication Paracetamol - 50% (92%) ∗∗∗

Propofol - 27% (19%) ∗∗∗

Ondansetron - 46% (87%) ∗∗∗

Fentanyl - 26% (16%) ∗∗∗

Fentanyl - 45% (84%) ∗∗∗

Lidocaine - 25% (21%) ∗∗∗

Oxycodone - 43% (91%) ∗∗∗

Midazolam - 22% (18%) ∗∗∗

Propofol - 40% (81%) ∗∗∗

Ondansetron - 21% (13%) ∗∗∗

Lab test Glucose - 67% (97%) ∗∗∗

Leukocytes - 7% (3%) ∗∗∗

Leukocytes - 67% (97%) ∗∗∗

Glucose - 6% (3%) ∗∗∗

Erythrocytes - 66% (97%) ∗∗∗

Platelets - 6% (3%) ∗∗∗

Hemoglobin - 65% (97%) ∗∗∗

Erythrocytes - 6% (3%) ∗∗∗

Hematocrit - 65% (97%) ∗∗∗

Mean corpuscular hemoglobin - 6% (3%) ∗∗∗

CPT-4 Surgery - 44% (81%) ∗∗∗

Mammography - 35% (32%) ∗∗∗

Mastectomy, partial - 33% (78%) ∗ Ultrasound - 32% (27%) ∗ Ultrasound - 30% (73%) ∗ Surgery - 30% (19%) ∗∗∗

Unlisted chemotherapy - 23% (85%) ∗∗∗

Mastectomy, partial - 28% (22%) ∗∗∗

Oncoprotein - 17% (85%) ∗∗∗

Mammography, bilateral - 26% (39%) ∗∗∗ Mean (standard deviation); from ICD-9 on in-group and (total) percentages; a Multiple pairwise t-test; ∗ p < . ∗∗ p < . ∗∗∗ p < . Supplementary Table 9: Most frequent terms for the two subgroups in the breast cancer cohort.11 y p e d i a b e t e s ( Sp li t ) Sub g r o up I Sub g r o up II Sub g r o up III ( N = , )( N = , )( N = , ) F e m a l e / M a l e , , ∗∗∗ a , , ∗∗∗ a , , ∗∗∗ a A g e . ( . ) ∗ b . ( . ) ∗ b . ( . ) ∗ b I C D - H y p e r t e n s i o n ( ) - % ( % ) ∗∗∗ E d e m a ( ) - % ( % ) ∗∗∗ C o r o n a r y a r t e r y a t h e r o s c l e r o s i s ( ) - % ( % ) ∗∗∗ H y p e r li p i d e m i a ( ) - % ( % ) ∗∗∗ A c u t e k i dn e y f a il u r e ( ) - % ( % ) ∗∗∗ C o r o n a r y a t h e r o s c l e r o s i s ( v e ss e l )( ) - % ( % ) ∗∗∗ C h e s t p a i n ( ) - % ( % ) C h r o n i c k i dn e y d i s e a s e ( ) - % ( % ) ∗∗∗ A n g i n a p ec t o r i s ( ) - % ( % ) ∗∗∗ O b e s i t y ( ) - % ( % ) ∗∗∗ P a i n i n li m b ( ) - % ( % ) ∗∗∗ A bn o r m a l r e s u l t c a r d i o v a s c u l a r s y s t e m f un c t i o n ( ) - % ( % ) ∗∗∗ H y p e r c h o l e s t e r o l e m i a ( ) - % ( % ) ∗∗∗ N e ph r i t i s a ndn e ph r o p a t h y ( ) - % ( % ) ∗∗∗ P e r c u t a n e o u s t r a n s l u m i n a l c o r o n a r y a n g i o p l a s t y ( V45.82 ) - % ( % ) ∗∗∗ M e d i c a t i o n M e t f o r m i n - % ( % ) ∗∗∗ P a r a ce t a m o l - % ( % ) ∗∗∗ A ce t y l s a li c y li c a c i d - % ( % ) ∗∗∗ A ce t y l s a li c y li c a c i d - % ( % ) ∗∗∗ ( I v s II ) G l u c ago n - % ( % ) ∗∗∗ C l o p i d r og e l - % ( % ) ∗∗∗ C a l c i u m - % ( % ) ∗∗∗ I n s u li n li s p r o - % ( % ) ∗∗∗ I n t r a c o r o n a r y n i t r og li ce r i n - % ( % ) ∗∗∗ P a r a ce t a m o l - % ( % ) ∗∗∗ V a n c o m y c i n - % ( % ) ∗∗∗ B i v a li r ud i n - % ( % ) ∗∗∗ C h o l e s t e r o l - % ( % ) ∗∗∗ F u r o s e m i d e - % ( % ) ∗∗∗ C l o p i d r og e l - % ( % ) ∗∗∗ L a b t e s t G l u c o s e - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ H e m a t o c r i t - % ( % ) ∗∗∗ ( III v s I ) C r e a t i n i n e - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ M e a n c o r pu s c u l a r h e m og l o b i n c o n ce n t r a t i o n - % ( % ) ∗∗∗ ( III v s I ) L e u k o c y t e s - % ( % ) ∗∗∗ B ili r ub i n - % ( % ) ∗∗∗ C h l o r i d e - % ( % ) ∗∗∗ T r i g l y ce r i d e - % ( % ) ∗∗∗ A L T t e s t - % ( % ) ∗∗∗ C h o l e s t e r o l - % ( % ) ∗∗∗ C h o l e s t e r o l r a t i o - % ( % ) ∗ A l k a li n e ph o s ph a t a s e - % ( % ) ∗∗∗ T r o p o n i n I c a r d i a c - % ( % ) ∗∗∗ C P T - C a l c i u m - % ( % ) ∗∗∗ ( I v s II ) P o t a ss i u m - % ( % ) ∗∗∗ E C G ;i n t e r p r e t a t i o n , r e p o r t - % ( % ) ∗∗∗ H e m og l o b i n A C - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ L i p i dp a n e l - % ( % ) ∗∗∗ G l u c o s e - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ P o t a ss i u m - % ( % ) ∗∗∗ L i p i dp a n e l - % ( % ) ∗∗∗ U r i n a l y s i s - % ( % ) ∗∗∗ T r o p o n i n , q u a n t i t a t i v e - % ( % ) ∗ L i p o p r o t e i n , d i r ec t m e a s u r e m e n t - % ( % ) ∗∗∗ H e p a t i c f un c t i o np a n e l - % ( % ) ∗∗∗ C - r e a c t i v e p r o t e i n , h i g h s e n s i t i v i t y - % ( % ) ∗∗∗ M e a n ( s t a nd a r dd e v i a t i o n ) ; f r o m I C D - n i n - g r o up a nd (t o t a l ) p e r ce n t ag e s ; a M u l t i p l e p a i r w i s ec h i - s q u a r e d t e s t ; b M u l t i p l e p a i r w i s e t - t e s t ; ∗ p < . , ∗∗ p < . , ∗∗∗ p < . E C G = E l ec t r o c a r d i og r a m ; A L T = A l a n i n e a m i n o t r a n s f e r a s e t e s t Supp l e m e n t a r y T a b l e : M o s t f r e q u e n tt e r m s f o r t h e t h r ee s ub g r o up s i n t h e t y p e d i a b e t e ss ec o nd s p li t r e p li c a t i o n c o h o r t . arkinson’s disease (Split 2)Subgroup I Subgroup II (N=1 , , , a a Age .

39 (12 . ∗ b .

65 (15 . ∗ b ICD-9 Anxiety state ( ) - 24% (65%) ∗∗ Other malaise and fatigue ( ) - 26% (53%) ∗∗ Constipation ( ) - 23% (57%) ∗ Chest pain ( ) - 22% (68%) ∗∗∗

Essential tremor ( ) - 22% (79%) ∗∗ Coronary atherosclerosis ( ) - 21% (79%) ∗∗∗

Abnormality of gait ( ) - 15% (48%) ∗∗∗

Atrial ﬁbrillation ( ) - 17% (85%) ∗∗∗

Depressive disorder ( ) - 14% (53%) ∗∗∗

Pleural eﬀusion ( ) - 17% (95%) ∗∗∗

Medication Carbidopa/Levodopa combination - 49% (68%) ∗∗∗

Carbidopa - 47% (43%) ∗ Amantadine - 17% (74%) ∗∗∗

Levodopa - 46% (42%) ∗ Pramipexole - 15% (75%) ∗∗∗

Acetylsalicylic acid - 26% (72%) ∗∗∗

Rasagiline - 14% (78%) ∗∗∗

Heparin - 23% (94%) ∗∗∗

Selegiline - 12% (78%) ∗∗∗

Metoprolol - 18% (81%) ∗∗∗

Lab test Glucose - 9% (15%) ∗∗∗

Erythrocytes - 77% (85%) ∗∗∗

Leukocytes - 9% (15%) ∗∗∗

Mean corpuscolar hemoglobin - 75% (86%) ∗∗∗

Creatinine - 9% (15%) ∗∗∗

Glucose - 75% (85%) ∗∗∗

Erythrocytes - 9% (15%) ∗∗∗

Width - 75% (86%) ∗∗∗

Urea nitrogen - 8% (15%) ∗∗∗

Leukocytes - 75% (85%) ∗∗∗

CPT-4 Unlisted psychiatric service or procedure - 29% (70%) ∗∗∗

Urea nitrogen - 60% (85%) ∗∗∗

Surgery - 17% (48%) ∗∗∗

ECG; interpretation, report - 59% (82%) ∗∗∗

MRI (brain, brain stem) - 16% (58%) Urinalysis - 42% (87%) ∗∗∗

CT head/brain - 5% (21%) ∗∗∗

Radiologic examination, chest - 38% (86%) ∗∗∗

Implanted neurostimulator - 4% (68%) Troponin, quantitative - 30% (85%) ∗∗∗ Mean (standard deviation); from ICD-9 on in-group and (total) percentages; a Multiple pairwise chi-squared test; b Multiple pairwise t-test; ∗ p < . , ∗ ∗ p < . , ∗ ∗ ∗ p < . Supplementary Table 11: Most frequent terms for the two subgroups in the Parkinson’s disease second splitreplication cohort. 13 l z h e i m e r ’ s d i s e a s e ( Sp li t ) Sub g r o up I Sub g r o up II Sub g r o up III ( N = , )( N = , )( N = ) F e m a l e / M a l e , a a ∗∗∗ a A g e . ( . ) ∗∗ b . ( . ) ∗∗ b . ( . ) ∗∗ b I C D - C o n s t i p a t i o n ( ) - % ( % ) D e m e n t i a w /o b e h a v i o r a l d i s t u r b a n ce ( ) - % ( % ) ∗∗∗ R o u t i n e g y n ec o l og i c a l e x a m i n a t i o n ( V72.31 ) - % ( % ) ∗∗∗ A n x i e t y s t a t e ( ) - % ( % ) ∗∗∗ ( I v s II ) A l t e r e d m e n t a l s t a t u s ( ) - % ( % ) ∗∗∗ C o un s e li n g ( V65.40 ) - % ( % ) ∗∗∗ M e m o r y l o ss ( ) - % ( % ) ∗∗∗ P e r s i s t e n t m e n t a l d i s o r d e r s ( ) - % ( % ) ∗∗∗ O s t e o p o r o s i s ( ) - % ( % ) ∗∗∗ D e p r e ss i v e d i s o r d e r ( ) - % ( % ) ∗∗∗ C o n g e s t i v e h e a r t f a il u r e ( ) - % ( % ) ∗∗∗ F a m il y h i s t o r y o f o s t e o p o r o s i s ( V17.81 ) - % ( % ) ∗∗∗ I n s o m n i a ( ) - % ( % ) ∗∗∗ ( I v s III ) D e m e n t i a w i t hb e h a v i o r a l d i s t u r b a n ce ( ) - % ( % ) ∗∗∗ M a li g n a n t n e o p l a s m o f u t e r u s ( ) - % ( % ) ∗∗∗ M e d i c a t i o n E r go c a l c i f e r o l - % ( % ) ∗∗∗ A ce t y l s a li c y li c a c i d - % ( % ) ∗∗∗ E t h i n y l e s t r a d i o l - % ( % ) ∗∗∗ D o n e p ez il - % ( % ) ∗ D o n e p ez il - % ( % ) ∗ I r o n - % ( % ) ∗ M e m a n t i n e - % ( % ) ∗∗∗ L e v o ﬂ o x a c i n - % ( % ) ∗∗∗ G a r d a s il - % ( % ) ∗∗∗ V i t a m i n B - - % ( % ) ∗∗∗ M e t o p r o l o l - % ( % ) ∗∗∗ N o r e t h i s t e r o n e - % ( % ) ∗∗∗ D o c u s a t e s o d i u m - % ( % ) ∗∗∗ ( I v s III ) H a l o p e r i d o l - % ( % ) ∗∗∗ N o r g e s t i m a t e - % ( % ) ∗ L a b t e s t G l u c o s e - % ( % ) ∗∗∗ E r y t h r o c y t e s - % ( % ) ∗∗∗ C h l a m y d i a/ G o n o rr h o e a e a m p li ﬁ e d D NA - % ( % ) ∗∗∗ E r y t h r o c y t e s - % ( % ) ∗∗∗ H e m a t o c r i t - % ( % ) ∗∗∗ H I V - % ( % ) W i d t h - % ( % ) ∗∗∗ M e a np l a t e l e t v o l u m e - % ( % ) ∗∗∗ S y ph ili s ( r a p i dp l a s m a r e ag i n ) - % ( % ) ∗∗∗ ( III v s II ) P l a t e l e t s - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ H e p a t i t i s B s u r f a ce a n t i g e n - % ( % ) H e m og l o b i n - % ( % ) ∗∗∗ M e a n c o r pu s c u l a r h e m og l o b i n c o n ce n t r a t i o n - % ( % ) ∗∗∗ H e p a t i t i s C v i r u s a b - % ( % ) C P T - T S H - % ( % ) ∗∗ P a r t i a l T h r o m b o p l a s t i n T i m e T e s t - % ( % ) ∗∗∗ P s y c h i a t r i c s e r v i ce / p r o ce du r e - % ( % ) ∗∗∗ E C G ;i n t e r p r e t a t i o n , r e p o r t - % ( % ) ∗∗∗ P r o t h r o m b i n t i m e - % ( % ) ∗∗∗ C a l c i u m ,i o n i ze d - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ X - r a y c h e s t - % ( % ) ∗∗∗ C y t o p a t h o l og y , s li d e s , ce r v i c a l / v ag i n a l - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ H e a d / b r a i n CT - % ( % ) ∗∗∗ T S H - % ( % ) ∗∗∗ P s y c h i a t r i c s e r v i ce / p r o ce du r e - % ( % ) ∗∗∗ T r o p o n i n , q u a n t i t a t i v e - % ( % ) E s t r a d i o l - % ( % ) ∗∗∗ M e a n ( s t a nd a r dd e v i a t i o n ) ; f r o m I C D - n i n - g r o up a nd (t o t a l ) p e r ce n t ag e s ; a M u l t i p l e p a i r w i s ec h i - s q u a r e d t e s t ; b M u l t i p l e p a i r w i s e t - t e s t ; ∗ p < . , ∗∗ p < . , ∗∗∗ p < . ; E C G = E l ec t r o c a r d i og r a m ; a b = a n t i b o d i e s ; T S H = T h y r o i d - s t i m u l a t i n g h o r m o n e ; CT = C o m pu t e d t o m og r a ph y Supp l e m e n t a r y T a b l e : M o s t f r e q u e n tt e r m s f o r t h e t h r ee s ub g r o up s i n t h e A l z h e i m e r ’ s d i s e a s e s ec o nd s p li t r e p li c a t i o n c o h o r t . u l t i p l e m y e l o m a ( Sp li t ) Sub g r o up I Sub g r o up II Sub g r o up III

Sub g r o up I V ( N = )( N = )( N = )( N = ) F e m a l e / M a l e a a ∗∗ a ∗∗ a A g e . ( . ) ∗∗ b ( I v s III / I V ) . ( . ) ∗∗ b ( II v s III / I V ) . ( . ) b . ( . ) b I C D - O t h e r m a l a i s e a nd f a t i g u e ( ) - % ( % ) ∗∗∗ O t h i nﬂ a mm a t o r y / t o x i c n e u r o p a t h y ( ) - % ( % ) ∗∗ P l e u r a e ﬀ u s i o n ( ) - % ( % ) ∗∗∗ H y p e r li p i d e m i a ( ) - % ( % ) ∗∗ D i s e a s e o f s a li v a r y g l a nd s ( ) - % ( % ) ∗∗∗ U n s p i nﬂ a mm a t o r y / t o x i c n e u r o p a t h y ( ) - % ( % ) ∗∗∗ A c u t e k i dn e y f a il u r e ( ) - % ( % ) ∗∗∗ N e ph r i t i s a ndn e ph r o p a t h y ( ) - % ( % ) ∗∗∗ ( I V v s I / II ) C o n s t i p a t i o n ( ) - % ( % ) ∗∗ D i s o r d e r s o f b o n e a nd c a r t il ag e ( ) - % ( % ) ∗∗∗ ( II v s I / I V ) O r ga n / t i ss u e t r a n s p l a n t( V42.9 ) - % ( % ) ∗∗∗ ( III v s I / I V ) D y s u r i a ( ) - % ( % ) F e v e r ( ) - % ( % ) ∗∗ D i s e a s e o f s a li v a r y g l a nd s ( ) - % ( % ) ∗∗∗ R e n a l f a il u r e ( ) - % ( % ) ∗∗∗ M o n o c l o n a l p a r a p r o t e i n e m i a ( ) - % ( % ) ∗∗ ( I V v s I / III ) C o un s e li n g ( V65.40 ) - % ( % ) ∗∗ O r ga n / t i ss u e t r a n s p l a n t( V42.9 ) - % ( % ) ∗∗∗ ( II v s I / I V ) A n t i n e o p l a s t i cc h e m o t h e r a p y ( V58.11 ) - % ( % ) ∗∗∗ ( III v s I / I V ) C h r o n i c k i dn e y d i s e a s e ( ) - % ( % ) ∗ M e d i c a t i o n O xy c o d o n e - % ( % ) ∗∗ C a l c i u m - % ( % ) ∗∗∗ O xy c o d o n e - % ( % ) ∗∗ E r go c a l c i f e r o l - % ( % ) ∗ L i d o c a i n e - % ( % ) ∗∗∗ D e x a m e t h a s o n e - % ( % ) ∗∗∗ O nd a n s e t r o n - % ( % ) ∗∗ C h o l ec a l c i f e r o l - % ( % ) ∗∗∗ A ce t y l s a li c y li c a c i d m g - % ( % ) ∗∗ B o r t ez o m i b - % ( % ) ∗∗∗ D i ph e nh y d r a m i n e - % ( % ) ∗ A t o r v a s t a t i n - % ( % ) ∗∗∗ D e x a m e t h a s o n e - % ( % ) ∗∗∗ A ce t y l s a li c y li c a c i d m g - % ( % ) ∗∗ D e x a m e t h a s o n e - % ( % ) ∗∗∗ ( III v s I / I V ) F u r o s e m i d e - % ( % ) ∗∗∗ ( I V v s I / III ) P a r a ce t a m o l - % ( % ) ∗∗∗ L e n a li d o m i d e - % ( % ) ∗∗∗ L o r a ze p a m - % ( % ) ∗∗∗ L o s a r t a n - % ( % ) ∗∗∗ L a b t e s t L e u k o c y t e s - % ( % ) ∗∗∗ W i d t h - % ( % ) ∗∗∗ M e a n c o r pu s c u l a r v o l u m e - % ( % ) ∗∗∗ G l u c o s e - % ( % ) ∗∗∗ E r y t h r o c y t e s - % ( % ) ∗∗∗ M e a np l a t e l e t v o l u m e - % ( % ) ∗∗∗ C h l o r i d e - % ( % ) ∗∗∗ L e u k o c y t e s - % ( % ) ∗∗∗ H e m a t o c r i t - % ( % ) ∗ M e a n c o r pu s c u l a r h e m og l o b i n - % ( % ) ∗∗∗ U r e a n i t r og e n - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗ M e a n c o r pu s c u l a r v o l u m e - % ( % ) ∗∗∗ H e m og l o b i n - % ( % ) ∗∗∗ L e u k o c y t e s - % ( % ) ∗∗∗ P r o t e i n - % ( % ) ∗∗∗ P l a t e l e t s - % ( % ) ∗∗∗ P r o t e i n - % ( % ) ∗∗∗ C r e a t i n i n e - % ( % ) ∗∗∗ ( III v s I / I V ) U r e a n i t r og e n - % ( % ) ∗∗∗ C P T - D i ag n o s t i c / i n t e r v e n t i o n a l CT - % ( % ) ∗∗ B e t a - m i c r og l o bu li n - % ( % ) ∗∗∗ E C G ;i n t e r p r e t a t i o n , r e p o r t - % ( % ) ∗∗ E C G ;i n t e r p r e t a t i o n , r e p o r t - % ( % ) ∗∗ PE T li m i t e d a r e a ( H e a d / n ec k ) - % ( % ) ∗∗∗ B o n e m a rr o w , b i o p s y - % ( % ) ∗∗∗ P TT - % ( % ) ∗∗∗ V i t a m i n D - % ( % ) ∗∗∗ Su r g e r y - % ( % ) ∗∗∗ N e ph e l o m e t r y - % ( % ) ∗∗∗ X - r a y , c h e s t - % ( % ) ∗∗∗ T r i g l y ce r i d e s - % ( % ) ∗∗∗ P s y c h i a t r i c s e r v i ce / p r o ce du r e - % ( % ) ∗ I mm un o ﬁ x a t i o n - % ( % ) ∗∗∗ U r i n a l y s i s - % ( % ) ∗∗∗ L i p i dp a n e l - % ( % ) ∗ PE T - CT ( s k u ll b a s e t o m i d - t h i g h ) - % ( % ) ∗∗ C h e m o t h e r a p y p r o ce du r e - % ( % ) ∗∗∗ ( II v s I / I V ) P h o s ph o r u s - % ( % ) ∗∗∗ C h o l e s t e r o l - % ( % ) ∗∗∗ M e a n ( s t a nd a r dd e v i a t i o n ) ; f r o m I C D - n i n - g r o up a nd (t o t a l ) p e r ce n t ag e s ; a M u l t i p l e p a i r w i s ec h i - s q u a r e d t e s t ; b M u l t i p l e p a i r w i s e t - t e s t ; ∗ p < . ; ∗∗ p < . ; ∗∗∗ p < . ; E C G = E l ec t r o c a r d i og r a m ; CT = C o m pu t e d t o m og r a ph y ; PE T = P o s i t r o n e m i ss i o n t o m og r a ph y ; P TT = P a r t i a l t h r o m b o p l a s t i n t i m e Supp l e m e n t a r y T a b l e : M o s t f r e q u e n tt e r m s f o r t h e f o u r s ub g r o up s i n t h e M u l t i p l e M y e l o m a s ec o nd s p li t r e p li c a t i o n c o h o r t . alignant neoplasm of prostate (Split 2)Subgroup I Subgroup II Subgroup III (N=2 , , , .

71 (12 . a .

92 (11 . ∗∗ a .

83 (14 . a ICD-9 Nocturia ( ) - 28% (50%) ∗∗∗

Personal history of PC (

V10.46 ) - 28% (77%) ∗∗∗

Palpitations ( ) - 21% (51%) ∗∗∗

Elevated PSA ( ) - 20% (49%) ∗∗∗

Hyperlipidemia ( ) - 25% (47%) ∗∗∗

Asthma ( ) - 18% (51%) ∗∗∗

Urinary frequency ( ) - 17% (45%) ∗∗∗ (I vs II)

Edema ( ) - 23% (47%) ∗∗∗

Vitamin D deﬁciency ( ) - 15% (72%) ∗∗∗

Impotence of organic origin ( ) - 16% (52%) ∗∗∗

Cardiac dysrhythmias ( ) - 15% (69%) ∗∗∗

Cyanosis ( ) - 14% (54%) ∗∗∗

Urge incontinence ( ) - 5% (52%) ∗∗∗ (I vs II)

Pleural eﬀusion ( ) - 13% (87%) ∗∗∗

Neoplasm of colon ( ) - 11% (52%) ∗∗∗

Medication Midazolam - 15% (18%) ∗∗∗

Paracetamol - 68% (81%) ∗∗∗

Vitamin D3 - 17% (49%) ∗∗∗

Tadalaﬁl - 12% (47%) ∗∗∗

Oxycodone - 61% (82%) ∗∗∗

Fluticasone - 17% (61%) ∗∗∗

Tamsulosin - 11% (23%) ∗∗ Ondansetron - 50% (82%) ∗∗∗

Atorvastatin - 17% (43%) ∗∗∗

Testosterone - 8% (45%) ∗∗∗ (I vs II)

Morphine - 50% (92%) ∗∗∗

Aerosol - 15% (53%) ∗∗∗

Sildenaﬁl - 10% (44%) ∗∗∗ (I vs II)

Lidocaine - 47% (77%) ∗∗∗

Omeprazole - 10% (51%) ∗∗∗

Lab test PSA total - 20% (33%) ∗∗∗

Glucose - 84% (68%) ∗∗∗

Glucose - 47% (21%) ∗∗∗

PSA post-prostatectomy - 15% (37%) ∗∗∗ (I vs III)

Leukocytes - 84% (72%) ∗∗∗

Cholesterol - 35% (49%) ∗∗∗

Nitrite - 15% (18%) ∗∗∗

Urea nitrogen - 84% (72%) ∗ Hemoglobin A1C - 17% (52%) ∗∗∗

PSA free - 11% (47%) ∗∗∗

Potassium - 84% (73%) ∗∗∗ (I vs III)

Hepatitis C virus ab - 11% (53%) ∗∗∗

Testosterone free - 6% (46%) ∗∗∗ (I vs II)

Creatinine - 83% (72%) ∗∗∗

HIV 1 - 8% (55%) ∗∗∗

CPT-4 Surgery - 25% (25%) ∗∗∗

Calcium - 71% (72%) ∗∗∗

PSA total - 51% (23%) ∗∗∗

Ultrasound post-voiding residual urine/bladder capacity - 28% (48%) ∗∗∗

ECG; interpretation, report - 43% (61%) ∗∗∗ (II vs I)

PSA free - 52% (44%) ∗∗∗

Ultrasound, transrectal - 16% (57%) ∗∗∗

Anastomosis - 33% (92%) ∗∗∗ (II vs I)

ECG; interpretation, report - 41% (32%) ∗∗∗ (III vs I)

Urinalysis - 11% (60%) ∗∗∗ (I vs II)

Urine culture, bacterial - 20% (69%) ∗∗∗

Surgery - 34% (26%) ∗∗∗ (III vs I)

MRI, pelvis - 9% (43%) ∗∗ Troponin, quantitative - 19% (90%) ∗∗∗

Spirometry - 14% (73%) ∗∗∗ Mean (standard deviation); from ICD-9 on in-group and (total) percentages; a Multiple pairwise t-test; ∗ p < . ∗∗ p < . ∗∗∗ p < . Supplementary Table 14: Most frequent terms for the three subgroups in the prostate cancer second splitreplication cohort. 16 alignant neoplasm of breast - female (Split 2)Subgroup I Subgroup II (N=5 , , .

98 (14 . ∗ a .

94 (13 . ∗ a ICD-9 Personal history of malignant neoplasm of breast (

V10.3 ) - 54% (79%) ∗∗∗

Lump or mass in breast ( ) - 26% (33%) ∗∗∗

Constipation ( ) - 24% (92%) ∗ Abnormal mammogram ( ) - 22% (43%) ∗∗∗

Secondary malignant neoplasm ( ) - 14% (91%) ∗∗∗

Other screening mammogram (

V76.12 ) - 19% (44%) ∗∗∗

Acquired absence of breast/nipple (

V45.71 ) - 12% (89%) ∗∗∗

Carcinoma in situ of breast ( ) - 15% (32%) ∗ Antineoplastic chemotherapy (

V58.11 ) - 7% (99%) ∗∗∗

Diﬀuse cystic mastopathy ( ) - 10% (38%) ∗∗∗

Medication Paracetamol - 50% (89%) ∗∗∗

Propofol - 28% (23%) ∗∗∗

Fentanyl - 45% (80%) ∗∗∗

Fentanyl - 28% (20%) ∗∗∗

Ondansetron 44% (83%) ∗∗∗

Midazolam - 24% (22%) ∗∗∗

Oxycodone - 42% (88%) ∗∗∗

Lidocaine - 23% (23%) ∗∗∗

Propofol - 38% (77%) ∗∗∗

Ondansetron - 23% (17%) ∗∗∗

Lab test Leukocytes - 69% (97%) ∗∗∗

Leukocytes - 6% (3%) ∗∗∗

Glucose - 69% (97%) ∗∗∗

Glucose - 6% (3%) ∗∗∗

Hematocrit - 67% (97%) ∗∗∗

Width - 5% (3%) ∗∗∗

Erythrocytes - 67% (97%) ∗∗∗

Mean corpuscular hemoglobin concentration - 5% (3%) ∗∗∗

Width - 66% (97%) ∗∗∗

Erythrocytes - 5% (3%) ∗∗∗

CPT-4 Surgery - 43% (79%) ∗∗∗

Mammography - 33% (36%) ∗∗∗

Mastectomy, partial - 34% (75%) ∗∗∗

Surgery - 30% (21%) ∗∗∗

Ultrasound - 27% (68%) ∗∗∗

Mastectomy, partial - 28% (25%) ∗∗∗

Unlisted chemotherapy - 24% (84%) ∗∗∗

Ultrasound, breast(s) - 24% (40%) ∗∗∗

Oncoprotein - 16% (81%) ∗∗∗

Mammography, bilateral - 23% (42%) ∗∗∗ Mean (standard deviation); from ICD-9 on in-group and (total) percentages; a Multiple pairwise t-test; ∗ p < . ∗∗ p < . ∗∗∗ p < . Supplementary Table 15: Most frequent terms for the two subgroups in the breast cancer second splitreplication cohort. 17

D = Alzheimer’s disease; ADHD = Attention deﬁcit hyperactivity disorder; BC = Breast cancer; CD = Crohn’s disease;MM = Multiple myeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes

Supplementary Figure 1: Second split Uniform Manifold Approximation and Projection (UMAP) encodingvisualization. ConvAE 1-layer CNN ( a ); SVD-RawCount ( b ); SVD-TFIDF ( c ); Deep Patient ( d ). AD =Alzheimer’s disease; ADHD = Attention deﬁcit hyperactivity disorder; BC = Breast cancer; CD = Crohn’sdisease; MM = Multiple myeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes.18 D = Alzheimer’s disease; ADHD = Attention deﬁcit hyperactivity disorder; BC = Breast cancer; CD = Crohn’s disease;MM = Multiple myeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes

Supplementary Figure 2: Second split Uniform Manifold Approximation and Projection (UMAP) clusteringvisualization. ConvAE 1-layer CNN ( a ); SVD-RawCount ( b ); SVD-TFIDF ( c ); Deep Patient ( d ). AD =Alzheimer’s disease; ADHD = Attention deﬁcit hyperactivity disorder; BC = Breast cancer; CD = Crohn’sdisease; MM = Multiple myeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes.19upplementary Figure 3: Complex disorder subgroups identiﬁed in the replication set. A subsample of 5 , aa