Temporal Phenotyping using Deep Predictive Clustering of Disease Progression
TTemporal Phenotyping using Deep Predictive Clustering of Disease Progression
Changhee Lee Mihaela van der Schaar
Abstract
Due to the wider availability of modern electronichealth records, patient care data is often beingstored in the form of time-series. Clustering suchtime-series data is crucial for patient phenotyping,anticipating patients’ prognoses by identifying“similar” patients, and designing treatment guide-lines that are tailored to homogeneous patient sub-groups. In this paper, we develop a deep learningapproach for clustering time-series data, whereeach cluster comprises patients who share similarfuture outcomes of interest (e.g., adverse events,the onset of comorbidities). To encourage eachcluster to have homogeneous future outcomes, theclustering is carried out by learning discrete rep-resentations that best describe the future outcomedistribution based on novel loss functions. Exper-iments on two real-world datasets show that ourmodel achieves superior clustering performanceover state-of-the-art benchmarks and identifiesmeaningful clusters that can be translated into ac-tionable information for clinical decision-making.
1. Introduction
Chronic diseases – such as cystic fibrosis and dementia –are heterogeneous in nature, with widely differing outcomeseven in narrow patient subgroups. Disease progression man-ifests through a broad spectrum of clinical factors, collectedas a sequence of measurements in electronic health records,which gives a rise to complex progression patterns amongpatients (Samal et al., 2011; Yoon et al., 2017). For example,cystic fibrosis evolves slowly, allowing for development ofcomorbidities and bacterial infections, and creating distinctresponses to therapeutic interventions, which in turn makesthe survival and quality of life substantially different (Ramoset al., 2017; Lee et al., 2019). Identifying patient subgroupswith similar progression patterns can be advantageous for University of California, Los Angeles, USA University ofCambridge, UK Alan Turing Institute, UK. Correspondence to:Changhee Lee < [email protected] > . Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 108, 2020. Copyright 2020 bythe author(s).
Figure 1.
A conceptual illustration of our (real-time) clusteringprocedure. Here, a new patient is assigned over time to one of thefour phenotypes based on the expected future event – either EventA or Event B – as new observations are collected. understanding such heterogeneous diseases. This allowsclinicians to anticipate patients’ prognoses by comparing to“similar” patients and to design treatment guidelines tailoredto homogeneous subgroups (Zhang et al., 2019).Temporal clustering has been recently used as a data-drivenframework to partition patients with time-series observa-tions into subgroups of patients. Recent research hastypically focused on either finding fixed-length and low-dimensional representations (Zhang et al., 2019; Rusanovet al., 2016) or on modifying the similarity measure (Gi-annoula et al., 2018; Luong & Chandola, 2017) both in anattempt to apply the existing clustering algorithms to time-series observations. However, clusters identified from theseapproaches are purely unsupervised – they do not accountfor patients’ observed outcomes (e.g., adverse events, theonset of comorbidities, etc.) – which leads to heterogeneousclusters if the clinical presentation of the disease differseven for patients with the same outcomes. Thus, a commonprognosis in each cluster remains unknown which can mys-tify the understanding of the underlying disease progression(Boudier et al., 2019; Wami et al., 2013). To overcome thislimitation, we focus on predictive clustering (Blockeel et al.,2017) to combine predictions on the future outcomes withclustering. More specifically, we aim at finding cluster as-signments and centroids by learning discrete representations a r X i v : . [ phy s i c s . m e d - ph ] J un emporal Phenotyping using Deep Predictive Clustering of Disease Progression of time-series that best describe the future outcome distribu-tion. By doing so, patients in the same cluster share similarfuture outcomes to provide a prognostic value. Figure 1illustrates a pictorial depiction of the clustering procedure.In this paper, we propose an actor-critic approach for tem-poral predictive clustering, which we call AC-TPC. Ourmodel consists of three networks – an encoder , a selector ,and a predictor – and a set of centroid candidates. The keyinsight, here, is that we model temporal predictive clusteringas learning discrete representations of the input time-seriesthat best describe the future outcome distribution. Morespecifically, the encoder maps an input time-series into acontinuous latent encoding; the selector assigns a cluster(i.e., maps to a discrete representation) to which the inputbelongs by taking the latent encoding as an input; and thepredictor estimates the future outcome distributions condi-tioned on either the encoding or the centroid of the selectedcluster (i.e., the selected discrete representation). The fol-lowing three contributions render our model to achieve ourgoal. First, to encourage homogeneous future outcomes ineach cluster, we define a clustering objective based on theKullback-Leibler (KL) divergence between the predictor’soutput given the time-series, and that given the assignedcentroids. Second, we transform solving a combinatorialproblem of identifying clusters into iteratively solving twosub-problems: optimization of the cluster assignments andoptimization of the centroids. Finally, we allow “back-propagation” through the sampling process of the selectorby adopting actor-critic training (Konda & Tsitsiklis, 2000).Throughout the experiments, we show significant perfor-mance improvements over the state-of-the-art clusteringmethods on two real-world medical datasets. To demon-strate the practical significance of our model, we consider amore realistic scenario where the future outcomes of inter-est are high-dimensional – that is, development of multiplecomorbidities in the next year – and interpreting all possiblecombinations is intractable. Our experiments show that ourmodel can identify meaningful clusters that can be translatedinto actionable information for clinical decision-making.
2. Problem Formulation
Let X ∈ X and Y ∈ Y be random variables for an input fea-ture and an output label (i.e., one or a combination of futureoutcome(s) of interest) with a joint distribution p XY (andmarginal distributions are p X and p Y ) where X is the featurespace and Y is the label space. Here, we focus our descrip-tion on C -class classification tasks, i.e., Y = { , · · · , C } . Source code available at https://github.com/chl8856/AC_TPC . In the Supplementary Material, we discuss simple modifica-tions for regression Y = R and M -dimensional binary classifica-tion tasks Y = { , } M . We are given a time-series dataset D = { ( x nt , y nt ) T n t =1 } Nn =1 comprising sequences of realizations (i.e., observations) ofthe pair ( X , Y ) for N patients. Here, ( x nt , y nt ) T n t =1 is a se-quence of T n observation pairs that correspond to patient n and t ∈ T n (cid:44) { , · · · , T n } denotes the time stamp atwhich the observations are made. From this point forward,we omit the dependency on n when it is clear in the contextand denote x t = ( x , · · · , x t ) .Our aim is to identify a set of K predictive clusters , C = {C (1) , · · · , C ( K ) } , for time-series data. Each clusterconsists of homogeneous data samples, that can be repre-sented by its centroid, based on a certain similarity mea-sure. There are two main distinctions from the conven-tional notion of clustering. First, we treat subsequences ofeach times-series as data samples and focus on partition-ing {{ x n t } T n t =1 } Nn =1 into C . Hence, we define a cluster as C ( k ) = { x n t | t ∈ T n , s nt = k } for k ∈ K (cid:44) { , · · · , K } where s nt ∈ K is the cluster assignment for a given x n t .This is to flexibly update the cluster assignment (in real-time) to which a patient belongs as new observations arebeing accrued over time. Second, we define the similaritymeasure with respect to the label distribution and associateit with clusters to provide a prognostic value. More specif-ically, we want the distribution of output label for subse-quences in each cluster to be homogeneous and, thus, canbe well-represented by the centroid of that cluster.Let S be a random variable for the cluster assignment – thatdepends on a given subsequence x t – and Y | S = k be arandom variable for the output given cluster k . Then, suchproperty of predictive clustering can be achieved by min-imizing the following Kullback-Leibler (KL) divergence: KL ( Y t | X t = x t (cid:107) Y t | S t = k ) for x t ∈ C ( k ) whichis defined as (cid:82) y p ( y | x t ) (cid:0) log p ( y | x t ) − log p ( y | s t ) (cid:1) dy where p ( y | x t ) and p ( y | s t ) are the label distributions con-ditioned on a subsequence x t and a cluster assignment s t , respectively. Note that the KL divergence achieves itsminimum when the two distributions are equivalent.Finally, we establish our goal as identifying a set of predic-tive clusters C that optimizes the following objective:minimize C (cid:88) k ∈K (cid:88) x t ∈C ( k ) KL (cid:0) Y t | X t = x t (cid:13)(cid:13) Y t | S t = k (cid:1) . (1)Unfortunately, the optimization problem in (1) is highlynon-trivial. We need to estimate the objective function in (1)while solving a non-convex combinatorial problem of find-ing the optimal cluster assignments and cluster centroids.
3. Method: AC-TPC
To effectively estimate the objective function in (1), weintroduce three networks – an encoder , a selector , and a predictor – and an embedding dictionary as illustrated in emporal Phenotyping using Deep Predictive Clustering of Disease Progression
Figure 2.
The block diagram of AC-TPC. The red line implies theprocedure of estimating p ( y | S t = s t ) via a sampling process andthe blue line implies that of estimating p ( y | X t = x t ) . Figure 2. These components together provide the clusterassignment and the corresponding centroid based on a givensequence of observations and enable us to estimate the prob-ability density p ( y | s t ) . More specifically, we define eachcomponent as follows: • The encoder , f θ : (cid:81) ti =1 X → Z , is a RNN (parameter-ized by θ ) that maps a (sub)sequence of a time-series x t to a latent representation (i.e., encoding) z t ∈ Z where Z is the latent space. • The selector , h ψ : Z → ∆ K − , is a fully-connectednetwork (parameterized by ψ ) that provides a probabilis-tic mapping to a categorical distribution from which thecluster assignment s t ∈ K is being sampled. • The predictor , g φ : Z → ∆ C − , is a fully-connectednetwork (parameterized by φ ) that estimates the labeldistribution given the encoding of a time-series or thecentroid of a cluster. • The embedding dictionary , E = { e (1) , · · · , e ( K ) } where e ( k ) ∈ Z for k ∈ K , is a set of cluster centroidslying in the latent space which represents the correspond-ing cluster.Here, ∆ D − = { q ∈ [0 , D : q + · · · + q D = 1 } is a ( D − -simplex that denotes the probability distribution fora D -dimensional categorical (class) variable.At each time stamp t , the encoder maps a input(sub)sequence x t into a latent encoding z t (cid:44) f θ ( x t ) .Then, based on the encoding z t , the cluster assignment s t is drawn from a categorical distribution that is definedby the selector output, i.e., s t ∼ Cat ( π t ) where π t =[ π t (1) , · · · , π t ( K )] (cid:44) h ψ ( z t ) . Once the assignment s t ischosen, we allocate the latent encoding z t to an embedding e ( s t ) in the embedding dictionary E . Since the allocatedembedding e ( s t ) corresponds to the centroid of the clusterto which x t belongs, we can, finally, estimate the den-sity p ( y | s t ) in (1) as the output of the predictor given theembedding e ( s t ) , i.e., ¯ y t (cid:44) g φ ( e ( s t )) . In this subsection, we define loss functions to achieve ourobjective in (1); the details of how we train our model willbe discussed in the following subsection.
Predictive Clustering Loss:
Since finding the cluster as-signment of a given sequence is a probabilistic problem dueto the sampling process, the objective function in (1) mustbe defined as an expectation over the cluster assignment.Thus, we can estimate solving the objective problem in (1)as minimizing the following loss function: L ( θ, ψ, φ, E ) = E x ,y ∼ p XY (cid:104) (cid:88) t ∈T E s t ∼ Cat ( π t ) (cid:2) (cid:96) ( y t , ¯ y t ) (cid:3)(cid:105) (2)where (cid:96) ( y t , ¯ y t ) = − (cid:80) Cc =1 y ct log ¯ y ct . Here, we slightlyabuse the notation and denote y = [ y · · · y C ] as the one-hotencoding of y , and y c and ¯ y c indicates the c -th component of y and ¯ y , respectively. It is worth to highlight that minimizing (cid:96) is equivalent to minimizing the KL divergence in (1) sincethe former term of the KL divergence is independent of ouroptimization procedure.One critical question that may arise is how to avoid trivialsolutions in this unsupervised setting of identifying the clus-ter assignments and the centroids (Yang et al., 2017). Forexample, all the embeddings in E may collapse into a singlepoint or the selector simply assigns equal probability to allthe clusters regardless of the input sequence. In both cases,our model will fail to correctly estimate p ( y | s t ) and, thus,end up finding a trivial solution. To address this issue, weintroduce two auxiliary loss functions that are tailored toaddress this concern. It is worth to highlight that these lossfunctions are not subject to the sampling process and theirgradients can be simply back-propagated. Sample-Wise Entropy of Cluster Assignment:
To mo-tivate sparse cluster assignment such that the selector ulti-mately selects one dominant cluster for each sequence, weintroduce sample-wise entropy of cluster assignment whichis given as L ( θ, ψ ) = E x ∼ p X (cid:104) − (cid:88) t ∈T (cid:88) k ∈K π t ( k ) log π t ( k ) (cid:105) (3)where π t = [ π t (1) · · · π t ( K )] = h ψ ( f θ ( x t )) . The sample-wise entropy achieves its minimum when π t becomes anone-hot vector. Embedding Separation Loss:
To prevent the embeddingsin E from collapsing into a single point, we define a lossfunction that encourages the embeddings to represent differ-ent label distributions, i.e., g φ ( e ( k )) for k ∈ K , from eachother: L ( E ) = − (cid:88) k (cid:54) = k (cid:48) (cid:96) ( g φ ( e ( k )) , g φ ( e ( k (cid:48) ))) (4) emporal Phenotyping using Deep Predictive Clustering of Disease Progression where (cid:96) is reused to quantify the distance between labeldistributions conditioned on each cluster. We minimize (4)when updating the embedding vectors e (1) , · · · , e ( K ) . The optimization problem in (1) is a non-convex combinato-rial problem because it comprises not only minimizing theKL divergence but also finding the optimal cluster assign-ments and centroids. Hence, we propose an optimizationprocedure that iteratively solves two subproblems: i) op-timizing the three networks – the encoder, selector, andpredictor – while fixing the embedding dictionary and ii)optimizing the embedding dictionary while fixing the threenetworks. Pseudo-code of AC-TPC can be found in theSupplementary Material.3.2.1. O
PTIMIZING THE T HREE N ETWORK
Finding predictive clusters incorporates the sampling pro-cess which is non-differentiable. Thus, to render “back-propagation”, we utilize the training of actor-critic models(Konda & Tsitsiklis, 2000). More specifically, we view thecombination of the encoder ( f θ ) and the selector ( h ψ ) as the“actor” parameterized by ω A = [ θ, ψ ] , and the predictor ( g φ )as the “critic”. The critic takes as input the the output of theactor (i.e., the cluster assignment) and estimates its valuebased on the sample-wise predictive clustering loss (i.e., (cid:96) ( y t , ¯ y t ) ) given the chosen cluster. This, in turn, rendersthe actor to change the distribution of selecting a cluster tominimize such loss. Thus, it is important for the critic toperform well on the updated output of the actor while it isimportant for the actor to perform well on the updated lossestimation. As such, the parameters for the actor and thecritic need to be updated iteratively.Given the embedding dictionary E fixed (thus, we willomit the dependency on E ), we train the actor, i.e., theencoder and the selector, by minimizing a combinationof the predictive clustering loss L and the entropy ofcluster assignments L , which is given by L A ( θ, ψ, φ ) = L ( θ, ψ, φ ) + α L ( θ, ψ ) where α ≥ is a coefficient cho-sen to balance between the two losses. To derive the gradientof this loss with respect ω A = [ θ, ψ ] , we utilize the ideasfrom actor-critic models (Konda & Tsitsiklis, 2000) in (S.1)which is displayed at the top; please refer to the Supplemen-tary Material for the detailed derivation. Note that since nosampling process is considered in L ( θ, ψ ) , we can simplyderive ∇ ω A L ( θ, ψ ) .Iteratively with training the actor, we train the critic, i.e., thepredictor, by minimizing the predictive clustering loss L asthe following: L C ( φ ) = L ( θ, ψ, φ ) whose gradient withrespect to φ can be givens as ∇ φ L C ( φ ) = ∇ φ L ( θ, ψ, φ ) .Note that since the critic is independent of the samplingprocess, the gradient can be simply back-propagated. 3.2.2. O PTIMIZING THE C LUSTER C ENTROIDS
Now, once the parameters for the three networks ( θ, ψ, φ ) are fixed (thus, we omit the dependency on θ , ψ , and φ ), weupdated the embeddings in E by minimizing a combinationof the predictive clustering loss L and the embedding sepa-ration loss L , which is given by L E ( E ) = L ( E )+ β L ( E ) where β ≥ is a coefficient chosen to balance between thetwo losses.3.2.3. I NITIALIZING
AC-TPC
VIA P RE -T RAINING
Since we transform the combinatorial optimization problemin (1) into iteratively solving two sub-problems, initializa-tion is crucial to achieve better optimization as a similarconcern has been addressed in (Yang et al., 2017).Therefore, we initialize our model based on the followingprocedure. First, we pre-train the encoder and the predictorby minimizing the following loss function based on thepredicted label distribution given the latent encodings ofinput sequences, i.e., ˆ y t (cid:44) g φ ( z t ) = g φ ( f θ ( x t )) , as thefollowing: L I ( θ, φ ) = E x ,y ∼ p XY (cid:104) − (cid:88) t ∈T (cid:96) ( y t , ˆ y t ) (cid:105) . (6)Minimizing (6) encourages the latent encoding to be en-riched with information for accurately predicting the labeldistribution. Then, we perform K -means (other clusteringmethod can be also applied) based on the learned represen-tations to initialize the embeddings E and the cluster as-signments {{ s nt } T n t =1 } Nn =1 . Finally, we pre-train the selector h ψ by minimizing the cross entropy treating the initializedcluster assignments as the true clusters.
4. Related Work
Temporal clustering, also known as time-series clustering,is a process of unsupervised partitioning of the time-seriesdata into clusters in such a way that homogeneous time-series are grouped together based on a certain similaritymeasure. Temporal clustering is challenging because i) thedata is often high-dimensional it consists of sequences notonly with high-dimensional features but also with manytime points and ii) defining a proper similarity measurefor time-series is not straightforward since it is often highlysensitive to distortions (Ratanamahatana et al., 2005). Toaddress these challenges, there have been various attemptsto find a good representation with reduced dimensionalityor to define a proper similarity measure for times-series(Aghabozorgi et al., 2015).Recently, (Baytas et al., 2017) and (Madiraju et al., 2018)proposed temporal clustering methods that utilize low-dimensional representations learned by RNNs. These worksare motivated by the success of applying deep neural net- emporal Phenotyping using Deep Predictive Clustering of Disease Progression ∇ ω A L A ( θ, ψ, φ ) = E x ,y ∼ p XY (cid:104) (cid:88) t ∈T E s t ∼ Cat ( π t ) (cid:2) (cid:96) ( y t , ¯ y t ) ∇ ω A log π t ( s t ) (cid:3)(cid:105) + α ∇ ω A L ( θ, ψ ) . (5)works to find “clustering friendly” latent representations forclustering static data (Xie et al., 2017; Yang et al., 2017). Inparticular, Baytas et al. (2017) utilized a modified LSTMauto-encoder to find the latent representations that are ef-fective to summarize the input time-series and conducted K -means on top of the learned representations as an ad-hoc process. Similarly, (Madiraju et al., 2018) proposeda bidirectional-LSTM auto-encoder that jointly optimizesthe reconstruction loss for dimensionality reduction and theclustering objective. However, these methods do not asso-ciate a target property with clusters and, thus, provide littleprognostic value about the underlying disease progression.Our work is most closely related to SOM-VAE (Fortuinet al., 2019). This method jointly optimizes a static varia-tional auto-encoder (VAE), that finds latent representationsof input features, and a self-organizing map (SOM), thatallows to map the latent representations into a more in-terpretable discrete representations, i.e., the embeddings.However, there are three key differences between our workand SOM-VAE. First, SOM-VAE aims at minimizing thereconstruction loss that is specified as the mean squarederror between the original input and the reconstructed inputbased on the corresponding embedding. Thus, similar tothe aforementioned methods, SOM-VAE neither associatesfuture outcomes of interest with clusters. In contrast, we fo-cus on minimizing the KL divergence between the outcomedistribution given the original input sequence and that giventhe corresponding embedding to build association betweenfuture outcomes of interest and clusters. Second, to over-come non-differentiability caused by the sampling process(that is, mapping the latent representation to the embed-dings), (Fortuin et al., 2019) applies the gradient copyingtechnique proposed by (van den Oord et al., 2017), whilewe utilize the training of actor-critic model (Konda & Tsit-siklis, 2000). Finally, while we flexibly model time-seriesusing LSTM, SOM-VAE handles time-series by integratinga Markov model in the latent representations. This can bea strict assumption especially in clinical settings where apatient’s medical history is informative for predicting thefuture clinical outcomes (Ranganath et al., 2016).
5. Experiments
In this section, we provide a set of experiments using tworeal-world time-series datasets. We iteratively update thethree networks – the encoder, selector, and predictor – andthe embedding dictionary as described in Section 3.2. Forthe network architecture, we constructed the encoder utiliz-ing a single-layer LSTM (Hochreiter & Schmidhuber, 1997)with 50 nodes and constructed the selector and predictor uti- lizing two-layer fully-connected network with 50 nodes ineach layer, respectively. The parameters ( θ, ψ, φ ) are initial-ized by Xavier initialization (Glorot & Bengio, 2010) andoptimized via Adam optimizer (Kingma & Ba, 2014) withlearning rate of . and keep probability . . We chosethe balancing coefficients α, β ∈ { . , . , . , . } uti-lizing grid search that achieves the minimum validation lossin (2); the effect of different loss functions are further inves-tigated in the experiments. Here, all the results are reportedusing 5 random 64/16/20 train/validation/test splits. We conducted experiments to investigate the performanceof AC-TPC on two real-world medical datasets; detailedstatistics of each dataset can be found in the SupplementaryMaterial.
UK Cystic Fibrosis registry (UKCF) : This datasetrecords annual follow-ups for 5,171 adult patients (aged18 years or older) enrolled in the UK CF registry over theperiod from 2008 and 2015, with a total of 25,012 hospitalvisits. Each patient is associated with 89 variables (i.e., 11static and 78 time-varying features), including informationon demographics and genetic mutations, bacterial infections,lung function scores, therapeutic managements, and diagno-sis on comorbidities. We set the development of differentcomorbidities in the next year as the label of interest at eachtime stamp.
Alzheimer’s Disease Neuroimaging Initiative (ADNI) : This dataset consists of 1,346 patients in the Alzheimer’sdisease study with a total of 11,651 hospital visits, whichtracks the disease progression via follow-up observations at6 months interval. Each patient is associated with 21 vari-ables (i.e., 5 static and 16 time-varying features), includinginformation on demographics, biomarkers on brain func-tions, and cognitive test results. We set predictions on thethree diagnostic groups – normal brain functioning, mildcognitive impairment, and Alzheimer’s disease – as the labelof interest at each time stamp.
We compare AC-TPC with clustering methods ranging fromconventional approaches based on K -means to the state-of-the-art approaches based on deep neural networks. All thebenchmarks compared in the experiments are tailored toincorporate time-series data as described below: https://adni.loni.usc.edu emporal Phenotyping using Deep Predictive Clustering of Disease Progression Dynamic time warping followed by K -means : Dynamictime warping (DTW) is utilized to quantify pairwise dis-tance between two variable-length sequences and, then, K -means is applied ( KM-DTW ). K -means with deep neural networks : To handle variable-length time-series data, we utilize our encoder and predictorthat are trained based on (6) for fixed-length dimensionalityreduction. Then, we apply K -means on the latent encodings z ( KM-E2P ( Z ) ) and on the predicted label distributions ˆ y ( KM-E2P ( Y ) ), respectively. Extensions of DCN (Yang et al., 2017): Since the DCN isdesigned for static data, we replace their static auto-encoderwith a sequence-to-sequence network to incorporate time-series data (
DCN-S2S ). To associated with the label dis-tribution, we compare a DCN whose static auto-encoder isreplaced with our encoder and predictor (
DCN-E2P ) to fo-cus dimensionality reduction while preserving informationfor label prediction.
SOM-VAE (Fortuin et al., 2019): We compare with SOM-VAE – though, this method aims at visualizing input – sinceit naturally clusters time-series data (
SOM-VAE ). In addi-tion, we compare with a variation of SOM-VAE by replacingthe decoder with our predictor to find embeddings that cap-ture information for predicting the label (
SOM-VAE-P ).For both cases, we set the dimension of SOM to K .It is worth highlighting that the label information is providedfor training DCN-E2P, KM-E2P, and SOM-VAE-P whilethe label information is not provided for training KM-DTW,DCN-S2S, and SOM-VAE. Please refer to the Supplemen-tary Material for the summary of major components of thetested benchmarks and the implementation details. We applied the following threestandard metrics for evaluating clustering performanceswhen the ground-truth cluster label is available: purity score , normalized mutual information (NMI) (Vinh et al., 2010),and adjusted Rand index (ARI) (Hubert & Arabie, 1985).More specifically, the purity score assesses how homoge-neous each cluster is (ranges from 0 to 1 where 1 being acluster consists of a single class), the NMI is an informa-tion theoretic measure of how much information is sharedbetween the clusters and the labels that is adjusted for thenumber of clusters (ranges from 0 to 1 where 1 being a per-fect clustering), and ARI is a corrected-for-chance version This extension is a representative of recent deep learningapproaches for clustering of both static data (Xie et al., 2017; Yanget al., 2017) and time-series data (Baytas et al., 2017; Madirajuet al., 2018) since these methods are built upon the same concept– that is, applying deep networks for dimensionality reduction toconduct conventional clustering methods, e.g., K -means. of the Rand index which is a measure of the percentage ofcorrect cluster assignments (ranges from -1 to 1 where 1being a perfect clustering and 0 being a random clustering).When the ground-truth label is not available, we utilizethe average Silhouette index (SI) (Rousseeuw, 1987) whichmeasures how similar a member is to its own cluster (ho-mogeneity within a cluster) compared to other clusters (het-erogeneity across clusters). Formally, the SI for a sub-sequence x n t ∈ C k can be given as follows: SI ( n ) = b ( n ) − a ( n )max( a ( n ) ,b ( n )) where a ( n ) = |C k |− (cid:80) m (cid:54) = n (cid:107) y nt − y mt (cid:107) and b ( n ) = min k (cid:48) (cid:54) = k |C k (cid:48) | (cid:80) m ∈C k (cid:48) (cid:107) y nt − y mt (cid:107) . Here, weused the L1-distance between the ground-truth labels of thefuture outcomes of interest since our goal is to group inputsubsequences with similar future outcomes. Prediction Performance:
To assess the prediction perfor-mance of the identified predictive clusters, we utilized botha rea under receiver operator characteristic curve (AUROC)and area under precision-recall curve (AUPRC) based onthe label predictions of each cluster and the ground-truth bi-nary labels on the future outcomes of interest. Note that theprediction performance is available only for the benchmarksthat incorporate the label information during training.
We start with a simple scenario where the true class (i.e.,the ground-truth cluster label) is available and the numberof classes is tractable. In particular, we set C = 2 = 8 based on the binary labels for the development of threecommon comorbidities of cystic fibrosis – diabetes, ABPA,and intestinal obstruction – in the next year for the UKCFdataet and C = 3 based on the mutually exclusive threediagnostic groups for the ADNI dataset. We compare AC-TPC against the aforementioned benchmarks with respectto the clustering and prediction performance in Table 1.As shown in Table 1, AC-TPC achieved performance gainover all the tested benchmarks in terms of both clusteringand prediction performance – where most of the improve-ments were statistically significant with p -value < . or p -value < . – for both datasets. Importantly, cluster-ing methods – i.e., KM-DTW, DCN-S2S, and SOM-VAE– that do not associate with the future outcomes of interestidentified clusters that provide little prognostic value on thefuture outcomes (note that the true class is derived from thefuture outcome of interest). This is clearly shown by theARI value near 0 which indicates that the identified clustershave no difference with random assignments. Therefore,similar sequences with respect to the latent representationstailored for reconstruction or with respect to the shape-basedmeasurement using DTW can have very different outcomes.In Figure 3, we further investigate the purity score, NMI,and ARI by varying the number of clusters K from to emporal Phenotyping using Deep Predictive Clustering of Disease Progression Table 1.
Performance comparison on the UKCF and ADNI datasets.
Dataset Method Purity NMI ARI AUROC AUPRC
UKCF KM-DTW 0.573 ± ∗ ± ∗ ± ∗ N/A N/AKM-E2P ( Z ) 0.719 ± ∗ ± ∗ ± ∗ ± ∗ ± ∗ KM-E2P ( Y ) 0.751 ± ∗ ± ∗ ± ∗ ± ∗ ± ∗ DCN-S2S 0.607 ± ∗ ± ∗ ± ∗ N/A N/ADCN-E2P 0.751 ± ∗ ± ∗ ± ∗ ± ∗ ± ∗ SOM-VAE 0.573 ± ∗ ± ∗ ± ∗ N/A N/ASOM-VAE-P 0.638 ± ∗ ± ∗ ± † ± ∗ ± ∗ Proposed ± ± ± ± ± ADNI KM-DTW 0.566 ± ∗ ± ∗ ± ∗ N/A N/AKM-E2P ( Z ) 0.736 ± † ± ± † ± ∗ ± Y ) 0.776 ± ± ± ± ± ± ∗ ± ∗ ± ∗ N/A N/ADCN-E2P 0.749 ± ± ± † ± † ± ± ∗ ± ∗ ± ∗ N/A N/ASOM-VAE-P 0.586 ± ∗ ± ∗ ± ∗ ± † ± ∗ Proposed ± ± ± ± ± ∗ indicates p -value < . , † indicates p -value < . (a) The averaged purity score. (b) The averaged NMI. (c) The averaged ARI. Figure 3.
The purity score, NMI, and ARI (mean and 95% confidence interval) for the UKCF dataset ( C = 8 ) with various K . on the UKCF dataset in the same setting with that statedabove (i.e., C = 8 ). Here, the three methods – i.e., KM-DTW, DCN-S2S, and SOM-VAE – are excluded for bettervisualization. As we can see in Figure 3, our model rarelyincur performance loss in both NMI and ARI while thebenchmarks (except for SOM-VAE-P) showed significantdecrease in the performance as K increased (higher than C ). This is because the number of clusters identified by AC-TPC (i.e., the number of activated clusters where we definecluster k is activated if |C ( k ) | > ) was the same with C most of the times, while the DCN-based methods identifiedexactly K clusters (due to the K -means). Since the NMIand ARI are adjusted for the number of clusters, a smallernumber of identified clusters yields, if everything beingequal, a higher performance. In contrast, while our modelachieved the same purity score for K ≥ , the benchmarkshowed improved performance as K increased since thepurity score does not penalize having many clusters. Thisis an important property of AC-TPC that we do not needto know a priori what the number of cluster is which is acommon practical challenge of applying the conventionalclustering methods (e.g., K -means).The performance gain of our model over SOM-VAE-P (and,our analysis is the same for SOM-VAE) comes from twopossible sources: i) SOM-VAE-P mainly focuses on visual- izing the input with SOM which makes both the encoder andembeddings less flexible – this is why it performed betterwith higher K – and ii) the Markov property can be too strictfor time-series data especially in clinical settings where apatient’s medical history is informative for predicting thefuture clinical outcomes (Ranganath et al., 2016). In this experiment, we focus on a more practical scenariowhere the future outcome of interest is high-dimensionaland, thus, the number of classes based on all the possi-ble combinations of future outcomes becomes intractable.Suppose that we are interested in the development of M comorbidities in the next year whose possible combinationsgrow exponentially C = 2 M . Interpreting such a largenumber of patient subgroups will be a daunting task whichhinders the understanding of underlying disease progression.Since different comorbidities may share common drivingfactors (Ronan et al., 2017), we hope our model to identifymuch smaller underlying (latent) clusters that govern thedevelopment of comorbidities. Here, to incorporate with M comorbidities (i.e., M binary labels), we redefine the outputspace as Y = { , } M and modify the predictor and lossfunctions, accordingly. emporal Phenotyping using Deep Predictive Clustering of Disease Progression (a) AUROC (b) AUPRC (c) Average SI Figure 4.
AUROC, AUPRC, and average SI (mean and 95% confidence interval) and the number of activated clusters with various K . Figure 5.
Clusters with high-risk of developing diabetes.
We identified clusters of patients based on the next-yeardevelopment of different comorbidities in the UKCFdataset and reported clusters in Figure 5 – Cluster 0, 5, 7,8, and 10 – with the frequency of developing important co-morbidities in the next year. Here, we selected the clustersthat have the highest risk of developing diabetes in the nextyear, and the frequency is calculated in a cluster-specificfashion using the true label. A full list of clusters and co-morbidity frequencies can be found in the SupplementaryMaterial. Although all these clusters displayed high risk ofdiabetes, the frequency of other co-occurred comorbiditieswas significantly different across the clusters. In particu-lar, around 89% of the patients in Cluster 5 experiencedasthma in the next year while it was less than 3% of thepatients in the other cluster. Interestingly, “leukotriene” – amedicine commonly used to manage asthma – and “FEV %predicted” – a measure of lung function – were the two mostdifferent input features between patients in Cluster 5 andthose in the other clusters. We observed similar findingsin Cluster 7 with ABPA, Cluster 8 with liver disease, andCluster 10 with osteopenia. Therefore, by grouping patientswho are likely to develop a similar set of comorbidities,our method identified clusters that can be translated intoactionable information for clinical decision-making. In predictive clustering, the trade-off between the clusteringperformance (for better interpretability) – which quantifieshow the data samples are homogeneous within each clusterand heterogeneous across clusters with respect to the futureoutcomes of interest – and the prediction performance is acommon issue. The most important parameter that governsthis trade-off is the number of clusters. More specifically,increasing the number of clusters will make the predictiveclusters have higher diversity to represent the output distri- bution and, thus, will increase the prediction performancewhile decreasing the clustering performance. One extremeexample is that there are as many clusters as data sampleswhich will make the identified clusters fully individualized;as a consequence, each cluster will lose interpretability as itno longer groups similar data samples.To highlight this trade-off, we conduct experiments underthe same experimental setup with that of Section 5.5. Forthe performance measures, we utilized the AUROC andAUPRC to assess the prediction performance, and utilizedthe average SI to assess the clustering performance. Tocontrol the number of activated clusters, we set β = 0 and β = 1 (since the embedding separation loss in (4) controlsthe activation of clusters) and reported the performanceby increasing the number of possible clusters K , i.e., thedimension of the embedding dictionary.As can be seen in Figure 4, the prediction performance in-creased with a increasing number of identified clusters dueto the higher diversity to represent the label distributionwhile making the identified clusters less interpretable. Thatis, the cohesion and separation among clusters become am-biguous as shown in the low average SI. On the other hand,when we set β = 1 . (which is selected based on the valida-tion loss in 2), our method consistently identified a similarnumber of clusters for K > , i.e., 13.8 on average, in adata-driven fashion and provided slightly reduced predictionperformance with significantly better interpretability, i.e.,the average SI . on average.
6. Conclusion
In this paper, we introduced AC-TPC, a deep learning ap-proach for predictive clustering of time-series data. Wedefined novel loss functions to encourage each cluster tohave homogeneous future outcomes (e.g., adverse events,the onset of comorbidities, etc.) and designed optimizationprocedures to avoid trivial solutions in identifying cluster as-signments and the centroids. Throughout the experiments ontwo real-world datasets, we showed that our model achievessuperior clustering performance over state-of-the-art meth-ods and identifies meaningful clusters that can be translatedinto actionable information for clinical decision-making. emporal Phenotyping using Deep Predictive Clustering of Disease Progression
Acknowledgements
The authors would like to thank the reviewers for their help-ful comments. This work was supported by the NationalScience Foundation (NSF grants 1524417 and 1722516),the US Office of Naval Research (ONR), and the UK CysticFibrosis Trust. We thank Dr. Janet Allen (Director of Strate-gic Innovation, UK Cystic Fibrosis Trust) for the visionand encouragement. We thank Rebecca Cosgriff and ElaineGunn for the help with data access, extraction and analysis.We also thank Prof. Andres Floto and Dr. Tomas Daniels,our collaborators, for the very helpful clinical discussions.
References
Aghabozorgi, S., Shirkhorshidi, A. S., and Wah, T. Y. Time-series clustering a decade review.
Information Systems ,53:16–38, May 2015.Baytas, I. M., Xiao, C., Zhang, X., Wang, F., Jain, A. K., andZhou, J. Patient subtyping via time-aware lstm networks.
In Proceedings of the 23rd ACM SIGKDD Conferenceon Knowledge Discovery and Data Mining (KDD 2017) ,2017.Blockeel, H., Dzeroski, S., Struyf, J., and Zenko, B.
Predic-tive Clustering . Springer New York, 2017.Boudier, A., Chanoine, S., Accordini, S., Anto, J. M., na,X. B., Bousquet, J., Demoly, P., Garcia-Aymerich, J.,Gormand, F., Heinrich, J., Janson, C., K¨unzli, N., Ma-tran, R., Pison, C., Raherison, C., Sunyer, J., Varraso, R.,Jarvis, D., Leynaert, B., Pin, I., and Siroux, V. Datadrivenadult asthma phenotypes based on clinical characteristicsare associated with asthma outcomes twenty years later.
Allegy , 74(5):953–963, May 2019.Fortuin, V., H¨user, M., Locatello, F., Strathmann, H., andR¨atsch, G. SOM-VAE: Interpretable discrete represen-tation learning on time series.
In Proceedings of the 7thInternational Conference on Learning Representations(ICLR 2019) , 2019.Giannoula, A., Gutierrez-Sacrist´an, A., Bravo, A., Sanz, F.,and Furlong, L. I. Identifying temporal patterns in patientdisease trajectories using dynamic ping: A population-based study.
Scientific Reports , 8(4216):1–14, March2018.Glorot, X. and Bengio, Y. Understanding the difficulty oftraining deep feedforward neural networks.
In Proceed-ings of the 13th International Conference on ArtificialIntelligence and Statistics (AISTATS 2010) , 2010.Hochreiter, S. and Schmidhuber, J. Long short-term memory.
Neural Computation , 9(8):1735–1780, 1997. Hubert, L. and Arabie, P. Comparing partitions.
Journal ofClassification , 2(1):193–218, December 1985.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms.
In Proceedings of the 13th Conference on Neural Infor-mation Processing Systems (NIPS 2000) , 2000.Lee, C., Yoon, J., and van der Schaar, M. Dynamic-deephit:A deep learning approach for dynamic survival analysiswith competing risks based on longitudinal data.
IEEETransactions on Biomedical Engineering , April 2019.Luong, D. T. A. and Chandola, V. A k-means approachto clustering disease progressions.
In Proceedings ofthe 5th IEEE International Conference on HealthcareInformatics (ICHI) , 2017.Madiraju, N. S., Sadat, S. M., Fisher, D., and Karimabadi, H.Deep temporal clustering: Fully unsupervised learning oftime-domain features. arXiv preprint arXiv:1802.01059 ,2018.Ramos, K. J., Quon, B. S., Heltshe, S. L., Mayer-Hamblett,N., Lease, E. D., Aitken, M. L., Weiss, N. S., and Goss,C. H. Heterogeneity in survival in adult patients withcystic fibrosis with FEV < of predicted in theunited states. Chest , 151(6):1320–1328, June 2017.Ranganath, R., Perotte, A., Elhadad, N., and Blei, D.Deep survival analysis.
In Proceedings of the 1st Ma-chine Learning for Healthcare Conference (MLHC 2016) ,2016.Ratanamahatana, C. A., Keogh, E., Bagnall, A. J., andLonardi, S. A novel bit level time series representationwith implications for similarity search and clustering.
In Proceedings of the 9th Pacific-Asia Conference onKnowledge Discovery and Data Mining (PAKDD 2005) ,2005.Ronan, N. J., Elborn, J., and Plant, B. J. Current and emerg-ing comorbidities in cystic fibrosis.
Presse Med. , 46(6):125–138, June 2017.Rousseeuw, P. J. Silhouettes: a graphical aid to the interpre-tation and validation of cluster analysis.
Computationaland Applied Mathematics , 20:53–65, 1987.Rusanov, A., Prado, P. V., and Weng, C. Unsupervisedtime-series clustering over lab data for automatic iden-tification of uncontrolled diabetes.
In Proceedings ofthe 4th IEEE International Conference on HealthcareInformatics (ICHI) , 2016. emporal Phenotyping using Deep Predictive Clustering of Disease Progression
Samal, L., Wright, A., Wong, B., Linder, J., and Bates, D.Leveraging electronic health records to support chronicdisease management: the need for temporal data views.
Informatics in Primary Care , 19(2):65–74, 2011.van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neu-ral discrete representation learning.
In Proceedings ofthe 31st Conference on Neural Information ProcessingSystems (NIPS 2017) , 2017.Vinh, N. X., Epps, J., and Bailey, J. Information theoreticmeasures for clusterings comparison: Variants, proper-ties, normalization and correction for chance.
Journal ofMachine Learning Research , 11(1):2837–2854, October2010.Wami, W. M., Buntinx, F., Bartholomeeusen, S., Goderis,G., Mathieu, C., and Aerts, M. Influence of chroniccomorbidity and medication on the efficacy of treatmentin patients with diabetes in general practice.
The BritishJournal of General Practice , 63(609):267–273, March2013.Xie, J., Girshick, R., and Farhadi, A. Unsupervised deepembedding for clustering analysis.
In Proceedings ofthe 33rd International Conference on Machine Learning(ICML 2016) , 2017.Yang, B., Fu, X., Sidiropoulos, N. D., and Hong, M. To-wards k-means-friendly spaces: Simultaneous deep learn-ing and clustering.
In Proceedings of the 34th Interna-tional Conference on Machine Learning (ICML 2017) ,2017.Yoon, J., Davtyan, C., and van der Schaar, M. Discoveryand clinical decision support for personalized healthcare.
IEEE J Biomed Health Inform. , 21(4):1133–1145, 2017.Zhang, X., Chou, J., Liang, J., Xiao, C., Zhao, Y., Sarva,H., Henchcliffe, C., and Wang, F. Data-driven subtypingof parkinsons disease using longitudinal clinical records:A cohort study.
Scientific Reports , 9(797):1–12, January2019. upplementary MaterialTemporal Phenotyping using Deep Predictive Clustering of Disease Progression
Changhee Lee Mihaela van der Schaar
A. AC-TPC for Regression and Binary Classification Tasks
As the task changes, estimating the label distribution and calculating the KL divergence in (1) of the manuscript mustbe redefined accordingly: For regression task, i.e., Y = R , we modify the predictor as g φ : Z → R and replace (cid:96) by (cid:96) ( y t , ¯ y t ) = (cid:107) y t − ¯ y t (cid:107) . Minimizing (cid:96) ( y t , ¯ y t ) is equivalent to minimizing the KL divergence between p ( y t | x t ) and p ( y t | s t ) when we assume these probability densities follow Gaussian distribution with the same variance. For the M -dimensional binary classification task, i.e., Y = { , } M , we modify the predictor as g φ : Z → [0 , M and replace (cid:96) by (cid:96) ( y t , ¯ y t ) = − (cid:80) Mm =1 y mt log ¯ y mt + (1 − y mt ) log(1 − ¯ y mt ) which is required to minimize the KL divergence. Here, y mt and ¯ y mt indicate the m -th element of y t and ¯ y t , respectively. The basic assumption here is that the distribution of each binarylabel is independent given the input sequence. B. Detailed Derivation of (5)
To derive the gradient of the predictive clustering loss in (5) of the manuscript with respect ω A = [ θ, ψ ] , we utilized theideas from actor-critic models (Konda & Tsitsiklis, 2000) on L A ( θ, ψ, φ ) = L ( θ, ψ, φ ) : ∇ ω A L A ( θ, ψ, φ ) = E x ,y ∼ p XY (cid:34) ∇ ω A (cid:32) T (cid:88) t =1 E s t ∼ Cat ( π t ) (cid:2) (cid:96) ( y t , ¯ y t ) (cid:3)(cid:33)(cid:35) + α ∇ ω A L ( θ, ψ )= E x ,y ∼ p XY (cid:34) T (cid:88) t =1 E s t ∼ Cat ( π t ) (cid:2) (cid:96) ( y t , ¯ y t ) ∇ ω A log π t ( s t ) (cid:3)(cid:35) + α ∇ ω A L ( θ, ψ ) , (S.1)where the second equality comes from the following derivation of the former term: E x ,y ∼ p XY (cid:34) ∇ ω A (cid:32) T (cid:88) t =1 E s t ∼ Cat ( π t ) (cid:2) (cid:96) ( y t , ¯ y t ) (cid:3)(cid:33)(cid:35) = E x ,y ∼ p XY (cid:34) ∇ ω A (cid:32) T (cid:88) t =1 (cid:88) s t ∈K π t ( s t ) (cid:96) ( y t , ¯ y t ) (cid:33)(cid:35) = E x ,y ∼ p XY (cid:34) T (cid:88) t =1 (cid:88) s t ∈K ∇ ω A π t ( s t ) (cid:96) ( y t , ¯ y t ) (cid:35) = E x ,y ∼ p XY (cid:34) T (cid:88) t =1 (cid:88) s t ∈K ∇ ω A π t ( s t ) π t ( s t ) π t ( s t ) (cid:96) ( y t , ¯ y t ) (cid:35) = E x ,y ∼ p XY (cid:34) T (cid:88) t =1 (cid:88) s t ∈K π t ( s t ) (cid:96) ( y t , ¯ y t ) ∇ ω A log π t ( s t ) (cid:35) = E x ,y ∼ p XY (cid:34) T (cid:88) t =1 E s t ∼ Cat ( π t ) (cid:2) (cid:96) ( y t , ¯ y t ) ∇ ω A log π t ( s t ) (cid:3)(cid:35) . C. Pseudo-Code of AC-TPC
As illustrated in Section 3.2, AC-TPC is trained in an iterative fashion. We provide the pseudo-code for optimizing ourmodel in Algorithm 1 and that for initializing the parameters in Algorithm 2. emporal Phenotyping using Deep Predictive Clustering of Disease Progression
Algorithm 1
Pseudo-code for Optimizing AC-TPC
Input:
Dataset D = { ( x nt , y nt ) T n t =1 } Nn =1 , number of clusters K , coefficients ( α, β ) ,learning rate ( η A , η C , η E ) , mini-batch size n mb , and update step M Output:
AC-TPC parameters ( θ, ψ, φ ) and the embedding dictionary E Initialize parameters ( θ, ψ, φ ) and the embedding dictionary E via Algorithm 2 repeat
Optimize the Encoder, Selector, and Predictor for m = 1 , · · · , M do Sample a mini-batch of n mb data samples: { ( x nt , y nt ) T n t =1 } n mb n =1 ∼ D for n = 1 , · · · , n mb do Calculate the assignment probability: π nt = [ π nt (1) · · · π nt ( K )] ← h ψ ( f θ ( x n t )) Draw the cluster assignment: s nt ∼ Cat ( π nt ) Calculate the label distributions: ¯ y nt ← g φ ( e ( s nt )) and ˆ y nt ← g φ ( f θ ( x n t )) end for Update the encoder f θ and selector h ψ : θ ← θ − η A (cid:32) n mb n mb (cid:88) n =1 T n (cid:88) t =1 (cid:96) ( y nt , ¯ y nt ) ∇ θ log π nt ( s nt ) − α ∇ θ K (cid:88) k =1 π nt ( k ) log π nt ( k ) (cid:33) ψ ← ψ − η A (cid:32) n mb n mb (cid:88) n =1 T n (cid:88) t =1 (cid:96) ( y nt , ¯ y nt ) ∇ ψ log π nt ( s nt ) − α ∇ ψ K (cid:88) k =1 π nt ( k ) log π nt ( k ) (cid:33) Update the predictor g φ : φ ← φ − η C n mb n mb (cid:88) n =1 T n (cid:88) t =1 ∇ φ (cid:96) ( y nt , ¯ y nt ) end for Optimize the Cluster Centroids for m = 1 , · · · , M do Sample a mini-batch of n mb data samples: { ( x nt , y nt ) T n t =1 } n mb n =1 ∼ D for n = 1 , · · · , n mb do Calculate the assignment probability: π nt = [ π nt (1) · · · π nt ( K )] ← h ψ ( f θ ( x n t )) Draw the cluster assignment: s nt ∼ Cat ( π nt ) Calculate the label distributions: ¯ y nt ← g φ ( e ( s nt )) end forfor k = 1 , · · · , K do Update the embeddings e ( k ) : e ( k ) ← e ( k ) − η E (cid:32) n mb n mb (cid:88) n =1 T n (cid:88) t =1 ∇ e ( k ) (cid:96) ( y nt , ¯ y nt ) − γ K (cid:88) k (cid:48) =1 k (cid:48) (cid:54) = k ∇ e ( k ) (cid:96) (cid:0) g φ ( e ( k )) , g φ ( e ( k (cid:48) )) (cid:1)(cid:33) end for Update the embedding dictionary:
E ← { e (1) , . . . e ( K ) } end foruntil convergence emporal Phenotyping using Deep Predictive Clustering of Disease Progression Algorithm 2
Pseudo-code for pre-training AC-TPC
Input:
Dataset D = { ( x nt , y nt ) T n t =1 } Nn =1 , number of clusters K , learning rate η , mini-batch size n mb Output:
AC-TPC parameters ( θ, ψ, φ ) and the embedding dictionary E Initialize parameters ( θ, ψ, φ ) via Xavier Initializer Pre-train the Encoder and Predictor repeat
Sample a mini-batch of n mb data samples: { ( x nt , y nt ) T n t =1 } n mb n =1 ∼ D for n = 1 , · · · , n mb do Calculate the label distributions: ˆ y nt ← g φ ( f θ ( x n t )) end for θ ← θ − η n mb n mb (cid:88) n =1 T n (cid:88) t =1 ∇ θ (cid:96) ( y nt , ˆ y nt ) φ ← φ − η n mb n mb (cid:88) n =1 T n (cid:88) t =1 ∇ φ (cid:96) ( y nt , ˆ y nt ) until convergence Initialize the Cluster Centroids
Calculate the embedding dictionary E and initial cluster assignments c nt E , {{ c nt } T n t =1 } Nn =1 ← K-means ( {{ z nt } T n t =1 } Nn =1 , K ) Pre-train the Selector repeat
Sample a mini-batch of n mb data samples: { ( x nt , y nt ) T n t =1 } n mb n =1 ∼ D for n = 1 , · · · , n mb do Calculate the assignment probability: π nt = [ π nt (1) · · · π nt ( K )] ← h ψ ( f θ ( x n t )) end for Update the selector h ψ : ψ ← ψ + η n mb n mb (cid:88) n =1 T n (cid:88) t =1 K (cid:88) k =1 c nt ( k ) log π nt ( k ) until convergence D. Details of the Datasets
D.1. UKCF Dataset
UK Cystic Fibrosis registry (UKCF) records annual follow-ups for 5,171 adult patients (aged 18 years or older) over theperiod from 2008 and 2015, with a total of 25,012 hospital visits. Each patient is associated with 89 variables (i.e., 11 staticand 78 time-varying features), including information on demographics and genetic mutations, bacterial infections, lungfunction scores, therapeutic managements, and diagnosis on comorbidities. The detailed statistics are given in Table S.1. D.2. ADNI Dataset
Alzheimer’s Disease Neuroimaging Initiative (ADNI) study consists of 1,346 patients with a total of 11,651 hospital visits,which tracks the disease progression via follow-up observations at 6 months interval. Each patient is associated with 21variables (i.e., 5 static and 16 time-varying features), including information on demographics, biomarkers on brain functions,and cognitive test results. The three diagnostic groups were normal brain functioning ( . ), mild cognitive impairment( . ), and Alzheimer’s disease ( . ). The detailed statistics are given in Table S.2. E. Details of the Benchmarks
We compared AC-TPC in the experiments with clustering methods ranging from conventional approaches based on K -meansto the state-of-the-art approaches based on deep neural networks. The details of how we implemented the benchmarks are https://adni.loni.usc.edu emporal Phenotyping using Deep Predictive Clustering of Disease Progression Table S.1.
Summary and description of the UKCF dataset.
STATIC COVARIATES Type Mean Type MeanDemographic
Gender Bin. 0.55
Genetic
Class I Mutation Bin. 0.05 Class VI Mutation Bin. 0.86Class II Mutation Bin. 0.87 DF508 Mutation Bin. 0.87Class III Mutation Bin. 0.89 G551D Mutation Bin. 0.06Class IV Mutation Bin. 0.05 Homozygous Bin. 0.58Class V Mutation Bin. 0.04 Heterozygous Bin 0.42
TIME-VARYING COVARIATES Type Mean Min / Max Type Mean Min / MaxDemographic
Age Cont. 30.4 18.0 / 86.0 Height Cont. 168.0 129.0 / 198.6Weight Cont. 64.1 24.0 / 173.3 BMI Cont. 22.6 10.9 / 30.0Smoking Status Bin. 0.1
Lung Func. Scores
FEV Cont. 2.3 0.2 / 6.3 Best FEV Cont. 2.5 0.3 / 8.0FEV % Pred. Cont. 65.1 9.0 / 197.6 Best FEV % Pred. Cont. 71.2 7.5 / 164.3 Hospitalization
IV ABX Days Hosp. Cont. 12.3 0 / 431 Non-IV Hosp. Adm. Cont. 1.2 0 / 203IV ABX Days Home Cont. 11.9 0 / 441
Lung Infections
B. Cepacia Bin. 0.05 P. Aeruginosa Bin. 0.59H. Influenza Bin. 0.05 K. Pneumoniae Bin. 0.00E. Coli Bin. 0.01 ALCA Bin. 0.03Aspergillus Bin. 0.14 NTM Bin. 0.03Gram-Negative Bin. 0.01 Xanthomonas Bin. 0.05S. Aureus Bin. 0.30
Comorbidities
Liver Disease Bin. 0.16 Depression Bin. 0.07Asthma Bin. 0.15 Hemoptysis Bin. 0.01ABPA Bin. 0.12 Pancreatitus Bin. 0.01Hypertension Bin. 0.04 Hearing Loss Bin. 0.03Diabetes Bin. 0.28 Gall bladder Bin. 0.01Arthropathy Bin. 0.09 Colonic structure Bin. 0.00Bone fracture Bin. 0.01 Intest. Obstruction Bin. 0.08Osteoporosis Bin. 0.09 GI bleed – no var. Bin. 0.00Osteopenia Bin. 0.21 GI bleed – var. Bin. 0.00Cancer Bin. 0.00 Liver Enzymes Bin. 0.16Cirrhosis Bin. 0.03 Kidney Stones Bin. 0.02
Treatments
Dornase Alpha Bin. 0.56 Inhaled B. BAAC Bin. 0.03Anti-fungals Bin. 0.07 Inhaled B. LAAC Bin. 0.08HyperSaline Bin. 0.23 Inhaled B. SAAC Bin. 0.05HypertonicSaline Bin. 0.01 Inhaled B. LABA Bin. 0.11Tobi Solution Bin. 0.20 Inhaled B. Dilators Bin. 0.57Cortico Combo Bin. 0.41 Cortico Inhaled Bin. 0.15Non-IV Ventilation Bin. 0.05 Oral B. Theoph. Bin. 0.04Acetylcysteine Bin. 0.02 Oral B. BA Bin. 0.03Aminoglycoside Bin. 0.03 Oral Hypo. Agents Bin. 0.01iBuprofen Bin. 0.00 Chronic Oral ABX Bin. 0.53Drug Dornase Bin. 0.02 Cortico Oral Bin. 0.14HDI Buprofen Bin. 0.00 Oxygen Therapy Bin. 0.11Tobramycin Bin. 0.03 O Exacerbation Bin. 0.03Leukotriene Bin. 0.07 O Nocturnal Bin. 0.03Colistin Bin. 0.03 O Continuous Bin. 0.03Diabetes Insulin Bin. 0.01 O Pro re nata Bin. 0.01Macrolida ABX Bin. 0.02ABX: antibiotics
Table S.2.
Summary and description of the ADNI dataset.
STATIC COVARIATES Type Mean Min/Max (Mode) Type Mean Min/Max (Mode)Demographic
Race Cat. 0.93 White Ethnicity Cat. 0.97 No Hisp/LatinoEducation Cat. 0.23 C16 Marital Status Cat. 0.75 Married
Genetic
APOE Cont. 0.44 0/2
TIME-VARYING COVARIATES Type Mean Min / Max Type Mean Min / MaxDemographic
Age Cont. 73.6 55/92
Biomarker
Entorhinal Cont. 3.6E+3 1.0E+3 / 6.7E+3 Mid Temp Cont. 2.0E+4 8.9E+3 / 3.2E+4Fusiform Cont. 1.8E+5 9.0E+4 / 2.9E+5 Ventricles Cont. 4.1E+4 5.7E+3 / 1.6E+5Hippocampus Cont. 6.9E+3 2.8E+3 / 1.1E+4 Whole Brain Cont. 1.0E+6 6.5E+5 / 1.5E+6Intracranial Cont. 1.5E+6 2.9E+2 / 2.1E+6
Cognitive
ADAS-11 Cont. 8.58 0/70 ADAS-13 Cont. 13.61 0/85CRD Sum of Boxes Cont. 1.21 0/17 Mini Mental State Cont. 27.84 2/30RAVLT Forgetting Cont. 4.19 -12/15 RAVLT Immediate Cont. 38.25 0/75RAVLT Learning Cont. 4.65 -5/14 RAVLT Percent Cont. 51.70 -500/100 emporal Phenotyping using Deep Predictive Clustering of Disease Progression (a) DCN-S2S (b) DCN-E2P, KM-E2P (c) SOM-VAE (d) SOM-VAE-P
Figure S.1.
The block diagrams of the tested benchmarks.
Table S.3.
Comparison table of benchmarks.
Methods HandlingTime-Series ClusteringMethod SimilarityMeasure LabelProvided LabelAssociated
KM-DTW DTW K -means DTW N NKM-E2P ( Z ) RNN K -means Euclidean in Z Y Y (indirect)KM-E2P ( Y ) RNN K -means Euclidean in Y Y Y (direct)DCN-S2S RNN K -means Euclidean in Z N NDCN-E2P RNN K -means Euclidean in Z Y Y (indirect)SOM-VAE Markov model embedding mapping reconstruction loss N NSOM-VAE-P Markov model embedding mapping prediction loss Y Y (direct)Proposed RNN embedding mapping KL divergence Y Y (direct) described as the following: • Dynamic time warping followed by K -means : Dynamic time warping (DTW) is utilized to quantify pairwise distancebetween two variable-length sequences and, then, K -means is applied (denoted as KM-DTW ). • K -means with deep neural networks : To handle variable-length time-series data, we utilized an encoder-predictornetwork as depicted in Figure S.1b and trained the network based on (6) for dimensionality reduction; this is to providefixed-length and low-dimensional representations for time-series. Then, we applied K -means on the latent encodings z (denoted as KM-E2P ( Z ) ) and on the predicted label distributions ˆ y (denoted as KM-E2P ( Y ) ), respectively. Weimplemented the encoder and predictor of KM-E2P with the same network architectures with those of our model: theencoder is a single-layer LSTM with 50 nodes and the decoder is a two-layered fully-connected network with 50 nodesin each layer. • Extensions of DCN (Yang et al., 2017): Since the DCN is designed for static data, we utilized a sequence-to-sequencemodel in Figure S.1a for the encoder-decoder network as an extension to incorporate time-series data (denoted as DCN-S2S ) and trained the network based on the reconstruction loss (using the reconstructed input sequence ˆ x t ). Forimplementing DCN-S2S, we used a single-layer LSTM with 50 nodes for both the encoder and the decoder. And, weaugmented a fully-connected layer with 50 nodes is used to reconstruct the original input sequence from the latentrepresentation of the decoder.In addition, since predictive clustering is associated with the label distribution, we compared a DCN whose encoder-decoder structure is replaced with our encoder-predictor network in Figure S.1b (denoted as DCN-E2P ) to focus thedimensionality reduction – and, thus, finding latent encodings where clustering is performed – on the informationfor predicting the label distribution. We implemented the encoder and predictor of DCN-E2P with the same networkarchitectures with those of our model as described in Section 5. https://github.com/rtavenar/tslearn https://github.com/boyangumn/DCN emporal Phenotyping using Deep Predictive Clustering of Disease Progression • SOM-VAE (Fortuin et al., 2019): We compare with SOM-VAE – though, this method is oriented towards visualizationof input data via SOM – since it naturally clusters time-series data assuming Markov property (denoted as SOM-VAE ).We replace the original CNN architecture of the encoder and the decoder with three-layered fully-connected networkwith 50 nodes in each layer, respectively. The network architecture is depicted in Figure S.1c where ˆ x t and ¯ x t indicatethe reconstructed inputs based on the encoding z t and the embedding e t at time t , respectively.In addition, we compare with a variation of SOM-VAE by replacing the decoder with the predictor to encourage the latentencoding to capture information for predicting the label distribution (denoted as SOM-VAE-P ). For the implementation,we replaced the decoder of SOM-VAE with our predictor which is a two-layered fully-connected layer with 50 nodes ineach layer to predict the label distribution as illustrated in Figure S.1d. Here, ˆ y t and ¯ y t indicate the predicted labelsbased on the encoding z t and the embedding e t at time t , respectively.For both cases, we used the default values for balancing coefficients of SOM-VAE and the dimension of SOM to beequal to K .We compared and summarized major components of the benchmarks in Table S.3. F. Additional Experiments
F.1. Contributions of the Auxiliary Loss Functions
As described in Section 3.1, we introduced two auxiliary loss functions – the sample-wise entropy of cluster assignment(3) and the embedding separation loss (4) – to avoid trivial solution that may arise in identifying the predictive clusters.To analyze the contribution of each auxiliary loss function, we report the average number of activated clusters, clusteringperformance, and prediction performance on the UKCF dataset with comorbidities as described in Section 5.4. Throughoutthe experiment, we set K = 16 – which is larger than C – to find the contribution of these loss functions to the number ofactivated clusters. Table S.4.
Performance comparison with varying the balancing coefficients α, β for the UKCF dataset.
Coefficients Clustering Performance Prognostic Value α β
Activated No. Purity NMI ARI AUROC AUPRC0.0 0.0 16 0.573 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± As we can see in Table S.4, both auxiliary loss functions make important contributions in improving the quality of predictiveclustering. More specifically, the sample-wise entropy encourages the selector to choose one dominant cluster. Thus, as wecan see results with α = 0 , without the sample-wise entropy, our selector assigns an equal probability to all clusterswhich results in a trivial solution. We observed that, by augmenting the embedding separation loss (4), AC-TPC identifies asmaller number of clusters owing to the well-separated embeddings. F.2. Additional Results on Targeting Multiple Future Outcomes
Throughout the experiment in Section 5.5, we identified subgroups of patients that are associated with the next-yeardevelopment of different comorbidities in the UKCF dataset. In Table S.5, we reported identified clusters and thefull list of comorbidities developed in the next year since the latest observation and the corresponding frequency which iscalculated in a cluster-specific fashion based on the true label.As we can see in the table, the identified clusters displayed very different label distributions; that is, the combinationof comorbidities as well as their frequency were very different across the clusters. For example, patients in Cluster 1experienced low-risk of developing any comorbidities in the next year while 85% of patients in Cluster 0 experienceddiabetes in the next year. https://github.com/ratschlab/SOM-VAE emporal Phenotyping using Deep Predictive Clustering of Disease Progression Table S.5.
The comorbidities developed in the next year for the identified clusters. The values in parentheses indicate the correspondingfrequency. Clusters Comorbidities and Frequencies
Cluster0 Diabetes (0.85) Liver Enzymes (0.21) Arthropathy (0.14) Depression (0.10)Hypertens (0.08) Osteopenia (0.07) Intest. Obstruction (0.07) Cirrhosis (0.04)ABPA (0.04) Liver Disease (0.04) Osteoporosis (0.03) Hearing Loss (0.03)Asthma (0.02) Kidney Stones (0.01) Bone fracture (0.01) Hemoptysis (0.01)Pancreatitis (0.01) Cancer (0.00) Gall bladder (0.00) Colonic stricture (0.00)GI bleed – no var. (0.00) GI bleed – var. (0.00)Cluster1 Liver Enzymes (0.09) Arthropathy (0.08) Depression (0.07) Intest. Obstruction (0.06)Diabetes (0.06) Osteopenia (0.05) ABPA (0.04) Asthma (0.03)Liver Disease (0.03) Hearing Loss (0.03) Osteoporosis (0.02) Pancreatitis (0.02)Kidney Stones (0.02) Hypertension (0.01) Cirrhosis (0.01) Gall bladder (0.01)Cancer (0.01) Hemoptysis (0.00) Bone fracture (0.00) Colonic stricture (0.00)GI bleed – no var. (0.00) GI bleed – var. (0.00)Cluster2 ABPA (0.77) Osteopenia (0.21) Intest. Obstruction (0.11) Hearing Loss (0.10)Liver Enzymes (0.07) Diabetes (0.06) Depression (0.05) Pancreatitis (0.05)Liver Disease (0.04) Arthropathy (0.04) Asthma (0.03) Bone fracture (0.02)Osteoporosis (0.02) Hypertension (0.01) Cancer (0.01) Cirrhosis (0.01)Kidney Stones (0.01) Gall bladder (0.01) Hemoptysis (0.00) Colonic stricture (0.00)GI bleed – no var. (0.00) GI bleed – var. (0.00)Cluster3 Asthma (0.89) Liver Disease (0.87) Diabetes (0.29) Osteopenia (0.28)Liver Enzymes (0.24) ABPA (0.15) Osteoporosis (0.11) Hearing Loss (0.06)Arthropathy (0.05) Intest. Obstruction (0.05) Depression (0.04) Hypertension (0.03)Cirrhosis (0.02) Kidney Stones (0.02) Pancreatitis (0.02) Gall bladder (0.02)Cancer (0.01) Bone fracture (0.00) Hemoptysis (0.00) Colonic stricture (0.00)GI bleed – no var. (0.00) GI bleed – var. (0.00)Cluster4 Osteoporosis (0.76) Diabetes (0.43) Arthropathy (0.20) Liver Enzymes (0.18)Osteopenia (0.15) Depression (0.13) Intest. Obstruction (0.11) ABPA (0.11)Hearing Loss (0.09) Liver Disease (0.08) Hypertension (0.07) Cirrhosis (0.07)Kidney Stones (0.03) Asthma (0.02) Hemoptysis (0.02) Bone fracture (0.02)Gall bladder (0.01) Pancreatitis (0.01) Cancer (0.00) Colonic stricture (0.00)GI bleed – no var. (0.00) GI bleed – var. (0.00)Cluster5 Asthma (0.88) Diabetes (0.81) Osteopenia (0.28) ABPA (0.24)Liver Enzymes (0.22) Depression (0.15) Osteoporosis (0.14) Intest. Obstruction (0.12)Hypertension (0.10) Cirrhosis (0.10) Liver Disease (0.09) Arthropathy (0.08)Bone fracture (0.01) Hemoptysis (0.01) Pancreatitis (0.01) Hearing Loss (0.01)Cancer (0.01) Kidney Stones (0.01) GI bleed – var. (0.01) Gall bladder (0.00)Colonic stricture (0.00) GI bleed – no var. (0.00)Cluster6 Liver Disease (0.85) Liver Enzymes (0.37) Osteopenia (0.27) ABPA (0.09)Arthropathy (0.07) Diabetes (0.06) Intest. Obstruction (0.06) Osteoporosis (0.05)Depression (0.03) Asthma (0.03) Hearing Loss (0.03) Cirrhosis (0.02)Hemoptysis (0.02) Hypertension (0.01) Kidney Stones (0.01) Pancreatitis (0.00)Gall bladder (0.00) Bone fracture (0.00) Cancer (0.00) Colonic stricture (0.00)GI bleed – no var. (0.00) GI bleed – var. (0.00)Cluster7 ABPA (0.83) Diabetes (0.78) Osteopenia (0.25) Osteoporosis (0.24)Liver Enzymes (0.15) Intest. Obstruction (0.12) Liver Disease (0.09) Hypertension (0.07)Hearing Loss (0.07) Arthropathy (0.06) Depression (0.04) Cirrhosis (0.02)Asthma (0.01) Bone fracture (0.01) Kidney Stones (0.01) Hemoptysis (0.01)Cancer (0.00) Pancreatitis (0.00) Gall bladder (0.00) Colonic stricture (0.00)GI bleed – no var. (0.00) GI bleed – var. (0.00)Cluster8 Diabetes (0.94) Liver Disease (0.83) Liver Enzymes (0.43) Osteopenia (0.30)Hearing Loss (0.11) Osteoporosis (0.10) Intest. Obstruction (0.09) Cirrhosis (0.08)Depression (0.08) ABPA (0.07) Arthropathy (0.06) Hypertension (0.05)Kidney Stones (0.05) Asthma (0.02) Hemoptysis (0.01) Bone fracture (0.01)Cancer (0.00) Pancreatitis (0.00) Gall bladder (0.00) Colonic stricture (0.00)GI bleed – no var. (0.00) GI bleed – var. (0.00)Cluster9 Asthma (0.89) Osteopenia (0.26) ABPA (0.19) Arthropathy (0.14)Intest. Obstruction (0.11) Depression (0.08) Osteoporosis (0.08) Diabetes (0.06)Liver Enzymes (0.06) Hemoptysis (0.03) Hypertension (0.02) Liver Disease (0.02)Pancreatitis (0.02) Bone fracture (0.01) Cancer (0.01) Cirrhosis (0.01)Gall bladder (0.01) Hearing Loss (0.01) Kidney Stones (0.00) Colonic stricture (0.00)GI bleed – no var. (0.00) GI bleed – var. (0.00)Cluster10 Osteopenia (0.82) Diabetes (0.81) Arthropathy (0.23) Depression (0.19)Liver Enzymes (0.18) Hypertension (0.16) Hearing Loss (0.10) Liver Disease (0.10)Osteoporosis (0.10) Intest. Obstruction (0.09) ABPA (0.09) Kidney Stones (0.07)Cirrhosis (0.05) Asthma (0.01) Cancer (0.00) GI bleed – var. (0.00)Bone fracture (0.00) Hemoptysis (0.00) Pancreatitis (0.00) Gall bladder (0.00)Colonic stricture (0.00) GI bleed – no var. (0.00)Cluster11 Osteopenia (0.77) Liver Enzymes (0.18) Arthropathy (0.12) Depression (0.09)Hypertension (0.06) Diabetes (0.06) Hearing Loss (0.06) ABPA (0.05)Liver Disease (0.05) Osteoporosis (0.04) Intest. Obstruction (0.04) Cirrhosis (0.02)Asthma (0.02) Pancreatitis (0.02) Bone fracture (0.01) Cancer (0.01)Kidney Stones (0.00) Gall bladder (0.00) Colonic stricture (0.00) Hemoptysis (0.00)GI bleed – no var. (0.00) GI bleed – var. (0.00) emporal Phenotyping using Deep Predictive Clustering of Disease Progression
F.3. How Does the Temporal Phenotypes Change over Time?
In this subsection, we demonstrate run-time examples of how AC-TPC flexibly updates the cluster assignments over timewith respect to the future development of comorbidities in the next year. Figure S.2 illustrates three representative patients: • Patient A had diabetes from the beginning of the study and developed asthma as an additional comorbidity at t = 2 .Accordingly, AC-TPC changed the temporal phenotype assigned to this patient from Cluster 0, which consists of patientswho are very likely to develop diabetes but very unlikely to develop asthma in the next year, to Cluster 5, which consistsof patients who are likely to develop both diabetes and asthma in the next year, at t = 1 . • Patient B had ABPA from the beginning of the study and developed diabetes at t = 5 . Similarly, AC-TPC changedthe temporal phenotype assigned to this patient from Cluster 2, which consists of patients who are likely to developABPA but not diabetes in the next year, to Cluster 7, which consists of patients who are likely to develop both ABPAand diabetes in the next year, at t = 4 . • Patient C had no comorbidity at the beginning of the study, and developed asthma and liver disease as additionalcomorbidities, respectively at t = 3 and t = 6 . AC-TPC changed the temporal phenotypes assigned to this patient fromCluster 1 to Cluster 9 at t = 2 and then to Cluster 3 at t = 5 . The changes in the temporal phenotypes were consistentwith the actual development of asthma and liver disease considering the distribution of comorbidity development in thenext year – that is, Cluster 1 consists of patients who are not likely to develop any comorbidities in the next year, Cluster9 consists of patients who are likely to develop asthma but not liver disease, and Cluster 3 consists of patients who arelikely to develop asthma and liver disease in the next year. Figure S.2.
An illustration of run-time examples of AC-TPC on three representative patients. emporal Phenotyping using Deep Predictive Clustering of Disease Progression
References
Fortuin, V., H¨user, M., Locatello, F., Strathmann, H., and R¨atsch, G. SOM-VAE: Interpretable discrete representationlearning on time series.
In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019) ,2019.Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms.
In Proceedings of the 13th Conference on Neural InformationProcessing Systems (NIPS 2000) , 2000.Yang, B., Fu, X., Sidiropoulos, N. D., and Hong, M. Towards k-means-friendly spaces: Simultaneous deep learning andclustering.