Clustering methods and Bayesian inference for the analysis of the evolution of immune disorders
CClustering methods and Bayesian inference forthe analysis of the evolution of immune disorders
A. Carpio ∗ , A. Sim´on † , L.F. Villa ‡ September 25, 2020
Abstract.
Choosing appropriate hyperparameters for unsupervised cluster-ing algorithms could be an optimal way for the study of long-standing challengeswith data, which we tackle while adapting clustering algorithms for immunedisorder diagnoses. We compare the potential ability of unsupervised cluster-ing algorithms to detect disease flares and remission periods through analysis oflaboratory data from systemic lupus erythematosus (SLE) patients records withdifferent hyperparameter choices. To determine which clustering strategy is thebest one we resort to a Bayesian analysis based on the Plackett-Luce modelapplied to rankings. This analysis quantifies the uncertainty in the choice ofclustering methods for a given problem.
Since the early times of the introduction of mathematical methods for medicaldiagnosis [7], remarkable advances have been made. Nowadays, the increasingavailability of medical data related to all sorts of illnesses is fostering the devel-opment of machine learning techniques [2] for medical diagnosis and treatment.While neural networks are often used for image based diagnosis [25, 16], super-vised and unsupervised clustering techniques [22] are now widely employed toinvestigate the role of genes in sickness [17, 24, 4, 20] and to study the responseto therapies [15, 6], as well as for assisted clinical diagnosis using informationfrom digital devices [18, 21]. Developing tools to assess the reliability of suchautomatic procedures and to choose the best method for different situations andclinical environments has become essential [18].We consider here the applicability of unsupervised clustering techniques toidentify stages in time-dependent series of clinical data. More precisely, wefocus on the study of immune disorders, such as SLE, difficult to diagnose andtreat properly because many symptoms are non specific and change throughoutthe evolution of the disease. SLE is a chronic autoimmune disease in which ∗ Universidad Complutense de Madrid, Spain † Universidad Complutense de Madrid and Universidad de Granada, Spain ‡ Hospital Universitario Puerta de Hierro de Madrid, Spain a r X i v : . [ q - b i o . Q M ] S e p he immune system attacks healthy tissues by mistake [1]. This attack causesinflammation and, in some cases, permanent tissue damage. Parts of the bodycommonly affected include skin, joints, heart, lungs, kidneys, red bone marrow(blood cell formation) and brain. Symptoms of lupus vary between people andmay be mild to severe, affect one area of the body or many, come and go, andchange over time. They include painful and swollen joints, fever, fatigue, chestpain when breathing deeply, hair loss, mouth ulcers, swollen glands, a red rash(typically on the face) and cardiac, renal o neural symptoms.The cause of SLE is unknown. It is thought to involve genetics as wellas environmental factors. Female sex hormones, race, age between 15 and45 and family history appear to be risk factors for SLE development. Aside,emotional or physical stress, sunlight exposure, viral infections, certain drugs,pregnancy, giving birth, smoking or vitamin D deficiency can trigger diseaseflares. There is currently no cure for lupus. Treatments include nonsteroidalanti-inflammatory drugs, corticosteroids, immunosuppressants, hydroxychloro-quine, and immunomodulatory monoclonal antibodies. However, dealing witha chronic disease makes one concerned about long-term adverse effects. Severaltreatments for lupus have attracted great attention recently due to their ap-plicability to covid-19 patients. In fact, serious covid-19 cases develop similarhyperimmune responses [9] which damage the patient’s tissues and may causedeath.Lupus patients go through periods of illness, called flares, and periods ofwellness, called remission. The symptoms during flares vary. It is essential tobe able to distinguish early when the patient is transitioning from remissionto flares, as well as what factors are causing it. A great deal of information iscontained in laboratory tests. Our purpose is to develop mathematical methodsto process it automatically from time series of such tests. We show here that itcould be possible to automatically detect transition days by applying clusteringtechniques. Different clustering strategies may produce variable results on thesame datasets. Therefore, it is important to be able to assess which algorithmsperform better in given tests problems. We show that a Bayesian analysisof rankings of the performance of clustering algorithms on medical datasetsprovides information on the probability of each clustering algorithm being thebest.The rest of the paper is organized as follows. Section 2 describes the datasetsunder study. Sections 3, 4 and 5 apply K-means, Hierarchical clustering andDensity based spatial clustering to the selected datasets. Section 6 explains howto construct performance rankings to estimate the probability of a particularclustering procedure and hyperparameter choice to be the most adequate oneto identify automatically transition from remission to flares and other stagesin the patient evolution from time series of clinical data. We summarize ourconclusions in Section 7. A final Appendix details the clinical variables underconsideration. 2 Clustering clinical data
Laboratory tests are often used to diagnose lupus, since the illness involves animmune response by antibodies against the patients’s own body. In addition,laboratory tests play a key role when detecting the transition to flares andidentifying the kind of health disorder that is building up and needs to betreated. Tables 5-7 in the Appendix list some variables usually monitored. Wewill work here with 28 patient records, choosing one dataset to illustrate theoutcome of the clustering procedures and all datasets for the Bayesian study ofthe probability of a specific method being the best to identify illness stages inthese patient’s records.All datasets are normalized subtracting the mean from each variable anddividing by three times the variance. After that, we obtain normalized timeseries of clinical data as represented in Figure 1. A difficulty in dealing withtime series of clinical data is that some measurements are usually missing. Whiteboxes mark missing data. We will eliminate from our study all variables withmore than 50% measurements missing. For the remaining variables, we fillempty boxes with the average value of the variable over the remaining days.Variables labeled as 17, 18, 28, 29, 30, 47, 48 in Tables 5-7 are suppressed. At theend, our data set is formed by 29 columns (days) and 65 rows (variables). Fromthis set, we may also eliminate six more variables with essentially zero values.Time measurements are not consecutive (they are not recorded in consecutivedays) but expand over months. The label attached to the days only indicatestime ordering. Visually, we can identify three groups of days. Group 1 is formedby days 1 to 16, group 2 is formed by days 17 to 22, and group 3 is formed bydays 23 to 29. The complementary representation in Figure 2 shows how certainvariables get out out control as time advances, and later start coming back tonormal, responding to treatment. These observations provide the motivation tolook for clusters in clinical data.Automatic clustering techniques may produce clusters even when the orig-inal data contain no significative clusters. Hopkins criterion [11] allows us toestablish whether the dataset contains relevant clusters. Given a set D , theHopkins statistics is obtained as follows:1. We extract a uniform sample ( p , . . . , p n ) from D formed by n points.2. For each point p i ∈ D , we find the closest neighbor p j and denote thedistance x i = dist ( p i , p j ).3. We generate a random sample ( q , . . . , q n ) with n points, which we call D random from a uniform distribution keeping the same variance as theoriginal set D .4. For each q i ∈ D random , we find its closest neighbor q j in D and denote y i = dist ( q i , q j ) . H = (cid:80) ni =1 y i (cid:80) ni =1 x i + (cid:80) ni =1 y i . If D was uniformly distributed, then the sums (cid:80) ni =1 y i and (cid:80) ni =1 x i would besimilar, and H would be close to 0 .
5. However, if there are clusters present in D , the distances to the artificial points ( (cid:80) ni =1 y i ) would be larger than4 thoseto the true points ( (cid:80) ni =1 x i ) and H would be larger than 0 .
5. The larger is H ,the more likely the presence of clusters in the data is.For our dataset, H = 0 . K-means clustering
K-Means [14] is one of the most widely used unsupervised clustering algorithms.The idea is to group observations in clusters so that the total intra-clustervariation is minimized.
Given an observation x i = ( x i, , . . . , x i,M ) in a M dimensional space, a cluster C j of points in the same space and the cluster centroid µ j , we define the intra-cluster total variation as: K (cid:88) j =1 W ( C j ) = K (cid:88) j =1 (cid:88) x i ∈ C j d ( x i , µ j ) , where d represents the euclidean distance d ( x i , µ j ) = (cid:118)(cid:117)(cid:117)(cid:116) M (cid:88) (cid:96) =1 ( x i,(cid:96) − µ j,(cid:96) ) . The centroids of each cluster C j with | C j | observations are defined as the av-erages of the observations in the cluster, that is, µ j = (cid:80) x i ∈ C j x i / | C j | . Theintra-cluster total variation can be interpreted as a measure of the cluster com-pactness. Each term W ( C j ) is the intra-cluster variation for a single cluster: W ( C j ) = (cid:88) x i ∈ C j d ( x i , µ j ) , where x i are the points belonging to the cluster C j . In our case, each observationis formed by measurements of M clinical variables a specific day.Once the number k of clusters to be formed is specified, the K-Means algo-rithm proceeds in the following steps:1. We initialize the centroids µ j generating generate k random points in the M dimensional space.2. Each observation x i is assigned to the closest centroid according to theeuclidean distance.3. For each cluster, we update the centroid as the average of the clusterobservations.4. We minimize the total intra-cluster variation iteratively. To do so, weiterate the previous steps until the clusters do not change or the maximumnumber of iterations is surpassed.The main drawback of this algorithm is the need of knowing the number ofclusters beforehand. We describe some strategies to estimate it next.5 .2 Selection of the number of clusters Two methods can be used to estimate the number of clusters in K-Means: theElbow method and the Silhouette method. While the Elbow method favorscluster compactness, the Silhoutte analysis opts for cluster separation whenselecting k .The Elbow method is based on running K-Means for different choices ofthe number of clusters. Each run stores the total intra-cluster variation. Thenumber of clusters N is selected in such a way that the total intra-cluster vari-ation does not diminishes noticeably for k + 1. This can visualized graphically,plotting the total intra-cluster variation as a function of k . When applied toour reference clinical dataset, we find Figure 3(a), which suggests k = 3 , , k . (b) Silhouette method: Silhouette coeffi-cient as a function of the number of clusters k .The Silhouette analysis [13] measures cluster quality. This method deter-mines the quality of a cluster estimating how well the point fits in the cluster.This is done calculating the mean distance from each point to the other clusters,the so-called Silhouette coefficient. Choosing the number of clusters maximiz-ing the Silhouette coefficient we guarantee a sharp separation between clusters.Figure 3(b) represents the Silhouette coefficient as a function of k for our ref-erence dataset. The value maximizing the coefficient is k = 4, though k = 3 isonly slightly worse. The results obtained running K-Means for k = 3 and k = 4 clusters are repre-sented in Figures 4 and 5.As Figure 4 shows, the algorithm is able to identify groups of days, which arealmost consecutive. The first (red) group includes days 21 to 29. The second(green) group includes days 1 to 16, skipping day 15. Finally, the third (blue)6igure 4: Clusters obtained with K-Means when k = 3.group includes days 17 to 20, and also day 15. Notice that day 15 is an anomalyfor K-Means.Let us see the behavior for k = 4. Again, Figure 5 shows, that the algorithmis able to identify groups of days. The first (red) group includes days 1 to 13.The second (green) group includes days 17 and 18. The third (blue) groupincludes days 24 to 29. Finally, the fourth (magenta) group includes days 14 to16, plus days 19-23. This time days 17-18 are an anomaly for K-Means.As a conclusion, days 15, 17 and 18 are difficult to explain for this algorithm.This may mean that they are days at which the patients’ condition changes.Figure 5: Clusters obtained with K-Means when k = 4.7 Hierarchical clustering
Hierarchical clustering [23] is a popular strategy for unsupervised learning. Wewill use here the agglomerative version of the algorithm, which works bottomup. Each element is initially considered a cluster itself. A cluster formed by asingle element is a leaf. At each step, the two most similar clusters merge toform a bigger one, called node. This process is repeated progressively until allthe points are combined in a big cluster, the tree.
This algorithm requires a similarity measure between elements and a strategy tomerge clusters. Different distances can be selected as similarity measures. Herewe use the euclidean distance. The elements to be compared are the values ofthe M variables for different days. Considering them points in a M dimensionalspace, we can compute the euclidean distance between them. Regarding theclustering strategy, we calculate the distances between all the elements and usethe ’complete’ approach. Each node’s height in the tree represents the distancebetween the two subnodes merged at that node. All leaves below any nodewhose height is less than a threshold C are grouped into a cluster (a singleton ifthe node itself is a leaf). This process is represented in a dendrogram, a graphrepresenting how the clusters merge until they form the tree that contains themall. As we move upwards the most similar clusters combine in branches thatmerge later at a higher height, known as the cophenetic distance between theclusters. The higher that height, the more different the clusters are.Once a hierarchical tree is built, we have to check whether it is representativeof our set of data, that is, whether the heights represent the original distanceswith reasonable accuracy. To do so, we calculate the correlation between thecophenetic distance and the original distance used to check similarity betweenobjects, the euclidean distance in our case. If the correlation coefficients displaysa large coefficient in a linear relation, say, larger than 0 .
75, the tree is considereda good representation of the dataset. Selecting a particular height to cut thetree, we obtain different numbers of clusters.
In this section we use the hierarchical clustering to select the onset of severeillness periods in time series of measurement of the clinical variables of lupuspatients. The dendrograms in Figure 6 and 7 show the outcome of applyingagglomerative hierarchical clustering based on the euclidean distance cut at adifferent height. The resulting tree is a good representant of the data set, sincethe correlation between the cophenetic distance and the euclidean distance is0 . > .
75. We select the hyperparameter, that is, the height, is such a waythat we obtain the number of clusters we considered with K-means. In the firstcase, choosing a threshold height to have 3 clusters, days 15, 17, 18 are singled8ut. In the second one, with 4 clusters, day 17 is marked as possible onset. Days15, 17, 18 are automatically identified as days at which the patients conditionmay change significantly again.Figure 6: Dendrogram representing 3 clusters in the time sequence of clinicalvariablesFigure 7: Dendrogram representing 4 clusters in the time sequence of clinicalvariables
Both K-Means and hierarchical clustering select the same days to mark a strongalteration in the status of a patient. We can analyze the clinical variables by a9ombination of both.Figure 8: Heatmap of the matrix relating the K-Means centroids for the k = 3clusters of days and the clinical variables. We have superimposed a dendrogramclassifying the variables in groups according to their influence in the clustercentroids.K-Means algorithm provides a matrix describing how each variable influ-ences the final centroids. We represent this matrix by means of a heatmap for k = 3 clusters, see Figure 8. Then, we superimpose a dendrogram groupingthe variables by similarity. In this way, we distinguish 6 groups of variablesaccording to their interactions through the clusters. Considering high valuesthe largest positive ones, intermediate values those about 0 and small valuesthe largest negative ones, we observe: • The low values of group 1 identify the first cluster, while intermediatevalues are typical of the second and third clusters. • The low values of group 2 identify the first cluster, while intermediatevalues are typical of the second cluster and large values of the third cluster. • The large values of group 3 identify the second cluster, while small valuesare typical of the first and second clusters. • The large values of group 4 identify the second cluster, while small valuesare typical of the third cluster and intermediate values of the first cluster. • The low values of group 5 identify the second cluster, while intermediate-large values are typical of the first and third clusters. • The large values of group 6 identify the first cluster, while intermediate-small values are typical of the second and third clusters.A few variables remain almost constant through the clusters and we have leftthem unassigned. This shows that they have little effect on the overall clusteringresults and we may as well suppress them. These relations are illustrated inTable 4.3. Variables for which the heat map reports large positive values withinthe cluster affect strongly the centroid: its coordinate in the direction of that10ariable is large. Considering variables for which the heat map reports smallvalues (that is, large negative values) within the cluster, the centroid variesstrongly in the opposite direction. In practice, this information would allow totrack automatically which variables play relevant roles in each cluster of daysand motivate the change.
Large Intermediate LowGroup 1
Group 2
Group 3
Group 4
Group 5
Group 6
The Density-Based Spatial Clustering of Applications with Noise (DBSCAN)algorithm usually allows us to detect clusters of any structure even in the pres-ence of noise and outliers. This technique is based on the spatial density ofpoints, following human intuition to identify clusters. For instance, in Figure 9,we visually spot four clusters in spite of the presence of noise due to the pointdensity variations.Figure 9: Density based cluster identification in the presence of noise.The method looks for high density regions and assigns clusters to them,while points in less dense regions are left outside and become anomalies. Thisis the main advantage of this algorithm: being able to detect outliers. Anotheradvantage is that methods such as K-Means and Hierarchical Clustering aredesigned to find spherical or convex clusters, that is, they work well when wemust find well separated and compact clusters, which is not always the case.11oreover, K-Means also needs to classify all the points in one cluster forcingthe introduction of weird criteria to detect outliers.
The DBSCAN algorithm is governed by two hyperparameters: • ε : Smallest distance for two points to be considered neighbors. • M inP ts : Minimum number of points required to form a cluster.According to them, we distinguish three types of points: • Core Point: Any point with a number of neighbors greater than or equalto a fixed minimun value
M inP ts (including itself). • Border Point: Any neighbor of a core point with a number of neighborssmaller than
M inP ts . • Outlier: Any point which is not a core neither a border point.Figure 10 illustrates the three types of points for a given ε and M inP ts = 6.Point x is a core point, it has at least 5 neighbors at a distance smaller than (cid:15) ,a total of 6 points counting x . On the other hand, y is a border point, since thenumber of neighbors is smaller than 6, but it is a neighbor of the core point x .Finally, z is an outlier. Although it is a neighbor of y , it is not a neighbor ofany core point and it has less than 6 neighbors.Figure 10: Cluster detection in the presence of noise by density criteria.Before describing the algorithm in detail, we need to distinguish three con-cepts: • Directly reachable points: A point A is directly reachable from B when itis a neighbor of B and B is a core point. • Reachable points: A point A is reachable from B when we can find a setof core points from B to A . 12 Connected point: Two points A and B are connected when there is a corepoint C such that A and B are reachable from C .A density based cluster is a group of connected points. The DBSCAN algorithmworks as follows:1. For each point x i , we calculate the distance between x i and the remainingpoints. Then, we find all the neighbors within the radius ε and mark ascore points those with a number of neighbors greater or equal to MinPts .2. For each core point not yet assigned to a cluster, we create a new cluster.Next, we search for all the points connected to that core point and assignthem to that cluster.3. We repeat those steps over the remaining set of points.4. The points not assigned to any cluster after this process are consideredoutliers.
Determining adequate values for ε and M inP ts is a difficult task, stronglyconditioned by the structure of the dataset we are working with. In general,there is no automatic procedure to do so. The main problems caused by a poorchoice are: • If the value of ε is too small, there will not be enough points to formclusters and most points risk being classified as outliers. On the otherhand, if ε is too large, most points will be classified in clusters and we willnot be able to identify the outliers. • If M inP ts is too large, too many points are required to form a cluster.Dense regions may be classified as outliers. When it is too small, lowdensity regions would appear as clusters and outliers would remain unde-tected.These values have to be carefully tuned to detect meaningful outliers. Ourchoice will be to find optimal values of ε given M inP ts.
We start calculating themeans of the distances to the k closest points and represent them in increasingorder. Turning points will mark thresholds for sharp changes and will providecandidate values for ε . In this section we explain how to use the clustering algorithm DBSCAN [8] toselect the onset of severe illness periods in time series of measurement of theclinical variables of lupus patients.We select ε fixing the hyperparamenter M inP ts = 3, minimum number ofdays in clusters usually observed in our previous studies. The resulting graph13igure 11: Graph of ascending 3-distances for the reference dataset of medicalvariables.would be We appreciate two turning points, at ε = 3 and ε = 4. However, ε = 4is too large and all the points form a cluster. Running the algorithm with ε = 3and ε = 3 . M inP ts = 3 and ε = 3, there are five outliers. The first one (day 15)marks the correct onset of the sickness period, see Figure 12(a). On the otherhand, when M inP ts = 3 and ε = 3 .
5, we find a single outlier (day 17), twodays later, see Figure 12(b). This technique confirms what we have previouslyobserved: the clinical variables are strongly altered between the days 15 and 17.(a) (b)Figure 12: (a) Clusters and outliers with DBSCAN and ε = 3. (b) Clusters andoutliers with DBSCAN and ε = 3 .
5. 14
Bayesian inference for clustering performance
For the same set of data, results vary using different clustering procedures.Thus, comparing the performance of different clustering strategies to extractuseful information from clinical datasets is an important task. In the previoussections, we have illustrated the use of clustering strategies on a dataset to obtaininformation on the phases of the illness of a patient, in particular, to characterizethe onset of deterioration and recovery periods. Next, we describe how to infer inan automatic way which methods are more adequate for particular collections ofclinical records by combining the construction of rankings on smaller collectionsof selected datasets and Bayesian analysis.
The idea is to run the clustering methods to be compared on a number ofdatasets for which the diagnosis is known, and then rank them according totheir performance. Finally, we analyze the results using a Plakett-Luce (PL)Bayesian model [5, 12]. This model is well adapted to this type of problems anddiffers from other approaches for continuous problems [3].An advantage of PL models is that normalized parameters represent di-rectly the marginal probability of an algorithm being placed in first position.Another advantage is that is relies on a finite number of parameters, as manyas algorithms we compare. It also fulfill’s Luce’s axiom: ’The probability of analgorithm A of being placed before algorithm B is the same independently ofthe remaining algorithms’.The PL model combines three ingredients:1. It selects randomly the algorithm to be positioned.2. Each algorithm has a weight w i , and the probability to select an algorithmat each stage is the ratio between its weight and the sum of the weightsof the remaining algorithms.3. Representing by σ = ( σ , . . . , σ n ) a ranking of size n , where σ i = j implies that the j -th algorithm is locates at the i -th position and by w = ( w , . . . , w n ) the vector of weights, the PL probability to select thatranking is: P P L ( σ ) = n (cid:89) i =1 w σ i (cid:80) nj = i w σ j . By simplicity, we assume that the sum of weights is 1.More precisely, the process, schematized in Figure 13, is the following:1. Run the clustering algorithms over m datasets. Choosing a way to evaluatethe performance of the algorithms, we will have a matrix M in which M ( i, j ) is the performance of algorithm j on dataset i .15igure 13: Scheme for the Bayesian analysis of clustering ranking performance.2. Assign a position in the ranking to each algorithm: the greatest the per-formance, the higher the situation in the ranking. In this way, we obtaina matrix R , where R ( i, j ) is the location in the ranking of algorithm j applied to dataset i .3. A given matrix R gives partial information of the likelihood of differentclustering algorithms to perform better than the rest. We quantify theuncertainty in such conclusions using Bayes relation for conditional prob-abilities P ( w | R ) ∝ P ( w ) · P ( R | w ) , where w = ( w , ..., w n ) denote the PL parameters. Since we are assuming (cid:80) ni =1 w i = 1, they represent the probability of having a given algorithmin the first position. We set P ( R | w ) = Π σ ∈ R P P L ( σ, w ) ,P ( w ) = Dir( w , α ) , where the Dirichlet distribution models uncertainty in the weights.4. Since the posterior distribution does not have a closed form, we sampleit using Markov Chain Monte Carlo (MCMC) methods [10]. From theset of samples, we visualize uncertainty in the probability of a clusteringstrategy being the best by means of histograms or expected values.The Dirichlet distribution Dir( α ) is a family of multivariate distributionsparametrized by a vector α of real positive numbers. It generalizes the Beta16unction. The Dirichlet distribution of order K ≥ α , . . . , α K has a density function f ( x , . . . , x K ; α , . . . , α K ) = 1 B ( α ) K (cid:89) i =1 x α i − i where (cid:80) Ki =1 x i = 1 and x i ≥ i ∈ [1 , K ]. The normalization constant is amultivariate Beta function expressed in terms of Gamma functions as: B ( α ) = (cid:81) Ki =1 Γ( α i )Γ( (cid:80) Ki =1 α i ) α = ( α , . . . , α K ) . Since we lack information on the perfomance of the algorithms for generaldatasets, we use a uniform distribution for the hyperparameter α , that is, α i = α = 1 , i = 1 , . . . , n . We apply our clustering techniques to identify groups of days reflecting differentstages of the evolution of the patient. We consider the clinical records of 28patients which have been previously diagnosed. In many cases there are sicknessperiods which require special medical care, starting a known specific day. Thisresults in a time series of measurements of clinical variables, which display adifferent behavior before and after it. A way to quantify the performance of thedifferent algorithms of the datasets if to check how well they predict the onsetof that flares period. Therefore, we need to define an automatic criterion toidentify it based on the different clustering strategies: • For the DBSCAN algorithm we select the first day not assigned to anycluster, that is, the first outlier. • For K-Means, we choose the first day which is unsequentially classified,that is, which has no neighbors in the same cluster. When such a day isnot found, we choose the first day of the smallest cluster. • For hierarchical clustering, we choose the smallest of the last isolatedpoints to merge with existing clusters. If not found, we choose the firstday within the smallest cluster.Performance is quantified by means of the difference between the day estimatedfrom the clustering analysis and the known true transition day.Let us revise the clustering analysis performed on our test dataset and applythese criteria. The onset of an unstability period for the patient is defined byDay 15. 17
For DBSCAN with
M inP ts = 3 and ε = 3, we had five outliers in Figure12(a). The smallest corresponds to day 15. The distance is D = 0. When k = 3 and ε = 3 .
5, we had one outlier in Figure 12(b), day 17. Thedistance is D = 2. • K-Means with k = 3 introduces the first temporal mismatch on day 15,assigning this day to the third cluster (blue), instead of the second cluster(green), see Figure 12. The distance is D = 0. When k = 4, we have amini-cluster with days 17 and 18. The distance is D = 2. When k = 5, thefirst temporal mismatch happens for day 11, and the distance is D = 4. • Hierchical clustering with 4 clusters, as depicted in Figure 7, singles outday 17. Therefore the distance is D = 2. Figure 6 considers 3 clustersinstead. The smallest day in the smallest cluster is 15, thus the distanceis D = 0 . We have applied these methods to 28 different clinical datasets, quantify-ing for each one the distance between the predicted and true transition days.We have excluded some datasets for several reasons. On three of them all thealgorithms gave the same answer. Three more are discarded because the Hop-kins statistics is too small to support looking for clusters. The variables justfluctuate. Three additional datasets are too complicate to analyze since theyseem to present several periods of flares and remission, and should be dividedin smaller periods. Thus, we study the remaining 19 datasets, which presentonly one transition. The results are represented in Table 2, where HC standsfor hierarchical clustering (3C with 3 clusters, 4C with 4 clusters) and KM byK-Means.Based on those distances, we build the ranking presented in Table 3. We as-sign a higher position in the ranking to smaller distances. The smallest possibledistance is D = 0. Smallest distances rank first. Ties are solved assigning thesame position to tied algorithms and freeing the next positions in equal number.Following this procedure, we obtain Table 3.We notice that K-Means with k = 3, Hierarchical clustering with 3 clustersand DBSCAN with ε = 3 and M inP ts = 3 perform quite well. We will usethe PL method to determine the probability for each algorithm being the best,as well as the uncertainty in our choice of algorithm. Notice that there arerepetitions in the ranking, obtaining the results represented in Table 6.2.The results in Table 6.2 indicate that hierarchical clustering with 4 clustersis the algorithm performing best, with a probability of 21.30%. Next, it followsK-means with 3 clusters, with a 20.36% probability, which worsens increasingthe number of clusters. DBSCAN appears with probability 18.73%.In spite of the narrow differences, hierarchical clustering algorithms outper-form the rest due to two reasons. First, they adapt well to small datasets.Second, they do not require a previous knowledge of the number of clusters,one can infer reasonable values from the tree. DBSCAN algorithms performworse that expected. This algorithm is devised to look for outliers, since it doesnot need to place all points in a cluster, unlike K-means. However, for many18 istance totransition days KMk=3 KMk=4 KMk=5 HC3C HC4C DBSCANMinPts=3, ε =3 D15 0 2 4 0 2 0D4 0 1 1 0 0 0D9 0 6 6 9 9 2D3 2 2 2 0 0 0D1 2 2 0 2 2 0D1 6 10 10 0 0 0D16 6 6 9 19 19 15D14 0 0 11 0 0 0D5 0 1 1 0 0 0D9 0 5 5 3 5 5D19 0 0 16 17 0 17D50 41 41 41 4 0 0D1 3 0 0 0 0 0D23 22 22 41 0 0 21D17 0 9 9 0 0 12D19 14 14 14 0 7 18D11 6 0 0 0 0 4D3 21 14 0 19 21 0D12 11 0 0 11 0 10Table 2: Distances to the transition day for the different algoritms.parameter choices we may find no outliers. Analyzing the results, we observethat either it gives sharp predictions or it produces the worst predictions. Thismay reflect a difficulty in tuning the hyperparameters in these algorithms, wejust used the values selected for the reference case. On the other hand, K-Meansis a classical algorithm that performs poorly in outlier detection. This is due tothe need of placing all points in a cluster, the difficulty to handle non convexclusters, and the requirement of a priori information on the number of clusters.Let us finally point out that the results vary with the definitions of transitiondays and distances for the different algorithms. If we adopt a different definitionfor hierarchical clustering techniques exploiting the cophenetic distances, thenit outperforms the rest. Here we chose the simplest definitions to illustrate theprocedure.
Extracting information from real clinical data in an automatic fashion faces anumber of challenges, such as the unavailability of large enough amounts of dataand incompleteness of the records, for instance. For each patient undergoingthe same illness, slightly different variables may have been monitored and timeintervals between tests vary largely. Unsupervised clustering methods provide19
KMk=3 KMk=4 KMk=5 HC3C HC4C DBSCANMinPts=3, ε =31 KMk=3 KMk=4 KMk=5 HC3C HC4C DBSCANMinPts=3, ε =3 Acknowledgements.
This research has been partially supported by theFEDER /Ministerio de Ciencia, Innovacin y Universidades - Agencia Estatal deInvestigacin grant No. MTM2017-84446-C2-1-R. The authors thank HospitalUniversitario Puerta de Hierro (Madrid, Spain) for providing the clinical data.21 ppendix: Clinical variables.
We list here the clinical variables involved in the test datasets under study,see Tables 5 and 7. The labels refer to the variables involved in the refer-ence dataset used through the text to exemplify the performance of clusteringalgorithms. The ranking for the Bayesian analysis uses additional datasets con-taining measurements of most of these variables. As it is to be expected frompatients’ records stored in hospitals, the datasets do not always record all thevariables, and the days at which measurements are taken vary with the patient.
Blood Test - Variable Label Urine Test - Variable Label
Glucose pH Urea Density Creatinine Proteins (strip) Glomerular Filtration Glucose Uric Acid Ketone bodies Cholesterol Bilirubin Cholesterol HDL Urobilinogen Cholesterol LDL Nitrites Triglyceride Leukocytes Total Proteins Red blood cell Albumin Turbidity Calcium Sediment CommentPhosphorus Leukocytes per field Sodium G6PD quantificationPotassium Red blood cell shadow Chloride Hyaline cylindersBicarbonate Cell PeelingIron Creatinine Total Bilirubin ProteinsCreatine Kinase (CK) Albumin Lactate dehydrogenase (LDH) Microalbumin/Creat. Ratio Alanine Aminotransferase (ALTGPT) HaptoglobinAspartate Aminotransf. (ASTGOT) Phosfatase Alkaline Gamma Glutamyltransferase (GGT) Vitamin B12 Folic Acid C-Reactive Protein Ferritin Transferrin Transferrin Saturation Complement component 3 (C3) Complement component 4 (C4) Thyroid Stimulating Hormone (TSH)Parathyroid Hormone (PTHi)v25-OH Vitamin D
Table 5: Numbering of clinical variables.22 ematology Laboratory - Variable Label
Leukocytes Neutrophils Lymphocytes Monocytes Eosinophils Basophils Neutrophils (Percent) Lymphocytes (Percent) Monocytes (Percent) Eosinophils (Percent) Basophils (Percent) Immature Granulocyte (IG) count Red blood cell Hemoglobin Hematocrit Mean Corpuscular Volume (MCV) Mean Corpuscular Hemoglobin (MCH) Mean Corpuscular Hemoglobin Concentration (MCHC) ErythroblastsErythroblasts (Percent) Red blood cell Distribution Width (RDW) Platelets Platelet Distribution wWidth (PDW) Reticulocytes (Percent)Erythrocyte Sedimentation Rate (ESR) Direct CoombsGlucose-6-Phosphate Dehydrogenase (G6PD) quantificationHaptoglobinProtrombrine TimeActivated Partial Thromboplastin Time (APTT)FibrinogenLupus anticoagulant
Immunology Laboratory - Variable Label
Anti-dsDNA antibodies lgM Anti-cardiolipin antibodieslgG Anti-cardiolipin antibodieslgM AntiB2glicoprotein-I antibodieslgG AntiB2glicoprotein-I antibodies Table 6: Numbering of clinical variables.